CN117763504A

CN117763504A - Electric energy data null value processing method, system and storage medium

Info

Publication number: CN117763504A
Application number: CN202311611388.4A
Authority: CN
Inventors: 戚成飞; 刘岩; 毕超然; 王耀宇; 杨晓波; 易忠林; 程杰; 焦东翔; 张希蔚; 熊洪樟; 吕凛杰; 王亚超; 李文文; 王杰; 张茹
Original assignee: State Grid Jibei Electric Power Co Ltd
Current assignee: State Grid Jibei Electric Power Co Ltd
Priority date: 2023-11-29
Filing date: 2023-11-29
Publication date: 2024-03-26

Abstract

The invention discloses a method, a system and a storage medium for processing null value of electric energy data in the technical field of electric energy metering data processing, which aim to solve the problems that the traditional electric power data analysis work in the prior art often faces data missing, is inconvenient for electric power analysis and the like, and comprise the steps of dividing intelligent electric power data; constructing a regression model in a source domain for iterative training, and solving according to an iteration result to obtain a source data characteristic weight and a source instance weight; mapping source data feature weights and instance weights to the construct domain by genetic programming; performing missing value data complement on the training data set containing the missing values; importing a test data set containing a missing value into a regression model for training to obtain a prediction result; and carrying out model parameter optimization on the regression model according to the regression model processing effect, and deploying the model parameter optimization into the electric power information acquisition system. The invention further improves the availability of the data samples, and has great significance for optimizing the power analysis flow and further improving the service quality.

Description

Electric energy data null value processing method, system and storage medium

Technical Field

The invention relates to a method, a system and a storage medium for processing null values of electric energy data, and belongs to the technical field of electric energy metering data processing.

Background

The novel power system is constructed, mass adjustable resources are effectively polymerized, real-time dynamic response is supported, consensus in the agglomeration industry is facilitated, collaborative innovation is promoted, the technical problem of energy transformation is solved, the preemptive industry develops a high point, and higher requirements are provided for a data monitoring and analysis processing mechanism of a power distribution network.

However, because the power distribution network directly serves users, the power distribution network has the characteristics of large scale, wide distribution, severe equipment operation environment, uneven quality level of a monitoring device and the like, the problems of undefined power data perception, missing report, false report, and the like are very easy to occur, more null values and null values are caused when the power data are acquired by the background, and great difficulty is brought to subsequent data mining and deep analysis. For a long time, for data with null and invalid values, power operators often discard the data. However, in the case of an occasional power outage abnormality, an insufficient data sample such as a peak power consumption pressure, and the like, simple discarding inevitably causes a loss in work such as power problem investigation and power consumption characteristic analysis.

In recent years, in order to promote new energy consumption and storage capacity and accelerate power market reform, a novel power information acquisition system 2.0 based on an advanced metering architecture (Advance Metering Infrastructure, AMI) and composed of intelligent ammeter terminals and the like is gradually put into scale construction, and software and hardware guarantee is provided for data acquisition, fault research and judgment, statistics and verification and other applications. Compared with the previous generation electricity information acquisition system, the new generation electricity information acquisition system has the flexible access capability of diversified equipment, and simultaneously has the functions of intelligent scheduling, real-time online research and judgment and the like. By means of the novel acquisition terminal, the novel system has higher electricity consumption behavior sensing capability of the terminal low-voltage client, and combines multi-dimensional massive data analysis to construct an application novel mode of intelligent penetration analysis of problems, expand service analysis depth and shorten system analysis time. Thanks to the underlying architecture, the real-time analysis of the power data requires a certain improvement in the process.

However, considering the targeted algorithm without assistance, the data analysis level still cannot fully meet the integrity requirement. Along with the gradual improvement of the accuracy of the power data analysis and the personalized requirements of the user service, the VEE standard aiming at the data quality and the data management of the AMI acquisition system 2.0 is also put into use. The VEE (Validation, estimation and Editing Standards) standard, a Validation, estimation and editing standard, provides a series of auditing standards and processing methods that can be performed on meter transmission data, and supports operational reports that generate detailed results daily, helping power service enterprises to understand grid operation details. The pulse overcurrent inspection, the time-varying inspection, the data abnormality inspection, the data null value inspection and the like provided by the standard can guarantee the reliability of the power data and the accuracy of the subsequent analysis process to a certain extent.

In the data review method provided by the VEE standard, data null check is an important item of content. This is because, for the staff at the data analysis end, data records with null or missing values cannot generally be entered into the usual analysis model and process flow. The analyst handles the problem with subjective randomness and objective data bias. Traditional null processing methods are based solely on simple engineering logic or the experience of the technician. At present, many scholars in the industry also propose to utilize the generalization of artificial intelligence and machine learning to perform data filling processing. However, these methods require sufficient data to perform reliable modeling, have high requirements for data volume and operator learning thresholds, and cannot be applied to small sample scenes.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, and provides a method, a system and a storage medium for processing null values of electric energy data, which can estimate and complement control and abnormal zero values in electric power acquisition data in a small sample scene so as to further improve the usability of data samples and have great significance in optimizing an electric power analysis flow and further improving service quality.

In order to achieve the above purpose, the invention is realized by adopting the following technical scheme:

in a first aspect, the present invention provides a method for processing null values of electric energy data, including the following steps:

and acquiring intelligent power data from the power information acquisition system.

And dividing intelligent power data, wherein relatively complete data form a source domain, and the rest data containing missing values form a target domain.

And constructing a regression model in the source domain for iterative training, and obtaining an iterative result after training is completed.

And solving according to the iteration result to obtain the source data characteristic weight and the source instance weight.

The target domain data is divided into a training data set containing missing values and a test data set containing missing values.

Source data feature weights and instance weights are mapped to the constructional domain through genetic programming, and migration data feature weights and migration instance weights are obtained.

And carrying out missing value data complementation on the training data set containing the missing values by using the migration data weight and the migration instance weight to obtain a complemented training data set.

And importing the completed training data set into a regression model, carrying out regression model training in a target domain, importing the test data set containing the missing values into the regression model, carrying out training to obtain a prediction result, and analyzing the processing effect of the regression model according to the prediction result.

And carrying out model parameter optimization on the regression model according to the regression model processing effect to obtain an optimized regression model and parameters thereof.

And deploying the optimized regression model and parameters thereof into the power information acquisition system.

Further, the intelligent power data includes a time stamp, temperature, humidity, voltage, current, and power consumption.

Further, let the total iteration number of the regression model be r, the process of performing the first iteration training on the regression model in the source domain is as follows:

constructing regression models on source domains

Calculating the number of times each source data feature on the source instance appears in the regression model;

for each source instance, obtaining a prediction error of each instance according to a regression model and expected values of source data corresponding to the source instance, wherein the prediction error is expressed asWherein (1)>For the ith source instance expected value, +.>Regression model for training of the first iteration +.>For source instance->Is a predicted result of (a).

Further, the expressions of the source data feature weights and the source instance weights are as follows:

wherein,for the j-th source data feature weight, r is the total iteration number, l is the current iteration number,for the j-th source data feature->In regression model->The number of occurrences of>For p-th source data feature->In regression model->The number of occurrences, m ^s Is the number of source data features; />For the ith source instance weight, +.>For the ith source instance expected value, +.>Is by regression model->To predict the predicted value of the ith instance, for example,> for the o source instance expected value, +.>Is the average of all source instance expectations, n ^s Is the number of source instances.

Further, mapping the source data feature weights and the instance weights to training data sets containing missing values through genetic programming to obtain migration data feature weights and migration instance weights, including:

constructing a plurality of genetic programming tree structures;

converting the source data characteristics and the source instance into a constructional domain by utilizing a genetic programming tree structure, and weighting the constructional domain to obtain migration data characteristic weights and migration instance weights, wherein the expression is as follows:

wherein,for the j-th migration data feature weight, < +.>For the p-th migration data feature, M _j For the j-th genetic programming tree structure, +.>Is M _j The weight of the p-th migration data feature in the tree, wherein terminal is a terminal set of a designated genetic programming tree and comprises all migration data features in the tree; />Migration instance weight for ith instance, +.>For the ith instance weight, I _i For the ith example, D ^s Representing the source domain, n ^s Is the number of source instances.

Further, the migration instance weight is normalized before being used for missing value completion, and the expression is as follows:

wherein,migration instance weight for normalized ith instance, +.>Migration instance weight for ith instance, +.>Migration instance weight for the first instance.

Further, the missing value data is complemented to the training data set containing the missing value by using the migration data weight and the migration instance weight, so as to obtain a complemented training data set, which comprises the following steps:

and measuring the distance between the to-be-complemented missing instance and the instance without the missing value in the training data set containing the missing value by using the weighted Euclidean distance, and calculating the sample weight according to the distance, wherein the expression is as follows:

wherein I is _a For the a-th to-be-complemented missing instance, I _i For the ith instance without missing values, d (I _a ,I _i ) Is I _a And I _i The distance between the two plates is set to be equal,for the j-th migration data feature weight, I _a [j]Is I _a The j-th characteristic value of (1) _i [j]Is I _i J represents a training data set containing missing values, w (I _a ,I _i ) Is the sample weight;

the entropy weight of the missing instance to be complemented is obtained through sample weight calculation, and the expression is as follows:

wherein E (I) _a ,I _i ) Is I _a And I _i Entropy weight calculation scalar of sample weight in between, n ^s For the number of source instances,to be complemented for the missing instance I _a Entropy weight of (a);

and estimating the missing value according to the entropy weight of the missing instance to be complemented, wherein the expression is as follows:

wherein,to complement the a-th missing instance, and (2)>And K is the super parameter of the K neighbor algorithm for the migration instance weight of the ith instance after normalization.

Further, the regression model processing effect is represented by a complement error and a distribution error, and the expression is as follows:

wherein, fitnes represents the regression model processing effect, lambda is the regularization factor,representing the complement error, ω (D) ^t D ^c ) Representing the distribution error, T ^t For real target data in the target domain +.>For predicted target data in the target domain, D ^t Representing the target domain, D ^c Representing a construction domain;

the complement error is used for measuring the deviation condition of predicted target data and real data, and the expression is as follows:

where RSE represents the relative square error,substitution of DR model on target Domain +.>Is predicted by->For the data to be predicted after complementing the missing values, < > for>For the i-th real object data in the object domain, < >>For the ith iteration model->For->Is predicted by->For the mean of the real object data in the object domain, n is the number of object instances,

the distribution error is used for measuring the distribution similarity of the complement data and the real data, and the expression is as follows:

wherein,for the j-th source data feature, +.>For the j-th target data feature, +.>And->Respectively->And->Sup is the upper-definite function, m ^t Is the target data feature quantity.

In a second aspect, the present invention provides an electrical energy data null processing system, including a processor and a storage medium.

The storage medium is for storing instructions.

The processor is operative to perform the steps of any one of the methods described above in accordance with the instructions.

In a third aspect, the present invention provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of any of the methods described above.

Compared with the prior art, the invention has the beneficial effects that:

according to the electric energy data null value processing method provided by the invention, the genetic programming missing value filling and the migration learning are applied to the power data complement work of the power distribution network, the information to be analyzed is transferred from the complete source field to the incomplete target field, the method is particularly suitable for the data complement work under the scene of small sample size, the power data analysis work efficiency can be improved at lower cost, and finally the power event analysis precision and the user service quality are improved.

Drawings

FIG. 1 is a flow chart of a method for processing null values of electric energy data according to an embodiment of the invention;

FIG. 2 is a schematic diagram illustrating a characteristic migration process of an electric energy data null processing method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a deployment implementation flow of an electrical energy data null processing system according to an embodiment of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical aspects of the present invention, and are not intended to limit the scope of the present invention.

Example 1:

as shown in fig. 1, the whole electric energy data null value processing method provided by the embodiment of the invention can be divided into three parts, specifically:

a first part: knowledge extraction

The intelligent power data is acquired from the power information acquisition system, and the intelligent power data comprises 6 attributes, namely DateTime (time stamp), temperature, humidity, voltage, current and PowerConsumption, according to factors related to the electric energy data.

The intelligent power data are divided, wherein relatively complete data form a source domain, and the rest data containing missing values form a target domain (for the purpose of follow-up optimization training model and reasonable data distribution, the target domain has complete real data besides the data to be complemented, or else the follow-up model effect cannot be verified).

And constructing a regression model in a source domain for iterative training, wherein the data regression model can be a logistic regression model, a stepwise regression model, a ridge regression model and the like, and can be selected according to actual data characteristics and business requirements.

When the iterative training of the regression model is carried out, firstly, the prediction error is required to be obtained through iteration for a plurality of times, and the regression model estimates the source data characteristic weight and the source instance weight based on the influence of the prediction error. The more important features in the source domain, the more likely it is that the target domain will benefit, and therefore the more weight they are in the conversion build process. For ease of description, D is used ^s ，D ^t ，D ^c Respectively representing source domain data and target domain data, and constructing domain data.

Assuming that the total iteration number of the regression model is r, the first iteration training process of the regression model in the source domain is as follows:

in source domain D ^s Constructing regression models on

Computing source instancesEach of the source data features->In regression model->The number of occurrences->

For each source instanceAccording to regression model->Source data expected value corresponding to source instance +.>Obtaining prediction error for each instance>Wherein (1)>Regression model for training of the first iteration +.>For source instance->Is associated with->There is some error.

After the iteration is completed, the source data characteristic weight and the instance weight can be solved through an iteration result, and the solving expression is as follows:

The greater the frequency of occurrence of source data features, the greater the impact on the results may be, and thus the greater the weight. Examples with smaller prediction errors for the prediction results and example labels illustrate that the more easily the model learns, i.e., the less noise information is contained, and thus the more weight is given. Noise and anomalies tend to have more prediction bias, meaning that they are not significant in alleviating the problem of data loss in the target domain.

A second part: data completion

A plurality of genetic programming trees are constructed, each genetic programming tree attempting to map out a particular feature of the target domain using all data features of the source domain, minimizing distribution mismatch between the source domain and the target domain. The model migration process is also a data enhancement process, as shown in FIG. 2.

Genetic programming tree converts source data features and source instances into a construct domain D ^c And weighting the migration data to obtain migration data characteristic weights and migration instance weights, wherein the expression is as follows:

wherein,for the j-th migration data feature weight, < +.>For the p-th migration data feature, M _j For the j-th genetic programming tree structure, +.>Is M _j P-th migration data feature in (b)Weight, terminal is a terminal set of a designated genetic programming tree, which includes all migration data features inside; />Migration instance weight for ith instance, +.>For the ith instance weight, I _i For the ith example, D ^s Representing the source domain, n ^s Is the number of source instances.

When the data complement is carried out later, the weight of the migration instance needs to be normalized, and the expression is as follows:

And (3) utilizing the characteristic weight and the migration example weight of the migration data obtained in the migration process, and carrying out the completion of the target missing data by a K nearest neighbor (EKNN) method based on entropy weighting. EKNN was chosen because it can be an implicit weighting scheme for the transfer source instance. In addition to calculating the weights of the source instances based on their impact on the prediction error, weighting them according to their contribution to the input of missing entropy in the target domain can result in a more accurate feature dimension positioning capability.

Training data containing missing values is measured using weighted euclidean distances when calculating the padding valuesThe distance between the missing instance to be complemented and the instance without missing value is concentrated, the sample weight w (I _a ,I _i ) Then, as an inverse ratio of the distance, the calculation process is as follows:

wherein I is _a For the a-th to-be-complemented missing instance, I _i For the ith instance without missing values, d (I _a ,I _i ) Is I _a And I _i The distance between the two plates is set to be equal,for the j-th migration data feature weight, I _a [j]Is I _a The j-th characteristic value of (1) _i [j]Is I _i J represents a training data set containing missing values, w (I _a ,I _i ) Weighting samples

Then calculate the entropy weightAnd used in EKNN interpolation methods, whose computational expression is as follows:

wherein E (I) _a ,I _i ) Is I _a And I _i Entropy weight calculation scalar of sample weight in between, n ^s For the number of source instances,to be complemented for the missing instance I _a Entropy weight of (c).

And estimating the missing value according to the entropy weight of the missing instance to be complemented, wherein the calculation expression is as follows:

And a third stage: model testing

The design of this stage aims at letting the transform domain D ^c As close as possible to the target domain D ^t I.e. the model fits the actual situation as much as possible. After model migration is complete, an attempt is made to verify model feasibility using the test data.

The analysis process uses a fitness function, which comprises two parts of a complement error and a distribution error, and the calculation process is as follows:

wherein, fitnes represents the regression model processing effect,representing the complement error, ω (D) ^t D ^c ) Representation ofDistribution error, T ^t For real target data in the target domain +.>For predicted target data in the target domain, D ^t Representing the target domain, D ^c Representing a construction domain.

Lambda is a regularization factor, and because the value of the distribution error is usually larger, the influence on the convergence of model parameters is also larger, so that smaller lambda is generally selected to balance the difference between the two measurement complement errors and the distribution error, the selected lambda is controlled to be 0.2-0.4, and the specific value needs to be slightly adjusted according to specific model data.

The complement error is used for measuring the deviation condition of the predicted target data and the real data, and the expression is as follows:

where RSE represents the relative square error,substitution of DR model on target Domain +.>Is predicted by->For the data to be predicted after complementing the missing values, < > for>For the i-th real object data in the object domain, < >>For the ith iteration model->For->Is predicted by->And n is the number of target instances, and is the mean value of the real target data in the target domain.

Example 2:

the embodiment provides an electric energy data null value processing system which comprises a processor and a storage medium.

The storage medium is for storing instructions.

The processor is operative to perform the steps of the method of embodiment 1 in accordance with the instructions.

Example 3:

the present embodiment provides a computer-readable storage medium having stored thereon a computer program, characterized in that the program when executed by a processor implements the steps of the method described in embodiment 1.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

Claims

1. The electric energy data null value processing method is characterized by comprising the following steps of:

acquiring intelligent power data from a power information acquisition system;

dividing intelligent power data, wherein relatively complete data form a source domain, and the rest data containing a missing value form a target domain;

constructing a regression model in a source domain for iterative training, and obtaining an iterative result after training is completed;

solving according to the iteration result to obtain a source data characteristic weight and a source instance weight;

dividing the target domain data into a training data set containing a missing value and a test data set containing a missing value;

mapping the source data characteristic weight and the instance weight to a construction domain through genetic programming to obtain migration data characteristic weight and migration instance weight;

performing missing value data complementation on the training data set containing the missing values by using the migration data weight and the migration instance weight to obtain a complemented training data set;

importing the completed training data set into a regression model, carrying out regression model training in a target domain, importing the test data set containing the missing values into the regression model, carrying out training to obtain a prediction result, and analyzing the processing effect of the regression model according to the prediction result;

model parameter optimization is carried out on the regression model according to the regression model processing effect, and an optimized regression model and parameters thereof are obtained;

2. The electrical energy data null processing method of claim 1, wherein the intelligent power data comprises a time stamp, temperature, humidity, voltage, current and power usage.

3. The electrical energy data null processing method according to claim 1, wherein the process of carrying out the first iteration training of the regression model in the source domain is as follows: constructing regression models on source domains

4. The electrical energy data null processing method of claim 1, wherein the expressions of the source data characteristic weights and source instance weights are as follows:

wherein,for the j-th source data feature weight, r is the total iteration number, l is the current iteration number, +.>For the j-th source data feature->In regression model->The number of occurrences of>For p-th source data feature->In regression model->The number of occurrences, m ^s Is the number of source data features;

for the ith source instance weight, +.>For the ith source instance expected value, +.>Is by regression model->To predict the predicted value of the ith instance, for example,> for the o source instance expected value, +.>Is the average of all source instance expectations, n ^s Is the number of source instances.

5. The electrical energy data null processing method of claim 1, wherein mapping the source data feature weights and the instance weights to the training data set containing the missing values by genetic programming to obtain the migration data feature weights and the migration instance weights comprises:

constructing a plurality of genetic programming tree structures;

wherein the method comprises the steps of，For the j-th migration data feature weight, < +.>For the p-th migration data feature, M _j For the j-th genetic programming tree structure, +.>Is M _j The weight of the p-th migration data feature in the tree, wherein terminal is a terminal set of a designated genetic programming tree and comprises all migration data features in the tree;

migration instance weight for ith instance, +.>For the ith instance weight, I _i For the ith example, D ^s Representing the source domain, n ^s Is the number of source instances.

6. The method for processing empty values of electric energy data according to claim 5, wherein the normalization processing is performed before the missing values are complemented by using the weights of migration examples, and the expression is as follows:

7. The method for processing empty values of electric energy data according to claim 1, wherein performing missing value data complementation on the training data set containing missing values by using migration data weights and migration instance weights to obtain a completed training data set, comprises:

wherein I is _a For the a-th to-be-complemented missing instance, I _i For the ith instance without missing values, d (I _a ,I _i ) Is I _a And I _i The distance between the two plates is set to be equal,for the j-th migration data feature weight, I _a [j]Is I _a The j-th characteristic value of (1) _i [j]Is I _i J represents a training data set containing missing values, w (I _a ,I _i ) Is the sample weight; the entropy weight of the missing instance to be complemented is obtained through sample weight calculation, and the expression is as follows:

8. The electrical energy data null processing method according to claim 1, wherein the regression model processing effect is represented by a complement error and a distribution error, and the expression thereof is as follows:

wherein fintes represents the regression model processing effect, lambda is the regularization factor,representing the complement error, ω (D) ^t D ^c ) Representing the distribution error, T ^t For real target data in the target domain +.>For predicted target data in the target domain, D ^t Representing the target domain, D ^c Representing a construction domain;

where RSE represents the relative square error,substitution of DR model on target Domain +.>Is predicted by->For the data to be predicted after complementing the missing values, < > for>For the i-th real object data in the object domain, < >>For the ith iteration model->For a pair ofIs predicted by->For the mean of the real object data in the object domain, n is the number of object instances,

9. The electric energy data null value processing system is characterized by comprising a processor and a storage medium;

the storage medium is used for storing instructions;

the processor being operative according to the instructions to perform the steps of the method according to any one of claims 1 to 8.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the method according to any one of claims 1-8.