CN111027612B

CN111027612B - Energy metering data feature reduction method and device based on weighted entropy FCM

Info

Publication number: CN111027612B
Application number: CN201911226594.7A
Authority: CN
Inventors: 孙虹; 董得龙; 卢静雅; 乔亚男; 孔祥玉; 童庆; 李野; 李刚; 杨光; 何泽昊; 季浩; 白涛; 顾强; 赵紫敬; 许迪; 吕伟嘉; 刘浩宇; 张兆杰; 翟术然
Original assignee: Tianjin University; State Grid Corp of China SGCC; State Grid Tianjin Electric Power Co Ltd; Electric Power Research Institute of State Grid Tianjin Electric Power Co Ltd
Current assignee: Tianjin University; State Grid Corp of China SGCC; State Grid Tianjin Electric Power Co Ltd; Electric Power Research Institute of State Grid Tianjin Electric Power Co Ltd
Priority date: 2019-12-04
Filing date: 2019-12-04
Publication date: 2024-01-30
Anticipated expiration: 2039-12-04
Also published as: CN111027612A

Abstract

The invention provides an energy metering data feature reduction method and device based on weighted entropy FCM, and relates to the technical field of power system automation, and the technical characteristics are that the method comprises the following steps: step 1: acquiring data to be processed; step 2: carrying out feature extraction on the data information of the data to be processed by adopting a Gaussian mixture model; step 3: performing reduction processing on the features extracted in the step 2 based on the FCM function; step 4: under the condition that the data reduction convergence is determined, finishing feature reduction, and exiting from the execution logic; and under the condition that the reduction of the number is not converged, the method of updating the weight is adopted to remove smaller features, and then the step 3 is executed, so that the reduction is carried out. The invention is beneficial to accurately and real-timely screening the characteristic data.

Description

Energy metering data feature reduction method and device based on weighted entropy FCM

Technical Field

The invention relates to the technical field of power system automation, in particular to a data feature reduction method and device based on feature weighted entropy FCM.

Background

The power industry is in the digital information age, the information volume is in explosive growth, and the large data age of electric power is accompanied by the explosion, so people enjoy the information and convenience brought by the information, and simultaneously face the huge challenge brought by the huge information volume, namely the availability of the information. The power big data has the characteristics of 4"V', namely large Volume, rapid growth (speed), multiple categories (Variety) and thin Value (Value) density. According to 4"V' characteristics of large electric power data, to obtain decision information beneficial to power grid operation from mass data which are rapidly generated, multi-source, heterogeneous and have low value, the decision information needs to be processed, namely, the inherent relevance among all attribute data is found out, and useful information beneficial to power grid decision is screened from the mass information. The data reduction is a key link in the big data preprocessing, reduces the data processing amount in the big data analysis process, and improves the data processing efficiency.

Traditional data reduction modes comprise a PCA manifold learning dimension reduction method and a RELIEF algorithm, but both algorithms have certain limitations: the PCA manifold learning dimension reduction method cannot solve the uncertainty of the mapping of data points in the high-dimensional space to the low-dimensional space, and the corresponding inverse mapping cannot be given, so that the method can only be used for the potential low-dimensional structure of the data set. The RELIEF algorithm can process discrete data and continuous data, but cannot be applied to various problems, and the algorithm does not consider the situation of data loss and cannot remove redundant characteristics. Therefore, the algorithm cannot cope with probability calculation with high complexity, and probability value is difficult to estimate accurately.

Therefore, in view of the above drawbacks, there is a need to develop a new data feature reduction method.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, and provides an energy metering data feature reduction method and device with high accuracy and good real-time performance based on weighted entropy FCM, so as to solve the technical problem that the existing data processing mode cannot effectively screen out feature data.

In order to achieve the above purpose, the present invention provides the following technical solutions:

the data feature reduction method based on the feature weighted entropy FCM provided by the invention comprises the following steps:

step 1: acquiring data to be processed;

step 2: extracting data information characteristics of the data to be processed by adopting a Gaussian mixture model;

step 3: performing reduction processing on the features extracted in the step 2 based on the FCM function;

step 4: under the condition that the data reduction convergence is determined, finishing feature reduction, and exiting from the execution logic; under the condition that the reduction of the number is not converged, the reduction treatment is carried out again after the smaller features are removed by adopting a weight updating method, namely, the step 3 is repeated;

furthermore, the pretreatment step between the step 1 and the step 2 further comprises the following pretreatment steps:

(1) Judging abnormal data of the data to be processed by adopting a 3 sigma principle and correcting the abnormal data;

(2) Establishing a difference function by using known points, and carrying out interpolation substitution processing on the missing values in the corrected data;

(3) And (3) smoothing technology is used for the corrected and interpolated data set to reduce data noise.

The specific method for extracting the data features in the step 3 is as follows:

GMM modeling is carried out on the data, the best GMM fitting parameters are calculated, the GMM model parameters are extracted as characteristic variables of the original data, so that characteristic extraction is carried out, and the built GMM model is as follows:

wherein x is the input sample variable; pi _i Is a mixed weight and hasd is the dimension of the input variable; p is p _i (x；μ _i ,Σ _i ) An ith gaussian component of GMM; mu (mu) _i Is the mean value of the ith Gaussian component; sigma (sigma) _i Is the covariance matrix of the ith gaussian component.

The specific method for the feature reduction processing in the step 4 is as follows:

FCM is improved by reducing features in the feature weighted entropy, the objective function is:

wherein:δ _j for controlling feature weights, w _j As the feature weight of the jth feature, W is the matrix formed by the features, U is the membership matrix, and V is the clustering center.

The invention also provides an energy metering data feature reduction device based on the weighted entropy FCM, which comprises:

the data acquisition unit is used for acquiring data to be processed;

the feature extraction unit is used for carrying out feature extraction processing on the data to be processed acquired by the data acquisition unit based on the Gaussian mixture model;

the reduction unit is used for carrying out reduction processing on the features extracted by the feature extraction unit based on the FCM function, and completing feature reduction under the condition that the convergence of data reduction is determined, and exiting from the execution logic; and under the condition that the reduction of the number is not converged, adopting a method for updating the weight to remove smaller features and returning to the reduction process.

The invention has the advantages and beneficial effects that:

1. the invention adopts the fuzzy C-means clustering (FRFCM) characteristic reduction method based on weighted entropy, and can add the weighted entropy into the FRFCM objective function, thereby automatically reducing the characteristic quantity and generating good clustering effect.

2. According to the FRFCM algorithm, the characteristic reduction behavior with the characteristic weight is embedded in the FCM process, the performance of the FCM can be improved, important characteristics can be selected through weighting, the number of the characteristics is reduced by discarding unimportant characteristics, and the purpose of characteristic reduction is achieved.

3. The calculation method adopted by the invention can utilize the reduced important features to perform clustering, can ensure the accuracy and simultaneously give consideration to the real-time performance, and can greatly reduce the calculation time.

4. The calculation method selected by the invention can estimate the parameters used in the FRFCM objective function in the learning process, so that the FRFCM algorithm is not limited by parameter selection, and the use is more flexible and convenient.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a process flow diagram of the present invention;

FIG. 2 is a flow chart of step 2 of the present invention;

fig. 3 is a flow chart of step 3 in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail below. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, based on the examples herein, which are within the scope of the invention as defined by the claims, will be within the scope of the invention as defined by the claims.

Embodiments of the invention are described in further detail below with reference to the attached drawing figures:

an energy metering data characteristic reduction method based on weighted entropy FCM comprises the following steps:

step 1: acquiring data to be processed; the data to be processed is energy metering data;

the specific method for extracting the characteristics in the step 2 is as follows:

When in use, the GMM model generally adopts a maximum likelihood estimation method, and likelihood functions are as follows:

L(λ|x)＝p(X|λ)＝Πp(x _i |λ)

the purpose of training is to find a set ofSo that the likelihood functionThe number L (λ|x) takes the maximum value, namely:

maximum likelihood estimation is a problem of nonlinear optimization, generally by iterative solution of a expectation maximization algorithm (EM algorithm), which is mainly divided into two steps: step E is desirable and step M is maximized. E, calculating an expected value of a likelihood function of the complete data by using the current parameter set; and step M, obtaining new parameters through maximization of an expected function, and iterating the step E and the step M until convergence.

To illustrate the EM algorithm, first define the Q function:

where y is a hidden signal, indicating the sequence number of the gaussian component currently being generated, and if it is generated by the ith gaussian component, y=i, λ is the existing model parameter,is a new parameter to be calculated.

The Q function is therefore rewritten as follows:

after defining the Q function, it is necessary to set the initial value of the EM algorithmAnd iteration termination conditions, the solving process is simply described as follows:

1) Solving for the data in the training dataset to fall in the assumed hidden stateThe probability of (2) isThen: />

2) Maximizing the Q function, solvingRelative to pi _i ；μ _i ；Σ _i I=1, 2.. _i ；μ _i ；∑ _i 。

The features extracted by the method can be used for feature reduction processing.

It should be noted that, in the case where there may be a missing value or an erroneous value in the data to be processed, the feature reduction process may be performed by a preprocessing step, where the preprocessing step includes:

the following formula is selected for abnormal number judgment:

wherein: epsilon is a threshold, typically 1-1.5;

the following formula is selected for the abnormal number correction:

wherein: p's' _i,t The load data corrected at the ith time t of the user; α, β, γ are weight coefficients, and satisfy α+β+γ=1;and->Is the distance p _i,t Recently, it has been proposed toIs a daily load of the observation.

(2) Establishing a difference function by using known points, and carrying out interpolation substitution processing on the missing values in the corrected data; the specific method comprises the following steps:

processing missing values Using a tri-spline interpolation to process missing values, i.e., the known function y=f (x) is in the interval [ a, b ]]N+1 nodes above: a=x ₀ ＜x ₁ ＜…＜x _n-1 ＜x _n Value y on =b _i ＝f(x _i ) (i=0, 1, …, n), the difference function S (x) is calculated such that:

A：S(x)＝y _i (i＝0,1,…,n)；

b: between each cell [ x _i ,x _i+1 ]S (x) on (i=0, 1, …, n-1) is a cubic polynomial denoted as S _i (x)；

C: s (x) is continuously differentiable in second order on [ a, b ];

at this time, the function S (x) is called a cubic spline difference function of f (x).

The condition B can be expressed as: s (x) = { S _i (x),x∈[x _i ,x _i+1 ],i＝0,1,…,n-1}，S _i (x)＝a _i x ³ +b _i x ² +c _i x+d, wherein: a, a _i ,b _i ,c _i ,d _i For the undetermined coefficients, 4n in total.

From the condition C, it can be seen that:

meanwhile, the total of 4n-2 equations can be known according to the formulas used for data preprocessing, and two boundary conditions are needed to be given for determining 4n undetermined parameters of S (x).

The boundary conditions of the usual cubic spline function are of three types:

1)S'(a)＝y' ₀ ，S'(b)＝y' _n . The spline interpolation function established by such boundary conditions is called the complete spline difference function of f (x). In particular y' ₀ ＝y' _n When=0, the spline curve is in a horizontal state at the end points.

2)S”(a)＝y” ₀ ，S”(b)＝y” _n . In particular, y' ₀ ＝y” _n =0, called natural boundary condition.

3) S '(a+0) =s' (b-0), S "(a+0) =s" (b-0), this condition is referred to as a cycle condition.

Any formula is selected according to the actual situation to determine the boundary condition.

The interpolation substitution process is to supplement the missing values in the data.

(3) The smoothing technology is used for reducing data noise of the corrected and interpolated data after replacing processing, and the specific method is as follows:

wherein p is _i,t The original load data at the time t of the ith day of the user; n is the total number of observation days;and->The average load and variance at time t of the user, respectively.

In the preprocessing step, the operation sequences of (1) and (2) can be interchanged, namely, the missing value in the data can be interpolated to replace the processing, and then the correction processing can be carried out on the abnormal data in the obtained data.

Step 3: the extracted features are reduced according to the fuzzy C-means clustering mode of the feature weighted entropy, namely the features extracted in the step 2 are reduced based on the FCM function;

the specific method for the reduction treatment in the step 3 is as follows:

wherein,δ _j for controlling the feature weights.

It should be noted that, in order to ensure that the calculation result meets the requirement, δ _j Is important.

The FRFCM objective function has two items, wherein the first item is the sum of the weighted distances between the data points and the characteristic of the cluster center, and when the distance between the data points and the center is smaller, the weighted distances between the characteristic and the cluster center are the smallest; the second term is the feature weight entropyVariations of (2), e.gBecause of delta _j At->And->Variables for controlling feature weights, so delta _j Is important.

For more convenient estimation of delta _j The invention also provides a learning program.

In probability theory and statistics, standard deviation and variance are used to measure the dispersion of data. Another measured dispersion index is a well-known variance-to-mean ratio (VMR), defined as vmr=σ ² And/u. VMR can be used to observe scattered or aggregated data sets, with smaller dispersion indicating that the data set is closer to the cluster center; the larger the dispersion, the farther the data set is from the cluster center. Because of the need to preserve features with small dispersion, then discard the ionsFeatures of large divergence, the inverse of VMR, i.e., mean-to-variance ratio (MVR), should be considered for use in the algorithm. That is, δ can be considered _j The estimation is as follows:

to make delta _j Can play the screening role and normalize the screening role to delta _j '＝δ _j /max(δ _j ) So that delta _j ' E (0, 1), replaces the original delta _j 。

In performing the FRFCM calculation, three minimization steps may be used to do so:

the first step: fixingAnd->Then minimize with respect to U considerations>Taking into account the lagrangian function:

the Lagrangian function is applied to u _ik The partial derivative is set to zero, resulting in:

thereby obtaining u _ik The update equation of (2) is as follows:

and a second step of: fixingAnd->Then minimize with respect to V considerations>The method comprises the following steps:

thereby obtaining v _ik The update equation of (2) is as follows:

and a third step of: fixingAnd->Then consider minimize +.>The method comprises the following steps:

thereby obtaining w _j The update equation of (2) is as follows:

in addition, to maintain constraintsIt is necessary to add w _j Adjust to->

It should be noted that during actual data processing, some insignificant features (i.e., small weight features) need to be culled because the FRFCM algorithm belongs to feature reduction. For better data rejection, a suitable threshold value needs to be selected: if w _j In the update equation of (2)Too large, then molecule->Becomes too small to approach a value of 0, resulting in excessive feature weights being lost in the update step; if, converselyToo small, the molecule->It approaches 1 resulting in a difficult to discard feature during the update step. To avoid this, a suitable constant is set in the FRFCM clustering algorithm to control. Because in the FRFCM clustering algorithm, one goal is to cluster a dataset (with n data points) into c clusters. The numbers n and c are two generally given constants, so the constant n/c can be used to controlSo as not to present the above-mentioned extreme cases. Review +.>The constant n/c is chosen for the treatment of +.>

In order to determine whether the feature reduction converges, a certain threshold value needs to be set for determination.

The data set to be processed has n data points, wherein each data point has d characteristic components, and the constraint condition of the characteristic weights is thatIf d is large, the feature reduction threshold is intuitively chosen to be 1/d.

In order to further improve the applicability of the algorithm, it is necessary to consider the data number n as another influencing factor, considering that the value of d may be small. Is known to beIn order to balance between a large d and a small d, a data n is chosen to replace one d, so that it becomes +.>When the data point data n is much larger than the bit component number d, select +.>Instead of 1/d, inaccuracy is achieved, in which case +.>And more accurate instead of 1/d.

the data acquisition unit is used for acquiring data to be processed;

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: the above embodiments are only for illustrating the technical aspects of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those of ordinary skill in the art that: modifications and equivalents may be made to the specific embodiments of the invention without departing from the spirit and scope of the invention, which is intended to be covered by the claims.

Claims

1. The energy metering data characteristic reduction method based on the weighted entropy FCM is characterized by comprising the following steps of:

step 1: acquiring data to be processed;

step 2: carrying out feature extraction on the data information of the data to be processed by adopting a Gaussian mixture model;

step 4: under the condition that the data reduction convergence is determined, finishing feature reduction, and exiting from the execution logic; under the condition that the reduction of the number is not converged, the method of updating the weight is adopted to remove smaller features, and then the step 3 is executed again to perform reduction treatment;

the data information feature extraction of the data to be processed by adopting a Gaussian mixture model comprises the following steps:

GMM modeling is conducted on the data to be processed, optimal GMM fitting parameters are calculated, and GMM model parameters are extracted to serve as feature variables of original data to conduct feature extraction;

the built GMM model is as follows:

wherein x is the input sample variable; pi _i Is a mixed weight and hasd is the dimension of the input variable; p is p _i (x；μ _i ,Σ _i ) An ith gaussian component of GMM; mu (mu) _i Is the mean value of the ith Gaussian component; sigma and method for producing the same _i Covariance matrix of ith Gaussian component;

and (3) performing reduction processing on the features extracted in the step (2) based on the FCM function, wherein the reduction processing comprises the following steps:

wherein,

δ _j for controlling feature weights, w _j As the feature weight of the jth feature, W is the matrix formed by the features, U is the membership matrix, and V is the clustering center.

2. The energy metering data feature reduction method based on weighted entropy FCM according to claim 1, further comprising a preprocessing step between the step 1 and the step 2, wherein the preprocessing step comprises:

(1) Abnormal data judgment is carried out on the data to be processed by adopting a 3 sigma principle, and the abnormal data is corrected;

3. According to claim 1The energy metering data characteristic reduction method based on weighted entropy FCM is characterized in that the delta j estimated value is that

4. The weighted entropy FCM-based energy metering data feature reduction method of claim 1, wherein δ _j ' replace the delta _j Wherein delta' _j ＝δ _j /max(δ _j ) And delta' _j ∈(0，1)。

5. The method for feature reduction of energy metering data based on weighted entropy FCM according to claim 1, wherein the constraint condition of feature weights is that

6. The method for reducing characteristics of energy metering data based on weighted entropy FCM according to claim 5, wherein when the number of data points n.ltoreq.d in the data set, the reduction threshold of the characteristic weight is set to be

When n > d, the reduction threshold of the feature weight is set asThe reduction threshold is used to determine whether the feature reduction converges.

7. The reduction device of the weighted entropy FCM-based energy metering data feature reduction method according to any one of claims 1 to 6, comprising:

the data acquisition unit is used for acquiring data to be processed;

the reduction unit is used for carrying out reduction processing on the features extracted by the feature extraction unit based on the FCM function, and completing feature reduction under the condition that the convergence of data reduction is determined, and exiting from the execution logic;

and under the condition that the reduction of the number is not converged, adopting a method for updating the weight to remove smaller features and returning to the reduction process.