CN115409317A

CN115409317A - Transformer area line loss detection method and device based on feature selection and machine learning

Info

Publication number: CN115409317A
Application number: CN202210809650.5A
Authority: CN
Inventors: 刘度度; 周钢; 任盛; 付兵权; 肖坤; 吴邦飞; 夏赞; 刘谋海
Original assignee: State Grid Corp of China SGCC; State Grid Hunan Electric Power Co Ltd; Zhangjiajie Power Supply Co of State Grid Hunan Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; State Grid Hunan Electric Power Co Ltd; Zhangjiajie Power Supply Co of State Grid Hunan Electric Power Co Ltd
Priority date: 2022-07-11
Filing date: 2022-07-11
Publication date: 2022-11-29

Abstract

The invention discloses a transformer area line loss detection method and a device based on feature selection and machine learning, wherein the method comprises the following steps: s01, extracting original electrical characteristic indexes from historical data samples of the transformer area, and constructing an electrical characteristic index set; s02, clustering the data in the electric characteristic index training set, and calculating the correlation degree between each electric characteristic index and the line loss rate; s03, selecting a final electric characteristic index subset from the electric characteristic index set according to the clustering result and the correlation degree result; and S04, inputting the subset of the electrical characteristic indexes into a machine learning training model for training to obtain a transformer area line loss detection model so as to realize transformer area line loss detection. The invention has the advantages of simple realization method, low cost, detection efficiency and precision and the like.

Description

Transformer area line loss detection method and device based on feature selection and machine learning

Technical Field

The invention relates to the technical field of transformer area monitoring, in particular to a transformer area line loss detection method and device based on feature selection and machine learning.

Background

The reduction of the power grid loss is an important technical measure for energy conservation and emission reduction, and the line loss rate detection line loss rate is defined as the ratio of line loss electric quantity to power supply quantity, and is an important way for establishing a loss reduction target and realizing the prediction of the carbon emission trend.

The traditional line loss theoretical calculation method comprises an average current method, an equivalent resistance method, a maximum current method and the like, the method is usually based on a series of assumptions so as to simplify the network, the calculation is simple, the accuracy is low, and the method is only suitable for the condition of low precision requirement. With the rapid development of artificial intelligence related research, many machine learning algorithms are also applied to solve the line loss calculation and prediction problem. Implementing line loss prediction based on machine learning generally involves two main steps: selecting and extracting features and constructing a prediction model, wherein the feature extraction method mainly adopts a grey correlation analysis mode, and the prediction model mainly adopts a neural network. The neural network has the advantages of good generalization performance, strong nonlinear mapping capability and the like, but also has the problems of difficult network structure and parameter selection, low convergence rate when the parameters are complex and the like. Therefore, in the line loss prediction method based on machine learning in the prior art, optimization of models is emphasized, and the neural network structure is improved by introducing an optimization algorithm, so that the prediction performance is improved.

However, in the line loss prediction method based on machine learning in the prior art, the problem of feature selection and extraction is ignored, and only conventional features such as load rate, user load, power supply quantity and the like are simply and directly extracted. In fact, the feature selection and extraction occupy most of the time (about 80%) in the line loss prediction process of the transformer area, and the model construction time occupies only a small proportion (about 20%) of the whole process of the line loss prediction of the transformer area, so the feature selection and extraction are the key for determining the execution efficiency of the line loss prediction of the transformer area and influencing the prediction accuracy. In the line loss prediction method based on machine learning in the prior art, the accurate prediction result is difficult to obtain by directly extracting specific characteristics. In order to improve the accuracy, feature quantities need to be greatly increased, for example, multi-dimensional and multi-level electrical feature indexes are adopted as input features of a prediction model, but as relevance between the features and the line loss rate is not concerned, many redundancy features are adopted as key indexes, the model is complex to construct, long time is required for training, and prediction efficiency and accuracy are reduced.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: aiming at the technical problems in the prior art, the invention provides the method and the device for detecting the line loss of the transformer area based on the feature selection and the machine learning, which have the advantages of simple realization method, low cost and capability of giving consideration to both the detection efficiency and the detection precision.

In order to solve the technical problems, the technical scheme provided by the invention is as follows:

a transformer area line loss detection method based on feature selection and machine learning comprises the following steps:

s01, extracting original electrical characteristic indexes from a historical data sample set of a distribution room, and constructing an electrical characteristic index set;

s02, clustering each data sample in the historical data sample set of the transformer area, and calculating the correlation degree between each electrical characteristic index and the line loss rate;

s03, selecting a final electric characteristic index subset from the electric characteristic index set according to a clustering result and a result of the correlation degree;

and S04, inputting the electrical characteristic index subset into a machine learning training model for training to obtain a transformer area line loss detection model so as to realize transformer area line loss detection.

Further, in step S01, a power supply radius, a load factor, and a line model are selected as electrical characteristic indexes of the distribution room, and a plurality of original electrical characteristic indexes are constructed, where the original electrical characteristic indexes include: the power supply system comprises an on-grid power ratio, a terminal power ratio, a head-terminal voltage drop, a power factor, a load rate, a load shape coefficient, a three-phase unbalance degree, a power supply radius, a grid structure, the total number of users in a distribution room and a power supply amount.

Further, in the step S01, performing a visual analysis on each electrical characteristic index to perform data cleansing, where the visual analysis uses skewness and kurtosis to respectively represent the characteristic distribution of each electrical characteristic index, where the skewness is used to represent asymmetry of the random variable probability distribution, and the kurtosis is used to represent steepness of the asymmetry of the random variable probability distribution, and correlation coefficients between each electrical characteristic index and the line loss rate are calculated by using pearson correlation coefficients.

Further, the step S01 further includes a step of normalizing the data values of the electrical characteristic indexes, where a calculation formula of the normalization process is:

in the formula (I), the compound is shown in the specification,

for the normalized result of the ith data value of feature j, x _ij For the characteristic j ith data sample value,

is the mean of the features j, i.e.

σ _j Is the standard deviation of the characteristic j, i.e.

Further, in the step S02, an improved k-means clustering algorithm is adopted to perform clustering processing on each data sample, and the steps include:

calculating the platform area performance index P of each data sample _E ，P _E Is defined as follows:

wherein, the first and the second end of the pipe are connected with each other,

x _ij the ith data sample value of the characteristic j;

according to P _E Value size the data samples are sorted in ascending order to obtain P _E Sorting results;

selecting a cluster center number k, according to said P _E Equally dividing the data samples into k parts according to the sequencing result, selecting the center of each data sample as an initial clustering center, and executing a k-means clustering algorithm;

and calculating the contour coefficient of the clustering result to check the clustering effect, and if not ideal, reselecting the clustering center number k.

Further, in the step S03, an MIC between the electrical characteristic indicator and the line loss rate is calculated to obtain a degree of correlation between the electrical characteristic indicator and the line loss rate;

the MIC between variables X, Y is calculated as:

wherein, B = n ^0.6 P (x, y) is the joint probability of the variables x and y, i.e.:

further, the LightGBM model is adopted as the machine learning training model in step S04, when the model is trained, the gradients of all data samples are calculated and sorted in a descending order according to the absolute values of the gradients, the data samples with the first a × 100% proportion are set as subsets of large gradient samples, the remaining data samples are randomly sampled b × (1-a) × 100% to be set as small gradient sample sets, then the large gradient samples and the small gradient samples are combined to generate a new sample set S, the small sample gradient is multiplied by a weight coefficient, a new weak learner is learned by using the sample set S, the above steps are repeated continuously until the iteration times or the loss function convergence is reached, and the station area line loss detection model is obtained through training.

Further, in the step S04, during model training, the average absolute percentage error ratio, the root mean square error and the relative error percentage are used as evaluation indexes to evaluate whether the distribution room line loss detection model meets preset requirements.

A computer apparatus comprising a processor and a memory, the memory being arranged to store a computer program, the processor being arranged to execute the computer program to perform the method as described above.

A computer-readable storage medium having stored thereon a computer program which, when executed, implements the method as described above.

Compared with the prior art, the invention has the advantages that:

1. according to the method, after the original electrical characteristic index is determined, clustering processing is carried out on the data sample to calculate the degree of correlation between the electrical characteristic index and the line loss rate, and the results of the clustering processing and the degree of correlation between the electrical characteristic index and the line loss rate are combined to carry out characteristic screening, so that the characteristics correlated with the line loss of the transformer area can be screened out, and redundant irrelevant characteristics are removed, so that the detection precision of the transformer area line loss detection model can be greatly improved, the training efficiency of the model is improved, and the detection precision and the detection efficiency of the transformer area line loss can be considered at the same time.

2. According to the method, the correlated electrical characteristic indexes are screened out based on clustering processing and the correlation degree between the calculation characteristics and the line loss rate, and then the LightGBM is used as a machine learning training model, so that the change rule of model calculation under the input of different electrical characteristic indexes can be accurately disclosed, the advantages of the LightGBM model are fully utilized, efficient and accurate line loss prediction is realized, the data calculation amount is reduced, and the prediction accuracy can be ensured as much as possible.

Drawings

Fig. 1 is a schematic flow chart of an implementation process of the transformer area line loss detection method based on feature selection and machine learning according to the embodiment.

Fig. 2 is a schematic diagram of a detailed process for implementing the line loss detection of the distribution room in an embodiment of the present invention.

Fig. 3 is a diagram illustrating the distribution result of the electrical characteristic index obtained in the specific application example.

FIG. 4 is a diagram of Pearson product-moment correlation coefficient results obtained in a specific application example.

Fig. 5 is a schematic diagram of a line loss rate calculation result of a feature selection background area obtained in a specific application embodiment.

Fig. 6 is a diagram comparing the detailed results of the line loss rate calculation of each method obtained in the specific application example.

Detailed Description

The invention is further described below with reference to the drawings and the specific preferred embodiments, without thereby limiting the scope of protection of the invention.

As shown in fig. 1, the steps of the transformer area line loss detection method based on feature selection and machine learning in this embodiment include:

s03, selecting a final electric characteristic index subset from the electric characteristic index set according to the clustering result and the correlation degree result;

and S04, inputting the subset of the electrical characteristic indexes into a machine learning training model for training to obtain a transformer area line loss detection model so as to realize transformer area line loss detection.

After the original electrical characteristic index is determined, the data sample is clustered to calculate the association degree between the electrical characteristic index and the line loss rate, the clustering and the association degree between the electrical characteristic index and the line loss rate are combined to screen the characteristics, the characteristics associated with the line loss of the transformer area can be screened, and redundant irrelevant characteristics are removed, so that the detection precision of the transformer area line loss detection model can be greatly improved, the training efficiency of the model is improved, and the detection precision and the detection efficiency of the transformer area line loss can be considered at the same time.

Due to the complex grid structure of the platform area, the operation condition is various, and factors influencing the condition are various, such as line length, power supply radius, line model, transformer capacity, power supply amount, user load and the like. Considering the size of the influence of the data indexes on the line loss and the difficulty level of the acquisition, the present embodiment selects the power supply radius, the load rate, and the line model as the electrical characteristic indexes of the distribution room, and constructs 11 original electrical characteristic indexes: the power supply system comprises a grid-connected power ratio, a tail end power ratio, a head end voltage drop, a tail end voltage drop, a power factor, a load rate, a load shape coefficient, a three-phase unbalance degree, a power supply radius, a grid structure, the total number of users in a platform area, a power supply amount and the like. The definition of each index is as follows:

1) The ratio of the on-line electricity quantity is as follows: (sum of grid capacity of photovoltaic users/power supply capacity of a distribution area) x 100%.

2) The terminal electric quantity accounts for: the calculation is divided into two types of power supply radius less than 150 meters and power supply radius more than 150 meters; if the power supply radius is less than 150 meters, the terminal electricity consumption ratio = (the sum of the electricity consumption of users in the meter box with the power supply distance of 30% at the front rank/the electricity consumption of the area) is multiplied by 100%; if the power supply radius is larger than or equal to 150 meters, the ratio of the electric quantity at the tail end is = (the sum of the electric quantity of users in a meter box with the power supply distance larger than 70% of the power supply radius of an area/the electric quantity of the area) multiplied by 100%.

3) The power factor calculation formula is as follows:

in the formula, w _p The daily active electric quantity of the table area general table is obtained; w is a _{Photovoltaic system} The daily online electric quantity of the photovoltaic users is obtained; w is a _q And the daily reactive electric quantity is the total daily reactive electric quantity of the transformer area.

4) Head end voltage drop = common average voltage-low end voltage. Wherein the common average voltage is the average value of three-phase voltage of one day of the checking table; the low end voltage is the average of the end user voltages.

5) Three-phase unbalance:

in the formula I _At ，I _Bt And I _Ct Respectively obtaining A, B and C phase secondary side currents, and collecting N points every day; k is the CT transformation ratio.

6) The load shape coefficient calculation formula is as follows

7) The load factor is calculated as follows

Load factor = (average power/distribution capacity) × 100% (4)

8) Power supply radius: the distance between the transformer and the farthest meter box.

9) A grid structure: and obtaining the data of the line type through the PMS. Wherein: 01 is cable, 02 is overhead insulation, 03 is overhead bare conductor, and 04 is hybrid. If the number of data points collected in one day is N, the load shape coefficient is calculated as follows

In the formula I _At ，I _Bt And I _Ct Respectively, A phase secondary side current, B phase secondary side current and C phase secondary side current, wherein the unit is ampere (A); in the case of the primary-side current, the CT conversion ratio η does not need to be multiplied.

10 Total number of users in a cell): and the total number of low-voltage users in the common transformer operating area.

11 Power supply amount): and (5) changing the power supply amount of the operation area.

The original electrical characteristic index can be determined according to actual requirements, and the calculation formula can be adjusted correspondingly according to the actual requirements.

In step S01 of this embodiment, each electrical characteristic index is subjected to visual analysis to perform data cleaning, abnormal and missing data are removed, and Skewness (Skewness) and Kurtosis (Kurtosis) are used to respectively represent the characteristic distribution of each electrical characteristic index during the visual analysis, where the Skewness is used to represent the asymmetry of the random variable probability distribution, and the Kurtosis is used to represent the steepness of the asymmetry of the random variable probability distribution; pearson correlation coefficients (Pearson correlation coefficients) are used to calculate the correlation coefficients between the electrical characteristic indicators and with the line loss rate. The characteristic distribution of the electrical characteristic indexes, the incidence relation among the characteristics and the linear relation of the characteristics can be preliminarily analyzed through the visualization analysis, so that the incidence characteristics can be accurately selected subsequently.

Obtaining data samples for multiple stations

Wherein

Is an input feature of a single sample,

the actual line loss rate is measured by firstly measuring the distribution of each electrical characteristic index through Skewness (Skewness) and Kurtosis (Kurtosis), wherein the Skewness is used for measuring the asymmetry of the probability distribution of the random variables, and the distribution is more asymmetric when the absolute value of the Skewness is larger. The skewness calculation formula is as follows:

in the formula, μ represents a mean value, and σ represents a standard deviation.

The kurtosis is used for measuring the steepness of the asymmetry of the probability distribution of the random variables, and the larger the kurtosis value is, the more the distribution diagram is about sharp. The kurtosis is calculated as follows:

then, calculating correlation coefficients between electrical characteristic indexes and between the electrical characteristic indexes and the line loss rate by adopting a Pearson correlation coefficient, wherein the calculation formula is as follows:

in the formula, E [. Cndot]In order to be expected by the user,

is the sample mean value, σ _X Is the sample standard deviation. The Pearson correlation coefficient ranges from-1 to 1, positive correlation is achieved when the coefficient is a positive value, negative correlation is achieved when the coefficient is a negative value, the absolute value is absolutely large, the higher the degree of correlation is, namely the correlation between the two variables is close to 0, the correlation is called no correlation, and the correlation is called strong correlation when the correlation is close to-1 or 1.

Since the dimension difference between different input features is very large, in order to eliminate the dimension to accelerate the optimization process, step S01 of this embodiment further includes a step of normalizing the input historical data sample of the distribution room, where a calculation formula of the normalization process is:

in the formula (I), the compound is shown in the specification,

for the normalized result of the ith data value of feature j, x _ij For the ith data value of the feature j,

is the mean of the features j, i.e.

σ _j Is the standard deviation of the characteristic j, i.e.

In step S02 of this embodiment, an improved k-means clustering algorithm is used to perform classification processing on historical data samples, and the steps include:

calculating the performance index P of the platform area of each data sample _E ，P _E Is defined as follows:

x _ij the ith data sample value of the characteristic j;

selecting a cluster center number k, in accordance with P _E Ordering results data samplesEqually dividing the data samples into k parts, selecting an initial clustering center of each data sample, and executing a k-means clustering algorithm;

In this embodiment, a clustering effect is obtained by clustering data samples, and whether data needs to be classified is determined according to the clustering effect, for example, if the profile coefficients of the clustering results are very close (the distance is less than a preset threshold), and the clustering effect is not obvious, it can be determined that clustering is not needed, otherwise, it is determined that clustering is needed.

Linear and non-linear relationships between each feature and the output label can be captured using mutual information methods. MIC is used to calculate linear or nonlinear correlation between variables. In feature selection, the MIC-based feature selection method can find out features with strong correlation with dependent variables by calculating MIC between the independent variables and the dependent variables, and set a threshold value to remove features with low correlation to obtain a preferred feature subset. In step S03, the correlation between the electrical characteristic index and the line loss rate is obtained by calculating an MIC (maximum information coefficient) between the electrical characteristic index and the line loss rate. MIC calculation is as follows:

where p (x, y) is the joint probability of the variables x and y.

Because the joint probability density is difficult to calculate, in the MIC calculation, the data points of the variables X and Y are distributed in a two-dimensional space, the space is divided by using an mxn grid, the condition that the data points fall into each square is checked, and the joint probability, namely the joint probability is calculated according to the condition

MIC is calculated as follows

Wherein B = n ^0.6 。

In this embodiment, in step S04, the LightGBM model is used as the machine learning training model, when the model is trained, the gradients of all data samples are calculated first, and sorted in a descending order according to the absolute values of the gradients, the data samples with a first a × 100% proportion are set as a subset of large gradient samples, the remaining data samples are randomly sampled b × (1-a) × 100% to set as a small gradient sample set, then the large gradient samples and the small gradient samples are combined to generate a new sample set S, the gradient of the small sample is multiplied by a weight coefficient, a new weak learner is learned by using the sample set S, the above steps are repeated until the iteration number or the loss function is converged, and the platform area line loss detection model is obtained by training.

A Gradient Boosting Decision Tree (GBDT) is an ensemble learning method based on Boosting. The GBDT updates the weights of the training set by iterating the residuals of a plurality of weak learners (usually decision trees), learns the weak learners, and constructs a strong learner by linear combination, thereby realizing optimization of learning.

Order to

Is a training set comprising N samples, wherein

Is used as an index of the electrical characteristics,

is the line loss rate; a loss function of

Wherein

And (4) predicting the value of the model. First, start withThe initialized fitting function:

secondly, the iteration times M are set, and the M iteration is carried out

Given leaf node region R _jm ,j＝1,...,J _m Wherein J _m Fitting regression trees using negative gradients for leaf node number, i.e. using sets

Training is performed with J = 1.., J for each leaf node _m Calculating multiplier gamma _jm The calculation formula is as follows:

the result is used to update f _m (x)：

And finally obtaining a final model:

GBDT is a classic algorithm in Boosting, but has a problem of difficulty in handling mass data characteristics. LightGBM is a distributed and efficient gradient lifting framework, and can solve the problems that GBDT is time-consuming and difficult to process mass data characteristics. LightGBM is optimized by using a Gradient-based one-side sampling algorithm (GOSS) and a feature bundling algorithm (EFB), and the problems of data quantity and feature quantity are solved.

In this embodiment, the LightGBM samples the samples by using the gos algorithm, and divides the samples into large gradient samples and small gradient samples. Firstly, calculating gradients of all samples and sorting the samples in a descending order according to absolute values of the gradients, setting the first a multiplied by 100 percent of the samples as a subset of large-gradient samples, randomly sampling the rest samples, namely bx (1-a) multiplied by 100 percent, as a small-gradient sample set, combining the large-gradient samples and the small-gradient samples to generate a new sample set S, multiplying the small-gradient samples by a weight coefficient, learning a new weak learner by using the sample set S, and continuously repeating the steps until iteration times or loss function convergence is reached. By this conditional hierarchical sampling, the amount of training data is reduced to the original (a + b-ab) × 100%.

High-dimensional data are often sparse, that is, mutually exclusive features exist, the GOSS algorithm reduces the calculation amount by reducing data samples, and the EFB algorithm reduces the number of features by feature fusion, so that the complexity is reduced, that is, a new feature is formed by adopting a method of bundling the mutually exclusive features, and the data dimensionality is reduced. When constructing the binding set, the EFB algorithm constructs the relationship between the features as a weighted undirected graph, the weights correspond to the total conflicts between the features, the problem is converted into a graph coloring problem, and a greedy algorithm is adopted to calculate the approximation of the optimal solution. In the embodiment, the correlated electrical characteristic indexes are screened out based on the clustering processing and the correlation degree between the calculation characteristic and the line loss rate, and then the LightGBM is used as the machine learning training model, so that the advantages of the LightGBM model can be fully utilized, efficient and accurate line loss prediction is realized, the data calculation amount is reduced, the calculation complexity is reduced, and the prediction precision can be ensured as far as possible.

In this embodiment, when the model is trained in step S04, the method further includes using a Mean Absolute Percentage Error (MAPE), a root-mean-square error (RMSE), and a relative error percentage E _C And the evaluation index is used for evaluating whether the line loss detection model of the transformer area meets the preset requirement. MAPE and RMSE are used to measure the overall performance of the model, and the smaller the value, the higher the accuracy of the model. Relative error hundredScore E _C It is used to measure the calculation error of a single sample. The three are respectively defined as:

according to the method, the distribution of the electric characteristic indexes of the transformer area and the incidence relation between the distribution of the electric characteristic indexes and the line loss rate are explored to select the electric characteristic indexes, so that the model input is determined, the lightGBM model-based transformer area line loss rate calculation model is established according to the characteristic selection result, the influence of different model parameters and the electric characteristic indexes on the model calculation result can be effectively represented, and the detection precision of the transformer area line loss is effectively improved.

As shown in fig. 2, when the line loss detection of the transformer area is implemented in the specific application embodiment of the present invention, the distribution, the association relationship, and the linear relationship of the electrical characteristic index are first explored through the visualized data; then identifying important information by using a feature selection strategy; finally, an analysis training model is made through historical data of the transformer area, a test set is used for verifying a feature selection strategy and a machine learning model, and the detailed process comprises the following steps:

step 1) selecting T = { the ratio of the power of the internet, the ratio of the power of the tail end, the voltage drop of the head end and the tail end, a power factor, a load rate, a load shape coefficient, three-phase unbalance, a power supply radius, a grid structure, the total number of users in a platform area and the power supply amount } as an electrical characteristic index;

step 2) carrying out visual analysis on the original electrical characteristic indexes, exploring characteristic distribution, characteristic correlation, linear relation and the like, and preliminarily understanding data; removing abnormal and missing data; normalizing the input according to the formula (9);

step 3) clustering the data samples by adopting an improved k-means algorithm, and checking whether the samples are to be classified; secondly, measuring the correlation degree between the electrical characteristic index and the line loss rate by adopting an MIC (many integrated core);

step 4) selecting an electrical characteristic index subset according to clustering and MIC results, inputting the electrical characteristic index subset into a LightGBM model to calculate an average absolute percentage error ratio and a root mean square error, and checking whether the classification processing of the transformer area data samples and the reduction of input characteristics are required to be carried out;

and step 5) inputting the finally selected electrical characteristic index into the LightGBM model, and comparing the performance of other related research results by taking the average absolute percentage error ratio, the root mean square error and the relative error percentage as evaluation indexes.

In order to verify the effectiveness of the invention, a data set containing line loss real data of 8000 station areas in a certain city is used, the above scheme of the invention is adopted to carry out a station area line loss detection test, and the test is compared with the traditional models such as standard BPNN, DBN, LM-BPNN, support vector machine and the like, the traditional models are as follows, and the parameter settings of the models are shown in Table 1.

(1) Standard BP neural network (BPNN): the BP neural network is generally composed of an input layer, a hidden layer and an output layer, wherein an input sample is transmitted to the output layer through the hidden layer by forward propagation, and errors are transmitted back layer by layer to adjust the weight of the network layer by a backward propagation algorithm.

(2) The Deep belief network (DBN: DBN model) is formed by stacking a plurality of Restricted Boltzmann Machines (RBMs) and 1 BP neural network output layer.

(3) Improving K-means clustering and BP neural network: firstly, confirming the optimal K value of K-Means clustering through a contour coefficient, and secondly, optimizing an error function of a BP neural network in a back propagation process through a Levenberg-Marquardt (LM) algorithm, so as to correct a network weight and a threshold.

(4) Hierarchical clustering and Random Forest (RF) based models: firstly, hierarchical clustering is adopted to classify the distribution areas and research the distribution areas respectively. And secondly, modeling is carried out on different types of transformer areas by adopting RF.

(5) Support Vector Machine (SVM): and (3) adopting an SVM as a calculation model, and optimizing the SVM by using a grid search method to search the most training parameters.

TABLE 1 different method parameter settings

The invention adopts an improved k-means method to respectively calculate the P of each platform area sample _E And sorted in descending order, with the results shown in table 2. Equally dividing the sample into k parts, and selecting each type of center as an initial clustering center of the type. The clustering number is increased from 2 to 9, and the contour coefficients S of the clustering results under corresponding k values are respectively calculated _t The results are shown in Table 3. The result shows that the contour coefficients of all the clustering results are very close, the clustering effect is not obvious, and the number of samples of partial categories can be caused after the clustering. And (4) comprehensively analyzing the results, and not performing classification processing when calculating the line loss rate of the data of the transformer area.

TABLE 2 Transformer area Performance index P _E

TABLE 3 Profile coefficients for different numbers of Cluster centers

The present embodiment further analyzes the electrical characteristic index distribution and the correlation coefficient result, the electrical characteristic index distribution is shown in fig. 3, the abscissa represents all possible results of the characteristic, and the ordinate represents the occurrence probability of different results. The pearson correlation coefficient matrix of the electrical characteristic index and the line loss rate is shown in fig. 4, wherein the skewness of many characteristic distributions presents positive skewness, such as terminal electric quantity ratio, load rate and power supply radius; the kurtosis of partial characteristic distribution presents positive kurtosis, such as head and tail end voltage drop, load shape coefficient and power supply radius; some characteristics such as power supply radius present the form of similar power law distribution, and the net charge accounts for more nearly all at 0%.

As can be seen from fig. 3, in the initial electrical characteristic index, the load shape coefficient, the three-phase imbalance and the power supply radius which are strongly positively correlated with the line loss rate have a strong load shape coefficient, and the head and tail end voltage drop which are weakly correlated with the line loss rate have a weak positive end voltage drop. And other characteristics such as the internet power ratio, the terminal power ratio and the load rate are weak in correlation. There may be extraneous features in the raw electrical characteristic index. The relevance between the characteristics and the prediction target can help the model to speculate the result of the prediction target from a certain characteristic, so that the performance of the model is improved to a certain extent, and the final prediction result is not influenced by irrelevant characteristics. Meanwhile, strong correlation exists between the three-phase unbalance degree and the load shape coefficient, so that the redundancy characteristic exists between partial characteristics. Redundant features include, in addition to duplicate data, strong associations between features, i.e., one attribute is inferred from another. The redundant features may cause multiple co-linearity problems, i.e., the independent variables are related to each other, resulting in reduced generalization capability and unstable results of the model.

In this embodiment, the distribution room line loss rate calculation and error analysis are further performed, the training iteration times of the LightGBM model are respectively 100, 1000, 10000, and 100000, and the other parameters are set according to table 1. And calculating the line loss rate of the test platform area samples under different iteration times, wherein the calculation result is shown in table 4. As can be seen from table 4, as the number of iterations increases, the accuracy of the model gradually increases. When the maximum iteration number is 100, the mean square error of the model is 1.229, the average absolute percentage error is 28.612%, 92 stations are arranged in a station area with the relative error percentage smaller than 1%, and 774 stations are arranged in a station area with the relative error percentage larger than 1% and smaller than 10%. When the maximum iteration number is 100000, the mean square error of the model is 0.020, 100 station areas with relative errors smaller than 0.05%, 45 station areas with relative error percentage larger than 0.05% and smaller than 0.1%, and 2294 station areas with relative error percentage larger than 0.1% and smaller than 5%; the area with the relative error percentage of more than 5% and less than 10% has 213 stations, and the number of the area with more than 10% has only 67 stations. From the training and testing time results, the training and testing time was 0.187s when the maximum number of iterations was 100, and 85.785s when the maximum number of iterations was 100000. When the maximum iteration number is 1000 and 10000, the calculation result is between the two cases, that is, the calculation accuracy of the model is higher and higher as the maximum iteration number is increased.

TABLE 4 comparison of the results of different iterations

From the above results, it is shown that the MSE of the LightGBM model is greatly reduced when the maximum number of iterations reaches 100000 compared to the model with the maximum number of iterations of 100. Although 8000 block samples took 85.785 seconds to train and test, they were still within acceptable limits. Since there may be redundancy between partial features, i.e., irrelevant features may cause a reduction in model performance, the present invention calculates MIC values between the electrical characteristic index and the line loss rate according to equations (10) to (12), and the results are shown in table 5.

TABLE 5MIC value calculation results

According to the ascending ordering of MIC calculation results, the features with smaller MIC values are removed one by one, the corresponding feature sets are reduced from 11 to 6 and input into the LightGBM model, and the obtained line loss rate calculation results are shown in FIG. 5. As can be seen from fig. 5, when the input electrical characteristic indicator is decreased, the MSE and MAPE are increased, i.e., the accuracy of the model is decreased. Therefore, although the relevance between the partial features is high and the relevance between the line loss rate and the partial electrical feature index is low, the deletion of the partial features still causes information loss, and the accuracy of the model is reduced.

To further verify the effectiveness of the method, the results of the transverse test and the conventional theoretical line loss calculation methods (models such as standard BPNN, DBN, LM-BPNN, support vector machine, etc.) are shown in table 6 and fig. 6, where (a) in fig. 6 corresponds to the calculation result of the present invention, (b) corresponds to the calculation result of the conventional RF method, (c) corresponds to the calculation result of the conventional BPNN, (d) corresponds to the calculation result of the conventional LMNN method, (e) is the calculation result of the conventional DBN, and (f) is the calculation result of the conventional SVM. From the results of MSE and MAPE, the invention performs optimally, and 0.020% and 2.459% are obtained respectively, namely the precision of the invention is superior to that of the traditional method, and the time consumption of the method is shorter, and is only 85.785s.

TABLE 6 comparison of results calculated by different methods

In order to further verify that the method of the present invention has general applicability to data samples in different time periods, part of the historical data samples are used as training samples, and the rest of the data samples are used as test samples to perform the test, and the results are shown in table 7. The result shows that the method can still obtain higher calculation precision when calculating data in different time periods, the MSE range is between 0.010 and 0.015, and the MAPE range is between 1.7 percent and 2.0 percent. Meanwhile, the results of a single station area are checked by taking 4 months and 2 days as an example, only 10 station area data are listed, and the results are shown in table 8. As can be seen from the table, the method still has strong generalization in the data calculation of different areas and different periods.

TABLE 7 calculation results of different time-period data samples

Table 8 line loss calculation results of the distribution room in month and 2 days

According to the method, the influence of electrical characteristic index input on the model performance is analyzed according to data distribution, characteristic association and a clustering algorithm, the information loss caused by reducing the characteristic quantity in a data set is displayed through an experimental result, so that the performance is reduced, the classification effect of the clustering algorithm on the data set is not obvious when the clustering algorithm is introduced, the problem of training sample reduction is also caused, a table area line loss rate calculation model based on the LightGBM technology is established based on a characteristic selection result, and the change rule of model calculation under the condition of inputting different model parameters and electrical characteristic indexes can be represented. Experiments prove that the method can give consideration to the efficiency and the precision of the line loss prediction of the transformer area.

The present embodiment further comprises a computer apparatus comprising a processor and a memory, the memory being configured to store a computer program, the processor being configured to execute the computer program to perform the method as described above.

The present embodiments also include a computer-readable storage medium storing a computer program that, when executed, implements a method as described above.

The foregoing is illustrative of the preferred embodiments of the present invention and is not to be construed as limiting the invention in any way. Although the present invention has been described with reference to the preferred embodiments, it is not intended to be limited thereto. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical spirit of the present invention should fall within the protection scope of the technical scheme of the present invention, unless the technical spirit of the present invention departs from the content of the technical scheme of the present invention.

Claims

1. A transformer area line loss detection method based on feature selection and machine learning is characterized by comprising the following steps:

s01, extracting original electrical characteristic indexes from a historical data sample set of a transformer area, and constructing an electrical characteristic index set;

and S04, inputting the subset of the electrical characteristic indexes into a machine learning training model for training to obtain a distribution room line loss detection model so as to realize distribution room line loss detection.

2. The method for detecting the line loss of the distribution room based on the feature selection and the machine learning as claimed in claim 1, wherein in the step S01, a plurality of original electrical feature indexes are constructed by selecting a power supply radius, a load rate and a line model as the electrical feature indexes of the distribution room, and the original electrical feature indexes include: the power supply system comprises an on-grid power ratio, a tail end power ratio, a head end voltage drop, a tail end voltage drop, a power factor, a load rate, a load shape coefficient, a three-phase unbalance degree, a power supply radius, a grid structure, the total number of users in a transformer area and a power supply quantity.

3. The method for detecting line loss in a distribution room based on feature selection and machine learning of claim 1, wherein the step S01 further comprises performing a visualization analysis on each electrical feature index to perform data cleaning, the visualization analysis uses skewness and kurtosis to respectively represent feature distributions of each electrical feature index, wherein the skewness is used for representing asymmetry of a random variable probability distribution, the kurtosis is used for representing steepness of the asymmetry of the random variable probability distribution, and correlation coefficients between each electrical feature index and a line loss rate are calculated by using pearson correlation coefficients.

4. The transformer area line loss detection method based on feature selection and machine learning as claimed in claim 1, wherein the step S01 further comprises a step of normalizing the data values of the electrical feature indexes, and the calculation formula of the normalization process is as follows:

in the formula (I), the compound is shown in the specification,

for the normalized result of the ith data value of feature j, x _ij For the ith data sample value of feature j,

is the mean of the features j, i.e.

σ _j Is the standard deviation of the characteristic j, i.e.

5. The method for detecting the line loss of the transformer area based on the feature selection and the machine learning as claimed in claim 1, wherein in the step S02, an improved k-means clustering algorithm is adopted to perform clustering processing on each data sample, and the steps include:

x _ij the ith data sample value of the characteristic j;

according to P _E Size of value will be each dataThe samples were sequenced in ascending order to obtain P _E Sorting results;

selecting a cluster center number k, according to P _E Equally dividing the data samples into k parts according to the sequencing result, selecting the center of each data sample as an initial clustering center, and executing a k-means clustering algorithm;

6. The method for detecting the line loss of the distribution room based on the feature selection and the machine learning according to any one of claims 1 to 5, wherein in the step S03, the MIC between the electrical feature index and the line loss rate is calculated to obtain the degree of correlation between the electrical feature index and the line loss rate;

the MIC between variables X, Y is calculated as:

7. the method as claimed in any one of claims 1 to 5, wherein the machine learning training model in step S04 uses a LightGBM model, and when training the model, the gradients of all data samples are calculated first and sorted in a descending order according to the absolute values of the gradients, the data samples with the first a × 100% proportion are set as the subset of large gradient samples, the remaining data samples are randomly sampled b × (1-a) × 100% and set as the set of small gradient samples, then the large gradient samples and the small gradient samples are combined to generate a new sample set S, the gradient of the small sample is multiplied by the weight coefficient, a new weak learner is learned by using the sample S, and the above steps are repeated until the number of iterations or the convergence of the loss function is reached, so as to obtain the machine area line loss detection model.

8. The method according to claim 6, wherein during model training in step S04, the average absolute percentage error ratio, the root mean square error, and the relative error percentage are used as evaluation indexes to evaluate whether the line loss detection model of the transformer area meets preset requirements.

9. A computer arrangement comprising a processor and a memory for storing a computer program, wherein the processor is adapted to execute the computer program to perform the method according to any of claims 1-8.

10. A computer-readable storage medium storing a computer program, characterized in that the computer program, when executed, implements the method according to any one of claims 1 to 8.