CN115423008A

CN115423008A - Method, system and medium for cleaning operation data of power grid equipment

Info

Publication number: CN115423008A
Application number: CN202211023992.0A
Authority: CN
Inventors: 龙云; 卢有飞; 梁雪青; 吴任博; 张扬; 刘璐豪; 赵宏伟; 陈明辉; 张少凡; 邹时容; 蔡燕春; 刘璇
Original assignee: Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd
Current assignee: Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd
Priority date: 2022-08-24
Filing date: 2022-08-24
Publication date: 2022-12-02

Abstract

The invention discloses a method, a system and a medium for cleaning operation data of power grid equipment, wherein the method comprises the following steps: acquiring original operation data of the power grid equipment from a database and preprocessing the original operation data; identifying noise data with abnormal attributes in the preprocessed running data and setting null values to obtain running data containing missing values; constructing and training a data prediction model based on a back propagation feedforward neural network, predicting operation data containing a missing value, obtaining a data value to be filled, filling the data, and obtaining filled operation data; and aiming at the filled running data, clustering by using an improved DBSCAN algorithm, selecting and deleting the highly matched repeated data records, and obtaining the cleaned running data. According to the method, the missing value is obtained by identifying the noise data and setting the null value, the data prediction model is constructed for prediction filling, and finally the cleaning work of two types of data is realized through a clustering algorithm.

Description

Method, system and medium for cleaning operation data of power grid equipment

Technical Field

The invention belongs to the technical field of power grid equipment data processing, and particularly relates to a method, a system and a medium for cleaning power grid equipment operation data.

Background

With the diversification of energy types and the continuous optimization of energy structures in China, various sets from different systems can generate massive data streams in the operation process, and information which can be mined based on the data is very rich. The data is called dirty data because the data includes error data generated by various factors such as sensor failure and external interference, and data which is from different service systems and causes mutual conflict cannot be utilized, and even the data can cause obstruction to information mining. In view of this, the quality of the dirty data is improved by cleaning the dirty data, and the accuracy of the subsequent data analysis result is guaranteed, so that the method has great significance.

In the prior art, the operation data cleaning work aiming at the unit equipment can be summarized into two types of cleaning of attribute value error data and cleaning of repeated data. For data with abnormal attribute values, the work to be considered is the detection of the abnormal values, a common simple and easy-to-realize method is a statistical method, and the abnormal values are distinguished according to n standard deviations of the average number of distance data sets on the basis of the Chebyshev theorem; however, the traditional method is difficult to be applied to the complex situation of abnormal value distribution with high data dimension. For the problem of missing values in the running data, if filling methods such as median, mean, mode and the like are adopted, the reliability of the data is often affected, and even the analysis result and subsequent decisions are misled. Meanwhile, the situation of repeated recording may exist in multi-source data from different systems, and if the most direct method of comparing two data in a database is adopted, the steps are too complicated, resources are consumed, time complexity is high, and the data cleaning work is not facilitated to be expanded.

Disclosure of Invention

The invention mainly aims to overcome the defects and shortcomings of the prior art and provides a method, a system and a medium for cleaning operation data of power grid equipment.

In order to achieve the above object, an aspect of the present invention provides a method for cleaning operation data of a power grid device, including the following steps:

acquiring original operation data of the power grid equipment from a database and preprocessing the original operation data;

identifying noise data with abnormal attributes in the preprocessed running data and setting null values to obtain running data containing missing values;

constructing and training a data prediction model based on a back propagation feedforward neural network, predicting operation data containing a missing value, obtaining a data value to be filled, filling the data, and obtaining filled operation data;

and aiming at the filled running data, clustering by using an improved DBSCAN algorithm, selecting and deleting the highly matched repeated data records, and obtaining the cleaned running data.

As a preferred technical solution, the identifying noise data with abnormal attributes in the preprocessed operating data includes:

randomly selecting a certain data point p from the preprocessed running data, and calculating the neighborhood N of the data point p _ε Neighborhood N _ε The radius is epsilon, and the calculation formula is:

N _ε (p)＝{q∈D|dist(p，q)≤ε}

wherein N is _ε (p) represents the neighborhood of the data point p, dist (p, q) represents the distance from other data points q in the running data to the data point p, and D represents the number of data points;

when dist (p, q) is less than or equal to epsilon, adding 1 to the number of data points in the neighborhood of the data point p, and circularly calculating until the distance values from all the data points to the data point p are found;

let the neighborhood N of a data point p _ε Containing at least Minpts data points, if q ∈ N _ε (p) and | N _ε (p) | is greater than or equal to Minpts, marking the data point p as a core point, otherwise marking the data point p as noise data and taking a null value of the data point p as a missing value;

and repeating the steps until all data points in the preprocessed running data are marked, so as to obtain the running data containing the missing values.

As a preferred technical scheme, the back propagation feedforward neural network, namely the BP neural network, comprises an input layer, a hidden layer and an output layer which are sequentially connected; the input layer, the hidden layer and the output layer all comprise a plurality of neuron nodes, and the neurons are connected through weights; the transformation function of the neuron adopts a Sigmoid function; the input layer inputs attribute values of the running data containing missing values; the output layer outputs the predicted data value to be filled;

and performing optimization training on the weight and the threshold of the BP neural network by adopting a genetic algorithm to obtain a data prediction model.

As a preferred technical scheme, the optimization training is performed by adopting a genetic algorithm to obtain a data prediction model, and the steps are as follows:

determining the network topology of the BP neural network, and dividing a data sample set prepared in advance into a training sample and a test sample;

coding the weight and the threshold of the BP neural network to obtain an initial population;

training a BP neural network by using a training sample, testing the BP neural network by using a testing sample, and calculating an error between an output value and an expected value;

calculating the fitness of chromosomes in the BP neural network, selecting chromosomes with high fitness to copy, and performing crossover and mutation operations to generate a new population;

judging whether the BP neural network reaches the performance index or the maximum iteration number, if so, decoding the chromosome obtained by encoding the weight and the threshold of the BP neural network to obtain the weight and the threshold of the optimal neural network as the initial weight and the threshold of the data prediction model;

otherwise, the initial population is obtained again and the execution is continued.

As a preferred technical solution, the network topology of the BP neural network is represented as: m-l-n, wherein m, l and n respectively represent the number of neuron nodes of an input layer, a hidden layer and an output layer;

the data sample set prepared in advance consists of N power grid data samples which are randomly divided into training samples

And a test specimen

Respectively expressed as:

wherein, (x ()), y ()) represents a certain sample in the sample space, x () represents the sample value associated with y (), y () represents the actual value of the sample to be predicted by the BP neural network, t is the time parameter, R ^m 、R ⁿ Is a real number, N ₁ Representing the number in the training samples, N representing the number of samples in the data sample set;

the training sample is used for establishing an input-output mapping relation of the BP neural network; the test sample is used for verifying the correctness of the input and output mapping relation;

the input and output mapping relation of the BP neural network is expressed as:

wherein the content of the first and second substances,

representing the output of the BP neural network, l is the number of hidden layer neuron nodes, v _jk Representing the weight of hidden layer neuron node j to output layer neuron node k, k =1,2, \ 8230;, n, w _ij Is the weight, x, of input layer neuron node i to neuron node j _i (t) represents the input of the BP neural network, θ _j For the threshold at hidden layer neuron node j, r _k Is the threshold value, f [ 2 ], at the output layer neuron node k]For the activation function, it is expressed as:

total error E of BP neural network ₁ Less than or equal to network total error target threshold e ₁ Then, there are:

wherein, y _k (t) represents a sample actual value;

mean error of detection E with BP neural network ₂ Less than or equal to the target threshold e of the average error of the detected samples ₂ Then, there are:

where n represents the number of neuron nodes of the output layer.

As a preferred technical scheme, the encoding is performed on the weight and the threshold of the BP neural network to obtain an initial population, and specifically the method comprises the following steps:

forming a chromosome from the network weight of each node in the BP neural network by adopting a real number coding method, arranging the chromosomes into a string in sequence, and optimizing the connection weight among neurons to initialize a population P (t); and a sigmoid cross entropy loss function is adopted by an implicit layer transfer function in the BP neural network, and nonlinear mapping capability is introduced.

As a preferred technical solution, the calculating the fitness of the chromosome in the BP neural network specifically includes:

taking out a chromosome i from the initialized population P (t), inputting the network weight of each node in the chromosome i into the BP neural network in sequence, calculating the total error E of the BP neural network, and defining the fitness f of the chromosome i _i Expressed as:

selecting chromosome with high fitness for replication, and setting cross probability P _c Probability of mutation P _m And initializing the selection probability p of chromosome i in the population _i (ii) a The selection probability is defined as:

p _i ＝α*(1-α)

wherein α is a random number in [0,1], i =1,2, \ 8230;, t;

performing crossover and mutation operations, wherein the crossover operations specifically comprise:

sequencing the chromosomes according to the fitness from top to bottom, and calculating the cumulative probability q of each chromosome i _i ：

Using the roulette algorithm to generate a random number r e [0,1] per round]If q is _i-1 ＜r≤q _i Taking the chromosome i as a father chromosome, carrying out a roulette algorithm for t times to obtain t father chromosomes, and forming chromosome pairs by adjacent father chromosomes; for each chromosome pair, according to the cross probability P _c Determining the crossover position k, interchanging the two parents of the chromosome pair at [1,k ]]Numbering the genes in between, thereby obtaining two new chromosomes;

the mutation operation specifically comprises:

according to the mutation probability P _m Determining u variation positions on two new chromosomes, and performing variation operation on the genes at the variation positions, i.e. adding one [ -1, 1] to the genes on the two new chromosomes]Random decimal numbers uniformly distributed among the random decimal numbers to obtain two new daughter chromosomes;

new individuals are inserted into the population P (t), resulting in a new population P (t + 1).

As a preferred technical solution, the selecting and deleting of the highly matched repeated data records by using the improved DBSCAN algorithm specifically includes:

establishing a word-document matrix in an inverted index mode, quickly acquiring a document list containing a certain word according to the word, and dividing the filled running data into a plurality of subsets according to the same type of equipment;

clustering a plurality of subsets by using an improved DBSCAN algorithm to enable repeated data to form a cluster, calculating the similarity of records in the cluster and judging whether the records are the repeated data records, wherein the method specifically comprises the following steps:

determining parameters of an improved DBSCAN algorithm, wherein the parameters comprise a radius epsilon ', a minimum point number Minpts' in the radius and a similarity initial threshold value R;

randomly selecting a data point A from a certain subset as a core point, and calculating the similar distances between the rest data points and the data point A by using a similar distance function aproxDist () function;

by a set distance threshold value N ₁ ，N ₂ Clustering the filled operation data in the subset:

and calculating the similarity of any two pieces of filled running data by combining the attribute weights, wherein the calculation formula is as follows:

wherein n is the total number of attributes of the filled running data,

S _Ai (x, y) represents the similarity of the attributes between the padded running data x and the padded running data y, and is represented as:

wherein d represents the distance that x and y are respectively mapped to a two-dimensional spatial point;

judging whether to update iteration on the radius epsilon' according to the initial threshold value R of the similarity, wherein the iteration formula is as follows:

when iteration is carried out until the output result meets the acceptable range of the similarity threshold, clustering is completed to obtain a repeated data record set;

and calculating the similarity of each data record according to the repeated data record set, keeping the data record with the maximum similarity, and deleting the rest data to obtain the cleaned running data.

The invention provides a cleaning system of the operation data of the power grid equipment, which is applied to the cleaning method of the operation data of the power grid equipment and comprises a data acquisition module, a noise point identification module, a data filling module and a data deleting module;

the data acquisition module is used for acquiring original operation data of the power grid equipment from the database and preprocessing the original operation data;

the noise point identification module is used for identifying noise point data with abnormal attributes in the preprocessed running data and setting null values to obtain running data containing missing values;

the data filling module builds and trains a data prediction model based on the back propagation feedforward neural network, predicts running data containing a missing value, obtains a data value to be filled, fills the data and obtains filled running data;

and the data deleting module selects and deletes the highly matched repeated data records by using the improved DBSCAN algorithm aiming at the filled running data to obtain the cleaned running data.

In another aspect, the present invention provides a computer-readable storage medium, which stores a program, and when the program is executed by a processor, the program implements the method for cleaning the operation data of the power grid device.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the method has the advantages that the method can cluster dense data sets in any shape without inputting the clustering number; abnormal points can be found while clustering, and are insensitive to the abnormal points in the data set; clustering results are not biased, and the initial value of a K-Means clustering algorithm has a great influence on the clustering results.

2. The method for predicting and filling the missing value constructs a data prediction model based on the genetic neural network, fully utilizes the global search capability of the genetic algorithm and the nonlinear mapping capability of the neural network, greatly improves the prediction precision of the data, and has controllable data prediction precision.

3. Aiming at the problem of low detection accuracy of repeated records of a density clustering algorithm, the improved DBSCAN clustering algorithm improves the accuracy of detecting repeated data records to a certain extent and ensures the effectiveness of data cleaning.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a schematic flow chart of a method for cleaning operation data of power grid equipment according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a training data prediction model according to an embodiment of the present invention;

FIG. 3 is a flow chart of an improved DBSCAN algorithm according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a system for cleaning operation data of power grid equipment according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

In this embodiment, based on the problem of cleaning BPA database data in the southern power grid NF2022DX, with reference to fig. 1 to 3, the method for cleaning the operating data of the power grid equipment provided by the present invention is described in detail, and includes the following steps:

s1, acquiring original operation data of power grid equipment from a database and preprocessing the data;

s2, identifying noise data with abnormal attributes in the preprocessed running data and setting null values to obtain running data containing missing values;

s3, constructing and training a data prediction model based on a back propagation feedforward neural network, predicting the operation data containing the missing value, obtaining a data value to be filled, filling the data, and obtaining the filled operation data;

and S4, clustering the filled running data by using an improved DBSCAN algorithm, selecting and deleting highly matched repeated data records, and obtaining the cleaned running data.

More specifically, in step S1, since the original operation data is from different power grid devices, the data needs to be preprocessed, which facilitates effective cleaning of the operation data; in this embodiment, the data attribute is preliminarily selected according to experience to preprocess the data.

More specifically, in step S2, the invention adopts a guiding concept based on a density clustering algorithm to cluster the preprocessed operating data and identify noise data with abnormal attributes, and the steps are as follows:

s21, randomly selecting a certain data point p from the preprocessed running data, and calculating the neighborhood N of the data point p _ε For data point p, its neighborhood N _ε Radius of epsilon, neighborhood N _ε The calculation formula is as follows:

N _ε (p)＝{q∈D|dist(p，q)≤ε}

wherein N is _ε (p) represents the neighborhood of the data point p, dist (p, q) represents the distance from other data points q in the running data to the data point p, and represents the number of data points;

defining a core object, setting a neighborhood N of data points p _ε At least contains Minpts data points in the determined field N _ε And in the case of Minpts, if q ∈ N _ε (p) and | N _ε (p) | is more than or equal to Minpts, marking the data point p as a core point, otherwise marking the data point p as noise data and taking a null value of the data point p as a missing value, thereby facilitating the subsequent step of predicting and filling the missing value;

More specifically, in step S3, the back propagation feedforward neural network, that is, the BP neural network, includes an input layer, a hidden layer, and an output layer, each layer includes a plurality of neuron nodes (depending on the specific data attribute dimension), the neurons are connected by weights, and the transformation function of the neurons adopts a Sigmoid function; the input layer is an attribute value of the running data containing the missing value, and the output layer is a data value to be filled for outputting prediction;

as shown in fig. 2, the weight and the threshold of the BP neural network are optimized and trained by using a genetic algorithm to obtain a data prediction model, and the prediction performance of the model is improved, specifically:

s31, firstly, determining the network topology of the BP neural network, and dividing a data sample set prepared in advance into a training sample and a test sample;

wherein, the network topology of the BP neural network is represented as: m-l-n, wherein m, l and n respectively represent the number of neuron nodes of an input layer, a hidden layer and an output layer; the data sample set prepared in advance consists of N power grid data samples which are randomly divided into training samples

And a test specimen

Respectively expressed as:

in the embodiment, the data sample set prepared in advance comes from the power grid data of the subordinate 100KV bus AB, and the voltage of the AB line is taken as the prediction target of the BP neural network, so that x () represents the active power associated with the voltage, and y () is the actual voltage value to be predicted by the BP neural network; after training, the BP neural network obtains an output predicted value by inputting x ()

When the actual voltage value y () is almost equal to the actual voltage value y (), the BP neural network prediction performance reaches the standard; and then the BP neural network can be used for filling data of AB line voltage loss, and the corresponding AB line voltage can be predicted by inputting recorded active power.

In the invention, a training sample is used for establishing an input-output mapping relation of a BP neural network; the test sample is used for verifying the correctness of the input and output mapping relation; the input and output mapping relation of the BP neural network is expressed as:

wherein the content of the first and second substances,

representing the output of the BP neural network, l is the number of hidden layer neuron nodes, v _jk Representing the weight of hidden layer neuron node j to output layer neuron node k, k =1,2, \ 8230;, n, w _ij Is the weight, x, of input layer neuron node i to neuron node j _i (t) represents the input of the BP neural network, θ _j For the threshold at hidden layer neuron node j, r _k Is the threshold value at output layer neuron node k, f [ 2 ]]For the activation function, it is expressed as:

wherein, y _k (t) represents a sample actual value;

where n represents the number of neuron nodes of the output layer.

S32, coding the weight and the threshold of the BP neural network to obtain an initial population;

a real number coding method is adopted to avoid the weight feeding change, the network weights of all nodes in the BP neural network form a chromosome, the chromosome is arranged into a string in sequence, and the connection weights among neurons are optimized to initialize a population P (t); a sigmoid cross entropy loss function is adopted by an implicit layer transfer function in the BP neural network, and nonlinear mapping capability is introduced; in the implementation, random decimal numbers uniformly distributed on [ -3,3] are used for initializing the population, so that the problem that the convergence of the algorithm is too slow due to too small weight adjustment is reduced.

S33, training the BP neural network by using the training sample, testing the BP neural network by using the test sample, and calculating the error between the output value and the expected value;

s34, calculating the fitness of the chromosomes in the BP neural network, selecting the chromosomes with high fitness to copy, and performing crossover and mutation operations to generate a new population;

selecting chromosome with high fitness for replication, and setting cross probability P _c Mutation probability P _m And initializing the selection probability p of chromosome i in the population _i Defined as:

p _i ＝α*(1-α)

wherein α is a random number in [0,1], i =1,2, \ 8230;, t;

Using the roulette algorithm to generate a random number r e [0,1] per round]If q is _i-1 ＜r≤q _i Taking the chromosome i as a father chromosome, carrying out a roulette algorithm for t times to obtain t father chromosomes, and forming a chromosome pair by adjacent father chromosomes; for each chromosome pairAccording to the cross probability P _c Determining the crossover position k, interchanging the two parents of the chromosome pair at [1,k ]]Numbering the genes in between, thereby obtaining two new chromosomes;

S35, judging whether the BP neural network reaches the performance index or the maximum iteration number, if so, decoding the chromosome obtained by encoding the weight and the threshold of the BP neural network to obtain the weight and the threshold of the optimal neural network as the initial weight and the threshold of the data prediction model; if not, returning to the step S32 to obtain the initial population again and continuing to execute.

More specifically, in step S4, as shown in fig. 3, the improved DBSCAN algorithm is used to classify the approximately repeated data records into the same class by multiple iterations, and the steps are as follows:

in the first stage, a word-document matrix is established in an inverted index mode, a document list containing a word is quickly obtained according to the word, the filled operation data is divided into a plurality of subsets according to the same type of equipment, the cluster screening repeated recording of the improved DBSCAN algorithm in the second stage can be carried out aiming at each subset data, and the computational resource and the time complexity of the second stage are reduced;

in the second stage, an improved DBSCAN algorithm is used for clustering a plurality of subsets to enable repeated data to be recorded into a cluster, the similarity of records in the cluster is calculated, and whether the records are the repeated data records is judged, specifically:

s41, determining parameters of an improved DBSCAN algorithm, including a radius epsilon ', minimum point numbers Minpts' in the radius and a similarity initial threshold value R;

s42, randomly selecting a data point A from a certain subset as a core point, and calculating similar distances between the rest data points and the data point A by using a similar distance function aproxDatt () function;

s43, passing the set distance threshold value N ₁ ，N ₂ Clustering the filled operation data in the subset:

wherein n is the total number of attributes of the padded running data,

wherein d represents the distance that x and y are mapped to two-dimensional spatial points, respectively;

judging whether to update and iterate the radius epsilon' according to the initial threshold value R of the similarity, wherein the iterative formula is as follows:

and S44, obtaining a repeated data record set after clustering is completed, calculating the similarity of each data record, keeping the data record with the maximum similarity, and deleting the rest data records to obtain the cleaned running data.

In order to verify the cleaning performance of the invention on the power grid operation data, the missing values are predicted by using a mean value method (K-Means) and the BP neural network provided by the invention based on the power grid operation data of the subordinate 110KV bus AB, the subordinate 110KV bus BC and the subordinate 110KV bus CA in the embodiment of the invention, and the data shown in the following table 1 is obtained:

	slave 110KV bus AB	Slave 110KV bus BC	Slave 110KV bus CA
				Method of averaging	112.560	112.950	114.220
BP neural network	112.739	113.533	113.148
				True value	112.741	113.531	113.150

TABLE 1

Therefore, the missing value and the true value of the BP neural network prediction provided by the invention are closer, and the prediction result is more accurate compared with the prediction result of an average value method.

Meanwhile, in order to verify the accuracy of repeated record identification of the improved DBSCAN algorithm, the invention randomly selects 3 groups of running data with different record numbers (500, 5000, 50000) from a BPA database in a southern power grid NF2022DX, after missing values are predicted through a BP neural network, a basic record matching algorithm and the improved DBSCAN algorithm are respectively used for identification, and the identification results are shown in the following table 2:

number of data records	Basic record matching algorithm	Improved DBSCAN algorithm
			500	90.33％	90.78％
5000	80.54％	86.66％
			50000	69.47％	79.68％

TABLE 2

As can be seen from Table 2, with the increase of the number of data records, the accuracy of the improved DBSCAN algorithm is always higher than that of the basic record matching algorithm, and the accuracy rate is reduced at a speed lower than that of the basic record matching algorithm, so that the improved DBSCAN algorithm has high accuracy and better performance in repeated data screening.

It should be noted that, for the sake of simplicity, the foregoing method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present invention is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present invention.

Based on the same idea as the method for cleaning the operation data of the power grid equipment in the embodiment, the invention further provides a system for cleaning the operation data of the power grid equipment, and the system can be used for executing the method for cleaning the operation data of the power grid equipment. For convenience of illustration, only the parts related to the embodiments of the present invention are shown in the schematic structural diagram of the embodiment of the cleaning system for the operation data of the power grid equipment, and those skilled in the art will understand that the illustrated structure does not constitute a limitation to the device, and may include more or less components than those illustrated, or combine some components, or arrange different components.

Referring to fig. 4, in another embodiment of the present application, a system for cleaning operation data of a power grid device is provided, where the system includes a data obtaining module, a noise point identifying module, a data filling module, and a data deleting module;

the data filling module builds and trains a data prediction model based on a back propagation feedforward neural network, predicts running data containing missing values, obtains data values to be filled, fills the data, and obtains filled running data;

and the data deleting module selects and deletes the highly matched repeated data records by using the improved DBSCAN algorithm according to the filled running data to obtain the cleaned running data.

It should be noted that, a system for cleaning operation data of power grid equipment of the present invention corresponds to a method for cleaning operation data of power grid equipment of the present invention one to one, and the technical features and the beneficial effects described in the above embodiment of the method for cleaning operation data of power grid equipment are all applicable to an embodiment of a system for cleaning operation data of power grid equipment, and specific contents may refer to descriptions in the embodiment of the method of the present invention, which are not described herein again, and thus are stated herein.

In addition, in the implementation of the system for cleaning operation data of power grid equipment in the foregoing embodiment, the logical division of each program module is only an example, and in practical applications, the foregoing function distribution may be performed by different program modules according to needs, for example, due to configuration requirements of corresponding hardware or convenience of implementation of software, that is, the internal structure of the system for cleaning operation data of power grid equipment is divided into different program modules to perform all or part of the functions described above.

Referring to fig. 5, an embodiment of the present invention further provides a computer-readable storage medium, in which a program is stored in a memory, and when the program is executed by a processor, the method for cleaning operation data of a power grid device is implemented, where the method includes:

and (4) clustering the filled running data by using an improved DBSCAN algorithm, selecting and deleting the highly matched repeated data records, and obtaining the cleaned running data.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A method for cleaning operation data of power grid equipment is characterized by comprising the following steps:

2. The method for cleaning the operation data of the power grid equipment according to claim 1, wherein noise data with abnormal attributes are identified in the preprocessed operation data, and the method comprises the following steps:

N _ε (p)＝{q∈D|dist(p，q)≤ε}

wherein N is _ε (p) represents the neighborhood of the data point p, dist (p, q) represents the distance from the other data points q in the running data to the data point p, and D represents the number of data points;

3. The method for cleaning the operation data of the power grid equipment according to claim 2, wherein the back propagation feedforward neural network (BP neural network) comprises an input layer, a hidden layer and an output layer which are sequentially connected; the input layer, the hidden layer and the output layer all comprise a plurality of neuron nodes, and the neurons are connected through weights; the transformation function of the neuron adopts a Sigmoid function; the input layer inputs attribute values of the running data containing missing values; the output layer outputs the predicted data value to be filled;

4. The method for cleaning the operation data of the power grid equipment according to claim 3, wherein the optimization training is performed by adopting a genetic algorithm to obtain a data prediction model, and the method comprises the following steps:

training a BP neural network by using a training sample, testing the BP neural network by using a test sample and calculating an error between an output value and an expected value;

5. The method for cleaning the operation data of the power grid equipment, according to claim 4, wherein the network topology of the BP neural network is represented as: m-l-n, wherein m, l and n respectively represent the number of neuron nodes of an input layer, a hidden layer and an output layer;

And a test specimen

Respectively expressed as:

the input and output mapping relation of the BP neural network is expressed as:

wherein the content of the first and second substances,

representing the output of the BP neural network, l is the number of hidden layer neuron nodes, v _jk Representing the weight of hidden layer neuron node j to output layer neuron node k, k =1,2, \ 8230;, n, w _ij Is the weight, x, of input layer neuron node i to neuron node j _i (t) represents the input of the BP neural network, θ _j For the threshold at hidden layer neuron node j, r _k At node k of neuron in output layerThreshold value, f [ [ alpha ] ]]For the activation function, it is expressed as:

wherein, y _k (t) represents a sample actual value;

where n represents the number of neuron nodes of the output layer.

6. The method for cleaning the operation data of the power grid equipment according to claim 5, wherein the BP neural network weight and the threshold are encoded to obtain an initial population, and specifically the method comprises the following steps:

7. The method for cleaning the operation data of the power grid equipment according to claim 6, wherein the calculating the fitness of the chromosome in the BP neural network specifically comprises:

taking out a chromosome i from the initialized population P (t), andthe network weight of each node is input into the BP neural network in sequence, the total error E of the BP neural network is calculated, and the fitness f of the chromosome i is defined _i Expressed as:

selecting chromosome with high fitness for replication, and setting cross probability P _c Mutation probability P _m And initializing the selection probability p of chromosome i in the population _i (ii) a The selection probability is defined as:

p _i ＝α*(1-α)

wherein α is a random number in [0,1], i =1,2,. Eta, t;

Using the roulette algorithm to generate a random number r e [0,1] per round]If q is _i-1 ＜r≤q _i Taking the chromosome i as a father chromosome, carrying out a roulette algorithm for t times to obtain t father chromosomes, and forming a chromosome pair by adjacent father chromosomes; for each chromosome pair, according to the cross probability P _c Determining the crossover position k, interchanging the two parents of the chromosome pair at [1,k ]]Numbering the genes in between, thereby obtaining two new chromosomes;

the mutation operation specifically comprises:

8. The method for cleaning the operating data of the power grid equipment according to claim 7, wherein the modified DBSCAN algorithm is used to select and delete the highly matched repeated data records, specifically:

randomly selecting a data point A from a certain subset as a core point, and calculating the similar distances between the rest data points and the data point A by using a similar distance function aproxDent () function;

wherein n is the total number of attributes of the filled running data,

9. A cleaning system of power grid equipment operation data is characterized by being applied to the cleaning method of the power grid equipment operation data in any one of claims 1 to 8, and comprising a data acquisition module, a noise point identification module, a data filling module and a data deleting module;

the data acquisition module is used for acquiring original operation data of the power grid equipment from the database and carrying out pretreatment;

the noisy point identification module is used for identifying noisy point data with abnormal attributes in the preprocessed running data and placing null values in the noisy point data to obtain running data containing missing values;

the data filling module builds and trains based on a back propagation feedforward neural network to obtain a data prediction model, predicts the operation data containing the missing value, obtains a data value to be filled, fills the data and obtains the filled operation data;

10. A computer-readable storage medium storing a program, wherein the program, when executed by a processor, implements a method for cleaning operation data of a power grid device according to any one of claims 1 to 8.