CN115423008A - Method, system and medium for cleaning operation data of power grid equipment - Google Patents

Method, system and medium for cleaning operation data of power grid equipment Download PDF

Info

Publication number
CN115423008A
CN115423008A CN202211023992.0A CN202211023992A CN115423008A CN 115423008 A CN115423008 A CN 115423008A CN 202211023992 A CN202211023992 A CN 202211023992A CN 115423008 A CN115423008 A CN 115423008A
Authority
CN
China
Prior art keywords
data
neural network
power grid
value
running
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211023992.0A
Other languages
Chinese (zh)
Inventor
龙云
卢有飞
梁雪青
吴任博
张扬
刘璐豪
赵宏伟
陈明辉
张少凡
邹时容
蔡燕春
刘璇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd
Original Assignee
Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd filed Critical Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd
Priority to CN202211023992.0A priority Critical patent/CN115423008A/en
Publication of CN115423008A publication Critical patent/CN115423008A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Abstract

The invention discloses a method, a system and a medium for cleaning operation data of power grid equipment, wherein the method comprises the following steps: acquiring original operation data of the power grid equipment from a database and preprocessing the original operation data; identifying noise data with abnormal attributes in the preprocessed running data and setting null values to obtain running data containing missing values; constructing and training a data prediction model based on a back propagation feedforward neural network, predicting operation data containing a missing value, obtaining a data value to be filled, filling the data, and obtaining filled operation data; and aiming at the filled running data, clustering by using an improved DBSCAN algorithm, selecting and deleting the highly matched repeated data records, and obtaining the cleaned running data. According to the method, the missing value is obtained by identifying the noise data and setting the null value, the data prediction model is constructed for prediction filling, and finally the cleaning work of two types of data is realized through a clustering algorithm.

Description

Method, system and medium for cleaning operation data of power grid equipment
Technical Field
The invention belongs to the technical field of power grid equipment data processing, and particularly relates to a method, a system and a medium for cleaning power grid equipment operation data.
Background
With the diversification of energy types and the continuous optimization of energy structures in China, various sets from different systems can generate massive data streams in the operation process, and information which can be mined based on the data is very rich. The data is called dirty data because the data includes error data generated by various factors such as sensor failure and external interference, and data which is from different service systems and causes mutual conflict cannot be utilized, and even the data can cause obstruction to information mining. In view of this, the quality of the dirty data is improved by cleaning the dirty data, and the accuracy of the subsequent data analysis result is guaranteed, so that the method has great significance.
In the prior art, the operation data cleaning work aiming at the unit equipment can be summarized into two types of cleaning of attribute value error data and cleaning of repeated data. For data with abnormal attribute values, the work to be considered is the detection of the abnormal values, a common simple and easy-to-realize method is a statistical method, and the abnormal values are distinguished according to n standard deviations of the average number of distance data sets on the basis of the Chebyshev theorem; however, the traditional method is difficult to be applied to the complex situation of abnormal value distribution with high data dimension. For the problem of missing values in the running data, if filling methods such as median, mean, mode and the like are adopted, the reliability of the data is often affected, and even the analysis result and subsequent decisions are misled. Meanwhile, the situation of repeated recording may exist in multi-source data from different systems, and if the most direct method of comparing two data in a database is adopted, the steps are too complicated, resources are consumed, time complexity is high, and the data cleaning work is not facilitated to be expanded.
Disclosure of Invention
The invention mainly aims to overcome the defects and shortcomings of the prior art and provides a method, a system and a medium for cleaning operation data of power grid equipment.
In order to achieve the above object, an aspect of the present invention provides a method for cleaning operation data of a power grid device, including the following steps:
acquiring original operation data of the power grid equipment from a database and preprocessing the original operation data;
identifying noise data with abnormal attributes in the preprocessed running data and setting null values to obtain running data containing missing values;
constructing and training a data prediction model based on a back propagation feedforward neural network, predicting operation data containing a missing value, obtaining a data value to be filled, filling the data, and obtaining filled operation data;
and aiming at the filled running data, clustering by using an improved DBSCAN algorithm, selecting and deleting the highly matched repeated data records, and obtaining the cleaned running data.
As a preferred technical solution, the identifying noise data with abnormal attributes in the preprocessed operating data includes:
randomly selecting a certain data point p from the preprocessed running data, and calculating the neighborhood N of the data point p ε Neighborhood N ε The radius is epsilon, and the calculation formula is:
N ε (p)={q∈D|dist(p,q)≤ε}
wherein N is ε (p) represents the neighborhood of the data point p, dist (p, q) represents the distance from other data points q in the running data to the data point p, and D represents the number of data points;
when dist (p, q) is less than or equal to epsilon, adding 1 to the number of data points in the neighborhood of the data point p, and circularly calculating until the distance values from all the data points to the data point p are found;
let the neighborhood N of a data point p ε Containing at least Minpts data points, if q ∈ N ε (p) and | N ε (p) | is greater than or equal to Minpts, marking the data point p as a core point, otherwise marking the data point p as noise data and taking a null value of the data point p as a missing value;
and repeating the steps until all data points in the preprocessed running data are marked, so as to obtain the running data containing the missing values.
As a preferred technical scheme, the back propagation feedforward neural network, namely the BP neural network, comprises an input layer, a hidden layer and an output layer which are sequentially connected; the input layer, the hidden layer and the output layer all comprise a plurality of neuron nodes, and the neurons are connected through weights; the transformation function of the neuron adopts a Sigmoid function; the input layer inputs attribute values of the running data containing missing values; the output layer outputs the predicted data value to be filled;
and performing optimization training on the weight and the threshold of the BP neural network by adopting a genetic algorithm to obtain a data prediction model.
As a preferred technical scheme, the optimization training is performed by adopting a genetic algorithm to obtain a data prediction model, and the steps are as follows:
determining the network topology of the BP neural network, and dividing a data sample set prepared in advance into a training sample and a test sample;
coding the weight and the threshold of the BP neural network to obtain an initial population;
training a BP neural network by using a training sample, testing the BP neural network by using a testing sample, and calculating an error between an output value and an expected value;
calculating the fitness of chromosomes in the BP neural network, selecting chromosomes with high fitness to copy, and performing crossover and mutation operations to generate a new population;
judging whether the BP neural network reaches the performance index or the maximum iteration number, if so, decoding the chromosome obtained by encoding the weight and the threshold of the BP neural network to obtain the weight and the threshold of the optimal neural network as the initial weight and the threshold of the data prediction model;
otherwise, the initial population is obtained again and the execution is continued.
As a preferred technical solution, the network topology of the BP neural network is represented as: m-l-n, wherein m, l and n respectively represent the number of neuron nodes of an input layer, a hidden layer and an output layer;
the data sample set prepared in advance consists of N power grid data samples which are randomly divided into training samples
Figure BDA0003813261020000031
And a test specimen
Figure BDA0003813261020000032
Respectively expressed as:
Figure BDA0003813261020000033
Figure BDA0003813261020000034
wherein, (x ()), y ()) represents a certain sample in the sample space, x () represents the sample value associated with y (), y () represents the actual value of the sample to be predicted by the BP neural network, t is the time parameter, R m 、R n Is a real number, N 1 Representing the number in the training samples, N representing the number of samples in the data sample set;
the training sample is used for establishing an input-output mapping relation of the BP neural network; the test sample is used for verifying the correctness of the input and output mapping relation;
the input and output mapping relation of the BP neural network is expressed as:
Figure BDA0003813261020000035
wherein the content of the first and second substances,
Figure BDA0003813261020000036
representing the output of the BP neural network, l is the number of hidden layer neuron nodes, v jk Representing the weight of hidden layer neuron node j to output layer neuron node k, k =1,2, \ 8230;, n, w ij Is the weight, x, of input layer neuron node i to neuron node j i (t) represents the input of the BP neural network, θ j For the threshold at hidden layer neuron node j, r k Is the threshold value, f [ 2 ], at the output layer neuron node k]For the activation function, it is expressed as:
Figure BDA0003813261020000037
total error E of BP neural network 1 Less than or equal to network total error target threshold e 1 Then, there are:
Figure BDA0003813261020000038
wherein, y k (t) represents a sample actual value;
mean error of detection E with BP neural network 2 Less than or equal to the target threshold e of the average error of the detected samples 2 Then, there are:
Figure BDA0003813261020000041
where n represents the number of neuron nodes of the output layer.
As a preferred technical scheme, the encoding is performed on the weight and the threshold of the BP neural network to obtain an initial population, and specifically the method comprises the following steps:
forming a chromosome from the network weight of each node in the BP neural network by adopting a real number coding method, arranging the chromosomes into a string in sequence, and optimizing the connection weight among neurons to initialize a population P (t); and a sigmoid cross entropy loss function is adopted by an implicit layer transfer function in the BP neural network, and nonlinear mapping capability is introduced.
As a preferred technical solution, the calculating the fitness of the chromosome in the BP neural network specifically includes:
taking out a chromosome i from the initialized population P (t), inputting the network weight of each node in the chromosome i into the BP neural network in sequence, calculating the total error E of the BP neural network, and defining the fitness f of the chromosome i i Expressed as:
Figure BDA0003813261020000042
selecting chromosome with high fitness for replication, and setting cross probability P c Probability of mutation P m And initializing the selection probability p of chromosome i in the population i (ii) a The selection probability is defined as:
p i =α*(1-α)
wherein α is a random number in [0,1], i =1,2, \ 8230;, t;
performing crossover and mutation operations, wherein the crossover operations specifically comprise:
sequencing the chromosomes according to the fitness from top to bottom, and calculating the cumulative probability q of each chromosome i i
Figure BDA0003813261020000043
Using the roulette algorithm to generate a random number r e [0,1] per round]If q is i-1 <r≤q i Taking the chromosome i as a father chromosome, carrying out a roulette algorithm for t times to obtain t father chromosomes, and forming chromosome pairs by adjacent father chromosomes; for each chromosome pair, according to the cross probability P c Determining the crossover position k, interchanging the two parents of the chromosome pair at [1,k ]]Numbering the genes in between, thereby obtaining two new chromosomes;
the mutation operation specifically comprises:
according to the mutation probability P m Determining u variation positions on two new chromosomes, and performing variation operation on the genes at the variation positions, i.e. adding one [ -1, 1] to the genes on the two new chromosomes]Random decimal numbers uniformly distributed among the random decimal numbers to obtain two new daughter chromosomes;
new individuals are inserted into the population P (t), resulting in a new population P (t + 1).
As a preferred technical solution, the selecting and deleting of the highly matched repeated data records by using the improved DBSCAN algorithm specifically includes:
establishing a word-document matrix in an inverted index mode, quickly acquiring a document list containing a certain word according to the word, and dividing the filled running data into a plurality of subsets according to the same type of equipment;
clustering a plurality of subsets by using an improved DBSCAN algorithm to enable repeated data to form a cluster, calculating the similarity of records in the cluster and judging whether the records are the repeated data records, wherein the method specifically comprises the following steps:
determining parameters of an improved DBSCAN algorithm, wherein the parameters comprise a radius epsilon ', a minimum point number Minpts' in the radius and a similarity initial threshold value R;
randomly selecting a data point A from a certain subset as a core point, and calculating the similar distances between the rest data points and the data point A by using a similar distance function aproxDist () function;
by a set distance threshold value N 1 ,N 2 Clustering the filled operation data in the subset:
and calculating the similarity of any two pieces of filled running data by combining the attribute weights, wherein the calculation formula is as follows:
Figure BDA0003813261020000051
wherein n is the total number of attributes of the filled running data,
Figure BDA0003813261020000052
S Ai (x, y) represents the similarity of the attributes between the padded running data x and the padded running data y, and is represented as:
Figure BDA0003813261020000053
wherein d represents the distance that x and y are respectively mapped to a two-dimensional spatial point;
judging whether to update iteration on the radius epsilon' according to the initial threshold value R of the similarity, wherein the iteration formula is as follows:
Figure BDA0003813261020000054
when iteration is carried out until the output result meets the acceptable range of the similarity threshold, clustering is completed to obtain a repeated data record set;
and calculating the similarity of each data record according to the repeated data record set, keeping the data record with the maximum similarity, and deleting the rest data to obtain the cleaned running data.
The invention provides a cleaning system of the operation data of the power grid equipment, which is applied to the cleaning method of the operation data of the power grid equipment and comprises a data acquisition module, a noise point identification module, a data filling module and a data deleting module;
the data acquisition module is used for acquiring original operation data of the power grid equipment from the database and preprocessing the original operation data;
the noise point identification module is used for identifying noise point data with abnormal attributes in the preprocessed running data and setting null values to obtain running data containing missing values;
the data filling module builds and trains a data prediction model based on the back propagation feedforward neural network, predicts running data containing a missing value, obtains a data value to be filled, fills the data and obtains filled running data;
and the data deleting module selects and deletes the highly matched repeated data records by using the improved DBSCAN algorithm aiming at the filled running data to obtain the cleaned running data.
In another aspect, the present invention provides a computer-readable storage medium, which stores a program, and when the program is executed by a processor, the program implements the method for cleaning the operation data of the power grid device.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. the method has the advantages that the method can cluster dense data sets in any shape without inputting the clustering number; abnormal points can be found while clustering, and are insensitive to the abnormal points in the data set; clustering results are not biased, and the initial value of a K-Means clustering algorithm has a great influence on the clustering results.
2. The method for predicting and filling the missing value constructs a data prediction model based on the genetic neural network, fully utilizes the global search capability of the genetic algorithm and the nonlinear mapping capability of the neural network, greatly improves the prediction precision of the data, and has controllable data prediction precision.
3. Aiming at the problem of low detection accuracy of repeated records of a density clustering algorithm, the improved DBSCAN clustering algorithm improves the accuracy of detecting repeated data records to a certain extent and ensures the effectiveness of data cleaning.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is a schematic flow chart of a method for cleaning operation data of power grid equipment according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of a training data prediction model according to an embodiment of the present invention;
FIG. 3 is a flow chart of an improved DBSCAN algorithm according to an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of a system for cleaning operation data of power grid equipment according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
In this embodiment, based on the problem of cleaning BPA database data in the southern power grid NF2022DX, with reference to fig. 1 to 3, the method for cleaning the operating data of the power grid equipment provided by the present invention is described in detail, and includes the following steps:
s1, acquiring original operation data of power grid equipment from a database and preprocessing the data;
s2, identifying noise data with abnormal attributes in the preprocessed running data and setting null values to obtain running data containing missing values;
s3, constructing and training a data prediction model based on a back propagation feedforward neural network, predicting the operation data containing the missing value, obtaining a data value to be filled, filling the data, and obtaining the filled operation data;
and S4, clustering the filled running data by using an improved DBSCAN algorithm, selecting and deleting highly matched repeated data records, and obtaining the cleaned running data.
More specifically, in step S1, since the original operation data is from different power grid devices, the data needs to be preprocessed, which facilitates effective cleaning of the operation data; in this embodiment, the data attribute is preliminarily selected according to experience to preprocess the data.
More specifically, in step S2, the invention adopts a guiding concept based on a density clustering algorithm to cluster the preprocessed operating data and identify noise data with abnormal attributes, and the steps are as follows:
s21, randomly selecting a certain data point p from the preprocessed running data, and calculating the neighborhood N of the data point p ε For data point p, its neighborhood N ε Radius of epsilon, neighborhood N ε The calculation formula is as follows:
N ε (p)={q∈D|dist(p,q)≤ε}
wherein N is ε (p) represents the neighborhood of the data point p, dist (p, q) represents the distance from other data points q in the running data to the data point p, and represents the number of data points;
when dist (p, q) is less than or equal to epsilon, adding 1 to the number of data points in the neighborhood of the data point p, and circularly calculating until the distance values from all the data points to the data point p are found;
defining a core object, setting a neighborhood N of data points p ε At least contains Minpts data points in the determined field N ε And in the case of Minpts, if q ∈ N ε (p) and | N ε (p) | is more than or equal to Minpts, marking the data point p as a core point, otherwise marking the data point p as noise data and taking a null value of the data point p as a missing value, thereby facilitating the subsequent step of predicting and filling the missing value;
and repeating the steps until all data points in the preprocessed running data are marked, so as to obtain the running data containing the missing values.
More specifically, in step S3, the back propagation feedforward neural network, that is, the BP neural network, includes an input layer, a hidden layer, and an output layer, each layer includes a plurality of neuron nodes (depending on the specific data attribute dimension), the neurons are connected by weights, and the transformation function of the neurons adopts a Sigmoid function; the input layer is an attribute value of the running data containing the missing value, and the output layer is a data value to be filled for outputting prediction;
as shown in fig. 2, the weight and the threshold of the BP neural network are optimized and trained by using a genetic algorithm to obtain a data prediction model, and the prediction performance of the model is improved, specifically:
s31, firstly, determining the network topology of the BP neural network, and dividing a data sample set prepared in advance into a training sample and a test sample;
wherein, the network topology of the BP neural network is represented as: m-l-n, wherein m, l and n respectively represent the number of neuron nodes of an input layer, a hidden layer and an output layer; the data sample set prepared in advance consists of N power grid data samples which are randomly divided into training samples
Figure BDA0003813261020000081
And a test specimen
Figure BDA0003813261020000082
Respectively expressed as:
Figure BDA0003813261020000083
Figure BDA0003813261020000084
wherein, (x ()), y ()) represents a certain sample in the sample space, x () represents the sample value associated with y (), y () represents the actual value of the sample to be predicted by the BP neural network, t is the time parameter, R m 、R n Is a real number, N 1 Representing the number in the training samples, N representing the number of samples in the data sample set;
in the embodiment, the data sample set prepared in advance comes from the power grid data of the subordinate 100KV bus AB, and the voltage of the AB line is taken as the prediction target of the BP neural network, so that x () represents the active power associated with the voltage, and y () is the actual voltage value to be predicted by the BP neural network; after training, the BP neural network obtains an output predicted value by inputting x ()
Figure BDA0003813261020000085
When the actual voltage value y () is almost equal to the actual voltage value y (), the BP neural network prediction performance reaches the standard; and then the BP neural network can be used for filling data of AB line voltage loss, and the corresponding AB line voltage can be predicted by inputting recorded active power.
In the invention, a training sample is used for establishing an input-output mapping relation of a BP neural network; the test sample is used for verifying the correctness of the input and output mapping relation; the input and output mapping relation of the BP neural network is expressed as:
Figure BDA0003813261020000086
wherein the content of the first and second substances,
Figure BDA0003813261020000087
representing the output of the BP neural network, l is the number of hidden layer neuron nodes, v jk Representing the weight of hidden layer neuron node j to output layer neuron node k, k =1,2, \ 8230;, n, w ij Is the weight, x, of input layer neuron node i to neuron node j i (t) represents the input of the BP neural network, θ j For the threshold at hidden layer neuron node j, r k Is the threshold value at output layer neuron node k, f [ 2 ]]For the activation function, it is expressed as:
Figure BDA0003813261020000088
total error E of BP neural network 1 Less than or equal to network total error target threshold e 1 Then, there are:
Figure BDA0003813261020000091
wherein, y k (t) represents a sample actual value;
mean error of detection E with BP neural network 2 Less than or equal to the target threshold e of the average error of the detected samples 2 Then, there are:
Figure BDA0003813261020000092
where n represents the number of neuron nodes of the output layer.
S32, coding the weight and the threshold of the BP neural network to obtain an initial population;
a real number coding method is adopted to avoid the weight feeding change, the network weights of all nodes in the BP neural network form a chromosome, the chromosome is arranged into a string in sequence, and the connection weights among neurons are optimized to initialize a population P (t); a sigmoid cross entropy loss function is adopted by an implicit layer transfer function in the BP neural network, and nonlinear mapping capability is introduced; in the implementation, random decimal numbers uniformly distributed on [ -3,3] are used for initializing the population, so that the problem that the convergence of the algorithm is too slow due to too small weight adjustment is reduced.
S33, training the BP neural network by using the training sample, testing the BP neural network by using the test sample, and calculating the error between the output value and the expected value;
s34, calculating the fitness of the chromosomes in the BP neural network, selecting the chromosomes with high fitness to copy, and performing crossover and mutation operations to generate a new population;
taking out a chromosome i from the initialized population P (t), inputting the network weight of each node in the chromosome i into the BP neural network in sequence, calculating the total error E of the BP neural network, and defining the fitness f of the chromosome i i Expressed as:
Figure BDA0003813261020000093
selecting chromosome with high fitness for replication, and setting cross probability P c Mutation probability P m And initializing the selection probability p of chromosome i in the population i Defined as:
p i =α*(1-α)
wherein α is a random number in [0,1], i =1,2, \ 8230;, t;
performing crossover and mutation operations, wherein the crossover operations specifically comprise:
sequencing the chromosomes according to the fitness from top to bottom, and calculating the cumulative probability q of each chromosome i i
Figure BDA0003813261020000094
Using the roulette algorithm to generate a random number r e [0,1] per round]If q is i-1 <r≤q i Taking the chromosome i as a father chromosome, carrying out a roulette algorithm for t times to obtain t father chromosomes, and forming a chromosome pair by adjacent father chromosomes; for each chromosome pairAccording to the cross probability P c Determining the crossover position k, interchanging the two parents of the chromosome pair at [1,k ]]Numbering the genes in between, thereby obtaining two new chromosomes;
according to the mutation probability P m Determining u variation positions on two new chromosomes, and performing variation operation on the genes at the variation positions, i.e. adding one [ -1, 1] to the genes on the two new chromosomes]Random decimal numbers uniformly distributed among the random decimal numbers to obtain two new daughter chromosomes;
new individuals are inserted into the population P (t), resulting in a new population P (t + 1).
S35, judging whether the BP neural network reaches the performance index or the maximum iteration number, if so, decoding the chromosome obtained by encoding the weight and the threshold of the BP neural network to obtain the weight and the threshold of the optimal neural network as the initial weight and the threshold of the data prediction model; if not, returning to the step S32 to obtain the initial population again and continuing to execute.
More specifically, in step S4, as shown in fig. 3, the improved DBSCAN algorithm is used to classify the approximately repeated data records into the same class by multiple iterations, and the steps are as follows:
in the first stage, a word-document matrix is established in an inverted index mode, a document list containing a word is quickly obtained according to the word, the filled operation data is divided into a plurality of subsets according to the same type of equipment, the cluster screening repeated recording of the improved DBSCAN algorithm in the second stage can be carried out aiming at each subset data, and the computational resource and the time complexity of the second stage are reduced;
in the second stage, an improved DBSCAN algorithm is used for clustering a plurality of subsets to enable repeated data to be recorded into a cluster, the similarity of records in the cluster is calculated, and whether the records are the repeated data records is judged, specifically:
s41, determining parameters of an improved DBSCAN algorithm, including a radius epsilon ', minimum point numbers Minpts' in the radius and a similarity initial threshold value R;
s42, randomly selecting a data point A from a certain subset as a core point, and calculating similar distances between the rest data points and the data point A by using a similar distance function aproxDatt () function;
s43, passing the set distance threshold value N 1 ,N 2 Clustering the filled operation data in the subset:
and calculating the similarity of any two pieces of filled running data by combining the attribute weights, wherein the calculation formula is as follows:
Figure BDA0003813261020000101
wherein n is the total number of attributes of the padded running data,
Figure BDA0003813261020000102
S Ai (x, y) represents the similarity of the attributes between the padded running data x and the padded running data y, and is represented as:
Figure BDA0003813261020000103
wherein d represents the distance that x and y are mapped to two-dimensional spatial points, respectively;
judging whether to update and iterate the radius epsilon' according to the initial threshold value R of the similarity, wherein the iterative formula is as follows:
Figure BDA0003813261020000111
when iteration is carried out until the output result meets the acceptable range of the similarity threshold, clustering is completed to obtain a repeated data record set;
and S44, obtaining a repeated data record set after clustering is completed, calculating the similarity of each data record, keeping the data record with the maximum similarity, and deleting the rest data records to obtain the cleaned running data.
In order to verify the cleaning performance of the invention on the power grid operation data, the missing values are predicted by using a mean value method (K-Means) and the BP neural network provided by the invention based on the power grid operation data of the subordinate 110KV bus AB, the subordinate 110KV bus BC and the subordinate 110KV bus CA in the embodiment of the invention, and the data shown in the following table 1 is obtained:
slave 110KV bus AB Slave 110KV bus BC Slave 110KV bus CA
Method of averaging 112.560 112.950 114.220
BP neural network 112.739 113.533 113.148
True value 112.741 113.531 113.150
TABLE 1
Therefore, the missing value and the true value of the BP neural network prediction provided by the invention are closer, and the prediction result is more accurate compared with the prediction result of an average value method.
Meanwhile, in order to verify the accuracy of repeated record identification of the improved DBSCAN algorithm, the invention randomly selects 3 groups of running data with different record numbers (500, 5000, 50000) from a BPA database in a southern power grid NF2022DX, after missing values are predicted through a BP neural network, a basic record matching algorithm and the improved DBSCAN algorithm are respectively used for identification, and the identification results are shown in the following table 2:
number of data records Basic record matching algorithm Improved DBSCAN algorithm
500 90.33% 90.78%
5000 80.54% 86.66%
50000 69.47% 79.68%
TABLE 2
As can be seen from Table 2, with the increase of the number of data records, the accuracy of the improved DBSCAN algorithm is always higher than that of the basic record matching algorithm, and the accuracy rate is reduced at a speed lower than that of the basic record matching algorithm, so that the improved DBSCAN algorithm has high accuracy and better performance in repeated data screening.
It should be noted that, for the sake of simplicity, the foregoing method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present invention is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present invention.
Based on the same idea as the method for cleaning the operation data of the power grid equipment in the embodiment, the invention further provides a system for cleaning the operation data of the power grid equipment, and the system can be used for executing the method for cleaning the operation data of the power grid equipment. For convenience of illustration, only the parts related to the embodiments of the present invention are shown in the schematic structural diagram of the embodiment of the cleaning system for the operation data of the power grid equipment, and those skilled in the art will understand that the illustrated structure does not constitute a limitation to the device, and may include more or less components than those illustrated, or combine some components, or arrange different components.
Referring to fig. 4, in another embodiment of the present application, a system for cleaning operation data of a power grid device is provided, where the system includes a data obtaining module, a noise point identifying module, a data filling module, and a data deleting module;
the data acquisition module is used for acquiring original operation data of the power grid equipment from the database and preprocessing the original operation data;
the noise point identification module is used for identifying noise point data with abnormal attributes in the preprocessed running data and setting null values to obtain running data containing missing values;
the data filling module builds and trains a data prediction model based on a back propagation feedforward neural network, predicts running data containing missing values, obtains data values to be filled, fills the data, and obtains filled running data;
and the data deleting module selects and deletes the highly matched repeated data records by using the improved DBSCAN algorithm according to the filled running data to obtain the cleaned running data.
It should be noted that, a system for cleaning operation data of power grid equipment of the present invention corresponds to a method for cleaning operation data of power grid equipment of the present invention one to one, and the technical features and the beneficial effects described in the above embodiment of the method for cleaning operation data of power grid equipment are all applicable to an embodiment of a system for cleaning operation data of power grid equipment, and specific contents may refer to descriptions in the embodiment of the method of the present invention, which are not described herein again, and thus are stated herein.
In addition, in the implementation of the system for cleaning operation data of power grid equipment in the foregoing embodiment, the logical division of each program module is only an example, and in practical applications, the foregoing function distribution may be performed by different program modules according to needs, for example, due to configuration requirements of corresponding hardware or convenience of implementation of software, that is, the internal structure of the system for cleaning operation data of power grid equipment is divided into different program modules to perform all or part of the functions described above.
Referring to fig. 5, an embodiment of the present invention further provides a computer-readable storage medium, in which a program is stored in a memory, and when the program is executed by a processor, the method for cleaning operation data of a power grid device is implemented, where the method includes:
acquiring original operation data of the power grid equipment from a database and preprocessing the original operation data;
identifying noise data with abnormal attributes in the preprocessed running data and setting null values to obtain running data containing missing values;
constructing and training a data prediction model based on a back propagation feedforward neural network, predicting operation data containing a missing value, obtaining a data value to be filled, filling the data, and obtaining filled operation data;
and (4) clustering the filled running data by using an improved DBSCAN algorithm, selecting and deleting the highly matched repeated data records, and obtaining the cleaned running data.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (10)

1. A method for cleaning operation data of power grid equipment is characterized by comprising the following steps:
acquiring original operation data of the power grid equipment from a database and preprocessing the original operation data;
identifying noise data with abnormal attributes in the preprocessed running data and setting null values to obtain running data containing missing values;
constructing and training a data prediction model based on a back propagation feedforward neural network, predicting operation data containing a missing value, obtaining a data value to be filled, filling the data, and obtaining filled operation data;
and aiming at the filled running data, clustering by using an improved DBSCAN algorithm, selecting and deleting the highly matched repeated data records, and obtaining the cleaned running data.
2. The method for cleaning the operation data of the power grid equipment according to claim 1, wherein noise data with abnormal attributes are identified in the preprocessed operation data, and the method comprises the following steps:
randomly selecting a certain data point p from the preprocessed running data, and calculating the neighborhood N of the data point p ε Neighborhood N ε The radius is epsilon, and the calculation formula is:
N ε (p)={q∈D|dist(p,q)≤ε}
wherein N is ε (p) represents the neighborhood of the data point p, dist (p, q) represents the distance from the other data points q in the running data to the data point p, and D represents the number of data points;
when dist (p, q) is less than or equal to epsilon, adding 1 to the number of data points in the neighborhood of the data point p, and circularly calculating until the distance values from all the data points to the data point p are found;
let the neighborhood N of a data point p ε Containing at least Minpts data points, if q ∈ N ε (p) and | N ε (p) | is greater than or equal to Minpts, marking the data point p as a core point, otherwise marking the data point p as noise data and taking a null value of the data point p as a missing value;
and repeating the steps until all data points in the preprocessed running data are marked, so as to obtain the running data containing the missing values.
3. The method for cleaning the operation data of the power grid equipment according to claim 2, wherein the back propagation feedforward neural network (BP neural network) comprises an input layer, a hidden layer and an output layer which are sequentially connected; the input layer, the hidden layer and the output layer all comprise a plurality of neuron nodes, and the neurons are connected through weights; the transformation function of the neuron adopts a Sigmoid function; the input layer inputs attribute values of the running data containing missing values; the output layer outputs the predicted data value to be filled;
and performing optimization training on the weight and the threshold of the BP neural network by adopting a genetic algorithm to obtain a data prediction model.
4. The method for cleaning the operation data of the power grid equipment according to claim 3, wherein the optimization training is performed by adopting a genetic algorithm to obtain a data prediction model, and the method comprises the following steps:
determining the network topology of the BP neural network, and dividing a data sample set prepared in advance into a training sample and a test sample;
coding the weight and the threshold of the BP neural network to obtain an initial population;
training a BP neural network by using a training sample, testing the BP neural network by using a test sample and calculating an error between an output value and an expected value;
calculating the fitness of chromosomes in the BP neural network, selecting chromosomes with high fitness to copy, and performing crossover and mutation operations to generate a new population;
judging whether the BP neural network reaches the performance index or the maximum iteration number, if so, decoding the chromosome obtained by encoding the weight and the threshold of the BP neural network to obtain the weight and the threshold of the optimal neural network as the initial weight and the threshold of the data prediction model;
otherwise, the initial population is obtained again and the execution is continued.
5. The method for cleaning the operation data of the power grid equipment, according to claim 4, wherein the network topology of the BP neural network is represented as: m-l-n, wherein m, l and n respectively represent the number of neuron nodes of an input layer, a hidden layer and an output layer;
the data sample set prepared in advance consists of N power grid data samples which are randomly divided into training samples
Figure FDA0003813261010000021
And a test specimen
Figure FDA0003813261010000022
Respectively expressed as:
Figure FDA0003813261010000023
Figure FDA0003813261010000024
wherein, (x ()), y ()) represents a certain sample in the sample space, x () represents the sample value associated with y (), y () represents the actual value of the sample to be predicted by the BP neural network, t is the time parameter, R m 、R n Is a real number, N 1 Representing the number in the training samples, N representing the number of samples in the data sample set;
the training sample is used for establishing an input-output mapping relation of the BP neural network; the test sample is used for verifying the correctness of the input and output mapping relation;
the input and output mapping relation of the BP neural network is expressed as:
Figure FDA0003813261010000025
wherein the content of the first and second substances,
Figure FDA0003813261010000026
representing the output of the BP neural network, l is the number of hidden layer neuron nodes, v jk Representing the weight of hidden layer neuron node j to output layer neuron node k, k =1,2, \ 8230;, n, w ij Is the weight, x, of input layer neuron node i to neuron node j i (t) represents the input of the BP neural network, θ j For the threshold at hidden layer neuron node j, r k At node k of neuron in output layerThreshold value, f [ [ alpha ] ]]For the activation function, it is expressed as:
Figure FDA0003813261010000027
total error E of BP neural network 1 Less than or equal to network total error target threshold e 1 Then, there are:
Figure FDA0003813261010000031
wherein, y k (t) represents a sample actual value;
mean error of detection E with BP neural network 2 Less than or equal to the target threshold e of the average error of the detected samples 2 Then, there are:
Figure FDA0003813261010000032
where n represents the number of neuron nodes of the output layer.
6. The method for cleaning the operation data of the power grid equipment according to claim 5, wherein the BP neural network weight and the threshold are encoded to obtain an initial population, and specifically the method comprises the following steps:
forming a chromosome from the network weight of each node in the BP neural network by adopting a real number coding method, arranging the chromosomes into a string in sequence, and optimizing the connection weight among neurons to initialize a population P (t); and a sigmoid cross entropy loss function is adopted by an implicit layer transfer function in the BP neural network, and nonlinear mapping capability is introduced.
7. The method for cleaning the operation data of the power grid equipment according to claim 6, wherein the calculating the fitness of the chromosome in the BP neural network specifically comprises:
taking out a chromosome i from the initialized population P (t), andthe network weight of each node is input into the BP neural network in sequence, the total error E of the BP neural network is calculated, and the fitness f of the chromosome i is defined i Expressed as:
Figure FDA0003813261010000033
selecting chromosome with high fitness for replication, and setting cross probability P c Mutation probability P m And initializing the selection probability p of chromosome i in the population i (ii) a The selection probability is defined as:
p i =α*(1-α)
wherein α is a random number in [0,1], i =1,2,. Eta, t;
performing crossover and mutation operations, wherein the crossover operations specifically comprise:
sequencing the chromosomes according to the fitness from top to bottom, and calculating the cumulative probability q of each chromosome i i
Figure FDA0003813261010000034
Using the roulette algorithm to generate a random number r e [0,1] per round]If q is i-1 <r≤q i Taking the chromosome i as a father chromosome, carrying out a roulette algorithm for t times to obtain t father chromosomes, and forming a chromosome pair by adjacent father chromosomes; for each chromosome pair, according to the cross probability P c Determining the crossover position k, interchanging the two parents of the chromosome pair at [1,k ]]Numbering the genes in between, thereby obtaining two new chromosomes;
the mutation operation specifically comprises:
according to the mutation probability P m Determining u variation positions on two new chromosomes, and performing variation operation on the genes at the variation positions, i.e. adding one [ -1, 1] to the genes on the two new chromosomes]Random decimal numbers uniformly distributed among the random decimal numbers to obtain two new daughter chromosomes;
new individuals are inserted into the population P (t), resulting in a new population P (t + 1).
8. The method for cleaning the operating data of the power grid equipment according to claim 7, wherein the modified DBSCAN algorithm is used to select and delete the highly matched repeated data records, specifically:
establishing a word-document matrix in an inverted index mode, quickly acquiring a document list containing a certain word according to the word, and dividing the filled running data into a plurality of subsets according to the same type of equipment;
clustering a plurality of subsets by using an improved DBSCAN algorithm to enable repeated data to form a cluster, calculating the similarity of records in the cluster and judging whether the records are the repeated data records, wherein the method specifically comprises the following steps:
determining parameters of an improved DBSCAN algorithm, wherein the parameters comprise a radius epsilon ', a minimum point number Minpts' in the radius and a similarity initial threshold value R;
randomly selecting a data point A from a certain subset as a core point, and calculating the similar distances between the rest data points and the data point A by using a similar distance function aproxDent () function;
by a set distance threshold value N 1 ,N 2 Clustering the filled operation data in the subset:
and calculating the similarity of any two pieces of filled running data by combining the attribute weights, wherein the calculation formula is as follows:
Figure FDA0003813261010000041
wherein n is the total number of attributes of the filled running data,
Figure FDA0003813261010000042
S Ai (x, y) represents the similarity of the attributes between the padded running data x and the padded running data y, and is represented as:
Figure FDA0003813261010000043
wherein d represents the distance that x and y are respectively mapped to a two-dimensional spatial point;
judging whether to update and iterate the radius epsilon' according to the initial threshold value R of the similarity, wherein the iterative formula is as follows:
Figure FDA0003813261010000044
when iteration is carried out until the output result meets the acceptable range of the similarity threshold, clustering is completed to obtain a repeated data record set;
and calculating the similarity of each data record according to the repeated data record set, keeping the data record with the maximum similarity, and deleting the rest data to obtain the cleaned running data.
9. A cleaning system of power grid equipment operation data is characterized by being applied to the cleaning method of the power grid equipment operation data in any one of claims 1 to 8, and comprising a data acquisition module, a noise point identification module, a data filling module and a data deleting module;
the data acquisition module is used for acquiring original operation data of the power grid equipment from the database and carrying out pretreatment;
the noisy point identification module is used for identifying noisy point data with abnormal attributes in the preprocessed running data and placing null values in the noisy point data to obtain running data containing missing values;
the data filling module builds and trains based on a back propagation feedforward neural network to obtain a data prediction model, predicts the operation data containing the missing value, obtains a data value to be filled, fills the data and obtains the filled operation data;
and the data deleting module selects and deletes the highly matched repeated data records by using the improved DBSCAN algorithm aiming at the filled running data to obtain the cleaned running data.
10. A computer-readable storage medium storing a program, wherein the program, when executed by a processor, implements a method for cleaning operation data of a power grid device according to any one of claims 1 to 8.
CN202211023992.0A 2022-08-24 2022-08-24 Method, system and medium for cleaning operation data of power grid equipment Pending CN115423008A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211023992.0A CN115423008A (en) 2022-08-24 2022-08-24 Method, system and medium for cleaning operation data of power grid equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211023992.0A CN115423008A (en) 2022-08-24 2022-08-24 Method, system and medium for cleaning operation data of power grid equipment

Publications (1)

Publication Number Publication Date
CN115423008A true CN115423008A (en) 2022-12-02

Family

ID=84199196

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211023992.0A Pending CN115423008A (en) 2022-08-24 2022-08-24 Method, system and medium for cleaning operation data of power grid equipment

Country Status (1)

Country Link
CN (1) CN115423008A (en)

Similar Documents

Publication Publication Date Title
US10713597B2 (en) Systems and methods for preparing data for use by machine learning algorithms
CN108108854B (en) Urban road network link prediction method, system and storage medium
CN112529168B (en) GCN-based attribute multilayer network representation learning method
CN112508085B (en) Social network link prediction method based on perceptual neural network
Lobato et al. Multi-objective genetic algorithm for missing data imputation
CN112966114B (en) Literature classification method and device based on symmetrical graph convolutional neural network
KR20210040248A (en) Generative structure-property inverse computational co-design of materials
CN111178611B (en) Method for predicting daily electric quantity
CN108733976B (en) Key protein identification method based on fusion biology and topological characteristics
CN113190688B (en) Complex network link prediction method and system based on logical reasoning and graph convolution
CN110866030A (en) Database abnormal access detection method based on unsupervised learning
CN110993113B (en) LncRNA-disease relation prediction method and system based on MF-SDAE
CN112687349A (en) Construction method of model for reducing octane number loss
CN111985623A (en) Attribute graph group discovery method based on maximized mutual information and graph neural network
CN114528949A (en) Parameter optimization-based electric energy metering abnormal data identification and compensation method
CN116340726A (en) Energy economy big data cleaning method, system, equipment and storage medium
Hajewski et al. An evolutionary approach to variational autoencoders
CN115423008A (en) Method, system and medium for cleaning operation data of power grid equipment
CN116303386A (en) Intelligent interpolation method and system for missing data based on relational graph
CN113704570B (en) Large-scale complex network community detection method based on self-supervision learning type evolution
CN114881158A (en) Defect value filling method and device based on random forest and computer equipment
Aung et al. Modularity based ABC algorithm for detecting communities in complex networks
CN114862007A (en) Short-period gas production rate prediction method and system for carbonate gas well
CN115168602A (en) Triple classification method based on improved concepts and examples
CN115017125B (en) Data processing method and device for improving KNN method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination