CN109472293A

CN109472293A - A kind of grid equipment file data error correction method based on machine learning

Info

Publication number: CN109472293A
Application number: CN201811187606.5A
Authority: CN
Inventors: 龙婧; 刘伟; 徐文峰
Original assignee: HUBEI CENTRAL CHINA TECHNOLOGY DEVELOPMENT OF ELECTRIC POWER Co Ltd; State Grid Corp of China SGCC
Current assignee: HUBEI CENTRAL CHINA TECHNOLOGY DEVELOPMENT OF ELECTRIC POWER Co Ltd; State Grid Corp of China SGCC
Priority date: 2018-10-12
Filing date: 2018-10-12
Publication date: 2019-03-15

Abstract

The present invention provides a kind of grid equipment file data error correction method based on machine learning, existing mass data is handled, excavates wherein hiding rule, automatically generate judgment rule, data are diagnosed automatically based on these rules, work difficulty can be substantially reduced, can be administered for quality of data screening, data rectification, data and a important foundation is provided；The present invention is administered using big data technology Develop Data, to automatic diagnosis abnormal in mass data, and suggestion is provided for the rectification of data, data check work can be reduced to the strong dependency of business personnel, for complete random extractive decentralized data abnormal conditions, it can also be realized and be automatically processed with machine learning, avoid manpower screening bring complex work amount, the present invention carries out classification analysis to data abnormal problem using big data, data generation side is supplied to be rectified and improved, data problem can be reduced from source, provide reference for data source header rectification.

Description

A kind of grid equipment file data error correction method based on machine learning

Technical field

The present invention relates to grid equipment field of data correction, specifically a kind of grid equipment archives number based on machine learning According to error correction method.

Background technique

Power grid production equipment account data are the bases that power grid production work is carried out, current all kinds of production equipment account data It is stored in equipment (assets) O&M lean management system (PMS2.0) system, total amount of data is more than 60G, is related to 200 More than equipment, such as: bus, overhead transmission line, switchgear, capacitor, transformer, cable etc..

Teams and groups, base personnel are responsible for being updated device data in time maintenance, field device O&M, maintenance, detection, examination It every production work such as tests to be both needed to based on device data, only guaranteed device data accuracy, related O&M overhaul business In the accurate registration PMS2.0 system of record ability, weight is provided for equipment state overhauling evaluation and assets whole-life cycle fee Will foundation and O&M maintenance lean management important embodiment.In addition, equipment scale be people provide staffing delimit the organizational structure, cost accounting Important evidence, therefore device data accuracy is particularly important.

There are the problems such as imperfect, inaccurate for power grid production equipment account data at present.Device parameter not only influences extremely The management of equipment archives itself, while the development of O&M service work is directly influenced, such as:

1. equipment account key parameter is imperfect.

2. equipment account data and GIS graph data be not corresponding.

3. equipment account parameters fill in mistake.

4. devices in system account data have differences with scene.

Wherein problem 3,4 can not carry out wrong data screening by refining mistake rule exploitation program, use people at present Work means are checked, and every 100 data generally requires the workload of 3 man days of investment.Work difficulty is big, and effect is owed It is good.These data problems directly affect daily O&M service work, are unable to normal registration O&M record of examination, while also influencing battalion Development with work such as perforation, same period line losses provides staffing and delimits the organizational structure and O&M cost in addition, Unit account of plant data exception will lead to people Accounting inaccuracy.

Summary of the invention

In view of the above shortcomings of the prior art, the present invention proposes a kind of grid equipment archives number based on machine learning According to error correction method, existing mass data is handled, excavates wherein hiding rule, judgment rule is automatically generated, is based on These rules diagnose data automatically, can substantially reduce work difficulty, can be rectified and improved for quality of data screening, data, Data, which are administered, provides a important foundation.

A kind of grid equipment file data error correction method based on machine learning, includes the following steps:

Step 1: data pick-up, obtains training set: all grid equipment file datas that company is saved are as history number According to importing in database, using the historical data in database as training set F；

Step 2: carrying out feature extraction to training set F, characteristic set S=is obtained in such a way that character string is split {s₁,s₂,s₃,...,s_n}；

Step 3: selecting characteristic value from characteristic set S manually is selected as feature vector S', S'={ s'₁,s'₂, s'₃…,s'_m, wherein

Step 4: being weighted by TF-IDF algorithm to feature vector S', weighting scheme is the characteristic value s' in feature vector_m The frequency occurred in training set F, is denoted as N_m, every data record includes Feature Words s' in training set_mFrequency be N'_m, that The specific word s'_mIDF value be

Therefore the weights omega of the specific word_mIt can be expressed as ω_m=N_m*IDF(s'_m), in this way to feature vector Each of S' Feature Words carry out calculating weight, obtain weight vectors ω；

Step 5: the weight of the feature vector chosen by third step and the 4th step and obtained feature vector, to original Data are clustered by distribution K-Means algorithm, finally divide training set F for k class；

Step 6: it is directed to the cluster result that the 5th step obtains, the data in manual verification's cluster, by abnormal data and erroneous judgement The case where data are selected, and the data erroneous judgement in each class is verified, obtains the accuracy rate of data verification, to accurate in all classes Rate averaged obtains the accuracy rate of model, and then whether judgment models accuracy rate reaches expected threshold value, if without if Third step is gone to, characteristic value is reselected, feature vector, determines weight, until accuracy rate reaches expected threshold value；

Step 7: after model determines, in service stage by data clusters, then by every one kind

Abnormal data returns to user, and normal data is recommended user, by user with reference to modification.

Further, including step 8: Modifying model: in user's use process, correcting mould by the feedback of user The characteristic value and weight of type.

Further, in the 5th step in clustering algorithm the value of k according to the distance in the sample size and all classes of training set It is determined with the value of k when being minimum, distance is calculated using Euclidean distance calculation formula to two records i, j in training set F Distance calculated, Euclidean distance calculation formula is

The present invention is administered using big data technology Develop Data, to automatic diagnosis abnormal in mass data, and is data Rectification suggestion is provided, data check work can be reduced to the strong dependency of business personnel, refined for completely random Decentralized data abnormal conditions, can also with machine learning realize automatically process, avoid manpower screening bring complex work amount, The present invention carries out classification analysis to data abnormal problem using big data, is supplied to data generation side and is rectified and improved, can be from source Data problem is reduced on head, provides reference for data source header rectification.

Detailed description of the invention

Fig. 1 is the flow diagram of the grid equipment file data error correction method the present invention is based on machine learning；

Fig. 2 is the data prediction flow diagram of clustering phase；

Fig. 3 is K-means parallelization operational process schematic diagram；

Fig. 4 is selection fitted figure of the k value from 20 to 1000 when being clustered by distributed K-Means algorithm, cross in figure Axis indicates k value, and the longitudinal axis indicates loss function value；

Fig. 5 is using cluster result schematic diagram a kind of when distribution K-Means algorithm.

Specific embodiment

Below in conjunction with the attached drawing in the present invention, the technical solution in the present invention is clearly and completely described.

Due to being all that processing is collected in artificial integration before power grid production equipment account data, inevitably can in these data There are various mistakes and error, these errors and wrong data will greatly affect field device O&M, maintenance, detection, test Validity.Therefore by the abnormal data in these data be changed to normal data be very it is necessary to.But due to data Amount is excessively huge, and it is very difficult for only manually carrying out screening error correction, so, the present invention is used based on Spark's Distributed K-Means algorithm can greatly reduce workload come automatically processing and correcting to abnormal data.

The embodiment of the present invention provides a kind of process signal of grid equipment file data error correction method based on machine learning Figure, this method mainly carry out abnormal data and diagnose business, and there are two targets: first is that being diagnosed to be abnormal data；Second is that right Abnormal data provides improving suggestions.The present invention overall includes three big steps: data pick-up, model construction, model are repaired Just, overall flow is as shown in Figure 1, the method mainly includes the following steps:

Step 5: the weight of the feature vector chosen by third step and the 4th step and obtained feature vector, to original Data are clustered by distribution K-Means algorithm, and the purpose of cluster is exactly in order to which by equipment of the same race, (equipment of the same race is not due to With personnel record, naming method is also different, so needing to select feature) kind attributes value it is poly- and be one kind, conveniently select Abnormal data out.The k's when value of k is according to the distance in the sample size and all classes of training set and for minimum in clustering algorithm Value determination, distance is calculated using Euclidean distance calculation formula to two record i in training set F, and the distance of j is calculated, Euclidean distance calculation formula is

Finally training set F is divided for k class；

Step 6: it is directed to the cluster result that the 5th step obtains, the data in manual verification's cluster, by abnormal data and erroneous judgement The case where data are selected, and the data erroneous judgement in each class is verified, obtain data verification accuracy rate (after data point good class, due to Equipment of the same race, the data of same alike result should differ will not be especially big, if it find that data exception attribute in entire classification The mode and mean value of data, then the data are it is assumed that be abnormal data), to the accuracy rate averaged in all classes, The accuracy rate of model is obtained, then whether judgment models accuracy rate reaches expected threshold value (such as 90%), if turning without if It to third step, reselects characteristic value, feature vector, determine weight, until accuracy rate reaches expected threshold value.

Step 7:, in service stage by data clusters, the abnormal data of every one kind is then returned to use after model determines Family, and normal data is recommended into user, by user with reference to modification；

Step 8: Modifying model

In user's use process, by the feedback of user come the characteristic value and weight of correction model.No matter machine learning Learning tasks using which kind of algorithm, model evaluation is all a kind of link of end-to-end machine learning assembly line.Monitor mould Type algorithm shows in a production environment, and objective evaluation model accuracy is related to diagnosis effect, customer experience, user feedback etc. to be referred to Mark, by adjusting model and parameter and then Optimized model algorithm.

In above-mentioned 5th step, 2 stages can be divided by running K-means algorithm parallel in Spark cluster: data are located in advance Reason stage and K-means clustering phase.Data preprocessing phase process is as shown in Figure 2.

In clustering phase, after pretreatment stage, data set has met the requirement of cluster, as long as therefore pre- place Result after reason calculates k cluster centre using K-means algorithm, this K point can gather as k of entire data set Class center.Clustering phase parallelization process is as shown in Figure 3:

Technical solution of the present invention is described in detail with a specific example below:

Problem description:

In actual electric network work is carried out, same equipment is often in high volume purchased, and is used in batches, in systems should be with Certain amount grade exists, it is impossible to only occurs once, and the corresponding particular attribute-value of same equipment should be consistent, In equipment files data, model can judge the equipment specified genus as the unique identification for identifying the equipment according to model Whether the value of property is correct.

1 experiment sample tables of data of table

Table 2 Sample data table

At present it is mainly following problems in data set:

1. model is filled in and lack of standardization can not be identified

Teams and groups, base personnel are responsible for device data typing, since each teams and groups, base personnel count according to oneself habit According to filling in, the equipment of same model is caused all to be rendered as difference in " model " field.Such as S11-M-100/10 in upper table, S11-100/10, S11-100, S11-100KVA are same equipment in fact, but filled in inside database it is different, it is same Kind model may have tens kinds to fill in mode.

2. same model corresponds to a variety of attribute values

Mistake is filled in often when the personnel's logging data of teams and groups, base, and data user of service does not know the feelings of field device Condition, this, which directly results in many business, to carry out.Often the rated capacity of the equipment of same model occurs in database at present A variety of values.

Specific implementation:

1. the extraction of characteristic value

Feature selecting occupies considerable status in machine learning.Feature is extracted from " model " character string text, Such as " model " " S11-M-100/10 " can be extracted as totally 5 features of S, 11, M, 100,10 by character fractionation.This 5 kinds of spies It levies and is in altogetherKind combination.In this experiment, by test repeatedly and and true controls, screened from these features Unrelated out or redundancy feature retains a character subset after removing it.

2. the weight of characteristic value

Due to extracting feature in " model " text, each feature is different to the percentage contribution of classification, therefore is transporting With being weighted before these features.Weighted value shared by each feature cannot treat different things as the same.It is used herein TF-IDF weighting Method, TF-IDF is a kind of statistical method, to assess words for wherein one in a file set or a corpus The significance level of part file.The importance of words, but simultaneously can be with the directly proportional increase of number that it occurs hereof The frequency that it occurs in corpus is inversely proportional decline.

The selection of 3.k value

Selection for k value, we are based on the common ancon method of industry, a function are defined, with the change of k, it is believed that Extreme value can be generated in correct k.That is: a suitable class cluster index, such as mean radius or diameter are given, as long as we are false If the number of class cluster when being at or above the number of true class cluster, which rises can be very slow, and once attempts To be less than true number class cluster when, which can steeply rise.Such as in this experiment, setting k ∈ (20,1000) is opened from 20 Begin, each k value increases by 98, and loss function and k value relation curve are as shown in Figure 4.

According to cluster result, subsequent processing is carried out to data in conjunction with business, due to " model " can unique identification one kind set It is standby, and equipment is also equipped with many other attributes, such as " rated capacity ", " voltage class ", " dielectric " etc., it is same to set Standby attribute value is unique.It is introduced for carrying out error correction to " rated capacity " herein, obtains " correct data ", " exception Data ", the error correction result of " wrong data ", analyze wrong data therein, abnormal data, provide amending advice, real Existing intelligent diagnostics.Specific analytic process is as follows:

(1) result after cluster is counted.It is most based on correct data frequency of occurrence in same class data, Wrong data belongs to a small number of principles, it is believed that the most group of " model " and " rated capacity " frequency of occurrence is combined into the positive exact figures of recommendation According to.

(2) when " model " and " rated capacity " is inconsistent with the correct data of recommendation, it is judged as wrong data.

(3) when " model " and " rated capacity " is inconsistent with a certain item of correct data of recommendation, it is judged as abnormal data. It is correct data when " model " and " rated capacity " is completely the same with the data format of recommendation.

The effect experiment figure of cluster is as shown in Figure 5.

The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any Belong to those skilled in the art in the technical scope disclosed by the present invention, any changes or substitutions that can be easily thought of, all answers It is included within the scope of the present invention.Therefore, protection scope of the present invention should be subject to the protection scope in claims.

Claims

1. a kind of grid equipment file data error correction method based on machine learning, it is characterised in that include the following steps:

Step 1: data pick-up, obtains training set: all grid equipment file datas that company saves are led as historical data Enter in database, using the historical data in database as training set F；

Step 2: carrying out feature extraction to training set F, characteristic set S={ s is obtained in such a way that character string is split₁, s₂,s₃,...,s_n}；

Step 3: selecting characteristic value from characteristic set S manually is selected as feature vector S', S'={ s '₁,s′₂,s ′₃…,s′_m, wherein

Step 4: being weighted by TF-IDF algorithm to feature vector S', weighting scheme is the characteristic value s' in feature vector_mIt is instructing Practice the frequency occurred in collection F, is denoted as N_m, every data record includes Feature Words s' in training set_mFrequency be N'_m, then should Feature Words s'_mIDF value be

Therefore the weights omega of the specific word_mIt can be expressed as ω_m=N_m*IDF(s'_m), in this way in feature vector S' Each Feature Words carry out calculating weight, obtain weight vectors ω；

Step 5: the weight of the feature vector chosen by third step and the 4th step and obtained feature vector, to initial data It is clustered by distributed K-Means algorithm, is finally divided training set F for k class；

Step 6: it is directed to the cluster result that the 5th step obtains, the data in manual verification's cluster, by abnormal data and erroneous judgement data The case where selecting, verifying the data erroneous judgement in each class, obtains the accuracy rate of data verification, asks the accuracy rate in all classes It is averaged, obtains the accuracy rate of model, then whether judgment models accuracy rate reaches expected threshold value, goes to if not Third step reselects characteristic value, feature vector, determines weight, until accuracy rate reaches expected threshold value；

Step 7:, in service stage by data clusters, the abnormal data of every one kind is then returned into user after model determines, And normal data is recommended into user, by user with reference to modification.

2. the grid equipment file data error correction method based on machine learning as described in claim 1, it is characterised in that: also wrap It includes step 8: Modifying model: in user's use process, by the feedback of user come the characteristic value and weight of correction model.

3. the grid equipment file data error correction method based on machine learning as described in claim 1, it is characterised in that: the 5th The value of k when the value of k is according to the distance in the sample size and all classes of training set and for minimum in clustering algorithm in step determines , distance calculate using Euclidean distance calculation formula to two record i in training set F, the distance of j is calculated, it is European away from It is from calculation formula