CN111737993B

CN111737993B - Method for extracting equipment health state from fault defect text of power distribution network equipment

Info

Publication number: CN111737993B
Application number: CN202010455039.8A
Authority: CN
Inventors: 成菲; 李海龙; 傅丁莉; 李小飞; 鲁鹏; 庞志飞; 周俊林; 王培波; 胡景博; 黄义皓; 施玉彬
Original assignee: Hangzhou Yuzhi Technology Co ltd; Jinhua Power Supply Co of State Grid Zhejiang Electric Power Co Ltd; Haiyan Power Supply Co of State Grid Zhejiang Electric Power Co Ltd; Zhejiang Huayun Electric Power Engineering Design Consulting Co
Current assignee: Hangzhou Yuzhi Technology Co ltd; Jinhua Power Supply Co of State Grid Zhejiang Electric Power Co Ltd; Haiyan Power Supply Co of State Grid Zhejiang Electric Power Co Ltd; Zhejiang Huayun Electric Power Engineering Design Consulting Co
Priority date: 2020-05-26
Filing date: 2020-05-26
Publication date: 2024-04-02
Anticipated expiration: 2040-05-26
Also published as: CN111737993A

Abstract

The invention discloses a method for extracting the health state of equipment from fault defect text of power distribution network equipment. Performing word segmentation and vectorization on fault defect text of power distribution network equipment; constructing a training set and a testing set; calculating the similarity between two fault defect texts in the test set and the training set; calculating the optimal value number k of the k-nearest neighbor algorithm, testing each fault defect text in a set, and selecting k fault defect texts with the maximum similarity in a training set; and calculating the health states of the fault defect texts in the test set, taking the health states of the fault defect texts in the test set as weighted average sums of the health states of k fault defect texts, and then carrying out weighted summation to obtain the final health state. According to the invention, self-learning mapping from fault defect text to health state data is realized, the existing fault/defect grade evaluation mode is changed, the health state of the equipment is extracted from the fault defect text based on a k-nearest neighbor algorithm, and the data of the health state of the whole equipment is accurately obtained.

Description

Method for extracting equipment health state from fault defect text of power distribution network equipment

Technical Field

The invention belongs to a method for combining natural language processing and power distribution network equipment data information in the field of intelligent power operation and detection, and particularly relates to a method for extracting equipment health state from fault defect text of power distribution network equipment.

Background

With the acceleration of the grid informatization process, many unstructured and semi-structured data are accumulated in the database of the grid enterprise. As one of the most typical, most complex unstructured data, analysis of text-type data has been a hotspot problem in the field of data mining.

In the overhaul and maintenance link of the power system, a large number of overhaul test records, inspection defect elimination records, fault and defect description reports, event sequence records and the like are recorded. The logs and reports mainly show text for short in Chinese text mixed with numbers and alphabetic characters), and are rich in equipment history running state information, overhaul effect information, reliability information and the like, and have great buddhist benefits in the objective evaluation of the development process of the health state of the equipment.

However, since texts have characteristics of multiple ambiguity, difficulty in segmentation, ambiguity, multiple noise, etc., the above information has not been fully mined yet. The Chinese text processing of the power grid belongs to a starting stage at present. Visual reliability statistical information is mined from Chinese text information, and complex information mining technology and smart mining processes need to be explored. Chinese text mining has long been considered an important and difficult technique. Especially when it is applied to various professional fields, it is more difficult to closely combine knowledge in the professional fields. In the power field, foreign students propose to excavate massive historical defect data by using a machine learning method aiming at a New York power grid, so that the basis of power equipment fault prediction and preventive maintenance is provided. However, chinese text is very different from english text, and not only there are no spaces between words, but also the part of speech and the syntax structure are very different, and the processing mode of carrying english text is not feasible. Other texts in the power system, such as fault text, are manually typed, have complex syntactic structures, and are difficult to accurately divide main guest components, so that the processing is complex and difficult.

The equipment defect classification and analysis statistics work to be performed by the power grid enterprises every year is often performed manually, so that the workload is large, time and labor are consumed, and the correctness of the classification and statistics work is difficult to verify due to subjective factors and experience differences. Data analysis mining based on fault defect text is therefore important and urgent.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, and discloses a mapping method based on fault defect text to equipment health state evaluation.

The technical scheme adopted for solving the technical problems is as follows:

the method comprises the following steps:

1) Word segmentation is carried out on each fault defect text of the power distribution network equipment, and word segmentation results are obtained;

the fault defect text refers to text data which are input into a computer by different maintenance/repair staff at different times for power distribution network equipment of a power grid system.

2) Vectorizing fault defect text;

3) Constructing a training set and a testing set of fault defect texts of power distribution network equipment;

collecting all fault defect texts of known power distribution network equipment, carrying out vectorization processing according to the method of the step 2), classifying according to health states, forming a fault defect text library, and taking the fault defect text library as a training set; vectorizing all fault defect texts of the power distribution network equipment to be tested according to the method of the step 2) to form a test set;

4) Calculating the similarity between every two fault defect texts in the test set and the training set;

5) Calculating the optimal value of the optimal value number k in the k-nearest neighbor algorithm, and then selecting k fault defect texts with the highest similarity from the training set aiming at each fault defect text in the test set;

in the conventional method, the value of the parameter k is preset. According to the training set of the fault defect text, the optimal k value is calculated, so that the prediction accuracy rate on the training set is highest.

6) Calculating a health state result HI (Health Index) of each fault defect text in the test set, wherein HI is a decimal between 0 and 1, namely HI is more than or equal to 0 and less than or equal to 1,0 and 1 respectively represent equipment faults and complete health, and taking the health state of each fault defect text in the test set as a weighted average sum of health states corresponding to k fault defect texts in the training set obtained in the step (5); and then, carrying out weighted summation on the health states of the fault defect texts in the test set to obtain the final health state H of the power distribution network equipment.

The step 5) is specifically as follows:

5.1 Dividing the health state HI into 3 types, wherein the corresponding area areas are [0,0.33 ], [0.33,0.67 ], [0.67,1], the initial value of the optimal value number k is 20, and the initial value of the selected fault defect text number n of the fault defect texts in the training set is set;

5.2 Aiming at the vectorized results of n fault defect texts in the training set, n=0 to J, and selecting k fault defect texts with highest similarity from the training set according to the similarity obtained by calculation in the step 4);

5.3 Taking the categories of the health states HI corresponding to the k fault defect texts, carrying out average calculation to obtain prediction categories, and then judging:

if the corresponding category of the fault defect text in the training set is consistent with the predicted category, the fault defect text is considered to be correct, otherwise, the fault defect text is considered to be wrong;

5.4 Counting the correct rate corresponding to n fault defect texts;

5.5 If the correct rate of the current iteration is obviously increased by 20% or more than that of the previous iteration, the k value is reduced by 1, and the step returns to the step 5.2), and the step process is repeated;

if the correct rate of the current iteration is not obviously increased by 20% or more than that of the previous iteration, the current k value is an optimal value;

5.6 Aiming at each fault defect text in the test set, selecting k fault defect texts with the largest similarity in the training set.

In the step 6), the weighted average sum of the health states corresponding to the k fault defect texts in the training set obtained in the step (5) is used as the health state of each fault defect text in the test set by adopting the following formula:

wherein HI _x Representing the health state of an xth fault defect sample in k fault defect texts selected in the training set;indicating the selected w of the test set _i Health status of individual fault defect samples.

The step 1) adopts a Chinese word segmentation technology (HMM-CWS) method based on a hidden Markov model to perform preprocessing of text word segmentation.

Step 2) is to sort the one-dimension of each word corresponding to the vector space after word segmentation according to word frequency, remove repeated words, obtain non-repeated word sequences, and the non-repeated word sequences form a complete vector space.

And 4) calculating the similarity between each fault defect text in the test set and each fault defect text in the training set according to a similarity measurement formula.

The invention has the beneficial effects that:

according to the invention, self-learning mapping from fault defect text to health state data is realized, the existing fault/defect grade evaluation mode is changed, the health state of the equipment is extracted from the fault defect text based on a k-nearest neighbor algorithm, and a basis is provided for the power grid enterprise to accurately obtain the data of the health state of the whole equipment.

Drawings

Fig. 1 is a flow chart of the method of the present invention.

Detailed Description

The invention is further described below with reference to the drawings and examples.

The embodiment of the invention and the implementation process are as follows:

the fault condition of the power distribution network equipment of the power grid system is recorded at different times by different maintenance/repair staff, and the fault condition is input into a computer in the form of text to form fault defect text data of the power distribution network equipment.

specifically, a Chinese word segmentation technology (HMM-CWS) method based on a hidden Markov model is adopted for preprocessing text word segmentation.

2) Vectorizing fault defect text;

step 1) is to sort the one-dimension of each word corresponding to the vector space after word segmentation according to word frequency, remove repeated words, obtain non-repeated word sequences, and the non-repeated word sequences form a complete vector space.

In particular, the formation of the complete vector space W is shown as follows _ALL ：

W _ALL ＝[w _ij ] _IxJ

Wherein w is _ij Representing the weight, w, between the ith fault defect text in the test set and the jth fault defect text in the training set _ij =0 or 1. When w is _ij When=1, the text contains the word vector, otherwise 0; i represents the total number of fault defect texts in the test set, J represents the total number of fault defect texts in the training set, [] _IxJ The size of the representation matrix is i×j.

collecting all fault defect texts of known power distribution network equipment, carrying out vectorization processing according to the method of the step 2), classifying according to the health state to form a fault defect text library, and taking the fault defect text library as a training set, wherein each fault defect text corresponds to a specific health state classification; vectorizing all fault defect texts of the power distribution network equipment to be tested according to the method of the step 2) to form a test set; the test set is a vectorized set of fault defect text to be classified.

In particular, the classification according to health status is specifically classified into three categories of complete health, sub-health and failure.

specifically, the similarity between each fault defect text in the test set and each fault defect text in the training set is calculated according to a similarity measurement formula.

The similarity formula is as follows:

wherein S is _ij For the similarity between the ith fault defect text in the test set and the jth fault defect text in the training set, w _i Is the feature vector of text i, w _j Is the feature vector of text j, M is the dimension of the vector, w _il 、w _jl Is the vector w _i 、w _j Is a first dimension value of (c).

5.1 Health Index, HI is a decimal between 0 and 1, i.e., 0.ltoreq.HI.ltoreq.1, 0 and 1 represent equipment failure and complete Health, respectively) are classified into 3 classes, corresponding zones are [0,0.33 ], [0.33,0.67 ], [0.67,1], and the initial value of the optimal number k is 20;

5.2 Aiming at the vectorized result of n fault defect texts in the training set, the value range of n is an integer of [1, J ], and k fault defect texts with highest similarity are selected from the training set according to the similarity obtained by calculation in the step 4);

5.4 Counting the correct rate corresponding to n fault defect texts;

5.5 If the accuracy of the current iteration is increased by 20% or more than that of the previous iteration, the k value is reduced by 1, and the step is returned to the step 5.2), and the step process is repeated;

if the correct rate of the current iteration is not increased by 20% or more than that of the previous iteration, the current k value is an optimal value;

6) Calculating health state results HI of each fault defect text in the test set, and taking the health state of each fault defect text in the test set as a weighted average sum of health states corresponding to k fault defect texts in the training set obtained in the step (5); and then, carrying out weighted summation on the health states of the fault defect texts in the test set to obtain the final health state HI of the power distribution network equipment.

Specifically, the weighted average sum of health states corresponding to k fault defect texts in the training set obtained in the step (5) is used as the health state of each fault defect text in the test set by adopting the following formula:

Claims

1. A method for extracting the health state of equipment by fault defect text of power distribution network equipment is characterized by comprising the following steps: the method comprises the following steps:

2) Vectorizing fault defect text;

6) Calculating the health status result HI of each fault defect text in the test set, wherein HI is a decimal from 0 to 1, namely HI is more than or equal to 0 and less than or equal to 1,0 and 1 respectively represent equipment faults and complete health, and taking the health status of each fault defect text in the test set as the weighted average sum of the health statuses corresponding to k fault defect texts in the training set obtained in the step (5); then, carrying out weighted summation on the health states of the fault defect texts in the test set to obtain a final health state H of the power distribution network equipment;

the step 5) is specifically as follows:

5.4 Counting the correct rate corresponding to n fault defect texts;

5.6 Aiming at each fault defect text in the test set, selecting k fault defect texts with the maximum similarity in the training set;

wherein HI _x Representing the health state of an xth fault defect sample in k fault defect texts selected in the training set;indicating the selected w of the test set _i Health status of individual fault defect samples;

the method comprises the following steps of 1) preprocessing text word segmentation by adopting a Chinese word segmentation technical method based on a hidden Markov model;

2. The method for extracting the health state of the power distribution network equipment by using the fault defect text according to claim 1, wherein the method comprises the following steps of: and 4) calculating the similarity between each fault defect text in the test set and each fault defect text in the training set according to a similarity measurement formula.