CN112364928A

CN112364928A - Random forest classification method in transformer substation fault data diagnosis

Info

Publication number: CN112364928A
Application number: CN202011292591.6A
Authority: CN
Inventors: 蒋一波; 冯缘
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2020-11-18
Filing date: 2020-11-18
Publication date: 2021-02-12

Abstract

A random forest classification method in transformer substation fault data diagnosis extracts data from a transformer substation fault diagnosis system, preprocesses the data to obtain an original sample set, and comprises the following steps: (1) establishing a random forest model; (2) analyzing the importance of the original random forest model; (3) processing the original sample set, reserving the result and the selected characteristics, generating a new sample set, and simultaneously carrying out the same processing on the test set; (4) repeating the step (1) by using a new sample set to obtain a final random forest model; (5) testing the random forest model by using a test set, and evaluating the performance of the model; (6) and (4) distinguishing and classifying the new data by using a random forest classifier, determining a classification result according to the voting amount of the tree classifier, and storing the classification result into a database. The invention reduces a large amount of real-time data processing amount, accelerates the system classification speed and ensures the real-time performance of a decision-making system; the classification performance is good; over-fitting is avoided.

Description

Random forest classification method in transformer substation fault data diagnosis

Technical Field

The invention relates to a random forest classification method in transformer substation fault data diagnosis.

Background

In the prior art, when a power grid fails, a monitoring device generates alarm information in time and uploads the information, such as switch tripping, automatic protection device action, undervoltage, overcurrent, device overload and the like. Particularly, when some power systems with huge structures and scales have faults, a time system can generate a large amount of alarm information, and the information comprises a large amount of uncertain knowledge and data caused by factors such as protection or circuit breaker misoperation, refusal, channel transmission interference error, protection action time deviation and the like. At present, a plurality of transformer substation fault data diagnosis technologies and methods provided at home and abroad mainly comprise expert systems, artificial neural networks, optimization algorithm technologies, petri networks, fuzzy set theories, rough set theories and the like. The above intelligent technologies have different advantages when applied to fault diagnosis, but also expose many problems. For example, the expert system has high maintenance difficulty and poor fault tolerance; the artificial neural network lacks the capability of explaining the self behavior, and simultaneously needs a large number of training samples and the like. The existing transformer substation fault data diagnosis and classification method has the problems that the accuracy and the efficiency cannot be ensured at the same time, and the requirements on the diagnosis speed and the accuracy are high in the use of the actual transformer substation fault diagnosis system.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a random forest classification method in a substation fault data diagnosis project, which adopts an integrated learning idea on the basis of a decision tree, trains through randomly selected samples and randomly selected features to generate a random forest, and classifies data through the random forest.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a random forest classification method in a transformer substation fault diagnosis project extracts data from a transformer substation fault diagnosis system, and preprocesses the data to obtain an original sample set, wherein the method comprises the following steps:

(1) establishing a random forest model, wherein the process is as follows:

(1.1) setting T as an original sample set, wherein n samples are in total, and extracting n samples from the original sample set T in each round in a boosting (with back sampling) mode to obtain a training set T with the size of n_iIn the process of extracting the original sample set, there may be samples which are repeatedly extracted, or there may be samples which are not extracted once, and k rounds of extraction are performed, so that the training set of each round of extraction is divided intoIs other than T₁,T₂,…,T_kThe data not contained is called out-of-bag data;

(1.2) establishing a decision tree;

(1.3) repeating the steps (1.1) and (1.2) until all CART trees are trained and all decision trees are combined to construct an original random forest model;

(2) performing importance analysis on an original random forest model, and designating L ═ sqrt (M) I to select L features before ranking;

(3) processing the original sample set T, reserving the result and the selected characteristics, generating a new sample set Y, and simultaneously carrying out the same processing on the test set;

(4) repeating the step (1) by using the new sample set Y to obtain a final random forest model H;

(5) testing the random forest model H by using a test set, and evaluating the performance of the model;

(6) and (4) distinguishing and classifying the new data by using a random forest classifier, determining a classification result according to the voting amount of the tree classifier, and storing the classification result into a database.

Further, the process of (1.2) is:

(1.2.1) let each sample have M features, and specify a number M ═ log₂M |, satisfies the condition M<<M, randomly selecting M features from the M features at each internal node to form a new feature set D_iFrom feature set D_iSelecting an optimal attribute to split the nodes;

(1.2.2) each node was split according to (1.2.1) until no more splitting could be achieved, each tree was grown to maximum using the CART method without pruning.

Still further, the transformer substation fault diagnosis system is an SCADA or EMS system.

The working principle of the invention is as follows: the invention provides a random forest classification method in substation fault diagnosis. Acquiring data from a power grid company, and performing feature selection by using a Chinesian index minimization criterion in the process of establishing a decision tree to generate a binary tree; and establishing an original random forest model by using the original sample set, analyzing the feature importance of the original random forest model, screening out key features and processing the original sample set. Establishing a final random forest model by using the new sample set, thereby greatly reducing the data processing amount; and finally, obtaining a classification result by the random forest classification model through a voting rule.

The invention has the following beneficial effects: 1. the method reduces a large amount of real-time data processing amount, accelerates the system classification speed and ensures the real-time performance of the decision system. 2. The classification performance is good. 3. Over-fitting is avoided.

Drawings

Fig. 1 is a flowchart of a random forest classification method in a substation fault diagnosis project.

FIG. 2 is a two-level random forest classification system for substation fault data.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

Referring to fig. 1 and 2, a random forest classification method in a substation fault diagnosis project includes the following steps:

the first step is as follows: extracting original data from systems such as SCADA and EMS.

The second step is that: carrying out data preprocessing on the original data to obtain an original sample set T, wherein the preprocessing comprises the following steps:

2.1) converting non-numeric data into numeric data

2.2) if the sample contains missing values, deleting the sample

2.3) if two or more samples exist, the attribute value and the category are all completely the same, only one is reserved, and the rest repeated samples are deleted

2.4) if there are two or more samples with identical attribute values but different categories, these invalid samples are deleted

The third step: t is an original sample set, wherein n samples are totally obtained, and then n samples are extracted from the original sample set T in each round in a mode of sampling back to obtain a training set T with the size of n_i. During the extraction of the original sample set, there may be repeatedly extracted samples, or there may be samples that are not extracted at one timeSample drawn. Performing k rounds of extraction, and the training set of each round of extraction is T₁,T₂,…,T_kThe data not contained is called the out-of-bag data, which serves as the test set for this random model.

The fourth step: according to the training set T₁,T₂,…,T_kBuilding k decision trees

Each sample has M characteristics, a number M is assigned to | log2M |, and the condition M is met<<M, randomly selecting M features from the M features at each internal node to form a new feature set D_i. From feature set D_iAnd selecting an optimal attribute to split the nodes.

Each node is split according to the above steps until no more splits can be made. Each tree is grown to the maximum extent by using the CART algorithm without pruning.

The fifth step: and combining the k decision trees, wherein the weight of each decision tree is the same, and constructing an original random forest model.

And a sixth step: and (3) carrying out importance analysis on the original random forest model, and designating L ═ sqrt (M) I to select L features before ranking.

The seventh step: and processing the original sample set T, reserving the result and the selected features, generating a new sample set Y, and taking the data (the data outside the bag) which is not contained as the test data.

Eighth step: and repeating the steps (namely the third step to the fifth step) for establishing the random forest model by using the new sample set Y to obtain a final random forest model H.

The ninth step: and testing the random forest model H by using the test set, wherein the classification result is determined according to the voting amount of the tree classifier, and the obtained classification result is compared with the test set result to verify the reliability of the model.

The tenth step: and classifying the new data by using a random forest classifier, and storing the classification result into a database.

Referring to fig. 2, the two-layer random forest classification system in the substation fault data identification project implemented by the method mainly includes: a classification module and a user interaction module. The classification module classifies according to the model and calculates the classification accuracy; the user interaction module realizes data visualization display, Web interface configuration and application program configuration.

The embodiments described in this specification are merely illustrative of implementations of the inventive concepts, which are intended for purposes of illustration only. The scope of the present invention should not be construed as being limited to the particular forms set forth in the examples, but rather as being defined by the claims and the equivalents thereof which can occur to those skilled in the art upon consideration of the present inventive concept.

Claims

1. A random forest classification method in transformer substation fault data diagnosis is characterized in that data are extracted from a transformer substation fault diagnosis system and preprocessed to obtain an original sample set, and the method comprises the following steps:

(1) establishing a random forest model, wherein the process is as follows:

(1.1) setting T as an original sample set, wherein n samples are in total, and extracting n samples from the original sample set T in each round in a Bootstrap manner to obtain a training set T with the size of n_iDuring the extraction of the original sample set, there may be repeatedly extracted samples, or there may be samples that are not extracted at one time. Performing k rounds of extraction, and the training set of each round of extraction is T₁,T₂,…,T_kThe data not contained is called out-of-bag data;

(1.2) establishing a decision tree;

2. A random forest classification method in substation data fault diagnosis according to claim 1, characterized in that the process of (1.2) is:

3. A random forest classification method in substation data fault diagnosis according to claim 1 or 2, characterized in that the substation fault diagnosis system is a SCADA or EMS system.