CN111858270A

CN111858270A - Interlocking system fault positioning method based on data mining algorithm

Info

Publication number: CN111858270A
Application number: CN202010475231.3A
Authority: CN
Inventors: 黄鲁江; 成燚
Original assignee: Casco Signal Ltd
Current assignee: Casco Signal Ltd
Priority date: 2020-05-29
Filing date: 2020-05-29
Publication date: 2020-10-30

Abstract

The invention relates to an interlocking system fault positioning method based on a data mining algorithm, which comprises the following steps: step 1) obtaining a fault log; step 2), extracting characteristic variables and target variables; step 3), data processing; step 4), algorithm selection; step 5), training and evaluating a model; step 6), acquiring the importance of the characteristic variable; and 7) determining a fault reason. Compared with the prior art, the method has the advantages of greatly reducing the workload of engineers, improving the working efficiency and the like.

Description

Interlocking system fault positioning method based on data mining algorithm

Technical Field

The invention relates to a fault positioning method for an interlocking system, in particular to a fault positioning method for an interlocking system based on a data mining algorithm.

Background

Data mining is the process of finding potentially valuable information or knowledge hidden in data from large, complex, noisy, or even incomplete data. Data mining algorithms have been widely used in commercial fields such as retail, insurance, finance, medical treatment, transportation and the like, and industrial fields such as aerospace, electric power, machine manufacturing and the like. Meanwhile, data mining technology is also gradually beginning to be explored in railway signal systems, but is not applied to fault diagnosis or positioning of the interlocking system. An ensemble learning algorithm is an important algorithm in data mining technology.

The computer interlocking system is the core control equipment for ensuring the driving safety in a railway signal system, and the reliable and stable operation of the system is the guarantee of the train operation. Complex faults in computer interlocking systems are generally characterized by ambiguity, coupling, and the like. For complex faults, a mode of manually checking a large number of log records is still adopted to locate the faults at the present stage, the mode not only depends on the experience and knowledge level of an analyst, but also takes a large amount of time and has low efficiency.

The failure analysis of the computer interlocking system is a complex system engineering, comprises various troubleshooting means and processing methods, and is unrealistic to completely cover and contain all the failure analysis through an algorithm.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide an interlocking system fault positioning method based on a data mining algorithm.

The purpose of the invention can be realized by the following technical scheme:

an interlocking system fault positioning method based on a data mining algorithm comprises the following steps:

step 1) obtaining a fault log;

step 2), extracting characteristic variables and target variables;

Step 3), data processing;

step 4), algorithm selection;

step 5), training and evaluating a model;

step 6), acquiring the importance of the characteristic variable;

and 7) determining a fault reason.

Preferably, the step 1) of obtaining the fault log specifically includes:

and acquiring database records with double-computer asynchronous faults for multiple times within a period of time, and dividing all fault logs according to minutes to acquire multiple groups of fault records.

Preferably, the extracting of the characteristic variables and the target variables in step 2) is specifically:

according to the principle analysis of double-machine asynchronous faults, which variables are characteristic variables and which variables are target variables need to be determined;

18 feature variables and one target variable are obtained.

Preferably, the target variable is whether an out-of-sync alarm occurs.

Preferably, the 18 characteristic variables include collection class information, driving class information, station field representation information and network state information.

Preferably, the data processing in the step 3) is specifically;

(1) analyzing the data type;

(2) processing missing values;

(3) deletion of features with variance of 0;

(4) processing outliers of the exceptions;

(5) deletion of all-0 value data;

(6) sample equalization;

(7) Normalization;

(8) and (4) dividing the data set.

Preferably, the algorithm of step 4) is specifically selected as follows: a decision tree DT, random forest RF or XGBT algorithm is selected.

Preferably, the step 5) model training and evaluation:

and training the three algorithms by respectively adopting the training data sets, and evaluating the three algorithms by adopting the test data sets.

Preferably, said assessment is in particular:

three algorithms were evaluated using three evaluation indices, recall, precision and F1 values.

Preferably, the step 6) of obtaining the importance of the characteristic variable specifically includes:

and obtaining the characteristic variable which has the largest influence on the target variable through the score of the characteristic importance.

Compared with the prior art, the invention has the following advantages:

1. the data mining algorithm is applied to the fault analysis of the computer interlocking system for the first time, and a new thought and a feasible method are provided for the intelligent fault analysis of the computer interlocking system.

2. The characteristic selection algorithm in data mining is a series of operations generally carried out for improving the accuracy of the algorithm, and the method creatively takes the characteristic selection algorithm as a fault positioning and screening strategy and is a process for reversely deducing fault reasons according to fault results.

3. Aiming at the positioning and troubleshooting of a kind of complex faults, the invention provides an intelligent method which is time-saving and labor-saving and has a certain degree of automation, thereby greatly reducing the workload of engineers and improving the working efficiency.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a schematic diagram of feature importance scores for a decision tree algorithm;

FIG. 3 is a schematic diagram of feature importance scores for a random forest algorithm;

FIG. 4 is a graph illustrating feature importance scores for the XGBT algorithm.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.

The present invention will be described in detail below with reference to fig. 1 and an actual single fault type (dual asynchronous fault).

1 obtaining fault logs

Firstly, database records with double-computer asynchronous faults occurring for multiple times within a period of time (within one month) are obtained, and all fault logs are divided according to minutes, so that 30-24-60-43200 groups of fault records are obtained.

2 extracting characteristic variables and target variables

A large amount of variable information is stored in a log file of a computer interlocking system, data needs to be classified in a large amount of unclassified data according to business understanding, and then characteristic variables needing to be used are determined. The information which may affect the asynchronous fault includes acquisition information, driving information, station indicating information and network state information, wherein the first three types of information can be classified into turnout information, signal information, track information, system information and the like.

A total of 18 characteristic variables and one target variable are obtained, the target variable being: and if asynchronous alarm occurs, information such as representation, control and drive of turnouts, signal machines and track circuits is used as characteristic variables. 43200 sets of statistics are then extracted from 43200 sets of fault log data.

3 data processing

And converting a large amount of log files into statistical data which can be used by an algorithm. And screens, filters, and groups data. And performing data processing on 43200 groups of data by the following steps:

(1) data type analysis

(2) Missing value handling

(3) Deletion of variance 0 features

(4) Outlier handling for exceptions

(5) Deletion of all-0 value data

(6) Sample equalization

(7) Normalization

(8) Data set partitioning

4 Algorithm selection

Since the method needs to obtain special importance in the model training process, certain requirements are required for the algorithm, and the algorithm capable of generating the feature importance must be selected. Decision Trees (DT), Random Forests (RF) and XGBT algorithms are selected in the method.

5 model training and evaluation

The fault case belongs to the two-classification problem of sample imbalance, three evaluation indexes, namely a call evaluation index, a precision evaluation index and an F1 evaluation index are adopted to evaluate three algorithms, and meanwhile, the feature variable which has the largest influence on the target variable is obtained through the score of feature importance.

The three algorithms were trained using the training data sets, respectively, and evaluated using the test data sets, as shown in table 1.

TABLE 1

The importance scores of the feature variables are obtained while the model is trained, five feature variables with the highest feature importance scores of the three algorithms are distributed as shown In FIGS. 2-4, and In the three algorithms, the feature variables with the highest scores are all the variables S _ In (representing the representation type variables of the signal), so that the variable S _ In is determined to be the most main reason for double-computer asynchronization.

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. An interlocking system fault positioning method based on a data mining algorithm is characterized by comprising the following steps:

step 1) obtaining a fault log;

step 2), extracting characteristic variables and target variables;

step 3), data processing;

step 4), algorithm selection;

step 5), training and evaluating a model;

step 6), acquiring the importance of the characteristic variable;

and 7) determining a fault reason.

2. The interlocking system fault location method based on the data mining algorithm according to claim 1, wherein the step 1) of obtaining the fault log specifically comprises:

3. The interlocking system fault location method based on the data mining algorithm according to claim 1, wherein the step 2) of extracting the characteristic variables and the target variables specifically comprises:

18 feature variables and one target variable are obtained.

4. The interlocking system fault location method based on the data mining algorithm as claimed in claim 3, characterized in that the target variable is whether an out-of-sync alarm occurs.

5. The interlocking system fault location method based on the data mining algorithm as claimed in claim 3, wherein the 18 characteristic variables comprise collection class information, driving class information, station yard representation information and network state information.

6. The interlocking system fault location method based on the data mining algorithm as claimed in claim 1, wherein the data processing in the step 3) is specifically;

(1) analyzing the data type;

(2) processing missing values;

(3) deletion of features with variance of 0;

(4) processing outliers of the exceptions;

(5) deletion of all-0 value data;

(6) Sample equalization;

(7) normalization;

(8) and (4) dividing the data set.

7. The interlocking system fault location method based on the data mining algorithm as claimed in claim 1, wherein the algorithm selection of step 4) is specifically: a decision tree DT, random forest RF or XGBT algorithm is selected.

8. The interlocking system fault location method based on the data mining algorithm as claimed in claim 1, wherein the step 5) model training and evaluation:

9. The interlocking system fault location method based on the data mining algorithm as claimed in claim 8, wherein the evaluation is specifically:

10. The interlocking system fault location method based on the data mining algorithm according to claim 8, wherein the step 6) of obtaining the importance of the characteristic variable specifically comprises: