CN114969267A

CN114969267A - Nuclear power quality defect cause analysis method

Info

Publication number: CN114969267A
Application number: CN202210665278.5A
Authority: CN
Inventors: 邵凯文; 赵芝芸; 王梦灵; 王理; 李垚; 陆潜慧; 陈佳洋; 张羽
Original assignee: East China University of Science and Technology
Current assignee: East China University of Science and Technology
Priority date: 2022-06-13
Filing date: 2022-06-13
Publication date: 2022-08-30

Abstract

The invention provides a nuclear power quality defect cause analysis method, which comprises the following steps: extracting text keywords by using a natural language processing technology to form a standardized defect target and a standardized defect reason; selecting features with high correlation degree from all the features by using a chi-square test feature extraction technology; taking the selected features as input, and training by using a random forest algorithm to obtain a random forest model comprising a plurality of decision trees; and searching an optimal chi-square test feature selection number k and hyper-parameters of the random forest by using an intelligent optimization algorithm, wherein the optimization aim is to ensure that the accuracy of the random forest model is high, and obtain a chi-square test-random forest scheduling model. The natural language processing technology of the nuclear power quality defect cause analysis method extracts text keywords, solves the problem of standardization of nuclear power text data, screens defect causes through the characteristics of chi-square inspection before the random forest model is modeled, and optimizes model parameters by adopting a genetic algorithm so as to improve the accuracy.

Description

Nuclear power quality defect cause analysis method

Technical Field

The invention relates to data-driven defect cause analysis, in particular to a nuclear power quality defect cause analysis dendrogram method based on particle swarm optimization-Chi square inspection-random forest.

Background

In the nuclear power production process, equipment and operation technology both influence the normal operation of a nuclear power plant, and defects such as pump water leakage, pore plate fracture, lifting falling and the like can be generated on unit equipment when the treatment is improper, and the defects can bring potential safety hazards to the nuclear power production.

In the prior art, manual defect cause analysis is mostly adopted for disordered redundant nuclear power text data, and forms such as an NCR report form and the like are used for analyzing nuclear power quality defects and causes and feeding back experiences. However, this method is not only inefficient, but also results are not highly accurate.

Disclosure of Invention

The invention aims to provide a nuclear power quality defect cause analysis method to solve the problem that the existing nuclear power defect cause analysis method is low in efficiency and low in precision.

In order to achieve the purpose, the invention provides a nuclear power quality defect cause analysis method, which comprises the following steps:

s1: extracting text keywords in nuclear power defect event data by using a natural language processing technology to form a standardized defect target and a standardized defect reason;

s2: taking each standardized defect target as a target, taking each standardized defect reason as a feature, calculating chi-square values of the features and the targets by using a chi-square inspection feature extraction technology to measure the correlation degree of the targets, and selecting the first k features with high correlation degrees from all the features, wherein k is the chi-square inspection feature selection number;

s3: taking all the selected features as input, and training by using a random forest algorithm to obtain a random forest model comprising a plurality of decision trees;

s4: and (3) searching for the optimal chi-square test characteristic selection number k and the hyper-parameters of the random forest by using an intelligent optimization algorithm, wherein the optimization goal is to ensure that the accuracy of the random forest model obtained in the steps S2 and S3 is high, so that a chi-square test-random forest scheduling model is obtained.

Preferably, the step S1 includes:

s11: acquiring the corresponding relation between defect event description and defect reason description from nuclear power defect event data;

s12: performing natural language processing operation on the defect event description and the defect reason description to obtain a standardized defect target and a standardized defect reason which correspond to each other;

s12 includes:

s121: acquiring a nuclear power knowledge base of a provider of a historical defect text base, and constructing a nuclear power text dictionary by using the nuclear power knowledge base;

s122: inputting all character feature vectors in the nuclear power text dictionary into an algorithm framework of a natural language processing model, and training the natural language processing model;

s123: and inputting the defect event description and the defect reason description into the trained natural language processing model to obtain a corresponding standardized defect target and a standardized defect reason.

Preferably, the algorithm framework of the natural language processing model adopts a Bert-LSTM-CRF model framework.

Preferably, the step S2 includes:

s21: performing data cleaning on the standardized defect target and the standardized defect reason, wherein the standardized defect target and the standardized defect reason after the data cleaning are respectively used as a target and a characteristic;

s22: and performing characteristic engineering processing of chi-square inspection on the target and the characteristics to obtain a chi-square value of each characteristic, and selecting the k characteristics with the maximum chi-square values.

Preferably, the step S3 includes:

s31: by a self-service sampling method, replaced random sampling samples are extracted, and a plurality of data subsets are formed by extraction;

s32: training each data subset to obtain a decision tree, so as to form a random forest with a plurality of decision trees;

s33: and judging the standardized defect target to which the sample belongs by utilizing the decision trees and the majority voting strategy, and determining the accuracy of the random forest model.

Preferably, the intelligent optimization algorithm adopts a genetic algorithm in a heuristic algorithm, and the hyper-parameters of the random forest comprise the number n of decision trees and the maximum depth d of each decision tree.

Preferably, the step S4 includes:

s41: taking the hyper-parameters and chi-square test feature selection number k of each group of random forests as a chromosome, setting a coding rule for converting the chromosome into a genetic space, and then creating initial population data as the population of the current round through random initialization;

s42: calculating a fitness function value of each individual in the population of the current round by repeating the steps S2 and S3, wherein the fitness function takes the accuracy of the random forest model as an objective function;

s43: carrying out selection operation according to the calculated fitness function value to obtain a selected individual; then, performing cross operation and mutation operation on all the selected individuals to obtain mutated individuals;

s44: calculating fitness function values of all the mutated individuals;

s45: combining all selected individuals and all mutated individuals into a combined population, and probabilistically selecting the individuals from the combined population according to the fitness function value to obtain a next round of population as a new current round of population;

s46: determining whether the current round reaches the maximum iteration number, if so, executing the step S47, otherwise, returning to the step S42;

s47: and selecting the individual with the maximum fitness function value in the population of the current round as the optimal chi-square test feature selection number k and the hyper-parameter of the random forest, and correspondingly obtaining a chi-square test-random forest scheduling model by repeating the steps S2 and S3.

Preferably, the nuclear power quality defect cause analysis method further includes step S5: and obtaining a corresponding visual propagation path diagram corresponding to the decision tree with the highest accuracy in the chi-square test-random forest scheduling model.

Preferably, the nuclear power quality defect cause analysis method is characterized by further comprising the step of S5': and evaluating the importance of a plurality of standardized defect reasons in the chi-square test-random forest scheduling model by using a GBDT algorithm, and sequencing the standardized defect reasons.

Preferably, the step S5' includes:

s51': recording the total splitting times, the total information gain and the average information gain of the features by utilizing a GBDT algorithm in the process of training a decision tree forming a chi-square test-random forest scheduling model;

s52': by the formula

Calculating the importance of the jth normalized defect cause in a single decision tree

Wherein j is the ordinal number of the normalized defect cause, L is the number of leaf nodes of the decision tree, t is the ordinal number of a non-leaf node, v _t Is the cause of the standardized defects associated with the non-leaf nodes t,

is the reduced value of the square penalty after the non-leaf node t splits,1(v _t j) means a function with the expression 1 when the associated feature of the non-leaf node T is equal to the normalized defect cause j, and T represents a decision tree;

s53': by the formula

Calculating the global importance of the jth standardized defect reason on all decision trees; wherein the content of the first and second substances,

the importance of the jth standardized defect reason on the mth decision tree is shown, wherein M is the total number of the decision trees, and M is the ordinal number of the decision trees;

s54': and ranking the importance of each standardized defect reason and forming an importance histogram of the standardized defect reasons.

According to the nuclear power quality defect cause analysis method, on one hand, text keywords are extracted through the Bert, lstm and other natural language processing technologies to form standardized defects and defect causes, and the problem of standardization of disordered redundant nuclear power text data is solved. On the other hand, before the random forest model is modeled, redundant defect reasons are selected and screened through the characteristics of chi-square test, and the model parameters are optimized by adopting a genetic algorithm, so that the accuracy of the random forest model of the defect causes is improved.

In addition, the invention can generate an accurate propagation path diagram of the defect cause and a defect cause importance degree histogram through a random forest model, and can carry out all-round, rapid and accurate analysis on the defect cause.

Drawings

FIG. 1 is a flow chart of a nuclear power quality defect cause analysis method of the present invention.

Fig. 2A and 2B are graphs of convergence of accuracy after 30 generations of optimization by the genetic algorithm, wherein fig. 2A is a scatter diagram and fig. 2B is a line diagram.

FIG. 3 is a graph of visualized propagation paths of a random forest model.

FIG. 4 is a histogram of the importance of normalized defect causes.

Detailed Description

The method of the present invention is described in detail below with reference to the accompanying drawings.

The invention provides a nuclear power quality defect cause analysis method which is based on a genetic algorithm, chi-square test and random forest and is used for solving the problem of low efficiency and low precision of the existing nuclear power defect cause analysis NCR report, realizing nuclear power text data standardization and forming standardized defects and defect causes. The following description will be made in conjunction with an example of a specific defect of a nuclear power plant to illustrate the implementation of the present invention and its advantages over the conventional defect cause analysis method. As shown in fig. 1, the nuclear power quality defect cause analysis method specifically includes the following steps:

step S0: collecting nuclear power defect event data of a nuclear power full life cycle (EPCS) stage;

wherein, collecting nuclear power defect event data is mainly carried out by collecting the existing NCR report. The NCR report is a textual way for engineers to record nuclear power non-compliance. The nuclear power defect event data refers to events in non-conforming items under the nuclear power full life cycle recorded in an NCR report.

In the embodiment, the collected application object is nuclear power quality defect data provided by a medium and wide nuclear company, and the nuclear power quality defect data comprises a defect event data report of a full life cycle (EPCS) stage of nuclear power, especially nuclear power defect event data in an NCR report (namely a fail report). The entire NCR report contains detailed description information such as unit information, primary stage information, secondary stage information, event information, reason information and the like, wherein the event information and the reason information are data required by the invention.

In addition, the nuclear power defect history text base is not only nuclear power defect event data, but also the nuclear power defect history text base contains an NCR report and further nuclear power defect event data, so that the nuclear power defect event data can be directly obtained from the nuclear power defect history text base.

The related text library is a text which does not relate to nuclear power defects and relates to some texts in other fields. For example, some supplier chain of responsibility information across enterprises cannot be found in a nuclear power defect text library. The related text library is parallel to the nuclear power historical defect text library.

In other embodiments, the step of acquiring nuclear power defect event data is omitted as long as the nuclear power defect event data can be acquired in step S1.

Step S1: extracting text keywords in nuclear power defect event data by using a natural language processing technology to form a standardized defect target and a standardized defect reason; since the input is a text report, a text report is a description of the defect and the cause of the defect corresponding to each other. Thereby, the conversion of the defect cause into the feature (i.e., the standardized defect cause) is completed.

In step S1, the natural language processing technique includes at least one of Bert, LSTM, etc. techniques.

Step S1 includes:

step S11: acquiring a corresponding relation between defect event description and defect reason description from nuclear power defect event data;

wherein the defect event description and the defect reason description are used as one of input parameters of a natural language processing model.

In the present embodiment, the defect event description and the defect cause description are event information and cause information reported by the NCR, respectively.

Step S12: and carrying out natural language processing operation on the defect event description and the defect reason description to obtain a standardized defect target and a standardized defect reason which correspond to each other.

The standardized defect targets and standardized defect causes obtained herein are key defect targets and key defect causes. At this time, since the defect event description and the defect reason description are from the historical defect text library, and a text report is the description of the defect and the defect reason corresponding to each other, the corresponding relationship between the corresponding standardized defect object and the standardized defect reason can be obtained.

The step S12 specifically includes:

step S121: acquiring a nuclear power knowledge base of a provider of a historical defect text base, and constructing a nuclear power text dictionary by using the nuclear power knowledge base; the nuclear power text dictionary is used as the other input parameter of the natural language processing model;

in this embodiment, the provider of the NCR report is a mid-wide core company. Utilize nuclear power knowledge base to construct nuclear power text dictionary, include: and traversing the nuclear power knowledge base, traversing each word, and marking a serial number on each word to obtain a character feature vector of each word so as to obtain the nuclear power text dictionary. The nuclear power text dictionary is usually in json format. The character eigenvector of each word is obtained by adding a sequence number to each word through an exhaustive nuclear power text, and is similar to the { nucleus: 1}, { electricity: 2} and so on.

Step S122: and inputting all character feature vectors in the nuclear power text dictionary into an algorithm framework of a natural language processing model, and training the natural language processing model.

In this embodiment, the algorithm framework of the natural language processing model adopts a Bert-LSTM-CRF model framework. The Bert-LSTM-CRF model framework is an algorithm framework typically represented by natural language processing models.

As described above, the character feature vector of each word in the nuclear power text dictionary is similar to { core: 1}, { electricity: 2, and the like, so that the nuclear power text dictionary is converted into a feature vector through the characteristics of bert and MasterL in the training process. The character feature vector of the nuclear power text dictionary is used for converting semantic information of a text into feature information recognized by a computer, and is one of input parameters.

Step S123: and inputting the defect event description and the defect reason description into the trained natural language processing model to obtain a corresponding standardized defect target and a standardized defect reason.

That is, the input data of the natural language processing model is a defect event description and defect reason description (i.e. event information and reason information) and a nuclear power text dictionary, the output data is a standardized defect target and a standardized defect reason (i.e. feature vectors of the defect event description and the defect reason description), and the standardized data is favorable to be input into the algorithm as features. The feature vectors are decoded by a nuclear power text dictionary, and a standardized character form of a defect target and a defect reason can be obtained.

In the correspondence relationship between each normalized defect target and the normalized defect cause, the number of normalized defect targets is 1, and the number of normalized defect causes is plural.

Step S2: taking each standardized defect target as a target, taking each standardized defect reason as a feature, calculating chi-square values of the features and the target by using a chi-square test feature extraction technology to measure the correlation degree of the features and the target, and selecting a part of (namely the first k) features with high correlation degree from all the features; k is a chi-square test feature selection number, and k is a positive integer;

the step S2 includes:

step S21: data cleaning is carried out on the standardized defect target and the standardized defect reason, and the standardized defect target and the standardized defect reason subjected to data cleaning are respectively used as a target and a characteristic;

data cleansing includes, but is not limited to, missing value processing, outlier processing. The missing value processing and the abnormal value processing refer to direct deletion of the missing value and the abnormal value.

Step S22: performing characteristic engineering processing of chi-square inspection on the target and the characteristics to obtain a chi-square value of each characteristic; the normalized defect cause is subjected to feature selection, i.e., the k features with the largest chi-squared value (i.e., the first k important features) are selected.

The selected features are used as input for a random forest model in the following.

In step S22, the specific steps of the feature engineering process for chi-square verification of the target and the feature are as follows:

step S221: chi-square analysis firstly assumes that two variables of a target and characteristics are mutually independent, and under the condition that the assumption is established, the actual frequency and the theoretical frequency of each grid in a variable list table are obtained;

the actual frequency is the number of defects caused by the defect, and the theoretical frequency is calculated by multiplying the actual frequency by the probability (probability of defect generation).

Step S222: then calculating the actual frequency counts of the two variables of the target and the characteristic, comparing the difference between the actual frequency counts and the theoretical frequency counts in each grid in the variable series table by calculating the chi-square value of each characteristic and all the targets, wherein the larger the difference is, the smaller the difference is, the difference is;

step S223: the difference (i.e., the chi-squared value of each feature) is expressed as a chi-squared test result; namely, if the chi-square test result is not obvious, the original assumption is not established, and the target variable and the characteristic variable are independent of each other; if the chi-square test result is obvious, the original assumption is established, and the target variable and the characteristic variable are related.

That is, the chi-squared value X of each feature is calculated by the following formula (1) ² ：

Wherein, X ² And the characteristic chi-square value is represented by o, the actual frequency and the theoretical frequency.

Chi-squared value X of the feature ² The larger the feature is, the more relevant the target is, i.e. the more important the defect cause is to the defect.

Further, the step S2 may further include the step S23: and respectively carrying out defect label numeralization and one-hot coding on all targets and the selected characteristics.

The defect label numeralization is to express the standardized defect object as a value (0, 1,2 … n) for easy program identification. One-hot encoding is a representation of each normalized defect cause as a binary vector.

Step S3: taking all selected features as input, and training by using a random forest algorithm to obtain a random forest model comprising a plurality of decision trees;

in machine learning, a random forest model is a classifier that contains multiple decision trees, and the class that it outputs is determined by the mode of the class that the individual decision trees output.

In step S3, all the features selected are the one-hot encoded features.

The step S3 includes:

step S31: randomly sampling samples which are put back by a self-service sampling method, and extracting to form a plurality of data subsets;

wherein each data subset comprises a plurality of samples taken at the same time.

Step S32: training each data subset to obtain a decision tree, so as to form a random forest with a plurality of decision trees;

randomly extracting features during training to split nodes until the nodes can not be split, and establishing a plurality of decision trees; the randomly extracting the features and splitting the nodes means that the optimal features are found from the randomly extracted features and are applied to the nodes to split the nodes.

Step S33: and judging the standardized defect target to which the sample belongs by utilizing the decision trees and the majority voting strategy, and determining the accuracy of the random forest model.

The samples here are the same as the samples randomly drawn in step S31, and are both the cause and the target of a single defect.

Step S4: and (3) searching for the optimal chi-square test characteristic selection number k and the hyper-parameters of the random forest by using an intelligent optimization algorithm, wherein the optimization goal is to ensure that the accuracy of the random forest model obtained in the steps S2 and S3 is high, so that a chi-square test-random forest scheduling model is obtained.

The hyper-parameters of the random forest and the chi-square test feature selection number k are all the states which should reach the best. In order to achieve the purpose, the invention adopts an intelligent optimization algorithm for optimizing the characteristics selected by the model and the chi-square test, and particularly, the intelligent optimization algorithm adopts a genetic algorithm in a heuristic algorithm, adopts the genetic algorithm to optimize parameters of hyper-parameters of the random forest and chi-square test characteristic selection number k, and automatically searches for locally optimal points, so as to optimize the characteristics selected by the random forest model and the chi-square test, further ensure that the accuracy of a decision tree in the random forest model is higher, and ensure that the visualized propagation path of the defect cause is more accurate.

The hyper-parameters of the random forest may include the number n of decision trees, the maximum depth d of each decision tree, the number of features randomly extracted when finding the optimal split point, and other parameters.

In this embodiment, the optimized parameters at least include three parameters, namely a chi-square test feature selection number k, a number n of decision trees, and a maximum depth d of each decision tree, that is, the hyper-parameters of the optimized random forest include the number n of decision trees and the maximum depth d of each decision tree.

Further, step S4 includes:

step S41: taking the hyper-parameters and chi-square test feature selection number k of each group of random forests as a chromosome, setting a coding rule for converting the chromosome into a genetic space, and then creating initial population data as a population of a current round through random initialization;

specifically, the hyper-parameters and chi-squared test feature selection numbers k of the random forest are expressed as chromosomes or individuals in the genetic space, namely converted into the genetic space, by setting a coding rule for converting the hyper-parameters and chi-squared test feature selection numbers k of the random forest into binary genotypes.

The specific way for converting the hyper-parameters and chi-square test feature selection number k of the random forest into the binary genotype is as follows: since the chi-square test feature selection number k and the hyperparameters of the random forest are specifically decimal values, the encoding is converted into binary (01). For example, when the binary genotype is coded as 000100100100, assume that the basic coefficient of a chromosome is set to 0.1, the left 4 bits represent k with a value of 1 × 0.1, the middle four bits represent the second parameter value of 2 × 0.1, and the right four bits represent the third parameter value of 4 × 0.1 in the coding rule.

The initial population data is the initial population of the initial search point. Wherein, the initial population data is random and does not influence the final convergence result.

The random initialization is to randomly set a binary value for the coding result of the binary gene type of the hyper-parameters and chi-square test characteristics of the random forest.

Step S42: calculating a fitness function value of each individual in the population of the current round by repeating steps S2 and S3;

the fitness function takes the accuracy of the random forest model as an objective function, the fitness function value of each chromosome (including a hyper-parameter and a k value) under each round of population is calculated, and the goodness and badness of each individual are evaluated according to the fitness of the individual in the genetic algorithm, so that the genetic opportunity is determined.

Step S43: carrying out selection operation according to the calculated fitness function value to obtain a selected individual; then, performing cross operation and mutation operation on all the selected individuals to obtain mutated individuals;

the selection operation means that a selection operator is applied to the population of the current round. The purpose of selection is to inherit optimized individuals directly to the next generation or to generate new individuals by pairwise crossing and then to inherit them to the next generation. The selection operation is performed based on fitness evaluation of individuals in the population of the current round.

The selection operation adopts a roulette selection method, the probability of each individual appearing in the offspring is calculated according to the fitness function value of the individual, and the individual is randomly selected according to the probability to form the offspring population.

The cross operation means setting cross probability, and each individual randomly selects another individual to carry out cross operation according to the cross probability.

Mutation operation refers to performing mutation operation according to mutation probability for each individual. The mutation is random, the mutation probability is generally set to 0.1, i.e., how many bits 0.1 in the chromosome are mutated, and the mutation operation is to change 0 to 1, or 1 to 0.

Step S44: calculating fitness function values of all the mutated individuals obtained in the step S43;

step S45: and combining all the selected individuals and all the variant individuals obtained in the step S43 into a combined population, and probabilistically selecting the individuals from the combined population according to the fitness function value to obtain the next round of population as a new current round of population.

In step S45, an individual is probabilistically selected from the fitness function values and a roulette selection strategy is adopted, which comprises the following specific steps:

step S451: superposing the fitness function values of all individuals in the combined population to obtain a total fitness function value;

step S452: the fitness function value of each individual is divided by the total fitness function value to obtain the probability of the individual being selected.

Step S453: calculating the cumulative probability of the individual according to the probability of the individual being selected to construct a wheel;

step S454: and generating a random number in the interval of [0,1], and if the random number is less than or equal to the cumulative probability of one individual and is greater than the cumulative probability of the previous individual, selecting the individual to enter the offspring population.

Step S455: step S454 is repeated until the population size is satisfied.

Step S46: determining whether the current round reaches the maximum iteration number, if so, stopping iteration, executing the step S47, otherwise, returning to the step S42;

step S47: and selecting the individual with the maximum fitness function value in the population of the current round as the optimal chi-square test feature selection number k and the hyper-parameter of the random forest, and correspondingly obtaining a chi-square test-random forest scheduling model by repeating the steps S2 and S3.

Further, step S5 is included: and obtaining a corresponding visual propagation path diagram (namely a defect-cause propagation path) corresponding to the decision tree with the highest accuracy in the chi-square test-random forest scheduling model.

In this embodiment, the number of the visual propagation path diagrams is 1, and a decision tree with the highest accuracy is selected from the chi-square test-random forest scheduling model for visualization to obtain a corresponding visual propagation path diagram, where the visual propagation path diagram shows an association relationship between a normalized defect target and a normalized defect cause in the decision tree with the highest accuracy of the chi-square test-random forest scheduling model. Thereby, cause analysis of each defect is achieved.

In addition, for the purpose of analyzing the importance degree of the defect cause, the invention also provides a method for sequencing and visualizing the histogram of the importance degree of the GBDT defect cause, namely, the method further comprises the step S5': and (3) evaluating the importance of a plurality of standardized defect reasons in the chi-square test-random forest scheduling model by using a GBDT (guaranteed bit rate transformation) algorithm, and sequencing the standardized defect reasons. Thus, the importance of the cause of each defect is shown.

It should be noted that because the GBDT analyzes the full amount of data, there is no data that is subject to feature selection, and thus the importance of all causes of defects can be revealed. Therefore, when the GBDT algorithm is used in step S5', the data output in step S1, i.e., the normalized data after the natural language processing, is used.

The GBDT algorithm is a gradient lifting tree algorithm and can evaluate the importance degree of each feature in the random forest model.

The step S5' includes:

step S51': recording the total splitting times, the total information gain and the average information gain of the features by utilizing a GBDT algorithm in the process of training a decision tree forming a chi-square test-random forest scheduling model; these parameters are evaluation indexes for measuring the importance of the features.

It should be noted that if a feature (i.e. normalized defect cause) is split more times, the importance of the feature is stronger, and the importance contribution of each feature on each tree is calculated and then simply averaged.

Step S52': calculating the importance of the jth normalized defect cause in a single decision tree by formula (2)

Wherein j is the ordinal number of the normalized defect cause, L is the number of leaf nodes of the decision tree, t is the ordinal number of a non-leaf node, v _t Is a normalization deficiency associated with a non-leaf node tThe reason for the sink is that,

is the reduction of the square penalty after t-splitting of the non-leaf node, 1 (v) _t J) means that the expression for the non-leaf node T with the associated feature equal to the normalized defect cause j is a function of 1, T representing the decision tree.

Step S53': calculating the global importance of the jth standardized defect reason on all decision trees by using a formula (3);

wherein the content of the first and second substances,

the importance of the jth normalized defect cause on the mth decision tree is shown, where M is the total number of decision trees and M is the ordinal number of the decision trees.

Step S54': and ranking the importance of each standardized defect reason and forming an importance histogram of the standardized defect reasons. Thereby, a histogram visualization of the importance of the cause of the defect is achieved.

Results of the experiment

In the experimental result of the present invention, according to the step S2, the chi-square value of each defect cause for the defect is calculated, and the defect causes are sorted, so that the defect causes having high correlation with the defect, such as management defect, human error, failure, temporal, failure, deficiency, and the like, are obtained as features.

Table 1 shows a specific experimental result of the normalized defect target and the normalized defect cause extracted by the natural language processing technique. Wherein the standardized defect target is located in columns 1 and 2 of table 1, NAN represents a missing value, and the standardized defect reason is fields of columns other than columns 1 and 2 of table 1. As shown in table 1, in the correspondence relationship between each normalized defect object and the normalized defect cause, the number of normalized defect objects is 1, and the number of normalized defect causes is plural.

TABLE 1 normalized Defect targets and normalized Defect causes after extraction

Equipment object

Phenomenon(s)

Reason classification

Orifice plate

Fracture of

Design errors

Mismatch

Human error

No discovery was made

Is not clear

Stone (stone)

NAN

Design errors

Destruction of

Cancellation

Transformation of

Dismantling danger

Conflict

Influence of

Consider that

Switch with a switch body

NAN

Design errors

Fail to work

Deficiency of

Lack of

Fan blower

Reverse rotation

Design errors

Deficiency of

Modifying

Violation of

Fault of

Is not strict

Failure of

Reverse connection

Cable with a protective layer

Injury of the skin

Managing defects

Problem of installation

Human error

Not found out

Is not executing

Support frame

Rusty

Others

Safety of design

Change

Influence of

Hoisting device

NAN

Human error

Mismatch

Consider that

Problem of installation

Lack of

Pit

NAN

Managing defects

Hysteresis

Power supply

Power off

Managing defects

Power failure

Instability of the film

Fail to work

Fault of

Temporary

Flange

NAN

Generator

NAN

Managing defects

Sundries

Secure

Adjustment of

Temporary

Progress of a game

Influence of

Cable with a protective layer

Power off

Managing defects

Human error

Is provided with

Hoisting device

Falling off

Managing defects

Deficiency of

Secure

Neglect of

Valve with a valve body

NAN

Others

Variations in

Penetration piece

NAN

Procedure is missingTrap for storing food

Construction process

Out of control

Generator

NAN

Managing defects

Adjustment of

Progress of a game

Hysteresis

Temporary

Lack of

The finally obtained chi-square test-random forest scheduling model is a random forest model which is obtained by corresponding to the optimal parameters after genetic algorithm optimization, and the corresponding visual propagation path diagram is a decision tree with the highest accuracy of the chi-square test-random forest scheduling model. The formed visual propagation path graph is optimized by an intelligent optimization algorithm, the identification precision of the defects is improved to 99.2% from 96%, the identification precision is the classification precision of the model, and the formula of the identification precision is the data of the correct category in the test set/the total number of the categories in the test set.

Fig. 2A and 2B are graphs of convergence of accuracy after 30 generations of optimization by the genetic algorithm. The abscissa of fig. 2A and fig. 2B is the number of iterations, the ordinate of fig. 2A is the accuracy of the random forest model (i.e., the fitness function value), and the ordinate of fig. 2B is the accuracy of the optimal random forest model. FIG. 2A is a scatter plot, where the points are the distribution of fitness function values of the random forest model for each chromosome in each iteration of the genetic algorithm. Fig. 2B is a corresponding line graph.

And outputting the random forest visualization tree diagram, and selecting the decision tree with the highest precision as the visualization propagation path diagram. Fig. 3 shows a visualized propagation path diagram of the random forest model, i.e. a visualized result of the optimal decision tree in the random forest model. Fig. 3 shows a path from the root node to each leaf node in the decision tree, where all non-leaf nodes are causes of the defect and the leaf nodes are defects, so that it is clear from the path which defect is associated with which causes.

And constructing a GBDT model by using the standardized defect data, analyzing and sequencing importance of all defect reasons to obtain all defect reasons, visualizing an importance histogram, displaying the importance histogram of the standardized defect reasons in a graph 4, and finding the corresponding standardized defect reasons through letter codes to know the importance of each standardized defect reason.

The legend shows that the method is more accurate and efficient than the NCR report defect cause analysis. The method for analyzing the defect cause of the quality defect event in the whole life cycle of the nuclear power plant is disclosed by the embodiment of the invention. Therefore, the process of defect formation is judged in time, the importance of the cause of the defect is visually embodied, and therefore relevant measures can be taken more pertinently to improve the experience feedback work efficiency of the nuclear power plant.

The above embodiments are merely preferred embodiments of the present invention, which are not intended to limit the scope of the present invention, and various changes may be made in the above embodiments of the present invention. All simple and equivalent changes and modifications made according to the claims and the content of the specification of the present application fall within the scope of the claims of the present patent application. The invention has not been described in detail in order to avoid obscuring the invention.

Claims

1. A nuclear power quality defect cause analysis method is characterized by comprising the following steps:

step S1: extracting text keywords in nuclear power defect event data by using a natural language processing technology to form a standardized defect target and a standardized defect reason;

step S2: taking each standardized defect target as a target, taking each standardized defect reason as a feature, calculating chi-square values of the features and the targets by using a chi-square inspection feature extraction technology to measure the correlation degree of the targets, and selecting the first k features with high correlation degrees from all the features, wherein k is the chi-square inspection feature selection number;

step S3: taking all the selected features as input, and training by using a random forest algorithm to obtain a random forest model comprising a plurality of decision trees;

2. The nuclear power quality defect cause analysis method according to claim 1, wherein the step S1 includes:

step S11: acquiring the corresponding relation between defect event description and defect reason description from nuclear power defect event data;

step S12: performing natural language processing operation on the defect event description and the defect reason description to obtain a standardized defect target and a standardized defect reason which correspond to each other;

the step S12 includes:

step S121: acquiring a nuclear power knowledge base of a provider of a historical defect text base, and constructing a nuclear power text dictionary by using the nuclear power knowledge base;

step S122: inputting all character feature vectors in the nuclear power text dictionary into an algorithm framework of a natural language processing model, and training the natural language processing model;

3. The nuclear power quality defect cause analysis method of claim 2, wherein the algorithm framework of the natural language processing model employs a Bert-LSTM-CRF model framework.

4. The nuclear power quality defect cause analysis method according to claim 1, wherein the step S2 includes:

step S22: and performing characteristic engineering processing of chi-square inspection on the target and the characteristics to obtain a chi-square value of each characteristic, and selecting the k characteristics with the maximum chi-square values.

5. The nuclear power quality defect cause analysis method according to claim 1, wherein the step S3 includes:

6. The nuclear power quality defect cause analysis method of claim 1, wherein the intelligent optimization algorithm employs a genetic algorithm in a heuristic algorithm, and the hyper-parameters of the random forest comprise the number n of decision trees and the maximum depth d of each decision tree.

7. The nuclear power quality defect cause analysis method according to claim 6, wherein the step S4 includes:

step S41: taking the hyper-parameters and chi-square test feature selection number k of each group of random forests as a chromosome, setting a coding rule for converting the chromosome into a genetic space, and then creating initial population data as the population of the current round through random initialization;

step S42: calculating a fitness function value of each individual in the population of the current round by repeating the steps S2 and S3, wherein the fitness function takes the accuracy of the random forest model as a target function;

step S44: calculating fitness function values of all the variant individuals;

step S45: combining all selected individuals and all mutated individuals into a combined population, and probabilistically selecting the individuals according to the fitness function values to obtain the next round of population as a new current round of population;

step S46: determining whether the current round reaches the maximum iteration number, if so, executing the step S47, otherwise, returning to the step S42;

8. The nuclear power quality defect cause analysis method according to claim 1, further comprising step S5: and obtaining a corresponding visual propagation path diagram corresponding to the decision tree with the highest accuracy in the chi-square test-random forest scheduling model.

9. The nuclear power quality defect cause analysis method according to claim 1, further comprising step S5': and evaluating the importance of a plurality of standardized defect reasons in the chi-square test-random forest scheduling model by using a GBDT algorithm, and sequencing the standardized defect reasons.

10. The nuclear power quality defect cause analysis method according to claim 9, wherein the step S5' includes:

step S51': recording the total splitting times, the total information gain and the average information gain of the features by utilizing a GBDT algorithm in the process of training a decision tree forming a chi-square test-random forest scheduling model;

step S52': by the formula

Wherein j is the ordinal number of the normalized defect cause, L is the leaf node number of the decision tree, t is the ordinal number of the non-leaf node, v _t Is the cause of the standardized defects associated with the non-leaf nodes t,

is the reduction of the square penalty after t-splitting of the non-leaf node, 1 (v) _t J) means a function with the expression 1 when the associated feature of the non-leaf node T is equal to the normalized defect cause j, and T represents a decision tree;

step S53': by the formula

the importance of the jth normalized defect cause on the mth decision tree,m is the total number of the decision tree, and M is the ordinal number of the decision tree;

step S54': and ranking the importance of each standardized defect reason and forming an importance histogram of the standardized defect reasons.