CN114969267A - Nuclear power quality defect cause analysis method - Google Patents

Nuclear power quality defect cause analysis method Download PDF

Info

Publication number
CN114969267A
CN114969267A CN202210665278.5A CN202210665278A CN114969267A CN 114969267 A CN114969267 A CN 114969267A CN 202210665278 A CN202210665278 A CN 202210665278A CN 114969267 A CN114969267 A CN 114969267A
Authority
CN
China
Prior art keywords
defect
nuclear power
chi
random forest
standardized
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210665278.5A
Other languages
Chinese (zh)
Inventor
邵凯文
赵芝芸
王梦灵
王理
李垚
陆潜慧
陈佳洋
张羽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China University of Science and Technology
Original Assignee
East China University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China University of Science and Technology filed Critical East China University of Science and Technology
Priority to CN202210665278.5A priority Critical patent/CN114969267A/en
Publication of CN114969267A publication Critical patent/CN114969267A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Abstract

The invention provides a nuclear power quality defect cause analysis method, which comprises the following steps: extracting text keywords by using a natural language processing technology to form a standardized defect target and a standardized defect reason; selecting features with high correlation degree from all the features by using a chi-square test feature extraction technology; taking the selected features as input, and training by using a random forest algorithm to obtain a random forest model comprising a plurality of decision trees; and searching an optimal chi-square test feature selection number k and hyper-parameters of the random forest by using an intelligent optimization algorithm, wherein the optimization aim is to ensure that the accuracy of the random forest model is high, and obtain a chi-square test-random forest scheduling model. The natural language processing technology of the nuclear power quality defect cause analysis method extracts text keywords, solves the problem of standardization of nuclear power text data, screens defect causes through the characteristics of chi-square inspection before the random forest model is modeled, and optimizes model parameters by adopting a genetic algorithm so as to improve the accuracy.

Description

Nuclear power quality defect cause analysis method
Technical Field
The invention relates to data-driven defect cause analysis, in particular to a nuclear power quality defect cause analysis dendrogram method based on particle swarm optimization-Chi square inspection-random forest.
Background
In the nuclear power production process, equipment and operation technology both influence the normal operation of a nuclear power plant, and defects such as pump water leakage, pore plate fracture, lifting falling and the like can be generated on unit equipment when the treatment is improper, and the defects can bring potential safety hazards to the nuclear power production.
In the prior art, manual defect cause analysis is mostly adopted for disordered redundant nuclear power text data, and forms such as an NCR report form and the like are used for analyzing nuclear power quality defects and causes and feeding back experiences. However, this method is not only inefficient, but also results are not highly accurate.
Disclosure of Invention
The invention aims to provide a nuclear power quality defect cause analysis method to solve the problem that the existing nuclear power defect cause analysis method is low in efficiency and low in precision.
In order to achieve the purpose, the invention provides a nuclear power quality defect cause analysis method, which comprises the following steps:
s1: extracting text keywords in nuclear power defect event data by using a natural language processing technology to form a standardized defect target and a standardized defect reason;
s2: taking each standardized defect target as a target, taking each standardized defect reason as a feature, calculating chi-square values of the features and the targets by using a chi-square inspection feature extraction technology to measure the correlation degree of the targets, and selecting the first k features with high correlation degrees from all the features, wherein k is the chi-square inspection feature selection number;
s3: taking all the selected features as input, and training by using a random forest algorithm to obtain a random forest model comprising a plurality of decision trees;
s4: and (3) searching for the optimal chi-square test characteristic selection number k and the hyper-parameters of the random forest by using an intelligent optimization algorithm, wherein the optimization goal is to ensure that the accuracy of the random forest model obtained in the steps S2 and S3 is high, so that a chi-square test-random forest scheduling model is obtained.
Preferably, the step S1 includes:
s11: acquiring the corresponding relation between defect event description and defect reason description from nuclear power defect event data;
s12: performing natural language processing operation on the defect event description and the defect reason description to obtain a standardized defect target and a standardized defect reason which correspond to each other;
s12 includes:
s121: acquiring a nuclear power knowledge base of a provider of a historical defect text base, and constructing a nuclear power text dictionary by using the nuclear power knowledge base;
s122: inputting all character feature vectors in the nuclear power text dictionary into an algorithm framework of a natural language processing model, and training the natural language processing model;
s123: and inputting the defect event description and the defect reason description into the trained natural language processing model to obtain a corresponding standardized defect target and a standardized defect reason.
Preferably, the algorithm framework of the natural language processing model adopts a Bert-LSTM-CRF model framework.
Preferably, the step S2 includes:
s21: performing data cleaning on the standardized defect target and the standardized defect reason, wherein the standardized defect target and the standardized defect reason after the data cleaning are respectively used as a target and a characteristic;
s22: and performing characteristic engineering processing of chi-square inspection on the target and the characteristics to obtain a chi-square value of each characteristic, and selecting the k characteristics with the maximum chi-square values.
Preferably, the step S3 includes:
s31: by a self-service sampling method, replaced random sampling samples are extracted, and a plurality of data subsets are formed by extraction;
s32: training each data subset to obtain a decision tree, so as to form a random forest with a plurality of decision trees;
s33: and judging the standardized defect target to which the sample belongs by utilizing the decision trees and the majority voting strategy, and determining the accuracy of the random forest model.
Preferably, the intelligent optimization algorithm adopts a genetic algorithm in a heuristic algorithm, and the hyper-parameters of the random forest comprise the number n of decision trees and the maximum depth d of each decision tree.
Preferably, the step S4 includes:
s41: taking the hyper-parameters and chi-square test feature selection number k of each group of random forests as a chromosome, setting a coding rule for converting the chromosome into a genetic space, and then creating initial population data as the population of the current round through random initialization;
s42: calculating a fitness function value of each individual in the population of the current round by repeating the steps S2 and S3, wherein the fitness function takes the accuracy of the random forest model as an objective function;
s43: carrying out selection operation according to the calculated fitness function value to obtain a selected individual; then, performing cross operation and mutation operation on all the selected individuals to obtain mutated individuals;
s44: calculating fitness function values of all the mutated individuals;
s45: combining all selected individuals and all mutated individuals into a combined population, and probabilistically selecting the individuals from the combined population according to the fitness function value to obtain a next round of population as a new current round of population;
s46: determining whether the current round reaches the maximum iteration number, if so, executing the step S47, otherwise, returning to the step S42;
s47: and selecting the individual with the maximum fitness function value in the population of the current round as the optimal chi-square test feature selection number k and the hyper-parameter of the random forest, and correspondingly obtaining a chi-square test-random forest scheduling model by repeating the steps S2 and S3.
Preferably, the nuclear power quality defect cause analysis method further includes step S5: and obtaining a corresponding visual propagation path diagram corresponding to the decision tree with the highest accuracy in the chi-square test-random forest scheduling model.
Preferably, the nuclear power quality defect cause analysis method is characterized by further comprising the step of S5': and evaluating the importance of a plurality of standardized defect reasons in the chi-square test-random forest scheduling model by using a GBDT algorithm, and sequencing the standardized defect reasons.
Preferably, the step S5' includes:
s51': recording the total splitting times, the total information gain and the average information gain of the features by utilizing a GBDT algorithm in the process of training a decision tree forming a chi-square test-random forest scheduling model;
s52': by the formula
Figure BDA0003691398550000031
Calculating the importance of the jth normalized defect cause in a single decision tree
Figure BDA0003691398550000041
Wherein j is the ordinal number of the normalized defect cause, L is the number of leaf nodes of the decision tree, t is the ordinal number of a non-leaf node, v t Is the cause of the standardized defects associated with the non-leaf nodes t,
Figure BDA0003691398550000042
is the reduced value of the square penalty after the non-leaf node t splits,1(v t j) means a function with the expression 1 when the associated feature of the non-leaf node T is equal to the normalized defect cause j, and T represents a decision tree;
s53': by the formula
Figure BDA0003691398550000043
Calculating the global importance of the jth standardized defect reason on all decision trees; wherein the content of the first and second substances,
Figure BDA0003691398550000044
the importance of the jth standardized defect reason on the mth decision tree is shown, wherein M is the total number of the decision trees, and M is the ordinal number of the decision trees;
s54': and ranking the importance of each standardized defect reason and forming an importance histogram of the standardized defect reasons.
According to the nuclear power quality defect cause analysis method, on one hand, text keywords are extracted through the Bert, lstm and other natural language processing technologies to form standardized defects and defect causes, and the problem of standardization of disordered redundant nuclear power text data is solved. On the other hand, before the random forest model is modeled, redundant defect reasons are selected and screened through the characteristics of chi-square test, and the model parameters are optimized by adopting a genetic algorithm, so that the accuracy of the random forest model of the defect causes is improved.
In addition, the invention can generate an accurate propagation path diagram of the defect cause and a defect cause importance degree histogram through a random forest model, and can carry out all-round, rapid and accurate analysis on the defect cause.
Drawings
FIG. 1 is a flow chart of a nuclear power quality defect cause analysis method of the present invention.
Fig. 2A and 2B are graphs of convergence of accuracy after 30 generations of optimization by the genetic algorithm, wherein fig. 2A is a scatter diagram and fig. 2B is a line diagram.
FIG. 3 is a graph of visualized propagation paths of a random forest model.
FIG. 4 is a histogram of the importance of normalized defect causes.
Detailed Description
The method of the present invention is described in detail below with reference to the accompanying drawings.
The invention provides a nuclear power quality defect cause analysis method which is based on a genetic algorithm, chi-square test and random forest and is used for solving the problem of low efficiency and low precision of the existing nuclear power defect cause analysis NCR report, realizing nuclear power text data standardization and forming standardized defects and defect causes. The following description will be made in conjunction with an example of a specific defect of a nuclear power plant to illustrate the implementation of the present invention and its advantages over the conventional defect cause analysis method. As shown in fig. 1, the nuclear power quality defect cause analysis method specifically includes the following steps:
step S0: collecting nuclear power defect event data of a nuclear power full life cycle (EPCS) stage;
wherein, collecting nuclear power defect event data is mainly carried out by collecting the existing NCR report. The NCR report is a textual way for engineers to record nuclear power non-compliance. The nuclear power defect event data refers to events in non-conforming items under the nuclear power full life cycle recorded in an NCR report.
In the embodiment, the collected application object is nuclear power quality defect data provided by a medium and wide nuclear company, and the nuclear power quality defect data comprises a defect event data report of a full life cycle (EPCS) stage of nuclear power, especially nuclear power defect event data in an NCR report (namely a fail report). The entire NCR report contains detailed description information such as unit information, primary stage information, secondary stage information, event information, reason information and the like, wherein the event information and the reason information are data required by the invention.
In addition, the nuclear power defect history text base is not only nuclear power defect event data, but also the nuclear power defect history text base contains an NCR report and further nuclear power defect event data, so that the nuclear power defect event data can be directly obtained from the nuclear power defect history text base.
The related text library is a text which does not relate to nuclear power defects and relates to some texts in other fields. For example, some supplier chain of responsibility information across enterprises cannot be found in a nuclear power defect text library. The related text library is parallel to the nuclear power historical defect text library.
In other embodiments, the step of acquiring nuclear power defect event data is omitted as long as the nuclear power defect event data can be acquired in step S1.
Step S1: extracting text keywords in nuclear power defect event data by using a natural language processing technology to form a standardized defect target and a standardized defect reason; since the input is a text report, a text report is a description of the defect and the cause of the defect corresponding to each other. Thereby, the conversion of the defect cause into the feature (i.e., the standardized defect cause) is completed.
In step S1, the natural language processing technique includes at least one of Bert, LSTM, etc. techniques.
Step S1 includes:
step S11: acquiring a corresponding relation between defect event description and defect reason description from nuclear power defect event data;
wherein the defect event description and the defect reason description are used as one of input parameters of a natural language processing model.
In the present embodiment, the defect event description and the defect cause description are event information and cause information reported by the NCR, respectively.
Step S12: and carrying out natural language processing operation on the defect event description and the defect reason description to obtain a standardized defect target and a standardized defect reason which correspond to each other.
The standardized defect targets and standardized defect causes obtained herein are key defect targets and key defect causes. At this time, since the defect event description and the defect reason description are from the historical defect text library, and a text report is the description of the defect and the defect reason corresponding to each other, the corresponding relationship between the corresponding standardized defect object and the standardized defect reason can be obtained.
The step S12 specifically includes:
step S121: acquiring a nuclear power knowledge base of a provider of a historical defect text base, and constructing a nuclear power text dictionary by using the nuclear power knowledge base; the nuclear power text dictionary is used as the other input parameter of the natural language processing model;
in this embodiment, the provider of the NCR report is a mid-wide core company. Utilize nuclear power knowledge base to construct nuclear power text dictionary, include: and traversing the nuclear power knowledge base, traversing each word, and marking a serial number on each word to obtain a character feature vector of each word so as to obtain the nuclear power text dictionary. The nuclear power text dictionary is usually in json format. The character eigenvector of each word is obtained by adding a sequence number to each word through an exhaustive nuclear power text, and is similar to the { nucleus: 1}, { electricity: 2} and so on.
Step S122: and inputting all character feature vectors in the nuclear power text dictionary into an algorithm framework of a natural language processing model, and training the natural language processing model.
In this embodiment, the algorithm framework of the natural language processing model adopts a Bert-LSTM-CRF model framework. The Bert-LSTM-CRF model framework is an algorithm framework typically represented by natural language processing models.
As described above, the character feature vector of each word in the nuclear power text dictionary is similar to { core: 1}, { electricity: 2, and the like, so that the nuclear power text dictionary is converted into a feature vector through the characteristics of bert and MasterL in the training process. The character feature vector of the nuclear power text dictionary is used for converting semantic information of a text into feature information recognized by a computer, and is one of input parameters.
Step S123: and inputting the defect event description and the defect reason description into the trained natural language processing model to obtain a corresponding standardized defect target and a standardized defect reason.
That is, the input data of the natural language processing model is a defect event description and defect reason description (i.e. event information and reason information) and a nuclear power text dictionary, the output data is a standardized defect target and a standardized defect reason (i.e. feature vectors of the defect event description and the defect reason description), and the standardized data is favorable to be input into the algorithm as features. The feature vectors are decoded by a nuclear power text dictionary, and a standardized character form of a defect target and a defect reason can be obtained.
In the correspondence relationship between each normalized defect target and the normalized defect cause, the number of normalized defect targets is 1, and the number of normalized defect causes is plural.
Step S2: taking each standardized defect target as a target, taking each standardized defect reason as a feature, calculating chi-square values of the features and the target by using a chi-square test feature extraction technology to measure the correlation degree of the features and the target, and selecting a part of (namely the first k) features with high correlation degree from all the features; k is a chi-square test feature selection number, and k is a positive integer;
the step S2 includes:
step S21: data cleaning is carried out on the standardized defect target and the standardized defect reason, and the standardized defect target and the standardized defect reason subjected to data cleaning are respectively used as a target and a characteristic;
data cleansing includes, but is not limited to, missing value processing, outlier processing. The missing value processing and the abnormal value processing refer to direct deletion of the missing value and the abnormal value.
Step S22: performing characteristic engineering processing of chi-square inspection on the target and the characteristics to obtain a chi-square value of each characteristic; the normalized defect cause is subjected to feature selection, i.e., the k features with the largest chi-squared value (i.e., the first k important features) are selected.
The selected features are used as input for a random forest model in the following.
In step S22, the specific steps of the feature engineering process for chi-square verification of the target and the feature are as follows:
step S221: chi-square analysis firstly assumes that two variables of a target and characteristics are mutually independent, and under the condition that the assumption is established, the actual frequency and the theoretical frequency of each grid in a variable list table are obtained;
the actual frequency is the number of defects caused by the defect, and the theoretical frequency is calculated by multiplying the actual frequency by the probability (probability of defect generation).
Step S222: then calculating the actual frequency counts of the two variables of the target and the characteristic, comparing the difference between the actual frequency counts and the theoretical frequency counts in each grid in the variable series table by calculating the chi-square value of each characteristic and all the targets, wherein the larger the difference is, the smaller the difference is, the difference is;
step S223: the difference (i.e., the chi-squared value of each feature) is expressed as a chi-squared test result; namely, if the chi-square test result is not obvious, the original assumption is not established, and the target variable and the characteristic variable are independent of each other; if the chi-square test result is obvious, the original assumption is established, and the target variable and the characteristic variable are related.
That is, the chi-squared value X of each feature is calculated by the following formula (1) 2
Figure BDA0003691398550000081
Wherein, X 2 And the characteristic chi-square value is represented by o, the actual frequency and the theoretical frequency.
Chi-squared value X of the feature 2 The larger the feature is, the more relevant the target is, i.e. the more important the defect cause is to the defect.
Further, the step S2 may further include the step S23: and respectively carrying out defect label numeralization and one-hot coding on all targets and the selected characteristics.
The defect label numeralization is to express the standardized defect object as a value (0, 1,2 … n) for easy program identification. One-hot encoding is a representation of each normalized defect cause as a binary vector.
Step S3: taking all selected features as input, and training by using a random forest algorithm to obtain a random forest model comprising a plurality of decision trees;
in machine learning, a random forest model is a classifier that contains multiple decision trees, and the class that it outputs is determined by the mode of the class that the individual decision trees output.
In step S3, all the features selected are the one-hot encoded features.
The step S3 includes:
step S31: randomly sampling samples which are put back by a self-service sampling method, and extracting to form a plurality of data subsets;
wherein each data subset comprises a plurality of samples taken at the same time.
Step S32: training each data subset to obtain a decision tree, so as to form a random forest with a plurality of decision trees;
randomly extracting features during training to split nodes until the nodes can not be split, and establishing a plurality of decision trees; the randomly extracting the features and splitting the nodes means that the optimal features are found from the randomly extracted features and are applied to the nodes to split the nodes.
Step S33: and judging the standardized defect target to which the sample belongs by utilizing the decision trees and the majority voting strategy, and determining the accuracy of the random forest model.
The samples here are the same as the samples randomly drawn in step S31, and are both the cause and the target of a single defect.
Step S4: and (3) searching for the optimal chi-square test characteristic selection number k and the hyper-parameters of the random forest by using an intelligent optimization algorithm, wherein the optimization goal is to ensure that the accuracy of the random forest model obtained in the steps S2 and S3 is high, so that a chi-square test-random forest scheduling model is obtained.
The hyper-parameters of the random forest and the chi-square test feature selection number k are all the states which should reach the best. In order to achieve the purpose, the invention adopts an intelligent optimization algorithm for optimizing the characteristics selected by the model and the chi-square test, and particularly, the intelligent optimization algorithm adopts a genetic algorithm in a heuristic algorithm, adopts the genetic algorithm to optimize parameters of hyper-parameters of the random forest and chi-square test characteristic selection number k, and automatically searches for locally optimal points, so as to optimize the characteristics selected by the random forest model and the chi-square test, further ensure that the accuracy of a decision tree in the random forest model is higher, and ensure that the visualized propagation path of the defect cause is more accurate.
The hyper-parameters of the random forest may include the number n of decision trees, the maximum depth d of each decision tree, the number of features randomly extracted when finding the optimal split point, and other parameters.
In this embodiment, the optimized parameters at least include three parameters, namely a chi-square test feature selection number k, a number n of decision trees, and a maximum depth d of each decision tree, that is, the hyper-parameters of the optimized random forest include the number n of decision trees and the maximum depth d of each decision tree.
Further, step S4 includes:
step S41: taking the hyper-parameters and chi-square test feature selection number k of each group of random forests as a chromosome, setting a coding rule for converting the chromosome into a genetic space, and then creating initial population data as a population of a current round through random initialization;
specifically, the hyper-parameters and chi-squared test feature selection numbers k of the random forest are expressed as chromosomes or individuals in the genetic space, namely converted into the genetic space, by setting a coding rule for converting the hyper-parameters and chi-squared test feature selection numbers k of the random forest into binary genotypes.
The specific way for converting the hyper-parameters and chi-square test feature selection number k of the random forest into the binary genotype is as follows: since the chi-square test feature selection number k and the hyperparameters of the random forest are specifically decimal values, the encoding is converted into binary (01). For example, when the binary genotype is coded as 000100100100, assume that the basic coefficient of a chromosome is set to 0.1, the left 4 bits represent k with a value of 1 × 0.1, the middle four bits represent the second parameter value of 2 × 0.1, and the right four bits represent the third parameter value of 4 × 0.1 in the coding rule.
The initial population data is the initial population of the initial search point. Wherein, the initial population data is random and does not influence the final convergence result.
The random initialization is to randomly set a binary value for the coding result of the binary gene type of the hyper-parameters and chi-square test characteristics of the random forest.
Step S42: calculating a fitness function value of each individual in the population of the current round by repeating steps S2 and S3;
the fitness function takes the accuracy of the random forest model as an objective function, the fitness function value of each chromosome (including a hyper-parameter and a k value) under each round of population is calculated, and the goodness and badness of each individual are evaluated according to the fitness of the individual in the genetic algorithm, so that the genetic opportunity is determined.
Step S43: carrying out selection operation according to the calculated fitness function value to obtain a selected individual; then, performing cross operation and mutation operation on all the selected individuals to obtain mutated individuals;
the selection operation means that a selection operator is applied to the population of the current round. The purpose of selection is to inherit optimized individuals directly to the next generation or to generate new individuals by pairwise crossing and then to inherit them to the next generation. The selection operation is performed based on fitness evaluation of individuals in the population of the current round.
The selection operation adopts a roulette selection method, the probability of each individual appearing in the offspring is calculated according to the fitness function value of the individual, and the individual is randomly selected according to the probability to form the offspring population.
The cross operation means setting cross probability, and each individual randomly selects another individual to carry out cross operation according to the cross probability.
Mutation operation refers to performing mutation operation according to mutation probability for each individual. The mutation is random, the mutation probability is generally set to 0.1, i.e., how many bits 0.1 in the chromosome are mutated, and the mutation operation is to change 0 to 1, or 1 to 0.
Step S44: calculating fitness function values of all the mutated individuals obtained in the step S43;
step S45: and combining all the selected individuals and all the variant individuals obtained in the step S43 into a combined population, and probabilistically selecting the individuals from the combined population according to the fitness function value to obtain the next round of population as a new current round of population.
In step S45, an individual is probabilistically selected from the fitness function values and a roulette selection strategy is adopted, which comprises the following specific steps:
step S451: superposing the fitness function values of all individuals in the combined population to obtain a total fitness function value;
step S452: the fitness function value of each individual is divided by the total fitness function value to obtain the probability of the individual being selected.
Step S453: calculating the cumulative probability of the individual according to the probability of the individual being selected to construct a wheel;
step S454: and generating a random number in the interval of [0,1], and if the random number is less than or equal to the cumulative probability of one individual and is greater than the cumulative probability of the previous individual, selecting the individual to enter the offspring population.
Step S455: step S454 is repeated until the population size is satisfied.
Step S46: determining whether the current round reaches the maximum iteration number, if so, stopping iteration, executing the step S47, otherwise, returning to the step S42;
step S47: and selecting the individual with the maximum fitness function value in the population of the current round as the optimal chi-square test feature selection number k and the hyper-parameter of the random forest, and correspondingly obtaining a chi-square test-random forest scheduling model by repeating the steps S2 and S3.
Further, step S5 is included: and obtaining a corresponding visual propagation path diagram (namely a defect-cause propagation path) corresponding to the decision tree with the highest accuracy in the chi-square test-random forest scheduling model.
In this embodiment, the number of the visual propagation path diagrams is 1, and a decision tree with the highest accuracy is selected from the chi-square test-random forest scheduling model for visualization to obtain a corresponding visual propagation path diagram, where the visual propagation path diagram shows an association relationship between a normalized defect target and a normalized defect cause in the decision tree with the highest accuracy of the chi-square test-random forest scheduling model. Thereby, cause analysis of each defect is achieved.
In addition, for the purpose of analyzing the importance degree of the defect cause, the invention also provides a method for sequencing and visualizing the histogram of the importance degree of the GBDT defect cause, namely, the method further comprises the step S5': and (3) evaluating the importance of a plurality of standardized defect reasons in the chi-square test-random forest scheduling model by using a GBDT (guaranteed bit rate transformation) algorithm, and sequencing the standardized defect reasons. Thus, the importance of the cause of each defect is shown.
It should be noted that because the GBDT analyzes the full amount of data, there is no data that is subject to feature selection, and thus the importance of all causes of defects can be revealed. Therefore, when the GBDT algorithm is used in step S5', the data output in step S1, i.e., the normalized data after the natural language processing, is used.
The GBDT algorithm is a gradient lifting tree algorithm and can evaluate the importance degree of each feature in the random forest model.
The step S5' includes:
step S51': recording the total splitting times, the total information gain and the average information gain of the features by utilizing a GBDT algorithm in the process of training a decision tree forming a chi-square test-random forest scheduling model; these parameters are evaluation indexes for measuring the importance of the features.
It should be noted that if a feature (i.e. normalized defect cause) is split more times, the importance of the feature is stronger, and the importance contribution of each feature on each tree is calculated and then simply averaged.
Step S52': calculating the importance of the jth normalized defect cause in a single decision tree by formula (2)
Figure BDA0003691398550000121
Figure BDA0003691398550000122
Wherein j is the ordinal number of the normalized defect cause, L is the number of leaf nodes of the decision tree, t is the ordinal number of a non-leaf node, v t Is a normalization deficiency associated with a non-leaf node tThe reason for the sink is that,
Figure BDA0003691398550000123
is the reduction of the square penalty after t-splitting of the non-leaf node, 1 (v) t J) means that the expression for the non-leaf node T with the associated feature equal to the normalized defect cause j is a function of 1, T representing the decision tree.
Step S53': calculating the global importance of the jth standardized defect reason on all decision trees by using a formula (3);
Figure BDA0003691398550000131
wherein the content of the first and second substances,
Figure BDA0003691398550000132
the importance of the jth normalized defect cause on the mth decision tree is shown, where M is the total number of decision trees and M is the ordinal number of the decision trees.
Step S54': and ranking the importance of each standardized defect reason and forming an importance histogram of the standardized defect reasons. Thereby, a histogram visualization of the importance of the cause of the defect is achieved.
Results of the experiment
In the experimental result of the present invention, according to the step S2, the chi-square value of each defect cause for the defect is calculated, and the defect causes are sorted, so that the defect causes having high correlation with the defect, such as management defect, human error, failure, temporal, failure, deficiency, and the like, are obtained as features.
Table 1 shows a specific experimental result of the normalized defect target and the normalized defect cause extracted by the natural language processing technique. Wherein the standardized defect target is located in columns 1 and 2 of table 1, NAN represents a missing value, and the standardized defect reason is fields of columns other than columns 1 and 2 of table 1. As shown in table 1, in the correspondence relationship between each normalized defect object and the normalized defect cause, the number of normalized defect objects is 1, and the number of normalized defect causes is plural.
TABLE 1 normalized Defect targets and normalized Defect causes after extraction
Equipment object Phenomenon(s) Reason classification
Orifice plate Fracture of Design errors Mismatch Human error No discovery was made Is not clear
Stone (stone) NAN Design errors Destruction of Cancellation Transformation of Dismantling danger Conflict Influence of Consider that
Switch with a switch body NAN Design errors Fail to work Deficiency of Lack of
Fan blower Reverse rotation Design errors Deficiency of Modifying Violation of Fault of Is not strict Failure of Reverse connection
Cable with a protective layer Injury of the skin Managing defects Problem of installation Human error Not found out Is not executing
Support frame Rusty Others Safety of design Change Influence of
Hoisting device NAN Human error Mismatch Consider that Problem of installation Lack of
Pit NAN Managing defects Hysteresis
Power supply Power off Managing defects Power failure Instability of the film Fail to work Fault of Temporary
Flange NAN NAN NAN
Generator NAN Managing defects Sundries Secure Adjustment of Temporary Progress of a game Influence of
Cable with a protective layer Power off Managing defects Human error Is provided with
Hoisting device Falling off Managing defects Deficiency of Secure Neglect of
Valve with a valve body NAN Others Variations in
Penetration piece NAN Procedure is missingTrap for storing food Construction process Out of control
Generator NAN Managing defects Adjustment of Progress of a game Hysteresis Temporary Lack of
The finally obtained chi-square test-random forest scheduling model is a random forest model which is obtained by corresponding to the optimal parameters after genetic algorithm optimization, and the corresponding visual propagation path diagram is a decision tree with the highest accuracy of the chi-square test-random forest scheduling model. The formed visual propagation path graph is optimized by an intelligent optimization algorithm, the identification precision of the defects is improved to 99.2% from 96%, the identification precision is the classification precision of the model, and the formula of the identification precision is the data of the correct category in the test set/the total number of the categories in the test set.
Fig. 2A and 2B are graphs of convergence of accuracy after 30 generations of optimization by the genetic algorithm. The abscissa of fig. 2A and fig. 2B is the number of iterations, the ordinate of fig. 2A is the accuracy of the random forest model (i.e., the fitness function value), and the ordinate of fig. 2B is the accuracy of the optimal random forest model. FIG. 2A is a scatter plot, where the points are the distribution of fitness function values of the random forest model for each chromosome in each iteration of the genetic algorithm. Fig. 2B is a corresponding line graph.
And outputting the random forest visualization tree diagram, and selecting the decision tree with the highest precision as the visualization propagation path diagram. Fig. 3 shows a visualized propagation path diagram of the random forest model, i.e. a visualized result of the optimal decision tree in the random forest model. Fig. 3 shows a path from the root node to each leaf node in the decision tree, where all non-leaf nodes are causes of the defect and the leaf nodes are defects, so that it is clear from the path which defect is associated with which causes.
And constructing a GBDT model by using the standardized defect data, analyzing and sequencing importance of all defect reasons to obtain all defect reasons, visualizing an importance histogram, displaying the importance histogram of the standardized defect reasons in a graph 4, and finding the corresponding standardized defect reasons through letter codes to know the importance of each standardized defect reason.
The legend shows that the method is more accurate and efficient than the NCR report defect cause analysis. The method for analyzing the defect cause of the quality defect event in the whole life cycle of the nuclear power plant is disclosed by the embodiment of the invention. Therefore, the process of defect formation is judged in time, the importance of the cause of the defect is visually embodied, and therefore relevant measures can be taken more pertinently to improve the experience feedback work efficiency of the nuclear power plant.
The above embodiments are merely preferred embodiments of the present invention, which are not intended to limit the scope of the present invention, and various changes may be made in the above embodiments of the present invention. All simple and equivalent changes and modifications made according to the claims and the content of the specification of the present application fall within the scope of the claims of the present patent application. The invention has not been described in detail in order to avoid obscuring the invention.

Claims (10)

1. A nuclear power quality defect cause analysis method is characterized by comprising the following steps:
step S1: extracting text keywords in nuclear power defect event data by using a natural language processing technology to form a standardized defect target and a standardized defect reason;
step S2: taking each standardized defect target as a target, taking each standardized defect reason as a feature, calculating chi-square values of the features and the targets by using a chi-square inspection feature extraction technology to measure the correlation degree of the targets, and selecting the first k features with high correlation degrees from all the features, wherein k is the chi-square inspection feature selection number;
step S3: taking all the selected features as input, and training by using a random forest algorithm to obtain a random forest model comprising a plurality of decision trees;
step S4: and (3) searching for the optimal chi-square test characteristic selection number k and the hyper-parameters of the random forest by using an intelligent optimization algorithm, wherein the optimization goal is to ensure that the accuracy of the random forest model obtained in the steps S2 and S3 is high, so that a chi-square test-random forest scheduling model is obtained.
2. The nuclear power quality defect cause analysis method according to claim 1, wherein the step S1 includes:
step S11: acquiring the corresponding relation between defect event description and defect reason description from nuclear power defect event data;
step S12: performing natural language processing operation on the defect event description and the defect reason description to obtain a standardized defect target and a standardized defect reason which correspond to each other;
the step S12 includes:
step S121: acquiring a nuclear power knowledge base of a provider of a historical defect text base, and constructing a nuclear power text dictionary by using the nuclear power knowledge base;
step S122: inputting all character feature vectors in the nuclear power text dictionary into an algorithm framework of a natural language processing model, and training the natural language processing model;
step S123: and inputting the defect event description and the defect reason description into the trained natural language processing model to obtain a corresponding standardized defect target and a standardized defect reason.
3. The nuclear power quality defect cause analysis method of claim 2, wherein the algorithm framework of the natural language processing model employs a Bert-LSTM-CRF model framework.
4. The nuclear power quality defect cause analysis method according to claim 1, wherein the step S2 includes:
step S21: data cleaning is carried out on the standardized defect target and the standardized defect reason, and the standardized defect target and the standardized defect reason subjected to data cleaning are respectively used as a target and a characteristic;
step S22: and performing characteristic engineering processing of chi-square inspection on the target and the characteristics to obtain a chi-square value of each characteristic, and selecting the k characteristics with the maximum chi-square values.
5. The nuclear power quality defect cause analysis method according to claim 1, wherein the step S3 includes:
step S31: randomly sampling samples which are put back by a self-service sampling method, and extracting to form a plurality of data subsets;
step S32: training each data subset to obtain a decision tree, so as to form a random forest with a plurality of decision trees;
step S33: and judging the standardized defect target to which the sample belongs by utilizing the decision trees and the majority voting strategy, and determining the accuracy of the random forest model.
6. The nuclear power quality defect cause analysis method of claim 1, wherein the intelligent optimization algorithm employs a genetic algorithm in a heuristic algorithm, and the hyper-parameters of the random forest comprise the number n of decision trees and the maximum depth d of each decision tree.
7. The nuclear power quality defect cause analysis method according to claim 6, wherein the step S4 includes:
step S41: taking the hyper-parameters and chi-square test feature selection number k of each group of random forests as a chromosome, setting a coding rule for converting the chromosome into a genetic space, and then creating initial population data as the population of the current round through random initialization;
step S42: calculating a fitness function value of each individual in the population of the current round by repeating the steps S2 and S3, wherein the fitness function takes the accuracy of the random forest model as a target function;
step S43: carrying out selection operation according to the calculated fitness function value to obtain a selected individual; then, performing cross operation and mutation operation on all the selected individuals to obtain mutated individuals;
step S44: calculating fitness function values of all the variant individuals;
step S45: combining all selected individuals and all mutated individuals into a combined population, and probabilistically selecting the individuals according to the fitness function values to obtain the next round of population as a new current round of population;
step S46: determining whether the current round reaches the maximum iteration number, if so, executing the step S47, otherwise, returning to the step S42;
step S47: and selecting the individual with the maximum fitness function value in the population of the current round as the optimal chi-square test feature selection number k and the hyper-parameter of the random forest, and correspondingly obtaining a chi-square test-random forest scheduling model by repeating the steps S2 and S3.
8. The nuclear power quality defect cause analysis method according to claim 1, further comprising step S5: and obtaining a corresponding visual propagation path diagram corresponding to the decision tree with the highest accuracy in the chi-square test-random forest scheduling model.
9. The nuclear power quality defect cause analysis method according to claim 1, further comprising step S5': and evaluating the importance of a plurality of standardized defect reasons in the chi-square test-random forest scheduling model by using a GBDT algorithm, and sequencing the standardized defect reasons.
10. The nuclear power quality defect cause analysis method according to claim 9, wherein the step S5' includes:
step S51': recording the total splitting times, the total information gain and the average information gain of the features by utilizing a GBDT algorithm in the process of training a decision tree forming a chi-square test-random forest scheduling model;
step S52': by the formula
Figure FDA0003691398540000031
Calculating the importance of the jth normalized defect cause in a single decision tree
Figure FDA0003691398540000032
Wherein j is the ordinal number of the normalized defect cause, L is the leaf node number of the decision tree, t is the ordinal number of the non-leaf node, v t Is the cause of the standardized defects associated with the non-leaf nodes t,
Figure FDA0003691398540000033
is the reduction of the square penalty after t-splitting of the non-leaf node, 1 (v) t J) means a function with the expression 1 when the associated feature of the non-leaf node T is equal to the normalized defect cause j, and T represents a decision tree;
step S53': by the formula
Figure FDA0003691398540000034
Calculating the global importance of the jth standardized defect reason on all decision trees; wherein the content of the first and second substances,
Figure FDA0003691398540000035
the importance of the jth normalized defect cause on the mth decision tree,m is the total number of the decision tree, and M is the ordinal number of the decision tree;
step S54': and ranking the importance of each standardized defect reason and forming an importance histogram of the standardized defect reasons.
CN202210665278.5A 2022-06-13 2022-06-13 Nuclear power quality defect cause analysis method Pending CN114969267A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210665278.5A CN114969267A (en) 2022-06-13 2022-06-13 Nuclear power quality defect cause analysis method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210665278.5A CN114969267A (en) 2022-06-13 2022-06-13 Nuclear power quality defect cause analysis method

Publications (1)

Publication Number Publication Date
CN114969267A true CN114969267A (en) 2022-08-30

Family

ID=82961265

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210665278.5A Pending CN114969267A (en) 2022-06-13 2022-06-13 Nuclear power quality defect cause analysis method

Country Status (1)

Country Link
CN (1) CN114969267A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117331047A (en) * 2023-12-01 2024-01-02 德心智能科技(常州)有限公司 Human behavior data analysis method and system based on millimeter wave radar

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117331047A (en) * 2023-12-01 2024-01-02 德心智能科技(常州)有限公司 Human behavior data analysis method and system based on millimeter wave radar

Similar Documents

Publication Publication Date Title
Fu et al. A genetic algorithm-based approach for building accurate decision trees
CN109829627A (en) A kind of safe confidence appraisal procedure of Electrical Power System Dynamic based on integrated study scheme
Ding Long-term load forecast using decision tree method
CN114969267A (en) Nuclear power quality defect cause analysis method
Divina et al. Evolutionary concept learning
CN110781206A (en) Method for predicting whether electric energy meter in operation fails or not by learning meter-dismantling and returning failure characteristic rule
Hmida et al. Scale genetic programming for large data sets: case of Higgs Bosons classification
CN112990593A (en) Transformer fault diagnosis and state prediction method based on CSO-ANN-EL algorithm
CN116128544A (en) Active auditing method and system for electric power marketing abnormal business data
Naeem et al. A novel data balancing approach and a deep fractal network with light gradient boosting approach for theft detection in smart grids
CN115526393A (en) Construction cost prediction method based on transformer project key influence factor screening
CN114626433A (en) Fault prediction and classification method, device and system for intelligent electric energy meter
Drozdz et al. Feature set reduction by evolutionary selection and construction
Zhao Risk Assessment Method of Agricultural Management Investment Based on Genetic Neural Network
Sun Application of GA-BP neural network in online education quality evaluation in colleges and universities
Patil et al. DGA Based Ensemble learning and Random Forest Models for Condition Assessment of Transformers
WO2021140742A1 (en) Operation management support device and operation management support method
Yu et al. House price prediction based on a machine learning model
Cintra et al. On rule learning methods: a comparative analysis of classic and fuzzy approaches
Kohshori et al. Multi Population Hybrid Genetic Algorithms for University Course Timetabling Problem
Wahyono et al. Optimization of Random Forest with Genetic Algorithm for Determination of Assessment
Gata et al. The Feasibility of Credit Using C4. 5 Algorithm Based on Particle Swarm Optimization Prediction
JP4298531B2 (en) Input attribute condition determination device, input attribute condition determination method, input attribute condition determination program, data analysis device, data analysis method, and data analysis program
Li et al. Prediction and Sensitivity Analysis of Companies’ Return on Equity Based on Deep Neural Network
CN114154561B (en) Electric power data management method based on natural language processing and random forest

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination