CN113268936A

CN113268936A - Key quality characteristic identification method based on multi-target evolution random forest characteristic selection

Info

Publication number: CN113268936A
Application number: CN202110752786.2A
Authority: CN
Inventors: 赵永满; 潘荣顺; 余佳昊
Original assignee: Shihezi University
Current assignee: Shihezi University
Priority date: 2021-07-03
Filing date: 2021-07-03
Publication date: 2021-08-17
Anticipated expiration: 2041-07-03
Also published as: CN113268936B

Abstract

The invention discloses a key feature identification method based on multi-objective evolution random forest feature selection, which comprises the following steps: firstly, acquiring multivariate quality characteristic data information in a production process through digital detection of a workshop to form a product quality characteristic data set; and then, primarily selecting the quality characteristics participating in classification by utilizing a Relief (recursive F) algorithm, and dividing the primarily selected data set into two parts: a product quality characteristic training data set and a product quality characteristic testing data set; inputting the training data set into a multi-objective optimization random forest feature selection algorithm to obtain a key quality characteristic set; and finally, verifying the obtained key quality characteristic set by using the test data set. The method considers the complex influence of the multivariate quality characteristics on the final quality of the product, accurately analyzes the key quality characteristics in the product, provides reference for identifying the key quality characteristics, provides support for quality control and improves the product quality prediction capability.

Description

Key quality characteristic identification method based on multi-target evolution random forest characteristic selection

Technical Field

The invention provides a key quality characteristic identification method based on multi-objective evolution random forest characteristic selection, and belongs to the field of quality management.

Background

In modern industrial production, a large amount of process data and quality characteristic data are generated in the production process of a product, including production environment data, product characteristic data, assembly characteristic data, customer demand characteristic data, and the like. Some of the quality characteristics have very important influence on the quality of the product, and some have little influence on the quality of the product, so that the identification of the key quality characteristics closely related to the quality of the product has very important significance on the continuous improvement of the product, the quality prediction of the product and the quality control of the product.

The traditional key quality characteristic identification method comprises key characteristic expansion and quality function expansion, wherein the key characteristic expansion method decomposes the whole product layer by layer, expands the product layer by layer from the aspects of product characteristics, part characteristics, process characteristics and the like, and then applies a qualitative and quantitative analysis method to identify the key quality characteristics. The quality function expansion method embodies customer demand guidance, focuses on the part of quality characteristics most focused by customers, but easily ignores potential key quality characteristics purely according to customer demands. The complex product comprises a large number of quality characteristics of component levels, the influence relationship among the quality characteristics is complex, and the mutual influence and the key degree of the quality characteristics are difficult to determine by the traditional qualitative and quantitative method.

The invention provides a method for identifying key quality characteristics by using quality characteristic data in industrial production on the basis of combining a multi-objective optimization algorithm and a machine learning model. The method is used for improving the product continuity in the production process, and improving the capabilities of product quality control, product quality prediction and the like.

Disclosure of Invention

(1) Objects of the invention

The invention aims to provide a key quality characteristic identification method based on multi-objective evolution random forest characteristic selection, and aims to solve the problem that the conventional qualitative and quantitative method is difficult to identify and determine the key quality characteristic.

(2) Technical scheme

The invention provides a key quality characteristic identification method based on multi-objective evolution random forest characteristic selection for solving the problems, as shown in the attached figure 1, the method comprises the following steps:

step 1: acquiring multivariate quality characteristic data information in a production process through digital detection of a workshop, wherein the multivariate quality characteristic data information comprises a plurality of process parameters, product size parameters, product grade classification and other quality characteristics which are important factors influencing the overall quality level of a product, so that a product quality characteristic data set is formed;

step 2: the method comprises the following steps of carrying out primary selection on quality characteristics participating in classification by utilizing a Relief (recursive F) algorithm, and dividing a data set after the primary selection into two parts: a product quality characteristic training data set and a product quality characteristic testing data set;

and step 3: dividing the training data set into an internal training set and an internal test set, wherein the internal training set is used for training a random forest model classifier, and the internal test set is used for evaluating the selected quality characteristic set generated by the algorithmsThe partial objective function value of (2). Then inputting a multi-target evolution random forest feature selection algorithm, establishing a plurality of corresponding algorithm targets, generating an initial population, and setting an iterative population algebra to obtain an advantage key quality characteristic set;

and 4, step 4: and verifying and evaluating the obtained key quality characteristic set by using the random forest classifier trained by the test data set and the key quality characteristic set.

The term "product quality characteristic data set" in step 1 refers to a data set having a certain number of quality characteristics (characteristic attributes), a certain number of samples (sampling products), and a definite classification for each sample (sampling product) for the same object (product) as shown in fig. 2.

In step 2, the Relief F algorithm performs initial selection on the quality characteristics participating in classification, the specific algorithm flow is shown in fig. 3, and the specific method is as follows:

2-1: extracting an individual from a sample of a certain typeESearching in homogeneous and heterogeneous samples, respectivelyFindingkThe nearest neighbor samples form a homogeneous neighbor sample setFAnd heterogeneous neighbor sample setsG；

2-2: then useEAndFandGdefining feature weight by the difference of average difference of each feature of the intermediate sampleW. For any featuremTo finishnSub-sampled feature weightsW _mThe calculation formula is as follows:

in the formula:

ca sample class representing a heterogeneous sample;

E[m]representing an individualEFeature(s)mA value of (d);

F _j[m]is shown asjThe value of the nearest homogeneous sample feature m;

p(c) Indicates the heterogeneous sample class ascThe probability of (d);

class(E) Representing an individualEA category of (1);

P(class(E) Represents the probability that the sample class is the same as E;

G(c)j[m]is shown asjIs the nearest tocClass sample characterizationmA value of (d);

the larger the weight of the characteristic is, the larger the inter-class distance and the small intra-class distance of the sample are caused by the characteristic, and the large identification effect on the class is achieved;

2-3: eliminating the quality characteristic that the distance between classes is smaller than the distance in the classes, and dividing the initially selected data set into two parts: a training data set of product quality characteristics and a testing data set of product quality characteristics, the training and testing sets being adapted tokCross-folding authentication method in which data is divided equally by sample sizekIn part, takek-1Part is a training set, and part 1 is a testing set.

The flow of the multi-objective evolution random forest feature selection algorithm in the step 3 is shown in fig. 4. The algorithm flow consists of two parts, namely a multi-objective evolutionary algorithm, an NSGA II algorithm is selected and Matlab software is used for realizing the multi-objective evolutionary algorithm. And the second is a random forest classifier which is realized by utilizing Python. The whole algorithm implementation process is realized by Matlab and Python interaction.

Wherein, in the multi-target evolution random forest characteristic selection algorithm in the step 3, the multi-target evolution algorithm selects the NSGA II algorithm, and the individual gene coding mode in the population adopts a binary coding method to solve the problemsIs coded asCThen, thenC=(c ₁,c ₂,c ₃,c ₄,c ₅,…,c _N) Is 1NThe vector of (2).NFor the total mass characteristic quantity, each elementc _i∈{0,1}(i=1,2,…,n) Represents the firstiOne feature is selected or not selected, if it is '1', and not selected if it is '0'. Each code corresponds to a solution, i.e. a subset of quality characteristics.

In the algorithm of the multi-objective evolution random forest feature selection algorithm in the step 3, the population genetic mode is binary tournament selection, two individuals are selected from a parent generation population each time, the two individuals are compared (using a crowding comparison operator), and more superior individuals are added into a child generation population.

In the multi-target evolution random forest feature selection algorithm in the step 3, a single-point crossing method is selected as a crossing method among individuals in the population, and the individuals are crossedC ₁=(c ₁₁,c ₁₂,c ₁₃,c ₁₄,c ₁₅,…,c _1N)，C ₂=(c ₂₁,c ₂₂,c ₂₃,c ₂₄,c ₂₅,…,c _2N) By cross probabilityp _cTwo new individuals were generated by performing crossover operations:C ₁=(c ₁₁,c ₁₂,c ₁₃,…,c _1e-1,c _2e,…,c _2N)，C ₂=(c ₂₁,c ₂₂,c ₂₃,…,c _2e-1,c _1e,…,c _1N)。

in the algorithm of the multi-objective evolution random forest feature selection algorithm in the step 3, a multi-objective variation method is adopted for variation modes of individuals in the population, and the individuals adopt the multi-objective variation methodC=(c ₁,c ₂,c ₃,c ₄,c ₅,…,c _N) Each gene has mutation probabilityp _mCarrying out mutation operation to generate a new individual: if the in situ is '0', the mutation is '1', and if the in situ is '1', the mutation is '0'.

In the multi-objective evolution random forest feature selection algorithm in the step 3, the multi-objective evolution algorithm selects the NSGA II algorithm, and the algorithm target is set by the actual production requirement, including but not limited to

Min F(s)={f ₁(s),f ₂(s),f ₃(s)}，f ₁In order to classify the error rate of the data,f ₂under the algorithm of Relief FsThe inverse of the sum of the weights of (c),f ₃is the quality characteristic subset size.

In the algorithm of the multi-objective evolution random forest feature selection algorithm in the step 3, the non-dominant sequencing basis of individuals in the population is as follows: for minimizing the multiobjective optimization problem, fornA target componentf _i(s),(i=1,2,…,n) Any given two decision variablesX _a ,X _bIf the following two conditions are satisfied, it is calledX _aDominatingX _b：

For any onei∈1,2,…,nAll are provided withf _i(X _a)≤f _i(X _b) If true;

exist ofi∈1,2,…,nSo thatf _i(X _a)≤f _i(X _b) If true;

if no other decision variables can dominate one decision variable, the decision variable is called as a non-dominated solution, in a group of solutions, the Pareto level of the non-dominated solution is defined as 1, the non-dominated solution is deleted from the solution set, the Pareto level of the rest solutions is defined as 2, and by analogy, the Pareto levels of all solutions in the solution set can be obtained, and the Pareto levels are sorted as shown in fig. 5.

In the multi-objective evolution random forest feature selection algorithm in the step 3, the crowding degree sequencing basis of the same non-dominant level individual in the population is as follows: the crowdedness represents the density of individuals around a given point in a population, and is defined asi _dRepresenting, visually, the individualiThe length of the largest rectangle surrounding the individual i but not the rest, as shown in fig. 6.

The specific method in the step 3 is as follows:

3-1: initializing the population and randomly generating a population P_t，P_tEach individual being a selected set of quality characteristicss；

3-2: for population P_tProceeding heredity, crossover and variation to obtain population P_t’；

3-3: target function pair population R set by algorithm_t=P_t+P_t’Carrying out fitness evaluation on each individual, and obtaining a target value for each individualf ₁(s),f ₂(s),f ₃(s)}；

3-4: using fast non-dominant sorting method to R_tEach individual is ranked in a non-dominated ranking order;

3-5: selecting the individual with the minimum current non-dominant grade to enter the selected population P_t+1Up to P_t+1Until the population cannot accommodate the next level;

3-6: carrying out congestion distance sorting on the next non-dominated level individuals by using a congestion distance distribution method;

3-7: selecting the individual with the largest crowding distance to enter an election group P_t+1Until population P is completed_t+1；

3-8: and repeating the steps 3-2 to 3-7 until the algorithm termination condition is reached. Outputting the population individuals after the algorithm is terminated, and decoding to obtain an identified key quality characteristic set;

specifically, in step 3-3, the "fitness evaluation" specifically includes the following steps:

3-3-1: decoding each individual of R into a corresponding set of quality characteristics;

3-3-2: corresponding quality characteristic set after decoding, wherein the quantity of the quality characteristics is a fitness functionf ₃(s) value;

3-3-3: corresponding quality characteristic set after decoding and corresponding to a Relief F algorithmsIs a function of the inverse of the sum of the weights off ₂(s) A value of (d);

3-3-4: extracting quality characteristic data sets corresponding to the internal training set to train the random forest classifier respectively;

3-3-5: extracting quality characteristic data sets corresponding to the internal test set, respectively verifying and predicting the precision of the trained random forest classifier, and obtaining a fitness functionf ₁(s) The value of (c).

In the step 4, the specific verification method in verifying and evaluating the obtained key quality characteristic set by using the test data set is to extract the quality characteristic data sets corresponding to the test set and respectively verify and predict the accuracy of the trained random forest classifier, and the method also adoptskAnd (4) folding and crossing verification method.

Drawings

FIG. 1 is a block flow diagram of the method of the present invention.

Fig. 2 is a block diagram of a product quality characteristic data set.

Fig. 3 is a flowchart of the Relief F algorithm.

FIG. 4 is a flow chart of a multi-objective evolutionary random forest feature selection algorithm.

FIG. 5 is a graph of Pareto rank after non-dominated sorting.

Fig. 6 is a schematic diagram of the congestion degree ranking.

FIG. 7 is a population evolution mode diagram of a multi-objective evolution random forest feature selection algorithm.

Detailed Description

The invention provides a key quality characteristic identification method based on multi-objective evolution random forest characteristic selection, and the invention is further described in detail with reference to the attached drawings.

Step 1, acquiring multivariate quality characteristic data information in a production process through digital detection of a workshop, wherein the multivariate quality characteristic data information comprises a plurality of process parameters, product size parameters, product grade classification and other quality characteristics, and the quality characteristics are important factors influencing the overall quality level of a product, so that a product quality characteristic data set is formed.

The term "product quality characteristic data set" as shown in fig. 2 refers to a data set that has a certain number of quality characteristics (characteristic attributes), a certain number of samples (sampling products), and a definite classification for each sample (sampling product) for the same research object (product).

Step 2, performing primary selection on the quality characteristics participating in classification by using a Relief F algorithm, and dividing the data set after the primary selection into two parts: a product quality characteristic training data set and a product quality characteristic testing data set.

The "Relief F algorithm performs initial selection on the quality characteristics participating in classification", the specific algorithm flow is shown in fig. 3, and the specific method is as follows:

2-1: extracting an individual from a sample of a certain typeEFinding out the same type and different type samples respectivelykThe nearest neighbor samples form a homogeneous neighbor sample setFAnd heterogeneous neighbor sample setsG；

2-2: then useEAndFandGdifference determination of average difference of each characteristic of medium sampleSemantic feature weightW. For any featuremTo finishnSub-sampled feature weightsW _mThe calculation formula is as follows:

in the formula:

ca sample class representing a heterogeneous sample;

E[m]representing an individualEFeature(s)mA value of (d);

F _j[m]is shown asjFeatures of a nearest neighbor homogeneous samplemA value of (d);

p(c) Indicates the heterogeneous sample class ascThe probability of (d);

class(E) Representing an individualEA category of (1);

P(class(E) Represent sample classes andEthe same probability;

Step 3, dividing the training data set into an internal training set and an internal test set, wherein the internal training set is used for training the random forest model classifier, and the internal test set is used for evaluating the selected quality characteristic set generated by the algorithmsThe partial objective function value of (2). Then inputting a multi-target evolution random forest feature selection algorithm to establish multiple forest featuresAnd generating an initial population corresponding to the algorithm target, and setting an iterative population algebra to obtain an advantage key quality characteristic set.

The algorithm flow of the multi-target evolution random forest feature selection algorithm is shown in the attached figure 4. The algorithm flow consists of two parts, namely a multi-objective evolutionary algorithm, an NSGA II algorithm is selected and Matlab software is used for realizing the multi-objective evolutionary algorithm. And the second is a random forest classifier which is realized by utilizing Python. The whole algorithm implementation process is realized by Matlab and Python interaction. The specific method in step 3 is as follows:

3-1: initializing the population and randomly generating a population P_t，P_tEach individual being a selected set of quality characteristicss. The individual gene coding mode in the population adopts a binary coding method to solvesIs coded asCThen, thenC=(c ₁ ,c ₂ ,c ₃ , c ₄ ,c ₅ ,…,c _N) Is 1NThe vector of (2).NFor the total mass characteristic quantity, each elementc _i∈{0,1}(i=1,2,3,…,n) Represents the firstiOne feature is selected or not selected, if it is '1', and not selected if it is '0'. Each code corresponds to a solution, i.e. a subset of quality characteristics;

3-2: for population P_tProceeding heredity, crossover and variation to obtain population P_t. The method comprises the following specific steps:

3-2-1: the genetic mode in the population is binary tournament selection, two individuals are selected from a parent generation population each time, the two individuals are compared (a crowding comparison operator is used), and more superior individuals are added into a child generation population;

3-2-2: the method for crossing individuals in the population adopts a single-point crossing method, and the individualsC ₁=(c ₁₁ ,c ₁₂ ,c ₁₃ ,c ₁₄ , c ₁₅ ,…,c _1N)，C ₂=(c ₂₁ ,c ₂₂ ,c ₂₃ ,c ₂₄ ,c ₂₅ ,…,c _2N) By cross probabilityp _cTwo new individuals were generated by performing crossover operations:

C ₁=(c ₁₁ ,c ₁₂ ,c ₁₃ ,…,c _1e-1 ,c _2e …,c _2N)，C ₂=(c ₂₁ ,c ₂₂ ,c ₂₃ ,…,c _2e-1 ,c _1e …,c _1N)；

3-2-3: the variation mode of individuals in the population adopts a multipoint variation method, and the individualsC=(c ₁ ,c ₂ ,c ₃ ,c ₄ ,c ₅ ,…,c _N) Each gene has mutation probabilityp _mCarrying out mutation operation to generate a new individual: if the original position is '0', the mutation is '1', and if the original position is '1', the mutation is '0';

3-3: target function pair population R set by algorithm_t=P_t+P_t’The algorithm target is set by actual production requirements, including but not limited toMin F(s)={f ₁(s),f ₂(s),f ₃(s)}，f ₁In order to classify the error rate of the data,f ₂under the algorithm of Relief FsThe inverse of the sum of the weights of (c),f ₃for the quality characteristic subset size, fitness evaluation is performed on each individual, and each individual obtains a target valuef ₁(s),f ₂(s),f ₃(s) The method comprises the following specific steps:

3-3-1: will be provided withREach individual is decoded into a corresponding set of quality characteristics;

3-3-2: set of corresponding quality characteristics after decoding, number of quality characteristicsQuantity as a fitness functionf ₃ (s)A value of (d);

3-3-5: extracting quality characteristic data sets corresponding to the internal test set, respectively verifying and predicting the precision of the trained random forest classifier, and obtaining a fitness functionf ₁(s) A value of (d);

3-4: using fast non-dominated sorting method pairsR _tEach individual is subjected to non-dominated ranking, and the specific steps are as follows:

the non-dominant ranking of individuals in the population is based on: for minimizing the multiobjective optimization problem, fornA target componentf _i(s),(i=1,2,…,n) Any given two decision variablesX _a ,X _bIf the following two conditions are satisfied, it is calledX _aDominatingX _b：

1, for anyi∈1,2,…,nAll are provided withf _i(X _a)≤f _i(X _b) If true;

2, existence ofi∈1,2,…,nSo thatf _i(X _a)≤f _i(X _b) If true;

if one decision variable does not have other decision variables capable of dominating the decision variable, the decision variable is called as a non-dominated solution, in a group of solutions, the Pareto level of the non-dominated solution is defined as 1, the non-dominated solution is deleted from the solution set, the Pareto level of the rest solutions is defined as 2, and by analogy, the Pareto levels of all solutions in the solution set can be obtained, and the Pareto levels are ranked as shown in fig. 5;

3-5: selecting the individuals with the smallest current non-dominant gradeEntry populationP _t+1Up toP _t+1Until the population cannot accommodate the next level;

3-6: and carrying out congestion distance sequencing on the next non-dominated level individual by using a congestion distance distribution method, wherein the specific method comprises the following steps:

the crowding degree sequencing basis of the individuals with the same non-dominant grade in the population is as follows: the crowdedness represents the density of individuals around a given point in a population, and is defined asi _dRepresenting, visually, the individualiThe surroundings include the individualiBut does not include the length of the largest rectangle of the rest of the individuals, as shown in fig. 6;

3-7: selecting the individual with the largest crowding distance to enter the selected populationP _t+1Until population is completedP _t+1. FIG. 7 shows the evolution process of the population, including steps 3-4 to 3-7;

3-8: and repeating the steps 3-2 to 3-7 until the algorithm termination condition is reached. And outputting the population individuals after the algorithm is terminated, and decoding to obtain the identified key quality characteristic set.

And 4, verifying and evaluating the obtained key quality characteristic set by using the random forest classifier trained by the test data set and the key quality characteristic set, wherein the specific verification method is to extract the quality characteristic data sets corresponding to the test set and verify and predict the precision of the trained random forest classifier respectively, and the same method is adoptedkAnd (4) folding and crossing verification method.

Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the applicant has described the present invention in detail, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention and shall be covered by the claims of the present invention.

Claims

1. A key quality characteristic identification method based on multi-objective evolution random forest characteristic selection is characterized by comprising the following steps: step 1: acquiring multivariate quality characteristic data information in a production process through digital detection of a workshop, wherein the multivariate quality characteristic data information comprises a plurality of process parameters, product size parameters, product grade classification and other quality characteristics which are important factors influencing the overall quality level of a product, so that a product quality characteristic data set is formed; step 2: performing primary selection on the quality characteristics participating in classification by using a Relief (recursive F) algorithm to obtain the algorithm weight of the quality characteristics, eliminating the quality characteristics of which the inter-class distance is less than the intra-class distance, and dividing the initially selected data set into two parts: a product quality characteristic training data set and a product quality characteristic testing data set; and step 3: inputting the training data set into a multi-objective evolution random forest feature selection algorithm, establishing a plurality of corresponding algorithm targets, generating an initial population, and setting an iterative population algebra to obtain an advantage key quality feature set; and 4, step 4: and verifying and evaluating the obtained key quality characteristic set by using the test data set.

2. The method for identifying key quality characteristics based on multi-objective evolution random forest characteristic selection as claimed in claim 1, is characterized in that: the "product quality characteristic data set" in step 1 refers to a data set having a certain number of quality characteristics (characteristic attributes), a certain number of samples (sampling products), and a definite classification for each sample (sampling products) for the same study object (product).

3. The method for identifying key quality characteristics based on multi-objective evolution random forest characteristic selection as claimed in claim 1, is characterized in that: the "Relief F" algorithm in step 2 is an extension of the Relief algorithm, and the specific process is as follows: extracting an individual from a sample of a certain typeEFinding out the same type and different type samples respectivelykThe nearest neighbor samples form a homogeneous neighbor sample setFAnd heterogeneous neighbor sample setsTThen, further withEAndFandGdefining feature weight by the difference of average difference of each feature of the intermediate sampleW。

4. The key quality characteristic identification method based on multi-objective evolution random forest characteristic selection as claimed in claim 1The method is characterized in that: step 2 the training set and the test set are adoptedkCross-folding authentication method in which data is divided equally by sample sizekIn part, takek-1Part is a training set, and part 1 is a testing set.

5. The method for identifying key quality characteristics based on multi-objective evolution random forest characteristic selection as claimed in claim 1, is characterized in that: the multi-objective evolutionary random forest feature selection algorithm comprises two parts, namely a multi-objective evolutionary algorithm, an NSGA II algorithm is selected and realized by Matlab software; and the random forest classifier is realized by utilizing Python, and the whole algorithm realization process is realized by Matlab and Python interaction.

6. The method for identifying key quality characteristics based on multi-objective evolution random forest characteristic selection as claimed in claim 1, is characterized in that: step 3, the multi-target evolution random forest feature selection algorithm is used for selecting the quality feature set of the population individualssThe objective function is given by production practice requirements, including but not limited to:

MinF(s)={f ₁(s),f ₂(s),f ₃(s)},f ₁in order to classify the error rate of the data,f ₂under the algorithm of Relief FsThe inverse of the sum of the weights of (c),f ₃is the feature subset size.

7. The method for identifying key quality characteristics based on multi-objective evolution random forest characteristic selection as claimed in claim 1, is characterized in that: in the step 3, the training set in the multi-objective evolution random forest feature selection algorithm is divided into an internal training set and an internal test set, wherein the internal training set is used for training a random forest model classifier, and the internal test set is used for evaluating the selected quality characteristic set generated by the algorithmsThe partial objective function value of (2).

8. A key quality characteristic identification method based on multi-objective evolution random forest characteristic selection is characterized by comprising the following steps: step 4, the verification and evaluation method adopts the stepskFold-cross validation method, system and methodkA test data set is co-processedkSecond verification, getkTarget average for secondary verification.