CN117435904A

CN117435904A - Single feature ordering and composite feature extraction method

Info

Publication number: CN117435904A
Application number: CN202311753604.9A
Authority: CN
Inventors: 胡旺; 陈业航; 章语; 李欣悦
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2023-12-20
Filing date: 2023-12-20
Publication date: 2024-01-23
Anticipated expiration: 2043-12-20
Also published as: CN117435904B

Abstract

The invention discloses a single feature ordering and composite feature extraction method, and belongs to the technical field of data processing. The method comprises the following steps: s1, constructing an input data set; s2, partitioning and clustering; s3, classifying and carrying out symbolic regression, and decoding a symbolic regression result into an expression; s4, single feature ordering is carried out according to the sign regression result; s5, extracting composite features according to the symbolic regression result. The method can effectively improve the interpretability of the single feature selection result and remove irrelevant or redundant features; meanwhile, the composite characteristics conforming to the field interpretability can be extracted explicitly, so that the knowledge communication among the fields is promoted; in addition, the interference caused by noise features can be effectively removed by selecting the truly relevant features, so that a model is simplified, the model accuracy is improved, and the process of data generation is assisted to be understood.

Description

Single feature ordering and composite feature extraction method

Technical Field

The invention belongs to the technical field of data processing, and particularly relates to a single feature ordering and composite feature extraction method.

Background

Feature selection is an important issue in the field of data processing technology, with the goal of finding the optimal feature subset. The feature selection can eliminate irrelevant or redundant features, so that the purposes of reducing the number of features, improving the model accuracy and reducing the running time are achieved. On the other hand, the selection of truly relevant features can effectively simplify the model and assist in understanding the data generation process.

The feature selection is used as an NP-Hard problem, and the cost of finding the correct optimal configuration is very high in all possible permutations of how to get the optimal configuration of the feature subset given a set of features to be screened. In the field of feature selection, genetic algorithms determine an optimal feature subset by employing an evolutionary-based method; different feature subsets are encoded into a population by unique encoding means. Evaluating subsets from the population of each generation by adopting the correctness of the prediction model of the target task, and competing to determine which subsets are continued to the next generation; the next generation consists of contest winners and is intersected (winning feature set updated with features of other winners) and mutated (some features randomly introduced or deleted). After the algorithm runs for a certain number of generations, the optimal members of the population constitute the optimal feature subset.

Symbolic regression is a machine learning technique that aims to identify a potential mathematical expression. It first builds a population of na iotave stochastic formulas to represent the relationship between known independent variables and their dependent variable targets to predict new data. Each successive generation program evolved from the previous program and selected the most appropriate individuals from the population for genetic manipulation. The sign regression relies on the natural selection theory of Darwin, and by using operations such as simulating gene replication, crossover, mutation and the like among computer programs, under the conditions of larger initial population and reasonable crossover and mutation probability setting, the sign regression does not fall into a local optimal solution, can search rules hidden behind random values based on a large amount of actual data, and has wider applicability and higher accuracy compared with the traditional regression method. Genetic programming is a core algorithm of symbolic regression, and by introducing a self-defined function and a dynamic program service method, obvious effects are achieved in the fields of machine learning, artificial intelligence, combination optimization, self-adaptive systems, control technology and the like. The genetic programming is based on the characteristics of the function, a binary tree structure is adopted, a function expression is used in a data structure, and genetic operation aiming at binary strings in a genetic algorithm is improved to form genetic operation aiming at a binary tree.

The ideas of evolutionary computation are not separated from the symbolic regression and feature selection technologies. The former is to obtain a symbol expression which better accords with the relation between data through an evolution algorithm, and the latter is to obtain an optimal feature subset which can more predict the tag value through the evolution algorithm. However, most of the existing feature selection methods based on the evolutionary algorithm only can implicitly extract important features, but cannot provide an explanatory reason, which is definitely unfavorable for the knowledge exchange and verification among the cross-fields; furthermore, the features are not isolated in real life, and in many cases, the features play a role in the result recombination, and the complex features can be better reconstructed by adopting symbolic regression to perform feature extraction work.

Disclosure of Invention

Aiming at the defects existing in the prior art, the invention provides a single feature ordering and composite feature extraction method, and the method carries out pareto non-dominant ordering from the symbol regression expression results based on the occurrence frequency of the related features and the results of the partial derivative average values of the related features in each expression, thereby obtaining the importance ordering results of the related features; and extracting the composite characteristics conforming to the domain knowledge by extracting frequent sub-formulas in the symbolic regression result and combining the domain knowledge.

In order to achieve the above purpose, the invention adopts the following technical scheme:

the single feature ordering and compound feature extracting method is characterized by comprising the following steps:

s1, constructing an input data set: for sample data to be processed, selecting parameters to be optimized in the sample data as labels, and selecting at least 3 features to be screened as related features; and (3) splicing the relevant characteristics of the samples with the corresponding labels after data preprocessing to obtain the input data of the single sample, and completing the construction of an input data set.

S2, partitioning and clustering: and carrying out cluster division on the input data set to obtain clusters in which each sample is located.

S3, sign regression: carrying out symbolic regression by classifying according to the clustering division result; in the symbolic regression process, the super parameters of each cluster are kept consistent, and the root mean square error is used as a fitness function; and after the symbolic regression iteration is finished, decoding the symbolic regression result into an expression to obtain the expression of each cluster.

S4, single feature ordering: counting the frequency of each related feature in the expression to obtain the total frequency of each related feature; meanwhile, selecting samples with fitting errors smaller than a set threshold value from each expression, and differentially calculating partial derivative average values of each related feature in the expression in the selected samples; and then, non-dominant sorting is carried out according to the total occurrence times of each related feature and the average value of the partial derivative of each related feature in the expression, so as to obtain a sorting result of the influence degree of the related feature on the parameters to be optimized.

S5, extracting composite features: and extracting substructures with occurrence frequencies larger than a set threshold value from the expression, and screening the extracted substructures by using a principal component analysis method or a correlation coefficient method to obtain composite characteristics.

Further, the data preprocessing includes: outlier rejection and data normalization;

the abnormal value eliminating process comprises the following steps: detecting abnormal values of the displacement sequences by adopting a Laida criterion; if the abnormal value exists, eliminating the abnormal value.

The data normalization process comprises the following steps: and (3) carrying out data standardization based on the mean value and standard deviation of the original data, wherein the standardized data meets the condition that the average value of samples in a single correlation characteristic is 0 and the variance is 1.

Further, the clustering division mode is as follows:

the input data for a single sample is expressed as:

（1）

wherein,representation sample->Related features of->，/>=1, 2,3, …, n, n is the total number of correlated features of the input dataset; but->Representation sample->Is a label value of (a).

Designating the number K of clusters, selecting any K samples from the input data set as initial center points to obtain a center point setWherein->Respectively representing the 1 st, 2 nd and K th center point samples; for the remaining samples which are not selected to be the center points, the Euclidean distance between each sample and all the center points is calculated by using a formula (2), and the samples are divided into clusters where the center points with the closest Euclidean distance are located according to the calculation result:

（2）

wherein,representing any sample in the input data set +.>Sample of any center point in the center point setEuclidean distance between->And->Respectively represent sample->And center point sample->Is>The values of the individual features.

And repeating the clustering dividing process, iterating until the cluster dividing is not changed or the maximum iteration number is reached, and completing the clustering dividing to obtain a clustering result.

Further, in step S3, symbol regression is implemented by using the evolutionary algorithm and the tree coding method.

Further, in step S4, the total number of times each relevant feature appears in the expression is calculated using expression 4):

（4）

wherein,for the number of expressions, +.>Indicate->Relevant features in the individual expressions->Frequency of occurrence;

calculating the average value of the partial derivatives of each relevant feature in the expression by using the formula (5):

（5）

wherein,indicate->Relevant features in the individual expressions->Is a derivative of the above.

Further, in step S4, a pareto non-dominant ranking algorithm is used to rank the single features.

Compared with other methods in the field of data processing, the method can effectively improve the interpretability of the single feature selection result, eliminate irrelevant or redundant features, thereby reducing the number of features and improving the model accuracy; meanwhile, composite features conforming to the field interpretability can be explicitly extracted from the sign regression result, so that knowledge exchange among fields is promoted; in addition, the interference caused by noise characteristics can be effectively removed by selecting the truly relevant characteristics, so that a model is simplified, and the understanding of the data generation process is assisted.

Drawings

FIG. 1 is a flow chart of feature selection based on symbolic regression according to an embodiment of the present invention.

FIG. 2 is a schematic diagram of a symbolic regression process according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of the accuracy results of different feature selection algorithms according to embodiment 1 of the present invention.

Fig. 4 is a schematic diagram of the accuracy results of different feature selection algorithms according to embodiment 2 of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantageous effects of the present invention more apparent, the following detailed description of the present invention will be given with reference to specific examples.

Example 1:

in this embodiment, taking the selection of the creep life characteristics of the nickel-based superalloy as an example, the creep life data of 10 ten thousand nickel-based high-temperature aggregate samples and nine corresponding characteristics to be screened are obtained, respectively: gamma prime volume fraction, shear modulus, domain inversion interfacial energy, stacking fault energy, gamma prime melting temperature, degree of mismatch, initial creep rate, applied stress, and creep temperature. In combination with practical process limitation factors and cost factors, 4 ten thousand samples are selected as the original data set of the embodiment.

Based on the above-mentioned nickel-based superalloy creep life data set, the present embodiment provides a single feature ordering and composite feature extraction method, the flow of which is shown in fig. 1, specifically including the following steps:

step 1: constructing an input data set;

for each sample in the nickel-based superalloy creep life original data set, taking life data of the sample as a label and 9 features to be screened as related features; preprocessing the relevant features, including: adopting Laida criterionCriteria) detecting abnormal values of the displacement sequences, and eliminating the abnormal values if the abnormal values exist; the data normalization is carried out on the related characteristics, and the process is as follows: data normalization is carried out based on the mean value and standard deviation of the original data, and the normalized data meets the requirement of a certain correlation characteristicThe mean value of the inner samples is 0, and the variance is 1; and (3) label: they are mapped to label data of 1-10 according to the value of creep life, and continuous creep life is mapped to discrete labels.

Splicing the preprocessed related features with the corresponding labels to obtain input data of a single sample:

（1）

wherein,representation sample->Related features of->，/>=1, 2,3, …,9; but->Representation sample->Is a label value of (a).

Step 2: partitioning and clustering;

the number K of the clusters is designated, if the number K of the clusters is too low, the number of samples in a single cluster is too high, and the purpose of the clusters cannot be achieved, and if the number K of the clusters is too high, the result of sign regression in the single cluster is not generalized; thus, for the nickel-base superalloy creep life data set, K is chosen to be 20 by empirical formula.

Selecting any 20 samples in the input data set as initial center points to obtain a center point setThe method comprises the steps of carrying out a first treatment on the surface of the For the remaining samples that are not selected to be the center point,the Euclidean distance of each sample to all center points is calculated using equation (2):

（2）

wherein,representing any sample in the input data set +.>Sample of any center point in the center point setEuclidean distance between->And->Respectively represent sample->And center point sample->Is>The values of the individual features; and dividing the samples into clusters where the center points with the nearest Euclidean distance are located according to the calculation result.

Repeating the clustering division, iterating until the cluster division is not changed or the maximum iteration number is reached, and completing the clustering to obtain a clustering result.

Step 3: symbol regression;

according to the clustering division result of the step 2, clustering is carried out, and symbolic regression is carried out, wherein the flow is shown in figure 2;

specifically, in the process of realizing symbolic regression by using an evolutionary algorithm, each generated expression is taken as an individual, and the fitness function in the evolutionary process is root mean square error RMSE, and the calculation formula is as follows:

（3）

wherein N is the number of all samples,then indicate +.>Life predictions for individual samples.

In the environment selection of each generation, the root mean square error is smaller, namely, individuals with higher fitness can be more easily left in the environment selection process, so that an expression with smaller error can be obtained along with the increase of the iteration times; in the symbolic regression process of this embodiment, the iteration number is set to 1000, the population size is 100, the variation probability is 0.8, and the crossover probability is 0.4.

In the symbol regression process, encoding an expression by adopting a multi-gene binary tree mode, wherein each gene consists of a binary tree, different genes form an expression, and a least square method is adopted among the different genes to determine coefficients; in this example, the depth of the tree was set to 6 and the maximum number of genes was 4. After the symbolic regression iteration is finished, the symbolic regression result is decoded into an expression.

Step 4: sequencing single features;

the ordering of the single feature specifically requires calculating two indexes, one is that the frequency of occurrence of the related feature in the expression is the more frequent the related feature occurs, the more important the related feature is; and the other is that the average value of the partial derivatives of the normalized correlation characteristic in the expression, the larger the average value of the partial derivatives is, the more sensitive the label is to the fluctuation of the correlation characteristic, and the more important the correlation characteristic is.

Calculating the total number of occurrences of each relevant feature in the expression using equation (4):

（4）

wherein,for the number of expressions to be used,

indicate->Relevant features in the individual expressions->Frequency of occurrence.

Selecting samples with the fitting error ranking of ten percent in each expression, and calculating the partial derivative average value of each relevant feature in the expression in the selected samples by using the formula (5):

（5）

And after the occurrence frequency and the partial derivative average value of each relevant feature are obtained, adopting non-dominant Parritor sorting to obtain a sorting result of the relevant features.

In this embodiment, the feature that the occurrence frequency is high is: gamma' volume fraction, shear modulus, stacking fault energy; the feature of higher average value of partial derivatives is: shear modulus, stacking fault energy, initial creep rate. Thus, according to the pareto non-dominant ordering, the top 4 single features in this embodiment are: gamma prime volume fraction, shear modulus, stacking fault energy, and initial creep rate.

Step 5: extracting composite new features;

extracting substructures with occurrence frequency more than 10% of the population quantity set by the symbolic regression, namely more than 10 times, according to the symbolic regression result expression obtained in the step 3; and then screening the extracted substructure by using a correlation coefficient method to obtain the composite characteristic.

In this embodiment, the extracted composite substructure isAnd->I.e. shear modulus initial creep rate and stacking fault energy initial creep rate.

The 4 single characteristics obtained in the step 4 and the two composite characteristics obtained in the step 5 are formed into a new characteristic data set, and are used as main characteristics affecting the creep life of the nickel-base superalloy, and the characteristics have better performance for predicting the creep life of the nickel-base superalloy.

And (3) verification: the creep life of the nickel-base superalloy is respectively predicted by using the 9 relevant characteristic original data sets and the new characteristic data sets obtained in the embodiment, the model prediction accuracy is shown in fig. 3, and it can be seen that the new characteristic data sets obtained in the embodiment are helpful for better predicting the value of the creep life of the nickel-base superalloy.

Example 2:

the present example uses the same nickel-base superalloy data as example 1 as the original dataset; the method is characterized in that the initial rate is used as a label, single feature ordering is carried out from other eight features to be screened, and composite feature extraction is carried out; the characteristics to be screened are respectively as follows: gamma volume fraction, shear modulus, domain inversion interfacial energy, stacking fault energy, gamma' melting temperature, degree of mismatch, applied stress, and creep temperature.

The nickel-base superalloy creep initiation rate dataset was subjected to a single feature ordering and composite feature extraction using the method described in example 1. Experimental results show that for the initial creep life of the nickel-based superalloy, the higher frequency of occurrence in the symbolic regression result expression is characterized by stacking fault energy, gamma' melting temperature and mismatch degree; the higher average value of the partial derivatives is characterized by gamma 'volume fraction, gamma' melting temperature and degree of mismatch. After non-dominant ordering of the occurrence frequency and the partial derivative average, the single feature selected is gamma' melting temperature and mismatch degree.

Extracting substructures with occurrence frequency more than 10% of the population quantity set by symbolic regression, namely more than 10 times, according to the symbolic regression result expression; then, the extracted substructure is screened by a correlation coefficient method, and the obtained composite features are as follows:and->I.e. gamma prime volume fraction mismatch versus stacking fault energy gamma prime melting temperature.

Constructing a new characteristic data set by using the gamma 'melting temperature, the mismatch degree, the gamma' volume fraction, the mismatch degree and the stacking fault energy; the initial creep rate of the nickel-base superalloy is predicted by using the original data set containing 8 relevant features and the new feature data set constructed in the embodiment, and the model prediction accuracy is shown in fig. 4, so that the new feature data set obtained in the embodiment is helpful to better predict the initial creep rate of the nickel-base superalloy.

While the invention has been described in terms of specific embodiments, any feature disclosed in this specification may be replaced by alternative features serving the equivalent or similar purpose, unless expressly stated otherwise; all of the features disclosed, or all of the steps in a method or process, except for mutually exclusive features and/or steps, may be combined in any manner.

Claims

1. The single feature ordering and compound feature extracting method is characterized by comprising the following steps:

s1, constructing an input data set: for sample data to be processed, selecting parameters to be optimized in the sample data as labels, and selecting at least 3 features to be screened as related features; the relevant characteristics of the samples are spliced with the corresponding labels after data preprocessing, so that input data of a single sample are obtained, and the construction of an input data set is completed;

s2, partitioning and clustering: carrying out cluster division on an input data set to obtain clusters in which each sample is located;

s3, sign regression: carrying out symbolic regression by classifying according to the clustering division result; in the symbolic regression process, the super parameters of each cluster are kept consistent, and the root mean square error is used as a fitness function; after the symbolic regression iteration is finished, decoding the symbolic regression result into expressions to obtain expressions of each cluster;

s4, single feature ordering: counting the frequency of each related feature in the expression to obtain the total frequency of each related feature; meanwhile, selecting samples with fitting errors smaller than a set threshold value from each expression, and differentially calculating partial derivative average values of each related feature in the expression in the selected samples; then non-dominant sorting is carried out according to the total occurrence times of each related feature and the average value of the partial derivative of each related feature in the expression, and a result of sorting the influence degree of the related feature on the parameters to be optimized is obtained;

2. The single feature ordering and compound feature extraction method of claim 1, wherein the data preprocessing comprises: outlier rejection and data normalization;

the abnormal value eliminating process comprises the following steps: detecting abnormal values of the displacement sequences by adopting a Laida criterion; if the abnormal value exists, eliminating the abnormal value;

3. The method for sorting and extracting composite features according to claim 2, wherein the clustering is performed in the following manner:

the input data for a single sample is expressed as:

（1）

wherein,representation sample->Related features of->，/>=1, 2,3, …, n, n is the total number of correlated features of the input dataset; whileRepresentation sample->Is a tag value of (2);

designating the number K of clusters, selecting any K samples from the input data set as initial center points to obtain a center point setWherein->、/>、/>Respectively representing the 1 st, 2 nd and K th center point samples; for the remaining samples which are not selected to be the center points, the Euclidean distance between each sample and all the center points is calculated by using a formula (2), and the samples are divided into clusters where the center points with the closest Euclidean distance are located according to the calculation result:

（2）

wherein,representing any sample in the input data set +.>Sample of any center point in the center point set +.>Euclidean distance between->And->Respectively represent sample->And center point sample->Is>The values of the individual features;

4. The method for single feature ordering and compound feature extraction according to claim 3, wherein in step S3, symbol regression is implemented by using an evolutionary algorithm and a tree coding method.

5. The method for single feature ordering and compound feature extraction according to claim 4, wherein in step S4, the total number of occurrences of each relevant feature in the expression is calculated using formula (4):

（4）

where m is the number of expressions,representing the relevant feature +.>Frequency of occurrence;

（5）

wherein,representing the relevant feature +.>Is a derivative of the above.

6. The method of claim 4, wherein in step S4, the single feature ordering is performed by using a pareto non-dominant ordering algorithm.