CN116720058A - Method for realizing key feature combination screening of machine learning candidate features - Google Patents

Method for realizing key feature combination screening of machine learning candidate features Download PDF

Info

Publication number
CN116720058A
CN116720058A CN202310481517.6A CN202310481517A CN116720058A CN 116720058 A CN116720058 A CN 116720058A CN 202310481517 A CN202310481517 A CN 202310481517A CN 116720058 A CN116720058 A CN 116720058A
Authority
CN
China
Prior art keywords
feature
screening
features
candidate
machine learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310481517.6A
Other languages
Chinese (zh)
Inventor
方继恒
杨尚荣
谢明
胡洁琼
张吉明
刘国化
杨有才
赵上强
马洪伟
陈永泰
李爱坤
宁德魁
王塞北
毕亚男
张巧
段云昭
陈松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yunnan Precious Metals Laboratory Co ltd
Sino Platinum Metals Co Ltd
Kunming Institute of Precious Metals
Original Assignee
Yunnan Precious Metals Laboratory Co ltd
Sino Platinum Metals Co Ltd
Kunming Institute of Precious Metals
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yunnan Precious Metals Laboratory Co ltd, Sino Platinum Metals Co Ltd, Kunming Institute of Precious Metals filed Critical Yunnan Precious Metals Laboratory Co ltd
Priority to CN202310481517.6A priority Critical patent/CN116720058A/en
Publication of CN116720058A publication Critical patent/CN116720058A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/126Evolutionary algorithms, e.g. genetic algorithms or genetic programming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Biology (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Physiology (AREA)
  • Biomedical Technology (AREA)
  • Genetics & Genomics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application discloses a method for realizing key feature combination screening of machine learning candidate features, which comprises the following steps: firstly, primarily screening candidate feature sets through linear correlation filtering; searching and further screening the residual features after linear correlation filtering and screening based on a genetic algorithm for limiting the number of the features; after the features are screened out through a genetic algorithm, feature weight sorting is adopted to sort the importance of the features, and key features which are ranked at the front are screened out through the feature weight sorting to form candidate features which are screened out in an exhaustive way; and finally, screening out the feature combination with the best model prediction precision through exhaustive screening to be used as the final machine learning feature combination. The method can overcome the difficulties of high field knowledge requirement, high computational complexity, low feature universality, low interpretability and the like faced when the traditional feature selection technology is adopted to screen the key feature combination for a large number of candidate feature sets.

Description

Method for realizing key feature combination screening of machine learning candidate features
Technical Field
The application relates to a method for realizing key feature combination screening of machine learning candidate features, and belongs to the technical field of noble metal alloys.
Background
The data and features determine the upper bound of machine learning, and the model and algorithm approach this upper bound. Thus, feature selection becomes particularly important. In materials research, each feature set is typically only specific to the application of a particular condition, and there is no unified feature that is valid for all applications. Therefore, selecting the most appropriate feature for each machine learning process belongs to one of the other challenges.
At present, many efforts are being made to select features. The feature selection method can be classified into 4 types according to the form of feature selection. (1) feature selection based on domain knowledge; feature selection techniques based on domain knowledge are well-interpreted, but in many cases face situations where domain knowledge is inadequate. (2) The filtering method selects features, and common filtering methods include correlation coefficients, variance screening, mutual information and hypothesis testing. The filtering method has the advantages of high efficiency in calculation time, high robustness to the over-fitting problem and the disadvantages: without consideration of the correlation between features, useful correlation features may be miskicked away. (3) packaging options feature, common packaging methods include: full searches (e.g., branch bound searches, breadth-first traversals, directed searches, etc.); heuristic search (e.g., bi-directional search, sequence forward selection, sequence backward selection, etc.), random search (e.g., randomly generated sequence selection algorithm, genetic algorithm, simulated annealing algorithm, etc.), feature subset classification performance found by the wrapper method is generally better than that found by the filtering method, the feature versatility selected by the wrapper method is not strong, feature selection needs to be performed again for the learning algorithm when the learning algorithm is changed, algorithm calculation complexity is high because classifier training and testing are performed every time the subset is evaluated, and the algorithm execution time is long especially for large-scale data sets.
As can be seen from the above analysis, the above feature selection method can perform feature selection, but the above feature selection method cannot simultaneously meet the feature selection requirements of less field knowledge requirements, low computational complexity, strong feature universality, high interpretability and the like, especially when the number of candidate features is large or the application scene is complex, so that the above problem is more prominent, and therefore, a reasonable framework for screening the most suitable feature group needs to be provided to optimize the problem, and the problem is always a problem in predicting the material performance field by using the machine learning technology.
Disclosure of Invention
The application aims to solve the problems that: the method has the advantages that the method is difficult to meet the difficulties of high field knowledge requirement, high computational complexity, low feature universality, low interpretability and the like when a large number of noble metal alloy machine learning candidate feature sets are screened for key feature combinations by adopting the traditional feature selection technology.
The application aims to provide a method for realizing key feature combination screening of machine learning candidate features, which comprises the following steps:
(1) The candidate feature sets are initially screened through linear correlation filtering, in the linear correlation screening, the linear correlation degree of each alloy feature is analyzed, and the linear correlation degree between the features is evaluated through a linear regression correlation coefficient R (see formula 1 for details):
wherein N is the number of samples, x i And y i Representing two different characteristics of the ith alloySign (i=1, 2,.,. The term, N),and->Representing the average of these two different features in N alloys.
Taking a correlation coefficient R larger than 0.95 as strong linear correlation, classifying alloy characteristics with strong linear correlation among the alloy characteristics into the same group; each group selects the alloy characteristic with the lowest modeling error by using a single characteristic quantity in the group to represent the combined gold characteristic and enter the subsequent screening; after grouping, the alloy features in each group are in strong linear correlation (|R| > 0.95), and the alloy features in each group have no strong linear correlation.
As a preferable scheme of the application, the number of the feature sets in the candidate feature sets is more than or equal to 60, and when the number of the features is less than 60, the existing mature feature selection technical means can realize key feature combination screening under the condition of simultaneously meeting the feature selection requirements of less field knowledge requirements, low calculation complexity, strong feature universality, high interpretability and the like.
(2) And (3) searching a genetic algorithm based on the number of the limiting features, further screening the remaining features subjected to linear correlation filtering screening for K times, and screening m features each time.
As a further preferable scheme of the application, the number m of limiting feature screening in the step (2) is 5-15, and the screening times K is more than 15; in addition, the model parameters used were: 50-150 generations, 150-250 population, model parameter optimization, 20-70 random sampling, and 5-10 fold cross validation is adopted for each sampling.
(3) After the characteristics are screened out by adopting a genetic algorithm, the characteristic weight sorting is adopted to sort the importance of the characteristics, and the process is as follows:
(a) The remaining n features after linear correlation screening constitute a feature set:
F={X 1 ,X 2 ,X 3 ,X 4 …X i ,X i+1 …X n } (3)
wherein X is i And (3) representing the i-th feature which is screened out, wherein n is the number of the residual features after linear correlation screening.
(b) K times of screening are carried out by adopting a genetic algorithm, m features are screened out each time, regression modeling is carried out based on the m features, and the prediction precision is p as shown in the following formula (4) k : (wherein p k =1-MAPE, MAPE is the model mean absolute percentage error
1={X 1 ,X 2 ,X 3 ,X 4 …X i ,X i+1 …X m }→P 1
2={X 1 ,X 2 ,X 3 ,X 4 …X i ,X i+1 …X m }→P 2
3={X 1 ,X 2 ,X 3 ,X 4 …X i ,X i+1 …X m }→P 3
K={X 1 ,X 2 ,X 3 ,X 4 …X i ,X i+1 …X m }→P k (4)
(c) The feature weight is shown in the formula (5), and is equal to the sum of products of the same feature and model prediction precision after each genetic algorithm screening, and the sum of products of the features of different types and prediction precision is sequenced later.
W A Is the feature weight of feature A, n is the number of candidate feature sets,for the kth screening, whether or not a characteristic A is selected, if the selected characteristic is A, then +.>No->
For the kth screening, whether or not a characteristic B is selected, if the selected characteristic is B, then +.>Otherwise->
And so on
For the kth screening, whether or not a characteristic N is selected, if the selected characteristic is N, then +.>Otherwise
(d) Feature ordering: sorting the sum of products of the characteristics of different types of characteristics and prediction precision after the characteristic weight formula processing, wherein the drawn graph is the sorting for calculating the precision accumulation sum of the characteristics:
I n =rank(W A ,W B ,W C ,.........,W N ) (6)
wherein I is n Representing feature weight ranking results for n candidate features, rank () represents ranking the weights of different features.
(4) The first 12 to 16 most important key features are screened out through feature weight sequencing to form candidate features which are screened out in an exhaustive way, and then feature combinations with optimal model prediction precision are screened out through the exhaustion way: the accuracy and generalization ability of feature combinations in the evaluation model are evaluated by model Mean Absolute Percentage Error (MAPE), and the feature combination with the lowest relative error is taken as the finally screened feature combination.
The sequence of the feature screening steps in the steps (1), (2), (3) and (4) is a stepwise progressive relationship, namely: linear correlation screening, searching by a genetic algorithm based on the number of limiting features, sorting feature weights and exhaustive screening; the difficulty of high field knowledge requirements, high computational complexity, low feature universality, low interpretability and the like faced when a traditional feature selection technology is adopted to screen a large number of candidate feature sets for key feature combinations cannot be overcome by changing the sequence of the feature screening steps or lacking the steps.
Compared with the prior art, the application has the following beneficial effects:
(1) The method has higher effectiveness for screening key feature combinations from a large number of candidate features in the precious metal alloy machine learning process, and can be popularized to the problem of screening key features from a large number of candidate features in other non-precious metal alloy machine learning processes.
(2) The feature screening strategy provided by the method can simultaneously meet the requirements of high interpretability, low computational complexity, high feature universality and good model prediction effect.
Drawings
FIG. 1 shows the results of solid phase temperature feature importance ranking using feature weight ranking;
FIG. 2 is an exhaustive view of the solid phase temperature profile;
FIG. 3 is a graph showing the prediction results of a solid phase temperature model based on the combination of the key features of the screened solid phase temperature.
Detailed Description
The application will now be described in more detail with reference to the drawings and the preferred embodiments, but the scope of the application is not limited to the description.
Example 1
In this embodiment, the analysis and the explanation are performed by taking solid-phase temperature key machine learning feature combination screening of the multi-element noble metal alloy as an example, and the specific steps are as follows:
(1) Machine learning data for 267 sets of multi-element noble metal alloys and corresponding solid phase temperatures were collected, with a partial data set as shown in table 1 (only a portion of the data is shown due to the large amount of data).
TABLE 1 partial multiple noble Metal alloy composition (Wt%) and corresponding solid phase temperature (DEG C) data collected
(2) The solid-phase temperature characteristic set of the multi-element noble metal alloy containing 100 characteristics is constructed through machine learning characteristic engineering (as shown in table 2, the specific value of each characteristic corresponding to each alloy in the table can be obtained through the following characteristic construction formula calculation), the candidate characteristics of machine learning can be generally obtained through the modes of domain knowledge, characteristic engineering and the like, the composition of the characteristic candidate set is mainly determined according to application scenes, and the characteristic types and the characteristic numbers of the candidate characteristic sets of different application scenes are generally different. In this embodiment 1, a machine learning feature is constructed by using feature engineering (compared with a mode of acquiring a feature by knowledge in the field, when facing the application scenario of the small sample data set in this embodiment, the feature constructed by feature engineering can improve universality of a machine learning model), and the feature is selected according to the fact that features associated with solid phase temperature performance are selected as far as possible mainly from the predicted solid phase temperature performance, and the feature engineering process mainly includes: establishing a physical and chemical parameter set, and constructing a feature set for evaluating the influence degree of each parameter on the target quantity according to the chemical proportion of the chemical formula of the collected alloy solder to replace the direct input of the chemical formula; the construction process of the feature set for evaluating the influence degree of each basic physicochemical parameter on the target quantity is as follows:
calculating the mean value factor f of each basic physicochemical parameter of each alloy by the formula (7) mi Calculating the variance factor f of each basic physicochemical parameter of each alloy by the formula (8) vi Feature quantity, and f mi And f vi As input to a machine learning performance prediction model;
f mi =∑(f ij ×c j )/∑c j (7)
wherein f mi For the average alloy factor characteristic, f vi Is characterized by a variance alloy factor, f ij The ith physicochemical parameter (i=1, 2, …; j=1, 2, … n) representing the jth element, n representing the number of components of the alloy, the alloy collected in this example being eight-component alloy at the maximum, so that n=8, c j Representing the mass percent of the jth element in the alloy.
For how each alloy is converted to 100 new alloy features shown in table 2 by formulas (7) and (8), reference may be made specifically to the feature construction method in patent CN114580271a (a method for achieving multi-element precious metal alloy solder solid-liquid phase temperature prediction) in example 1, because the work of construction of features in this example is huge, and the focus of the present document is also focused mainly on the feature screening process rather than the feature construction process, much like the construction process in patent CN114580271a, so the feature construction process will not be described in detail here.
TABLE 2 structured alloy features and corresponding numbering thereof
(3) And (3) primarily screening the candidate feature set through linear correlation filtering, analyzing the linear correlation degree of each alloy feature in the linear correlation screening, and evaluating the linear correlation degree between the features through a linear regression correlation coefficient R, wherein R is calculated as follows:
wherein N is the number of samples, in this embodiment N is 267, x i And y i Two different characteristics of the ith alloy are represented (i=1, 2,., N),and->Representing the average of these two different features in N alloys.
As can be seen from the calculation formula, the R value calculation amount of any two characteristics of 267 alloys is very large, and the space occupied by the display is also very large, so that specific calculation examples are not shown here, and the work is usually obtained by coding calculation through Matlab or Python software.
Taking the correlation coefficient larger than 0.95 as strong linear correlation, classifying alloy features with strong linear correlation among the alloy features into the same group; each group selects the alloy characteristic with the lowest modeling error by using a single characteristic quantity in the group to represent the combined gold characteristic and enter the subsequent screening; after grouping, the alloy features in each group are in strong linear correlation (|R| > 0.95), and the alloy features in each group have no strong linear correlation; after linear correlation screening, the remaining alloy characteristics of the solid phase temperature model were 55, and the characteristics are shown in table 3 (each characteristic in the table corresponds to a specific value of each alloy, and the calculation mode is the same as that in table 1).
TABLE 3 55 characterization of the remaining 55 after Linear correlation screening
/>
(4) The method is characterized in that the method further carries out 50 times of screening on the residual characteristics after linear correlation filtering screening based on the genetic algorithm search of limiting the number of the characteristics, wherein the number of the characteristics screened each time is 10, and the adopted model parameters are as follows: 100 generations, 200 populations, model parameter optimization, 50 random samples, and 5-fold cross validation for each sample.
(5) After the genetic algorithm is adopted to screen the features, the feature weight ranking is adopted to rank the importance of the features, and the result is shown in figure 1. The feature prediction precision addition and sequencing can be known: the first 5 key features affecting the solid phase temperature are numbered 35, 38, 81, 74 and 85, corresponding to the melting enthalpy average, bulk modulus average, atomic radius 2 (coordination number 12) variance, ambient atomic number variance value, and melting enthalpy variance feature, respectively, and particularly the fitness value of the two features 35 and 38 is high, which means that the two features 35 and 38 are the most critical machine learning features affecting solid phase temperature prediction.
(6) The first 12 most important key features are selected through feature weight sorting to form candidate features which are selected through exhaustion, then feature combinations with optimal model prediction precision are selected through exhaustion, the precision and generalization capability of the feature combinations in an evaluation model are evaluated through model average absolute percentage error (MAPE), and the feature combination with the lowest relative error is used as the feature combination which is selected finally; the calculation formula of the model Mean Absolute Percentage Error (MAPE) is shown in formula (4), the exhaustion result of the solid phase temperature is shown in figure 2, and the specific alloy characteristic types which are screened out by the solid phase temperature model in an exhaustion way are shown in table 4.
Where N is the number of samples, n=267, y in this embodiment i Is the actual value of the solid phase temperature,is the solid phase temperature predicted value, y i Is the average of the actual values of the solid phase temperature.
TABLE 4 results of solid phase temperature model alloy feature screening
Based on the alloy screening result, adopting a learner-Support Vector Regression (SVR) algorithm consistent with the alloy characteristic screening to carry out regression modeling; the data set is randomly divided into a training set (80%) and a testing set (20%), the modeling result of the solid phase temperature is shown in fig. 3, and the result shows that for the solid phase temperature prediction model, the percentage error of the training set is 4.72%, the percentage error of the testing set is 9.83%, and the errors are smaller, so that the model trained according to the screened solid phase temperature characteristic combination has better effect and better generalization capability.
Example 2
In this embodiment, the analysis and the explanation are performed by taking the combination and screening of the liquid phase temperature key machine learning characteristics of the multi-element noble metal alloy as an example, and the specific steps are as follows:
(1) Machine learning data for 267 sets of multi-element noble metal alloys and corresponding liquidus temperatures were collected, with a partial data set as shown in table 5 (only a portion of the data is shown due to the large amount of data).
TABLE 5 collected partial multiple noble metal alloy composition (Wt%) and corresponding liquidus temperature (DEG C) data
/>
(2) The multi-element noble metal alloy liquidus temperature characteristic set containing 100 characteristics is constructed through machine learning characteristic engineering, and the same characteristic construction method and the same characteristic types and numbers in the embodiment 1 are adopted because liquidus temperature and solidus temperature properties are similar, and the repeated display is omitted.
(2) And (3) primarily screening the candidate feature set through linear correlation filtering, analyzing the linear correlation degree of each alloy feature in the linear correlation screening, evaluating the linear correlation degree between the features through a linear regression correlation coefficient R (see formula 1 in detail), and classifying the alloy features with strong linear correlation among the alloy features into the same group by taking the correlation coefficient larger than 0.95 as strong linear correlation. Each group selects the alloy characteristic with the lowest modeling error by using a single characteristic quantity in the group to represent the combined gold characteristic for subsequent screening. After grouping, the alloy features in each group are in strong linear correlation (|R| > 0.95), and the alloy features in each group have no strong linear correlation. After linear correlation screening, the alloy characteristics of the liquidus temperature model remained 55.
(3) The genetic algorithm search based on the limiting feature number further carries out 40 times of screening on the remaining features after linear correlation filtering screening, the number of the features screened each time is 5, and the adopted model parameters are as follows: 100 generations, 200 populations, model parameter optimization, 50 random samples, and 5-fold cross validation for each sample.
(4) After the genetic algorithm is adopted to screen the characteristics, the characteristic weight sorting is adopted to sort the importance of the characteristics.
(5) The first 14 most important key features are selected through feature weight sorting to form candidate features which are selected through exhaustion, then feature combinations with optimal model prediction precision are selected through exhaustion, the precision and generalization capability of the feature combinations in an evaluation model are evaluated through model average absolute percentage error (MAPE), the feature combinations with the lowest relative error are used as feature combinations which are selected finally, and the specific alloy feature types which are selected through exhaustion by a liquid phase temperature model are shown in table 6.
TABLE 6 results of screening alloy characteristics for liquid phase temperature model
Feature numbering Eigenvalues
35 Average value of melting enthalpy
81 Variance of atomic radius 2 (coordination number 12)
95 Variance of mass attenuation coefficient CrKalpha
39 Mean Young's modulus
68 Melting Point 1 variance
(6) Based on the alloy screening result, adopting a learner-Support Vector Regression (SVR) algorithm consistent with the alloy characteristic screening to carry out regression modeling, wherein the modeling result of the liquid phase temperature shows that: the percentage error of the training set is 3.72%, the percentage error of the test set is 9.35%, and the errors are smaller, so that the model trained according to the screened liquid phase temperature characteristic combination has better effect and better generalization capability.
Example 3
The embodiment takes the combination screening of the conductivity key machine learning characteristics of the noble metal electric contact alloy material as an example for analysis and explanation, and specifically comprises the following steps:
(1) Machine learning data for 205 sets of precious metal electrical contact alloy material compositions and corresponding conductivity properties were collected, with a partial data set as shown in table 7 (only a portion of the data is shown due to the large amount of data).
TABLE 7 collected partial noble Metal electric contact alloy Material composition (Wt%) and corresponding conductivity (% IACS) Performance data
Sequence number Ag Cu Ni Au Pd Ce Pt Cd V Conductivity of electric conductivity
1 90 10 0 0 0 0 0 0 0 90.74
2 85 15 0 0 0 0 0 0 0 82.1
3 80 20 0 0 0 0 0 0 0 82.1
4 75 25 0 0 0 0 0 0 0 82.1
5 72 28 0 0 0 0 0 0 0 78.37
6 50 50 0 0 0 0 0 0 0 78.37
7 78 0 0 0 22 0 0 0 0 16.9
8 70 0 0 0 30 0 0 0 0 11.49
9 60 0 0 0 40 0 0 0 0 8.62
10 50 0 0 0 50 0 0 0 0 5.7
11 90 0 0 10 0 0 0 0 0 47.89
12 80 0 0 20 0 0 0 0 0 28.74
13 60 0 0 40 0 0 0 0 0 20.28
14 40 0 0 60 0 0 0 0 0 15.67
15 95 0 0 0 0 0 5 0 0 45.37
16 90 0 0 0 0 0 10 0 0 29.73
17 88 0 0 0 0 0 12 0 0 28.74
18 80 0 0 0 0 0 20 0 0 17.07
19 20 0 0 0 0 0 80 0 0 5.75
20 25 0 0 0 0 0 75 0 0 5.3
21 86 0 0 0 0 0 0 14 0 59.45
22 84 0 0 0 0 0 0 16 0 43.1
23 83 0 0 0 0 0 0 17 0 35.92
(2) The precious metal electric contact alloy material conductivity feature set containing 120 features is constructed through machine learning feature engineering, and since the part works in another unpublished research result, no specific feature display is performed here, and the feature construction process is similar to that of the embodiment 1, except that the types and the numbers of the features are different.
(3) Preliminary screening is carried out on candidate feature sets through linear correlation filtering, in the linear correlation screening, the linear correlation degree of each alloy feature is analyzed, the linear correlation degree between the features is evaluated through a linear regression correlation coefficient R (see formula 1 in detail), the correlation coefficient is more than 0.95 and is taken as strong linear correlation, and the alloy features with strong linear correlation among the alloy features are classified into the same group; each group selects the alloy characteristic with the lowest modeling error by using a single characteristic quantity in the group to represent the combined gold characteristic and enter the subsequent screening; after grouping, the alloy features in each group are in strong linear correlation (|R| > 0.95), and the alloy features in each group have no strong linear correlation. After linear correlation screening, the alloy characteristics of the conductivity model remained as 63.
(4) The genetic algorithm search based on the number of limiting features further carries out 60 times of screening on the remaining features after linear correlation filtering screening, the number of the features screened each time is 15, and the adopted model parameters are as follows: 100 generations, 200 populations, model parameter optimization, 50 random samples, and 5-fold cross validation for each sample.
(5) After the genetic algorithm is adopted to screen the characteristics, the characteristic weight sorting is adopted to sort the importance of the characteristics.
(6) The first 15 most important key features are selected through feature weight sorting to form candidate features which are selected through exhaustion, then feature combinations with optimal model prediction precision are selected through exhaustion, the precision and generalization capability of the feature combinations in an evaluation model are evaluated through model average absolute percentage error (MAPE), and the feature combination with the lowest relative error is used as the feature combination which is selected finally; the specific alloy characteristic types which are screened out by the conductivity model are shown in table 8.
TABLE 8 results of conductivity model alloy feature screening
Feature numbering Eigenvalues
3 Group number average
10 Third ionization energy average value
95 Variance of mass attenuation coefficient CrKalpha
62 Variance of chemical potential energy
(7) Based on the alloy screening result, adopting a learner-Support Vector Machine (SVM) algorithm consistent with the alloy characteristic screening to carry out regression modeling, wherein the modeling result of the conductivity shows that: the percentage error of the training set is 4.12%, the percentage error of the test set is 3.99%, and the errors are smaller, so that the model trained according to the screened conductivity characteristic combination has better effect and better generalization capability.

Claims (6)

1. The method for realizing the key feature combination screening of the machine learning candidate features is characterized by comprising the following steps of:
(1) Preliminary screening is carried out on the candidate feature set through linear correlation filtering, the strong linear correlation is taken as a correlation coefficient larger than 0.95, and the features with the strong linear correlation in the feature set are classified into the same group;
(2) Searching a genetic algorithm based on the number of the limiting features, further screening the remaining features subjected to linear correlation filtering screening for K times, and screening m features each time;
(3) After the characteristics are screened out by adopting a genetic algorithm, the characteristic weight ranking is adopted to rank the importance of the characteristics;
(4) The first 12 to 16 most important key features are screened out through feature weight sorting to form candidate features which are screened out in an exhaustive way, and then the feature combination with the best model prediction precision is screened out through the exhaustion way.
2. The method for implementing key feature combination screening for a machine learning candidate feature of claim 1, wherein: the number of the feature sets in the candidate feature sets is more than or equal to 60.
3. The method for implementing key feature combination screening for a machine learning candidate feature of claim 1, wherein: the formula for evaluating the correlation coefficient in step (1) is as follows:
where N is the number of samples,x i and y i Two different characteristics of the ith alloy are represented (i=1, 2,., N),andrepresenting the average of these two different features in N alloys.
4. The method for implementing key feature combination screening for a machine learning candidate feature of claim 1, wherein: the number of limiting feature screening in the step (2) is 5-15, namely m=5-15; in addition, the screening times K is more than 15, and the adopted model parameters are as follows: 50-150 generations, 150-250 population, model parameter optimization, 20-70 random sampling, and 5-10 fold cross validation is adopted for each sampling.
5. The method for implementing key feature combination screening for a machine learning candidate feature of claim 1, wherein: in the step (3), feature weight sorting is adopted to sort the feature importance, and the process is as follows:
(a) The remaining n features after linear correlation screening constitute feature set F:
F={X 1 ,X 2 ,X 3 ,X 4 …X i ,X i+1 …X n } (3)
wherein X is i Representing the i-th feature which is screened out, wherein n is the number of the residual features after linear correlation screening;
(b) K times of screening are carried out by adopting a genetic algorithm, m features are screened out each time, regression modeling is carried out based on the m features, and the prediction precision is p as shown in the following formula (4) k Wherein p is k =1-MAPE, MAPE is the mean absolute percentage error of the model;
1={X 1 ,X 2 ,X 3 ,X 4 …X i ,X i+1 …X m }→P 1
2={X 1 ,X 2 ,X 3 ,X 4 …X i ,X i+1 …X m }→P 2
3={X 1 ,X 2 ,X 3 ,X 4 …X i ,X i+1 …X m }→P 3
K={X 1 ,X 2 ,X 3 ,X 4 …X i ,X i+1 …X m }→P k (4)
(c) The feature weights are shown in a formula (5), the feature weights are equal to the sum of products of the same features and model prediction precision after each genetic algorithm screening, and the sum of products of the features of different types and the prediction precision is sequenced later;
W A is the feature weight of feature A, n is the number of candidate feature sets,for the kth screening, whether or not a characteristic A is selected, if the selected characteristic is A, then +.>No->
For the kth screening, whether feature B was screened,if the screening is characterized by B, then +.>No->
And so on
For the kth screening, whether or not a characteristic N is selected, if the selected characteristic is N, then +.>No->
(d) Feature ordering: sorting the sum of products of the characteristics of different types of characteristics and prediction precision after the characteristic weight formula processing, wherein the drawn graph is the sorting for calculating the precision accumulation sum of the characteristics:
I n =rank(W A ,W B ,W C ,.........,W N ) (6)
wherein I is n Representing feature weight ranking results for n candidate features, rank () represents ranking the weights of different features.
6. The method for implementing key feature combination screening for a machine learning candidate feature of claim 1, wherein: and (4) evaluating the precision and generalization capability of the feature combinations in an evaluation model through a model average absolute percentage error (MAPE), and taking the feature combination with the lowest relative error as the finally screened feature combination.
CN202310481517.6A 2023-04-28 2023-04-28 Method for realizing key feature combination screening of machine learning candidate features Pending CN116720058A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310481517.6A CN116720058A (en) 2023-04-28 2023-04-28 Method for realizing key feature combination screening of machine learning candidate features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310481517.6A CN116720058A (en) 2023-04-28 2023-04-28 Method for realizing key feature combination screening of machine learning candidate features

Publications (1)

Publication Number Publication Date
CN116720058A true CN116720058A (en) 2023-09-08

Family

ID=87870410

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310481517.6A Pending CN116720058A (en) 2023-04-28 2023-04-28 Method for realizing key feature combination screening of machine learning candidate features

Country Status (1)

Country Link
CN (1) CN116720058A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117196918A (en) * 2023-09-21 2023-12-08 国家电网有限公司大数据中心 Building carbon emission determining method, device, equipment and storage medium
CN117787476A (en) * 2023-12-07 2024-03-29 聊城大学 Quick evaluation method for blocking flow shop scheduling based on key machine

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180150746A1 (en) * 2016-02-05 2018-05-31 Huawei Technologies Co., Ltd. Feature Set Determining Method and Apparatus
US20180300333A1 (en) * 2017-04-13 2018-10-18 General Electric Company Feature subset selection and ranking
CN110135469A (en) * 2019-04-24 2019-08-16 北京航空航天大学 It is a kind of to improve the characteristic filter method and device selected based on correlative character
CN111126426A (en) * 2019-10-11 2020-05-08 平安普惠企业管理有限公司 Feature selection method and device, computer equipment and storage medium
CN112149702A (en) * 2019-06-28 2020-12-29 北京百度网讯科技有限公司 Feature processing method and device
CN112215278A (en) * 2020-10-09 2021-01-12 吉林大学 Multi-dimensional data feature selection method combining genetic algorithm and dragonfly algorithm
CN112216356A (en) * 2020-10-22 2021-01-12 哈尔滨理工大学 High-entropy alloy hardness prediction method based on machine learning
CN113837271A (en) * 2021-09-23 2021-12-24 山东纬横数据科技有限公司 Classification improvement algorithm based on feature selection
CN114580272A (en) * 2022-02-16 2022-06-03 昆明贵金属研究所 Design method for simultaneously optimizing conductivity and hardness of multi-element electric contact alloy
CN114580271A (en) * 2022-02-16 2022-06-03 昆明贵金属研究所 Method for realizing solid-liquid phase temperature prediction of multi-element precious metal alloy brazing filler metal
CN115527625A (en) * 2022-10-19 2022-12-27 西安邮电大学 Hardness prediction method and system for high-entropy alloy
CN115879638A (en) * 2022-12-30 2023-03-31 东北石油大学 Carbon emission prediction method for oil field transfer station system

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180150746A1 (en) * 2016-02-05 2018-05-31 Huawei Technologies Co., Ltd. Feature Set Determining Method and Apparatus
US20180300333A1 (en) * 2017-04-13 2018-10-18 General Electric Company Feature subset selection and ranking
CN110135469A (en) * 2019-04-24 2019-08-16 北京航空航天大学 It is a kind of to improve the characteristic filter method and device selected based on correlative character
CN112149702A (en) * 2019-06-28 2020-12-29 北京百度网讯科技有限公司 Feature processing method and device
CN111126426A (en) * 2019-10-11 2020-05-08 平安普惠企业管理有限公司 Feature selection method and device, computer equipment and storage medium
CN112215278A (en) * 2020-10-09 2021-01-12 吉林大学 Multi-dimensional data feature selection method combining genetic algorithm and dragonfly algorithm
CN112216356A (en) * 2020-10-22 2021-01-12 哈尔滨理工大学 High-entropy alloy hardness prediction method based on machine learning
CN113837271A (en) * 2021-09-23 2021-12-24 山东纬横数据科技有限公司 Classification improvement algorithm based on feature selection
CN114580272A (en) * 2022-02-16 2022-06-03 昆明贵金属研究所 Design method for simultaneously optimizing conductivity and hardness of multi-element electric contact alloy
CN114580271A (en) * 2022-02-16 2022-06-03 昆明贵金属研究所 Method for realizing solid-liquid phase temperature prediction of multi-element precious metal alloy brazing filler metal
CN115527625A (en) * 2022-10-19 2022-12-27 西安邮电大学 Hardness prediction method and system for high-entropy alloy
CN115879638A (en) * 2022-12-30 2023-03-31 东北石油大学 Carbon emission prediction method for oil field transfer station system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JIHENG FANG 等: "Solid-Liquid Phase Temperature Prediction of Alloys Based on Machine Learning Key Feature Screening", 《HTTPS://PAPERS.SSRN.COM/SOL3/PAPERS.CFM?ABSTRACT_ID=4423396》, 19 April 2023 (2023-04-19), pages 1 - 30 *
李郅琴;杜建强;聂斌;熊旺平;黄灿奕;李欢;: "特征选择方法综述", 计算机工程与应用, no. 24, 31 December 2019 (2019-12-31), pages 16 - 25 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117196918A (en) * 2023-09-21 2023-12-08 国家电网有限公司大数据中心 Building carbon emission determining method, device, equipment and storage medium
CN117196918B (en) * 2023-09-21 2024-06-07 国家电网有限公司大数据中心 Building carbon emission determining method, device, equipment and storage medium
CN117787476A (en) * 2023-12-07 2024-03-29 聊城大学 Quick evaluation method for blocking flow shop scheduling based on key machine

Similar Documents

Publication Publication Date Title
CN116720058A (en) Method for realizing key feature combination screening of machine learning candidate features
CN102495127B (en) Protein secondary mass spectrometric identification method based on probability statistic model
Jensen et al. On the utility of linkage disequilibrium as a statistic for identifying targets of positive selection in nonequilibrium populations
CN112699054B (en) Ordered generation method for software test cases
CN114580271B (en) Method for realizing solid-liquid phase temperature prediction of multi-element noble metal alloy solder
CN109740924B (en) Article scoring prediction method integrating attribute information network and matrix decomposition
CN113360730A (en) Feature selection method based on filter and RF-RFE algorithm
CN114580272B (en) Design method for optimizing conductivity and hardness of multi-element electric contact alloy
CN115048855A (en) Click rate prediction model, training method and application device thereof
Zhao et al. AMEIR: Automatic behavior modeling, interaction exploration and MLP investigation in the recommender system
CN111243676A (en) Blast disease onset prediction model based on high-throughput sequencing data and application
CN109390032B (en) Method for exploring disease-related SNP (single nucleotide polymorphism) combination in data of whole genome association analysis based on evolutionary algorithm
CN112966702A (en) Method and apparatus for classifying protein-ligand complex
CN116756508A (en) Fault diagnosis method and device for transformer, computer equipment and storage medium
CN113838519B (en) Gene selection method and system based on adaptive gene interaction regularization elastic network model
CN107957944B (en) User data coverage rate oriented test case automatic generation method
Wu et al. Optimal sampling of a population to determine QTL location, variance, and allelic number
CN110223730B (en) Prediction method and prediction device for protein and small molecule binding site
KR101632537B1 (en) Technical ripple effect analysis method
CN114724655A (en) Hydrogen storage alloy performance prediction method, prediction model thereof and model establishment method
CN111863136A (en) Integrated system and method for correlation analysis among multiple sets of chemical data
Villar et al. Substructural analysis in drug discovery
Jeong et al. Technology planning through technology roadmap: Application of patent citation network
Arief et al. Visualising the pattern of long‐term genotype performance by leveraging a genomic prediction model
CN112885409B (en) Colorectal cancer protein marker selection system based on feature selection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination