CN116720058A - Method for realizing key feature combination screening of machine learning candidate features - Google Patents
Method for realizing key feature combination screening of machine learning candidate features Download PDFInfo
- Publication number
- CN116720058A CN116720058A CN202310481517.6A CN202310481517A CN116720058A CN 116720058 A CN116720058 A CN 116720058A CN 202310481517 A CN202310481517 A CN 202310481517A CN 116720058 A CN116720058 A CN 116720058A
- Authority
- CN
- China
- Prior art keywords
- feature
- screening
- features
- candidate
- machine learning
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012216 screening Methods 0.000 title claims abstract description 86
- 238000000034 method Methods 0.000 title claims abstract description 38
- 238000010801 machine learning Methods 0.000 title claims abstract description 32
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 25
- 238000001914 filtration Methods 0.000 claims abstract description 16
- 230000002068 genetic effect Effects 0.000 claims abstract description 15
- 239000000956 alloy Substances 0.000 claims description 70
- 229910045601 alloy Inorganic materials 0.000 claims description 66
- 238000002790 cross-validation Methods 0.000 claims description 5
- 238000013210 evaluation model Methods 0.000 claims description 5
- 238000005457 optimization Methods 0.000 claims description 5
- 238000005070 sampling Methods 0.000 claims description 4
- 229910001199 N alloy Inorganic materials 0.000 claims description 3
- 238000009825 accumulation Methods 0.000 claims description 2
- 238000012545 processing Methods 0.000 claims description 2
- 238000005516 engineering process Methods 0.000 abstract description 5
- 239000007790 solid phase Substances 0.000 description 22
- 229910000510 noble metal Inorganic materials 0.000 description 12
- 238000004364 calculation method Methods 0.000 description 10
- 238000010276 construction Methods 0.000 description 9
- 239000007791 liquid phase Substances 0.000 description 6
- 238000012360 testing method Methods 0.000 description 6
- 239000000203 mixture Substances 0.000 description 5
- 239000000126 substance Substances 0.000 description 5
- 238000012549 training Methods 0.000 description 5
- 238000004458 analytical method Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- PCHJSUWPFVWCPO-UHFFFAOYSA-N gold Chemical compound [Au] PCHJSUWPFVWCPO-UHFFFAOYSA-N 0.000 description 4
- 239000010931 gold Substances 0.000 description 4
- 229910052737 gold Inorganic materials 0.000 description 4
- 238000012417 linear regression Methods 0.000 description 4
- 238000002844 melting Methods 0.000 description 4
- 230000008018 melting Effects 0.000 description 4
- 238000002921 genetic algorithm search Methods 0.000 description 3
- 229910000923 precious metal alloy Inorganic materials 0.000 description 3
- 238000010187 selection method Methods 0.000 description 3
- 239000000463 material Substances 0.000 description 2
- 238000004806 packaging method and process Methods 0.000 description 2
- 239000010970 precious metal Substances 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000012163 sequencing technique Methods 0.000 description 2
- 229910000679 solder Inorganic materials 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 230000007786 learning performance Effects 0.000 description 1
- 238000005381 potential energy Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000002922 simulated annealing Methods 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/12—Computing arrangements based on biological models using genetic models
- G06N3/126—Evolutionary algorithms, e.g. genetic algorithms or genetic programming
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/10—Machine learning using kernel methods, e.g. support vector machines [SVM]
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Computational Biology (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Biology (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Physiology (AREA)
- Biomedical Technology (AREA)
- Genetics & Genomics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The application discloses a method for realizing key feature combination screening of machine learning candidate features, which comprises the following steps: firstly, primarily screening candidate feature sets through linear correlation filtering; searching and further screening the residual features after linear correlation filtering and screening based on a genetic algorithm for limiting the number of the features; after the features are screened out through a genetic algorithm, feature weight sorting is adopted to sort the importance of the features, and key features which are ranked at the front are screened out through the feature weight sorting to form candidate features which are screened out in an exhaustive way; and finally, screening out the feature combination with the best model prediction precision through exhaustive screening to be used as the final machine learning feature combination. The method can overcome the difficulties of high field knowledge requirement, high computational complexity, low feature universality, low interpretability and the like faced when the traditional feature selection technology is adopted to screen the key feature combination for a large number of candidate feature sets.
Description
Technical Field
The application relates to a method for realizing key feature combination screening of machine learning candidate features, and belongs to the technical field of noble metal alloys.
Background
The data and features determine the upper bound of machine learning, and the model and algorithm approach this upper bound. Thus, feature selection becomes particularly important. In materials research, each feature set is typically only specific to the application of a particular condition, and there is no unified feature that is valid for all applications. Therefore, selecting the most appropriate feature for each machine learning process belongs to one of the other challenges.
At present, many efforts are being made to select features. The feature selection method can be classified into 4 types according to the form of feature selection. (1) feature selection based on domain knowledge; feature selection techniques based on domain knowledge are well-interpreted, but in many cases face situations where domain knowledge is inadequate. (2) The filtering method selects features, and common filtering methods include correlation coefficients, variance screening, mutual information and hypothesis testing. The filtering method has the advantages of high efficiency in calculation time, high robustness to the over-fitting problem and the disadvantages: without consideration of the correlation between features, useful correlation features may be miskicked away. (3) packaging options feature, common packaging methods include: full searches (e.g., branch bound searches, breadth-first traversals, directed searches, etc.); heuristic search (e.g., bi-directional search, sequence forward selection, sequence backward selection, etc.), random search (e.g., randomly generated sequence selection algorithm, genetic algorithm, simulated annealing algorithm, etc.), feature subset classification performance found by the wrapper method is generally better than that found by the filtering method, the feature versatility selected by the wrapper method is not strong, feature selection needs to be performed again for the learning algorithm when the learning algorithm is changed, algorithm calculation complexity is high because classifier training and testing are performed every time the subset is evaluated, and the algorithm execution time is long especially for large-scale data sets.
As can be seen from the above analysis, the above feature selection method can perform feature selection, but the above feature selection method cannot simultaneously meet the feature selection requirements of less field knowledge requirements, low computational complexity, strong feature universality, high interpretability and the like, especially when the number of candidate features is large or the application scene is complex, so that the above problem is more prominent, and therefore, a reasonable framework for screening the most suitable feature group needs to be provided to optimize the problem, and the problem is always a problem in predicting the material performance field by using the machine learning technology.
Disclosure of Invention
The application aims to solve the problems that: the method has the advantages that the method is difficult to meet the difficulties of high field knowledge requirement, high computational complexity, low feature universality, low interpretability and the like when a large number of noble metal alloy machine learning candidate feature sets are screened for key feature combinations by adopting the traditional feature selection technology.
The application aims to provide a method for realizing key feature combination screening of machine learning candidate features, which comprises the following steps:
(1) The candidate feature sets are initially screened through linear correlation filtering, in the linear correlation screening, the linear correlation degree of each alloy feature is analyzed, and the linear correlation degree between the features is evaluated through a linear regression correlation coefficient R (see formula 1 for details):
wherein N is the number of samples, x i And y i Representing two different characteristics of the ith alloySign (i=1, 2,.,. The term, N),and->Representing the average of these two different features in N alloys.
Taking a correlation coefficient R larger than 0.95 as strong linear correlation, classifying alloy characteristics with strong linear correlation among the alloy characteristics into the same group; each group selects the alloy characteristic with the lowest modeling error by using a single characteristic quantity in the group to represent the combined gold characteristic and enter the subsequent screening; after grouping, the alloy features in each group are in strong linear correlation (|R| > 0.95), and the alloy features in each group have no strong linear correlation.
As a preferable scheme of the application, the number of the feature sets in the candidate feature sets is more than or equal to 60, and when the number of the features is less than 60, the existing mature feature selection technical means can realize key feature combination screening under the condition of simultaneously meeting the feature selection requirements of less field knowledge requirements, low calculation complexity, strong feature universality, high interpretability and the like.
(2) And (3) searching a genetic algorithm based on the number of the limiting features, further screening the remaining features subjected to linear correlation filtering screening for K times, and screening m features each time.
As a further preferable scheme of the application, the number m of limiting feature screening in the step (2) is 5-15, and the screening times K is more than 15; in addition, the model parameters used were: 50-150 generations, 150-250 population, model parameter optimization, 20-70 random sampling, and 5-10 fold cross validation is adopted for each sampling.
(3) After the characteristics are screened out by adopting a genetic algorithm, the characteristic weight sorting is adopted to sort the importance of the characteristics, and the process is as follows:
(a) The remaining n features after linear correlation screening constitute a feature set:
F={X 1 ,X 2 ,X 3 ,X 4 …X i ,X i+1 …X n } (3)
wherein X is i And (3) representing the i-th feature which is screened out, wherein n is the number of the residual features after linear correlation screening.
(b) K times of screening are carried out by adopting a genetic algorithm, m features are screened out each time, regression modeling is carried out based on the m features, and the prediction precision is p as shown in the following formula (4) k : (wherein p k =1-MAPE, MAPE is the model mean absolute percentage error
1={X 1 ,X 2 ,X 3 ,X 4 …X i ,X i+1 …X m }→P 1
2={X 1 ,X 2 ,X 3 ,X 4 …X i ,X i+1 …X m }→P 2
3={X 1 ,X 2 ,X 3 ,X 4 …X i ,X i+1 …X m }→P 3
…
K={X 1 ,X 2 ,X 3 ,X 4 …X i ,X i+1 …X m }→P k (4)
(c) The feature weight is shown in the formula (5), and is equal to the sum of products of the same feature and model prediction precision after each genetic algorithm screening, and the sum of products of the features of different types and prediction precision is sequenced later.
W A Is the feature weight of feature A, n is the number of candidate feature sets,for the kth screening, whether or not a characteristic A is selected, if the selected characteristic is A, then +.>No->
For the kth screening, whether or not a characteristic B is selected, if the selected characteristic is B, then +.>Otherwise->
…
And so on
For the kth screening, whether or not a characteristic N is selected, if the selected characteristic is N, then +.>Otherwise
(d) Feature ordering: sorting the sum of products of the characteristics of different types of characteristics and prediction precision after the characteristic weight formula processing, wherein the drawn graph is the sorting for calculating the precision accumulation sum of the characteristics:
I n =rank(W A ,W B ,W C ,.........,W N ) (6)
wherein I is n Representing feature weight ranking results for n candidate features, rank () represents ranking the weights of different features.
(4) The first 12 to 16 most important key features are screened out through feature weight sequencing to form candidate features which are screened out in an exhaustive way, and then feature combinations with optimal model prediction precision are screened out through the exhaustion way: the accuracy and generalization ability of feature combinations in the evaluation model are evaluated by model Mean Absolute Percentage Error (MAPE), and the feature combination with the lowest relative error is taken as the finally screened feature combination.
The sequence of the feature screening steps in the steps (1), (2), (3) and (4) is a stepwise progressive relationship, namely: linear correlation screening, searching by a genetic algorithm based on the number of limiting features, sorting feature weights and exhaustive screening; the difficulty of high field knowledge requirements, high computational complexity, low feature universality, low interpretability and the like faced when a traditional feature selection technology is adopted to screen a large number of candidate feature sets for key feature combinations cannot be overcome by changing the sequence of the feature screening steps or lacking the steps.
Compared with the prior art, the application has the following beneficial effects:
(1) The method has higher effectiveness for screening key feature combinations from a large number of candidate features in the precious metal alloy machine learning process, and can be popularized to the problem of screening key features from a large number of candidate features in other non-precious metal alloy machine learning processes.
(2) The feature screening strategy provided by the method can simultaneously meet the requirements of high interpretability, low computational complexity, high feature universality and good model prediction effect.
Drawings
FIG. 1 shows the results of solid phase temperature feature importance ranking using feature weight ranking;
FIG. 2 is an exhaustive view of the solid phase temperature profile;
FIG. 3 is a graph showing the prediction results of a solid phase temperature model based on the combination of the key features of the screened solid phase temperature.
Detailed Description
The application will now be described in more detail with reference to the drawings and the preferred embodiments, but the scope of the application is not limited to the description.
Example 1
In this embodiment, the analysis and the explanation are performed by taking solid-phase temperature key machine learning feature combination screening of the multi-element noble metal alloy as an example, and the specific steps are as follows:
(1) Machine learning data for 267 sets of multi-element noble metal alloys and corresponding solid phase temperatures were collected, with a partial data set as shown in table 1 (only a portion of the data is shown due to the large amount of data).
TABLE 1 partial multiple noble Metal alloy composition (Wt%) and corresponding solid phase temperature (DEG C) data collected
(2) The solid-phase temperature characteristic set of the multi-element noble metal alloy containing 100 characteristics is constructed through machine learning characteristic engineering (as shown in table 2, the specific value of each characteristic corresponding to each alloy in the table can be obtained through the following characteristic construction formula calculation), the candidate characteristics of machine learning can be generally obtained through the modes of domain knowledge, characteristic engineering and the like, the composition of the characteristic candidate set is mainly determined according to application scenes, and the characteristic types and the characteristic numbers of the candidate characteristic sets of different application scenes are generally different. In this embodiment 1, a machine learning feature is constructed by using feature engineering (compared with a mode of acquiring a feature by knowledge in the field, when facing the application scenario of the small sample data set in this embodiment, the feature constructed by feature engineering can improve universality of a machine learning model), and the feature is selected according to the fact that features associated with solid phase temperature performance are selected as far as possible mainly from the predicted solid phase temperature performance, and the feature engineering process mainly includes: establishing a physical and chemical parameter set, and constructing a feature set for evaluating the influence degree of each parameter on the target quantity according to the chemical proportion of the chemical formula of the collected alloy solder to replace the direct input of the chemical formula; the construction process of the feature set for evaluating the influence degree of each basic physicochemical parameter on the target quantity is as follows:
calculating the mean value factor f of each basic physicochemical parameter of each alloy by the formula (7) mi Calculating the variance factor f of each basic physicochemical parameter of each alloy by the formula (8) vi Feature quantity, and f mi And f vi As input to a machine learning performance prediction model;
f mi =∑(f ij ×c j )/∑c j (7)
wherein f mi For the average alloy factor characteristic, f vi Is characterized by a variance alloy factor, f ij The ith physicochemical parameter (i=1, 2, …; j=1, 2, … n) representing the jth element, n representing the number of components of the alloy, the alloy collected in this example being eight-component alloy at the maximum, so that n=8, c j Representing the mass percent of the jth element in the alloy.
For how each alloy is converted to 100 new alloy features shown in table 2 by formulas (7) and (8), reference may be made specifically to the feature construction method in patent CN114580271a (a method for achieving multi-element precious metal alloy solder solid-liquid phase temperature prediction) in example 1, because the work of construction of features in this example is huge, and the focus of the present document is also focused mainly on the feature screening process rather than the feature construction process, much like the construction process in patent CN114580271a, so the feature construction process will not be described in detail here.
TABLE 2 structured alloy features and corresponding numbering thereof
(3) And (3) primarily screening the candidate feature set through linear correlation filtering, analyzing the linear correlation degree of each alloy feature in the linear correlation screening, and evaluating the linear correlation degree between the features through a linear regression correlation coefficient R, wherein R is calculated as follows:
wherein N is the number of samples, in this embodiment N is 267, x i And y i Two different characteristics of the ith alloy are represented (i=1, 2,., N),and->Representing the average of these two different features in N alloys.
As can be seen from the calculation formula, the R value calculation amount of any two characteristics of 267 alloys is very large, and the space occupied by the display is also very large, so that specific calculation examples are not shown here, and the work is usually obtained by coding calculation through Matlab or Python software.
Taking the correlation coefficient larger than 0.95 as strong linear correlation, classifying alloy features with strong linear correlation among the alloy features into the same group; each group selects the alloy characteristic with the lowest modeling error by using a single characteristic quantity in the group to represent the combined gold characteristic and enter the subsequent screening; after grouping, the alloy features in each group are in strong linear correlation (|R| > 0.95), and the alloy features in each group have no strong linear correlation; after linear correlation screening, the remaining alloy characteristics of the solid phase temperature model were 55, and the characteristics are shown in table 3 (each characteristic in the table corresponds to a specific value of each alloy, and the calculation mode is the same as that in table 1).
TABLE 3 55 characterization of the remaining 55 after Linear correlation screening
/>
(4) The method is characterized in that the method further carries out 50 times of screening on the residual characteristics after linear correlation filtering screening based on the genetic algorithm search of limiting the number of the characteristics, wherein the number of the characteristics screened each time is 10, and the adopted model parameters are as follows: 100 generations, 200 populations, model parameter optimization, 50 random samples, and 5-fold cross validation for each sample.
(5) After the genetic algorithm is adopted to screen the features, the feature weight ranking is adopted to rank the importance of the features, and the result is shown in figure 1. The feature prediction precision addition and sequencing can be known: the first 5 key features affecting the solid phase temperature are numbered 35, 38, 81, 74 and 85, corresponding to the melting enthalpy average, bulk modulus average, atomic radius 2 (coordination number 12) variance, ambient atomic number variance value, and melting enthalpy variance feature, respectively, and particularly the fitness value of the two features 35 and 38 is high, which means that the two features 35 and 38 are the most critical machine learning features affecting solid phase temperature prediction.
(6) The first 12 most important key features are selected through feature weight sorting to form candidate features which are selected through exhaustion, then feature combinations with optimal model prediction precision are selected through exhaustion, the precision and generalization capability of the feature combinations in an evaluation model are evaluated through model average absolute percentage error (MAPE), and the feature combination with the lowest relative error is used as the feature combination which is selected finally; the calculation formula of the model Mean Absolute Percentage Error (MAPE) is shown in formula (4), the exhaustion result of the solid phase temperature is shown in figure 2, and the specific alloy characteristic types which are screened out by the solid phase temperature model in an exhaustion way are shown in table 4.
Where N is the number of samples, n=267, y in this embodiment i Is the actual value of the solid phase temperature,is the solid phase temperature predicted value, y i Is the average of the actual values of the solid phase temperature.
TABLE 4 results of solid phase temperature model alloy feature screening
Based on the alloy screening result, adopting a learner-Support Vector Regression (SVR) algorithm consistent with the alloy characteristic screening to carry out regression modeling; the data set is randomly divided into a training set (80%) and a testing set (20%), the modeling result of the solid phase temperature is shown in fig. 3, and the result shows that for the solid phase temperature prediction model, the percentage error of the training set is 4.72%, the percentage error of the testing set is 9.83%, and the errors are smaller, so that the model trained according to the screened solid phase temperature characteristic combination has better effect and better generalization capability.
Example 2
In this embodiment, the analysis and the explanation are performed by taking the combination and screening of the liquid phase temperature key machine learning characteristics of the multi-element noble metal alloy as an example, and the specific steps are as follows:
(1) Machine learning data for 267 sets of multi-element noble metal alloys and corresponding liquidus temperatures were collected, with a partial data set as shown in table 5 (only a portion of the data is shown due to the large amount of data).
TABLE 5 collected partial multiple noble metal alloy composition (Wt%) and corresponding liquidus temperature (DEG C) data
/>
(2) The multi-element noble metal alloy liquidus temperature characteristic set containing 100 characteristics is constructed through machine learning characteristic engineering, and the same characteristic construction method and the same characteristic types and numbers in the embodiment 1 are adopted because liquidus temperature and solidus temperature properties are similar, and the repeated display is omitted.
(2) And (3) primarily screening the candidate feature set through linear correlation filtering, analyzing the linear correlation degree of each alloy feature in the linear correlation screening, evaluating the linear correlation degree between the features through a linear regression correlation coefficient R (see formula 1 in detail), and classifying the alloy features with strong linear correlation among the alloy features into the same group by taking the correlation coefficient larger than 0.95 as strong linear correlation. Each group selects the alloy characteristic with the lowest modeling error by using a single characteristic quantity in the group to represent the combined gold characteristic for subsequent screening. After grouping, the alloy features in each group are in strong linear correlation (|R| > 0.95), and the alloy features in each group have no strong linear correlation. After linear correlation screening, the alloy characteristics of the liquidus temperature model remained 55.
(3) The genetic algorithm search based on the limiting feature number further carries out 40 times of screening on the remaining features after linear correlation filtering screening, the number of the features screened each time is 5, and the adopted model parameters are as follows: 100 generations, 200 populations, model parameter optimization, 50 random samples, and 5-fold cross validation for each sample.
(4) After the genetic algorithm is adopted to screen the characteristics, the characteristic weight sorting is adopted to sort the importance of the characteristics.
(5) The first 14 most important key features are selected through feature weight sorting to form candidate features which are selected through exhaustion, then feature combinations with optimal model prediction precision are selected through exhaustion, the precision and generalization capability of the feature combinations in an evaluation model are evaluated through model average absolute percentage error (MAPE), the feature combinations with the lowest relative error are used as feature combinations which are selected finally, and the specific alloy feature types which are selected through exhaustion by a liquid phase temperature model are shown in table 6.
TABLE 6 results of screening alloy characteristics for liquid phase temperature model
Feature numbering | Eigenvalues |
35 | Average value of melting enthalpy |
81 | Variance of atomic radius 2 (coordination number 12) |
95 | Variance of mass attenuation coefficient CrKalpha |
39 | Mean Young's modulus |
68 | Melting Point 1 variance |
(6) Based on the alloy screening result, adopting a learner-Support Vector Regression (SVR) algorithm consistent with the alloy characteristic screening to carry out regression modeling, wherein the modeling result of the liquid phase temperature shows that: the percentage error of the training set is 3.72%, the percentage error of the test set is 9.35%, and the errors are smaller, so that the model trained according to the screened liquid phase temperature characteristic combination has better effect and better generalization capability.
Example 3
The embodiment takes the combination screening of the conductivity key machine learning characteristics of the noble metal electric contact alloy material as an example for analysis and explanation, and specifically comprises the following steps:
(1) Machine learning data for 205 sets of precious metal electrical contact alloy material compositions and corresponding conductivity properties were collected, with a partial data set as shown in table 7 (only a portion of the data is shown due to the large amount of data).
TABLE 7 collected partial noble Metal electric contact alloy Material composition (Wt%) and corresponding conductivity (% IACS) Performance data
Sequence number | Ag | Cu | Ni | Au | Pd | Ce | Pt | Cd | V | Conductivity of electric conductivity |
1 | 90 | 10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 90.74 |
2 | 85 | 15 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 82.1 |
3 | 80 | 20 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 82.1 |
4 | 75 | 25 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 82.1 |
5 | 72 | 28 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 78.37 |
6 | 50 | 50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 78.37 |
7 | 78 | 0 | 0 | 0 | 22 | 0 | 0 | 0 | 0 | 16.9 |
8 | 70 | 0 | 0 | 0 | 30 | 0 | 0 | 0 | 0 | 11.49 |
9 | 60 | 0 | 0 | 0 | 40 | 0 | 0 | 0 | 0 | 8.62 |
10 | 50 | 0 | 0 | 0 | 50 | 0 | 0 | 0 | 0 | 5.7 |
11 | 90 | 0 | 0 | 10 | 0 | 0 | 0 | 0 | 0 | 47.89 |
12 | 80 | 0 | 0 | 20 | 0 | 0 | 0 | 0 | 0 | 28.74 |
13 | 60 | 0 | 0 | 40 | 0 | 0 | 0 | 0 | 0 | 20.28 |
14 | 40 | 0 | 0 | 60 | 0 | 0 | 0 | 0 | 0 | 15.67 |
15 | 95 | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 45.37 |
16 | 90 | 0 | 0 | 0 | 0 | 0 | 10 | 0 | 0 | 29.73 |
17 | 88 | 0 | 0 | 0 | 0 | 0 | 12 | 0 | 0 | 28.74 |
18 | 80 | 0 | 0 | 0 | 0 | 0 | 20 | 0 | 0 | 17.07 |
19 | 20 | 0 | 0 | 0 | 0 | 0 | 80 | 0 | 0 | 5.75 |
20 | 25 | 0 | 0 | 0 | 0 | 0 | 75 | 0 | 0 | 5.3 |
21 | 86 | 0 | 0 | 0 | 0 | 0 | 0 | 14 | 0 | 59.45 |
22 | 84 | 0 | 0 | 0 | 0 | 0 | 0 | 16 | 0 | 43.1 |
23 | 83 | 0 | 0 | 0 | 0 | 0 | 0 | 17 | 0 | 35.92 |
(2) The precious metal electric contact alloy material conductivity feature set containing 120 features is constructed through machine learning feature engineering, and since the part works in another unpublished research result, no specific feature display is performed here, and the feature construction process is similar to that of the embodiment 1, except that the types and the numbers of the features are different.
(3) Preliminary screening is carried out on candidate feature sets through linear correlation filtering, in the linear correlation screening, the linear correlation degree of each alloy feature is analyzed, the linear correlation degree between the features is evaluated through a linear regression correlation coefficient R (see formula 1 in detail), the correlation coefficient is more than 0.95 and is taken as strong linear correlation, and the alloy features with strong linear correlation among the alloy features are classified into the same group; each group selects the alloy characteristic with the lowest modeling error by using a single characteristic quantity in the group to represent the combined gold characteristic and enter the subsequent screening; after grouping, the alloy features in each group are in strong linear correlation (|R| > 0.95), and the alloy features in each group have no strong linear correlation. After linear correlation screening, the alloy characteristics of the conductivity model remained as 63.
(4) The genetic algorithm search based on the number of limiting features further carries out 60 times of screening on the remaining features after linear correlation filtering screening, the number of the features screened each time is 15, and the adopted model parameters are as follows: 100 generations, 200 populations, model parameter optimization, 50 random samples, and 5-fold cross validation for each sample.
(5) After the genetic algorithm is adopted to screen the characteristics, the characteristic weight sorting is adopted to sort the importance of the characteristics.
(6) The first 15 most important key features are selected through feature weight sorting to form candidate features which are selected through exhaustion, then feature combinations with optimal model prediction precision are selected through exhaustion, the precision and generalization capability of the feature combinations in an evaluation model are evaluated through model average absolute percentage error (MAPE), and the feature combination with the lowest relative error is used as the feature combination which is selected finally; the specific alloy characteristic types which are screened out by the conductivity model are shown in table 8.
TABLE 8 results of conductivity model alloy feature screening
Feature numbering | Eigenvalues |
3 | Group number average |
10 | Third ionization energy average value |
95 | Variance of mass attenuation coefficient CrKalpha |
62 | Variance of chemical potential energy |
(7) Based on the alloy screening result, adopting a learner-Support Vector Machine (SVM) algorithm consistent with the alloy characteristic screening to carry out regression modeling, wherein the modeling result of the conductivity shows that: the percentage error of the training set is 4.12%, the percentage error of the test set is 3.99%, and the errors are smaller, so that the model trained according to the screened conductivity characteristic combination has better effect and better generalization capability.
Claims (6)
1. The method for realizing the key feature combination screening of the machine learning candidate features is characterized by comprising the following steps of:
(1) Preliminary screening is carried out on the candidate feature set through linear correlation filtering, the strong linear correlation is taken as a correlation coefficient larger than 0.95, and the features with the strong linear correlation in the feature set are classified into the same group;
(2) Searching a genetic algorithm based on the number of the limiting features, further screening the remaining features subjected to linear correlation filtering screening for K times, and screening m features each time;
(3) After the characteristics are screened out by adopting a genetic algorithm, the characteristic weight ranking is adopted to rank the importance of the characteristics;
(4) The first 12 to 16 most important key features are screened out through feature weight sorting to form candidate features which are screened out in an exhaustive way, and then the feature combination with the best model prediction precision is screened out through the exhaustion way.
2. The method for implementing key feature combination screening for a machine learning candidate feature of claim 1, wherein: the number of the feature sets in the candidate feature sets is more than or equal to 60.
3. The method for implementing key feature combination screening for a machine learning candidate feature of claim 1, wherein: the formula for evaluating the correlation coefficient in step (1) is as follows:
where N is the number of samples,x i and y i Two different characteristics of the ith alloy are represented (i=1, 2,., N),andrepresenting the average of these two different features in N alloys.
4. The method for implementing key feature combination screening for a machine learning candidate feature of claim 1, wherein: the number of limiting feature screening in the step (2) is 5-15, namely m=5-15; in addition, the screening times K is more than 15, and the adopted model parameters are as follows: 50-150 generations, 150-250 population, model parameter optimization, 20-70 random sampling, and 5-10 fold cross validation is adopted for each sampling.
5. The method for implementing key feature combination screening for a machine learning candidate feature of claim 1, wherein: in the step (3), feature weight sorting is adopted to sort the feature importance, and the process is as follows:
(a) The remaining n features after linear correlation screening constitute feature set F:
F={X 1 ,X 2 ,X 3 ,X 4 …X i ,X i+1 …X n } (3)
wherein X is i Representing the i-th feature which is screened out, wherein n is the number of the residual features after linear correlation screening;
(b) K times of screening are carried out by adopting a genetic algorithm, m features are screened out each time, regression modeling is carried out based on the m features, and the prediction precision is p as shown in the following formula (4) k Wherein p is k =1-MAPE, MAPE is the mean absolute percentage error of the model;
1={X 1 ,X 2 ,X 3 ,X 4 …X i ,X i+1 …X m }→P 1
2={X 1 ,X 2 ,X 3 ,X 4 …X i ,X i+1 …X m }→P 2
3={X 1 ,X 2 ,X 3 ,X 4 …X i ,X i+1 …X m }→P 3
…
K={X 1 ,X 2 ,X 3 ,X 4 …X i ,X i+1 …X m }→P k (4)
(c) The feature weights are shown in a formula (5), the feature weights are equal to the sum of products of the same features and model prediction precision after each genetic algorithm screening, and the sum of products of the features of different types and the prediction precision is sequenced later;
W A is the feature weight of feature A, n is the number of candidate feature sets,for the kth screening, whether or not a characteristic A is selected, if the selected characteristic is A, then +.>No->
For the kth screening, whether feature B was screened,if the screening is characterized by B, then +.>No->
…
And so on
For the kth screening, whether or not a characteristic N is selected, if the selected characteristic is N, then +.>No->
(d) Feature ordering: sorting the sum of products of the characteristics of different types of characteristics and prediction precision after the characteristic weight formula processing, wherein the drawn graph is the sorting for calculating the precision accumulation sum of the characteristics:
I n =rank(W A ,W B ,W C ,.........,W N ) (6)
wherein I is n Representing feature weight ranking results for n candidate features, rank () represents ranking the weights of different features.
6. The method for implementing key feature combination screening for a machine learning candidate feature of claim 1, wherein: and (4) evaluating the precision and generalization capability of the feature combinations in an evaluation model through a model average absolute percentage error (MAPE), and taking the feature combination with the lowest relative error as the finally screened feature combination.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310481517.6A CN116720058A (en) | 2023-04-28 | 2023-04-28 | Method for realizing key feature combination screening of machine learning candidate features |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310481517.6A CN116720058A (en) | 2023-04-28 | 2023-04-28 | Method for realizing key feature combination screening of machine learning candidate features |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116720058A true CN116720058A (en) | 2023-09-08 |
Family
ID=87870410
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310481517.6A Pending CN116720058A (en) | 2023-04-28 | 2023-04-28 | Method for realizing key feature combination screening of machine learning candidate features |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116720058A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117196918A (en) * | 2023-09-21 | 2023-12-08 | 国家电网有限公司大数据中心 | Building carbon emission determining method, device, equipment and storage medium |
CN117787476A (en) * | 2023-12-07 | 2024-03-29 | 聊城大学 | Quick evaluation method for blocking flow shop scheduling based on key machine |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180150746A1 (en) * | 2016-02-05 | 2018-05-31 | Huawei Technologies Co., Ltd. | Feature Set Determining Method and Apparatus |
US20180300333A1 (en) * | 2017-04-13 | 2018-10-18 | General Electric Company | Feature subset selection and ranking |
CN110135469A (en) * | 2019-04-24 | 2019-08-16 | 北京航空航天大学 | It is a kind of to improve the characteristic filter method and device selected based on correlative character |
CN111126426A (en) * | 2019-10-11 | 2020-05-08 | 平安普惠企业管理有限公司 | Feature selection method and device, computer equipment and storage medium |
CN112149702A (en) * | 2019-06-28 | 2020-12-29 | 北京百度网讯科技有限公司 | Feature processing method and device |
CN112215278A (en) * | 2020-10-09 | 2021-01-12 | 吉林大学 | Multi-dimensional data feature selection method combining genetic algorithm and dragonfly algorithm |
CN112216356A (en) * | 2020-10-22 | 2021-01-12 | 哈尔滨理工大学 | High-entropy alloy hardness prediction method based on machine learning |
CN113837271A (en) * | 2021-09-23 | 2021-12-24 | 山东纬横数据科技有限公司 | Classification improvement algorithm based on feature selection |
CN114580272A (en) * | 2022-02-16 | 2022-06-03 | 昆明贵金属研究所 | Design method for simultaneously optimizing conductivity and hardness of multi-element electric contact alloy |
CN114580271A (en) * | 2022-02-16 | 2022-06-03 | 昆明贵金属研究所 | Method for realizing solid-liquid phase temperature prediction of multi-element precious metal alloy brazing filler metal |
CN115527625A (en) * | 2022-10-19 | 2022-12-27 | 西安邮电大学 | Hardness prediction method and system for high-entropy alloy |
CN115879638A (en) * | 2022-12-30 | 2023-03-31 | 东北石油大学 | Carbon emission prediction method for oil field transfer station system |
-
2023
- 2023-04-28 CN CN202310481517.6A patent/CN116720058A/en active Pending
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180150746A1 (en) * | 2016-02-05 | 2018-05-31 | Huawei Technologies Co., Ltd. | Feature Set Determining Method and Apparatus |
US20180300333A1 (en) * | 2017-04-13 | 2018-10-18 | General Electric Company | Feature subset selection and ranking |
CN110135469A (en) * | 2019-04-24 | 2019-08-16 | 北京航空航天大学 | It is a kind of to improve the characteristic filter method and device selected based on correlative character |
CN112149702A (en) * | 2019-06-28 | 2020-12-29 | 北京百度网讯科技有限公司 | Feature processing method and device |
CN111126426A (en) * | 2019-10-11 | 2020-05-08 | 平安普惠企业管理有限公司 | Feature selection method and device, computer equipment and storage medium |
CN112215278A (en) * | 2020-10-09 | 2021-01-12 | 吉林大学 | Multi-dimensional data feature selection method combining genetic algorithm and dragonfly algorithm |
CN112216356A (en) * | 2020-10-22 | 2021-01-12 | 哈尔滨理工大学 | High-entropy alloy hardness prediction method based on machine learning |
CN113837271A (en) * | 2021-09-23 | 2021-12-24 | 山东纬横数据科技有限公司 | Classification improvement algorithm based on feature selection |
CN114580272A (en) * | 2022-02-16 | 2022-06-03 | 昆明贵金属研究所 | Design method for simultaneously optimizing conductivity and hardness of multi-element electric contact alloy |
CN114580271A (en) * | 2022-02-16 | 2022-06-03 | 昆明贵金属研究所 | Method for realizing solid-liquid phase temperature prediction of multi-element precious metal alloy brazing filler metal |
CN115527625A (en) * | 2022-10-19 | 2022-12-27 | 西安邮电大学 | Hardness prediction method and system for high-entropy alloy |
CN115879638A (en) * | 2022-12-30 | 2023-03-31 | 东北石油大学 | Carbon emission prediction method for oil field transfer station system |
Non-Patent Citations (2)
Title |
---|
JIHENG FANG 等: "Solid-Liquid Phase Temperature Prediction of Alloys Based on Machine Learning Key Feature Screening", 《HTTPS://PAPERS.SSRN.COM/SOL3/PAPERS.CFM?ABSTRACT_ID=4423396》, 19 April 2023 (2023-04-19), pages 1 - 30 * |
李郅琴;杜建强;聂斌;熊旺平;黄灿奕;李欢;: "特征选择方法综述", 计算机工程与应用, no. 24, 31 December 2019 (2019-12-31), pages 16 - 25 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117196918A (en) * | 2023-09-21 | 2023-12-08 | 国家电网有限公司大数据中心 | Building carbon emission determining method, device, equipment and storage medium |
CN117196918B (en) * | 2023-09-21 | 2024-06-07 | 国家电网有限公司大数据中心 | Building carbon emission determining method, device, equipment and storage medium |
CN117787476A (en) * | 2023-12-07 | 2024-03-29 | 聊城大学 | Quick evaluation method for blocking flow shop scheduling based on key machine |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN116720058A (en) | Method for realizing key feature combination screening of machine learning candidate features | |
CN102495127B (en) | Protein secondary mass spectrometric identification method based on probability statistic model | |
Jensen et al. | On the utility of linkage disequilibrium as a statistic for identifying targets of positive selection in nonequilibrium populations | |
CN112699054B (en) | Ordered generation method for software test cases | |
CN114580271B (en) | Method for realizing solid-liquid phase temperature prediction of multi-element noble metal alloy solder | |
CN109740924B (en) | Article scoring prediction method integrating attribute information network and matrix decomposition | |
CN113360730A (en) | Feature selection method based on filter and RF-RFE algorithm | |
CN114580272B (en) | Design method for optimizing conductivity and hardness of multi-element electric contact alloy | |
CN115048855A (en) | Click rate prediction model, training method and application device thereof | |
Zhao et al. | AMEIR: Automatic behavior modeling, interaction exploration and MLP investigation in the recommender system | |
CN111243676A (en) | Blast disease onset prediction model based on high-throughput sequencing data and application | |
CN109390032B (en) | Method for exploring disease-related SNP (single nucleotide polymorphism) combination in data of whole genome association analysis based on evolutionary algorithm | |
CN112966702A (en) | Method and apparatus for classifying protein-ligand complex | |
CN116756508A (en) | Fault diagnosis method and device for transformer, computer equipment and storage medium | |
CN113838519B (en) | Gene selection method and system based on adaptive gene interaction regularization elastic network model | |
CN107957944B (en) | User data coverage rate oriented test case automatic generation method | |
Wu et al. | Optimal sampling of a population to determine QTL location, variance, and allelic number | |
CN110223730B (en) | Prediction method and prediction device for protein and small molecule binding site | |
KR101632537B1 (en) | Technical ripple effect analysis method | |
CN114724655A (en) | Hydrogen storage alloy performance prediction method, prediction model thereof and model establishment method | |
CN111863136A (en) | Integrated system and method for correlation analysis among multiple sets of chemical data | |
Villar et al. | Substructural analysis in drug discovery | |
Jeong et al. | Technology planning through technology roadmap: Application of patent citation network | |
Arief et al. | Visualising the pattern of long‐term genotype performance by leveraging a genomic prediction model | |
CN112885409B (en) | Colorectal cancer protein marker selection system based on feature selection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |