CN112966447A

CN112966447A - Chemical material adsorption performance prediction method and device based on automatic machine learning

Info

Publication number: CN112966447A
Application number: CN202110318374.8A
Authority: CN
Inventors: 王坤峰; 杨培松; 张欢; 赖欣; 阳庆元; 俞度立
Original assignee: Beijing University of Chemical Technology
Current assignee: Beijing University of Chemical Technology
Priority date: 2021-03-25
Filing date: 2021-03-25
Publication date: 2021-06-15

Abstract

The invention relates to the technical field of machine learning, and in one aspect provides a chemical material adsorption performance prediction method based on automatic machine learning, which comprises the following steps: acquiring structural characteristics of a chemical material to construct an original data set; preprocessing an original data set, and generating an initial model containing hyper-parameters according to machine learning; performing iterative training on an initial model by using a pipeline method to generate an optimal prediction model; inputting the test data set into an optimal prediction model to predict the adsorption performance of the chemical material; the method can quickly and accurately realize the prediction of the adsorption performance of the material. The invention also provides a chemical material adsorption performance prediction device based on automatic machine learning, which comprises a data set construction module, a model pre-training module, a model component module and a test module; the method is applied to the device to predict the adsorption performance of the chemical material.

Description

Chemical material adsorption performance prediction method and device based on automatic machine learning

Technical Field

The invention relates to the technical field of machine learning, in particular to a chemical material adsorption performance prediction method and device based on automatic machine learning.

Background

The screening and design of chemical materials have important significance on the storage and transportation of important chemical gases. However, the synthesized gas storage materials are various in types and large in quantity, and when the synthesized gas storage materials are used for researching the gas adsorption working capacity, the method needs to be realized through a molecular dynamics simulation method, the method is accurate but time-consuming, and the traditional calculation method is obviously not feasible for searching suitable storage materials from mass storage materials.

Machine learning provides great convenience for material property calculation, but also has some problems. With the increase of the variety and complexity of the algorithm, engineers need to select a corresponding model architecture, a training process, a regularization method, a hyper-parameter and the like, which have great influence on the performance of the algorithm. The process of building accurate and powerful learning models requires advanced data science skills and it is also a difficult task to select the appropriate method to solve the problem and configure the optimal parameter values for a particular model. Therefore, how to quickly and effectively calculate the adsorption performance of the material and screen out a suitable gas storage material becomes a problem which needs to be solved urgently.

The prior art has the following problems. First, the traditional material calculation method is slow and inefficient, and cannot meet the existing requirements. Secondly, the parameter adjusting process of the common machine learning algorithm is complex, and the requirement threshold for the use of non-professionals is high. Third, there is no design of a pipeline model for material prediction content.

Disclosure of Invention

To this end, one aspect of the present invention provides a method for predicting material adsorption performance based on automatic machine learning. The method comprises the steps of obtaining characteristics of a chemical material through a data component module and establishing original data, establishing an initial model through a model pre-training module, generating an optimal prediction model through a model establishing module and predicting the adsorption performance of the chemical material through a testing module. The method is used for solving the problems that the prediction speed of the adsorption performance of the material is low and the efficiency is further low due to the fact that a pipeline model related to the prediction content of the material is not designed in the prior art.

In order to achieve the above object, the present invention provides a method for predicting chemical material adsorption performance based on automatic machine learning, comprising:

acquiring various characteristics related to the adsorption performance of the chemical material, establishing an original data set by combining different types of characteristics, and preprocessing the original data set;

performing feature processing on the preprocessed raw data set and utilizing machine learning to generate a plurality of initial models containing hyper-parameters according to the raw data subjected to feature processing by utilizing machine learning;

iteratively training a plurality of the initial models by a pipeline method to generate an optimal prediction model;

inputting the test data set into the optimal prediction model to predict the adsorption performance of the chemical material.

Further, the method for preprocessing the raw data comprises one or more of data sampling, data cleaning, feature compression, feature conversion and feature extraction;

the method for generating the initial models containing the hyper-parameters through machine learning comprises the steps of carrying out feature processing on the original data to ensure the reasonability of the data, and selecting different machines for learning to generate the initial models containing the hyper-parameters according to prior knowledge.

Further, the method for iteratively training a plurality of initial models by a pipeline method to obtain an optimal prediction model comprises: and performing data screening and feature processing on the data set subjected to feature processing according to the feature importance, and performing parameter adjustment on the initial model through a genetic algorithm and an iterative method.

Further, the data screening includes selecting the optimal top n% feature information by using a SelectKBest method and removing the feature information which does not meet a minimum variance threshold, wherein the selection method is used for calculating the optimal top n% feature by combining chi-square verification and mutual information, and the formula is as follows:

wherein p (x, y) is a joint distribution function of x and y, p (x) and p (y) are marginal probability density functions of x and y, respectively, F_iIs an observed value of the ith feature, E_iIs the expected value of the ith feature.

Further, the feature importance includes a correlation between the feature and the target variable and a correlation between the features, the feature importance is generated by analyzing the correlation between the feature and the target variable and the correlation between the features, retaining the feature with strong correlation with the target variable, and deleting the feature with strong correlation between the features, the correlation is calculated by a formula:

where r (x, y) represents the correlation coefficient between two variables x and y,

and

representing the mean of x and y, respectively.

Further, the parameter adjusting method for the initial model through the genetic algorithm comprises the following steps: and generating a plurality of initial models with optimal performance by respectively optimizing the hyper-parameters of the initial models, and generating an optimal prediction model by selecting optimal parameters.

Further, the method for generating the optimal prediction model comprises the following steps: integrating the plurality of initial models with the best performance into an optimal prediction model set through superposition combination, wherein an integration formula is as follows:

wherein A ═ { A ═ A¹,...AⁿIs a set of machine learning, each element representing data processing and machine learning algorithms, A^jE (j is 1.. multidata., n) is a hyper-parameter space Lambda corresponding to the set^j；

Partitioning the dataset into k training sets by performing k cross-validations on the dataset at model selection

And k verification sets

Is a training set

After training, the training is carried out with a hyperparameter lambda epsilon lambda^jAlgorithm A of^jIn the verification set

The optimal combination of prediction models and hyper-parameter combinations are generated.

Further, by goodness of fit R²And RMSE evaluating the best predictive model:

where n represents the total number of data sets,

and y_iRespectively the optimal model predicted value and the actual value of the ith data,

is all predicted values

Average value of (a).

Another aspect of the present invention provides an automatic machine learning-based chemical material adsorption performance prediction apparatus for performing the automatic machine learning-based chemical material adsorption performance prediction method according to any one of claims 1 to 8, comprising:

the data set construction module is used for acquiring physical and chemical structural characteristics of the chemical material, judging and filtering invalid data and null values, and establishing an original data set for the filtered characteristics;

the model pre-training module is connected with the data component module and used for generating a plurality of initial models containing hyper-parameters according to different algorithms;

a model construction module connected with the pre-training module of the model and used for iteratively training a plurality of initial models through a pipeline method to generate an optimal prediction model;

and the test module is connected with the model component module and used for inputting a test data set to the optimal prediction model to predict the adsorption performance of the chemical material.

Further, the model building module comprises:

the characteristic engineering module is connected with the pre-training module of the model and is used for carrying out characteristic processing on the original data set and carrying out characteristic selection, compression and extraction on the original data in the original data set according to the characteristic importance;

the model selection module is connected with the characteristic engineering module and used for carrying out algorithm model selection on the original data set after characteristic processing by combining characteristic importance and building an initial model by different algorithms;

the parameter optimizing module is connected with the model selecting module and used for respectively optimizing the hyperparameters of the plurality of initial models by a genetic algorithm optimizing method to generate a plurality of initial models with the best performance:

the pipeline module is connected with the parameter optimizing module and is used for integrating a plurality of initial models with the best performance into an optimal prediction model set through a superposition combination method;

and the model evaluation module is respectively connected with the pipeline module and the test module and is used for evaluating the performance of the optimal prediction model set model in the formed pipeline module and selecting an optimal prediction model.

Compared with the prior art, the method has the advantages that the original data set is constructed by obtaining the structural characteristics of the chemical material, the original data set is preprocessed, the initial model containing the hyper-parameters is generated according to machine learning, the optimal prediction model is generated by iterative training of the initial model of the pipeline method, and the test data set is input to the optimal prediction model to predict the adsorption performance of the chemical material.

Furthermore, the original data is subjected to sampling and/or data cleaning and/or feature compression and/or feature extraction preprocessing, feature processing is carried out on the original data, different machines are selected for learning, a plurality of initial models containing the hyperparameters are generated according to the priori knowledge, the rationality of the data is guaranteed, the parameter adjusting process is further simplified, and therefore the prediction efficiency is further improved.

And further, data screening and feature processing are carried out on the data set subjected to feature processing according to feature importance, and parameter adjustment is carried out on the initial model in an iterative manner through a genetic algorithm, so that the parameter adjustment process is further simplified, and the prediction efficiency is further improved.

Further, the optimal top n% of feature information is selected by a SelectKBest method, and features which do not accord with a minimum variance threshold are removed, so that the feature selection of the data is improved, and the prediction efficiency is further improved.

Further, the feature importance is generated by retaining the features with strong correlation with the target variable and deleting the features with strong correlation between the features, so that the feature selection of the data is improved, and the prediction efficiency is further improved.

Furthermore, the hyper-parameters of the initial models are optimized through a genetic algorithm to generate the initial models with the optimal performance, and the optimal parameters are selected to generate the optimal prediction model, so that the automatic processing of the models is realized, and the prediction efficiency is further improved.

Further, the initial models with the best performance are integrated into an optimal prediction model set through superposition combination, k cross verifications are performed during model selection, the data set is divided into k training sets and k verification sets, and the algorithm is verified after training is completed to generate the optimal prediction model combination and the optimal hyper-parameter combination, so that automatic processing of the models is realized, and the prediction efficiency is further improved.

Further, by goodness of fit R²And the RMSE evaluates the optimal prediction model to verify the optimal prediction model, thereby further improving the prediction efficiency.

Furthermore, the chemical material adsorption performance prediction method based on automation and learning is arranged in the chemical material adsorption performance prediction device based on automation and learning, so that the automatic prediction of the chemical material adsorption performance is realized, and the prediction efficiency is further improved.

Drawings

FIG. 1 is a flow chart of a method for predicting the adsorption performance of a chemical material based on automatic machine learning according to the present invention;

fig. 2 is a detailed flowchart of step S103;

fig. 3 is a block diagram of the structure of the prediction device for chemical material adsorption performance based on automatic machine learning according to the present invention.

FIG. 4 is a schematic diagram of an evaluation model performance of the adsorption performance of the organic molecule to methane gas according to the present invention.

Detailed Description

In order that the objects and advantages of the invention will be more clearly understood, the invention is further described below with reference to examples; it should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Preferred embodiments of the present invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention, and do not limit the scope of the present invention.

It should be noted that in the description of the present invention, the terms of direction or positional relationship indicated by the terms "upper", "lower", "left", "right", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, which are only for convenience of description, and do not indicate or imply that the device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and thus, should not be construed as limiting the present invention.

Furthermore, it should be noted that, in the description of the present invention, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

Fig. 1 is a flow chart of a method for predicting the adsorption performance of a chemical material based on automatic machine learning according to the present invention.

The method for predicting the adsorption performance of the chemical material based on automatic machine learning comprises the following steps:

s100, acquiring various characteristics of chemical materials, and constructing an original data set by combining different types of characteristics;

step S200, inputting an original data set to perform feature processing, and obtaining a plurality of initial models containing hyper-parameters by using a base learner;

step S300, performing iterative training on a plurality of initial models by using a pipeline method to obtain an optimal prediction model: performing data screening and feature processing on the input data set; adjusting parameters of the model through a genetic algorithm, and obtaining an optimal prediction model through a step-by-step iteration method;

and S400, inputting a test data set to the optimal prediction model to predict the adsorption performance of the chemical material.

In the step S100, physical and chemical characteristics of the material, including physical characteristics such as pore size characteristics, volume, density, surface area, and element content percentage of the material, and chemical characteristics such as heat of adsorption, are calculated by using a conventional molecular simulation method, and invalid data and null values are determined and filtered, and an original data set is established according to the characteristics.

In step S200, feature processing is performed on the original data set, and a plurality of initial models including hyper-parameters are generated through machine learning for obtaining an initial parameter model of machine learning according to prior knowledge.

In the step S300, data screening and feature processing are performed on the input data set; adjusting parameters of the model through a genetic algorithm; and obtaining the optimal prediction model by a step-by-step iteration method.

In the embodiment, a plurality of initial models containing hyper-parameters are obtained according to prior knowledge, and the initial models are subjected to iterative training through a pipeline method to generate an optimal prediction model;

specifically, data screening and characteristic processing are carried out on an input data set, parameters of the initial model are adjusted through a genetic algorithm, an optimal prediction model is generated through a step-by-step iteration method, and a test data set is input into the optimal prediction model to predict the adsorption performance of the chemical material. The pipeline method comprises the steps of performing data preprocessing, feature engineering, model selection, model evaluation and the like end to end by using the pipeline, so that the prediction model can be optimized without manual participation. By combining the method of automatic machine learning in the chemical material performance prediction, the time and labor cost can be minimized while the prediction accuracy is ensured.

Fig. 2 is a flowchart illustrating a step S103 of the method for predicting the chemical material adsorption performance based on automatic machine learning according to the present invention.

The embodiment of the invention provides a chemical material adsorption performance prediction method based on automatic machine learning, and the step S103 comprises the following steps:

and step 310, performing feature engineering, namely performing feature processing on the original data and preprocessing on the original data set, wherein the preprocessing comprises one or more of data sampling, data cleaning, feature compression, feature conversion and feature extraction, and performing feature screening according to the importance of features.

Specifically, by a SelectKBest method, selecting the feature information of the optimal top n% and removing the features which do not meet the minimum variance threshold, wherein the method for selecting the optimal top n% features is performed by combining chi-square verification and mutual information:

wherein p (x, y) is a joint distribution function of x and y, and p (x) and p (y) are marginal probability density functions of x and y, F_iIs an observed value of the ith feature, E_iIs the expected value of the ith feature, the feature importance includes the correlation between the feature and the target variable and the inter-featureAnd (4) correlation, namely, the features with strong correlation with the target variable are reserved, and the features with strong correlation among the features are deleted to avoid multiple collinearity of the model.

Specifically, the correlation coefficient of the two characteristic variables is calculated by the following formula, and if the correlation coefficient of the two characteristic variables is greater than 0.9, the correlation of the two characteristic variables is determined to be strong:

and

representing the mean of x and y, respectively.

And 320, selecting a model, namely selecting the model with excellent performance of the initialization model in the machine learning.

And 330, model parameter adjustment, namely optimizing the hyper-parameters of the plurality of initial models through the genetic algorithm to generate a plurality of initial models with optimal performance, and selecting a batch of optimal parameters to ensure that the model performance is optimal.

Step 340, performing superposition combination on the optimal models obtained in the step 330 in a pipeline mode, generating a plurality of optimal performance initial models according to a genetic algorithm, and integrating the plurality of optimal performance initial models into an optimal prediction model set by using a superposition combination method:

wherein A ═ { A ═ A¹,...AⁿRepresents a set of base learners, each element representing data processing and machine learning algorithms, A^jE.a (j ═ 1.. times, n) is the hyper-parameter space Λ corresponding to the set^j。

In particular, the modelK cross-validation is performed during selection, and the data set is divided into k training sets

And k verification sets

To be in a training set

With a hyperparameter lambda epsilon lambda of training^jAlgorithm A of^jIn the verification set

The optimal model combination and the hyper-parameter combination are found through the algorithm.

Specifically, an optimal pipeline model is obtained by integration through a superposition combination method, so that a high-performance model can be prevented from being discarded, a more complex model structure is formed, a model with stronger prediction capability can be obtained, simplification of hyper-parameters and overfitting of data are avoided, and the obtained model has stronger robustness; the pipeline method is iterative, each step is repeatedly executed, the accuracy of the model is continuously improved, and a successful algorithm is obtained. Combining the method models of all the processing processes to form a prediction model of a production line, and finally obtaining a finished pipeline prediction model comprising data preprocessing, characteristic engineering and model prediction.

Step 350, evaluating the pipeline model, and fitting goodness of fit R through an objective function²And RMSE estimates the pipeline model formed by overlapping and combining the optimal models:

where n represents the total number of data sets,

is all predicted values

Average value of (a).

Fig. 3 is a block diagram of a chemical material adsorption performance prediction apparatus based on automatic machine learning according to the present invention.

The embodiment of the invention provides a chemical material adsorption performance prediction device based on automatic machine learning, which comprises:

the data set building module 10 is used for acquiring physical and chemical structural characteristics of chemical materials, judging and filtering invalid data and null values, and building an original data set by combining the characteristics;

the model pre-training module 20 is used for obtaining a plurality of initial models containing hyper-parameters according to different algorithms; and setting a group of initial hyper-parameter values for each model according to the prior knowledge so as to facilitate the subsequent iteration optimization operation.

A model building module 30 for iteratively training a plurality of initial models by a pipeline method to obtain an optimal prediction model, the training module comprising:

the feature engineering module 31 is used for performing feature processing on the original data set, and performing feature selection, compression, extraction and the like on the original data set according to feature importance;

the model selection module 32 is used for selecting an algorithm model for the original data set after the feature processing by combining the feature importance, and selecting different algorithms to build an initial model;

the parameter optimizing module 33 is configured to optimize the hyper-parameters for the plurality of initial models respectively by using a genetic algorithm optimization method to obtain a plurality of initial models with optimal performance:

a pipeline module 34, configured to integrate the initial models with the best performance into an optimal prediction model by a superposition combination method;

a model evaluation module 35 for evaluating the performance of the models in the formed pipeline modules to select an optimal prediction model;

the test module 36 is used for inputting the test data set to the optimal prediction model to predict the chemical material adsorption performance.

Fig. 4 is a schematic diagram of a model performance of the method for predicting the adsorption performance of the chemical material based on automatic machine learning according to the present invention for evaluating the adsorption performance of the organic molecule to methane gas.

The evaluation model performance of the organic molecules on the methane gas adsorption performance is compared with a predicted value and a real observed value of the covalent organic compounds on the methane gas adsorption performance through a pipeline model, so that the evaluation model performance is improved by 2 to 3 orders of magnitude in efficiency compared with the traditional molecular simulation calculation method.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention; various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A chemical material adsorption performance prediction method based on automatic machine learning is characterized by comprising the following steps:

performing feature processing on the preprocessed raw data set and utilizing machine learning to generate a plurality of initial models containing hyper-parameters according to the raw data subjected to the feature processing;

2. The method of predicting the adsorption performance of a chemical material based on automatic machine learning of claim 1, wherein the method of preprocessing the raw data comprises: one or more of data sampling, data cleaning, feature compression, feature conversion and feature extraction;

the method for generating a plurality of initial models containing hyper-parameters through machine learning comprises the following steps: and performing characteristic processing on the original data to ensure the reasonability of the data and selecting different machines for learning to generate a plurality of initial models containing the hyperparameters according to the prior knowledge.

3. The method of predicting the adsorption performance of a chemical material based on automatic machine learning according to claim 1, wherein the method of iteratively training a plurality of initial models by a pipeline method to obtain an optimal prediction model comprises: and performing data screening and feature processing on the data set subjected to feature processing according to the feature importance, and performing parameter adjustment on the initial model through a genetic algorithm and an iteration method.

4. The automated machine learning-based prediction method of chemical material adsorption performance of claim 3, wherein the data screening comprises: selecting the optimal top n% of feature information by using a SelectKBest method and removing the feature information which does not accord with the minimum variance threshold, wherein the optimal top n% of feature is obtained by combining chi-square verification and mutual information through the selection method, and the formula is as follows:

5. The method of predicting adsorption performance of chemical material based on automatic machine learning according to claim 4, wherein the feature importance includes correlation between features and target variables and correlation between features, the feature importance is generated by analyzing correlation between features and target variables and correlation between features, retaining features having strong correlation with target variables and deleting features having strong correlation between features, if correlation coefficient of two feature variables is greater than 0.9, it is determined that correlation of two feature variables is strong, and the calculation formula is:

and

respectively represent x andthe mean value of y.

6. The method of predicting the adsorption performance of a chemical material based on automatic machine learning according to claim 3, wherein the method of parametrizing the initial model by a genetic algorithm comprises: and generating a plurality of initial models with the best performance by respectively optimizing the hyper-parameters of the initial models and generating the best prediction model by selecting the best parameters.

7. The automated machine learning-based prediction method of chemical material adsorption performance of claim 6, wherein the method of generating an optimal prediction model comprises: integrating a plurality of initial models with the best performance into an optimal prediction model set through superposition combination, wherein an integration formula is shown as follows,

wherein A ═ { A ═ A¹,...AⁿIs the set of machine learning, each element represents data processing and machine learning algorithm, A^jE (j is 1.. multidata., n) is a hyper-parameter space Lambda corresponding to the set^j；

When model selection is performed, k cross-validation is performed on the data set, and the data set is divided into k training sets

And k verification sets

Is a training set

To generate the optimal combination of prediction models and hyper-parameter combinations.

8. The method of claim 7, wherein the goodness-of-fit R is used to predict the adsorption performance of the chemical material²And RMSE evaluation of the best predictive model, R²And RMSE is calculated as follows:

where n represents the total number of data sets,

is all predicted values

Average value of (a).

9. An automatic machine learning-based chemical material adsorption performance prediction apparatus for performing the automatic machine learning-based chemical material adsorption performance prediction method according to any one of claims 1 to 8, comprising:

a model construction module connected with the model pre-training module and used for iteratively training a plurality of initial models through a pipeline method to generate an optimal prediction model;

10. The apparatus of claim 9 for predicting the adsorption performance of a chemical material based on automatic machine learning, wherein the model building module comprises: