CN115148307A - Material performance automatic prediction system - Google Patents

Material performance automatic prediction system Download PDF

Info

Publication number
CN115148307A
CN115148307A CN202210735361.5A CN202210735361A CN115148307A CN 115148307 A CN115148307 A CN 115148307A CN 202210735361 A CN202210735361 A CN 202210735361A CN 115148307 A CN115148307 A CN 115148307A
Authority
CN
China
Prior art keywords
meta
algorithm
data set
data
metadata
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210735361.5A
Other languages
Chinese (zh)
Inventor
刘悦
王双燕
杨正伟
涂章伟
施思齐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Shanghai for Science and Technology
Original Assignee
University of Shanghai for Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Shanghai for Science and Technology filed Critical University of Shanghai for Science and Technology
Priority to CN202210735361.5A priority Critical patent/CN115148307A/en
Publication of CN115148307A publication Critical patent/CN115148307A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C60/00Computational materials science, i.e. ICT specially adapted for investigating the physical or chemical properties of materials or phenomena associated with their design, synthesis, processing, characterisation or utilisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a material performance automatic prediction system, which comprises: the data acquisition module is used for acquiring a material data set and a general data set of other research fields; the data preprocessing module is used for preprocessing the data set; the domain knowledge base construction module is used for constructing a material category tree, performing quantitative representation and constructing a domain knowledge matrix to guide collaborative element recommendation; the metadata base construction module is used for performing metadata feature calculation and algorithm performance evaluation on the material data set and the general data set respectively to construct material metadata and general metadata; the algorithm recommending module is used for coordinating the algorithm recommending result based on the material metadata with the algorithm recommending result based on the general metadata to obtain an optimal recommending algorithm; and predicting the target property of the material based on an optimal recommendation algorithm. The method can solve the problems of difficult algorithm selection and complicated parameter determination during machine learning modeling, and improve the reliability and the usability of machine learning in the field of material performance prediction.

Description

Material performance automatic prediction system
Technical Field
The invention belongs to the field of material performance prediction, and particularly relates to an automatic prediction system for material performance.
Background
With the development of material science, the accumulation of material data is more and more huge, useful information is extracted from hundreds of millions of complex data, and a novel material with good modeling performance becomes the core and key of material research. Although the traditional material science method comprises experimental measurement and calculation simulation, part of the material problems can be solved well, due to the limitation of experimental conditions, the lack of theoretical basis and the like, long time is consumed, the experimental precision is low, and the material discovery and design are difficult to accelerate. In recent years, machine Learning (ML) driven by data has become popular in accelerating material performance prediction and new material discovery because it can quickly learn and establish a regression mapping relationship between material influencing factors (such as composition, process, external environment and other descriptors) and target quantities (such as performance and the like) and the accuracy of predicting material performance approaches de novo calculation.
Currently, ML has been widely used in many different material fields, such as solid electrolytes, photovoltaic materials, energy storage materials, catalysts/photocatalysts, thermoelectric materials, high temperature superconductors, and high entropy alloys and glasses. Materials researchers often determine the most appropriate algorithm for a given material problem based on historical experience or multiple ML algorithm performance comparisons. For example, yi Hai Qing et al (2018) of Beijing science and technology university studies support vector regression (SVM), sequence minimum optimization regression and multi-layer sensor algorithm, predicts lattice mismatch of nickel-based single crystal alloy by using relevant material descriptors such as chemical components, dendrite information, specimen thickness, measurement temperature and the like, and finally, the multi-layer sensor model has high correlation coefficient and low error value and good prediction performance. Researchers Rajan et al (2018) at the Indonesian academy of sciences materials research center predict the band gaps of the two-dimensional transition metal carbides and nitrides MXenes by using four ML models, namely, a kernel ridge regression, a support vector regression, a Gaussian process regression, a bootstrap integration regression and the like, and the lowest root mean square error obtained by the Gaussian process regression model is measured to be 0.14 through experiments. Mastelini et al (2021) at St.Paul university used three machine learning methods of RF, KNN, and CART to analyze the relationship between each element of chalcogenide glass and the glass transition temperature, young's modulus, coefficient of thermal expansion, and refractive index properties, and the results showed that the RF and KNN algorithms outperformed the CART algorithm in predicting performance. Priya et al (2021) at university of Illinois, USA applies various machine learning models such as support vector machine, linear regression, neural network, k-nearest neighbor and XG-Boost to research on total conductivity of solid oxide perovskite at different temperatures and under atmospheric pressure, and finally XGboost obtains the lowest RMSE value of 0.25, and can rapidly and accurately identify perovskite material with high conductivity. That is, various linear models (e.g., MLR, ridge, and LASSO), non-linear models (e.g., SVR, GPR, and ANN), or other integrated regression models (e.g., RF, adaBoost) may achieve higher prediction accuracy in the material performance task. According to the "Free Lunch Free" (No Free Lunch Theory) theorem, no machine learning algorithm performs best on all material tasks.
At the same time, machine learning is a very complex process, whose performance, training speed and complexity depend to a large extent on the hyper-parameter settings. For material experts not in computer expertise, in the face of huge hyper-parameter optimization space, they often determine the optimal hyper-parameter of an algorithm by means of trial and error or based on intuition with minimized model prediction error as a learning target. However, manually building a good machine learning model for each material task, i.e., enumerating all possible hyper-parameter configurations and trial and error in an iterative manner, is time consuming and hardly feasible. Although some researchers have adopted some hyper-parameter automatic optimization methods to simplify the complex hyper-parameter adjustment process of the model, such as grid search, random search, bayesian optimization, etc., due to the diversity of ML regression algorithms and the complex and time-consuming hyper-parameter optimization space, neither the historical experience method, the trial-and-error method nor the existing hyper-parameter optimization method can avoid the dilemma that the material experts are time-consuming and resource-intensive. Therefore, how to select the regression model more quickly and accurately and optimize the hyper-parameters thereof so as to improve the usability and accuracy of the ML model in the material science research is an urgent problem to be solved.
Disclosure of Invention
In order to solve the above problems, the present invention provides the following solutions: an automatic prediction system for material properties, comprising:
the data acquisition module is used for acquiring a material data set which can be used for a regression task and a general data set in other fields as input data of the prediction system;
the data preprocessing module is connected with the data acquisition module and is used for preprocessing the material data set and the general data set;
the domain knowledge base building module is connected with the data preprocessing module and used for building a material class tree and carrying out quantitative representation to build a domain knowledge matrix by acquiring domain knowledge;
the metadata base building module is connected with the data preprocessing module and is used for respectively carrying out metadata feature calculation and algorithm performance evaluation on the material data set and the general data set to obtain a metadata feature matrix and a performance matrix of the material data set and the general data set so as to build the material metadata and the general metadata;
the algorithm recommendation module is respectively connected with the domain knowledge base construction module and the metadata base construction module and is used for collaborating the algorithm recommendation result based on the material metadata and the algorithm recommendation result based on the general metadata by adopting a collaborative recommendation mechanism of domain knowledge embedding to obtain an optimal recommendation algorithm; and predicting the target property of the material based on an optimal recommendation algorithm.
Preferably, the data information of the material data set at least comprises data source, sample number, characteristic dimension, target attribute and material category;
the data information of the universal data set includes at least a data source, a number of sample pieces, a characteristic dimension, and a target attribute.
Preferably, the data preprocessing module comprises a first preprocessing unit and a second preprocessing unit;
the first preprocessing unit is used for tracing the original information of the condition attribute and the target performance of the data set to obtain a uniform data format;
and the second preprocessing unit is used for carrying out missing value processing, type data processing and standard normalization on the data in the unified data format to obtain the data meeting the machine learning modeling requirement.
Preferably, the meta database construction module comprises a meta feature calculation unit and an algorithm performance evaluation unit;
the meta-feature calculation unit is used for calculating the meta-features of the material data set and the general data set to obtain a meta-feature matrix of the material data set and the general data set;
the algorithm performance evaluation unit is used for obtaining a mapping relation between the input and the output of the data set through a regression algorithm modeling rule, predicting target performance according to the mapping relation and obtaining a performance matrix.
Preferably, the meta-features comprise legacy meta-features, enhanced meta-features;
the traditional meta-feature is extracted based on the condition attribute of the data set, and comprises simple meta-feature, statistical meta-feature for describing the data distribution condition and meta-feature based on principal component;
the enhanced meta-features comprise meta-features based on a machine learning model, statistical meta-features describing target attributes, and meta-features describing uncertainty of the target attributes;
the meta-features based on the machine learning model are obtained by extracting model performance measurement through a machine learning algorithm;
and the statistical meta-features of the target attributes and the meta-features describing uncertainty of the target attributes are extracted according to target performance of the data set.
Preferably, the meta-feature describing uncertainty of the target attribute is used for measuring uncertainty of target attribute data and performing conceptualized numerical representation;
the numerical value represents target attribute data with uncertain concepts processed by a Gaussian distribution triple, and the target attribute data comprises expectation, entropy and super entropy;
the expectation is the most representative data representation in a concept; the entropy is used to represent a granularity scale of the concept; the super entropy is used to describe the uncertainty of the concept granularity.
Preferably, the domain knowledge base building module comprises a domain knowledge acquisition unit and a domain knowledge representation unit;
the domain knowledge acquisition unit is used for constructing a material category tree of a multi-branch tree structure to visualize material domain knowledge according to the target attribute of the material data set; classifying the material data set step by step according to the material category tree to obtain a classification result;
the domain knowledge representation unit is used for obtaining the material types according to the material type tree, carrying out quantitative representation and constructing a domain knowledge matrix.
Preferably, the classification result comprises a metal material, an inorganic non-metal material, a polymer material and a composite material;
the metal material comprises ferrous metal and nonferrous metal;
the inorganic non-metallic materials comprise ceramics, cement, refractory materials and glass;
the high polymer material comprises plastics, fibers, paint, organic solvent, organic micromolecules, biofuel compounds and rubber materials;
the composite material comprises a metal-based material, a ceramic-based material, a polymer-based material and a carbon-carbon composite material.
Preferably, the algorithm recommending module comprises a first algorithm recommending unit, a second algorithm recommending unit and a collaborative recommending and ordering unit;
the first algorithm recommending unit is used for quantifying and embedding knowledge in different fields according to metadata of the material data set so as to guide a meta-learning recommending process and obtain a first recommending result;
the second algorithm recommending unit is used for directly recommending meta-learning according to the metadata of the universal data set to obtain a second recommending result;
the collaborative recommendation sorting unit is used for performing collaborative sorting calculation according to the first recommendation result and the second recommendation result and obtaining an optimal recommendation algorithm according to ranking; and predicting the target property of the material based on an optimal recommendation algorithm.
The invention discloses the following technical effects:
according to the material performance automatic prediction system provided by the invention, a new material data set is given, and recommendation/prediction ranking of all candidate regression algorithms on the data set is obtained through a collaborative element learning component embedded with knowledge in the field. And selecting three regression algorithms with top comprehensive performance ranking as the final optimal algorithm to recommend to the user. The automatic material performance prediction system aims to select a relatively better algorithm set for a given data set instead of an absolute optimal single algorithm through the historical performance ranking of the algorithm, so that the time and the calculation cost of an algorithm parameter optimization space and a field expert for selecting a machine learning algorithm are reduced, and the reliability and the usability of machine learning in the field of material performance prediction are improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings required in the embodiments will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic diagram of a system configuration according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating a four-layer material class tree structure for visualizing knowledge in the material domain, in accordance with an embodiment of the present invention;
FIG. 3 is a diagram of the recommendation results of six meta-learning methods according to an embodiment of the present invention;
FIG. 4 is a graph of true and predicted rankings for a regression algorithm in accordance with an embodiment of the invention;
FIG. 5 is a graph of the predicted results of the recommendation algorithm in different data sets according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Currently, data-driven regression Machine Learning (ML) can analyze and map complex structure-activity relationships between material composition-structure-process-performance, etc. by fitting potential patterns in existing historical empirical data. According to the free lunch theorem, each algorithm has an application range of the algorithm, and the algorithm is not suitable for all problems. For non-computing professional materials specialists, it is often only possible to enumerate all possible algorithms and parameter configurations by trial and error or based on historical experience, which is a time consuming and resource labour intensive complex process. Thus, automatic Machine Learning (AutoML) based on Meta-Learning (Meta-Learning) can speed up the design of ML on new tasks by Learning different ML model performances and parameter configurations from previous modeling experience. However, there is currently a lack of research to apply Meta-learning based AutoML in the material performance regression prediction task. At the same time, there are limited regression datasets available for machine learning of publicly published materials.
Therefore, in order to accelerate the material performance prediction research, the embodiment provides an automatic prediction system of material performance with embedded domain knowledge. The method comprises a regression-oriented metadata base builder, a material field knowledge base builder and a collaborative algorithm recommender. The metadatabase mainly comprises enhanced metafeatures for measuring the similarity of the material data sets and performance data of a regression algorithm suitable for predicting the target performance of the material; the material domain knowledge is visualized by a material category tree established by the material category to which the traceable material dataset belongs; collaborative algorithm recommendation is to adopt a collaborative mechanism to collaborate material metadata-based algorithm recommendation results and other domain metadata-based algorithm recommendation results, wherein domain knowledge is quantified and embedded into an automatic algorithm recommendation process.
The AutoML automates important steps of machine learning in various stages such as feature engineering, algorithm selection, hyper-parameter optimization and the like, reduces complexity of various algorithm selections and parameter setting in a model construction process of the machine learning, reduces cost and participation degree of field experts in scientific discovery and analysis by utilizing the machine learning, and improves usability of the machine learning.
And (3) predicting the material performance: data-driven machine learning can analyze and map complex structure-activity relationships between material composition-structure-process-performance, etc. by fitting potential patterns in existing historical empirical data.
Meta learning: meta-learning is widely used in automatic machine learning by learning modeling experience from previous similar tasks, automatically selecting the appropriate machine learning algorithm and determining parameters for a given new task.
With the difficult problems of algorithm selection and parameter optimization complexity accompanying machine learning development, autoML is proposed and continuously developed. The AutoML automatically learns the important steps of characteristic engineering, hyper-parameter optimization, model selection, process configuration and the like in the traditional machine learning process, and applies an algorithm and a training model by using minimum field knowledge or actual data, so that the threshold of field experts entering machine learning and deep learning is reduced, and the machine learning model can be applied without human intervention as much as possible. Researchers have proposed many AutoML frameworks today, but most focus on only certain parts of the AutoML pipeline. Such as Auto-WEKA, auto-Sklearn, TPOT, focusing on traditional ML models (e.g., SVM, DT, KNN); TPOT involves neural networks, but only supports multi-layer perceptrons; auto-Keras is an open source library developed based on Keras, which focuses more on searching deep learning models, supporting multiple modes and multiple tasks, etc.
It is worth mentioning that, it is considered as Auto-Weka of the Auto ml pioneer work, it considers the Combined Algorithm selection and the hyper-parameter Optimization (CASH) problem for the first time, uses the machine learning framework Weka and the bayes Optimization method to reduce the dependence of human on machine learning, so that non-experts can easily construct a high quality classifier suitable for a specific application scenario. Among them, the CASH is the core of the AutoML, which can be regarded as a hierarchical hyper-parametric optimization (HPO) problem and can be solved by HPO methods such as random search, bayesian optimization, evolutionary optimization, reinforcement learning, gradient-based search, and the like. However, auto-Weka incurs a large amount of optimization computation cost due to the existence of a high-dimensional hyper-parameter space. Subsequently, feurer et al implemented Auto-sklern in 2015, which mainly included a meta-learning step to hot-start bayesian optimization and an automated ensemble building step to use all classifiers discovered by bayesian optimization. The Meta-learning is to learn different hyper-parameter configurations or the performances of the machine learning model in different similar tasks (Meta-knowledge) from the existing experience (metadata) to select the ML example with good potential prediction performance, so that the CASH time and the calculation cost are greatly reduced, and the design of the machine learning model on a new task is accelerated. Nowadays, more and more ML researchers are focusing on important content research on how to better describe learning tasks, how to build and learn meta-knowledge from metadata quickly and efficiently, how to automatically recommend meta-knowledge, and the like.
Prediction of material properties for materials with continuous properties is a supervised regression problem. According to the fact that the free lunch theorem does not exist, the situation that a single universal regression algorithm is superior to other algorithms in all material tasks does not exist, and different algorithms and parameters thereof need field experts to select and optimize repeatedly. AutoML automates the machine learning modeling process by combining algorithm selection and hyper-parameter optimization problems, thereby greatly reducing the threshold for field experts to enter machine learning. However, there has been little research into the task of predicting the properties of AutoML binding materials. In the field of materials science research, dunn et al in 2021 proposed an Automatriner technique for predicting the properties of inorganic bulk materials. It is a tool for automatically creating a complete machine learning pipeline for material science, mainly with the help of the existing TOPT technology to achieve the bottom CASH problem. However, like most AutoML methods, automatriner selects the optimal model for prediction by training multiple ML models from scratch, i.e., it only avoids user intervention, but does not reduce the latency in the modeling process. Therefore, by studying the use of metamorphic learning-based AutoML for material performance prediction tasks, it would be expected to improve ML regression prediction performance and efficiency with lower time computation resources. However, the direct introduction of Meta-learning based AutoML into the material property regression task faces three difficult problems: 1) Lack of metadata suitable for material regression tasks (in the face of material datasets with complex physicochemical mechanisms inside, meta-features to differentiate similarities between different material tasks and performance data of ML algorithms possibly applicable to material problems have to be constructed); 2) Since most material machine learning studies rarely disclose their experimental data, the lack of a sufficient number of material data sets to construct metadata may lead to meta-learning result overfitting; 3) Data-driven AutoML tends to make meta-learning results inconsistent or even paradoxical with material domain knowledge. Here, the latter two problems can be considered together as one and the same problem.
The invention introduces an autoML technology based on meta-learning into a material performance regression prediction task, and provides an automatic material performance prediction system. Specifically, for metadata construction, three new meta-features are proposed based on the traditional meta-features to enhance similarity measures between data sets; by investigating the material machine learning literature, 18 regression algorithms were used as candidate algorithms for material performance prediction in the present method. Aiming at the problem that a limited material data set and an ML result lack field knowledge, a field knowledge embedded collaborative learning mechanism is provided to carry out meta-learning by collaboratively combining a public data set and a material data set collected from machine learning in OpenML and UCI of two public communities, so that meta-overfitting is relieved, and the interpretability and reliability of a model are improved.
As shown in fig. 1, the present invention provides an automatic prediction system for material properties, which comprises the following five modules: 1) The data acquisition module is used for collecting a material data set which can be used for a regression task and a public general data set in other fields; 2) The data preprocessing model is used for carrying out missing value processing, standard normalization and the like on all data sets; 3) The domain knowledge base building module builds a material category tree by acquiring domain knowledge and quantitatively expresses the material category tree to build a domain knowledge matrix; 4) The metadatabase building module is used for respectively obtaining respective metafeature matrixes and performance matrixes of the material data set and the public general data set through metafeature calculation and algorithm performance evaluation and building two types of metadata; 5) And the algorithm recommending module adopts a collaborative recommending mechanism of embedding domain knowledge to collaborate the algorithm recommending result based on the material metadata and the algorithm recommending result based on the general metadata, and finally adopts an optimal recommending algorithm to predict the material target attribute.
A data acquisition module:
from 41 material machine learning documents, 54 regression material datasets for prediction of material properties were collected. The data set information in these documents includes data source, sample number, characteristic dimension, target attribute, material category, etc., which covers different material performance studies in the field of material science, such as conductivity study of polymer material of nano composite polymer electrolyte system, ion conductivity study of LISICON type lithium fast ion conductor material, sublimation enthalpy of small organic molecules, density study of biofuel compound, etc. In addition, 60 universal data sets suitable for regression tasks are collected from the two ML public database communities OpenML and UCI, and basic information comprises data sources, the number of sample strips, feature dimensions and target attributes.
A data preprocessing module:
because the public data sets are collected by different research institutions and lack of systematic storage modes, two steps of data preprocessing, namely manual data preprocessing and procedural data preprocessing, are required to be carried out on each data set, wherein the two steps of data preprocessing comprise manually unifying the influencing factors (X) and the target performance (y) through the source tracing original information; missing values, categorical data and standard normalization are handled using the ML program.
The metadata base building module:
in general, meta-learned metadata consists of the evaluation performance of different ML models and a set of meta-features (or dataset attributes) on previous datasets that influence the recommendation of the ML algorithm by affecting similarity measures between datasets. The method constructs a metadata base using all 54 material datasets and 60 universal datasets. Specifically, the traditional four types of meta-features are combined with two types of new meta-features to form 27 meta-features which are used for representing the attributes of the data set; the material data set and the generic data set were modeled next using 18 regression algorithms suitable for material property prediction. And finally, corresponding to the two large-class data sets, and respectively obtaining two large-class element characteristic matrixes and two large-class algorithm performance matrixes through element characteristic calculation and algorithm performance evaluation.
Meta-feature computation
Meta-features are the core of meta-learning, which is used to characterize the attributes of a data set in order to later decide or measure the similarity at the data level between different data sets. In constructing the present automated material performance prediction system, the proposed meta-features are divided into six classes to capture the dataset properties that may affect regression model performance: 8 simple meta features for describing the basic structure of the data set, 8 statistical meta features for describing the data distribution situation, 3 meta features based on Principal Component Analysis (PCA), 5 landmark meta features based on an ML model, 2 statistical meta features for describing the object attribute and 1 meta feature for describing the uncertainty of the object attribute. The first three groups represent common features extracted from attribute X of the dataset. The next group relies on the ML algorithm to extract the model performance metrics, while the last two groups are extracted from object Y of the dataset as enhanced meta-features. Table 1 gives all the meta-characteristics used in the process and their definition descriptions.
TABLE 1
Figure BDA0003715132300000141
Figure BDA0003715132300000151
For enhanced meta-features, generally, pure data-driven machine learning modeling and prediction is typically implemented assuming that the learning samples conform to a certain data distribution. On the other hand, due to the complex and various material driving mechanisms, experimental errors from measurement or errors from calculation of unsatisfactory approximate values can cause uncertainty of target property values which are easily influenced by physical and chemical factors inside the material and external factors such as time, temperature and the like. Therefore, studying the data distribution and uncertainty metrics associated with the target property is critical to distinguishing between different material datasets. Herein, the groupStatistical meta-features on the target attributes are extracted from the kurtosis and skewness of the target vector to characterize the distribution of the output data by descriptive statistics. Furthermore, machine learning of the pan-gaussian distribution of an uncertain research field can be represented numerically by the expectation Ey, entropy En, and super-entropy He for uncertain concepts of material target attributes. Wherein Ey is the most representative data representation in a concept; en is used to represent the granularity scale of a concept and He is used to describe the uncertainty of the concept's granularity. Then definition C is a concept consisting of triplets (Ey, en, he) for processing data with uncertain information. Given a new data set, its target vector is defined as y i =(y 1 ,...,y i ,...,y p ) Ey, en, he are calculated as formulas (1) to (4). Wherein S 2 Represents the variance of the target property of the data set and p represents the number of samples of the data set.
Figure BDA0003715132300000152
Figure BDA0003715132300000153
Figure BDA0003715132300000161
Figure BDA0003715132300000162
Algorithm performance evaluation
In terms of prediction accuracy, classical and statistical machine learning methods (e.g., linear regression, support vector regression, K-nearest neighbor regression, and decision trees) are more suitable for smaller scale datasets; neural networks, however, require large amounts of data and are only nearly predictive when there are thousands or more training data points. From the perspective of model interpretability, linear models (e.g., linear regression, ridge regression, lasso regression) are easy to implement and learning results are often easy to understand. Meanwhile, in the field of material science research, complex nonlinear relations often exist between condition factors and target attributes, which results in learning results of nonlinear models (such as multilayer perceptron, gaussian process regression, bayesian ridge regression) being a "black box", but they are widely used by material experts. The present invention finally considers 18 regression algorithms to predict material properties as shown in table 2. They employ different modeling rules to establish a mapping between material data set inputs and outputs. Here, the root mean square error RMSE is used to evaluate the performance of the model prediction by optimizing the hyperparameters through five-fold cross validation and Bayes Optimization (BO) methods.
TABLE 2
Figure BDA0003715132300000163
Figure BDA0003715132300000171
A domain knowledge base construction module:
domain knowledge acquisition
The method designs a multi-way tree named as a material category tree to visualize the knowledge in the material field. The various target attributes are treated as leaf nodes of the MCT-DK by tracing the source machine learning literature for each material dataset to clarify the target attributes of the material dataset. Similarly, the domain expert designs the penultimate layer by analyzing the material classes to which the leaf nodes belong. So as to refer up until the root node (level 0) is reached. Finally, a four-layer material category tree for visualizing the material domain knowledge as shown in fig. 2 is designed, and the specific material subdivision attributes represented by the leaf nodes are shown in table 3. In overview, all material datasets are grouped into four major material main classes: metal materials, inorganic non-metal materials, high polymer materials and composite materials. In particular, metallic materials are classified into ferrous metals and non-ferrous metals, which relate to a plurality of finely divided materials, such as 9 kinds of iron-based metallic glass, nickel-based single crystal superalloy, high entropy alloy, and multi-component β -titanium alloy. The inorganic non-metallic material mainly comprises four main types of glass, ceramic material, cement and refractory material, and comprises 23 types of subdivided materials such as a lithium fast ion conductor of the LISICON type, a perovskite compound, an iron-based superconductor, an NASICON solid electrolyte and the like. The high molecular material can be divided into 8 subclasses of 17 subdivided materials such as plastics, rubber, fibers, coatings, organic solvents and the like. Plastics can be classified into thermoplastics and thermosets, for example, according to their physicochemical properties; rubber materials can be divided into natural rubber and synthetic rubber according to the manufacturing method; the coating materials can be classified into organic coating materials, inorganic coating materials, organic-composite coating materials, and the like according to the kind of the binder. Composite materials can be classified into metal-based materials, ceramic-based materials, polymer-based materials, carbon-carbon composite materials, and the like according to different substrates. It should be noted that some branches (material classes) lack corresponding leaf nodes (datasets), but the materials specialist still retains these major classes to ensure the integrity of the material class system. Furthermore, it is believed that leaf node branching will be more and more abundant as the public data set increases, i.e., the material class tree is dynamically augmented.
TABLE 3
Figure BDA0003715132300000181
Figure BDA0003715132300000191
Representation of domain knowledge
First, the "four-level" material classes are obtained respectively according to the material class tree. Specifically, for each prediction data set, a bottom-up strategy is adopted to traverse from a leaf node to a root, and the paths traversed jointly form the material category. Because the material class tree, including the root node, has four levels, the complete material class for a data set will consist of four nodes. For example, a predicted data set of superconducting transition temperature of an iron-based superconductor may be obtained by obtaining a complete material class of the data set [ material, inorganic non-metallic material, ceramic material, iron-based superconductor ] from a material class tree; the methane hydrate formation temperature prediction data set has complete material classes of [ material, high molecular material, small organic molecule, methane hydrate ].
Next, a top-down strategy is employed to quantify the similarity of any two datasets at the domain knowledge level. Let it be assumed here that there are two material data sets i and j, defined
Figure BDA0003715132300000192
The similarity of the material classes for i and j,
Figure BDA0003715132300000193
the values are shown in formula (5):
Figure BDA0003715132300000194
wherein,
Figure BDA0003715132300000195
the expression that i and j are similar only in the first-level material category, namely that they respectively belong to any two of four main categories of metal materials, inorganic non-metal materials, high polymer materials or composite materials;
Figure BDA0003715132300000196
means i and j satisfy the second level similarity on the basis of the first level similarity, that is, they belong to any one of four main classes of materials; in the same way, the method for preparing the composite material,
Figure BDA0003715132300000197
indicates that i and j are similar in material classes up to the third level, and
Figure BDA0003715132300000198
then i and j are represented as similar through the fourth level of category, which is the highest domain knowledge similarity represented by the material category tree quantization. For example, the predicted superconducting transition temperature data set and methane hydrate formation temperature for the iron-based superconductor described aboveThe material class similarity of the prediction dataset was 0.2.
An algorithm recommending module:
in summary, two main types of metadata consisting of the meta-feature matrix and the algorithm performance matrix and the quantitative representation of the knowledge of the material field obtained according to the material category tree are constructed respectively for the material data set and the general data set. In the method, a collaborative recommendation mechanism of domain knowledge embedding is provided, so that the process of using the existing meta-learning method for algorithm recommendation is improved. Respectively, firstly, quantifying and embedding different domain knowledge according to metadata of a material data set to guide a meta-learning recommendation process; the second is to directly make meta-learning recommendation according to the metadata of the universal data sets, wherein by default all universal data sets and material testing data sets have no category similarity at the domain knowledge level (or as,
Figure BDA0003715132300000201
the same value).
In particular, suppose a new material test set d is given new The method comprises the steps of firstly obtaining 27 meta-features describing the attributes of the data set, calculating Euclidean distances between the data set and all prior tasks (a material data set and a general data set) of each type in a meta-feature space by using a formula (6), and taking the Euclidean distances as similarity measurement bases between the data sets.
Figure BDA0003715132300000202
Wherein, F i And F j Meta-features representing any two data sets, each meta-feature set defined as { f 1 ,...,f k ,...,f n N represents the total number of meta-features proposed by the method, i.e. 27.
Secondly, respectively orienting all the prior material data sets and all the prior general data sets according to the data sets and d new The Euclidean distance of the data sets is sequenced, so that m parts of similar material data sets and p parts of similar general data sets with the Euclidean distance being the nearest can be obtained. Obviously, fromFrom the data level, the shorter the Euclidean distance between every two data sets, the higher the similarity between them. And finally, returning the algorithm performance data of the similar data set and the material category (/ domain knowledge) corresponding to the similar material data set.
Furthermore, on the one hand, based on the similar general data set and the performance data thereof, the Average Ranking of each regression algorithm in all the algorithms is decided by using the Average Ranking (AR) method, as shown in formula (7):
Figure BDA0003715132300000211
wherein,
Figure BDA0003715132300000212
representing the predicted performance ranking of the jth regression algorithm on the ith similar data set.
Figure BDA0003715132300000213
Rank the predicted performance of the jth regression algorithm on different similar datasets, j = 1.., a (a takes 18, representing the total number of all candidate regression algorithms); p represents the number of copies of a similar universal data set.
On the other hand, based on the similar material data set and its performance data, the AR algorithm is improved by embedding domain knowledge, and finally the average ranking of each regression algorithm in all algorithms is decided, as shown in equation (8):
Figure BDA0003715132300000214
wherein m represents the number of similar material data sets;
Figure BDA0003715132300000215
represents the ith similarity data set and d new The similarity at the domain knowledge level is calculated by the above formula (5).
Then, adopting a cooperation mechanism to obtain the calculation result
Figure BDA0003715132300000216
And
Figure BDA0003715132300000217
these two different recommendations are coordinated as shown in equation (9), where the learning factor C is coordinated * In the [0.1,1 ]]Values within the range. Finally, after the final ranking of each regression algorithm is cooperatively calculated, the regression models with the top final ranking are recommended by rearranging the rankings. It can be seen that the ranking position of each regression algorithm is based on the quality of the combined evaluation, i.e. the better the quality of the comprehensive evaluation obtained by the regression algorithm, the higher the ranking position is, the better the regression fit is in prediction.
col_R(R md ,R pd ,C * )=C * *R md +(1-C * )*R pd (9)
In response to the shortcomings of the prior art, an automated material property predictor is presented. The method aims to improve the auto ML technology based on meta-learning to automatically select a regression model and optimize the hyper-parameters thereof so as to promote the design and discovery of novel materials with excellent performance. The method comprehensively considers the problems of the traditional ML and the AutoML in the field of material science, uses the AutoML based on the meta-learning for predicting the material performance to liberate the intervention of material experts and reduce the time and material resource consumption, and solves the problems that the algorithm selection and the parameter optimization are difficult, time and labor are consumed, and the ML result driven by data is easy to be inconsistent with the knowledge in the field of materials and even contradict with the knowledge in the traditional machine learning modeling process. First, as much manpower as possible is used to collect material datasets and other domain datasets from published material ML documents and published ML communities, respectively, to fill meta-learning samples. Meanwhile, 27 meta-features including three new meta-features and performance data for training 18 regression algorithms suitable for the material performance prediction problem are respectively calculated for the two types of data sets, so that enhanced meta-data for meta-learning is constructed. In addition, a material category tree is built by the material categories of the traceable material dataset to visualize material domain knowledge. And finally, a field knowledge embedded cooperation mechanism is adopted to cooperate the meta recommendation results based on the two types of meta data respectively, so as to realize final regression algorithm recommendation and performance prediction.
Given a new material dataset, the recommendation/prediction ranking of all 18 candidate regression algorithms on that dataset is derived through the above domain knowledge embedded collaborative meta learning component. And selecting three regression algorithms with top comprehensive performance ranking as the final optimal algorithm to recommend to the user. The automatic material performance predictor aims to select a relatively better algorithm set for a given data set through the historical performance ranking of the algorithm instead of an absolute optimal single algorithm, so that the space for selecting the algorithm and the time and the calculation cost for selecting the machine learning algorithm by a field expert are reduced, and the reliability and the usability of the field of material performance prediction through machine learning are improved.
FIG. 3 shows the distribution of data sets corresponding to six different meta-learning methods under different recommendation accuracies (in the figure, four color bar boxes represent the number of material test sets with recommendation accuracies of 0-0.2, 0.4, 0.6, and 0.8-1, respectively; and "acc" in the figure represents the average recommendation result in all 100 experiments). The experiment adopts a leave-one-out cross validation method, and one data set is used as a material test data set d in each division new While the remaining 53 data sets and 60 training sets serve as a priori data sets for meta-learning. For each portion d new It was divided into 80% training samples and 20% testing samples using Holdout. Therefore, a recommendation algorithm ranking is obtained by inputting 80% of training samples into an automatic material performance predictor (a collaborative learning recommendation method with embedded domain knowledge, mtL-ColDK); the performance of 18 regression algorithms on the 80% training samples is trained from scratch by the same experimental setup using five-fold cross validation and then normalized and ranked to obtain an algorithm ranking which is denoted as the true algorithm ranking. The above verification experiments were performed 100 times in total. Whether the regression algorithm of the top five ranking in the recommendation algorithm is consistent with the regression algorithm of the real top five ranking is used as the recommendation evaluation index,
specifically, the proposed MtL-Col (using equations (7), (9)) has corresponding data set distributions of 39/54, 43/54, 41/54, 44/54, 49/54, 48/54, 46/54, 44/54 at the ideal recommendation accuracy (threshold set to 0.5). Compared with the meta-learning method with less meta-tasks (comprising MtL-M and MtL-P), the other two methods with more meta-tasks (comprising MtL-Colcmp and MtL-Col) can always obtain better recommendation accuracy. More importantly, mtL-Col outperformed the comparison method MtL-Colcmp using mixed metadata in every experiment using the same large-scale metadata. Further, as shown in FIGS. 3 (a) and 3 (e), in the 9/10 experiment, mtL-DK obtained more material data sets with the desired recommended accuracy than MtL-M. In num.1, num.4, num.5, num6 and num.7 experiments on MtL-DK, the numbers for the recommended accuracy improvement were 2, 4, 2 and 3, respectively. Although the ideal accuracy rate was not significantly improved by embedding domain knowledge encoding into meta-learning in other rounds of validation experiments, it was found that it still reduces the number of data sets with low accuracy rate. In addition, the proposed MtL-ColDK (using equations (7) - (9)) was compared to MtL-Col. As can be seen from fig. 3 (d) and 3 (f), the former achieved a greater or stable number of material data sets with the desired recommended accuracy in the 8/10 experiment. In general, compared with other five automatic meta-learning methods which do not adopt a collaborative learning mechanism or are not embedded with domain knowledge guidance, the MtL-ColDK provided by the method has higher and stable recommendation capability.
Further, 12 data sets were randomly selected from 54 data sets, and the predictive power of 3 optimal regression algorithms using the proposed meta learning recommendation method was analyzed. Fig. 4 shows the true and predicted rankings of the 18 regression algorithms on each data set. Among them, the prediction performances of linear models such as LR, LS, ridge, LARSLasso and RNASAC are not very outstanding. And the prediction rank of nonlinear models such as KNN, adaBoost, MLP, KRR and RF is higher than that of other regression models. This fact is attributed to the complex physicochemical mechanisms inside the material, resulting in the inability of linear algorithms to construct non-linear relationships between various factors and target properties. FIG. 5 shows the prediction accuracy of the optimization algorithm at 20% of the test samples per data set. It can be seen that the AdaBoost, MLP and KRR models achieve a low average RMSE and a high average R2 in most cases. Furthermore, model PAR, KNN, SGD and ElasticNet regression were also predicted to perform well in material datasets 3, 4, 12, 20 and 34.
The automatic material property prediction system provided by the embodiment has the following advantages:
1. an automatic material performance prediction system is provided, and an automatic machine learning technology based on meta-learning is introduced into material performance regression prediction research.
2. The performance data of 27 meta-features and 18 regression algorithms suitable for the material performance prediction problem were constructed as meta-data to enhance meta-learning.
3. And a cooperation mechanism is adopted to cooperate the two types of meta-recommendation results based on different metadata, so that meta-overfitting is avoided and meta-learning recommendation precision is improved.
4. And quantifying and embedding the material category multi-branch tree of the visual domain knowledge into a collaborative recommendation process, so that the recommendation and prediction results of the regression algorithm are more accurate and reliable.
The above-described embodiments are only intended to illustrate the preferred embodiments of the present invention, and not to limit the scope of the present invention, and various modifications and improvements made to the technical solution of the present invention by those skilled in the art without departing from the spirit of the present invention should fall within the protection scope defined by the claims of the present invention.

Claims (9)

1. An automatic prediction system for material properties, comprising:
the data acquisition module is used for acquiring a material data set which can be used for a regression task and a general data set in other fields as input data of the prediction system;
the data preprocessing module is connected with the data acquisition module and is used for preprocessing the material data set and the general data set;
the domain knowledge base building module is connected with the data preprocessing module and used for building a material class tree and carrying out quantitative representation to build a domain knowledge matrix by acquiring domain knowledge;
the metadata base building module is connected with the data preprocessing module and is used for respectively carrying out metadata feature calculation and algorithm performance evaluation on the material data set and the general data set to obtain a metadata feature matrix and a performance matrix of the material data set and the general data set so as to build the material metadata and the general metadata;
the algorithm recommendation module is respectively connected with the domain knowledge base construction module and the metadata base construction module and is used for collaborating the algorithm recommendation result based on the material metadata and the algorithm recommendation result based on the general metadata by adopting a collaborative recommendation mechanism of domain knowledge embedding to obtain an optimal recommendation algorithm; and predicting the target property of the material based on an optimal recommendation algorithm.
2. The system of claim 1,
the data information of the material data set at least comprises a data source, a sample number, a characteristic dimension, a target attribute and a material category;
the data information of the universal data set includes at least a data source, a number of sample pieces, a feature dimension, and a target attribute.
3. The system of claim 1,
the data preprocessing module comprises a first preprocessing unit and a second preprocessing unit;
the first preprocessing unit is used for tracing the original information of the condition attribute and the target performance of the data set to obtain a uniform data format;
and the second preprocessing unit is used for carrying out missing value processing, type data processing and standard normalization on the data in the unified data format to obtain the data meeting the machine learning modeling requirement.
4. The system of claim 1,
the meta database construction module comprises a meta characteristic calculation unit and an algorithm performance evaluation unit;
the meta-feature calculation unit is used for calculating the meta-features of the material data set and the general data set to obtain a meta-feature matrix of the material data set and the general data set;
the algorithm performance evaluation unit is used for obtaining a mapping relation between input and output of a data set through a regression algorithm modeling rule, predicting target performance according to the mapping relation and obtaining a performance matrix.
5. The system of claim 4,
the meta-features comprise traditional meta-features and enhanced meta-features;
the traditional meta-feature is extracted based on the condition attribute of the data set, and comprises simple meta-feature, statistical meta-feature for describing the data distribution condition and meta-feature based on principal component;
the enhanced meta-features comprise meta-features based on a machine learning model, statistical meta-features describing target attributes and meta-features describing uncertainty of the target attributes;
the meta-features based on the machine learning model are obtained by extracting model performance measurement through a machine learning algorithm;
and the statistical meta-feature of the target attribute and the meta-feature describing uncertainty of the target attribute are obtained by extracting according to the target performance of the data set.
6. The system of claim 5, wherein,
the meta-feature describing the uncertainty of the target attribute is used for measuring the uncertainty of the target attribute data and performing conceptualized numerical representation;
the numerical value represents target attribute data with uncertain concepts processed by a Gaussian distribution triple, and the target attribute data comprises expectation, entropy and super entropy;
the expectation is the most representative data representation in a concept; the entropy is used to represent a granularity scale of the concept; the hyper-entropy is used to describe the uncertainty of the concept granularity.
7. The system of claim 1,
the domain knowledge base building module comprises a domain knowledge acquisition unit and a domain knowledge representation unit;
the domain knowledge acquisition unit is used for constructing a material category tree of a multi-branch tree structure to visualize material domain knowledge according to the target attribute of the material data set; classifying the material data set step by step according to the material category tree to obtain a classification result;
the domain knowledge representation unit is used for obtaining the material types according to the material type tree, carrying out quantitative representation and constructing a domain knowledge matrix.
8. The system of claim 7,
the classification result comprises a metal material, an inorganic non-metal material, a high polymer material and a composite material;
the metal material comprises ferrous metal and nonferrous metal;
the inorganic non-metallic materials comprise ceramics, cement, refractory materials and glass;
the high polymer material comprises plastics, fibers, paint, organic solvent, organic micromolecules, biofuel compounds and rubber materials;
the composite material comprises a metal-based material, a ceramic-based material, a polymer-based material and a carbon-carbon composite material.
9. The system of claim 1,
the algorithm recommending module comprises a first algorithm recommending unit, a second algorithm recommending unit and a collaborative recommending and sequencing unit;
the first algorithm recommending unit is used for quantifying and embedding knowledge in different fields according to metadata of the material data set so as to guide a meta-learning recommending process and obtain a first recommending result;
the second algorithm recommending unit is used for directly recommending meta-learning according to the meta-data of the universal data set to obtain a second recommending result;
the collaborative recommendation sorting unit is used for performing collaborative sorting calculation according to the first recommendation result and the second recommendation result and obtaining an optimal recommendation algorithm according to ranking; and predicting the target property of the material based on an optimal recommendation algorithm.
CN202210735361.5A 2022-06-27 2022-06-27 Material performance automatic prediction system Pending CN115148307A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210735361.5A CN115148307A (en) 2022-06-27 2022-06-27 Material performance automatic prediction system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210735361.5A CN115148307A (en) 2022-06-27 2022-06-27 Material performance automatic prediction system

Publications (1)

Publication Number Publication Date
CN115148307A true CN115148307A (en) 2022-10-04

Family

ID=83409237

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210735361.5A Pending CN115148307A (en) 2022-06-27 2022-06-27 Material performance automatic prediction system

Country Status (1)

Country Link
CN (1) CN115148307A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115713987A (en) * 2022-11-17 2023-02-24 广州瑞博新材料技术研究有限公司 Polycaprolactone test data analysis method and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115713987A (en) * 2022-11-17 2023-02-24 广州瑞博新材料技术研究有限公司 Polycaprolactone test data analysis method and system
CN115713987B (en) * 2022-11-17 2023-06-13 广州瑞博新材料技术研究有限公司 Polycaprolactone test data analysis method and system

Similar Documents

Publication Publication Date Title
Juan et al. Accelerating materials discovery using machine learning
Xu et al. Small data machine learning in materials science
Li et al. AI applications through the whole life cycle of material discovery
Liu et al. Machine learning for high-entropy alloys: Progress, challenges and opportunities
Liu et al. Materials discovery and design using machine learning
CN110990461A (en) Big data analysis model algorithm model selection method and device, electronic equipment and medium
Wei et al. Machine learning for battery research
Khayyam et al. A novel hybrid machine learning algorithm for limited and big data modeling with application in industry 4.0
Liu et al. Data quantity governance for machine learning in materials science
Zhao et al. JAMIP: an artificial-intelligence aided data-driven infrastructure for computational materials informatics
Xie et al. Factorization machine based service recommendation on heterogeneous information networks
Zhao et al. Battery prognostics and health management from a machine learning perspective
CN107045569B (en) Gear reducer optimization design method based on clustering multi-target distribution estimation algorithm
Neshat et al. Short-term solar radiation forecasting using hybrid deep residual learning and gated LSTM recurrent network with differential covariance matrix adaptation evolution strategy
Dan et al. Computational prediction of critical temperatures of superconductors based on convolutional gradient boosting decision trees
JP7411977B2 (en) Machine learning support method and machine learning support device
Liu et al. Auto-MatRegressor: liberating machine learning alchemists
Fang et al. Patent2Vec: Multi-view representation learning on patent-graphs for patent classification
Arróyave et al. A perspective on Bayesian methods applied to materials discovery and design
CN116629352A (en) Hundred million-level parameter optimizing platform
CN115148307A (en) Material performance automatic prediction system
Haixiang et al. Optimizing reservoir features in oil exploration management based on fusion of soft computing
Che et al. A modified support vector regression: Integrated selection of training subset and model
Zhou et al. The application of nature-inspired optimization algorithms on the modern management: A systematic literature review and bibliometric analysis
ElMadany et al. A Proposed Approach for Production in ERP Systems Using Support Vector Machine Algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination