CN113674807A

CN113674807A - Molecular screening method based on deep learning technology qualitative and quantitative model

Info

Publication number: CN113674807A
Application number: CN202110914312.3A
Authority: CN
Inventors: 王建浦; 朱琳; 章亮
Original assignee: Nanjing Tech University
Current assignee: Nanjing Tech University
Priority date: 2021-08-10
Filing date: 2021-08-10
Publication date: 2021-11-19

Abstract

The invention discloses a molecular screening method based on an advanced learning technology qualitative and quantitative model (EMIM). collected molecules are converted into SMILES, and then the SMILES is converted into qualitative and quantitative descriptors for pretreatment; hash mapping, variance threshold and Pearson correlation coefficient characteristic engineering are used for qualitative descriptors and quantitative descriptors; setting a qualitative and quantitative model according to input data processed by characteristic engineering, using a Sigmoid function as final output, optimally setting parameters of the qualitative and quantitative model through a back propagation optimization algorithm to perform performance evaluation, and performing iterative training to obtain a model with high verification precision; and (3) preprocessing the molecules to be predicted, performing characteristic engineering, inputting the preprocessed molecules to a highest verification precision qualitative and quantitative model (EMIM) for prediction screening to obtain the prediction result of the molecules to be predicted. The invention realizes more efficient and accurate screening of new molecules and solves the limitation of the traditional screening.

Description

Molecular screening method based on deep learning technology qualitative and quantitative model

Technical Field

The invention relates to the field of deep learning and crossing of physics, chemistry and materials, in particular to a qualitative and quantitative model built by utilizing a deep learning neural network framework, and the model can be used for screening molecules.

Background

Molecular descriptors, which are the structural sub-fragments of a molecule or the characterization of physicochemical properties, can be divided into qualitative descriptors and quantitative descriptors. Qualitative descriptors generally refer to molecular fingerprints, which can convert chemical molecules into bit strings consisting of only 0 and 1. The quantitative descriptor is a descriptor for describing molecular attributes based on molecular composition (hydrogen bond donor number and benzene ring number), physicochemical properties (topological polar surface area and octanol water distribution coefficient) and experimental data information (ultraviolet spectrum and solvent ratio), and is a numerical index characteristic of chemical information of molecules.

The technologies such as machine learning and deep learning can carry out data mining from a molecular data set, establish a model for linear and nonlinear relations between molecular chemical information characteristics and target characteristics to realize accuracy, efficiently predict and screen new molecules, and further guide design experiments. At present, the data set is established through a single quantitative descriptor or a single qualitative descriptor, and the machine learning model and the deep learning model are trained, which have limitations. For example, single molecule fingerprints, whose physicochemical properties are missing; the single molecule attribute is used for characterizing molecules, and information such as molecular structures and the like is vacant. Therefore, the application of machine learning and deep learning in screening new molecules is limited to a certain extent.

Disclosure of Invention

The invention aims to solve the technical problem that a molecular screening method based on an Enhanced Molecular Information Model (EMIM) is established, and the new molecules are input into the EMIM for prediction screening after data preprocessing and characteristic engineering, so that the high-efficiency and accurate screening of the new molecules is realized.

The technical scheme of the invention for realizing the aim is as follows:

a molecular screening method based on deep learning qualitative and quantitative model (EMIM), comprising the following steps: and (3) inputting the qualitative and quantitative model to perform predictive screening after the sub-data set is constructed, preprocessed and subjected to characteristic engineering in sequence, and obtaining a screening result of the new molecules.

Preferably, the construction and preprocessing of the molecular data set are to convert the collected molecular structural formula into SMILES, and convert the SMILES into a qualitative descriptor and a quantitative descriptor, respectively.

The qualitative descriptor is a molecular fingerprint, and the quantitative descriptor is a molecular attribute. Preferably the molecular fingerprint is a MACCS fingerprint, FP2 fingerprint, ECFP fingerprint.

Preferably, the characteristic engineering is to use Hash mapping for qualitative descriptors, use variance threshold and Pearson correlation coefficient matrix algorithm for quantitative descriptors, and divide data sets of the qualitative descriptors and the quantitative descriptors in a training set-test set-9: 1 ratio.

The method for constructing and training the qualitative and quantitative model specifically comprises the following steps: according to input data processed by characteristic engineering, a qualitative and quantitative model is set, the first layer is an input layer, the input data is qualitative and quantitative descriptors, the middle layer is a dense layer, the last layer is an output layer, a Sigmoid function is used as final output, qualitative and quantitative model parameters are optimized and set through a back propagation optimization algorithm for performance evaluation, and iterative training is performed to obtain a model with high verification precision.

The prediction screening specifically comprises the following steps:

(1) based on a qualitative and quantitative model constructed and trained, the qualitative and quantitative model with the highest verification precision value is reserved through multiple iterative training, then N (N is usually 1-20) new molecules which do not appear in a training set are randomly selected, the new molecules are converted into SMILES, and the pretreatment and characteristic engineering steps of the SMILES conversion into qualitative and quantitative descriptors are completely the same as those of the training model;

(2) and predicting the N new molecules by using the qualitative and quantitative model with the highest verification precision ECFP fingerprint and the molecule attribute combined input to obtain the prediction result of the molecules to be predicted.

The molecules used in the data set are preferably perovskite luminescent and photovoltaic device luminescent layer materials, additive molecules.

The molecular screening method based on the deep learning technology qualitative and quantitative model specifically comprises the following steps:

step a: firstly, converting a collected molecular structural formula into SMILES, and then respectively converting the SMILES into qualitative descriptors and quantitative descriptors;

step b: mapping each digital code to the position of the fingerprint by the converted qualitative descriptor through a Hash algorithm; the transformed 208 attribute features of the quantitative descriptor are subjected to variance threshold and Pearson correlation coefficient matrix algorithm, 143 pieces of redundant attribute information are deleted, and 65 attributes are reserved for training the model.

Step c: setting a qualitative and quantitative model based on a neural network according to input data of qualitative and quantitative descriptors, wherein the qualitative and quantitative model has a five-layer network architecture, the first layer is an input layer, the input data is molecular fingerprints and molecular attributes, the middle layer is a dense layer, the last layer is an output layer, and a Sigmoid function is used as final output; inputting the molecular fingerprints and the molecular attribute data into a qualitative and quantitative model for iterative training, setting the model learning rate through back propagation optimization of an Adam algorithm optimizer, performing performance evaluation on parameters such as the number of neurons in each layer, and the like, and finally obtaining the qualitative and quantitative model with the highest verification precision;

step d: and c, selecting new molecules of the material, inputting the new molecules into the neural network model obtained in the step c for prediction screening after the new molecules are subjected to the characteristic engineering treatment in the step b, and obtaining the screening result of the new molecules.

The above molecular screening method is preferably the method comprising the steps of:

(1) respectively converting the collected molecules by SMILES to generate qualitative descriptors of MACCS fingerprints, FP2 fingerprints and ECFP fingerprints;

(2) the quantitative descriptors generated by the conversion of the collected molecules by SMILES are 208 attribute characteristics;

the above molecular screening method, preferably in step b:

(1) mapping each digital code to the position of the molecular fingerprint of the code by utilizing a Hash algorithm to the converted qualitative descriptor molecular fingerprint index data;

(2) selecting variance threshold attributes of 208 molecular attributes of the converted quantitative descriptors, deleting 107 attribute features, and leaving 101 attributes; deleting 36 attributes again by using a Pearson correlation coefficient matrix algorithm, and finally leaving 65 attributes for training the model;

(3) and (3) setting the qualitative and quantitative descriptor data sets of the labels to be training sets: the test set is a 9:1 scale division.

The above molecular screening method, preferably in step c:

(1) the number of neurons of the input layer is determined by the input molecular fingerprint and the molecular attributes, and is generally L +65, wherein L is the length of a molecular fingerprint bit, and 65 is the number of the input attributes; the number of neurons in the middle dense layer is set to be 4-32, the number of neurons in the last output layer is set to be 2, and a Sigmoid function is adopted for output;

(2) selecting an Adam algorithm as an optimizer, and setting the learning rate of the optimizer to be 0.001-0.1;

(3) selecting the verification accuracy as a performance evaluation index for evaluating the quality of the network model in the training process;

(4) the number of training batches is set to be 5-15, the number of training rounds is set to be 100-200, and the number of early termination training rounds is set to be 5-10;

(5) during each round of training, the training set is divided into a plurality of parts according to the batch processing size and input into the network model for training, the model weight coefficient is updated by using the optimizer and the loss function, and the data of the test set is input into the qualitative and quantitative model during each round of training to obtain the accuracy and the loss value of the model verification precision so as to guide the model to prevent over-fitting or under-fitting.

In the above molecular screening method, preferably, in step d:

(1) selecting new molecules which do not appear in a training set, processing data according to the data processing mode of the training set, and inputting the data into a model;

(2) and (4) carrying out prediction screening on the new molecules by using the qualitative and quantitative model with the highest verification precision to obtain a screening result.

Therefore, the invention is a pretreatment for converting collected molecules into SMILES and then converting the SMILES into qualitative and quantitative descriptors; hash mapping, variance threshold and Pearson correlation coefficient characteristic engineering are used for qualitative descriptors and quantitative descriptors; setting a qualitative and quantitative model according to input data processed by characteristic engineering, wherein the first layer is an input layer, the middle layer is a dense layer, the last layer is an output layer, a Sigmoid function is used as final output, qualitative and quantitative model parameters are optimally set through a back propagation optimization algorithm for performance evaluation, and iterative training is performed to obtain a model with high verification precision; and (3) preprocessing the molecules to be predicted, performing characteristic engineering, and inputting the molecules to be predicted into a highest verification precision qualitative and quantitative model (EMIM) for prediction screening to obtain the prediction result of the molecules to be predicted. The invention overcomes the limitations of insufficient chemical information and the like when inputting a single qualitative descriptor or a single quantitative descriptor by simultaneously inputting the qualitative descriptor and the quantitative descriptor for combined learning, and realizes the high-efficiency screening of molecules.

Compared with the prior art, the invention has the advantages of

(1) The invention effectively solves the problems that the physical and chemical properties of the input of the single qualitative descriptor molecule fingerprint are lacked, or the molecule structure segment of the single quantitative descriptor molecule attribute is lacked, and the like by simultaneously inputting the qualitative descriptor and the quantitative descriptor. Meanwhile, the strong information extraction capability of the deep learning neural network framework model can comprehensively mine structural fragment information carried by molecular fingerprints and physicochemical property characteristics provided by molecular attributes, and the highest verification precision accuracy rate of 92.86% is realized. In the deployment model screening, 8 new molecules of the material can be screened correctly. Compared with the highest verification accuracy of 82.14 percent of ECFP fingerprints in a DNN model, the qualitative and quantitative model realizes more efficient and accurate screening of new molecules and solves the limitation of the traditional screening.

(2) The qualitative and quantitative model established by the invention is a model simultaneously input by a qualitative descriptor and a quantitative descriptor, and can realize high-precision molecular prediction by using a small sample volume data set.

Drawings

FIG. 1 is an algorithmic flow chart of a molecular screening method;

FIG. 2 is a qualitative descriptor molecular fingerprint index data hash map;

FIG. 3 is a thermodynamic diagram of a Pearson correlation coefficient matrix for quantitative descriptor molecular properties;

FIG. 4 is a schematic diagram of a qualitative and quantitative model;

FIG. 5 is a learning curve of accuracy and loss value for the qualitative and quantitative model with the highest verification accuracy value; a is a curve of training precision accuracy and verification precision accuracy, and b is a curve of training loss value and verification loss value;

FIG. 6 is a diagram of the prediction of new molecules by the highest verification accuracy qualitative and quantitative model;

FIG. 7 is a graph comparing the verification accuracy of the input ECFP fingerprint, 65 attributes in the qualitative and quantitative model and the input ECFP fingerprint in the DNN model;

FIG. 8 is a comparison graph of verification accuracy of an input FP2 fingerprint, 65 attributes in a qualitative and quantitative model and an input FP2 fingerprint in a DNN model;

FIG. 9 is a graph of the comparison of the verification accuracy of an input MACCS fingerprint, 65 attributes in a qualitative and quantitative model, and an input MACCS fingerprint in a DNN model;

FIG. 10 is a comparison graph of verification accuracy of the input ECFP fingerprint, 65 attributes in the qualitative and quantitative model and the input 65 attributes in the DNN model;

FIG. 11 is a graph of the comparison of the verification accuracy of an input MACCS fingerprint, 65 attributes in a qualitative-quantitative model and 65 attributes in a DNN model;

FIG. 12 is a graph of the comparison of the verification accuracy of an input FP2 fingerprint, 65 attributes in a qualitative and quantitative model and 65 attributes in a DNN model;

Detailed Description

The present invention will be described in detail with reference to specific examples.

Example 1

A molecular screening method based on a deep learning technology qualitative and quantitative model is realized by the flow shown in figure 1. The method comprises the following steps: the method comprises the steps of constructing and preprocessing a data set, performing characteristic engineering on the data set, constructing and training a qualitative and quantitative model, and deploying model prediction screening.

1. The method specifically comprises the following steps of:

(1) the molecular structure is converted to SMILES. First obtaining molecules for training the modelWith its molecular structural formula, according to DAVID WEININGER SMILES transformation rule developed in 1988, with perovskite FAPBI₃The additive in the light emitting diode is taken as an example, and the structural formula of each additive molecule is converted into SMILES.

(2) The SMILES is converted into qualitative and quantitative descriptors, respectively. Calling a molecular fingerprint algorithm packaged in two chemical informatics toolkits including RDkit and Open Babel in Python according to different algorithms of different molecular fingerprints, and inputting SMILES to generate MACCS, FP2 and ECFP molecular fingerprints; then SMILES are input, 208 attributes of each molecule are generated in RDKIt according to the SMILES, and further a qualitative and quantitative descriptor data set is constructed.

2. The characteristic engineering of the data set specifically comprises the following steps:

(1) the qualitative descriptor molecule fingerprint index data is hashed. The obtained molecular fingerprint index data is a series of digital codes, each digital code is mapped to the position of the molecular fingerprint where the code is located by utilizing a Hash algorithm, and the mapping process is shown in FIG. 2;

(2) a variance threshold and pearson correlation coefficient matrix algorithm is used for the quantitative descriptor molecular attributes. For 208 attributes of the quantitative descriptors, if the attributes have lower variance values and contain less chemical information, selecting variance threshold attributes of the 208 attributes, deleting 107 attributes and reserving 101 attributes; if the attributes have strong correlation, deleting the redundant attributes can effectively reduce information redundancy, deleting 36 attributes again by using a pearson correlation coefficient matrix algorithm, and finally leaving 65 attributes for training the model, wherein the pearson correlation coefficient matrix is shown in fig. 3;

(3) the qualitative and quantitative descriptor data set is divided by a training set to test set ratio of 9:1, namely when the model is trained, 90% of data is used for training, and 10% of data is used for testing. The number of training samples in each data set was 121 and the number of test samples was 14.

3. The method comprises the following steps of constructing and training a qualitative and quantitative model:

(1) when a qualitative and quantitative model is constructed, the qualitative descriptor molecular fingerprint and the quantitative descriptor molecular attribute can be selected to be input in a combined mode. In the model, a five-layer network structure is set as shown in fig. 3, the number of neurons set in the input layer is L +65(L is the length of the molecular fingerprint bits, and 65 is the number of input attributes), the number of neurons set in the second layer and the third layer are set in the dense layer, the number of the neurons set in the input layer is 12 and 4, the activation function is set as a Rectified Linear Unit (RELU), and the dense layer is used for mapping the features extracted from the molecular fingerprint and the molecular attributes to the output space through nonlinear change. And in the fourth layer, combining the third layer by using a Concatenate function, and outputting the final layer by using a Sigmoid function. The mean square error function is selected as a loss function, the Adam algorithm is selected as an optimizer, the learning rate of the optimizer is set to be 0.001, and the training batch number and the training round number are respectively set to be 8 and 200. In order to obtain a model with high verification precision and best generalization capability, an early termination training function is arranged in the qualitative and quantitative model to monitor the verification loss, and the number of early termination training rounds is set to be 10.

(2) When a qualitative and quantitative model is trained, a training set is divided into 16 parts according to the batch processing size and input to the qualitative and quantitative model for training, an optimizer and a loss function are utilized to update the weight coefficient of the model for performance evaluation, and test set data is input to the qualitative and quantitative model during each training to obtain the loss value and accuracy of the model so as to guide model training to prevent over-fitting or under-fitting of the model. The qualitative and quantitative model learning curve of the highest verification accuracy value is shown in fig. 5, and the verification accuracy reaches 92.86% at this time. The right side of FIG. 5 is the relationship between the number of Training rounds and the loss value, in which Training loss is the Training loss and Validation loss is the Validation loss. Training was stopped when the model had been trained for a total of 41 rounds, as a result of setting the early termination training function to monitor for validation loss.

4. The deployment model prediction screening specifically comprises the following steps:

(1) and (4) based on the construction and training of the qualitative and quantitative model in the step (3), the qualitative and quantitative model with the highest verification precision value is reserved through repeated iterative training. Then randomly selecting 8 new molecules which do not appear in the training set, converting the new molecules into SMILES, and converting the SMILES into qualitative and quantitative descriptors, wherein the preprocessing and characteristic engineering steps are completely the same as those of the training model.

(2) The qualitative and quantitative model with the highest verification precision ECFP fingerprint and the molecular attribute combined input is used for predicting 8 new molecules, and all prediction results are correct. The prediction results are shown in fig. 6.

Example 2

When a qualitative and quantitative model is constructed, the FP2 fingerprint and 65 attributes are selected to be input jointly, five layers of networks are arranged in the model as shown in FIG. 4, and the specific construction and training of the qualitative and quantitative model are described with reference to example 1.

At the moment, the accuracy of the verification precision of the qualitative and quantitative model reaches 85.71%, and compared with the verification precision that the fingerprint input by FP2 in the DNN model can only reach 75.00%, the precision of the qualitative and quantitative model is improved by 10.71%. The comparison accuracy is listed in fig. 8.

Example 3

When a qualitative and quantitative model is constructed, the MACCS fingerprint and 65 attributes are selected to be input jointly, five layers of networks are set in the model as shown in fig. 4, and the specific construction and training of the qualitative and quantitative model are described with reference to example 1.

At the moment, the verification precision accuracy of the qualitative and quantitative model reaches 85.71%, and compared with the verification precision that MACCS fingerprints input in the DNN model can only reach 71.43%, the precision of the qualitative and quantitative model is improved by 14.28%. The comparison accuracy is shown in fig. 9.

Example 4

When a qualitative and quantitative model is constructed, the ECFP fingerprints and 65 attributes are selected to be input in a combined manner, five layers of networks are arranged in the model as shown in FIG. 4, and the specific construction and training of the qualitative and quantitative model are described with reference to example 1.

At the moment, the accuracy of the verification precision of the qualitative and quantitative model reaches 92.86%, and compared with the verification precision that 65 attributes input in the DNN model can only reach 78.57%, the precision of the qualitative and quantitative model is improved by 14.29%. The comparison accuracy is listed in fig. 10.

Example 5

When a qualitative and quantitative model is constructed, an FP2 fingerprint and 65 attributes are selected to be input in a combined manner, five layers of networks are arranged in the model as shown in FIG. 5, and the specific construction and training of the qualitative and quantitative model are described with reference to example 1.

At the moment, the accuracy of the verification precision of the qualitative and quantitative model reaches 85.71%, and compared with the verification precision that 65 attributes input in the DNN model can only reach 78.57%, the precision of the qualitative and quantitative model is improved by 7.14%. The comparison accuracy is listed in fig. 11.

Example 6

When a qualitative and quantitative model is constructed, the MACCS fingerprint and 65 attributes are selected to be input jointly, five layers of networks are set in the model as shown in fig. 4, and the specific construction and training of the qualitative and quantitative model are described with reference to embodiment 1.

At the moment, the accuracy of the verification precision of the qualitative and quantitative model reaches 85.71%, and compared with the verification precision that 65 attributes input in the DNN model can only reach 78.57%, the precision of the qualitative and quantitative model is improved by 7.14%. The comparison accuracy is listed in fig. 12.

It will be understood that modifications and variations can be made by persons skilled in the art in light of the above teachings and all such modifications and variations are intended to be included within the scope of the invention as defined in the appended claims.

Claims

1. A molecular screening method based on deep learning qualitative and quantitative model is characterized by comprising the following steps: and (3) inputting the qualitative and quantitative model to perform predictive screening after the sub-data set is constructed, preprocessed and subjected to characteristic engineering in sequence, and obtaining a screening result of the new molecules.

2. The method of claim 1, wherein the step of constructing and preprocessing the data set comprises converting the collected molecular structure into SMILES, and converting the SMILES into a qualitative descriptor and a quantitative descriptor, respectively.

3. The molecular screening method based on deep learning qualitative and quantitative model as claimed in claim 2, wherein the qualitative descriptor is molecular fingerprint and the quantitative descriptor is molecular attribute.

4. The molecular screening method based on deep learning qualitative and quantitative model according to claim 3, wherein the molecular fingerprint is MACCS fingerprint, FP2 fingerprint, ECFP fingerprint, etc.

5. The molecular screening method based on deep learning qualitative and quantitative model as claimed in claim 1, wherein the feature engineering is to use HashMap for qualitative descriptors, use variance threshold and Pearson correlation coefficient matrix algorithm for quantitative descriptors, and divide the qualitative and quantitative descriptor data sets in the ratio of training set to test set 9: 1.

6. The molecular screening method based on the deep learning qualitative and quantitative model according to claim 1, wherein the qualitative and quantitative model is constructed and trained by the following specific steps: according to input data processed by characteristic engineering, a qualitative and quantitative model is set, the first layer is an input layer, the input data is qualitative and quantitative descriptors, the middle layer is a dense layer, the last layer is an output layer, a Sigmoid function is used as final output, qualitative and quantitative model parameters are optimized and set through a back propagation optimization algorithm for performance evaluation, and iterative training is performed to obtain a model with high verification precision.

7. The molecular screening method based on deep learning qualitative and quantitative model according to claim 1, characterized in that the predictive screening specifically comprises:

8. The method of claim 1, wherein the data set comprises molecules selected from the group consisting of perovskite luminescent materials, additive molecules, and combinations thereof.

9. The molecular screening method based on deep learning qualitative and quantitative model according to claim 1, characterized in that the method specifically comprises the following steps:

step b: mapping each digital code to the position of the fingerprint by the converted qualitative descriptor through a Hash algorithm; using a variance threshold and a Pearson correlation coefficient matrix algorithm to the 208 converted attribute features, deleting 143 redundant attribute information, and reserving 65 attributes for training a model;