CN113674807A - Molecular screening method based on deep learning technology qualitative and quantitative model - Google Patents
Molecular screening method based on deep learning technology qualitative and quantitative model Download PDFInfo
- Publication number
- CN113674807A CN113674807A CN202110914312.3A CN202110914312A CN113674807A CN 113674807 A CN113674807 A CN 113674807A CN 202110914312 A CN202110914312 A CN 202110914312A CN 113674807 A CN113674807 A CN 113674807A
- Authority
- CN
- China
- Prior art keywords
- qualitative
- quantitative
- model
- molecular
- quantitative model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012216 screening Methods 0.000 title claims abstract description 45
- 238000000034 method Methods 0.000 title claims abstract description 30
- 238000013135 deep learning Methods 0.000 title claims description 17
- 238000005516 engineering process Methods 0.000 title abstract description 6
- 238000012549 training Methods 0.000 claims abstract description 60
- 238000012795 verification Methods 0.000 claims abstract description 43
- 230000006870 function Effects 0.000 claims abstract description 16
- 238000011156 evaluation Methods 0.000 claims abstract description 8
- 238000013507 mapping Methods 0.000 claims abstract description 8
- 238000007781 pre-processing Methods 0.000 claims abstract description 7
- 238000005457 optimization Methods 0.000 claims abstract description 6
- 108091005942 ECFP Proteins 0.000 claims description 12
- 239000011159 matrix material Substances 0.000 claims description 9
- 210000002569 neuron Anatomy 0.000 claims description 8
- 238000012360 testing method Methods 0.000 claims description 8
- 239000000463 material Substances 0.000 claims description 6
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 4
- 239000000654 additive Substances 0.000 claims description 4
- 230000000996 additive effect Effects 0.000 claims description 4
- 238000013528 artificial neural network Methods 0.000 claims description 4
- 238000006243 chemical reaction Methods 0.000 claims description 3
- 238000003062 neural network model Methods 0.000 claims description 2
- NJMWOUFKYKNWDW-UHFFFAOYSA-N 1-ethyl-3-methylimidazolium Chemical compound CCN1C=C[N+](C)=C1 NJMWOUFKYKNWDW-UHFFFAOYSA-N 0.000 abstract description 5
- 238000010276 construction Methods 0.000 description 7
- 239000000126 substance Substances 0.000 description 7
- 238000012545 processing Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 3
- 238000010200 validation analysis Methods 0.000 description 3
- 239000012634 fragment Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 101100256732 Caenorhabditis elegans set-9 gene Proteins 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 229910052739 hydrogen Inorganic materials 0.000 description 1
- 239000001257 hydrogen Substances 0.000 description 1
- HGASFNYMVGEKTF-UHFFFAOYSA-N octan-1-ol;hydrate Chemical compound O.CCCCCCCCO HGASFNYMVGEKTF-UHFFFAOYSA-N 0.000 description 1
- 125000001997 phenyl group Chemical group [H]C1=C([H])C([H])=C(*)C([H])=C1[H] 0.000 description 1
- 239000002904 solvent Substances 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000002211 ultraviolet spectrum Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C10/00—Computational theoretical chemistry, i.e. ICT specially adapted for theoretical aspects of quantum chemistry, molecular mechanics, molecular dynamics or the like
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/70—Machine learning, data mining or chemometrics
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/90—Programming languages; Computing architectures; Database systems; Data warehousing
Abstract
The invention discloses a molecular screening method based on an advanced learning technology qualitative and quantitative model (EMIM). collected molecules are converted into SMILES, and then the SMILES is converted into qualitative and quantitative descriptors for pretreatment; hash mapping, variance threshold and Pearson correlation coefficient characteristic engineering are used for qualitative descriptors and quantitative descriptors; setting a qualitative and quantitative model according to input data processed by characteristic engineering, using a Sigmoid function as final output, optimally setting parameters of the qualitative and quantitative model through a back propagation optimization algorithm to perform performance evaluation, and performing iterative training to obtain a model with high verification precision; and (3) preprocessing the molecules to be predicted, performing characteristic engineering, inputting the preprocessed molecules to a highest verification precision qualitative and quantitative model (EMIM) for prediction screening to obtain the prediction result of the molecules to be predicted. The invention realizes more efficient and accurate screening of new molecules and solves the limitation of the traditional screening.
Description
Technical Field
The invention relates to the field of deep learning and crossing of physics, chemistry and materials, in particular to a qualitative and quantitative model built by utilizing a deep learning neural network framework, and the model can be used for screening molecules.
Background
Molecular descriptors, which are the structural sub-fragments of a molecule or the characterization of physicochemical properties, can be divided into qualitative descriptors and quantitative descriptors. Qualitative descriptors generally refer to molecular fingerprints, which can convert chemical molecules into bit strings consisting of only 0 and 1. The quantitative descriptor is a descriptor for describing molecular attributes based on molecular composition (hydrogen bond donor number and benzene ring number), physicochemical properties (topological polar surface area and octanol water distribution coefficient) and experimental data information (ultraviolet spectrum and solvent ratio), and is a numerical index characteristic of chemical information of molecules.
The technologies such as machine learning and deep learning can carry out data mining from a molecular data set, establish a model for linear and nonlinear relations between molecular chemical information characteristics and target characteristics to realize accuracy, efficiently predict and screen new molecules, and further guide design experiments. At present, the data set is established through a single quantitative descriptor or a single qualitative descriptor, and the machine learning model and the deep learning model are trained, which have limitations. For example, single molecule fingerprints, whose physicochemical properties are missing; the single molecule attribute is used for characterizing molecules, and information such as molecular structures and the like is vacant. Therefore, the application of machine learning and deep learning in screening new molecules is limited to a certain extent.
Disclosure of Invention
The invention aims to solve the technical problem that a molecular screening method based on an Enhanced Molecular Information Model (EMIM) is established, and the new molecules are input into the EMIM for prediction screening after data preprocessing and characteristic engineering, so that the high-efficiency and accurate screening of the new molecules is realized.
The technical scheme of the invention for realizing the aim is as follows:
a molecular screening method based on deep learning qualitative and quantitative model (EMIM), comprising the following steps: and (3) inputting the qualitative and quantitative model to perform predictive screening after the sub-data set is constructed, preprocessed and subjected to characteristic engineering in sequence, and obtaining a screening result of the new molecules.
Preferably, the construction and preprocessing of the molecular data set are to convert the collected molecular structural formula into SMILES, and convert the SMILES into a qualitative descriptor and a quantitative descriptor, respectively.
The qualitative descriptor is a molecular fingerprint, and the quantitative descriptor is a molecular attribute. Preferably the molecular fingerprint is a MACCS fingerprint, FP2 fingerprint, ECFP fingerprint.
Preferably, the characteristic engineering is to use Hash mapping for qualitative descriptors, use variance threshold and Pearson correlation coefficient matrix algorithm for quantitative descriptors, and divide data sets of the qualitative descriptors and the quantitative descriptors in a training set-test set-9: 1 ratio.
The method for constructing and training the qualitative and quantitative model specifically comprises the following steps: according to input data processed by characteristic engineering, a qualitative and quantitative model is set, the first layer is an input layer, the input data is qualitative and quantitative descriptors, the middle layer is a dense layer, the last layer is an output layer, a Sigmoid function is used as final output, qualitative and quantitative model parameters are optimized and set through a back propagation optimization algorithm for performance evaluation, and iterative training is performed to obtain a model with high verification precision.
The prediction screening specifically comprises the following steps:
(1) based on a qualitative and quantitative model constructed and trained, the qualitative and quantitative model with the highest verification precision value is reserved through multiple iterative training, then N (N is usually 1-20) new molecules which do not appear in a training set are randomly selected, the new molecules are converted into SMILES, and the pretreatment and characteristic engineering steps of the SMILES conversion into qualitative and quantitative descriptors are completely the same as those of the training model;
(2) and predicting the N new molecules by using the qualitative and quantitative model with the highest verification precision ECFP fingerprint and the molecule attribute combined input to obtain the prediction result of the molecules to be predicted.
The molecules used in the data set are preferably perovskite luminescent and photovoltaic device luminescent layer materials, additive molecules.
The molecular screening method based on the deep learning technology qualitative and quantitative model specifically comprises the following steps:
step a: firstly, converting a collected molecular structural formula into SMILES, and then respectively converting the SMILES into qualitative descriptors and quantitative descriptors;
step b: mapping each digital code to the position of the fingerprint by the converted qualitative descriptor through a Hash algorithm; the transformed 208 attribute features of the quantitative descriptor are subjected to variance threshold and Pearson correlation coefficient matrix algorithm, 143 pieces of redundant attribute information are deleted, and 65 attributes are reserved for training the model.
Step c: setting a qualitative and quantitative model based on a neural network according to input data of qualitative and quantitative descriptors, wherein the qualitative and quantitative model has a five-layer network architecture, the first layer is an input layer, the input data is molecular fingerprints and molecular attributes, the middle layer is a dense layer, the last layer is an output layer, and a Sigmoid function is used as final output; inputting the molecular fingerprints and the molecular attribute data into a qualitative and quantitative model for iterative training, setting the model learning rate through back propagation optimization of an Adam algorithm optimizer, performing performance evaluation on parameters such as the number of neurons in each layer, and the like, and finally obtaining the qualitative and quantitative model with the highest verification precision;
step d: and c, selecting new molecules of the material, inputting the new molecules into the neural network model obtained in the step c for prediction screening after the new molecules are subjected to the characteristic engineering treatment in the step b, and obtaining the screening result of the new molecules.
The above molecular screening method is preferably the method comprising the steps of:
(1) respectively converting the collected molecules by SMILES to generate qualitative descriptors of MACCS fingerprints, FP2 fingerprints and ECFP fingerprints;
(2) the quantitative descriptors generated by the conversion of the collected molecules by SMILES are 208 attribute characteristics;
the above molecular screening method, preferably in step b:
(1) mapping each digital code to the position of the molecular fingerprint of the code by utilizing a Hash algorithm to the converted qualitative descriptor molecular fingerprint index data;
(2) selecting variance threshold attributes of 208 molecular attributes of the converted quantitative descriptors, deleting 107 attribute features, and leaving 101 attributes; deleting 36 attributes again by using a Pearson correlation coefficient matrix algorithm, and finally leaving 65 attributes for training the model;
(3) and (3) setting the qualitative and quantitative descriptor data sets of the labels to be training sets: the test set is a 9:1 scale division.
The above molecular screening method, preferably in step c:
(1) the number of neurons of the input layer is determined by the input molecular fingerprint and the molecular attributes, and is generally L +65, wherein L is the length of a molecular fingerprint bit, and 65 is the number of the input attributes; the number of neurons in the middle dense layer is set to be 4-32, the number of neurons in the last output layer is set to be 2, and a Sigmoid function is adopted for output;
(2) selecting an Adam algorithm as an optimizer, and setting the learning rate of the optimizer to be 0.001-0.1;
(3) selecting the verification accuracy as a performance evaluation index for evaluating the quality of the network model in the training process;
(4) the number of training batches is set to be 5-15, the number of training rounds is set to be 100-200, and the number of early termination training rounds is set to be 5-10;
(5) during each round of training, the training set is divided into a plurality of parts according to the batch processing size and input into the network model for training, the model weight coefficient is updated by using the optimizer and the loss function, and the data of the test set is input into the qualitative and quantitative model during each round of training to obtain the accuracy and the loss value of the model verification precision so as to guide the model to prevent over-fitting or under-fitting.
In the above molecular screening method, preferably, in step d:
(1) selecting new molecules which do not appear in a training set, processing data according to the data processing mode of the training set, and inputting the data into a model;
(2) and (4) carrying out prediction screening on the new molecules by using the qualitative and quantitative model with the highest verification precision to obtain a screening result.
Therefore, the invention is a pretreatment for converting collected molecules into SMILES and then converting the SMILES into qualitative and quantitative descriptors; hash mapping, variance threshold and Pearson correlation coefficient characteristic engineering are used for qualitative descriptors and quantitative descriptors; setting a qualitative and quantitative model according to input data processed by characteristic engineering, wherein the first layer is an input layer, the middle layer is a dense layer, the last layer is an output layer, a Sigmoid function is used as final output, qualitative and quantitative model parameters are optimally set through a back propagation optimization algorithm for performance evaluation, and iterative training is performed to obtain a model with high verification precision; and (3) preprocessing the molecules to be predicted, performing characteristic engineering, and inputting the molecules to be predicted into a highest verification precision qualitative and quantitative model (EMIM) for prediction screening to obtain the prediction result of the molecules to be predicted. The invention overcomes the limitations of insufficient chemical information and the like when inputting a single qualitative descriptor or a single quantitative descriptor by simultaneously inputting the qualitative descriptor and the quantitative descriptor for combined learning, and realizes the high-efficiency screening of molecules.
Compared with the prior art, the invention has the advantages of
(1) The invention effectively solves the problems that the physical and chemical properties of the input of the single qualitative descriptor molecule fingerprint are lacked, or the molecule structure segment of the single quantitative descriptor molecule attribute is lacked, and the like by simultaneously inputting the qualitative descriptor and the quantitative descriptor. Meanwhile, the strong information extraction capability of the deep learning neural network framework model can comprehensively mine structural fragment information carried by molecular fingerprints and physicochemical property characteristics provided by molecular attributes, and the highest verification precision accuracy rate of 92.86% is realized. In the deployment model screening, 8 new molecules of the material can be screened correctly. Compared with the highest verification accuracy of 82.14 percent of ECFP fingerprints in a DNN model, the qualitative and quantitative model realizes more efficient and accurate screening of new molecules and solves the limitation of the traditional screening.
(2) The qualitative and quantitative model established by the invention is a model simultaneously input by a qualitative descriptor and a quantitative descriptor, and can realize high-precision molecular prediction by using a small sample volume data set.
Drawings
FIG. 1 is an algorithmic flow chart of a molecular screening method;
FIG. 2 is a qualitative descriptor molecular fingerprint index data hash map;
FIG. 3 is a thermodynamic diagram of a Pearson correlation coefficient matrix for quantitative descriptor molecular properties;
FIG. 4 is a schematic diagram of a qualitative and quantitative model;
FIG. 5 is a learning curve of accuracy and loss value for the qualitative and quantitative model with the highest verification accuracy value; a is a curve of training precision accuracy and verification precision accuracy, and b is a curve of training loss value and verification loss value;
FIG. 6 is a diagram of the prediction of new molecules by the highest verification accuracy qualitative and quantitative model;
FIG. 7 is a graph comparing the verification accuracy of the input ECFP fingerprint, 65 attributes in the qualitative and quantitative model and the input ECFP fingerprint in the DNN model;
FIG. 8 is a comparison graph of verification accuracy of an input FP2 fingerprint, 65 attributes in a qualitative and quantitative model and an input FP2 fingerprint in a DNN model;
FIG. 9 is a graph of the comparison of the verification accuracy of an input MACCS fingerprint, 65 attributes in a qualitative and quantitative model, and an input MACCS fingerprint in a DNN model;
FIG. 10 is a comparison graph of verification accuracy of the input ECFP fingerprint, 65 attributes in the qualitative and quantitative model and the input 65 attributes in the DNN model;
FIG. 11 is a graph of the comparison of the verification accuracy of an input MACCS fingerprint, 65 attributes in a qualitative-quantitative model and 65 attributes in a DNN model;
FIG. 12 is a graph of the comparison of the verification accuracy of an input FP2 fingerprint, 65 attributes in a qualitative and quantitative model and 65 attributes in a DNN model;
Detailed Description
The present invention will be described in detail with reference to specific examples.
Example 1
A molecular screening method based on a deep learning technology qualitative and quantitative model is realized by the flow shown in figure 1. The method comprises the following steps: the method comprises the steps of constructing and preprocessing a data set, performing characteristic engineering on the data set, constructing and training a qualitative and quantitative model, and deploying model prediction screening.
1. The method specifically comprises the following steps of:
(1) the molecular structure is converted to SMILES. First obtaining molecules for training the modelWith its molecular structural formula, according to DAVID WEININGER SMILES transformation rule developed in 1988, with perovskite FAPBI3The additive in the light emitting diode is taken as an example, and the structural formula of each additive molecule is converted into SMILES.
(2) The SMILES is converted into qualitative and quantitative descriptors, respectively. Calling a molecular fingerprint algorithm packaged in two chemical informatics toolkits including RDkit and Open Babel in Python according to different algorithms of different molecular fingerprints, and inputting SMILES to generate MACCS, FP2 and ECFP molecular fingerprints; then SMILES are input, 208 attributes of each molecule are generated in RDKIt according to the SMILES, and further a qualitative and quantitative descriptor data set is constructed.
2. The characteristic engineering of the data set specifically comprises the following steps:
(1) the qualitative descriptor molecule fingerprint index data is hashed. The obtained molecular fingerprint index data is a series of digital codes, each digital code is mapped to the position of the molecular fingerprint where the code is located by utilizing a Hash algorithm, and the mapping process is shown in FIG. 2;
(2) a variance threshold and pearson correlation coefficient matrix algorithm is used for the quantitative descriptor molecular attributes. For 208 attributes of the quantitative descriptors, if the attributes have lower variance values and contain less chemical information, selecting variance threshold attributes of the 208 attributes, deleting 107 attributes and reserving 101 attributes; if the attributes have strong correlation, deleting the redundant attributes can effectively reduce information redundancy, deleting 36 attributes again by using a pearson correlation coefficient matrix algorithm, and finally leaving 65 attributes for training the model, wherein the pearson correlation coefficient matrix is shown in fig. 3;
(3) the qualitative and quantitative descriptor data set is divided by a training set to test set ratio of 9:1, namely when the model is trained, 90% of data is used for training, and 10% of data is used for testing. The number of training samples in each data set was 121 and the number of test samples was 14.
3. The method comprises the following steps of constructing and training a qualitative and quantitative model:
(1) when a qualitative and quantitative model is constructed, the qualitative descriptor molecular fingerprint and the quantitative descriptor molecular attribute can be selected to be input in a combined mode. In the model, a five-layer network structure is set as shown in fig. 3, the number of neurons set in the input layer is L +65(L is the length of the molecular fingerprint bits, and 65 is the number of input attributes), the number of neurons set in the second layer and the third layer are set in the dense layer, the number of the neurons set in the input layer is 12 and 4, the activation function is set as a Rectified Linear Unit (RELU), and the dense layer is used for mapping the features extracted from the molecular fingerprint and the molecular attributes to the output space through nonlinear change. And in the fourth layer, combining the third layer by using a Concatenate function, and outputting the final layer by using a Sigmoid function. The mean square error function is selected as a loss function, the Adam algorithm is selected as an optimizer, the learning rate of the optimizer is set to be 0.001, and the training batch number and the training round number are respectively set to be 8 and 200. In order to obtain a model with high verification precision and best generalization capability, an early termination training function is arranged in the qualitative and quantitative model to monitor the verification loss, and the number of early termination training rounds is set to be 10.
(2) When a qualitative and quantitative model is trained, a training set is divided into 16 parts according to the batch processing size and input to the qualitative and quantitative model for training, an optimizer and a loss function are utilized to update the weight coefficient of the model for performance evaluation, and test set data is input to the qualitative and quantitative model during each training to obtain the loss value and accuracy of the model so as to guide model training to prevent over-fitting or under-fitting of the model. The qualitative and quantitative model learning curve of the highest verification accuracy value is shown in fig. 5, and the verification accuracy reaches 92.86% at this time. The right side of FIG. 5 is the relationship between the number of Training rounds and the loss value, in which Training loss is the Training loss and Validation loss is the Validation loss. Training was stopped when the model had been trained for a total of 41 rounds, as a result of setting the early termination training function to monitor for validation loss.
4. The deployment model prediction screening specifically comprises the following steps:
(1) and (4) based on the construction and training of the qualitative and quantitative model in the step (3), the qualitative and quantitative model with the highest verification precision value is reserved through repeated iterative training. Then randomly selecting 8 new molecules which do not appear in the training set, converting the new molecules into SMILES, and converting the SMILES into qualitative and quantitative descriptors, wherein the preprocessing and characteristic engineering steps are completely the same as those of the training model.
(2) The qualitative and quantitative model with the highest verification precision ECFP fingerprint and the molecular attribute combined input is used for predicting 8 new molecules, and all prediction results are correct. The prediction results are shown in fig. 6.
Example 2
When a qualitative and quantitative model is constructed, the FP2 fingerprint and 65 attributes are selected to be input jointly, five layers of networks are arranged in the model as shown in FIG. 4, and the specific construction and training of the qualitative and quantitative model are described with reference to example 1.
At the moment, the accuracy of the verification precision of the qualitative and quantitative model reaches 85.71%, and compared with the verification precision that the fingerprint input by FP2 in the DNN model can only reach 75.00%, the precision of the qualitative and quantitative model is improved by 10.71%. The comparison accuracy is listed in fig. 8.
Example 3
When a qualitative and quantitative model is constructed, the MACCS fingerprint and 65 attributes are selected to be input jointly, five layers of networks are set in the model as shown in fig. 4, and the specific construction and training of the qualitative and quantitative model are described with reference to example 1.
At the moment, the verification precision accuracy of the qualitative and quantitative model reaches 85.71%, and compared with the verification precision that MACCS fingerprints input in the DNN model can only reach 71.43%, the precision of the qualitative and quantitative model is improved by 14.28%. The comparison accuracy is shown in fig. 9.
Example 4
When a qualitative and quantitative model is constructed, the ECFP fingerprints and 65 attributes are selected to be input in a combined manner, five layers of networks are arranged in the model as shown in FIG. 4, and the specific construction and training of the qualitative and quantitative model are described with reference to example 1.
At the moment, the accuracy of the verification precision of the qualitative and quantitative model reaches 92.86%, and compared with the verification precision that 65 attributes input in the DNN model can only reach 78.57%, the precision of the qualitative and quantitative model is improved by 14.29%. The comparison accuracy is listed in fig. 10.
Example 5
When a qualitative and quantitative model is constructed, an FP2 fingerprint and 65 attributes are selected to be input in a combined manner, five layers of networks are arranged in the model as shown in FIG. 5, and the specific construction and training of the qualitative and quantitative model are described with reference to example 1.
At the moment, the accuracy of the verification precision of the qualitative and quantitative model reaches 85.71%, and compared with the verification precision that 65 attributes input in the DNN model can only reach 78.57%, the precision of the qualitative and quantitative model is improved by 7.14%. The comparison accuracy is listed in fig. 11.
Example 6
When a qualitative and quantitative model is constructed, the MACCS fingerprint and 65 attributes are selected to be input jointly, five layers of networks are set in the model as shown in fig. 4, and the specific construction and training of the qualitative and quantitative model are described with reference to embodiment 1.
At the moment, the accuracy of the verification precision of the qualitative and quantitative model reaches 85.71%, and compared with the verification precision that 65 attributes input in the DNN model can only reach 78.57%, the precision of the qualitative and quantitative model is improved by 7.14%. The comparison accuracy is listed in fig. 12.
It will be understood that modifications and variations can be made by persons skilled in the art in light of the above teachings and all such modifications and variations are intended to be included within the scope of the invention as defined in the appended claims.
Claims (9)
1. A molecular screening method based on deep learning qualitative and quantitative model is characterized by comprising the following steps: and (3) inputting the qualitative and quantitative model to perform predictive screening after the sub-data set is constructed, preprocessed and subjected to characteristic engineering in sequence, and obtaining a screening result of the new molecules.
2. The method of claim 1, wherein the step of constructing and preprocessing the data set comprises converting the collected molecular structure into SMILES, and converting the SMILES into a qualitative descriptor and a quantitative descriptor, respectively.
3. The molecular screening method based on deep learning qualitative and quantitative model as claimed in claim 2, wherein the qualitative descriptor is molecular fingerprint and the quantitative descriptor is molecular attribute.
4. The molecular screening method based on deep learning qualitative and quantitative model according to claim 3, wherein the molecular fingerprint is MACCS fingerprint, FP2 fingerprint, ECFP fingerprint, etc.
5. The molecular screening method based on deep learning qualitative and quantitative model as claimed in claim 1, wherein the feature engineering is to use HashMap for qualitative descriptors, use variance threshold and Pearson correlation coefficient matrix algorithm for quantitative descriptors, and divide the qualitative and quantitative descriptor data sets in the ratio of training set to test set 9: 1.
6. The molecular screening method based on the deep learning qualitative and quantitative model according to claim 1, wherein the qualitative and quantitative model is constructed and trained by the following specific steps: according to input data processed by characteristic engineering, a qualitative and quantitative model is set, the first layer is an input layer, the input data is qualitative and quantitative descriptors, the middle layer is a dense layer, the last layer is an output layer, a Sigmoid function is used as final output, qualitative and quantitative model parameters are optimized and set through a back propagation optimization algorithm for performance evaluation, and iterative training is performed to obtain a model with high verification precision.
7. The molecular screening method based on deep learning qualitative and quantitative model according to claim 1, characterized in that the predictive screening specifically comprises:
(1) based on a qualitative and quantitative model constructed and trained, the qualitative and quantitative model with the highest verification precision value is reserved through multiple iterative training, then N (N is usually 1-20) new molecules which do not appear in a training set are randomly selected, the new molecules are converted into SMILES, and the pretreatment and characteristic engineering steps of the SMILES conversion into qualitative and quantitative descriptors are completely the same as those of the training model;
(2) and predicting the N new molecules by using the qualitative and quantitative model with the highest verification precision ECFP fingerprint and the molecule attribute combined input to obtain the prediction result of the molecules to be predicted.
8. The method of claim 1, wherein the data set comprises molecules selected from the group consisting of perovskite luminescent materials, additive molecules, and combinations thereof.
9. The molecular screening method based on deep learning qualitative and quantitative model according to claim 1, characterized in that the method specifically comprises the following steps:
step a: firstly, converting a collected molecular structural formula into SMILES, and then respectively converting the SMILES into qualitative descriptors and quantitative descriptors;
step b: mapping each digital code to the position of the fingerprint by the converted qualitative descriptor through a Hash algorithm; using a variance threshold and a Pearson correlation coefficient matrix algorithm to the 208 converted attribute features, deleting 143 redundant attribute information, and reserving 65 attributes for training a model;
step c: setting a qualitative and quantitative model based on a neural network according to input data of qualitative and quantitative descriptors, wherein the qualitative and quantitative model has a five-layer network architecture, the first layer is an input layer, the input data is molecular fingerprints and molecular attributes, the middle layer is a dense layer, the last layer is an output layer, and a Sigmoid function is used as final output; inputting the molecular fingerprints and the molecular attribute data into a qualitative and quantitative model for iterative training, setting the model learning rate through back propagation optimization of an Adam algorithm optimizer, performing performance evaluation on parameters such as the number of neurons in each layer, and the like, and finally obtaining the qualitative and quantitative model with the highest verification precision;
step d: and c, selecting new molecules of the material, inputting the new molecules into the neural network model obtained in the step c for prediction screening after the new molecules are subjected to the characteristic engineering treatment in the step b, and obtaining the screening result of the new molecules.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110914312.3A CN113674807A (en) | 2021-08-10 | 2021-08-10 | Molecular screening method based on deep learning technology qualitative and quantitative model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110914312.3A CN113674807A (en) | 2021-08-10 | 2021-08-10 | Molecular screening method based on deep learning technology qualitative and quantitative model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113674807A true CN113674807A (en) | 2021-11-19 |
Family
ID=78542141
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110914312.3A Pending CN113674807A (en) | 2021-08-10 | 2021-08-10 | Molecular screening method based on deep learning technology qualitative and quantitative model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113674807A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114818948A (en) * | 2022-05-05 | 2022-07-29 | 北京科技大学 | Data-mechanism driven material attribute prediction method of graph neural network |
CN116646024A (en) * | 2023-07-26 | 2023-08-25 | 苏州创腾软件有限公司 | Open loop polymerization enthalpy prediction method and device based on machine learning |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106777986A (en) * | 2016-12-19 | 2017-05-31 | 南京邮电大学 | Ligand molecular fingerprint generation method based on depth Hash in drug screening |
CN110444250A (en) * | 2019-03-26 | 2019-11-12 | 广东省微生物研究所(广东省微生物分析检测中心) | High-throughput drug virtual screening system based on molecular fingerprint and deep learning |
US20200176087A1 (en) * | 2018-12-03 | 2020-06-04 | Battelle Memorial Institute | Method for simultaneous characterization and expansion of reference libraries for small molecule identification |
CN111798935A (en) * | 2019-04-09 | 2020-10-20 | 南京药石科技股份有限公司 | Universal compound structure-property correlation prediction method based on neural network |
-
2021
- 2021-08-10 CN CN202110914312.3A patent/CN113674807A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106777986A (en) * | 2016-12-19 | 2017-05-31 | 南京邮电大学 | Ligand molecular fingerprint generation method based on depth Hash in drug screening |
US20200176087A1 (en) * | 2018-12-03 | 2020-06-04 | Battelle Memorial Institute | Method for simultaneous characterization and expansion of reference libraries for small molecule identification |
CN110444250A (en) * | 2019-03-26 | 2019-11-12 | 广东省微生物研究所(广东省微生物分析检测中心) | High-throughput drug virtual screening system based on molecular fingerprint and deep learning |
CN111798935A (en) * | 2019-04-09 | 2020-10-20 | 南京药石科技股份有限公司 | Universal compound structure-property correlation prediction method based on neural network |
Non-Patent Citations (1)
Title |
---|
周世英 等: "基于深度学习的药物分子分类与虚拟筛选研究", 《中国优秀硕士学位论文全文数据库》, pages 11 - 51 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114818948A (en) * | 2022-05-05 | 2022-07-29 | 北京科技大学 | Data-mechanism driven material attribute prediction method of graph neural network |
CN114818948B (en) * | 2022-05-05 | 2023-02-03 | 北京科技大学 | Data-mechanism driven material attribute prediction method of graph neural network |
CN116646024A (en) * | 2023-07-26 | 2023-08-25 | 苏州创腾软件有限公司 | Open loop polymerization enthalpy prediction method and device based on machine learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109754113B (en) | Load prediction method based on dynamic time warping and long-and-short time memory | |
CN113674807A (en) | Molecular screening method based on deep learning technology qualitative and quantitative model | |
CN111783442A (en) | Intrusion detection method, device, server and storage medium | |
CN109598334B (en) | Sample generation method and device | |
CN112761628B (en) | Shale gas yield determination method and device based on long-term and short-term memory neural network | |
CN108153982B (en) | Aero-engine after-repair performance prediction method based on stacked self-coding deep learning network | |
CN111078847A (en) | Power consumer intention identification method and device, computer equipment and storage medium | |
CN110600085B (en) | Tree-LSTM-based organic matter physicochemical property prediction method | |
WO2019151503A1 (en) | Determination device, determination method, and determination program | |
CN115547424A (en) | Molecular screening method based on deep learning technology multi-molecular fingerprint model | |
CN115409292A (en) | Short-term load prediction method for power system and related device | |
CN115394383A (en) | Method and system for predicting luminescence wavelength of phosphorescent material | |
CN115222065A (en) | Wellhead pressure online multi-step prediction method based on Stacking ensemble learning | |
CN110019796A (en) | A kind of user version information analysis method and device | |
CN113705661A (en) | Industrial process performance evaluation method of hybrid depth residual shrinkage network and XGboost algorithm | |
CN106782577A (en) | A kind of voice signal coding and decoding methods based on Chaotic time series forecasting model | |
CN112817954A (en) | Missing value interpolation method based on multi-method ensemble learning | |
Absardi et al. | A fast reference-free genome compression using deep neural networks | |
CN116756508A (en) | Fault diagnosis method and device for transformer, computer equipment and storage medium | |
CN116291336A (en) | Automatic segmentation clustering system based on deep self-attention neural network | |
CN116595363A (en) | Prediction method, apparatus, device, storage medium, and computer program product | |
Mete et al. | Predicting semantic building information (BIM) with Recurrent Neural Networks | |
CN111061626B (en) | Test case priority ordering method based on neuron activation frequency analysis | |
CN113657441A (en) | Classification algorithm based on weighted Pearson correlation coefficient and combined with feature screening | |
CN113658109A (en) | Glass defect detection method based on field loss prediction active learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |