CN113674807A - Molecular screening method based on deep learning technology qualitative and quantitative model - Google Patents

Molecular screening method based on deep learning technology qualitative and quantitative model Download PDF

Info

Publication number
CN113674807A
CN113674807A CN202110914312.3A CN202110914312A CN113674807A CN 113674807 A CN113674807 A CN 113674807A CN 202110914312 A CN202110914312 A CN 202110914312A CN 113674807 A CN113674807 A CN 113674807A
Authority
CN
China
Prior art keywords
qualitative
quantitative
model
molecular
quantitative model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110914312.3A
Other languages
Chinese (zh)
Inventor
王建浦
朱琳
章亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Tech University
Original Assignee
Nanjing Tech University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Tech University filed Critical Nanjing Tech University
Priority to CN202110914312.3A priority Critical patent/CN113674807A/en
Publication of CN113674807A publication Critical patent/CN113674807A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C10/00Computational theoretical chemistry, i.e. ICT specially adapted for theoretical aspects of quantum chemistry, molecular mechanics, molecular dynamics or the like
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/90Programming languages; Computing architectures; Database systems; Data warehousing

Abstract

The invention discloses a molecular screening method based on an advanced learning technology qualitative and quantitative model (EMIM). collected molecules are converted into SMILES, and then the SMILES is converted into qualitative and quantitative descriptors for pretreatment; hash mapping, variance threshold and Pearson correlation coefficient characteristic engineering are used for qualitative descriptors and quantitative descriptors; setting a qualitative and quantitative model according to input data processed by characteristic engineering, using a Sigmoid function as final output, optimally setting parameters of the qualitative and quantitative model through a back propagation optimization algorithm to perform performance evaluation, and performing iterative training to obtain a model with high verification precision; and (3) preprocessing the molecules to be predicted, performing characteristic engineering, inputting the preprocessed molecules to a highest verification precision qualitative and quantitative model (EMIM) for prediction screening to obtain the prediction result of the molecules to be predicted. The invention realizes more efficient and accurate screening of new molecules and solves the limitation of the traditional screening.

Description

Molecular screening method based on deep learning technology qualitative and quantitative model
Technical Field
The invention relates to the field of deep learning and crossing of physics, chemistry and materials, in particular to a qualitative and quantitative model built by utilizing a deep learning neural network framework, and the model can be used for screening molecules.
Background
Molecular descriptors, which are the structural sub-fragments of a molecule or the characterization of physicochemical properties, can be divided into qualitative descriptors and quantitative descriptors. Qualitative descriptors generally refer to molecular fingerprints, which can convert chemical molecules into bit strings consisting of only 0 and 1. The quantitative descriptor is a descriptor for describing molecular attributes based on molecular composition (hydrogen bond donor number and benzene ring number), physicochemical properties (topological polar surface area and octanol water distribution coefficient) and experimental data information (ultraviolet spectrum and solvent ratio), and is a numerical index characteristic of chemical information of molecules.
The technologies such as machine learning and deep learning can carry out data mining from a molecular data set, establish a model for linear and nonlinear relations between molecular chemical information characteristics and target characteristics to realize accuracy, efficiently predict and screen new molecules, and further guide design experiments. At present, the data set is established through a single quantitative descriptor or a single qualitative descriptor, and the machine learning model and the deep learning model are trained, which have limitations. For example, single molecule fingerprints, whose physicochemical properties are missing; the single molecule attribute is used for characterizing molecules, and information such as molecular structures and the like is vacant. Therefore, the application of machine learning and deep learning in screening new molecules is limited to a certain extent.
Disclosure of Invention
The invention aims to solve the technical problem that a molecular screening method based on an Enhanced Molecular Information Model (EMIM) is established, and the new molecules are input into the EMIM for prediction screening after data preprocessing and characteristic engineering, so that the high-efficiency and accurate screening of the new molecules is realized.
The technical scheme of the invention for realizing the aim is as follows:
a molecular screening method based on deep learning qualitative and quantitative model (EMIM), comprising the following steps: and (3) inputting the qualitative and quantitative model to perform predictive screening after the sub-data set is constructed, preprocessed and subjected to characteristic engineering in sequence, and obtaining a screening result of the new molecules.
Preferably, the construction and preprocessing of the molecular data set are to convert the collected molecular structural formula into SMILES, and convert the SMILES into a qualitative descriptor and a quantitative descriptor, respectively.
The qualitative descriptor is a molecular fingerprint, and the quantitative descriptor is a molecular attribute. Preferably the molecular fingerprint is a MACCS fingerprint, FP2 fingerprint, ECFP fingerprint.
Preferably, the characteristic engineering is to use Hash mapping for qualitative descriptors, use variance threshold and Pearson correlation coefficient matrix algorithm for quantitative descriptors, and divide data sets of the qualitative descriptors and the quantitative descriptors in a training set-test set-9: 1 ratio.
The method for constructing and training the qualitative and quantitative model specifically comprises the following steps: according to input data processed by characteristic engineering, a qualitative and quantitative model is set, the first layer is an input layer, the input data is qualitative and quantitative descriptors, the middle layer is a dense layer, the last layer is an output layer, a Sigmoid function is used as final output, qualitative and quantitative model parameters are optimized and set through a back propagation optimization algorithm for performance evaluation, and iterative training is performed to obtain a model with high verification precision.
The prediction screening specifically comprises the following steps:
(1) based on a qualitative and quantitative model constructed and trained, the qualitative and quantitative model with the highest verification precision value is reserved through multiple iterative training, then N (N is usually 1-20) new molecules which do not appear in a training set are randomly selected, the new molecules are converted into SMILES, and the pretreatment and characteristic engineering steps of the SMILES conversion into qualitative and quantitative descriptors are completely the same as those of the training model;
(2) and predicting the N new molecules by using the qualitative and quantitative model with the highest verification precision ECFP fingerprint and the molecule attribute combined input to obtain the prediction result of the molecules to be predicted.
The molecules used in the data set are preferably perovskite luminescent and photovoltaic device luminescent layer materials, additive molecules.
The molecular screening method based on the deep learning technology qualitative and quantitative model specifically comprises the following steps:
step a: firstly, converting a collected molecular structural formula into SMILES, and then respectively converting the SMILES into qualitative descriptors and quantitative descriptors;
step b: mapping each digital code to the position of the fingerprint by the converted qualitative descriptor through a Hash algorithm; the transformed 208 attribute features of the quantitative descriptor are subjected to variance threshold and Pearson correlation coefficient matrix algorithm, 143 pieces of redundant attribute information are deleted, and 65 attributes are reserved for training the model.
Step c: setting a qualitative and quantitative model based on a neural network according to input data of qualitative and quantitative descriptors, wherein the qualitative and quantitative model has a five-layer network architecture, the first layer is an input layer, the input data is molecular fingerprints and molecular attributes, the middle layer is a dense layer, the last layer is an output layer, and a Sigmoid function is used as final output; inputting the molecular fingerprints and the molecular attribute data into a qualitative and quantitative model for iterative training, setting the model learning rate through back propagation optimization of an Adam algorithm optimizer, performing performance evaluation on parameters such as the number of neurons in each layer, and the like, and finally obtaining the qualitative and quantitative model with the highest verification precision;
step d: and c, selecting new molecules of the material, inputting the new molecules into the neural network model obtained in the step c for prediction screening after the new molecules are subjected to the characteristic engineering treatment in the step b, and obtaining the screening result of the new molecules.
The above molecular screening method is preferably the method comprising the steps of:
(1) respectively converting the collected molecules by SMILES to generate qualitative descriptors of MACCS fingerprints, FP2 fingerprints and ECFP fingerprints;
(2) the quantitative descriptors generated by the conversion of the collected molecules by SMILES are 208 attribute characteristics;
the above molecular screening method, preferably in step b:
(1) mapping each digital code to the position of the molecular fingerprint of the code by utilizing a Hash algorithm to the converted qualitative descriptor molecular fingerprint index data;
(2) selecting variance threshold attributes of 208 molecular attributes of the converted quantitative descriptors, deleting 107 attribute features, and leaving 101 attributes; deleting 36 attributes again by using a Pearson correlation coefficient matrix algorithm, and finally leaving 65 attributes for training the model;
(3) and (3) setting the qualitative and quantitative descriptor data sets of the labels to be training sets: the test set is a 9:1 scale division.
The above molecular screening method, preferably in step c:
(1) the number of neurons of the input layer is determined by the input molecular fingerprint and the molecular attributes, and is generally L +65, wherein L is the length of a molecular fingerprint bit, and 65 is the number of the input attributes; the number of neurons in the middle dense layer is set to be 4-32, the number of neurons in the last output layer is set to be 2, and a Sigmoid function is adopted for output;
(2) selecting an Adam algorithm as an optimizer, and setting the learning rate of the optimizer to be 0.001-0.1;
(3) selecting the verification accuracy as a performance evaluation index for evaluating the quality of the network model in the training process;
(4) the number of training batches is set to be 5-15, the number of training rounds is set to be 100-200, and the number of early termination training rounds is set to be 5-10;
(5) during each round of training, the training set is divided into a plurality of parts according to the batch processing size and input into the network model for training, the model weight coefficient is updated by using the optimizer and the loss function, and the data of the test set is input into the qualitative and quantitative model during each round of training to obtain the accuracy and the loss value of the model verification precision so as to guide the model to prevent over-fitting or under-fitting.
In the above molecular screening method, preferably, in step d:
(1) selecting new molecules which do not appear in a training set, processing data according to the data processing mode of the training set, and inputting the data into a model;
(2) and (4) carrying out prediction screening on the new molecules by using the qualitative and quantitative model with the highest verification precision to obtain a screening result.
Therefore, the invention is a pretreatment for converting collected molecules into SMILES and then converting the SMILES into qualitative and quantitative descriptors; hash mapping, variance threshold and Pearson correlation coefficient characteristic engineering are used for qualitative descriptors and quantitative descriptors; setting a qualitative and quantitative model according to input data processed by characteristic engineering, wherein the first layer is an input layer, the middle layer is a dense layer, the last layer is an output layer, a Sigmoid function is used as final output, qualitative and quantitative model parameters are optimally set through a back propagation optimization algorithm for performance evaluation, and iterative training is performed to obtain a model with high verification precision; and (3) preprocessing the molecules to be predicted, performing characteristic engineering, and inputting the molecules to be predicted into a highest verification precision qualitative and quantitative model (EMIM) for prediction screening to obtain the prediction result of the molecules to be predicted. The invention overcomes the limitations of insufficient chemical information and the like when inputting a single qualitative descriptor or a single quantitative descriptor by simultaneously inputting the qualitative descriptor and the quantitative descriptor for combined learning, and realizes the high-efficiency screening of molecules.
Compared with the prior art, the invention has the advantages of
(1) The invention effectively solves the problems that the physical and chemical properties of the input of the single qualitative descriptor molecule fingerprint are lacked, or the molecule structure segment of the single quantitative descriptor molecule attribute is lacked, and the like by simultaneously inputting the qualitative descriptor and the quantitative descriptor. Meanwhile, the strong information extraction capability of the deep learning neural network framework model can comprehensively mine structural fragment information carried by molecular fingerprints and physicochemical property characteristics provided by molecular attributes, and the highest verification precision accuracy rate of 92.86% is realized. In the deployment model screening, 8 new molecules of the material can be screened correctly. Compared with the highest verification accuracy of 82.14 percent of ECFP fingerprints in a DNN model, the qualitative and quantitative model realizes more efficient and accurate screening of new molecules and solves the limitation of the traditional screening.
(2) The qualitative and quantitative model established by the invention is a model simultaneously input by a qualitative descriptor and a quantitative descriptor, and can realize high-precision molecular prediction by using a small sample volume data set.
Drawings
FIG. 1 is an algorithmic flow chart of a molecular screening method;
FIG. 2 is a qualitative descriptor molecular fingerprint index data hash map;
FIG. 3 is a thermodynamic diagram of a Pearson correlation coefficient matrix for quantitative descriptor molecular properties;
FIG. 4 is a schematic diagram of a qualitative and quantitative model;
FIG. 5 is a learning curve of accuracy and loss value for the qualitative and quantitative model with the highest verification accuracy value; a is a curve of training precision accuracy and verification precision accuracy, and b is a curve of training loss value and verification loss value;
FIG. 6 is a diagram of the prediction of new molecules by the highest verification accuracy qualitative and quantitative model;
FIG. 7 is a graph comparing the verification accuracy of the input ECFP fingerprint, 65 attributes in the qualitative and quantitative model and the input ECFP fingerprint in the DNN model;
FIG. 8 is a comparison graph of verification accuracy of an input FP2 fingerprint, 65 attributes in a qualitative and quantitative model and an input FP2 fingerprint in a DNN model;
FIG. 9 is a graph of the comparison of the verification accuracy of an input MACCS fingerprint, 65 attributes in a qualitative and quantitative model, and an input MACCS fingerprint in a DNN model;
FIG. 10 is a comparison graph of verification accuracy of the input ECFP fingerprint, 65 attributes in the qualitative and quantitative model and the input 65 attributes in the DNN model;
FIG. 11 is a graph of the comparison of the verification accuracy of an input MACCS fingerprint, 65 attributes in a qualitative-quantitative model and 65 attributes in a DNN model;
FIG. 12 is a graph of the comparison of the verification accuracy of an input FP2 fingerprint, 65 attributes in a qualitative and quantitative model and 65 attributes in a DNN model;
Detailed Description
The present invention will be described in detail with reference to specific examples.
Example 1
A molecular screening method based on a deep learning technology qualitative and quantitative model is realized by the flow shown in figure 1. The method comprises the following steps: the method comprises the steps of constructing and preprocessing a data set, performing characteristic engineering on the data set, constructing and training a qualitative and quantitative model, and deploying model prediction screening.
1. The method specifically comprises the following steps of:
(1) the molecular structure is converted to SMILES. First obtaining molecules for training the modelWith its molecular structural formula, according to DAVID WEININGER SMILES transformation rule developed in 1988, with perovskite FAPBI3The additive in the light emitting diode is taken as an example, and the structural formula of each additive molecule is converted into SMILES.
(2) The SMILES is converted into qualitative and quantitative descriptors, respectively. Calling a molecular fingerprint algorithm packaged in two chemical informatics toolkits including RDkit and Open Babel in Python according to different algorithms of different molecular fingerprints, and inputting SMILES to generate MACCS, FP2 and ECFP molecular fingerprints; then SMILES are input, 208 attributes of each molecule are generated in RDKIt according to the SMILES, and further a qualitative and quantitative descriptor data set is constructed.
2. The characteristic engineering of the data set specifically comprises the following steps:
(1) the qualitative descriptor molecule fingerprint index data is hashed. The obtained molecular fingerprint index data is a series of digital codes, each digital code is mapped to the position of the molecular fingerprint where the code is located by utilizing a Hash algorithm, and the mapping process is shown in FIG. 2;
(2) a variance threshold and pearson correlation coefficient matrix algorithm is used for the quantitative descriptor molecular attributes. For 208 attributes of the quantitative descriptors, if the attributes have lower variance values and contain less chemical information, selecting variance threshold attributes of the 208 attributes, deleting 107 attributes and reserving 101 attributes; if the attributes have strong correlation, deleting the redundant attributes can effectively reduce information redundancy, deleting 36 attributes again by using a pearson correlation coefficient matrix algorithm, and finally leaving 65 attributes for training the model, wherein the pearson correlation coefficient matrix is shown in fig. 3;
(3) the qualitative and quantitative descriptor data set is divided by a training set to test set ratio of 9:1, namely when the model is trained, 90% of data is used for training, and 10% of data is used for testing. The number of training samples in each data set was 121 and the number of test samples was 14.
3. The method comprises the following steps of constructing and training a qualitative and quantitative model:
(1) when a qualitative and quantitative model is constructed, the qualitative descriptor molecular fingerprint and the quantitative descriptor molecular attribute can be selected to be input in a combined mode. In the model, a five-layer network structure is set as shown in fig. 3, the number of neurons set in the input layer is L +65(L is the length of the molecular fingerprint bits, and 65 is the number of input attributes), the number of neurons set in the second layer and the third layer are set in the dense layer, the number of the neurons set in the input layer is 12 and 4, the activation function is set as a Rectified Linear Unit (RELU), and the dense layer is used for mapping the features extracted from the molecular fingerprint and the molecular attributes to the output space through nonlinear change. And in the fourth layer, combining the third layer by using a Concatenate function, and outputting the final layer by using a Sigmoid function. The mean square error function is selected as a loss function, the Adam algorithm is selected as an optimizer, the learning rate of the optimizer is set to be 0.001, and the training batch number and the training round number are respectively set to be 8 and 200. In order to obtain a model with high verification precision and best generalization capability, an early termination training function is arranged in the qualitative and quantitative model to monitor the verification loss, and the number of early termination training rounds is set to be 10.
(2) When a qualitative and quantitative model is trained, a training set is divided into 16 parts according to the batch processing size and input to the qualitative and quantitative model for training, an optimizer and a loss function are utilized to update the weight coefficient of the model for performance evaluation, and test set data is input to the qualitative and quantitative model during each training to obtain the loss value and accuracy of the model so as to guide model training to prevent over-fitting or under-fitting of the model. The qualitative and quantitative model learning curve of the highest verification accuracy value is shown in fig. 5, and the verification accuracy reaches 92.86% at this time. The right side of FIG. 5 is the relationship between the number of Training rounds and the loss value, in which Training loss is the Training loss and Validation loss is the Validation loss. Training was stopped when the model had been trained for a total of 41 rounds, as a result of setting the early termination training function to monitor for validation loss.
4. The deployment model prediction screening specifically comprises the following steps:
(1) and (4) based on the construction and training of the qualitative and quantitative model in the step (3), the qualitative and quantitative model with the highest verification precision value is reserved through repeated iterative training. Then randomly selecting 8 new molecules which do not appear in the training set, converting the new molecules into SMILES, and converting the SMILES into qualitative and quantitative descriptors, wherein the preprocessing and characteristic engineering steps are completely the same as those of the training model.
(2) The qualitative and quantitative model with the highest verification precision ECFP fingerprint and the molecular attribute combined input is used for predicting 8 new molecules, and all prediction results are correct. The prediction results are shown in fig. 6.
Example 2
When a qualitative and quantitative model is constructed, the FP2 fingerprint and 65 attributes are selected to be input jointly, five layers of networks are arranged in the model as shown in FIG. 4, and the specific construction and training of the qualitative and quantitative model are described with reference to example 1.
At the moment, the accuracy of the verification precision of the qualitative and quantitative model reaches 85.71%, and compared with the verification precision that the fingerprint input by FP2 in the DNN model can only reach 75.00%, the precision of the qualitative and quantitative model is improved by 10.71%. The comparison accuracy is listed in fig. 8.
Example 3
When a qualitative and quantitative model is constructed, the MACCS fingerprint and 65 attributes are selected to be input jointly, five layers of networks are set in the model as shown in fig. 4, and the specific construction and training of the qualitative and quantitative model are described with reference to example 1.
At the moment, the verification precision accuracy of the qualitative and quantitative model reaches 85.71%, and compared with the verification precision that MACCS fingerprints input in the DNN model can only reach 71.43%, the precision of the qualitative and quantitative model is improved by 14.28%. The comparison accuracy is shown in fig. 9.
Example 4
When a qualitative and quantitative model is constructed, the ECFP fingerprints and 65 attributes are selected to be input in a combined manner, five layers of networks are arranged in the model as shown in FIG. 4, and the specific construction and training of the qualitative and quantitative model are described with reference to example 1.
At the moment, the accuracy of the verification precision of the qualitative and quantitative model reaches 92.86%, and compared with the verification precision that 65 attributes input in the DNN model can only reach 78.57%, the precision of the qualitative and quantitative model is improved by 14.29%. The comparison accuracy is listed in fig. 10.
Example 5
When a qualitative and quantitative model is constructed, an FP2 fingerprint and 65 attributes are selected to be input in a combined manner, five layers of networks are arranged in the model as shown in FIG. 5, and the specific construction and training of the qualitative and quantitative model are described with reference to example 1.
At the moment, the accuracy of the verification precision of the qualitative and quantitative model reaches 85.71%, and compared with the verification precision that 65 attributes input in the DNN model can only reach 78.57%, the precision of the qualitative and quantitative model is improved by 7.14%. The comparison accuracy is listed in fig. 11.
Example 6
When a qualitative and quantitative model is constructed, the MACCS fingerprint and 65 attributes are selected to be input jointly, five layers of networks are set in the model as shown in fig. 4, and the specific construction and training of the qualitative and quantitative model are described with reference to embodiment 1.
At the moment, the accuracy of the verification precision of the qualitative and quantitative model reaches 85.71%, and compared with the verification precision that 65 attributes input in the DNN model can only reach 78.57%, the precision of the qualitative and quantitative model is improved by 7.14%. The comparison accuracy is listed in fig. 12.
It will be understood that modifications and variations can be made by persons skilled in the art in light of the above teachings and all such modifications and variations are intended to be included within the scope of the invention as defined in the appended claims.

Claims (9)

1. A molecular screening method based on deep learning qualitative and quantitative model is characterized by comprising the following steps: and (3) inputting the qualitative and quantitative model to perform predictive screening after the sub-data set is constructed, preprocessed and subjected to characteristic engineering in sequence, and obtaining a screening result of the new molecules.
2. The method of claim 1, wherein the step of constructing and preprocessing the data set comprises converting the collected molecular structure into SMILES, and converting the SMILES into a qualitative descriptor and a quantitative descriptor, respectively.
3. The molecular screening method based on deep learning qualitative and quantitative model as claimed in claim 2, wherein the qualitative descriptor is molecular fingerprint and the quantitative descriptor is molecular attribute.
4. The molecular screening method based on deep learning qualitative and quantitative model according to claim 3, wherein the molecular fingerprint is MACCS fingerprint, FP2 fingerprint, ECFP fingerprint, etc.
5. The molecular screening method based on deep learning qualitative and quantitative model as claimed in claim 1, wherein the feature engineering is to use HashMap for qualitative descriptors, use variance threshold and Pearson correlation coefficient matrix algorithm for quantitative descriptors, and divide the qualitative and quantitative descriptor data sets in the ratio of training set to test set 9: 1.
6. The molecular screening method based on the deep learning qualitative and quantitative model according to claim 1, wherein the qualitative and quantitative model is constructed and trained by the following specific steps: according to input data processed by characteristic engineering, a qualitative and quantitative model is set, the first layer is an input layer, the input data is qualitative and quantitative descriptors, the middle layer is a dense layer, the last layer is an output layer, a Sigmoid function is used as final output, qualitative and quantitative model parameters are optimized and set through a back propagation optimization algorithm for performance evaluation, and iterative training is performed to obtain a model with high verification precision.
7. The molecular screening method based on deep learning qualitative and quantitative model according to claim 1, characterized in that the predictive screening specifically comprises:
(1) based on a qualitative and quantitative model constructed and trained, the qualitative and quantitative model with the highest verification precision value is reserved through multiple iterative training, then N (N is usually 1-20) new molecules which do not appear in a training set are randomly selected, the new molecules are converted into SMILES, and the pretreatment and characteristic engineering steps of the SMILES conversion into qualitative and quantitative descriptors are completely the same as those of the training model;
(2) and predicting the N new molecules by using the qualitative and quantitative model with the highest verification precision ECFP fingerprint and the molecule attribute combined input to obtain the prediction result of the molecules to be predicted.
8. The method of claim 1, wherein the data set comprises molecules selected from the group consisting of perovskite luminescent materials, additive molecules, and combinations thereof.
9. The molecular screening method based on deep learning qualitative and quantitative model according to claim 1, characterized in that the method specifically comprises the following steps:
step a: firstly, converting a collected molecular structural formula into SMILES, and then respectively converting the SMILES into qualitative descriptors and quantitative descriptors;
step b: mapping each digital code to the position of the fingerprint by the converted qualitative descriptor through a Hash algorithm; using a variance threshold and a Pearson correlation coefficient matrix algorithm to the 208 converted attribute features, deleting 143 redundant attribute information, and reserving 65 attributes for training a model;
step c: setting a qualitative and quantitative model based on a neural network according to input data of qualitative and quantitative descriptors, wherein the qualitative and quantitative model has a five-layer network architecture, the first layer is an input layer, the input data is molecular fingerprints and molecular attributes, the middle layer is a dense layer, the last layer is an output layer, and a Sigmoid function is used as final output; inputting the molecular fingerprints and the molecular attribute data into a qualitative and quantitative model for iterative training, setting the model learning rate through back propagation optimization of an Adam algorithm optimizer, performing performance evaluation on parameters such as the number of neurons in each layer, and the like, and finally obtaining the qualitative and quantitative model with the highest verification precision;
step d: and c, selecting new molecules of the material, inputting the new molecules into the neural network model obtained in the step c for prediction screening after the new molecules are subjected to the characteristic engineering treatment in the step b, and obtaining the screening result of the new molecules.
CN202110914312.3A 2021-08-10 2021-08-10 Molecular screening method based on deep learning technology qualitative and quantitative model Pending CN113674807A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110914312.3A CN113674807A (en) 2021-08-10 2021-08-10 Molecular screening method based on deep learning technology qualitative and quantitative model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110914312.3A CN113674807A (en) 2021-08-10 2021-08-10 Molecular screening method based on deep learning technology qualitative and quantitative model

Publications (1)

Publication Number Publication Date
CN113674807A true CN113674807A (en) 2021-11-19

Family

ID=78542141

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110914312.3A Pending CN113674807A (en) 2021-08-10 2021-08-10 Molecular screening method based on deep learning technology qualitative and quantitative model

Country Status (1)

Country Link
CN (1) CN113674807A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114818948A (en) * 2022-05-05 2022-07-29 北京科技大学 Data-mechanism driven material attribute prediction method of graph neural network
CN116646024A (en) * 2023-07-26 2023-08-25 苏州创腾软件有限公司 Open loop polymerization enthalpy prediction method and device based on machine learning

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106777986A (en) * 2016-12-19 2017-05-31 南京邮电大学 Ligand molecular fingerprint generation method based on depth Hash in drug screening
CN110444250A (en) * 2019-03-26 2019-11-12 广东省微生物研究所(广东省微生物分析检测中心) High-throughput drug virtual screening system based on molecular fingerprint and deep learning
US20200176087A1 (en) * 2018-12-03 2020-06-04 Battelle Memorial Institute Method for simultaneous characterization and expansion of reference libraries for small molecule identification
CN111798935A (en) * 2019-04-09 2020-10-20 南京药石科技股份有限公司 Universal compound structure-property correlation prediction method based on neural network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106777986A (en) * 2016-12-19 2017-05-31 南京邮电大学 Ligand molecular fingerprint generation method based on depth Hash in drug screening
US20200176087A1 (en) * 2018-12-03 2020-06-04 Battelle Memorial Institute Method for simultaneous characterization and expansion of reference libraries for small molecule identification
CN110444250A (en) * 2019-03-26 2019-11-12 广东省微生物研究所(广东省微生物分析检测中心) High-throughput drug virtual screening system based on molecular fingerprint and deep learning
CN111798935A (en) * 2019-04-09 2020-10-20 南京药石科技股份有限公司 Universal compound structure-property correlation prediction method based on neural network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
周世英 等: "基于深度学习的药物分子分类与虚拟筛选研究", 《中国优秀硕士学位论文全文数据库》, pages 11 - 51 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114818948A (en) * 2022-05-05 2022-07-29 北京科技大学 Data-mechanism driven material attribute prediction method of graph neural network
CN114818948B (en) * 2022-05-05 2023-02-03 北京科技大学 Data-mechanism driven material attribute prediction method of graph neural network
CN116646024A (en) * 2023-07-26 2023-08-25 苏州创腾软件有限公司 Open loop polymerization enthalpy prediction method and device based on machine learning

Similar Documents

Publication Publication Date Title
CN109754113B (en) Load prediction method based on dynamic time warping and long-and-short time memory
CN113674807A (en) Molecular screening method based on deep learning technology qualitative and quantitative model
CN111783442A (en) Intrusion detection method, device, server and storage medium
CN109598334B (en) Sample generation method and device
CN112761628B (en) Shale gas yield determination method and device based on long-term and short-term memory neural network
CN108153982B (en) Aero-engine after-repair performance prediction method based on stacked self-coding deep learning network
CN111078847A (en) Power consumer intention identification method and device, computer equipment and storage medium
CN110600085B (en) Tree-LSTM-based organic matter physicochemical property prediction method
WO2019151503A1 (en) Determination device, determination method, and determination program
CN115547424A (en) Molecular screening method based on deep learning technology multi-molecular fingerprint model
CN115409292A (en) Short-term load prediction method for power system and related device
CN115394383A (en) Method and system for predicting luminescence wavelength of phosphorescent material
CN115222065A (en) Wellhead pressure online multi-step prediction method based on Stacking ensemble learning
CN110019796A (en) A kind of user version information analysis method and device
CN113705661A (en) Industrial process performance evaluation method of hybrid depth residual shrinkage network and XGboost algorithm
CN106782577A (en) A kind of voice signal coding and decoding methods based on Chaotic time series forecasting model
CN112817954A (en) Missing value interpolation method based on multi-method ensemble learning
Absardi et al. A fast reference-free genome compression using deep neural networks
CN116756508A (en) Fault diagnosis method and device for transformer, computer equipment and storage medium
CN116291336A (en) Automatic segmentation clustering system based on deep self-attention neural network
CN116595363A (en) Prediction method, apparatus, device, storage medium, and computer program product
Mete et al. Predicting semantic building information (BIM) with Recurrent Neural Networks
CN111061626B (en) Test case priority ordering method based on neuron activation frequency analysis
CN113657441A (en) Classification algorithm based on weighted Pearson correlation coefficient and combined with feature screening
CN113658109A (en) Glass defect detection method based on field loss prediction active learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination