CN115050428A - Drug property prediction method and system based on deep learning fusion molecular graph and fingerprint - Google Patents

Drug property prediction method and system based on deep learning fusion molecular graph and fingerprint Download PDF

Info

Publication number
CN115050428A
CN115050428A CN202210654644.7A CN202210654644A CN115050428A CN 115050428 A CN115050428 A CN 115050428A CN 202210654644 A CN202210654644 A CN 202210654644A CN 115050428 A CN115050428 A CN 115050428A
Authority
CN
China
Prior art keywords
molecular
model
drug
module
prediction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210654644.7A
Other languages
Chinese (zh)
Inventor
蔡涵萱
王领
巫景行
李奕锐
罗海林
刘政豪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202210654644.7A priority Critical patent/CN115050428A/en
Publication of CN115050428A publication Critical patent/CN115050428A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/30Prediction of properties of chemical compounds, compositions or mixtures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/50Molecular design, e.g. of drugs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Medicinal Chemistry (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a method and a system for predicting drug properties based on a deep learning fusion molecular graph and fingerprints. The prediction method comprises the following steps: prediction of different drug properties; constructing a deep learning model suitable for medicine property prediction; selecting a specific mode according to the requirements of model construction, and splitting a data set into a training set, a test set and a verification set; inputting a data set into a network model, training and updating parameters in the network according to the difference between the prediction result of the training set and the true value of the training set, determining the optimal network parameters according to the optimal result on the verification set, and detecting the data in the test set; determining the optimal hyper-parameter combination of the model according to the hyper-parameter optimization strategy; and generating a targeted optimal model for predicting the properties of different medicines, so as to be applied to predicting the properties of new small-molecule medicines later. The method disclosed by the invention integrates the classical molecular fingerprint characteristics, and overcomes the defect that important characteristics cannot be effectively extracted on a small-scale data set by deep learning.

Description

Drug property prediction method and system based on deep learning fusion molecular graph and fingerprint
Technical Field
The invention relates to the technical field of deep learning prediction of drug properties, in particular to a drug property prediction method and system based on a deep learning fusion molecular graph and fingerprints.
Background
Cancer is one of the major diseases that currently endanger human health and life. According to the 2020 world cancer report issued by the international center for cancer (IARC) under the world health organization, there will be 1810 ten thousand new cases of cancer and 960 ten thousand cases of cancer death worldwide in 2018. The cancer statistics data in 2018 shows that the incidence and mortality of cancer in China are the first in the world. In 1810 ten thousand new cancer cases, 380.4 ten thousand cases are accounted in China; of the 960 ten thousand cancer deaths, 229.6 million cases are accounted for in our country. Cancer prevention and treatment have become an important public health problem in China. Therefore, the urgent need of cancer drug development is emphatically mentioned in the national major new drug creation implementation plans throughout the past.
From the perspective of traditional drug molecule design, accurate prediction of molecular properties, including physicochemical and biological activity properties, as well as ADME/T (absorption, distribution, metabolism, excretion and toxicity) properties, is a fundamental challenge in molecular design. Since the concept of computer-aided drug design has been proposed and developed and applied for some time, as one of the most widely and well-established computational methods in molecular property prediction, quantitative structure-effect (property) relationship (QSAR/QSPR) modeling estimates the activity/properties of strange chemical structures by fitting and learning known data relationships using empirical, linear or non-linear functions, and then applies these models to predict and design new molecules with desired functional properties. The QSAR/QSPR model, as a predecessor of artificial intelligence in the current drug development field, is limited by the lack of computing hardware and experimental data and cannot be popularized and applied thirty years ago, but with the continuous accumulation of experimental data (such as chemical, biological and pharmacological related data) and the upgrade of hardware conditions, Artificial Intelligence (AI) and Machine Learning (ML) algorithms create many successful cases in the drug development field, are considered as indispensable tools for establishing the QSAR/QSPR model, and are helpful for quickly and reliably predicting and evaluating the physicochemical, biological and ADME/T characteristics of small molecules in the drug development practice.
Generally, the ML-based QSAR/QSPR modeling prediction method depends heavily on a proper molecular characterization mode, and the currently common molecular representation methods can be divided into three major categories, namely a molecular descriptor, a molecular fingerprint and a molecular graph. Molecular descriptors and fingerprints are derived from the knowledge in the human expert field and are used to fully describe the structural, physicochemical, topological and structural features of molecules. The molecular graph type representation method generally occurs in a Deep Learning (DL) based method, which is based on the principle that atoms and bonds of molecules are regarded as nodes and edges, and integrated point-line information is input into the structure of a Deep neural network as information material providing machine Learning. Both the traditional ML-based approach and the DL-based approach proposed in recent years have created many successful cases in the field of drug development, but whether graph-based DL models outperform traditional descriptor-based ML models remains controversial. Studies have reported that graph-based DL models still have potential limitations in the case of data set deficiencies. The present invention speculates and verifies during development that the information captured based on the molecular representation of the graph or fingerprint is different and complementary.
The development of deep learning in the field of pharmaceutical research is supported by data as advantages and is also limited by the data. The data of drug development has great resistance on data accumulation due to the characteristics that environmental standards of all sources are difficult to unify and the noise is large. Since decades of traditional drug development processes, high-throughput screening and combinatorial chemistry technologies are applied, and data in the field of drug development initially touches the threshold of 'big data', but due to the characteristics of data types, artificial intelligence faces more small data problems in the biomedical industry after standardization processing. Therefore, in the 'data transition period' in the field of biomedicine, the advantages of traditional machine learning and deep learning and the complementary information captured respectively are combined, and research and verification prove that the prediction accuracy higher than that of the existing algorithm is shown as a first innovative method strategy.
Disclosure of Invention
The present invention aims to overcome the above-mentioned drawbacks of the prior art and provide a method and a system for predicting drug properties based on a deep learning fusion molecular graph and a fingerprint.
The purpose of the invention can be realized by the following technical scheme.
The method for predicting the drug property based on the deep learning fusion molecular graph and the fingerprint is used for realizing the rapid property prediction of the small molecular drug and comprises the following steps:
1) for the prediction of different drug properties, acquiring a targeted and specific data set containing a large amount of drug small molecule data;
2) constructing a deep learning model suitable for drug property prediction, wherein the model fuses two characteristics of a molecular graph and a molecular fingerprint into different modules, and finally, a fusion type neural network framework is formed by using a full connecting layer;
3) selecting a specific mode according to the requirements of model construction, and splitting a data set into a training set, a test set and a verification set;
4) inputting a data set into a network model, training and updating parameters in the network according to the difference between the prediction result and the truth value of a training set, determining the optimal network parameters according to the optimal result on a verification set, and detecting data in a test set;
5) determining the optimal hyper-parameter combination of the model according to the hyper-parameter optimization strategy;
6) for the prediction of different drug properties, generating a targeted optimal model for subsequent application in the prediction of new small molecule drug properties;
7) for the generated optimal model, explanatory analysis is provided for reference in subsequent drug design.
In the step 1), in order to obtain a data set for training, the method specifically comprises the following steps:
1-1) adopting a classical data set to construct a model in the field of drug properties of the known classical data set in the industry;
1-2) there is no art for the pharmaceutical properties of accepted classic datasets that exist in the industry, where targeted pharmaceutical activity data is collected from laboratory records maintained in pharmaceutical chemistry or biological laboratories, or from compound activity data provided by a network-based published pharmaceutical chemistry database, or from databases derived from other approaches, and is pre-processed for model construction.
In the step 1-2), in order to preprocess the acquired original pharmaceutical activity data set, the method specifically comprises the following steps:
1-2-1) obtaining targeted raw drug activity data from various sources;
1-2-2) checking the weight according to the small drug molecules, and averaging the activity data of the repeated molecules;
1-2-3) carrying out dehydroionization, desalting ion, structural force field optimization and the like on the small drug molecules;
1-2-4) for regression tasks, specific activity values are retained; for the classification task, carrying out negative and positive labeling on the small drug molecules according to a specified threshold value;
1-2-5) data set is presented as a simplified molecular linear input canonical format (SMILES) for drug small molecules and corresponding target values.
In the step 2), the constructed model specifically comprises the following key points:
2-1) the feature extraction part of the model fuses two modules for extracting molecular diagram features and molecular fingerprint features, and respectively extracts the features of the drug micromolecules to generate corresponding feature vectors;
2-2) a module for extracting the molecular diagram characteristics from the model, wherein a network structure of a diagram attention machine mechanism is adopted; generating a component graph according to an input SMILES format: mapping the atoms of the molecules to nodes in the component subgraph and mapping the chemical bonds to edges in the component subgraph, and calculating the physicochemical properties of the atoms and the chemical bonds as initial characteristic vectors of point edges; an attention mechanism in a network structure, which focuses on the influence between adjacent atoms, namely the attention between adjacent atoms, and is used for iteratively updating the feature vectors of atoms in molecules; after the iteration updating is finished, integrating the feature vectors of all atoms, and outputting the feature vectors as feature vectors of the molecular graph;
2-3) extracting molecular fingerprint characteristics from the model by adopting a plurality of fully-connected layers; three different types of molecular fingerprints are generated according to the input SMILES format: molecular fingerprint MACCS FP based on substructure, molecular fingerprint PubChem FP based on substructure, molecular fingerprint Pharmacophore ErGFP based on Pharmacophore; inputting the three fingerprints in series into a full-connection layer network of the module to obtain a characteristic vector of the molecular fingerprint;
2-4) splicing the characteristic vectors generated by the two modules and inputting the spliced characteristic vectors into a multi-layer full-connection layer to predict the properties of the drug micromolecules and generate a final prediction result.
In the step 2-2), the molecular diagram feature extraction module of the model specifically comprises the following steps when extracting the molecular diagram features:
2-2-1) calculating the physicochemical property of each atom as an initial characteristic vector of a point in a molecular diagram; the physical and chemical properties specifically include: atomic types (carbon, nitrogen, oxygen, or other types), number of chemical bonds attached, number of charges, chiral carbon case, number of hydrogen atoms attached, hybrid orbital case, atomic mass, whether aromatic or not, etc., including atoms having atomic numbers within one hundred, such as carbon, nitrogen, oxygen, fluorine, etc.
2-2-2) calculating the attention degree between adjacent atoms, and iteratively updating the expression of the atoms according to the attention as follows:
e ij =LeakyRelu(a·[W 1 h i ||W 1 h j ])
Figure BDA0003688848420000031
Figure BDA0003688848420000032
wherein h is i And h j Is the stack of adjacent atoms i and jPre-generation eigenvectors, W 1 Is a weight matrix, alpha ij Is a weight; the attention value calculated between adjacent atoms i and j is e ij Calculating an attention value e of each of K adjacent atoms K and i ik Summing up, thereby calculating the attention impact of the j atom on the i atom; before updating the atom i, normalizing the attention values corresponding to all neighbors of i to obtain alpha ij (ii) a The multiple attention is K, the multi-head attention mechanism repeatedly calculates the attention for multiple times, the average value of the attention for multiple times is taken to update the atom i, and the iterative characteristic vector h is obtained i ′;
2-2-3) calculating a feature vector of the molecular graph, wherein an expression is as follows:
Figure BDA0003688848420000041
wherein N is the total atomic number of the molecule, h i ' the characteristic vectors of atoms after the iterative update is finished are averaged to be used as the characteristic vectors of molecules.
In the step 3), when splitting the data set required by the model construction, the method specifically comprises the following steps:
3-1) the model can define the splitting mode and the splitting ratio by user;
3-2) the built-in splitting mode of the model comprises the following steps: random splitting and skeleton splitting; randomly splitting, namely randomly splitting the data set out of order; the skeleton is split, the skeleton number and the corresponding molecule number of the small drug molecules in the data set are calculated firstly, the skeleton and the molecules with the small corresponding molecule number are sequentially classified into a verification set and a test set until the molecular number of the verification set and the test set is enough, and the rest molecules are uniformly classified into a training set; the split of the skeleton can realize that the molecular skeletons in the training set, the verification set and the test set are not overlapped, thereby providing higher requirements for the prediction capability of the model and being beneficial to the model to discover the drug molecules with novel skeletons.
In the step 5), the hyper-parameter optimization of the model specifically includes the following steps:
5-1) six hyper-parameters are built in the model: extracting the loss rate of the molecular graph module, the attention number of the molecular graph module, the attention iteration times of the molecular graph module, the loss rate of the molecular fingerprint module, the feature vector dimension of the molecular fingerprint module and the proportion of the molecular graph and the molecular fingerprint vector when the fusion module is in full-connection layer input;
and 5-2) carrying out hyper-parameter optimization on the model according to a Bayesian optimization mode, optimizing for 20 rounds, and selecting a group of hyper-parameters with optimal evaluation scores of the test set.
In the step 6), the prediction application of the model specifically includes the following steps:
6-1) generating an optimal prediction model aiming at the specific drug property according to the optimal hyper-parameter combination screened in the step 5);
6-2) when predicting the drug molecules with unknown properties, loading the corresponding optimal model, and inputting the SMILES format of the molecules into the model to obtain the prediction result of the drug molecules;
6-3) the model supports mass prediction of drug molecules with unknown properties, and rapid and efficient molecular property judgment is realized.
The step 7), when the model is subjected to explanatory analysis, specifically comprises the following steps:
7-1) providing two model interpretation functions of fingerprint interpretation, molecular graph interpretation and the like according to the optimal prediction model aiming at the specific medicine property generated in the step 6) and the input requirement of a user;
7-2) when a user requires fingerprint interpretation, calculating importance indexes of different fingerprint sites in the model, wherein the higher the index is, the larger the effect of the site is in the generation process of the model, and the intramolecular information represented by the site plays an important role in designing drug molecules aiming at specific drug properties;
7-3) when a user requires the explanation of the molecular graph, calculating the attention value in the molecular graph in the model, mapping the attention value on the molecular graph, wherein the higher the attention value of a certain part of atoms is, the greater the effect of the structure in the generation process of the model is, and the important effect is played for designing the drug molecules aiming at specific drug properties.
A system of a drug property prediction method based on a deep learning fusion molecular graph and fingerprints comprises the following steps: the data preprocessing module is used for preprocessing the collected chemical molecule activity original data set so that the model can be applied to the construction of a new drug molecule property data set; a model construction module for modeling the processed sample through a deep learning model based on molecular graphs and molecular fingerprints; the deep learning model based on the molecular graph and the molecular fingerprint comprises a characteristic extraction module based on the molecular graph, a characteristic extraction module based on the molecular fingerprint and a fusion module; the feature extraction module based on the molecular diagram adopts a graph attention machine mechanism network, and focuses on judging the influence of the relationship between adjacent atoms on the molecular property; the characteristic extraction module based on the molecular fingerprints extracts the influence of molecular substructures and pharmacophores on molecular properties from three different types of molecular fingerprints; the fusion module combines the feature vectors obtained by the two feature extraction modules and inputs the combined feature vectors into a multilayer full-connection layer network; a prediction module: the prediction module is used for predicting the new drug micromolecules according to the optimal model generated by the model construction module, so that the model is applied to the prediction of the new drug molecules; an explanatory module: the explanatory module is used for performing explanatory analysis on the drug micromolecules according to the optimal model generated by the model construction module, so that the model can provide a drug design suggestion aiming at specific drug properties for a user;
a non-transitory computer readable storage medium having stored thereon a computer program, wherein the program when executed by a processor implements the method and application for predicting drug properties based on a deep learning fusion molecule map and fingerprints.
A computer device comprising a storage medium, a processor and a computer program stored on the storage medium and executable on the processor, wherein the processor implements the method and application for predicting a drug property based on a deep learning fusion molecule map and a fingerprint when executing the program.
Compared with the prior art, the invention has the beneficial effects that:
compared with the existing medicine micromolecule property prediction method based on deep learning, the medicine property prediction method based on deep learning and fusing the molecular graph and the fingerprint integrates the classic molecular fingerprint characteristics, and overcomes the defect that the deep learning cannot effectively extract important characteristics on a small-scale data set; compared with the traditional method based on machine learning and manual extraction of molecular features, the method disclosed by the invention integrates the advantages of deep learning, the computer gives attention to the relationship among atoms in the molecular graph, the structural features in the molecular graph are extracted by autonomous calculation, and the defect of manual feature extraction is overcome. The invention can be used in the property field where no known classical data set exists, provides a data preprocessing function, collects and processes the original activity data of molecules, and constructs a sample set for modeling. After the targeted optimal model is established, the method can further accurately and efficiently predict the designated properties of the molecules, thereby effectively improving the efficiency of drug research and development and accelerating the speed of virtual screening of small molecule drugs.
Drawings
FIG. 1 is a schematic flow diagram of a method for predicting drug properties based on a deep learning fusion molecular map and fingerprints;
FIG. 2 is a schematic diagram of a default network structure of a drug property prediction deep learning model fused with a molecular graph and a fingerprint;
FIG. 3 is a process diagram of a drug property prediction deep learning model training method fusing a molecular graph and a fingerprint;
FIG. 4 is a graph of the comparison of the performance accuracy of a drug property prediction method based on a deep learning fusion molecular graph and fingerprints and other drug property prediction methods based on artificial intelligence;
FIG. 5 is a graph of the results of ablative experiments of a deep learning fusion molecular graph and fingerprint-based drug property prediction method on neural networks using a molecular graph portion and a molecular fingerprint portion alone;
FIG. 6 is a graph of the comparison of the accuracy of a drug property prediction method based on a deep learning fusion molecular graph and fingerprints between the classical unbiased dataset LIT-PCBA and the currently leading multiple drug property prediction methods;
FIG. 7 is a graph of model accuracy versus results using different molecular fingerprints for a drug property prediction method based on a deep learning fusion molecular graph and fingerprints;
fig. 8a is an effect on CDK9 pathway downstream protein expression intensity in the results of protein experiments on drug molecules whose CDK9 inhibitory activity is predicted to be positive in the application of the deep learning fusion molecule graph and fingerprint-based drug property prediction method;
FIG. 8b is a grey scale statistical analysis of downstream protein p-RNAP II CTD (Ser2) in protein experiments with drug molecules predicted to be positive for CDK9 inhibitory activity in the application of the drug property prediction method based on deep learning fusion molecule maps and fingerprints;
fig. 8c is a grey-scale statistical analysis graph of downstream protein Mcl-1 from protein experiments on drug molecules with positive prediction of CDK9 inhibitory activity in the application of the drug property prediction method based on deep learning fusion molecule graph and fingerprint;
FIG. 8d is a graph of a grey scale statistical analysis of downstream protein cleaned PARP from protein experiments with drug molecules positive for prediction of CDK9 inhibitory activity in the application of the drug property prediction method based on deep learning fusion molecule graphs and fingerprints;
fig. 8e is an experimental graph of apoptosis of drug molecules positive for CDK9 inhibitory activity prediction on MOLM-13 cells containing CDK9 targets in an application of the drug property prediction method based on deep learning fusion molecule graphs and fingerprints;
FIG. 8f is a graph showing the experimental results of apoptosis of MOLM-13 cells containing CDK9 target by pyritinol as a control group in the application of the drug property prediction method based on deep learning fusion molecular graph and fingerprint and the quantitative analysis thereof;
FIG. 9a is an interpretability analysis verification diagram of a small molecule with negative prediction on the permeability activity of the blood brain barrier by a drug property prediction method based on a deep learning fusion molecular diagram and a fingerprint;
FIG. 9b is an interpretable analysis and verification diagram of a small molecule with positive prediction on the permeability activity of the blood brain barrier by a drug property prediction method based on a deep learning fusion molecular diagram and a fingerprint;
FIG. 10a is an interpretable analytical validation of a drug property prediction method based on deep learning fusion molecular graphs and fingerprints to predict small molecule 1 with inhibitory activity against target of rapamycin (mTOR), containing morpholine rings and urea pharmacophores known to play a key role and being given higher attention;
figure 10b is an interpretable analysis validation of a drug property prediction method based on deep learning fusion molecular diagrams and fingerprints to predict small molecule 2 with inhibitory activity against target of rapamycin (mTOR), containing a bridged morpholine ring pharmacophore with known key play and being given higher attention;
fig. 10c is an interpretable analysis validation of small molecule 3 predicted to have inhibitory activity against target of rapamycin (mTOR) based on a deep learning fusion molecular map and fingerprint drug property prediction method, containing a morpholine ring pharmacophore on pyrazolopyrimidine backbone with known key role and being given higher attention.
Detailed Description
The following is a specific example of the present invention in a development team laboratory, to illustrate the manner and logic of use of the invention, but not to limit the scope of the invention, and equivalent modifications made by those skilled in the art to various selected materials of the invention are within the scope of the invention as set forth in the claims appended hereto.
Example 1
This example provides a method for predicting drug properties based on a deep learning fusion molecular graph and fingerprints, which takes the strength of the inhibitory activity of the small molecule to be predicted in this example on cyclin-dependent kinase family members (CDK1-9, 14, 19) as an example, and comprises the following steps:
1) obtaining data comprising inhibitory activity of a plurality of small drug molecules against a cyclin dependent kinase family member (CDK1-9, 14, 19) for constructing a data set, the data set comprising the steps of:
1-2) because there is no accepted, well-processed, classical data set available in the industry, it is desirable to collect data from laboratory records maintained in a pharmaceutical chemistry or biological laboratory, or from compound activity data provided by a network published pharmaceutical chemistry database, followed by data pre-processing.
1-2-1) examples of the invention select all activity data records from the pharmacochemistry database ChEMBL that are included for the target according to the target number of CDK1-9, 14, 19.
1-2-2) data screening. Only the record of the biological activity with the test type B and the report activity type IC50, EC50, Ki and Kd is kept, the duplication is carried out on the drug small molecules with a plurality of activity records, and the activity data of the repeated molecules are averaged.
1-2-3) data standardization treatment. And (3) carrying out dehydroion, desalting ion cleaning and force field optimization on the small molecular structure.
1-2-4) data annotation. In this embodiment, the data type belongs to a classification task, and the molecule needs to be labeled systematically according to a specified threshold. For this example, the threshold was 10. mu.M, small molecules with test activity ≦ 10. mu.M were labeled as inhibitors, and small molecules > 10. mu.M were labeled as non-inhibitors.
1-2-5) obtaining the normalized data. After normalization a total of 12532 compounds and their enzyme inhibitory activity data were obtained for 11 CDK isoform proteins, for example 1871 compounds tested for recording the CDK1 target, of which 883 compounds were labelled as non-inhibitors, 988 compounds as inhibitors, available data tested for the CDK2 target comprised 4305 compounds, of which 1598 compounds were labelled as non-inhibitors, 2707 compounds as inhibitors, and the compound activity test for the CDK9 target comprised 1330 compounds, of which 243 compounds were labelled as non-inhibitors and 1087 compounds as active substances. The resulting standardized data set consisted of 12532 SMILES formats for small drug molecules and their corresponding CDK isoform protein inhibitory activity target values.
2) And (3) constructing a deep learning model suitable for drug property prediction, wherein the model fuses two characteristics of a molecular graph and a molecular fingerprint into different modules, and finally, a full connection layer is used for combining into a fused neural network framework. Specifically, in this embodiment, in the feature extraction module based on the molecular graph, a deep learning network model based on a graph attention mechanism is adopted; in the characteristic extraction module based on the molecular fingerprint, serial bit strings of three fingerprints including MACCS FP, PubChemFP and Pharmacophore ErG FP are selected as a molecular fingerprint representation method to be input into the module. The output vectors of the two characteristic extraction modules are connected in series and input into the fusion module of the multilayer full-connection layer, and the CDK subtype protein inhibition activity corresponding to the small molecular drug is predicted.
3) In the aspect of model data set construction, a random splitting mode is selected, and three sets are randomly divided according to the proportion of 8, 1 and 1 of a training set, a test set and a verification set.
4) Inputting the divided data sets into a network model, training and updating parameters in the network according to the difference between the prediction result of the training set and the true value of the training set, determining the optimal network parameters according to the optimal result on the verification set, and detecting the data in the test set.
Specifically, in this embodiment, when a small drug molecule is predicted for its CDK subtype protein inhibitory activity, a sigmoid-carrying function is selected as a pre-BCE loss function to calculate the loss between the prediction result and the true value comprehensively, then a back propagation calculation gradient is performed, an Adam optimizer is used to update network parameters, 20 iterations are performed, and finally, the network parameter of the optimal round of expression on the verification set is selected as a final model using ROC-AUC as an evaluation index.
5) And determining the optimal hyper-parameter combination of the model according to the hyper-parameter optimization strategy.
Specifically, in this embodiment, a total of six hyper-parameters and their corresponding optimal selection ranges are set for the drug small molecules in the prediction of the CDK subtype protein inhibitory activity: the loss rate of a extracted molecular graph module ([0,0.05, …,0.6]), the number of attentions of the extracted molecular graph module ([2,3, …,8]), the number of attentions of the extracted molecular graph module ([40,45, …,80]), the loss rate of an extracted molecular fingerprint module ([0,0.05, …,0.6]), the feature vector dimension of the extracted molecular fingerprint module ([300,350, …,600]), and the proportion of the molecular graph and the molecular fingerprint vector when the fusion module is fully connected with layer input ([0,0.1, …,1 ]).
In order to find the excellent hyper-parameter combinations as efficiently and accurately as possible, a bayesian optimization strategy is adopted in the embodiment to perform combination exploration on six hyper-parameters and the ranges thereof. The Bayesian optimization strategy calculates the posterior probability part with the result through Gaussian process regression according to the current hyperparameter combination and result to obtain the expected mean value and variance of the six hyperparameters on each possible value, and comprehensively judges which value combination is respectively selected by the six hyperparameters during the next optimization. In the Bayesian optimization process, as the number of molecules of the drug micromolecules in the CDK subtype protein inhibition activity prediction data set corresponding to the drug micromolecules is small, the chemical distribution of the data set is different, in order to reduce the influence caused by the random splitting of the sample set, ten random numbers of the micromolecules are selected to split ten data sets in each hyper-parameter combination during calculation, and the average value of ten training results is used as the evaluation value of each optimization step. The Bayesian optimization co-optimization of the embodiment is carried out for 15 steps, and the hyper-parameter combination with the optimal evaluation index on the verification set is selected as the final hyper-parameter combination.
6) And generating a targeted optimal model for predicting different drug properties for subsequent application in predicting new small molecule drug properties.
Specifically in this example, a total of 11 optimal models for CDK1-9, 14, 19 were constructed using a small molecule-based dataset of inhibitory activity against cyclin-dependent kinase family members (CDK1-9, 14, 19), providing the user with a prediction of the properties of novel drug small molecules of the CDK family.
In this embodiment, the prediction application is performed by using 11 optimal models for the CDK family, which specifically includes the following steps:
6-1) selecting an existing library of compounds that contains a desired group of compounds or that requires a predicted target value.
The SPECS library of compounds (containing 208670 compounds,https:// www.specs.net/) To mine CDK9 inhibitors. A CDK9 inhibitor screening compound library (approximately 194916 compounds) was created by subjecting the SPECS compound library to the same standardized protocol as in step S103, and filtering through the five rules of Lipinski.
6-2) inputting the compound library into a constructed optimal deep learning prediction model aiming at CDK 9. The SMILES chemical character string of each molecule in the compound library is input into an optimal model which is constructed on CDK9 inhibition activity prediction by the drug property prediction algorithm based on the deep learning fusion molecule graph and the fingerprint and calculated by each node, the inhibition degree of the corresponding molecule on CDK9 kinase is output, and the closer the output value is to 1, the more inhibition on CDK9 kinase is shown.
6-3) the inhibition degree data of CDK9 kinase generated by calculating the compound library in a CDK9 optimal model are ranked from high to low, and the top 1000 molecules are selected from 194916 compounds for further analysis. Finally, 19 compounds were selected and purchased for bioassay validation by molecular docking processing software based on visual ligand-protein interaction analysis. In a biological experiment, the cell level verification result shows that 6 compounds in the 19 compounds have obvious cancer cell level anti-cancer activity, and the in vitro CDK9 kinase inhibition test result shows that 5 compounds have obvious inhibition activity on a target.
The result of the medicine property prediction based on the big data and deep learning neural network prediction model realized in the embodiment of the invention is correct and accords with the actual situation. The drug property prediction algorithm based on the deep learning fusion molecular graph and the fingerprint has advancement and practicability, and can provide rapid and efficient screening of drug molecular property prediction for pharmaceutical chemists and practitioners in related fields.
The invention further provides a computer device corresponding to the embodiment.
The computer device of this embodiment includes a storage medium, a processor, and a computer program stored on the storage medium and executable on the processor, where the processor, when executing the program, can implement the method for predicting a drug property based on a deep learning fusion molecular graph and a fingerprint and the application thereof described in the embodiments. The computer device of this embodiment generates a total of 11 optimal models for different kinases by data preprocessing of the acquired data sets of inhibitory activity of cyclin dependent kinase family members (CDK1-9, 14, 19) based on small molecules when processing the computer program, and enables prediction of new drug small molecule properties with the optimal models. The computer equipment of the embodiment can quickly and efficiently predict the inhibitory activity of the 11 CDK kinase family members, improve the research and development efficiency of the CDK kinase family members in the research and development of related drugs, and accelerate the speed of virtual screening.
The invention also provides a non-transitory computer readable storage medium corresponding to the above-mentioned embodiment.
The non-transitory computer readable storage medium of this embodiment has stored thereon a computer program that, when executed by a processor, is a method for predicting a drug property based on a deep learning fusion molecular graph and a fingerprint. The non-transitory computer readable storage medium of this example contains an acquired dataset of inhibitory activity of small molecule-based cyclin-dependent kinase family members (CDK1-9, 14, 19), a pre-processed sample set, and a total of 11 optimal models generated from the sample set for different kinases. By using the non-transitory computer-readable storage medium of the embodiment, the optimal model generated by the embodiment can be directly used, the generation time of the model is saved, a user can quickly and efficiently predict the inhibitory activity of 11 CDK kinase family members, the research and development efficiency of the CDK kinase family member related drugs is improved, and the speed of virtual screening is increased.
Example 2
The present example provides a method for predicting drug properties based on deep learning fusion molecular diagram and molecular fingerprint and an explanatory analysis of the molecular diagram, taking the strength of the permeability activity of the small molecule to be predicted on the blood brain barrier in the present example as an example, the method comprises the following steps:
1) acquiring a permeability activity data set containing a large number of drug small molecules to a blood brain barrier, wherein the data set is derived from a medicinal chemical data set disclosed on a network, the data is preprocessed, 2039 compounds are obtained in total, 1560 positive compounds and 479 negative compounds are obtained, and the positive proportion of the data set is 23.49%. The standardized data set consists of 2039 SMILES format for small drug molecules and their corresponding permeability activity yin-yang values for the blood-brain barrier.
2) And (3) constructing a deep learning model suitable for drug property prediction, wherein the model fuses two characteristics of a molecular graph and a molecular fingerprint into different modules, and finally, a fusion type neural network framework is combined by using a full connection layer. Specifically, in the present embodiment, in the feature extraction module based on the molecular graph, a deep learning network model based on a graph attention mechanism is adopted; in the characteristic extraction module based on the molecular fingerprint, serial bit strings of three fingerprints including MACCS FP, PubChemFP and Pharmacophore ErG FP are selected as a molecular fingerprint representation method to be input into the module. The output vectors of the two characteristic extraction modules are connected in series and input into the fusion module of the multilayer full-connection layer, and the blood brain barrier permeability activity value corresponding to the small molecule drug is predicted.
3) In the aspect of model data set construction, a random splitting mode is selected, and three sets are randomly divided according to the proportion of 8, 1 and 1 of a training set, a test set and a verification set.
4) Inputting the divided data sets into a network model, training and updating parameters in the network according to the difference between the prediction result of the training set and the true value of the training set, determining the optimal network parameters according to the optimal result on the verification set, and detecting the data in the test set.
Specifically, in the embodiment, when the blood brain barrier permeability activity of the small drug molecules is predicted, the sigmoid function is selected as a pre-BCE loss function to comprehensively calculate the loss between the prediction result and the true value, then the back propagation calculation gradient is performed, an Adam optimizer is used to update the network parameters, the iteration is performed for 20 rounds, and finally ROC-AUC is used as an evaluation index to select the network parameters of the optimal round on the verification set to serve as a final model.
5) And determining the optimal hyper-parameter combination of the model according to the hyper-parameter optimization strategy.
Specifically, in this embodiment, a total of six hyper-parameters and their corresponding optimal selection ranges are set for the prediction of the blood-brain barrier permeability activity of the small drug molecules: the loss rate of a extracted molecular graph module ([0,0.05, …,0.6]), the number of attentions of the extracted molecular graph module ([2,3, …,8]), the number of attentions of the extracted molecular graph module ([40,45, …,80]), the loss rate of an extracted molecular fingerprint module ([0,0.05, …,0.6]), the feature vector dimension of the extracted molecular fingerprint module ([300,350, …,600]), and the proportion of the molecular graph and the molecular fingerprint vector when the fusion module is fully connected with layer input ([0,0.1, …,1 ]).
In order to find the excellent hyper-parameter combinations as efficiently and accurately as possible, a bayesian optimization strategy is adopted in the embodiment to perform combination exploration on six hyper-parameters and the ranges thereof. The Bayes optimization strategy calculates the posterior probability part with the result through Gaussian process regression according to the current super-parameter combination and result to obtain the expected mean value and variance of the six super-parameters on each possible value, and comprehensively judges which value combination is respectively selected by the six super-parameters during the next optimization. In the Bayesian optimization process, as the number of molecules of the drug micromolecules in the corresponding blood brain barrier permeability activity prediction data set is small, the chemical distribution of the data set is different, in order to reduce the influence caused by the random splitting of the sample set, ten random numbers of molecules are selected for splitting ten data sets in each hyper-parameter combination during calculation, and the mean value of ten training results is used as the evaluation value of each optimization step. The Bayesian optimization of the embodiment is performed for 20 steps, and the hyperparametric combination with the optimal evaluation index on the verification set is selected as the final hyperparametric combination.
6) For prediction of different drug properties, a targeted optimal model is generated for subsequent application to new small molecule drug property prediction and explanatory analysis.
Specifically, in this embodiment, a small molecule-based blood-brain barrier permeability activity dataset is used to construct 1 optimal model for blood-brain barrier permeability, which is provided for a user to predict properties of a new blood-brain barrier permeable drug small molecule.
7) In this embodiment, an optimal model for blood-brain barrier permeability is used to perform an explanatory analysis, which specifically includes the following steps:
7-1) generating a molecular data set to be subjected to explanatory analysis, and selecting 2 drug small molecules;
7-2) loading a pre-generated optimal model aiming at blood brain barrier permeability, and performing property prediction and molecular diagram module explanatory analysis on the drug micromolecules;
7-3) generating a property prediction result and a molecular map module to explain an analysis result (such as FIG. 9a and FIG. 9 b);
7-4) the SMILES format of molecule 1 is: [ C @ H ]1CN (C [ C @ @ H ] (C) N1) C2C (F) C (N) C3C (═ O) C (═ CN (C4CC4) C3C2F) C (O) ═ O, the property prediction result was 0.134, and the molecule was predicted to be a negative molecule and matched with the actual blood brain barrier permeability of the molecule. The graph interpretation analysis of the molecule is shown in fig. 9a, the higher the attention value obtained by calculation and judgment is, the more the model pays attention to the partial structure in prediction, the part in the circle is the part with the higher attention value obtained by calculation, namely the substructure of the side of the part is considered by the model to play an important role in that the molecule cannot penetrate the blood brain barrier. The color of the sub-structure surrounded by the square frame is deeper than that of the sub-structure surrounded by the circular frame, namely the sub-structure in the square frame plays an important role in that molecules cannot penetrate through a blood brain barrier. The blood brain barrier is a membrane structure, whether a molecule can penetrate the blood brain barrier is related to the lipophilicity and polarity of the molecule, and for a negative molecule, the larger the polarity of the molecule is, the lower the ClogP value of the molecule is, the more the molecule cannot penetrate the blood brain barrier. The molecules are subjected to quantitative analysis by adopting software ChemBioDraw, the ClogP value in a square area is calculated to be-0.905 by the software, the ClogP value in a circular area is 0.934, the ClogP value of the square area is lower and has higher polarity compared with the ClogP values of the two areas, the ClogP value in the square area plays an important role in preventing the molecules from penetrating the blood brain barrier, and the high attention of the model to the square area is consistent with the judgment that the model cannot penetrate the blood brain barrier on the whole of the molecules. The SMILES format of molecule 2 is: c1CCN (CC1) CC1cccc (C1) OCCCNC (═ O) C, with a property prediction result of 0.988, was predicted to be a positive molecule, and was consistent with the actual blood brain barrier permeability of the molecule. The graph interpretation analysis of the molecule is shown in fig. 9b, the part in the circle is the part with larger attention value, and represents that the model pays more attention to the part structure in prediction, namely the substructure of the side of the part plays an important role in the molecule being capable of penetrating the blood brain barrier. The color of the substructure enclosed by the square frame is deeper than that of the substructure enclosed by the circular frame, namely the substructure enclosed by the square frame plays an important role in the blood brain barrier penetration of molecules. For a positive molecule, the less polar the molecule, the lower and higher the ClogP value of the molecule, the more blood brain barrier penetration. The results of quantitative analysis by software show that the ClogP value in the square region is 2.142, while the ClogP value in the circular region is 1.389, comparing the ClogP values of the two regions, the square region has a higher value and a lower polarity, which plays an important role in the blood brain barrier penetration of the molecule, and the high attention of the model to the square region is consistent with the judgment that the model can penetrate the blood brain barrier on the whole molecule. For the comparative analysis of the two molecules, the molecule 2 can be found to be a positive molecule, and the ClogP value of the circular region with low attention is much higher than that of the circular region with low attention in the negative molecule 1, and the explanation side indicates that the whole molecule 2 has smaller polarity and is easier to pass through the blood brain barrier. The prediction and the explanatory analysis of the model on the two molecules are consistent with the actual situation, and the optimal model trained and constructed by the algorithm can be used for carrying out correct property prediction and reasonable explanatory analysis on the molecules and providing powerful help for chemists to design the drug molecules.
The invention further provides a computer device corresponding to the embodiment.
The computer device of this embodiment includes a storage medium, a processor, and a computer program stored on the storage medium and executable on the processor, and when the processor executes the program, the method and application for predicting a drug property based on a deep learning fusion molecule map and a fingerprint as described in the embodiments can be implemented. When the computer device of this embodiment processes a computer program, 1 optimal model is generated by collecting an acquired data set of blood brain barrier permeability activity based on small molecules, and the optimal model can be used to predict and analyze the properties of small molecules of a new drug. The computer equipment of this embodiment can predict to blood brain barrier permeability activity fast high-efficiently, promotes the research and development efficiency when the research and development needs the relevant medicine that pierces through the blood brain barrier for the speed of virtual screening.
The invention also provides a non-transitory computer readable storage medium corresponding to the above-mentioned embodiment.
The non-transitory computer readable storage medium of this embodiment has stored thereon a computer program that, when executed by a processor, is a method for predicting a drug property based on a deep learning fusion molecular graph and a fingerprint. The non-transitory computer-readable storage medium of this embodiment includes the acquired data set based on the activity of small molecules on blood brain barrier permeability, and 1 optimal model generated according to the data set. By using the non-transitory computer-readable storage medium of the embodiment, the optimal model generated by the embodiment can be directly used, the generation time of the model is saved, a user can quickly and efficiently perform inhibitory activity prediction and molecular structure explanatory analysis aiming at the fact that small molecules can penetrate through the blood brain barrier, the research and development efficiency when researching and developing related drugs needing to penetrate through the blood brain barrier is improved, and the speed of virtual screening is increased.
Example 3
This example provides a method for predicting drug properties based on a deep learning fusion molecular map and fingerprints, which takes the strength of the inhibitory activity of small molecules to target rapamycin (mTOR) to be predicted in this example as an example, and includes the following steps:
1) obtaining inhibitory activity data of a large number of drug small molecules on target of rapamycin (mTOR) to construct a data set, wherein the construction of the data set specifically comprises the following steps:
1-2) because there is no accepted, well-processed, classical data set available in the industry, it is desirable to collect data from laboratory records maintained in a pharmaceutical chemistry or biological laboratory, or from compound activity data provided by a network published pharmaceutical chemistry database, followed by data pre-processing.
1-2-1) the embodiment of the invention selects to obtain the related protein level information from a protein database UniProt, and collects all activity data records corresponding to mTOR kinase according to Uniport ID of the kinase from a pharmaceutical chemistry database ChEMBL.
1-2-2) data screening. Only the record of the biological activity with the test type B and the report activity type IC50, EC50, Ki and Kd is kept, the duplication is carried out on the drug small molecules with a plurality of activity records, and the activity data of the repeated molecules are averaged.
1-2-3) data standardization treatment. And (3) carrying out dehydroion, desalting ion cleaning and force field optimization on the small molecular structure.
1-2-4) data annotation. In this embodiment, the data type belongs to a classification task, and the molecule needs to be labeled systematically according to a specified threshold. For this example, the threshold was 1. mu.M, small molecules with test activity ≦ 1. mu.M were labeled as inhibitors, and small molecules > 1. mu.M were labeled as non-inhibitors.
1-2-5) obtaining the normalized data. A total of 4104 compounds for mTOR kinase and their kinase inhibitory activity data were obtained after normalization, with 565 compounds labeled non-inhibitors, 3539 compounds labeled inhibitors and 86.23% positive data on the data set. The resulting standardized data set consisted of 4104 SMILES formats for drug small molecules and their corresponding mTOR kinase inhibitor activity target values.
2) And (3) constructing a deep learning model suitable for drug property prediction, wherein the model fuses two characteristics of a molecular graph and a molecular fingerprint into different modules, and finally, a fusion type neural network framework is combined by using a full connection layer. Specifically, in the present embodiment, in the feature extraction module based on the molecular graph, a deep learning network model based on a graph attention mechanism is adopted; in the characteristic extraction module based on the molecular fingerprint, serial bit strings of three fingerprints including MACCS FP, PubChemFP and Pharmacophore ErG FP are selected as a molecular fingerprint representation method to be input into the module. The output vectors of the two characteristic extraction modules are connected in series and input into the fusion module of the multilayer full-connection layer, and the mTOR kinase inhibition activity corresponding to the small-molecule drug is predicted.
3) In the aspect of model data set construction, a random splitting mode is selected, and three sets are randomly divided according to the proportion of 8, 1 and 1 of a training set, a test set and a verification set.
4) Inputting the divided data sets into a network model, training and updating parameters in the network according to the difference between the prediction result of the training set and the true value of the training set, determining the optimal network parameters according to the optimal result on the verification set, and detecting the data in the test set.
Specifically, in this embodiment, when a small drug molecule is predicted in the mTOR kinase inhibitory activity corresponding to the small drug molecule, a sigmoid function is selected as a pre-BCE loss function to comprehensively calculate the loss between the prediction result and the true value, then a back propagation calculation gradient is performed, an Adam optimizer is used to update network parameters, the iteration is performed for 40 rounds, and finally, ROC-AUC is used as an evaluation index to select the network parameters of the best round of expression on the verification set as a final model.
5) And determining the optimal hyper-parameter combination of the model according to the hyper-parameter optimization strategy.
Specifically, in this embodiment, a total of six hyper-parameters and their corresponding optimal selection ranges are set for the drug small molecules in the prediction of the CDK subtype protein inhibitory activity: extracting a loss rate of the molecular graph module ([0,0.05, …,0.6]), extracting an attention number of the molecular graph module ([2,3, …,8]), extracting an attention iteration number of the molecular graph module ([40,45, …,80]), extracting a loss rate of the molecular fingerprint module ([0,0.05, …,0.6]), extracting a feature vector dimension of the molecular fingerprint module ([300,350, …,600]), and a molecular graph and molecular fingerprint vector ratio at the time of full-connection layer input of the fusion module ([0,0.1, …,1 ]).
In order to find the excellent hyper-parameter combinations as efficiently and accurately as possible, a bayesian optimization strategy is adopted in the embodiment to perform combination exploration on six hyper-parameters and the ranges thereof. The Bayes optimization strategy calculates the posterior probability part with the result through Gaussian process regression according to the current super-parameter combination and result to obtain the expected mean value and variance of the six super-parameters on each possible value, and comprehensively judges which value combination is respectively selected by the six super-parameters during the next optimization. In the Bayesian optimization process, as the number of the molecules of the drug micromolecules in the corresponding mTOR kinase inhibition activity prediction data set is small, the chemical distribution of the data set is different, ten random numbers of the micromolecules are selected to split ten data sets in each hyper-parameter combination during calculation in order to reduce the influence caused by random splitting of the sample set, and the mean value of ten training results is used as the evaluation value of each step of optimization. The Bayesian optimization of the embodiment is performed for 20 steps, and the hyperparametric combination with the optimal evaluation index on the verification set is selected as the final hyperparametric combination.
6) For prediction of different drug properties, a targeted optimal model is generated for subsequent application to new small molecule drug property prediction and explanatory analysis.
Specifically, in this example, a small molecule-based mTOR kinase inhibitory activity dataset was used to construct 1 optimal model, which was provided to the user for property prediction and molecular profiling model interpretation analysis of CDK family novel drug small molecules.
7) In this embodiment, an optimal model for mTOR kinase is used for prediction and explanatory analysis applications, which specifically includes the following steps:
7-1) generating a molecular data set to be subjected to explanatory analysis, and selecting 3 drug small molecules;
7-2) loading a pre-generated optimal model aiming at mTOR kinase, and performing property prediction and molecular diagram module explanatory analysis on the drug small molecules;
7-3) generating a property prediction result and a molecular map module to explain an analysis result (such as FIG. 10a, FIG. 10b and FIG. 10 c);
7-4) the predicted values for mTOR inhibitory activity for the three molecules were 0.984, 0.949 and 0.964, respectively, and all three were predicted to be positive molecules, consistent with the actual inhibition of mTOR kinase by the molecules. The inhibitory activity of the small molecules to mTOR kinase depends on the binding capacity of the molecules and hydrogen bonds to a great extent, and in the box labeling of the three molecules, morpholine rings and urea groups are hydrogen bond acceptors, can form a hydrogen bond effect, and have high inhibitory activity to mTOR kinase. This is consistent with the results of the molecule being predicted to be a positive molecule. On the basis of a pre-generated optimal model, the prediction and molecular graph explanatory analysis of the three small molecules are consistent with the actual inhibitory activity of the molecules and the actual hydrogen bonding capacity of the substructure, and the model constructed on the basis of the invention is proved to be capable of realizing the inhibitory activity prediction and model explanation of mTOR kinase. The invention can effectively help chemists to carry out large-scale drug screening and drug molecule design on mTOR kinase.
The invention further provides a computer device corresponding to the embodiment.
The computer device of this embodiment includes a storage medium, a processor, and a computer program stored on the storage medium and executable on the processor, where the processor, when executing the program, can implement the method for predicting a drug property based on a deep learning fusion molecular graph and a fingerprint and the application thereof described in the embodiments. When the computer device of this embodiment processes a computer program, the acquired data preprocessing is performed on the mTOR kinase inhibition activity dataset based on small molecules to generate 1 optimal model, and the prediction and the explanatory analysis of the properties of the small molecules of the new drug using the optimal model can be realized. The computer equipment of this embodiment can predict to mTOR kinase's inhibitory activity fast high-efficiently, promotes the research and development efficiency when the relevant medicine of target mTOR kinase needs to be researched and developed for the speed of virtual screening.
In accordance with the above embodiments, the present embodiment further provides a system for predicting drug properties based on deep learning fusion molecular graph and fingerprint, including: the data preprocessing module is used for preprocessing the collected chemical molecule activity original data set so that the model can be applied to the construction of a new drug molecule property data set; a model construction module for modeling the processed sample by a deep learning model based on molecular graphs and molecular fingerprints; the deep learning model based on the molecular graph and the molecular fingerprint comprises a characteristic extraction module based on the molecular graph, a characteristic extraction module based on the molecular fingerprint and a fusion module; the feature extraction module based on the molecular diagram adopts a graph attention machine mechanism network, and focuses on judging the influence of the relationship between adjacent atoms on the molecular property; the characteristic extraction module based on the molecular fingerprints extracts molecular structures and influence of pharmacophores on molecular properties from three different types of molecular fingerprints; the fusion module combines the feature vectors obtained by the two feature extraction modules and inputs the combined feature vectors into a multilayer full-connection layer network; a prediction module: the prediction module is used for predicting the new drug micromolecules according to the optimal model generated by the model construction module, so that the model is applied to the prediction of the new drug molecules; an explanatory module: the interpretive module is used for performing interpretive analysis on the small drug molecules according to the optimal model generated by the model building module, so that the model can provide a drug design suggestion aiming at specific drug properties for a user.
The invention also proposes a non-transitory computer-readable storage medium.
The non-transitory computer readable storage medium of this embodiment has stored thereon a computer program that, when executed by a processor, is a method for drug property prediction based on a deep learning fusion score map and fingerprints. The non-transitory computer readable storage medium of this embodiment contains an acquired data set of mTOR kinase inhibition activity based on small molecules, and 1 optimal model generated from the data set. By using the non-transitory computer-readable storage medium of the embodiment, the optimal model generated by the embodiment can be directly used, the generation time of the model is saved, a user can quickly and efficiently perform activity prediction and molecular structure interpretive analysis on whether small molecules can inhibit mTOR kinase, the research and development efficiency in research and development of related drugs requiring targeted mTOR kinase is improved, and the speed of virtual screening is increased.

Claims (10)

1. The drug property prediction method based on the deep learning fusion molecular graph and the fingerprint is characterized by comprising the following steps of:
1) for the prediction of different drug properties, acquiring a targeted and specific data set containing a large amount of drug small molecule data;
2) constructing a deep learning model suitable for drug property prediction, wherein the model fuses two characteristics of a molecular graph and a molecular fingerprint into different modules, and finally, a fusion type neural network framework is formed by using a full connecting layer;
3) selecting a specific mode according to the requirements of model construction, and splitting a data set into a training set, a test set and a verification set;
4) inputting a data set into a network model, training and updating parameters in the network according to the difference between the prediction result of the training set and the true value of the training set, determining the optimal network parameters according to the optimal result on the verification set, and detecting the data in the test set;
5) determining the optimal hyper-parameter combination of the model according to the hyper-parameter optimization strategy;
6) for the prediction of different drug properties, generating a targeted optimal model for subsequent application in the prediction of new small molecule drug properties;
7) for the generated optimal model, an explanatory analysis is provided, which can be used for subsequent drug design.
2. The method for predicting drug properties based on deep learning fusion molecular graphs and fingerprints as claimed in claim 1, wherein the step 1) comprises the following steps:
1-1) adopting a classical data set to construct a model in the field of drug properties of the classical data set acknowledged in the industry;
1-2) there is no art for the pharmaceutical properties of accepted classic datasets that exist in the industry, where targeted pharmaceutical activity data is collected from laboratory records maintained in pharmaceutical chemistry or biological laboratories, or from compound activity data provided by a network-based published pharmaceutical chemistry database, or from databases derived from other approaches, and is pre-processed for model construction.
3. The method for predicting drug properties based on deep learning fusion molecular graphs and fingerprints as claimed in claim 2, wherein the steps 1-2) comprise the following steps:
1-2-1) obtaining targeted raw drug activity data from various sources;
1-2-2) checking the weight according to the small drug molecules, and averaging the activity data of the repeated molecules;
1-2-3) carrying out dehydroionization, desalting ion and structural force field optimization on the drug micromolecules;
1-2-4) for the regression task, preserving specific activity values; for the classification task, carrying out negative and positive labeling on the small drug molecules according to a specified threshold value;
1-2-5) data set is presented as a simplified molecular linear input canonical format SMILES for drug small molecules and corresponding target values.
4. The method for predicting drug properties based on deep learning fusion molecular graphs and fingerprints as claimed in claim 1, wherein the step 2) comprises the following steps:
2-1) the feature extraction part of the model fuses two modules for extracting molecular diagram features and molecular fingerprint features, and respectively extracts the features of the drug micromolecules to generate corresponding feature vectors;
2-2) a module for extracting the molecular diagram characteristics from the model, wherein a network structure of a diagram attention machine mechanism is adopted; generating a component graph according to an input SMILES format: mapping the atoms of the molecules to nodes in the component subgraph and mapping the chemical bonds to edges in the component subgraph, and calculating the physicochemical properties of the atoms and the chemical bonds as initial characteristic vectors of point edges; an attention mechanism in a network structure, which focuses on the influence between adjacent atoms, namely the attention between adjacent atoms, and is used for iteratively updating the feature vectors of atoms in molecules; after the iterative updating is finished, integrating the feature vectors of all atoms to serve as the feature vectors of the molecular graph for output;
2-3) extracting molecular fingerprint characteristics from the model by adopting a plurality of fully-connected layers; three different types of molecular fingerprints are generated according to the input SMILES format: molecular fingerprint MACCS FP based on substructure, molecular fingerprint PubChem FP based on substructure, molecular fingerprint Pharmacophore ErG FP based on Pharmacophore; inputting the three fingerprints in series into a full-connection layer network of the module to obtain a characteristic vector of the molecular fingerprint;
and 2-4) splicing the feature vectors generated by the two modules and inputting the spliced feature vectors into a multi-layer full-connection layer to predict the properties of the drug micromolecules and generate a final prediction result.
5. The method for predicting drug properties based on deep learning fusion molecular graphs and fingerprints as claimed in claim 4, wherein the step 2-2) comprises the following steps:
2-2-1) calculating the physicochemical property of each atom as an initial characteristic vector of a point in a molecular diagram; the physical and chemical properties specifically include: atom type, number of chemical bonds attached, number of charges, chiral carbon condition, number of hydrogen atoms attached, hybridization orbital condition, atomic mass, whether it is an aromatic atom; the atomic type is an atom with an atomic number within one hundred;
2-2-2) calculating the attention degree between adjacent atoms, and iteratively updating the expression of the atoms according to the attention as follows:
e ij =LeakyRelu(a·[W 1 h i ||W 1 h j ])
Figure FDA0003688848410000031
Figure FDA0003688848410000032
wherein h is i And h j Pre-iteration feature vectors, W, for adjacent atoms i and j 1 Is a weight matrix, alpha ij Is a weight; the attention value calculated between adjacent atoms i and j is e ij Calculating an attention value e of each of K adjacent atoms K and i ik Summing, thereby calculating the attention impact of the j atom on the i atom; before updating the atom i, normalizing the attention values corresponding to all neighbors of i to obtain alpha ij (ii) a The multiple attention is K, the multi-head attention mechanism repeatedly calculates the attention for multiple times, the average value of the attention for multiple times is taken to update the atom i, and the iterative characteristic vector h is obtained i ′;
2-2-3) calculating a feature vector of the molecular graph, wherein an expression is as follows:
Figure FDA0003688848410000033
wherein N is the total atomic number of the molecule, h i ' taking the mean value of the feature vectors of all atoms as a score for the feature vectors of the atoms after the iterative update is finishedThe feature vector of the child.
6. The method for predicting drug properties based on deep learning fusion molecular graphs and fingerprints as claimed in claim 1, wherein the step 3) comprises the following steps:
3-1) the model can define the splitting mode and the splitting ratio by user;
3-2) the built-in splitting mode of the model comprises the following steps: random splitting and skeleton splitting; wherein, the random splitting is carried out, and the data set is randomly split out of order; and (3) skeleton splitting, namely calculating the skeleton number and the corresponding molecule number of the small drug molecules in the data set, and sequentially classifying the skeletons and molecules with small corresponding molecule number into a verification set and a test set until the molecular number in the verification set and the test set is enough, and uniformly classifying the rest molecules into a training set.
7. The method for predicting drug properties based on deep learning fusion molecular graphs and fingerprints as claimed in claim 1, wherein the step 5) comprises the following steps:
5-1) six hyper-parameters are built in the model: extracting the loss rate of the molecular graph module, the attention number of the molecular graph module, the attention iteration times of the molecular graph module, the loss rate of the molecular fingerprint module, the feature vector dimension of the molecular fingerprint module and the proportion of the molecular graph and the molecular fingerprint vector when the fusion module is in full-connection layer input;
5-2) carrying out hyperparametric optimization on the model according to a Bayesian optimization mode, optimizing for 20 rounds, and selecting a group of hyperparameters with optimal evaluation scores of the test set.
8. The method for predicting drug properties based on deep learning fusion molecular graphs and fingerprints as claimed in claim 1, wherein the step 6) comprises the following steps:
6-1) generating an optimal prediction model aiming at the specific drug property according to the optimal hyper-parameter combination screened in the step 5);
6-2) loading a corresponding optimal model when predicting the drug molecules with unknown properties, and inputting the SMILES format of the molecules into the model to obtain the prediction result of the drug molecules;
6-3) the model supports mass prediction of drug molecules with unknown properties, and rapid and efficient molecular property judgment is realized.
9. The method for predicting drug properties based on deep learning fusion molecular graphs and fingerprints as claimed in claim 1, wherein the step 7) comprises the following steps:
7-1) providing two model interpretation functions of fingerprint interpretation and molecular diagram interpretation according to the optimal prediction model aiming at the specific medicine property generated in the step 6) and the input requirement of a user;
7-2) when a user requires fingerprint interpretation, calculating importance indexes of different fingerprint sites in the model, wherein the higher the index is, the greater the function of the site is in the generation process of the model, and the intramolecular information represented by the site plays an important role in designing drug molecules aiming at specific drug properties;
7-3) when a user requires the explanation of the molecular graph, calculating the attention value in the molecular graph in the model, mapping the attention value to the molecular graph, wherein the larger the attention value of a certain part of atoms is, the larger the structure plays a role in the generation process of the model is, and the structure plays an important role in designing drug molecules aiming at specific drug properties.
10. The system for realizing the drug property prediction method based on the deep learning fusion molecular graph and the fingerprint as claimed in any one of claims 1 to 9, is characterized by comprising a data preprocessing module, a model construction module, a model prediction module and a model interpretation module; the data preprocessing module is used for preprocessing the collected chemical molecule activity original data set so that the model can be applied to the construction of a new drug molecule property data set;
the model construction module is used for modeling the processed sample through a deep learning model based on the molecular graph and the molecular fingerprint; the deep learning model based on the molecular graph and the molecular fingerprint comprises a characteristic extraction module based on the molecular graph, a characteristic extraction module based on the molecular fingerprint and a fusion module; the feature extraction module based on the molecular diagram adopts a graph attention machine mechanism network, and focuses on judging the influence of the relationship between adjacent atoms on the molecular property; the characteristic extraction module based on the molecular fingerprints extracts molecular structures and influence of pharmacophores on molecular properties from three different types of molecular fingerprints; the fusion module combines the feature vectors obtained by the two feature extraction modules and inputs the combined feature vectors into a multilayer full-connection layer network;
the prediction module is used for predicting the new drug micromolecules according to the optimal model generated by the model construction module, so that the model is applied to the prediction of the new drug molecules;
the model explanatory module is used for performing explanatory analysis on the drug small molecules according to the optimal model generated by the model construction module, so that the model can provide a drug design suggestion aiming at specific drug properties for a user.
CN202210654644.7A 2022-06-10 2022-06-10 Drug property prediction method and system based on deep learning fusion molecular graph and fingerprint Pending CN115050428A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210654644.7A CN115050428A (en) 2022-06-10 2022-06-10 Drug property prediction method and system based on deep learning fusion molecular graph and fingerprint

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210654644.7A CN115050428A (en) 2022-06-10 2022-06-10 Drug property prediction method and system based on deep learning fusion molecular graph and fingerprint

Publications (1)

Publication Number Publication Date
CN115050428A true CN115050428A (en) 2022-09-13

Family

ID=83161049

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210654644.7A Pending CN115050428A (en) 2022-06-10 2022-06-10 Drug property prediction method and system based on deep learning fusion molecular graph and fingerprint

Country Status (1)

Country Link
CN (1) CN115050428A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115691703A (en) * 2022-10-15 2023-02-03 苏州创腾软件有限公司 Drug property prediction method and system based on pharmacokinetic model
CN116230109A (en) * 2023-05-10 2023-06-06 北京大学 Chiral separation prediction method based on deep learning
CN116502130A (en) * 2023-06-26 2023-07-28 湖南大学 Method for identifying smell characteristics of algae source

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112164428A (en) * 2020-09-23 2021-01-01 常州微亿智造科技有限公司 Method and device for predicting properties of small drug molecules based on deep learning
CN112164427A (en) * 2020-09-23 2021-01-01 常州微亿智造科技有限公司 Method and device for predicting activity of small drug molecule target based on deep learning
CN112366001A (en) * 2020-11-25 2021-02-12 苏州莱奥生物技术有限公司 Comprehensive early patent medicine property evaluation method based on pharmacokinetics and application thereof

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112164428A (en) * 2020-09-23 2021-01-01 常州微亿智造科技有限公司 Method and device for predicting properties of small drug molecules based on deep learning
CN112164427A (en) * 2020-09-23 2021-01-01 常州微亿智造科技有限公司 Method and device for predicting activity of small drug molecule target based on deep learning
CN112366001A (en) * 2020-11-25 2021-02-12 苏州莱奥生物技术有限公司 Comprehensive early patent medicine property evaluation method based on pharmacokinetics and application thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
高丽 等: "计算机辅助药物设计在新药研发中的应用进展", 中国药学杂志, no. 09, 8 May 2011 (2011-05-08), pages 641 - 645 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115691703A (en) * 2022-10-15 2023-02-03 苏州创腾软件有限公司 Drug property prediction method and system based on pharmacokinetic model
CN116230109A (en) * 2023-05-10 2023-06-06 北京大学 Chiral separation prediction method based on deep learning
CN116502130A (en) * 2023-06-26 2023-07-28 湖南大学 Method for identifying smell characteristics of algae source
CN116502130B (en) * 2023-06-26 2023-09-15 湖南大学 Method for identifying smell characteristics of algae source

Similar Documents

Publication Publication Date Title
Muzio et al. Biological network analysis with deep learning
Al-Tashi et al. Approaches to multi-objective feature selection: A systematic literature review
Priya et al. Performance analysis of liver disease prediction using machine learning algorithms
CN115050428A (en) Drug property prediction method and system based on deep learning fusion molecular graph and fingerprint
CN105653846A (en) Integrated similarity measurement and bi-directional random walk based pharmaceutical relocation method
CN105740626A (en) Drug activity prediction method based on machine learning
CN114334038B (en) Disease medicine prediction method based on heterogeneous network embedded model
Chu et al. Hierarchical graph representation learning for the prediction of drug-target binding affinity
CN113571125A (en) Drug target interaction prediction method based on multilayer network and graph coding
CN112420126A (en) Drug target prediction method based on multi-source data fusion and network structure disturbance
Ye et al. Molecular substructure graph attention network for molecular property identification in drug discovery
Su et al. Predicting drug-target interactions over heterogeneous information network
CN115132270A (en) Drug screening method and system
WO2022082739A1 (en) Method for predicting protein and ligand molecule binding free energy on basis of convolutional neural network
Keyvanpour et al. DTIP-TC2A: An analytical framework for drug-target interactions prediction methods
Zenil et al. Algorithmic complexity and reprogrammability of chemical structure networks
El Rahman et al. Machine learning model for breast cancer prediction
Dutta et al. Hybrid Genetic Algorithm Random Forest algorithm (HGARF) for improving the missing value Imputation in Hepatitis Medical Dataset
Bongini et al. A Deep Learning Approach to the Prediction of Drug Side–Effects on Molecular Graphs
Wang et al. Hypergraph-based Gene Ontology Embedding for Disease Gene Prediction
Shi et al. Semi-supervised learning protein complexes from protein interaction networks
Fitriawan et al. Multi-label classification using deep belief networks for virtual screening of multi-target drug
Fitriawan et al. Deep belief networks using hybrid fingerprint feature for virtual screening of drug design
Ram et al. Learning structure of a gene regulatory network
Rifaioğlu Deep learning for prediction of drug-target interaction space and protein functions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination