CN113066525A - Multi-target drug screening method based on ensemble learning and hybrid neural network - Google Patents

Multi-target drug screening method based on ensemble learning and hybrid neural network Download PDF

Info

Publication number
CN113066525A
CN113066525A CN202110339575.6A CN202110339575A CN113066525A CN 113066525 A CN113066525 A CN 113066525A CN 202110339575 A CN202110339575 A CN 202110339575A CN 113066525 A CN113066525 A CN 113066525A
Authority
CN
China
Prior art keywords
drug
candidate drug
candidate
target
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110339575.6A
Other languages
Chinese (zh)
Other versions
CN113066525B (en
Inventor
陈观兴
谭晓军
陈语谦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202110339575.6A priority Critical patent/CN113066525B/en
Publication of CN113066525A publication Critical patent/CN113066525A/en
Application granted granted Critical
Publication of CN113066525B publication Critical patent/CN113066525B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/30Drug targeting using structural data; Docking or binding prediction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/50Molecular design, e.g. of drugs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/90Programming languages; Computing architectures; Database systems; Data warehousing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Medicinal Chemistry (AREA)
  • Pharmacology & Pharmacy (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Bioethics (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention discloses a multi-target drug screening method based on ensemble learning and a hybrid neural network, which comprises the following steps: acquiring data; docking treatment is carried out, and candidate drugs are obtained according to docking scores; determining multi-target proteins and butting the candidate drug with the multi-target proteins to obtain the number of target proteins acted by the candidate drug; predicting the activity of the candidate drug based on a preset ensemble learning regression model to obtain a predicted activity value of the candidate drug; predicting the binding force of the pathogenic target protein and the candidate drug based on a preset mixed neural network framework to obtain the binding force fraction of the candidate drug and the target protein; and comprehensively determining the final candidate drug. The invention realizes low-cost, high-efficiency and high-accuracy drug screening from the aspects of multi-target analysis, ensemble learning and a hybrid neural network. The invention is used as a multi-target drug screening method based on integrated learning and a hybrid neural network, and can be widely applied to the field of drug screening.

Description

Multi-target drug screening method based on ensemble learning and hybrid neural network
Technical Field
The invention relates to the field of drug screening, in particular to a multi-target drug screening method based on integrated learning and a hybrid neural network.
Background
The existing drug screening method is mainly carried out on the basis of drugs, targets or some aspect of interaction force of the drug targets, and the experiment has one-sidedness, so that the screened drugs do not necessarily have good drug effects for treating diseases.
Based on the aspect of target proteins, the existing methods generally perform virtual screening on a certain protein to determine a candidate drug, and for some pathways, the pathogenic mechanism is often influenced by a plurality of proteins, so that the analysis on only one target protein is one-sidedness. Based on the aspect of medicines, the existing method does not effectively integrate the physicochemical properties of medicines and a statistical analysis method to predict the activity of the medicines, and more errors often exist. Based on the aspect of drug target interaction force, the existing method generally adopts a single group of neural network framework to predict the interaction force. The neural network serves as a black box, and meanwhile, uncertainty is brought to drug prediction due to a large amount of complex information among molecules, so that a single set of neural network framework can generate large errors. In summary, the existing drug development schemes have the disadvantages of high cost, low efficiency and low accuracy.
Disclosure of Invention
In order to solve the technical problems of high cost, low efficiency and low accuracy of drug research and development, the invention aims to provide a multi-target drug screening method based on integrated learning and a hybrid neural network.
The first technical scheme adopted by the invention is as follows: a multi-target drug screening method based on integrated learning and hybrid neural networks comprises the following steps:
acquiring pathogenic target protein, corresponding known ligand and drug molecule library data;
docking treatment is carried out on the basis of the drug molecule library data and pathogenic target protein, and candidate drugs are obtained according to docking scores;
determining multi-target proteins corresponding to the pathogenic target proteins and butting the candidate drug with the multi-target proteins to obtain the number of target proteins acted by the candidate drug;
calculating the physicochemical properties of the corresponding known ligand and the candidate drug, and predicting the activity of the candidate drug based on a preset ensemble learning regression model to obtain a predicted activity value of the candidate drug;
predicting the binding force of the pathogenic target protein and the candidate drug based on a preset mixed neural network framework to obtain the binding force fraction of the candidate drug and the target protein;
and (3) integrating the number of target proteins acted by the candidate drug, the predicted activity value of the candidate drug and the binding force fraction of the candidate drug and the target proteins to determine the final candidate drug.
Further, the step of obtaining pathogenic target protein, corresponding known ligand and drug molecule library data specifically comprises:
obtaining the sequence and the crystal structure of a target protein from a UniProt database, and performing quality evaluation on the protein;
obtaining known ligand molecules of target proteins and corresponding simplified molecule linear input specifications from a ChEMBL database;
drug molecule library structures and their corresponding simplified molecule linear input specifications were obtained from the ZINC15 database.
Further, the step of docking the pathogenic target protein with the drug molecule library data and obtaining the candidate drug according to the docking score specifically includes:
preparing a target protein and drug molecule library before docking;
docking by taking target protein as a receptor and drug molecules as ligands to obtain docking fractions;
the drug with the top 10 of the docking score is taken as the candidate drug.
Further, the step of determining a multi-target protein corresponding to the pathogenic target protein and docking the candidate drug with the multi-target protein to obtain the number of target proteins acted by the candidate drug specifically includes:
acquiring a protein-protein relation of pathogenic target proteins from the STRING database, and selecting a protein combination with high confidence level to obtain multi-target proteins;
inputting the multi-target protein into a DAVID database for analysis, and selecting the protein according to a preset rule;
and carrying out butt joint treatment on the selected protein and the candidate drugs to obtain the number of target proteins acted by each candidate drug.
Further, the step of calculating the physicochemical properties of the corresponding known ligand and the candidate drug, and predicting the activity of the candidate drug based on a preset ensemble learning regression model to obtain a predicted activity value of the candidate drug specifically includes:
calculating the physicochemical properties of the candidate drug and the known ligand molecules of the target protein;
taking the physical and chemical properties of known ligand molecules of the target protein as characteristics and carrying out characteristic selection to obtain selected characteristics;
training a preset ensemble learning regression model by using the activity value of the known ligand molecule of the target protein and the corresponding selected characteristics of the known ligand molecule to obtain a trained ensemble learning regression model;
and predicting the activity value of the candidate drug molecule based on the trained ensemble learning regression model to obtain the predicted activity value of the candidate drug.
Further, the integrated algorithm regression model comprises an integrated learning Boosting, Bagging, Stacking algorithm and a variant thereof and an integrated learning Voting algorithm voter.
Further, the step of predicting the binding force of the pathogenic target protein and the candidate drug based on the preset mixed neural network framework to obtain the binding force fraction of the candidate drug and the target protein specifically comprises:
taking a sequence of a pathogenic target protein, a simplified molecule linear input specification corresponding to a known ligand and an activity value as a data set, and dividing the data set into a training set, a testing set and a verification set;
training a preset hybrid neural network framework based on the training set, the test set and the verification set, and adjusting parameters to obtain the trained hybrid neural network framework;
respectively coding and deeply embedding a sequence of a pathogenic target protein and a simplified molecular linear input specification corresponding to a candidate drug based on a trained mixed neural network framework;
embedding and inputting the depth of a target protein, a known ligand and a candidate drug into a multilayer perceptron, and outputting a consistency index, a mean square error and a prediction score of binding force;
and (3) respectively drawing the consistency index and the mean square error from high to low into a heat map, and selecting the prediction result of the interval with the deepest color for average processing to obtain the final binding force fraction of the candidate drug and the target protein.
Further, the hybrid neural network framework comprises five basic models, namely a deep neural network, a convolutional neural network, a long-short term memory neural network, a graph attention neural network and a Transformer.
Further, the step of determining the final candidate drug by integrating the number of target proteins acted by the candidate drug, the predicted activity value of the candidate drug, and the binding force fraction between the candidate drug and the target proteins specifically comprises:
taking the number of target proteins acted by the candidate drug as an actual score to obtain a first score;
ranking the predicted activity values of the candidate drugs from high to low, wherein the activity is the strongest and is 10 points, and the activity is the lowest and is 1 point, and obtaining a second item score;
and (4) ranking the binding force scores of the candidate drugs and the target protein from high to low, wherein the score is 10 when the binding force value is the highest, and the score is 1 when the binding force value is the lowest, so as to obtain a third score.
And summing the first item score, the second item score and the third item score of the candidate drug, and taking the highest score after summation as the final candidate drug.
The method and the system have the beneficial effects that: according to the invention, multi-target action analysis is carried out according to the action protein in the channel, the physicochemical property of the drug and the integrated learning algorithm are effectively integrated to carry out the activity prediction of the drug, a plurality of neural network models are built, and the interaction force prediction is carried out on the target protein and the candidate drug molecules. From the perspective of multi-target analysis, integrated learning and a hybrid neural network, experimental errors are reduced, the robustness and the prediction precision of the model are improved, the effects of low cost, high efficiency and high accuracy are achieved, and the medicine screened out from the model has higher reliability.
Drawings
FIG. 1 is a flow chart illustrating the steps of a method for screening multi-target drugs based on an ensemble learning and hybrid neural network according to the present invention;
FIG. 2 illustrates a multi-target analysis step according to an embodiment of the present invention;
FIG. 3 illustrates steps for predicting candidate drug activity based on ensemble learning according to embodiments of the present invention;
FIG. 4 shows steps for predicting binding strength based on a predetermined hybrid neural network framework according to an embodiment of the present invention.
Detailed Description
The invention is described in further detail below with reference to the figures and the specific embodiments. The step numbers in the following embodiments are provided only for convenience of illustration, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.
Referring to fig. 1, the invention provides a multi-target drug screening method based on ensemble learning and hybrid neural network, comprising the following steps:
s1, acquiring pathogenic target protein, corresponding known ligand and drug molecule library data;
s2, carrying out docking treatment on the pathogenic target protein based on the drug molecule library data, and obtaining candidate drugs according to docking scores;
s3, determining multi-target proteins corresponding to the pathogenic target proteins, and butting the candidate drug with the multi-target proteins to obtain the number of target proteins acted by the candidate drug;
s4, calculating the physicochemical properties of the corresponding known ligand and the candidate drug, and predicting the activity of the candidate drug based on a preset ensemble learning regression model to obtain the predicted activity value of the candidate drug;
s5, predicting the binding force of the pathogenic target protein and the candidate drug based on a preset mixed neural network framework to obtain the binding force fraction of the candidate drug and the target protein;
s6, synthesizing the number of target proteins acted by the candidate drug, the predicted activity value of the candidate drug and the binding force fraction of the candidate drug and the target proteins, and determining the final candidate drug.
Further as a preferred embodiment of the method, the step of obtaining the database of pathogenic target proteins, corresponding known ligands and drug molecules specifically comprises:
s11, obtaining the sequence and crystal structure of the target protein from the UniProt database, and performing quality evaluation on the protein;
specifically, sequences of Phosphoglycerate kinase 1 (PGK 1), which is a target Protein for colon cancer, were obtained from a UniProt database, a crystal structure (No. 4O33) was obtained from a Protein database (Protein Data Bank, PDB), quality evaluation of the Protein was performed using a SAVES server, and an evaluation report shows that the structure passed.
S12, obtaining known ligand molecules of the target protein and corresponding simplified molecule linear input specifications from a ChEMBL database;
specifically, the known ligand molecule structure of PGK1 and the corresponding Simplified molecular input specification (SMILES) were obtained from ChEMBL database, totaling 91 ligand molecules of known activity.
S13, obtaining the drug molecule library structure and the corresponding simplified molecule linear input specification from the ZINC15 database, and selecting the FDA drug molecule library in the embodiment.
Further as a preferred embodiment of the method, the step of docking the drug molecule library data with the pathogenic target protein and obtaining the drug candidate according to the docking score specifically includes:
s21, preparing a target protein and drug molecule library before docking;
specifically, PGK1 was prepared prior to docking based on Discovery Studio, opened drug molecule library, and prepared prior to docking for FDA drug molecules.
S22, docking by taking the target protein as a receptor and the drug molecules as ligands to obtain docking fractions;
and S23, taking the drug with the top 10 of the docking score as a candidate drug.
Specifically, prepared PGK1 is selected as a receptor based on Discovery Studio, prepared drug molecules are selected as ligands, docking is carried out to obtain docking scores, and FDA drugs with the top scores of 10 are selected as candidate drugs.
Further, as a preferred embodiment of the method, the step of determining a multi-target protein corresponding to a pathogenic target protein and docking the candidate drug with the multi-target protein to obtain the number of target proteins acted by the candidate drug specifically includes:
s31, obtaining the protein-protein relation of the pathogenic target protein from the STRING database and selecting a protein combination with high confidence level to obtain multi-target protein;
specifically, referring to FIG. 2, the protein-protein relationship of the target proteins, trisphosphate isomerase (TPI 1), Glyceraldehyde-3-phosphate dehydrogenase (GAPDH), Alpha-enolase (Alpha-enolase, ENO1), Phosphoglycerate mutase 1(Phosphoglycerate mutase 1, PGAM1), Phosphoglycerate mutase 1(Phosphoglycerate mutase 4, PGAM4) and PGK1 were found from the STRING database with significant interaction scores of 0.999, 0.998, 0.991, 0.990 and 0.987, respectively, and the confidence scores were highest, thus these proteins were selected for further analysis.
S32, inputting the multi-target protein into a DAVID database for analysis, and selecting the protein according to a preset rule;
and S33, carrying out butt joint treatment on the selected protein and the candidate drugs to obtain the target protein number acted by each candidate drug.
Specifically, the protein combinations obtained in the previous step are input into a DAVID database for gene function annotation and enrichment pathway analysis, proteins of colon cancer-related functions and glycolysis pathways are selected, and the candidate drugs are docked with TPI1, GAPDH, ENO1, PGAM1 and PGAM4 proteins to obtain the number of successfully docked proteins of each candidate drug, which is 2, 3, 2, 4, 2, 5, 1, 4, 5 and 4 respectively.
As a further preferred embodiment of the method, the step of calculating the physicochemical properties of the corresponding known ligand and the candidate drug, and predicting the activity of the candidate drug based on a preset ensemble learning regression model to obtain a predicted activity value of the candidate drug specifically includes:
s41, calculating the physicochemical properties of the known ligand molecules of the candidate drug and the target protein;
specifically, referring to fig. 3, physicochemical properties of the drug candidate molecules and the known ligand molecules of PGK1 were calculated based on the Discovery Studio software, and 204 genetic properties were calculated in this example.
S42, performing feature selection by taking the physicochemical property of the known ligand molecule of the target protein as a feature to obtain a selected feature;
specifically, 204 genetic attributes of known ligand molecules of PGK1 are used as features, corresponding activity values are used as targets, and support vector machine recursive feature elimination (SVM-RFE) is used for feature selection to obtain 98 selected features.
S43, training a preset ensemble learning regression model by using the activity value of the known ligand molecule of the target protein and the corresponding selected features of the activity value to obtain the trained ensemble learning regression model;
and S44, predicting the activity value of the candidate drug molecule based on the trained ensemble learning regression model to obtain the predicted activity value of the candidate drug.
Specifically, the characteristics of the candidate drug molecules are input into the integrated learning regression model after training,
further as a preferred embodiment of the method, the ensemble learning regression model comprises a Voting machine based on ensemble learning Boosting, Bagging, Stacking algorithms and variants thereof, and an ensemble learning Voting algorithm.
Specifically, a network search parameter tuning (GridSeachCV) method is used for tuning parameters and selecting appropriate parameters, the ensemble learning Boosting algorithm and variants include Adaptive Boosting (Adaptive Boosting), eXtreme Gradient Boosting (XGboost), Gradient Boosting (GB), Gradient Boosting with categorical features (calboost ) and other algorithms, and the ensemble learning Bagging algorithm and variants thereof include Random Forest (RF), EXtreme Trees (ET) and other algorithms.
As a further preferred embodiment of the method, the step of predicting the binding force between the pathogenic target protein and the candidate drug based on the preset mixed neural network framework to obtain the binding force score between the candidate drug and the target protein specifically includes:
s51, taking the sequence of the pathogenic target protein, the simplified molecule linear input specification corresponding to the known ligand and the activity value as a data set, and dividing the data set into a training set, a testing set and a verification set;
s52, training a preset hybrid neural network framework based on the training set, the test set and the verification set, and adjusting parameters to obtain the trained hybrid neural network framework;
s53, respectively carrying out coding and deep embedding processing on the sequence of the pathogenic target protein and the simplified molecular linear input specification corresponding to the candidate drug based on the trained mixed neural network framework;
specifically, the network parameters are adjusted through the effect on the verification set, the final model is determined through the effect on the test set, the network parameters are determined by a specific data set, and the parameters to be specifically debugged include the number of convolution layers and the maximum pooling layer, the number of fully-connected layers, the number of units of each layer, the size of a convolution kernel, a learning rate, the size of a training batch, the selection of an activation function, an epoch value and the like.
S54, embedding and inputting the depth of the target protein, the known ligand and the candidate drug into a multilayer perceptron, and outputting the predicted scores of the consistency index, the mean square error and the binding force;
and S55, respectively drawing the consistency index and the mean square error from high to low into a heat map, and selecting the prediction result of the interval with the deepest color for average processing to obtain the final binding force fraction of the candidate drug and the target protein.
Specifically, the steps of the binding force of the candidate drug and PGK1 in the present embodiment are shown in fig. 4.
Further as a preferred embodiment of the method, the hybrid neural network framework comprises five basic models, namely a deep neural network, a convolutional neural network and long-short term memory neural network, a graph attention neural network and a Transformer.
Specifically, the deep neural network model comprises a coding layer and a full-connection layer of sequences and SMILES, the sequences of PGK1 are coded into AAC characteristic forms, SMILES of known ligands and candidate drug molecules are coded into characteristic vectors by an RDkit library, and the characteristic vectors are input into the full-connection layer and output corresponding deep embedding. The convolutional neural network includes an embedding layer, a convolutional layer, and a max-pooling layer. The convolutional neural network and the long-short term memory neural network comprise an embedding layer, a convolutional layer, a max-pooling layer and an LSTM layer, and the output of the max-pooling layer is input to the LSTM layer. The graph attention neural network includes an embedding layer, a graph attention layer, a convolutional layer, and a max pooling layer. The Transformer comprises a multi-tiered neural network with multi-headed attention encoding a target protein sequence and a drug molecule SMILES. The network parameters are determined by a specific data set, and the specific parameters to be debugged include the number of convolution layers and the maximum pooling layer, the number of full-connection layers, the number of units of each layer, the size of a convolution kernel, the learning rate, the size of a training batch, the selection of an activation function, an epoch value and the like.
Further as a preferred embodiment of the method, the step of determining the final candidate drug by integrating the number of target proteins acted by the candidate drug, the predicted activity value of the candidate drug, and the binding force fraction between the candidate drug and the target proteins specifically comprises:
s61, taking the number of target proteins acted by the candidate drug as an actual score to obtain a first score;
s62, ranking the predicted activity values of the candidate drugs from high to low, wherein the activity is the strongest and 10 scores, and the activity is the lowest and 1 score, and scoring according to the rule to obtain a second score;
s63, ranking the binding force scores of the candidate drugs and the target protein from high to low, and scoring according to the rule that the highest binding force value is 10 scores and the lowest binding force value is 1 score to obtain a third score.
And S64, summing the first item score, the second item score and the third item score of the drug candidate, and taking the highest score after summation as the final drug candidate.
A multi-target drug screening device based on an integrated learning and hybrid neural network comprises:
at least one processor;
at least one memory for storing at least one program;
when executed by the at least one processor, cause the at least one processor to implement a method for multi-target drug screening based on ensemble learning and hybrid neural networks as described above.
The contents in the above method embodiments are all applicable to the present system embodiment, the functions specifically implemented by the present system embodiment are the same as those in the above method embodiment, and the beneficial effects achieved by the present system embodiment are also the same as those achieved by the above method embodiment.
While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (9)

1. A multi-target drug screening method based on ensemble learning and a hybrid neural network is characterized by comprising the following steps:
acquiring pathogenic target protein, corresponding known ligand and drug molecule library data;
docking treatment is carried out on the basis of the drug molecule library data and pathogenic target protein, and candidate drugs are obtained according to docking scores;
determining multi-target proteins corresponding to the pathogenic target proteins and butting the candidate drug with the multi-target proteins to obtain the number of target proteins acted by the candidate drug;
calculating the physicochemical properties of the corresponding known ligand and the candidate drug, and predicting the activity of the candidate drug based on a preset ensemble learning regression model to obtain a predicted activity value of the candidate drug;
predicting the binding force of the pathogenic target protein and the candidate drug based on a preset mixed neural network framework to obtain the binding force fraction of the candidate drug and the target protein;
and (3) integrating the number of target proteins acted by the candidate drug, the predicted activity value of the candidate drug and the binding force fraction of the candidate drug and the target proteins to determine the final candidate drug.
2. The method for screening multi-target drugs based on ensemble learning and hybrid neural network as claimed in claim 1, wherein the step of obtaining pathogenic target proteins, corresponding known ligands and drug molecule library data specifically comprises:
obtaining the sequence and the crystal structure of a target protein from a UniProt database, and performing quality evaluation on the protein;
obtaining known ligand molecules of target proteins and corresponding simplified molecule linear input specifications from a ChEMBL database;
drug molecule library structures and their corresponding simplified molecule linear input specifications were obtained from the ZINC15 database.
3. The method for screening multi-target drugs based on ensemble learning and hybrid neural network as claimed in claim 2, wherein the step of docking the drug-based molecular library data with pathogenic target proteins and obtaining candidate drugs according to docking scores comprises:
preparing a target protein and drug molecule library before docking;
docking by taking target protein as a receptor and drug molecules as ligands to obtain docking fractions;
the drug with the top 10 of the docking score is taken as the candidate drug.
4. The method for screening multi-target drugs based on ensemble learning and hybrid neural network as claimed in claim 3, wherein the step of determining the multi-target proteins corresponding to pathogenic target proteins and docking the candidate drugs with the multi-target proteins to obtain the number of target proteins acted by the candidate drugs specifically comprises:
acquiring a protein-protein relation of pathogenic target proteins from the STRING database, and selecting a protein combination with high confidence level to obtain multi-target proteins;
inputting the multi-target protein into a DAVID database for analysis, and selecting the protein according to a preset rule;
and carrying out butt joint treatment on the selected protein and the candidate drugs to obtain the number of target proteins acted by each candidate drug.
5. The method of claim 4, wherein the step of calculating the physicochemical properties of the corresponding known ligands and the candidate drugs and predicting the activity of the candidate drugs based on a preset ensemble learning regression model to obtain the predicted activity values of the candidate drugs comprises:
calculating the physicochemical properties of the candidate drug and the known ligand molecules of the target protein;
taking the physical and chemical properties of known ligand molecules of the target protein as characteristics and carrying out characteristic selection to obtain selected characteristics;
training a preset ensemble learning regression model by using the activity value of the known ligand molecule of the target protein and the corresponding selected characteristics of the known ligand molecule to obtain a trained ensemble learning regression model;
and predicting the activity value of the candidate drug molecule based on the trained ensemble learning regression model to obtain the predicted activity value of the candidate drug.
6. The method of claim 5, wherein the ensemble learning regression model comprises a Voting machine based on ensemble learning Boosting, Bagging, Stacking algorithm and their variants, and ensemble learning Voting algorithm.
7. The method for screening multi-target drugs based on ensemble learning and hybrid neural network as claimed in claim 5, wherein the step of predicting the binding capacity of pathogenic target protein and candidate drug based on the predetermined hybrid neural network framework to obtain the binding capacity fraction of candidate drug and target protein specifically comprises:
taking a sequence of a pathogenic target protein, a simplified molecule linear input specification corresponding to a known ligand and an activity value as a data set, and dividing the data set into a training set, a testing set and a verification set;
training a preset hybrid neural network framework based on the training set, the test set and the verification set, and adjusting parameters to obtain the trained hybrid neural network framework;
respectively coding and deeply embedding a sequence of a pathogenic target protein and a simplified molecular linear input specification corresponding to a candidate drug based on a trained mixed neural network framework;
embedding and inputting the depth of a target protein, a known ligand and a candidate drug into a multilayer perceptron, and outputting a consistency index, a mean square error and a prediction score of binding force;
and (3) respectively drawing the consistency index and the mean square error from high to low into a heat map, and selecting the prediction result of the interval with the deepest color for average processing to obtain the final binding force fraction of the candidate drug and the target protein.
8. The method of claim 7, wherein the hybrid neural network framework comprises five basic models, namely a deep neural network, a convolutional neural network, a long-short term memory neural network, a graph attention neural network and a Transformer.
9. The method as claimed in claim 8, wherein the step of determining the final candidate drug by integrating the target protein number of the candidate drug, the predicted activity value of the candidate drug, and the binding force score between the candidate drug and the target protein comprises:
taking the number of target proteins acted by the candidate drug as an actual score to obtain a first score;
ranking the predicted activity values of the candidate drugs from high to low, wherein the activity is the strongest and is 10 points, and the activity is the lowest and is 1 point, and scoring according to the rule to obtain a second score;
and (4) ranking the binding force scores of the candidate drugs and the target protein from high to low, wherein the score is 10 when the binding force value is the highest, and the score is 1 when the binding force value is the lowest, and scoring according to the rule to obtain a third score.
And summing the first item score, the second item score and the third item score of the candidate drug, and taking the highest score after summation as the final candidate drug.
CN202110339575.6A 2021-03-30 2021-03-30 Multi-target drug screening method based on integrated learning and hybrid neural network Active CN113066525B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110339575.6A CN113066525B (en) 2021-03-30 2021-03-30 Multi-target drug screening method based on integrated learning and hybrid neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110339575.6A CN113066525B (en) 2021-03-30 2021-03-30 Multi-target drug screening method based on integrated learning and hybrid neural network

Publications (2)

Publication Number Publication Date
CN113066525A true CN113066525A (en) 2021-07-02
CN113066525B CN113066525B (en) 2023-06-23

Family

ID=76564489

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110339575.6A Active CN113066525B (en) 2021-03-30 2021-03-30 Multi-target drug screening method based on integrated learning and hybrid neural network

Country Status (1)

Country Link
CN (1) CN113066525B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114432311A (en) * 2021-11-17 2022-05-06 中山大学 Anti-idiopathic pulmonary fibrosis compound and computer prediction screening method thereof
CN115762662A (en) * 2022-11-30 2023-03-07 苏州创腾软件有限公司 Specific target drug generation method and device based on graph neural network and MaxFlow platform
CN116155630A (en) * 2023-04-21 2023-05-23 北京邮电大学 Malicious traffic identification method and related equipment
CN117912591A (en) * 2024-03-19 2024-04-19 鲁东大学 Kinase-drug interaction prediction method based on deep contrast learning
CN117912591B (en) * 2024-03-19 2024-05-31 鲁东大学 Kinase-drug interaction prediction method based on deep contrast learning

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003038672A1 (en) * 2001-10-31 2003-05-08 Sumitomo Pharmaceuticals Company, Limited Screening method, screening system and screening program
CN102222178A (en) * 2011-03-31 2011-10-19 清华大学深圳研究生院 Method for screening and/or designing medicines aiming at multiple targets
CN106446607A (en) * 2016-09-26 2017-02-22 华东师范大学 Drug target virtual screening method based on interactive fingerprints and machine learning
CN107731309A (en) * 2017-08-31 2018-02-23 武汉百药联科科技有限公司 A kind of Forecasting Methodology of pharmaceutical activity and its application
CN110444250A (en) * 2019-03-26 2019-11-12 广东省微生物研究所(广东省微生物分析检测中心) High-throughput drug virtual screening system based on molecular fingerprint and deep learning
CN111445945A (en) * 2020-03-20 2020-07-24 北京晶派科技有限公司 Small molecule activity prediction method and device and computing equipment
CN111785320A (en) * 2020-06-28 2020-10-16 西安电子科技大学 Drug target interaction prediction method based on multilayer network representation learning
CN112489737A (en) * 2020-11-16 2021-03-12 南京希瑞斯细胞工程有限公司 Intelligent medicine target affinity prediction method and flow

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003038672A1 (en) * 2001-10-31 2003-05-08 Sumitomo Pharmaceuticals Company, Limited Screening method, screening system and screening program
CN102222178A (en) * 2011-03-31 2011-10-19 清华大学深圳研究生院 Method for screening and/or designing medicines aiming at multiple targets
CN106446607A (en) * 2016-09-26 2017-02-22 华东师范大学 Drug target virtual screening method based on interactive fingerprints and machine learning
CN107731309A (en) * 2017-08-31 2018-02-23 武汉百药联科科技有限公司 A kind of Forecasting Methodology of pharmaceutical activity and its application
CN110444250A (en) * 2019-03-26 2019-11-12 广东省微生物研究所(广东省微生物分析检测中心) High-throughput drug virtual screening system based on molecular fingerprint and deep learning
CN111445945A (en) * 2020-03-20 2020-07-24 北京晶派科技有限公司 Small molecule activity prediction method and device and computing equipment
CN111785320A (en) * 2020-06-28 2020-10-16 西安电子科技大学 Drug target interaction prediction method based on multilayer network representation learning
CN112489737A (en) * 2020-11-16 2021-03-12 南京希瑞斯细胞工程有限公司 Intelligent medicine target affinity prediction method and flow

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
张煜卓 等: "基于集成学习和混合神经网络的多靶标药物筛选方法", 《广州化学》 *
张煜卓 等: "基于集成学习和混合神经网络的多靶标药物筛选方法", 《广州化学》, vol. 42, no. 6, 31 December 2017 (2017-12-31), pages 61 - 66 *
杨倩: "基于双层相似性融合算法(TL-SEA)的抗肿瘤靶标组合预测", 《中国优秀硕士学位论文全文数据库 基础科学辑》 *
杨倩: "基于双层相似性融合算法(TL-SEA)的抗肿瘤靶标组合预测", 《中国优秀硕士学位论文全文数据库 基础科学辑》, no. 1, 15 January 2019 (2019-01-15), pages 006 - 723 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114432311A (en) * 2021-11-17 2022-05-06 中山大学 Anti-idiopathic pulmonary fibrosis compound and computer prediction screening method thereof
CN114432311B (en) * 2021-11-17 2023-08-11 中山大学 Compound for resisting idiopathic pulmonary fibrosis and computer predictive screening method thereof
CN115762662A (en) * 2022-11-30 2023-03-07 苏州创腾软件有限公司 Specific target drug generation method and device based on graph neural network and MaxFlow platform
CN116155630A (en) * 2023-04-21 2023-05-23 北京邮电大学 Malicious traffic identification method and related equipment
CN116155630B (en) * 2023-04-21 2023-07-04 北京邮电大学 Malicious traffic identification method and related equipment
CN117912591A (en) * 2024-03-19 2024-04-19 鲁东大学 Kinase-drug interaction prediction method based on deep contrast learning
CN117912591B (en) * 2024-03-19 2024-05-31 鲁东大学 Kinase-drug interaction prediction method based on deep contrast learning

Also Published As

Publication number Publication date
CN113066525B (en) 2023-06-23

Similar Documents

Publication Publication Date Title
CN111798921B (en) RNA binding protein prediction method and device based on multi-scale attention convolution neural network
CN113066525B (en) Multi-target drug screening method based on integrated learning and hybrid neural network
CN110459274B (en) Small molecule drug virtual screening method based on deep migration learning and application thereof
Li et al. DeepDSC: a deep learning method to predict drug sensitivity of cancer cell lines
AU2019221793B2 (en) GAN-CNN for MHC peptide binding prediction
WO2017196963A1 (en) Computational method for classifying and predicting protein side chain conformations
Camproux et al. A hidden markov model derived structural alphabet for proteins
KR20200129130A (en) Applications for drug discovery and systems and methods for spatial graph convolution by molecular simulation
Singhal et al. A domain-based approach to predict protein-protein interactions
CN112639831A (en) Mutual information countermeasure automatic encoder
CN114333986A (en) Method and device for model training, drug screening and affinity prediction
Lopez-del Rio et al. Evaluation of cross-validation strategies in sequence-based binding prediction using deep learning
CN112164426A (en) Drug small molecule target activity prediction method and device based on TextCNN
US20240055071A1 (en) Artificial intelligence-based compound processing method and apparatus, device, storage medium, and computer program product
US8631057B2 (en) Alignment of multiple liquid chromatography-mass spectrometry runs
Jiang et al. Guiding conventional protein–ligand docking software with convolutional neural networks
CN111627494A (en) Protein property prediction method and device based on multi-dimensional features and computing equipment
Buch et al. A systematic review and evaluation of statistical methods for group variable selection
CN116580848A (en) Multi-head attention mechanism-based method for analyzing multiple groups of chemical data of cancers
CN114678083A (en) Training method and prediction method of chemical genetic toxicity prediction model
EP4272215A1 (en) Protein structure prediction
CN116420191A (en) Predicting protein structure by multiple iterations using loops
Syrlybaeva et al. CBSF: A New Empirical Scoring Function for Docking Parameterized by Weights of Neural Network
US20230420070A1 (en) Protein Structure Prediction
Beccaria et al. Predicting the binding of small molecules to proteins through invariant representation of the molecular structure

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant