CN113066525A

CN113066525A - Multi-target drug screening method based on ensemble learning and hybrid neural network

Info

Publication number: CN113066525A
Application number: CN202110339575.6A
Authority: CN
Inventors: 陈观兴; 谭晓军; 陈语谦
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2021-03-30
Filing date: 2021-03-30
Publication date: 2021-07-02
Anticipated expiration: 2041-03-30
Also published as: CN113066525B

Abstract

The invention discloses a multi-target drug screening method based on ensemble learning and a hybrid neural network, which comprises the following steps: acquiring data; docking treatment is carried out, and candidate drugs are obtained according to docking scores; determining multi-target proteins and butting the candidate drug with the multi-target proteins to obtain the number of target proteins acted by the candidate drug; predicting the activity of the candidate drug based on a preset ensemble learning regression model to obtain a predicted activity value of the candidate drug; predicting the binding force of the pathogenic target protein and the candidate drug based on a preset mixed neural network framework to obtain the binding force fraction of the candidate drug and the target protein; and comprehensively determining the final candidate drug. The invention realizes low-cost, high-efficiency and high-accuracy drug screening from the aspects of multi-target analysis, ensemble learning and a hybrid neural network. The invention is used as a multi-target drug screening method based on integrated learning and a hybrid neural network, and can be widely applied to the field of drug screening.

Description

Multi-target drug screening method based on ensemble learning and hybrid neural network

Technical Field

The invention relates to the field of drug screening, in particular to a multi-target drug screening method based on integrated learning and a hybrid neural network.

Background

The existing drug screening method is mainly carried out on the basis of drugs, targets or some aspect of interaction force of the drug targets, and the experiment has one-sidedness, so that the screened drugs do not necessarily have good drug effects for treating diseases.

Based on the aspect of target proteins, the existing methods generally perform virtual screening on a certain protein to determine a candidate drug, and for some pathways, the pathogenic mechanism is often influenced by a plurality of proteins, so that the analysis on only one target protein is one-sidedness. Based on the aspect of medicines, the existing method does not effectively integrate the physicochemical properties of medicines and a statistical analysis method to predict the activity of the medicines, and more errors often exist. Based on the aspect of drug target interaction force, the existing method generally adopts a single group of neural network framework to predict the interaction force. The neural network serves as a black box, and meanwhile, uncertainty is brought to drug prediction due to a large amount of complex information among molecules, so that a single set of neural network framework can generate large errors. In summary, the existing drug development schemes have the disadvantages of high cost, low efficiency and low accuracy.

Disclosure of Invention

In order to solve the technical problems of high cost, low efficiency and low accuracy of drug research and development, the invention aims to provide a multi-target drug screening method based on integrated learning and a hybrid neural network.

The first technical scheme adopted by the invention is as follows: a multi-target drug screening method based on integrated learning and hybrid neural networks comprises the following steps:

acquiring pathogenic target protein, corresponding known ligand and drug molecule library data;

docking treatment is carried out on the basis of the drug molecule library data and pathogenic target protein, and candidate drugs are obtained according to docking scores;

determining multi-target proteins corresponding to the pathogenic target proteins and butting the candidate drug with the multi-target proteins to obtain the number of target proteins acted by the candidate drug;

calculating the physicochemical properties of the corresponding known ligand and the candidate drug, and predicting the activity of the candidate drug based on a preset ensemble learning regression model to obtain a predicted activity value of the candidate drug;

predicting the binding force of the pathogenic target protein and the candidate drug based on a preset mixed neural network framework to obtain the binding force fraction of the candidate drug and the target protein;

and (3) integrating the number of target proteins acted by the candidate drug, the predicted activity value of the candidate drug and the binding force fraction of the candidate drug and the target proteins to determine the final candidate drug.

Further, the step of obtaining pathogenic target protein, corresponding known ligand and drug molecule library data specifically comprises:

obtaining the sequence and the crystal structure of a target protein from a UniProt database, and performing quality evaluation on the protein;

obtaining known ligand molecules of target proteins and corresponding simplified molecule linear input specifications from a ChEMBL database;

drug molecule library structures and their corresponding simplified molecule linear input specifications were obtained from the ZINC15 database.

Further, the step of docking the pathogenic target protein with the drug molecule library data and obtaining the candidate drug according to the docking score specifically includes:

preparing a target protein and drug molecule library before docking;

docking by taking target protein as a receptor and drug molecules as ligands to obtain docking fractions;

the drug with the top 10 of the docking score is taken as the candidate drug.

Further, the step of determining a multi-target protein corresponding to the pathogenic target protein and docking the candidate drug with the multi-target protein to obtain the number of target proteins acted by the candidate drug specifically includes:

acquiring a protein-protein relation of pathogenic target proteins from the STRING database, and selecting a protein combination with high confidence level to obtain multi-target proteins;

inputting the multi-target protein into a DAVID database for analysis, and selecting the protein according to a preset rule;

and carrying out butt joint treatment on the selected protein and the candidate drugs to obtain the number of target proteins acted by each candidate drug.

Further, the step of calculating the physicochemical properties of the corresponding known ligand and the candidate drug, and predicting the activity of the candidate drug based on a preset ensemble learning regression model to obtain a predicted activity value of the candidate drug specifically includes:

calculating the physicochemical properties of the candidate drug and the known ligand molecules of the target protein;

taking the physical and chemical properties of known ligand molecules of the target protein as characteristics and carrying out characteristic selection to obtain selected characteristics;

training a preset ensemble learning regression model by using the activity value of the known ligand molecule of the target protein and the corresponding selected characteristics of the known ligand molecule to obtain a trained ensemble learning regression model;

and predicting the activity value of the candidate drug molecule based on the trained ensemble learning regression model to obtain the predicted activity value of the candidate drug.

Further, the integrated algorithm regression model comprises an integrated learning Boosting, Bagging, Stacking algorithm and a variant thereof and an integrated learning Voting algorithm voter.

Further, the step of predicting the binding force of the pathogenic target protein and the candidate drug based on the preset mixed neural network framework to obtain the binding force fraction of the candidate drug and the target protein specifically comprises:

taking a sequence of a pathogenic target protein, a simplified molecule linear input specification corresponding to a known ligand and an activity value as a data set, and dividing the data set into a training set, a testing set and a verification set;

training a preset hybrid neural network framework based on the training set, the test set and the verification set, and adjusting parameters to obtain the trained hybrid neural network framework;

respectively coding and deeply embedding a sequence of a pathogenic target protein and a simplified molecular linear input specification corresponding to a candidate drug based on a trained mixed neural network framework;

embedding and inputting the depth of a target protein, a known ligand and a candidate drug into a multilayer perceptron, and outputting a consistency index, a mean square error and a prediction score of binding force;

and (3) respectively drawing the consistency index and the mean square error from high to low into a heat map, and selecting the prediction result of the interval with the deepest color for average processing to obtain the final binding force fraction of the candidate drug and the target protein.

Further, the hybrid neural network framework comprises five basic models, namely a deep neural network, a convolutional neural network, a long-short term memory neural network, a graph attention neural network and a Transformer.

Further, the step of determining the final candidate drug by integrating the number of target proteins acted by the candidate drug, the predicted activity value of the candidate drug, and the binding force fraction between the candidate drug and the target proteins specifically comprises:

taking the number of target proteins acted by the candidate drug as an actual score to obtain a first score;

ranking the predicted activity values of the candidate drugs from high to low, wherein the activity is the strongest and is 10 points, and the activity is the lowest and is 1 point, and obtaining a second item score;

and (4) ranking the binding force scores of the candidate drugs and the target protein from high to low, wherein the score is 10 when the binding force value is the highest, and the score is 1 when the binding force value is the lowest, so as to obtain a third score.

And summing the first item score, the second item score and the third item score of the candidate drug, and taking the highest score after summation as the final candidate drug.

The method and the system have the beneficial effects that: according to the invention, multi-target action analysis is carried out according to the action protein in the channel, the physicochemical property of the drug and the integrated learning algorithm are effectively integrated to carry out the activity prediction of the drug, a plurality of neural network models are built, and the interaction force prediction is carried out on the target protein and the candidate drug molecules. From the perspective of multi-target analysis, integrated learning and a hybrid neural network, experimental errors are reduced, the robustness and the prediction precision of the model are improved, the effects of low cost, high efficiency and high accuracy are achieved, and the medicine screened out from the model has higher reliability.

Drawings

FIG. 1 is a flow chart illustrating the steps of a method for screening multi-target drugs based on an ensemble learning and hybrid neural network according to the present invention;

FIG. 2 illustrates a multi-target analysis step according to an embodiment of the present invention;

FIG. 3 illustrates steps for predicting candidate drug activity based on ensemble learning according to embodiments of the present invention;

FIG. 4 shows steps for predicting binding strength based on a predetermined hybrid neural network framework according to an embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the figures and the specific embodiments. The step numbers in the following embodiments are provided only for convenience of illustration, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.

Referring to fig. 1, the invention provides a multi-target drug screening method based on ensemble learning and hybrid neural network, comprising the following steps:

s1, acquiring pathogenic target protein, corresponding known ligand and drug molecule library data;

s2, carrying out docking treatment on the pathogenic target protein based on the drug molecule library data, and obtaining candidate drugs according to docking scores;

s3, determining multi-target proteins corresponding to the pathogenic target proteins, and butting the candidate drug with the multi-target proteins to obtain the number of target proteins acted by the candidate drug;

s4, calculating the physicochemical properties of the corresponding known ligand and the candidate drug, and predicting the activity of the candidate drug based on a preset ensemble learning regression model to obtain the predicted activity value of the candidate drug;

s5, predicting the binding force of the pathogenic target protein and the candidate drug based on a preset mixed neural network framework to obtain the binding force fraction of the candidate drug and the target protein;

s6, synthesizing the number of target proteins acted by the candidate drug, the predicted activity value of the candidate drug and the binding force fraction of the candidate drug and the target proteins, and determining the final candidate drug.

Further as a preferred embodiment of the method, the step of obtaining the database of pathogenic target proteins, corresponding known ligands and drug molecules specifically comprises:

s11, obtaining the sequence and crystal structure of the target protein from the UniProt database, and performing quality evaluation on the protein;

specifically, sequences of Phosphoglycerate kinase 1 (PGK 1), which is a target Protein for colon cancer, were obtained from a UniProt database, a crystal structure (No. 4O33) was obtained from a Protein database (Protein Data Bank, PDB), quality evaluation of the Protein was performed using a SAVES server, and an evaluation report shows that the structure passed.

S12, obtaining known ligand molecules of the target protein and corresponding simplified molecule linear input specifications from a ChEMBL database;

specifically, the known ligand molecule structure of PGK1 and the corresponding Simplified molecular input specification (SMILES) were obtained from ChEMBL database, totaling 91 ligand molecules of known activity.

S13, obtaining the drug molecule library structure and the corresponding simplified molecule linear input specification from the ZINC15 database, and selecting the FDA drug molecule library in the embodiment.

Further as a preferred embodiment of the method, the step of docking the drug molecule library data with the pathogenic target protein and obtaining the drug candidate according to the docking score specifically includes:

s21, preparing a target protein and drug molecule library before docking;

specifically, PGK1 was prepared prior to docking based on Discovery Studio, opened drug molecule library, and prepared prior to docking for FDA drug molecules.

S22, docking by taking the target protein as a receptor and the drug molecules as ligands to obtain docking fractions;

and S23, taking the drug with the top 10 of the docking score as a candidate drug.

Specifically, prepared PGK1 is selected as a receptor based on Discovery Studio, prepared drug molecules are selected as ligands, docking is carried out to obtain docking scores, and FDA drugs with the top scores of 10 are selected as candidate drugs.

Further, as a preferred embodiment of the method, the step of determining a multi-target protein corresponding to a pathogenic target protein and docking the candidate drug with the multi-target protein to obtain the number of target proteins acted by the candidate drug specifically includes:

s31, obtaining the protein-protein relation of the pathogenic target protein from the STRING database and selecting a protein combination with high confidence level to obtain multi-target protein;

specifically, referring to FIG. 2, the protein-protein relationship of the target proteins, trisphosphate isomerase (TPI 1), Glyceraldehyde-3-phosphate dehydrogenase (GAPDH), Alpha-enolase (Alpha-enolase, ENO1), Phosphoglycerate mutase 1(Phosphoglycerate mutase 1, PGAM1), Phosphoglycerate mutase 1(Phosphoglycerate mutase 4, PGAM4) and PGK1 were found from the STRING database with significant interaction scores of 0.999, 0.998, 0.991, 0.990 and 0.987, respectively, and the confidence scores were highest, thus these proteins were selected for further analysis.

S32, inputting the multi-target protein into a DAVID database for analysis, and selecting the protein according to a preset rule;

and S33, carrying out butt joint treatment on the selected protein and the candidate drugs to obtain the target protein number acted by each candidate drug.

Specifically, the protein combinations obtained in the previous step are input into a DAVID database for gene function annotation and enrichment pathway analysis, proteins of colon cancer-related functions and glycolysis pathways are selected, and the candidate drugs are docked with TPI1, GAPDH, ENO1, PGAM1 and PGAM4 proteins to obtain the number of successfully docked proteins of each candidate drug, which is 2, 3, 2, 4, 2, 5, 1, 4, 5 and 4 respectively.

As a further preferred embodiment of the method, the step of calculating the physicochemical properties of the corresponding known ligand and the candidate drug, and predicting the activity of the candidate drug based on a preset ensemble learning regression model to obtain a predicted activity value of the candidate drug specifically includes:

s41, calculating the physicochemical properties of the known ligand molecules of the candidate drug and the target protein;

specifically, referring to fig. 3, physicochemical properties of the drug candidate molecules and the known ligand molecules of PGK1 were calculated based on the Discovery Studio software, and 204 genetic properties were calculated in this example.

S42, performing feature selection by taking the physicochemical property of the known ligand molecule of the target protein as a feature to obtain a selected feature;

specifically, 204 genetic attributes of known ligand molecules of PGK1 are used as features, corresponding activity values are used as targets, and support vector machine recursive feature elimination (SVM-RFE) is used for feature selection to obtain 98 selected features.

S43, training a preset ensemble learning regression model by using the activity value of the known ligand molecule of the target protein and the corresponding selected features of the activity value to obtain the trained ensemble learning regression model;

and S44, predicting the activity value of the candidate drug molecule based on the trained ensemble learning regression model to obtain the predicted activity value of the candidate drug.

Specifically, the characteristics of the candidate drug molecules are input into the integrated learning regression model after training,

further as a preferred embodiment of the method, the ensemble learning regression model comprises a Voting machine based on ensemble learning Boosting, Bagging, Stacking algorithms and variants thereof, and an ensemble learning Voting algorithm.

Specifically, a network search parameter tuning (GridSeachCV) method is used for tuning parameters and selecting appropriate parameters, the ensemble learning Boosting algorithm and variants include Adaptive Boosting (Adaptive Boosting), eXtreme Gradient Boosting (XGboost), Gradient Boosting (GB), Gradient Boosting with categorical features (calboost ) and other algorithms, and the ensemble learning Bagging algorithm and variants thereof include Random Forest (RF), EXtreme Trees (ET) and other algorithms.

As a further preferred embodiment of the method, the step of predicting the binding force between the pathogenic target protein and the candidate drug based on the preset mixed neural network framework to obtain the binding force score between the candidate drug and the target protein specifically includes:

s51, taking the sequence of the pathogenic target protein, the simplified molecule linear input specification corresponding to the known ligand and the activity value as a data set, and dividing the data set into a training set, a testing set and a verification set;

s52, training a preset hybrid neural network framework based on the training set, the test set and the verification set, and adjusting parameters to obtain the trained hybrid neural network framework;

s53, respectively carrying out coding and deep embedding processing on the sequence of the pathogenic target protein and the simplified molecular linear input specification corresponding to the candidate drug based on the trained mixed neural network framework;

specifically, the network parameters are adjusted through the effect on the verification set, the final model is determined through the effect on the test set, the network parameters are determined by a specific data set, and the parameters to be specifically debugged include the number of convolution layers and the maximum pooling layer, the number of fully-connected layers, the number of units of each layer, the size of a convolution kernel, a learning rate, the size of a training batch, the selection of an activation function, an epoch value and the like.

S54, embedding and inputting the depth of the target protein, the known ligand and the candidate drug into a multilayer perceptron, and outputting the predicted scores of the consistency index, the mean square error and the binding force;

and S55, respectively drawing the consistency index and the mean square error from high to low into a heat map, and selecting the prediction result of the interval with the deepest color for average processing to obtain the final binding force fraction of the candidate drug and the target protein.

Specifically, the steps of the binding force of the candidate drug and PGK1 in the present embodiment are shown in fig. 4.

Further as a preferred embodiment of the method, the hybrid neural network framework comprises five basic models, namely a deep neural network, a convolutional neural network and long-short term memory neural network, a graph attention neural network and a Transformer.

Specifically, the deep neural network model comprises a coding layer and a full-connection layer of sequences and SMILES, the sequences of PGK1 are coded into AAC characteristic forms, SMILES of known ligands and candidate drug molecules are coded into characteristic vectors by an RDkit library, and the characteristic vectors are input into the full-connection layer and output corresponding deep embedding. The convolutional neural network includes an embedding layer, a convolutional layer, and a max-pooling layer. The convolutional neural network and the long-short term memory neural network comprise an embedding layer, a convolutional layer, a max-pooling layer and an LSTM layer, and the output of the max-pooling layer is input to the LSTM layer. The graph attention neural network includes an embedding layer, a graph attention layer, a convolutional layer, and a max pooling layer. The Transformer comprises a multi-tiered neural network with multi-headed attention encoding a target protein sequence and a drug molecule SMILES. The network parameters are determined by a specific data set, and the specific parameters to be debugged include the number of convolution layers and the maximum pooling layer, the number of full-connection layers, the number of units of each layer, the size of a convolution kernel, the learning rate, the size of a training batch, the selection of an activation function, an epoch value and the like.

Further as a preferred embodiment of the method, the step of determining the final candidate drug by integrating the number of target proteins acted by the candidate drug, the predicted activity value of the candidate drug, and the binding force fraction between the candidate drug and the target proteins specifically comprises:

s61, taking the number of target proteins acted by the candidate drug as an actual score to obtain a first score;

s62, ranking the predicted activity values of the candidate drugs from high to low, wherein the activity is the strongest and 10 scores, and the activity is the lowest and 1 score, and scoring according to the rule to obtain a second score;

s63, ranking the binding force scores of the candidate drugs and the target protein from high to low, and scoring according to the rule that the highest binding force value is 10 scores and the lowest binding force value is 1 score to obtain a third score.

And S64, summing the first item score, the second item score and the third item score of the drug candidate, and taking the highest score after summation as the final drug candidate.

A multi-target drug screening device based on an integrated learning and hybrid neural network comprises:

at least one processor;

at least one memory for storing at least one program;

when executed by the at least one processor, cause the at least one processor to implement a method for multi-target drug screening based on ensemble learning and hybrid neural networks as described above.

The contents in the above method embodiments are all applicable to the present system embodiment, the functions specifically implemented by the present system embodiment are the same as those in the above method embodiment, and the beneficial effects achieved by the present system embodiment are also the same as those achieved by the above method embodiment.

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A multi-target drug screening method based on ensemble learning and a hybrid neural network is characterized by comprising the following steps:

2. The method for screening multi-target drugs based on ensemble learning and hybrid neural network as claimed in claim 1, wherein the step of obtaining pathogenic target proteins, corresponding known ligands and drug molecule library data specifically comprises:

3. The method for screening multi-target drugs based on ensemble learning and hybrid neural network as claimed in claim 2, wherein the step of docking the drug-based molecular library data with pathogenic target proteins and obtaining candidate drugs according to docking scores comprises:

preparing a target protein and drug molecule library before docking;

the drug with the top 10 of the docking score is taken as the candidate drug.

4. The method for screening multi-target drugs based on ensemble learning and hybrid neural network as claimed in claim 3, wherein the step of determining the multi-target proteins corresponding to pathogenic target proteins and docking the candidate drugs with the multi-target proteins to obtain the number of target proteins acted by the candidate drugs specifically comprises:

5. The method of claim 4, wherein the step of calculating the physicochemical properties of the corresponding known ligands and the candidate drugs and predicting the activity of the candidate drugs based on a preset ensemble learning regression model to obtain the predicted activity values of the candidate drugs comprises:

6. The method of claim 5, wherein the ensemble learning regression model comprises a Voting machine based on ensemble learning Boosting, Bagging, Stacking algorithm and their variants, and ensemble learning Voting algorithm.

7. The method for screening multi-target drugs based on ensemble learning and hybrid neural network as claimed in claim 5, wherein the step of predicting the binding capacity of pathogenic target protein and candidate drug based on the predetermined hybrid neural network framework to obtain the binding capacity fraction of candidate drug and target protein specifically comprises:

8. The method of claim 7, wherein the hybrid neural network framework comprises five basic models, namely a deep neural network, a convolutional neural network, a long-short term memory neural network, a graph attention neural network and a Transformer.

9. The method as claimed in claim 8, wherein the step of determining the final candidate drug by integrating the target protein number of the candidate drug, the predicted activity value of the candidate drug, and the binding force score between the candidate drug and the target protein comprises:

ranking the predicted activity values of the candidate drugs from high to low, wherein the activity is the strongest and is 10 points, and the activity is the lowest and is 1 point, and scoring according to the rule to obtain a second score;

and (4) ranking the binding force scores of the candidate drugs and the target protein from high to low, wherein the score is 10 when the binding force value is the highest, and the score is 1 when the binding force value is the lowest, and scoring according to the rule to obtain a third score.