CN109979541B

CN109979541B - Method for predicting pharmacokinetic property and toxicity of drug molecules based on capsule network

Info

Publication number: CN109979541B
Application number: CN201910216282.1A
Authority: CN
Inventors: 杨胜勇; 王译伟; 邹俊; 黄磊; 姜斯文
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2019-03-20
Filing date: 2019-03-20
Publication date: 2021-06-22
Anticipated expiration: 2039-03-20
Also published as: CN109979541A

Abstract

A method for predicting the pharmacokinetic properties and toxicity of drug molecules based on a capsule network. After comprehensive molecular fingerprints and molecular descriptors are constructed and early preparation work of a model is established, low-level characteristic contents of molecules are extracted from upper-level low-level characteristics through convolution or limited Boltzmann machine operation, high-level characteristics of the molecules are abstracted from lower-level characteristics by using a capsule network method, and the relation between the high-level characteristics and an activity label is fitted through a dynamic routing algorithm, so that pharmacokinetic properties and toxicity classification of unknown small molecules are predicted. The invention does not need to collect large-scale data sets for training, optimizes input end to further realize automatic dimension reduction, updates the coupling coefficient through an iterative dynamic routing process, and transmits all characteristics of the upper layer capsule to any lower layer capsule through the dynamic routing, thereby greatly keeping the hierarchical position relationship between the characteristics of the lower layer and the characteristics of the upper layer. The prediction effect is better than that of the traditional machine learning method.

Description

Method for predicting pharmacokinetic property and toxicity of drug molecules based on capsule network

Technical Field

The invention relates to the field of computer-aided drug molecule design, in particular to a method for predicting pharmacokinetic properties and toxicity of drug molecules based on a capsule network.

Background

The great success of a drug depends not only on its good efficacy, but also on its excellent pharmacokinetic properties and low toxicity. According to statistics, poor absorption, distribution, metabolism, excretion and toxicity of the candidate drug account for more than 50% of the reasons of drug development failure, so that the compounds with poor pharmacokinetic properties and toxicity are excluded and optimized in the early stage of drug development, and the success rate of drug development can be greatly improved. In recent years, while pharmacokinetic properties and toxicity of compounds can be measured by in vitro high throughput screening methods, assay based assays are not only costly and long-lasting, but also require that compounds be successfully synthesized prior to using these testing techniques. If the pharmacokinetic properties and toxicity of the synthesized drug candidates are not satisfactory, they can only be discarded. At this time, the prediction of the pharmacokinetic properties and toxicity of the compound by using a virtual screening method provides a new strategy for drug development. The method can save a large amount of manpower and material resources, shorten the period of drug development and further improve the efficiency. To date, virtual pharmacokinetic properties and toxicity prediction methods are generally divided into receptor-based and ligand-based. Since the receptor-based prediction method is limited by factors such as flexibility, water solubility and inaccurate scoring of the receptor, the ligand-based prediction method is more widely applied. The ligand-based prediction method is further divided into structure-activity relationship, pharmacophore model, similarity search, machine learning and the like. The machine learning method is used for making prediction classification by searching patterns in data and applying the patterns, and practice shows that the machine learning method can make more accurate classification results than other methods. Although the application of machine learning has greatly improved the prediction accuracy, the conventional machine learning method is difficult to be applied to the prediction of multiple relevant properties at the same time due to the complexity of pharmacokinetic properties and toxicity. Recently, deep learning networks in machine learning methods have been successfully applied in various fields, and models established based on such algorithms also have application in drug discovery. However, there are still some factors that limit the application of deep learning networks in drug development, and there are at least the following two points. First, deep learning requires a large amount of data to train a model to obtain accurate prediction results. The collection of large and reliable activity data in drug development is a costly and time-consuming task. Second, conventional deep learning Networks, such as Convolutional Neural Networks (CNN), were originally designed to recognize two-dimensional images. In these networks, some special algorithms, such as maximal pooling for dimension reduction in CNN, lose some of the original data information, thus resulting in conventional deep learning networks performing poorly in drug discovery related studies. Therefore, there is still a need to develop a new and more accurate prediction method of pharmacokinetic properties and toxicity, so as to promote the application of machine learning in the development of new drugs, and provide advantages for shortening the drug development cycle and reducing the drug development cost.

Disclosure of Invention

The purpose of the invention is: provides a brand-new medicine molecular pharmacokinetic property and toxicity prediction method based on a Capsule network (Capsule Networks). This method belongs to a ligand-based prediction method. The method is based on the molecular fingerprints and the molecular descriptors of the ligand, adopts a capsule network for deep learning to establish the relationship between the molecular fingerprints and the molecular descriptors and the pharmacokinetic properties and toxicity, and overcomes the defects of poor classification prediction effect, serious loss of original information for characterizing the molecular fingerprints and the molecular descriptors of the ligand and great dependence of prediction accuracy on the scale of a training set in the prior art.

The basic idea of the invention is as follows: collecting compounds with known pharmacokinetic properties and toxicity of a certain specific drug molecule determined through experiments and activity labels thereof as a training set, constructing comprehensive molecular fingerprints and molecular descriptors to represent small molecules, calculating the molecular fingerprints and the molecular descriptors of all molecules in the training set, firstly obtaining low-level characteristic contents of the molecules through convolution or restricted Boltzmann machine operation, then abstracting to obtain high-level characteristics of the molecules by utilizing a capsule network method, and fitting the relation between the high-level characteristics and the activity labels, thereby predicting the activity classification of unknown small molecules.

The purpose of the invention is achieved by the following steps:

a novel medicine molecule pharmacokinetic property and toxicity prediction method based on a capsule network is characterized in that: collecting compounds with known pharmacokinetic properties or toxicity of a specific drug molecule and activity labels thereof, which are determined through experiments, constructing comprehensive molecular fingerprints and molecular descriptors so as to represent molecules, and establishing early preparation work of a model; and then extracting low-level characteristic contents of the molecules in the original molecular characterization information through convolution or limited Boltzmann machine operation, abstracting high-level characteristics of the molecules by using a capsule network method, and fitting the relation between the high-level characteristics and the activity label through a dynamic routing algorithm, so that the method is used for predicting the pharmacokinetic properties and toxicity classification of unknown small molecules.

The prediction comprises the following six steps:

(1) preparation of training set: preparing a training set by adopting data simultaneously containing a small molecular structure and a specific activity label thereof, and if the activity information is quantitative representation, selecting a reasonable threshold value to convert the activity information into qualitative representation, wherein the activity is 1; inactive-0, stored in sdf format;

(2) calculating a molecular descriptor; the molecular descriptors comprise 13 most commonly used molecular descriptors in a machine learning method for establishing a pharmacokinetic and toxicity prediction model, namely a lipid-water partition coefficient, an apparent partition coefficient, molecular solubility, molecular weight, the number of hydrogen bond donors, the number of hydrogen bond acceptors, the number of rotatable bonds, the number of rings, the number of aromatic rings, the sum of the numbers of oxygen atoms and hydrogen atoms, polar surface area, molecular part polar surface area and molecular surface area; the calculation of all the molecular descriptors can be completed by an open source PaDEL-Descriptor or Discovery Studio program;

(3) calculating a molecular fingerprint; adopting 166 MACCS fingerprints based on substructure characteristics to characterize the structure of the molecule, and completing the calculation of the molecular fingerprint through an RDkit program;

(4) preprocessing a molecular descriptor; the range of values for different molecular descriptors varies widely, their values being limited to the interval of (0, 1) by pre-processing the molecular descriptors; a one-dimensional vector is adopted to represent a compound, and the compound comprises a compound name, an activity label, a fingerprint and a scaled descriptor value, and is stored in a csv format;

(5) upper level low level feature (u)_i) And next level advanced features (U)_j) Establishing a classification model; firstly, convolution operation is used as a feature extractor to obtain low-level features of molecules, or limited Boltzmann machine operation is used to obtain low-level features of the molecules, then the capsule network method is used to abstract the high-level features of the molecules, a dynamic routing algorithm is used to fit the relation between the high-level features and the active labels, and two weights, namely the weight/coupling system of the low-level features and the high-level features, are continuously updated in the dynamic routingNumber c_i,jAnd the possibility of low-level features mapping to high-level features b_i,jObtaining an optimal 'consensus' prediction result;

(6) predicting the activity of the unknown compound; and (3) predicting whether the compound has activity according to the length output by the digital capsule layer capsule, and simultaneously verifying the performance of the established prediction model.

In the method for abstracting the high-level characteristics of molecules by using the capsule network, the upper layer is assumed to have the low-level characteristics u of the molecules₁，u₂And u₃…u_nThe next layer has a high level of molecular character U₁And U₂；

When there is a new low-level feature u_n+1It needs to decide to deliver it to U₁Or U₂By adjusting the weight c_n+1,1And c_n+1,2The implementation is carried out;

high level characteristic U₁And U₂Accepting outputs from other low-level features, where the dense location of the low-level feature outputs means that there are predictions of multiple low-level features that are close to each other, i.e., the "consensus" output; novel low level character u_n+1Is close to the "consensus" output in which high-level feature, it is delivered to which high-level feature; the dynamic routing generates a mechanism based on the above results to automatically adjust its weight if u_n+1Is transmitted to the advanced feature U₁I.e. up-regulating U₁Relative weight c_n+1,1While lowering U₂Relative weight c_n+1,2。

The step of obtaining the low-level features of the molecules with a convolution operation as a feature extractor is:

1) taking the molecular descriptor value containing the name of the training set compound, the activity label, the molecular fingerprint and the scale as an input file in a csv format;

2) mapping the input vector to a convolution layer conditional layer, and optimizing and adjusting the number and size of convolution layer filters;

3) mapping the low-level features obtained by the convolutional layer to a hidden feature layer through full-connection operation, and optimizing and adjusting the number of neurons nodes of the hidden feature layer;

4) the feature vector output of the upper layer is completely mapped and activated to a Primary Capsule layer Primary caps layer through full connection operation, and the neuron number of the Primary Capsule layer and the dimension of each Capsule in the layer are optimized and adjusted;

5) mapping all outputs of the main capsule layer to a digital capsule layer Digitcaps layer, and optimizing and adjusting Routing iteration times;

in the method for abstracting the high-level characteristics of the molecules by utilizing the capsule network, the capsule network comprises a main capsule layer and a digital capsule layer, and the high-level characteristic process of the abstract molecules is divided into four parts:

1) and (3) matrix change, converting the low-level features U into high-level features U through the relation W of the upper-level low-level features and the lower-level high-level features:

U_j＝W_ij·u_i，

i denotes the lower layer capsule, j denotes the higher layer capsule;

2) input weighting, low-level feature vectors need to be weighted by adjusting coupling coefficient/weight c_i,jTo decide which higher level feature to send, the coupling coefficient is calculated by the softmax function:

b_ijrepresenting the logarithmic probability that the capsule i at the bottom layer corresponds to the capsule j at the upper layer, and the initial value is set to 0;

3) weighted summation, weighted summation(s) of the obtained advanced feature vectors_j)：

A "consensus" output representing all low-level feature vectors, if the coupling coefficient between capsule i and capsule j is 1, then the coupling coefficient that capsule i sends to other capsules in the high-level is 0, i.e., all outputs of capsule i are sent to capsule j;

4) and (3) nonlinear activation, namely activating 'consensus' output by adopting a vector nonlinear activation/compression function square function to generate a capsule at a high layer:

v_jrepresenting the vector output of the digital capsule layer.

Fitting the relation between the high-level characteristics and the active labels through a dynamic routing algorithm, conveying all characteristics of the upper layer capsules into any one of the lower layer capsules, and automatically adjusting the weight of the capsules, wherein the dynamic routing algorithm comprises the following steps:

1) outputting U after encapsulating the output of the hidden characteristic layer_jSetting routing times r;

2) definition b_i,jThe probability that a capsule vector of l layers is connected to a capsule vector of the next layer, the initial value is 0;

3) circularly executing the steps 4) to 7) r times;

4) for the capsule vector of l layers, b is calculated by calculating the softmax function_i,jConversion to c_i,j；

5) Weighted sum s for capsule vectors of l +1 layers_j；

6) Activating s with non-linear activation of vectors for capsule vectors of l +1 layers_jTo obtain v_j；

7) According to U_jAnd v_jUpdate of the relationship of (b)_i,j: when the two are similar, the dot product is large, b_i,jBecomes larger and the likelihood of low-level features linking high-level features becomes larger; on the contrary, when the difference between the two is large, b_i,jThe smaller the likelihood that a low-level feature will be connected to a high-level feature.

The above is according to U_jAnd v_jUpdate of the relationship of (b)_i,j: is made of U_jAnd v_jUpdate b by dot product of_i,j。

The method for obtaining the low-level characteristics of the molecules by taking the limited Boltzmann machine as the characteristics comprises the following steps:

2) mapping the input vector to a hidden feature layer, and optimizing and adjusting the number of neurons of the hidden feature layer, the number of operations of a limited Boltzmann machine of the layer, the learning rate of RBM of the limited Boltzmann machine and the iteration number of RBM of the limited Boltzmann machine;

3) the feature vector output of the upper layer is completely mapped and activated to a Primary Capsule layer Primary caps layer through full connection operation, and the neuron number of the Primary Capsule layer and the dimension of each Capsule in the layer are optimized and adjusted;

4) and mapping all the outputs of the main capsule layer to the Digitcaps layer of the digital capsule layer, and optimizing and adjusting the routing iteration times.

In the step (1), preparing a training set, wherein the training set is established according to the specific activity predicted by the target and comprises the structures of molecules and corresponding activity labels thereof, and the number of the molecules is more than or equal to 1,000;

in the step (2) of calculating the molecular descriptors and the step (3) of calculating the fingerprints, programs such as open source PaDEL-Descriptor, RDkit and commercial version Discovery Studio and the like can be selected to complete the calculation work of all the molecular descriptors and the molecular fingerprints;

in the preprocessing of the molecular descriptor in the step (4), the value of the molecular descriptor is limited to the interval of (0, 1) according to the following formula,

wherein x is the original value of the molecular descriptor, x is the scaled value, max and min correspond to the maximum and minimum values of the molecular descriptor, respectively; characterizing a compound by adopting a one-dimensional vector, and storing a molecular descriptor value comprising a compound name, an activity label, a molecular fingerprint and a scale in a csv format;

in the step (6) of predicting the activity of the unknown compound, the length of each capsule represents the probability of appearance of the characteristic content according to the definition of the capsule, and finally, the existence of the activity of the compound is predicted according to the length output by the digital capsule layer, and the length of an output vector is taken when classification is carried out.

The length of the capsule is calculated by the spacing loss margin loss:

L_k＝T_k max(0,m⁺-||v_k||)²+λ(1-T_k)max(0,||v_k||-m^-)²

k represents the classification, T_kIs an indicator function of the classification, m⁺Is an upper bound, m^-Is a lower bound, λ is a proportionality coefficient, and the total loss is the sum of the losses of the various samples; the general settings are: if class k exists, | | v_kIf the k class does not exist, | | v_kThe | | | is not greater than 0.1.

Besides the hyper-parameters involved in each step, including the number and size of filters, the number of neurons in each layer and the iteration number, which need to be optimized and adjusted, the hyper-parameters of the whole network, including the batch processing size, the iteration number and the network learning rate, are also optimized and adjusted, the optimal values of all the hyper-parameters are obtained by performing 5-fold cross validation on the training set, and then the optimal values are used for model setting, so that the activity prediction of unknown compounds is realized.

The performance of the established prediction model is verified by using a test set independent of the training set, and the following formula is adopted for evaluation:

wherein Q represents the total prediction accuracy of the prediction model, SE represents the sensitivity, which means the proportion of positive/active compounds correctly predicted by the prediction model, and SP represents the specificity, which means the proportion of negative/inactive compounds correctly predicted by the prediction model.

The invention has the positive effects that: the method of the invention belongs to a ligand-based prediction method, which is based on molecular fingerprints and molecular descriptors of ligands, and adopts a brand-new deep learning method and a capsule network to establish the relationship between the molecular fingerprints and the molecular descriptors and pharmacokinetic properties and toxicity. The innovation of the invention is that the high-level characteristics of the molecules are abstracted by using a dynamic routing algorithm of the capsule network, and the relationship between all the low-level characteristics and the high-level characteristics is obtained and reserved to the maximum extent. Compared with the previous method for predicting the pharmacokinetic property and toxicity of the drug molecule based on other machine learning, the method has three advantages:

first, the method of the present invention has a better prediction than the traditional machine learning method. The defects of the traditional convolutional neural network are as follows: firstly, the learning effect of the spatial position is poor, multiple acquisition can be carried out when the features are extracted, the features with high occurrence probability are amplified through a maximum pooling algorithm, and the features with low occurrence probability are ignored; meanwhile, when the one-dimensional vector molecular characterization-based learning is carried out, the effect of extracting the features of the global molecular characterization is not good, for example, the logic relation hidden among the features, such as positions, tiny changes and the like, cannot be learned, and therefore the recognition accuracy is seriously influenced.

Secondly, the method does not need to optimize and reduce the dimension of the molecular descriptors in advance, the input is optimized end to further realize automatic dimension reduction, the capsule network processes the molecular characterization far beyond CNN in the spatial position, the whole learning process expresses the occurrence probability of the characteristics by the capsule length, the state of the characteristics is expressed by the direction and is transmitted from the bottom layer to the high layer, and various characteristic information is encapsulated, so that the molecular characteristic information with low occurrence probability is also reserved while the number of training samples is reduced; the coupling coefficient is updated through the iterative dynamic routing process, all the characteristics of the upper layer capsule can be transmitted to any one of the lower layer capsules through the dynamic routing, and the hierarchical position relation between the bottom layer characteristics and the high layer characteristics is greatly reserved.

Third, the method does not need to collect large-scale data sets for training, and better prediction accuracy can be obtained by using more than 1,000 compounds in the training set.

Fourthly, the method has faster convergence time, and under the same condition, the time required for reaching the convergence is one tenth of the training time of the deep confidence network accumulated by the corresponding standard convolutional neural network and the limited Boltzmann machine. The method has the characteristics of high efficiency and good prediction effect, and has high practical value and popularization significance.

Drawings

The implementation of the molecule of fig. 1 goes from low-level features to high-level features.

FIG. 2 is a flowchart of the capsule model work convolved as a molecular low-level feature extractor.

FIG. 3 is a flowchart of a capsule model operation with a constrained Boltzmann machine as a molecular low-level feature extractor.

Figure 4 dynamic routing algorithm process diagram for capsule network.

FIG. 5 is a flow chart for achieving capsule network based prediction of pharmacokinetic properties and toxicity of drug molecules using the present invention.

Detailed Description

The attached drawing shows the specific process of using the invention to realize the prediction of the pharmacokinetic property and toxicity of the drug molecule.

The properties predicted by the implementation of the invention include: (1) blood brain barrier penetration; (2) human oral bioavailability; (3) carcinogenicity; (4) 12 genotoxic effects of Tox 21; (5) an hERG inhibitor; (6) absorption by the human small intestine; (7) hepatotoxicity; (8) and (4) teratogenesis.

See figure 1.

The invention is based on the molecular fingerprint and the molecular descriptor of the ligand, adopts a brand-new deep learning method and a capsule network to establish the relationship between the molecular fingerprint and the molecular descriptor and the pharmacokinetic property and the toxicity. The innovation of the invention is that the high-level characteristics of the molecules are abstracted by using a dynamic routing algorithm of the capsule network, and the relationship between all the low-level characteristics and the high-level characteristics is obtained and reserved to the maximum extent.

In abstracting high-level characteristics of molecules using a capsule network method, it is assumed that a layer above has low-level characteristics u of molecules₁，u₂And u₃…u_nThe next layer has a high level of molecular character U₁And U₂；

In the process of predicting the activity of unknown compounds, the length of each capsule represents the probability of appearance of characteristic contents according to the definition of the capsule, and finally, whether the compounds have the activity is predicted according to the length output by the digital capsule layer, and the length of an output vector is taken when classification is carried out.

The length of the capsule is calculated by the interval loss margin loss:

L_k＝T_k max(0,m⁺-||v_k||)²+λ(1-T_k)max(0,||v_k||-m^-)²

k represents the classification, T_kIs an indicator function of the classification, m⁺Is an upper bound, m^-Is a lower bound, λ is a proportionality coefficient, and the total loss is the sum of the losses of the various samples; the general settings are: if class k exists, | | v_kNot less than | |0.9, | | v if class k does not exist | | | v_kThe | | | is not greater than 0.1.

The process of the specific embodiment of the method of the invention is as follows:

first step, training set preparation: specific pharmacokinetic properties and toxicity data for known compounds are collected via a variety of reliable routes. Since minor errors in the structure of a compound can adversely affect the predictive performance of the model, all compounds obtained must be "pre-processed" using the following workflow: 1) removing inorganic substances and the mixture; 2) if the same compound has activity test data with larger difference, the data is deleted; 3) standardizing special chemical structure types; 4) removing the structurally repetitive compound; 5) the necessary manual checks are performed. This workflow is implemented with the help of a series of open source chemical software, chemotype and OpenBabe. Preparing a training set by adopting data simultaneously containing a small molecular structure and a specific activity label thereof, and if the activity information is quantitative representation, selecting a reasonable threshold value to convert the activity information into qualitative representation, wherein the activity is 1; inactive-0, stored in sdf format.

Secondly, calculating the molecular fingerprints and the molecular descriptors by using general software for calculating the molecular fingerprints and the descriptors of the medicines: the molecular descriptors include 13 of the most commonly used models for machine learning methods to build pharmacokinetic and toxicity predictions, namely, lipid water partition coefficient, apparent partition coefficient, molecular solubility, molecular weight, number of hydrogen bond donors, number of hydrogen bond acceptors, number of rotatable bonds, number of rings, number of aromatic rings, sum of number of oxygen and hydrogen atoms, polar surface area, molecular moiety polar surface area, and molecular surface area. The calculation of all molecular descriptors is completed by open source PaDEL-Descriptor or Discovery Studio program.

And (3) the molecular fingerprint is calculated by adopting 166 MACCS fingerprints based on the substructure characteristics to characterize the structure of the molecule and finishing the calculation of the molecular fingerprint through RDkit software. The reason for selecting this fingerprint is: the length of the training device is shorter, and the training device is beneficial to reducing the number of parameters in modeling and shortening the training time.

Thirdly, preprocessing the molecular descriptors: the resulting molecular descriptors are pre-processed and rescaled. The range of values for different molecular descriptors varies widely, and their values are limited to the interval of (0, 1) according to the following formula:

x is the original value of the molecular descriptor, x is the scaled value, and max and min correspond to the maximum and minimum values of the molecular descriptor, respectively. And characterizing the compound by adopting a one-dimensional vector, wherein the one-dimensional vector comprises a compound name, an activity label, a molecular fingerprint and a scaled molecular descriptor value, and storing the molecular descriptor value in a csv format.

Fourthly, establishing a classification model of the upper-layer low-level features and the lower-layer high-level features; firstly, convolution operation is used as a feature extractor to obtain low-level features of molecules, or limited Boltzmann machine operation is used to obtain low-level features of the molecules, then the capsule network method is used to abstract the high-level features of the molecules, a dynamic routing algorithm is used to fit the relation between the high-level features and the active labels, and two weights, namely the weight/coupling coefficient c of the low-level features and the high-level features, are continuously updated in the dynamic routing_i,jAnd the possibility of low-level features mapping to high-level features b_i,jAnd obtaining the optimal 'consensus' prediction result.

Through 5 times of cross validation on a training set, multiple evaluation indexes including accuracy, specificity, sensitivity and a Marx correlation coefficient are monitored, optimal values of all hyper-parameters are obtained, and in actual operation, the optimal values are as follows: once the highest accuracy is obtained from all candidate hyper-parameter settings, the best hyper-parameter setting is applied to the test set and the prediction of unknown tagged compounds. If the training set is small (less than 10,000 compounds), an early stopping strategy should be adopted during the training process to alleviate the occurrence of overfitting, which is specifically: the original training set is randomly divided into a new training set and a validation set (4: 1). When the errors in the validation set are less than the last iteration, training will stop immediately. The conditions for checking whether the network converges are: the loss function value does not decrease as the number of iterations increases. Thus, a computer program is established for predicting the pharmacokinetic properties and toxicity process of the drug molecules.

And fifthly, testing the performance of the established prediction model by using a test set independent of the training set, and evaluating by adopting the following formula:

wherein Q represents the total prediction Accuracy Accuracy of the prediction model, SE represents the Sensitivity, and SP represents the Specificity, wherein the Sensitivity is the proportion of positive/active compounds correctly predicted by the prediction model, and the Specificity is the proportion of negative/inactive compounds correctly predicted by the prediction model.

Finally, a classification prediction of unknown active compounds is achieved. The target predicted compound is directly input into the model which is trained and tested in the mode of a csv file. After the run, the predicted activity results for the compounds were output: 0 indicates no activity, and 1 indicates activity.

See fig. 2.

The steps in obtaining low-level features of the molecules with a convolution operation as a feature extractor are:

4) the feature vector output of the upper layer is fully mapped and activated to the Primary Capsule layer Primarycaps layer through full-connection operation, and the neuron number of the Primary Capsule layer and the dimensionality of each Capsule in the layer are optimally adjusted

dimension；

5) And mapping all the outputs of the main capsule layer to the Digitcaps layer of the digital capsule layer, and optimizing and adjusting the routing iteration times.

U_j＝W_ij·u_i，

i denotes the lower layer capsule, j denotes the higher layer capsule;

b_i,jrepresenting the logarithmic probability that the capsule i at the bottom layer corresponds to the capsule j at the upper layer, and the initial value is set to 0;

v_jrepresenting the vector output of the digital capsule layer.

See figure 3.

The method comprises the following steps of adopting a restricted Boltzmann machine as a feature extractor to obtain low-level features of molecules:

And fitting the relation between the high-level characteristics and the active labels through a dynamic routing algorithm, conveying all the characteristics of the capsules at the upper layer to any one of the capsules at the lower layer, and automatically adjusting the weight of the capsules, wherein the steps of the dynamic routing algorithm (figure 5) are as follows:

3) circularly executing the steps 4) to 7) r times;

5) Weighted sum s for capsule vectors of l +1 layers_j；

See fig. 5.

Example 1.

Active compounds that predict the potassium channel encoded by hERG (human ether-a-go-go-related gene) and low-level features of the molecule are obtained using a convolution operation as a feature extractor. The implementation process is as follows:

first, data relating to hERG activity is collected. Data related to hERG was obtained from the ChEMBL open source database (https:// www.ebi.ac.uk/ChEMBL /). ChEMBL is a well-known database of biological activities established by the european bioinformatics institute. This public database can be downloaded by anyone from a website and is therefore widely used by researchers in chemical informatics. The workflow for the initial ChEMBL-hERG dataset setup is as follows:

1) 17,952 compounds tested for hERG activity were extracted based on the hERG ID number in the data (ChEMBL 240);

2) compounds identified as "Nonstandard unit for type" (non-standard unit type), "Outside type range" (Outside the typical range) and "Not specified" are excluded. The initial dataset had 10,068 compounds in total, including 4,793 positives (hERG inhibitor, IC)₅₀<10 μ M) and 5,275 negatives (hERG non-inhibitor, IC)₅₀Not less than 10 mu M). Obtaining accuracy by pre-processing the raw data setChEMBL-hERG data set. Finally, we obtained a data set of 8,310 compounds (positive compound: 3,860; negative compound: 4,450). To establish the model and subsequent model testing, the entire data set was randomly divided into 90% as the training set (7,460 compounds) and 10% as the testing set (850 compounds).

In the second step, the molecular fingerprints and descriptors are calculated using a program that calculates molecular fingerprints and descriptors for drugs, and then all compounds of the ChEMBL-hERG dataset are characterized. The obtained 13 molecular descriptors are preprocessed, and the values of the descriptors are limited to the interval of (0, 1). An input file is built by Excel, information of one molecule per action is represented by a molecule in a one-dimensional vector mode, and the molecular characterization comprises the name/number of the molecule, an activity label (activity/positive is 1, and non-activity/negative is 0), a molecular descriptor and a scaled molecular descriptor value. The input file is saved in csv format.

And thirdly, establishing a prediction model for hERG active/inactive molecules based on a capsule network by using a ChEMBL-hERG training set. The network weights are randomly initialized using a truncated normal distribution and stddev is set to 0.01. Convolution operation and a limited Boltzmann machine are used as feature extractors, and probability distribution of a correction linear unit and an energy function is respectively used as an activation function. To reduce internal covariate shifts, the input distribution for each layer is normalized to a standard gaussian distribution using a batch normalization method. And (5) carrying out network optimization by adopting an Adam method. A plurality of evaluation indexes including accuracy, specificity, sensitivity and Marx correlation coefficient are monitored through 5-time cross validation of a training set, the optimal values of all hyper-parameters are obtained, and in the adjustment of the hyper-parameters of an actual model, the method is debugged from the following aspects according to the composition of the model.

The hyper-parameter adjustment optimization of the feature extractor adopts convolution as the feature extractor. Filter candidate size settings 8 × 8, 16 × 16, 32 × 32, and 64 × 64; the kernel function candidate number is set to 2, 3, 4, 5, and 6.

Hidden feature layer neuron number candidate range: 64, 128, 256, 512, 1024 and 2048.

Candidate ranges for the number of neurons in the main capsule layer: 64, 128, 256, 512, 1024 and 2048.

Candidate range of number of times of route iteration of capsule part: from 1 to 5, each change is incremented by 1.

Optimization adjustment of the hyper-parameters of the whole network, (1) batch size (batch size) candidate range: 128, 256, 512, and 1028; (1) candidate range of network iteration times (iteration epoch): from 100 to 1000, with 50 increments per change; (3) candidate range of learning rate of network (learning rate of network): from 0.001 to 0.01, with each change increasing by 0.001.

And fourthly, setting the model by the optimal hyper-parameter combination obtained in the last step, adjusting the model to a test (test) state, and verifying the prediction performance of the model by using the test set.

In the fifth step, compounds not tested in the hERG assay are processed and input to the model, which is adjusted to a predictive (predict) state. After the operation is finished, checking an output file, and predicting a result: 0 indicates no inhibition of hERG, and 1 indicates inhibition of hERG.

Example 2.

Active compounds of the hERG (human ether-a-go-go-related gene) encoded potassium channel are still predicted, with the restricted Boltzmann machine as the feature extractor.

The same procedure as in example 1 was carried out in the first and second steps.

And thirdly, establishing a prediction model for hERG active/inactive molecules based on a capsule network by using a ChEMBL-hERG training set. The network weights are randomly initialized using a truncated normal distribution and stddev is set to 0.01. The feature extractor employs a restricted boltzmann machine. The probability distribution of the energy function acts as the activation function. To reduce internal covariate shifts, the input distribution for each layer is normalized to a standard gaussian distribution using a batch normalization method. And (5) carrying out network optimization by adopting an Adam method. The optimal values of all the hyper-parameters are obtained by monitoring a plurality of evaluation indexes (accuracy, specificity, sensitivity, Marx correlation coefficient and the like) through 5-time cross validation of a training set, and the method is debugged from the following aspects according to the composition of a model in the regulation of the hyper-parameters of an actual model.

Restricted boltzmann machine candidate number: 2, 3, 4 and 5.

Number of neuron candidates per restricted boltzmann machine: 64, 128, 256, 512, 1024 and 2048.

The number of candidate iterations for the limited boltzmann machine is from 100 to 1000, with each change being increased by 50.

The candidate learning rate for the constrained boltzmann machine is from 0.001 to 0.01, increasing by 0.001 per change.

And (3) optimizing and adjusting the hyper-parameters of the whole network: (1) batch size (batch size) candidate range: 128, 256, 512, and 1028; (1) candidate range of network iteration times (iteration epoch): from 100 to 1000, with 50 increments per change; (3) candidate range of learning rate of network (learning rate of network): from 0.001 to 0.01, with each change increasing by 0.001.

The fourth and fifth steps are the same as in example 1.

The present examples 1 and 2 were evaluated and verified by using the following formulas:

When the convolution and the limited Boltzmann machine are respectively used as the feature extractor, the overall prediction precision of the test set is about 90 percent, which shows that the established model has good prediction capability on the compounds independent of the training set.

Claims

1. A novel medicine molecule pharmacokinetic property and toxicity prediction method based on a capsule network is characterized in that: collecting known compounds with specific drug molecule pharmacokinetic properties or toxicity determined by experiments and activity labels thereof, constructing comprehensive molecular fingerprints and molecular descriptors to characterize small molecules, and establishing early preparation work of a model; then extracting the low-level characteristic content of the molecule in the upper-level low-level characteristic through convolution or limited Boltzmann machine operation, abstracting the high-level characteristic of the molecule in the next-level high-level characteristic by using a capsule network method, and fitting the relation between the high-level characteristic and the active label and the relation between the high-level characteristic and the active label through a dynamic routing algorithm, thereby being used for predicting the pharmacokinetic property and toxicity classification of the unknown small molecule;

the method comprises the following six steps:

(1) preparation of training set: the method comprises the steps of collecting a compound, pretreating the compound, establishing a data set of specific activity, preparing a training set by adopting data simultaneously containing a small molecular structure and a specific activity label thereof, and if activity information is quantitatively expressed, selecting a reasonable threshold value to convert the activity information into a qualitative expression, wherein the activity is 1; inactive-0, stored in sdf format;

(2) calculating a molecular descriptor;

the molecular descriptors comprise 13 most commonly used molecular descriptors in a machine learning method for establishing a pharmacokinetic and toxicity prediction model, namely a lipid-water partition coefficient, an apparent partition coefficient, molecular solubility, molecular weight, the number of hydrogen bond donors, the number of hydrogen bond acceptors, the number of rotatable bonds, the number of rings, the number of aromatic rings, the sum of the numbers of oxygen atoms and hydrogen atoms, polar surface area, molecular part polar surface area and molecular surface area; all the molecular descriptors are calculated by open source PaDEL-Descriptor or Discovery Studio programs;

(3) calculating a molecular fingerprint; adopting 166-bit MACCS molecular fingerprints based on substructure characteristics to characterize the structure of the molecule, and completing the calculation of the molecular fingerprints through RDkit software;

(4) preprocessing a molecular descriptor; the range of values for different molecular descriptors varies widely, their values being limited to the interval of (0, 1) by pre-processing; a one-dimensional vector is adopted to represent a compound, and the compound comprises a compound name, an activity label, a molecular fingerprint and a scaled molecular descriptor value, and is stored in a csv format;

(5) upper layer low level feature u_iAnd next layer of advanced features U_jEstablishing a classification model; firstly, convolution operation is used as a feature extractor to obtain low-level features of molecules, or limited Boltzmann machine operation is used to obtain low-level features of the molecules, then the capsule network method is used to abstract the high-level features of the molecules, a dynamic routing algorithm is used to fit the relation between the high-level features and the active labels, and two weights, namely low-level features u, are continuously updated in the dynamic routing_iAnd advanced feature U_jWeight/coupling coefficient c between_ijAnd the possibility of mapping low-level features to high-level features b_ijObtaining an optimal 'consensus' prediction result;

2. The novel method for predicting the pharmacokinetic properties and toxicity of drug molecules based on capsule network as claimed in claim 1, wherein: the method for abstracting the high-level characteristics of the molecules by using the capsule network comprises the step of obtaining the low-level characteristics u of the existing molecules on the upper layer₁，u₂And u₃…u_nThen the next layer has the advanced character of the molecule U₁And U₂；

high level characteristic U₁And U₂Receiving input from other low-level featuresIt is noted that, inside the high-level features, the dense location of the low-level feature output means that there are predictions of multiple low-level features close to each other, i.e., the "consensus" output; novel low level character u_n+1Is close to the "consensus" output in which high-level feature, it is delivered to which high-level feature; the dynamic routing generates a mechanism based on the above results to automatically adjust its weight if u_n+1Is transmitted to the advanced feature U₁I.e. up-regulating U₁Relative weight c_n+1,1While lowering U₂Relative weight c_n+1,2。

3. The novel method for predicting the pharmacokinetic properties and toxicity of drug molecules based on capsule network as claimed in claim 1, wherein:

3) mapping the low-level features obtained by the convolutional layer to a hidden feature layer through full-connection operation, and optimizing and adjusting the neuron number nodes of the hidden feature layer;

U_j＝W_ij·u_i，

i denotes the lower layer capsule, j denotes the higher layer capsule;

b_ijrepresenting the logarithmic probability that the capsule i at the bottom layer corresponds to the capsule j at the upper layer, and the initial value is set to 0; k represents the number of neurons within each capsule;

3) weighted summation, which is to obtain high-level feature vector_j：

s_jA "consensus" output representing all low-level feature vectors, if the coupling coefficient between capsule i and capsule j is 1, then the coupling coefficient that capsule i sends to other capsules in the high-level is 0, i.e., all outputs of capsule i are sent to capsule j;

v_jrepresenting the final vector output for the higher layer capsule j.

4. The novel method for predicting the pharmacokinetic properties and toxicity of drug molecules based on capsule network as claimed in claim 1, wherein: fitting the relation between the high-level characteristics and the active labels through a dynamic routing algorithm, conveying all characteristics of the upper layer capsules into any one of the lower layer capsules, and automatically adjusting the weight of the capsules, wherein the dynamic routing algorithm comprises the following steps:

3) circularly executing the steps 4) to 7) r times;

5) Weighted sum s for capsule vectors of l +1 layers_j，s_jA "consensus" output for all low-level feature vectors;

6) activating s with non-linear activation of vectors for capsule vectors of l +1 layers_jTo obtain v_j；v_jRepresents the final vector output of the higher layer capsule j;

5. The novel method for predicting the pharmacokinetic properties and toxicity of drug molecules based on capsule network as claimed in claim 4, wherein: outputting U after being encapsulated according to capsules_jAnd for a capsule vector of l +1 layers, activating s with nonlinear activation of the vector_jTo obtain v_jUpdate the probability b that a capsule vector defining a layer l is connected to a capsule vector of the next layer_i,jIs made of U_jAnd v_jUpdate b by dot product of_i,j。

6. The novel method for predicting the pharmacokinetic properties and toxicity of drug molecules based on capsule network as claimed in claim 1, wherein: the method for obtaining the low-level characteristics of the molecules by taking the limited Boltzmann machine as the characteristics comprises the following steps:

4) and mapping all the outputs of the main capsule layer to the Digitcaps layer of the digital capsule layer, and optimizing and adjusting the Routing iteration time.

7. The novel method for predicting the pharmacokinetic properties and toxicity of drug molecules based on capsule network as claimed in claim 1, wherein:

in the step (2) of calculating the molecular descriptors and the step (3) of calculating the molecular fingerprints, an open source PaDEL-Descriptor or RDkit or commercial version Discovery Studio program is selected to complete the calculation work of all the molecular descriptors and the molecular fingerprints;

in the step (6) of predicting the activity of the unknown compound, the length of each capsule represents the probability of appearance of the characteristic content according to the definition of the capsule, and finally, whether the compound has the activity or not is predicted according to the length output by the digital capsule layer, and the length of an output vector is taken when classification is carried out.

8. The novel method for predicting the pharmacokinetic properties and toxicity of drug molecules based on capsule network as claimed in claim 7, wherein: the length of the capsule is calculated by the spacing loss margin loss:

L_k＝T_kmax(0,m⁺-||v_k||)²+λ(1-T_k)max(0,||v_k||-m^-)²

k represents the classification, T_kIs an indicator function of the classification, m⁺Is an upper bound, m^-Is a lower bound, λ is a proportionality coefficient, and the total loss is the sum of the losses of the various samples; | v | (V)_k| | represents an output value of the kth class; the general settings are: if class k exists, | | v_kIf the k class does not exist, | | v_kThe | | | is not greater than 0.1.

9. The novel method for predicting the pharmacokinetic properties and toxicity of drug molecules based on capsule network as claimed in claim 1, wherein: besides the hyper-parameters involved in each step, including the number and size of filters, the number of neurons in each layer and the iteration number, which need to be optimized and adjusted, the hyper-parameters of the whole network, including the batch processing size, the iteration number and the network learning rate, are also optimized and adjusted, the optimal values of all the hyper-parameters are obtained by performing 5-fold cross validation on the training set, and then the optimal values are used for model setting, so that the activity prediction of unknown compounds is realized.

10. The novel method for predicting the pharmacokinetic properties and toxicity of drug molecules based on capsule network as claimed in claim 1, wherein: the performance of the established prediction model is verified by using a test set independent of the training set, and the following formula is adopted for evaluation:

wherein TP represents a correctly classified positive/active compound, TN represents a correctly classified negative/inactive compound, FP represents a incorrectly classified negative/inactive compound, FN: misclassified positive/active compounds, Q represents the overall prediction accuracy of the prediction model, SE represents sensitivity, which refers to the proportion of positive/active compounds correctly predicted by the prediction model, and SP represents specificity, which refers to the proportion of negative/inactive compounds correctly predicted by the prediction model.