CN112599186B

CN112599186B - Compound target protein binding prediction method based on multi-deep learning model consensus

Info

Publication number: CN112599186B
Application number: CN202011619677.5A
Authority: CN
Inventors: 郑光; 胡成杰; 刘昊; 乔安杰; 陈俊楠; 高雅杰; 吕诚; 李立
Original assignee: Lanzhou University
Current assignee: Lanzhou University
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2022-09-27
Anticipated expiration: 2040-12-30
Also published as: CN112599186A

Abstract

The invention discloses a compound target protein binding prediction method based on multi-deep learning model consensus, which comprises the following steps of: acquiring binding/non-binding data of a compound target protein; extracting a plurality of different compound target protein binding/unbinding datasets for training, testing and validation; respectively constructing and training a plurality of compound target protein binding/non-binding deep learning models, extracting the binding/non-binding characteristics of the compound target protein, and obtaining a plurality of final compound target protein binding/non-binding deep learning models; predicting to obtain the binding/non-binding relation of a plurality of groups of compound target proteins through a plurality of compound target protein binding/non-binding deep learning models in the last step; and integrating the obtained multiple groups of binding relationship results to obtain a consensus binding relationship, and integrating the obtained multiple groups of non-binding relationship results to obtain a consensus non-binding relationship. The method has the characteristics of low false positive rate and high accuracy rate, and is suitable for popularization and application.

Description

Compound target protein binding prediction method based on multi-deep learning model consensus

Technical Field

The invention belongs to the technical field of medicine research and development, and relates to a compound target protein binding prediction method based on multi-deep learning model consensus.

Background

The method utilizes the existing mainstream deep learning model (the full-link neural network model, the convolutional neural network model, the recurrent neural network model, the encoder-decoder network and the deep belief network) to extract the characteristics of the binding relationship of the compound target protein, thereby predicting the binding relationship of the new compound target protein and having important practical significance for the discovery/research of new drugs and the research of the action mechanism of the new drugs. However, although these models can achieve very high training accuracy (> 90%), the high prediction false positive rate (> 90%) prevents further application of deep learning models in this area. How to reduce the false positive rate of compound target protein binding relationship prediction is still a world problem so far.

In the prior art, chemical informatics software of a compound target protein binding relationship (see figure 1) is used and is carried out fully manually, parameters needing to be adjusted are numerous, the success rate is low, and the problem of high false positive exists. Deng, N et al (J.Phys.chem.B 2015, 119,976-988.) and Nataraj S et al (Biophys Rev (2017)9: 91-102. DOI 10.1007/S12551-016-0247-1) discuss the problems of numerous parameters, low success rate and high false positive in compound target protein binding prediction from the two aspects of compound target protein binding free energy calculation (free energy calculation) and compound target protein binding software review, respectively.

Therefore, it is necessary to develop a method for predicting binding of a compound target protein with a low false positive rate and a high accuracy.

Disclosure of Invention

The invention aims to overcome the problems in the prior art and provides a compound target protein binding prediction method based on multi-deep learning model consensus.

The technical scheme is as follows:

a compound target protein binding prediction method based on multi-deep learning model consensus comprises the following steps:

(1) acquiring binding data (positive sample) of a compound target protein and non-binding data (negative sample) of the compound-target protein;

(2) extracting a plurality of different compound target protein binding/unbinding data sets (positive/negative sample sets) for training, testing and validation;

constructing a plurality of different compound target protein binding deep learning models (positive models),by usingThe method comprises the following steps that a plurality of different positive/negative sample sets are respectively in a ratio of (1:1), (2:1), (4:1), (6:1) and (8:1), a plurality of different compound target protein binding/non-binding deep learning models (positive) are trained in a forward direction, characteristics (positive characteristics) bound by a plurality of groups of compound target proteins are extracted, and a plurality of different final compound target protein binding deep learning models (positive models) are obtained; and

constructing a plurality of different compound-target protein non-binding deep learning preliminary models (negative models),by usingThe method comprises the following steps that (1:1), (1:2), (1:4), (1:6) and (1:8) are respectively carried out on a plurality of different positive/negative sample sets according to the proportion, a plurality of different compound-target protein non-binding deep learning models (negatives) are trained in a negative direction, and characteristics (negative characteristics) bound by a plurality of groups of compound target proteins are extracted to obtain a plurality of different final compound-target protein non-binding deep learning models (negative models);

(3) obtaining the binding relationship (positive) of a plurality of groups of compound target proteins through a plurality of different positive models; obtaining a plurality of groups of compound-target protein non-binding relations (negatives) through a plurality of different negative models;

(4) integrating the multiple groups of binding relationship results obtained in the step (3) to obtain a consensus binding relationship, and integrating the multiple groups of non-binding relationship results obtained in the step (3) to obtain a consensus non-binding relationship;

(5) a condensed binding relationship for a compound target protein is a consensus binding relationship-consensus non-binding relationship.

As one embodiment, step (1) of the process of the present invention further comprises: the obtaining of binding data (positive samples) of the compound target protein comprises obtaining from a binding database of compound target proteins published in scientific literature and on the web, said database comprising: ChemSpider, PubChem, BindingDB, zip or ChEMBL, or two or more thereof.

As one embodiment, step (1) of the process of the present invention further comprises: the obtaining of compound-target protein non-binding data (negative sample) comprises randomly generating compound-target protein data, from which compound-target protein binding data (positive sample) is excluded.

As one embodiment, step (1) of the process of the present invention further comprises: the proportion range of the compound target protein binding data (positive sample) and the compound-target protein non-binding data (negative sample) is (0.1:1) - (1: 100); preferably, the ratio of the binding data of the compound target protein (positive sample) to the non-binding data of the compound-target protein (negative sample) for training the negative model is (1:1), (1:2), (1:4), (1:6) and (1:8), and the ratio of the binding data of the compound-target protein for training the positive model is (1:1), (2:1), (4:1), (6:1) and (8: 1).

As an embodiment, the compound target protein binding/unbinding data (positive/negative sample set) is: generation of positive samples: compound target protein binding data (positive sample) downloaded from BindingDB; the compound-target protein unbound data (negative samples) are generated by removing the positive samples from the randomly combined compound-target protein data, i.e., the compound-target protein unbound data (negative samples).

As one embodiment, the training, testing and verifying compound target protein binding/unbinding data set (positive sample set/negative sample set) of the present invention is a compound target protein binding/unbinding data set (positive/negative sample set) that randomly extracts part of data from compound target protein binding data (positive sample) and compound-target protein unbinding data (negative sample), respectively, as training, testing and verifying compound target protein binding/unbinding data sets (positive/negative sample set);

as one embodiment, the single binding/unbinding data set (positive sample set/negative sample set) of the plurality of different binding/unbinding data sets (positive sample set/negative sample set) of the compound target protein of the present invention is: positive/negative sample data were randomly drawn at a ratio of 98:1:1 into training, testing and validation data sets.

As one embodiment, the compound target protein binding characteristics (positive) are trained and extracted as: training a positive sample by using a plurality of different deep learning models, and stopping training when the loss value is not reduced any more, so as to obtain a plurality of groups of extracted compound target protein binding characteristics; training and extracting compound-target protein non-binding features (negative) as: and (3) training the negative sample by using a plurality of different deep learning models, and stopping training when the loss value is not reduced any more, so as to obtain a plurality of groups of extracted compound-target protein unbound characteristics.

As one embodiment, said step (2) in the method of the present invention further comprises:

labels of the positive/negative sample sets are set as binding [0,1] and non-binding [1,0] (or dual binding [1,0] and non-binding [0,1]) respectively for training.

A single model of the plurality of different deep learning models used to train the positive and negative examples consists of any one of the following types: a fully-connected neural network model, a convolutional neural network model, a recurrent neural network model, a deep belief network, or an encoder-decoder network; as one embodiment, a recurrent neural network model is preferred.

As one embodiment, the single model in the plurality of different deep learning models (positive) in step (2) of the method of the present invention is composed of any one of the following types: a recurrent neural network (positive), a convolutional neural network (positive), a fully-connected neural network (positive), or an encoder-decoder network (positive);

a single model of the plurality of different deep learning models (negative) is comprised of any one of the following types: recurrent neural networks (negative), convolutional neural networks (negative), fully-connected neural networks (negative), or encoder-decoder networks (negative).

In the method of the present invention, as one embodiment, the step (2) of the method of the present invention further comprises:

(1) input parameter control of a single deep learning model:

TABLE 1 table of input parameters of single compound target protein binding/unbinding prediction deep learning model

(2) Structure and parameters of a single deep learning model

TABLE 2 Individual Compound target protein binding/unbinding prediction deep learning model Structure and parameter Table

Model structure	Hidden layer structure	Neuronal parameter range	Convolution kernel parameters
				Full-connection neural network model	3 to 10 layers	2～5000	-
Convolutional neural network model	2 to 9 layers	-	4～200
				Recurrent neural network model	1 to 7 layers	32～512	-
Deep belief network	3 to 10 layers	2～5000	-
				Encoder-decoder network	-	-	-

Wherein the encoder-decoder network may be constructed from a fully-connected neural network model, a convolutional neural network model, a recursive neural network model, and a deep belief network.

(3) Output parameters of a single deep learning model

There are two types of output parameters: one type is a probability numerical value of outputting single continuity, and the range is 0-1; the second type is to output discrete values, including two discrete values of [0,1] and [1,0 ].

As one embodiment, step (2) of the method of the present invention further comprises: data used, compound aspect, including compound atoms and chemical bonds between compound atoms; a target protein aspect, including an amino acid sequence; binding relationships, secondary parameters include Ki, IC50, Kd, EC 50. Ki is the inhibition constant (inhibition constant) reflecting the strength of inhibition of the target by the inhibitor, and IC50 is the concentration of drug or inhibitor required to inhibit half of a given biological process (or a component of the process such as an enzyme, receptor, cell, etc.). Kd is the dissociation constant (dissociation constant) and reflects the magnitude of the affinity of the compound for the target. EC50 refers to the concentration of drug, antibody or toxin that achieves 50% of the maximum biological effect after a particular exposure time.

As one embodiment, step (2) of the method of the present invention further comprises: training by using a plurality of positive samples bound by different compound target proteins, and refusing under-fitting and over-fitting to obtain a plurality of different finally bound deep learning models (positive); and (3) training by using a plurality of different compound-target protein unbound negative samples, and refusing under-fitting and over-fitting to obtain a plurality of different final unbound deep learning models (negative).

As one embodiment, step (3) of the method of the present invention further comprises: according to the requirement of actual prediction quantity, adjusting output thresholds of a plurality of different positive/negative deep learning models, and eliminating common non-binding combinations obtained by integrating a plurality of non-binding combinations from common binding combinations obtained by integrating a plurality of groups of binding combinations to obtain a simplified prediction result.

As an embodiment, the method of any of the above aspects of the invention further comprises: the system operation environment is as follows:

hardware: CPU + GPU;

software: windows or Linux, Python + Tensorflow, PyTorch, Keras, etc.

As an embodiment, the method of the present invention further comprises:

(1) acquiring binding data (positive sample) of a compound target protein and non-binding data (negative sample) of the compound-target protein, wherein the positive sample is the binding of the compound target protein which is discovered by human in scientific research experiments; the negative sample is a portion excluding compound target protein binding in the compound-target protein combination space; the database includes: ChemSpider, PubChem, BindingDB, ZINC or ChEMBL, or a combination of two or more thereof;

constructing a plurality of different compound target protein binding deep learning models (positive models),by usingThe method comprises the following steps that a plurality of different positive/negative sample sets are respectively provided with the proportion of (1:1), (2:1), (4:1), (6:1) and (8:1), a plurality of different compound target protein binding/non-binding deep learning models (positive) are trained in a positive direction, the characteristics (positive characteristics) bound by a plurality of groups of compound target proteins are extracted, and a plurality of different final compound target protein binding deep learning models (positive models) are obtained; and

constructing a plurality of different compound-target protein non-binding deep learning preliminary models (negative models), adopting a plurality of different positive/negative sample sets with the proportion of (1:1), (1:2), (1:4), (1:6) and (1:8), training a plurality of different compound-target protein non-binding deep learning models (negative) in a negative direction, extracting the characteristics (negative characteristics) bound by a plurality of groups of compound target proteins, and obtaining a plurality of different final compound-target protein non-binding deep learning models (negative models);

a single model of the plurality of different deep learning models (positive models) is composed of any one of the following types: a recurrent neural network (positive model), a convolutional neural network (positive model), a fully-connected neural network (positive model), or an encoder-decoder network (positive model);

a single model of the plurality of different deep learning models (negative models) is composed of any one of the following types: a recurrent neural network (negative model), a convolutional neural network (negative model), a fully-connected neural network (negative model), or an encoder-decoder network (negative model);

(3) predicting and obtaining the binding relationship (positive) of a plurality of groups of compound target proteins through a plurality of different compound target protein binding deep learning models (positive); obtaining a plurality of groups of compound-target protein non-binding relations (negatives) through a plurality of different compound-target protein non-binding deep learning models (negatives);

(4) integrating the multiple groups of binding relation results obtained in the step (3) to obtain a consensus binding relation, and integrating the multiple groups of non-binding relation results obtained in the step (3) to obtain a consensus non-binding relation;

As an embodiment, the method of the present invention further comprises:

the single deep learning model (positive) includes four parts: the system comprises a data input module, a Recurrent Neural Network (RNN) module, a full connection module and a two-classification output module; the Data comes from BingdingDB, the downloaded file format is SDF (structured Data File), the SDF contains Data in Molfile format, and after Data preprocessing, the Data input of the model is divided into 7 parts: the number of compound atoms, the number of chemical bonds of the compound, the length of the target protein sequence, the target protein sequence and the binding tag; the RNN module uses 3 two-layer LSTM neural networks to respectively extract compound atoms, compound chemical bonds and target protein sequences; splicing output characteristics of 3 RNNs as input of a full-connection module, wherein the full-connection module comprises 5 layers of full-connection neural networks, and full-connection units of each layer are respectively 1024, 1024 and 2; the second classification output module classifies the data output by full connection by using a Softmax function, a binding label uses a one-hot format, and [0,1] represents binding and [1,0] represents non-binding;

the single deep learning model (negative model) includes four parts: the system comprises a data input module, an RNN module, a full connection module and a two-classification output module; the Data comes from BingdingDB, the downloaded file format is SDF (structured Data File), the SDF contains Data in Molfile format, and after Data preprocessing, the Data input of the model is divided into 7 parts: the number of compound atoms, the number of chemical bonds of the compound, the length of the target protein sequence, the target protein sequence and the binding tag; the RNN module uses 3 two-layer LSTM neural networks to respectively extract compound atoms, compound chemical bonds and target protein sequences; splicing output characteristics of 3 RNNs as input of a full-connection module, wherein the full-connection module comprises 5 layers of full-connection neural networks, and full-connection units of each layer are respectively 1024, 1024 and 2; the second classification output module classifies the data output by full connection by using a Softmax function, a binding label uses a one-hot format, the [1,0] represents binding, and the [0,1] represents non-binding;

and the predicted results of the different negative models are used for eliminating the model prediction results and reducing the number of false positive samples.

As one embodiment, a single deep learning model (DNN) of the method of the present invention may be formed by any combination of some specific models, including: a fully-connected neural network model (FCNN), a convolutional neural network model (CNN), a recurrent neural network model (RNN), and an encoder-decoder network (ANN). The parameter ranges are shown in tables 1 and 2.

As one embodiment, the inventive method uses a deep-learning model to train both compound target protein binding and compound-target protein non-binding characteristics, and then uses both characteristics to predict a combination of potential binding (positive) and non-binding (negative). On the basis, according to the requirement of actual prediction quantity, adjusting output thresholds of a plurality of different positive and negative models, and excluding a plurality of groups of non-binding combination-integrated common non-binding combinations from the plurality of groups of binding combination-integrated common non-binding combinations to obtain a simplified prediction result.

The beneficial effects of the invention are as follows:

the compound target protein binding prediction method based on the multi-deep learning model consensus solves the problem that the false positive of the current compound target protein binding relationship prediction result is high, and the problem that the false positive of the prediction result is high is reduced by training a plurality of different compound-target protein non-binding relationship models and cleaning the results of a plurality of different compound target protein binding relationships.

The method solves the problem that the prediction process of the existing compound target protein binding relationship by utilizing chemical informatics software (Autodock Vina, GOLD, MOE-Dock and the like) cannot be automatically carried out; the multiple different deep learning models used by the invention can be automatically carried out, so that the intervention of manual guess is reduced, and the false positive prediction is reduced.

Drawings

FIG. 1: in the prior art, a mainstream deep learning model trains and predicts a compound flow chart;

FIG. 2: the deep learning model predicts a binding relationship technology route map of a compound target protein;

FIG. 3: the recurrent neural network (positive) series recurrent neural network (negative) deep learning model of the embodiment 1 of the invention predicts the technical route map of the binding relationship of the compound target protein;

FIG. 4: schematic diagrams of single recurrent neural network (positive) and single recurrent neural network (negative) structures of example 1;

FIG. 5: schematic diagrams of single recurrent neural network (positive) and single encoder-decoder neural network (negative) system architectures of embodiment 2;

FIG. 6: schematic diagrams of single encoder-decoder neural network (positive) and single recurrent neural network (negative) system architectures of example 3;

FIG. 7 is a schematic view of: schematic diagrams of single encoder-decoder neural networks (positive) and single encoder-decoder neural networks (negative) system architectures of embodiment 4.

Detailed Description

The technical solution of the present invention is further described in detail with reference to the accompanying drawings and specific embodiments.

The technical route chart of the deep learning model for predicting the binding relationship of the compound target protein is shown in figure 1.

Example 1

A plurality of recurrent neural networks (positive) are connected in series with a plurality of recurrent neural networks (negative) models to predict the binding relationship and the system of the target protein of the compound.

The overall system structure schematic diagram: see fig. 3-4.

The system operation environment is as follows:

hardware: CPU + GPU;

software: windows or Linux, Python + Tensorflow, PyTorch, Keras, etc.

The technical scheme is as follows:

as shown in fig. 3 to 4.

(1) Binding data for compound target protein (positive sample), non-binding data for compound-target protein (negative sample)

Compound target protein binding data (positive samples), compound-target protein non-binding data (negative samples). Wherein the positive sample is a target protein binding of a compound that has been found in human research activities, and is typically selected from ChemSpider, PubChem, BindingDB, zip, ChEMBL, and the like.

Negative examples are the construction of raw data during random combination of a compound with a target protein, followed by the removal of the now known bound "compound-target protein" combination from the raw data.

The data used in the present invention, 759346 compounds and 7181 target proteins, the original data space/quantity constructed is 759346 × 7181-5452863626, the binding data of the compound target protein is known to be 1433590, and the negative sample space/quantity is 5452863626-1433590-5451430036.

(2-1) extracting compound target protein binding/non-binding data (positive samples/negative samples) according to the proportion of nine positive samples and negative samples, synthesizing a total data set, and dividing to obtain a training set, a test set and a verification set

(i) The extraction of the alignment sample comprises three parts of extracting a training set, a testing set and a verification set: and 1433590 pieces of binding data are randomly extracted, 1410000 pieces of binding data are taken as positive samples of the training set, and 5000 pieces of binding data are respectively extracted from the rest binding data and are respectively taken as positive samples of the test set and the verification set.

(ii) The extraction of the negative sample also comprises three parts of extracting and obtaining a training set, a testing set and a verification set: the negative samples of the training set are obtained by fixing the number of positive samples, then determining different extraction numbers of each data set according to the proportion of positive samples to negative samples, and randomly extracting from 5451430036 pieces of unbound data, wherein the proportion of positive samples to negative samples comprises nine types of 1:1, 2:1, 4:1, 6:1, 8:1, 1:2, 1:4, 1:6 and 1:8, and the corresponding number of the negative samples is 2820000, 2115000, 1762500, 1645000, 1586250, 4230000, 7050000, 9870000 and 12690000. The test set and the validation set also used 5000 pieces of unbound data, respectively, randomly drawn in addition to the training set as negative examples.

(iii) Combining the positive and negative samples obtained in the steps (i) and (ii) to respectively obtain nine training sets with different positive and negative sample ratios and corresponding test sets and verification sets

(2-2) construction of a plurality of different Compound target protein binding/non-binding deep learning models (Positive/negative models)

(i) Construction of multiple Compound target protein binding deep learning model (Zheng)

A single positive model contains four parts: the device comprises a data input module, an RNN module, a full connection module and a two-classification output module.

The Data comes from BingdingDB, the downloaded file format is SDF (structured Data File), the SDF contains Data in Molfile format, and after Data preprocessing, the Data input of the model is divided into 7 parts: compound atom number, compound atom, compound chemical bond number, compound chemical bond, target protein sequence length, target protein sequence, and tag.

Because the length of the compound and the target protein is not fixed, the characteristics can be fully extracted by adopting a recurrent neural network, and the RNN module adopts 3 two-layer LSTM neural networks to respectively extract the characteristics of a compound atom, a compound chemical bond and a target protein sequence as the output of each RNN module.

The output characteristics of 3 RNNs are spliced to be used as the input of a full-connection module, the full-connection module comprises 5 layers of full-connection, and the full-connection units of each layer are respectively 1024, 1024 and 2.

The two-classification output module classifies the data output by full connection by using a Softmax function, a binding label uses a one-hot format, and [0,1] represents binding and [1,0] represents non-binding.

Repeating the model construction steps, and respectively constructing nine different compound target protein binding deep learning models (positive) according to the nine data sets.

(ii) Construction of multiple Compound-target protein unbound deep learning model (negative)

A single negative model contains four parts: the device comprises a data input module, an RNN module, a full connection module and a two-classification output module.

The Data comes from BingdingDB, the downloaded file format is SDF (structured Data File), the SDF contains Data in Molfile format, and after Data preprocessing, the Data input of the model is divided into 7 parts: compound atom number, compound atom, compound chemical bond number, compound chemical bond, target protein sequence length, target protein sequence, and tag. Because the lengths of the compound and the target protein are not fixed, the recursive neural network can be adopted to fully extract the characteristic RNN module, and 3 two-layer LSTM neural networks are used to respectively extract the characteristics of the compound atom, the chemical bond of the compound and the target protein sequence as the output of the respective RNN module.

The two-classification output module classifies the data output by the full connection by using a Softmax function, the label uses a one-hot format, the [1,0] represents binding, and the [0,1] represents non-binding.

Repeating the model construction steps, and respectively constructing nine different compound target protein binding deep learning models (negative) according to the nine data sets.

(2-3) training a plurality of different compound target protein binding/non-binding deep learning models (positive/negative models), extracting a plurality of groups of compound target protein binding/non-binding characteristics (positive/negative characteristics), and obtaining a plurality of final compound target protein binding/non-binding deep learning models (positive/negative models)

(i) Obtaining a plurality of final compound target protein binding deep learning models (positive models)

And (4) training the model by using the nine data sets obtained in the step (2-1), and obtaining a corresponding model by training each data set.

Each iterative process of single model training comprises two stages of forward propagation and backward propagation: in forward propagation, an input layer respectively inputs an original block, a chemical bond block and a target protein sequence, corresponding feature data are respectively extracted through three parallel RNN feature extraction networks, then the respectively extracted features are spliced, and a prediction output y under current weight and bias is obtained through a full connection layer and an output layer; in the back propagation process, a cross entropy loss function is used for comparing y with a data actual label to obtain a loss value loss, the weight of each layer is updated from back to front by using a back propagation algorithm according to the loss value loss, and the updating amplitude of each iteration is controlled by the learning rate.

And inputting the test data set into the model to obtain and record the loss value of the test set every 100 iterations in the training process, storing the model with the minimum loss value as the current optimal model, marking that the model has an overfitting phenomenon at present when the loss value of the test set does not show a descending trend or even begins to increase, weakening the generalization performance, and stopping the training and ending the program to obtain the binding characteristic (positive characteristic) of the compound target protein. And the finally stored model is the model approaching the optimal, namely the final compound target protein binding deep learning model (positive model) is obtained.

The super-parameters adjustable during model training comprise: the learning rate and the parameter adjusting range are 1E-3, 1E-4, 1E-5 and 1E-6; RNN layers with parameter adjusting range of 1, 2; RNN hidden unit number, with parameter adjusting range of 128, 256, 512; the number of layers of the full-connection module is 1, 2, 3, 4, 5 and 6; the number of units of each layer of the full-connection module is 256, 512 and 1024.

To obtain the best binding characteristicsOptimal model parameter combinationsComprises the following steps: the learning rate is 1E-4, the number of RNN layers is 2, the number of RNN hidden units is 256, the number of layers of a full-connection module is 5, the number of units of each layer of the full-connection module is 1024, and the number of training rounds epoch is 20.

(ii) Obtaining a plurality of final compound-target protein unbound deep learning models (negative models)

The training process is the same as (2-3) (i) above.

(3) Predicting to obtain the binding/non-binding relationship (positive/negative) of the target proteins of the compounds

A group of compound target protein binding/non-binding relations (positive/negative) are obtained: firstly, input data required by prediction is generated, and the binding relationship of each compound to all target proteins is predicted, so that each compound is required to be combined with all 7181 protein sequences respectively to serve as input data of one batch, and the input data is input into a model to obtain the prediction result of each model on the binding relationship of the compound. Because the binding relation of a single compound has uncertainty, the reliability of the prediction result is improved by adopting a mode of randomly drawing 100 compounds for prediction. The predicted result is the binding relationship (positive/negative) of a group of compound target proteins.

The above process is repeated for each compound target protein binding/unbinding deep learning model (positive/negative model), and nine groups of compound target protein binding/unbinding relationships (positive/negative) can be obtained.

(4) Simplifying the binding relationship of the target proteins of the compounds to obtain the optimal positive/negative model combination

And (4) obtaining a compound target protein binding relation (positive) and a compound-target protein non-binding relation (negative) of each compound in the step (3), and removing the intersection of the compound target protein binding relation (positive) and the compound-target protein non-binding relation to obtain the simplified compound target protein binding relation.

(i) Obtaining a consensus binding relationship (Positive model)

Acquiring the consensus binding relationship by adopting the following modes: combining the prediction results of the positive models, ranking the binding relationship of each model according to the prediction output probability value, obtaining the comprehensive ranking of the same binding relationship of the positive models as the final ranking, and taking 200 compounds before the final ranking as the binding relationship which can be predicted by the model combination.

(ii) Obtaining a consensus non-binding relationship (negative model)

Acquiring the consensus non-binding relationship by adopting the following modes: combining the prediction results of the negative models, ranking the binding relation of each model according to the prediction output probability value, obtaining the comprehensive ranking of the same binding relation of the negative models as the final ranking, and taking the top 6900 name of each compound as the non-binding relation which cannot be bound by the model combination prediction according to the final ranking. The formula is as follows:

wherein Rank _ score is the comprehensive ranking score, n is the nth model, Rank _i Rank of the ith model.

And (4) eliminating the intersection of the consensus binding relationship (positive model) and the consensus non-binding relationship (negative model) from the consensus binding relationship (positive model) to obtain the final consensus binding relationship. The evaluation index of the consensus binding relationship is the real target hit rate of the average target number shrinking to within 10, namely the correct ratio of the original positive samples of 100 compounds predicted by the model combination.

Obtaining the optimal consensus binding relationship by adopting the following modes: and combining the prediction results of the nine positive models and the nine negative models to obtain the real target hit rate index of each combination, wherein the combination with the maximum real target hit rate value is the combination with the optimal consensus binding relationship.

The optimal model combination is a combination of two positive models trained using the datasets with positive and negative sample ratios of 1:1 and 2:1 and five negative models trained using the datasets with positive and negative sample ratios of 1:1, 1:2, 1:4, 1:6 and 1: 8.

As shown in fig. 4, "9.80" is the simplified binding relationship of the compound target protein, "190.20" is the intersection of the consensus binding relationship and the consensus non-binding relationship, and "6709.80" is the number of non-binding relationships after the intersection is removed from the result of the consensus non-binding relationship.

In this embodiment, by adjusting the final output parameter softmax, the average predicted target protein number is reduced from 200 to 9.80 by adopting a method of obtaining consensus binding relationship according to comprehensive ranking, and the hit rate of the real target is 46.5%.

Example 2

A plurality of recurrent neural networks (positive) are connected in series with a plurality of encoder-decoder neural networks (negative) to predict the binding relationship of the compound target protein through a deep learning model.

The overall system structure schematic diagram: see fig. 5.

The system operation environment is as follows: the same as in "example 1".

(1) Obtaining Compound target protein binding/unbinding data (Positive/negative sample)

The same as the corresponding parts in "example 1".

The same as the section (2-1) of the technical scheme of the embodiment 1.

(i) Construction of Single Compound target protein binding deep learning model (Positive)

The same as the section (2-2) (i) of the technical scheme of the embodiment 1.

(ii) Construction of Single Compound-target protein unbound deep learning model (negative)

The negative model contains four parts: the system comprises a data input module, an encoder module, a feature extraction module and a two-classification output module, wherein the feature extraction module can be any one or combination of a fully-connected module, a convolution module or a circulation module, and the fully-connected module is taken as an example here.

The Data comes from BingdingDB, the downloaded file format is SDF (structured Data File), the SDF contains Data in Molfile format, and after Data preprocessing, the Data input of the model is divided into 7 parts: compound atom number, compound atom, compound chemical bond number, compound chemical bond, target protein sequence length, target protein sequence, and binding tag. Because the compound and target protein are of variable length, the encoder-decoder training uses a recurrent neural network, which can fully extract features and convert variable-length inputs to fixed length, and the encoder-decoder uses two sets of 2 two-layer LSTM neural networks to train the encoders of the compound sequence and target protein sequence, respectively.

The output characteristics of 2 encoders are spliced to be used as the input of a full-connection module, the full-connection module comprises 5 layers of full connection, and the full-connection unit of each layer is respectively 1024, 1024 and 2.

The model training process is the same as that described in the technical scheme (2-3) (i) of 'example 1'.

In the encoder-decoder neural network, the encoder-decoder needs to be trained separately and cured. The compound encoder-decoder and the target protein encoder-decoder are first trained separately using the entire training data, the training process being such that the difference between the input data to the encoder and the output data from the decoder is continuously reduced. After training is finished, parameters of the encoder are reserved and solidified, the encoder is partially taken out to serve as a data input module and a feature extraction module, and then a classification module and a two-classification module which are formed by connecting all the connection layers form the whole model.

Each iterative process of single model training comprises two stages of forward propagation and backward propagation: in forward propagation, a compound and a target protein sequence are respectively input into an input layer, corresponding characteristic data are respectively extracted after passing through an encoder, then the respectively extracted characteristics are spliced, and a prediction output y under the current weight and bias is obtained after passing through a full connection layer and an output layer; in the back propagation process, the y is compared with the data actual label by using a cross entropy loss function to obtain a loss value loss, each layer of weight of the full connection layer is updated from back to front by using a back propagation algorithm according to the loss value loss (because the parameters of the encoder are already solidified, updating is not needed), and the updating amplitude of each iteration is controlled by the learning rate.

And inputting the test data set into the model to obtain and record the loss value of the test set every 100 iterations in the training process, storing the model with the minimum loss value as the current optimal model, marking that the model has an overfitting phenomenon at present when the loss value of the test set does not show a descending trend or even begins to increase, weakening the generalization performance, and stopping the training and ending the program to obtain the non-binding characteristic (negative characteristic) of the compound-target protein. And the finally stored model is the model approaching the optimal, namely the final compound-target protein non-binding deep learning model (negative model) is obtained.

The parameters that need to be adjusted for encoder-decoder training are mainly: the learning rate is 1E-3, 5E-4, 1E-4, 5E-5 and 1E-5; the number of RNN layers is 1 and 2; the RNN hidden unit number is 128, 256 and 512. The parameters of the feature extraction module to be adjusted mainly include: the learning rate is 1E-3, 5E-4, 1E-4, 5E-5 and 1E-5; the number of layers of the full-connection module is 1, 2, 3, 4, 5 and 6; the number of units of each layer of the full-connection module is 256, 512 and 1024.

Finally, the optimal combination of model parameters capable of obtaining the optimal binding characteristics is the learning rate 1E-4, the RNN layer number 2 and the RNN hidden unit number 512 of the encoder-decoder; the learning rate of the feature extraction module is 1E-4, the number of layers of the full connection module is 4, the number of units of each layer of the full connection module is 1024, and the number of epochs of training rounds is 20.

The same as the corresponding parts in "example 1".

(4) Simplifying the binding relationship of the compound target protein to obtain the optimal positive/negative model combination

The same as the corresponding part of "example 1".

As shown in fig. 5, "9.65" is the simplified binding relationship of the compound target protein, "190.35" is the intersection of the consensus binding relationship and the consensus non-binding relationship, and "6709.65" is the number of non-binding relationships after the intersection is removed from the result of the consensus non-binding relationship.

In this embodiment, by adjusting the final output parameter softmax, the experimental result adopts the method of taking consensus binding relationship according to the comprehensive ranking in example one (4), the average number of target proteins is reduced from 200 to 9.65, and the hit rate of the real target is 41.5%.

Example 3

Multiple 'encoder-decoder neural networks (positive)' and multiple 'recurrent neural networks (negative)' in series deep learning model prediction compound target protein binding relation and system

The overall system structure schematic diagram: see fig. 6.

The system operation environment is as follows: the same as in "example 1".

The technical scheme is as follows:

The same as the corresponding parts in "example 1".

The same as the section (2-1) of the technical scheme of the embodiment 1.

The positive model contains four parts: the system comprises a data input module, an encoder module, a feature extraction module and a two-classification output module, wherein the feature extraction module can be any one or combination of a fully-connected module, a convolution module or a circulation module, and the fully-connected module is taken as an example here.

The output characteristics of 2 encoders are spliced to be used as the input of a full-connection module, the full-connection module comprises 5 layers of full connection, and the full-connection unit of each layer is 1024, 1024 and 2 respectively.

The same as the section (2-2) (ii) of the technical scheme of the embodiment 1.

The model training process is the same as that described in the technical scheme (2-3) (ii) of "example 4".

The model training process is the same as that described in the technical scheme (2-3) (ii) of "example 1".

(3) Predicting to obtain the binding/non-binding relationship (positive/negative) of target proteins of multiple groups of compounds

The same as the corresponding parts in "example 1".

The same as the corresponding part of "example 1".

The optimal model combination is a combination of two positive models trained using the data sets with positive and negative sample ratios of 1:1 and 2:1 and four negative models trained using the data sets with positive and negative sample ratios of 1:2, 1:4, 1:6, and 1: 8.

As shown in fig. 6, "9.85" is the simplified binding relationship of the compound target protein, "190.15" is the intersection of the consensus binding relationship and the consensus non-binding relationship, and "6709.85" is the number of non-binding relationships after the intersection is removed from the result of the consensus non-binding relationship.

In this embodiment, by adjusting the final output parameter softmax, the experimental result adopts the method of obtaining consensus binding relationship according to the comprehensive ranking in example one (4), the average number of target proteins is reduced from 200 to 9.85, and the hit rate of the real target is 43.5%.

Example 4

A plurality of encoder-decoder neural networks (positive) are connected in series with a plurality of encoder-decoder neural networks (negative) to predict the binding relationship of the compound target protein through a deep learning model and a system.

The overall system structure schematic diagram: see fig. 7.

The system operation environment is as follows: the same as in "example 1".

The technical scheme is as follows:

The same as the corresponding part of "example 1".

The same as the section (2-1) of the technical scheme of the embodiment 1.

The same as the section (2-2) (i) of the technical scheme of the 'embodiment 13'.

The same as the section (2-2) (ii) of the technical scheme of the 'embodiment 4'.

The model training process is the same as that described in "embodiment 13" technical solution (2-3) (i).

The same as the corresponding parts in "example 1".

The same as the corresponding part of "example 1".

The optimal model combination is a combination of two positive models trained using the data sets with positive and negative sample ratios of 1:1 and 2:1 and three negative models trained using the data sets with positive and negative sample ratios of 1:4, 1:6 and 1: 8.

As shown in fig. 7, "9.97" is the simplified binding relationship of the compound target protein, "190.03" is the intersection of the consensus binding relationship and the consensus non-binding relationship, and "6709.97" is the number of non-binding relationships after the intersection is removed from the result of the consensus non-binding relationship.

In this embodiment, by adjusting the final output parameter softmax, the experimental result adopts the method of obtaining consensus binding relationship according to the comprehensive ranking in example one (4), the average number of target proteins is reduced from 200 to 9.97, and the hit rate of the real target is 42.5%.

The above description is only a preferred embodiment of the present invention, and the scope of the present invention is not limited thereto, and any simple modifications or equivalent substitutions of the technical solutions that can be obviously obtained by those skilled in the art within the technical scope of the present invention are within the scope of the present invention.

Claims

1. The compound target protein binding prediction method based on the multi-deep learning model consensus is characterized by comprising the following steps of:

(1) acquiring binding data of a compound target protein and non-binding data of the compound-target protein;

(2) extracting a plurality of different compound target protein binding/non-binding data sets for training, testing and verification, extracting compound target protein binding/non-binding data according to nine positive and negative sample ratios, synthesizing a total data set, and dividing to obtain a training set, a testing set and a verification set;

constructing a plurality of different compound target protein binding deep learning models, adopting sample sets with nine proportions to train the plurality of different compound target protein binding/non-binding deep learning models in a forward direction, extracting the binding characteristics of the compound target proteins, and obtaining a plurality of different final compound target protein binding deep learning models; constructing a plurality of different compound-target protein non-binding deep learning models, negatively training the plurality of different compound-target protein non-binding deep learning models by adopting sample sets with nine proportions, and extracting the binding characteristics of the compound target protein to obtain a plurality of different final compound-target protein non-binding deep learning models;

(3) predicting through a plurality of different deep learning models bound by optimal compound target proteins to obtain a binding relationship of a plurality of groups of compound target proteins; predicting through a plurality of compound-target protein non-binding deep learning models to obtain a plurality of groups of compound-target protein non-binding relations;

(5) the simplified binding relationship of the compound target protein is the consensus binding relationship-consensus non-binding relationship;

the step (1) further comprises: the acquisition of the binding data of the compound target protein comprises the acquisition of the binding data of the compound target protein from scientific literature and/or a compound target protein binding database published on the internet;

the database includes: ChemSpider, PubChem, BindingDB, ZINC or ChEMBL, or a combination of two or more thereof;

the step (1) further comprises: the compound-target protein non-binding data is the compound-target protein non-binding data after the compound-target protein binding data is excluded from the data of the randomly generated compound-target protein;

the proportion range of the compound target protein binding data and the compound-target protein non-binding data is (0.1:1) - (1: 100); the proportions of the compound target protein binding data (positive sample) and the compound-target protein non-binding data (negative sample) for training the negative model are (1:1), (1:2), (1:4), (1:6) and (1:8), respectively, and the proportions for training the positive model are (1:1), (2:1), (4:1), (6:1) and (8:1), respectively.

2. The method for predicting binding of a compound target protein based on co-recognition of multiple deep learning models according to claim 1, wherein the single model in the multiple different deep learning models in step (2) is composed of any one of the following types: a recurrent neural network model, a fully-connected neural network model, a convolutional neural network model, or an encoder-decoder network model.

3. The method for predicting binding of a compound target protein based on co-recognition of multiple deep learning models according to claim 1, wherein the step (2) further comprises:

a single binding/unbinding dataset of the plurality of different training, testing and validation compound target protein binding/unbinding datasets is: randomly extracting positive/negative sample data according to a ratio of 98:1:1 and dividing the positive/negative sample data into training, testing and verifying data sets, randomly disordering the sequence, then randomly extracting 128 sample data into a group, importing a single deep learning model for training, and terminating the training when the loss value is not reduced any more in the training process, thereby obtaining a group of compound target protein binding/non-binding characteristics.

4. The method for predicting binding of a compound target protein based on co-recognition of multiple deep learning models according to claim 3, wherein the step (2) further comprises:

setting labels of the positive/negative sample sets as binding [1,0] and non-binding [0,1] (or dual binding [0,1] and non-binding [1,0]) respectively for training; and setting labels of the positive/negative sample sets as binding [0,1] and non-binding [1,0] (or dual binding [1,0] and non-binding [0,1]) respectively for training.

5. The method for predicting binding of a compound target protein based on multi-deep learning model consensus as claimed in claim 4, wherein the step (2) further comprises:

input parameter control of the single deep learning model:

combining the compound and the target protein, wherein the width of one-time input is 600-4000, and the width of sectional input is 4-60; 2. respectively treating a compound and a target protein, wherein the width of the compound in one-time input is 100-1500, the width of the compound in segmented input is 4-60, the width of the target protein in one-time input is 500-1500, and the width of the target protein in segmented input is 4-60;

the structure and parameters of the single deep learning model are as follows:

the hidden layer of the fully connected neural network model is 3-10 layers, and the number of neuron parameters ranges from 2 to 5000; the hidden layers of the convolutional neural network model are 2-9 layers, and the parameter range of convolutional kernels is 4-200; the hidden layer of the recurrent neural network model is 1-7 layers, and the number of neuron parameters ranges from 32 to 512; the hidden layer of the deep confidence network model is 3-10 layers, and the number of neuron parameters ranges from 2 to 5000;

the output parameters of the single deep learning model include two types: one type is a probability numerical value of outputting single continuity, and the range is 0-1; the second type is to output discrete values, including two discrete values of [0,1] and [1,0 ].

6. The method for predicting binding of a compound target protein based on co-recognition of multiple deep learning models according to claim 1, wherein the step (2) further comprises: data used, compound aspect, including compound atoms and chemical bonds between compound atoms; target protein aspects, including amino acid sequences; binding relationships, secondary parameters include Ki, IC50, Kd, EC 50.

7. The method for predicting binding of a compound target protein based on co-recognition of multiple deep learning models according to claim 1, wherein the step (3) further comprises: according to the requirement of actual prediction quantity, adjusting output thresholds of a plurality of different positive/negative deep learning models, and eliminating common non-binding combinations obtained by integrating a plurality of non-binding combinations from common binding combinations obtained by integrating a plurality of groups of binding combinations to obtain a simplified prediction result.

8. The method for predicting binding of a compound target protein based on multi-deep learning model consensus as claimed in claim 1, further comprising:

the system operation environment is as follows:

hardware: CPU + GPU;

software: windows or Linux, Python + tensoflow, PyTorch, Keras;

(1) acquiring binding data of a compound target protein and non-binding data of the compound-target protein, wherein the positive sample is the binding of the compound target protein found in scientific research experiments by human beings; negative examples are moieties that exclude binding of a compound target protein in the compound-target protein combination space; the database includes: ChemSpider, PubChem, BindingDB, ZINC or ChEMBL, or a combination of two or more thereof;

constructing a plurality of different compound target protein binding deep learning models, adopting a plurality of different positive/negative sample sets with the proportion of (1:1), (2:1), (4:1), (6:1) and (8:1), forward training the plurality of different compound target protein binding/non-binding deep learning models, extracting the binding characteristics of a plurality of groups of compound target proteins, and obtaining a plurality of different final compound target protein binding deep learning models; and

constructing a plurality of different compound-target protein non-binding deep learning preliminary models, adopting a plurality of different positive/negative sample sets with the proportion of (1:1), (1:2), (1:4), (1:6) and (1:8), training a plurality of different compound-target protein non-binding deep learning models in a negative direction, extracting the binding characteristics of a plurality of groups of compound target proteins, and obtaining a plurality of different final compound-target protein non-binding deep learning models;

individual ones of the plurality of different deep learning models are comprised of any one of the following types: a recurrent neural network, a convolutional neural network, a fully-connected neural network, or an encoder-decoder network;

(3) obtaining a binding relation of a plurality of groups of compound target proteins through a plurality of different positive models; obtaining a plurality of groups of compound-target protein non-binding relations through a plurality of different negative models;