CN114360638A

CN114360638A - Compound-protein interaction prediction method based on deep learning

Info

Publication number: CN114360638A
Application number: CN202111530765.2A
Authority: CN
Inventors: 吴坚; 钱莹
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2021-12-15
Filing date: 2021-12-15
Publication date: 2022-04-15

Abstract

The invention discloses a compound-protein interaction prediction method based on deep learning, which comprises the following steps: step A, using a CNN module to learn local characteristics of a compound picture, and then using a Transformer encoder to learn semantic relations of the learned characteristics; b, dividing the protein sequence by using a k-gram method, and then using a plurality of transform encoders to learn semantic relations; and step C, further learning the learned characteristics of the compound and the protein, and step D, obtaining a prediction result through a full connection layer. The invention provides a method which is high in prediction precision and can effectively capture the molecular picture characteristics of a compound.

Description

Compound-protein interaction prediction method based on deep learning

Technical Field

The invention relates to the technical field of drug design and image processing, in particular to a method for predicting compound-protein interaction by combining picture characteristics of a CNN (CNN) and a Transformer learning compound.

Background

Drug discovery is a field of bioinformatics that aims to discover new molecular structures with desirable pharmacological properties, involving a wide range of scientific disciplines, including biology, chemistry, and pharmacology. The new drug molecules provide benefits for patient treatment by interacting with the target protein.

In recent years, in the context of computer-aided drug design, there has been a great deal of interest in developing automated machine learning techniques to find a large number of authentic, diverse, and novel candidate molecules in a broad and unstructured molecular space. The method can accurately predict the interaction relationship between the compound and the protein, can reduce the generation of some diseases, can reduce the development cost of medicaments, and is important for patients and society.

Disclosure of Invention

The invention aims to provide a method for predicting compound-protein interaction by combining CNN and Transformer learning compound picture characteristics according to the information of compound molecular pictures. The method can obviously improve the accuracy of the prediction of the interaction between the compound and the protein.

The specific technical scheme for realizing the purpose of the invention is as follows:

a method for predicting compound-protein interactions based on deep learning, the method comprising the steps of:

1) obtaining a characteristic diagram of a compound molecule, specifically comprising:

1.1) generating a compound molecule picture by RDkit software according to the SMILES sequence of the compound molecule;

1.2) inputting the generated compound molecular picture into a CNN module to learn features, wherein the picture size is H multiplied by W multiplied by 3, H and W respectively represent the length and width of the picture, and 3 represents the color channel of the picture; the CNN module is used for learning the local characteristics of the compound molecular picture, and the CNN module comprises the following components: a convolutional layer, a BatchNormalization layer, a LeakyReLU function activation layer, and a max pooling layer; the size of the feature map after learning by CNN is

m and n represent the length and width of the molecular characteristic diagram of the compound after CNN learning respectively, and c represents the channel of the characteristic diagram;

1.3) from the obtained feature map X_CNNFlattening as an input token for a Transformer according to its channel dimensions, the new feature map size being

Wherein m.n is the length and width multiplication of the molecular characteristic diagram of the compound, and c is the channel of the characteristic diagram;

1.4) mapping the new feature map X_tokenAs input to the transform encoder, the semantic relationships in the feature map are then learned by several transform encoders, each of which consists of:

1.4.1) the LayerNormalization layer processes data with different lengths;

1.4.2) Multi-head attention layer converts the input vector into three different vectors, all with dimensions d: a query vector q, a key vector k, a value vector v, which are compressed into a corresponding matrix Q, K, V, the calculation process being: calculating the scores of the query matrix Q and the key matrix K as follows: s is Q.K^TThe normalized gradient stability score is:

the score is converted to probability using the softmax function: p ═ softmax (S)_n) Obtaining a weighted value matrix: attention ═ V · P; the formula of the whole process is as follows:

wherein softmax represents the softmax activation function, Q and V represent the corresponding matrices of the query vector Q and the value vector V compression, K^TTranspose of the corresponding matrix representing the key vector compression, d_kRepresenting a current input dimension; then, to solve the problem that self-attention is not sensitive to position information, position codes of the same dimension are added to the original input embedding, and the formula is as follows:

pos represents the position of the word in the sentence, i represents the current dimension of the position code, and d is the dimension of the input vector; the total attention is combined by a plurality of head attentions: MultiHead (Q ', K ', V ') -concat (head)₁，...，head_h)W^OConcat denotes a splicing operation, W^OHead as a learnable feature transformation matrix_iThe value of Attention for each head is represented, h represents a total of h heads;

1.4.3) obtaining a final semantic information characteristic diagram through a multilayer perceptron layer; the process of the multilayer perceptron layer is as follows: a full connection layer, a GELU activation layer, a DropPath layer, a full connection layer, and a DropPath layer;

1.5) learning representation by several Transformer encoders, the final compound molecule feature map X was obtained_C；

2) Obtaining a characteristic map of the protein sequence, which specifically comprises:

2.1) protein sequences are in the FASTA format, where the sequences are made up of combinations of amino acids and the amino acids are all represented in a single letter; dividing the protein sequence into words by adopting a k-gram method, wherein the length of each word is k;

2.2) establishing a dictionary for the divided words, sequencing the words in an ascending order according to the words appearing at first, replacing the words represented by the original protein sequence with the serial number of the dictionary, and then embedding and representing the serial number represented by each word;

2.3) adding the position information of the original words in the protein sequence, and learning the semantic information among the words in the protein sequence through a plurality of transform encoder modules;

2.4) obtaining the final characteristic diagram X of the protein sequence after learning by a plurality of Transformer encoder modules_P；

3) Characteristic diagram X of compound molecule_CAnd protein sequence profile X_PFurther learning is performed and the final result is predicted, specifically comprising:

3.1) mapping the feature map X_CAnd the characteristic diagramX_PPerforming shallow learning of features through a multilayer perceptron;

3.2) overlapping the two feature maps, and then using a CNN module to learn deep features; the module design of CNN is: a convolutional layer, a BatchNormalization layer, a LeakyReLU function activation layer, a max pooling layer, a convolutional layer, a BatchNormalization layer, a LeakyReLU function activation layer, a max pooling layer;

3.3) flattening the characteristic diagram, and obtaining a final prediction result through a full connection layer, wherein the result is expressed as 0 or 1, 0 represents that no interaction exists between the compound and the protein, and 1 represents that the interaction exists.

The technical conception of the invention is as follows: first, the compound molecular picture contains more molecular information. The atomic information and the spatial structure of the image are well represented on the picture. The CNN is used for learning the local features of the compound molecular picture, so that the molecular information contained in each pixel can be effectively captured. And establishing semantic relations of the learning characteristic graphs of a plurality of transform encoders through the learning capability of the transform to the global characteristics. Secondly, for the defect that the expression of protein amino acid species is insufficient by only 20 species, the combined amino acid species can be expanded to more species by dividing the k-gram, thereby being beneficial to the expression learning of the protein. And finally, according to the characteristic diagrams of the compound and the protein, a multilayer perceptron and a CNN module are constructed to learn the characteristic diagrams again, so that the prediction capability of the method is effectively improved.

Compared with the prior art, the invention has the beneficial effects that: firstly, the characteristics of the compound molecule picture are learned. And the CNN and the Transformer are used for comprehensively extracting local and global characteristics of the picture, so that the information in the compound molecular picture is effectively captured. Then the k-gram method effectively solves the problem of amino acid letter representation shortage, and creates a Transformer encoder which can effectively extract the semantic relation of amino acid words in the protein sequence. Finally, the performance of the method is effectively improved through multi-level learning of the multi-layer perceptron and the CNN, and compared with other existing methods, the method can obviously improve the training speed and the prediction accuracy.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a flow chart of the CNN module of step 1 in FIG. 1;

FIG. 3 is a schematic diagram of word segmentation by k-gram.

Detailed Description

The invention is further described with reference to the following figures and examples.

Example 1

Referring to fig. 1-3, a method for predicting compound-protein interaction based on deep learning, the method comprising the steps of:

1.2) inputting the generated compound molecular picture into a CNN module to learn features, wherein the picture size is 256 multiplied by 3, 256 respectively represents the length and width of the picture, and 3 represents the color channel of the picture; learning local features of compound pictures using a CNN module, the CNN module consisting of: a convolutional layer, a BatchNormalization layer, a LeakyReLU function activation layer, and a max pooling layer; the size of the feature map after CNN is

16 represents the length and width of the molecular profile of the compound after learning by CNN, and 256 represents the channel of the profile;

Wherein 256 is the length and width multiplication of the molecular profile of the compound, and 256 is the channel of the profile;

1.4) mapping the new feature map X_tokenAs an input to the transform encoder,then, learning semantic relations in the feature map through a plurality of transform encoder modules, wherein each transform encoder module comprises the following modules:

1.4.1) the LayerNormalization layer processes data with different lengths;

1.4.2) Multi-head attention layer converts the input vector into three different vectors, all with dimensions 256: a query vector q, a key vector k, a value vector v, which are compressed into a corresponding matrix Q, K, V, the calculation process being: calculating the scores of the query matrix Q and the key matrix K as follows: s is Q.K^TThe normalized gradient stability score is:

1.5) learning representation by 4 transform encoders to obtain the final compound molecule feature map

The size of the feature map is 256 times 256;

2.1) protein sequences are in the FASTA format, where the sequences are made up of combinations of amino acids and the amino acids are all represented in a single letter; dividing the protein sequence into words by adopting a k-gram method, wherein the length of each word is 3;

2.4) obtaining the final characteristic map of the protein sequence after learning by 4 Transformer encoder modules

The size of the feature map is 256 times 256;

3.1) mapping the feature map X_CAnd the characteristic diagram X_PPerforming shallow learning of features through a multilayer perceptron;

3.2) superposing the two characteristic graphs to obtain a final characteristic representation graph

Wherein the characteristicsThe graph size is 256 times 256, and the number of channels is 2; then using a CNN module to carry out deep feature learning; the module design of CNN is: a convolutional layer, a BatchNormalization layer, a LeakyReLU function activation layer, a max pooling layer, a convolutional layer, a BatchNormalization layer, a LeakyReLU function activation layer, a max pooling layer;

Example 2

This example performs the experiment on three data sets as an example. Experiments were performed on 3 data sets of compound and protein interactions (Human, Celegans, Davis). Data set partitioning: the data set is divided into a training set, a validation set and a test set according to a ratio of 8: 1. The method comprises the following steps of:

1.2) inputting the generated compound molecular picture into a CNN module to learn features, wherein the picture size is H multiplied by W multiplied by 3, H and W respectively represent the length and width of the picture, and 3 represents the color channel of the picture; learning local features of compound pictures using a CNN module, the CNN module consisting of: a convolutional layer, a BatchNormalization layer, a LeakyReLU function activation layer, and a max pooling layer; the size of the feature map after CNN is

1.4) mapping the new feature map X_tokenAs input of the transform encoder, the semantic relations in the feature map are then learned by several transform encoder modules, each of which consists of:

1.4.1) the LayerNormalization layer processes data with different lengths;

pos represents the position of the word in the sentence, i represents the current dimension of the position code, and d is the dimension of the input vector; the total attention is combined by a plurality of head attentions: MultiHead (Q ', K ', V ') -concat (head)₁，...，head_h)W^OConcat denotes a splicing operation, W^OTo learnLearned feature transformation matrix, head_iThe value of Attention for each head is represented, h represents a total of h heads;

TABLE 1

TABLE 2

TABLE 3

According to the above steps, the deep learning model CAT-CPI created by the invention is compared with other machine learning methods and deep learning methods on the data set by experiments. Tables 1 and 2 are the results of CAT-CPI comparisons with other machine learning methods on the Human and Celegans datasets, respectively. Table 3 shows the results of experiments comparing CAT-CPI with other deep learning models on the Davis dataset. It can be seen that the models implemented by the present invention all achieve the best results.

The above description is the prediction effect of the present invention using the Human, Celegans and Davis data sets as examples, and does not limit the scope of the present invention, and various modifications and improvements can be made without departing from the scope of the present invention.

Claims

1. A method for predicting compound-protein interaction based on deep learning, characterized by: the method comprises the following specific steps:

1.4.1) the LayerNormalization layer processes data with different lengths;

1.4.2) Multi-head attention layer converts the input vector into three different vectors, all with dimensions d: a query vector q, a key vector k, a value vector v, which are compressed into a corresponding matrix Q, K, V, the calculation process being: calculating the scores of the query matrix Q and the key matrix K as：S＝Q·K^TThe normalized gradient stability score is: