CN114360638A - Compound-protein interaction prediction method based on deep learning - Google Patents

Compound-protein interaction prediction method based on deep learning Download PDF

Info

Publication number
CN114360638A
CN114360638A CN202111530765.2A CN202111530765A CN114360638A CN 114360638 A CN114360638 A CN 114360638A CN 202111530765 A CN202111530765 A CN 202111530765A CN 114360638 A CN114360638 A CN 114360638A
Authority
CN
China
Prior art keywords
layer
compound
learning
characteristic diagram
cnn
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111530765.2A
Other languages
Chinese (zh)
Inventor
吴坚
钱莹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China Normal University
Original Assignee
East China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China Normal University filed Critical East China Normal University
Priority to CN202111530765.2A priority Critical patent/CN114360638A/en
Publication of CN114360638A publication Critical patent/CN114360638A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/30Drug targeting using structural data; Docking or binding prediction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Chemical & Material Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Medicinal Chemistry (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Investigating Or Analysing Biological Materials (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a compound-protein interaction prediction method based on deep learning, which comprises the following steps: step A, using a CNN module to learn local characteristics of a compound picture, and then using a Transformer encoder to learn semantic relations of the learned characteristics; b, dividing the protein sequence by using a k-gram method, and then using a plurality of transform encoders to learn semantic relations; and step C, further learning the learned characteristics of the compound and the protein, and step D, obtaining a prediction result through a full connection layer. The invention provides a method which is high in prediction precision and can effectively capture the molecular picture characteristics of a compound.

Description

Compound-protein interaction prediction method based on deep learning
Technical Field
The invention relates to the technical field of drug design and image processing, in particular to a method for predicting compound-protein interaction by combining picture characteristics of a CNN (CNN) and a Transformer learning compound.
Background
Drug discovery is a field of bioinformatics that aims to discover new molecular structures with desirable pharmacological properties, involving a wide range of scientific disciplines, including biology, chemistry, and pharmacology. The new drug molecules provide benefits for patient treatment by interacting with the target protein.
In recent years, in the context of computer-aided drug design, there has been a great deal of interest in developing automated machine learning techniques to find a large number of authentic, diverse, and novel candidate molecules in a broad and unstructured molecular space. The method can accurately predict the interaction relationship between the compound and the protein, can reduce the generation of some diseases, can reduce the development cost of medicaments, and is important for patients and society.
Disclosure of Invention
The invention aims to provide a method for predicting compound-protein interaction by combining CNN and Transformer learning compound picture characteristics according to the information of compound molecular pictures. The method can obviously improve the accuracy of the prediction of the interaction between the compound and the protein.
The specific technical scheme for realizing the purpose of the invention is as follows:
a method for predicting compound-protein interactions based on deep learning, the method comprising the steps of:
1) obtaining a characteristic diagram of a compound molecule, specifically comprising:
1.1) generating a compound molecule picture by RDkit software according to the SMILES sequence of the compound molecule;
1.2) inputting the generated compound molecular picture into a CNN module to learn features, wherein the picture size is H multiplied by W multiplied by 3, H and W respectively represent the length and width of the picture, and 3 represents the color channel of the picture; the CNN module is used for learning the local characteristics of the compound molecular picture, and the CNN module comprises the following components: a convolutional layer, a BatchNormalization layer, a LeakyReLU function activation layer, and a max pooling layer; the size of the feature map after learning by CNN is
Figure BDA0003411433390000011
m and n represent the length and width of the molecular characteristic diagram of the compound after CNN learning respectively, and c represents the channel of the characteristic diagram;
1.3) from the obtained feature map XCNNFlattening as an input token for a Transformer according to its channel dimensions, the new feature map size being
Figure BDA0003411433390000012
Wherein m.n is the length and width multiplication of the molecular characteristic diagram of the compound, and c is the channel of the characteristic diagram;
1.4) mapping the new feature map XtokenAs input to the transform encoder, the semantic relationships in the feature map are then learned by several transform encoders, each of which consists of:
1.4.1) the LayerNormalization layer processes data with different lengths;
1.4.2) Multi-head attention layer converts the input vector into three different vectors, all with dimensions d: a query vector q, a key vector k, a value vector v, which are compressed into a corresponding matrix Q, K, V, the calculation process being: calculating the scores of the query matrix Q and the key matrix K as follows: s is Q.KTThe normalized gradient stability score is:
Figure BDA0003411433390000021
the score is converted to probability using the softmax function: p ═ softmax (S)n) Obtaining a weighted value matrix: attention ═ V · P; the formula of the whole process is as follows:
Figure BDA0003411433390000022
wherein softmax represents the softmax activation function, Q and V represent the corresponding matrices of the query vector Q and the value vector V compression, KTTranspose of the corresponding matrix representing the key vector compression, dkRepresenting a current input dimension; then, to solve the problem that self-attention is not sensitive to position information, position codes of the same dimension are added to the original input embedding, and the formula is as follows:
Figure BDA0003411433390000023
pos represents the position of the word in the sentence, i represents the current dimension of the position code, and d is the dimension of the input vector; the total attention is combined by a plurality of head attentions: MultiHead (Q ', K ', V ') -concat (head)1,...,headh)WOConcat denotes a splicing operation, WOHead as a learnable feature transformation matrixiThe value of Attention for each head is represented, h represents a total of h heads;
1.4.3) obtaining a final semantic information characteristic diagram through a multilayer perceptron layer; the process of the multilayer perceptron layer is as follows: a full connection layer, a GELU activation layer, a DropPath layer, a full connection layer, and a DropPath layer;
1.5) learning representation by several Transformer encoders, the final compound molecule feature map X was obtainedC
2) Obtaining a characteristic map of the protein sequence, which specifically comprises:
2.1) protein sequences are in the FASTA format, where the sequences are made up of combinations of amino acids and the amino acids are all represented in a single letter; dividing the protein sequence into words by adopting a k-gram method, wherein the length of each word is k;
2.2) establishing a dictionary for the divided words, sequencing the words in an ascending order according to the words appearing at first, replacing the words represented by the original protein sequence with the serial number of the dictionary, and then embedding and representing the serial number represented by each word;
2.3) adding the position information of the original words in the protein sequence, and learning the semantic information among the words in the protein sequence through a plurality of transform encoder modules;
2.4) obtaining the final characteristic diagram X of the protein sequence after learning by a plurality of Transformer encoder modulesP
3) Characteristic diagram X of compound moleculeCAnd protein sequence profile XPFurther learning is performed and the final result is predicted, specifically comprising:
3.1) mapping the feature map XCAnd the characteristic diagramXPPerforming shallow learning of features through a multilayer perceptron;
3.2) overlapping the two feature maps, and then using a CNN module to learn deep features; the module design of CNN is: a convolutional layer, a BatchNormalization layer, a LeakyReLU function activation layer, a max pooling layer, a convolutional layer, a BatchNormalization layer, a LeakyReLU function activation layer, a max pooling layer;
3.3) flattening the characteristic diagram, and obtaining a final prediction result through a full connection layer, wherein the result is expressed as 0 or 1, 0 represents that no interaction exists between the compound and the protein, and 1 represents that the interaction exists.
The technical conception of the invention is as follows: first, the compound molecular picture contains more molecular information. The atomic information and the spatial structure of the image are well represented on the picture. The CNN is used for learning the local features of the compound molecular picture, so that the molecular information contained in each pixel can be effectively captured. And establishing semantic relations of the learning characteristic graphs of a plurality of transform encoders through the learning capability of the transform to the global characteristics. Secondly, for the defect that the expression of protein amino acid species is insufficient by only 20 species, the combined amino acid species can be expanded to more species by dividing the k-gram, thereby being beneficial to the expression learning of the protein. And finally, according to the characteristic diagrams of the compound and the protein, a multilayer perceptron and a CNN module are constructed to learn the characteristic diagrams again, so that the prediction capability of the method is effectively improved.
Compared with the prior art, the invention has the beneficial effects that: firstly, the characteristics of the compound molecule picture are learned. And the CNN and the Transformer are used for comprehensively extracting local and global characteristics of the picture, so that the information in the compound molecular picture is effectively captured. Then the k-gram method effectively solves the problem of amino acid letter representation shortage, and creates a Transformer encoder which can effectively extract the semantic relation of amino acid words in the protein sequence. Finally, the performance of the method is effectively improved through multi-level learning of the multi-layer perceptron and the CNN, and compared with other existing methods, the method can obviously improve the training speed and the prediction accuracy.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a flow chart of the CNN module of step 1 in FIG. 1;
FIG. 3 is a schematic diagram of word segmentation by k-gram.
Detailed Description
The invention is further described with reference to the following figures and examples.
Example 1
Referring to fig. 1-3, a method for predicting compound-protein interaction based on deep learning, the method comprising the steps of:
1) obtaining a characteristic diagram of a compound molecule, specifically comprising:
1.1) generating a compound molecule picture by RDkit software according to the SMILES sequence of the compound molecule;
1.2) inputting the generated compound molecular picture into a CNN module to learn features, wherein the picture size is 256 multiplied by 3, 256 respectively represents the length and width of the picture, and 3 represents the color channel of the picture; learning local features of compound pictures using a CNN module, the CNN module consisting of: a convolutional layer, a BatchNormalization layer, a LeakyReLU function activation layer, and a max pooling layer; the size of the feature map after CNN is
Figure BDA0003411433390000041
16 represents the length and width of the molecular profile of the compound after learning by CNN, and 256 represents the channel of the profile;
1.3) from the obtained feature map XCNNFlattening as an input token for a Transformer according to its channel dimensions, the new feature map size being
Figure BDA0003411433390000042
Wherein 256 is the length and width multiplication of the molecular profile of the compound, and 256 is the channel of the profile;
1.4) mapping the new feature map XtokenAs an input to the transform encoder,then, learning semantic relations in the feature map through a plurality of transform encoder modules, wherein each transform encoder module comprises the following modules:
1.4.1) the LayerNormalization layer processes data with different lengths;
1.4.2) Multi-head attention layer converts the input vector into three different vectors, all with dimensions 256: a query vector q, a key vector k, a value vector v, which are compressed into a corresponding matrix Q, K, V, the calculation process being: calculating the scores of the query matrix Q and the key matrix K as follows: s is Q.KTThe normalized gradient stability score is:
Figure BDA0003411433390000043
the score is converted to probability using the softmax function: p ═ softmax (S)n) Obtaining a weighted value matrix: attention ═ V · P; the formula of the whole process is as follows:
Figure BDA0003411433390000044
wherein softmax represents the softmax activation function, Q and V represent the corresponding matrices of the query vector Q and the value vector V compression, KTTranspose of the corresponding matrix representing the key vector compression, dkRepresenting a current input dimension; then, to solve the problem that self-attention is not sensitive to position information, position codes of the same dimension are added to the original input embedding, and the formula is as follows:
Figure BDA0003411433390000045
pos represents the position of the word in the sentence, i represents the current dimension of the position code, and d is the dimension of the input vector; the total attention is combined by a plurality of head attentions: MultiHead (Q ', K ', V ') -concat (head)1,...,headh)WOConcat denotes a splicing operation, WOHead as a learnable feature transformation matrixiThe value of Attention for each head is represented, h represents a total of h heads;
1.4.3) obtaining a final semantic information characteristic diagram through a multilayer perceptron layer; the process of the multilayer perceptron layer is as follows: a full connection layer, a GELU activation layer, a DropPath layer, a full connection layer, and a DropPath layer;
1.5) learning representation by 4 transform encoders to obtain the final compound molecule feature map
Figure BDA0003411433390000051
The size of the feature map is 256 times 256;
2) obtaining a characteristic map of the protein sequence, which specifically comprises:
2.1) protein sequences are in the FASTA format, where the sequences are made up of combinations of amino acids and the amino acids are all represented in a single letter; dividing the protein sequence into words by adopting a k-gram method, wherein the length of each word is 3;
2.2) establishing a dictionary for the divided words, sequencing the words in an ascending order according to the words appearing at first, replacing the words represented by the original protein sequence with the serial number of the dictionary, and then embedding and representing the serial number represented by each word;
2.3) adding the position information of the original words in the protein sequence, and learning the semantic information among the words in the protein sequence through a plurality of transform encoder modules;
2.4) obtaining the final characteristic map of the protein sequence after learning by 4 Transformer encoder modules
Figure BDA0003411433390000052
The size of the feature map is 256 times 256;
3) characteristic diagram X of compound moleculeCAnd protein sequence profile XPFurther learning is performed and the final result is predicted, specifically comprising:
3.1) mapping the feature map XCAnd the characteristic diagram XPPerforming shallow learning of features through a multilayer perceptron;
3.2) superposing the two characteristic graphs to obtain a final characteristic representation graph
Figure BDA0003411433390000053
Wherein the characteristicsThe graph size is 256 times 256, and the number of channels is 2; then using a CNN module to carry out deep feature learning; the module design of CNN is: a convolutional layer, a BatchNormalization layer, a LeakyReLU function activation layer, a max pooling layer, a convolutional layer, a BatchNormalization layer, a LeakyReLU function activation layer, a max pooling layer;
3.3) flattening the characteristic diagram, and obtaining a final prediction result through a full connection layer, wherein the result is expressed as 0 or 1, 0 represents that no interaction exists between the compound and the protein, and 1 represents that the interaction exists.
Example 2
This example performs the experiment on three data sets as an example. Experiments were performed on 3 data sets of compound and protein interactions (Human, Celegans, Davis). Data set partitioning: the data set is divided into a training set, a validation set and a test set according to a ratio of 8: 1. The method comprises the following steps of:
1) obtaining a characteristic diagram of a compound molecule, specifically comprising:
1.1) generating a compound molecule picture by RDkit software according to the SMILES sequence of the compound molecule;
1.2) inputting the generated compound molecular picture into a CNN module to learn features, wherein the picture size is H multiplied by W multiplied by 3, H and W respectively represent the length and width of the picture, and 3 represents the color channel of the picture; learning local features of compound pictures using a CNN module, the CNN module consisting of: a convolutional layer, a BatchNormalization layer, a LeakyReLU function activation layer, and a max pooling layer; the size of the feature map after CNN is
Figure BDA0003411433390000054
m and n represent the length and width of the molecular characteristic diagram of the compound after CNN learning respectively, and c represents the channel of the characteristic diagram;
1.3) from the obtained feature map XCNNFlattening as an input token for a Transformer according to its channel dimensions, the new feature map size being
Figure BDA0003411433390000061
Wherein m.n is the length and width multiplication of the molecular characteristic diagram of the compound, and c is the channel of the characteristic diagram;
1.4) mapping the new feature map XtokenAs input of the transform encoder, the semantic relations in the feature map are then learned by several transform encoder modules, each of which consists of:
1.4.1) the LayerNormalization layer processes data with different lengths;
1.4.2) Multi-head attention layer converts the input vector into three different vectors, all with dimensions d: a query vector q, a key vector k, a value vector v, which are compressed into a corresponding matrix Q, K, V, the calculation process being: calculating the scores of the query matrix Q and the key matrix K as follows: s is Q.KTThe normalized gradient stability score is:
Figure BDA0003411433390000062
the score is converted to probability using the softmax function: p ═ softmax (S)n) Obtaining a weighted value matrix: attention ═ V · P; the formula of the whole process is as follows:
Figure BDA0003411433390000063
wherein softmax represents the softmax activation function, Q and V represent the corresponding matrices of the query vector Q and the value vector V compression, KTTranspose of the corresponding matrix representing the key vector compression, dkRepresenting a current input dimension; then, to solve the problem that self-attention is not sensitive to position information, position codes of the same dimension are added to the original input embedding, and the formula is as follows:
Figure BDA0003411433390000064
pos represents the position of the word in the sentence, i represents the current dimension of the position code, and d is the dimension of the input vector; the total attention is combined by a plurality of head attentions: MultiHead (Q ', K ', V ') -concat (head)1,...,headh)WOConcat denotes a splicing operation, WOTo learnLearned feature transformation matrix, headiThe value of Attention for each head is represented, h represents a total of h heads;
1.4.3) obtaining a final semantic information characteristic diagram through a multilayer perceptron layer; the process of the multilayer perceptron layer is as follows: a full connection layer, a GELU activation layer, a DropPath layer, a full connection layer, and a DropPath layer;
1.5) learning representation by several Transformer encoders, the final compound molecule feature map X was obtainedC
2) Obtaining a characteristic map of the protein sequence, which specifically comprises:
2.1) protein sequences are in the FASTA format, where the sequences are made up of combinations of amino acids and the amino acids are all represented in a single letter; dividing the protein sequence into words by adopting a k-gram method, wherein the length of each word is k;
2.2) establishing a dictionary for the divided words, sequencing the words in an ascending order according to the words appearing at first, replacing the words represented by the original protein sequence with the serial number of the dictionary, and then embedding and representing the serial number represented by each word;
2.3) adding the position information of the original words in the protein sequence, and learning the semantic information among the words in the protein sequence through a plurality of transform encoder modules;
2.4) obtaining the final characteristic diagram X of the protein sequence after learning by a plurality of Transformer encoder modulesP
3) Characteristic diagram X of compound moleculeCAnd protein sequence profile XPFurther learning is performed and the final result is predicted, specifically comprising:
3.1) mapping the feature map XCAnd the characteristic diagram XPPerforming shallow learning of features through a multilayer perceptron;
3.2) overlapping the two feature maps, and then using a CNN module to learn deep features; the module design of CNN is: a convolutional layer, a BatchNormalization layer, a LeakyReLU function activation layer, a max pooling layer, a convolutional layer, a BatchNormalization layer, a LeakyReLU function activation layer, a max pooling layer;
3.3) flattening the characteristic diagram, and obtaining a final prediction result through a full connection layer, wherein the result is expressed as 0 or 1, 0 represents that no interaction exists between the compound and the protein, and 1 represents that the interaction exists.
TABLE 1
Figure BDA0003411433390000071
TABLE 2
Figure BDA0003411433390000072
Figure BDA0003411433390000081
TABLE 3
Figure BDA0003411433390000082
According to the above steps, the deep learning model CAT-CPI created by the invention is compared with other machine learning methods and deep learning methods on the data set by experiments. Tables 1 and 2 are the results of CAT-CPI comparisons with other machine learning methods on the Human and Celegans datasets, respectively. Table 3 shows the results of experiments comparing CAT-CPI with other deep learning models on the Davis dataset. It can be seen that the models implemented by the present invention all achieve the best results.
The above description is the prediction effect of the present invention using the Human, Celegans and Davis data sets as examples, and does not limit the scope of the present invention, and various modifications and improvements can be made without departing from the scope of the present invention.

Claims (1)

1. A method for predicting compound-protein interaction based on deep learning, characterized by: the method comprises the following specific steps:
1) obtaining a characteristic diagram of a compound molecule, specifically comprising:
1.1) generating a compound molecule picture by RDkit software according to the SMILES sequence of the compound molecule;
1.2) inputting the generated compound molecular picture into a CNN module to learn features, wherein the picture size is H multiplied by W multiplied by 3, H and W respectively represent the length and width of the picture, and 3 represents the color channel of the picture; the CNN module is used for learning the local characteristics of the compound molecular picture, and the CNN module comprises the following components: a convolutional layer, a BatchNormalization layer, a LeakyReLU function activation layer, and a max pooling layer; the size of the feature map after learning by CNN is
Figure FDA0003411433380000011
m and n represent the length and width of the molecular characteristic diagram of the compound after CNN learning respectively, and c represents the channel of the characteristic diagram;
1.3) from the obtained feature map XCNNFlattening as an input token for a Transformer according to its channel dimensions, the new feature map size being
Figure FDA0003411433380000012
Wherein m.n is the length and width multiplication of the molecular characteristic diagram of the compound, and c is the channel of the characteristic diagram;
1.4) mapping the new feature map XtokenAs input to the transform encoder, the semantic relationships in the feature map are then learned by several transform encoders, each of which consists of:
1.4.1) the LayerNormalization layer processes data with different lengths;
1.4.2) Multi-head attention layer converts the input vector into three different vectors, all with dimensions d: a query vector q, a key vector k, a value vector v, which are compressed into a corresponding matrix Q, K, V, the calculation process being: calculating the scores of the query matrix Q and the key matrix K as:S=Q·KTThe normalized gradient stability score is:
Figure FDA0003411433380000013
the score is converted to probability using the softmax function: p ═ softmax (S)n) Obtaining a weighted value matrix: attention ═ V · P; the formula of the whole process is as follows:
Figure FDA0003411433380000014
wherein softmax represents the softmax activation function, Q and V represent the corresponding matrices of the query vector Q and the value vector V compression, KTTranspose of the corresponding matrix representing the key vector compression, dkRepresenting a current input dimension; then, to solve the problem that self-attention is not sensitive to position information, position codes of the same dimension are added to the original input embedding, and the formula is as follows:
Figure FDA0003411433380000015
Figure FDA0003411433380000016
pos represents the position of the word in the sentence, i represents the current dimension of the position code, and d is the dimension of the input vector; the total attention is combined by a plurality of head attentions: MultiHead (Q ', K ', V ') -concat (head)1,...,headh)WOConcat denotes a splicing operation, WOHead as a learnable feature transformation matrixiThe value of Attention for each head is represented, h represents a total of h heads;
1.4.3) obtaining a final semantic information characteristic diagram through a multilayer perceptron layer; the process of the multilayer perceptron layer is as follows: a full connection layer, a GELU activation layer, a DropPath layer, a full connection layer, and a DropPath layer;
1.5) learning representation by several Transformer encoders, the final compound molecule feature map X was obtainedC
2) Obtaining a characteristic map of the protein sequence, which specifically comprises:
2.1) protein sequences are in the FASTA format, where the sequences are made up of combinations of amino acids and the amino acids are all represented in a single letter; dividing the protein sequence into words by adopting a k-gram method, wherein the length of each word is k;
2.2) establishing a dictionary for the divided words, sequencing the words in an ascending order according to the words appearing at first, replacing the words represented by the original protein sequence with the serial number of the dictionary, and then embedding and representing the serial number represented by each word;
2.3) adding the position information of the original words in the protein sequence, and learning the semantic information among the words in the protein sequence through a plurality of transform encoder modules;
2.4) obtaining the final characteristic diagram X of the protein sequence after learning by a plurality of Transformer encoder modulesP
3) Characteristic diagram X of compound moleculeCAnd protein sequence profile XPFurther learning is performed and the final result is predicted, specifically comprising:
3.1) mapping the feature map XCAnd the characteristic diagram XPPerforming shallow learning of features through a multilayer perceptron;
3.2) overlapping the two feature maps, and then using a CNN module to learn deep features; the module design of CNN is: a convolutional layer, a BatchNormalization layer, a LeakyReLU function activation layer, a max pooling layer, a convolutional layer, a BatchNormalization layer, a LeakyReLU function activation layer, a max pooling layer;
3.3) flattening the characteristic diagram, and obtaining a final prediction result through a full connection layer, wherein the result is expressed as 0 or 1, 0 represents that no interaction exists between the compound and the protein, and 1 represents that the interaction exists.
CN202111530765.2A 2021-12-15 2021-12-15 Compound-protein interaction prediction method based on deep learning Pending CN114360638A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111530765.2A CN114360638A (en) 2021-12-15 2021-12-15 Compound-protein interaction prediction method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111530765.2A CN114360638A (en) 2021-12-15 2021-12-15 Compound-protein interaction prediction method based on deep learning

Publications (1)

Publication Number Publication Date
CN114360638A true CN114360638A (en) 2022-04-15

Family

ID=81099344

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111530765.2A Pending CN114360638A (en) 2021-12-15 2021-12-15 Compound-protein interaction prediction method based on deep learning

Country Status (1)

Country Link
CN (1) CN114360638A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115761250A (en) * 2022-11-21 2023-03-07 北京科技大学 Compound inverse synthesis method and device
CN116072227A (en) * 2023-03-07 2023-05-05 中国海洋大学 Marine nutrient biosynthesis pathway excavation method, apparatus, device and medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115761250A (en) * 2022-11-21 2023-03-07 北京科技大学 Compound inverse synthesis method and device
CN115761250B (en) * 2022-11-21 2023-10-10 北京科技大学 Compound reverse synthesis method and device
CN116072227A (en) * 2023-03-07 2023-05-05 中国海洋大学 Marine nutrient biosynthesis pathway excavation method, apparatus, device and medium

Similar Documents

Publication Publication Date Title
CN112560631B (en) Knowledge distillation-based pedestrian re-identification method
CN111755078B (en) Drug molecule attribute determination method, device and storage medium
CN114360638A (en) Compound-protein interaction prediction method based on deep learning
CN107341510B (en) Image clustering method based on sparse orthogonality double-image non-negative matrix factorization
CN112801280B (en) One-dimensional convolution position coding method of visual depth self-adaptive neural network
CN112100346A (en) Visual question-answering method based on fusion of fine-grained image features and external knowledge
CN116417093A (en) Drug target interaction prediction method combining transducer and graph neural network
CN115098620A (en) Cross-modal Hash retrieval method for attention similarity migration
CN110837736B (en) Named entity recognition method of Chinese medical record based on word structure
CN112837741A (en) Protein secondary structure prediction method based on cyclic neural network
CN112132186A (en) Multi-label classification method with partial deletion and unknown class labels
CN115545018B (en) Multi-mode multi-granularity entity identification system and entity identification method
CN116758397A (en) Single-mode induced multi-mode pre-training method and system based on deep learning
CN113392191B (en) Text matching method and device based on multi-dimensional semantic joint learning
CN113378938B (en) Edge transform graph neural network-based small sample image classification method and system
CN116884072A (en) Facial expression recognition method based on multi-level and multi-scale attention mechanism
CN114998647B (en) Breast cancer full-size pathological image classification method based on attention multi-instance learning
CN116204673A (en) Large-scale image retrieval hash method focusing on relationship among image blocks
CN111259176B (en) Cross-modal Hash retrieval method based on matrix decomposition and integrated with supervision information
CN109146058B (en) Convolutional neural network with transform invariant capability and consistent expression
CN105989094B (en) Image retrieval method based on middle layer expression of hidden layer semantics
CN113343710A (en) Unsupervised word embedding representation learning method based on Ising model
CN111339782B (en) Sign language translation system and method based on multilevel semantic analysis
Shahadat et al. Cross channel weight sharing for image classification
CN116486101B (en) Image feature matching method based on window attention

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination