CN111508556A - Protein contact map prediction method based on single sequence and full convolution neural network - Google Patents

Protein contact map prediction method based on single sequence and full convolution neural network Download PDF

Info

Publication number
CN111508556A
CN111508556A CN201911068072.9A CN201911068072A CN111508556A CN 111508556 A CN111508556 A CN 111508556A CN 201911068072 A CN201911068072 A CN 201911068072A CN 111508556 A CN111508556 A CN 111508556A
Authority
CN
China
Prior art keywords
protein
sequence
prediction
model
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201911068072.9A
Other languages
Chinese (zh)
Inventor
於东军
陈明猜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Science and Technology
Original Assignee
Nanjing University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Science and Technology filed Critical Nanjing University of Science and Technology
Priority to CN201911068072.9A priority Critical patent/CN111508556A/en
Publication of CN111508556A publication Critical patent/CN111508556A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/30Drug targeting using structural data; Docking or binding prediction
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Medicinal Chemistry (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a protein contact map prediction method based on a single sequence and a full convolution neural network. The invention uses the strategy of combining the coding single sequence with the deep full convolution neural network technology, so that a calculation model can be predicted only from a single protein sequence, the dependence of a contact map prediction algorithm on a homologous sequence is avoided, and the prediction precision of low-homology protein is improved; the method only uses a simple and direct coding sequence as a characteristic, so that the algorithm is efficient and simple, and the operation efficiency of the prediction model and the usability of the prediction method are improved.

Description

Protein contact map prediction method based on single sequence and full convolution neural network
Technical Field
The invention relates to the field of bioinformatics prediction of protein structures, in particular to a protein contact map prediction method based on a single sequence and a deep full convolution neural network.
Background
The protein contact map contains important protein space geometric constraint information, and is helpful for solving the bioinformatics problems related to protein structures, such as 3D structure modeling, drug design and the like. The use of conventional wet assay methods to determine protein contact patterns is very expensive and time consuming. Therefore, the development of an automated calculation method for protein contact map prediction using bioinformatics-related knowledge is an urgent need. In recent years, computational methods for protein contact map prediction have been developed in great quantities.
Most protein-contact prediction algorithms are based on the fact that spatially close residues are mutated simultaneously, this evolutionary information is available from homologous sequences several typical statistical algorithms to further analyze common mutations coupling analysis, covariance inverse estimation based on direct coupling analysis, pseudo-likelihood function maximization based on direct coupling analysis, indirect coupling is a key problem that these algorithms attempt to solve.
However, about one third of the 15000 protein families lack known structures of homology, and over 90% of the sequences have no known sequences of homology. In this case, the accuracy of these advanced methods can be greatly affected. The accuracy of low homology sequences is still low. In this case, the contact map prediction result may even have a negative impact on the structure prediction, rather than an auxiliary prediction.
Disclosure of Invention
In order to solve the problem of low precision of low homology sequences in the protein contact map prediction problem, the invention aims to provide a high-precision protein contact map prediction method based on a single-sequence deep full-convolution neural network technology.
The technical solution for realizing the purpose of the invention is as follows:
compared with the prior art, the invention has the remarkable advantages that:
1. improving the prediction accuracy of the low homology protein: the strategy of combining the coding single sequence with the deep full convolution neural network technology is used, so that a calculation model can predict from a single protein sequence only, and the dependence of a contact map prediction algorithm on a homologous sequence is avoided;
2. the operation efficiency of the prediction model and the usability of the prediction method are improved: other contact map prediction algorithms mostly rely on intermediate results of other algorithms to complete prediction, or perform complex time-consuming matrix calculation to obtain input features, the method abandons the defect, and only uses simple and direct coding sequences as features, so that the algorithm is efficient and simple;
drawings
FIG. 1 is a schematic view of a prediction flow chart of a protein contact map prediction method.
Detailed Description
The invention relates to a protein contact map prediction method based on a single sequence and a deep full convolution neural network, which comprises the following steps of:
the first stage is as follows: training model
Step 1.1: screening all known protein structures and sequences in the PDB database (4 months ago 2019); data with incomplete information or inconsistent sequence information in the result information are removed; removing protein sequences greater than 400 or less than 40 in length; using CD-HIT to remove redundancy such that the sequence similarity of any two protein sequences is less than 70%; the data set comprised a total of 25394 proteins. Then, the first 1200 proteins in the data set are taken as a test data set after being sorted according to pdb id, 1000 proteins are taken as a verification set for parameter adjustment and early stopping in turn, and the other 23194 proteins are used as a training data set. Since pdb id is independent of the nature and type of the protein sequence itself, this method is a random selection.
Step 1.2: encoding a protein sequence. Each residue pair is encoded into a 441-dimensional feature vector using one-hot. That is, the pair of residues corresponding to a specific position in the feature vector is set to 1, and the other positions are set to 0.
Step 1.3, building a network model, wherein a fully-connected neural network using residual blocks and convolutional layers is designed, all the intermediate convolutional layers have 96 convolutional kernels with the size of 3 × 3, an exponential linear unit (E L U) is adopted to activate a function, the final output layer has 1 convolutional kernel, all the convolutional layers use 'same padding', then a prediction result of a L×L matrix is given, each element of the matrix represents the possibility that a residue pair contacts at a corresponding position residue pair, 30 residual blocks are used in total, in the training process, a cross entropy loss function is used to calculate loss,
in addition, to prevent overfitting, we applied L2 regularization with a coefficient of 0.05 to all weights in the network.
And a second stage: based on the protein sequence information input by the user, a protein contact map is predicted by using a trained prediction model, and the process is as follows:
step 2.1: and receiving target protein sequence information from a webpage end, coding the sequence at a server end, and completing prediction by using a pre-trained model. And returning the prediction information to the webpage end.
The invention is further described below with reference to the accompanying drawings.
Fig. 1 shows a system structure diagram of the prediction method of the present invention. As shown in connection with fig. 1, according to an embodiment of the present invention,
a protein contact map prediction method based on a single sequence and a deep full convolution neural network comprises the following steps:
firstly, screening a construction data set, encoding a protein sequence by using a one-hot encoding technology to construct input characteristics, calculating the distance between residues according to the position information of protein atoms, and constructing a training label; second, the model parameters are optimized according to the Adaptive motion Estimation algorithm using a gradient descent technique. And controlling the iteration times of the gradient descent algorithm according to the precision of the model on the verification set. After the webpage server is deployed, the protein input of the user is received, the protein input is encoded and then input into a prediction model, and a model prediction result is returned to the page.
The foregoing process will be described in more detail with reference to the accompanying drawings.
The first stage is as follows: training model
Step 1.1: screening all known protein structures and sequences in the PDB database (4 months ago 2019); data with incomplete information or inconsistent sequence information in the result information are removed; removing protein sequences greater than 400 or less than 40 in length; using CD-HIT to remove redundancy such that the sequence similarity of any two protein sequences is less than 70%; the data set comprised a total of 25394 proteins. We then randomly selected 1200 proteins as the test data set, 1000 proteins as the validation set used to adjust parameters and stop early, and the other 23194 proteins were used as the training data set.
Figure BDA0002260031460000031
Where CD-HIT is used to remove redundancy, the sequence similarity is equal to the ratio of identical amino acids over the length of the sequence.
Step 1.2: encoding a protein sequence. Each residue pair is encoded into a 441-dimensional feature vector using one-hot. That is, the pair of residues corresponding to a specific position in the feature vector is set to 1, and the other positions are set to 0.
Figure BDA0002260031460000032
Wherein, the residue pair number is n (1 is not less than n and not more than 441), n is converted into a characteristic vector V, and i is the subscript of the element in V.
Step 1.3, building a network model, wherein a fully-connected neural network using residual blocks and convolutional layers is designed, all the intermediate convolutional layers have 96 convolutional kernels with the size of 3 × 3, an exponential linear unit (E L U) is adopted to activate a function, the final output layer has 1 convolutional kernel, all the convolutional layers use 'same padding', then a prediction result of a L×L matrix is given, each element of the matrix represents the possibility that a residue pair contacts at a corresponding position residue pair, 30 residual blocks are used in total, in the training process, a cross entropy loss function is used to calculate loss,
Figure BDA0002260031460000041
where R (x) is an E L U function, x is an argument, α is a constant.
J=-(ylog(p)+(1-y)log(1-p)) (4)
Figure BDA0002260031460000045
Where p is the prediction residue pair contact probability, y is the label, J is the cross entropy loss function, x is the argument, α is the constant.
And the Adaptive motion Estimation technique is used to optimize the model. ADAM can be viewed as a combination of RMSprop and random gradient descent with momentum. It scales the learning rate with the square of the gradient, such as RMSprop, and replaces the gradient itself with a moving average of the gradient, such as SGD with momentum. ADAM is an adaptive learning rate method that calculates individual learning rates under different parameters, and it uses first and second moment estimates of the gradient to adjust the learning rate of each weight of the neural network.
Figure BDA0002260031460000042
Where m is the momentum and X is a random variable.
Figure BDA0002260031460000043
Where m and v are moving averages, g is the current gradient, β is a hyperparameter, their default values are 0.9 and 0.999, respectively, the vector of moving averages is initialized to zero at the first iteration.
Furthermore, to prevent overfitting, we applied L2 regularization with a coefficient of 0.05 to all weights in the network.
An exemplary algorithmic process for optimizing model parameters for a gradient descent algorithm is as follows:
Figure BDA0002260031460000044
Figure BDA0002260031460000051
and a second stage: based on the protein sequence information input by the user, a protein contact map is predicted by using a trained prediction model, and the process is as follows:
step 2.1: and receiving target protein sequence information from a webpage end, coding the sequence at a server end, and completing prediction by using a pre-trained model. And returning the prediction information to the webpage end.
In summary, compared with the existing prediction method, the prediction method has the following significant advantages: the method solves the problem of low prediction precision of low homology protein sequences and improves the prediction precision of the model.
The invention has the advantages that: firstly, other contact image prediction algorithms rely on intermediate results of other algorithms to complete prediction, or perform complex time-consuming matrix calculation to obtain input features, the method abandons the defect, and only uses simple and direct coding sequences as features, so that the algorithm is efficient and simple; secondly, other contact map prediction algorithms can obtain a prediction result with higher precision only by requiring that the target protein has more homologous sequences, and the method only constructs characteristics from a single sequence so that the model precision does not depend on the sequence homology; thirdly, the full convolution neural network is used, the variable-length protein sequence can be predicted, and a direct contact map prediction result is obtained.

Claims (4)

1. A protein contact map prediction method based on a single sequence and a full convolution neural network comprises the following steps:
the first stage is as follows: training model
Step 1.1: constructing a data set for model test and training; all sequences in the PDB database were screened: using the CD-HIT to perform redundancy removal on a single data set; clustering all sequences according to parameter setting, outputting the longest sequence in each group of clusters as a representative sequence, and simultaneously giving the name of each sequence under each group of clusters for similarity analysis; randomly dividing a training set, a testing set and a verifying set;
step 1.2: (ii) encodes a protein; firstly, converting a one-dimensional protein sequence into residue pairs, wherein each residue pair is coded by one-hot; calculating the distance between residues according to the PDB structure information to obtain a protein contact map;
step 1.3: designing a network model structure; designing a fully-connected neural network using a residual block and a convolutional layer; optimizing a training network by using an Adaptive motion Estimation technology, and adjusting parameters according to the precision of a verification set;
and a second stage: based on the protein sequence information input by the user, the protein contact map is predicted by using a trained prediction model, and the process is as follows:
receiving target protein sequence information of a webpage end, coding the sequence at a server end, completing prediction by using a pre-trained model, and returning prediction information to the webpage end.
2. The method of predicting protein contact maps based on single sequence and full convolution neural networks of claim 1, wherein: one-hot encoding was performed in units of residue pairs based on the input protein sequence as input for the model.
Figure FDA0002260031450000011
Wherein, the residue pair number is n, n is more than or equal to 1 and less than or equal to 441, n is converted into a characteristic vector V, and i is the subscript of the element in V.
3. The single sequence and full convolution neural network based protein contact map prediction method of claim 1, characterized in that a data set for training a model is constructed and all known protein structures and sequences in the PDB database are filtered by CD-HIT screening so that there is no homology in the sequences of any two protein sequences; incomplete data is removed from the data; removing protein sequences with too long or too short length; then randomly taking a small part of samples in the data set as a test data set and a verification set, and using the rest samples as training data sets.
4. The method of predicting protein contact maps based on single sequence and full convolution neural networks of claim 1, wherein: the network model uses residual block concatenation; all convolution layers in the residual block adopt exponential linear unit activation functions, and the last convolution layer adopts Sigmoid signal functions. And (3) iteratively updating the weight of the convolution layer by using an Adaptive motion Estimation technology, adjusting parameters by using a verification set and controlling the number of training iterations.
CN201911068072.9A 2019-11-04 2019-11-04 Protein contact map prediction method based on single sequence and full convolution neural network Withdrawn CN111508556A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911068072.9A CN111508556A (en) 2019-11-04 2019-11-04 Protein contact map prediction method based on single sequence and full convolution neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911068072.9A CN111508556A (en) 2019-11-04 2019-11-04 Protein contact map prediction method based on single sequence and full convolution neural network

Publications (1)

Publication Number Publication Date
CN111508556A true CN111508556A (en) 2020-08-07

Family

ID=71863779

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911068072.9A Withdrawn CN111508556A (en) 2019-11-04 2019-11-04 Protein contact map prediction method based on single sequence and full convolution neural network

Country Status (1)

Country Link
CN (1) CN111508556A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112185466A (en) * 2020-09-24 2021-01-05 中国科学院计算技术研究所 Method for constructing protein structure by directly utilizing protein multi-sequence association information

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112185466A (en) * 2020-09-24 2021-01-05 中国科学院计算技术研究所 Method for constructing protein structure by directly utilizing protein multi-sequence association information
CN112185466B (en) * 2020-09-24 2023-05-23 中国科学院计算技术研究所 Method for constructing protein structure by directly utilizing protein multi-sequence association information

Similar Documents

Publication Publication Date Title
CN109142171B (en) Urban PM10 concentration prediction method based on feature expansion and fusing with neural network
CN113593631B (en) Method and system for predicting protein-polypeptide binding site
CN107622182B (en) Method and system for predicting local structural features of protein
CN112233723B (en) Protein structure prediction method and system based on deep learning
CN110599766B (en) Road traffic jam propagation prediction method based on SAE-LSTM-SAD
WO2020048389A1 (en) Method for compressing neural network model, device, and computer apparatus
CN108009674A (en) Air PM2.5 concentration prediction methods based on CNN and LSTM fused neural networks
CN107967517A (en) The method and apparatus quantified for neutral net
CN111898689A (en) Image classification method based on neural network architecture search
CN112084877B (en) NSGA-NET-based remote sensing image recognition method
CN112085161B (en) Graph neural network method based on random information transmission
CN109947652A (en) A kind of improvement sequence learning method of software defect prediction
CN112364119B (en) Ocean buoy trajectory prediction method based on LSTM coding and decoding model
CN110929798A (en) Image classification method and medium based on structure optimization sparse convolution neural network
CN113033786B (en) Fault diagnosis model construction method and device based on time convolution network
CN112818690A (en) Semantic recognition method and device combined with knowledge graph entity information and related equipment
CN113936738A (en) RNA-protein binding site prediction method based on deep convolutional neural network
CN112447265A (en) Lysine acetylation site prediction method based on modular dense convolutional network
CN115732034A (en) Identification method and system of spatial transcriptome cell expression pattern
CN117237733A (en) Breast cancer full-slice image classification method combining self-supervision and weak supervision learning
CN114792126A (en) Convolutional neural network design method based on genetic algorithm
CN113257357B (en) Protein residue contact map prediction method
CN111508556A (en) Protein contact map prediction method based on single sequence and full convolution neural network
CN114241267A (en) Structural entropy sampling-based multi-target architecture search osteoporosis image identification method
CN115908909A (en) Evolutionary neural architecture searching method and system based on Bayes convolutional neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20200807