CN109637580B

CN109637580B - Protein amino acid association matrix prediction method

Info

Publication number: CN109637580B
Application number: CN201811484434.8A
Authority: CN
Inventors: 沈红斌; 徐佳燕; 冯世豪; 杨静
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2018-12-06
Filing date: 2018-12-06
Publication date: 2023-06-13
Anticipated expiration: 2038-12-06
Also published as: CN109637580A

Abstract

A protein amino acid association matrix prediction method comprises the following steps: s1, constructing a protein amino acid association diagram prediction training data set; s2, extracting 6 features from protein amino acid sequences in a training set, combining the 6 features of each sequence, and generating a tag file and a weight mask matrix; s3, training by using the combined characteristics, the label file and the weight mask matrix on the basis of an improved residual error network; s4, searching a homologous sequence list according to the test sequence, and obtaining the merging characteristics, the tag file and the weight mask matrix of the homologous sequences; s5, further training by using the combination features of the homologous sequences, the tag file and the weight mask matrix obtained in the step S4 on the basis of the model obtained in the step S3; s6, obtaining the merging characteristics of the test sequence according to the test amino acid sequence, and then inputting the prediction model obtained in the step S5 for prediction.

Description

Protein amino acid association matrix prediction method

Technical Field

The invention relates to the technical field of protein biology, in particular to a protein amino acid association matrix prediction method based on deep learning.

Background

The spatial structure of protein molecules is of particular importance for understanding the function of proteins. In order to understand the mechanism of action of a protein from the molecular level, it is often necessary to determine the three-dimensional structure of a protein. Direct determination of the structure of a protein using biological experiments, such as using X-ray or nuclear magnetic resonance techniques, typically requires significant effort. Therefore, it becomes important to provide additional information to solve the three-dimensional structure of proteins.

Among them, amino acid correspondences of proteins are considered to play an important role in solving protein structures, and providing accurate amino acid correspondences alone can already result in an acceptable three-dimensional structural model of proteins. Markers in which long-range amino acid interactions (sequence spacing between residues of these two interactions is 24 or more) are of greater use for solving protein structures, and prediction of such interactions is also more difficult, requiring models with the ability to model relationships between long-range residues.

Disclosure of Invention

The invention aims to provide a protein amino acid incidence matrix prediction method based on deep learning, which is used for solving the problem of high structure measurement cost of the existing protein.

According to one embodiment of the invention, a protein amino acid incidence matrix prediction method comprises the following steps:

s1, constructing a protein amino acid association diagram prediction training data set;

s2, extracting 6 features from protein amino acid sequences in a training set, combining the 6 features of each sequence, and generating a tag file and a weight mask matrix;

s3, training by using the combined characteristics, the label file and the weight mask matrix on the basis of an improved residual error network;

s4, searching a homologous sequence list according to the test sequence, and obtaining the merging characteristics, the tag file and the weight mask matrix of the homologous sequences;

s5, further training by using the combination features of the homologous sequences, the tag file and the weight mask matrix obtained in the step S4 on the basis of the model obtained in the step S3;

s6, obtaining the merging characteristics of the test sequence according to the test amino acid sequence, and then inputting the prediction model obtained in the step S5 for prediction.

The protein amino acid correlation matrix prediction method is based on a deep learning model, and the accuracy of protein amino acid correlation matrix prediction is greatly improved.

Drawings

The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. Several embodiments of the present invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

FIG. 1 is a flow chart of prediction of protein amino acid association diagram based on deep learning in an embodiment of the invention.

FIG. 2 is a flow chart of sequence feature extraction used in an embodiment of the present invention.

FIG. 3 is a block diagram of a deep learning network used in an embodiment of the present invention.

Detailed Description

According to one or more embodiments, as shown in fig. 1, a protein amino acid profile prediction method based on deep learning includes the following steps:

s2, extracting 6 features from the protein amino acid sequence in a training set, combining the 6 features, and generating a tag file and a weight mask matrix;

s4, searching a homologous sequence list according to the test amino acid sequence, and obtaining merging features, a tag file and a weight mask matrix of the homologous sequences;

and S5, further training by using the combination characteristics of the homologous sequences obtained in the step S4, the tag file and the weight mask matrix on the basis of the model obtained in the step S3.

S6, obtaining the merging features of the test sequence, and then inputting the prediction model obtained in the S5 for prediction.

According to one or more embodiments, step S1 further comprises the steps of:

s11, screening a protein amino acid chain list used for final training from all PDB databases according to a certain condition by using a PISCES tool, wherein 11217 sequences are found in the screening result. The screening conditions are shown in Table 1. Wherein PDB is an abbreviation for Protein Data Bank, a database of protein structures. Wherein the PISCES tool address is http:// dunbrack.fccc.edu/PISCES.php and the PDB database address ishttps://www.rcsb.org/。

Table 1 is used to construct a protein amino acid profile to predict screening conditions for a training set using the PISCES tool.

TABLE 1

S12, downloading 11217-sequence fasta-format files from the PDB for inputting the subsequent generation characteristics according to the id and chain symbols of the 11217-amino acid sequence, and downloading 11217-sample PDB files for inputting the subsequent generation tag files and weight mask files.

According to one or more embodiments, step S2 further comprises the steps of:

s21, generating sequence features. The specific operation is as follows: fasta format file of training set sequence is input, and the number of nr is calculated by using psiblast software in BLAST+software packageAlignment on a database (nr is an abbreviation for non-redundant, a database of protein sequences) generates PSSM (PSSM is an abbreviation for position specific score matrix) features and an alignment file in json format. The commands that may be used are blast+/bin/blast-query test.fasta-db nr-out_ascii_pssm test.matrix-save_pssm_after_last_round-out test.blast-value 0.001-max_target_seqs 10000-num_item 3. On the other hand, a fasta format file of the training set sequence is input, and a text format secondary structure and solution accessibility features are generated through SCRATCH software. The commands that can be used are SCRATCH-1D_1.1/bin/run_SCRATCH-1D_predictors.sh test.fasta test 4. And then the obtained comparison file in json format is automatically processed into the MSA file in text format. Wherein MSA is an abbreviation for Multi-sequence alignment, referring to multiple sequence alignment. The MSA file is then used as input to generate PSICOV, evfold and CCMPred features by PSICOV software, freecontact software and CCMPred software, respectively. The command in which PSICOV characteristics may be generated is PSICOV-p-r 0.001test.MSA>the command that may be used to generate the Evfold feature is freecontact- -parprof Evfold-i flat-o Evfold<'test.MSA'>the command that may be used to generate the CCMpred feature is CCMpred test.msa test.mat. As shown in fig. 2. BLAST+ software suite addressesftp://ftp.ncbi.nlm.nih.gov/blast/ executables/blast+/The address of the nr database is ftp:// ftp. Ncbi.nlm.nih.gov/blast/db/. The address of the SCRATCH software is http:// SCRATCH. Proteomics.

S22, splicing sequence characteristics. The dimensions of the PSSM feature, the secondary structure feature and the solution accessibility feature are L x 20, L x 3 and L x 2 respectively, the three features of the training sequence are spliced on a second dimension to obtain an L x 25-dimensional feature, and then the L x 25-dimensional feature is converted into an L x 50-dimensional feature by using a method of generating the residue pair feature by connecting the one-dimensional features corresponding to two residues. And then, connecting the generated two-dimensional feature with the features of the PSICOV feature, the Evfold feature and the CCMPred feature, wherein the three dimensions are L.times.L.times.1, on a third dimension, and generating a final L.times.L.times.53-dimensional feature. In the process of generating sequence features, there are cases where some sequences cannot be output. The result obtained in this way contains 10591 sequence samples in total;

s23, generating a sequence tag. The definition of the protein amino acid association diagram used in the invention is as follows: when the Euclidean distance of the beta carbon atom (alpha carbon atom for glycine) of two amino acids is smaller than

The two amino acids are considered to have interactions in the association diagram, positive samples and negative samples. And calculating Euclidean distance between every two residues according to three-dimensional coordinates in the pdb file of a sequence, and generating a tag matrix with the size of L x L, wherein L is the length of the sequence. The invention follows the manner in which the tag is given as follows: the tag is assigned a value of 1 when one residue pair is a positive sample and a value of 0 when one residue pair is a negative sample. In the actual situation, residues which cannot be calculated through experiments to obtain space three-dimensional coordinates exist in the pdb file of the sequence, when the situation occurs, the label is assigned to be 2, and the label is used as a mark, so that the residue pair at the position is ignored when the loss function value or the precision of a sequence is solved later;

s24, producing a sequence weight mask matrix in advance. The weight mask matrix is pre-formulated based on the ratio of positive and negative residue pairs in a training sequence. First, a tag matrix is obtained in which the number of positive residue pairs in the tag matrix generated in step S23 is calculated, and the number of negative residue pairs is calculated, and is given as nn. The positive residue pairs in the sequence are given a weight

The weight at the negative residue pair is 1. The generated weight mask matrix is consistent with the tag matrix size. In addition, as residues which cannot be calculated through experiments and the three-dimensional space coordinates exist in the training sequence, when the situation occurs, the weight is assigned to be 0, so that the effect of neglecting the residue pairs at the position when the loss function value or the precision of one sequence is solved later is achieved.

According to one or more embodiments, wherein step S3 is trained on the modified residual network using the combined features, tag file and weight mask matrix. And randomly selecting part of sequences from the sorted data set as a verification set, and using the rest sequences as training sets. The loss function is selected as a cross entropy function, while the loss function value for a sequence is selected as the average of the loss function values for all pairs of residues in the sequence. The invention increases the receptive field of the last layer of neurons in the original residual error network by using the technology of cavity convolution, so that the sequence interval between the residual errors of the modeling relationship of the network is improved. One recursive calculation method for receptive fields in a convolutional network is shown below:

wherein r is _n Represents the receptive field, k of the nth layer _n Representing the convolution kernel size, s, of the nth layer _i Representing the step size parameter of the i-th layer. This equation shows that the size of the convolution kernel is a very important factor for the receptive field of the network. The direct increase of the size of the convolution kernel brings about the increased parameter scale and the increased possibility of overfitting, and the cavity convolution becomes a proper choice at the moment, so that the receptive field of the network can be increased on the basis of not increasing the number of parameters, and the capability of modeling the remote relationship of the network is improved.

About 1000 sequences are randomly selected from the sorted data set to serve as a verification set, and the rest sequences serve as training sets. The loss function is selected as a cross entropy function, while the loss function value for a sequence is selected as the average of the loss function values for all pairs of residues in the sequence. The platform for implementing the use of the corresponding deep learning code is a TensorFlow. Initial learning rate of training was rated 10 ^-3.5 The learning rate reduction strategy is chosen to multiply the current learning rate by 0.1 for every 10 epochs. The L2 normalization method is used in the learning process to reduce the possibility of overfitting, wherein the normalization coefficient is selected to be 0.0001. The batch size during training was chosen to be 1 sequence sample. Parameter initialization method in deep networkThe xavier method was chosen.

According to one or more embodiments, in step S4, a list of homologous sequences is searched according to the test sequence, and the merging features, the tag file, and the weight mask matrix of these homologous sequences are obtained, which is specifically as follows: the fasta format file of the test sequence was obtained and the initial list of homologous sequences was obtained as input by alignment of the hhblits software on the uniprot20 database. The command used may be hhblits-i test.fasta-d uniprot20-oa3m test.oa3m-cpu 1-diff inf-n 1-id 99. The list of homologous sequences, i.e. the generated oa3m file, is used as input to generate the final list of homologous sequences again by alignment of the hhblits software over the pdb70 database. The command used may be hhblits-I test.oa3m-d pdb70-o test.o-cpu 1-diff inf-n 1-id 99. Finally, the fasta format file and the PDB file of the homologous sequences are downloaded in the PDB database from the resulting list of homologous sequences, i.e. the test.o file, and the features, tags and weight mask files of these homologous sequences are obtained using the methods described in S21, S22, S23 and S24.

The specific operation is as follows: first, the hhblits alignment software is installed on the machine, and the software address is https:// github. Two databases for retrieving homologous sequences, uniprinted 20 and pdb70, are then downloaded and decompressed, the former database being addressed http:// wwuser.gdg.de/-combbiol/data/hhuis/database/hhsuite_dbs/old-release 20/uniprinted 20_2016_02.Tgz, the latter one being addressed for http:// wwuser.gdg.de/-combbisdata/databases/hhuis_dbs/pdb 70_from_mmcif_18025. Gz. And then, the test sequence is used as input to obtain an initial homologous sequence list through the comparison of the hhblits software on the uniprot20 database, and the final homologous sequence list is generated through the comparison of the hhblits software on the pdb70 database again as input. The PDB70 database is used as an alignment database to facilitate downloading of the corresponding PDB file from the PDB database to obtain the true coordinate information of the homologous sequence. Whereas the initial alignment performed before using the uniprot20 database was to be able to get more homologous sequences in the second step. Finally, the resulting list of homologous sequences is used to download the sequences fasta format file and PDB file in the PDB database and the features, tags and weight mask files for these homologous sequences are obtained using the methods described in S21, S22, S23 and S24.

According to one or more embodiments, step S5 uses the combined features of the homologous sequences obtained in step S4, the tag file and the weight mask matrix for further training on the basis of the model obtained in step S3. Since the model is already near the optimal point, no learning rate reduction operation is performed at this time. When certain difference exists between the training data and the test data, the upper precision limit exists when the test data is predicted by simply relying on the learning of the training data, and the idea of transfer learning is used for further training at the moment, so that the method is a good choice.

According to one or more embodiments, step S6 obtains the combined features of the test sequence, and then inputs the prediction model obtained in S5 to make a prediction. The specific operation is as follows: using the test sequence as input, the profile corresponding to the test sequence is obtained by the methods described in S21 and S22. The profile is then input to the model obtained in step S5 for prediction. Output is a matrix of L x 1, where the number at each coordinate ranges within (0, 1), indicating the fractional height of the likelihood of the residue pair at that coordinate having an interaction.

The development of technologies in the fields of machine learning and deep learning, as a result of the existing excellent protein structure databases, such as PDB, which store more and more structural information of proteins determined through biological experiments, has enabled us to migrate some suitable technologies into the field of protein amino acid profile prediction. With the continuous growth of protein structure information and the development of machine learning deep learning technology in recent years, a plurality of excellent models have emerged. For example, SVM-SEQ uses a support vector machine to predict protein amino acid profiles, deep ConPred uses a deep belief network to improve the prediction of long-range protein amino acid profiles.

One problem with using deep learning models is how to choose the input features of the network. In addition to the traditional features such as PSSM matrix, secondary structure and solution accessibility, three co-evolution information features including PSICOV feature, evfold feature and CCMPred feature are used in the invention. Co-evolution information has proven to be very effective in protein amino acid profile prediction tasks, and the three features have a complementary relationship, so that the predictor can find out residue pairs with interactions more completely.

The base predictor of the invention is constructed based on a residual network, but considering that the original purpose of the residual network is to be used for image recognition tasks, the invention improves the original residual network aiming at protein amino acid correlation diagram prediction tasks, and has the following three points: the full-join layer is changed to a convolution layer to accommodate cases of varying sequence lengths, and the pooling layer is removed from the network to accommodate cases where there are often isolated pairs of positive residues. In addition, the receptive field of the network is an important factor for modeling long-range relationships, and therefore, the invention increases the receptive field of the original model by using the technique of hole convolution.

In addition, the invention discovers that the serious problem of unbalanced data distribution exists in the training sample. In a sequence, the number of negative samples can often be as many as several tens of times the number of negative samples, which presents a small challenge for model learning. The present invention solves this problem by giving different weights to the loss function values for different pairs of residues in a sequence before solving the sequence loss function values.

Therefore, compared with the prior art, the embodiment of the invention has the following beneficial effects:

1. a training dataset for protein amino acid profile prediction is constructed. Three effective co-evolution information features are fused on the basis of common features (PSSM matrix, secondary structure and solution accessibility), including PSICOV features, evfold features and CCMPred features.

2. Solving the problem of predicting protein amino acid association diagram by giving different weights to different residue pairsThe positive and negative samples are very unbalanced. The invention weights negative residue pairs while weighting positive residue pairs

Where nn is the number of negative residue pairs in a sequence and np corresponds to the number of positive residue pairs. And meanwhile, obtaining the loss function value of the whole sequence by taking the average value of the residue pairs loss function value in the sequence. The classifier's preference for negative residue pairs and long sequences is reduced.

3. Modeling the relationship between the input features and the predicted amino acid correspondences using a residual network model in deep learning can score all residue pairs in a sequence simultaneously and allows for a very high accuracy of the predicted result. Meanwhile, the invention improves the original residual network for the specific situation of the protein amino acid correlation diagram prediction task, and improves the prediction precision of the model.

4. The method is characterized in that the homologous sequence of the test sequence is used for further training on the basis of the training of the completed model by using the training data set, so that the original model has better prediction accuracy for the test sequence.

It is to be understood that while the spirit and principles of the invention have been described in connection with several embodiments, it is to be understood that this invention is not limited to the specific embodiments disclosed nor does it imply that the features of these aspects are not combinable and that such is for convenience of description only. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A protein amino acid association matrix prediction method is characterized by comprising the following steps:

s6, obtaining the merging characteristics of the test sequence according to the test amino acid sequence, and then inputting the prediction model obtained in the step S5 for prediction;

the specific method for constructing the protein amino acid association diagram prediction training data set in the step S1 is as follows:

s11, screening a sequence list used as final training from all PDB databases by using a PISCES tool;

s12, downloading the fasta format files of the sequences from a PDB database for inputting the subsequent generation features according to the ids and the chain symbols of the sequences, and downloading the PDB files of the sequences for inputting the subsequent generation tag files and the weight mask files;

the step S2 specifically includes the following steps:

s21, generating sequence characteristics, namely firstly inputting an amino acid sequence to generate PSSM characteristics and comparison files through comparison of psiblast software in BLAST+ software packages on an nr database, simultaneously inputting the sequence to generate secondary structures and solution accessibility characteristics through SCRATCH software, then processing the comparison files into MSA files, and then respectively taking the MSA files as inputs to generate PSICOV characteristics, evfold characteristics and CCMPred characteristics through PSICOV software, freecontact software and CCMPred software;

s22, splicing sequence features, namely splicing PSSM features, secondary structure features and solution accessibility features on a non-sequence length dimension, then converting one-dimensional features into two-dimensional features by using a method of generating residue pair features by connecting one-dimensional features corresponding to two residues, and then connecting the generated two-dimensional features with PSICOV features, evfold features and CCMPred features on the non-sequence length dimension to generate final two-dimensional features;

s23, generating sequence labels, calculating Euclidean distance between every two residues according to three-dimensional coordinates in a pdb file of a sequence, and generating a label matrix with the size of L, wherein L is the length of the sequence, and the label is given by the following modes: the tag is assigned a value of 1 when one residue pair is a positive sample, and 0 when one residue pair is a negative sample;

s24, producing a weight mask matrix of the sequence, and setting the number of positive residue pairs in one sequence as np and the number of negative residue pairs as nn, wherein the weight at the positive residue pairs in the sequence is assigned as

The weight at the negative residue pair is assigned 1;

the improved residual network in the step S3 means that the pooling layer in the original residual network is completely removed to adapt to the situation that isolated positive residue pairs exist frequently, the final full-connection layer is changed into a convolution layer to adapt to the situation that the sequence length is not constant, and a hole convolution technology is used to improve the precision of the base predictor;

the step S4 specifically includes the following steps:

s41, taking a test sequence as input, generating a primary homologous sequence comparison file through the comparison of the hhblits software on the uniprot20 database, and then taking the file as input, generating a final homologous sequence comparison file again through the comparison of the hhblits software on the pdb70 database, so as to obtain a final homologous sequence list of the sequence;

s42, obtaining the combination characteristics, the tag file and the weight mask matrix of the homologous sequences according to the method described in the S2;

step S5 is used for further training by using the combination features of the homologous sequences, the tag files and the weight mask matrix obtained in step S4 on the basis of the model obtained in step S3;

and step S6, obtaining the merging characteristics of the test sequence, and then inputting the merging characteristics into the prediction model obtained in step S5 for prediction, wherein the specific operation is as follows:

using the test sequence as input, a profile corresponding to the test sequence is obtained by the methods described in step S21 and step S22, and then the profile is input to the model obtained in step S5 for prediction.