CN109637580B - Protein amino acid association matrix prediction method - Google Patents

Protein amino acid association matrix prediction method Download PDF

Info

Publication number
CN109637580B
CN109637580B CN201811484434.8A CN201811484434A CN109637580B CN 109637580 B CN109637580 B CN 109637580B CN 201811484434 A CN201811484434 A CN 201811484434A CN 109637580 B CN109637580 B CN 109637580B
Authority
CN
China
Prior art keywords
sequence
features
amino acid
prediction
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811484434.8A
Other languages
Chinese (zh)
Other versions
CN109637580A (en
Inventor
沈红斌
徐佳燕
冯世豪
杨静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN201811484434.8A priority Critical patent/CN109637580B/en
Publication of CN109637580A publication Critical patent/CN109637580A/en
Application granted granted Critical
Publication of CN109637580B publication Critical patent/CN109637580B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

A protein amino acid association matrix prediction method comprises the following steps: s1, constructing a protein amino acid association diagram prediction training data set; s2, extracting 6 features from protein amino acid sequences in a training set, combining the 6 features of each sequence, and generating a tag file and a weight mask matrix; s3, training by using the combined characteristics, the label file and the weight mask matrix on the basis of an improved residual error network; s4, searching a homologous sequence list according to the test sequence, and obtaining the merging characteristics, the tag file and the weight mask matrix of the homologous sequences; s5, further training by using the combination features of the homologous sequences, the tag file and the weight mask matrix obtained in the step S4 on the basis of the model obtained in the step S3; s6, obtaining the merging characteristics of the test sequence according to the test amino acid sequence, and then inputting the prediction model obtained in the step S5 for prediction.

Description

Protein amino acid association matrix prediction method
Technical Field
The invention relates to the technical field of protein biology, in particular to a protein amino acid association matrix prediction method based on deep learning.
Background
The spatial structure of protein molecules is of particular importance for understanding the function of proteins. In order to understand the mechanism of action of a protein from the molecular level, it is often necessary to determine the three-dimensional structure of a protein. Direct determination of the structure of a protein using biological experiments, such as using X-ray or nuclear magnetic resonance techniques, typically requires significant effort. Therefore, it becomes important to provide additional information to solve the three-dimensional structure of proteins.
Among them, amino acid correspondences of proteins are considered to play an important role in solving protein structures, and providing accurate amino acid correspondences alone can already result in an acceptable three-dimensional structural model of proteins. Markers in which long-range amino acid interactions (sequence spacing between residues of these two interactions is 24 or more) are of greater use for solving protein structures, and prediction of such interactions is also more difficult, requiring models with the ability to model relationships between long-range residues.
Disclosure of Invention
The invention aims to provide a protein amino acid incidence matrix prediction method based on deep learning, which is used for solving the problem of high structure measurement cost of the existing protein.
According to one embodiment of the invention, a protein amino acid incidence matrix prediction method comprises the following steps:
s1, constructing a protein amino acid association diagram prediction training data set;
s2, extracting 6 features from protein amino acid sequences in a training set, combining the 6 features of each sequence, and generating a tag file and a weight mask matrix;
s3, training by using the combined characteristics, the label file and the weight mask matrix on the basis of an improved residual error network;
s4, searching a homologous sequence list according to the test sequence, and obtaining the merging characteristics, the tag file and the weight mask matrix of the homologous sequences;
s5, further training by using the combination features of the homologous sequences, the tag file and the weight mask matrix obtained in the step S4 on the basis of the model obtained in the step S3;
s6, obtaining the merging characteristics of the test sequence according to the test amino acid sequence, and then inputting the prediction model obtained in the step S5 for prediction.
The protein amino acid correlation matrix prediction method is based on a deep learning model, and the accuracy of protein amino acid correlation matrix prediction is greatly improved.
Drawings
The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. Several embodiments of the present invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:
FIG. 1 is a flow chart of prediction of protein amino acid association diagram based on deep learning in an embodiment of the invention.
FIG. 2 is a flow chart of sequence feature extraction used in an embodiment of the present invention.
FIG. 3 is a block diagram of a deep learning network used in an embodiment of the present invention.
Detailed Description
According to one or more embodiments, as shown in fig. 1, a protein amino acid profile prediction method based on deep learning includes the following steps:
s1, constructing a protein amino acid association diagram prediction training data set;
s2, extracting 6 features from the protein amino acid sequence in a training set, combining the 6 features, and generating a tag file and a weight mask matrix;
s3, training by using the combined characteristics, the label file and the weight mask matrix on the basis of an improved residual error network;
s4, searching a homologous sequence list according to the test amino acid sequence, and obtaining merging features, a tag file and a weight mask matrix of the homologous sequences;
and S5, further training by using the combination characteristics of the homologous sequences obtained in the step S4, the tag file and the weight mask matrix on the basis of the model obtained in the step S3.
S6, obtaining the merging features of the test sequence, and then inputting the prediction model obtained in the S5 for prediction.
According to one or more embodiments, step S1 further comprises the steps of:
s11, screening a protein amino acid chain list used for final training from all PDB databases according to a certain condition by using a PISCES tool, wherein 11217 sequences are found in the screening result. The screening conditions are shown in Table 1. Wherein PDB is an abbreviation for Protein Data Bank, a database of protein structures. Wherein the PISCES tool address is http:// dunbrack.fccc.edu/PISCES.php and the PDB database address ishttps://www.rcsb.org/
Table 1 is used to construct a protein amino acid profile to predict screening conditions for a training set using the PISCES tool.
Figure BDA0001894112490000031
TABLE 1
S12, downloading 11217-sequence fasta-format files from the PDB for inputting the subsequent generation characteristics according to the id and chain symbols of the 11217-amino acid sequence, and downloading 11217-sample PDB files for inputting the subsequent generation tag files and weight mask files.
According to one or more embodiments, step S2 further comprises the steps of:
s21, generating sequence features. The specific operation is as follows: fasta format file of training set sequence is input, and the number of nr is calculated by using psiblast software in BLAST+software packageAlignment on a database (nr is an abbreviation for non-redundant, a database of protein sequences) generates PSSM (PSSM is an abbreviation for position specific score matrix) features and an alignment file in json format. The commands that may be used are blast+/bin/blast-query test.fasta-db nr-out_ascii_pssm test.matrix-save_pssm_after_last_round-out test.blast-value 0.001-max_target_seqs 10000-num_item 3. On the other hand, a fasta format file of the training set sequence is input, and a text format secondary structure and solution accessibility features are generated through SCRATCH software. The commands that can be used are SCRATCH-1D_1.1/bin/run_SCRATCH-1D_predictors.sh test.fasta test 4. And then the obtained comparison file in json format is automatically processed into the MSA file in text format. Wherein MSA is an abbreviation for Multi-sequence alignment, referring to multiple sequence alignment. The MSA file is then used as input to generate PSICOV, evfold and CCMPred features by PSICOV software, freecontact software and CCMPred software, respectively. The command in which PSICOV characteristics may be generated is PSICOV-p-r 0.001test.MSA>the command that may be used to generate the Evfold feature is freecontact- -parprof Evfold-i flat-o Evfold<'test.MSA'>the command that may be used to generate the CCMpred feature is CCMpred test.msa test.mat. As shown in fig. 2. BLAST+ software suite addressesftp://ftp.ncbi.nlm.nih.gov/blast/ executables/blast+/The address of the nr database is ftp:// ftp. Ncbi.nlm.nih.gov/blast/db/. The address of the SCRATCH software is http:// SCRATCH. Proteomics.
S22, splicing sequence characteristics. The dimensions of the PSSM feature, the secondary structure feature and the solution accessibility feature are L x 20, L x 3 and L x 2 respectively, the three features of the training sequence are spliced on a second dimension to obtain an L x 25-dimensional feature, and then the L x 25-dimensional feature is converted into an L x 50-dimensional feature by using a method of generating the residue pair feature by connecting the one-dimensional features corresponding to two residues. And then, connecting the generated two-dimensional feature with the features of the PSICOV feature, the Evfold feature and the CCMPred feature, wherein the three dimensions are L.times.L.times.1, on a third dimension, and generating a final L.times.L.times.53-dimensional feature. In the process of generating sequence features, there are cases where some sequences cannot be output. The result obtained in this way contains 10591 sequence samples in total;
s23, generating a sequence tag. The definition of the protein amino acid association diagram used in the invention is as follows: when the Euclidean distance of the beta carbon atom (alpha carbon atom for glycine) of two amino acids is smaller than
Figure BDA0001894112490000041
The two amino acids are considered to have interactions in the association diagram, positive samples and negative samples. And calculating Euclidean distance between every two residues according to three-dimensional coordinates in the pdb file of a sequence, and generating a tag matrix with the size of L x L, wherein L is the length of the sequence. The invention follows the manner in which the tag is given as follows: the tag is assigned a value of 1 when one residue pair is a positive sample and a value of 0 when one residue pair is a negative sample. In the actual situation, residues which cannot be calculated through experiments to obtain space three-dimensional coordinates exist in the pdb file of the sequence, when the situation occurs, the label is assigned to be 2, and the label is used as a mark, so that the residue pair at the position is ignored when the loss function value or the precision of a sequence is solved later;
s24, producing a sequence weight mask matrix in advance. The weight mask matrix is pre-formulated based on the ratio of positive and negative residue pairs in a training sequence. First, a tag matrix is obtained in which the number of positive residue pairs in the tag matrix generated in step S23 is calculated, and the number of negative residue pairs is calculated, and is given as nn. The positive residue pairs in the sequence are given a weight
Figure BDA0001894112490000051
The weight at the negative residue pair is 1. The generated weight mask matrix is consistent with the tag matrix size. In addition, as residues which cannot be calculated through experiments and the three-dimensional space coordinates exist in the training sequence, when the situation occurs, the weight is assigned to be 0, so that the effect of neglecting the residue pairs at the position when the loss function value or the precision of one sequence is solved later is achieved.
According to one or more embodiments, wherein step S3 is trained on the modified residual network using the combined features, tag file and weight mask matrix. And randomly selecting part of sequences from the sorted data set as a verification set, and using the rest sequences as training sets. The loss function is selected as a cross entropy function, while the loss function value for a sequence is selected as the average of the loss function values for all pairs of residues in the sequence. The invention increases the receptive field of the last layer of neurons in the original residual error network by using the technology of cavity convolution, so that the sequence interval between the residual errors of the modeling relationship of the network is improved. One recursive calculation method for receptive fields in a convolutional network is shown below:
Figure BDA0001894112490000052
wherein r is n Represents the receptive field, k of the nth layer n Representing the convolution kernel size, s, of the nth layer i Representing the step size parameter of the i-th layer. This equation shows that the size of the convolution kernel is a very important factor for the receptive field of the network. The direct increase of the size of the convolution kernel brings about the increased parameter scale and the increased possibility of overfitting, and the cavity convolution becomes a proper choice at the moment, so that the receptive field of the network can be increased on the basis of not increasing the number of parameters, and the capability of modeling the remote relationship of the network is improved.
About 1000 sequences are randomly selected from the sorted data set to serve as a verification set, and the rest sequences serve as training sets. The loss function is selected as a cross entropy function, while the loss function value for a sequence is selected as the average of the loss function values for all pairs of residues in the sequence. The platform for implementing the use of the corresponding deep learning code is a TensorFlow. Initial learning rate of training was rated 10 -3.5 The learning rate reduction strategy is chosen to multiply the current learning rate by 0.1 for every 10 epochs. The L2 normalization method is used in the learning process to reduce the possibility of overfitting, wherein the normalization coefficient is selected to be 0.0001. The batch size during training was chosen to be 1 sequence sample. Parameter initialization method in deep networkThe xavier method was chosen.
According to one or more embodiments, in step S4, a list of homologous sequences is searched according to the test sequence, and the merging features, the tag file, and the weight mask matrix of these homologous sequences are obtained, which is specifically as follows: the fasta format file of the test sequence was obtained and the initial list of homologous sequences was obtained as input by alignment of the hhblits software on the uniprot20 database. The command used may be hhblits-i test.fasta-d uniprot20-oa3m test.oa3m-cpu 1-diff inf-n 1-id 99. The list of homologous sequences, i.e. the generated oa3m file, is used as input to generate the final list of homologous sequences again by alignment of the hhblits software over the pdb70 database. The command used may be hhblits-I test.oa3m-d pdb70-o test.o-cpu 1-diff inf-n 1-id 99. Finally, the fasta format file and the PDB file of the homologous sequences are downloaded in the PDB database from the resulting list of homologous sequences, i.e. the test.o file, and the features, tags and weight mask files of these homologous sequences are obtained using the methods described in S21, S22, S23 and S24.
The specific operation is as follows: first, the hhblits alignment software is installed on the machine, and the software address is https:// github. Two databases for retrieving homologous sequences, uniprinted 20 and pdb70, are then downloaded and decompressed, the former database being addressed http:// wwuser.gdg.de/-combbiol/data/hhuis/database/hhsuite_dbs/old-release 20/uniprinted 20_2016_02.Tgz, the latter one being addressed for http:// wwuser.gdg.de/-combbisdata/databases/hhuis_dbs/pdb 70_from_mmcif_18025. Gz. And then, the test sequence is used as input to obtain an initial homologous sequence list through the comparison of the hhblits software on the uniprot20 database, and the final homologous sequence list is generated through the comparison of the hhblits software on the pdb70 database again as input. The PDB70 database is used as an alignment database to facilitate downloading of the corresponding PDB file from the PDB database to obtain the true coordinate information of the homologous sequence. Whereas the initial alignment performed before using the uniprot20 database was to be able to get more homologous sequences in the second step. Finally, the resulting list of homologous sequences is used to download the sequences fasta format file and PDB file in the PDB database and the features, tags and weight mask files for these homologous sequences are obtained using the methods described in S21, S22, S23 and S24.
According to one or more embodiments, step S5 uses the combined features of the homologous sequences obtained in step S4, the tag file and the weight mask matrix for further training on the basis of the model obtained in step S3. Since the model is already near the optimal point, no learning rate reduction operation is performed at this time. When certain difference exists between the training data and the test data, the upper precision limit exists when the test data is predicted by simply relying on the learning of the training data, and the idea of transfer learning is used for further training at the moment, so that the method is a good choice.
According to one or more embodiments, step S6 obtains the combined features of the test sequence, and then inputs the prediction model obtained in S5 to make a prediction. The specific operation is as follows: using the test sequence as input, the profile corresponding to the test sequence is obtained by the methods described in S21 and S22. The profile is then input to the model obtained in step S5 for prediction. Output is a matrix of L x 1, where the number at each coordinate ranges within (0, 1), indicating the fractional height of the likelihood of the residue pair at that coordinate having an interaction.
The development of technologies in the fields of machine learning and deep learning, as a result of the existing excellent protein structure databases, such as PDB, which store more and more structural information of proteins determined through biological experiments, has enabled us to migrate some suitable technologies into the field of protein amino acid profile prediction. With the continuous growth of protein structure information and the development of machine learning deep learning technology in recent years, a plurality of excellent models have emerged. For example, SVM-SEQ uses a support vector machine to predict protein amino acid profiles, deep ConPred uses a deep belief network to improve the prediction of long-range protein amino acid profiles.
One problem with using deep learning models is how to choose the input features of the network. In addition to the traditional features such as PSSM matrix, secondary structure and solution accessibility, three co-evolution information features including PSICOV feature, evfold feature and CCMPred feature are used in the invention. Co-evolution information has proven to be very effective in protein amino acid profile prediction tasks, and the three features have a complementary relationship, so that the predictor can find out residue pairs with interactions more completely.
The base predictor of the invention is constructed based on a residual network, but considering that the original purpose of the residual network is to be used for image recognition tasks, the invention improves the original residual network aiming at protein amino acid correlation diagram prediction tasks, and has the following three points: the full-join layer is changed to a convolution layer to accommodate cases of varying sequence lengths, and the pooling layer is removed from the network to accommodate cases where there are often isolated pairs of positive residues. In addition, the receptive field of the network is an important factor for modeling long-range relationships, and therefore, the invention increases the receptive field of the original model by using the technique of hole convolution.
In addition, the invention discovers that the serious problem of unbalanced data distribution exists in the training sample. In a sequence, the number of negative samples can often be as many as several tens of times the number of negative samples, which presents a small challenge for model learning. The present invention solves this problem by giving different weights to the loss function values for different pairs of residues in a sequence before solving the sequence loss function values.
Therefore, compared with the prior art, the embodiment of the invention has the following beneficial effects:
1. a training dataset for protein amino acid profile prediction is constructed. Three effective co-evolution information features are fused on the basis of common features (PSSM matrix, secondary structure and solution accessibility), including PSICOV features, evfold features and CCMPred features.
2. Solving the problem of predicting protein amino acid association diagram by giving different weights to different residue pairsThe positive and negative samples are very unbalanced. The invention weights negative residue pairs while weighting positive residue pairs
Figure BDA0001894112490000081
Where nn is the number of negative residue pairs in a sequence and np corresponds to the number of positive residue pairs. And meanwhile, obtaining the loss function value of the whole sequence by taking the average value of the residue pairs loss function value in the sequence. The classifier's preference for negative residue pairs and long sequences is reduced.
3. Modeling the relationship between the input features and the predicted amino acid correspondences using a residual network model in deep learning can score all residue pairs in a sequence simultaneously and allows for a very high accuracy of the predicted result. Meanwhile, the invention improves the original residual network for the specific situation of the protein amino acid correlation diagram prediction task, and improves the prediction precision of the model.
4. The method is characterized in that the homologous sequence of the test sequence is used for further training on the basis of the training of the completed model by using the training data set, so that the original model has better prediction accuracy for the test sequence.
It is to be understood that while the spirit and principles of the invention have been described in connection with several embodiments, it is to be understood that this invention is not limited to the specific embodiments disclosed nor does it imply that the features of these aspects are not combinable and that such is for convenience of description only. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims (1)

1. A protein amino acid association matrix prediction method is characterized by comprising the following steps:
s1, constructing a protein amino acid association diagram prediction training data set;
s2, extracting 6 features from protein amino acid sequences in a training set, combining the 6 features of each sequence, and generating a tag file and a weight mask matrix;
s3, training by using the combined characteristics, the label file and the weight mask matrix on the basis of an improved residual error network;
s4, searching a homologous sequence list according to the test sequence, and obtaining the merging characteristics, the tag file and the weight mask matrix of the homologous sequences;
s5, further training by using the combination features of the homologous sequences, the tag file and the weight mask matrix obtained in the step S4 on the basis of the model obtained in the step S3;
s6, obtaining the merging characteristics of the test sequence according to the test amino acid sequence, and then inputting the prediction model obtained in the step S5 for prediction;
the specific method for constructing the protein amino acid association diagram prediction training data set in the step S1 is as follows:
s11, screening a sequence list used as final training from all PDB databases by using a PISCES tool;
s12, downloading the fasta format files of the sequences from a PDB database for inputting the subsequent generation features according to the ids and the chain symbols of the sequences, and downloading the PDB files of the sequences for inputting the subsequent generation tag files and the weight mask files;
the step S2 specifically includes the following steps:
s21, generating sequence characteristics, namely firstly inputting an amino acid sequence to generate PSSM characteristics and comparison files through comparison of psiblast software in BLAST+ software packages on an nr database, simultaneously inputting the sequence to generate secondary structures and solution accessibility characteristics through SCRATCH software, then processing the comparison files into MSA files, and then respectively taking the MSA files as inputs to generate PSICOV characteristics, evfold characteristics and CCMPred characteristics through PSICOV software, freecontact software and CCMPred software;
s22, splicing sequence features, namely splicing PSSM features, secondary structure features and solution accessibility features on a non-sequence length dimension, then converting one-dimensional features into two-dimensional features by using a method of generating residue pair features by connecting one-dimensional features corresponding to two residues, and then connecting the generated two-dimensional features with PSICOV features, evfold features and CCMPred features on the non-sequence length dimension to generate final two-dimensional features;
s23, generating sequence labels, calculating Euclidean distance between every two residues according to three-dimensional coordinates in a pdb file of a sequence, and generating a label matrix with the size of L, wherein L is the length of the sequence, and the label is given by the following modes: the tag is assigned a value of 1 when one residue pair is a positive sample, and 0 when one residue pair is a negative sample;
s24, producing a weight mask matrix of the sequence, and setting the number of positive residue pairs in one sequence as np and the number of negative residue pairs as nn, wherein the weight at the positive residue pairs in the sequence is assigned as
Figure FDA0004219004330000021
The weight at the negative residue pair is assigned 1;
the improved residual network in the step S3 means that the pooling layer in the original residual network is completely removed to adapt to the situation that isolated positive residue pairs exist frequently, the final full-connection layer is changed into a convolution layer to adapt to the situation that the sequence length is not constant, and a hole convolution technology is used to improve the precision of the base predictor;
the step S4 specifically includes the following steps:
s41, taking a test sequence as input, generating a primary homologous sequence comparison file through the comparison of the hhblits software on the uniprot20 database, and then taking the file as input, generating a final homologous sequence comparison file again through the comparison of the hhblits software on the pdb70 database, so as to obtain a final homologous sequence list of the sequence;
s42, obtaining the combination characteristics, the tag file and the weight mask matrix of the homologous sequences according to the method described in the S2;
step S5 is used for further training by using the combination features of the homologous sequences, the tag files and the weight mask matrix obtained in step S4 on the basis of the model obtained in step S3;
and step S6, obtaining the merging characteristics of the test sequence, and then inputting the merging characteristics into the prediction model obtained in step S5 for prediction, wherein the specific operation is as follows:
using the test sequence as input, a profile corresponding to the test sequence is obtained by the methods described in step S21 and step S22, and then the profile is input to the model obtained in step S5 for prediction.
CN201811484434.8A 2018-12-06 2018-12-06 Protein amino acid association matrix prediction method Active CN109637580B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811484434.8A CN109637580B (en) 2018-12-06 2018-12-06 Protein amino acid association matrix prediction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811484434.8A CN109637580B (en) 2018-12-06 2018-12-06 Protein amino acid association matrix prediction method

Publications (2)

Publication Number Publication Date
CN109637580A CN109637580A (en) 2019-04-16
CN109637580B true CN109637580B (en) 2023-06-13

Family

ID=66071420

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811484434.8A Active CN109637580B (en) 2018-12-06 2018-12-06 Protein amino acid association matrix prediction method

Country Status (1)

Country Link
CN (1) CN109637580B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110176272B (en) * 2019-04-18 2021-05-18 浙江工业大学 Protein disulfide bond prediction method based on multi-sequence association information
CN111210869B (en) * 2020-01-08 2023-06-20 中山大学 Protein refrigeration electron microscope structure analysis model training method and analysis method
CN112085245B (en) * 2020-07-21 2024-06-18 浙江工业大学 Protein residue contact prediction method based on depth residual neural network
CN112085247B (en) * 2020-07-22 2024-06-21 浙江工业大学 Protein residue contact prediction method based on deep learning
CN112927753A (en) * 2021-02-22 2021-06-08 中南大学 Method for identifying interface hot spot residues of protein and RNA (ribonucleic acid) compound based on transfer learning
CN116884473B (en) * 2023-05-22 2024-04-26 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Protein function prediction model generation method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7574306B1 (en) * 2003-11-20 2009-08-11 University Of Washington Method and system for optimization of polymer sequences to produce polymers with stable, 3-dimensional conformations
CN104951668A (en) * 2015-04-07 2015-09-30 上海大学 Method for predicting protein association graphs on basis of cascade neural network structures
CN106126972A (en) * 2016-06-21 2016-11-16 哈尔滨工业大学 A kind of level multi-tag sorting technique for protein function prediction
CN106951736A (en) * 2017-03-14 2017-07-14 齐鲁工业大学 A kind of secondary protein structure prediction method based on multiple evolution matrix
CN107563150A (en) * 2017-08-31 2018-01-09 深圳大学 Forecasting Methodology, device, equipment and the storage medium of protein binding site
CN107622182A (en) * 2017-08-04 2018-01-23 中南大学 The Forecasting Methodology and system of protein partial structurtes feature

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001050355A2 (en) * 2000-01-05 2001-07-12 Structural Bioinformatics Advanced Technologies A/S Computer predictions of molecules
GB0006153D0 (en) * 2000-03-14 2000-05-03 Inpharmatica Ltd Database
US20130304432A1 (en) * 2012-05-09 2013-11-14 Memorial Sloan-Kettering Cancer Center Methods and apparatus for predicting protein structure

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7574306B1 (en) * 2003-11-20 2009-08-11 University Of Washington Method and system for optimization of polymer sequences to produce polymers with stable, 3-dimensional conformations
CN104951668A (en) * 2015-04-07 2015-09-30 上海大学 Method for predicting protein association graphs on basis of cascade neural network structures
CN106126972A (en) * 2016-06-21 2016-11-16 哈尔滨工业大学 A kind of level multi-tag sorting technique for protein function prediction
CN106951736A (en) * 2017-03-14 2017-07-14 齐鲁工业大学 A kind of secondary protein structure prediction method based on multiple evolution matrix
CN107622182A (en) * 2017-08-04 2018-01-23 中南大学 The Forecasting Methodology and system of protein partial structurtes feature
CN107563150A (en) * 2017-08-31 2018-01-09 深圳大学 Forecasting Methodology, device, equipment and the storage medium of protein binding site

Non-Patent Citations (11)

* Cited by examiner, † Cited by third party
Title
Improving protein structure prediction using multiple sequence-based contact predictions;WU S等;《Structure》;20111231;全文 *
Reconstruction of 3D structures from protein contact maps;VASSURA M等;《IEEE/ACM Transactions on computational biology and bioinformatics(TCBB)》;20081231;全文 *
基于加权朴素贝叶斯分类器和极端随机树的蛋白质接触图预测;金康荣等;《南京航空航天大学学报》;20181015;第1节 *
基于多网络数据协同矩阵分解预测蛋白质功能;余国先等;《计算机研究与发展》;20171215(第12期);全文 *
基于多重进化矩阵的蛋白质特征向量构造方法;杜月寒等;《计算机系统应用》;20180215(第02期);全文 *
基于带偏差递归神经网络蛋白质关联图的预测;刘桂霞等;《吉林大学学报(理学版)》;20080326(第02期);全文 *
基于改进克隆选择算法的蛋白质关联图预测;刘桂霞等;《吉林大学学报(工学版)》;20090915(第05期);全文 *
基于样本选择的蛋白质关联结构预测;孙鹏飞等;《计算机与应用化学》;20100728(第07期);第2-5节 *
基于神经网络的多聚脯氨酸二型结构预测;陆克中等;《食品与生物技术学报》;20050130(第01期);全文 *
孙鹏飞等.基于样本选择的蛋白质关联结构预测.《计算机与应用化学》.2010,(第07期), *
残差网络下基于困难样本挖掘的目标检测;张超等;《激光与光电子学进展》;20180511(第10期);全文 *

Also Published As

Publication number Publication date
CN109637580A (en) 2019-04-16

Similar Documents

Publication Publication Date Title
CN109637580B (en) Protein amino acid association matrix prediction method
CN106886599B (en) Image retrieval method and device
JP7344900B2 (en) Choosing a neural network architecture for supervised machine learning problems
EP3821434A1 (en) Machine learning for determining protein structures
CN106951736B (en) A kind of secondary protein structure prediction method based on multiple evolution matrix
Hurtado et al. Deep transfer learning in the assessment of the quality of protein models
Kumar et al. PRmePRed: A protein arginine methylation prediction tool
CN109902190B (en) Image retrieval model optimization method, retrieval method, device, system and medium
CN105893787A (en) Prediction method for protein post-translational modification methylation loci
CN111046979A (en) Method and system for discovering badcase based on small sample learning
CN111627494A (en) Protein property prediction method and device based on multi-dimensional features and computing equipment
CN112365921A (en) Protein secondary structure prediction method based on long-time and short-time memory network
CN112085245B (en) Protein residue contact prediction method based on depth residual neural network
CN116635940A (en) Training protein structure prediction neural networks using simplified multi-sequence alignment
Juan et al. A simple strategy to enhance the speed of protein secondary structure prediction without sacrificing accuracy
CN104504299B (en) The method for predicting the interactively between the residue of memebrane protein
US20220165367A1 (en) System and method for exploring chemical space during molecular design using a machine learning model
CN114530195A (en) Protein model quality evaluation method based on deep learning
EP4189606A1 (en) Neural architecture and hardware accelerator search
CN111383710A (en) Gene splice site recognition model construction method based on particle swarm optimization gemini support vector machine
CN113469244B (en) Volkswagen app classification system
CN113257342B (en) Protein interaction site prediction method based on residue position characteristics
CN111539536B (en) Method and device for evaluating service model hyper-parameters
CN113851192B (en) Training method and device for amino acid one-dimensional attribute prediction model and attribute prediction method
CN116679981B (en) Software system configuration optimizing method and device based on transfer learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant