CN105046103A - Novel representation method for protein sequence fusing genetic information - Google Patents
Novel representation method for protein sequence fusing genetic information Download PDFInfo
- Publication number
- CN105046103A CN105046103A CN201510382702.5A CN201510382702A CN105046103A CN 105046103 A CN105046103 A CN 105046103A CN 201510382702 A CN201510382702 A CN 201510382702A CN 105046103 A CN105046103 A CN 105046103A
- Authority
- CN
- China
- Prior art keywords
- protein
- amino acid
- sequence
- protein sequence
- pssm
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Peptides Or Proteins (AREA)
Abstract
The invention provides a novel representation method for protein sequence fusing genetic information. The method comprises the steps of: keeping amino acids of a conserved region in a sequence of a protein P unchanged, and sequentially converting amino acids of a non-conserved region into other amino acids according to a probability for mutating amino acids of the non-conserved region into the other amino acids by a PSSM (Position Special Scoring Matrix), thereby obtaining 20 virtual proteins that contain the genetic information of the protein P; and fusing the protein P with a virtual proteome to synthesize vector description of a novel protein P so as to solve the problem that the secondary structure type prediction rate and subcellular localization prediction rate of the protein are relatively low. Compared to an existing evolutionary information fusion method, the method provided by the invention has a more obvious biological meaning. By adopting a proteome that is most probably evolved to represent a certain protein, the prediction success rate of a relevant predictor can be significantly increased; and the method has wide application space.
Description
Technical field
The present invention relates to bioinformatics, protein pseudo amino acid composition composition and traditional protein sequence analysis technical field, particularly relate to a kind of protein sequence method for expressing of new blending inheritance information.
Background technology
Along with the order-checking of human genome completes, bioinformatics enters a new developing stage---the genome times afterwards comprehensively.Genome plan has produced hundreds of millions of genome sequences, how to look for from these sequences life be how to originate from, be not only how to evolve, these genes but also be the answer how making life entity have a series of problems such as active, be the focus of current research.Analyzing these gene orders can from many levels, and as base sequence, protein, genome etc., because much biological phenotype character and gene regulation are all determined by the amino acid sequence of protein, analysis of amino acid sequence has certain advantage.
The one dimension character string that protein sequence is made up of 20 seed amino acids, show that the biological nature more lain in wherein is very difficult, people devise many kinds of pseudo amino acid composition compositions employing vector modes to describe protein sequence for this reason, these pseudo amino acid composition compositions are as dyad composition, triplet composition, the gray theory factor, what complexity factors etc. had well can describe protein sequence local amino acid sequence information, the overall amino acid sequence information that well can describe protein sequence had, all positive role is served to based on the protein structure of sequence and Function Classification prediction.
Live species is all from limited spore in time immemorial, and same existing protein is also from some simple protein evolutions.Contain in evolutionary process base insert or delete, suddenly change, copy or with other gene fusion etc., along with going deep into of evolutionary process, similarity between sequence is fewer and feweri, but corresponding protein also retains same characteristic mostly, as same biological function, three-dimensional structure and Subcellular Localization etc.Extract these sequence evolution information to form the focus that protein description vectors is research for this reason.The method of general fused protein evolution information is all based on PSSM matrix now, because each protein sequence length is change, so the PSSM matrix obtained is the matrix (L is protein sequence length) of the dimension that line number change, a columns are fixed.Because existing machine learning method requires that the dimension of input is identical, PSSM matrix conversion will be all the vector of fixing dimension by institute in a conventional method, the addition of PSSM matrix by rows is obtained 20 dimensional vectors divided by L again represents protein sequence as method 1 adopts; Row corresponding for a certain for expressions all in PSSM matrix identical amino acid is added again divided by this amino acid number in the sequence by method 2, obtain 20 dimensional vectors, amino acid sequence is made up of 20 seed amino acids, and we can obtain the vector of one 20 × 20 dimension for representing this protein like this; Method 3 is existing carries out standardization by PSSM matrix, passes through PSSM
t× PSSM obtains the matrix of 20 × 20, because this matrix is positive semidefinite matrix, only needs wherein 201 units usually to represent protein P; I proposes a kind of new model Grey-PSSM extracting PSSM information based on gray theory, this method is based on gray model GM(2,1) model is built to each the columns value in PSSM matrix, obtain two development coefficients and an interference coefficient, like this PSSM matrix is changed into the vector of 3 × 20=60 dimension.
Said method is all based on PSSM matrix being carried out to simple summation statistics or carrying out gray model modeling, although some information can be extracted, but this will inevitably lose amino acid whose order information in protein sequence, and aforesaid operations does not have corresponding biological significance, do the hereditary information likely comprised by PSSM like this and lose.Due to the importance of hereditary information, thus the protein sequence describing method designing a kind of new blending inheritance information to based on the protein function of sequence information and structure type prediction very necessary.
Summary of the invention
The technical problem to be solved in the present invention is to provide a kind of protein sequence method for expressing of new blending inheritance information, be intended to be evolved information by fused protein, directly expand from sequence, be fused into the vector description of new protein P, to solve secondary protein structure type prediction and the lower problem of Subcellular Localization prediction rate.
For solving above technical matters, technical scheme of the present invention is: a kind of protein sequence method for expressing of new blending inheritance information, is characterized in that comprising the following steps:
(1) PSI-BLAST program search Swiss-Prot database is used to generate the special scoring matrix PSSM in position of protein sequence P;
(2) protein sequence in P protein gene and ncbi database is compared, find the conserved sequence of protein gene P;
(3) can know that the amino acid mutation in protein sequence P on certain position is other amino acid whose probability according to PSSM matrix, by constant for the amino acid on this albumen conserved sequence position, the size that non-conservative Region amino acid sports other amino acid probability according to it converts other amino acid successively to, so just can obtain 20 virtual protein containing protein P hereditary information;
(4) front n the protein sequence got in these 20 virtual protein forms the Leaf proteins describing protein sequence P;
(5) pseudo amino acid composition constituent feature extracting method is adopted to obtained Leaf proteins n+1 protein, obtain its vector description, this n+1 vector is combined, finally obtain the vector description method of protein P.
The expression formula of the special scoring matrix PSSM in position of described protein sequence P is:
Wherein
represent that the amino acid mutation of protein sequence i-th position in protein evolution process is the possibility size of jth amino acid, the possibility that the larger expression of its value changes into is larger, and j represents amino acid A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y and V respectively from 1 to 20.
Described method is used in secondary protein structure type prediction and Subcellular Localization prediction, and correlation predictive device success rate prediction improves 4 ~ 7%.
The method that the present invention proposes is compared with existing fusion evolution information approach, there is more obvious biological significance, the Leaf proteins that employing most may be evolved is to represent some protein, these protein homologies are not high, but more may have identical 26S Proteasome Structure and Function, this is not high with protein sequence similarity in training set in protein structure and function type prediction to those, but the protein prediction with long-range homology has help, this method is used in secondary protein structure type prediction and Subcellular Localization prediction, the success rate prediction of correlation predictive device can be significantly improved, there is wide utilization prospect.
Embodiment
In order to make object of the present invention, technical scheme and advantage clearly understand, below in conjunction with embodiment, the present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, be not intended to limit the present invention.
The protein sequence method for expressing of the blending inheritance information adopting the present invention new, concrete steps are as follows:
1) PSI-BLAST program search Swiss-Prot database is used to generate the special scoring matrix (PositionSpecificScoringMatrix, PSSM) in position of protein sequence P;
Given human gene albumen:
>AAA61157
MVPSAGQLALFALGIVLAACQALENSTSPLSADPPVAAAVVSHFNDCPDSHTQFCFHATCRFLVHEDKPACVCHSGYVGARCEHADLLAVVAASQKKQAITALVVVSIVALAVLIITCVLIHCCQVRKHCEWCRALICRHEKPSALLKGRTACCHSETLV
Carry out calculating its location specific scoring matrix (PSSM matrix), first that BLAST is localized: (1) is downloaded blast and carried out this locality configuration, the machine configuration version: blast-2.2.28+ on NCBI; (2) at Protein Data Bank http://www.uniprot.org/(UniPortKB/Swiss-Protdatabase (Release2013_10)) download Protein Data Bank; (3) optimum configurations (-num_iterations:3 ,-evalue:0.001)
By the PSI-BLAST program in BLAST-2.2.28+, we can obtain the PSSM matrix of the above-mentioned protein of protein, in this matrix, first row represents that in crude protein sequence, amino acid converts the possibility of amino acid A to, secondary series represents that sequence original acid converts the possibility size of amino acid R to, with this, the 3rd row represent to the 20th row the possibility converting amino acid N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y and V to respectively.PSSM matrix the first row represents protein sequence first amino acid, and the second row represents the amino acid on the 2nd position, by that analogy.
2) protein sequence in P protein gene and ncbi database is compared, find the conserved sequence of protein gene P;
By AAA61157 sequence inputting to network address:
http:// www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi, this network address provides the function of searching conserved sequence, and adopt the default parameter value that website provides, the conserved sequence that can obtain sequence A AA61157 has two sections, and one section is 44-83, and another section is 47-121, adds up to: 44-121.As follows, non-thickened portion is non-conservative region, and thickened portion is conservative region;
MVPSAGQLALFALGIVLAACQALENSTSPLSADPPVAAAVVSH
FNDCPDSHTQFCFHATCRFLVHEDKPACVCHSGYVGARCEHADLLAVVAASQKKQAITALVVVSIVALAVLIITCVLIHCCQVRKHCEWCRALICRHEKPSALLKGRTACCHSETLV。
3) based on PSSM matrix and its conserved sequence information structure protein P evolution protein sequence of protein P;
Can know that the amino acid mutation in protein sequence P on certain position is other amino acid whose probability according to PSSM matrix, by constant for the amino acid on this albumen conserved sequence position, the size that non-conservative Region amino acid sports other amino acid probability according to it converts other amino acid successively to, so just can obtain 20 virtual protein containing protein P hereditary information.
PSSM matrix first behavior of such as protein sequence AAA61157: [-2-2-3-4-2-2-3-4-314-260-3-2-1-2-10], therefrom we can find out and to be to the maximum in these 20 values
=6, represent that AAA61157 protein sequence first amino acid is converted to M most possible,
=4, convert the probability second of L to.
According to the method described above, can obtain the sequence that AAA61157 most probable is evolved into is:
MVPTAWQLAMLCAGCLICSCQSCDNCTAPDPTEPPERPAWRGH
FNDCPDSHTQFCFHATCRFLVHEDKPACVCHSGYVGARCEHADLLAVVAASQKKQAITALVVVSIVALAVLIITCVLIHCCHKRKRCRWCRQYECKEEEPEKLLRQENGCCHSETVV
Second sequence that may be evolved into is:
LLASWGHYMLMALFIVLPAGEALEDSPEALSNDDDHAAKVTSS
FNDCPDSHTQFCFHATCRFLVHEDKPACVCHSGYVGARCEHADLLAVVAASQKKQAITALVVVSIVALAVLIITCVLIYYYRWKRHKEHYKERIGEHPKRRTIQKGRTSNANADNIM
3rd sequence that may be evolved into is:
IISAGARCCCFSGHTPGAPDHCEPSTSPYMNPSETIEHRFQNR
FNDCPDSHTQFCFHATCRFLVHEDKPACVCHSGYVGARCEHADLLAVVAASQKKQAITALVVVSIVALAVLIITCVLICRHCRYYQRKKQPNNLKRANRNNAMIQSGSAMGKGQSLI。
4) front 2 protein sequences (sequence that protein the P is most possibly evolved into) formation got in these 20 virtual protein describes the Leaf proteins of protein sequence P;
We can get the Leaf proteins of protein P and front 2 the constitutive protein P of virtual protein:
MVPSAGQLALFALGIVLAACQALENSTSPLSADPPVAAAVVSH
FNDCPDSHTQFCFHATCRFLVHEDKPACVCHSGYVGARCEHADLLAVVAASQKKQAITALVVVSIVALAVLIITCVLIHCCQVRKHCEWCRALICRHEKPSALLKGRTACCHSETLV
MVPTAWQLAMLCAGCLICSCQSCDNCTAPDPTEPPERPAWRGH
FNDCPDSHTQFCFHATCRFLVHEDKPACVCHSGYVGARCEHADLLAVVAASQKKQAITALVVVSIVALAVLIITCVLIHCCHKRKRCRWCRQYECKEEEPEKLLRQENGCCHSETVV;
LLASWGHYMLMALFIVLPAGEALEDSPEALSNDDDHAAKVTSS
FNDCPDSHTQFCFHATCRFLVHEDKPACVCHSGYVGARCEHADLLAVVAASQKKQAITALVVVSIVALAVLIITCVLIYYYRWKRHKEHYKERIGEHPKRRTIQKGRTSNANADNIM。
5) pseudo amino acid composition constituent feature extracting method is adopted to each sequence of obtained Proteomics, obtain its vector description, the vector obtained is combined, finally obtain the vector description method of protein P;
By above-mentioned 3 sequence inputting to network address be:
In http://www.csbio.sjtu.edu.cn/bioinf/PseAAC/# webpage, PseAAmode chooses Type1, Aminoacidcharacter all chooses, Weightfactor value is 0.05, Lambdaparameter value is 5, obtain the description vectors of three protein, Article 1, sequence is: [10.4356.5222.1742.6092.1742.1744.3483.0432.6097.3910.435 0.8703.0432.6092.6094.7823.0437.8260.4350.4355.6805.7816 .1536.4946.328], Article 2 is: [6.7278.4092.5234.6251.6822.1023.7842.5232.9435.0450.8411 .2613.7842.9433.7842.9433.3635.8861.2610.8416.5056.6206. 0416.8026.765], Article 3 is: [8.7523.0633.9393.5012.1882.6264.3763.9393.9396.1271.3132 .1882.1881.7503.5014.3763.0635.6890.8752.6265.7825.6915. 7716.1736.565], these three 25 dimensional vectors are combined, form 75 dimensional vectors and represent original protein.
Adopt the method can improve existing protein sequence describing method, the method be used in secondary protein structure type prediction and Subcellular Localization prediction, correlation predictive device success rate prediction improves 5%.
The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, all any amendments done within the spirit and principles in the present invention, equivalent replacement and improvement etc., all should be included within protection scope of the present invention.
Claims (3)
1. a protein sequence method for expressing for new blending inheritance information, is characterized in that comprising the following steps:
(1) PSI-BLAST program search Swiss-Prot database is used to generate the special scoring matrix PSSM in position of protein sequence P;
(2) protein sequence in P protein gene and ncbi database is compared, find the conserved sequence of protein gene P;
(3) can know that the amino acid mutation in protein sequence P on certain position is other amino acid whose probability according to PSSM matrix, by constant for the amino acid on this albumen conserved sequence position, the size that non-conservative Region amino acid sports other amino acid probability according to it converts other amino acid successively to, so just can obtain 20 virtual protein containing protein P hereditary information;
(4) front n the protein sequence got in these 20 virtual protein forms the Leaf proteins describing protein sequence P;
(5) pseudo amino acid composition constituent feature extracting method is adopted to obtained Leaf proteins n+1 protein, obtain its vector description, this n+1 vector is combined, finally obtain the vector description method of protein P.
2. the protein sequence method for expressing of new blending inheritance information according to claim 1, is characterized in that: the expression formula of the special scoring matrix PSSM in position of described protein sequence P is:
Wherein
represent that the amino acid mutation of protein sequence i-th position in protein evolution process is the possibility size of jth amino acid, the possibility that the larger expression of its value changes into is larger, and j represents amino acid A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y and V respectively from 1 to 20.
3. the protein sequence method for expressing of new blending inheritance information according to claim 1, is characterized in that: described method is used in secondary protein structure type prediction and Subcellular Localization prediction, and correlation predictive device success rate prediction improves 4 ~ 7%.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510382702.5A CN105046103A (en) | 2015-07-03 | 2015-07-03 | Novel representation method for protein sequence fusing genetic information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510382702.5A CN105046103A (en) | 2015-07-03 | 2015-07-03 | Novel representation method for protein sequence fusing genetic information |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105046103A true CN105046103A (en) | 2015-11-11 |
Family
ID=54452643
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510382702.5A Pending CN105046103A (en) | 2015-07-03 | 2015-07-03 | Novel representation method for protein sequence fusing genetic information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105046103A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106845149A (en) * | 2017-02-09 | 2017-06-13 | 景德镇陶瓷大学 | A kind of new protein sequence method for expressing based on gene ontology information |
CN107358064A (en) * | 2017-07-03 | 2017-11-17 | 苏州大学 | System and method for predicting influence of amino acid variation on protein structure stability |
CN109448787A (en) * | 2018-10-12 | 2019-03-08 | 云南大学 | Based on the protein subnucleus localization method for improving PSSM progress feature extraction with merging |
CN112242179A (en) * | 2020-09-09 | 2021-01-19 | 天津大学 | Method for identifying type of membrane protein |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103324933A (en) * | 2013-06-08 | 2013-09-25 | 南京理工大学常熟研究院有限公司 | Membrane protein sub-cell positioning method based on complex space multi-view feature fusion |
CN104636635A (en) * | 2015-01-29 | 2015-05-20 | 南京理工大学 | Protein crystallization predicting method based on two-layer SVM learning mechanism |
-
2015
- 2015-07-03 CN CN201510382702.5A patent/CN105046103A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103324933A (en) * | 2013-06-08 | 2013-09-25 | 南京理工大学常熟研究院有限公司 | Membrane protein sub-cell positioning method based on complex space multi-view feature fusion |
CN104636635A (en) * | 2015-01-29 | 2015-05-20 | 南京理工大学 | Protein crystallization predicting method based on two-layer SVM learning mechanism |
Non-Patent Citations (3)
Title |
---|
姜小莹 等: "使用伪氨基酸组成和模糊支持向量机预测蛋白质结构类", 《生物物理学报》 * |
李丹丹 等: "蛋白质序列的一种新的二维图形表示", 《南阳师范学院学报》 * |
石卓兴: "蛋白质亚细胞定位预测中若干信息提取算法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106845149A (en) * | 2017-02-09 | 2017-06-13 | 景德镇陶瓷大学 | A kind of new protein sequence method for expressing based on gene ontology information |
CN106845149B (en) * | 2017-02-09 | 2019-04-09 | 景德镇陶瓷大学 | A kind of protein sequence representation method based on gene ontology information |
CN107358064A (en) * | 2017-07-03 | 2017-11-17 | 苏州大学 | System and method for predicting influence of amino acid variation on protein structure stability |
CN109448787A (en) * | 2018-10-12 | 2019-03-08 | 云南大学 | Based on the protein subnucleus localization method for improving PSSM progress feature extraction with merging |
CN109448787B (en) * | 2018-10-12 | 2021-10-08 | 云南大学 | Protein subnuclear localization method for feature extraction and fusion based on improved PSSM |
CN112242179A (en) * | 2020-09-09 | 2021-01-19 | 天津大学 | Method for identifying type of membrane protein |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Diaz et al. | TACOA–Taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach | |
Li et al. | PSPEL: in silico prediction of self-interacting proteins from amino acids sequences using ensemble learning | |
Xiao et al. | iCDI-PseFpt: identify the channel–drug interaction in cellular networking with PseAAC and molecular fingerprints | |
Pace et al. | Phylogeny and beyond: Scientific, historical, and conceptual significance of the first tree of life | |
CN105046103A (en) | Novel representation method for protein sequence fusing genetic information | |
Xu et al. | Protst: Multi-modality learning of protein sequences and biomedical texts | |
CN118140234A (en) | System for identifying and developing natural source food ingredients through empirical testing combining machine learning and database mining with target functions | |
Junior et al. | A scalable computational approach for simulating complexes of multiple chromosomes | |
CN107609352A (en) | A kind of Forecasting Methodology of protein self-interaction | |
Megrian et al. | Ancient origin and constrained evolution of the division and cell wall gene cluster in Bacteria | |
Ruiz-Sanchez et al. | Ecological speciation in Nolina parviflora (Asparagaceae): lacking spatial connectivity along of the Trans-Mexican Volcanic Belt | |
CN114943017A (en) | Cross-modal retrieval method based on similarity zero sample hash | |
Park et al. | Large-scale phylogenomics reveals ancient introgression in Asian Hepatica and new insights into the origin of the insular endemic Hepatica maxima | |
Chan et al. | Learning to predict expression efficacy of vectors in recombinant protein production | |
CN117012304B (en) | Deep learning molecule generation system and method fused with GGNN-GAN | |
Guo et al. | [Retracted] PLncWX: A Machine‐Learning Algorithm for Plant lncRNA Identification Based on WOA‐XGBoost | |
Lin et al. | An efficient hybrid Taguchi-genetic algorithm for protein folding simulation | |
Wang et al. | DaDL-SChlo: protein subchloroplast localization prediction based on generative adversarial networks and pre-trained protein language model | |
CN112613391A (en) | Hyperspectral image band selection method based on reverse learning binary rice breeding algorithm | |
Zhang et al. | Combining a binary input encoding scheme with RBFNN for globulin protein inter-residue contact map prediction | |
CN111950619A (en) | Active learning method based on dual-generation countermeasure network | |
Lu et al. | Predicting disulfide connectivity patterns | |
Jaramillo-Garzón et al. | Predictability of gene ontology slim-terms from primary structure information in Embryophyta plant proteins | |
Bi | A genetic-based EM motif-finding algorithm for biological sequence analysis | |
Mou et al. | Gene rational design: the dawn of crop breeding |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20151111 |
|
WD01 | Invention patent application deemed withdrawn after publication |