CN105046103A - Novel representation method for protein sequence fusing genetic information - Google Patents

Novel representation method for protein sequence fusing genetic information Download PDF

Info

Publication number
CN105046103A
CN105046103A CN201510382702.5A CN201510382702A CN105046103A CN 105046103 A CN105046103 A CN 105046103A CN 201510382702 A CN201510382702 A CN 201510382702A CN 105046103 A CN105046103 A CN 105046103A
Authority
CN
China
Prior art keywords
protein
amino acid
sequence
protein sequence
pssm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510382702.5A
Other languages
Chinese (zh)
Inventor
肖绚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jingdezhen Ceramic Institute
Original Assignee
Jingdezhen Ceramic Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jingdezhen Ceramic Institute filed Critical Jingdezhen Ceramic Institute
Priority to CN201510382702.5A priority Critical patent/CN105046103A/en
Publication of CN105046103A publication Critical patent/CN105046103A/en
Pending legal-status Critical Current

Links

Landscapes

  • Peptides Or Proteins (AREA)

Abstract

The invention provides a novel representation method for protein sequence fusing genetic information. The method comprises the steps of: keeping amino acids of a conserved region in a sequence of a protein P unchanged, and sequentially converting amino acids of a non-conserved region into other amino acids according to a probability for mutating amino acids of the non-conserved region into the other amino acids by a PSSM (Position Special Scoring Matrix), thereby obtaining 20 virtual proteins that contain the genetic information of the protein P; and fusing the protein P with a virtual proteome to synthesize vector description of a novel protein P so as to solve the problem that the secondary structure type prediction rate and subcellular localization prediction rate of the protein are relatively low. Compared to an existing evolutionary information fusion method, the method provided by the invention has a more obvious biological meaning. By adopting a proteome that is most probably evolved to represent a certain protein, the prediction success rate of a relevant predictor can be significantly increased; and the method has wide application space.

Description

A kind of protein sequence method for expressing of new blending inheritance information
Technical field
The present invention relates to bioinformatics, protein pseudo amino acid composition composition and traditional protein sequence analysis technical field, particularly relate to a kind of protein sequence method for expressing of new blending inheritance information.
Background technology
Along with the order-checking of human genome completes, bioinformatics enters a new developing stage---the genome times afterwards comprehensively.Genome plan has produced hundreds of millions of genome sequences, how to look for from these sequences life be how to originate from, be not only how to evolve, these genes but also be the answer how making life entity have a series of problems such as active, be the focus of current research.Analyzing these gene orders can from many levels, and as base sequence, protein, genome etc., because much biological phenotype character and gene regulation are all determined by the amino acid sequence of protein, analysis of amino acid sequence has certain advantage.
The one dimension character string that protein sequence is made up of 20 seed amino acids, show that the biological nature more lain in wherein is very difficult, people devise many kinds of pseudo amino acid composition compositions employing vector modes to describe protein sequence for this reason, these pseudo amino acid composition compositions are as dyad composition, triplet composition, the gray theory factor, what complexity factors etc. had well can describe protein sequence local amino acid sequence information, the overall amino acid sequence information that well can describe protein sequence had, all positive role is served to based on the protein structure of sequence and Function Classification prediction.
Live species is all from limited spore in time immemorial, and same existing protein is also from some simple protein evolutions.Contain in evolutionary process base insert or delete, suddenly change, copy or with other gene fusion etc., along with going deep into of evolutionary process, similarity between sequence is fewer and feweri, but corresponding protein also retains same characteristic mostly, as same biological function, three-dimensional structure and Subcellular Localization etc.Extract these sequence evolution information to form the focus that protein description vectors is research for this reason.The method of general fused protein evolution information is all based on PSSM matrix now, because each protein sequence length is change, so the PSSM matrix obtained is the matrix (L is protein sequence length) of the dimension that line number change, a columns are fixed.Because existing machine learning method requires that the dimension of input is identical, PSSM matrix conversion will be all the vector of fixing dimension by institute in a conventional method, the addition of PSSM matrix by rows is obtained 20 dimensional vectors divided by L again represents protein sequence as method 1 adopts; Row corresponding for a certain for expressions all in PSSM matrix identical amino acid is added again divided by this amino acid number in the sequence by method 2, obtain 20 dimensional vectors, amino acid sequence is made up of 20 seed amino acids, and we can obtain the vector of one 20 × 20 dimension for representing this protein like this; Method 3 is existing carries out standardization by PSSM matrix, passes through PSSM t× PSSM obtains the matrix of 20 × 20, because this matrix is positive semidefinite matrix, only needs wherein 201 units usually to represent protein P; I proposes a kind of new model Grey-PSSM extracting PSSM information based on gray theory, this method is based on gray model GM(2,1) model is built to each the columns value in PSSM matrix, obtain two development coefficients and an interference coefficient, like this PSSM matrix is changed into the vector of 3 × 20=60 dimension.
Said method is all based on PSSM matrix being carried out to simple summation statistics or carrying out gray model modeling, although some information can be extracted, but this will inevitably lose amino acid whose order information in protein sequence, and aforesaid operations does not have corresponding biological significance, do the hereditary information likely comprised by PSSM like this and lose.Due to the importance of hereditary information, thus the protein sequence describing method designing a kind of new blending inheritance information to based on the protein function of sequence information and structure type prediction very necessary.
Summary of the invention
The technical problem to be solved in the present invention is to provide a kind of protein sequence method for expressing of new blending inheritance information, be intended to be evolved information by fused protein, directly expand from sequence, be fused into the vector description of new protein P, to solve secondary protein structure type prediction and the lower problem of Subcellular Localization prediction rate.
For solving above technical matters, technical scheme of the present invention is: a kind of protein sequence method for expressing of new blending inheritance information, is characterized in that comprising the following steps:
(1) PSI-BLAST program search Swiss-Prot database is used to generate the special scoring matrix PSSM in position of protein sequence P;
(2) protein sequence in P protein gene and ncbi database is compared, find the conserved sequence of protein gene P;
(3) can know that the amino acid mutation in protein sequence P on certain position is other amino acid whose probability according to PSSM matrix, by constant for the amino acid on this albumen conserved sequence position, the size that non-conservative Region amino acid sports other amino acid probability according to it converts other amino acid successively to, so just can obtain 20 virtual protein containing protein P hereditary information;
(4) front n the protein sequence got in these 20 virtual protein forms the Leaf proteins describing protein sequence P;
(5) pseudo amino acid composition constituent feature extracting method is adopted to obtained Leaf proteins n+1 protein, obtain its vector description, this n+1 vector is combined, finally obtain the vector description method of protein P.
The expression formula of the special scoring matrix PSSM in position of described protein sequence P is:
Wherein represent that the amino acid mutation of protein sequence i-th position in protein evolution process is the possibility size of jth amino acid, the possibility that the larger expression of its value changes into is larger, and j represents amino acid A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y and V respectively from 1 to 20.
Described method is used in secondary protein structure type prediction and Subcellular Localization prediction, and correlation predictive device success rate prediction improves 4 ~ 7%.
The method that the present invention proposes is compared with existing fusion evolution information approach, there is more obvious biological significance, the Leaf proteins that employing most may be evolved is to represent some protein, these protein homologies are not high, but more may have identical 26S Proteasome Structure and Function, this is not high with protein sequence similarity in training set in protein structure and function type prediction to those, but the protein prediction with long-range homology has help, this method is used in secondary protein structure type prediction and Subcellular Localization prediction, the success rate prediction of correlation predictive device can be significantly improved, there is wide utilization prospect.
Embodiment
In order to make object of the present invention, technical scheme and advantage clearly understand, below in conjunction with embodiment, the present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, be not intended to limit the present invention.
The protein sequence method for expressing of the blending inheritance information adopting the present invention new, concrete steps are as follows:
1) PSI-BLAST program search Swiss-Prot database is used to generate the special scoring matrix (PositionSpecificScoringMatrix, PSSM) in position of protein sequence P;
Given human gene albumen:
>AAA61157
MVPSAGQLALFALGIVLAACQALENSTSPLSADPPVAAAVVSHFNDCPDSHTQFCFHATCRFLVHEDKPACVCHSGYVGARCEHADLLAVVAASQKKQAITALVVVSIVALAVLIITCVLIHCCQVRKHCEWCRALICRHEKPSALLKGRTACCHSETLV
Carry out calculating its location specific scoring matrix (PSSM matrix), first that BLAST is localized: (1) is downloaded blast and carried out this locality configuration, the machine configuration version: blast-2.2.28+ on NCBI; (2) at Protein Data Bank http://www.uniprot.org/(UniPortKB/Swiss-Protdatabase (Release2013_10)) download Protein Data Bank; (3) optimum configurations (-num_iterations:3 ,-evalue:0.001)
By the PSI-BLAST program in BLAST-2.2.28+, we can obtain the PSSM matrix of the above-mentioned protein of protein, in this matrix, first row represents that in crude protein sequence, amino acid converts the possibility of amino acid A to, secondary series represents that sequence original acid converts the possibility size of amino acid R to, with this, the 3rd row represent to the 20th row the possibility converting amino acid N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y and V to respectively.PSSM matrix the first row represents protein sequence first amino acid, and the second row represents the amino acid on the 2nd position, by that analogy.
2) protein sequence in P protein gene and ncbi database is compared, find the conserved sequence of protein gene P;
By AAA61157 sequence inputting to network address:
http:// www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi, this network address provides the function of searching conserved sequence, and adopt the default parameter value that website provides, the conserved sequence that can obtain sequence A AA61157 has two sections, and one section is 44-83, and another section is 47-121, adds up to: 44-121.As follows, non-thickened portion is non-conservative region, and thickened portion is conservative region;
MVPSAGQLALFALGIVLAACQALENSTSPLSADPPVAAAVVSH FNDCPDSHTQFCFHATCRFLVHEDKPACVCHSGYVGARCEHADLLAVVAASQKKQAITALVVVSIVALAVLIITCVLIHCCQVRKHCEWCRALICRHEKPSALLKGRTACCHSETLV。
3) based on PSSM matrix and its conserved sequence information structure protein P evolution protein sequence of protein P;
Can know that the amino acid mutation in protein sequence P on certain position is other amino acid whose probability according to PSSM matrix, by constant for the amino acid on this albumen conserved sequence position, the size that non-conservative Region amino acid sports other amino acid probability according to it converts other amino acid successively to, so just can obtain 20 virtual protein containing protein P hereditary information.
PSSM matrix first behavior of such as protein sequence AAA61157: [-2-2-3-4-2-2-3-4-314-260-3-2-1-2-10], therefrom we can find out and to be to the maximum in these 20 values =6, represent that AAA61157 protein sequence first amino acid is converted to M most possible, =4, convert the probability second of L to.
According to the method described above, can obtain the sequence that AAA61157 most probable is evolved into is:
MVPTAWQLAMLCAGCLICSCQSCDNCTAPDPTEPPERPAWRGH FNDCPDSHTQFCFHATCRFLVHEDKPACVCHSGYVGARCEHADLLAVVAASQKKQAITALVVVSIVALAVLIITCVLIHCCHKRKRCRWCRQYECKEEEPEKLLRQENGCCHSETVV
Second sequence that may be evolved into is:
LLASWGHYMLMALFIVLPAGEALEDSPEALSNDDDHAAKVTSS FNDCPDSHTQFCFHATCRFLVHEDKPACVCHSGYVGARCEHADLLAVVAASQKKQAITALVVVSIVALAVLIITCVLIYYYRWKRHKEHYKERIGEHPKRRTIQKGRTSNANADNIM
3rd sequence that may be evolved into is:
IISAGARCCCFSGHTPGAPDHCEPSTSPYMNPSETIEHRFQNR FNDCPDSHTQFCFHATCRFLVHEDKPACVCHSGYVGARCEHADLLAVVAASQKKQAITALVVVSIVALAVLIITCVLICRHCRYYQRKKQPNNLKRANRNNAMIQSGSAMGKGQSLI。
4) front 2 protein sequences (sequence that protein the P is most possibly evolved into) formation got in these 20 virtual protein describes the Leaf proteins of protein sequence P;
We can get the Leaf proteins of protein P and front 2 the constitutive protein P of virtual protein:
MVPSAGQLALFALGIVLAACQALENSTSPLSADPPVAAAVVSH FNDCPDSHTQFCFHATCRFLVHEDKPACVCHSGYVGARCEHADLLAVVAASQKKQAITALVVVSIVALAVLIITCVLIHCCQVRKHCEWCRALICRHEKPSALLKGRTACCHSETLV
MVPTAWQLAMLCAGCLICSCQSCDNCTAPDPTEPPERPAWRGH FNDCPDSHTQFCFHATCRFLVHEDKPACVCHSGYVGARCEHADLLAVVAASQKKQAITALVVVSIVALAVLIITCVLIHCCHKRKRCRWCRQYECKEEEPEKLLRQENGCCHSETVV;
LLASWGHYMLMALFIVLPAGEALEDSPEALSNDDDHAAKVTSS FNDCPDSHTQFCFHATCRFLVHEDKPACVCHSGYVGARCEHADLLAVVAASQKKQAITALVVVSIVALAVLIITCVLIYYYRWKRHKEHYKERIGEHPKRRTIQKGRTSNANADNIM。
5) pseudo amino acid composition constituent feature extracting method is adopted to each sequence of obtained Proteomics, obtain its vector description, the vector obtained is combined, finally obtain the vector description method of protein P;
By above-mentioned 3 sequence inputting to network address be:
In http://www.csbio.sjtu.edu.cn/bioinf/PseAAC/# webpage, PseAAmode chooses Type1, Aminoacidcharacter all chooses, Weightfactor value is 0.05, Lambdaparameter value is 5, obtain the description vectors of three protein, Article 1, sequence is: [10.4356.5222.1742.6092.1742.1744.3483.0432.6097.3910.435 0.8703.0432.6092.6094.7823.0437.8260.4350.4355.6805.7816 .1536.4946.328], Article 2 is: [6.7278.4092.5234.6251.6822.1023.7842.5232.9435.0450.8411 .2613.7842.9433.7842.9433.3635.8861.2610.8416.5056.6206. 0416.8026.765], Article 3 is: [8.7523.0633.9393.5012.1882.6264.3763.9393.9396.1271.3132 .1882.1881.7503.5014.3763.0635.6890.8752.6265.7825.6915. 7716.1736.565], these three 25 dimensional vectors are combined, form 75 dimensional vectors and represent original protein.
Adopt the method can improve existing protein sequence describing method, the method be used in secondary protein structure type prediction and Subcellular Localization prediction, correlation predictive device success rate prediction improves 5%.
The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, all any amendments done within the spirit and principles in the present invention, equivalent replacement and improvement etc., all should be included within protection scope of the present invention.

Claims (3)

1. a protein sequence method for expressing for new blending inheritance information, is characterized in that comprising the following steps:
(1) PSI-BLAST program search Swiss-Prot database is used to generate the special scoring matrix PSSM in position of protein sequence P;
(2) protein sequence in P protein gene and ncbi database is compared, find the conserved sequence of protein gene P;
(3) can know that the amino acid mutation in protein sequence P on certain position is other amino acid whose probability according to PSSM matrix, by constant for the amino acid on this albumen conserved sequence position, the size that non-conservative Region amino acid sports other amino acid probability according to it converts other amino acid successively to, so just can obtain 20 virtual protein containing protein P hereditary information;
(4) front n the protein sequence got in these 20 virtual protein forms the Leaf proteins describing protein sequence P;
(5) pseudo amino acid composition constituent feature extracting method is adopted to obtained Leaf proteins n+1 protein, obtain its vector description, this n+1 vector is combined, finally obtain the vector description method of protein P.
2. the protein sequence method for expressing of new blending inheritance information according to claim 1, is characterized in that: the expression formula of the special scoring matrix PSSM in position of described protein sequence P is:
Wherein represent that the amino acid mutation of protein sequence i-th position in protein evolution process is the possibility size of jth amino acid, the possibility that the larger expression of its value changes into is larger, and j represents amino acid A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y and V respectively from 1 to 20.
3. the protein sequence method for expressing of new blending inheritance information according to claim 1, is characterized in that: described method is used in secondary protein structure type prediction and Subcellular Localization prediction, and correlation predictive device success rate prediction improves 4 ~ 7%.
CN201510382702.5A 2015-07-03 2015-07-03 Novel representation method for protein sequence fusing genetic information Pending CN105046103A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510382702.5A CN105046103A (en) 2015-07-03 2015-07-03 Novel representation method for protein sequence fusing genetic information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510382702.5A CN105046103A (en) 2015-07-03 2015-07-03 Novel representation method for protein sequence fusing genetic information

Publications (1)

Publication Number Publication Date
CN105046103A true CN105046103A (en) 2015-11-11

Family

ID=54452643

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510382702.5A Pending CN105046103A (en) 2015-07-03 2015-07-03 Novel representation method for protein sequence fusing genetic information

Country Status (1)

Country Link
CN (1) CN105046103A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106845149A (en) * 2017-02-09 2017-06-13 景德镇陶瓷大学 A kind of new protein sequence method for expressing based on gene ontology information
CN107358064A (en) * 2017-07-03 2017-11-17 苏州大学 System and method for predicting influence of amino acid variation on protein structure stability
CN109448787A (en) * 2018-10-12 2019-03-08 云南大学 Based on the protein subnucleus localization method for improving PSSM progress feature extraction with merging
CN112242179A (en) * 2020-09-09 2021-01-19 天津大学 Method for identifying type of membrane protein

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103324933A (en) * 2013-06-08 2013-09-25 南京理工大学常熟研究院有限公司 Membrane protein sub-cell positioning method based on complex space multi-view feature fusion
CN104636635A (en) * 2015-01-29 2015-05-20 南京理工大学 Protein crystallization predicting method based on two-layer SVM learning mechanism

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103324933A (en) * 2013-06-08 2013-09-25 南京理工大学常熟研究院有限公司 Membrane protein sub-cell positioning method based on complex space multi-view feature fusion
CN104636635A (en) * 2015-01-29 2015-05-20 南京理工大学 Protein crystallization predicting method based on two-layer SVM learning mechanism

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
姜小莹 等: "使用伪氨基酸组成和模糊支持向量机预测蛋白质结构类", 《生物物理学报》 *
李丹丹 等: "蛋白质序列的一种新的二维图形表示", 《南阳师范学院学报》 *
石卓兴: "蛋白质亚细胞定位预测中若干信息提取算法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106845149A (en) * 2017-02-09 2017-06-13 景德镇陶瓷大学 A kind of new protein sequence method for expressing based on gene ontology information
CN106845149B (en) * 2017-02-09 2019-04-09 景德镇陶瓷大学 A kind of protein sequence representation method based on gene ontology information
CN107358064A (en) * 2017-07-03 2017-11-17 苏州大学 System and method for predicting influence of amino acid variation on protein structure stability
CN109448787A (en) * 2018-10-12 2019-03-08 云南大学 Based on the protein subnucleus localization method for improving PSSM progress feature extraction with merging
CN109448787B (en) * 2018-10-12 2021-10-08 云南大学 Protein subnuclear localization method for feature extraction and fusion based on improved PSSM
CN112242179A (en) * 2020-09-09 2021-01-19 天津大学 Method for identifying type of membrane protein

Similar Documents

Publication Publication Date Title
Diaz et al. TACOA–Taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach
Li et al. PSPEL: in silico prediction of self-interacting proteins from amino acids sequences using ensemble learning
Xiao et al. iCDI-PseFpt: identify the channel–drug interaction in cellular networking with PseAAC and molecular fingerprints
Pace et al. Phylogeny and beyond: Scientific, historical, and conceptual significance of the first tree of life
CN105046103A (en) Novel representation method for protein sequence fusing genetic information
Xu et al. Protst: Multi-modality learning of protein sequences and biomedical texts
CN118140234A (en) System for identifying and developing natural source food ingredients through empirical testing combining machine learning and database mining with target functions
Junior et al. A scalable computational approach for simulating complexes of multiple chromosomes
CN107609352A (en) A kind of Forecasting Methodology of protein self-interaction
Megrian et al. Ancient origin and constrained evolution of the division and cell wall gene cluster in Bacteria
Ruiz-Sanchez et al. Ecological speciation in Nolina parviflora (Asparagaceae): lacking spatial connectivity along of the Trans-Mexican Volcanic Belt
CN114943017A (en) Cross-modal retrieval method based on similarity zero sample hash
Park et al. Large-scale phylogenomics reveals ancient introgression in Asian Hepatica and new insights into the origin of the insular endemic Hepatica maxima
Chan et al. Learning to predict expression efficacy of vectors in recombinant protein production
CN117012304B (en) Deep learning molecule generation system and method fused with GGNN-GAN
Guo et al. [Retracted] PLncWX: A Machine‐Learning Algorithm for Plant lncRNA Identification Based on WOA‐XGBoost
Lin et al. An efficient hybrid Taguchi-genetic algorithm for protein folding simulation
Wang et al. DaDL-SChlo: protein subchloroplast localization prediction based on generative adversarial networks and pre-trained protein language model
CN112613391A (en) Hyperspectral image band selection method based on reverse learning binary rice breeding algorithm
Zhang et al. Combining a binary input encoding scheme with RBFNN for globulin protein inter-residue contact map prediction
CN111950619A (en) Active learning method based on dual-generation countermeasure network
Lu et al. Predicting disulfide connectivity patterns
Jaramillo-Garzón et al. Predictability of gene ontology slim-terms from primary structure information in Embryophyta plant proteins
Bi A genetic-based EM motif-finding algorithm for biological sequence analysis
Mou et al. Gene rational design: the dawn of crop breeding

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20151111

WD01 Invention patent application deemed withdrawn after publication