CN105046103A

CN105046103A - Novel representation method for protein sequence fusing genetic information

Info

Publication number: CN105046103A
Application number: CN201510382702.5A
Authority: CN
Inventors: 肖绚
Original assignee: Jingdezhen Ceramic Institute
Current assignee: Jingdezhen Ceramic Institute
Priority date: 2015-07-03
Filing date: 2015-07-03
Publication date: 2015-11-11

Abstract

The invention provides a novel representation method for protein sequence fusing genetic information. The method comprises the steps of: keeping amino acids of a conserved region in a sequence of a protein P unchanged, and sequentially converting amino acids of a non-conserved region into other amino acids according to a probability for mutating amino acids of the non-conserved region into the other amino acids by a PSSM (Position Special Scoring Matrix), thereby obtaining 20 virtual proteins that contain the genetic information of the protein P; and fusing the protein P with a virtual proteome to synthesize vector description of a novel protein P so as to solve the problem that the secondary structure type prediction rate and subcellular localization prediction rate of the protein are relatively low. Compared to an existing evolutionary information fusion method, the method provided by the invention has a more obvious biological meaning. By adopting a proteome that is most probably evolved to represent a certain protein, the prediction success rate of a relevant predictor can be significantly increased; and the method has wide application space.

Description

A kind of protein sequence method for expressing of new blending inheritance information

Technical field

The present invention relates to bioinformatics, protein pseudo amino acid composition composition and traditional protein sequence analysis technical field, particularly relate to a kind of protein sequence method for expressing of new blending inheritance information.

Background technology

Along with the order-checking of human genome completes, bioinformatics enters a new developing stage---the genome times afterwards comprehensively.Genome plan has produced hundreds of millions of genome sequences, how to look for from these sequences life be how to originate from, be not only how to evolve, these genes but also be the answer how making life entity have a series of problems such as active, be the focus of current research.Analyzing these gene orders can from many levels, and as base sequence, protein, genome etc., because much biological phenotype character and gene regulation are all determined by the amino acid sequence of protein, analysis of amino acid sequence has certain advantage.

The one dimension character string that protein sequence is made up of 20 seed amino acids, show that the biological nature more lain in wherein is very difficult, people devise many kinds of pseudo amino acid composition compositions employing vector modes to describe protein sequence for this reason, these pseudo amino acid composition compositions are as dyad composition, triplet composition, the gray theory factor, what complexity factors etc. had well can describe protein sequence local amino acid sequence information, the overall amino acid sequence information that well can describe protein sequence had, all positive role is served to based on the protein structure of sequence and Function Classification prediction.

Live species is all from limited spore in time immemorial, and same existing protein is also from some simple protein evolutions.Contain in evolutionary process base insert or delete, suddenly change, copy or with other gene fusion etc., along with going deep into of evolutionary process, similarity between sequence is fewer and feweri, but corresponding protein also retains same characteristic mostly, as same biological function, three-dimensional structure and Subcellular Localization etc.Extract these sequence evolution information to form the focus that protein description vectors is research for this reason.The method of general fused protein evolution information is all based on PSSM matrix now, because each protein sequence length is change, so the PSSM matrix obtained is the matrix (L is protein sequence length) of the dimension that line number change, a columns are fixed.Because existing machine learning method requires that the dimension of input is identical, PSSM matrix conversion will be all the vector of fixing dimension by institute in a conventional method, the addition of PSSM matrix by rows is obtained 20 dimensional vectors divided by L again represents protein sequence as method 1 adopts; Row corresponding for a certain for expressions all in PSSM matrix identical amino acid is added again divided by this amino acid number in the sequence by method 2, obtain 20 dimensional vectors, amino acid sequence is made up of 20 seed amino acids, and we can obtain the vector of one 20 × 20 dimension for representing this protein like this; Method 3 is existing carries out standardization by PSSM matrix, passes through PSSM ^t× PSSM obtains the matrix of 20 × 20, because this matrix is positive semidefinite matrix, only needs wherein 201 units usually to represent protein P; I proposes a kind of new model Grey-PSSM extracting PSSM information based on gray theory, this method is based on gray model GM(2,1) model is built to each the columns value in PSSM matrix, obtain two development coefficients and an interference coefficient, like this PSSM matrix is changed into the vector of 3 × 20=60 dimension.

Said method is all based on PSSM matrix being carried out to simple summation statistics or carrying out gray model modeling, although some information can be extracted, but this will inevitably lose amino acid whose order information in protein sequence, and aforesaid operations does not have corresponding biological significance, do the hereditary information likely comprised by PSSM like this and lose.Due to the importance of hereditary information, thus the protein sequence describing method designing a kind of new blending inheritance information to based on the protein function of sequence information and structure type prediction very necessary.

Summary of the invention

The technical problem to be solved in the present invention is to provide a kind of protein sequence method for expressing of new blending inheritance information, be intended to be evolved information by fused protein, directly expand from sequence, be fused into the vector description of new protein P, to solve secondary protein structure type prediction and the lower problem of Subcellular Localization prediction rate.

For solving above technical matters, technical scheme of the present invention is: a kind of protein sequence method for expressing of new blending inheritance information, is characterized in that comprising the following steps:

(1) PSI-BLAST program search Swiss-Prot database is used to generate the special scoring matrix PSSM in position of protein sequence P;

(2) protein sequence in P protein gene and ncbi database is compared, find the conserved sequence of protein gene P;

(3) can know that the amino acid mutation in protein sequence P on certain position is other amino acid whose probability according to PSSM matrix, by constant for the amino acid on this albumen conserved sequence position, the size that non-conservative Region amino acid sports other amino acid probability according to it converts other amino acid successively to, so just can obtain 20 virtual protein containing protein P hereditary information;

(4) front n the protein sequence got in these 20 virtual protein forms the Leaf proteins describing protein sequence P;

(5) pseudo amino acid composition constituent feature extracting method is adopted to obtained Leaf proteins n+1 protein, obtain its vector description, this n+1 vector is combined, finally obtain the vector description method of protein P.

The expression formula of the special scoring matrix PSSM in position of described protein sequence P is:

Wherein represent that the amino acid mutation of protein sequence i-th position in protein evolution process is the possibility size of jth amino acid, the possibility that the larger expression of its value changes into is larger, and j represents amino acid A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y and V respectively from 1 to 20.

Described method is used in secondary protein structure type prediction and Subcellular Localization prediction, and correlation predictive device success rate prediction improves 4 ~ 7%.

The method that the present invention proposes is compared with existing fusion evolution information approach, there is more obvious biological significance, the Leaf proteins that employing most may be evolved is to represent some protein, these protein homologies are not high, but more may have identical 26S Proteasome Structure and Function, this is not high with protein sequence similarity in training set in protein structure and function type prediction to those, but the protein prediction with long-range homology has help, this method is used in secondary protein structure type prediction and Subcellular Localization prediction, the success rate prediction of correlation predictive device can be significantly improved, there is wide utilization prospect.

Embodiment

In order to make object of the present invention, technical scheme and advantage clearly understand, below in conjunction with embodiment, the present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, be not intended to limit the present invention.

The protein sequence method for expressing of the blending inheritance information adopting the present invention new, concrete steps are as follows:

1) PSI-BLAST program search Swiss-Prot database is used to generate the special scoring matrix (PositionSpecificScoringMatrix, PSSM) in position of protein sequence P;

Given human gene albumen:

>AAA61157

MVPSAGQLALFALGIVLAACQALENSTSPLSADPPVAAAVVSHFNDCPDSHTQFCFHATCRFLVHEDKPACVCHSGYVGARCEHADLLAVVAASQKKQAITALVVVSIVALAVLIITCVLIHCCQVRKHCEWCRALICRHEKPSALLKGRTACCHSETLV

Carry out calculating its location specific scoring matrix (PSSM matrix), first that BLAST is localized: (1) is downloaded blast and carried out this locality configuration, the machine configuration version: blast-2.2.28+ on NCBI; (2) at Protein Data Bank http://www.uniprot.org/(UniPortKB/Swiss-Protdatabase (Release2013_10)) download Protein Data Bank; (3) optimum configurations (-num_iterations:3 ,-evalue:0.001)

By the PSI-BLAST program in BLAST-2.2.28+, we can obtain the PSSM matrix of the above-mentioned protein of protein, in this matrix, first row represents that in crude protein sequence, amino acid converts the possibility of amino acid A to, secondary series represents that sequence original acid converts the possibility size of amino acid R to, with this, the 3rd row represent to the 20th row the possibility converting amino acid N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y and V to respectively.PSSM matrix the first row represents protein sequence first amino acid, and the second row represents the amino acid on the 2nd position, by that analogy.

2) protein sequence in P protein gene and ncbi database is compared, find the conserved sequence of protein gene P;

By AAA61157 sequence inputting to network address:

http:// www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi, this network address provides the function of searching conserved sequence, and adopt the default parameter value that website provides, the conserved sequence that can obtain sequence A AA61157 has two sections, and one section is 44-83, and another section is 47-121, adds up to: 44-121.As follows, non-thickened portion is non-conservative region, and thickened portion is conservative region;

MVPSAGQLALFALGIVLAACQALENSTSPLSADPPVAAAVVSH FNDCPDSHTQFCFHATCRFLVHEDKPACVCHSGYVGARCEHADLLAVVAASQKKQAITALVVVSIVALAVLIITCVLIHCCQVRKHCEWCRALICRHEKPSALLKGRTACCHSETLV。

3) based on PSSM matrix and its conserved sequence information structure protein P evolution protein sequence of protein P;

Can know that the amino acid mutation in protein sequence P on certain position is other amino acid whose probability according to PSSM matrix, by constant for the amino acid on this albumen conserved sequence position, the size that non-conservative Region amino acid sports other amino acid probability according to it converts other amino acid successively to, so just can obtain 20 virtual protein containing protein P hereditary information.

PSSM matrix first behavior of such as protein sequence AAA61157: [-2-2-3-4-2-2-3-4-314-260-3-2-1-2-10], therefrom we can find out and to be to the maximum in these 20 values =6, represent that AAA61157 protein sequence first amino acid is converted to M most possible, =4, convert the probability second of L to.

According to the method described above, can obtain the sequence that AAA61157 most probable is evolved into is:

MVPTAWQLAMLCAGCLICSCQSCDNCTAPDPTEPPERPAWRGH FNDCPDSHTQFCFHATCRFLVHEDKPACVCHSGYVGARCEHADLLAVVAASQKKQAITALVVVSIVALAVLIITCVLIHCCHKRKRCRWCRQYECKEEEPEKLLRQENGCCHSETVV

Second sequence that may be evolved into is:

LLASWGHYMLMALFIVLPAGEALEDSPEALSNDDDHAAKVTSS FNDCPDSHTQFCFHATCRFLVHEDKPACVCHSGYVGARCEHADLLAVVAASQKKQAITALVVVSIVALAVLIITCVLIYYYRWKRHKEHYKERIGEHPKRRTIQKGRTSNANADNIM

3rd sequence that may be evolved into is:

IISAGARCCCFSGHTPGAPDHCEPSTSPYMNPSETIEHRFQNR FNDCPDSHTQFCFHATCRFLVHEDKPACVCHSGYVGARCEHADLLAVVAASQKKQAITALVVVSIVALAVLIITCVLICRHCRYYQRKKQPNNLKRANRNNAMIQSGSAMGKGQSLI。

4) front 2 protein sequences (sequence that protein the P is most possibly evolved into) formation got in these 20 virtual protein describes the Leaf proteins of protein sequence P;

We can get the Leaf proteins of protein P and front 2 the constitutive protein P of virtual protein:

MVPSAGQLALFALGIVLAACQALENSTSPLSADPPVAAAVVSH FNDCPDSHTQFCFHATCRFLVHEDKPACVCHSGYVGARCEHADLLAVVAASQKKQAITALVVVSIVALAVLIITCVLIHCCQVRKHCEWCRALICRHEKPSALLKGRTACCHSETLV

MVPTAWQLAMLCAGCLICSCQSCDNCTAPDPTEPPERPAWRGH FNDCPDSHTQFCFHATCRFLVHEDKPACVCHSGYVGARCEHADLLAVVAASQKKQAITALVVVSIVALAVLIITCVLIHCCHKRKRCRWCRQYECKEEEPEKLLRQENGCCHSETVV；

LLASWGHYMLMALFIVLPAGEALEDSPEALSNDDDHAAKVTSS FNDCPDSHTQFCFHATCRFLVHEDKPACVCHSGYVGARCEHADLLAVVAASQKKQAITALVVVSIVALAVLIITCVLIYYYRWKRHKEHYKERIGEHPKRRTIQKGRTSNANADNIM。

5) pseudo amino acid composition constituent feature extracting method is adopted to each sequence of obtained Proteomics, obtain its vector description, the vector obtained is combined, finally obtain the vector description method of protein P;

By above-mentioned 3 sequence inputting to network address be:

In http://www.csbio.sjtu.edu.cn/bioinf/PseAAC/# webpage, PseAAmode chooses Type1, Aminoacidcharacter all chooses, Weightfactor value is 0.05, Lambdaparameter value is 5, obtain the description vectors of three protein, Article 1, sequence is: [10.4356.5222.1742.6092.1742.1744.3483.0432.6097.3910.435 0.8703.0432.6092.6094.7823.0437.8260.4350.4355.6805.7816 .1536.4946.328], Article 2 is: [6.7278.4092.5234.6251.6822.1023.7842.5232.9435.0450.8411 .2613.7842.9433.7842.9433.3635.8861.2610.8416.5056.6206. 0416.8026.765], Article 3 is: [8.7523.0633.9393.5012.1882.6264.3763.9393.9396.1271.3132 .1882.1881.7503.5014.3763.0635.6890.8752.6265.7825.6915. 7716.1736.565], these three 25 dimensional vectors are combined, form 75 dimensional vectors and represent original protein.

Adopt the method can improve existing protein sequence describing method, the method be used in secondary protein structure type prediction and Subcellular Localization prediction, correlation predictive device success rate prediction improves 5%.

The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, all any amendments done within the spirit and principles in the present invention, equivalent replacement and improvement etc., all should be included within protection scope of the present invention.

Claims

1. a protein sequence method for expressing for new blending inheritance information, is characterized in that comprising the following steps:

2. the protein sequence method for expressing of new blending inheritance information according to claim 1, is characterized in that: the expression formula of the special scoring matrix PSSM in position of described protein sequence P is:

3. the protein sequence method for expressing of new blending inheritance information according to claim 1, is characterized in that: described method is used in secondary protein structure type prediction and Subcellular Localization prediction, and correlation predictive device success rate prediction improves 4 ~ 7%.