CN106599611A

CN106599611A - Marking method and system for protein functions

Info

Publication number: CN106599611A
Application number: CN201611128108.4A
Authority: CN
Inventors: 邓磊; 曾丞
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2016-12-09
Filing date: 2016-12-09
Publication date: 2017-04-26
Anticipated expiration: 2036-12-09
Also published as: CN106599611B

Abstract

The invention relates to the technical field of bioinformation, and discloses a marking method and system for protein functions. Therefore, protein marking performance is improved, expensive cost of a bioexperiment method and poor efficiency are solved. The method comprises following steps: estimating the first possibility of a certain function in a to-be-inquired protein according to a first-stage structure neighborhood and a second-stage structure neighborhood; estimating the second possibility of the certain function in the to-be-inquired protein according to all homologous sequences; inputting a PSSM matrix of the to-be-inquired protein into an SVM prediction model to obtain the third possibility of the certain function in the to-be-inquired protein; converting the distribution of the function corresponding to other species according to the gene co-expression fraction into the fourth possibility of the function occurring in a target species in the to-be-inquired protein; and mixing the first possibility, the second possibility, the third possibility and the fourth possibility to estimate the comprehensive possibility of the function in the to-be-inquired protein.

Description

Protein function mask method and system

Technical field

The present invention relates to technical field of biological information, more particularly to a kind of protein function mask method and system.

Background technology

Protein is the material base of all life, is the final effector and direct executor of vital movement, and it participates in Biological almost all of vital movement process in vivo, such as heredity, development, breeding, the metabolism of matter and energy, stress, thinking and Memory etc..Protein is interconnected to constitute by peptide bond by 20 kinds of different amino acid residues, is folded into specific space conformation Afterwards, protein just has corresponding biologic activity and function.Protein function (Protein Function) is from physiology's Angle includes:Enzyme catalysiss, material delivery and storage, nutrition storage, motor coordination, mechanical supports, immunoprotection, signal acceptance and Control action of conduction, growth and differentiation etc..Mankind's concern protein function is largely also due to protein and the mankind The contact of countless ties between health, the hereditary having now been found that is mostly absolutely gene mutation causes coded protein Dysfunction caused by.The phenylketonuria (Phenylketonuria, PKU) of such as recessive is precisely due to phenylpropyl alcohol The shortage of propylhomoserin hydroxylase is caused；Albinism is then that, due to congenital deficiency tryrosinase, or tyrosinase activity declines, and makes Obtaining B16 cell generation obstacle is caused；Heritability cystic fibrosises (Cystic Fibrosis, CF) be located at cytoplasma membrane On chloride channel regulatory factor afunction it is relevant.

Determine that the function of agnoprotein matter is pre- for change mechanism of the organism under physiology or pathological conditions, disease is understood Anti- and drug development has important meaning.The experimental technique of identification of protein function mainly has gel electrophoresis (Gel Electrophoresis), yeast two-hybrid method (Yeast Two-hybrid), tandem affinity purification technology (TAP), burn light altogether Shake energy transfer technique (FRET), protein biochip technology and immunoelectronmicroscopy (IEM) etc., although these methods can be to not Know that the function of protein is accurately determined, but as experimental design is complicated, of a high price and the cycle is long so as to may be only available for Small scale experiments, it is impossible to the needs annotated to protein function in the range of full-length genome by satisfaction.Up to the present, have super The whole genome sequence for crossing 3000 kinds of cell biologicals is determined, and 5,000,000 nonredundancy is had more than in the open data base for accessing Protein sequence data.Determine that using biotic experiment the function of these protein will be one and take very much and expensive appoint Business.Therefore, the method for being marked using biotic experiment is impossible to catch up with protein sequence data growth rate.At present, it is big respectively The protein function for about there was only 20%, 7%, 10% and 1% mankind, home mouse, fruit bat and nematicide is tested mark (Gene The TAS marks of Ontology).In light of this situation, scientists are increasingly turned to using computational methods as supporting mark quantity Huge sequence and structured data.

The existing protein function Predicting Technique based on computational methods includes BLAST, ESG and Argot2 etc., main base In sequence homology information.It is the main stream approach of current protein function prediction based on the power and energy of sequence homology, but which is predicted Accuracy rate (Accuracy) and coverage (Coverage) be not high, there is a definite limitation.Protein is derived from protein sequence The method of function just relatively accurately, when sequence similarity is less than 30%, is based on only when sequence is highly similar The accuracy rate of homologous function prediction method drastically will decline.

The content of the invention

Present invention aim at a kind of protein function mask method and system are disclosed, to improve the property of protein mark , can solve the problems, such as that BIOLOGICAL TEST METHODS is with high costs and inefficiency.

For achieving the above object, the present invention discloses a kind of protein function mask method, including：

Step S1, according to the representative structure of protein to be checked search first order structure neighborhood；

Step S2, the homologous sequence for searching for the protein to be checked, according to the representative structure of homologous sequence is searched The second level structure neighborhood of protein to be checked；

Step S3, according to the first order structure neighborhood and the distribution situation of a certain function of second level structure neighborhood, comment Estimate the first probability that the function occurs in the protein to be checked；And according to all of homologous sequence to should function Distribution situation assesses the second probability that the function occurs in the protein to be checked；

Step S4, SVM forecast models by the PSSM Matrix predictions functions are set up, and by the protein to be checked SVM forecast models described in PSSM Input matrixes draw the 3rd probability that the function occurs in the protein to be checked；

Step S5, according to the co-expression gene of the corresponding query gene of protein to be checked and the query gene, meter Calculate the gene co-expressing fraction between corresponding ortholog in other species, and according to the gene co-expressing fraction by other things In kind to should function distribution situation be converted into that the function occurs in the protein to be checked in target species the 4th Probability；

Step S6, fusion described first, second, third and the 4th probability are assessing the function in the albumen to be checked The comprehensive probability occurred in matter

It is corresponding with said method, invention additionally discloses a kind of protein function labeling system, including：

First processing module, for according to the representative structure of protein to be checked search first order structure neighborhood；

Second processing module, the homologous sequence for searching for the protein to be checked, tie according to the representative of homologous sequence Structure searches the second level structure neighborhood of the protein to be checked；

3rd processing module, dividing for a certain function according to the first order structure neighborhood and second level structure neighborhood Cloth situation, assesses the first probability that the function occurs in the protein to be checked；And according to all of homologous sequence pair Should the distribution situation of function assess the second probability that the function occurs in the protein to be checked；

Fourth processing module, the SVM forecast models for foundation by the PSSM Matrix predictions function, and will be described to be checked SVM forecast models described in asking the PSSM Input matrixes of protein draw the 3rd that the function occurs in the protein to be checked Probability；

5th processing module, for the common table according to the corresponding query gene of protein to be checked and the query gene Up to gene, the gene co-expressing fraction between corresponding ortholog in other species is calculated, and according to the gene co-expressing point Number by other species to should the distribution situation of function be converted in target species the function in the protein to be checked The 4th probability for occurring；

6th processing module, for merge described first, second, third and the 4th probability to assess the function described The comprehensive probability occurred in protein to be checked.

The invention has the advantages that：

Assessing and merge from structure, sequence, PSSM and across species coexpression information many-side draws a certain function to be checked The comprehensive probability that occurs in protein is ask, the performance of protein mark is improved, and then expansible draws the protein to be checked The corresponding likelihood value of each function phase, solves the problems, such as that BIOLOGICAL TEST METHODS is with high costs and inefficiency.

Below with reference to accompanying drawings, the present invention is further detailed explanation.

Description of the drawings

The accompanying drawing for constituting the part of the application is used for providing a further understanding of the present invention, the schematic reality of the present invention Apply example and its illustrate, for explaining the present invention, not constituting inappropriate limitation of the present invention.In the accompanying drawings：

Fig. 1 is protein function mask method flow chart disclosed in the embodiment of the present invention；

Fig. 2 is the structure alignment algorithm flow chart used by searching structure neighbours disclosed in the embodiment of the present invention；

Fig. 3 present invention (PredGO) and ROC of other 8 kinds of methods based on different data sources on GOA-PDB data sets Curve, this 8 kinds of data sources are structural information (Str) respectively, and coexpression information (Coexpression), phyletic evolution are composed (Phylogenetics), position-specific scoring matrices (PSSM), 3 metasequence information (Trigram), interaction information (PPI), function domain information (Interpro) and ortholog information (Othology)；The more high then estimated performance of ROC curve is more It is good；

Fig. 4 present invention (PredGO) and other three kinds of function prediction methods BLAST, Argot2 and Str is in CAFA data sets On ROC curve comparison diagram, the more high then estimated performance of ROC curve is better；

Fig. 5 present invention (PredGO) and other existing function prediction methods (BLAST, Jones-UCL, Argot2, ESG and Str) the maximum F values (Fmax) in CAFA data sets are compared, and maximum F values are higher to represent that estimated performance is better.

Specific embodiment

Embodiments of the invention are described in detail below in conjunction with accompanying drawing, but the present invention can be defined by the claims Implement with the multitude of different ways for covering.

Embodiment 1

The present embodiment discloses a kind of protein function mask method, as shown in figure 1, including：

Step S1, according to the representative structure of protein to be checked search first order structure neighborhood.

In this step, remote three dimensional structure similarity can be particularly to infer according to the structural similarity of protein The function of protein.For given protein Q to be checked, a representative is obtained from PDB storehouses or homology model data base Structure M, finds all of structure neighborhoods of M in protein structure storehouse using structure neighborhood searching algorithm and constitutes the first level structure neighbour Occupy (N1, N2 ...).

Step S2, the homologous sequence for searching for protein to be checked, search egg to be checked according to the representative structure of homologous sequence The second level structure neighborhood of white matter.

In this step, it is possible to use sequence alignment method PSI-BLAST (iteration is once) searches out all in PDB storehouses Homologous sequence (H1, H2 ...), for each homologous sequence Hi, it is same that the method for relying on above-mentioned steps S1 similar searches out this The structure neighborhood of source sequence is constituting the second level structure neighborhood relative to protein to be checked.

Step S3, according to first order structure neighborhood and the distribution situation of a certain function of second level structure neighborhood, assessment should The first probability that function occurs in protein to be checked；And according to all of homologous sequence to should function distribution situation Assess the second probability that the function occurs in protein to be checked.

In this step, as first order structure neighborhood is that itself representative structure directly determines according to protein to be checked , which can more reflect the probability that a certain function occurs in the protein to be checked compared to second level structure neighborhood；For this purpose, The present embodiment can be come by following formula come comprehensive first order structure neighborhood and the distribution situation of the function of second level structure neighborhood The probability that assessment corresponding function occurs in protein to be checked, concrete formula are as follows：

Wherein, P_iFor the structure distance between albumen to be checked and i-th first order structure neighborhood, if S_iWith function f_a, I_Si(f_a) for 1, otherwise I_Si(f_a) it is weights for 0, w, N_sFor the number of first order structure neighborhood；N_seqFor the number of homologous sequence, E_iIt is the sequence similarity with homologous sequence with the protein to be checked, P_ijFor i-th homologous sequence and j-th second level knot Structure neighbours S_ijBetween structure distance, N_siFor the number of second level structure neighborhood；In the same manner, if S_ijWith function f_a, I_Sij(f_a) be 1, otherwise I_Sij(f_a) for 0.

Optionally, the representative structure given for, can carry out structure ratio using dual stage process as shown in Figure 2 To building structure neighborhood collection, in first stage, it is, using double dynamic programming algorithms, to carry out protein relative distance and two grades of knots The comparison in tile section (SSEs) direction, as the number of SSEs in protein compares less, the speed of calculating can be than very fast.Base In a given structural similarity threshold value, potential similar protein matter is chosen to enter the comparison of second stage.This single order Section is by the object function of an optimization to related residue to carrying out topological comparison.Optimization aim is to maximize reciprocity residue pair Number, and make the C of overlay structure_αRMSD is minimum.Optimized using the dynamic programming algorithm and rigid body superposition algorithm of iteration residue- Residue alignments.The amount of calculation of second stage is bigger than the first stage, but as the first stage has filtered most of structure, reduces Computing cost.Due to the huge amount of protein in structural database (the PDB structures of functional mark), in order to reduce calculating Amount, is clustered according to sequence similarity to the protein in structural library using Cd-hit, by albumen of the similarity more than 60% Matter is classified as same class.When using structure alignment method searching structure neighbours, query structure representative only with each classification Structure is compared, if higher (for example PSD of the similarity of query structure and certain representative structure<0.6), then it is assumed that the representative is tied The all proteins of structure place apoplexy due to endogenous wind are the structure neighborhood of inquiry albumen.This is because when protein sequence similarity it is higher (> 60%), when, structure is often closely similar.

In the present embodiment, the use of second level structure neighborhood is in order to excavate more remote functional relationship, particularly When one-level functional relationship is lacked.Finally, the function of this two-layer configuration neighbours is marked by we, is comprehensively risen by scoring functions Come, predict the function of inquiry albumen.

On the other hand, in this step, for inquiry Protein Q, UniProtKB/Swiss- can be searched for using PSI-BLAST The homologous sequence of Prot data bases, for each homologous sequence H_k, corresponding function mark (Gene Ontology) is by sequence Row compare E values (E-value) given a mark, for some function Ti mark to inquire about albumen probability score, i.e., second Probability, computing formula can be：

Wherein E is homologous sequence (H_k) E values, b be constant log (10), n is the number of homologous sequence, if H_kWith work( Can Ti, then Ind_k(Ti) it is 1, is otherwise 0.

Step S4, foundation predict the SVM forecast models of the function by PSSM matrixes (position-specific scoring matrices), and The PSSM Input matrix SVM forecast models of protein to be checked are shown into that the function occurs in protein to be checked the 3rd can Can property.

In this step, by constructing the sample set being made up of positive and negative sample, with the characteristic that PSSM is input into as SVM, lead to Cross extraction training set and independent test collection builds SVM forecast models, and the forecast model is predicted and is assessed, this kind of technology For the technology that those skilled in the art are easily achieved, will not be described here.Preferably, auto-covariance can be used in the present embodiment PSSM matrixes are changed into the independent variable AC of alternative approach the feature of regular length, and the computing formula of independent variable is：

Wherein, j represents a descriptor, j=1,2 ..., D (D is the number of descriptor)；I represents the position in sequence；L For the length of aminoacid sequence, maximums of the lg for lg (lg=1,2 ..., LG), the sum of each sequence AC variable is LG*D, Based on AC features, for each function f_a, with support vector machine method, train a forecast model to carry out function prediction.

Step S5, according to the co-expression gene of the corresponding query gene of protein to be checked and the query gene, calculate which Gene co-expressing fraction in its species between corresponding ortholog, and will be corresponding in other species according to gene co-expressing fraction The distribution situation of the function is converted into the 4th probability that the function in target species occurs in protein to be checked.

The step is marked based on the function across species coexpression.Can be from COXPRESdb and ArrayExpress data bases Obtain 11 species (mankind, nematicide, Canis familiaris L., fly, Brachydanio rerio, chicken, Rhesus Macacus, mice, Rattus norvegicuss, budding yeast and fragmentation ferment It is female) coexpression data.Precalculate often with Pearson's correlation coefficient (Pearson Correlation Coefficient) The intergenic coexpression of any two in individual species.For in query gene (protein) Q (in species 1), P1, P2 ... Pi with Query gene Q has similar expression, Q_ojWith Pi_ojQ and Pi other species (species 2 ..., corresponding direct line in species n) Homologous protein.Fusion ortholog the credibility of gene co-expressing relation and is covered to improve in the coexpression information of other species Cover degree.Can using Nae Bayesianmethod (Bayes, NB) calculate across species gene co-expressing fraction (COXS), this point In number fusion target species, intergenic coexpression relation and the coexpression in other species between corresponding ortholog are closed System, the concrete gene co-expressing fraction that calculates include：

COXS (Q, P_i)=1- (1-C (Q, P_i))*(1-w*os_i)

Wherein Q is query gene, and Pi is the co-expression gene of Q, and C is the Pearson's correlation coefficient of two gene expressions, and w is The weights of ortholog expression, OS_iIt is (Q between the corresponding ortholog in species j of Q and Pi_oj,Pi_oj) common table Up to fraction, n is the sum of species.

Above steps treats a certain function that query protein may possess, carry out first to this function respectively, The calculating of second, third and the 4th totally four kinds of probabilities.Step S6, fusion first, second, third and the 4th probability are assessing The comprehensive probability that the function occurs in protein to be checked.

The step can adopt Bayesian network fusion structure, sequence, location specific scoring matrix and across species coexpressions Information, builds protein function automatic marking method (PredGO), there is some function to be defined as the positive protein (Positive), the positive number for given protein-function to sum, finds a positive protein matter-function pair Prior probability (Prior), computing formula is:

Wherein, P (pos) be predicted as to probability, P (neg) be predicted as mistake probability；Relative, posterior probability is such as Shown in below equation：

Wherein, f₁,...,f_NFor the value in given data source, including first to fourth probability totally four data sources；

Likelihood ratio L is defined as:

According to the relevant priori and posterior probability of bayes rule, have:

P_post=L (f₁..., f_N)P_prior；

Thereby, the posterior probability of a certain function is higher, then protein to be checked has the probability of the function bigger.

In specific experimental demonstration, the inventive method (PredGO) is pre- with other oroteins function in two datasets Survey method is compared.GOA-PDB data sets are new between 201010 to 201311 from the time of GOA data bases extraction Data, each protein are marked including at least the function of 1 non-IEA, are removed after redundancy with CDHIT, are obtained from 256 things 3632 protein planted.2011 data sets of CAFA are that first protein function marks challenge match (http:// Biofunctionprediction.org/ the data set for) providing, comprising 866 protein from 11 species.In GOA- In PDB data bases, as shown in Fig. 3 and table 1, the integrated approach (PredGO) in comprehensive multiple data source no matter in molecular function or There is better performance than individual data source all in biological process.On CAFA data sets, Fig. 4 illustrates the present invention (PredGO) with other three kinds of function prediction methods BLAST, ROC curve comparison diagrams of the Argot2 and Str on CAFA data sets, The estimated performance of (PredGO) of the invention is more preferably (the more high then estimated performance of ROC curve is better).Fig. 5 illustrates the present invention (PredGO) with other existing function prediction methods (BLAST, Jones-UCL, Argot2, ESG and Str) in CAFA data sets Maximum F values (Fmax) are compared, and the maximum F values of of the invention (PredGO) increase significantly (the higher expression estimated performance of maximum F values Better).

Table 1：

To sum up, protein function mask method disclosed in the present embodiment, from structure, sequence, PSSM and across species coexpressions The many-sided assessment of information simultaneously merges the comprehensive probability for showing that a certain function occurs in protein to be checked, improves protein mark The performance of note, so it is expansible draw the corresponding likelihood value of the protein to be checked each function phase, solve biotic experiment The method problem with inefficiency with high costs.

Embodiment 2

Corresponding with said method embodiment, the present embodiment discloses a kind of protein function labeling system, including：

Second processing module, the homologous sequence for searching for protein to be checked, look into according to the representative structure of homologous sequence Look for the second level structure neighborhood of protein to be checked；

Optionally, the computing formula of the first probability can be：

The computing formula of the second probability can be：

Wherein E_kFor homologous sequence H_kAlignment score value, b be constant log (10), n is the number of homologous sequence, such as Fruit H_kWith function Ti, then Ind_k(Ti) it is 1, is otherwise 0.

Calculate gene co-expressing fraction and can adopt equation below：

COXS (Q, P_i)=1- (1-C (Q, P_i))*(1-w*os_i)

In the present embodiment, the concrete internal data between each module is processed and can refer to above-described embodiment 1, is not repeated.

In the same manner, protein function labeling system disclosed in the present embodiment, from structure, sequence, PSSM and across species coexpressions The many-sided assessment of information simultaneously merges the comprehensive probability for showing that a certain function occurs in protein to be checked, improves protein mark The performance of note, so it is expansible draw the corresponding likelihood value of the protein to be checked each function phase, solve biotic experiment The method problem with inefficiency with high costs.

The preferred embodiments of the present invention are the foregoing is only, the present invention is not limited to, for the skill of this area For art personnel, the present invention can have various modifications and variations.It is all within the spirit and principles in the present invention, made any repair Change, equivalent, improvement etc., should be included within the scope of the present invention.

Claims

1. a kind of protein function mask method, it is characterised in that include：

Step S2, the homologous sequence for searching for the protein to be checked, search according to the representative structure of homologous sequence described to be checked Ask the second level structure neighborhood of protein；

Step S3, according to the first order structure neighborhood and the distribution situation of a certain function of second level structure neighborhood, assessment should The first probability that function occurs in the protein to be checked；And according to all of homologous sequence to should function distribution The second probability that the assessment of scenario function occurs in the protein to be checked；

Step S4, SVM forecast models by the PSSM Matrix predictions functions are set up, and by the PSSM of the protein to be checked SVM forecast models described in Input matrix draw the 3rd probability that the function occurs in the protein to be checked；

Step S5, according to the co-expression gene of the corresponding query gene of protein to be checked and the query gene, calculate which Gene co-expressing fraction in its species between corresponding ortholog, and according to the gene co-expressing fraction by other species To should function distribution situation be converted in target species that the function occurs in the protein to be checked the 4th may Property；

Step S6, fusion described first, second, third and the 4th probability are assessing the function in the protein to be checked The comprehensive probability of appearance.

2. protein function mask method according to claim 1, it is characterised in that the calculating of first probability is public Formula is：

S (f_{a}) = w \cdot Σ_{i = 1}^{N_{s}} (1 - P_{i}) \cdot I_{S_{i}} (f_{a}) + (1 - w) \cdot Σ_{i = 1}^{N_{s e q}} \max_{1 \leq j \leq N_{s_{i}}} (E_{i} \cdot (1 - P_{i j}) \cdot I_{S_{i j}} (f_{a}))

Wherein, P_iFor the structure distance between albumen to be checked and i-th first order structure neighborhood, if S_iWith function f_a, I_Si (f_a) for 1, otherwise I_Si(f_a) it is weights for 0, w, N_sFor the number of first order structure neighborhood；N_seqFor the number of homologous sequence, E_i It is the sequence similarity with homologous sequence with the protein to be checked, P_ijFor i-th homologous sequence and j-th second level structures Neighbours S_ijBetween structure distance, N_siFor the number of second level structure neighborhood；In the same manner, if S_ijWith function f_a, I_Sij(f_a) for 1, Otherwise I_Sij(f_a) for 0.

3. protein function mask method according to claim 1, it is characterised in that the calculating of second probability is public Formula is：

S (T_{i}) = Σ_{k = 1}^{n} \frac{- l o g (E_{k}) + b}{Σ_{j = 1}^{n} {- l o g (E_{j}) + b}} \cdot {Ind}_{k} (T_{i})

Wherein E_kFor homologous sequence H_kAlignment score value, b be constant log (10), n is the number of homologous sequence, if H_k With function Ti, then Ind_k(Ti) it is 1, is otherwise 0.

4. protein function mask method according to claim 1, it is characterised in that step S4 is included using self tuning PSSM matrixes are changed into the independent variable AC of variance alternative approach the feature of regular length, and the computing formula of independent variable is：

{AC}_{\lg, j} = \frac{1}{L - \lg} Σ_{i = 1}^{L - \lg} (X_{i, j} - \frac{1}{L} Σ_{i = 1}^{L} X_{i, j}) (X_{(i + \lg), j} - \frac{1}{L} Σ_{i = 1}^{L} X_{i, j})

Wherein, j represents a descriptor, j=1,2 ..., D (D is the number of descriptor)；I represents the position in sequence；L is ammonia The length of base acid sequence, maximums of the lg for lg (lg=1,2 ..., LG), the sum of each sequence AC variable is LG*D, is based on AC features, for each function f_a, with support vector machine method, train a forecast model to carry out function prediction.

5. protein function mask method according to claim 1, it is characterised in that step S5 calculates gene table altogether Include up to fraction：

COXS (Q, P_i)=1- (1-C (Q, P_i))*(1-w*OS_i)

{OS}_{i} = 1 - Π_{j = 2}^{n} (1 - C (Q_{o j}, {Pi}_{o j}))

Wherein Q is query gene, and Pi is the co-expression gene of Q, and C is the Pearson's correlation coefficient of two gene expressions, and w is lineal The weights of homologous geness expression, OS_iIt is (Q between the corresponding ortholog in species j of Q and Pi_oj,Pi_oj) coexpression point Number, n is the sum of species.

6. protein function mask method according to claim 1, it is characterised in that step S6 includes：

There is some function to be defined as the positive protein, for positive number of the given protein-function to sum, The prior probability of a positive protein matter-function pair is found, computing formula is:

p_{p r i o r} = \frac{P (p o s)}{P (n e g)} = \frac{P (p o s)}{1 - P (p o s)};

Wherein, P (pos) be predicted as to probability, P (neg) be predicted as mistake probability；Relative, posterior probability is for example following Shown in formula：

P_{p o s t} = \frac{P (p o s | f_{1}, ..., f_{N})}{p (n e g | f_{1}, ..., f_{N})};

Likelihood ratio L is defined as:

L (f_{1}, ..., f_{N}) = \frac{P (f_{1}, ..., f_{N} | p o s)}{P (f_{1}, ..., f_{N} | n e g)};

According to the relevant priori and posterior probability of bayes rule, have:

P_post=L (f₁..., f_N)P_prior；

The posterior probability of a certain function is higher, then the protein to be checked has the probability of the function bigger.

7. a kind of protein function labeling system for performing 1 to 6 arbitrary methods described of the claims, its feature exist In, including：

Second processing module, the homologous sequence for searching for the protein to be checked, look into according to the representative structure of homologous sequence Look for the second level structure neighborhood of the protein to be checked；

3rd processing module, for the distribution feelings of a certain function according to the first order structure neighborhood and second level structure neighborhood Condition, assesses the first probability that the function occurs in the protein to be checked；And according to all of homologous sequence to should The distribution situation of function assesses the second probability that the function occurs in the protein to be checked；

Fourth processing module, for setting up the SVM forecast models by the PSSM Matrix predictions functions, and by the egg to be checked SVM forecast models described in the PSSM Input matrixes of white matter draw the 3rd possibility that the function occurs in the protein to be checked Property；

5th processing module, for the coexpression base according to the corresponding query gene of protein to be checked and the query gene Cause, calculates the gene co-expressing fraction between corresponding ortholog in other species, and will according to the gene co-expressing fraction In other species to should the distribution situation of function be converted into the function in target species and occur in the protein to be checked The 4th probability；

6th processing module, for merge described first, second, third and the 4th probability to assess the function described to be checked The comprehensive probability occurred in asking protein.

8. protein function labeling system according to claim 7, it is characterised in that the calculating of first probability is public Formula is：

S (f_{a}) = w \cdot Σ_{i = 1}^{N_{s}} (1 - P_{i}) \cdot I_{S_{i}} (f_{a}) + (1 - w) \cdot Σ_{i = 1}^{N_{s e q}} \max_{1 \leq j \leq N_{s_{i}}} (E_{i} \cdot (1 - P_{i j}) \cdot I_{S_{i j}} (f_{a}))

9. protein function labeling system according to claim 7, it is characterised in that the calculating of second probability is public Formula is：

S (T_{i}) = Σ_{k = 1}^{n} \frac{- l o g (E_{k}) + b}{Σ_{j = 1}^{n} {- l o g (E_{j}) + b}} \cdot {Ind}_{k} (T_{i})

10. protein function labeling system according to claim 7, it is characterised in that the calculating gene co-expressing point Number includes：

COXS (Q, P_i)=1- (1-C (Q, P_i)*(1-w*OS_i)

{OS}_{i} = 1 - Π_{j = 2}^{n} (1 - C (Q_{o j}, {Pi}_{o j}))