CN106599611A - Marking method and system for protein functions - Google Patents

Marking method and system for protein functions Download PDF

Info

Publication number
CN106599611A
CN106599611A CN201611128108.4A CN201611128108A CN106599611A CN 106599611 A CN106599611 A CN 106599611A CN 201611128108 A CN201611128108 A CN 201611128108A CN 106599611 A CN106599611 A CN 106599611A
Authority
CN
China
Prior art keywords
protein
function
checked
probability
gene
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201611128108.4A
Other languages
Chinese (zh)
Other versions
CN106599611B (en
Inventor
邓磊
曾丞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN201611128108.4A priority Critical patent/CN106599611B/en
Publication of CN106599611A publication Critical patent/CN106599611A/en
Application granted granted Critical
Publication of CN106599611B publication Critical patent/CN106599611B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations

Landscapes

  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Analytical Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention relates to the technical field of bioinformation, and discloses a marking method and system for protein functions. Therefore, protein marking performance is improved, expensive cost of a bioexperiment method and poor efficiency are solved. The method comprises following steps: estimating the first possibility of a certain function in a to-be-inquired protein according to a first-stage structure neighborhood and a second-stage structure neighborhood; estimating the second possibility of the certain function in the to-be-inquired protein according to all homologous sequences; inputting a PSSM matrix of the to-be-inquired protein into an SVM prediction model to obtain the third possibility of the certain function in the to-be-inquired protein; converting the distribution of the function corresponding to other species according to the gene co-expression fraction into the fourth possibility of the function occurring in a target species in the to-be-inquired protein; and mixing the first possibility, the second possibility, the third possibility and the fourth possibility to estimate the comprehensive possibility of the function in the to-be-inquired protein.

Description

Protein function mask method and system
Technical field
The present invention relates to technical field of biological information, more particularly to a kind of protein function mask method and system.
Background technology
Protein is the material base of all life, is the final effector and direct executor of vital movement, and it participates in Biological almost all of vital movement process in vivo, such as heredity, development, breeding, the metabolism of matter and energy, stress, thinking and Memory etc..Protein is interconnected to constitute by peptide bond by 20 kinds of different amino acid residues, is folded into specific space conformation Afterwards, protein just has corresponding biologic activity and function.Protein function (Protein Function) is from physiology's Angle includes:Enzyme catalysiss, material delivery and storage, nutrition storage, motor coordination, mechanical supports, immunoprotection, signal acceptance and Control action of conduction, growth and differentiation etc..Mankind's concern protein function is largely also due to protein and the mankind The contact of countless ties between health, the hereditary having now been found that is mostly absolutely gene mutation causes coded protein Dysfunction caused by.The phenylketonuria (Phenylketonuria, PKU) of such as recessive is precisely due to phenylpropyl alcohol The shortage of propylhomoserin hydroxylase is caused;Albinism is then that, due to congenital deficiency tryrosinase, or tyrosinase activity declines, and makes Obtaining B16 cell generation obstacle is caused;Heritability cystic fibrosises (Cystic Fibrosis, CF) be located at cytoplasma membrane On chloride channel regulatory factor afunction it is relevant.
Determine that the function of agnoprotein matter is pre- for change mechanism of the organism under physiology or pathological conditions, disease is understood Anti- and drug development has important meaning.The experimental technique of identification of protein function mainly has gel electrophoresis (Gel Electrophoresis), yeast two-hybrid method (Yeast Two-hybrid), tandem affinity purification technology (TAP), burn light altogether Shake energy transfer technique (FRET), protein biochip technology and immunoelectronmicroscopy (IEM) etc., although these methods can be to not Know that the function of protein is accurately determined, but as experimental design is complicated, of a high price and the cycle is long so as to may be only available for Small scale experiments, it is impossible to the needs annotated to protein function in the range of full-length genome by satisfaction.Up to the present, have super The whole genome sequence for crossing 3000 kinds of cell biologicals is determined, and 5,000,000 nonredundancy is had more than in the open data base for accessing Protein sequence data.Determine that using biotic experiment the function of these protein will be one and take very much and expensive appoint Business.Therefore, the method for being marked using biotic experiment is impossible to catch up with protein sequence data growth rate.At present, it is big respectively The protein function for about there was only 20%, 7%, 10% and 1% mankind, home mouse, fruit bat and nematicide is tested mark (Gene The TAS marks of Ontology).In light of this situation, scientists are increasingly turned to using computational methods as supporting mark quantity Huge sequence and structured data.
The existing protein function Predicting Technique based on computational methods includes BLAST, ESG and Argot2 etc., main base In sequence homology information.It is the main stream approach of current protein function prediction based on the power and energy of sequence homology, but which is predicted Accuracy rate (Accuracy) and coverage (Coverage) be not high, there is a definite limitation.Protein is derived from protein sequence The method of function just relatively accurately, when sequence similarity is less than 30%, is based on only when sequence is highly similar The accuracy rate of homologous function prediction method drastically will decline.
The content of the invention
Present invention aim at a kind of protein function mask method and system are disclosed, to improve the property of protein mark , can solve the problems, such as that BIOLOGICAL TEST METHODS is with high costs and inefficiency.
For achieving the above object, the present invention discloses a kind of protein function mask method, including:
Step S1, according to the representative structure of protein to be checked search first order structure neighborhood;
Step S2, the homologous sequence for searching for the protein to be checked, according to the representative structure of homologous sequence is searched The second level structure neighborhood of protein to be checked;
Step S3, according to the first order structure neighborhood and the distribution situation of a certain function of second level structure neighborhood, comment Estimate the first probability that the function occurs in the protein to be checked;And according to all of homologous sequence to should function Distribution situation assesses the second probability that the function occurs in the protein to be checked;
Step S4, SVM forecast models by the PSSM Matrix predictions functions are set up, and by the protein to be checked SVM forecast models described in PSSM Input matrixes draw the 3rd probability that the function occurs in the protein to be checked;
Step S5, according to the co-expression gene of the corresponding query gene of protein to be checked and the query gene, meter Calculate the gene co-expressing fraction between corresponding ortholog in other species, and according to the gene co-expressing fraction by other things In kind to should function distribution situation be converted into that the function occurs in the protein to be checked in target species the 4th Probability;
Step S6, fusion described first, second, third and the 4th probability are assessing the function in the albumen to be checked The comprehensive probability occurred in matter
It is corresponding with said method, invention additionally discloses a kind of protein function labeling system, including:
First processing module, for according to the representative structure of protein to be checked search first order structure neighborhood;
Second processing module, the homologous sequence for searching for the protein to be checked, tie according to the representative of homologous sequence Structure searches the second level structure neighborhood of the protein to be checked;
3rd processing module, dividing for a certain function according to the first order structure neighborhood and second level structure neighborhood Cloth situation, assesses the first probability that the function occurs in the protein to be checked;And according to all of homologous sequence pair Should the distribution situation of function assess the second probability that the function occurs in the protein to be checked;
Fourth processing module, the SVM forecast models for foundation by the PSSM Matrix predictions function, and will be described to be checked SVM forecast models described in asking the PSSM Input matrixes of protein draw the 3rd that the function occurs in the protein to be checked Probability;
5th processing module, for the common table according to the corresponding query gene of protein to be checked and the query gene Up to gene, the gene co-expressing fraction between corresponding ortholog in other species is calculated, and according to the gene co-expressing point Number by other species to should the distribution situation of function be converted in target species the function in the protein to be checked The 4th probability for occurring;
6th processing module, for merge described first, second, third and the 4th probability to assess the function described The comprehensive probability occurred in protein to be checked.
The invention has the advantages that:
Assessing and merge from structure, sequence, PSSM and across species coexpression information many-side draws a certain function to be checked The comprehensive probability that occurs in protein is ask, the performance of protein mark is improved, and then expansible draws the protein to be checked The corresponding likelihood value of each function phase, solves the problems, such as that BIOLOGICAL TEST METHODS is with high costs and inefficiency.
Below with reference to accompanying drawings, the present invention is further detailed explanation.
Description of the drawings
The accompanying drawing for constituting the part of the application is used for providing a further understanding of the present invention, the schematic reality of the present invention Apply example and its illustrate, for explaining the present invention, not constituting inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 is protein function mask method flow chart disclosed in the embodiment of the present invention;
Fig. 2 is the structure alignment algorithm flow chart used by searching structure neighbours disclosed in the embodiment of the present invention;
Fig. 3 present invention (PredGO) and ROC of other 8 kinds of methods based on different data sources on GOA-PDB data sets Curve, this 8 kinds of data sources are structural information (Str) respectively, and coexpression information (Coexpression), phyletic evolution are composed (Phylogenetics), position-specific scoring matrices (PSSM), 3 metasequence information (Trigram), interaction information (PPI), function domain information (Interpro) and ortholog information (Othology);The more high then estimated performance of ROC curve is more It is good;
Fig. 4 present invention (PredGO) and other three kinds of function prediction methods BLAST, Argot2 and Str is in CAFA data sets On ROC curve comparison diagram, the more high then estimated performance of ROC curve is better;
Fig. 5 present invention (PredGO) and other existing function prediction methods (BLAST, Jones-UCL, Argot2, ESG and Str) the maximum F values (Fmax) in CAFA data sets are compared, and maximum F values are higher to represent that estimated performance is better.
Specific embodiment
Embodiments of the invention are described in detail below in conjunction with accompanying drawing, but the present invention can be defined by the claims Implement with the multitude of different ways for covering.
Embodiment 1
The present embodiment discloses a kind of protein function mask method, as shown in figure 1, including:
Step S1, according to the representative structure of protein to be checked search first order structure neighborhood.
In this step, remote three dimensional structure similarity can be particularly to infer according to the structural similarity of protein The function of protein.For given protein Q to be checked, a representative is obtained from PDB storehouses or homology model data base Structure M, finds all of structure neighborhoods of M in protein structure storehouse using structure neighborhood searching algorithm and constitutes the first level structure neighbour Occupy (N1, N2 ...).
Step S2, the homologous sequence for searching for protein to be checked, search egg to be checked according to the representative structure of homologous sequence The second level structure neighborhood of white matter.
In this step, it is possible to use sequence alignment method PSI-BLAST (iteration is once) searches out all in PDB storehouses Homologous sequence (H1, H2 ...), for each homologous sequence Hi, it is same that the method for relying on above-mentioned steps S1 similar searches out this The structure neighborhood of source sequence is constituting the second level structure neighborhood relative to protein to be checked.
Step S3, according to first order structure neighborhood and the distribution situation of a certain function of second level structure neighborhood, assessment should The first probability that function occurs in protein to be checked;And according to all of homologous sequence to should function distribution situation Assess the second probability that the function occurs in protein to be checked.
In this step, as first order structure neighborhood is that itself representative structure directly determines according to protein to be checked , which can more reflect the probability that a certain function occurs in the protein to be checked compared to second level structure neighborhood;For this purpose, The present embodiment can be come by following formula come comprehensive first order structure neighborhood and the distribution situation of the function of second level structure neighborhood The probability that assessment corresponding function occurs in protein to be checked, concrete formula are as follows:
Wherein, PiFor the structure distance between albumen to be checked and i-th first order structure neighborhood, if SiWith function fa, ISi(fa) for 1, otherwise ISi(fa) it is weights for 0, w, NsFor the number of first order structure neighborhood;NseqFor the number of homologous sequence, EiIt is the sequence similarity with homologous sequence with the protein to be checked, PijFor i-th homologous sequence and j-th second level knot Structure neighbours SijBetween structure distance, NsiFor the number of second level structure neighborhood;In the same manner, if SijWith function fa, ISij(fa) be 1, otherwise ISij(fa) for 0.
Optionally, the representative structure given for, can carry out structure ratio using dual stage process as shown in Figure 2 To building structure neighborhood collection, in first stage, it is, using double dynamic programming algorithms, to carry out protein relative distance and two grades of knots The comparison in tile section (SSEs) direction, as the number of SSEs in protein compares less, the speed of calculating can be than very fast.Base In a given structural similarity threshold value, potential similar protein matter is chosen to enter the comparison of second stage.This single order Section is by the object function of an optimization to related residue to carrying out topological comparison.Optimization aim is to maximize reciprocity residue pair Number, and make the C of overlay structureαRMSD is minimum.Optimized using the dynamic programming algorithm and rigid body superposition algorithm of iteration residue- Residue alignments.The amount of calculation of second stage is bigger than the first stage, but as the first stage has filtered most of structure, reduces Computing cost.Due to the huge amount of protein in structural database (the PDB structures of functional mark), in order to reduce calculating Amount, is clustered according to sequence similarity to the protein in structural library using Cd-hit, by albumen of the similarity more than 60% Matter is classified as same class.When using structure alignment method searching structure neighbours, query structure representative only with each classification Structure is compared, if higher (for example PSD of the similarity of query structure and certain representative structure<0.6), then it is assumed that the representative is tied The all proteins of structure place apoplexy due to endogenous wind are the structure neighborhood of inquiry albumen.This is because when protein sequence similarity it is higher (> 60%), when, structure is often closely similar.
In the present embodiment, the use of second level structure neighborhood is in order to excavate more remote functional relationship, particularly When one-level functional relationship is lacked.Finally, the function of this two-layer configuration neighbours is marked by we, is comprehensively risen by scoring functions Come, predict the function of inquiry albumen.
On the other hand, in this step, for inquiry Protein Q, UniProtKB/Swiss- can be searched for using PSI-BLAST The homologous sequence of Prot data bases, for each homologous sequence Hk, corresponding function mark (Gene Ontology) is by sequence Row compare E values (E-value) given a mark, for some function Ti mark to inquire about albumen probability score, i.e., second Probability, computing formula can be:
Wherein E is homologous sequence (Hk) E values, b be constant log (10), n is the number of homologous sequence, if HkWith work( Can Ti, then Indk(Ti) it is 1, is otherwise 0.
Step S4, foundation predict the SVM forecast models of the function by PSSM matrixes (position-specific scoring matrices), and The PSSM Input matrix SVM forecast models of protein to be checked are shown into that the function occurs in protein to be checked the 3rd can Can property.
In this step, by constructing the sample set being made up of positive and negative sample, with the characteristic that PSSM is input into as SVM, lead to Cross extraction training set and independent test collection builds SVM forecast models, and the forecast model is predicted and is assessed, this kind of technology For the technology that those skilled in the art are easily achieved, will not be described here.Preferably, auto-covariance can be used in the present embodiment PSSM matrixes are changed into the independent variable AC of alternative approach the feature of regular length, and the computing formula of independent variable is:
Wherein, j represents a descriptor, j=1,2 ..., D (D is the number of descriptor);I represents the position in sequence;L For the length of aminoacid sequence, maximums of the lg for lg (lg=1,2 ..., LG), the sum of each sequence AC variable is LG*D, Based on AC features, for each function fa, with support vector machine method, train a forecast model to carry out function prediction.
Step S5, according to the co-expression gene of the corresponding query gene of protein to be checked and the query gene, calculate which Gene co-expressing fraction in its species between corresponding ortholog, and will be corresponding in other species according to gene co-expressing fraction The distribution situation of the function is converted into the 4th probability that the function in target species occurs in protein to be checked.
The step is marked based on the function across species coexpression.Can be from COXPRESdb and ArrayExpress data bases Obtain 11 species (mankind, nematicide, Canis familiaris L., fly, Brachydanio rerio, chicken, Rhesus Macacus, mice, Rattus norvegicuss, budding yeast and fragmentation ferment It is female) coexpression data.Precalculate often with Pearson's correlation coefficient (Pearson Correlation Coefficient) The intergenic coexpression of any two in individual species.For in query gene (protein) Q (in species 1), P1, P2 ... Pi with Query gene Q has similar expression, QojWith PiojQ and Pi other species (species 2 ..., corresponding direct line in species n) Homologous protein.Fusion ortholog the credibility of gene co-expressing relation and is covered to improve in the coexpression information of other species Cover degree.Can using Nae Bayesianmethod (Bayes, NB) calculate across species gene co-expressing fraction (COXS), this point In number fusion target species, intergenic coexpression relation and the coexpression in other species between corresponding ortholog are closed System, the concrete gene co-expressing fraction that calculates include:
COXS (Q, Pi)=1- (1-C (Q, Pi))*(1-w*osi)
Wherein Q is query gene, and Pi is the co-expression gene of Q, and C is the Pearson's correlation coefficient of two gene expressions, and w is The weights of ortholog expression, OSiIt is (Q between the corresponding ortholog in species j of Q and Pioj,Pioj) common table Up to fraction, n is the sum of species.
Above steps treats a certain function that query protein may possess, carry out first to this function respectively, The calculating of second, third and the 4th totally four kinds of probabilities.Step S6, fusion first, second, third and the 4th probability are assessing The comprehensive probability that the function occurs in protein to be checked.
The step can adopt Bayesian network fusion structure, sequence, location specific scoring matrix and across species coexpressions Information, builds protein function automatic marking method (PredGO), there is some function to be defined as the positive protein (Positive), the positive number for given protein-function to sum, finds a positive protein matter-function pair Prior probability (Prior), computing formula is:
Wherein, P (pos) be predicted as to probability, P (neg) be predicted as mistake probability;Relative, posterior probability is such as Shown in below equation:
Wherein, f1,...,fNFor the value in given data source, including first to fourth probability totally four data sources;
Likelihood ratio L is defined as:
According to the relevant priori and posterior probability of bayes rule, have:
Ppost=L (f1..., fN)Pprior
Thereby, the posterior probability of a certain function is higher, then protein to be checked has the probability of the function bigger.
In specific experimental demonstration, the inventive method (PredGO) is pre- with other oroteins function in two datasets Survey method is compared.GOA-PDB data sets are new between 201010 to 201311 from the time of GOA data bases extraction Data, each protein are marked including at least the function of 1 non-IEA, are removed after redundancy with CDHIT, are obtained from 256 things 3632 protein planted.2011 data sets of CAFA are that first protein function marks challenge match (http:// Biofunctionprediction.org/ the data set for) providing, comprising 866 protein from 11 species.In GOA- In PDB data bases, as shown in Fig. 3 and table 1, the integrated approach (PredGO) in comprehensive multiple data source no matter in molecular function or There is better performance than individual data source all in biological process.On CAFA data sets, Fig. 4 illustrates the present invention (PredGO) with other three kinds of function prediction methods BLAST, ROC curve comparison diagrams of the Argot2 and Str on CAFA data sets, The estimated performance of (PredGO) of the invention is more preferably (the more high then estimated performance of ROC curve is better).Fig. 5 illustrates the present invention (PredGO) with other existing function prediction methods (BLAST, Jones-UCL, Argot2, ESG and Str) in CAFA data sets Maximum F values (Fmax) are compared, and the maximum F values of of the invention (PredGO) increase significantly (the higher expression estimated performance of maximum F values Better).
Table 1:
To sum up, protein function mask method disclosed in the present embodiment, from structure, sequence, PSSM and across species coexpressions The many-sided assessment of information simultaneously merges the comprehensive probability for showing that a certain function occurs in protein to be checked, improves protein mark The performance of note, so it is expansible draw the corresponding likelihood value of the protein to be checked each function phase, solve biotic experiment The method problem with inefficiency with high costs.
Embodiment 2
Corresponding with said method embodiment, the present embodiment discloses a kind of protein function labeling system, including:
First processing module, for according to the representative structure of protein to be checked search first order structure neighborhood;
Second processing module, the homologous sequence for searching for protein to be checked, look into according to the representative structure of homologous sequence Look for the second level structure neighborhood of protein to be checked;
3rd processing module, dividing for a certain function according to the first order structure neighborhood and second level structure neighborhood Cloth situation, assesses the first probability that the function occurs in the protein to be checked;And according to all of homologous sequence pair Should the distribution situation of function assess the second probability that the function occurs in the protein to be checked;
Fourth processing module, the SVM forecast models for foundation by the PSSM Matrix predictions function, and will be described to be checked SVM forecast models described in asking the PSSM Input matrixes of protein draw the 3rd that the function occurs in the protein to be checked Probability;
5th processing module, for the common table according to the corresponding query gene of protein to be checked and the query gene Up to gene, the gene co-expressing fraction between corresponding ortholog in other species is calculated, and according to the gene co-expressing point Number by other species to should the distribution situation of function be converted in target species the function in the protein to be checked The 4th probability for occurring;
6th processing module, for merge described first, second, third and the 4th probability to assess the function described The comprehensive probability occurred in protein to be checked.
Optionally, the computing formula of the first probability can be:
Wherein, PiFor the structure distance between albumen to be checked and i-th first order structure neighborhood, if SiWith function fa, ISi(fa) for 1, otherwise ISi(fa) it is weights for 0, w, NsFor the number of first order structure neighborhood;NseqFor the number of homologous sequence, EiIt is the sequence similarity with homologous sequence with the protein to be checked, PijFor i-th homologous sequence and j-th second level knot Structure neighbours SijBetween structure distance, NsiFor the number of second level structure neighborhood;In the same manner, if SijWith function fa, ISij(fa) be 1, otherwise ISij(fa) for 0.
The computing formula of the second probability can be:
Wherein EkFor homologous sequence HkAlignment score value, b be constant log (10), n is the number of homologous sequence, such as Fruit HkWith function Ti, then Indk(Ti) it is 1, is otherwise 0.
Calculate gene co-expressing fraction and can adopt equation below:
COXS (Q, Pi)=1- (1-C (Q, Pi))*(1-w*osi)
Wherein Q is query gene, and Pi is the co-expression gene of Q, and C is the Pearson's correlation coefficient of two gene expressions, and w is The weights of ortholog expression, OSiIt is (Q between the corresponding ortholog in species j of Q and Pioj,Pioj) common table Up to fraction, n is the sum of species.
In the present embodiment, the concrete internal data between each module is processed and can refer to above-described embodiment 1, is not repeated.
In the same manner, protein function labeling system disclosed in the present embodiment, from structure, sequence, PSSM and across species coexpressions The many-sided assessment of information simultaneously merges the comprehensive probability for showing that a certain function occurs in protein to be checked, improves protein mark The performance of note, so it is expansible draw the corresponding likelihood value of the protein to be checked each function phase, solve biotic experiment The method problem with inefficiency with high costs.
The preferred embodiments of the present invention are the foregoing is only, the present invention is not limited to, for the skill of this area For art personnel, the present invention can have various modifications and variations.It is all within the spirit and principles in the present invention, made any repair Change, equivalent, improvement etc., should be included within the scope of the present invention.

Claims (10)

1. a kind of protein function mask method, it is characterised in that include:
Step S1, according to the representative structure of protein to be checked search first order structure neighborhood;
Step S2, the homologous sequence for searching for the protein to be checked, search according to the representative structure of homologous sequence described to be checked Ask the second level structure neighborhood of protein;
Step S3, according to the first order structure neighborhood and the distribution situation of a certain function of second level structure neighborhood, assessment should The first probability that function occurs in the protein to be checked;And according to all of homologous sequence to should function distribution The second probability that the assessment of scenario function occurs in the protein to be checked;
Step S4, SVM forecast models by the PSSM Matrix predictions functions are set up, and by the PSSM of the protein to be checked SVM forecast models described in Input matrix draw the 3rd probability that the function occurs in the protein to be checked;
Step S5, according to the co-expression gene of the corresponding query gene of protein to be checked and the query gene, calculate which Gene co-expressing fraction in its species between corresponding ortholog, and according to the gene co-expressing fraction by other species To should function distribution situation be converted in target species that the function occurs in the protein to be checked the 4th may Property;
Step S6, fusion described first, second, third and the 4th probability are assessing the function in the protein to be checked The comprehensive probability of appearance.
2. protein function mask method according to claim 1, it is characterised in that the calculating of first probability is public Formula is:
S ( f a ) = w &CenterDot; &Sigma; i = 1 N s ( 1 - P i ) &CenterDot; I S i ( f a ) + ( 1 - w ) &CenterDot; &Sigma; i = 1 N s e q max 1 &le; j &le; N s i ( E i &CenterDot; ( 1 - P i j ) &CenterDot; I S i j ( f a ) )
Wherein, PiFor the structure distance between albumen to be checked and i-th first order structure neighborhood, if SiWith function fa, ISi (fa) for 1, otherwise ISi(fa) it is weights for 0, w, NsFor the number of first order structure neighborhood;NseqFor the number of homologous sequence, Ei It is the sequence similarity with homologous sequence with the protein to be checked, PijFor i-th homologous sequence and j-th second level structures Neighbours SijBetween structure distance, NsiFor the number of second level structure neighborhood;In the same manner, if SijWith function fa, ISij(fa) for 1, Otherwise ISij(fa) for 0.
3. protein function mask method according to claim 1, it is characterised in that the calculating of second probability is public Formula is:
S ( T i ) = &Sigma; k = 1 n - l o g ( E k ) + b &Sigma; j = 1 n { - l o g ( E j ) + b } &CenterDot; Ind k ( T i )
Wherein EkFor homologous sequence HkAlignment score value, b be constant log (10), n is the number of homologous sequence, if Hk With function Ti, then Indk(Ti) it is 1, is otherwise 0.
4. protein function mask method according to claim 1, it is characterised in that step S4 is included using self tuning PSSM matrixes are changed into the independent variable AC of variance alternative approach the feature of regular length, and the computing formula of independent variable is:
AC lg , j = 1 L - lg &Sigma; i = 1 L - lg ( X i , j - 1 L &Sigma; i = 1 L X i , j ) ( X ( i + lg ) , j - 1 L &Sigma; i = 1 L X i , j )
Wherein, j represents a descriptor, j=1,2 ..., D (D is the number of descriptor);I represents the position in sequence;L is ammonia The length of base acid sequence, maximums of the lg for lg (lg=1,2 ..., LG), the sum of each sequence AC variable is LG*D, is based on AC features, for each function fa, with support vector machine method, train a forecast model to carry out function prediction.
5. protein function mask method according to claim 1, it is characterised in that step S5 calculates gene table altogether Include up to fraction:
COXS (Q, Pi)=1- (1-C (Q, Pi))*(1-w*OSi)
OS i = 1 - &Pi; j = 2 n ( 1 - C ( Q o j , Pi o j ) )
Wherein Q is query gene, and Pi is the co-expression gene of Q, and C is the Pearson's correlation coefficient of two gene expressions, and w is lineal The weights of homologous geness expression, OSiIt is (Q between the corresponding ortholog in species j of Q and Pioj,Pioj) coexpression point Number, n is the sum of species.
6. protein function mask method according to claim 1, it is characterised in that step S6 includes:
There is some function to be defined as the positive protein, for positive number of the given protein-function to sum, The prior probability of a positive protein matter-function pair is found, computing formula is:
p p r i o r = P ( p o s ) P ( n e g ) = P ( p o s ) 1 - P ( p o s ) ;
Wherein, P (pos) be predicted as to probability, P (neg) be predicted as mistake probability;Relative, posterior probability is for example following Shown in formula:
P p o s t = P ( p o s | f 1 , ... , f N ) p ( n e g | f 1 , ... , f N ) ;
Wherein, f1,...,fNFor the value in given data source, including first to fourth probability totally four data sources;
Likelihood ratio L is defined as:
L ( f 1 , ... , f N ) = P ( f 1 , ... , f N | p o s ) P ( f 1 , ... , f N | n e g ) ;
According to the relevant priori and posterior probability of bayes rule, have:
Ppost=L (f1..., fN)Pprior
The posterior probability of a certain function is higher, then the protein to be checked has the probability of the function bigger.
7. a kind of protein function labeling system for performing 1 to 6 arbitrary methods described of the claims, its feature exist In, including:
First processing module, for according to the representative structure of protein to be checked search first order structure neighborhood;
Second processing module, the homologous sequence for searching for the protein to be checked, look into according to the representative structure of homologous sequence Look for the second level structure neighborhood of the protein to be checked;
3rd processing module, for the distribution feelings of a certain function according to the first order structure neighborhood and second level structure neighborhood Condition, assesses the first probability that the function occurs in the protein to be checked;And according to all of homologous sequence to should The distribution situation of function assesses the second probability that the function occurs in the protein to be checked;
Fourth processing module, for setting up the SVM forecast models by the PSSM Matrix predictions functions, and by the egg to be checked SVM forecast models described in the PSSM Input matrixes of white matter draw the 3rd possibility that the function occurs in the protein to be checked Property;
5th processing module, for the coexpression base according to the corresponding query gene of protein to be checked and the query gene Cause, calculates the gene co-expressing fraction between corresponding ortholog in other species, and will according to the gene co-expressing fraction In other species to should the distribution situation of function be converted into the function in target species and occur in the protein to be checked The 4th probability;
6th processing module, for merge described first, second, third and the 4th probability to assess the function described to be checked The comprehensive probability occurred in asking protein.
8. protein function labeling system according to claim 7, it is characterised in that the calculating of first probability is public Formula is:
S ( f a ) = w &CenterDot; &Sigma; i = 1 N s ( 1 - P i ) &CenterDot; I S i ( f a ) + ( 1 - w ) &CenterDot; &Sigma; i = 1 N s e q max 1 &le; j &le; N s i ( E i &CenterDot; ( 1 - P i j ) &CenterDot; I S i j ( f a ) )
Wherein, PiFor the structure distance between albumen to be checked and i-th first order structure neighborhood, if SiWith function fa, ISi (fa) for 1, otherwise ISi(fa) it is weights for 0, w, NsFor the number of first order structure neighborhood;NseqFor the number of homologous sequence, Ei It is the sequence similarity with homologous sequence with the protein to be checked, PijFor i-th homologous sequence and j-th second level structures Neighbours SijBetween structure distance, NsiFor the number of second level structure neighborhood;In the same manner, if SijWith function fa, ISij(fa) for 1, Otherwise ISij(fa) for 0.
9. protein function labeling system according to claim 7, it is characterised in that the calculating of second probability is public Formula is:
S ( T i ) = &Sigma; k = 1 n - l o g ( E k ) + b &Sigma; j = 1 n { - l o g ( E j ) + b } &CenterDot; Ind k ( T i )
Wherein EkFor homologous sequence HkAlignment score value, b be constant log (10), n is the number of homologous sequence, if Hk With function Ti, then Indk(Ti) it is 1, is otherwise 0.
10. protein function labeling system according to claim 7, it is characterised in that the calculating gene co-expressing point Number includes:
COXS (Q, Pi)=1- (1-C (Q, Pi)*(1-w*OSi)
OS i = 1 - &Pi; j = 2 n ( 1 - C ( Q o j , Pi o j ) )
Wherein Q is query gene, and Pi is the co-expression gene of Q, and C is the Pearson's correlation coefficient of two gene expressions, and w is lineal The weights of homologous geness expression, OSiIt is (Q between the corresponding ortholog in species j of Q and Pioj,Pioj) coexpression point Number, n is the sum of species.
CN201611128108.4A 2016-12-09 2016-12-09 Protein function mask method and system Active CN106599611B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611128108.4A CN106599611B (en) 2016-12-09 2016-12-09 Protein function mask method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611128108.4A CN106599611B (en) 2016-12-09 2016-12-09 Protein function mask method and system

Publications (2)

Publication Number Publication Date
CN106599611A true CN106599611A (en) 2017-04-26
CN106599611B CN106599611B (en) 2019-04-30

Family

ID=58598271

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611128108.4A Active CN106599611B (en) 2016-12-09 2016-12-09 Protein function mask method and system

Country Status (1)

Country Link
CN (1) CN106599611B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107977548A (en) * 2017-12-05 2018-05-01 东软集团股份有限公司 Method, apparatus, medium and the electronic equipment of anticipating interaction between proteins
CN108334746A (en) * 2018-01-15 2018-07-27 浙江工业大学 A kind of Advances in protein structure prediction based on secondary structure similarity
CN109101785A (en) * 2018-07-12 2018-12-28 浙江工业大学 A kind of Advances in protein structure prediction based on secondary structure similarity selection strategy
CN109785901A (en) * 2018-12-26 2019-05-21 东软集团股份有限公司 A kind of protein function prediction technique and device
CN109801675A (en) * 2018-12-26 2019-05-24 东软集团股份有限公司 A kind of method, apparatus and equipment of determining protein liposomal function
CN109817275A (en) * 2018-12-26 2019-05-28 东软集团股份有限公司 The generation of protein function prediction model, protein function prediction technique and device
CN110277136A (en) * 2019-07-05 2019-09-24 湖南大学 Protein sequence database parallel search identification method and device
CN114627963A (en) * 2022-05-16 2022-06-14 北京肿瘤医院(北京大学肿瘤医院) Protein data filling method, system, computer device and readable storage medium
CN115497555A (en) * 2022-08-16 2022-12-20 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Multi-species protein function prediction method, device, equipment and storage medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170329914A1 (en) * 2016-05-11 2017-11-16 International Business Machines Corporation Predicting Personalized Cancer Metastasis Routes, Biological Mediators of Metastasis and Metastasis Blocking Therapies

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1328601A (en) * 1998-08-25 2001-12-26 斯克利普斯研究院 Methods and systems for predicting protein function
CN103473483A (en) * 2013-10-07 2013-12-25 谢华林 Online predicting method for structure and function of protein
CN106068273A (en) * 2013-12-31 2016-11-02 丹尼斯科美国公司 The protein expression strengthened

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1328601A (en) * 1998-08-25 2001-12-26 斯克利普斯研究院 Methods and systems for predicting protein function
CN103473483A (en) * 2013-10-07 2013-12-25 谢华林 Online predicting method for structure and function of protein
CN106068273A (en) * 2013-12-31 2016-11-02 丹尼斯科美国公司 The protein expression strengthened

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LEI DENG ET AL.: "An Integrated Framework for Functional Annotation of Protein Structural Domains", 《IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS》 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107977548B (en) * 2017-12-05 2020-04-07 东软集团股份有限公司 Method, device, medium, and electronic device for predicting protein-protein interaction
CN107977548A (en) * 2017-12-05 2018-05-01 东软集团股份有限公司 Method, apparatus, medium and the electronic equipment of anticipating interaction between proteins
CN108334746A (en) * 2018-01-15 2018-07-27 浙江工业大学 A kind of Advances in protein structure prediction based on secondary structure similarity
CN108334746B (en) * 2018-01-15 2021-06-18 浙江工业大学 Protein structure prediction method based on secondary structure similarity
CN109101785A (en) * 2018-07-12 2018-12-28 浙江工业大学 A kind of Advances in protein structure prediction based on secondary structure similarity selection strategy
CN109101785B (en) * 2018-07-12 2021-06-18 浙江工业大学 Protein structure prediction method based on secondary structure similarity selection strategy
CN109817275B (en) * 2018-12-26 2020-12-01 东软集团股份有限公司 Protein function prediction model generation method, protein function prediction device, and computer readable medium
CN109817275A (en) * 2018-12-26 2019-05-28 东软集团股份有限公司 The generation of protein function prediction model, protein function prediction technique and device
CN109801675B (en) * 2018-12-26 2021-01-05 东软集团股份有限公司 Method, device and equipment for determining protein lipid function
CN109801675A (en) * 2018-12-26 2019-05-24 东软集团股份有限公司 A kind of method, apparatus and equipment of determining protein liposomal function
CN109785901A (en) * 2018-12-26 2019-05-21 东软集团股份有限公司 A kind of protein function prediction technique and device
CN109785901B (en) * 2018-12-26 2021-07-30 东软集团股份有限公司 Protein function prediction method and device
CN110277136A (en) * 2019-07-05 2019-09-24 湖南大学 Protein sequence database parallel search identification method and device
CN114627963A (en) * 2022-05-16 2022-06-14 北京肿瘤医院(北京大学肿瘤医院) Protein data filling method, system, computer device and readable storage medium
CN115497555A (en) * 2022-08-16 2022-12-20 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Multi-species protein function prediction method, device, equipment and storage medium
CN115497555B (en) * 2022-08-16 2024-01-05 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Multi-species protein function prediction method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN106599611B (en) 2019-04-30

Similar Documents

Publication Publication Date Title
CN106599611A (en) Marking method and system for protein functions
Cho et al. Diffusion component analysis: unraveling functional topology in biological networks
CN103413067B (en) A kind of protein structure prediction method based on abstract convex Lower Bound Estimation
O'Donoghue et al. Evolutionary profiles derived from the QR factorization of multiple structural alignments gives an economy of information
Zhang et al. Multi-head attention-based probabilistic CNN-BiLSTM for day-ahead wind speed forecasting
Mittal et al. Recruiting machine learning methods for molecular simulations of proteins
CN109785901B (en) Protein function prediction method and device
CN107491664B (en) Protein structure de novo prediction method based on information entropy
Wang et al. Machine learning-based methods for prediction of linear B-cell epitopes
Zhang et al. Predicting linear B-cell epitopes by using sequence-derived structural and physicochemical features
Tian et al. Pairwise alignment of interaction networks by fast identification of maximal conserved patterns
CN114202123A (en) Service data prediction method and device, electronic equipment and storage medium
Yu et al. SOMPNN: an efficient non-parametric model for predicting transmembrane helices
Wang et al. A brief review of machine learning methods for RNA methylation sites prediction
CN102760209A (en) Transmembrane helix predicting method for nonparametric membrane protein
Sun et al. Smolign: a spatial motifs-based protein multiple structural alignment method
Zhou et al. Enhanced hybrid search algorithm for protein structure prediction using the 3D-HP lattice model
Yan et al. A systematic review of state-of-the-art strategies for machine learning-based protein function prediction
CN109378034B (en) Protein prediction method based on distance distribution estimation
Wu et al. Unified deep learning architecture for modeling biology sequence
Bidargaddi et al. Combining segmental semi-Markov models with neural networks for protein secondary structure prediction
CN108920894B (en) Protein conformation space optimization method based on brief abstract convex estimation
CN102831332A (en) Interpretation prediction method of transmembrane helix of membrane protein
Zhang et al. PathEmb: Random walk based document embedding for global pathway similarity search
Li et al. Deep contextual representation learning for identifying essential proteins via integrating multisource protein features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant