CN106599611A - Marking method and system for protein functions - Google Patents
Marking method and system for protein functions Download PDFInfo
- Publication number
- CN106599611A CN106599611A CN201611128108.4A CN201611128108A CN106599611A CN 106599611 A CN106599611 A CN 106599611A CN 201611128108 A CN201611128108 A CN 201611128108A CN 106599611 A CN106599611 A CN 106599611A
- Authority
- CN
- China
- Prior art keywords
- protein
- function
- checked
- probability
- gene
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
Landscapes
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Biotechnology (AREA)
- Biophysics (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Chemical & Material Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Analytical Chemistry (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Investigating Or Analysing Biological Materials (AREA)
Abstract
The invention relates to the technical field of bioinformation, and discloses a marking method and system for protein functions. Therefore, protein marking performance is improved, expensive cost of a bioexperiment method and poor efficiency are solved. The method comprises following steps: estimating the first possibility of a certain function in a to-be-inquired protein according to a first-stage structure neighborhood and a second-stage structure neighborhood; estimating the second possibility of the certain function in the to-be-inquired protein according to all homologous sequences; inputting a PSSM matrix of the to-be-inquired protein into an SVM prediction model to obtain the third possibility of the certain function in the to-be-inquired protein; converting the distribution of the function corresponding to other species according to the gene co-expression fraction into the fourth possibility of the function occurring in a target species in the to-be-inquired protein; and mixing the first possibility, the second possibility, the third possibility and the fourth possibility to estimate the comprehensive possibility of the function in the to-be-inquired protein.
Description
Technical field
The present invention relates to technical field of biological information, more particularly to a kind of protein function mask method and system.
Background technology
Protein is the material base of all life, is the final effector and direct executor of vital movement, and it participates in
Biological almost all of vital movement process in vivo, such as heredity, development, breeding, the metabolism of matter and energy, stress, thinking and
Memory etc..Protein is interconnected to constitute by peptide bond by 20 kinds of different amino acid residues, is folded into specific space conformation
Afterwards, protein just has corresponding biologic activity and function.Protein function (Protein Function) is from physiology's
Angle includes:Enzyme catalysiss, material delivery and storage, nutrition storage, motor coordination, mechanical supports, immunoprotection, signal acceptance and
Control action of conduction, growth and differentiation etc..Mankind's concern protein function is largely also due to protein and the mankind
The contact of countless ties between health, the hereditary having now been found that is mostly absolutely gene mutation causes coded protein
Dysfunction caused by.The phenylketonuria (Phenylketonuria, PKU) of such as recessive is precisely due to phenylpropyl alcohol
The shortage of propylhomoserin hydroxylase is caused;Albinism is then that, due to congenital deficiency tryrosinase, or tyrosinase activity declines, and makes
Obtaining B16 cell generation obstacle is caused;Heritability cystic fibrosises (Cystic Fibrosis, CF) be located at cytoplasma membrane
On chloride channel regulatory factor afunction it is relevant.
Determine that the function of agnoprotein matter is pre- for change mechanism of the organism under physiology or pathological conditions, disease is understood
Anti- and drug development has important meaning.The experimental technique of identification of protein function mainly has gel electrophoresis (Gel
Electrophoresis), yeast two-hybrid method (Yeast Two-hybrid), tandem affinity purification technology (TAP), burn light altogether
Shake energy transfer technique (FRET), protein biochip technology and immunoelectronmicroscopy (IEM) etc., although these methods can be to not
Know that the function of protein is accurately determined, but as experimental design is complicated, of a high price and the cycle is long so as to may be only available for
Small scale experiments, it is impossible to the needs annotated to protein function in the range of full-length genome by satisfaction.Up to the present, have super
The whole genome sequence for crossing 3000 kinds of cell biologicals is determined, and 5,000,000 nonredundancy is had more than in the open data base for accessing
Protein sequence data.Determine that using biotic experiment the function of these protein will be one and take very much and expensive appoint
Business.Therefore, the method for being marked using biotic experiment is impossible to catch up with protein sequence data growth rate.At present, it is big respectively
The protein function for about there was only 20%, 7%, 10% and 1% mankind, home mouse, fruit bat and nematicide is tested mark (Gene
The TAS marks of Ontology).In light of this situation, scientists are increasingly turned to using computational methods as supporting mark quantity
Huge sequence and structured data.
The existing protein function Predicting Technique based on computational methods includes BLAST, ESG and Argot2 etc., main base
In sequence homology information.It is the main stream approach of current protein function prediction based on the power and energy of sequence homology, but which is predicted
Accuracy rate (Accuracy) and coverage (Coverage) be not high, there is a definite limitation.Protein is derived from protein sequence
The method of function just relatively accurately, when sequence similarity is less than 30%, is based on only when sequence is highly similar
The accuracy rate of homologous function prediction method drastically will decline.
The content of the invention
Present invention aim at a kind of protein function mask method and system are disclosed, to improve the property of protein mark
, can solve the problems, such as that BIOLOGICAL TEST METHODS is with high costs and inefficiency.
For achieving the above object, the present invention discloses a kind of protein function mask method, including:
Step S1, according to the representative structure of protein to be checked search first order structure neighborhood;
Step S2, the homologous sequence for searching for the protein to be checked, according to the representative structure of homologous sequence is searched
The second level structure neighborhood of protein to be checked;
Step S3, according to the first order structure neighborhood and the distribution situation of a certain function of second level structure neighborhood, comment
Estimate the first probability that the function occurs in the protein to be checked;And according to all of homologous sequence to should function
Distribution situation assesses the second probability that the function occurs in the protein to be checked;
Step S4, SVM forecast models by the PSSM Matrix predictions functions are set up, and by the protein to be checked
SVM forecast models described in PSSM Input matrixes draw the 3rd probability that the function occurs in the protein to be checked;
Step S5, according to the co-expression gene of the corresponding query gene of protein to be checked and the query gene, meter
Calculate the gene co-expressing fraction between corresponding ortholog in other species, and according to the gene co-expressing fraction by other things
In kind to should function distribution situation be converted into that the function occurs in the protein to be checked in target species the 4th
Probability;
Step S6, fusion described first, second, third and the 4th probability are assessing the function in the albumen to be checked
The comprehensive probability occurred in matter
It is corresponding with said method, invention additionally discloses a kind of protein function labeling system, including:
First processing module, for according to the representative structure of protein to be checked search first order structure neighborhood;
Second processing module, the homologous sequence for searching for the protein to be checked, tie according to the representative of homologous sequence
Structure searches the second level structure neighborhood of the protein to be checked;
3rd processing module, dividing for a certain function according to the first order structure neighborhood and second level structure neighborhood
Cloth situation, assesses the first probability that the function occurs in the protein to be checked;And according to all of homologous sequence pair
Should the distribution situation of function assess the second probability that the function occurs in the protein to be checked;
Fourth processing module, the SVM forecast models for foundation by the PSSM Matrix predictions function, and will be described to be checked
SVM forecast models described in asking the PSSM Input matrixes of protein draw the 3rd that the function occurs in the protein to be checked
Probability;
5th processing module, for the common table according to the corresponding query gene of protein to be checked and the query gene
Up to gene, the gene co-expressing fraction between corresponding ortholog in other species is calculated, and according to the gene co-expressing point
Number by other species to should the distribution situation of function be converted in target species the function in the protein to be checked
The 4th probability for occurring;
6th processing module, for merge described first, second, third and the 4th probability to assess the function described
The comprehensive probability occurred in protein to be checked.
The invention has the advantages that:
Assessing and merge from structure, sequence, PSSM and across species coexpression information many-side draws a certain function to be checked
The comprehensive probability that occurs in protein is ask, the performance of protein mark is improved, and then expansible draws the protein to be checked
The corresponding likelihood value of each function phase, solves the problems, such as that BIOLOGICAL TEST METHODS is with high costs and inefficiency.
Below with reference to accompanying drawings, the present invention is further detailed explanation.
Description of the drawings
The accompanying drawing for constituting the part of the application is used for providing a further understanding of the present invention, the schematic reality of the present invention
Apply example and its illustrate, for explaining the present invention, not constituting inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 is protein function mask method flow chart disclosed in the embodiment of the present invention;
Fig. 2 is the structure alignment algorithm flow chart used by searching structure neighbours disclosed in the embodiment of the present invention;
Fig. 3 present invention (PredGO) and ROC of other 8 kinds of methods based on different data sources on GOA-PDB data sets
Curve, this 8 kinds of data sources are structural information (Str) respectively, and coexpression information (Coexpression), phyletic evolution are composed
(Phylogenetics), position-specific scoring matrices (PSSM), 3 metasequence information (Trigram), interaction information
(PPI), function domain information (Interpro) and ortholog information (Othology);The more high then estimated performance of ROC curve is more
It is good;
Fig. 4 present invention (PredGO) and other three kinds of function prediction methods BLAST, Argot2 and Str is in CAFA data sets
On ROC curve comparison diagram, the more high then estimated performance of ROC curve is better;
Fig. 5 present invention (PredGO) and other existing function prediction methods (BLAST, Jones-UCL, Argot2, ESG and
Str) the maximum F values (Fmax) in CAFA data sets are compared, and maximum F values are higher to represent that estimated performance is better.
Specific embodiment
Embodiments of the invention are described in detail below in conjunction with accompanying drawing, but the present invention can be defined by the claims
Implement with the multitude of different ways for covering.
Embodiment 1
The present embodiment discloses a kind of protein function mask method, as shown in figure 1, including:
Step S1, according to the representative structure of protein to be checked search first order structure neighborhood.
In this step, remote three dimensional structure similarity can be particularly to infer according to the structural similarity of protein
The function of protein.For given protein Q to be checked, a representative is obtained from PDB storehouses or homology model data base
Structure M, finds all of structure neighborhoods of M in protein structure storehouse using structure neighborhood searching algorithm and constitutes the first level structure neighbour
Occupy (N1, N2 ...).
Step S2, the homologous sequence for searching for protein to be checked, search egg to be checked according to the representative structure of homologous sequence
The second level structure neighborhood of white matter.
In this step, it is possible to use sequence alignment method PSI-BLAST (iteration is once) searches out all in PDB storehouses
Homologous sequence (H1, H2 ...), for each homologous sequence Hi, it is same that the method for relying on above-mentioned steps S1 similar searches out this
The structure neighborhood of source sequence is constituting the second level structure neighborhood relative to protein to be checked.
Step S3, according to first order structure neighborhood and the distribution situation of a certain function of second level structure neighborhood, assessment should
The first probability that function occurs in protein to be checked;And according to all of homologous sequence to should function distribution situation
Assess the second probability that the function occurs in protein to be checked.
In this step, as first order structure neighborhood is that itself representative structure directly determines according to protein to be checked
, which can more reflect the probability that a certain function occurs in the protein to be checked compared to second level structure neighborhood;For this purpose,
The present embodiment can be come by following formula come comprehensive first order structure neighborhood and the distribution situation of the function of second level structure neighborhood
The probability that assessment corresponding function occurs in protein to be checked, concrete formula are as follows:
Wherein, PiFor the structure distance between albumen to be checked and i-th first order structure neighborhood, if SiWith function fa,
ISi(fa) for 1, otherwise ISi(fa) it is weights for 0, w, NsFor the number of first order structure neighborhood;NseqFor the number of homologous sequence,
EiIt is the sequence similarity with homologous sequence with the protein to be checked, PijFor i-th homologous sequence and j-th second level knot
Structure neighbours SijBetween structure distance, NsiFor the number of second level structure neighborhood;In the same manner, if SijWith function fa, ISij(fa) be
1, otherwise ISij(fa) for 0.
Optionally, the representative structure given for, can carry out structure ratio using dual stage process as shown in Figure 2
To building structure neighborhood collection, in first stage, it is, using double dynamic programming algorithms, to carry out protein relative distance and two grades of knots
The comparison in tile section (SSEs) direction, as the number of SSEs in protein compares less, the speed of calculating can be than very fast.Base
In a given structural similarity threshold value, potential similar protein matter is chosen to enter the comparison of second stage.This single order
Section is by the object function of an optimization to related residue to carrying out topological comparison.Optimization aim is to maximize reciprocity residue pair
Number, and make the C of overlay structureαRMSD is minimum.Optimized using the dynamic programming algorithm and rigid body superposition algorithm of iteration residue-
Residue alignments.The amount of calculation of second stage is bigger than the first stage, but as the first stage has filtered most of structure, reduces
Computing cost.Due to the huge amount of protein in structural database (the PDB structures of functional mark), in order to reduce calculating
Amount, is clustered according to sequence similarity to the protein in structural library using Cd-hit, by albumen of the similarity more than 60%
Matter is classified as same class.When using structure alignment method searching structure neighbours, query structure representative only with each classification
Structure is compared, if higher (for example PSD of the similarity of query structure and certain representative structure<0.6), then it is assumed that the representative is tied
The all proteins of structure place apoplexy due to endogenous wind are the structure neighborhood of inquiry albumen.This is because when protein sequence similarity it is higher (>
60%), when, structure is often closely similar.
In the present embodiment, the use of second level structure neighborhood is in order to excavate more remote functional relationship, particularly
When one-level functional relationship is lacked.Finally, the function of this two-layer configuration neighbours is marked by we, is comprehensively risen by scoring functions
Come, predict the function of inquiry albumen.
On the other hand, in this step, for inquiry Protein Q, UniProtKB/Swiss- can be searched for using PSI-BLAST
The homologous sequence of Prot data bases, for each homologous sequence Hk, corresponding function mark (Gene Ontology) is by sequence
Row compare E values (E-value) given a mark, for some function Ti mark to inquire about albumen probability score, i.e., second
Probability, computing formula can be:
Wherein E is homologous sequence (Hk) E values, b be constant log (10), n is the number of homologous sequence, if HkWith work(
Can Ti, then Indk(Ti) it is 1, is otherwise 0.
Step S4, foundation predict the SVM forecast models of the function by PSSM matrixes (position-specific scoring matrices), and
The PSSM Input matrix SVM forecast models of protein to be checked are shown into that the function occurs in protein to be checked the 3rd can
Can property.
In this step, by constructing the sample set being made up of positive and negative sample, with the characteristic that PSSM is input into as SVM, lead to
Cross extraction training set and independent test collection builds SVM forecast models, and the forecast model is predicted and is assessed, this kind of technology
For the technology that those skilled in the art are easily achieved, will not be described here.Preferably, auto-covariance can be used in the present embodiment
PSSM matrixes are changed into the independent variable AC of alternative approach the feature of regular length, and the computing formula of independent variable is:
Wherein, j represents a descriptor, j=1,2 ..., D (D is the number of descriptor);I represents the position in sequence;L
For the length of aminoacid sequence, maximums of the lg for lg (lg=1,2 ..., LG), the sum of each sequence AC variable is LG*D,
Based on AC features, for each function fa, with support vector machine method, train a forecast model to carry out function prediction.
Step S5, according to the co-expression gene of the corresponding query gene of protein to be checked and the query gene, calculate which
Gene co-expressing fraction in its species between corresponding ortholog, and will be corresponding in other species according to gene co-expressing fraction
The distribution situation of the function is converted into the 4th probability that the function in target species occurs in protein to be checked.
The step is marked based on the function across species coexpression.Can be from COXPRESdb and ArrayExpress data bases
Obtain 11 species (mankind, nematicide, Canis familiaris L., fly, Brachydanio rerio, chicken, Rhesus Macacus, mice, Rattus norvegicuss, budding yeast and fragmentation ferment
It is female) coexpression data.Precalculate often with Pearson's correlation coefficient (Pearson Correlation Coefficient)
The intergenic coexpression of any two in individual species.For in query gene (protein) Q (in species 1), P1, P2 ... Pi with
Query gene Q has similar expression, QojWith PiojQ and Pi other species (species 2 ..., corresponding direct line in species n)
Homologous protein.Fusion ortholog the credibility of gene co-expressing relation and is covered to improve in the coexpression information of other species
Cover degree.Can using Nae Bayesianmethod (Bayes, NB) calculate across species gene co-expressing fraction (COXS), this point
In number fusion target species, intergenic coexpression relation and the coexpression in other species between corresponding ortholog are closed
System, the concrete gene co-expressing fraction that calculates include:
COXS (Q, Pi)=1- (1-C (Q, Pi))*(1-w*osi)
Wherein Q is query gene, and Pi is the co-expression gene of Q, and C is the Pearson's correlation coefficient of two gene expressions, and w is
The weights of ortholog expression, OSiIt is (Q between the corresponding ortholog in species j of Q and Pioj,Pioj) common table
Up to fraction, n is the sum of species.
Above steps treats a certain function that query protein may possess, carry out first to this function respectively,
The calculating of second, third and the 4th totally four kinds of probabilities.Step S6, fusion first, second, third and the 4th probability are assessing
The comprehensive probability that the function occurs in protein to be checked.
The step can adopt Bayesian network fusion structure, sequence, location specific scoring matrix and across species coexpressions
Information, builds protein function automatic marking method (PredGO), there is some function to be defined as the positive protein
(Positive), the positive number for given protein-function to sum, finds a positive protein matter-function pair
Prior probability (Prior), computing formula is:
Wherein, P (pos) be predicted as to probability, P (neg) be predicted as mistake probability;Relative, posterior probability is such as
Shown in below equation:
Wherein, f1,...,fNFor the value in given data source, including first to fourth probability totally four data sources;
Likelihood ratio L is defined as:
According to the relevant priori and posterior probability of bayes rule, have:
Ppost=L (f1..., fN)Pprior;
Thereby, the posterior probability of a certain function is higher, then protein to be checked has the probability of the function bigger.
In specific experimental demonstration, the inventive method (PredGO) is pre- with other oroteins function in two datasets
Survey method is compared.GOA-PDB data sets are new between 201010 to 201311 from the time of GOA data bases extraction
Data, each protein are marked including at least the function of 1 non-IEA, are removed after redundancy with CDHIT, are obtained from 256 things
3632 protein planted.2011 data sets of CAFA are that first protein function marks challenge match (http://
Biofunctionprediction.org/ the data set for) providing, comprising 866 protein from 11 species.In GOA-
In PDB data bases, as shown in Fig. 3 and table 1, the integrated approach (PredGO) in comprehensive multiple data source no matter in molecular function or
There is better performance than individual data source all in biological process.On CAFA data sets, Fig. 4 illustrates the present invention
(PredGO) with other three kinds of function prediction methods BLAST, ROC curve comparison diagrams of the Argot2 and Str on CAFA data sets,
The estimated performance of (PredGO) of the invention is more preferably (the more high then estimated performance of ROC curve is better).Fig. 5 illustrates the present invention
(PredGO) with other existing function prediction methods (BLAST, Jones-UCL, Argot2, ESG and Str) in CAFA data sets
Maximum F values (Fmax) are compared, and the maximum F values of of the invention (PredGO) increase significantly (the higher expression estimated performance of maximum F values
Better).
Table 1:
To sum up, protein function mask method disclosed in the present embodiment, from structure, sequence, PSSM and across species coexpressions
The many-sided assessment of information simultaneously merges the comprehensive probability for showing that a certain function occurs in protein to be checked, improves protein mark
The performance of note, so it is expansible draw the corresponding likelihood value of the protein to be checked each function phase, solve biotic experiment
The method problem with inefficiency with high costs.
Embodiment 2
Corresponding with said method embodiment, the present embodiment discloses a kind of protein function labeling system, including:
First processing module, for according to the representative structure of protein to be checked search first order structure neighborhood;
Second processing module, the homologous sequence for searching for protein to be checked, look into according to the representative structure of homologous sequence
Look for the second level structure neighborhood of protein to be checked;
3rd processing module, dividing for a certain function according to the first order structure neighborhood and second level structure neighborhood
Cloth situation, assesses the first probability that the function occurs in the protein to be checked;And according to all of homologous sequence pair
Should the distribution situation of function assess the second probability that the function occurs in the protein to be checked;
Fourth processing module, the SVM forecast models for foundation by the PSSM Matrix predictions function, and will be described to be checked
SVM forecast models described in asking the PSSM Input matrixes of protein draw the 3rd that the function occurs in the protein to be checked
Probability;
5th processing module, for the common table according to the corresponding query gene of protein to be checked and the query gene
Up to gene, the gene co-expressing fraction between corresponding ortholog in other species is calculated, and according to the gene co-expressing point
Number by other species to should the distribution situation of function be converted in target species the function in the protein to be checked
The 4th probability for occurring;
6th processing module, for merge described first, second, third and the 4th probability to assess the function described
The comprehensive probability occurred in protein to be checked.
Optionally, the computing formula of the first probability can be:
Wherein, PiFor the structure distance between albumen to be checked and i-th first order structure neighborhood, if SiWith function fa,
ISi(fa) for 1, otherwise ISi(fa) it is weights for 0, w, NsFor the number of first order structure neighborhood;NseqFor the number of homologous sequence,
EiIt is the sequence similarity with homologous sequence with the protein to be checked, PijFor i-th homologous sequence and j-th second level knot
Structure neighbours SijBetween structure distance, NsiFor the number of second level structure neighborhood;In the same manner, if SijWith function fa, ISij(fa) be
1, otherwise ISij(fa) for 0.
The computing formula of the second probability can be:
Wherein EkFor homologous sequence HkAlignment score value, b be constant log (10), n is the number of homologous sequence, such as
Fruit HkWith function Ti, then Indk(Ti) it is 1, is otherwise 0.
Calculate gene co-expressing fraction and can adopt equation below:
COXS (Q, Pi)=1- (1-C (Q, Pi))*(1-w*osi)
Wherein Q is query gene, and Pi is the co-expression gene of Q, and C is the Pearson's correlation coefficient of two gene expressions, and w is
The weights of ortholog expression, OSiIt is (Q between the corresponding ortholog in species j of Q and Pioj,Pioj) common table
Up to fraction, n is the sum of species.
In the present embodiment, the concrete internal data between each module is processed and can refer to above-described embodiment 1, is not repeated.
In the same manner, protein function labeling system disclosed in the present embodiment, from structure, sequence, PSSM and across species coexpressions
The many-sided assessment of information simultaneously merges the comprehensive probability for showing that a certain function occurs in protein to be checked, improves protein mark
The performance of note, so it is expansible draw the corresponding likelihood value of the protein to be checked each function phase, solve biotic experiment
The method problem with inefficiency with high costs.
The preferred embodiments of the present invention are the foregoing is only, the present invention is not limited to, for the skill of this area
For art personnel, the present invention can have various modifications and variations.It is all within the spirit and principles in the present invention, made any repair
Change, equivalent, improvement etc., should be included within the scope of the present invention.
Claims (10)
1. a kind of protein function mask method, it is characterised in that include:
Step S1, according to the representative structure of protein to be checked search first order structure neighborhood;
Step S2, the homologous sequence for searching for the protein to be checked, search according to the representative structure of homologous sequence described to be checked
Ask the second level structure neighborhood of protein;
Step S3, according to the first order structure neighborhood and the distribution situation of a certain function of second level structure neighborhood, assessment should
The first probability that function occurs in the protein to be checked;And according to all of homologous sequence to should function distribution
The second probability that the assessment of scenario function occurs in the protein to be checked;
Step S4, SVM forecast models by the PSSM Matrix predictions functions are set up, and by the PSSM of the protein to be checked
SVM forecast models described in Input matrix draw the 3rd probability that the function occurs in the protein to be checked;
Step S5, according to the co-expression gene of the corresponding query gene of protein to be checked and the query gene, calculate which
Gene co-expressing fraction in its species between corresponding ortholog, and according to the gene co-expressing fraction by other species
To should function distribution situation be converted in target species that the function occurs in the protein to be checked the 4th may
Property;
Step S6, fusion described first, second, third and the 4th probability are assessing the function in the protein to be checked
The comprehensive probability of appearance.
2. protein function mask method according to claim 1, it is characterised in that the calculating of first probability is public
Formula is:
Wherein, PiFor the structure distance between albumen to be checked and i-th first order structure neighborhood, if SiWith function fa, ISi
(fa) for 1, otherwise ISi(fa) it is weights for 0, w, NsFor the number of first order structure neighborhood;NseqFor the number of homologous sequence, Ei
It is the sequence similarity with homologous sequence with the protein to be checked, PijFor i-th homologous sequence and j-th second level structures
Neighbours SijBetween structure distance, NsiFor the number of second level structure neighborhood;In the same manner, if SijWith function fa, ISij(fa) for 1,
Otherwise ISij(fa) for 0.
3. protein function mask method according to claim 1, it is characterised in that the calculating of second probability is public
Formula is:
Wherein EkFor homologous sequence HkAlignment score value, b be constant log (10), n is the number of homologous sequence, if Hk
With function Ti, then Indk(Ti) it is 1, is otherwise 0.
4. protein function mask method according to claim 1, it is characterised in that step S4 is included using self tuning
PSSM matrixes are changed into the independent variable AC of variance alternative approach the feature of regular length, and the computing formula of independent variable is:
Wherein, j represents a descriptor, j=1,2 ..., D (D is the number of descriptor);I represents the position in sequence;L is ammonia
The length of base acid sequence, maximums of the lg for lg (lg=1,2 ..., LG), the sum of each sequence AC variable is LG*D, is based on
AC features, for each function fa, with support vector machine method, train a forecast model to carry out function prediction.
5. protein function mask method according to claim 1, it is characterised in that step S5 calculates gene table altogether
Include up to fraction:
COXS (Q, Pi)=1- (1-C (Q, Pi))*(1-w*OSi)
Wherein Q is query gene, and Pi is the co-expression gene of Q, and C is the Pearson's correlation coefficient of two gene expressions, and w is lineal
The weights of homologous geness expression, OSiIt is (Q between the corresponding ortholog in species j of Q and Pioj,Pioj) coexpression point
Number, n is the sum of species.
6. protein function mask method according to claim 1, it is characterised in that step S6 includes:
There is some function to be defined as the positive protein, for positive number of the given protein-function to sum,
The prior probability of a positive protein matter-function pair is found, computing formula is:
Wherein, P (pos) be predicted as to probability, P (neg) be predicted as mistake probability;Relative, posterior probability is for example following
Shown in formula:
Wherein, f1,...,fNFor the value in given data source, including first to fourth probability totally four data sources;
Likelihood ratio L is defined as:
According to the relevant priori and posterior probability of bayes rule, have:
Ppost=L (f1..., fN)Pprior;
The posterior probability of a certain function is higher, then the protein to be checked has the probability of the function bigger.
7. a kind of protein function labeling system for performing 1 to 6 arbitrary methods described of the claims, its feature exist
In, including:
First processing module, for according to the representative structure of protein to be checked search first order structure neighborhood;
Second processing module, the homologous sequence for searching for the protein to be checked, look into according to the representative structure of homologous sequence
Look for the second level structure neighborhood of the protein to be checked;
3rd processing module, for the distribution feelings of a certain function according to the first order structure neighborhood and second level structure neighborhood
Condition, assesses the first probability that the function occurs in the protein to be checked;And according to all of homologous sequence to should
The distribution situation of function assesses the second probability that the function occurs in the protein to be checked;
Fourth processing module, for setting up the SVM forecast models by the PSSM Matrix predictions functions, and by the egg to be checked
SVM forecast models described in the PSSM Input matrixes of white matter draw the 3rd possibility that the function occurs in the protein to be checked
Property;
5th processing module, for the coexpression base according to the corresponding query gene of protein to be checked and the query gene
Cause, calculates the gene co-expressing fraction between corresponding ortholog in other species, and will according to the gene co-expressing fraction
In other species to should the distribution situation of function be converted into the function in target species and occur in the protein to be checked
The 4th probability;
6th processing module, for merge described first, second, third and the 4th probability to assess the function described to be checked
The comprehensive probability occurred in asking protein.
8. protein function labeling system according to claim 7, it is characterised in that the calculating of first probability is public
Formula is:
Wherein, PiFor the structure distance between albumen to be checked and i-th first order structure neighborhood, if SiWith function fa, ISi
(fa) for 1, otherwise ISi(fa) it is weights for 0, w, NsFor the number of first order structure neighborhood;NseqFor the number of homologous sequence, Ei
It is the sequence similarity with homologous sequence with the protein to be checked, PijFor i-th homologous sequence and j-th second level structures
Neighbours SijBetween structure distance, NsiFor the number of second level structure neighborhood;In the same manner, if SijWith function fa, ISij(fa) for 1,
Otherwise ISij(fa) for 0.
9. protein function labeling system according to claim 7, it is characterised in that the calculating of second probability is public
Formula is:
Wherein EkFor homologous sequence HkAlignment score value, b be constant log (10), n is the number of homologous sequence, if Hk
With function Ti, then Indk(Ti) it is 1, is otherwise 0.
10. protein function labeling system according to claim 7, it is characterised in that the calculating gene co-expressing point
Number includes:
COXS (Q, Pi)=1- (1-C (Q, Pi)*(1-w*OSi)
Wherein Q is query gene, and Pi is the co-expression gene of Q, and C is the Pearson's correlation coefficient of two gene expressions, and w is lineal
The weights of homologous geness expression, OSiIt is (Q between the corresponding ortholog in species j of Q and Pioj,Pioj) coexpression point
Number, n is the sum of species.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611128108.4A CN106599611B (en) | 2016-12-09 | 2016-12-09 | Protein function mask method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611128108.4A CN106599611B (en) | 2016-12-09 | 2016-12-09 | Protein function mask method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106599611A true CN106599611A (en) | 2017-04-26 |
CN106599611B CN106599611B (en) | 2019-04-30 |
Family
ID=58598271
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611128108.4A Active CN106599611B (en) | 2016-12-09 | 2016-12-09 | Protein function mask method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106599611B (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107977548A (en) * | 2017-12-05 | 2018-05-01 | 东软集团股份有限公司 | Method, apparatus, medium and the electronic equipment of anticipating interaction between proteins |
CN108334746A (en) * | 2018-01-15 | 2018-07-27 | 浙江工业大学 | A kind of Advances in protein structure prediction based on secondary structure similarity |
CN109101785A (en) * | 2018-07-12 | 2018-12-28 | 浙江工业大学 | A kind of Advances in protein structure prediction based on secondary structure similarity selection strategy |
CN109785901A (en) * | 2018-12-26 | 2019-05-21 | 东软集团股份有限公司 | A kind of protein function prediction technique and device |
CN109801675A (en) * | 2018-12-26 | 2019-05-24 | 东软集团股份有限公司 | A kind of method, apparatus and equipment of determining protein liposomal function |
CN109817275A (en) * | 2018-12-26 | 2019-05-28 | 东软集团股份有限公司 | The generation of protein function prediction model, protein function prediction technique and device |
CN110277136A (en) * | 2019-07-05 | 2019-09-24 | 湖南大学 | Protein sequence database parallel search identification method and device |
CN114627963A (en) * | 2022-05-16 | 2022-06-14 | 北京肿瘤医院(北京大学肿瘤医院) | Protein data filling method, system, computer device and readable storage medium |
CN115497555A (en) * | 2022-08-16 | 2022-12-20 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Multi-species protein function prediction method, device, equipment and storage medium |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170329914A1 (en) * | 2016-05-11 | 2017-11-16 | International Business Machines Corporation | Predicting Personalized Cancer Metastasis Routes, Biological Mediators of Metastasis and Metastasis Blocking Therapies |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1328601A (en) * | 1998-08-25 | 2001-12-26 | 斯克利普斯研究院 | Methods and systems for predicting protein function |
CN103473483A (en) * | 2013-10-07 | 2013-12-25 | 谢华林 | Online predicting method for structure and function of protein |
CN106068273A (en) * | 2013-12-31 | 2016-11-02 | 丹尼斯科美国公司 | The protein expression strengthened |
-
2016
- 2016-12-09 CN CN201611128108.4A patent/CN106599611B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1328601A (en) * | 1998-08-25 | 2001-12-26 | 斯克利普斯研究院 | Methods and systems for predicting protein function |
CN103473483A (en) * | 2013-10-07 | 2013-12-25 | 谢华林 | Online predicting method for structure and function of protein |
CN106068273A (en) * | 2013-12-31 | 2016-11-02 | 丹尼斯科美国公司 | The protein expression strengthened |
Non-Patent Citations (1)
Title |
---|
LEI DENG ET AL.: "An Integrated Framework for Functional Annotation of Protein Structural Domains", 《IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS》 * |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107977548B (en) * | 2017-12-05 | 2020-04-07 | 东软集团股份有限公司 | Method, device, medium, and electronic device for predicting protein-protein interaction |
CN107977548A (en) * | 2017-12-05 | 2018-05-01 | 东软集团股份有限公司 | Method, apparatus, medium and the electronic equipment of anticipating interaction between proteins |
CN108334746A (en) * | 2018-01-15 | 2018-07-27 | 浙江工业大学 | A kind of Advances in protein structure prediction based on secondary structure similarity |
CN108334746B (en) * | 2018-01-15 | 2021-06-18 | 浙江工业大学 | Protein structure prediction method based on secondary structure similarity |
CN109101785A (en) * | 2018-07-12 | 2018-12-28 | 浙江工业大学 | A kind of Advances in protein structure prediction based on secondary structure similarity selection strategy |
CN109101785B (en) * | 2018-07-12 | 2021-06-18 | 浙江工业大学 | Protein structure prediction method based on secondary structure similarity selection strategy |
CN109817275B (en) * | 2018-12-26 | 2020-12-01 | 东软集团股份有限公司 | Protein function prediction model generation method, protein function prediction device, and computer readable medium |
CN109817275A (en) * | 2018-12-26 | 2019-05-28 | 东软集团股份有限公司 | The generation of protein function prediction model, protein function prediction technique and device |
CN109801675B (en) * | 2018-12-26 | 2021-01-05 | 东软集团股份有限公司 | Method, device and equipment for determining protein lipid function |
CN109801675A (en) * | 2018-12-26 | 2019-05-24 | 东软集团股份有限公司 | A kind of method, apparatus and equipment of determining protein liposomal function |
CN109785901A (en) * | 2018-12-26 | 2019-05-21 | 东软集团股份有限公司 | A kind of protein function prediction technique and device |
CN109785901B (en) * | 2018-12-26 | 2021-07-30 | 东软集团股份有限公司 | Protein function prediction method and device |
CN110277136A (en) * | 2019-07-05 | 2019-09-24 | 湖南大学 | Protein sequence database parallel search identification method and device |
CN114627963A (en) * | 2022-05-16 | 2022-06-14 | 北京肿瘤医院(北京大学肿瘤医院) | Protein data filling method, system, computer device and readable storage medium |
CN115497555A (en) * | 2022-08-16 | 2022-12-20 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Multi-species protein function prediction method, device, equipment and storage medium |
CN115497555B (en) * | 2022-08-16 | 2024-01-05 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Multi-species protein function prediction method, device, equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN106599611B (en) | 2019-04-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106599611A (en) | Marking method and system for protein functions | |
Cho et al. | Diffusion component analysis: unraveling functional topology in biological networks | |
CN103413067B (en) | A kind of protein structure prediction method based on abstract convex Lower Bound Estimation | |
O'Donoghue et al. | Evolutionary profiles derived from the QR factorization of multiple structural alignments gives an economy of information | |
Zhang et al. | Multi-head attention-based probabilistic CNN-BiLSTM for day-ahead wind speed forecasting | |
Mittal et al. | Recruiting machine learning methods for molecular simulations of proteins | |
CN109785901B (en) | Protein function prediction method and device | |
CN107491664B (en) | Protein structure de novo prediction method based on information entropy | |
Wang et al. | Machine learning-based methods for prediction of linear B-cell epitopes | |
Zhang et al. | Predicting linear B-cell epitopes by using sequence-derived structural and physicochemical features | |
Tian et al. | Pairwise alignment of interaction networks by fast identification of maximal conserved patterns | |
CN114202123A (en) | Service data prediction method and device, electronic equipment and storage medium | |
Yu et al. | SOMPNN: an efficient non-parametric model for predicting transmembrane helices | |
Wang et al. | A brief review of machine learning methods for RNA methylation sites prediction | |
CN102760209A (en) | Transmembrane helix predicting method for nonparametric membrane protein | |
Sun et al. | Smolign: a spatial motifs-based protein multiple structural alignment method | |
Zhou et al. | Enhanced hybrid search algorithm for protein structure prediction using the 3D-HP lattice model | |
Yan et al. | A systematic review of state-of-the-art strategies for machine learning-based protein function prediction | |
CN109378034B (en) | Protein prediction method based on distance distribution estimation | |
Wu et al. | Unified deep learning architecture for modeling biology sequence | |
Bidargaddi et al. | Combining segmental semi-Markov models with neural networks for protein secondary structure prediction | |
CN108920894B (en) | Protein conformation space optimization method based on brief abstract convex estimation | |
CN102831332A (en) | Interpretation prediction method of transmembrane helix of membrane protein | |
Zhang et al. | PathEmb: Random walk based document embedding for global pathway similarity search | |
Li et al. | Deep contextual representation learning for identifying essential proteins via integrating multisource protein features |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |