CN106845149A

CN106845149A - A kind of new protein sequence method for expressing based on gene ontology information

Info

Publication number: CN106845149A
Application number: CN201710071092.6A
Authority: CN
Inventors: 肖绚; 程翔
Original assignee: Jingdezhen Ceramic Institute
Current assignee: Shanghai simudi Medical Information Technology Co.,Ltd.
Priority date: 2017-02-09
Filing date: 2017-02-09
Publication date: 2017-06-13
Anticipated expiration: 2037-02-09
Also published as: CN106845149B

Abstract

The present invention relates to a kind of new protein sequence method for expressing based on gene ontology information, the all of similar protein matter sequences of protein sequence P are found first by blast program search Swiss Prot databases, training data concentration all proteins are input in GO databases, the GO ontology informations that each protein has are searched；Then the mark gene ontology information that P protein has is searched in gene ontology storehouse；It is the M discrete vector of element by P protein definitions according to the M label that forecasting problem has.This method is by by the protein G O information in sequence sets, it is fused into the vector description of new protein P, so that being substantially reduced using GO method dimensions, for in Protein Subcellular multi-tag location prediction and the prediction of antibacterial peptide function multi-tag, the success rate prediction of correlation predictive device can be significantly improved, with wide utilization prospect.

Description

A kind of new protein sequence method for expressing based on gene ontology information

Technical field

The present invention relates to bioinformatics, protein pseudo amino acid composition composition and traditional protein sequence analysis technology neck Domain, more particularly to a kind of new protein sequence method for expressing based on gene ontology information.

Background technology

Carry out the progress of sequencing technologies with recent two decades, bioinformatics enters into the genome times afterwards comprehensively.How number is analyzed Which subcellular fraction genome sequence in terms of hundred million, such as protein work in, are tied with which kind of function, with which type of two grades Structure, tertiary structure and quaternary structure, these genes are again how to make life entity active, and which protein is probably potential The answer of a series of problem such as drug targets, is the focus of current research.

Due to above mentioned problem is existed using Bioexperiment technology and wasted time and energy, bioinformatics has obtained pole in recent years A series of great development, on-line prediction devices emerge.Although the result that these fallout predictors are predicted also needs to Bioexperiment being verified, But the result of prediction still has to biologist and is very helpful, the scope of experiment is such as reduced, genomic medicine design is carried out Booster action etc..

These fallout predictors are that, based on sequence information, some are that also some are based on newest based on structural information a bit Sequencing information.The prediction effect of the fallout predictor based on sequence information is general than being based on the low of structural information, but its information needed is big All exist so greatly being developed.Egg is mostly described using pseudo amino acid composition composition in the fallout predictor based on sequence information White matter sequence, these pseudo amino acid composition compositions are such as：Dyad composition, triplet composition, the gray theory factor, complexity factors etc. have Protein sequence part amino acid sequence information can be described well, what is had can well describe the global ammonia of protein sequence Base acid order information, positive role is all served to the protein structure based on sequence and function classification prediction.

In recent years with the appearance of Gene Ontology, it have become in biological information field a particularly important method and Instrument, has greatly deepened our integration and utilization to biological data.Using gene ontology（Go Ontology）Information is to egg White matter 26S Proteasome Structure and Function is predicted will be good than other methods such as functional domain and pseudo amino acid composition ingredient prediction effect.Gene ontology The gene and gene outcome vocabulary being related to are divided into three major types, cover three aspects of biology：1）Cellular component；2）Molecule work( Energy；3）Bioprocess.Contained term also increases to more than 50,000 from thousands of in gene ontology storehouse.Gene ontology is one oriented The body of acyclic pattern, has used tri- kinds of relations of is_a, part_of and regulates in current GO.Based on gene ontology information What is commonly use in the method for correlation predictive is using 0-1 discrete vector methods, if protein sequence contains each gene ontology Then this vectorial corresponding element be 1, if without if be 0.This method is only simply to calculate the information of whetheing there is, some Scholar is improved this, calculates the number of times that specific gene ontology occurs in certain protein, thus by 0-1 it is discrete to Amount is changed to integer vectors, increased frequency information.Above-mentioned these methods can be caused due to the increase of the vocabulary in gene ontology storehouse Dimension disaster.The correlation of institute's forecasting problem and gene ontology is directed to for this some scholar, not using all gene ontology institutes Some dictionaries, but part is used, which reduces the dimension of discrete vector, eliminate a little irrelevant informations.

Except using discrete vector method, the also Arithmetic of Semantic Similarity based on gene ontology, mainly include gene sheet Across the branch Similarity of Term algorithm of Similarity of Term calculating method and gene ontology in the same branch of body, these are to gene function point Analyse, compare and predict etc. that biological study popular domain has very important significance.But due to Gene Ontology Term drastically Increase, the complexity of these algorithms and calculating time also increase.

The above method is all based on carrying out gene ontology simple summation statistics or carries out Similarity measures, but due to Not all protein has the information of correlation in GO databases, and this is the defect based on GO information approaches, is this Invention blends GO information with other similar protein matter GO information, and for the classification quantity of institute's forecasting problem, reduces GO and retouch The dimension of vector approach is stated, a kind of new protein sequence based on GO information is designed and is described method to based on sequence information Protein function and structure type prediction etc. provide help.

The content of the invention

The technical problem to be solved in the present invention is to provide a kind of new protein sequence based on gene ontology information and represents Method, it is intended to by other protein Gs O information, the vector description of new protein P is fused into, to solve Protein Subcellular The problem relatively low to tag location prediction rate.

To solve above technical problem, the technical scheme is that：A kind of new albumen based on gene ontology information Matter sequence method for expressing, it is characterised in that comprise the following steps：

（1）Swiss-Prot databases are searched for using blast program find all of similar protein matter sequences of protein sequence P；

（2）Training data concentration all proteins are input in GO databases, the GO bodies that each protein has are searched Information, GO database websites are http://www.geneontology.org/；

（3）The mark gene ontology information that P protein has is searched in gene ontology storehouse, if P protein is without correlation Information, then according to the height with P protein similarities, search the GO information of similar protein matter sequence, until finding at least successively One GO ontology information is expressed as the GO information of P protein；

（4）Assuming that P protein functions or other forecasting problems have M label, A is expressed as₁,A₂,…,A_M, by P eggs White matter is defined as the M discrete vector of element, is shown below：

δ₁Represent that P protein belongs to first probability of label, δ₂Represent that P protein belongs to second probability of label, successively Analogize, δ_MRepresent that P protein belongs to the probability of m-th label, their initial values are all 0；

δ_i（I=1,2 ..., M）Computational methods it is as follows：

Successively to GO information contained by P proteinConcentrated in training data and find corresponding protein, Such as n protein is concentrated with training to containThe protein of information, respectively P₁、P₂、…、P_n, it is assumed that P₁Affiliated label It is A_iAnd A_j, then δ_iAnd δ_jPlus 1 respectively, P₂It is A with label_r、A_t、A_y, then δ_r、δ_t、δ_yPlus 1 respectively, until P protein is had Some GO information calculates finish according to the method described above, has thus obtained the protein containing GO information and has described new method.

Methods described is used in Protein Subcellular multi-tag location prediction, and correlation predictive device predicts that absolute success rate is improved 5~10%.

Method proposed by the present invention is substantially reduced compared with existing GO information approaches with dimension, and existing method dimension reaches To up to ten thousand, and use this method, dimension as the number of tags predicted, typically also with regard to tens dimensions, if the albumen predicted Matter does not have GO information, then using its most like protein G O information, expand the scope that GO information approaches are used.This method is used In Protein Subcellular multi-tag location prediction and the prediction of antibacterial peptide function multi-tag, the pre- of correlation predictive device can be significantly improved Success rate is surveyed, with wide utilization prospect.

Specific embodiment

In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to embodiments, to the present invention It is further elaborated.It should be appreciated that specific embodiment described herein is only used to explain the present invention, limit is not used to The fixed present invention, example is prediction animal protein subcellular fraction multi-tag prediction algorithm herein.

Using the new protein sequence method for expressing based on gene ontology information of the present invention, comprise the following steps that：

1）Swiss-Prot databases are searched for using blast program find all of similar protein matter sequences of protein sequence P.

Protein P can be directly inputted on Swiss-Prot database BLAST instrument webpages, its network address is http://www.uniprot.org/blast/, BLAST operational factor are acquiescence, it is also possible to which BLAST is downloaded on NCBI to be carried out It is locally configured, the machine configuration version：Blast-2.2.28+, all proteins are downloaded in Protein Data Bank Swiss-Prot Sequence；Such as input albumen matter Q63564, can obtain a series of similar protein matter Q8BG39 according to similarity height arrangement, A0A091DVS5、HOVBF0…。

2）Training data concentration all proteins are input in GO databases, the GO that each protein has is searched Ontology information, GO database websites are http://www.geneontology.org/；

The GO information having such as protein Q63564 is（GO: 0001669, GO:0016021, GO:0022857, GO: 0030054, GO:0030672, GO:0043195, GO:0055085）.

（3）The mark gene ontology information that P protein has is searched in gene ontology storehouse, if P protein does not have Relevant information, then according to the height with P protein similarities, search the GO information of similar protein matter sequence, until finding successively At least one GO ontology informations are expressed as the GO information of P protein。

There is its gene ontology information in database due to Q63564, if it can not obtained according to the first step Similarity sequence order high, the gene ontology information of these sequences of Q8BG39, A0A091DVS5, HOVBF0 ... is found successively As the ontology information of Q63564 sequences.

（4）In existing database in the prediction of animal protein subcellular fraction multi-tag, subcellular fraction is 20 kinds, and P protein is sub- thin Born of the same parents' positioning has 20 labels, is expressed as A₁,A₂,…,A₂₀, it is 20 discrete vectors of element by P protein definitions, such as Shown in following formula：

δ₁Represent that P protein belongs to first probability of label, δ₂Represent that P protein belongs to second probability of label, successively Analogize, δ₂₀Represent that P protein belongs to the 20th probability of label, their initial values are all 0；

δ_i（I=1,2 ..., 20）Computational methods it is as follows：

Successively to GO information (GO contained by P protein Q63564: 0001669, GO:0016021, GO:0022857, GO: 0030054, GO:0030672, GO:0043195, GO:0055085) concentrate to find in training data and contain these GO information Protein, contains gene ontology GO such as in training set：0001669 protein be Q29108, Q32PB3, Q6AXZ6, Q29016, Q63053, A0JN61, P79136, Q63053, P79136, Q29016, Q6AXZ6, Q32PB3, Q29108, Q63053, Respectively P₁、P₂、…、P₁₄, P₁Label belonging to Q29108 is 1, then δ₁Plus 1, P₂The label that Q32PB3 has is 1,2 and 18, institute With δ₁、δ₂And δ₁₈Plus 1 respectively, P₃It is 1 that Q6AXZ6 has label, then δ₁Plus 1, P₄Q29016 has label 1, then δ₁Plus 1, P₅ Q63053 has label 2,5,6,7,9,18,20, then, δ₂、δ₅、δ₆、δ₇、δ₉、δ₁₈、δ₂₀Plus 1 respectively, P₆The mark that A0JN61 has It is 2 and 18 to sign, then δ₂And δ₁₈Plus 1, until the GO information that P protein Q63564 has is calculated according to the method described above finish, Thus obtain the protein Q63564 containing GO information and describe new method.

Methods described is used in Protein Subcellular multi-tag location prediction, and correlation predictive device predicts that absolute success rate is improved 8%。

Presently preferred embodiments of the present invention is the foregoing is only, is not intended to limit the invention, it is all in essence of the invention Any modification, equivalent and improvement made within god and principle etc., should be included within the scope of the present invention.

Claims

1. a kind of new protein sequence method for expressing based on gene ontology information, it is characterised in that comprise the following steps：

δ₁Represent that P protein belongs to first probability of label, δ₂Represent that P protein belongs to second probability of label, successively class Push away, δ_MRepresent that P protein belongs to the probability of m-th label, their initial values are all 0；

δ_i（I=1,2 ..., M）Computational methods it is as follows：

2. the protein sequence method for expressing based on gene ontology information according to claim 1, it is characterised in that：It is described Method is used in Protein Subcellular multi-tag location prediction, and correlation predictive device predicts that absolute success rate improves 5~10%.