CN1566365A - Microbe gene prediction method based on polynary entropy distance method - Google Patents

Microbe gene prediction method based on polynary entropy distance method Download PDF

Info

Publication number
CN1566365A
CN1566365A CN 03147763 CN03147763A CN1566365A CN 1566365 A CN1566365 A CN 1566365A CN 03147763 CN03147763 CN 03147763 CN 03147763 A CN03147763 A CN 03147763A CN 1566365 A CN1566365 A CN 1566365A
Authority
CN
China
Prior art keywords
orf
sequence
gene
dna
polynary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN 03147763
Other languages
Chinese (zh)
Inventor
佘振苏
朱怀球
欧阳正清
姚新秋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN 03147763 priority Critical patent/CN1566365A/en
Publication of CN1566365A publication Critical patent/CN1566365A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention relates to microbiological genome sequence analysis, microbiological gene identification, and microbiological species identification technology, which comprises the steps of, (1) arranging part of the known coded ORF and non-coded ORF as the clustering central dot of the initial state, (2) reading microbiological DNA sequence, (3) seeking the longest ORF from the sequences, (4) performing analysis and discrimination to the microbiological DNA sequence, (5) charging the coded sequence into clustering central dot, repeating step 4, (6) determining the candidate gene as the protein coding gene.

Description

Microbial gene Forecasting Methodology based on polynary entropy distance method
Technical field
The present invention relates to biology information technology, relate in particular to biology information technologies such as microbial genome sequential analysis, microbial gene identification, microbial species identification.
Technical background
21st century is the epoch of life science, also is the epoch of information science.Finish along with the each task of the Human Genome Project (Human Genome Project) is approaching, related nucleic acid, proteinic sequence and structured data are exponential growth.In the face of huge and complicated data, utilization computer management data, control error, acceleration analytic process are imperative, therefore information biology becomes one of current life science and field, natural science applied great forward position, also is 21st century one of natural science applied core realm.The importance of biology information technology that with the information biology achievement in research is impellent is also more and more outstanding.In recent years, the development of computer technology and Internet technology provides the condition of hardware foundation and realization especially for the transmission of bioinformation, promoted the development of biology information technology widely.USDOE in April calendar year 2001 and then the Human Genome Project proposed to be intended to understand 10 years blue prints of new " from the genome to life (Genome ToLife) " of life secret, seriously point out in this plan preface: " the most important task of the biology of 21st century is to be familiar with secrets of life on genomic level.Beyond all doubt, reach this target and will depend on and be based upon the combine biology information technology revolution of the new round on the basis of systems biology and information biology." therefore, the biology information technology that integrates information technology and biotechnology has become the focus of current techniques revolution, it be carry out nearly all biology from now on, medical research is developed necessary steersman and power, also is the pillar of future economy development.
For many years, experimental technique is to solve the main path of finding new gene in genomic dna sequence.But flourish along with biology information technology is according to theoretical method, to utilize computer technology be that so-called bioinformatics method carries out the important channel that predictive genes more and more becomes this class problem of solution.So-called predictive genes utilizes computer technology and theoretical method that numerous genes and control region thereof in the genomic dna sequence are positioned exactly.Because biology information technology, only use takes lower cost and time faster, just can obtain important informations such as reliable gene location, functional site position.Method for forecasting gene is the necessary tool that biological gene group information is analyzed and developed, and is the important means of finding new gene from now on, also is one of underlying issue of information biology research.Current biological gene group data present the trend of explosive growth, management, analysis and the application integrated, dna sequence data of genomic information all become more complicated, therefore, specialty will become to take as the leading factor with the service of information technology, especially utilize computer technology, information technology to carry out the service of predictive genes, will become the biomedical technology of genome times afterwards comprehensively, the core technology of pharmaceutical technology field development.
Microorganism (comprising the unicellular animals and plants such as low of bacterium, actinomycetes, fungi, virus, Rickettsiae, mycoplasmas, chlamydozoan and some) is the good material of carrying out modern molecule genetics research, is the requisite approach of research human genetics; Simultaneously, microorganism is as the reactor of microbiological genetic engineering, can directly apply to the production of modern genetic engineering products such as Interferon, rabbit, insulin human, tethelin, hepatitis B vaccine again, the application on agricultural, industry and bio-pharmaceuticals engineering is very extensive.Therefore, the research of microbial genome genetic information has crucial meaning to the development in modern life science and genetically engineered field, and its economic interests are inestimable.Although people study comparatively thoroughly to the genetics characteristic of certain micro-organisms (for example intestinal bacteria) at present; But, the microbial species number reaches 2~300 ten thousand kinds according to estimates, wherein be less than whole 0.5% for ratio that scientist discerned, at the beginning of 2003, to the then kind surplus in the of 100 only that their complete genome DNA sequences have been finished order-checking and gene is positioned, the location great majority of these genes are to realize by the calculating of existing microbial gene predictive software systems in the world.Along with the further research of people to the life secret on the existing microbial genome level, and along with propelling to the research of how unknown microbial species and biotechnology utilization, can predict, the predictive genes software system will more and more embody its important techniques effect and economic worth.
Current foremost microbial gene predictive software systems mainly contains the GLIMMER software system that the GeneMark software system (GeneMarkS that comprises nearest release) of development such as the M.Borodovsky of the georgia ,u.s.a Institute of Technology and the U.S. S.L.Salzberg of Johns Hopkins University etc. release.The principle that they carry out predictive genes mainly is to extract the information of some local feature of dna sequence dna, and for example the relational structure of promotor signal and contiguous base has adopted high-order Markov chain or Hidden Markov Model (HMM) thus.These two software system are present the highest software system of precision in the world, can be by internet access.The network address of GeneMark and GeneMarkS is Http:// opal.biology.gatech.edu/GeneMarkThe user must submit the unknown dna sequence dna that need carry out predictive genes to the server at program place by the internet, the gene information that obtains through systems analysis prediction comprises the length etc. of location, gene transcription direction and the gene of the initiation site of each gene and termination site, and the mode by e-mail sends to the user at last.The network address of GLIMMER is Http:// www.tigr.org/software/glimmerOr Http:// www.cs.jhu.edu/labs/compbio/glimmer.html, this system comprises two relevant executable programs, the user can download this two programs in above-mentioned website, moves on local computer.The program run result provides the gene location information (position, length and the transcriptional orientation that comprise each gene) in the unknown dna sequence dna, and these information are saved as a text.
Although since nineteen ninety-eight GeneMarkS and GLIMMER are by how tame order-checking center employing in the world, cooperates separately or with other softwares new order-checking is listed as and carry out gene recognition, but, there is more and more evidences to show, the mistake that the microbial gene information of existing predictive genes system note exists will be higher than people's the imagination far away, and these mistakes mainly are because the method for forecasting gene that those prognoses systems adopted brings.Therefore, developing new method for forecasting gene, design more efficient, more accurate microbial gene prognoses system, is pressing for of biology information technology development.
Summary of the invention
The purpose of this invention is to provide advanced microbial gene Forecasting Methodology, can make things convenient for and test microbes ground genome sequence exactly.
For this reason, the present invention adopts following scheme:
A kind of microbial gene Forecasting Methodology based on polynary entropy distance method is characterized in that may further comprise the steps:
A, the ORF that known coded is set and noncoding ORF are mapped to the EDP phase space one by one, as the cluster centre point of original state;
B, read microbial DNA sequence to be detected;
C, from dna sequence dna, find out all the longest ORF, write down their positions in this sequence, each ORF all be mapped on the EDP phase space a bit, original state all is uncertain ORF;
D, utilize the cluster centre point of system initial state,, all uncertain ORF are carried out analysis and distinguishing, it is divided into coding ORF, non-coding ORF and uncertain ORF three classes in the EDP phase space according to the Euclidean distance that defines on the EDP phase space;
E, will newly be judged to be coding and noncoding ORF joins the cluster centre point, repeating step d is included into coding ORF or non-coding ORF up to all uncertain ORF;
F, the ORF that will differentiate for the coding class are defined as the gene of coded protein.
Among the described step b, this microbial DNA sequence both can be a whole genome sequence, also can be one section consecutive gene group sequence.
In the described steps d, in the following way:
D c/D nc<coef
Wherein, D cAnd D NcBe the distance of ORF to be measured apart from known coded ORF mean center point and known non-coding ORF mean center point, coef is an adjustability coefficients;
When this formula was set up, this dna sequence dna was an encoding sequence; When this formula was false, this dna sequence dna was a non-coding sequence.
Described coef value is 1.
Described ORF is meant to appear in the dna sequence dna and begins with translation initiation codon ATG, along transcriptional orientation, with one section successive, the three sign indicating number nucleotide sequences of the sub-TAA of nearest translation stop codon, TGA, TAG end.
Described microbial gene Forecasting Methodology based on polynary entropy distance method also comprises among the step f, is defined as the information of the gene of proteins encoded, forms the step of text and output.
Adopt Forecasting Methodology of the present invention, can make things convenient for the proteins encoded plasmagene and the position thereof that dope exactly in the genome sequence.Precision of prediction is in same level at least with international existing best technique, and predetermined speed obviously improves, and using method is simple and easy to do.
Description of drawings
Fig. 1 is a testing process synoptic diagram of the present invention;
Fig. 2 is the test interface synoptic diagram of the specific embodiment of the invention;
Fig. 3 is another test interface synoptic diagram of the specific embodiment of the invention.
Embodiment
Below in conjunction with Figure of description the specific embodiment of the present invention is described.
After the splicing through order-checking and sequence, people obtain the length dna sequence of microorganism, the perhaps dna sequence dna of whole genome.The gene of the unknown of comprising in these sequences is found out with the method for information biology, and this is one of most important problem of biology information technology, just the microbial gene prognoses system target that will realize.
The present invention is from polynary entropy distance (the Multivariate Entropy Distance of the description microbe genome DNA sequence of the original proposition of applicant, MED) thought is set out, and has designed novel method---a MED method that is enclosed within implementation system retrieval gene region on the dna sequence dna at the microbial genome constitutional features.The applicant also according to the independent development of MED method the software system MEDversion1.0 of microbial gene prediction, this system is easy to use, it is a not automatic prognoses system of dependent learning collection, both can be to the complete genome DNA sequence of microorganism, also can break and carry out sequential analysis certain the continuous lengthy motion picture in the genome sequence, thereby dope (i.e. 5 '-3 ' direction and 3 '-5 ' direction) all complete gene regions on the positive and negative two strands of this dna sequence dna automatically, in the destination file of output, mark the position (position that comprise two ends of all genes in sequence at last, the length of gene) and the translation direction.
(Multivariate Entropy Distance MED) adopts the portrayal of Statistical Linguistics method to polynary entropy to dna sequence dna apart from method.Famous information theory and communication theory initiator---Shannon (C.E.Shannon) points out when artificial language is discussed, and is the frequency of occurrences from its basic vocabulary to passage or the best portrayal of language.So, how to choose the basic vocabulary of this section of genomic dna sequence genetic language? central dogma according to molecular genetics, dna sequence dna with encoding function is translated into the aminoacid sequence with biological significance according to general genetic code, aminoacid sequence is brought into play function by the folding protein molecule that obtains biologically active in specific space in vital process.Therefore, be a very natural selection with 20 seed amino acids as the basic vocabulary of understanding biological dna sequence dna.According to general genetic code, any section of DNA sequence " translation " is become aminoacid sequence, be referred to as pseudo-aminoacid sequence.We think that pairing aminoacid sequence of dna sequence dna and the pairing pseudo-aminoacid sequence of noncoding dna sequence dna with biological significance, energy coded protein have certain difference.For portraying this difference, we have at first introduced multivariable parameter---and the entropy density distribution (entropy density profile, EDP).
Suppose the length amino acid sequence of given dna sequence dna be L (is unit with amino acid), the occurrence number of i seed amino acid (being called for short ordering according to its letter) is L i, the frequency of utilization (or abundance) that then obtains the i seed amino acid is
p i = L i L . Definition according to the Shannon entropy
H = - Σ i = 1 20 p i log p i
Can construct the entropy density distribution (EDP) of this dna sequence dna:
S i = - 1 H p i log p i i = 1 , . . . , 20
Like this, for the dna sequence dna of any finite length, we can construct its multivariable parameter EDP, i.e. { S i(i=1 ..., 20), make it corresponding on the EDP phase spaces of 20 dimensions a bit.
Then, we use any two point { S in the EDP phase space iAnd S ' iThe Euclidean distance D of the EDP of corresponding two segment DNA sequences (respectively) portrays the difference of two segment DNA sequences:
D 2 = Σ i = 1 20 ( S i - S i ′ ) 2 .
The distance that EDP by calculating a unknown nucleotide sequence and a series of known EDP put mutually, we can sort out unknown nucleotide sequence easily, the main thought of MED method that Here it is.
We have at first investigated 12 bacterial genomes, find for each genome, and the population mean EDP of its encoding sequence and comparing of non-coding region, widely different.In phase space, the point of these two kinds of sequences is assembled agglomerating around center separately.That is to say that the EDP of dna sequence dna shows cluster (clustering property) clearly in phase space, in other words, have the dna sequence dna of diverse coding and will be in phase space determine their " identity " by cluster.To each bar sequence, can calculate it and divide the Euclidean distance that is clipped to coding center and non-coding center.The criterion that relatively can be used as division coding and non-coding sequence of these two distances.We can design a kind of predictive genes algorithm of discerning based on open reading frame (open reading frame is called for short ORF) on the microorganism whole genome sequence in view of the above, and we are referred to as polynary entropy distance method.(for most microorganisms, ORF appears at exactly in the dna sequence dna and begins with translation initiation codon ATG, along transcriptional orientation, with one section successive, the three sign indicating number nucleotide sequences of the sub-TAA of nearest translation stop codon, TGA, TAG end).Further find, only need sample sequence (for example, 10~20 codings, noncoding sequence) separately to carry out statistical average, just can obtain the good approximation of population mean EDP few quantity.Along with the increase of sample number, sample mean EDP approaches population mean EDP rapidly.This quick convergent character of progressive average EDP makes it be applied to complete genomic gene recognition becomes possibility.
In microbial genome, particularly in the bacterial genomes, we can seek gene by the check to ORF.But the ORF sequence of coded protein mixes among a large amount of noncoding ORF sequences, and is all the more so for short gene.So the very corn of a subject is exactly in case how a given dna sequence dna (comprising full genome) identifies the gene of real coding from the intensive noise background.We use the MED method, only need few study collection to construct the average EDP of coding ORF and non-coding ORF, and sequence more to be measured again is apart from the distance D at two centers cAnd D NcIf inlet coefficient coef is D c/ D Nc<coef then is judged to encoding sequence.This coefficient can be an empirical value, and also can derive by theory draws, and in most of the cases, coef=1 is the criterion that we use.
We have introduced so-called multi-level polycentric thinking to the MED method: under polycentric framework, encoding sequence and non-coding sequence can be divided into some subclasses respectively by certain cluster mode.The EDP of the sequence that the center of each subclass is contained with it on average obtains, thereby represents this subclass with a unique point in phase space.The EDP of single sequence is to the description on the basic layer of sequence; Encoding sequence or non-coding sequence population mean EDP separately then is the highest description of level.On the level in office, belong to of a sort sequence and all have certain similarity, this similarity is portrayed by the relative distance of they EDP.A given new sequence, it is to can uniquely determining from the distance of its nearest subclass on a certain level.The simplest method of determining sequence ownership is in putting in the nearest subclass mutually from its EDP with it merger.When carrying out gene recognition,, judge that then this sequence belongs to encoding sequence if be integrated in the class that belongs to encoding sequence.In fact, because the influence of the sequence of new merger, coding will change to some extent with the topological framework on each subclass border of non-coding.Especially, when the initial sequence number is less, use alternative manner, tended towards stability in the border.
According to above-mentioned principle, we can introduce the sample spot of only a few to any unknown dna sequence dna to be measured, i.e. the ORF of some known codings and noncoding ORF are with this multicenter as the original state of sequential analysis.For the continuous DNA sequence of microbial genome, whole genome sequence especially, we a few sample point suitably chosen are applicable to different microbial species analysis revealed, have good universality.The MED method is analyzed all ORF in the testing gene group sequence according to these pervasive sample spot, and they are carried out the first step identification and classification; Self-teaching on the basis of the first step classification is brought the knowledge that self study obtains the identification and classification of second step into, and so iteration is carried out, and reaches stable status (promptly all unknown ORF being finished coding or noncoding differentiation) until classification.Like this, the prognoses system of our design can not need the dependent learning collection and realize the automatic gene of unknown gene group dna sequence dna is predicted.
Software MED Version1.0 according to scheme Design can directly move under operating system environments such as Windows 9x/ME/NT/2000/XP, need not import any adjustable parameter, be applicable to the Personal Computer that above-mentioned operating system is installed, without limits to the hardware of computer.
As shown in Figure 1, be the concrete schema of using of the present invention, as can be seen from the figure, the present invention includes following steps:
A, the ORF that a small amount of known coded is set and noncoding ORF are mapped to the EDP phase space one by one, as the cluster centre point of original state;
The input of b, sequence data;
The microbial DNA sequence that reads submission is as the input data, and the microbial DNA sequence that should submit to both can be a whole genome sequence here, also can be one section continuous DNA sequence;
C, from the dna sequence dna of input, find out all the longest ORF, write down their positions in this sequence, each ORF all be mapped on the EDP phase space a bit, original state all is uncertain ORF; This step is corresponding with step a.
D, utilize the cluster centre point of system initial state,, all uncertain ORF are carried out analysis and distinguishing, it is divided into coding ORF, non-coding ORF and uncertain ORF three classes in the EDP phase space according to the Euclidean distance that defines on the EDP phase space;
D c/ D Nc<coef is as basis for estimation, wherein, and D cAnd D NcBe sequence to be measured apart from the average ORF of known coded and the distance at known noncoding average ORF center, coef is an adjustability coefficients; In practice, be the coef value 1 often;
When this formula was set up, this dna sequence dna was an encoding sequence; When this formula was false, this dna sequence dna was a non-coding sequence.
E, ORF differentiate and the iteration of cluster centre point increases.To newly be judged to be coding and noncoding ORF and join the cluster centre point, repeating step d is included into coding ORF or non-coding ORF up to all uncertain ORF;
F, the ORF that will differentiate for the coding class are defined as the gene of coded protein.Their information (comprising mrna length, the start-stop site of gene and the direction of genetic transcription) and the information of system running environment are noted, and outputed in the destination file with text formatting.
As shown in Figures 2 and 3, be a postrun interface of MED version1.0 system of designing according to technical scheme of the present invention.Its destination file is to be the text of suffix name with txt, and is as shown in table 1.
Table 1 MED version1.0 system prediction result's output file form
Gene?Identification?by?MED?Method?for?PROKARYOTIC(Version?1.0) Working?mode:Genome Found???????????????839?ORFs Identified??????????556?genes Predicted?genes: ??Gene????Strand????LeftEnd????RightEnd????GeneLength ??1???????+?????????197????????2083????????1887 ??2???????-?????????2769???????2924????????156 ??3???????+?????????3139???????3378????????240 ??4???????+?????????3560???????3982????????423 ??5???????+?????????3982???????4515????????534 ??6???????+?????????4530???????6068????????1539 ??7???????+?????????6170???????6973????????804 ??8???????+?????????6997???????8394????????1398 ??9???????+?????????8421???????8837????????417 ??……
Adopt predict the outcome example and comparison of the present invention:
The standard of weighing a certain method for forecasting gene has two parameters usually: susceptibility (sensitivity) and specificity (specificity).Suppose that it is gene that X bar sequence is arranged in the sequence to be measured, utilize a certain predictive genes system that this sequence to be measured is predicted, dope Y bar gene order altogether, and in this Y bar sequence Y is arranged 1Bar is gene really, remaining Y 2(Y 2=Y-Y 1) bar is not gene.Then susceptibility is defined as Y 1/ X, the ability of its expression system prediction; Specificity is defined as Y 1/ Y, it is expressed as system prediction result's confidence level.The a pair of often contradiction of susceptibility and specificity.
Be example at first,, predict the outcome and best in the world microbial gene prognoses system GeneMarkS and predicting the outcome of GLIMMER2.02 are made comparisons of the present invention according to general measurement index with the full genome of intestinal bacteria (E.coli K12), as shown in table 2.Investigate accuracy of predicting according to two kinds of situations in the table: first kind of situation is according to GenBank (nucleic acid sequence data storehouse, one of biomolecules three big core databases) 4289 genes of institute's note are estimated (being X=4289), wherein Hen Dayibufen gene annotation is predicted according to the predictive genes system of information biology, does not obtain experimental verification as yet; Second kind of situation is to estimate (being X=1851) according to 1851 genes of at present existing experimental verification, relatively affirmation.According to present general standard, the prediction entirely accurate is meant that the transcription initiation of gene order and final position all predict accurately, and predicts and be meant that accurately the Transcription Termination position prediction of gene order is accurate.From comparative result as can be seen, the susceptibility of MED v1.0 will be higher than GeneMarkS, and specificity will be higher than GLIMMER2.02, takes all factors into consideration two indexs, and MED v1.0 and GeneMarkS and GLIMMER2.02 are in same precision level at least.
The comparison of table 2 and GeneMarkS, GLIMMER2.02 prediction level
System 4289 genes according to the GenBank note are estimated Testing 1851 genes of affirmation factually estimates
Predict accurately number gene (and the ratio that accounts for 4289 numbers, be equivalent to susceptibility) The number gene (and the ratio that accounts for 4289 numbers) of prediction entirely accurate Specificity Predict accurately number gene (and the ratio that accounts for 1851 numbers, be equivalent to susceptibility) The number gene (and the ratio that accounts for 1851 numbers) of prediction entirely accurate
GeneMark S 3956 (92.2%) 2873 (67.0%) 97.2% 1784(96.4%) 1474 (79.6%)
Glimmer2 .02 4156 (96.9%) 3160 (73.7%) 81.4% 1822(98.4%) 1296 (70.0%)
MED?v1.0 4092 (95.4%) 3187 (74.3%) 88.0% 1800(97.2%) 1401 (75.7%)
We also use technical scheme of the present invention the whole genome sequence of existing 87 kinds of bacteriums among the GenBank are all tested, and are standard with the gene annotation of GenBank, and test result is as shown in the table.
The prediction level of table 3 pair 87 kinds of bacterium complete genome DNA sequences
Bacteria name Susceptibility (%) (predicting ratio accurately) Specificity (%) The ratio (%) of prediction entirely accurate
A.aeolicus ?98.2 ????86.0 ??81.0
A.fulgidus ?95.5 ????86.3 ??72.3
A.pernix ?80.6 ????85.3 ??35.8
A.tumefaciens_C58 ?92.8 ????87.4 ??56.3
A.tumefaciens_C58_U Wash ?89.2 ????86.2 ??56.7
B.aphidicola_Sg ?90.5 ????86.9 ??77.8
B.burgdorferi ?83.8 ????96.7 ??60.5
B.halodurans ?94.3 ????90.7 ??58.0
B.melitensis ?91.7 ????86.6 ??46.0
B.sp ?91.7 ????92.7 ??75.4
B.subtilis ?94.5 ????91.3 ??60.3
C.acetobutylicum ?93.1 ????96.0 ??66.7
C.crescentus ?90.3 ????91.4 ??55.7
C.glutamicum ?94.7 ????78.2 ??63.1
C.jejuni ?95.6 ????96.1 ??81.8
C.muridarum ?90.8 ????84.4 ??70.5
C.perfringens ?94.5 ????97.6 ??73.9
C.pneumoniae_AR39 ?85.6 ????83.5 ??66.1
C.pneumoniae_CWL02 9 ?90.7 ????83.1 ??71.6
C.pneumoniae_J138 ?90.0 ????84.1 ??70.3
C.tepidum_TLS ?82.9 ????81.0 ??57.9
C.trachomatis ?91.8 ????86.1 ??79.0
D.radiodurans ?91.2 ????91.7 ??53.6
E.coli_K12 ?95.4 ????88.0 ??74.3
E.coli_O157_H7 ?92.7 ????88.3 ??70.8
?E.coli????????O157_H7_ ?EDL933 ????93.0 ????87.3 ????70.8
?F.nucleatum ????92.7 ????98.1 ????62.6
?H.influenzae ????97.0 ????91.2 ????83.8
?H.pylori_26995 ????93.5 ????92.6 ????73.4
?H.pylori_J99 ????96.0 ????92.7 ????73.4
?H.sp ????93.8 ????94.5 ????83.7
?L.innocua ????97.4 ????95.2 ????68.1
?L.lactis ????95.9 ????92.1 ????79.7
?L.monocytogenes ????97.5 ????95.5 ????66.9
?M.acetivorans ????89.6 ????82.3 ????57.2
?M.genitalium ????91.3 ????84.7 ????78.3
?M.jannaschii ????97.2 ????94.8 ????67.6
?M.kandleri ????89.3 ????88.2 ????25.6
?M.leprae ????86.9 ????58.1 ????42.6
?M.loti ????91.0 ????87.0 ????59.4
?M.mazei ????93.8 ????88.6 ????59.5
?M.pneumoniae ????92.7 ????84.7 ????86.4
?M.pulmonis ????85.4 ????96.8 ????68.8
?M.thermoautotrophicu ?m ????95.2 ????72.3 ????57.1
?M.tuberculosis_ ?CDC1551 ????81.3 ????89.1 ????41.8
?M.tuberculosis_H37Rv ????88.5 ????90.4 ????45.7
?N.meningitidis_MC58 ????89.6 ????85.0 ????71.0
?N.meningitidis_Z2491 ????90.5 ????88.0 ????68.0
?N.sp ????93.6 ????89.5 ????76.6
?P.abyssi ????97.2 ????87.8 ????67.5
?P.aeruginosa ????93.7 ????91.5 ????65.5
?P.aerophilum ????86.5 ????90.6 ????53.8
?P.furiosus ????94.9 ????92.5 ????75.3
?P.horikoshii ????93.6 ????88.3 ????67.4
?P.multocida ????97.9 ????90.8 ????82.0
?R.conorii ????80.3 ????94.0 ????64.7
?R.prowazekii ????87.9 ????91.3 ????81.9
?R.solanacearum ????88.2 ????90.0 ????53.6
?S.agalactiae ????93.0 ????90.4 ????75.1
?S.aureus_Mu50 ????92.4 ????96.2 ????65.5
?S.aureus_MW2 ????90.7 ????96.2 ????65.6
?S.aureus_N315 ????93.1 ????96.7 ????66.1
?S.coelicolor ????86.0 ????92.9 ????46.5
?S.meliloti ????93.5 ????89.9 ????62.7
?S.PCC6803 ????96.4 ????84.5 ????81.0
?S.pneumoniae_R6 ????91.9 ????89.2 ????70.3
?S.pneumoniae_TIGR4 ????88.3 ????83.1 ????71.8
?S.pyogenes ????95.8 ????88.4 ????79.0
?S.pyogenes_MGAS315 ????95.8 ????92.6 ????73.4
?S.pyogenes_MGAS823 ?2 ????94.3 ????90.7 ????75.4
?S.solfataricus ????89.2 ????87.7 ????58.9
?S.tokodaii ????88.1 ????89.9 ????70.5
?S.typhi ????94.7 ????82.6 ????72.0
?S.typhimurium_LT2 ????95.5 ????87.0 ????74.0
?T.acidophilum ????96.8 ????83.7 ????57.8
?T.?elongates_BP-1 ????93.6 ????68.0 ????68.8
?T.maritima ????94.3 ????92.6 ????57.5
?T.pallidum ????88.7 ????77.3 ????48.2
?T.tengcongensis ????94.3 ????91.6 ????61.6
?T.volcanium ????94.5 ????86.2 ????71.2
?U.urealyticum ????91.7 ????94.0 ????83.4
?V.cholerae ????90.3 ????85.6 ????63.4
?X.campestris ????91.2 ????90.9 ????50.8
?X.citri ????89.6 ????91.1 ????46.9
?X.fastidiosa ????75.4 ????76.6 ????46.9
?Y.pestis ????94.4 ????82.1 ????69.9
?Y.pestis_KIM ????89.8 ????83.3 ????61.2
This shows that testing method of the present invention is simple, practical, test gained data precision is higher, and institute's application system platform is also very universal, can bring great convenience for the user.
The above; only for the preferable embodiment of the present invention, but protection scope of the present invention is not limited thereto, and anyly is familiar with those skilled in the art in the technical scope that the present invention discloses; the variation that can expect easily or replacement all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claims.

Claims (8)

1, a kind of microbial gene Forecasting Methodology based on polynary entropy distance method is characterized in that may further comprise the steps:
A, the ORF that known coded is set and noncoding ORF are mapped to the EDP phase space one by one, as the cluster centre point of original state;
B, read microbial DNA sequence to be detected;
C, from dna sequence dna, find out all the longest ORF, write down their positions in this sequence, with each ORF all be mapped on the EDP phase space a bit, its original state all is uncertain ORF;
D, utilize the cluster centre point of system initial state,, all uncertain ORF are carried out analysis and distinguishing, it is divided into coding ORF, non-coding ORF and uncertain ORF three classes in the EDP phase space according to the Euclidean distance that defines on the EDP phase space;
E, will newly be judged to be coding and noncoding ORF joins the cluster centre point, repeating step d is included into coding ORF or non-coding ORF up to all uncertain ORF;
F, the ORF that will differentiate for the coding class are defined as the gene of coded protein.
2, the microbial gene Forecasting Methodology based on polynary entropy distance method as claimed in claim 1 is characterized in that among the described step b, and this microbial DNA sequence both can be a whole genome sequence, also can be one section consecutive gene group sequence.
3, the microbial gene Forecasting Methodology based on polynary entropy distance method as claimed in claim 1, it is characterized in that among described step a and the c, the higher-dimension EDP phase spaces of structure 20 dimensions, and dna sequence dna that will any one section finite length is mapped on the EDP phase space a bit.
4, the microbial gene Forecasting Methodology based on polynary entropy distance method as claimed in claim 1 is characterized in that in the described steps d, in the following way:
D c/D nc<coef
Wherein, D cAnd D NcBe sequence to be measured apart from the average ORF of known coded and the distance at known noncoding average ORF center, coef is an adjustability coefficients;
When this formula was set up, this dna sequence dna was an encoding sequence; When this formula was false, this dna sequence dna was a non-coding sequence.
5, the microbial gene Forecasting Methodology based on polynary entropy distance method as claimed in claim 3 is characterized in that described coef value is 1.
6, the microbial gene Forecasting Methodology based on polynary entropy distance method as claimed in claim 1, it is characterized in that described ORF is meant to appear in the dna sequence dna begins with translation initiation codon ATG, along transcriptional orientation, with one section successive, the three sign indicating number nucleotide sequences of the sub-TAA of nearest translation stop codon, TGA, TAG end.
7, the microbial gene Forecasting Methodology based on polynary entropy distance method as claimed in claim 1 is characterized in that also comprising among the step f, is defined as the information of the gene of proteins encoded, forms the step of text and output.
8, the microbial gene Forecasting Methodology based on polynary entropy distance method as claimed in claim 6 is characterized in that described text content comprises the position at two ends, the length of gene.
CN 03147763 2003-06-24 2003-06-24 Microbe gene prediction method based on polynary entropy distance method Pending CN1566365A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 03147763 CN1566365A (en) 2003-06-24 2003-06-24 Microbe gene prediction method based on polynary entropy distance method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 03147763 CN1566365A (en) 2003-06-24 2003-06-24 Microbe gene prediction method based on polynary entropy distance method

Publications (1)

Publication Number Publication Date
CN1566365A true CN1566365A (en) 2005-01-19

Family

ID=34472039

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 03147763 Pending CN1566365A (en) 2003-06-24 2003-06-24 Microbe gene prediction method based on polynary entropy distance method

Country Status (1)

Country Link
CN (1) CN1566365A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101794351A (en) * 2010-03-09 2010-08-04 哈尔滨工业大学 Protein secondary structure engineering prediction method based on large margin nearest central point
CN101840467A (en) * 2010-04-20 2010-09-22 中国科学院研究生院 Method for filtering, evolving and classifying proteome and system thereof
CN101655847B (en) * 2008-08-22 2011-12-28 山东省计算中心 Expansive entropy information bottleneck principle based clustering method
CN104298892A (en) * 2014-09-18 2015-01-21 天津诺禾致源生物信息科技有限公司 Detection device and method for gene fusion
CN107663549A (en) * 2017-10-18 2018-02-06 中国科学院昆明植物研究所 A kind of method of the cytoplasmic male sterile gene prediction based on plant mitochondria genome signature
CN111190138A (en) * 2020-02-12 2020-05-22 南京邮电大学 Tool car indoor and outdoor combined positioning method and device based on Internet of things

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101655847B (en) * 2008-08-22 2011-12-28 山东省计算中心 Expansive entropy information bottleneck principle based clustering method
CN101794351A (en) * 2010-03-09 2010-08-04 哈尔滨工业大学 Protein secondary structure engineering prediction method based on large margin nearest central point
CN101794351B (en) * 2010-03-09 2012-08-15 哈尔滨工业大学 Protein secondary structure engineering prediction method based on large interval nearest central point
CN101840467A (en) * 2010-04-20 2010-09-22 中国科学院研究生院 Method for filtering, evolving and classifying proteome and system thereof
CN101840467B (en) * 2010-04-20 2012-07-04 中国科学院研究生院 Method for filtering, evolving and classifying proteome and system thereof
CN104298892A (en) * 2014-09-18 2015-01-21 天津诺禾致源生物信息科技有限公司 Detection device and method for gene fusion
CN104298892B (en) * 2014-09-18 2017-05-10 天津诺禾致源生物信息科技有限公司 Detection device and method for gene fusion
CN107663549A (en) * 2017-10-18 2018-02-06 中国科学院昆明植物研究所 A kind of method of the cytoplasmic male sterile gene prediction based on plant mitochondria genome signature
CN111190138A (en) * 2020-02-12 2020-05-22 南京邮电大学 Tool car indoor and outdoor combined positioning method and device based on Internet of things

Similar Documents

Publication Publication Date Title
Klein et al. RSEARCH: finding homologs of single structured RNA sequences
Liang et al. DeepMicrobes: taxonomic classification for metagenomics with deep learning
Cai et al. ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time
US10584380B2 (en) Systems and methods for mitochondrial analysis
Rogozin et al. Analysis of evolution of exon-intron structure of eukaryotic genes
Troyanskaya et al. Sequence complexity profiles of prokaryotic genomic sequences: A fast algorithm for calculating linguistic complexity
Azad et al. Probabilistic methods of identifying genes in prokaryotic genomes: connections to the HMM theory
Makałowski et al. Bioinformatics of nanopore sequencing
CN110379464B (en) Method for predicting DNA transcription terminator in bacteria
Yang et al. MetaMetaDB: A database and analytic system for investigating microbial habitability
Liu Bioinformatics in aquaculture: principles and methods
Shibuya et al. Dictionary-driven prokaryotic gene finding
CN1566365A (en) Microbe gene prediction method based on polynary entropy distance method
Mueller et al. Avian Immunome DB: An example of a user-friendly interface for extracting genetic information
Stuart et al. A comprehensive whole genome bacterial phylogeny using correlated peptide motifs defined in a high dimensional vector space
Krause et al. Understanding the role of (advanced) machine learning in metagenomic workflows
Nicodeme et al. Proteome analysis based on motif statistics
Zhu et al. Predicting the results of molecular specific hybridization using boosted tree algorithm
CN114373508B (en) Strain identification method based on 16S rDNA sequence
CN117171676B (en) Decision tree-based soil microorganism identification analysis method, system and storage medium
Gautier et al. Low-bandwidth and non-compute intensive remote identification of microbes from raw sequencing reads
Aw et al. The Impact of Stability Considerations on Genetic Fine-Mapping
Meisner et al. Leveraging haplotype information in heritability estimation and polygenic prediction
Newton et al. A hidden Markov model web application for analysing bacterial genomotyping DNA microarray experiments
Vilov et al. Investigating the performance of foundation models on human 3'UTR sequences

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication