CN103559427B

CN103559427B - A kind of use Digital ID biological sequence and the method for inferring species affiliation

Info

Publication number: CN103559427B
Application number: CN201310557139.1A
Authority: CN
Inventors: 高扬; 罗辽复
Original assignee: Individual
Current assignee: Individual
Priority date: 2013-11-12
Filing date: 2013-11-12
Publication date: 2017-10-31
Anticipated expiration: 2033-11-12
Also published as: CN103559427A

Abstract

The invention provides a kind of biological sequence identification code based on organism nucleotide sequence Correlation of Bases feature, and then a whole set of identification code is proposed in mark biological sequence and the implementing method and effect evaluation system of species Genetic relationship is carried out.Species Genetic relationship, is the close inspection to identification code validity.The present invention deduces the classificating knowledge that the result that Evolution of Mammals tree and parvovirus affiliation obtain meets biologist, shows that this method is effective, identification code resolution height.The biological sequence identification code that the present invention is provided has the outstanding features such as recognition capability is strong, data volume is small, it is possible to achieve the huge genome sequence of a few digits mark and mark and the com-parison and analysis application for simplifying biological sequence, great practical value.

Description

A kind of use Digital ID biological sequence and the method for inferring species affiliation

Technical field

The present invention is excavated and integration sequence information association feature with bioinformatics method, and then using numeral to biology Sequence and species are identified and Genetic relationship, belong to application of the informatics in field of biology.

Background technology

Biological sequence includes amino acid sequence and nucleotide sequence, and its nucleotide sequence is divided into DNA again (DNA) sequence and ribonucleic acid (RNA) sequence.DNA sequence dna is by adenylate (A), cytidine monophosphate (C), guanylic acid (G) and thymidylic acid (T) four kinds of nucleotide monomers are polymerized, and are generally represented with the symbol sebolic addressing of four letter compositions.Similar, RNA sequence can be with Represented with the symbol sebolic addressing of tetra- letter compositions of A, C, G and U, wherein being uridylic acid instead of T U.The full genome of species has been sequenced Group sequence length is from thousands of to million, or even billions of letters.

Researcher attempts to extract Data Identification biological sequence from biological sequence, and applies genome sequence oligomer (K-mer) frequency feature carries out the research of phylogenetics method.The such as component method of academician of the Chinese Academy of Sciences Mr. Hao Bailin (CVTree)^[1]With 20⁵Individual data deduce spore relation, American scientist Kim et al. feature frequency method (FFP)^[2]Very Extremely use up to 20⁸Individual (data volume considerably beyond genomic data amount) individual data do Study on Evolution.Their method easily by High dimension lacks the restriction of sample and is not suitable for mini gene group or short sequence, such as parvovirus^[3], and use can not be realized Low volume data mark is biological (sequence).

In order to improve mark biological (sequence) and deduce the practicality of biological (sequence) affiliation, we are made that new Attempt.Different from the method based on K-mer Frequency statistics, we study sequence (DNA or RNA) from information theory Information association feature, proposes to identify genome with information association (IC) and inclined information association (PIC), and further give birth to its deduction Thing (sequence) affiliation.

The content of the invention

It is of the invention that a kind of method with Digital ID biological (sequence) is provided, and it is illustrated in deduction biological (sequence) parent Application in edge relation.Next, we associate recommended information and the calculating of inclined information association, biology (sequence) identification code Build and its application in affiliation research.

Signified sequence can be biological genome full sequence or organism genomic sequence fragment in the present invention； Can be that DNA sequence dna can also be RNA sequence.Sequence data used in the present invention is public resource, can pass through US National Biotechnology Information center (NCBI) database, European Molecular Biology Laboratory's database (EMBL) and DNA Data Bank of Japan Etc. (DDBJ) global public database, freely obtains and uses.

In order to realize above goal of the invention, the present invention provides following technical scheme：

First, information association and inclined information association

By taking given DNA sequence dna as an example, the element for constituting sequence is base A, G, C, T, according to statistical method：Base i (i =A, G, C, T) occur probability be p_i；The probability that base j (j=A, G, C, T) occurs is p_i；At a distance of two positions of k distance Occurs base i and base j joint probability p respectively_i(k)j, the Correlation of Bases of whole piece sequence can be drawn further according to information theory Information content：

We claim D_k+2For information association.

P is calculated for actual sequence_i(k)jWhen, in order to avoid the edge effect that finite length (N) is produced, it can introduce the cycle Property boundary condition, i.e., k+1 base before sequence be connected on to the afterbody of sequence, a length of N+k+1 sequence is formed, then counts At a distance of k base-pair ij (N_i(k)j), obtain p_i(k)j().In addition Taylor expansion, formula (1) can be write as

Now x_i(k)j=(N_i(k)j-Np_ip_j)/Np_ip_j.When N is big, information association can be write as

Summation associates us across 16 kinds of Correlation of Bases, for description particular bases and introduces inclined information association (PIC)

F_i(k)j=(P_i(k)j-P_iP_j)² (2)

2nd, biological (sequence) identification code

Biological (sequence) identification code can be expressed as matrix, vector or other forms, for example：

[F_A(k0)A, F_A(k0)T..., F_G(k0)G,D_k0+2, F_A(k0+1)A..., F_G(k0+1)G, D_k0+3... D_k0+d+1], (5)

[F_A(k0)A/D_k0+2..., F_G(k0)G/D_k0+2,D_k0+2,F_A(k0+1)A/D_k0+3..., D_k0+3... D_k0+d+1], (6)

No matter represent in what manner, the core element of biological (sequence) identification code is all information association and 16 kinds of breaths that believe one side only Associated data, and parameter k₀=0, parameter d scope can be determined as needed, have d × 17 data.For ease of understanding, Usually biological (sequence) identification code is represented or stated in the matrix form.

3rd, biological (sequence) affiliation is reconstructed

Biological (sequence) affiliation of reconstruct is both one kind application to identification code, is also the severe of inspection identification code validity Carving method, in order to examine the validity of proposed biological (sequence) identification code, we build species chadogram, step with it It is rapid as follows：

1. filter information is associated

Because it is weak that signal occurs for the system that the low inclined information association of recognition capability is included, in some instances it may even be possible to evolution can be upset and closed System.So only being built jointly with information association for calculating spore distance from the stronger inclined information association of recognition capability Parameter.Shape such as [D are built first_k0+2, D_k0+3..., D_k0+d+1] or [F_α(k0)β, F_α(k0+1)β..., F_α(k0+d-1)β] (α, β ∈ A, G, C, T }) vector X.Variance analysis (ANOVA) and multiple comparative test (MCT) are done to vector X element.In Multiple range test, For giving species pair, as long as vector X arbitrary element is by multiple comparative test, being considered as vector X can successfully distinguish This species pair.Unrecognizable species logarithm is accounted for into total species logarithm ratio normalizes to after 100 the failure for being referred to as vector X Point, it is designated as W_X(k₀, d).The low corresponding information parameter species specificities of vector X of failure score are strong.

2. evaluate affiliation

The evaluation method of species (sequence) affiliation is a lot, such as is used when being compared in paternity test to sample sequence The probabilistic method that arrives and the methods such as chadogram are built by distance matrix, it is numerous.

By taking the method for building chadogram as an example, evolutionary distance D needs to meet three below axiom：

(i)D_{X, y}≥0；D during and if only if x=y_{X, y}=0；

(ii)D_x,y=D_{Y, x}(symmetry)；

(iii) for any species x, y and z, D_{X, z}≤D_{X, y}+D_{Y, z}(triangle inequality relation) perseverance is set up.

The distance between Correlation of Bases matrix can be calculated with mahalanobis distance, Euclidean distance scheduling algorithm, and in this, as thing Plant (sequence) evolutionary distance.For ease of understanding, with Euclidean distance formula

Exemplified by：Wherein M represents Correlation of Bases matrix；F represents matrix column vector；D represents the element for constituting column vector.When After the distance between species are all determined two-by-two, distance matrix has just been obtained, and then for drawing chadogram.

When calculating evolutionary distance, it is proposed that using the ratio of inclined information association and information association, because by formula (1) and (2) Know, the order of magnitude of information association and inclined information association is respectively~10^-3With~10^-6, with the ratio of inclined information association and information association Value can make each column vector order of magnitude suitable as parameter.

3. the statistical check of chadogram

General adjacent method (Neighbor-Joining, abbreviation NJ) and arithmetic average can be used not to weight and form a team for chadogram Method (Unweighted Pair-Group Method with Arithmetic, abbreviation UPGMA) constructs phylogenetic tree.It is right The robustness (Robustness) of the chadogram generated in inspection institute, we have proposed one kind equivalent to reverse boot-strap method (Bootstrap) or Jack-knife examine new method：The Correlation of Bases matrix of Z rows is set up, it is from 0 to the variable of Z-1 to make d Parameter.A d value is given, a tree can be obtained.By the way that the tree of generation is compared into the optimal d values model of determination with species taxonomy information Enclose.The tree in the range of optimal d values is finally integrated into consistent tree, i.e., final phylogenetic tree.Thus obtained system occurs Can all there be a statistical value, referred to as Bootstrap values in each branch of tree.Phylogenetic tree obtained by the bigger explanation of d value scopes is more It is stable.

The effect of the present invention：Biology (sequence) identification code proposed by the present invention is based on Correlation of Bases information, has d × 17 (d scope is determined as needed) parameter, data volume is very small (typically smaller than 200 × 17), advantageously in realization visualization. Biological (sequence) evolutionary relationship, the classificating knowledge phase of gained phylogenetic tree and existing biology are deduced with such data Symbol, this shows that biological (sequence) identification code recognition capability proposed by the present invention is strong.It is a kind of more easy and with application value Mark sequence and carry out species taxonomy method.

Brief description of the drawings

Fig. 1 be No. ID be gi | 2745742 | the type virus sequence of AIDS 1.

Fig. 2 be No. ID be gi | 2745742 | the type of AIDS 1 virus identification code.

Fig. 3 be 36 mammalian sample species selecting of the present invention be identified, sorted chadogram.Wherein d with 10 be that ladder is incremented to 249 from 9, and the maximum statistics that branch can obtain supports to be 25.Evolutionary distance between species is based on F_A(k)T/ D_k+2, F_T(k)A/D_k+2, F_T(k)T/D_k+2, F_T(k)G/D_k+2, F_G(k)T/D_k+2And D_k+2Difference.

Fig. 4 be 32 parvovirus selecting of the present invention be identified, sorted chadogram.Wherein d is ladder with 10 From 9 to 199, the maximum statistics that branch can obtain supports to be 20.Evolutionary distance is based on F_A(k)G/D_k+2, F_G(k)A/D_k+2, F_T(k)C/ D_k+2, F_C(k)T/D_k+2, F_T(k)G/D_k+2, F_G(k)T/D_k+2And D_k+2Difference.

Embodiment

Hereinafter, embodiment of the present invention is described in detail in conjunction with the accompanying drawings and embodiments, to illustrate life proposed by the present invention The validity of thing (sequence) identification code.

Embodiment 1

Citing No. ID be gi | 2745742 | the type of AIDS 1 virus establishment identification code.The virus genome sequence is by 9290 Individual base is constituted, and because data volume is larger, we show a part (980 bases) for its genome, do understanding directly perceived, is such as schemed 1。

It is the type of AIDS 1 virus one identification code of establishment according to identification method of the present invention, totally 20 rows 17 are arranged, such as Shown in Fig. 2, wherein the representation of identification code is in the way of shown in formula 4.

Embodiment 2 reconstructs widely known Evolution of Mammals tree

Screen the strong inclined information association of species specificity with statistical tool, (i) using 36 mammals as sample species, from Randomly selected in each sample species genome 100 length be 1kb sequence as sample sequence；(ii) sample sequence is calculated K takes 0 to 248 when the information association and inclined information association of row and mitochondrial genomes sequence；(iii) 50 different starting points are set up k₀, maximal dimension d=8 vector, and carry out variance analysis and Multiple range test.

Table 1 represents to list in the result of above-mentioned statistics, form corresponding to d=2, when 4,6 and 8, vector X average failure ScoreThe wherein association of X representative informations or partially information association.It is to 50 random k₀Corresponding d dimensions The failure score W of vector_X(k₀, d) it is averaging.100 are normalized into, representative vector X is during Multiple range test The species of None- identified are listed in bracket to accounting for the ratios of total species pair, its variance.

As can be seen from Table 1, averagely unsuccessfully score increases and reduced vector X with d, and inclined information association is to average effect (letter Breath association) differentiation it is more obvious.As d >=6, inclined information association F_A(k)T, F_T(k)A, F_T(k)T, F_T(k)GAnd F_G(k)TIdentification energy Power is better than information association (D_k+2).Therefore with this 5 inclined information associations and D_k+2Be bonded matrix be used for calculate spore away from From.

Then spore distance, input chadogram generation software are calculated with formula (7).Resulting chadogram such as Fig. 3 Shown, it is very consistent with known biological classification knowledge, embodies in the following areas：First, primate (Primate) forms single source Branch；Second, Muridae (home mouse Mouse and rat Rat) and non-Muridae (squirrel Squirrel, rabbit Rabbit, cavy Guinea Pig and glirid Dormouse) form each single source branch；This result is false for non-single source evolution branch of rodent (Rodent) animal Say and new support is provided；3rd, the single source branch of brutish class (Ferungulate) formation is kicked suddenly and details branch ties with biologist By consistent；Finally, single hole animal Marsupialia (echidna Echidna and platypus Platypus) and marsupial Monotreme (didelphid opossum, kangaroo wallaroo) is each polymerized to pair and met each other nearer.

The mammal vector X of table 1 multiple comparative test result

Listed in form corresponding to d=2, when 4,6 and 8, vector X average failure scoreWherein X generations Table information association or inclined information association.It is to 50 random k₀The failure score W of corresponding d n dimensional vector ns_X(k₀, d) It is averaging.100 are normalized into, the species of representative vector X None- identifieds during Multiple range test are to accounting for total thing Kind to ratio, its variance is listed in bracket.

Embodiment 3 builds parvovirus (Parvoviruses) chadogram

The species specificity of inclined information association is examined with statistical tool：(i) all (32) viral genome sample will be used as This genome, randomly selected from each sample genome 50 length be 1kb sequence as sample sequence；(ii) in k from 0 The information association and inclined information association of sample sequence and full-length genome are calculated in the range of to 198；(iii) 50 differences are set up to rise Point k₀, maximal dimension d=10 vector X be used for carry out variance analysis and Multiple range test.

The result of Multiple range test as shown in table 2, is averagely unsuccessfully listed corresponding to d=4 in score form, when 6,8 and 10, The average of vector X unsuccessfully obtains 1The wherein association of X representative informations or partially information association.It is to 50 Random k₀Corresponding d ties up the failure score W of identification code_X(k₀, d) it is averaging.It has been normalized into 100, representative vector X The species of None- identified are listed in bracket to accounting for the ratios of total species pair, its variance during Multiple range test.

It can be seen that according to table 2, vector X average failure score increases and reduced with d, as d=4, all identification codes The score that averagely fails is both greater than 6, as d=10, and the average failure score of most vectors is less than 2.Wherein inclined information association F_A(k)G, F_T(k)C, F_T(k)G, F_C(k)T, F_G(k)AAnd F_G(k)TIt is average failure score substantially (＜ 0.9) less than normal.These results show this 6 kinds partially Information association genome specificity is stronger, is adapted to calculate evolutionary distance between species.

The parvovirus Correlation of Bases vector X of table 2 the statistical testing results

Next evolutionary distance is calculated, and then draws chadogram.Chadogram obtained by us is as shown in figure 4, it meets The classificating knowledge of biologist, i.e.,：Infect the parvovirus subfamily (Parvovirinae) of vertebrate and infect dynamic without vertebra The dense Chordopoxvirinae (Densovirinae) of thing can be with completely separable.Further, each including multiple species Tobamovirus gathers Into respective evolution branch, and bootstrap values are very high, show that respective branch result is stable.Aleutian disease virus belongs to (Amdovirus) Got together with bocavirus category (Bocavirus), they are all only planted comprising a virus.The thick Tobamovirus of ring star black smoke (Pefudensovirus) it is near with Densovirus (Densovirus) evolutionary distance.Dependovirus (dependoviruses) point Brace has host-virus linkage feature：AAAVa, GPV, AAAVd and MlDPV get together, and their host is birds；Infect The viral BAAV and BPV-2 of bovid are polymerized to one；The parvovirus for infecting primate is got together.This branch's knot Structure is the reflection to host-virus coevolution history of dependovirus^[4], and the power language model method (DL) of Yu et al. is no Host-virus branched structure of dependovirus can be reflected.No matter entirety or details meet biologist to the chadogram that we build Classification gain knowledge, this show our method have higher resolution.

Above example obtains the big branch of chadogram, ramuscule and is all consistent with the cognition of biologist.This shows the present invention The organism identification code (digital signature) and its application process of proposition effectively, for mark genome and can infer species relationship Relation.

Claims

1. a kind of use Digital ID biological sequence and infer species affiliation method, the use Digital ID biological sequence and Infer that the method for species affiliation is specifically included：

I information associations and inclined information association

By taking given DNA sequence dna as an example, the element for constituting sequence is base A, G, C, T, according to statistical method：What base i occurred Probability is p_i, wherein i=A, G, C, T；The probability that base j occurs is p_j, wherein j=A, G, C, T；At a distance of k base-pair i (k) j The joint probability of appearance is designated as p_i(k)j, the Correlation of Bases information content of whole piece sequence can be drawn further according to information theory：

<mrow> <mtable> <mtr> <mtd> <mrow> <msub> <mi>D</mi> <mrow> <mi>k</mi> <mo>+</mo> <mn>2</mn> </mrow> </msub> <mo>=</mo> <mo>-</mo> <mn>2</mn> <munder> <mo>&Sigma;</mo> <mi>i</mi> </munder> <mrow> <msub> <mi>p</mi> <mi>i</mi> </msub> <msub> <mi>log</mi> <mn>2</mn> </msub> <msub> <mi>p</mi> <mi>i</mi> </msub> </mrow> <mo>+</mo> <munder> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </munder> <msub> <mi>p</mi> <mrow> <mi>i</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mi>j</mi> </mrow> </msub> <msub> <mi>log</mi> <mn>2</mn> </msub> <msub> <mi>p</mi> <mrow> <mi>i</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mi>j</mi> </mrow> </msub> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mo>(</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>0</mn> <mo>,</mo> <mn>1</mn> <mo>,</mo> <mn>2...</mn> </mrow> <mo>)</mo> </mrow> </mtd> </mtr> </mtable> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow>

Claim D_k+2For information association；

Calculate p_i(k)jWhen, in order to avoid the edge effect that finite length N is produced, periodic boundary condition can be introduced, i.e., by sequence The k+1 base of row above is connected on the afterbody of sequence, forms a length of N+k+1 sequence, then by a distance of k base-pair i (k) j The number of times of appearance is designated as N_i(k)j, obtain joint probability p_i(k)j, whereinIn addition Taylor expansion, formula (1) can To be write as

<mfenced open = "" close = ""> <mtable> <mtr> <mtd> <mrow> <msub> <mi>D</mi> <mrow> <mi>k</mi> <mo>+</mo> <mn>2</mn> </mrow> </msub> <mo>=</mo> <mo>-</mo> <mn>2</mn> <munder> <mo>&Sigma;</mo> <mi>i</mi> </munder> <msub> <mi>p</mi> <mi>i</mi> </msub> <msub> <mi>log</mi> <mn>2</mn> </msub> <msub> <mi>p</mi> <mi>i</mi> </msub> <mo>+</mo> <munder> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </munder> <mfrac> <msub> <mi>N</mi> <mrow> <mi>i</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mi>j</mi> </mrow> </msub> <mi>N</mi> </mfrac> <msub> <mi>log</mi> <mn>2</mn> </msub> <mfrac> <msub> <mi>N</mi> <mrow> <mi>i</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mi>j</mi> </mrow> </msub> <mi>N</mi> </mfrac> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <mn>2</mn> <mi>ln</mi> <mn>2</mn> </mrow> </mfrac> <munder> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </munder> <msub> <mi>p</mi> <mi>i</mi> </msub> <msub> <mi>p</mi> <mi>j</mi> </msub> <mrow> <mo>(</mo> <msubsup> <mi>x</mi> <mrow> <mi>i</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mi>j</mi> </mrow> <mn>2</mn> </msubsup> <mo>-</mo> <mfrac> <mn>1</mn> <mn>3</mn> </mfrac> <msubsup> <mi>x</mi> <mrow> <mi>i</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mi>j</mi> </mrow> <mn>3</mn> </msubsup> <mo>+</mo> <mfrac> <mn>1</mn> <mn>6</mn> </mfrac> <msubsup> <mi>x</mi> <mrow> <mi>i</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mi>j</mi> </mrow> <mn>4</mn> </msubsup> <mo>-</mo> <mo>...</mo> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> </mtable> </mfenced>

Now x_i(k)j=(N_i(k)j-Np_ip_j)/Np_ip_j；When N is big, information association can be write as

<mrow> <msub> <mi>D</mi> <mrow> <mi>k</mi> <mo>+</mo> <mn>2</mn> </mrow> </msub> <mo>&cong;</mo> <mfrac> <mn>1</mn> <mrow> <mi>l</mi> <mi>n</mi> <mn>2</mn> </mrow> </mfrac> <munder> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </munder> <mfrac> <msup> <mrow> <mo>(</mo> <msub> <mi>p</mi> <mrow> <mi>i</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mi>j</mi> </mrow> </msub> <mo>-</mo> <msub> <mi>p</mi> <mi>i</mi> </msub> <msub> <mi>p</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mn>2</mn> </msup> <mrow> <msub> <mi>p</mi> <mi>i</mi> </msub> <msub> <mi>p</mi> <mi>j</mi> </msub> </mrow> </mfrac> </mrow>

Summation, for description particular bases association, introduces inclined information association (PIC) across 16 kinds of Correlation of Bases

F_i(k)j=(p_i(k)j-p_ip_j)² (2)

II biological sequence identification codes

Biological sequence identification code can be expressed as matrix, vector：

[F_A(k0)A,F_A(k0)T,…,F_G(k0)G,D_k0+2,F_A(k0+1)A,…,F_G(k0+1)G,D_k0+3,…D_k0+d+1]……(5)

[F_A(k0)A/D_k0+2,…,F_G(k0)G/D_k0+2,D_k0+2,F_A(k0+1)A/D_k0+3,…,D_k0+3,…D_k0+d+1]……(6)

No matter represent in what manner, the core element of biological sequence identification code is all information association and 16 kinds of breath incidence numbers that believe one side only According to, and parameter k₀=0, parameter d scope can be determined as needed, have d × 17 data；By biological sequence identification code with Matrix form is represented or stated；

III reconstructs biological sequence affiliation

Species chadogram is built, step is as follows：

(1) filter information is associated

Build shape such as [D_k0+2,D_k0+3,…,D_k0+d+1] or [F_α(k0)β,F_α(k0+1)β,…,F_α(k0+d-1)β] vector X, wherein α, β ∈ {A,G,C,T}；Variance analysis (ANOVA) and multiple comparative test (MCT) are done to vector X element；In Multiple range test, for Given species pair, as long as vector X arbitrary element is by multiple comparative test, this thing can successfully be distinguished by being considered as vector X Kind pair；The ratio that unrecognizable species logarithm is accounted for into total species logarithm normalizes to after 100 the failure score for being referred to as vector X, It is designated as W_X(k₀,d)；The low corresponding information parameter species specificities of vector X of failure score are strong；

(2) affiliation is evaluated

Evolutionary distance D needs to meet three below axiom：

(i)D_x,y≥0；D during and if only if x=y_x,y=0；

(ii)D_x,y=D_y,x；

(iii) for any species x, y and z, D_x,z≤D_x,y+D_y,zPerseverance is set up；

The distance between Correlation of Bases matrix can be calculated with mahalanobis distance, Euclidean distance algorithm, and in this, as species sequence Evolutionary distance；With Euclidean distance formula

<mrow> <msub> <mi>D</mi> <mrow> <mi>x</mi> <mo>,</mo> <mi>y</mi> </mrow> </msub> <mo>=</mo> <msqrt> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>f</mi> </munderover> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>d</mi> </munderover> <msup> <mrow> <mo>(</mo> <msub> <mi>M</mi> <mi>x</mi> </msub> <mo>(</mo> <mrow> <mi>i</mi> <mo>,</mo> <mi>j</mi> </mrow> <mo>)</mo> <mo>-</mo> <msub> <mi>M</mi> <mi>y</mi> </msub> <mo>(</mo> <mrow> <mi>i</mi> <mo>,</mo> <mi>j</mi> </mrow> <mo>)</mo> <mo>)</mo> </mrow> <mn>2</mn> </msup> </mrow> </msqrt> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>7</mn> <mo>)</mo> </mrow> </mrow>

Wherein M represents Correlation of Bases matrix, and f represents matrix column vector；D represents the element for constituting column vector；When species two-by-two The distance between be all determined after, just obtained distance matrix, and then for drawing chadogram；

Using the ratio of inclined information association and information association during calculating evolutionary distance, because being known by formula (1) and (2), information is closed Join and the order of magnitude of inclined information association is respectively~10^-3With~10^-6, ginseng is used as using the ratio of inclined information association and information association Number, can make each column vector order of magnitude suitable；

(3) statistical check of chadogram

General adjacent method (Neighbor-Joining, abbreviation NJ) and arithmetic average can be used not to weight the method for forming a team for chadogram (Unweighted Pair-Group Method withArithmetic, abbreviation UPGMA) constructs phylogenetic tree；For The robustness (Robustness) of the chadogram of inspection institute's generation, proposes one kind equivalent to reverse boot-strap method or Jack-knife The new method of inspection is tested：The Correlation of Bases matrix of Z rows is set up, it is the variable element from 0 to Z-1 to make d；Give a d Value, can obtain a tree；By the way that the tree of generation is compared into the optimal d values scope of determination with species taxonomy information；Finally by optimal d Tree in the range of value is integrated into consistent tree, i.e., final phylogenetic tree；Each branch of thus obtained phylogenetic tree is all Have a statistical value, referred to as Bootstrap values；Phylogenetic tree obtained by the bigger explanation of d value scopes is more stable.

2. use Digital ID biological sequence according to claim 1 and the method for inferring species affiliation, wherein described Sequence can be biological genome full sequence or organism genomic sequence fragment.

3. use Digital ID biological sequence according to claim 1 or 2 and the method for inferring species affiliation, wherein institute State sequence and be selected from common sequence resource.

4. use Digital ID biological sequence according to claim 3 and the method for inferring species affiliation, wherein described Common sequence resource is selected from US National Biotechnology Information center (NCBI) database, European Molecular Biology Laboratory's data Storehouse (EMBL) and DNA Data Bank of Japan (DDBJ) is any can obtain the public database of living species sequence.

5. use Digital ID biological sequence according to claim 1 and the method for inferring species affiliation, by Biological Order The scope of row extends to not common database resource.