CN103559427B - A kind of use Digital ID biological sequence and the method for inferring species affiliation - Google Patents

A kind of use Digital ID biological sequence and the method for inferring species affiliation Download PDF

Info

Publication number
CN103559427B
CN103559427B CN201310557139.1A CN201310557139A CN103559427B CN 103559427 B CN103559427 B CN 103559427B CN 201310557139 A CN201310557139 A CN 201310557139A CN 103559427 B CN103559427 B CN 103559427B
Authority
CN
China
Prior art keywords
mrow
msub
mtd
mtr
mtable
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310557139.1A
Other languages
Chinese (zh)
Other versions
CN103559427A (en
Inventor
高扬
罗辽复
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201310557139.1A priority Critical patent/CN103559427B/en
Publication of CN103559427A publication Critical patent/CN103559427A/en
Application granted granted Critical
Publication of CN103559427B publication Critical patent/CN103559427B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Other Investigation Or Analysis Of Materials By Electrical Means (AREA)

Abstract

The invention provides a kind of biological sequence identification code based on organism nucleotide sequence Correlation of Bases feature, and then a whole set of identification code is proposed in mark biological sequence and the implementing method and effect evaluation system of species Genetic relationship is carried out.Species Genetic relationship, is the close inspection to identification code validity.The present invention deduces the classificating knowledge that the result that Evolution of Mammals tree and parvovirus affiliation obtain meets biologist, shows that this method is effective, identification code resolution height.The biological sequence identification code that the present invention is provided has the outstanding features such as recognition capability is strong, data volume is small, it is possible to achieve the huge genome sequence of a few digits mark and mark and the com-parison and analysis application for simplifying biological sequence, great practical value.

Description

A kind of use Digital ID biological sequence and the method for inferring species affiliation
Technical field
The present invention is excavated and integration sequence information association feature with bioinformatics method, and then using numeral to biology Sequence and species are identified and Genetic relationship, belong to application of the informatics in field of biology.
Background technology
Biological sequence includes amino acid sequence and nucleotide sequence, and its nucleotide sequence is divided into DNA again (DNA) sequence and ribonucleic acid (RNA) sequence.DNA sequence dna is by adenylate (A), cytidine monophosphate (C), guanylic acid (G) and thymidylic acid (T) four kinds of nucleotide monomers are polymerized, and are generally represented with the symbol sebolic addressing of four letter compositions.Similar, RNA sequence can be with Represented with the symbol sebolic addressing of tetra- letter compositions of A, C, G and U, wherein being uridylic acid instead of T U.The full genome of species has been sequenced Group sequence length is from thousands of to million, or even billions of letters.
Researcher attempts to extract Data Identification biological sequence from biological sequence, and applies genome sequence oligomer (K-mer) frequency feature carries out the research of phylogenetics method.The such as component method of academician of the Chinese Academy of Sciences Mr. Hao Bailin (CVTree)[1]With 205Individual data deduce spore relation, American scientist Kim et al. feature frequency method (FFP)[2]Very Extremely use up to 208Individual (data volume considerably beyond genomic data amount) individual data do Study on Evolution.Their method easily by High dimension lacks the restriction of sample and is not suitable for mini gene group or short sequence, such as parvovirus[3], and use can not be realized Low volume data mark is biological (sequence).
In order to improve mark biological (sequence) and deduce the practicality of biological (sequence) affiliation, we are made that new Attempt.Different from the method based on K-mer Frequency statistics, we study sequence (DNA or RNA) from information theory Information association feature, proposes to identify genome with information association (IC) and inclined information association (PIC), and further give birth to its deduction Thing (sequence) affiliation.
The content of the invention
It is of the invention that a kind of method with Digital ID biological (sequence) is provided, and it is illustrated in deduction biological (sequence) parent Application in edge relation.Next, we associate recommended information and the calculating of inclined information association, biology (sequence) identification code Build and its application in affiliation research.
Signified sequence can be biological genome full sequence or organism genomic sequence fragment in the present invention; Can be that DNA sequence dna can also be RNA sequence.Sequence data used in the present invention is public resource, can pass through US National Biotechnology Information center (NCBI) database, European Molecular Biology Laboratory's database (EMBL) and DNA Data Bank of Japan Etc. (DDBJ) global public database, freely obtains and uses.
In order to realize above goal of the invention, the present invention provides following technical scheme:
First, information association and inclined information association
By taking given DNA sequence dna as an example, the element for constituting sequence is base A, G, C, T, according to statistical method:Base i (i =A, G, C, T) occur probability be pi;The probability that base j (j=A, G, C, T) occurs is pi;At a distance of two positions of k distance Occurs base i and base j joint probability p respectivelyi(k)j, the Correlation of Bases of whole piece sequence can be drawn further according to information theory Information content:
We claim Dk+2For information association.
P is calculated for actual sequencei(k)jWhen, in order to avoid the edge effect that finite length (N) is produced, it can introduce the cycle Property boundary condition, i.e., k+1 base before sequence be connected on to the afterbody of sequence, a length of N+k+1 sequence is formed, then counts At a distance of k base-pair ij (Ni(k)j), obtain pi(k)j().In addition Taylor expansion, formula (1) can be write as
Now xi(k)j=(Ni(k)j-Npipj)/Npipj.When N is big, information association can be write as
Summation associates us across 16 kinds of Correlation of Bases, for description particular bases and introduces inclined information association (PIC)
Fi(k)j=(Pi(k)j-PiPj)2 (2)
2nd, biological (sequence) identification code
Biological (sequence) identification code can be expressed as matrix, vector or other forms, for example:
[FA(k0)A, FA(k0)T..., FG(k0)G,Dk0+2, FA(k0+1)A..., FG(k0+1)G, Dk0+3... Dk0+d+1], (5)
[FA(k0)A/Dk0+2..., FG(k0)G/Dk0+2,Dk0+2,FA(k0+1)A/Dk0+3..., Dk0+3... Dk0+d+1], (6)
No matter represent in what manner, the core element of biological (sequence) identification code is all information association and 16 kinds of breaths that believe one side only Associated data, and parameter k0=0, parameter d scope can be determined as needed, have d × 17 data.For ease of understanding, Usually biological (sequence) identification code is represented or stated in the matrix form.
3rd, biological (sequence) affiliation is reconstructed
Biological (sequence) affiliation of reconstruct is both one kind application to identification code, is also the severe of inspection identification code validity Carving method, in order to examine the validity of proposed biological (sequence) identification code, we build species chadogram, step with it It is rapid as follows:
1. filter information is associated
Because it is weak that signal occurs for the system that the low inclined information association of recognition capability is included, in some instances it may even be possible to evolution can be upset and closed System.So only being built jointly with information association for calculating spore distance from the stronger inclined information association of recognition capability Parameter.Shape such as [D are built firstk0+2, Dk0+3..., Dk0+d+1] or [Fα(k0)β, Fα(k0+1)β..., Fα(k0+d-1)β] (α, β ∈ A, G, C, T }) vector X.Variance analysis (ANOVA) and multiple comparative test (MCT) are done to vector X element.In Multiple range test, For giving species pair, as long as vector X arbitrary element is by multiple comparative test, being considered as vector X can successfully distinguish This species pair.Unrecognizable species logarithm is accounted for into total species logarithm ratio normalizes to after 100 the failure for being referred to as vector X Point, it is designated as WX(k0, d).The low corresponding information parameter species specificities of vector X of failure score are strong.
2. evaluate affiliation
The evaluation method of species (sequence) affiliation is a lot, such as is used when being compared in paternity test to sample sequence The probabilistic method that arrives and the methods such as chadogram are built by distance matrix, it is numerous.
By taking the method for building chadogram as an example, evolutionary distance D needs to meet three below axiom:
(i)DX, y≥0;D during and if only if x=yX, y=0;
(ii)Dx,y=DY, x(symmetry);
(iii) for any species x, y and z, DX, z≤DX, y+DY, z(triangle inequality relation) perseverance is set up.
The distance between Correlation of Bases matrix can be calculated with mahalanobis distance, Euclidean distance scheduling algorithm, and in this, as thing Plant (sequence) evolutionary distance.For ease of understanding, with Euclidean distance formula
Exemplified by:Wherein M represents Correlation of Bases matrix;F represents matrix column vector;D represents the element for constituting column vector.When After the distance between species are all determined two-by-two, distance matrix has just been obtained, and then for drawing chadogram.
When calculating evolutionary distance, it is proposed that using the ratio of inclined information association and information association, because by formula (1) and (2) Know, the order of magnitude of information association and inclined information association is respectively~10-3With~10-6, with the ratio of inclined information association and information association Value can make each column vector order of magnitude suitable as parameter.
3. the statistical check of chadogram
General adjacent method (Neighbor-Joining, abbreviation NJ) and arithmetic average can be used not to weight and form a team for chadogram Method (Unweighted Pair-Group Method with Arithmetic, abbreviation UPGMA) constructs phylogenetic tree.It is right The robustness (Robustness) of the chadogram generated in inspection institute, we have proposed one kind equivalent to reverse boot-strap method (Bootstrap) or Jack-knife examine new method:The Correlation of Bases matrix of Z rows is set up, it is from 0 to the variable of Z-1 to make d Parameter.A d value is given, a tree can be obtained.By the way that the tree of generation is compared into the optimal d values model of determination with species taxonomy information Enclose.The tree in the range of optimal d values is finally integrated into consistent tree, i.e., final phylogenetic tree.Thus obtained system occurs Can all there be a statistical value, referred to as Bootstrap values in each branch of tree.Phylogenetic tree obtained by the bigger explanation of d value scopes is more It is stable.
The effect of the present invention:Biology (sequence) identification code proposed by the present invention is based on Correlation of Bases information, has d × 17 (d scope is determined as needed) parameter, data volume is very small (typically smaller than 200 × 17), advantageously in realization visualization. Biological (sequence) evolutionary relationship, the classificating knowledge phase of gained phylogenetic tree and existing biology are deduced with such data Symbol, this shows that biological (sequence) identification code recognition capability proposed by the present invention is strong.It is a kind of more easy and with application value Mark sequence and carry out species taxonomy method.
Brief description of the drawings
Fig. 1 be No. ID be gi | 2745742 | the type virus sequence of AIDS 1.
Fig. 2 be No. ID be gi | 2745742 | the type of AIDS 1 virus identification code.
Fig. 3 be 36 mammalian sample species selecting of the present invention be identified, sorted chadogram.Wherein d with 10 be that ladder is incremented to 249 from 9, and the maximum statistics that branch can obtain supports to be 25.Evolutionary distance between species is based on FA(k)T/ Dk+2, FT(k)A/Dk+2, FT(k)T/Dk+2, FT(k)G/Dk+2, FG(k)T/Dk+2And Dk+2Difference.
Fig. 4 be 32 parvovirus selecting of the present invention be identified, sorted chadogram.Wherein d is ladder with 10 From 9 to 199, the maximum statistics that branch can obtain supports to be 20.Evolutionary distance is based on FA(k)G/Dk+2, FG(k)A/Dk+2, FT(k)C/ Dk+2, FC(k)T/Dk+2, FT(k)G/Dk+2, FG(k)T/Dk+2And Dk+2Difference.
Embodiment
Hereinafter, embodiment of the present invention is described in detail in conjunction with the accompanying drawings and embodiments, to illustrate life proposed by the present invention The validity of thing (sequence) identification code.
Embodiment 1
Citing No. ID be gi | 2745742 | the type of AIDS 1 virus establishment identification code.The virus genome sequence is by 9290 Individual base is constituted, and because data volume is larger, we show a part (980 bases) for its genome, do understanding directly perceived, is such as schemed 1。
It is the type of AIDS 1 virus one identification code of establishment according to identification method of the present invention, totally 20 rows 17 are arranged, such as Shown in Fig. 2, wherein the representation of identification code is in the way of shown in formula 4.
Embodiment 2 reconstructs widely known Evolution of Mammals tree
Screen the strong inclined information association of species specificity with statistical tool, (i) using 36 mammals as sample species, from Randomly selected in each sample species genome 100 length be 1kb sequence as sample sequence;(ii) sample sequence is calculated K takes 0 to 248 when the information association and inclined information association of row and mitochondrial genomes sequence;(iii) 50 different starting points are set up k0, maximal dimension d=8 vector, and carry out variance analysis and Multiple range test.
Table 1 represents to list in the result of above-mentioned statistics, form corresponding to d=2, when 4,6 and 8, vector X average failure ScoreThe wherein association of X representative informations or partially information association.It is to 50 random k0Corresponding d dimensions The failure score W of vectorX(k0, d) it is averaging.100 are normalized into, representative vector X is during Multiple range test The species of None- identified are listed in bracket to accounting for the ratios of total species pair, its variance.
As can be seen from Table 1, averagely unsuccessfully score increases and reduced vector X with d, and inclined information association is to average effect (letter Breath association) differentiation it is more obvious.As d >=6, inclined information association FA(k)T, FT(k)A, FT(k)T, FT(k)GAnd FG(k)TIdentification energy Power is better than information association (Dk+2).Therefore with this 5 inclined information associations and Dk+2Be bonded matrix be used for calculate spore away from From.
Then spore distance, input chadogram generation software are calculated with formula (7).Resulting chadogram such as Fig. 3 Shown, it is very consistent with known biological classification knowledge, embodies in the following areas:First, primate (Primate) forms single source Branch;Second, Muridae (home mouse Mouse and rat Rat) and non-Muridae (squirrel Squirrel, rabbit Rabbit, cavy Guinea Pig and glirid Dormouse) form each single source branch;This result is false for non-single source evolution branch of rodent (Rodent) animal Say and new support is provided;3rd, the single source branch of brutish class (Ferungulate) formation is kicked suddenly and details branch ties with biologist By consistent;Finally, single hole animal Marsupialia (echidna Echidna and platypus Platypus) and marsupial Monotreme (didelphid opossum, kangaroo wallaroo) is each polymerized to pair and met each other nearer.
The mammal vector X of table 1 multiple comparative test result
Listed in form corresponding to d=2, when 4,6 and 8, vector X average failure scoreWherein X generations Table information association or inclined information association.It is to 50 random k0The failure score W of corresponding d n dimensional vector nsX(k0, d) It is averaging.100 are normalized into, the species of representative vector X None- identifieds during Multiple range test are to accounting for total thing Kind to ratio, its variance is listed in bracket.
Embodiment 3 builds parvovirus (Parvoviruses) chadogram
The species specificity of inclined information association is examined with statistical tool:(i) all (32) viral genome sample will be used as This genome, randomly selected from each sample genome 50 length be 1kb sequence as sample sequence;(ii) in k from 0 The information association and inclined information association of sample sequence and full-length genome are calculated in the range of to 198;(iii) 50 differences are set up to rise Point k0, maximal dimension d=10 vector X be used for carry out variance analysis and Multiple range test.
The result of Multiple range test as shown in table 2, is averagely unsuccessfully listed corresponding to d=4 in score form, when 6,8 and 10, The average of vector X unsuccessfully obtains 1The wherein association of X representative informations or partially information association.It is to 50 Random k0Corresponding d ties up the failure score W of identification codeX(k0, d) it is averaging.It has been normalized into 100, representative vector X The species of None- identified are listed in bracket to accounting for the ratios of total species pair, its variance during Multiple range test.
It can be seen that according to table 2, vector X average failure score increases and reduced with d, as d=4, all identification codes The score that averagely fails is both greater than 6, as d=10, and the average failure score of most vectors is less than 2.Wherein inclined information association FA(k)G, FT(k)C, FT(k)G, FC(k)T, FG(k)AAnd FG(k)TIt is average failure score substantially (< 0.9) less than normal.These results show this 6 kinds partially Information association genome specificity is stronger, is adapted to calculate evolutionary distance between species.
The parvovirus Correlation of Bases vector X of table 2 the statistical testing results
Next evolutionary distance is calculated, and then draws chadogram.Chadogram obtained by us is as shown in figure 4, it meets The classificating knowledge of biologist, i.e.,:Infect the parvovirus subfamily (Parvovirinae) of vertebrate and infect dynamic without vertebra The dense Chordopoxvirinae (Densovirinae) of thing can be with completely separable.Further, each including multiple species Tobamovirus gathers Into respective evolution branch, and bootstrap values are very high, show that respective branch result is stable.Aleutian disease virus belongs to (Amdovirus) Got together with bocavirus category (Bocavirus), they are all only planted comprising a virus.The thick Tobamovirus of ring star black smoke (Pefudensovirus) it is near with Densovirus (Densovirus) evolutionary distance.Dependovirus (dependoviruses) point Brace has host-virus linkage feature:AAAVa, GPV, AAAVd and MlDPV get together, and their host is birds;Infect The viral BAAV and BPV-2 of bovid are polymerized to one;The parvovirus for infecting primate is got together.This branch's knot Structure is the reflection to host-virus coevolution history of dependovirus[4], and the power language model method (DL) of Yu et al. is no Host-virus branched structure of dependovirus can be reflected.No matter entirety or details meet biologist to the chadogram that we build Classification gain knowledge, this show our method have higher resolution.
Above example obtains the big branch of chadogram, ramuscule and is all consistent with the cognition of biologist.This shows the present invention The organism identification code (digital signature) and its application process of proposition effectively, for mark genome and can infer species relationship Relation.

Claims (5)

1. a kind of use Digital ID biological sequence and infer species affiliation method, the use Digital ID biological sequence and Infer that the method for species affiliation is specifically included:
I information associations and inclined information association
By taking given DNA sequence dna as an example, the element for constituting sequence is base A, G, C, T, according to statistical method:What base i occurred Probability is pi, wherein i=A, G, C, T;The probability that base j occurs is pj, wherein j=A, G, C, T;At a distance of k base-pair i (k) j The joint probability of appearance is designated as pi(k)j, the Correlation of Bases information content of whole piece sequence can be drawn further according to information theory:
<mrow> <mtable> <mtr> <mtd> <mrow> <msub> <mi>D</mi> <mrow> <mi>k</mi> <mo>+</mo> <mn>2</mn> </mrow> </msub> <mo>=</mo> <mo>-</mo> <mn>2</mn> <munder> <mo>&amp;Sigma;</mo> <mi>i</mi> </munder> <mrow> <msub> <mi>p</mi> <mi>i</mi> </msub> <msub> <mi>log</mi> <mn>2</mn> </msub> <msub> <mi>p</mi> <mi>i</mi> </msub> </mrow> <mo>+</mo> <munder> <mo>&amp;Sigma;</mo> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </munder> <msub> <mi>p</mi> <mrow> <mi>i</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mi>j</mi> </mrow> </msub> <msub> <mi>log</mi> <mn>2</mn> </msub> <msub> <mi>p</mi> <mrow> <mi>i</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mi>j</mi> </mrow> </msub> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mo>(</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>0</mn> <mo>,</mo> <mn>1</mn> <mo>,</mo> <mn>2...</mn> </mrow> <mo>)</mo> </mrow> </mtd> </mtr> </mtable> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow>
Claim Dk+2For information association;
Calculate pi(k)jWhen, in order to avoid the edge effect that finite length N is produced, periodic boundary condition can be introduced, i.e., by sequence The k+1 base of row above is connected on the afterbody of sequence, forms a length of N+k+1 sequence, then by a distance of k base-pair i (k) j The number of times of appearance is designated as Ni(k)j, obtain joint probability pi(k)j, whereinIn addition Taylor expansion, formula (1) can To be write as
<mfenced open = "" close = ""> <mtable> <mtr> <mtd> <mrow> <msub> <mi>D</mi> <mrow> <mi>k</mi> <mo>+</mo> <mn>2</mn> </mrow> </msub> <mo>=</mo> <mo>-</mo> <mn>2</mn> <munder> <mo>&amp;Sigma;</mo> <mi>i</mi> </munder> <msub> <mi>p</mi> <mi>i</mi> </msub> <msub> <mi>log</mi> <mn>2</mn> </msub> <msub> <mi>p</mi> <mi>i</mi> </msub> <mo>+</mo> <munder> <mo>&amp;Sigma;</mo> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </munder> <mfrac> <msub> <mi>N</mi> <mrow> <mi>i</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mi>j</mi> </mrow> </msub> <mi>N</mi> </mfrac> <msub> <mi>log</mi> <mn>2</mn> </msub> <mfrac> <msub> <mi>N</mi> <mrow> <mi>i</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mi>j</mi> </mrow> </msub> <mi>N</mi> </mfrac> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <mn>2</mn> <mi>ln</mi> <mn>2</mn> </mrow> </mfrac> <munder> <mi>&amp;Sigma;</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </munder> <msub> <mi>p</mi> <mi>i</mi> </msub> <msub> <mi>p</mi> <mi>j</mi> </msub> <mrow> <mo>(</mo> <msubsup> <mi>x</mi> <mrow> <mi>i</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mi>j</mi> </mrow> <mn>2</mn> </msubsup> <mo>-</mo> <mfrac> <mn>1</mn> <mn>3</mn> </mfrac> <msubsup> <mi>x</mi> <mrow> <mi>i</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mi>j</mi> </mrow> <mn>3</mn> </msubsup> <mo>+</mo> <mfrac> <mn>1</mn> <mn>6</mn> </mfrac> <msubsup> <mi>x</mi> <mrow> <mi>i</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mi>j</mi> </mrow> <mn>4</mn> </msubsup> <mo>-</mo> <mo>...</mo> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> </mtable> </mfenced>
Now xi(k)j=(Ni(k)j-Npipj)/Npipj;When N is big, information association can be write as
<mrow> <msub> <mi>D</mi> <mrow> <mi>k</mi> <mo>+</mo> <mn>2</mn> </mrow> </msub> <mo>&amp;cong;</mo> <mfrac> <mn>1</mn> <mrow> <mi>l</mi> <mi>n</mi> <mn>2</mn> </mrow> </mfrac> <munder> <mo>&amp;Sigma;</mo> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </munder> <mfrac> <msup> <mrow> <mo>(</mo> <msub> <mi>p</mi> <mrow> <mi>i</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mi>j</mi> </mrow> </msub> <mo>-</mo> <msub> <mi>p</mi> <mi>i</mi> </msub> <msub> <mi>p</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mn>2</mn> </msup> <mrow> <msub> <mi>p</mi> <mi>i</mi> </msub> <msub> <mi>p</mi> <mi>j</mi> </msub> </mrow> </mfrac> </mrow>
Summation, for description particular bases association, introduces inclined information association (PIC) across 16 kinds of Correlation of Bases
Fi(k)j=(pi(k)j-pipj)2 (2)
II biological sequence identification codes
Biological sequence identification code can be expressed as matrix, vector:
<mrow> <mfenced open = "[" close = "]"> <mtable> <mtr> <mtd> <msub> <mi>F</mi> <mrow> <mi>A</mi> <mrow> <mo>(</mo> <msub> <mi>k</mi> <mn>0</mn> </msub> <mo>)</mo> </mrow> <mi>A</mi> </mrow> </msub> </mtd> <mtd> <msub> <mi>F</mi> <mrow> <mi>A</mi> <mrow> <mo>(</mo> <msub> <mi>k</mi> <mn>0</mn> </msub> <mo>)</mo> </mrow> <mi>T</mi> </mrow> </msub> </mtd> <mtd> <mo>...</mo> </mtd> <mtd> <msub> <mi>F</mi> <mrow> <mi>G</mi> <mrow> <mo>(</mo> <msub> <mi>k</mi> <mn>0</mn> </msub> <mo>)</mo> </mrow> <mi>G</mi> </mrow> </msub> </mtd> <mtd> <msub> <mi>D</mi> <mrow> <msub> <mi>k</mi> <mn>0</mn> </msub> <mo>+</mo> <mn>2</mn> </mrow> </msub> </mtd> </mtr> <mtr> <mtd> <msub> <mi>F</mi> <mrow> <mi>A</mi> <mrow> <mo>(</mo> <msub> <mi>k</mi> <mn>0</mn> </msub> <mo>+</mo> <mn>1</mn> <mo>)</mo> </mrow> <mi>A</mi> </mrow> </msub> </mtd> <mtd> <msub> <mi>F</mi> <mrow> <mi>A</mi> <mrow> <mo>(</mo> <msub> <mi>k</mi> <mn>0</mn> </msub> <mo>+</mo> <mn>1</mn> <mo>)</mo> </mrow> <mi>T</mi> </mrow> </msub> </mtd> <mtd> <mo>...</mo> </mtd> <mtd> <msub> <mi>F</mi> <mrow> <mi>G</mi> <mrow> <mo>(</mo> <msub> <mi>k</mi> <mn>0</mn> </msub> <mo>+</mo> <mn>1</mn> <mo>)</mo> </mrow> <mi>G</mi> </mrow> </msub> </mtd> <mtd> <msub> <mi>D</mi> <mrow> <msub> <mi>k</mi> <mn>0</mn> </msub> <mo>+</mo> <mn>3</mn> </mrow> </msub> </mtd> </mtr> <mtr> <mtd> <mtable> <mtr> <mtd> <mo>.</mo> </mtd> </mtr> <mtr> <mtd> <mo>.</mo> </mtd> </mtr> <mtr> <mtd> <mo>.</mo> </mtd> </mtr> </mtable> </mtd> <mtd> <mtable> <mtr> <mtd> <mo>.</mo> </mtd> </mtr> <mtr> <mtd> <mo>.</mo> </mtd> </mtr> <mtr> <mtd> <mo>.</mo> </mtd> </mtr> </mtable> </mtd> <mtd> <mtable> <mtr> <mtd> <mo>.</mo> </mtd> </mtr> <mtr> <mtd> <mo>.</mo> </mtd> </mtr> <mtr> <mtd> <mo>.</mo> </mtd> </mtr> </mtable> </mtd> <mtd> <mtable> <mtr> <mtd> <mo>.</mo> </mtd> </mtr> <mtr> <mtd> <mo>.</mo> </mtd> </mtr> <mtr> <mtd> <mo>.</mo> </mtd> </mtr> </mtable> </mtd> <mtd> <mtable> <mtr> <mtd> <mo>.</mo> </mtd> </mtr> <mtr> <mtd> <mo>.</mo> </mtd> </mtr> <mtr> <mtd> <mo>.</mo> </mtd> </mtr> </mtable> </mtd> </mtr> <mtr> <mtd> <msub> <mi>F</mi> <mrow> <mi>A</mi> <mrow> <mo>(</mo> <msub> <mi>k</mi> <mn>0</mn> </msub> <mo>+</mo> <mi>d</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> </mrow> <mi>A</mi> </mrow> </msub> </mtd> <mtd> <msub> <mi>F</mi> <mrow> <mi>A</mi> <mrow> <mo>(</mo> <msub> <mi>k</mi> <mn>0</mn> </msub> <mo>+</mo> <mi>d</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> </mrow> <mi>T</mi> </mrow> </msub> </mtd> <mtd> <mo>...</mo> </mtd> <mtd> <msub> <mi>F</mi> <mrow> <mi>G</mi> <mrow> <mo>(</mo> <msub> <mi>k</mi> <mn>0</mn> </msub> <mo>+</mo> <mi>d</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> </mrow> <mi>G</mi> </mrow> </msub> </mtd> <mtd> <msub> <mi>D</mi> <mrow> <msub> <mi>k</mi> <mn>0</mn> </msub> <mo>+</mo> <mi>d</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> </mtd> </mtr> </mtable> </mfenced> <mo>...</mo> <mo>...</mo> <mrow> <mo>(</mo> <mn>3</mn> <mo>)</mo> </mrow> </mrow>
<mrow> <mfenced open = "[" close = "]"> <mtable> <mtr> <mtd> <mrow> <msub> <mi>F</mi> <mrow> <mi>A</mi> <mrow> <mo>(</mo> <msub> <mi>k</mi> <mn>0</mn> </msub> <mo>)</mo> </mrow> <mi>A</mi> </mrow> </msub> <mo>/</mo> <msub> <mi>D</mi> <mrow> <msub> <mi>k</mi> <mn>0</mn> </msub> <mo>+</mo> <mn>2</mn> </mrow> </msub> </mrow> </mtd> <mtd> <mrow> <msub> <mi>F</mi> <mrow> <mi>A</mi> <mrow> <mo>(</mo> <msub> <mi>k</mi> <mn>0</mn> </msub> <mo>)</mo> </mrow> <mi>T</mi> </mrow> </msub> <mo>/</mo> <msub> <mi>D</mi> <mrow> <msub> <mi>k</mi> <mn>0</mn> </msub> <mo>+</mo> <mn>2</mn> </mrow> </msub> </mrow> </mtd> <mtd> <mo>...</mo> </mtd> <mtd> <mrow> <msub> <mi>F</mi> <mrow> <mi>G</mi> <mrow> <mo>(</mo> <msub> <mi>k</mi> <mn>0</mn> </msub> <mo>)</mo> </mrow> <mi>G</mi> </mrow> </msub> <mo>/</mo> <msub> <mi>D</mi> <mrow> <msub> <mi>k</mi> <mn>0</mn> </msub> <mo>+</mo> <mn>2</mn> </mrow> </msub> </mrow> </mtd> <mtd> <msub> <mi>D</mi> <mrow> <msub> <mi>k</mi> <mn>0</mn> </msub> <mo>+</mo> <mn>2</mn> </mrow> </msub> </mtd> </mtr> <mtr> <mtd> <mrow> <msub> <mi>F</mi> <mrow> <mi>A</mi> <mrow> <mo>(</mo> <msub> <mi>k</mi> <mn>0</mn> </msub> <mo>+</mo> <mn>1</mn> <mo>)</mo> </mrow> <mi>A</mi> </mrow> </msub> <mo>/</mo> <msub> <mi>D</mi> <mrow> <msub> <mi>k</mi> <mn>0</mn> </msub> <mo>+</mo> <mn>3</mn> </mrow> </msub> </mrow> </mtd> <mtd> <mrow> <msub> <mi>F</mi> <mrow> <mi>A</mi> <mrow> <mo>(</mo> <msub> <mi>k</mi> <mn>0</mn> </msub> <mo>+</mo> <mn>1</mn> <mo>)</mo> </mrow> <mi>T</mi> </mrow> </msub> <mo>/</mo> <msub> <mi>D</mi> <mrow> <msub> <mi>k</mi> <mn>0</mn> </msub> <mo>+</mo> <mn>3</mn> </mrow> </msub> </mrow> </mtd> <mtd> <mo>...</mo> </mtd> <mtd> <mrow> <msub> <mi>F</mi> <mrow> <mi>G</mi> <mrow> <mo>(</mo> <msub> <mi>k</mi> <mn>0</mn> </msub> <mo>+</mo> <mn>1</mn> <mo>)</mo> </mrow> <mi>G</mi> </mrow> </msub> <mo>/</mo> <msub> <mi>D</mi> <mrow> <msub> <mi>k</mi> <mn>0</mn> </msub> <mo>+</mo> <mn>3</mn> </mrow> </msub> </mrow> </mtd> <mtd> <msub> <mi>D</mi> <mrow> <msub> <mi>k</mi> <mn>0</mn> </msub> <mo>+</mo> <mn>3</mn> </mrow> </msub> </mtd> </mtr> <mtr> <mtd> <mtable> <mtr> <mtd> <mo>.</mo> </mtd> </mtr> <mtr> <mtd> <mo>.</mo> </mtd> </mtr> <mtr> <mtd> <mo>.</mo> </mtd> </mtr> </mtable> </mtd> <mtd> <mtable> <mtr> <mtd> <mo>.</mo> </mtd> </mtr> <mtr> <mtd> <mo>.</mo> </mtd> </mtr> <mtr> <mtd> <mo>.</mo> </mtd> </mtr> </mtable> </mtd> <mtd> <mtable> <mtr> <mtd> <mo>.</mo> </mtd> </mtr> <mtr> <mtd> <mo>.</mo> </mtd> </mtr> <mtr> <mtd> <mo>.</mo> </mtd> </mtr> </mtable> </mtd> <mtd> <mtable> <mtr> <mtd> <mo>.</mo> </mtd> </mtr> <mtr> <mtd> <mo>.</mo> </mtd> </mtr> <mtr> <mtd> <mo>.</mo> </mtd> </mtr> </mtable> </mtd> <mtd> <mtable> <mtr> <mtd> <mo>.</mo> </mtd> </mtr> <mtr> <mtd> <mo>.</mo> </mtd> </mtr> <mtr> <mtd> <mo>.</mo> </mtd> </mtr> </mtable> </mtd> </mtr> <mtr> <mtd> <mrow> <msub> <mi>F</mi> <mrow> <mi>A</mi> <mrow> <mo>(</mo> <msub> <mi>k</mi> <mn>0</mn> </msub> <mo>+</mo> <mi>d</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> </mrow> <mi>A</mi> </mrow> </msub> <mo>/</mo> <msub> <mi>D</mi> <mrow> <msub> <mi>k</mi> <mn>0</mn> </msub> <mo>+</mo> <mi>d</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> </mrow> </mtd> <mtd> <mrow> <msub> <mi>F</mi> <mrow> <mi>A</mi> <mrow> <mo>(</mo> <msub> <mi>k</mi> <mn>0</mn> </msub> <mo>+</mo> <mi>d</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> </mrow> <mi>T</mi> </mrow> </msub> <mo>/</mo> <msub> <mi>D</mi> <mrow> <msub> <mi>k</mi> <mn>0</mn> </msub> <mo>+</mo> <mi>d</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> </mrow> </mtd> <mtd> <mo>...</mo> </mtd> <mtd> <mrow> <msub> <mi>F</mi> <mrow> <mi>G</mi> <mrow> <mo>(</mo> <msub> <mi>k</mi> <mn>0</mn> </msub> <mo>+</mo> <mi>d</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> </mrow> <mi>G</mi> </mrow> </msub> <mo>/</mo> <msub> <mi>D</mi> <mrow> <msub> <mi>k</mi> <mn>0</mn> </msub> <mo>+</mo> <mi>d</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> </mrow> </mtd> <mtd> <msub> <mi>D</mi> <mrow> <msub> <mi>k</mi> <mn>0</mn> </msub> <mo>+</mo> <mi>d</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> </mtd> </mtr> </mtable> </mfenced> <mo>...</mo> <mo>...</mo> <mrow> <mo>(</mo> <mn>4</mn> <mo>)</mo> </mrow> </mrow>
[FA(k0)A,FA(k0)T,…,FG(k0)G,Dk0+2,FA(k0+1)A,…,FG(k0+1)G,Dk0+3,…Dk0+d+1]……(5)
[FA(k0)A/Dk0+2,…,FG(k0)G/Dk0+2,Dk0+2,FA(k0+1)A/Dk0+3,…,Dk0+3,…Dk0+d+1]……(6)
No matter represent in what manner, the core element of biological sequence identification code is all information association and 16 kinds of breath incidence numbers that believe one side only According to, and parameter k0=0, parameter d scope can be determined as needed, have d × 17 data;By biological sequence identification code with Matrix form is represented or stated;
III reconstructs biological sequence affiliation
Species chadogram is built, step is as follows:
(1) filter information is associated
Build shape such as [Dk0+2,Dk0+3,…,Dk0+d+1] or [Fα(k0)β,Fα(k0+1)β,…,Fα(k0+d-1)β] vector X, wherein α, β ∈ {A,G,C,T};Variance analysis (ANOVA) and multiple comparative test (MCT) are done to vector X element;In Multiple range test, for Given species pair, as long as vector X arbitrary element is by multiple comparative test, this thing can successfully be distinguished by being considered as vector X Kind pair;The ratio that unrecognizable species logarithm is accounted for into total species logarithm normalizes to after 100 the failure score for being referred to as vector X, It is designated as WX(k0,d);The low corresponding information parameter species specificities of vector X of failure score are strong;
(2) affiliation is evaluated
Evolutionary distance D needs to meet three below axiom:
(i)Dx,y≥0;D during and if only if x=yx,y=0;
(ii)Dx,y=Dy,x
(iii) for any species x, y and z, Dx,z≤Dx,y+Dy,zPerseverance is set up;
The distance between Correlation of Bases matrix can be calculated with mahalanobis distance, Euclidean distance algorithm, and in this, as species sequence Evolutionary distance;With Euclidean distance formula
<mrow> <msub> <mi>D</mi> <mrow> <mi>x</mi> <mo>,</mo> <mi>y</mi> </mrow> </msub> <mo>=</mo> <msqrt> <mrow> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>f</mi> </munderover> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>d</mi> </munderover> <msup> <mrow> <mo>(</mo> <msub> <mi>M</mi> <mi>x</mi> </msub> <mo>(</mo> <mrow> <mi>i</mi> <mo>,</mo> <mi>j</mi> </mrow> <mo>)</mo> <mo>-</mo> <msub> <mi>M</mi> <mi>y</mi> </msub> <mo>(</mo> <mrow> <mi>i</mi> <mo>,</mo> <mi>j</mi> </mrow> <mo>)</mo> <mo>)</mo> </mrow> <mn>2</mn> </msup> </mrow> </msqrt> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>7</mn> <mo>)</mo> </mrow> </mrow>
Wherein M represents Correlation of Bases matrix, and f represents matrix column vector;D represents the element for constituting column vector;When species two-by-two The distance between be all determined after, just obtained distance matrix, and then for drawing chadogram;
Using the ratio of inclined information association and information association during calculating evolutionary distance, because being known by formula (1) and (2), information is closed Join and the order of magnitude of inclined information association is respectively~10-3With~10-6, ginseng is used as using the ratio of inclined information association and information association Number, can make each column vector order of magnitude suitable;
(3) statistical check of chadogram
General adjacent method (Neighbor-Joining, abbreviation NJ) and arithmetic average can be used not to weight the method for forming a team for chadogram (Unweighted Pair-Group Method withArithmetic, abbreviation UPGMA) constructs phylogenetic tree;For The robustness (Robustness) of the chadogram of inspection institute's generation, proposes one kind equivalent to reverse boot-strap method or Jack-knife The new method of inspection is tested:The Correlation of Bases matrix of Z rows is set up, it is the variable element from 0 to Z-1 to make d;Give a d Value, can obtain a tree;By the way that the tree of generation is compared into the optimal d values scope of determination with species taxonomy information;Finally by optimal d Tree in the range of value is integrated into consistent tree, i.e., final phylogenetic tree;Each branch of thus obtained phylogenetic tree is all Have a statistical value, referred to as Bootstrap values;Phylogenetic tree obtained by the bigger explanation of d value scopes is more stable.
2. use Digital ID biological sequence according to claim 1 and the method for inferring species affiliation, wherein described Sequence can be biological genome full sequence or organism genomic sequence fragment.
3. use Digital ID biological sequence according to claim 1 or 2 and the method for inferring species affiliation, wherein institute State sequence and be selected from common sequence resource.
4. use Digital ID biological sequence according to claim 3 and the method for inferring species affiliation, wherein described Common sequence resource is selected from US National Biotechnology Information center (NCBI) database, European Molecular Biology Laboratory's data Storehouse (EMBL) and DNA Data Bank of Japan (DDBJ) is any can obtain the public database of living species sequence.
5. use Digital ID biological sequence according to claim 1 and the method for inferring species affiliation, by Biological Order The scope of row extends to not common database resource.
CN201310557139.1A 2013-11-12 2013-11-12 A kind of use Digital ID biological sequence and the method for inferring species affiliation Active CN103559427B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310557139.1A CN103559427B (en) 2013-11-12 2013-11-12 A kind of use Digital ID biological sequence and the method for inferring species affiliation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310557139.1A CN103559427B (en) 2013-11-12 2013-11-12 A kind of use Digital ID biological sequence and the method for inferring species affiliation

Publications (2)

Publication Number Publication Date
CN103559427A CN103559427A (en) 2014-02-05
CN103559427B true CN103559427B (en) 2017-10-31

Family

ID=50013673

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310557139.1A Active CN103559427B (en) 2013-11-12 2013-11-12 A kind of use Digital ID biological sequence and the method for inferring species affiliation

Country Status (1)

Country Link
CN (1) CN103559427B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105512512B (en) * 2015-11-24 2019-03-29 潍坊医学院 The method that amino acid carries out species taxonomy apart from polymorphism comparison protein sequence
CN105447341B (en) * 2015-11-24 2018-10-16 潍坊医学院 Mononucleotide compares the method that nucleic acid sequence carries out species taxonomy apart from polymorphism
CN109937426A (en) * 2016-04-11 2019-06-25 量子生物有限公司 System and method for biological data management
CN109273046B (en) * 2018-10-19 2022-04-22 江苏东南证据科学研究院有限公司 Biological whole sibling identification method based on probability statistical model
WO2022087839A1 (en) * 2020-10-27 2022-05-05 深圳华大基因股份有限公司 Non-invasive prenatal genetic testing data-based kinship determining method and apparatus

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1810972A (en) * 2005-12-15 2006-08-02 中国水产科学研究院黄海水产研究所 Lefteye flounder disease resistance related MHC gene marker and subsidiary breeding method
CN101392293A (en) * 2008-09-25 2009-03-25 上海交通大学 Molecular marker method of turnip mosaic virus resistance gene in non-heading Chinese cabbage
CN101812520A (en) * 2010-03-30 2010-08-25 浙江大学 Molecular marker method based on microRNA

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20130068185A (en) * 2011-12-14 2013-06-26 한국전자통신연구원 Genome sequence mapping device and genome sequence mapping method thereof

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1810972A (en) * 2005-12-15 2006-08-02 中国水产科学研究院黄海水产研究所 Lefteye flounder disease resistance related MHC gene marker and subsidiary breeding method
CN101392293A (en) * 2008-09-25 2009-03-25 上海交通大学 Molecular marker method of turnip mosaic virus resistance gene in non-heading Chinese cabbage
CN101812520A (en) * 2010-03-30 2010-08-25 浙江大学 Molecular marker method based on microRNA

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
碱基关联矩阵法在DNA病毒亲缘关系研究中的应用;高扬;《万方学位论文》;20121130;正文第6-30页 *

Also Published As

Publication number Publication date
CN103559427A (en) 2014-02-05

Similar Documents

Publication Publication Date Title
CN103559427B (en) A kind of use Digital ID biological sequence and the method for inferring species affiliation
Whelan et al. Molecular phylogenetics: state-of-the-art methods for looking into the past
Comin et al. Alignment-free phylogeny of whole genomes using underlying subwords
Liu et al. miRNA-dis: microRNA precursor identification based on distance structure status pairs
Saha et al. Computational approaches and tools used in identification of dispersed repetitive DNA sequences
Brierley et al. Predicting the animal hosts of coronaviruses from compositional biases of spike protein and whole genome sequences through machine learning
Lebatteux et al. Toward an alignment-free method for feature extraction and accurate classification of viral sequences
Kojima et al. Virus-like insertions with sequence signatures similar to those of endogenous nonretroviral RNA viruses in the human genome
Fiscon et al. MISSEL: a method to identify a large number of small species-specific genomic subsequences and its application to viruses classification
CN113936737B (en) Method for comparing RNA structures based on RNA motif vectors, family clustering method, method for evaluating allosteric effect, method for functional annotation, system and equipment
Kucherov et al. Estimating seed sensitivity on homogeneous alignments
Liu et al. Diversity of sweepoviruses infecting sweet potato in China
Nourani et al. Computational prediction of virus–human protein–protein interactions using embedding kernelized heterogeneous data
Wei et al. DBH: a de Bruijn graph-based heuristic method for clustering large-scale 16S rRNA sequences into OTUs
Abouelhoda et al. String mining in bioinformatics
Muflikhah et al. Profiling DNA sequence of SARS-Cov-2 virus using machine learning algorithm
Wang et al. Effect of k-tuple length on sample-comparison with high-throughput sequencing data
Wang et al. MRPGA: motif detecting by modified random projection strategy and genetic algorithm
Niu et al. SgRNA-RF: identification of SgRNA on-target activity with imbalanced datasets
Zhang et al. A heuristic cluster-based em algorithm for the planted (l, d) problem
Sharma et al. An experimental comparison of PMSprune and other algorithms for motif search
Siswantining et al. Collaboration and implementation of self organizing maps (SOM) partitioning algorithm in HOPACH clustering method
Dai et al. Study of LZ-word distribution and its application for sequence comparison
Xu et al. m5U-GEPred: prediction of RNA 5-methyluridine sites based on sequence-derived and graph embedding features
Schliep et al. Decoding non-unique oligonucleotide hybridization experiments of targets related by a phylogenetic tree

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant