CN103559427B - A kind of use Digital ID biological sequence and the method for inferring species affiliation - Google Patents
A kind of use Digital ID biological sequence and the method for inferring species affiliation Download PDFInfo
- Publication number
- CN103559427B CN103559427B CN201310557139.1A CN201310557139A CN103559427B CN 103559427 B CN103559427 B CN 103559427B CN 201310557139 A CN201310557139 A CN 201310557139A CN 103559427 B CN103559427 B CN 103559427B
- Authority
- CN
- China
- Prior art keywords
- mrow
- msub
- mtd
- mtr
- mtable
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Other Investigation Or Analysis Of Materials By Electrical Means (AREA)
Abstract
The invention provides a kind of biological sequence identification code based on organism nucleotide sequence Correlation of Bases feature, and then a whole set of identification code is proposed in mark biological sequence and the implementing method and effect evaluation system of species Genetic relationship is carried out.Species Genetic relationship, is the close inspection to identification code validity.The present invention deduces the classificating knowledge that the result that Evolution of Mammals tree and parvovirus affiliation obtain meets biologist, shows that this method is effective, identification code resolution height.The biological sequence identification code that the present invention is provided has the outstanding features such as recognition capability is strong, data volume is small, it is possible to achieve the huge genome sequence of a few digits mark and mark and the com-parison and analysis application for simplifying biological sequence, great practical value.
Description
Technical field
The present invention is excavated and integration sequence information association feature with bioinformatics method, and then using numeral to biology
Sequence and species are identified and Genetic relationship, belong to application of the informatics in field of biology.
Background technology
Biological sequence includes amino acid sequence and nucleotide sequence, and its nucleotide sequence is divided into DNA again
(DNA) sequence and ribonucleic acid (RNA) sequence.DNA sequence dna is by adenylate (A), cytidine monophosphate (C), guanylic acid (G) and thymidylic acid
(T) four kinds of nucleotide monomers are polymerized, and are generally represented with the symbol sebolic addressing of four letter compositions.Similar, RNA sequence can be with
Represented with the symbol sebolic addressing of tetra- letter compositions of A, C, G and U, wherein being uridylic acid instead of T U.The full genome of species has been sequenced
Group sequence length is from thousands of to million, or even billions of letters.
Researcher attempts to extract Data Identification biological sequence from biological sequence, and applies genome sequence oligomer
(K-mer) frequency feature carries out the research of phylogenetics method.The such as component method of academician of the Chinese Academy of Sciences Mr. Hao Bailin
(CVTree)[1]With 205Individual data deduce spore relation, American scientist Kim et al. feature frequency method (FFP)[2]Very
Extremely use up to 208Individual (data volume considerably beyond genomic data amount) individual data do Study on Evolution.Their method easily by
High dimension lacks the restriction of sample and is not suitable for mini gene group or short sequence, such as parvovirus[3], and use can not be realized
Low volume data mark is biological (sequence).
In order to improve mark biological (sequence) and deduce the practicality of biological (sequence) affiliation, we are made that new
Attempt.Different from the method based on K-mer Frequency statistics, we study sequence (DNA or RNA) from information theory
Information association feature, proposes to identify genome with information association (IC) and inclined information association (PIC), and further give birth to its deduction
Thing (sequence) affiliation.
The content of the invention
It is of the invention that a kind of method with Digital ID biological (sequence) is provided, and it is illustrated in deduction biological (sequence) parent
Application in edge relation.Next, we associate recommended information and the calculating of inclined information association, biology (sequence) identification code
Build and its application in affiliation research.
Signified sequence can be biological genome full sequence or organism genomic sequence fragment in the present invention;
Can be that DNA sequence dna can also be RNA sequence.Sequence data used in the present invention is public resource, can pass through US National
Biotechnology Information center (NCBI) database, European Molecular Biology Laboratory's database (EMBL) and DNA Data Bank of Japan
Etc. (DDBJ) global public database, freely obtains and uses.
In order to realize above goal of the invention, the present invention provides following technical scheme:
First, information association and inclined information association
By taking given DNA sequence dna as an example, the element for constituting sequence is base A, G, C, T, according to statistical method:Base i (i
=A, G, C, T) occur probability be pi;The probability that base j (j=A, G, C, T) occurs is pi;At a distance of two positions of k distance
Occurs base i and base j joint probability p respectivelyi(k)j, the Correlation of Bases of whole piece sequence can be drawn further according to information theory
Information content:
We claim Dk+2For information association.
P is calculated for actual sequencei(k)jWhen, in order to avoid the edge effect that finite length (N) is produced, it can introduce the cycle
Property boundary condition, i.e., k+1 base before sequence be connected on to the afterbody of sequence, a length of N+k+1 sequence is formed, then counts
At a distance of k base-pair ij (Ni(k)j), obtain pi(k)j().In addition Taylor expansion, formula (1) can be write as
Now xi(k)j=(Ni(k)j-Npipj)/Npipj.When N is big, information association can be write as
Summation associates us across 16 kinds of Correlation of Bases, for description particular bases and introduces inclined information association (PIC)
Fi(k)j=(Pi(k)j-PiPj)2 (2)
2nd, biological (sequence) identification code
Biological (sequence) identification code can be expressed as matrix, vector or other forms, for example:
[FA(k0)A, FA(k0)T..., FG(k0)G,Dk0+2, FA(k0+1)A..., FG(k0+1)G, Dk0+3... Dk0+d+1], (5)
[FA(k0)A/Dk0+2..., FG(k0)G/Dk0+2,Dk0+2,FA(k0+1)A/Dk0+3..., Dk0+3... Dk0+d+1], (6)
No matter represent in what manner, the core element of biological (sequence) identification code is all information association and 16 kinds of breaths that believe one side only
Associated data, and parameter k0=0, parameter d scope can be determined as needed, have d × 17 data.For ease of understanding,
Usually biological (sequence) identification code is represented or stated in the matrix form.
3rd, biological (sequence) affiliation is reconstructed
Biological (sequence) affiliation of reconstruct is both one kind application to identification code, is also the severe of inspection identification code validity
Carving method, in order to examine the validity of proposed biological (sequence) identification code, we build species chadogram, step with it
It is rapid as follows:
1. filter information is associated
Because it is weak that signal occurs for the system that the low inclined information association of recognition capability is included, in some instances it may even be possible to evolution can be upset and closed
System.So only being built jointly with information association for calculating spore distance from the stronger inclined information association of recognition capability
Parameter.Shape such as [D are built firstk0+2, Dk0+3..., Dk0+d+1] or [Fα(k0)β, Fα(k0+1)β..., Fα(k0+d-1)β] (α, β ∈ A, G,
C, T }) vector X.Variance analysis (ANOVA) and multiple comparative test (MCT) are done to vector X element.In Multiple range test,
For giving species pair, as long as vector X arbitrary element is by multiple comparative test, being considered as vector X can successfully distinguish
This species pair.Unrecognizable species logarithm is accounted for into total species logarithm ratio normalizes to after 100 the failure for being referred to as vector X
Point, it is designated as WX(k0, d).The low corresponding information parameter species specificities of vector X of failure score are strong.
2. evaluate affiliation
The evaluation method of species (sequence) affiliation is a lot, such as is used when being compared in paternity test to sample sequence
The probabilistic method that arrives and the methods such as chadogram are built by distance matrix, it is numerous.
By taking the method for building chadogram as an example, evolutionary distance D needs to meet three below axiom:
(i)DX, y≥0;D during and if only if x=yX, y=0;
(ii)Dx,y=DY, x(symmetry);
(iii) for any species x, y and z, DX, z≤DX, y+DY, z(triangle inequality relation) perseverance is set up.
The distance between Correlation of Bases matrix can be calculated with mahalanobis distance, Euclidean distance scheduling algorithm, and in this, as thing
Plant (sequence) evolutionary distance.For ease of understanding, with Euclidean distance formula
Exemplified by:Wherein M represents Correlation of Bases matrix;F represents matrix column vector;D represents the element for constituting column vector.When
After the distance between species are all determined two-by-two, distance matrix has just been obtained, and then for drawing chadogram.
When calculating evolutionary distance, it is proposed that using the ratio of inclined information association and information association, because by formula (1) and (2)
Know, the order of magnitude of information association and inclined information association is respectively~10-3With~10-6, with the ratio of inclined information association and information association
Value can make each column vector order of magnitude suitable as parameter.
3. the statistical check of chadogram
General adjacent method (Neighbor-Joining, abbreviation NJ) and arithmetic average can be used not to weight and form a team for chadogram
Method (Unweighted Pair-Group Method with Arithmetic, abbreviation UPGMA) constructs phylogenetic tree.It is right
The robustness (Robustness) of the chadogram generated in inspection institute, we have proposed one kind equivalent to reverse boot-strap method
(Bootstrap) or Jack-knife examine new method:The Correlation of Bases matrix of Z rows is set up, it is from 0 to the variable of Z-1 to make d
Parameter.A d value is given, a tree can be obtained.By the way that the tree of generation is compared into the optimal d values model of determination with species taxonomy information
Enclose.The tree in the range of optimal d values is finally integrated into consistent tree, i.e., final phylogenetic tree.Thus obtained system occurs
Can all there be a statistical value, referred to as Bootstrap values in each branch of tree.Phylogenetic tree obtained by the bigger explanation of d value scopes is more
It is stable.
The effect of the present invention:Biology (sequence) identification code proposed by the present invention is based on Correlation of Bases information, has d × 17
(d scope is determined as needed) parameter, data volume is very small (typically smaller than 200 × 17), advantageously in realization visualization.
Biological (sequence) evolutionary relationship, the classificating knowledge phase of gained phylogenetic tree and existing biology are deduced with such data
Symbol, this shows that biological (sequence) identification code recognition capability proposed by the present invention is strong.It is a kind of more easy and with application value
Mark sequence and carry out species taxonomy method.
Brief description of the drawings
Fig. 1 be No. ID be gi | 2745742 | the type virus sequence of AIDS 1.
Fig. 2 be No. ID be gi | 2745742 | the type of AIDS 1 virus identification code.
Fig. 3 be 36 mammalian sample species selecting of the present invention be identified, sorted chadogram.Wherein d with
10 be that ladder is incremented to 249 from 9, and the maximum statistics that branch can obtain supports to be 25.Evolutionary distance between species is based on FA(k)T/
Dk+2, FT(k)A/Dk+2, FT(k)T/Dk+2, FT(k)G/Dk+2, FG(k)T/Dk+2And Dk+2Difference.
Fig. 4 be 32 parvovirus selecting of the present invention be identified, sorted chadogram.Wherein d is ladder with 10
From 9 to 199, the maximum statistics that branch can obtain supports to be 20.Evolutionary distance is based on FA(k)G/Dk+2, FG(k)A/Dk+2, FT(k)C/
Dk+2, FC(k)T/Dk+2, FT(k)G/Dk+2, FG(k)T/Dk+2And Dk+2Difference.
Embodiment
Hereinafter, embodiment of the present invention is described in detail in conjunction with the accompanying drawings and embodiments, to illustrate life proposed by the present invention
The validity of thing (sequence) identification code.
Embodiment 1
Citing No. ID be gi | 2745742 | the type of AIDS 1 virus establishment identification code.The virus genome sequence is by 9290
Individual base is constituted, and because data volume is larger, we show a part (980 bases) for its genome, do understanding directly perceived, is such as schemed
1。
It is the type of AIDS 1 virus one identification code of establishment according to identification method of the present invention, totally 20 rows 17 are arranged, such as
Shown in Fig. 2, wherein the representation of identification code is in the way of shown in formula 4.
Embodiment 2 reconstructs widely known Evolution of Mammals tree
Screen the strong inclined information association of species specificity with statistical tool, (i) using 36 mammals as sample species, from
Randomly selected in each sample species genome 100 length be 1kb sequence as sample sequence;(ii) sample sequence is calculated
K takes 0 to 248 when the information association and inclined information association of row and mitochondrial genomes sequence;(iii) 50 different starting points are set up
k0, maximal dimension d=8 vector, and carry out variance analysis and Multiple range test.
Table 1 represents to list in the result of above-mentioned statistics, form corresponding to d=2, when 4,6 and 8, vector X average failure
ScoreThe wherein association of X representative informations or partially information association.It is to 50 random k0Corresponding d dimensions
The failure score W of vectorX(k0, d) it is averaging.100 are normalized into, representative vector X is during Multiple range test
The species of None- identified are listed in bracket to accounting for the ratios of total species pair, its variance.
As can be seen from Table 1, averagely unsuccessfully score increases and reduced vector X with d, and inclined information association is to average effect (letter
Breath association) differentiation it is more obvious.As d >=6, inclined information association FA(k)T, FT(k)A, FT(k)T, FT(k)GAnd FG(k)TIdentification energy
Power is better than information association (Dk+2).Therefore with this 5 inclined information associations and Dk+2Be bonded matrix be used for calculate spore away from
From.
Then spore distance, input chadogram generation software are calculated with formula (7).Resulting chadogram such as Fig. 3
Shown, it is very consistent with known biological classification knowledge, embodies in the following areas:First, primate (Primate) forms single source
Branch;Second, Muridae (home mouse Mouse and rat Rat) and non-Muridae (squirrel Squirrel, rabbit Rabbit, cavy Guinea
Pig and glirid Dormouse) form each single source branch;This result is false for non-single source evolution branch of rodent (Rodent) animal
Say and new support is provided;3rd, the single source branch of brutish class (Ferungulate) formation is kicked suddenly and details branch ties with biologist
By consistent;Finally, single hole animal Marsupialia (echidna Echidna and platypus Platypus) and marsupial
Monotreme (didelphid opossum, kangaroo wallaroo) is each polymerized to pair and met each other nearer.
The mammal vector X of table 1 multiple comparative test result
Listed in form corresponding to d=2, when 4,6 and 8, vector X average failure scoreWherein X generations
Table information association or inclined information association.It is to 50 random k0The failure score W of corresponding d n dimensional vector nsX(k0, d)
It is averaging.100 are normalized into, the species of representative vector X None- identifieds during Multiple range test are to accounting for total thing
Kind to ratio, its variance is listed in bracket.
Embodiment 3 builds parvovirus (Parvoviruses) chadogram
The species specificity of inclined information association is examined with statistical tool:(i) all (32) viral genome sample will be used as
This genome, randomly selected from each sample genome 50 length be 1kb sequence as sample sequence;(ii) in k from 0
The information association and inclined information association of sample sequence and full-length genome are calculated in the range of to 198;(iii) 50 differences are set up to rise
Point k0, maximal dimension d=10 vector X be used for carry out variance analysis and Multiple range test.
The result of Multiple range test as shown in table 2, is averagely unsuccessfully listed corresponding to d=4 in score form, when 6,8 and 10,
The average of vector X unsuccessfully obtains 1The wherein association of X representative informations or partially information association.It is to 50
Random k0Corresponding d ties up the failure score W of identification codeX(k0, d) it is averaging.It has been normalized into 100, representative vector X
The species of None- identified are listed in bracket to accounting for the ratios of total species pair, its variance during Multiple range test.
It can be seen that according to table 2, vector X average failure score increases and reduced with d, as d=4, all identification codes
The score that averagely fails is both greater than 6, as d=10, and the average failure score of most vectors is less than 2.Wherein inclined information association FA(k)G,
FT(k)C, FT(k)G, FC(k)T, FG(k)AAnd FG(k)TIt is average failure score substantially (< 0.9) less than normal.These results show this 6 kinds partially
Information association genome specificity is stronger, is adapted to calculate evolutionary distance between species.
The parvovirus Correlation of Bases vector X of table 2 the statistical testing results
Next evolutionary distance is calculated, and then draws chadogram.Chadogram obtained by us is as shown in figure 4, it meets
The classificating knowledge of biologist, i.e.,:Infect the parvovirus subfamily (Parvovirinae) of vertebrate and infect dynamic without vertebra
The dense Chordopoxvirinae (Densovirinae) of thing can be with completely separable.Further, each including multiple species Tobamovirus gathers
Into respective evolution branch, and bootstrap values are very high, show that respective branch result is stable.Aleutian disease virus belongs to (Amdovirus)
Got together with bocavirus category (Bocavirus), they are all only planted comprising a virus.The thick Tobamovirus of ring star black smoke
(Pefudensovirus) it is near with Densovirus (Densovirus) evolutionary distance.Dependovirus (dependoviruses) point
Brace has host-virus linkage feature:AAAVa, GPV, AAAVd and MlDPV get together, and their host is birds;Infect
The viral BAAV and BPV-2 of bovid are polymerized to one;The parvovirus for infecting primate is got together.This branch's knot
Structure is the reflection to host-virus coevolution history of dependovirus[4], and the power language model method (DL) of Yu et al. is no
Host-virus branched structure of dependovirus can be reflected.No matter entirety or details meet biologist to the chadogram that we build
Classification gain knowledge, this show our method have higher resolution.
Above example obtains the big branch of chadogram, ramuscule and is all consistent with the cognition of biologist.This shows the present invention
The organism identification code (digital signature) and its application process of proposition effectively, for mark genome and can infer species relationship
Relation.
Claims (5)
1. a kind of use Digital ID biological sequence and infer species affiliation method, the use Digital ID biological sequence and
Infer that the method for species affiliation is specifically included:
I information associations and inclined information association
By taking given DNA sequence dna as an example, the element for constituting sequence is base A, G, C, T, according to statistical method:What base i occurred
Probability is pi, wherein i=A, G, C, T;The probability that base j occurs is pj, wherein j=A, G, C, T;At a distance of k base-pair i (k) j
The joint probability of appearance is designated as pi(k)j, the Correlation of Bases information content of whole piece sequence can be drawn further according to information theory:
<mrow>
<mtable>
<mtr>
<mtd>
<mrow>
<msub>
<mi>D</mi>
<mrow>
<mi>k</mi>
<mo>+</mo>
<mn>2</mn>
</mrow>
</msub>
<mo>=</mo>
<mo>-</mo>
<mn>2</mn>
<munder>
<mo>&Sigma;</mo>
<mi>i</mi>
</munder>
<mrow>
<msub>
<mi>p</mi>
<mi>i</mi>
</msub>
<msub>
<mi>log</mi>
<mn>2</mn>
</msub>
<msub>
<mi>p</mi>
<mi>i</mi>
</msub>
</mrow>
<mo>+</mo>
<munder>
<mo>&Sigma;</mo>
<mrow>
<mi>i</mi>
<mi>j</mi>
</mrow>
</munder>
<msub>
<mi>p</mi>
<mrow>
<mi>i</mi>
<mrow>
<mo>(</mo>
<mi>k</mi>
<mo>)</mo>
</mrow>
<mi>j</mi>
</mrow>
</msub>
<msub>
<mi>log</mi>
<mn>2</mn>
</msub>
<msub>
<mi>p</mi>
<mrow>
<mi>i</mi>
<mrow>
<mo>(</mo>
<mi>k</mi>
<mo>)</mo>
</mrow>
<mi>j</mi>
</mrow>
</msub>
</mrow>
</mtd>
</mtr>
<mtr>
<mtd>
<mrow>
<mo>(</mo>
<mrow>
<mi>k</mi>
<mo>=</mo>
<mn>0</mn>
<mo>,</mo>
<mn>1</mn>
<mo>,</mo>
<mn>2...</mn>
</mrow>
<mo>)</mo>
</mrow>
</mtd>
</mtr>
</mtable>
<mo>-</mo>
<mo>-</mo>
<mo>-</mo>
<mrow>
<mo>(</mo>
<mn>1</mn>
<mo>)</mo>
</mrow>
</mrow>
Claim Dk+2For information association;
Calculate pi(k)jWhen, in order to avoid the edge effect that finite length N is produced, periodic boundary condition can be introduced, i.e., by sequence
The k+1 base of row above is connected on the afterbody of sequence, forms a length of N+k+1 sequence, then by a distance of k base-pair i (k) j
The number of times of appearance is designated as Ni(k)j, obtain joint probability pi(k)j, whereinIn addition Taylor expansion, formula (1) can
To be write as
<mfenced open = "" close = "">
<mtable>
<mtr>
<mtd>
<mrow>
<msub>
<mi>D</mi>
<mrow>
<mi>k</mi>
<mo>+</mo>
<mn>2</mn>
</mrow>
</msub>
<mo>=</mo>
<mo>-</mo>
<mn>2</mn>
<munder>
<mo>&Sigma;</mo>
<mi>i</mi>
</munder>
<msub>
<mi>p</mi>
<mi>i</mi>
</msub>
<msub>
<mi>log</mi>
<mn>2</mn>
</msub>
<msub>
<mi>p</mi>
<mi>i</mi>
</msub>
<mo>+</mo>
<munder>
<mo>&Sigma;</mo>
<mrow>
<mi>i</mi>
<mi>j</mi>
</mrow>
</munder>
<mfrac>
<msub>
<mi>N</mi>
<mrow>
<mi>i</mi>
<mrow>
<mo>(</mo>
<mi>k</mi>
<mo>)</mo>
</mrow>
<mi>j</mi>
</mrow>
</msub>
<mi>N</mi>
</mfrac>
<msub>
<mi>log</mi>
<mn>2</mn>
</msub>
<mfrac>
<msub>
<mi>N</mi>
<mrow>
<mi>i</mi>
<mrow>
<mo>(</mo>
<mi>k</mi>
<mo>)</mo>
</mrow>
<mi>j</mi>
</mrow>
</msub>
<mi>N</mi>
</mfrac>
</mrow>
</mtd>
</mtr>
<mtr>
<mtd>
<mrow>
<mo>=</mo>
<mfrac>
<mn>1</mn>
<mrow>
<mn>2</mn>
<mi>ln</mi>
<mn>2</mn>
</mrow>
</mfrac>
<munder>
<mi>&Sigma;</mi>
<mrow>
<mi>i</mi>
<mi>j</mi>
</mrow>
</munder>
<msub>
<mi>p</mi>
<mi>i</mi>
</msub>
<msub>
<mi>p</mi>
<mi>j</mi>
</msub>
<mrow>
<mo>(</mo>
<msubsup>
<mi>x</mi>
<mrow>
<mi>i</mi>
<mrow>
<mo>(</mo>
<mi>k</mi>
<mo>)</mo>
</mrow>
<mi>j</mi>
</mrow>
<mn>2</mn>
</msubsup>
<mo>-</mo>
<mfrac>
<mn>1</mn>
<mn>3</mn>
</mfrac>
<msubsup>
<mi>x</mi>
<mrow>
<mi>i</mi>
<mrow>
<mo>(</mo>
<mi>k</mi>
<mo>)</mo>
</mrow>
<mi>j</mi>
</mrow>
<mn>3</mn>
</msubsup>
<mo>+</mo>
<mfrac>
<mn>1</mn>
<mn>6</mn>
</mfrac>
<msubsup>
<mi>x</mi>
<mrow>
<mi>i</mi>
<mrow>
<mo>(</mo>
<mi>k</mi>
<mo>)</mo>
</mrow>
<mi>j</mi>
</mrow>
<mn>4</mn>
</msubsup>
<mo>-</mo>
<mo>...</mo>
<mo>)</mo>
</mrow>
</mrow>
</mtd>
</mtr>
</mtable>
</mfenced>
Now xi(k)j=(Ni(k)j-Npipj)/Npipj;When N is big, information association can be write as
<mrow>
<msub>
<mi>D</mi>
<mrow>
<mi>k</mi>
<mo>+</mo>
<mn>2</mn>
</mrow>
</msub>
<mo>&cong;</mo>
<mfrac>
<mn>1</mn>
<mrow>
<mi>l</mi>
<mi>n</mi>
<mn>2</mn>
</mrow>
</mfrac>
<munder>
<mo>&Sigma;</mo>
<mrow>
<mi>i</mi>
<mi>j</mi>
</mrow>
</munder>
<mfrac>
<msup>
<mrow>
<mo>(</mo>
<msub>
<mi>p</mi>
<mrow>
<mi>i</mi>
<mrow>
<mo>(</mo>
<mi>k</mi>
<mo>)</mo>
</mrow>
<mi>j</mi>
</mrow>
</msub>
<mo>-</mo>
<msub>
<mi>p</mi>
<mi>i</mi>
</msub>
<msub>
<mi>p</mi>
<mi>j</mi>
</msub>
<mo>)</mo>
</mrow>
<mn>2</mn>
</msup>
<mrow>
<msub>
<mi>p</mi>
<mi>i</mi>
</msub>
<msub>
<mi>p</mi>
<mi>j</mi>
</msub>
</mrow>
</mfrac>
</mrow>
Summation, for description particular bases association, introduces inclined information association (PIC) across 16 kinds of Correlation of Bases
Fi(k)j=(pi(k)j-pipj)2 (2)
II biological sequence identification codes
Biological sequence identification code can be expressed as matrix, vector:
<mrow>
<mfenced open = "[" close = "]">
<mtable>
<mtr>
<mtd>
<msub>
<mi>F</mi>
<mrow>
<mi>A</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>k</mi>
<mn>0</mn>
</msub>
<mo>)</mo>
</mrow>
<mi>A</mi>
</mrow>
</msub>
</mtd>
<mtd>
<msub>
<mi>F</mi>
<mrow>
<mi>A</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>k</mi>
<mn>0</mn>
</msub>
<mo>)</mo>
</mrow>
<mi>T</mi>
</mrow>
</msub>
</mtd>
<mtd>
<mo>...</mo>
</mtd>
<mtd>
<msub>
<mi>F</mi>
<mrow>
<mi>G</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>k</mi>
<mn>0</mn>
</msub>
<mo>)</mo>
</mrow>
<mi>G</mi>
</mrow>
</msub>
</mtd>
<mtd>
<msub>
<mi>D</mi>
<mrow>
<msub>
<mi>k</mi>
<mn>0</mn>
</msub>
<mo>+</mo>
<mn>2</mn>
</mrow>
</msub>
</mtd>
</mtr>
<mtr>
<mtd>
<msub>
<mi>F</mi>
<mrow>
<mi>A</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>k</mi>
<mn>0</mn>
</msub>
<mo>+</mo>
<mn>1</mn>
<mo>)</mo>
</mrow>
<mi>A</mi>
</mrow>
</msub>
</mtd>
<mtd>
<msub>
<mi>F</mi>
<mrow>
<mi>A</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>k</mi>
<mn>0</mn>
</msub>
<mo>+</mo>
<mn>1</mn>
<mo>)</mo>
</mrow>
<mi>T</mi>
</mrow>
</msub>
</mtd>
<mtd>
<mo>...</mo>
</mtd>
<mtd>
<msub>
<mi>F</mi>
<mrow>
<mi>G</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>k</mi>
<mn>0</mn>
</msub>
<mo>+</mo>
<mn>1</mn>
<mo>)</mo>
</mrow>
<mi>G</mi>
</mrow>
</msub>
</mtd>
<mtd>
<msub>
<mi>D</mi>
<mrow>
<msub>
<mi>k</mi>
<mn>0</mn>
</msub>
<mo>+</mo>
<mn>3</mn>
</mrow>
</msub>
</mtd>
</mtr>
<mtr>
<mtd>
<mtable>
<mtr>
<mtd>
<mo>.</mo>
</mtd>
</mtr>
<mtr>
<mtd>
<mo>.</mo>
</mtd>
</mtr>
<mtr>
<mtd>
<mo>.</mo>
</mtd>
</mtr>
</mtable>
</mtd>
<mtd>
<mtable>
<mtr>
<mtd>
<mo>.</mo>
</mtd>
</mtr>
<mtr>
<mtd>
<mo>.</mo>
</mtd>
</mtr>
<mtr>
<mtd>
<mo>.</mo>
</mtd>
</mtr>
</mtable>
</mtd>
<mtd>
<mtable>
<mtr>
<mtd>
<mo>.</mo>
</mtd>
</mtr>
<mtr>
<mtd>
<mo>.</mo>
</mtd>
</mtr>
<mtr>
<mtd>
<mo>.</mo>
</mtd>
</mtr>
</mtable>
</mtd>
<mtd>
<mtable>
<mtr>
<mtd>
<mo>.</mo>
</mtd>
</mtr>
<mtr>
<mtd>
<mo>.</mo>
</mtd>
</mtr>
<mtr>
<mtd>
<mo>.</mo>
</mtd>
</mtr>
</mtable>
</mtd>
<mtd>
<mtable>
<mtr>
<mtd>
<mo>.</mo>
</mtd>
</mtr>
<mtr>
<mtd>
<mo>.</mo>
</mtd>
</mtr>
<mtr>
<mtd>
<mo>.</mo>
</mtd>
</mtr>
</mtable>
</mtd>
</mtr>
<mtr>
<mtd>
<msub>
<mi>F</mi>
<mrow>
<mi>A</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>k</mi>
<mn>0</mn>
</msub>
<mo>+</mo>
<mi>d</mi>
<mo>-</mo>
<mn>1</mn>
<mo>)</mo>
</mrow>
<mi>A</mi>
</mrow>
</msub>
</mtd>
<mtd>
<msub>
<mi>F</mi>
<mrow>
<mi>A</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>k</mi>
<mn>0</mn>
</msub>
<mo>+</mo>
<mi>d</mi>
<mo>-</mo>
<mn>1</mn>
<mo>)</mo>
</mrow>
<mi>T</mi>
</mrow>
</msub>
</mtd>
<mtd>
<mo>...</mo>
</mtd>
<mtd>
<msub>
<mi>F</mi>
<mrow>
<mi>G</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>k</mi>
<mn>0</mn>
</msub>
<mo>+</mo>
<mi>d</mi>
<mo>-</mo>
<mn>1</mn>
<mo>)</mo>
</mrow>
<mi>G</mi>
</mrow>
</msub>
</mtd>
<mtd>
<msub>
<mi>D</mi>
<mrow>
<msub>
<mi>k</mi>
<mn>0</mn>
</msub>
<mo>+</mo>
<mi>d</mi>
<mo>+</mo>
<mn>1</mn>
</mrow>
</msub>
</mtd>
</mtr>
</mtable>
</mfenced>
<mo>...</mo>
<mo>...</mo>
<mrow>
<mo>(</mo>
<mn>3</mn>
<mo>)</mo>
</mrow>
</mrow>
<mrow>
<mfenced open = "[" close = "]">
<mtable>
<mtr>
<mtd>
<mrow>
<msub>
<mi>F</mi>
<mrow>
<mi>A</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>k</mi>
<mn>0</mn>
</msub>
<mo>)</mo>
</mrow>
<mi>A</mi>
</mrow>
</msub>
<mo>/</mo>
<msub>
<mi>D</mi>
<mrow>
<msub>
<mi>k</mi>
<mn>0</mn>
</msub>
<mo>+</mo>
<mn>2</mn>
</mrow>
</msub>
</mrow>
</mtd>
<mtd>
<mrow>
<msub>
<mi>F</mi>
<mrow>
<mi>A</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>k</mi>
<mn>0</mn>
</msub>
<mo>)</mo>
</mrow>
<mi>T</mi>
</mrow>
</msub>
<mo>/</mo>
<msub>
<mi>D</mi>
<mrow>
<msub>
<mi>k</mi>
<mn>0</mn>
</msub>
<mo>+</mo>
<mn>2</mn>
</mrow>
</msub>
</mrow>
</mtd>
<mtd>
<mo>...</mo>
</mtd>
<mtd>
<mrow>
<msub>
<mi>F</mi>
<mrow>
<mi>G</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>k</mi>
<mn>0</mn>
</msub>
<mo>)</mo>
</mrow>
<mi>G</mi>
</mrow>
</msub>
<mo>/</mo>
<msub>
<mi>D</mi>
<mrow>
<msub>
<mi>k</mi>
<mn>0</mn>
</msub>
<mo>+</mo>
<mn>2</mn>
</mrow>
</msub>
</mrow>
</mtd>
<mtd>
<msub>
<mi>D</mi>
<mrow>
<msub>
<mi>k</mi>
<mn>0</mn>
</msub>
<mo>+</mo>
<mn>2</mn>
</mrow>
</msub>
</mtd>
</mtr>
<mtr>
<mtd>
<mrow>
<msub>
<mi>F</mi>
<mrow>
<mi>A</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>k</mi>
<mn>0</mn>
</msub>
<mo>+</mo>
<mn>1</mn>
<mo>)</mo>
</mrow>
<mi>A</mi>
</mrow>
</msub>
<mo>/</mo>
<msub>
<mi>D</mi>
<mrow>
<msub>
<mi>k</mi>
<mn>0</mn>
</msub>
<mo>+</mo>
<mn>3</mn>
</mrow>
</msub>
</mrow>
</mtd>
<mtd>
<mrow>
<msub>
<mi>F</mi>
<mrow>
<mi>A</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>k</mi>
<mn>0</mn>
</msub>
<mo>+</mo>
<mn>1</mn>
<mo>)</mo>
</mrow>
<mi>T</mi>
</mrow>
</msub>
<mo>/</mo>
<msub>
<mi>D</mi>
<mrow>
<msub>
<mi>k</mi>
<mn>0</mn>
</msub>
<mo>+</mo>
<mn>3</mn>
</mrow>
</msub>
</mrow>
</mtd>
<mtd>
<mo>...</mo>
</mtd>
<mtd>
<mrow>
<msub>
<mi>F</mi>
<mrow>
<mi>G</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>k</mi>
<mn>0</mn>
</msub>
<mo>+</mo>
<mn>1</mn>
<mo>)</mo>
</mrow>
<mi>G</mi>
</mrow>
</msub>
<mo>/</mo>
<msub>
<mi>D</mi>
<mrow>
<msub>
<mi>k</mi>
<mn>0</mn>
</msub>
<mo>+</mo>
<mn>3</mn>
</mrow>
</msub>
</mrow>
</mtd>
<mtd>
<msub>
<mi>D</mi>
<mrow>
<msub>
<mi>k</mi>
<mn>0</mn>
</msub>
<mo>+</mo>
<mn>3</mn>
</mrow>
</msub>
</mtd>
</mtr>
<mtr>
<mtd>
<mtable>
<mtr>
<mtd>
<mo>.</mo>
</mtd>
</mtr>
<mtr>
<mtd>
<mo>.</mo>
</mtd>
</mtr>
<mtr>
<mtd>
<mo>.</mo>
</mtd>
</mtr>
</mtable>
</mtd>
<mtd>
<mtable>
<mtr>
<mtd>
<mo>.</mo>
</mtd>
</mtr>
<mtr>
<mtd>
<mo>.</mo>
</mtd>
</mtr>
<mtr>
<mtd>
<mo>.</mo>
</mtd>
</mtr>
</mtable>
</mtd>
<mtd>
<mtable>
<mtr>
<mtd>
<mo>.</mo>
</mtd>
</mtr>
<mtr>
<mtd>
<mo>.</mo>
</mtd>
</mtr>
<mtr>
<mtd>
<mo>.</mo>
</mtd>
</mtr>
</mtable>
</mtd>
<mtd>
<mtable>
<mtr>
<mtd>
<mo>.</mo>
</mtd>
</mtr>
<mtr>
<mtd>
<mo>.</mo>
</mtd>
</mtr>
<mtr>
<mtd>
<mo>.</mo>
</mtd>
</mtr>
</mtable>
</mtd>
<mtd>
<mtable>
<mtr>
<mtd>
<mo>.</mo>
</mtd>
</mtr>
<mtr>
<mtd>
<mo>.</mo>
</mtd>
</mtr>
<mtr>
<mtd>
<mo>.</mo>
</mtd>
</mtr>
</mtable>
</mtd>
</mtr>
<mtr>
<mtd>
<mrow>
<msub>
<mi>F</mi>
<mrow>
<mi>A</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>k</mi>
<mn>0</mn>
</msub>
<mo>+</mo>
<mi>d</mi>
<mo>-</mo>
<mn>1</mn>
<mo>)</mo>
</mrow>
<mi>A</mi>
</mrow>
</msub>
<mo>/</mo>
<msub>
<mi>D</mi>
<mrow>
<msub>
<mi>k</mi>
<mn>0</mn>
</msub>
<mo>+</mo>
<mi>d</mi>
<mo>+</mo>
<mn>1</mn>
</mrow>
</msub>
</mrow>
</mtd>
<mtd>
<mrow>
<msub>
<mi>F</mi>
<mrow>
<mi>A</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>k</mi>
<mn>0</mn>
</msub>
<mo>+</mo>
<mi>d</mi>
<mo>-</mo>
<mn>1</mn>
<mo>)</mo>
</mrow>
<mi>T</mi>
</mrow>
</msub>
<mo>/</mo>
<msub>
<mi>D</mi>
<mrow>
<msub>
<mi>k</mi>
<mn>0</mn>
</msub>
<mo>+</mo>
<mi>d</mi>
<mo>+</mo>
<mn>1</mn>
</mrow>
</msub>
</mrow>
</mtd>
<mtd>
<mo>...</mo>
</mtd>
<mtd>
<mrow>
<msub>
<mi>F</mi>
<mrow>
<mi>G</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>k</mi>
<mn>0</mn>
</msub>
<mo>+</mo>
<mi>d</mi>
<mo>-</mo>
<mn>1</mn>
<mo>)</mo>
</mrow>
<mi>G</mi>
</mrow>
</msub>
<mo>/</mo>
<msub>
<mi>D</mi>
<mrow>
<msub>
<mi>k</mi>
<mn>0</mn>
</msub>
<mo>+</mo>
<mi>d</mi>
<mo>+</mo>
<mn>1</mn>
</mrow>
</msub>
</mrow>
</mtd>
<mtd>
<msub>
<mi>D</mi>
<mrow>
<msub>
<mi>k</mi>
<mn>0</mn>
</msub>
<mo>+</mo>
<mi>d</mi>
<mo>+</mo>
<mn>1</mn>
</mrow>
</msub>
</mtd>
</mtr>
</mtable>
</mfenced>
<mo>...</mo>
<mo>...</mo>
<mrow>
<mo>(</mo>
<mn>4</mn>
<mo>)</mo>
</mrow>
</mrow>
[FA(k0)A,FA(k0)T,…,FG(k0)G,Dk0+2,FA(k0+1)A,…,FG(k0+1)G,Dk0+3,…Dk0+d+1]……(5)
[FA(k0)A/Dk0+2,…,FG(k0)G/Dk0+2,Dk0+2,FA(k0+1)A/Dk0+3,…,Dk0+3,…Dk0+d+1]……(6)
No matter represent in what manner, the core element of biological sequence identification code is all information association and 16 kinds of breath incidence numbers that believe one side only
According to, and parameter k0=0, parameter d scope can be determined as needed, have d × 17 data;By biological sequence identification code with
Matrix form is represented or stated;
III reconstructs biological sequence affiliation
Species chadogram is built, step is as follows:
(1) filter information is associated
Build shape such as [Dk0+2,Dk0+3,…,Dk0+d+1] or [Fα(k0)β,Fα(k0+1)β,…,Fα(k0+d-1)β] vector X, wherein α, β ∈
{A,G,C,T};Variance analysis (ANOVA) and multiple comparative test (MCT) are done to vector X element;In Multiple range test, for
Given species pair, as long as vector X arbitrary element is by multiple comparative test, this thing can successfully be distinguished by being considered as vector X
Kind pair;The ratio that unrecognizable species logarithm is accounted for into total species logarithm normalizes to after 100 the failure score for being referred to as vector X,
It is designated as WX(k0,d);The low corresponding information parameter species specificities of vector X of failure score are strong;
(2) affiliation is evaluated
Evolutionary distance D needs to meet three below axiom:
(i)Dx,y≥0;D during and if only if x=yx,y=0;
(ii)Dx,y=Dy,x;
(iii) for any species x, y and z, Dx,z≤Dx,y+Dy,zPerseverance is set up;
The distance between Correlation of Bases matrix can be calculated with mahalanobis distance, Euclidean distance algorithm, and in this, as species sequence
Evolutionary distance;With Euclidean distance formula
<mrow>
<msub>
<mi>D</mi>
<mrow>
<mi>x</mi>
<mo>,</mo>
<mi>y</mi>
</mrow>
</msub>
<mo>=</mo>
<msqrt>
<mrow>
<munderover>
<mo>&Sigma;</mo>
<mrow>
<mi>j</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>f</mi>
</munderover>
<munderover>
<mo>&Sigma;</mo>
<mrow>
<mi>i</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>d</mi>
</munderover>
<msup>
<mrow>
<mo>(</mo>
<msub>
<mi>M</mi>
<mi>x</mi>
</msub>
<mo>(</mo>
<mrow>
<mi>i</mi>
<mo>,</mo>
<mi>j</mi>
</mrow>
<mo>)</mo>
<mo>-</mo>
<msub>
<mi>M</mi>
<mi>y</mi>
</msub>
<mo>(</mo>
<mrow>
<mi>i</mi>
<mo>,</mo>
<mi>j</mi>
</mrow>
<mo>)</mo>
<mo>)</mo>
</mrow>
<mn>2</mn>
</msup>
</mrow>
</msqrt>
<mo>-</mo>
<mo>-</mo>
<mo>-</mo>
<mrow>
<mo>(</mo>
<mn>7</mn>
<mo>)</mo>
</mrow>
</mrow>
Wherein M represents Correlation of Bases matrix, and f represents matrix column vector;D represents the element for constituting column vector;When species two-by-two
The distance between be all determined after, just obtained distance matrix, and then for drawing chadogram;
Using the ratio of inclined information association and information association during calculating evolutionary distance, because being known by formula (1) and (2), information is closed
Join and the order of magnitude of inclined information association is respectively~10-3With~10-6, ginseng is used as using the ratio of inclined information association and information association
Number, can make each column vector order of magnitude suitable;
(3) statistical check of chadogram
General adjacent method (Neighbor-Joining, abbreviation NJ) and arithmetic average can be used not to weight the method for forming a team for chadogram
(Unweighted Pair-Group Method withArithmetic, abbreviation UPGMA) constructs phylogenetic tree;For
The robustness (Robustness) of the chadogram of inspection institute's generation, proposes one kind equivalent to reverse boot-strap method or Jack-knife
The new method of inspection is tested:The Correlation of Bases matrix of Z rows is set up, it is the variable element from 0 to Z-1 to make d;Give a d
Value, can obtain a tree;By the way that the tree of generation is compared into the optimal d values scope of determination with species taxonomy information;Finally by optimal d
Tree in the range of value is integrated into consistent tree, i.e., final phylogenetic tree;Each branch of thus obtained phylogenetic tree is all
Have a statistical value, referred to as Bootstrap values;Phylogenetic tree obtained by the bigger explanation of d value scopes is more stable.
2. use Digital ID biological sequence according to claim 1 and the method for inferring species affiliation, wherein described
Sequence can be biological genome full sequence or organism genomic sequence fragment.
3. use Digital ID biological sequence according to claim 1 or 2 and the method for inferring species affiliation, wherein institute
State sequence and be selected from common sequence resource.
4. use Digital ID biological sequence according to claim 3 and the method for inferring species affiliation, wherein described
Common sequence resource is selected from US National Biotechnology Information center (NCBI) database, European Molecular Biology Laboratory's data
Storehouse (EMBL) and DNA Data Bank of Japan (DDBJ) is any can obtain the public database of living species sequence.
5. use Digital ID biological sequence according to claim 1 and the method for inferring species affiliation, by Biological Order
The scope of row extends to not common database resource.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310557139.1A CN103559427B (en) | 2013-11-12 | 2013-11-12 | A kind of use Digital ID biological sequence and the method for inferring species affiliation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310557139.1A CN103559427B (en) | 2013-11-12 | 2013-11-12 | A kind of use Digital ID biological sequence and the method for inferring species affiliation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103559427A CN103559427A (en) | 2014-02-05 |
CN103559427B true CN103559427B (en) | 2017-10-31 |
Family
ID=50013673
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310557139.1A Active CN103559427B (en) | 2013-11-12 | 2013-11-12 | A kind of use Digital ID biological sequence and the method for inferring species affiliation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103559427B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105512512B (en) * | 2015-11-24 | 2019-03-29 | 潍坊医学院 | The method that amino acid carries out species taxonomy apart from polymorphism comparison protein sequence |
CN105447341B (en) * | 2015-11-24 | 2018-10-16 | 潍坊医学院 | Mononucleotide compares the method that nucleic acid sequence carries out species taxonomy apart from polymorphism |
CN109937426A (en) * | 2016-04-11 | 2019-06-25 | 量子生物有限公司 | System and method for biological data management |
CN109273046B (en) * | 2018-10-19 | 2022-04-22 | 江苏东南证据科学研究院有限公司 | Biological whole sibling identification method based on probability statistical model |
WO2022087839A1 (en) * | 2020-10-27 | 2022-05-05 | 深圳华大基因股份有限公司 | Non-invasive prenatal genetic testing data-based kinship determining method and apparatus |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1810972A (en) * | 2005-12-15 | 2006-08-02 | 中国水产科学研究院黄海水产研究所 | Lefteye flounder disease resistance related MHC gene marker and subsidiary breeding method |
CN101392293A (en) * | 2008-09-25 | 2009-03-25 | 上海交通大学 | Molecular marker method of turnip mosaic virus resistance gene in non-heading Chinese cabbage |
CN101812520A (en) * | 2010-03-30 | 2010-08-25 | 浙江大学 | Molecular marker method based on microRNA |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20130068185A (en) * | 2011-12-14 | 2013-06-26 | 한국전자통신연구원 | Genome sequence mapping device and genome sequence mapping method thereof |
-
2013
- 2013-11-12 CN CN201310557139.1A patent/CN103559427B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1810972A (en) * | 2005-12-15 | 2006-08-02 | 中国水产科学研究院黄海水产研究所 | Lefteye flounder disease resistance related MHC gene marker and subsidiary breeding method |
CN101392293A (en) * | 2008-09-25 | 2009-03-25 | 上海交通大学 | Molecular marker method of turnip mosaic virus resistance gene in non-heading Chinese cabbage |
CN101812520A (en) * | 2010-03-30 | 2010-08-25 | 浙江大学 | Molecular marker method based on microRNA |
Non-Patent Citations (1)
Title |
---|
碱基关联矩阵法在DNA病毒亲缘关系研究中的应用;高扬;《万方学位论文》;20121130;正文第6-30页 * |
Also Published As
Publication number | Publication date |
---|---|
CN103559427A (en) | 2014-02-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103559427B (en) | A kind of use Digital ID biological sequence and the method for inferring species affiliation | |
Whelan et al. | Molecular phylogenetics: state-of-the-art methods for looking into the past | |
Comin et al. | Alignment-free phylogeny of whole genomes using underlying subwords | |
Liu et al. | miRNA-dis: microRNA precursor identification based on distance structure status pairs | |
Saha et al. | Computational approaches and tools used in identification of dispersed repetitive DNA sequences | |
Brierley et al. | Predicting the animal hosts of coronaviruses from compositional biases of spike protein and whole genome sequences through machine learning | |
Lebatteux et al. | Toward an alignment-free method for feature extraction and accurate classification of viral sequences | |
Kojima et al. | Virus-like insertions with sequence signatures similar to those of endogenous nonretroviral RNA viruses in the human genome | |
Fiscon et al. | MISSEL: a method to identify a large number of small species-specific genomic subsequences and its application to viruses classification | |
CN113936737B (en) | Method for comparing RNA structures based on RNA motif vectors, family clustering method, method for evaluating allosteric effect, method for functional annotation, system and equipment | |
Kucherov et al. | Estimating seed sensitivity on homogeneous alignments | |
Liu et al. | Diversity of sweepoviruses infecting sweet potato in China | |
Nourani et al. | Computational prediction of virus–human protein–protein interactions using embedding kernelized heterogeneous data | |
Wei et al. | DBH: a de Bruijn graph-based heuristic method for clustering large-scale 16S rRNA sequences into OTUs | |
Abouelhoda et al. | String mining in bioinformatics | |
Muflikhah et al. | Profiling DNA sequence of SARS-Cov-2 virus using machine learning algorithm | |
Wang et al. | Effect of k-tuple length on sample-comparison with high-throughput sequencing data | |
Wang et al. | MRPGA: motif detecting by modified random projection strategy and genetic algorithm | |
Niu et al. | SgRNA-RF: identification of SgRNA on-target activity with imbalanced datasets | |
Zhang et al. | A heuristic cluster-based em algorithm for the planted (l, d) problem | |
Sharma et al. | An experimental comparison of PMSprune and other algorithms for motif search | |
Siswantining et al. | Collaboration and implementation of self organizing maps (SOM) partitioning algorithm in HOPACH clustering method | |
Dai et al. | Study of LZ-word distribution and its application for sequence comparison | |
Xu et al. | m5U-GEPred: prediction of RNA 5-methyluridine sites based on sequence-derived and graph embedding features | |
Schliep et al. | Decoding non-unique oligonucleotide hybridization experiments of targets related by a phylogenetic tree |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |