CN106778078A

CN106778078A - DNA sequence dna similitude comparison method based on kendall coefficient correlations

Info

Publication number: CN106778078A
Application number: CN201611186639.9A
Authority: CN
Inventors: 林劼; 林丽玉; 江育娥
Original assignee: Fujian Normal University
Current assignee: Fujian Normal University
Priority date: 2016-12-20
Filing date: 2016-12-20
Publication date: 2017-05-31
Anticipated expiration: 2036-12-20
Also published as: CN106778078B

Abstract

The present invention discloses the DNA sequence dna similitude comparison method based on kendall coefficient correlations, and it comprises the following steps：1) N bars DNA sequence dna to be compared is obtained；2) length k is chosen, the corresponding k words that each pair combines DNA sequence dna are obtained in the way of sliding window, and be combined into corresponding vector 3) with step 2) acquired in k words, calculate the number of times that each k word occurs in DNA sequence dna and calculate the frequency vector that k words occur in DNA sequence dna, be designated as x_i, all of k word frequency rate of DNA sequence dna is designated as X={ x_i}；4) combination of two is carried out to N bar DNA sequence dna k term vectors, that is, is obtainedCombination, each combination k word frequency vector is designated as x, y；5) the k word frequency vectors of every kind of combination are x, y, calculate its corresponding kendall coefficient correlation；6) the N*N rank similarity factor matrixes of N bar DNA sequence dnas are set up, to obtain the similitude and evolutionary relationship figure of DNA sequence dna.The present invention improves the effect that DNA sequence dna similitude is compared, and simplifies computational complexity and shortens operation time.

Description

DNA sequence dna similitude comparison method based on kendall coefficient correlations

Technical field

The present invention relates to computer and bioinformatics process field, more particularly to the DNA based on kendall coefficient correlations Sequence similarity comparison method.

Background technology

The central task of bioinformatics, is to extract conceptual knowledge from vast as the open sea DNA sequence data.Biological information The task that scholar is faced, is not only the efficient data storage meanses of solution, and need to develop effective data analysis tool. Because only that using new, effective data analysis tool, DNA sequence dna information could be converted into Biological Knowledge, and understand fully The 26S Proteasome Structure and Function information that they are contained, and then thoroughly understand the biological significance representated by them.

The theoretical foundation that DNA sequence dna is compared is Evolution Theory, if having enough similitudes between two DNA sequence dnas, Just speculate the two may have common evolution ancestors, by DNA sequence dna the replacement of residue, residue or DNA sequencing fragment lack The hereditary variation process such as mistake and DNA sequence dna restructuring develops respectively.It is different that DNA sequence dna phase Sihe DNA sequence dna is homologous Concept, the similarity degree between DNA sequence dna can be the parameter of quantization, and DNA sequence dna it is whether homologous need evolve it is true Checking.It is actually to use certain specific Mathematical Modeling or algorithm that DNA sequence dna is compared, and finds out two or more DNA sequence dnas Between maximum matching base number.

The frequency and positional information that topaz is beautiful, Wang Tianming et al. is occurred using the k words in DNA sequence dna construct a probability Distribution, this distribution represents the distance between two vectors, be worth smaller species closer to.Vinga and Almeida are proposed and are based on The DNA sequence dna comparative approach of word frequency rate：The number of times that all length occurs for the word of k by way of sliding window, obtains k words Number or frequency vector, so a DNA sequence dna is mapped as a vector in higher-dimension theorem in Euclid space, so as to by DNA sequence dna it Between similarity system design be converted to comparing between vector.

It is exactly that two DNA sequence dnas are compared with specific algorithm that double DNA sequence dnas are compared, so as to obtain this two DNA The matching of maximum similitude between sequence.Kendall coefficient correlations are widely used in time DNA sequence dna, the hydrology, water quality DNA The dependency prediction of sequence etc., but it be not used for the matching of DNA sequence dna similitude.

The content of the invention

It is an object of the invention to overcome the deficiencies in the prior art, there is provided the DNA sequence dna phase based on kendall coefficient correlations Like property comparison method, one is built on N bar DNA sequence dnasRank similarity factor matrix, the evolution for obtaining N bar DNA sequence dnas is closed System, while improving efficiency and raising operation efficiency that DNA sequence dna similitude is compared.

The technical solution adopted by the present invention is：

DNA sequence dna similitude comparison method based on kendall coefficient correlations, it comprises the following steps：

1) N bars DNA sequence dna to be compared is obtained；

2) length k is chosen, each pair is obtained in the way of sliding window and is combined the corresponding k words of DNA sequence dna, and be combined into phase The vector answered

3) with step 2) acquired in k words, calculate the number of times that each k word occurs in DNA sequence dna, that is, calculate k words in DNA The frequency vector occurred in sequence, is designated as x_i；

4) combination of two is carried out to N bar DNA sequence dna k term vectors, that is, is obtainedCombination, each mix vector is designated as X= {x_i, Y={ y_i}。

5) the k word frequency vectors of every kind of combination are x_i, y_i, calculate its corresponding kendall coefficient correlation；

6) N × N rank correlation matrixs of N bar DNA sequence dnas are set up, to obtain analog information and the evolution of DNA sequence dna Graph of a relation.

Further, the step 2) in, the word frequency vector that its length is k is taken to DNA sequence dna.

Further, the step 5) in, can as follows obtain the kendall coefficient correlations of the k words of DNA sequence dna；

A) by following formula, the k words of DNA sequence dna A to be compared are obtained, wherein DNA sequence dna A length is set to n：

B) by following formula, the frequency that k words occur is calculated：x_i={ i-th k wordRepeat in DNA sequence dna A Number of times；

C) to the X for combining, Y-direction amount, by following formula, calculates kendall coefficient correlationsIt is characterized in that：t_x It is { x_i},{y_iIn possess uniformity logarithm, t_yIt is { x_i,y_iPossessing inconsistency logarithm, T is { x_i,y_iPossess and differ k words Total number.

D) t in step c)_x, t_yCan be obtained by following formula, t_x=(x_i-y_i)*(x_i-y_i) be jack per line, then it is known as { x_i, y_iIn uniformity logarithm, t_yCan be obtained by following formula, t_y=(x_i-y_i)*(x_i-y_i) be contrary sign, then it is known as { x_i,y_iIn differ Cause property logarithm

The kendall coefficient correlations τ for being obtained is the number that a value is [- 1,1], when the value of τ represents two closer to 1 Degree of correlation is stronger between bar DNA sequence dna, when the value of τ is related negative sense between two DNA sequence dnas of -1 expression, works as τ Value represent that two DNA sequence dnas do not exist correlation close to 0.

The kendall correlation matrixs of N*N ranks are built, this matrix is symmetrical matrix, and value on diagonal is 1, can be with The affinity information two-by-two of N bar DNA sequence dnas is obtained, the relation of the evolution of N bar DNA sequence dnas is thus constructed.

DNA sequence dna similitude comparison method of the present invention based on kendall coefficient correlations, is asked for using sliding window mode The k word frequency vectors of DNA sequence dna to be analyzed, the k term vectors to N bar DNA sequence dnas carry out combination of two, related using kendall Coefficient seeks its coefficient correlation to the k word frequency vectors of corresponding DNA sequence dna, enabling carry out similitude inspection to a plurality of DNA sequence dna Survey, testing result is effectively reflected the evolutionary relationship between DNA sequence dna.This method is more succinct, need to only build one symmetrically Matrix, the value on matrix diagonal left to bottom right is 1, simplifies computational complexity, improves operation efficiency, kendall Coefficient can obtain the good degree of accuracy as the characteristic value of description DNA sequence dna similitude prediction.

Brief description of the drawings

The present invention is described in further details below in conjunction with the drawings and specific embodiments；

Fig. 1 is the schematic flow sheet of DNA sequence dna similitude comparison method of the present invention based on kendall coefficient correlations；

Fig. 2 is the evolution of the DNA sequence dna of DNA sequence dna similitude comparison method of the present invention based on kendall coefficient correlations Graph of a relation.

Specific embodiment

As shown in Figure 1 or 2,20 DNA encoding DNA sequence dnas of species are used to the method for the present invention as analysis object As a example by be further elaborated, comprise the following steps：As shown in figure 1, the present embodiment based on kendall coefficient correlations DNA sequence dna similitude comparison method comprises the following steps：

1) 20 DNA encoding DNA sequence dnas of species are selected as initial DNA sequence dna, the DNA sequence dna title of 20 species and Length is shown in Table 1；

Species name	DNA sequence dna length
		baboon	16522
bluewhale	16403
		cat	17010
common_chimpanzee	16564
		cow	16339
fin_whale	16399
		gibbon	16473
gorilla	16365
		grayseal	16798
harborseal	16827
		horse	16661
human	16570
		mouse	16296
opossum	17085
		orangutan	16390
pigmy_chimpanzee	16555
		platypus	17020
rat	16301
		wallaroo	16897
whiterhinoceros	16833

Table 1：Species DNA sequence dna information

2) the initial DNA sequence dna to step 1 obtains its k word, and combines these k words, obtains the k word frequency of initial DNA sequence dna Rate vector is (referring to Vinga, S.Almeida, J.S.Alignment-free sequence comparison area review [J].Bioinformatics.513-523.2003).The characteristics of the method is the short dna to seeking length k by sliding window mode Sequence appears in frequency in DNA sequence dna to be measured, and to 4 bases { A, T, G, C } of DNA, it is 2 to take k length, then corresponding to k words has 4² =16 kinds, k words 4 are corresponded to if k=3³=64 kinds；Such as the DNA sequence dna A=ATAACTA, its k word W of DNA sequencing fragment to be measured₂= { AT, TA, AA, TT, AG, GA, AC, CA, CT ... }, its frequency vectorBe worth for 1, 2,1,0,0,0,1,0,1,0…}；DNA sequencing fragment B=ACAACTTA to be measured, its k words frequency vector for 0,1,1,1,0,0, 2,1,1,0…}；

3) correspondence N bar DNA sequence dnas, can obtain N number of k words frequency vector, and its combination of two is obtainedCombination, each Combination frequency vector is designated as X, Y

4) calculated by following formulaObtain kendall coefficient correlations, wherein t_xIt is { x_i,y_iAnd other k word frequency Possess uniformity logarithm, t between rate_yIt is { x_i,y_iPossessing inconsistency logarithm and other k word frequency rates between, T is { x_i,y_iGather around Differ k word total numbers, step 2) in DNA sequence dna A, B fragment k words total number be T=7；

5) step 4) in t_x, t_yCan be obtained by following formula, t_x=(x_i-y_i)×(x_i-y_i) be jack per line, then it is referred to as { x_i,y_i} Middle uniformity logarithm, t_yCan be obtained by following formula, t_y=(x_i-y_i)×(x_i-y_i) be contrary sign, then it is referred to as { x_i,y_iIn inconsistency Logarithm；

6) it is the kendall correlation matrixs of N*N ranks to build matrix, and this matrix is symmetrical matrix, and diagonal line value is 1, Upper triangular matrix can be generally classified as.Because similitude and distance are into negative correlativing relation, so, build evolutionary relationship figure it Before, similarity figure is taken opposite number and is converted to distance by us, and builds evolutionary relationship figure with this, refer to Fig. 2.

Interpretation of result：By the Pearson correlation coefficients between calculating and editing distance, it has been found that application kendall meters The DNA sequence dna similitude for calculating is -0.94 with the coefficient correlation of editing distance, illustrates what is calculated using the inventive method The characteristics of DNA sequence dna similitude has high precision, and can be a kind of the non-of replacement editing distance by being quickly calculated Normal effective method.

Embodiments of the invention are the foregoing is only, the scope of the claims of the invention is not thereby limited, it is every to utilize this hair Equivalent structure or equivalent flow conversion that bright specification and accompanying drawing content are made, or directly or indirectly it is used in other related skills Art field, is included within the scope of the present invention.

Claims

1. the DNA sequence dna similitude comparison method of kendall coefficient correlations is based on, it is characterised in that：It comprises the following steps：

1) N bars DNA sequence dna to be compared is obtained；

2) length k is chosen, each pair is obtained in the way of sliding window and is combined the corresponding k words of DNA sequence dna, and be combined into corresponding Vector；

3) with step 2) acquired in k words, calculate the number of times that each k word occurs in DNA sequence dna, that is, calculate k words in DNA sequence dna The frequency vector of middle appearance, is designated as x_i；

4) combination of two is carried out to N bar DNA sequence dna k term vectors, that is, is obtainedCombination, each mix vector is designated as X={ x_i},Y ={ y_i}；

6) N × N rank correlation matrixs of N bar DNA sequence dnas are set up, to obtain the analog information and evolutionary relationship of DNA sequence dna Figure.

2. the DNA sequence dna similitude comparison method of kendall coefficient correlations is based on according to claim 1, it is characterised in that： The step 2) in, the word frequency vector that its length is k is taken to DNA sequence dna.

3. the DNA sequence dna similitude comparison method of kendall coefficient correlations is based on according to claim 1, it is characterised in that： The step 5) in, the kendall coefficient correlations of the k words of DNA sequence dna are obtained as follows：

F_{K}^{A} = (f (W_{k, 1}^{A}), f (W_{k, 2}^{A}), ... f (W_{k, n}^{A}))

B) by following formula, the frequency that k words occur is calculated：x_i={ i-th k wordThat repeats in DNA sequence dna A is secondary Number }；

C) to the X for combining, Y-direction amount, by following formula, calculates kendall coefficient correlationsIt is characterized in that：t_xIt is {x_i},{y_iIn possess uniformity logarithm, t_yIt is { x_i,y_iPossessing inconsistency logarithm, T is { x_i,y_iPossess that to differ k words total Number；

D) t in step c)_x, t_yCan be obtained by following formula, t_x=(x_i-y_i)*(x_i-y_i) be jack per line, then it is known as { x_i,y_iIn Uniformity logarithm, t_yCan be obtained by following formula, t_y=(x_i-y_i)*(x_i-y_i) be contrary sign, then it is known as { x_i,y_iIn inconsistency Logarithm.

4. the DNA sequence dna similitude comparison method of kendall coefficient correlations is based on according to claim 1, it is characterised in that： The kendall coefficient correlations τ for being obtained is the number that a value is [- 1,1], when the value of τ is closer to 1 expression, two DNA sequences Degree of correlation is stronger between row, when the value of τ is related negative sense between two DNA sequence dnas of -1 expression, when the value of τ is approached Represent that two DNA sequence dnas do not exist correlation in 0.

5. the DNA sequence dna similitude comparison method of kendall coefficient correlations is based on according to claim 1, it is characterised in that： The kendall correlation matrixs of N*N ranks are built in step 6, this matrix is symmetrical matrix, and the value on diagonal is 1, can be obtained To the affinity information two-by-two of N bar DNA sequence dnas, the relation of the evolution of N bar DNA sequence dnas is thus constructed.