CN106778078A - DNA sequence dna similitude comparison method based on kendall coefficient correlations - Google Patents

DNA sequence dna similitude comparison method based on kendall coefficient correlations Download PDF

Info

Publication number
CN106778078A
CN106778078A CN201611186639.9A CN201611186639A CN106778078A CN 106778078 A CN106778078 A CN 106778078A CN 201611186639 A CN201611186639 A CN 201611186639A CN 106778078 A CN106778078 A CN 106778078A
Authority
CN
China
Prior art keywords
dna sequence
dna
sequence dna
words
kendall
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201611186639.9A
Other languages
Chinese (zh)
Other versions
CN106778078B (en
Inventor
林劼
林丽玉
江育娥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujian Normal University
Original Assignee
Fujian Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujian Normal University filed Critical Fujian Normal University
Priority to CN201611186639.9A priority Critical patent/CN106778078B/en
Publication of CN106778078A publication Critical patent/CN106778078A/en
Application granted granted Critical
Publication of CN106778078B publication Critical patent/CN106778078B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Biophysics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The present invention discloses the DNA sequence dna similitude comparison method based on kendall coefficient correlations, and it comprises the following steps:1) N bars DNA sequence dna to be compared is obtained;2) length k is chosen, the corresponding k words that each pair combines DNA sequence dna are obtained in the way of sliding window, and be combined into corresponding vector 3) with step 2) acquired in k words, calculate the number of times that each k word occurs in DNA sequence dna and calculate the frequency vector that k words occur in DNA sequence dna, be designated as xi, all of k word frequency rate of DNA sequence dna is designated as X={ xi};4) combination of two is carried out to N bar DNA sequence dna k term vectors, that is, is obtainedCombination, each combination k word frequency vector is designated as x, y;5) the k word frequency vectors of every kind of combination are x, y, calculate its corresponding kendall coefficient correlation;6) the N*N rank similarity factor matrixes of N bar DNA sequence dnas are set up, to obtain the similitude and evolutionary relationship figure of DNA sequence dna.The present invention improves the effect that DNA sequence dna similitude is compared, and simplifies computational complexity and shortens operation time.

Description

DNA sequence dna similitude comparison method based on kendall coefficient correlations
Technical field
The present invention relates to computer and bioinformatics process field, more particularly to the DNA based on kendall coefficient correlations Sequence similarity comparison method.
Background technology
The central task of bioinformatics, is to extract conceptual knowledge from vast as the open sea DNA sequence data.Biological information The task that scholar is faced, is not only the efficient data storage meanses of solution, and need to develop effective data analysis tool. Because only that using new, effective data analysis tool, DNA sequence dna information could be converted into Biological Knowledge, and understand fully The 26S Proteasome Structure and Function information that they are contained, and then thoroughly understand the biological significance representated by them.
The theoretical foundation that DNA sequence dna is compared is Evolution Theory, if having enough similitudes between two DNA sequence dnas, Just speculate the two may have common evolution ancestors, by DNA sequence dna the replacement of residue, residue or DNA sequencing fragment lack The hereditary variation process such as mistake and DNA sequence dna restructuring develops respectively.It is different that DNA sequence dna phase Sihe DNA sequence dna is homologous Concept, the similarity degree between DNA sequence dna can be the parameter of quantization, and DNA sequence dna it is whether homologous need evolve it is true Checking.It is actually to use certain specific Mathematical Modeling or algorithm that DNA sequence dna is compared, and finds out two or more DNA sequence dnas Between maximum matching base number.
The frequency and positional information that topaz is beautiful, Wang Tianming et al. is occurred using the k words in DNA sequence dna construct a probability Distribution, this distribution represents the distance between two vectors, be worth smaller species closer to.Vinga and Almeida are proposed and are based on The DNA sequence dna comparative approach of word frequency rate:The number of times that all length occurs for the word of k by way of sliding window, obtains k words Number or frequency vector, so a DNA sequence dna is mapped as a vector in higher-dimension theorem in Euclid space, so as to by DNA sequence dna it Between similarity system design be converted to comparing between vector.
It is exactly that two DNA sequence dnas are compared with specific algorithm that double DNA sequence dnas are compared, so as to obtain this two DNA The matching of maximum similitude between sequence.Kendall coefficient correlations are widely used in time DNA sequence dna, the hydrology, water quality DNA The dependency prediction of sequence etc., but it be not used for the matching of DNA sequence dna similitude.
The content of the invention
It is an object of the invention to overcome the deficiencies in the prior art, there is provided the DNA sequence dna phase based on kendall coefficient correlations Like property comparison method, one is built on N bar DNA sequence dnasRank similarity factor matrix, the evolution for obtaining N bar DNA sequence dnas is closed System, while improving efficiency and raising operation efficiency that DNA sequence dna similitude is compared.
The technical solution adopted by the present invention is:
DNA sequence dna similitude comparison method based on kendall coefficient correlations, it comprises the following steps:
1) N bars DNA sequence dna to be compared is obtained;
2) length k is chosen, each pair is obtained in the way of sliding window and is combined the corresponding k words of DNA sequence dna, and be combined into phase The vector answered
3) with step 2) acquired in k words, calculate the number of times that each k word occurs in DNA sequence dna, that is, calculate k words in DNA The frequency vector occurred in sequence, is designated as xi
4) combination of two is carried out to N bar DNA sequence dna k term vectors, that is, is obtainedCombination, each mix vector is designated as X= {xi, Y={ yi}。
5) the k word frequency vectors of every kind of combination are xi, yi, calculate its corresponding kendall coefficient correlation;
6) N × N rank correlation matrixs of N bar DNA sequence dnas are set up, to obtain analog information and the evolution of DNA sequence dna Graph of a relation.
Further, the step 2) in, the word frequency vector that its length is k is taken to DNA sequence dna.
Further, the step 5) in, can as follows obtain the kendall coefficient correlations of the k words of DNA sequence dna;
A) by following formula, the k words of DNA sequence dna A to be compared are obtained, wherein DNA sequence dna A length is set to n:
B) by following formula, the frequency that k words occur is calculated:xi={ i-th k wordRepeat in DNA sequence dna A Number of times;
C) to the X for combining, Y-direction amount, by following formula, calculates kendall coefficient correlationsIt is characterized in that:tx It is { xi},{yiIn possess uniformity logarithm, tyIt is { xi,yiPossessing inconsistency logarithm, T is { xi,yiPossess and differ k words Total number.
D) t in step c)x, tyCan be obtained by following formula, tx=(xi-yi)*(xi-yi) be jack per line, then it is known as { xi, yiIn uniformity logarithm, tyCan be obtained by following formula, ty=(xi-yi)*(xi-yi) be contrary sign, then it is known as { xi,yiIn differ Cause property logarithm
The kendall coefficient correlations τ for being obtained is the number that a value is [- 1,1], when the value of τ represents two closer to 1 Degree of correlation is stronger between bar DNA sequence dna, when the value of τ is related negative sense between two DNA sequence dnas of -1 expression, works as τ Value represent that two DNA sequence dnas do not exist correlation close to 0.
The kendall correlation matrixs of N*N ranks are built, this matrix is symmetrical matrix, and value on diagonal is 1, can be with The affinity information two-by-two of N bar DNA sequence dnas is obtained, the relation of the evolution of N bar DNA sequence dnas is thus constructed.
DNA sequence dna similitude comparison method of the present invention based on kendall coefficient correlations, is asked for using sliding window mode The k word frequency vectors of DNA sequence dna to be analyzed, the k term vectors to N bar DNA sequence dnas carry out combination of two, related using kendall Coefficient seeks its coefficient correlation to the k word frequency vectors of corresponding DNA sequence dna, enabling carry out similitude inspection to a plurality of DNA sequence dna Survey, testing result is effectively reflected the evolutionary relationship between DNA sequence dna.This method is more succinct, need to only build one symmetrically Matrix, the value on matrix diagonal left to bottom right is 1, simplifies computational complexity, improves operation efficiency, kendall Coefficient can obtain the good degree of accuracy as the characteristic value of description DNA sequence dna similitude prediction.
Brief description of the drawings
The present invention is described in further details below in conjunction with the drawings and specific embodiments;
Fig. 1 is the schematic flow sheet of DNA sequence dna similitude comparison method of the present invention based on kendall coefficient correlations;
Fig. 2 is the evolution of the DNA sequence dna of DNA sequence dna similitude comparison method of the present invention based on kendall coefficient correlations Graph of a relation.
Specific embodiment
As shown in Figure 1 or 2,20 DNA encoding DNA sequence dnas of species are used to the method for the present invention as analysis object As a example by be further elaborated, comprise the following steps:As shown in figure 1, the present embodiment based on kendall coefficient correlations DNA sequence dna similitude comparison method comprises the following steps:
1) 20 DNA encoding DNA sequence dnas of species are selected as initial DNA sequence dna, the DNA sequence dna title of 20 species and Length is shown in Table 1;
Species name DNA sequence dna length
baboon 16522
bluewhale 16403
cat 17010
common_chimpanzee 16564
cow 16339
fin_whale 16399
gibbon 16473
gorilla 16365
grayseal 16798
harborseal 16827
horse 16661
human 16570
mouse 16296
opossum 17085
orangutan 16390
pigmy_chimpanzee 16555
platypus 17020
rat 16301
wallaroo 16897
whiterhinoceros 16833
Table 1:Species DNA sequence dna information
2) the initial DNA sequence dna to step 1 obtains its k word, and combines these k words, obtains the k word frequency of initial DNA sequence dna Rate vector is (referring to Vinga, S.Almeida, J.S.Alignment-free sequence comparison area review [J].Bioinformatics.513-523.2003).The characteristics of the method is the short dna to seeking length k by sliding window mode Sequence appears in frequency in DNA sequence dna to be measured, and to 4 bases { A, T, G, C } of DNA, it is 2 to take k length, then corresponding to k words has 42 =16 kinds, k words 4 are corresponded to if k=33=64 kinds;Such as the DNA sequence dna A=ATAACTA, its k word W of DNA sequencing fragment to be measured2= { AT, TA, AA, TT, AG, GA, AC, CA, CT ... }, its frequency vectorBe worth for 1, 2,1,0,0,0,1,0,1,0…};DNA sequencing fragment B=ACAACTTA to be measured, its k words frequency vector for 0,1,1,1,0,0, 2,1,1,0…};
3) correspondence N bar DNA sequence dnas, can obtain N number of k words frequency vector, and its combination of two is obtainedCombination, each Combination frequency vector is designated as X, Y
4) calculated by following formulaObtain kendall coefficient correlations, wherein txIt is { xi,yiAnd other k word frequency Possess uniformity logarithm, t between rateyIt is { xi,yiPossessing inconsistency logarithm and other k word frequency rates between, T is { xi,yiGather around Differ k word total numbers, step 2) in DNA sequence dna A, B fragment k words total number be T=7;
5) step 4) in tx, tyCan be obtained by following formula, tx=(xi-yi)×(xi-yi) be jack per line, then it is referred to as { xi,yi} Middle uniformity logarithm, tyCan be obtained by following formula, ty=(xi-yi)×(xi-yi) be contrary sign, then it is referred to as { xi,yiIn inconsistency Logarithm;
6) it is the kendall correlation matrixs of N*N ranks to build matrix, and this matrix is symmetrical matrix, and diagonal line value is 1, Upper triangular matrix can be generally classified as.Because similitude and distance are into negative correlativing relation, so, build evolutionary relationship figure it Before, similarity figure is taken opposite number and is converted to distance by us, and builds evolutionary relationship figure with this, refer to Fig. 2.
Interpretation of result:By the Pearson correlation coefficients between calculating and editing distance, it has been found that application kendall meters The DNA sequence dna similitude for calculating is -0.94 with the coefficient correlation of editing distance, illustrates what is calculated using the inventive method The characteristics of DNA sequence dna similitude has high precision, and can be a kind of the non-of replacement editing distance by being quickly calculated Normal effective method.
Embodiments of the invention are the foregoing is only, the scope of the claims of the invention is not thereby limited, it is every to utilize this hair Equivalent structure or equivalent flow conversion that bright specification and accompanying drawing content are made, or directly or indirectly it is used in other related skills Art field, is included within the scope of the present invention.

Claims (5)

1. the DNA sequence dna similitude comparison method of kendall coefficient correlations is based on, it is characterised in that:It comprises the following steps:
1) N bars DNA sequence dna to be compared is obtained;
2) length k is chosen, each pair is obtained in the way of sliding window and is combined the corresponding k words of DNA sequence dna, and be combined into corresponding Vector;
3) with step 2) acquired in k words, calculate the number of times that each k word occurs in DNA sequence dna, that is, calculate k words in DNA sequence dna The frequency vector of middle appearance, is designated as xi
4) combination of two is carried out to N bar DNA sequence dna k term vectors, that is, is obtainedCombination, each mix vector is designated as X={ xi},Y ={ yi};
5) the k word frequency vectors of every kind of combination are xi, yi, calculate its corresponding kendall coefficient correlation;
6) N × N rank correlation matrixs of N bar DNA sequence dnas are set up, to obtain the analog information and evolutionary relationship of DNA sequence dna Figure.
2. the DNA sequence dna similitude comparison method of kendall coefficient correlations is based on according to claim 1, it is characterised in that: The step 2) in, the word frequency vector that its length is k is taken to DNA sequence dna.
3. the DNA sequence dna similitude comparison method of kendall coefficient correlations is based on according to claim 1, it is characterised in that: The step 5) in, the kendall coefficient correlations of the k words of DNA sequence dna are obtained as follows:
A) by following formula, the k words of DNA sequence dna A to be compared are obtained, wherein DNA sequence dna A length is set to n:
F K A = ( f ( W k , 1 A ) , f ( W k , 2 A ) , ... f ( W k , n A ) )
B) by following formula, the frequency that k words occur is calculated:xi={ i-th k wordThat repeats in DNA sequence dna A is secondary Number };
C) to the X for combining, Y-direction amount, by following formula, calculates kendall coefficient correlationsIt is characterized in that:txIt is {xi},{yiIn possess uniformity logarithm, tyIt is { xi,yiPossessing inconsistency logarithm, T is { xi,yiPossess that to differ k words total Number;
D) t in step c)x, tyCan be obtained by following formula, tx=(xi-yi)*(xi-yi) be jack per line, then it is known as { xi,yiIn Uniformity logarithm, tyCan be obtained by following formula, ty=(xi-yi)*(xi-yi) be contrary sign, then it is known as { xi,yiIn inconsistency Logarithm.
4. the DNA sequence dna similitude comparison method of kendall coefficient correlations is based on according to claim 1, it is characterised in that: The kendall coefficient correlations τ for being obtained is the number that a value is [- 1,1], when the value of τ is closer to 1 expression, two DNA sequences Degree of correlation is stronger between row, when the value of τ is related negative sense between two DNA sequence dnas of -1 expression, when the value of τ is approached Represent that two DNA sequence dnas do not exist correlation in 0.
5. the DNA sequence dna similitude comparison method of kendall coefficient correlations is based on according to claim 1, it is characterised in that: The kendall correlation matrixs of N*N ranks are built in step 6, this matrix is symmetrical matrix, and the value on diagonal is 1, can be obtained To the affinity information two-by-two of N bar DNA sequence dnas, the relation of the evolution of N bar DNA sequence dnas is thus constructed.
CN201611186639.9A 2016-12-20 2016-12-20 DNA sequence dna similitude comparison method based on kendall related coefficient Expired - Fee Related CN106778078B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611186639.9A CN106778078B (en) 2016-12-20 2016-12-20 DNA sequence dna similitude comparison method based on kendall related coefficient

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611186639.9A CN106778078B (en) 2016-12-20 2016-12-20 DNA sequence dna similitude comparison method based on kendall related coefficient

Publications (2)

Publication Number Publication Date
CN106778078A true CN106778078A (en) 2017-05-31
CN106778078B CN106778078B (en) 2019-04-09

Family

ID=58896076

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611186639.9A Expired - Fee Related CN106778078B (en) 2016-12-20 2016-12-20 DNA sequence dna similitude comparison method based on kendall related coefficient

Country Status (1)

Country Link
CN (1) CN106778078B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108846262A (en) * 2018-05-31 2018-11-20 广西大学 The method that RNA secondary structure distance based on DFT calculates phylogenetic tree construction

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040101846A1 (en) * 2002-11-22 2004-05-27 Collins Patrick J. Methods for identifying suitable nucleic acid probe sequences for use in nucleic acid arrays
CN102732609A (en) * 2011-04-08 2012-10-17 博奥生物有限公司 Method for detecting similarity of oligonucleotide and target genome
WO2014019164A1 (en) * 2012-08-01 2014-02-06 深圳华大基因研究院 Method and device for analyzing microbial community composition
CN104395900A (en) * 2013-03-15 2015-03-04 北京未名博思生物智能科技开发有限公司 Spatial arithmetic method of sequence alignment
CN104657628A (en) * 2015-01-08 2015-05-27 深圳华大基因科技服务有限公司 Proton-based transcriptome sequencing data comparison and analysis method and system
WO2016058089A1 (en) * 2014-10-17 2016-04-21 The Hospital For Sick Children Dna methylation markers for overgrowth syndromes
EP3081257A1 (en) * 2015-04-17 2016-10-19 Sorin CRM SAS Active implantable medical device for cardiac stimulation comprising means for detecting a remodelling or reverse remodelling phenomenon of the patient
CN106203471A (en) * 2016-06-22 2016-12-07 南京航空航天大学 A kind of based on the Spectral Clustering merging Kendall Tau distance metric

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040101846A1 (en) * 2002-11-22 2004-05-27 Collins Patrick J. Methods for identifying suitable nucleic acid probe sequences for use in nucleic acid arrays
CN102732609A (en) * 2011-04-08 2012-10-17 博奥生物有限公司 Method for detecting similarity of oligonucleotide and target genome
WO2014019164A1 (en) * 2012-08-01 2014-02-06 深圳华大基因研究院 Method and device for analyzing microbial community composition
CN104395900A (en) * 2013-03-15 2015-03-04 北京未名博思生物智能科技开发有限公司 Spatial arithmetic method of sequence alignment
WO2016058089A1 (en) * 2014-10-17 2016-04-21 The Hospital For Sick Children Dna methylation markers for overgrowth syndromes
CN104657628A (en) * 2015-01-08 2015-05-27 深圳华大基因科技服务有限公司 Proton-based transcriptome sequencing data comparison and analysis method and system
EP3081257A1 (en) * 2015-04-17 2016-10-19 Sorin CRM SAS Active implantable medical device for cardiac stimulation comprising means for detecting a remodelling or reverse remodelling phenomenon of the patient
CN106203471A (en) * 2016-06-22 2016-12-07 南京航空航天大学 A kind of based on the Spectral Clustering merging Kendall Tau distance metric

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
黄玉娟: "基于k词的DNA序列分析的模型研究及应用", 《中国博士学位论文全文数据库(基础科学辑)》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108846262A (en) * 2018-05-31 2018-11-20 广西大学 The method that RNA secondary structure distance based on DFT calculates phylogenetic tree construction

Also Published As

Publication number Publication date
CN106778078B (en) 2019-04-09

Similar Documents

Publication Publication Date Title
Jackson et al. PHRAPL: phylogeographic inference using approximate likelihoods
Steinley Local optima in K-means clustering: what you don't know may hurt you.
Rabosky et al. Clade age and species richness are decoupled across the eukaryotic tree of life
Yuan et al. Bayesian mediation analysis.
Moon et al. Two-stage sensitivity-based group screening in computer experiments
Epprecht et al. Variable selection and forecasting via automated methods for linear models: LASSO/adaLASSO and Autometrics
US20110208495A1 (en) Method, system, and program for generating prediction model based on multiple regression analysis
US20240029834A1 (en) Drug Optimization by Active Learning
Ayadi et al. BiMine+: an efficient algorithm for discovering relevant biclusters of DNA microarray data
Bezáková et al. Graph model selection using maximum likelihood
Chesneau et al. Statistical theory and practice of the inverse power Muth distribution
Laffont et al. Multivariate analysis of longitudinal ordinal data with mixed effects models, with application to clinical outcomes in osteoarthritis
Saad et al. A family of exact goodness-of-fit tests for high-dimensional discrete distributions
CN106778078A (en) DNA sequence dna similitude comparison method based on kendall coefficient correlations
Rabin et al. Two directional Laplacian pyramids with application to data imputation
Xue et al. Comparison of population-based algorithms for optimizing thinnings and rotation using a process-based growth model
CN114880490A (en) Knowledge graph completion method based on graph attention network
CN114678070A (en) Single cell RNA sequencing data dimension reduction method, equipment and readable storage medium
CN113095467A (en) Quantum biological population quantity estimation method
Zamanzadeh et al. Autopopulus: a novel framework for autoencoder imputation on large clinical datasets
Barakat et al. Exact prediction intervals for future current records and record range from any continuous distribution
Gustafsson et al. Large-scale reverse engineering by the lasso
Minerva et al. Evolutionary approaches for statistical modelling
Morgan et al. Experimental design
Min et al. Bayesian variable selection in Poisson change-point regression analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20190409