CN106778078B - DNA sequence dna similitude comparison method based on kendall related coefficient - Google Patents

DNA sequence dna similitude comparison method based on kendall related coefficient Download PDF

Info

Publication number
CN106778078B
CN106778078B CN201611186639.9A CN201611186639A CN106778078B CN 106778078 B CN106778078 B CN 106778078B CN 201611186639 A CN201611186639 A CN 201611186639A CN 106778078 B CN106778078 B CN 106778078B
Authority
CN
China
Prior art keywords
dna sequence
dna
sequence dna
word
related coefficient
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201611186639.9A
Other languages
Chinese (zh)
Other versions
CN106778078A (en
Inventor
林劼
林丽玉
江育娥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujian Normal University
Original Assignee
Fujian Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujian Normal University filed Critical Fujian Normal University
Priority to CN201611186639.9A priority Critical patent/CN106778078B/en
Publication of CN106778078A publication Critical patent/CN106778078A/en
Application granted granted Critical
Publication of CN106778078B publication Critical patent/CN106778078B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Biophysics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The present invention discloses the DNA sequence dna similitude comparison method based on kendall related coefficient comprising following steps: 1) obtaining N item DNA sequence dna to be compared;2) length k is chosen, the corresponding k word of each pair of combination DNA sequence dna is obtained in the way of sliding window, and it is combined into corresponding vector 3) with k word acquired in step 2), it calculates the number that each k word occurs in DNA sequence dna and calculates the frequency vector that k word occurs in DNA sequence dna, be denoted as xi, all k word frequency rates of DNA sequence dna are denoted as X={ xi};4) combination of two is carried out to N DNA sequence dna k term vector to get arrivingCombination, each combination k word frequency vector are denoted as x, y;5) k word frequency vector, that is, x, y of every kind of combination, calculates its corresponding kendall related coefficient;6) the N*N rank similarity factor matrix of N DNA sequence dna is established, to obtain the similitude and evolutionary relationship figure of DNA sequence dna.The present invention improves the effect that DNA sequence dna similitude compares, and simplifies computational complexity and shortens operation time.

Description

DNA sequence dna similitude comparison method based on kendall related coefficient
Technical field
The present invention relates to computers and bioinformatics process field, more particularly to the DNA based on kendall related coefficient Sequence similarity comparison method.
Background technique
The central task of bioinformatics is to extract conceptual knowledge from vast as the open sea DNA sequence data.Biological information The task that scholar is faced is not only to solve efficient data storage means, and needs to develop effective data analysis tool. Because only that DNA sequence dna information could be converted into Biological Knowledge, and understand fully using new, effective data analysis tool The structure and function information that they are contained, and then thoroughly understand the biological significance representated by them.
The theoretical basis that DNA sequence dna compares is Evolution Theory, if having enough similitudes between two DNA sequence dnas, There may be common evolution ancestors with regard to both speculating, by lacking for the replacement of residue in DNA sequence dna, residue or DNA sequencing fragment It loses and the hereditary variations processes such as DNA sequence dna recombination develops respectively.DNA sequence dna phase Sihe DNA sequence dna is homologous to be different Concept, the similarity degree between DNA sequence dna is the parameter that can quantify, and DNA sequence dna it is whether homologous need evolve it is true Verifying.It is actually to use certain specific mathematical model or algorithm that DNA sequence dna, which compares, finds out two or more DNA sequence dnas Between maximum matching base number.
The frequency and location information that Huang Yujuan, Wang Tianming et al. are occurred using the k word in DNA sequence dna construct a probability Distribution, this distribution indicate the distance between two vectors, it is closer to be worth smaller species.Vinga and Almeida, which is proposed, to be based on The DNA sequence dna comparative approach of word frequency rate: the number that the word that all length is k by way of sliding window occurs obtains k word One DNA sequence dna, is mapped as a vector in higher-dimension theorem in Euclid space in this way by several or frequency vector, thus by DNA sequence dna it Between similarity system design be converted to the comparison between vector.
It is exactly that two DNA sequence dnas are compared with specific algorithm that double DNA sequence dnas, which compare, so as to find out this two DNA The matching of maximum similitude between sequence.Kendall related coefficient is widely used in time DNA sequence dna, the hydrology, water quality DNA The dependency prediction of sequence etc., but it be not used for the matching of DNA sequence dna similitude.
Summary of the invention
It is an object of the invention to overcome the deficiencies of the prior art and provide the DNA sequence dna phases based on kendall related coefficient Like property comparison method, building one is about N DNA sequence dnaRank similarity factor matrix, the evolution for obtaining N DNA sequence dna are closed System, while improving the efficiency of DNA sequence dna similitude comparison and improving operation efficiency.
The technical solution adopted by the present invention is that:
DNA sequence dna similitude comparison method based on kendall related coefficient comprising following steps:
1) N item DNA sequence dna to be compared is obtained;
2) length k is chosen, the corresponding k word of each pair of combination DNA sequence dna is obtained in the way of sliding window, and is combined into phase The vector answered
3) with k word acquired in step 2), the number that each k word occurs in DNA sequence dna is calculated, i.e. calculating k word is in DNA The frequency vector occurred in sequence, is denoted as xi
4) combination of two is carried out to N DNA sequence dna k term vector to get arrivingCombination, each mix vector are denoted as X= {xi, Y={ yi}。
5) k word frequency vector, that is, x of every kind of combinationi, yi, calculate its corresponding kendall related coefficient;
6) establish N × N rank kendall correlation matrix of N DNA sequence dna, with obtain the analog information of DNA sequence dna with And evolutionary relationship figure.
Further, in the step 2), the word frequency vector the length is k is taken to DNA sequence dna.
Further, in the step 5), the kendall related coefficient of the k word of DNA sequence dna can be obtained as follows;
A) by following formula, the k word of DNA sequence dna A to be compared is obtained, wherein DNA sequence dna A length is set as n:
B) by following formula, the frequency that k word occurs: x is calculatedi={ i-th of k wordRepeat in DNA sequence dna A Number;
C) to combined X, Y-direction amount calculates kendall related coefficient by following formulaT in formulaxIt is { xi}, {yiIn possess consistency logarithm, tyIt is { xi,yiPossessing inconsistency logarithm, T is { xi,yiPossess not identical k word total number.
D) t in step c)x, tyIt can be obtained by following formula, tx=(xi-yi)*(xi-yi) it is jack per line, then it is known as { xi, yiIn consistency logarithm, tyIt can be obtained by following formula, ty=(xi-yi)*(xi-yi) it is contrary sign, then it is known as { xi,yiIn it is different Cause property logarithm
Kendall related coefficient obtained is expressed as τ, is the number that a value is [- 1,1], when the value of τ is closer to 1 Then indicate that degree of correlation is stronger between two DNA sequence dnas, when being negative sense between the value of τ two DNA sequence dnas of closer -1 expression Correlation, when the value of τ indicates that correlation is not present in two DNA sequence dnas close to 0.
The kendall correlation matrix of N*N rank is constructed, this matrix is symmetrical matrix, and the value on diagonal line is 1, can be with The affinity information two-by-two of N DNA sequence dna is obtained, the relationship of the evolution of N DNA sequence dna is thus constructed.
The present invention is based on the DNA sequence dna similitude comparison methods of kendall related coefficient, are sought using sliding window mode The k word frequency vector of DNA sequence dna to be analyzed carries out combination of two to the k term vector of N DNA sequence dna, utilizes kendall correlation Coefficient seeks its related coefficient to the k word frequency vector of corresponding DNA sequence dna, makes it possible to carry out similitude inspection to a plurality of DNA sequence dna It surveys, testing result is effectively reflected the evolutionary relationship between DNA sequence dna.This method is more succinct, need to only construct one symmetrically Matrix, the value on the diagonal line of matrix left to bottom right are 1, simplify computational complexity, improve operation efficiency, kendall Coefficient can be used as the characteristic value of description DNA sequence dna similitude prediction, can obtain good accuracy.
Detailed description of the invention
The present invention is described in further details below in conjunction with the drawings and specific embodiments;
Fig. 1 is that the present invention is based on the flow diagrams of the DNA sequence dna similitude comparison method of kendall related coefficient;
Fig. 2 is that the present invention is based on the evolution of the DNA sequence dna of the DNA sequence dna similitude comparison method of kendall related coefficient Relational graph.
Specific embodiment
As shown in Figure 1 or 2, analysis object is used as using the DNA encoding DNA sequence dna of 20 species to method of the invention For be further elaborated, comprising the following steps: as shown in Figure 1, the present embodiment based on kendall related coefficient DNA sequence dna similitude comparison method includes the following steps:
1) select the DNA encoding DNA sequence dna of 20 species as initial DNA sequence dna, the DNA sequence dna title of 20 species and Length is shown in Table 1;
Species name DNA sequence dna length
baboon 16522
bluewhale 16403
cat 17010
common_chimpanzee 16564
cow 16339
fin_whale 16399
gibbon 16473
gorilla 16365
grayseal 16798
harborseal 16827
horse 16661
human 16570
mouse 16296
opossum 17085
orangutan 16390
pigmy_chimpanzee 16555
platypus 17020
rat 16301
wallaroo 16897
whiterhinoceros 16833
Table 1: species DNA sequence dna information
2) its k word is obtained to the initial DNA sequence dna of step 1, and combines these k words, obtain the k word frequency of initial DNA sequence dna Rate vector is (referring to Vinga, S.Almeida, J.S.Alignment-free sequence comparison area review [J].Bioinformatics.513-523.2003).The characteristics of the method is to the short dna for seeking length k by sliding window mode Sequence appears in frequency in DNA sequence dna to be measured, and to 4 bases { A, T, G, C } of DNA, taking k length is 2, then corresponding to k word has 42 =16 kinds, k word 4 is corresponded to if k=33=64 kinds;Such as DNA sequence dna A=ATAACTA, the k word W of DNA sequencing fragment to be measured2= { AT, TA, AA, TT, AG, GA, AC, CA, CT ... }, frequency vectorValue for 1, 2,1,0,0,0,1,0,1,0…};DNA sequencing fragment B=ACAACTTA to be measured, k word frequency vector be 0,1,1,1,0,0, 2,1,1,0…};
3) corresponding N DNA sequence dna, can find out N number of k word frequency vector and obtain its combination of twoCombination, each Combination frequency vector is denoted as X, Y
4) it is calculate by the following formulaKendall related coefficient is obtained, wherein txIt is { xi,yiAnd other k word frequency Possess consistency logarithm, t between rateyIt is { xi,yiAnd other k word frequency rates between possess inconsistency logarithm, T is { xi,yiGather around There is not identical k word total number, the k word total number of DNA sequence dna A, B segment is T=7 in step 2);
5) t in step 4)x, tyIt can be obtained by following formula, tx=(xi-yi)×(xi-yi) it is jack per line, then it is known as { xi,yi} Middle consistency logarithm, tyIt can be obtained by following formula, ty=(xi-yi)×(xi-yi) it is contrary sign, then it is known as { xi,yiIn inconsistency Logarithm;
6) building matrix be N*N rank kendall correlation matrix, this matrix be symmetrical matrix, diagonal line value 1, Upper triangular matrix can be usually classified as.Since similitude and distance are negatively correlated relationship, so, building evolutionary relationship figure it Before, similarity figure is taken opposite number to be converted to distance by we, and constructs evolutionary relationship figure with this, please refers to Fig. 2.
Interpretation of result: pass through the Pearson correlation coefficients between calculating and editing distance, it has been found that count using kendall The related coefficient of the DNA sequence dna similitude and editing distance that calculate is -0.94, illustrate that the method for the present invention is applied to calculate DNA sequence dna similitude has the characteristics that with high accuracy, and can be a kind of the non-of substitution editing distance by being quickly calculated Normal effective method.
The above description is only an embodiment of the present invention, is not intended to limit the scope of the invention, all to utilize this hair Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant skills Art field, is included within the scope of the present invention.

Claims (4)

1. the DNA sequence dna similitude comparison method based on kendall related coefficient, it is characterised in that: it includes the following steps:
1) N item DNA sequence dna to be compared is obtained;
2) length k is chosen, the corresponding k word of each pair of combination DNA sequence dna is obtained in the way of sliding window, and is combined into corresponding Vector
3) with k word acquired in step 2), the number that each k word occurs in DNA sequence dna is calculated, i.e. calculating k word is in DNA sequence dna The frequency vector of middle appearance, is denoted as xi
4) combination of two is carried out to N DNA sequence dna k term vector to get arrivingCombination, each mix vector are denoted as X={ xi},Y ={ yi};
5) k word frequency vector, that is, x of every kind of combinationi, yi, calculate its corresponding kendall related coefficient;
In step 5), the kendall related coefficient of the k word of DNA sequence dna is obtained as follows:
A) by following formula, the k word of DNA sequence dna A to be compared is obtained, wherein DNA sequence dna A length is set as n:
B) by following formula, the frequency that k word occurs: x is calculatedi={ i-th of k wordTime repeated in DNA sequence dna A Number };
C) to combined X, Y-direction amount calculates kendall related coefficient by following formulaT in formulaxIt is { xi},{yiIn Possess consistency logarithm, tyIt is { xi,yiPossessing inconsistency logarithm, T is { xi,yiPossess not identical k word total number;
D) t in step c)x, tyIt can be obtained by following formula, tx=(xi-yi)*(xi-yi) it is jack per line, then it is known as { xi,yiIn Consistency logarithm, tyIt can be obtained by following formula, ty=(xi-yi)*(xi-yi) it is contrary sign, then it is known as { xi,yiIn inconsistency Logarithm;
6) establish N × N rank kendall correlation matrix of N DNA sequence dna, with obtain DNA sequence dna analog information and into Change relational graph.
2. the DNA sequence dna similitude comparison method based on kendall related coefficient according to claim 1, it is characterised in that: In the step 2), the word frequency vector the length is k is taken to DNA sequence dna.
3. the DNA sequence dna similitude comparison method based on kendall related coefficient according to claim 1, it is characterised in that: Kendall related coefficient obtained is expressed as τ, and τ is the number that a value is [- 1,1], when the value of τ indicates two closer to 1 Degree of correlation is stronger between DNA sequence dna, when being negative sense correlation between the value of τ two DNA sequence dnas of closer -1 expression, works as τ Value indicate that correlation is not present in two DNA sequence dnas close to 0.
4. the DNA sequence dna similitude comparison method based on kendall related coefficient according to claim 1, it is characterised in that: The kendall correlation matrix of building N*N rank in step 6), this matrix are symmetrical matrix, and the value on diagonal line is 1, can be with The affinity information two-by-two of N DNA sequence dna is obtained, the relationship of the evolution of N DNA sequence dna is thus constructed.
CN201611186639.9A 2016-12-20 2016-12-20 DNA sequence dna similitude comparison method based on kendall related coefficient Expired - Fee Related CN106778078B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611186639.9A CN106778078B (en) 2016-12-20 2016-12-20 DNA sequence dna similitude comparison method based on kendall related coefficient

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611186639.9A CN106778078B (en) 2016-12-20 2016-12-20 DNA sequence dna similitude comparison method based on kendall related coefficient

Publications (2)

Publication Number Publication Date
CN106778078A CN106778078A (en) 2017-05-31
CN106778078B true CN106778078B (en) 2019-04-09

Family

ID=58896076

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611186639.9A Expired - Fee Related CN106778078B (en) 2016-12-20 2016-12-20 DNA sequence dna similitude comparison method based on kendall related coefficient

Country Status (1)

Country Link
CN (1) CN106778078B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108846262A (en) * 2018-05-31 2018-11-20 广西大学 The method that RNA secondary structure distance based on DFT calculates phylogenetic tree construction

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102732609A (en) * 2011-04-08 2012-10-17 博奥生物有限公司 Method for detecting similarity of oligonucleotide and target genome
WO2014019164A1 (en) * 2012-08-01 2014-02-06 深圳华大基因研究院 Method and device for analyzing microbial community composition
CN104395900A (en) * 2013-03-15 2015-03-04 北京未名博思生物智能科技开发有限公司 Spatial arithmetic method of sequence alignment
CN104657628A (en) * 2015-01-08 2015-05-27 深圳华大基因科技服务有限公司 Proton-based transcriptome sequencing data comparison and analysis method and system
WO2016058089A1 (en) * 2014-10-17 2016-04-21 The Hospital For Sick Children Dna methylation markers for overgrowth syndromes
EP3081257A1 (en) * 2015-04-17 2016-10-19 Sorin CRM SAS Active implantable medical device for cardiac stimulation comprising means for detecting a remodelling or reverse remodelling phenomenon of the patient
CN106203471A (en) * 2016-06-22 2016-12-07 南京航空航天大学 A kind of based on the Spectral Clustering merging Kendall Tau distance metric

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040101846A1 (en) * 2002-11-22 2004-05-27 Collins Patrick J. Methods for identifying suitable nucleic acid probe sequences for use in nucleic acid arrays

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102732609A (en) * 2011-04-08 2012-10-17 博奥生物有限公司 Method for detecting similarity of oligonucleotide and target genome
WO2014019164A1 (en) * 2012-08-01 2014-02-06 深圳华大基因研究院 Method and device for analyzing microbial community composition
CN104395900A (en) * 2013-03-15 2015-03-04 北京未名博思生物智能科技开发有限公司 Spatial arithmetic method of sequence alignment
WO2016058089A1 (en) * 2014-10-17 2016-04-21 The Hospital For Sick Children Dna methylation markers for overgrowth syndromes
CN104657628A (en) * 2015-01-08 2015-05-27 深圳华大基因科技服务有限公司 Proton-based transcriptome sequencing data comparison and analysis method and system
EP3081257A1 (en) * 2015-04-17 2016-10-19 Sorin CRM SAS Active implantable medical device for cardiac stimulation comprising means for detecting a remodelling or reverse remodelling phenomenon of the patient
CN106203471A (en) * 2016-06-22 2016-12-07 南京航空航天大学 A kind of based on the Spectral Clustering merging Kendall Tau distance metric

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于k词的DNA序列分析的模型研究及应用;黄玉娟;《中国博士学位论文全文数据库(基础科学辑)》;20120915(第09期);第A006-9页

Also Published As

Publication number Publication date
CN106778078A (en) 2017-05-31

Similar Documents

Publication Publication Date Title
Talagala et al. Meta-learning how to forecast time series
Kalinowski How well do evolutionary trees describe genetic relationships among populations?
Van Buuren et al. Fully conditional specification in multivariate imputation
CN110263979B (en) Method and device for predicting sample label based on reinforcement learning model
CN105740312A (en) Clustering database queries for runtime prediction
CN110717617A (en) Unsupervised relation prediction method based on depth map network self-encoder
Qiu et al. A deep learning framework for imputing missing values in genomic data
Hird et al. Rapid and accurate species tree estimation for phylogeographic investigations using replicated subsampling
US20070021952A1 (en) General graphical Gaussian modeling method and apparatus therefore
Kwon et al. The use of random-effect models for high-dimensional variable selection problems
Bezáková et al. Graph model selection using maximum likelihood
CN106778078B (en) DNA sequence dna similitude comparison method based on kendall related coefficient
Liu et al. Group variable selection and estimation in the tobit censored response model
Maenhout et al. Graph-based data selection for the construction of genomic prediction models
CN109063418A (en) Determination method, apparatus, equipment and the readable storage medium storing program for executing of disease forecasting classifier
Hein et al. Can we compare effect size of spatial genetic structure between studies and species using Moran eigenvector maps?
Gómez-Vela et al. Gene network coherence based on prior knowledge using direct and indirect relationships
Lee et al. Survival prediction and variable selection with simultaneous shrinkage and grouping priors
CN107103206A (en) The DNA sequence dna cluster of local sensitivity Hash based on standard entropy
CN109326327B (en) Biological sequence clustering method based on SeqRank graph algorithm
CN113838519B (en) Gene selection method and system based on adaptive gene interaction regularization elastic network model
Boggis et al. equips: eqtl analysis using informed partitioning of snps–a fully Bayesian approach
Lehmann et al. High trait variability in optimal polygenic prediction strategy within multiple-ancestry cohorts
CN110162704B (en) Multi-scale key user extraction method based on multi-factor genetic algorithm
Cheng et al. Use of biclustering for missing value imputation in gene expression data.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20190409