CN105512512B - The method that amino acid carries out species taxonomy apart from polymorphism comparison protein sequence - Google Patents

The method that amino acid carries out species taxonomy apart from polymorphism comparison protein sequence Download PDF

Info

Publication number
CN105512512B
CN105512512B CN201510829185.1A CN201510829185A CN105512512B CN 105512512 B CN105512512 B CN 105512512B CN 201510829185 A CN201510829185 A CN 201510829185A CN 105512512 B CN105512512 B CN 105512512B
Authority
CN
China
Prior art keywords
amino acid
distance
protein sequence
same race
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510829185.1A
Other languages
Chinese (zh)
Other versions
CN105512512A (en
Inventor
孔登
陈永
王晓红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Weifang Medical University
Original Assignee
Weifang Medical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Weifang Medical University filed Critical Weifang Medical University
Priority to CN201510829185.1A priority Critical patent/CN105512512B/en
Publication of CN105512512A publication Critical patent/CN105512512A/en
Application granted granted Critical
Publication of CN105512512B publication Critical patent/CN105512512B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Peptides Or Proteins (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention proposes a kind of methods that amino acid carries out species taxonomy apart from polymorphism comparison protein sequence, include the following steps: S10: each amino acid on protein sequence is numbered;S20: the distance between adjacent amino acid of the same race on protein sequence is calculated;S30: the number that the different distance of the adjacent amino acid of the same race of statistics occurs on every protein sequence;S40: carrying out alignment two-by-two according to the statistical data of S30, construct distance matrix, calculates generation system development tree according to distance matrix, carries out species taxonomy.The difference of sequence upper amino acid is changed into the difference of distance between amino acid by this method, has not only taken into account vacancy, but also without being inserted into interval, method is simple, enormously simplifies calculation amount.

Description

The method that amino acid carries out species taxonomy apart from polymorphism comparison protein sequence
Technical field
The invention belongs to species identification field, in particular to a kind of amino acid is carried out apart from polymorphism comparison protein sequence The method of species taxonomy.
Background technique
Currently, if two sections of protein sequences come from same ancestors, having certain homology, parent according to evolutionism principle The closer species homologies of edge relationship are higher, so can be classified according to putting in order for amino acid in protein sequence, Establish the genealogical tree (phylogenetic tree) of molecular evolution.It is now widely used be by Higgins and Sharp in Multiple sequences are first compared building distance matrix two-by-two, reflect between sequence and close two-by-two by the Clustal algorithm proposed in 1988 Then system calculates generation system chadogram according to distance matrix.When two sequences compare, simplest situation is exactly not consider sky Position, only selection compares starting point, but this method error is larger, it is difficult to reflect truth.Most common method is pair Position compares, i.e., by the method sequence alignment that keeps length different at insertion interval, but due to there are many mode that insertion is spaced, So as to cause the complexity compared, greatly increase calculation amount.
Therefore, in line with the spirit and theory asked, and by the knowledge of profession, the auxiliary of experience, and in multi-party clever thought, examination After testing, the present invention is just created, spy provides a kind of amino acid again and carries out species taxonomy apart from polymorphism comparison protein sequence The difference of sequence upper amino acid can be changed into the difference of distance between amino acid by method, not only take into account vacancy, but also without insertion Interval, enormously simplifies the complexity of comparison.
Summary of the invention
The present invention proposes a kind of method that amino acid carries out species taxonomy apart from polymorphism comparison protein sequence, by sequence The difference of upper amino acid is changed into the difference of distance between amino acid, has not only taken into account vacancy, but also without being inserted into interval, calculation method letter It is single.
The technical scheme of the present invention is realized as follows: a kind of amino acid carries out object apart from polymorphism comparison protein sequence The method of kind classification, includes the following steps:
S10: each amino acid on protein sequence is numbered;
S20: the distance between adjacent amino acid of the same race on protein sequence is calculated;
S30: the number that the different distance of the adjacent amino acid of the same race of statistics occurs on every protein sequence;
S40: the number occurred in every protein sequence according to the different distance of every kind of statistics amino acid carries out two Two comparisons, construct distance matrix, calculate generation system development tree according to distance matrix, carry out species taxonomy.
As a preferred embodiment, the type of the amino acid includes: alanine, leucine, arginine, relies ammonia Acid, asparagine, methionine, phenylalanine, cysteine, proline, glutamine, serine, glutamic acid, threonine, Glycine, tryptophan, histidine, tyrosine, isoleucine, valine, any one or more in aspartic acid.
As a preferred embodiment, calculating adjacent amino acid of the same race on protein sequence in the step S20 Distance extracts respectively using by the corresponding number of amino acid various in sequence, calculate between adjacent amino acid of the same race away from From.
As a preferred embodiment, the different distance of every kind of amino acid in the step S40 according to statistics exists The number occurred in every protein sequence analyzes the polymorphism of amino acid distance of the same race in protein, by constructing apart from square Battle array calculates generation system chadogram progress species taxonomy.
As a preferred embodiment, analyzing the polymorphic of amino acid distance of the same race in protein in the step S40 Property, meet formula: F=2nxy/(nx+ny), P=-lnF, wherein nxFor phase in two sections of protein sequence first segment protein sequences The number for a certain distance that adjacent amino acid of the same race occurs, nyIt is adjacent of the same race in two sections of protein sequence second segment protein sequences The number for a certain distance that amino acid occurs, nxyA certain distance goes out occurrence between the adjacent amino acid of the same race of two sections of protein sequences The identical number of number, i.e. nxAnd nyIn smaller value, P be the adjacent amino acid distance of the same race of two sections of protein sequences diversity value.
As a preferred embodiment, in the step S40 calculate protein sequence on adjacent amino acid of the same race away from From polymorphism asked after comparing the multiple diversity values for calculating all amino acid whole distances two-by-two to all proteins sequence It is averaged building distance matrix, evolutionary relationship tree is made according to distance matrix.
After above-mentioned technical proposal, the beneficial effects of the present invention are: same according to two adjacent on protein sequence The difference of kind amino acid distance is compared, and constructs distance matrix, calculates generation system chadogram, we further according to distance matrix The difference of sequence upper amino acid is changed into the difference of distance between amino acid by method, has not only taken into account vacancy, but also without being inserted into interval, meter Calculation method is simple, can satisfy basic requirement.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention without any creative labor, may be used also for those of ordinary skill in the art To obtain other drawings based on these drawings.
Fig. 1 is flow diagram of the present invention;
Fig. 2 is the phylogenetic tree that the present invention constructs;
Fig. 3 is the phylogenetic tree that aligned sequences building is aligned using 6.0 software of Mega.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.
As shown in Figure 1, the method that amino acid of the present invention carries out species taxonomy apart from polymorphism comparison protein sequence, including Following steps:
S10: each amino acid on protein sequence is numbered;
S20: the distance between adjacent amino acid of the same race on protein sequence is calculated;
S30: the number that the different distance of the adjacent amino acid of the same race of statistics occurs on every protein sequence;
S40: the number occurred in every protein sequence according to the different distance of every kind of statistics amino acid carries out two Two comparisons, construct distance matrix, calculate generation system development tree according to distance matrix, carry out species taxonomy.
The type of the amino acid includes: alanine, leucine, arginine, lysine, asparagine, methionine, benzene Alanine, cysteine, proline, glutamine, serine, glutamic acid, threonine, glycine, tryptophan, histidine, junket Propylhomoserin, isoleucine, valine, any one or more in aspartic acid.
The distance that adjacent amino acid of the same race on protein sequence is calculated in the step S20 is used amino various in sequence The corresponding number of acid extracts respectively, calculates the distance between adjacent amino acid of the same race.
Time occurred in every protein sequence in the step S40 according to the different distance of every kind of amino acid of statistics Number analyzes the polymorphism of amino acid distance of the same race in protein, passes through building distance matrix, calculates generation system chadogram and carry out Species taxonomy.
The polymorphism that amino acid distance of the same race in protein is analyzed in the step S40, meets formula: F=2nxy/(nx+ ny), P=-lnF, wherein nxFor amino acid of the same race adjacent in two sections of protein sequence first segment protein sequences occur it is a certain away from From number, nyIt is the number for a certain distance that adjacent amino acid of the same race occurs in two sections of protein sequence second segment protein sequences Mesh, nxyIt is a certain apart from the identical number of frequency of occurrence, i.e. n between the adjacent amino acid of the same race of two sections of protein sequencesxAnd nyIn Smaller value, P be the adjacent amino acid distance of the same race of two sections of protein sequences diversity value.
The polymorphism that adjacent amino acid distance of the same race on protein sequence is calculated in the step S40, to all proteins After sequence compares the multiple diversity values for calculating all amino acid whole distances two-by-two, averaged constructs distance matrix, root Evolutionary relationship tree is made according to distance matrix.
By taking one section of amino acid sequence as an example, each amino acid number is given, then distance is as follows between two neighboring G:
Know that the distance between two neighboring G is 5,17,1,5.The neighbor distance of any other amino acid can similarly be obtained.
In case of mutation, such as 18 C become G, then the distance between two neighboring G becomes 5,11,6,1,5. Sequence is compared above, has 5,1,5 three number identical, i.e., the distance at mutation between two adjacent G is affected, remaining number It is constant.The distance of C is also affected simultaneously, and the distance of other amino acid is constant, as follows:
In case of insertion, as being inserted into a G between 17 Q and 18 C, then the distance between two neighboring G becomes 5,11,7,1,5 compare with first sequence above, have 5,1,5 three number identical, i.e. distance between two adjacent G of insert division It is affected, remaining number is constant.
The neighbor distance of other amino acid is affected a bit, some do not change, shown as shown in following table one:
In case of missing, caused by influence it is with being inserted into similar.For a plurality of amino acid sequence, it can count all same The neighbor distance of kind amino acid, the data drawn are compared.According to evolutionism principle, affiliation is closer, amino acid The similarity of arrangement is higher, indicates that the data of distance are more similar;On the contrary, affiliation is remoter, make a variation bigger, indicate away from From data similarity it is poorer.It is possible thereby to by different species taxonomies.
Following table is by taking the cytochrome c sequence of 16 species as an example, sequence information such as following table two:
Serial number Chinese Sequence number Length
1 People NP_061820.1 105
2 Macaque EHH17434.1 105
3 Ox NP_001039526.1 105
4 California gray whale P68100.2 105
5 Chicken NP_001072946.1 105
6 Penguin P00017.2 105
7 Lizard P21665.2 105
8 Snake JAB54399.1 105
9 Salmon ACI70114.1 104
10 Gadus ACQ58603.1 104
11 Longicorn JAB67143.1 108
12 Diamond-back moth NP_001292408.1 108
13 Arabidopsis AAA32747.1 112
14 Aspergillus niger P56205.1 111
15 Burkholderia EJ062937.1 119
16 Xanthomonas campestris KLD80214.1 117
Explanation is further disclosed to the classification determination process of species below, first to above-mentioned 16 sequences respectively since 1 Number, then extract the number of all amino acid of the same race in each sequence respectively, then calculate separately each adjacent amino acid of the same race it Between distance, count each distance occur number.As following table three be alanine A statistical result, N be two neighboring A it Between distance.The statistical result of other amino acid is unlisted due to length.
The statistical result of other amino acid, similar with amino acid A, details are not described herein again.
According to the distance statistics of above-mentioned each adjacent amino acid of the same race as a result, 16 protein sequences are substituted into two-by-two line by line public Formula: F=2nxy/(nx+ny), P=-lnF table after obtaining diversity value, is sought the P value of all amino acid whole distances average After value, distance matrix is constructed as shown in following table four:
The phylogenetic tree constructed with adjacent method to the distance matrix of upper table four is as shown in Fig. 2, using 6.0 software of Mega Contraposition aligned sequences simultaneously select adjacent method building genealogical tree as shown in Figure 3.People, macaque are first classified as one kind in Fig. 2;Ox, California gray whale return For one kind;Chicken, penguin are classified as one kind;Salmon, gadus are classified as one kind;Longicorn, diamond-back moth are a kind of;Burkholderia, Xanthomonas campestris To use 6.0 software of Mega to carry out the result base that contraposition aligned sequences construct genealogical tree to protein sequence in one kind, with Fig. 3 This is consistent.Snake is that the affiliation of vertebrate and lizard is closer, but is not classified as one kind with both of which.This is because It is difficult to merely react true evolution situation with a kind of protein sequence, this is that any sequence analysis method is all difficult to avoid that 's.Knot of accurately evolving should can be just obtained in conjunction with traditional classification method comprehensive analysis such as other sequences analysis and morphology Fruit.
The method that the amino acid carries out species taxonomy apart from polymorphism comparison protein sequence, its working principle is that: according to The difference of adjacent two amino acid distances of the same race is compared on protein sequence, distance matrix is constructed, further according to apart from square Battle array calculates generation system chadogram, and the difference of sequence upper amino acid is changed into the difference of distance between amino acid by this method, both simultaneous Vacancy is cared for, and without being inserted into interval, calculation method is simple, can satisfy basic requirement.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Within mind and principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims (4)

1. a kind of method that amino acid carries out species taxonomy apart from polymorphism comparison protein sequence, which is characterized in that including such as Lower step:
S10: each amino acid on protein sequence is numbered;
S20: the distance between adjacent amino acid of the same race on protein sequence is calculated;
S30: the number that the different distance of the adjacent amino acid of the same race of statistics occurs on every protein sequence;
S40: the number occurred in every protein sequence according to the different distance of every kind of statistics amino acid, it is right two-by-two to carry out Than constructing distance matrix, calculating generation system development tree according to distance matrix, carry out species taxonomy;
The number occurred in every protein sequence in the step S40 according to the different distance of every kind of amino acid of statistics, The polymorphism of amino acid distance of the same race in protein is analyzed, passes through building distance matrix, calculate generation system chadogram and carry out object Kind classification;
The polymorphism that amino acid distance of the same race in protein is analyzed in the step S40, meets formula: F=2nxy/(nx+ny), P =-lnF, wherein nxThe a certain distance occurred for amino acid of the same race adjacent in two sections of protein sequence first segment protein sequences Number, nyIt is the number for a certain distance that adjacent amino acid of the same race occurs in two sections of protein sequence second segment protein sequences, nxyIt is a certain apart from the identical number of frequency of occurrence, i.e. n between the adjacent amino acid of the same race of two sections of protein sequencesxAnd nyIn compared with Small value, P are the diversity value of the adjacent amino acid distance of the same race of two sections of protein sequences.
2. the method that amino acid according to claim 1 carries out species taxonomy apart from polymorphism comparison protein sequence, Be characterized in that, the type of the amino acid include: alanine, leucine, arginine, lysine, asparagine, methionine, Phenylalanine, cysteine, proline, glutamine, serine, glutamic acid, threonine, glycine, tryptophan, histidine, Tyrosine, isoleucine, valine, any one or more in aspartic acid.
3. the method that amino acid according to claim 1 carries out species taxonomy apart from polymorphism comparison protein sequence, It is characterized in that, the distance that adjacent amino acid of the same race on protein sequence is calculated in the step S20 is used ammonia various in sequence The corresponding number of base acid extracts respectively, calculates the distance between adjacent amino acid of the same race.
4. the method that amino acid according to claim 1 carries out species taxonomy apart from polymorphism comparison protein sequence, It is characterized in that, the polymorphism of adjacent amino acid distance of the same race on protein sequence is calculated in the step S40, to all proteins After sequence compares the multiple diversity values for calculating all amino acid whole distances two-by-two, averaged constructs distance matrix, root Evolutionary relationship tree is made according to distance matrix.
CN201510829185.1A 2015-11-24 2015-11-24 The method that amino acid carries out species taxonomy apart from polymorphism comparison protein sequence Active CN105512512B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510829185.1A CN105512512B (en) 2015-11-24 2015-11-24 The method that amino acid carries out species taxonomy apart from polymorphism comparison protein sequence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510829185.1A CN105512512B (en) 2015-11-24 2015-11-24 The method that amino acid carries out species taxonomy apart from polymorphism comparison protein sequence

Publications (2)

Publication Number Publication Date
CN105512512A CN105512512A (en) 2016-04-20
CN105512512B true CN105512512B (en) 2019-03-29

Family

ID=55720489

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510829185.1A Active CN105512512B (en) 2015-11-24 2015-11-24 The method that amino acid carries out species taxonomy apart from polymorphism comparison protein sequence

Country Status (1)

Country Link
CN (1) CN105512512B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11456057B2 (en) 2018-03-29 2022-09-27 International Business Machines Corporation Biological sequence distance explorer system providing user visualization of genomic distance between a set of genomes in a dynamic zoomable fashion
CN108846262A (en) * 2018-05-31 2018-11-20 广西大学 The method that RNA secondary structure distance based on DFT calculates phylogenetic tree construction
CN111341387B (en) * 2020-02-19 2023-06-30 吉林大学 Unidirectional coding unsupervised classification method based on basic component sequence vector

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102332064A (en) * 2011-10-07 2012-01-25 吉林大学 Biological species identification method based on genetic barcode
CN103559427A (en) * 2013-11-12 2014-02-05 高扬 Method for identifying biological sequence and deducing species genetic relationship through digitals

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102332064A (en) * 2011-10-07 2012-01-25 吉林大学 Biological species identification method based on genetic barcode
CN103559427A (en) * 2013-11-12 2014-02-05 高扬 Method for identifying biological sequence and deducing species genetic relationship through digitals

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
一种新的DNA序列进化距离及其应用;梁丽萍,等.;《生物化学与生物物理进展》;20111231;第38卷(第8期);全文
小麦果聚糖合成酶基因6-SFT-A单核苷酸多态性分析及其定位;岳爱琴,等.;《中国农业科学》;20111231;第44卷(第11期);全文
生物序列的分析方法及其进化模型研究;解小莉;《中国博士学位论文全文数据库 基础科学辑》;20121115(第11期);第A006-30页

Also Published As

Publication number Publication date
CN105512512A (en) 2016-04-20

Similar Documents

Publication Publication Date Title
Cai et al. The perfect storm: gene tree estimation error, incomplete lineage sorting, and ancient gene flow explain the most recalcitrant ancient angiosperm clade, Malpighiales
Tripodi et al. Global range expansion history of pepper (Capsicum spp.) revealed by over 10,000 genebank accessions
CN105512512B (en) The method that amino acid carries out species taxonomy apart from polymorphism comparison protein sequence
Pollard et al. A method to identify significant clusters in gene expression data
Barley et al. Sun skink landscape genomics: assessing the roles of micro‐evolutionary processes in shaping genetic and phenotypic diversity across a heterogeneous and fragmented landscape
US20110280907A1 (en) Method and system for building a phylogeny from genetic sequences and using the same for recommendation of vaccine strain candidates for the influenza virus
Zhang et al. Community identification in networks with unbalanced structure
Caravas et al. Shaking the Diptera tree of life: performance analysis of nuclear and mitochondrial sequence data partitions
Soto et al. A multi-objective evolutionary algorithm for improving multiple sequence alignments
Matzke et al. Bayesian analysis of congruence of core genes in Prochlorococcus and Synechococcus and implications on horizontal gene transfer
Sun et al. AFLP assessment of genetic variability and relationships in an Asian wild germplasm collection of Dactylis glomerata L.
Roje Evaluating the effects of non-neutral molecular markers on phylogeny inference
CN106557668A (en) DNA sequence dna similar test method based on LF entropys
Górecki et al. A Robinson-Foulds measure to compare unrooted trees with rooted trees
Grummer Evolutionary history of the Patagonian Liolaemus fitzingerii species group of lizards
CN104298997B (en) data classification method and device
CN107729719B (en) De novo sequencing method
CN110232951A (en) Judge method, computer-readable medium and the application of sequencing data saturation
Olabode et al. Revisiting the recombinant history of HIV-1 group M with dynamic network community detection
Berry et al. On the approximation of computing evolutionary trees
Yan et al. A novel robust model fitting approach towards multiple-structure data segmentation
Bodini et al. Analytical formulation of bubble entropy for autoregressive processes
Hayhoe et al. SPECTRE: Seedless network alignment via spectral centralities
Agüero-Chapin et al. DISTATIS: A Promising Framework to Integrate Distance Matrices in Molecular Phylogenetics
Chong et al. Efficient extraction of high-betweenness vertices

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant