CN105512512B - The method that amino acid carries out species taxonomy apart from polymorphism comparison protein sequence - Google Patents
The method that amino acid carries out species taxonomy apart from polymorphism comparison protein sequence Download PDFInfo
- Publication number
- CN105512512B CN105512512B CN201510829185.1A CN201510829185A CN105512512B CN 105512512 B CN105512512 B CN 105512512B CN 201510829185 A CN201510829185 A CN 201510829185A CN 105512512 B CN105512512 B CN 105512512B
- Authority
- CN
- China
- Prior art keywords
- amino acid
- distance
- protein sequence
- same race
- sequence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Medical Informatics (AREA)
- Biophysics (AREA)
- Theoretical Computer Science (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Chemical & Material Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Analytical Chemistry (AREA)
- Molecular Biology (AREA)
- Genetics & Genomics (AREA)
- Peptides Or Proteins (AREA)
- Investigating Or Analysing Biological Materials (AREA)
Abstract
The invention proposes a kind of methods that amino acid carries out species taxonomy apart from polymorphism comparison protein sequence, include the following steps: S10: each amino acid on protein sequence is numbered;S20: the distance between adjacent amino acid of the same race on protein sequence is calculated;S30: the number that the different distance of the adjacent amino acid of the same race of statistics occurs on every protein sequence;S40: carrying out alignment two-by-two according to the statistical data of S30, construct distance matrix, calculates generation system development tree according to distance matrix, carries out species taxonomy.The difference of sequence upper amino acid is changed into the difference of distance between amino acid by this method, has not only taken into account vacancy, but also without being inserted into interval, method is simple, enormously simplifies calculation amount.
Description
Technical field
The invention belongs to species identification field, in particular to a kind of amino acid is carried out apart from polymorphism comparison protein sequence
The method of species taxonomy.
Background technique
Currently, if two sections of protein sequences come from same ancestors, having certain homology, parent according to evolutionism principle
The closer species homologies of edge relationship are higher, so can be classified according to putting in order for amino acid in protein sequence,
Establish the genealogical tree (phylogenetic tree) of molecular evolution.It is now widely used be by Higgins and Sharp in
Multiple sequences are first compared building distance matrix two-by-two, reflect between sequence and close two-by-two by the Clustal algorithm proposed in 1988
Then system calculates generation system chadogram according to distance matrix.When two sequences compare, simplest situation is exactly not consider sky
Position, only selection compares starting point, but this method error is larger, it is difficult to reflect truth.Most common method is pair
Position compares, i.e., by the method sequence alignment that keeps length different at insertion interval, but due to there are many mode that insertion is spaced,
So as to cause the complexity compared, greatly increase calculation amount.
Therefore, in line with the spirit and theory asked, and by the knowledge of profession, the auxiliary of experience, and in multi-party clever thought, examination
After testing, the present invention is just created, spy provides a kind of amino acid again and carries out species taxonomy apart from polymorphism comparison protein sequence
The difference of sequence upper amino acid can be changed into the difference of distance between amino acid by method, not only take into account vacancy, but also without insertion
Interval, enormously simplifies the complexity of comparison.
Summary of the invention
The present invention proposes a kind of method that amino acid carries out species taxonomy apart from polymorphism comparison protein sequence, by sequence
The difference of upper amino acid is changed into the difference of distance between amino acid, has not only taken into account vacancy, but also without being inserted into interval, calculation method letter
It is single.
The technical scheme of the present invention is realized as follows: a kind of amino acid carries out object apart from polymorphism comparison protein sequence
The method of kind classification, includes the following steps:
S10: each amino acid on protein sequence is numbered;
S20: the distance between adjacent amino acid of the same race on protein sequence is calculated;
S30: the number that the different distance of the adjacent amino acid of the same race of statistics occurs on every protein sequence;
S40: the number occurred in every protein sequence according to the different distance of every kind of statistics amino acid carries out two
Two comparisons, construct distance matrix, calculate generation system development tree according to distance matrix, carry out species taxonomy.
As a preferred embodiment, the type of the amino acid includes: alanine, leucine, arginine, relies ammonia
Acid, asparagine, methionine, phenylalanine, cysteine, proline, glutamine, serine, glutamic acid, threonine,
Glycine, tryptophan, histidine, tyrosine, isoleucine, valine, any one or more in aspartic acid.
As a preferred embodiment, calculating adjacent amino acid of the same race on protein sequence in the step S20
Distance extracts respectively using by the corresponding number of amino acid various in sequence, calculate between adjacent amino acid of the same race away from
From.
As a preferred embodiment, the different distance of every kind of amino acid in the step S40 according to statistics exists
The number occurred in every protein sequence analyzes the polymorphism of amino acid distance of the same race in protein, by constructing apart from square
Battle array calculates generation system chadogram progress species taxonomy.
As a preferred embodiment, analyzing the polymorphic of amino acid distance of the same race in protein in the step S40
Property, meet formula: F=2nxy/(nx+ny), P=-lnF, wherein nxFor phase in two sections of protein sequence first segment protein sequences
The number for a certain distance that adjacent amino acid of the same race occurs, nyIt is adjacent of the same race in two sections of protein sequence second segment protein sequences
The number for a certain distance that amino acid occurs, nxyA certain distance goes out occurrence between the adjacent amino acid of the same race of two sections of protein sequences
The identical number of number, i.e. nxAnd nyIn smaller value, P be the adjacent amino acid distance of the same race of two sections of protein sequences diversity value.
As a preferred embodiment, in the step S40 calculate protein sequence on adjacent amino acid of the same race away from
From polymorphism asked after comparing the multiple diversity values for calculating all amino acid whole distances two-by-two to all proteins sequence
It is averaged building distance matrix, evolutionary relationship tree is made according to distance matrix.
After above-mentioned technical proposal, the beneficial effects of the present invention are: same according to two adjacent on protein sequence
The difference of kind amino acid distance is compared, and constructs distance matrix, calculates generation system chadogram, we further according to distance matrix
The difference of sequence upper amino acid is changed into the difference of distance between amino acid by method, has not only taken into account vacancy, but also without being inserted into interval, meter
Calculation method is simple, can satisfy basic requirement.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
Some embodiments of invention without any creative labor, may be used also for those of ordinary skill in the art
To obtain other drawings based on these drawings.
Fig. 1 is flow diagram of the present invention;
Fig. 2 is the phylogenetic tree that the present invention constructs;
Fig. 3 is the phylogenetic tree that aligned sequences building is aligned using 6.0 software of Mega.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall within the protection scope of the present invention.
As shown in Figure 1, the method that amino acid of the present invention carries out species taxonomy apart from polymorphism comparison protein sequence, including
Following steps:
S10: each amino acid on protein sequence is numbered;
S20: the distance between adjacent amino acid of the same race on protein sequence is calculated;
S30: the number that the different distance of the adjacent amino acid of the same race of statistics occurs on every protein sequence;
S40: the number occurred in every protein sequence according to the different distance of every kind of statistics amino acid carries out two
Two comparisons, construct distance matrix, calculate generation system development tree according to distance matrix, carry out species taxonomy.
The type of the amino acid includes: alanine, leucine, arginine, lysine, asparagine, methionine, benzene
Alanine, cysteine, proline, glutamine, serine, glutamic acid, threonine, glycine, tryptophan, histidine, junket
Propylhomoserin, isoleucine, valine, any one or more in aspartic acid.
The distance that adjacent amino acid of the same race on protein sequence is calculated in the step S20 is used amino various in sequence
The corresponding number of acid extracts respectively, calculates the distance between adjacent amino acid of the same race.
Time occurred in every protein sequence in the step S40 according to the different distance of every kind of amino acid of statistics
Number analyzes the polymorphism of amino acid distance of the same race in protein, passes through building distance matrix, calculates generation system chadogram and carry out
Species taxonomy.
The polymorphism that amino acid distance of the same race in protein is analyzed in the step S40, meets formula: F=2nxy/(nx+
ny), P=-lnF, wherein nxFor amino acid of the same race adjacent in two sections of protein sequence first segment protein sequences occur it is a certain away from
From number, nyIt is the number for a certain distance that adjacent amino acid of the same race occurs in two sections of protein sequence second segment protein sequences
Mesh, nxyIt is a certain apart from the identical number of frequency of occurrence, i.e. n between the adjacent amino acid of the same race of two sections of protein sequencesxAnd nyIn
Smaller value, P be the adjacent amino acid distance of the same race of two sections of protein sequences diversity value.
The polymorphism that adjacent amino acid distance of the same race on protein sequence is calculated in the step S40, to all proteins
After sequence compares the multiple diversity values for calculating all amino acid whole distances two-by-two, averaged constructs distance matrix, root
Evolutionary relationship tree is made according to distance matrix.
By taking one section of amino acid sequence as an example, each amino acid number is given, then distance is as follows between two neighboring G:
Know that the distance between two neighboring G is 5,17,1,5.The neighbor distance of any other amino acid can similarly be obtained.
In case of mutation, such as 18 C become G, then the distance between two neighboring G becomes 5,11,6,1,5.
Sequence is compared above, has 5,1,5 three number identical, i.e., the distance at mutation between two adjacent G is affected, remaining number
It is constant.The distance of C is also affected simultaneously, and the distance of other amino acid is constant, as follows:
In case of insertion, as being inserted into a G between 17 Q and 18 C, then the distance between two neighboring G becomes
5,11,7,1,5 compare with first sequence above, have 5,1,5 three number identical, i.e. distance between two adjacent G of insert division
It is affected, remaining number is constant.
The neighbor distance of other amino acid is affected a bit, some do not change, shown as shown in following table one:
In case of missing, caused by influence it is with being inserted into similar.For a plurality of amino acid sequence, it can count all same
The neighbor distance of kind amino acid, the data drawn are compared.According to evolutionism principle, affiliation is closer, amino acid
The similarity of arrangement is higher, indicates that the data of distance are more similar;On the contrary, affiliation is remoter, make a variation bigger, indicate away from
From data similarity it is poorer.It is possible thereby to by different species taxonomies.
Following table is by taking the cytochrome c sequence of 16 species as an example, sequence information such as following table two:
Serial number | Chinese | Sequence number | Length |
1 | People | NP_061820.1 | 105 |
2 | Macaque | EHH17434.1 | 105 |
3 | Ox | NP_001039526.1 | 105 |
4 | California gray whale | P68100.2 | 105 |
5 | Chicken | NP_001072946.1 | 105 |
6 | Penguin | P00017.2 | 105 |
7 | Lizard | P21665.2 | 105 |
8 | Snake | JAB54399.1 | 105 |
9 | Salmon | ACI70114.1 | 104 |
10 | Gadus | ACQ58603.1 | 104 |
11 | Longicorn | JAB67143.1 | 108 |
12 | Diamond-back moth | NP_001292408.1 | 108 |
13 | Arabidopsis | AAA32747.1 | 112 |
14 | Aspergillus niger | P56205.1 | 111 |
15 | Burkholderia | EJ062937.1 | 119 |
16 | Xanthomonas campestris | KLD80214.1 | 117 |
Explanation is further disclosed to the classification determination process of species below, first to above-mentioned 16 sequences respectively since 1
Number, then extract the number of all amino acid of the same race in each sequence respectively, then calculate separately each adjacent amino acid of the same race it
Between distance, count each distance occur number.As following table three be alanine A statistical result, N be two neighboring A it
Between distance.The statistical result of other amino acid is unlisted due to length.
The statistical result of other amino acid, similar with amino acid A, details are not described herein again.
According to the distance statistics of above-mentioned each adjacent amino acid of the same race as a result, 16 protein sequences are substituted into two-by-two line by line public
Formula: F=2nxy/(nx+ny), P=-lnF table after obtaining diversity value, is sought the P value of all amino acid whole distances average
After value, distance matrix is constructed as shown in following table four:
The phylogenetic tree constructed with adjacent method to the distance matrix of upper table four is as shown in Fig. 2, using 6.0 software of Mega
Contraposition aligned sequences simultaneously select adjacent method building genealogical tree as shown in Figure 3.People, macaque are first classified as one kind in Fig. 2;Ox, California gray whale return
For one kind;Chicken, penguin are classified as one kind;Salmon, gadus are classified as one kind;Longicorn, diamond-back moth are a kind of;Burkholderia, Xanthomonas campestris
To use 6.0 software of Mega to carry out the result base that contraposition aligned sequences construct genealogical tree to protein sequence in one kind, with Fig. 3
This is consistent.Snake is that the affiliation of vertebrate and lizard is closer, but is not classified as one kind with both of which.This is because
It is difficult to merely react true evolution situation with a kind of protein sequence, this is that any sequence analysis method is all difficult to avoid that
's.Knot of accurately evolving should can be just obtained in conjunction with traditional classification method comprehensive analysis such as other sequences analysis and morphology
Fruit.
The method that the amino acid carries out species taxonomy apart from polymorphism comparison protein sequence, its working principle is that: according to
The difference of adjacent two amino acid distances of the same race is compared on protein sequence, distance matrix is constructed, further according to apart from square
Battle array calculates generation system chadogram, and the difference of sequence upper amino acid is changed into the difference of distance between amino acid by this method, both simultaneous
Vacancy is cared for, and without being inserted into interval, calculation method is simple, can satisfy basic requirement.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention
Within mind and principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.
Claims (4)
1. a kind of method that amino acid carries out species taxonomy apart from polymorphism comparison protein sequence, which is characterized in that including such as
Lower step:
S10: each amino acid on protein sequence is numbered;
S20: the distance between adjacent amino acid of the same race on protein sequence is calculated;
S30: the number that the different distance of the adjacent amino acid of the same race of statistics occurs on every protein sequence;
S40: the number occurred in every protein sequence according to the different distance of every kind of statistics amino acid, it is right two-by-two to carry out
Than constructing distance matrix, calculating generation system development tree according to distance matrix, carry out species taxonomy;
The number occurred in every protein sequence in the step S40 according to the different distance of every kind of amino acid of statistics,
The polymorphism of amino acid distance of the same race in protein is analyzed, passes through building distance matrix, calculate generation system chadogram and carry out object
Kind classification;
The polymorphism that amino acid distance of the same race in protein is analyzed in the step S40, meets formula: F=2nxy/(nx+ny), P
=-lnF, wherein nxThe a certain distance occurred for amino acid of the same race adjacent in two sections of protein sequence first segment protein sequences
Number, nyIt is the number for a certain distance that adjacent amino acid of the same race occurs in two sections of protein sequence second segment protein sequences,
nxyIt is a certain apart from the identical number of frequency of occurrence, i.e. n between the adjacent amino acid of the same race of two sections of protein sequencesxAnd nyIn compared with
Small value, P are the diversity value of the adjacent amino acid distance of the same race of two sections of protein sequences.
2. the method that amino acid according to claim 1 carries out species taxonomy apart from polymorphism comparison protein sequence,
Be characterized in that, the type of the amino acid include: alanine, leucine, arginine, lysine, asparagine, methionine,
Phenylalanine, cysteine, proline, glutamine, serine, glutamic acid, threonine, glycine, tryptophan, histidine,
Tyrosine, isoleucine, valine, any one or more in aspartic acid.
3. the method that amino acid according to claim 1 carries out species taxonomy apart from polymorphism comparison protein sequence,
It is characterized in that, the distance that adjacent amino acid of the same race on protein sequence is calculated in the step S20 is used ammonia various in sequence
The corresponding number of base acid extracts respectively, calculates the distance between adjacent amino acid of the same race.
4. the method that amino acid according to claim 1 carries out species taxonomy apart from polymorphism comparison protein sequence,
It is characterized in that, the polymorphism of adjacent amino acid distance of the same race on protein sequence is calculated in the step S40, to all proteins
After sequence compares the multiple diversity values for calculating all amino acid whole distances two-by-two, averaged constructs distance matrix, root
Evolutionary relationship tree is made according to distance matrix.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510829185.1A CN105512512B (en) | 2015-11-24 | 2015-11-24 | The method that amino acid carries out species taxonomy apart from polymorphism comparison protein sequence |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510829185.1A CN105512512B (en) | 2015-11-24 | 2015-11-24 | The method that amino acid carries out species taxonomy apart from polymorphism comparison protein sequence |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105512512A CN105512512A (en) | 2016-04-20 |
CN105512512B true CN105512512B (en) | 2019-03-29 |
Family
ID=55720489
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510829185.1A Active CN105512512B (en) | 2015-11-24 | 2015-11-24 | The method that amino acid carries out species taxonomy apart from polymorphism comparison protein sequence |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105512512B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11456057B2 (en) | 2018-03-29 | 2022-09-27 | International Business Machines Corporation | Biological sequence distance explorer system providing user visualization of genomic distance between a set of genomes in a dynamic zoomable fashion |
CN108846262A (en) * | 2018-05-31 | 2018-11-20 | 广西大学 | The method that RNA secondary structure distance based on DFT calculates phylogenetic tree construction |
CN111341387B (en) * | 2020-02-19 | 2023-06-30 | 吉林大学 | Unidirectional coding unsupervised classification method based on basic component sequence vector |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102332064A (en) * | 2011-10-07 | 2012-01-25 | 吉林大学 | Biological species identification method based on genetic barcode |
CN103559427A (en) * | 2013-11-12 | 2014-02-05 | 高扬 | Method for identifying biological sequence and deducing species genetic relationship through digitals |
-
2015
- 2015-11-24 CN CN201510829185.1A patent/CN105512512B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102332064A (en) * | 2011-10-07 | 2012-01-25 | 吉林大学 | Biological species identification method based on genetic barcode |
CN103559427A (en) * | 2013-11-12 | 2014-02-05 | 高扬 | Method for identifying biological sequence and deducing species genetic relationship through digitals |
Non-Patent Citations (3)
Title |
---|
一种新的DNA序列进化距离及其应用;梁丽萍,等.;《生物化学与生物物理进展》;20111231;第38卷(第8期);全文 |
小麦果聚糖合成酶基因6-SFT-A单核苷酸多态性分析及其定位;岳爱琴,等.;《中国农业科学》;20111231;第44卷(第11期);全文 |
生物序列的分析方法及其进化模型研究;解小莉;《中国博士学位论文全文数据库 基础科学辑》;20121115(第11期);第A006-30页 |
Also Published As
Publication number | Publication date |
---|---|
CN105512512A (en) | 2016-04-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Cai et al. | The perfect storm: gene tree estimation error, incomplete lineage sorting, and ancient gene flow explain the most recalcitrant ancient angiosperm clade, Malpighiales | |
Tripodi et al. | Global range expansion history of pepper (Capsicum spp.) revealed by over 10,000 genebank accessions | |
CN105512512B (en) | The method that amino acid carries out species taxonomy apart from polymorphism comparison protein sequence | |
Pollard et al. | A method to identify significant clusters in gene expression data | |
Barley et al. | Sun skink landscape genomics: assessing the roles of micro‐evolutionary processes in shaping genetic and phenotypic diversity across a heterogeneous and fragmented landscape | |
US20110280907A1 (en) | Method and system for building a phylogeny from genetic sequences and using the same for recommendation of vaccine strain candidates for the influenza virus | |
Zhang et al. | Community identification in networks with unbalanced structure | |
Caravas et al. | Shaking the Diptera tree of life: performance analysis of nuclear and mitochondrial sequence data partitions | |
Soto et al. | A multi-objective evolutionary algorithm for improving multiple sequence alignments | |
Matzke et al. | Bayesian analysis of congruence of core genes in Prochlorococcus and Synechococcus and implications on horizontal gene transfer | |
Sun et al. | AFLP assessment of genetic variability and relationships in an Asian wild germplasm collection of Dactylis glomerata L. | |
Roje | Evaluating the effects of non-neutral molecular markers on phylogeny inference | |
CN106557668A (en) | DNA sequence dna similar test method based on LF entropys | |
Górecki et al. | A Robinson-Foulds measure to compare unrooted trees with rooted trees | |
Grummer | Evolutionary history of the Patagonian Liolaemus fitzingerii species group of lizards | |
CN104298997B (en) | data classification method and device | |
CN107729719B (en) | De novo sequencing method | |
CN110232951A (en) | Judge method, computer-readable medium and the application of sequencing data saturation | |
Olabode et al. | Revisiting the recombinant history of HIV-1 group M with dynamic network community detection | |
Berry et al. | On the approximation of computing evolutionary trees | |
Yan et al. | A novel robust model fitting approach towards multiple-structure data segmentation | |
Bodini et al. | Analytical formulation of bubble entropy for autoregressive processes | |
Hayhoe et al. | SPECTRE: Seedless network alignment via spectral centralities | |
Agüero-Chapin et al. | DISTATIS: A Promising Framework to Integrate Distance Matrices in Molecular Phylogenetics | |
Chong et al. | Efficient extraction of high-betweenness vertices |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |