CN105512512B

CN105512512B - The method that amino acid carries out species taxonomy apart from polymorphism comparison protein sequence

Info

Publication number: CN105512512B
Application number: CN201510829185.1A
Authority: CN
Inventors: 孔登; 陈永; 王晓红
Original assignee: Weifang Medical University
Current assignee: Weifang Medical University
Priority date: 2015-11-24
Filing date: 2015-11-24
Publication date: 2019-03-29
Anticipated expiration: 2035-11-24
Also published as: CN105512512A

Abstract

The invention proposes a kind of methods that amino acid carries out species taxonomy apart from polymorphism comparison protein sequence, include the following steps: S10: each amino acid on protein sequence is numbered；S20: the distance between adjacent amino acid of the same race on protein sequence is calculated；S30: the number that the different distance of the adjacent amino acid of the same race of statistics occurs on every protein sequence；S40: carrying out alignment two-by-two according to the statistical data of S30, construct distance matrix, calculates generation system development tree according to distance matrix, carries out species taxonomy.The difference of sequence upper amino acid is changed into the difference of distance between amino acid by this method, has not only taken into account vacancy, but also without being inserted into interval, method is simple, enormously simplifies calculation amount.

Description

The method that amino acid carries out species taxonomy apart from polymorphism comparison protein sequence

Technical field

The invention belongs to species identification field, in particular to a kind of amino acid is carried out apart from polymorphism comparison protein sequence The method of species taxonomy.

Background technique

Currently, if two sections of protein sequences come from same ancestors, having certain homology, parent according to evolutionism principle The closer species homologies of edge relationship are higher, so can be classified according to putting in order for amino acid in protein sequence, Establish the genealogical tree (phylogenetic tree) of molecular evolution.It is now widely used be by Higgins and Sharp in Multiple sequences are first compared building distance matrix two-by-two, reflect between sequence and close two-by-two by the Clustal algorithm proposed in 1988 Then system calculates generation system chadogram according to distance matrix.When two sequences compare, simplest situation is exactly not consider sky Position, only selection compares starting point, but this method error is larger, it is difficult to reflect truth.Most common method is pair Position compares, i.e., by the method sequence alignment that keeps length different at insertion interval, but due to there are many mode that insertion is spaced, So as to cause the complexity compared, greatly increase calculation amount.

Therefore, in line with the spirit and theory asked, and by the knowledge of profession, the auxiliary of experience, and in multi-party clever thought, examination After testing, the present invention is just created, spy provides a kind of amino acid again and carries out species taxonomy apart from polymorphism comparison protein sequence The difference of sequence upper amino acid can be changed into the difference of distance between amino acid by method, not only take into account vacancy, but also without insertion Interval, enormously simplifies the complexity of comparison.

Summary of the invention

The present invention proposes a kind of method that amino acid carries out species taxonomy apart from polymorphism comparison protein sequence, by sequence The difference of upper amino acid is changed into the difference of distance between amino acid, has not only taken into account vacancy, but also without being inserted into interval, calculation method letter It is single.

The technical scheme of the present invention is realized as follows: a kind of amino acid carries out object apart from polymorphism comparison protein sequence The method of kind classification, includes the following steps:

S10: each amino acid on protein sequence is numbered；

S20: the distance between adjacent amino acid of the same race on protein sequence is calculated；

S30: the number that the different distance of the adjacent amino acid of the same race of statistics occurs on every protein sequence；

S40: the number occurred in every protein sequence according to the different distance of every kind of statistics amino acid carries out two Two comparisons, construct distance matrix, calculate generation system development tree according to distance matrix, carry out species taxonomy.

As a preferred embodiment, the type of the amino acid includes: alanine, leucine, arginine, relies ammonia Acid, asparagine, methionine, phenylalanine, cysteine, proline, glutamine, serine, glutamic acid, threonine, Glycine, tryptophan, histidine, tyrosine, isoleucine, valine, any one or more in aspartic acid.

As a preferred embodiment, calculating adjacent amino acid of the same race on protein sequence in the step S20 Distance extracts respectively using by the corresponding number of amino acid various in sequence, calculate between adjacent amino acid of the same race away from From.

As a preferred embodiment, the different distance of every kind of amino acid in the step S40 according to statistics exists The number occurred in every protein sequence analyzes the polymorphism of amino acid distance of the same race in protein, by constructing apart from square Battle array calculates generation system chadogram progress species taxonomy.

As a preferred embodiment, analyzing the polymorphic of amino acid distance of the same race in protein in the step S40 Property, meet formula: F=2n_xy/(n_x+n_y), P=-lnF, wherein n_xFor phase in two sections of protein sequence first segment protein sequences The number for a certain distance that adjacent amino acid of the same race occurs, n_yIt is adjacent of the same race in two sections of protein sequence second segment protein sequences The number for a certain distance that amino acid occurs, n_xyA certain distance goes out occurrence between the adjacent amino acid of the same race of two sections of protein sequences The identical number of number, i.e. n_xAnd n_yIn smaller value, P be the adjacent amino acid distance of the same race of two sections of protein sequences diversity value.

As a preferred embodiment, in the step S40 calculate protein sequence on adjacent amino acid of the same race away from From polymorphism asked after comparing the multiple diversity values for calculating all amino acid whole distances two-by-two to all proteins sequence It is averaged building distance matrix, evolutionary relationship tree is made according to distance matrix.

After above-mentioned technical proposal, the beneficial effects of the present invention are: same according to two adjacent on protein sequence The difference of kind amino acid distance is compared, and constructs distance matrix, calculates generation system chadogram, we further according to distance matrix The difference of sequence upper amino acid is changed into the difference of distance between amino acid by method, has not only taken into account vacancy, but also without being inserted into interval, meter Calculation method is simple, can satisfy basic requirement.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention without any creative labor, may be used also for those of ordinary skill in the art To obtain other drawings based on these drawings.

Fig. 1 is flow diagram of the present invention；

Fig. 2 is the phylogenetic tree that the present invention constructs；

Fig. 3 is the phylogenetic tree that aligned sequences building is aligned using 6.0 software of Mega.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

As shown in Figure 1, the method that amino acid of the present invention carries out species taxonomy apart from polymorphism comparison protein sequence, including Following steps:

S10: each amino acid on protein sequence is numbered；

The type of the amino acid includes: alanine, leucine, arginine, lysine, asparagine, methionine, benzene Alanine, cysteine, proline, glutamine, serine, glutamic acid, threonine, glycine, tryptophan, histidine, junket Propylhomoserin, isoleucine, valine, any one or more in aspartic acid.

The distance that adjacent amino acid of the same race on protein sequence is calculated in the step S20 is used amino various in sequence The corresponding number of acid extracts respectively, calculates the distance between adjacent amino acid of the same race.

Time occurred in every protein sequence in the step S40 according to the different distance of every kind of amino acid of statistics Number analyzes the polymorphism of amino acid distance of the same race in protein, passes through building distance matrix, calculates generation system chadogram and carry out Species taxonomy.

The polymorphism that amino acid distance of the same race in protein is analyzed in the step S40, meets formula: F=2n_xy/(n_x+ n_y), P=-lnF, wherein n_xFor amino acid of the same race adjacent in two sections of protein sequence first segment protein sequences occur it is a certain away from From number, n_yIt is the number for a certain distance that adjacent amino acid of the same race occurs in two sections of protein sequence second segment protein sequences Mesh, n_xyIt is a certain apart from the identical number of frequency of occurrence, i.e. n between the adjacent amino acid of the same race of two sections of protein sequences_xAnd n_yIn Smaller value, P be the adjacent amino acid distance of the same race of two sections of protein sequences diversity value.

The polymorphism that adjacent amino acid distance of the same race on protein sequence is calculated in the step S40, to all proteins After sequence compares the multiple diversity values for calculating all amino acid whole distances two-by-two, averaged constructs distance matrix, root Evolutionary relationship tree is made according to distance matrix.

By taking one section of amino acid sequence as an example, each amino acid number is given, then distance is as follows between two neighboring G:

Know that the distance between two neighboring G is 5,17,1,5.The neighbor distance of any other amino acid can similarly be obtained.

In case of mutation, such as 18 C become G, then the distance between two neighboring G becomes 5,11,6,1,5. Sequence is compared above, has 5,1,5 three number identical, i.e., the distance at mutation between two adjacent G is affected, remaining number It is constant.The distance of C is also affected simultaneously, and the distance of other amino acid is constant, as follows:

In case of insertion, as being inserted into a G between 17 Q and 18 C, then the distance between two neighboring G becomes 5,11,7,1,5 compare with first sequence above, have 5,1,5 three number identical, i.e. distance between two adjacent G of insert division It is affected, remaining number is constant.

The neighbor distance of other amino acid is affected a bit, some do not change, shown as shown in following table one:

In case of missing, caused by influence it is with being inserted into similar.For a plurality of amino acid sequence, it can count all same The neighbor distance of kind amino acid, the data drawn are compared.According to evolutionism principle, affiliation is closer, amino acid The similarity of arrangement is higher, indicates that the data of distance are more similar；On the contrary, affiliation is remoter, make a variation bigger, indicate away from From data similarity it is poorer.It is possible thereby to by different species taxonomies.

Following table is by taking the cytochrome c sequence of 16 species as an example, sequence information such as following table two:

Serial number	Chinese	Sequence number	Length
				1	People	NP_061820.1	105
2	Macaque	EHH17434.1	105
				3	Ox	NP_001039526.1	105
4	California gray whale	P68100.2	105
				5	Chicken	NP_001072946.1	105
6	Penguin	P00017.2	105
				7	Lizard	P21665.2	105
8	Snake	JAB54399.1	105
				9	Salmon	ACI70114.1	104
10	Gadus	ACQ58603.1	104
				11	Longicorn	JAB67143.1	108
12	Diamond-back moth	NP_001292408.1	108
				13	Arabidopsis	AAA32747.1	112

14	Aspergillus niger	P56205.1	111
				15	Burkholderia	EJ062937.1	119
16	Xanthomonas campestris	KLD80214.1	117

Explanation is further disclosed to the classification determination process of species below, first to above-mentioned 16 sequences respectively since 1 Number, then extract the number of all amino acid of the same race in each sequence respectively, then calculate separately each adjacent amino acid of the same race it Between distance, count each distance occur number.As following table three be alanine A statistical result, N be two neighboring A it Between distance.The statistical result of other amino acid is unlisted due to length.

The statistical result of other amino acid, similar with amino acid A, details are not described herein again.

According to the distance statistics of above-mentioned each adjacent amino acid of the same race as a result, 16 protein sequences are substituted into two-by-two line by line public Formula: F=2n_xy/(n_x+n_y), P=-lnF table after obtaining diversity value, is sought the P value of all amino acid whole distances average After value, distance matrix is constructed as shown in following table four:

The phylogenetic tree constructed with adjacent method to the distance matrix of upper table four is as shown in Fig. 2, using 6.0 software of Mega Contraposition aligned sequences simultaneously select adjacent method building genealogical tree as shown in Figure 3.People, macaque are first classified as one kind in Fig. 2；Ox, California gray whale return For one kind；Chicken, penguin are classified as one kind；Salmon, gadus are classified as one kind；Longicorn, diamond-back moth are a kind of；Burkholderia, Xanthomonas campestris To use 6.0 software of Mega to carry out the result base that contraposition aligned sequences construct genealogical tree to protein sequence in one kind, with Fig. 3 This is consistent.Snake is that the affiliation of vertebrate and lizard is closer, but is not classified as one kind with both of which.This is because It is difficult to merely react true evolution situation with a kind of protein sequence, this is that any sequence analysis method is all difficult to avoid that 's.Knot of accurately evolving should can be just obtained in conjunction with traditional classification method comprehensive analysis such as other sequences analysis and morphology Fruit.

The method that the amino acid carries out species taxonomy apart from polymorphism comparison protein sequence, its working principle is that: according to The difference of adjacent two amino acid distances of the same race is compared on protein sequence, distance matrix is constructed, further according to apart from square Battle array calculates generation system chadogram, and the difference of sequence upper amino acid is changed into the difference of distance between amino acid by this method, both simultaneous Vacancy is cared for, and without being inserted into interval, calculation method is simple, can satisfy basic requirement.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Within mind and principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims

1. a kind of method that amino acid carries out species taxonomy apart from polymorphism comparison protein sequence, which is characterized in that including such as Lower step:

S10: each amino acid on protein sequence is numbered；

S40: the number occurred in every protein sequence according to the different distance of every kind of statistics amino acid, it is right two-by-two to carry out Than constructing distance matrix, calculating generation system development tree according to distance matrix, carry out species taxonomy；

The number occurred in every protein sequence in the step S40 according to the different distance of every kind of amino acid of statistics, The polymorphism of amino acid distance of the same race in protein is analyzed, passes through building distance matrix, calculate generation system chadogram and carry out object Kind classification；

The polymorphism that amino acid distance of the same race in protein is analyzed in the step S40, meets formula: F=2n_xy/(n_x+n_y), P =-lnF, wherein n_xThe a certain distance occurred for amino acid of the same race adjacent in two sections of protein sequence first segment protein sequences Number, n_yIt is the number for a certain distance that adjacent amino acid of the same race occurs in two sections of protein sequence second segment protein sequences, n_xyIt is a certain apart from the identical number of frequency of occurrence, i.e. n between the adjacent amino acid of the same race of two sections of protein sequences_xAnd n_yIn compared with Small value, P are the diversity value of the adjacent amino acid distance of the same race of two sections of protein sequences.

2. the method that amino acid according to claim 1 carries out species taxonomy apart from polymorphism comparison protein sequence, Be characterized in that, the type of the amino acid include: alanine, leucine, arginine, lysine, asparagine, methionine, Phenylalanine, cysteine, proline, glutamine, serine, glutamic acid, threonine, glycine, tryptophan, histidine, Tyrosine, isoleucine, valine, any one or more in aspartic acid.

3. the method that amino acid according to claim 1 carries out species taxonomy apart from polymorphism comparison protein sequence, It is characterized in that, the distance that adjacent amino acid of the same race on protein sequence is calculated in the step S20 is used ammonia various in sequence The corresponding number of base acid extracts respectively, calculates the distance between adjacent amino acid of the same race.

4. the method that amino acid according to claim 1 carries out species taxonomy apart from polymorphism comparison protein sequence, It is characterized in that, the polymorphism of adjacent amino acid distance of the same race on protein sequence is calculated in the step S40, to all proteins After sequence compares the multiple diversity values for calculating all amino acid whole distances two-by-two, averaged constructs distance matrix, root Evolutionary relationship tree is made according to distance matrix.