CN111462812A - Multi-target phylogenetic tree construction method based on feature hierarchy - Google Patents

Multi-target phylogenetic tree construction method based on feature hierarchy Download PDF

Info

Publication number
CN111462812A
CN111462812A CN202010168038.5A CN202010168038A CN111462812A CN 111462812 A CN111462812 A CN 111462812A CN 202010168038 A CN202010168038 A CN 202010168038A CN 111462812 A CN111462812 A CN 111462812A
Authority
CN
China
Prior art keywords
value
population
tree
feature
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010168038.5A
Other languages
Chinese (zh)
Other versions
CN111462812B (en
Inventor
冯筠
刘泽云
刘蒙
侯刚
冯宏伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern University
Original Assignee
Northwestern University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern University filed Critical Northwestern University
Priority to CN202010168038.5A priority Critical patent/CN111462812B/en
Publication of CN111462812A publication Critical patent/CN111462812A/en
Application granted granted Critical
Publication of CN111462812B publication Critical patent/CN111462812B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/126Evolutionary algorithms, e.g. genetic algorithms or genetic programming

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Physiology (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Genetics & Genomics (AREA)
  • Biotechnology (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a multi-target phylogenetic tree construction method based on feature hierarchy, which combines the feature description of morphological data, firstly generates corresponding feature hierarchy relation, and estimates missing values by combining the dependency relation between features, thereby obtaining a complete data set; and then, the maximum reduction principle and the maximum likelihood principle based on the inapplicable Fitch algorithm are combined to construct the parallel multi-target phylogenetic tree, so that the problem caused by the inapplicable data can be well measured, and the limitation of tree construction under a single principle is avoided. Compared with the traditional phylogenetic analysis method, the method can well solve the problem of uncertainty of the tree structure caused by missing data and inapplicable data, and provides a basis for biologists to research species evolution through various tree building principles.

Description

Multi-target phylogenetic tree construction method based on feature hierarchy
Technical Field
The invention belongs to the field of biological information, relates to development tree analysis and construction in phylogenetic research, and particularly relates to a multi-target phylogenetic tree construction method based on feature hierarchy.
Background
Phylogenetic studies are directed to evolutionary relationships, such as evolutionary history and relationships between species or populations, to understand when and when speciation events may occur, and thus to study species origin. Phylogenetic analysis is to infer and evaluate these evolutionary relationships, with the goal of finding phylogenetic trees that fit species evolution. For a clearer understanding of the study, it can be seen from fig. 1: the species characteristic matrix can be obtained by performing morphological data arrangement on the collected fossil; and performing phylogenetic analysis on the morphological data to obtain a phylogenetic tree capable of reflecting the species evolution relationship.
Morphological phylogeny morphological characteristics are limited compared to molecular phylogeny, and morphological-based phylogeny analysis is slow. For the phylogenetic analysis of early ancient organisms, due to the reasons of long age, change of storage environment, unstable molecular data, limitation of fossil storage and observation means and the like, only morphological data are available as materials for phylogenetic analysis, and the morphological data is an important basis for carrying out phylogenetic analysis of the ancient organisms. Due to the loss of fossil records and the existence of a species characteristic hierarchical relationship, morphological data often contain a large amount of missing data and inapplicable data (note that the inapplicable data appears when the existence of a certain characteristic in a species depends on the nonexistent characteristic), and the existing method cannot effectively process 'problem' data, so that the construction of a phylogenetic tree is inaccurate, and troubles are caused for a biological scientist to research species evolution.
Meanwhile, the commonly used phylogenetic analysis method comprises a distance-based method and an optimal principle-based method, wherein the distance-based method mainly comprises an adjacency method and a UPGMA; the latter mainly includes maximum reduction method, maximum likelihood method and Bayes inference method. Whereas the existing mainstream tools are based on a single optimization principle. Typically, multiple methods are performed on a single modality data set. However, the results of multiple phylogenetic inferences may be conflicting or consistent. When the goals conflict with each other, there is generally no way to satisfy all of the goals.
Disclosure of Invention
Aiming at the technical problem that the construction of a phylogenetic tree is unstable due to the existence of missing data and inapplicable data in morphological data in the prior art, the invention aims to provide a multi-target phylogenetic tree construction method based on feature hierarchy, which combines prior knowledge to generate a feature hierarchical relationship, analyzes the morphological data distribution condition under the feature dependency relationship by taking the feature hierarchical relationship as a support, and further estimates the missing data so as to obtain a complete morphological data set; starting from complete morphological data, the inapplicable Fitch algorithm proposed in Martin D.Brazeau (2017) is adopted to measure the inapplicable data, and the maximum simplification principle and the maximum likelihood principle are combined to construct a multi-target phylogenetic tree, so that a Pareto optimal tree group which simultaneously meets the maximum simplification principle and the maximum likelihood principle is generated, and a basis is provided for a biological scientist to research species evolution.
In order to realize the task, the invention adopts the following technical solutions:
a multi-target phylogenetic tree construction method based on feature hierarchy is characterized by comprising the following steps:
step one, constructing and formalizing a characteristic hierarchical relationship
Analyzing the dependency relationship among the features by combining the feature description, the prior knowledge and other related information, thereby constructing a feature hierarchical relationship; formally expressing the constructed characteristic hierarchical relationship to convert the hierarchical relationship into a data matrix which can be processed by a computer;
step two, estimating the form data missing value based on the Markov Monte Carlo algorithm of the feature level
Step 2.1, to incomplete form matrix DincomInitializing missing values, and performing random interpolation on missing data according to column values;
step 2.2, column analysis is carried out by combining the characteristic hierarchical relationship, and the parameter distribution of each column is analyzed so as to estimate the posterior distribution of the parameters, which mainly comprises the following conditions:
(1) the master feature is present, but no slave feature:
calculating the value distribution of the subordinate characteristics according to different values of the main characteristics;
(2) dependent features are present, but no master features are present:
calculating the value distribution of the main characteristic according to different values of the subordinate characteristic;
(3) there are dependent features, and there are main features:
respectively calculating value distribution of the main characteristic and the subordinate characteristic;
(4) neither the dependent nor the main feature is present:
independently calculating the value distribution of the characteristics;
step 2.3, recording the interpolation current value xtAnd randomly sampling to determine candidate value xt+1I.e. the probability density distribution Q (x) obtained according to step 2.2t+1|xt) Generating candidate samples xt+1(ii) a Will present value xtAnd a candidate value xt+1Separately for each substitution in the column analysis, x is calculated separatelytAnd xt+1And the probability density function values under the existing observation column distribution are marked as pOld and pNew. Extracting a random number U from U (0,1), if U<pNew/pOld, the candidate x is acceptedt+1Newly sampling missing values; otherwise, the current missing interpolation value x is maintainedtThe change is not changed;
step 2.4, iteratively executing the step 2.2-2.3 until the iteration times k are met, and outputting the final interpolated form data
Figure BDA0002408166610000041
Is marked as Dcom
Step three, constructing a multi-target phylogenetic tree
Step 3.1, according to the complete form matrix DcomBuilding an initialization Tree group P1The tree group size is N. Respectively calculating a reduced value score and a likelihood value score of each individual in the tree group:
given a phylogenetic tree structure τ where the set of nodes is V (τ) and the set of edges is E (τ), the reduced value score can be expressed as:
Figure BDA0002408166610000042
wherein, wjWeight, v, representing feature jjAnd ujRespectively representing the characteristic state values of the nodes v and u at the bit point j, C (v)j,uj) Representing slave state v as a cost matrixjTransition to state ujThe cost of (d); the specific feature variation calculation is described in martind. brazeau (2017), et al, which proposes a non-applicable pitch algorithm.
The likelihood L (τ) of the phylogenetic tree τ is the probability that the observation D and evolution model M that produced the tree — L (τ) ═ P (D | τ, M.) the likelihood function L (τ) of the tree τ given D and M can be written as:
Figure BDA0002408166610000043
wherein, P (D)iand/T, M) is the likelihood value of the ith position.
Each individual is encoded with a chromosome structure { tree structure, reduction score, likelihood score }.
Step 3.2, for the initial population P1And (3) performing rapid non-dominated sorting and congestion degree calculation, wherein the specific optimization process is as follows:
aiming at each generation of evolutionary population, finding out the current optimal individual, wherein the current optimal solution called as an evolutionary population is a non-dominant solution; the set of all non-dominant solutions is called a non-dominating set of the current evolutionary population, and the non-dominating set is made to continuously approach a real optimal solution set to finally reach the optimal.
The rapid non-dominant sorting is to layer the population according to individual non-inferior solutions, so that the search is carried out towards the direction of a Pareto optimal solution set; the specific process is as follows:
setting the evolutionary population as P, and simultaneously setting a non-dominating set P of the evolutionary population*(ii) a First, the first individual is placed in a non-dominating set P*In turn, the individuals in the evolved population P
Figure BDA0002408166610000051
Taking out and putting in P*Simultaneously taking out P and P in sequence*Comparing all individuals in (1), and deleting P*All individuals in (1) are dominated by P if P is dominated by P*Is subject to, P is taken from P*Is deleted.
To enable selective ordering within individuals having the same non-dominant order set, an individual crowding distance is used for the metric; the specific calculation process is as follows:
let P [ i ]]distanceIs the aggregation distance of individual i, P [ i]M is the function value of the individual i on the sub-target m, and the calculation formula is as follows:
P[i]distance=(P[i+1].ps-P[i-1].ps)+(P[i+1].l-P[i-1].l)
wherein ps and l are the simple score and the likelihood score of the tree individual respectively.
And 3.3, carrying out chromosome selection, crossing and mutation operations on the current population, and generating a new population.
Step 3.4, merging the parent population and the child population, performing rapid non-dominated sorting, simultaneously performing crowding degree calculation on the individuals in each non-dominated layer, and selecting proper individuals according to the non-dominated relationship and the crowding degree of the individuals to form a new parent population;
and 3.5, repeating the operation until the condition of ending the program is met.
The multi-target phylogenetic tree construction method based on the feature hierarchy can well measure the problem caused by the inapplicable data, avoids the limitation of tree construction under a single principle, and provides a basis for biologists to research species evolution. Compared with the prior art, the method has the following technical innovation:
1. in the aspect of morphological data missing value estimation, compared with the traditional missing value processing method (such as feature quantity increase, missing value interpolation or non-processing), the method analyzes feature distribution caused by feature dependence by combining a hierarchical structure of features, fuses the feature distribution and missing value estimation, and reduces the influence of missing data on the instability of phylogenetic analysis. The shape data is subjected to missing estimation through a Markov Monte Carlo missing interpolation algorithm, so that the interpolation value can well accord with the distribution of the original data.
2. In the aspect of morphological inapplicable data, an inapplicable Fitch algorithm (Martin D. Brazeau, 2017) is adopted, inapplicable data and missing data can be correctly distinguished and measured, and guarantee is provided for accurately constructing a phylogenetic tree.
3. In the aspect of phylogenetic tree construction, compared with the traditional phylogenetic tree construction tool with a single target, the method adopts a multi-target phylogenetic tree construction method and combines a maximum reduction principle and a maximum likelihood principle to construct the phylogenetic tree, so that a phylogenetic tree group capable of meeting various principles is obtained, and the problem that results of executing various methods on the same morphological data set are possibly conflicted is avoided.
Drawings
FIG. 1 is an exemplary diagram of fossil, morphological data and phylogenetic tree, wherein FIGS. A-F are fossil pictures; panel G is the species signature matrix and panel H is the phylogenetic tree.
FIG. 2 is a feature description and feature hierarchy relationship diagram.
FIG. 3 is a schematic diagram of accuracy evaluation of interpolation results based on MCMC morphological data; wherein panel (a) is canidae, panel (b) is terroris canidae, panel (c) is cambrian poda, and panel (d) is Hawaiian platynin.
FIG. 4 is a schematic diagram illustrating evaluation of tree building accuracy based on interpolation results of MCMC morphological data; wherein panel (a) is canidae, panel (b) is terroris canidae, panel (c) is cambrian poda, and panel (d) is Hawaiian platynin.
FIG. 5 is a schematic diagram of the generation of new population by non-dominated sorting and congestion calculation in step three;
FIG. 6 is a schematic diagram of the construction of the multi-objective phylogenetic tree in step three;
FIG. 7 is a schematic diagram of the mutation operation of the multi-objective phylogenetic tree construction in step three;
FIG. 8 is a diagram of a Pateto treelet set generated based on a multi-objective phylogenetic tree;
FIG. 9 is a set of trees generated based on a multi-objective phylogenetic tree with TNT and Mrbayes;
FIG. 10 is a graph of distance between a Pateto tree set and a standard tree generated based on deletion interpolation and multi-objective phylogenetic trees;
FIG. 11 is an overall flowchart of the method for constructing a multi-objective phylogenetic tree based on feature hierarchy according to the present invention.
The present invention will be described in further detail with reference to the following drawings and examples.
Detailed Description
As shown in fig. 1 to fig. 11, the embodiment provides a method for constructing a multi-objective phylogenetic tree based on feature hierarchy, and the detailed steps are as follows:
step one, constructing and formalizing a hierarchical relationship of expression characteristics
Step 1.1, finding the dependency relationship among the characteristics by combining the prior knowledge and the corresponding characteristic description, and determining the main characteristics, the subordinate characteristics and the logic constraint between the main characteristic value and the subordinate characteristic value. For better understanding, fig. 2 is taken as an example for illustration. According to the characteristic description, the following steps are shown: the values of the characteristic "tail" include "absence (0)", "presence (1)"; the value of the feature "tail color" includes "blue (0)", "red (1)" and "inapplicable (-)", and the value of the feature "tail" is "inapplicable (-)" if and only if the value of the feature "tail" is "absent (0)"; the value of the characteristic "tail length" includes "tail short (0)", "tail long (1)", and "inapplicable (-)", and the value of the characteristic "tail" is "inapplicable (-)" if and only if the value of the characteristic "tail" is "absent (0)"; the values of the feature "eye" include "absence (0)", "presence (1)". Thus, it can be concluded that the feature "tail" is the main feature of "tail color" and "tail length". The following can be obtained by analysis: the main characteristic is 'tail', the corresponding subordinate characteristics are 'tail color' and 'tail length', and the 'eye' has no subordinate characteristic;
and step 1.2, formally defining the characteristic hierarchical relationship obtained in the step 1.1, so that the characteristic hierarchical relationship can be read and analyzed by a computer.
Assume the species signature matrix is D { X }1,…,XnIn which Xi(i∈[1,n]) Representing the characteristics of the ith species, wherein n is the number of the species; species XiHas a characteristic distribution of Xi(xi1,xi2,…,xim) M is a characteristic number, where xip(p∈[1,m]) Represents species XiP characteristic state. The feature set is denoted as { y1,y2,…,ym}. thus, the feature hierarchy relationship matrix may be represented by a matrix C of dimension m × m, where C isijIs the vertex yiAnd vertex yjThe weight value in between.
If the feature yi(0<i ≦ m) logically depends on feature yj(0<j ≦ m), i.e., feature yiAnd feature yjWhen a value dependent relationship exists, the characteristic y is assumedjWhen the value is v, it is the feature yiMain characteristic of (1), then CijThe value is v. In other words, when the main feature yjWhen the value is v, the dependent characteristic yiIs in an applicable state; when the main feature yjDependent feature y when the value is not viIs in an inapplicable state. If when the feature yiAnd feature yjNo value dependency, then CijThe value is ∞. In summary, the feature level relationship matrix calculation formula is as follows:
Figure BDA0002408166610000081
by referring to the calculation method of the hierarchical relationship matrix of features, the hierarchical relationship of features shown in fig. 2 is still used for analysis, and a corresponding hierarchical relationship matrix (shown in table) can be obtained. As can be seen from the first column of the table, when the tail characteristic value is 1, the characteristic tail is the main characteristic of the tail color and the tail length; otherwise, the tail color and tail length values are in an inapplicable state. Columns 2-4 in the table show that the characteristic tail color, tail length, eye are not sufficient to prove to be the main characteristic of the other characteristics.
Table 1: feature hierarchical relationship matrix
Figure BDA0002408166610000091
If the shape data has missing data, the data set D can be divided into observation data YobsAnd missing data YmissExpressed as D ═ Yobs,YmissFourthly, performing morphological data missing estimation in the second step; if the shape data has no missing data, step two can be skipped and step three can be directly executed.
Step two, estimating the form data missing value based on the Markov Monte Carlo algorithm of the feature level
Step 2.1, according to the value range of each row of the form data, carrying out random interpolation on the missing data according to the row values;
and 2.2, performing column analysis on the form data by combining the characteristic hierarchical relation matrix obtained in the step 1.2, and analyzing the parameter distribution of each column so as to estimate the posterior distribution of the parameters. According to observable data YobsAnd missing data debit value
Figure BDA0002408166610000092
Obtaining probability density distribution theta under current datat. The specific calculation formula is as follows (where Pr is a specific probability density function):
Figure BDA0002408166610000093
step 2.3, recording the interpolation current value xtAccording to YobsAnd a predetermined YmissDistributed random sampling to successively determine candidate value xt+1. Candidate sample xt+1Satisfies the probability density distribution theta obtained in step 2.2(t). At this point, the missing data is already included in the sample as a distribution parameter. The specific calculation formula is as follows:
Figure BDA0002408166610000101
will present value xtAnd a candidate value xt+1Separately for each substitution in the column analysis, x is calculated separatelytAnd xt+1And the probability density function values under the existing observation column distribution are marked as pOld and pNew. Extracting a random number U from U (0,1), if U<pNew/pOld, the candidate x is acceptedt+1Newly sampling missing values; otherwise, the current missing interpolation value x is maintainedtAnd is not changed.
Step 2.4, iteratively executing the step 2.2-2.3 until the iteration times k are met, and outputting the final interpolated form data
Figure BDA0002408166610000102
Is marked as Dcom. According to the operation, the Markov chain can be obtained through multiple iterations
Figure BDA0002408166610000103
The result after deletion estimation can be obtained from the Markov chain
Figure BDA0002408166610000104
When k is sufficiently large, it is preferable that,
Figure BDA0002408166610000105
will converge to the same distribution, i.e. a smooth distribution.
Further, the specific process of the second step comprises:
the species signature matrix dataset D containing missing data may be represented as D ═ Yobs,YmissIn which Y isobsRepresenting observed data, YmissIndicating missing data.
Figure BDA0002408166610000106
Figure BDA0002408166610000107
The morphological data missing estimation algorithm comprises the following specific steps:
i, setting initial value
Figure BDA0002408166610000108
And ordert is 0, wherein p is the number of deletions in the morphology dataset;
II, calculating probability density distribution under the t sampling
Figure BDA0002408166610000111
III, sampling for the t +1 th time, and specifically, the following steps:
a) extracting from the distribution of the column characteristic values of the 1 st missing value
Figure BDA0002408166610000112
Make it satisfy distribution
Figure BDA0002408166610000113
b) Extracting from the distribution of the column characteristic values of the 2 nd missing value
Figure BDA0002408166610000114
Make it satisfy distribution
Figure BDA0002408166610000115
c) Extracting from the distribution of the characteristic values of the column where the ith missing value is located
Figure BDA0002408166610000116
Make it satisfy distribution
Figure BDA0002408166610000117
……
d) Extracting from the column characteristic value distribution of the p-th missing value
Figure BDA0002408166610000118
Make it satisfy distribution
Figure BDA0002408166610000119
I, record
Figure BDA00024081666100001110
To Y'missIf the relative change amount exceeds
Figure BDA00024081666100001111
Greater than the random number U in U (0,1), then accept
Figure BDA00024081666100001112
Newly sampling missing values; otherwise, the current sampling value is maintained
Figure BDA00024081666100001113
Is not changed, i.e.
Figure BDA00024081666100001114
Can be written as
Figure BDA00024081666100001115
And II, repeating the step k times, and when k is large enough, the Markov chain is gradually converged, and the state distribution tends to be stable, so that a plurality of deletion estimated values can be obtained simultaneously.
In order to verify the effectiveness of the feature hierarchy-based multi-objective phylogenetic tree construction method in missing value estimation, experiments are respectively carried out on 4 morphological data sets, wherein the data sets comprise canidae, ciconidae, cambrian quadruped and Hawaiian Platynini. The complete morphological data set is obtained by carrying out random deletion on the morphological data set, wherein the deletion ratio is 5% -95%, and then carrying out deletion value estimation on the data set subjected to random deletion by adopting an MCMC deletion estimation algorithm.
In this embodiment, the deletion estimation performance is estimated based on the proximity of the interpolation value and the true value, the data set before random deletion is compared with the data set after MCMC deletion estimation, and compared with mic packet interpolation, random forest interpolation, fixed value interpolation, and random interpolation, and the quality of interpolation is estimated using Root Mean Square Error (RMSE), and the calculation formula is as follows:
Figure BDA0002408166610000121
the results of the experiment are shown in FIG. 3. Experimental results show that the multi-target phylogenetic tree construction method based on the feature hierarchy has a lower root mean square error and a maximum fixed value interpolation error compared with other methods. With the gradual increase of the deletion ratio, the root mean square error after interpolation by several methods is increased continuously. However, because the method provided in the embodiment finds the constraint of the logic value between the features through the hierarchical relationship of the features, a relatively low root mean square error can be obtained under the condition of high loss. For the data set after deletion estimation, the invention defines the topological difference rate for measuring the accuracy of the data set after deletion estimation on phylogenetic analysis, and the topological difference rate is specifically defined as a real tree T1With a single tree T randomly selected among the shortest trees for a given search2The symmetric difference rate between the two is calculated according to the following formula:
Figure BDA0002408166610000122
wherein n is the number of species, | split (tree) | represents the number of "partition" tree sets of tree, | split (T)1)∩split(T2) I represents T1And T2The same number of sets of partition trees. The experimental result is shown in fig. 4, and the result shows that, compared with several other methods, the method for constructing the multi-target phylogenetic tree based on the feature hierarchy provided in this embodiment has a smaller difference between the tree structure constructed by the interpolated data set and the standard tree, and the next time is random forest interpolation.
As can be seen from fig. 3 and 4, there is no direct relationship between the interpolation accuracy and the symmetric difference rate, that is, the symmetric difference rate is not necessarily low when the phylogenetic tree is constructed on the data set with high interpolation accuracy. The multi-target phylogenetic tree construction method based on the feature hierarchy provided by the embodiment can have a relatively low root mean square error by combining the feature hierarchy relationship, and the tree construction result is closer to a standard tree structure.
Step three, constructing a multi-target phylogenetic tree
Step 3.1, according to the complete form matrix DcomBuilding an initialization Tree group P1The tree group size is N. A reduced value score and a likelihood value score are calculated for each individual in the tree group separately.
Each individual is encoded with a chromosome structure { tree structure, reduction score, likelihood score }.
Step 3.2, for the initial population P1And performing fast non-dominated sorting and congestion degree calculation. And the non-dominant solution quick sequencing is to layer the population according to individual non-inferior solutions so as to enable the search to be carried out towards the Pareto optimal solution set direction. Aiming at each generation of evolutionary population, finding out the current optimal individual, namely a non-dominant solution (non-dominant solution) of the evolutionary population; the set of all non-dominated solutions is called a non-dominated solution set (NDSet) of the current evolutionary population, and the non-dominated solution set is continuously approximated to a real optimal solution set to finally reach the optimal value.
Defining: let P and q be any two different individuals in the evolutionary population P, if P dominates q, then it is noted that P > q:
i, p is no worse than q for all sub-targets. I.e. fk(p)≤fk(q)(k=1,2,…r);
II, at least one sub-target exists, and p is better than q. Namely, it is
Figure BDA0002408166610000131
So that fl(p)<fl(q)。
Setting the evolutionary population as P, and simultaneously setting a non-dominating set P of the evolutionary population*The specific optimization process is as follows:
first, the first individual is placed in a non-dominating set P*In turn, the individuals in the evolved population P
Figure BDA0002408166610000141
Taking out and putting in P*Simultaneously taking out P and P in sequence*Is compared with all of the individuals in (a),deleting P*All individuals in (1) are dominated by P if P is dominated by P*Is subject to, P is taken from P*Is deleted.
For a better understanding, the non-dominated building process is briefly described:
first, find the non-dominant solution in the population, which is marked as the first non-dominant layer F1Assigning a non-dominant ranking i to all of its individualsrank1 (wherein i)rankIs the non-dominant rank value of individual i) and the individuals of this non-dominant layer are removed from the entire population; then, the non-dominant solution set in the rest population is continuously found out and is marked as F2All of them are assigned a non-dominant rank value i rank2; and (4) circulating the operation until the whole population is completely layered. Individuals within the same hierarchy have the same non-dominant order value irank
To be able to have the same irankAnd selective ordering is carried out in the individuals, and the individual crowding distance is adopted for measurement. The crowding distance of the individual i is the distance between two individuals i +1 and i-1 adjacent to the individual i in the target space, and the specific calculation steps are as follows:
and I, initializing the distance of the same layer. Let P [ i]distance0 (wherein P [ i ]]distanceIndicates the crowding distance of any individual i);
II, arranging the individuals on the same layer in an ascending order according to the m function values;
III, making the individuals on the sorted edges have a selective advantage. Given a large number W, let P [0 ]]distance=P[i]distance=W;
IV, calculating the crowdedness P [ i ] of the individuals i in the middle of the sequence]distance
P[i]distance=(P[i+1].f1-P[i-1].f1)+(P[i+1].f2-P[i-1].f2)
Wherein f is1,f2Respectively, the simple score and the likelihood score of the tree individual, P [ i]M is the function value of the individual i on the sub-target m.
Repeating the steps II to IV for different target functionsOperation is performed to obtain the crowding distance P [ i ] of the individual i]distanceBy preferentially selecting individuals with large crowding distances, the calculation results can be distributed relatively uniformly in the target space to maintain the diversity of the population.
For a clearer understanding of the non-dominated sorting and the congestion calculation and sorting process, fig. 5 is taken as an example for illustration. Firstly, the t generation parent generation group PtAnd the progeny population Q produced in the t generationtThe new population R with the scale of 2N is generated by combinationt. Then, for the new population RtFast sorting of the non-dominant solution is carried out, and a plurality of non-dominant layers F are generated1,F2,F3…, wherein the non-dominant layer priorities are decremented from layer to layer. Then, the crowdedness of the individuals in each non-dominant layer is calculated and ranked, and the first N individuals are selected to form a new population Pt+1. In FIG. 5, there is a non-dominant layer F1And F2And the non-dominant layer F3A part of individuals are selected into a new group Pt+1
And 3.3, carrying out chromosome selection, crossing and mutation operations on the current population, and generating a new population.
1) Selecting operation: tournament selection is used to pick individuals from parents to generate offspring populations.
The method comprises the following specific steps:
i, determining the individual quantity pool/road selected each time;
and II, randomly selecting individuals from the population to form a group, and selecting the individual with the best fitness value to enter a filial generation population according to the fitness value of each individual.
And (5) repeating the step II N times, wherein the obtained individuals form a new generation of population.
2) And (3) cross operation: the original superior genes are inherited to next generation individuals and new individuals containing more complex gene structures are generated.
Firstly, randomly taking out a pair of individuals to be mated from a mating pool; according to the cross probability Pc(0<Pc1) to carry out the crossover operation, thereby forming a new pair of individuals.
The specific interleaving steps are shown in fig. 6: given two trees T1And T2From tree T1Pruning a part of subtrees s; then, the tree T is removed2All leaf nodes appear in s; by replanting subtrees to T2Randomly selecting branches to obtain a descendant subtree T'1
3) Mutation operation: since the phylogenetic tree needs to be mutated, the present embodiment adopts a branch exchange algorithm, which is commonly used in phylogenetic analysis, to implement nearest neighbor exchange (NNI).
The specific process is shown in fig. 7: first, an internal branch is selected, two branches are respectively connected to two sides of the internal branch, and the two branches are neighbors of each other. Interchanging one branch on one side of the internal branch with one branch on the other side, thereby enabling interchanging phylogenetic trees of possible nearest neighbors resulting from a phylogenetic tree mutation.
Step 3.4, merging the parent population and the offspring population, performing rapid non-dominated sorting and congestion degree calculation, and selecting proper individuals to form a new parent population according to the non-dominated relationship and the congestion degree of the individuals;
and 3.5, repeating the operation until the condition of ending the program is met.
The inventors performed experiments on 4 data sets (canidae, terroris canidae, cambrian cladophora, Hawaiian platynii) and fig. 8 is a graph of Pareto treegand generated from experiments performed on canidae data sets. The existing phylogenetic analysis software TNT and mrboyes are compared, 500 phylogenetic tree sets are generated at the same time, the intersection set of the sets is calculated to calculate the average, the method is used for evaluating the effectiveness of the feature level-based multi-target phylogenetic tree construction method provided by the embodiment, and the experimental result is shown in fig. 9. The experimental result shows that the generated tree structure of the multi-target phylogenetic tree construction method based on the feature hierarchy can generate more intersections with the existing tools, and the 65.4% of the tree structures generated by the method can be consistent with the phylogenetic tree generated by the existing software.
Meanwhile, the symmetry difference measurement is performed on the multi-target phylogenetic tree construction method based on the feature hierarchy, the tree set constructed by TNT and Mrbayes and the standard tree, and the experimental result is shown in fig. 10. Experimental results show that the symmetry differences of the multi-target phylogenetic tree construction method based on the feature hierarchy are respectively 13.2, 16.5, 37.8 and 53.1, which is entirely lower than the existing methods, that is, the multi-target phylogenetic tree construction method based on the feature hierarchy provided in the embodiment can generate results more similar to standard trees.
In summary, the multi-target phylogenetic tree construction method based on the feature hierarchy of the embodiment can better process missing and unapplicable data in morphological data, can generate a phylogenetic tree set satisfying a maximum reduction method and a maximum likelihood method at the same time, and has a certain application prospect.

Claims (2)

1. A multi-target phylogenetic tree construction method based on feature hierarchy is characterized by comprising the following steps:
step one, constructing and formalizing a characteristic hierarchical relationship
Analyzing the dependency relationship among the features by combining the feature description, the prior knowledge and other related information, thereby constructing a feature hierarchical relationship; formally expressing the constructed characteristic hierarchical relationship to convert the hierarchical relationship into a data matrix which can be processed by a computer;
step two, estimating a form data missing value based on a Markov Monte Carlo algorithm of a feature level, wherein the specific method is as follows;
step 2.1, to incomplete form matrix DincomInitializing missing values, and performing random interpolation on missing data according to column values;
step 2.2, column analysis is carried out by combining the characteristic hierarchical relationship, and the parameter distribution of each column is analyzed so as to estimate the posterior distribution of the parameters, which mainly comprises the following conditions:
(1) the master feature is present, but no slave feature:
calculating the value distribution of the subordinate characteristics according to different values of the main characteristics;
(2) dependent features are present, but no master features are present:
calculating the value distribution of the main characteristic according to different values of the subordinate characteristic;
(3) there are dependent features, and there are main features:
respectively calculating value distribution of the main characteristic and the subordinate characteristic;
(4) neither the dependent nor the main feature is present:
independently calculating the value distribution of the characteristics;
step 2.3, recording the interpolation current value xtAnd randomly sampling to determine candidate value xt+1I.e. the probability density distribution Q (x) obtained according to step 2.2t+1|xt) Generating candidate samples xt+1(ii) a Will present value xtAnd a candidate value xt+1Separately for each substitution in the column analysis, x is calculated separatelytAnd xt+1Probability density function values under the existing observation column distribution are marked as pOld and pNew;
extracting a random number U from U (0,1), and accepting a candidate value x if U < pNew/pOldt+1Newly sampling missing values; otherwise, the current missing interpolation value x is maintainedtThe change is not changed;
step 2.4, iteratively executing the step 2.2-2.3 until the iteration times k are met, and outputting the final interpolated form data
Figure FDA0002408166600000021
Is marked as Dcom
Step three, constructing a multi-target phylogenetic tree
Step 3.1, according to the complete form matrix DcomBuilding an initialization Tree group P1And the size of the tree group is N, and the reduction value score and the likelihood value score of each individual in the tree group are respectively calculated as follows:
given a phylogenetic tree structure τ where the set of nodes is V (τ) and the set of edges is E (τ), the reduced value score can be expressed as:
Figure FDA0002408166600000022
wherein, wjWeight, v, representing feature jjAnd ujRespectively representing the characteristic state values of the nodes v and u at the bit point j, C (v)j,uj) Representing slave state v as a cost matrixjTransition to state ujThe cost of (d);
the likelihood L (τ) of the phylogenetic tree τ is the probability that the observed data D and the evolutionary model M that produced the tree, i.e., L (τ) ═ P (D | τ, M), the likelihood function L (τ) of the tree τ given D and M can be written as:
Figure FDA0002408166600000023
wherein, P (D)ithe/T, M) is the likelihood value of the ith position;
encoding each individual, wherein the chromosome structure is { tree structure, reduction score and likelihood score };
step 3.2, for the initial population P1Performing rapid non-dominated sorting and congestion degree calculation;
step 3.3, carrying out chromosome selection, cross operation and mutation operation on the current population, and generating a new population;
step 3.4, merging the parent population and the child population, performing rapid non-dominated sorting, simultaneously performing crowding degree calculation on the individuals in each non-dominated layer, and selecting proper individuals according to the non-dominated relationship and the crowding degree of the individuals to form a new parent population;
and 3.5, repeating the operation until the condition of ending the program is met.
2. The method of claim 1, wherein step 3.2 is performed on the starting population P1And (3) performing rapid non-dominated sorting and congestion degree calculation, wherein the specific optimization process is as follows:
aiming at each generation of evolutionary population, finding out the current optimal individual, wherein the current optimal solution called as an evolutionary population is a non-dominant solution; the set of all non-dominating solutions is called as a non-dominating set of the current evolutionary population, and the non-dominating set is made to continuously approach a real optimal solution set to finally reach the optimal;
the rapid non-dominant sorting is to layer the population according to individual non-inferior solutions, so that the search is carried out towards the direction of a Pareto optimal solution set; the specific process is as follows:
setting the evolutionary population as P, and simultaneously setting a non-dominating set P of the evolutionary population*(ii) a First, the first individual is placed in a non-dominating set P*In turn, the individuals in the evolved population P
Figure FDA0002408166600000031
Taking out and putting in P*Simultaneously taking out P and P in sequence*Comparing all individuals in (1), and deleting P*All individuals in (1) are dominated by P if P is dominated by P*Is subject to, P is taken from P*Deleting;
in order to selectively sort in individuals with the same non-dominant order set, the individual crowding distance is used for measurement, and the specific calculation process is as follows:
let P [ i ]]distanceIs the aggregation distance of individual i, P [ i]M is the function value of the individual i on the sub-target m, and the calculation formula is as follows:
P[i]distance=(P[i+1].ps-P[i-1].ps)+(P[i+1].l-P[i-1].l)
wherein ps and l are the simple score and the likelihood score of the tree individual respectively.
CN202010168038.5A 2020-03-11 2020-03-11 Multi-target phylogenetic tree construction method based on feature hierarchy Active CN111462812B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010168038.5A CN111462812B (en) 2020-03-11 2020-03-11 Multi-target phylogenetic tree construction method based on feature hierarchy

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010168038.5A CN111462812B (en) 2020-03-11 2020-03-11 Multi-target phylogenetic tree construction method based on feature hierarchy

Publications (2)

Publication Number Publication Date
CN111462812A true CN111462812A (en) 2020-07-28
CN111462812B CN111462812B (en) 2023-03-24

Family

ID=71680798

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010168038.5A Active CN111462812B (en) 2020-03-11 2020-03-11 Multi-target phylogenetic tree construction method based on feature hierarchy

Country Status (1)

Country Link
CN (1) CN111462812B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112817959A (en) * 2021-02-25 2021-05-18 西北大学 Construction method of ancient biomorphic phylogenetic tree based on multi-metric index weight
CN114613426A (en) * 2022-01-26 2022-06-10 西北大学 Phylogenetic tree construction method based on dynamic multi-objective optimization

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004057511A2 (en) * 2002-12-23 2004-07-08 Universität Karlsruhe Methods for the analysis, classification and/or tree construction of sequences using correlation analysis
US20070225918A1 (en) * 2003-06-19 2007-09-27 Khalid Sayood System and Method for Sequence Distance Measure for Phylogenetic Tree Construction
CN106326689A (en) * 2015-06-25 2017-01-11 深圳华大基因科技服务有限公司 Method and device for determining site subject to selection in colony
US20170212980A1 (en) * 2016-01-25 2017-07-27 Shenzhen University Construction method for heuristic metabolic co-expression network and the system thereof
GB201717125D0 (en) * 2016-11-28 2017-11-29 National Univ Of Defense Technology Differential evolution method oriented to agile satellite multi-target task planning
CN108509764A (en) * 2018-02-27 2018-09-07 西北大学 A kind of extinct plants and animal pedigree evolution analysis method based on genetic property yojan
CN109326328A (en) * 2018-11-02 2019-02-12 西北大学 A kind of extinct plants and animal pedigree evolution analysis method based on pedigree cluster

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004057511A2 (en) * 2002-12-23 2004-07-08 Universität Karlsruhe Methods for the analysis, classification and/or tree construction of sequences using correlation analysis
US20070225918A1 (en) * 2003-06-19 2007-09-27 Khalid Sayood System and Method for Sequence Distance Measure for Phylogenetic Tree Construction
CN106326689A (en) * 2015-06-25 2017-01-11 深圳华大基因科技服务有限公司 Method and device for determining site subject to selection in colony
US20170212980A1 (en) * 2016-01-25 2017-07-27 Shenzhen University Construction method for heuristic metabolic co-expression network and the system thereof
GB201717125D0 (en) * 2016-11-28 2017-11-29 National Univ Of Defense Technology Differential evolution method oriented to agile satellite multi-target task planning
CN108509764A (en) * 2018-02-27 2018-09-07 西北大学 A kind of extinct plants and animal pedigree evolution analysis method based on genetic property yojan
CN109326328A (en) * 2018-11-02 2019-02-12 西北大学 A kind of extinct plants and animal pedigree evolution analysis method based on pedigree cluster

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
冯思玲: "系统发育树构建方法研究", 《信息技术》 *
张丽娜等: "常用系统发育树构建算法和软件鸟瞰", 《动物学研究》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112817959A (en) * 2021-02-25 2021-05-18 西北大学 Construction method of ancient biomorphic phylogenetic tree based on multi-metric index weight
CN112817959B (en) * 2021-02-25 2023-03-24 西北大学 Construction method of ancient biomorphic phylogenetic tree based on multi-metric index weight
CN114613426A (en) * 2022-01-26 2022-06-10 西北大学 Phylogenetic tree construction method based on dynamic multi-objective optimization
CN114613426B (en) * 2022-01-26 2023-10-31 西北大学 System development tree construction method based on dynamic multi-objective optimization

Also Published As

Publication number Publication date
CN111462812B (en) 2023-03-24

Similar Documents

Publication Publication Date Title
Excoffier et al. Robust demographic inference from genomic and SNP data
US9058564B2 (en) Controlling quarantining and biasing in cataclysms for optimization simulations
US8700548B2 (en) Optimization technique using evolutionary algorithms
CN112181867B (en) On-chip network memory controller layout method based on multi-target genetic algorithm
CN111462812B (en) Multi-target phylogenetic tree construction method based on feature hierarchy
CN103838820B (en) Evolutionary multi-objective optimization community detection method based on affinity propagation
JP2007200302A (en) Combining model-based and genetics-based offspring generation for multi-objective optimization using convergence criterion
JP6451735B2 (en) Energy amount estimation device, energy amount estimation method, and energy amount estimation program
CN113903395A (en) BP neural network copy number variation detection method and system for improving particle swarm optimization
CN105590039B (en) A kind of protein complex recognizing method based on BSO optimizations
Lamiable et al. An algorithmic game-theory approach for coarse-grain prediction of RNA 3D structure
García-Pareja et al. Exact simulation of coupled Wright–Fisher diffusions
CN113673695B (en) Crowd behavior rule automatic extraction method based on novel feature automatic construction
CN112837739B (en) Hierarchical feature phylogenetic model based on self-encoder and Monte Carlo tree
Belkhir Per instance algorithm configuration for continuous black box optimization
CN114996278A (en) Road network shortest path distance calculation method based on reinforcement learning
Denysiuk et al. A new hybrid evolutionary multiobjective algorithm guided by descent directions
Cancino et al. A multi-objective evolutionary approach for phylogenetic inference
CN113837474A (en) Regional soil heavy metal pollution index prediction method and device
CN113704570A (en) Large-scale complex network community detection method based on self-supervision learning type evolution
Schwefel Evolutionary computation-a study on collective learning
Lou An intelligent teaching test paper generation system based on ant colony hybrid genetic algorithms
CN105978816B (en) Multicast tree optimization method based on genetic framework
Khoder et al. Community tracking in time evolving networks: an evolutionary multi-objective approach
Moradi et al. An Improved Multi-objective Genetic Algorithm for Revealing Community Structures of Complex Networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant