CN111462812A

CN111462812A - Multi-target phylogenetic tree construction method based on feature hierarchy

Info

Publication number: CN111462812A
Application number: CN202010168038.5A
Authority: CN
Inventors: 冯筠; 刘泽云; 刘蒙; 侯刚; 冯宏伟
Original assignee: Northwestern University
Current assignee: Northwestern University
Priority date: 2020-03-11
Filing date: 2020-03-11
Publication date: 2020-07-28
Anticipated expiration: 2040-03-11
Also published as: CN111462812B

Abstract

The invention discloses a multi-target phylogenetic tree construction method based on feature hierarchy, which combines the feature description of morphological data, firstly generates corresponding feature hierarchy relation, and estimates missing values by combining the dependency relation between features, thereby obtaining a complete data set; and then, the maximum reduction principle and the maximum likelihood principle based on the inapplicable Fitch algorithm are combined to construct the parallel multi-target phylogenetic tree, so that the problem caused by the inapplicable data can be well measured, and the limitation of tree construction under a single principle is avoided. Compared with the traditional phylogenetic analysis method, the method can well solve the problem of uncertainty of the tree structure caused by missing data and inapplicable data, and provides a basis for biologists to research species evolution through various tree building principles.

Description

Multi-target phylogenetic tree construction method based on feature hierarchy

Technical Field

The invention belongs to the field of biological information, relates to development tree analysis and construction in phylogenetic research, and particularly relates to a multi-target phylogenetic tree construction method based on feature hierarchy.

Background

Phylogenetic studies are directed to evolutionary relationships, such as evolutionary history and relationships between species or populations, to understand when and when speciation events may occur, and thus to study species origin. Phylogenetic analysis is to infer and evaluate these evolutionary relationships, with the goal of finding phylogenetic trees that fit species evolution. For a clearer understanding of the study, it can be seen from fig. 1: the species characteristic matrix can be obtained by performing morphological data arrangement on the collected fossil; and performing phylogenetic analysis on the morphological data to obtain a phylogenetic tree capable of reflecting the species evolution relationship.

Morphological phylogeny morphological characteristics are limited compared to molecular phylogeny, and morphological-based phylogeny analysis is slow. For the phylogenetic analysis of early ancient organisms, due to the reasons of long age, change of storage environment, unstable molecular data, limitation of fossil storage and observation means and the like, only morphological data are available as materials for phylogenetic analysis, and the morphological data is an important basis for carrying out phylogenetic analysis of the ancient organisms. Due to the loss of fossil records and the existence of a species characteristic hierarchical relationship, morphological data often contain a large amount of missing data and inapplicable data (note that the inapplicable data appears when the existence of a certain characteristic in a species depends on the nonexistent characteristic), and the existing method cannot effectively process 'problem' data, so that the construction of a phylogenetic tree is inaccurate, and troubles are caused for a biological scientist to research species evolution.

Meanwhile, the commonly used phylogenetic analysis method comprises a distance-based method and an optimal principle-based method, wherein the distance-based method mainly comprises an adjacency method and a UPGMA; the latter mainly includes maximum reduction method, maximum likelihood method and Bayes inference method. Whereas the existing mainstream tools are based on a single optimization principle. Typically, multiple methods are performed on a single modality data set. However, the results of multiple phylogenetic inferences may be conflicting or consistent. When the goals conflict with each other, there is generally no way to satisfy all of the goals.

Disclosure of Invention

Aiming at the technical problem that the construction of a phylogenetic tree is unstable due to the existence of missing data and inapplicable data in morphological data in the prior art, the invention aims to provide a multi-target phylogenetic tree construction method based on feature hierarchy, which combines prior knowledge to generate a feature hierarchical relationship, analyzes the morphological data distribution condition under the feature dependency relationship by taking the feature hierarchical relationship as a support, and further estimates the missing data so as to obtain a complete morphological data set; starting from complete morphological data, the inapplicable Fitch algorithm proposed in Martin D.Brazeau (2017) is adopted to measure the inapplicable data, and the maximum simplification principle and the maximum likelihood principle are combined to construct a multi-target phylogenetic tree, so that a Pareto optimal tree group which simultaneously meets the maximum simplification principle and the maximum likelihood principle is generated, and a basis is provided for a biological scientist to research species evolution.

In order to realize the task, the invention adopts the following technical solutions:

a multi-target phylogenetic tree construction method based on feature hierarchy is characterized by comprising the following steps:

step one, constructing and formalizing a characteristic hierarchical relationship

Analyzing the dependency relationship among the features by combining the feature description, the prior knowledge and other related information, thereby constructing a feature hierarchical relationship; formally expressing the constructed characteristic hierarchical relationship to convert the hierarchical relationship into a data matrix which can be processed by a computer;

step two, estimating the form data missing value based on the Markov Monte Carlo algorithm of the feature level

Step 2.1, to incomplete form matrix D_incomInitializing missing values, and performing random interpolation on missing data according to column values;

step 2.2, column analysis is carried out by combining the characteristic hierarchical relationship, and the parameter distribution of each column is analyzed so as to estimate the posterior distribution of the parameters, which mainly comprises the following conditions:

(1) the master feature is present, but no slave feature:

calculating the value distribution of the subordinate characteristics according to different values of the main characteristics;

(2) dependent features are present, but no master features are present:

calculating the value distribution of the main characteristic according to different values of the subordinate characteristic;

(3) there are dependent features, and there are main features:

respectively calculating value distribution of the main characteristic and the subordinate characteristic;

(4) neither the dependent nor the main feature is present:

independently calculating the value distribution of the characteristics;

step 2.3, recording the interpolation current value x_tAnd randomly sampling to determine candidate value x_t+1I.e. the probability density distribution Q (x) obtained according to step 2.2_t+1|x_t) Generating candidate samples x_t+1(ii) a Will present value x_tAnd a candidate value x_t+1Separately for each substitution in the column analysis, x is calculated separately_tAnd x_t+1And the probability density function values under the existing observation column distribution are marked as pOld and pNew. Extracting a random number U from U (0,1), if U<pNew/pOld, the candidate x is accepted_t+1Newly sampling missing values; otherwise, the current missing interpolation value x is maintained_tThe change is not changed;

step 2.4, iteratively executing the step 2.2-2.3 until the iteration times k are met, and outputting the final interpolated form data

Is marked as D_com。

Step three, constructing a multi-target phylogenetic tree

Step 3.1, according to the complete form matrix D_comBuilding an initialization Tree group P₁The tree group size is N. Respectively calculating a reduced value score and a likelihood value score of each individual in the tree group:

given a phylogenetic tree structure τ where the set of nodes is V (τ) and the set of edges is E (τ), the reduced value score can be expressed as:

wherein, w_jWeight, v, representing feature j_jAnd u_jRespectively representing the characteristic state values of the nodes v and u at the bit point j, C (v)_j,u_j) Representing slave state v as a cost matrix_jTransition to state u_jThe cost of (d); the specific feature variation calculation is described in martind. brazeau (2017), et al, which proposes a non-applicable pitch algorithm.

The likelihood L (τ) of the phylogenetic tree τ is the probability that the observation D and evolution model M that produced the tree — L (τ) ═ P (D | τ, M.) the likelihood function L (τ) of the tree τ given D and M can be written as:

wherein, P (D)_iand/T, M) is the likelihood value of the ith position.

Each individual is encoded with a chromosome structure { tree structure, reduction score, likelihood score }.

Step 3.2, for the initial population P₁And (3) performing rapid non-dominated sorting and congestion degree calculation, wherein the specific optimization process is as follows:

aiming at each generation of evolutionary population, finding out the current optimal individual, wherein the current optimal solution called as an evolutionary population is a non-dominant solution; the set of all non-dominant solutions is called a non-dominating set of the current evolutionary population, and the non-dominating set is made to continuously approach a real optimal solution set to finally reach the optimal.

The rapid non-dominant sorting is to layer the population according to individual non-inferior solutions, so that the search is carried out towards the direction of a Pareto optimal solution set; the specific process is as follows:

setting the evolutionary population as P, and simultaneously setting a non-dominating set P of the evolutionary population^*(ii) a First, the first individual is placed in a non-dominating set P^*In turn, the individuals in the evolved population P

Taking out and putting in P^*Simultaneously taking out P and P in sequence^*Comparing all individuals in (1), and deleting P^*All individuals in (1) are dominated by P if P is dominated by P^*Is subject to, P is taken from P^*Is deleted.

To enable selective ordering within individuals having the same non-dominant order set, an individual crowding distance is used for the metric; the specific calculation process is as follows:

let P [ i ]]_distanceIs the aggregation distance of individual i, P [ i]M is the function value of the individual i on the sub-target m, and the calculation formula is as follows:

P[i]_distance＝(P[i+1].ps-P[i-1].ps)+(P[i+1].l-P[i-1].l)

wherein ps and l are the simple score and the likelihood score of the tree individual respectively.

And 3.3, carrying out chromosome selection, crossing and mutation operations on the current population, and generating a new population.

Step 3.4, merging the parent population and the child population, performing rapid non-dominated sorting, simultaneously performing crowding degree calculation on the individuals in each non-dominated layer, and selecting proper individuals according to the non-dominated relationship and the crowding degree of the individuals to form a new parent population;

and 3.5, repeating the operation until the condition of ending the program is met.

The multi-target phylogenetic tree construction method based on the feature hierarchy can well measure the problem caused by the inapplicable data, avoids the limitation of tree construction under a single principle, and provides a basis for biologists to research species evolution. Compared with the prior art, the method has the following technical innovation:

1. in the aspect of morphological data missing value estimation, compared with the traditional missing value processing method (such as feature quantity increase, missing value interpolation or non-processing), the method analyzes feature distribution caused by feature dependence by combining a hierarchical structure of features, fuses the feature distribution and missing value estimation, and reduces the influence of missing data on the instability of phylogenetic analysis. The shape data is subjected to missing estimation through a Markov Monte Carlo missing interpolation algorithm, so that the interpolation value can well accord with the distribution of the original data.

2. In the aspect of morphological inapplicable data, an inapplicable Fitch algorithm (Martin D. Brazeau, 2017) is adopted, inapplicable data and missing data can be correctly distinguished and measured, and guarantee is provided for accurately constructing a phylogenetic tree.

3. In the aspect of phylogenetic tree construction, compared with the traditional phylogenetic tree construction tool with a single target, the method adopts a multi-target phylogenetic tree construction method and combines a maximum reduction principle and a maximum likelihood principle to construct the phylogenetic tree, so that a phylogenetic tree group capable of meeting various principles is obtained, and the problem that results of executing various methods on the same morphological data set are possibly conflicted is avoided.

Drawings

FIG. 1 is an exemplary diagram of fossil, morphological data and phylogenetic tree, wherein FIGS. A-F are fossil pictures; panel G is the species signature matrix and panel H is the phylogenetic tree.

FIG. 2 is a feature description and feature hierarchy relationship diagram.

FIG. 3 is a schematic diagram of accuracy evaluation of interpolation results based on MCMC morphological data; wherein panel (a) is canidae, panel (b) is terroris canidae, panel (c) is cambrian poda, and panel (d) is Hawaiian platynin.

FIG. 4 is a schematic diagram illustrating evaluation of tree building accuracy based on interpolation results of MCMC morphological data; wherein panel (a) is canidae, panel (b) is terroris canidae, panel (c) is cambrian poda, and panel (d) is Hawaiian platynin.

FIG. 5 is a schematic diagram of the generation of new population by non-dominated sorting and congestion calculation in step three;

FIG. 6 is a schematic diagram of the construction of the multi-objective phylogenetic tree in step three;

FIG. 7 is a schematic diagram of the mutation operation of the multi-objective phylogenetic tree construction in step three;

FIG. 8 is a diagram of a Pateto treelet set generated based on a multi-objective phylogenetic tree;

FIG. 9 is a set of trees generated based on a multi-objective phylogenetic tree with TNT and Mrbayes;

FIG. 10 is a graph of distance between a Pateto tree set and a standard tree generated based on deletion interpolation and multi-objective phylogenetic trees;

FIG. 11 is an overall flowchart of the method for constructing a multi-objective phylogenetic tree based on feature hierarchy according to the present invention.

The present invention will be described in further detail with reference to the following drawings and examples.

Detailed Description

As shown in fig. 1 to fig. 11, the embodiment provides a method for constructing a multi-objective phylogenetic tree based on feature hierarchy, and the detailed steps are as follows:

step one, constructing and formalizing a hierarchical relationship of expression characteristics

Step 1.1, finding the dependency relationship among the characteristics by combining the prior knowledge and the corresponding characteristic description, and determining the main characteristics, the subordinate characteristics and the logic constraint between the main characteristic value and the subordinate characteristic value. For better understanding, fig. 2 is taken as an example for illustration. According to the characteristic description, the following steps are shown: the values of the characteristic "tail" include "absence (0)", "presence (1)"; the value of the feature "tail color" includes "blue (0)", "red (1)" and "inapplicable (-)", and the value of the feature "tail" is "inapplicable (-)" if and only if the value of the feature "tail" is "absent (0)"; the value of the characteristic "tail length" includes "tail short (0)", "tail long (1)", and "inapplicable (-)", and the value of the characteristic "tail" is "inapplicable (-)" if and only if the value of the characteristic "tail" is "absent (0)"; the values of the feature "eye" include "absence (0)", "presence (1)". Thus, it can be concluded that the feature "tail" is the main feature of "tail color" and "tail length". The following can be obtained by analysis: the main characteristic is 'tail', the corresponding subordinate characteristics are 'tail color' and 'tail length', and the 'eye' has no subordinate characteristic;

and step 1.2, formally defining the characteristic hierarchical relationship obtained in the step 1.1, so that the characteristic hierarchical relationship can be read and analyzed by a computer.

Assume the species signature matrix is D { X }₁,…,X_nIn which X_i(i∈[1,n]) Representing the characteristics of the ith species, wherein n is the number of the species; species X_iHas a characteristic distribution of X_i(x_i1,x_i2,…,x_im) M is a characteristic number, where x_ip(p∈[1,m]) Represents species X_iP characteristic state. The feature set is denoted as { y₁,y₂,…,y_m}. thus, the feature hierarchy relationship matrix may be represented by a matrix C of dimension m × m, where C is_ijIs the vertex y_iAnd vertex y_jThe weight value in between.

If the feature y_i(0<i ≦ m) logically depends on feature y_j(0<j ≦ m), i.e., feature y_iAnd feature y_jWhen a value dependent relationship exists, the characteristic y is assumed_jWhen the value is v, it is the feature y_iMain characteristic of (1), then C_ijThe value is v. In other words, when the main feature y_jWhen the value is v, the dependent characteristic y_iIs in an applicable state; when the main feature y_jDependent feature y when the value is not v_iIs in an inapplicable state. If when the feature y_iAnd feature y_jNo value dependency, then C_ijThe value is ∞. In summary, the feature level relationship matrix calculation formula is as follows:

by referring to the calculation method of the hierarchical relationship matrix of features, the hierarchical relationship of features shown in fig. 2 is still used for analysis, and a corresponding hierarchical relationship matrix (shown in table) can be obtained. As can be seen from the first column of the table, when the tail characteristic value is 1, the characteristic tail is the main characteristic of the tail color and the tail length; otherwise, the tail color and tail length values are in an inapplicable state. Columns 2-4 in the table show that the characteristic tail color, tail length, eye are not sufficient to prove to be the main characteristic of the other characteristics.

Table 1: feature hierarchical relationship matrix

If the shape data has missing data, the data set D can be divided into observation data Y_obsAnd missing data Y_missExpressed as D ═ Y_obs,Y_missFourthly, performing morphological data missing estimation in the second step; if the shape data has no missing data, step two can be skipped and step three can be directly executed.

Step 2.1, according to the value range of each row of the form data, carrying out random interpolation on the missing data according to the row values;

and 2.2, performing column analysis on the form data by combining the characteristic hierarchical relation matrix obtained in the step 1.2, and analyzing the parameter distribution of each column so as to estimate the posterior distribution of the parameters. According to observable data Y_obsAnd missing data debit value

Obtaining probability density distribution theta under current data^t. The specific calculation formula is as follows (where Pr is a specific probability density function):

step 2.3, recording the interpolation current value x_tAccording to Y_obsAnd a predetermined Y_missDistributed random sampling to successively determine candidate value x_t+1. Candidate sample x_t+1Satisfies the probability density distribution theta obtained in step 2.2^(t). At this point, the missing data is already included in the sample as a distribution parameter. The specific calculation formula is as follows:

will present value x_tAnd a candidate value x_t+1Separately for each substitution in the column analysis, x is calculated separately_tAnd x_t+1And the probability density function values under the existing observation column distribution are marked as pOld and pNew. Extracting a random number U from U (0,1), if U<pNew/pOld, the candidate x is accepted_t+1Newly sampling missing values; otherwise, the current missing interpolation value x is maintained_tAnd is not changed.

Is marked as D_com. According to the operation, the Markov chain can be obtained through multiple iterations

The result after deletion estimation can be obtained from the Markov chain

When k is sufficiently large, it is preferable that,

will converge to the same distribution, i.e. a smooth distribution.

Further, the specific process of the second step comprises:

the species signature matrix dataset D containing missing data may be represented as D ═ Y_obs,Y_missIn which Y is_obsRepresenting observed data, Y_missIndicating missing data.

The morphological data missing estimation algorithm comprises the following specific steps:

i, setting initial value

And ordert is 0, wherein p is the number of deletions in the morphology dataset;

II, calculating probability density distribution under the t sampling

III, sampling for the t +1 th time, and specifically, the following steps:

a) extracting from the distribution of the column characteristic values of the 1 st missing value

Make it satisfy distribution

b) Extracting from the distribution of the column characteristic values of the 2 nd missing value

Make it satisfy distribution

c) Extracting from the distribution of the characteristic values of the column where the ith missing value is located

Make it satisfy distribution

……

d) Extracting from the column characteristic value distribution of the p-th missing value

Make it satisfy distribution

I, record

To Y'_missIf the relative change amount exceeds

Greater than the random number U in U (0,1), then accept

Newly sampling missing values; otherwise, the current sampling value is maintained

Is not changed, i.e.

Can be written as

And II, repeating the step k times, and when k is large enough, the Markov chain is gradually converged, and the state distribution tends to be stable, so that a plurality of deletion estimated values can be obtained simultaneously.

In order to verify the effectiveness of the feature hierarchy-based multi-objective phylogenetic tree construction method in missing value estimation, experiments are respectively carried out on 4 morphological data sets, wherein the data sets comprise canidae, ciconidae, cambrian quadruped and Hawaiian Platynini. The complete morphological data set is obtained by carrying out random deletion on the morphological data set, wherein the deletion ratio is 5% -95%, and then carrying out deletion value estimation on the data set subjected to random deletion by adopting an MCMC deletion estimation algorithm.

In this embodiment, the deletion estimation performance is estimated based on the proximity of the interpolation value and the true value, the data set before random deletion is compared with the data set after MCMC deletion estimation, and compared with mic packet interpolation, random forest interpolation, fixed value interpolation, and random interpolation, and the quality of interpolation is estimated using Root Mean Square Error (RMSE), and the calculation formula is as follows:

the results of the experiment are shown in FIG. 3. Experimental results show that the multi-target phylogenetic tree construction method based on the feature hierarchy has a lower root mean square error and a maximum fixed value interpolation error compared with other methods. With the gradual increase of the deletion ratio, the root mean square error after interpolation by several methods is increased continuously. However, because the method provided in the embodiment finds the constraint of the logic value between the features through the hierarchical relationship of the features, a relatively low root mean square error can be obtained under the condition of high loss. For the data set after deletion estimation, the invention defines the topological difference rate for measuring the accuracy of the data set after deletion estimation on phylogenetic analysis, and the topological difference rate is specifically defined as a real tree T₁With a single tree T randomly selected among the shortest trees for a given search₂The symmetric difference rate between the two is calculated according to the following formula:

wherein n is the number of species, | split (tree) | represents the number of "partition" tree sets of tree, | split (T)₁)∩split(T₂) I represents T₁And T₂The same number of sets of partition trees. The experimental result is shown in fig. 4, and the result shows that, compared with several other methods, the method for constructing the multi-target phylogenetic tree based on the feature hierarchy provided in this embodiment has a smaller difference between the tree structure constructed by the interpolated data set and the standard tree, and the next time is random forest interpolation.

As can be seen from fig. 3 and 4, there is no direct relationship between the interpolation accuracy and the symmetric difference rate, that is, the symmetric difference rate is not necessarily low when the phylogenetic tree is constructed on the data set with high interpolation accuracy. The multi-target phylogenetic tree construction method based on the feature hierarchy provided by the embodiment can have a relatively low root mean square error by combining the feature hierarchy relationship, and the tree construction result is closer to a standard tree structure.

Step three, constructing a multi-target phylogenetic tree

Step 3.1, according to the complete form matrix D_comBuilding an initialization Tree group P₁The tree group size is N. A reduced value score and a likelihood value score are calculated for each individual in the tree group separately.

Step 3.2, for the initial population P₁And performing fast non-dominated sorting and congestion degree calculation. And the non-dominant solution quick sequencing is to layer the population according to individual non-inferior solutions so as to enable the search to be carried out towards the Pareto optimal solution set direction. Aiming at each generation of evolutionary population, finding out the current optimal individual, namely a non-dominant solution (non-dominant solution) of the evolutionary population; the set of all non-dominated solutions is called a non-dominated solution set (NDSet) of the current evolutionary population, and the non-dominated solution set is continuously approximated to a real optimal solution set to finally reach the optimal value.

Defining: let P and q be any two different individuals in the evolutionary population P, if P dominates q, then it is noted that P > q:

i, p is no worse than q for all sub-targets. I.e. f_k(p)≤f_k(q)(k＝1,2,…r)；

II, at least one sub-target exists, and p is better than q. Namely, it is

So that f_l(p)<f_l(q)。

Setting the evolutionary population as P, and simultaneously setting a non-dominating set P of the evolutionary population^*The specific optimization process is as follows:

first, the first individual is placed in a non-dominating set P^*In turn, the individuals in the evolved population P

Taking out and putting in P^*Simultaneously taking out P and P in sequence^*Is compared with all of the individuals in (a),deleting P^*All individuals in (1) are dominated by P if P is dominated by P^*Is subject to, P is taken from P^*Is deleted.

For a better understanding, the non-dominated building process is briefly described:

first, find the non-dominant solution in the population, which is marked as the first non-dominant layer F₁Assigning a non-dominant ranking i to all of its individuals_rank1 (wherein i)_rankIs the non-dominant rank value of individual i) and the individuals of this non-dominant layer are removed from the entire population; then, the non-dominant solution set in the rest population is continuously found out and is marked as F₂All of them are assigned a non-dominant rank value i _rank2; and (4) circulating the operation until the whole population is completely layered. Individuals within the same hierarchy have the same non-dominant order value i_rank。

To be able to have the same i_rankAnd selective ordering is carried out in the individuals, and the individual crowding distance is adopted for measurement. The crowding distance of the individual i is the distance between two individuals i +1 and i-1 adjacent to the individual i in the target space, and the specific calculation steps are as follows:

and I, initializing the distance of the same layer. Let P [ i]_distance0 (wherein P [ i ]]_distanceIndicates the crowding distance of any individual i);

II, arranging the individuals on the same layer in an ascending order according to the m function values;

III, making the individuals on the sorted edges have a selective advantage. Given a large number W, let P [0 ]]_distance＝P[i]_distance＝W；

IV, calculating the crowdedness P [ i ] of the individuals i in the middle of the sequence]_distance：

P[i]_distance＝(P[i+1].f₁-P[i-1].f₁)+(P[i+1].f₂-P[i-1].f₂)

Wherein f is₁，f₂Respectively, the simple score and the likelihood score of the tree individual, P [ i]M is the function value of the individual i on the sub-target m.

Repeating the steps II to IV for different target functionsOperation is performed to obtain the crowding distance P [ i ] of the individual i]_distanceBy preferentially selecting individuals with large crowding distances, the calculation results can be distributed relatively uniformly in the target space to maintain the diversity of the population.

For a clearer understanding of the non-dominated sorting and the congestion calculation and sorting process, fig. 5 is taken as an example for illustration. Firstly, the t generation parent generation group P_tAnd the progeny population Q produced in the t generation_tThe new population R with the scale of 2N is generated by combination_t. Then, for the new population R_tFast sorting of the non-dominant solution is carried out, and a plurality of non-dominant layers F are generated₁,F₂,F₃…, wherein the non-dominant layer priorities are decremented from layer to layer. Then, the crowdedness of the individuals in each non-dominant layer is calculated and ranked, and the first N individuals are selected to form a new population P_t+1. In FIG. 5, there is a non-dominant layer F₁And F₂And the non-dominant layer F₃A part of individuals are selected into a new group P_t+1。

1) Selecting operation: tournament selection is used to pick individuals from parents to generate offspring populations.

The method comprises the following specific steps:

i, determining the individual quantity pool/road selected each time;

and II, randomly selecting individuals from the population to form a group, and selecting the individual with the best fitness value to enter a filial generation population according to the fitness value of each individual.

And (5) repeating the step II N times, wherein the obtained individuals form a new generation of population.

2) And (3) cross operation: the original superior genes are inherited to next generation individuals and new individuals containing more complex gene structures are generated.

Firstly, randomly taking out a pair of individuals to be mated from a mating pool; according to the cross probability P_c(0<P_c1) to carry out the crossover operation, thereby forming a new pair of individuals.

The specific interleaving steps are shown in fig. 6: given two trees T₁And T₂From tree T₁Pruning a part of subtrees s; then, the tree T is removed₂All leaf nodes appear in s; by replanting subtrees to T₂Randomly selecting branches to obtain a descendant subtree T'₁。

3) Mutation operation: since the phylogenetic tree needs to be mutated, the present embodiment adopts a branch exchange algorithm, which is commonly used in phylogenetic analysis, to implement nearest neighbor exchange (NNI).

The specific process is shown in fig. 7: first, an internal branch is selected, two branches are respectively connected to two sides of the internal branch, and the two branches are neighbors of each other. Interchanging one branch on one side of the internal branch with one branch on the other side, thereby enabling interchanging phylogenetic trees of possible nearest neighbors resulting from a phylogenetic tree mutation.

Step 3.4, merging the parent population and the offspring population, performing rapid non-dominated sorting and congestion degree calculation, and selecting proper individuals to form a new parent population according to the non-dominated relationship and the congestion degree of the individuals;

The inventors performed experiments on 4 data sets (canidae, terroris canidae, cambrian cladophora, Hawaiian platynii) and fig. 8 is a graph of Pareto treegand generated from experiments performed on canidae data sets. The existing phylogenetic analysis software TNT and mrboyes are compared, 500 phylogenetic tree sets are generated at the same time, the intersection set of the sets is calculated to calculate the average, the method is used for evaluating the effectiveness of the feature level-based multi-target phylogenetic tree construction method provided by the embodiment, and the experimental result is shown in fig. 9. The experimental result shows that the generated tree structure of the multi-target phylogenetic tree construction method based on the feature hierarchy can generate more intersections with the existing tools, and the 65.4% of the tree structures generated by the method can be consistent with the phylogenetic tree generated by the existing software.

Meanwhile, the symmetry difference measurement is performed on the multi-target phylogenetic tree construction method based on the feature hierarchy, the tree set constructed by TNT and Mrbayes and the standard tree, and the experimental result is shown in fig. 10. Experimental results show that the symmetry differences of the multi-target phylogenetic tree construction method based on the feature hierarchy are respectively 13.2, 16.5, 37.8 and 53.1, which is entirely lower than the existing methods, that is, the multi-target phylogenetic tree construction method based on the feature hierarchy provided in the embodiment can generate results more similar to standard trees.

In summary, the multi-target phylogenetic tree construction method based on the feature hierarchy of the embodiment can better process missing and unapplicable data in morphological data, can generate a phylogenetic tree set satisfying a maximum reduction method and a maximum likelihood method at the same time, and has a certain application prospect.

Claims

1. A multi-target phylogenetic tree construction method based on feature hierarchy is characterized by comprising the following steps:

step two, estimating a form data missing value based on a Markov Monte Carlo algorithm of a feature level, wherein the specific method is as follows;

(1) the master feature is present, but no slave feature:

(2) dependent features are present, but no master features are present:

(3) there are dependent features, and there are main features:

(4) neither the dependent nor the main feature is present:

independently calculating the value distribution of the characteristics;

step 2.3, recording the interpolation current value x_tAnd randomly sampling to determine candidate value x_t+1I.e. the probability density distribution Q (x) obtained according to step 2.2_t+1|x_t) Generating candidate samples x_t+1(ii) a Will present value x_tAnd a candidate value x_t+1Separately for each substitution in the column analysis, x is calculated separately_tAnd x_t+1Probability density function values under the existing observation column distribution are marked as pOld and pNew;

extracting a random number U from U (0,1), and accepting a candidate value x if U < pNew/pOld_t+1Newly sampling missing values; otherwise, the current missing interpolation value x is maintained_tThe change is not changed;

Is marked as D_com；

Step three, constructing a multi-target phylogenetic tree

Step 3.1, according to the complete form matrix D_comBuilding an initialization Tree group P₁And the size of the tree group is N, and the reduction value score and the likelihood value score of each individual in the tree group are respectively calculated as follows:

wherein, w_jWeight, v, representing feature j_jAnd u_jRespectively representing the characteristic state values of the nodes v and u at the bit point j, C (v)_j，u_j) Representing slave state v as a cost matrix_jTransition to state u_jThe cost of (d);

the likelihood L (τ) of the phylogenetic tree τ is the probability that the observed data D and the evolutionary model M that produced the tree, i.e., L (τ) ═ P (D | τ, M), the likelihood function L (τ) of the tree τ given D and M can be written as:

wherein, P (D)_ithe/T, M) is the likelihood value of the ith position;

encoding each individual, wherein the chromosome structure is { tree structure, reduction score and likelihood score };

step 3.2, for the initial population P₁Performing rapid non-dominated sorting and congestion degree calculation;

step 3.3, carrying out chromosome selection, cross operation and mutation operation on the current population, and generating a new population;

2. The method of claim 1, wherein step 3.2 is performed on the starting population P₁And (3) performing rapid non-dominated sorting and congestion degree calculation, wherein the specific optimization process is as follows:

aiming at each generation of evolutionary population, finding out the current optimal individual, wherein the current optimal solution called as an evolutionary population is a non-dominant solution; the set of all non-dominating solutions is called as a non-dominating set of the current evolutionary population, and the non-dominating set is made to continuously approach a real optimal solution set to finally reach the optimal;

Taking out and putting in P^*Simultaneously taking out P and P in sequence^*Comparing all individuals in (1), and deleting P^*All individuals in (1) are dominated by P if P is dominated by P^*Is subject to, P is taken from P^*Deleting;

in order to selectively sort in individuals with the same non-dominant order set, the individual crowding distance is used for measurement, and the specific calculation process is as follows:

P[i]_distance＝(P[i+1].ps-P[i-1].ps)+(P[i+1].l-P[i-1].l)