CN111462812A - Multi-target phylogenetic tree construction method based on feature hierarchy - Google Patents
Multi-target phylogenetic tree construction method based on feature hierarchy Download PDFInfo
- Publication number
- CN111462812A CN111462812A CN202010168038.5A CN202010168038A CN111462812A CN 111462812 A CN111462812 A CN 111462812A CN 202010168038 A CN202010168038 A CN 202010168038A CN 111462812 A CN111462812 A CN 111462812A
- Authority
- CN
- China
- Prior art keywords
- value
- population
- tree
- feature
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B5/00—ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/12—Computing arrangements based on biological models using genetic models
- G06N3/126—Evolutionary algorithms, e.g. genetic algorithms or genetic programming
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Biophysics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Theoretical Computer Science (AREA)
- Physiology (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Genetics & Genomics (AREA)
- Biotechnology (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a multi-target phylogenetic tree construction method based on feature hierarchy, which combines the feature description of morphological data, firstly generates corresponding feature hierarchy relation, and estimates missing values by combining the dependency relation between features, thereby obtaining a complete data set; and then, the maximum reduction principle and the maximum likelihood principle based on the inapplicable Fitch algorithm are combined to construct the parallel multi-target phylogenetic tree, so that the problem caused by the inapplicable data can be well measured, and the limitation of tree construction under a single principle is avoided. Compared with the traditional phylogenetic analysis method, the method can well solve the problem of uncertainty of the tree structure caused by missing data and inapplicable data, and provides a basis for biologists to research species evolution through various tree building principles.
Description
Technical Field
The invention belongs to the field of biological information, relates to development tree analysis and construction in phylogenetic research, and particularly relates to a multi-target phylogenetic tree construction method based on feature hierarchy.
Background
Phylogenetic studies are directed to evolutionary relationships, such as evolutionary history and relationships between species or populations, to understand when and when speciation events may occur, and thus to study species origin. Phylogenetic analysis is to infer and evaluate these evolutionary relationships, with the goal of finding phylogenetic trees that fit species evolution. For a clearer understanding of the study, it can be seen from fig. 1: the species characteristic matrix can be obtained by performing morphological data arrangement on the collected fossil; and performing phylogenetic analysis on the morphological data to obtain a phylogenetic tree capable of reflecting the species evolution relationship.
Morphological phylogeny morphological characteristics are limited compared to molecular phylogeny, and morphological-based phylogeny analysis is slow. For the phylogenetic analysis of early ancient organisms, due to the reasons of long age, change of storage environment, unstable molecular data, limitation of fossil storage and observation means and the like, only morphological data are available as materials for phylogenetic analysis, and the morphological data is an important basis for carrying out phylogenetic analysis of the ancient organisms. Due to the loss of fossil records and the existence of a species characteristic hierarchical relationship, morphological data often contain a large amount of missing data and inapplicable data (note that the inapplicable data appears when the existence of a certain characteristic in a species depends on the nonexistent characteristic), and the existing method cannot effectively process 'problem' data, so that the construction of a phylogenetic tree is inaccurate, and troubles are caused for a biological scientist to research species evolution.
Meanwhile, the commonly used phylogenetic analysis method comprises a distance-based method and an optimal principle-based method, wherein the distance-based method mainly comprises an adjacency method and a UPGMA; the latter mainly includes maximum reduction method, maximum likelihood method and Bayes inference method. Whereas the existing mainstream tools are based on a single optimization principle. Typically, multiple methods are performed on a single modality data set. However, the results of multiple phylogenetic inferences may be conflicting or consistent. When the goals conflict with each other, there is generally no way to satisfy all of the goals.
Disclosure of Invention
Aiming at the technical problem that the construction of a phylogenetic tree is unstable due to the existence of missing data and inapplicable data in morphological data in the prior art, the invention aims to provide a multi-target phylogenetic tree construction method based on feature hierarchy, which combines prior knowledge to generate a feature hierarchical relationship, analyzes the morphological data distribution condition under the feature dependency relationship by taking the feature hierarchical relationship as a support, and further estimates the missing data so as to obtain a complete morphological data set; starting from complete morphological data, the inapplicable Fitch algorithm proposed in Martin D.Brazeau (2017) is adopted to measure the inapplicable data, and the maximum simplification principle and the maximum likelihood principle are combined to construct a multi-target phylogenetic tree, so that a Pareto optimal tree group which simultaneously meets the maximum simplification principle and the maximum likelihood principle is generated, and a basis is provided for a biological scientist to research species evolution.
In order to realize the task, the invention adopts the following technical solutions:
a multi-target phylogenetic tree construction method based on feature hierarchy is characterized by comprising the following steps:
step one, constructing and formalizing a characteristic hierarchical relationship
Analyzing the dependency relationship among the features by combining the feature description, the prior knowledge and other related information, thereby constructing a feature hierarchical relationship; formally expressing the constructed characteristic hierarchical relationship to convert the hierarchical relationship into a data matrix which can be processed by a computer;
step two, estimating the form data missing value based on the Markov Monte Carlo algorithm of the feature level
Step 2.1, to incomplete form matrix DincomInitializing missing values, and performing random interpolation on missing data according to column values;
step 2.2, column analysis is carried out by combining the characteristic hierarchical relationship, and the parameter distribution of each column is analyzed so as to estimate the posterior distribution of the parameters, which mainly comprises the following conditions:
(1) the master feature is present, but no slave feature:
calculating the value distribution of the subordinate characteristics according to different values of the main characteristics;
(2) dependent features are present, but no master features are present:
calculating the value distribution of the main characteristic according to different values of the subordinate characteristic;
(3) there are dependent features, and there are main features:
respectively calculating value distribution of the main characteristic and the subordinate characteristic;
(4) neither the dependent nor the main feature is present:
independently calculating the value distribution of the characteristics;
step 2.3, recording the interpolation current value xtAnd randomly sampling to determine candidate value xt+1I.e. the probability density distribution Q (x) obtained according to step 2.2t+1|xt) Generating candidate samples xt+1(ii) a Will present value xtAnd a candidate value xt+1Separately for each substitution in the column analysis, x is calculated separatelytAnd xt+1And the probability density function values under the existing observation column distribution are marked as pOld and pNew. Extracting a random number U from U (0,1), if U<pNew/pOld, the candidate x is acceptedt+1Newly sampling missing values; otherwise, the current missing interpolation value x is maintainedtThe change is not changed;
step 2.4, iteratively executing the step 2.2-2.3 until the iteration times k are met, and outputting the final interpolated form dataIs marked as Dcom。
Step three, constructing a multi-target phylogenetic tree
Step 3.1, according to the complete form matrix DcomBuilding an initialization Tree group P1The tree group size is N. Respectively calculating a reduced value score and a likelihood value score of each individual in the tree group:
given a phylogenetic tree structure τ where the set of nodes is V (τ) and the set of edges is E (τ), the reduced value score can be expressed as:
wherein, wjWeight, v, representing feature jjAnd ujRespectively representing the characteristic state values of the nodes v and u at the bit point j, C (v)j,uj) Representing slave state v as a cost matrixjTransition to state ujThe cost of (d); the specific feature variation calculation is described in martind. brazeau (2017), et al, which proposes a non-applicable pitch algorithm.
The likelihood L (τ) of the phylogenetic tree τ is the probability that the observation D and evolution model M that produced the tree — L (τ) ═ P (D | τ, M.) the likelihood function L (τ) of the tree τ given D and M can be written as:
wherein, P (D)iand/T, M) is the likelihood value of the ith position.
Each individual is encoded with a chromosome structure { tree structure, reduction score, likelihood score }.
Step 3.2, for the initial population P1And (3) performing rapid non-dominated sorting and congestion degree calculation, wherein the specific optimization process is as follows:
aiming at each generation of evolutionary population, finding out the current optimal individual, wherein the current optimal solution called as an evolutionary population is a non-dominant solution; the set of all non-dominant solutions is called a non-dominating set of the current evolutionary population, and the non-dominating set is made to continuously approach a real optimal solution set to finally reach the optimal.
The rapid non-dominant sorting is to layer the population according to individual non-inferior solutions, so that the search is carried out towards the direction of a Pareto optimal solution set; the specific process is as follows:
setting the evolutionary population as P, and simultaneously setting a non-dominating set P of the evolutionary population*(ii) a First, the first individual is placed in a non-dominating set P*In turn, the individuals in the evolved population PTaking out and putting in P*Simultaneously taking out P and P in sequence*Comparing all individuals in (1), and deleting P*All individuals in (1) are dominated by P if P is dominated by P*Is subject to, P is taken from P*Is deleted.
To enable selective ordering within individuals having the same non-dominant order set, an individual crowding distance is used for the metric; the specific calculation process is as follows:
let P [ i ]]distanceIs the aggregation distance of individual i, P [ i]M is the function value of the individual i on the sub-target m, and the calculation formula is as follows:
P[i]distance=(P[i+1].ps-P[i-1].ps)+(P[i+1].l-P[i-1].l)
wherein ps and l are the simple score and the likelihood score of the tree individual respectively.
And 3.3, carrying out chromosome selection, crossing and mutation operations on the current population, and generating a new population.
Step 3.4, merging the parent population and the child population, performing rapid non-dominated sorting, simultaneously performing crowding degree calculation on the individuals in each non-dominated layer, and selecting proper individuals according to the non-dominated relationship and the crowding degree of the individuals to form a new parent population;
and 3.5, repeating the operation until the condition of ending the program is met.
The multi-target phylogenetic tree construction method based on the feature hierarchy can well measure the problem caused by the inapplicable data, avoids the limitation of tree construction under a single principle, and provides a basis for biologists to research species evolution. Compared with the prior art, the method has the following technical innovation:
1. in the aspect of morphological data missing value estimation, compared with the traditional missing value processing method (such as feature quantity increase, missing value interpolation or non-processing), the method analyzes feature distribution caused by feature dependence by combining a hierarchical structure of features, fuses the feature distribution and missing value estimation, and reduces the influence of missing data on the instability of phylogenetic analysis. The shape data is subjected to missing estimation through a Markov Monte Carlo missing interpolation algorithm, so that the interpolation value can well accord with the distribution of the original data.
2. In the aspect of morphological inapplicable data, an inapplicable Fitch algorithm (Martin D. Brazeau, 2017) is adopted, inapplicable data and missing data can be correctly distinguished and measured, and guarantee is provided for accurately constructing a phylogenetic tree.
3. In the aspect of phylogenetic tree construction, compared with the traditional phylogenetic tree construction tool with a single target, the method adopts a multi-target phylogenetic tree construction method and combines a maximum reduction principle and a maximum likelihood principle to construct the phylogenetic tree, so that a phylogenetic tree group capable of meeting various principles is obtained, and the problem that results of executing various methods on the same morphological data set are possibly conflicted is avoided.
Drawings
FIG. 1 is an exemplary diagram of fossil, morphological data and phylogenetic tree, wherein FIGS. A-F are fossil pictures; panel G is the species signature matrix and panel H is the phylogenetic tree.
FIG. 2 is a feature description and feature hierarchy relationship diagram.
FIG. 3 is a schematic diagram of accuracy evaluation of interpolation results based on MCMC morphological data; wherein panel (a) is canidae, panel (b) is terroris canidae, panel (c) is cambrian poda, and panel (d) is Hawaiian platynin.
FIG. 4 is a schematic diagram illustrating evaluation of tree building accuracy based on interpolation results of MCMC morphological data; wherein panel (a) is canidae, panel (b) is terroris canidae, panel (c) is cambrian poda, and panel (d) is Hawaiian platynin.
FIG. 5 is a schematic diagram of the generation of new population by non-dominated sorting and congestion calculation in step three;
FIG. 6 is a schematic diagram of the construction of the multi-objective phylogenetic tree in step three;
FIG. 7 is a schematic diagram of the mutation operation of the multi-objective phylogenetic tree construction in step three;
FIG. 8 is a diagram of a Pateto treelet set generated based on a multi-objective phylogenetic tree;
FIG. 9 is a set of trees generated based on a multi-objective phylogenetic tree with TNT and Mrbayes;
FIG. 10 is a graph of distance between a Pateto tree set and a standard tree generated based on deletion interpolation and multi-objective phylogenetic trees;
FIG. 11 is an overall flowchart of the method for constructing a multi-objective phylogenetic tree based on feature hierarchy according to the present invention.
The present invention will be described in further detail with reference to the following drawings and examples.
Detailed Description
As shown in fig. 1 to fig. 11, the embodiment provides a method for constructing a multi-objective phylogenetic tree based on feature hierarchy, and the detailed steps are as follows:
step one, constructing and formalizing a hierarchical relationship of expression characteristics
Step 1.1, finding the dependency relationship among the characteristics by combining the prior knowledge and the corresponding characteristic description, and determining the main characteristics, the subordinate characteristics and the logic constraint between the main characteristic value and the subordinate characteristic value. For better understanding, fig. 2 is taken as an example for illustration. According to the characteristic description, the following steps are shown: the values of the characteristic "tail" include "absence (0)", "presence (1)"; the value of the feature "tail color" includes "blue (0)", "red (1)" and "inapplicable (-)", and the value of the feature "tail" is "inapplicable (-)" if and only if the value of the feature "tail" is "absent (0)"; the value of the characteristic "tail length" includes "tail short (0)", "tail long (1)", and "inapplicable (-)", and the value of the characteristic "tail" is "inapplicable (-)" if and only if the value of the characteristic "tail" is "absent (0)"; the values of the feature "eye" include "absence (0)", "presence (1)". Thus, it can be concluded that the feature "tail" is the main feature of "tail color" and "tail length". The following can be obtained by analysis: the main characteristic is 'tail', the corresponding subordinate characteristics are 'tail color' and 'tail length', and the 'eye' has no subordinate characteristic;
and step 1.2, formally defining the characteristic hierarchical relationship obtained in the step 1.1, so that the characteristic hierarchical relationship can be read and analyzed by a computer.
Assume the species signature matrix is D { X }1,…,XnIn which Xi(i∈[1,n]) Representing the characteristics of the ith species, wherein n is the number of the species; species XiHas a characteristic distribution of Xi(xi1,xi2,…,xim) M is a characteristic number, where xip(p∈[1,m]) Represents species XiP characteristic state. The feature set is denoted as { y1,y2,…,ym}. thus, the feature hierarchy relationship matrix may be represented by a matrix C of dimension m × m, where C isijIs the vertex yiAnd vertex yjThe weight value in between.
If the feature yi(0<i ≦ m) logically depends on feature yj(0<j ≦ m), i.e., feature yiAnd feature yjWhen a value dependent relationship exists, the characteristic y is assumedjWhen the value is v, it is the feature yiMain characteristic of (1), then CijThe value is v. In other words, when the main feature yjWhen the value is v, the dependent characteristic yiIs in an applicable state; when the main feature yjDependent feature y when the value is not viIs in an inapplicable state. If when the feature yiAnd feature yjNo value dependency, then CijThe value is ∞. In summary, the feature level relationship matrix calculation formula is as follows:
by referring to the calculation method of the hierarchical relationship matrix of features, the hierarchical relationship of features shown in fig. 2 is still used for analysis, and a corresponding hierarchical relationship matrix (shown in table) can be obtained. As can be seen from the first column of the table, when the tail characteristic value is 1, the characteristic tail is the main characteristic of the tail color and the tail length; otherwise, the tail color and tail length values are in an inapplicable state. Columns 2-4 in the table show that the characteristic tail color, tail length, eye are not sufficient to prove to be the main characteristic of the other characteristics.
Table 1: feature hierarchical relationship matrix
If the shape data has missing data, the data set D can be divided into observation data YobsAnd missing data YmissExpressed as D ═ Yobs,YmissFourthly, performing morphological data missing estimation in the second step; if the shape data has no missing data, step two can be skipped and step three can be directly executed.
Step two, estimating the form data missing value based on the Markov Monte Carlo algorithm of the feature level
Step 2.1, according to the value range of each row of the form data, carrying out random interpolation on the missing data according to the row values;
and 2.2, performing column analysis on the form data by combining the characteristic hierarchical relation matrix obtained in the step 1.2, and analyzing the parameter distribution of each column so as to estimate the posterior distribution of the parameters. According to observable data YobsAnd missing data debit valueObtaining probability density distribution theta under current datat. The specific calculation formula is as follows (where Pr is a specific probability density function):
step 2.3, recording the interpolation current value xtAccording to YobsAnd a predetermined YmissDistributed random sampling to successively determine candidate value xt+1. Candidate sample xt+1Satisfies the probability density distribution theta obtained in step 2.2(t). At this point, the missing data is already included in the sample as a distribution parameter. The specific calculation formula is as follows:
will present value xtAnd a candidate value xt+1Separately for each substitution in the column analysis, x is calculated separatelytAnd xt+1And the probability density function values under the existing observation column distribution are marked as pOld and pNew. Extracting a random number U from U (0,1), if U<pNew/pOld, the candidate x is acceptedt+1Newly sampling missing values; otherwise, the current missing interpolation value x is maintainedtAnd is not changed.
Step 2.4, iteratively executing the step 2.2-2.3 until the iteration times k are met, and outputting the final interpolated form dataIs marked as Dcom. According to the operation, the Markov chain can be obtained through multiple iterationsThe result after deletion estimation can be obtained from the Markov chainWhen k is sufficiently large, it is preferable that,will converge to the same distribution, i.e. a smooth distribution.
Further, the specific process of the second step comprises:
the species signature matrix dataset D containing missing data may be represented as D ═ Yobs,YmissIn which Y isobsRepresenting observed data, YmissIndicating missing data.
The morphological data missing estimation algorithm comprises the following specific steps:
i, setting initial valueAnd ordert is 0, wherein p is the number of deletions in the morphology dataset;
II, calculating probability density distribution under the t sampling
III, sampling for the t +1 th time, and specifically, the following steps:
a) extracting from the distribution of the column characteristic values of the 1 st missing valueMake it satisfy distribution
b) Extracting from the distribution of the column characteristic values of the 2 nd missing valueMake it satisfy distribution
c) Extracting from the distribution of the characteristic values of the column where the ith missing value is locatedMake it satisfy distribution……
d) Extracting from the column characteristic value distribution of the p-th missing valueMake it satisfy distribution
I, recordTo Y'missIf the relative change amount exceedsGreater than the random number U in U (0,1), then acceptNewly sampling missing values; otherwise, the current sampling value is maintainedIs not changed, i.e.Can be written as
And II, repeating the step k times, and when k is large enough, the Markov chain is gradually converged, and the state distribution tends to be stable, so that a plurality of deletion estimated values can be obtained simultaneously.
In order to verify the effectiveness of the feature hierarchy-based multi-objective phylogenetic tree construction method in missing value estimation, experiments are respectively carried out on 4 morphological data sets, wherein the data sets comprise canidae, ciconidae, cambrian quadruped and Hawaiian Platynini. The complete morphological data set is obtained by carrying out random deletion on the morphological data set, wherein the deletion ratio is 5% -95%, and then carrying out deletion value estimation on the data set subjected to random deletion by adopting an MCMC deletion estimation algorithm.
In this embodiment, the deletion estimation performance is estimated based on the proximity of the interpolation value and the true value, the data set before random deletion is compared with the data set after MCMC deletion estimation, and compared with mic packet interpolation, random forest interpolation, fixed value interpolation, and random interpolation, and the quality of interpolation is estimated using Root Mean Square Error (RMSE), and the calculation formula is as follows:
the results of the experiment are shown in FIG. 3. Experimental results show that the multi-target phylogenetic tree construction method based on the feature hierarchy has a lower root mean square error and a maximum fixed value interpolation error compared with other methods. With the gradual increase of the deletion ratio, the root mean square error after interpolation by several methods is increased continuously. However, because the method provided in the embodiment finds the constraint of the logic value between the features through the hierarchical relationship of the features, a relatively low root mean square error can be obtained under the condition of high loss. For the data set after deletion estimation, the invention defines the topological difference rate for measuring the accuracy of the data set after deletion estimation on phylogenetic analysis, and the topological difference rate is specifically defined as a real tree T1With a single tree T randomly selected among the shortest trees for a given search2The symmetric difference rate between the two is calculated according to the following formula:
wherein n is the number of species, | split (tree) | represents the number of "partition" tree sets of tree, | split (T)1)∩split(T2) I represents T1And T2The same number of sets of partition trees. The experimental result is shown in fig. 4, and the result shows that, compared with several other methods, the method for constructing the multi-target phylogenetic tree based on the feature hierarchy provided in this embodiment has a smaller difference between the tree structure constructed by the interpolated data set and the standard tree, and the next time is random forest interpolation.
As can be seen from fig. 3 and 4, there is no direct relationship between the interpolation accuracy and the symmetric difference rate, that is, the symmetric difference rate is not necessarily low when the phylogenetic tree is constructed on the data set with high interpolation accuracy. The multi-target phylogenetic tree construction method based on the feature hierarchy provided by the embodiment can have a relatively low root mean square error by combining the feature hierarchy relationship, and the tree construction result is closer to a standard tree structure.
Step three, constructing a multi-target phylogenetic tree
Step 3.1, according to the complete form matrix DcomBuilding an initialization Tree group P1The tree group size is N. A reduced value score and a likelihood value score are calculated for each individual in the tree group separately.
Each individual is encoded with a chromosome structure { tree structure, reduction score, likelihood score }.
Step 3.2, for the initial population P1And performing fast non-dominated sorting and congestion degree calculation. And the non-dominant solution quick sequencing is to layer the population according to individual non-inferior solutions so as to enable the search to be carried out towards the Pareto optimal solution set direction. Aiming at each generation of evolutionary population, finding out the current optimal individual, namely a non-dominant solution (non-dominant solution) of the evolutionary population; the set of all non-dominated solutions is called a non-dominated solution set (NDSet) of the current evolutionary population, and the non-dominated solution set is continuously approximated to a real optimal solution set to finally reach the optimal value.
Defining: let P and q be any two different individuals in the evolutionary population P, if P dominates q, then it is noted that P > q:
i, p is no worse than q for all sub-targets. I.e. fk(p)≤fk(q)(k=1,2,…r);
Setting the evolutionary population as P, and simultaneously setting a non-dominating set P of the evolutionary population*The specific optimization process is as follows:
first, the first individual is placed in a non-dominating set P*In turn, the individuals in the evolved population PTaking out and putting in P*Simultaneously taking out P and P in sequence*Is compared with all of the individuals in (a),deleting P*All individuals in (1) are dominated by P if P is dominated by P*Is subject to, P is taken from P*Is deleted.
For a better understanding, the non-dominated building process is briefly described:
first, find the non-dominant solution in the population, which is marked as the first non-dominant layer F1Assigning a non-dominant ranking i to all of its individualsrank1 (wherein i)rankIs the non-dominant rank value of individual i) and the individuals of this non-dominant layer are removed from the entire population; then, the non-dominant solution set in the rest population is continuously found out and is marked as F2All of them are assigned a non-dominant rank value i rank2; and (4) circulating the operation until the whole population is completely layered. Individuals within the same hierarchy have the same non-dominant order value irank。
To be able to have the same irankAnd selective ordering is carried out in the individuals, and the individual crowding distance is adopted for measurement. The crowding distance of the individual i is the distance between two individuals i +1 and i-1 adjacent to the individual i in the target space, and the specific calculation steps are as follows:
and I, initializing the distance of the same layer. Let P [ i]distance0 (wherein P [ i ]]distanceIndicates the crowding distance of any individual i);
II, arranging the individuals on the same layer in an ascending order according to the m function values;
III, making the individuals on the sorted edges have a selective advantage. Given a large number W, let P [0 ]]distance=P[i]distance=W;
IV, calculating the crowdedness P [ i ] of the individuals i in the middle of the sequence]distance:
P[i]distance=(P[i+1].f1-P[i-1].f1)+(P[i+1].f2-P[i-1].f2)
Wherein f is1,f2Respectively, the simple score and the likelihood score of the tree individual, P [ i]M is the function value of the individual i on the sub-target m.
Repeating the steps II to IV for different target functionsOperation is performed to obtain the crowding distance P [ i ] of the individual i]distanceBy preferentially selecting individuals with large crowding distances, the calculation results can be distributed relatively uniformly in the target space to maintain the diversity of the population.
For a clearer understanding of the non-dominated sorting and the congestion calculation and sorting process, fig. 5 is taken as an example for illustration. Firstly, the t generation parent generation group PtAnd the progeny population Q produced in the t generationtThe new population R with the scale of 2N is generated by combinationt. Then, for the new population RtFast sorting of the non-dominant solution is carried out, and a plurality of non-dominant layers F are generated1,F2,F3…, wherein the non-dominant layer priorities are decremented from layer to layer. Then, the crowdedness of the individuals in each non-dominant layer is calculated and ranked, and the first N individuals are selected to form a new population Pt+1. In FIG. 5, there is a non-dominant layer F1And F2And the non-dominant layer F3A part of individuals are selected into a new group Pt+1。
And 3.3, carrying out chromosome selection, crossing and mutation operations on the current population, and generating a new population.
1) Selecting operation: tournament selection is used to pick individuals from parents to generate offspring populations.
The method comprises the following specific steps:
i, determining the individual quantity pool/road selected each time;
and II, randomly selecting individuals from the population to form a group, and selecting the individual with the best fitness value to enter a filial generation population according to the fitness value of each individual.
And (5) repeating the step II N times, wherein the obtained individuals form a new generation of population.
2) And (3) cross operation: the original superior genes are inherited to next generation individuals and new individuals containing more complex gene structures are generated.
Firstly, randomly taking out a pair of individuals to be mated from a mating pool; according to the cross probability Pc(0<Pc1) to carry out the crossover operation, thereby forming a new pair of individuals.
The specific interleaving steps are shown in fig. 6: given two trees T1And T2From tree T1Pruning a part of subtrees s; then, the tree T is removed2All leaf nodes appear in s; by replanting subtrees to T2Randomly selecting branches to obtain a descendant subtree T'1。
3) Mutation operation: since the phylogenetic tree needs to be mutated, the present embodiment adopts a branch exchange algorithm, which is commonly used in phylogenetic analysis, to implement nearest neighbor exchange (NNI).
The specific process is shown in fig. 7: first, an internal branch is selected, two branches are respectively connected to two sides of the internal branch, and the two branches are neighbors of each other. Interchanging one branch on one side of the internal branch with one branch on the other side, thereby enabling interchanging phylogenetic trees of possible nearest neighbors resulting from a phylogenetic tree mutation.
Step 3.4, merging the parent population and the offspring population, performing rapid non-dominated sorting and congestion degree calculation, and selecting proper individuals to form a new parent population according to the non-dominated relationship and the congestion degree of the individuals;
and 3.5, repeating the operation until the condition of ending the program is met.
The inventors performed experiments on 4 data sets (canidae, terroris canidae, cambrian cladophora, Hawaiian platynii) and fig. 8 is a graph of Pareto treegand generated from experiments performed on canidae data sets. The existing phylogenetic analysis software TNT and mrboyes are compared, 500 phylogenetic tree sets are generated at the same time, the intersection set of the sets is calculated to calculate the average, the method is used for evaluating the effectiveness of the feature level-based multi-target phylogenetic tree construction method provided by the embodiment, and the experimental result is shown in fig. 9. The experimental result shows that the generated tree structure of the multi-target phylogenetic tree construction method based on the feature hierarchy can generate more intersections with the existing tools, and the 65.4% of the tree structures generated by the method can be consistent with the phylogenetic tree generated by the existing software.
Meanwhile, the symmetry difference measurement is performed on the multi-target phylogenetic tree construction method based on the feature hierarchy, the tree set constructed by TNT and Mrbayes and the standard tree, and the experimental result is shown in fig. 10. Experimental results show that the symmetry differences of the multi-target phylogenetic tree construction method based on the feature hierarchy are respectively 13.2, 16.5, 37.8 and 53.1, which is entirely lower than the existing methods, that is, the multi-target phylogenetic tree construction method based on the feature hierarchy provided in the embodiment can generate results more similar to standard trees.
In summary, the multi-target phylogenetic tree construction method based on the feature hierarchy of the embodiment can better process missing and unapplicable data in morphological data, can generate a phylogenetic tree set satisfying a maximum reduction method and a maximum likelihood method at the same time, and has a certain application prospect.
Claims (2)
1. A multi-target phylogenetic tree construction method based on feature hierarchy is characterized by comprising the following steps:
step one, constructing and formalizing a characteristic hierarchical relationship
Analyzing the dependency relationship among the features by combining the feature description, the prior knowledge and other related information, thereby constructing a feature hierarchical relationship; formally expressing the constructed characteristic hierarchical relationship to convert the hierarchical relationship into a data matrix which can be processed by a computer;
step two, estimating a form data missing value based on a Markov Monte Carlo algorithm of a feature level, wherein the specific method is as follows;
step 2.1, to incomplete form matrix DincomInitializing missing values, and performing random interpolation on missing data according to column values;
step 2.2, column analysis is carried out by combining the characteristic hierarchical relationship, and the parameter distribution of each column is analyzed so as to estimate the posterior distribution of the parameters, which mainly comprises the following conditions:
(1) the master feature is present, but no slave feature:
calculating the value distribution of the subordinate characteristics according to different values of the main characteristics;
(2) dependent features are present, but no master features are present:
calculating the value distribution of the main characteristic according to different values of the subordinate characteristic;
(3) there are dependent features, and there are main features:
respectively calculating value distribution of the main characteristic and the subordinate characteristic;
(4) neither the dependent nor the main feature is present:
independently calculating the value distribution of the characteristics;
step 2.3, recording the interpolation current value xtAnd randomly sampling to determine candidate value xt+1I.e. the probability density distribution Q (x) obtained according to step 2.2t+1|xt) Generating candidate samples xt+1(ii) a Will present value xtAnd a candidate value xt+1Separately for each substitution in the column analysis, x is calculated separatelytAnd xt+1Probability density function values under the existing observation column distribution are marked as pOld and pNew;
extracting a random number U from U (0,1), and accepting a candidate value x if U < pNew/pOldt+1Newly sampling missing values; otherwise, the current missing interpolation value x is maintainedtThe change is not changed;
step 2.4, iteratively executing the step 2.2-2.3 until the iteration times k are met, and outputting the final interpolated form dataIs marked as Dcom;
Step three, constructing a multi-target phylogenetic tree
Step 3.1, according to the complete form matrix DcomBuilding an initialization Tree group P1And the size of the tree group is N, and the reduction value score and the likelihood value score of each individual in the tree group are respectively calculated as follows:
given a phylogenetic tree structure τ where the set of nodes is V (τ) and the set of edges is E (τ), the reduced value score can be expressed as:
wherein, wjWeight, v, representing feature jjAnd ujRespectively representing the characteristic state values of the nodes v and u at the bit point j, C (v)j,uj) Representing slave state v as a cost matrixjTransition to state ujThe cost of (d);
the likelihood L (τ) of the phylogenetic tree τ is the probability that the observed data D and the evolutionary model M that produced the tree, i.e., L (τ) ═ P (D | τ, M), the likelihood function L (τ) of the tree τ given D and M can be written as:
wherein, P (D)ithe/T, M) is the likelihood value of the ith position;
encoding each individual, wherein the chromosome structure is { tree structure, reduction score and likelihood score };
step 3.2, for the initial population P1Performing rapid non-dominated sorting and congestion degree calculation;
step 3.3, carrying out chromosome selection, cross operation and mutation operation on the current population, and generating a new population;
step 3.4, merging the parent population and the child population, performing rapid non-dominated sorting, simultaneously performing crowding degree calculation on the individuals in each non-dominated layer, and selecting proper individuals according to the non-dominated relationship and the crowding degree of the individuals to form a new parent population;
and 3.5, repeating the operation until the condition of ending the program is met.
2. The method of claim 1, wherein step 3.2 is performed on the starting population P1And (3) performing rapid non-dominated sorting and congestion degree calculation, wherein the specific optimization process is as follows:
aiming at each generation of evolutionary population, finding out the current optimal individual, wherein the current optimal solution called as an evolutionary population is a non-dominant solution; the set of all non-dominating solutions is called as a non-dominating set of the current evolutionary population, and the non-dominating set is made to continuously approach a real optimal solution set to finally reach the optimal;
the rapid non-dominant sorting is to layer the population according to individual non-inferior solutions, so that the search is carried out towards the direction of a Pareto optimal solution set; the specific process is as follows:
setting the evolutionary population as P, and simultaneously setting a non-dominating set P of the evolutionary population*(ii) a First, the first individual is placed in a non-dominating set P*In turn, the individuals in the evolved population PTaking out and putting in P*Simultaneously taking out P and P in sequence*Comparing all individuals in (1), and deleting P*All individuals in (1) are dominated by P if P is dominated by P*Is subject to, P is taken from P*Deleting;
in order to selectively sort in individuals with the same non-dominant order set, the individual crowding distance is used for measurement, and the specific calculation process is as follows:
let P [ i ]]distanceIs the aggregation distance of individual i, P [ i]M is the function value of the individual i on the sub-target m, and the calculation formula is as follows:
P[i]distance=(P[i+1].ps-P[i-1].ps)+(P[i+1].l-P[i-1].l)
wherein ps and l are the simple score and the likelihood score of the tree individual respectively.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010168038.5A CN111462812B (en) | 2020-03-11 | 2020-03-11 | Multi-target phylogenetic tree construction method based on feature hierarchy |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010168038.5A CN111462812B (en) | 2020-03-11 | 2020-03-11 | Multi-target phylogenetic tree construction method based on feature hierarchy |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111462812A true CN111462812A (en) | 2020-07-28 |
CN111462812B CN111462812B (en) | 2023-03-24 |
Family
ID=71680798
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010168038.5A Active CN111462812B (en) | 2020-03-11 | 2020-03-11 | Multi-target phylogenetic tree construction method based on feature hierarchy |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111462812B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112817959A (en) * | 2021-02-25 | 2021-05-18 | 西北大学 | Construction method of ancient biomorphic phylogenetic tree based on multi-metric index weight |
CN114613426A (en) * | 2022-01-26 | 2022-06-10 | 西北大学 | Phylogenetic tree construction method based on dynamic multi-objective optimization |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2004057511A2 (en) * | 2002-12-23 | 2004-07-08 | Universität Karlsruhe | Methods for the analysis, classification and/or tree construction of sequences using correlation analysis |
US20070225918A1 (en) * | 2003-06-19 | 2007-09-27 | Khalid Sayood | System and Method for Sequence Distance Measure for Phylogenetic Tree Construction |
CN106326689A (en) * | 2015-06-25 | 2017-01-11 | 深圳华大基因科技服务有限公司 | Method and device for determining site subject to selection in colony |
US20170212980A1 (en) * | 2016-01-25 | 2017-07-27 | Shenzhen University | Construction method for heuristic metabolic co-expression network and the system thereof |
GB201717125D0 (en) * | 2016-11-28 | 2017-11-29 | National Univ Of Defense Technology | Differential evolution method oriented to agile satellite multi-target task planning |
CN108509764A (en) * | 2018-02-27 | 2018-09-07 | 西北大学 | A kind of extinct plants and animal pedigree evolution analysis method based on genetic property yojan |
CN109326328A (en) * | 2018-11-02 | 2019-02-12 | 西北大学 | A kind of extinct plants and animal pedigree evolution analysis method based on pedigree cluster |
-
2020
- 2020-03-11 CN CN202010168038.5A patent/CN111462812B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2004057511A2 (en) * | 2002-12-23 | 2004-07-08 | Universität Karlsruhe | Methods for the analysis, classification and/or tree construction of sequences using correlation analysis |
US20070225918A1 (en) * | 2003-06-19 | 2007-09-27 | Khalid Sayood | System and Method for Sequence Distance Measure for Phylogenetic Tree Construction |
CN106326689A (en) * | 2015-06-25 | 2017-01-11 | 深圳华大基因科技服务有限公司 | Method and device for determining site subject to selection in colony |
US20170212980A1 (en) * | 2016-01-25 | 2017-07-27 | Shenzhen University | Construction method for heuristic metabolic co-expression network and the system thereof |
GB201717125D0 (en) * | 2016-11-28 | 2017-11-29 | National Univ Of Defense Technology | Differential evolution method oriented to agile satellite multi-target task planning |
CN108509764A (en) * | 2018-02-27 | 2018-09-07 | 西北大学 | A kind of extinct plants and animal pedigree evolution analysis method based on genetic property yojan |
CN109326328A (en) * | 2018-11-02 | 2019-02-12 | 西北大学 | A kind of extinct plants and animal pedigree evolution analysis method based on pedigree cluster |
Non-Patent Citations (2)
Title |
---|
冯思玲: "系统发育树构建方法研究", 《信息技术》 * |
张丽娜等: "常用系统发育树构建算法和软件鸟瞰", 《动物学研究》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112817959A (en) * | 2021-02-25 | 2021-05-18 | 西北大学 | Construction method of ancient biomorphic phylogenetic tree based on multi-metric index weight |
CN112817959B (en) * | 2021-02-25 | 2023-03-24 | 西北大学 | Construction method of ancient biomorphic phylogenetic tree based on multi-metric index weight |
CN114613426A (en) * | 2022-01-26 | 2022-06-10 | 西北大学 | Phylogenetic tree construction method based on dynamic multi-objective optimization |
CN114613426B (en) * | 2022-01-26 | 2023-10-31 | 西北大学 | System development tree construction method based on dynamic multi-objective optimization |
Also Published As
Publication number | Publication date |
---|---|
CN111462812B (en) | 2023-03-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Excoffier et al. | Robust demographic inference from genomic and SNP data | |
US9058564B2 (en) | Controlling quarantining and biasing in cataclysms for optimization simulations | |
US8700548B2 (en) | Optimization technique using evolutionary algorithms | |
CN112181867B (en) | On-chip network memory controller layout method based on multi-target genetic algorithm | |
CN111462812B (en) | Multi-target phylogenetic tree construction method based on feature hierarchy | |
CN103838820B (en) | Evolutionary multi-objective optimization community detection method based on affinity propagation | |
JP2007200302A (en) | Combining model-based and genetics-based offspring generation for multi-objective optimization using convergence criterion | |
JP6451735B2 (en) | Energy amount estimation device, energy amount estimation method, and energy amount estimation program | |
CN113903395A (en) | BP neural network copy number variation detection method and system for improving particle swarm optimization | |
CN105590039B (en) | A kind of protein complex recognizing method based on BSO optimizations | |
Lamiable et al. | An algorithmic game-theory approach for coarse-grain prediction of RNA 3D structure | |
García-Pareja et al. | Exact simulation of coupled Wright–Fisher diffusions | |
CN113673695B (en) | Crowd behavior rule automatic extraction method based on novel feature automatic construction | |
CN112837739B (en) | Hierarchical feature phylogenetic model based on self-encoder and Monte Carlo tree | |
Belkhir | Per instance algorithm configuration for continuous black box optimization | |
CN114996278A (en) | Road network shortest path distance calculation method based on reinforcement learning | |
Denysiuk et al. | A new hybrid evolutionary multiobjective algorithm guided by descent directions | |
Cancino et al. | A multi-objective evolutionary approach for phylogenetic inference | |
CN113837474A (en) | Regional soil heavy metal pollution index prediction method and device | |
CN113704570A (en) | Large-scale complex network community detection method based on self-supervision learning type evolution | |
Schwefel | Evolutionary computation-a study on collective learning | |
Lou | An intelligent teaching test paper generation system based on ant colony hybrid genetic algorithms | |
CN105978816B (en) | Multicast tree optimization method based on genetic framework | |
Khoder et al. | Community tracking in time evolving networks: an evolutionary multi-objective approach | |
Moradi et al. | An Improved Multi-objective Genetic Algorithm for Revealing Community Structures of Complex Networks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |