CN114613426B

CN114613426B - System development tree construction method based on dynamic multi-objective optimization

Info

Publication number: CN114613426B
Application number: CN202210094499.1A
Authority: CN
Inventors: 冯宏伟; 王蓓; 李燕; 刘建妮; 冯筠
Original assignee: NORTHWEST UNIVERSITY
Current assignee: NORTHWEST UNIVERSITY
Priority date: 2022-01-26
Filing date: 2022-01-26
Publication date: 2023-10-31
Anticipated expiration: 2042-01-26
Also published as: CN114613426A

Abstract

The application discloses a system development tree construction method based on dynamic multi-objective optimization. And dynamically selecting a plurality of optimal single targets to fuse according to the data types under the condition that the single-target tree construction cannot process the tree conflict, adopting a multi-target optimization algorithm, and carrying out phylogenetic tree construction by combining non-dominant sorting and genetic algorithm, and finally constructing an optimal phylogenetic tree set under the plurality of targets for the data containing the missing and inapplicable data. Compared with a pedigree tree construction method based on single target optimization, the method effectively solves the problems of deficiency and inapplicability in the pedigree tree construction process of the paleobiomorphology data and conflict under a plurality of evaluation indexes, and improves the accuracy and stability of species pedigree analysis.

Description

System development tree construction method based on dynamic multi-objective optimization

Technical Field

The application belongs to the technical field of bioinformatics, and relates to a phylogenetic tree construction method based on dynamic multi-objective optimization.

Background

Phylogenetic tree construction is an important expression of the conclusion of phylogenetic studies, and phylogenetic is an important branch of the computational biology field. The phylogenetic tree can more intuitively and vividly show the relatedness and evolutionary relationship among species, and provides important basis for the origin, development and future of human research life. Currently, the phylogenetic analysis is mainly divided into two branches, one is molecular biological phylogenetic analysis for researching the current biological protein molecules and nucleotide sequences, and the other is morphological phylogenetic analysis for researching the paleobiological morphological data. The present application is directed to the latter, i.e., phylogenetic analysis based on morphological data.

The object of the application is to extract protein nucleotide sequences from ancient organisms several hundred million years ago for developmental analysis, and for ancient organisms, the time distance of survival is now and past hundred million years, and molecular data in fossil are very unstable or no longer exist along with the transition of the buried environment, the natural actions of weathering, desertification and the like, so that researchers can only study through morphological data extracted by ancient biologists.

Just because of the existence of a series of uncertain factors in fossils, the morphological data studied have the problems of absence and inapplicability (note: the inapplicability of the value of a sub-feature of a species occurs when it is not present). These problem data can lead to deviations in the phylogenetic tree being constructed because of insufficient information available when the phylogenetic tree is constructed. Common methods for processing missing data include adding feature numbers, deleting missing data directly, and performing missing interpolation to treat the missing data as a new state.

In practice, ancient biologists sometimes directly ignore missing data in phylogenetic analyses. In addition, when a phylogenetic tree is constructed, the topological structure of the phylogenetic tree grows exponentially with the increase of the number of species, which results in difficulty in constructing the phylogenetic tree.

At present, the methods for constructing phylogenetic tree are divided into two types, namely a distance-based method and an optimal principle-based method. Wherein, the distance-based method comprises an adjacent method, UPGMA, WPGMA and the like; and methods based on the optimal principle include a maximum reduction method, a maximum likelihood method and a Bayesian inference method. However, all of these methods are tree building methods based on a single optimization principle. In the actual tree building process, multiple principles are often considered, but for the same pedigree tree, the phenomenon of conflict exists between different principles, which brings difficulty to pedigree tree building.

Disclosure of Invention

Aiming at the defects or shortcomings of the prior art, the application aims to provide a multi-target phylogenetic tree construction method based on target selection, which solves the problems of missing and inapplicability in paleobiomorphology data and simultaneously avoids the problem of conflict of phylogenetic trees under different tree construction principles.

In order to achieve the above task, the present application adopts the following technical solutions:

a phylogenetic tree construction method based on dynamic multi-objective optimization is characterized by comprising the following specific steps:

firstly, constructing trees for a plurality of groups of data containing standard trees by using different single-target-based phylogenetic tree construction methods, and sequencing targets according to RF distances between a tree set generated by different optimization targets and the standard trees;

step two, matching the real paleobiomorphology data with the simulation data, selecting an optimization target aiming at the matched data, and dynamically obtaining the optimization target in the multi-target optimization tree construction of the real morphology data;

thirdly, constructing a multi-target phylogenetic tree through dynamically selected optimization targets, non-dominant sequencing and genetic algorithm, and selecting an optimal phylogenetic tree result set.

Specifically, the implementation method of the first step is as follows:

according to the characteristics in the archaea phylogenetic tree, all leaf nodes in a simulated binary phylogenetic tree are the positions where the classification units are located; and then simulating biological evolution according to a Markov model, wherein the transition probabilities among the features are the same, and generating feature matrixes of different feature numbers and species numbers: d { X ₁ ，X ₂ ，...，X _i ，...，X _n }

Wherein X is _i Represents the i-th species; species X _i The state of the p-th morphological feature of (2) is denoted as X _ip The number of species used to construct the lineage tree is denoted n, and the number of features is denoted m; species X _i The feature vectors of (a) are expressed as follows:

X _i (X _i1 ，X _i2 ，...，X _ip ，...X _iq ，...，X _im )；

then, carrying out random missing treatment on the original feature matrix, and taking different defectsFailure rate, data containing deletions is denoted as D _missing The corresponding feature matrix is:

D _missing {X′ ₁ ，X′ ₂ ，...，X′ _i ，...，X′ _n }

taking 10%,25%,50% and 75% deletion proportion of the simulation data, wherein the deletion is randomly performed at a selected site in the feature matrix;

and respectively constructing the phylogenetic tree based on single-objective optimization for 3 x 4 = 64 groups of data of three different species numbers, four different feature numbers and four different deletion rates by using the reduced values, likelihood values, fitch values, CI values and RI values used for phylogenetic tree construction. Finally selecting an optimal phylogenetic tree set;

and calculating RF distances between the sets obtained by different optimization targets and the standard tree, and sequencing the optimization targets of the phylogenetic tree set from small to large according to the RF distances to obtain the corresponding relation between the data and the optimization targets.

Further, the implementation method of the second step is as follows:

the method comprises the steps of dynamically matching real paleobiomorphology data with simulation data, namely firstly matching the number of species in the real data with the number of species in the simulation data according to the number of features, and then finding a simulation data set closest to the data size of the real data set according to the deletion rate;

according to the data matching result, the first N targets with optimal simulation data are used as a plurality of targets in multi-target optimization, and are used as N bases for tree space searching in later multi-target optimization; according to the sequence among the targets, firstly, sorting is carried out according to the first target, and the like, so that the earlier targets can influence the final tree building result.

Preferably, the specific implementation steps of the third step are as follows:

step 3.1: initializing a tree set

Encoding the missing data existing in the data as a new state, and treating the inapplicable data as the missing data; alternately generating an initial tree set by using a maximum reduction method and a maximum likelihood method;

step 3.2: iterative updating of genetic algorithms

The topology structure of the generated initial tree set is changed through a crossover operator and a mutation operator in a genetic algorithm to search a tree space; through branch exchange on the phylogenetic tree, deleting, grafting and changing the positions of the classification units on the tree;

step 3.3: non-dominant ordering

Calculating target values according to different optimization targets and performing non-dominant sorting on the trees; ordering the n phylogenetic trees into different non-dominant layers, the species of the first non-dominant layer being the optimal solution set, namely the pareto boundary; second level of preference, and so on; selecting the first m trees through non-dominant sorting to carry out next iteration update;

step 3.4: and (3) repeating the steps 3.2 to 3.3 until the iteration termination condition is reached, and finally obtaining the optimal phylogenetic tree set.

Compared with the prior art, the application provides a system development tree construction method based on dynamic multi-objective optimization, which brings the following technical innovation:

1. compared with a pedigree tree construction method based on single-objective optimization, the method effectively solves the problems of lack and inapplicability in the construction process of the pedigree tree of the paleobiomorphology data and conflict under a plurality of evaluation indexes, and improves the accuracy and stability of species pedigree analysis.

2. The relation between the data and the optimization target is explored by using the simulation data, a conclusion about the universality of the phylogenetic tree construction by the morphological data is obtained, and a reference is provided for the phylogenetic tree construction by the morphological data.

Drawings

Fig. 1 is a simulated standard tree, number of species (tax) =40;

FIG. 2 is a simulated morphological feature matrix;

FIG. 3 is a morphological feature matrix containing deletions;

FIG. 4 is a graph showing comparison of different optimization objectives for the same species number, wherein:

fig. 4 (a) is a single-target tree-building RF distance comparison (100×100) RFdistance: likelihood < Fitch < paraminony < RI < CI;

fig. 4 (b) is a single-target tree-building RF distance versus (100 x 75) RFdistance: likelihood < parameter < Fitch < RI < CI;

fig. 4 (c) is a single-target tree-building RF distance comparison (100×25) RFdistance: likelihood < passimon < RI < CI < fit;

FIG. 5 is a graph showing different optimization objectives when feature numbers are the same, wherein:

fig. 5 (a) is a single-target tree-building RF distance comparison (100×100) RFdistance: likelihood < Fitch < paraminony < RI < CI;

fig. 5 (b) is a single-target tree-building RF distance versus (40 x 100) RFdistance: fitch < parameter < likelihood < RI < CI;

fig. 5 (c) is a single-target tree-building RF distance versus (20 x 100) RFdistance: likelihood < parameter < Fitch < CI < RI;

fig. 6 is the effect of different species on tree construction results when feature number = 100, wherein:

FIG. 6 (a) is a shorthand value versus RF distance for a target build;

FIG. 6 (b) is a comparison of likelihood values for the target treeing RF distance;

FIG. 6 (c) is a Fitch value versus RF distance for a target build;

fig. 7 is the effect of different feature numbers on tree construction results when species number=40, wherein:

FIG. 7 (a) is a comparison of RF distance for a target build for a feature number of 25, 50, 75, 100;

fig. 7 (b) shows the comparison of the likelihood values for the target treeing RF distance for feature numbers of 25, 50, 75, 100;

FIG. 7 (c) is a comparison of RF distance for target tree construction for Fitch values for feature numbers of 25, 50, 75, 100;

FIG. 8 is a fossil image with deletions;

fig. 9 is inapplicable data.

For this tail feature in fig. 2, there are three states, present, absent, missing. When the state of the tail is not present, then the two features for tail color and tail length are not applicable states, denoted by 'N'.

FIG. 10 is a phylogenetic tree construction method commonly used at present;

FIG. 11 is a representation of multi-objective phylogenetic tree construction data;

FIG. 12 is a schematic diagram of phylogenetic tree construction;

FIG. 13 is a flow chart of a phylogenetic tree construction method based on dynamic multi-objective optimization according to the present application.

The application is explained in further detail below with reference to the drawings and examples.

Detailed Description

The embodiment provides a system development tree construction method based on dynamic multi-objective optimization, which comprises the following steps:

step one, optimizing target ordering is carried out on data containing standard trees:

according to the characteristics in the archaea phylogenetic tree, a binary phylogenetic tree is simulated, wherein all leaf nodes are the positions of the classification units. The biological evolution is simulated according to a Markov model, wherein the transition probabilities between the features are the same. The simulated binary phylogenetic tree is randomly generated using the rcoal () function in the R language, as shown in fig. 1. This is a simulated phylogenetic tree containing 40 species, all leaf nodes t ₁ ～t ₄₀ Is the location of the Taxon (Taxon). Biological evolution was simulated according to a markov model using an rtritdisc () function, where the transition probabilities between features were the same. The features generated by each simulation are mutually independent, and feature matrixes with different feature numbers are generated circularly for m times.

As shown in fig. 2, a feature matrix of the simulated morphological data with a number of species 40 and a feature number 25 is shown, wherein each feature has a value of '1' or '2', representing two different feature states.

For example, assuming that a feature is a tail, then '1' represents that the species has a tail, and '2' representing none, generating different feature numbers, feature matrix of species number is: d { X ₁ ，X ₂ ，...，X _i ，...，X _n (wherein X is _i Represents the i-th species; species X _i The state of the p-th morphological feature of (2) is denoted as X _ip The number of species used to construct the lineage tree is denoted n, and the number of features is denoted m; then the species X _i Is denoted as X _i (X _i1 ，X _i2 ，...，X _ip ，...X _iq ，...，X _im )。

Then, the original feature matrix is subjected to deletion treatment, different deletion rates are adopted, and data containing the deletion are marked as D _missing 。

The deletion processing of data included completely random deletions (Missing Completely At Random, MCAR), completely non-random deletions (MissingNotAtRandom, MNAR), and random deletions (MissingAtRandom, MAR).

In this embodiment, the data is subjected to a deletion process using a complete random deletion (MCAR), i.e., the deletion of data is random and is independent of any incomplete and complete variables. Its corresponding feature matrix is D _missing {X′ ₁ ，X′ ₂ ，...，X′ _i ，...，X′ _n 10%,25%,50%,75% of the deletion ratio was taken for the analog data. Wherein the deletions are randomly made at selected sites in the feature matrix.

Figure 3 shows a simulated morphological data feature matrix with a number of species of 40, features of 25, and a 25% loss, using? ' absence of representative feature.

For the generated simulation data, different single-objective-based phylogenetic tree construction methods were used, respectively, wherein the objectives contained the reduction values, likelihood values, fitch, CI and RI.

Given a tree structure τ, where the set of nodes is V (τ), the edges in the tree are E (τ), and the reduced value score can be expressed as:

wherein w is _j Weights, v, representing features j _j And u _j Respectively represent the characteristic state values of the nodes v, u at the bit point j, C (v _j ，u _j ) Cost matrix representing slave state v _j Transition to state u _j At the cost of (2).

Based on the data set D and the evolution model M, likelihood values L (τ) =p (d|τ, M) for each tree τ are calculated as tree scores (tree score), and the tree structure with the largest likelihood value is selected as the final analysis result. Wherein the likelihood function L (τ) for each tree τ can be written as:

wherein P (D) _i T, M) is the likelihood value of the ith site.

Given the tree structure τ, the top of the tree starts out toward the root, optimizing the traits on the tree. This is achieved by assigning a feature set (feature set) to each node of the tree from top to bottom. A pair of terminal taxonomies connected by any one node is selected from the top of the tree, and the trait set assigned to that node is calculated. The feature set of a node serves as the intersection (intersection) of the feature sets of two terminal taxonomies (or two nodes, or one taxon and one node) to which it is connected.

If the intersection of two taxon (or node) feature sets to which a node is connected is empty (no intersection), the minimum closed trait set (the smallest closed set) of each selected one of the trait constituent nodes in the two feature sets is assigned to the node under study. At such nodes, the number of feature changes is the difference of this minimum closed-set.

If the intersection of two derivative feature sets connected by the node is not null, the intersection character is given to the node to form the character set of the node. At such nodes, the number of feature changes equals zero, which is done up to the root node of the branch graph.

Finally, the root taxon (root taxon) is examined to see if its features are included in the feature set of the node above it, the root node. If included, the length of the tree at the root node is not increased; if not, the difference between the feature of the root taxon and the nearest feature in the root node feature set is calculated, which is the added value of the length of the tree at the root node.

Consistency index (index of consistency, CI) as a measure of the level of suitability of a particular tree topology to a data set.

The Consistency Index (CI) is defined as:

wherein s represents the number of actual changes of the neutral shape of a character evolution series on the pedigree tree, and for the discrete character evolution series coded by integers, the increment value of the discrete character evolution series on the length of the branch sequence diagram is; the value of m represents the minimum value of the variation of the character in a character evolution series on any sequence diagram (the number of possible variations of the character in a character evolution series in the optimal state without non-homologous phenomenon).

If there is no non-homologous evolution on one lineage tree, then CI is equal to 1, and the CI value tends to be 0 as the amount of non-homologous evolution that occurs on the lineage tree increases. Since the CI value decreases as the L value increases, the CI value is greatest for the simplest order graph from a particular dataset, since the length of the simplest order graph is shortest.

Retention Index (RI):

where g is used to measure the maximum variation possible for a feature on any tree in a feature evolution family.

And (3) performing optimization search on the simulation data by using a genetic algorithm, performing optimization iteration by taking the 5 targets as target functions, and finally, generating a phylogenetic tree set as an optimal phylogenetic tree result set under the condition that the iteration termination condition is finally reached. And calculates the RF (Robinson-Foults distance) distance between the phylogenetic tree set obtained by the different optimization targets and the standard tree generated at the beginning of step one. Thus, the corresponding relation between the data and the optimization target can be obtained. Wherein the RF distance is calculated as follows:

wherein n is the number of the classification units; the term "split" represents the number of tree "bipartite" tree sets; |split (T) ₁ )∩split(T ₂ ) I represents T ₁ And T ₂ The number of tree set intersections is divided.

The results of some experiments are shown in fig. 4-6, wherein fig. 4 is a comparison chart of different optimization targets when the species numbers are the same; FIG. 5 is a graph showing the comparison of different optimization objectives when the feature numbers are the same; fig. 6 shows the effect of different species on the tree construction result when the number of features=100. As seen from the three sets of experiments in the figure:

(1) Data with different sizes are different in optimal tree building targets;

(2) The species number in the data set is the most main factor influencing the tree construction result, and the distance between the tree result set and the standard tree is larger and larger along with the increase of the species number;

(3) In the case where the number of species is the same, the RF distance decreases as the number of features increases, but the trend of the RF distance change is not apparent as the number of species increases.

Step two, dynamically optimizing a target selection method:

the obtained real paleobiomorphology data and the simulation data are dynamically matched, firstly, the number of species in the real data and the number of species in the simulation data are matched according to the number of features, and finally, a simulation data set closest to the data size of the real data set is found according to the deletion rate.

And according to the data matching result of the last step, taking the first N targets with optimal simulation data as a plurality of targets in multi-target optimization. As three bases for performing tree space search by multi-objective optimization afterwards, according to the sequence among the objectives, firstly sorting is performed according to the first objective, and so on. Thus, the earlier targets may affect the final tree construction result.

The specific implementation steps comprise:

step 3.1: initializing a tree set

The missing data existing in the data is encoded as a new state, and the inapplicable data is treated as the missing data. The initial tree set is alternately generated using a maximum reduction method and a maximum likelihood method.

Step 3.2: iterative updating of genetic algorithms

And calculating the target value according to different optimization targets by the initial tree set. The data structure formed by the tree and the target value is represented as the structure shown in fig. 2, and then the generated initial tree is iteratively updated by using a genetic algorithm, tree space is searched by using crossover and mutation operators, and four crossover operators PDG, PDGm, BE and CS and three mutation operators NNI, SPR and TBR are adopted. Two trees are selected from the generated initial tree as male parents, and the intersecting operation is to select a certain subtree on the original tree structure, then delete the subtree on the other tree, and select one branch to graft the subtree. The mutation operation is to randomly select a branch on a tree for transformation. Both of these reorganization methods are to search the tree space on the basis of the initial tree.

Step 3.3: non-dominant ordering

Non-dominated ordering of these sets of trees will order the n phylogenetic trees into different non-dominated layers, the first non-dominated layer being of the species optimal for solution set, i.e. pareto boundaries. Selecting the first m trees through non-dominant sorting to carry out next iteration update, and layering the population with the scale of n through a non-dominant sorting algorithm, wherein the method comprises the following specific steps:

(1) Let i=1;

(2) For all j=1, 2, …,n, and j not equal i, comparing individual Tree according to the definition above _i And individual T _j A dominant versus non-dominant relationship between;

(3) If there is no individual Tree _j Is superior to Tree _i Tree then _i Marking as a non-dominant individual;

(4) Let i=i+1, go to step (2) until all non-dominant individuals are found.

The set of non-dominant individuals obtained by the above steps is the first non-dominant layer of the population, and then the marked non-dominant individuals are ignored (i.e. the individuals do not make the next round of comparison), and the steps (1) to (4) are followed to obtain the second non-dominant layer. And so on until the entire population is stratified. The non-dominant layer can sort all trees, and then species selection can be performed according to the number of layers of the non-dominant layer, and the trees are sequentially selected from each layer until the Nth tree is selected.

Step 3.4: repeating the steps 3.2-3.3 until reaching the iteration termination condition, and finally obtaining the optimal phylogenetic tree set.

The experimental results are shown in the following tables and 2:

table 1: comparison of RF distance between the results set of the simulated data tree and the standard tree (20 x 25% missing)

Table 2: RF distance comparison between the true data tree result set and the standard tree (28 x 38)

From the experimental results, it can be seen that the phylogenetic tree construction method based on dynamic multi-objective optimization provided by the embodiment has higher accuracy in both the simulation data and the real morphological data after matching.

In summary, compared with the pedigree tree construction method based on single-objective optimization, the system development tree construction method based on dynamic multi-objective optimization provided by the embodiment effectively solves the problems of deficiency and inapplicability in the construction process of the pedigree tree of the paleomorphism data and conflict under a plurality of evaluation indexes, and improves the accuracy and stability of species pedigree analysis. And the relation between the data and the optimization target is explored by using the simulation data, so that a conclusion about the universality of the phylogenetic tree construction by the morphological data is obtained, and a reference is provided for the phylogenetic tree construction by the morphological data.

Claims

1. A phylogenetic tree construction method based on dynamic multi-objective optimization is characterized by comprising the following specific steps:

the specific implementation method is as follows:

according to the characteristics in the archaea phylogenetic tree, all leaf nodes in a simulated binary phylogenetic tree are the positions where the classification units are located; and then simulating biological evolution according to a Markov model, wherein the transition probabilities among the features are the same, and generating feature matrixes of different feature numbers and species numbers:

D{X ₁ ,X ₂ ,…,X _i ,…,X _n }

X _i (X _i1 ,X _i2 ,…,X _ip ,…X _iq ,…,X _im )；

then, the original feature matrix is subjected to random deletion treatment, different deletion rates are adopted, and data containing the deletion are marked as D _missing The corresponding feature matrix is:

D _missing {X′ ₁ , ₂ X′ ₂ ,…,X′ _i ,…,X′ _n }

respectively constructing a phylogenetic tree based on single-objective optimization for 3 x 4 = 64 groups of data of three different species numbers, four different feature numbers and four different deletion rates by using a simple value, a likelihood value, a fit value, a CI value and an RI value used for phylogenetic tree construction; finally selecting an optimal phylogenetic tree set;

calculating RF distances between the sets obtained by different optimization targets and the standard tree, and sequencing the optimization targets of the phylogenetic tree set from small to large according to the RF distances to obtain the corresponding relation between the data and the optimization targets;

thirdly, constructing a multi-target phylogenetic tree through dynamically selected optimization targets, non-dominant sequencing and genetic algorithm, and selecting an optimal phylogenetic tree result set;

the specific implementation steps are as follows:

step 3.1: initializing a tree set

step 3.2: iterative updating of genetic algorithms

step 3.3: non-dominant ordering

2. The method of claim 1, wherein the implementation method of the second step is:

the real paleobiomorphology data and the simulation data are dynamically matched, namely, firstly, the number of species in the real data and the number of species in the simulation data are matched according to the number of features, and then, a simulation data set closest to the data size of the real data set is found according to the deletion rate;

according to the data matching result, the first N targets with optimal simulation data are used as a plurality of targets in multi-target optimization, and are used as a plurality of bases for tree space searching in later multi-target optimization; according to the sequence among the targets, firstly, sorting is carried out according to the first target, and the like, so that the earlier targets can influence the final tree building result.