CN111477281A

CN111477281A - Pan-genome construction method and construction device based on phylogenetic tree

Info

Publication number: CN111477281A
Application number: CN202010260748.0A
Authority: CN
Inventors: 张锦波; 鲍冠辉; 李季
Original assignee: Beijing Nuohe Zhiyuan Biotechnology Co ltd; Nanjing Novogene Biotechnology Co ltd; Tianjin Novogene Biological Information Technology Co ltd; Tianjin Nuohe Medical Examination Institute Co ltd; Beijing Novogene Technology Co ltd
Current assignee: Beijing Novogene Technology Co ltd; Beijing Nuohe Zhiyuan Biotechnology Co ltd; Tianjin Novogene Biological Information Technology Co ltd; Tianjin Nuohe Medical Examination Institute Co ltd
Priority date: 2020-04-03
Filing date: 2020-04-03
Publication date: 2020-07-31
Anticipated expiration: 2040-04-03
Also published as: CN111477281B

Abstract

The invention provides a pan-genome construction method and a pan-genome construction device based on a phylogenetic tree. The construction method comprises the following steps: carrying out phylogenetic tree construction on strains of different strains to obtain a phylogenetic tree; clustering pairwise according to different positions of strains of different strains on the phylogenetic tree from the bottom layer to the top layer, and performing pan-genome construction layer by layer to obtain the total pan-genome of different strains. The method constructs the pan-genome from the bottom layer to the top layer, not only solves the problem of inaccurate comparison result in the existing method, but also enables the pan-genome of each evolution node to be obtained, thereby analyzing the existing/missing variation result from a more detailed evolution angle.

Description

Pan-genome construction method and construction device based on phylogenetic tree

Technical Field

The invention relates to the field of genome-generic sequencing data analysis, in particular to a genome-generic construction method and a genome-generic construction device based on a phylogenetic tree.

Background

In 2005, the bacterial pan-genome concept (pan-genome) was proposed in a study of group B streptococci by Tetterin et al. In 2009, plum et al used the soapdenov assembly method to splice multiple human genomes for the first time, found that there are a large number of unique DNA sequences and functional genes in individuals of different groups, and put forward the concept of "human genome wide" (pan-genome), i.e., the sum of human population gene sequences, for the first time. Since then, pan-genomic analysis methods are widely used in the field of animal and plant research, it is worth noting that the above-mentioned differences between different individuals involved in pan-genomic construction are about 1%.

The bacterial genome has evolved at a much faster rate than the animal and plant genomes, and therefore there are abundant genetic polymorphisms among bacterial strain genes, for example, as early as 2005, scientists at the american genome institute speculated that about 30 new genes specific to a new strain are discovered upon completion of the genome of the strain by analyzing the genomic sequences of 8 strains of streptococcus, and their diligent mentions that "the total genome of bacteria may be unlimited". In addition, it has been shown that, in some fast-evolving bacteria, not less than 20% of the DNA sequence of the genome of a strain is unique to that strain.

Although the bacterial pan-genomics has richer genetic diversity with the animal and plant pan-genomes, the research of the bacterial pan-genomics is mainly carried out by adopting an analysis method of the animal and plant pan-genomes, namely, a core genome and a variable genome are distinguished by uniformly comparing the comparative genomes to a reference genome with higher quality and then filtering the common 90% similarity. However, the bacterial polymorphisms are significantly higher than in animals and plants, so that 90% similarity filtering and single reference genome comparison methods produce large false positives, making analysis of different lines unavailable. For example, some strains differ from the reference genome by a relatively close proximity to the general case of animals and plants, but if the polymorphisms of some strains are significantly higher than those of animals and plants, a 90% threshold and a single reference genome do not distinguish well between the core genome and the variable genome.

Therefore, there is no effective solution for constructing a genome-wide gene set of such species having genetic polymorphisms among individuals of more than 1%.

Disclosure of Invention

The invention mainly aims to provide a method and a device for constructing a pan-genome based on a phylogenetic tree, so as to solve the problem that the existing technology is inaccurate when the pan-genome construction is carried out on the species with genetic polymorphism more than 1% among individuals.

In order to achieve the above object, according to one aspect of the present invention, there is provided a method for constructing a genome-wide based on a phylogenetic tree, the method comprising: carrying out phylogenetic tree construction on strains of different strains to obtain a phylogenetic tree; clustering pairwise according to different positions of strains of different strains on the phylogenetic tree from the bottom layer to the top layer, and performing pan-genome construction layer by layer to obtain the total pan-genome of different strains.

Further, before the phylogenetic tree construction is carried out on strains of different strains, the construction method further comprises the following steps: acquiring genome sequence information and annotation information of strains of different strains; selecting conserved protein sequences in all strains to carry out multi-sequence comparison according to the genome sequence information and annotation information of each strain to obtain comparison results; constructing a phylogenetic tree according to the comparison result; preferably, multisequence comparison is carried out on conserved protein sequences in all strains by adopting a muscle software to obtain a primary comparison result; filtering the initial comparison result by adopting Gblocks software to obtain a comparison result; preferably, a Protest software is adopted to construct a phylogenetic tree according to a comparison result; preferably, the strains of different strains are viruses, bacteria, fungi, parasites, spirochetes, mycoplasma, chlamydia and rickettsia, or animals or plants with genetic polymorphism of 2-30%, preferably 5-25%, more preferably 10-20%.

Further, the manner of obtaining genome sequence information and annotation information of strains of different strains includes at least one of: (1) obtaining from known genomic sequence information and annotation information; (2) the strain is obtained by sequencing, assembling and annotating the strain in sequence.

Further, obtaining genome sequence information and annotation information of the strain by sequencing, assembling and annotating the strain comprises: sequencing strains to be compared to obtain sequencing reads; performing sequence splicing assembly on the strains based on the overlapping property of sequencing reads to obtain a genome sequence; performing repeated sequence and gene structure annotation on the genome sequence to obtain a genome annotation result of the strain; preferably, the strain is subjected to PacBio sequencing or nanopore sequencing; preferably, the strains are subjected to sequence splicing assembly based on the overlapping property of sequencing reads by adopting Falcon, Canu or WTDBG software; preferably, repeat and gene structure annotation is performed on genomic sequences using RepeatMasker, August, glimmermm, GeneWise, or EVM software.

Further, clustering every two strains in the direction from the bottom layer to the top layer according to different positions of strains of different strains on the phylogenetic tree, and performing pan-genome construction layer by layer to obtain the total pan-genome of different strains, wherein the total pan-genome of different strains comprises the following steps: according to different positions of strains of different strains on a phylogenetic tree, firstly, pairwise comparison is carried out on strains at the bottom layer, and a core genome and a non-core genome are divided according to a sequence similarity threshold value to obtain a pan-genome at the bottom layer; performing pan-genome construction on the pan-genome at the bottom layer and the genome at the upper layer of the bottom layer or the pan-genome again, and repeating the steps until reaching the top layer, thereby obtaining the total pan-genome of different strains; preferably, the sequence similarity threshold is 90%, more than or equal to 90% of the core genome is divided into core genomes, and less than 90% of the non-core genomes.

Further, after obtaining the total pan-genome of different strains, the construction method further comprises: and (4) counting the sequence length and the number of genes of each evolutionary node and the related region of the total pan-genome so as to obtain the existence/deletion variation condition of each strain on each evolutionary node.

According to a second aspect of the present application, there is provided a phylogenetic tree-based genome-wide construction apparatus, comprising: the system comprises an evolutionary tree construction module and a genome-generic construction module, wherein the evolutionary tree construction module is used for carrying out systematic evolutionary tree construction on strains of different strains to obtain a systematic evolutionary tree; and the pan-genome construction module is used for clustering every two strains from the bottom layer to the top layer according to different positions of strains of different strains on the phylogenetic tree, and performing pan-genome construction layer by layer to obtain the total pan-genome of different strains.

Further, the evolutionary tree building module further comprises: the information acquisition module is used for acquiring genome sequence information and annotation information of strains of different strains; the multi-sequence comparison module is used for selecting the conserved protein sequences in all strains to carry out multi-sequence comparison according to the genome sequence information and the annotation information of each strain to obtain comparison results; the evolutionary tree construction submodule is used for constructing a systematic evolutionary tree according to the comparison result; preferably, the multiple sequence alignment module comprises: the moscle module is used for carrying out multi-sequence comparison on conserved protein sequences in all strains to obtain a primary comparison result; the Gblocks module is used for filtering the initial comparison result to obtain a comparison result; preferably, the evolutionary tree building submodule is a test module.

Further, the information acquisition module comprises at least one of the following: (1) a first obtaining module for obtaining from known genomic sequence information and annotation information; (2) and the second acquisition module is used for sequencing, assembling and annotating the strain in sequence.

Further, the second obtaining module includes: the sequencing module is used for sequencing the strains to be compared to obtain sequencing reads; the assembly module is used for carrying out sequence splicing assembly on the strains based on the overlapping property of sequencing reads to obtain a genome sequence; the annotation module is used for performing repeated sequence and gene structure annotation on the genome sequence to obtain a genome annotation result of the strain; preferably, the sequencing module is a PacBio sequencing module; preferably, the assembly module is a Falcon, Canu or WTDBG module; preferably, the annotation module is a RepeatMasker, August, glimmermam, GeneWise, or EVM module.

Further, the pan-genome construction module comprises: the bottom pan-genome building module is used for comparing every two strains of the bottom layer according to different positions of strains of different strains on the phylogenetic tree, and dividing a core genome and a non-core genome according to a sequence similarity threshold value to obtain a bottom pan-genome; a pan-genome layer-by-layer construction module for performing pan-genome construction on the pan-genome at the bottom layer and the genome at the upper layer of the bottom layer or the pan-genome again, and repeating the steps until the top layer, so as to obtain the total pan-genome of different strains; preferably, the sequence similarity threshold is 90%, more than or equal to 90% of the core genome is divided into core genomes, and less than 90% of the non-core genomes.

Further, the construction apparatus further includes: and the statistical module is used for counting the sequence length and the number of genes of each evolutionary node and the total pan-genome related region so as to obtain the existence/deletion variation condition of each strain on each evolutionary node.

According to a third aspect of the present application, there is also provided a storage medium comprising a stored program, wherein the program, when executed, controls a device on which the storage medium is located to perform any of the above-mentioned phylogenetic tree-based genome construction methods.

According to a fourth aspect of the present application, there is also provided a processor for executing a program, wherein the program executes any one of the above-mentioned phylogenetic tree-based genome construction methods.

By applying the technical scheme of the invention, the phylogenetic tree is constructed firstly, the comparison is purposeful according to the result of the phylogenetic tree, and the pan-genome is constructed gradually from the bottom layer to the top layer of the phylogenetic tree according to the evolution distance of different strains, so that the total pan-genome is obtained. The method constructs the pan-genome from the bottom layer to the top layer, not only solves the problem of inaccurate comparison result in the existing method, but also enables the pan-genome of each evolution node to be obtained, thereby analyzing the existing/missing variation result from a more detailed evolution angle.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:

FIG. 1 shows a schematic flow diagram of a method for genome-wide construction based on phylogenetic trees according to example 1 of the present invention;

FIG. 2 shows a detailed flow diagram of the phylogenetic tree-based genome construction method according to example 2 of the present invention;

FIG. 3 shows a phylogenetic tree built on the basis of conserved protein sequences according to example 3 of the present invention;

FIG. 4 shows a PCA plot based on SNP data according to example 3 of the invention;

FIG. 5 shows the numbers of genes of the core genome and the non-core genome of each node on the evolutionary tree according to example 3 of the present invention, light color represents the number of genes of the core genome, and dark color represents the number of genes of the non-core genome; and

FIG. 6 shows a schematic structural diagram of a genome-wide construction apparatus based on phylogenetic tree according to example 4 of the present invention.

Detailed Description

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present invention will be described in detail with reference to examples.

As mentioned in the background, the current art generally refers to the analysis of animal and plant pan-genomes, wherein each individual is independently assembled, and then the best quality assembly result is selected as the reference genome from all comparison individuals (including the assembly result of the new individual and the assembly result already published), and then the genomes of other individuals are respectively aligned with the reference genome, and the core genome and the non-core genome are distinguished by selecting a fixed threshold (e.g. the common 90% similarity). The evolution rate of the bacterial genome is higher than that of the animal and plant genomes, so the inventor finds that: the use of the animal and plant pan-genomic analysis method in bacterial genomes has the following disadvantages: (1) the alignment of a single reference genome is easy to cause errors of sequence alignment, the difference of reference genomes in animals and plants is not large and is generally about 99%, but 20% of reference genomes in bacteria are peculiar, so that the alignment of other strains is easy to cause inaccuracy only by taking the genome of a certain strain as a reference. (2) A single filtering threshold. For example, the same region, a vs Ref is 91%, B vs Ref is 89%, a vs B is 98%, in the conventional analysis method, the region of a would be considered as that of the core genome, the region of B is classified as a non-core region, but the similarity between a and B in the region is as high as 98%, so the gray alignment near 90% similarity is easily classified as a mistake. (3) The existing technology only gives the existence/deletion (PAV) of each strain relative to a total pan-genome, but each evolutionary node or the node of the corresponding strain on the evolutionary tree is unknown.

Example 1

In order to improve the above-mentioned drawbacks of the prior art, in a preferred embodiment of the present application, a method for constructing a genome-wide based phylogenetic tree is provided, as shown in fig. 1, the method for constructing the genome-wide based phylogenetic tree comprises:

s101, carrying out phylogenetic tree construction on strains of different strains to obtain a phylogenetic tree;

s103, clustering every two strains in the direction from the bottom layer to the top layer according to different positions of strains of different strains on the phylogenetic tree, and performing pan-genome construction layer by layer to obtain the total pan-genome of different strains.

According to the method for constructing the pan-genome, the phylogenetic tree is constructed firstly, the comparison is purposeful according to the result of the phylogenetic tree, and the pan-genome is constructed gradually from the bottom layer to the top layer of the phylogenetic tree according to the evolution distance of different strains, so that the total pan-genome is obtained. The method constructs the pan-genome from the bottom layer to the top layer, not only solves the problem of inaccurate comparison result in the existing method, but also enables the pan-genome of each evolution node to be obtained, thereby analyzing the existing/missing variation result from a more detailed evolution angle. The total genome of pan-genome here refers to the top-most constructed pan-genome.

The step of constructing the phylogenetic tree can be constructed by adopting the existing phylogenetic tree construction method according to the sequence information and the annotation information of the strains of different strains. In a preferred embodiment, before the phylogenetic tree construction of strains of different strains, the construction method further comprises: acquiring genome sequence information and annotation information of strains of different strains; selecting conserved protein sequences in all strains to carry out multi-sequence comparison according to the genome sequence information and annotation information of each strain to obtain comparison results; and constructing a phylogenetic tree according to the comparison result.

The step of constructing the phylogenetic Tree can be realized by means of existing software, for example, multisequence comparison can be performed on conserved protein sequences in all strains by adopting muscle software to obtain a primary comparison result, then the primary comparison result is filtered by adopting Gblocks software to obtain a comparison result, and finally, the phylogenetic Tree is constructed by adopting test software according to the comparison result.

The pan-genome construction method is applicable to species having a large difference (at least more than 1%) in genetic polymorphism among individuals, including but not limited to viruses, bacteria, fungi, parasites, spirochetes, mycoplasma, chlamydia, rickettsia, and the like, and may be animals or plants having a genetic polymorphism of 2% to 30%, preferably 5% to 25%, and more preferably 10% to 20%.

According to the perfection of the genome information of strains of different strains in each species, before the construction of the phylogenetic tree, the genome sequence information and annotation information of each strain are firstly acquired, so that conserved protein sequences existing in all strains are selected to carry out multi-sequence alignment, and the phylogenetic tree is constructed. According to different strains, some strains are known and some strains are unknown, and the acquisition modes of corresponding genome sequence information and annotation information are different. In a preferred embodiment, the means for obtaining genomic sequence information and annotation information for strains of different strains comprises at least one of: (1) obtaining from known genomic sequence information and annotation information; (2) the strain is obtained by sequencing, assembling and annotating the strain in sequence. If the sequence is unknown, the sequence is sequentially sequenced and assembled by using the advantages of a shotgun sequencing technology, and finally the sequence is annotated to obtain the unknown relevant information.

The specific sequencing, assembly and annotation steps are not different from the existing operations. In a preferred embodiment, obtaining genomic sequence information and annotation information for a strain by sequencing, assembling and annotating the strain comprises: sequencing strains to be compared to obtain sequencing reads; performing sequence splicing assembly on the strains based on the overlapping property of sequencing reads to obtain a genome sequence; and (4) performing repeated sequence and gene structure annotation on the genome sequence to obtain a genome annotation result of the strain.

Sequencing of the strains described above can be performed using PacBio sequencing or nanopore sequencing. When the strains are assembled by sequence splicing using the overlapping property between sequencing reads obtained by PacBio sequencing or nanopore sequencing, a plurality of different assembly and splicing software including, but not limited to, Falcon, Canu or WTDBG can be adopted. In the step of performing repeat and gene structure annotation of a genomic sequence, repeat and gene structure annotation is preferably performed using RepeatMasker, August, glimmermm, GeneWise, or EVM software.

The construction of the pan-genome of the application has the greatest improvement point that the systematic evolution tree is constructed firstly according to the distance of each strain in the evolution, then the pan-genome is constructed pairwise between strains at the bottommost layer of the evolution according to different evolution nodes of the systematic evolution tree, and then the pan-genome is established layer by layer according to the same method, so that the total pan-genome is obtained. When constructing the corresponding pan-genome between the genomes of the individual layers, or between the genomes or pan-genomes, or between pan-genomes and pan-genomes, the specific method can be carried out by adopting the same principle as the existing method. For example, the core genome and the non-core genome are each distinguished by the degree of similarity satisfying a certain threshold.

In a preferred embodiment, clustering the strains of different strains in pairs from the bottom layer to the top layer according to different positions of the strains on the phylogenetic tree, and performing pan-genome construction layer by layer to obtain the total pan-genome of the different strains comprises: according to different positions of strains of different strains on a phylogenetic tree, firstly, pairwise comparison is carried out on strains at the bottom layer, and a core genome and a non-core genome are divided according to a sequence similarity threshold value to obtain a pan-genome at the bottom layer; and performing pan-genome construction on the pan-genome at the bottom layer and the genome at the upper layer of the bottom layer or the pan-genome again, and repeating the steps until the top layer, thereby obtaining the total pan-genome of different strains.

Preferably, the sequence similarity threshold is 80%, preferably 90%, 80% or more, more preferably 90% or more, of the divided genome is the core genome, and less than 90% is the non-core genome.

In order to show the existence/deletion variation of each strain on different evolutionary nodes and the level of the total pan-genome more clearly and intuitively, in a preferred embodiment, after obtaining the total pan-genome of different strains, the construction method further comprises: and (4) counting the sequence length and the number of genes of each evolutionary node and the related region of the total pan-genome so as to obtain the existence/deletion variation condition of each strain on each evolutionary node.

Specific examples of sequence length statistics include the following, for node a, there are 1, 2, 3 strains, and the sequence length statistics include: the total core genome length (intersection) and the sum of the pan-genomes (union) shared by 1, 2, 3, thereby obtaining the sequence length unique to the 1, 2, 3 strain or the deletion sequence length relative to the pan-genome.

Example 2:

in this example, the construction of a pan-genome was carried out according to the scheme shown in FIG. 2, as follows:

1) DNA extraction, pooling and then PacBio sequencing were performed on each strain.

2) Genome splicing assembly was performed individually for each strain by overlap of reads using Falcon software, Canu or WTDBG software.

3) Repeat and gene structure annotation of the genome was performed by RepeatMasker, August, glimmermm, GeneWise, EVM software.

4) In the genome annotation results for each strain to be compared, the protein sequences of the conserved genes were selected, then subjected to multiple sequence alignment using the muscle software, and then filtered using the Gblocks software.

5) And (3) carrying out phylogenetic tree construction by using a test software through a multi-sequence alignment result.

6) According to the positions of different strains in a phylogenetic tree, pairwise pan-genomes from the bottom are constructed, MUMmer software is also adopted for comparison, then 90% of similarity is selected to divide a core genome and a non-core genome to obtain a new pan-genome (the pan-genome with two nodes at the bottom), and then pan-genome construction is carried out with the pan-genome or the genome at the upper layer.

And 8, carrying out statistics on the sequence length and the number of genes of each node and the final pan-genome related region, and clarifying the evolution condition of each strain on the corresponding node.

In this example, in the case that the genome sequence information and annotation information of a part of strains are unknown, it is preferred to perform PacBio sequencing, assembly and annotation on the strains to be compared, and these steps are consistent with the existing method, and all are performed to obtain the genome of the strains. Then, protein sequences of conserved genes are selected from genomes of all strains to be compared (including new assembly results and published genomes), multisequence alignment is carried out by using muscle software, and then phylogenetic tree construction is carried out by adopting ProTest software. Secondly, constructing a pan-genome for the strains at the bottom layer by adopting the direction from the bottom to the top through all the strains at all the positions of the phylogenetic tree, constructing a new pan-genome for each layer of constructed pan-genome and the pan-genome or genome at the upper layer, and so on to complete the total pan-genome construction, wherein the concept of a reference genome is eliminated in the process. Finally, the presence/absence variation (PAV) of each strain at each node and the relative total genome-wide PAV profile were summarized.

Example 3:

the test of this embodiment is performed by using simulation data, and the specific method is as follows:

1) and (3) simulating a data process:

the simulation data selection adopts an escherichia coli reference genome, the size of the reference genome on NCBI is 4,614,635bp, the number of contained genes is 4,413, the number of genes of different strains of escherichia coli is different from 4000 to 5500, and the other 8 strains with different degrees of difference are simulated through SNP (single nucleotide polymorphism) substitution, indel (<50bp) insertion, large-fragment PAV (>500bp) and SV (structural variation) sequence change on the basis of the reference genome, and the specific conditions are as follows:

table 1:

denotes the sample as a reference genome.

The simulation process simulates the steps of sequencing, assembling and annotating, and directly displays the information of the annotated different strains.

2) Building phylogenetic trees

The method comprises the steps of clustering protein families of all strains, selecting single copy genes, namely only one gene of each strain in one family, combining the genes of all families, performing sequence multi-alignment by using a multiscale, filtering the alignment result by using Gblock, and constructing an evolutionary tree by using Protest software, wherein the results are shown in a graph 3, and a graph 3 shows the evolutionary tree based on protein sequences.

We further verified the accuracy of the simulation data with a multi-dimensional matrix based on SNPs, and it can be seen from Principal Component Analysis (PCA) as shown in fig. 4 (showing a PCA plot based on SNP data) that the results corresponding to the evolutionary tree are also roughly divided into three groups.

According to the results of the evolutionary tree, we adopt MUMmer and similarity difference of 90% to construct pan-genomes of different nodes, and the specific process is shown in the following table:

table 2:

hierarchical tree	Samples involved in pan-genomic construction	Novel pan-genomic sequences
			1	Ecoli-A1 vs Ecoli-A2	Ecoli-pan-A12
1	Ecoli-B2 vs Ecoli-B3	Ecoli-pan-B23
			1	Ecoli-C2 vs Ecoli-C3	Ecoli-pan-C23
2	Ecoli-A3 vs Ecoli-pan-A12	Ecoli-pan-A123
			2	Ecoli-B1 vs Ecoli-pan-B23	Ecoli-pan-B231
3	Ecoli-pan-A123 vs Ecoli-pan-B231	Ecoli-pan-A123-B231
			3	Ecoli-C1 vs Ecoli-pan-C23	Ecoli-pan-C231
4	Ecoli-pan-C231 vs Ecoli-pan-A123-B231	Ecoli-pan-A123-B231-C231

According to the construction condition of the pan-genome of each node, the number of genes corresponding to the core genome and the non-core genome can be correspondingly returned to the evolutionary tree. As shown in fig. 5, fig. 5 shows the number of genes of the core genome and the non-core genome of each node on the evolutionary tree. Light color represents the number of genes in the core genome and dark color represents the number of genes in the non-core genome.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

Corresponding to the above manner, the present application also provides a genome-wide construction apparatus based on phylogenetic tree, which is used for implementing the above embodiments and preferred embodiments, and has been described and not repeated. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

This is further illustrated below in connection with alternative embodiments.

Example 4:

this embodiment provides a pan-genome construction apparatus based on phylogenetic tree, as shown in fig. 6, the construction apparatus includes: an evolutionary tree building module 20 and a genome-wide building module 40, wherein,

the evolutionary tree building module 20 is used for performing phylogenetic tree building on strains of different strains to obtain a phylogenetic tree;

and the pan-genome construction module 40 is used for clustering every two strains according to different positions of strains of different strains on the phylogenetic tree from the bottom layer to the top layer, and performing pan-genome construction layer by layer to obtain the total pan-genome of different strains.

According to the construction device of the pan-genome, the phylogenetic tree is firstly constructed through the phylogenetic tree construction module 20, the comparison is purposeful according to the result of the phylogenetic tree, and then the pan-genome construction module 40 is utilized to gradually construct the pan-genome from the bottom layer to the top layer of the phylogenetic tree according to the evolutionary distances of different strains, so that the total pan-genome is obtained. The device is constructed from the bottom layer to the top layer of the pan-genome, so that the problem of inaccurate comparison result in the existing method is solved, and the pan-genome of each evolution node can be obtained, thereby analyzing the existing/missing variation result from a more detailed evolution angle. The total genome of pan-genome here refers to the top-most constructed pan-genome.

In a preferred embodiment, the evolutionary tree building module further comprises: the information acquisition module is used for acquiring genome sequence information and annotation information of strains of different strains; the multi-sequence comparison module is used for selecting the conserved protein sequences in all strains to carry out multi-sequence comparison according to the genome sequence information and the annotation information of each strain to obtain comparison results; and the evolutionary tree construction submodule is used for constructing a systematic evolutionary tree according to the comparison result.

Preferably, the multiple sequence alignment module comprises: the moscle module is used for carrying out multi-sequence comparison on conserved protein sequences in all strains to obtain a primary comparison result; and the Gblocks module is used for filtering the initial comparison result to obtain a comparison result.

Preferably, the evolutionary tree building submodule is a test module.

In a preferred embodiment, the strain information acquisition module includes at least one of: (1) a first obtaining module for obtaining from known genomic sequence information and annotation information; (2) and the second acquisition module is used for sequencing, assembling and annotating the strain in sequence.

In a preferred embodiment, the second obtaining module includes: the sequencing module is used for sequencing the strains to be compared to obtain sequencing reads; the assembly module is used for carrying out sequence splicing assembly on the strains based on the overlapping property of sequencing reads to obtain a genome sequence; and the annotation module is used for carrying out repeated sequence and gene structure annotation on the genome sequence to obtain a genome annotation result of the strain.

Preferably, the sequencing module is a PacBio sequencing module.

Preferably, the assembly module is a Falcon, Canu or WTDBG module.

Preferably, the annotation module is a RepeatMasker, August, glimmermam, GeneWise, or EVM module.

In a preferred embodiment, the genome-wide building block comprises: the bottom pan-genome building module is used for comparing every two strains of the bottom layer according to different positions of strains of different strains on the phylogenetic tree, and dividing a core genome and a non-core genome according to a sequence similarity threshold value to obtain a bottom pan-genome; and the pan-genome layer-by-layer construction module is used for carrying out pan-genome construction on the pan-genome at the bottom layer and the genome at the upper layer of the bottom layer or the pan-genome again, and the like until the top layer, so that the total pan-genome of different strains is obtained.

Preferably, the sequence similarity threshold is 90%, more than or equal to 90% of the core genome is divided into core genomes, and less than 90% of the non-core genomes.

In a preferred embodiment, the construction apparatus further comprises: and the statistical module is used for counting the sequence length and the number of genes of each evolutionary node and the total pan-genome related region so as to obtain the existence/deletion variation condition of each strain on each evolutionary node.

Example 5

The embodiment provides a storage medium, which comprises a stored program, wherein when the program runs, a device where the storage medium is located is controlled to execute the above-mentioned genome-wide construction method based on the phylogenetic tree.

The embodiment also provides a processor for executing the program, wherein the program executes the above method for constructing the genome-wide based on the phylogenetic tree.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

From the above description, it can be seen that the above-described embodiments of the present invention achieve the following technical effects:

1) in the traditional method and device, other strains are uniformly compared with a reference genome, and then a threshold value is selected to divide a core genome and a non-core genome, so that the method is not suitable for bacterial genomes with large genetic polymorphism, and thus wrong classification results are obtained. The method and the device are mainly used for carrying out bacterial pan-genomics analysis based on the phylogenetic relationship, the comparison is purposeful by introducing the result of the phylogenetic tree, and a pan-genome is gradually constructed from the bottom to the top of the phylogenetic tree according to the evolutionary distances of different strains.

(2) Traditional methods and devices do not introduce evolutionary trees and ultimately only the PAV status of other strains relative to the total genome-wide. The method and the device provided by the application have the advantages that the pan-genome of each evolution node can be obtained through the pan-genome construction from the bottom to the top, so that the PAV result can be analyzed from a more detailed evolution perspective.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A pan-genome construction method based on a phylogenetic tree is characterized by comprising the following steps:

carrying out phylogenetic tree construction on strains of different strains to obtain a phylogenetic tree;

clustering every two strains in the direction from the bottom layer to the top layer according to different positions of the strains of different strains on the phylogenetic tree, and performing pan-genome construction layer by layer to obtain the total pan-genome of different strains.

2. The construction method according to claim 1, wherein before performing phylogenetic tree construction on strains of different strains, the construction method further comprises:

acquiring genome sequence information and annotation information of the strains of different strains;

selecting conserved protein sequences in all strains to carry out multi-sequence comparison according to the genome sequence information and annotation information of each strain to obtain comparison results;

constructing the phylogenetic tree according to the comparison result;

preferably, the multiple sequence alignment is carried out on the conserved protein sequences in all strains by adopting a muscle software to obtain an initial alignment result;

filtering the initial comparison result by adopting Gblocks software to obtain the comparison result;

preferably, a test software is adopted to construct the phylogenetic tree according to the comparison result;

preferably, the strains of different strains are viruses, bacteria, fungi, parasites, spirochetes, mycoplasma, chlamydia and rickettsia, or animals or plants with genetic polymorphism of 2-30%, preferably 5-25%, more preferably 10-20%.

3. The construction method according to claim 2, wherein the manner of obtaining genomic sequence information and annotation information of the strains of different strains includes at least one of:

(1) obtaining from known genomic sequence information and annotation information;

(2) the strain is obtained by sequencing, assembling and annotating the strain in sequence.

4. The construction method according to claim 3, wherein obtaining genome sequence information and annotation information of the strain by sequencing, assembling and annotating the strain comprises:

sequencing the strains to be compared to obtain sequencing reads;

performing sequence splicing assembly on the strain based on the overlapping property of the sequencing reads to obtain a genome sequence;

performing repeated sequence and gene structure annotation on the genome sequence to obtain a genome annotation result of the strain;

preferably, the strain is subjected to PacBio sequencing or nanopore sequencing;

preferably, the strains are subjected to sequence splicing assembly based on the overlapping property of the sequencing reads by adopting Falcon, Canu or WTDBG software;

preferably, the genomic sequence is annotated with repetitive sequences and gene structure using RepeatMasker, August, glimmermm, GeneWise or EVM software.

5. The construction method according to any one of claims 1 to 4, wherein performing pan-genome construction layer by layer according to pairwise clustering of the strains of different strains in different positions on the phylogenetic tree in a direction from a bottom layer to a top layer to obtain a total pan-genome of the different strains comprises:

according to different positions of the strains of different strains on the phylogenetic tree, firstly, pairwise comparison is carried out on the strains at the bottom layer, and a core genome and a non-core genome are divided according to a sequence similarity threshold value to obtain a pan-genome at the bottom layer;

performing pan-genome construction on the pan-genome at the bottom layer and the genome at the upper layer of the bottom layer or the pan-genome again, and repeating the steps until reaching the top layer, thereby obtaining the total pan-genome of different strains;

preferably, the sequence similarity threshold is 90%, more than or equal to 90% of the sequences are divided into core genomes, and less than 90% of the sequences are non-core genomes.

6. The construction method according to claim 1, wherein after obtaining the total pan-genome of the different strains, the construction method further comprises:

and (3) counting the sequence length and the number of genes of each evolutionary node and the region related to the total pan-genome so as to obtain the existence/deletion variation condition of each strain on each evolutionary node.

7. A pan-genomic construction apparatus based on phylogenetic trees, the construction apparatus comprising:

the phylogenetic tree building module is used for carrying out phylogenetic tree building on strains of different strains to obtain a phylogenetic tree;

and the pan-genome construction module is used for clustering every two strains from the bottom layer to the top layer according to different positions of the strains of different strains on the phylogenetic tree, and performing pan-genome construction layer by layer to obtain the total pan-genome of different strains.

8. The building apparatus of claim 7, wherein the evolutionary tree building module further comprises:

the information acquisition module is used for acquiring genome sequence information and annotation information of the strains of different strains;

the multi-sequence comparison module is used for selecting the conserved protein sequences in all strains to carry out multi-sequence comparison according to the genome sequence information and the annotation information of each strain to obtain comparison results;

the evolutionary tree construction submodule is used for constructing the systematic evolutionary tree according to the comparison result;

preferably, the multiple sequence alignment module comprises: the multislice module is used for carrying out the multi-sequence comparison on the conserved protein sequences in all strains to obtain a primary comparison result; the Gblocks module is used for filtering the initial comparison result to obtain the comparison result;

preferably, the evolutionary tree building submodule is a test module.

9. The building apparatus according to claim 8, wherein the information acquisition module includes at least one of:

(1) a first obtaining module for obtaining from known genomic sequence information and annotation information;

(2) and the second acquisition module is used for sequencing, assembling and annotating the strain in sequence.

10. The building apparatus according to claim 9, wherein the second obtaining module comprises:

the sequencing module is used for sequencing the strains to be compared to obtain sequencing reads;

the assembly module is used for carrying out sequence splicing assembly on the strains based on the overlapping property of the sequencing reads to obtain a genome sequence;

the annotation module is used for carrying out repeated sequence and gene structure annotation on the genome sequence to obtain a genome annotation result of the strain;

preferably, the sequencing module is a PacBio sequencing module;

preferably, the assembly module is a Falcon, Canu or WTDBG module;

preferably, the annotation module is a RepeatMasker, August, glimmermam, GeneWise or EVM module.

11. The construction apparatus according to any one of claims 7 to 10, wherein the genome-wide construction module comprises:

the bottom pan-genome construction module is used for comparing every two strains of the bottom layer according to different positions of the strains of different strains on the phylogenetic tree, and dividing a core genome and a non-core genome according to a sequence similarity threshold value to obtain a bottom pan-genome;

a pan-genome layer-by-layer construction module for performing pan-genome construction on the pan-genome of the bottom layer and the genome of the last layer of the bottom layer or the pan-genome again, and repeating the steps until the top layer, so as to obtain the total pan-genome of different strains;

12. The build device of claim 7, further comprising:

and the statistical module is used for counting the sequence length and the number of genes of each evolutionary node and the region related to the total pan-genome so as to obtain the existence/deletion variation condition of each strain on each evolutionary node.

13. A storage medium comprising a stored program, wherein the program, when executed, controls a device on which the storage medium is located to execute the method for constructing a genome-wide based on a phylogenetic tree according to any one of claims 1 to 6.

14. A processor, characterized in that the processor is configured to execute a program, wherein the program is configured to execute the method for genome-wide construction based on phylogenetic tree according to any of claims 1 to 6 when executing the program.