CN116705156A

CN116705156A - Method for searching determining sites of viral genome classification based on decision tree algorithm

Info

Publication number: CN116705156A
Application number: CN202210167610.5A
Authority: CN
Inventors: 郝沛; 徐心恬; 宋诗阳
Original assignee: Institut Pasteur of Shanghai of CAS
Current assignee: Institut Pasteur of Shanghai of CAS
Priority date: 2022-02-23
Filing date: 2022-02-23
Publication date: 2023-09-05

Abstract

The application provides a systematic evolution analysis method of viral genome. Specifically, the application provides a method for performing systematic evolution analysis on a viral genome by determining a decisive site for classification of the viral genome based on a decision tree algorithm. The method can efficiently and rapidly analyze the genome sequences of tens of thousands of viruses, and greatly improves the application universality of systematic evolution analysis. In addition, the application tracks the variation of the critical sites in the transmission process of the identified classified decisive site viruses and lays a solid foundation for epidemiological research and prevention and control of infectious diseases.

Description

Method for searching determining sites of viral genome classification based on decision tree algorithm

Technical Field

The application relates to the fields of virology, bioinformatics and system evolution analysis, in particular to a method for searching a decisive site for viral genome classification based on a decision tree algorithm.

Background

Phylogenetic (phylogenetic) studies the evolutionary relationships between species, the basic idea of which is to compare the characteristics of species and consider similar species genetically close. The results of phylogenetic studies are often presented in the form of a phylogenetic tree (phylogenetic tree, also known as a phylogenetic tree). System evolutionary tree using a tree-like branching patternTo summarize the relatedness between species, consisting of a series of nodes (branches) and branches (branches), where each node represents a taxonomic unit (species or sequence) and the links between nodes represent the evolutionary relationships between species. Topology and root branch length are two important references for the phylogenetic tree. Generally, phylogenetic trees are constructed by distance-based methods, including Minimum Evolution (ME), and feature-based methods ^[1] And the ortho-merging method (NJ) ^[2] The latter is based on the maximum reduction Method (MP) ^[3] And maximum likelihood Method (ML) ^[4] Is representative.

With the vigorous development of bioinformatics and the arrival of big data age, phylogenetic and phylogenetic trees are increasingly widely applied in the fields of microbiology and virology. Notably, viruses are more prone to mutation in the genome than other organisms, and thus tend to produce large numbers of variant-containing strains during shorter propagation evolutions. How to solve the various branches of virus as a new infectious disease pathogen in the evolution process in a short time is a great challenge in the field of virology at present.

The traditional phylogenetic tree construction requires the following steps: data collection, multi-sequence alignment, mathematical model selection, tree construction, inspection and evaluation. The defects are that: first, constructing a phylogenetic tree requires multiple sequence alignments of viral genomes. When the number of viral genomes to be analyzed exceeds a certain upper limit, the time taken to perform a multiple sequence alignment using software will be greatly prolonged. The limitation of this step of multiple sequence alignment therefore makes the number of sequences that can be analyzed by conventional phylogenetic trees limited. In addition, although the conventional phylogenetic tree can well show the genetic relatedness of viral genomes, the reason for the formation of each branch on the phylogenetic tree cannot be intuitively explained, i.e. the decisive sites of each branch of the phylogenetic tree cannot be given in each step of constructing the phylogenetic tree.

Therefore, there is an urgent need in the art to develop a method for constructing phylogenetic trees of viruses that is efficient, simple and more intuitive.

Disclosure of Invention

The application aims to provide a method for constructing a phylogenetic tree of viruses efficiently, simply, conveniently and intuitively.

In a first aspect of the present application, there is provided a method for phylogenetic analysis of viral genomes, the method comprising the steps of:

(S1) providing a sample to be analyzed;

(S2) sequencing the sample to be analyzed, thereby obtaining a genomic sequence of the sample;

(S3) comparing the genome sequence with a reference genome according to the number N of the genome sequences obtained in the step (S2), thereby obtaining a variation locus matrix with the size of m multiplied by N, wherein m is the number of sequences and N is the number of variation loci;

(S4) calculating the information entropy of each column in the mutation site matrix, so as to obtain the information entropy of n mutation sites;

(S5) analyzing information entropy of the mutation site using a decision tree algorithm, thereby obtaining a decisive site for the genomic classification; and

(S6) visualizing the determinant site, thereby obtaining a phylogenetic tree of the viral genome.

In another preferred embodiment, the calculation formula of the decision tree algorithm is shown in formula I:

wherein col is the decisive site, C is all variant sites, V _col M for all nucleotide classes present at the col position _j Refers to the collection of all sequences with nucleotide j at the col position, ent (M _j ) Is the information entropy.

In another preferred embodiment, the information entropy is calculated according to formula II:

in the method, in the process of the application,

wherein Ent (M) _j ) For information entropy, k means one of all mutation sites except for the information entropy site to be calculated, C' means all mutation sites except for the information entropy site to be calculated, C means all mutation sites, and p means M _j One of all the nucleotide categories appearing at position k in the sequence set except the nucleotide category having the highest occurrence probability, V' _k Refers to M _j All nucleotide categories other than the nucleotide category with the highest occurrence probability among the nucleotide categories occurring at the k positions in the sequence set, V _k Refers to all nucleotide classes, M, that occur at the k position _j Refers to the collection of all sequences with nucleotides j at the k position, p is M _j The nucleotide class with the highest probability of occurrence at position k in the sequence set.

In another preferred embodiment, the sample is from a virus to be analyzed that has been identified as being from the same species as the reference genome in terms of virus taxonomy.

In another preferred embodiment, the virus comprises: DNA viruses, RNA viruses, or combinations thereof.

In another preferred embodiment, the virus comprises a plant virus, an animal virus, a bacterial virus (phage), or a combination thereof.

In another preferred embodiment, the virus comprises a single stranded RNA virus, a double stranded RNA virus, a single stranded DNA virus, a double stranded DNA virus, or a combination thereof.

In another preferred embodiment, the type of sample is selected from the group consisting of: a DNA sample, an RNA sample, or a combination thereof.

In another preferred embodiment, the reference genome and the test sample are derived from the same species.

In another preferred embodiment, the reference genome refers to all, a portion, or a combination thereof, of the genome of the species.

In another preferred embodiment, the reference genome comprises a whole genome, a partial genome.

In another preferred embodiment, the reference genome may be continuous or discontinuous.

In another preferred embodiment, the reference genome is all, a portion, or a combination thereof of all nucleic acids (DNA and/or RNA) of the species (e.g., virus).

In another preferred embodiment, the coverage of the reference genome is more than 50%, preferably more than 60%, more preferably more than 70%, more preferably more than 80%, most preferably more than 95% of the whole genome.

In another preferred embodiment, the sequencing is selected from the group consisting of: single ended sequencing, double ended sequencing, or a combination thereof.

In another preferred embodiment, the sequence alignment is performed using sequence alignment software selected from the group consisting of: MEGA, clustal Omega, mafft, clustalW, NCBI Blast, or combinations thereof.

In another preferred embodiment, in the step (S3), when the number N of genome sequences is less than 1 ten thousand, performing sequence alignment by using method a; wherein the method A is as follows: all of the genomic sequences measured were subjected to multiple sequence alignment with a reference genome.

In another preferred embodiment, in the step (S3), when the number N of genome sequences is more than 1 ten thousand, performing sequence alignment by using method B; the method B comprises the following steps: and (3) respectively aligning the detected genome sequences with a single sequence of a reference genome.

In another preferred embodiment, the single sequence alignment is performed using the following method: and (3) respectively carrying out single sequence alignment on the genome sequences and a reference genome by using sequence alignment software, and carrying out parallel calculation on single sequence alignment scripts by using a Linux command xargs-P30.

In another preferred embodiment, the visualizing is performed by software selected from the group consisting of: cytoscape, gephi, iGraph, or a combination thereof.

In another preferred embodiment, the visualization is achieved by a package/module.

In a second aspect of the application, there is provided a system for phylogenetic analysis of viral genomes, the system comprising:

(M1) a sequencing unit for nucleic acid sequencing a sample to be analyzed, thereby obtaining a genomic sequence of the sample;

(M2) a comparison unit, connected to the sequencing unit, for comparing the obtained genomic sequence of the sample with a reference genome, thereby obtaining positional information of the genomic sequence on the reference genome, and obtaining a mutation site matrix;

(M3) a calculation unit, which is connected with the comparison unit and is used for calculating the information entropy of the mutation site matrix and analyzing the information entropy by utilizing a decision tree algorithm so as to obtain the decisive sites of genome classification; and

(M4) a visualization processing unit, connected to the calculation unit, for visualizing the determinant site of the obtained genome classification, thereby obtaining a phylogenetic tree of the viral genome.

In a third aspect of the application, there is provided the use of a method of phylogenetic analysis of viral genomes for: (i) building a virus phylogenetic tree; (ii) epidemiological analysis of infectious diseases; and (iii) prevention and control of infectious diseases.

In another preferred embodiment, the infectious disease is an infectious disease caused by a virus.

It is understood that within the scope of the present application, the above-described technical features of the present application and technical features specifically described below (e.g., in the examples) may be combined with each other to constitute new or preferred technical solutions. And are limited to a space, and are not described in detail herein.

Drawings

FIG. 1 is a schematic flow chart showing two methods for obtaining a matrix of viral genomic variation sites when the number of sequences is suitable or unsuitable for multiple sequence alignment. Wherein (1) the schematic diagram shows that the virus genome is subjected to nucleic acid sequence comparison, the head and tail fragments of the genome are cleaned, and a virus genome variation site matrix is obtained through identification; (2) the schematic diagram shows that each sequence is respectively subjected to nucleic acid sequence comparison with a reference genome, and the obtained variant nucleotide of each sequence compared with the reference genome is combined into a (whole) viral genome variant site matrix by utilizing parallel shortening of comparison time.

FIG. 2 shows an example of a text file obtained by implementing a decision tree algorithm for finding decisive sites for viral genome classification in the R language.

FIG. 3 shows visualization of the determinant site of viral genome classification using Cytoscape software. The top 10-tier branches classifying more than 3000 SARS-CoV-2 viruses are shown. The circle nodes (nodes) in the figure represent the decisive sites for each branch of the network graph, and the circle size represents the number of viral sequences contained in that branch. Such as: 17747 (C: 72, T: 17) represents that the downward division of this node into two branches is formed by a variation of nucleotide 17747, in which there are 72 viral sequences of cytosine (C) at position 17747 and 17 sequences of thymine (T) at position 17747.

Detailed Description

The inventor has studied extensively and intensively, and found that it is possible to classify viral genome sequences by using a decision tree algorithm for the first time and find a technical scheme of determining sites. Compared with the traditional evolutionary tree, the technical scheme does not depend on sequence comparison in the aspect of early data processing, so that the upper limit of the number of analyzed sequences is unlimited, the tens of thousands of virus genome sequences can be efficiently and rapidly analyzed, the genetic relationship of each virus genome can be displayed, and meanwhile, the reasons of the formation of each branch in the virus evolution process are explained through the identified classification decisive sites, so that the variation of key sites is tracked in the virus transmission process. On this basis, the present application has been completed.

Terminology

As used herein, the term "variation locus matrix" refers to a matrix of nucleotides at all variation loci of all sequences to be analyzed, each row representing the nucleotide position of one sequence to be analyzed at all variation loci of a reference sequence, and each column representing the position of one variation locus of the reference genome at all sequences to be analyzed. The mutation site refers to a mutation site on a reference sequence if any one of the sequences to be analyzed at the site is different from the nucleotide of the reference sequence at the site after sequence alignment.

As used herein, the term "decision tree algorithm" refers to a supervised learning algorithm that solves the classification problem in the machine learning field, and adopts a tree structure to implement the final classification by layer-by-layer reasoning. When the computer uses the algorithm to conduct classification prediction, a certain attribute value is used for judging at the internal nodes of the tree, and the corresponding branch nodes are selected to enter according to the judging result until the leaf nodes are reached, so that the classification result is obtained. In this patent, the decision tree algorithm uses each column of the mutation site matrix as a feature (feature), and calculates information entropy for each feature. The algorithm enables the classification determinant site of the minimum information entropy selected last to maximize the information gain, i.e., the degree of classification uncertainty reduction of the variant site matrix.

As used herein, the term "information entropy" refers to an indicator in a decision tree algorithm that measures the purity of a sample set, in general, the greater the information entropy, the less the purity of the sample set; the smaller the information entropy, the greater the sample set purity. Unlike the past information entropy algorithm, the sequence set of the patent does not have a tag, so that the information entropy is defined by using the purity of the nucleotide class of other sites, and the specific formula is as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device,k means one of all mutation sites except the entropy site of the information to be calculated, and p means M _j One of all the nucleotide categories appearing at position k in the sequence set except the nucleotide category having the highest occurrence probability, V' _k Refers to M _j All nucleotide categories other than the nucleotide category with the highest occurrence probability among the nucleotide categories occurring at the k positions in the sequence set, V _k Refers to all nucleotide classes, M, that occur at the k position _j Refers to the collection of all sequences with nucleotides j at the k position, C' refers to all mutation sites except the entropy site of the information to be calculated, C refers to all mutation sites, and p is M _j The nucleotide class with the largest occurrence probability at the k position in the sequence set is removed, and the number of the sequences after the nucleotide class is removed is M _j Information entropy on k columns. The greater the entropy of the information, the more impure the sample set over k bits is represented.

As used herein, the term "determinant site" refers to a specific single nucleotide polymorphism site (Single Nucleotide Variation, SNV) capable of classifying a viral sequence to be analyzed into genotypes having a plurality of representative characteristics (e.g., epidemiologically significant). In this patent, the site with the smallest entropy of information is selected as the decisive site for sequence classification, i.e. the decisive site (determinant site) obtained can be the optimal site for separating multiple viral genotypes.

As used herein, the terms "system," "device" and "apparatus" are synonymous.

Reference genome

In the present application, the reference genome may be a whole genome or a partial genome, for example, a virus. And, the reference genome may be continuous or discontinuous. When the reference genome is a partial genome, the total coverage (F) of the reference genome is 50% or more, preferably 60% or more, more preferably 70% or more, more preferably 80% or more, most preferably 95% or more of the whole genome, wherein the total coverage (F) refers to the percentage of the reference genome to the whole genome.

In a preferred embodiment, the reference genome is a whole genome.

Sequencing

In the present application, sequencing can be performed using conventional sequencing techniques and platforms. The sequencing platform is not particularly limited, with the second generation sequencing platform including (but not limited to): GA, GAII, GAIIx, hiSeq1000/2000/2500/3000/4000, X Ten, X Five, nextSeq500/550, miSeq, miSeqDx, miSeq FGx, miniSeq of Illumina; SOLiD of Applied Biosystems; 454 FLX from Roche; thermo Fisher Scientific (Life Technologies) Ion Torrent, ion PGM, ion Proton I/II; BGISEQ1000, BGISEQ500, BGISEQ100 of the warrior gene; bioelectronic seq 4000 of boaobiological group; DA8600 of the university of zhongshan DAs gene stock company; bejeq and kang's NextSeq CN500; BIGIS of Umbelliferae in Umbelliferae of Umbelliferae; the Huating kang gene HYK-PSTAR-IIA.

Third generation single molecule sequencing platforms include (but are not limited to): helicos BioSciences company's HeliScope system, pacific Bioscience SMRT system, gridION, minION of Oxford Nanopore Technologies. The sequencing type may be Single End sequencing or double End sequencing, the sequencing length may be any length greater than 30bp, 40bp, 50bp, 100bp, 300bp, etc., and the sequencing depth may be any multiple of 0.01, 0.02, 0.1, 1, 5, 10, 30 times, etc., greater than 0.01 of the genome.

Data processing

In the present application, the data processing generally includes the steps of:

(i) Sequencing a sample to be analyzed, thereby obtaining a genomic sequence of the sample;

(ii) Obtaining a variation site matrix by aligning the genome sequence with a reference genome according to the number of genome sequences obtained in (i);

(iii) Calculating the information entropy of the variation locus matrix, and utilizing a decision tree algorithm to obtain the decisive loci of genome classification; and

(iv) And visualizing the decisive sites, thereby obtaining a phylogenetic tree of the genome of the living body.

Wherein in step (i), the method specifically further comprises: processing the sample to be analyzed, and extracting nucleic acids (DNA and/or RNA) in the sample. The extraction modes of the nucleic acid of the sample to be detected include (but are not limited to): column extraction and magnetic bead extraction. And constructing a library of the sample, and sequencing the sample by adopting a high-throughput sequencing platform, thereby obtaining the full-length or partial fragment nucleic acid sequence of the sample viral genome.

Wherein in step (ii), the method specifically further comprises: before sequence alignment, performing quality control on the genome sequence of the sample obtained by sequencing, and removing low-quality sequences, wherein the low-quality sequences refer to sequences with the length less than 90% of the length of a reference sequence. In addition, in this step, it is necessary to place the reference genome sequence at the beginning of the sequence to be aligned, and clean the head and tail fragments of the aligned sequences, and clean the fragments in which the abnormality occurs inside the aligned sequences.

Sequence alignment different methods are selected for alignment according to the number of genome sequences obtained. When the number N of the genome sequences is less than 1 ten thousand, adopting a method A to carry out sequence comparison; wherein the method A is as follows: all of the genomic sequences measured were subjected to multiple sequence alignment with a reference genome. When the number N of the genome sequences is more than 1 ten thousand, adopting a method B to carry out sequence comparison; the method B comprises the following steps: and (3) respectively aligning the detected genome sequences with a single sequence of a reference genome. The single sequence alignment was performed using the following method: and (3) respectively carrying out single sequence alignment on the genome sequences and a reference genome by using sequence alignment software, and carrying out parallel calculation on single sequence alignment scripts by using a Linux command xargs-P30.

And extracting mutation sites in a genome sequence of a sample to be detected through sequence comparison, and obtaining a variability site matrix with the size of m multiplied by n of the genome of the sample, wherein m is the sequence number, and n is the mutation site number.

For the comparison result of the method A, the m multiplied by n variability site matrix can be directly extracted. And (3) for the sequence comparison result by adopting the method B, combining mutation site submatrices 1 x (x is the number of mutation sites of the single sequence) obtained by comparing each single sequence, and finally obtaining a variability site matrix with the size of m multiplied by n.

Wherein in step (iii), the method specifically further comprises: and (3) calculating the information entropy of each column (namely the information entropy of each variation site) in the variability site matrix with the size of m multiplied by n obtained in the step (ii), sequencing the information entropy, and selecting the variation site with the minimum information entropy as a first branch point (namely a first class classification decisive site) of the decision tree. Classifying the variation site matrix into two or more subsets according to the first-class classification decisive sites, respectively calculating the information entropy of each column in each subset again, and selecting the variation site with the minimum information entropy as the second-class branch point (namely the second-class classification decisive site) of the decision tree. And so on, recursion is performed until a recursion return condition is reached. The recursive return condition is that each sequence contained in the current subset is identical.

The calculation formula of the information entropy is as follows:

in the method, in the process of the application,

wherein Ent (M) _j ) For the entropy of information, k means one of all the variant sites except the entropy site of the information to be calculated, and p means M _j One of all the nucleotide categories appearing at position k in the sequence set except the nucleotide category having the highest occurrence probability, V' _k Refers to M _j All nucleotide categories other than the nucleotide category with the highest occurrence probability among the nucleotide categories occurring at the k positions in the sequence set, V _k Refers to all nucleotide classes, M, that occur at the k position _j Refers to the collection of all sequences with nucleotides j at the k position, C' refers to all mutation sites except the entropy site of the information to be calculated, C refers to all mutation sites, and p is M _j The nucleotide class with the highest probability of occurrence at the k position in the sequence set.

The determination method of the classification decisive sites is a decision tree algorithm, and the formula is as follows:

The algorithm can be run in computer software (such as R language) to obtain the classification of the viral sample genome sequence and the text file of each classification decisive site.

In addition, if the information entropy of the plurality of mutation sites is the same and is the minimum value when the information entropy is sequenced each time, the plurality of mutation sites are all determined to be classification decisive sites and are reserved in the text file of the classification decisive sites. At the time of the subsequent visualization process, the dominant site preceding the reference sequence is selected for classification.

Wherein in step (iv), the method specifically further comprises: and (3) carrying out visual processing on the text file of the classification decisive site obtained in the step (iii) in Cytoscape software, and adding information such as the geographical position, acquisition time and the like of a virus sequence into the input text file to enrich visual results.

In the visualization process, if a plurality of classification determinant sites of the same level exist in the input text file, the determinant site positioned before on the reference sequence is selected for classification.

Genome system evolution analysis method

The application provides an analysis method for genome system evolution, which is efficient, accurate and wide in application range, and comprises the following steps:

(S1) providing a sample to be analyzed;

In another preferred embodiment, in the step (S3), when the number N of genome sequences is more than 1 ten thousand, performing sequence alignment by using method B; the method B comprises the following steps: and (3) respectively aligning the detected genome sequences with a single sequence of a reference genome. Wherein, the single sequence alignment is performed by the following method: and (3) respectively carrying out single sequence alignment on the genome sequences and a reference genome by using sequence alignment software, and carrying out parallel calculation on single sequence alignment scripts by using a Linux command xargs-P30.

In a preferred embodiment of the application, the method comprises the steps of:

(i) Sequencing the genome of the virus sample, thereby obtaining a genomic sequence of the virus;

(ii) According to the number of genome sequences obtained in the step (ii), performing sequence alignment with a reference genome by adopting a method A, thereby obtaining a variation site matrix;

(iii) Calculating the information entropy of the variation locus matrix, and utilizing a decision tree algorithm to obtain decisive loci of the virus genome classification; and

(iv) And visualizing the decisive sites, thereby obtaining a phylogenetic tree of the viral genome.

(ii) According to the number of genome sequences obtained in the step (ii), performing sequence alignment with a reference genome by adopting a method B, thereby obtaining a variation site matrix;

Genome system evolution analysis system (apparatus)

In the present application, there is also provided a viral genome system evolution analysis system (apparatus) comprising:

(M3) a calculation unit, which is connected with the comparison unit and is used for calculating the information entropy of the mutation site matrix and obtaining the decisive sites of genome classification by utilizing a decision tree algorithm; and

The application has the main advantages that:

(1) Compared with the traditional phylogenetic tree, the method does not depend on the multi-sequence comparison of conventional software in the aspect of early data processing, so that the upper limit of the number of the analyzed sequences is unlimited, the method can efficiently and rapidly analyze tens of thousands of virus genome sequences, the application universality of the phylogenetic analysis is greatly improved, and the method has better adaptability to massive sequencing results in big data age.

(2) The method can explain the reason of each branch formation in the virus evolution process through the identified classification decisive sites while displaying the genetic relationship of each virus genome, thereby tracking the variation of the critical sites in the virus transmission process and laying a solid foundation for epidemiological research and subsequent experimental verification.

(3) The method is not only suitable for searching the decisive sites for the classification of viral genomes, but also suitable for other life body genomes of bacteria, fungi and the like.

The application is further illustrated below in conjunction with specific embodiments. It is to be understood that these examples are illustrative of the present application and are not intended to limit the scope of the present application. The experimental procedure, in which the detailed conditions are not noted in the following examples, is generally followed by routine conditions such as Sambrook et al, molecular cloning: conditions described in the laboratory Manual (New York: cold Spring Harbor Laboratory Press, 1989) or as recommended by the manufacturer. Percentages and parts are by weight unless otherwise indicated.

Unless otherwise specified, the materials used in the examples are all commercially available products.

The technical scheme of the application mainly comprises the following three technical links:

the first technical link is: according to the number of the viral genome to be researched, a proper method is selected to obtain a mutation site matrix

And the technical link II is as follows: and calculating information entropy of the mutation site matrix obtained in the second technical step and obtaining decisive sites of each viral genome classification by utilizing a decision tree algorithm.

And the technical link III: visualization of the determinant site of viral genome classification was performed using Cytoscape software.

Example 1

According to the number of the viral genome to be researched, a proper method is selected to obtain a mutation site matrix

1.1 multiple sequence alignment was performed using existing software.

The method is suitable for the situation that the number of sequences is not large (usually, the number of sequences is lower than 1 ten thousand), and is suitable for multi-sequence alignment, and comprises the following steps:

(1) The nucleotide sequence of the full length or fragment of the viral genome is obtained.

(2) Quality control is performed on the sequences, removing low quality sequences, typically sequences with a length < 90% of the length of the reference sequence. Such as: in total, 323 new coronavirus genome full-length sequences are added, and since the length of a new coronavirus reference sequence is 29903bp, the sequence with the sequence length smaller than 26912bp needs to be removed, and 317 sequences to be analyzed remain.

(3) And carrying out multi-sequence alignment on the sequence to be analyzed. By MEGA ^[1] 、Clustal Omega ^[2] And the equal sequence alignment software uses default parameters to carry out multi-sequence alignment on the nucleotide sequences. In this step, the reference genome sequence needs to be placed at the first position of the sequences to be aligned, so that the subsequent mutation sites can be conveniently located (see the schematic diagram of fig. 1 (1)).

(4) The aligned head and tail fragments of the sequence are cleaned up. So that all sequences after cleaning do not start or end with "-" (see FIG. 1 (1) schematic).

(5) And cleaning fragments with abnormal internal sequences after alignment. The aligned sites containing "N" are deleted from all the sequences to be analyzed to obtain an m×a matrix (see FIG. 1 (1) schematic). Where m is the number of sequences and a is the length of the genome fragment remaining after cleaning. Such as: the 317 new coronavirus sequences with the length of about 30000bp which are subjected to quality control are subjected to sequence comparison and cleaning, and after head and tail and abnormal fragments are removed, the length of the rest genome fragments is 28990bp, so that a matrix of 317 multiplied by 28990 is obtained. The row names are the names of the sequences to be analyzed, and the column names are the corresponding nucleotide positions in the reference genome.

(6) And extracting a mutation site matrix. By means ofMEGA software ^[1] The mutation sites are extracted from the matrix of mxa. A mutation site refers to a position in a reference genomic sequence when all nucleotides in the column are not identical (see FIG. 1 (1) for schematic representation). If n mutation sites exist in the fragment with the length of the genome fragment a, the obtained mutation site matrix size is m×n. If there are 174 mutation sites in 28990 nucleotide sites of 317 to-be-analyzed sequences, the 174 columns are extracted and combined to obtain a new mutation site matrix with the size of 317×174.

1.2 sequence alignment and mutation site matrix combining Using script

The method is suitable for the situation that the number of sequences is too large (for example, the number of sequences exceeds 1 ten thousand), and is not suitable for sequence alignment. In which case it takes more than one day to apply the sequence alignment software for multiple sequence alignment. And by using the script and combining the multi-process parallel computation of the Linux system, the analysis speed can be greatly increased.

(2) Quality control is performed on the sequences, removing low quality sequences, typically sequences with a length < 90% of the length of the reference sequence.

(3) And comparing the sequences to be analyzed with the reference sequences in pairs. By Clustal Omega ^[2] And the equal sequence comparison software is combined with the script, each sequence and the reference sequence are respectively compared in pairs, and the comparison time is shortened by utilizing the multi-process parallel calculation of the Linux system. Such as: the 49219 new coronavirus sequences to be analyzed through quality control are subjected to pairwise sequence comparison with a reference sequence, and 17 days are needed to be sequentially carried out to finish the sequence comparison; the Linux command xargs-P30 is used for carrying out parallel calculation on the pairwise alignment script, so that the time can be shortened by 30 times, and the step is completed within 14 hours.

(4) Obtaining the mutation site submatrix. And extracting a mutation site submatrix from the aligned results of the sequences by using a script. Such as: if there are x mutation sites between a new coronavirus sequence and a reference sequence, the size of the submatrix is 1 x, the row content is the respective nucleotide of the sequence at the x mutation sites, and the column names are the corresponding positions of the mutation sites on the reference sequence.

(5) And merging the mutation site sub-matrixes to obtain a mutation site matrix. Combining the mutation site submatrices obtained by comparing each sequence with the reference sequence by utilizing a script (see the schematic diagram of the attached figure 1 (2)), namely taking the union of all the mutation site submatrices according to the positions of the mutation sites on the reference sequence (the column names of all the submatrices), so that the nucleotide at the a-th row and b-th column position of the mutation site matrix represents the nucleotide at the b-th position of the a-th sequence to be analyzed.

(6) And (5) cleaning a variation site matrix. Starting from the first column and the last column of the mutation site matrix, removing the column containing the "-" until the first column and the last column of the mutation site matrix after cleaning do not contain the "-". The columns containing "N" in the mutation site matrix are deleted. Thus obtaining a new final mutation site matrix m multiplied by n, wherein m is the number of sequences and n is the number of mutation sites.

Example 2

Calculating information entropy and utilizing decision tree algorithm to obtain decisive sites for virus genome classification

This step is derived from the ID3 algorithm ^[3] Belongs to one of common decision tree algorithms. Taking each column as a feature, calculating information entropy for each feature information, and finally selecting the classification decisive site of the minimum information entropy to maximize the information gain, namely the degree of reducing the classification uncertainty of the variation site matrix.

The method comprises the following steps:

(1) And calculating the classified information entropy for each column in the mutation site matrix. Unlike the past information entropy algorithm, the sequence to be analyzed does not have a tag, so that the information entropy is defined by using the purity of the nucleotide class of other sites, and the specific formula is as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device,M _j refers to the collection of all sequences with nucleotide j at the entropy position of the information to be calculated, p is M _j The nucleotide class with the largest occurrence probability at k positions in the sequence set is M after the nucleotide class is removed _j The number of sequences in the sequence is M _j The greater the entropy of information over k bits, the more impure the sample set over k bits is represented. Accumulating the information entropies of the sites except the sites of the information entropies to be calculated to obtain the information entropies of the sequence set of the information entropies to be calculated in the nucleotide j, namely Ent (M _j )。

Examples are as follows:

/>

for site 1, sequences can be split into two sets: m is M _A And M _T 。

M _A The method comprises the following steps:

	1	2	3	4	5
						Seq1	A	G	G	T	T
Seq2	A	G	C	T	A

M _T the method comprises the following steps:

	1	2	3	4	5
						Seq3	T	G	G	T	A
Seq4	T	G	G	A	A
						Seq5	T	C	C	A	A
Seq6	T	C	G	A	T

Ent(M _A )＝M _A ² +M _A ³ +M _A ⁴ +M _A ⁵

＝0+1+0+1＝2

Ent(M _T )＝M _T ² +M _T ³ +M _T ⁴ +M _T ⁵

＝2+1+1+1＝5

M _T ³ =1 represents M _T The number of sequences in the sequence set at the site 3 except for the nucleotide class (G in this case) having the highest probability of occurrence is 1 (i.e., the sequence of Seq 5). If meet M _T ² The probability of occurrence of G and C at site 2 is the same, so that both G and C can be regarded as M _j The nucleotide class with the highest probability of occurrence at this site has no influence on the calculation result.

(2) Determining the decisive sites for sequence classification. The formula is as follows:

wherein C refers to all mutation sites, V _col M for all nucleotide classes present at the entropy locus of the information to be calculated _j Refers to the collection of all sequences for which the nucleotide is j at the entropy locus of the information to be calculated.

And calculating the classified information entropy of each column in the variation locus matrix, and accumulating the information entropy of each class to be used as the information entropy of the information entropy locus to be calculated currently. The site with the minimum information entropy is selected as the decisive site of sequence classification, namely the obtained decisive site (determinant site) can be used as the optimal site for separating a plurality of branches of the evolutionary tree on the decisive site network.

In the example of (1), the sum of the information entropies of site 1 after classification according to A and T (i.e., the information entropies of site 1) is Ent (M _A )+Ent(M _T ) =2+5=7. The entropy of the information at positions 2, 3, 4, 5 and so on are 7, 9, 6, 9, respectively. 6 is the minimum information entropy, so position 4 is the classification determinant of these 5 sequences, and the sequence matrix can be divided into the following two subsets according to position 4:

	1	2	3	4	5
						Seq1	A	G	G	T	T
Seq2	A	G	C	T	A
						Seq3	T	G	G	T	A

	1	2	3	4	5
						Seq4	T	G	G	A	A
Seq5	T	C	C	A	A
						Seq6	T	C	G	A	T

(3) And continuously calculating the information entropy of each column of classified sub-sets, and obtaining the classification decisive sites. I.e. steps (1) and (2) are recursively performed.

(4) Repeating step (3) until a recursive return condition is reached. The recursive return condition is that each sequence contained in the current subset is identical.

(5) Using the above algorithm in R language ^[4] And (3) obtaining the virus sequence classification and the text file formed by the decisive sites according to the classification. The decisive site is the mutation site col of the formula.

Example 3

Visualization of determining sites for viral genome classification using Cytoscape software

The obtained virus genome classification and text file composed of decisive sites according to the classification are input into Cytoscape software for network diagram visualization (see Cytoscape Gihub page: https:// github. Com/Cytoscape/Cytoscape-tutorials/wiki), as shown in FIG. 3. The information such as the geographical position, the acquisition time and the like of the virus sequence can be added in the input text file, and the visual result is enriched.

If the information entropy of the plurality of classification definitive loci is the same in the information entropy calculation step, the classification definitive loci are listed in the output text file. The dominant sites preceding the reference sequence were selected for classification when using the Cytoscape map.

Reference is made to:

[1]Huson,D.H.,Nettles,S.M.,&Warnow,T.J.(1999).Disk-covering,a fast-converging method for phylogenetic tree reconstruction.Journal of computational biology:a journal of computational molecular cell biology,6(3-4),369–386.

[2]Saitou,N.,&Nei,M.(1987).The neighbor-joining method:a new method for reconstructing phylogenetic trees.Molecular biology and evolution,4(4),406–425.

[3]Goodman,M.,Pedwaydon,J.,Czelusniak,J.,Suzuki,T.,Gotoh,T.,Moens,L.,Shishikura,F.,Walz,D.,&Vinogradov,S.(1988).An evolutionary tree for invertebrate globin sequences.Journal of molecular evolution,27(3),236–249.

[4]Felsenstein J.(1981).Evolutionary trees from DNA sequences:a maximum likelihood approach.Journal of molecular evolution,17(6),368–376.

all documents mentioned in this disclosure are incorporated by reference in this disclosure as if each were individually incorporated by reference. Further, it will be appreciated that various changes and modifications may be made by those skilled in the art after reading the above teachings, and such equivalents are intended to fall within the scope of the application as defined in the appended claims.

Claims

1. A method for phylogenetic analysis of viral genomes, the method comprising the steps of:

(S1) providing a sample to be analyzed;

2. The method of claim 1, wherein the decision tree algorithm has a calculation formula as shown in formula I:

3. The method of claim 2, wherein the information entropy is calculated according to formula II:

in the method, in the process of the application,

4. The method of claim 1, wherein the sample is from a virus to be analyzed that has been identified as being from the same species as the reference genome in terms of virus taxonomy.

5. The method of claim 1, wherein the sequence alignment is performed using sequence alignment software selected from the group consisting of: MEGA, clustal Omega, mafft, clustalW, NCBI Blast, or combinations thereof.

6. The method of claim 1, wherein in step (S3), when the number N of genomic sequences is less than 1 ten thousand, sequence alignment is performed using method a; wherein the method A is as follows: all of the genomic sequences measured were subjected to multiple sequence alignment with a reference genome.

7. The method of claim 1, wherein in step (S3), when the number of genomic sequences N is more than 1 ten thousand, the sequence alignment is performed using method B; the method B comprises the following steps: and (3) respectively aligning the detected genome sequences with a single sequence of a reference genome.

8. The method of claim 7, wherein the single sequence alignment is performed using the following method: and (3) respectively carrying out single sequence alignment on the genome sequences and a reference genome by using sequence alignment software, and carrying out parallel calculation on single sequence alignment scripts by using a Linux command xargs-P30.

9. A system for phylogenetic analysis of viral genomes, comprising:

10. Use of a method for phylogenetic analysis of viral genomes, characterized in that it comprises: (i) building a virus phylogenetic tree; (ii) epidemiological analysis of infectious diseases; and (iii) prevention and control of infectious diseases.