CN116312796B

CN116312796B - Metagenome abundance estimation method and system based on expectation maximization algorithm

Info

Publication number: CN116312796B
Application number: CN202310103910.1A
Authority: CN
Inventors: 马昀然; 刘俊锋; 郭昊; 李诗濛; 任用
Original assignee: Beijing Xiansheng Medical Examination Laboratory Co ltd; Jiangsu Xiansheng Diagnostic Technology Co ltd; Jiangsu Xiansheng Medical Diagnosis Co ltd
Current assignee: Beijing Xiansheng Medical Examination Laboratory Co ltd; Jiangsu Xiansheng Diagnostic Technology Co ltd; Jiangsu Xiansheng Medical Diagnosis Co ltd
Priority date: 2022-12-27
Filing date: 2023-02-07
Publication date: 2023-11-14
Anticipated expiration: 2043-02-07
Also published as: CN116312796A

Abstract

The application belongs to the technical field of belief analysis, and particularly relates to a metagenome abundance estimation method based on an expectation maximization algorithm (EM), which comprises the steps of introducing the influence of species reference genome similarity comparison on information of a reference genome unique comparison position based on comparison information of metagenome sequencing data, and constructing the probability of occurrence of unique comparison and multiple comparison observed by a statistical model depiction; and (3) adopting an EM to solve the constructed statistical model to estimate the species abundance in the metagenome. The application quantifies the influence of the similarity of the species reference genome, so the accuracy of Gao Hong genome abundance estimation can be improved on the species level, and the sensitivity and the specificity of metagenome species identification can be improved.

Description

Metagenome abundance estimation method and system based on expectation maximization algorithm

Technical Field

The application belongs to the technical field of bioinformatics, and particularly relates to a metagenome abundance estimation method and system based on a expectation maximization algorithm.

Technical Field

Metagenomic sequencing (mNGS) technology has broad application prospects in clinical microbiology, and a variety of computational methods based on metagenomic sequencing data have been developed for rapid identification of pathogens in clinical samples. Wherein, the metagenome classification algorithm centrafuge realizes rapid classification of metagenome sequencing data based on BWT (Burrows-Wheeler transform) and FM (ferrocina-Manzini) indexes and uses a smaller index space. Moreover, centrafuge uses the EM (estimation-Maximization) algorithm to estimate species abundance in metagenomic sequencing data. In addition, the centrafuge has wider application scenes, and can analyze not only short-reading long sequences but also long-reading long sequences. Although centrafuge has better performance in species identification, it is less effective in identifying low abundance species. Compared with a mapping-based method, the method has the advantages that the read matching precision of the centering is low, the estimated value of the abundance is influenced, and the false positive rate is high.

Based on the metagenome sequencing data comparison result, the metagenome classifier mainly adopts 2 strategies to distribute metagenome sequencing data with multiple comparison results. The first strategy directly assigns according to the number of species unique comparison reads: when the read i multiple aligned to species A and species B, if the unique aligned reads of species A are greater in number than species B, then the read i will be assigned to species A with a greater probability or directly to species A. The second strategy is to construct a probability model to characterize the metagenome sequencing data comparison result, and the sequence abundance or species abundance of each species is estimated by solving the probability model, and the representative algorithms are centrafuge and braicken. The second strategy can give a probabilistic interpretation of the allocation results compared to the first strategy. However, the representative algorithms Centrifuge and braicken as the second strategy are both deficient in probabilistic model construction. The shortcoming of the probability model constructed by the Centrifuge algorithm is that: (1) The unique alignment and the multiple alignment are not subjected to differentiation treatment on sequencing data of different species; (2) The effect of genomic similarity on unique alignment sequencing data was not characterized. The probability model constructed by the Bracken algorithm has the following defects: (1) distributing the comparison result of only Kraken; (2) model solving is not completely based on observation samples; (3) The effect of genomic similarity on unique alignment sequencing data was not directly characterized.

In view of this, the present application has been proposed.

Disclosure of Invention

In order to solve the technical problems, the application introduces the proportion of the unique comparison position of the reference genome to quantify the influence of the similarity comparison information of the species reference genome based on the comparison information of the metagenome sequencing data, and constructs the probability of occurrence of the unique comparison and multiple comparison observed by the statistical model; calculating the proportion of unique comparison positions of corresponding reference genomes based on the comparison information of the metagenome sequencing data; the statistical model constructed is solved using an Expectation-Maximization (EM) algorithm, species abundance in the metagenome is estimated, and a reference genome length is introduced at an Expectation step (E-step).

Therefore, the core objective of the application is to provide a metagenomic abundance assessment method and system.

In order to achieve the above purpose, the present application proposes the following technical scheme:

the application firstly provides a metagenome abundance assessment method, which comprises the following steps:

1) Obtaining a unique comparison position proportion of a reference genome;

2) Obtaining the occurrence probability of single comparison and multiple comparison of sequencing data;

3) Metagenomic species abundance was assessed using a expectation maximization algorithm.

Further, the step 1) is obtained as follows: based on the comparison information of the metagenome sequencing data, counting the number of unique comparison sequencing sequences and the number of multiple comparison sequences on the reference genome, and calculating the ratio of the number of sequencing sequences uniquely compared to the reference genome to the number of sequencing sequences all compared to the reference genome.

Further, the obtaining in step 2) is: based on the reference genome unique alignment position proportion, the influence of species reference genome similarity alignment information is quantified, and a statistical model is constructed to characterize the occurrence probability of the observed unique alignment and multiple alignments.

Further, the statistical model is:

wherein,

r is the number of metagenomic sequencing data,

s is the number of species in the metagenome,

and->The abundance of species j and k, respectively, the parameter to be estimated,

l _j and l _k The average length of the genomes of species j and k respectively,

C _ij for the probability of comparing the sequencing data i to species j, when the sequencing data i is compared only to species j, the probability is equal to P _j ，P _j The ratio of unique alignment positions on reference genome j; when sequencing data i is multiple aligned to species j, the probability is equal to 1-P _j The method comprises the steps of carrying out a first treatment on the surface of the When sequencing data i is not aligned to species j, the probability is equal to 0.

Further, the step 3) specifically includes:

and (3) solving the statistical model constructed in the step (2) by adopting an expectation maximization algorithm, and estimating the abundance of the species in the metagenome.

Further, the solving specifically includes: introducing a reference genome length in the desired step for calculating the number n of sequencing data from species j _j The formula is:

based on n again _j Updating abundance of species jThe formula is as follows:

further, the metagenome in the above method is an infectious metagenome.

The application also provides a metagenome abundance estimation system, which comprises the following components:

assembly 1): a reference genome unique alignment position ratio calculation component;

assembly 2): the sequencing data unique comparison and multiple comparison occurrence probability statistics component;

assembly 3): metagenomic species abundance assessment component.

Further, the obtaining of the component 1) is as follows: based on the comparison information of the metagenome sequencing data, counting the number of unique comparison sequencing sequences and the number of multiple comparison sequences on the reference genome, and calculating the ratio of the number of sequencing sequences uniquely compared to the reference genome to the number of sequencing sequences all compared to the reference genome.

Further, the obtaining of the component 2) is as follows: based on the reference genome unique alignment position proportion, the influence of species reference genome similarity alignment information is quantified, and a statistical model is constructed to characterize the occurrence probability of the observed unique alignment and multiple alignments.

Further, the statistical model is:

wherein,

r is the number of metagenomic sequencing data,

s is the number of species in the metagenome,

Further, the assembly 3) specifically includes:

and (3) estimating the abundance of the species in the metagenome by adopting a statistical model constructed by the expectation maximization algorithm solving component 2).

based on n again _j Updating abundance of species jThe formula is as follows:

further, the metagenome in the above system is an infectious metagenome.

The present application also provides an electronic device including: a processor and a memory; the processor is connected to a memory, wherein the memory is configured to store a computer program, and the processor is configured to invoke the computer program to perform the method according to any of the preceding claims.

The present application also provides a computer storage medium storing a computer program comprising program instructions which, when executed by a processor, perform a method as claimed in any one of the preceding claims.

The application has the beneficial technical effects that:

the application quantifies the influence of species reference genome similarity on comparison information, and differentially processes the sequencing data of unique comparison and multiple comparison for different species, thereby improving the accuracy of Gao Hong genome abundance estimation on the species level and providing necessary technical support for improving the sensitivity and specificity of metagenome species identification.

Drawings

FIG. 1, a metagenomic abundance estimation flow chart based on a expectation maximization algorithm of the present application;

FIG. 2, correlation analysis of abundance estimates and true values for different methods;

FIG. 3, correlation analysis of abundance estimates and true values for different methods (5% outlier removal).

Detailed Description

Embodiments of the present application will be described in detail below with reference to examples, but it will be understood by those skilled in the art that the following examples are only for illustrating the present application and should not be construed as limiting the scope of the present application. The specific conditions are not noted in the examples and are carried out according to conventional conditions or conditions recommended by the manufacturer. The reagents or apparatus used were conventional products commercially available without the manufacturer's attention.

Some definitions of terms unless defined otherwise below, all technical and scientific terms used in the detailed description of the application are intended to have the same meaning as commonly understood by one of ordinary skill in the art. While the following terms are believed to be well understood by those skilled in the art, the following definitions are set forth to better explain the present application.

The term "about" in the present application means a range of accuracy that one skilled in the art can understand while still guaranteeing the technical effect of the features in question. The term generally means a deviation of + -10%, preferably + -5%, from the indicated value.

As used herein, the terms "comprising," "including," "having," "containing," or "involving" are inclusive or open-ended and do not exclude additional unrecited elements or method steps. The term "consisting of …" is considered to be a preferred embodiment of the term "comprising". If a certain group is defined below to contain at least a certain number of embodiments, this should also be understood to disclose a group that preferably consists of only these embodiments.

Furthermore, the terms first, second, third, (a), (b), (c), and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and that the embodiments of the application described herein are capable of operation in other sequences than described or illustrated herein.

The application is illustrated below in connection with specific embodiments.

Example 1 estimation method establishment

As shown in fig. 1, the metagenome abundance estimation method based on the expectation maximization algorithm provided by the application comprises the following steps:

s1, sequencing data comparison: and selecting proper alignment tools and an alignment database, performing alignment analysis on the metagenome sequencing data, and outputting alignment information of each piece of sequencing data. The comparison information at least needs to contain the identification information of the reference genome on the comparison, and distinguish between the unique comparison and the multiple comparison. For multiple aligned sequencing data, reference genome identification information on all alignments should be included.

S2, calculating the proportion of unique comparison positions of the reference genome: according to the comparison information of the metagenome sequencing data, the number of the unique comparison sequencing sequences and the number of multiple comparison sequences on the reference genome j (j=1, 2,..n, n is the number of the reference genome) are counted, and then the ratio of the unique comparison positions on the reference genome j, namely the ratio of the number of the sequencing sequences uniquely compared to the reference genome j to the number of the sequencing sequences uniquely compared to the reference genome j, is calculated.

S3, calculating the unique comparison and multiple comparison probability of the sequencing data:

based on the reference genome unique alignment position proportion, the influence of species reference genome similarity alignment information is quantified, and a statistical model is constructed to characterize the occurrence probability of the observed unique alignment and multiple alignments.

The statistical model is as follows:

r is the number of metagenomic sequencing data,

s is the number of species in the metagenome,

C _ij for the probability that sequencing data i is aligned to species j, if sequencing data i is aligned to reference genome j only,the probability is P _j The method comprises the steps of carrying out a first treatment on the surface of the If multiple alignments are made to reference genome j, the probability is 1-P _j 。P _j The ratio of positions is uniquely aligned for reference genome j (see step S2 for calculation).

S4, estimating the species abundance of the metagenome by adopting an EM algorithm: based on the alignment information of the metagenomic sequencing data, species abundance is estimated as follows:

1) Initial step (I-step): the initial value of abundance of species j is:

s: number of species in metagenome

2) Desired step (E-step):

n _j : the number of sequences from species j;

C _ij : sequencing the probability that sequence i is from species j, if species j is not aligned, the probability is 0; if the species j is uniquely compared or multiple compared, the probability is S3;

3) Maximizing step (M-step) updates the abundance of species j:

abundance of updated species j

The EM algorithm stops iterating if the difference between the pre-update and post-update species abundance estimates satisfies the following condition:

after estimating the abundance of a species, the estimated abundance of species j can be converted to the sequence abundance of that species by the following formula:

example 2 evaluation of Effect

This example compares the merits of the methods of the present application with conventional methods in terms of metagenomic abundance estimation.

1) Generating simulated metagenome data: the present example is based on reference genomes of 4078 bacteria and 200 archaea, using a simulator Mason to generate 2000 ten thousand sequences of 100bp in length, from which 10000 sequences were randomly extracted for abundance estimation.

2) Analysis of simulated metagenomic data: the above simulated data were subjected to abundance estimation using centrifuge, which has been more widely used, and the metagenomic abundance estimation method of the present application.

3) The correlation analysis of the abundance estimate values of the two methods with the true abundance values shows (fig. 2) that the Pearson correlation coefficient of the centrafuge is only 0.26, whereas the Pearson correlation coefficient of the metagenomic abundance estimation method of the present application is 0.6. If 5% of the outlier estimates were removed from both methods, the Pearson correlation of centrafuge was also only increased to 0.46, whereas the Pearson correlation of the metagenomic abundance estimation method of the present application was increased to 0.95 (fig. 3).

As can be seen from example 2, compared with the more widely used centrafuge, the metagenomic abundance estimation method provided by the application can obviously improve the accuracy of metagenomic abundance estimation, thereby being beneficial to improving the sensitivity and specificity of metagenomic species identification.

Finally, it should be noted that the above embodiments are merely illustrative of the technical solution of the present application, and not limiting thereof; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the application.

Claims

1. A method for assessing metagenomic abundance, comprising the steps of:

1) Obtaining a unique comparison position proportion of a reference genome;

3) Assessing metagenomic species abundance using a expectation maximization algorithm;

the step 2) is obtained by: quantifying the influence of species reference genome similarity comparison information based on the reference genome unique comparison position proportion, and constructing a statistical model to characterize the occurrence probability of the observed unique comparison and multiple comparison;

the statistical model is as follows:

wherein,

r is the number of metagenomic sequencing data,

s is the number of species in the metagenome,

C _ij for the probability of comparing the sequencing data i to species j, when the sequencing data i is compared only to species j, the probability is equal to P _j ，P _j For uniquely aligning positions on reference genome jProportion of the components; when sequencing data i is multiple aligned to species j, the probability is equal to 1-P _j The method comprises the steps of carrying out a first treatment on the surface of the When sequencing data i is not aligned to species j, its probability is equal to 0;

the evaluation of step 3) is as follows:

solving the statistical model constructed in the step 2) by adopting an expectation maximization algorithm, and estimating species abundance in a metagenome;

the solving is as follows: introducing a reference genome length in the desired step for calculating the number n of sequencing data from species j _j The formula is:

based on n again _j Updating abundance of species jThe formula is as follows:

2. the method of evaluating according to claim 1, wherein,

the step 1) is obtained by: based on the alignment information of the metagenome sequencing data, the ratio of the number of sequencing sequences uniquely aligned to the reference genome to the number of sequencing sequences all aligned to the reference genome is calculated.

3. The assessment method according to any one of claims 1-2, wherein said metagenome is an infectious metagenome.

4. An electronic device, comprising: a processor and a memory; the processor is connected to a memory, wherein the memory is adapted to store a computer program, the processor being adapted to invoke the computer program to perform the method of any of claims 1-3.

5. A computer storage medium, characterized in that the computer storage medium stores a computer program comprising program instructions which, when executed by a processor, perform the method of any of claims 1-3.