Method for judging background introduction microorganism sequence and application thereof
Technical Field
The invention relates to the technical field of biological information analysis, in particular to a method for judging a background introduced microorganism sequence and application thereof.
Background
Pathogenic macro genomics (mNGS) is a high-throughput sequencing technology that directly extracts nucleic acids from clinical specimens and detects pathogens, independent of culture. Compared with the traditional clinical laboratory detection method, the pathogenic mNGS is used for detecting the sequence based on the nucleic acid level, can break through the limitations of different pathogen types, comprehensively covers thousands of pathogens without bias, simultaneously identifies various pathogenic microorganisms such as bacteria, fungi, viruses and parasites, and gradually becomes an important tool in the field of clinical microorganism identification.
However, the accuracy of the mNGS can be affected by contamination-DNA sequences are detected that are not actually present in the sample. There are two main types of contaminants in the mNGS experiment, which are caused by different sources, external and internal contamination respectively. External contamination is caused by the outside of the sample being tested, potential sources of contamination include the body of the subject or researcher, the laboratory environment, etc.; the most important of them are internal contamination from sample collection tools, assay consumables required for nucleic acid extraction, reagents required for library construction, and the like.
Although, contamination can be reduced by laboratory techniques such as ultraviolet radiation, "ultra-purification" and/or enzymatic treatment of reagents and separation of pre-PCR and post-PCR regions. However, even the best laboratory methods do not completely eliminate DNA contamination. Engineering bacteria are required to be used for producing the experimental consumables or reagents, nucleic acid residues of the engineering bacteria are inevitable, only partial pollution can be eliminated by treating the consumables or strictly controlling the laboratory, the pollution level is controlled, and the removal effect is limited.
The most common method in practice is to remove the contamination by a computational method, which is broadly divided into two types:
one is a method of directly removing background microorganisms, such as simultaneously setting a batch of negative blank controls to remove contamination by removing sequences of microorganisms commonly found in the blank controls; or removing potential microbes by using the characteristic that the proportion of background polluted nucleic acid is inversely proportional to the proportion of sample nucleic acid according to the characteristic of high host rate of the mNGS. This type of process only allows a list of potential background contaminations to be obtained, using a knife-cut method to directly remove these microorganisms. However, common clinical pathogens, such as acinetobacter baumannii, pseudomonas aeruginosa, klebsiella pneumoniae, stenotrophomonas maltophilia, escherichia coli, enterococcus faecalis, serratia marcescens and the like, are microorganisms existing in the background of nucleic acid extraction or the background of reagents at the same time, and whether the microorganisms in a sample are background pollution or originated from the sample cannot be judged.
The other method is to add an internal reference for quantification and remove microorganisms with relative abundance below a threshold value, and the method has high requirements on the internal reference, needs an internal reference set with accurate quantification and high complexity and is complex to operate. In practice, however, we find that different sample types have different effects on the final quantitative result when the internal parameters are added due to different pretreatment steps, and different sample types need to calculate the corresponding threshold values independently.
Disclosure of Invention
In view of the above, it is necessary to provide a method for determining background-introduced microorganism sequences, which can accurately determine whether the microorganism sequences obtained by sequencing are background-introduced, and can reduce the batch false positive or false negative problem.
A method of determining background incorporation of a microbial sequence, comprising the steps of:
determination of background microorganism list: taking a plurality of pathogenic microorganism negative control samples, respectively extracting nucleic acid, determining the total amount D of the nucleic acid of the samples, sequencing, obtaining microorganism sequences in the samples through comparison, and obtaining a quantitative value A of the abundance of each microorganism according to the number of the compared sequences; respectively carrying out correlation test on the quantitative value A of the abundance of each microorganism and the total nucleic acid amount D, and judging as a background pathogenic microorganism when the quantitative value A of the abundance of the microorganism and the total nucleic acid amount D of the sample are in negative correlation, thereby obtaining a background pathogenic microorganism list;
establishing a judgment model: respectively taking a negative control sample set and a positive sample set of background pathogenic microorganisms, obtaining gene sequence data of the background pathogenic microorganisms, comparing the gene sequence data with a characteristic sequence region of the microorganisms, calculating the abundance of each characteristic sequence region of the background microorganisms according to the compared sequence number, respectively obtaining the characteristic reads distribution of the microorganisms in the negative sample and the characteristic reads distribution of the microorganisms in the positive sample, and establishing a judgment model according to the different characteristic reads distributions of the background polluted pathogenic microorganisms in the negative sample and the positive sample;
quantitative analysis of microorganism-specific regions: acquiring background pathogenic microorganism gene sequence data in a sample to be detected, comparing the background pathogenic microorganism gene sequence data to a characteristic sequence area of the microorganism, calculating the abundance of each characteristic area of the microorganism according to the compared sequence number to obtain characteristic reads distribution of the microorganism of the sample to be detected, judging by adopting the judging model, judging as a positive sample if the characteristic reads distribution of the background pathogenic microorganism in the sample to be detected accords with the characteristics of the positive sample, and judging as a background to introduce the microorganism sample if the characteristic reads distribution of the background polluted pathogenic microorganism in the sample to be detected accords with the characteristics of the negative sample.
The present inventors found in previous studies that even microorganisms of the same species, such as microorganisms introduced in a background environment (defined as background microorganisms), are distinguished from clinically occurring microorganisms of this species (defined as pathogenic microorganisms) in their strains. Taking escherichia coli as an example, the escherichia coli is a common reagent engineering strain, and we first judge whether there is a difference between a clinically infected escherichia coli strain and a reagent engineering strain to judge the feasibility of the hypothesis. The inventor carries out analysis by a large sample amount, namely a large number of samples of clinically judged escherichia coli and samples of clinical negativity, downloads all reference sequences of escherichia coli strains with complete genomes from NCBI, and carries out quantification of escherichia coli on the samples, and finds that the clinical negativity samples are compared with strains with the highest abundance, such as BL21(DE3), C41(DE3), BL21(TaKaRa), C43(DE3) and the like, while the samples of clinically judged pathogenic escherichia coli are other strains and carry a drug-resistant gene, such as CTX-M and the like. Thus, it was confirmed that clinical pathogenic strains were distinct from reagent engineered strains and could be analyzed for their corresponding characteristics in the mNGS sequencing data.
Based on the above research, the present inventors propose the above method for determining background-introduced microorganism sequences, which obtains a determination model through big data training to determine whether each sample is more consistent with the existence of pathogens or only the existence of background, integrates quantification and strain characteristics of strains, obtains distribution characteristics of pathogen strains and background strains through large sample amount, and then determines whether the distribution of comparison reads is significantly higher than that of individual background strains by using a log-likelihood ratio model according to the distribution of comparison reads, thereby having the advantage of accurate determination.
In one embodiment, in the step of determining background microorganisms, the correlation test is a pearson correlation test and a spearman correlation test. It can be understood that the background microorganism list can be obtained by the correlation test in negative correlation based on the rule that the background pollution is inversely proportional to the total amount of the sample nucleic acid, and the judgment standard that the pearson correlation test and the spearman correlation test are simultaneously significant and negative correlated is the background microorganism has higher accuracy.
In one embodiment, in the step of determining the background microorganisms, the background pathogenic microorganisms further satisfy the following condition: the frequency of the microorganism in all negative control samples is more than or equal to 25 percent. For clinically insignificant microorganisms, which are not of interest, the present case focuses on clinically common pathogenic microorganisms with background contamination and is therefore defined as microorganisms that are stable in the control sample.
In one embodiment, the characteristic sequence region of the microorganism is obtained by:
obtaining a reference sequence of all complete strains of microorganisms, splitting by sliding cutting, carrying out cluster analysis on obtained split fragments, selecting one fragment as a non-redundant fragment if a plurality of fragments with similar clusters exist, taking the fragment as the non-redundant fragment if no similar fragment exists in a certain fragment cluster, analyzing all split fragments to obtain a non-redundant fragment set, comparing the non-redundant fragment set to a reference gene database, and removing the fragments which can be compared to other species to obtain a characteristic sequence region of the microorganisms.
It can be understood that the reference sequences of all the complete strains of the microorganisms can be downloaded from databases such as NCBI, the specific method of the cluster analysis can adopt any cluster analysis method applicable in statistics, and the cd-hit software is specifically adopted for clustering in the embodiment of the invention.
In one embodiment, the sliding cut is performed by 500 + -200 bp per sliding, and the window length is 1000 + -400 bp. Similarly, the specific sliding distance and window length can be adjusted according to the size of the gene segment to be obtained, so as to facilitate statistical analysis.
In one embodiment, in the step of establishing the judgment model, the distribution of the characteristic reads of the microorganism in the negative sample and the distribution of the characteristic reads of the microorganism in the positive sample are calculated by the following method:
wherein: p (IDi | H0) represents the conditional probability of each feature sequence region in the negative sample, and p (IDi | H1) represents the conditional probability of each feature sequence region in the positive sample;
∑SEH0read _ of _ ID _ i _ in _ Sample _ s represents the number of sequence fragments reads of the characteristic sequence region with ID i in the negative Sample,
the total number of sequence fragments reads representing all the characteristic sequence regions of all negative samples,
∑SEH1read _ of _ ID _ i _ in _ Sample _ s represents the number of sequence fragments reads of the characteristic sequence region with ID i in the positive Sample,
total sequence fragment reads number representing all signature sequence regions of all positive samples.
In one embodiment, the judgment model is a log-likelihood ratio model.
In one embodiment, the judgment model is established by the following method:
wherein:
LLR is a log-likelihood ratio;
L(H1) Obtaining the likelihood ratio of the current reads distribution condition under the positive condition;
L(H0) To obtain the current reads distribution under negative conditionsA likelihood of the situation;
ΠiP(IDi|H1)read_of_ID_imultiplying the conditional probabilities of comparing each reads to each characteristic sequence region under the positive condition;
ΠiP(IDi|H0)read_of_ID_ithe conditional probabilities of alignment of reads to each signature region under negative conditions were multiplied.
Since each reads is independent, the likelihood (L (H) of the current reads distribution is obtained under the positive condition1) Equal to the probability of the regions aligned by the reads; similarly, since each reads is independent, the likelihood probability of obtaining the current reads distribution under the negative condition is equal to the probability of the regions aligned by each reads multiplied.
The invention also discloses the application of the method for judging the background introduced microorganism sequence in the detection and analysis of pathogenic microorganisms.
The method is applied to the detection and analysis of pathogenic microorganisms, and can accurately judge whether each sample is more consistent with the existence of a pathogen or only exists in the background, so that the detection and analysis accuracy is improved.
The invention also discloses a pathogenic microorganism detection system, which comprises:
the detection module is used for carrying out gene detection on a sample to be detected to obtain microbial gene data;
the analysis module is used for acquiring the microbial gene data, analyzing according to the method and judging whether the detected microbial gene data is introduced as a background;
and the output module outputs the judgment result.
The pathogenic microorganism detection system has the advantages of high detection accuracy and low false negative and false positive by adopting the analysis method.
Compared with the prior art, the invention has the following beneficial effects:
the method for judging the background introduced microorganism sequence is to judge whether the background introduced microorganism sequence is obviously higher than the background distribution or not according to the comparison distribution of the characteristic reads in each sample, and can make accurate statistical judgment on each sample.
And no internal reference is needed, so that the method has the advantage of simple and convenient operation. The method also integrates the quantitative and strain characteristics of the strains, obtains the distribution characteristics of the pathogenic strains and the background strains through a large sample amount, and judges whether the distribution of the comparison reads is obviously higher than that of the independent background strains or not by using a log-likelihood ratio model according to the distribution of the comparison reads, thereby having the characteristic of internal correction of the sample independent of other samples.
In addition, the judgment model established by the method can correct the problem of alignment quantitative errors caused by the homology of the reference sequence to a certain extent.
Drawings
FIG. 1 is a schematic diagram of E.coli characteristic reads distribution;
FIG. 2 is a schematic view of the ROC curve in example 1;
FIG. 3 is a schematic diagram of the ROC curve in example 2.
Detailed Description
To facilitate an understanding of the invention, the invention will now be described more fully with reference to the accompanying drawings. Preferred embodiments of the present invention are shown in the drawings. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
The reagents used in the following examples, unless otherwise specified, are all conventionally available commercially; the methods used in the following examples are all carried out according to conventional methods unless otherwise specified.
Example 1
A method of determining background incorporation of a microbial sequence, comprising the steps of:
firstly, determining a background microorganism list.
1. And (5) sequencing.
1000 negative control plasma samples were obtained, nucleic acids were extracted and the total amount of nucleic acids D was quantified, followed by library sequencing.
And (3) carrying out mNGS sequencing off-machine data, and carrying out microbial quantification: firstly, removing a joint and a sequence with low quality or length less than 35bp by using fastp software; then, bwa software is used for aligning to a human reference genome, and the sequence aligned to the human reference gene is removed; the obtained non-human genomic sequence (unHost Reads) was then aligned with bwa to a microbial database (containing reference sequences for 18562 microorganisms); and finally, obtaining a quantitative value A of the abundance of the microorganisms according to the comparison sequence number.
2. Background microorganisms were determined.
By the rule that background contamination is inversely proportional to the total amount of nucleic acid in a sample, we calculated a pearson correlation test and a spearman correlation test of the quantitative value a of microbial abundance to the total amount of nucleic acid D, which were both significantly and negatively correlated. And microorganisms having an appearance frequency of not less than 25% in the sample are background microorganisms.
In this example, Escherichia coli Escherichia _ coli was judged as a background pathogenic microorganism in accordance with the above rule.
And secondly, establishing a judgment model.
1. E.coli-specific region characteristics (characteristic sequence region) database was constructed.
Reference sequences of Escherichia coli with complete genomes were downloaded from NCBI database https:// www.ncbi.nlm.nih.gov/assembly, and then the quality of assembly was evaluated to obtain 956 high quality E.coli reference sequences.
Then moving 500bp each time for sliding, splitting according to the window length of 1000bp, splitting the obtained fragments, clustering by using cd-hit software, and only reserving one of the fragments with high similarity to obtain a non-redundant fragment of the escherichia coli; then, megablast is used to align the non-redundant fragments to an Nt database (reference gene database), the fragments which can be aligned to non-Escherichia coli are removed, and finally, a characteristic region of the Escherichia coli-specific non-redundant fragments is obtained and is used as a representative sequence of the Escherichia coli in the database IDuniqV1.0.
It will be appreciated that if a database of specific regional characteristics is to be constructed for other microorganisms, a similar analysis strategy will be used to obtain the IDuniqV1.0 database.
2. And quantifying the characteristic region of the microorganism.
2.1 retrospective collection of clinical samples of mNGS sequencing and quantification of E.coli signature regions.
16536 clinical samples were collected as a training set in this example, and included alveolar lavage fluid, sputum, cerebrospinal fluid, pharyngeal swab, blood, tissue, and various other sample types.
First, for the mNGS sequencing data of 16536 samples, the low-quality and linker sequences were removed by fastp; bwa is used for aligning the human reference genome, and the sequence aligned to the human reference genome is removed; the sequences obtained were then aligned to the database of microorganism-specific regional characteristics, iduniqv1.0, and the number of sequences aligned to each characteristic region of each microorganism was counted for each sample.
As shown in FIG. 1, assuming that the E.coli genome has only 3 characteristic regions in the characteristic database, ID1, ID2 and ID3, then by quantifying the sample S, the comparison reads numbers of the sample S in the 3 characteristic regions are respectively 30, 10 and 5, which are the characteristic reads distribution of E.coli in the sample S.
2.2 model features are calculated.
We analyzed 16536 clinical samples from the training set, 304 of which were clinical judged E.coli samples (H1 sample set, with reagent background and E.coli pathogen) and the remainder were judged to be background strain only (H0 sample set, with reagent background), and the characteristic probability distribution of background strain H0 and the probability characteristic distribution of strain containing pathogen H1 were obtained according to the following two equations:
the above formula shows the probability of falling into the signature ID region for 1 sequence aligned to the signature region of the large intestine under the H1/H0 distribution, and each sequence is independent. This feature weight probability will be used as the profile idprob.
2.3 the log-likelihood ratio model judges whether the pathogenic strains exist in the sample.
Establishing a log-likelihood ratio model according to the following formula:
the log-likelihood ratio LLR is calculated by taking the ratio of the probability of obtaining the characteristic reads distribution under the H1 assumption to the probability of obtaining the characteristic reads distribution under the H0 assumption, and taking a log of the ratio.
Since H1 assumes that there is a pathogenic sequence in the presence of the background sequence, there is one more parameter than H0 (background sequence only). Therefore, when the number of compared reads is sufficiently large, the double log-likelihood ratio 2 LLR approximates the chi-square distribution with the degree of freedom of 1, and as the number of compared reads is higher, the power of the discrimination is higher, the statistic of the hypothesis test is higher; so when judged significantly above the H0 distribution, the presence of the pathogenic microorganism, not just the background sequence, in the sample can be judged.
In practical operation, the threshold optimized for each different microorganism can be obtained by training the sample for judgment of the subsequent clinical sample, and in this embodiment, the judgment threshold of the training set model is finally obtained as 43.237.
2.4 obtaining the threshold value of the log likelihood ratio LLR statistic and evaluating the classification effect of the model.
For 16536 samples in the training set, the probability of H0 or H1 in each feature region sequence is calculated according to the above formulas 1 and 2, and then the log likelihood ratio statistic of each sample is calculated according to the above formula 3.
Because the sequencing of the mNGS is influenced by the host ratio of a sample, a plurality of clinical samples have no specific sequence of the Escherichia coli, even a part of samples have no sequence of the Escherichia coli. In total 3725 samples in the training set have specific sequence coverage, and 304 samples with pathogenic escherichia coli are clinically judged to be samples satisfying specific sequences.
Therefore, in order to evaluate the true classification effect of LLR and obtain more accurate specificity (1 ten thousand samples without specific sequence were escherichia coli negative samples and were not included), this example was evaluated only for these 3725 samples, and the ROC curve is shown in fig. 2, where the AUC value is 0.972, the specificity is 0.963, and the sensitivity is 0.888.
The results show that the method for judging the introduction of the background into the microbial sequence can accurately judge whether the microbial sequence obtained by sequencing is introduced as the background, and the problem of batch false positive or false negative does not exist.
Example 2
1464 clinical samples were collected as a validation set, and 275 samples in which E.coli sequences were detected (out of these, 22 samples clinically judged to be E.coli positive) were analyzed by the method of example 1 to determine the background-introduced microbial sequences.
Sequencing was performed according to the method of example 1, wherein E.coli gene sequence data was obtained, substituted into the above judgment model, judgment was performed with the log likelihood ratio LLR statistic in example 1, the threshold was 40.684, and the model classification effect was evaluated.
The ROC curve is shown in fig. 3, with an AUC of 0.996, a specificity of 0.956, and a sensitivity of 1.000.
The results show that the method for judging the introduction of the microbial sequences into the background can achieve a good distinguishing effect on samples from different sources and batches, and has the advantages of high judgment stability and strong practicability.
It is understood that the log-likelihood ratio model LLR in the above embodiment is applicable to a wide range of sample types, and is not limited to the sample type, and may be used for DNA or RNA, and the model is not limited to the above escherichia coli, and may be applied to other microorganisms.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.