CN113270145B - Method for judging background introduction microorganism sequence and application thereof - Google Patents

Method for judging background introduction microorganism sequence and application thereof Download PDF

Info

Publication number
CN113270145B
CN113270145B CN202110466665.1A CN202110466665A CN113270145B CN 113270145 B CN113270145 B CN 113270145B CN 202110466665 A CN202110466665 A CN 202110466665A CN 113270145 B CN113270145 B CN 113270145B
Authority
CN
China
Prior art keywords
microorganism
background
sample
sequence
characteristic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110466665.1A
Other languages
Chinese (zh)
Other versions
CN113270145A (en
Inventor
许腾
何福生
李晓蕾
谢淑媚
王小锐
李永军
苏杭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Vision Gene Technology Co ltd
Guangzhou Weiyuan Medical Equipment Co ltd
Guangzhou Weiyuan Medical Laboratory Co ltd
Shenzhen Weiyuan Medical Technology Co ltd
Original Assignee
Guangzhou Weiyuan Medical Equipment Co ltd
Guangzhou Weiyuan Medical Laboratory Co ltd
Shenzhen Weiyuan Medical Technology Co ltd
Weiyuan Shenzhen Medical Research Center Co ltd
Guangzhou Vision Gene Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Weiyuan Medical Equipment Co ltd, Guangzhou Weiyuan Medical Laboratory Co ltd, Shenzhen Weiyuan Medical Technology Co ltd, Weiyuan Shenzhen Medical Research Center Co ltd, Guangzhou Vision Gene Technology Co ltd filed Critical Guangzhou Weiyuan Medical Equipment Co ltd
Priority to CN202110466665.1A priority Critical patent/CN113270145B/en
Publication of CN113270145A publication Critical patent/CN113270145A/en
Application granted granted Critical
Publication of CN113270145B publication Critical patent/CN113270145B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/30Unsupervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Software Systems (AREA)
  • Public Health (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention relates to a method for judging a background introduced microorganism sequence and application thereof, belonging to the technical field of biological information analysis. The method comprises the following steps: determination of background microorganism list: taking a plurality of pathogenic microorganism negative samples, carrying out correlation test on the quantitative value A of the abundance of each microorganism and the total amount D of nucleic acid, and judging as the background pathogenic microorganism if the quantitative value A and the total amount D of the nucleic acid are in negative correlation; establishing a judgment model: respectively taking a negative control sample set and a positive sample set of the background pathogenic microorganism, comparing the gene sequence data of the background pathogenic microorganism to a characteristic sequence area of the microorganism, calculating abundance and establishing a judgment model; quantitative analysis of microorganism-specific regions: and acquiring background pathogenic microorganism gene sequence data in a sample to be detected, and judging by adopting the judging model. The method can accurately judge whether the microbial sequence obtained by sequencing is introduced as a background, can reduce the problem of false positive or false negative in batches, and can be widely applied to data analysis of pathogenic microorganism detection.

Description

Method for judging background introduction microorganism sequence and application thereof
Technical Field
The invention relates to the technical field of biological information analysis, in particular to a method for judging a background introduced microorganism sequence and application thereof.
Background
Pathogenic macro genomics (mNGS) is a high-throughput sequencing technology that directly extracts nucleic acids from clinical specimens and detects pathogens, independent of culture. Compared with the traditional clinical laboratory detection method, the pathogenic mNGS is used for detecting the sequence based on the nucleic acid level, can break through the limitations of different pathogen types, comprehensively covers thousands of pathogens without bias, simultaneously identifies various pathogenic microorganisms such as bacteria, fungi, viruses and parasites, and gradually becomes an important tool in the field of clinical microorganism identification.
However, the accuracy of the mNGS can be affected by contamination-DNA sequences are detected that are not actually present in the sample. There are two main types of contaminants in the mNGS experiment, which are caused by different sources, external and internal contamination respectively. External contamination is caused by the outside of the sample being tested, potential sources of contamination include the body of the subject or researcher, the laboratory environment, etc.; the most important of them are internal contamination from sample collection tools, assay consumables required for nucleic acid extraction, reagents required for library construction, and the like.
Although, contamination can be reduced by laboratory techniques such as ultraviolet radiation, "ultra-purification" and/or enzymatic treatment of reagents and separation of pre-PCR and post-PCR regions. However, even the best laboratory methods do not completely eliminate DNA contamination. Engineering bacteria are required to be used for producing the experimental consumables or reagents, nucleic acid residues of the engineering bacteria are inevitable, only partial pollution can be eliminated by treating the consumables or strictly controlling the laboratory, the pollution level is controlled, and the removal effect is limited.
The most common method in practice is to remove the contamination by a computational method, which is broadly divided into two types:
one is a method of directly removing background microorganisms, such as simultaneously setting a batch of negative blank controls to remove contamination by removing sequences of microorganisms commonly found in the blank controls; or removing potential microbes by using the characteristic that the proportion of background polluted nucleic acid is inversely proportional to the proportion of sample nucleic acid according to the characteristic of high host rate of the mNGS. This type of process only allows a list of potential background contaminations to be obtained, using a knife-cut method to directly remove these microorganisms. However, common clinical pathogens, such as acinetobacter baumannii, pseudomonas aeruginosa, klebsiella pneumoniae, stenotrophomonas maltophilia, escherichia coli, enterococcus faecalis, serratia marcescens and the like, are microorganisms existing in the background of nucleic acid extraction or the background of reagents at the same time, and whether the microorganisms in a sample are background pollution or originated from the sample cannot be judged.
The other method is to add an internal reference for quantification and remove microorganisms with relative abundance below a threshold value, and the method has high requirements on the internal reference, needs an internal reference set with accurate quantification and high complexity and is complex to operate. In practice, however, we find that different sample types have different effects on the final quantitative result when the internal parameters are added due to different pretreatment steps, and different sample types need to calculate the corresponding threshold values independently.
Disclosure of Invention
In view of the above, it is necessary to provide a method for determining background-introduced microorganism sequences, which can accurately determine whether the microorganism sequences obtained by sequencing are background-introduced, and can reduce the batch false positive or false negative problem.
A method of determining background incorporation of a microbial sequence, comprising the steps of:
determination of background microorganism list: taking a plurality of pathogenic microorganism negative control samples, respectively extracting nucleic acid, determining the total amount D of the nucleic acid of the samples, sequencing, obtaining microorganism sequences in the samples through comparison, and obtaining a quantitative value A of the abundance of each microorganism according to the number of the compared sequences; respectively carrying out correlation test on the quantitative value A of the abundance of each microorganism and the total nucleic acid amount D, and judging as a background pathogenic microorganism when the quantitative value A of the abundance of the microorganism and the total nucleic acid amount D of the sample are in negative correlation, thereby obtaining a background pathogenic microorganism list;
establishing a judgment model: respectively taking a negative control sample set and a positive sample set of background pathogenic microorganisms, obtaining gene sequence data of the background pathogenic microorganisms, comparing the gene sequence data with a characteristic sequence region of the microorganisms, calculating the abundance of each characteristic sequence region of the background microorganisms according to the compared sequence number, respectively obtaining the characteristic reads distribution of the microorganisms in the negative sample and the characteristic reads distribution of the microorganisms in the positive sample, and establishing a judgment model according to the different characteristic reads distributions of the background polluted pathogenic microorganisms in the negative sample and the positive sample;
quantitative analysis of microorganism-specific regions: acquiring background pathogenic microorganism gene sequence data in a sample to be detected, comparing the background pathogenic microorganism gene sequence data to a characteristic sequence area of the microorganism, calculating the abundance of each characteristic area of the microorganism according to the compared sequence number to obtain characteristic reads distribution of the microorganism of the sample to be detected, judging by adopting the judging model, judging as a positive sample if the characteristic reads distribution of the background pathogenic microorganism in the sample to be detected accords with the characteristics of the positive sample, and judging as a background to introduce the microorganism sample if the characteristic reads distribution of the background polluted pathogenic microorganism in the sample to be detected accords with the characteristics of the negative sample.
The present inventors found in previous studies that even microorganisms of the same species, such as microorganisms introduced in a background environment (defined as background microorganisms), are distinguished from clinically occurring microorganisms of this species (defined as pathogenic microorganisms) in their strains. Taking escherichia coli as an example, the escherichia coli is a common reagent engineering strain, and we first judge whether there is a difference between a clinically infected escherichia coli strain and a reagent engineering strain to judge the feasibility of the hypothesis. The inventor carries out analysis by a large sample amount, namely a large number of samples of clinically judged escherichia coli and samples of clinical negativity, downloads all reference sequences of escherichia coli strains with complete genomes from NCBI, and carries out quantification of escherichia coli on the samples, and finds that the clinical negativity samples are compared with strains with the highest abundance, such as BL21(DE3), C41(DE3), BL21(TaKaRa), C43(DE3) and the like, while the samples of clinically judged pathogenic escherichia coli are other strains and carry a drug-resistant gene, such as CTX-M and the like. Thus, it was confirmed that clinical pathogenic strains were distinct from reagent engineered strains and could be analyzed for their corresponding characteristics in the mNGS sequencing data.
Based on the above research, the present inventors propose the above method for determining background-introduced microorganism sequences, which obtains a determination model through big data training to determine whether each sample is more consistent with the existence of pathogens or only the existence of background, integrates quantification and strain characteristics of strains, obtains distribution characteristics of pathogen strains and background strains through large sample amount, and then determines whether the distribution of comparison reads is significantly higher than that of individual background strains by using a log-likelihood ratio model according to the distribution of comparison reads, thereby having the advantage of accurate determination.
In one embodiment, in the step of determining background microorganisms, the correlation test is a pearson correlation test and a spearman correlation test. It can be understood that the background microorganism list can be obtained by the correlation test in negative correlation based on the rule that the background pollution is inversely proportional to the total amount of the sample nucleic acid, and the judgment standard that the pearson correlation test and the spearman correlation test are simultaneously significant and negative correlated is the background microorganism has higher accuracy.
In one embodiment, in the step of determining the background microorganisms, the background pathogenic microorganisms further satisfy the following condition: the frequency of the microorganism in all negative control samples is more than or equal to 25 percent. For clinically insignificant microorganisms, which are not of interest, the present case focuses on clinically common pathogenic microorganisms with background contamination and is therefore defined as microorganisms that are stable in the control sample.
In one embodiment, the characteristic sequence region of the microorganism is obtained by:
obtaining a reference sequence of all complete strains of microorganisms, splitting by sliding cutting, carrying out cluster analysis on obtained split fragments, selecting one fragment as a non-redundant fragment if a plurality of fragments with similar clusters exist, taking the fragment as the non-redundant fragment if no similar fragment exists in a certain fragment cluster, analyzing all split fragments to obtain a non-redundant fragment set, comparing the non-redundant fragment set to a reference gene database, and removing the fragments which can be compared to other species to obtain a characteristic sequence region of the microorganisms.
It can be understood that the reference sequences of all the complete strains of the microorganisms can be downloaded from databases such as NCBI, the specific method of the cluster analysis can adopt any cluster analysis method applicable in statistics, and the cd-hit software is specifically adopted for clustering in the embodiment of the invention.
In one embodiment, the sliding cut is performed by 500 + -200 bp per sliding, and the window length is 1000 + -400 bp. Similarly, the specific sliding distance and window length can be adjusted according to the size of the gene segment to be obtained, so as to facilitate statistical analysis.
In one embodiment, in the step of establishing the judgment model, the distribution of the characteristic reads of the microorganism in the negative sample and the distribution of the characteristic reads of the microorganism in the positive sample are calculated by the following method:
Figure BDA0003044346800000031
Figure BDA0003044346800000032
wherein: p (IDi | H0) represents the conditional probability of each feature sequence region in the negative sample, and p (IDi | H1) represents the conditional probability of each feature sequence region in the positive sample;
SEH0read _ of _ ID _ i _ in _ Sample _ s represents the number of sequence fragments reads of the characteristic sequence region with ID i in the negative Sample,
Figure BDA0003044346800000033
the total number of sequence fragments reads representing all the characteristic sequence regions of all negative samples,
SEH1read _ of _ ID _ i _ in _ Sample _ s represents the number of sequence fragments reads of the characteristic sequence region with ID i in the positive Sample,
Figure BDA0003044346800000034
total sequence fragment reads number representing all signature sequence regions of all positive samples.
In one embodiment, the judgment model is a log-likelihood ratio model.
In one embodiment, the judgment model is established by the following method:
Figure BDA0003044346800000041
wherein:
LLR is a log-likelihood ratio;
L(H1) Obtaining the likelihood ratio of the current reads distribution condition under the positive condition;
L(H0) To obtain the current reads distribution under negative conditionsA likelihood of the situation;
ΠiP(IDi|H1)read_of_ID_imultiplying the conditional probabilities of comparing each reads to each characteristic sequence region under the positive condition;
ΠiP(IDi|H0)read_of_ID_ithe conditional probabilities of alignment of reads to each signature region under negative conditions were multiplied.
Since each reads is independent, the likelihood (L (H) of the current reads distribution is obtained under the positive condition1) Equal to the probability of the regions aligned by the reads; similarly, since each reads is independent, the likelihood probability of obtaining the current reads distribution under the negative condition is equal to the probability of the regions aligned by each reads multiplied.
The invention also discloses the application of the method for judging the background introduced microorganism sequence in the detection and analysis of pathogenic microorganisms.
The method is applied to the detection and analysis of pathogenic microorganisms, and can accurately judge whether each sample is more consistent with the existence of a pathogen or only exists in the background, so that the detection and analysis accuracy is improved.
The invention also discloses a pathogenic microorganism detection system, which comprises:
the detection module is used for carrying out gene detection on a sample to be detected to obtain microbial gene data;
the analysis module is used for acquiring the microbial gene data, analyzing according to the method and judging whether the detected microbial gene data is introduced as a background;
and the output module outputs the judgment result.
The pathogenic microorganism detection system has the advantages of high detection accuracy and low false negative and false positive by adopting the analysis method.
Compared with the prior art, the invention has the following beneficial effects:
the method for judging the background introduced microorganism sequence is to judge whether the background introduced microorganism sequence is obviously higher than the background distribution or not according to the comparison distribution of the characteristic reads in each sample, and can make accurate statistical judgment on each sample.
And no internal reference is needed, so that the method has the advantage of simple and convenient operation. The method also integrates the quantitative and strain characteristics of the strains, obtains the distribution characteristics of the pathogenic strains and the background strains through a large sample amount, and judges whether the distribution of the comparison reads is obviously higher than that of the independent background strains or not by using a log-likelihood ratio model according to the distribution of the comparison reads, thereby having the characteristic of internal correction of the sample independent of other samples.
In addition, the judgment model established by the method can correct the problem of alignment quantitative errors caused by the homology of the reference sequence to a certain extent.
Drawings
FIG. 1 is a schematic diagram of E.coli characteristic reads distribution;
FIG. 2 is a schematic view of the ROC curve in example 1;
FIG. 3 is a schematic diagram of the ROC curve in example 2.
Detailed Description
To facilitate an understanding of the invention, the invention will now be described more fully with reference to the accompanying drawings. Preferred embodiments of the present invention are shown in the drawings. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
The reagents used in the following examples, unless otherwise specified, are all conventionally available commercially; the methods used in the following examples are all carried out according to conventional methods unless otherwise specified.
Example 1
A method of determining background incorporation of a microbial sequence, comprising the steps of:
firstly, determining a background microorganism list.
1. And (5) sequencing.
1000 negative control plasma samples were obtained, nucleic acids were extracted and the total amount of nucleic acids D was quantified, followed by library sequencing.
And (3) carrying out mNGS sequencing off-machine data, and carrying out microbial quantification: firstly, removing a joint and a sequence with low quality or length less than 35bp by using fastp software; then, bwa software is used for aligning to a human reference genome, and the sequence aligned to the human reference gene is removed; the obtained non-human genomic sequence (unHost Reads) was then aligned with bwa to a microbial database (containing reference sequences for 18562 microorganisms); and finally, obtaining a quantitative value A of the abundance of the microorganisms according to the comparison sequence number.
2. Background microorganisms were determined.
By the rule that background contamination is inversely proportional to the total amount of nucleic acid in a sample, we calculated a pearson correlation test and a spearman correlation test of the quantitative value a of microbial abundance to the total amount of nucleic acid D, which were both significantly and negatively correlated. And microorganisms having an appearance frequency of not less than 25% in the sample are background microorganisms.
In this example, Escherichia coli Escherichia _ coli was judged as a background pathogenic microorganism in accordance with the above rule.
And secondly, establishing a judgment model.
1. E.coli-specific region characteristics (characteristic sequence region) database was constructed.
Reference sequences of Escherichia coli with complete genomes were downloaded from NCBI database https:// www.ncbi.nlm.nih.gov/assembly, and then the quality of assembly was evaluated to obtain 956 high quality E.coli reference sequences.
Then moving 500bp each time for sliding, splitting according to the window length of 1000bp, splitting the obtained fragments, clustering by using cd-hit software, and only reserving one of the fragments with high similarity to obtain a non-redundant fragment of the escherichia coli; then, megablast is used to align the non-redundant fragments to an Nt database (reference gene database), the fragments which can be aligned to non-Escherichia coli are removed, and finally, a characteristic region of the Escherichia coli-specific non-redundant fragments is obtained and is used as a representative sequence of the Escherichia coli in the database IDuniqV1.0.
It will be appreciated that if a database of specific regional characteristics is to be constructed for other microorganisms, a similar analysis strategy will be used to obtain the IDuniqV1.0 database.
2. And quantifying the characteristic region of the microorganism.
2.1 retrospective collection of clinical samples of mNGS sequencing and quantification of E.coli signature regions.
16536 clinical samples were collected as a training set in this example, and included alveolar lavage fluid, sputum, cerebrospinal fluid, pharyngeal swab, blood, tissue, and various other sample types.
First, for the mNGS sequencing data of 16536 samples, the low-quality and linker sequences were removed by fastp; bwa is used for aligning the human reference genome, and the sequence aligned to the human reference genome is removed; the sequences obtained were then aligned to the database of microorganism-specific regional characteristics, iduniqv1.0, and the number of sequences aligned to each characteristic region of each microorganism was counted for each sample.
As shown in FIG. 1, assuming that the E.coli genome has only 3 characteristic regions in the characteristic database, ID1, ID2 and ID3, then by quantifying the sample S, the comparison reads numbers of the sample S in the 3 characteristic regions are respectively 30, 10 and 5, which are the characteristic reads distribution of E.coli in the sample S.
2.2 model features are calculated.
We analyzed 16536 clinical samples from the training set, 304 of which were clinical judged E.coli samples (H1 sample set, with reagent background and E.coli pathogen) and the remainder were judged to be background strain only (H0 sample set, with reagent background), and the characteristic probability distribution of background strain H0 and the probability characteristic distribution of strain containing pathogen H1 were obtained according to the following two equations:
Figure BDA0003044346800000061
Figure BDA0003044346800000062
the above formula shows the probability of falling into the signature ID region for 1 sequence aligned to the signature region of the large intestine under the H1/H0 distribution, and each sequence is independent. This feature weight probability will be used as the profile idprob.
2.3 the log-likelihood ratio model judges whether the pathogenic strains exist in the sample.
Establishing a log-likelihood ratio model according to the following formula:
Figure BDA0003044346800000071
the log-likelihood ratio LLR is calculated by taking the ratio of the probability of obtaining the characteristic reads distribution under the H1 assumption to the probability of obtaining the characteristic reads distribution under the H0 assumption, and taking a log of the ratio.
Since H1 assumes that there is a pathogenic sequence in the presence of the background sequence, there is one more parameter than H0 (background sequence only). Therefore, when the number of compared reads is sufficiently large, the double log-likelihood ratio 2 LLR approximates the chi-square distribution with the degree of freedom of 1, and as the number of compared reads is higher, the power of the discrimination is higher, the statistic of the hypothesis test is higher; so when judged significantly above the H0 distribution, the presence of the pathogenic microorganism, not just the background sequence, in the sample can be judged.
In practical operation, the threshold optimized for each different microorganism can be obtained by training the sample for judgment of the subsequent clinical sample, and in this embodiment, the judgment threshold of the training set model is finally obtained as 43.237.
2.4 obtaining the threshold value of the log likelihood ratio LLR statistic and evaluating the classification effect of the model.
For 16536 samples in the training set, the probability of H0 or H1 in each feature region sequence is calculated according to the above formulas 1 and 2, and then the log likelihood ratio statistic of each sample is calculated according to the above formula 3.
Because the sequencing of the mNGS is influenced by the host ratio of a sample, a plurality of clinical samples have no specific sequence of the Escherichia coli, even a part of samples have no sequence of the Escherichia coli. In total 3725 samples in the training set have specific sequence coverage, and 304 samples with pathogenic escherichia coli are clinically judged to be samples satisfying specific sequences.
Therefore, in order to evaluate the true classification effect of LLR and obtain more accurate specificity (1 ten thousand samples without specific sequence were escherichia coli negative samples and were not included), this example was evaluated only for these 3725 samples, and the ROC curve is shown in fig. 2, where the AUC value is 0.972, the specificity is 0.963, and the sensitivity is 0.888.
The results show that the method for judging the introduction of the background into the microbial sequence can accurately judge whether the microbial sequence obtained by sequencing is introduced as the background, and the problem of batch false positive or false negative does not exist.
Example 2
1464 clinical samples were collected as a validation set, and 275 samples in which E.coli sequences were detected (out of these, 22 samples clinically judged to be E.coli positive) were analyzed by the method of example 1 to determine the background-introduced microbial sequences.
Sequencing was performed according to the method of example 1, wherein E.coli gene sequence data was obtained, substituted into the above judgment model, judgment was performed with the log likelihood ratio LLR statistic in example 1, the threshold was 40.684, and the model classification effect was evaluated.
The ROC curve is shown in fig. 3, with an AUC of 0.996, a specificity of 0.956, and a sensitivity of 1.000.
The results show that the method for judging the introduction of the microbial sequences into the background can achieve a good distinguishing effect on samples from different sources and batches, and has the advantages of high judgment stability and strong practicability.
It is understood that the log-likelihood ratio model LLR in the above embodiment is applicable to a wide range of sample types, and is not limited to the sample type, and may be used for DNA or RNA, and the model is not limited to the above escherichia coli, and may be applied to other microorganisms.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A method for determining background-induced microbial sequences, comprising the steps of:
determination of background microorganism list: taking a plurality of pathogenic microorganism negative control samples, respectively extracting nucleic acid, determining the total amount D of the nucleic acid of the samples, sequencing, obtaining microorganism sequences in the samples through comparison, and obtaining a quantitative value A of the abundance of each microorganism according to the number of the compared sequences; respectively carrying out correlation test on the quantitative value A of the abundance of each microorganism and the total nucleic acid amount D, and judging as a background pathogenic microorganism when the quantitative value A of the abundance of the microorganism and the total nucleic acid amount D of the sample are in negative correlation, thereby obtaining a background pathogenic microorganism list;
establishing a judgment model: respectively taking a negative control sample set and a positive sample set of background pathogenic microorganisms, obtaining gene sequence data of the background pathogenic microorganisms, comparing the gene sequence data with a characteristic sequence region of the microorganisms, calculating the abundance of each characteristic sequence region of the background microorganisms according to the compared sequence number, respectively obtaining the characteristic reads distribution of the microorganisms in the negative sample and the characteristic reads distribution of the microorganisms in the positive sample, and establishing a judgment model according to the different characteristic reads distributions of the background polluted pathogenic microorganisms in the negative sample and the positive sample;
and (3) quantitatively analyzing a microorganism characteristic sequence region: acquiring background pathogenic microorganism gene sequence data in a sample to be detected, comparing the background pathogenic microorganism gene sequence data to a characteristic sequence area of the microorganism, calculating the abundance of each characteristic area of the microorganism according to the compared sequence number to obtain characteristic reads distribution of the microorganism of the sample to be detected, judging by adopting the judging model, judging as a positive sample if the characteristic reads distribution of the background pathogenic microorganism in the sample to be detected accords with the characteristics of the positive sample, and judging as a background to introduce the microorganism sample if the characteristic reads distribution of the background polluted pathogenic microorganism in the sample to be detected accords with the characteristics of the negative sample.
2. The method for determining a background-introduced microorganism sequence according to claim 1, wherein in the step of determining a background microorganism list, the correlation test is a pearson correlation test and a spearman correlation test.
3. The method of claim 1, wherein the background pathogenic microorganisms satisfy the following conditions in the step of determining the background microorganism list: the frequency of the microorganism in all negative control samples is more than or equal to 25 percent.
4. The method for determining background-introduced microorganism sequences according to claim 1, wherein the characteristic sequence region of the microorganism is obtained by:
obtaining a reference sequence of all complete strains of microorganisms, splitting by sliding cutting, carrying out cluster analysis on obtained split fragments, selecting one fragment as a non-redundant fragment if a plurality of fragments with similar clusters exist, taking the fragment as the non-redundant fragment if no similar fragment exists in a certain fragment cluster, analyzing all split fragments to obtain a non-redundant fragment set, comparing the non-redundant fragment set to a reference gene database, and removing the fragments which can be compared to other species to obtain a characteristic sequence region of the microorganisms.
5. The method for determining background-introduced microbial sequences according to claim 4, wherein said sliding cut is performed at 500 ± 200bp per sliding and the window length is 1000 ± 400 bp.
6. The method for determining background-introduced microorganism sequences according to claim 1, wherein in the step of establishing the determination model, the distribution of the characteristic reads of the microorganism in the negative sample and the distribution of the characteristic reads of the microorganism in the positive sample are calculated by:
Figure FDA0003498108180000021
Figure FDA0003498108180000022
wherein: p (IDi | H0) represents the conditional probability of each feature sequence region in the negative sample, P (IDi | H1) represents the conditional probability of each feature sequence region in the positive sample;
s∈H0read _ of _ ID _ i _ in _ Sample _ s represents the number of sequence fragments reads of the characteristic sequence region with ID i in the negative Sample,
Figure FDA0003498108180000023
the total number of sequence fragments reads representing all the characteristic sequence regions of all negative samples,
s∈H1read _ of _ ID _ i _ in _ Sample _ s represents the number of sequence fragments reads of the characteristic sequence region with ID i in the positive Sample,
Figure FDA0003498108180000024
total number of sequence fragments reads representing all characteristic sequence regions of all positive samples.
7. The method for determining background-introduced microorganism sequences according to claim 6, wherein the determination model is a log-likelihood ratio model.
8. The method for determining background-introduced microorganism sequences according to claim 7, wherein the determination model is established by:
Figure FDA0003498108180000025
wherein:
LLR is a log-likelihood ratio;
L(H1) Obtaining the likelihood ratio of the current reads distribution condition under the positive condition;
L(H0) Obtaining the likelihood ratio of the current reads distribution condition under the negative condition;
ΠiP(IDi|H1)read_of_ID_imultiplying the conditional probabilities of comparing each reads to each characteristic sequence region under the positive condition;
ПiP(IDi|H0)read_of_ID_imultiplying the conditional probability of aligning each reads to each signature sequence region under negative conditions.
9. Use of the method for determining background-introduced microorganism sequences according to any of claims 1 to 8 in pathogenic microorganism detection assays.
10. A pathogenic microorganism detection system, comprising:
the detection module is used for carrying out gene detection on a sample to be detected to obtain microbial gene data;
an analysis module for acquiring the microbial gene data, analyzing according to the method of any one of claims 1 to 7, and judging whether the detected microbial gene data is introduced as a background;
and the output module outputs the judgment result.
CN202110466665.1A 2021-04-28 2021-04-28 Method for judging background introduction microorganism sequence and application thereof Active CN113270145B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110466665.1A CN113270145B (en) 2021-04-28 2021-04-28 Method for judging background introduction microorganism sequence and application thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110466665.1A CN113270145B (en) 2021-04-28 2021-04-28 Method for judging background introduction microorganism sequence and application thereof

Publications (2)

Publication Number Publication Date
CN113270145A CN113270145A (en) 2021-08-17
CN113270145B true CN113270145B (en) 2022-05-06

Family

ID=77229774

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110466665.1A Active CN113270145B (en) 2021-04-28 2021-04-28 Method for judging background introduction microorganism sequence and application thereof

Country Status (1)

Country Link
CN (1) CN113270145B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114038501B (en) * 2021-12-21 2022-05-27 广州金匙医学检验有限公司 Background bacterium judgment method based on machine learning
CN117051129B (en) * 2023-10-10 2024-03-22 瑞因迈拓科技(广州)有限公司 Method for setting background bacteria threshold value for microorganism detection and application thereof

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110349629A (en) * 2019-06-20 2019-10-18 广州赛哲生物科技股份有限公司 Analysis method for detecting microorganisms by using metagenome or macrotranscriptome

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105671150A (en) * 2016-01-29 2016-06-15 江汉大学 Method for qualitative and quantitative detection of microorganisms in human body
CN113689912A (en) * 2020-12-14 2021-11-23 广东美格基因科技有限公司 Method and system for correcting microbial contrast result based on metagenome sequencing
AU2021100216A4 (en) * 2021-01-13 2021-04-08 Qingdao Agricultural University An xMAP Assay for Detection of Porcine Reproductive and Respiratory Syndrome Virus Antibodies

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110349629A (en) * 2019-06-20 2019-10-18 广州赛哲生物科技股份有限公司 Analysis method for detecting microorganisms by using metagenome or macrotranscriptome

Also Published As

Publication number Publication date
CN113270145A (en) 2021-08-17

Similar Documents

Publication Publication Date Title
CN110349630B (en) Analysis method and device for blood metagenome sequencing data and application thereof
JP7164125B2 (en) Quality control templates to ensure validity of sequencing-based assays
EP3143537B1 (en) Rare variant calls in ultra-deep sequencing
CN113270145B (en) Method for judging background introduction microorganism sequence and application thereof
CN113160882B (en) Pathogenic microorganism metagenome detection method based on third generation sequencing
CN111187813B (en) Full-process quality control pathogenic microorganism high-throughput sequencing detection method
CN113066533B (en) mNGS pathogen data analysis method
CN111041089A (en) Application of host marker for COVID-19 infection
Sadasivan et al. Rapid real-time squiggle classification for read until using rawmap
CN113284560A (en) Pathogenic detection background microorganism judgment method and application
US20130045878A1 (en) Process for identification of pathogens
CN116189763A (en) Single sample copy number variation detection method based on second generation sequencing
KR102124193B1 (en) Method for screening makers for predicting depressive disorder or suicide risk using machine learning, markers for predicting depressive disorder or suicide risk, method for predicting depressive disorder or suicide risk
EP3975190A1 (en) Method for discovering marker for predicting risk of depression or suicide using multi-omics analysis, marker for predicting risk of depression or suicide, and method for predicting risk of depression or suicide using multi-omics analysis
CN115719616A (en) Method and system for screening specific sequences of pathogenic species
CN114078568B (en) Metagenome sequencing data processing system and processing method based on IIB type restriction endonuclease characteristics
EP3612644B1 (en) Use of off-target sequences for dna analysis
Vranckx et al. Analysis of MALDI‐TOF MS Spectra using the BioNumerics Software
US20240011105A1 (en) Analysis of microbial fragments in plasma
CN116426696B (en) Plasma virus detection and analysis method based on sequencing technology
CN117854593A (en) Target sequencing data analysis method, system and computer storage medium
CN118006818A (en) High-sensitivity detection method for Chlamydia psittaci
CN112735530A (en) Method for tracing sample based on flora structure
CN116497104A (en) Quality control evaluation method and system for laboratory standard strain
CN116026914A (en) Microorganism detection method based on metabolic mass spectrum

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20230828

Address after: Room 303, G10, South China new material innovation park, building 3, No. 31, Kefeng Road, Guangzhou hi tech Industrial Development Zone, Guangdong 510130

Patentee after: GUANGZHOU VISION GENE TECHNOLOGY Co.,Ltd.

Patentee after: Guangzhou Weiyuan Medical Equipment Co.,Ltd.

Patentee after: Guangzhou Weiyuan medical laboratory Co.,Ltd.

Patentee after: Shenzhen Weiyuan Medical Technology Co.,Ltd.

Address before: Room 303, G10, South China new material innovation park, building 3, No. 31, Kefeng Road, Guangzhou hi tech Industrial Development Zone, Guangdong 510130

Patentee before: GUANGZHOU VISION GENE TECHNOLOGY Co.,Ltd.

Patentee before: Guangzhou Weiyuan Medical Equipment Co.,Ltd.

Patentee before: Guangzhou Weiyuan medical laboratory Co.,Ltd.

Patentee before: Shenzhen Weiyuan Medical Technology Co.,Ltd.

Patentee before: Weiyuan (Shenzhen) Medical Research Center Co.,Ltd.

TR01 Transfer of patent right