WO2022028624A1 - 通过测序获取微生物物种及相关信息的方法、装置、计算机可读存储介质和电子设备 - Google Patents

通过测序获取微生物物种及相关信息的方法、装置、计算机可读存储介质和电子设备 Download PDF

Info

Publication number
WO2022028624A1
WO2022028624A1 PCT/CN2021/115705 CN2021115705W WO2022028624A1 WO 2022028624 A1 WO2022028624 A1 WO 2022028624A1 CN 2021115705 W CN2021115705 W CN 2021115705W WO 2022028624 A1 WO2022028624 A1 WO 2022028624A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
reads
seed
sequences
species
Prior art date
Application number
PCT/CN2021/115705
Other languages
English (en)
French (fr)
Inventor
江山
杨玉梅
庞白冰
约翰逊·克里斯托弗·雷
Original Assignee
西安中科茵康莱医学检验有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 西安中科茵康莱医学检验有限公司 filed Critical 西安中科茵康莱医学检验有限公司
Publication of WO2022028624A1 publication Critical patent/WO2022028624A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing

Definitions

  • the present invention relates to the field of microbial identification, in particular, to a method, device, computer-readable storage medium and electronic device for obtaining microbial species and related information through sequencing.
  • Microorganisms are divided into the following eight categories: bacteria, viruses, fungi, actinomycetes, rickettsia, mycoplasma, chlamydia, and spirochetes.
  • Next-generation sequencing (NGS) (mNGS) technology is an effective method for identifying microbial species present in a sample.
  • rRNA There are three types of rRNA in prokaryotes: 23S, 16S, and 5S rRNA.
  • the gene encoding 16S rRNA has good evolutionary conservation, suitable length for analysis (about 1540bp), and good variability matched with evolutionary distance, so it has become the standard identification sequence for bacterial molecular identification.
  • the 16S rRNA gene is not only suitable for bacteria, but also for the classification of prokaryotes such as mycoplasma, chlamydia, rickettsia, spirochetes, etc.
  • the sequence of 16S rRNA contains 9 or 10 hypervariable regions and 10 or 11 conserved regions.
  • the sequences of the conserved regions reflect the genetic relationship between biological species, while the sequences of the hypervariable regions reflect the differences between species.
  • the NGS-targeted sequencing strategy targets the hypervariable region sequences of the 16S rRNA gene. PCR-amplified sequences of 100-several hundred bp were used for NGS sequencing, and the obtained sequence information was compared with the 16S rRNA gene sequence database to identify the microorganisms present in the sample.
  • mNGS next-generation sequencing
  • the NGS technology based on 16S/18S/ITS amplicons has a limited read length, and the length of the sequenced fragments is between 50 and 400 bp depending on different sequencing platforms.
  • the length of the 16S rRNA gene is about 1500 bp.
  • the nucleic acid molecule of the gene In order to obtain the full-length sequence information of the gene, the nucleic acid molecule of the gene must be broken into short fragments suitable for NGS sequencing.
  • the full length of the 16S rRNA gene sequence was assembled after the sequence stacking. However, since ribosomal gene sequences are highly conserved in the evolution of species, the sequence similarity of species with close evolutionary relationship (such as species within the same genus) is high.
  • next-generation sequencing performs targeted amplification on the hypervariable region of the 16S rRNA gene, and then performs NGS sequencing on the amplicon. Since the nucleotide sequences of 9 or 10 hypervariable regions contained in the sequence of 16S rRNA reflect the differences between species, the NGS sequencing of the hypervariable regions and the sequence information obtained by sequencing were compared with The 16S rRNA gene hypervariable region sequence database alignment can obtain the ability to identify some microorganisms at the "species" level.
  • nucleotide sequence diversity carried by a single or several hypervariable regions is not sufficient to distinguish all prokaryotic microorganisms.
  • Johnson, J.S. et al. (2019) showed that only the full-length 16S rRNA gene nucleotide sequence contains enough information to distinguish all prokaryotic microorganisms at the "species" level. Therefore, the current 16S/18S/ITS amplicon next-generation sequencing (NGS) technology cannot achieve the discriminative ability to detect microorganisms in clinical samples at the level of "species”.
  • NGS next-generation sequencing
  • the present invention relates to a method for obtaining microbial species and related information by sequencing, comprising:
  • sequencing data is obtained by sequencing the amplification product through next-generation sequencing technology after targeting and enriching the characteristic nucleic acid sequences of microorganisms;
  • the feature sequence database performs clustering processing in advance according to the similarity of the reference sequences containing the feature sequences to obtain one or more levels of clusters, each level cluster has at least one sub-seed, and the bottommost cluster has a sub-seed or several seeds as reference sequences;
  • the comparative analysis includes:
  • step e merging the tertiary screening seed sequences, aligning the reads with it, and stepwise iterative screening to obtain a reference sequence that meets the requirements, wherein the threshold used in the stepwise iterative screening is more stringent than step d);
  • step f) According to the reference sequence obtained in step e) and the number of reads thereof, calculate the content of reads at the species level and its proportion; when calculating, add the number of reads of a plurality of reference sequences belonging to the same species to obtain the reads of the species
  • the proportion of each species in the sample is calculated by dividing the number of reads per species by the sum of the number of reads of the species contained in the sample.
  • the present invention also relates to a device for obtaining microbial species and related information by sequencing, the device comprising:
  • a sequencing data acquisition module which is used for acquiring sequencing data, the sequencing data is obtained by sequencing the amplification product through the next generation sequencing technology after targeting and enriching the characteristic nucleic acid sequences of microorganisms;
  • a feature sequence database building module which is used to perform clustering processing on the sequencing data to obtain a feature sequence database
  • the feature sequence database includes one or more levels of clusters, and each level cluster has at least one sub-seed, and the bottom layer has at least one sub-seed. There is one or more seeds in the cluster that are used as reference sequences;
  • the alignment analysis module is used for performing alignment analysis on the sequencing data and the characteristic sequence database to identify the composition of microorganisms in the detected sample, and the alignment analysis is the alignment in the method as described above. Defined for analysis.
  • the present invention also relates to a computer-readable storage medium for storing computer instructions, programs, code sets or instruction sets which, when run on a computer, cause the computer to execute Step ii) in the method as described above.
  • the present invention also relates to an electronic device, comprising:
  • a storage device that stores one or more programs
  • the one or more processors When the one or more programs are executed by the one or more processors, the one or more processors cause the one or more processors to implement step ii) in the method as described above.
  • the present invention also relates to the use of the method as described above, or the device as described above, or the computer-readable storage medium as described above, or the electronic device as described above, in identifying microorganisms.
  • the existing technology still cannot satisfactorily solve the problem of using short-read NGS technology to complete the identification of microbial species based on evolutionarily highly conserved long-segment sequence information such as the 16S rRNA gene sequence.
  • the present invention solves this problem well.
  • the detection of laboratory and clinical samples confirms that the present invention realizes the accurate identification of highly similar long fragment sequences such as 16S rRNA gene sequences by using short-read NGS technology, and overcomes the fact that targeted sequencing can only be used to detect short sequences.
  • the difficulty of sequencing has enabled the identification of microorganisms at the species level or higher resolution based on short fragment sequencing.
  • the present invention can correctly identify the microbial species present in the sample, and measure the relative ratio in quantity among the species, and has higher accuracy and sensitivity than the prior art. Take bacteria as an example: the lower limit of detection for a single species can be as low as 10 CFU.
  • the present invention can simultaneously and correctly detect all microorganisms in a mixed sample of multiple (such as 5 or more) species. Even if the content of the two species differs by a factor of 16 or more, the present invention can correctly detect all of them at the same time.
  • the average sequencing data volume of all samples in Examples 3 to 9 is 55,663 reads, which is much lower than the data volume (10,000,000-100,000,000 reads) required by the current mNGS method.
  • the amount of valid data available for microbial identification in all samples is more than 90%, while the valid data of mNGS methods in existing reports usually accounts for 0.00001-0.01%.
  • the present invention exhibits extremely high data efficiency.
  • the detection cost of the present invention is much lower than that of the current technology mNGS.
  • the present invention not only ensures low cost, but also meets the actual needs of detection accuracy and sensitivity. Therefore, the present invention has higher detection sensitivity and accuracy while maintaining the advantages of the mNGS method for detecting a wide range of targets and less affected factors.
  • FIG. 1 is a schematic diagram of a detection flow taking the detection of microorganisms based on prokaryotic 16S rRNA gene sequencing as an example in an embodiment of the present invention
  • Fig. 2 is the schematic diagram of the main operation flow based on the construction of 16S rDNA reference sequence database in one embodiment of the present invention
  • FIG. 3 is a schematic diagram of a database comprising a two-layer clustering hierarchical structure constructed in an embodiment of the present invention
  • FIG. 4 is a schematic diagram of a seed clustering hierarchical relationship of a reference sequence database in an embodiment of the present invention
  • FIG. 5 is a schematic flowchart of an algorithm for identifying microbial species using sequencing data according to an embodiment of the present invention.
  • the present invention relates to a method for obtaining microbial species and related information by sequencing, comprising:
  • sequencing data is obtained by sequencing the amplification product through next-generation sequencing technology after targeting and enriching the characteristic nucleic acid sequences of microorganisms;
  • the feature sequence database performs clustering processing in advance according to the similarity of the reference sequences containing the feature sequences to obtain one or more levels of clusters, each level cluster has at least one sub-seed, and the bottommost cluster has a sub-seed or several seeds as reference sequences.
  • the microorganisms include bacteria, archaea, fungi, mycoplasma, chlamydia, leektria, spirochetes, and viruses, wherein the characteristic nucleic acid sequences of RNA viruses can be obtained by reverse transcribing their RNA genomes to generate cDNAs.
  • NGS Next Generation Sequencing, next-generation sequencing/second-generation sequencing, also known as high-throughput sequencing;
  • mNGS metagenomics next generation sequencing, metagenomics next generation sequencing;
  • ITS Internal Transcribed Spacer, an internal transcribed spacer, is a nucleic acid sequence located between the large and small subunit rRNAs in the fungal ribosomal RNA (rRNA) gene transcription region or the corresponding polycistronic rRNA precursor.
  • rRNA fungal ribosomal RNA
  • sequencing reads that is, a sequence of the detection object generated during high-throughput sequencing
  • NRMSE normalized root mean square error, standardized root mean square error
  • Adapter the adapter sequence used in sequencing
  • Cluster class, cluster
  • Seed seed, that is, the class center; according to the different levels of the cluster, the seed is divided into a sub-seed and a seed as a reference sequence, but the two can be summarized by "seed";
  • Bowtie2 an alignment software that aligns short sequences to large genomes
  • Mean depth average sequencing depth
  • Gap blank, breakpoint, here means no coverage
  • End gap no coverage at the end
  • Middle gap no coverage in the middle segment
  • Overlap graphs a graph used to characterize the overlapping relationship between multiple nucleic acid sequences in sequence coding
  • paired-end reads double-ended sequence, the sequence generated from the forward and reverse sequencing of the current fragment
  • de novo assembly de novo assembly, a method of assembling small fragments into longer fragments
  • the reference sequence refers to the characteristic sequence that can indicate the microbial species, which is usually conservative, and the reference sequence generally includes 16S rRNA gene, 18S rRNA gene, ITS nucleic acid sequence, RNA virus
  • the RNA is the full-length sequence of one or more of the RNA polymerase gene (RdRp) of the template, the viral capsid protein coding gene, the pol gene of the retrovirus, etc. or other nucleic acid sequences that can reflect the characteristics of the microorganism species.
  • the detected object can be from an organism (microbial host) or from an environmental sample containing microorganisms.
  • the sample to be detected is a sample from a microbial host or an environmental sample containing microorganisms: the sample from a microbial host preferably includes but is not limited to: feces, intestinal contents, skin, At least one of tissue, sputum, blood, saliva, dental plaque, urine, vaginal secretions, bile, bronchoalveolar lavage, cerebrospinal fluid, pleural fluid, ascites, pelvic fluid, pus, and rumen; in some
  • the environmental samples from microorganisms preferably include: internal and external surfaces of objects, domestic water, medical water, industrial water, food, beverages, fertilizers, wastewater, volcanic ash, permafrost, silt, soil, compost, pollution At least one of river aquaculture water and air.
  • the host is an animal; further choices include humans and all livestock (eg, domestic animals and pets) and wild animals and birds including, without limitation, cattle, horses, dairy cattle, pigs, sheep, Goats, rats, mice, dogs, cats, rabbits, camels, donkeys, deer, minks, chickens, ducks, geese, turkeys, fighting cocks, etc.
  • livestock eg, domestic animals and pets
  • wild animals and birds including, without limitation, cattle, horses, dairy cattle, pigs, sheep, Goats, rats, mice, dogs, cats, rabbits, camels, donkeys, deer, minks, chickens, ducks, geese, turkeys, fighting cocks, etc.
  • Preprocessing of samples For samples of different types and sources, it may be necessary to preprocess the samples to meet the needs of nucleic acid extraction. Treatment methods include but are not limited to washing the samples with sterile water, ddH 2 O (double distilled water), sterile physiological saline, sterile PBS (phosphate buffered saline) and other liquids, and concentrating the samples by filtration, centrifugation, etc. , use gradient centrifugation and other methods to separate some components in the sample, or use some kits that meet the experimental needs to separate some components in the sample, or remove or enrich a certain part of the nucleic acid in the sample.
  • Treatment methods include but are not limited to washing the samples with sterile water, ddH 2 O (double distilled water), sterile physiological saline, sterile PBS (phosphate buffered saline) and other liquids, and concentrating the samples by filtration, centrifugation, etc. , use gradient centrifugation and other methods to separate some components
  • Extraction of nucleic acid Use a nucleic acid extraction kit to extract all nucleic acid substances contained in the pretreated sample.
  • the nucleic acid extraction kit used is not limited to a certain manufacturer, nor is it limited to a certain method, as long as it can obtain nucleic acid substances that meet the quality requirements required by the experiment.
  • the extracted nucleic acid includes DNA, RNA, or both.
  • nucleic acid of known sequence Before the start of this step, a certain amount of nucleic acid of known sequence can be added to the sample, and the nucleic acid sequence meets the following conditions: 1) It can be amplified in the reaction system prepared in the next step; 2) It can be added by the next step 3) The entire sequence is known; 4) The sequence used will not interfere with the analysis of any target sequence that may be present in the sample; 5) The nucleic acid sequence may exist alone or It depends on the existence of vectors such as plasmids, viruses, cells, etc.; 6) The added nucleic acid sequence can be obtained by the operation of this step, and exists in the finally extracted nucleic acid material.
  • adding a nucleic acid with a known sequence can help to better judge the contamination introduced by sampling, experiment and other links in the detection result, but it is not necessary in this technical solution. Not adding nucleic acid of known sequence will not affect the integrity of this technical solution, and adding nucleic acid of known sequence does not constitute an innovation to this technical solution.
  • the addition of nucleic acids of known sequence is not limited to this step.
  • Targeted enrichment of specific nucleic acid sequences Use certain methods to enrich nucleic acid sequences that can provide microbial taxonomic information, so that the nucleic acids of these sequences occupy a higher proportion of the total nucleic acid sequences of the sample, and the enriched products are analyzed. Purification and quantification. Enrichment methods include, but are not limited to, PCR, hybridization capture, and the like. The purification of enriched products includes but is not limited to adsorption column purification, magnetic bead purification, etc. The purpose is to remove enzymes, primers, probes, salts, metal ions and other components remaining in the sample during the enrichment process, so as to obtain pure and relatively high-quality products.
  • Nucleic acid sequences of long fragments are of long fragments (greater than 20 bp). Quantification is to determine the concentration of nucleic acid in the obtained sample, and then calculate the nucleic acid content in the sample according to the volume. Quantitative methods include ultraviolet spectrophotometry, dye-binding method and other methods. The enriched target sequence is currently commonly used for species classification of microorganisms.
  • RNA RNA-dependent RNA polymerase
  • N Nucleocapsid genes in the coronavirus genome.
  • viruses which may be evolutionarily conserved and species-specific nucleic acid sequences in its genome, such as the Pol (RNA-dependent RNA polymerase) and N (Nucleocapsid) genes in the coronavirus genome.
  • Pol RNA-dependent RNA polymerase
  • N Nucleocapsid
  • Nucleic acid sequences can exist alone or rely on vectors such as plasmids, viruses, and cells.
  • one or more steps of quality inspection procedures are included after nucleic acid enrichment, the purpose of which is to detect the effect of enrichment on target nucleic acid sequences.
  • the means of quality control include detection of nucleic acid content, detection of nucleic acid purity, and detection of fragment length of enriched nucleic acid sequences.
  • sequencing detection efficiency usually means that the content of microorganisms in the samples is more abundant.
  • the enrichment efficiency the content level of microorganisms in the sample can be predicted. For samples whose enrichment quality does not meet expectations, the enrichment operation can be re-enriched, but not all of them must be re-enriched. For samples whose enrichment quality does not meet expectations, subsequent experiments can also be continued.
  • Sequencing library construction The purpose is to convert the enriched nucleic acids into short nucleic acid fragments that can be detected by the NGS platform.
  • the main step is to break the long nucleic acid sequence into a length that can be read by the NGS platform, and at the same time add corresponding sequencing primers to both ends of the fragment, so that the sequencer can detect the nucleic acid sequence. If the added sequencing primer contains a barcode (barcode/index), the source of the sample can also be distinguished.
  • a certain amount of nucleic acid with a known sequence can be added to the nucleic acid obtained in the previous step, and the nucleic acid sequence satisfies the following conditions:
  • the nucleic acid sequence can exist alone, or it can exist depending on vectors such as plasmids, viruses, cells, etc.
  • adding a nucleic acid of known sequence can help to better judge the contamination introduced by sampling, experiment and other links in the detection result, but it is not necessary in the present invention. Not adding nucleic acid of known sequence will not affect the integrity of this technical solution, and adding nucleic acid of known sequence does not constitute an innovation to this technical solution.
  • the addition of nucleic acids of known sequence is not limited to this step.
  • the specific experiment of sequencing library construction includes the following steps:
  • the nucleic acid as the manipulation object is double-stranded DNA (dsDNA).
  • dsDNA double-stranded DNA
  • the two ends of the short fragment dsDNA generated by breaking the long fragment dsDNA will not be very neat, and usually a chain will have a few bases overhang, forming a sticky end.
  • the broken DNA needs to be repaired in different ways. For example, if using Thermo Fisher's Ion torrent sequencing platform, the ends need to be repaired to a fully flush form, and if using the illumina sequencing platform, they need to be end repaired to a form where one of the strands has an extra adenine (A).
  • A extra adenine
  • Fragment screening magnetic beads are used to screen nucleic acid fragments in the sample, and only nucleic acid fragments of suitable length are retained, and nucleic acid fragments that are too long or too short are discarded.
  • the lengths of nucleic acid fragments vary according to the selected sequencing platform, sequencing reagents, and sequencing conditions.
  • Sequencing adapters are two pieces of dsDNA with specific sequences. In a sequencing instrument, the sequencing reaction needs to start from these specific sequences, and the sequencing primers may or may not contain barcode/index sequences. The barcode/index sequences can be used to distinguish sequences from different sample sources in the same sequencing experiment. Using an enzymatic tool such as T4 ligase, the two sequencing primers were ligated to the ends of the end-repaired short dsDNA. Only dsDNA with a sequencing primer attached to each end can be sequenced.
  • Library enrichment All the nucleic acid sequences to be tested correctly linked with sequencing primers in a sample are called sequencing libraries.
  • Library enrichment is to use a certain method, usually PCR, to amplify the number of nucleic acid sequences correctly connected with sequencing primers, increase their copy number, and facilitate subsequent work. An enrichment step is not always necessary in an experimental workflow.
  • the construction of the sequencing library further includes a quality control step, the purpose of which is to detect whether the constructed sequencing library meets the sequencing requirements.
  • the means of quality control include detection of nucleic acid content, detection of nucleic acid purity, and detection of fragment length of enriched nucleic acid sequences. Only libraries with fragment lengths that meet the requirements of the sequencing instrument, sufficient content, and qualified purity can be used for subsequent sequencing. This quality control is a link in the experimental process to ensure the quality of the experiment. The control parameters are related to the selected sequencing platform, but it is not a necessary link required by this technical solution.
  • On-machine sequencing Carry out the experiment according to the instructions of the manufacturer, model and reagent of the selected sequencer.
  • the NGS sequencer manufacturers compatible with this technology include but are not limited to all the instruments and reagents currently on the market by mainstream manufacturers such as Thermo Fisher, illumina, and BGI.
  • the logic of the data analysis process in the present invention is to align (map) the short sequence fragments (reads) obtained by NGS sequencing to all reference sequences in the microbial genome characteristic sequence database; calculate the statistics of the actual coverage of the reads on each reference sequence Characteristic parameters such as CV, etc., and compare the actual reads coverage on each reference sequence with the theoretically predicted reads coverage (if the microorganism represented by the reference sequence is included in the sample), which can be based on mathematical models such as Probabilistic and statistical models, or empirical/experimental data construction.
  • the results of the comparison are reflected in a number of different comparative characterization parameters (such as NRMSE, etc.); according to whether the statistical characteristic parameters and the comparative characterization parameters meet the required standards, the reference sequences that do not meet the standards will be eliminated.
  • the screening operation is repeated cyclically until the statistical characteristic parameters and characterization parameters of all the remaining reference sequences meet the predetermined final criteria, then the microbial species represented by these reference sequences is the result of species detection. Calculate the proportion of the number of reads in each alignment of these reference sequences to the total number of reads of the reference sequence in the sample alignment, that is, the proportion of the microbial species represented by each reference sequence in the total number of sample microorganisms.
  • the above-mentioned reference sequence screening method based on the statistical feature parameters actually covered by the reads on each reference sequence and the comparative characterization parameters is only one of the screening methods that can be used, and other suitable screening methods can also be used. For example, only using statistical feature parameters Or compare the characterization parameters for screening; or use the method based on Fisher's exact test enrichment analysis; or the EM algorithm based on Bayesian probability, and so on.
  • the present invention performs clustering on all reference sequences, and selects a representative reference sequence, namely seeds, for each cluster; And select a representative reference sequence, namely seed ID, for each generated cluster; further clustering operations and selection of seed ID can be performed on the basis of seed ID as needed. This operation can be performed as many times as necessary.
  • a hierarchical clustering tree structure is constructed (the tree structure may or may not have a root), in which the leaves of the tree are composed of the reference sequences of the microbial genome feature sequence database, constituting a tree structure.
  • the bottom layer, the first layer node (node) above the leaf is the seed of the first layer of the cluster, and the nodes of the other layers are the seed IDs of the cluster formed by all the nodes connected below it.
  • the screening of database reference sequences can start from the top node of the clustering tree and proceed down layer by layer.
  • the comparative analysis includes:
  • step e merging the tertiary screening seed sequences, aligning the reads with it, and stepwise iterative screening to obtain a reference sequence that meets the requirements, wherein the threshold used in the stepwise iterative screening is more stringent than step d);
  • step f) According to the reference sequence obtained in step e) and the number of reads thereof, calculate the content of reads at the species level and its proportion; when calculating, add the number of reads of a plurality of reference sequences belonging to the same species to obtain the reads of the species
  • the proportion of each species in the sample is calculated by dividing the number of reads per species by the sum of the number of reads of the species contained in the sample.
  • the sample to be detected is a sample from a microbial host, and step a) further includes: removing interference from nucleic acid sequencing data of the host in the sample.
  • step d) after the step-by-step iterative removal of the seed sequence whose read coverage index does not meet the requirements, it also includes a screening step of the cluster internal reference sequence:
  • step g) is further included after step f): eliminating the interference of nucleic acid sequencing data of background contaminants in the experimental environment.
  • step b) the statistical independence test is Fisher's exact test, which specifically includes:
  • the reference sequence with the number of reads in each alignment greater than a certain number is recorded as "with read alignment", otherwise it is recorded as "without read alignment”; according to the clustering hierarchy relationship of seeds in the reference sequence database, the clustering is statistically tested Whether the seed IDs with "read alignment” are significantly enriched in the leaf nodes under each seed in the tree, and the seeds that meet the requirements are screened out step by step.
  • the construction method of the characteristic sequence database comprises:
  • Clustering is performed on the reference sequences according to their similarity in the second database.
  • the sequence of both ends of the reference sequence outside the amplification primer and the primer sequence in the database are removed.
  • the second database when constructing the second database, it also includes:
  • the most representative species taxonomy is selected among the species annotations of the set of matched reference sequences, and this taxonomic information is used to correct the species annotations for the referenced sequences.
  • the clustering process includes a first clustering process:
  • the clustering process further includes a second clustering process:
  • the clustering process further includes a third clustering process:
  • hierarchical clustering is performed on the seed reference sequence of the cluster obtained by the second clustering process to create a hierarchical nested tree.
  • the following describes the data analysis process in detail by taking the microbial detection based on prokaryotic 16S rRNA gene NGS sequencing as an example.
  • the data analysis process mainly includes two parts, 1) the construction method of the 16S rRNA gene (ie rDNA) reference sequence database used in the analysis process; 2) the algorithm process of data analysis using sequencing data.
  • Download databases include but are not limited to NCBI and SILVA databases.
  • the types and quantities of reference sequences contained in the database can be selected according to specific needs. For example, it can contain 100, 250, 500, 1000, 2000, 10,000, or even all existing types, etc. quantity.
  • the purpose of this step is to obtain a reference sequence that does not contain primers. You can use the Smith-Waterman local alignment algorithm or other methods of aligning and locating short sequences to match the amplification used in enrichment according to a certain similarity (such as but not limited to 80%, 85%, 90%, 95%, etc.).
  • the information of the amplification primers (referred to as amplification primers in this step) and the enrichment amplification primers are located in the reference sequence, and then the sequence is cut according to the matching positions of the primers to remove the amplification primers and the sequences at both ends of the primers. In addition, only the sequences at both ends other than the primers may be removed.
  • the certainty level of the species annotation includes: certain, limited certainty, rare Annotation (sorted by certainty from high to low) can also be classified according to the actual certainty of more/less levels, or even no classification.
  • Some reference sequences in the database have bases in partial positions that are ambiguous or contain degenerate bases. In order to identify the most representative bases at these positions, we calibrated according to the rules. The main steps are: use MUSCLE to perform multiple sequence alignment for reference sequences that are more than 97% similar within the same species, and use MUSCLE for ambiguous base positions. The most representative bases are replaced. After this step, if there are still ambiguous bases, a reference sequence with 97% similarity in the same genus is selected for multiple sequence alignment, and the most representative base is used to replace the ambiguous base position. Ambiguous bases that still exist after processing are retained.
  • This step is mainly to remove redundant reference sequences that are 100% similar, and at the same time to ensure that the required reference sequences are not removed by mistake. Therefore, this technical solution requires that the reference sequence to be removed needs to be included by other reference sequences, and the certainty level of species annotation is low. At the same time, it is also judged based on the matching of primers at both ends of the sequence, for example: primers at both ends of a reference sequence Complete, but completely contained by another reference sequence, in which case the previous reference sequence must be retained; alternatively, the shorter reference sequence may be retained if it has a higher level of species certainty. There are also some more complicated cases such as the processing of a reference sequence containing a primer end by another longer reference sequence.
  • clustering The purpose of clustering is to, on the one hand, disperse the computational pressure of subsequent analysis, and on the other hand reduce the competition between alignment sequences, so that suitable reference sequences can be selected step by step more accurately. All non-redundant reference sequences within each species were clustered according to the standard of 99.5% similarity (the similarity index could be adjusted appropriately according to the actual database reference sequences). After clustering, the representative sequence in each reference sequence class (cluster) is used as the seed of the class, and other sequences clustered in the same cluster are used as its subsequences.
  • All seeds obtained in 6) are clustered according to the similarity of 99% (which can be adjusted according to the similarity index of seed clustering in 6)), and the seeds and their subsequences of different species that are clustered in the same class are clustered. Merge, and then perform reclustering according to 99.5% similarity, and replace the old cluster participating in the reclustering calculation with the newly obtained cluster.
  • this step can be combined with 6) to directly use all the sequences for clustering, instead of clustering within species first and then merging similar classes between species as described above. However, if there are too many species with close distances, the direct clustering method may not achieve a good analysis effect.
  • the cluster When there are too many subsequences of seed in the cluster, split the cluster according to higher similarity (such as 99.6%, 99.8%, 99.9%, etc.) and replace the split with the new cluster formed after splitting previous cluster. So far, a database containing a two-layer cluster hierarchy has been constructed: the first layer consists of the seed reference sequences of the cluster, and the second layer contains the sub-reference sequences of all clusters. The schematic diagram of its structure is shown in Figure 3.
  • All seed reference sequences are clustered according to the similarity of 97%, 98%, and 99% (represented by these three clustering similarity indicators, as long as the effect of layer-by-layer clustering can be achieved).
  • the main steps are: first, cluster all the seed reference sequences according to 97% similarity to obtain the cluster of the first layer (including seeds and their subsequences); then for each cluster, its internal reference sequences are 98% similar to each other. Then, cluster each cluster in the second layer according to the 99% similarity of its internal reference sequences to obtain the cluster of the third layer. Finally, an index hierarchy including all seed reference sequences in the database is constructed, so that each seed reference sequence obtains seed IDs corresponding to three different sequence similarity clusters. So far, the index information of all reference sequences constitutes a file consisting of four columns of information.
  • a schematic diagram of the seed clustering hierarchy of the reference sequence database is shown in Figure 4.
  • the sequencing short sequence (read) data in Fastq format entering the analysis process is screened, and reads with low quality and adapter sequences are filtered out.
  • bowtie2 or other short sequence alignment tools
  • bowtie2 or other short sequence alignment tools
  • the alignment tool is not limited, it can efficiently achieve the purpose of accurately matching the reads and the reference sequence.
  • the primary screening of seeds is carried out using the principle of enrichment analysis. The specific process is as follows:
  • All read sequences were aligned with all seed sequences in the database using hisat2 software.
  • the alignment results are filtered with the read mismatch rate lower than a certain threshold (such as 0.5%, 1%, 1.5%, 2%, etc.) as the condition, and the reads generated by PCR duplication in the alignment are removed.
  • the reference sequence with the number of reads in each alignment greater than a certain number is marked as "with read alignment", otherwise it is marked as "without read alignment”.
  • Fisher's exact test is used for enrichment analysis to statistically test whether the leaf nodes under each seed in the clustering tree are significantly enriched.
  • the seed ID of "read alignment” and filter out the seeds that meet the requirements step by step.
  • the table is a count of seed IDs that satisfy both the row and column criteria.
  • a certain threshold such as 0.001, 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, etc.
  • the alignment tool is not limited, it can achieve the purpose of accurately matching the reads and the reference sequence, such as other sequence alignment tools MAQ, SOAP, BWA, NovoAlign, etc.
  • the read coverage of each seed sequence is calculated, the index for evaluating the seed sequence coverage is calculated, and the seed sequence is subjected to secondary screening.
  • the specific calculation process is as follows:
  • Use bowtie2 to align the read sequence with the primary screened seed sequence (mainly to obtain alignment that meets certain requirements for alignment similarity, and refs do not compete for reads), according to thresholds (such as 0.5%, 1%, 1.5%, The reads mismatch rate of 2%, etc.) filters the alignment and removes the reads generated by PCR duplication in the alignment. Calculate the read coverage of each seed sequence.
  • the calculation indicators include CV (Coefficient of variation), coverage (coverage), mean depth (average sequencing depth), end gap (coverage gaps at both ends of the sequence), and middle gap (the middle of the sequence). cover gap) etc.
  • the seed sequence is screened according to the relatively loose index, and the seed sequence that meets the index passes the secondary screening.
  • the seed sequences that passed the secondary screening were randomized.
  • bowtie2 or other sequence alignment tools such as MAQ, SOAP, BWA, NovoAlign, etc. are used to align the read sequence with the seed sequence, and the seeds compete with each other for reads during the alignment process.
  • the filtered seeds were randomly grouped, and bowtie2 was used to align the read sequence with the seed sequence within each group.
  • the seed sequences compete for reads, and reads are randomly assigned among the seed sequences with the highest alignment score.
  • the alignment is filtered according to the reads mismatch rate of a certain threshold, and the reads generated by PCR duplication in the alignment are removed.
  • Calculate the coverage index of alignment including Cor (Pearson's correlation coefficient, calculate the consistency between the read coverage model of the expected seed sequence and the actual coverage obtained by alignment), NRMSE (Normalized root mean square error, between the expected model and the actual coverage) difference), coverage, mean depth, end gap, middle gap, etc.
  • the coverage index of alignment meets the set parameters, such as mean depth ⁇ 15 or 20, end gap ⁇ 30 or 40, NRMSE ⁇ 0.4 or 0.35, middle gap ⁇ 0 or 5 or 10 or 15, etc., where mean
  • the depth is calculated from the level of the entire species, that is, the sum of the depths of multiple seeds belonging to the same species can satisfy the parameters.
  • the stricter the parameters the fewer seeds may be entered into the subsequent analysis. If the parameters are stricter, the seeds with good coverage are more likely to be the final candidate reference sequences. Filter out the seed sequences that do not meet the requirements until all the remaining seed sequences meet the requirements. Merge the seed sequences screened in each group and enter the next step.
  • the first steps of the process are filtered at the seed layer of the database.
  • this step the subsequence of the cluster to which each seed belongs to obtained by the tertiary screening is screened.
  • this step of screening can also be adjusted to: if the coverage of the seed sequence screened in 5) is good enough, it can directly enter the final step candidate. Only the seeds with insufficient reads coverage are used for intra-cluster alignment screening. The former method is taken as an example for detailed description below.
  • the alignment tool is not limited, it can achieve the purpose of accurately matching the reads and the reference sequence
  • the reads coverage of each reference sequence is counted, filtered according to the reads coverage index, and the reference sequences with poor reads coverage are iteratively removed round by round.
  • the specific calculation process is as follows:
  • the alignment is filtered according to the reads mismatch rate with a stricter threshold than the previous steps.
  • reads that may be mismatched are removed according to the proportion of dominant bases. Reads due to PCR duplication in the alignment were removed.
  • Calculate the read coverage indicators including Cor (calculate the consistency between the expected reads coverage model and the actual reads coverage), NRMSE (calculate the difference between the expected reads coverage model and the actual reads coverage), coverage, mean depth, end gap , middle gap, etc. According to whether the read coverage index meets the set parameters, the reference sequences that do not meet the requirements are filtered out.
  • Each round of iteration filters out at most 1%, or 5%, or 10%, or 15%, or 30% of the total number of reference sequences (the proportion can be adjusted according to the desired convergence speed), until the remaining reference sequences are all satisfied Require. Then, multiple reference sequences belonging to the same species were deduplicated. Use MUSCLE (or other sequence alignment tools such as ClustalW, T-coffee, MAFFT, etc.) to align the related reference sequences in pairs, and remove the one with poor read coverage among the two reference sequences whose alignment positions are completely identical. Finally, the reference sequence obtained from the internal screening of the cluster to which each seed belongs is used as a candidate reference sequence to enter the next process.
  • MUSCLE or other sequence alignment tools such as ClustalW, T-coffee, MAFFT, etc.
  • the reference sequences obtained from the internal screening of the cluster were merged, and the reads were aligned with the reference sequences using bowtie2.
  • the alignment, screening, and iterative procedures are the same as 6), but the coverage screening parameters are more stringent than 6).
  • the reference sequences are finally filtered according to the number of unique reads owned by each reference sequence. If the alignment positions of a reference sequence and another reference sequence are completely consistent within the specified regions (the end sequences are not considered), the number of unique reads will be compared according to the sequence similarity and the number of unique reads owned by each of them. Less (the quantity difference needs to meet a certain range, for example, the difference multiple exceeds 1.5, 2, 2.5, etc.) and the side is removed.
  • the content and proportion of reads at the species level were calculated.
  • the number of reads of multiple reference sequences belonging to the same species is summed to obtain the number of reads of the species, and then the proportion of each species in the sample is calculated, that is, the number of reads of each species divided by the sum of the number of reads of the species contained in the sample.
  • the logistic regression model was used to calculate the reliability of each reference sequence, and the training data of the model came from the experimental analysis results of multiple batches.
  • the method for removing background contamination in this technical solution includes two steps: the first step is to calculate the proportion of species and determine whether the species is in the list of important clinical pathogenic species, that is, the proportion is very low and not in the list of important clinical pathogenic species Species that appear in the negative control samples will be excluded as false positives; in the second step, the species present in the negative control samples will be filtered. But since clinical samples may indeed contain certain microbial species present in the environment, they cannot simply be ruled out directly.
  • the main method for removing background contamination species is to calculate the normalized content of the species reads in each clinical sample and control sample respectively, and then calculate that the content of this species in the clinical samples is one from the species (or detected in all control samples).
  • species statistical distributions (such as normal distribution, Poisson distribution, Weibull distribution, or other known theoretical distributions, or empirical distributions based on data resampling (such as bootstrapping, Jackknife, etc.) ) sample probability. If the probability is greater than a certain threshold (such as 0.001, 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, etc.), the species is considered to be background contamination and removed from the detection results, otherwise it is retained.
  • a certain threshold such as 0.001, 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, etc.
  • the screening process of the reference sequence database based on the sequencing reads data has obtained the reference sequence that finally meets the requirements.
  • the species taxonomic annotation information of the obtained reference sequence the species composition of the microorganisms contained in the sample can be obtained.
  • Microorganism sequencing data analysis device computer readable storage medium and electronic device
  • the present invention also relates to a microbial sequencing data analysis device, the device comprising:
  • a sequencing data acquisition module which is used for acquiring sequencing data, the sequencing data is obtained by sequencing the amplification product through the next generation sequencing technology after targeting and enriching the characteristic nucleic acid sequences of microorganisms;
  • a characteristic sequence database building module which is used to perform clustering processing on the sequencing data to obtain a characteristic sequence database, and the clustering processing is defined by the clustering processing in the above method;
  • a comparison analysis module which is used for performing comparison analysis on the sequencing data and the characteristic sequence database to identify the microbial composition in the sample to be detected, and the comparison analysis is the comparison in the above method. Defined for analysis.
  • the apparatus includes:
  • a sequencing data acquisition module which is used for acquiring sequencing data, the sequencing data is obtained by sequencing the amplification product through the next-generation sequencing technology after the primers carry out targeted enrichment of microbial characteristic sequences;
  • a feature sequence database building module which is used to perform clustering processing on the sequencing data to obtain a feature sequence database
  • the feature sequence database includes one or more levels of clusters, and each level cluster has at least one sub-seed, and the bottom layer has at least one sub-seed. There are several seeds in the cluster that are used as reference sequences;
  • a comparison analysis module which is used for performing comparison analysis between the sequencing data and the characteristic sequence database to identify the microbial composition in the sample to be detected, and the comparison analysis module includes:
  • the second module which is used to compare the reads sequence with the seeds sequence of the characteristic sequence database, remove the reads generated due to PCR repetition, and do statistical independence test to the reads sequence and the seeds sequence, select Get the seeds sequence related to the reads sequence to obtain the primary screening seed sequence;
  • a third module which is used to align the read sequence with the primary screening seed sequence, wherein the primary screening seed sequences do not compete for reads, calculate the read coverage of each seed sequence, and calculate and evaluate the seed sequence Covered indicators, and secondary screening of the seed sequence based on this;
  • the fourth module which is used for aligning the read sequence with the seed sequence obtained by the secondary screening, competing for reads during the alignment of the seed sequence, calculating the read coverage index of the seed sequence in the alignment, and iteratively removing the read step by step.
  • the third-level screening seed sequence is obtained;
  • the fifth module which is used for merging the tertiary screening seed sequences, aligning the reads with it, and iteratively screening to obtain a reference sequence that meets the requirements, wherein the threshold used in the step-by-step iterative screening is more stringent than step d). ;
  • the sixth module which is used to calculate the content of reads at the species level and its proportion according to the reference sequence obtained in step e) and the number of reads thereof; when calculating, the number of reads of multiple reference sequences belonging to one species Add up to get the number of reads for that species, and then calculate the proportion of each species in the sample by dividing the number of reads for each species by the sum of the number of reads for the species contained in the sample.
  • the present invention also relates to a computer readable storage medium for storing computer instructions, programs, code sets or instruction sets which, when run on a computer, cause the computer to perform the steps in the method as described above ii).
  • the computer-readable medium may be a computer-readable signal medium or a computer-readable storage medium.
  • the computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples (a non-exhaustive list) of computer readable storage media include: electrical connections having one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), Erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory, optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
  • a computer-readable storage medium can be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a propagated data signal in baseband or as part of a carrier wave, with computer-readable program code embodied thereon. Such propagated data signals may take a variety of forms including, but not limited to, electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device .
  • Program code embodied on a computer readable medium may be transmitted using any suitable medium including, but not limited to, wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
  • Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including object-oriented programming languages—such as Java, Smalltalk, C++, but also conventional procedural languages, or a combination thereof.
  • Programming Language such as "C" language or similar programming language.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider through Internet connection).
  • LAN local area network
  • WAN wide area network
  • an electronic device comprising:
  • a storage device that stores one or more programs
  • the one or more processors When the one or more programs are executed by the one or more processors, the one or more processors cause the one or more processors to implement step ii) in the method as described above.
  • the electronic device may further include a transceiver.
  • the processor and the transceiver are connected, eg, via a bus.
  • the transceiver is not limited to one, and the structure of the electronic device does not constitute a limitation on the embodiments of the present application.
  • the processor may be a CPU, a general-purpose processor, a DSP, an ASIC, an FPGA, or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof. It may implement or execute the various exemplary logical blocks, modules and circuits described in connection with this disclosure.
  • a processor can also be a combination that implements computing functions, such as a combination of one or more microprocessors, a combination of a DSP and a microprocessor, and the like.
  • the bus may include a path to transfer information between the aforementioned components.
  • the bus can be a PCI bus or an EISA bus or the like.
  • the bus can be divided into address bus, data bus, control bus and so on.
  • the memory 802 can be ROM or other types of static storage devices that can store static information and instructions, RAM or other types of dynamic storage devices that can store information and instructions, or EEPROM, CD-ROM or other optical disk storage, optical disk storage (including compact discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or capable of carrying or storing desired program code in the form of instructions or data structures and capable of being executed by a computer Access any other medium without limitation.
  • 16s rDNA is used as an example to illustrate the construction process of the reference sequence database.
  • Download databases include but are not limited to NCBI and SILVA databases.
  • the types and quantities of reference sequences contained in the database can be selected according to specific needs. For example, it can contain 100, 250, 500, 1000, 2000, 10,000, or even all existing types, etc. quantity.
  • the purpose of this step is to obtain a reference sequence that does not contain primers. You can use the Smith-Waterman local alignment algorithm or other methods of aligning and locating short sequences to match the amplification used in enrichment according to a certain similarity (such as but not limited to 80%, 85%, 90%, 95%, etc.).
  • the information of the amplification primers (referred to as amplification primers in this step) and the enrichment amplification primers are located in the reference sequence, and then the sequence is cut according to the matching positions of the primers to remove the amplification primers and the sequences at both ends of the primers. In addition, only the sequences at both ends other than the primers may be removed.
  • the certainty level of the species annotation includes: certain, limited certainty, rare Annotation (sorted by certainty from high to low) can also be classified according to the actual certainty of more/less levels, or even no classification.
  • Some reference sequences in the database have bases in partial positions that are ambiguous or contain degenerate bases. In order to identify the most representative bases at these positions, we make corrections according to the rules. The main steps are: the use of reference sequences whose similarity is greater than 97%, or 97.5%, or 98%, or 98.5%, etc. within the same species. MUSCLE performs multiple sequence alignments and replaces the most representative bases for ambiguous base positions. After this step, if there are still ambiguous bases, select the reference sequences with 97%, or 97.5%, or 98%, or 98.5% similarity in the same genus for multiple sequence alignment, and use the most ambiguous base position for the sequence alignment. Replacement of representative bases. Ambiguous bases that still exist after processing are retained.
  • This step is mainly to remove redundant reference sequences that are 100% similar, and at the same time to ensure that the required reference sequences are not removed by mistake. Therefore, the technical solution requires that the removed reference sequence should be included by other reference sequences, and the certainty level of species annotation is low.
  • All seeds obtained in 6) are clustered according to the similarity of 98%, or 98.5%, or 99% (which can be adjusted according to the similarity index of seed clustering in 6)), and will be clustered in the same class.
  • the seeds of different species and their subsequences are merged, and then reclustered according to 99.5% similarity, and the old cluster involved in the reclustering calculation is replaced with the newly obtained cluster.
  • this step can be combined with 6) to directly use all the sequences for clustering, instead of clustering within species first and then merging similar classes between species as described above. However, if there are too many species with close distances, the direct clustering method may not achieve a good analysis effect.
  • the cluster When there are too many subsequences of seed in the cluster, split the cluster according to higher similarity (such as 99.6%, 99.8%, 99.9%, etc.) and replace the split with the new cluster formed after splitting previous cluster. So far, a database containing a two-layer cluster hierarchy has been constructed: the first layer consists of the seed reference sequences of the cluster, and the second layer contains the sub-reference sequences of all clusters.
  • All seed reference sequences are clustered according to the similarity of 97%, 98%, and 99% (represented by these three clustering similarity indicators, as long as the effect of layer-by-layer clustering can be achieved).
  • the main steps are: first, cluster all the seed reference sequences according to 97% similarity to obtain the cluster of the first layer (including seeds and their subsequences); then for each cluster, its internal reference sequences are 98% similar to each other. Then, cluster each cluster in the second layer according to the 99% similarity of its internal reference sequences to obtain the cluster of the third layer. Finally, an index hierarchy including all seed reference sequences in the database is constructed, so that each seed reference sequence obtains seed IDs corresponding to three different sequence similarity clusters. So far, the index information of all reference sequences constitutes a file consisting of four columns of information.
  • the seed clustering hierarchy of the reference sequence database is shown in the figure below.
  • Mock samples are generated by artificially selecting multiple reference sequences in single or different combinations from the database.
  • the reference sequence is randomly interrupted to generate reads whose length distribution conforms to a normal distribution, and a set of reads with a specified sequencing depth is randomly generated to simulate the sequencing result.
  • a total of 83 samples composed of equal mixtures of different species were generated by simulation, including 24 single-species samples and 59 multi-species mixtures (the number of species varied from 2 to 12) samples.
  • the average depth of reference sequence reads ranged from 20X to 800X range.
  • This batch of data analysis uses a database containing 2,119 pathogenic bacteria, and at the same time builds a hierarchical relationship of reference sequences according to 97%, 98%, and 99% similarity, including a total of 34,025 primary representative sequences, and the total number of sequences. 83,886 (the reference sequence in this version is the number of reference sequences with redundancy removed).
  • the alignment parameter is --min-score L,-1,-0.1-a, and the reference sequences do not compete for reads. Then filter the alignment according to 0.5% mismatch reads rate, and remove the reads generated by PCR repetition in the alignment, calculate the coverage of each reference sequence and filter according to the parameters CV ⁇ 0.55&gap ⁇ 40&mean depth ⁇ 20, pass the threshold The screened reference sequences are used as candidate sequences for the next step.
  • the iterative screening steps within the secondary cluster are the same as 3) and 4), where the screening threshold parameters in the same step 4) are CV ⁇ 0.55&Cor ⁇ 0.6&mean depth ⁇ 20&gap ⁇ 25, and the reference sequences that meet the conditions enter the next step of analysis.
  • Gram-positive bacteria are used as experimental objects, which are simple bacteria species classified as known species cultivated in liquid medium. The purpose is to investigate whether the technical solution can correctly identify the microorganisms in the sample, and to investigate the lower sensitivity limit of the detection ability of the technical solution.
  • the samples of single species of Staphylococcus epidermidis were cultured in the laboratory and counted by agar plate culture to confirm the number of bacterial CFU added to the experimental samples.
  • the input amount is 2560CFU, and the sample numbers are INQ19M0101 and INQ19M0102.
  • the input amount is 640CFU, and the sample numbers are INQ19M0103 and INQ19M0104.
  • the input amount is 160CFU, and the sample numbers are INQ19M0105 and INQ19M0106.
  • the input amount is 40CFU, and the sample numbers are INQ19M0107 and INQ19M0108.
  • the input amount is 10CFU, and the sample numbers are INQ19M0109 and INQ19M0110.
  • the selected sample is bacteria
  • the extraction protocol is to extract DNA contained in the sample.
  • the extracted nucleic acid may also be RNA, or DNA and RNA are extracted at the same time; the nucleic acid extraction kit used is not limited to the manufacturer or the product number.
  • PCR amplification primers are: F: 5'AGAGTTTGATCMTGGCTCAG 3', or 5'AGAGTTTGATCCTGGCTCAG 3', or 5'CTCCTACGGGAGGCAGCAG 3', or 5'GTGCCAGCMGCCGCGG 3', or 5'AAACTYAAAKGAATTGACGG 3', or 5'GCAACGAGCGCAACCC 3', or 5'AGAGTTTGATCATGGCTCAG 3', or 5'AACTGAAGAGTTTGATCCTGGCTC 3'; R: 5', GGTTACCTTGTTACGACTT 3', or 5'GGYTACCTTGTTACGACTT 3', or 5'CTGCTGCSYCCCGTAG 3', or 5'GWATTACCGCGGCKGCTG 3', or 5'CCGTCAATTCMTTTRAGTTT 3', Or 5'GGGTTGCGCTCGTTG 3', or 5'AAGGAGGTGWTCCARCC 3', or 5'TA
  • the reagents used were Ex Taq HS, Baori Doctor Biotechnology (Beijing) Co., Ltd., product number RR006A.
  • the method used to enrich the target sequence is to enrich the 16S rRNA gene sequence by PCR.
  • the enrichment method used may also be hybrid capture;
  • the enzyme used in PCR is not limited to the company's product number, it may be other product numbers or other companies' Taq enzymes, or other types of enzymes, as long as
  • the system and amplification conditions are only suitable reaction conditions for the currently used Taq enzyme and used primers, and are not intended to limit the technical solution.
  • the reaction system is:
  • reaction conditions are:
  • DNA sorting magnetic beads are used to purify the PCR product.
  • the purification method is not limited to the use of magnetic beads for purification, but may also be other methods that can purify PCR products such as adsorption column purification; magnetic bead purification is not limited to DNA sorting magnetic beads, nor is it limited to this
  • the company's products, magnetic beads that can purify PCR products can be used; even, whether purification is required, or the choice of purification form, depends on the enrichment method of the target sequence in the previous step, as well as the reagents and reaction conditions used in the next reaction. Selection may or may not require purification, and may not purify the "PCR product".
  • Step (3) is repeated once.
  • PCR product quantification use Qubit 3.0 reagent (Qubit dsDNA BR Assay Kit, Thermo Fisher Scientific Q32850) to quantify the purified PCR product.
  • Qubit 3.0 reagent Qubit dsDNA BR Assay Kit, Thermo Fisher Scientific Q32850
  • a fluorescent dye method was used for the quantification of PCR products.
  • the quantitative method is not limited to the fluorescent dye method, and may also be other dye methods or non-dye methods, such as ultraviolet spectrophotometer method, fluorescence quantitative PCR method, capillary electrophoresis or Microfluidic electrophoresis combined with nucleic acid dye fluorescence imaging method, etc.; for the fluorescent dye method, it is not limited to the reagents of the company or the product number; the method used can accurately reflect the quality of nucleic acid in the sample (refer to the substance content, not the quality). ; If UV spectrophotometry is used, information on nucleic acid quality can also be obtained, which also belongs to the scope of quality control of this technical solution and does not constitute an innovation to this technical solution.
  • DNase I deoxyribonuclease I
  • the nucleic acid in the sample is interrupted by using the nuclease DNase I.
  • the method used is not limited to the use of biological methods such as enzymes, but may also be physical methods such as ultrasound, or chemical methods, or other types of methods, which can controllably break long fragments of nucleic acids into short films.
  • 5-100ng is not a limiting condition, it is just a more suitable condition, more or less nucleic acid can also be used in this technical solution, 0.01ng or less nucleic acid can also be used in this technical solution get the correct result.
  • the amount of DNase I used is only a more suitable amount under the current conditions, and is not a limiting factor.
  • the description of the purification method is the same as before.
  • Step (3) is repeated once.
  • DNA end repair using reagents Pfu enzyme (Tiangen Biochemical Technology (Beijing) Co., Ltd., EP101), dNTP Mixture (Tiangen Biochemical Technology (Beijing) Co., Ltd., CD111), DNA sorting magnetic beads (Wuxi Baimai) Grid Biotechnology Co., Ltd., BMSX).
  • the operation steps are as follows: In this embodiment, the dsDNA after end repair is blunt-ended, that is, the double-stranded ends of the DNA are flush. In this technical solution, the end repair of small fragments varies according to the selected sequencing platform.
  • the illumina sequencing platform when the illumina sequencing platform is selected, there will be a protruding adenine (A) at the end of one strand of the repaired dsDNA.
  • the repair method It does not constitute a restriction of this technical solution; the enzyme selected for end repair is not limited to pfu enzyme, but may also be Taq enzyme or other enzymes; the selection of pfu enzyme is not limited to the company's product number.
  • nucleic acid fragment length screening was performed.
  • fragment screening is not limited to the use of DNA fragments to separate magnetic beads, and other methods are also possible.
  • nucleic acids of desired length fragments are selected and recovered, and the nucleic acid fragments can be selected;
  • the DNA fragment sorting magnetic beads used are not limited to the company or the product number; the length of the fragments retained after screening is related to the selected sequencing instrument, sequencing reagents, and sequencing parameter settings, and does not constitute a restriction on this technical solution; the length of nucleic acid fragments
  • the screening is not limited to after the end repair, it may also be before the end repair, or after the next step or the next step, this adjustment does not constitute a limitation to this technical solution. In this technical solution, the description of the purification method is the same as before.
  • Step (c) is repeated once.
  • Sequencing adapter ligation The reagents used were T4 DNA ligase (Thermo Fisher Scientific, EL0011), DNA sorting magnetic beads (Wuxi Biomag Biotechnology Co., Ltd., BMSX), and the operation steps were as follows:
  • the sequencing primers were the sequencing primers of the Ion Torrent sequencing platform of Thermo Fisher Company.
  • the selection of sequencing primers is affected by sequencing instruments and sequencing reagents, and does not constitute a restriction on this technical solution;
  • the ligase used is not limited to T4 ligase, an enzyme that can connect two nucleic acid fragments or Other technical methods can be used;
  • the selection of T4 ligase is not limited to the company or the product number;
  • the ratio of the reaction system and the reaction conditions are only suitable conditions at present, and do not constitute restrictions on the technical solution.
  • the description of the purification method is the same as before.
  • Step (c) is repeated once.
  • Sequencing library enrichment The reagents used were HiFi high-fidelity Taq enzyme (KAPA biosystems, KK2602), DNA sorting magnetic beads (Wuxi Baimaige Biotechnology Co., Ltd., BMSX), and the operation steps were as follows: In the present embodiment, high-fidelity Taq enzyme was used in PCR mode Perform sequencing library enrichment. In this technical solution, library enrichment is not necessary for this technical solution, and omitting this step does not constitute an innovation to this technical solution; the enrichment method is not limited to the PCR method, but can increase the ratio or content of the available sequencing library in the sample.
  • the PCR method is not limited to the selection of Taq enzyme, and other enzymes capable of nucleic acid amplification may be used in this solution; Taq enzyme is not limited to the company or the product number; the selection of PCR amplification primers is subject to sequencing instruments, sequencing The influence of the reagents does not constitute a limitation on the technical solution; the ratio of the reaction system and the reaction conditions are only suitable conditions at present, and do not constitute a limitation on the technical solution. In this technical solution, the description of the purification method is the same as before.
  • Step (c) is repeated once.
  • Quantify the sequencing library use Qubit 3.0 reagent (Qubit dsDNA BR Assay Kit, Thermo Fisher Scientific Q32850) to quantify the purified PCR products.
  • Qubit 3.0 reagent Qubit dsDNA BR Assay Kit, Thermo Fisher Scientific Q32850
  • a fluorescent dye method was used for the quantification of the sequencing library.
  • the quantitative method is not limited to the fluorescent dye method, but may also be other dye methods or non-dye methods, such as UV spectrophotometer method, fluorescence quantitative PCR method, capillary electrophoresis or microfluidic method Controlled electrophoresis combined with nucleic acid dye fluorescent imaging methods, etc.;
  • the fluorescent dye method is not limited to the reagents of the company or the product number; the method used can accurately reflect the quality of the nucleic acid in the sample (refer to the substance content, not the quality); if Using ultraviolet spectrophotometry, it is also possible to obtain information on the quality of nucleic acid, which also belongs to the scope of quality control of this technical solution, and does not constitute an innovation to this technical solution.
  • the sequencing libraries of different samples sequenced in the same batch are mixed. According to the quantitative results of Qubit, equal amounts are added to different samples to make a mixed library. In this embodiment, the sequencing libraries of different samples are mixed in equal amounts. In this technical solution, the mixing of different samples can also be unequal; the number of samples mixed each time can be flexibly adjusted according to the sequencing equipment, sequencing reagents, sequencing methods, actual experimental needs, etc., which does not constitute a limitation to this technical solution.
  • the quantitative method can also be other dye methods or non-dye methods, such as Qubit, UV spectrophotometer, Agilent 2100 bioanalyzer, etc.; the choice of fluorescence quantitative PCR method is not limited to the company or the product number.
  • manufacturers of NGS sequencers that are compatible with the experimental process include, but are not limited to, Thermo Fisher, Illumina, BGI and other mainstream manufacturers that currently sell instruments, reagents, and sequencing methods in the market, which do not constitute a restriction on this technical solution.
  • the off-machine data is filtered to filter out low-quality reads, and the remaining high-quality clean data can be used for later analysis.
  • the analysis process is as follows:
  • Database Contains 252 important clinical pathogenic microorganisms, including 2396 representative sequences. The selection of representative sequences is similar to the first-level seed selection method in the aforementioned database construction process. The difference is that all sequences are directly clustered according to 99.5% similarity. The seeds of each class are selected as representatives to form the representative set.
  • the average number of detected sequences for 10 samples was 55,931, and S. epidermidis was correctly detected in all samples with a sensitivity of 100%.
  • Gram-negative bacteria are used as experimental objects, which are simple bacteria species classified by known species cultivated in liquid medium. The purpose is to investigate whether this technical solution can correctly identify microorganisms in the sample, and to investigate the lower limit of sensitivity of the detection ability of this technical solution.
  • a single species sample (Serratia marcescens) was counted using agar plate culture to confirm the number of bacterial CFU added.
  • the input amount is 2560CFU, and the sample number is INQ19M0111-INQ19M0112.
  • the input amount is 640CFU, and the sample number is INQ19M0113-INQ19M0114.
  • the input amount is 160CFU, and the sample number is INQ19M0115-INQ19M0116.
  • the input amount is 40CFU, and the sample number is INQ19M0117-INQ19M0118.
  • the input amount is 10CFU, and the sample numbers are INQ19M0119-INQ19M0120.
  • the average number of detected sequences in the 10 samples was 54,703. Except for the sample INQ19M0119, Serratia marcescens could be correctly detected in the rest of the samples.
  • the average number of detected sequences in the two samples was 62,970, and in the two samples, the added 5 species could be correctly detected.
  • the purpose of this example is to examine whether this technical solution can correctly identify species with a low proportion of samples containing multiple species.
  • Two laboratory-grown microbial strains (S. epidermidis, Serratia marcescens) were counted on agar plates to confirm the number of bacterial CFUs.
  • the samples were formed by 16-fold dilution gradient mixing (the mixed amount of Serratia marcescens was 3200 CFU, the mixed amount of Staphylococcus epidermidis was 200 CFU), and the sample numbers were INQ19M0143 and INQ19M0144.
  • the average number of detected sequences in the two samples was 46,026, and in the two samples, two species with a 16-fold difference in content could be correctly detected.
  • the database contains 252 important clinical pathogenic microorganisms, including 1920 primary representative sequences and 143009 total sequences (the reference sequences in this version are not de-redundant).
  • step 5 Perform internal iterative screening of the candidate sequences screened in step 5.
  • the screening method is the same as steps 3, 4, and 2.
  • the iterative screening parameters of the same step 3 are CV ⁇ 0.5&gap ⁇ 15&mean depth ⁇ 20, and the iterative screening parameters of the same step 4 are CV ⁇ 0.5&Cor ⁇ 0.7&mean depth ⁇ 20&gap ⁇ 25. Combine all candidate seeds that meet the requirements as the final candidate reference sequence.
  • step 7 Align the reads with the final candidate reference sequence obtained in step 6, filter the alignment according to the mismatch reads rate of 0.5%, and remove the reads generated by PCR repetition in the alignment. Calculate the CV, gap, Cor, CV/Cor, mean depth indicators of each reference sequence, and then perform step-by-step iterative screening according to the threshold CV ⁇ 0.5&Cor ⁇ 0.7&mean depth ⁇ 20&gap ⁇ 25. All the seeds that meet the requirements are used as the final target sequence, and the corresponding species are the detected species. At the same time, the read count of each species and its proportion in the sample are calculated.
  • the number of detected sequences in the samples was 55,115, and human sequences accounted for 0.020% of all detected sequences.
  • the detection results of this technology show that the samples contain Escherichia coli and Enterococcus faecium, which are consistent with the results obtained by microbial culture and other methods.
  • the number of detected sequences in the samples was 67,797, and human sequences accounted for 9.50% of all detected sequences.
  • the detection results of this technology show that the samples contain Escherichia coli, which is consistent with the results obtained by microbial culture and other methods.
  • Sample types include: bile, pleural effusion, joint effusion, cerebrospinal fluid, urine, pus, pericardial effusion, drainage fluid, sputum, etc., including the main sample types for clinical microbiological testing.
  • Database Contains more than 18,000 species (including all known bacteria, mycoplasma, chlamydia, rickettsia, spirochetes). The hierarchical relationship of the reference sequences was constructed according to the similarity of 97%, 98% and 99%, including 30,816 primary representative sequences and 154,392 total sequences (the reference sequence in this version is the number of reference sequences with redundancy removed).
  • the seed IDs of significantly enriched reads were screened according to the relatively loose enrichment p value of 0.1, and the corresponding seeds were extracted as the candidate reference sequences for the next step.
  • step 5 Align the reads with the reference sequence screened in step 4, and use bowtie2 software with the same parameters as step 4. The alignments were then filtered at a mismatch reads rate of 0.5%, and reads in the alignments due to PCR duplication were removed. For sites with a dominant base ratio of less than 0.95 in the reference sequence alignment site, only the reads that support the reference sequence are kept. Calculate the CV, gap, Cor, and mean depth indicators of each reference sequence, follow the indicators of CV ⁇ 0.6&Cor ⁇ 0.65&mean depth ⁇ 20&gap ⁇ 25, and perform step-by-step iterative screening until all reference sequences meet the requirements.
  • the reference sequence screened in step 5 directly enters the final iteration, otherwise, it needs to enter the secondary cluster of the reference sequence for competitive iterative screening.
  • the iterative screening steps within the cluster are the same as 4 and 5.
  • the threshold parameters for screening are CV ⁇ 0.6&Cor ⁇ 0.7&mean depth ⁇ 20&gap ⁇ 25.
  • step 7 Merge all the reference sequences that meet the conditions screened in step 6, perform final iterative screening, and use bowtie2 for alignment, and the parameters are the same as in step 4. The alignments were then filtered using a mismatch reads rate of 0.5%, and reads in the alignments due to PCR duplication were removed. Calculate the coverage of each reference sequence, and perform step-by-step iterative screening according to the threshold CV ⁇ 0.6&Cor ⁇ 0.7&mean depth ⁇ 20&gap ⁇ 25. The reference sequence that finally meets the requirements is used as the final target sequence, and the corresponding species is the target species. At the same time, the number of reads of each species and the proportion in the sample are calculated.
  • the positive rate of all samples using this technical solution was 98.9% (88/89).
  • the agreement between the technical solution and the bacterial culture identification test results was 90.6% (48/53).
  • This example is used to illustrate that in the technical solution, after adding a certain amount of nucleic acid with a known sequence in a certain link, the ability to identify potential contamination results in the detection results can be improved.
  • the drainage fluid samples collected from the hospital for the identification of pathogenic microorganisms are divided into two parts after the sample is collected, and the identification is carried out by using this technical solution or the method of bacterial culture respectively.
  • the added part includes: step 2, when the target gene is amplified, a certain amount of plasmid vector containing nucleic acid of known sequence is added to the reaction solution of the PCR reaction, and the nucleic acid extraction product obtained by the negative control during sample extraction is also added. The same amount of nucleic acid of known sequence is reacted simultaneously. The length of the added nucleic acid sequence is similar to that of the 16S rRNA gene. In this example, a nucleic acid with a known sequence is selected to be added in this step.
  • nucleic acid of known sequence is not limited to being added in this step, and can also be added before nucleic acid extraction , or added after this step, or added at any step deemed appropriate in this technical solution; the added nucleic acid sequence can exist alone, or a plasmid, virus, cell, etc. can be used as a carrier; the amount of added nucleic acid sequence can be from 1 copy to an infinite number of copies.
  • Database Contains 2202 clinical pathogenic bacteria species. At the same time, the hierarchical relationship of reference sequences was constructed according to the similarity of 97%, 98% and 99%, including 40,035 primary representative sequences and 92,385 total sequences (the reference sequence is the number of reference sequences with redundancy removed).
  • the reference sequences obtained by 3. screening are randomly grouped by means of permutation and combination, and then each group is compared and iteratively screened. Align the reads with the reference sequences screened in each group, using bowtie2 software, the parameters are the same as 3. The alignment was filtered at a mismatch reads rate of 1% to remove repetitive sequences in the alignment due to PCR. Calculate the NRMSE, end gap, middle gap, Cor, mean depth and other indicators of each reference sequence, and perform step-by-step iterative screening according to the indicators of NRMSE ⁇ 0.35&Cor ⁇ 0.6&mean depth ⁇ 20&end gap ⁇ 40&middle gap ⁇ 10 until all reference sequences fulfil requirements.
  • All the reference sequences screened in 4. are respectively entered into the secondary cluster of the reference sequence for competitive iterative screening.
  • the internal iterative screening steps of the secondary cluster are the same as 3 and 4, but the screening threshold is stricter: the comparison parameter is --min-score L,-1,-0.04-a; the two-step reads mismatch rate only allows 0.5 %; the same as step 3, the screening parameters are end gap ⁇ 25&middle gap ⁇ 1&CV ⁇ 0.55&mean depth ⁇ 20; the same as step 4, the parameters are NRMSE ⁇ 0.3&Cor ⁇ 0.6&mean depth ⁇ 20&end gap ⁇ 25&middle gap ⁇ 1.
  • the reference sequences that meet the conditions go to the next step of analysis.
  • nucleic acid of known sequence IQ internal standard
  • Propionibacterium acnes Propionibacterium acnes
  • Enterococcus faecalis Relative content 0.5781
  • the average sequencing data of each sample is 55,663 short sequence data, the effective data volume in this example is more than 90%, and the coverage of the target sequence is close to 100% .
  • Example 7 the number of detected sequences in the sample was 55,115 short sequences, and human sequences accounted for 0.020% of all detected sequences.
  • a total of 7,707 short sequences were aligned to the Enterococcus faecium (Enterococcus faecium) species, accounting for 13.98%, the target gene sequence sequencing coverage of the species was 99.9%, and the average sequencing depth was 349.04 ⁇ .
  • a total of 2,761 short sequences were aligned to Escherichia coli (Escherichia coli) species, accounting for 5.01%, the target gene sequence sequencing coverage of the species was 99.8%, and the average sequencing depth was 140.85 ⁇ .
  • Example 8 the number of detected sequences in the sample was 67,797, and the human-derived sequences accounted for 9.50% of all detected sequences. A total of 3,136 short sequences were aligned to Escherichia coli (Escherichia coli) species, accounting for 4.63%, the target gene sequence sequencing coverage of the species was 100%, and the average sequencing depth was 397.54 ⁇ .
  • mNGS metagenomic approach
  • Example 9 of the present invention 89 clinical samples were used for detection.
  • the detection positive rate of this technical solution is 98.9% (88/89), which is higher than that of other non-NGS methods (bacterial culture combined with mass spectrometry identification method) (59.6%, 53/89).
  • the consistency rate between the detection results of the technical solution and the former was 90.6% (48/53).
  • the positive rate of patient samples detected by metagenomic method was 38.2% (195/511).
  • the positive rate of NGS method was 27.0% (138/511).
  • the concordance rate between the mNGS protocol and the former was 60.9% (84/138). It can be seen that the technical solution has a higher positive rate than the prior art in clinical sample detection.
  • the technical solution has a higher positive consistency rate of traditional bacterial culture detection.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Engineering & Computer Science (AREA)
  • Organic Chemistry (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Wood Science & Technology (AREA)
  • Zoology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Analytical Chemistry (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Immunology (AREA)
  • Microbiology (AREA)
  • Molecular Biology (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

本发明涉及微生物鉴定领域,具体而言,涉及一种通过测序获取微生物物种及相关信息的方法。该方法包括:i)获取测序数据,所述测序数据由引物对微生物特征序列进行扩增后经下一代测序技术对扩增产物进行测序得到;ii)将所述测序数据与特征序列数据库进行比对分析以鉴定所述待检测样本中的微生物组成;其中所述特征序列数据库预先根据含有所述特征序列的参考序列的相似性做聚类处理,得到一层或多层级的cluster,每个层级cluster中有至少一个子seed,最底层的cluster中有若干作为参考序列的seed。

Description

通过测序获取微生物物种及相关信息的方法、装置、计算机可读存储介质和电子设备 技术领域
本发明涉及微生物鉴定领域,具体而言,涉及一种通过测序获取微生物物种及相关信息的方法、装置、计算机可读存储介质和电子设备。
背景技术
微生物划分为以下8大类:细菌、病毒、真菌、放线菌、立克次氏体、支原体、衣原体、螺旋体。二代测序(NGS)(mNGS)技术是鉴定样本中存在的微生物种属类别的有效方法。
在微生物鉴定方面,NGS的应用主要有两种方法:
一种是使用宏基因组策略。检测样本中所有的核酸序列,通过将检测到的序列与微生物基因组序列数据库比对,识别样本中存在的微生物。
另外一种是靶向测序策略。特异性的捕获或者富集样本中的某些特征性序列,然后再进行序列检测,并将获得的序列信息与相应的微生物特征性序列数据库比对,从而鉴定样本中存在的微生物。原核生物的rRNA有3种类型:23S、16S、5SrRNA。编码16S rRNA的基因具有良好的进化保守性,适宜分析的长度(约为1540bp),以及与进化距离相匹配的良好变异性,所以成为细菌分子鉴定的标准标识序列。16S rRNA基因不仅适用于细菌,也适用于支原体、衣原体、立克次氏体、螺旋体等原核生物的分类,是目前针对原核生物分类时,得到广泛接受、数据库最齐全的特征性序列。16S rRNA的序列包含9或10个高变区(variable region)和10或11个保守区(conserved region)。保守区序列反映了生物物种间的亲缘关系,而高变区序列则体现物种间的差异。NGS靶向测序策略是针对16S rRNA基因的高变区序列。PCR扩增100-几百bp的序列被用于NGS测序,获得的序列信息经与16S rRNA基因序列数据库比对后,可鉴定样本中存在的微生物。
但是,当宏基因组二代测序(mNGS)技术应用于微生物鉴定、特别是临床病原微生物鉴定时,mNGS对样本中的核酸不加区分的全部测序。由于样本中存在大量非微生物的宿主核酸如大量来自人体细胞的人类基因组核酸,而一个人体细胞的核酸量大约是微生物的1,000-100,000倍,且微生物中具有种属特异性的序列仅有自身基因组的大约1/100,加之临床样本中病原微生物的数量相比宿主细胞的数量而言非常稀少,因而,被检测的样本核酸中仅有1/1,000,000-100,000,000的量来自病原微生物,这就导致绝大多数的测序数据都与微生物的鉴定无关,属于无效数据。测序数据的浪费,一方面导致检测成本高昂,另一方面由于有效数据过少,降低了检测的灵敏度与可靠性。
而基于16S/18S/ITS扩增子的NGS技术则读长有限,取决于不同的测序平台,测序片段的长度介于50~400bp之间。16S rRNA基因的长度约1500bp。为了获得该基因全长的序列信息,必须将该基因核酸分子打断成适合NGS测序的短片段,并在测序完成后根据不同短片段相互间在末端序列上的重叠,将短片段按基因序列的顺序叠放后组装出16S rRNA基因序列的全长。然而,由于核糖体基因序列在物种进化过程中高度保守,进化关系相近物种(如相同属(genus)内的物种)的序列相似度很高。因此,在含有一个以上物种的临床样本中,由于分属不同物种的短片段间的序列相似性太高,使得从短片段正确地组装出各物种的全长16S rRNA基因序列而不出现跨物种混合体(chimera)是难以实现的。
为避免上述困难,当前流行的扩增子二代测序(NGS)技术对16S rRNA基因的高变区进行靶向扩增,而后对扩增子(amplicon)进行NGS测序。由于16S rRNA的序列所包含的9或10个高变区(variable region)的核苷酸序列体现了物种间的差异,因此,通过对高变区的NGS测序、并将测序得到的序列信息与16S rRNA基因高变区序列数据库比对,可以获得对部分微生物在“种”(species)的水平上的鉴别能力。
但是,单个或若干个高变区所携带的核苷酸序列多样性并不足以用来区分所有的原核微生物。Johnson,J.S.et al.(2019)的研究表明,只有全长的16S rRNA基因核苷酸序列才包含足够的信息以在“种”的水平上区分所有的原核微生物。因此,当前的16S/18S/ITS扩增子二代测序(NGS)技术无法达到在“种”(species)的水平上检测临床样本中微生物的鉴别能力。
综上所述,上述方案应用于临床微生物检测时均存在各自无法克服的缺陷和不足。
发明内容
本发明涉及一种通过测序获取微生物物种及相关信息的方法,包括:
i)获取测序数据,所述测序数据由靶向富集微生物特征核酸序列后经下一代测序技术对扩增产物进行测序得到;
ii)将所述测序数据与特征序列数据库进行比对分析以鉴定待检测样本中的微生物组成;
其中所述特征序列数据库预先根据含有所述特征序列的参考序列的相似性做聚类处理,得到一层或多层级的cluster,每个层级cluster中有至少一个子seed,最底层的cluster中有一或若干作为参考序列的seed;
所述比对分析包括:
a)去除所述测序数据中的低质量和含有测序接头序列的reads;
b)将reads序列与所述特征序列数据库的seeds序列进行比对,将因PCR扩增而产生的完全重复reads去除,并对事件“有reads序列比对上”与seeds序列做统计学独立性检验,选出与事件“有reads序列比对上”相关的seeds序列,得到初级筛选seed序列;
c)将read序列与所述初级筛选seed序列进行比对,其中所述初级筛选seed序列之间不竞争reads,计算每个seed序列的reads覆盖情况,计算评估seed序列覆盖的指标,并据此对seed序列进行次级筛选;
d)将read序列与次级筛选得到的seed序列比对,seed序列比对过程中相互间竞争reads,计算比对中seed序列的read覆盖度指标,逐步迭代去除read覆盖度指标不满足要求的seed序列,得到叁级筛选seed序列;
e)合并所述叁级筛选seed序列,并将reads与之进行比对,逐步迭代筛选得到满足要求的参考序列,其中逐步迭代筛选所用阈值相对于步骤d)更加严格;
f)根据步骤e)得到的所述参考序列及其reads数量,计算物种水平的reads含量及其占比;计算时将同属于一个物种的多个参考序列的reads数目加和得到该物种的reads数,然后根据每个物种reads数目除以样本所含物种的reads数量的总和计算样本中每个物种的占比即丰度。
本发明还涉及一种通过测序获取微生物物种及相关信息的装置,所述装置包括:
测序数据获取模块,其用于获取测序数据,所述测序数据由靶向富集微生物特征核酸序列后经下一代测序技术对扩增产物进行测序得到;
特征序列数据库构建模块,其用于对所述测序数据进行聚类处理得到特征序列数据库,所述特征序列数据库包含一层或多层级的cluster,每个层级cluster中有至少一个子seed,最底层的cluster中有一或若干作为参考序列的seed;
比对分析模块,其用于将所述测序数据与所述特征序列数据库进行比对分析以鉴定所述被检测样本中的微生物组成,所述比对分析为如上所述方法中的所述比对分析所定义。
根据本发明的再一方面,本发明还涉及一种计算机可读存储介质,所述计算机存储介质用于存储计算机指令、程序、代码集或指令集,当其在计算机上运行时,使得计算机执行如上所述的方法中的步骤ii)。
根据本发明的再一方面,本发明还涉及一种电子设备,包括:
一个或多个处理器;以及
存储装置,存储一个或多个程序,
当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如上所述的方法中的步骤ii)。
本发明还涉及如上所述的方法、或如上所述的装置、或如上所述的计算机可读存储介质、或如上所述的电子设备在鉴定微生物中的应用。
1.尽管经过长期努力,现有技术仍然无法满意地解决应用短读长NGS技术完成基于诸如16S rRNA基因序列等进化上高度保守的长片段序列信息进行微生物物种识别的问题。本发明很好地解决了该问题。对实验室和临床样本的检测证实,本发明实现了运用短读长NGS技术完成对诸如16S rRNA基因序列等高度相似的长片段序列的准确鉴别,克服了以往靶向测序仅能够用于检测短序列的困难,实现了基于短片段测序在物种水平或更高分辨率下的微生物鉴定。
2.本发明可以正确地鉴定样本中存在的微生物物种、并测量各物种间数量上的相对比例,具有比现有技术更高的准确性与灵敏度。以细菌为例:对于单个物种的检测下限可以低至10CFU。本发明能够同时 正确地检出多个(如5个或更多)物种混合样本中的全部微生物。即使两个物种间含量相差16倍或更多,本发明也能够同时正确地全部检出。
3.在对临床样本的检测中,实施例三至九中所有样本的平均测序数据量为55,663条reads,远低于目前mNGS方法检测所需的数据量(10,000,000-100,000,000条reads)。全部样本中可用于微生物鉴定的有效数据量均在90%以上,而现有报道中mNGS方法的有效数据通常占比为0.00001-0.01%。相比mNGS方法,本发明展现了极高的数据效率。因而,本发明的检测成本远低于当前技术mNGS的检测成本。
4.较高的测序覆盖深度才能保证检测的准确性。本发明检测中对靶序列覆盖率近100%、测序深度达到20×以上的微生物物种鉴定结果才能予以确认。但目前的公开报道中,mNGS对受检微生物基因组的覆盖率要求可低至10%以内,平均测序深度自然低于1×。因此,本发明应用于微生物检测的灵敏度和特异性均高于现有技术。
5.通过对人工模拟样本与临床样本的检测证实,本发明在保证低成本的同时,检测的准确性和灵敏度亦满足实际需要。因而,在保持mNGS方法检测靶标广泛、受影响因素较少等优势的同时,本发明具备更高的检测灵敏度与准确性。
附图说明
为了更清楚地说明本发明具体实施方式或现有技术中的技术方案,下面将对具体实施方式或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施方式,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本发明一个实施例中以基于原核生物16S rRNA基因测序的微生物检测为例的检测流程示意图;
图2为本发明一个实施例中以基于16S rDNA参考序列数据库的构建的主要操作流程的示意图;
图3为本发明一个实施例中构建了包含两层聚类层级结构的数据库的示意图;
图4为本发明一个实施例中参考序列数据库的seed聚类层级关系的示意图;
图5为本发明一个实施例中利用测序数据进行微生物物种鉴定的算法流程示意图。
具体实施方式
现将详细地提供本发明实施方式的参考,其一个或多个实例描述于下文。提供每一实例作为解释而非限制本发明。实际上,对本领域技术人员而言,显而易见的是,可以对本发明进行多种修改和变化而不背离本发明的范围或精神。例如,作为一个实施方式的部分而说明或描述的特征可以用于另一实施方式中,来产生更进一步的实施方式。
因此,旨在本发明覆盖落入所附权利要求的范围及其等同范围中的此类修改和变化。本发明的其它对象、特征和方面公开于以下详细描述中或从中是显而易见的。本领域普通技术人员应理解本讨论仅是示例性实施方式的描述,而非意在限制本发明更广阔的方面。
本发明涉及一种通过测序获取微生物物种及相关信息的方法,包括:
i)获取测序数据,所述测序数据由靶向富集微生物特征核酸序列后经下一代测序技术对扩增产物进行测序得到;
ii)将所述测序数据与特征序列数据库进行比对分析以鉴定所述待检测样本中的微生物组成;
其中所述特征序列数据库预先根据含有所述特征序列的参考序列的相似性做聚类处理,得到一层或多层级的cluster,每个层级cluster中有至少一个子seed,最底层的cluster中有一或若干作为参考序列的seed。
所述微生物包括细菌、古菌、真菌、支原体、衣原体、力克次氏体、螺旋体、和病毒,其中RNA病毒的特征核酸序列可通过逆转录其RNA基因组生成cDNA后获得。
技术术语
本发明中涉及的技术术语包括如下:
NGS:Next Generation Sequencing,下一代测序/第二代测序,又名高通量测序;
mNGS:metagenomics next generation sequencing,宏基因组下一代测序;
ITS:Internal Transcribed Spacer,内转录间隔区,是位于真菌核糖体RNA(rRNA)基因转录区或对应多顺反子rRNA前体中大、小亚基rRNA之间的核酸序列。
Reads:测序读段,即高通量测序时产生的检测对象的一段序列;
Cor:Pearson’s correlation coefficient,皮尔逊相关系数;
NRMSE:normalized root mean square error,标准化的均方根误差;
CV:coefficient of variation,变异系数;
Alignment:序列比对;
Reads mismatch rate:序列比对错误率;
Fastq:存储序列及其质量值的四行的文本文件格式;
Adapter:测序中使用的接头序列;
Cluster:类、簇;
Seed:种子,也即类中心;依据所在cluster的层级不同,seed又分为子seed与作为参考序列的seed,但二者可为“seed”所概括;
Bowtie2:一个将短序列比对到大基因组的比对软件;
Mean depth:平均测序深度;
Gap:空白、断点,此处代表无覆盖;
End gap:末端的无覆盖;
Middle gap:中间段无覆盖;
EM:Expectation Maximization,期望最大化;
overlap graphs:用于表征多个核酸序列间在序列编码上相互重叠关系的图;
paired-end reads:双端序列,从目前片段的正反向分别测序产生的序列;
de novo组装:从头组装,将小片段组装成较长片段的方法;
heuristic:经验性的。
参考序列:在本发明中,若无特殊说明,参考序列是指能够表明微生物种属的特征序列,其通常是保守的,参考序列一般包括16S rRNA基因、18S rRNA基因、ITS核酸序列、RNA病毒的RNA为模板的RNA聚合酶基因(RdRp)、病毒衣壳蛋白编码基因、逆转录病毒的pol基因等或者其他可以反映微生物种属特征的核酸序列中的一种或多种的全长序列。
检测的样本及其预处理,测序文库的建立
在本发明中,检测的对象可以来自于生物(微生物宿主)或来自于含有微生物的环境样本。在一些实施方式中,所述待检测样本为来自于微生物宿主的样本或来自于含有微生物的环境样本:所述来自于微生物宿主的样本优选包括但是不限于:粪便、肠道内容物、皮肤、组织、痰液、血液、唾液、牙菌斑、尿液、阴道分泌物、胆汁、支气管肺泡灌洗液、脑脊液、胸水、腹水、盆腔积液、脓液以及瘤胃中的至少一种;在一些实施方式中,所述来自于含有微生物的环境样本优选包括:物体内外表面、生活用水、医疗用水、工业用水、食品、饮料、肥料、废水、火山灰、冻土层、淤泥、土壤、堆肥、污染河流养殖水体以及空气中的至少一种。
在一些实施方式中,所述宿主是动物;进一步可以选择包括人类及所有畜养(如家畜和宠物)和野生的动物及禽鸟,其非限制性地包括牛、马、乳牛、猪、绵羊、山羊、大鼠、小鼠、狗、猫、兔、骆驼、驴、鹿、貂、鸡、鸭、鹅、火鸡、斗鸡等。
以基于原核生物16S rRNA基因测序的微生物检测为例,样本的检测流程示意图如图1所示,详述如下:
样本的预处理:针对不同类型、来源的样本,可能需要对样本进行预处理,以适应核酸提取的需要。处理方式包括但是不限于使用无菌水、ddH 2O(双蒸馏水)、无菌生理盐水、无菌PBS(磷酸缓冲盐溶液)等液体对样本进行洗涤,使用过滤、离心等方式对样本进行浓缩,使用梯度离心等方式对样本中的某些成分进行分离,或者使用某些满足实验需求的试剂盒对样本中的某些成分进行分离、或者对样本中某一部分的核酸进行去除或者富集。
核酸的提取:使用核酸提取试剂盒,提取经过预处理后样本中包含的全部核酸物质。使用的核酸提取试剂盒不局限于某一个厂家,也不局限于某一种方法,能获得满足实验需要的质量要求的核酸物质即可。所提取的核酸包括DNA、RNA,或者两者同时提取。在此步骤开始之前,可以向样本中加入一定量的已 知序列的核酸,该核酸序列满足以下条件:1)可以在下一步骤配制的反应体系下被扩增;2)可以被下一步骤加入的已有引物或者单独准备的引物相匹配;3)全部序列已知;4)所使用的序列不会干扰对样本中可能存在的任何目的序列的分析;5)核酸序列可以单独存在,也可以依赖于质粒、病毒、细胞等载体存在;6)所加入的核酸序列可以被此步骤的操作所获得,并且存在于最终提取到的核酸物质中。在本技术方案中,加入已知序列的核酸可以帮助更好的判断检测结果中由于采样、实验等环节引入的污染,但是并不是本技术方案的必须。不加入已知序列的核酸不会影响本技术方案的完整性,加入已知序列的核酸也不构成对本技术方案的创新。加入已知序列的核酸也不局限于在此步骤进行。
特定核酸序列的靶向富集:使用某些方法对能够提供微生物分类信息的核酸序列进行富集,使这些序列的核酸在样本的全部核酸序列中占据更高的比例,并且对富集产物进行纯化和定量。富集的方法包括但是不限于PCR、杂交捕获等方式。对富集产物的纯化包括但是不限于吸附柱纯化、磁珠纯化等方式,目的是去除富集处理过程在样本中残留的酶、引物、探针、盐、金属离子等成分,获得纯净的较长片段(大于20bp)的核酸序列。定量是为了确定得到样本中核酸的浓度,进而根据体积计算样本中的核酸含量。定量方式包括紫外分光光度法、染料结合法等方法。富集的目标序列,是目前常用的对微生物进行物种分类的序列,对于原核生物,可能是核糖体16s rRNA基因序列;对于真核生物,可能是核糖体18s rRNA基因序列或者ITS序列;对于病毒,可能是其基因组中同时具备进化保守性和物种特异性的核酸序列,如冠状病毒基因组中的Pol(RNA-dependent RNA polymerase)和N(Nucleocapsid)基因。这类序列通常较长,并且可能具有不同物种之间高度相似的部分,使用现有的短读长NGS测序和数据分析方法不能正确区分物种来源。对于获取的RNA,还可能需要先进行反转录,将RNA反转录为cDNA,然后再进行特定序列的靶向富集。在此步骤,配制反应体系时还可以加入一定量的已知序列的核酸(作为阳性对照),该核酸序列满足以下条件:
1)可以在本步骤配制的反应体系下被扩增;
2)可以被本步骤加入的已有引物或者单独准备的引物相匹配;
3)全部序列已知;
4)所使用的序列不会干扰对样本中可能存在的任何目的序列的分析;
5)核酸序列可以单独存在,也可以依赖于质粒、病毒、细胞等载体存在。
在一些实施方式中,在核酸富集之后还包括一步或多步质检流程,其目的是检测对目标核酸序列富集的效果。质量控制的手段包括对核酸含量的检测、核酸纯度的检测、以及富集后的核酸序列的片段长度检测等。经过有效富集的样本,会有更高的测序检测效率,通常也意味着样本中微生物的含量更加丰富。通过对富集效率的判断,可以预测样本中微生物的含量水平。对于富集质量不符合预期的样本,可以重新进行富集操作,但并不都是必须重新富集,富集质量不符合预期的样本也可以继续进行后续实验。
测序文库构建:目的是将富集得到的核酸,转换为NGS平台可以检测的核酸短片段。主要步骤是将长的核酸序列打断为NGS平台可以读取的长度,同时在片段两端加入相应的测序引物,使得测序仪可以对核酸序列进行检测。如果加入的测序引物含有标签(barcode/index),还可以对样本来源进行区分。在此步骤开始之前,在上一步获得的核酸中可以加入一定量的已知序列的核酸,该核酸序列满足以下条件:
1)全部序列已知;
2)所使用的序列不会干扰对样本中可能存在的任何目的序列的分析;
3)核酸序列可以单独存在,也可以依赖于质粒、病毒、细胞等载体存在。
在本发明中,加入已知序列的核酸可以帮助更好的判断检测结果中由于采样、实验等环节引入的污染,但是并不是本发明的必须。不加入已知序列的核酸不会影响本技术方案的完整性,加入已知序列的核酸也不构成对本技术方案的创新。加入已知序列的核酸也不局限于在此步骤进行。
测序文库构建具体实验包括以下步骤:
1)将长的核酸序列片段打断为较短的核酸序列片段。由于提取到或者富集之后的核酸序列仍然很长,大大超过NGS仪器的读取能力,需要将长片段打断为NGS仪器可以读取的长度的短片段,然后同时进行检测,才能够获得全部的核酸序列信息。打断的方法包括但是不限于物理方法(例如使用超声设备使得核酸断裂)、生物学方法(例如使用核酸酶、转座酶等方法将核酸序列切断)等。
2)进行末端修复。在文库构建的整个环节中,作为操作对象的核酸都是双链DNA(dsDNA)。将长 片段dsDNA打断之后生成的短片段dsDNA的两个末端不会很整齐,通常一条链会有几个碱基的突出,形成粘性末端。根据不同测序平台的设计,需要对打断后的DNA进行不同方式的修复。例如,如果使用Thermo Fisher的Ion torrent测序平台,需要将末端修复为完全齐平的形式,如果使用illumina测序平台,需要将末端修复为其中一条链有一个额外的腺嘌呤(A)的形式。
3)片段筛选。使用片段筛选磁珠,对样本中的核酸片段进行筛选,仅保留适宜长度的核酸片段,过长或者过短的核酸片段被丢弃。核酸片段的长度根据所选用的测序平台、测序试剂、测序条件的不同而不同。
4)连接测序接头。测序接头是两段具有特定序列的dsDNA。在测序仪器中,测序反应需要从这些特定序列开始工作,测序引物可以含有或者不含barcode/index序列,barcode/index序列可以用于区分同一次测序实验中不同样本来源的序列。使用T4连接酶等酶类工具,将两个测序引物分别连接在经过末端修复的短片段dsDNA的两端。只有两端分别连接了一端一个测序引物的dsDNA才能够被测序。
5)文库富集。一个样本中所有的正确连接了测序引物的待测核酸序列称为测序文库。文库富集是使用一定方法,通常是PCR,将正确连接了测序引物的核酸序列数量放大,增加其拷贝数,方便后续工作。富集步骤在实验流程中并不总是必须的。
6)在本环节的每一步反应之后,都有相应的纯化操作,目的是将核酸与反应试剂分离,获得纯净的核酸进入下一步反应。随着试剂与反应条件的调整,并不是每一次的纯化操作都是必须的,这样的调整不超出本技术方案的范围,纯化所用的试剂与方法也不构成对本技术方案的限制。
在一些实施方式中,测序文库构建后还包括质控步骤,其目的是检测构建的测序文库是否符合测序要求。质量控制的手段包括对核酸含量的检测、核酸纯度的检测、以及富集后的核酸序列的片段长度检测等。只有片段长度符合测序仪器要求、含量足够、纯度合格的文库才能用于后续测序。本次质控是实验流程中保障实验质量的一个环节,控制参数与所选用的测序平台有关,但是并不是本技术方案所要求的必须环节。
上机测序:根据所选用的测序仪厂家、型号、试剂的说明书进行实验。本技术所兼容的NGS测序仪厂家包括但不限于Thermo Fisher,illumina,BGI等主流厂家所有目前在市场销售的仪器与试剂。
数据分析流程
本发明中数据分析流程的逻辑是,将NGS测序获得的短序列片段(reads)比对(map)到微生物基因组特征序列数据库中的所有参考序列上;计算每个参考序列上reads实际覆盖的统计特征参数如CV等,并将每个参考序列上reads实际覆盖与(如果该参考序列代表的微生物包含在样本中时)理论预测的reads覆盖相比较,该理论预测的reads覆盖可根据数学模型如概率和统计模型、或经验/实验数据构造。比较的结果体现为多个不同的比较表征参数(如NRMSE等);根据统计特征参数和比较表征参数是否符合要求的标准,将达不到标准的参考序列淘汰。循环重复该筛选操作,直至所有剩余的参考序列的统计特征参数和表征参数均满足既定的最终标准,则这些参考序列所代表的微生物物种即为物种检测的结果。计算这些参考序列各自比对上的reads数量在该样本比对上参考序列的总reads数量中所占的比例,即为各参考序列所代表的微生物物种在样本微生物总数量中的占比。
上述基于每个参考序列上reads实际覆盖的统计特征参数和比较表征参数的参考序列筛选方法仅是其中可以使用的筛选方法之一,其他适合的筛选方法也可采用,例如,仅使用统计特征参数或比较表征参数进行筛选;或使用基于Fisher’s exact test富集分析的方法进行筛选;或基于贝叶斯概率(Bayesian probability)的EM算法,等等。
本发明根据数据库所含参考序列之间的序列相似性,对所有参考序列进行聚类,并为每个类(cluster)挑选出具代表性的参考序列即seed;对所有的seed进行聚类操作,并为生成的每个cluster选出代表性的参考序列即seed ID;可根据需要在seed ID的基础上做进一步的聚类操作和选取seed ID。该操作可根据需要进行多次。最终构建出一个多层级聚类树(hierarchical clustering tree)结构(该树结构可以有或没有根(root)),其中树的叶子(leaf)由微生物基因组特征序列数据库的参考序列组成、构成树结构的最底层,叶子上方的第一层节点(node)为第一层cluster的seed,其余各层的节点分别为代表其下方连接的所有节点构成的cluster的seed ID。对数据库参考序列的筛选可以从聚类树的最上层节点开始,逐层向下进行。
在一些实施方式中,所述比对分析包括:
a)去除所述测序数据中的低质量和含有测序接头序列的reads;
b)将reads序列与所述特征序列数据库的seeds序列进行比对,将因PCR扩增而产生的完全重复reads去除,并对事件“有reads序列比对上”与seeds序列做统计学独立性检验,选出与事件“有reads序列比对上”相关的seeds序列,得到初级筛选seed序列;
c)将read序列与所述初级筛选seed序列进行比对,其中所述初级筛选seed序列之间不竞争reads,计算每个seed序列的reads覆盖情况,计算评估seed序列覆盖的指标,并据此对seed序列进行次级筛选;
d)将read序列与次级筛选得到的seed序列比对,seed序列比对过程中相互间竞争reads,计算比对中seed序列的read覆盖度指标,逐步迭代去除read覆盖度指标不满足要求的seed序列,得到叁级筛选seed序列;
e)合并所述叁级筛选seed序列,并将reads与之进行比对,逐步迭代筛选得到满足要求的参考序列,其中逐步迭代筛选所用阈值相对于步骤d)更加严格;
f)根据步骤e)得到的所述参考序列及其reads数量,计算物种水平的reads含量及其占比;计算时将同属于一个物种的多个参考序列的reads数目加和得到该物种的reads数,然后根据每个物种reads数目除以样本所含物种的reads数量的总和计算样本中每个物种的占比即丰度。
在一些实施方式中,所述待检测样本为来自于微生物宿主的样本,步骤a)还包括:去除所述样本中所述宿主的核酸测序数据干扰。
在一些实施方式中,在步骤d)中,所述逐步迭代去除read覆盖度指标不满足要求的seed序列之后还包括cluster内部参考序列的筛选步骤:
将上步获得的每一个seed所属cluster的参考序列与reads进行比对,相同cluster内部的参考序列间竞争reads;统计每个参考序列的reads覆盖情况,根据reads覆盖指标进行过滤,逐轮迭代去除reads覆盖不佳的seed序列,其中逐轮迭代筛选所用阈值相对于步骤d)前面所述步骤更加严格。
在一些实施方式中,在步骤f)后还包括步骤g):排除实验环境中的背景污染物种的核酸测序数据干扰。
在一些实施方式中,在步骤b)中,所述统计学独立性检验为费舍尔精确检验,其具体包括:
将每个比对上的reads数量大于一定数量的参考序列记为“有read比对”,否则记为“无read比对”;根据参考序列数据库中seeds的聚类层级关系,统计检验聚类树中每个seed下属的叶子节点中是否显著富集“有read比对”的seed ID,逐级筛选出满足要求的seeds。
在一些实施方式中,所述特征序列数据库的构建方法包括:
获取含有所述特征序列的参考序列的公共数据库,去除所述数据库中所述参考序列在所述扩增引物之外的两端序列,得到第一数据库;
根据物种内相似性对所述第一数据库中存在模糊碱基的所述参考序列进行矫正处理,并根据物种注释以及相似性去除100%相似的冗余参考序列,得到第二数据库;
根据所述第二数据库中所述参考序列的相似性对其做聚类处理。
在一些实施方式中,在构建所述第一数据库时,去除所述数据库中所述参考序列在所述扩增引物之外的两端序列和引物序列。
在一些实施方式中,在构建所述第二数据库时,还包括:
将每条参考序列与NCBI NT/NR库进行blast比对,按照序列相似性和/或覆盖度的规则从NCBI NT/NR库中筛选出匹配的参考序列集;
在所述匹配的参考序列集的物种注释中选择最具代表性的物种分类,并使用该分类信息校正所针对的参考序列的物种注释。
在一些实施方式中,所述聚类处理包括第一聚类处理:
将所有非冗余的参考序列按照相似性进行聚类;
或;
1)按照物种内非冗余的参考序列的相似性进行聚类;
2)对1)中得到的seed按照相似性进行聚类,将聚在同一个类里的不同物种的seed及其子序列合并,然后按照99.5%相似性进行重聚类,用新获得的cluster替换参与重聚类计算的旧cluster。
在一些实施方式中,所述聚类处理还包括第二聚类处理:
在cluster中seed的子序列太多的情况下,按照比所述第一聚类处理更高的相似性标准对所述第一聚类处理得到cluster进行拆分,用拆分后形成的新cluster替换掉拆分前的cluster。
在一些实施方式中,所述聚类处理还包括第三聚类处理:
按照不同的相似度阈值,对所述第二聚类处理得到的cluster的seed参考序列进行层级聚类,创建一个有层次的嵌套的树。
下面以基于原核生物16S rRNA基因NGS测序的微生物检测为例,对数据分析流程加以详细说明。
数据分析流程主要包括两大部分的内容,1)分析流程中使用的16S rRNA基因(即rDNA)参考序列数据库构建方法;2)利用测序数据进行数据分析的算法流程。
1. 16S rDNA参考序列数据库的构建,主要操作流程的示意图如图2所示。
1)下载公共数据库中最新的16S rDNA参考序列数据,保证数据库序列来源的完备性。下载数据库包括但不限于NCBI及SILVA数据库,数据库包含的参考序列种类和数量可根据具体需要选择,例如,可包含100、250、500、1000、2000、10,000种、甚至全部已有种类等等不同的数量。
2)去除参考序列中的富集扩增引物以及两端引物之外的序列
本步骤的目的是获取不包含引物的参考序列。可以使用基于Smith-Waterman局部比对算法或者其他比对定位短序列的方法,根据一定的相似性(例如但不限于80%、85%、90%、95%等)匹配富集时使用的扩增引物(本步骤中简称扩增引物)信息及在参考序列中定位富集扩增引物,然后根据引物匹配的位置进行序列剪切,去除扩增引物以及引物之外的两端序列。另外,也可以仅仅去除引物以外的两端序列。
3)对16S rDNA序列进行物种注释校正
由于数据库中部分参考序列的物种注释信息可能存在误差,为了确保物种注释信息能够最大程度上反映其正确的物种归属,本技术方案对物种注释进行校正。但如果获取的参考序列物种注释信息足够确信,亦可不做此校正。校正流程:将每条参考序列与NCBI NT/NR库进行blast比对,按照序列相似性99.5%、覆盖度(coverage)99%(适当提高或者降低相似性值也在此技术范围内)的规则从NCBI NT/NR库中筛选出匹配的参考序列集。在匹配的参考序列集的物种注释中选择最具代表性的物种分类,并使用该分类信息校正所针对的参考序列的物种注释。同时根据该代表性物种分类在匹配的参考序列集的物种注释中的唯一性程度对所针对的参考序列的物种注释的确定性进行分级,物种注释的确定性级别包括:certain,limited certainty,rare annotation(按确定性由高到低排序),也可以根据实际确定性情况进行更多/少级别的的分类,甚至是不分类。
4)对16S rDNA序列中的模糊碱基进行校正
数据库中某些参考序列的部分位置的碱基不明确或包含简并碱基。为了明确这些位置最具代表性的碱基,我们按照规则进行校正,主要步骤是:对在同一个物种内相似性大于97%的参考序列使用MUSCLE进行多重序列比对,对于模糊碱基位置使用最具代表性的碱基进行替换。此步骤之后若还存在模糊碱基,则选择其同属中具有97%相似性的参考序列进行多重序列比对,对于模糊碱基位置使用最具代表性的碱基进行替换。处理之后依然存在的模糊碱基予以保留。
5)根据物种注释以及相似性对数据库进行去冗余
本步骤主要是为了去除100%相似的冗余参考序列,同时确保需要的参考序列不被误去除。因此,本技术方案要求被去除的参考序列需被其他参考序列所包含、以及物种注释的确定性等级较低,同时,还结合序列两端的引物匹配情况进行判断,比如:一个参考序列两端引物完整,但是却被另外一个参考序列完全包含,这种情况下前一个参考序列需保留;又或者,如果较短的参考序列的物种确定性等级更高,那么短的参考序列也可以保留。还有一些更复杂的情况比如参考序列某包含引物的端被另一个更长的参考序列包含的处理等。
6)对物种内非冗余的参考序列按照指定阈值的相似性进行聚类
聚类的目的是,一方面分散后续分析的计算压力,另一方面减少比对时序列之间的竞争,从而能更准确的逐步挑选出合适的参考序列。按照相似性99.5%(相似性指标可根据实际数据库参考序列的情况进行适当调整)的标准,对每个物种内所有的非冗余参考序列进行聚类。聚类后,每个参考序列类(cluster)中的代表序列作为该类的seed,与它聚在同一个cluster里的其他序列作为其子序列。
7)对物种间相似性较高的类进行合并后按照指定阈值的相似性重聚类
对6)中得到的所有seed按照99%(可根据6)中seed聚类的相似性指标作出适当调整)相似性进行聚类,将聚在同一个类里的不同物种的seed及其子序列合并,然后按照99.5%相似性进行重聚类,用新获得的cluster替换参与重聚类计算的旧cluster。另外,本步骤可以和6)合并,直接使用所有的序列进行聚类,而非按照前面描述的先物种内聚类、再合并物种间相似的类。但如果距离较近的物种太多,则直接做聚类的方式并不一定能达到很好的分析效果。
8)将聚类后较大的cluster按照更高相似性进行拆分
在cluster中seed的子序列太多的情况下,按照更高相似性(例如99.6%、99.8%、99.9%等)标准对该cluster进行拆分,用拆分后形成的新cluster替换掉拆分前的cluster。至此,构建了包含两层聚类层级结构的数据库:第一层由cluster的seed参考序列组成,第二层包含所有cluster的子参考序列。其结构示意图如图3所示。
9)对构建的cluster的seed参考序列进行层级聚类
依次按照97%、98%、99%(仅以此3种聚类相似性指标为代表,只要能达到逐层聚类的效果都可以)的相似性对所有seed参考序列进行聚类。主要步骤为:首先对所有的seed参考序列按照97%相似性聚类,得到第一层的cluster(包含seed及其子序列);然后针对每个cluster分别对其内部的参考序列按照98%相似性进行聚类,得到第二层的cluster;之后对第二层的每个cluster分别对其内部的参考序列按照99%相似性进行聚类,得到第三层的cluster。最终构建出包含数据库中所有seed参考序列的索引层级结构,使得每个seed参考序列获得分别对应于三种不同序列相似性聚类的seed ID。至此,所有参考序列的索引信息构成了一个由四列信息组成的文件。参考序列数据库的seed聚类层级关系的示意图如图4所示。
2.利用测序数据进行微生物物种鉴定的算法流程,具体流程如图5所示。
1)测序数据的质量控制
对进入分析流程的Fastq格式的测序短序列(read)数据进行筛选,过滤掉低质量和含有adapter序列的reads。
2)去除宿主(人)的核酸序列污染
使用bowtie2(或其他短序列比对工具)按默认参数将所有read序列与人类参考基因组hg19或hg38序列比对,过滤掉比对上人基因组序列的read序列;
3)数据库seed序列的初级筛选
使用hisat2(比对工具不限,能高效地实现reads与参考序列较准确匹配的目的即可)将read序列与数据库的seed序列进行比对。根据层级聚类信息,利用富集分析原理对seed进行初级筛选。具体流程如下:
使用hisat2软件,将所有reads序列与数据库所有seed序列进行比对。以低于某个阈值(如0.5%、1%、1.5%,2%等)的read mismatch rate为条件过滤比对结果,并将alignment中因PCR重复而产生的reads去除。将每个比对上的reads数量大于一定数量(例如5、10、15等条)的参考序列记为“有read比对”,否则记为“无read比对”。根据参考序列数据库中seeds的聚类层级关系即聚类树(见图4),利用Fisher’s exact test进行富集分析,统计检验聚类树中每个seed下属的叶子节点中是否显著富集“有read比对”的seed ID,逐级筛选出满足要求的seeds。
Fisher’s exact test富集分析说明如下:
根据每个seed的情况,为其构建下表,并按表下所列公式计算Fisher’s exact test检验的概率值。表格中为同时满足行和列条件的seed ID计数。当P值小于或等于某个阈值(如0.001,0.01,0.05,0.1,0.2,0.3,0.4,0.5等)时,该被检验的seed下属的聚类中显著地富集“有read比对”,且P值越小富集越显著。
Figure PCTCN2021115705-appb-000001
Figure PCTCN2021115705-appb-000002
4)数据库seed序列的次级筛选
使用bowtie2(比对工具不限,能达到reads与参考序列较准确匹配的目的即可,例如其他序列比对工具MAQ、SOAP、BWA、NovoAlign等)将read序列与初级筛选出的seed序列进行比对(seed序列间不竞争reads),计算每个seed序列的reads覆盖情况,计算评估seed序列覆盖的指标,并据此对seed序列进行次级筛选。具体计算流程如下:
使用bowtie2将read序列与初级筛选出的seed序列进行比对(主要是能得到比对相似度满足一定要求的alignment,ref间不竞争reads),按照阈值(如0.5%、1%、1.5%,2%等)的reads mismatch rate对alignment进行过滤,并将alignment中因PCR重复而产生的reads去除。计算每个seed序列的reads覆盖情况,计算指标包括CV(Coefficient of variation),coverage(覆盖度),mean depth(平均测序深度),end gap(序列两端的覆盖缺口),middle gap(序列中间的覆盖缺口)等。根据相对宽松的指标对seed序列进行筛选,满足指标的seed序列通过次级筛选。
5)数据库seed序列的叁级筛选
将通过次级筛选的seed序列进行随机分组。在各组内使用bowtie2或其他序列比对工具例如MAQ、SOAP、BWA、NovoAlign等,将read序列与seed序列比对,seed序列比对过程中相互间竞争reads。计算alignment中seed序列的read覆盖度指标,逐步迭代去除read覆盖度指标不满足要求的seed序列。具体计算流程如下:
将过滤后的seed进行随机分组,每组内使用bowtie2将read序列与seed序列比对。比对过程中,seed序列之间竞争reads,reads在比对得分排名靠前的seed序列间随机分配。然后按照某个阈值的reads mismatch rate对alignment进行过滤,并将alignment中因PCR重复而产生的reads去除。计算alignment的覆盖度指标,包括Cor(Pearson’s correlation coefficient,计算预期的seed序列的reads覆盖模型与alignment获得的实际覆盖之间的一致度)、NRMSE(Normalized root mean square error,预期模型与实际覆盖间的差异度)、coverage、mean depth、end gap、middle gap等。依据alignment的覆盖度指标是否满足设定的参数,例如mean depth≥15或20、end gap≤30或40、NRMSE≤0.4或0.35、middle gap≤0或5或10或15,等等,其中mean depth计算的时候从整个物种水平进行计算的,即属于同一个物种的多个seed深度之和满足参数即可。参数越严格进入后续分析的seed可能会越少,如果参数较严格的情况下覆盖尚且良好的seed能作为最终候选参考序列的可能性也较大。将不满足要求的seed序列过滤掉,直到剩余的seed序列全部满足要求为止。将各组筛选得到的seed序列合并,进入下一步流程。
6)cluster内部参考序列的筛选
前几步流程在数据库的seed层进行筛选。为了得到更准确的符合要求的参考序列,本步骤对叁级筛选得到的每个seed所属的cluster的子序列进行筛选。另外,此步筛选也可调整为:如果5)筛选出的seed序列的覆盖足够好,可直接进入最终的步骤候选。仅对reads覆盖不够好的seed进行cluster内部比对筛选。下面以前一种方法为例进行详细说明。
使用bowtie2(比对工具不限,能达到reads与参考序列较准确匹配的目的即可)将reads序列与叁级筛选获得的每一个seed所属cluster的参考序列(包含seed序列及其子序列)进行比对。参考序列间竞争reads。统计每个参考序列的reads覆盖情况,根据reads覆盖指标进行过滤,逐轮迭代去除reads覆盖不佳的参考序列。具体计算流程如下:
分别对筛选出的每一个seed序列所属的cluster的参考序列使用bowtie2与reads序列进行比对。采用较前面几步更严格的比对参数,并允许参考序列间竞争reads。初始迭代时,如reads比对上多个参考序列且得分都是排名最高(top1),则reads在这些参考序列间随机分配。后续迭代过程中,比对得分top1的参考序列共享的reads的分配将根据各参考序列在前一轮比对之后得到的read count和覆盖度指标值按比例进行。read count越多和覆盖度指标越好,该参考序列获得reads的概率就越大。比对完成后,按照相比前几步更严格阈值的reads mismatch rate对alignment进行过滤。对alignment中有mismatch/indel的位点,根据优势碱基占比情况去除可能是错误匹配的reads。将alignment中因PCR重复而产生的reads去除。计算reads覆盖度指标,包括Cor(计算预期的reads覆盖模型与实际reads覆盖间的一致度)、NRMSE(计算预期的reads覆盖模型与实际reads覆盖间的差异度)、coverage、mean depth、end gap、middle gap等。依据reads覆盖度指标是否满足设定的参数,将不满足要求的参考序列过滤掉。每一轮迭代最多过滤掉参考序列总数 的1%、或5%、或10%、或15%、或30%等(比例可按照所需的收敛速度进行调整),直到剩余的参考序列全部满足要求。而后对同属一个物种的多个参考序列进行去重复计算。使用MUSCLE(或其他序列比对工具例如ClustalW、T-coffee、MAFFT等)对相关参考序列进行两两比对,将比对位置完全一致的两个参考序列中reads覆盖较差的一方去除。最后将每个seed所属cluster内部筛选得到的参考序列作为候选参考序列进入下一步流程。
7)参考序列的终极筛选
合并cluster内部筛选获得的参考序列,并使用bowtie2(或其他序列比对工具例如MAQ、SOAP、BWA、NovoAlign等进行合理参数设置)将reads与参考序列进行比对。使用更加严格的参数,逐步迭代筛选得到最终满足要求的参考序列。具体计算流程如下:
合并cluster内部筛选获得的参考序列,使用bowtie2将reads与参考序列进行比对。比对、筛选、以及迭代过程均与6)相同,但覆盖筛选参数较6)更严格。完成6)的全部操作后,根据每个参考序列拥有的独有reads的数量情况,对参考序列进行最后的过滤。如果一个参考序列与另一参考序列间的两两指定区域内(末端序列不考虑)比对位置完全一致,则根据序列相似性和各自拥有的独有reads的数量情况,将独有reads数量较少(数量差距需要满足一定的范围,例如差异倍数超过1.5、2、2.5等)一方去除。
8)计算样本所含微生物各物种间的数量比例
根据7)得到的样本中所有的参考序列及其reads数量,计算物种水平的reads含量及其占比。将同属于一个物种的多个参考序列的reads数目加和得到该物种的reads数,然后计算样本中每个物种的占比即每个物种reads数目除以样本所含物种的reads数量的总和。根据每个参考序列的各统计指标(包括mean depth、NRMSE、Cor、gap),使用logistic回归模型计算每个参考序列的可信度,其中模型的训练数据来源于多批次的实验分析结果。
9)背景污染物种的排除
由于采样或实验环境(包括实验使用的试剂和耗材)中存在各种类型不一的环境微生物,从而构成微生物检测结果中的背景污染。本技术方案去除背景污染的方法包括两个步骤:第一步,计算物种占比以及判断该物种是否在重要的临床致病物种列表中,即占比很低且不在重要临床致病物种列表中的物种将被当作假阳性而排除;第二步,过滤在阴性对照样本中出现的物种。但由于临床样本也可能确实包含环境中存在的某些微生物物种,所以不能简单地直接排除。为此,实验中设置了多个阴性对照样本,同时每个样本中(包括被测样本和对照样本)引入一种内部参照品(internal control),用于对同一样本中物种的定量进行归一化。去除背景污染物种的主要方法是:分别计算各临床样本和对照样本中该物种reads的归一化含量,然后计算该物种在临床样本中的含量是一个来自该物种(或者所有对照样本中检测到的物种)在阴性对照样本中含量的统计分布(如正态分布、Poisson分布、weibull分布、或其他已知的理论分布、或基于数据的重抽样(如bootstrapping、Jackknife等方法)获得的经验分布)的样本的概率。如概率大于某个阈值(如0.001,0.01,0.05,0.1,0.2,0.3,0.4,0.5等),则认为该物种是背景污染并从检测结果中去除,否则予以保留。
至此,基于测序reads数据对参考序列数据库的筛选流程获得了最终满足要求的参考序列。根据所获参考序列的物种分类注释信息,可以获得样本所含微生物的物种构成。
微生物测序数据分析装置、计算机可读存储介质及电子设备
本发明还涉及一种微生物测序数据分析装置,所述装置包括:
测序数据获取模块,其用于获取测序数据,所述测序数据由靶向富集微生物特征核酸序列后经下一代测序技术对扩增产物进行测序得到;
特征序列数据库构建模块,其用于对所述测序数据进行聚类处理得到特征序列数据库,所述聚类处理为如上所述方法中的所述聚类处理所定义;
比对分析模块,其用于将所述测序数据与所述特征序列数据库进行比对分析以鉴定所述待检测样本中的微生物组成,所述比对分析为如上所述方法中的所述比对分析所定义。
在一个实施方式中,所述装置包括:
测序数据获取模块,其用于获取测序数据,所述测序数据由引物对微生物特征序列进行靶向富集后经下一代测序技术对扩增产物进行测序得到;
特征序列数据库构建模块,其用于对所述测序数据进行聚类处理得到特征序列数据库,所述特征序列数据库包含一层或多层级的cluster,每个层级cluster中有至少一个子seed,最底层的cluster中有若干作为参考序列的seed;
比对分析模块,其用于将所述测序数据与所述特征序列数据库进行比对分析以鉴定所述待检测样本中的微生物组成,所述比对分析模块包括:
a)第一模块,其用于去除所述测序数据中的低质量和含有测序接头序列的reads;
b)第二模块,其用于将reads序列与所述特征序列数据库的seeds序列进行比对,将因PCR重复而产生的reads去除,并对reads序列与seeds序列做统计学独立性检验,选出与reads序列相关的seeds序列,得到初级筛选seed序列;
c)第三模块,其用于将read序列与所述初级筛选seed序列进行比对,其中所述初级筛选seed序列之间不竞争reads,计算每个seed序列的reads覆盖情况,计算评估seed序列覆盖的指标,并据此对seed序列进行次级筛选;
d)第四模块,其用于将read序列与次级筛选得到的seed序列比对,seed序列比对过程中相互间竞争reads,计算比对中seed序列的read覆盖度指标,逐步迭代去除read覆盖度指标不满足要求的seed序列,得到叁级筛选seed序列;
e)第五模块,其用于合并所述叁级筛选seed序列,并将reads与之进行比对,逐步迭代筛选得到满足要求的参考序列,其中逐步迭代筛选所用阈值相对于步骤d)更加严格;
f)第六模块,其用于根据步骤e)得到的所述参考序列及其reads数量,计算物种水平的reads含量及其占比;计算时将同属于一个物种的多个参考序列的reads数目加和得到该物种的reads数,然后根据每个物种reads数目除以样本所含物种的reads数量的总和计算样本中每个物种的占比。
本发明还涉及一种计算机可读存储介质,所述计算机存储介质用于存储计算机指令、程序、代码集或指令集,当其在计算机上运行时,使得计算机执行如上所述的方法中的步骤ii)。
可以采用一个或多个计算机可读的介质的任意组合。计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器、光存储器件、磁存储器件、或者上述的任意合适的组合。在本文件中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。
计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括——但不限于——电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。
计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括——但不限于——无线、电线、光缆、RF等等,或者上述的任意合适的组合。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开操作的计算机程序代码,程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如”C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
根据本发明的再一方面,还涉及一种电子设备,包括:
一个或多个处理器;以及
存储装置,存储一个或多个程序,
当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如上所述的方法中的步骤ii)。
可选的,电子设备还可以包括收发器。处理器和收发器相连,如通过总线相连。需要说明的是,实际应用中收发器不限于一个,该电子设备的结构并不构成对本申请实施例的限定。
其中,处理器可以是CPU,通用处理器,DSP,ASIC,FPGA或者其他可编程逻辑器件、晶体管逻辑器件、硬件部件或者其任意组合。其可以实现或执行结合本申请公开内容所描述的各种示例性的逻辑方框,模块和电路。处理器也可以是实现计算功能的组合,例如包含一个或多个微处理器组合,DSP和微处理器的组合等。
总线可包括一通路,在上述组件之间传送信息。总线可以是PCI总线或EISA总线等。总线可以分为地址总线、数据总线、控制总线等。存储器802可以是ROM或可存储静态信息和指令的其他类型的静态存储设备,RAM或者可存储信息和指令的其他类型的动态存储设备,也可以是EEPROM、CD-ROM或其他光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。
下面将结合实施例对本发明的实施方案进行详细描述。
实施例1构建数据库
本实施例以16s rDNA为例说明参考序列数据库的构建过程。
1)下载公共数据库中最新的16S rDNA参考序列数据,保证数据库序列来源的完备性。下载数据库包括但不限于NCBI及SILVA数据库,数据库包含的参考序列种类和数量可根据具体需要选择,例如,可包含100、250、500、1000、2000、10,000种、甚至全部已有种类等等不同的数量。
2)去除参考序列中的富集扩增引物以及两端引物之外的序列
本步骤的目的是获取不包含引物的参考序列。可以使用基于Smith-Waterman局部比对算法或者其他比对定位短序列的方法,根据一定的相似性(例如但不限于80%、85%、90%、95%等)匹配富集时使用的扩增引物(本步骤中简称扩增引物)信息及在参考序列中定位富集扩增引物,然后根据引物匹配的位置进行序列剪切,去除扩增引物以及引物之外的两端序列。另外,也可以仅仅去除引物以外的两端序列。
3)对16S rDNA序列进行物种注释校正
由于数据库中部分参考序列的物种注释信息可能存在误差,为了确保物种注释信息能够最大程度上反映其正确的物种归属,本技术方案对物种注释进行校正。但如果获取的参考序列物种注释信息足够确信,亦可不做此校正。校正流程:将每条参考序列与NCBI NT/NR库进行blast比对,按照序列相似性99.5%、覆盖度(coverage)99%(适当提高或者降低相似性值也在此技术范围内)的规则从NCBI NT/NR库中筛选出匹配的参考序列集。在匹配的参考序列集的物种注释中选择最具代表性的物种分类,并使用该分类信息校正所针对的参考序列的物种注释。同时根据该代表性物种分类在匹配的参考序列集的物种注释中的唯一性程度对所针对的参考序列的物种注释的确定性进行分级,物种注释的确定性级别包括:certain,limited certainty,rare annotation(按确定性由高到低排序),也可以根据实际确定性情况进行更多/少级别的的分类,甚至是不分类。
4)对16S rDNA序列中的模糊碱基进行校正
数据库中某些参考序列的部分位置的碱基不明确或包含简并碱基。为了明确这些位置最具代表性的碱基,我们按照规则进行校正,主要步骤是:对在同一个物种内相似性大于97%、或97.5%、或98%、或98.5%等的参考序列使用MUSCLE进行多重序列比对,对于模糊碱基位置使用最具代表性的碱基进行替换。此步骤之后若还存在模糊碱基,则选择其同属中具有97%、或97.5%、或98%、或98.5%等相似性的参考序列进行多重序列比对,对于模糊碱基位置使用最具代表性的碱基进行替换。处理之后依然存在的模糊碱基予以保留。
5)根据物种注释以及相似性对数据库进行去冗余
本步骤主要是为了去除100%相似的冗余参考序列,同时确保需要的参考序列不被误去除。因此,本技术方案要求被去除的参考序列需被其他参考序列所包含、以及物种注释的确定性等级较低。
6)对物种内非冗余的参考序列按照指定阈值的相似性进行聚类
按照相似性98%、或98.5%、或99%、或99.5%等(相似性指标可根据实际数据库参考序列的情况进行适当调整)的标准,对每个物种内所有的非冗余参考序列进行聚类。聚类后,每个参考序列类(cluster)中的代表序列作为该类的seed,与它聚在同一个cluster里的其他序列作为其子序列。
7)对物种间相似性较高的类进行合并后按照指定阈值的相似性重聚类
对6)中得到的所有seed按照98%、或98.5%、或99%等(可根据6)中seed聚类的相似性指标作出适当调整)相似性进行聚类,将聚在同一个类里的不同物种的seed及其子序列合并,然后按照99.5%相似性进行重聚类,用新获得的cluster替换参与重聚类计算的旧cluster。另外,本步骤可以和6)合并,直接使用所有的序列进行聚类,而非按照前面描述的先物种内聚类、再合并物种间相似的类。但如果距离较近的物种太多,则直接做聚类的方式并不一定能达到很好的分析效果。
8)将聚类后较大的cluster按照更高相似性进行拆分
在cluster中seed的子序列太多的情况下,按照更高相似性(例如99.6%、99.8%、99.9%等)标准对该cluster进行拆分,用拆分后形成的新cluster替换掉拆分前的cluster。至此,构建了包含两层聚类层级结构的数据库:第一层由cluster的seed参考序列组成,第二层包含所有cluster的子参考序列。
9)对构建的cluster的seed参考序列进行层级聚类
依次按照97%、98%、99%(仅以此3种聚类相似性指标为代表,只要能达到逐层聚类的效果都可以)的相似性对所有seed参考序列进行聚类。主要步骤为:首先对所有的seed参考序列按照97%相似性聚类,得到第一层的cluster(包含seed及其子序列);然后针对每个cluster分别对其内部的参考序列按照98%相似性进行聚类,得到第二层的cluster;之后对第二层的每个cluster分别对其内部的参考序列按照99%相似性进行聚类,得到第三层的cluster。最终构建出包含数据库中所有seed参考序列的索引层级结构,使得每个seed参考序列获得分别对应于三种不同序列相似性聚类的seed ID。至此,所有参考序列的索引信息构成了一个由四列信息组成的文件。参考序列数据库的seed聚类层级关系如下图所示。
实施例二:模拟实验及数据分析
模拟数据的生成
从数据库中人为选取单个或不同组合的多个参考序列生成模拟样本。将参考序列随机打断以生成长度分布符合正态分布的reads,随机生成指定测序深度的reads集合以模拟测序结果。通过模拟共产生了83个由不同物种等量混合构成的样本,其中包括24个单物种样本和59个多物种混合(物种数2-12不等)样本,参考序列reads的平均深度从20X到800X不等。
分析流程
数据库:该批次数据分析使用的是包含2119种致病细菌的数据库,同时按照97%、98%、99%相似性构建参考序列的层级关系,共包含一级代表序列34,025个,总序列数83,886条(本版本参考序列为去除冗余的参考序列数量)。
由于本数据库seed层数量较大,接下来使用hisat2软件按照参数--score-min L,-1,-0.08--no-spliced-alignment--no-templatelen-adjustment--secondary-k 100000--mm--no-softclip进行比对,然后按照1%mismatch reads rate对alignment进行过滤,并将alignment中因PCR重复而产生的reads去除,然后统计每个参考序列比对的reads数目,将比对reads数大于10条的记为“有read比对”,否则为“无read比对”,然后在按照聚类层级关系使用富集分析的方法,按照较宽松的富集p值0.1筛选出显著富集reads的seed ID,提取对应的seed作为下一步的候选参考序列。
使用bowtie2将reads与执行步骤2)中得到的候选参考序列比对,比对参数为--min-score L,-1,-0.1–a,参考序列间不竞争reads。然后按照0.5%mismatch reads rate对alignment进行过滤,并将alignment中因PCR重复而产生的reads去除,计算每个参考序列的覆盖度并根据参数CV≤0.55&gap≤40&mean depth≥20进行过滤,通过阈值筛选的参考序列作为下一步的候选序列。
将reads与步骤3)中筛选得到的参考序列进行比对,使用bowtie2软件,参数同3),然后按照0.5%的mismatch reads rate对alignment进行过滤,并将alignment中因PCR重复而产生的reads去除,计算每个参考序列的CV、gap、Cor、CV/Cor、mean depth指标,按照CV≤0.55&Cor≥0.6&mean depth≥20&gap≤40的指标,并进行逐步迭代筛选,直到所有参考序列满足要求。
对4)筛选出的参考序列进行筛选。如果覆盖度参数满足CV≤0.4&Cor≥0.9&gap==0,则直接进入最终的迭代,否则需要进入参考序列的二级cluster内部进行竞争迭代筛选。二级cluster内部迭代筛选步骤同3)和4),其中同4)步骤的筛选阈值参数为CV≤0.55&Cor≥0.6&mean depth≥20&gap≤25,满足条件的参考序列进入下一步分析。
将5)筛选出的所有满足条件的参考序列合并,进行终极迭代筛选。使用bowtie2比对,参数同4),使用0.5%的mismatch reads rate对alignment进行过滤,并将alignment中因PCR重复而产生的reads去除。对于参考序列比对位点中优势碱基占比小于0.95的位点仅保留支持参考序列的reads比对,计算每个参考序列的覆盖度,根据阈值CV≤0.55&Cor≥0.6&mean depth≥20&gap≤25进行逐步迭代筛选,逐一末位淘汰进行迭代筛选。直到所有参考序列都满足阈值之后,再对筛选得到的每个物种内的参考序列两两进行多重序列比对。最终满足要求的参考序列作为最终的目标序列,对应的物种为目标物种,同时计算每个物种的reads数目以及在样本中的占比。
分析结果
模拟实验数据的分析表明,样本物种检测的平均灵敏度(sensitivity)达98.65%,平均精确性(precision)达98.79%,其中,单物种样本的灵敏度和精确性均为100%。
实施例三
本实施例以革兰氏阳性菌作为实验对象,为液体培养基培养的已知物种分类的单纯菌种。目的是考察本技术方案对样本中的微生物是否能够正确鉴定、以及考察本技术方案检测能力的灵敏度下限。
(一)样本准备
实验室培养表皮葡萄球菌单菌种样本,经琼脂平板培养计数,确认实验样本中加入的细菌CFU数量。投入量为2560CFU,样本编号为INQ19M0101、INQ19M0102。投入量为640CFU,样本编号为INQ19M0103、INQ19M0104。投入量为160CFU,样本编号为INQ19M0105、INQ19M0106。投入量为40CFU,样本编号为INQ19M0107、INQ19M0108。投入量为10CFU,样本编号为INQ19M0109、INQ19M0110。
(二)实验过程
1.样本DNA提取,使用Ezup柱式细菌基因组DNA抽提试剂盒,生工生物工程(上海)股份有限公司,货号B518255-0100。按照试剂盒说明书提取样本中的DNA。
在本实施例中,选取的样本为细菌,提取方案为提取样本中包含的DNA。在本技术方案中,提取的核酸也可能为RNA,或者DNA和RNA同时提取;所使用的核酸提取试剂盒也不限于该厂家或者该货号。
2.目的基因扩增。PCR扩增引物为:F:5'AGAGTTTGATCMTGGCTCAG 3'、或者5'AGAGTTTGATCCTGGCTCAG 3'、或者5'CTCCTACGGGAGGCAGCAG 3'、或者5'GTGCCAGCMGCCGCGG 3'、或者5'AAACTYAAAKGAATTGACGG 3'、或者5'GCAACGAGCGCAACCC 3'、或者5'AGAGTTTGATCATGGCTCAG 3'、或者5’AACTGAAGAGTTTGATCCTGGCTC 3’;R:5’、GGTTACCTTGTTACGACTT 3’、或者5'GGYTACCTTGTTACGACTT 3'、或者5'CTGCTGCSYCCCGTAG 3'、或者5'GWATTACCGCGGCKGCTG 3'、或者5'CCGTCAATTCMTTTRAGTTT 3'、或者5'GGGTTGCGCTCGTTG 3'、或者5'AAGGAGGTGWTCCARCC 3'、或者5'TAGGGTTACCTTGTTACGACTT 3'、或者5’TACGGTTACCTTGTTACGACTT 3’。所用的试剂为Ex Taq HS,宝日医生物技术(北京)有限公司,货号RR006A。在本实施例中,所用的富集靶向序列的方法为PCR富集16S rRNA基因序列。在本技术方案中,所用富集方法还可能为杂交捕获;PCR所用的酶也不局限于该公司的该货号,可能是其他货号或者其他公司的Taq酶,也可能是其他类型的酶,只要能够扩增DNA序列即可;对于细菌等原核生物,本实施例选取了16S rRNA基因的序列作为靶向序列,但是靶向序列存在多种可能性,包括但是不限于16S/18S/ITS等区域的序列,根据测序目的的不同可以任意选择和组合,即使针对原核生物,靶向序列也不局限于16S,扩增引物也不局限于实施例中所列举的序列;实施例中所列举的反应体系与扩增条件,仅仅是针对当前所用的Taq酶和所用引物的适宜反应条件,不作为对本技术方案的限制。
反应体系为:
组分 体积(μl)
10*Buffer 7.5
dNTPs 6
Ex Taq 0.375
F 3
R 3
模板量 10
ddH 2O 45.125
反应条件为:
Figure PCTCN2021115705-appb-000003
3.产物纯化,使用的主要试剂为DNA分选磁珠(无锡百迈格生物科技有限公司,BMSX)。操作步骤如下:本实施例中,使用DNA分选磁珠对PCR产物进行纯化。在本技术方案中,纯化方式不局限于使用磁珠进行纯化,也可能是吸附柱纯化等其他可以纯化PCR产物的方式;磁珠纯化也不局限于DNA分选磁珠,也不局限于该公司的产品,能够纯化PCR产物的磁珠均可;甚至,是否需要进行纯化,或者纯化形式的选择,要根据上一步对靶向序列的富集方式、以及下一步反应所用的试剂和反应条件进行选择,也可能不需要进行纯化,也可能不是对“PCR产物”进行纯化。
a)将已室温平衡30min的磁珠涡旋震荡3min,充分混匀,按照相对应编号向96孔板的对应孔中加入37.5μL磁珠。
b)待PCR反应结束后,取下PCR管,离心10s,将PCR产物转移至上步对应好编号的96孔板中,封板膜密封,涡旋震荡30s,静置5min,瞬时离心,放置磁力板上静置5min,小心吸走上清液。
c)加入180μL新配制的75%乙醇,手持磁力板底部,将96孔板水平移动至相邻板槽中反复3-5次,静置5s,待液体澄清后吸走上清。
d)步骤(3)重复一次。
e)将96孔板从磁力板上取下,瞬时离心,重新放回磁力板,用100μL排枪将剩余乙醇吸走,敞开盖晾干至磁珠出现细微裂痕。
f)加入22μL ddH 2O,封板膜密封,涡旋震荡30s,静置5min,瞬时离心。将96孔板放置于磁力板上静置至澄清,将上清液全部转移至新的1.5mL低吸附离心管中,即为纯化产物。
4.PCR产物定量,使用Qubit 3.0试剂(Qubit dsDNA BR Assay Kit,Thermo Fisher Scientific Q32850),对纯化好的PCR产物进行定量。按照试剂与仪器的说明书进行操作。在本实施例中,PCR产物的定量使用了荧光染料法。在本技术方案中,对于靶向富集后的核酸,定量方式不局限于荧光染料法,也可能是其他染料法或者非染料法,比如紫外分光光度计法、荧光定量PCR法、毛细管电泳或微流控电泳结合核酸染料荧光成像法等;对于荧光染料法,也不局限于该公司或者该货号的试剂;所用方法可以准确反应样本中的核酸的质量(指物质含量,并非品质)即可;如果使用紫外分光光度法,还能够获得核酸品质的信息,这些信息也属于本技术方案质量控制的范畴,不构成对本技术方案的创新。
5.目的基因片段化。本实验使用脱氧核糖核酸酶I(DNase I)对待测核酸进行片段化处理,DNase I来自于New England Biolabs,货号为M0303S。操作步骤如下:本实施例中,使用核酸酶DNase I对样本中的核酸进行打断。本技术方案中,所用的方法不局限于使用酶等生物学方法,也可能是超声等物理学方法,或者化学方法,或者其他类型的方法,能够可控的将长片段的核酸打断为短片段,同时不引入会影响核酸序列识别的其他改变即可;即使用生物学方法,也不局限于DNase I这种酶,还可能是其他能够切断DNA的酶,例如其他DNA内切酶、DNA转座酶(如Tn5转座酶)、CRISPR等;使用DNase I这种酶,也不局限于该公司或者该货号;反应条件仅为配合本实施例中所用的试剂,随着试剂与方法的不同,反应条件有可能改变,不构成对本技术方案的限制。在本实施例中,从上一步得到的核酸中取出5-100ng加入本反应。本技术方案中,5-100ng不是一个限制条件,仅仅是一个较为合适的条件,更多或者更少的核酸也可以用于本技术方案,0.01ng或者更少的核酸,使用本技术方案也能得到正确的结果。本技术方案中,所用的 DNase I的量仅为当前条件下一个较为合适的量,并不是限制性因素。本技术方案中,纯化方式的说明同前。
a)将样本的16S rRNA纯化产物、酶缓冲液、EDTA及MnCl 2从冰箱中拿出,融化,震荡,瞬时离心,备用。
b)取出与样本相对应数量的PCR管,放置冰盒上,加入60ng的16S rRNA纯化产物XμL,同时补水(30-X)μL。EDTA按1:4的比例稀释配置后备用(终浓度1.25mmol/L)。
c)原浓度酶(2U/μL)稀释为0.1U/μL:取1个0.2ml的PCR管,放置冰盒上,加入19μL1*DNase Ⅰ buffer(10*buffer稀释),然后从冰箱中取出DNase Ⅰ酶,瞬时离心后,取1μL的酶加入到PCR管中,混合点震荡30-40次,瞬时离心后立即放置冰盒上。
d)配制酶、酶缓冲液以及MnCl 2MIX:用0.01U酶切(30ng DNA):
i.取1个1.5ml的离心管,放置冰盒上,依次往里面加入10*DNase Ⅰ缓冲液((n+2)*2)μL、MnCl 2((n+2)*2)μL以及稀释后的DNase Ⅰ酶(0.1U/μL)((n+2)*0.1)μL,加入完成后,点震荡40-50次,瞬时离心后立即放置冰盒上。
ii.依次取4.1μL MIX加入到八连排中,然后用排枪取15μL步骤2中所配溶液按照对应顺序加入到八联排中,改好管盖,点震荡5-10次,瞬时离心后立即放置冰盒上。
e)将八联排管放入PCR仪上,20℃15min反应。
f)反应后,取出八连排,加入5μL稀释后的EDTA终止反应,加入后,震荡,瞬时离心,得到的产物为弥散片段。立即进入下一步的磁珠纯化步骤。
6.片段化产物纯化。使用的主要试剂为DNA分选磁珠(无锡百迈格生物科技有限公司,BMSX)。操作步骤如下:
a)将已室温平衡30min的磁珠涡旋震荡3min,充分混匀,按照相对应编号向96孔板的对应孔中加入50μL磁珠。
b)将酶切产物转移至上步对应好编号的96孔板中,封板膜密封,涡旋震荡30s,静置5min,瞬时离心,放置磁力板上静置5min,小心吸走上清液。
c)加入180μL新配制的75%乙醇,手持磁力板底部,将96孔板水平移动至相邻板槽中反复3-5次,静置5s,待液体澄清后吸走上清。
d)步骤(3)重复一次。
e)将96孔板从磁力板上取下,瞬时离心,重新放回磁力板,用100μL排枪将剩余乙醇吸走,敞开盖至磁珠出现细微裂痕。
f)加入42μL TE,封板膜密封,涡旋震荡30s,静置5min,瞬时离心。将96孔板放置于磁力板上静置至澄清,吸取40.2μL上清液转移至新的96孔板中,即为纯化产物。
7.DNA末端修复,使用试剂为Pfu酶(天根生化科技(北京)有限公司,EP101)、dNTP Mixture(天根生化科技(北京)有限公司,CD111),DNA分选磁珠(无锡百迈格生物科技有限公司,BMSX)。操作步骤如下:本实施例中,末端修复后的dsDNA为平末端,即DNA的双链末端齐平。本技术方案中,对小片段的末端修复根据所选用的测序平台的不同而不同,例如选用illumina测序平台时,修复后的dsDNA的一条链末端会有一个突出的腺嘌呤(A),修复方式不构成本技术方案的限制条件;末端修复选用的酶也不局限于pfu酶,也可能是Taq酶或者其他酶类;选用pfu酶,也不局限于该公司的该货号。在本实施例中,在末端修复步骤之后,进行核酸片段长度筛选。在本技术方案中,片段筛选不局限于使用DNA片段分选磁珠,也可能是其他方式,例如将核酸电泳之后选取所需长度片段的核酸并回收,能够对核酸片段进行选择即可;所使用的DNA片段分选磁珠也不限于该公司或者该货号;筛选后保留的片段长度与所选用的测序仪器、测序试剂、测序参数设置有关,不构成对本技术方案的限制;对核酸片段长度的筛选,也不局限于在末端修复之后,也可能在末端修复之前,或者在下一步、下下一步实验之后,这一调整不构成对本技术方案的限制。本技术方案中,纯化方式的说明同前。
a)在装有40μL DNA的低吸附管中依次加入试剂,并使用移液器充分吹打,震荡混匀,瞬时离心,置于PCR仪上72℃反应15min。
组分 反应体积(μL)
DNA 40.2
Pfu酶10*buffer 5μL
dNTPs(2.5mM) 4μL
Pfu酶 0.8μL
总体积 50μL
b)将反应后的DNA产物加入装有32.5μL磁珠的96孔板中,封板膜密封,涡旋震荡30sec,静置5min,瞬时离心,放置磁力板上静置5min,将上清对应转入另外新的装有10μL磁珠的96孔板中,封板膜密封,涡旋震荡30sec,静置5min,瞬时离心,放置磁力板上静置5min,小心吸走上清液。
c)加入180μL新配制的75%乙醇,手持磁力板底部,将96孔板水平移动至相邻板槽中反复3-5次,静置5sec,待液体澄清后吸走上清。
d)步骤(c)重复一次。
e)将96孔板从磁力板上取下,瞬时离心,重新放回磁力板,用100μL排枪将剩余乙醇吸走,敞开盖晾干至磁珠出现细微裂痕。
f)加入27μL TE,封板膜密封,涡旋震荡30sec,静置5min,瞬时离心。将96孔板放置于磁力板上静置至澄清,吸取25μL上清液转移至新的96孔板中,即为纯化产物。
8.测序接头连接。所用试剂为T4 DNA连接酶(Thermo Fisher Scientific,EL0011),DNA分选磁珠(无锡百迈格生物科技有限公司,BMSX),操作步骤如下:本实施例中,测序接头指测序引物,所用的测序引物为Thermo Fisher公司Ion Torrent测序平台的测序引物。在本技术方案中,测序引物的选择受到测序仪器、测序试剂的影响,不构成对本技术方案的限制;所用的连接酶也不局限于T4连接酶,能够将两段核酸片段连接起来的酶或者其他技术方法均可以使用;选用T4连接酶,也不局限于该公司或者该货号;反应体系的比例以及反应条件,仅为当前较为合适的条件,不构成对被技术方案的限制。本技术方案中,纯化方式的说明同前。
a)根据下表在96孔PCR板中依次加入试剂,涡旋振荡5sec,瞬时离心,将管子置于PCR仪上,25℃反应30min。
组分 反应体积(μL)
平末端DNA 25
无核酸酶水 10.0
10*连接缓冲液 5
50%PEG4000 5
DNA连接酶 1
P1接头 2
标签序列X 2
反应总体系 50
b)将反应后的DNA产物加入装有40μL磁珠的96孔板中,封板膜密封,涡旋震荡30sec,静置5min,瞬时离心,放置磁力板上静置5min,小心吸走上清液。
c)加入180μL新配制的75%乙醇,手持磁力板底部,将96孔板水平移动至相邻板槽中反复3-5次,静置5sec,待液体澄清后吸走上清。
d)步骤(c)重复一次。
e)将96孔板从磁力板上取下,瞬时离心,重新放回磁力板,用100μL排枪将剩余乙醇吸走,敞开盖晾干至磁珠出现细微裂痕。
f)加入23μL TE,封板膜密封,涡旋震荡30sec,静置5min,瞬时离心。将96孔板放置于磁力板上静置至澄清,吸取20μL上清液转移至新的96孔板中,即为纯化产物。
9.测序文库富集。所用试剂为HiFi高保真Taq酶(KAPA biosystems,KK2602),DNA分选磁珠(无锡百迈格生物科技有限公司,BMSX),操作步骤如下:本实施例中,使用高保真Taq酶以PCR方式进行测序文库富集。在本技术方案中,文库富集并不是本技术方案所必须,省略该步骤不构成对本技术方案的创新;富集方式不局限于PCR方法,能够提高样本中可用测序文库的比例或者含量的方法均可;PCR方法不局限于选用Taq酶,其他能够进行核酸扩增的酶均可能用于本方案;Taq酶也不局限于该公司或者该货号;PCR扩增引物的选择受到测序仪器、测序试剂的影响,不构成对本技术方案的限制;反应体系的比 例以及反应条件,仅为当前较为合适的条件,不构成对本技术方案的限制。本技术方案中,纯化方式的说明同前。
a)根据下表在96孔PCR板中配置PCR反应体系:
组分 反应体积(μL)
DNA 20
KAPA HiFi高保真酶混合物 25
PGM-A(10μM) 2.5
PGM-P1(10μM) 2.5
反应总体系 50
反应程序:
Figure PCTCN2021115705-appb-000004
b)将反应后的DNA产物加入装有40μL磁珠的96孔板中,封板膜密封,涡旋震荡30sec,静置5min,瞬时离心,放置磁力板上静置5min,小心吸走上清液。
c)加入180μL新配制的75%乙醇,手持磁力板底部,将96孔板水平移动至相邻板槽中反复3-5次,静置5s,待液体澄清后吸走上清。
d)步骤(c)重复一次。
e)将96孔板从磁力板上取下,瞬时离心,重新放回磁力板,用100μL排枪将剩余乙醇吸走,敞开盖晾干至磁珠出现细微裂痕。
f)加入52μL TE,封板膜密封,涡旋震荡30sec,静置5min,瞬时离心。将96孔板放置于磁力板上静置至澄清,吸取50μL上清液转移至新的96孔板中,再次进行纯化,最后加入40μL TE进行洗脱,即为最终的测序文库。
10.测序文库定量,使用Qubit 3.0试剂(Qubit dsDNA BR Assay Kit,Thermo Fisher Scientific Q32850),对纯化好的PCR产物进行定量。按照试剂与仪器的说明书进行操作。在本实施例中,测序文库的定量使用了荧光染料法。在本技术方案中,对于构建完成的测序文库,定量方式不局限于荧光染料法,也可能是其他染料法或者非染料法,比如紫外分光光度计法、荧光定量PCR法、毛细管电泳或微流控电泳结合核酸染料荧光成像方法等;对于荧光染料法,也不局限于该公司或者该货号的试剂;所用方法可以准确反应样本中的核酸的质量(指物质含量,并非品质)即可;如果使用紫外分光光度法,还能够获得核酸品质的信息,这些信息也属于本技术方案质量控制的范畴,不构成对本技术方案的创新。
11.混合测序文库。将同一批次测序的不同样本的测序文库进行混合,根据Qubit定量的结果,不同样本都加入相等的量,制作混合文库。本实施例中,不同样本的测序文库为等量混合。在本技术方案中,不同样本的混合也可以是不等量;每次混合的样本数量可以根据测序仪器、测序试剂、测序方式、实验实际需求等情况灵活调整,不构成对本技术方案的限制。
12.混合文库定量。使用Ion Library TaqMan Quantitation Kit(Thermo Fisher Scientific,4468802),按照试剂说明书进行操作。本实施例中,使用荧光定量PCR对混合后的文库进行定量。在本技术方案中,定量方式也可以是其他染料法或者非染料法,例如Qubit、紫外分光光度计、Agilent 2100生物分析仪等;选用荧光定量PCR法,也不局限于该公司或者该货号的试剂;甚至,此步骤的定量不是必须进行的,省略此步骤也不构成对本技术方案的限制;如果使用紫外分光光度法或者Agilent 2100等仪器,还能够获得核酸品质的信息,这些信息也属于本技术方案质量控制的范畴,不构成对本技术方案的创新。
13.进行测序,使用Ion One Touch 2测序准备平台和Ion Torrent PGM测序平台,使用试剂为Ion PGM Hi-Q View OT2 Kit(Thermo Fisher Scientific,A29900),Ion PGM Hi-Q View Sequencing Kit(Thermo Fisher  Scientific,A30044),Ion316 Chip Kit v2 BC(Thermo Fisher Scientific,4488149)按照相应仪器与试剂说明书进行操作。在本实施例中,选取了Thermo Fisher的Ion Torrent PGM测序仪和Ion 316测序芯片以及配套的测序试剂、测序方法进行实验。在本技术方案中,实验流程兼容的NGS测序仪厂家包括但不限于Thermo Fisher,Illumina,BGI等主流厂家所有目前在市场销售的仪器与试剂以及测序方式,这些均不构成对本技术方案的限制。
(三)数据分析
下机数据经过数据过滤,滤除低质量的reads,剩余高质量的Clean data方可用于后期分析。分析流程如下:
1.数据库:包含252种重要临床致病微生物,包含代表序列2396个,代表序列选取与前述数据库构建流程中的一级seed选取方法类似,区别是直接对所有序列按照99.5%相似性聚类,选择每个类的seed作为代表组成了代表集。
2.将高质量的测序reads与人类基因组序列hg19版本序列进行比对,使用bowtie2默认参数,将比对上人基因组的序列去除。
3.直接进行代表集序列不竞争reads的比对及覆盖统计,主要方法:使用bowtie2(参数–a;或者将reads分别与每个参考序列比对,使用默认参数)进行比对,然后按照0.5%mismatch reads rate进行过滤,并将alignment中因PCR重复而产生的reads去除,计算每个参考序列的覆盖度并根据参数CV≤0.5&gap≤15&mean depth≥20进行过滤。
4.将3)过滤后的参考序列按照98%相似性进行聚类,筛选出每个cluster的代表序列进入下一步分析。
5.将reads与4)筛选得到的参考序列进行比对,使用bowtie2软件的默认参数,然后按照0.5%的mismatch reads rate对alignment进行过滤,并将alignment中因PCR重复而产生的reads去除,计算每个参考序列的CV、gap、Cor、CV/Cor、mean depth指标,按照CV≤0.4&Cor≥0.7&mean depth≥20&gap≤15的指标进行逐步迭代过滤,,直到所有参考序列满足要求。然后计算每个物种的reads数目及其占比情况。
(四)实验结果
10个样本的平均检测序列数量为55,931条,在所有样本中,表皮葡萄球菌都可以正确检出,灵敏度为100%。
Figure PCTCN2021115705-appb-000005
Figure PCTCN2021115705-appb-000006
实施例四
本实施例以革兰氏阴性菌作为实验对象,为使用液体培养基培养的已知物种分类的单纯菌种。目的是考察本技术方案对样本中微生物是否能够正确鉴定,以及考察本技术方案检测能力的灵敏度下限。
(一)样本准备
单菌种样本(粘质沙雷菌),使用琼脂平板培养计数,确认加入的细菌CFU数。投入量为2560CFU,样本编号为INQ19M0111-INQ19M0112。投入量为640CFU,样本编号为INQ19M0113-INQ19M0114。投入量为160CFU,样本编号为INQ19M0115-INQ19M0116。投入量为40CFU,样本编号为INQ19M0117-INQ19M0118。投入量为10CFU,样本编号为INQ19M0119-INQ19M0120。
(二)实验过程
同实施例三。
(三)数据分析
同实施例三。
(四)实验结果
10个样本的平均检测序列数量为54,703条,除样本INQ19M0119外,其余样本中粘质沙雷菌都可以正确检出。
Figure PCTCN2021115705-appb-000007
实施例五
使用液体培养基培养的已知物种分类的单纯菌种,考察本技术方案是否可以正确鉴定多个微生物物种混合的样本。
(一)样本准备
对5种微生物(金黄色葡萄球菌、表皮葡萄球菌、粘质沙雷菌、绿脓杆菌、大肠埃希菌)使用琼脂平板培养计数,确认细菌CFU数。等量混合所有菌株(200CFU),样本编号为INQ19M0133、INQ19M0134。
(二)实验过程
同实施例三。
(三)数据分析
同实施例三。
(四)实验结果
2个样本的平均检测序列数量为62,970条,两个样本中,加入的5个物种都可以正确检出。
Figure PCTCN2021115705-appb-000008
实施例六
使用液体培养基培养的已知物种分类的单纯菌种,本实施例的目的是考察本技术方案能否正确鉴定含有多个物种的样本中占比很低的物种。
(一)样本准备
2种实验室培养的微生物菌株(表皮葡萄球菌、粘质沙雷菌)用琼脂平板培养计数,确认细菌CFU数。按16倍稀释梯度混合(粘质沙雷菌混合量3200CFU,表皮葡萄球菌混合量200CFU)构成样本,样本编号为INQ19M0143、INQ19M0144。
(二)实验过程
同实施例三。
(三)分析方法
同实施例三。
(四)实验结果
2个样本的平均检测序列数量为46,026条,两个样本中,含量相差16倍的两个物种都能够正确检出。
Figure PCTCN2021115705-appb-000009
Figure PCTCN2021115705-appb-000010
实施例七
考察本技术方案检测临床样本的能力。
(一)样本准备
来自医院采集的用于病原微生物鉴定的胆汁样本,编号INQ19M0322。
(二)实验过程
同实施例三。
(三)数据分析
1.数据库:包含252种重要临床致病微生物的数据库,包含一级代表序列1920个,总序列数143009条(本版本参考序列未去冗余)。
2.将高质量的测序reads与人类基因组序列hg19版本序列进行比对,使用bowtie2默认参数,将比对上人基因组的序列去除。
3.直接将reads与一级seed参考序列进行不竞争reads的比对、覆盖统计及筛选。主要方法为:使用bowtie2(参数--min-score L,-0.6,-0.15-a)进行序列比对,再按照0.5%mismatch reads rate对alignment进行过滤,并将alignment中因PCR重复而产生的reads去除,然后计算每个参考序列的覆盖度并根据参数CV≤0.5&gap≤25&mean depth≥20进行过滤,通过阈值的参考序列作为下一步的候选。
4.将3、过滤后的参考序列按照98%相似性进行聚类,筛选出每个cluster的代表序列seed进入下一步分析。
5.进行参考序列间竞争reads的比对并根据覆盖情况逐步迭代筛选出满足要求的参考序列。具体步骤是:将reads与4、筛选得到的参考序列进行比对,使用bowtie2软件的默认参数,然后按照0.5%的mismatch reads rate对alignment进行过滤,并将alignment中因PCR重复而产生的reads去除。对于参考序列比对位点中优势碱基占比小于0.95的位点仅保留支持参考序列的reads比对。计算每个参考序列的CV、gap、Cor、CV/Cor、mean depth指标,然后按照阈值CV≤0.5&Cor≥0.65&mean depth≥20&gap≤25的指标进行逐步迭代筛选,直到所有参考序列满足要求。满足要求的参考序列作为下一步的候选。
6.将步骤5筛选出的候选序列进行seed内部迭代筛选。主要步骤是:首先对这些seed按照覆盖情况进行分类,满足阈值CV≤0.4&Cor≥0.9&gap==0的参考序列直接进入最终迭代筛选;不满足的参考序列,分别对其进行seed内部的迭代筛选,筛选方法与3、4、两步一样,其中同步骤3的迭代筛选参数为CV≤0.5&gap≤15&mean depth≥20,同步骤4的迭代筛选参数为CV≤0.5&Cor≥0.7&mean depth≥20&gap≤25,将所有满足要求的候选seed合并,作为终极候选参考序列。
7.将reads与步骤6得到的终极候选参考序列比对,按照0.5%的mismatch reads rate对alignment进行过滤,并将alignment中因PCR重复而产生的reads去除。计算每个参考序列的CV、gap、Cor、CV/Cor、mean depth指标,然后按照阈值CV≤0.5&Cor≥0.7&mean depth≥20&gap≤25进行逐步迭代筛选。所有满足要求的seed作为最终的目标序列,对应的物种为检测的物种,同时计算每个物种的read count以及在样本中的占比。
(四)实验结果
样本的检测序列数量为55,115条,人源序列占所有检测序列的0.020%。本技术的检测结果显示样本中含有大肠埃希菌、屎肠球菌,此结果与使用微生物培养等方法得到的结果一致。
Figure PCTCN2021115705-appb-000011
Figure PCTCN2021115705-appb-000012
实施例八
考察本技术方案检测临床样本的能力。
(一)样本准备
来自医院采集的用于病原微生物鉴定的胸水样本,编号INQ19M0324。
(二)实验过程
同实施例三。
(三)数据分析
同实施例七。
(四)实验结果
样本的检测序列数量为67,797条,人源序列占所有检测序列的9.50%。本技术的检测结果显示,样本中含有大肠埃希菌,此结果与使用微生物培养等方法得到的结果一致。
Figure PCTCN2021115705-appb-000013
实施例九
比较本技术方案与当前最常用的微生物鉴定方法的检测能力,对比指标包括检测阳性率、阳性结果一致性,考察本技术方案检测临床样本的能力。
(一)样本准备
来自医院采集的用于病原微生物鉴定的一批样本,收集时间2019.08-2019.08,有效样本共89例。样本采集后分为两份,分别使用本技术方案或者细菌培养的方式进行鉴定。样本类型包括:胆汁、胸腹腔积液、关节腔积液、脑脊液、尿液、脓、心包积液、引流液、痰等,囊括了临床微生物检验的主要样本类型。
(二)实验过程
同实施例三。
(三)数据分析
1.数据库:包含1.8万多个物种(含所有已知的细菌、支原体、衣原体、立克次体、螺旋体)。按照97%、98%、99%相似性构建参考序列的层级关系,共包含一级代表序列30,816个,总序列数154,392条(本版本参考序列为去除冗余的参考序列数量)。
2.将高质量的测序reads与人类基因组参考序列hg38版本进行比对,使用bowtie2默认参数,将比对上人基因组的序列去除。
3.使用hisat2软件按照参数--score-min L,-1,-0.08--no-spliced-alignment--no-templatelen-adjustment--secondary-k 100000--mm--no-softclip进行比对,然后按照1%mismatch reads rate对alignment进行过滤,并将alignment中因PCR重复而产生的reads去除。统计每个参考序列比对的reads数目,将比对reads数大于10条的记为“有read比对”,否则为“无read比对”。按聚类层级关系,根据富集分析的方法,按照较宽松的富集p值0.1筛选出显著富集reads的seed ID,提取对应的seed作为下一步的候选参考序列。
4.使用bowtie2将reads与步骤3得到的候选参考序列比对,比对参数为--min-score L,-1,-0.1–a,参考序列间不竞争reads。然后按照0.5%mismatch reads rate对alignment进行过滤,并将alignment中因PCR重复而产生的reads去除。计算每个参考序列的覆盖度并根据参数CV≤0.6&gap≤50&mean depth≥20 进行过滤,通过阈值的参考序列作为下一步的候选序列。
5.将reads与步骤4筛选得到的参考序列进行比对,使用bowtie2软件,参数同步骤4。然后按照0.5%的mismatch reads rate对alignment进行过滤,并将alignment中因PCR重复而产生的reads去除。对于参考序列比对位点中优势碱基占比小于0.95的位点仅保留支持参考序列的reads比对。计算每个参考序列的CV、gap、Cor、mean depth指标,按照CV≤0.6&Cor≥0.65&mean depth≥20&gap≤25的指标,并进行逐步迭代筛选,直到所有参考序列满足要求。
6.如果覆盖度参数满足CV≤0.4&Cor≥0.9&gap==0,则步骤5筛选出的参考序列直接进入最终的迭代,否则,需要进入参考序列的二级cluster内部进行竞争迭代筛选,二级cluster内部迭代筛选步骤同4、和5,筛选的阈值参数为CV≤0.6&Cor≥0.7&mean depth≥20&gap≤25。
7.将步骤6筛选出的所有满足条件的参考序列合并,进行终极迭代筛选,使用bowtie2比对,参数同步骤4。然后使用0.5%的mismatch reads rate对alignment进行过滤,并将alignment中因PCR重复而产生的reads去除。计算每个参考序列的覆盖度,根据阈值CV≤0.6&Cor≥0.7&mean depth≥20&gap≤25进行逐步迭代筛选。最终满足要求的参考序列作为最终的目标序列,对应的物种为目标物种,同时计算每个物种的reads数目以及在样本中的占比。
(四)实验结果
全部样本使用本技术方案的检测阳性率为98.9%(88/89)。在使用细菌培养鉴定方法检测出的阳性样本中,本技术方案与细菌培养鉴定检测结果的符合度为90.6%(48/53)。
Figure PCTCN2021115705-appb-000014
Figure PCTCN2021115705-appb-000015
Figure PCTCN2021115705-appb-000016
Figure PCTCN2021115705-appb-000017
Figure PCTCN2021115705-appb-000018
Figure PCTCN2021115705-appb-000019
Figure PCTCN2021115705-appb-000020
Figure PCTCN2021115705-appb-000021
Figure PCTCN2021115705-appb-000022
Figure PCTCN2021115705-appb-000023
Figure PCTCN2021115705-appb-000024
实施例十
本实施例用于说明本技术方案中,在某一环节加入一定量已知序列的核酸之后,可以提高对检测结果中潜在的污染结果鉴别的能力。
(一)样本准备
来自医院采集的用于病原微生物鉴定的引流液样本,样本采集后分为两份,分别使用本技术方案或者细菌培养的方式进行鉴定。
(二)实验流程
同实施例三。增加部分包括:步骤2,目的基因扩增时,在配制PCR反应的反应液中加入一定量的含有已知序列的核酸的质粒载体,样本提取时的阴性对照得到的核酸提取产物也使用加入了相同量的已知序列的核酸,同时进行反应。加入的核酸序列长度与16S rRNA基因的序列长度近似。本实施例选择在此步骤加入已知序列的核酸。在本技术方案中,加入已知序列的核酸并不是必须的,不加入并不会影响本技术方案的检测能力;已知序列的核酸不局限于在此步骤加入,也可以在核酸提取前加入,或者在本步骤之后加入,或者在本技术方案任意认为合适的步骤加入;加入的核酸序列可以单独存在,也可以使用质粒、病毒、细胞等作为载体;加入的核酸序列量可以从1个copy至无限多个copy的范围。
(三)分析流程
1.数据库:包含2202个临床致病菌物种。同时按照97%、98%、99%相似性构建参考序列的层级关系,共包含一级代表序列40,035个,总序列数92,385条(参考序列为去除冗余的参考序列数量)。
2.使用hisat2软件按照参数--score-min L,-1,-0.08--no-spliced-alignment--no-templatelen-adjustment--secondary-k 100000--mm--no-softclip进行比对,然后按照1%mismatch reads rate对alignment进行过滤,并将alignment中因PCR重复而产生的reads去除。然后统计每个参考序列比对的reads数目,将比对reads数大于10条的记为“有read比对”,否则为“无read比对”。按聚类层级关系根据富集分析的方法,按照较宽松的富集p值0.1筛选出显著富集reads的seed ID,提取对应的seed作为下一步的候选参考序列。
3.使用bowtie2将reads与2、得到的候选参考序列比对,比对参数为--min-score L,-1,-0.08–a,参考序列间不竞争reads,然后按照1%mismatch reads rate对alignment进行过滤,去除alignment中由于PCR导致的重复序列。计算每个参考序列的覆盖度并根据参数CV≤0.6&end gap≤40&middle gap≤10&mean depth≥20进行过滤,通过阈值的参考序列作为下一步的候选序列。
4.将3、筛选得到的参考序列利用排列组合的方式进行随机分组,然后对每组进行比对迭代筛选。将reads分别与每组筛选得到的参考序列进行比对,使用bowtie2软件,参数同3、。按照1%的mismatch reads rate对alignment进行过滤,去除alignment中由于PCR导致的重复序列。计算每个参考序列的NRMSE、end gap、middle gap、Cor、mean depth等指标,按照NRMSE≤0.35&Cor≥0.6&mean depth≥20&end gap≤40&middle gap≤10的指标,进行逐步迭代筛选,直到所有参考序列满足要求。
5.将4、筛选出的所有参考序列,分别进入参考序列的二级cluster内部进行竞争迭代筛选。二级cluster内部迭代筛选步骤同3、和4、,但是筛选的阈值更严格一些:比对参数为--min-score L,-1,-0.04–a;两步的reads mismatch rate仅允许0.5%;同步骤3、的筛选参数为end gap≤25&middle gap<1&CV≤0.55&mean depth≥20;同步骤4、的参数为NRMSE≤0.3&Cor≥0.6&mean depth≥20&end gap≤25&middle gap<1。满足条件的参考序列进入下一步分析。
6.将5、筛选出的所有满足条件的参考序列合并,进行终极迭代筛选。使用bowtie2比对,参数同5、,使用0.5%的mismatch reads rate对alignment进行过滤,并将alignment中因PCR重复而产生的reads去除,对于参考序列比对位点中优势碱基占比小于0.95的位点仅保留支持参考序列的reads比对。计算每个参考序列的覆盖度,根据阈值为NRMSE≤0.3&Cor≥0.65&mean depth≥20&end gap≤25&middle gap<1进行逐步迭代筛选,最终得到满足要求的参考序列作为最终的目标序列,对应的物种为目标物种,同时计算每个物种的reads数目以及在样本中的占比。
7.对上述筛选得到的参考序列进行背景污染筛选。一是如果参考序列不在重要临床致病物种列表且占比小于5%的物种直接去除;二是对于检测到内标序列的样本中的物种,如果这些物种在对照样本中也有检测到,计算临床样本中该物种与内标的比值与所有对照样本中该物种与内标比值符合同一个正态分布的概率。如果概率大于0.05,则认为该物种是背景污染并从检测结果中去除,否则予以保留。然后重新计算所有物种在样本中的占比,得到最终的分析结果。
(四)实验结果
使用本技术方案对原始样本进行检测,原始检测结果中包括已知序列的核酸(INQ internal standard)(相对含量1)、疮疱丙酸杆菌(相对含量0.064)、粪肠球菌(相对含量0.5781)。通过与同批次的阴性对照样本的检测结果对比,以与已知序列核酸的相对含量为计算依据,可以判定疮疱丙酸杆菌为污染,最终报告结果为粪肠球菌,与使用微生物培养等方法得到的结果一致。
Figure PCTCN2021115705-appb-000025
实施例和对比例的产品的表征数据以及效果数据
本发明实施例三至实施例八中的全部样本,平均每个样本的测序数据为55,663条短序列数据,有效数据量在本实施例中为90%以上,对靶序列的覆盖度接近100%。
实施例七中,样本的检测序列数量为55,115条短序列,人源序列占所有检测序列的0.020%。共有7,707条短序列比对至Enterococcus faecium(屎肠球菌)物种,占比13.98%,物种目标基因序列测序覆盖率99.9%,平均测序深度349.04×。共有2,761条短序列比对至Escherichia coli(大肠埃希菌)物种,占比5.01%,物种目标基因序列测序覆盖率99.8%,平均测序深度140.85×。
实施例八中,样本的检测序列数量为67,797条,人源序列占所有检测序列的9.50%。共有3,136条短序列比对至Escherichia coli(大肠埃希菌)物种,占比4.63%,物种目标基因序列测序覆盖率100%,平均测 序深度397.54×。
作为对比,New England Journal of Medicine,2014,370(25):2408-2417的案例报道中,采用宏基因组方法(mNGS)检测患者样本。样本的测序数据为8,187,737条短序列,其中98.7%的数据为人类基因组序列,总共有475条短序列比对至Leptospiraceae科,在总数据量中占比为0.0006%(万分之0.6),物种基因组测序覆盖率2.2%-3.7%,基因组平均测序深度为0.02-0.04×。案例对比显示,mNGS技术检测微生物能够获得的有效数据量、有效数据占比、以及对目标基因组序列的覆盖度及测序深度均远不及本技术方案。
本发明实施例九中,89个临床样本用于检测。本技术方案的检测阳性率为98.9%(88/89),高于其他非NGS方法(细菌培养结合质谱鉴定方法)的阳性率(59.6%,53/89)。在非本技术方案检测阳性的样本中,本技术方案检测结果与前者的一致率为90.6%(48/53)。
同时,与另一案例进行比较,Clinical Infectious Diseases(2018)67(S2):S231–40的报道中,采用宏基因组方法检测患者样本的阳性率为38.2%(195/511),同组样本非NGS方法的阳性率为27.0%(138/511)。在使用非mNGS方案检测阳性的样本中,mNGS方案检测结果与前者的一致率为60.9%(84/138)。由此可见,本技术方案在临床样本检测中具有比现有技术更高的阳性率。同时,与相关技术mNGS相比,本技术方案具有更高的传统细菌培养检测阳性一致率。
以上所述实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本发明的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干变形和改进,这些都属于本发明的保护范围。因此,本发明专利的保护范围应以所附权利要求为准。

Claims (19)

  1. 通过测序获取微生物物种及相关信息的方法,包括:
    i)获取测序数据,所述测序数据由靶向富集微生物特征核酸序列后经下一代测序技术对扩增产物进行测序得到;
    ii)将所述测序数据与特征序列数据库进行比对分析以鉴定待检测样本中的微生物组成;
    其中所述特征序列数据库预先根据含有所述特征序列的参考序列的相似性做聚类处理,得到一层或多层级的cluster,每个层级cluster中有至少一个子seed,最底层的cluster中有一或若干作为参考序列的seed;
    所述比对分析包括:
    a)去除所述测序数据中的低质量和含有测序接头序列的reads;
    b)将reads序列与所述特征序列数据库的seeds序列进行比对,将因PCR扩增而产生的完全重复reads去除,并对事件“有reads序列比对上”与seeds序列做统计学独立性检验,选出与事件“有reads序列比对上”相关的seeds序列,得到初级筛选seed序列;
    c)将read序列与所述初级筛选seed序列进行比对,其中所述初级筛选seed序列之间不竞争reads,计算每个seed序列的reads覆盖情况,计算评估seed序列覆盖的指标,并据此对seed序列进行次级筛选;
    d)将read序列与次级筛选得到的seed序列比对,seed序列比对过程中相互间竞争reads,计算比对中seed序列的read覆盖度指标,逐步迭代去除read覆盖度指标不满足要求的seed序列,得到叁级筛选seed序列;
    e)合并所述叁级筛选seed序列,并将reads与之进行比对,逐步迭代筛选得到满足要求的参考序列,其中逐步迭代筛选所用阈值相对于步骤d)更加严格;
    f)根据步骤e)得到的所述参考序列及其reads数量,计算物种水平的reads含量及其占比;计算时将同属于一个物种的多个参考序列的reads数目加和得到该物种的reads数,然后根据每个物种reads数目除以样本所含物种的reads数量的总和计算样本中每个物种的占比即丰度。
  2. 根据权利要求1所述的方法,所述微生物包括细菌、古菌、真菌、支原体、衣原体、力克次氏体、螺旋体、和病毒,其中RNA病毒的特征核酸序列通过逆转录其RNA基因组生成cDNA后获得;
    优选地,所述待检测样本为来自于微生物宿主的样本或来自于含有微生物的环境样本:
    所述来自于微生物宿主的样本包括但是不限于:粪便、肠道内容物、皮肤、痰液、血液、唾液、牙菌斑、尿液、阴道分泌物、胆汁、支气管肺泡灌洗液、脑脊液、胸水、腹水、盆腔积液、脓液以及瘤胃中的至少一种;
    所述来自于含有微生物的环境样本包括但是不限于:物体内外表面、生活用水、医疗用水、工业用水、食品、饮料、肥料、废水、火山灰、冻土层、淤泥、土壤、堆肥、污染河流养殖水体以及空气中的至少一种。
  3. 根据权利要求1所述的方法,步骤i)中所述的靶向富集微生物特征核酸序列的富集方法包括PCR、核酸探针杂交捕获、生物素标记捕获、地高辛标记捕获、同位素标记捕获、磁珠捕获、抗体捕获、CRISPR/Cas技术中的一种或者多种,其反应方式选择在液体中、固体表面或者两种形式结合使用。
  4. 根据权利要求1所述的方法,所述待检测样本为来自于微生物宿主的样本,步骤a)还包括:去除所述样本中所述宿主的核酸测序数据干扰。
  5. 根据权利要求2或4所述的方法,所述宿主为人。
  6. 根据权利要求1所述的方法,在步骤d)中,所述逐步迭代去除read覆盖度指标不满足要求的seed序列之后还包括cluster内部参考序列的筛选步骤:
    将上步获得的每一个seed所属cluster的参考序列与reads进行比对,相同cluster内部的参考序列间竞争reads;统计每个参考序列的reads覆盖情况,根据reads覆盖指标进行过滤,逐轮迭代去除reads覆盖不佳的seed序列,其中逐轮迭代筛选所用阈值相对于步骤d)前面所述步骤更加严格。
  7. 根据权利要求1所述的方法,在步骤f)后还包括步骤g):排除实验环境中的背景污染物种的核酸测序数据干扰。
  8. 根据权利要求1所述的方法,在步骤b)中,所述统计学独立性检验为费舍尔精确检验,其包括:
    将每个比对上的reads数量大于一定数量的参考序列记为“有read比对”,否则记为“无read比对”;根 据参考序列数据库中seeds的聚类层级关系,统计检验聚类树中每个seed下属的叶子节点中是否显著富集“有read比对”的参考序列,逐级筛选出满足要求的seeds。
  9. 根据权利要求1~4、6~8任一项所述的方法,所述特征序列数据库的构建方法包括:
    获取含有所述特征序列的参考序列的公共数据库,去除所述数据库中所述参考序列在所述扩增引物之外的两端序列,得到第一数据库;
    根据物种内相似性对所述第一数据库中存在模糊碱基的所述参考序列进行矫正处理,并根据物种注释以及相似性去除100%相似的冗余参考序列,得到第二数据库;
    根据所述第二数据库中所述参考序列的相似性对其做聚类处理。
  10. 根据权利要求9所述的方法,在构建所述第一数据库时,去除所述数据库中所述参考序列在所述扩增引物之外的两端序列和引物序列。
  11. 根据权利要求9所述的方法,在构建所述第二数据库时,还包括:
    将每条参考序列与NCBI NT/NR库进行blast比对,按照序列相似性和/或覆盖度的规则从NCBI NT/NR库中筛选出匹配的参考序列集;
    在所述匹配的参考序列集的物种注释中选择最具代表性的物种分类,并使用该分类信息校正所针对的参考序列的物种注释。
  12. 根据权利要求9所述的方法,所述聚类处理包括第一聚类处理:
    将所有非冗余的参考序列按照相似性进行聚类;
    或;
    1)按照物种内非冗余的参考序列的相似性进行聚类;
    2)对1)中得到的seed按照相似性进行聚类,将能够聚在同一个类里的属于不同物种的seed及其子序列合并,然后按照99.5%相似性进行重聚类,用新获得的cluster替换参与重聚类计算的旧cluster。
  13. 根据权利要求12所述的方法,所述聚类处理还包括第二聚类处理:
    在cluster中seed的子序列太多的情况下,按照比所述第一聚类处理更高的相似性标准对所述第一聚类处理得到cluster进行拆分,用拆分后形成的新cluster替换掉拆分前的cluster。
  14. 根据权利要求13所述的方法,所述聚类处理还包括第三聚类处理:
    按照不同的相似度阈值,对所述第二聚类处理得到的cluster的seed参考序列进行层级聚类,创建一个有层次的嵌套的树。
  15. 根据权利要求1~4、6~8、10~14任一项所述的方法,所述微生物特征序列包括16S rRNA基因、18S rRNA基因、ITS核酸序列、RNA病毒的RNA为模板的RNA聚合酶基因(RdRp)、病毒衣壳蛋白编码基因、逆转录病毒的pol基因或者其他能够反映微生物种属特征的核酸序列中的一种或多种的全长序列。
  16. 通过测序获取微生物物种及相关信息的装置,所述装置包括:
    测序数据获取模块,其用于获取测序数据,所述测序数据由靶向富集微生物特征核酸序列后经下一代测序技术对扩增产物进行测序得到;
    特征序列数据库构建模块,其用于对所述测序数据进行聚类处理得到特征序列数据库,所述特征序列数据库包含一层或多层级的cluster,每个层级cluster中有至少一个子seed,最底层的cluster中有一或若干作为参考序列的seed;
    比对分析模块,其用于将所述测序数据与所述特征序列数据库进行比对分析以鉴定所述被检测样本中的微生物组成,所述比对分析模块包括:
    a)第一模块,其用于去除所述测序数据中的低质量和含有测序接头序列的reads;
    b)第二模块,其用于将reads序列与所述特征序列数据库的seeds序列进行比对,将因PCR扩增而产生的完全重复reads去除,并对事件“有reads序列比对上”与seeds序列做统计学独立性检验,选出与事件“有reads序列比对上”相关的seeds序列,得到初级筛选seed序列;
    c)第三模块,其用于将read序列与所述初级筛选seed序列进行比对,其中所述初级筛选seed序列之间不竞争reads,计算每个seed序列的reads覆盖情况,计算评估seed序列覆盖的指标,并据此对seed序列进行次级筛选;
    d)第四模块,其用于将read序列与次级筛选得到的seed序列比对,seed序列比对过程中相互间竞争 reads,计算比对中seed序列的read覆盖度指标,逐步迭代去除read覆盖度指标不满足要求的seed序列,得到叁级筛选seed序列;
    e)第五模块,其用于合并所述叁级筛选seed序列,并将reads与之进行比对,逐步迭代筛选得到满足要求的参考序列,其中逐步迭代筛选所用阈值相对于步骤d)更加严格;
    f)第六模块,其用于根据步骤e)得到的所述参考序列及其reads数量,计算物种水平的reads含量及其占比;计算时将同属于一个物种的多个参考序列的reads数目加和得到该物种的reads数,然后根据每个物种reads数目除以样本所含物种的reads数量的总和计算样本中每个物种的占比即丰度。
  17. 一种计算机可读存储介质,所述计算机存储介质用于存储计算机指令、程序、代码集或指令集,当其在计算机上运行时,使得计算机执行如权利要求1~15任一项所述的方法。
  18. 一种电子设备,包括:
    一个或多个处理器;以及
    存储装置,存储一个或多个程序,
    当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1~15任一项所述的方法。
  19. 权利要求1~15任一项所述的方法、或权利要求16所述的装置、或权利要求17所述的计算机可读存储介质、或权利要求18所述的电子设备在鉴定微生物中的应用。
PCT/CN2021/115705 2020-08-07 2021-08-31 通过测序获取微生物物种及相关信息的方法、装置、计算机可读存储介质和电子设备 WO2022028624A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010790095.7 2020-08-07
CN202010790095.7A CN114067911B (zh) 2020-08-07 2020-08-07 获取微生物物种及相关信息的方法和装置

Publications (1)

Publication Number Publication Date
WO2022028624A1 true WO2022028624A1 (zh) 2022-02-10

Family

ID=80117092

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/115705 WO2022028624A1 (zh) 2020-08-07 2021-08-31 通过测序获取微生物物种及相关信息的方法、装置、计算机可读存储介质和电子设备

Country Status (2)

Country Link
CN (1) CN114067911B (zh)
WO (1) WO2022028624A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117050867A (zh) * 2023-08-20 2023-11-14 浙江深华生物科技有限公司 一种评估肿瘤dna高通量定量检测系统
CN117690483A (zh) * 2023-11-30 2024-03-12 洛兮医疗科技(河北)有限公司 一种基于病原宏基因二代测序的耐药基因检测方法
CN117708569A (zh) * 2024-02-05 2024-03-15 中国医学科学院北京协和医院 一种病原微生物信息的识别方法、装置、终端及存储介质
WO2024077568A1 (zh) * 2022-10-13 2024-04-18 深圳华大智造科技股份有限公司 参考序列的构建方法、宏基因组数据压缩方法和电子设备
WO2024096149A1 (ko) * 2022-11-01 2024-05-10 엘지전자 주식회사 차세대 시퀀싱 방법을 이용한 미생물 분석 시스템 및 미생물 분석 방법
WO2024101492A1 (ko) * 2022-11-11 2024-05-16 엘지전자 주식회사 차세대 시퀀싱 방법을 이용한 미생물 분석 시스템 및 미생물 분석 방법

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114496089B (zh) * 2022-04-02 2022-07-15 北京大学人民医院 一种病原微生物鉴定方法
CN115572771A (zh) * 2022-09-20 2023-01-06 中国科学院广州地球化学研究所 一种微生物物种高通量鉴定解析方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101363056A (zh) * 2008-09-11 2009-02-11 浙江大学 一种高通量微生物鉴定方法
US20110045992A1 (en) * 2007-11-16 2011-02-24 Hyk Gene Technology Co., Ltd. Dna sequencing method and system
CN104039982A (zh) * 2012-08-01 2014-09-10 深圳华大基因研究院 一种分析微生物群落组成的方法和装置
CN107849606A (zh) * 2015-04-20 2018-03-27 尼欧基因组学实验室股份有限公司 提高下一代测序灵敏度的方法
CN109923217A (zh) * 2016-10-13 2019-06-21 生物梅里埃公司 宏基因组样品中病原体的鉴定和抗生素表征
CN111471676A (zh) * 2020-03-13 2020-07-31 广州市达瑞生物技术股份有限公司 一种宏基因组二代测序的建库样本的制备方法

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105986013A (zh) * 2015-02-02 2016-10-05 广州华大基因医学检验所有限公司 确定微生物种类的方法和装置
CN106701914A (zh) * 2016-11-09 2017-05-24 上海市食品药品检验所 一种基于dna条形码的细菌核酸测序鉴定方法
WO2019213624A1 (en) * 2018-05-04 2019-11-07 The Regents Of The University Of California Spiked primers for enrichment of pathogen nucleic acids among background of nucleic acids
CN111009286B (zh) * 2018-10-08 2023-04-28 深圳华大因源医药科技有限公司 对宿主样本进行微生物分析的方法和装置
CN110349629B (zh) * 2019-06-20 2021-08-06 湖南赛哲医学检验所有限公司 一种利用宏基因组或宏转录组检测微生物的分析方法
CN111462819A (zh) * 2020-02-26 2020-07-28 康美华大基因技术有限公司 肠道微生物检测数据分析方法、自动化解读系统及介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110045992A1 (en) * 2007-11-16 2011-02-24 Hyk Gene Technology Co., Ltd. Dna sequencing method and system
CN101363056A (zh) * 2008-09-11 2009-02-11 浙江大学 一种高通量微生物鉴定方法
CN104039982A (zh) * 2012-08-01 2014-09-10 深圳华大基因研究院 一种分析微生物群落组成的方法和装置
CN107849606A (zh) * 2015-04-20 2018-03-27 尼欧基因组学实验室股份有限公司 提高下一代测序灵敏度的方法
CN109923217A (zh) * 2016-10-13 2019-06-21 生物梅里埃公司 宏基因组样品中病原体的鉴定和抗生素表征
CN111471676A (zh) * 2020-03-13 2020-07-31 广州市达瑞生物技术股份有限公司 一种宏基因组二代测序的建库样本的制备方法

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024077568A1 (zh) * 2022-10-13 2024-04-18 深圳华大智造科技股份有限公司 参考序列的构建方法、宏基因组数据压缩方法和电子设备
WO2024096149A1 (ko) * 2022-11-01 2024-05-10 엘지전자 주식회사 차세대 시퀀싱 방법을 이용한 미생물 분석 시스템 및 미생물 분석 방법
WO2024101492A1 (ko) * 2022-11-11 2024-05-16 엘지전자 주식회사 차세대 시퀀싱 방법을 이용한 미생물 분석 시스템 및 미생물 분석 방법
CN117050867A (zh) * 2023-08-20 2023-11-14 浙江深华生物科技有限公司 一种评估肿瘤dna高通量定量检测系统
CN117690483A (zh) * 2023-11-30 2024-03-12 洛兮医疗科技(河北)有限公司 一种基于病原宏基因二代测序的耐药基因检测方法
CN117708569A (zh) * 2024-02-05 2024-03-15 中国医学科学院北京协和医院 一种病原微生物信息的识别方法、装置、终端及存储介质
CN117708569B (zh) * 2024-02-05 2024-04-05 中国医学科学院北京协和医院 一种病原微生物信息的识别方法、装置、终端及存储介质

Also Published As

Publication number Publication date
CN114067911B (zh) 2024-02-06
CN114067911A (zh) 2022-02-18

Similar Documents

Publication Publication Date Title
WO2022028624A1 (zh) 通过测序获取微生物物种及相关信息的方法、装置、计算机可读存储介质和电子设备
US11761035B2 (en) Methods and systems for generation and error-correction of unique molecular index sets with heterogeneous molecular lengths
US11898198B2 (en) Universal short adapters with variable length non-random unique molecular identifiers
CN110349629B (zh) 一种利用宏基因组或宏转录组检测微生物的分析方法
JP2021007039A (ja) 核酸配列アセンブリ
US20180137243A1 (en) Therapeutic Methods Using Metagenomic Data From Microbial Communities
US20200294628A1 (en) Creation or use of anchor-based data structures for sample-derived characteristic determination
US20200234793A1 (en) Systems and methods for metagenomic analysis
KR20200010464A (ko) 기지 또는 미지의 유전자형의 다수의 기여자로부터 dna 혼합물을 분해 및 정량하기 위한 방법 및 시스템
Cuenca et al. Mitochondrial sequencing of missing persons DNA casework by implementing Thermo Fisher’s precision ID mtDNA whole genome assay
US20190048393A1 (en) Method for qualitative and quantitative detection of microorganism in human body
Pei et al. Targeted sequencing approach and its clinical applications for the molecular diagnosis of human diseases
CN113066533A (zh) 一种mNGS病原体数据分析方法
CN115662516A (zh) 一种基于二代测序技术的高通量预测噬菌体宿主的分析方法
Levin et al. Optimization for sequencing and analysis of degraded FFPE-RNA samples
CN112331268B (zh) 目标物种特有序列的获取方法及目标物种检测方法
Bovo et al. A viral metagenomic approach on a non-metagenomic experiment: mining next generation sequencing datasets from pig DNA identified several porcine parvoviruses for a retrospective evaluation of viral infections
Xu et al. High-quality Japanese flounder genome aids in identifying stress-related genes using gene coexpression network
González-Recio et al. Improving phenotypic prediction in dairy cattle breeding using the metagenome
DeWitte LOTUS: A Web-Based Computational Tool for the Preliminary Investigation of a Novel MST Method Utilizing a Library of 16s rRNA Bacteroides OTUs
Kloda Gene expression analysis on a subgene level
Charrier Developing long-read Oxford Nanopore nemabiome metabarcoding for ovine gastrointestinal nematode community analysis and diagnostics
CN115044703A (zh) 一种人冠状病毒HCoV-OC43的MNP标记位点、引物组合物、试剂盒及其应用
Hunter Applications of High Throughput Technologies in Modern Genomics
CN117106872A (zh) 一种高饲料效率奶牛分子网络标记的筛选方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21853736

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21853736

Country of ref document: EP

Kind code of ref document: A1