CN114023389B - Analysis method of metagenome data - Google Patents

Analysis method of metagenome data Download PDF

Info

Publication number
CN114023389B
CN114023389B CN202210002820.9A CN202210002820A CN114023389B CN 114023389 B CN114023389 B CN 114023389B CN 202210002820 A CN202210002820 A CN 202210002820A CN 114023389 B CN114023389 B CN 114023389B
Authority
CN
China
Prior art keywords
sequence
read
data
mer
length
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210002820.9A
Other languages
Chinese (zh)
Other versions
CN114023389A (en
Inventor
郎继东
孙继国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Qitan Technology Ltd
Original Assignee
Chengdu Qitan Technology Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Qitan Technology Ltd filed Critical Chengdu Qitan Technology Ltd
Priority to CN202210002820.9A priority Critical patent/CN114023389B/en
Publication of CN114023389A publication Critical patent/CN114023389A/en
Application granted granted Critical
Publication of CN114023389B publication Critical patent/CN114023389B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Software Systems (AREA)
  • Public Health (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioethics (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention provides an analysis method of metagenome data. The invention provides a metagenome data analysis method, which comprises the following steps: 1) preprocessing raw data to obtain a data set with expected quality; 2) performing K-mer sliding extraction on each read sequence in the data set in the step 1) for N times to obtain N K-mer sequences; 3) classifying K-mer sequences obtained by the same K-mer sliding extraction in all read sequences into a K-mer sequence subset to obtain N K-mer sequence subsets; 4) performing metagenomic species analysis on each K-mer sequence subset obtained in the step 3) respectively to obtain N data analysis results; 5) combining the N data analysis results obtained in the step 4), and analyzing and obtaining information of various microorganisms in the metagenome. The method of the invention has ultra-high sensitivity and can effectively control false positive results.

Description

Analysis method of metagenome data
Technical Field
The invention belongs to the technical field of metagenome analysis, particularly relates to a metagenome data analysis method, and more particularly relates to a metagenome data analysis method based on third-generation sequencing.
Background
Metagenome, also known as microbial environment genome, is the sum of all microbial genetic material in the environment, and at present mainly refers to the sum of bacterial and fungal genomes in environmental samples. Metagenomics (Metagenomics) is a new method for researching microorganisms by using microbial genome in environmental samples as a research object, using functional gene screening and/or sequencing analysis as a research means, and using microbial diversity, population structure, evolutionary relationship, functional activity, mutual cooperation relationship and relationship with the environment as research purposes, and generally comprises extracting genomic DNA from environmental samples to perform high-throughput sequencing analysis.
In recent years, researchers have been more and more widely researching on metagenomics, especially human metagenomics, such as the metagenomics research of intestinal flora and the research of tumor metagenomics, and the researchers have not only sequenced and classified metagenomics sequences, but also analyzed and researched important relativity between the metagenomics and human diseases. For example, in 2017, a study published in the journal of science shows how microorganisms invade most pancreatic cancer patients and break down the major chemotherapeutic drugs used by these patients (Geller LT, Barzily-Rokni M, Danino T, et al. A.positional role of organic bacteria in mediating tissue to the chemotherapeutical drug gettissue, science, 2017;357(6356):1156-1160. DOI: 10.1126/science. aah5043); in 2020 Nature journal, direct connection was first established between microorganisms in humans and genetic variation driving cancer development, confirming that colorectal cancer gene mutations can be caused by toxin release from the gut flora (Pleguzuelos-Manzano C, Puschhof J, Rosendahl Huber A, et al. mutation in colon cancer used by genetic pks + E. coli Nature 2020;580(7802):269-273. DOI:10.1038/s 41586-020-2080-8); in the same year, the latest efforts on the cancer Microbiome published by another research team in the journal "Nature" (Poore GD, Kopylova E, Zhu Q, et al. Microbiome analytes of blood and tissue culture supernatant diagnostic assay prophach. Nature. 2020;579(7800): 567. 574. DOI:10.1038/s 41586-020-.
The Third generation sequencing technology, also called Third generation sequencing technology (Third generation sequencing) or single molecule real-time DNA sequencing technology, is a technology that can realize the individual sequencing of each DNA molecule without PCR amplification during DNA sequencing. At present, the third generation sequencing technology principle is mainly divided into single molecule fluorescence sequencing represented by the SMRT technology of Pacbio and nanopore sequencing represented by the nanopore electrophoresis technology of Oxford nanopore and carbon flush technologies. One of the main technical characteristics of third-generation sequencing is that the intrinsic reaction speed of DNA polymerase is realized, 10 bases can be sequenced in one second, and the sequencing speed is 2 ten thousand times that of chemical sequencing; secondly, the inherent self continuity of the DNA polymerase is realized, and a very long sequence can be detected by one reaction; second-generation sequencing can detect hundreds of bases, but third-generation sequencing can detect thousands of bases.
Metagenomic sequencing enables the simultaneous detection of almost all known pathogens from clinical samples. At present, most metagenome researches adopt a second generation (Illumina) sequencing platform, the sequencing running time exceeds 16 hours, and the period from the total sample to the report is 48-72 hours; in contrast, third generation (Nanopore) sequencing, after effective removal of host DNA background (e.g., using saponin Charantiampous T, Kay GL, Richardson H, et al, Nanopore methylation enzymes labeled with background microorganism infection. Nat Biotechnology 2019;37(7): 783. cake 792. DOI:10.1038/s 41587-019 cake 0156-5), allows detection of microorganisms around 50 minutes in real-time computational analysis and the entire detection cycle can be controlled within 6 hours (Gu W, Deng X, Lee M, et al, Rapid detection by detection of genetic mutation-mutation gene mutation-mutation detection of infectious viruses) (2021. 202115: 10.1038: 35020-31) and current pathogenic bacteria can be detected on emerging viruses 4142-1105, has good clinical application prospect, such as fixed-point real-time monitoring of epidemic situation.
However, the current method for detecting metagenome based on three-generation sequencing still has great challenges and problems. On an experimental level, how to efficiently and stably remove host (host) genome or enrich metagenome DNA fragments is a major problem. Improper experimental methods can cause the relative proportion of metagenomic DNA in DNA for sequencing to be low, so that effective metagenomic data flowing into subsequent data analysis is too little, and in order to make up for the defects, a larger sequencing data volume can be used, thereby causing sequencing cost consumption and extremely low cost performance. In the aspect of data analysis, not only how to detect the microbial species with high sensitivity and effectively control false positive results, but also how to solve the current situation of low data utilization rate caused by the unique characteristics of the third generation data, such as the random indel problem in the nanopore sequencing sequence, are considered. At present, methods for detecting metagenome based on three-generation sequencing data, such as some deep learning methods of Kraken2, Bracken, CNN and the like, do not solve the existing problems at the same time.
Therefore, the method further improves the existing metagenome data analysis method, improves the detection sensitivity, solves the problem of low data utilization rate and has very important significance.
Disclosure of Invention
Therefore, the present invention aims to overcome the defects of the prior art, and provides a method for analyzing metagenomic data, which can well solve the problems in the aspect of data analysis, effectively avoid the problem of low data utilization rate caused by random insertion and deletion (indel) from the aspect of data characteristics, has ultrahigh sensitivity, can accurately detect the species and composition of microorganisms even under ultrahigh host (host) background noise, and effectively control the result of false positive, so that the analysis result is more accurate and efficient; meanwhile, the method of the invention reduces the pressure for the experimental technical level, does not need excessive sequencing data volume, reduces the sequencing cost and gives full play to the advantages of the third-generation sequencing data. The method of the invention can be well compatible with the current second-generation sequencing data analysis method, and enlarges the means of analyzing the metagenome data by the second-generation sequencing.
The purpose of the invention is realized by the following technical scheme.
In one aspect, the present invention provides a method for analyzing metagenomic data, the method comprising the steps of:
1) preprocessing raw data to obtain a data set with expected quality, wherein the data set with expected quality comprises a plurality of read sequences;
2) performing N times of K-mer sliding extraction on each read sequence in the plurality of read sequences in the data set of the step 1), and obtaining N K-mer sequences aiming at each read sequence;
adding constraint conditions in the N sliding extractions, so that the initial positions of the K-mer sequences extracted by the sliding extraction are randomly and uniformly distributed on a target sequence read, the K-mer sequences extracted by the sliding extraction have base coincidence, and finally, all the K-mer sequences extracted by the N times cover at least 80% of the area of the target sequence read;
wherein N is an integer;
3) classifying K-mer sequences obtained by the same K-mer sliding extraction in all read sequences into a K-mer sequence subset to obtain N K-mer sequence subsets;
4) performing metagenomic species analysis on each K-mer sequence subset obtained in the step 3) respectively to finally obtain N data analysis results;
5) combining the N data analysis results obtained in the step 4), and analyzing and obtaining information of various microorganisms in the metagenome.
According to the analysis method of the metagenomic data, in the step 1), the original data is original third-generation sequencing data or long-read-length sequencing data.
Wherein, the long read length refers to sequencing data obtained by three generations of sequencing, such as nanopore sequencing, and the length of the original sequencing data can reach 10kb, 20kb, 150kb, 160kb, 200kb, 300kb, 1Mb, 2 Mb. It is known to those skilled in the art that with the development of sequencing technology, longer read lengths of sequencing data are subsequently possible. It will be appreciated, however, that the methods of the invention are applicable to sequencing data of any length, and that longer data will be more advantageous than the methods of the invention.
According to the method for analyzing metagenomic data, in the step 1), the original data is long read length data obtained by nanopore sequencing.
The method for analyzing metagenomic data according to the present invention, wherein, in the step 1), the pre-processing of the raw data comprises removing a linker sequence and a barcode (barcode) sequence therein, and filtering a low-quality read sequence and a too short read sequence.
In an optional embodiment, the low quality threshold may be Q5 or Q7, which are thresholds well known to those skilled in the art. Wherein Q represents the average quality value of the sequencing read, i.e. the sum of the accuracy of each base in the sequencing read is averaged. The threshold value can be adjusted according to actual conditions, and specific adjustment parameters are detailed in http:// maq.
In an optional embodiment, the sequence length threshold for too short a read sequence includes, but is not limited to, 100 bp; for example, the threshold may be 50bp, 200bp, 300bp, or the like. Those skilled in the art can make adjustments as appropriate, and the threshold adjustment parameter is described in detail in http:// maq.
In an optional embodiment, the step 1) pre-treatment uses Porechop software and/or NanoFilt software. This is data processing software commonly used by those skilled in the art, and the use method and technical parameters thereof are also well known by those skilled in the art, and are not described in detail herein.
As known to those skilled in the art, a K-mer is a division of a read into strings comprising K bases, e.g., ATCG sequence is a 4-mer and ATCGATCG sequence is an 8-mer.
The method for analyzing metagenomic data according to the present invention, wherein the step 2) is achieved by a method comprising the steps of:
firstly, performing sliding extraction on each read sequence for N times according to a length K without dividing the read sequence, wherein the length K is any integer, and filtering the extracted sequence if the length after the sliding extraction is smaller than a preset length;
adding constraint conditions, wherein the constraint conditions comprise:
the positions of the extracted K-mer sequences on the target read sequences are slid randomly;
the K-mer sequences extracted in a sliding mode are distributed on the upstream, midstream and downstream sites of the target read sequence;
the initial positions of the K-mer sequences extracted in a sliding mode are uniformly distributed on the target read sequence; and
all the K-mer sequences extracted for the last N times cover at least 80% of the region of the target read sequence with base coincidence;
and thirdly, obtaining N K-mer sequences aiming at each read sequence.
In an optional embodiment, K is 75bp or more and less than or equal to 500bp, 100bp or more and less than or equal to 200bp, or K =200 bp. As known to those skilled in the art, K may be set as desired depending on the quality characteristics of the data or the permissions. Without being bound by any theory, K may be any integer value, and for example, K may be 20bp or more, 50bp or more, 75bp or more, 100bp or more, 150bp or more, 200bp or more, 300bp or more, 400bp or more, or the like.
According to the analysis method of the metagenome data, in the step 2), the value range of N is more than or equal to 10 and less than or equal to 50, and N is more than or equal to 20 and less than or equal to 25. Without being bound by any theory, N may be any integer value, e.g., N may be 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50 or 100, or 200, or 300, etc.
The method for analyzing metagenomic data according to the present invention, wherein, in the step 3), the analysis result includes that Centrifuge or metahlan or the like can be used. These are all software known and used by those skilled in the art, and are also conventional software for performing metagenomic analysis.
The method for analyzing metagenomic data according to the present invention, wherein the step 5) is implemented by a method comprising the steps of:
merging the N data analysis results obtained in the step 4) to obtain a result matrix of the detected species and the relative abundance of the detected species, and calculating the average value, the variance and the standard deviation of the relative abundance of each detected species; giving a confidence level, such as but not limited to 90%, 95% or 99% and the like, and calculating a confidence interval of the relative abundance value of the species at the confidence level, taking the range of the confidence interval as a boundary, averaging the relative abundance values satisfying the confidence interval in the N times of analysis results, and taking the calculated value as the final detected relative abundance of the species, wherein the sum of the relative abundance values of all the detected species is 100%.
In a particular embodiment, the method of the invention comprises the steps of:
1) performing data preprocessing on the original third-generation sequencing data, removing a joint and a bar code sequence added in a library by using Porechop software and NanoFilt software, and filtering a low-quality and an excessively short read sequence, wherein the threshold of the low quality is Q7, and the threshold of the length of the excessively short sequence is 100bp, so as to obtain a data set;
2) performing sliding extraction on each read sequence of the data set in the step 1) for N times according to the length K without dividing the read, setting the length K to be more than or equal to 100bp and less than or equal to 200bp, setting the length N to be more than or equal to 20 and less than or equal to 25, filtering the extracted sequence if the length after the sliding extraction is less than the preset length,
adding constraints in the sliding extraction, including:
sliding the position of the extracted K-mer sequence on the target read sequence to be random;
the K-mer sequences extracted by sliding are randomly distributed on the upstream, the midstream and the downstream of the target read;
starting sites of the K-mer sequence extracted in a sliding mode are uniformly distributed on a target read and have no random preference; and
finally, after N times of extraction, all K-mer sequences cover at least 80% of the region of the target sequence read with base coincidence;
obtaining N K-mer sequences for each read sequence;
3) classifying K-mer sequences obtained by the same K-mer sliding extraction in all read sequences into a K-mer sequence subset to obtain N K-mer sequence subsets;
n fixed-length K-mer sequences are obtained for each read sequence, and the K-mer sequences obtained by the same extraction of different read sequences belong to the same K-mer sequence subset, so that the sequencing data of the original long read length is decomposed into a sequencing data set of the short read length extracted according to fixed-length sliding;
4) respectively carrying out metagenomic species analysis on the K-mer sequence subsets with the final length of K obtained in the step 3), wherein the species comprise bacteria, fungi, parasites and/or viruses, and finally obtaining N data analysis results;
5) merging the N data analysis results obtained in the step 4) to obtain a result matrix of the detected species and the relative abundance of the detected species, and calculating the average value, the variance and the standard deviation of the relative abundance of each detected species; giving a confidence level of 95%, calculating a confidence interval of relative abundance values of the species at the confidence level, taking the range of the confidence interval as a boundary, averaging the relative abundance values meeting the confidence interval in the N times of analysis results, taking the calculated value as the final detected relative abundance of the species, and taking the sum of the relative abundance values of all the detected species as 100%.
The method also comprises the step of merging the N data analysis results obtained in the step 4), and carrying out the following analysis:
performing statistical analysis on sample diversity based on the species abundance matrix, such as but not limited to alpha diversity, Shannon index and Simpson index;
statistical analysis of differentially significant species among the sample groups based on the species abundance matrix, such as but not limited to LDA Effect Size (LefSe) analysis or analysis of variance, etc.
The present invention also provides an apparatus for metagenomic data analysis, wherein the apparatus comprises:
a data preprocessing module for preprocessing the sequence to obtain a data set with a desired quality;
the data processing module is used for processing the data set into N K-mer sequence subsets;
a microorganism identification module for analyzing microbial species classification data from the desired dataset based on a microbial reference sequence database;
the data result integration module is used for merging the species classification data analysis results; and
and the reporting module is used for outputting the result according to the data.
The data preprocessing module comprises: porechop software and/or NanoFilt software;
the data processing module comprises using the method of the invention to process a data set into N subsets of K-mer sequences;
the microorganism identification module includes metahlan software, Centrifuge software, and/or metagenomic databases.
The invention provides a method and a device for detecting a metagenome based on third-generation sequencing data. Specifically, the inventors of the present invention found that the long read length advantage of the third generation sequencing data can be more fully exerted by performing sliding extraction on the original long read sequence and converting the original long read sequence into a short read length sequencing sequence.
Although the structure and data characteristics of an original sequencing read sequence are not changed in the conventional metagenome analysis method, the problem of false negative caused by insufficient detection sensitivity exists.
Therefore, the method for analyzing the metagenome data can effectively avoid the problem of low data utilization rate caused by random insertion and deletion (indel) from the data characteristics, has ultrahigh sensitivity, can accurately detect the species and the composition of microorganisms even under ultrahigh host (host) background noise, and effectively controls the false positive result; the pressure is reduced for the experimental technical level, excessive sequencing data volume is not needed, the sequencing cost is reduced, and the advantages of the third-generation sequencing data are fully exerted. The method of the invention can be well compatible with the current second-generation sequencing data analysis method, and enlarges the means of analyzing the metagenome data by the second-generation sequencing; meanwhile, the method is used as a metagenome analysis method for analyzing third-generation sequencing data, so that the analysis result is more accurate and efficient.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments of the present invention will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flow diagram of a metagenomic data analysis method in one embodiment of the present invention;
fig. 2 is a block diagram showing the structure of a metagenomic data analyzing apparatus according to an embodiment of the present invention.
Detailed Description
Features and exemplary embodiments of various aspects of the present invention will be described in detail below. In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced without some of these specific details. The following description of the embodiments is merely intended to provide a better understanding of the present invention by illustrating examples of the present invention.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The embodiments will be described in detail below with reference to the accompanying drawings.
However, the current method for detecting metagenome based on three-generation sequencing still has great challenges and problems. On an experimental level, how to efficiently and stably remove host (host) genome or enrich metagenome DNA fragments is a major problem. Improper experimental methods can cause the relative proportion of metagenomic DNA in DNA for sequencing to be low, so that effective metagenomic data flowing into subsequent data analysis is too little, and in order to make up for the defects, a larger sequencing data volume can be used, thereby causing sequencing cost consumption and extremely low cost performance. In the aspect of data analysis, not only how to detect the microbial species with high sensitivity and effectively control false positive results, but also how to solve the current situation of low data utilization rate caused by the unique characteristics of the third generation data, such as the random indel problem in the nanopore sequencing sequence, are considered. At present, methods for detecting metagenome based on three-generation sequencing data, such as some deep learning methods of Kraken2, Bracken, CNN and the like, do not solve the existing problems at the same time.
Therefore, the method further improves the existing metagenome data analysis method, improves the detection sensitivity, solves the problem of low data utilization rate and has very important significance.
With reference to fig. 1 and 2 of the present invention, the present invention provides an analysis method for metagenomic data analysis based on three generations of sequencing data, the method comprising the steps of:
s1: performing data preprocessing on the original third-generation sequencing data, removing a joint and a bar code sequence added in a library by using Porechop software and NanoFilt software, and filtering a low-quality and an excessively short read sequence, wherein the threshold of the low quality is Q7, and the threshold of the length of the excessively short sequence is 100bp, so as to obtain a data set;
s2: performing sliding extraction on each read sequence of the data set in the step S1 for N times according to the length K without dividing the read, setting the length K to be more than or equal to 100bp and less than or equal to 200bp, setting the length N to be more than or equal to 20 and less than or equal to 25, filtering the extracted sequence if the length after the sliding extraction is less than the preset length,
adding constraints in the sliding extraction, including:
sliding the position of the extracted K-mer sequence on the target read sequence to be random;
the K-mer sequences extracted by sliding are randomly distributed on the upstream, the midstream and the downstream of the target read;
starting sites of the K-mer sequence extracted in a sliding mode are uniformly distributed on a target read and have no random preference; and
finally, after N times of extraction, all K-mer sequences cover at least 80% of the region of the target sequence read with base coincidence;
obtaining N K-mer sequences for each read sequence;
s3: classifying K-mer sequences obtained by the same K-mer sliding extraction in all read sequences into a K-mer sequence subset to obtain N K-mer sequence subsets;
n fixed-length K-mer sequences are obtained for each read sequence, and the K-mer sequences obtained by the same extraction of different read sequences belong to the same K-mer sequence subset, so that the sequencing data of the original long read length is decomposed into a sequencing data set of the short read length extracted according to fixed-length sliding;
s4: respectively carrying out metagenomic species analysis on the K-mer sequence subsets with the final read length of K obtained in the step S3, wherein the species comprise bacteria, fungi, parasites and/or viruses, and finally obtaining N data analysis results;
s5: combining the N data analysis results obtained in the step S4 to obtain a result matrix of the detected species and the relative abundance of the detected species, and calculating the average value, the variance and the standard deviation of the relative abundance of each detected species; giving a confidence level of 95%, calculating a confidence interval of the relative abundance value of the species at the confidence level, and finally taking the mean value of the relative abundance values falling in the confidence interval as the detected final relative abundance value of the species, wherein the sum of the relative abundance values of all the detected species is 100%.
The inventor finds that the original long-read sequence is subjected to sliding extraction and is converted into a short-read sequencing sequence, and the advantage of long-read of the third-generation sequencing data can be more fully exerted.
Further, as shown in fig. 2, in one embodiment of the present invention, there is provided an apparatus for metagenomic data analysis, wherein the apparatus comprises: a data preprocessing module 101, which preprocesses the sequence using, for example, Porechop software and/or NanoFilt software, to obtain a data set with a desired quality; a data processing module 102 for processing the data set into N subsets of K-mer sequences using the method of the present invention; a microorganism identification module 103 for analyzing microorganism species classification data from the desired dataset using, for example, metahlan software, Centrifuge software, based on a microorganism reference sequence database and/or metagenomic database; the data result integration module 104 is used for merging the species classification data analysis results; and a reporting module 105 for outputting the result according to the data.
Example 1 data analysis using the method of the invention
1. Preparing 4 groups of metagenomic standards containing human host proportions of 0%, 30%, 50% and 90%, wherein each standard comprises Enterococcus faecalis (Enterococcus faecalis), Escherichia coli (Escherichia coli), Lactobacillus fermentum (Lactobacillus fermentum), Listeria (Listeria monocytogenes), Salmonella (Salmonella enterica) and Staphylococcus aureus (Staphylococcus aureus), and expected abundance of each bacterium in each group of standards, preparing through an experimental library, and sequencing by using a three-generation nanopore sequencer with the model number of QNome-9604 to obtain original long-read sequencing data, wherein the maximum read length of the data reaches 16Kb, and the average length of the data is 7.8 Kb; removing the added joints and barcode sequences in the experimental library building process by using Porechop software and NanoFilt software, and filtering a read sequence with the quality below Q5 and the length less than 100 bp;
2. performing sliding extraction on each read sequence in the data set of the sequences obtained in the step 1 for 20 times (N = 20) according to the length K =100bp without dividing the read, and filtering out the extracted short read-length sequence if the length after the sliding extraction is smaller than 100 bp;
increasing constraint conditions, wherein the positions of the short read-length sequences obtained after 20 times of sliding extraction on the target read are random, the starting sites of the short read-length sequences are more uniformly distributed on the upstream, the midstream and the downstream of the target read, and the set of the 20 short read-length sequences extracted by sliding covers 85% of the target sequence read;
3. classifying the K-mer sequence obtained after each extraction into a K-mer sequence subset to finally obtain 20 sliding-extracted K-mer sequence subsets with fixed lengths (K =100 bp), namely decomposing original sequencing data with long read length into a sequencing data set with short read length extracted according to the fixed length K =100bp in a sliding manner;
4. performing metagenomic species analysis on the final 20 short read-length sequence data sets with fixed lengths (K =100 bp) obtained in the step 3 by using metaplan 3 respectively to obtain 20 data analysis results;
5. combining the 20 data analysis results to obtain a result matrix of the detected species and the relative abundance thereof, respectively calculating the average value, the variance and the standard deviation of the relative abundance of each detected species, giving a 95% confidence level, calculating the confidence interval of the relative abundance value of the species under the 95% confidence level, and finally taking the average value of the relative abundance values falling in the confidence interval as the detected relative abundance value of the species, wherein the sum of the relative abundance values of all the detected species is 100%.
As shown in Table 1, the results of statistics show that the method can detect various strains very sensitively in the data of pure bacteria environment or in the data condition of ultrahigh host background noise, and has no false positive result, so that the method is particularly suitable for the metagenomic analysis of third-generation sequencing data.
TABLE 1 statistics of the results of the detection of various species and their relative abundance by the method of the invention
Figure 703121DEST_PATH_IMAGE001
Note: since metaphlan3 was not considered for analysis and only the relative abundance of aligned species was calculated, there was no result from human host and it is indicated by a "-".
In addition, the term "and/or" herein is only one kind of association relationship describing an associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.
It should be understood that in the present embodiment, "B corresponding to a" means that B is associated with a, from which B can be determined. It should also be understood that determining B from a does not mean determining B from a alone, but may be determined from a and/or other information.
While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (11)

1. A method for analyzing metagenomic data, comprising the steps of:
1) preprocessing raw data to obtain a data set with expected quality, wherein the data set with expected quality comprises a plurality of read sequences;
2) performing N times of K-mer sliding extraction on each read sequence in the plurality of read sequences in the data set of the step 1), and obtaining N K-mer sequences aiming at each read sequence;
adding constraint conditions in the N sliding extractions, so that the initial positions of the K-mer sequences extracted by the sliding extraction are randomly and uniformly distributed on a target sequence read, the K-mer sequences extracted by the sliding extraction have base coincidence, and finally, all the K-mer sequences extracted by the N times cover at least 80% of the area of the target sequence read;
wherein N is an integer;
3) classifying K-mer sequences obtained by the same K-mer sliding extraction in all read sequences into a K-mer sequence subset to obtain N K-mer sequence subsets;
4) performing metagenomic species analysis on each K-mer sequence subset obtained in the step 3) respectively to finally obtain N data analysis results;
5) merging the N data analysis results obtained in the step 4), and analyzing and obtaining information of various microorganisms in the metagenome;
the step 5) specifically comprises the following steps:
merging the N data analysis results obtained in the step 4) to obtain a result matrix of the detected species and the relative abundance of the detected species, and calculating the average value, the variance and the standard deviation of the relative abundance of each detected species; giving a confidence level, calculating a confidence interval of relative abundance values of the species at the confidence level, taking the range of the confidence interval as a boundary, averaging the relative abundance values meeting the confidence interval in the N times of analysis results, taking the calculated value as the final relative abundance detected by the species, and taking the sum of the relative abundance values of all the detected species as 100%.
2. The method for analyzing metagenomic data according to claim 1, wherein in step 1), said raw data is raw third generation sequencing data or long read length sequencing data.
3. The method for analyzing metagenomic data according to claim 1, wherein in step 1), said raw data is long read length data obtained by nanopore sequencing.
4. The method of claim 1, wherein in step 1), the pre-processing of the raw data comprises removing the linker sequence and barcode sequence, and filtering the read sequence with quality less than Q7 and the read sequence with length less than 100 bp.
5. The method for analyzing metagenomic data according to claim 1, wherein said step 2) comprises the steps of:
firstly, performing sliding extraction on each read sequence for N times according to a length K without dividing the read sequence, wherein the length K is any integer, and filtering the extracted sequence if the length after the sliding extraction is smaller than a preset length;
adding constraint conditions, wherein the constraint conditions comprise:
the positions of the extracted K-mer sequences on the target read sequences are slid randomly;
the K-mer sequences extracted in a sliding mode are distributed on the upstream, midstream and downstream sites of the target read sequence;
the initial positions of the K-mer sequences extracted in a sliding mode are uniformly distributed on the target read sequence; and
all the K-mer sequences extracted for the last N times cover at least 80% of the region of the target read sequence with base coincidence;
and thirdly, obtaining N K-mer sequences aiming at each read sequence.
6. The method for analyzing metagenomic data according to claim 5, wherein K is in a range of 75bp or more and 500bp or less.
7. The method for analyzing metagenomic data according to claim 5, wherein K is in a range of 100bp or more and 200bp or less.
8. The method for analyzing metagenomic data according to claim 1, wherein in step 2), the value range of N is 10 or more and N or less and 50 or less.
9. The method for analyzing metagenomic data as set forth in claim 1, wherein in step 2), the value range of N is 20 ≤ N ≤ 25.
10. A method for analyzing metagenomic data, comprising the steps of:
1) performing data preprocessing on the original third-generation sequencing data, removing a joint and a bar code sequence added in a library by using Porechop software and NanoFilt software, and filtering a low-quality and an excessively short read sequence, wherein the threshold of the low quality is Q7, and the threshold of the length of the excessively short sequence is 100bp, so as to obtain a data set;
2) performing sliding extraction on each read sequence of the data set in the step 1) for N times according to the length K without dividing the read, setting the length K to be more than or equal to 100bp and less than or equal to 200bp, setting the length N to be more than or equal to 20 and less than or equal to 25, and filtering the extracted sequence if the length after the sliding extraction is less than the preset length;
adding constraints in the sliding extraction, including:
sliding the position of the extracted K-mer sequence on the target read sequence to be random;
the K-mer sequences extracted by sliding are randomly distributed on the upstream, the midstream and the downstream of the target read;
starting sites of the K-mer sequence extracted in a sliding mode are uniformly distributed on a target read and have no random preference; and
finally, after N times of extraction, all K-mer sequences cover at least 80% of the region of the target sequence read with base coincidence;
obtaining N K-mer sequences for each read sequence;
3) classifying K-mer sequences obtained by the same K-mer sliding extraction in all read sequences into a K-mer sequence subset to obtain N K-mer sequence subsets;
n fixed-length K-mer sequences are obtained for each read sequence, and the K-mer sequences obtained by the same extraction of different read sequences belong to the same K-mer sequence subset, so that the sequencing data of the original long read length is decomposed into a sequencing data set of the short read length extracted according to fixed-length sliding;
4) respectively carrying out metagenomic species analysis on the K-mer sequence subsets with the final length of K obtained in the step 3), wherein the species comprise bacteria, fungi, parasites and/or viruses, and finally obtaining N data analysis results;
5) merging the N data analysis results obtained in the step 4) to obtain a result matrix of the detected species and the relative abundance of the detected species, and calculating the average value, the variance and the standard deviation of the relative abundance of each detected species; giving a confidence level of 95%, calculating a confidence interval of relative abundance values of the species at the confidence level, taking the range of the confidence interval as a boundary, averaging the relative abundance values meeting the confidence interval in the N times of analysis results, taking the calculated value as the final detected relative abundance of the species, and taking the sum of the relative abundance values of all the detected species as 100%.
11. The method for analyzing metagenomic data according to claim 10, further comprising merging the N data analysis results obtained in step 4) and performing the following analysis:
performing statistical analysis on the sample diversity based on the species abundance matrix;
statistical analysis of differentially significant species among the sample groups was performed based on the species abundance matrix.
CN202210002820.9A 2022-01-05 2022-01-05 Analysis method of metagenome data Active CN114023389B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210002820.9A CN114023389B (en) 2022-01-05 2022-01-05 Analysis method of metagenome data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210002820.9A CN114023389B (en) 2022-01-05 2022-01-05 Analysis method of metagenome data

Publications (2)

Publication Number Publication Date
CN114023389A CN114023389A (en) 2022-02-08
CN114023389B true CN114023389B (en) 2022-03-25

Family

ID=80069300

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210002820.9A Active CN114023389B (en) 2022-01-05 2022-01-05 Analysis method of metagenome data

Country Status (1)

Country Link
CN (1) CN114023389B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107034279A (en) * 2017-05-05 2017-08-11 中山大学 Application of the tuberculosis microbial markers in the reagent of diagnosis of tuberculosis is prepared
CN113862382A (en) * 2020-06-30 2021-12-31 北京大学人民医院 Application of biomarker of intestinal flora in preparation of product for diagnosing adult immune thrombocytopenia

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106228035B (en) * 2016-07-07 2019-03-01 清华大学 Efficient clustering method based on local sensitivity Hash and imparametrization bayes method
KR20190047023A (en) * 2016-09-15 2019-05-07 썬 제노믹스 인코포레이티드 A universal method of extracting nucleic acid molecules from a population of one or more types of microorganisms in a sample
CN108334750B (en) * 2018-04-19 2019-02-12 江苏先声医学诊断有限公司 A kind of macro genomic data analysis method and system
US20210249102A1 (en) * 2018-05-31 2021-08-12 Arizona Board Of Regents On Behalf Of The University Of Arizona Methods for comparative metagenomic analysis
US11830581B2 (en) * 2019-03-07 2023-11-28 International Business Machines Corporation Methods of optimizing genome assembly parameters
CN111933218B (en) * 2020-07-01 2022-03-29 广州基迪奥生物科技有限公司 Optimized metagenome binding method for analyzing microbial community
CN113160882B (en) * 2021-05-24 2022-11-15 成都博欣医学检验实验室有限公司 Pathogenic microorganism metagenome detection method based on third generation sequencing

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107034279A (en) * 2017-05-05 2017-08-11 中山大学 Application of the tuberculosis microbial markers in the reagent of diagnosis of tuberculosis is prepared
CN113862382A (en) * 2020-06-30 2021-12-31 北京大学人民医院 Application of biomarker of intestinal flora in preparation of product for diagnosing adult immune thrombocytopenia

Also Published As

Publication number Publication date
CN114023389A (en) 2022-02-08

Similar Documents

Publication Publication Date Title
Press et al. Hi-C deconvolution of a human gut microbiome yields high-quality draft genomes and reveals plasmid-genome interactions
US20190194727A1 (en) Multitag sequencing ecogenomics analysis
Shendure et al. Next-generation DNA sequencing
CN111816258B (en) Optimization method for accurate identification of human flora 16S rDNA high-throughput sequencing species
CN109923217A (en) Identification and antibiotic characterization of pathogens in metagenomic samples
US20210403991A1 (en) Sequencing Process
CN107077537A (en) With short reading sequencing data detection repeat amplification protcol
CN105420375B (en) Construction method of environmental microorganism genome sketch
Quijada et al. High-throughput sequencing and food microbiology
CA3054487A1 (en) Systems and methods for metagenomic analysis
CN103981256A (en) Salmonella CRISPR (clustered regularlay interspaced short palindromic repeats) sequencing typing method
CN115662516A (en) Analysis method for high-throughput prediction of phage host based on next-generation sequencing technology
CN107577923A (en) A kind of identification of highly similar microorganism and sorting technique
JP6588536B2 (en) Artificial exogenous reference molecules for comparing species and abundance ratios between microorganisms of different species
CN114023389B (en) Analysis method of metagenome data
Kim Application of metagenomic techniques: understanding the unrevealed human microbiota and explaining the in clinical infectious diseases
CN112735530A (en) Method for tracing sample based on flora structure
Vandamme et al. Genomic taxonomy and biodiversity of the Burkholderia cepacia complex
CN116705160A (en) Pathogen metagenome analysis method based on nanopore sequencing data
豊間根耕地 Studies on identification and evaluation of CRISPR diversity on human skin microbiome for development of a new personal identification method
Jiksing et al. Draft genome sequence data of two Salmonella bacteria from serogroup C type
Perumal et al. Downstream Analysis and Visualization-Knowledge Discovery–Alpha and Beta Diversity
CN116705156A (en) Method for searching determining sites of viral genome classification based on decision tree algorithm
CN116779036A (en) Rapid analysis method for sequencing targeting pathogen nanopores based on multiplex PCR
Tonon et al. Molecular Taxonomy of Environmental Prokaryotes

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40061506

Country of ref document: HK