CN114023389B

CN114023389B - Analysis method of metagenome data

Info

Publication number: CN114023389B
Application number: CN202210002820.9A
Authority: CN
Inventors: 郎继东; 孙继国
Original assignee: Chengdu Qitan Technology Ltd
Current assignee: Chengdu Qitan Technology Ltd
Priority date: 2022-01-05
Filing date: 2022-01-05
Publication date: 2022-03-25
Anticipated expiration: 2042-01-05
Also published as: CN114023389A

Abstract

The invention provides an analysis method of metagenome data. The invention provides a metagenome data analysis method, which comprises the following steps: 1) preprocessing raw data to obtain a data set with expected quality; 2) performing K-mer sliding extraction on each read sequence in the data set in the step 1) for N times to obtain N K-mer sequences; 3) classifying K-mer sequences obtained by the same K-mer sliding extraction in all read sequences into a K-mer sequence subset to obtain N K-mer sequence subsets; 4) performing metagenomic species analysis on each K-mer sequence subset obtained in the step 3) respectively to obtain N data analysis results; 5) combining the N data analysis results obtained in the step 4), and analyzing and obtaining information of various microorganisms in the metagenome. The method of the invention has ultra-high sensitivity and can effectively control false positive results.

Description

Analysis method of metagenome data

Technical Field

The invention belongs to the technical field of metagenome analysis, particularly relates to a metagenome data analysis method, and more particularly relates to a metagenome data analysis method based on third-generation sequencing.

Background

Metagenome, also known as microbial environment genome, is the sum of all microbial genetic material in the environment, and at present mainly refers to the sum of bacterial and fungal genomes in environmental samples. Metagenomics (Metagenomics) is a new method for researching microorganisms by using microbial genome in environmental samples as a research object, using functional gene screening and/or sequencing analysis as a research means, and using microbial diversity, population structure, evolutionary relationship, functional activity, mutual cooperation relationship and relationship with the environment as research purposes, and generally comprises extracting genomic DNA from environmental samples to perform high-throughput sequencing analysis.

In recent years, researchers have been more and more widely researching on metagenomics, especially human metagenomics, such as the metagenomics research of intestinal flora and the research of tumor metagenomics, and the researchers have not only sequenced and classified metagenomics sequences, but also analyzed and researched important relativity between the metagenomics and human diseases. For example, in 2017, a study published in the journal of science shows how microorganisms invade most pancreatic cancer patients and break down the major chemotherapeutic drugs used by these patients (Geller LT, Barzily-Rokni M, Danino T, et al. A.positional role of organic bacteria in mediating tissue to the chemotherapeutical drug gettissue, science, 2017;357(6356):1156-1160. DOI: 10.1126/science. aah5043); in 2020 Nature journal, direct connection was first established between microorganisms in humans and genetic variation driving cancer development, confirming that colorectal cancer gene mutations can be caused by toxin release from the gut flora (Pleguzuelos-Manzano C, Puschhof J, Rosendahl Huber A, et al. mutation in colon cancer used by genetic pks + E. coli Nature 2020;580(7802):269-273. DOI:10.1038/s 41586-020-2080-8); in the same year, the latest efforts on the cancer Microbiome published by another research team in the journal "Nature" (Poore GD, Kopylova E, Zhu Q, et al. Microbiome analytes of blood and tissue culture supernatant diagnostic assay prophach. Nature. 2020;579(7800): 567. 574. DOI:10.1038/s 41586-020-.

The Third generation sequencing technology, also called Third generation sequencing technology (Third generation sequencing) or single molecule real-time DNA sequencing technology, is a technology that can realize the individual sequencing of each DNA molecule without PCR amplification during DNA sequencing. At present, the third generation sequencing technology principle is mainly divided into single molecule fluorescence sequencing represented by the SMRT technology of Pacbio and nanopore sequencing represented by the nanopore electrophoresis technology of Oxford nanopore and carbon flush technologies. One of the main technical characteristics of third-generation sequencing is that the intrinsic reaction speed of DNA polymerase is realized, 10 bases can be sequenced in one second, and the sequencing speed is 2 ten thousand times that of chemical sequencing; secondly, the inherent self continuity of the DNA polymerase is realized, and a very long sequence can be detected by one reaction; second-generation sequencing can detect hundreds of bases, but third-generation sequencing can detect thousands of bases.

Metagenomic sequencing enables the simultaneous detection of almost all known pathogens from clinical samples. At present, most metagenome researches adopt a second generation (Illumina) sequencing platform, the sequencing running time exceeds 16 hours, and the period from the total sample to the report is 48-72 hours; in contrast, third generation (Nanopore) sequencing, after effective removal of host DNA background (e.g., using saponin Charantiampous T, Kay GL, Richardson H, et al, Nanopore methylation enzymes labeled with background microorganism infection. Nat Biotechnology 2019;37(7): 783. cake 792. DOI:10.1038/s 41587-019 cake 0156-5), allows detection of microorganisms around 50 minutes in real-time computational analysis and the entire detection cycle can be controlled within 6 hours (Gu W, Deng X, Lee M, et al, Rapid detection by detection of genetic mutation-mutation gene mutation-mutation detection of infectious viruses) (2021. 202115: 10.1038: 35020-31) and current pathogenic bacteria can be detected on emerging viruses 4142-1105, has good clinical application prospect, such as fixed-point real-time monitoring of epidemic situation.

However, the current method for detecting metagenome based on three-generation sequencing still has great challenges and problems. On an experimental level, how to efficiently and stably remove host (host) genome or enrich metagenome DNA fragments is a major problem. Improper experimental methods can cause the relative proportion of metagenomic DNA in DNA for sequencing to be low, so that effective metagenomic data flowing into subsequent data analysis is too little, and in order to make up for the defects, a larger sequencing data volume can be used, thereby causing sequencing cost consumption and extremely low cost performance. In the aspect of data analysis, not only how to detect the microbial species with high sensitivity and effectively control false positive results, but also how to solve the current situation of low data utilization rate caused by the unique characteristics of the third generation data, such as the random indel problem in the nanopore sequencing sequence, are considered. At present, methods for detecting metagenome based on three-generation sequencing data, such as some deep learning methods of Kraken2, Bracken, CNN and the like, do not solve the existing problems at the same time.

Therefore, the method further improves the existing metagenome data analysis method, improves the detection sensitivity, solves the problem of low data utilization rate and has very important significance.

Disclosure of Invention

Therefore, the present invention aims to overcome the defects of the prior art, and provides a method for analyzing metagenomic data, which can well solve the problems in the aspect of data analysis, effectively avoid the problem of low data utilization rate caused by random insertion and deletion (indel) from the aspect of data characteristics, has ultrahigh sensitivity, can accurately detect the species and composition of microorganisms even under ultrahigh host (host) background noise, and effectively control the result of false positive, so that the analysis result is more accurate and efficient; meanwhile, the method of the invention reduces the pressure for the experimental technical level, does not need excessive sequencing data volume, reduces the sequencing cost and gives full play to the advantages of the third-generation sequencing data. The method of the invention can be well compatible with the current second-generation sequencing data analysis method, and enlarges the means of analyzing the metagenome data by the second-generation sequencing.

The purpose of the invention is realized by the following technical scheme.

In one aspect, the present invention provides a method for analyzing metagenomic data, the method comprising the steps of:

1) preprocessing raw data to obtain a data set with expected quality, wherein the data set with expected quality comprises a plurality of read sequences;

2) performing N times of K-mer sliding extraction on each read sequence in the plurality of read sequences in the data set of the step 1), and obtaining N K-mer sequences aiming at each read sequence;

adding constraint conditions in the N sliding extractions, so that the initial positions of the K-mer sequences extracted by the sliding extraction are randomly and uniformly distributed on a target sequence read, the K-mer sequences extracted by the sliding extraction have base coincidence, and finally, all the K-mer sequences extracted by the N times cover at least 80% of the area of the target sequence read;

wherein N is an integer;

3) classifying K-mer sequences obtained by the same K-mer sliding extraction in all read sequences into a K-mer sequence subset to obtain N K-mer sequence subsets;

4) performing metagenomic species analysis on each K-mer sequence subset obtained in the step 3) respectively to finally obtain N data analysis results;

5) combining the N data analysis results obtained in the step 4), and analyzing and obtaining information of various microorganisms in the metagenome.

According to the analysis method of the metagenomic data, in the step 1), the original data is original third-generation sequencing data or long-read-length sequencing data.

Wherein, the long read length refers to sequencing data obtained by three generations of sequencing, such as nanopore sequencing, and the length of the original sequencing data can reach 10kb, 20kb, 150kb, 160kb, 200kb, 300kb, 1Mb, 2 Mb. It is known to those skilled in the art that with the development of sequencing technology, longer read lengths of sequencing data are subsequently possible. It will be appreciated, however, that the methods of the invention are applicable to sequencing data of any length, and that longer data will be more advantageous than the methods of the invention.

According to the method for analyzing metagenomic data, in the step 1), the original data is long read length data obtained by nanopore sequencing.

The method for analyzing metagenomic data according to the present invention, wherein, in the step 1), the pre-processing of the raw data comprises removing a linker sequence and a barcode (barcode) sequence therein, and filtering a low-quality read sequence and a too short read sequence.

In an optional embodiment, the low quality threshold may be Q5 or Q7, which are thresholds well known to those skilled in the art. Wherein Q represents the average quality value of the sequencing read, i.e. the sum of the accuracy of each base in the sequencing read is averaged. The threshold value can be adjusted according to actual conditions, and specific adjustment parameters are detailed in http:// maq.

In an optional embodiment, the sequence length threshold for too short a read sequence includes, but is not limited to, 100 bp; for example, the threshold may be 50bp, 200bp, 300bp, or the like. Those skilled in the art can make adjustments as appropriate, and the threshold adjustment parameter is described in detail in http:// maq.

In an optional embodiment, the step 1) pre-treatment uses Porechop software and/or NanoFilt software. This is data processing software commonly used by those skilled in the art, and the use method and technical parameters thereof are also well known by those skilled in the art, and are not described in detail herein.

As known to those skilled in the art, a K-mer is a division of a read into strings comprising K bases, e.g., ATCG sequence is a 4-mer and ATCGATCG sequence is an 8-mer.

The method for analyzing metagenomic data according to the present invention, wherein the step 2) is achieved by a method comprising the steps of:

firstly, performing sliding extraction on each read sequence for N times according to a length K without dividing the read sequence, wherein the length K is any integer, and filtering the extracted sequence if the length after the sliding extraction is smaller than a preset length;

adding constraint conditions, wherein the constraint conditions comprise:

the positions of the extracted K-mer sequences on the target read sequences are slid randomly;

the K-mer sequences extracted in a sliding mode are distributed on the upstream, midstream and downstream sites of the target read sequence;

the initial positions of the K-mer sequences extracted in a sliding mode are uniformly distributed on the target read sequence; and

all the K-mer sequences extracted for the last N times cover at least 80% of the region of the target read sequence with base coincidence;

and thirdly, obtaining N K-mer sequences aiming at each read sequence.

In an optional embodiment, K is 75bp or more and less than or equal to 500bp, 100bp or more and less than or equal to 200bp, or K =200 bp. As known to those skilled in the art, K may be set as desired depending on the quality characteristics of the data or the permissions. Without being bound by any theory, K may be any integer value, and for example, K may be 20bp or more, 50bp or more, 75bp or more, 100bp or more, 150bp or more, 200bp or more, 300bp or more, 400bp or more, or the like.

According to the analysis method of the metagenome data, in the step 2), the value range of N is more than or equal to 10 and less than or equal to 50, and N is more than or equal to 20 and less than or equal to 25. Without being bound by any theory, N may be any integer value, e.g., N may be 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50 or 100, or 200, or 300, etc.

The method for analyzing metagenomic data according to the present invention, wherein, in the step 3), the analysis result includes that Centrifuge or metahlan or the like can be used. These are all software known and used by those skilled in the art, and are also conventional software for performing metagenomic analysis.

The method for analyzing metagenomic data according to the present invention, wherein the step 5) is implemented by a method comprising the steps of:

merging the N data analysis results obtained in the step 4) to obtain a result matrix of the detected species and the relative abundance of the detected species, and calculating the average value, the variance and the standard deviation of the relative abundance of each detected species; giving a confidence level, such as but not limited to 90%, 95% or 99% and the like, and calculating a confidence interval of the relative abundance value of the species at the confidence level, taking the range of the confidence interval as a boundary, averaging the relative abundance values satisfying the confidence interval in the N times of analysis results, and taking the calculated value as the final detected relative abundance of the species, wherein the sum of the relative abundance values of all the detected species is 100%.

In a particular embodiment, the method of the invention comprises the steps of:

1) performing data preprocessing on the original third-generation sequencing data, removing a joint and a bar code sequence added in a library by using Porechop software and NanoFilt software, and filtering a low-quality and an excessively short read sequence, wherein the threshold of the low quality is Q7, and the threshold of the length of the excessively short sequence is 100bp, so as to obtain a data set;

2) performing sliding extraction on each read sequence of the data set in the step 1) for N times according to the length K without dividing the read, setting the length K to be more than or equal to 100bp and less than or equal to 200bp, setting the length N to be more than or equal to 20 and less than or equal to 25, filtering the extracted sequence if the length after the sliding extraction is less than the preset length,

adding constraints in the sliding extraction, including:

sliding the position of the extracted K-mer sequence on the target read sequence to be random;

the K-mer sequences extracted by sliding are randomly distributed on the upstream, the midstream and the downstream of the target read;

starting sites of the K-mer sequence extracted in a sliding mode are uniformly distributed on a target read and have no random preference; and

finally, after N times of extraction, all K-mer sequences cover at least 80% of the region of the target sequence read with base coincidence;

obtaining N K-mer sequences for each read sequence;

n fixed-length K-mer sequences are obtained for each read sequence, and the K-mer sequences obtained by the same extraction of different read sequences belong to the same K-mer sequence subset, so that the sequencing data of the original long read length is decomposed into a sequencing data set of the short read length extracted according to fixed-length sliding;

4) respectively carrying out metagenomic species analysis on the K-mer sequence subsets with the final length of K obtained in the step 3), wherein the species comprise bacteria, fungi, parasites and/or viruses, and finally obtaining N data analysis results;

5) merging the N data analysis results obtained in the step 4) to obtain a result matrix of the detected species and the relative abundance of the detected species, and calculating the average value, the variance and the standard deviation of the relative abundance of each detected species; giving a confidence level of 95%, calculating a confidence interval of relative abundance values of the species at the confidence level, taking the range of the confidence interval as a boundary, averaging the relative abundance values meeting the confidence interval in the N times of analysis results, taking the calculated value as the final detected relative abundance of the species, and taking the sum of the relative abundance values of all the detected species as 100%.

The method also comprises the step of merging the N data analysis results obtained in the step 4), and carrying out the following analysis:

performing statistical analysis on sample diversity based on the species abundance matrix, such as but not limited to alpha diversity, Shannon index and Simpson index;

statistical analysis of differentially significant species among the sample groups based on the species abundance matrix, such as but not limited to LDA Effect Size (LefSe) analysis or analysis of variance, etc.

The present invention also provides an apparatus for metagenomic data analysis, wherein the apparatus comprises:

a data preprocessing module for preprocessing the sequence to obtain a data set with a desired quality;

the data processing module is used for processing the data set into N K-mer sequence subsets;

a microorganism identification module for analyzing microbial species classification data from the desired dataset based on a microbial reference sequence database;

the data result integration module is used for merging the species classification data analysis results; and

and the reporting module is used for outputting the result according to the data.

The data preprocessing module comprises: porechop software and/or NanoFilt software;

the data processing module comprises using the method of the invention to process a data set into N subsets of K-mer sequences;

the microorganism identification module includes metahlan software, Centrifuge software, and/or metagenomic databases.

The invention provides a method and a device for detecting a metagenome based on third-generation sequencing data. Specifically, the inventors of the present invention found that the long read length advantage of the third generation sequencing data can be more fully exerted by performing sliding extraction on the original long read sequence and converting the original long read sequence into a short read length sequencing sequence.

Although the structure and data characteristics of an original sequencing read sequence are not changed in the conventional metagenome analysis method, the problem of false negative caused by insufficient detection sensitivity exists.

Therefore, the method for analyzing the metagenome data can effectively avoid the problem of low data utilization rate caused by random insertion and deletion (indel) from the data characteristics, has ultrahigh sensitivity, can accurately detect the species and the composition of microorganisms even under ultrahigh host (host) background noise, and effectively controls the false positive result; the pressure is reduced for the experimental technical level, excessive sequencing data volume is not needed, the sequencing cost is reduced, and the advantages of the third-generation sequencing data are fully exerted. The method of the invention can be well compatible with the current second-generation sequencing data analysis method, and enlarges the means of analyzing the metagenome data by the second-generation sequencing; meanwhile, the method is used as a metagenome analysis method for analyzing third-generation sequencing data, so that the analysis result is more accurate and efficient.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments of the present invention will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flow diagram of a metagenomic data analysis method in one embodiment of the present invention;

fig. 2 is a block diagram showing the structure of a metagenomic data analyzing apparatus according to an embodiment of the present invention.

Detailed Description

Features and exemplary embodiments of various aspects of the present invention will be described in detail below. In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced without some of these specific details. The following description of the embodiments is merely intended to provide a better understanding of the present invention by illustrating examples of the present invention.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The embodiments will be described in detail below with reference to the accompanying drawings.

With reference to fig. 1 and 2 of the present invention, the present invention provides an analysis method for metagenomic data analysis based on three generations of sequencing data, the method comprising the steps of:

s1: performing data preprocessing on the original third-generation sequencing data, removing a joint and a bar code sequence added in a library by using Porechop software and NanoFilt software, and filtering a low-quality and an excessively short read sequence, wherein the threshold of the low quality is Q7, and the threshold of the length of the excessively short sequence is 100bp, so as to obtain a data set;

s2: performing sliding extraction on each read sequence of the data set in the step S1 for N times according to the length K without dividing the read, setting the length K to be more than or equal to 100bp and less than or equal to 200bp, setting the length N to be more than or equal to 20 and less than or equal to 25, filtering the extracted sequence if the length after the sliding extraction is less than the preset length,

adding constraints in the sliding extraction, including:

obtaining N K-mer sequences for each read sequence;

s3: classifying K-mer sequences obtained by the same K-mer sliding extraction in all read sequences into a K-mer sequence subset to obtain N K-mer sequence subsets;

s4: respectively carrying out metagenomic species analysis on the K-mer sequence subsets with the final read length of K obtained in the step S3, wherein the species comprise bacteria, fungi, parasites and/or viruses, and finally obtaining N data analysis results;

s5: combining the N data analysis results obtained in the step S4 to obtain a result matrix of the detected species and the relative abundance of the detected species, and calculating the average value, the variance and the standard deviation of the relative abundance of each detected species; giving a confidence level of 95%, calculating a confidence interval of the relative abundance value of the species at the confidence level, and finally taking the mean value of the relative abundance values falling in the confidence interval as the detected final relative abundance value of the species, wherein the sum of the relative abundance values of all the detected species is 100%.

The inventor finds that the original long-read sequence is subjected to sliding extraction and is converted into a short-read sequencing sequence, and the advantage of long-read of the third-generation sequencing data can be more fully exerted.

Further, as shown in fig. 2, in one embodiment of the present invention, there is provided an apparatus for metagenomic data analysis, wherein the apparatus comprises: a data preprocessing module 101, which preprocesses the sequence using, for example, Porechop software and/or NanoFilt software, to obtain a data set with a desired quality; a data processing module 102 for processing the data set into N subsets of K-mer sequences using the method of the present invention; a microorganism identification module 103 for analyzing microorganism species classification data from the desired dataset using, for example, metahlan software, Centrifuge software, based on a microorganism reference sequence database and/or metagenomic database; the data result integration module 104 is used for merging the species classification data analysis results; and a reporting module 105 for outputting the result according to the data.

Example 1 data analysis using the method of the invention

1. Preparing 4 groups of metagenomic standards containing human host proportions of 0%, 30%, 50% and 90%, wherein each standard comprises Enterococcus faecalis (Enterococcus faecalis), Escherichia coli (Escherichia coli), Lactobacillus fermentum (Lactobacillus fermentum), Listeria (Listeria monocytogenes), Salmonella (Salmonella enterica) and Staphylococcus aureus (Staphylococcus aureus), and expected abundance of each bacterium in each group of standards, preparing through an experimental library, and sequencing by using a three-generation nanopore sequencer with the model number of QNome-9604 to obtain original long-read sequencing data, wherein the maximum read length of the data reaches 16Kb, and the average length of the data is 7.8 Kb; removing the added joints and barcode sequences in the experimental library building process by using Porechop software and NanoFilt software, and filtering a read sequence with the quality below Q5 and the length less than 100 bp;

2. performing sliding extraction on each read sequence in the data set of the sequences obtained in the step 1 for 20 times (N = 20) according to the length K =100bp without dividing the read, and filtering out the extracted short read-length sequence if the length after the sliding extraction is smaller than 100 bp;

increasing constraint conditions, wherein the positions of the short read-length sequences obtained after 20 times of sliding extraction on the target read are random, the starting sites of the short read-length sequences are more uniformly distributed on the upstream, the midstream and the downstream of the target read, and the set of the 20 short read-length sequences extracted by sliding covers 85% of the target sequence read;

3. classifying the K-mer sequence obtained after each extraction into a K-mer sequence subset to finally obtain 20 sliding-extracted K-mer sequence subsets with fixed lengths (K =100 bp), namely decomposing original sequencing data with long read length into a sequencing data set with short read length extracted according to the fixed length K =100bp in a sliding manner;

4. performing metagenomic species analysis on the final 20 short read-length sequence data sets with fixed lengths (K =100 bp) obtained in the step 3 by using metaplan 3 respectively to obtain 20 data analysis results;

5. combining the 20 data analysis results to obtain a result matrix of the detected species and the relative abundance thereof, respectively calculating the average value, the variance and the standard deviation of the relative abundance of each detected species, giving a 95% confidence level, calculating the confidence interval of the relative abundance value of the species under the 95% confidence level, and finally taking the average value of the relative abundance values falling in the confidence interval as the detected relative abundance value of the species, wherein the sum of the relative abundance values of all the detected species is 100%.

As shown in Table 1, the results of statistics show that the method can detect various strains very sensitively in the data of pure bacteria environment or in the data condition of ultrahigh host background noise, and has no false positive result, so that the method is particularly suitable for the metagenomic analysis of third-generation sequencing data.

TABLE 1 statistics of the results of the detection of various species and their relative abundance by the method of the invention

Note: since metaphlan3 was not considered for analysis and only the relative abundance of aligned species was calculated, there was no result from human host and it is indicated by a "-".

In addition, the term "and/or" herein is only one kind of association relationship describing an associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

It should be understood that in the present embodiment, "B corresponding to a" means that B is associated with a, from which B can be determined. It should also be understood that determining B from a does not mean determining B from a alone, but may be determined from a and/or other information.

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for analyzing metagenomic data, comprising the steps of:

wherein N is an integer;

5) merging the N data analysis results obtained in the step 4), and analyzing and obtaining information of various microorganisms in the metagenome;

the step 5) specifically comprises the following steps:

merging the N data analysis results obtained in the step 4) to obtain a result matrix of the detected species and the relative abundance of the detected species, and calculating the average value, the variance and the standard deviation of the relative abundance of each detected species; giving a confidence level, calculating a confidence interval of relative abundance values of the species at the confidence level, taking the range of the confidence interval as a boundary, averaging the relative abundance values meeting the confidence interval in the N times of analysis results, taking the calculated value as the final relative abundance detected by the species, and taking the sum of the relative abundance values of all the detected species as 100%.

2. The method for analyzing metagenomic data according to claim 1, wherein in step 1), said raw data is raw third generation sequencing data or long read length sequencing data.

3. The method for analyzing metagenomic data according to claim 1, wherein in step 1), said raw data is long read length data obtained by nanopore sequencing.

4. The method of claim 1, wherein in step 1), the pre-processing of the raw data comprises removing the linker sequence and barcode sequence, and filtering the read sequence with quality less than Q7 and the read sequence with length less than 100 bp.

5. The method for analyzing metagenomic data according to claim 1, wherein said step 2) comprises the steps of:

adding constraint conditions, wherein the constraint conditions comprise:

and thirdly, obtaining N K-mer sequences aiming at each read sequence.

6. The method for analyzing metagenomic data according to claim 5, wherein K is in a range of 75bp or more and 500bp or less.

7. The method for analyzing metagenomic data according to claim 5, wherein K is in a range of 100bp or more and 200bp or less.

8. The method for analyzing metagenomic data according to claim 1, wherein in step 2), the value range of N is 10 or more and N or less and 50 or less.

9. The method for analyzing metagenomic data as set forth in claim 1, wherein in step 2), the value range of N is 20 ≤ N ≤ 25.

10. A method for analyzing metagenomic data, comprising the steps of:

2) performing sliding extraction on each read sequence of the data set in the step 1) for N times according to the length K without dividing the read, setting the length K to be more than or equal to 100bp and less than or equal to 200bp, setting the length N to be more than or equal to 20 and less than or equal to 25, and filtering the extracted sequence if the length after the sliding extraction is less than the preset length;

adding constraints in the sliding extraction, including:

obtaining N K-mer sequences for each read sequence;

11. The method for analyzing metagenomic data according to claim 10, further comprising merging the N data analysis results obtained in step 4) and performing the following analysis:

performing statistical analysis on the sample diversity based on the species abundance matrix;

statistical analysis of differentially significant species among the sample groups was performed based on the species abundance matrix.