WO2023283967A1

WO2023283967A1 - Optimized kraken2 algorithm and application thereof in second-generation sequencing

Info

Publication number: WO2023283967A1
Application number: PCT/CN2021/106970
Authority: WO
Inventors: 张岩; 李振中; 任用; 李诗濛; 郭昊; 梁相志; 陈莉; 戴岩; 李珊; 顾菊
Original assignee: 江苏先声医学诊断有限公司; 江苏先声医疗器械有限公司; 南京先声诊断技术有限公司
Priority date: 2021-07-14
Filing date: 2021-07-17
Publication date: 2023-01-19
Also published as: CN113539369A; CN113539369B

Abstract

A biological information analysis method based on a kraken2 single-sequence kmer score and overall taxonomy structure statistics. By means of the method, false positives in biological information analysis can be reduced, and the species detection accuracy can be improved. The method is applicable to second-generation metagenome sequencing analysis.

Description

An optimized kraken2 algorithm and its application in next-generation sequencing

This application claims the priority of the Chinese patent application submitted to the China Patent Office on July 17, 2021, with the application number 202110804351.8, and the title of the invention is "an optimized kraken2 algorithm and its application in next-generation sequencing", all of which The contents are incorporated by reference in this application.

technical field

The invention relates to the field of bioinformatics, in particular to an optimized kraken2 algorithm and its application in next-generation sequencing.

Background technique

The metagenomic community is complex and huge, and a large amount of DNA needs to be sequenced. Illumina next-generation sequencing technology is a massively parallel sequencing technology, which has the characteristics of high throughput, high sequencing accuracy, and short timeliness, which perfectly matches the metagenome. The demand for metagenomics has led to the widespread application of metagenomics in infection detection.

Species detection of microbial communities after sequencing is the most important work in metagenomics research. Only by accurately and reliably locating microbial communities can we associate metagenomics with research, such as studying whether a patient's disease is caused by a certain microbial infection (If a person suspects malaria, it is necessary to accurately detect the presence of Plasmodium in the blood to give a definitive diagnosis), metagenomic analysis is a fast, accurate and advanced detection technology, currently used in the auxiliary diagnosis of infectious diseases played an important role in.

Kraken2 is applied to Illumina second-generation metagenomic sequencing, which has the characteristics of fast analysis speed and high sensitivity, but the specificity is low, and many false positive results are often detected, which is due to the characteristics of the kraken2 algorithm. According to the relationship between taxid and seqid, quickly construct fixed-length kmers for the selected reference genome sequence (the default read length is 35bp), and give priority to constructing a certain level of specific kmers, such as Streptococcus pneumoniae, kraken2 will give priority to constructing the specific kmers of this species Specific kmers, and there is a certain kmer in multiple species of Streptococcus genus Streptococcus, then locate the kmer under the genus Streptococcus, the same principle, if a certain kmer exists in multiple genera under Streptococcus family Streptococcus, then the kmer Positioned under Streptococcus family. In view of the kraken2 algorithm, although there is a certain probability of misalignment for a certain microorganism with a high DNA sequence, it will basically not interfere with the detection of this species. Due to the characteristics of the next-generation sequencing with short read length, it is prone to sequence misalignment, or cannot be accurately aligned (such as a sequence from Streptococcus pneumoniae, which is misaligned to Streptococcus mitis, or can only be aligned to Streptococcus genus level), which is the most important factor affecting the accuracy of species detection.

In addition, because the database contains a lot of sequences, such as plasmids/vectors, etc., the results of this part of the comparison will also be output, and this part of the results is basically meaningless (it can also be counted as false positive detection).

In view of this, the present invention is proposed.

Contents of the invention

The purpose of the present invention is to seek a bioinformatics analysis method that can reduce false positives in sequencing analysis, improve the accuracy of species detection, and is suitable for Illumina second-generation metagenomic sequencing.

To achieve the above object, the present invention proposes the following technical solutions:

The present invention firstly provides a bioinformatics analysis method based on kraken2 single sequence kmer score and overall taxonomy structure statistics, said method comprising the following steps:

1) NGS sequencing data is compared using kraken2 to obtain the taxid-kmer result of each sequence;

2) Establish the hierarchical relationship of taxid based on the taxonomy database, obtain the taxid according to the result of the taxid-kmer in step 1, and associate it with the taxonomy level, and then relocate the taxid according to the positioning rules;

3) Calculate the kmer score of each sequence according to the taxid located in step 2) of each sequence and the taxid-kmer comparison result of step 1);

4) According to kmer score and taxonomy level, compare and compare the results for overall calculation;

Further, it also includes

5) Carry out species-level detection based on the overall calculation results of 4).

Further, the hierarchical relationship in step 2) includes one or more hierarchical relationships of serotype/subtype, species, genus and/or family.

Further, the positioning rules in the step 2) include the following:

Normally, the taxid location given by kraken2 is accepted, except for the following cases:

If a sequence obtains a unique taxid according to the taxid-kmer result and the taxid is lower than the species level, it is positioned as the species level taxid to which the taxid belongs;

When a sequence obtains more than 2 taxids according to the taxid-kmer results, there are 3 cases:

For all taxids, only one appears at the species level, and other taxids belong to the serotype/subtype, genus, and family level of the species, and then locate the taxid at the species level;

All taxids that are associated with more than 2 species levels and belong to the same genus will be finally located at the genus level taxid;

All taxids that are associated with more than 2 genus levels (including genus levels without classification) and belong to the same family will be finally located at the family level taxid;

Further, the calculation rules in the step 3) include:

Finally locate the sequence below the taxid at the family level, its kmer score = (family taxid kmers+genus taxid kmers+species taxid kmers+subtype/serotype taxid kmers)/total kmers;

Finally, the sequence above the taxid at the family level is located, and the kmer score is set to 0.

Further, in the step 4), the overall calculation includes:

a. Set a filter cutoff threshold, and filter each sequence according to the kmer score;

b. For the filtered sequence in a, count the reads of taxid;

The reads of the taxid is the total number of taxid sequences that appear in a sample;

c. Set a filtering threshold threshold, filter the taxids located at the species level in b, calculate their relative ratio, and exclude the species level taxids lower than the threshold;

The genus relative ratio is the ratio of certain species-level taxid reads to the highest species-level taxid reads of the same genus reads;

Further, it also includes:

d. If the taxid at the species level filtered by c lacks genus classification, calculate the family relative ratio and exclude the taxid at the species level lower than the filtering threshold threshold;

The family relative ratio is the ratio of certain species-level taxid reads to the highest species-level taxid reads of the same family reads;

Further, it also includes:

e. After c, d filter and retain the species-level taxid reads correction 1, calculate the genus relative ratio, and calculate the species-level taxid genus correction reads with the genus-level taxid reads according to the genus relative ratio;

The genus relative ratio is the sum of the species-level taxid reads of the same genus after filtering by c and d, and then calculates the ratio of taxid reads at various levels to the sum;

The genus-level taxid reads include the genus-level taxid reads in b and the reads incorporated into the genus-associated species-level taxid that have not passed the filtering threshold threshold in c;

f, after c, the kind-level taxid reads filtered by d are corrected 2, and the family-level relative ratio is calculated, and the family-level taxid reads are calculated according to the family-level taxid family-corrected reads;

The relative ratio of the family is the sum of the species-level taxid reads of the same family after filtering by c and d, and then calculate the ratio of the taxid reads at various levels to the sum;

The family-level taxid reads include the family-level taxid reads in b and the reads incorporated into the family-related species-level taxid in d that do not pass the filtering threshold threshold.

Further, said step 5) species-level detection, which is equivalent to the species-level taxid detection, according to c, d obtains the species-level taxid which is the final species taxid, gets b, e, and the sum of the species-level taxid reads obtained by f obtains The final species taxid reads, and calculate the relative abundance based on the sum of reads.

Further, the database used for comparison in step 1) is nt, refseq or genbank database; preferably, the database is nt database.

Firstly, the present invention also provides the application of the method based on kraken2 single sequence kmer score and overall taxonomy structure statistics in next-generation sequencing bioinformatics analysis.

The present invention also provides a computer-readable medium, which stores a computer program, and when the computer program is executed by a processor, the method described in any one of the above claims is realized.

The present invention also provides an electronic device, which is characterized in that it includes a processor and a memory, and one or more readable instructions are stored in the memory, and when the one or more readable instructions are executed by the processor, the Claim any one of the above methods.

Beneficial technical effect of the present invention:

1) The bioinformatics method of the present invention can reduce false positives in bioinformatics analysis, improve the accuracy of species detection, and is suitable for second-generation metagenomic sequencing, including single-end and double-end sequencing.

2) The present invention ensures the overall sensitivity through precise positioning of a single sequence and overall systematic optimization.

3) By introducing taxonomy, the present invention excludes partial comparison results of plasmids/vectors, etc., effectively reducing the situation of meaningless detection.

Description of drawings

In order to more clearly illustrate the specific implementation of the present invention or the technical solutions in the prior art, the following will briefly introduce the accompanying drawings that need to be used in the specific implementation or description of the prior art. Obviously, the accompanying drawings in the following description The drawings show some implementations of the present invention, and those skilled in the art can obtain other drawings based on these drawings without any creative work.

Fig. 1 schematic diagram of the system of the present invention;

Figure 2 Schematic diagram of taxid positioning and kmer score scoring for a single sequence;

Fig. 3 Comparison chart of false positive species detected in 9 cases of spike-in sample DNA library, opt represents the process of the present invention, confidence represents the process of kraken2 confidence 0.5+bracken, kraken represents the process of kraken2+bracken, S1-S9 represents 9 samples;

Fig. 4 Comparison diagram of false positive species detected in 9 cases of spike-in sample RNA library, opt represents the process of the present invention, confidence represents the process of kraken2 confidence 0.5+bracken, kraken represents the process of kraken2+bracken, and S1-S9 represents 9 samples;

Fig. 5 Sensitivity chart of 12 simulated samples and 9 spike-in samples, opt represents the process of the present invention, confidence represents the kraken2 confidence 0.5+bracken process, kraken represents the kraken2+bracken process, simulated represents 12 simulated samples, spike-in is 9 spike-in samples.

detailed description

Embodiments of the present invention will be described in detail below in conjunction with examples, but those skilled in the art will understand that following examples are only used to illustrate the present invention, and should not be considered as limiting the scope of the invention, and described examples are Some, but not all, embodiments of the invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

Definition of some terms

Unless otherwise defined hereinafter, all technical and scientific terms used in the detailed description of the present invention have the same meanings as commonly understood by those skilled in the art. While the following terms are believed to be well understood by those skilled in the art, the following definitions are set forth to better explain the present invention.

As used herein, the terms "comprising", "comprising", "having", "containing" or "involving" are inclusive or open-ended and do not exclude other unrecited elements or method steps . The term "consisting of" is considered as a preferred embodiment of the term "comprising". If in the following a certain group is defined as comprising at least a certain number of embodiments, this is also to be understood as revealing a group which preferably consists only of these embodiments.

The terms "about" and "approximately" in the present invention represent the range of accuracy that can be understood by those skilled in the art and still guarantee the technical effect of the mentioned feature. The term generally means ±10%, preferably ±5%, of the indicated value.

The use of an indefinite or definite article when referring to a noun in the singular eg "a" or "an", "the", includes a plural of that noun.

In addition, the terms first, second, third, (a), (b), (c) and the like in the specification and claims are used to distinguish similar elements and are not necessary to describe the order or time order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and that the embodiments described herein are capable of operation in other sequences than described or illustrated herein.

"read" or "each read" or "single read" in the present invention refers to a nucleic acid sequence generated by a high-throughput sequencing platform.

The term "alignment result" in the present invention: "alignment" in English, refers to the corresponding result between a sequencing readout sequence and a reference sequence, and a sequencing readout sequence can have multiple alignment results at the same time.

The "kmer" in the present invention refers to continuously cutting a sequence and scratching bases one by one to obtain a substring of k bases. For example, the length of reads is L, and the length of k-mer is set to k, then the generated k- The number of mers is: L-k+1; another example is the sequence AACTGACT, if k is set to 3, it can be divided into 6 k-mers of AAC, ACT, CTG, TGA, GAC, and ACT.

The "kraken2" in the present invention refers to a high-precision metagenomic sequence classification software based on the kmer algorithm in the field, which can quickly classify sequencing reads into species.

The "kraken2 optimization algorithm" described in the present invention refers to an optimization system for microbial species-level detection based on the comparison results of kraken2 developed by the present invention, aiming at improving accuracy and reducing false positive detection.

"nt comparison database" of the present invention is the kraken2 comparison index database based on NCBI nt database establishment.

The "taxid" or "taxonomy_id" mentioned in the present invention refers to the id number in the taxonomy database.

The single sequence kmer score of kraken2 and the overall bioinformatics analysis method based on taxonomy structure statistics of the present invention generally include the following steps:

1. Use kraken2 for sequence comparison of next-generation sequencing data, using databases, such as nt and other comparison databases;

2. Organize the taxonomy database, establish the hierarchical relationship of serotype/subtype-species-genus-family or any one or more hierarchical relationships based on taxid, and import the corresponding taxid according to the taxid-kmer comparison result of each sequence of kraken2 The taxonomy level;

3. Carry out taxid relocation and kmer score calculation for each reads;

4. According to kmer score and taxonomy hierarchical structure, compare and compare the results for overall calculation;

5. Carry out microbial detection according to the calculation results.

Exemplarily, the following steps may be specifically included:

1. Use kraken2 for sequence alignment;

2. Introduce the taxonomy level, and relocate its level and corresponding taxid according to the positioning rules according to the single sequence taxid-kmer comparison result;

The positioning rules may include:

In principle, accept the taxid positioning given by kraken2, except for the following cases:

3. Calculate the kmer score of each sequence according to the calculation rules according to the taxid of each sequence after 2 positioning and the above-mentioned taxid-kmer comparison results;

Calculation rules can be:

Finally locate the sequence below the family, its kmer score = (family taxid kmers+genus taxid kmers+species taxid kmers+subtype/serotype taxid kmers)/total kmers;

Finally, the sequence above the family is located, and the kmer score is set to 0.

4. Mark all sequences mapped to phage/plasmid/vector as unaligned;

5. Set a filter cutoff threshold (represented by score_cutoff in the following), and filter each sequence according to the kmer score;

6. For the filtered sequence in 5, count the reads of taxid;

7. Set a filtering threshold, filter the taxids located at the species level in 6, calculate their relative ratio, and exclude the species-level taxids below the threshold;

8. If the taxid at the species level filtered by 7 lacks genus classification, calculate the family relative ratio and exclude the taxid at the species level lower than the filtering threshold threshold;

9. The species-level taxid reads retained by filtering in 7 and 8 are corrected 1, and the relative ratio is calculated, and the taxid reads at the species level are calculated according to the relative ratio of the species-level taxid and genus correction reads;

The genus relative ratio is the sum of the species-level taxid reads of the same genus after filtering by 7 and 8, and then calculates the ratio of taxid reads at various levels to the sum;

The genus-level taxid reads include 6 genus-level taxid reads and 7 reads that do not pass through the genus-related species level taxid of the genus-level taxid;

10. The species-level taxid reads filtered by 7 and 8 are corrected 2, and the family-level relative ratio is calculated, and the family-level taxid reads are calculated according to the family-level relative ratio to calculate the species-level taxid family-corrected reads;

The relative ratio of the family is the sum of the species-level taxid reads of the same family after filtering by 7 and 8, and then calculate the ratio of the taxid reads at various levels to the sum;

The family-level taxid reads include 6 family-level taxid reads and 8 family-related taxid-incorporated reads that do not pass the filtering threshold threshold.

11. Microbial detection, which is equivalent to the species-level taxid detection, the species-level taxid obtained according to 7 and 8 is the final species taxid, and the sum of the species-level taxid reads obtained by 6, 9, and 10 is used to obtain the final species taxid reads, and Relative abundance is calculated based on the sum of reads.

The following description is provided only to help the understanding of the present invention. These descriptions should not be read with a scope less than that understood by those skilled in the art.

The design optimization of embodiment 1 method system

The problem to be solved in this embodiment is how to ensure the accuracy of the kraken2 comparison results as much as possible through the data analysis method.

1. First of all, for this problem, it can be divided into two problems, how to reduce the detection of false positive species to improve specificity, and ensure the detection of real species to obtain higher sensitivity. The two achieve the best balance, that is, get best accuracy. By analyzing how kraken2 produces false positive species results, and the cases where the sensitivity will be reduced.

a) The redundancy of the database, whether it is refseq, genbank or nt, there is a large amount of reference genome redundancy, which is an important reason for false positive detection, and wrong comparison will also reduce the sensitivity;

b) Sequence similarity, typically, the sequence degree of Escherichia coli and Shigella exceeds 99%, which is especially common among species under the same genus, which is an important reason for false positive detection and also interferes with real species detected, reducing the sensitivity;

c) host sequence (usually refers to human host) interference, since the current metagenomic basically comes from host sample collection, there will inevitably be a large number of host sequences, which will affect the detection of microorganisms to a certain extent;

d) The mNGS sequence is short, which results in fewer kmers for each sequence, which is more obvious in 75bp single-end sequencing, and misalignment is prone to occur.

2. Secondly, define which of the above situations can be solved through data analysis of the sample sequencing results alone.

The so-called data analysis of the sample sequencing results alone refers to: the information obtained is only one sample sequence file (FASTQ), and no other information is known; through some algorithms, the species detection results are output.

For "a) database redundancy", on the one hand, it can be achieved by sorting out the database and removing redundancy, but excessive simplification of the database will cause false negatives, so in addition to database optimization, algorithms can also be used to a certain extent The reduction of the algorithm part is the consideration point of this patent;

For "b) sequence similarity", on the one hand, it can be solved through standard genome updates and the progress of species taxonomy, and on the other hand, it can be optimized to a certain extent through algorithms, which is part of the consideration of this patent;

For "c) host sequence (usually human-derived sequence) interference", on the one hand, the host genome can be removed by accurately constructing the host genome sequence and using an alignment algorithm, but it does not guarantee whether the host genome removal is accurate and thorough (removal is not possible) There will still be some residues completely, and excessive removal will reduce the sensitivity of microbial detection), in addition, it is possible to accurately analyze the results of single sequence comparisons in the algorithm, and set the threshold for species detection as a whole, Get a certain degree of optimization, this part is the consideration point of this patent;

As for "d) the mNGS sequence is short", this part is a technically hard problem, which needs to be improved through technological progress.

3. Specific optimization scheme of kraken2 algorithm

For the three optimization points in 2, combined with the comparison principle of kraken2, the following optimization system was established:

3.1 Processing of a single sequence

a) Introduce the taxonomy level, and relocate its level and corresponding taxid according to the positioning rules according to the single sequence taxid-kmer comparison result;

The positioning rules include: accept the taxid positioning given by kraken2 in principle, except for the following cases:

If a sequence is compared to a unique taxid and the taxid is at a lower species level, it will be positioned as the species-level taxid to which the taxid belongs (for example, if a sequence is 76 bp in length and 35 bp in kmer length, 42 kmers will be obtained, and the taxid-kmer comparison result is: 0:10, 1313:32, except for the 10 kmers that cannot be compared, the rest are compared to the taxid 1313, and the taxonomy hierarchy structure is used to locate the taxid 1313 Streptococcus pneumoniae Streptococcus pneumoniae);

A sequence alignment with more than 2 taxids can be divided into 3 types:

All taxids in the comparison result are associated with only one species at the species level, and other taxids belong to the serotype/subtype of the species. At the genus and family levels, they are located at the taxid of the species (for example, a sequence length of 76bp, 35bp kmer length will get 42 kmers, the taxid-kmer comparison results are: 0:10, 1313:20, 1301:12, except for the 10 kmers that cannot be compared, 20 kmers are compared to the taxid 1313 , corresponding to Streptococcus pneumoniae Streptococcus pneumoniae through the taxonomy hierarchy structure, and the other 12 kmers were compared to the genus 1301 Streptococcus Streptococcus pneumoniae, since Streptococcus pneumoniae is under the genus Streptococcus, the sequence was mapped to the species taxid 1313 Streptococcus pneumoniae Streptococcus pneumoniae);

All the taxids in the comparison results are related to the results of the same genus and different species, and finally locate the genus taxid (for example, if a sequence length is 76bp, 35bp kmer length will get 42 kmers, the taxid-kmer comparison result is: 0: 10, 1313:20, 28037:5, 1301:7, in addition to the 10 kmers that cannot be compared, 20 kmers are compared to the taxid 1313, corresponding to Streptococcus pneumoniae Streptococcus pneumoniae through the taxonomy hierarchy structure, and the other 5 The kmer was compared to 28037 Streptococcus mitis light-weight Streptococcus, and 7 kmers were compared to 1301 Streptococcus Streptococcus. Since there were 2 species under Streptococcus, the sequence was mapped to taxid 1301 Streptococcus Streptococcus);

Comparing all taxids in the results, and correlating the results of different genus in the same family, the taxid of the family is finally located;

b) Calculate the kmer score of each sequence according to the calculation rules according to the alignment results of taxid and taxid-kmer located in a) for each sequence;

The calculation rules are:

3.2 Overall processing

c) Set a score_cutoff, judge whether the result is credible according to the kmer score of each sequence, and exclude it as an unreliable result if the score_cutoff is not exceeded;

d) For the sequences that are finally mapped to phage/plasmid/vector, all are marked as unaligned results;

e) For the filtered sequence in c) and d), count the reads of taxid;

f) For the taxid-associated genera and families located at the species level in e), calculate the genus relative ratio of all species under each genus (the ratio of the number of reads in the genus to the number of reads in the highest species of reads in the genus), set A threshold to relocate the reads of the species whose genus relative ratio is lower than the filtering threshold to the genus level;

g) Similar to f), for species that lack genus information but have clear family information, calculate the family relative ratio (the ratio of each species under the same family to the species with the highest reads in the family), and make the family relative ratio lower than f ) in the filtering threshold threshold, this part of the reads is relocated to the branch level;

h) For the reads that are finally compared to the genus level (including the reads that were originally compared to the genus level and the sum of the reads of the species under the genus that did not pass the filtering threshold threshold in step f), calculate the remaining species of the genus that have been filtered by f) The sum of reads, and calculate the ratio of the relative sum of the remaining species of the genus, and then assign the reads that are finally compared to the genus level to each remaining species according to the ratio of each remaining species;

i) For the reads that are finally compared to the family level (including the reads that were originally compared to the family level and the sum of the reads of the family that did not pass the filtering threshold threshold in the g) step), calculate the remaining species of the family that have been filtered by g) The sum of the reads, and calculate the ratio of the relative sum of the remaining species, and then according to the ratio of each remaining species, the reads that are finally compared to the family level are assigned to the remaining species;

j) When reporting the final species reads (that is, species reads processed by h), i) and relative abundance, set a reads filtering threshold reads_cutoff, and species below a certain threshold will not be counted and reported.

4. Kraken2's simulated data test and establishment of corresponding thresholds (score_cutoff, threshold, reads_cutoff)

Create a test dataset, test the optimization method, and initially establish the thresholds involved in the method

Selected 4 kinds of bacteria (Haemophilus influenzae, Streptococcus pneumoniae, Staphylococcus aureus, Klebsiella pneumoniae), 2 kinds of fungi (Candida albicans, Aspergillus fumigatus), 4 kinds of viruses (human herpesvirus, human papillary Tumor virus, influenza A virus, HIV virus), according to a certain reads ratio, added to the human host reads (GRC38.p13), the sequence length is set to 75bp, the total data volume is set to 10M, a total of 4 groups, Each group consists of 3 identical samples. The reference genomes used by the simulated samples are shown in the table below:

The specific simulation samples are shown in the following table (there are 3 repeated samples in order, such as samples 1, 2, and 3 are completely consistent samples):

Score_cutoff summarizes the statistical results of statistical samples under different thresholds. Samples 7-12 are focused on due to the low total microbial count, and their error rate is higher than that of the overall level. From the statistical results, setting a threshold higher than this error rate is It can solve the occurrence of wrong comparison, as shown in the following table:

In view of the fact that score_cutoff reaches the same level of error rate at 0.5 and 0.4, each sample is divided into list species false positive species (that is, the species detected by the error comparison is in the same family of 10 simulated species), non-list species false positive Species (the species detected by the wrong comparison are not in the same family as the 10 simulated species), and the species with the highest reads are counted. Since the repeated samples are detected to be completely consistent, only the representative sample results are listed, as shown in the following table:

Since the optimization method includes the correction of the same genus/families, the final reads_cutoff is set to correct for species detected in different families. From the comparison of different values of score_cutoff, a slightly lower comparison rate is allowed Under the premise of , score_cutoff is set to 0.5, and reads_cutoff is set to greater than 3 to eliminate the detection of non-list false positive species (the lower the value, the better the sensitivity of true positive species detected by fewer reads).

In summary, finally determine the technical solution of the present invention as follows:

1) Based on the comparison results of kraken2;

2) Introduce the taxonomy level, and relocate the level and corresponding taxid of a single sequence according to the alignment rules of taxid-kmer;

Targeting rules include:

If a sequence is aligned with a unique taxid and the taxid is lower than the species level, it will be positioned as the taxid of the species to which the taxid belongs;

A sequence alignment with more than 2 taxids can be divided into 3 types:

All taxids in the comparison result are associated with only one species at the species level, and other taxids belong to the serotype/subtype of the species. At the genus and family levels, the taxid of this species is located;

Comparing all taxids in the results, and correlating the results of the same genus to different species, the genus taxid is finally located;

3) Calculate the kmer score of each sequence according to the calculation rules according to the alignment results of taxid and taxid-kmer located in 2) for each sequence;

The calculation rules are:

4) Set score_cutoff to 0.5, and determine whether the result is credible according to the kmer score of each sequence. If the kmer score does not exceed score_cutoff, it is regarded as an unreliable result and excluded;

5) For the sequences that are finally mapped to phage/plasmid/vector, all are marked as unaligned results;

6) For the sequence filtered in 4) and 5), count the reads of taxid;

7) For the taxid-associated genus and family located at the species level in 6), calculate the genus relative ratio of all species under each genus (the ratio of the number of reads of each species in the genus to the number of reads in the highest species of reads in the genus), set A threshold, whose value is set to 0.1, relocates the reads of the species whose genus relative ratio is lower than the filtering threshold to the genus level;

8) Similar to 7), for species that lack genus information but have clear family information, calculate the family relative ratio (the ratio of each species under the same family to the species with the highest reads in the family), and make the family relative ratio lower than 7 ) in the filtering threshold threshold, the reads of this part of the species are relocated to the family level;

9) For the reads that are finally compared to the genus level (including 6) the reads of the genus level taxid and step 7) the sum of the reads of the genus that did not pass the filtering threshold threshold), calculate the sum of the reads of the remaining species of the genus after 7) filtering , and calculate the ratio of the relative sum of the remaining species of the genus, and then assign the reads that are finally compared to the genus level to each remaining species according to the ratio of each remaining species;

10) For the final comparison to the family-level reads (including 6) family-level taxid reads and 8) the sum of reads of the family species that did not pass the filtering threshold threshold in step 8), calculate the sum of the reads of the family after 8) filtering the remaining species , and calculate the ratio of the relative sum of the remaining species, and then assign the reads that are finally compared to the family level to each remaining species according to the ratio of each remaining species;

11) When reporting the final species reads and relative abundance, set a reads filtering threshold reads_cutoff, which is set to 3, and species whose reads are lower than reads_cutoff will not be counted and reported. Finally, the species-level reads and relative abundance are reported.

Comparison of the effects of Example 2 and the traditional kraken2 method

1. According to the simulation sample results of the final detection of the optimization method, the false positive and false negative detections are sorted out as follows:

Statistical indicators:

Sensitivity is 117/120 (total number of non-human species)=97.5%;

Three false positive species were detected.

2.2 According to the results of kraken2 confidence 0.5+braken process detection, the false positive detection and false negative detection are sorted out as shown in the following table:

样本sample	taxidtaxi	物种species	readsreads	相对丰度relative abundance	结果result

样本1sample 1	340412340412	Aspergillus novofumigatus Aspergillus novofumigatus	11	0.000110.00011	假阳性false positive

样本1sample 1	984962984962	Heterobasidion irregulareHeterobasidion irregular	11	0.000110.00011	假阳性false positive

样本1sample 1	145522145522	Nannochloropsis oceanicaNannochloropsis oceanica	44	0.000440.00044	假阳性false positive

样本1sample 1	2803728037	Streptococcus mitis Streptococcus mitis	22	0.000220.00022	假阳性false positive

样本1sample 1	26567872656787	Venustampulla echinocandicaVenustapulla echinocandica	11	0.000110.00011	假阳性false positive
样本10sample 10	8604986049	Cladophialophora carrioniiCladophialophora carrionii	11	0.000110.00011	假阳性false positive
样本10Sample 10	1037610376	Human gammaherpesvirus 4Human gammaherpes virus 4	11	0.000110.00011	假阳性false positive
样本10sample 10	18739601873960	Pseudocercospora fijiensisPseudocercospora fijiensis	22	0.000220.00022	假阳性false positive
样本10Sample 10	26567872656787	Venustampulla echinocandica Venustapulla echinocandica	11	0.000110.00011	假阳性false positive
样本10Sample 10	727727	Haemophilus influenzae Haemophilus influenzae	00	00	假阴性false negative
样本10Sample 10	573573	Klebsiella pneumoniae Klebsiella pneumoniae	00	00	假阴性false negative
样本11Sample 11	727727	Haemophilus influenzae Haemophilus influenzae	00	00	假阴性false negative
样本11Sample 11	573573	Klebsiella pneumoniae Klebsiella pneumoniae	00	00	假阴性false negative
样本12sample 12	727727	Haemophilus influenzae Haemophilus influenzae	00	00	假阴性false negative
样本12sample 12	573573	Klebsiella pneumoniae Klebsiella pneumoniae	00	00	假阴性false negative

样本4sample 4	145522145522	Nannochloropsis oceanicaNannochloropsis oceanica	22	0.000220.00022	假阳性false positive

样本4Sample 4	9980299802	Spirometra erinaceieuropaeiSpirometra erinaceieuropaei	22	0.000220.00022	假阳性false positive

样本7Sample 7	12201881220188	Aspergillus tanneriAspergillus tanneri	11	0.000110.00011	假阳性false positive

样本7Sample 7	4521945219	Guanarito mammarenavirus Guanarito mammarenavirus	11	0.000110.00011	假阳性false positive

样本7Sample 7	145522145522	Nannochloropsis oceanicaNannochloropsis oceanica	33	0.000330.00033	假阳性false positive

Statistical indicators:

The sensitivity is 114/120=95%;

43 cases of false positive species were detected (3 cases with reads greater than 3).

3. In the detection results of kraken2+bracken that does not introduce the confidence threshold, there are 5517 cases of false positive species (1188 cases with reads greater than 3), and the false negative species is Klebsiella pneumoniae that appeared in samples 10-12 A total of 3 cases.

Conclusion: The sensitivity of kraken2 without setting the confidence threshold is consistent with this patent method, but too many false positive species are detected. After setting the confidence threshold, although the detection of false positive species will be greatly reduced, the sensitivity will be reduced. In comprehensive comparison, the patented method reduces the detection of false positive species on the basis of ensuring the sensitivity, and the effect is better.

Embodiment 3 actual sample detection experiment

Nine spike-in samples were used to establish DNA libraries and RNA libraries for sequencing on the machine. The specific samples and positive species are shown in the table below:

样本sample	文库类型library type	原始编号original number	物种species	taxidtaxi
S1S1	DNAdna	样本1sample 1	Alphapapillomavirus 7Alphapapillomavirus 7	337042337042
S1S1	DNAdna	样本1sample 1	Cryptococcus gattii VGIICryptococcus gattii VGII	18590961859096
S1S1	DNAdna	样本1sample 1	Cupriavidus gilardiiCupriavidus gilardii	8254182541
S1S1	DNAdna	样本1sample 1	Escherichia coliEscherichia coli	562562
S1S1	DNAdna	样本1sample 1	Human alphaherpesvirus 1Human alphaherpes virus 1	1029810298
S1S1	DNAdna	样本1sample 1	Human betaherpesvirus 5Human betaherpes virus 5	1035910359
S1S1	DNAdna	样本1sample 1	Human betaherpesvirus 6BHuman betaherpesvirus 6B	3260432604
S1S1	DNAdna	样本1sample 1	Mycoplasma hyorhinisMycoplasma hyorhinis	21002100
S1S1	DNAdna	样本1sample 1	Streptococcus pneumoniaeStreptococcus pneumoniae	13131313
S1S1	RNARNA	样本1sample 1	Alphapapillomavirus 7Alphapapillomavirus 7	337042337042
S1S1	RNARNA	样本1sample 1	Cryptococcus gattii VGIICryptococcus gattii VGII	18590961859096
S1S1	RNARNA	样本1sample 1	Cupriavidus gilardiiCupriavidus gilardii	8254182541
S1S1	RNARNA	样本1sample 1	Escherichia coliEscherichia coli	562562
S1S1	RNARNA	样本1sample 1	Human alphaherpesvirus 1Human alphaherpes virus 1	1029810298
S1S1	RNARNA	样本1sample 1	Human betaherpesvirus 5Human betaherpes virus 5	1035910359
S1S1	RNARNA	样本1sample 1	Human betaherpesvirus 6BHuman betaherpesvirus 6B	3260432604
S1S1	RNARNA	样本1sample 1	Mycoplasma hyorhinisMycoplasma hyorhinis	21002100
S1S1	RNARNA	样本1sample 1	Streptococcus pneumoniaeStreptococcus pneumoniae	13131313
S2S2	DNAdna	样本2sample 2	Alphapapillomavirus 7Alphapapillomavirus 7	337042337042
S2S2	DNAdna	样本2sample 2	Cryptococcus gattii VGIICryptococcus gattii VGII	18590961859096
S2S2	DNAdna	样本2sample 2	Cupriavidus gilardiiCupriavidus gilardii	8254182541
S2S2	DNAdna	样本2sample 2	Escherichia coliEscherichia coli	562562
S2S2	DNAdna	样本2sample 2	Human alphaherpesvirus 1Human alphaherpes virus 1	1029810298
S2S2	DNAdna	样本2sample 2	Human betaherpesvirus 5Human betaherpes virus 5	1035910359
S2S2	DNAdna	样本2sample 2	Human betaherpesvirus 6BHuman betaherpesvirus 6B	3260432604
S2S2	DNAdna	样本2sample 2	Mycoplasma hyorhinisMycoplasma hyorhinis	21002100
S2S2	DNAdna	样本2sample 2	Streptococcus pneumoniaeStreptococcus pneumoniae	13131313
S2S2	RNARNA	样本2sample 2	Alphapapillomavirus 7Alphapapillomavirus 7	337042337042
S2S2	RNARNA	样本2sample 2	Cryptococcus gattii VGIICryptococcus gattii VGII	18590961859096
S2S2	RNARNA	样本2sample 2	Cupriavidus gilardiiCupriavidus gilardii	8254182541
S2S2	RNARNA	样本2sample 2	Escherichia coliEscherichia coli	562562
S2S2	RNARNA	样本2sample 2	Human alphaherpesvirus 1Human alphaherpes virus 1	1029810298
S2S2	RNARNA	样本2sample 2	Human betaherpesvirus 5Human betaherpes virus 5	1035910359
S2S2	RNARNA	样本2sample 2	Human betaherpesvirus 6BHuman betaherpesvirus 6B	3260432604
S2S2	RNARNA	样本2sample 2	Mycoplasma hyorhinisMycoplasma hyorhinis	21002100

S2S2	RNARNA	样本2sample 2	Streptococcus pneumoniaeStreptococcus pneumoniae	13131313
S3S3	DNAdna	样本3sample 3	Alphapapillomavirus 7Alphapapillomavirus 7	337042337042
S3S3	DNAdna	样本3sample 3	Cupriavidus gilardiiCupriavidus gilardii	8254182541
S3S3	DNAdna	样本3sample 3	Haemophilus influenzaeHaemophilus influenzae	727727
S3S3	DNAdna	样本3sample 3	Human alphaherpesvirus 2Human alphaherpes virus 2	1031010310
S3S3	DNAdna	样本3sample 3	Listeria monocytogenesListeria monocytogenes	16391639
S3S3	DNAdna	样本3sample 3	Mycoplasma hyorhinisMycoplasma hyorhinis	21002100
S3S3	DNAdna	样本3sample 3	Neisseria meningitidisNeisseria meningitidis	487487
S3S3	DNAdna	样本3sample 3	Streptococcus agalactiaeStreptococcus agalactiae	13111311
S3S3	RNARNA	样本3sample 3	Alphapapillomavirus 7Alphapapillomavirus 7	337042337042
S3S3	RNARNA	样本3sample 3	Cupriavidus gilardiiCupriavidus gilardii	8254182541
S3S3	RNARNA	样本3sample 3	Haemophilus influenzaeHaemophilus influenzae	727727
S3S3	RNARNA	样本3sample 3	Human alphaherpesvirus 2Human alphaherpes virus 2	1031010310
S3S3	RNARNA	样本3sample 3	Listeria monocytogenesListeria monocytogenes	16391639
S3S3	RNARNA	样本3sample 3	Mycoplasma hyorhinisMycoplasma hyorhinis	21002100
S3S3	RNARNA	样本3sample 3	Neisseria meningitidisNeisseria meningitidis	487487
S3S3	RNARNA	样本3sample 3	Parechovirus AParechovirus A	18039561803956
S4S4	DNAdna	样本4Sample 4	Alphapapillomavirus 7Alphapapillomavirus 7	337042337042
S4S4	DNAdna	样本4sample 4	Cupriavidus gilardiiCupriavidus gilardii	8254182541
S4S4	DNAdna	样本4Sample 4	Haemophilus influenzaeHaemophilus influenzae	727727
S4S4	DNAdna	样本4sample 4	Human alphaherpesvirus 2Human alphaherpes virus 2	1031010310
S4S4	DNAdna	样本4sample 4	Listeria monocytogenesListeria monocytogenes	16391639
S4S4	DNAdna	样本4Sample 4	Mycoplasma hyorhinisMycoplasma hyorhinis	21002100
S4S4	DNAdna	样本4sample 4	Neisseria meningitidisNeisseria meningitidis	487487
S4S4	DNAdna	样本4sample 4	Streptococcus agalactiaeStreptococcus agalactiae	13111311
S4S4	RNARNA	样本4sample 4	Alphapapillomavirus 7Alphapapillomavirus 7	337042337042
S4S4	RNARNA	样本4sample 4	Cupriavidus gilardiiCupriavidus gilardii	8254182541
S4S4	RNARNA	样本4Sample 4	Haemophilus influenzaeHaemophilus influenzae	727727
S4S4	RNARNA	样本4Sample 4	Human alphaherpesvirus 2Human alphaherpes virus 2	1031010310
S4S4	RNARNA	样本4Sample 4	Listeria monocytogenesListeria monocytogenes	16391639
S4S4	RNARNA	样本4Sample 4	Mycoplasma hyorhinisMycoplasma hyorhinis	21002100
S4S4	RNARNA	样本4Sample 4	Neisseria meningitidisNeisseria meningitidis	487487
S4S4	RNARNA	样本4Sample 4	Parechovirus AParechovirus A	18039561803956
S5S5	DNAdna	样本5Sample 5	Alphapapillomavirus 7Alphapapillomavirus 7	337042337042
S5S5	DNAdna	样本5Sample 5	Candida albicansCandida albicans	54765476
S5S5	DNAdna	样本5Sample 5	Cupriavidus gilardiiCupriavidus gilardii	8254182541
S5S5	DNAdna	样本5Sample 5	Enterococcus faecalisEnterococcus faecalis	13511351
S5S5	DNAdna	样本5Sample 5	Human mastadenovirus CHuman mastadenovirus C	129951129951
S5S5	DNAdna	样本5Sample 5	Klebsiella oxytocaKlebsiella oxytoca	571571
S5S5	DNAdna	样本5Sample 5	Mycoplasma hyorhinisMycoplasma hyorhinis	21002100
S5S5	DNAdna	样本5Sample 5	Streptococcus pyogenesStreptococcus pyogenes	13141314
S5S5	RNARNA	样本5Sample 5	Alphapapillomavirus 7Alphapapillomavirus 7	337042337042
S5S5	RNARNA	样本5Sample 5	Candida albicansCandida albicans	54765476
S5S5	RNARNA	样本5Sample 5	Cupriavidus gilardiiCupriavidus gilardii	8254182541

S5S5	RNARNA	样本5Sample 5	Enterococcus faecalisEnterococcus faecalis	13511351
S5S5	RNARNA	样本5Sample 5	Human mastadenovirus CHuman mastadenovirus C	129951129951
S5S5	RNARNA	样本5Sample 5	Human orthopneumovirusHuman orthopneumovirus	1125011250
S5S5	RNARNA	样本5Sample 5	Human respirovirus 1Human respirovirus 1	1273012730
S5S5	RNARNA	样本5Sample 5	Klebsiella oxytocaKlebsiella oxytoca	571571
S5S5	RNARNA	样本5Sample 5	Mycoplasma hyorhinisMycoplasma hyorhinis	21002100
S5S5	RNARNA	样本5Sample 5	Streptococcus pyogenesStreptococcus pyogenes	13141314
S6S6	DNAdna	样本6Sample 6	Alphapapillomavirus 7Alphapapillomavirus 7	337042337042
S6S6	DNAdna	样本6Sample 6	Candida albicansCandida albicans	54765476
S6S6	DNAdna	样本6Sample 6	Cupriavidus gilardiiCupriavidus gilardii	8254182541
S6S6	DNAdna	样本6Sample 6	Enterococcus faecalisEnterococcus faecalis	13511351
S6S6	DNAdna	样本6Sample 6	Human mastadenovirus CHuman mastadenovirus C	129951129951
S6S6	DNAdna	样本6Sample 6	Klebsiella oxytocaKlebsiella oxytoca	571571
S6S6	DNAdna	样本6Sample 6	Mycoplasma hyorhinisMycoplasma hyorhinis	21002100
S6S6	DNAdna	样本6Sample 6	Streptococcus pyogenesStreptococcus pyogenes	13141314
S6S6	RNARNA	样本6Sample 6	Alphapapillomavirus 7Alphapapillomavirus 7	337042337042
S6S6	RNARNA	样本6Sample 6	Candida albicansCandida albicans	54765476
S6S6	RNARNA	样本6Sample 6	Cupriavidus gilardiiCupriavidus gilardii	8254182541
S6S6	RNARNA	样本6Sample 6	Enterococcus faecalisEnterococcus faecalis	13511351
S6S6	RNARNA	样本6Sample 6	Human mastadenovirus CHuman mastadenovirus C	129951129951
S6S6	RNARNA	样本6Sample 6	Human orthopneumovirusHuman orthopneumovirus	1125011250
S6S6	RNARNA	样本6Sample 6	Human respirovirus 1Human respirovirus 1	1273012730
S6S6	RNARNA	样本6Sample 6	Klebsiella oxytocaKlebsiella oxytoca	571571
S6S6	RNARNA	样本6Sample 6	Mycoplasma hyorhinisMycoplasma hyorhinis	21002100
S6S6	RNARNA	样本6Sample 6	Streptococcus pyogenesStreptococcus pyogenes	13141314
S7S7	DNAdna	样本7Sample 7	Aeromonas hydrophilaAeromonas hydrophila	644644
S7S7	DNAdna	样本7Sample 7	Alphapapillomavirus 7Alphapapillomavirus 7	337042337042
S7S7	DNAdna	样本7Sample 7	Clavispora lusitaniaeClavispora lusitaniae	3691136911
S7S7	DNAdna	样本7Sample 7	Cupriavidus gilardiiCupriavidus gilardii	8254182541
S7S7	DNAdna	样本7Sample 7	Human mastadenovirus EHuman mastadenovirus E	130308130308
S7S7	DNAdna	样本7Sample 7	Legionella pneumophilaLegionella pneumophila	446446
S7S7	DNAdna	样本7Sample 7	Mycoplasma hyorhinisMycoplasma hyorhinis	21002100
S7S7	DNAdna	样本7Sample 7	Neisseria siccaNeisseria sicca	490490
S7S7	DNAdna	样本7Sample 7	Pseudomonas fluorescensPseudomonas fluorescens	294294
S7S7	RNARNA	样本7Sample 7	Aeromonas hydrophilaAeromonas hydrophila	644644
S7S7	RNARNA	样本7Sample 7	Alphapapillomavirus 7Alphapapillomavirus 7	337042337042
S7S7	RNARNA	样本7Sample 7	Cupriavidus gilardiiCupriavidus gilardii	8254182541
S7S7	RNARNA	样本7Sample 7	Human mastadenovirus EHuman mastadenovirus E	130308130308
S7S7	RNARNA	样本7Sample 7	Influenza A virusInfluenza A virus	1132011320
S7S7	RNARNA	样本7Sample 7	Influenza B virusInfluenza B virus	1152011520
S7S7	RNARNA	样本7Sample 7	Legionella pneumophilaLegionella pneumophila	446446
S7S7	RNARNA	样本7Sample 7	Mycoplasma hyorhinisMycoplasma hyorhinis	21002100
S7S7	RNARNA	样本7Sample 7	Neisseria siccaNeisseria sicca	490490
S7S7	RNARNA	样本7Sample 7	Pseudomonas fluorescensPseudomonas fluorescens	294294

S8 S8	DNAdna	样本8Sample 8	Aeromonas hydrophilaAeromonas hydrophila	644644
S8 S8	DNAdna	样本8Sample 8	Alphapapillomavirus 7 Alphapapillomavirus 7	337042337042
S8 S8	DNAdna	样本8Sample 8	Clavispora lusitaniaeClavispora lusitaniae	3691136911
S8 S8	DNAdna	样本8Sample 8	Cupriavidus gilardiiCupriavidus gilardii	8254182541
S8 S8	DNAdna	样本8Sample 8	Human mastadenovirus EHuman mastadenovirus E	130308130308
S8 S8	DNAdna	样本8Sample 8	Legionella pneumophilaLegionella pneumophila	446446
S8 S8	DNAdna	样本8Sample 8	Mycoplasma hyorhinisMycoplasma hyorhinis	21002100
S8 S8	DNAdna	样本8Sample 8	Neisseria siccaNeisseria sicca	490490
S8 S8	DNAdna	样本8Sample 8	Pseudomonas fluorescensPseudomonas fluorescens	294294
S8 S8	RNARNA	样本8Sample 8	Aeromonas hydrophilaAeromonas hydrophila	644644
S8 S8	RNARNA	样本8Sample 8	Alphapapillomavirus 7 Alphapapillomavirus 7	337042337042
S8 S8	RNARNA	样本8Sample 8	Cupriavidus gilardiiCupriavidus gilardii	8254182541
S8 S8	RNARNA	样本8Sample 8	Human mastadenovirus EHuman mastadenovirus E	130308130308
S8 S8	RNARNA	样本8Sample 8	Influenza A virusInfluenza A virus	1132011320
S8 S8	RNARNA	样本8Sample 8	Influenza B virusInfluenza B virus	1152011520
S8 S8	RNARNA	样本8Sample 8	Legionella pneumophilaLegionella pneumophila	446446
S8 S8	RNARNA	样本8Sample 8	Mycoplasma hyorhinisMycoplasma hyorhinis	21002100
S8 S8	RNARNA	样本8Sample 8	Neisseria siccaNeisseria sicca	490490
S8 S8	RNARNA	样本8Sample 8	Pseudomonas fluorescensPseudomonas fluorescens	294294
S9 S9	DNAdna	样本9Sample 9	Alphapapillomavirus 7 Alphapapillomavirus 7	337042337042
S9 S9	DNAdna	样本9Sample 9	Cupriavidus gilardiiCupriavidus gilardii	8254182541
S9 S9	DNAdna	样本9Sample 9	Mycoplasma hyorhinisMycoplasma hyorhinis	21002100
S9 S9	RNARNA	样本9Sample 9	Alphapapillomavirus 7 Alphapapillomavirus 7	337042337042
S9 S9	RNARNA	样本9Sample 9	Cupriavidus gilardiiCupriavidus gilardii	8254182541
S9 S9	RNARNA	样本9Sample 9	Mycoplasma hyorhinisMycoplasma hyorhinis	21002100

The positive species not detected by the process of the present invention, the positive species not detected by the kraken2 confidence 0.5+bracken process, the statistics of the positive species not detected by the kraken2+bracken process are shown in the table below (wherein reads_opt, abundance_opt represent the positives of the process of the present invention Species detection, reads_confidence, abundance_confidence represent positive species detection of kraken2 confidence 0.5+bracken process, reads_kraken, abundance_kraken represent positive detection of kraken2+bracken process):

The total number of positive species is 148, kraken2 confidence 0.5+bracken has 2 species not detected (sensitivity is 98.6%), the process of the present invention and kraken2+bracken have one species not detected (sensitivity is 99.3%), the performance is similar to the simulated data, In terms of sensitivity, the process of the present invention is the same as kraken2+bracken, slightly higher than the process of kraken2 confidence 0.5+bracken.

The summary statistics of the false positive species detected by each process of the DNA library are as follows (corresponding to the results in Figure 3, wherein opt in the picture represents the process of the present invention, confidence represents the Kraken2 confidence 0.5+bracken process and corresponds to the fourth column in the table, kraken represents Kraken2 +bracken process):

The summary statistics of the false positive species detected by each process of the RNA library are as follows (corresponding to the results in Figure 4, where opt in the picture represents the process of the present invention, confidence represents the Kraken2 confidence 0.5+bracken process and corresponds to the fourth column in the table, and kraken represents Kraken2+ bracket process):

The sensitivity results of the summary simulation samples and spike-in sample statistics are as follows (corresponding to the results in Figure 5, where opt in the picture represents the process of the present invention, confidence represents the Kraken2 confidence 0.5+bracken process, and kraken represents the Kraken2+bracken process):

From the above statistical results, the detection of false positive species in the process of the present invention will be far lower than that of kraken2+bracken process, and on the basis of ensuring that the sensitivity is higher than kraken2 confidence 0.5+bracken, the detection of false positives will be lower than the latter ( Even when the reads>3 are reported, the detection of false positive species can still be reduced by about 1/3).

It should be noted that at last: above each embodiment is only in order to illustrate technical scheme of the present invention, and is not intended to limit; Although the present invention has been described in detail with reference to foregoing each embodiment, those of ordinary skill in the art should understand that: It is still possible to modify the technical solutions described in the foregoing embodiments, or perform equivalent replacements for some or all of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the various embodiments of the present invention. range.

Claims

A biometric analysis method, characterized in that it comprises the following steps:

1) Sequence comparison: use kraken2 for sequence comparison of NGS sequencing data, and obtain the taxid-kmer result of each sequence;

2) Establish taxid hierarchical relationship based on taxonomy database: according to step 1) taxonomy level is associated with the taxid-kmer result, and taxid is relocated according to the positioning rules;

3) Calculate the kmer score of each sequence: calculate the kmer score of each sequence according to the taxid relocated in step 2) and the taxid-kmer result of step 1) for each sequence;

4) Overall calculation of comparison results: overall calculation based on kmer score and taxonomy level.
The bioinformatics analysis method of claim 1, wherein the method further comprises:

5) Species taxid detection: perform species taxid detection according to the overall calculation result in 4).
The bioinformatics analysis method according to any one of claims 1-2, wherein the hierarchical relationship in step 2) includes one or more of serotype/subtype, species, genus, and family.
The arbitrary described biometric analysis method of claim 1-3, is characterized in that, described step 3) described kmerscore calculating rule is as follows:

Finally locate the sequence below the taxid at the family level, kmer score=(family taxid kmers+genus taxid kmers+species taxid kmers+subtype/serotype taxid kmers)/total kmers;

Finally, the sequence above the taxid at the family level is located, and the kmer score is 0.
The biometric analysis method according to any one of claims 1-4, wherein the relocation rules in said step 2) include:

Usually accept the taxid positioning given by kraken2, and relocate in the following cases:

If a sequence obtains a unique taxid according to the taxid-kmer result and the taxid is lower than the species level, it is positioned as the species level taxid to which the taxid belongs;

When a sequence obtains more than 2 taxids according to the taxid-kmer results, there are 3 cases:

For all taxids, only one appears at the species level, and other taxids belong to the serotype/subtype, genus, or family level of the species, then locate the taxid at the species level;

All taxids that are associated with more than 2 species levels and belong to the same genus will be finally located at the genus level taxid;

All taxids that are associated with more than 2 genus levels and belong to the same family will be finally located at the family level taxid.
The bioinformatics analysis method described in any one of claims 1-5, wherein the overall calculation in said step 4) includes:

a. Set a filter cutoff threshold, and filter each sequence according to the kmer score;

b. For the filtered sequence in a, count the reads of taxid;

The reads of the taxid is the total number of taxid sequences that appear in a sample;

c. Set a filtering threshold threshold, filter the taxids located at the species level in b, calculate their relative ratio, and exclude the species level taxids lower than the threshold;

The genus relative ratio is the ratio of certain species-level taxid reads to the highest species-level taxid reads of the same genus reads;

Preferably, it also includes:

d. If the taxid at the species level filtered by c lacks genus classification, calculate the family relative ratio and exclude the taxid at the species level lower than the filtering threshold threshold;

The family relative ratio is the ratio of a species-level taxid reads relative to the species-level taxid reads with the highest reads of the same family.
The bioinformatics analysis method according to any one of claims 1-6, wherein the overall calculation in the step 4) further comprises:

e. After c, d filter and retain the species-level taxid reads correction 1, calculate the genus relative ratio, and calculate the species-level taxid genus correction reads with the genus-level taxid reads according to the genus relative ratio;

The genus relative ratio is the sum of the species-level taxid reads of the same genus after filtering by c and d, and then calculates the ratio of taxid reads at various levels to the sum;

The genus-level taxid reads include the genus-level taxid reads in b and the reads incorporated into the genus-associated species-level taxid that have not passed the filtering threshold threshold in c;

f, after c, the kind-level taxid reads filtered by d are corrected 2, and the family-level relative ratio is calculated, and the family-level taxid reads are calculated according to the family-level taxid family-corrected reads;

The relative ratio of the family is the sum of the species-level taxid reads of the same family after filtering by c and d, and then calculate the ratio of the taxid reads at various levels to the sum;

The family-level taxid reads include the family-level taxid reads in b and the reads incorporated into the family-related species-level taxid in d that do not pass the filtering threshold threshold.
The bioinformatics analysis method according to any one of claims 1-7, characterized in that the database used for comparison in the step 1) is an nt, refseq or genbank database; preferably, the database is an nt database.
A computer-readable medium, which stores a computer program, and when the computer program is executed by a processor, implements the method according to any one of claims 1-8.
An electronic device, characterized in that it includes a processor and a memory, and the memory stores one or more readable instructions, and when the one or more readable instructions are executed by the processor, claim 1- The method described in any one of 8.