WO2023283967A1 - Optimized kraken2 algorithm and application thereof in second-generation sequencing - Google Patents

Optimized kraken2 algorithm and application thereof in second-generation sequencing Download PDF

Info

Publication number
WO2023283967A1
WO2023283967A1 PCT/CN2021/106970 CN2021106970W WO2023283967A1 WO 2023283967 A1 WO2023283967 A1 WO 2023283967A1 CN 2021106970 W CN2021106970 W CN 2021106970W WO 2023283967 A1 WO2023283967 A1 WO 2023283967A1
Authority
WO
WIPO (PCT)
Prior art keywords
taxid
level
species
reads
family
Prior art date
Application number
PCT/CN2021/106970
Other languages
French (fr)
Chinese (zh)
Inventor
张岩
李振中
任用
李诗濛
郭昊
梁相志
陈莉
戴岩
李珊
顾菊
Original Assignee
江苏先声医学诊断有限公司
江苏先声医疗器械有限公司
南京先声诊断技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 江苏先声医学诊断有限公司, 江苏先声医疗器械有限公司, 南京先声诊断技术有限公司 filed Critical 江苏先声医学诊断有限公司
Publication of WO2023283967A1 publication Critical patent/WO2023283967A1/en

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A50/00TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE in human health protection, e.g. against extreme weather
    • Y02A50/30Against vector-borne diseases, e.g. mosquito-borne, fly-borne, tick-borne or waterborne diseases whose impact is exacerbated by climate change

Definitions

  • the invention relates to the field of bioinformatics, in particular to an optimized kraken2 algorithm and its application in next-generation sequencing.
  • next-generation sequencing technology is a massively parallel sequencing technology, which has the characteristics of high throughput, high sequencing accuracy, and short timeliness, which perfectly matches the metagenome.
  • the demand for metagenomics has led to the widespread application of metagenomics in infection detection.
  • Species detection of microbial communities after sequencing is the most important work in metagenomics research. Only by accurately and reliably locating microbial communities can we associate metagenomics with research, such as studying whether a patient's disease is caused by a certain microbial infection (If a person suspects malaria, it is necessary to accurately detect the presence of Plasmodium in the blood to give a definitive diagnosis), metagenomic analysis is a fast, accurate and advanced detection technology, currently used in the auxiliary diagnosis of infectious diseases played an important role in.
  • Kraken2 is applied to Illumina second-generation metagenomic sequencing, which has the characteristics of fast analysis speed and high sensitivity, but the specificity is low, and many false positive results are often detected, which is due to the characteristics of the kraken2 algorithm.
  • the default read length is 35bp
  • kraken2 will give priority to constructing the specific kmers of this species Specific kmers, and there is a certain kmer in multiple species of Streptococcus genus Streptococcus, then locate the kmer under the genus Streptococcus, the same principle, if a certain kmer exists in multiple genera under Streptococcus family Streptococcus, then the kmer Positioned under Streptococcus family.
  • the database contains a lot of sequences, such as plasmids/vectors, etc.
  • the results of this part of the comparison will also be output, and this part of the results is basically meaningless (it can also be counted as false positive detection).
  • the purpose of the present invention is to seek a bioinformatics analysis method that can reduce false positives in sequencing analysis, improve the accuracy of species detection, and is suitable for Illumina second-generation metagenomic sequencing.
  • the present invention proposes the following technical solutions:
  • the present invention firstly provides a bioinformatics analysis method based on kraken2 single sequence kmer score and overall taxonomy structure statistics, said method comprising the following steps:
  • NGS sequencing data is compared using kraken2 to obtain the taxid-kmer result of each sequence
  • the hierarchical relationship in step 2) includes one or more hierarchical relationships of serotype/subtype, species, genus and/or family.
  • the positioning rules in the step 2) include the following:
  • a sequence obtains a unique taxid according to the taxid-kmer result and the taxid is lower than the species level, it is positioned as the species level taxid to which the taxid belongs;
  • calculation rules in the step 3) include:
  • kmer score (family taxid kmers+genus taxid kmers+species taxid kmers+subtype/serotype taxid kmers)/total kmers;
  • the overall calculation includes:
  • the reads of the taxid is the total number of taxid sequences that appear in a sample
  • the genus relative ratio is the ratio of certain species-level taxid reads to the highest species-level taxid reads of the same genus reads
  • the family relative ratio is the ratio of certain species-level taxid reads to the highest species-level taxid reads of the same family reads
  • d filter and retain the species-level taxid reads correction 1, calculate the genus relative ratio, and calculate the species-level taxid genus correction reads with the genus-level taxid reads according to the genus relative ratio;
  • the genus relative ratio is the sum of the species-level taxid reads of the same genus after filtering by c and d, and then calculates the ratio of taxid reads at various levels to the sum;
  • the genus-level taxid reads include the genus-level taxid reads in b and the reads incorporated into the genus-associated species-level taxid that have not passed the filtering threshold threshold in c;
  • the relative ratio of the family is the sum of the species-level taxid reads of the same family after filtering by c and d, and then calculate the ratio of the taxid reads at various levels to the sum;
  • the family-level taxid reads include the family-level taxid reads in b and the reads incorporated into the family-related species-level taxid in d that do not pass the filtering threshold threshold.
  • said step 5 species-level detection, which is equivalent to the species-level taxid detection, according to c, d obtains the species-level taxid which is the final species taxid, gets b, e, and the sum of the species-level taxid reads obtained by f obtains The final species taxid reads, and calculate the relative abundance based on the sum of reads.
  • the database used for comparison in step 1) is nt, refseq or genbank database; preferably, the database is nt database.
  • the present invention also provides the application of the method based on kraken2 single sequence kmer score and overall taxonomy structure statistics in next-generation sequencing bioinformatics analysis.
  • the present invention also provides a computer-readable medium, which stores a computer program, and when the computer program is executed by a processor, the method described in any one of the above claims is realized.
  • the present invention also provides an electronic device, which is characterized in that it includes a processor and a memory, and one or more readable instructions are stored in the memory, and when the one or more readable instructions are executed by the processor, the Claim any one of the above methods.
  • the bioinformatics method of the present invention can reduce false positives in bioinformatics analysis, improve the accuracy of species detection, and is suitable for second-generation metagenomic sequencing, including single-end and double-end sequencing.
  • the present invention ensures the overall sensitivity through precise positioning of a single sequence and overall systematic optimization.
  • the present invention excludes partial comparison results of plasmids/vectors, etc., effectively reducing the situation of meaningless detection.
  • FIG. 1 schematic diagram of the system of the present invention
  • Figure 2 Schematic diagram of taxid positioning and kmer score scoring for a single sequence
  • Fig. 3 Comparison chart of false positive species detected in 9 cases of spike-in sample DNA library, opt represents the process of the present invention, confidence represents the process of kraken2 confidence 0.5+bracken, kraken represents the process of kraken2+bracken, S1-S9 represents 9 samples;
  • Fig. 4 Comparison diagram of false positive species detected in 9 cases of spike-in sample RNA library, opt represents the process of the present invention, confidence represents the process of kraken2 confidence 0.5+bracken, kraken represents the process of kraken2+bracken, and S1-S9 represents 9 samples;
  • Fig. 5 Sensitivity chart of 12 simulated samples and 9 spike-in samples, opt represents the process of the present invention, confidence represents the kraken2 confidence 0.5+bracken process, kraken represents the kraken2+bracken process, simulated represents 12 simulated samples, spike-in is 9 spike-in samples.
  • the terms “comprising”, “comprising”, “having”, “containing” or “involving” are inclusive or open-ended and do not exclude other unrecited elements or method steps .
  • the term “consisting of” is considered as a preferred embodiment of the term “comprising”. If in the following a certain group is defined as comprising at least a certain number of embodiments, this is also to be understood as revealing a group which preferably consists only of these embodiments.
  • read or “each read” or “single read” in the present invention refers to a nucleic acid sequence generated by a high-throughput sequencing platform.
  • alignment result in the present invention: “alignment” in English, refers to the corresponding result between a sequencing readout sequence and a reference sequence, and a sequencing readout sequence can have multiple alignment results at the same time.
  • the "kmer” in the present invention refers to continuously cutting a sequence and scratching bases one by one to obtain a substring of k bases.
  • the length of reads is L
  • the length of k-mer is set to k
  • the number of mers is: L-k+1; another example is the sequence AACTGACT, if k is set to 3, it can be divided into 6 k-mers of AAC, ACT, CTG, TGA, GAC, and ACT.
  • the "kraken2" in the present invention refers to a high-precision metagenomic sequence classification software based on the kmer algorithm in the field, which can quickly classify sequencing reads into species.
  • the "kraken2 optimization algorithm" described in the present invention refers to an optimization system for microbial species-level detection based on the comparison results of kraken2 developed by the present invention, aiming at improving accuracy and reducing false positive detection.
  • nt comparison database of the present invention is the kraken2 comparison index database based on NCBI nt database establishment.
  • the "taxid” or “taxonomy_id” mentioned in the present invention refers to the id number in the taxonomy database.
  • the single sequence kmer score of kraken2 and the overall bioinformatics analysis method based on taxonomy structure statistics of the present invention generally include the following steps:
  • the positioning rules may include:
  • a sequence obtains a unique taxid according to the taxid-kmer result and the taxid is lower than the species level, it is positioned as the species level taxid to which the taxid belongs;
  • Calculation rules can be:
  • kmer score (family taxid kmers+genus taxid kmers+species taxid kmers+subtype/serotype taxid kmers)/total kmers;
  • the genus relative ratio is the ratio of certain species-level taxid reads to the highest species-level taxid reads of the same genus reads
  • the family relative ratio is the ratio of certain species-level taxid reads to the highest species-level taxid reads of the same family reads
  • the species-level taxid reads retained by filtering in 7 and 8 are corrected 1, and the relative ratio is calculated, and the taxid reads at the species level are calculated according to the relative ratio of the species-level taxid and genus correction reads;
  • the genus relative ratio is the sum of the species-level taxid reads of the same genus after filtering by 7 and 8, and then calculates the ratio of taxid reads at various levels to the sum;
  • the genus-level taxid reads include 6 genus-level taxid reads and 7 reads that do not pass through the genus-related species level taxid of the genus-level taxid;
  • the species-level taxid reads filtered by 7 and 8 are corrected 2, and the family-level relative ratio is calculated, and the family-level taxid reads are calculated according to the family-level relative ratio to calculate the species-level taxid family-corrected reads;
  • the relative ratio of the family is the sum of the species-level taxid reads of the same family after filtering by 7 and 8, and then calculate the ratio of the taxid reads at various levels to the sum;
  • the family-level taxid reads include 6 family-level taxid reads and 8 family-related taxid-incorporated reads that do not pass the filtering threshold threshold.
  • Microbial detection which is equivalent to the species-level taxid detection
  • the species-level taxid obtained according to 7 and 8 is the final species taxid
  • the sum of the species-level taxid reads obtained by 6, 9, and 10 is used to obtain the final species taxid reads, and Relative abundance is calculated based on the sum of reads.
  • the problem to be solved in this embodiment is how to ensure the accuracy of the kraken2 comparison results as much as possible through the data analysis method.
  • Sequence similarity typically, the sequence degree of Escherichia coli and Shigella exceeds 99%, which is especially common among species under the same genus, which is an important reason for false positive detection and also interferes with real species detected, reducing the sensitivity;
  • the so-called data analysis of the sample sequencing results alone refers to: the information obtained is only one sample sequence file (FASTQ), and no other information is known; through some algorithms, the species detection results are output.
  • sequence similarity on the one hand, it can be solved through standard genome updates and the progress of species taxonomy, and on the other hand, it can be optimized to a certain extent through algorithms, which is part of the consideration of this patent;
  • the host genome can be removed by accurately constructing the host genome sequence and using an alignment algorithm, but it does not guarantee whether the host genome removal is accurate and thorough (removal is not possible) There will still be some residues completely, and excessive removal will reduce the sensitivity of microbial detection), in addition, it is possible to accurately analyze the results of single sequence comparisons in the algorithm, and set the threshold for species detection as a whole, Get a certain degree of optimization, this part is the consideration point of this patent;
  • the positioning rules include: accept the taxid positioning given by kraken2 in principle, except for the following cases:
  • a sequence is compared to a unique taxid and the taxid is at a lower species level, it will be positioned as the species-level taxid to which the taxid belongs (for example, if a sequence is 76 bp in length and 35 bp in kmer length, 42 kmers will be obtained, and the taxid-kmer comparison result is: 0:10, 1313:32, except for the 10 kmers that cannot be compared, the rest are compared to the taxid 1313, and the taxonomy hierarchy structure is used to locate the taxid 1313 Streptococcus pneumoniae Streptococcus pneumoniae);
  • a sequence alignment with more than 2 taxids can be divided into 3 types:
  • All taxids in the comparison result are associated with only one species at the species level, and other taxids belong to the serotype/subtype of the species. At the genus and family levels, they are located at the taxid of the species (for example, a sequence length of 76bp, 35bp kmer length will get 42 kmers, the taxid-kmer comparison results are: 0:10, 1313:20, 1301:12, except for the 10 kmers that cannot be compared, 20 kmers are compared to the taxid 1313 , corresponding to Streptococcus pneumoniae Streptococcus pneumoniae through the taxonomy hierarchy structure, and the other 12 kmers were compared to the genus 1301 Streptococcus Streptococcus pneumoniae, since Streptococcus pneumoniae is under the genus Streptococcus, the sequence was mapped to the species taxid 1313 Streptococcus pneumoniae Streptococcus pneumoniae);
  • All the taxids in the comparison results are related to the results of the same genus and different species, and finally locate the genus taxid (for example, if a sequence length is 76bp, 35bp kmer length will get 42 kmers, the taxid-kmer comparison result is: 0: 10, 1313:20, 28037:5, 1301:7, in addition to the 10 kmers that cannot be compared, 20 kmers are compared to the taxid 1313, corresponding to Streptococcus pneumoniae Streptococcus pneumoniae through the taxonomy hierarchy structure, and the other 5 The kmer was compared to 28037 Streptococcus mitis light-weight Streptococcus, and 7 kmers were compared to 1301 Streptococcus Streptococcus. Since there were 2 species under Streptococcus, the sequence was mapped to taxid 1301 Streptococcus Streptococcus);
  • the calculation rules are:
  • kmer score (family taxid kmers+genus taxid kmers+species taxid kmers+subtype/serotype taxid kmers)/total kmers;
  • Selected 4 kinds of bacteria Haemophilus influenzae, Streptococcus pneumoniae, Staphylococcus aureus, Klebsiella pneumoniae
  • 2 kinds of fungi Candida albicans, Aspergillus fumigatus
  • 4 kinds of viruses human herpesvirus, human papillary Tumor virus, influenza A virus, HIV virus
  • GRC38.p13 the sequence length is set to 75bp
  • the total data volume is set to 10M
  • a total of 4 groups Each group consists of 3 identical samples.
  • the reference genomes used by the simulated samples are shown in the table below:
  • Score_cutoff summarizes the statistical results of statistical samples under different thresholds. Samples 7-12 are focused on due to the low total microbial count, and their error rate is higher than that of the overall level. From the statistical results, setting a threshold higher than this error rate is It can solve the occurrence of wrong comparison, as shown in the following table:
  • each sample is divided into list species false positive species (that is, the species detected by the error comparison is in the same family of 10 simulated species), non-list species false positive Species (the species detected by the wrong comparison are not in the same family as the 10 simulated species), and the species with the highest reads are counted. Since the repeated samples are detected to be completely consistent, only the representative sample results are listed, as shown in the following table:
  • the final reads_cutoff is set to correct for species detected in different families. From the comparison of different values of score_cutoff, a slightly lower comparison rate is allowed Under the premise of , score_cutoff is set to 0.5, and reads_cutoff is set to greater than 3 to eliminate the detection of non-list false positive species (the lower the value, the better the sensitivity of true positive species detected by fewer reads).
  • Targeting rules include:
  • a sequence is aligned with a unique taxid and the taxid is lower than the species level, it will be positioned as the taxid of the species to which the taxid belongs;
  • a sequence alignment with more than 2 taxids can be divided into 3 types:
  • the calculation rules are:
  • kmer score (family taxid kmers+genus taxid kmers+species taxid kmers+subtype/serotype taxid kmers)/total kmers;
  • the false positive and false negative detections are sorted out as follows:
  • sample taxi species reads relative abundance result sample 1 340412 Aspergillus novofumigatus 1 0.00011 false positive sample 1 984962 Heterobasidion irregular 1 0.00011 false positive sample 1 145522 Nannochloropsis oceanica 4 0.00044 false positive sample 1 28037 Streptococcus mitis 2 0.00022 false positive sample 1 2656787 Venustapulla echinocandica 1 0.00011 false positive sample 10 86049 Cladophialophora carrionii 1 0.00011 false positive Sample 10 10376 Human gammaherpes virus 4 1 0.00011 false positive sample 10 1873960 Pseudocercospora fijiensis 2 0.00022 false positive Sample 10 2656787 Venustapulla echinocandica 1 0.00011 false positive Sample 10 727 Haemophilus influenzae 0 0 false negative Sample 10 573 Klebsiella pneumoniae 0 0 false negative Sample 11 727 Haemophilus influenzae 0 0 false negative Sample 11 573 Kle
  • Embodiment 3 actual sample detection experiment
  • the total number of positive species is 148, kraken2 confidence 0.5+bracken has 2 species not detected (sensitivity is 98.6%), the process of the present invention and kraken2+bracken have one species not detected (sensitivity is 99.3%), the performance is similar to the simulated data, In terms of sensitivity, the process of the present invention is the same as kraken2+bracken, slightly higher than the process of kraken2 confidence 0.5+bracken.
  • the summary statistics of the false positive species detected by each process of the RNA library are as follows (corresponding to the results in Figure 4, where opt in the picture represents the process of the present invention, confidence represents the Kraken2 confidence 0.5+bracken process and corresponds to the fourth column in the table, and kraken represents Kraken2+ bracket process):
  • the sensitivity results of the summary simulation samples and spike-in sample statistics are as follows (corresponding to the results in Figure 5, where opt in the picture represents the process of the present invention, confidence represents the Kraken2 confidence 0.5+bracken process, and kraken represents the Kraken2+bracken process):
  • the detection of false positive species in the process of the present invention will be far lower than that of kraken2+bracken process, and on the basis of ensuring that the sensitivity is higher than kraken2 confidence 0.5+bracken, the detection of false positives will be lower than the latter ( Even when the reads>3 are reported, the detection of false positive species can still be reduced by about 1/3).

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Biophysics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

A biological information analysis method based on a kraken2 single-sequence kmer score and overall taxonomy structure statistics. By means of the method, false positives in biological information analysis can be reduced, and the species detection accuracy can be improved. The method is applicable to second-generation metagenome sequencing analysis.

Description

一种优化的kraken2算法及其在二代测序中的应用An optimized kraken2 algorithm and its application in next-generation sequencing
本申请要求于2021年07月17日提交中国专利局、申请号为202110804351.8、发明名称为“一种优化的kraken2算法及其在二代测序中的应用”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application submitted to the China Patent Office on July 17, 2021, with the application number 202110804351.8, and the title of the invention is "an optimized kraken2 algorithm and its application in next-generation sequencing", all of which The contents are incorporated by reference in this application.
技术领域technical field
本发明涉及生物信息学领域,特别是涉及一种优化的kraken2算法及其在二代测序中的应用。The invention relates to the field of bioinformatics, in particular to an optimized kraken2 algorithm and its application in next-generation sequencing.
背景技术Background technique
宏基因组群落复杂庞大,需要对大量的DNA进行测序,Illumina二代测序技术是一种大规模平行测序技术,具有通量高,测序准确度高,时效短等特点,正好完美的匹配了宏基因组学的需求,成就了宏基因组学在感染检测的广泛应用。The metagenomic community is complex and huge, and a large amount of DNA needs to be sequenced. Illumina next-generation sequencing technology is a massively parallel sequencing technology, which has the characteristics of high throughput, high sequencing accuracy, and short timeliness, which perfectly matches the metagenome. The demand for metagenomics has led to the widespread application of metagenomics in infection detection.
测序之后微生物群落的物种检测,是宏基因组学研究中的最为重要工作,只有准确可靠地对微生物群落进行精确定位,才能关联宏基因组与研究的关联,比如研究患者的发病是否是某种微生物感染(如某个人怀疑是疟疾,那么需要准确地检测出其血液中存在疟原虫才能最终给出明确诊断),宏基因组分析是一种快速,准确,先进的检测技术,目前在感染类疾病辅助诊断中发挥了重要作用。Species detection of microbial communities after sequencing is the most important work in metagenomics research. Only by accurately and reliably locating microbial communities can we associate metagenomics with research, such as studying whether a patient's disease is caused by a certain microbial infection (If a person suspects malaria, it is necessary to accurately detect the presence of Plasmodium in the blood to give a definitive diagnosis), metagenomic analysis is a fast, accurate and advanced detection technology, currently used in the auxiliary diagnosis of infectious diseases played an important role in.
Kraken2应用于Illumina二代宏基因组测序,具有分析速度快,灵敏度高的特点,但是特异度较低,往往会检测出很多假阳性结果,这是因为kraken2算法的特点。根据taxid与seqid关系,对选取的参考基因组序列快速构建固定长度的kmers(默认为35bp的读长),优先构建某个层级的特异kmers,比如肺炎链球菌Streptococcus pneumoniae,kraken2会优先构建该物种的特异kmers,而链球菌属Streptococcus多个物种也存在某个kmer,则将该kmer定位到链球菌属下,同样的原理,某个kmer存在于链球菌科Streptococcaceae下多个属,则将该kmer定位在链球菌科下。鉴于kraken2的算法,对于某种DNA序列较高的微生物,虽然会有一定的概率会发生错误比对,但是基本上不会干扰该物种的检出。由于二代测序具有读长短的特点,因此很容易出现序列发生错误比对,或者无法精确比对(比如某条来自肺炎链球菌的序列,错误比对到Streptococcus mitis,或者只能比对到Streptococcus属层级),这是影响物种检测准确度的最重要因素。Kraken2 is applied to Illumina second-generation metagenomic sequencing, which has the characteristics of fast analysis speed and high sensitivity, but the specificity is low, and many false positive results are often detected, which is due to the characteristics of the kraken2 algorithm. According to the relationship between taxid and seqid, quickly construct fixed-length kmers for the selected reference genome sequence (the default read length is 35bp), and give priority to constructing a certain level of specific kmers, such as Streptococcus pneumoniae, kraken2 will give priority to constructing the specific kmers of this species Specific kmers, and there is a certain kmer in multiple species of Streptococcus genus Streptococcus, then locate the kmer under the genus Streptococcus, the same principle, if a certain kmer exists in multiple genera under Streptococcus family Streptococcus, then the kmer Positioned under Streptococcus family. In view of the kraken2 algorithm, although there is a certain probability of misalignment for a certain microorganism with a high DNA sequence, it will basically not interfere with the detection of this species. Due to the characteristics of the next-generation sequencing with short read length, it is prone to sequence misalignment, or cannot be accurately aligned (such as a sequence from Streptococcus pneumoniae, which is misaligned to Streptococcus mitis, or can only be aligned to Streptococcus genus level), which is the most important factor affecting the accuracy of species detection.
除此之外,由于数据库包含的序列很多,比如质粒/载体等也在内,因此比对这部分比对的结果也会给出输出,这部分结果基本上是无意义的(也可以算作假阳性检出)。In addition, because the database contains a lot of sequences, such as plasmids/vectors, etc., the results of this part of the comparison will also be output, and this part of the results is basically meaningless (it can also be counted as false positive detection).
鉴于此,提出本发明。In view of this, the present invention is proposed.
发明内容Contents of the invention
本发明的目的是寻求一种能够降低测序分析假阳性,能够提高物种检测的准确度,适用于Illumina二代宏基因组测序的生信分析方法。The purpose of the present invention is to seek a bioinformatics analysis method that can reduce false positives in sequencing analysis, improve the accuracy of species detection, and is suitable for Illumina second-generation metagenomic sequencing.
为实现上述目的,本发明提出如下技术方案:To achieve the above object, the present invention proposes the following technical solutions:
本发明首先提供了一种基于kraken2单条序列kmer评分和整体基于taxonomy结构统计的生信分析方法,所述方法包括如下步骤:The present invention firstly provides a bioinformatics analysis method based on kraken2 single sequence kmer score and overall taxonomy structure statistics, said method comprising the following steps:
1)NGS测序数据使用kraken2进行序列比对,得到每条序列的taxid-kmer结果;1) NGS sequencing data is compared using kraken2 to obtain the taxid-kmer result of each sequence;
2)基于taxonomy数据库建立taxid的层级关系,根据步骤1)taxid-kmer结果获得taxid,并关联taxonomy层级,再根据定位规则重定位taxid;2) Establish the hierarchical relationship of taxid based on the taxonomy database, obtain the taxid according to the result of the taxid-kmer in step 1, and associate it with the taxonomy level, and then relocate the taxid according to the positioning rules;
3)根据每条序列经过步骤2)定位的taxid和步骤1)的taxid-kmer比对结果,计算每条序列的kmer score;3) Calculate the kmer score of each sequence according to the taxid located in step 2) of each sequence and the taxid-kmer comparison result of step 1);
4)根据kmer score和taxonomy层级,对比对结果进行整体计算;4) According to kmer score and taxonomy level, compare and compare the results for overall calculation;
进一步的,还包括Further, it also includes
5)根据4)的整体计算结果进行物种层级检测。5) Carry out species-level detection based on the overall calculation results of 4).
进一步的,所述步骤2)中层级关系包括血清型/亚型、种、属和/或科的一种或多种层级关系。Further, the hierarchical relationship in step 2) includes one or more hierarchical relationships of serotype/subtype, species, genus and/or family.
进一步的,所述步骤2)中定位规则包括如下:Further, the positioning rules in the step 2) include the following:
通常情况下接受kraken2给出的taxid定位,以下情况除外:Normally, the taxid location given by kraken2 is accepted, except for the following cases:
某条序列根据taxid-kmer结果获得唯一taxid且taxid低于种层级,则定位为该taxid所属的种层级taxid;If a sequence obtains a unique taxid according to the taxid-kmer result and the taxid is lower than the species level, it is positioned as the species level taxid to which the taxid belongs;
某条序列根据taxid-kmer结果获得超过2个taxid时,分3种情况:When a sequence obtains more than 2 taxids according to the taxid-kmer results, there are 3 cases:
所有taxid,关联到种层级上只出现1个,其他taxid属于该种的血清型/亚型、属、科层级,则定位到该种层级taxid;For all taxids, only one appears at the species level, and other taxids belong to the serotype/subtype, genus, and family level of the species, and then locate the taxid at the species level;
所有taxid,关联到种层级超过2个且属于同一属,则最终定位到属层级taxid;All taxids that are associated with more than 2 species levels and belong to the same genus will be finally located at the genus level taxid;
所有taxid,关联到属层级超过2个(属层级没有分类也在内)且属于同一科,则最终定位到科层级taxid;All taxids that are associated with more than 2 genus levels (including genus levels without classification) and belong to the same family will be finally located at the family level taxid;
进一步的,所述步骤3)中所述计算的规则包括:Further, the calculation rules in the step 3) include:
最终定位到科层级taxid以下的序列,其kmer score=(科taxid kmers+属taxid kmers+种taxid kmers+亚型/血清型taxid kmers)/总kmers;Finally locate the sequence below the taxid at the family level, its kmer score = (family taxid kmers+genus taxid kmers+species taxid kmers+subtype/serotype taxid kmers)/total kmers;
最终定位到科层级taxid以上的序列,kmer score设定为0。Finally, the sequence above the taxid at the family level is located, and the kmer score is set to 0.
进一步的,所述步骤4)中整体计算包括:Further, in the step 4), the overall calculation includes:
a、设定一个过滤cutoff阈值,对每条序列根据kmer score进行过滤;a. Set a filter cutoff threshold, and filter each sequence according to the kmer score;
b、对a中经过过滤的序列,统计taxid的reads;b. For the filtered sequence in a, count the reads of taxid;
所述taxid的reads是一个样本出现的taxid的序列总数;The reads of the taxid is the total number of taxid sequences that appear in a sample;
c、设定一个过滤阈值threshold,对b中定位到种层级的taxid进行过滤,计算其属相对比值,排除低于阈值的种层级taxid;c. Set a filtering threshold threshold, filter the taxids located at the species level in b, calculate their relative ratio, and exclude the species level taxids lower than the threshold;
所述属相对比值为某个种层级taxid reads相对于同属reads最高的种层级taxid reads的比值;The genus relative ratio is the ratio of certain species-level taxid reads to the highest species-level taxid reads of the same genus reads;
进一步的,还包括:Further, it also includes:
d、经c过滤的种层级taxid,若缺乏属分类,则计算科相对比值,排除低于过滤阈值threshold种层级taxid;d. If the taxid at the species level filtered by c lacks genus classification, calculate the family relative ratio and exclude the taxid at the species level lower than the filtering threshold threshold;
所述科相对比值为某个种层级taxid reads相对于同科reads最高的种层级taxid reads的比值;The family relative ratio is the ratio of certain species-level taxid reads to the highest species-level taxid reads of the same family reads;
更进一步的,还包括:Further, it also includes:
e、经c,d过滤保留的种层级taxid reads校正1,计算属相对比值,将属层级taxid reads按照属相对比值计算种层级taxid属校正reads;e. After c, d filter and retain the species-level taxid reads correction 1, calculate the genus relative ratio, and calculate the species-level taxid genus correction reads with the genus-level taxid reads according to the genus relative ratio;
所述属相对比值为经c,d过滤后同属的种层级taxid reads总和,之后计算各种层级taxid reads相对于总和的比例;The genus relative ratio is the sum of the species-level taxid reads of the same genus after filtering by c and d, and then calculates the ratio of taxid reads at various levels to the sum;
所述属层级taxid reads包括b中属层级taxid reads和c中未通过过滤阈值threshold的该属关联种层级taxid并入的reads;The genus-level taxid reads include the genus-level taxid reads in b and the reads incorporated into the genus-associated species-level taxid that have not passed the filtering threshold threshold in c;
f、经c,d过滤的种层级taxid reads校正2,计算科相对比值,将科层级taxid reads按照科相对比值计算种层级taxid科校正reads;f, after c, the kind-level taxid reads filtered by d are corrected 2, and the family-level relative ratio is calculated, and the family-level taxid reads are calculated according to the family-level taxid family-corrected reads;
所述科相对比值为经c,d过滤后同科的种层级taxid reads总和,之后计算各种层级taxid reads相对于总和的比例;The relative ratio of the family is the sum of the species-level taxid reads of the same family after filtering by c and d, and then calculate the ratio of the taxid reads at various levels to the sum;
所述科层级taxid的reads包含b中科层级taxid reads和d中未通过过滤阈值threshold的该科关联种层级taxid并入的reads。The family-level taxid reads include the family-level taxid reads in b and the reads incorporated into the family-related species-level taxid in d that do not pass the filtering threshold threshold.
进一步的,所述步骤5)物种层级检测,即等同于种层级taxid检测,根据c,d得到种层级taxid即为最终的物种taxid,取b,e,f得到的种层级taxid reads之和得到最终的物种taxid reads,并根据reads之和计算相对丰度。Further, said step 5) species-level detection, which is equivalent to the species-level taxid detection, according to c, d obtains the species-level taxid which is the final species taxid, gets b, e, and the sum of the species-level taxid reads obtained by f obtains The final species taxid reads, and calculate the relative abundance based on the sum of reads.
进一步的,所述步骤1)中比对采用的数据库为nt、refseq或genbank数据库;优选的,所述数据库为nt数据库。Further, the database used for comparison in step 1) is nt, refseq or genbank database; preferably, the database is nt database.
本发明首先还提供了基于kraken2单条序列kmer评分和整体基于taxonomy结构统计的方法在二代测序生信分析中的应用。Firstly, the present invention also provides the application of the method based on kraken2 single sequence kmer score and overall taxonomy structure statistics in next-generation sequencing bioinformatics analysis.
本发明还提供一种计算机可读介质,其存储有计算机程序,所述计算机程序被处理器执行时,实现权利要求上述任一项所述方法。The present invention also provides a computer-readable medium, which stores a computer program, and when the computer program is executed by a processor, the method described in any one of the above claims is realized.
本发明还提供一种电子设备,其特征在于,包括处理器以及存储器,所述存储器上存储一条或多条可读指令,所述一条或多条可读指令被所述处理器执行时,实现权利要求上述任一项所述方法。The present invention also provides an electronic device, which is characterized in that it includes a processor and a memory, and one or more readable instructions are stored in the memory, and when the one or more readable instructions are executed by the processor, the Claim any one of the above methods.
本发明有益的技术效果:Beneficial technical effect of the present invention:
1)本发明所述生信方法能够降低生信分析的检出假阳性,能够提高物种检测的准确度,适用于二代宏基因组测序,包括单端和双端测序等。1) The bioinformatics method of the present invention can reduce false positives in bioinformatics analysis, improve the accuracy of species detection, and is suitable for second-generation metagenomic sequencing, including single-end and double-end sequencing.
2)本发明通过对单条序列精确定位和整体的系统性优化,保证了整体的灵敏度。2) The present invention ensures the overall sensitivity through precise positioning of a single sequence and overall systematic optimization.
3)本发明通过引入taxonomy,排除了质粒/载体部分比对结果等,有效降低了无意义检出的情况。3) By introducing taxonomy, the present invention excludes partial comparison results of plasmids/vectors, etc., effectively reducing the situation of meaningless detection.
附图说明Description of drawings
为了更清楚地说明本发明具体实施方式或现有技术中的技术方案,下面将对具体实施方式或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施方式,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the specific implementation of the present invention or the technical solutions in the prior art, the following will briefly introduce the accompanying drawings that need to be used in the specific implementation or description of the prior art. Obviously, the accompanying drawings in the following description The drawings show some implementations of the present invention, and those skilled in the art can obtain other drawings based on these drawings without any creative work.
图1本发明体系简图;Fig. 1 schematic diagram of the system of the present invention;
图2单条序列进行taxid定位和kmer score评分示意图;Figure 2 Schematic diagram of taxid positioning and kmer score scoring for a single sequence;
图3 9例spike-in样本DNA文库检出的假阳性物种比较图,opt代表本发明流程,confidence代表kraken2 confidence 0.5+bracken流程,kraken代表kraken2+bracken流程,S1-S9代表9个样本;Fig. 3 Comparison chart of false positive species detected in 9 cases of spike-in sample DNA library, opt represents the process of the present invention, confidence represents the process of kraken2 confidence 0.5+bracken, kraken represents the process of kraken2+bracken, S1-S9 represents 9 samples;
图4 9例spike-in样本RNA文库检出的假阳性物种比较图,opt代表本发明流程,confidence代表kraken2 confidence 0.5+bracken流程,kraken代表kraken2+bracken流程,S1-S9代表9个样本;Fig. 4 Comparison diagram of false positive species detected in 9 cases of spike-in sample RNA library, opt represents the process of the present invention, confidence represents the process of kraken2 confidence 0.5+bracken, kraken represents the process of kraken2+bracken, and S1-S9 represents 9 samples;
图5 12例模拟样本和9例spike-in样本灵敏度统计图,opt代表本发明流程,confidence代表kraken2 confidence 0.5+bracken流程,kraken代表kraken2+bracken流程,simulated代表12例模拟样本,spike-in即9例spike-in样本。Fig. 5 Sensitivity chart of 12 simulated samples and 9 spike-in samples, opt represents the process of the present invention, confidence represents the kraken2 confidence 0.5+bracken process, kraken represents the kraken2+bracken process, simulated represents 12 simulated samples, spike-in is 9 spike-in samples.
具体实施方式detailed description
下面将结合实施例对本发明的实施方案进行详细描述,但是本领域技术人员将会理解,下列实施例仅用于说明本发明,而不应视为限制本发明的范围,并且所述实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。Embodiments of the present invention will be described in detail below in conjunction with examples, but those skilled in the art will understand that following examples are only used to illustrate the present invention, and should not be considered as limiting the scope of the invention, and described examples are Some, but not all, embodiments of the invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.
部分术语定义Definition of some terms
除非在下文中另有定义,本发明具体实施方式中所用的所有技术术语和科学术语的含义意图与本领域技术人员通常所理解的相同。虽然相信以下术语对于本领域技术人员很好理解,但仍然阐述以下定义以更好地解释本发明。Unless otherwise defined hereinafter, all technical and scientific terms used in the detailed description of the present invention have the same meanings as commonly understood by those skilled in the art. While the following terms are believed to be well understood by those skilled in the art, the following definitions are set forth to better explain the present invention.
如本发明中所使用,术语“包括”、“包含”、“具有”、“含有”或“涉及”为包含性的(inclusive)或开放式的,且不排除其它未列举的元素或方法步骤。术语“由…组成”被认为是术语“包含”的优选实施方案。如果在下文中某一组被定义为包含至少一定数目的实施方案,这也应被理解为揭示了一个优选地仅由这些实施方案组成的组。As used herein, the terms "comprising", "comprising", "having", "containing" or "involving" are inclusive or open-ended and do not exclude other unrecited elements or method steps . The term "consisting of" is considered as a preferred embodiment of the term "comprising". If in the following a certain group is defined as comprising at least a certain number of embodiments, this is also to be understood as revealing a group which preferably consists only of these embodiments.
本发明中的术语“大约”、“大体”表示本领域技术人员能够理解的仍可保证论及特征的技术效果的准确度区间。该术语通常表示偏离指示数值的±10%,优选±5%。The terms "about" and "approximately" in the present invention represent the range of accuracy that can be understood by those skilled in the art and still guarantee the technical effect of the mentioned feature. The term generally means ±10%, preferably ±5%, of the indicated value.
在提及单数形式名词时使用的不定冠词或定冠词例如“一个”或“一种”,“所述”,包括该名词的复数形式。The use of an indefinite or definite article when referring to a noun in the singular eg "a" or "an", "the", includes a plural of that noun.
此外,说明书和权利要求书中的术语第一、第二、第三、(a)、(b)、(c)以及诸如此类,是用于区分相似的元素,不是描述顺序或时间次序必须的。应理解,如此应用的术语在适当的环境下可互换,并且本发明描述的实施方案能以不同于本发明描述或举例说明的其它顺序实施。In addition, the terms first, second, third, (a), (b), (c) and the like in the specification and claims are used to distinguish similar elements and are not necessary to describe the order or time order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and that the embodiments described herein are capable of operation in other sequences than described or illustrated herein.
本发明中的“read”或“每条reads”或“单条reads”,指高通量测序平台产生的一条核酸序列。"read" or "each read" or "single read" in the present invention refers to a nucleic acid sequence generated by a high-throughput sequencing platform.
本发明中的术语“比对结果”:英文为“alignment”,指一条测序读出序列与一条参考序列之间的对应结果,一条测序读出序列可以同时具有多个比对结果。The term "alignment result" in the present invention: "alignment" in English, refers to the corresponding result between a sequencing readout sequence and a reference sequence, and a sequencing readout sequence can have multiple alignment results at the same time.
本发明所述的“kmer”是指将一条序列连续切割,逐个碱基划动得到k个碱基的子字符串,例如reads长度为L,k-mer长度设为k,则产生的k-mers数目为:L-k+1;再例如序列AACTGACT,设置k为3,则可以将其分割为AAC、ACT、CTG、TGA、GAC、ACT共6个k-mers。The "kmer" in the present invention refers to continuously cutting a sequence and scratching bases one by one to obtain a substring of k bases. For example, the length of reads is L, and the length of k-mer is set to k, then the generated k- The number of mers is: L-k+1; another example is the sequence AACTGACT, if k is set to 3, it can be divided into 6 k-mers of AAC, ACT, CTG, TGA, GAC, and ACT.
本发明所述的“kraken2”是指本领域的一个基于kmer算法的高精度宏基因组序列分类软件,能够快速的将测序reads进行物种分类。The "kraken2" in the present invention refers to a high-precision metagenomic sequence classification software based on the kmer algorithm in the field, which can quickly classify sequencing reads into species.
本发明所述的“kraken2优化算法”是指本发明开发的一种基于kraken2比对结果进行微生物物种层级检测的优化体系,旨在提高准确度和降低假阳性检出。The "kraken2 optimization algorithm" described in the present invention refers to an optimization system for microbial species-level detection based on the comparison results of kraken2 developed by the present invention, aiming at improving accuracy and reducing false positive detection.
本发明所述的“nt比对数据库”是基于NCBI nt数据库建立的kraken2比对索引数据库。"nt comparison database" of the present invention is the kraken2 comparison index database based on NCBI nt database establishment.
本发明所述的“taxid”或“taxonomy_id”是指taxonomy数据库中的id号。The "taxid" or "taxonomy_id" mentioned in the present invention refers to the id number in the taxonomy database.
本发明所述的于kraken2单条序列kmer评分和整体基于taxonomy结构统计的生信分析方法,大体包括如下步骤:The single sequence kmer score of kraken2 and the overall bioinformatics analysis method based on taxonomy structure statistics of the present invention generally include the following steps:
1.使用kraken2进行二代测序数据的序列比对,数据库使用,比如nt等比对数据 库;1. Use kraken2 for sequence comparison of next-generation sequencing data, using databases, such as nt and other comparison databases;
2.整理taxonomy数据库,建立基于taxid的血清型/亚型-种-属-科的层级关系或者任意一种或多种层级关系,并根据kraken2每条序列taxid-kmer比对结果的taxid引入相应的taxonomy层级;2. Organize the taxonomy database, establish the hierarchical relationship of serotype/subtype-species-genus-family or any one or more hierarchical relationships based on taxid, and import the corresponding taxid according to the taxid-kmer comparison result of each sequence of kraken2 The taxonomy level;
3.每条reads进行taxid重新定位和kmer score计算;3. Carry out taxid relocation and kmer score calculation for each reads;
4.根据kmer score和taxonomy层级结构,对比对结果进行整体的计算;4. According to kmer score and taxonomy hierarchical structure, compare and compare the results for overall calculation;
5.根据计算结果进行微生物检测。5. Carry out microbial detection according to the calculation results.
示例性的,可以具体包括如下步骤:Exemplarily, the following steps may be specifically included:
1.使用kraken2进行序列比对;1. Use kraken2 for sequence alignment;
2.引入taxonomy层级,根据单条序列taxid-kmer比对结果依据定位规则重新定位其层级和对应的taxid;2. Introduce the taxonomy level, and relocate its level and corresponding taxid according to the positioning rules according to the single sequence taxid-kmer comparison result;
所述定位规则可以包括:The positioning rules may include:
原则上接受kraken2给出的taxid定位,以下情况除外:In principle, accept the taxid positioning given by kraken2, except for the following cases:
某条序列根据taxid-kmer结果获得唯一taxid且taxid低于种层级,则定位为该taxid所属的种层级taxid;If a sequence obtains a unique taxid according to the taxid-kmer result and the taxid is lower than the species level, it is positioned as the species level taxid to which the taxid belongs;
某条序列根据taxid-kmer结果获得超过2个taxid时,分3种情况:When a sequence obtains more than 2 taxids according to the taxid-kmer results, there are 3 cases:
所有taxid,关联到种层级上只出现1个,其他taxid属于该种的血清型/亚型、属、科层级,则定位到该种层级taxid;For all taxids, only one appears at the species level, and other taxids belong to the serotype/subtype, genus, and family level of the species, and then locate the taxid at the species level;
所有taxid,关联到种层级超过2个且属于同一属,则最终定位到属层级taxid;All taxids that are associated with more than 2 species levels and belong to the same genus will be finally located at the genus level taxid;
所有taxid,关联到属层级超过2个(属层级没有分类也在内)且属于同一科,则最终定位到科层级taxid;All taxids that are associated with more than 2 genus levels (including genus levels without classification) and belong to the same family will be finally located at the family level taxid;
3.根据每条序列经过2定位的taxid,以及上述taxid-kmer比对结果,根据计算规则计算每条序列的kmer score评分;3. Calculate the kmer score of each sequence according to the calculation rules according to the taxid of each sequence after 2 positioning and the above-mentioned taxid-kmer comparison results;
计算规则可以为:Calculation rules can be:
最终定位到科以下的序列,其kmer score=(科taxid kmers+属taxid kmers+种taxid kmers+亚型/血清型taxid kmers)/总kmers;Finally locate the sequence below the family, its kmer score = (family taxid kmers+genus taxid kmers+species taxid kmers+subtype/serotype taxid kmers)/total kmers;
最终定位到科以上的序列,kmer score设定为0。Finally, the sequence above the family is located, and the kmer score is set to 0.
4.将定位到噬菌体/质粒/载体的序列,全部标记为未比对;4. Mark all sequences mapped to phage/plasmid/vector as unaligned;
5.设定一个过滤cutoff阈值(后文全部以score_cutoff表示),对每条序列根据kmer score得分进行过滤;5. Set a filter cutoff threshold (represented by score_cutoff in the following), and filter each sequence according to the kmer score;
6.对5中经过过滤的序列,统计taxid的reads;6. For the filtered sequence in 5, count the reads of taxid;
7.设定一个过滤阈值threshold,对6中定位到种层级的taxid进行过滤,计算其属相对比值,排除低于阈值的种层级taxid;7. Set a filtering threshold, filter the taxids located at the species level in 6, calculate their relative ratio, and exclude the species-level taxids below the threshold;
所述属相对比值为某个种层级taxid reads相对于同属reads最高的种层级taxid reads的比值;The genus relative ratio is the ratio of certain species-level taxid reads to the highest species-level taxid reads of the same genus reads;
8.经7过滤的种层级taxid,若缺乏属分类,则计算科相对比值,排除低于过滤阈值threshold种层级taxid;8. If the taxid at the species level filtered by 7 lacks genus classification, calculate the family relative ratio and exclude the taxid at the species level lower than the filtering threshold threshold;
所述科相对比值为某个种层级taxid reads相对于同科reads最高的种层级taxid reads的比值;The family relative ratio is the ratio of certain species-level taxid reads to the highest species-level taxid reads of the same family reads;
9.经7,8过滤保留的种层级taxid reads校正1,计算属相对比值,将属层级taxid reads按照属相对比值计算种层级taxid属校正reads;9. The species-level taxid reads retained by filtering in 7 and 8 are corrected 1, and the relative ratio is calculated, and the taxid reads at the species level are calculated according to the relative ratio of the species-level taxid and genus correction reads;
所述属相对比值为经7,8过滤后同属的种层级taxid reads总和,之后计算各种层级taxid reads相对于总和的比例;The genus relative ratio is the sum of the species-level taxid reads of the same genus after filtering by 7 and 8, and then calculates the ratio of taxid reads at various levels to the sum;
所述属层级taxid reads包括6中属层级taxid reads和7中未通过过滤阈值threshold的该属关联种层级taxid并入的reads;The genus-level taxid reads include 6 genus-level taxid reads and 7 reads that do not pass through the genus-related species level taxid of the genus-level taxid;
10.经7,8过滤的种层级taxid reads校正2,计算科相对比值,将科层级taxid reads按照科相对比值计算种层级taxid科校正reads;10. The species-level taxid reads filtered by 7 and 8 are corrected 2, and the family-level relative ratio is calculated, and the family-level taxid reads are calculated according to the family-level relative ratio to calculate the species-level taxid family-corrected reads;
所述科相对比值为经7,8过滤后同科的种层级taxid reads总和,之后计算各种层级taxid reads相对于总和的比例;The relative ratio of the family is the sum of the species-level taxid reads of the same family after filtering by 7 and 8, and then calculate the ratio of the taxid reads at various levels to the sum;
所述科层级taxid的reads包含6中科层级taxid reads和8中未通过过滤阈值threshold的该科关联种层级taxid并入的reads。The family-level taxid reads include 6 family-level taxid reads and 8 family-related taxid-incorporated reads that do not pass the filtering threshold threshold.
11.微生物检测,即等同于种层级taxid检测,根据7,8得到种层级taxid即为最终的物种taxid,取6,9,10得到的种层级taxid reads之和得到最终的物种taxid reads,并根据reads之和计算相对丰度。11. Microbial detection, which is equivalent to the species-level taxid detection, the species-level taxid obtained according to 7 and 8 is the final species taxid, and the sum of the species-level taxid reads obtained by 6, 9, and 10 is used to obtain the final species taxid reads, and Relative abundance is calculated based on the sum of reads.
以下描述仅是为了帮助理解本发明而提供。这些描述不应被理解为具有小于本领域技术人员所理解的范围。The following description is provided only to help the understanding of the present invention. These descriptions should not be read with a scope less than that understood by those skilled in the art.
实施例1方法体系的设计优化The design optimization of embodiment 1 method system
本实施例要解决的问题是如何通过数据分析方法,尽量保证kraken2比对结果的准确度。The problem to be solved in this embodiment is how to ensure the accuracy of the kraken2 comparison results as much as possible through the data analysis method.
1.首先,对于该问题,可以拆分为2个问题,如何降低假阳性物种检出以提高特异度,以及保证真实物种检出以获得较高的灵敏度,两者达到最佳平衡,即得到最好的准确度。通过分析kraken2是如何产生假阳性物种结果,以及灵敏度会降低的情况。1. First of all, for this problem, it can be divided into two problems, how to reduce the detection of false positive species to improve specificity, and ensure the detection of real species to obtain higher sensitivity. The two achieve the best balance, that is, get best accuracy. By analyzing how kraken2 produces false positive species results, and the cases where the sensitivity will be reduced.
a)数据库的冗余,无论是refseq,genbank还是nt,都存在着大量的参考基因组冗 余,这是造成假阳性检出的重要原因,错误的比对也同样会降低灵敏度;a) The redundancy of the database, whether it is refseq, genbank or nt, there is a large amount of reference genome redundancy, which is an important reason for false positive detection, and wrong comparison will also reduce the sensitivity;
b)序列相似性,典型的如大肠杆菌与志贺杆菌的序列度超过99%,这种情况在同属下的物种间尤为常见,这是造成假阳性检出的重要原因,也会干扰真实物种检出,降低了灵敏度;b) Sequence similarity, typically, the sequence degree of Escherichia coli and Shigella exceeds 99%, which is especially common among species under the same genus, which is an important reason for false positive detection and also interferes with real species detected, reducing the sensitivity;
c)宿主序列(通常指人源宿主)干扰,由于目前的宏基因组基本上来自于宿主样本采集,不可避免的会存在大量的宿主序列,这部分序列会一定程度上影响微生物的检测;c) host sequence (usually refers to human host) interference, since the current metagenomic basically comes from host sample collection, there will inevitably be a large number of host sequences, which will affect the detection of microorganisms to a certain extent;
d)mNGS序列较短,序列较短造成了每条序列得到的kmer较少,这在75bp单端测序上更为明显,容易发生错误比对。d) The mNGS sequence is short, which results in fewer kmers for each sequence, which is more obvious in 75bp single-end sequencing, and misalignment is prone to occur.
2.其次,界定以上哪些情况是可以单独通过对样本测序结果的数据分析来解决的。2. Secondly, define which of the above situations can be solved through data analysis of the sample sequencing results alone.
所谓单独通过对样本测序结果的数据分析,是指:所获得的信息只有一个样本序列文件(FASTQ),而不知道其他信息;通过一些算法,输出物种检测结果。The so-called data analysis of the sample sequencing results alone refers to: the information obtained is only one sample sequence file (FASTQ), and no other information is known; through some algorithms, the species detection results are output.
对于“a)数据库的冗余”,一方面可以通过在数据库整理,去冗余中来实现,但是数据库过度精简会造成假阴性的发生,所以除了数据库优化之外,还可以通过算法进行一定程度的降低,算法部分是本专利的考虑点;For "a) database redundancy", on the one hand, it can be achieved by sorting out the database and removing redundancy, but excessive simplification of the database will cause false negatives, so in addition to database optimization, algorithms can also be used to a certain extent The reduction of the algorithm part is the consideration point of this patent;
对于“b)序列相似性”,一方面可以通过标准基因组更新和物种分类学的进步来解决,另外可以通过算法来进行一定程度的优化,这部分是本专利的考虑点;For "b) sequence similarity", on the one hand, it can be solved through standard genome updates and the progress of species taxonomy, and on the other hand, it can be optimized to a certain extent through algorithms, which is part of the consideration of this patent;
对于“c)宿主序列(通常指人源序列)干扰”,这个一方面可以通过精准构建宿主的基因组序列和使用比对算法去除宿主基因组,但是并不能保证宿主基因组去除的是否准确彻底(去除不彻底仍会有一定的残留,另外去除过度会降低微生物检出的灵敏度),除此之外可以通过在算法上对单条序列比对结果进行精确分析,和整体上设定物种检出的阈值,得到一定程度的优化,这部分是本专利的考虑点;For "c) host sequence (usually human-derived sequence) interference", on the one hand, the host genome can be removed by accurately constructing the host genome sequence and using an alignment algorithm, but it does not guarantee whether the host genome removal is accurate and thorough (removal is not possible) There will still be some residues completely, and excessive removal will reduce the sensitivity of microbial detection), in addition, it is possible to accurately analyze the results of single sequence comparisons in the algorithm, and set the threshold for species detection as a whole, Get a certain degree of optimization, this part is the consideration point of this patent;
对于“d)mNGS序列较短”,这部分是技术硬性问题,需要通过技术的进步来提升。As for "d) the mNGS sequence is short", this part is a technically hard problem, which needs to be improved through technological progress.
3.kraken2算法的具体优化方案3. Specific optimization scheme of kraken2 algorithm
针对2中的3个优化点,结合kraken2的比对原则,建立了如下的优化体系:For the three optimization points in 2, combined with the comparison principle of kraken2, the following optimization system was established:
3.1单条序列的处理3.1 Processing of a single sequence
a)引入taxonomy层级,根据单条序列taxid-kmer比对结果依据定位规则重新定位其层级和对应的taxid;a) Introduce the taxonomy level, and relocate its level and corresponding taxid according to the positioning rules according to the single sequence taxid-kmer comparison result;
定位规则包括:原则上接受kraken2给出的taxid定位,以下情况除外:The positioning rules include: accept the taxid positioning given by kraken2 in principle, except for the following cases:
某条序列比对到唯一taxid且taxid低种层级,则定位为该taxid所属的种层级taxid(举例,某条序列长度76bp,35bp kmer长度则得到42个kmer,taxid-kmer比对结果为:0:10,1313:32,除了不能比对的10个kmer之外,其余都比对到1313这个taxid上,通过taxonomy层级结构定位到taxid 1313 Streptococcus pneumoniae肺炎链球菌);If a sequence is compared to a unique taxid and the taxid is at a lower species level, it will be positioned as the species-level taxid to which the taxid belongs (for example, if a sequence is 76 bp in length and 35 bp in kmer length, 42 kmers will be obtained, and the taxid-kmer comparison result is: 0:10, 1313:32, except for the 10 kmers that cannot be compared, the rest are compared to the taxid 1313, and the taxonomy hierarchy structure is used to locate the taxid 1313 Streptococcus pneumoniae Streptococcus pneumoniae);
某条序列比对超过2个taxid,分为3种:A sequence alignment with more than 2 taxids can be divided into 3 types:
比对结果中的所有taxid,关联到种层级上只有1个物种,其他taxid属于该种的血 清型/亚型,属、科层级,则定位到该种taxid(举例,某条序列长度76bp,35bp kmer长度则得到42个kmer,taxid-kmer比对结果为:0:10,1313:20,1301:12,除了不能比对的10个kmer之外,20个kmer比对到1313这个taxid上,通过taxonomy层级结构对应Streptococcus pneumoniae肺炎链球菌,另外12个kmer比对到1301Streptococcus链球菌属,由于肺炎链球菌在链球菌属下,则该序列定位到种taxid 1313 Streptococcus pneumoniae肺炎链球菌);All taxids in the comparison result are associated with only one species at the species level, and other taxids belong to the serotype/subtype of the species. At the genus and family levels, they are located at the taxid of the species (for example, a sequence length of 76bp, 35bp kmer length will get 42 kmers, the taxid-kmer comparison results are: 0:10, 1313:20, 1301:12, except for the 10 kmers that cannot be compared, 20 kmers are compared to the taxid 1313 , corresponding to Streptococcus pneumoniae Streptococcus pneumoniae through the taxonomy hierarchy structure, and the other 12 kmers were compared to the genus 1301 Streptococcus Streptococcus pneumoniae, since Streptococcus pneumoniae is under the genus Streptococcus, the sequence was mapped to the species taxid 1313 Streptococcus pneumoniae Streptococcus pneumoniae);
比对结果中的所有taxid,关联出同属不同种的结果,则最终定位到属taxid(举例,某条序列长度76bp,35bp kmer长度则得到42个kmer,taxid-kmer比对结果为:0:10,1313:20,28037:5,1301:7,除了不能比对的10个kmer之外,20个kmer比对到1313这个taxid上,通过taxonomy层级结构对应Streptococcus pneumoniae肺炎链球菌,另外5个kmer比对到28037 Streptococcus mitis轻型链球菌,7个kmer比对到1301 Streptococcus链球菌属,由于链球菌属下出现了2个种,则该序列定位到taxid 1301 Streptococcus链球菌属);All the taxids in the comparison results are related to the results of the same genus and different species, and finally locate the genus taxid (for example, if a sequence length is 76bp, 35bp kmer length will get 42 kmers, the taxid-kmer comparison result is: 0: 10, 1313:20, 28037:5, 1301:7, in addition to the 10 kmers that cannot be compared, 20 kmers are compared to the taxid 1313, corresponding to Streptococcus pneumoniae Streptococcus pneumoniae through the taxonomy hierarchy structure, and the other 5 The kmer was compared to 28037 Streptococcus mitis light-weight Streptococcus, and 7 kmers were compared to 1301 Streptococcus Streptococcus. Since there were 2 species under Streptococcus, the sequence was mapped to taxid 1301 Streptococcus Streptococcus);
比对结果中的所有taxid,关联出同科不同属的结果,则最终定位到科taxid;Comparing all taxids in the results, and correlating the results of different genus in the same family, the taxid of the family is finally located;
b)根据每条序列经过a)定位的taxid和taxid-kmer比对结果,根据计算规则计算每条序列的kmer score;b) Calculate the kmer score of each sequence according to the calculation rules according to the alignment results of taxid and taxid-kmer located in a) for each sequence;
计算规则为:The calculation rules are:
最终定位到科以下的序列,其kmer score=(科taxid kmers+属taxid kmers+种taxid kmers+亚型/血清型taxid kmers)/总kmers;Finally locate the sequence below the family, its kmer score = (family taxid kmers+genus taxid kmers+species taxid kmers+subtype/serotype taxid kmers)/total kmers;
最终定位到科以上的序列,kmer score设定为0。Finally, the sequence above the family is located, and the kmer score is set to 0.
3.2整体的处理3.2 Overall processing
c)设定一个score_cutoff,根据每条序列的kmer score判定其结果是否可信,未过score_cutoff视为不可信结果予以排除;c) Set a score_cutoff, judge whether the result is credible according to the kmer score of each sequence, and exclude it as an unreliable result if the score_cutoff is not exceeded;
d)对于最终定位到噬菌体/质粒/载体的序列,全部标记为未比对结果;d) For the sequences that are finally mapped to phage/plasmid/vector, all are marked as unaligned results;
e)对c)d)中经过过滤的序列,统计taxid的reads;e) For the filtered sequence in c) and d), count the reads of taxid;
f)对e)中定位到种层级的taxid关联属和科,计算每个属下所有种的属相对比值(属内各种reads数相对于属内reads最高种reads数的比值),设定一个threshold,将属相对比值低于过滤阈值的这部分种reads重新定位到属层级;f) For the taxid-associated genera and families located at the species level in e), calculate the genus relative ratio of all species under each genus (the ratio of the number of reads in the genus to the number of reads in the highest species of reads in the genus), set A threshold to relocate the reads of the species whose genus relative ratio is lower than the filtering threshold to the genus level;
g)类同于f),对缺乏属信息但是有明确科信息的种,计算科相对比值(同科下每个种相对于该科内reads最高种的比值),将科相对比值低于f)中过滤阈值threshold这部分种reads重新定位到科层级;g) Similar to f), for species that lack genus information but have clear family information, calculate the family relative ratio (the ratio of each species under the same family to the species with the highest reads in the family), and make the family relative ratio lower than f ) in the filtering threshold threshold, this part of the reads is relocated to the branch level;
h)对于最终比对到属层级的reads(包括原始比对到属层级的reads和步骤f)未通过过滤阈值threshold的该属下种reads之和),计算该属经f)过滤剩余种的reads总和,并计算该属剩余种相对总和的比值,之后根据各剩余种的比值,将最终比对到属 层级的reads分配到各剩余种上;h) For the reads that are finally compared to the genus level (including the reads that were originally compared to the genus level and the sum of the reads of the species under the genus that did not pass the filtering threshold threshold in step f), calculate the remaining species of the genus that have been filtered by f) The sum of reads, and calculate the ratio of the relative sum of the remaining species of the genus, and then assign the reads that are finally compared to the genus level to each remaining species according to the ratio of each remaining species;
i)对于最终比对到科层级的reads(包含原始比对到科层级的reads和g)步骤未通过过滤阈值threshold的该科下种reads之和),计算该科经过g)过滤剩余种的reads总和,并计算剩余种相对总和的比值,之后根据各剩余种的比值,将最终比对到科层级的reads分配到各剩余种上;i) For the reads that are finally compared to the family level (including the reads that were originally compared to the family level and the sum of the reads of the family that did not pass the filtering threshold threshold in the g) step), calculate the remaining species of the family that have been filtered by g) The sum of the reads, and calculate the ratio of the relative sum of the remaining species, and then according to the ratio of each remaining species, the reads that are finally compared to the family level are assigned to the remaining species;
j)报出最终物种reads(即经过h),i)处理的种reads)和相对丰度时,设定一个reads过滤阈值reads_cutoff,低于某个阈值的物种,不予统计和报出。j) When reporting the final species reads (that is, species reads processed by h), i) and relative abundance, set a reads filtering threshold reads_cutoff, and species below a certain threshold will not be counted and reported.
4、kraken2的模拟数据测试和建立相应的各个阈值(score_cutoff,threshold,reads_cutoff)4. Kraken2's simulated data test and establishment of corresponding thresholds (score_cutoff, threshold, reads_cutoff)
建立测试数据集,测试优化方法,和初步建立方法中涉及的阈值Create a test dataset, test the optimization method, and initially establish the thresholds involved in the method
以选取的4种细菌(流感嗜血杆菌,肺炎链球菌,金黄色葡萄球菌,肺炎克雷伯菌),2种真菌(白色念球菌,烟曲霉),4种病毒(人类疱疹病毒,人乳头瘤病毒,甲型流感病毒,HIV病毒),按照一定的reads比例,加入到人源宿主reads(GRC38.p13)中,序列长度设定为75bp,总数据量设定为10M,一共4组,每组由3个完全一致的样本组成。模拟样本使用的参考基因组如下表所示:Selected 4 kinds of bacteria (Haemophilus influenzae, Streptococcus pneumoniae, Staphylococcus aureus, Klebsiella pneumoniae), 2 kinds of fungi (Candida albicans, Aspergillus fumigatus), 4 kinds of viruses (human herpesvirus, human papillary Tumor virus, influenza A virus, HIV virus), according to a certain reads ratio, added to the human host reads (GRC38.p13), the sequence length is set to 75bp, the total data volume is set to 10M, a total of 4 groups, Each group consists of 3 identical samples. The reference genomes used by the simulated samples are shown in the table below:
Figure PCTCN2021106970-appb-000001
Figure PCTCN2021106970-appb-000001
具体的模拟样本如下表所示(其中按照顺序3个为重复的样本,比如样本1,2,3为完全一致的样本):The specific simulation samples are shown in the following table (there are 3 repeated samples in order, such as samples 1, 2, and 3 are completely consistent samples):
Figure PCTCN2021106970-appb-000002
Figure PCTCN2021106970-appb-000002
Figure PCTCN2021106970-appb-000003
Figure PCTCN2021106970-appb-000003
Score_cutoff在不同阈值下统计样本的统计结果汇总,样本7-12由于微生物总量较低,重点关注,其错误率要高于整体层级,从统计结果来看设定高于此错误率的threshold就可以解决错误比对的发生,具体如下表所示:Score_cutoff summarizes the statistical results of statistical samples under different thresholds. Samples 7-12 are focused on due to the low total microbial count, and their error rate is higher than that of the overall level. From the statistical results, setting a threshold higher than this error rate is It can solve the occurrence of wrong comparison, as shown in the following table:
Figure PCTCN2021106970-appb-000004
Figure PCTCN2021106970-appb-000004
鉴于score_cutoff在0.5和0.4两个数值达到错误率相同层级,对每个样本分为list 物种假阳性物种(即错误比对检出的物种在10个模拟物种的同科),非list物种假阳性物种(错误比对检出的物种不在10个模拟物种的同科),统计其各自reads最高的物种,由于重复性样本检出完全一致,只列出代表样本结果,具体如下表所示:In view of the fact that score_cutoff reaches the same level of error rate at 0.5 and 0.4, each sample is divided into list species false positive species (that is, the species detected by the error comparison is in the same family of 10 simulated species), non-list species false positive Species (the species detected by the wrong comparison are not in the same family as the 10 simulated species), and the species with the highest reads are counted. Since the repeated samples are detected to be completely consistent, only the representative sample results are listed, as shown in the following table:
Figure PCTCN2021106970-appb-000005
Figure PCTCN2021106970-appb-000005
由于优化方法中包含了对同属/科的校正,最终的reads_cutoff则是设定解决非同科检出的物种进行校正的,从score_cutoff不同值的比较来看,在允许稍低一点的比对率的前提下,score_cutoff设定为0.5,reads_cutoff设定到大于3即可消除非list假阳性物种检出(该值越低越有利于保证reads偏少检出的真实阳性物种的灵敏度)。Since the optimization method includes the correction of the same genus/families, the final reads_cutoff is set to correct for species detected in different families. From the comparison of different values of score_cutoff, a slightly lower comparison rate is allowed Under the premise of , score_cutoff is set to 0.5, and reads_cutoff is set to greater than 3 to eliminate the detection of non-list false positive species (the lower the value, the better the sensitivity of true positive species detected by fewer reads).
综上,最终确定本发明的技术方案如下:In summary, finally determine the technical solution of the present invention as follows:
1)基于kraken2比对结果进行;1) Based on the comparison results of kraken2;
2)引入taxonomy层级,单条序列根据taxid-kmer比对结果,依据定位规则重新定位其层级和对应的taxid;2) Introduce the taxonomy level, and relocate the level and corresponding taxid of a single sequence according to the alignment rules of taxid-kmer;
定位规则包括:Targeting rules include:
原则上接受kraken2给出的taxid定位,以下情况除外:In principle, accept the taxid positioning given by kraken2, except for the following cases:
某条序列比对到唯一taxid且taxid低于物种层级,则定位为该taxid所属的物种taxid;If a sequence is aligned with a unique taxid and the taxid is lower than the species level, it will be positioned as the taxid of the species to which the taxid belongs;
某条序列比对超过2个taxid,分为3种:A sequence alignment with more than 2 taxids can be divided into 3 types:
比对结果中的所有taxid,关联到物种层级上只有1个物种,其他taxid属于该物种的血清型/亚型,属、科层级,则定位到该物种taxid;All taxids in the comparison result are associated with only one species at the species level, and other taxids belong to the serotype/subtype of the species. At the genus and family levels, the taxid of this species is located;
比对结果中的所有taxid,关联出同属不同物种的结果,则最终定位到属taxid;Comparing all taxids in the results, and correlating the results of the same genus to different species, the genus taxid is finally located;
比对结果中的所有taxid,关联出同科不同属的结果,则最终定位到科taxid;Comparing all taxids in the results, and correlating the results of different genus in the same family, the taxid of the family is finally located;
3)根据每条序列经过2)定位的taxid和taxid-kmer比对结果,根据计算规则计算 每条序列的kmer score评分;3) Calculate the kmer score of each sequence according to the calculation rules according to the alignment results of taxid and taxid-kmer located in 2) for each sequence;
计算规则为:The calculation rules are:
最终定位到科以下的序列,其kmer score=(科taxid kmers+属taxid kmers+物种taxid kmers+亚型/血清型taxid kmers)/总kmers;Finally locate the sequence below the family, its kmer score = (family taxid kmers+genus taxid kmers+species taxid kmers+subtype/serotype taxid kmers)/total kmers;
最终定位到科以上的序列,kmer score设定为0。Finally, the sequence above the family is located, and the kmer score is set to 0.
4)设定score_cutoff为0.5,根据每条序列的kmer score判定其结果是否可信,kmer score未过score_cutoff视为不可信结果予以排除;4) Set score_cutoff to 0.5, and determine whether the result is credible according to the kmer score of each sequence. If the kmer score does not exceed score_cutoff, it is regarded as an unreliable result and excluded;
5)对于最终定位到噬菌体/质粒/载体的序列,全部标记为未比对结果;5) For the sequences that are finally mapped to phage/plasmid/vector, all are marked as unaligned results;
6)对4)5)中经过过滤的序列,统计taxid的reads;6) For the sequence filtered in 4) and 5), count the reads of taxid;
7)对6)中定位到物种层级的taxid关联属和科,计算每个属下所有物种的属相对比值(属内各物种reads数相对于属内reads最高物种reads数的比值),设定一个threshold,其值设定为0.1,将属相对比值低于过滤阈值的这部分物种reads重新定位到属层级;7) For the taxid-associated genus and family located at the species level in 6), calculate the genus relative ratio of all species under each genus (the ratio of the number of reads of each species in the genus to the number of reads in the highest species of reads in the genus), set A threshold, whose value is set to 0.1, relocates the reads of the species whose genus relative ratio is lower than the filtering threshold to the genus level;
8)类同于7),对缺乏属信息但是有明确科信息的物种,计算科相对比值(同科下每个物种相对于该科内reads最高物种的比值),将科相对比值低于7)中过滤阈值threshold这部分物种reads重新定位到科层级;8) Similar to 7), for species that lack genus information but have clear family information, calculate the family relative ratio (the ratio of each species under the same family to the species with the highest reads in the family), and make the family relative ratio lower than 7 ) in the filtering threshold threshold, the reads of this part of the species are relocated to the family level;
9)对于最终比对到属层级的reads(包括6)属层级taxid的reads和步骤7)未通过过滤阈值threshold的该属物种reads之和),计算该属经7)过滤剩余物种的reads总和,并计算该属剩余物种相对总和的比值,之后根据各剩余物种的比值,将最终比对到属层级的reads分配到各剩余物种上;9) For the reads that are finally compared to the genus level (including 6) the reads of the genus level taxid and step 7) the sum of the reads of the genus that did not pass the filtering threshold threshold), calculate the sum of the reads of the remaining species of the genus after 7) filtering , and calculate the ratio of the relative sum of the remaining species of the genus, and then assign the reads that are finally compared to the genus level to each remaining species according to the ratio of each remaining species;
10)对于最终比对到科层级的reads(包含6)科层级taxid的reads和8)步骤未通过过滤阈值threshold的该科物种reads之和),计算该科经过8)过滤剩余物种的reads总和,并计算剩余物种相对总和的比值,之后根据各剩余物种的比值,将最终比对到科层级的reads分配到各剩余物种上;10) For the final comparison to the family-level reads (including 6) family-level taxid reads and 8) the sum of reads of the family species that did not pass the filtering threshold threshold in step 8), calculate the sum of the reads of the family after 8) filtering the remaining species , and calculate the ratio of the relative sum of the remaining species, and then assign the reads that are finally compared to the family level to each remaining species according to the ratio of each remaining species;
11)报出最终物种reads和相对丰度时,设定一个reads过滤阈值reads_cutoff,该值设定为3,reads低于reads_cutoff的物种,不予统计和报出。最后报出物种层级的reads和相对丰度。11) When reporting the final species reads and relative abundance, set a reads filtering threshold reads_cutoff, which is set to 3, and species whose reads are lower than reads_cutoff will not be counted and reported. Finally, the species-level reads and relative abundance are reported.
实施例2与传统kraken2方法的效果比较Comparison of the effects of Example 2 and the traditional kraken2 method
1、根据优化方法最终检测的模拟样本结果整理假阳性与假阴性检出如下:1. According to the simulation sample results of the final detection of the optimization method, the false positive and false negative detections are sorted out as follows:
Figure PCTCN2021106970-appb-000006
Figure PCTCN2021106970-appb-000006
Figure PCTCN2021106970-appb-000007
Figure PCTCN2021106970-appb-000007
统计指标:Statistical indicators:
灵敏度为117/120(非人源物种总数)=97.5%;Sensitivity is 117/120 (total number of non-human species)=97.5%;
假阳性物种检出3例。Three false positive species were detected.
2.2根据kraken2 confidence 0.5+braken流程检测的结果整理假阳性检出和假阴性检出如下表所示:2.2 According to the results of kraken2 confidence 0.5+braken process detection, the false positive detection and false negative detection are sorted out as shown in the following table:
样本sample taxidtaxi 物种species readsreads 相对丰度relative abundance 结果result
样本1sample 1 340412340412 Aspergillus novofumigatus Aspergillus novofumigatus 11 0.000110.00011 假阳性false positive
样本1sample 1 984962984962 Heterobasidion irregulareHeterobasidion irregular 11 0.000110.00011 假阳性false positive
样本1sample 1 145522145522 Nannochloropsis oceanicaNannochloropsis oceanica 44 0.000440.00044 假阳性false positive
样本1sample 1 2803728037 Streptococcus mitis Streptococcus mitis 22 0.000220.00022 假阳性false positive
样本1sample 1 26567872656787 Venustampulla echinocandicaVenustapulla echinocandica 11 0.000110.00011 假阳性false positive
样本10sample 10 8604986049 Cladophialophora carrioniiCladophialophora carrionii 11 0.000110.00011 假阳性false positive
样本10Sample 10 1037610376 Human gammaherpesvirus 4Human gammaherpes virus 4 11 0.000110.00011 假阳性false positive
样本10sample 10 18739601873960 Pseudocercospora fijiensisPseudocercospora fijiensis 22 0.000220.00022 假阳性false positive
样本10Sample 10 26567872656787 Venustampulla echinocandica Venustapulla echinocandica 11 0.000110.00011 假阳性false positive
样本10Sample 10 727727 Haemophilus influenzae Haemophilus influenzae 00 00 假阴性false negative
样本10Sample 10 573573 Klebsiella pneumoniae Klebsiella pneumoniae 00 00 假阴性false negative
样本11Sample 11 727727 Haemophilus influenzae Haemophilus influenzae 00 00 假阴性false negative
样本11Sample 11 573573 Klebsiella pneumoniae Klebsiella pneumoniae 00 00 假阴性false negative
样本12sample 12 727727 Haemophilus influenzae Haemophilus influenzae 00 00 假阴性false negative
样本12sample 12 573573 Klebsiella pneumoniae Klebsiella pneumoniae 00 00 假阴性false negative
样本4sample 4 145522145522 Nannochloropsis oceanicaNannochloropsis oceanica 22 0.000220.00022 假阳性false positive
样本4Sample 4 9980299802 Spirometra erinaceieuropaeiSpirometra erinaceieuropaei 22 0.000220.00022 假阳性false positive
样本7Sample 7 12201881220188 Aspergillus tanneriAspergillus tanneri 11 0.000110.00011 假阳性false positive
样本7Sample 7 4521945219 Guanarito mammarenavirus Guanarito mammarenavirus 11 0.000110.00011 假阳性false positive
样本7Sample 7 145522145522 Nannochloropsis oceanicaNannochloropsis oceanica 33 0.000330.00033 假阳性false positive
统计指标:Statistical indicators:
灵敏度为114/120=95%;The sensitivity is 114/120=95%;
假阳性物种检出43例(其中reads大于3的为3例)。43 cases of false positive species were detected (3 cases with reads greater than 3).
3、kraken2+bracken即不引入confidence阈值的检测结果中,统计其中的假阳性物种统计为5517例(其中reads大于3的为1188例),假阴性物种为出现在样本10-12中的Klebsiella pneumoniae共3例。3. In the detection results of kraken2+bracken that does not introduce the confidence threshold, there are 5517 cases of false positive species (1188 cases with reads greater than 3), and the false negative species is Klebsiella pneumoniae that appeared in samples 10-12 A total of 3 cases.
结论:kraken2不设置confidence阈值灵敏度与本专利方法一致,但是假阳性物种检出过多,而设置confidence阈值之后虽然假阳性物种检出会减少很多,但是灵敏度会降低。综合比较,本专利方法在保证了灵敏度基础上,降低了假阳性物种检出,效果更优。Conclusion: The sensitivity of kraken2 without setting the confidence threshold is consistent with this patent method, but too many false positive species are detected. After setting the confidence threshold, although the detection of false positive species will be greatly reduced, the sensitivity will be reduced. In comprehensive comparison, the patented method reduces the detection of false positive species on the basis of ensuring the sensitivity, and the effect is better.
实施例3实际样本检测实验Embodiment 3 actual sample detection experiment
使用了9例spike-in样本,各自建立DNA文库和RNA文库,进行上机测序,具体的样本及阳性物种见下表:Nine spike-in samples were used to establish DNA libraries and RNA libraries for sequencing on the machine. The specific samples and positive species are shown in the table below:
样本sample 文库类型library type 原始编号original number 物种species taxidtaxi
S1S1 DNAdna 样本1sample 1 Alphapapillomavirus 7Alphapapillomavirus 7 337042337042
S1S1 DNAdna 样本1sample 1 Cryptococcus gattii VGIICryptococcus gattii VGII 18590961859096
S1S1 DNAdna 样本1sample 1 Cupriavidus gilardiiCupriavidus gilardii 8254182541
S1S1 DNAdna 样本1sample 1 Escherichia coliEscherichia coli 562562
S1S1 DNAdna 样本1sample 1 Human alphaherpesvirus 1Human alphaherpes virus 1 1029810298
S1S1 DNAdna 样本1sample 1 Human betaherpesvirus 5Human betaherpes virus 5 1035910359
S1S1 DNAdna 样本1sample 1 Human betaherpesvirus 6BHuman betaherpesvirus 6B 3260432604
S1S1 DNAdna 样本1sample 1 Mycoplasma hyorhinisMycoplasma hyorhinis 21002100
S1S1 DNAdna 样本1sample 1 Streptococcus pneumoniaeStreptococcus pneumoniae 13131313
S1S1 RNARNA 样本1sample 1 Alphapapillomavirus 7Alphapapillomavirus 7 337042337042
S1S1 RNARNA 样本1sample 1 Cryptococcus gattii VGIICryptococcus gattii VGII 18590961859096
S1S1 RNARNA 样本1sample 1 Cupriavidus gilardiiCupriavidus gilardii 8254182541
S1S1 RNARNA 样本1sample 1 Escherichia coliEscherichia coli 562562
S1S1 RNARNA 样本1sample 1 Human alphaherpesvirus 1Human alphaherpes virus 1 1029810298
S1S1 RNARNA 样本1sample 1 Human betaherpesvirus 5Human betaherpes virus 5 1035910359
S1S1 RNARNA 样本1sample 1 Human betaherpesvirus 6BHuman betaherpesvirus 6B 3260432604
S1S1 RNARNA 样本1sample 1 Mycoplasma hyorhinisMycoplasma hyorhinis 21002100
S1S1 RNARNA 样本1sample 1 Streptococcus pneumoniaeStreptococcus pneumoniae 13131313
S2S2 DNAdna 样本2sample 2 Alphapapillomavirus 7Alphapapillomavirus 7 337042337042
S2S2 DNAdna 样本2sample 2 Cryptococcus gattii VGIICryptococcus gattii VGII 18590961859096
S2S2 DNAdna 样本2sample 2 Cupriavidus gilardiiCupriavidus gilardii 8254182541
S2S2 DNAdna 样本2sample 2 Escherichia coliEscherichia coli 562562
S2S2 DNAdna 样本2sample 2 Human alphaherpesvirus 1Human alphaherpes virus 1 1029810298
S2S2 DNAdna 样本2sample 2 Human betaherpesvirus 5Human betaherpes virus 5 1035910359
S2S2 DNAdna 样本2sample 2 Human betaherpesvirus 6BHuman betaherpesvirus 6B 3260432604
S2S2 DNAdna 样本2sample 2 Mycoplasma hyorhinisMycoplasma hyorhinis 21002100
S2S2 DNAdna 样本2sample 2 Streptococcus pneumoniaeStreptococcus pneumoniae 13131313
S2S2 RNARNA 样本2sample 2 Alphapapillomavirus 7Alphapapillomavirus 7 337042337042
S2S2 RNARNA 样本2sample 2 Cryptococcus gattii VGIICryptococcus gattii VGII 18590961859096
S2S2 RNARNA 样本2sample 2 Cupriavidus gilardiiCupriavidus gilardii 8254182541
S2S2 RNARNA 样本2sample 2 Escherichia coliEscherichia coli 562562
S2S2 RNARNA 样本2sample 2 Human alphaherpesvirus 1Human alphaherpes virus 1 1029810298
S2S2 RNARNA 样本2sample 2 Human betaherpesvirus 5Human betaherpes virus 5 1035910359
S2S2 RNARNA 样本2sample 2 Human betaherpesvirus 6BHuman betaherpesvirus 6B 3260432604
S2S2 RNARNA 样本2sample 2 Mycoplasma hyorhinisMycoplasma hyorhinis 21002100
S2S2 RNARNA 样本2sample 2 Streptococcus pneumoniaeStreptococcus pneumoniae 13131313
S3S3 DNAdna 样本3sample 3 Alphapapillomavirus 7Alphapapillomavirus 7 337042337042
S3S3 DNAdna 样本3sample 3 Cupriavidus gilardiiCupriavidus gilardii 8254182541
S3S3 DNAdna 样本3sample 3 Haemophilus influenzaeHaemophilus influenzae 727727
S3S3 DNAdna 样本3sample 3 Human alphaherpesvirus 2Human alphaherpes virus 2 1031010310
S3S3 DNAdna 样本3sample 3 Listeria monocytogenesListeria monocytogenes 16391639
S3S3 DNAdna 样本3sample 3 Mycoplasma hyorhinisMycoplasma hyorhinis 21002100
S3S3 DNAdna 样本3sample 3 Neisseria meningitidisNeisseria meningitidis 487487
S3S3 DNAdna 样本3sample 3 Streptococcus agalactiaeStreptococcus agalactiae 13111311
S3S3 RNARNA 样本3sample 3 Alphapapillomavirus 7Alphapapillomavirus 7 337042337042
S3S3 RNARNA 样本3sample 3 Cupriavidus gilardiiCupriavidus gilardii 8254182541
S3S3 RNARNA 样本3sample 3 Haemophilus influenzaeHaemophilus influenzae 727727
S3S3 RNARNA 样本3sample 3 Human alphaherpesvirus 2Human alphaherpes virus 2 1031010310
S3S3 RNARNA 样本3sample 3 Listeria monocytogenesListeria monocytogenes 16391639
S3S3 RNARNA 样本3sample 3 Mycoplasma hyorhinisMycoplasma hyorhinis 21002100
S3S3 RNARNA 样本3sample 3 Neisseria meningitidisNeisseria meningitidis 487487
S3S3 RNARNA 样本3sample 3 Parechovirus AParechovirus A 18039561803956
S4S4 DNAdna 样本4Sample 4 Alphapapillomavirus 7Alphapapillomavirus 7 337042337042
S4S4 DNAdna 样本4sample 4 Cupriavidus gilardiiCupriavidus gilardii 8254182541
S4S4 DNAdna 样本4Sample 4 Haemophilus influenzaeHaemophilus influenzae 727727
S4S4 DNAdna 样本4sample 4 Human alphaherpesvirus 2Human alphaherpes virus 2 1031010310
S4S4 DNAdna 样本4sample 4 Listeria monocytogenesListeria monocytogenes 16391639
S4S4 DNAdna 样本4Sample 4 Mycoplasma hyorhinisMycoplasma hyorhinis 21002100
S4S4 DNAdna 样本4sample 4 Neisseria meningitidisNeisseria meningitidis 487487
S4S4 DNAdna 样本4sample 4 Streptococcus agalactiaeStreptococcus agalactiae 13111311
S4S4 RNARNA 样本4sample 4 Alphapapillomavirus 7Alphapapillomavirus 7 337042337042
S4S4 RNARNA 样本4sample 4 Cupriavidus gilardiiCupriavidus gilardii 8254182541
S4S4 RNARNA 样本4Sample 4 Haemophilus influenzaeHaemophilus influenzae 727727
S4S4 RNARNA 样本4Sample 4 Human alphaherpesvirus 2Human alphaherpes virus 2 1031010310
S4S4 RNARNA 样本4Sample 4 Listeria monocytogenesListeria monocytogenes 16391639
S4S4 RNARNA 样本4Sample 4 Mycoplasma hyorhinisMycoplasma hyorhinis 21002100
S4S4 RNARNA 样本4Sample 4 Neisseria meningitidisNeisseria meningitidis 487487
S4S4 RNARNA 样本4Sample 4 Parechovirus AParechovirus A 18039561803956
S5S5 DNAdna 样本5Sample 5 Alphapapillomavirus 7Alphapapillomavirus 7 337042337042
S5S5 DNAdna 样本5Sample 5 Candida albicansCandida albicans 54765476
S5S5 DNAdna 样本5Sample 5 Cupriavidus gilardiiCupriavidus gilardii 8254182541
S5S5 DNAdna 样本5Sample 5 Enterococcus faecalisEnterococcus faecalis 13511351
S5S5 DNAdna 样本5Sample 5 Human mastadenovirus CHuman mastadenovirus C 129951129951
S5S5 DNAdna 样本5Sample 5 Klebsiella oxytocaKlebsiella oxytoca 571571
S5S5 DNAdna 样本5Sample 5 Mycoplasma hyorhinisMycoplasma hyorhinis 21002100
S5S5 DNAdna 样本5Sample 5 Streptococcus pyogenesStreptococcus pyogenes 13141314
S5S5 RNARNA 样本5Sample 5 Alphapapillomavirus 7Alphapapillomavirus 7 337042337042
S5S5 RNARNA 样本5Sample 5 Candida albicansCandida albicans 54765476
S5S5 RNARNA 样本5Sample 5 Cupriavidus gilardiiCupriavidus gilardii 8254182541
S5S5 RNARNA 样本5Sample 5 Enterococcus faecalisEnterococcus faecalis 13511351
S5S5 RNARNA 样本5Sample 5 Human mastadenovirus CHuman mastadenovirus C 129951129951
S5S5 RNARNA 样本5Sample 5 Human orthopneumovirusHuman orthopneumovirus 1125011250
S5S5 RNARNA 样本5Sample 5 Human respirovirus 1Human respirovirus 1 1273012730
S5S5 RNARNA 样本5Sample 5 Klebsiella oxytocaKlebsiella oxytoca 571571
S5S5 RNARNA 样本5Sample 5 Mycoplasma hyorhinisMycoplasma hyorhinis 21002100
S5S5 RNARNA 样本5Sample 5 Streptococcus pyogenesStreptococcus pyogenes 13141314
S6S6 DNAdna 样本6Sample 6 Alphapapillomavirus 7Alphapapillomavirus 7 337042337042
S6S6 DNAdna 样本6Sample 6 Candida albicansCandida albicans 54765476
S6S6 DNAdna 样本6Sample 6 Cupriavidus gilardiiCupriavidus gilardii 8254182541
S6S6 DNAdna 样本6Sample 6 Enterococcus faecalisEnterococcus faecalis 13511351
S6S6 DNAdna 样本6Sample 6 Human mastadenovirus CHuman mastadenovirus C 129951129951
S6S6 DNAdna 样本6Sample 6 Klebsiella oxytocaKlebsiella oxytoca 571571
S6S6 DNAdna 样本6Sample 6 Mycoplasma hyorhinisMycoplasma hyorhinis 21002100
S6S6 DNAdna 样本6Sample 6 Streptococcus pyogenesStreptococcus pyogenes 13141314
S6S6 RNARNA 样本6Sample 6 Alphapapillomavirus 7Alphapapillomavirus 7 337042337042
S6S6 RNARNA 样本6Sample 6 Candida albicansCandida albicans 54765476
S6S6 RNARNA 样本6Sample 6 Cupriavidus gilardiiCupriavidus gilardii 8254182541
S6S6 RNARNA 样本6Sample 6 Enterococcus faecalisEnterococcus faecalis 13511351
S6S6 RNARNA 样本6Sample 6 Human mastadenovirus CHuman mastadenovirus C 129951129951
S6S6 RNARNA 样本6Sample 6 Human orthopneumovirusHuman orthopneumovirus 1125011250
S6S6 RNARNA 样本6Sample 6 Human respirovirus 1Human respirovirus 1 1273012730
S6S6 RNARNA 样本6Sample 6 Klebsiella oxytocaKlebsiella oxytoca 571571
S6S6 RNARNA 样本6Sample 6 Mycoplasma hyorhinisMycoplasma hyorhinis 21002100
S6S6 RNARNA 样本6Sample 6 Streptococcus pyogenesStreptococcus pyogenes 13141314
S7S7 DNAdna 样本7Sample 7 Aeromonas hydrophilaAeromonas hydrophila 644644
S7S7 DNAdna 样本7Sample 7 Alphapapillomavirus 7Alphapapillomavirus 7 337042337042
S7S7 DNAdna 样本7Sample 7 Clavispora lusitaniaeClavispora lusitaniae 3691136911
S7S7 DNAdna 样本7Sample 7 Cupriavidus gilardiiCupriavidus gilardii 8254182541
S7S7 DNAdna 样本7Sample 7 Human mastadenovirus EHuman mastadenovirus E 130308130308
S7S7 DNAdna 样本7Sample 7 Legionella pneumophilaLegionella pneumophila 446446
S7S7 DNAdna 样本7Sample 7 Mycoplasma hyorhinisMycoplasma hyorhinis 21002100
S7S7 DNAdna 样本7Sample 7 Neisseria siccaNeisseria sicca 490490
S7S7 DNAdna 样本7Sample 7 Pseudomonas fluorescensPseudomonas fluorescens 294294
S7S7 RNARNA 样本7Sample 7 Aeromonas hydrophilaAeromonas hydrophila 644644
S7S7 RNARNA 样本7Sample 7 Alphapapillomavirus 7Alphapapillomavirus 7 337042337042
S7S7 RNARNA 样本7Sample 7 Cupriavidus gilardiiCupriavidus gilardii 8254182541
S7S7 RNARNA 样本7Sample 7 Human mastadenovirus EHuman mastadenovirus E 130308130308
S7S7 RNARNA 样本7Sample 7 Influenza A virusInfluenza A virus 1132011320
S7S7 RNARNA 样本7Sample 7 Influenza B virusInfluenza B virus 1152011520
S7S7 RNARNA 样本7Sample 7 Legionella pneumophilaLegionella pneumophila 446446
S7S7 RNARNA 样本7Sample 7 Mycoplasma hyorhinisMycoplasma hyorhinis 21002100
S7S7 RNARNA 样本7Sample 7 Neisseria siccaNeisseria sicca 490490
S7S7 RNARNA 样本7Sample 7 Pseudomonas fluorescensPseudomonas fluorescens 294294
S8 S8 DNAdna 样本8Sample 8 Aeromonas hydrophilaAeromonas hydrophila 644644
S8 S8 DNAdna 样本8Sample 8 Alphapapillomavirus 7 Alphapapillomavirus 7 337042337042
S8 S8 DNAdna 样本8Sample 8 Clavispora lusitaniaeClavispora lusitaniae 3691136911
S8 S8 DNAdna 样本8Sample 8 Cupriavidus gilardiiCupriavidus gilardii 8254182541
S8 S8 DNAdna 样本8Sample 8 Human mastadenovirus EHuman mastadenovirus E 130308130308
S8 S8 DNAdna 样本8Sample 8 Legionella pneumophilaLegionella pneumophila 446446
S8 S8 DNAdna 样本8Sample 8 Mycoplasma hyorhinisMycoplasma hyorhinis 21002100
S8 S8 DNAdna 样本8Sample 8 Neisseria siccaNeisseria sicca 490490
S8 S8 DNAdna 样本8Sample 8 Pseudomonas fluorescensPseudomonas fluorescens 294294
S8 S8 RNARNA 样本8Sample 8 Aeromonas hydrophilaAeromonas hydrophila 644644
S8 S8 RNARNA 样本8Sample 8 Alphapapillomavirus 7 Alphapapillomavirus 7 337042337042
S8 S8 RNARNA 样本8Sample 8 Cupriavidus gilardiiCupriavidus gilardii 8254182541
S8 S8 RNARNA 样本8Sample 8 Human mastadenovirus EHuman mastadenovirus E 130308130308
S8 S8 RNARNA 样本8Sample 8 Influenza A virusInfluenza A virus 1132011320
S8 S8 RNARNA 样本8Sample 8 Influenza B virusInfluenza B virus 1152011520
S8 S8 RNARNA 样本8Sample 8 Legionella pneumophilaLegionella pneumophila 446446
S8 S8 RNARNA 样本8Sample 8 Mycoplasma hyorhinisMycoplasma hyorhinis 21002100
S8 S8 RNARNA 样本8Sample 8 Neisseria siccaNeisseria sicca 490490
S8 S8 RNARNA 样本8Sample 8 Pseudomonas fluorescensPseudomonas fluorescens 294294
S9 S9 DNAdna 样本9Sample 9 Alphapapillomavirus 7 Alphapapillomavirus 7 337042337042
S9 S9 DNAdna 样本9Sample 9 Cupriavidus gilardiiCupriavidus gilardii 8254182541
S9 S9 DNAdna 样本9Sample 9 Mycoplasma hyorhinisMycoplasma hyorhinis 21002100
S9 S9 RNARNA 样本9Sample 9 Alphapapillomavirus 7 Alphapapillomavirus 7 337042337042
S9 S9 RNARNA 样本9Sample 9 Cupriavidus gilardiiCupriavidus gilardii 8254182541
S9 S9 RNARNA 样本9Sample 9 Mycoplasma hyorhinisMycoplasma hyorhinis 21002100
本发明的流程未检出的阳性物种,kraken2 confidence 0.5+bracken流程未检出的阳性物种,kraken2+bracken流程未检出的阳性物种统计如下表所示(其中reads_opt,abundance_opt代表本发明流程的阳性物种检出,reads_confidence,abundance_confidence代表kraken2 confidence 0.5+bracken流程的阳性物种检出,reads_kraken,abundance_kraken代表kraken2+bracken流程的阳性检出):The positive species not detected by the process of the present invention, the positive species not detected by the kraken2 confidence 0.5+bracken process, the statistics of the positive species not detected by the kraken2+bracken process are shown in the table below (wherein reads_opt, abundance_opt represent the positives of the process of the present invention Species detection, reads_confidence, abundance_confidence represent positive species detection of kraken2 confidence 0.5+bracken process, reads_kraken, abundance_kraken represent positive detection of kraken2+bracken process):
Figure PCTCN2021106970-appb-000008
Figure PCTCN2021106970-appb-000008
阳性物种合计148,kraken2 confidence 0.5+bracken有2个物种未检出(灵敏度为98.6%),本发明流程和kraken2+bracken有一个物种未检出(灵敏度为99.3%),表现与模拟数据类似,灵敏度上本发明流程,和kraken2+bracken相同,稍高于kraken2  confidence 0.5+bracken的流程。The total number of positive species is 148, kraken2 confidence 0.5+bracken has 2 species not detected (sensitivity is 98.6%), the process of the present invention and kraken2+bracken have one species not detected (sensitivity is 99.3%), the performance is similar to the simulated data, In terms of sensitivity, the process of the present invention is the same as kraken2+bracken, slightly higher than the process of kraken2 confidence 0.5+bracken.
DNA文库各个流程检出的假阳性物种汇总统计如下表(对应图3结果,其中图片中的opt代表本发明流程,confidence代表Kraken2 confidence 0.5+bracken流程且对应表格中的第4列,kraken代表Kraken2+bracken流程):The summary statistics of the false positive species detected by each process of the DNA library are as follows (corresponding to the results in Figure 3, wherein opt in the picture represents the process of the present invention, confidence represents the Kraken2 confidence 0.5+bracken process and corresponds to the fourth column in the table, kraken represents Kraken2 +bracken process):
Figure PCTCN2021106970-appb-000009
Figure PCTCN2021106970-appb-000009
RNA文库各个流程检出的假阳性物种汇总统计如下表(对应图4结果,其中图片中的opt代表本发明流程,confidence代表Kraken2 confidence 0.5+bracken流程且对应表格中第4列,kraken代表Kraken2+bracken流程):The summary statistics of the false positive species detected by each process of the RNA library are as follows (corresponding to the results in Figure 4, where opt in the picture represents the process of the present invention, confidence represents the Kraken2 confidence 0.5+bracken process and corresponds to the fourth column in the table, and kraken represents Kraken2+ bracket process):
Figure PCTCN2021106970-appb-000010
Figure PCTCN2021106970-appb-000010
汇总模拟样本和spike-in样本统计的灵敏度结果如下表(对应图5结果,其中图片中的opt代表本发明流程,confidence代表Kraken2 confidence 0.5+bracken流程,kraken代表Kraken2+bracken流程):The sensitivity results of the summary simulation samples and spike-in sample statistics are as follows (corresponding to the results in Figure 5, where opt in the picture represents the process of the present invention, confidence represents the Kraken2 confidence 0.5+bracken process, and kraken represents the Kraken2+bracken process):
Figure PCTCN2021106970-appb-000011
Figure PCTCN2021106970-appb-000011
Figure PCTCN2021106970-appb-000012
Figure PCTCN2021106970-appb-000012
从以上的统计结果来看,本发明流程假阳性物种检出要远低于kraken2+bracken流程,且在保证了灵敏度高于kraken2 confidence 0.5+bracken基础上,假阳性检出要低于后者(即便是在reads>3才报出的情况下,仍然能降低大约1/3的假阳性物种检出)。From the above statistical results, the detection of false positive species in the process of the present invention will be far lower than that of kraken2+bracken process, and on the basis of ensuring that the sensitivity is higher than kraken2 confidence 0.5+bracken, the detection of false positives will be lower than the latter ( Even when the reads>3 are reported, the detection of false positive species can still be reduced by about 1/3).
最后应说明的是:以上各实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述各实施例对本发明进行了详细的说明,但本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。It should be noted that at last: above each embodiment is only in order to illustrate technical scheme of the present invention, and is not intended to limit; Although the present invention has been described in detail with reference to foregoing each embodiment, those of ordinary skill in the art should understand that: It is still possible to modify the technical solutions described in the foregoing embodiments, or perform equivalent replacements for some or all of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the various embodiments of the present invention. range.

Claims (10)

  1. 一种生信分析方法,其特征在于,包括如下步骤:A biometric analysis method, characterized in that it comprises the following steps:
    1)序列比对:NGS测序数据使用kraken2进行序列比对,获得每条序列taxid-kmer结果;1) Sequence comparison: use kraken2 for sequence comparison of NGS sequencing data, and obtain the taxid-kmer result of each sequence;
    2)基于taxonomy数据库建立taxid层级关系:根据步骤1)taxid-kmer结果关联taxonomy层级,根据定位规则重定位taxid;2) Establish taxid hierarchical relationship based on taxonomy database: according to step 1) taxonomy level is associated with the taxid-kmer result, and taxid is relocated according to the positioning rules;
    3)计算每条序列kmer score:根据每条序列经过步骤2)重定位的taxid和步骤1)的taxid-kmer结果计算每条序列kmer score;3) Calculate the kmer score of each sequence: calculate the kmer score of each sequence according to the taxid relocated in step 2) and the taxid-kmer result of step 1) for each sequence;
    4)对比对结果进行整体计算:根据kmer score和taxonomy层级进行整体计算。4) Overall calculation of comparison results: overall calculation based on kmer score and taxonomy level.
  2. 权利要求1所述的生信分析方法,其特征在于,所述方法进一步包括:The bioinformatics analysis method of claim 1, wherein the method further comprises:
    5)物种taxid检测:根据4)的整体计算结果进行物种taxid检测。5) Species taxid detection: perform species taxid detection according to the overall calculation result in 4).
  3. 权利要求1-2任一所述的生信分析方法,其特征在于,所述步骤2)中层级关系包括血清型/亚型、种、属、科的一种或多种。The bioinformatics analysis method according to any one of claims 1-2, wherein the hierarchical relationship in step 2) includes one or more of serotype/subtype, species, genus, and family.
  4. 权利要求1-3任一所述的生信分析方法,其特征在于,所述步骤3)中所述kmer score计算规则如下:The arbitrary described biometric analysis method of claim 1-3, is characterized in that, described step 3) described kmerscore calculating rule is as follows:
    最终定位到科层级taxid以下的序列,kmer score=(科taxid kmers+属taxid kmers+种taxid kmers+亚型/血清型taxid kmers)/总kmers;Finally locate the sequence below the taxid at the family level, kmer score=(family taxid kmers+genus taxid kmers+species taxid kmers+subtype/serotype taxid kmers)/total kmers;
    最终定位到科层级taxid以上的序列,kmer score为0。Finally, the sequence above the taxid at the family level is located, and the kmer score is 0.
  5. 权利要求1-4任一所述的生信分析方法,其特征在于,所述步骤2)中重定位规则包括:The biometric analysis method according to any one of claims 1-4, wherein the relocation rules in said step 2) include:
    通常情况下接受kraken2给出的taxid定位,以下情况进行重定位:Usually accept the taxid positioning given by kraken2, and relocate in the following cases:
    某条序列根据taxid-kmer结果获得唯一taxid且taxid低于种层级,则定位为该taxid所属的种层级taxid;If a sequence obtains a unique taxid according to the taxid-kmer result and the taxid is lower than the species level, it is positioned as the species level taxid to which the taxid belongs;
    某条序列根据taxid-kmer结果获得超过2个taxid时,分3种情况:When a sequence obtains more than 2 taxids according to the taxid-kmer results, there are 3 cases:
    所有taxid,关联到种层级上只出现1个,其他taxid属于该种的血清型/亚型、属、或科层级,则定位到该种层级taxid;For all taxids, only one appears at the species level, and other taxids belong to the serotype/subtype, genus, or family level of the species, then locate the taxid at the species level;
    所有taxid,关联到种层级超过2个且属于同一属,则最终定位到属层级taxid;All taxids that are associated with more than 2 species levels and belong to the same genus will be finally located at the genus level taxid;
    所有taxid,关联到属层级超过2个且属于同一科,则最终定位到科层级taxid。All taxids that are associated with more than 2 genus levels and belong to the same family will be finally located at the family level taxid.
  6. 权利要求1-5任一所述的生信分析方法,其特征在于,所述步骤4)中整体计算包括:The bioinformatics analysis method described in any one of claims 1-5, wherein the overall calculation in said step 4) includes:
    a、设定一个过滤cutoff阈值,对每条序列根据kmer score进行过滤;a. Set a filter cutoff threshold, and filter each sequence according to the kmer score;
    b、对a中经过过滤的序列,统计taxid的reads;b. For the filtered sequence in a, count the reads of taxid;
    所述taxid的reads是一个样本出现的taxid的序列总数;The reads of the taxid is the total number of taxid sequences that appear in a sample;
    c、设定一个过滤阈值threshold,对b中定位到种层级的taxid进行过滤,计算其属相对比值,排除低于阈值的种层级taxid;c. Set a filtering threshold threshold, filter the taxids located at the species level in b, calculate their relative ratio, and exclude the species level taxids lower than the threshold;
    所述属相对比值为某个种层级taxid reads相对于同属reads最高的种层级taxid reads的比值;The genus relative ratio is the ratio of certain species-level taxid reads to the highest species-level taxid reads of the same genus reads;
    优选的,还包括:Preferably, it also includes:
    d、经c过滤的种层级taxid,若缺乏属分类,则计算科相对比值,排除低于过滤阈值threshold种层级taxid;d. If the taxid at the species level filtered by c lacks genus classification, calculate the family relative ratio and exclude the taxid at the species level lower than the filtering threshold threshold;
    所述科相对比值为某个种层级taxid reads相对于同科reads最高的种层级taxid reads的比值。The family relative ratio is the ratio of a species-level taxid reads relative to the species-level taxid reads with the highest reads of the same family.
  7. 权利要求1-6任一所述的生信分析方法,其特征在于,所述步骤4)中所述整体计算进一步包括:The bioinformatics analysis method according to any one of claims 1-6, wherein the overall calculation in the step 4) further comprises:
    e、经c,d过滤保留的种层级taxid reads校正1,计算属相对比值,将属层级taxid reads按照属相对比值计算种层级taxid属校正reads;e. After c, d filter and retain the species-level taxid reads correction 1, calculate the genus relative ratio, and calculate the species-level taxid genus correction reads with the genus-level taxid reads according to the genus relative ratio;
    所述属相对比值为经c,d过滤后同属的种层级taxid reads总和,之后计算各种层级taxid reads相对于总和的比例;The genus relative ratio is the sum of the species-level taxid reads of the same genus after filtering by c and d, and then calculates the ratio of taxid reads at various levels to the sum;
    所述属层级taxid reads包括b中属层级taxid reads和c中未通过过滤阈值threshold的该属关联种层级taxid并入的reads;The genus-level taxid reads include the genus-level taxid reads in b and the reads incorporated into the genus-associated species-level taxid that have not passed the filtering threshold threshold in c;
    f、经c,d过滤的种层级taxid reads校正2,计算科相对比值,将科层级taxid reads按照科相对比值计算种层级taxid科校正reads;f, after c, the kind-level taxid reads filtered by d are corrected 2, and the family-level relative ratio is calculated, and the family-level taxid reads are calculated according to the family-level taxid family-corrected reads;
    所述科相对比值为经c,d过滤后同科的种层级taxid reads总和,之后计算各种层级taxid reads相对于总和的比例;The relative ratio of the family is the sum of the species-level taxid reads of the same family after filtering by c and d, and then calculate the ratio of the taxid reads at various levels to the sum;
    所述科层级taxid的reads包含b中科层级taxid reads和d中未通过过滤阈值threshold的该科关联种层级taxid并入的reads。The family-level taxid reads include the family-level taxid reads in b and the reads incorporated into the family-related species-level taxid in d that do not pass the filtering threshold threshold.
  8. 权利要求1-7任一所述的生信分析方法,其特征在于,所述步骤1)中比对采用的数据库为nt、refseq或genbank数据库;优选的,所述数据库为nt数据库。The bioinformatics analysis method according to any one of claims 1-7, characterized in that the database used for comparison in the step 1) is an nt, refseq or genbank database; preferably, the database is an nt database.
  9. 一种计算机可读介质,其存储有计算机程序,所述计算机程序被处理器执行时,实现权利要求1-8中任一项所述方法。A computer-readable medium, which stores a computer program, and when the computer program is executed by a processor, implements the method according to any one of claims 1-8.
  10. 一种电子设备,其特征在于,包括处理器以及存储器,所述存储器上存储一条或多条可读指令,所述一条或多条可读指令被所述处理器执行时,实现权利要求1-8中任一项所述方法。An electronic device, characterized in that it includes a processor and a memory, and the memory stores one or more readable instructions, and when the one or more readable instructions are executed by the processor, claim 1- The method described in any one of 8.
PCT/CN2021/106970 2021-07-14 2021-07-17 Optimized kraken2 algorithm and application thereof in second-generation sequencing WO2023283967A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110804351.8A CN113539369B (en) 2021-07-14 2021-07-14 Optimized kraken2 algorithm and application thereof in second-generation sequencing
CN202110804351.8 2021-07-14

Publications (1)

Publication Number Publication Date
WO2023283967A1 true WO2023283967A1 (en) 2023-01-19

Family

ID=78128300

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/106970 WO2023283967A1 (en) 2021-07-14 2021-07-17 Optimized kraken2 algorithm and application thereof in second-generation sequencing

Country Status (2)

Country Link
CN (1) CN113539369B (en)
WO (1) WO2023283967A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113539369B (en) * 2021-07-14 2022-03-25 江苏先声医学诊断有限公司 Optimized kraken2 algorithm and application thereof in second-generation sequencing

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111681704A (en) * 2020-04-21 2020-09-18 华中科技大学鄂州工业技术研究院 Construction method of matK gene-based unknown plant species identification database and database
CN112071366A (en) * 2020-10-13 2020-12-11 南开大学 Metagenome data analysis method based on second-generation sequencing technology
US20210141833A1 (en) * 2019-11-07 2021-05-13 International Business Machines Corporation Optimizing k-mer databases by k-mer subtraction
CN113096737A (en) * 2021-03-26 2021-07-09 北京源生康泰基因科技有限公司 Method and system for automatically analyzing pathogen types
CN113539369A (en) * 2021-07-14 2021-10-22 江苏先声医学诊断有限公司 Optimized kraken2 algorithm and application thereof in second-generation sequencing

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111462821B (en) * 2020-04-10 2022-02-22 广州微远医疗器械有限公司 Pathogenic microorganism analysis and identification system and application
CN111710365B (en) * 2020-06-10 2022-04-08 山东省计算中心(国家超级计算济南中心) Ontology-based protein/gene synonym table construction method
CN112599198A (en) * 2020-12-29 2021-04-02 上海派森诺生物科技股份有限公司 Microorganism species and functional composition analysis method for metagenome sequencing data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210141833A1 (en) * 2019-11-07 2021-05-13 International Business Machines Corporation Optimizing k-mer databases by k-mer subtraction
CN111681704A (en) * 2020-04-21 2020-09-18 华中科技大学鄂州工业技术研究院 Construction method of matK gene-based unknown plant species identification database and database
CN112071366A (en) * 2020-10-13 2020-12-11 南开大学 Metagenome data analysis method based on second-generation sequencing technology
CN113096737A (en) * 2021-03-26 2021-07-09 北京源生康泰基因科技有限公司 Method and system for automatically analyzing pathogen types
CN113539369A (en) * 2021-07-14 2021-10-22 江苏先声医学诊断有限公司 Optimized kraken2 algorithm and application thereof in second-generation sequencing

Also Published As

Publication number Publication date
CN113539369A (en) 2021-10-22
CN113539369B (en) 2022-03-25

Similar Documents

Publication Publication Date Title
US20230366046A1 (en) Systems and methods for analyzing viral nucleic acids
Marchant et al. The C-Fern (Ceratopteris richardii) genome: insights into plant genome evolution with the first partial homosporous fern genome assembly
CN111462821B (en) Pathogenic microorganism analysis and identification system and application
US20200294628A1 (en) Creation or use of anchor-based data structures for sample-derived characteristic determination
US11830580B2 (en) K-mer database for organism identification
WO2018218788A1 (en) Third-generation sequencing sequence alignment method based on global seed scoring optimization
WO2019213811A1 (en) Method, apparatus, and system for detecting chromosomal aneuploidy
US11809498B2 (en) Optimizing k-mer databases by k-mer subtraction
CN112599198A (en) Microorganism species and functional composition analysis method for metagenome sequencing data
WO2023283967A1 (en) Optimized kraken2 algorithm and application thereof in second-generation sequencing
CN112992277A (en) Construction method and application of microbial genome database
WO2020155623A1 (en) Sequence alignment filtering processing method, system and device, and readable storage medium
CN115631789A (en) Pangenome-based group joint variation detection method
CN108595912B (en) Method, device and system for detecting chromosome aneuploidy
WO2017000859A1 (en) Leaping search algorithm for similar sub-sequences in character sequence and application thereof in searching in biological sequence database
CN115064215A (en) Method for tracing strain and identifying attribute through similarity
US11688489B2 (en) Systems and methods for grouping and collapsing sequencing reads
WO2020213736A1 (en) Information processing device, information processing method, program and storage medium
Cai et al. Concod: an effective integration framework of consensus-based calling deletions from next-generation sequencing data
CN114334004B (en) Rapid comparison and identification method for pathogenic microorganisms and application thereof
CN110600083B (en) Calcium acetate-acinetobacter baumannii complex group identification method based on splicing-free assembly WGS data
Chu et al. Improving on hash-based probabilistic sequence classification using multiple spaced seeds and multi-index Bloom filters
CN118197436A (en) Construction method of pathogenic microorganism metagenome database
CN116682496A (en) Pathogenic microorganism genome database and construction method and application thereof
CN114582523A (en) Novel coronavirus genome feature similarity measurement method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21949747

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE