WO2023283967A1 - Optimized kraken2 algorithm and application thereof in second-generation sequencing - Google Patents
Optimized kraken2 algorithm and application thereof in second-generation sequencing Download PDFInfo
- Publication number
- WO2023283967A1 WO2023283967A1 PCT/CN2021/106970 CN2021106970W WO2023283967A1 WO 2023283967 A1 WO2023283967 A1 WO 2023283967A1 CN 2021106970 W CN2021106970 W CN 2021106970W WO 2023283967 A1 WO2023283967 A1 WO 2023283967A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- taxid
- level
- species
- reads
- family
- Prior art date
Links
- 238000012163 sequencing technique Methods 0.000 title abstract description 17
- 238000004422 calculation algorithm Methods 0.000 title description 14
- 238000000034 method Methods 0.000 claims abstract description 60
- 238000001514 detection method Methods 0.000 claims abstract description 44
- 238000004458 analytical method Methods 0.000 claims abstract description 9
- 238000001914 filtration Methods 0.000 claims description 30
- 238000004364 calculation method Methods 0.000 claims description 18
- 238000003766 bioinformatics method Methods 0.000 claims description 11
- 238000007481 next generation sequencing Methods 0.000 claims description 8
- 238000012937 correction Methods 0.000 claims description 6
- 238000004590 computer program Methods 0.000 claims description 4
- 241000894007 species Species 0.000 description 167
- 239000000523 sample Substances 0.000 description 107
- 108020004414 DNA Proteins 0.000 description 75
- 239000013614 RNA sample Substances 0.000 description 75
- 230000008569 process Effects 0.000 description 32
- 240000005893 Pteridium aquilinum Species 0.000 description 22
- 235000009936 Pteridium aquilinum Nutrition 0.000 description 22
- 230000035945 sensitivity Effects 0.000 description 19
- 241000388169 Alphapapillomavirus 7 Species 0.000 description 18
- 241001634906 Cupriavidus gilardii Species 0.000 description 18
- 241000202938 Mycoplasma hyorhinis Species 0.000 description 18
- 241000193998 Streptococcus pneumoniae Species 0.000 description 13
- 241000194017 Streptococcus Species 0.000 description 12
- 238000005457 optimization Methods 0.000 description 12
- 241000606768 Haemophilus influenzae Species 0.000 description 8
- 229940047650 haemophilus influenzae Drugs 0.000 description 8
- 229940031000 streptococcus pneumoniae Drugs 0.000 description 8
- 230000000813 microbial effect Effects 0.000 description 7
- 241001386813 Kraken Species 0.000 description 6
- 241000222122 Candida albicans Species 0.000 description 5
- 241000588724 Escherichia coli Species 0.000 description 5
- 241000588747 Klebsiella pneumoniae Species 0.000 description 5
- 229940095731 candida albicans Drugs 0.000 description 5
- 239000013612 plasmid Substances 0.000 description 5
- 239000013598 vector Substances 0.000 description 5
- 241000607528 Aeromonas hydrophila Species 0.000 description 4
- 241001096540 Cryptococcus gattii VGII Species 0.000 description 4
- 241000194032 Enterococcus faecalis Species 0.000 description 4
- 241000700588 Human alphaherpesvirus 1 Species 0.000 description 4
- 241000701074 Human alphaherpesvirus 2 Species 0.000 description 4
- 241000701024 Human betaherpesvirus 5 Species 0.000 description 4
- 241001455656 Human betaherpesvirus 6B Species 0.000 description 4
- 241000620147 Human mastadenovirus C Species 0.000 description 4
- 241000886703 Human mastadenovirus E Species 0.000 description 4
- 241000588749 Klebsiella oxytoca Species 0.000 description 4
- 241000589242 Legionella pneumophila Species 0.000 description 4
- 241000186779 Listeria monocytogenes Species 0.000 description 4
- 241000588650 Neisseria meningitidis Species 0.000 description 4
- 241000588645 Neisseria sicca Species 0.000 description 4
- 241000589540 Pseudomonas fluorescens Species 0.000 description 4
- 241000193996 Streptococcus pyogenes Species 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 229940032049 enterococcus faecalis Drugs 0.000 description 4
- 229940115932 legionella pneumophila Drugs 0.000 description 4
- 241000712431 Influenza A virus Species 0.000 description 3
- 241001300629 Nannochloropsis oceanica Species 0.000 description 3
- 241001134658 Streptococcus mitis Species 0.000 description 3
- 238000007405 data analysis Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 244000005700 microbiome Species 0.000 description 3
- 238000002864 sequence alignment Methods 0.000 description 3
- 238000004088 simulation Methods 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 241001508813 Clavispora lusitaniae Species 0.000 description 2
- 241000711920 Human orthopneumovirus Species 0.000 description 2
- 241000726041 Human respirovirus 1 Species 0.000 description 2
- 241000713196 Influenza B virus Species 0.000 description 2
- 108091028043 Nucleic acid sequence Proteins 0.000 description 2
- 241000873939 Parechovirus A Species 0.000 description 2
- 241000193985 Streptococcus agalactiae Species 0.000 description 2
- 241000700605 Viruses Species 0.000 description 2
- 238000003745 diagnosis Methods 0.000 description 2
- 208000015181 infectious disease Diseases 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 241001225321 Aspergillus fumigatus Species 0.000 description 1
- 241000668755 Aspergillus novofumigatus Species 0.000 description 1
- 241001626696 Aspergillus tanneri Species 0.000 description 1
- 241000894006 Bacteria Species 0.000 description 1
- FGUUSXIOTUKUDN-IBGZPJMESA-N C1(=CC=CC=C1)N1C2=C(NC([C@H](C1)NC=1OC(=NN=1)C1=CC=CC=C1)=O)C=CC=C2 Chemical compound C1(=CC=CC=C1)N1C2=C(NC([C@H](C1)NC=1OC(=NN=1)C1=CC=CC=C1)=O)C=CC=C2 FGUUSXIOTUKUDN-IBGZPJMESA-N 0.000 description 1
- 241001668502 Cladophialophora carrionii Species 0.000 description 1
- 208000035473 Communicable disease Diseases 0.000 description 1
- 241000233866 Fungi Species 0.000 description 1
- 241000190708 Guanarito mammarenavirus Species 0.000 description 1
- 241000735452 Heterobasidion Species 0.000 description 1
- 241000701044 Human gammaherpesvirus 4 Species 0.000 description 1
- 241000725303 Human immunodeficiency virus Species 0.000 description 1
- 206010028980 Neoplasm Diseases 0.000 description 1
- 241000224016 Plasmodium Species 0.000 description 1
- 241000087479 Pseudocercospora fijiensis Species 0.000 description 1
- 241000607768 Shigella Species 0.000 description 1
- 241000422838 Spirometra erinaceieuropaei Species 0.000 description 1
- 241000191967 Staphylococcus aureus Species 0.000 description 1
- 229940091771 aspergillus fumigatus Drugs 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000008280 blood Substances 0.000 description 1
- 210000004369 blood Anatomy 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000012165 high-throughput sequencing Methods 0.000 description 1
- 230000008676 import Effects 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 201000004792 malaria Diseases 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 150000007523 nucleic acids Chemical group 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000006748 scratching Methods 0.000 description 1
- 230000002393 scratching effect Effects 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
- 230000008685 targeting Effects 0.000 description 1
- 241001529453 unidentified herpesvirus Species 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A50/00—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE in human health protection, e.g. against extreme weather
- Y02A50/30—Against vector-borne diseases, e.g. mosquito-borne, fly-borne, tick-borne or waterborne diseases whose impact is exacerbated by climate change
Definitions
- the invention relates to the field of bioinformatics, in particular to an optimized kraken2 algorithm and its application in next-generation sequencing.
- next-generation sequencing technology is a massively parallel sequencing technology, which has the characteristics of high throughput, high sequencing accuracy, and short timeliness, which perfectly matches the metagenome.
- the demand for metagenomics has led to the widespread application of metagenomics in infection detection.
- Species detection of microbial communities after sequencing is the most important work in metagenomics research. Only by accurately and reliably locating microbial communities can we associate metagenomics with research, such as studying whether a patient's disease is caused by a certain microbial infection (If a person suspects malaria, it is necessary to accurately detect the presence of Plasmodium in the blood to give a definitive diagnosis), metagenomic analysis is a fast, accurate and advanced detection technology, currently used in the auxiliary diagnosis of infectious diseases played an important role in.
- Kraken2 is applied to Illumina second-generation metagenomic sequencing, which has the characteristics of fast analysis speed and high sensitivity, but the specificity is low, and many false positive results are often detected, which is due to the characteristics of the kraken2 algorithm.
- the default read length is 35bp
- kraken2 will give priority to constructing the specific kmers of this species Specific kmers, and there is a certain kmer in multiple species of Streptococcus genus Streptococcus, then locate the kmer under the genus Streptococcus, the same principle, if a certain kmer exists in multiple genera under Streptococcus family Streptococcus, then the kmer Positioned under Streptococcus family.
- the database contains a lot of sequences, such as plasmids/vectors, etc.
- the results of this part of the comparison will also be output, and this part of the results is basically meaningless (it can also be counted as false positive detection).
- the purpose of the present invention is to seek a bioinformatics analysis method that can reduce false positives in sequencing analysis, improve the accuracy of species detection, and is suitable for Illumina second-generation metagenomic sequencing.
- the present invention proposes the following technical solutions:
- the present invention firstly provides a bioinformatics analysis method based on kraken2 single sequence kmer score and overall taxonomy structure statistics, said method comprising the following steps:
- NGS sequencing data is compared using kraken2 to obtain the taxid-kmer result of each sequence
- the hierarchical relationship in step 2) includes one or more hierarchical relationships of serotype/subtype, species, genus and/or family.
- the positioning rules in the step 2) include the following:
- a sequence obtains a unique taxid according to the taxid-kmer result and the taxid is lower than the species level, it is positioned as the species level taxid to which the taxid belongs;
- calculation rules in the step 3) include:
- kmer score (family taxid kmers+genus taxid kmers+species taxid kmers+subtype/serotype taxid kmers)/total kmers;
- the overall calculation includes:
- the reads of the taxid is the total number of taxid sequences that appear in a sample
- the genus relative ratio is the ratio of certain species-level taxid reads to the highest species-level taxid reads of the same genus reads
- the family relative ratio is the ratio of certain species-level taxid reads to the highest species-level taxid reads of the same family reads
- d filter and retain the species-level taxid reads correction 1, calculate the genus relative ratio, and calculate the species-level taxid genus correction reads with the genus-level taxid reads according to the genus relative ratio;
- the genus relative ratio is the sum of the species-level taxid reads of the same genus after filtering by c and d, and then calculates the ratio of taxid reads at various levels to the sum;
- the genus-level taxid reads include the genus-level taxid reads in b and the reads incorporated into the genus-associated species-level taxid that have not passed the filtering threshold threshold in c;
- the relative ratio of the family is the sum of the species-level taxid reads of the same family after filtering by c and d, and then calculate the ratio of the taxid reads at various levels to the sum;
- the family-level taxid reads include the family-level taxid reads in b and the reads incorporated into the family-related species-level taxid in d that do not pass the filtering threshold threshold.
- said step 5 species-level detection, which is equivalent to the species-level taxid detection, according to c, d obtains the species-level taxid which is the final species taxid, gets b, e, and the sum of the species-level taxid reads obtained by f obtains The final species taxid reads, and calculate the relative abundance based on the sum of reads.
- the database used for comparison in step 1) is nt, refseq or genbank database; preferably, the database is nt database.
- the present invention also provides the application of the method based on kraken2 single sequence kmer score and overall taxonomy structure statistics in next-generation sequencing bioinformatics analysis.
- the present invention also provides a computer-readable medium, which stores a computer program, and when the computer program is executed by a processor, the method described in any one of the above claims is realized.
- the present invention also provides an electronic device, which is characterized in that it includes a processor and a memory, and one or more readable instructions are stored in the memory, and when the one or more readable instructions are executed by the processor, the Claim any one of the above methods.
- the bioinformatics method of the present invention can reduce false positives in bioinformatics analysis, improve the accuracy of species detection, and is suitable for second-generation metagenomic sequencing, including single-end and double-end sequencing.
- the present invention ensures the overall sensitivity through precise positioning of a single sequence and overall systematic optimization.
- the present invention excludes partial comparison results of plasmids/vectors, etc., effectively reducing the situation of meaningless detection.
- FIG. 1 schematic diagram of the system of the present invention
- Figure 2 Schematic diagram of taxid positioning and kmer score scoring for a single sequence
- Fig. 3 Comparison chart of false positive species detected in 9 cases of spike-in sample DNA library, opt represents the process of the present invention, confidence represents the process of kraken2 confidence 0.5+bracken, kraken represents the process of kraken2+bracken, S1-S9 represents 9 samples;
- Fig. 4 Comparison diagram of false positive species detected in 9 cases of spike-in sample RNA library, opt represents the process of the present invention, confidence represents the process of kraken2 confidence 0.5+bracken, kraken represents the process of kraken2+bracken, and S1-S9 represents 9 samples;
- Fig. 5 Sensitivity chart of 12 simulated samples and 9 spike-in samples, opt represents the process of the present invention, confidence represents the kraken2 confidence 0.5+bracken process, kraken represents the kraken2+bracken process, simulated represents 12 simulated samples, spike-in is 9 spike-in samples.
- the terms “comprising”, “comprising”, “having”, “containing” or “involving” are inclusive or open-ended and do not exclude other unrecited elements or method steps .
- the term “consisting of” is considered as a preferred embodiment of the term “comprising”. If in the following a certain group is defined as comprising at least a certain number of embodiments, this is also to be understood as revealing a group which preferably consists only of these embodiments.
- read or “each read” or “single read” in the present invention refers to a nucleic acid sequence generated by a high-throughput sequencing platform.
- alignment result in the present invention: “alignment” in English, refers to the corresponding result between a sequencing readout sequence and a reference sequence, and a sequencing readout sequence can have multiple alignment results at the same time.
- the "kmer” in the present invention refers to continuously cutting a sequence and scratching bases one by one to obtain a substring of k bases.
- the length of reads is L
- the length of k-mer is set to k
- the number of mers is: L-k+1; another example is the sequence AACTGACT, if k is set to 3, it can be divided into 6 k-mers of AAC, ACT, CTG, TGA, GAC, and ACT.
- the "kraken2" in the present invention refers to a high-precision metagenomic sequence classification software based on the kmer algorithm in the field, which can quickly classify sequencing reads into species.
- the "kraken2 optimization algorithm" described in the present invention refers to an optimization system for microbial species-level detection based on the comparison results of kraken2 developed by the present invention, aiming at improving accuracy and reducing false positive detection.
- nt comparison database of the present invention is the kraken2 comparison index database based on NCBI nt database establishment.
- the "taxid” or “taxonomy_id” mentioned in the present invention refers to the id number in the taxonomy database.
- the single sequence kmer score of kraken2 and the overall bioinformatics analysis method based on taxonomy structure statistics of the present invention generally include the following steps:
- the positioning rules may include:
- a sequence obtains a unique taxid according to the taxid-kmer result and the taxid is lower than the species level, it is positioned as the species level taxid to which the taxid belongs;
- Calculation rules can be:
- kmer score (family taxid kmers+genus taxid kmers+species taxid kmers+subtype/serotype taxid kmers)/total kmers;
- the genus relative ratio is the ratio of certain species-level taxid reads to the highest species-level taxid reads of the same genus reads
- the family relative ratio is the ratio of certain species-level taxid reads to the highest species-level taxid reads of the same family reads
- the species-level taxid reads retained by filtering in 7 and 8 are corrected 1, and the relative ratio is calculated, and the taxid reads at the species level are calculated according to the relative ratio of the species-level taxid and genus correction reads;
- the genus relative ratio is the sum of the species-level taxid reads of the same genus after filtering by 7 and 8, and then calculates the ratio of taxid reads at various levels to the sum;
- the genus-level taxid reads include 6 genus-level taxid reads and 7 reads that do not pass through the genus-related species level taxid of the genus-level taxid;
- the species-level taxid reads filtered by 7 and 8 are corrected 2, and the family-level relative ratio is calculated, and the family-level taxid reads are calculated according to the family-level relative ratio to calculate the species-level taxid family-corrected reads;
- the relative ratio of the family is the sum of the species-level taxid reads of the same family after filtering by 7 and 8, and then calculate the ratio of the taxid reads at various levels to the sum;
- the family-level taxid reads include 6 family-level taxid reads and 8 family-related taxid-incorporated reads that do not pass the filtering threshold threshold.
- Microbial detection which is equivalent to the species-level taxid detection
- the species-level taxid obtained according to 7 and 8 is the final species taxid
- the sum of the species-level taxid reads obtained by 6, 9, and 10 is used to obtain the final species taxid reads, and Relative abundance is calculated based on the sum of reads.
- the problem to be solved in this embodiment is how to ensure the accuracy of the kraken2 comparison results as much as possible through the data analysis method.
- Sequence similarity typically, the sequence degree of Escherichia coli and Shigella exceeds 99%, which is especially common among species under the same genus, which is an important reason for false positive detection and also interferes with real species detected, reducing the sensitivity;
- the so-called data analysis of the sample sequencing results alone refers to: the information obtained is only one sample sequence file (FASTQ), and no other information is known; through some algorithms, the species detection results are output.
- sequence similarity on the one hand, it can be solved through standard genome updates and the progress of species taxonomy, and on the other hand, it can be optimized to a certain extent through algorithms, which is part of the consideration of this patent;
- the host genome can be removed by accurately constructing the host genome sequence and using an alignment algorithm, but it does not guarantee whether the host genome removal is accurate and thorough (removal is not possible) There will still be some residues completely, and excessive removal will reduce the sensitivity of microbial detection), in addition, it is possible to accurately analyze the results of single sequence comparisons in the algorithm, and set the threshold for species detection as a whole, Get a certain degree of optimization, this part is the consideration point of this patent;
- the positioning rules include: accept the taxid positioning given by kraken2 in principle, except for the following cases:
- a sequence is compared to a unique taxid and the taxid is at a lower species level, it will be positioned as the species-level taxid to which the taxid belongs (for example, if a sequence is 76 bp in length and 35 bp in kmer length, 42 kmers will be obtained, and the taxid-kmer comparison result is: 0:10, 1313:32, except for the 10 kmers that cannot be compared, the rest are compared to the taxid 1313, and the taxonomy hierarchy structure is used to locate the taxid 1313 Streptococcus pneumoniae Streptococcus pneumoniae);
- a sequence alignment with more than 2 taxids can be divided into 3 types:
- All taxids in the comparison result are associated with only one species at the species level, and other taxids belong to the serotype/subtype of the species. At the genus and family levels, they are located at the taxid of the species (for example, a sequence length of 76bp, 35bp kmer length will get 42 kmers, the taxid-kmer comparison results are: 0:10, 1313:20, 1301:12, except for the 10 kmers that cannot be compared, 20 kmers are compared to the taxid 1313 , corresponding to Streptococcus pneumoniae Streptococcus pneumoniae through the taxonomy hierarchy structure, and the other 12 kmers were compared to the genus 1301 Streptococcus Streptococcus pneumoniae, since Streptococcus pneumoniae is under the genus Streptococcus, the sequence was mapped to the species taxid 1313 Streptococcus pneumoniae Streptococcus pneumoniae);
- All the taxids in the comparison results are related to the results of the same genus and different species, and finally locate the genus taxid (for example, if a sequence length is 76bp, 35bp kmer length will get 42 kmers, the taxid-kmer comparison result is: 0: 10, 1313:20, 28037:5, 1301:7, in addition to the 10 kmers that cannot be compared, 20 kmers are compared to the taxid 1313, corresponding to Streptococcus pneumoniae Streptococcus pneumoniae through the taxonomy hierarchy structure, and the other 5 The kmer was compared to 28037 Streptococcus mitis light-weight Streptococcus, and 7 kmers were compared to 1301 Streptococcus Streptococcus. Since there were 2 species under Streptococcus, the sequence was mapped to taxid 1301 Streptococcus Streptococcus);
- the calculation rules are:
- kmer score (family taxid kmers+genus taxid kmers+species taxid kmers+subtype/serotype taxid kmers)/total kmers;
- Selected 4 kinds of bacteria Haemophilus influenzae, Streptococcus pneumoniae, Staphylococcus aureus, Klebsiella pneumoniae
- 2 kinds of fungi Candida albicans, Aspergillus fumigatus
- 4 kinds of viruses human herpesvirus, human papillary Tumor virus, influenza A virus, HIV virus
- GRC38.p13 the sequence length is set to 75bp
- the total data volume is set to 10M
- a total of 4 groups Each group consists of 3 identical samples.
- the reference genomes used by the simulated samples are shown in the table below:
- Score_cutoff summarizes the statistical results of statistical samples under different thresholds. Samples 7-12 are focused on due to the low total microbial count, and their error rate is higher than that of the overall level. From the statistical results, setting a threshold higher than this error rate is It can solve the occurrence of wrong comparison, as shown in the following table:
- each sample is divided into list species false positive species (that is, the species detected by the error comparison is in the same family of 10 simulated species), non-list species false positive Species (the species detected by the wrong comparison are not in the same family as the 10 simulated species), and the species with the highest reads are counted. Since the repeated samples are detected to be completely consistent, only the representative sample results are listed, as shown in the following table:
- the final reads_cutoff is set to correct for species detected in different families. From the comparison of different values of score_cutoff, a slightly lower comparison rate is allowed Under the premise of , score_cutoff is set to 0.5, and reads_cutoff is set to greater than 3 to eliminate the detection of non-list false positive species (the lower the value, the better the sensitivity of true positive species detected by fewer reads).
- Targeting rules include:
- a sequence is aligned with a unique taxid and the taxid is lower than the species level, it will be positioned as the taxid of the species to which the taxid belongs;
- a sequence alignment with more than 2 taxids can be divided into 3 types:
- the calculation rules are:
- kmer score (family taxid kmers+genus taxid kmers+species taxid kmers+subtype/serotype taxid kmers)/total kmers;
- the false positive and false negative detections are sorted out as follows:
- sample taxi species reads relative abundance result sample 1 340412 Aspergillus novofumigatus 1 0.00011 false positive sample 1 984962 Heterobasidion irregular 1 0.00011 false positive sample 1 145522 Nannochloropsis oceanica 4 0.00044 false positive sample 1 28037 Streptococcus mitis 2 0.00022 false positive sample 1 2656787 Venustapulla echinocandica 1 0.00011 false positive sample 10 86049 Cladophialophora carrionii 1 0.00011 false positive Sample 10 10376 Human gammaherpes virus 4 1 0.00011 false positive sample 10 1873960 Pseudocercospora fijiensis 2 0.00022 false positive Sample 10 2656787 Venustapulla echinocandica 1 0.00011 false positive Sample 10 727 Haemophilus influenzae 0 0 false negative Sample 10 573 Klebsiella pneumoniae 0 0 false negative Sample 11 727 Haemophilus influenzae 0 0 false negative Sample 11 573 Kle
- Embodiment 3 actual sample detection experiment
- the total number of positive species is 148, kraken2 confidence 0.5+bracken has 2 species not detected (sensitivity is 98.6%), the process of the present invention and kraken2+bracken have one species not detected (sensitivity is 99.3%), the performance is similar to the simulated data, In terms of sensitivity, the process of the present invention is the same as kraken2+bracken, slightly higher than the process of kraken2 confidence 0.5+bracken.
- the summary statistics of the false positive species detected by each process of the RNA library are as follows (corresponding to the results in Figure 4, where opt in the picture represents the process of the present invention, confidence represents the Kraken2 confidence 0.5+bracken process and corresponds to the fourth column in the table, and kraken represents Kraken2+ bracket process):
- the sensitivity results of the summary simulation samples and spike-in sample statistics are as follows (corresponding to the results in Figure 5, where opt in the picture represents the process of the present invention, confidence represents the Kraken2 confidence 0.5+bracken process, and kraken represents the Kraken2+bracken process):
- the detection of false positive species in the process of the present invention will be far lower than that of kraken2+bracken process, and on the basis of ensuring that the sensitivity is higher than kraken2 confidence 0.5+bracken, the detection of false positives will be lower than the latter ( Even when the reads>3 are reported, the detection of false positive species can still be reduced by about 1/3).
Landscapes
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Biophysics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
A biological information analysis method based on a kraken2 single-sequence kmer score and overall taxonomy structure statistics. By means of the method, false positives in biological information analysis can be reduced, and the species detection accuracy can be improved. The method is applicable to second-generation metagenome sequencing analysis.
Description
本申请要求于2021年07月17日提交中国专利局、申请号为202110804351.8、发明名称为“一种优化的kraken2算法及其在二代测序中的应用”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application submitted to the China Patent Office on July 17, 2021, with the application number 202110804351.8, and the title of the invention is "an optimized kraken2 algorithm and its application in next-generation sequencing", all of which The contents are incorporated by reference in this application.
本发明涉及生物信息学领域,特别是涉及一种优化的kraken2算法及其在二代测序中的应用。The invention relates to the field of bioinformatics, in particular to an optimized kraken2 algorithm and its application in next-generation sequencing.
宏基因组群落复杂庞大,需要对大量的DNA进行测序,Illumina二代测序技术是一种大规模平行测序技术,具有通量高,测序准确度高,时效短等特点,正好完美的匹配了宏基因组学的需求,成就了宏基因组学在感染检测的广泛应用。The metagenomic community is complex and huge, and a large amount of DNA needs to be sequenced. Illumina next-generation sequencing technology is a massively parallel sequencing technology, which has the characteristics of high throughput, high sequencing accuracy, and short timeliness, which perfectly matches the metagenome. The demand for metagenomics has led to the widespread application of metagenomics in infection detection.
测序之后微生物群落的物种检测,是宏基因组学研究中的最为重要工作,只有准确可靠地对微生物群落进行精确定位,才能关联宏基因组与研究的关联,比如研究患者的发病是否是某种微生物感染(如某个人怀疑是疟疾,那么需要准确地检测出其血液中存在疟原虫才能最终给出明确诊断),宏基因组分析是一种快速,准确,先进的检测技术,目前在感染类疾病辅助诊断中发挥了重要作用。Species detection of microbial communities after sequencing is the most important work in metagenomics research. Only by accurately and reliably locating microbial communities can we associate metagenomics with research, such as studying whether a patient's disease is caused by a certain microbial infection (If a person suspects malaria, it is necessary to accurately detect the presence of Plasmodium in the blood to give a definitive diagnosis), metagenomic analysis is a fast, accurate and advanced detection technology, currently used in the auxiliary diagnosis of infectious diseases played an important role in.
Kraken2应用于Illumina二代宏基因组测序,具有分析速度快,灵敏度高的特点,但是特异度较低,往往会检测出很多假阳性结果,这是因为kraken2算法的特点。根据taxid与seqid关系,对选取的参考基因组序列快速构建固定长度的kmers(默认为35bp的读长),优先构建某个层级的特异kmers,比如肺炎链球菌Streptococcus pneumoniae,kraken2会优先构建该物种的特异kmers,而链球菌属Streptococcus多个物种也存在某个kmer,则将该kmer定位到链球菌属下,同样的原理,某个kmer存在于链球菌科Streptococcaceae下多个属,则将该kmer定位在链球菌科下。鉴于kraken2的算法,对于某种DNA序列较高的微生物,虽然会有一定的概率会发生错误比对,但是基本上不会干扰该物种的检出。由于二代测序具有读长短的特点,因此很容易出现序列发生错误比对,或者无法精确比对(比如某条来自肺炎链球菌的序列,错误比对到Streptococcus mitis,或者只能比对到Streptococcus属层级),这是影响物种检测准确度的最重要因素。Kraken2 is applied to Illumina second-generation metagenomic sequencing, which has the characteristics of fast analysis speed and high sensitivity, but the specificity is low, and many false positive results are often detected, which is due to the characteristics of the kraken2 algorithm. According to the relationship between taxid and seqid, quickly construct fixed-length kmers for the selected reference genome sequence (the default read length is 35bp), and give priority to constructing a certain level of specific kmers, such as Streptococcus pneumoniae, kraken2 will give priority to constructing the specific kmers of this species Specific kmers, and there is a certain kmer in multiple species of Streptococcus genus Streptococcus, then locate the kmer under the genus Streptococcus, the same principle, if a certain kmer exists in multiple genera under Streptococcus family Streptococcus, then the kmer Positioned under Streptococcus family. In view of the kraken2 algorithm, although there is a certain probability of misalignment for a certain microorganism with a high DNA sequence, it will basically not interfere with the detection of this species. Due to the characteristics of the next-generation sequencing with short read length, it is prone to sequence misalignment, or cannot be accurately aligned (such as a sequence from Streptococcus pneumoniae, which is misaligned to Streptococcus mitis, or can only be aligned to Streptococcus genus level), which is the most important factor affecting the accuracy of species detection.
除此之外,由于数据库包含的序列很多,比如质粒/载体等也在内,因此比对这部分比对的结果也会给出输出,这部分结果基本上是无意义的(也可以算作假阳性检出)。In addition, because the database contains a lot of sequences, such as plasmids/vectors, etc., the results of this part of the comparison will also be output, and this part of the results is basically meaningless (it can also be counted as false positive detection).
鉴于此,提出本发明。In view of this, the present invention is proposed.
发明内容Contents of the invention
本发明的目的是寻求一种能够降低测序分析假阳性,能够提高物种检测的准确度,适用于Illumina二代宏基因组测序的生信分析方法。The purpose of the present invention is to seek a bioinformatics analysis method that can reduce false positives in sequencing analysis, improve the accuracy of species detection, and is suitable for Illumina second-generation metagenomic sequencing.
为实现上述目的,本发明提出如下技术方案:To achieve the above object, the present invention proposes the following technical solutions:
本发明首先提供了一种基于kraken2单条序列kmer评分和整体基于taxonomy结构统计的生信分析方法,所述方法包括如下步骤:The present invention firstly provides a bioinformatics analysis method based on kraken2 single sequence kmer score and overall taxonomy structure statistics, said method comprising the following steps:
1)NGS测序数据使用kraken2进行序列比对,得到每条序列的taxid-kmer结果;1) NGS sequencing data is compared using kraken2 to obtain the taxid-kmer result of each sequence;
2)基于taxonomy数据库建立taxid的层级关系,根据步骤1)taxid-kmer结果获得taxid,并关联taxonomy层级,再根据定位规则重定位taxid;2) Establish the hierarchical relationship of taxid based on the taxonomy database, obtain the taxid according to the result of the taxid-kmer in step 1, and associate it with the taxonomy level, and then relocate the taxid according to the positioning rules;
3)根据每条序列经过步骤2)定位的taxid和步骤1)的taxid-kmer比对结果,计算每条序列的kmer score;3) Calculate the kmer score of each sequence according to the taxid located in step 2) of each sequence and the taxid-kmer comparison result of step 1);
4)根据kmer score和taxonomy层级,对比对结果进行整体计算;4) According to kmer score and taxonomy level, compare and compare the results for overall calculation;
进一步的,还包括Further, it also includes
5)根据4)的整体计算结果进行物种层级检测。5) Carry out species-level detection based on the overall calculation results of 4).
进一步的,所述步骤2)中层级关系包括血清型/亚型、种、属和/或科的一种或多种层级关系。Further, the hierarchical relationship in step 2) includes one or more hierarchical relationships of serotype/subtype, species, genus and/or family.
进一步的,所述步骤2)中定位规则包括如下:Further, the positioning rules in the step 2) include the following:
通常情况下接受kraken2给出的taxid定位,以下情况除外:Normally, the taxid location given by kraken2 is accepted, except for the following cases:
某条序列根据taxid-kmer结果获得唯一taxid且taxid低于种层级,则定位为该taxid所属的种层级taxid;If a sequence obtains a unique taxid according to the taxid-kmer result and the taxid is lower than the species level, it is positioned as the species level taxid to which the taxid belongs;
某条序列根据taxid-kmer结果获得超过2个taxid时,分3种情况:When a sequence obtains more than 2 taxids according to the taxid-kmer results, there are 3 cases:
所有taxid,关联到种层级上只出现1个,其他taxid属于该种的血清型/亚型、属、科层级,则定位到该种层级taxid;For all taxids, only one appears at the species level, and other taxids belong to the serotype/subtype, genus, and family level of the species, and then locate the taxid at the species level;
所有taxid,关联到种层级超过2个且属于同一属,则最终定位到属层级taxid;All taxids that are associated with more than 2 species levels and belong to the same genus will be finally located at the genus level taxid;
所有taxid,关联到属层级超过2个(属层级没有分类也在内)且属于同一科,则最终定位到科层级taxid;All taxids that are associated with more than 2 genus levels (including genus levels without classification) and belong to the same family will be finally located at the family level taxid;
进一步的,所述步骤3)中所述计算的规则包括:Further, the calculation rules in the step 3) include:
最终定位到科层级taxid以下的序列,其kmer score=(科taxid kmers+属taxid kmers+种taxid kmers+亚型/血清型taxid kmers)/总kmers;Finally locate the sequence below the taxid at the family level, its kmer score = (family taxid kmers+genus taxid kmers+species taxid kmers+subtype/serotype taxid kmers)/total kmers;
最终定位到科层级taxid以上的序列,kmer score设定为0。Finally, the sequence above the taxid at the family level is located, and the kmer score is set to 0.
进一步的,所述步骤4)中整体计算包括:Further, in the step 4), the overall calculation includes:
a、设定一个过滤cutoff阈值,对每条序列根据kmer score进行过滤;a. Set a filter cutoff threshold, and filter each sequence according to the kmer score;
b、对a中经过过滤的序列,统计taxid的reads;b. For the filtered sequence in a, count the reads of taxid;
所述taxid的reads是一个样本出现的taxid的序列总数;The reads of the taxid is the total number of taxid sequences that appear in a sample;
c、设定一个过滤阈值threshold,对b中定位到种层级的taxid进行过滤,计算其属相对比值,排除低于阈值的种层级taxid;c. Set a filtering threshold threshold, filter the taxids located at the species level in b, calculate their relative ratio, and exclude the species level taxids lower than the threshold;
所述属相对比值为某个种层级taxid reads相对于同属reads最高的种层级taxid reads的比值;The genus relative ratio is the ratio of certain species-level taxid reads to the highest species-level taxid reads of the same genus reads;
进一步的,还包括:Further, it also includes:
d、经c过滤的种层级taxid,若缺乏属分类,则计算科相对比值,排除低于过滤阈值threshold种层级taxid;d. If the taxid at the species level filtered by c lacks genus classification, calculate the family relative ratio and exclude the taxid at the species level lower than the filtering threshold threshold;
所述科相对比值为某个种层级taxid reads相对于同科reads最高的种层级taxid reads的比值;The family relative ratio is the ratio of certain species-level taxid reads to the highest species-level taxid reads of the same family reads;
更进一步的,还包括:Further, it also includes:
e、经c,d过滤保留的种层级taxid reads校正1,计算属相对比值,将属层级taxid reads按照属相对比值计算种层级taxid属校正reads;e. After c, d filter and retain the species-level taxid reads correction 1, calculate the genus relative ratio, and calculate the species-level taxid genus correction reads with the genus-level taxid reads according to the genus relative ratio;
所述属相对比值为经c,d过滤后同属的种层级taxid reads总和,之后计算各种层级taxid reads相对于总和的比例;The genus relative ratio is the sum of the species-level taxid reads of the same genus after filtering by c and d, and then calculates the ratio of taxid reads at various levels to the sum;
所述属层级taxid reads包括b中属层级taxid reads和c中未通过过滤阈值threshold的该属关联种层级taxid并入的reads;The genus-level taxid reads include the genus-level taxid reads in b and the reads incorporated into the genus-associated species-level taxid that have not passed the filtering threshold threshold in c;
f、经c,d过滤的种层级taxid reads校正2,计算科相对比值,将科层级taxid reads按照科相对比值计算种层级taxid科校正reads;f, after c, the kind-level taxid reads filtered by d are corrected 2, and the family-level relative ratio is calculated, and the family-level taxid reads are calculated according to the family-level taxid family-corrected reads;
所述科相对比值为经c,d过滤后同科的种层级taxid reads总和,之后计算各种层级taxid reads相对于总和的比例;The relative ratio of the family is the sum of the species-level taxid reads of the same family after filtering by c and d, and then calculate the ratio of the taxid reads at various levels to the sum;
所述科层级taxid的reads包含b中科层级taxid reads和d中未通过过滤阈值threshold的该科关联种层级taxid并入的reads。The family-level taxid reads include the family-level taxid reads in b and the reads incorporated into the family-related species-level taxid in d that do not pass the filtering threshold threshold.
进一步的,所述步骤5)物种层级检测,即等同于种层级taxid检测,根据c,d得到种层级taxid即为最终的物种taxid,取b,e,f得到的种层级taxid reads之和得到最终的物种taxid reads,并根据reads之和计算相对丰度。Further, said step 5) species-level detection, which is equivalent to the species-level taxid detection, according to c, d obtains the species-level taxid which is the final species taxid, gets b, e, and the sum of the species-level taxid reads obtained by f obtains The final species taxid reads, and calculate the relative abundance based on the sum of reads.
进一步的,所述步骤1)中比对采用的数据库为nt、refseq或genbank数据库;优选的,所述数据库为nt数据库。Further, the database used for comparison in step 1) is nt, refseq or genbank database; preferably, the database is nt database.
本发明首先还提供了基于kraken2单条序列kmer评分和整体基于taxonomy结构统计的方法在二代测序生信分析中的应用。Firstly, the present invention also provides the application of the method based on kraken2 single sequence kmer score and overall taxonomy structure statistics in next-generation sequencing bioinformatics analysis.
本发明还提供一种计算机可读介质,其存储有计算机程序,所述计算机程序被处理器执行时,实现权利要求上述任一项所述方法。The present invention also provides a computer-readable medium, which stores a computer program, and when the computer program is executed by a processor, the method described in any one of the above claims is realized.
本发明还提供一种电子设备,其特征在于,包括处理器以及存储器,所述存储器上存储一条或多条可读指令,所述一条或多条可读指令被所述处理器执行时,实现权利要求上述任一项所述方法。The present invention also provides an electronic device, which is characterized in that it includes a processor and a memory, and one or more readable instructions are stored in the memory, and when the one or more readable instructions are executed by the processor, the Claim any one of the above methods.
本发明有益的技术效果:Beneficial technical effect of the present invention:
1)本发明所述生信方法能够降低生信分析的检出假阳性,能够提高物种检测的准确度,适用于二代宏基因组测序,包括单端和双端测序等。1) The bioinformatics method of the present invention can reduce false positives in bioinformatics analysis, improve the accuracy of species detection, and is suitable for second-generation metagenomic sequencing, including single-end and double-end sequencing.
2)本发明通过对单条序列精确定位和整体的系统性优化,保证了整体的灵敏度。2) The present invention ensures the overall sensitivity through precise positioning of a single sequence and overall systematic optimization.
3)本发明通过引入taxonomy,排除了质粒/载体部分比对结果等,有效降低了无意义检出的情况。3) By introducing taxonomy, the present invention excludes partial comparison results of plasmids/vectors, etc., effectively reducing the situation of meaningless detection.
为了更清楚地说明本发明具体实施方式或现有技术中的技术方案,下面将对具体实施方式或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施方式,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the specific implementation of the present invention or the technical solutions in the prior art, the following will briefly introduce the accompanying drawings that need to be used in the specific implementation or description of the prior art. Obviously, the accompanying drawings in the following description The drawings show some implementations of the present invention, and those skilled in the art can obtain other drawings based on these drawings without any creative work.
图1本发明体系简图;Fig. 1 schematic diagram of the system of the present invention;
图2单条序列进行taxid定位和kmer score评分示意图;Figure 2 Schematic diagram of taxid positioning and kmer score scoring for a single sequence;
图3 9例spike-in样本DNA文库检出的假阳性物种比较图,opt代表本发明流程,confidence代表kraken2 confidence 0.5+bracken流程,kraken代表kraken2+bracken流程,S1-S9代表9个样本;Fig. 3 Comparison chart of false positive species detected in 9 cases of spike-in sample DNA library, opt represents the process of the present invention, confidence represents the process of kraken2 confidence 0.5+bracken, kraken represents the process of kraken2+bracken, S1-S9 represents 9 samples;
图4 9例spike-in样本RNA文库检出的假阳性物种比较图,opt代表本发明流程,confidence代表kraken2 confidence 0.5+bracken流程,kraken代表kraken2+bracken流程,S1-S9代表9个样本;Fig. 4 Comparison diagram of false positive species detected in 9 cases of spike-in sample RNA library, opt represents the process of the present invention, confidence represents the process of kraken2 confidence 0.5+bracken, kraken represents the process of kraken2+bracken, and S1-S9 represents 9 samples;
图5 12例模拟样本和9例spike-in样本灵敏度统计图,opt代表本发明流程,confidence代表kraken2 confidence 0.5+bracken流程,kraken代表kraken2+bracken流程,simulated代表12例模拟样本,spike-in即9例spike-in样本。Fig. 5 Sensitivity chart of 12 simulated samples and 9 spike-in samples, opt represents the process of the present invention, confidence represents the kraken2 confidence 0.5+bracken process, kraken represents the kraken2+bracken process, simulated represents 12 simulated samples, spike-in is 9 spike-in samples.
下面将结合实施例对本发明的实施方案进行详细描述,但是本领域技术人员将会理解,下列实施例仅用于说明本发明,而不应视为限制本发明的范围,并且所述实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。Embodiments of the present invention will be described in detail below in conjunction with examples, but those skilled in the art will understand that following examples are only used to illustrate the present invention, and should not be considered as limiting the scope of the invention, and described examples are Some, but not all, embodiments of the invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.
部分术语定义Definition of some terms
除非在下文中另有定义,本发明具体实施方式中所用的所有技术术语和科学术语的含义意图与本领域技术人员通常所理解的相同。虽然相信以下术语对于本领域技术人员很好理解,但仍然阐述以下定义以更好地解释本发明。Unless otherwise defined hereinafter, all technical and scientific terms used in the detailed description of the present invention have the same meanings as commonly understood by those skilled in the art. While the following terms are believed to be well understood by those skilled in the art, the following definitions are set forth to better explain the present invention.
如本发明中所使用,术语“包括”、“包含”、“具有”、“含有”或“涉及”为包含性的(inclusive)或开放式的,且不排除其它未列举的元素或方法步骤。术语“由…组成”被认为是术语“包含”的优选实施方案。如果在下文中某一组被定义为包含至少一定数目的实施方案,这也应被理解为揭示了一个优选地仅由这些实施方案组成的组。As used herein, the terms "comprising", "comprising", "having", "containing" or "involving" are inclusive or open-ended and do not exclude other unrecited elements or method steps . The term "consisting of" is considered as a preferred embodiment of the term "comprising". If in the following a certain group is defined as comprising at least a certain number of embodiments, this is also to be understood as revealing a group which preferably consists only of these embodiments.
本发明中的术语“大约”、“大体”表示本领域技术人员能够理解的仍可保证论及特征的技术效果的准确度区间。该术语通常表示偏离指示数值的±10%,优选±5%。The terms "about" and "approximately" in the present invention represent the range of accuracy that can be understood by those skilled in the art and still guarantee the technical effect of the mentioned feature. The term generally means ±10%, preferably ±5%, of the indicated value.
在提及单数形式名词时使用的不定冠词或定冠词例如“一个”或“一种”,“所述”,包括该名词的复数形式。The use of an indefinite or definite article when referring to a noun in the singular eg "a" or "an", "the", includes a plural of that noun.
此外,说明书和权利要求书中的术语第一、第二、第三、(a)、(b)、(c)以及诸如此类,是用于区分相似的元素,不是描述顺序或时间次序必须的。应理解,如此应用的术语在适当的环境下可互换,并且本发明描述的实施方案能以不同于本发明描述或举例说明的其它顺序实施。In addition, the terms first, second, third, (a), (b), (c) and the like in the specification and claims are used to distinguish similar elements and are not necessary to describe the order or time order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and that the embodiments described herein are capable of operation in other sequences than described or illustrated herein.
本发明中的“read”或“每条reads”或“单条reads”,指高通量测序平台产生的一条核酸序列。"read" or "each read" or "single read" in the present invention refers to a nucleic acid sequence generated by a high-throughput sequencing platform.
本发明中的术语“比对结果”:英文为“alignment”,指一条测序读出序列与一条参考序列之间的对应结果,一条测序读出序列可以同时具有多个比对结果。The term "alignment result" in the present invention: "alignment" in English, refers to the corresponding result between a sequencing readout sequence and a reference sequence, and a sequencing readout sequence can have multiple alignment results at the same time.
本发明所述的“kmer”是指将一条序列连续切割,逐个碱基划动得到k个碱基的子字符串,例如reads长度为L,k-mer长度设为k,则产生的k-mers数目为:L-k+1;再例如序列AACTGACT,设置k为3,则可以将其分割为AAC、ACT、CTG、TGA、GAC、ACT共6个k-mers。The "kmer" in the present invention refers to continuously cutting a sequence and scratching bases one by one to obtain a substring of k bases. For example, the length of reads is L, and the length of k-mer is set to k, then the generated k- The number of mers is: L-k+1; another example is the sequence AACTGACT, if k is set to 3, it can be divided into 6 k-mers of AAC, ACT, CTG, TGA, GAC, and ACT.
本发明所述的“kraken2”是指本领域的一个基于kmer算法的高精度宏基因组序列分类软件,能够快速的将测序reads进行物种分类。The "kraken2" in the present invention refers to a high-precision metagenomic sequence classification software based on the kmer algorithm in the field, which can quickly classify sequencing reads into species.
本发明所述的“kraken2优化算法”是指本发明开发的一种基于kraken2比对结果进行微生物物种层级检测的优化体系,旨在提高准确度和降低假阳性检出。The "kraken2 optimization algorithm" described in the present invention refers to an optimization system for microbial species-level detection based on the comparison results of kraken2 developed by the present invention, aiming at improving accuracy and reducing false positive detection.
本发明所述的“nt比对数据库”是基于NCBI nt数据库建立的kraken2比对索引数据库。"nt comparison database" of the present invention is the kraken2 comparison index database based on NCBI nt database establishment.
本发明所述的“taxid”或“taxonomy_id”是指taxonomy数据库中的id号。The "taxid" or "taxonomy_id" mentioned in the present invention refers to the id number in the taxonomy database.
本发明所述的于kraken2单条序列kmer评分和整体基于taxonomy结构统计的生信分析方法,大体包括如下步骤:The single sequence kmer score of kraken2 and the overall bioinformatics analysis method based on taxonomy structure statistics of the present invention generally include the following steps:
1.使用kraken2进行二代测序数据的序列比对,数据库使用,比如nt等比对数据 库;1. Use kraken2 for sequence comparison of next-generation sequencing data, using databases, such as nt and other comparison databases;
2.整理taxonomy数据库,建立基于taxid的血清型/亚型-种-属-科的层级关系或者任意一种或多种层级关系,并根据kraken2每条序列taxid-kmer比对结果的taxid引入相应的taxonomy层级;2. Organize the taxonomy database, establish the hierarchical relationship of serotype/subtype-species-genus-family or any one or more hierarchical relationships based on taxid, and import the corresponding taxid according to the taxid-kmer comparison result of each sequence of kraken2 The taxonomy level;
3.每条reads进行taxid重新定位和kmer score计算;3. Carry out taxid relocation and kmer score calculation for each reads;
4.根据kmer score和taxonomy层级结构,对比对结果进行整体的计算;4. According to kmer score and taxonomy hierarchical structure, compare and compare the results for overall calculation;
5.根据计算结果进行微生物检测。5. Carry out microbial detection according to the calculation results.
示例性的,可以具体包括如下步骤:Exemplarily, the following steps may be specifically included:
1.使用kraken2进行序列比对;1. Use kraken2 for sequence alignment;
2.引入taxonomy层级,根据单条序列taxid-kmer比对结果依据定位规则重新定位其层级和对应的taxid;2. Introduce the taxonomy level, and relocate its level and corresponding taxid according to the positioning rules according to the single sequence taxid-kmer comparison result;
所述定位规则可以包括:The positioning rules may include:
原则上接受kraken2给出的taxid定位,以下情况除外:In principle, accept the taxid positioning given by kraken2, except for the following cases:
某条序列根据taxid-kmer结果获得唯一taxid且taxid低于种层级,则定位为该taxid所属的种层级taxid;If a sequence obtains a unique taxid according to the taxid-kmer result and the taxid is lower than the species level, it is positioned as the species level taxid to which the taxid belongs;
某条序列根据taxid-kmer结果获得超过2个taxid时,分3种情况:When a sequence obtains more than 2 taxids according to the taxid-kmer results, there are 3 cases:
所有taxid,关联到种层级上只出现1个,其他taxid属于该种的血清型/亚型、属、科层级,则定位到该种层级taxid;For all taxids, only one appears at the species level, and other taxids belong to the serotype/subtype, genus, and family level of the species, and then locate the taxid at the species level;
所有taxid,关联到种层级超过2个且属于同一属,则最终定位到属层级taxid;All taxids that are associated with more than 2 species levels and belong to the same genus will be finally located at the genus level taxid;
所有taxid,关联到属层级超过2个(属层级没有分类也在内)且属于同一科,则最终定位到科层级taxid;All taxids that are associated with more than 2 genus levels (including genus levels without classification) and belong to the same family will be finally located at the family level taxid;
3.根据每条序列经过2定位的taxid,以及上述taxid-kmer比对结果,根据计算规则计算每条序列的kmer score评分;3. Calculate the kmer score of each sequence according to the calculation rules according to the taxid of each sequence after 2 positioning and the above-mentioned taxid-kmer comparison results;
计算规则可以为:Calculation rules can be:
最终定位到科以下的序列,其kmer score=(科taxid kmers+属taxid kmers+种taxid kmers+亚型/血清型taxid kmers)/总kmers;Finally locate the sequence below the family, its kmer score = (family taxid kmers+genus taxid kmers+species taxid kmers+subtype/serotype taxid kmers)/total kmers;
最终定位到科以上的序列,kmer score设定为0。Finally, the sequence above the family is located, and the kmer score is set to 0.
4.将定位到噬菌体/质粒/载体的序列,全部标记为未比对;4. Mark all sequences mapped to phage/plasmid/vector as unaligned;
5.设定一个过滤cutoff阈值(后文全部以score_cutoff表示),对每条序列根据kmer score得分进行过滤;5. Set a filter cutoff threshold (represented by score_cutoff in the following), and filter each sequence according to the kmer score;
6.对5中经过过滤的序列,统计taxid的reads;6. For the filtered sequence in 5, count the reads of taxid;
7.设定一个过滤阈值threshold,对6中定位到种层级的taxid进行过滤,计算其属相对比值,排除低于阈值的种层级taxid;7. Set a filtering threshold, filter the taxids located at the species level in 6, calculate their relative ratio, and exclude the species-level taxids below the threshold;
所述属相对比值为某个种层级taxid reads相对于同属reads最高的种层级taxid reads的比值;The genus relative ratio is the ratio of certain species-level taxid reads to the highest species-level taxid reads of the same genus reads;
8.经7过滤的种层级taxid,若缺乏属分类,则计算科相对比值,排除低于过滤阈值threshold种层级taxid;8. If the taxid at the species level filtered by 7 lacks genus classification, calculate the family relative ratio and exclude the taxid at the species level lower than the filtering threshold threshold;
所述科相对比值为某个种层级taxid reads相对于同科reads最高的种层级taxid reads的比值;The family relative ratio is the ratio of certain species-level taxid reads to the highest species-level taxid reads of the same family reads;
9.经7,8过滤保留的种层级taxid reads校正1,计算属相对比值,将属层级taxid reads按照属相对比值计算种层级taxid属校正reads;9. The species-level taxid reads retained by filtering in 7 and 8 are corrected 1, and the relative ratio is calculated, and the taxid reads at the species level are calculated according to the relative ratio of the species-level taxid and genus correction reads;
所述属相对比值为经7,8过滤后同属的种层级taxid reads总和,之后计算各种层级taxid reads相对于总和的比例;The genus relative ratio is the sum of the species-level taxid reads of the same genus after filtering by 7 and 8, and then calculates the ratio of taxid reads at various levels to the sum;
所述属层级taxid reads包括6中属层级taxid reads和7中未通过过滤阈值threshold的该属关联种层级taxid并入的reads;The genus-level taxid reads include 6 genus-level taxid reads and 7 reads that do not pass through the genus-related species level taxid of the genus-level taxid;
10.经7,8过滤的种层级taxid reads校正2,计算科相对比值,将科层级taxid reads按照科相对比值计算种层级taxid科校正reads;10. The species-level taxid reads filtered by 7 and 8 are corrected 2, and the family-level relative ratio is calculated, and the family-level taxid reads are calculated according to the family-level relative ratio to calculate the species-level taxid family-corrected reads;
所述科相对比值为经7,8过滤后同科的种层级taxid reads总和,之后计算各种层级taxid reads相对于总和的比例;The relative ratio of the family is the sum of the species-level taxid reads of the same family after filtering by 7 and 8, and then calculate the ratio of the taxid reads at various levels to the sum;
所述科层级taxid的reads包含6中科层级taxid reads和8中未通过过滤阈值threshold的该科关联种层级taxid并入的reads。The family-level taxid reads include 6 family-level taxid reads and 8 family-related taxid-incorporated reads that do not pass the filtering threshold threshold.
11.微生物检测,即等同于种层级taxid检测,根据7,8得到种层级taxid即为最终的物种taxid,取6,9,10得到的种层级taxid reads之和得到最终的物种taxid reads,并根据reads之和计算相对丰度。11. Microbial detection, which is equivalent to the species-level taxid detection, the species-level taxid obtained according to 7 and 8 is the final species taxid, and the sum of the species-level taxid reads obtained by 6, 9, and 10 is used to obtain the final species taxid reads, and Relative abundance is calculated based on the sum of reads.
以下描述仅是为了帮助理解本发明而提供。这些描述不应被理解为具有小于本领域技术人员所理解的范围。The following description is provided only to help the understanding of the present invention. These descriptions should not be read with a scope less than that understood by those skilled in the art.
实施例1方法体系的设计优化The design optimization of embodiment 1 method system
本实施例要解决的问题是如何通过数据分析方法,尽量保证kraken2比对结果的准确度。The problem to be solved in this embodiment is how to ensure the accuracy of the kraken2 comparison results as much as possible through the data analysis method.
1.首先,对于该问题,可以拆分为2个问题,如何降低假阳性物种检出以提高特异度,以及保证真实物种检出以获得较高的灵敏度,两者达到最佳平衡,即得到最好的准确度。通过分析kraken2是如何产生假阳性物种结果,以及灵敏度会降低的情况。1. First of all, for this problem, it can be divided into two problems, how to reduce the detection of false positive species to improve specificity, and ensure the detection of real species to obtain higher sensitivity. The two achieve the best balance, that is, get best accuracy. By analyzing how kraken2 produces false positive species results, and the cases where the sensitivity will be reduced.
a)数据库的冗余,无论是refseq,genbank还是nt,都存在着大量的参考基因组冗 余,这是造成假阳性检出的重要原因,错误的比对也同样会降低灵敏度;a) The redundancy of the database, whether it is refseq, genbank or nt, there is a large amount of reference genome redundancy, which is an important reason for false positive detection, and wrong comparison will also reduce the sensitivity;
b)序列相似性,典型的如大肠杆菌与志贺杆菌的序列度超过99%,这种情况在同属下的物种间尤为常见,这是造成假阳性检出的重要原因,也会干扰真实物种检出,降低了灵敏度;b) Sequence similarity, typically, the sequence degree of Escherichia coli and Shigella exceeds 99%, which is especially common among species under the same genus, which is an important reason for false positive detection and also interferes with real species detected, reducing the sensitivity;
c)宿主序列(通常指人源宿主)干扰,由于目前的宏基因组基本上来自于宿主样本采集,不可避免的会存在大量的宿主序列,这部分序列会一定程度上影响微生物的检测;c) host sequence (usually refers to human host) interference, since the current metagenomic basically comes from host sample collection, there will inevitably be a large number of host sequences, which will affect the detection of microorganisms to a certain extent;
d)mNGS序列较短,序列较短造成了每条序列得到的kmer较少,这在75bp单端测序上更为明显,容易发生错误比对。d) The mNGS sequence is short, which results in fewer kmers for each sequence, which is more obvious in 75bp single-end sequencing, and misalignment is prone to occur.
2.其次,界定以上哪些情况是可以单独通过对样本测序结果的数据分析来解决的。2. Secondly, define which of the above situations can be solved through data analysis of the sample sequencing results alone.
所谓单独通过对样本测序结果的数据分析,是指:所获得的信息只有一个样本序列文件(FASTQ),而不知道其他信息;通过一些算法,输出物种检测结果。The so-called data analysis of the sample sequencing results alone refers to: the information obtained is only one sample sequence file (FASTQ), and no other information is known; through some algorithms, the species detection results are output.
对于“a)数据库的冗余”,一方面可以通过在数据库整理,去冗余中来实现,但是数据库过度精简会造成假阴性的发生,所以除了数据库优化之外,还可以通过算法进行一定程度的降低,算法部分是本专利的考虑点;For "a) database redundancy", on the one hand, it can be achieved by sorting out the database and removing redundancy, but excessive simplification of the database will cause false negatives, so in addition to database optimization, algorithms can also be used to a certain extent The reduction of the algorithm part is the consideration point of this patent;
对于“b)序列相似性”,一方面可以通过标准基因组更新和物种分类学的进步来解决,另外可以通过算法来进行一定程度的优化,这部分是本专利的考虑点;For "b) sequence similarity", on the one hand, it can be solved through standard genome updates and the progress of species taxonomy, and on the other hand, it can be optimized to a certain extent through algorithms, which is part of the consideration of this patent;
对于“c)宿主序列(通常指人源序列)干扰”,这个一方面可以通过精准构建宿主的基因组序列和使用比对算法去除宿主基因组,但是并不能保证宿主基因组去除的是否准确彻底(去除不彻底仍会有一定的残留,另外去除过度会降低微生物检出的灵敏度),除此之外可以通过在算法上对单条序列比对结果进行精确分析,和整体上设定物种检出的阈值,得到一定程度的优化,这部分是本专利的考虑点;For "c) host sequence (usually human-derived sequence) interference", on the one hand, the host genome can be removed by accurately constructing the host genome sequence and using an alignment algorithm, but it does not guarantee whether the host genome removal is accurate and thorough (removal is not possible) There will still be some residues completely, and excessive removal will reduce the sensitivity of microbial detection), in addition, it is possible to accurately analyze the results of single sequence comparisons in the algorithm, and set the threshold for species detection as a whole, Get a certain degree of optimization, this part is the consideration point of this patent;
对于“d)mNGS序列较短”,这部分是技术硬性问题,需要通过技术的进步来提升。As for "d) the mNGS sequence is short", this part is a technically hard problem, which needs to be improved through technological progress.
3.kraken2算法的具体优化方案3. Specific optimization scheme of kraken2 algorithm
针对2中的3个优化点,结合kraken2的比对原则,建立了如下的优化体系:For the three optimization points in 2, combined with the comparison principle of kraken2, the following optimization system was established:
3.1单条序列的处理3.1 Processing of a single sequence
a)引入taxonomy层级,根据单条序列taxid-kmer比对结果依据定位规则重新定位其层级和对应的taxid;a) Introduce the taxonomy level, and relocate its level and corresponding taxid according to the positioning rules according to the single sequence taxid-kmer comparison result;
定位规则包括:原则上接受kraken2给出的taxid定位,以下情况除外:The positioning rules include: accept the taxid positioning given by kraken2 in principle, except for the following cases:
某条序列比对到唯一taxid且taxid低种层级,则定位为该taxid所属的种层级taxid(举例,某条序列长度76bp,35bp kmer长度则得到42个kmer,taxid-kmer比对结果为:0:10,1313:32,除了不能比对的10个kmer之外,其余都比对到1313这个taxid上,通过taxonomy层级结构定位到taxid 1313 Streptococcus pneumoniae肺炎链球菌);If a sequence is compared to a unique taxid and the taxid is at a lower species level, it will be positioned as the species-level taxid to which the taxid belongs (for example, if a sequence is 76 bp in length and 35 bp in kmer length, 42 kmers will be obtained, and the taxid-kmer comparison result is: 0:10, 1313:32, except for the 10 kmers that cannot be compared, the rest are compared to the taxid 1313, and the taxonomy hierarchy structure is used to locate the taxid 1313 Streptococcus pneumoniae Streptococcus pneumoniae);
某条序列比对超过2个taxid,分为3种:A sequence alignment with more than 2 taxids can be divided into 3 types:
比对结果中的所有taxid,关联到种层级上只有1个物种,其他taxid属于该种的血 清型/亚型,属、科层级,则定位到该种taxid(举例,某条序列长度76bp,35bp kmer长度则得到42个kmer,taxid-kmer比对结果为:0:10,1313:20,1301:12,除了不能比对的10个kmer之外,20个kmer比对到1313这个taxid上,通过taxonomy层级结构对应Streptococcus pneumoniae肺炎链球菌,另外12个kmer比对到1301Streptococcus链球菌属,由于肺炎链球菌在链球菌属下,则该序列定位到种taxid 1313 Streptococcus pneumoniae肺炎链球菌);All taxids in the comparison result are associated with only one species at the species level, and other taxids belong to the serotype/subtype of the species. At the genus and family levels, they are located at the taxid of the species (for example, a sequence length of 76bp, 35bp kmer length will get 42 kmers, the taxid-kmer comparison results are: 0:10, 1313:20, 1301:12, except for the 10 kmers that cannot be compared, 20 kmers are compared to the taxid 1313 , corresponding to Streptococcus pneumoniae Streptococcus pneumoniae through the taxonomy hierarchy structure, and the other 12 kmers were compared to the genus 1301 Streptococcus Streptococcus pneumoniae, since Streptococcus pneumoniae is under the genus Streptococcus, the sequence was mapped to the species taxid 1313 Streptococcus pneumoniae Streptococcus pneumoniae);
比对结果中的所有taxid,关联出同属不同种的结果,则最终定位到属taxid(举例,某条序列长度76bp,35bp kmer长度则得到42个kmer,taxid-kmer比对结果为:0:10,1313:20,28037:5,1301:7,除了不能比对的10个kmer之外,20个kmer比对到1313这个taxid上,通过taxonomy层级结构对应Streptococcus pneumoniae肺炎链球菌,另外5个kmer比对到28037 Streptococcus mitis轻型链球菌,7个kmer比对到1301 Streptococcus链球菌属,由于链球菌属下出现了2个种,则该序列定位到taxid 1301 Streptococcus链球菌属);All the taxids in the comparison results are related to the results of the same genus and different species, and finally locate the genus taxid (for example, if a sequence length is 76bp, 35bp kmer length will get 42 kmers, the taxid-kmer comparison result is: 0: 10, 1313:20, 28037:5, 1301:7, in addition to the 10 kmers that cannot be compared, 20 kmers are compared to the taxid 1313, corresponding to Streptococcus pneumoniae Streptococcus pneumoniae through the taxonomy hierarchy structure, and the other 5 The kmer was compared to 28037 Streptococcus mitis light-weight Streptococcus, and 7 kmers were compared to 1301 Streptococcus Streptococcus. Since there were 2 species under Streptococcus, the sequence was mapped to taxid 1301 Streptococcus Streptococcus);
比对结果中的所有taxid,关联出同科不同属的结果,则最终定位到科taxid;Comparing all taxids in the results, and correlating the results of different genus in the same family, the taxid of the family is finally located;
b)根据每条序列经过a)定位的taxid和taxid-kmer比对结果,根据计算规则计算每条序列的kmer score;b) Calculate the kmer score of each sequence according to the calculation rules according to the alignment results of taxid and taxid-kmer located in a) for each sequence;
计算规则为:The calculation rules are:
最终定位到科以下的序列,其kmer score=(科taxid kmers+属taxid kmers+种taxid kmers+亚型/血清型taxid kmers)/总kmers;Finally locate the sequence below the family, its kmer score = (family taxid kmers+genus taxid kmers+species taxid kmers+subtype/serotype taxid kmers)/total kmers;
最终定位到科以上的序列,kmer score设定为0。Finally, the sequence above the family is located, and the kmer score is set to 0.
3.2整体的处理3.2 Overall processing
c)设定一个score_cutoff,根据每条序列的kmer score判定其结果是否可信,未过score_cutoff视为不可信结果予以排除;c) Set a score_cutoff, judge whether the result is credible according to the kmer score of each sequence, and exclude it as an unreliable result if the score_cutoff is not exceeded;
d)对于最终定位到噬菌体/质粒/载体的序列,全部标记为未比对结果;d) For the sequences that are finally mapped to phage/plasmid/vector, all are marked as unaligned results;
e)对c)d)中经过过滤的序列,统计taxid的reads;e) For the filtered sequence in c) and d), count the reads of taxid;
f)对e)中定位到种层级的taxid关联属和科,计算每个属下所有种的属相对比值(属内各种reads数相对于属内reads最高种reads数的比值),设定一个threshold,将属相对比值低于过滤阈值的这部分种reads重新定位到属层级;f) For the taxid-associated genera and families located at the species level in e), calculate the genus relative ratio of all species under each genus (the ratio of the number of reads in the genus to the number of reads in the highest species of reads in the genus), set A threshold to relocate the reads of the species whose genus relative ratio is lower than the filtering threshold to the genus level;
g)类同于f),对缺乏属信息但是有明确科信息的种,计算科相对比值(同科下每个种相对于该科内reads最高种的比值),将科相对比值低于f)中过滤阈值threshold这部分种reads重新定位到科层级;g) Similar to f), for species that lack genus information but have clear family information, calculate the family relative ratio (the ratio of each species under the same family to the species with the highest reads in the family), and make the family relative ratio lower than f ) in the filtering threshold threshold, this part of the reads is relocated to the branch level;
h)对于最终比对到属层级的reads(包括原始比对到属层级的reads和步骤f)未通过过滤阈值threshold的该属下种reads之和),计算该属经f)过滤剩余种的reads总和,并计算该属剩余种相对总和的比值,之后根据各剩余种的比值,将最终比对到属 层级的reads分配到各剩余种上;h) For the reads that are finally compared to the genus level (including the reads that were originally compared to the genus level and the sum of the reads of the species under the genus that did not pass the filtering threshold threshold in step f), calculate the remaining species of the genus that have been filtered by f) The sum of reads, and calculate the ratio of the relative sum of the remaining species of the genus, and then assign the reads that are finally compared to the genus level to each remaining species according to the ratio of each remaining species;
i)对于最终比对到科层级的reads(包含原始比对到科层级的reads和g)步骤未通过过滤阈值threshold的该科下种reads之和),计算该科经过g)过滤剩余种的reads总和,并计算剩余种相对总和的比值,之后根据各剩余种的比值,将最终比对到科层级的reads分配到各剩余种上;i) For the reads that are finally compared to the family level (including the reads that were originally compared to the family level and the sum of the reads of the family that did not pass the filtering threshold threshold in the g) step), calculate the remaining species of the family that have been filtered by g) The sum of the reads, and calculate the ratio of the relative sum of the remaining species, and then according to the ratio of each remaining species, the reads that are finally compared to the family level are assigned to the remaining species;
j)报出最终物种reads(即经过h),i)处理的种reads)和相对丰度时,设定一个reads过滤阈值reads_cutoff,低于某个阈值的物种,不予统计和报出。j) When reporting the final species reads (that is, species reads processed by h), i) and relative abundance, set a reads filtering threshold reads_cutoff, and species below a certain threshold will not be counted and reported.
4、kraken2的模拟数据测试和建立相应的各个阈值(score_cutoff,threshold,reads_cutoff)4. Kraken2's simulated data test and establishment of corresponding thresholds (score_cutoff, threshold, reads_cutoff)
建立测试数据集,测试优化方法,和初步建立方法中涉及的阈值Create a test dataset, test the optimization method, and initially establish the thresholds involved in the method
以选取的4种细菌(流感嗜血杆菌,肺炎链球菌,金黄色葡萄球菌,肺炎克雷伯菌),2种真菌(白色念球菌,烟曲霉),4种病毒(人类疱疹病毒,人乳头瘤病毒,甲型流感病毒,HIV病毒),按照一定的reads比例,加入到人源宿主reads(GRC38.p13)中,序列长度设定为75bp,总数据量设定为10M,一共4组,每组由3个完全一致的样本组成。模拟样本使用的参考基因组如下表所示:Selected 4 kinds of bacteria (Haemophilus influenzae, Streptococcus pneumoniae, Staphylococcus aureus, Klebsiella pneumoniae), 2 kinds of fungi (Candida albicans, Aspergillus fumigatus), 4 kinds of viruses (human herpesvirus, human papillary Tumor virus, influenza A virus, HIV virus), according to a certain reads ratio, added to the human host reads (GRC38.p13), the sequence length is set to 75bp, the total data volume is set to 10M, a total of 4 groups, Each group consists of 3 identical samples. The reference genomes used by the simulated samples are shown in the table below:
具体的模拟样本如下表所示(其中按照顺序3个为重复的样本,比如样本1,2,3为完全一致的样本):The specific simulation samples are shown in the following table (there are 3 repeated samples in order, such as samples 1, 2, and 3 are completely consistent samples):
Score_cutoff在不同阈值下统计样本的统计结果汇总,样本7-12由于微生物总量较低,重点关注,其错误率要高于整体层级,从统计结果来看设定高于此错误率的threshold就可以解决错误比对的发生,具体如下表所示:Score_cutoff summarizes the statistical results of statistical samples under different thresholds. Samples 7-12 are focused on due to the low total microbial count, and their error rate is higher than that of the overall level. From the statistical results, setting a threshold higher than this error rate is It can solve the occurrence of wrong comparison, as shown in the following table:
鉴于score_cutoff在0.5和0.4两个数值达到错误率相同层级,对每个样本分为list 物种假阳性物种(即错误比对检出的物种在10个模拟物种的同科),非list物种假阳性物种(错误比对检出的物种不在10个模拟物种的同科),统计其各自reads最高的物种,由于重复性样本检出完全一致,只列出代表样本结果,具体如下表所示:In view of the fact that score_cutoff reaches the same level of error rate at 0.5 and 0.4, each sample is divided into list species false positive species (that is, the species detected by the error comparison is in the same family of 10 simulated species), non-list species false positive Species (the species detected by the wrong comparison are not in the same family as the 10 simulated species), and the species with the highest reads are counted. Since the repeated samples are detected to be completely consistent, only the representative sample results are listed, as shown in the following table:
由于优化方法中包含了对同属/科的校正,最终的reads_cutoff则是设定解决非同科检出的物种进行校正的,从score_cutoff不同值的比较来看,在允许稍低一点的比对率的前提下,score_cutoff设定为0.5,reads_cutoff设定到大于3即可消除非list假阳性物种检出(该值越低越有利于保证reads偏少检出的真实阳性物种的灵敏度)。Since the optimization method includes the correction of the same genus/families, the final reads_cutoff is set to correct for species detected in different families. From the comparison of different values of score_cutoff, a slightly lower comparison rate is allowed Under the premise of , score_cutoff is set to 0.5, and reads_cutoff is set to greater than 3 to eliminate the detection of non-list false positive species (the lower the value, the better the sensitivity of true positive species detected by fewer reads).
综上,最终确定本发明的技术方案如下:In summary, finally determine the technical solution of the present invention as follows:
1)基于kraken2比对结果进行;1) Based on the comparison results of kraken2;
2)引入taxonomy层级,单条序列根据taxid-kmer比对结果,依据定位规则重新定位其层级和对应的taxid;2) Introduce the taxonomy level, and relocate the level and corresponding taxid of a single sequence according to the alignment rules of taxid-kmer;
定位规则包括:Targeting rules include:
原则上接受kraken2给出的taxid定位,以下情况除外:In principle, accept the taxid positioning given by kraken2, except for the following cases:
某条序列比对到唯一taxid且taxid低于物种层级,则定位为该taxid所属的物种taxid;If a sequence is aligned with a unique taxid and the taxid is lower than the species level, it will be positioned as the taxid of the species to which the taxid belongs;
某条序列比对超过2个taxid,分为3种:A sequence alignment with more than 2 taxids can be divided into 3 types:
比对结果中的所有taxid,关联到物种层级上只有1个物种,其他taxid属于该物种的血清型/亚型,属、科层级,则定位到该物种taxid;All taxids in the comparison result are associated with only one species at the species level, and other taxids belong to the serotype/subtype of the species. At the genus and family levels, the taxid of this species is located;
比对结果中的所有taxid,关联出同属不同物种的结果,则最终定位到属taxid;Comparing all taxids in the results, and correlating the results of the same genus to different species, the genus taxid is finally located;
比对结果中的所有taxid,关联出同科不同属的结果,则最终定位到科taxid;Comparing all taxids in the results, and correlating the results of different genus in the same family, the taxid of the family is finally located;
3)根据每条序列经过2)定位的taxid和taxid-kmer比对结果,根据计算规则计算 每条序列的kmer score评分;3) Calculate the kmer score of each sequence according to the calculation rules according to the alignment results of taxid and taxid-kmer located in 2) for each sequence;
计算规则为:The calculation rules are:
最终定位到科以下的序列,其kmer score=(科taxid kmers+属taxid kmers+物种taxid kmers+亚型/血清型taxid kmers)/总kmers;Finally locate the sequence below the family, its kmer score = (family taxid kmers+genus taxid kmers+species taxid kmers+subtype/serotype taxid kmers)/total kmers;
最终定位到科以上的序列,kmer score设定为0。Finally, the sequence above the family is located, and the kmer score is set to 0.
4)设定score_cutoff为0.5,根据每条序列的kmer score判定其结果是否可信,kmer score未过score_cutoff视为不可信结果予以排除;4) Set score_cutoff to 0.5, and determine whether the result is credible according to the kmer score of each sequence. If the kmer score does not exceed score_cutoff, it is regarded as an unreliable result and excluded;
5)对于最终定位到噬菌体/质粒/载体的序列,全部标记为未比对结果;5) For the sequences that are finally mapped to phage/plasmid/vector, all are marked as unaligned results;
6)对4)5)中经过过滤的序列,统计taxid的reads;6) For the sequence filtered in 4) and 5), count the reads of taxid;
7)对6)中定位到物种层级的taxid关联属和科,计算每个属下所有物种的属相对比值(属内各物种reads数相对于属内reads最高物种reads数的比值),设定一个threshold,其值设定为0.1,将属相对比值低于过滤阈值的这部分物种reads重新定位到属层级;7) For the taxid-associated genus and family located at the species level in 6), calculate the genus relative ratio of all species under each genus (the ratio of the number of reads of each species in the genus to the number of reads in the highest species of reads in the genus), set A threshold, whose value is set to 0.1, relocates the reads of the species whose genus relative ratio is lower than the filtering threshold to the genus level;
8)类同于7),对缺乏属信息但是有明确科信息的物种,计算科相对比值(同科下每个物种相对于该科内reads最高物种的比值),将科相对比值低于7)中过滤阈值threshold这部分物种reads重新定位到科层级;8) Similar to 7), for species that lack genus information but have clear family information, calculate the family relative ratio (the ratio of each species under the same family to the species with the highest reads in the family), and make the family relative ratio lower than 7 ) in the filtering threshold threshold, the reads of this part of the species are relocated to the family level;
9)对于最终比对到属层级的reads(包括6)属层级taxid的reads和步骤7)未通过过滤阈值threshold的该属物种reads之和),计算该属经7)过滤剩余物种的reads总和,并计算该属剩余物种相对总和的比值,之后根据各剩余物种的比值,将最终比对到属层级的reads分配到各剩余物种上;9) For the reads that are finally compared to the genus level (including 6) the reads of the genus level taxid and step 7) the sum of the reads of the genus that did not pass the filtering threshold threshold), calculate the sum of the reads of the remaining species of the genus after 7) filtering , and calculate the ratio of the relative sum of the remaining species of the genus, and then assign the reads that are finally compared to the genus level to each remaining species according to the ratio of each remaining species;
10)对于最终比对到科层级的reads(包含6)科层级taxid的reads和8)步骤未通过过滤阈值threshold的该科物种reads之和),计算该科经过8)过滤剩余物种的reads总和,并计算剩余物种相对总和的比值,之后根据各剩余物种的比值,将最终比对到科层级的reads分配到各剩余物种上;10) For the final comparison to the family-level reads (including 6) family-level taxid reads and 8) the sum of reads of the family species that did not pass the filtering threshold threshold in step 8), calculate the sum of the reads of the family after 8) filtering the remaining species , and calculate the ratio of the relative sum of the remaining species, and then assign the reads that are finally compared to the family level to each remaining species according to the ratio of each remaining species;
11)报出最终物种reads和相对丰度时,设定一个reads过滤阈值reads_cutoff,该值设定为3,reads低于reads_cutoff的物种,不予统计和报出。最后报出物种层级的reads和相对丰度。11) When reporting the final species reads and relative abundance, set a reads filtering threshold reads_cutoff, which is set to 3, and species whose reads are lower than reads_cutoff will not be counted and reported. Finally, the species-level reads and relative abundance are reported.
实施例2与传统kraken2方法的效果比较Comparison of the effects of Example 2 and the traditional kraken2 method
1、根据优化方法最终检测的模拟样本结果整理假阳性与假阴性检出如下:1. According to the simulation sample results of the final detection of the optimization method, the false positive and false negative detections are sorted out as follows:
统计指标:Statistical indicators:
灵敏度为117/120(非人源物种总数)=97.5%;Sensitivity is 117/120 (total number of non-human species)=97.5%;
假阳性物种检出3例。Three false positive species were detected.
2.2根据kraken2 confidence 0.5+braken流程检测的结果整理假阳性检出和假阴性检出如下表所示:2.2 According to the results of kraken2 confidence 0.5+braken process detection, the false positive detection and false negative detection are sorted out as shown in the following table:
样本sample | taxidtaxi | 物种species | readsreads |
相对丰度relative | 结果result |
样本1sample 1 | 340412340412 |
Aspergillus novofumigatus |
11 | 0.000110.00011 |
假阳性false |
样本1sample 1 | 984962984962 | Heterobasidion irregulareHeterobasidion irregular | 11 | 0.000110.00011 |
假阳性false |
样本1sample 1 | 145522145522 | Nannochloropsis oceanicaNannochloropsis oceanica | 44 | 0.000440.00044 |
假阳性false |
样本1sample 1 | 2803728037 |
Streptococcus mitis |
22 | 0.000220.00022 |
假阳性false |
样本1sample 1 | 26567872656787 | Venustampulla echinocandicaVenustapulla echinocandica | 11 | 0.000110.00011 | 假阳性false positive |
样本10sample 10 | 8604986049 | Cladophialophora carrioniiCladophialophora carrionii | 11 | 0.000110.00011 | 假阳性false positive |
样本10Sample 10 | 1037610376 |
Human gammaherpesvirus 4Human |
11 | 0.000110.00011 | 假阳性false positive |
样本10sample 10 | 18739601873960 | Pseudocercospora fijiensisPseudocercospora fijiensis | 22 | 0.000220.00022 | 假阳性false positive |
样本10Sample 10 | 26567872656787 |
Venustampulla echinocandica |
11 | 0.000110.00011 | 假阳性false positive |
样本10Sample 10 | 727727 |
Haemophilus influenzae |
00 | 00 | 假阴性false negative |
样本10Sample 10 | 573573 |
Klebsiella pneumoniae |
00 | 00 | 假阴性false negative |
样本11Sample 11 | 727727 |
Haemophilus influenzae |
00 | 00 | 假阴性false negative |
样本11Sample 11 | 573573 |
Klebsiella pneumoniae |
00 | 00 | 假阴性false negative |
样本12sample 12 | 727727 |
Haemophilus influenzae |
00 | 00 | 假阴性false negative |
样本12sample 12 | 573573 |
Klebsiella pneumoniae |
00 | 00 |
假阴性false |
样本4sample 4 | 145522145522 | Nannochloropsis oceanicaNannochloropsis oceanica | 22 | 0.000220.00022 |
假阳性false |
样本4Sample 4 | 9980299802 | Spirometra erinaceieuropaeiSpirometra erinaceieuropaei | 22 | 0.000220.00022 |
假阳性false |
样本7Sample 7 | 12201881220188 | Aspergillus tanneriAspergillus tanneri | 11 | 0.000110.00011 |
假阳性false |
样本7Sample 7 | 4521945219 |
Guanarito mammarenavirus |
11 | 0.000110.00011 |
假阳性false |
样本7Sample 7 | 145522145522 | Nannochloropsis oceanicaNannochloropsis oceanica | 33 | 0.000330.00033 | 假阳性false positive |
统计指标:Statistical indicators:
灵敏度为114/120=95%;The sensitivity is 114/120=95%;
假阳性物种检出43例(其中reads大于3的为3例)。43 cases of false positive species were detected (3 cases with reads greater than 3).
3、kraken2+bracken即不引入confidence阈值的检测结果中,统计其中的假阳性物种统计为5517例(其中reads大于3的为1188例),假阴性物种为出现在样本10-12中的Klebsiella pneumoniae共3例。3. In the detection results of kraken2+bracken that does not introduce the confidence threshold, there are 5517 cases of false positive species (1188 cases with reads greater than 3), and the false negative species is Klebsiella pneumoniae that appeared in samples 10-12 A total of 3 cases.
结论:kraken2不设置confidence阈值灵敏度与本专利方法一致,但是假阳性物种检出过多,而设置confidence阈值之后虽然假阳性物种检出会减少很多,但是灵敏度会降低。综合比较,本专利方法在保证了灵敏度基础上,降低了假阳性物种检出,效果更优。Conclusion: The sensitivity of kraken2 without setting the confidence threshold is consistent with this patent method, but too many false positive species are detected. After setting the confidence threshold, although the detection of false positive species will be greatly reduced, the sensitivity will be reduced. In comprehensive comparison, the patented method reduces the detection of false positive species on the basis of ensuring the sensitivity, and the effect is better.
实施例3实际样本检测实验Embodiment 3 actual sample detection experiment
使用了9例spike-in样本,各自建立DNA文库和RNA文库,进行上机测序,具体的样本及阳性物种见下表:Nine spike-in samples were used to establish DNA libraries and RNA libraries for sequencing on the machine. The specific samples and positive species are shown in the table below:
样本sample | 文库类型library type | 原始编号original number | 物种species | taxidtaxi |
S1S1 | DNAdna | 样本1sample 1 | Alphapapillomavirus 7Alphapapillomavirus 7 | 337042337042 |
S1S1 | DNAdna | 样本1sample 1 | Cryptococcus gattii VGIICryptococcus gattii VGII | 18590961859096 |
S1S1 | DNAdna | 样本1sample 1 | Cupriavidus gilardiiCupriavidus gilardii | 8254182541 |
S1S1 | DNAdna | 样本1sample 1 | Escherichia coliEscherichia coli | 562562 |
S1S1 | DNAdna | 样本1sample 1 | Human alphaherpesvirus 1Human alphaherpes virus 1 | 1029810298 |
S1S1 | DNAdna | 样本1sample 1 | Human betaherpesvirus 5Human betaherpes virus 5 | 1035910359 |
S1S1 | DNAdna | 样本1sample 1 | Human betaherpesvirus 6BHuman betaherpesvirus 6B | 3260432604 |
S1S1 | DNAdna | 样本1sample 1 | Mycoplasma hyorhinisMycoplasma hyorhinis | 21002100 |
S1S1 | DNAdna | 样本1sample 1 | Streptococcus pneumoniaeStreptococcus pneumoniae | 13131313 |
S1S1 | RNARNA | 样本1sample 1 | Alphapapillomavirus 7Alphapapillomavirus 7 | 337042337042 |
S1S1 | RNARNA | 样本1sample 1 | Cryptococcus gattii VGIICryptococcus gattii VGII | 18590961859096 |
S1S1 | RNARNA | 样本1sample 1 | Cupriavidus gilardiiCupriavidus gilardii | 8254182541 |
S1S1 | RNARNA | 样本1sample 1 | Escherichia coliEscherichia coli | 562562 |
S1S1 | RNARNA | 样本1sample 1 | Human alphaherpesvirus 1Human alphaherpes virus 1 | 1029810298 |
S1S1 | RNARNA | 样本1sample 1 | Human betaherpesvirus 5Human betaherpes virus 5 | 1035910359 |
S1S1 | RNARNA | 样本1sample 1 | Human betaherpesvirus 6BHuman betaherpesvirus 6B | 3260432604 |
S1S1 | RNARNA | 样本1sample 1 | Mycoplasma hyorhinisMycoplasma hyorhinis | 21002100 |
S1S1 | RNARNA | 样本1sample 1 | Streptococcus pneumoniaeStreptococcus pneumoniae | 13131313 |
S2S2 | DNAdna | 样本2sample 2 | Alphapapillomavirus 7Alphapapillomavirus 7 | 337042337042 |
S2S2 | DNAdna | 样本2sample 2 | Cryptococcus gattii VGIICryptococcus gattii VGII | 18590961859096 |
S2S2 | DNAdna | 样本2sample 2 | Cupriavidus gilardiiCupriavidus gilardii | 8254182541 |
S2S2 | DNAdna | 样本2sample 2 | Escherichia coliEscherichia coli | 562562 |
S2S2 | DNAdna | 样本2sample 2 | Human alphaherpesvirus 1Human alphaherpes virus 1 | 1029810298 |
S2S2 | DNAdna | 样本2sample 2 | Human betaherpesvirus 5Human betaherpes virus 5 | 1035910359 |
S2S2 | DNAdna | 样本2sample 2 | Human betaherpesvirus 6BHuman betaherpesvirus 6B | 3260432604 |
S2S2 | DNAdna | 样本2sample 2 | Mycoplasma hyorhinisMycoplasma hyorhinis | 21002100 |
S2S2 | DNAdna | 样本2sample 2 | Streptococcus pneumoniaeStreptococcus pneumoniae | 13131313 |
S2S2 | RNARNA | 样本2sample 2 | Alphapapillomavirus 7Alphapapillomavirus 7 | 337042337042 |
S2S2 | RNARNA | 样本2sample 2 | Cryptococcus gattii VGIICryptococcus gattii VGII | 18590961859096 |
S2S2 | RNARNA | 样本2sample 2 | Cupriavidus gilardiiCupriavidus gilardii | 8254182541 |
S2S2 | RNARNA | 样本2sample 2 | Escherichia coliEscherichia coli | 562562 |
S2S2 | RNARNA | 样本2sample 2 | Human alphaherpesvirus 1Human alphaherpes virus 1 | 1029810298 |
S2S2 | RNARNA | 样本2sample 2 | Human betaherpesvirus 5Human betaherpes virus 5 | 1035910359 |
S2S2 | RNARNA | 样本2sample 2 | Human betaherpesvirus 6BHuman betaherpesvirus 6B | 3260432604 |
S2S2 | RNARNA | 样本2sample 2 | Mycoplasma hyorhinisMycoplasma hyorhinis | 21002100 |
S2S2 | RNARNA | 样本2sample 2 | Streptococcus pneumoniaeStreptococcus pneumoniae | 13131313 |
S3S3 | DNAdna | 样本3sample 3 | Alphapapillomavirus 7Alphapapillomavirus 7 | 337042337042 |
S3S3 | DNAdna | 样本3sample 3 | Cupriavidus gilardiiCupriavidus gilardii | 8254182541 |
S3S3 | DNAdna | 样本3sample 3 | Haemophilus influenzaeHaemophilus influenzae | 727727 |
S3S3 | DNAdna | 样本3sample 3 | Human alphaherpesvirus 2Human alphaherpes virus 2 | 1031010310 |
S3S3 | DNAdna | 样本3sample 3 | Listeria monocytogenesListeria monocytogenes | 16391639 |
S3S3 | DNAdna | 样本3sample 3 | Mycoplasma hyorhinisMycoplasma hyorhinis | 21002100 |
S3S3 | DNAdna | 样本3sample 3 | Neisseria meningitidisNeisseria meningitidis | 487487 |
S3S3 | DNAdna | 样本3sample 3 | Streptococcus agalactiaeStreptococcus agalactiae | 13111311 |
S3S3 | RNARNA | 样本3sample 3 | Alphapapillomavirus 7Alphapapillomavirus 7 | 337042337042 |
S3S3 | RNARNA | 样本3sample 3 | Cupriavidus gilardiiCupriavidus gilardii | 8254182541 |
S3S3 | RNARNA | 样本3sample 3 | Haemophilus influenzaeHaemophilus influenzae | 727727 |
S3S3 | RNARNA | 样本3sample 3 | Human alphaherpesvirus 2Human alphaherpes virus 2 | 1031010310 |
S3S3 | RNARNA | 样本3sample 3 | Listeria monocytogenesListeria monocytogenes | 16391639 |
S3S3 | RNARNA | 样本3sample 3 | Mycoplasma hyorhinisMycoplasma hyorhinis | 21002100 |
S3S3 | RNARNA | 样本3sample 3 | Neisseria meningitidisNeisseria meningitidis | 487487 |
S3S3 | RNARNA | 样本3sample 3 | Parechovirus AParechovirus A | 18039561803956 |
S4S4 | DNAdna | 样本4Sample 4 | Alphapapillomavirus 7Alphapapillomavirus 7 | 337042337042 |
S4S4 | DNAdna | 样本4sample 4 | Cupriavidus gilardiiCupriavidus gilardii | 8254182541 |
S4S4 | DNAdna | 样本4Sample 4 | Haemophilus influenzaeHaemophilus influenzae | 727727 |
S4S4 | DNAdna | 样本4sample 4 | Human alphaherpesvirus 2Human alphaherpes virus 2 | 1031010310 |
S4S4 | DNAdna | 样本4sample 4 | Listeria monocytogenesListeria monocytogenes | 16391639 |
S4S4 | DNAdna | 样本4Sample 4 | Mycoplasma hyorhinisMycoplasma hyorhinis | 21002100 |
S4S4 | DNAdna | 样本4sample 4 | Neisseria meningitidisNeisseria meningitidis | 487487 |
S4S4 | DNAdna | 样本4sample 4 | Streptococcus agalactiaeStreptococcus agalactiae | 13111311 |
S4S4 | RNARNA | 样本4sample 4 | Alphapapillomavirus 7Alphapapillomavirus 7 | 337042337042 |
S4S4 | RNARNA | 样本4sample 4 | Cupriavidus gilardiiCupriavidus gilardii | 8254182541 |
S4S4 | RNARNA | 样本4Sample 4 | Haemophilus influenzaeHaemophilus influenzae | 727727 |
S4S4 | RNARNA | 样本4Sample 4 | Human alphaherpesvirus 2Human alphaherpes virus 2 | 1031010310 |
S4S4 | RNARNA | 样本4Sample 4 | Listeria monocytogenesListeria monocytogenes | 16391639 |
S4S4 | RNARNA | 样本4Sample 4 | Mycoplasma hyorhinisMycoplasma hyorhinis | 21002100 |
S4S4 | RNARNA | 样本4Sample 4 | Neisseria meningitidisNeisseria meningitidis | 487487 |
S4S4 | RNARNA | 样本4Sample 4 | Parechovirus AParechovirus A | 18039561803956 |
S5S5 | DNAdna | 样本5Sample 5 | Alphapapillomavirus 7Alphapapillomavirus 7 | 337042337042 |
S5S5 | DNAdna | 样本5Sample 5 | Candida albicansCandida albicans | 54765476 |
S5S5 | DNAdna | 样本5Sample 5 | Cupriavidus gilardiiCupriavidus gilardii | 8254182541 |
S5S5 | DNAdna | 样本5Sample 5 | Enterococcus faecalisEnterococcus faecalis | 13511351 |
S5S5 | DNAdna | 样本5Sample 5 | Human mastadenovirus CHuman mastadenovirus C | 129951129951 |
S5S5 | DNAdna | 样本5Sample 5 | Klebsiella oxytocaKlebsiella oxytoca | 571571 |
S5S5 | DNAdna | 样本5Sample 5 | Mycoplasma hyorhinisMycoplasma hyorhinis | 21002100 |
S5S5 | DNAdna | 样本5Sample 5 | Streptococcus pyogenesStreptococcus pyogenes | 13141314 |
S5S5 | RNARNA | 样本5Sample 5 | Alphapapillomavirus 7Alphapapillomavirus 7 | 337042337042 |
S5S5 | RNARNA | 样本5Sample 5 | Candida albicansCandida albicans | 54765476 |
S5S5 | RNARNA | 样本5Sample 5 | Cupriavidus gilardiiCupriavidus gilardii | 8254182541 |
S5S5 | RNARNA | 样本5Sample 5 | Enterococcus faecalisEnterococcus faecalis | 13511351 |
S5S5 | RNARNA | 样本5Sample 5 | Human mastadenovirus CHuman mastadenovirus C | 129951129951 |
S5S5 | RNARNA | 样本5Sample 5 | Human orthopneumovirusHuman orthopneumovirus | 1125011250 |
S5S5 | RNARNA | 样本5Sample 5 | Human respirovirus 1Human respirovirus 1 | 1273012730 |
S5S5 | RNARNA | 样本5Sample 5 | Klebsiella oxytocaKlebsiella oxytoca | 571571 |
S5S5 | RNARNA | 样本5Sample 5 | Mycoplasma hyorhinisMycoplasma hyorhinis | 21002100 |
S5S5 | RNARNA | 样本5Sample 5 | Streptococcus pyogenesStreptococcus pyogenes | 13141314 |
S6S6 | DNAdna | 样本6Sample 6 | Alphapapillomavirus 7Alphapapillomavirus 7 | 337042337042 |
S6S6 | DNAdna | 样本6Sample 6 | Candida albicansCandida albicans | 54765476 |
S6S6 | DNAdna | 样本6Sample 6 | Cupriavidus gilardiiCupriavidus gilardii | 8254182541 |
S6S6 | DNAdna | 样本6Sample 6 | Enterococcus faecalisEnterococcus faecalis | 13511351 |
S6S6 | DNAdna | 样本6Sample 6 | Human mastadenovirus CHuman mastadenovirus C | 129951129951 |
S6S6 | DNAdna | 样本6Sample 6 | Klebsiella oxytocaKlebsiella oxytoca | 571571 |
S6S6 | DNAdna | 样本6Sample 6 | Mycoplasma hyorhinisMycoplasma hyorhinis | 21002100 |
S6S6 | DNAdna | 样本6Sample 6 | Streptococcus pyogenesStreptococcus pyogenes | 13141314 |
S6S6 | RNARNA | 样本6Sample 6 | Alphapapillomavirus 7Alphapapillomavirus 7 | 337042337042 |
S6S6 | RNARNA | 样本6Sample 6 | Candida albicansCandida albicans | 54765476 |
S6S6 | RNARNA | 样本6Sample 6 | Cupriavidus gilardiiCupriavidus gilardii | 8254182541 |
S6S6 | RNARNA | 样本6Sample 6 | Enterococcus faecalisEnterococcus faecalis | 13511351 |
S6S6 | RNARNA | 样本6Sample 6 | Human mastadenovirus CHuman mastadenovirus C | 129951129951 |
S6S6 | RNARNA | 样本6Sample 6 | Human orthopneumovirusHuman orthopneumovirus | 1125011250 |
S6S6 | RNARNA | 样本6Sample 6 | Human respirovirus 1Human respirovirus 1 | 1273012730 |
S6S6 | RNARNA | 样本6Sample 6 | Klebsiella oxytocaKlebsiella oxytoca | 571571 |
S6S6 | RNARNA | 样本6Sample 6 | Mycoplasma hyorhinisMycoplasma hyorhinis | 21002100 |
S6S6 | RNARNA | 样本6Sample 6 | Streptococcus pyogenesStreptococcus pyogenes | 13141314 |
S7S7 | DNAdna | 样本7Sample 7 | Aeromonas hydrophilaAeromonas hydrophila | 644644 |
S7S7 | DNAdna | 样本7Sample 7 | Alphapapillomavirus 7Alphapapillomavirus 7 | 337042337042 |
S7S7 | DNAdna | 样本7Sample 7 | Clavispora lusitaniaeClavispora lusitaniae | 3691136911 |
S7S7 | DNAdna | 样本7Sample 7 | Cupriavidus gilardiiCupriavidus gilardii | 8254182541 |
S7S7 | DNAdna | 样本7Sample 7 | Human mastadenovirus EHuman mastadenovirus E | 130308130308 |
S7S7 | DNAdna | 样本7Sample 7 | Legionella pneumophilaLegionella pneumophila | 446446 |
S7S7 | DNAdna | 样本7Sample 7 | Mycoplasma hyorhinisMycoplasma hyorhinis | 21002100 |
S7S7 | DNAdna | 样本7Sample 7 | Neisseria siccaNeisseria sicca | 490490 |
S7S7 | DNAdna | 样本7Sample 7 | Pseudomonas fluorescensPseudomonas fluorescens | 294294 |
S7S7 | RNARNA | 样本7Sample 7 | Aeromonas hydrophilaAeromonas hydrophila | 644644 |
S7S7 | RNARNA | 样本7Sample 7 | Alphapapillomavirus 7Alphapapillomavirus 7 | 337042337042 |
S7S7 | RNARNA | 样本7Sample 7 | Cupriavidus gilardiiCupriavidus gilardii | 8254182541 |
S7S7 | RNARNA | 样本7Sample 7 | Human mastadenovirus EHuman mastadenovirus E | 130308130308 |
S7S7 | RNARNA | 样本7Sample 7 | Influenza A virusInfluenza A virus | 1132011320 |
S7S7 | RNARNA | 样本7Sample 7 | Influenza B virusInfluenza B virus | 1152011520 |
S7S7 | RNARNA | 样本7Sample 7 | Legionella pneumophilaLegionella pneumophila | 446446 |
S7S7 | RNARNA | 样本7Sample 7 | Mycoplasma hyorhinisMycoplasma hyorhinis | 21002100 |
S7S7 | RNARNA | 样本7Sample 7 | Neisseria siccaNeisseria sicca | 490490 |
S7S7 | RNARNA | 样本7Sample 7 | Pseudomonas fluorescensPseudomonas fluorescens | 294294 |
S8 | DNAdna | 样本8Sample 8 | Aeromonas hydrophilaAeromonas hydrophila | 644644 | |
S8 | DNAdna | 样本8Sample 8 |
Alphapapillomavirus 7 |
337042337042 | |
S8 | DNAdna | 样本8Sample 8 | Clavispora lusitaniaeClavispora lusitaniae | 3691136911 | |
S8 | DNAdna | 样本8Sample 8 | Cupriavidus gilardiiCupriavidus gilardii | 8254182541 | |
S8 | DNAdna | 样本8Sample 8 | Human mastadenovirus EHuman mastadenovirus E | 130308130308 | |
S8 | DNAdna | 样本8Sample 8 | Legionella pneumophilaLegionella pneumophila | 446446 | |
S8 | DNAdna | 样本8Sample 8 | Mycoplasma hyorhinisMycoplasma hyorhinis | 21002100 | |
S8 | DNAdna | 样本8Sample 8 | Neisseria siccaNeisseria sicca | 490490 | |
S8 | DNAdna | 样本8Sample 8 | Pseudomonas fluorescensPseudomonas fluorescens | 294294 | |
S8 | RNARNA | 样本8Sample 8 | Aeromonas hydrophilaAeromonas hydrophila | 644644 | |
S8 | RNARNA | 样本8Sample 8 |
Alphapapillomavirus 7 |
337042337042 | |
S8 | RNARNA | 样本8Sample 8 | Cupriavidus gilardiiCupriavidus gilardii | 8254182541 | |
S8 | RNARNA | 样本8Sample 8 | Human mastadenovirus EHuman mastadenovirus E | 130308130308 | |
S8 | RNARNA | 样本8Sample 8 | Influenza A virusInfluenza A virus | 1132011320 | |
S8 | RNARNA | 样本8Sample 8 | Influenza B virusInfluenza B virus | 1152011520 | |
S8 | RNARNA | 样本8Sample 8 | Legionella pneumophilaLegionella pneumophila | 446446 | |
S8 | RNARNA | 样本8Sample 8 | Mycoplasma hyorhinisMycoplasma hyorhinis | 21002100 | |
S8 | RNARNA | 样本8Sample 8 | Neisseria siccaNeisseria sicca | 490490 | |
S8 | RNARNA | 样本8Sample 8 | Pseudomonas fluorescensPseudomonas fluorescens | 294294 | |
S9 | DNAdna | 样本9Sample 9 |
Alphapapillomavirus 7 |
337042337042 | |
S9 | DNAdna | 样本9Sample 9 | Cupriavidus gilardiiCupriavidus gilardii | 8254182541 | |
S9 | DNAdna | 样本9Sample 9 | Mycoplasma hyorhinisMycoplasma hyorhinis | 21002100 | |
S9 | RNARNA | 样本9Sample 9 |
Alphapapillomavirus 7 |
337042337042 | |
S9 | RNARNA | 样本9Sample 9 | Cupriavidus gilardiiCupriavidus gilardii | 8254182541 | |
S9 | RNARNA | 样本9Sample 9 | Mycoplasma hyorhinisMycoplasma hyorhinis | 21002100 |
本发明的流程未检出的阳性物种,kraken2 confidence 0.5+bracken流程未检出的阳性物种,kraken2+bracken流程未检出的阳性物种统计如下表所示(其中reads_opt,abundance_opt代表本发明流程的阳性物种检出,reads_confidence,abundance_confidence代表kraken2 confidence 0.5+bracken流程的阳性物种检出,reads_kraken,abundance_kraken代表kraken2+bracken流程的阳性检出):The positive species not detected by the process of the present invention, the positive species not detected by the kraken2 confidence 0.5+bracken process, the statistics of the positive species not detected by the kraken2+bracken process are shown in the table below (wherein reads_opt, abundance_opt represent the positives of the process of the present invention Species detection, reads_confidence, abundance_confidence represent positive species detection of kraken2 confidence 0.5+bracken process, reads_kraken, abundance_kraken represent positive detection of kraken2+bracken process):
阳性物种合计148,kraken2 confidence 0.5+bracken有2个物种未检出(灵敏度为98.6%),本发明流程和kraken2+bracken有一个物种未检出(灵敏度为99.3%),表现与模拟数据类似,灵敏度上本发明流程,和kraken2+bracken相同,稍高于kraken2 confidence 0.5+bracken的流程。The total number of positive species is 148, kraken2 confidence 0.5+bracken has 2 species not detected (sensitivity is 98.6%), the process of the present invention and kraken2+bracken have one species not detected (sensitivity is 99.3%), the performance is similar to the simulated data, In terms of sensitivity, the process of the present invention is the same as kraken2+bracken, slightly higher than the process of kraken2 confidence 0.5+bracken.
DNA文库各个流程检出的假阳性物种汇总统计如下表(对应图3结果,其中图片中的opt代表本发明流程,confidence代表Kraken2 confidence 0.5+bracken流程且对应表格中的第4列,kraken代表Kraken2+bracken流程):The summary statistics of the false positive species detected by each process of the DNA library are as follows (corresponding to the results in Figure 3, wherein opt in the picture represents the process of the present invention, confidence represents the Kraken2 confidence 0.5+bracken process and corresponds to the fourth column in the table, kraken represents Kraken2 +bracken process):
RNA文库各个流程检出的假阳性物种汇总统计如下表(对应图4结果,其中图片中的opt代表本发明流程,confidence代表Kraken2 confidence 0.5+bracken流程且对应表格中第4列,kraken代表Kraken2+bracken流程):The summary statistics of the false positive species detected by each process of the RNA library are as follows (corresponding to the results in Figure 4, where opt in the picture represents the process of the present invention, confidence represents the Kraken2 confidence 0.5+bracken process and corresponds to the fourth column in the table, and kraken represents Kraken2+ bracket process):
汇总模拟样本和spike-in样本统计的灵敏度结果如下表(对应图5结果,其中图片中的opt代表本发明流程,confidence代表Kraken2 confidence 0.5+bracken流程,kraken代表Kraken2+bracken流程):The sensitivity results of the summary simulation samples and spike-in sample statistics are as follows (corresponding to the results in Figure 5, where opt in the picture represents the process of the present invention, confidence represents the Kraken2 confidence 0.5+bracken process, and kraken represents the Kraken2+bracken process):
从以上的统计结果来看,本发明流程假阳性物种检出要远低于kraken2+bracken流程,且在保证了灵敏度高于kraken2 confidence 0.5+bracken基础上,假阳性检出要低于后者(即便是在reads>3才报出的情况下,仍然能降低大约1/3的假阳性物种检出)。From the above statistical results, the detection of false positive species in the process of the present invention will be far lower than that of kraken2+bracken process, and on the basis of ensuring that the sensitivity is higher than kraken2 confidence 0.5+bracken, the detection of false positives will be lower than the latter ( Even when the reads>3 are reported, the detection of false positive species can still be reduced by about 1/3).
最后应说明的是:以上各实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述各实施例对本发明进行了详细的说明,但本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。It should be noted that at last: above each embodiment is only in order to illustrate technical scheme of the present invention, and is not intended to limit; Although the present invention has been described in detail with reference to foregoing each embodiment, those of ordinary skill in the art should understand that: It is still possible to modify the technical solutions described in the foregoing embodiments, or perform equivalent replacements for some or all of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the various embodiments of the present invention. range.
Claims (10)
- 一种生信分析方法,其特征在于,包括如下步骤:A biometric analysis method, characterized in that it comprises the following steps:1)序列比对:NGS测序数据使用kraken2进行序列比对,获得每条序列taxid-kmer结果;1) Sequence comparison: use kraken2 for sequence comparison of NGS sequencing data, and obtain the taxid-kmer result of each sequence;2)基于taxonomy数据库建立taxid层级关系:根据步骤1)taxid-kmer结果关联taxonomy层级,根据定位规则重定位taxid;2) Establish taxid hierarchical relationship based on taxonomy database: according to step 1) taxonomy level is associated with the taxid-kmer result, and taxid is relocated according to the positioning rules;3)计算每条序列kmer score:根据每条序列经过步骤2)重定位的taxid和步骤1)的taxid-kmer结果计算每条序列kmer score;3) Calculate the kmer score of each sequence: calculate the kmer score of each sequence according to the taxid relocated in step 2) and the taxid-kmer result of step 1) for each sequence;4)对比对结果进行整体计算:根据kmer score和taxonomy层级进行整体计算。4) Overall calculation of comparison results: overall calculation based on kmer score and taxonomy level.
- 权利要求1所述的生信分析方法,其特征在于,所述方法进一步包括:The bioinformatics analysis method of claim 1, wherein the method further comprises:5)物种taxid检测:根据4)的整体计算结果进行物种taxid检测。5) Species taxid detection: perform species taxid detection according to the overall calculation result in 4).
- 权利要求1-2任一所述的生信分析方法,其特征在于,所述步骤2)中层级关系包括血清型/亚型、种、属、科的一种或多种。The bioinformatics analysis method according to any one of claims 1-2, wherein the hierarchical relationship in step 2) includes one or more of serotype/subtype, species, genus, and family.
- 权利要求1-3任一所述的生信分析方法,其特征在于,所述步骤3)中所述kmer score计算规则如下:The arbitrary described biometric analysis method of claim 1-3, is characterized in that, described step 3) described kmerscore calculating rule is as follows:最终定位到科层级taxid以下的序列,kmer score=(科taxid kmers+属taxid kmers+种taxid kmers+亚型/血清型taxid kmers)/总kmers;Finally locate the sequence below the taxid at the family level, kmer score=(family taxid kmers+genus taxid kmers+species taxid kmers+subtype/serotype taxid kmers)/total kmers;最终定位到科层级taxid以上的序列,kmer score为0。Finally, the sequence above the taxid at the family level is located, and the kmer score is 0.
- 权利要求1-4任一所述的生信分析方法,其特征在于,所述步骤2)中重定位规则包括:The biometric analysis method according to any one of claims 1-4, wherein the relocation rules in said step 2) include:通常情况下接受kraken2给出的taxid定位,以下情况进行重定位:Usually accept the taxid positioning given by kraken2, and relocate in the following cases:某条序列根据taxid-kmer结果获得唯一taxid且taxid低于种层级,则定位为该taxid所属的种层级taxid;If a sequence obtains a unique taxid according to the taxid-kmer result and the taxid is lower than the species level, it is positioned as the species level taxid to which the taxid belongs;某条序列根据taxid-kmer结果获得超过2个taxid时,分3种情况:When a sequence obtains more than 2 taxids according to the taxid-kmer results, there are 3 cases:所有taxid,关联到种层级上只出现1个,其他taxid属于该种的血清型/亚型、属、或科层级,则定位到该种层级taxid;For all taxids, only one appears at the species level, and other taxids belong to the serotype/subtype, genus, or family level of the species, then locate the taxid at the species level;所有taxid,关联到种层级超过2个且属于同一属,则最终定位到属层级taxid;All taxids that are associated with more than 2 species levels and belong to the same genus will be finally located at the genus level taxid;所有taxid,关联到属层级超过2个且属于同一科,则最终定位到科层级taxid。All taxids that are associated with more than 2 genus levels and belong to the same family will be finally located at the family level taxid.
- 权利要求1-5任一所述的生信分析方法,其特征在于,所述步骤4)中整体计算包括:The bioinformatics analysis method described in any one of claims 1-5, wherein the overall calculation in said step 4) includes:a、设定一个过滤cutoff阈值,对每条序列根据kmer score进行过滤;a. Set a filter cutoff threshold, and filter each sequence according to the kmer score;b、对a中经过过滤的序列,统计taxid的reads;b. For the filtered sequence in a, count the reads of taxid;所述taxid的reads是一个样本出现的taxid的序列总数;The reads of the taxid is the total number of taxid sequences that appear in a sample;c、设定一个过滤阈值threshold,对b中定位到种层级的taxid进行过滤,计算其属相对比值,排除低于阈值的种层级taxid;c. Set a filtering threshold threshold, filter the taxids located at the species level in b, calculate their relative ratio, and exclude the species level taxids lower than the threshold;所述属相对比值为某个种层级taxid reads相对于同属reads最高的种层级taxid reads的比值;The genus relative ratio is the ratio of certain species-level taxid reads to the highest species-level taxid reads of the same genus reads;优选的,还包括:Preferably, it also includes:d、经c过滤的种层级taxid,若缺乏属分类,则计算科相对比值,排除低于过滤阈值threshold种层级taxid;d. If the taxid at the species level filtered by c lacks genus classification, calculate the family relative ratio and exclude the taxid at the species level lower than the filtering threshold threshold;所述科相对比值为某个种层级taxid reads相对于同科reads最高的种层级taxid reads的比值。The family relative ratio is the ratio of a species-level taxid reads relative to the species-level taxid reads with the highest reads of the same family.
- 权利要求1-6任一所述的生信分析方法,其特征在于,所述步骤4)中所述整体计算进一步包括:The bioinformatics analysis method according to any one of claims 1-6, wherein the overall calculation in the step 4) further comprises:e、经c,d过滤保留的种层级taxid reads校正1,计算属相对比值,将属层级taxid reads按照属相对比值计算种层级taxid属校正reads;e. After c, d filter and retain the species-level taxid reads correction 1, calculate the genus relative ratio, and calculate the species-level taxid genus correction reads with the genus-level taxid reads according to the genus relative ratio;所述属相对比值为经c,d过滤后同属的种层级taxid reads总和,之后计算各种层级taxid reads相对于总和的比例;The genus relative ratio is the sum of the species-level taxid reads of the same genus after filtering by c and d, and then calculates the ratio of taxid reads at various levels to the sum;所述属层级taxid reads包括b中属层级taxid reads和c中未通过过滤阈值threshold的该属关联种层级taxid并入的reads;The genus-level taxid reads include the genus-level taxid reads in b and the reads incorporated into the genus-associated species-level taxid that have not passed the filtering threshold threshold in c;f、经c,d过滤的种层级taxid reads校正2,计算科相对比值,将科层级taxid reads按照科相对比值计算种层级taxid科校正reads;f, after c, the kind-level taxid reads filtered by d are corrected 2, and the family-level relative ratio is calculated, and the family-level taxid reads are calculated according to the family-level taxid family-corrected reads;所述科相对比值为经c,d过滤后同科的种层级taxid reads总和,之后计算各种层级taxid reads相对于总和的比例;The relative ratio of the family is the sum of the species-level taxid reads of the same family after filtering by c and d, and then calculate the ratio of the taxid reads at various levels to the sum;所述科层级taxid的reads包含b中科层级taxid reads和d中未通过过滤阈值threshold的该科关联种层级taxid并入的reads。The family-level taxid reads include the family-level taxid reads in b and the reads incorporated into the family-related species-level taxid in d that do not pass the filtering threshold threshold.
- 权利要求1-7任一所述的生信分析方法,其特征在于,所述步骤1)中比对采用的数据库为nt、refseq或genbank数据库;优选的,所述数据库为nt数据库。The bioinformatics analysis method according to any one of claims 1-7, characterized in that the database used for comparison in the step 1) is an nt, refseq or genbank database; preferably, the database is an nt database.
- 一种计算机可读介质,其存储有计算机程序,所述计算机程序被处理器执行时,实现权利要求1-8中任一项所述方法。A computer-readable medium, which stores a computer program, and when the computer program is executed by a processor, implements the method according to any one of claims 1-8.
- 一种电子设备,其特征在于,包括处理器以及存储器,所述存储器上存储一条或多条可读指令,所述一条或多条可读指令被所述处理器执行时,实现权利要求1-8中任一项所述方法。An electronic device, characterized in that it includes a processor and a memory, and the memory stores one or more readable instructions, and when the one or more readable instructions are executed by the processor, claim 1- The method described in any one of 8.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110804351.8A CN113539369B (en) | 2021-07-14 | 2021-07-14 | Optimized kraken2 algorithm and application thereof in second-generation sequencing |
CN202110804351.8 | 2021-07-14 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023283967A1 true WO2023283967A1 (en) | 2023-01-19 |
Family
ID=78128300
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/106970 WO2023283967A1 (en) | 2021-07-14 | 2021-07-17 | Optimized kraken2 algorithm and application thereof in second-generation sequencing |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN113539369B (en) |
WO (1) | WO2023283967A1 (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113539369B (en) * | 2021-07-14 | 2022-03-25 | 江苏先声医学诊断有限公司 | Optimized kraken2 algorithm and application thereof in second-generation sequencing |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111681704A (en) * | 2020-04-21 | 2020-09-18 | 华中科技大学鄂州工业技术研究院 | Construction method of matK gene-based unknown plant species identification database and database |
CN112071366A (en) * | 2020-10-13 | 2020-12-11 | 南开大学 | Metagenome data analysis method based on second-generation sequencing technology |
US20210141833A1 (en) * | 2019-11-07 | 2021-05-13 | International Business Machines Corporation | Optimizing k-mer databases by k-mer subtraction |
CN113096737A (en) * | 2021-03-26 | 2021-07-09 | 北京源生康泰基因科技有限公司 | Method and system for automatically analyzing pathogen types |
CN113539369A (en) * | 2021-07-14 | 2021-10-22 | 江苏先声医学诊断有限公司 | Optimized kraken2 algorithm and application thereof in second-generation sequencing |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111462821B (en) * | 2020-04-10 | 2022-02-22 | 广州微远医疗器械有限公司 | Pathogenic microorganism analysis and identification system and application |
CN111710365B (en) * | 2020-06-10 | 2022-04-08 | 山东省计算中心(国家超级计算济南中心) | Ontology-based protein/gene synonym table construction method |
CN112599198A (en) * | 2020-12-29 | 2021-04-02 | 上海派森诺生物科技股份有限公司 | Microorganism species and functional composition analysis method for metagenome sequencing data |
-
2021
- 2021-07-14 CN CN202110804351.8A patent/CN113539369B/en active Active
- 2021-07-17 WO PCT/CN2021/106970 patent/WO2023283967A1/en unknown
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210141833A1 (en) * | 2019-11-07 | 2021-05-13 | International Business Machines Corporation | Optimizing k-mer databases by k-mer subtraction |
CN111681704A (en) * | 2020-04-21 | 2020-09-18 | 华中科技大学鄂州工业技术研究院 | Construction method of matK gene-based unknown plant species identification database and database |
CN112071366A (en) * | 2020-10-13 | 2020-12-11 | 南开大学 | Metagenome data analysis method based on second-generation sequencing technology |
CN113096737A (en) * | 2021-03-26 | 2021-07-09 | 北京源生康泰基因科技有限公司 | Method and system for automatically analyzing pathogen types |
CN113539369A (en) * | 2021-07-14 | 2021-10-22 | 江苏先声医学诊断有限公司 | Optimized kraken2 algorithm and application thereof in second-generation sequencing |
Also Published As
Publication number | Publication date |
---|---|
CN113539369A (en) | 2021-10-22 |
CN113539369B (en) | 2022-03-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20230366046A1 (en) | Systems and methods for analyzing viral nucleic acids | |
Marchant et al. | The C-Fern (Ceratopteris richardii) genome: insights into plant genome evolution with the first partial homosporous fern genome assembly | |
CN111462821B (en) | Pathogenic microorganism analysis and identification system and application | |
US20200294628A1 (en) | Creation or use of anchor-based data structures for sample-derived characteristic determination | |
US11830580B2 (en) | K-mer database for organism identification | |
WO2018218788A1 (en) | Third-generation sequencing sequence alignment method based on global seed scoring optimization | |
WO2019213811A1 (en) | Method, apparatus, and system for detecting chromosomal aneuploidy | |
US11809498B2 (en) | Optimizing k-mer databases by k-mer subtraction | |
CN112599198A (en) | Microorganism species and functional composition analysis method for metagenome sequencing data | |
WO2023283967A1 (en) | Optimized kraken2 algorithm and application thereof in second-generation sequencing | |
CN112992277A (en) | Construction method and application of microbial genome database | |
WO2020155623A1 (en) | Sequence alignment filtering processing method, system and device, and readable storage medium | |
CN115631789A (en) | Pangenome-based group joint variation detection method | |
CN108595912B (en) | Method, device and system for detecting chromosome aneuploidy | |
WO2017000859A1 (en) | Leaping search algorithm for similar sub-sequences in character sequence and application thereof in searching in biological sequence database | |
CN115064215A (en) | Method for tracing strain and identifying attribute through similarity | |
US11688489B2 (en) | Systems and methods for grouping and collapsing sequencing reads | |
WO2020213736A1 (en) | Information processing device, information processing method, program and storage medium | |
Cai et al. | Concod: an effective integration framework of consensus-based calling deletions from next-generation sequencing data | |
CN114334004B (en) | Rapid comparison and identification method for pathogenic microorganisms and application thereof | |
CN110600083B (en) | Calcium acetate-acinetobacter baumannii complex group identification method based on splicing-free assembly WGS data | |
Chu et al. | Improving on hash-based probabilistic sequence classification using multiple spaced seeds and multi-index Bloom filters | |
CN118197436A (en) | Construction method of pathogenic microorganism metagenome database | |
CN116682496A (en) | Pathogenic microorganism genome database and construction method and application thereof | |
CN114582523A (en) | Novel coronavirus genome feature similarity measurement method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21949747 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |