WO2023283967A1 - Algorithme kraken2 optimisé et son application dans le séquençage de deuxième génération - Google Patents
Algorithme kraken2 optimisé et son application dans le séquençage de deuxième génération Download PDFInfo
- Publication number
- WO2023283967A1 WO2023283967A1 PCT/CN2021/106970 CN2021106970W WO2023283967A1 WO 2023283967 A1 WO2023283967 A1 WO 2023283967A1 CN 2021106970 W CN2021106970 W CN 2021106970W WO 2023283967 A1 WO2023283967 A1 WO 2023283967A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- taxid
- level
- species
- reads
- family
- Prior art date
Links
- 238000012163 sequencing technique Methods 0.000 title abstract description 17
- 238000004422 calculation algorithm Methods 0.000 title description 14
- 238000000034 method Methods 0.000 claims abstract description 60
- 238000001514 detection method Methods 0.000 claims abstract description 44
- 238000004458 analytical method Methods 0.000 claims abstract description 9
- 238000001914 filtration Methods 0.000 claims description 30
- 238000004364 calculation method Methods 0.000 claims description 18
- 238000003766 bioinformatics method Methods 0.000 claims description 11
- 238000007481 next generation sequencing Methods 0.000 claims description 8
- 238000012937 correction Methods 0.000 claims description 6
- 238000004590 computer program Methods 0.000 claims description 4
- 241000894007 species Species 0.000 description 167
- 239000000523 sample Substances 0.000 description 107
- 108020004414 DNA Proteins 0.000 description 75
- 239000013614 RNA sample Substances 0.000 description 75
- 230000008569 process Effects 0.000 description 32
- 240000005893 Pteridium aquilinum Species 0.000 description 22
- 235000009936 Pteridium aquilinum Nutrition 0.000 description 22
- 230000035945 sensitivity Effects 0.000 description 19
- 241000388169 Alphapapillomavirus 7 Species 0.000 description 18
- 241001634906 Cupriavidus gilardii Species 0.000 description 18
- 241000202938 Mycoplasma hyorhinis Species 0.000 description 18
- 241000193998 Streptococcus pneumoniae Species 0.000 description 13
- 241000194017 Streptococcus Species 0.000 description 12
- 238000005457 optimization Methods 0.000 description 12
- 241000606768 Haemophilus influenzae Species 0.000 description 8
- 229940047650 haemophilus influenzae Drugs 0.000 description 8
- 229940031000 streptococcus pneumoniae Drugs 0.000 description 8
- 230000000813 microbial effect Effects 0.000 description 7
- 241001386813 Kraken Species 0.000 description 6
- 241000222122 Candida albicans Species 0.000 description 5
- 241000588724 Escherichia coli Species 0.000 description 5
- 241000588747 Klebsiella pneumoniae Species 0.000 description 5
- 229940095731 candida albicans Drugs 0.000 description 5
- 239000013612 plasmid Substances 0.000 description 5
- 239000013598 vector Substances 0.000 description 5
- 241000607528 Aeromonas hydrophila Species 0.000 description 4
- 241001096540 Cryptococcus gattii VGII Species 0.000 description 4
- 241000194032 Enterococcus faecalis Species 0.000 description 4
- 241000700588 Human alphaherpesvirus 1 Species 0.000 description 4
- 241000701074 Human alphaherpesvirus 2 Species 0.000 description 4
- 241000701024 Human betaherpesvirus 5 Species 0.000 description 4
- 241001455656 Human betaherpesvirus 6B Species 0.000 description 4
- 241000620147 Human mastadenovirus C Species 0.000 description 4
- 241000886703 Human mastadenovirus E Species 0.000 description 4
- 241000588749 Klebsiella oxytoca Species 0.000 description 4
- 241000589242 Legionella pneumophila Species 0.000 description 4
- 241000186779 Listeria monocytogenes Species 0.000 description 4
- 241000588650 Neisseria meningitidis Species 0.000 description 4
- 241000588645 Neisseria sicca Species 0.000 description 4
- 241000589540 Pseudomonas fluorescens Species 0.000 description 4
- 241000193996 Streptococcus pyogenes Species 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 229940032049 enterococcus faecalis Drugs 0.000 description 4
- 229940115932 legionella pneumophila Drugs 0.000 description 4
- 241000712431 Influenza A virus Species 0.000 description 3
- 241001300629 Nannochloropsis oceanica Species 0.000 description 3
- 241001134658 Streptococcus mitis Species 0.000 description 3
- 238000007405 data analysis Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 244000005700 microbiome Species 0.000 description 3
- 238000002864 sequence alignment Methods 0.000 description 3
- 238000004088 simulation Methods 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 241001508813 Clavispora lusitaniae Species 0.000 description 2
- 241000711920 Human orthopneumovirus Species 0.000 description 2
- 241000726041 Human respirovirus 1 Species 0.000 description 2
- 241000713196 Influenza B virus Species 0.000 description 2
- 108091028043 Nucleic acid sequence Proteins 0.000 description 2
- 241000873939 Parechovirus A Species 0.000 description 2
- 241000193985 Streptococcus agalactiae Species 0.000 description 2
- 241000700605 Viruses Species 0.000 description 2
- 238000003745 diagnosis Methods 0.000 description 2
- 208000015181 infectious disease Diseases 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 241001225321 Aspergillus fumigatus Species 0.000 description 1
- 241000668755 Aspergillus novofumigatus Species 0.000 description 1
- 241001626696 Aspergillus tanneri Species 0.000 description 1
- 241000894006 Bacteria Species 0.000 description 1
- FGUUSXIOTUKUDN-IBGZPJMESA-N C1(=CC=CC=C1)N1C2=C(NC([C@H](C1)NC=1OC(=NN=1)C1=CC=CC=C1)=O)C=CC=C2 Chemical compound C1(=CC=CC=C1)N1C2=C(NC([C@H](C1)NC=1OC(=NN=1)C1=CC=CC=C1)=O)C=CC=C2 FGUUSXIOTUKUDN-IBGZPJMESA-N 0.000 description 1
- 241001668502 Cladophialophora carrionii Species 0.000 description 1
- 208000035473 Communicable disease Diseases 0.000 description 1
- 241000233866 Fungi Species 0.000 description 1
- 241000190708 Guanarito mammarenavirus Species 0.000 description 1
- 241000735452 Heterobasidion Species 0.000 description 1
- 241000701044 Human gammaherpesvirus 4 Species 0.000 description 1
- 241000725303 Human immunodeficiency virus Species 0.000 description 1
- 206010028980 Neoplasm Diseases 0.000 description 1
- 241000224016 Plasmodium Species 0.000 description 1
- 241000087479 Pseudocercospora fijiensis Species 0.000 description 1
- 241000607768 Shigella Species 0.000 description 1
- 241000422838 Spirometra erinaceieuropaei Species 0.000 description 1
- 241000191967 Staphylococcus aureus Species 0.000 description 1
- 229940091771 aspergillus fumigatus Drugs 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000008280 blood Substances 0.000 description 1
- 210000004369 blood Anatomy 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000012165 high-throughput sequencing Methods 0.000 description 1
- 230000008676 import Effects 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 201000004792 malaria Diseases 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 150000007523 nucleic acids Chemical group 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000006748 scratching Methods 0.000 description 1
- 230000002393 scratching effect Effects 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
- 230000008685 targeting Effects 0.000 description 1
- 241001529453 unidentified herpesvirus Species 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A50/00—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE in human health protection, e.g. against extreme weather
- Y02A50/30—Against vector-borne diseases, e.g. mosquito-borne, fly-borne, tick-borne or waterborne diseases whose impact is exacerbated by climate change
Definitions
- the invention relates to the field of bioinformatics, in particular to an optimized kraken2 algorithm and its application in next-generation sequencing.
- next-generation sequencing technology is a massively parallel sequencing technology, which has the characteristics of high throughput, high sequencing accuracy, and short timeliness, which perfectly matches the metagenome.
- the demand for metagenomics has led to the widespread application of metagenomics in infection detection.
- Species detection of microbial communities after sequencing is the most important work in metagenomics research. Only by accurately and reliably locating microbial communities can we associate metagenomics with research, such as studying whether a patient's disease is caused by a certain microbial infection (If a person suspects malaria, it is necessary to accurately detect the presence of Plasmodium in the blood to give a definitive diagnosis), metagenomic analysis is a fast, accurate and advanced detection technology, currently used in the auxiliary diagnosis of infectious diseases played an important role in.
- Kraken2 is applied to Illumina second-generation metagenomic sequencing, which has the characteristics of fast analysis speed and high sensitivity, but the specificity is low, and many false positive results are often detected, which is due to the characteristics of the kraken2 algorithm.
- the default read length is 35bp
- kraken2 will give priority to constructing the specific kmers of this species Specific kmers, and there is a certain kmer in multiple species of Streptococcus genus Streptococcus, then locate the kmer under the genus Streptococcus, the same principle, if a certain kmer exists in multiple genera under Streptococcus family Streptococcus, then the kmer Positioned under Streptococcus family.
- the database contains a lot of sequences, such as plasmids/vectors, etc.
- the results of this part of the comparison will also be output, and this part of the results is basically meaningless (it can also be counted as false positive detection).
- the purpose of the present invention is to seek a bioinformatics analysis method that can reduce false positives in sequencing analysis, improve the accuracy of species detection, and is suitable for Illumina second-generation metagenomic sequencing.
- the present invention proposes the following technical solutions:
- the present invention firstly provides a bioinformatics analysis method based on kraken2 single sequence kmer score and overall taxonomy structure statistics, said method comprising the following steps:
- NGS sequencing data is compared using kraken2 to obtain the taxid-kmer result of each sequence
- the hierarchical relationship in step 2) includes one or more hierarchical relationships of serotype/subtype, species, genus and/or family.
- the positioning rules in the step 2) include the following:
- a sequence obtains a unique taxid according to the taxid-kmer result and the taxid is lower than the species level, it is positioned as the species level taxid to which the taxid belongs;
- calculation rules in the step 3) include:
- kmer score (family taxid kmers+genus taxid kmers+species taxid kmers+subtype/serotype taxid kmers)/total kmers;
- the overall calculation includes:
- the reads of the taxid is the total number of taxid sequences that appear in a sample
- the genus relative ratio is the ratio of certain species-level taxid reads to the highest species-level taxid reads of the same genus reads
- the family relative ratio is the ratio of certain species-level taxid reads to the highest species-level taxid reads of the same family reads
- d filter and retain the species-level taxid reads correction 1, calculate the genus relative ratio, and calculate the species-level taxid genus correction reads with the genus-level taxid reads according to the genus relative ratio;
- the genus relative ratio is the sum of the species-level taxid reads of the same genus after filtering by c and d, and then calculates the ratio of taxid reads at various levels to the sum;
- the genus-level taxid reads include the genus-level taxid reads in b and the reads incorporated into the genus-associated species-level taxid that have not passed the filtering threshold threshold in c;
- the relative ratio of the family is the sum of the species-level taxid reads of the same family after filtering by c and d, and then calculate the ratio of the taxid reads at various levels to the sum;
- the family-level taxid reads include the family-level taxid reads in b and the reads incorporated into the family-related species-level taxid in d that do not pass the filtering threshold threshold.
- said step 5 species-level detection, which is equivalent to the species-level taxid detection, according to c, d obtains the species-level taxid which is the final species taxid, gets b, e, and the sum of the species-level taxid reads obtained by f obtains The final species taxid reads, and calculate the relative abundance based on the sum of reads.
- the database used for comparison in step 1) is nt, refseq or genbank database; preferably, the database is nt database.
- the present invention also provides the application of the method based on kraken2 single sequence kmer score and overall taxonomy structure statistics in next-generation sequencing bioinformatics analysis.
- the present invention also provides a computer-readable medium, which stores a computer program, and when the computer program is executed by a processor, the method described in any one of the above claims is realized.
- the present invention also provides an electronic device, which is characterized in that it includes a processor and a memory, and one or more readable instructions are stored in the memory, and when the one or more readable instructions are executed by the processor, the Claim any one of the above methods.
- the bioinformatics method of the present invention can reduce false positives in bioinformatics analysis, improve the accuracy of species detection, and is suitable for second-generation metagenomic sequencing, including single-end and double-end sequencing.
- the present invention ensures the overall sensitivity through precise positioning of a single sequence and overall systematic optimization.
- the present invention excludes partial comparison results of plasmids/vectors, etc., effectively reducing the situation of meaningless detection.
- FIG. 1 schematic diagram of the system of the present invention
- Figure 2 Schematic diagram of taxid positioning and kmer score scoring for a single sequence
- Fig. 3 Comparison chart of false positive species detected in 9 cases of spike-in sample DNA library, opt represents the process of the present invention, confidence represents the process of kraken2 confidence 0.5+bracken, kraken represents the process of kraken2+bracken, S1-S9 represents 9 samples;
- Fig. 4 Comparison diagram of false positive species detected in 9 cases of spike-in sample RNA library, opt represents the process of the present invention, confidence represents the process of kraken2 confidence 0.5+bracken, kraken represents the process of kraken2+bracken, and S1-S9 represents 9 samples;
- Fig. 5 Sensitivity chart of 12 simulated samples and 9 spike-in samples, opt represents the process of the present invention, confidence represents the kraken2 confidence 0.5+bracken process, kraken represents the kraken2+bracken process, simulated represents 12 simulated samples, spike-in is 9 spike-in samples.
- the terms “comprising”, “comprising”, “having”, “containing” or “involving” are inclusive or open-ended and do not exclude other unrecited elements or method steps .
- the term “consisting of” is considered as a preferred embodiment of the term “comprising”. If in the following a certain group is defined as comprising at least a certain number of embodiments, this is also to be understood as revealing a group which preferably consists only of these embodiments.
- read or “each read” or “single read” in the present invention refers to a nucleic acid sequence generated by a high-throughput sequencing platform.
- alignment result in the present invention: “alignment” in English, refers to the corresponding result between a sequencing readout sequence and a reference sequence, and a sequencing readout sequence can have multiple alignment results at the same time.
- the "kmer” in the present invention refers to continuously cutting a sequence and scratching bases one by one to obtain a substring of k bases.
- the length of reads is L
- the length of k-mer is set to k
- the number of mers is: L-k+1; another example is the sequence AACTGACT, if k is set to 3, it can be divided into 6 k-mers of AAC, ACT, CTG, TGA, GAC, and ACT.
- the "kraken2" in the present invention refers to a high-precision metagenomic sequence classification software based on the kmer algorithm in the field, which can quickly classify sequencing reads into species.
- the "kraken2 optimization algorithm" described in the present invention refers to an optimization system for microbial species-level detection based on the comparison results of kraken2 developed by the present invention, aiming at improving accuracy and reducing false positive detection.
- nt comparison database of the present invention is the kraken2 comparison index database based on NCBI nt database establishment.
- the "taxid” or “taxonomy_id” mentioned in the present invention refers to the id number in the taxonomy database.
- the single sequence kmer score of kraken2 and the overall bioinformatics analysis method based on taxonomy structure statistics of the present invention generally include the following steps:
- the positioning rules may include:
- a sequence obtains a unique taxid according to the taxid-kmer result and the taxid is lower than the species level, it is positioned as the species level taxid to which the taxid belongs;
- Calculation rules can be:
- kmer score (family taxid kmers+genus taxid kmers+species taxid kmers+subtype/serotype taxid kmers)/total kmers;
- the genus relative ratio is the ratio of certain species-level taxid reads to the highest species-level taxid reads of the same genus reads
- the family relative ratio is the ratio of certain species-level taxid reads to the highest species-level taxid reads of the same family reads
- the species-level taxid reads retained by filtering in 7 and 8 are corrected 1, and the relative ratio is calculated, and the taxid reads at the species level are calculated according to the relative ratio of the species-level taxid and genus correction reads;
- the genus relative ratio is the sum of the species-level taxid reads of the same genus after filtering by 7 and 8, and then calculates the ratio of taxid reads at various levels to the sum;
- the genus-level taxid reads include 6 genus-level taxid reads and 7 reads that do not pass through the genus-related species level taxid of the genus-level taxid;
- the species-level taxid reads filtered by 7 and 8 are corrected 2, and the family-level relative ratio is calculated, and the family-level taxid reads are calculated according to the family-level relative ratio to calculate the species-level taxid family-corrected reads;
- the relative ratio of the family is the sum of the species-level taxid reads of the same family after filtering by 7 and 8, and then calculate the ratio of the taxid reads at various levels to the sum;
- the family-level taxid reads include 6 family-level taxid reads and 8 family-related taxid-incorporated reads that do not pass the filtering threshold threshold.
- Microbial detection which is equivalent to the species-level taxid detection
- the species-level taxid obtained according to 7 and 8 is the final species taxid
- the sum of the species-level taxid reads obtained by 6, 9, and 10 is used to obtain the final species taxid reads, and Relative abundance is calculated based on the sum of reads.
- the problem to be solved in this embodiment is how to ensure the accuracy of the kraken2 comparison results as much as possible through the data analysis method.
- Sequence similarity typically, the sequence degree of Escherichia coli and Shigella exceeds 99%, which is especially common among species under the same genus, which is an important reason for false positive detection and also interferes with real species detected, reducing the sensitivity;
- the so-called data analysis of the sample sequencing results alone refers to: the information obtained is only one sample sequence file (FASTQ), and no other information is known; through some algorithms, the species detection results are output.
- sequence similarity on the one hand, it can be solved through standard genome updates and the progress of species taxonomy, and on the other hand, it can be optimized to a certain extent through algorithms, which is part of the consideration of this patent;
- the host genome can be removed by accurately constructing the host genome sequence and using an alignment algorithm, but it does not guarantee whether the host genome removal is accurate and thorough (removal is not possible) There will still be some residues completely, and excessive removal will reduce the sensitivity of microbial detection), in addition, it is possible to accurately analyze the results of single sequence comparisons in the algorithm, and set the threshold for species detection as a whole, Get a certain degree of optimization, this part is the consideration point of this patent;
- the positioning rules include: accept the taxid positioning given by kraken2 in principle, except for the following cases:
- a sequence is compared to a unique taxid and the taxid is at a lower species level, it will be positioned as the species-level taxid to which the taxid belongs (for example, if a sequence is 76 bp in length and 35 bp in kmer length, 42 kmers will be obtained, and the taxid-kmer comparison result is: 0:10, 1313:32, except for the 10 kmers that cannot be compared, the rest are compared to the taxid 1313, and the taxonomy hierarchy structure is used to locate the taxid 1313 Streptococcus pneumoniae Streptococcus pneumoniae);
- a sequence alignment with more than 2 taxids can be divided into 3 types:
- All taxids in the comparison result are associated with only one species at the species level, and other taxids belong to the serotype/subtype of the species. At the genus and family levels, they are located at the taxid of the species (for example, a sequence length of 76bp, 35bp kmer length will get 42 kmers, the taxid-kmer comparison results are: 0:10, 1313:20, 1301:12, except for the 10 kmers that cannot be compared, 20 kmers are compared to the taxid 1313 , corresponding to Streptococcus pneumoniae Streptococcus pneumoniae through the taxonomy hierarchy structure, and the other 12 kmers were compared to the genus 1301 Streptococcus Streptococcus pneumoniae, since Streptococcus pneumoniae is under the genus Streptococcus, the sequence was mapped to the species taxid 1313 Streptococcus pneumoniae Streptococcus pneumoniae);
- All the taxids in the comparison results are related to the results of the same genus and different species, and finally locate the genus taxid (for example, if a sequence length is 76bp, 35bp kmer length will get 42 kmers, the taxid-kmer comparison result is: 0: 10, 1313:20, 28037:5, 1301:7, in addition to the 10 kmers that cannot be compared, 20 kmers are compared to the taxid 1313, corresponding to Streptococcus pneumoniae Streptococcus pneumoniae through the taxonomy hierarchy structure, and the other 5 The kmer was compared to 28037 Streptococcus mitis light-weight Streptococcus, and 7 kmers were compared to 1301 Streptococcus Streptococcus. Since there were 2 species under Streptococcus, the sequence was mapped to taxid 1301 Streptococcus Streptococcus);
- the calculation rules are:
- kmer score (family taxid kmers+genus taxid kmers+species taxid kmers+subtype/serotype taxid kmers)/total kmers;
- Selected 4 kinds of bacteria Haemophilus influenzae, Streptococcus pneumoniae, Staphylococcus aureus, Klebsiella pneumoniae
- 2 kinds of fungi Candida albicans, Aspergillus fumigatus
- 4 kinds of viruses human herpesvirus, human papillary Tumor virus, influenza A virus, HIV virus
- GRC38.p13 the sequence length is set to 75bp
- the total data volume is set to 10M
- a total of 4 groups Each group consists of 3 identical samples.
- the reference genomes used by the simulated samples are shown in the table below:
- Score_cutoff summarizes the statistical results of statistical samples under different thresholds. Samples 7-12 are focused on due to the low total microbial count, and their error rate is higher than that of the overall level. From the statistical results, setting a threshold higher than this error rate is It can solve the occurrence of wrong comparison, as shown in the following table:
- each sample is divided into list species false positive species (that is, the species detected by the error comparison is in the same family of 10 simulated species), non-list species false positive Species (the species detected by the wrong comparison are not in the same family as the 10 simulated species), and the species with the highest reads are counted. Since the repeated samples are detected to be completely consistent, only the representative sample results are listed, as shown in the following table:
- the final reads_cutoff is set to correct for species detected in different families. From the comparison of different values of score_cutoff, a slightly lower comparison rate is allowed Under the premise of , score_cutoff is set to 0.5, and reads_cutoff is set to greater than 3 to eliminate the detection of non-list false positive species (the lower the value, the better the sensitivity of true positive species detected by fewer reads).
- Targeting rules include:
- a sequence is aligned with a unique taxid and the taxid is lower than the species level, it will be positioned as the taxid of the species to which the taxid belongs;
- a sequence alignment with more than 2 taxids can be divided into 3 types:
- the calculation rules are:
- kmer score (family taxid kmers+genus taxid kmers+species taxid kmers+subtype/serotype taxid kmers)/total kmers;
- the false positive and false negative detections are sorted out as follows:
- sample taxi species reads relative abundance result sample 1 340412 Aspergillus novofumigatus 1 0.00011 false positive sample 1 984962 Heterobasidion irregular 1 0.00011 false positive sample 1 145522 Nannochloropsis oceanica 4 0.00044 false positive sample 1 28037 Streptococcus mitis 2 0.00022 false positive sample 1 2656787 Venustapulla echinocandica 1 0.00011 false positive sample 10 86049 Cladophialophora carrionii 1 0.00011 false positive Sample 10 10376 Human gammaherpes virus 4 1 0.00011 false positive sample 10 1873960 Pseudocercospora fijiensis 2 0.00022 false positive Sample 10 2656787 Venustapulla echinocandica 1 0.00011 false positive Sample 10 727 Haemophilus influenzae 0 0 false negative Sample 10 573 Klebsiella pneumoniae 0 0 false negative Sample 11 727 Haemophilus influenzae 0 0 false negative Sample 11 573 Kle
- Embodiment 3 actual sample detection experiment
- the total number of positive species is 148, kraken2 confidence 0.5+bracken has 2 species not detected (sensitivity is 98.6%), the process of the present invention and kraken2+bracken have one species not detected (sensitivity is 99.3%), the performance is similar to the simulated data, In terms of sensitivity, the process of the present invention is the same as kraken2+bracken, slightly higher than the process of kraken2 confidence 0.5+bracken.
- the summary statistics of the false positive species detected by each process of the RNA library are as follows (corresponding to the results in Figure 4, where opt in the picture represents the process of the present invention, confidence represents the Kraken2 confidence 0.5+bracken process and corresponds to the fourth column in the table, and kraken represents Kraken2+ bracket process):
- the sensitivity results of the summary simulation samples and spike-in sample statistics are as follows (corresponding to the results in Figure 5, where opt in the picture represents the process of the present invention, confidence represents the Kraken2 confidence 0.5+bracken process, and kraken represents the Kraken2+bracken process):
- the detection of false positive species in the process of the present invention will be far lower than that of kraken2+bracken process, and on the basis of ensuring that the sensitivity is higher than kraken2 confidence 0.5+bracken, the detection of false positives will be lower than the latter ( Even when the reads>3 are reported, the detection of false positive species can still be reduced by about 1/3).
Landscapes
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Biophysics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
Procédé d'analyse d'informations biologiques basé sur un score k-mer à séquence unique kraken2 et des statistiques de structure de taxonomie générales. Au moyen du procédé, des faux positifs dans l'analyse d'informations biologiques peuvent être réduits, et la précision de détection d'espèces peut être améliorée. Le procédé est applicable à une analyse de séquençage de métagénome de deuxième génération.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110804351.8A CN113539369B (zh) | 2021-07-14 | 2021-07-14 | 一种优化的kraken2算法及其在二代测序中的应用 |
CN202110804351.8 | 2021-07-14 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023283967A1 true WO2023283967A1 (fr) | 2023-01-19 |
Family
ID=78128300
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/106970 WO2023283967A1 (fr) | 2021-07-14 | 2021-07-17 | Algorithme kraken2 optimisé et son application dans le séquençage de deuxième génération |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN113539369B (fr) |
WO (1) | WO2023283967A1 (fr) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113539369B (zh) * | 2021-07-14 | 2022-03-25 | 江苏先声医学诊断有限公司 | 一种优化的kraken2算法及其在二代测序中的应用 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111681704A (zh) * | 2020-04-21 | 2020-09-18 | 华中科技大学鄂州工业技术研究院 | 一种基于matK基因的未知植物物种识别数据库的构建方法及数据库 |
CN112071366A (zh) * | 2020-10-13 | 2020-12-11 | 南开大学 | 一种基于二代测序技术的宏基因组数据分析方法 |
US20210141833A1 (en) * | 2019-11-07 | 2021-05-13 | International Business Machines Corporation | Optimizing k-mer databases by k-mer subtraction |
CN113096737A (zh) * | 2021-03-26 | 2021-07-09 | 北京源生康泰基因科技有限公司 | 一种用于对病原体类型进行自动分析的方法及系统 |
CN113539369A (zh) * | 2021-07-14 | 2021-10-22 | 江苏先声医学诊断有限公司 | 一种优化的kraken2算法及其在二代测序中的应用 |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111462821B (zh) * | 2020-04-10 | 2022-02-22 | 广州微远医疗器械有限公司 | 病原微生物分析鉴定系统及应用 |
CN111710365B (zh) * | 2020-06-10 | 2022-04-08 | 山东省计算中心(国家超级计算济南中心) | 一种基于本体的蛋白质/基因同义词表构建方法 |
CN112599198A (zh) * | 2020-12-29 | 2021-04-02 | 上海派森诺生物科技股份有限公司 | 一种用于宏基因组测序数据的微生物物种与功能组成分析方法 |
-
2021
- 2021-07-14 CN CN202110804351.8A patent/CN113539369B/zh active Active
- 2021-07-17 WO PCT/CN2021/106970 patent/WO2023283967A1/fr unknown
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210141833A1 (en) * | 2019-11-07 | 2021-05-13 | International Business Machines Corporation | Optimizing k-mer databases by k-mer subtraction |
CN111681704A (zh) * | 2020-04-21 | 2020-09-18 | 华中科技大学鄂州工业技术研究院 | 一种基于matK基因的未知植物物种识别数据库的构建方法及数据库 |
CN112071366A (zh) * | 2020-10-13 | 2020-12-11 | 南开大学 | 一种基于二代测序技术的宏基因组数据分析方法 |
CN113096737A (zh) * | 2021-03-26 | 2021-07-09 | 北京源生康泰基因科技有限公司 | 一种用于对病原体类型进行自动分析的方法及系统 |
CN113539369A (zh) * | 2021-07-14 | 2021-10-22 | 江苏先声医学诊断有限公司 | 一种优化的kraken2算法及其在二代测序中的应用 |
Also Published As
Publication number | Publication date |
---|---|
CN113539369A (zh) | 2021-10-22 |
CN113539369B (zh) | 2022-03-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20230366046A1 (en) | Systems and methods for analyzing viral nucleic acids | |
Zielezinski et al. | Alignment-free sequence comparison: benefits, applications, and tools | |
Marchant et al. | The C-Fern (Ceratopteris richardii) genome: insights into plant genome evolution with the first partial homosporous fern genome assembly | |
CN111462821B (zh) | 病原微生物分析鉴定系统及应用 | |
JP2016502162A (ja) | 未加工のシーケンシングデータのデータベースにより駆動される一次解析 | |
US20200294628A1 (en) | Creation or use of anchor-based data structures for sample-derived characteristic determination | |
WO2018218788A1 (fr) | Procédé d'alignement de séquences de séquençage de troisième génération fondé sur une optimisation de notation de valeur initiale globale | |
US11809498B2 (en) | Optimizing k-mer databases by k-mer subtraction | |
CN115631789B (zh) | 一种基于泛基因组的群体联合变异检测方法 | |
WO2023283967A1 (fr) | Algorithme kraken2 optimisé et son application dans le séquençage de deuxième génération | |
CN112992277A (zh) | 一种微生物基因组数据库构建方法及其应用 | |
CN112599198A (zh) | 一种用于宏基因组测序数据的微生物物种与功能组成分析方法 | |
CN115083521B (zh) | 一种单细胞转录组测序数据中肿瘤细胞类群的鉴定方法及系统 | |
WO2020155623A1 (fr) | Procédé, système et dispositif de traitement de filtrage d'alignement de séquence et support d'informations lisible | |
US20230282309A1 (en) | Systems and methods for grouping and collapsing sequencing reads | |
CN108595912B (zh) | 检测染色体非整倍性的方法、装置及系统 | |
WO2017000859A1 (fr) | Algorithme de recherche de saut de sous-séquences similaires dans une séquence de caractères et son application lors d'une recherche dans une base de données de séquences biologiques | |
WO2020213736A1 (fr) | Dispositif de traitement d'informations, procédé de traitement d'informations, programme et support d'informations | |
EP3114596B1 (fr) | Procédés et systèmes électroniques pour la caractérisation de micro-organismes | |
Cai et al. | Concod: an effective integration framework of consensus-based calling deletions from next-generation sequencing data | |
CN114334004B (zh) | 一种病原微生物快速比对鉴定方法及其应用 | |
CN112800245B (zh) | 一种病原微生物参考知识库的最大多样性聚类构建方法 | |
Namiki et al. | Fast dna sequence clustering based on longest common subsequence | |
Xu et al. | MetaQuad: Shared Informative Variants Discovery in Metagenomic Samples | |
CN116682496A (zh) | 一种病原微生物基因组数据库及其构建方法和应用 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21949747 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |