WO2023283967A1 - Algorithme kraken2 optimisé et son application dans le séquençage de deuxième génération - Google Patents

Algorithme kraken2 optimisé et son application dans le séquençage de deuxième génération Download PDF

Info

Publication number
WO2023283967A1
WO2023283967A1 PCT/CN2021/106970 CN2021106970W WO2023283967A1 WO 2023283967 A1 WO2023283967 A1 WO 2023283967A1 CN 2021106970 W CN2021106970 W CN 2021106970W WO 2023283967 A1 WO2023283967 A1 WO 2023283967A1
Authority
WO
WIPO (PCT)
Prior art keywords
taxid
level
species
reads
family
Prior art date
Application number
PCT/CN2021/106970
Other languages
English (en)
Chinese (zh)
Inventor
张岩
李振中
任用
李诗濛
郭昊
梁相志
陈莉
戴岩
李珊
顾菊
Original Assignee
江苏先声医学诊断有限公司
江苏先声医疗器械有限公司
南京先声诊断技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 江苏先声医学诊断有限公司, 江苏先声医疗器械有限公司, 南京先声诊断技术有限公司 filed Critical 江苏先声医学诊断有限公司
Publication of WO2023283967A1 publication Critical patent/WO2023283967A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A50/00TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE in human health protection, e.g. against extreme weather
    • Y02A50/30Against vector-borne diseases, e.g. mosquito-borne, fly-borne, tick-borne or waterborne diseases whose impact is exacerbated by climate change

Definitions

  • the invention relates to the field of bioinformatics, in particular to an optimized kraken2 algorithm and its application in next-generation sequencing.
  • next-generation sequencing technology is a massively parallel sequencing technology, which has the characteristics of high throughput, high sequencing accuracy, and short timeliness, which perfectly matches the metagenome.
  • the demand for metagenomics has led to the widespread application of metagenomics in infection detection.
  • Species detection of microbial communities after sequencing is the most important work in metagenomics research. Only by accurately and reliably locating microbial communities can we associate metagenomics with research, such as studying whether a patient's disease is caused by a certain microbial infection (If a person suspects malaria, it is necessary to accurately detect the presence of Plasmodium in the blood to give a definitive diagnosis), metagenomic analysis is a fast, accurate and advanced detection technology, currently used in the auxiliary diagnosis of infectious diseases played an important role in.
  • Kraken2 is applied to Illumina second-generation metagenomic sequencing, which has the characteristics of fast analysis speed and high sensitivity, but the specificity is low, and many false positive results are often detected, which is due to the characteristics of the kraken2 algorithm.
  • the default read length is 35bp
  • kraken2 will give priority to constructing the specific kmers of this species Specific kmers, and there is a certain kmer in multiple species of Streptococcus genus Streptococcus, then locate the kmer under the genus Streptococcus, the same principle, if a certain kmer exists in multiple genera under Streptococcus family Streptococcus, then the kmer Positioned under Streptococcus family.
  • the database contains a lot of sequences, such as plasmids/vectors, etc.
  • the results of this part of the comparison will also be output, and this part of the results is basically meaningless (it can also be counted as false positive detection).
  • the purpose of the present invention is to seek a bioinformatics analysis method that can reduce false positives in sequencing analysis, improve the accuracy of species detection, and is suitable for Illumina second-generation metagenomic sequencing.
  • the present invention proposes the following technical solutions:
  • the present invention firstly provides a bioinformatics analysis method based on kraken2 single sequence kmer score and overall taxonomy structure statistics, said method comprising the following steps:
  • NGS sequencing data is compared using kraken2 to obtain the taxid-kmer result of each sequence
  • the hierarchical relationship in step 2) includes one or more hierarchical relationships of serotype/subtype, species, genus and/or family.
  • the positioning rules in the step 2) include the following:
  • a sequence obtains a unique taxid according to the taxid-kmer result and the taxid is lower than the species level, it is positioned as the species level taxid to which the taxid belongs;
  • calculation rules in the step 3) include:
  • kmer score (family taxid kmers+genus taxid kmers+species taxid kmers+subtype/serotype taxid kmers)/total kmers;
  • the overall calculation includes:
  • the reads of the taxid is the total number of taxid sequences that appear in a sample
  • the genus relative ratio is the ratio of certain species-level taxid reads to the highest species-level taxid reads of the same genus reads
  • the family relative ratio is the ratio of certain species-level taxid reads to the highest species-level taxid reads of the same family reads
  • d filter and retain the species-level taxid reads correction 1, calculate the genus relative ratio, and calculate the species-level taxid genus correction reads with the genus-level taxid reads according to the genus relative ratio;
  • the genus relative ratio is the sum of the species-level taxid reads of the same genus after filtering by c and d, and then calculates the ratio of taxid reads at various levels to the sum;
  • the genus-level taxid reads include the genus-level taxid reads in b and the reads incorporated into the genus-associated species-level taxid that have not passed the filtering threshold threshold in c;
  • the relative ratio of the family is the sum of the species-level taxid reads of the same family after filtering by c and d, and then calculate the ratio of the taxid reads at various levels to the sum;
  • the family-level taxid reads include the family-level taxid reads in b and the reads incorporated into the family-related species-level taxid in d that do not pass the filtering threshold threshold.
  • said step 5 species-level detection, which is equivalent to the species-level taxid detection, according to c, d obtains the species-level taxid which is the final species taxid, gets b, e, and the sum of the species-level taxid reads obtained by f obtains The final species taxid reads, and calculate the relative abundance based on the sum of reads.
  • the database used for comparison in step 1) is nt, refseq or genbank database; preferably, the database is nt database.
  • the present invention also provides the application of the method based on kraken2 single sequence kmer score and overall taxonomy structure statistics in next-generation sequencing bioinformatics analysis.
  • the present invention also provides a computer-readable medium, which stores a computer program, and when the computer program is executed by a processor, the method described in any one of the above claims is realized.
  • the present invention also provides an electronic device, which is characterized in that it includes a processor and a memory, and one or more readable instructions are stored in the memory, and when the one or more readable instructions are executed by the processor, the Claim any one of the above methods.
  • the bioinformatics method of the present invention can reduce false positives in bioinformatics analysis, improve the accuracy of species detection, and is suitable for second-generation metagenomic sequencing, including single-end and double-end sequencing.
  • the present invention ensures the overall sensitivity through precise positioning of a single sequence and overall systematic optimization.
  • the present invention excludes partial comparison results of plasmids/vectors, etc., effectively reducing the situation of meaningless detection.
  • FIG. 1 schematic diagram of the system of the present invention
  • Figure 2 Schematic diagram of taxid positioning and kmer score scoring for a single sequence
  • Fig. 3 Comparison chart of false positive species detected in 9 cases of spike-in sample DNA library, opt represents the process of the present invention, confidence represents the process of kraken2 confidence 0.5+bracken, kraken represents the process of kraken2+bracken, S1-S9 represents 9 samples;
  • Fig. 4 Comparison diagram of false positive species detected in 9 cases of spike-in sample RNA library, opt represents the process of the present invention, confidence represents the process of kraken2 confidence 0.5+bracken, kraken represents the process of kraken2+bracken, and S1-S9 represents 9 samples;
  • Fig. 5 Sensitivity chart of 12 simulated samples and 9 spike-in samples, opt represents the process of the present invention, confidence represents the kraken2 confidence 0.5+bracken process, kraken represents the kraken2+bracken process, simulated represents 12 simulated samples, spike-in is 9 spike-in samples.
  • the terms “comprising”, “comprising”, “having”, “containing” or “involving” are inclusive or open-ended and do not exclude other unrecited elements or method steps .
  • the term “consisting of” is considered as a preferred embodiment of the term “comprising”. If in the following a certain group is defined as comprising at least a certain number of embodiments, this is also to be understood as revealing a group which preferably consists only of these embodiments.
  • read or “each read” or “single read” in the present invention refers to a nucleic acid sequence generated by a high-throughput sequencing platform.
  • alignment result in the present invention: “alignment” in English, refers to the corresponding result between a sequencing readout sequence and a reference sequence, and a sequencing readout sequence can have multiple alignment results at the same time.
  • the "kmer” in the present invention refers to continuously cutting a sequence and scratching bases one by one to obtain a substring of k bases.
  • the length of reads is L
  • the length of k-mer is set to k
  • the number of mers is: L-k+1; another example is the sequence AACTGACT, if k is set to 3, it can be divided into 6 k-mers of AAC, ACT, CTG, TGA, GAC, and ACT.
  • the "kraken2" in the present invention refers to a high-precision metagenomic sequence classification software based on the kmer algorithm in the field, which can quickly classify sequencing reads into species.
  • the "kraken2 optimization algorithm" described in the present invention refers to an optimization system for microbial species-level detection based on the comparison results of kraken2 developed by the present invention, aiming at improving accuracy and reducing false positive detection.
  • nt comparison database of the present invention is the kraken2 comparison index database based on NCBI nt database establishment.
  • the "taxid” or “taxonomy_id” mentioned in the present invention refers to the id number in the taxonomy database.
  • the single sequence kmer score of kraken2 and the overall bioinformatics analysis method based on taxonomy structure statistics of the present invention generally include the following steps:
  • the positioning rules may include:
  • a sequence obtains a unique taxid according to the taxid-kmer result and the taxid is lower than the species level, it is positioned as the species level taxid to which the taxid belongs;
  • Calculation rules can be:
  • kmer score (family taxid kmers+genus taxid kmers+species taxid kmers+subtype/serotype taxid kmers)/total kmers;
  • the genus relative ratio is the ratio of certain species-level taxid reads to the highest species-level taxid reads of the same genus reads
  • the family relative ratio is the ratio of certain species-level taxid reads to the highest species-level taxid reads of the same family reads
  • the species-level taxid reads retained by filtering in 7 and 8 are corrected 1, and the relative ratio is calculated, and the taxid reads at the species level are calculated according to the relative ratio of the species-level taxid and genus correction reads;
  • the genus relative ratio is the sum of the species-level taxid reads of the same genus after filtering by 7 and 8, and then calculates the ratio of taxid reads at various levels to the sum;
  • the genus-level taxid reads include 6 genus-level taxid reads and 7 reads that do not pass through the genus-related species level taxid of the genus-level taxid;
  • the species-level taxid reads filtered by 7 and 8 are corrected 2, and the family-level relative ratio is calculated, and the family-level taxid reads are calculated according to the family-level relative ratio to calculate the species-level taxid family-corrected reads;
  • the relative ratio of the family is the sum of the species-level taxid reads of the same family after filtering by 7 and 8, and then calculate the ratio of the taxid reads at various levels to the sum;
  • the family-level taxid reads include 6 family-level taxid reads and 8 family-related taxid-incorporated reads that do not pass the filtering threshold threshold.
  • Microbial detection which is equivalent to the species-level taxid detection
  • the species-level taxid obtained according to 7 and 8 is the final species taxid
  • the sum of the species-level taxid reads obtained by 6, 9, and 10 is used to obtain the final species taxid reads, and Relative abundance is calculated based on the sum of reads.
  • the problem to be solved in this embodiment is how to ensure the accuracy of the kraken2 comparison results as much as possible through the data analysis method.
  • Sequence similarity typically, the sequence degree of Escherichia coli and Shigella exceeds 99%, which is especially common among species under the same genus, which is an important reason for false positive detection and also interferes with real species detected, reducing the sensitivity;
  • the so-called data analysis of the sample sequencing results alone refers to: the information obtained is only one sample sequence file (FASTQ), and no other information is known; through some algorithms, the species detection results are output.
  • sequence similarity on the one hand, it can be solved through standard genome updates and the progress of species taxonomy, and on the other hand, it can be optimized to a certain extent through algorithms, which is part of the consideration of this patent;
  • the host genome can be removed by accurately constructing the host genome sequence and using an alignment algorithm, but it does not guarantee whether the host genome removal is accurate and thorough (removal is not possible) There will still be some residues completely, and excessive removal will reduce the sensitivity of microbial detection), in addition, it is possible to accurately analyze the results of single sequence comparisons in the algorithm, and set the threshold for species detection as a whole, Get a certain degree of optimization, this part is the consideration point of this patent;
  • the positioning rules include: accept the taxid positioning given by kraken2 in principle, except for the following cases:
  • a sequence is compared to a unique taxid and the taxid is at a lower species level, it will be positioned as the species-level taxid to which the taxid belongs (for example, if a sequence is 76 bp in length and 35 bp in kmer length, 42 kmers will be obtained, and the taxid-kmer comparison result is: 0:10, 1313:32, except for the 10 kmers that cannot be compared, the rest are compared to the taxid 1313, and the taxonomy hierarchy structure is used to locate the taxid 1313 Streptococcus pneumoniae Streptococcus pneumoniae);
  • a sequence alignment with more than 2 taxids can be divided into 3 types:
  • All taxids in the comparison result are associated with only one species at the species level, and other taxids belong to the serotype/subtype of the species. At the genus and family levels, they are located at the taxid of the species (for example, a sequence length of 76bp, 35bp kmer length will get 42 kmers, the taxid-kmer comparison results are: 0:10, 1313:20, 1301:12, except for the 10 kmers that cannot be compared, 20 kmers are compared to the taxid 1313 , corresponding to Streptococcus pneumoniae Streptococcus pneumoniae through the taxonomy hierarchy structure, and the other 12 kmers were compared to the genus 1301 Streptococcus Streptococcus pneumoniae, since Streptococcus pneumoniae is under the genus Streptococcus, the sequence was mapped to the species taxid 1313 Streptococcus pneumoniae Streptococcus pneumoniae);
  • All the taxids in the comparison results are related to the results of the same genus and different species, and finally locate the genus taxid (for example, if a sequence length is 76bp, 35bp kmer length will get 42 kmers, the taxid-kmer comparison result is: 0: 10, 1313:20, 28037:5, 1301:7, in addition to the 10 kmers that cannot be compared, 20 kmers are compared to the taxid 1313, corresponding to Streptococcus pneumoniae Streptococcus pneumoniae through the taxonomy hierarchy structure, and the other 5 The kmer was compared to 28037 Streptococcus mitis light-weight Streptococcus, and 7 kmers were compared to 1301 Streptococcus Streptococcus. Since there were 2 species under Streptococcus, the sequence was mapped to taxid 1301 Streptococcus Streptococcus);
  • the calculation rules are:
  • kmer score (family taxid kmers+genus taxid kmers+species taxid kmers+subtype/serotype taxid kmers)/total kmers;
  • Selected 4 kinds of bacteria Haemophilus influenzae, Streptococcus pneumoniae, Staphylococcus aureus, Klebsiella pneumoniae
  • 2 kinds of fungi Candida albicans, Aspergillus fumigatus
  • 4 kinds of viruses human herpesvirus, human papillary Tumor virus, influenza A virus, HIV virus
  • GRC38.p13 the sequence length is set to 75bp
  • the total data volume is set to 10M
  • a total of 4 groups Each group consists of 3 identical samples.
  • the reference genomes used by the simulated samples are shown in the table below:
  • Score_cutoff summarizes the statistical results of statistical samples under different thresholds. Samples 7-12 are focused on due to the low total microbial count, and their error rate is higher than that of the overall level. From the statistical results, setting a threshold higher than this error rate is It can solve the occurrence of wrong comparison, as shown in the following table:
  • each sample is divided into list species false positive species (that is, the species detected by the error comparison is in the same family of 10 simulated species), non-list species false positive Species (the species detected by the wrong comparison are not in the same family as the 10 simulated species), and the species with the highest reads are counted. Since the repeated samples are detected to be completely consistent, only the representative sample results are listed, as shown in the following table:
  • the final reads_cutoff is set to correct for species detected in different families. From the comparison of different values of score_cutoff, a slightly lower comparison rate is allowed Under the premise of , score_cutoff is set to 0.5, and reads_cutoff is set to greater than 3 to eliminate the detection of non-list false positive species (the lower the value, the better the sensitivity of true positive species detected by fewer reads).
  • Targeting rules include:
  • a sequence is aligned with a unique taxid and the taxid is lower than the species level, it will be positioned as the taxid of the species to which the taxid belongs;
  • a sequence alignment with more than 2 taxids can be divided into 3 types:
  • the calculation rules are:
  • kmer score (family taxid kmers+genus taxid kmers+species taxid kmers+subtype/serotype taxid kmers)/total kmers;
  • the false positive and false negative detections are sorted out as follows:
  • sample taxi species reads relative abundance result sample 1 340412 Aspergillus novofumigatus 1 0.00011 false positive sample 1 984962 Heterobasidion irregular 1 0.00011 false positive sample 1 145522 Nannochloropsis oceanica 4 0.00044 false positive sample 1 28037 Streptococcus mitis 2 0.00022 false positive sample 1 2656787 Venustapulla echinocandica 1 0.00011 false positive sample 10 86049 Cladophialophora carrionii 1 0.00011 false positive Sample 10 10376 Human gammaherpes virus 4 1 0.00011 false positive sample 10 1873960 Pseudocercospora fijiensis 2 0.00022 false positive Sample 10 2656787 Venustapulla echinocandica 1 0.00011 false positive Sample 10 727 Haemophilus influenzae 0 0 false negative Sample 10 573 Klebsiella pneumoniae 0 0 false negative Sample 11 727 Haemophilus influenzae 0 0 false negative Sample 11 573 Kle
  • Embodiment 3 actual sample detection experiment
  • the total number of positive species is 148, kraken2 confidence 0.5+bracken has 2 species not detected (sensitivity is 98.6%), the process of the present invention and kraken2+bracken have one species not detected (sensitivity is 99.3%), the performance is similar to the simulated data, In terms of sensitivity, the process of the present invention is the same as kraken2+bracken, slightly higher than the process of kraken2 confidence 0.5+bracken.
  • the summary statistics of the false positive species detected by each process of the RNA library are as follows (corresponding to the results in Figure 4, where opt in the picture represents the process of the present invention, confidence represents the Kraken2 confidence 0.5+bracken process and corresponds to the fourth column in the table, and kraken represents Kraken2+ bracket process):
  • the sensitivity results of the summary simulation samples and spike-in sample statistics are as follows (corresponding to the results in Figure 5, where opt in the picture represents the process of the present invention, confidence represents the Kraken2 confidence 0.5+bracken process, and kraken represents the Kraken2+bracken process):
  • the detection of false positive species in the process of the present invention will be far lower than that of kraken2+bracken process, and on the basis of ensuring that the sensitivity is higher than kraken2 confidence 0.5+bracken, the detection of false positives will be lower than the latter ( Even when the reads>3 are reported, the detection of false positive species can still be reduced by about 1/3).

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Biophysics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

Procédé d'analyse d'informations biologiques basé sur un score k-mer à séquence unique kraken2 et des statistiques de structure de taxonomie générales. Au moyen du procédé, des faux positifs dans l'analyse d'informations biologiques peuvent être réduits, et la précision de détection d'espèces peut être améliorée. Le procédé est applicable à une analyse de séquençage de métagénome de deuxième génération.
PCT/CN2021/106970 2021-07-14 2021-07-17 Algorithme kraken2 optimisé et son application dans le séquençage de deuxième génération WO2023283967A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110804351.8A CN113539369B (zh) 2021-07-14 2021-07-14 一种优化的kraken2算法及其在二代测序中的应用
CN202110804351.8 2021-07-14

Publications (1)

Publication Number Publication Date
WO2023283967A1 true WO2023283967A1 (fr) 2023-01-19

Family

ID=78128300

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/106970 WO2023283967A1 (fr) 2021-07-14 2021-07-17 Algorithme kraken2 optimisé et son application dans le séquençage de deuxième génération

Country Status (2)

Country Link
CN (1) CN113539369B (fr)
WO (1) WO2023283967A1 (fr)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113539369B (zh) * 2021-07-14 2022-03-25 江苏先声医学诊断有限公司 一种优化的kraken2算法及其在二代测序中的应用

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111681704A (zh) * 2020-04-21 2020-09-18 华中科技大学鄂州工业技术研究院 一种基于matK基因的未知植物物种识别数据库的构建方法及数据库
CN112071366A (zh) * 2020-10-13 2020-12-11 南开大学 一种基于二代测序技术的宏基因组数据分析方法
US20210141833A1 (en) * 2019-11-07 2021-05-13 International Business Machines Corporation Optimizing k-mer databases by k-mer subtraction
CN113096737A (zh) * 2021-03-26 2021-07-09 北京源生康泰基因科技有限公司 一种用于对病原体类型进行自动分析的方法及系统
CN113539369A (zh) * 2021-07-14 2021-10-22 江苏先声医学诊断有限公司 一种优化的kraken2算法及其在二代测序中的应用

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111462821B (zh) * 2020-04-10 2022-02-22 广州微远医疗器械有限公司 病原微生物分析鉴定系统及应用
CN111710365B (zh) * 2020-06-10 2022-04-08 山东省计算中心(国家超级计算济南中心) 一种基于本体的蛋白质/基因同义词表构建方法
CN112599198A (zh) * 2020-12-29 2021-04-02 上海派森诺生物科技股份有限公司 一种用于宏基因组测序数据的微生物物种与功能组成分析方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210141833A1 (en) * 2019-11-07 2021-05-13 International Business Machines Corporation Optimizing k-mer databases by k-mer subtraction
CN111681704A (zh) * 2020-04-21 2020-09-18 华中科技大学鄂州工业技术研究院 一种基于matK基因的未知植物物种识别数据库的构建方法及数据库
CN112071366A (zh) * 2020-10-13 2020-12-11 南开大学 一种基于二代测序技术的宏基因组数据分析方法
CN113096737A (zh) * 2021-03-26 2021-07-09 北京源生康泰基因科技有限公司 一种用于对病原体类型进行自动分析的方法及系统
CN113539369A (zh) * 2021-07-14 2021-10-22 江苏先声医学诊断有限公司 一种优化的kraken2算法及其在二代测序中的应用

Also Published As

Publication number Publication date
CN113539369A (zh) 2021-10-22
CN113539369B (zh) 2022-03-25

Similar Documents

Publication Publication Date Title
US20230366046A1 (en) Systems and methods for analyzing viral nucleic acids
Zielezinski et al. Alignment-free sequence comparison: benefits, applications, and tools
Marchant et al. The C-Fern (Ceratopteris richardii) genome: insights into plant genome evolution with the first partial homosporous fern genome assembly
CN111462821B (zh) 病原微生物分析鉴定系统及应用
JP2016502162A (ja) 未加工のシーケンシングデータのデータベースにより駆動される一次解析
US20200294628A1 (en) Creation or use of anchor-based data structures for sample-derived characteristic determination
WO2018218788A1 (fr) Procédé d'alignement de séquences de séquençage de troisième génération fondé sur une optimisation de notation de valeur initiale globale
US11809498B2 (en) Optimizing k-mer databases by k-mer subtraction
CN115631789B (zh) 一种基于泛基因组的群体联合变异检测方法
WO2023283967A1 (fr) Algorithme kraken2 optimisé et son application dans le séquençage de deuxième génération
CN112992277A (zh) 一种微生物基因组数据库构建方法及其应用
CN112599198A (zh) 一种用于宏基因组测序数据的微生物物种与功能组成分析方法
CN115083521B (zh) 一种单细胞转录组测序数据中肿瘤细胞类群的鉴定方法及系统
WO2020155623A1 (fr) Procédé, système et dispositif de traitement de filtrage d'alignement de séquence et support d'informations lisible
US20230282309A1 (en) Systems and methods for grouping and collapsing sequencing reads
CN108595912B (zh) 检测染色体非整倍性的方法、装置及系统
WO2017000859A1 (fr) Algorithme de recherche de saut de sous-séquences similaires dans une séquence de caractères et son application lors d'une recherche dans une base de données de séquences biologiques
WO2020213736A1 (fr) Dispositif de traitement d'informations, procédé de traitement d'informations, programme et support d'informations
EP3114596B1 (fr) Procédés et systèmes électroniques pour la caractérisation de micro-organismes
Cai et al. Concod: an effective integration framework of consensus-based calling deletions from next-generation sequencing data
CN114334004B (zh) 一种病原微生物快速比对鉴定方法及其应用
CN112800245B (zh) 一种病原微生物参考知识库的最大多样性聚类构建方法
Namiki et al. Fast dna sequence clustering based on longest common subsequence
Xu et al. MetaQuad: Shared Informative Variants Discovery in Metagenomic Samples
CN116682496A (zh) 一种病原微生物基因组数据库及其构建方法和应用

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21949747

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE