WO2022246062A1 - Umi collapsing - Google Patents
Umi collapsing Download PDFInfo
- Publication number
- WO2022246062A1 WO2022246062A1 PCT/US2022/030023 US2022030023W WO2022246062A1 WO 2022246062 A1 WO2022246062 A1 WO 2022246062A1 US 2022030023 W US2022030023 W US 2022030023W WO 2022246062 A1 WO2022246062 A1 WO 2022246062A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- umi
- families
- sequence
- merging
- family
- Prior art date
Links
- 238000000034 method Methods 0.000 claims abstract description 118
- 239000012634 fragment Substances 0.000 claims description 141
- 108020004707 nucleic acids Proteins 0.000 claims description 52
- 102000039446 nucleic acids Human genes 0.000 claims description 52
- 150000007523 nucleic acids Chemical class 0.000 claims description 52
- 230000009191 jumping Effects 0.000 claims description 51
- 238000012163 sequencing technique Methods 0.000 claims description 51
- 108020004414 DNA Proteins 0.000 claims description 35
- 238000012070 whole genome sequencing analysis Methods 0.000 claims description 21
- 230000007704 transition Effects 0.000 claims description 15
- 206010028980 Neoplasm Diseases 0.000 claims description 9
- 238000001574 biopsy Methods 0.000 claims description 6
- 239000008280 blood Substances 0.000 claims description 6
- 210000004369 blood Anatomy 0.000 claims description 6
- 108091061744 Cell-free fetal DNA Proteins 0.000 claims description 5
- 210000004381 amniotic fluid Anatomy 0.000 claims description 5
- 238000004891 communication Methods 0.000 claims description 3
- 238000012937 correction Methods 0.000 description 50
- 238000012545 processing Methods 0.000 description 20
- 230000009977 dual effect Effects 0.000 description 19
- 230000008569 process Effects 0.000 description 17
- 238000010586 diagram Methods 0.000 description 13
- 101100100104 Zea mays TPS6 gene Proteins 0.000 description 12
- 238000010276 construction Methods 0.000 description 12
- 238000002360 preparation method Methods 0.000 description 9
- 238000004422 calculation algorithm Methods 0.000 description 8
- 238000013461 design Methods 0.000 description 8
- 230000008901 benefit Effects 0.000 description 7
- 230000006870 function Effects 0.000 description 6
- 102000053602 DNA Human genes 0.000 description 5
- 235000019506 cigar Nutrition 0.000 description 5
- 238000001514 detection method Methods 0.000 description 5
- 230000004048 modification Effects 0.000 description 5
- 238000012986 modification Methods 0.000 description 5
- 108700028369 Alleles Proteins 0.000 description 4
- 108091035707 Consensus sequence Proteins 0.000 description 4
- 210000004027 cell Anatomy 0.000 description 4
- 239000002773 nucleotide Substances 0.000 description 4
- 125000003729 nucleotide group Chemical group 0.000 description 4
- 230000035945 sensitivity Effects 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 230000000295 complement effect Effects 0.000 description 3
- 238000004590 computer program Methods 0.000 description 3
- 238000005192 partition Methods 0.000 description 3
- 230000009467 reduction Effects 0.000 description 3
- 230000005778 DNA damage Effects 0.000 description 2
- 231100000277 DNA damage Toxicity 0.000 description 2
- 230000003321 amplification Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 239000003153 chemical reaction reagent Substances 0.000 description 2
- 238000012350 deep sequencing Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 239000000796 flavoring agent Substances 0.000 description 2
- 235000019634 flavors Nutrition 0.000 description 2
- 238000003780 insertion Methods 0.000 description 2
- 230000037431 insertion Effects 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 238000007481 next generation sequencing Methods 0.000 description 2
- 238000003199 nucleic acid amplification method Methods 0.000 description 2
- 241000143060 Americamysis bahia Species 0.000 description 1
- 101100421761 Arabidopsis thaliana GSNAP gene Proteins 0.000 description 1
- 235000000832 Ayote Nutrition 0.000 description 1
- 235000003949 Cucurbita mixta Nutrition 0.000 description 1
- 235000009854 Cucurbita moschata Nutrition 0.000 description 1
- 240000004244 Cucurbita moschata Species 0.000 description 1
- -1 DNA) molecule Chemical class 0.000 description 1
- 241001113322 Elmis Species 0.000 description 1
- 238000001159 Fisher's combined probability test Methods 0.000 description 1
- 101800000863 Galanin message-associated peptide Proteins 0.000 description 1
- 102100028501 Galanin peptides Human genes 0.000 description 1
- 101000848922 Homo sapiens Protein FAM72A Proteins 0.000 description 1
- 108091028043 Nucleic acid sequence Proteins 0.000 description 1
- 108010047956 Nucleosomes Proteins 0.000 description 1
- 102100034514 Protein FAM72A Human genes 0.000 description 1
- 238000003559 RNA-seq method Methods 0.000 description 1
- 101100030351 Schizosaccharomyces pombe (strain 972 / ATCC 24843) dis2 gene Proteins 0.000 description 1
- 241001223864 Sphyraena barracuda Species 0.000 description 1
- 241000283907 Tragelaphus oryx Species 0.000 description 1
- 238000007792 addition Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000001174 ascending effect Effects 0.000 description 1
- 238000003556 assay Methods 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 210000000349 chromosome Anatomy 0.000 description 1
- 239000002299 complementary DNA Substances 0.000 description 1
- 238000000205 computational method Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 230000009615 deamination Effects 0.000 description 1
- 238000006481 deamination reaction Methods 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000010195 expression analysis Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 239000012530 fluid Substances 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 230000008570 general process Effects 0.000 description 1
- 229920001519 homopolymer Polymers 0.000 description 1
- 238000000338 in vitro Methods 0.000 description 1
- 239000007788 liquid Substances 0.000 description 1
- DRLFMBDRBRZALE-UHFFFAOYSA-N melatonin Chemical compound COC1=CC=C2NC=C(CCNC(C)=O)C2=C1 DRLFMBDRBRZALE-UHFFFAOYSA-N 0.000 description 1
- 239000003607 modifier Substances 0.000 description 1
- 210000001623 nucleosome Anatomy 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 108090000623 proteins and genes Proteins 0.000 description 1
- 238000007841 sequencing by ligation Methods 0.000 description 1
- VYPSYNLAJGMNEJ-UHFFFAOYSA-N silicon dioxide Inorganic materials O=[Si]=O VYPSYNLAJGMNEJ-UHFFFAOYSA-N 0.000 description 1
- 239000000377 silicon dioxide Substances 0.000 description 1
- 239000000344 soap Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 230000000392 somatic effect Effects 0.000 description 1
- 239000007858 starting material Substances 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12N—MICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
- C12N15/00—Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
- C12N15/09—Recombinant DNA-technology
- C12N15/10—Processes for the isolation, preparation or purification of DNA or RNA
- C12N15/1034—Isolating an individual clone by screening libraries
- C12N15/1065—Preparation or screening of tagged libraries, e.g. tagged microorganisms by STM-mutagenesis, tagged polynucleotides, gene tags
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
Definitions
- the present disclosure relates generally to the field of processing sequence reads, for example, grouping sequence reads.
- UMI unique molecular barcode
- a method of grouping sequence reads can include grouping sequence reads can include grouping sequence reads into families of sequence reads; and merging (or collapsing) families of sequence reads.
- a method for grouping sequence reads is under control of a processor (e.g., a hardware processor or a virtual processor) and comprises: receiving a plurality of sequence reads each comprising a fragment sequence and a unique molecular identifier (UMI) sequence (or an identifier sequence).
- UMI unique molecular identifier
- the method can comprise: aligning sequence reads of the plurality of sequence reads to a reference sequence (e.g., a reference genome sequence) using the fragment sequences of the sequence reads.
- the method can comprise: grouping sequence reads of the plurality of sequence reads into a plurality of families of sequence reads based on the UMI sequences and positions of the fragment sequences of the sequence reads aligned to the reference sequence.
- the method can comprise: performing UMI statistic estimation of the plurality of families.
- the method can comprise: performing probability-based merging of families of the plurality of families.
- Performing probability-based merging can comprise: performing probability-based merging of families of the plurality of families using the results of UMI statistic estimation.
- performing UMI statistic estimation comprises: determining fragment (or fragment insert) size frequency, UMI jumping rate, and/or UMI frequency.
- Performing probability-based merging can comprises performing probability-based merging of families of the plurality of families using fragment size frequency, UMI jumping rate, and/or UMI frequency.
- performing probability-based merging comprises: determining a relative likelihood (or probability) of the two families are derived from (or that originate from) the same original nucleic acid (e.g., DNA) molecule using the fragment size frequency, the UMI jumping rate, and/or the UMI frequency.
- Performing probability -based merging can comprise: determining the relative likelihood is above a merging threshold (e.g., 1).
- Performing probability-based merging can comprise: merging the two families of the plurality of families.
- merging the two families comprises: merging a smaller family (e.g., with fewer sequence reads) of the two families into a larger family (e.g., with more sequence reads) of the two families.
- determining the relative likelihood of the two families are derived from the same original nucleic acid molecule comprises: determining a likelihood ratio of unique molecule (or family) over non-unique molecule (or family) given fragment positions. Determining the relative likelihood of the two families are derived from the same original nucleic acid molecule can comprise: determining a likelihood ratio of UMI transition for result of UMI jumping or sequencing error.
- the relative likelihood is a product (e.g., multiplication product) of (i) the likelihood (or probability) ratio of unique molecule over nonunique molecule given fragment positions and (ii) the likelihood (or probability) ratio of UMI transition for unique molecule over non-unique molecule.
- determining the relative likelihood of the two families are derived from the same original nucleic acid molecule comprises: determining relative likelihood of the two families are derived from the same original nucleic acid molecule using a sequencing error rate (e.g., 0.001) and/or a mismatch probability (e.g., 0.25).
- the sequencing error rate can be predetermined.
- the mismatch probability can be predetermined.
- performing probability-based merging comprises: family identification and merging (or collapsing).
- Performing probability-based merging comprises can comprise: duplex identification and merging (or collapsing).
- performing probability-based merging comprises: performing probability-based merging of families of the plurality of families using a probability map.
- performing probability-based merging comprises: (i) for one, one or more, or each pair of families of the plurality of families, determining a relative likelihood (or probability) of the families of the pair are derived from the same original nucleic acid molecule.
- Performing probability-based merging can comprise: (ii) for the pair of families with the highest relative likelihood (or probability), if the relative likelihood of the families in the pair with the highest relative likelihood (or probability) are derived from the same original nucleic acid molecule is above a merging threshold (e.g., 1), then merging the families.
- performing probability-based merging further comprises: (iii) repeating (i) and (ii) until the relative likelihood of the families in the pair with the highest relative likelihood (or probability) is not above the merging threshold
- performing UMI statistic estimation comprises: performing UMI statistic estimation on a subset of families of the plurality of families.
- the subset of families can comprise at least 50,000 families of the plurality of families.
- the subset of families can comprise at least 10% of families of the plurality of families.
- the plurality of families (e.g., before probability-based merging or after probability-based merging) comprises at least 500,000 families.
- the plurality of families before probability-based merging is performed can comprise at least 10% more families than the plurality of families after probability-based merging is performed.
- one, one or more, or each family of the plurality of families before (or after) merging comprises at least 1 sequence read (e.g., at least 5 sequence reads) of the plurality of sequence reads.
- reads comprises a second UMI sequence.
- the UMI sequence can be 5’ to the fragment sequence.
- the second UMI sequence can be 3’ to the fragment sequence.
- the UMI sequence can be 3’ to the fragment sequence.
- the second UMI sequence can be 5’ to the fragment sequence.
- the UMI sequence is 4-20 bases in length.
- the second UMI sequence can be 4-20 bases in length.
- the UMI sequence and the second UMI sequence can have different lengths.
- the UMI sequence and the second UMI sequence can have an identical length.
- the UMI sequence and the second UMI sequence can be different.
- the UMI sequence and the second UMI sequence can be identical.
- the UMI sequences can be random.
- the UMI sequences can be non-random.
- the method comprises: subsequent to performing probability-based merging, for one, one or more, or each of the plurality of families, determining a consensus fragment sequence of the family, a position of the consensus fragment sequence aligned to the reference sequence, and/or a consensus UMI sequence of the family.
- the method can comprise: aligning the consensus fragment sequence to the reference sequence.
- the method comprises: determining a fragment sequence and/or a UMI sequence of the original nucleic acid molecule from which the sequence reads of the family are derived.
- the method can comprise: aligning the fragment sequence to the reference sequence.
- the method comprises: creating a file or a report and/or generating a user interface (UI) comprising a UI element representing or comprising, for one, one or more, or each of the plurality of families, (i) the family.
- the file or report and/or the UI element can represents or comprise (ii) sequence reads of the family, fragment sequences of the family, and/or UMI sequences of the family.
- the file or report and/or the UI element can represents or comprise (iii) a consensus fragment sequence of the family, a position of the consensus fragment sequence aligned to the reference sequence, and/or a consensus UMI sequence of the family.
- the plurality of sequence reads comprises fragment sequences that are about 50 base pairs to about 1000 base pairs in length each.
- the plurality of sequence reads can comprise paired-end sequence reads and/or single-end sequence reads.
- the plurality of sequence reads can be generated by whole genome sequencing (WGS), e.g., clinical WGS (cWGS).
- the plurality of sequence reads is generated from a sample.
- the sample can be obtained from a subject.
- the sample can be generated from another sample obtained from a subject.
- the other sample can be obtained directly from the subject.
- a system of grouping sequence reads comprises: non-transitory memory configured to store executable instructions.
- the non- transitory memory can be configured to store a plurality of sequence reads each comprising a fragment sequence and a unique molecular identifier (UMI) sequence (or an identifier sequence).
- UMI unique molecular identifier
- the system can comprise: a processor (e.g., a hardware processor or a virtual processor) in communication with the non-transitory memory.
- the hardware processor can be programmed by the executable instructions to perform: aligning sequence reads of the plurality of sequence reads to a reference genome sequence using the fragment sequences of the sequence reads.
- the hardware processor can be programmed by the executable instructions to perform: grouping sequence reads of the plurality of sequence reads into a plurality of families of sequence reads based on the UMI sequences positions of the fragment sequences of the sequence reads aligned to the reference genome sequence.
- the hardware processor can be programmed by the executable instructions to perform: performing probability -based merging of families of the plurality of families.
- performing probability-based merging comprises: performing UMI statistic estimation of the plurality of families.
- performing UMI statistic estimation comprises: determining fragment (or fragment insert) size frequency, UMI jumping rate, and/or UMI frequency.
- Performing probability -based merging can comprise: performing probability-based merging of families of the plurality of families using fragment size frequency, UMI jumping rate, and/or UMI frequency.
- performing probability-based merging comprises: determining a relative likelihood (or probability) of the two families are derived from the same original nucleic acid molecule using the fragment size frequency, the UMI jumping rate, and/or the UMI frequency.
- Performing probability-based merging can comprise: determining the relative likelihood is above a merging threshold.
- Performing probability-based merging can comprise: merging the two families of the plurality of families. Merging the two families comprises: merging a smaller family (e.g., with fewer sequence reads) of the two families into a larger family (with more sequence reads) of the two families.
- the relative likelihood of the two families are derived from the same original nucleic acid molecule is a product (e.g., a multiplication product) of (i) a likelihood ratio of unique molecule (or family) over non-unique molecule (or family) given sequencing error) for unique molecule over non-unique molecule.
- the hardware processor can be programmed by the executable instructions to perform: determining the likelihood ratio of unique molecule over non-unique molecule given fragment positions.
- the hardware processor can be programmed by the executable instructions to perform: determining the likelihood ratio of UMI transition for unique molecule over non-unique molecule.
- determining the relative likelihood of the two families are derived from the same original nucleic acid molecule comprises: determining relative likelihood of the two families are derived from the same original nucleic acid molecule using a sequencing error rate (e.g., 0.001) and/or a mismatch probability (e.g., 0.25).
- the sequencing error rate can be predetermined.
- the mismatch probability can be predetermined.
- performing probability-based merging comprises: family identification and merging (or collapsing).
- Performing probability-based merging can comprise: duplex identification and merging (or collapsing).
- performing probability-based merging comprises: performing probability-based merging of families of the plurality of families using a probability map.
- performing probability-based merging comprises: (i) for one, one or more, or each pair of families of the plurality of families, determining a relative likelihood (or probability) of the families of the pair are derived from the same original nucleic acid molecule.
- Performing probability-based merging can comprise: (ii) for the pair of families with the highest relative likelihood (or probability), if the relative likelihood (or probability) of the families in the pair with the highest relative likelihood (or probability) are derived from the same original nucleic acid molecule is above a merging threshold, then merging the families.
- Performing probability-based merging can further comprise: (iii) repeating (i) and (ii) until the relative likelihood (or probability) of the families in the pair with the highest relative likelihood is not above the merging threshold
- performing UMI statistic estimation comprises: performing UMI statistic estimation on a subset of families of the plurality of families.
- the subset of families can comprise at least 50,000 families of the plurality of families.
- the subset of families can comprise at least 10% of families of the plurality of families.
- the plurality of families (e.g., before probability-based merging or after probability-based merging) comprises at least 500,000 families.
- the plurality of families before probability-based merging is performed can comprise at least 10% more families than the plurality of families after probability-based merging is performed.
- one, one or more, or each family of the plurality of families before (or after) merging comprises at least 1 sequence read (e.g., at least 5 sequence reads) of the plurality of sequence reads.
- reads comprises a second UMI sequence.
- the UMI sequence can be 5’ to the fragment sequence.
- the second UMI sequence can be 3’ to the fragment sequence.
- the UMI sequence can be 3’ to the fragment sequence.
- the second UMI sequence can be 5’ to the fragment sequence.
- the UMI sequence is 4-20 bases in length.
- the second UMI sequence can be 4-20 bases in length.
- the UMI sequence and the second UMI sequence can have different lengths.
- the UMI sequence and the second UMI sequence can have an identical length.
- the UMI sequence and the second UMI sequence can be different.
- the UMI sequence and the second UMI sequence can be identical.
- the UMI sequences can be random.
- the UMI sequences can be non-random.
- the hardware processor is programmed by the executable instructions to perform: subsequent to performing probability-based merging, for one, one or more, or each of the plurality of families, determining a fragment sequence (or a consensus fragment sequence) of the family, a position of the fragment sequence aligned to the reference genome sequence, and/or a UMI sequence of the family.
- the hardware processor can be programmed by the executable instructions to perform: aligning the fragment sequence of the family to the reference sequence.
- the hardware processor can be programmed by the executable instructions to perform: determining a fragment sequence and/or a UMI sequence of the original nucleic acid molecule from which the sequence reads of the family are derived.
- the method can comprise: aligning the consensus fragment sequence to the reference sequence.
- the hardware processor is programmed by the executable instructions to perform: creating a file or a report and/or generating a user interface (UI) comprising a UI element representing or comprising, for one, one or more, or each of the plurality of families, (i) the family, (ii) sequence reads of the family, fragment sequences of the family, and/or UMI sequences of the family, and/or (iii) a fragment sequence of the family, a position of the fragment sequence aligned to the reference genome sequence, and/or a UMI sequence of the family.
- UI user interface
- the plurality of sequence reads comprises fragment sequences that are about 50 base pairs to about 1000 base pairs in length each.
- the plurality of sequence reads can comprise paired-end sequence reads and/or single-end sequence reads.
- the plurality of sequence reads can be generated by whole genome sequencing (WGS), e.g., clinical WGS (cWGS).
- the plurality of sequence reads is generated from a sample.
- the sample can be obtained from a subject.
- the sample can be generated from another sample can comprise cells, cell-free DNA, cell-free fetal DNA, circular tumor DNA, amniotic fluid, a blood sample, a biopsy sample, or a combination thereof.
- Also disclosed herein include a non-transitory computer-readable medium storing executable instructions, when executed by a system (e.g., a computing system), causes the system to perform any method or one or more steps of a method disclosed herein.
- a system e.g., a computing system
- FIG. 1 A shows a schematic illustration of collapsing sequence reads.
- FIG. IB depicts an exemplary illustration of the basic concept of collapsing, e.g., grouping and output (or emit) of consensus.
- FIG. 2 shows non-limiting exemplary embodiments of the general process for library preparation, sequencing, and EMI collapsing.
- FIG. 3 depicts data related to error correction performance from different sample types.
- FIG. 4 shows an exemplary illustration of a genomic locus error.
- FIG. 5 depicts an exemplary illustration of current family identification methods in which error in ElMIs is assumed to be caused by sequencing and caveats.
- FIG. 6 depicts a EIMI jumping example with an Agilent SureSelect dataset.
- FIG. 7 depicts a EIMI jumping example in TSO500 fusion calling.
- FIG. 8 shows an illustration of current family identification methods for read collapsing and caveats.
- FIG. 9 shows an illustration of dual EMI.
- FIG. 10 depicts an exemplary illustration of labeled fragments for a probabilistic framework for duplicate grouping.
- FIG. 11 depicts an exemplary workflow of the disclosed probabilistic framework for duplicate grouping.
- FIG. 12 depicts an exemplary model of the merging process using the disclosed probabilistic framework. estimation of unique molecule by position.
- FIG. 14 shows exemplary probability model stat estimation of UMI jumping in dual or single UMI embodiments.
- FIG. 15 shows data related to the validation of UMI jumping estimation with a single UMI.
- FIG. 16 depicts models of estimation of unique molecule by UMI for random (Top) or non-random (Bottom) UMI types. Also, See , Table 1.
- FIG. 17 depicts exemplary illustration of duplex collapsing with single UMI.
- FIG. 18 depicts data related to enhanced performance of error correction using the presently disclosed methods (DRAGEN v. Fulcrum genomics tools (Fgbio)).
- FIG. 19 depicts data related to Pileup vs variant caller (VC) Sensitivity.
- FIG. 20 shows a histogram of Truth Challenge Benchmark Data variant mutant support using DRAGEN vs Fgbio.
- FIG. 21 depicts a receiver operator characteristic (ROC) curve of the impact on SNP variant calling: DRAGEN UMI + DRAGEN VC. Shown are results using the positional and probability-based models disclosed herein.
- ROC receiver operator characteristic
- FIG. 22 depicts an ROC curve of the impact on non-SNP variant calling: DRAGEN UMI + DRAGEN VC. Shown are results using the positional and probability-based models disclosed herein.
- FIG. 23 depicts an ROC curve of the impact on SNP variant calling: DRAGEN UMI + DRAGEN VC. Shown here are the results from probability models only.
- FIG. 24 depicts an ROC curve of the impact on non-SNP variant calling: DRAGEN UMI + DRAGEN VC. Shown here are the results from probability models only.
- FIG. 25 depicts an ROC curve of the impact on SNP variant calling: DRAGEN UMI + CG VC (LQ only). Shown are results using probability based models.
- FIG. 26 depicts an ROC curve of the impact on non-SNP variant calling: DRAGEN UMI + CG VC (LQ only). Shown are results using probability based models.
- FIG. 27 depicts data related to insertion/deletion (indel) error rate.
- FIG. 28 depicts a flow diagram of an exemplary embodiment of the UMI calling methods disclosed herein.
- FIG. 29 depicts a flow diagram of an exemplary embodiment of methods for identifying collapsible regions.
- FIG. 30 depicts a flow diagram of exemplary embodiments of methods for generating consensus reads. collapsible regions.
- FIG. 32 shows an illustration of sequences collapsing using positional and UMI information.
- FIG. 33 depicts a diagram related to UMI metrics for read pairs with duplex UMI. Also, See, Table 6.
- FIG. 34 shows a diagram related to UMI error corrections. Also, See, Table 6.
- FIG. 35 shows a diagram related to UMI metrics related to UMI collapsible regions. Also, See, Table 6.
- FIG. 36 is a flow diagram showing an exemplary method of grouping sequence reads.
- Grouping sequence reads can include grouping sequence reads, based on UMI sequences in the sequence reads, into families.
- Grouping sequence reads can include merging families using a probabilistic model. Merging families of sequence reads is also referred to herein as read or UMI collapsing.
- FIG. 37 is a block diagram of an illustrative computing system configured for grouping sequence reads.
- UMI unique molecular barcode
- UMIs are a type of molecular molecular barcodes are short sequences used to uniquely tag each molecule in a sample library. UMIs are used for a wide range of sequencing applications, many around PCR duplicates in DNA and cDNA. UMI deduplication is also useful for RNA-seq gene expression analysis and other quantitative sequencing methods. Sequencing with UMIs can reduce the rate of falsepositive variant calls and increase sensitivity of variant detection.
- UMIs incorporate a unique barcode onto each molecule within a given sample library. By incorporating individual barcodes on each original DNA fragment, variant alleles present in the original sample (true variants) can be distinguished from errors introduced during library preparation, target enrichment, or sequencing.
- UMI collapsing has mainly relied on the UMI sequence similarity and fragment position.
- Current algorithms only assume sequencing error has occurred if there is a difference in UMI sequence. However, this assumption does not hold, for example, for the artifact of UMI jumping.
- this problem can be solved by first estimating the UMI jumping rate using a small portion of data and then applying this prior knowledge on the full data to evaluate how reads should be grouped using a probability framework.
- the UMI sequence, UMI jumping rate, fragment size and coverage distribution are leveraged to assess the likelihood of merging reads with different UMI or different positions.
- the problem of UMI jumping is resolved and can be applied universally on any UMI design. In addition, based on positional information, fragment greatly reducing the DNA error.
- reads are grouped by fragment alignment position.
- a small fuzzy window at each position e.g., 1, 2, 3, 4, or 5
- the reads are grouped first by exact UMI sequence, which forms a family.
- UMI jumping or hopping probability is estimated through insert size distribution and number of distinct UMI at certain positions.
- pair-wise likelihood ratio is calculated to assess if two families with different UMI sequences and genomic positions are derived from the same original molecule. Families with likelihood lower than threshold are merged.
- the default threshold is 1, for example.
- a method of grouping sequence reads can include grouping sequence reads can include grouping sequence reads into families of sequence reads; and merging (or collapsing) families of sequence reads.
- a method for grouping sequence reads is under control of a processor (e.g., a hardware processor or a virtual processor) and comprises: receiving a plurality of sequence reads each comprising a fragment sequence and a unique molecular identifier (UMI) sequence (or an identifier sequence).
- UMI unique molecular identifier
- the method can comprise: aligning sequence reads of the plurality of sequence reads to a reference sequence (e.g., a reference genome sequence) using the fragment sequences of the sequence reads.
- the method can comprise: grouping sequence reads of the plurality of sequence reads into a plurality of families of sequence reads based on the UMI sequences positions of the fragment sequences of the sequence reads aligned to the reference sequence.
- the method can comprise: performing UMI statistic estimation of the plurality of families.
- the method can comprise: performing probability-based merging of families of the plurality of families.
- Performing probability-based merging can comprise: performing probability-based merging of families of the plurality of families using the results of UMI statistic estimation.
- Read collapsing is a computational method that identifies nucleotide sequence reads as originating from the same source nucleic acid (e.g., DNA) molecule, and subsequently uses statistical methods to reduce spurious errors found in these sets of reads.
- source nucleic acid e.g., DNA
- read collapsing may include grouping those reads 104+rl, 104+r2, 104-rl, 104-r2 together.
- Read collapsing may include reducing spurious errors, such as with simplex collapsing to determine the nucleotide sequence of a nucleotide strand, such as the sequence of the plus strand 108a of a DNA molecule 108. Read confidence, such as with duplex collapsing to determine the nucleotide sequence of a DNA molecule 108 from both the sequence of the plus strand 108a and the sequence of the minus strand 108b.
- the systems and methods disclosed herein may utilize a probabilistic model for grouping sequence reads (which can include merging families of sequence reads, referred to herein as read or UMI collapsing).
- Read or UMI collapsing may produce high-quality reads.
- Read or UMI collapsing may require that a sample be sequenced with identifier sequences (e.g., unique identifier sequences (UMIs)) 112a, 112b’, 112a’, 112b.
- identifier sequences 112a, 112b’, 112a’, 112b can enable increased resolution when distinguishing reads and molecules that may appear very similar otherwise, though read collapsing may be performed without such identifier sequences under specific circumstances.
- Read collapsing may result in in-silica error reduction. Such error reduction may be useful for many applications within next generation sequencing (NGS).
- NGS next generation sequencing
- the source nucleic acid molecules (or template) are tagged with dual UMIs as illustrated in FIG. 1A and FIG.8, left. In some embodiments, the source nucleic acid molecules (or template) are tagged with single UMIs as illustrated in FIG. 8, right.
- read collapsing effectively combines all the duplicate observations of a DNA fragment, such as PCR duplicates of a DNA fragment, into a single representative, read collapsing has the benefit of significantly reducing the amount of data that needs to be processed downstream. Removing duplicate observations, or reads, may result in a ten-fold, or more, decrease in data size.
- UMI helps to improve grouping accuracy.
- the same problem of FP and FN exist with UMI.
- read or UMI collapsing can enable error correction on single strand to remove random sequencing and PCR error, and duplex error correction can be used to remove in vitro DNA damage error (duplex collapsing).
- a nucleic acid or a template can be tagged with UMIs during library preparation.
- the resulting plus nucleic acid can have two UMIs (a on the 5’ of the nucleic acid and b’ on the 3’ of the nucleic acid).
- the resulting minus strand can have two UMIs (e.g., b on the 5’ of the nucleic acid and a’ on the 3’ of the nucleic acid).
- a nucleic acid can be tagged one UMI.
- the tagged nucleic acid can have a fragment sequence.
- the tagged nucleic acid can have a UMI sequence. added during library preparation, such as sequences for attachment to a flow cell for sequencing (e.g., P5 and P7 sequences).
- Read or UMI collapsing can result in error rate reduction by 10e-6- 10e-5, enabling ultra-sensitive variant detection.
- Shown in FIG. 3 is error correction performance from simplex and duplex collapsing on circulating free DNA (cfDNA), nucleosome, and pipDNA samples. The total error rate of cfDNA was down to 10e-5, and the duplex correction yielded error rate down to 10e-6.
- UMI errors there can be a sequencing error (e.g., the UMI carries sequencing errors).
- the UMI carries sequencing errors.
- there is a UMI jumping error (FIG. 6-FIG. 7), in which, e.g., the UMI sequence is replaced by other sequence during PCR.
- UMI correction methods can be based on or comprise heuristic rules.
- corrected UMIs have the same start/end position and hamming distance ⁇ 2 (e.g., fgbio correction).
- Heuristic rules can be hard to generalize. For example, correct UMI if unique correction is nearest. If not, then, (1) Identify families where both are valid, (2) Identify families where one of UMI is invalid, only allow nearest and second-nearest, or (3) no family is identified where both are invalid, only allow nearest correction.
- read pairs can have similar fragment alignment location ( ⁇ 3bp) and can share 1 same UMI and at least 1 same alignment.
- Current methods can require dual UMI for duplex collapsing (FIG. 8).
- a missing single end UMI can disable grouping of duplex sequence.
- FIG. 9 shows an illustration of dual UMI.
- UMI transition can be calculated and correct UMI and merge into families can be based on likelihood ratio.
- Equations 1 and 2 below describe a probabilistic framework for duplicate grouping.
- L pos Likelihood ratio of unique molecule over non-unique molecule given fragment positions.
- L umi Likelihood ratio of umi transition for unique molecule over non-unique molecule. Assumptions include that the UMI transition is caused by jumping or sequencing error and only larger family can jump into smaller family (C1 is larger than C2, FIG. 10). As shown in FIG. 12, initial grouping can comprise grouping reads by UMI plus position key and ordering by family size and UMI sequence. For pair-wise probability calculation and merging, pair-wise probability is computed. Only larger family can jump into smaller family, and said pair is prioritized. The pair with largest Probability (likelihood) is identified and compared with threshold. If merge is successful, probability map is recomputed until largest pair ⁇ threshold.
- the probabilities methods disclosed herein can advantageously leverage all reads in a region rather than reads with the same start and end.
- “L pos L pos * indel error rate”.
- the indel error rate can be, for example, 0.001, 0.0001, or 0.00001.
- FIG. 14 Shown in FIG. 14 are exemplary embodiments of probabilistic methods for estimating UMI jumping.
- duplex rate can be estimated first.
- the probability-based UMI collapsing method disclosed herein can be accurate under different UMI settings: random single UMI, nonrandom single UMI, random dual UMI, and/or nonrandom dual UMI.
- parameters for the probability model can be fine-tuned. For example, additional statistics, such as shift probability and mismatch rate, can be made and used. Optimal threshold for the merging probability can be determined can used.
- consensus sequencing generation can be improved. Error rate can be estimated from raw read and applied as prior for consensus read generation (e.g., estimate error rate from homopolymer region to improve indel error rate; e.g., estimate error rate from simplex read to improve duplex collapsing). in FIG. 18, the error correction performance on duplex reads is improved using the presently disclosed methods in DRAGEN as compared to, e.g., fulcrum (fgbio).
- Sensitivity of DRAGEN UMI is similar to, e.g., Fgbio-duplex as shown in FIG. 19.
- Truth variants each with at least one duplex support are shown in FIG. 19. Missed cases may be due to alignment difference and end-masking.
- DRAGEN calls more real support for variants than Fgbio as shown in FIG. 20. Without being bound by any particular theory, remaining cases of missed support is due to consensus generation not read grouping.
- FIGS. 21-26 show the results of UMI collapsing with a probability model of the present disclosure from data using Agilent random single UMI.
- current majority voting yields up to 80% incorrect genotypes in long repeat units (FIG. 27).
- transition probability between different genotypes can be estimated and applied during consensus generation.
- Library preparation methods can provide the ability to attach unique identifiers (UMIs) to molecules before PCR and sequencing. This makes it possible to take post-sequenced reads, group them by UMI, and thus aggregate the evidence for what the pre- PCR fragment was. Described herein is the design of software pipeline (e.g., on Illumina DRAGEN) logic that accomplishes these tasks.
- UMIs unique identifiers
- the general design for DRAGEN’ s UMI processing is as follows: (1) Group alignments by their original source fragment, (2) Generate a single consensus read (or pair) for each source fragment, and (3) Align the consensus read and feed it into the downstream analysis pipeline (e.g., sort, variant callers).
- processing a full input sample through a single hashtable can run slowly. Therefore, a method was developed to identify genomic regions that may be processed independently of the other regions, and are processed in parallel.
- FIG. 28 shows units of the software described in this section.
- the design is based on the following constraints: if the inputs are FASTQ files, the UMI tags must be contained in the read name field or provided in a separate FASTQ files; if the inputs are BAM files, the UMI tags must be contained in the read name field or in the UMI bam tag; and input FASTQ/BAM are from a paired-end run.
- the software can only support the following conditions: single UMIs that are less than or equal to 15 base pairs and dual UMIs that are less than or equal to 8 + 8 base pairs.
- a single original DNA fragment can, in some embodiments, lead to multiple input reads, differing from each other by sequencing errors. Described herein are methods to gather reads into groups where all of the members of the group have matching UMI, and sequences for all reads are close to identical.
- the method for detecting sequence similarity is to use the aligner; any reads that align to the same genomic location must have a similar sequence. Thus reads can be grouped by of reads can be built using this key.
- the first stage of UMI processing is to do a normal aligner run, and to partition and sort by clip-adjusted mate coordinates.
- This uses a typical sort- partitioning data structure, the Binner.
- the Binner At the conclusion of the first alignment run, all reads have been partitioned in this Binner data structure, and then later partitions of reads can be loaded, sorted by coordinate, and independent regions for parallel processing can be identified.
- FIG. 29 depicts a flow diagram of an exemplary embodiment of methods for identifying collapsible regions.
- a group of related reads also known as a family, can be identified as having very close alignment positions (within a “fuzzy window” of a few base- pairs), and very similar UMIs. And as coverage varies across the genome, there are many positions where it can be safely concluded that no families may be merged across that position, e.g., there are natural “break points” where family assembly can be processed independently.
- sort partitions can be read back into memory, sorted, and scanned for “collapsible regions”.
- Each “CollapsibleRegion” is assigned to a separate “RegionCollapserThread” to generate an independent set of consensus reads, in a “CollapsedRegion” data structure.
- the “CollapsedRegions” are put back into their intended order by a “RegionSerializerThread”, which pumps the consensus reads directly back into the DRAGEN aligner.
- FIG. 30 depicts a flow diagram of exemplary embodiments of methods for generating consensus reads.
- the workunit for this phase of the UMI processing is the “CollapsibleRegion”. It is the job of the “RegionCollapserThread” to receive a “CollapsibleRegion”, feed all of that region’s reads into a “FamilyHashtable”, and use that hashtable to generate a set of consensus reads. Details of these read-collapsing methods, including UMI matching/correction, are described below.
- FIG. 31 depicts a flow diagram of exemplary embodiments for scanning for collapsible regions.
- the “RegionFinder” tracks the last N (fuzzy_window+l) pair positions covered by at least some reads. As a new pair is scanned, it is checked if it can be shift-merged with any of the recent families (same left-most position and right most position difference ⁇ fuzzy window). If so, this new position is considered to be a match, and a note is made not to split at that position.
- a Family is a grouping of sequencing data from reads that ostensibly originate from copies of the same source molecule.
- a Family is defined by the following information: (1) UMI, single or dual UMI noted by “+”; (2) Clip-adjusted pair coordinates, the alignment position of each mate is taken and adjusted outward beyond the 5’ end by the total amount of CIGAR soft clips; and (3)Orientation, each Family’s orientation is set based on the strand direction of readl and read2, in that order. For example, if readl is mapped to the forward strand and read2 is mapped to the reverse strand, the orientation of the family is Forward-Reverse. During the initial scan of a “CollapsibleRegion”, reads are grouped into Families based on an exact match of these criteria.
- Both implementations can apply the following three types of family merging: (1) UMI correction, in which two families with exactly the same position are combined, but are tolerably close in UMI sequence; (2) Shift-merge, in which two families with small ( ⁇ fuzzy window) difference in clip-adjusted pair coordinates are merged; and (3) Duplexmerging, in which two families with complimentary orientations and matching coordinates and UMIs are combined, because they can originate from two strands of the template molecule.
- the UMI correction merges families with the same start-end but mismatch in UMI sequence. If the UMI code has a unique correction defined by the correction table to be the ‘true’ UMI, the corrected UMI will be assigned. For remaining families that do not have uniquely corrected UMIs, the process can work as described below. families with UMI1 and UMI2 combinations where both sequences are true codes are identified, and these are used as targets for correction.
- ReCo shall loop through the target families and merge the candidate family to the target if the orientations match and the any of the following apply (This is a greedy algorithm in which the first target to satisfy any of the following is taken): (1) Candidate UMI1 is the same as target UMI1, and target UMI2 is either a nearest code or second nearest code of candidate UMI2; (2) Candidate UMI2 is the same as target UMI2, and target UMI1 is either a nearest code or second nearest code of candidate UMI1 or; (3) Neither candidate UMIs match the target UMIs, however, both target UMIs are nearest codes for their respective candidate UMIs. A second nearest code is not allowed.
- Shift correction corrects for PCR errors that result in alignment shifts. This can cause one true PCR family to informatically be viewed as multiple families with differing positions. In some embodiments, this is done according to the following steps: (1) For each family, search for other families with start and end positions within the umi-fuzzy-window- size” parameter, and the candidate family cannot have been shift-merged before (For example, if a family's start and end positions are ⁇ 10, 20 ⁇ and the window size is 3, then the following families are all likely candidates for correction: ⁇ 13, 20 ⁇ , ⁇ 7, 23 ⁇ ); and (2) If two families are within a fuzzy window, determine if they can be merged.
- UMIs are tags added to the original double stranded molecule during library preparation, and are thus propagated though the PCR family.
- DRAGEN UMI is able to further collapse the two consensus reads for the single strands into one consensus for the double strand via cross-family collapsing. This is possible for non-random UMIs where the UMI is in the PCR product and therefore complementary across strands.
- the random single UMI correction only applies UMI correction and shift-merge.
- the UMI are not used for collapsing and reads can be collapsed based only on position.
- the frequency of insert size of the test sample can be roughly estimated.
- Read-pairs with low MAPQ e.g., ⁇ 60
- Low MAPQ described herein can be, for example, ⁇ 100, ⁇ 75, ⁇ 50, ⁇ 40, ⁇ 30, ⁇ 20, or ⁇ 10.
- total family can be set as total number of families after first round of grouping with the same start-end or mismatch ⁇ 1, UMI sequence, and strand.
- Read-pairs with low MAPQ (e.g., ⁇ 60), non-properly paired, or UMI with N base are excluded.
- the user can pick the first read-pairs and calculate the soft-clip adjusted start-end and strand as group key.
- the UMI jumping can look like:
- only positional information is used to determine whether a family is associated with UMI jumping.
- the caveat is that it can be potentially different molecules with the same start-end which leads to overestimation of UMI jumping.
- the number of families and insert size can be used to down-select the regions with only one unique molecule.
- UMI are random UMI
- s1, s2 from familyl and family2 calculate hamming distance as: Dis, total number of non-match base after excluding N; nN, total number N base in either of the item.
- UMI are non-random UMI, for each pair of UMI as si, s2 from familyl and family2. If s1 and s2 are designed UMI, correct as s1’ and s2’ (if the distance between observed and corrected UMI >1, discard the read-pair).
- the final likelihood L L umi *Lposition, if L is above the pre-defmed threshold merge Families A and B.
- duplex collapsing can be performed. For each candidate family, loop through all families within the fuzzy window range. For pairs of families with reverse strand information, the pair-wise likelihood can be computed to find the most likely candidate to merge.
- duplex UMI a pair of family that forms a duplex can look like:
- the final likelihood “L L umi *L position ”, if L is above the pre-defmed threshold merge Family A and B.
- a list of collapsers are employed to process families, combining the multiple input read pair information they contain into consensus read pairs.
- two types of collapsing can be done: simplex collapsing, where accumulated read pileups are combined into consensus reads on one strand; and cross-family collapsing, where consensus reads are those whose UMIs, orientations, and positions indicate that they are from the same dual-stranded source molecule.
- the simplex collapsing can proceed using the following steps: (1) Group reads by CIGAR string; (2) Produce pileups for each read group; (3) Order pileups descending by read count, ascending by indel distance; (4) Create a consensus read from the first group e.g., the group with largest read count and lowest distance to reference; (5) save a second candidate if read count of second candidate > read count of first candidate * minRatio (default 0.5).
- cross-family collapsing can work according to the following steps: (1) Obtain the read group candidates from each strand; (2) Compare the two mates of both strands to find a best matching read group (e.g., compare readl of positive strand string with less difference from reference; (3) Output one consensus read based on the best matching read group; (4) If no matched CIGAR hypothesis from two strands, the two strands can be reported as two separate simplex families.
- the consensus base can be set according to the following rules.
- the software can calculate the most frequently observed base and the second most frequently observed base. If there are no bases observed, the consensus base can be set to ‘N’ If only 1 base is observed, the consensus base can be set to observed base. If there are two or more bases observed, the consensus base can be set to the most frequently observed base. If top two bases comprise an equal frequency, the consensus base can be set to the one with higher condensed qscore. If the second most frequent base’s “count * “Majority Ratio” (default 4/3) is greater than or equal to the winner’s count, the consensus can be set to ‘N’. For cross-family merging, only two pileups (e.g., readl from one strand read2 from the opposite strand) are compared to generate consensus base.
- Fisher To compute a new quality score of consensus base, Fisher’s method can be applied to represent higher quality score post collapsing.
- the Fisher score accumulates a sum of the natural log of the basecall likelihoods, whereas a Max score simply keeps the largest score encountered. The detailed steps are described below.
- Each of the collapsed reads can be generated based on the following convention:“consensus_read_refIDl _posl_refID2_pos2_orientation”. refID2, reference ID of read2; pos2, genomic position of read2; orientation, orientation of readl and read2.
- “ReadCollapserThreads” feeds series of “CollapsedRegions” into the “RegionSerializer Thread”, which puts the output reads into the expected order and pushes them downstream into the DRAGEN aligner, and from there into the rest of the DRAGEN pipeline.
- speed was, in some embodiments, limited by the performance of the memory allocator.
- the “FamilyHashtable” and “ReadCollapser” logic both hammered the allocator to build data structures and to construct output reads.
- the “RegionSerializer Thread” hammered the allocator with millions of calls to free memory.
- the DRAGEN pipeline can process data from whole genome and hybrid- capture assays with unique molecular identifiers (UMI).
- UMIs are molecular tags added to DNA fragments before amplification to determine the original input DNA molecule of the amplified fragments. UMIs help reduce errors and biases introduced by DNA damage such as deamination before library prep, PCR error, or sequencing errors.
- the input reads files must be from a paired-end run.
- Input can be pairs of FASTQ files or aligned/unaligned BAM input.
- DRAGEN can support the following UMI types: Dual, nonrandom UMIs, such as TruSight Oncology (TSO) UMI Reagents or IDT xGen Prism; Dual, random UMIs, such as Agilent SureSelect XT HS2 molecular barcodes (MBC) or IDT xGen Duplex Seq Adapters; Single- xGen dual index UMI Adapters.
- TSO TruSight Oncology
- MLC molecular barcodes
- IDT xGen Duplex Seq Adapters Single- xGen dual index UMI Adapters.
- DRAGEN uses the UMI sequence to group the read pairs by their original input fragment and generates a consensus read pair for each such group, or family.
- the consensus reduces error rates to detect rare and low frequency somatic variants in DNA samples with high accuracy.
- the DRAGEN pipeline can generate a consensus as follows: (1) Aligns reads; (2) Groups reads into groups with matching UMI and pair alignments (these groups are referred to as families); (3) Generates a single consensus read pair for each read family. These generated reads have higher quality scores than the input reads and reflect the increased confidence gained by combining multiple observations into each base call.
- the UMI workflow is only compatible with small variant calling and SV in DRAGEN.
- UMIs can be entered in any one of the following formats: (1) Read name — The UMI sequence is located in the eighth colon-delimited field of the read name (QNAME), for example, “NDX550136: 7 :H2MTNBDXX : 1 : 13302:3141 : 10799: AAGGATG+TCGGAGA” ; (2) BAM tag — The UMI is present as an RX tag in pre-aligned or aligned BAM file (standard SAM format) or; (3)FASTQ file — The UMI is located in a third FASTQ file using the same read order as the read pairs.
- DRAGEN supports UMIs with two parts each with a maximum of 8 bp and separated by +, or a single UMI with a maximum of 15 bp.
- the UMI workflow must be executed using a set of reads that correspond to a unique set of Read Group Sample Name (RGSM)/Read Group Library (RGLB).
- RGSM Read Group Sample Name
- RGLB Read Group Library
- DRAGEN supports multiple lanes if all lanes correspond to the same RGSM/RGLB set.
- DRAGEN UMI does not support a tumor-normal analysis, because a tumor-normal run corresponds to two different RGSM. In a tumor-normal run, one sample name can be used for tumor and one sample name can be used for normal. In some embodiments, DRAGEN UMI supports one sample in a run.
- the input can contain multiple samples.
- DRAGEN checks if only one sample is included in the run and if the sample uses only a single, unique RGLB library. DRAGEN also accepts a library that was spread across multiple lanes. If there is a single sample and single library, DRAGEN processes with an error.
- the user can provide a predefined UMI correction table or a list of valid UMI sequences as input.
- a tab- delimited file To create the UMI correction table, use a tab- delimited file, include a header, and add the following fields shown in Table 5.
- DRAGEN uses the default table for TruSight Oncology (TSO) UMI Reagents ocated at src/config /umi_correction _table . txt.
- TSO TruSight Oncology
- the user can provide a file for whitelisted nonrandom UMI with valid UMI sequence, one per line.
- DRAGEN then autogenerates a UMI correction table with hamming distance of one.
- the user can set the batch option for different UMIs correction.
- Three batch modes are available that optimize collapsing configurations for different UMI types. Use one of the following modes: random- duplex
- the user can set the umi-enable” option to “true”. In some embodiments, this option is not compatible with “-enable-duplicate-marking” because the UMI pipeline generates a consensus read from a set of candidate input reads, rather than choosing the best nonduplicate read. If using the umi-library-type” option, “- - umi- enable” is not required, umi -emit- multiplicity
- the user can set the consensus sequence type to output.
- DRAGEN UMI allows users to collapse duplex sequences from the two strands of the original molecules.
- duplex sequence is typically -20-60% of total library, depending on library kit, input material, and sequencing depth.
- the user can enter one of the following consensus sequence types: both
- the user can specify the input type for the UMI sequence.
- the following are valid values: qname, bamtag, and fastq. If using umi-source fastq”, the UMI sequence from FASTQ file using umi-fastq” can be provided.
- the user can enter the path to a customized correction table.
- Local Run Manager uses lookup correction with a built-in table for the Illumina TruSight Oncology and Illumina for IDT UMI Index Anchor kits.
- the user can enter the path for a customized, valid UMI sequence.
- DRAGEN processes UMIs by grouping reads by UMI and alignment position. If there are sequencing errors in the UMIs, DRAGEN can correct and detect small sequencing errors by using a lookup table or by using sequence similarity and read counts. The user can specify the type of correction with the umi-library-type” or umi- correction-scheme” option using the values “lookup”, “random”, or “none”.
- a lookup table can be created that specifies which sequence can be corrected and how to correct it. In some embodiments, this correct file scheme works best on UMI sets where sequences have a minimum hamming/edit distance between them.
- DRAGEN uses lookup correction with a built-in correction table for the Illumina TruSight Oncology and Illumina for IDT UMI Index Anchor kits. The user can specify the path of their correction file using the umi-correction-table” option. In some embodiments, the user can employ a different set of nonrandom UMIs.
- the DRAGEN pipeline in some embodiments, must infer which UMIs at a given position are likely to be errors relative to other UMIs observed at the same position.
- the error modes include small UMI errors, such as one mismatch, or UMI jumping or hopping artifact from library prep. DRAGEN accomplishes this as described below.
- Reads are grouped by fragment alignment position. Within a small fuzzy window at each position (e.g., 1, 2, 3, 4, or 5), the reads are grouped first by exact UMI sequence, which forms a family. UMI jumping or hopping probability is estimated through insert size distribution and number of distinct UMI at certain positions. Within a fuzzy window, pair-wise likelihood ratio is calculated to assess if two families with different UMI sequences and genomic positions are derived from the same original molecule. Families with likelihood lower than threshold are merged. The default threshold is 1, for example.
- Duplex UMI adapters simultaneously tag both strands of double-stranded DNA fragments. It is then possible to identify reads resulting from amplification of each strand of the original fragment.
- DRAGEN considers two collapsed read pairs to be the sequence of two strands of the same original fragment of DNA if they have the same alignment position (within a fuzzy window), complementary orientations, and their UMIs are swapped from Read 1 and Read 2. If there is only single-ended UMI, DRAGEN compares the start-end position of families from two strands and computes pair-wise likelihood to determine if they default, DRAGEN outputs both simplex and duplex consensus sequences.
- DRAGEN If the user enables BAM output, DRAGEN generates an “ ⁇ output_prefix>. bam” that includes all UMI consensus reads. The QNAMEs for the reads are generated based on the following convention: consensus_read_reflDl_posl_reflD2_pos2_orientation
- reflDl the reference ID of Read 1
- posl the genomic position of Read 1
- refID2 the reference ID of Read 2
- pos2 the genomic position of Read 2
- orientation The orientation of Read 1 and Read 2.
- Orientation can be one of the following values (Position refers to the outermost aligned position of the read and is adjusted for soft clips): 1, Read 1 is forward and Read 2 is reverse, the starting position for Read 1 is less than or equal to the Read 2 end position; 2, Read 1 is reverse and Read 2 is forward, the starting position for Read 2 is greater than or equal to the Read 1 end position; 3, Read 1 is forward and Read 2 is reverse, the starting position for Read 1 is greater than the Read 2 end position; 4, Read 1 is reverse and Read 2 is forward, the starting position for Read 2 is greater than the Read 1 end position; 5, Read 1 and Read 2 are forward; and 6, Read 1 and Read 2 are reverse.
- DRAGEN outputs an “ ⁇ output_prefix>.
- umi metrics.csv file that describes the statistics for UMI collapsing. This file summarizes statistics on input reads, how they were grouped into families, how UMIs were corrected, and how families generated consensus reads. The following metrics described below can be useful when tuning the pipeline for an application.
- Families can be combined in various ways. The number of such corrections can be reported as follows: (1) Families shifted, where families with fragment alignment coordinates up to the distance specified by the umi -fuzzy-window-size contextually corrected, where families with exactly the same fragment alignment coordinates and compatible UMIs are merged; or (3) Duplex families, where families with close alignment coordinates and complementary UMIs are merged.
- DRAGEN When the user specifies a valid path for umi-metrics-interval-file”, DRAGEN outputs a separate set of on-target UMI statistics that contains only families within the specified BED file.
- the histogram of unique UMIs per fragment position metric may be helpful. It is a zero-based histogram, where the index indicates a count of unique UMIs at a particular fragment position and the value represents the number of positions with that count.
- Table 6 below and FIG. 33 -FIG. 35 describe non-limiting examples of available UMI metrics.
- FIG. 36 is a flow diagram showing an exemplary method 3600 of grouping sequence reads.
- a method of grouping sequence reads can include grouping sequence reads can families of sequence reads.
- reads are grouped by fragment alignment position. Within a small fuzzy window at each position (e.g., 1, 2, 3, 4, or 5), the reads are grouped first by exact UMI sequence, which forms a family. UMI jumping or hopping probability is estimated through insert size distribution and number of distinct UMI at certain positions. Within a fuzzy window, pair-wise likelihood ratio is calculated to assess if two families with different UMI sequences and genomic positions are derived from the same original molecule. Families with likelihood lower than threshold are merged. The default threshold is 1, for example.
- the method 3600 may be embodied in a set of executable program instructions stored on a computer-readable medium, such as one or more disk drives, of a computing system.
- a computer-readable medium such as one or more disk drives
- the computing system 3700 shown in FIG. 37 and described in greater detail below can execute a set of executable program instructions to implement the method 3600.
- the executable program instructions can be loaded into memory, such as RAM, and executed by one or more processors of the computing system 3700.
- the method 3600 is described with respect to the computing system 3700 shown in FIG. 37, the description is illustrative only and is not intended to be limiting. In some embodiments, the method 3600 or portions thereof may be performed serially or in parallel by multiple computing systems.
- a computing system receives a plurality of sequence reads each comprising a fragment sequence and a unique molecular identifier (UMI) sequence (or an identifier sequence).
- the plurality of sequence reads can be generated from a sample.
- the sample can be obtained from a subject.
- the sample can be generated from another sample obtained from a subject.
- the other sample can be obtained directly from the subject.
- the sample can comprise cells, cell-free DNA, cell-free fetal DNA, circular tumor DNA, amniotic fluid, a blood sample, a biopsy sample, or a combination thereof.
- the computing system can load the plurality of sequence reads into its memory.
- Sequence reads can be generated by techniques such as sequencing by synthesis, sequencing by binding, or sequencing by ligation. Sequence reads can be generated using instruments such as MINISEQ, MISEQ, NEXTSEQ, HISEQ, and NOVASEQ sequencing instruments from Illumina, Inc. (San Diego, CA).
- a sequence read can be, for example, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1250, 1500, 1750, 2000, or more base pairs (bps) in length.
- a sequence read are about 50 base pairs to sequence reads can comprise single-end sequence reads.
- the sequence reads can be generated by whole genome sequencing (WGS).
- the WGS can be clinical WGS (cWGS).
- the sequence reads can comprise single-end sequence reads.
- the plurality of sequence reads can be generated by whole genome sequencing (WGS), e.g., clinical WGS (cWGS).
- the sequence reads can be generated by targeted sequencing, such as sequencing of 5, 10, 20, 30, 40, 50, 100, 200, or more genes.
- the sample can comprise cells, cell-free DNA, cell-free fetal DNA, amniotic fluid, a blood sample, a biopsy sample, or a combination thereof.
- a sequence read can include one UMI sequence.
- a sequence read can comprise two UMI sequences (e.g., a first UMI sequence and a second UMI sequence).
- the first UMI sequence can be 5’ to the fragment sequence.
- the second UMI sequence can be 3’ to the fragment sequence.
- the first UMI sequence can be 3’ to the fragment sequence.
- the second UMI sequence can be 5’ to the fragment sequence.
- the first UMI sequence and the second UMI sequence can have different lengths.
- the first UMI sequence and the second UMI sequence can have an identical length.
- the first UMI sequence and the second UMI sequence can be different.
- the first UMI sequence and the second UMI sequence can be identical.
- a UMI sequence can be, for example, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 30, 40, 50 or more or less bases in length.
- the UMI sequences can be random.
- the UMI sequences can be non-random.
- the method 3600 proceeds from block 3608 to block 3612, where the computing system aligns sequence reads of the plurality of sequence reads to a reference sequence using the fragment sequences of the sequence reads.
- the reference sequence can be a reference genome sequence (e.g., hg38 or hgl9, or a portion thereof).
- the computing system can align sequence reads to the reference sequence using an aligner or an alignment method such as Burrows-Wheeler Aligner (BWA), iSAAC, BarraCUDA, BFAST, BLASTN, BLAT, Bowtie, CASHX, Cloudburst, CUDA-EC, CUSHAW, CUSHAW2, CU SHAW2-GPU, drFAST, ELAND, ERNE, GNUMAP, GEM, GensearchNGS, GMAP and GSNAP, Geneious Assembler, LAST, MAQ, mrFAST and mrsFAST, MOM, MOSAIK, MPscan, Novoaligh & NovoalignCS, NextGENe, Omixon, PALMapper, Partek, PASS, PerM, PRIMEX, QPalma, RazerS, REAL, cREAL, RMAP, rNA, RT Investigator, Segemehl, SeqMap, Shrec, SHRiMP, SLI
- the method 3600 proceeds from block 3612 to block 3620, where the computing system groups sequence reads of the plurality of sequence reads into a plurality of families of sequence reads based on the UMI sequences and/or positions of the fragment sequence read.
- a family can comprise at least 2 sequence read (e.g., at least 2, 3, 4, 5, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90, 100, 250, 500, 1000, 2000, or more or less sequence reads) of the plurality of sequence reads.
- a family can comprise sequence reads with an identical UMI sequence, an identical alignment position (referred to herein as exact same start-end), and an identical strand (referred to herein as same strand, e.g., plus strand or minus strand).
- a family can comprise two sequence reads with an identical UMI sequence, alignment positions that differ within a fuzzy window (e.g., alignment positions can differ by one position (referred to herein as mismatch ⁇ 1)), and an identical strand orientation (referred to herein as same strand, e.g., plus strand or minus strand).
- a fuzzy window can be, for example, 1, 2, 3, 4, or 5.
- the plurality of families can comprise, for example, at least 100,000, 200,000, 300,000, 400,000, 500,000, 1,000,000, 2,000,000, 3,000,000, 4,000,000, 5,000,000, 10,000,000, or more or less families.
- the method 3600 proceeds from block 3616 to block 3620, where the computing system performs UMI statistic estimation of the plurality of families.
- the computing system can determine fragment (or fragment insert) size frequency, UMI jumping rate, and/or UMI frequency. See section 2.8 above for an illustration.
- the computing system can perform UMI statistic estimation on a subset of families of the plurality of families.
- the subset of families can comprise at least 5,000, 10,000, 20,000, 30,000, 40,000, 50,000 60,000, 70,000, 80,000, 90,000, 100,000, or more or less, families of the plurality of families.
- the subset of families can comprise at least 0.1%, 0.5%, 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10%, 11%, 12%, 13%, 14%, 15%, 20%, or more or less, of families of the plurality of families.
- the method 3600 proceeds from block 3620 to block 3624, where the computing system performs probability-based merging of families of the plurality of families (also referred to herein as read or UMI grouping or collapsing). See section 2.9 above for an illustration.
- the computing system can perform family identification and merging (or collapsing).
- the computing system can perform duplex identification and merging (or collapsing). See FIG. 2 and accompanying description.
- the computing system can perform probability-based merging of families of the plurality of families using a probability map (see FIG. 12 and the accompanying description for an illustration).
- the plurality of families can comprise, for example, at least
- the plurality of families before probability- based merging is performed can comprise at least 0.5%, 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, based merging is performed.
- a family after probability-based merging can comprise one sequence read.
- a family after probability-based merging can comprise at least 2 sequence read (e.g., at least 2, 3, 4, 5, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90, 100, 250, 500, 1000, 2000, or more or less sequence reads) of the plurality of sequence reads.
- the computing system can perform probability-based merging of families of the plurality of families using the results of UMI statistic estimation (e.g., fragment size frequency, UMI jumping rate, and/or UMI frequency).
- UMI statistic estimation e.g., fragment size frequency, UMI jumping rate, and/or UMI frequency.
- the computing system can perform probability-based merging of families of the plurality of families using fragment size frequency, UMI jumping rate, and/or UMI frequency.
- the computing system can perform probability -based merging of families of the plurality of families using a sequencing error rate (e.g., 0.0001, 0.0002, 0.0003, 0.0004, 0.0005, 0.0006, 0.0007, 0.0008, 0.0009, 0.001, 0.002, 0.003, 0.004, 0.005, or more or less) and/or a mismatch probability (e.g., 0.15, 0.17, 0.2, 0.23, 0.24, 0.25, 0.26, 0.27, 0.3, 0.33, 0.35, or more or less).
- the sequencing error rate can be predetermined.
- the mismatch probability can be predetermined.
- the computing system can determine a relative likelihood (or probability) (also referred to herein as L ) of the two families are derived from (or that originate from) the same original nucleic acid (e.g., DNA) molecule.
- the computing system can determine the relative likelihood of the two families are derived from the same original nucleic acid molecule using one or more of equations 4 to 11.
- the computing system can determine the relative likelihood of the two families are derived from the same original nucleic acid molecule using the fragment size frequency, the UMI jumping rate, and/or the UMI frequency.
- the computing system can determine the relative likelihood is above a merging threshold (e.g., 1, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more or less)).
- the computing system can merge the two families of the plurality of families.
- the computing system can merge a smaller family (e.g., with fewer sequence reads) of the two families into a larger family (e.g., with more sequence reads) of the two families.
- the computing system can determine a likelihood ratio of unique molecule (or family) over non-unique molecule (or family) given fragment positions (also referred to herein as L pos ).
- the computing system can determine a likelihood ratio of UMI transition for unique molecule (or family) over non-unique molecule (or family). This likelihood ratio of UMI transition is referred to herein as L umi .
- UMI transition can be a result of UMI two families are derived from the same original nucleic acid molecule as a product (e.g., multiplication product) of (i) the likelihood (or probability) ratio of unique molecule over nonunique molecule given fragment positions and (ii) the likelihood (or probability) ratio of UMI transition for unique molecule over non-unique molecule.
- the computing system can determine relative likelihood of the two families are derived from the same original nucleic acid molecule using a sequencing error rate and/or a mismatch probability.
- the computing system can (i) for one, one or more, or each pair of families of the plurality of families, determine a relative likelihood (or probability) of the families of the pair are derived from the same original nucleic acid molecule.
- the computing system can (ii) for the pair of families with the highest relative likelihood (or probability), if the relative likelihood of the families in the pair with the highest relative likelihood (or probability) are derived from the same original nucleic acid molecule is above a merging threshold (e.g., 1), then merging the families.
- a merging threshold e.g. 1, 1, then merging the families.
- the computing system can (iii) repeat (i) and (ii) until the relative likelihood of the families in the pair with the highest relative likelihood (or probability) is not above the merging threshold.
- the computing system can align the consensus fragment sequence to the reference sequence.
- the computing system can determine a fragment sequence and/or a UMI sequence of the original nucleic acid molecule from which the sequence reads of the family are derived.
- the fragment sequence of the original nucleic acid molecule from which the sequence reads of the family are derived can be a consensus fragment sequence of the family.
- the UMI sequence of the original nucleic acid molecule from which the sequence reads of the family are derived can be a consensus UMI sequence of the family.
- the computing system can align the fragment sequence to the reference sequence.
- computing system can create a file or a report and/or generate a user interface (UI) comprising a UI element representing or comprising, for one, one or more, or each of the plurality of families, (i) the family.
- the file or report and/or the UI element can represent or comprise (ii) sequence reads of the family, fragment sequences of the family, and/or UMI sequences of the family.
- the file or report and/or the UI element can represens or comprise (iii) a consensus fragment sequence of the family, a position of the consensus fragment sequence aligned to the reference sequence, and/or a consensus UMI sequence of the family.
- a UI element can be a window (e.g., a container window, browser window, text terminal, child window, or message window), a menu (e.g., a menu bar, context menu, or menu extra), an icon, or a tab.
- a UI element can be for input control (e.g., a checkbox, radio button, dropdown list, list box, button, toggle, text field, or date field).
- a UI element can element can informational (e.g., a tooltip, icon, progress bar, notification, message box, or modal window).
- a UI element can be a container (e.g., an accordion).
- the method 3600 ends at block 3628.
- FIG. 37 depicts a general architecture of an example computing device 3700 configured to execute the processes and implement the features described herein.
- the general architecture of the computing device 3700 depicted in FIG. 37 includes an arrangement of computer hardware and software components.
- the computing device 3700 may include many more (or fewer) elements than those shown in FIG. 37. It is not necessary, however, that all of these generally conventional elements be shown in order to provide an enabling disclosure.
- the computing device 3700 includes a processing unit 3710, a network interface 3720, a computer readable medium drive 3730, an input/output device interface 3740, a display 3750, and an input device 3760, all of which may communicate with one another by way of a communication bus.
- the network interface 3720 may provide connectivity to one or more networks or computing systems.
- the processing unit 3710 may thus receive information and instructions from other computing systems or services via a network.
- the processing unit 3710 may also communicate to and from memory 3770 and further provide output information for an optional display 3750 via the input/output device interface 3740.
- the input/output device interface 3740 may also accept input from the optional input device 3760, such as a keyboard, mouse, digital pen, microphone, touch screen, gesture recognition system, voice recognition system, gamepad, accelerometer, gyroscope, or other input device.
- the memory 3770 may contain computer program instructions (grouped as modules or components in some embodiments) that the processing unit 3710 executes in order to implement one or more embodiments.
- the memory 3770 generally includes RAM, ROM and/or other persistent, auxiliary or non-transitory computer-readable media.
- the memory 3770 may store an operating system 3772 that provides computer program instructions for use by the processing unit 3710 in the general administration and operation of the computing device 3700.
- the memory 3770 may further include computer program instructions and other information for implementing aspects of the present disclosure.
- the memory 3770 includes a sequence reads grouping module 3774 for grouping sequence reads (which can include merging or collapsing families of sequence reads).
- the sequence reads grouping module 3774 can perform one or more actions of the method 3600 described with reference to FIG. 36.
- data stores that store sequence reads or data being processed and results (e.g., intermediate results or final results) of grouping sequence reads.
- a processor configured to carry out recitations A, B and C can include a first processor configured to carry out recitation A and working in conjunction with a second processor configured to carry out recitations B and C.
- Any reference to “or” herein is intended to encompass “and/or” unless otherwise stated.
- All of the processes described herein may be embodied in, and fully automated via, software code modules executed by a computing system that includes one or more computers or processors.
- the code modules may be stored in any type of non-transitory computer-readable medium or other computer storage device. Some or all the methods may be embodied in specialized computer hardware.
- a processor can be a microprocessor, but in the alternative, the processor can be a controller, microcontroller, or state machine, combinations of the same, or the like.
- a processor can include electrical circuitry configured to process computer-executable instructions.
- a processor includes an FPGA or other programmable device that performs logic operations without processing computer-executable instructions.
- a processor can also be implemented as a combination of computing devices, for example a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
- a processor may also include primarily analog components.
- some or all of the signal processing algorithms described herein may be implemented in analog circuitry or mixed analog and digital circuitry.
- a computing environment can include any type of computer system, including, but not limited to, a computer system based on a microprocessor, a mainframe computer, a digital signal processor, a portable computing device, a device controller, or a computational engine within an appliance, to name a few.
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Chemical & Material Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Biotechnology (AREA)
- Genetics & Genomics (AREA)
- Biophysics (AREA)
- Bioinformatics & Computational Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Biology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Biomedical Technology (AREA)
- Organic Chemistry (AREA)
- Zoology (AREA)
- Wood Science & Technology (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Analytical Chemistry (AREA)
- Crystallography & Structural Chemistry (AREA)
- Plant Pathology (AREA)
- Public Health (AREA)
- Evolutionary Computation (AREA)
- Epidemiology (AREA)
- Databases & Information Systems (AREA)
- Molecular Biology (AREA)
- Software Systems (AREA)
- Microbiology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioethics (AREA)
- Biochemistry (AREA)
- Artificial Intelligence (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Semiconductor Lasers (AREA)
- Preparation Of Fruits And Vegetables (AREA)
Abstract
Description
Claims
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2022277902A AU2022277902A1 (en) | 2021-05-19 | 2022-05-19 | Umi collapsing |
CA3219179A CA3219179A1 (en) | 2021-05-19 | 2022-05-19 | Umi collapsing |
CN202280041976.6A CN117597739A (en) | 2021-05-19 | 2022-05-19 | UMI collapse |
EP22735259.8A EP4341940A1 (en) | 2021-05-19 | 2022-05-19 | Umi collapsing |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US202163190716P | 2021-05-19 | 2021-05-19 | |
US63/190,716 | 2021-05-19 |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2022246062A1 true WO2022246062A1 (en) | 2022-11-24 |
WO2022246062A9 WO2022246062A9 (en) | 2024-02-01 |
Family
ID=82319831
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2022/030023 WO2022246062A1 (en) | 2021-05-19 | 2022-05-19 | Umi collapsing |
Country Status (6)
Country | Link |
---|---|
US (1) | US20220392575A1 (en) |
EP (1) | EP4341940A1 (en) |
CN (1) | CN117597739A (en) |
AU (1) | AU2022277902A1 (en) |
CA (1) | CA3219179A1 (en) |
WO (1) | WO2022246062A1 (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160319345A1 (en) * | 2015-04-28 | 2016-11-03 | Illumina, Inc. | Error suppression in sequenced dna fragments using redundant reads with unique molecular indices (umis) |
US20200135298A1 (en) | 2018-10-31 | 2020-04-30 | Illumina, Inc. | Systems and methods for grouping and collapsing sequencing reads |
-
2022
- 2022-05-19 CA CA3219179A patent/CA3219179A1/en active Pending
- 2022-05-19 US US17/748,455 patent/US20220392575A1/en active Pending
- 2022-05-19 CN CN202280041976.6A patent/CN117597739A/en active Pending
- 2022-05-19 EP EP22735259.8A patent/EP4341940A1/en active Pending
- 2022-05-19 WO PCT/US2022/030023 patent/WO2022246062A1/en active Application Filing
- 2022-05-19 AU AU2022277902A patent/AU2022277902A1/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160319345A1 (en) * | 2015-04-28 | 2016-11-03 | Illumina, Inc. | Error suppression in sequenced dna fragments using redundant reads with unique molecular indices (umis) |
US20200135298A1 (en) | 2018-10-31 | 2020-04-30 | Illumina, Inc. | Systems and methods for grouping and collapsing sequencing reads |
Non-Patent Citations (2)
Title |
---|
GUIDE USER: "Illumina DRAGEN Bio-IT Platform v3.8", 26 March 2021 (2021-03-26), XP055952873, Retrieved from the Internet <URL:https://support-docs.illumina.com/SW/DRAGEN_v38/dragen-platform-v3.8-guide-1000000158551-01.pdf> [retrieved on 20220818] * |
TOM SMITH ET AL: "UMI-tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy", GENOME RESEARCH, vol. 27, no. 3, 18 January 2017 (2017-01-18), US, pages 491 - 499, XP055517852, ISSN: 1088-9051, DOI: 10.1101/gr.209601.116 * |
Also Published As
Publication number | Publication date |
---|---|
EP4341940A1 (en) | 2024-03-27 |
AU2022277902A1 (en) | 2023-12-14 |
CA3219179A1 (en) | 2022-11-24 |
WO2022246062A9 (en) | 2024-02-01 |
AU2022277902A9 (en) | 2024-01-11 |
CN117597739A (en) | 2024-02-23 |
US20220392575A1 (en) | 2022-12-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Laehnemann et al. | Denoising DNA deep sequencing data—high-throughput sequencing errors and their correction | |
US8271206B2 (en) | DNA sequence assembly methods of short reads | |
WO2016141294A1 (en) | Systems and methods for genomic pattern analysis | |
US11062793B2 (en) | Systems and methods for aligning sequences to graph references | |
CN107133493B (en) | Method for assembling genome sequence, method for detecting structural variation and corresponding system | |
JP2022533492A (en) | Flexible Seed Extension for Hashtable Genome Mapping | |
US20230053523A1 (en) | Methods and systems for identifying recombinant variants | |
He et al. | De novo assembly methods for next generation sequencing data | |
US10325674B2 (en) | Apparatus, method, and system for creating phylogenetic tree | |
EP3938932B1 (en) | Method and system for mapping read sequences using a pangenome reference | |
EP4341940A1 (en) | Umi collapsing | |
Alfonsi et al. | Data-driven recombination detection in viral genomes | |
Cheong et al. | The context sensitivity problem in biological sequence segmentation | |
Dharanipragada et al. | Copy number variation detection workflow using next generation sequencing data | |
US20230019053A1 (en) | Genotyping variable number tandem repeats | |
US20230187020A1 (en) | Systems and methods for iterative and scalable population-scale variant analysis | |
US20220301655A1 (en) | Systems and methods for generating graph references | |
WO2018033733A1 (en) | Methods and apparatus for identifying genetic variants | |
US20230386608A1 (en) | Targeted calling of overlapping copy number variants | |
JP2024524869A (en) | Methods and systems for identifying recombinant mutants | |
Iakovishina | Detection of structural variants in cancer genomes using a Bayesian approach. You will find below the abstract of my PhD thesis | |
Kuosmanen | Third-generation RNA-sequencing analysis: graph alignment and transcript assembly with long reads. | |
Song | IMPROVING GENOME ANNOTATION WITH RNA-SEQ DATA | |
Marschall et al. | Discovering and Genotyping Twilight Zone Deletions | |
Söylev | Algorithms for Structural Variation Discovery Using Multiple Sequence Signatures |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22735259 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 3219179 Country of ref document: CA |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2022277902 Country of ref document: AU Ref document number: AU2022277902 Country of ref document: AU |
|
WWE | Wipo information: entry into national phase |
Ref document number: 202280041976.6 Country of ref document: CN |
|
ENP | Entry into the national phase |
Ref document number: 2022277902 Country of ref document: AU Date of ref document: 20220519 Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2022735259 Country of ref document: EP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 2022735259 Country of ref document: EP Effective date: 20231219 |