US20230410945A1 - System and method for secondary analysis of nucleotide sequencing data - Google Patents
System and method for secondary analysis of nucleotide sequencing data Download PDFInfo
- Publication number
- US20230410945A1 US20230410945A1 US18/300,343 US202318300343A US2023410945A1 US 20230410945 A1 US20230410945 A1 US 20230410945A1 US 202318300343 A US202318300343 A US 202318300343A US 2023410945 A1 US2023410945 A1 US 2023410945A1
- Authority
- US
- United States
- Prior art keywords
- nucleotide subsequence
- sequencing
- reference sequence
- nucleotide
- variant calling
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012163 sequencing technique Methods 0.000 title claims abstract description 150
- 238000000034 method Methods 0.000 title claims abstract description 142
- 125000003729 nucleotide group Chemical group 0.000 title claims abstract description 139
- 239000002773 nucleotide Substances 0.000 title claims abstract description 121
- 238000004458 analytical method Methods 0.000 title claims abstract description 67
- 238000012545 processing Methods 0.000 claims description 85
- 239000002157 polynucleotide Substances 0.000 claims description 52
- 230000008569 process Effects 0.000 claims description 51
- 102000040430 polynucleotide Human genes 0.000 claims description 42
- 108091033319 polynucleotide Proteins 0.000 claims description 42
- 238000003786 synthesis reaction Methods 0.000 claims description 10
- 238000013507 mapping Methods 0.000 claims description 6
- 238000006243 chemical reaction Methods 0.000 claims description 5
- 230000006870 function Effects 0.000 claims description 3
- 238000001514 detection method Methods 0.000 abstract description 7
- 238000012217 deletion Methods 0.000 abstract description 4
- 230000037430 deletion Effects 0.000 abstract description 4
- 230000002068 genetic effect Effects 0.000 abstract description 4
- 238000003780 insertion Methods 0.000 abstract description 4
- 230000037431 insertion Effects 0.000 abstract description 4
- 230000008711 chromosomal rearrangement Effects 0.000 abstract description 3
- 238000003205 genotyping method Methods 0.000 abstract description 3
- 102000054765 polymorphisms of proteins Human genes 0.000 abstract description 3
- 239000007850 fluorescent dye Substances 0.000 description 29
- 102000053602 DNA Human genes 0.000 description 14
- 108020004414 DNA Proteins 0.000 description 14
- 239000000523 sample Substances 0.000 description 13
- 238000004891 communication Methods 0.000 description 11
- 239000003153 chemical reaction reagent Substances 0.000 description 10
- 239000012634 fragment Substances 0.000 description 10
- 210000004027 cell Anatomy 0.000 description 8
- 230000015572 biosynthetic process Effects 0.000 description 7
- 238000002864 sequence alignment Methods 0.000 description 6
- SUYVUBYJARFZHO-RRKCRQDMSA-N dATP Chemical compound C1=NC=2C(N)=NC=NC=2N1[C@H]1C[C@H](O)[C@@H](COP(O)(=O)OP(O)(=O)OP(O)(O)=O)O1 SUYVUBYJARFZHO-RRKCRQDMSA-N 0.000 description 5
- RGWHQCVHVJXOKC-SHYZEUOFSA-N dCTP Chemical compound O=C1N=C(N)C=CN1[C@@H]1O[C@H](CO[P@](O)(=O)O[P@](O)(=O)OP(O)(O)=O)[C@@H](O)C1 RGWHQCVHVJXOKC-SHYZEUOFSA-N 0.000 description 5
- HAAZLUGHYHWQIW-KVQBGUIXSA-N dGTP Chemical compound C1=NC=2C(=O)NC(N)=NC=2N1[C@H]1C[C@H](O)[C@@H](COP(O)(=O)OP(O)(=O)OP(O)(O)=O)O1 HAAZLUGHYHWQIW-KVQBGUIXSA-N 0.000 description 5
- NHVNXKFIZYSCEB-XLPZGREQSA-N dTTP Chemical compound O=C1NC(=O)C(C)=CN1[C@@H]1O[C@H](COP(O)(=O)OP(O)(=O)OP(O)(O)=O)[C@@H](O)C1 NHVNXKFIZYSCEB-XLPZGREQSA-N 0.000 description 5
- 102000039446 nucleic acids Human genes 0.000 description 5
- 108020004707 nucleic acids Proteins 0.000 description 5
- 150000007523 nucleic acids Chemical class 0.000 description 5
- XYFCBTPGUUZFHI-UHFFFAOYSA-N Phosphine Chemical compound P XYFCBTPGUUZFHI-UHFFFAOYSA-N 0.000 description 4
- 101000829958 Homo sapiens N-acetyllactosaminide beta-1,6-N-acetylglucosaminyl-transferase Proteins 0.000 description 3
- 102100023315 N-acetyllactosaminide beta-1,6-N-acetylglucosaminyl-transferase Human genes 0.000 description 3
- 238000012300 Sequence Analysis Methods 0.000 description 3
- 230000008859 change Effects 0.000 description 3
- 238000003776 cleavage reaction Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000003384 imaging method Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000001717 pathogenic effect Effects 0.000 description 3
- 230000007017 scission Effects 0.000 description 3
- 108090000790 Enzymes Proteins 0.000 description 2
- 102000004190 Enzymes Human genes 0.000 description 2
- 238000007792 addition Methods 0.000 description 2
- 125000003545 alkoxy group Chemical group 0.000 description 2
- 239000011324 bead Substances 0.000 description 2
- 230000000295 complement effect Effects 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- OPTASPLRGRRNAP-UHFFFAOYSA-N cytosine Chemical compound NC=1C=CNC(=O)N=1 OPTASPLRGRRNAP-UHFFFAOYSA-N 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- UYTPUPDQBNUYGX-UHFFFAOYSA-N guanine Chemical compound O=C1NC(N)=NC2=C1N=CN2 UYTPUPDQBNUYGX-UHFFFAOYSA-N 0.000 description 2
- 238000010348 incorporation Methods 0.000 description 2
- 239000000178 monomer Substances 0.000 description 2
- 244000052769 pathogen Species 0.000 description 2
- 229910000073 phosphorus hydride Inorganic materials 0.000 description 2
- 230000002441 reversible effect Effects 0.000 description 2
- RWQNBRDOKXIBIV-UHFFFAOYSA-N thymine Chemical compound CC1=CNC(=O)NC1=O RWQNBRDOKXIBIV-UHFFFAOYSA-N 0.000 description 2
- 229930024421 Adenine Natural products 0.000 description 1
- GFFGJBXGBJISGV-UHFFFAOYSA-N Adenine Chemical compound NC1=NC=NC2=C1N=CN2 GFFGJBXGBJISGV-UHFFFAOYSA-N 0.000 description 1
- OKTJSMMVPCPJKN-UHFFFAOYSA-N Carbon Chemical compound [C] OKTJSMMVPCPJKN-UHFFFAOYSA-N 0.000 description 1
- 238000001712 DNA sequencing Methods 0.000 description 1
- 102000016928 DNA-directed DNA polymerase Human genes 0.000 description 1
- 108010014303 DNA-directed DNA polymerase Proteins 0.000 description 1
- 206010028980 Neoplasm Diseases 0.000 description 1
- 102220573043 RNA polymerase II subunit A C-terminal domain phosphatase_A2R_mutation Human genes 0.000 description 1
- PYMYPHUHKUWMLA-LMVFSUKVSA-N Ribose Natural products OC[C@@H](O)[C@@H](O)[C@@H](O)C=O PYMYPHUHKUWMLA-LMVFSUKVSA-N 0.000 description 1
- 229960000643 adenine Drugs 0.000 description 1
- HMFHBZSHGGEWLO-UHFFFAOYSA-N alpha-D-Furanose-Ribose Natural products OCC1OC(O)C(O)C1O HMFHBZSHGGEWLO-UHFFFAOYSA-N 0.000 description 1
- 210000004381 amniotic fluid Anatomy 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 150000001540 azides Chemical class 0.000 description 1
- 125000000852 azido group Chemical group *N=[N+]=[N-] 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 239000013060 biological fluid Substances 0.000 description 1
- 239000012472 biological sample Substances 0.000 description 1
- 238000001574 biopsy Methods 0.000 description 1
- 210000004369 blood Anatomy 0.000 description 1
- 239000008280 blood Substances 0.000 description 1
- 210000001185 bone marrow Anatomy 0.000 description 1
- 210000002798 bone marrow cell Anatomy 0.000 description 1
- 239000000872 buffer Substances 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 201000011510 cancer Diseases 0.000 description 1
- 229910052799 carbon Inorganic materials 0.000 description 1
- 210000001175 cerebrospinal fluid Anatomy 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 229940104302 cytosine Drugs 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000000295 emission spectrum Methods 0.000 description 1
- 210000002889 endothelial cell Anatomy 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000002255 enzymatic effect Effects 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 238000012632 fluorescent imaging Methods 0.000 description 1
- 238000009472 formulation Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012804 iterative process Methods 0.000 description 1
- 230000003211 malignant effect Effects 0.000 description 1
- 229910044991 metal oxide Inorganic materials 0.000 description 1
- 150000004706 metal oxides Chemical class 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 239000003607 modifier Substances 0.000 description 1
- 238000010369 molecular cloning Methods 0.000 description 1
- 238000002887 multiple sequence alignment Methods 0.000 description 1
- 230000035772 mutation Effects 0.000 description 1
- 238000007481 next generation sequencing Methods 0.000 description 1
- 239000002777 nucleoside Substances 0.000 description 1
- 150000003833 nucleoside derivatives Chemical class 0.000 description 1
- -1 nucleotide triphosphates Chemical class 0.000 description 1
- 230000003071 parasitic effect Effects 0.000 description 1
- 210000004910 pleural fluid Anatomy 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000008707 rearrangement Effects 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 125000000548 ribosyl group Chemical group C1([C@H](O)[C@H](O)[C@H](O1)CO)* 0.000 description 1
- 102220281779 rs1555594471 Human genes 0.000 description 1
- 210000003296 saliva Anatomy 0.000 description 1
- 210000000582 semen Anatomy 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000002922 simulated annealing Methods 0.000 description 1
- 210000000130 stem cell Anatomy 0.000 description 1
- 229940113082 thymine Drugs 0.000 description 1
- 210000001519 tissue Anatomy 0.000 description 1
- 235000011178 triphosphate Nutrition 0.000 description 1
- 239000001226 triphosphate Substances 0.000 description 1
- 210000004881 tumor cell Anatomy 0.000 description 1
- 210000002700 urine Anatomy 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6869—Methods for sequencing
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/20—Sequence assembly
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/30—Unsupervised data analysis
Definitions
- Sequence Listing is provided as a file entitled “Sequence_Listing_ILLINC_346C1.xml, created Jun. 22, 2023, which is approximately 2.0 kb in size.
- the information in the electronic format of the sequence listing is incorporated herein by reference in its entirety.
- the present disclosure relates generally to the field of DNA sequencing, and more particularly relates to systems and methods for performing real-time secondary analyses for next generation sequencing application.
- Genetic mutation can be identified by identifying variants—relative to reference sequences—in sequence reads.
- identifying variants includes distinct steps that are performed sequentially and can be time consuming to perform after the conclusion of the sequencing process.
- the system comprises: a memory comprising a reference nucleotide sequence; a processor configured to execute instructions that perform a method comprising: receiving a first nucleotide subsequence of a read from a sequencing system; processing the first nucleotide subsequence using a first alignment path to determine a first plurality of candidate locations of the read on the reference sequence; determining whether the first nucleotide subsequence aligns to the reference sequence based on the determined candidate locations; receiving a second nucleotide subsequence from the sequencing system; processing the second nucleotide subsequence to determine a second plurality of candidate locations of the read that align to the reference sequence using: a second alignment path if the read is aligned to the reference sequence, and the first alignment path if otherwise, wherein the second alignment path is more computationally efficient than the first alignment path to determine the second plurality of candidate locations of the read
- the method comprises: receiving a first nucleotide subsequence from a sequencing system during a sequencing run; and performing a secondary analysis of the first nucleotide subsequence of a read based on a reference sequence using a first analysis path or a second analysis path, wherein the second analysis path is more computationally efficient than the first processing path in performing the secondary analysis.
- FIG. 1 is a schematic illustration showing an example sequencing system for performing real-time analyses.
- FIG. 2 shows a functional block diagram of an example computer system for performing real-time analyses.
- FIG. 3 is a flowchart of an example method for sequencing by synthesis.
- FIG. 4 is a flowchart of an example method for performing base calling.
- FIGS. 5 A and 5 B show example iterative alignments and variant calling.
- FIG. 6 is a flowchart of an example method for performing a real-time secondary sequence analysis.
- FIGS. 7 A and 7 B are schematic illustrations comparing a traditional method of secondary analysis ( FIG. 7 A ) to an iterative method of secondary analysis ( FIG. 7 B ).
- FIG. 9 A is a flowchart of an example method for performing a real-time secondary analysis.
- FIG. 9 B is a predicted line graph showing data processed per K-Mer.
- FIG. 9 C is a bar chart showing run times.
- FIGS. 11 A and 11 B compare an existing variant caller ( FIG. 11 A ) to a variant caller that uses high confidence, low processing path as described herein ( FIG. 11 B ).
- the system and method can determine preliminary variant calls iteratively in real-time (or with zero or low latency).
- Final results of variant determinations may be available soon after (or immediately after) the end of a sequencing run.
- a sequencing run can be terminated early if variant calls are available with sufficient confidence during the run.
- only information related to variant determinations e.g., variant calls
- only variant information may be sent to a computing system (e.g., a cloud computing system) for further processing.
- sequencing runs may be terminated prior to completion of an entire sequencing process.
- outputs and intermediate results of the system can include histograms of duplicates, exact matches, single and double SNPs, and single and double indels.
- the polynucleotides can be attached to the one or more fluidic channels of the flowcell 114 .
- the flowcell 114 can include a plurality of beads, wherein each bead can include multiple copies of a polynucleotide to be sequenced.
- the mounting stage 116 can be configured to allow proper alignment and movement of the flowcell 114 in relation to the other components of the optics system 102 . In one embodiment, the mounting stage 116 can be used to align the flowcell 114 with a lens 118 .
- the optics system 102 can include multiple lasers 120 configured to generate light at predetermined wavelengths.
- the light generated by the lasers 120 can pass through a fiber optic cable 122 to excite fluorescent labels in the flowcell 114 .
- the lens 118 mounted on a focuser 124 , can move along the z-axis.
- the focused fluorescent emissions can be detected by a detector 126 , for example a charge-coupled device (CCD) sensor or a complementary metal oxide semiconductor (CMOS) sensor.
- CCD charge-coupled device
- CMOS complementary metal oxide semiconductor
- a filter assembly 128 of the optics system 102 can be configured to filter the fluorescent emissions of the fluorescent labels in the flowcell 114 .
- the filter assembly 128 can include a first filter and a second filter. Each filter can be a longpass filter, a shortpass filter, or a bandpass filter, depending on the types of fluorescent molecules being used in the system.
- the first filter can be configured to detect the fluorescent emissions of the first fluorescent labels by the detector 126 .
- the second filter can be configured to detect the fluorescent emissions of the second fluorescent labels by the detector 126 . With two filters in the filter assembly 128 , the detector 126 can detect two different wavelengths of fluorescent emissions.
- a sample having a polynucleotide to be sequenced is loaded into the flowcell 114 and placed in the mounting stage 116 .
- the computer system 106 then activates the fluidics system 104 to begin a sequencing cycle.
- the computer system 106 instructs the fluidics system 104 , through the communication interface 108 b , to supply reagents, for example nucleotide analogs, to the flowcell 114 .
- the processor 202 can be configured to execute instructions that cause the fluidics system 104 to supply reagents to the flowcell 114 during sequencing reactions.
- the processor 202 can execute instructions that control the lasers 120 of the optics system 102 to generate light at predetermined wavelengths.
- the processor 202 can execute instructions that control the detector 126 of the optics system 102 and receive data from the detector 126 .
- the processor 202 can execute instructions to process data, for example fluorescent images, received from the detector 126 and to determine the nucleotide sequence of polynucleotides based on the data received form the detector 126 .
- the computer system 106 can include a nucleic base determiner 216 configured to determine the nucleotide sequence of polynucleotides using the data received from the detector 126 .
- the nucleic base determiner 216 can generate a template of the locations of polynucleotide clusters in the flowcell 114 using the fluorescent images captured by the detector 126 .
- the nucleic base determiner 216 can register the locations of polynucleotide clusters in the flowcell 114 in the fluorescent images captured by the detector 126 based on the location template generated.
- the nucleic base determiner 216 can extract intensities of the fluorescent emissions from the fluorescent images to generate extracted intensities.
- the nucleic base determiner 216 can determine the bases of the polynucleotide from the extracted intensities.
- the nucleic base determiner 216 can determine quality scores of the bases of the polynucleotides determined.
- the computer system 106 can include an iterative aligner 218 and a variant caller 220 , such as the Strelka variant caller (sites.google.com/site/strelkasomaticvariantcaller/home/faq).
- the iterative aligner 218 can align sequence reads determined by the nucleic base determiner 216 to a reference sequence.
- the aligned sequence reads can have associated scores.
- the scores can be probabilities (e.g., mismatch percentages) that the sequence reads have been correctly aligned to the reference sequence.
- the lengths of fragmented double-stranded polynucleotide fragments can range from 200 bases to 1000 bases.
- the method 300 proceeds to block 315 , where the double-stranded polynucleotide fragments are bridge-amplified into clusters of polynucleotide fragments attached to the inside surface of one or more channels of a flowcell, for example the flowcell 114 .
- the inside surface of the one or more channels of the flowcell can include two types of primers, for example a first primer type (P1) and a second primer type (P2) and the DNA fragments can be amplified by well-known methods.
- the first type of nucleotide can be an analog of deoxyguanosine triphosphate (dGTP) not conjugated with any fluorescent label.
- the second type of nucleotide can be an analog of deoxythymidine triphosphate (dTTP) conjugated with the first type of fluorescent label via a linker.
- the third type of nucleotide can be an analog of deoxycytidine triphosphate (dCTP) conjugated with the second type fluorescent label via a linker.
- the fourth type of nucleotide can be an analog of deoxyadenosine triphosphate (dATP) conjugated with both the first type of fluorescent label and the second type of fluorescent label via one or more linkers.
- the linkers may include one or more cleavage groups.
- the fluorescent labels Prior to the subsequent sequencing cycle, the fluorescent labels can be removed from the nucleotide analogs.
- a linker attaching a fluorescent label to a nucleotide analog can include an azide and/or an alkoxy group, for example on the same carbon, such that the linker may be cleaved after each incorporation cycle by a phosphine reagent, thereby releasing the fluorescent label from subsequent sequencing cycles.
- the nucleotide triphosphates can be reversibly blocked at the 3′ position so that sequencing is controlled and no more than a single nucleotide analog can be added onto each extending primer-polynucleotide in each cycle.
- the 3′ ribose position of a nucleotide analog can include both alkoxy and azido functionalities which can be removable by cleavage with a phosphine reagent, thereby creating a nucleotide that can be further extended.
- the fluidics system 104 can wash the one or more channels of the flowcell 114 in order to remove any unincorporated nucleoside analogs and enzyme. Prior to the subsequent sequencing cycle, the reversible 3′ blocks can be removed so that another nucleotide analog can be added onto each extending primer-polynucleotide.
- the fluorescent images comprising the fluorescent signals detected can be processed at block 335 , and the bases of the nucleotides incorporated can be determined. For each nucleotide base determined, a quality score can be determined at block 340 . A determination can be made at decision block 345 whether to detect more nucleotides based on, for example, the quality of the signal or after a predetermined number of bases. If more nucleotides are to be detected, then nucleotide determination of the next sequencing cycle can be performed at block 320 . In some embodiments, the labeled nucleotides may be added to one end of the DNA strand corresponding to a cluster.
- the labeled nucleotides may also be added to the other end of the DNA strand corresponding to the cluster.
- Reads on one end of the DNA strand are often referred to as the Read 1 set, and those reads on the other end of a DNA strand are often referred to as the Read 2 set.
- the sequencing technique that allows the determination of two or more reads of sequence from two places on a single polynucleotide duplex is known as paired-end (PE) sequencing.
- the two or more reads of sequences from the two places on the single polynucleotide duplex is referred to as Read 1 set, Read 2 set, etc. Paired-end sequencing has been described in U.S.
- the fluorescent labels can be removed from the nucleotide analogs, and the reversible 3′ blocks can be removed so that another nucleotide analog can be added onto each extending primer-polynucleotide.
- the method 300 can terminate at block 350 .
- Base calling can refer to the process of determining bases of the nucleotides incorporated into the clusters of growing primer-polynucleotides being sequenced to be guanine (G), thymine (T), cytosine (C), or adenine (A).
- FIG. 4 is a flowchart of an example method 400 for performing base calling utilizing the sequencing system 100 . Processing detected signals at block 335 illustrated in FIG. 3 can include performing base calling of the method 400 . After beginning at block 405 , light of predetermined wavelengths can be generated using lasers. The light generated can shine onto nucleotide analogs at block 410 . For example, the computer system 106 , through its optics system interface 212 and the communication channel 108 a , can cause the lasers 120 to generate light at the predetermined wavelength.
- the laser-generated light can shine onto nucleotide analogs incorporated into growing primer-polynucleotides attached on inside surface of one or more channels of a flow cell, for example, the flowcell 114 .
- the primer-polynucleotides can include clusters of single-stranded polynucleotide fragments hybridized to sequencing primers.
- the nucleotide analogs each can include zero, one, or two fluorescent labels.
- the two fluorescent labels can be a first fluorescent label and a second fluorescent label.
- the fluorescent labels after being excited by the laser-generated light, can emit fluorescent emissions.
- the first fluorescent label can produce fluorescent emissions at the first wavelength which can be captured in, for example, a first fluorescent image.
- the second fluorescent label can produce fluorescent emissions at the second wavelength which can be captured in, for example, a second fluorescent image.
- the nucleotide analogs can include a first type of nucleotide, a second type of nucleotide, a third type of nucleotide, and a fourth type of nucleotide.
- the first type of nucleotide for example an analog of deoxyguanosine triphosphate (dGTP)
- dGTP deoxyguanosine triphosphate
- the second type of nucleotide for example an analog of deoxythymidine triphosphate (dTTP), can be conjugated with the first type of fluorescent label, and not the second type of fluorescent label.
- fluorescent emissions of the nucleotide analogs at the first wavelength and the second wavelength can be detected using at least one detector.
- the detector 126 can capture two fluorescent images, a first fluorescent image at the first wavelength and a second fluorescent image at the second wavelength.
- the nucleic base determiner 216 can determine the presence or the absence of fluorescent emissions in the two fluorescent images.
- the first type of nucleotide can produce no, or minimal, fluorescent emission at the first wavelength or at the second wavelength.
- the nucleotide can be determined to be the first type of nucleotide, for example dGTP. If any or more than minimal fluorescent emission is detected, the method 400 can proceed to decision block 425 .
- the second type of nucleotide is conjugated with the first type of fluorescent label, and not the second type of fluorescent label, the second type of nucleotide can produce fluorescent emissions at the first wavelength and no, or minimal, fluorescent emission at the second wavelength.
- the nucleotide can be determined to be the second type of nucleotide, for example dTTP. If fluorescent emissions are detected at the second wavelength, the method 400 can proceed to decision block 430 .
- the fourth type of nucleotide is conjugated with both the first type of fluorescent label and the second type of fluorescent label, the fourth type of nucleotide can produce fluorescent emissions at the first wavelength or the second wavelength.
- the nucleotide can be determined to be the fourth type of nucleotide, for example dATP.
- the flowcell 114 can include clusters of growing primer-polynucleotides to be sequenced.
- decision block 435 if there is at least one more cluster with fluorescent emissions to be processed for a given sequencing cycle, the method 400 can continue at block 410 . If no more cluster of single-stranded polynucleotide is to be processed, the method 400 can end at block 440 .
- SBS Sequence-by-synthesis
- a single nucleotide monomer may be provided to a target nucleotide in the presence of a polymerase in each delivery.
- more than one type of nucleotide monomer can be provided to a target nucleic acid in the presence of a polymerase in a delivery.
- FIGS. 5 A and 5 B show an example iterative alignment and variant calling process according to one embodiment. After a certain number of minimum sequencing cycles have been imaged, real-time primary analyses can be performed in order to determine the base calls and quality scores for each unaligned read. In FIG. 5 A , the minimum number of sequencing cycles shown is three. In some embodiments, the minimum sequencing cycles can be 16, 32, or more cycles. Base calling and quality score determination are illustrated above with reference to FIG. 3 . Each read can be aligned to the reference sequence with the most likely alignment being chosen, and then the reads can be stacked in a pile-up and variant calling can be performed.
- the primary analyses includes determining unaligned sequence reads, such as CCA 504 a , TTA 504 d , and TAG 504 k , from 16 clusters shown on the flowcell. Under the Primary Analysis heading, each cluster is represented as a row of letters, with each letter representing a sequenced polynucleotide.
- secondary analyses can include aligning the 16 sequence reads to a reference sequence (GATTACATAAGATTCTTTCATCG 508 (SEQ ID NO: 1)) shown under the Secondary Analysis heading in FIG. 5 A .
- the sequences aligned under the reference sequence constitute a pile-up of polynucleotides.
- the alignment probabilities can be refined, and the read alignments may shift to a new most-likely alignment. This shift would trigger new variant calling to be performed in the affected regions.
- the sequencing reads CCA 504 a , TTA 504 d , and TAG 504 k from the third sequencing cycle become CCAT 504 a ′ (row 1 under the “Primary Analysis” heading), TTAC 504 d ′ (row 4), and TAGG 504 k ′ (row 11) respectively.
- the variant called after the 32nd sequencing cycle or during the 33rd sequencing cycle may be an initial variant called. During subsequent sequencing cycles, the variant called may be refined (including that a variant previously called for a particular nucleotide position is no longer called and drops off).
- the variant called may be refined (including that a variant previously called for a particular nucleotide position is no longer called and drops off).
- a variant for the fourth position of TTACAT was called to be a G after the third cycle, while no variant for the position was called after the fourth position.
- FIGS. 7 A and 7 B are schematic illustrations comparing a traditional method of secondary analysis ( FIG. 7 A ) to the secondary analysis of an embodiment of the present disclosure ( FIG. 7 B ).
- FIG. 7 A illustrates that for a traditional method of secondary analysis, the alignment does not proceed until the full set of bases in the read are sequenced.
- the alignment process can include multiple alignment processing steps.
- the first alignment processing step waits for the full set of sequenced bases in the read to be available.
- the variant caller process which includes multiple variant caller processing steps, can begin.
- the first variant caller processing step waits for the full set of alignment data to be available.
- alignment can occur at intervals of 16 bases as illustrated.
- Variant calling can occur at intervals of 16 after alignment is complete.
- a sequencing system for real-time secondary analysis may output 16 bases of sequence reads every 1.3 hours.
- the total time required for performing alignment and variant calling should be within 1.3 hours such that a user can have access to the variant calls made prior to the next 16 bases of sequence reads are available.
- processing can occur continuously as fast as possible on the available computer resources, with no fixed iteration steps.
- the analysis can self-adjust and will be as close to the sequencing progress as possible. Alignments and variant calling results can be generated on demand at any time.
- FIG. 9 A is a flowchart of an example method 900 for performing a real-time secondary analysis.
- the method 900 includes two paths: a low confidence, high computation processing path of a traditional secondary analysis method and a high confidence, low computation processing path according to one embodiment of the present disclosure.
- the low confidence, high processing path and the high confidence, low processing path are referred to herein as the blue path and the yellow path respectively.
- the isAligned variable is set to 0 or False.
- the (MAPping Quality) MapQ score can be calculated.
- the MapQ score can equal to ⁇ 10 log 10 Pr ⁇ mapping position is wrong ⁇ , rounded to the nearest integer. So if the probability of correctly mapping some random read was 0.99, then the MapQ score should be 20 (i.e. log 10 of 0.01* ⁇ 10). If the probability of a correct match increased to 0.999, the MapQ score would increase to 30. Conversely, as the probability of a correct match tends towards zero, so does the MapQ score.
- FIG. 11 B shows an embodiment of the variant caller as disclosed in the current invention.
- a metric is generated to determine if a polynucleotide at a genomic position can be determined with high confidence. For example, a high confidence decision could be generated if all polynucleotides at a given genomic position are the same. Alternatively, a high confidence decision could be generated if the number of polynucleotides of the same type at a genomic position is higher than a threshold. Alternative metrics for determining high confidence can also be implemented. If the polynucleotide can be determined with high confidence, then the formulation of the probabilities may be skipped and a simple variant calling step may be executed. For example, a simple variant caller may call any variant that is detected with high confidence.
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Biotechnology (AREA)
- Medical Informatics (AREA)
- Analytical Chemistry (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Organic Chemistry (AREA)
- Molecular Biology (AREA)
- Genetics & Genomics (AREA)
- Zoology (AREA)
- Wood Science & Technology (AREA)
- Data Mining & Analysis (AREA)
- Immunology (AREA)
- Biochemistry (AREA)
- General Engineering & Computer Science (AREA)
- Microbiology (AREA)
- Artificial Intelligence (AREA)
- Bioethics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Evolutionary Computation (AREA)
- Public Health (AREA)
- Software Systems (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Apparatus Associated With Microorganisms And Enzymes (AREA)
Abstract
Disclosed herein are systems and methods for performing secondary analyses of nucleotide sequencing data in a time-efficient manner. Some embodiments include performing a secondary analysis iteratively while sequence reads are generated by a sequencing system. Secondary analyses can encompass both alignment of sequence reads to a reference sequence (e.g., the human reference genome sequence) and utilization of this alignment to detect differences between a sample and the reference. Secondary analysis can enable detection of genetic differences, variant detection and genotyping, identification of single nucleotide polymorphisms (SNPs), small insertions and deletion (indels) and structural changes in the DNA, such as copy number variants (CNVs) and chromosomal rearrangements.
Description
- The present application claims priority under 35 U.S.C. § 119(e) to U.S. Provisional Application No. 62/405,824, filed Oct. 7, 2016; the content of which is incorporated herein by reference in its entirety.
- The present application is being filed with a Sequence Listing in electronic format. The Sequence Listing is provided as a file entitled “Sequence_Listing_ILLINC_346C1.xml, created Jun. 22, 2023, which is approximately 2.0 kb in size. The information in the electronic format of the sequence listing is incorporated herein by reference in its entirety.
- The present disclosure relates generally to the field of DNA sequencing, and more particularly relates to systems and methods for performing real-time secondary analyses for next generation sequencing application.
- Genetic mutation can be identified by identifying variants—relative to reference sequences—in sequence reads. To identify a variant, a sample from a subject may be completely sequenced using a sequencing instrument to obtain sequence reads. After obtaining sequence reads, the sequence reads can be assembled or aligned prior to variant calling. Thus, identifying variants includes distinct steps that are performed sequentially and can be time consuming to perform after the conclusion of the sequencing process.
- Disclosed herein are systems and methods for sequencing polynucleotides. In one embodiment, the system comprises: a memory comprising a reference nucleotide sequence; a processor configured to execute instructions that perform a method comprising: receiving a first nucleotide subsequence of a read from a sequencing system; processing the first nucleotide subsequence using a first alignment path to determine a first plurality of candidate locations of the read on the reference sequence; determining whether the first nucleotide subsequence aligns to the reference sequence based on the determined candidate locations; receiving a second nucleotide subsequence from the sequencing system; processing the second nucleotide subsequence to determine a second plurality of candidate locations of the read that align to the reference sequence using: a second alignment path if the read is aligned to the reference sequence, and the first alignment path if otherwise, wherein the second alignment path is more computationally efficient than the first alignment path to determine the second plurality of candidate locations of the read.
- In one embodiment, the method comprises: receiving a first nucleotide subsequence from a sequencing system during a sequencing run; and performing a secondary analysis of the first nucleotide subsequence of a read based on a reference sequence using a first analysis path or a second analysis path, wherein the second analysis path is more computationally efficient than the first processing path in performing the secondary analysis.
-
FIG. 1 is a schematic illustration showing an example sequencing system for performing real-time analyses. -
FIG. 2 shows a functional block diagram of an example computer system for performing real-time analyses. -
FIG. 3 is a flowchart of an example method for sequencing by synthesis. -
FIG. 4 is a flowchart of an example method for performing base calling. -
FIGS. 5A and 5B show example iterative alignments and variant calling. -
FIG. 6 is a flowchart of an example method for performing a real-time secondary sequence analysis. -
FIGS. 7A and 7B are schematic illustrations comparing a traditional method of secondary analysis (FIG. 7A ) to an iterative method of secondary analysis (FIG. 7B ). -
FIG. 8 is a schematic illustration of read generation at 16-base interval. -
FIG. 9A is a flowchart of an example method for performing a real-time secondary analysis.FIG. 9B is a predicted line graph showing data processed per K-Mer.FIG. 9C is a bar chart showing run times. -
FIG. 10 is another flowchart of an example method for performing a real-time secondary analysis. -
FIGS. 11A and 11B compare an existing variant caller (FIG. 11A ) to a variant caller that uses high confidence, low processing path as described herein (FIG. 11B ). - In the following detailed description, reference is made to the accompanying drawings, which form a part hereof. In the drawings, similar symbols typically identify similar components, unless context dictates otherwise. The illustrative embodiments described in the detailed description, drawings, and claims are not meant to be limiting. Other embodiments may be utilized, and other changes may be made, without departing from the spirit or scope of the subject matter presented herein. It will be readily understood that the aspects of the present disclosure, as generally described herein, and illustrated in the Figures, can be arranged, substituted, combined, separated, and designed in a wide variety of different configurations, all of which are explicitly contemplated herein.
- Disclosed herein are systems and methods for performing secondary analyses of nucleotide sequencing data in a time-efficient manner. In some embodiments, the method comprises performing a secondary analysis iteratively while sequence reads are generated by a sequencing system. Secondary analyses can encompass both alignment of sequence reads to a reference sequence (e.g., the human reference genome sequence) and utilization of this alignment to detect differences between a sample and the reference. Secondary analyses can enable detection of genetic differences, variant detection and genotyping, identification of single nucleotide polymorphisms (SNPs), small insertions and deletion (indels) and structural changes in the DNA, such as copy number variants (CNVs) and chromosomal rearrangements.
- By performing secondary analyses while sequence reads are generated, the system and method can determine preliminary variant calls iteratively in real-time (or with zero or low latency). Final results of variant determinations may be available soon after (or immediately after) the end of a sequencing run. Alternatively, a sequencing run can be terminated early if variant calls are available with sufficient confidence during the run. In some embodiments, only information related to variant determinations (e.g., variant calls) is transferred off the sequencing system. This can decrease, or minimize, the data bandwidth required in comparison to performing the variant determinations in a system that is external. In addition, only variant information may be sent to a computing system (e.g., a cloud computing system) for further processing. In this embodiment, sequencing runs may be terminated prior to completion of an entire sequencing process. For example, if the identity of a pathogen of interest is determined after a number of sequencing cycles of a sequencing run, the sequencing run can be terminated. Thus, the time to a particular answer (e.g., pathogen identification) may be decreased. In one embodiment, outputs and intermediate results of the system can include histograms of duplicates, exact matches, single and double SNPs, and single and double indels.
- Unless defined otherwise, technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the present disclosure belongs. See, e.g. Singleton et al., Dictionary of Microbiology and Molecular Biology 2nd ed., J. Wiley & Sons (New York, NY 1994); Sambrook et al., Molecular Cloning, A Laboratory Manual, Cold Springs Harbor Press (Cold Springs Harbor, N Y 1989). For purposes of the present disclosure, the following terms are defined below.
- Disclosed herein are systems and methods for performing secondary analyses iteratively in a time and/or computing resource efficient manner. Secondary analyses can encompass both alignment of sequence reads to a reference sequence (e.g., the human reference genome sequence) and utilization of this alignment to detect differences between a sample and the reference. Secondary analyses can enable detection of genetic differences, variant detection and genotyping, identification of single nucleotide polymorphisms (SNPs), small insertions and deletion (indels) and structural changes in the DNA, such as copy number variants (CNVs) and chromosomal rearrangements. Secondary analyses may be performed for one sequencing cycle while sequencing data is being generated for the next sequencing cycle.
-
FIG. 1 is a schematic illustration showing anexample sequencing system 100 for performing real-time secondary analyses. Non-limiting examples of the sequencing method utilized by thesequencing system 100 can include sequencing by synthesis and Heliscope single molecule sequencing. Thesequencing system 100 can include anoptics system 102 configured to generate raw sequencing data using sequencing reagents supplied by afluidics system 104 that is part of thesequencing system 100. The raw sequencing data can include fluorescent images captured by theoptics system 102. Acomputer system 106 that is part of thesequencing system 100 can be configured to control theoptics system 102 and thefluidics system 104 viacommunication channels computer interface 110 of theoptics system 102 can be configured to communicate with thecomputer system 106 through thecommunication channel 108 a. - During sequencing reactions, the
fluidics system 104 can direct the flow of reagents through one ormore reagent tubes 112 to and from aflowcell 114 positioned on a mountingstage 116. The reagents can be, for example, fluorescently labeled nucleotides, buffers, enzymes, and cleavage reagents. Theflowcell 114 can include at least one fluidic channel. Theflowcell 114 can be a patterned array flowcell or a random array flowcell. Theflowcell 114 can include multiple clusters of single-stranded polynucleotides to be sequenced in the at least one fluidic channel. The lengths of the polynucleotides can vary ranging, for example, from 200 bases to 1000 bases. The polynucleotides can be attached to the one or more fluidic channels of theflowcell 114. In some embodiments, theflowcell 114 can include a plurality of beads, wherein each bead can include multiple copies of a polynucleotide to be sequenced. The mountingstage 116 can be configured to allow proper alignment and movement of theflowcell 114 in relation to the other components of theoptics system 102. In one embodiment, the mountingstage 116 can be used to align theflowcell 114 with alens 118. - The
optics system 102 can includemultiple lasers 120 configured to generate light at predetermined wavelengths. The light generated by thelasers 120 can pass through afiber optic cable 122 to excite fluorescent labels in theflowcell 114. Thelens 118, mounted on afocuser 124, can move along the z-axis. The focused fluorescent emissions can be detected by adetector 126, for example a charge-coupled device (CCD) sensor or a complementary metal oxide semiconductor (CMOS) sensor. - A
filter assembly 128 of theoptics system 102 can be configured to filter the fluorescent emissions of the fluorescent labels in theflowcell 114. Thefilter assembly 128 can include a first filter and a second filter. Each filter can be a longpass filter, a shortpass filter, or a bandpass filter, depending on the types of fluorescent molecules being used in the system. The first filter can be configured to detect the fluorescent emissions of the first fluorescent labels by thedetector 126. The second filter can be configured to detect the fluorescent emissions of the second fluorescent labels by thedetector 126. With two filters in thefilter assembly 128, thedetector 126 can detect two different wavelengths of fluorescent emissions. - In some embodiments, the
optics system 102 can include a dichroic configured to split the fluorescent emissions. Theoptics system 102 can include two detectors, a first detector coupled with a first filter for detecting fluorescent emissions at a first wavelength and a second detector coupled with a second filter for detecting fluorescent emissions at a second wavelength. - In use, a sample having a polynucleotide to be sequenced is loaded into the
flowcell 114 and placed in the mountingstage 116. Thecomputer system 106 then activates thefluidics system 104 to begin a sequencing cycle. During sequencing reactions, thecomputer system 106 instructs thefluidics system 104, through thecommunication interface 108 b, to supply reagents, for example nucleotide analogs, to theflowcell 114. Through thecommunication interface 108 a and thecomputer interface 110, thecomputer system 106 is configured to control thelasers 120 of theoptics system 102 to generate light at a predetermined wavelength and shine onto nucleotide analogs coupled with fluorescent labels incorporated into growing primers hybridized to polynucleotides being sequenced. Thecomputer system 106 controls thedetector 126 of theoptics system 102 to capture the emission spectra of the nucleotide analogs in fluorescent images. Thecomputer system 106 receives the fluorescent images from thedetector 126 and process the fluorescent images received to determine the nucleotide sequence of the polynucleotides being sequenced. - The
computer system 106 of thesequencing system 100 can be configured to control theoptics system 102 and thefluidics system 104 as discussed above. While many configurations are possible for thecomputer system 106, one embodiment is illustrated inFIG. 2 . As shown inFIG. 2 , thecomputer system 106 can include aprocessor 202 that is in electrical communication with amemory 204, astorage 206, and a communication interface 208. In one embodiment, thecomputer system 106 includes a field-programmable gate array (FPGA), graphics processing unit (GPU), and/or vector central processing unit (CPU) to perform sequence alignments and generate variant calls. - The
processor 202 can be configured to execute instructions that cause thefluidics system 104 to supply reagents to theflowcell 114 during sequencing reactions. Theprocessor 202 can execute instructions that control thelasers 120 of theoptics system 102 to generate light at predetermined wavelengths. Theprocessor 202 can execute instructions that control thedetector 126 of theoptics system 102 and receive data from thedetector 126. Theprocessor 202 can execute instructions to process data, for example fluorescent images, received from thedetector 126 and to determine the nucleotide sequence of polynucleotides based on the data received form thedetector 126. - The
memory 204 can be configured to store instructions for configuring theprocessor 202 to perform the functions of thecomputer system 106 when thesequencing system 100 is powered on. When thesequencing system 100 is powered off, thestorage 206 can store the instructions for configuring theprocessor 202 to perform the functions of thecomputer system 106. The communication interface 208 can be configured to facilitate the communications between thecomputer system 106, theoptics system 102, and thefluidics system 104. - The
computer system 106 can include a user interface 210 configured to communicate with a display device (not shown) for displaying the sequencing results (including results of secondary analyses such as variant callings) of thesequencing system 100. The user interface 210 can be configured to receive inputs from users of thesequencing system 100. An optics system interface 212 and afluidics system interface 214 of thecomputer system 106 can be configured to control theoptics system 102 and thefluidics system 104 through the communication links 108 a and 108 b illustrated inFIG. 1 . For example, the optics system interface 212 can communicate with thecomputer interface 110 of theoptics system 102 through thecommunication link 108 a. - The
computer system 106 can include anucleic base determiner 216 configured to determine the nucleotide sequence of polynucleotides using the data received from thedetector 126. Thenucleic base determiner 216 can generate a template of the locations of polynucleotide clusters in theflowcell 114 using the fluorescent images captured by thedetector 126. Thenucleic base determiner 216 can register the locations of polynucleotide clusters in theflowcell 114 in the fluorescent images captured by thedetector 126 based on the location template generated. Thenucleic base determiner 216 can extract intensities of the fluorescent emissions from the fluorescent images to generate extracted intensities. Thenucleic base determiner 216 can determine the bases of the polynucleotide from the extracted intensities. Thenucleic base determiner 216 can determine quality scores of the bases of the polynucleotides determined. - The
computer system 106 can include aniterative aligner 218 and avariant caller 220, such as the Strelka variant caller (sites.google.com/site/strelkasomaticvariantcaller/home/faq). During a sequencing cycle, theiterative aligner 218 can align sequence reads determined by thenucleic base determiner 216 to a reference sequence. The aligned sequence reads can have associated scores. The scores can be probabilities (e.g., mismatch percentages) that the sequence reads have been correctly aligned to the reference sequence. In some implementations, thecomputer system 106 can include hardware, such as a field-programmable gate array (FPGA) or a graphics processing unit (GPU), for aligning sequence reads to the reference sequence and for determining variant calls. In some embodiments, theiterative aligner 218 and thevariant caller 220 can be implemented by a computer system distinct from thecomputer system 106. In some embodiments, thecomputer system 106 may be an integrated component of thesequencing system 100. In some embodiments, theoptics system 102, thefluidics system 104, and/or thecomputer system 106 can be integrated into one machine. -
FIG. 3 is a flowchart of anexample method 300 for sequencing by synthesis utilizing thesequencing system 100. After themethod 300 beings atblock 305, aflowcell 114 including fragmented double-stranded polynucleotide fragments is received atblock 310. The fragmented double-stranded polynucleotide fragments can be generated from a deoxyribonucleic acid (DNA) sample. The DNA sample can be from various sources for example, a biological sample, a cell sample, an environmental sample, or any combination thereof. The DNA sample can include one or more of a biological fluid, a tissue, and cells from a patient. For example, the DNA sample can be taken from, or include, blood, urine, cerebrospinal fluid, pleural fluid, amniotic fluid, semen, saliva, bone marrow, a biopsy sample, or any combination thereof. - The DNA sample can include DNA from cells of interest. The cells of interest can vary and in some embodiments express a malignant phenotype. In some embodiments, the cells of interest can include tumor cells bone marrow cells, cancer cells, stem cells endothelial cells, virally infected cells pathogenic, parasitic organism cells or any combination thereof.
- The lengths of fragmented double-stranded polynucleotide fragments can range from 200 bases to 1000 bases. Once the
flowcell 114 including fragmented double-stranded polynucleotide fragments are received atblock 310, themethod 300 proceeds to block 315, where the double-stranded polynucleotide fragments are bridge-amplified into clusters of polynucleotide fragments attached to the inside surface of one or more channels of a flowcell, for example theflowcell 114. The inside surface of the one or more channels of the flowcell can include two types of primers, for example a first primer type (P1) and a second primer type (P2) and the DNA fragments can be amplified by well-known methods. - After generating clusters within the
flowcell 114, themethod 300 can begin a Sequencing by Synthesis process. The Sequencing by Synthesis process can include determining the nucleotide sequence of clusters of single-stranded polynucleotide fragments. To determine the sequence of a cluster of single-stranded polynucleotide fragments with the sequence 5′-P1-F-A2R-3′, primers with the sequence A2F, which are complementary of the sequence A2R, can be added and extended atblock 320 with nucleotide analogs with zero, one, or two labels by a DNA polymerase to form growing primer-polynucleotides. - During each sequencing cycle, four types of nucleotide analogs can be added and incorporated onto the growing primer-polynucleotides. The four types of nucleotide analogs can have different modifications. For example, the first type of nucleotide can be an analog of deoxyguanosine triphosphate (dGTP) not conjugated with any fluorescent label. The second type of nucleotide can be an analog of deoxythymidine triphosphate (dTTP) conjugated with the first type of fluorescent label via a linker. The third type of nucleotide can be an analog of deoxycytidine triphosphate (dCTP) conjugated with the second type fluorescent label via a linker. The fourth type of nucleotide can be an analog of deoxyadenosine triphosphate (dATP) conjugated with both the first type of fluorescent label and the second type of fluorescent label via one or more linkers. The linkers may include one or more cleavage groups. Prior to the subsequent sequencing cycle, the fluorescent labels can be removed from the nucleotide analogs. For example, a linker attaching a fluorescent label to a nucleotide analog can include an azide and/or an alkoxy group, for example on the same carbon, such that the linker may be cleaved after each incorporation cycle by a phosphine reagent, thereby releasing the fluorescent label from subsequent sequencing cycles.
- The nucleotide triphosphates can be reversibly blocked at the 3′ position so that sequencing is controlled and no more than a single nucleotide analog can be added onto each extending primer-polynucleotide in each cycle. For example, the 3′ ribose position of a nucleotide analog can include both alkoxy and azido functionalities which can be removable by cleavage with a phosphine reagent, thereby creating a nucleotide that can be further extended. After the incorporation of nucleotide analogs, the
fluidics system 104 can wash the one or more channels of theflowcell 114 in order to remove any unincorporated nucleoside analogs and enzyme. Prior to the subsequent sequencing cycle, the reversible 3′ blocks can be removed so that another nucleotide analog can be added onto each extending primer-polynucleotide. - At
block 325, lasers such as thelasers 120 can excite the two fluorescent labels at predetermined wavelengths. Atblock 330, signals from the fluorescent labels can be detected. Detecting the fluorescent labels can include capturing fluorescent emissions in two fluorescent images at a first wavelength and a second wavelength by, for example, thedetector 126 using two filters. The fluorescent emissions of the first fluorescent label can be at, or around, the first wavelength, and the fluorescent emissions of the second fluorescent label can be at, or around, the second wavelength. The fluorescent images can be stored for later processing offline. In some embodiments, the fluorescent images can be processed to determine the sequence of the growing primer-polynucleotides in each cluster in real-time. - In online real-time fluorescent imaging processing, the fluorescent images comprising the fluorescent signals detected can be processed at
block 335, and the bases of the nucleotides incorporated can be determined. For each nucleotide base determined, a quality score can be determined atblock 340. A determination can be made atdecision block 345 whether to detect more nucleotides based on, for example, the quality of the signal or after a predetermined number of bases. If more nucleotides are to be detected, then nucleotide determination of the next sequencing cycle can be performed atblock 320. In some embodiments, the labeled nucleotides may be added to one end of the DNA strand corresponding to a cluster. The labeled nucleotides may also be added to the other end of the DNA strand corresponding to the cluster. Reads on one end of the DNA strand are often referred to as theRead 1 set, and those reads on the other end of a DNA strand are often referred to as theRead 2 set. The sequencing technique that allows the determination of two or more reads of sequence from two places on a single polynucleotide duplex is known as paired-end (PE) sequencing. The two or more reads of sequences from the two places on the single polynucleotide duplex is referred to asRead 1 set,Read 2 set, etc. Paired-end sequencing has been described in U.S. patent application Ser. No. 14/683,580; the content of which is incorporated herein by reference in its entirety. The advantage of the paired-end approach is that there is significantly more information to be gained from sequencing two stretches from a single template than from sequencing each of two independent templates in a random fashion. - Prior to the next sequencing cycle, the fluorescent labels can be removed from the nucleotide analogs, and the reversible 3′ blocks can be removed so that another nucleotide analog can be added onto each extending primer-polynucleotide. After all the fluorescent images are processed, the
method 300 can terminate atblock 350. - Base calling can refer to the process of determining bases of the nucleotides incorporated into the clusters of growing primer-polynucleotides being sequenced to be guanine (G), thymine (T), cytosine (C), or adenine (A).
FIG. 4 is a flowchart of anexample method 400 for performing base calling utilizing thesequencing system 100. Processing detected signals atblock 335 illustrated inFIG. 3 can include performing base calling of themethod 400. After beginning atblock 405, light of predetermined wavelengths can be generated using lasers. The light generated can shine onto nucleotide analogs atblock 410. For example, thecomputer system 106, through its optics system interface 212 and thecommunication channel 108 a, can cause thelasers 120 to generate light at the predetermined wavelength. - The laser-generated light can shine onto nucleotide analogs incorporated into growing primer-polynucleotides attached on inside surface of one or more channels of a flow cell, for example, the
flowcell 114. The primer-polynucleotides can include clusters of single-stranded polynucleotide fragments hybridized to sequencing primers. The nucleotide analogs each can include zero, one, or two fluorescent labels. The two fluorescent labels can be a first fluorescent label and a second fluorescent label. The fluorescent labels, after being excited by the laser-generated light, can emit fluorescent emissions. For example, the first fluorescent label can produce fluorescent emissions at the first wavelength which can be captured in, for example, a first fluorescent image. The second fluorescent label can produce fluorescent emissions at the second wavelength which can be captured in, for example, a second fluorescent image. - The nucleotide analogs can include a first type of nucleotide, a second type of nucleotide, a third type of nucleotide, and a fourth type of nucleotide. The first type of nucleotide, for example an analog of deoxyguanosine triphosphate (dGTP), is not conjugated to the first fluorescent label or the second fluorescent label. The second type of nucleotide, for example an analog of deoxythymidine triphosphate (dTTP), can be conjugated with the first type of fluorescent label, and not the second type of fluorescent label. The third type of nucleotide, for example an analog of deoxycytidine triphosphate (dCTP), can be conjugated with the second type fluorescent label, and not the first type of fluorescent label. The fourth type of nucleotide, for example an analog of deoxyadenosine triphosphate (dATP), can be conjugated with both the first type of fluorescent label and the second type of fluorescent label.
- At
block 415, fluorescent emissions of the nucleotide analogs at the first wavelength and the second wavelength can be detected using at least one detector. For example, thedetector 126 can capture two fluorescent images, a first fluorescent image at the first wavelength and a second fluorescent image at the second wavelength. After receiving the two fluorescent images from theoptics system 102, thenucleic base determiner 216 can determine the presence or the absence of fluorescent emissions in the two fluorescent images. - Because the first type of nucleotide is not conjugated to the first fluorescent label or the second fluorescent label, the first type of nucleotide can produce no, or minimal, fluorescent emission at the first wavelength or at the second wavelength. At
decision block 420, if no fluorescent emission is detected, the nucleotide can be determined to be the first type of nucleotide, for example dGTP. If any or more than minimal fluorescent emission is detected, themethod 400 can proceed todecision block 425. - Because the second type of nucleotide is conjugated with the first type of fluorescent label, and not the second type of fluorescent label, the second type of nucleotide can produce fluorescent emissions at the first wavelength and no, or minimal, fluorescent emission at the second wavelength. At
decision block 425, if no fluorescent emission at the second wavelength is detected in the second fluorescent image, and fromdecision block 420, fluorescent emissions at the first wavelength are detected in the first fluorescent image, then the nucleotide can be determined to be the second type of nucleotide, for example dTTP. If fluorescent emissions are detected at the second wavelength, themethod 400 can proceed todecision block 430. - Because the third type of nucleotide is conjugated with the second type fluorescent label, and not the first type of fluorescent label, the third type of nucleotide can produce fluorescent emissions at the second wavelength and no, or minimal, fluorescent emission at the first wavelength. At
decision block 430, if no fluorescent emission at the first wavelength is detected in the first fluorescent image, and fromdecision block 425, fluorescent emissions at the second wavelength are detected in the second fluorescent image, then the nucleotide can be determined to be the third type of nucleotide, for example dCTP. - Because the fourth type of nucleotide is conjugated with both the first type of fluorescent label and the second type of fluorescent label, the fourth type of nucleotide can produce fluorescent emissions at the first wavelength or the second wavelength. At
decision block 430, if fluorescent emissions are detected at the first wavelength in the first fluorescent image, and fromdecision block 425, fluorescent emissions can be detected at the second wavelength in the second fluorescent image, then the nucleotide can be determined to be the fourth type of nucleotide, for example dATP. - The
flowcell 114 can include clusters of growing primer-polynucleotides to be sequenced. Atdecision block 435, if there is at least one more cluster with fluorescent emissions to be processed for a given sequencing cycle, themethod 400 can continue atblock 410. If no more cluster of single-stranded polynucleotide is to be processed, themethod 400 can end atblock 440. - The methods described herein can be used in conjunction with a variety of nucleic acid sequencing techniques. Particularly applicable techniques are those wherein nucleic acids are attached at fixed locations in an array such that their relative positions do not change and wherein the array is repeatedly imaged. Embodiments in which images are obtained in different color channels, for example, coinciding with different labels used to distinguish one nucleotide base type from another are particularly applicable. In some embodiments, the process to determine the nucleotide sequence of a target nucleic acid can be an automated process. Preferred embodiments include sequencing-by-synthesis (“SBS”) techniques.
- “Sequencing-by-synthesis (“SBS”) techniques” generally involve the enzymatic extension of a nascent nucleic acid strand through the iterative addition of nucleotides against a template strand. In traditional methods of SBS, a single nucleotide monomer may be provided to a target nucleotide in the presence of a polymerase in each delivery. However, in the methods described herein, more than one type of nucleotide monomer can be provided to a target nucleic acid in the presence of a polymerase in a delivery.
-
FIGS. 5A and 5B show an example iterative alignment and variant calling process according to one embodiment. After a certain number of minimum sequencing cycles have been imaged, real-time primary analyses can be performed in order to determine the base calls and quality scores for each unaligned read. InFIG. 5A , the minimum number of sequencing cycles shown is three. In some embodiments, the minimum sequencing cycles can be 16, 32, or more cycles. Base calling and quality score determination are illustrated above with reference toFIG. 3 . Each read can be aligned to the reference sequence with the most likely alignment being chosen, and then the reads can be stacked in a pile-up and variant calling can be performed. - In
FIG. 5A , the primary analyses includes determining unaligned sequence reads, such asCCA 504 a,TTA 504 d, andTAG 504 k, from 16 clusters shown on the flowcell. Under the Primary Analysis heading, each cluster is represented as a row of letters, with each letter representing a sequenced polynucleotide. Once the minimum number of cycles have been sequenced, e.g. 3 cycles, secondary analyses can include aligning the 16 sequence reads to a reference sequence (GATTACATAAGATTCTTTCATCG 508 (SEQ ID NO: 1)) shown under the Secondary Analysis heading inFIG. 5A . In the Secondary Analysis diagram, the sequences aligned under the reference sequence constitute a pile-up of polynucleotides. As an example, the sequence readsCCA 504 a (row 1 under the “Primary Analysis” heading),TTA 504 d (row 4), andTAG 504 k (row 11) can be aligned to sequences ACA, TTA, and TAC, respectively, within theTTACAT 512 subsequence of thereference sequence 508, with one, zero, and one mismatch, respectively. Thus, the third position of theTTACAT 512 subsequence can be determined to be a C 516 a instead of an A in thereference sequence 508 with some probability of correctness, and the fourth position of theTTACAT 512 subsequence can be determined to be aG 516 b instead of a C in the reference sequence with some probability of correctness. Other variants of the reference sequence can be similarly determined. - As new sequencing cycles are performed and base calls determined, the alignment probabilities can be refined, and the read alignments may shift to a new most-likely alignment. This shift would trigger new variant calling to be performed in the affected regions. In
FIG. 5B , after a fourth sequencing cycle, the sequencing readsCCA 504 a,TTA 504 d, andTAG 504 k from the third sequencing cycle becomeCCAT 504 a′ (row 1 under the “Primary Analysis” heading),TTAC 504 d′ (row 4), andTAGG 504 k′ (row 11) respectively. The sequence readsCCAT 504 a′ andTTAC 504 d′ can still be aligned to theTTACAT 512 subsequence of thereference sequence 508, with one and zero mismatch respectively. For sequence readsCCAT 504 a′ andTTAC 504 d′, the alignment position does not change between the iteration shown inFIG. 5A and the iteration shown inFIG. 5B ; the third position of theTTACAT 512 subsequence can be determined to be a C 516 a instead of an A in the reference sequence. To align the readTAGG 504 k′ to theTTACAT 512 subsequence requires two mismatches. However, the sequence readTAGG 504 k′ can be aligned toTAAG 520 of thereference sequence 508 with higher probability, since this alignment has only one mismatch, The examples ofFIG. 5A andFIG. 5B show that alignment positions may shift as the sequencing run proceeds, and variant calling may improve. - In some embodiments, aligning sequence reads to a reference sequence comprises keeping a list of most likely alignments as leaves on a node for each sequence read. Each leaf can have an associated probability. Leaves with probabilities that drop below some threshold can be trimmed.
-
FIG. 6 is a flowchart of anexample method 600 for performing a real-time secondary sequence analysis. After themethod 600 starts atblock 605, imaging data of a sequencing cycle can be received atblock 610. For example, thecomputer system 106 can receive the imaging data from thedetector 126. Atblock 615, bases can be determined and quality scores of the bases can be determined. Generating imaging data and determining bases and quality of bases determined are illustrated above with reference toFIGS. 3-4 . After each sequencing cycle, the lengths of sequencing reads become one nucleotide longer. For example, after the 31st sequencing cycle, the sequencing reads are 31 nucleotides in length, and after the 32nd sequencing cycle, the sequencing reads become one nucleotide longer to 32 nucleotides in length. - At
decision block 620, whether a certain number of minimum sequencing cycles have been performed can be determined. The minimum sequencing cycles can be 16, 32, or more cycles. If the number of sequencing cycles performed is lower than the minimum sequencing cycles required, themethod 600 proceeds to block 610. If the number of sequencing cycles performed is at least the minimum sequencing cycles required, themethod 600 proceeds to block 625. - At
block 625, the sequence reads determined can be aligned to a reference sequence. Themethod 600 can utilize different alignment methods in different implementations. Non limiting examples of alignment methods include global alignments (such as Needleman-Wunsch algorithm), local alignments, dynamic programming (such as Smith-Waterman algorithm), heuristic algorithms or probabilistic methods, progressive methods, iterative methods, motif finding or profile analysis, genetic algorithms, simulated annealing, pairwise alignments, multiple sequence alignments. - At
block 630, variants can be determined. An initial variant may be called only after a predetermined variant threshold is reached. The variant threshold may be important due to possible PCR or sequencing errors. The variant threshold can be based on the alignment of a base to a position of the reference sequence that is different from the base at the corresponding position of the reference sequence. - In
FIG. 5A , the variant threshold is one observation. Thus, the third position of TTACAT can be determined to be a C instead of an A in the reference sequence. If the variant threshold is two or more, the C variant will not be called atblock 630 at the particular sequencing cycle. InFIG. 5B , the third position of TTACAT can be determined to be a C instead of an A in the reference sequence if the variant threshold is at most two observations. In some embodiments, the variant threshold can be a percentage of all bases aligned to a particular position of the reference sequence, such as 1%, 5%, 10%, 25%, 50%, or more. As described in further details below, most likely alignments can be stored as leaves on a node for each sequence read. Each leaf can have an associated probability. Leaves with probabilities that drop below some threshold can be trimmed. Thus, variants called for a nucleotide position on the reference sequence may be refined or may drop off during subsequent cycles. - A determination can be made at
decision block 635 whether there are more nucleotides to be read or all sequencing cycles are complete. This determination can be based on, for example, the quality of the signal or after a predetermined number of bases. If there are more nucleotides to be read and not all sequencing cycles are complete, then themethod 600 proceeds to block 610, where sequencing data can be generated for the next sequencing cycle. If there are no more nucleotides to be read and all sequencing cycles are complete, themethod 600 ends at block 650. - In some embodiments, blocks 625 and 630 and blocks 610 and 615 can be performed in parallel after the minimum number of sequencing cycles have been performed. For example, after 32 sequencing cycles are performed, the method can proceed to block 625 to perform alignment of sequence reads that are 32 nucleotides in length. While the
method 600 performs alignment atblock 625 and variant calling atblock 630, the next sequencing cycle (i.e., the 33rd sequencing cycle) can be performed. Thus, variants can be determined atblock 630 prior to completion of the 33rd sequencing cycle. And themethod 600 can enable alignment and variant calling in real time (or with zero or low latency) while sequencing cycles are performed. Furthermore, the variants called during earlier sequencing cycles may be refined during subsequent cycles. Thus, variant calling illustrated inFIG. 6 can be an iterative process. For example, the variant called after the 32nd sequencing cycle or during the 33rd sequencing cycle may be an initial variant called. During subsequent sequencing cycles, the variant called may be refined (including that a variant previously called for a particular nucleotide position is no longer called and drops off). As another example, as shown inFIGS. 5A and 5B , a variant for the fourth position of TTACAT was called to be a G after the third cycle, while no variant for the position was called after the fourth position. - In another embodiment, the sequencing process may be terminated prior to the time that all sequencing cycles are complete. For example, if a particular target variant is identified prior to the completion of all the sequencing cycles, the sequencing process may terminate. This allows the system to save costs on reagents and provide the desired result earlier than systems that need to complete all cycles before a target variant call is made.
- In some embodiments, the alignment may not be performed at
block 625 and variants called atblock 630 every sequencing cycle. For example, alignments may be performed and variants called every nth sequencing cycle, where n is 1, 2, 3, 4, 5, 10, 20, or more sequencing cycles. In some embodiments, the frequency of the alignment performed atblock 625 and variants called atblock 630 may be based on the number of variants called in the previous sequencing cycle. For example, if a large number of variants are called in one sequencing cycle, alignments and variant calling may be performed more frequently (e.g., the next cycle) or less frequently. As another example, if no variant or no new variant has been called in one sequencing cycle, alignments and variant calling may be performed more frequently or less frequently (e.g., not the next cycle). - In some embodiments, the variant calling at
block 630 may be performed selectively for a region of the reference sequence. The portion of the reference sequence being aligned may be different in different implementations. For example, variant calling may be performed selectively for a region of the reference sequence where the alignment of the sequence reads to the reference sequence has changed during a previous sequencing cycle (e.g., the immediate prior sequencing cycle). As another example, the region of the reference sequence being aligned may be determined based on known single nucleotide polymorphism (SNP) locations. - In some embodiments, the
method 600 for performing a real-time secondary sequence analysis can be based on a tree structure for each read. The root of the tree can be labelled with a “$”, indicating the start of the sequence. The child nodes of the root correspond to the four possible base calls: ‘A’, ‘C’, ‘G’ and ‘T’. Each node in the tree can have three variables associated with it: the total number of differences of the sequence of the current branch leading from the root to that node (referred to as sequence S), with the bases from the current read (referred to as sequence W), and then the start and stop indices in the Burrows-Wheeler Transform (BWT) of the reference sequence for all positions in the reference that match the sequence S. An important property of the BWT is that all rows that have a common starting sequence are guaranteed to be consecutive in the transform, and so rather than keeping a list of individual indices into the reference that match the sequence S, it is sufficient to track the start and stop indices. This is valuable in the case of mapping reads to the human reference genome because there are very many repetitive regions. - Each child node of the root would then also have 4 children of its own, also corresponding to the four possible bases ‘A’, ‘C’, ‘G’ and ‘T’. Again, the number of differences with the sequence of the current read, W, can be tracked. For example, if the read of the first two cycles were ‘C’ and then ‘T’, the read can have a path through the tree defined by Root->C->T. Thus, the total accumulated differences would be zero for the last T node. In contrast, for the path defined by Root->A->G, the total accumulated differences at the G node would be 2, because neither the A nor the G match the corresponding cycle in the current read.
- In some embodiments, a limit on the number of differences with the reference that is acceptable can be defined. Once that limit is reached, that branch is dead and will no longer be analyzed it in subsequent cycles. The BWT transform, with appropriate indices, can be used to perform the calculations necessary at each node in constant, O(1), time. The amount of memory required for the computation, and the number of nodes in the tree, are influenced by the total number of allowable errors threshold. In some embodiments, support for small insertions and deletions can be implemented
- In some embodiments, more complex rearrangements would be handled through multiple seeds. That is, if a particular read is found to not match anywhere, the process may start again at some later cycles, with the expectation that the other part of the read would map somewhere. All of these reads can be tracked, and a more complex analysis (e.g., a dynamic programming method like the Smith-Waterman algorithm) can be performed when there are computing powers available.
- Additional embodiments are systems and methods for secondary analysis that include iterative processing of sequencing reads. Secondary analyses can encompass both alignment of sequence reads to a reference sequence (e.g., the human reference genome sequence) and utilization of this alignment to detect differences between a sample and the reference, such as variant detection and calling. In one implementation, alignment and variant call results can be obtained before the sequencer has finished running. For example, these results may be provided at time intervals dependent on the available computing resources. This can be accomplished by extending intermediate alignment results from a prior iteration with alignment results from the current iteration. The alignment results from the current iteration are generated by comparing the newly sequenced bases of the current iteration with the bases from the reference sequence at the previously aligned position. The results of the comparison are combined with the alignment results from the prior iteration, and the combined output is stored for the next iteration.
-
FIGS. 7A and 7B are schematic illustrations comparing a traditional method of secondary analysis (FIG. 7A ) to the secondary analysis of an embodiment of the present disclosure (FIG. 7B ).FIG. 7A illustrates that for a traditional method of secondary analysis, the alignment does not proceed until the full set of bases in the read are sequenced. The alignment process can include multiple alignment processing steps. The first alignment processing step waits for the full set of sequenced bases in the read to be available. After the alignment process is complete, the variant caller process, which includes multiple variant caller processing steps, can begin. The first variant caller processing step waits for the full set of alignment data to be available. -
FIG. 7B illustrates an iterative method of secondary analysis according to one embodiment of the present disclosure. As shown, alignment and variant calling run in real time and generate interim results. Processing can be scheduled at fixed intervals. The fixed intervals can include the arrival of a subsequence of N bases, where N is a positive integer, such as 16. For example, processing can occur at intervals of 16 bases. As another example, processing can occur at intervals of 1, 2, 4, 8, 16, 32, 64, 128, 151, or more bases. In one implementation, processing can occur at intervals of any number between 1 and 152, most preferably at intervals of 16+/−8. In one embodiment, the intervals can change from one iteration to another iteration. A sequencing system, such as thesequencing system 100 inFIG. 1 , can generate sequence reads at intervals of 16 bases as illustrated inFIG. 8 . Alternatively, the number of bases in each processing interval may be different. For example, the first interval can be processed after 16 bases are sequenced, and the second iteration can be processed after 18 bases are sequenced. The number of bases in the iteration may be as low as 1 or as high as the number of bases in the read. - The process described in
FIG. 7B can be applied to theRead 1 set or theRead 2 set when a paired end sequencing technique is used. Additionally, information captured when processing theRead 1 set may be applied to theRead 2 set. For example, it would be possible to execute the alignment step using conventional methods during or after theRead 1 set is sequenced, and this information can be used to process theRead 2 set as theRead 2 polynucleotides are sequenced. - Referring now to
FIG. 8 , multiple reads 804 a-804 d of single-stranded polynucleotides may be generated from a sequencing instrument. These single-stranded polynucleotides can be 151 bases in length, referred to asbase 0 tobase 150. The sequences of these single-stranded polynucleotides can be determined with sequencing by synthesis described above. After iteration 0 (the first iteration) of 16 sequencing cycles, 16 bases of sequence reads are determined by a sequencing system. For example, sequence reads ofBase 0 to Base 15 are generated for Read 0 (804 a) and sequence reads ofBase 0 to Base 15 are determined for Read 1 (804 b), etc. After iteration 1 (the second iteration) of another 16 sequencing cycles, 16 additional bases of sequences are determined for each read. For example,Base 16 to Base 31 are generated for Read 0 (804 a). The sequencing system can continue to generate reads at 16-base interval until the sequence reads ofBase 128 to Base 143 of each cluster are generated atiteration 8. The sequencing system can generate reads ofBase 144 to Base 151 of each cluster at iteration 9 (the last iteration). In an alternative embodiment, the number of bases generated at each iteration may be different, with the number of bases per iteration determined by the available computing resources. For example, the first processing interval may consist of 16 bases, while the second processing interval may consist of 18 bases. The smallest number of bases in a processing interval is one, and the largest number of bases in a processing interval is equal to the length of the read. - Referring to
FIG. 7B , alignment can occur at intervals of 16 bases as illustrated. Variant calling can occur at intervals of 16 after alignment is complete. For example, a sequencing system for real-time secondary analysis mayoutput 16 bases of sequence reads every 1.3 hours. For read-time secondary analysis, the total time required for performing alignment and variant calling should be within 1.3 hours such that a user can have access to the variant calls made prior to the next 16 bases of sequence reads are available. - In one embodiment, processing can occur continuously as fast as possible on the available computer resources, with no fixed iteration steps. The analysis can self-adjust and will be as close to the sequencing progress as possible. Alignments and variant calling results can be generated on demand at any time.
-
FIG. 9A is a flowchart of anexample method 900 for performing a real-time secondary analysis. Themethod 900 includes two paths: a low confidence, high computation processing path of a traditional secondary analysis method and a high confidence, low computation processing path according to one embodiment of the present disclosure. The low confidence, high processing path and the high confidence, low processing path are referred to herein as the blue path and the yellow path respectively. - The low confidence, high computation processing path can include the sequence alignment of each read to a reference sequence. For this path, all bases from the available iterations of the read are used to align the read to the reference sequence. For example, if
iteration 0 anditeration 1 each consist of 16 bases, then 32 bases will be processed by the aligner. One of a number of conventional alignment techniques may be used for the low confidence, high computation path. Once sequence alignment is complete, the mapping and alignment positions can be stored and scored. After all reads are aligned, variants can be called. - The
method 900 improves upon the traditional method of secondary analysis by adding a high confidence, low computation processing path. Atiteration 0, themethod 900 waits for a number of sequencing cycles to be complete to generate a number of bases of each read. For example, themethod 900 can wait for 16 cycles of sequencing to complete to generate 16 bases of each read. Duringiteration 0, the 16 bases of each read are analyzed and processed following the low confidence, high computation processing path. The traditional method is referred to herein as the blue path. Duringiteration 1 and any subsequent iteration, the next 16 bases of each read are analyzed following either the low confidence, high computation processing path or the high confidence, low computation processing path. If the read was aligned with sufficient confidence in the immediate prior iteration, the 16 bases of the current iteration are analyzed following the high confidence, low computation processing path. Otherwise, the 16 bases of the current iteration are analyzed following the high confidence, low computation processing path. - If the read was aligned with sufficient confidence in the immediate prior iteration, the 16 bases of the current iteration are aligned to the next 16 bases of the reference sequence. This alignment is referred to herein as simple alignment, which requires less processing compared to conventional sequence alignment. Instead of sequence alignment to the entire reference sequence, the number of mismatches between the 16 bases of the current iteration and the next 16 bases of the reference sequence can be determined. If the number of mismatches is above a threshold, processing of the 16 bases can return to the low confidence, high computation processing path. An isAligned variable can be set to 0 or false upon returning to the low confidence, high processing path. The number of mismatches can be determined with respect to the 16 bases of the current iteration or all bases of the current iteration and prior iteration(s).
- If the number of mismatches is below a threshold, processing of the 16 bases can stay in the high confidence, low computation processing path, and the alignment result of the particular read can be stored. Alternative metrics may be formulated to determine if the isAligned variable is set to 0 or False. For example, if the number of mismatches is below the threshold, the (MAPping Quality) MapQ score can be calculated. The MapQ score can equal to −10 log10 Pr{mapping position is wrong}, rounded to the nearest integer. So if the probability of correctly mapping some random read was 0.99, then the MapQ score should be 20 (i.e. log10 of 0.01*−10). If the probability of a correct match increased to 0.999, the MapQ score would increase to 30. Conversely, as the probability of a correct match tends towards zero, so does the MapQ score.
- When the processing of the 16 bases stays in the high confidence, low computation processing path, the read can contribute to the pile-up (when multiple reads are aligned to similar locations of the reference sequence such that these reads “pileup” on top of one another on the reference sequence). When the processing of the 16 bases returns to the low confidence, high computation processing path, the read can be removed from pile-up. In one embodiment, a read is processed in the low confidence, high computation processing path only if the number of candidates, the total number of sequence alignment locations, is lower than a threshold, such as 1000. The result of alignment when a read is processed is stored.
-
FIG. 9B is a conceptual plot of the amount of data processed by the two processingpaths using method 900 shown inFIG. 9A . After 16 sequencing cycles, 16 bases of each read are generated by a sequencing system. The reads are all processed in the low confidence, high computation processing path duringiteration 0. After 32 sequencing cycles, around 75% of the candidates are considered aligned afteriteration 1. These candidates are processed in the high confidence, low computation processing path duringiteration 2. Afteriteration 2, around 90% of the candidates are considered aligned and are processed in the high confidence, low computation processing path duringiteration 3. Less computing and processing were required when reads are processed in the high confidence, low computation processing path because only simple alignments are required. Because a lot of data is processed in the high confidence, low computation processing path and less processing is required in this path, the total time required was lower than if the reads are only processed in the low confidence, high computation processing path. Thus, alignment and variant call results can be obtained before the sequencer has finished running. These results can be provided to a user at time intervals dependent on the available computing resources. Accordingly, themethod 900 can perform secondary analyses in a time efficient manner to enable real-time secondary analyses. -
FIG. 9C shows the predicted run-time improvement of the aligner described inFIG. 10 . The “Base” data is generated using only of the “Existing Processing” (conventional or blue path) inFIG. 10 . The “Load Read 1” data shows the reduced processing cycles when data from theRead 1 set is aligned, pre-stored, and then utilized to accelerate processing of the data in theRead 2 set. Themethod 900 can implement one of two types of simple aligners for the high confidence, low computation processing path: a simple aligner that skips exact matches or a simple aligner that skips single mis-matches. A simple aligner that skips single matches allows zero or one mismatch. The “Skip Exact Matches” data shows the reduced processing cycles when the conventional (blue) path is skipped if the 16 bases of the current iteration exactly match the 16 bases of the reference sequence at the previously determined reference position. The “Skip Single Mismatches” data shows the reduced processing cycles when the conventional (blue) path is skipped if the 16 bases of the current iteration align to the 16 bases of the reference sequence at the previously determined reference position with at most one mismatch.FIG. 9C shows that compared to the baseline, when themethod 900 utilizes the simple aligner that skipped conventional processing when single mismatches were detected in the high confidence, low computation processing path, the runtime is reduced by three times. Note these numbers were generated by a prototype processor that does not include all processing steps, and as a result, is a projection of expectations. -
FIG. 10 is another flowchart of anexample method 1000 for performing a real-time secondary analysis. Themethod 1000 and themethod 900 shown inFIG. 9A can implement the same low confidence, high computation processing path and different high confidence, low computation processing path. The high confidence, low computation processing path of themethod 1000 generates the MapQ score after simple alignment and uses the MapQ score to determine whether to continue processing in the high confidence, low computation processing path or to return to the low confidence, high processing path. - A high percentage of the runtime occurs on a small percentage of the reads. In some embodiments, the low confidence, high computation processing path of the
method -
FIGS. 11A and 11B show a simplified flow diagram of an existing variant calling method, the Strelka small variant caller (FIG. 11A ), and a variant calling method of the present disclosure (FIG. 11B ).FIG. 11A shows that the small variant caller uses the pile-up information generated from the aligner as an input. From the pile-up, the small variant caller identifies regions of sequence variation known as active regions. Next, de novo re-assembly may be applied to the active regions. At each genomic position, probabilities are generated to determine the likelihood that a sequenced polynucleotide at a genomic position is an A, C, T, or G. From these probabilities, a variant may be detected. -
FIG. 11B shows an embodiment of the variant caller as disclosed in the current invention. In this embodiment, a metric is generated to determine if a polynucleotide at a genomic position can be determined with high confidence. For example, a high confidence decision could be generated if all polynucleotides at a given genomic position are the same. Alternatively, a high confidence decision could be generated if the number of polynucleotides of the same type at a genomic position is higher than a threshold. Alternative metrics for determining high confidence can also be implemented. If the polynucleotide can be determined with high confidence, then the formulation of the probabilities may be skipped and a simple variant calling step may be executed. For example, a simple variant caller may call any variant that is detected with high confidence. - The generation of probabilities step and the variant calling step of an existing variant calling method can combined require up to 40% of the computing and processing of the variant caller.
FIG. 11B shows avariant calling method 1100 that implements both the low confidence, high computation processing path of an existing variant calling method and a high confidence, low computation processing path. By adding the high confidence, low computation processing path, the Strelka variant caller was optimized and processing was reduced by nearly 40%. A high confidence, low computation processing path can be added to alternative variant callers. - As shown in
FIG. 7B , a variant caller may be executed within the iterative processing window. The variant caller ofFIG. 11A orFIG. 11B may be executed iteratively within the iterative processing window. Additionally, more than one type of variant caller may be executed within the iterative processing window. For example, a small variant caller, such as Strelka, and alternative variant callers, such as structural variant callers or copy number variant callers, may be executed within the iterative processing window. - In at least some of the previously described embodiments, one or more elements used in an embodiment can interchangeably be used in another embodiment unless such a replacement is not technically feasible. It will be appreciated by those skilled in the art that various other omissions, additions and modifications may be made to the methods and structures described above without departing from the scope of the claimed subject matter. All such modifications and changes are intended to fall within the scope of the subject matter, as defined by the appended claims.
- With respect to the use of substantially any plural and/or singular terms herein, those having skill in the art can translate from the plural to the singular and/or from the singular to the plural as is appropriate to the context and/or application. The various singular/plural permutations may be expressly set forth herein for sake of clarity.
- It will be understood by those within the art that, in general, terms used herein, and especially in the appended claims (e.g., bodies of the appended claims) are generally intended as “open” terms (e.g., the term “including” should be interpreted as “including but not limited to,” the term “having” should be interpreted as “having at least,” the term “includes” should be interpreted as “includes but is not limited to,” etc.). It will be further understood by those within the art that if a specific number of an introduced claim recitation is intended, such an intent will be explicitly recited in the claim, and in the absence of such recitation no such intent is present. For example, as an aid to understanding, the following appended claims may contain usage of the introductory phrases “at least one” and “one or more” to introduce claim recitations. However, the use of such phrases should not be construed to imply that the introduction of a claim recitation by the indefinite articles “a” or “an” limits any particular claim containing such introduced claim recitation to embodiments containing only one such recitation, even when the same claim includes the introductory phrases “one or more” or “at least one” and indefinite articles such as “a” or “an” (e.g., “a” and/or “an” should be interpreted to mean “at least one” or “one or more”); the same holds true for the use of definite articles used to introduce claim recitations. In addition, even if a specific number of an introduced claim recitation is explicitly recited, those skilled in the art will recognize that such recitation should be interpreted to mean at least the recited number (e.g., the bare recitation of “two recitations,” without other modifiers, means at least two recitations, or two or more recitations). Furthermore, in those instances where a convention analogous to “at least one of A, B, and C, etc.” is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., “a system having at least one of A, B, and C” would include but not be limited to systems that have A alone, B alone, C alone, A and B together, A and C together, B and C together, and/or A, B, and C together, etc.). In those instances where a convention analogous to “at least one of A, B, or C, etc.” is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., “a system having at least one of A, B, or C” would include but not be limited to systems that have A alone, B alone, C alone, A and B together, A and C together, B and C together, and/or A, B, and C together, etc.). It will be further understood by those within the art that virtually any disjunctive word and/or phrase presenting two or more alternative terms, whether in the description, claims, or drawings, should be understood to contemplate the possibilities of including one of the terms, either of the terms, or both terms. For example, the phrase “A or B” will be understood to include the possibilities of “A” or “B” or “A and B.”
- In addition, where features or aspects of the disclosure are described in terms of Markush groups, those skilled in the art will recognize that the disclosure is also thereby described in terms of any individual member or subgroup of members of the Markush group.
- As will be understood by one skilled in the art, for any and all purposes, such as in terms of providing a written description, all ranges disclosed herein also encompass any and all possible sub-ranges and combinations of sub-ranges thereof. Any listed range can be easily recognized as sufficiently describing and enabling the same range being broken down into at least equal halves, thirds, quarters, fifths, tenths, etc. As a non-limiting example, each range discussed herein can be readily broken down into a lower third, middle third and upper third, etc. As will also be understood by one skilled in the art all language such as “up to,” “at least,” “greater than,” “less than,” and the like include the number recited and refer to ranges which can be subsequently broken down into sub-ranges as discussed above. Finally, as will be understood by one skilled in the art, a range includes each individual member. Thus, for example, a group having 1-3 articles refers to groups having 1, 2, or 3 articles. Similarly, a group having 1-5 articles refers to groups having 1, 2, 3, 4, or 5 articles, and so forth.
- While various aspects and embodiments have been disclosed herein, other aspects and embodiments will be apparent to those skilled in the art. The various aspects and embodiments disclosed herein are for purposes of illustration and are not intended to be limiting, with the true scope and spirit being indicated by the following claims.
Claims (30)
1.-27. (canceled)
28. A system for sequencing polynucleotides, comprising:
a sequencing apparatus configured to determine the nucleotide sequence of a polynucleotide; and
a processor configured to control the sequencing apparatus and to execute instructions that perform a method comprising:
receiving a first nucleotide subsequence of the polynucleotide;
determining whether the first nucleotide subsequence aligns to a reference sequence at a first plurality of candidate locations beyond a threshold confidence level using a first process;
receiving a second nucleotide subsequence of the polynucleotide from the sequencing apparatus, wherein the second nucleotide subsequence comprises the first nucleotide subsequence plus one or more additional nucleotides;
comparing the one or more additional nucleotides in the second nucleotide subsequence to the reference sequence based in part on the first plurality of candidate locations, if the first nucleotide subsequence is aligned to the reference sequence beyond the threshold confidence level, or
repeating the first process by aligning the entire second nucleotide subsequence to the reference sequence if the first nucleotide subsequence is not aligned to the reference sequence beyond the threshold confidence level.
29. The system of claim 28 , wherein the threshold confidence level depends on a number of mismatches or a probability of a correct match.
30. The system of claim 28 , wherein the first nucleotide subsequence is one or more nucleotides in length.
31. The system of claim 28 , wherein the second nucleotide subsequence is one or more nucleotides in length.
32. The system of claim 28 , wherein comparing the one or more additional nucleotides in the second nucleotide subsequence to the reference sequence comprises a simple alignment process, the simple alignment process being more computationally efficient than the first process in memory usage or the number of computation operations.
33. The system of claim 32 , wherein the processor is further configured to determine a simple alignment score based on the simple alignment process.
34. The system of claim 28 , wherein the processor is further configured to store data corresponding to at least one of the first plurality of candidate locations if the first nucleotide subsequence is aligned to the reference sequence.
35. The system of claim 28 , wherein the processor is further configured to store data corresponding to at least one of a second plurality of candidate locations resulting from comparing the second nucleotide subsequence to the reference sequence.
36. The system of claim 28 , wherein comparing the one or more additional nucleotides in the second nucleotide subsequence to the reference sequence comprises comparing the second nucleotide subsequence with corresponding sequences of the second nucleotide subsequence on the reference sequence based on the first plurality of candidate locations.
37. The system of claim 36 , wherein the processor is further configured to determine a mapping quality (MapQ) score for each of the second plurality of candidate locations.
38. The system of claim 28 , wherein determining whether the first nucleotide subsequence aligns to the reference sequence is initiated before the sequencing reactions are completed.
39. The system of claim 28 , wherein the processor is further configured to perform variant calling for the first nucleotide subsequence or the second nucleotide subsequence.
40. The system of claim 39 , wherein performing the variant calling comprises:
performing variant calling using a first variant calling process or a second, simple, variant calling process, wherein the second variant calling process is more computationally efficient than the first variant calling process in variant calling of the second nucleotide subsequence.
41. The system of claim 39 , wherein the variant calling is performed using the output of the first process or the process used to compare the one or more additional nucleotides in the second nucleotide subsequence to the reference sequence, based on a variant calling metric.
42. The system of claim 41 , wherein the variant calling metric is determined based on a number of different base types called at a position of the reference sequence.
43. The system of claim 28 , wherein comparing the one or more additional nucleotides in the second nucleotide subsequence to the reference sequence is initiated before the sequencing reactions are completed.
44. The system of claim 28 , wherein the sequencing apparatus implements sequencing-by-synthesis.
45. A computer-implemented method for efficient sequencing of polynucleotides, comprising:
receiving a first nucleotide subsequence of a read from a sequencing apparatus during a sequencing run of the first nucleotide subsequence;
performing a secondary analysis of the first nucleotide subsequence of the read based on a reference sequence using a first process or a second process, wherein the first nucleotide subsequence comprises one or more additional nucleotides compared to a previous iteration, wherein the second process is more computationally efficient than the first process in performing the secondary analysis, wherein the first process aligns the entire first nucleotide subsequence to the reference sequence, wherein the second process aligns the one or more additional nucleotides to the reference sequence based in part on results from the previous iteration, and wherein the secondary analysis comprises:
comparing the first nucleotide subsequence to the reference sequence to determine a first subsequence of the reference sequence that has a high degree of similarity to the first nucleotide subsequence; and
determining if the sequencing apparatus should generate additional nucleotide reads.
46. The method of claim 45 , wherein performing the secondary analysis comprises processing the first nucleotide subsequence to determine a first plurality of candidate locations of the read that align to the reference sequence using:
the first process if the read is not aligned to the reference sequence in the previous iteration,
the second process if otherwise,
wherein the second process is more computationally efficient than the first process to determine the first plurality of candidate locations of the read.
47. The method of claim 46 , wherein performing the secondary analysis of the first nucleotide subsequence using the second process comprises performing a simple alignment to determine a simple alignment score.
48. The method of claim 46 , wherein results of the secondary analysis comprises output of the first process, or output of the second process.
49. The method of claim 45 , wherein performing the secondary analysis comprises performing variant calling of the first nucleotide subsequence, comprising:
performing variant calling on the output of the first process or the second process using a first variant calling process or a second variant calling process, wherein the second variant calling process is more computationally efficient than the first variant calling process in variant calling of the first nucleotide subsequence.
50. The method of claim 49 , wherein results of the secondary analysis comprises output of the first variant calling process, output of the second variant calling process.
51. The method of claim 45 , further comprising providing a user with results of the secondary analysis during the sequencing run.
52. The method of claim 51 , wherein the results of the secondary analysis are provided to the user at fixed intervals.
53. The method of claim 51 , wherein the results of the secondary analysis are provided to the user at request of the user.
54. The method of claim 45 , wherein performing the secondary analysis is based on whether the first nucleotide subsequence aligns to the reference sequence beyond a threshold confidence in the previous iteration.
55. A computer readable recording medium having recorded a program for implementing in a computer the functions of a system according to claim 28 .
56. A computer readable recording medium having recorded a program that causes a computer to execute a method according to claim 45 .
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US18/300,343 US20230410945A1 (en) | 2016-10-07 | 2023-04-13 | System and method for secondary analysis of nucleotide sequencing data |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201662405824P | 2016-10-07 | 2016-10-07 | |
PCT/US2017/055653 WO2018068014A1 (en) | 2016-10-07 | 2017-10-06 | System and method for secondary analysis of nucleotide sequencing data |
US201816311141A | 2018-12-18 | 2018-12-18 | |
US18/300,343 US20230410945A1 (en) | 2016-10-07 | 2023-04-13 | System and method for secondary analysis of nucleotide sequencing data |
Related Parent Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2017/055653 Continuation WO2018068014A1 (en) | 2016-10-07 | 2017-10-06 | System and method for secondary analysis of nucleotide sequencing data |
US16/311,141 Continuation US11646102B2 (en) | 2016-10-07 | 2017-10-06 | System and method for secondary analysis of nucleotide sequencing data |
Publications (1)
Publication Number | Publication Date |
---|---|
US20230410945A1 true US20230410945A1 (en) | 2023-12-21 |
Family
ID=60480359
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/311,141 Active 2040-06-14 US11646102B2 (en) | 2016-10-07 | 2017-10-06 | System and method for secondary analysis of nucleotide sequencing data |
US18/300,343 Pending US20230410945A1 (en) | 2016-10-07 | 2023-04-13 | System and method for secondary analysis of nucleotide sequencing data |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/311,141 Active 2040-06-14 US11646102B2 (en) | 2016-10-07 | 2017-10-06 | System and method for secondary analysis of nucleotide sequencing data |
Country Status (15)
Country | Link |
---|---|
US (2) | US11646102B2 (en) |
EP (1) | EP3458993A1 (en) |
JP (3) | JP6898441B2 (en) |
KR (3) | KR102384832B1 (en) |
CN (2) | CN115810396A (en) |
AU (3) | AU2017341069A1 (en) |
BR (2) | BR122023004154A2 (en) |
CA (1) | CA3027179C (en) |
IL (2) | IL300135B2 (en) |
MX (2) | MX2018015412A (en) |
MY (1) | MY193917A (en) |
RU (1) | RU2741807C2 (en) |
SG (2) | SG11201810924WA (en) |
WO (1) | WO2018068014A1 (en) |
ZA (2) | ZA201808277B (en) |
Families Citing this family (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2697397B1 (en) | 2011-04-15 | 2017-04-05 | The Johns Hopkins University | Safe sequencing system |
ES2886507T3 (en) | 2012-10-29 | 2021-12-20 | Univ Johns Hopkins | Pap test for ovarian and endometrial cancers |
WO2017027653A1 (en) | 2015-08-11 | 2017-02-16 | The Johns Hopkins University | Assaying ovarian cyst fluid |
WO2018112249A1 (en) * | 2016-12-15 | 2018-06-21 | Illumina, Inc. | Methods and systems for determining paralogs |
MX2020001575A (en) | 2017-08-07 | 2020-11-18 | Univ Johns Hopkins | Methods and materials for assessing and treating cancer. |
US11803554B2 (en) | 2019-05-24 | 2023-10-31 | Illumina, Inc. | Flexible seed extension for hash table genomic mapping |
KR102292599B1 (en) * | 2019-11-06 | 2021-08-23 | 주식회사 뷰웍스 | Optical analysis device and optical analysis method |
MX2022010905A (en) * | 2020-03-11 | 2022-12-15 | Illumina Inc | Incremental secondary analysis of nucleic acid sequences. |
CN113436683A (en) * | 2020-03-23 | 2021-09-24 | 北京合生基因科技有限公司 | Method and system for screening candidate inserts |
AU2022202798A1 (en) * | 2021-05-26 | 2022-12-15 | Genieus Genomics Pty Ltd | Processing sequencing data relating to amyotrophic lateral sclerosis |
CN113299344A (en) * | 2021-06-23 | 2021-08-24 | 深圳华大医学检验实验室 | Gene sequencing analysis method, gene sequencing analysis device, storage medium and computer equipment |
WO2024081805A1 (en) * | 2022-10-13 | 2024-04-18 | Element Biosciences, Inc. | Separating sequencing data in parallel with a sequencing run in next generation sequencing data analysis |
Family Cites Families (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CA2357263A1 (en) | 2001-09-07 | 2003-03-07 | Bioinformatics Solutions Inc. | New methods for faster and more sensitive homology search in dna sequences |
US7575865B2 (en) * | 2003-01-29 | 2009-08-18 | 454 Life Sciences Corporation | Methods of amplifying and sequencing nucleic acids |
US8275557B2 (en) * | 2005-02-11 | 2012-09-25 | Smartgene Gmbh | Computer-implemented method and computer-based system for validating DNA sequencing data |
WO2011137368A2 (en) | 2010-04-30 | 2011-11-03 | Life Technologies Corporation | Systems and methods for analyzing nucleic acid sequences |
US20120203792A1 (en) | 2011-02-01 | 2012-08-09 | Life Technologies Corporation | Systems and methods for mapping sequence reads |
US10424394B2 (en) * | 2011-10-06 | 2019-09-24 | Sequenom, Inc. | Methods and processes for non-invasive assessment of genetic variations |
EP2764458B1 (en) * | 2011-10-06 | 2021-04-07 | Sequenom, Inc. | Methods and processes for non-invasive assessment of genetic variations |
KR101394339B1 (en) * | 2012-03-06 | 2014-05-13 | 삼성에스디에스 주식회사 | System and method for processing genome sequence in consideration of seed length |
US10504613B2 (en) * | 2012-12-20 | 2019-12-10 | Sequenom, Inc. | Methods and processes for non-invasive assessment of genetic variations |
KR101481457B1 (en) * | 2012-10-29 | 2015-01-12 | 삼성에스디에스 주식회사 | System and method for aligning genome sequence considering entire read |
US20140238250A1 (en) * | 2013-02-28 | 2014-08-28 | Wki Holding Company, Inc. | Microwavable Heating Element and Composition |
US20160034638A1 (en) * | 2013-03-14 | 2016-02-04 | University Of Rochester | System and Method for Detecting Population Variation from Nucleic Acid Sequencing Data |
US10191929B2 (en) | 2013-05-29 | 2019-01-29 | Noblis, Inc. | Systems and methods for SNP analysis and genome sequencing |
RU2539038C1 (en) * | 2013-11-02 | 2015-01-10 | Общество с ограниченной ответственностью "Гамма" | Dna sequencing method and device therefor (versions) |
CN104462211B (en) * | 2014-11-04 | 2018-01-02 | 北京诺禾致源科技股份有限公司 | The processing method and processing unit of weight sequencing data |
-
2017
- 2017-10-06 KR KR1020187038172A patent/KR102384832B1/en active IP Right Grant
- 2017-10-06 SG SG11201810924WA patent/SG11201810924WA/en unknown
- 2017-10-06 CA CA3027179A patent/CA3027179C/en active Active
- 2017-10-06 IL IL300135A patent/IL300135B2/en unknown
- 2017-10-06 EP EP17804976.3A patent/EP3458993A1/en active Pending
- 2017-10-06 US US16/311,141 patent/US11646102B2/en active Active
- 2017-10-06 SG SG10201911912XA patent/SG10201911912XA/en unknown
- 2017-10-06 KR KR1020227011278A patent/KR102515638B1/en active IP Right Grant
- 2017-10-06 WO PCT/US2017/055653 patent/WO2018068014A1/en active Application Filing
- 2017-10-06 MX MX2018015412A patent/MX2018015412A/en unknown
- 2017-10-06 JP JP2019519631A patent/JP6898441B2/en active Active
- 2017-10-06 KR KR1020237010257A patent/KR20230044335A/en not_active Application Discontinuation
- 2017-10-06 CN CN202211557451.6A patent/CN115810396A/en active Pending
- 2017-10-06 CN CN201780040788.0A patent/CN109416927B/en active Active
- 2017-10-06 AU AU2017341069A patent/AU2017341069A1/en not_active Abandoned
- 2017-10-06 MY MYPI2018002632A patent/MY193917A/en unknown
- 2017-10-06 BR BR122023004154-2A patent/BR122023004154A2/en unknown
- 2017-10-06 IL IL263512A patent/IL263512B2/en unknown
- 2017-10-06 RU RU2018143972A patent/RU2741807C2/en active
- 2017-10-06 BR BR112018076983A patent/BR112018076983A8/en active Search and Examination
-
2018
- 2018-12-07 ZA ZA2018/08277A patent/ZA201808277B/en unknown
- 2018-12-11 MX MX2022011757A patent/MX2022011757A/en unknown
-
2020
- 2020-05-27 JP JP2020091991A patent/JP7051937B2/en active Active
- 2020-07-22 AU AU2020207826A patent/AU2020207826B2/en active Active
-
2021
- 2021-03-15 ZA ZA2021/01720A patent/ZA202101720B/en unknown
- 2021-12-01 AU AU2021277671A patent/AU2021277671B2/en active Active
-
2022
- 2022-02-22 JP JP2022025557A patent/JP7387777B2/en active Active
-
2023
- 2023-04-13 US US18/300,343 patent/US20230410945A1/en active Pending
Also Published As
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20230410945A1 (en) | System and method for secondary analysis of nucleotide sequencing data | |
US11155863B2 (en) | Sequence assembly | |
US20190362810A1 (en) | Systems and methods for determining copy number variation | |
US11887699B2 (en) | Methods for compression of molecular tagged nucleic acid sequence data | |
US20210358572A1 (en) | Methods, systems, and computer-readable media for calculating corrected amplicon coverages | |
NZ793021A (en) | System and method for secondary analysis of nucleotide sequencing data | |
US20160070856A1 (en) | Variant-calling on data from amplicon-based sequencing methods |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
AS | Assignment |
Owner name: ILLUMINA, INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:GARCIA, FRANCISCO JOSE;RACZY, COME;DAY, AARON;AND OTHERS;SIGNING DATES FROM 20181127 TO 20181203;REEL/FRAME:064998/0088 |