US20230410946A1 - Systems and methods for sequence data alignment quality assessment - Google Patents
Systems and methods for sequence data alignment quality assessment Download PDFInfo
- Publication number
- US20230410946A1 US20230410946A1 US18/338,488 US202318338488A US2023410946A1 US 20230410946 A1 US20230410946 A1 US 20230410946A1 US 202318338488 A US202318338488 A US 202318338488A US 2023410946 A1 US2023410946 A1 US 2023410946A1
- Authority
- US
- United States
- Prior art keywords
- read
- paired
- alignment
- potential
- reads
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 76
- 238000001303 quality assessment method Methods 0.000 title claims 3
- 150000007523 nucleic acids Chemical group 0.000 claims abstract description 70
- 108091028043 Nucleic acid sequence Proteins 0.000 claims abstract description 43
- 238000013507 mapping Methods 0.000 claims description 24
- 238000007481 next generation sequencing Methods 0.000 claims description 10
- 239000012634 fragment Substances 0.000 claims description 9
- 108091033319 polynucleotide Proteins 0.000 claims description 9
- 102000040430 polynucleotide Human genes 0.000 claims description 9
- 239000002157 polynucleotide Substances 0.000 claims description 9
- 230000006870 function Effects 0.000 claims description 8
- 238000004891 communication Methods 0.000 claims description 3
- 238000002864 sequence alignment Methods 0.000 claims description 3
- 238000012163 sequencing technique Methods 0.000 description 56
- 102000039446 nucleic acids Human genes 0.000 description 26
- 108020004707 nucleic acids Proteins 0.000 description 26
- 239000000523 sample Substances 0.000 description 25
- 238000001514 detection method Methods 0.000 description 18
- 239000002773 nucleotide Substances 0.000 description 17
- 125000003729 nucleotide group Chemical group 0.000 description 17
- 239000003153 chemical reaction reagent Substances 0.000 description 13
- 238000012545 processing Methods 0.000 description 12
- 238000005516 engineering process Methods 0.000 description 11
- 238000013459 approach Methods 0.000 description 10
- 108090000623 proteins and genes Proteins 0.000 description 9
- 108091034117 Oligonucleotide Proteins 0.000 description 8
- 229920002477 rna polymer Polymers 0.000 description 8
- 108020004414 DNA Proteins 0.000 description 6
- 102000053602 DNA Human genes 0.000 description 6
- ISAKRJDGNUQOIC-UHFFFAOYSA-N Uracil Chemical compound O=C1C=CNC(=O)N1 ISAKRJDGNUQOIC-UHFFFAOYSA-N 0.000 description 6
- OPTASPLRGRRNAP-UHFFFAOYSA-N cytosine Chemical compound NC=1C=CNC(=O)N=1 OPTASPLRGRRNAP-UHFFFAOYSA-N 0.000 description 6
- UYTPUPDQBNUYGX-UHFFFAOYSA-N guanine Chemical compound O=C1NC(N)=NC2=C1N=CN2 UYTPUPDQBNUYGX-UHFFFAOYSA-N 0.000 description 6
- 238000009396 hybridization Methods 0.000 description 6
- 230000003287 optical effect Effects 0.000 description 6
- 230000008569 process Effects 0.000 description 6
- 239000007787 solid Substances 0.000 description 6
- RWQNBRDOKXIBIV-UHFFFAOYSA-N thymine Chemical compound CC1=CNC(=O)NC1=O RWQNBRDOKXIBIV-UHFFFAOYSA-N 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 150000002500 ions Chemical class 0.000 description 5
- 238000003786 synthesis reaction Methods 0.000 description 5
- 229930024421 Adenine Natural products 0.000 description 4
- GFFGJBXGBJISGV-UHFFFAOYSA-N Adenine Chemical compound NC1=NC=NC2=C1N=CN2 GFFGJBXGBJISGV-UHFFFAOYSA-N 0.000 description 4
- 229960000643 adenine Drugs 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 4
- 230000015572 biosynthetic process Effects 0.000 description 4
- 238000005251 capillar electrophoresis Methods 0.000 description 4
- 238000005286 illumination Methods 0.000 description 4
- 238000002493 microarray Methods 0.000 description 4
- 239000000203 mixture Substances 0.000 description 4
- 108700024394 Exon Proteins 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 3
- 239000003086 colorant Substances 0.000 description 3
- 230000000295 complement effect Effects 0.000 description 3
- 229940104302 cytosine Drugs 0.000 description 3
- 239000000975 dye Substances 0.000 description 3
- 238000003384 imaging method Methods 0.000 description 3
- 239000002777 nucleoside Substances 0.000 description 3
- 125000003835 nucleoside group Chemical group 0.000 description 3
- 102000004169 proteins and genes Human genes 0.000 description 3
- 238000012175 pyrosequencing Methods 0.000 description 3
- 229940113082 thymine Drugs 0.000 description 3
- 229940035893 uracil Drugs 0.000 description 3
- IQFYYKKMVGJFEH-XLPZGREQSA-N Thymidine Chemical compound O=C1NC(=O)C(C)=CN1[C@@H]1O[C@H](CO)[C@@H](O)C1 IQFYYKKMVGJFEH-XLPZGREQSA-N 0.000 description 2
- OIRDTQYFTABQOQ-KQYNXXCUSA-N adenosine Chemical compound C1=NC=2C(N)=NC=NC=2N1[C@@H]1O[C@H](CO)[C@@H](O)[C@H]1O OIRDTQYFTABQOQ-KQYNXXCUSA-N 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 210000004027 cell Anatomy 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 239000000539 dimer Substances 0.000 description 2
- XPPKVPWEQAFLFU-UHFFFAOYSA-J diphosphate(4-) Chemical compound [O-]P([O-])(=O)OP([O-])([O-])=O XPPKVPWEQAFLFU-UHFFFAOYSA-J 0.000 description 2
- 235000011180 diphosphates Nutrition 0.000 description 2
- 230000005284 excitation Effects 0.000 description 2
- 230000004927 fusion Effects 0.000 description 2
- 238000010348 incorporation Methods 0.000 description 2
- 230000035772 mutation Effects 0.000 description 2
- 238000002360 preparation method Methods 0.000 description 2
- 239000013615 primer Substances 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000007841 sequencing by ligation Methods 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- YKBGVTZYEHREMT-KVQBGUIXSA-N 2'-deoxyguanosine Chemical compound C1=NC=2C(=O)NC(N)=NC=2N1[C@H]1C[C@H](O)[C@@H](CO)O1 YKBGVTZYEHREMT-KVQBGUIXSA-N 0.000 description 1
- CKTSBUTUHBMZGZ-ULQXZJNLSA-N 4-amino-1-[(2r,4s,5r)-4-hydroxy-5-(hydroxymethyl)oxolan-2-yl]-5-tritiopyrimidin-2-one Chemical compound O=C1N=C(N)C([3H])=CN1[C@@H]1O[C@H](CO)[C@@H](O)C1 CKTSBUTUHBMZGZ-ULQXZJNLSA-N 0.000 description 1
- 241000143060 Americamysis bahia Species 0.000 description 1
- DWRXFEITVBNRMK-UHFFFAOYSA-N Beta-D-1-Arabinofuranosylthymine Natural products O=C1NC(=O)C(C)=CN1C1C(O)C(O)C(CO)O1 DWRXFEITVBNRMK-UHFFFAOYSA-N 0.000 description 1
- 108091035707 Consensus sequence Proteins 0.000 description 1
- RYGMFSIKBFXOCR-UHFFFAOYSA-N Copper Chemical compound [Cu] RYGMFSIKBFXOCR-UHFFFAOYSA-N 0.000 description 1
- 108020001019 DNA Primers Proteins 0.000 description 1
- 239000003155 DNA primer Substances 0.000 description 1
- 102000004190 Enzymes Human genes 0.000 description 1
- 108090000790 Enzymes Proteins 0.000 description 1
- 206010028980 Neoplasm Diseases 0.000 description 1
- 238000012300 Sequence Analysis Methods 0.000 description 1
- JLCPHMBAVCMARE-UHFFFAOYSA-N [3-[[3-[[3-[[3-[[3-[[3-[[3-[[3-[[3-[[3-[[3-[[5-(2-amino-6-oxo-1H-purin-9-yl)-3-[[3-[[3-[[3-[[3-[[3-[[5-(2-amino-6-oxo-1H-purin-9-yl)-3-[[5-(2-amino-6-oxo-1H-purin-9-yl)-3-hydroxyoxolan-2-yl]methoxy-hydroxyphosphoryl]oxyoxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(5-methyl-2,4-dioxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxyoxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(5-methyl-2,4-dioxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(5-methyl-2,4-dioxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(5-methyl-2,4-dioxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methyl [5-(6-aminopurin-9-yl)-2-(hydroxymethyl)oxolan-3-yl] hydrogen phosphate Polymers Cc1cn(C2CC(OP(O)(=O)OCC3OC(CC3OP(O)(=O)OCC3OC(CC3O)n3cnc4c3nc(N)[nH]c4=O)n3cnc4c3nc(N)[nH]c4=O)C(COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3CO)n3cnc4c(N)ncnc34)n3ccc(N)nc3=O)n3cnc4c(N)ncnc34)n3ccc(N)nc3=O)n3ccc(N)nc3=O)n3ccc(N)nc3=O)n3cnc4c(N)ncnc34)n3cnc4c(N)ncnc34)n3cc(C)c(=O)[nH]c3=O)n3cc(C)c(=O)[nH]c3=O)n3ccc(N)nc3=O)n3cc(C)c(=O)[nH]c3=O)n3cnc4c3nc(N)[nH]c4=O)n3cnc4c(N)ncnc34)n3cnc4c(N)ncnc34)n3cnc4c(N)ncnc34)n3cnc4c(N)ncnc34)O2)c(=O)[nH]c1=O JLCPHMBAVCMARE-UHFFFAOYSA-N 0.000 description 1
- 150000001413 amino acids Chemical class 0.000 description 1
- 238000013476 bayesian approach Methods 0.000 description 1
- IQFYYKKMVGJFEH-UHFFFAOYSA-N beta-L-thymidine Natural products O=C1NC(=O)C(C)=CN1C1OC(CO)C(O)C1 IQFYYKKMVGJFEH-UHFFFAOYSA-N 0.000 description 1
- 230000000903 blocking effect Effects 0.000 description 1
- 239000000872 buffer Substances 0.000 description 1
- 201000011510 cancer Diseases 0.000 description 1
- 239000003054 catalyst Substances 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 210000000349 chromosome Anatomy 0.000 description 1
- 239000002299 complementary DNA Substances 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 239000005549 deoxyribonucleoside Substances 0.000 description 1
- 238000006911 enzymatic reaction Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 239000007850 fluorescent dye Substances 0.000 description 1
- 238000007672 fourth generation sequencing Methods 0.000 description 1
- -1 genome Chemical class 0.000 description 1
- 150000004676 glycans Chemical class 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 150000002632 lipids Chemical class 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010369 molecular cloning Methods 0.000 description 1
- 229930014626 natural product Natural products 0.000 description 1
- 238000001821 nucleic acid purification Methods 0.000 description 1
- 238000001668 nucleic acid synthesis Methods 0.000 description 1
- 238000002515 oligonucleotide synthesis Methods 0.000 description 1
- 229920000642 polymer Polymers 0.000 description 1
- 229920001282 polysaccharide Polymers 0.000 description 1
- 239000005017 polysaccharide Substances 0.000 description 1
- 229930010796 primary metabolite Natural products 0.000 description 1
- 102000004196 processed proteins & peptides Human genes 0.000 description 1
- 108090000765 processed proteins & peptides Proteins 0.000 description 1
- 238000000746 purification Methods 0.000 description 1
- 239000013074 reference sample Substances 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 239000002342 ribonucleoside Substances 0.000 description 1
- 229930000044 secondary metabolite Natural products 0.000 description 1
- 150000003384 small molecules Chemical class 0.000 description 1
- 238000010561 standard procedure Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 239000000758 substrate Substances 0.000 description 1
- 230000001225 therapeutic effect Effects 0.000 description 1
- 229940104230 thymidine Drugs 0.000 description 1
- 210000001519 tissue Anatomy 0.000 description 1
- 239000006226 wash reagent Substances 0.000 description 1
- 238000012049 whole transcriptome sequencing Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/12—Computing arrangements based on biological models using genetic models
- G06N3/126—Evolutionary algorithms, e.g. genetic algorithms or genetic programming
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N7/00—Computing arrangements based on specific mathematical models
- G06N7/01—Probabilistic graphical models, e.g. probabilistic networks
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/20—Sequence assembly
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
Definitions
- Sequence assembly can generally be divided into two broad categories: de novo assembly and reference genome mapping assembly.
- de novo assembly sequence reads are assembled together so that they form a new and previously unknown sequence.
- reference genome mapping sequence reads are assembled against an existing backbone sequence (e.g., reference sequence, etc.) to build a sequence that is similar but not necessarily identical to the backbone sequence.
- a computer-implemented method for classifying alignments of paired nucleic acid sequence reads is disclosed.
- a plurality of paired nucleic acid sequence reads is received, wherein each read is comprised of a first tag and a second tag separated by an insert region.
- Potential alignments for the first and second tags of each paired nucleic acid sequence read to a reference sequence is determined, wherein the potential alignments satisfies a minimum threshold mismatch constraint.
- Potential paired alignments of the first and second tags of each read are identified, wherein a distance between the first and second tags of each potential paired alignment is within an estimated insert size range.
- An alignment score is calculated for each potential paired alignment based on a distance between the first and second tags and a total number of mismatches for each tag.
- a computer-implemented method for determining possible alignments for sequencing reads is disclosed.
- a sample can be interrogated to produce a plurality of read sequences from the sample. Alignments are performed for the read sequences from the sequencer. A quality value for each alignment is determined. Each alignment with its associated quality value is outputted.
- FIG. 3 is an exemplary flowchart showing a method for classifying alignment quality of paired reads, in accordance with various embodiments.
- FIG. 4 is a depiction of how PQV can be calculated for gapped alignments, in accordance with various embodiments.
- a “system” denotes a set of components, real or abstract, comprising a whole where each component interacts with or is related to at least one other component within the whole.
- a “biomolecule” is any molecule that is produced by a biological organism, including large polymeric molecules such as proteins, polysaccharides, lipids, and nucleic acids as well as small molecules such as primary metabolites, secondary metabolites, and other natural products.
- next generation sequencing refers to sequencing technologies having increased throughput as compared to traditional Sanger- and capillary electrophoresis-based approaches, for example with the ability to generate hundreds of thousands of relatively small sequence reads at a time.
- next generation sequencing techniques include, but are not limited to, sequencing by synthesis, sequencing by ligation, and sequencing by hybridization. More specifically, the SOLiD Sequencing System of Life Technologies Corp. provides massively parallel sequencing with enhanced accuracy. The SOLiD System and associated workflows, protocols, chemistries, etc. are described in more detail in PCT Publication No. WO 2006/084132, entitled “Reagents, Methods, and Libraries for Bead-Based Sequencing,” international filing date Feb.
- sequencing run refers to any step or portion of a sequencing experiment performed to determine some information relating to at least one biomolecule (e.g., nucleic acid molecule).
- DNA deoxyribonucleic acid
- A adenine
- T thymine
- C cytosine
- G guanine
- RNA ribonucleic acid
- adenine (A) pairs with thymine (T) in the case of RNA, however, adenine (A) pairs with uracil (U)
- cytosine (C) pairs with guanine (G) when a first nucleic acid strand binds to a second nucleic acid strand made up of nucleotides that are complementary to those in the first strand, the two strands bind to form a double strand.
- nucleic acid sequencing data denotes any information or data that is indicative of the order of the nucleotide bases (e.g., adenine, guanine, cytosine, and thymine/uracil) in a molecule (e.g., whole genome, whole transcriptome, exome, oligonucleotide, polynucleotide, fragment, etc.) of DNA or RNA.
- nucleotide bases e.g., adenine, guanine, cytosine, and thymine/uracil
- a molecule e.g., whole genome, whole transcriptome, exome, oligonucleotide, polynucleotide, fragment, etc.
- color space refers to a nucleic acid sequence data schema where nucleic acid sequence information is represented by a set of colors (e.g., color calls, color signals, etc.) each carrying details about the identity and/or positional sequence of bases that comprise the nucleic acid sequence.
- colors e.g., color calls, color signals, etc.
- the nucleic acid sequence “ATCGA” can be represented in color space by various combinations of colors that are measured as the nucleic acid sequence is interrogated using optical detection-based (e.g., dye-based, etc.) sequencing techniques such as those employed by the SOLiD System.
- a “polynucleotide”, “nucleic acid”, or “oligonucleotide” refers to a linear polymer of nucleosides (including deoxyribonucleosides, ribonucleosides, or analogs thereof) joined by internucleosidic linkages.
- a polynucleotide comprises at least three nucleosides.
- oligonucleotides range in size from a few monomeric units, e.g. 3-4, to several hundreds of monomeric units.
- computer system 100 can further include a read only memory (ROM) 108 or other static storage device coupled to bus 102 for storing static information and instructions for processor 104 .
- ROM read only memory
- a storage device 110 such as a magnetic disk or optical disk, can be provided and coupled to bus 102 for storing information and instructions.
- non-volatile media can include, but are not limited to, optical or magnetic disks, such as storage device 110 .
- volatile media can include, but are not limited to, dynamic memory, such as memory 106 .
- transmission media can include, but are not limited to, coaxial cables, copper wire, and fiber optics, including the wires that comprise bus 102 .
- Computer-readable media include, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, or any other magnetic medium, a CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, a RAM, PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge, or any other tangible medium from which a computer can read.
- Various forms of computer readable media can be involved in carrying one or more sequences of one or more instructions to processor 104 for execution.
- the instructions can initially be carried on the magnetic disk of a remote computer.
- the remote computer can load the instructions into its dynamic memory and send the instructions over a telephone line using a modem.
- a modem local to computer system 100 can receive the data on the telephone line and use an infra-red transmitter to convert the data to an infra-red signal.
- An infra-red detector coupled to bus 102 can receive the data carried in the infra-red signal and place the data on bus 102 .
- Bus 102 can carry the data to memory 106 , from which processor 104 retrieves and executes the instructions.
- the instructions received by memory 106 may optionally be stored on storage device 110 either before or after execution by processor 104 .
- Nucleic acid sequence data can be generated using various techniques, platforms or technologies, including, but not limited to: capillary electrophoresis, microarrays, ligation-based systems, polymerase-based systems, hybridization-based systems, direct or indirect nucleotide identification systems, pyrosequencing, ion- or pH-based detection systems, electronic signature-based systems, etc.
- nucleic acid sequencing platforms can include components as displayed in the block diagram of FIG. 2 .
- sequencing instrument 200 can include a fluidic delivery and control unit 202 , a sample processing unit 204 , a signal detection unit 206 , and a data acquisition, analysis and control unit 208 .
- instrumentation, reagents, libraries and methods used for next generation sequencing are described in U.S. Patent Application Publication No. US20090062129 (ASN 11/737308) and U.S. Patent Application Publication No. US20080003571 (ASN 11/345,979) to McKernan, et al., which applications are incorporated herein by reference.
- Various embodiments of instrument 200 can provide for automated sequencing that can be used to gather sequence information from a plurality of sequences in parallel, i.e., substantially simultaneously.
- the fluidics delivery and control unit 202 can include reagent delivery system.
- the reagent delivery system can include a reagent reservoir for the storage of various reagents.
- the reagents can include RNA-based primers, forward/reverse DNA primers, oligonucleotide mixtures for ligation sequencing, nucleotide mixtures for sequencing-by-synthesis, optional ECC oligonucleotide mixtures, buffers, wash reagents, blocking reagent, stripping reagents, and the like.
- the reagent delivery system can include a pipetting system or a continuous flow system which connects the sample processing unit with the reagent reservoir.
- the sample processing unit 204 can include a sample chamber, such as flow cell, a substrate, a micro-array, a multi-well tray, or the like.
- the sample processing unit 204 can include multiple lanes, multiple channels, multiple wells, or other means of processing multiple sample sets substantially simultaneously.
- the sample processing unit can include multiple sample chambers to enable processing of multiple runs simultaneously.
- the system can perform signal detection on one sample chamber while substantially simultaneously processing another sample chamber.
- the sample processing unit can include an automation system for moving or manipulating the sample chamber.
- the signal detection unit 206 can include an imaging or detection sensor.
- the imaging or detection sensor can include a CCD, a CMOS, an ion sensor, such as an ion sensitive layer overlying a CMOS, a current detector, or the like.
- the signal detection unit 206 can include an excitation system to cause a probe, such as a fluorescent dye, to emit a signal.
- the excitation system can include an illumination source, such as arc lamp, a laser, a light emitting diode (LED), or the like.
- the signal detection unit 206 can include optics for the transmission of light from an illumination source to the sample or from the sample to the imaging or detection sensor.
- the signal detection unit 206 may not include an illumination source, such as for example, when a signal is produced spontaneously as a result of a sequencing reaction.
- a signal can be produced by the interaction of a released moiety, such as a released ion interacting with an ion sensitive layer, or a pyrophosphate reacting with an enzyme or other catalyst to produce a chemiluminescent signal.
- changes in an electrical current can be detected as a nucleic acid passes through a nanopore without the need for an illumination source.
- data acquisition analysis and control unit 208 can monitor various system parameters.
- the system parameters can include temperature of various portions of instrument 200 , such as sample processing unit or reagent reservoirs, volumes of various reagents, the status of various system subcomponents, such as a manipulator, a stepper motor, a pump, or the like, or any combination thereof.
- instrument 200 can be used to practice variety of sequencing methods including ligation-based methods, sequencing by synthesis, single molecule methods, nanopore sequencing, and other sequencing techniques.
- Ligation sequencing can include single ligation techniques, or change ligation techniques where multiple ligation are performed in sequence on a single primary. Sequencing by synthesis can include the incorporation of dye labeled nucleotides, chain termination, ion/proton sequencing, pyrophosphate sequencing, or the like.
- Single molecule techniques can include continuous sequencing, where the identity of the nuclear type is determined during incorporation without the need to pause or delay the sequencing reaction, or staggered sequence, where the sequencing reactions is paused to determine the identity of the incorporated nucleotide.
- the sequencing instrument 200 can determine the sequence of a nucleic acid, such as a polynucleotide or an oligonucleotide.
- the nucleic acid can include DNA or RNA, and can be single stranded, such as ssDNA and RNA, or double stranded, such as dsDNA or a RNA/cDNA pair.
- the nucleic acid can include or be derived from a fragment library, a mate pair library, a ChIP fragment, or the like.
- the sequencing instrument 200 can obtain the sequence information from a single nucleic acid molecule or from a group of substantially identical nucleic acid molecules.
- sequencing instrument 200 can output nucleic acid sequencing read data in a variety of different output data file types/formats, including, but not limited to: *.fasta, *.csfasta, *seq.txt, *qseq.txt, *.fastq, *.sff, *prb.txt, *.sms, *srs and/or *.qv.
- FIG. 3 is an exemplary flowchart showing a method for classifying alignments of paired nucleic acid sequence reads, in accordance with various embodiments.
- the sequence read alignment classification scores can be a factor in the pairing quality value (PQV) determining.
- method 300 begins with step 302 where a plurality of paired nucleic acid sequence reads is received.
- Each paired nucleic acid sequence read is comprised of a first tag (e.g., F3/R3 read) and a second tag (e.g., F3/R3 read) separated by an insert region.
- the paired nucleic acid sequence reads are mate-pair reads.
- the paired nucleic acid sequence reads are paired-end reads.
- the paired nucleic acid sequence reads are a combination of mate-pair and paired-end reads.
- step 304 the potential alignments for the first and second tags of each of each paired nucleic acid sequence read to a reference sequence are determined, wherein all the potential alignments satisfy a minimum threshold mismatch constraint. That is, each read tag that is aligned to the reference sequence cannot exceed a certain number of mismatches (i.e., minimum threshold mismatch constraint).
- step 306 potential paired alignments of the first and second tags of each paired nucleic acid sequence read are identified, wherein a distance between the first and second tags of each potential paired alignment is within an estimated insert size range.
- the estimated insert size range can be determined by: 1. mapping all the tags to a reference sequence, 2. determining a distribution of pairing distance for all uniquely mapped pairs of tags, and 3. calculating a mean and standard deviation value from the distribution pairing distance data to estimate a range of insert size (e.g., range values that covers 95% of the distributed distances of the observed pairs, range values derived a certain number of standard deviation from the mean, etc.).
- an alignment score is calculated for each potential paired alignment based on the distance between the first and second tags and a total number of mismatches for each tag.
- the alignment score calculation is also a function of read alignment length (i.e., read length of the tags).
- the alignment score calculation is also a function of the total number of possible alignment for each paired nucleic acid sequence read.
- the method 300 can be performed using color space nucleic acid sequence data. In various embodiments, the method 300 can be performed using base space nucleic acid sequence data. It should be understood, however, that the method 300 disclosed herein can be performed using any schema or format of nucleic acid sequence information as long as the schema or format can convey the base identity and position.
- the system and methods of the present teachings may introduce a Bayesian inference based statistical approach to calculating mapping quality values for different library types such as single fragment and paired reads (e.g., mate-pair, paired-end reads, etc.).
- paired reads e.g., mate-pair, paired-end reads, etc.
- These approaches can make use of mate-pair/paired-end read information including insert size distribution between the read pairs (e.g., pairs of tags), read orientation, strand ID annotations, gene ID annotations, etc.
- non-uniform prior probabilities for different alignment types and alignments that correspond to inversions e.g., mate-pair reads mapping to opposite strands, etc.
- gapped alignments e.g., insertion/deletion within a read
- mate-pair/paired-end reads capable of being mapped to exons from the same gene can be assigned a uniform prior probability regardless of the genomic distance between the exons.
- mate-pair/paired-end reads that map to exons from different genes (corresponding to gene fusions) can be assigned a lower prior probability.
- sequence analytics tools and applications such as for example SOLiD LIFESCOPE genetic analysis software (Life Technologies Corporation; Carlsbad, CA) and can be used for mapping and variant detection using sequencing reads such as those obtained from a NGS sequencing instrument.
- the accuracy and predictive value of the mapping/pairing quality score computed using these methods can be demonstrated using either simulated datasets (for example from a human reference chromosome 0) as well as actual genome datasets (for example from a HuRef sample generated using a NGS instrument). Evaluating the resulting mapping quality values and compared to phred-scale values for probability of misalignment demonstrates that the methods of the present teachings provide more accurate mapping quality when compared against conventional approaches and may be better suited to represent phred-scale alignment probability for a multiplicity of different library types.
- mapping quality methods described herein demonstrate highly accurate and comprehensive functionality in terms of computing quality of different alignment types including gapped alignments and whole-transcriptomes.
- the predictive value of a mapping quality value can improve the efficiency of generating variant calls and gene fusion calls made using various tools and sequencing analytics software (such as the SOLiD LIFESCOPE sequence analysis toolset). Together with the base quality values of individual bases in a read, mapping quality values can be used to improve the efficiency of rare-allele detection in cancer genomics research.
- PQV Mapping/Pairing quality value
- the PQV can be generally associated with a phred-scaled quantitative measure of the confidence of aligning a read to the correct location in the reference genome.
- the PQV may further be represented as the negative log odds of misaligning a read ( ⁇ 10 log 10 [prob of error]).
- the posterior probability of correctly aligning a read pair to a reference sequence can be calculated using (for example) the total alignment length of the mate pair reads, total number of mismatches to reference, complete mate-pair information such as insert size and gene ID annotations (in the case of whole transcriptome).
- the calculated mapping/pairing quality values can further represent the probability of aligning sequenced reads to the reference sequence (e.g., reference genome, etc.).
- the quality of any given alignment for a pair of reads r 1 , r 2 mapped to positions x 1 and x 2 in the reference sequence can be represented by Equation 1:
- a (r 1 ,r 2 ,x 1 ,x 2 ) represents the event when reads r 1 & r 2 are sequenced from locations x 1 & x 2 respectively and P(A
- the probability P(r 1 ,r 2 ), of observing reads r 1 and r 2 can then be a function of the complexity of the genome sequenced.
- One exemplary probability determination can be calculated as Equation 3:
- Equation 4 Equation 4:
- r 1 , r 2 ) can result in 1 thereby obscuring the relative quality of the alignment compared to those of other read pairs.
- P(B) can represent the probability of finding an alignment to the reference sequence with M+1 mismatches, where M is the maximum allowed mismatches set in the pairing.ini file, as shown in Equation 6:
- Equation 7 For uniquely paired reads, the posterior probability can be given by Equation 7:
- the PQV may be computed as the negative log odds of misaligning the pair of reads, as shown in Equation 9:
- PQV PQV PQV max ⁇ 1 ⁇ 0 ⁇ 0
- PQV max can reflect an exemplary maximum possible pairing quality value when the pair of reads map uniquely to the reference with zero mismatches.
- a pairing method can be devised to search for gapped alignments (i.e., InDels) when one of the tag (F3/R3/F5-P2) maps to a reference sequence and another tag does not map to the reference sequence within a selected insert-size range.
- gapped alignments i.e., InDels
- the PQV for gapped alignments can be approximately zero.
- an alternative hypothesis can be tested as the probability of finding the partial un-gapped alignments.
- the read with the gapped alignment can be treated as two partial reads on either side of an InDel start point where the half with the greatest length is used as the partial alignment length for an alternate hypothesis.
- Such an approach can be used to help ensure that gapped alignments within InDel starting point at the middle of the read with significant length of alignment on either side of the InDel starting point will be assigned a higher PQV compared to gapped alignments with InDel starting point close to either ends of the read, as shown in Equation 11:
- the read with the highest PQV can be selected as the primary alignment for the read and is reported to the *.BAM file.
- the primary alignment can be selected at random from among the alignments with the same PQV.
- the embodiments described herein can be practiced with other computer system configurations including hand-held devices, microprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, mainframe computers and the like.
- the embodiments can also be practiced in distributing computing environments where tasks are performed by remote processing devices that are linked through a network.
- any of the operations that form part of the embodiments described herein are useful machine operations.
- the embodiments, described herein also relate to a device or an apparatus for performing these operations.
- the systems and methods described herein can be specially constructed for the required purposes or it may be a general purpose computer selectively activated or configured by a computer program stored in the computer.
- various general purpose machines may be used with computer programs written in accordance with the teachings herein, or it may be more convenient to construct a more specialized apparatus to perform the required operations.
Abstract
A computer-implemented method for classifying alignments of paired nucleic acid sequence reads is disclosed. A plurality of paired nucleic acid sequence reads is received, wherein each read is comprised of a first tag and a second tag separated by an insert region. Potential alignments for the first and second tags of each read to a reference sequence is determined, wherein the potential alignments satisfies a minimum threshold mismatch constraint. Potential paired alignments of the first and second tags of each read are identified, wherein a distance between the first and second tags of each potential paired alignment is within an estimated insert size range. An alignment score is calculated for each potential paired alignment based on a distance between the first and second tags and a total number of mismatches for each tag.
Description
- This application is a continuation of U.S. application Ser. No. 15/001,389 filed Jan. 20, 2016, which is a continuation of U.S. application Ser. No. 13/177,267 filed Jul. 6, 2011, now U.S. Pat. No. 9,268,903, which claims priority to U.S. application No. 61/361,879 filed Jul. 6, 2010, which disclosures are herein incorporated by reference in their entirety.
- The present disclosure generally relates to the field of nucleic acid sequencing including systems and methods for mapping or aligning fragment sequence reads to a reference sequence.
- Upon completion of the Human Genome Project, one focus of the sequencing industry has shifted to finding higher throughput and/or lower cost nucleic acid sequencing technologies, sometimes referred to as “next generation” sequencing (NGS) technologies. In making sequencing higher throughput and/or less expensive, the goal is to make the technology more accessible for sequencing. These goals can be reached through the use of sequencing platforms and methods that provide sample preparation for larger quantities of samples of significant complexity, sequencing larger numbers of complex samples, and/or a high volume of information generation and analysis in a short period of time. Various methods, such as, for example, sequencing by synthesis, sequencing by hybridization, and sequencing by ligation are evolving to meet these challenges.
- Research into fast and efficient nucleic acid (e.g., genome, exome, etc.) sequence assembly methods is vital to the sequencing industry as NGS technologies can provide ultra-high throughput nucleic acid sequencing. As such sequencing systems incorporating NGS technologies can produce a large number of short sequence reads in a relatively short amount time. Sequence assembly methods must be able to assemble and/or map a large number of reads quickly and efficiently (i.e., minimize use of computational resources). For example, the sequencing of a human size genome can result in tens or hundreds of millions of reads that need to be assembled before they can be further analyzed to determine their biological, diagnostic and/or therapeutic relevance.
- Sequence assembly can generally be divided into two broad categories: de novo assembly and reference genome mapping assembly. In de novo assembly, sequence reads are assembled together so that they form a new and previously unknown sequence. Whereas in reference genome mapping, sequence reads are assembled against an existing backbone sequence (e.g., reference sequence, etc.) to build a sequence that is similar but not necessarily identical to the backbone sequence.
- Conventional mapping tools (e.g., MAQ, BFAST, SHRiMP, BWA, etc.) used to align sequence reads tend to incorrectly estimate alignment quality compared to phred-scaled quality scores; as these tools typically do not support quality value determination that differentiates between read fragments types (e.g., single, mate-pair, paired-end, etc.).
- Systems, methods, software and computer-usable media for determining alignment quality of biomolecule-related sequence reads aligned to a reference sequence are disclosed. Biomolecule-related sequences can relate to proteins, peptides, nucleic acids, and the like, and can include structural and functional information such as secondary or tertiary structures, amino acid or nucleotide sequences, sequence motifs, binding properties, genetic mutations and variants, and the like.
- In various embodiments, nucleic acid sequence read data can be generated using various techniques, platforms or technologies, including, but not limited to: capillary electrophoresis, microarrays, ligation-based systems, polymerase-based systems, hybridization-based systems, direct or indirect nucleotide identification systems, pyrosequencing, ion- or pH-based detection systems, electronic signature-based systems, etc.
- In one aspect, a computer-implemented method for classifying alignments of paired nucleic acid sequence reads is disclosed. A plurality of paired nucleic acid sequence reads is received, wherein each read is comprised of a first tag and a second tag separated by an insert region. Potential alignments for the first and second tags of each paired nucleic acid sequence read to a reference sequence is determined, wherein the potential alignments satisfies a minimum threshold mismatch constraint. Potential paired alignments of the first and second tags of each read are identified, wherein a distance between the first and second tags of each potential paired alignment is within an estimated insert size range. An alignment score is calculated for each potential paired alignment based on a distance between the first and second tags and a total number of mismatches for each tag.
- In another aspect, a system for identifying potential alignments for sequencing reads is disclosed. The system includes a nucleic acid sequencer and a processor in communications with the sequencer. The nucleic acid sequencer can be configured to interrogate a sample and produce a plurality of read sequences from the sample. The processor can be configured to obtain the read sequences from the sequencer, perform alignments of the read sequences from the sequencer to a reference sample, calculate a quality value for each alignment and output each alignment with its associated quality value.
- In still another aspect, a computer-implemented method for determining possible alignments for sequencing reads is disclosed. A sample can be interrogated to produce a plurality of read sequences from the sample. Alignments are performed for the read sequences from the sequencer. A quality value for each alignment is determined. Each alignment with its associated quality value is outputted.
- These and other features are provided herein.
- For a more complete understanding of the principles disclosed herein, and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:
-
FIG. 1 is a block diagram that illustrates a computer system, in accordance with various embodiments. -
FIG. 2 is a schematic diagram of a system for reconstructing a nucleic acid sequence, in accordance with various embodiments. -
FIG. 3 is an exemplary flowchart showing a method for classifying alignment quality of paired reads, in accordance with various embodiments. -
FIG. 4 is a depiction of how PQV can be calculated for gapped alignments, in accordance with various embodiments. - It is to be understood that the figures are not necessarily drawn to scale, nor are the objects in the figures necessarily drawn to scale in relationship to one another. The figures are depictions that are intended to bring clarity and understanding to various embodiments of apparatuses, systems, and methods disclosed herein. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts. Moreover, it should be appreciated that the drawings are not intended to limit the scope of the present teachings in any way.
- Embodiments of systems and methods for determining sequence alignment quality are described herein.
- The section headings used herein are for organizational purposes only and are not to be construed as limiting the described subject matter in any way.
- In this detailed description of the various embodiments, for purposes of explanation, numerous specific details are set forth to provide a thorough understanding of the embodiments disclosed. One skilled in the art will appreciate, however, that these various embodiments may be practiced with or without these specific details. In other instances, structures and devices are shown in block diagram form. Furthermore, one skilled in the art can readily appreciate that the specific sequences in which methods are presented and performed are illustrative and it is contemplated that the sequences can be varied and still remain within the spirit and scope of the various embodiments disclosed herein.
- All literature and similar materials cited in this application, including but not limited to, patents, patent applications, articles, books, treatises, and internet web pages are expressly incorporated by reference in their entirety for any purpose. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as is commonly understood by one of ordinary skill in the art to which the various embodiments described herein belongs. When definitions of terms in incorporated references appear to differ from the definitions provided in the present teachings, the definition provided in the present teachings shall control.
- It will be appreciated that there is an implied “about” prior to the temperatures, concentrations, times, etc. discussed in the present teachings, such that slight and insubstantial deviations are within the scope of the present teachings. In this application, the use of the singular includes the plural unless specifically stated otherwise. Also, the use of “comprise”, “comprises”, “comprising”, “contain”, “contains”, “containing”, “include”, “includes”, and “including” are not intended to be limiting. It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the present teachings.
- Further, unless otherwise required by context, singular terms shall include pluralities and plural terms shall include the singular. Generally, nomenclatures utilized in connection with, and techniques of, cell and tissue culture, molecular biology, and protein and oligo- or polynucleotide chemistry and hybridization described herein are those well known and commonly used in the art. Standard techniques are used, for example, for nucleic acid purification and preparation, chemical analysis, recombinant nucleic acid, and oligonucleotide synthesis. Enzymatic reactions and purification techniques are performed according to manufacturer's specifications or as commonly accomplished in the art or as described herein. The techniques and procedures described herein are generally performed according to conventional methods well known in the art and as described in various general and more specific references that are cited and discussed throughout the instant specification. See, e.g., Sambrook et al., Molecular Cloning: A Laboratory Manual (Third ed., Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y. 2000). The nomenclatures utilized in connection with, and the laboratory procedures and techniques described herein are those well known and commonly used in the art.
- As used herein, “a” or “an” means “at least one” or “one or more.”
- A “system” denotes a set of components, real or abstract, comprising a whole where each component interacts with or is related to at least one other component within the whole.
- A “biomolecule” is any molecule that is produced by a biological organism, including large polymeric molecules such as proteins, polysaccharides, lipids, and nucleic acids as well as small molecules such as primary metabolites, secondary metabolites, and other natural products.
- The phrase “next generation sequencing” or NGS refers to sequencing technologies having increased throughput as compared to traditional Sanger- and capillary electrophoresis-based approaches, for example with the ability to generate hundreds of thousands of relatively small sequence reads at a time. Some examples of next generation sequencing techniques include, but are not limited to, sequencing by synthesis, sequencing by ligation, and sequencing by hybridization. More specifically, the SOLiD Sequencing System of Life Technologies Corp. provides massively parallel sequencing with enhanced accuracy. The SOLiD System and associated workflows, protocols, chemistries, etc. are described in more detail in PCT Publication No. WO 2006/084132, entitled “Reagents, Methods, and Libraries for Bead-Based Sequencing,” international filing date Feb. 1, 2006, U.S. patent application Ser. No. 12/873,190, entitled “Low-Volume Sequencing System and Method of Use,” filed on Aug. 31, 2010, and U.S. patent application Ser. No. 12/873,132, entitled “Fast-Indexing Filter Wheel and Method of Use,” filed on Aug. 31, 2010, the entirety of each of these applications being incorporated herein by reference thereto.
- The phrase “sequencing run” refers to any step or portion of a sequencing experiment performed to determine some information relating to at least one biomolecule (e.g., nucleic acid molecule).
- It is well known that DNA (deoxyribonucleic acid) is a chain of nucleotides consisting of 4 types of nucleotides; A (adenine), T (thymine), C (cytosine), and G (guanine), and that RNA (ribonucleic acid) is comprised of 4 types of nucleotides; A, U (uracil), G, and C. It is also known that certain pairs of nucleotides specifically bind to one another in a complementary fashion (called complementary base pairing). That is, adenine (A) pairs with thymine (T) (in the case of RNA, however, adenine (A) pairs with uracil (U)), and cytosine (C) pairs with guanine (G). When a first nucleic acid strand binds to a second nucleic acid strand made up of nucleotides that are complementary to those in the first strand, the two strands bind to form a double strand. As used herein, “nucleic acid sequencing data,” “nucleic acid sequencing information,” “nucleic acid sequence,” “genomic sequence,” “genetic sequence,” or “fragment sequence,” or “nucleic acid sequencing read” denotes any information or data that is indicative of the order of the nucleotide bases (e.g., adenine, guanine, cytosine, and thymine/uracil) in a molecule (e.g., whole genome, whole transcriptome, exome, oligonucleotide, polynucleotide, fragment, etc.) of DNA or RNA. It should be understood that the present teachings contemplate sequence information obtained using all available varieties of techniques, platforms or technologies, including, but not limited to: capillary electrophoresis, microarrays, ligation-based systems, polymerase-based systems, hybridization-based systems, direct or indirect nucleotide identification systems, pyrosequencing, ion- or pH-based detection systems, electronic signature-based systems, etc.
- The phrase “ligation cycle” refers to a step in a sequence-by-ligation process where a probe sequence is ligated to a primer or another probe sequence.
- The phrase “color call” refers to an observed dye color resulting from the detection of a probe sequence after a ligation cycle of a sequencing run.
- The phrase “color space” refers to a nucleic acid sequence data schema where nucleic acid sequence information is represented by a set of colors (e.g., color calls, color signals, etc.) each carrying details about the identity and/or positional sequence of bases that comprise the nucleic acid sequence. For example, the nucleic acid sequence “ATCGA” can be represented in color space by various combinations of colors that are measured as the nucleic acid sequence is interrogated using optical detection-based (e.g., dye-based, etc.) sequencing techniques such as those employed by the SOLiD System. That is, in various embodiments, the SOLiD System can employ a schema that represents a nucleic acid fragment sequence as an initial base followed by a sequence of overlapping dimers (adjacent pairs of bases). The system can encode each dimer with one of four colors using a coding scheme that results in a sequence of color calls that represent a nucleotide sequence.
- The phrase “base space” refers to a nucleic acid sequence data schema where nucleic acid sequence information is represented by the actual nucleotide base composition of the nucleic acid sequence. For example, the nucleic acid sequence “ATCGA” is represented in base space by the actual nucleotide base identities (e.g., A, T/or U, C, G) of the nucleic acid sequence.
- A “polynucleotide”, “nucleic acid”, or “oligonucleotide” refers to a linear polymer of nucleosides (including deoxyribonucleosides, ribonucleosides, or analogs thereof) joined by internucleosidic linkages. Typically, a polynucleotide comprises at least three nucleosides. Usually oligonucleotides range in size from a few monomeric units, e.g. 3-4, to several hundreds of monomeric units. Whenever a polynucleotide such as an oligonucleotide is represented by a sequence of letters, such as “ATGCCTG,” it will be understood that the nucleotides are in 5′->3′ order from left to right and that “A” denotes deoxyadenosine, “C” denotes deoxycytidine, “G” denotes deoxyguanosine, and “T” denotes thymidine, unless otherwise noted. The letters A, C, G, and T may be used to refer to the bases themselves, to nucleosides, or to nucleotides comprising the bases, as is standard in the art.
- The techniques of “paired-end,” “pairwise,” “paired tag,” or “mate pair” sequencing are generally known in the art of molecular biology (Siegel A. F. et al., Genomics. 2000, 68: 237-246; Roach J. C. et al., Genomics. 1995, 26: 345-353). These sequencing techniques can allow the determination of multiple “reads” of sequence, each from a different place on a single polynucleotide. Typically, the distance (i.e., insert region) between the two reads or other information regarding a relationship between the reads is known. In some situations, these sequencing techniques provide more information than does sequencing two stretches of nucleic acid sequences in a random fashion. With the use of appropriate software tools for the assembly of sequence information (e.g., Mullikin J. C. et al., Genome Res. 2003, 13: 81-90; Kent, W. I. et al., Genome Res. 2001, 11: 1541-8) it is possible to make use of the knowledge that the “paired-end,” “pairwise,” “paired tag” or “mate pair” sequences are not completely random, but are known to occur a known distance apart and/or to have some other relationship, and are therefore linked or paired in the genome. This information can aid in the assembly of whole nucleic acid sequences into a consensus sequence.
-
FIG. 1 is a block diagram that illustrates acomputer system 100, upon which embodiments of the present teachings may be implemented. In various embodiments,computer system 100 can include abus 102 or other communication mechanism for communicating information, and aprocessor 104 coupled withbus 102 for processing information. In various embodiments,computer system 100 can also include amemory 106, which can be a random access memory (RAM) or other dynamic storage device, coupled tobus 102 for determining base calls, and instructions to be executed byprocessor 104.Memory 106 also can be used for storing temporary variables or other intermediate information during execution of instructions to be executed byprocessor 104. In various embodiments,computer system 100 can further include a read only memory (ROM) 108 or other static storage device coupled tobus 102 for storing static information and instructions forprocessor 104. Astorage device 110, such as a magnetic disk or optical disk, can be provided and coupled tobus 102 for storing information and instructions. - In various embodiments,
computer system 100 can be coupled viabus 102 to adisplay 112, such as a cathode ray tube (CRT) or liquid crystal display (LCD), for displaying information to a computer user. Aninput device 114, including alphanumeric and other keys, can be coupled tobus 102 for communicating information and command selections toprocessor 104. Another type of user input device is acursor control 116, such as a mouse, a trackball or cursor direction keys for communicating direction information and command selections toprocessor 104 and for controlling cursor movement ondisplay 112. This input device typically has two degrees of freedom in two axes, a first axis (i.e., x) and a second axis (i.e., y), that allows the device to specify positions in a plane. - A
computer system 100 can perform the present teachings. Consistent with certain implementations of the present teachings, results can be provided bycomputer system 100 in response toprocessor 104 executing one or more sequences of one or more instructions contained inmemory 106. Such instructions can be read intomemory 106 from another computer-readable medium, such asstorage device 110. Execution of the sequences of instructions contained inmemory 106 can causeprocessor 104 to perform the processes described herein. Alternatively hard-wired circuitry can be used in place of or in combination with software instructions to implement the present teachings. Thus implementations of the present teachings are not limited to any specific combination of hardware circuitry and software. - The term “computer-readable medium” as used herein refers to any media that participates in providing instructions to
processor 104 for execution. Such a medium can take many forms, including but not limited to, non-volatile media, volatile media, and transmission media. Examples of non-volatile media can include, but are not limited to, optical or magnetic disks, such asstorage device 110. Examples of volatile media can include, but are not limited to, dynamic memory, such asmemory 106. Examples of transmission media can include, but are not limited to, coaxial cables, copper wire, and fiber optics, including the wires that comprisebus 102. - Common forms of computer-readable media include, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, or any other magnetic medium, a CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, a RAM, PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge, or any other tangible medium from which a computer can read.
- Various forms of computer readable media can be involved in carrying one or more sequences of one or more instructions to
processor 104 for execution. For example, the instructions can initially be carried on the magnetic disk of a remote computer. The remote computer can load the instructions into its dynamic memory and send the instructions over a telephone line using a modem. A modem local tocomputer system 100 can receive the data on the telephone line and use an infra-red transmitter to convert the data to an infra-red signal. An infra-red detector coupled tobus 102 can receive the data carried in the infra-red signal and place the data onbus 102.Bus 102 can carry the data tomemory 106, from whichprocessor 104 retrieves and executes the instructions. The instructions received bymemory 106 may optionally be stored onstorage device 110 either before or after execution byprocessor 104. - In accordance with various embodiments, instructions configured to be executed by a processor to perform a method are stored on a computer-readable medium. The computer-readable medium can be a device that stores digital information. For example, a computer-readable medium includes a compact disc read-only memory (CD-ROM) as is known in the art for storing software. The computer-readable medium is accessed by a processor suitable for executing instructions configured to be executed.
- Nucleic acid sequence data can be generated using various techniques, platforms or technologies, including, but not limited to: capillary electrophoresis, microarrays, ligation-based systems, polymerase-based systems, hybridization-based systems, direct or indirect nucleotide identification systems, pyrosequencing, ion- or pH-based detection systems, electronic signature-based systems, etc.
- Various embodiments of nucleic acid sequencing platforms (i.e., nucleic acid sequencer) can include components as displayed in the block diagram of
FIG. 2 . According to various embodiments, sequencinginstrument 200 can include a fluidic delivery andcontrol unit 202, asample processing unit 204, asignal detection unit 206, and a data acquisition, analysis andcontrol unit 208. Various embodiments of instrumentation, reagents, libraries and methods used for next generation sequencing are described in U.S. Patent Application Publication No. US20090062129 (ASN 11/737308) and U.S. Patent Application Publication No. US20080003571 (ASN 11/345,979) to McKernan, et al., which applications are incorporated herein by reference. Various embodiments ofinstrument 200 can provide for automated sequencing that can be used to gather sequence information from a plurality of sequences in parallel, i.e., substantially simultaneously. - In various embodiments, the fluidics delivery and
control unit 202 can include reagent delivery system. The reagent delivery system can include a reagent reservoir for the storage of various reagents. The reagents can include RNA-based primers, forward/reverse DNA primers, oligonucleotide mixtures for ligation sequencing, nucleotide mixtures for sequencing-by-synthesis, optional ECC oligonucleotide mixtures, buffers, wash reagents, blocking reagent, stripping reagents, and the like. Additionally, the reagent delivery system can include a pipetting system or a continuous flow system which connects the sample processing unit with the reagent reservoir. - In various embodiments, the
sample processing unit 204 can include a sample chamber, such as flow cell, a substrate, a micro-array, a multi-well tray, or the like. Thesample processing unit 204 can include multiple lanes, multiple channels, multiple wells, or other means of processing multiple sample sets substantially simultaneously. Additionally, the sample processing unit can include multiple sample chambers to enable processing of multiple runs simultaneously. In particular embodiments, the system can perform signal detection on one sample chamber while substantially simultaneously processing another sample chamber. Additionally, the sample processing unit can include an automation system for moving or manipulating the sample chamber. - In various embodiments, the
signal detection unit 206 can include an imaging or detection sensor. For example, the imaging or detection sensor can include a CCD, a CMOS, an ion sensor, such as an ion sensitive layer overlying a CMOS, a current detector, or the like. Thesignal detection unit 206 can include an excitation system to cause a probe, such as a fluorescent dye, to emit a signal. The excitation system can include an illumination source, such as arc lamp, a laser, a light emitting diode (LED), or the like. In particular embodiments, thesignal detection unit 206 can include optics for the transmission of light from an illumination source to the sample or from the sample to the imaging or detection sensor. Alternatively, thesignal detection unit 206 may not include an illumination source, such as for example, when a signal is produced spontaneously as a result of a sequencing reaction. For example, a signal can be produced by the interaction of a released moiety, such as a released ion interacting with an ion sensitive layer, or a pyrophosphate reacting with an enzyme or other catalyst to produce a chemiluminescent signal. In another example, changes in an electrical current can be detected as a nucleic acid passes through a nanopore without the need for an illumination source. - In various embodiments, data acquisition analysis and
control unit 208 can monitor various system parameters. The system parameters can include temperature of various portions ofinstrument 200, such as sample processing unit or reagent reservoirs, volumes of various reagents, the status of various system subcomponents, such as a manipulator, a stepper motor, a pump, or the like, or any combination thereof. - It will be appreciated by one skilled in the art that various embodiments of
instrument 200 can be used to practice variety of sequencing methods including ligation-based methods, sequencing by synthesis, single molecule methods, nanopore sequencing, and other sequencing techniques. Ligation sequencing can include single ligation techniques, or change ligation techniques where multiple ligation are performed in sequence on a single primary. Sequencing by synthesis can include the incorporation of dye labeled nucleotides, chain termination, ion/proton sequencing, pyrophosphate sequencing, or the like. Single molecule techniques can include continuous sequencing, where the identity of the nuclear type is determined during incorporation without the need to pause or delay the sequencing reaction, or staggered sequence, where the sequencing reactions is paused to determine the identity of the incorporated nucleotide. - In various embodiments, the
sequencing instrument 200 can determine the sequence of a nucleic acid, such as a polynucleotide or an oligonucleotide. The nucleic acid can include DNA or RNA, and can be single stranded, such as ssDNA and RNA, or double stranded, such as dsDNA or a RNA/cDNA pair. In various embodiments, the nucleic acid can include or be derived from a fragment library, a mate pair library, a ChIP fragment, or the like. In particular embodiments, thesequencing instrument 200 can obtain the sequence information from a single nucleic acid molecule or from a group of substantially identical nucleic acid molecules. - In various embodiments, sequencing
instrument 200 can output nucleic acid sequencing read data in a variety of different output data file types/formats, including, but not limited to: *.fasta, *.csfasta, *seq.txt, *qseq.txt, *.fastq, *.sff, *prb.txt, *.sms, *srs and/or *.qv. -
FIG. 3 is an exemplary flowchart showing a method for classifying alignments of paired nucleic acid sequence reads, in accordance with various embodiments. In various embodiments, the sequence read alignment classification scores can be a factor in the pairing quality value (PQV) determining. - As depicted herein,
method 300 begins withstep 302 where a plurality of paired nucleic acid sequence reads is received. Each paired nucleic acid sequence read is comprised of a first tag (e.g., F3/R3 read) and a second tag (e.g., F3/R3 read) separated by an insert region. In various embodiments, the paired nucleic acid sequence reads are mate-pair reads. In various embodiments, the paired nucleic acid sequence reads are paired-end reads. In various embodiments, the paired nucleic acid sequence reads are a combination of mate-pair and paired-end reads. - In
step 304, the potential alignments for the first and second tags of each of each paired nucleic acid sequence read to a reference sequence are determined, wherein all the potential alignments satisfy a minimum threshold mismatch constraint. That is, each read tag that is aligned to the reference sequence cannot exceed a certain number of mismatches (i.e., minimum threshold mismatch constraint). - In
step 306, potential paired alignments of the first and second tags of each paired nucleic acid sequence read are identified, wherein a distance between the first and second tags of each potential paired alignment is within an estimated insert size range. In various embodiments, the estimated insert size range can be determined by: 1. mapping all the tags to a reference sequence, 2. determining a distribution of pairing distance for all uniquely mapped pairs of tags, and 3. calculating a mean and standard deviation value from the distribution pairing distance data to estimate a range of insert size (e.g., range values that covers 95% of the distributed distances of the observed pairs, range values derived a certain number of standard deviation from the mean, etc.). - In
step 308, an alignment score is calculated for each potential paired alignment based on the distance between the first and second tags and a total number of mismatches for each tag. In various embodiments, the alignment score calculation is also a function of read alignment length (i.e., read length of the tags). In various embodiments, the alignment score calculation is also a function of the total number of possible alignment for each paired nucleic acid sequence read. - In various embodiments, the
method 300 can be performed using color space nucleic acid sequence data. In various embodiments, themethod 300 can be performed using base space nucleic acid sequence data. It should be understood, however, that themethod 300 disclosed herein can be performed using any schema or format of nucleic acid sequence information as long as the schema or format can convey the base identity and position. - According to various embodiments, the system and methods of the present teachings may introduce a Bayesian inference based statistical approach to calculating mapping quality values for different library types such as single fragment and paired reads (e.g., mate-pair, paired-end reads, etc.). These approaches can make use of mate-pair/paired-end read information including insert size distribution between the read pairs (e.g., pairs of tags), read orientation, strand ID annotations, gene ID annotations, etc. Using this approach, non-uniform prior probabilities for different alignment types and alignments that correspond to inversions (e.g., mate-pair reads mapping to opposite strands, etc.), gapped alignments (e.g., insertion/deletion within a read) can be assigned and can be useful to assess the probability of observing such mutations in a particular genome.
- In various embodiments, for the case of whole-transcriptome sequencing, mate-pair/paired-end reads capable of being mapped to exons from the same gene can be assigned a uniform prior probability regardless of the genomic distance between the exons. In various embodiments, mate-pair/paired-end reads that map to exons from different genes (corresponding to gene fusions) can be assigned a lower prior probability. In various embodiments such an approach cam be implemented in sequence analytics tools and applications such as for example SOLiD LIFESCOPE genetic analysis software (Life Technologies Corporation; Carlsbad, CA) and can be used for mapping and variant detection using sequencing reads such as those obtained from a NGS sequencing instrument.
- In various embodiments, the accuracy and predictive value of the mapping/pairing quality score computed using these methods can be demonstrated using either simulated datasets (for example from a human reference chromosome 0) as well as actual genome datasets (for example from a HuRef sample generated using a NGS instrument). Evaluating the resulting mapping quality values and compared to phred-scale values for probability of misalignment demonstrates that the methods of the present teachings provide more accurate mapping quality when compared against conventional approaches and may be better suited to represent phred-scale alignment probability for a multiplicity of different library types.
- According to various embodiments, the mapping quality methods described herein demonstrate highly accurate and comprehensive functionality in terms of computing quality of different alignment types including gapped alignments and whole-transcriptomes. In one aspect, the predictive value of a mapping quality value can improve the efficiency of generating variant calls and gene fusion calls made using various tools and sequencing analytics software (such as the SOLiD LIFESCOPE sequence analysis toolset). Together with the base quality values of individual bases in a read, mapping quality values can be used to improve the efficiency of rare-allele detection in cancer genomics research.
- In various embodiments, methods for determining Mapping/Pairing quality value (PQV) are provided. The PQV can be generally associated with a phred-scaled quantitative measure of the confidence of aligning a read to the correct location in the reference genome. The PQV may further be represented as the negative log odds of misaligning a read (−10 log10[prob of error]).
- In various embodiments, the posterior probability of correctly aligning a read pair to a reference sequence can be calculated using (for example) the total alignment length of the mate pair reads, total number of mismatches to reference, complete mate-pair information such as insert size and gene ID annotations (in the case of whole transcriptome). The calculated mapping/pairing quality values can further represent the probability of aligning sequenced reads to the reference sequence (e.g., reference genome, etc.).
- According to various embodiments, a method is provided which can be implemented in a software tool or application which computes mapping/paring quality values that better represents phred-scale quality scores (for example the probability of misaligning the reads). This method can make use of read pair information to compute quality values for mate-pair and paired-end library types. Mapping/pairing quality values computed by the methods of the present teachings can be accurate and predictive in terms of being able to improve the accuracy of small variant detection.
- In various embodiments, the pairing algorithm of the present teachings can be configured to report multiple sets of possible alignments for any given pair of reads (for example F3/R3 tags for a Mate-pair run and F3/F5-P2 tags for a Paired-end run obtained using a NGS sequencer). The pairing quality method and algorithm can implement a Bayesian approach to calculate the quality of a given alignment for a pair of reads (i.e., pair of tags) and the alignment with the highest PQV can be selected as the primary alignment for the pair of reads. In various embodiments, the PQVs may be used to represent a Phred-Scaled quality score. Such an approach can be useful for downstream variant detection tools such as DiBayes, Small-InDels, Large-InDels and CNV.
- In various aspects, the quality of any given alignment for a pair of reads r1, r2 mapped to positions x1 and x2 in the reference sequence can be represented by Equation 1:
-
Q(r 1 ,r 2 ,x 1 ,x 2)=P(A(r 1 ,r 2 ,x 1 ,x 2)|r 1 ,r 2), - where A (r1,r2,x1,x2) represents the event when reads r1 & r2 are sequenced from locations x1 & x2 respectively and P(A|r1, r2) is the probability of the event A occurring given the pair of reads r1 and r2.
- Applying a Bayesian-type approach, the posterior probability P(A|r1,r2) may be represented as Equation 2:
-
- The probability P(r1,r2), of observing reads r1 and r2 can then be a function of the complexity of the genome sequenced. One exemplary probability determination can be calculated as Equation 3:
-
P(r 1 ,r 2)=Σi,j∈M P(r 1 ,r 2 |A(r 1 ,r 2 ,i,j))×P(A(r 1 ,r 2 ,i,j)) - where M is the set of possible alignments to the reference sequence for reads r1 and r2. Using this relationship to represent P(r1, r2) in the previous equation one obtains Equation 4:
-
- The prior probability P(A) of the event A can further be given by Equation 5:
-
P(A(r 1 ,r 2 ,x 1 ,x 2))=P(A(r 2 ,x 2)|B)×P(B(r 1 ,x 1)), - where B(r1,x1) is the event that read r1 is sequenced from location x1 in the genome and P(A|B) is the conditional probability of finding the event A where read r2 is sequenced from location x2, given that read r1 was sequenced from location x1.
- The probability P(B) can be a constant for any given read r1, and the conditional probability P(A|B) can follow the insert-size distribution. As indicated below, the following prior probabilities can be used in pairing quality calculations. In various embodiments, P(A(r,r2,i,j)) can be the alignment score calculated for each potential sequence pair alignment (as discussed above with respect to
FIG. 3 ). -
- P(A|B)=1, for all ‘AAA’ pairs.
- P(A|B)=1/10,000, for all ‘non-AAA’ pairs (including Small & Large Indels).
- P(A|B)=1/10,000, when one of the reads in the pair cannot be mapped to the reference sequence.
- In various embodiments, where a pair of reads have a unique set of alignments to a reference sequence, the posterior probability P(A|r1, r2) can result in 1 thereby obscuring the relative quality of the alignment compared to those of other read pairs. This can be addressed by calculating a background probability P(B), which can represent the probability of finding an alignment to the reference sequence with M+1 mismatches, where M is the maximum allowed mismatches set in the pairing.ini file, as shown in Equation 6:
-
P B =P(r 1 |A(r 1 ,x 1)×P(r 2 |B,M+1mismatches),>r 1 >r 2(k 1 >k 2, if r 1 =r 2) - For uniquely paired reads, the posterior probability can be given by Equation 7:
-
- For mapping using a local alignment method, the likelihood function P(r1,r2|A) can be given by Equation 8:
-
- where,
-
- L1 & L2 are the read lengths for reads r1 and r2 respectively, (ex. F3=50 and R3=50),
- k1 & k2 are the alignment lengths (k1≤L1 and k2≤L2),
- m1 & m2 are the number of mismatches, and
- e is the error rate.
- Being consistent with a phred-type quality score (−10*log10[prob(error)]), the PQV may be computed as the negative log odds of misaligning the pair of reads, as shown in Equation 9:
-
PQV=−10×log10[1−Q(r 1 ,r 2 ,x 1 ,x 2)] - The resulting pairing quality values can be normalized by a maximum value to help ensure that the pairing quality values are within a desired range [0,100], as shown in Equation 10:
-
- PQVmax can reflect an exemplary maximum possible pairing quality value when the pair of reads map uniquely to the reference with zero mismatches.
- In various embodiments, a pairing method can be devised to search for gapped alignments (i.e., InDels) when one of the tag (F3/R3/F5-P2) maps to a reference sequence and another tag does not map to the reference sequence within a selected insert-size range. For this exemplary approach, where both an un-gapped and a gapped alignment are found for a given read then, due to the low prior probability of 10{circumflex over ( )}−4 assigned to the gapped alignments, the PQV for gapped alignments can be approximately zero.
- Thus, as shown in
FIG. 4 , in calculating the PQV for gapped alignments, an alternative hypothesis can be tested as the probability of finding the partial un-gapped alignments. The read with the gapped alignment can be treated as two partial reads on either side of an InDel start point where the half with the greatest length is used as the partial alignment length for an alternate hypothesis. Such an approach can be used to help ensure that gapped alignments within InDel starting point at the middle of the read with significant length of alignment on either side of the InDel starting point will be assigned a higher PQV compared to gapped alignments with InDel starting point close to either ends of the read, as shown in Equation 11: -
- In various embodiments, for reads with multiple un-gapped alignments, the read with the highest PQV can be selected as the primary alignment for the read and is reported to the *.BAM file. In cases where there are multiple alignments with the same PQV, then the primary alignment can be selected at random from among the alignments with the same PQV.
- While the present teachings are described in conjunction with various embodiments, it is not intended that the present teachings be limited to such embodiments. On the contrary, the present teachings encompass various alternatives, modifications, and equivalents, as will be appreciated by those of skill in the art.
- Further, in describing various embodiments, the specification may have presented a method and/or process as a particular sequence of steps. However, to the extent that the method or process does not rely on the particular order of steps set forth herein, the method or process should not be limited to the particular sequence of steps described. As one of ordinary skill in the art would appreciate, other sequences of steps may be possible. Therefore, the particular order of the steps set forth in the specification should not be construed as limitations on the claims. In addition, the claims directed to the method and/or process should not be limited to the performance of their steps in the order written, and one skilled in the art can readily appreciate that the sequences may be varied and still remain within the spirit and scope of the various embodiments.
- The embodiments described herein, can be practiced with other computer system configurations including hand-held devices, microprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, mainframe computers and the like. The embodiments can also be practiced in distributing computing environments where tasks are performed by remote processing devices that are linked through a network.
- It should also be understood that the embodiments described herein can employ various computer-implemented operations involving data stored in computer systems. These operations are those requiring physical manipulation of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. Further, the manipulations performed are often referred to in terms, such as producing, identifying, determining, or comparing.
- Any of the operations that form part of the embodiments described herein are useful machine operations. The embodiments, described herein, also relate to a device or an apparatus for performing these operations. The systems and methods described herein can be specially constructed for the required purposes or it may be a general purpose computer selectively activated or configured by a computer program stored in the computer. In particular, various general purpose machines may be used with computer programs written in accordance with the teachings herein, or it may be more convenient to construct a more specialized apparatus to perform the required operations.
- Certain embodiments can also be embodied as computer readable code on a computer readable medium. The computer readable medium is any data storage device that can store data, which can thereafter be read by a computer system. Examples of the computer readable medium include hard drives, network attached storage (NAS), read-only memory, random-access memory, CD-ROMs, CD-Rs, CD-RWs, magnetic tapes, and other optical and non-optical data storage devices. The computer readable medium can also be distributed over a network coupled computer systems so that the computer readable code is stored and executed in a distributed fashion.
Claims (19)
1-20. (canceled)
21. A method for classifying alignments of paired nucleic acid sequence reads, comprising:
receiving, by a computing device comprising a processor and memory, a plurality of paired nucleic acid sequence reads, wherein each paired nucleic acid sequence read comprises a first read of a first tag derived from a first region of a polynucleotide and a second read of a second tag derived from a second region of the polynucleotide, the first and second reads produced by next generation sequencing, wherein the first tag and the second tag are separated by an insert region;
mapping, by the computing device, the first and second reads of each paired nucleic acid sequence read to a reference genome to form potential alignments, wherein one or more potential alignments are produced for each paired nucleic acid sequence read, wherein each potential alignment satisfies a minimum threshold mismatch constraint;
identifying, by the computing device, potential paired alignments of the first and second reads of each paired nucleic acid sequence read, wherein a distance between the first and second reads of each potential paired alignment is within an estimated insert size range;
calculating, by the computing device, an alignment score for each potential paired alignment based on:
a distance between the first and second reads, and
a total number of mismatches for the first and second reads; and
selecting, by the computing device, the potential paired alignment having a highest alignment score as a primary alignment for each paired nucleic acid sequence read.
22. The method, as recited in claim 21 , wherein the paired nucleic acid sequence read is a mate-pair read.
23. The method, as recited in claim 21 , wherein the paired nucleic acid sequence read is a paired-end read.
24. The method, as recited in claim 21 , wherein the estimated insert size range is based on a standard deviation value derived from a distribution of estimated sizes of insert region for the plurality of paired nucleic acid sequence reads.
25. The method, as recited in claim 21 , wherein the calculated alignment score is a function of read alignment length.
26. The method, as recited in claim 21 , wherein the calculated alignment score is a function of a total number of possible alignments for each read.
27. A system for sequence alignment quality assessment, comprising:
a next generation sequencing instrument configured to interrogate a sample to produce a plurality of read sequences from the sample; and
a processor in communication with the next generation sequencing instrument, the processor configured to,
obtain the read sequences from the next generation sequencing instrument,
map the read sequences to a reference genome to produce potential alignments, wherein one or more potential alignments are produced for a given read sequence, wherein each potential alignment satisfies a minimum threshold mismatch constraint,
calculate a quality value for each potential alignment,
output the potential alignments and associated quality values, and
select the potential alignment having a highest quality value as a primary alignment for the given read sequence.
28. The system as recited in claim 27 , wherein the quality value is calculated for the potential alignment of the read sequence corresponding to a single fragment library type.
29. The system as recited in claim 27 , wherein the quality value is calculated for the potential alignment of the read sequence corresponding to a paired read library type.
30. The system as recited in claim 29 , wherein for the paired read library type, the mapping produces aligned paired reads, wherein the aligned paired reads must have insert region sizes that fall within an estimated insert size range for the aligned paired reads, wherein the aligned paired reads are separated by an insert region.
31. The system as recited in claim 30 , wherein the estimated insert size range is based on a standard deviation value derived from a distribution of estimated insert region sizes of the aligned paired reads.
32. A method for sequence alignment quality assessment, comprising:
interrogating a sample, by a next generation sequencing instrument, to produce a plurality of read sequences from the sample;
obtaining, by a computing device comprising a processor and a memory, the plurality of read sequences from the next generation sequencing instrument;
mapping, by the computing device, the read sequences to a reference genome to produce potential alignments, wherein one or more potential alignments are produced for a given read sequence, wherein each potential alignment satisfies a minimum threshold mismatch constraint;
calculating, by the computing device, a quality value for each potential alignment;
outputting, by the computing device, the potential alignments and associated quality values; and
selecting, by the computing device, the potential alignment having a highest quality value as a primary alignment.
33. The method, as recited in claim 32 , wherein the quality value is calculated for the potential alignment of the read sequence corresponding to a single fragment library type.
34. The method, as recited in claim 32 , wherein the quality value is calculated for the potential alignment of the read sequence corresponding to a paired read library type.
35. The method, as recited in claim 32 , wherein the calculated quality value for each potential alignment is a function of read alignment length.
36. The method, as recited in claim 32 , wherein the calculated quality value for each potential alignment is a function of number of read mismatches.
37. The method, as recited in claim 34 , wherein for the paired read library type, the mapping produces aligned paired reads, wherein the aligned paired reads must have insert region sizes that fall within an estimated insert size range for the aligned paired reads, wherein the aligned paired reads are separated by an insert region.
38. The method, as recited in claim 37 , wherein the estimated insert size range is based on a standard deviation value derived from a distribution of estimated insert region sizes of the aligned paired reads.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US18/338,488 US20230410946A1 (en) | 2010-07-06 | 2023-06-21 | Systems and methods for sequence data alignment quality assessment |
Applications Claiming Priority (5)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US36187910P | 2010-07-06 | 2010-07-06 | |
US13/177,267 US9268903B2 (en) | 2010-07-06 | 2011-07-06 | Systems and methods for sequence data alignment quality assessment |
US15/001,389 US20160140291A1 (en) | 2010-07-06 | 2016-01-20 | Systems and Methods for Sequence Data Alignment Quality Assessment |
US16/421,653 US20190348153A1 (en) | 2010-07-06 | 2019-05-24 | Systems and methods for sequence data alignment quality assessment |
US18/338,488 US20230410946A1 (en) | 2010-07-06 | 2023-06-21 | Systems and methods for sequence data alignment quality assessment |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/421,653 Continuation US20190348153A1 (en) | 2010-07-06 | 2019-05-24 | Systems and methods for sequence data alignment quality assessment |
Publications (1)
Publication Number | Publication Date |
---|---|
US20230410946A1 true US20230410946A1 (en) | 2023-12-21 |
Family
ID=45439304
Family Applications (4)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/177,267 Active 2034-05-14 US9268903B2 (en) | 2010-07-06 | 2011-07-06 | Systems and methods for sequence data alignment quality assessment |
US15/001,389 Abandoned US20160140291A1 (en) | 2010-07-06 | 2016-01-20 | Systems and Methods for Sequence Data Alignment Quality Assessment |
US16/421,653 Abandoned US20190348153A1 (en) | 2010-07-06 | 2019-05-24 | Systems and methods for sequence data alignment quality assessment |
US18/338,488 Pending US20230410946A1 (en) | 2010-07-06 | 2023-06-21 | Systems and methods for sequence data alignment quality assessment |
Family Applications Before (3)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/177,267 Active 2034-05-14 US9268903B2 (en) | 2010-07-06 | 2011-07-06 | Systems and methods for sequence data alignment quality assessment |
US15/001,389 Abandoned US20160140291A1 (en) | 2010-07-06 | 2016-01-20 | Systems and Methods for Sequence Data Alignment Quality Assessment |
US16/421,653 Abandoned US20190348153A1 (en) | 2010-07-06 | 2019-05-24 | Systems and methods for sequence data alignment quality assessment |
Country Status (1)
Country | Link |
---|---|
US (4) | US9268903B2 (en) |
Families Citing this family (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2118797A2 (en) * | 2007-02-05 | 2009-11-18 | Applied Biosystems, LLC | System and methods for indel identification using short read sequencing |
US9268903B2 (en) * | 2010-07-06 | 2016-02-23 | Life Technologies Corporation | Systems and methods for sequence data alignment quality assessment |
EP2753715A4 (en) | 2011-09-09 | 2015-05-20 | Univ Leland Stanford Junior | Methods for obtaining a sequence |
US9600625B2 (en) | 2012-04-23 | 2017-03-21 | Bina Technologies, Inc. | Systems and methods for processing nucleic acid sequence data |
US9857328B2 (en) | 2014-12-18 | 2018-01-02 | Agilome, Inc. | Chemically-sensitive field effect transistors, systems and methods for manufacturing and using the same |
EP3235010A4 (en) | 2014-12-18 | 2018-08-29 | Agilome, Inc. | Chemically-sensitive field effect transistor |
US10020300B2 (en) | 2014-12-18 | 2018-07-10 | Agilome, Inc. | Graphene FET devices, systems, and methods of using the same for sequencing nucleic acids |
US9618474B2 (en) | 2014-12-18 | 2017-04-11 | Edico Genome, Inc. | Graphene FET devices, systems, and methods of using the same for sequencing nucleic acids |
US10006910B2 (en) | 2014-12-18 | 2018-06-26 | Agilome, Inc. | Chemically-sensitive field effect transistors, systems, and methods for manufacturing and using the same |
US9859394B2 (en) | 2014-12-18 | 2018-01-02 | Agilome, Inc. | Graphene FET devices, systems, and methods of using the same for sequencing nucleic acids |
US10811539B2 (en) | 2016-05-16 | 2020-10-20 | Nanomedical Diagnostics, Inc. | Graphene FET devices, systems, and methods of using the same for sequencing nucleic acids |
EP3469100B1 (en) | 2016-06-10 | 2021-01-27 | Seegene, Inc. | Methods for preparing tagging oligonucleotides |
Family Cites Families (138)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5168499A (en) | 1990-05-02 | 1992-12-01 | California Institute Of Technology | Fault detection and bypass in a sequence information signal processor |
US5856928A (en) | 1992-03-13 | 1999-01-05 | Yan; Johnson F. | Gene and protein representation, characterization and interpretation process |
US5430886A (en) | 1992-06-15 | 1995-07-04 | Furtek; Frederick C. | Method and apparatus for motion estimation |
DE69333422T2 (en) | 1992-07-31 | 2004-12-16 | International Business Machines Corp. | Finding strings in a database of strings |
JPH0793370A (en) | 1993-09-27 | 1995-04-07 | Hitachi Device Eng Co Ltd | Gene data base retrieval system |
GB2283840B (en) | 1993-11-12 | 1998-07-22 | Fujitsu Ltd | Genetic motif extracting method and apparatus |
US5671090A (en) | 1994-10-13 | 1997-09-23 | Northrop Grumman Corporation | Methods and systems for analyzing data |
US5601982A (en) | 1995-02-07 | 1997-02-11 | Sargent; Jeannine P. | Method and apparatus for determining the sequence of polynucleotides |
US6001562A (en) | 1995-05-10 | 1999-12-14 | The University Of Chicago | DNA sequence similarity recognition by hybridization to short oligomers |
US5604100A (en) | 1995-07-19 | 1997-02-18 | Perlin; Mark W. | Method and system for sequencing genomes |
JP4286332B2 (en) | 1995-07-27 | 2009-06-24 | 富士通株式会社 | Method and apparatus for automatic removal of vector part contained in DNA base sequence |
JP3675521B2 (en) | 1995-07-27 | 2005-07-27 | 富士通株式会社 | Fragment waveform display method and apparatus when determining DNA base sequence |
US5866330A (en) | 1995-09-12 | 1999-02-02 | The Johns Hopkins University School Of Medicine | Method for serial analysis of gene expression |
US6119120A (en) | 1996-06-28 | 2000-09-12 | Microsoft Corporation | Computer implemented methods for constructing a compressed data structure from a data string and for using the data structure to find data patterns in the data string |
US6189013B1 (en) | 1996-12-12 | 2001-02-13 | Incyte Genomics, Inc. | Project-based full length biomolecular sequence database |
DE69739981D1 (en) | 1996-10-31 | 2010-10-14 | Human Genome Sciences Inc | Streptococcus pneumoniae antigens and vaccines |
US5873052A (en) | 1996-11-06 | 1999-02-16 | The Perkin-Elmer Corporation | Alignment-based similarity scoring methods for quantifying the differences between related biopolymer sequences |
US6117634A (en) | 1997-03-05 | 2000-09-12 | The Reagents Of The University Of Michigan | Nucleic acid sequencing and mapping |
US5966711A (en) | 1997-04-15 | 1999-10-12 | Alpha Gene, Inc. | Autonomous intelligent agents for the annotation of genomic databases |
AU737710B2 (en) | 1997-05-14 | 2001-08-30 | Human Genome Sciences, Inc. | Antimicrobial peptide |
US6518023B1 (en) | 1997-06-27 | 2003-02-11 | Lynx Therapeutics, Inc. | Method of mapping restriction sites in polynucleotides |
US6905837B2 (en) | 1997-09-02 | 2005-06-14 | New England Biolabs, Inc. | Method for screening restriction endonucleases |
US7099777B1 (en) | 1997-09-05 | 2006-08-29 | Affymetrix, Inc. | Techniques for identifying confirming mapping and categorizing nucleic acids |
US6223175B1 (en) | 1997-10-17 | 2001-04-24 | California Institute Of Technology | Method and apparatus for high-speed approximate sub-string searches |
US6054276A (en) | 1998-02-23 | 2000-04-25 | Macevicz; Stephen C. | DNA restriction site mapping |
US6505126B1 (en) | 1998-03-25 | 2003-01-07 | Schering-Plough Corporation | Method to identify fungal genes useful as antifungal targets |
EP1068351A2 (en) | 1998-04-02 | 2001-01-17 | Tellus Genetic Resources, Inc. | A method for obtaining a plant with a genetic lesion in a gene sequence |
US6223186B1 (en) | 1998-05-04 | 2001-04-24 | Incyte Pharmaceuticals, Inc. | System and method for a precompiled database for biomolecular sequence information |
CA2321821A1 (en) | 1998-06-26 | 2000-01-06 | Visible Genetics Inc. | Method for sequencing nucleic acids with reduced errors |
US6223128B1 (en) | 1998-06-29 | 2001-04-24 | Dnstar, Inc. | DNA sequence assembly system |
JP4040764B2 (en) | 1998-08-19 | 2008-01-30 | 富士通株式会社 | Gene motif extraction processing apparatus, gene motif extraction processing method, and recording medium storing gene motif extraction processing program |
US6607888B2 (en) | 1998-10-20 | 2003-08-19 | Wisconsin Alumni Research Foundation | Method for analyzing nucleic acid reactions |
US6961664B2 (en) | 1999-01-19 | 2005-11-01 | Maxygen | Methods of populating data structures for use in evolutionary simulations |
US7024312B1 (en) | 1999-01-19 | 2006-04-04 | Maxygen, Inc. | Methods for making character strings, polynucleotides and polypeptides having desired characteristics |
WO2000052178A1 (en) | 1999-03-01 | 2000-09-08 | Insight Strategy & Marketing Ltd. | Polynucleotide encoding a polypeptide having heparanase activity and expression of same in genetically modified cells |
US6528260B1 (en) | 1999-03-25 | 2003-03-04 | Genset, S.A. | Biallelic markers related to genes involved in drug metabolism |
US6287773B1 (en) | 1999-05-19 | 2001-09-11 | Hoeschst-Ariad Genomics Center | Profile searching in nucleic acid sequences using the fast fourier transformation |
US6421613B1 (en) | 1999-07-09 | 2002-07-16 | Pioneer Hi-Bred International, Inc. | Data processing of the maize prolifera genetic sequence |
AU6611900A (en) | 1999-07-30 | 2001-03-13 | Agy Therapeutics, Inc. | Techniques for facilitating identification of candidate genes |
US6990238B1 (en) | 1999-09-30 | 2006-01-24 | Battelle Memorial Institute | Data processing, analysis, and visualization system for use with disparate data types |
US6898530B1 (en) | 1999-09-30 | 2005-05-24 | Battelle Memorial Institute | Method and apparatus for extracting attributes from sequence strings and biopolymer material |
GB2356401A (en) | 1999-11-19 | 2001-05-23 | Proteom Ltd | Method for manipulating protein or DNA sequence data |
EP1103911A1 (en) | 1999-11-25 | 2001-05-30 | Applied Research Systems ARS Holding N.V. | Automated method for identifying related biomolecular sequences |
US6571230B1 (en) | 2000-01-06 | 2003-05-27 | International Business Machines Corporation | Methods and apparatus for performing pattern discovery and generation with respect to data sequences |
US6635423B2 (en) | 2000-01-14 | 2003-10-21 | Integriderm, Inc. | Informative nucleic acid arrays and methods for making same |
US6775622B1 (en) | 2000-01-31 | 2004-08-10 | Zymogenetics, Inc. | Method and system for detecting near identities in large DNA databases |
US6714874B1 (en) | 2000-03-15 | 2004-03-30 | Applera Corporation | Method and system for the assembly of a whole genome using a shot-gun data set |
JP3581291B2 (en) | 2000-03-14 | 2004-10-27 | 日立ソフトウエアエンジニアリング株式会社 | How to display the results of hybridization experiments |
US6760668B1 (en) | 2000-03-24 | 2004-07-06 | Bayer Healthcare Llc | Method for alignment of DNA sequences with enhanced accuracy and read length |
US6711558B1 (en) | 2000-04-07 | 2004-03-23 | Washington University | Associative database scanning and information retrieval |
EP1276860B1 (en) | 2000-04-28 | 2007-09-26 | Sangamo Biosciences Inc. | Databases of regulatory sequences; methods of making and using same |
US7801591B1 (en) | 2000-05-30 | 2010-09-21 | Vladimir Shusterman | Digital healthcare information management |
WO2001092991A2 (en) | 2000-05-30 | 2001-12-06 | Kosan Biosciences, Inc. | Design of polyketide synthase genes |
US6785614B1 (en) | 2000-05-31 | 2004-08-31 | The Regents Of The University Of California | End sequence profiling |
AU2001266948A1 (en) | 2000-06-14 | 2001-12-24 | Douglas M. Blair | Apparatus and method for providing sequence database comparison |
JP3431135B2 (en) | 2000-07-14 | 2003-07-28 | 独立行政法人農業技術研究機構 | Gene affinity search method and gene affinity search system |
US6713257B2 (en) | 2000-08-25 | 2004-03-30 | Rosetta Inpharmatics Llc | Gene discovery using microarrays |
WO2002022876A1 (en) | 2000-09-11 | 2002-03-21 | University Of Rochester | Method of identifying putative antibiotic resistance genes |
EP1328805A4 (en) | 2000-09-28 | 2007-10-03 | Wisconsin Alumni Res Found | System and process for validating, aligning and reordering one or more genetic sequence maps using at least one ordered restriction map |
US7054755B2 (en) | 2000-10-12 | 2006-05-30 | Iconix Pharmaceuticals, Inc. | Interactive correlation of compound information and genomic information |
US7444243B2 (en) | 2000-10-30 | 2008-10-28 | Monsanto Technology Llc | Probabilistic method for determining nucleic acid coding features |
US6763148B1 (en) | 2000-11-13 | 2004-07-13 | Visual Key, Inc. | Image recognition methods |
US20020177138A1 (en) | 2000-11-15 | 2002-11-28 | The United States Of America , Represented By The Secretary, Department Of Health And Human Services | Methods for the indentification of textual and physical structured query fragments for the analysis of textual and biopolymer information |
AU2002228834A1 (en) | 2000-12-01 | 2002-06-11 | Sri International | Data relationship model |
US7078504B2 (en) | 2000-12-01 | 2006-07-18 | Diversa Corporation | Enzymes having dehalogenase activity and methods of use thereof |
EP1350431B1 (en) | 2001-01-09 | 2011-05-25 | Kurita Water Industries Ltd. | Method of selecting antimicrobial agent and method of using the same |
US7614036B2 (en) | 2001-03-22 | 2009-11-03 | Robert D Bjornson | Method and system for dataflow creation and execution |
US6691109B2 (en) | 2001-03-22 | 2004-02-10 | Turbo Worx, Inc. | Method and apparatus for high-performance sequence comparison |
US6963865B2 (en) | 2001-04-05 | 2005-11-08 | International Business Machines Corporation | Method system and program product for data searching |
US6996477B2 (en) | 2001-04-19 | 2006-02-07 | Dana Farber Cancer Institute, Inc. | Computational subtraction method |
US7809509B2 (en) | 2001-05-08 | 2010-10-05 | Ip Genesis, Inc. | Comparative mapping and assembly of nucleic acid sequences |
IL158487A0 (en) | 2001-05-18 | 2004-05-12 | Wisconsin Alumni Res Found | Method for the synthesis of dna sequences |
US7065451B2 (en) | 2001-05-24 | 2006-06-20 | Board Of Regents, The University Of Texas System | Computer-based method for creating collections of sequences from a dataset of sequence identifiers corresponding to natural complex biopolymer sequences and linked to corresponding annotations |
CA2357263A1 (en) | 2001-09-07 | 2003-03-07 | Bioinformatics Solutions Inc. | New methods for faster and more sensitive homology search in dna sequences |
JP3530842B2 (en) | 2001-11-19 | 2004-05-24 | 株式会社日立製作所 | Nucleic acid base sequence assembling apparatus and operation method thereof |
US7058634B2 (en) | 2002-02-06 | 2006-06-06 | United Devices, Inc. | Distributed blast processing architecture and associated systems and methods |
AU2003215216A1 (en) | 2002-02-15 | 2003-09-09 | Applera Corporation | Methods for searching polynucleotide probe targets in databases |
US7809510B2 (en) | 2002-02-27 | 2010-10-05 | Ip Genesis, Inc. | Positional hashing method for performing DNA sequence similarity search |
CA2478964A1 (en) | 2002-03-11 | 2003-09-25 | Athenix Corporation | Integrated system for high throughput capture of genetic diversity |
AU2003224897A1 (en) | 2002-04-09 | 2003-10-27 | Kenneth L. Beattie | Oligonucleotide probes for genosensor chips |
US20040049354A1 (en) | 2002-04-26 | 2004-03-11 | Affymetrix, Inc. | Method, system and computer software providing a genomic web portal for functional analysis of alternative splice variants |
US7400980B2 (en) | 2002-05-30 | 2008-07-15 | Chan Sheng Liu | Methods of detecting DNA variation in sequence data |
US7254489B2 (en) | 2002-05-31 | 2007-08-07 | Microsoft Corporation | Systems, methods and apparatus for reconstructing phylogentic trees |
US6747643B2 (en) | 2002-06-03 | 2004-06-08 | Omnigon Technologies Ltd. | Method of detecting, interpreting, recognizing, identifying and comparing n-dimensional shapes, partial shapes, embedded shapes and shape collages using multidimensional attractor tokens |
US7061491B2 (en) | 2002-06-03 | 2006-06-13 | Omnigon Technologies Ltd. | Method for solving frequency, frequency distribution and sequence-matching problems using multidimensional attractor tokens |
US7007001B2 (en) | 2002-06-26 | 2006-02-28 | Microsoft Corporation | Maximizing mutual information between observations and hidden states to minimize classification errors |
US7016884B2 (en) | 2002-06-27 | 2006-03-21 | Microsoft Corporation | Probability estimate for K-nearest neighbor |
US7043371B2 (en) | 2002-08-16 | 2006-05-09 | American Museum Of Natural History | Method for search based character optimization |
US6983274B2 (en) | 2002-09-23 | 2006-01-03 | Aaron Thomas Patzer | Multiple alignment genome sequence matching processor |
US7606403B2 (en) | 2002-10-17 | 2009-10-20 | Intel Corporation | Model-based fusion of scanning probe microscopic images for detection and identification of molecular structures |
AU2002343175A1 (en) | 2002-11-28 | 2004-06-18 | Nokia Corporation | Method and device for determining and outputting the similarity between two data strings |
US7158889B2 (en) | 2002-12-20 | 2007-01-02 | International Business Machines Corporation | Gene finding using ordered sets |
DE10260805A1 (en) | 2002-12-23 | 2004-07-22 | Geneart Gmbh | Method and device for optimizing a nucleotide sequence for expression of a protein |
US7512498B2 (en) | 2002-12-31 | 2009-03-31 | Intel Corporation | Streaming processing of biological sequence matching |
US7396646B2 (en) | 2003-01-22 | 2008-07-08 | Modular Genetics, Inc. | Alien sequences |
US6988039B2 (en) | 2003-02-14 | 2006-01-17 | Eidogen, Inc. | Method for determining sequence alignment significance |
JP2004259119A (en) | 2003-02-27 | 2004-09-16 | Internatl Business Mach Corp <Ibm> | Computer system for screening base sequence, method for it, program for executing method in computer, and computer readable record medium storing program |
US20050026173A1 (en) | 2003-02-27 | 2005-02-03 | Methexis Genomics, N.V. | Genetic diagnosis using multiple sequence variant analysis combined with mass spectrometry |
WO2004079631A2 (en) | 2003-03-03 | 2004-09-16 | Koninklijke Philips Electronics N.V. | Method and arrangement for searching for strings |
CA2518046C (en) | 2003-03-04 | 2013-12-10 | Suntory Limited | Screening method for genes of brewing yeast |
US7041455B2 (en) | 2003-03-07 | 2006-05-09 | Illumigen Biosciences, Inc. | Method and apparatus for pattern identification in diploid DNA sequence data |
US7424369B2 (en) | 2003-04-04 | 2008-09-09 | Board Of Regents, The University Of Texas System | Physical-chemical property based sequence motifs and methods regarding same |
US7881873B2 (en) | 2003-04-29 | 2011-02-01 | The Jackson Laboratory | Systems and methods for statistical genomic DNA based analysis and evaluation |
US7711491B2 (en) | 2003-05-05 | 2010-05-04 | Lawrence Livermore National Security, Llc | Computational method and system for modeling, analyzing, and optimizing DNA amplification and synthesis |
US7205111B2 (en) | 2003-07-24 | 2007-04-17 | Marshfield Clinic | Rapid identification of bacteria from positive blood cultures |
US7475087B1 (en) | 2003-08-29 | 2009-01-06 | The United States Of America As Represented By The Secretary Of Agriculture | Computer display tool for visualizing relationships between and among data |
WO2005029280A2 (en) | 2003-09-19 | 2005-03-31 | Netezza Corporation | Performing sequence analysis as a multipart plan storing intermediate results as a relation |
US20070118296A1 (en) | 2003-11-07 | 2007-05-24 | Dna Software Inc. | System and methods for three dimensional molecular structural analysis |
US7328111B2 (en) | 2003-11-07 | 2008-02-05 | Mitsubishi Electric Research Laboratories, Inc. | Method for determining similarities between data sequences using cross-correlation matrices and deformation functions |
US7792894B1 (en) | 2004-04-05 | 2010-09-07 | Microsoft Corporation | Group algebra techniques for operating on matrices |
EP1740719B1 (en) | 2004-04-09 | 2011-06-22 | Trustees of Boston University | Method for de novo detection of sequences in nucleic acids:target sequencing by fragmentation |
US7325013B2 (en) | 2004-04-15 | 2008-01-29 | Id3Man, Inc. | Database with efficient fuzzy matching |
US7313555B2 (en) | 2004-04-30 | 2007-12-25 | Anácapa North | Method for computing the minimum edit distance with fine granularity suitably quickly |
US7599802B2 (en) | 2004-06-10 | 2009-10-06 | Evan Harwood | V-life matching and mating system |
US7747641B2 (en) | 2004-07-09 | 2010-06-29 | Microsoft Corporation | Modeling sequence and time series data in predictive analytics |
US7181373B2 (en) | 2004-08-13 | 2007-02-20 | Agilent Technologies, Inc. | System and methods for navigating and visualizing multi-dimensional biological data |
US7386523B2 (en) | 2004-09-29 | 2008-06-10 | Intel Corporation | K-means clustering using t-test computation |
US7627537B2 (en) | 2004-10-28 | 2009-12-01 | Intel Corporation | Score result reuse for Bayesian network structure learning |
WO2006062684A2 (en) | 2004-11-10 | 2006-06-15 | Attagene, Inc. | Populations of reporter sequences and methods of their use |
US7590291B2 (en) | 2004-12-06 | 2009-09-15 | Intel Corporation | Method and apparatus for non-parametric hierarchical clustering |
US7788043B2 (en) | 2004-12-14 | 2010-08-31 | New York University | Methods, software arrangements and systems for aligning sequences which utilizes non-affine gap penalty procedure |
US7424371B2 (en) | 2004-12-21 | 2008-09-09 | Helicos Biosciences Corporation | Nucleic acid analysis |
WO2006084132A2 (en) | 2005-02-01 | 2006-08-10 | Agencourt Bioscience Corp. | Reagents, methods, and libraries for bead-based squencing |
KR101138864B1 (en) | 2005-03-08 | 2012-05-14 | 삼성전자주식회사 | Method for designing primer and probe set, primer and probe set designed by the method, kit comprising the set, computer readable medium recorded thereon a program to execute the method, and method for identifying target sequence using the set |
KR20070115964A (en) | 2005-03-18 | 2007-12-06 | 바이오인포르마티카 엘엘씨 | System, method and computer program for non-binary sequence comparison |
US7962291B2 (en) | 2005-09-30 | 2011-06-14 | Affymetrix, Inc. | Methods and computer software for detecting splice variants |
JP5329968B2 (en) | 2005-11-10 | 2013-10-30 | サウンドハウンド インコーポレイテッド | How to store and retrieve non-text based information |
US20070134676A1 (en) | 2005-12-08 | 2007-06-14 | Barrett Michael T | Methods and compositions for performing sample heterogeneity corrected comparative genomic hybridization (CGH) |
US20070134692A1 (en) | 2005-12-09 | 2007-06-14 | Affymetrix, Inc. | Method, system and, computer software for efficient update of probe array annotation data |
US20070141612A1 (en) | 2005-12-16 | 2007-06-21 | New England Biolabs, Inc. | Systematic in silico selection method for identifying drug targets in pathogens |
KR100707213B1 (en) | 2006-03-21 | 2007-04-13 | 삼성전자주식회사 | Method and apparatus for choosing nucleic acid probes for microarrays |
US20090062129A1 (en) | 2006-04-19 | 2009-03-05 | Agencourt Personal Genomics, Inc. | Reagents, methods, and libraries for gel-free bead-based sequencing |
US7822782B2 (en) | 2006-09-21 | 2010-10-26 | The University Of Houston System | Application package to automatically identify some single stranded RNA viruses from characteristic residues of capsid protein or nucleotide sequences |
US7805460B2 (en) | 2006-10-26 | 2010-09-28 | Polytechnic Institute Of New York University | Generating a hierarchical data structure associated with a plurality of known arbitrary-length bit strings used for detecting whether an arbitrary-length bit string input matches one of a plurality of known arbitrary-length bit string |
CA2667374A1 (en) | 2006-10-26 | 2008-06-12 | Integrated Dna Technologies, Inc. | Fingerprint analysis for a plurality of oligonucleotides |
US7979215B2 (en) | 2007-07-30 | 2011-07-12 | Agilent Technologies, Inc. | Methods and systems for evaluating CGH candidate probe nucleic acid sequences |
WO2011026136A1 (en) | 2009-08-31 | 2011-03-03 | Life Technologies Corporation | Low-volume sequencing system and method of use |
US9268903B2 (en) * | 2010-07-06 | 2016-02-23 | Life Technologies Corporation | Systems and methods for sequence data alignment quality assessment |
-
2011
- 2011-07-06 US US13/177,267 patent/US9268903B2/en active Active
-
2016
- 2016-01-20 US US15/001,389 patent/US20160140291A1/en not_active Abandoned
-
2019
- 2019-05-24 US US16/421,653 patent/US20190348153A1/en not_active Abandoned
-
2023
- 2023-06-21 US US18/338,488 patent/US20230410946A1/en active Pending
Also Published As
Publication number | Publication date |
---|---|
US20160140291A1 (en) | 2016-05-19 |
US20190348153A1 (en) | 2019-11-14 |
US9268903B2 (en) | 2016-02-23 |
US20120011086A1 (en) | 2012-01-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20230410946A1 (en) | Systems and methods for sequence data alignment quality assessment | |
US20240021272A1 (en) | Systems and methods for identifying sequence variation | |
US20210108264A1 (en) | Systems and methods for identifying sequence variation | |
US20210292831A1 (en) | Systems and methods to detect copy number variation | |
US20210217491A1 (en) | Systems and methods for detecting homopolymer insertions/deletions | |
EP3052651B1 (en) | Systems and methods for detecting structural variants | |
US20210210164A1 (en) | Systems and methods for mapping sequence reads | |
US20110270533A1 (en) | Systems and methods for analyzing nucleic acid sequences | |
US20120330559A1 (en) | Systems and methods for hybrid assembly of nucleic acid sequences | |
US20230083827A1 (en) | Systems and methods for identifying somatic mutations | |
US20140274733A1 (en) | Methods and Systems for Local Sequence Alignment | |
US11021734B2 (en) | Systems and methods for validation of sequencing results | |
US20170206313A1 (en) | Using Flow Space Alignment to Distinguish Duplicate Reads | |
US20230340586A1 (en) | Systems and methods for paired end sequencing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: LIFE TECHNOLOGIES CORPORATION, CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:ZHANG, ZHENG;UTIRAMERUR, SOWMI;HYLAND, FIONA;REEL/FRAME:064856/0935 Effective date: 20110714 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |