WO2014145503A2 - Alignement de séquences à l'aide du mappage d'un maximum d'oligonucléotides par la technique "diviser pour régner" (dcmom), appareil, système et procédés associés - Google Patents

Alignement de séquences à l'aide du mappage d'un maximum d'oligonucléotides par la technique "diviser pour régner" (dcmom), appareil, système et procédés associés Download PDF

Info

Publication number
WO2014145503A2
WO2014145503A2 PCT/US2014/030288 US2014030288W WO2014145503A2 WO 2014145503 A2 WO2014145503 A2 WO 2014145503A2 US 2014030288 W US2014030288 W US 2014030288W WO 2014145503 A2 WO2014145503 A2 WO 2014145503A2
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
reference sequence
elements
query sequence
query
Prior art date
Application number
PCT/US2014/030288
Other languages
English (en)
Other versions
WO2014145503A3 (fr
Inventor
Yuan Gao
Hun Ki LIM
Joo Heon SHIN
Original Assignee
Lieber Institute For Brain Development
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lieber Institute For Brain Development filed Critical Lieber Institute For Brain Development
Publication of WO2014145503A2 publication Critical patent/WO2014145503A2/fr
Publication of WO2014145503A3 publication Critical patent/WO2014145503A3/fr

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly

Definitions

  • the determining further comprises recursively repeating steps a-d with additional query sequences until the remaining unmatched portion of the first query sequence is smaller than 3 elements.
  • FIG. 3 depicts the example of FIG. 2 farther along in the recursive mapping process.
  • FIG. 10 depicts DCMOM's detection of insertions corresponding to the location of known variations.
  • the reference sequence may be divided into three subsequences: the Matched Reference Sequence 14.450 corresponding to the area of the reference sequence matched by the Maximally Extended Match 14.420, and those portions to the sides of the Matched Reference Sequence, the Left-of- Match Reference Sequence 14.440 and the Right-of-Match Reference Sequence 14.460.
  • the Left-of-Match and Right-of-Match Reference Sequences may be used in a recursive instance of the method as the reference sequence.
  • An IMT is the output data from an instance of DCMOM, without complete node merging, represented visually as a tree like structure.
  • the Assembled Result 14.900 may represent a Final Junction Tree ("FJT"), if node merging has already been performed on the Right Side Result 14.630 and/or Left Side Result 14.730.
  • the FJT is, in certain embodiments, a visual representation of the output data from an instance of DCMOM, similar to an IMT, but where some or all of the nodes have been merged together by classifying the joints between them.
  • the Assembled Result 14.900 may in some embodiments be further processed 14.940 by the current instance of DCMOM, including by additional node merging. In some embodiments, further processing or node merging will not be performed on the Assembled Result 14.900 and it will simply be returned 14.950.
  • the intron len defines the minimum number of elements that must be found between two adjacent mapped nodes for the joint between the two nodes to be classified as an intron.
  • the intron len may, in some embodiments, be between about 2 and about 10,000 elements or nodes. In some embodiments, intron len may be the same as min frag len. Alternatively, intron len could be between about 3 and about 32 and, in certain embodiments, in certain embodiments may be, 4, 5, 6, 7, 8, 9, 12, 15, 20 or 30.
  • the entire transcriptome of an organism can be aligned and assembled by the invention.
  • the invention can detect and identify the presence of and location of introns and exons within the transcriptome of an organism, including the junctions between introns and exons, between introns and introns, and between exons and exons, as well as splice junctions, and can identify splicing and alternative splicing events.
  • the systems and methods of the invention can also detect and identify known or predicted combinations of exons, and unexpected exon pairs that occur through exon skipping, cryptic splicing, gene fusions, or by any other means.
  • Sequencing instruments for nucleic acids can include, for example, any of the high throughput sequencing machines from 454 Life Sciences/Roche, Illumina/Solexa, Applied Biosystems/Life Technologies (SOLiD), Helicos Biosciences, Complete Genomics, and Ion Torrent Systems (which generate "next generation sequencing” or "NGS” data), as well as the more traditional machines such as the Sanger sequencing machines.
  • BED Tools was used to generate exon coverage statistics for reads aligned within the genomic regions annotated as exons in Ref-seq.
  • the breadth of coverage of exons are compared with other methods in the range of depth of coverage up to 100, as displayed in FIG. 12 and FIG. 13.
  • DCMOM is superior to TopHat (without annotation).
  • DCMOM's overall performance compares favorably to RUM, GSNAP or TopHat with annotation; however, DCMOM does not require the annotation that these other methods require.
  • Not requiring annotation is advantageous at least because systems that require annotation are limited to identifying already discovered exons, junctions and variants.
  • DCMOM - a system that does not use annotation - has comparable performance to those systems that do use annotation, it is apparent that DCMOM provides a way to discover novel exons, junctions and variants, especially mini exons, previously unreported or marked in annotations.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Chemical & Material Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Biotechnology (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Organic Chemistry (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Molecular Biology (AREA)
  • Microbiology (AREA)
  • Immunology (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

L'invention concerne un appareil, un système et un procédé nouveaux d'alignement de séquences de données de chaîne arbitraire, comprenant des séquences nucléotidiques. Un noeud racine est sélectionné à partir d'une séquence d'interrogation et localisé dans la séquence de référence. Le noeud racine est étendu pour englober des éléments adjacents qui sont présents dans les séquences d'interrogation et de référence, pour former une correspondance de noeud racine étendu. La zone vers la gauche et/ou la droite de la correspondance de noeud racine étendu est ensuite recherchée récursivement à l'aide du même procédé pour identifier des correspondances supplémentaires. Lorsque la recherche est terminée, les articulations entre les noeuds identifiés sont classées en fonction de leurs caractéristiques comme, par ex., les SNP, les délétions, les substitutions, les insertions, les indels et/ou les introns. L'appareil peut être inclus dans une machine de séquençage d'ADN, ou il peut être une machine autonome. L'invention concerne également des applications non biologiques.
PCT/US2014/030288 2013-03-15 2014-03-17 Alignement de séquences à l'aide du mappage d'un maximum d'oligonucléotides par la technique "diviser pour régner" (dcmom), appareil, système et procédés associés WO2014145503A2 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201361791948P 2013-03-15 2013-03-15
US61/791,948 2013-03-15

Publications (2)

Publication Number Publication Date
WO2014145503A2 true WO2014145503A2 (fr) 2014-09-18
WO2014145503A3 WO2014145503A3 (fr) 2014-11-06

Family

ID=51538483

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2014/030288 WO2014145503A2 (fr) 2013-03-15 2014-03-17 Alignement de séquences à l'aide du mappage d'un maximum d'oligonucléotides par la technique "diviser pour régner" (dcmom), appareil, système et procédés associés

Country Status (1)

Country Link
WO (1) WO2014145503A2 (fr)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170116370A1 (en) * 2015-10-21 2017-04-27 Coherent Logix, Incorporated DNA Alignment using a Hierarchical Inverted Index Table
WO2018071054A1 (fr) * 2016-10-11 2018-04-19 Genomsys Sa Procédé et système d'accès sélectif de données bioinformatiques mémorisées ou transmises
US10395759B2 (en) 2015-05-18 2019-08-27 Regeneron Pharmaceuticals, Inc. Methods and systems for copy number variant detection
US11763918B2 (en) 2016-10-11 2023-09-19 Genomsys Sa Method and apparatus for the access to bioinformatics data structured in access units
US12071669B2 (en) 2016-02-12 2024-08-27 Regeneron Pharmaceuticals, Inc. Methods and systems for detection of abnormal karyotypes

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090024618A1 (en) * 2003-11-26 2009-01-22 Wei Fan System and method for indexing weighted-sequences in large databases
US20090150313A1 (en) * 2007-12-06 2009-06-11 Andre Heilper Vectorization of dynamic-time-warping computation using data reshaping
US20120041977A1 (en) * 2009-04-13 2012-02-16 Hitachi, Ltd. Pair character string retrieval system
WO2012158621A1 (fr) * 2011-05-13 2012-11-22 Indiana University Reaserch And Technology Coporation Mappage sécurisé et évolutif de lectures de séquençage humain sur des nuages hybrides

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090024618A1 (en) * 2003-11-26 2009-01-22 Wei Fan System and method for indexing weighted-sequences in large databases
US20090150313A1 (en) * 2007-12-06 2009-06-11 Andre Heilper Vectorization of dynamic-time-warping computation using data reshaping
US20120041977A1 (en) * 2009-04-13 2012-02-16 Hitachi, Ltd. Pair character string retrieval system
WO2012158621A1 (fr) * 2011-05-13 2012-11-22 Indiana University Reaserch And Technology Coporation Mappage sécurisé et évolutif de lectures de séquençage humain sur des nuages hybrides

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10395759B2 (en) 2015-05-18 2019-08-27 Regeneron Pharmaceuticals, Inc. Methods and systems for copy number variant detection
US11568957B2 (en) 2015-05-18 2023-01-31 Regeneron Pharmaceuticals Inc. Methods and systems for copy number variant detection
CN108140071A (zh) * 2015-10-21 2018-06-08 相干逻辑公司 使用分级反向索引表的dna比对
US20170116370A1 (en) * 2015-10-21 2017-04-27 Coherent Logix, Incorporated DNA Alignment using a Hierarchical Inverted Index Table
JP2018535484A (ja) * 2015-10-21 2018-11-29 コーヒレント・ロジックス・インコーポレーテッド 階層的転置索引表を使用したdnaアラインメント
CN108140071B (zh) * 2015-10-21 2022-04-29 相干逻辑公司 使用分级反向索引表的dna比对
WO2017070514A1 (fr) * 2015-10-21 2017-04-27 Coherent Logix, Incorporated Alignement d'adn à l'aide d'une table d'index inversés hiérarchique
US11594301B2 (en) 2015-10-21 2023-02-28 Coherent Logix, Incorporated DNA alignment using a hierarchical inverted index table
US12087403B2 (en) 2015-10-21 2024-09-10 Coherent Logix, Incorporated DNA alignment using a hierarchical inverted index table
US12071669B2 (en) 2016-02-12 2024-08-27 Regeneron Pharmaceuticals, Inc. Methods and systems for detection of abnormal karyotypes
WO2018071054A1 (fr) * 2016-10-11 2018-04-19 Genomsys Sa Procédé et système d'accès sélectif de données bioinformatiques mémorisées ou transmises
CN110603595A (zh) * 2016-10-11 2019-12-20 耶诺姆希斯股份公司 用于从压缩的基因组序列读段重建基因组参考序列的方法和系统
US11404143B2 (en) 2016-10-11 2022-08-02 Genomsys Sa Method and systems for the indexing of bioinformatics data
CN110603595B (zh) * 2016-10-11 2023-08-08 耶诺姆希斯股份公司 用于从压缩的基因组序列读段重建基因组参考序列的方法和系统
US11763918B2 (en) 2016-10-11 2023-09-19 Genomsys Sa Method and apparatus for the access to bioinformatics data structured in access units

Also Published As

Publication number Publication date
WO2014145503A3 (fr) 2014-11-06

Similar Documents

Publication Publication Date Title
US11697835B2 (en) Systems and methods for epigenetic analysis
US20220238180A1 (en) Methods and systems for genome analysis
Liao et al. Current challenges and solutions of de novo assembly
Chatterjee et al. Comparison of alignment software for genome-wide bisulphite sequence data
US20200165683A1 (en) Systems and methods for analyzing circulating tumor dna
Schubert et al. Characterization of ancient and modern genomes by SNP detection and phylogenomic and metagenomic analysis using PALEOMIX
Siragusa et al. Fast and accurate read mapping with approximate seeds and multiple backtracking
Shajii et al. Fast genotyping of known SNPs through approximate k-mer matching
Tripathi et al. Next-generation sequencing revolution through big data analytics
JP2021525104A (ja) 選択的スプライシングの解析のためのシステムおよび方法
Coonrod et al. Developing genome and exome sequencing for candidate gene identification in inherited disorders: an integrated technical and bioinformatics approach
JP2017500004A (ja) 遺伝子試料について遺伝子型解析するための方法およびシステム
Knowles et al. Grape RNA-Seq analysis pipeline environment
WO2014145503A2 (fr) Alignement de séquences à l'aide du mappage d'un maximum d'oligonucléotides par la technique "diviser pour régner" (dcmom), appareil, système et procédés associés
Li et al. An NGS workflow blueprint for DNA sequencing data and its application in individualized molecular oncology
Wu et al. SOAPfusion: a robust and effective computational fusion discovery tool for RNA-seq reads
Chen et al. Recent advances in sequence assembly: principles and applications
Sater et al. UMI-VarCal: a new UMI-based variant caller that efficiently improves low-frequency variant detection in paired-end sequencing NGS libraries
US20130253839A1 (en) Surprisal data reduction of genetic data for transmission, storage, and analysis
US20170076047A1 (en) Systems and methods for genetic testing
Molinari et al. Transcriptome analysis using RNA-Seq fromexperiments with and without biological replicates: areview
Moraga et al. BrumiR: A toolkit for de novo discovery of microRNAs from sRNA-seq data
CN111542616A (zh) 脱氨引起的序列错误的纠正
Deshpande et al. RNA-seq data science: From raw data to effective interpretation
Costa-Silva et al. Computational methods for differentially expressed gene analysis from RNA-Seq: an overview

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14765762

Country of ref document: EP

Kind code of ref document: A2

122 Ep: pct application non-entry in european phase

Ref document number: 14765762

Country of ref document: EP

Kind code of ref document: A2