CN114783523A - 使用分级反向索引表的dna比对 - Google Patents

使用分级反向索引表的dna比对 Download PDF

Info

Publication number
CN114783523A
CN114783523A CN202210452629.4A CN202210452629A CN114783523A CN 114783523 A CN114783523 A CN 114783523A CN 202210452629 A CN202210452629 A CN 202210452629A CN 114783523 A CN114783523 A CN 114783523A
Authority
CN
China
Prior art keywords
index table
entry
length
reference data
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210452629.4A
Other languages
English (en)
Chinese (zh)
Inventor
M·B·多尔
J·D·加玛尼
S·V·伍德
D·G·阿拉斯塔斯
M·A·亨特
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Coherent Logix Inc
Original Assignee
Coherent Logix Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Coherent Logix Inc filed Critical Coherent Logix Inc
Publication of CN114783523A publication Critical patent/CN114783523A/zh
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/319Inverted lists
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Databases & Information Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Bioethics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
CN202210452629.4A 2015-10-21 2016-10-21 使用分级反向索引表的dna比对 Pending CN114783523A (zh)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US201562244541P 2015-10-21 2015-10-21
US62/244,541 2015-10-21
CN201680061446.2A CN108140071B (zh) 2015-10-21 2016-10-21 使用分级反向索引表的dna比对
PCT/US2016/058183 WO2017070514A1 (en) 2015-10-21 2016-10-21 Dna alignment using a hierarchical inverted index table

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN201680061446.2A Division CN108140071B (zh) 2015-10-21 2016-10-21 使用分级反向索引表的dna比对

Publications (1)

Publication Number Publication Date
CN114783523A true CN114783523A (zh) 2022-07-22

Family

ID=58557902

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202210452629.4A Pending CN114783523A (zh) 2015-10-21 2016-10-21 使用分级反向索引表的dna比对
CN201680061446.2A Active CN108140071B (zh) 2015-10-21 2016-10-21 使用分级反向索引表的dna比对

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN201680061446.2A Active CN108140071B (zh) 2015-10-21 2016-10-21 使用分级反向索引表的dna比对

Country Status (7)

Country Link
US (3) US11594301B2 (enExample)
EP (1) EP3365821B1 (enExample)
JP (1) JP6884143B2 (enExample)
KR (1) KR20180072684A (enExample)
CN (2) CN114783523A (enExample)
BR (1) BR112018007092B1 (enExample)
WO (1) WO2017070514A1 (enExample)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8705623B2 (en) * 2009-10-02 2014-04-22 Texas Instruments Incorporated Line-based compression for digital image data
US12125559B2 (en) * 2019-05-14 2024-10-22 Samsung Electronics Co., Ltd. Parallelizable sequence alignment systems and methods
CN112948446B (zh) * 2019-11-26 2024-08-16 北京京东振世信息技术有限公司 一种匹配产品单据的方法和装置
CN111402959A (zh) * 2020-03-13 2020-07-10 苏州浪潮智能科技有限公司 一种序列比对的方法、系统、设备及可读存储介质
IL281960B2 (en) 2021-04-01 2025-12-01 Zimmerman Israel System and method for rapid statistical pattern discovery
CN114329135B (zh) * 2021-12-08 2025-06-03 腾讯科技(深圳)有限公司 一种索引点离线排序方法、装置、设备及存储介质
CN116010427B (zh) * 2023-02-13 2025-11-14 长鑫存储技术有限公司 一种编号分配方法、装置、电子设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040153255A1 (en) * 2003-02-03 2004-08-05 Ahn Tae-Jin Apparatus and method for encoding DNA sequence, and computer readable medium
US20120089338A1 (en) * 2009-03-13 2012-04-12 Life Technologies Corporation Computer implemented method for indexing reference genome
CN103336916A (zh) * 2013-07-05 2013-10-02 中国科学院数学与系统科学研究院 一种测序序列映射方法及系统
WO2014145503A2 (en) * 2013-03-15 2014-09-18 Lieber Institute For Brain Development Sequence alignment using divide and conquer maximum oligonucleotide mapping (dcmom), apparatus, system and method related thereto
CN104217134A (zh) * 2013-05-29 2014-12-17 诺布里斯股份有限公司 用于snp分析和基因组测序的系统和方法

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH08272824A (ja) * 1995-03-31 1996-10-18 Hitachi Software Eng Co Ltd 遺伝子配列データ自動検索方法
WO2005096208A1 (ja) * 2004-03-31 2005-10-13 Bio-Think Tank Co., Ltd. 塩基配列検索装置及び塩基配列検索方法
US7702640B1 (en) * 2005-12-29 2010-04-20 Amazon Technologies, Inc. Stratified unbalanced trees for indexing of data items within a computer system
WO2007137225A2 (en) * 2006-05-19 2007-11-29 The University Of Chicago Method for indexing nucleic acid sequences for computer based searching
US8271206B2 (en) * 2008-04-21 2012-09-18 Softgenetics Llc DNA sequence assembly methods of short reads
CN101984445B (zh) * 2010-03-04 2012-03-14 深圳华大基因科技有限公司 一种基于聚合酶链式反应产物测序序列分型的实现方法和系统
US20140163900A1 (en) * 2012-06-02 2014-06-12 Whitehead Institute For Biomedical Research Analyzing short tandem repeats from high throughput sequencing data for genetic applications
US9679104B2 (en) * 2013-01-17 2017-06-13 Edico Genome, Corp. Bioinformatics systems, apparatuses, and methods executed on an integrated circuit processing platform
US9792405B2 (en) * 2013-01-17 2017-10-17 Edico Genome, Corp. Bioinformatics systems, apparatuses, and methods executed on an integrated circuit processing platform
US10381106B2 (en) * 2013-01-28 2019-08-13 Hasso-Plattner-Institut Fuer Softwaresystemtechnik Gmbh Efficient genomic read alignment in an in-memory database
NL2011817C2 (en) 2013-11-19 2015-05-26 Genalice B V A method of generating a reference index data structure and method for finding a position of a data pattern in a reference data structure.
US9886561B2 (en) * 2014-02-19 2018-02-06 The Regents Of The University Of California Efficient encoding and storage and retrieval of genomic data
NL2013120B1 (en) * 2014-07-03 2016-09-20 Genalice B V A method for finding associated positions of bases of a read on a reference genome.

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040153255A1 (en) * 2003-02-03 2004-08-05 Ahn Tae-Jin Apparatus and method for encoding DNA sequence, and computer readable medium
US20120089338A1 (en) * 2009-03-13 2012-04-12 Life Technologies Corporation Computer implemented method for indexing reference genome
WO2014145503A2 (en) * 2013-03-15 2014-09-18 Lieber Institute For Brain Development Sequence alignment using divide and conquer maximum oligonucleotide mapping (dcmom), apparatus, system and method related thereto
CN104217134A (zh) * 2013-05-29 2014-12-17 诺布里斯股份有限公司 用于snp分析和基因组测序的系统和方法
CN103336916A (zh) * 2013-07-05 2013-10-02 中国科学院数学与系统科学研究院 一种测序序列映射方法及系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
CHEN, S.等: "SEME: A FAST MAPPER OF ILLUMINA SEQUENCING READS WITH STATITICAL EVALUATION", JOURNAL OF COMPUTATIONAL BIOLOGY, 20 November 2013 (2013-11-20), pages 847 - 860 *

Also Published As

Publication number Publication date
EP3365821A1 (en) 2018-08-29
EP3365821B1 (en) 2022-06-29
JP2018535484A (ja) 2018-11-29
JP6884143B2 (ja) 2021-06-09
KR20180072684A (ko) 2018-06-29
US12087403B2 (en) 2024-09-10
US20250174304A1 (en) 2025-05-29
WO2017070514A1 (en) 2017-04-27
US20240203527A1 (en) 2024-06-20
CN108140071A (zh) 2018-06-08
US20170116370A1 (en) 2017-04-27
CN108140071B (zh) 2022-04-29
US11594301B2 (en) 2023-02-28
BR112018007092B1 (pt) 2024-02-20
EP3365821A4 (en) 2019-06-26
BR112018007092A2 (pt) 2018-10-23

Similar Documents

Publication Publication Date Title
US12087403B2 (en) DNA alignment using a hierarchical inverted index table
Ndiaye et al. When less is more: sketching with minimizers in genomics
US20200243162A1 (en) Method, system, and computing device for optimizing computing operations of gene sequencing system
Huo et al. CS2A: A compressed suffix array-based method for short read alignment
Vezzi Next generation sequencing revolution challenges: Search, assemble, and validate genomes
US12387818B2 (en) Memory allocation to optimize computer operations of seeding for burrows wheeler alignment
US12412641B2 (en) Merging alignment and sorting to optimize computer operations for gene sequencing pipeline
JP7352985B2 (ja) 生物学的配列情報の取り扱い
Zeng et al. Improved parallel processing of massive de bruijn graph for genome assembly
Minkin Applications of the Compacted de Bruijn Graph in Comparative Genomics
US20210174904A1 (en) Merging Duplicate Marking to Optimize Computer Operations for Gene Sequencing Pipeline
Ferrolho Optimization of a Genomic Data Compressor Using Metameric Genetic Algorithms
Luo Clustering for DNA Storage
Chu Improving sequence analysis with probabilistic data structures and algorithms
CN119046711A (zh) 一种对转录组长readsRNA-seq数据进行高效聚类的方法
Mohamadi Parallel algorithms and software tools for high-throughput sequencing data
Kuosmanen Third-generation RNA-sequencing analysis: graph alignment and transcript assembly with long reads.
HK40045105B (zh) 确定小核酸序列集合的方法及其应用
Vezzi et al. Algorithms and Data Structures for Next‐Generation Sequences
Palopoli et al. Discovering frequent structured patterns from string databases: an application to biological sequences
Nadalin Paired is better: local assembly algorithms for NGS paired reads and applications to RNA-Seq
Pahadia Conservative Error Correction of Sequencing Data
Liu et al. CUSHAW Suite: Parallel and Efficient Algorithms for NGS Read Alignment
Boža et al. Fishing in Read Collections: Memory Efficient Indexing for Sequence Assembly
Vicedomini Alignment and reconciliation strategies for large-scale de novo assembly

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination