CN114930724A - 创建基因突变词典及利用基因突变词典压缩基因组数据的方法和装置 - Google Patents

创建基因突变词典及利用基因突变词典压缩基因组数据的方法和装置 Download PDF

Info

Publication number
CN114930724A
CN114930724A CN201980102589.7A CN201980102589A CN114930724A CN 114930724 A CN114930724 A CN 114930724A CN 201980102589 A CN201980102589 A CN 201980102589A CN 114930724 A CN114930724 A CN 114930724A
Authority
CN
China
Prior art keywords
data
dictionary
mutant
mutation
genome
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201980102589.7A
Other languages
English (en)
Inventor
徐崇钧
周玉君
邓梓晴
龚梅花
蒋慧
徐讯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
MGI Tech Co Ltd
Original Assignee
MGI Tech Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by MGI Tech Co Ltd filed Critical MGI Tech Co Ltd
Publication of CN114930724A publication Critical patent/CN114930724A/zh
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/50Compression of genetic data
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/40Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code
    • H03M7/4031Fixed length to variable length coding
    • H03M7/4037Prefix coding
    • H03M7/4043Adaptive prefix coding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/70Type of the data to be coded, other than image and sound

Abstract

一种创建基因突变词典的方法及利用基因突变词典压缩基因组数据的方法和装置,其中创建基因突变词典的方法包括:获取一种物种的多个个体的基因组序列数据和该物种的参考基因组数据;将多个个体的基因组序列数据分别比对到参考基因组数据上,得到每个个体的基因组序列数据相对于参考基因组数据的突变结果;将该物种的基因组划分成若干个有生物学意义的单元分区;根据突变结果,对每个单元分区的突变体情况分别进行统计,生成每个单元分区在多个个体中的全部突变体类型,并对突变体类型编号获得基因突变词典。本发明解决了基因组数据压缩的难题,使其存储量明显降低,并极大地降低了存储数据的成本。

Description

PCT国内申请,说明书已公开。

Claims (16)

  1. PCT国内申请,权利要求书已公开。
CN201980102589.7A 2019-12-31 2019-12-31 创建基因突变词典及利用基因突变词典压缩基因组数据的方法和装置 Pending CN114930724A (zh)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/130731 WO2021134574A1 (zh) 2019-12-31 2019-12-31 创建基因突变词典及利用基因突变词典压缩基因组数据的方法和装置

Publications (1)

Publication Number Publication Date
CN114930724A true CN114930724A (zh) 2022-08-19

Family

ID=76687198

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201980102589.7A Pending CN114930724A (zh) 2019-12-31 2019-12-31 创建基因突变词典及利用基因突变词典压缩基因组数据的方法和装置

Country Status (4)

Country Link
US (1) US20220383987A1 (zh)
EP (1) EP4087139A4 (zh)
CN (1) CN114930724A (zh)
WO (1) WO2021134574A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116246715A (zh) * 2023-04-27 2023-06-09 倍科为(天津)生物技术有限公司 多样本基因突变数据存储方法、装置、设备及介质
WO2024077568A1 (zh) * 2022-10-13 2024-04-18 深圳华大智造科技股份有限公司 参考序列的构建方法、宏基因组数据压缩方法和电子设备

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003188735A (ja) * 2001-12-13 2003-07-04 Ntt Data Corp データ圧縮装置及び方法並びにプログラム
US20040153255A1 (en) * 2003-02-03 2004-08-05 Ahn Tae-Jin Apparatus and method for encoding DNA sequence, and computer readable medium
EP2612271A4 (en) * 2010-08-31 2017-07-19 Annai Systems Inc. Method and systems for processing polymeric sequence data and related information
US8937564B2 (en) * 2013-01-10 2015-01-20 Infinidat Ltd. System, method and non-transitory computer readable medium for compressing genetic information
CN103546162B (zh) * 2013-09-22 2016-08-17 上海交通大学 基于非连续上下文建模和最大熵原则的基因压缩方法
JP6198659B2 (ja) * 2014-04-03 2017-09-20 株式会社日立ハイテクノロジーズ 配列データ解析装置、dna解析システムおよび配列データ解析方法
CA3191504A1 (en) * 2014-05-30 2015-12-03 Sequenom, Inc. Chromosome representation determinations
KR101832373B1 (ko) * 2015-08-28 2018-02-26 주식회사 케이티 유전 변이 정보 축소 및 전달 장치와 그 방법
CN110914911B (zh) * 2017-05-16 2023-09-22 生命科技股份有限公司 压缩分子标记的核酸序列数据的方法
CN109450452B (zh) * 2018-11-27 2020-07-10 中国科学院计算技术研究所 一种针对基因数据的取样字典树索引的压缩方法和系统

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024077568A1 (zh) * 2022-10-13 2024-04-18 深圳华大智造科技股份有限公司 参考序列的构建方法、宏基因组数据压缩方法和电子设备
CN116246715A (zh) * 2023-04-27 2023-06-09 倍科为(天津)生物技术有限公司 多样本基因突变数据存储方法、装置、设备及介质
CN116246715B (zh) * 2023-04-27 2024-04-16 倍科为(天津)生物技术有限公司 多样本基因突变数据存储方法、装置、设备及介质

Also Published As

Publication number Publication date
WO2021134574A1 (zh) 2021-07-08
US20220383987A1 (en) 2022-12-01
EP4087139A1 (en) 2022-11-09
EP4087139A4 (en) 2023-01-18

Similar Documents

Publication Publication Date Title
Berger et al. Levenshtein distance, sequence comparison and biological database search
US20220383987A1 (en) Method and device for creating gene mutation dictionary, and method and device for compressing genomic data using the dictionary
Solis et al. Optimized representations and maximal information in proteins
Gong et al. Predicting clinical outcomes across changing electronic health record systems
US20050114377A1 (en) Computerized method, system and program product for generating a data mining model
US9378271B2 (en) Database system for analysis of longitudinal data sets
Janin et al. BEETL-fastq: a searchable compressed archive for DNA reads
CN113342750A (zh) 一种文件的数据比对方法、装置、设备及存储介质
Amiri et al. Clustering categorical data via ensembling dissimilarity matrices
Kumar et al. Fast and memory efficient approach for mapping NGS reads to a reference genome
Banerjee et al. Design and development of bioinformatics feature based DNA sequence data compression algorithm
Cánovas et al. Csam: Compressed sam format
Wang et al. Syllable-PBWT for space-efficient haplotype long-match query
Alanko et al. A framework for space-efficient read clustering in metagenomic samples
Kundaje et al. Combining sequence and time series expression data to learn transcriptional modules
JP2004535612A (ja) 遺伝子発現データの管理システムおよび方法
Baldi et al. BLASTing small molecules—statistics and extreme statistics of chemical similarity scores
KoROTKOV et al. Enlarged similarity of nucleic acid sequences
Zhang et al. Efficient Search over Genomic Short Read Data
Matos et al. MAFCO: a compression tool for MAF files
CN115295135B (zh) 基于分治算法的医疗数据质量改进方法、装置及存储介质
Çakırgöz et al. Organization of Variation-Based Personal Genetic Data with Document-Based No-Sql Database
Alatabbi et al. On the repetitive collection indexing problem
CN116564414B (zh) 分子序列的比对方法、装置、电子设备、存储介质及产品
Chen et al. Compression for population genetic data through finite-state entropy

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination