CN114930724A - 创建基因突变词典及利用基因突变词典压缩基因组数据的方法和装置 - Google Patents
创建基因突变词典及利用基因突变词典压缩基因组数据的方法和装置 Download PDFInfo
- Publication number
- CN114930724A CN114930724A CN201980102589.7A CN201980102589A CN114930724A CN 114930724 A CN114930724 A CN 114930724A CN 201980102589 A CN201980102589 A CN 201980102589A CN 114930724 A CN114930724 A CN 114930724A
- Authority
- CN
- China
- Prior art keywords
- data
- dictionary
- mutant
- mutation
- genome
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
- G16B50/50—Compression of genetic data
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
- H03M7/40—Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code
- H03M7/4031—Fixed length to variable length coding
- H03M7/4037—Prefix coding
- H03M7/4043—Adaptive prefix coding
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
- H03M7/70—Type of the data to be coded, other than image and sound
Abstract
一种创建基因突变词典的方法及利用基因突变词典压缩基因组数据的方法和装置,其中创建基因突变词典的方法包括:获取一种物种的多个个体的基因组序列数据和该物种的参考基因组数据;将多个个体的基因组序列数据分别比对到参考基因组数据上,得到每个个体的基因组序列数据相对于参考基因组数据的突变结果;将该物种的基因组划分成若干个有生物学意义的单元分区;根据突变结果,对每个单元分区的突变体情况分别进行统计,生成每个单元分区在多个个体中的全部突变体类型,并对突变体类型编号获得基因突变词典。本发明解决了基因组数据压缩的难题,使其存储量明显降低,并极大地降低了存储数据的成本。
Description
PCT国内申请,说明书已公开。
Claims (16)
- PCT国内申请,权利要求书已公开。
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2019/130731 WO2021134574A1 (zh) | 2019-12-31 | 2019-12-31 | 创建基因突变词典及利用基因突变词典压缩基因组数据的方法和装置 |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114930724A true CN114930724A (zh) | 2022-08-19 |
Family
ID=76687198
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201980102589.7A Pending CN114930724A (zh) | 2019-12-31 | 2019-12-31 | 创建基因突变词典及利用基因突变词典压缩基因组数据的方法和装置 |
Country Status (4)
Country | Link |
---|---|
US (1) | US20220383987A1 (zh) |
EP (1) | EP4087139A4 (zh) |
CN (1) | CN114930724A (zh) |
WO (1) | WO2021134574A1 (zh) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116246715A (zh) * | 2023-04-27 | 2023-06-09 | 倍科为(天津)生物技术有限公司 | 多样本基因突变数据存储方法、装置、设备及介质 |
WO2024077568A1 (zh) * | 2022-10-13 | 2024-04-18 | 深圳华大智造科技股份有限公司 | 参考序列的构建方法、宏基因组数据压缩方法和电子设备 |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2003188735A (ja) * | 2001-12-13 | 2003-07-04 | Ntt Data Corp | データ圧縮装置及び方法並びにプログラム |
US20040153255A1 (en) * | 2003-02-03 | 2004-08-05 | Ahn Tae-Jin | Apparatus and method for encoding DNA sequence, and computer readable medium |
EP2612271A4 (en) * | 2010-08-31 | 2017-07-19 | Annai Systems Inc. | Method and systems for processing polymeric sequence data and related information |
US8937564B2 (en) * | 2013-01-10 | 2015-01-20 | Infinidat Ltd. | System, method and non-transitory computer readable medium for compressing genetic information |
CN103546162B (zh) * | 2013-09-22 | 2016-08-17 | 上海交通大学 | 基于非连续上下文建模和最大熵原则的基因压缩方法 |
JP6198659B2 (ja) * | 2014-04-03 | 2017-09-20 | 株式会社日立ハイテクノロジーズ | 配列データ解析装置、dna解析システムおよび配列データ解析方法 |
CA3191504A1 (en) * | 2014-05-30 | 2015-12-03 | Sequenom, Inc. | Chromosome representation determinations |
KR101832373B1 (ko) * | 2015-08-28 | 2018-02-26 | 주식회사 케이티 | 유전 변이 정보 축소 및 전달 장치와 그 방법 |
CN110914911B (zh) * | 2017-05-16 | 2023-09-22 | 生命科技股份有限公司 | 压缩分子标记的核酸序列数据的方法 |
CN109450452B (zh) * | 2018-11-27 | 2020-07-10 | 中国科学院计算技术研究所 | 一种针对基因数据的取样字典树索引的压缩方法和系统 |
-
2019
- 2019-12-31 CN CN201980102589.7A patent/CN114930724A/zh active Pending
- 2019-12-31 EP EP19958459.0A patent/EP4087139A4/en active Pending
- 2019-12-31 WO PCT/CN2019/130731 patent/WO2021134574A1/zh unknown
-
2022
- 2022-06-29 US US17/809,892 patent/US20220383987A1/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2024077568A1 (zh) * | 2022-10-13 | 2024-04-18 | 深圳华大智造科技股份有限公司 | 参考序列的构建方法、宏基因组数据压缩方法和电子设备 |
CN116246715A (zh) * | 2023-04-27 | 2023-06-09 | 倍科为(天津)生物技术有限公司 | 多样本基因突变数据存储方法、装置、设备及介质 |
CN116246715B (zh) * | 2023-04-27 | 2024-04-16 | 倍科为(天津)生物技术有限公司 | 多样本基因突变数据存储方法、装置、设备及介质 |
Also Published As
Publication number | Publication date |
---|---|
WO2021134574A1 (zh) | 2021-07-08 |
US20220383987A1 (en) | 2022-12-01 |
EP4087139A1 (en) | 2022-11-09 |
EP4087139A4 (en) | 2023-01-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Berger et al. | Levenshtein distance, sequence comparison and biological database search | |
US20220383987A1 (en) | Method and device for creating gene mutation dictionary, and method and device for compressing genomic data using the dictionary | |
Solis et al. | Optimized representations and maximal information in proteins | |
Gong et al. | Predicting clinical outcomes across changing electronic health record systems | |
US20050114377A1 (en) | Computerized method, system and program product for generating a data mining model | |
US9378271B2 (en) | Database system for analysis of longitudinal data sets | |
Janin et al. | BEETL-fastq: a searchable compressed archive for DNA reads | |
CN113342750A (zh) | 一种文件的数据比对方法、装置、设备及存储介质 | |
Amiri et al. | Clustering categorical data via ensembling dissimilarity matrices | |
Kumar et al. | Fast and memory efficient approach for mapping NGS reads to a reference genome | |
Banerjee et al. | Design and development of bioinformatics feature based DNA sequence data compression algorithm | |
Cánovas et al. | Csam: Compressed sam format | |
Wang et al. | Syllable-PBWT for space-efficient haplotype long-match query | |
Alanko et al. | A framework for space-efficient read clustering in metagenomic samples | |
Kundaje et al. | Combining sequence and time series expression data to learn transcriptional modules | |
JP2004535612A (ja) | 遺伝子発現データの管理システムおよび方法 | |
Baldi et al. | BLASTing small molecules—statistics and extreme statistics of chemical similarity scores | |
KoROTKOV et al. | Enlarged similarity of nucleic acid sequences | |
Zhang et al. | Efficient Search over Genomic Short Read Data | |
Matos et al. | MAFCO: a compression tool for MAF files | |
CN115295135B (zh) | 基于分治算法的医疗数据质量改进方法、装置及存储介质 | |
Çakırgöz et al. | Organization of Variation-Based Personal Genetic Data with Document-Based No-Sql Database | |
Alatabbi et al. | On the repetitive collection indexing problem | |
CN116564414B (zh) | 分子序列的比对方法、装置、电子设备、存储介质及产品 | |
Chen et al. | Compression for population genetic data through finite-state entropy |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |