WO2004070029A1 - Procede pour coder une sequence d'adn et procede pour comprimer une sequence d'adn - Google Patents
Procede pour coder une sequence d'adn et procede pour comprimer une sequence d'adn Download PDFInfo
- Publication number
- WO2004070029A1 WO2004070029A1 PCT/KR2003/001093 KR0301093W WO2004070029A1 WO 2004070029 A1 WO2004070029 A1 WO 2004070029A1 KR 0301093 W KR0301093 W KR 0301093W WO 2004070029 A1 WO2004070029 A1 WO 2004070029A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- dna sequence
- bases
- encoding
- encoded
- byte
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- E—FIXED CONSTRUCTIONS
- E04—BUILDING
- E04G—SCAFFOLDING; FORMS; SHUTTERING; BUILDING IMPLEMENTS OR AIDS, OR THEIR USE; HANDLING BUILDING MATERIALS ON THE SITE; REPAIRING, BREAKING-UP OR OTHER WORK ON EXISTING BUILDINGS
- E04G1/00—Scaffolds primarily resting on the ground
- E04G1/28—Scaffolds primarily resting on the ground designed to provide support only at a low height
- E04G1/32—Other free-standing supports, e.g. using trestles
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
- H03M7/3084—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method
-
- E—FIXED CONSTRUCTIONS
- E04—BUILDING
- E04G—SCAFFOLDING; FORMS; SHUTTERING; BUILDING IMPLEMENTS OR AIDS, OR THEIR USE; HANDLING BUILDING MATERIALS ON THE SITE; REPAIRING, BREAKING-UP OR OTHER WORK ON EXISTING BUILDINGS
- E04G1/00—Scaffolds primarily resting on the ground
- E04G1/15—Scaffolds primarily resting on the ground essentially comprising special means for supporting or forming platforms; Platforms
-
- E—FIXED CONSTRUCTIONS
- E04—BUILDING
- E04G—SCAFFOLDING; FORMS; SHUTTERING; BUILDING IMPLEMENTS OR AIDS, OR THEIR USE; HANDLING BUILDING MATERIALS ON THE SITE; REPAIRING, BREAKING-UP OR OTHER WORK ON EXISTING BUILDINGS
- E04G1/00—Scaffolds primarily resting on the ground
- E04G1/15—Scaffolds primarily resting on the ground essentially comprising special means for supporting or forming platforms; Platforms
- E04G2001/155—Platforms with an access hatch for getting through from one level to another
-
- E—FIXED CONSTRUCTIONS
- E04—BUILDING
- E04G—SCAFFOLDING; FORMS; SHUTTERING; BUILDING IMPLEMENTS OR AIDS, OR THEIR USE; HANDLING BUILDING MATERIALS ON THE SITE; REPAIRING, BREAKING-UP OR OTHER WORK ON EXISTING BUILDINGS
- E04G1/00—Scaffolds primarily resting on the ground
- E04G1/28—Scaffolds primarily resting on the ground designed to provide support only at a low height
- E04G1/30—Ladder scaffolds
- E04G2001/302—Ladder scaffolds with ladders supporting the platform
- E04G2001/305—The ladders being vertical and perpendicular to the platform
Definitions
- the present invention relates to a method for encoding a DNA sequence and a method for compressing a DNA sequence, and particularly to, a method for encoding a DNA sequence by expressing 4 types of DNA bases such as adenine (A), guanine (G), cytosine (C) and thymine (T) into 2 bits and a method compressing an encoded DNA sequence by using a common data compression method to increase compression efficiency.
- 4 types of DNA bases such as adenine (A), guanine (G), cytosine (C) and thymine (T) into 2 bits
- A adenine
- G guanine
- C cytosine
- T thymine
- DNA sequences of various living bodies are analyzed and researches on methods to effectively express and compress the DNA sequences are in progress.
- it takes at least 2 bits per base when a DNA base sequence is compressed using common sentence compression software such as WinZip and Arj.
- the present invention has been made in view of the foregoing problems, and considering that a DNA sequence comprises 4 types of bases such as adenine (A), guanine (G), cytosine (C), thymine (T), it is an object of the present invention to provide a method for encoding a DNA sequence by expressing respective bases of the DNA sequence into a 2-bit unit and a method for compressing an encoded DNA sequence to improve compression efficiency and compression rate.
- A adenine
- G guanine
- C cytosine
- T thymine
- the present invention provides a method for encoding a DNA sequence comprising the steps of: encoding bases of the DNA sequence comprising adenine (A), guanine (G), cytosine (C) and thymine (T), into 2 bits; forming one byte with a predetermined number of the encoded bases; and forming a DNA sequence in the byte unit.
- a method for compressing a DNA sequence comprising the steps of: encoding DNA bases comprising adenine (A) guanine (G), cytosine (C) and thymine (T) into 2 bits, respectively; forming one byte with a predetermined number of the encoded bases; forming a DNA sequence in the byte unit; and compressing the DNA sequence using a data compression method.
- A adenine
- G guanine
- C cytosine
- T thymine
- Fig. 1 is a view schematically showing the encoding of DNA bases
- Fig. 2 shows an embodiment of the method for encoding a DNA sequence according to the present invention
- Fig. 3 shows an embodiment of the method for compressing a DNA sequence according to the present invention.
- each base of a DNA sequence can be encoded into 2 bits. That is, each base is expressed into one of 4 characters such as adenine (A), guanine (G), cytosine (C) and thymine (T), which are expressed into 2-bit values including 00, 01, 10 and 11. It is just an example to express adenine (A), guanine (G), cytosine (C) and thymine (T) into 2-bit values of 00, 01, 10 and 11.
- the bases may be any values different from each other (Ex.: 01, 11, 00, 10).
- each base is set to be encoded into 2 bits.
- Bases of the DNA sequence to be encoded (hereinafter referred to as "target DNA sequence) are gathered in a predetermined number to form one byte.
- the encoded final DNA sequence is expressed in byte unit.
- the number of bases included in one byte may be 1, 2, 3 or 4.
- the remaining bits are filled with a predetermined value.
- target sequence 10 is "CACGACGTTGTA", in which 4 bases form one byte, respective procedures are explained.
- CACG complementary metal-oxide-semiconductor
- S21,S22 four bases of the target sequence 10 are encoded into 2 bits to form one byte
- S23 a temporary DNA sequence
- ACGT next four bases
- S22 The encoded byte is added to the temporary DNA sequence (S23).
- S23 The temporary DNA sequence is then "1000100100100111".
- the target DNA sequence still contains bases to be encoded (S24) and again undergoes the step S21.
- the four bases (TGTA) are encoded into 2 bits to form one byte (S22).
- the encoded byte is added to the temporary DNA sequence (S23). Then, the temporary DNA sequence becomes "100010010010011111011100". All the bases of the target DNA sequence are encoded and the process is ended (S24). Here, the information of the temporary DNA sequence is an encoded final DNA sequence 20.
- the steps S21 to S24 are the same as the procedures described in the above and shown in Fig. 2.
- the encoded DNA sequence (the temporary DNA sequence in the example) is finally compressed by an compression method (S25).
- the compression method which can be used in the present invention includes any of the sentence compression methods which have been already developed and used.
- the method for encoding a DNA sequence and the method for compressing a DNA sequence can be preferably performed in the form of a computer program. Therefore, the present invention includes a recording medium which is readable by a computer having computer programs recorded, in which the programs can carry out respective steps of the method for encoding a DNA sequence and the method for compressing a DNA sequence. While the present invention has been described with reference to the particular illustrative embodiments, it is not to be restricted by the embodiments but only by the appended claims. It is to be appreciated that those skilled in the art can change or modify the embodiments without departing from the scope and spirit of the present invention.
Landscapes
- Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Architecture (AREA)
- Biophysics (AREA)
- Health & Medical Sciences (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Medical Informatics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- General Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- Structural Engineering (AREA)
- Mechanical Engineering (AREA)
- Civil Engineering (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2003232661A AU2003232661A1 (en) | 2003-02-07 | 2003-06-04 | Method to encode a dna sequence and to compress a dna sequence |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR10-2003-0007920 | 2003-02-07 | ||
KR1020030007920A KR20040071993A (ko) | 2003-02-07 | 2003-02-07 | Dna 서열 부호화 방법 및 dna 서열 압축 방법 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2004070029A1 true WO2004070029A1 (fr) | 2004-08-19 |
Family
ID=32844797
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/KR2003/001093 WO2004070029A1 (fr) | 2003-02-07 | 2003-06-04 | Procede pour coder une sequence d'adn et procede pour comprimer une sequence d'adn |
Country Status (3)
Country | Link |
---|---|
KR (1) | KR20040071993A (fr) |
AU (1) | AU2003232661A1 (fr) |
WO (1) | WO2004070029A1 (fr) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2010108929A3 (fr) * | 2009-03-23 | 2010-11-25 | Intresco B.V. | Procédés d'obtention d'un groupe de symboles distinguant de manière unique un organisme, par exemple l'homme |
CN105550535A (zh) * | 2015-12-03 | 2016-05-04 | 人和未来生物科技(长沙)有限公司 | 一种基因字符序列快速编码为二进制序列的编码方法 |
US10902937B2 (en) | 2014-02-12 | 2021-01-26 | International Business Machines Corporation | Lossless compression of DNA sequences |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101253700B1 (ko) * | 2010-11-26 | 2013-04-12 | 가천대학교 산학협력단 | Ngs 데이터의 고속 압축장치 및 그 방법 |
KR101922129B1 (ko) | 2011-12-05 | 2018-11-26 | 삼성전자주식회사 | 차세대 시퀀싱을 이용하여 획득된 유전 정보를 압축 및 압축해제하는 방법 및 장치 |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH08123498A (ja) * | 1994-10-21 | 1996-05-17 | Nippon Telegr & Teleph Corp <Ntt> | 波形データ圧縮方法 |
US5651099A (en) * | 1995-01-26 | 1997-07-22 | Hewlett-Packard Company | Use of a genetic algorithm to optimize memory space |
US5706498A (en) * | 1993-09-27 | 1998-01-06 | Hitachi Device Engineering Co., Ltd. | Gene database retrieval system where a key sequence is compared to database sequences by a dynamic programming device |
US5727130A (en) * | 1995-08-31 | 1998-03-10 | Motorola, Inc. | Genetic algorithm for constructing and tuning fuzzy logic system |
US5838964A (en) * | 1995-06-26 | 1998-11-17 | Gubser; David R. | Dynamic numeric compression methods |
KR20020040406A (ko) * | 2000-11-24 | 2002-05-30 | 김응수 | 유전자 코드에 의한 정보압축 및 저장 방법 |
-
2003
- 2003-02-07 KR KR1020030007920A patent/KR20040071993A/ko not_active Application Discontinuation
- 2003-06-04 WO PCT/KR2003/001093 patent/WO2004070029A1/fr not_active Application Discontinuation
- 2003-06-04 AU AU2003232661A patent/AU2003232661A1/en not_active Abandoned
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5706498A (en) * | 1993-09-27 | 1998-01-06 | Hitachi Device Engineering Co., Ltd. | Gene database retrieval system where a key sequence is compared to database sequences by a dynamic programming device |
JPH08123498A (ja) * | 1994-10-21 | 1996-05-17 | Nippon Telegr & Teleph Corp <Ntt> | 波形データ圧縮方法 |
US5651099A (en) * | 1995-01-26 | 1997-07-22 | Hewlett-Packard Company | Use of a genetic algorithm to optimize memory space |
US5838964A (en) * | 1995-06-26 | 1998-11-17 | Gubser; David R. | Dynamic numeric compression methods |
US5727130A (en) * | 1995-08-31 | 1998-03-10 | Motorola, Inc. | Genetic algorithm for constructing and tuning fuzzy logic system |
KR20020040406A (ko) * | 2000-11-24 | 2002-05-30 | 김응수 | 유전자 코드에 의한 정보압축 및 저장 방법 |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2010108929A3 (fr) * | 2009-03-23 | 2010-11-25 | Intresco B.V. | Procédés d'obtention d'un groupe de symboles distinguant de manière unique un organisme, par exemple l'homme |
US9607127B2 (en) | 2009-03-23 | 2017-03-28 | Jan Jaap Nietfeld | Methods for providing a set of symbols uniquely distinguishing an organism such as a human individual |
NL2003311C2 (en) * | 2009-07-30 | 2011-02-02 | Intresco B V | Method for producing a biological pin code. |
US10902937B2 (en) | 2014-02-12 | 2021-01-26 | International Business Machines Corporation | Lossless compression of DNA sequences |
CN105550535A (zh) * | 2015-12-03 | 2016-05-04 | 人和未来生物科技(长沙)有限公司 | 一种基因字符序列快速编码为二进制序列的编码方法 |
Also Published As
Publication number | Publication date |
---|---|
AU2003232661A1 (en) | 2004-08-30 |
KR20040071993A (ko) | 2004-08-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110945595B (zh) | 基于dna的数据存储和检索 | |
JP7179008B2 (ja) | 核酸ベースのデータ記憶 | |
US11379729B2 (en) | Nucleic acid-based data storage | |
CN109830263B (zh) | 一种基于寡核苷酸序列编码存储的dna存储方法 | |
AU2019270159A1 (en) | Compositions and methods for nucleic acid-based data storage | |
US9774351B2 (en) | Method and apparatus for encoding information units in code word sequences avoiding reverse complementarity | |
KR100537523B1 (ko) | Dna 서열 부호화 장치 및 방법 | |
WO2004070029A1 (fr) | Procede pour coder une sequence d'adn et procede pour comprimer une sequence d'adn | |
CN107633158B (zh) | 对基因序列进行压缩和解压缩的方法和设备 | |
Goel | A compression algorithm for DNA that uses ASCII values | |
CN111279422A (zh) | 编码/解码方法、编码/解码器和存储方法、装置 | |
KR101953663B1 (ko) | 하나의 올리고뉴클레오티드를 이용해서 올리고뉴클레오티드 풀을 생산하는 방법 | |
Pathak et al. | RETRACTED: LFQC: a lossless compression algorithm for FASTQ files | |
TWI770247B (zh) | 核酸用於資料儲存之方法、及其非暫時性電腦可讀儲存介質、系統及電子裝置 | |
Venugopal et al. | Probabilistic Approach for DNA Compression | |
Wang et al. | DNA Digital Data Storage based on Distributed Method | |
최영재 | High Information Capacity and Low Cost DNA-based Data Storage through Additional Encoding Characters | |
Rani | M.: A new referential method for compressing genomes | |
AU2022245140A1 (en) | Fixed point number representation and computation circuits | |
WO2023177864A1 (fr) | Énumération et recherche combinatoires pour stockage de données basé sur l'acide nucléique | |
KR20210056822A (ko) | Fastq 포맷의 유전체 데이터를 위한 유전체 데이터의 압축 및 전송 방법 | |
Bandyopadhyay | Data hiding using DNA sequence compression |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NI NO NZ OM PH PL PT RO RU SC SD SE SG SK SL TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
122 | Ep: pct application non-entry in european phase | ||
NENP | Non-entry into the national phase |
Ref country code: JP |
|
WWW | Wipo information: withdrawn in national office |
Country of ref document: JP |