WO2004070029A1 - Procede pour coder une sequence d'adn et procede pour comprimer une sequence d'adn - Google Patents

Procede pour coder une sequence d'adn et procede pour comprimer une sequence d'adn Download PDF

Info

Publication number
WO2004070029A1
WO2004070029A1 PCT/KR2003/001093 KR0301093W WO2004070029A1 WO 2004070029 A1 WO2004070029 A1 WO 2004070029A1 KR 0301093 W KR0301093 W KR 0301093W WO 2004070029 A1 WO2004070029 A1 WO 2004070029A1
Authority
WO
WIPO (PCT)
Prior art keywords
dna sequence
bases
encoding
encoded
byte
Prior art date
Application number
PCT/KR2003/001093
Other languages
English (en)
Inventor
Hyoung Do Kim
Seung Wha Yoo
Kyoung Hee Choi
Original Assignee
Ajou University Industry Cooperation Foundation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ajou University Industry Cooperation Foundation filed Critical Ajou University Industry Cooperation Foundation
Priority to AU2003232661A priority Critical patent/AU2003232661A1/en
Publication of WO2004070029A1 publication Critical patent/WO2004070029A1/fr

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • EFIXED CONSTRUCTIONS
    • E04BUILDING
    • E04GSCAFFOLDING; FORMS; SHUTTERING; BUILDING IMPLEMENTS OR AIDS, OR THEIR USE; HANDLING BUILDING MATERIALS ON THE SITE; REPAIRING, BREAKING-UP OR OTHER WORK ON EXISTING BUILDINGS
    • E04G1/00Scaffolds primarily resting on the ground
    • E04G1/28Scaffolds primarily resting on the ground designed to provide support only at a low height
    • E04G1/32Other free-standing supports, e.g. using trestles
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3084Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method
    • EFIXED CONSTRUCTIONS
    • E04BUILDING
    • E04GSCAFFOLDING; FORMS; SHUTTERING; BUILDING IMPLEMENTS OR AIDS, OR THEIR USE; HANDLING BUILDING MATERIALS ON THE SITE; REPAIRING, BREAKING-UP OR OTHER WORK ON EXISTING BUILDINGS
    • E04G1/00Scaffolds primarily resting on the ground
    • E04G1/15Scaffolds primarily resting on the ground essentially comprising special means for supporting or forming platforms; Platforms
    • EFIXED CONSTRUCTIONS
    • E04BUILDING
    • E04GSCAFFOLDING; FORMS; SHUTTERING; BUILDING IMPLEMENTS OR AIDS, OR THEIR USE; HANDLING BUILDING MATERIALS ON THE SITE; REPAIRING, BREAKING-UP OR OTHER WORK ON EXISTING BUILDINGS
    • E04G1/00Scaffolds primarily resting on the ground
    • E04G1/15Scaffolds primarily resting on the ground essentially comprising special means for supporting or forming platforms; Platforms
    • E04G2001/155Platforms with an access hatch for getting through from one level to another
    • EFIXED CONSTRUCTIONS
    • E04BUILDING
    • E04GSCAFFOLDING; FORMS; SHUTTERING; BUILDING IMPLEMENTS OR AIDS, OR THEIR USE; HANDLING BUILDING MATERIALS ON THE SITE; REPAIRING, BREAKING-UP OR OTHER WORK ON EXISTING BUILDINGS
    • E04G1/00Scaffolds primarily resting on the ground
    • E04G1/28Scaffolds primarily resting on the ground designed to provide support only at a low height
    • E04G1/30Ladder scaffolds
    • E04G2001/302Ladder scaffolds with ladders supporting the platform
    • E04G2001/305The ladders being vertical and perpendicular to the platform

Definitions

  • the present invention relates to a method for encoding a DNA sequence and a method for compressing a DNA sequence, and particularly to, a method for encoding a DNA sequence by expressing 4 types of DNA bases such as adenine (A), guanine (G), cytosine (C) and thymine (T) into 2 bits and a method compressing an encoded DNA sequence by using a common data compression method to increase compression efficiency.
  • 4 types of DNA bases such as adenine (A), guanine (G), cytosine (C) and thymine (T) into 2 bits
  • A adenine
  • G guanine
  • C cytosine
  • T thymine
  • DNA sequences of various living bodies are analyzed and researches on methods to effectively express and compress the DNA sequences are in progress.
  • it takes at least 2 bits per base when a DNA base sequence is compressed using common sentence compression software such as WinZip and Arj.
  • the present invention has been made in view of the foregoing problems, and considering that a DNA sequence comprises 4 types of bases such as adenine (A), guanine (G), cytosine (C), thymine (T), it is an object of the present invention to provide a method for encoding a DNA sequence by expressing respective bases of the DNA sequence into a 2-bit unit and a method for compressing an encoded DNA sequence to improve compression efficiency and compression rate.
  • A adenine
  • G guanine
  • C cytosine
  • T thymine
  • the present invention provides a method for encoding a DNA sequence comprising the steps of: encoding bases of the DNA sequence comprising adenine (A), guanine (G), cytosine (C) and thymine (T), into 2 bits; forming one byte with a predetermined number of the encoded bases; and forming a DNA sequence in the byte unit.
  • a method for compressing a DNA sequence comprising the steps of: encoding DNA bases comprising adenine (A) guanine (G), cytosine (C) and thymine (T) into 2 bits, respectively; forming one byte with a predetermined number of the encoded bases; forming a DNA sequence in the byte unit; and compressing the DNA sequence using a data compression method.
  • A adenine
  • G guanine
  • C cytosine
  • T thymine
  • Fig. 1 is a view schematically showing the encoding of DNA bases
  • Fig. 2 shows an embodiment of the method for encoding a DNA sequence according to the present invention
  • Fig. 3 shows an embodiment of the method for compressing a DNA sequence according to the present invention.
  • each base of a DNA sequence can be encoded into 2 bits. That is, each base is expressed into one of 4 characters such as adenine (A), guanine (G), cytosine (C) and thymine (T), which are expressed into 2-bit values including 00, 01, 10 and 11. It is just an example to express adenine (A), guanine (G), cytosine (C) and thymine (T) into 2-bit values of 00, 01, 10 and 11.
  • the bases may be any values different from each other (Ex.: 01, 11, 00, 10).
  • each base is set to be encoded into 2 bits.
  • Bases of the DNA sequence to be encoded (hereinafter referred to as "target DNA sequence) are gathered in a predetermined number to form one byte.
  • the encoded final DNA sequence is expressed in byte unit.
  • the number of bases included in one byte may be 1, 2, 3 or 4.
  • the remaining bits are filled with a predetermined value.
  • target sequence 10 is "CACGACGTTGTA", in which 4 bases form one byte, respective procedures are explained.
  • CACG complementary metal-oxide-semiconductor
  • S21,S22 four bases of the target sequence 10 are encoded into 2 bits to form one byte
  • S23 a temporary DNA sequence
  • ACGT next four bases
  • S22 The encoded byte is added to the temporary DNA sequence (S23).
  • S23 The temporary DNA sequence is then "1000100100100111".
  • the target DNA sequence still contains bases to be encoded (S24) and again undergoes the step S21.
  • the four bases (TGTA) are encoded into 2 bits to form one byte (S22).
  • the encoded byte is added to the temporary DNA sequence (S23). Then, the temporary DNA sequence becomes "100010010010011111011100". All the bases of the target DNA sequence are encoded and the process is ended (S24). Here, the information of the temporary DNA sequence is an encoded final DNA sequence 20.
  • the steps S21 to S24 are the same as the procedures described in the above and shown in Fig. 2.
  • the encoded DNA sequence (the temporary DNA sequence in the example) is finally compressed by an compression method (S25).
  • the compression method which can be used in the present invention includes any of the sentence compression methods which have been already developed and used.
  • the method for encoding a DNA sequence and the method for compressing a DNA sequence can be preferably performed in the form of a computer program. Therefore, the present invention includes a recording medium which is readable by a computer having computer programs recorded, in which the programs can carry out respective steps of the method for encoding a DNA sequence and the method for compressing a DNA sequence. While the present invention has been described with reference to the particular illustrative embodiments, it is not to be restricted by the embodiments but only by the appended claims. It is to be appreciated that those skilled in the art can change or modify the embodiments without departing from the scope and spirit of the present invention.

Landscapes

  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Architecture (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Medical Informatics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Structural Engineering (AREA)
  • Mechanical Engineering (AREA)
  • Civil Engineering (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

La présente invention concerne un procédé pour coder une séquence d'ADN et un procédé pour comprimer une séquence d'ADN. Le procédé pour coder une séquence d'ADN comprend les étapes suivantes: codage en deux bits des bases de la séquence d'ADN, à savoir l'adénine (A), la guanine (G) la cytosine (C) et la thymine (T); formation d'un multiplet avec un nombre prédéterminé des bases codées; et formation d'une séquence d'ADN dans le multiplet. Le procédé de compression de séquence d'ADN selon la présente invention comprend en plus une étape de compression de la séquence d'ADN codée, faisant appel à un procédé de compression des données. Selon la présente invention, il est possible de coder efficacement une séquence d'ADN et d'améliorer le taux et la vitesse de compression d'informations relatives à une séquence d'ADN.
PCT/KR2003/001093 2003-02-07 2003-06-04 Procede pour coder une sequence d'adn et procede pour comprimer une sequence d'adn WO2004070029A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2003232661A AU2003232661A1 (en) 2003-02-07 2003-06-04 Method to encode a dna sequence and to compress a dna sequence

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR10-2003-0007920 2003-02-07
KR1020030007920A KR20040071993A (ko) 2003-02-07 2003-02-07 Dna 서열 부호화 방법 및 dna 서열 압축 방법

Publications (1)

Publication Number Publication Date
WO2004070029A1 true WO2004070029A1 (fr) 2004-08-19

Family

ID=32844797

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2003/001093 WO2004070029A1 (fr) 2003-02-07 2003-06-04 Procede pour coder une sequence d'adn et procede pour comprimer une sequence d'adn

Country Status (3)

Country Link
KR (1) KR20040071993A (fr)
AU (1) AU2003232661A1 (fr)
WO (1) WO2004070029A1 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010108929A3 (fr) * 2009-03-23 2010-11-25 Intresco B.V. Procédés d'obtention d'un groupe de symboles distinguant de manière unique un organisme, par exemple l'homme
CN105550535A (zh) * 2015-12-03 2016-05-04 人和未来生物科技(长沙)有限公司 一种基因字符序列快速编码为二进制序列的编码方法
US10902937B2 (en) 2014-02-12 2021-01-26 International Business Machines Corporation Lossless compression of DNA sequences

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101253700B1 (ko) * 2010-11-26 2013-04-12 가천대학교 산학협력단 Ngs 데이터의 고속 압축장치 및 그 방법
KR101922129B1 (ko) 2011-12-05 2018-11-26 삼성전자주식회사 차세대 시퀀싱을 이용하여 획득된 유전 정보를 압축 및 압축해제하는 방법 및 장치

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH08123498A (ja) * 1994-10-21 1996-05-17 Nippon Telegr & Teleph Corp <Ntt> 波形データ圧縮方法
US5651099A (en) * 1995-01-26 1997-07-22 Hewlett-Packard Company Use of a genetic algorithm to optimize memory space
US5706498A (en) * 1993-09-27 1998-01-06 Hitachi Device Engineering Co., Ltd. Gene database retrieval system where a key sequence is compared to database sequences by a dynamic programming device
US5727130A (en) * 1995-08-31 1998-03-10 Motorola, Inc. Genetic algorithm for constructing and tuning fuzzy logic system
US5838964A (en) * 1995-06-26 1998-11-17 Gubser; David R. Dynamic numeric compression methods
KR20020040406A (ko) * 2000-11-24 2002-05-30 김응수 유전자 코드에 의한 정보압축 및 저장 방법

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5706498A (en) * 1993-09-27 1998-01-06 Hitachi Device Engineering Co., Ltd. Gene database retrieval system where a key sequence is compared to database sequences by a dynamic programming device
JPH08123498A (ja) * 1994-10-21 1996-05-17 Nippon Telegr & Teleph Corp <Ntt> 波形データ圧縮方法
US5651099A (en) * 1995-01-26 1997-07-22 Hewlett-Packard Company Use of a genetic algorithm to optimize memory space
US5838964A (en) * 1995-06-26 1998-11-17 Gubser; David R. Dynamic numeric compression methods
US5727130A (en) * 1995-08-31 1998-03-10 Motorola, Inc. Genetic algorithm for constructing and tuning fuzzy logic system
KR20020040406A (ko) * 2000-11-24 2002-05-30 김응수 유전자 코드에 의한 정보압축 및 저장 방법

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010108929A3 (fr) * 2009-03-23 2010-11-25 Intresco B.V. Procédés d'obtention d'un groupe de symboles distinguant de manière unique un organisme, par exemple l'homme
US9607127B2 (en) 2009-03-23 2017-03-28 Jan Jaap Nietfeld Methods for providing a set of symbols uniquely distinguishing an organism such as a human individual
NL2003311C2 (en) * 2009-07-30 2011-02-02 Intresco B V Method for producing a biological pin code.
US10902937B2 (en) 2014-02-12 2021-01-26 International Business Machines Corporation Lossless compression of DNA sequences
CN105550535A (zh) * 2015-12-03 2016-05-04 人和未来生物科技(长沙)有限公司 一种基因字符序列快速编码为二进制序列的编码方法

Also Published As

Publication number Publication date
AU2003232661A1 (en) 2004-08-30
KR20040071993A (ko) 2004-08-16

Similar Documents

Publication Publication Date Title
CN110945595B (zh) 基于dna的数据存储和检索
JP7179008B2 (ja) 核酸ベースのデータ記憶
US11379729B2 (en) Nucleic acid-based data storage
CN109830263B (zh) 一种基于寡核苷酸序列编码存储的dna存储方法
AU2019270159A1 (en) Compositions and methods for nucleic acid-based data storage
US9774351B2 (en) Method and apparatus for encoding information units in code word sequences avoiding reverse complementarity
KR100537523B1 (ko) Dna 서열 부호화 장치 및 방법
WO2004070029A1 (fr) Procede pour coder une sequence d&#39;adn et procede pour comprimer une sequence d&#39;adn
CN107633158B (zh) 对基因序列进行压缩和解压缩的方法和设备
Goel A compression algorithm for DNA that uses ASCII values
CN111279422A (zh) 编码/解码方法、编码/解码器和存储方法、装置
KR101953663B1 (ko) 하나의 올리고뉴클레오티드를 이용해서 올리고뉴클레오티드 풀을 생산하는 방법
Pathak et al. RETRACTED: LFQC: a lossless compression algorithm for FASTQ files
TWI770247B (zh) 核酸用於資料儲存之方法、及其非暫時性電腦可讀儲存介質、系統及電子裝置
Venugopal et al. Probabilistic Approach for DNA Compression
Wang et al. DNA Digital Data Storage based on Distributed Method
최영재 High Information Capacity and Low Cost DNA-based Data Storage through Additional Encoding Characters
Rani M.: A new referential method for compressing genomes
AU2022245140A1 (en) Fixed point number representation and computation circuits
WO2023177864A1 (fr) Énumération et recherche combinatoires pour stockage de données basé sur l&#39;acide nucléique
KR20210056822A (ko) Fastq 포맷의 유전체 데이터를 위한 유전체 데이터의 압축 및 전송 방법
Bandyopadhyay Data hiding using DNA sequence compression

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NI NO NZ OM PH PL PT RO RU SC SD SE SG SK SL TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP