WO1999001940A1 - Biological data - Google Patents

Biological data Download PDF

Info

Publication number
WO1999001940A1
WO1999001940A1 PCT/GB1998/001937 GB9801937W WO9901940A1 WO 1999001940 A1 WO1999001940 A1 WO 1999001940A1 GB 9801937 W GB9801937 W GB 9801937W WO 9901940 A1 WO9901940 A1 WO 9901940A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence data
bits
monomer
datatype
biological sequence
Prior art date
Application number
PCT/GB1998/001937
Other languages
French (fr)
Inventor
Michael James Gilchrist
Original Assignee
Hexagen Technology Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hexagen Technology Limited filed Critical Hexagen Technology Limited
Priority to EP98932338A priority Critical patent/EP0995271A1/en
Priority to AU82278/98A priority patent/AU8227898A/en
Priority to JP50665999A priority patent/JP2002508130A/en
Publication of WO1999001940A1 publication Critical patent/WO1999001940A1/en

Links

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction

Definitions

  • This invention relates to the compression of biological sequence data for electronic storage.
  • Bio sequence data is typically represented in an alphabetic manner, rather than by chemical formula, with each letter representing a monomer unit in a biological polymer (Table I).
  • DNA sequences are represented as strings of letters chosen from a simple four-letter "alphabet".
  • Each A, C, G or T represents a monomer unit (nucleotide) in a DNA polymer.
  • proteins are made up of twenty different monomer units (amino acids), which have each been assigned single letter codes.
  • Alphabetic text-based computer information is generally stored and manipulated using the char datatype, using 8 bits (1 byte) and a conventional file of biological sequence data is made up of a string of characters of datatype char.
  • a conventional file of sequence data uses a single byte to represent each monomer, so the amino acid sequence of the glycogen synthase protein, for example, requires 737 bytes of storage using the one-letter amino acid code, and the corresponding DNA sequence requires 2211 bytes.
  • the char datatype was designed for representing a full character set, including upper and lower case letters plus numbers, punctuation, and other characters, and each 8-bit char can represent 256 (2 8 ) different values. Using the char datatype and alphanumeric characters to store DNA sequences therefore fails to utilise 252 of the available values. Similarly, protein sequences waste 236 values. Other datatypes which are in common usage for data storage include int (16 bits), long (32 bits), 7o ⁇ t (32 bits), although this may vary from machine to machine.
  • the DNA and RNA alphabets each consist of 4 letters and, rather than storing these sequences in alphanumeric form using strings of char datatype, using a sub-byte datatype would enable a significant storage saving. Degenerate nucleic acid sequence information (which can be represented using a 16 letter alphabet) and protein sequences could also be treated in this way. It would therefore be useful to define a sub-byte datatype in order to take advantage of the small size of the biological alphabet.
  • the commonly used blast sequence comparison program converts single byte char data into a half-byte working space whilst manipulating data. This is a temporary measure, however, and data is not stored in this manner using a specific sub-byte datatype.
  • the invention is based upon the realisation that using a whole byte to represent a monomer in a biological sequence is not the most efficient means of permanent storage.
  • a sub-byte datatype for the storage or manipulation of biological sequence data in a programming language or a database.
  • the invention also provides a programming language or a database which utilises a sub-byte datatype for the storage or manipulation of biological sequence data.
  • sub-byte it is meant fewer than 8 bits.
  • the datatype may be intrinsic to a program or programming language, or it may be user-defined.
  • the invention is not limited, however, to situations where a formal datatype must be defined.
  • a computer program which stores biological sequence data using fewer than 8 bits to represent each monomer in said sequence data.
  • the invention also provides a file containing biological sequence data, wherein each monomer in said sequence data is represented using fewer than 8 bits.
  • a method for compressing biological sequence data comprising representing each monomer in said sequence data by using fewer than 8 bits.
  • the invention also provides a method for reducing the size of a file in which biological sequence data is represented using 8 or more bits per monomer, comprising replacing the representation of each monomer with a representation using fewer than 8 bits.
  • a computer programmed to store biological sequence data by using fewer than 8 bits to represent each monomer in said sequence data.
  • a computer comprising means for alphabetic entry of biological sequence data, means to convert said sequence data into a format wherein each monomer unit is represented using fewer than 8 bits and, preferably, means to store said data.
  • a storage medium holding biological sequence data, wherein said sequence data is stored using fewer than 8 bits to represent each monomer in said sequence data.
  • the storage medium may be in any appropriate form, such as a floppy disk, a CD-ROM, or a fixed disk drive.
  • a method for transmitting biological sequence data comprising compressing the data by representing each monomer in said sequence data by using fewer than 8 bits before transmission, for instance over a network.
  • biological sequence data which has been electronically stored using less than 8 bits to represent each monomer in said sequence data.
  • the biological sequence data may be of any suitable kind, such as DNA sequence, RNA sequence, and protein or polypeptide sequence.
  • nucleic acid sequences can be represented using 2 bits to represent each monomer (nucleotides A, C, G, or T U). Accordingly, a 2 bit datatype may be defined according to the invention for the storage or manipulation of nucleic acid sequences. Such a datatype is referred to herein as base.
  • each nucleotide in a nucleic acid sequence By representing each nucleotide in a nucleic acid sequence by using only 2 bits, 4 nucleotides can be stored in a single byte. This represents a 75% compression compared with the conventional representation of each nucleotide using a single byte.
  • nucleic acid sequence is not definite, more than 2 bits are required to represent each nucleotide.
  • N is used according to IUPAC convention.
  • the alphabet of this IUPAC convention (Table I) has 16 members. This can be conveniently represented using 4 bits per member. Accordingly, a 4 bit datatype may be defined according to the invention for the storage or manipulation of degenerate or uncertain nucleic acid sequences. Such a datatype is referred to herein as longbase.
  • nucleotide By representing each nucleotide in a sequence by using 4 bits, 2 nucleotides can be stored in a single byte. This represents a 50% compression compared with the conventional representation of each nucleotide using a single byte.
  • each amino acid in a protein sequence By representing each amino acid in a protein sequence by using 6 bits, 4 amino acids can be stored in 3 bytes. This represents a 25% compression compared with the conventional representation of each amino acid using a single byte.
  • the degree of degeneracy incorporated into a 6-bit representation or datatype also allows an amino acid to be represented in terms of codons, of which there are 64.
  • a datatype used in this way is referred to herein as codon.
  • Each single codon value represents a single codon, which inherently also defines an amino acid.
  • the codon datatype represents three base entries, just as a codon is made up of three nucleotides.
  • 4 codons can be represented in 3 bytes. This represents a 75% compression compared with the conventional representation of each codon using 3 bytes. It will also be appreciated that a full byte could be used to represent each codon, which would allow a degree of degeneracy and would represent a 67% compression compared with using 3 bytes to represent each codon.
  • the various datatypes and compressions described above may not be suitable in all circumstances.
  • the programming language C requires a string to have a NULL terminator. This is not possible with the base datatype, for instance, because all of the 4 possible values (permutations of 2 bits) are used to represent information, which does not allow a terminator to be represented.
  • the IUPAC convention uses 15 representations for a DNA or RNA sequence, which does allow the sixteenth permutation to represent a terminator. In certain circumstances, however, a value may be needed to represent a gap (representing an unknown sequence of unknown length) which would remove the possibility of having a terminator.
  • the codon datatype is also "full" since each of the 64 available values represents a codon.
  • B represents asparagine or aspartate ie. N or D
  • Z represents glutamine or glutamate ie.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Using a whole byte to represent a monomer in a biological sequence is not the most efficient means of permanent storage. The invention relates to the compression of biological sequence data for electronic storage by utilising a sub-byte datatype for the storage or manipulation of biological sequence data in a programming language or a database. For nucleotide sequences, for example, 2 bits can be used to represent each monomer.

Description

BIOLOGICAL DATA
This invention relates to the compression of biological sequence data for electronic storage.
The nature of biological sequence data (eg. DNA and protein sequences) means that electronic storage is perfectly suited. Not only does the sheer volume of data necessitate large-scale storage, but electronic storage allows rapid and efficient searching of the data eg. for homologous sequences. Since the advent of initiatives such as whole genome sequencing, the amount of storage required has increased significantly. The storage requirement for the yeast genome, for instance, is huge. Whilst large capacity storage systems continue to fall in price, one of the rate-limiting steps when dealing with sequence data is the transfer from storage medium into memory (eg. hard drive into RAM) and any developments which significantly reduce the size of sequence data files would be welcomed.
Biological sequence data is typically represented in an alphabetic manner, rather than by chemical formula, with each letter representing a monomer unit in a biological polymer (Table I). For instance, DNA sequences are represented as strings of letters chosen from a simple four-letter "alphabet". Each A, C, G or T represents a monomer unit (nucleotide) in a DNA polymer. Similarly, proteins are made up of twenty different monomer units (amino acids), which have each been assigned single letter codes.
Because of its alphabetic nature, biological sequence data is naturally suited to electronic storage in alphabetic text form. Alphabetic text-based computer information is generally stored and manipulated using the char datatype, using 8 bits (1 byte) and a conventional file of biological sequence data is made up of a string of characters of datatype char. A conventional file of sequence data uses a single byte to represent each monomer, so the amino acid sequence of the glycogen synthase protein, for example, requires 737 bytes of storage using the one-letter amino acid code, and the corresponding DNA sequence requires 2211 bytes.
The char datatype, however, was designed for representing a full character set, including upper and lower case letters plus numbers, punctuation, and other characters, and each 8-bit char can represent 256 (28) different values. Using the char datatype and alphanumeric characters to store DNA sequences therefore fails to utilise 252 of the available values. Similarly, protein sequences waste 236 values. Other datatypes which are in common usage for data storage include int (16 bits), long (32 bits), 7oαt (32 bits), although this may vary from machine to machine. The DNA and RNA alphabets each consist of 4 letters and, rather than storing these sequences in alphanumeric form using strings of char datatype, using a sub-byte datatype would enable a significant storage saving. Degenerate nucleic acid sequence information (which can be represented using a 16 letter alphabet) and protein sequences could also be treated in this way. It would therefore be useful to define a sub-byte datatype in order to take advantage of the small size of the biological alphabet.
The commonly used blast sequence comparison program converts single byte char data into a half-byte working space whilst manipulating data. This is a temporary measure, however, and data is not stored in this manner using a specific sub-byte datatype.
The invention is based upon the realisation that using a whole byte to represent a monomer in a biological sequence is not the most efficient means of permanent storage.
According to the invention, there is defined a sub-byte datatype for the storage or manipulation of biological sequence data in a programming language or a database.
The invention also provides a programming language or a database which utilises a sub-byte datatype for the storage or manipulation of biological sequence data.
According to a further aspect of the invention, there is provided the use of a sub-byte datatype in the storage or manipulation of biological sequence data.
By "sub-byte" it is meant fewer than 8 bits.
The datatype may be intrinsic to a program or programming language, or it may be user-defined. The invention is not limited, however, to situations where a formal datatype must be defined.
According to a further aspect of the invention, there is provided a computer program which stores biological sequence data using fewer than 8 bits to represent each monomer in said sequence data.
The invention also provides a file containing biological sequence data, wherein each monomer in said sequence data is represented using fewer than 8 bits.
According to a further aspect of the invention, there is provided a method for compressing biological sequence data, comprising representing each monomer in said sequence data by using fewer than 8 bits.
The invention also provides a method for reducing the size of a file in which biological sequence data is represented using 8 or more bits per monomer, comprising replacing the representation of each monomer with a representation using fewer than 8 bits.
According to a further aspect of the invention, there is provided a computer programmed to store biological sequence data by using fewer than 8 bits to represent each monomer in said sequence data.
According to a further aspect of the invention, there is provided a computer comprising means for alphabetic entry of biological sequence data, means to convert said sequence data into a format wherein each monomer unit is represented using fewer than 8 bits and, preferably, means to store said data.
According to a further aspect of the invention, there is provided a storage medium holding biological sequence data, wherein said sequence data is stored using fewer than 8 bits to represent each monomer in said sequence data.
The storage medium may be in any appropriate form, such as a floppy disk, a CD-ROM, or a fixed disk drive.
According to a further aspect of the invention, there is provided a method for transmitting biological sequence data, comprising compressing the data by representing each monomer in said sequence data by using fewer than 8 bits before transmission, for instance over a network.
According to a further aspect of the invention, there is provided biological sequence data which has been electronically stored using less than 8 bits to represent each monomer in said sequence data.
The biological sequence data may be of any suitable kind, such as DNA sequence, RNA sequence, and protein or polypeptide sequence.
It will be apparent that nucleic acid sequences can be represented using 2 bits to represent each monomer (nucleotides A, C, G, or T U). Accordingly, a 2 bit datatype may be defined according to the invention for the storage or manipulation of nucleic acid sequences. Such a datatype is referred to herein as base.
By representing each nucleotide in a nucleic acid sequence by using only 2 bits, 4 nucleotides can be stored in a single byte. This represents a 75% compression compared with the conventional representation of each nucleotide using a single byte.
Where a nucleic acid sequence is not definite, more than 2 bits are required to represent each nucleotide. For instance, where a nucleotide has not been unequivocally determined, the symbol "N" is used according to IUPAC convention. The alphabet of this IUPAC convention (Table I) has 16 members. This can be conveniently represented using 4 bits per member. Accordingly, a 4 bit datatype may be defined according to the invention for the storage or manipulation of degenerate or uncertain nucleic acid sequences. Such a datatype is referred to herein as longbase.
By representing each nucleotide in a sequence by using 4 bits, 2 nucleotides can be stored in a single byte. This represents a 50% compression compared with the conventional representation of each nucleotide using a single byte.
As an alternative to using 4 bits to represent degenerate or uncertain nucleic acid sequences, under certain circumstances these features may be accommodated where 2 bits are used, as in base. For instance, where a DNA sequence is stored in a data file using 2 bits per nucleotide, parallel files could be utilised which contain "modifying" data to qualify details in the sequence file. For instance, the second file may contain an indication that whilst nucleotide 221 is given as guanine in the sequence file, in fact it may be any purine. Obviously, the choice of using such a "modifying" file or using more than 2 bits to represent the sequence depends on the particular situation, but the choice is routine.
It will further be apparent that protein sequences require at least 5 bits to represent each monomer (20 amino acids) since 24=16 and 25=32. Whilst this is encompassed within the invention, 5 bits is an awkward length, being an odd number. 6 bits is more convenient and, furthermore, this allows a degree of degeneracy to be incorporated into the sequence (26=64). Accordingly, a 6 bit datatype may be defined according to the invention for the storage or manipulation of protein sequences. Such a datatype is referred to herein as aminoacid.
By representing each amino acid in a protein sequence by using 6 bits, 4 amino acids can be stored in 3 bytes. This represents a 25% compression compared with the conventional representation of each amino acid using a single byte.
The degree of degeneracy incorporated into a 6-bit representation or datatype also allows an amino acid to be represented in terms of codons, of which there are 64. A datatype used in this way is referred to herein as codon. Each single codon value represents a single codon, which inherently also defines an amino acid. In effect, the codon datatype represents three base entries, just as a codon is made up of three nucleotides. By using 6 bits to represent each codon, 4 codons can be represented in 3 bytes. This represents a 75% compression compared with the conventional representation of each codon using 3 bytes. It will also be appreciated that a full byte could be used to represent each codon, which would allow a degree of degeneracy and would represent a 67% compression compared with using 3 bytes to represent each codon.
It should be borne in mind that the various datatypes and compressions described above may not be suitable in all circumstances. For example, the programming language C requires a string to have a NULL terminator. This is not possible with the base datatype, for instance, because all of the 4 possible values (permutations of 2 bits) are used to represent information, which does not allow a terminator to be represented.
Similar caveats apply to longbase. The IUPAC convention uses 15 representations for a DNA or RNA sequence, which does allow the sixteenth permutation to represent a terminator. In certain circumstances, however, a value may be needed to represent a gap (representing an unknown sequence of unknown length) which would remove the possibility of having a terminator. The codon datatype is also "full" since each of the 64 available values represents a codon.
Whilst these datatypes may not be universally applicable, however, they are not without utility since not all programming languages or databases have such a terminator requirement. A further problem in using the datatypes of the invention in languages such as C is the international ANSI standard which does not recognise these datatypes. However, new languages, such as Java which is still in early development, currently have less strict standards and may be amenable to the introduction of new datatypes at this stage. TABLE I IUB/IUPAC standard biological sequence codes
Single letter nucleotide codes
A Adenine C Cytosine G Guanine T Thymine
U Uracil Degenerate nucleotide codes
In addition to the five above codes: N any (A/C/G/T) R ρuRine (G/A) Y pYrimidine (T/C) K Keto (G/T) M aMino (A/C) S Strong (G/C)
W Weak (A ) B not A (C/G/T) D not C (A/G/T)
H not G (A/C/T) v not T (A/C/G)
Single letter amino acid codes
A Alanine C Cysteine D Aspartate E Glutamate F Phenylalanine G Glycine
H Histidine I Isoleucine K Lysine
L Leucine M Methionine N Asparagine p Proline Glutamine R Arginine
S Serine T Threonine V Valine W Tryptophan Y Tyrosine In addition:
B represents asparagine or aspartate ie. N or D
Z represents glutamine or glutamate ie. Q or E
U represents selenocysteine X represents "any amino acid" or "unknown"
* represents a translation stop
- represents a gap of indeterminate length

Claims

CLALMS
1. A sub-byte datatype for the storage or manipulation of biological sequence data in a programming language or a database.
2. A programming language or a database which utilises a sub-byte datatype for the storage or manipulation of biological sequence data.
3. The use of a sub-byte datatype in the storage or manipulation of biological sequence data.
4. A file containing biological sequence data, wherein each monomer in said sequence data is represented using fewer than 8 bits.
5. A method for compressing biological sequence data, comprising representing each monomer in said sequence data by using fewer than 8 bits.
6. A method for reducing the size of a file in which biological sequence data is represented using 8 or more bits per monomer, comprising replacing the representation of each monomer with a representation using fewer than 8 bits.
7. A computer programmed to store biological sequence data by using fewer than 8 bits to represent each monomer in said sequence data.
8. A storage medium holding biological sequence data, wherein said sequence data is stored using fewer than 8 bits to represent each monomer in said sequence data.
9. Biological sequence data which has been electronically stored using less than 8 bits to represent each monomer in said sequence data.
PCT/GB1998/001937 1997-07-01 1998-07-01 Biological data WO1999001940A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP98932338A EP0995271A1 (en) 1997-07-01 1998-07-01 Biological data
AU82278/98A AU8227898A (en) 1997-07-01 1998-07-01 Biological data
JP50665999A JP2002508130A (en) 1997-07-01 1998-07-01 Biological data

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB9713921.6 1997-07-01
GBGB9713921.6A GB9713921D0 (en) 1997-07-01 1997-07-01 Biological data

Publications (1)

Publication Number Publication Date
WO1999001940A1 true WO1999001940A1 (en) 1999-01-14

Family

ID=10815230

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB1998/001937 WO1999001940A1 (en) 1997-07-01 1998-07-01 Biological data

Country Status (5)

Country Link
EP (1) EP0995271A1 (en)
JP (1) JP2002508130A (en)
AU (1) AU8227898A (en)
GB (1) GB9713921D0 (en)
WO (1) WO1999001940A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1313225A1 (en) * 2000-04-19 2003-05-21 Satoshi Omori Nucleotide sequence information, and method and device for recording information on sequence of amino acid
EP1443449A2 (en) * 2003-02-03 2004-08-04 Samsung Electronics Co., Ltd. Apparatus, method and computer readable medium for encoding a DNA sequence
US6912469B1 (en) * 2000-05-05 2005-06-28 Kenneth J. Cool Electronic hybridization assay and sequence analysis
US20090298702A1 (en) * 2008-06-02 2009-12-03 Xing Su Nucleic acid sequencing using a compacted coding technique
US10790044B2 (en) * 2016-05-19 2020-09-29 Seven Bridges Genomics Inc. Systems and methods for sequence encoding, storage, and compression

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4701744A (en) * 1986-03-27 1987-10-20 Rca Corporation Method and apparatus for compacting and de-compacting text characters
WO1997031327A1 (en) * 1996-02-26 1997-08-28 Motorola Inc. Personal human genome card and methods and systems for producing same

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4701744A (en) * 1986-03-27 1987-10-20 Rca Corporation Method and apparatus for compacting and de-compacting text characters
WO1997031327A1 (en) * 1996-02-26 1997-08-28 Motorola Inc. Personal human genome card and methods and systems for producing same

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1313225A1 (en) * 2000-04-19 2003-05-21 Satoshi Omori Nucleotide sequence information, and method and device for recording information on sequence of amino acid
EP1313225A4 (en) * 2000-04-19 2003-05-21 Satoshi Omori Nucleotide sequence information, and method and device for recording information on sequence of amino acid
US6912469B1 (en) * 2000-05-05 2005-06-28 Kenneth J. Cool Electronic hybridization assay and sequence analysis
EP1443449A2 (en) * 2003-02-03 2004-08-04 Samsung Electronics Co., Ltd. Apparatus, method and computer readable medium for encoding a DNA sequence
EP1443449A3 (en) * 2003-02-03 2006-02-22 Samsung Electronics Co., Ltd. Apparatus, method and computer readable medium for encoding a DNA sequence
US20090298702A1 (en) * 2008-06-02 2009-12-03 Xing Su Nucleic acid sequencing using a compacted coding technique
US8498824B2 (en) * 2008-06-02 2013-07-30 Intel Corporation Nucleic acid sequencing using a compacted coding technique
US10790044B2 (en) * 2016-05-19 2020-09-29 Seven Bridges Genomics Inc. Systems and methods for sequence encoding, storage, and compression
US20210050074A1 (en) * 2016-05-19 2021-02-18 Vladimir Semenyuk Systems and methods for sequence encoding, storage, and compression

Also Published As

Publication number Publication date
JP2002508130A (en) 2002-03-12
EP0995271A1 (en) 2000-04-26
AU8227898A (en) 1999-01-25
GB9713921D0 (en) 1997-09-03

Similar Documents

Publication Publication Date Title
Mantegna et al. Systematic analysis of coding and noncoding DNA sequences using methods of statistical linguistics
Kuruppu et al. Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval
CN100367189C (en) Method for coding DNA sequence and device and computer readability medium
Schultz et al. On malleability in the genetic code
EP3470997A1 (en) Method for using dna to store text information, decoding method therefor and application thereof
US7412332B1 (en) Method for analyzing polysaccharides
Smith et al. Some possible codes for encrypting data in DNA
Lam et al. A space and time efficient algorithm for constructing compressed suffix arrays
Küppers Towards an experimental analysis of molecular self-organization and precellular Darwinian evolution
EP0995271A1 (en) Biological data
Reznik Coding of sets of words
WO1998047101A3 (en) Robust machine-readable symbology and method and apparatus for printing and reading same
EP0450049A1 (en) Character encoding.
Arquès et al. A circular code in the protein coding genes of mitochondria
Arques et al. A code in the protein coding genes
CN110120247A (en) A kind of distributed genetic big data storage platform
Cevallos et al. On the efficient digital code representation in DNA-based data storage
CN116030895A (en) DNA information storage method based on natural and unnatural base
Roy et al. A survey of data structures and algorithms used in the contextof compression upon biological sequence
WO2004070029A1 (en) Method to encode a dna sequence and to compress a dna sequence
Mäkinen Constructing a binary tree efficiently from its traversals
US6032165A (en) Method and system for converting multi-byte character strings between interchange codes within a computer system
Luckow et al. Interactive computer programs for the graphic analysis of nucleotide sequence data
Bierman et al. Influence of dictionary size on the lossless compression of microarray images
Arya et al. An improved method for DNA sequence compression

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GE GH GM GW HR HU ID IL IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG US UZ VN YU ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW SD SZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN ML MR NE SN TD TG

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase

Ref country code: KR

WWE Wipo information: entry into national phase

Ref document number: 1998932338

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 1998932338

Country of ref document: EP

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

WWE Wipo information: entry into national phase

Ref document number: 09462112

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: CA

WWW Wipo information: withdrawn in national office

Ref document number: 1998932338

Country of ref document: EP