EP1442292A2 - Integrated system and method for analysis of genomic sequence data - Google Patents

Integrated system and method for analysis of genomic sequence data

Info

Publication number
EP1442292A2
EP1442292A2 EP02800693A EP02800693A EP1442292A2 EP 1442292 A2 EP1442292 A2 EP 1442292A2 EP 02800693 A EP02800693 A EP 02800693A EP 02800693 A EP02800693 A EP 02800693A EP 1442292 A2 EP1442292 A2 EP 1442292A2
Authority
EP
European Patent Office
Prior art keywords
genomic
compressed
genomic sequence
uncompressed
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP02800693A
Other languages
German (de)
French (fr)
Inventor
Isaac Bentwich
Yitzhak Mouyal
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Rosetta Genomics Ltd
Original Assignee
Rosetta Genomics Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US09/976,911 external-priority patent/US20060129330A1/en
Application filed by Rosetta Genomics Ltd filed Critical Rosetta Genomics Ltd
Publication of EP1442292A2 publication Critical patent/EP1442292A2/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Definitions

  • the present invention relates to database storage and retrieval in general and more particularly to storage and retrieval in computers which utilize multi-phase bit technology and to analysis and representation of genomic sequence data in general, and to pattern analysis of genomic data in particular.
  • Genomic sequence data is typically represented as alphanumeric strings.
  • the present invention seeks to provide a system and method for analysis of genomic sequence data, and particularly pattern analysis of genomic motifs appearing in genomic sequence data and their possible functional significance.
  • the present invention comprises three sub-systems: a first sub-system and method for analysis of genomic data further described in co-pending U.S. Provisional Patent Application 60/329,114; a second sub-system and method for storage and retrieval of genomic data in compressed data space also described in co-pending U.S. Provisional Patent Application 60/329, 1 12; and a third sub-system and method for genomic sequence similarity comparison in compressed data space also described in co-pending U.S. Provisional Patent Application 60/329,115.
  • capabilities and preferably other capabilities are preferably provided using a computer software application or a computer database program.
  • a method for analysis of genomic sequence data in compressed data space including: obtaining genomic data, preprocessing genomic data into preprocessed genomic data, compressing at least part of the preprocessed genomic data, storing compressed preprocessed genomic data, indexing compressed preprocessed genomic data, and analyzing genomic data, based at least in part on the indexing.
  • the obtaining includes obtaining uncompressed genomic sequence data, and protein coding region location data for each of a plurality of proteins known to be encoded by the genomic sequence data
  • the preprocessing includes calculating and storing a plurality of genomic region sequences, based at least in part on the obtaining, and determining for each of the plurality of genomic region sequences, a plurality of uncompressed short genomic segments contained therewith
  • the compressing includes compressing each of the pluralitty of uncompressed short genomic segments contained in each of the plurality of genomic region sequences into one of a plurality of compressed short genomic segments
  • the storing includes storing the plurality of compressed short genomic segments
  • the indexing includes indexing the plurality of compressed short genomic segments
  • the analyzing includes: receiving a user query containing at least one logical condition relating to at least one of the following: one of the genomic region sequences, and one of the uncompressed short genomic segments, and retrieving results to the user query, the retrieving including at least one of the following
  • a method for storage and retrieval of compressed genomic sequence data and similarity assessment of genomic sequence data in compressed data space including: receiving uncompressed genomic sequence data, compressing the uncompressed genomic sequence data into compressed genomic sequence data, storing the compressed genomic sequence data, indexing the compressed genomic sequence data, retrieving at least part of the compressed genomic sequence data representing uncompressed genomic sequence data similar to an uncompressed genomic target sequence, based at least in part on the indexing, and decompressing the at least part of the compressed genomic sequence data.
  • the retrieving includes: receiving a target genomic sequence, a first plurality of compressed genomic sequences, representing respectively in compressed form a first plurality of genomic sequences, and at least one similarity criterion, and producing a second plurality of compressed genomic sequences, representing respectively in compressed form a second plurality of genomic sequences, the second plurality of genomic sequence being a subset of the first plurality of genomic sequences, each of the second plurality of genomic sequences being similar to the target genomic sequence, according to the at least one similarity criterion.
  • a method for analysis of genomic sequence data utilizing genomic sequence similarity assessment in compressed data space including: obtaining genomic data, preprocessing genomic data into preprocessed genomic data, compressing at least part of the preprocessed genomic data, storing compressed preprocessed genomic data, indexing compressed preprocessed genomic data, and analyzing genomic data, based at least in part on the indexing, the analyzing also including assessing genomic sequence similarity, based at least in part on the indexing
  • the obtaining includes obtaining uncompressed genomic sequence data, and protein coding region location data for each of a plurality of proteins known to be encoded by the genomic sequence data
  • the preprocessing includes calculating and storing a plurality of genomic region sequences, based at least in part on the obtaining, and determining for each of the plurality of genomic region sequences, a plurality of uncompressed short genomic segments contained therewith
  • the compressing includes compressing each of the plurality of uncompressed short genomic segments contained in each of the plurality of genomic region sequences into one of a plurality of compressed short genomic segments
  • the storing includes storing the plurality of compressed short genomic segments
  • the indexing includes indexing the plurality of compressed short genomic segments
  • the analyzing includes: receiving a user query containing at least one logical condition relating to one of the plurality of uncompressed short genomic segments and at least one similarity criterion, extracting a subset of the plurality of uncompressed short genomic segments, each of the subset
  • the plurality of genomic region sequences includes a plurality of protein coding regions.
  • each of the plurality of protein coding regions is normalized.
  • the plurality of genomic region sequences includes a plurality of regions adjacent to protein coding regions.
  • the plurality of regions adjacent to protein coding regions includes a plurality of regions upstream to protein boding regions.
  • the plurality of regions adjacent to protein coding regions includes a plurality of regions downstream to protein coding regions.
  • each of the plurality of regions adjacent to protein coding regions is normalized according to coding direction of one of the plurality of protein coding regions adjacent thereto.
  • the plurality of proteins known to be encoded by the genomic sequence data includes a majority of proteins known to be encoded by the genomic sequence data.
  • the plurality of short genomic segments contained in each of the plurality of genomic region sequences includes a majority of short genomic segments of a given length contained in each of the plurality of genomic region sequences.
  • genomic sequence data includes: a first genomic sequence data belonging to a first organism, and a second genomic sequence data belonging to a second organism different than the first organism.
  • the method also includes storing for each one of the plurality of proteins at least one of the following protein properties: an organism of expression, a tissue of expression and a function.
  • the at least one logical condition includes a degree of uniqueness of one of the plurality of short genomic sequences relative to at least one of the plurality of genomic sequence regions.
  • the at least one logical condition includes a degree of commonality of one of the plurality of short genomic sequences relative to at least two of the plurality of genomic sequence regions.
  • the method also includes: storing, based on user input, a plurality of criteria, determining and marking, each of the plurality of short genomic segments which complies with each one of the criteria, and the user query is based at least in part on at least one of the plurality of criteria.
  • each of the plurality of criteria includes at least one of the at least one logical condition.
  • the determining and storing also includes determining and storing a relationship between at least two of the plurality of short genomic segments, and the logical condition references the relationship. Still further in accordance with a preferred embodiment of the present invention the relationship also includes a relation between a location of a first one of the plurality of short genomic sequence relative to one of the plurality of genomic region sequences, and a second one of the plurality of short genomic sequence relative to the one of the plurality of genomic region sequences.
  • the relationship also includes a similarity between a first one of the plurality of short genomic sequences and a second one of the plurality of short genomic sequences.
  • the retrieving includs: receiving a query, the query including a query condition and uncompressed query data to which the query condition relates, compressing the uncompressed query data into compressed query data, and extracting the at least part of the compressed genomic sequence data, based at least in part on the compressed query data.
  • the retrieving does not require storing the uncompressed genomic sequence data.
  • the retrieving does not require accessing the uncompressed genomic sequence data.
  • the retrieving does not require retrieving the uncompressed genomic sequence data.
  • the retrieving includes sorting the uncompressed genomic sequence data, based at least in part on the indexing.
  • the sorting is alphabetical sorting.
  • the uncompressed genomic sequence data includs a plurality of uncompressed strings
  • the compressed genomic sequence data includs a plurality of compressed strings, each of the plurality of uncompressed strings being compressed into a single corresponding one of the plurality of compressed strings.
  • each of the plurality of uncompressed strings is an alphanumeric string representing a genomic sequence
  • each alphanumeric string includs a plurality of characters
  • each of the plurality of characters represents one of the following items: a nucleotide in the genomic sequence, and an unknown nucleotide in the genomic sequence
  • each of the plurality of uncompressed strings includs a plurality of uncompressed characters
  • each of the plurality of compressed strings includs a plurality of compressed characters, at least two of the plurality of uncompressed characters being compressed into one of the plurality of compressed characters.
  • each one of the plurality of uncompressed characters is compressed into one of the plurality of compressed characters.
  • the at least two of the plurality of uncompressed characters includs at least three of the plurality of uncompressed characters.
  • the at least two of the plurality of uncompressed characters includs at least four of the pluratlity of uncompressed characters.
  • At least three of the plurality of uncompressed characters are compressed into each one of a majority of the plurality of compressed characters.
  • the plurality of compressed strings is stored in a field, the field being part of a table and the table being part of a database.
  • the receiving, the compressing, the storing, the indexing, the retrieving, and the decompressing are performed internally by the database. Moreover in accordance with a preferred embodiment of the present invention the receiving, the, compressing, the storing, the indexing, the retrieving, and the decompressing, do not require a program external to the database.
  • the receiving, the compressing, the storing, the indexing, the retrieving, and the decompressing do not require programming.
  • each of the plurality of uncompressed strings is an alphanumeric string, including a plurality of alphanumeric characters.
  • the determining does not require comparing the first genomic sequence with the second genomic sequence.
  • the determining does not include any of the following: decompressing the first compressed genomic sequence, and decompressing the second compressed genomic sequence.
  • the method also includes decompressing each of the second plurality of compressed genomic sequences.
  • the producing does not require comparing the genomic sequence with any of the first plurality of genomic sequences.
  • the producing does not require decompressing any of the first plurality of compressed genomic sequences.
  • genomic data analyzer does not require comparing the first genomic sequence with the second genomic sequence.
  • genomic compressed sequence similarity assessment system also includes a genomic decompressor operative to decompress each of the second plurality of compressed genomic sequence ' s.
  • genomic data extractor does not require comparing the genomic sequence with any of the first plurality of genomic sequences.
  • genomic data extractor does not require decompressing any of the first plurality of compressed genomic sequences.
  • the present invention also seeks to provide an improved method for storage, sorting and retrieval of data in a database.
  • the present invention seeks to provide the capability to store, index, and retrieve multiple alphanumeric strings, in compressed form, in a database and to assess string similarity of strings in their compressed form.
  • These capabilities and preferably other capabilities are preferably provided using a computer software application or a computer database program.
  • a method for storage and retrieval of compressed data and similarity assessment of data in compressed data space including: receiving uncompressed data, compressing the uncompressed data into compressed data, storing the compressed data, indexing the compressed data, retrieving at least part of the compressed data representing uncompressed data similar to an uncompressed target data item, based at least in part on the indexing, and decompressing the at least part of the compressed data.
  • the retrieving includes: receiving ' a target string, a first plurality of compressed strings, representing respectively in compressed form a first plurality of strings, and at least one similarity criterion, and producing a second plurality of compcessed strings, representing respectively in compressed form a second plurality of strings, the second plurality of string being a subset of the first plurality of strings, each of the second plurality of strings being similar to the target string, according to the at least one similarity criterion.
  • a method for storage and retrieval of compressed data including: receiving uncompressed data, compressing the uncompressed data into compressed data, storing the compressed data, indexing the compressed data, retrieving at least part of the compressed data, based at least in part on the indexing, and decompressing the at least part of the compressed data.
  • a method for comparing compressed strings including: receiving two compressed strings, a first compressed string representing in compressed form a first string, and a second compressed string representing in compressed form a second string, comparing the first compressed string with the second compressed string, and determining degree of similarity between the first string and the second string, based at least in part on the comparing.
  • a method for assessing similarity of strings including: receiving the following items: a string, a first plurality of compressed strings, representing respectively in compressed form a first plurality of strings, and at least one similarity criterion, and producing a second plurality of compressed strings, representing respectively in compressed form a second plurality of strings, the second plurality of strings being a subset of the first plurality of strings, each of the second plurality of strings being similar to the string, according to the at least one similarity criterion.
  • a compressed data storage and retrieval system including' a data compressor operative to receive uncompressed data and to compress the uncompressed data into compressed data, a compressed data indexer operative to store the compressed data and to index the compressed data, and a data extractor employing the compressed data indexer, and operative to retrieve at least part of the compressed data and to decompress the at least part of the compressed data.
  • a compressed string comparison system including: a compressed string evaluator operative to receive two compressed strings, a first compressed string representing in compressed form a first string, and a second compressed string representing in compressed form a second string, and to compare the first compressed string. with the second compressed string, and a compressed string analyzer employing the compressed string evaluator, and operative to determine degree of similarity between the first string and the second string.
  • a compressed string similarity assessment system including: a compressed string evaluator operative to receive a string, a first plurality of compressed strings, representing respectively in compressed form a first plurality of strings, and at least one similarity criterion, and a compressed string extractor operative to produce a second plurality of compressed strings, representing respectively in compressed form a second plurality of strings, the second plurality of string being a subset of the first plurality of strings, each of the second plurality of strings being similar to the string, according to the at least one similarity criterion.
  • a computer-readable medium including a computer program, the computer program being operative, when in operative association with a computer, to perform the following steps: receiving uncompressed data, compressing the uncompressed data into compressed data, storing the compressed data, indexing the compressed data, retrieving at least part of the compressed data, based at least in part on the indexing, and decompressing the at least part of the compressed data.
  • a computer-readable medium including a computer program, the computer program being operative, when in operative association with a computer, to perform the following steps: receiving two compressed strings, a first compressed string representing in compressed form a first string, and a second compressed string representing in compressed form a second string, the second string . being different from the first string, comparing the first compressed string with the second compressed string, and determining degree of similarity between the first string and the second string, based at least in part on the comparing.
  • a computer-readable medium including a computer program, the computer program being operative, when in operative association with a computer, to perform the following steps: receiving the following items: a string, a first plurality of compressed strings, representing respectively in compressed form a first plurality of strings, and at least one similarity criterion, and producing a second plurality of compressed strings, representing respectively in compressed form a second plurality of strings, the second plurality of strings being a subset of the first plurality of strings, each of the second plurality of strings being similar to the string, according to the at least one similarity criterion.
  • the retrieving includes: receiving a query, the query including a query condition and uncompressed query data to which the query condition relates, compressing the uncompressed query data into compressed query data, and extracting the at least part of the compressed data, based at least in part on the compressed query data.
  • the retrieving does not require storing the uncompressed data.
  • the retrieving does not require accessing the uncompressed data.
  • the retrieving does not require retrieving the uncompressed data.
  • the retrieving includes sorting the uncompressed data, based at least in part on the indexing.
  • the sorting is alphabetical sorting.
  • the uncompressed data includes a plurality of uncompressed strings
  • the compressed data includes a plurality of compressed strings, each of the plurality of uncompressed strings being compressed into a single corresponding one of the plurality of compressed strings.
  • each of the plurality of uncompressed strings is an alphanumeric string, including a plurality of alphanumeric characters.
  • each of the plurality of uncompressed strings includes a plurality of uncompressed characters
  • each of the plurality of compressed strings includes a plurality of compressed characters, at least two of the plurality of uncompressed characters being compressed into one of the plurality of compressed characters.
  • each one of the plurality of uncompressed characters is compressed into one of the plurality of compressed characters.
  • the at least two of the plurality of uncompressed characters includes at least three of the plurality of uncompressed characters.
  • the at least two of the plurality of uncompressed characters includes at least four of the plurality of uncompressed characters.
  • At least three of the plurality of uncompressed characters are compressed into each one of a majority of the plurality of compressed characters.
  • the plurality of compressed strings is stored in a field, the field being part of a table and the table being part of a database.
  • the receiving, the compressing, the storing, the indexing, the retrieving, and the decompressing are performed internally by the database.
  • the receiving, the compressing, the storing, the indexing, the retrieving, and the decompressing do not require a program external to the database. Additionally in accordance with a preferred embodiment of the present invention the receiving, the compressing, the storing, the indexing, the retrieving, and the decompressing, do not require programming.
  • each of the plurality of compressed characters is stored in one byte of memory, the one byte of memory including a plurality of bits, each of the plurality of bits storing one of more than two possible values.
  • the determining does not include comparing the first string with the second string- 11
  • the determining does not include any of the following: decompressing the first compressed string, and decompressing the second compressed string.
  • first string and the first string and the second string are alphanumeric strings.
  • each of the plurality of compressed characters is stored in one byte of memory, the one byte of memory including a plurality of bits, each of the plurality of bits storing one of more than two possible values.
  • the method also includes decompressing each of the second plurality of compressed strings.
  • the producing does not require decompressing any of the first plurality of compressed strings.
  • each of the plurality of compressed characters is stored in one byte of memory, the one byte of memory including a plurality of bits, each of the plurality of bits storing one of more than two possible values.
  • the data extractor provides the following functionality: receiving a query including a query condition and uncompressed query data to which the query condition relates, compressing the uncompressed query data into compressed query data, and extracting the at least part of the compressed data, based at least in part on the compressed query data.
  • the functionality of the data extractor does not require storing the uncompressed data.
  • the functionality of the data extractor does not require accessing the uncompressed data.
  • the functionality of the data extractor does not require retrieving the uncompressed data.
  • the data extractor employing the compressed data indexer is operative to sort the uncompressed data.
  • the data extractor employing the compressed data indexer is operative to alphabetically sort the uncompressed data.
  • the uncompressed data includes a plurality of uncompressed strings
  • the compressed data includes a plurality of compressed strings, each of the plurality of uncompressed strings being compressed into a single corresponding one of the plurality of compressed strings.
  • each of the plurality of uncompressed strings is an alphanumeric string including a plurality of alphanumeric characters.
  • each of the plurality of uncompressed strings includes a plurality of uncompressed characters, each of the plurality of compressed strings includes a plurality of compressed characters, at least two of the plurality of uncompressed characters being compressed into one of the plurality of compressed characters.
  • each one of the plurality of uncompressed characters is compressed into one of the plurality of compressed characters.
  • the at least two of the plurality of uncompressed characters includes at least three of the plurality of uncompressed characters.
  • the at least two of the plurality of uncompressed characters includes at least four of the plurality of uncompressed characters.
  • At least three of the plurality of uncompressed characters are compressed into each one of a majority of the plurality of compressed characters.
  • the plurality of compressed strings is stored in a field, the field being part of a table and the table being part of a database.
  • functionality of the data compressor, the compressed data indexer, and the data extractor is performed internally by a database.
  • functionality of the data compressor, the compressed data indexer, and the data extractor does not require a program external to the database.
  • functionality of the data compressor, the compressed data indexer, and the data extractor does not require programming.
  • each of the plurality of compressed characters is stored in one byte of memory, the one byte of memory including a plurality of bits, each of the plurality of bits sto ⁇ ng one of more than two possible values.
  • functionality of the compressed string analyzer does not require comparing the first string with the second string.
  • functionality of the compressed string analyzer does not require any of the following: decompressing the first compressed string, and decompressing the second compressed string.
  • first string and the first string and the second string are alphanumeric strings.
  • each of the plurality of compressed characters is stored in one byte of memory, the one byte of memory including a plurality of bits, each of the plurality of bits sto ⁇ ng one of more than two possible values.
  • the compressed string similarity assessment system also includes a compressed string decompressor operative to decompress each of the second plurality of compressed strings.
  • functionality of the compressed string extractor does not require comparing the string with any of the first plurality of strings.
  • functionality of the compressed string extractor does not require decompressing any of the first plurality of compressed strings.
  • each of the plurality of compressed characters is stored in one byte of memory, the one byte of memory including a plurality of bits, each of the plurality of bits storing one of more than two possible values.
  • the present invention also seeks to provide an improved method for presentation of genomic sequence data.
  • the present invention seeks to increase the ease with which genomic motifs and their inverse- reversed sequences may be visually distinguished from each other.
  • the present invention enhances the ease with which a viewer can visually distinguish purine nucleotides from pyrimidine nucleotides and can visually distinguish one set of complementary nucleotides, i e adenine-thymine, from another set of complementary nucleotides, i.e. guanine-cytosine.
  • These and other enhanced visual distinctions are preferably provided by employing a novel type of genomic computer font. Different colors may also be applied to different nucleotides.
  • a method for displaying genomic sequence data including receiving an alphanumeric string representing genomic sequence data, the alphanumeric string including a plurality of characters, each of the characters representing a nucleotide in the genomic sequence; and expressing the alphanumeric string using a representation which distinguishes a first plurality of nucleotides, sharing in common a first genomic attribute, from a second plurality of nucleotides, sharing in common a second genomic attribute, the second genomic attribute being different from the first genomic attribute
  • a method for graphically displaying genomic sequence information including: receiving a first alphanumeric string representing a first genomic sequence, and a second alphanumeric string representing a second genomic sequence, the second genomic sequence being a reversed-inversed genomic sequence of the first genomic sequence; and graphically displaying the first alphanumeric string and the second alphanumeric string, such that a graphical display of the second alphanumeric string is a horizontal and vertical mirror image of a graphical display of the first alphanumeric string.
  • a genomic display system comprising: a receiving apparatus operative to receive an alphanumeric string representing genomic sequence data, said alphanumeric string comprising a plurality of characters, each of said characters representing a nucleotide in said genomic sequence; and an expressing apparatus operative to express said alphanumeric string using a representation which distinguishes a first plurality of nucleotides, sharing in common a first genomic attribute, from a second plurality of nucleotides, sharing in common a second genomic attribute, said second genomic attribute being different from said first genomic attribute.
  • a system for graphically displaying genomic sequence information comprising: a genomic sequence expressor, receiving a first alphanumeric string representing a first genomic sequence and a second alphanumeric string representing a second genomic sequence, said second genomic sequence being a reversed-inversed genomic sequence of said first genomic sequence; and expressing said first alphanumeric string and said second alphanumeric string, such that a graphical display of said second alphanumeric string is a horizontal and vertical mirror image of a graphical display of said first alphanumeric string; and a display operative to receive an output from said genomic sequence expressor and to provide a visually sensible display of an expression of said graphical display of said first alphanumeric string and said graphical display of said second alphanumeric string.
  • a computer-readable medium comprising a computer program, the computer program being operative, when in operative association with a computer, to perform the following steps: receiving an alphanumeric string representing genomic sequence data, said alphanumeric string comprising a plurality of characters, each of said characters representing a nucleotide in said genomic sequence; and expressing said alphanumeric string using a representation which distinguishes a first plurality of nucleotides, sharing in common a first genomic attribute, from a second plurality of nucleotides, sharing in common a second genomic attribute, said second genomic attribute being different from said first genomic attribute.
  • the first plurality of nucleotides are represented by at least one first representing attribute
  • the second plurality of nucleotides are represented by at least one second representing attribute, the second representing attribute being different from the-first representing attribute
  • a computer-readable medium comprising a computer program, the computer program being operative, when in operative association with a computer, to perform the following steps: receiving a first alphanumeric string representing a first genomic sequence and a second alphanumeric string representing a second genomic sequence, said second genomic sequence being a reversed-inversed genomic sequence of said first genomic sequence; and graphically displaying said first alphanumeric string and said second alphanumeric string, such that a graphical display of said second alphanumeric string is a horizontal and vertical mirror image of a graphical display of said first alphanumeric string.
  • the representation comprises a human sensible representation.
  • the at least one first representing attribute and the at least one second representing attribute are graphical attributes.
  • the graphical attributes are shapes.
  • the graphical attributes are positions.
  • the positions are vertical positions.
  • the graphical attributes are orientations.
  • orientations are vertical orientations.
  • the graphical attributes are colors.
  • representation also includes representing each of four nucleotides: adenine, thymine, cytosine, and guanine, by a different color.
  • the human sensible representation includes one of the following: a shape with a letter and a shape without a letter.
  • the human sensible representation is produced using a computer font.
  • the computer font is a TRUETYPE® font.
  • the representation comprises a machine sensible representation.
  • the at least one first representing attribute and the at least one second representing attribute are machine sensible attributes.
  • the first plurality of nucleotides are purine nucleotides
  • the second plurality of nucleotides are pyrimidine nucleotides.
  • the first plurality of nucleotides consists of adenine and thymine nucleotides
  • the second plurality of nucleotides consists of guanine and cytosine nucleotides.
  • the representation also distinguishes a third plurality of nucleotides, sharing in common a third genomic attribute, from a fourth plurality of nucleotides, sharing in common a fourth genomic attribute, the fourth genomic attribute being different from the third genomic attribute.
  • the third plurality of nucleotides are represented by at least one third representing attribute
  • the fourth plurality of nucleotides are represented by at least one fourth representing attribute, the at least one third representing attribute being different from the at least one fourth representing attribute.
  • the first plurality of nucleotides are purine nucleotides
  • the second plurality of nucleotides are pyrimidine nucleotides
  • the third plurality of nucleotides are adenine and thymine nucleotides
  • the fourth plurality of nucleotides are guanine and cytosine nucleotides.
  • the method also includes expressing the first alphanumeric string and the second alphanumeric string using a representation which distinguishes a first plurality of nucleotides, sharing in common a first genomic attribute, from a second plurality of nucleotides, sharing in common a second genomic attribute, the second genomic attribute being different from the first genomic attribute.
  • the genomic sequence expressor is also operative to receive an alphanumeric string which represents genomic sequence data, the alphanumeric string including a plurality of characters, each of the plurality of characters representing a nucleotide in the genomic sequence, and to express the alphanumeric string using a representation which distinguishes a first plurality of nucleotides, sharing in common a first genomic attribute, from a second plurality of nucleotides, sharing in common a second genomic attribute, the second genomic attribute being different from the first genomic attribute, and the display is also operative to receive an output from the expressor and to display the genomic sequence using the representation.
  • Fig. 1 is a simplified block diagram illustrating a computer application constaicted and operative in accordance with a preferred embodiment of the present invention
  • Fig. 2 is a simplified block diagram illustrating a genomic data compression mechanism, which is a preferred implementation of a compression mechanism and a decompression mechanism constructed and operative in accordance with a preferred embodiment of the present invention
  • Fig. 3A is a simplified illustration of a preferred implementation of a compressed byte bitmap used in accordance with a preferred embodiment of the present invention
  • Fig. 3B is a simplified illustration of an alternative preferred implementation of a compressed byte bitmap used in accordance with a preferred embodiment of the present invention.
  • Fig. 4A is a table illustrating preferred values assignable to 'header' bits in accordance with a preferred embodiment of the present invention
  • Fig. 4B is a table illustrating preferred values assignable to 'nucleotide- representing' bits in accordance with a preferred embodiment of the present invention
  • Fig. 4C is a table illustrating preferred values assignable to bits when encoding one or more uncommon characters
  • Fig. 5 is a simplified flowchart illustrating operation of a genomic data compression engine constructed and operative in accordance with a preferred embodiment of the present invention
  • Fig. 6 is a simplified flowchart illustrating a methodology for generating translation tables used in accordance with a preferred embodiment of the present invention
  • Fig. 7A is a simplified illustration of a compression table employed in accordance with a preferred embodiment of the present invention.
  • Fig. 7B is a simplified illustration of a decompression table employed in accordance with a preferred embodiment of the present invention.
  • Fig. 8 is a simplified flowchart illustrating operation of a genomic data decompression engine constructed and operative in accordance with a preferred embodiment of the present invention
  • Fig. 9A is a simplified illustration of an example of compression of an uncompressed genomic string representing genomic sequence data into a compressed genomic string
  • Fig. 9B is a simplified illustration of an example of compression of an uncompressed genomic string representing genomic sequence data, and containing an unknown nucleotide, into a compressed genomic string;
  • Fig. 10 is a simplified block diagram illustrating shifted genomic sequences utilized by a compressed genomic sequence similarity search module constructed and operative in accordance with a preferred embodiment of the present invention
  • Fig. 1 1 is a simplified flowchart illustrating operation of the compressed genomic sequence similarity search module constructed and operative in accordance with a preferred embodiment of the present invention
  • Fig. 12A is a simplified illustration of an example of identifying a genomic sequence having one nucleotide replacement relative to a target genomic sequence
  • Fig. 12B is a simplified illustration of an example of identifying a genomic sequence having two nucleotide additions relative to a target genomic sequence
  • Fig. 12C is a simplified illustration of an example of identifying a genomic sequence having one nucleotide deletion relative to a target genomic sequence
  • Fig. 13 is a simplified functional diagram of a computer database application constructed and operative in accordance with a preferred embodiment of the present invention
  • Fig. 14 is a simplified block diagram illustrating a genomic pattern analysis database constructed and operative in accordance with a preferred embodiment of the present invention.
  • Fig. 15 is a flowchart diagram illustrating operation of a genomic preprocessing unit constructed and operative in accordance with a preferred embodiment of the present invention
  • Fig. 16 is a flowchart diagram illustrating operation of a genomic query processing unit constructed and operative in accordance with a preferred embodiment of the present invention
  • Fig. 17 is a simplified illustration of an example of genomic pattern analysis performed by a preferred embodiment of the present invention.
  • Fig. 18 is a simplified block diagram illustrating a computer application constructed and operative in accordance with a preferred embodiment of the present invention.
  • Fig. 19 is a simplified block diagram illustrating a mechanism for data compression and decompression, which is a preferred implementation of a compression mechanism and a decompression mechanism constructed and operative in accordance with a preferred embodiment of the present invention
  • Fig. 20A is a simplified illustration of a preferred implementation of a compressed byte bitmap used in accordance with a preferred embodiment of the present invention
  • Fig. 20B is a simplified illustration of an alternative preferred implementation of a compressed byte bitmap used in accordance with a preferred embodiment of the present invention
  • Fig. 21 A is a table illustrating preferred values assignable to 'header' bits in accordance with a preferred embodiment of the present invention
  • Fig. 2 IB is a table illustrating preferred values assignable to character- representing bits in accordance with a preferred embodiment of the present invention
  • Fig. 21C is a table illustrating preferred values assignable to bits when encoding one or more uncommon characters
  • Fig. 22 is a simplified flowchart illustrating operation of a compression engine constructed and operative in accordance with a preferred embodiment of the present invention
  • Fig. 23 is a simplified flowchart illustrating a methodology for generating translation tables used in accordance with a preferred embodiment of the present invention.
  • Fig. 24A is a simplified illustration of a compression table employed in accordance with a preferred embodiment of the present invention.
  • Fig. 24B is a simplified illustration of a decompression table employed in accordance with a preferred embodiment of the present invention.
  • Fig. 25 is a simplified flowchart illustrating operation of a decompression engine constructed and operative in accordance with a preferred embodiment of the present invention
  • Fig. 26A is a simplified illustration of an example of compression of an uncompressed string into a compressed string
  • Fig. 26B is a simplified illustration of an example of compression of an uncompressed string, and containing a rare character, into a compressed string;
  • Fig. 27 is a simplified block diagram illustrating shifted compressed strings utilized by a compressed string similarity search module constructed and operative in accordance with a preferred embodiment of the present invention
  • Fig. 28 is a simplified flowchart illustrating operation of the compressed string similarity search module constructed and operative in accordance with a preferred embodiment of the present invention
  • Fig. 29A is a simplified illustration of an example of identifying a character string having one character replacement relative to a target string
  • Fig. 29B is a simplified illustration of an example of identifying a character string having two character additions relative to a target string
  • Fig. 29C is a simplified illustration of an example of identifying a character string having one character deletion relative to a target string
  • Fig. 30 is a simplified illustration of a triphase-bit compressed character used in accordance with a preferred embodiment of the present invention.
  • Fig. 3 1 is a simplified block diagram illustrating a computer application constructed and operative in accordance with a preferred embodiment of the present invention
  • Fig. 32 is a simplified flowchart illustrating preferred operation of a genomic graphic representation engine, constructed and operative in accordance with a preferred embodiment of the present invention
  • Fig. 33 is a simplified illustration of an example demonstrating conversion of alphanumeric genomic representation into graphic genomic representation
  • Fig 34 is a simplified illustration of an example demonstrating an advantage of a graphic genomic representation in comparing a genomic motif sequence with the inverse-reversed sequence of this motif;
  • Fig 35 is a simplified illustration of an example demonstrating an advantage of a graphic genomic representation, in visually distinguishing adenine- thymine-rich sequences, from cytosine-guanine-rich sequences;
  • Fig 36 is a simplified illustration of an example demonstrating an advantage of a graphic genomic representation, in visually distinguishing purine nucleotides from pyrimidine nucleotides.
  • FIG. 1 is a simplified block diagram illustrating a computer application constructed and operative in accordance with a preferred embodiment of the present invention.
  • Each of a plurality of genomic sequences 100 is compressed by a compression mechanism 102, collectively yielding a respective plurality of compressed genomic sequences 104, each typically represented as a compressed alphanumeric string.
  • the compressed genomic sequences 104 are stored in a plurality of records in a table in a database 106, and a compressed genomic sequences index 108 is constructed, which indexes the compressed genomic sequences 104.
  • a target genomic sequence 1 10, one or more similarity criteria 1 11 and a query condition 1 12 relating to the target genomic sequence 110 are provided by a user of the database 106, in order to find all genomic sequences 100 in the database 106 which comply with the provided query condition 112 as it is applied to genomic sequences similar to the target genomic sequence 110, to a degree specified by the similarity criteria 1 12.
  • the target genomic sequence 110 is compressed by a compression mechanism 1 14, which may be similar to compression mechanism 102, into a compressed target genomic sequence 116.
  • the compressed target genomic sequence 116 and the similarity criteria 1 1 1 are passed on as input to a compressed genomic sequence similarity search module 1 18.
  • the compressed genomic sequence similarity search module in conjunction with the compressed genomic sequence index 108 is operative to query the database 106, retrieving a plurality of compressed genomic sequences which comply with the query condition 1 12, as applied to genomic sequences which are similar to the target genomic sequence 1 10, to a degree defined by similarity criteria 111, in compressed form. These results are designated compressed similar to target query results 120.
  • the compressed genomic sequence similarity search module 118 is further described below with reference to Figs. 10, 11, 12A, 12B and 12C.
  • Each of the compressed genomic sequences similar to target 120 is decompressed by a decompression mechanism 122, collectively yielding a respective plurality of genomic sequences similar to target 124.
  • Preferred embodiments of the compression mechanisms 102 and 114 and of the decompression mechanism 122 which preferably reverses the process of these compression mechanisms, are further described below with reference to Fig. 2. It is appreciated that while the end result is retrieval of genomic sequences which are similar to the target genomic sequence 110, the actions of sequence similarity comparison and retrieval are preferably performed on compressed genomic sequences: comparing compressed target genomic sequence 116 with compressed genomic sequences 104, using the compressed genomic sequence index 108.
  • An important aspect of the present invention is that it allows determining the level of similarity between genomic sequences, by comparing compressed genomic sequences to which these genomic sequences correspond.
  • Fig. 2 is a simplified block diagram illustrating a mechanism for compression and decompression of genomic data. This mechanism is a preferred implementation of the compression mechanisms 102 and 114 and of the decompression mechanism 122 described hereinabove with reference to Fig. 1.
  • An uncompressed genomic string 200 is given as an example of one of the genomic sequences 100 or of the target genomic sequence 110 described hereinabove with reference to Fig. 1.
  • the uncompressed genomic string 200 represents genomic sequence data.
  • genomic sequence data is typically represented as an alphanumeric string comprising only five letters: A, T, C, and G, representing the four nucleotides, which comprise the genome: Adenine, Thymine, Cytosine, and Guanine respectively, and the letter N or the minus sign, representing locations in the sequence in which the nucleotide is currently not known.
  • the letter N typically appears in genomic sequence data much less frequently than do the other four letters. Further, when N's do appear in a genomic sequence they are frequently found contiguously rather than separately since it is frequently the case that a contiguous group of nucleotides in the sequence, rather than just one nucleotide, are unknown.
  • the uncompressed genomic string 200 comprises a plurality of bytes, each storing a character - A, T, C, or G - representing a nucleotide.
  • the uncompressed genomic string 200 shown in Fig. 2 comprises only three bytes: BYTE I, BYTE II and BYTE III, each storing a character representing a corresponding nucleotide: NUC 1, NUC 2 and NUC 3, respectively.
  • the uncompressed genomic string 200 is compressed by a genomic data compression engine 202 into a compressed genomic string 204. Operation of a preferred embodiment of the genomic data compression engine 202 is further described below with reference to Fig. 5
  • the genomic data compression engine 202 employs a compression table 206 in compression of uncompressed genomic string 200 into compressed genomic string 204
  • the compression table 206 is preferably a translation table and holds a list of all possible or reasonable 3 -alphanumeric-character combinations and, for each such combination, the byte or bytes into which it may be compressed
  • a preferred embodiment of the compression table 206 is further described below with reference to Fig 7A
  • the compressed genomic string 204 preferably comprises one or more compressed bytes 208
  • the compressed genomic string 204 shown in Fig 2 comprises only one compressed byte 208, which represents in compressed form all three nucleotides which are represented in the uncompressed genomic string 200, NUC 1, NUC 2 and NUC 3, which nucleotides require three bytes of storage, BYTE I, BYTE II and BYTE III, in their original uncompressed form.
  • compressed genomic string 204 which comprises of a plurality of compressed bytes 208, may alternatively be stored as one or more integers.
  • an Integer is typically defined as a data-type comprising 4 bytes. It is therefore possible to compress an uncompressed genomic string 200 comprising 12 nucleotides into 4 compressed bytes 208, and then to store all 4 resulting compressed bytes 208 as one Integer Longer genomic strings may be compressed into longer Integers, such as Biglnt data type in MS SQL-2000® which comprises 8 bytes, or as a plurality of Integers or Biglnts. For example, a genomic string comprising 48 nucleotides may be compressed into 2 Biglnts
  • Tinylnt datatype which comprises one byte of memory
  • a Tinylnt datatype which comprises one byte of memory
  • genomic strings which are longer than 3 or 4 characters, i.e. are compressed into more than one Tinylnt
  • the compressed byte 208 is further described below with reference to Figs 3, 4 A, 4B, and 4C
  • the compressed genomic string 204 may be decompressed by a genomic data decompression engine 210 back to uncompressed genomic string 200, preferably by reversal of the methodology of the genomic data compression engine 202 Operation of the genomic data decompression engine 210 is further described below with refei ence to Fig 8
  • Genomic data decompression engine 210 employs a decompression table 212 in decompressing compressed genomic string 204 into uncompressed genomic string 200
  • the decompression table 212 is a translation table and holds a list of the bit- values for each possible or reasonable compressed byte 208 and for each such compressed byte, the alphanumeric string, containing up to three characters, into which it may be decompressed
  • the decompression table 212 is further described below with reference to Fig 7B
  • FIG. 3 A is a simplified illustration of a preferred implementation of a byte bitmap preferably employed in generating the compressed byte 208 of Fig 2
  • the compressed byte 208 of Fig 2 preferably comprises 8 bits' BIT I, BIT II, BIT III, BIT IV, BIT V, BIT VI, BIT VII & BIT VIII Preferably, these bits are divided into four groups, each containing 2 bits
  • a HEADER typically contains BIT I and BIT II
  • a BIT-PAIR I typically contains BIT III and BIT IV
  • a BIT-PAIR II typically contains BIT V and BIT VI
  • a BIT-PAIR III typically contains BIT VII and BIT VIII
  • the HEADER preferably stores information about what the other bits in the compressed byte, BIT III - BIT VIII, represent
  • the compression of genomic data is such that the compressed byte may either store up to three nucleotides, or may store up to three unknown-nucleotides, i e 'N's, but preferably not a combination of nucleotides and N's
  • the HEADER stores information which indicates how many nucleotides the compressed byte 208 represents, one, two or three, or alternatively if the entire compressed-byte represents one or more 'N's
  • the values assignable to bits of the HEADER are further described below with reference to Fig 4A BIT-PAIR I, BIT-PAIR II and BIT-PAIR III each contain 2 bits which are capable, when taken together, of representing one of four possible nucleotides, A, T, C and G.
  • values assigned to the two bits of BIT -PAIR I determine whether the compressed byte 208 represents one, two or three N's. Values assignable to bits of BIT-PAIR I, determining the number of N's that the compressed byte 208 represents, are further described below with reference to Fig. 4C.
  • Fig. 3B is a simplified illustration of an alternative preferred implementation of a byte bitmap preferably employed in generating the compressed byte 208 of Fig. 2.
  • Alternative compressed byte 300 is an alternative byte bitmap which may be used for compression of genomic data, instead of the byte bitmap of compressed byte 208, depicted in Fig. 3A.
  • Alternative compressed byte 300 comprises four bit-pairs, BIT-PAIR I, BIT-PAIR II, BIT-PAIR III and BIT-PAIR IV, rather than only three bit-pairs in compressed-byte 208 of Fig. 3 A. Unlike compressed byte 208 of Fig. 3 A, in which BIT I and BIT II function as a HEADER, alternative compressed byte 300 does not comprise any such header. All 8 bits of alternative compressed byte 300 function as one of four bit-pairs, each of said bit-pairs representing a nucleotide. Alternative compressed byte 300 is therefore capable of representing 4 nucleotides in compressed form, as opposed to compressed byte 208 of Fig. 3A, which is capable of representing 3 nucleotides.
  • alternative compressed byte 300 may be useful when compressing genomic sequences which do not include unknown nucleotides, and are of a fixed length. If the length of the uncompressed genomic string 200 is known, then it is possible to ignore the possible tailing zeros at the right end of the alternative compressed byte 300, which do not represent a nucleotide, but rather represent a blank.
  • an uncompressed genomic string 200 which is known to be 7 nucleotides long, may be compressed into 2 alternative compressed bytes 300: the first containing in compressed form 4 nucleotides, and the second containing 3 nucleotides.
  • BIT VII and BITVIII of BIT-PAIR IV contain zeros which are ignored because the uncompressed genomic string is known to be 7 nucleotides long, despite the absence of a 'header' which would explicitly instruct to ignore these bits.
  • Fig. 4A is a table illustrating preferred values assignable to BIT I and BIT II, both belonging to the HEADER of compressed byte 208 shown in Fig. 3 A.
  • Assigning the value '01 ' to the bits of the HEADER i.e. assigning '0' to BIT I and ' 1 ' to BIT II of compressed byte 208 shown in Fig. 3A, signifies that the compressed byte 208 represents only one nucleotide, as represented by the values in BIT III and BIT IV, both belonging to BIT-PAIR I of compressed byte 208.
  • the remaining four bits of the compressed byte 208, BIT V, BIT VI, BIT VII and BIT VUI are to be ignored and do not represent any additional nucleotide.
  • the remaining two bits of the compressed byte 208, BIT VII & BIT VIII, are to be ignored and do not represent any additional nucleotide.
  • Fig 4B is a table illustrating the preferred values assignable to the nucleotide-representing bits: BIT III, BIT IV, BIT V, BIT VI, BIT VII, & BIT VIII of compressed byte 208 shown in Fig. 3 A.
  • each of BIT-PAIR I, BIT-PAIR II and BIT-PAIR III in compressed byte 208 comprises a pair of bits: BIT III & BIT IV, BIT V & BIT VI, and BIT VII & BIT VIII respectively.
  • the values presented in Fig 4B are values which may be assigned to each of the above mentioned pairs of bits so as to allow each of these pairs of bits to represent one of the four possible genomic nucleotides A, T, C or G
  • bit-pair i e assigning '0' to BIT III and '0' to BIT IV, or assigning '0' to BIT V and '0' to BIT VI, or assigning '0' to BIT VII and '0' to BIT VIII, signifies that that bit-pair, i e BIT-PAIR I, BIT-PAIR II or BIT-PAIR III, of the compressed byte 208 of Fig 3 A, represents the nucleotide 'A'.
  • bit-pair Assigning the value of '01' to any of the three bit-pairs representing one of the three nucleotides, i e. assigning '0' to BIT III & '1' to BIT IV, or assigning '0' to BIT V & ' L to BIT VI, or assigning '0' to BIT VII & ' 1 ' to BIT VIII, signifies that that bit-pair, i e. BIT-PAIR I, BIT-PAIR II or BIT-PAIR III, represents the nucleotide 'C.
  • bit-pair Assigning the value of ' 11 ' to any of the three bit-pairs representing one of the three nucleotides, i e assigning ' 1 ' to BIT III & ' 1' to BIT IV, or assigning T to BIT V & ' 1 ' to BIT VI, or assigning ' 1 ' to BIT VII & ' 1 ' to BIT VIII, signifies that that bit-pair, i e BIT-PAIR I, BIT-PAIR II or BIT-PAIR III, represents the nucleotide "T.
  • Fig. 4C is a table illustrating preferred values ' assignable to BIT III and BIT IV of Fig. 3 A respectively, when encoding one or more .unknown nucleotides.
  • a value '00' assigned to the bits of the HEADER of Fig. 3 A signifies that the compressed byte 208 represents one or more unknown nucleotides and does not represent any known nucleotides.
  • BIT III and BIT IV may be used to signify if the entire byte represents one, two of three 'N's.
  • Assigning the value ' 10' to BIT III & BIT IV i.e. assigning '1' to BIT III and '0' to BIT IV, signifies that the entire compressed byte 208 represents two unknown nucleotides.
  • each compressed byte 208 it is possible to use each compressed byte 208 to encode more or less than three 'N's.
  • BIT III through BIT VIII of Fig. 3 A it is possible to use BIT III through BIT VIII of Fig. 3 A to signify up to 64 N's represented by a compressed byte 208.
  • Fig. 5 is a simplified flowchart illustrating operation of the genomic data compression engine 202 of Fig. 2, constructed and operative in accordance with a preferred embodiment of the present invention.
  • the compression table 206 stores for each possible combination of up to three nucleotides, e.g. 'ATG', 'GGC, 'AT', bit-values which represents this combination in compressed form in one compressed byte 208, preferably according to the values suggested in Figs. 4A, 4B, and 4C.
  • Figs. 7A and 7B present examples of preferred compression and decompression tables 206 and 212, respectively.
  • an iterative process of compression of multiple strings takes place.
  • An uncompressed genomic string such as uncompressed genomic string 200 of Fig. 2 is received.
  • the uncompressed genomic string 200 is a genomic sequence represented by a string 'ATGAT'. This example is followed through the following steps of Fig. 5.
  • the uncompressed genomic string 200 preferably is parsed into substrings, each having up to three nucleotides (3-nucleotide-substrings) by parsing the uncompressed genomic string 200 from left to right. It should be noted that one or more of the nucleotides in a 3-nucleotide-substring may in fact be unknown, i.e. an 'N'. In the given example the string 'ATGAT' is parsed from left to right, into 3-nucleotide- substrings, yielding 'ATG' and 'AT'.
  • a recursive operation is initiated, which looks up each 3- nucleotide-substring in the compression table 206, and based on the contents of the compression table, assigns appropriate bit values to the bits in one or more compressed byte 208.
  • the compressed bytes 208 are combined to yield a compressed genomic string 204.
  • Fig. 6 is a simplified flowchart illustrating a preferred functionality for generating translation tables, including compression table 206 and decompression table 212 of Fig. 2.
  • 3-nucleotide-substrings i.e. 1 -nucleotide 2-nucleotide and 3-nucleotide combinations of A, T, C, G and N.
  • these combinations may include ATN, ATA and ATC.
  • N's are encoded in compressed bytes 208 which do not also represent known nucleotides. Values '00' are assigned to the headers of such compressed bytes and all known nucleotides in the 3-nucleotide-substring are encoded in other bytes.
  • the HEADER is assigned the value '01 ' and the single nucleotide is represented by BIT III & BIT IV of the compressed byte 208.
  • the HEADER is assigned the value ' 10' and the two nucleotides are represented by BITS III - VI of the compressed byte 208, with the first nucleotide represented by BIT III & BIT IV and the second nucleotide represented by BIT V & BIT VI.
  • the HEADER is assigned the value ' 1 1 ' and the three nucleotides are represented by BITS III - VIII of the compressed byte 208, with the first nucleotide represented by BIT III & BIT IV, the second nucleotide represented by BIT V & BIT VI and the third nucleotide represented by BIT VII & BIT VIII.
  • Each 3-nucleotide-substring and its corresponding one or more compressed bytes 208 are stored in translation tables including compression table 206 and decompression table 212.
  • N's unknown nucleotides
  • N's unknown nucleotides
  • the present invention utilizes this fact to achieve an optimized compression suited specifically for genomic sequence data: three- nucleotide combinations which contain only nucleotides and no N's, as well as those containing only N's and no nucleotides, are both compressed into a single byte. Most of the rare cases of 3-nucleotide mixtures of nucleotides and N's are compressed into two bytes. Only a minority of extremely rare combinations of nucleotides and N's require three bytes and therefore are in fact not compressed.
  • FIG. 7A is a simplified illustration of a preferred implementation of compression table 206 of Fig. 2, employed in accordance with a preferred embodiment of the present invention.
  • compression table 206 The goal of compression table 206 is to provide a translation-table, also referred to as a 'lookup table', which provides the bit-values of the one or more compressed bytes 208 required to represent in compressed form every possible 1- nucleotide, 2-nucleotide and 3-nucleotide sub-string of uncompressed genomic string 200.
  • the compression table 206 is described here logically, as a database table comprising fields into each of which multiple values are stored in respective multiple records. It is appreciated by those skilled in the art that the description of the compression table 206 in terms of table comprising fields is meant for clarity and not meant to be limiting, and that the compression table 206 may equally be implemented as a 'CASE' or 'IF-THEN' programming code in a any suitable computer language, as is well known in the art.
  • computer code can be written, which comprises a plurality of TF-THEN' or 'CASE' arguments, each one of the arguments providing bit-values of the one or more compressed bytes 208 representing in compressed form one 3-nucleotide-substring of uncompressed genomic string 200.
  • Compression table 206 preferably comprises multiple records each containing 4 fields: uncompressed nucleotide-combination 700, compressed byte I 702, compressed byte II 704 and compressed byte III 706. For clarity, an example is given for the content which may be stored in each of these fields.
  • the uncompressed nucleotide combination 700 is a field which stores all possible 3-nucleotide substrings, i.e. 1-nucleotide, 2-nucleotide and 3-nucleotide combinations, including combinations of nucleotides only and combinations which include N's.
  • uncompressed nucleotide combination 700 stores a 3- nucleotide combination 'ATN'.
  • Compressed byte I 702, compressed byte II 704 and compressed byte III 706 respectively are fields which store for each uncompressed nucleotide-combination 700 the bit-values for each of the one or more compressed bytes 208 required for encoding it.
  • compressed byte I 702 stores ' 10001100', which represents the nucleotide-combination 'AT'
  • compressed byte II 704 stores '00010000', which represents 'N ⁇
  • Compressed byte III 706, in the given example stores null, since only two compressed bytes are required to represent the nucleotide combination 'ATN'.
  • most 3-nucleotide substrings may be compressed into one compressed byte 208, some rare combinations may be compressed into two compressed bytes, and some 20 very rare combinations may require 3 compressed bytes, and therefore may not be compressed. Therefore, notwithstanding that compression table 206 comprises three compressed bytes fields 702-706, one compressed byte field, such as compressed byte I 702, is sufficient to translate a vast majority of 3-nucleotide combinations to be typically found in a genomic sequence.
  • Fig. 7B is a simplified illustration of a preferred implementation of decompression table 212 of Fig. 2, employed in accordance with a preferred embodiment of the present invention.
  • decompression table 212 is to provide a translation-table, also referred to as a 'lookup table', which provides the 1-nucleotide, 2-nucleotide or 3- nucleotide uncompressed genomic string 200 preferably corresponding to every possible compressed byte 208.
  • decompression table 212 in terms of table comprising fields is meant for clarity and not meant to be limiting, and that the decompression table 212 may equally be implemented as a 'CASE' code in any computer language, as is well known in the art.
  • the decompression table 212 preferably comprises multiple records each containing 2 fields: compressed byte 708 and decompressed nucleotide-combination 710.
  • Compressed byte 708 is a field which preferably stores bit-values of every possible compressed byte 208.
  • Decompressed nucleotide combination 710 is a field which stores for each compressed byte 708 the 1-nucleotide, 2-nucleotide or 3-nucleotide uncompressed genomic string 200 which it encodes.
  • the field compressed byte 708 may contain the compressed byte 208 bit- value ' 10001100' and the respective field decompressed nucleotide combination 710 may contain the 2-nucleotide combination 'AT' which this bit value represents in compressed form.
  • genomic data decompression engine 210 of Fig. 2 performs a reverse action of that of genomic data compression engine 202 of Fig. 2, which was further described hereinabove with reference to Fig. 5.
  • a compressed genomic string 204 is received in order to be decompressed.
  • Genomic data decompression engine 210 of Fig. 2 gets the compressed genomic string 204 of Fig. 2 to be decompressed.
  • the process shown in Fig. 8 is explained with reference to an example wherein the compressed genomic string 204 comprises two compressed bytes 208 the bit-value of which are: ' 11001110' & ' 10001 100' respectively. This example is followed through the following steps of Fig.
  • a recursive operation is initiated, which parses the received compressed genomic string 204 into the compressed byte 208 included in this compressed genomic string.
  • Each compressed byte 208 is looked up in the decompression table 212, and based on the contents of the compression table, finds out the 3-nucleotide substring which this compressed byte represents.
  • the 3-nucleotide substrings are combined to yield an uncompressed genomic string 200.
  • the second compressed byte in the compressed genomic string has bit-values of ' 10001100', which when looked-up in the decompression table is found out to represent the nucleotide combination 'AT'.
  • 'ATG' represented by compressed byte ' 1 1001 1 10'
  • 'AT' represented by compressed byte ' 10001100'
  • FIG. 9 A is a simplified illustration of an example of compressing an uncompressed genomic string 200 into a compressed genomic string 204, both shown in Fig. 2.
  • Uncompressed genomic string 'ATGAT' 900 is an uncompressed genomic string 200 comprising the nucleotides: 'ATGAT'.
  • nucleotide-triplet-1 'ATG' 902 nucleotide-triplet-1 'ATG' 902
  • Compressed byte-1 906 encodes three nucleotides: 'ATG'. Therefore a value of ' 1 1 ' is assigned to the two bits of the HEADER of compressed byte-1 906, as indicated by reference numeral 908, signifying that this compressed byte 208 encodes 3 nucleotides.
  • Value '00' is set to the two bits of BIT-PAIR I, of compressed byte-1 906, as indicated by reference numeral 910, signifying that the first nucleotide represented by this byte is 'A'.
  • Value ' 11 ' is assigned to the two bits of BIT-PAIR II of compressed byte- 1 906, as indicated by reference numeral 912, signifying that the second nucleotide represented by this byte is 'T'.
  • Value ' 10' is assigned to the two bits of BIT-PAIR III of compressed byte-1 906, as indicated by reference numeral 914, signifying that the third nucleotide represented by this byte is 'G'.
  • Compressed byte-2 908 encodes two nucleotides: 'AT'. Therefore a value of ' 10' is assigned to the two bits of the HEADER of compressed byte-2 908, as indicated by reference numeral 916, signifying that this compressed byte 208 encodes 2 nucleotides, and that therefore the two bits of BIT-PAIR III are to be ignored. Value '00' is assigned to the two bits of BIT-PAIR I of compressed byte- 2 908, as indicated by reference numeral 918, signifying that the first nucleotide represented by this byte is 'A'.
  • Value ' 1 1 ' is assigned to the two bits of BIT-PAIR II of compressed byte-2 908, as indicated by reference numeral 920, signifying that the second nucleotide represented by this byte is 'T'.
  • FIG. 9B is a simplified illustration of another example of compression of an uncompressed genomic string 200 of Fig. 2, containing an unknown nucleotide, into a compressed genomic string 204 of Fig. 2.
  • Uncompressed genomic string 'ATNCG' 950 is an uncompressed genomic string 200 comprising the characters: 'ATNCG'.
  • nucleotide-triplet- 1 'ATN' 952 nucleotide-triplet-2 'CG_' 954. It is appreciated that the nucleotide- triplet-2 'CG_' 954 actually contains not a triplet but only 2 nucleotides: C and G.
  • nucleotide-triplet-1 'ATN' 952 contains an 'N' it is preferably represented by two compressed bytes 208 rather than one: the first, compressed byte-1 956, encodes 'AT', and the second, compressed byte-2 958, encodes 'N ⁇
  • Compressed byte-1 956 encodes two nucleotides, 'AT', therefore ' 10' is assigned to the two bits of the HEADER of compressed byte-1 956, as indicated by reference numeral 960, signifying that this compressed byte 208 encodes 2 nucleotides.
  • Value '00' is assigned to the two bits of BIT-PAIR I of compressed byte- 1 956, as indicated by reference numeral 962, signifying that the first nucleotide represented by this byte is 'A'.
  • Value ' 1 1 ' is assigned to the two bits of BIT-PAIR II of compressed byte- 1 956, as indicated by reference numeral 964, signifying that the second nucleotide represented by this byte is 'T'.
  • Value '00' assigned to the bits of BIT-PAIR III of compressed byte-1 956, as indicated by reference numeral 966, is ignored, and does not represent an 'A' since the HEADER specified that this byte encodes only 2 nucleotides.
  • Compressed byte-2 958 is dedicated to encoding 'N's, in this case only one 'N ⁇ which is derived from the nucleotide-triplet-1 'ATN' 952. Therefore '00' is assigned to the two bits of the HEADER of compressed byte-2 958, as indicated by reference numeral 968, signifying that this compressed byte 208 is dedicated to encoding one or more N's.
  • the value '01' is assigned to the two bits of BIT-PALR I of compressed byte-2 958, as indicated by reference numeral 970, signifying that this byte, which is dedicated to encoding N's, encodes only one N. Accordingly, The zeros in the two bits of BIT-PAIR II and the two bits of BIT-PAIR III, indicated by reference numerals 972 & 974, are ignored.
  • Compressed byte-3 990 encodes two nucleotides: 'CG'. Therefore a value of ' 10' is assigned to the two bits of the HEADER of compressed byte-3 990, as indicated by reference numeral 976, signifying that this compressed byte 208 encodes only 2 nucleotides.
  • Value '01 ' is assigned to the two bits of BIT-PAIR I of compressed byte- 3 990, as indicated by reference numeral 978, signifying that the first nucleotide represented by this byte is 'C ⁇
  • Value ' 10' is assigned to the two bits of BIT-PAIR II of compressed byte-3 990, as indicated by reference numeral 980, signifying that the second nucleotide represented by this byte is 'G'.
  • Fig. 10 is a simplified block diagram illustrating shifted genomic sequences utilized by the compressed genomic sequence similarity search module 1 18 of Fig. 1 constructed and operative in accordance with a preferred embodiment of the present invention.
  • each compressed byte represents in compressed form 3 or 4 nucleotides. It is therefore very easy to compare entire 'triplets' of nucleotides, but if an addition or deletion of a single nucleotide occurs, then all triplets 'downstream' to the one modified will have seemed to have changed completely, whereas in fact, they have only been 'shifted' to the right or to the left by one location.
  • the basic concept of the compressed genomic sequence similarity search module 1 18 of Fig. 1 is therefore to calculate all possible 'shifted' variations of the compressed genomic sequence 110 of Fig. 1, and to use them to search for compressed genomic sequences similar to target 120 of Fig. 1.
  • target genomic sequence 1000 which is a compressed genomic sequence comprising of 12 nucleotides, NI through N12, which are represented in compressed form by 4 compressed bytes 208, BYTE 1, BYTE 2, BYTE 3 and BYTE 4.
  • shifted genomic sequences 1002 are generated: 'minus one' shifted genomic sequence 1004, 'minus two' shifted genomic sequence 1006, 'plus one' shifted genomic sequence 1008, 'plus two' shifted genomic sequence 1010.
  • the first nucleotide in target genomic sequence 1000, NI has been removed in the 'minus one' shifted genomic sequence 1004, and therefore 'minus one' shifted genomic sequence 1004 begins with N2.
  • the nucleotides compressed into each of the four compressed bytes 208 of 'minus one' shifted genomic sequence 1004 are therefore 'shifted to the left' by one location.
  • sequence of nucleotides compressed into each of the four compressed bytes 208 of 'minus two' shifted genomic sequence 1006 is shifted to the left by two locations; that of 'plus one' shifted genomic sequence 1008 is shifted to the right by one location; and that of 'plus two' shifted genomic sequence 1010 is shifted to the right by two locations.
  • Fig. 11 is a simplified flowchart illustrating operation of the compressed genomic sequence similarity search module 118 of Fig. 1 constructed and operative in accordance with a preferred embodiment of the present invention.
  • Operation of the compressed genomic sequence similarity search module 1 18 begins by getting a target compressed genomic sequence 110 of Fig. 1.
  • shifted compressed genomic sequences are generated: 'minus one' shifted genomic sequence 1004, 'minus two' shifted genomic sequence 1006, 'plus one' shifted genomic sequence 1008, 'plus two' shifted genomic sequence 1010, as described hereinabove with reference to Fig. 10.
  • all compressed genomic sequences 104 having at least one compressed byte which matches that of the compressed target genomic sequence 1000 or of one of the four shifted genomic sequences 1004-1010, are retrieved. It is important to note that a match is looked for only between bytes occupying the same location in the compressed genomic string: the first compressed byte in a compressed genomic sequence 104 is compared to the first compressed byte of the compressed target genomic sequence and to the first compressed byte of each of the four compressed shifted genomic sequences. It is not compared to any other, e.g. second, third or fourth, compressed bytes within these genomic sequences. All compressed -genomic sequences having at least one match, are considered potentially similar, and are passed on the next step.
  • Compressed genomic sequences having less mismatching compressed bytes with the target or one of the shifted genomic sequences than a certain user defined 'threshold', are considered potentially very similar, and are passed on to the next step.
  • mismatching compressed byte/s are further analyzed to determine the exact nature of the mistake, in order to further fine-tune the similarity comparison.
  • the resulting compressed genomic sequences similar to target 120 of Fig. 1 are considered to represent in compressed form. genomic sequences which are similar to the target genomic sequence represented in compressed form by the compressed target genomic sequence 1 10, and are delivered.
  • Fig. 12A is a simplified illustration of an example of identifying a genomic sequence having one nucleotide replacement relative to a target genomic sequence.
  • Fig. 12A shows a genomic sequence designated SIMILAR TO TARGET GENOMIC SEQUENCE (1 REPLACEMENT), in which a nucleotide designated N13 shown in broken line format, in the compressed byte designated 1R-BYTEIL has replaced nucleotide N5 in that same spot in the original TARGET GENOMIC SEQUENCE.
  • Fig. 12B is a simplified illustration of an example of identifying a genomic sequence having two nucleotide additions relative to a target genomic sequence.
  • Fig. 12B shows a genomic sequence designated SIMILAR TO TARGET GENOMIC SEQUENCE (2 ADDITIONS), in which two nucleotides designated N13 and N 14 shown in broken line format, in compressed byte designated 2A-B YTEIL have been added to the genomic sequence relative to the original TARGET GENOMIC SEQUENCE, 'pushing' nucleotides N5 and N6 to the next compressed byte designated 2A-BYTE III, and shifting all the following nucleotides by two positions.
  • Fig. 12C is a simplified illustration of an example of identifying a genomic sequence having one nucleotide deletion relative to a target genomic sequence.
  • Fig. 12C shows a genomic sequence designated SIMILAR TO TARGET GENOMIC SEQUENCE (1 DELETION), in which one nucleotide designated N5 of TARGET GENOMIC SEQUENCE has been deleted, shifting all nucleotides from N6 onwards one position to the left.
  • the missing N5 in byte ID-BYTE II is represented by a small blank broken line box between N4 and N6.
  • the two genomic sequences may therefore be deduced as being similar, differing by a one deletion mistake in the mismatched compressed byte ID-BYTE II.
  • FIGs. 10 and 12A, 12B and 12C demonstrate detection of up to 2 addition or deletion mistakes, the same concept may be utilized to detect a wider spectrum of mistakes, by generating more 'shifted sequences' accordingly, e.g. 'plus three' shifted sequence, etc.
  • Fig. 13 is a simplified block diagram of a computer database application constructed and operative in accordance with a preferred embodiment of the present invention. It is appreciated that the computer database application may be implemented in any appropriate programmed computer system, for example, an appropriate personal computer, or a personal computer server, and may use any appropriate database system, for example MICROSOFT SQL SERVER 12000®.
  • the embodiment of Fig. 13 comprises a mechanism for efficient pattern analysis of genomic sequence data.
  • the general idea of the present invention is to view the task of genomic pattern analysis in a manner similar to an attempt to understand a book in a totally foreign and unknown language, but where some clues do exist as for the meaning of a few specific words, or the general significance of several chapters.
  • the approach in such a case would be to break up the book into meaningful sections, such as chapters, and within each such chapter, to make a list of all the words appearing in that chapter. This then allows one to find correlations between words and other words, or between words and the chapters they are found in.
  • genomic sequence data may be approached in much a similar manner.
  • First the 'book' i.e. the DNA sequence, is divided into meaningful 'chapters' such as protein coding regions, and regions upstream and downstream of these regions.
  • meaningful 'chapters' such as protein coding regions, and regions upstream and downstream of these regions.
  • regions adjacent to protein coding regions are often involved in inhibiting or enhancing the production of these proteins.
  • Additional 'sub- chapters' may be created as well, such as regions within a protein coding region, which is known or suspected to have a specific function.
  • each of these 'chapters' i.e. genomic protein related regions
  • genomic sequence data we do not know what the 'words' are
  • the approach taken by the present invention is to parse each 'chapter' into 'words' of arbitrary length/s, such as all lengths between 10 and 30 characters. This approach generates a very long list of 'potential words', knowing that most of these are non-sense, and only a small fraction are genuine 'words'.
  • genomic data begins by obtaining genomic data to be analyzed, and other definitions and preferences required for the genomic data analysis.
  • Primary required data is raw genomic data 1100, including sequenced DNA data 1102 and protein location information 1104.
  • Protein location information 1104 comprises relative offset of the protein coding regions of proteins known to be encoded by the sequenced DNA 1 102, as well as the orientation of these protein coding areas where available. Protein location 1104 is typically part of basic genomic annotation data which is made available as part of the genomic sequencing effort.
  • genomic data 1106 may also contribute to the process of genomic pattern analysis, and may include various properties of known proteins, such as tissue-specific expression of proteins, the organism in which each protein is expressed (when analyzing genomes of multiple organisms), a protein specific biological function. This information may also include additional research-derived information about proteins encoded by the sequenced DNA, such as grouping specific proteins into groups of proteins which are of particular interest. Other genomic data 1 106 may also include information about specific sites or locations in a protein-coding region, such as various protein binding sites, or regions upstream of the coding area of a protein, which are of specific interest. The significance and use of such additional data is elaborated hereinbelow.
  • User defined criteria 1108 may be entered, defining various parameters by which the genomic sequence data analysis is performed, as explained hereinbelow.
  • the raw genomic data 1100, other genomic data 1106 and user defined criteria 1 108 are entered into a genomic pattern analysis engine 1110.
  • the genomic pattern analysis engine 1110 is a computer based program, preferably built around a database program, such as MICROSOFT SQL-SERVER 12000®, and comprising a genomic pre-processing unit 1112, a genomic pattern analysis database 1 1 14 and a genomic query processing unit 1116.
  • the genomic pre-processing unit 1112 is a computer program operative in conjunction with a database program, which receives the raw genomic data 1100 and other genomic data 1 106 entered to the genomic pattern analysis engine 1110, pre- processes it and stores the pre-processed genomic data to the genomic pattern analysis database. Operation of the genomic pre-processing unit 1112 is further described below with reference to Fig. 15.
  • the genomic pattern analysis database 1114 is a database storing the preprocessed genomic data.
  • the data structure of the genomic pattern analysis database 1 1 14 is designed so as to be conducive to pattern analysis of genomic sequence data, and to include the following major data elements:
  • Proteins 11 18 is a list of proteins known to be encoded by the sequenced DNA 1 102;
  • Protein related regions 1120 are regions in the sequenced DNA, which are related to each of the proteins 1118, such as protein coding regions, and regions upstream of protein coding regions;
  • Short genomic segments in regions 1122 are all short genomic segments of a length defined by the user defined criteria 1108, found in each of the protein related regions; and ,
  • SGSR-to-SGSR relationships 1124 document various relationships between two or more SGSRs, such as the distance between them.
  • genomic pattern analysis database is further described below with reference to Fig. 14.
  • the genomic pattern analysis database 1114 may be queried by the genomic query processing unit 1116, allowing a user 1126 to analyze the raw genomic data 1 100, by using the preprocessed data derived therefrom and stored in the genomic pattern analysis database 1114. Operation of the genomic query processing unit is further described below with reference to Fig. 16.
  • genomic pattern analysis engine a basic concept of the genomic pattern analysis engine is to perform as much pre-processing and storing of useful intermediate results as possible before the actual process of pattern analysis, so as to be able afterwards to produce very fast results in response to relatively complicated pattern analysis queries.
  • This approach allows performing complicated genomic' data analysis tasks, frequently carried out only by mainframe computers or super-computers, on relatively inexpensive, and easily scalable hardware, such as PC server computers. While this approach potentially requires very large databases, with up to billions of records in some cases, and may requires significant pre-processing time, it still offers a dramatically more cost- effective alternative than the traditional extremely expensive parallel processing alternatives.
  • Fig. 14 is a block diagram illustrating a preferred embodiment of the genomic pattern analysis 1114 of Fig. 13.
  • the genomic pattern analysis database preferably comprises of four major data elements: proteins 1118, protein related regions 1120, short genomic segments in region (SGSR) 1122 and SGSR-to-SGSR relationships 1124.
  • proteins 1118 proteins 1118
  • protein related regions 1120 protein related regions 1120
  • short genomic segments in region (SGSR) 1122 SGSR-to-SGSR relationships 1124.
  • each of these data elements is stored in a table in the database, related to the other tables, as described below.
  • Proteins 1118 is a plurality of proteins known to be encoded by the raw genomic data 1 100. For each of these proteins, multiple properties relevant to the genomic pattern analysis may be stored. For example, an organism 1200 it belongs to, a biological or other function 1202 it is known to perform, and its expression 1204 in a specific organ or tissue.
  • Each of the proteins 1 118 is related to one or more protein related regions 1 120. These include a protein coding region 1206, which must be obtained or calculated based on the protein location data 1104. In addition, regions adjacent to the protein may be calculated and stored: pre protein 1208 is a region upstream to the protein coding region 1206, and post protein 1210 is a region downstream to the protein coding region. The coding direction of the protein is required in order to calculate the protein-adjacent regions. Finally, other regions 1212 may also be defined, such as regions of special interest within a specific protein, e.g. regions within the protein coding region known or suspected, correlating to an amino-sequence which is responsible for specific biological activity in the final protein, to have a certain functional significance.
  • regions may be selected, which are not related to a specific protein, in order to analyze genomic patterns within them as well.
  • some of the definitions used to create the protein related regions 1 120 may be semi-arbitrary, and therefore may be defined by the user, as part of the user defined criteria 1108. For example, when analyzing the regions upstream of a protein coding region, a user defined criterion 1108 may define whether this region should extend all the way until the next protein upstream, or if it should be considered only a maximal fixed distance from the protein.
  • Each of the protein related regions 1120 is related to a plurality of short genomic segments in region 1 122, which are a key element in the present invention.
  • Short genomic segments in region 1 122 is a plurality of preferably all, or most, of the short genomic segments of a given length or range of lengths, as determined by the user defined criteria 1 108, which are found in each of the plurality of protein related regions 1 120.
  • the number of protein related regions 1120 created and the number of lengths of short genomic segments desired may be billions or tens of billions.
  • records may be preferably stored in partitioned tables, under a single view, using a database such as MICROSOFT SQL SERVER 12000®
  • Location 1214 is the location, or offset, of the SGSR 1122 relative to the protein related region 1120 in which it is found.
  • Uniqueness 1216 stores a link to a reference indicating the degree of uniqueness of this SGSR relative to the protein related region 1120 in which it is found, or relative to multiple protein related regions 1120.
  • one SGSR 1122 may be unique in the protein related region 1120 it appears in, i.e. it appears in that region only once.
  • Another SGSR 1 122 may appear in a specific protein coding region 1206, and may be unique relative to all protein coding regions 1206.
  • Yet another SGSR 1122 may appear only 3 times in the pre protein regions 1208 of all proteins 1118 which have a similar expression 1204, such as proteins expressing in nerve cells - this may still be considered significant by a user 1126 of the system, and may be queried.
  • commonality 1218 stores an indication, or a link to an indication, as to the commonality of a SGSR 1122 relative to two or more protein related regions 1 120. For example, it may be of interest to find out and mark all SGSRs 1 122 which appear in a pre protein region of proteins which have a similar function, as a starting point to attempting to assess which short genomic segments may be active as 'triggers' in controlling expression of these proteins.
  • Each SGSR may be associated with one or more, possibly many, criteria flags 1220, each of which stores an indication or a link to an indication of any compound condition which this SGSR meets.
  • a criteria flag 1220 may be used in order to 'mark' all SGSRs meeting a certain query, so that they can later be retrieved quickly and easily.
  • a criteria flag 1220 may be created to indicate any SGSR which appears in the post protein region 1210 of all proteins 1118 having a first function 1202, and does not appear in the post protein region 1210 of any proteins 1118 having a second function 1202. Since each short-genomic-segment record may be associated with multiple criteria flags, preferably each criterion is a record in a criteria table (not shown in Fig. 14), which is linked to multiple criteria-flag records 1220, each of which is linked to one short-genomic-segment record 1122.
  • the 'criteria flag' mechanism therefore allows extremely fast retrieval of all short-genomic-segments which comply with any combination of complex queries, which have been applied at the pre-processing phase.
  • each short genomic segment in region 1122 may be associated, with one or more SGSR-to-SGSR relationships 1124.
  • Each SGSR-to-SGSR relationship 1 124 stores information on the relation between two or more SGSRs.
  • SGSR-to-SGSR relationships 1124 may document the proximity 1222 of two SGSRs from each other, i.e. the difference between their respective location 1214 values; or their nucleotide sequence similarity 1224, or any other 1226 parameter by which they may be compared.
  • Fig. 15 is a flowchart illustrating operation of the genomic pre-processing unit 11 12 of Fig. 13 in accordance with a preferred embodiment of the present invention.
  • the genomic pre-processing unit 1112 of Fig. 13 processes the raw genomic data 1 100, in several steps described below, and populates the genomic pattern analysis database 1 1 14.
  • Preprocessing begins by acquiring the genomic data to be processed, including raw genomic data 1100, other genomic data 1106 and user defined criteria 1 108 of Fig. 13.
  • Proteins known to be encoded by the raw genomic data 1100 are stored in the genomic pattern analysis database 1110, and are classified according to various attributes which are deemed relevant to the genomic data analysis process, such as the organism 1200 they belong to, their biological or other function 1202, and their organs or tissue specific expression 1204 of Fig. 14.
  • the protein coding region 1206 of Fig. 14 is calculated based on the protein location data 1104 of Fig. 13.
  • the protein coding regions are also normalized, i.e. if the direction of the protein is known to be right to left, it is reversed, so as to be read from left to right, and inverted, replacing every A with a T and every C with a G.
  • some proteins are coded on the positive strand of the DNA, and are therefore 'read' from left to right from the sequenced DNA 1 102, whereas some are encoded on the negative strand, and therefore are 'read' from right to left, and appear 'inverted' in the sequenced DNA 1102, i.e. each 'A' should be replaced with a 'T', each 'C replaced with a 'G', and vice-versa.
  • regions upstream and downstream of each protein may also be calculated, normalized and stored, as may also other regions 1212 of interest.
  • Protein related regions 1120 are stored in the database, and are each linked to the protein related thereto.
  • Each of the protein related regions 1120 is then parsed in order to find preferably all short genomic segments of a given length located in that region.
  • the results are stored as short genomic segments in region 1122 of Fig. 14.
  • the length/s of the short genomic segments to be analyzed is determined as part of the user defined criteria.
  • various queries may be performed, in order to further determine properties of short genomic segments in region 1122, which are deemed material to the desired direction of genomic sequence data analysis.
  • SGSRs' answering a queried criteria may be 'flagged' for future use, using the uniqueness 1216, commonality 1218 or other criteria flags 1220 of Fig. 14.
  • SGSR-to-SGSR relationships 1124 between two or more SGSRs are determined, such as their proximity 1222, similarity 1224, or any other attribute 1226 by which they may be compared. ; .
  • Fig. 16 is a block diagram illustrating operation of the genomic query processing unit 1116 of Fig. 13, constructed and operative in accordance with a preferred embodiment of the present invention.
  • the genomic query processing unit 1116 allows a user 1126 of Fig. 13 to perform complex pattern analysis of genomic sequence data, based on the preprocessed data stored in the genomic pattern analysis database 1114.
  • qualifying properties to be used in analysis of genomic data are obtained from the user 1126, as indicated by reference numeral 1400. These may include short segment properties 1402, short segment in,region properties 1404, protein properties 1406 and other pre-defined criteria 1408.
  • genomic pattern analysis database 1114 is queried according to the qualifying properties obtained by the previous step, as indicated by reference numeral 1410.
  • this step of querying the database according to qualifying properties may also comprise a short genomic segment similarity comparison mechanism, as indicated by reference numeral 1412 marked by broken line.
  • a short genomic segment similarity comparison mechanism may enable querying the database for short genomic segments which are similar but not identical to the short segments specified by any of the qulifying properties indicated by reference numerals 1402-1408.
  • Such a mechanism may be advantegous since, as is well known in the art, it is frequently the case that a genomic motif may appear in slight variations while still maintaining its biological functionality. Accordingly, various algorithms are known in the art for identifying genomic string similarity, and may therefore be used here.
  • a 'translation-table' may be created, in which for each short genomic segment, all of the possible variants, e.g. of up to a small number of mistakes such as 2 mistakes, are listed.
  • Such a mechanism may be very efficient especially for short segments, such as 3-7 nucleotides long, since the number of possible permutations is not very large.
  • results which fit the qualifying properties 1402-1408 are retrieved from the database and are delivered, as indicated by reference numeral 1414. These include qualifying short genomic segments 1416, qualifying regions 1418 and qualifying proteins 1420.
  • FIG. 17 is a simplified illustration of an example of genomic pattern analysis performed by a preferred embodiment of the present invention.
  • Fig. 17 provides a genomic analysis example which may be conducive for a better understanding the usefulness and operation of the present invention.
  • protein A, protein B, protein C, protein D, protein E and protein F are six proteins known to be coded by the raw genomic data 1100 entered into genomic pattern analysis engine 1110 of Fig. 13.
  • proteins A and C are known to have a specific biological function, which the user 1126 considers desirable, and protein B is known not to have this specific desired biological function.
  • the biological function of proteins D, E and F is unknown.
  • the initial goal of the genomic pattern analysis is to find a genomic sequence pattern common to the coding regions of proteins A and C, and which is not found in the coding region of protein B. If such a genomic pattern is found, then the final goal is to find other proteins, the function of which is at present unknown, such as proteins D, E and F in the given example, which display a genomic pattern similar to that found in the initial step.
  • the rationale is that the genomic pattern common to proteins known to have a desired function, may serve as a predictor for finding other proteins, the function of which is at present unknown, and which might be expected to perhaps have the desired function.
  • the genomic preprocessing unit 1112 preprocesses the raw genomic data 1 100, and using the sequenced DNA 1102 and the protein location 1104 of these six proteins relative to the sequenced DNA 1102, and their coding direction (left-to-right or right-to-left), calculates and preferably normalizes the protein coding region 1206 of Fig. 14.
  • Protein A coding region 1500, protein B coding region 1502, protein C coding region 1504, protein D coding region 1506, protein E coding region 1508 and protein F coding region 1510 are illustrated in simplified form in Fig. 17. Coding regions of proteins D, E and F, designated by reference numerals 1506, 1508 and 1510, the biological function of which is unknown, are shown in broken line format. Preferably all protein related regions for preferably all other proteins known to be encoded by raw genomic data 1100 are processed in a similar manner. For clarity of explanation, the given example now focuses on these six proteins alone.
  • the genomic pre-processing unit further processes the protein coding regions of proteins A, B, C, D, E and F designated by reference numerals 1500-1510, in order to find and store preferably all short genomic segments, of a given length, e.g. 10 nucleotides long, in each of these protein coding regions.
  • the length of SGSR to be used is a matter of user preference, and is determined by user defined criteria 1 108 of Fig. 13.
  • Fig. 17 illustrates only six short genomic segments found in the protein coding regions of these proteins: SGSR-I, SGSR-II, SGSR-III, SGSR-IV, SGSR-V and SGSR- VI. It is appreciated that in reality there is a very large number, such as tens of thousands, of short genomic segments found in each protein related genomic region. This number depends upon the size of the region and the number of different SGSR lengths which the user 1126 decides to use.
  • SGSR-I, SGSR-II and SGSR-III are found in protein A coding region 1500; SGSR-IV, SGSR-III, SGSR-II and SGSR-V are found in protein B coding region 1502; SGSR- VI, SGSR-I, SGSR-II and SGSR-III are found in protein C coding region 1504; none of the six SGSR are found in protein D coding region 1506; SGSR-III, SGSR-II and SGSR-V are found in protein E coding region 1508; and SGSR-I, SGSR-II and SGSR-III are found in protein F coding region 1510.
  • the initial step of the genomic pattern analysis searches for a pattern common to protein A and C, and not to protein B.
  • SGSR-I, SGSR-II and SGSR-III form a pattern which appears in protein A coding region 1500 and protein C coding region 1504, but not in protein B coding region 1502
  • the commonality property of SGSR may be used to 'flag' all short genomic segments in regions 1 122, which are common to the protein coding regions 1206 of all proteins 1118, sharing the same desired function 1202 - protein A and B in the given example All SGSRs common to coding regions of all proteins which share the lack of that desired function, may be 'flagged accordingly as well, using a different commonality 1218 'flag' In the given example only protein B is shown as a representative of that group of proteins It is then easy to find all SGSRs which are common to A and C but not to B.
  • a more elaborate pattern may be sought, for example by analyzing the location of the SGSRs which seem to be potentially significant relative to each other, or to the region in which they are found.
  • SGSR-III and SGSR-II do appear as well in protein B coding region 1502, they do not appear in the same pattern, e.g. the location SGSR-III relative to SGSR-II is different in protein B coding region than it is in coding regions of proteins A and C This may easily be queried from the genomic pattern analysis, using the location property of SGSR, designated by reference numeral 1214 of Fig 14, which preferably stores the location, i e offset, of the SGSR relative to the region in which it is found
  • the SGSR-to-SGSR relationships 1 124 of Fig 14 may be very useful
  • multiple such relationship records may be formed, which document the proximity 1222 of Fig. 14 of each of the pairs of SGSRs which are suspected as being potentially significant: SGSR- I-to-SGSR-II, SGSR-II-to-SGSR-III and SGSR-I-to-SGSR-III. This provides an efficient means of finding all instances in which several SGSRs are not only found in proximity, but form a pattern relative to one another.
  • the criteria flags 1220 of Fig. 14 may be used to flag all SGSRs which comply with a very complex query.
  • a criteria flag may be formed to 'flag' all SGSRs which are: (a) common to a first group of proteins, (b) do not appear in a second group, (c) have a location property indicating they appear close together, and (d) appear as part of a certain SGSR-to-SGSR relationship with a certain proximity value.
  • Protein coding regions of proteins D, E and F are thus examined.
  • Protein F coding region 1504 shown marked in bold broken line, is the only one which displays a similar pattern of SGSRs to that observed in the coding regions of proteins A and C.
  • Protein D coding region shown none of the three SGSRs suspected as significant, and protein E coding region shows two of the significant three, but not in the same pattern.
  • genomic sequence analysis database 1114 may be beneficially utilized in performing complex genomic pattern analysis tasks, and of the usefulness of such analysis. It is further appreciated that genomic pattern analysis is often a highly complex task, often requiring a long, iterative, and somewhat creative process of trial-and-error.
  • FIG. 18 is a simplified block diagram illustrating a computer application constructed and operative in accordance with another preferred embodiment of the present invention.
  • Each of a plurality of strings 1800 is compressed by a compression mechanism 1802, collectively yielding a respective plurality of compressed strings 1804.
  • the compressed strings 1804 are stored in a plurality of records in a table in a database 1806, and a compressed strings index 1808 is constructed, which indexes the compressed strings 1804.
  • a target string 1810, one or more similarity criteria 1811, and a query condition 1812 relating to the target string 1810 are provided by a user of the database 1806, in order to find all strings 1800 in the database 1806 which comply with the provided query condition 1812, as it is applied to strings similar to the target string 1810, to a degree specified by the similarity criteria 1811.
  • the target string 1810 is compressed by a compression mechanism 1814, which may be similar to compression mechanism 1802, .into a compressed target string 1816.
  • the compressed target string 1816 and the similarity criteria 1811 are passed on as input to a compressed string similarity search module 1818.
  • the compressed string similarity search module in conjunction with the compressed string index 1808 is operative to query the database 1806, retrieving a plurality of compressed strings which comply with the query condition 1812, as applied to strings which are similar to the target string 1810, to a degree defined by similarity criteria 1811, in compressed form. These results are designated compressed similar to target query results 1820.
  • the compressed string similarity search module 1818 is further described below with reference to Figs. 27, 28, 29A, 29B and 29C-
  • Each of the compressed similar to target query results 1820 is decompressed by a decompression mechanism 1822, collectively yielding a respective plurality of strings similar to target 1824.
  • the actions of string similarity comparison and retrieval are preferably performed on compressed strings: comparing compressed target string 1816 with compressed strings 1804, using the compressed string index 1808.
  • An important aspect of the present invention is that it allows determining the level of similarity between strings, by comparing compressed strings to which these strings correspond.
  • Fig. 19 is a simplified block diagram illustrating a mechanism for compression and decompression of data. This mechanism is a preferred implementation of the compression mechanisms 1802 and 1814 and of the decompression mechanism 1822 described hereinabove with reference to Fig. 18.
  • An uncompressed string 1900 is given as an example of one of the strings 1800 or of the target string 1810 described hereinabove with reference to Fig. 18.
  • Genomic sequence data is typically represented as alphanumeric strings, each character representing one of four nucleotides, A, T, C, and G, and unknown nucleotides represented by N.
  • the letter N typically appears in genomic sequence data much less frequently than do the other four letters. Therefore, in the context of this example, 'N' is an example of a 'rare character'.
  • the uncompressed string 1900 comprises a plurality of bytes, each storing a character: A, T, C, or G.
  • the uncompressed string 1900 shown in Fig. 19 comprises only three bytes: BYTE I, BYTE II and BYTE III, each storing a character: CHR 1, CHR 2 and CHR 3, respectively.
  • the uncompressed string 1900 is compressed by a compression engine 1902 into a compressed string 1904. Operation of a preferred embodiment of the compression engine 1902 is further described below with reference to Fig. 22.
  • the compression engine 1902 employs a compression table 1906 in compression of uncompressed string 1900 into compressed string 1904.
  • the compression table 1906 is preferably a translation table and holds a list of all possible or reasonable 3 -alphanumeric-character combinations and, for each such combination, the byte or bytes into which it may be compressed.
  • a preferred embodiment of the compression table 1906 is further described below with reference to Fig. 24A.
  • the compressed string 1904 preferably comprises one or more compressed bytes 1908.
  • the compressed string 1904 shown in Fig. 19 comprises only one compressed byte 1908, which represents in compressed form all three characters which are represented in the uncompressed string 1900, CHR 1, CHR 2 and CHR 3, which characters require three bytes of storage, BYTE I, BYTE II and BYTE III, in their original uncompressed form.
  • compressed string 1904 which comprises a plurality of compressed bytes 1908, may alternatively be stored as one or more integers.
  • an Integer is typically defined as a data-type comprising 4 bytes.
  • an uncompressed string 1900 comprising 12 characters into 4 compressed bytes 1908, and then to store all 4 resulting compressed bytes 1908 as one Integer.
  • Longer strings may be compressed into longer Integers, such as Biglnt data type in MS SQL-2000® which comprised of 8 bytes, or as a plurality of Integers or Biglnts.
  • a string comprising 48 characters may be compressed into 2 Biglnts.
  • Tinylnt data type which comprises one byte of memory, may be used to store compressed byte 1908.
  • strings which are longer than 3 or 4 characters, i.e. are compressed into more than one Tinylnt
  • Tinylnt type fields each representing in compressed form 3 or 4 uncompressed characters. It is then possible to create an indexed View which indexes all these fields together, as one.
  • the compressed byte 1908 is further described below with reference to Figs. 20, 21A, 21B, and 21 C.
  • the compressed string 1904 may be decompressed by a decompression engine 1910 back to uncompressed string 1900, preferably by reversal of the methodology of the compression engine 1902. Operation of the decompression engine 1910 is further described below with reference to Fig. 25.
  • Decompression engine 1910 employs a decompression table 1912 in decompressing compressed string 1904 into uncompressed string 1900.
  • the decompression table 1912 is a translation table and holds a list of the bit-values for each possible or reasonable compressed byte 1908 and for each such compressed byte, the alphanumeric string, containing up to three characters, into which it may be decompressed.
  • the decompression table 1912 is further described below with reference to Fig. 24B.
  • Fig. 20A is a simplified illustration of a preferred implementation of a byte bitmap preferably employed in generating the compressed byte 1908 of Fig. 19.
  • the compressed byte 1908 of Fig. 19 preferably comprises 8 bits: BIT I, BIT II, BIT III, BIT IV, BIT V, BIT VI, BIT VII & BIT VIII.
  • these bits are divided into four groups, each containing 2 bits.
  • a HEADER typically contains BIT I and BIT II
  • a BIT-PAIR I typically contains BIT III and BIT IV
  • a BIT-PAIR II typically contains BIT V and BIT VI
  • a BIT-PAIR III typically contains BIT VII and BIT VIII.
  • the HEADER preferably stores information about what the other bits in the compressed byte, BIT III - BIT VIII, represent.
  • the compression of data is such that the compressed byte may either store up to three characters, or may store up to three rare characters, e.g. 'N's in the genomic example, but preferably not a combination of characters and N's.
  • the HEADER stores information which indicates how many characters the compressed byte 1908 represents, one, two or three, or alternatively if the entire compressed-byte represents one or more 'N's.
  • the values assignable to bits of the HEADER are further described below with reference to Fig. 21A.
  • BIT-PAIR I, BIT-PAIR II, and BIT-PAIR in each contain 2 bits which are capable, when taken together, of representing one of four possible characters, A, T, C, and G.
  • the values assignable to the bits of each of the characters - BIT-PAIR I, BIT- PAIR II, and BIT-PAIR III - are further described below with reference to Fig. 21B.
  • values assigned to the two bits of BIT-PAIR I determine whether the compressed byte 1908 represents one, two or three N's. Values assignable to bits of BIT-PAIR I, determining the number of N's that the compressed byte 1908 represents, are further described below with reference to Fig. 21C.
  • Fig. 20B is a simplified illustration of an alternative preferred implementation of a byte bitmap preferably employed in generating the compressed byte 1908 of Fig. 19.
  • Alternative compressed byte 2000 is an alternative byte bitmap which may be used for compression of data, instead of the byte bitmap of compressed byte 1908 depicted in Fig. 20 A.
  • Alternative compressed byte 2000 comprises of four bit-pairs, BIT-PAIR I, BIT-PAIR II, BIT-PAIR III and BIT-PAIR IV, rather than only three bit-pairs in compressed-byte 1908 of Fig. 20 A.
  • alternative compressed byte 2000 does not comprise of any such header. All 8 bits of alternative compressed byte 2000 function as one of four bit-pairs, each of said bit-pairs representing a character.
  • Alternative compressed byte 2000 is therefore capable of representing 4 characters in compressed form, as opposed to compressed byte 1908 of Fig. 20A, which is capable of representing 3 characters.
  • alternative compressed byte 2000 may be useful when compressing strings which do not include rare characters, and are of a fixed length. If the length of the uncompressed string 1900 is known, then it is possible to ignore the possible tailing zeros at the right end of the alternative compressed byte 2000, which do not represent a character, but rather represent a blank.
  • an uncompressed string 1900 which is known to be 7 characters long, may be compressed into 2 alternative compressed byte 2000: the first containing in compressed form 4 characters, and the second containing 3 characters.
  • this second alternative compressed byte 2000 BIT VII and BITVIII of BIT-PAIR IV contain zeros which are ignored because the uncompressed string is known to be 7 characters long, despite the absence of a 'header' which would explicitly instruct to ignore these bits.
  • Fig. 21 A is a table illustrating preferred values assignable to BIT I and BIT II, both belonging to the HEADER of compressed byte 1908 shown in Fig. 20 A.
  • Assigning the value '00' to the bits of the HEADER i.e. assigning '0' to BIT I and assigning '0' to BIT II of compressed byte 1908 shown in Fig. 20A, signifies that the entire compressed byte 1908 represents only one or more rare characters, i.e. 'N ⁇ and does not represent any known characters. In the non genomic embodiment of the present invention, it is possible to specify a plurality of different rare characters, such as up to 64, which would be represented by an entire byte, when the value of the header is '00'. Assigning the value '01 ' to the bits of the HEADER, i.e.
  • Assigning the value ' 10' to the bits of the HEADER i.e. assigning '1 ' to BIT I and '0' to BIT II of compressed byte 1908 shown in Fig. 20A, signifies that the compressed byte 1908 represents two characters; the first character being represented by the values in BIT III & BIT IV, both belonging to BIT-PAIR I, and the second character being represented by values in BIT V & BIT VI, both belonging to BIT-PAIR II of compressed byte 1908.
  • the remaining two bits of the compressed byte 1908, BIT VII & BIT VIII are to be ignored and do not represent any additional character.
  • Fig. 21B is a table illustrating the preferred values assignable to the character-representing bits: BIT III, BIT IN BIT V, BIT VI, BIT VII, & BIT VIII of compressed byte 1908 shown in Fig. 20A.
  • each of BIT-PAIR I, BIT-PAIR II and BIT-PAIR III in compressed byte 1908 comprises a pair of bits: BIT III & BIT IV, BIT V & BIT VI, and BIT VII & BIT VIII respectively.
  • the values presented in Fig. 21B are values which may be assigned to each of the above mentioned pairs of bits so as to allow each of these pairs of bits to represent one of the four possible genomic characters: A, T, C, or G.
  • Assigning the value '00' to any of the three bit-pairs representing one of the three characters i.e. assigning '0' to BIT III and '0' to BIT IV, or assigning '0' to BIT V and '0' to BIT VI, or assigning '0' to BIT VII and '0' to BIT VIH, signifies that that bit-pair, i.e. BIT-PAIR I, BIT-PAIR II or BIT-PAIR III, of the compressed byte 1908 of Fig. 20A, represents the character 'A'.
  • Assigning the value of ' 10' to any of the three bit-pairs representing one of the three characters i.e. assigning ' 1 ' to BIT IH & '0' to BIT IV, or assigning ' 1' to BIT V & '0' to BIT VI, or assigning ' 1 ' to BIT VII & '0' to BIT VIII, signifies that that bit-pair, i.e. BIT-PAIR I, BIT-PAIR II or BIT-PAIR III, represents the character 'G'.
  • Fig. 21C is a table illustrating preferred values assignable to BIT III and BIT IV of Fig. 20A respectively, when encoding one or more rare characters.
  • a value '00' assigned to the bits of the HEADER of Fig. 20 A signifies that the compressed byte 1908 represents one or more rare characters and does not represent any known characters.
  • BIT III and BIT IV may be used to signify if the entire byte represents one, two of three 'N's. Assigning the value '01 ' to BIT III & BIT IN i.e. assigning '0' to BIT III and ' 1 ' to BIT IV, signifies that the entire compressed byte 1908 represents one rare character.
  • each compressed byte 1908 to encode more or less than three ' ⁇ 's
  • BIT III through BIT VIII of Fig. 20A to signify up to 64 ⁇ 's represented by a compressed byte 1908.
  • Fig. 22 is a simplified flowchart illustrating operation of the compression engine 1902 of Fig. 19, constructed and operative in accordance with a preferred embodiment of the present invention.
  • compression table 1906 and decompression table 1912 are initially generated.
  • the compression table 1906 stores for each possible combination of up to three characters, e.g. 'ATG', 'GGC, 'AT', bit-values which represents this combination in compressed form in one compressed byte 1908, preferably according to the values suggested in Figs 21 A, 21B, and 21C.
  • Figs. 24A and 24B present examples of preferred compression and decompression tables 1906 and 1912 respectively.
  • an iterative process of compression of multiple strings takes place.
  • An uncompressed string such as uncompressed string 1900 of Fig. 19, is received.
  • the uncompressed string 1900 is a string l epi esented by a string 'ATGAT'
  • This example is followed through the following steps of Fig 22
  • the uncompressed string 1900 preferably is parsed into sub-strings, each having up to three characters (3 -character-substrings) by parsing the uncompressed string 1900 from left to right It should be noted that one or more of the characters in a 3-character-subst ⁇ ng may in fact be unknown, i e an 'N' In the given example the stung 'ATGAT' is parsed from left to right, into 3 -character-substrings, yielding 'ATG' and 'AT'
  • Fig 23 is a simplified flowchart illustrating a pi eferred functionality for generating translation tables, including compression table 1906 and decompression table 1912 of Fig 19
  • N's are encoded in compressed bytes 1908 which do not also represent known characters Values '00' are assigned to the headers of such compressed bytes and all known characters in the 3 -character-substring are encoded in other bytes
  • the HEADER is assigned the value ' 10' and the two characters are represented by BITS III - VI of the compressed byte 1908, with the first character represented by BIT III & BIT IV and the second character represented by BIT V & BIT VI.
  • the HEADER is assigned the value ' 1 1 ' and the three characters are represented by BITS III - VIII of the compressed byte 1908, with the first character represented by BIT III & BIT IV, the second character represented by BIT V & BIT VI and the third character represented by BIT VII & BIT VIII.
  • Each 3 -character-substring and its corresponding one or more compressed bytes 1908 are stored in translation tables including compression table 1906 and decompression table 1912.
  • rare characters are typically very rare in typical strings, and that furthermore when 'N's appear in a string they tend to appear contiguously, signifying a 'gap' in the sequenced genome. Instances of isolated single or double N's are typically less frequent than instances of contiguous 'N's.
  • the present invention utilizes this fact to achieve an optimized compression suited specifically for genomic sequence data: three-character combinations which contain only characters and no N's, as well as those containing only N's and no characters, are both compressed into a single byte. Most of the rare cases of 3-character mixtures of characters and N's are compressed into two bytes. Only a minority of extremely rare combinations of characters and N's require three bytes and therefore are in fact not compressed.
  • FIG. 24A is a simplified illustration of a preferred implementation of compression table 1906 of Fig. 19, employed in accordance with a preferred embodiment of the present invention.
  • compression table 1906 is to provide a translation-table, also referred to as a 'lookup table', which provides the bit- values of the one or more compressed bytes 1908 required to represent in compressed form every possible 1- character 2-character and 3-character sub-string of uncompressed string 1900.
  • the compression table 1906 is described here logically, as a database table comprising fields into each of which multiple values are stored in respective multiple records. It is appreciated by those skilled in the art that the description of the compression table 1906 in terms of a table comprising fields is meant for clarity and not meant to be limiting and that the compression table 1906 may equally be implemented as a 'CASE', or 'IF-THEN' programming code in a any suitable computer language, as is well known in the art. For example, computer code can be written, which comprises a plurality of 'IF-THEN' or 'CASE' arguments, each one of the arguments providing bit-values of the one or more compressed bytes 1908 representing in compressed form one 3 -character-substring of uncompressed string 1900.
  • Compression table 1906 preferably comprises multiple records each containing 4 fields: uncompressed character-combination 2400, compressed byte I 2402, compressed byte II 2404 and compressed byte III 2406. For clarity, an example is given or the content which may be stored in each of these fields.
  • the uncompressed character combination 2400 is a field which stores all possible 3-character substrings, i.e. 1-character, 2-character and 3-character combinations, including combinations of characters only and combinations which include N's.
  • uncompressed character combination 2400 stores a 3-character combination 'ATN'.
  • Compressed byte I 2402, compressed byte II 2404 and compressed byte III 2406, respectively,- are fields which store for each uncompressed character- combination 2400 the bit-values for each of the one or more compressed bytes 1908 required for encoding it.
  • compressed byte I 2402 stores ' 10001 100', which represents the character-combination 'AT'
  • compressed byte II 2404 stores '00010000', which represents 'N ⁇ Compressed byte III 2406, in the given example, stores null, since only two compressed bytes are required to represent the character combination 'ATN'.
  • most 3-character substrings may be compressed into one compressed byte 1908, some rare combinations may be compressed into two compressed bytes, and some 20 very rare combinations may require 3 compressed bytes, and therefore may not be compressed. Therefore, notwithstanding that compression table 1906 comprises three compressed bytes fields 2402-2406, one compressed byte field, such as compressed byte-1 2402, is sufficient to translate a vast majority of 3-character combinations to be typically found in a string.
  • Fig. 24B is a simplified illustration of a preferred implementation of decompression table 1912 of Fig. 19, employed in accordance with a preferred embodiment of the present invention.
  • decompression table 1912 is to provide a translation-table, also referred to as a 'lookup table', which provides the 1-character, 2-character or 3- character uncompressed string 1900 preferably corresponding to every possible compressed byte 1908.
  • decompression table 1912 in terms of table comprising fields is meant for clarity and not meant to be limiting, and that the decompression table 1912 may equally be implemented as a 'CASE' code in any computer language, as is well known in the art.
  • the decompression table 1912 preferably comprises multiple records each containing 2 fields: compressed byte 2408 and decompressed character- combination 2410.
  • Compressed byte 2408 is a field which preferably stores bit-values of every possible compressed byte 1908.
  • Decompressed character combination 2410 is a field which stores for each compressed byte 2408 the 1-character, 2-character or 3-character uncompressed string 1900 which it encodes.
  • the field compressed byte 2408 may contain the compressed byte 1908 bit-value ' 10001100' and the respective field decompressed character combination 2410 may contain the 2-character combination 'AT' which this bit value represents in compressed form.
  • Fig. 25 is a simplified flowchart illustrating operation of decompression engine 1910 of Fig. 19 constructed and operative in accordance with a preferred embodiment of the present invention.
  • decompression engine 1910 of Fig. 19 performs a reverse action of that of compression engine 1902 of Fig. 19, which was further described hereinabove with reference to Fig. 22.
  • a compressed string 1904 is received in order to be decompressed.
  • Decompression engine 1910 of Fig. 19 gets the compressed string 1904 of Fig. 19 to be decompressed.
  • the process shown in Fig. 25 is explained with reference to an example wherein the compressed string 1904 comprises two compressed bytes 1908 the bit-value of which are: ' 11001 1 10' & ' 10001100' respectively. This example is followed through the following steps of Fig. 25.
  • a recursive operation is initiated, which parses the received compressed string 1904 into the compressed byte 1908 included in this compressed string.
  • Each compressed byte 1908 is looked up in the decompression table 1912, and based on the contents of the compression table, finds out the 3-character substring which this compressed byte represents.
  • the 3-character substrings are combined to yield an uncompressed string 1900.
  • the second compressed byte in the compressed string has bit-values of ' 10001100', which when looked-up in the decompression table is found out to represent the character combination 'AT'.
  • 'ATG' represented by compressed byte ' 11001110'
  • 'AT' represented by compressed byte ' 10001 100'
  • FIG. 26A is a simplified illustration of an example of compressing an uncompressed string 1900 into a compressed string 1904, both shown in Fig. 19.
  • Uncompressed string 'ATGAT' 2600 is an uncompressed string 1900 comprising the characters: 'ATGAT'.
  • the results is two 'character-triplets': character-triplet- 1 'ATG' 2602 and character-triplet-2 'AT_' 2604. It is appreciated that the character- triplet-2 'AT_' 2604 actually contains only two characters: A and T.
  • Compressed byte-1 2606 encodes three characters: 'ATG'. Therefore a value of ' 1 r is assigned to the two bits of the HEADER of compressed byte-1 2606, as indicated by reference numeral 2608, signifying that this compressed byte 1908 encodes 3 characters.
  • Value '00' is set to the two bits of BIT-PATR I, of compressed byte-1 2606, as indicated by reference numeral 2610, signifying that the first character represented by this byte is 'A'.
  • Value ' 1 1 ' is assigned to the two bits of BIT-PAIR II of compressed byte- 1 2606, as indicated by reference numeral 2612, signifying that the second character represented by this byte is 'T'.
  • Value ' 10' is assigned to the two bits of BIT-PAIR III of compressed byte- 1 2606, as indicated by reference numeral 2614, signifying that the third character represented by this byte is 'G'.
  • Compressed byte-2 2607 encodes two characters: 'AT'. Therefore a value of ' 10' is assigned to the two bits of the HEADER of compressed byte-2 2607, as indicated by reference numeral 2616, signifying that this compressed byte 1908 encodes 2 characters, and that therefore the two bits of BIT-PATR III are to be ignored.
  • Value '00' is assigned to the two bits of BIT-PAIR I of compressed byte- 2 2607, as indicated by reference numeral 2618, signifying that the first character represented by this byte is 'A'.
  • Value ' 1 1 ' is assigned to the two bits of BIT-PAIR II of compressed byte-2 2607, as indicated by reference numeral 2620, signifying that the second character represented by this byte is 'T'.
  • Value '00' stored in the bits of BIT-PAIR III of compressed byte-2 2607, as indicated by reference numeral 2622 are ignored and do not represent an 'A', since the HEADER specified that this byte encodes only 2 characters.
  • FIG. 26B is a simplified illustration of another example of compression of an uncompressed string 1900 of Fig. 19, containing a rare character, into a compressed string 1904 of Fig. 19.
  • Uncompressed string 'ATNCG' 2650 is an uncompressed string 1900 comprising the characters: 'ATNCG'.
  • character-triplet- 1 'ATN' 2652 contains an 'N' it is preferably represented by two compressed bytes 1908 rather than one: the first, compressed byte-1 2656, encodes 'AT', and the second, compressed byte-2 2658, encodes 'N'.
  • Compressed byte-1 2656 encodes two characters, 'AT', therefore ' 10' is assigned to the two bits of the HEADER of compressed byte-1 2656, as indicated by reference numeral 2660, signifying that this compressed byte 1908 encodes 2 characters.
  • Value '00' is assigned to the two bits of BIT -PAIR I of compressed byte- 1 2656, as indicated by reference numeral 2662, signifying that the first character represented by this byte is 'A'.
  • Value ' 11 ' is assigned to the two bits of BIT-PAIR II of compressed byte- 1 2656, as indicated by reference numeral 2664, signifying that the second character represented by this byte is 'T'.
  • Compressed byte-2 2658 is dedicated to encoding 'N's, in this case only one 'N ⁇ which is derived from the character-triplet- 1 'ATN' 2652. Therefore '00' is assigned to the two bits of the HEADER of compressed byte-2 2658, as indicated by reference numeral 2668, signifying that this compressed byte 1908 is dedicated to encoding one or more N's.
  • the value '01 ' is assigned to the two bits of BIT-PAIR I of compressed byte-2 2658, as indicated by reference numeral 2670, signifying that this byte, which is dedicated to encoding N's, encodes only one N. Accordingly, the zeros in the two bits of BIT-PAIR II and the two bits of BIT-PAIR III, indicated by reference numerals 2672 & 2674 are ignored.
  • Compressed byte-3 2690 encodes two characters: 'CG'. Therefore a value of ' 10' is assigned to the two bits of the HEADER of compressed byte-3 2690, as indicated by reference numeral 2676, signifying that this compressed byte 1908 encodes only 2 characters.
  • Value '01 ' is assigned to the two bits of BIT-PAIR I of compressed byte- 3 2690, as indicated by reference numeral 2678, signifying that the first character represented by this byte is 'C.
  • Value ' 10' is assigned to the two bits of BIT-PAIR II of compressed byte-3 2690, as indicated by reference numeral 2680, signifying that the second character represented by this byte is 'G'.
  • Fig. 27 is a simplified block diagram illustrating shifted compressed strings utilized by the compressed string similarity search module 1818 of Fig. 18 constructed and operative in accordance with a preferred embodiment of the present invention.
  • each compressed byte represents in compressed form 3 or 4 characters. It is therefore very easy to compare entire 'triplets' of characters, but if an addition or deletion of a single character occurs, then all triplets 'downstream' to the one modified will have seemed to have changed completely, whereas in fact, they have only been 'shifted' to the right or to the left by one location.
  • the basic concept of the compressed character string similarity search module 1818 of Fig. 18 is therefore to calculate all possible 'shifted' variations of the compressed character string 1810 of Fig. 18, and to use them to search for compressed character strings similar to target 1820 of Fig. 18.
  • target string 2700 which is a compressed character string comprising 12 characters, NI through N12, which are represented in compressed form by 4 compressed bytes 1908, BYTE 1, BYTE 2, BYTE 3 and BYTE 4.
  • shifted strings 2702 are generated: 'minus one' shifted string 2704, 'minus two' shifted string 2706, 'plus one' shifted string 2708 and 'plus two' shifted string 2710.
  • the first character in target string 2700, NI has been removed in the 'minus one' shifted string 2704, therefore 'minus one' shifted string 2704 begins with N2.
  • the characters compressed into each of the four compressed bytes 1908 of 'minus one' shifted string 2704 are therefore 'shifted to the left' by one location.
  • sequence of characters compressed into each of the four compressed bytes 1908 of 'minus two' shifted string 2706 is shifted to the left by two locations; that of 'plus one' shifted string 2708 is shifted to the right by one location; and that of 'plus two' shifted string 2710 is shifted to the right by two locations.
  • Fig. 28 is a simplified flowchart illustrating operation of the compressed string similarity search module 1818 of Fig. 18 constaicted and operative in accordance with a preferred embodiment of the present invention.
  • Operation of the compressed string similarity search module 1818 begins by getting a target compressed string 1810 of Fig. 18.
  • all compressed strings 1804 having at least one compressed byte which matches that of the compressed target string 2700 or of one of the four shifted strings 2704-2710, are retrieved. It is important to note that a match is looked for only between bytes occupying the same location in the compressed genomic string: the first compressed byte in a compressed string 1804 is compared to the first compressed byte of the compressed target string 1816 and to the first compressed byte of each of the four compressed shifted strings. It is not compared to any other, e.g. second, third or fourth, compressed bytes within these strings. All compressed strings having at least one match, are considered potentially similar, and are passed on the next step.
  • Compressed strings having less mismatching compressed bytes with the target or one of the shifted strings than a certain user defined 'threshold', are considered potentially very similar, and are passed on to the next step.
  • mismatching compressed byte/s are further analyzed to determine the exact nature of the mistake, in order to further fine-tune the similarity comparison
  • the resulting compressed strings similar to target 1816 of Fig. 18 are considered to represent in compressed form genomic sequences which are similar to the target genomic sequence represented in compressed form by the compressed target string 1816, and are delivered.
  • Fig. 29A is a simplified illustration of an example of identifying a string having one character replacement relative to a target string.
  • Fig. 29A shows a string designated SIMILAR TO TARGET STRING (1 REPLACEMENT), in which a character, designated N13 shown in broken line format, in the compressed byte designated 1R-BYTEII, has replaced character N5 in that same spot in the original TARGET STRING.
  • FIG. 29A 3 of the 4 compressed bytes of SIMILAR TO TARGET GENOMIC SEQUENCE (1 REPLACEMENT), shown in bold line format, still match those in TARGET GENOMIC SEQUENCE, and so the two genomic sequences can be deduced as being similar, by comparison of their compressed format, without decompressing them.
  • Fig. 29B is a simplified illustration of an example of identifying a string having two character additions relative to a target string.
  • Fig. 29B shows a string designated SIMILAR TO TARGET STRING (2 ADDITIONS), in which two characters, designated N13 and N14, shown in broken line format, in compressed byte designated 2A-BYTEII, have been added to the string relative to the original TARGET STRING, 'pushing' characters N5 and N6 to the next compressed byte designated 2A-BYTE III, and shifting all the following characters by two positions.
  • the two strings may therefore be deduced as being similar, differing by a two addition mistake in the mismatched compressed byte 2A-BYTE II.
  • Fig. 29C is a simplified illustration of an example of identifying a character string having one character deletion relative to a target string.
  • Fig. 29C shows a string designated SIMILAR TO TARGET STRING (1 DELETION), in which one character, designated N5, of TARGET STRING has been deleted, shifting all characters from N6 onwards one position to the left.
  • N5 in byte ID-BYTE II is represented by a small blank broken line box between N4 and N6.
  • the two strings may therefore be deduced as being similar, differing by a one deletion mistake in the mismatched compressed byte ID-BYTE II.
  • Figs. 27 and 29A, 29B and 29C demonstrate detection of up to 2 addition or deletion mistakes, the same concept may be utilized to detect a wider spectrum of mistakes, by generating more 'shifted sequences' accordingly, e.g. 'plus three' shifted sequence etc.
  • Fig. 30 is a simplified illustration of a triphase-bit compressed character used in accordance with a preferred embodiment of the present invention.
  • Fig. 30 is a preferred implementation of the present invention in a tri-phase bit environment.
  • a compressed triphase-bit compressed byte 3000 is a preferred embodiment of a compressed string 1804 of Fig. 18, when implemented in a triphase-bit computer environment.
  • Triphase-bit compressed byte 3000 preferably comprises the following three elements: a HEADER comprising two bits: BIT I and BIT ⁇ , CHAR I comprising three bits BIT III, BIT IV & BIT V, and CHAR II comprising three bits BIT VI, BIT VII & BIT VIII.
  • each of the two alphanumeric characters which may be represented by the triphase compressed character 3000 may be one of the following three common-character-sets: uppercase English letters, lowercase English letters, or numerals and seventeen other common symbols.
  • the HEADER may indicate that the entire byte represents one rare character, one of seven hundred twenty nine options, as is explained below.
  • the following values may be utilized by the two bits of the HEADER:
  • the values '01 ' in the two bits of the HEADER signify that the entire byte represents only one character, which is a lowercase letter.
  • the values '02' in the two bits of the HEADER signify that the entire byte represents only one character, which is an uppercase letter.
  • the values ' 10' in the two bits of the HEADER - signify that the entire byte represents only one character, which is a numeral or other common symbols. Up to 17 common symbols may be represented in this way, in addition to the ten numerals, since CHAR I and CHAR II each contain 3 bits and may therefore represent 27 options.
  • the values ' 11 ' in the two bits of the HEADER signify that the byte represents two characters, which are both lowercase letters.
  • the values '12' in the two bits ; of the HEADER signify that the byte represents two characters, which are both uppercase letters.
  • the values '20' in the two bits of the HEADER signify that the byte represents two characters, which are both numerals or common symbols.
  • the values '21' in the two bits of the HEADER signify that the byte represents two characters, the first of which is a lowercase letter and the second of which is an uppercase letter
  • the values '22' in the two bits of the HEADER signify that the byte represents two characters, the first of which is an uppercase letter and the second of which is a lowercase letter
  • Fig. 30 depicts a configuration where 2 bits are utilized as a HEADER of the triphase-bit compressed character 3000, this is not necessary It is possible, for example, to encode two characters in the triphase-bit compressed character 3000, each comprising four, rather than three, bits, and therefore supporting 81 options, one of these options being a null.
  • Fig 31 is a simplified block diagram illustrating a computer application constructed and operative in accordance with another preferred embodiment of the present invention. It is appreciated that the computer application may be implemented in any appropriately programmed computer system such as, for example, a suitable personal computer including an operating system having a suitable graphical user interface.
  • Fig. 31 comprises a mechanism for conversion of a standard alphanumeric representation of genomic sequence information into a more informative and intuitive graphical representation, conducive to visual pattern analysis of genomic sequence information
  • Sequenced DNA 3100 is biological information relating to a sequence of nucleotides - adenine, thymine, guanine, and cytosine - of a given DNA molecule, or genome. Determining the sequence of nucleotides of a genome is achieved by various 'wet-lab' sequencing methodologies and techniques, as is well known in the art.
  • genomic alphanumeric representation 3110 is an alphanumeric string used both for computer storage of sequenced DNA data and for its presentation.
  • the genomic alphanumeric representation 3110 comprises primarily four letters 'A' representing adenine, 'T' representing thymine, 'G' representing guanine and 'C representing cytosine.
  • nucleotide adenine represented by 'A'
  • nucleotide thymine represented by T: wherever the DNA sequence on one DNA strand contains adenine, the opposite strand at that exact location contains thymine, and vice-versa.
  • nucleotide guanine represented by 'G' is a 'counterpart' of nucleotide cytosine represented by C. Wherever the DNA sequence on one DNA strand contains guanine, the opposite strand at that exact location contains cytosine, and vice-versa.
  • adenine represented by 'A' and guanine represented by 'G 1 are purine nucleotides, whereas thymine represented by 'T' and cytosine represented by 'C are pyrimidine nucleotides.
  • a genomic graphic representation engine 3120 receives the genomic alphanumeric representation 31 10 and converts it into a genomic graphical representation 3130.
  • a preferred embodiment of a genomic graphic representation engine 3120 is further described below with reference to Fig. 32.
  • the genomic graphic representation 3130 produced by the genomic graphic representation engine 3120, preferably represents each of the four letters - 'A', 'T ⁇ 'G' and 'C - in the original genomic alphanumeric representation 3110, using a plurality of graphic parameters.
  • the alphanumeric representation 3110 is represented by a combination of a graphic shape, a vertical orientation, and a color specific to that letter, as is further described below.
  • the genomic sequence data may also include unknown nucleotides, i.e. nucleotides in the genomic sequence which the sequencing process was unable to identify. Unknown nucleotides are typically represented by 'N' or '-'. These may be also be represented by the graphic representation 3130 by a designated shape, color, and letter, as per user preference.
  • a preferred embodiment of the genomic graphic representation 3130 may include a genomic font with embedded letters 3140, in which a letter representing each nucleotide is embedded on a shape that represents it graphically.
  • the letter 'A' is represented by an upward oriented half-oval with an embedded letter 'A', as illustrated by reference numeral 3141.
  • the letter 'T' is represented by a downward oriented half- oval with an embedded letter 'T', as illustrated by reference numeral 3142.
  • the letter 'G' is represented by an upward oriented half-square with an embedded letter 'G', as illustrated by reference numeral 3143.
  • the letter 'C is represented by a downward , oriented half-square with an embedded letter 'C, as illustrated by reference numeral 3 144.
  • the genomic graphic representation 3130 may also include a genomic font without letters 3150, in which only a shape is used to graphically represent each letter, without any letter embedded on the shape.
  • a genomic font without letters 3150 in which only a shape is used to graphically represent each letter, without any letter embedded on the shape.
  • an upward oriented half-oval 3151 represents 'A'
  • a downward oriented half-oval 3152 represents 'T'
  • an upward oriented half-square 3 153 represents 'G'
  • a downward oriented half-square 3154 represents 'C.
  • each of the four shapes without letters 3151, 3152, 3153 & 3 154, or alternatively each of the four shapes with embedded letters 3141, 3142, 3143 & 3144, representing the four letters 'A', 'T', 'G' and 'C, respectively, may be displayed in a different color, according to the user's preference.
  • a preferred embodiment of the current invention displays the above mentioned shapes in red, blue, brown, and green, respectively.
  • genomic graphic representation 3130 provides enhanced visual discrimination between adenine-thymine counterparts, and guanine-cytosine counterparts.
  • 'A' and 'T' are represented by two vertical 'complementary' halves of one shape.
  • 'A' and 'T' are represented by two halves of an oval 3151 and 3 152 respectively.
  • 'G' and 'C are also represented by two vertical complementary halves of a different shape.
  • 'G' and 'C are represented by two halves of a square 3153 and 3 154 respectively.
  • 'AT-rich' DNA segments i.e. segments in which there is a higher incidence of adenine and thymine nucleotides
  • 'CG-rich' DNA segments i.e. segments in which there is a higher incidence of cytosine and guanine nucleotides.
  • AT-rich DNA segments in which the oval shapes are more dominant, can be discerned visually with enhanced ease from 'CG-rich' segments, in which square shapes are more dominant. Different shapes may be utilized other than the ones described here, e.g. a triangle may be used instead of an half-oval.
  • Enhanced visual discernment of AT-rich from CG-rich DNA sequences is further described below with reference to Fig. 35.
  • both purine nucleotides adenine ('A') and guanine ('G') are graphically represented by shapes that have an upward orientation: an upward oriented half-oval 3 15 1 and an upward oriented half-square 3153, respectively.
  • both pyrimidine nucleotides thymine ('T') and cytosine ('C') are graphically represented by shapes that have a downward orientation: a downward oriented half-oval 3152 and a downward oriented half-square 3154, respectively.
  • An example for the usefulness of the enhanced ease of visually distinguishing purine nucleotides from pyrimidine nucleotides is the enhanced ease of visually discerning the similarity between two genomic motifs: When comparing two genomic motifs, one ending with adenine ('A') while the other ends with guanine ('G'), both adenine and guanine being purine nucleotides, since both adenine and guanine are graphically represented by upward oriented shapes, the similarity between these two genomics motifs, is made more visually apparent. Visually distinguishing of purine nucleotides from pyrimidine nucleotides is further described below with reference to Fig. 35. It is yet further appreciated that the genomic graphic representation 3130 described above may also provide enhanced visual discrimination between the four different nucleotides, based on their different colors.
  • Fig. 31 illustrates a human sensible, graphical representation of genomic sequence data in order to represent one or more genomic attributes of each of the four nucleotides
  • another implementation of the present invention may use machine sensible representation in order to represent these attributes.
  • Fig. 32 is a simplified flowchart illustrating preferred operation of the genomic graphic representation engine 3120 of Fig. 3 1 , constructed and operative in accordance with a preferred embodiment of the present invention.
  • a genomic font is produced, preferably using conventional font- creation software, such as 'FONT CREATOR PROGRAM'.
  • font- creation software such as 'FONT CREATOR PROGRAM'.
  • preferred shapes are assigned to each of the four letters 'A', 'T', 'G' and 'C, such as the shapes indicated by reference numerals 3151-3154 of Fig. 31, respectively.
  • the genomic font is employed: the first comprising shapes with embedded letters in shapes, as designated by reference, numeral 3140, and the other comprising shapes without embedded letters, as designated by reference numeral 3150, both of Fig. 31.
  • each of the four letters, 'A', 'T', 'G' & 'C are preferably those designated by reference numerals 3141, 3142, 3143 & 3144 and by reference numerals 3151 , 3152, 3153 & 3154 respectively in Fig. 31.
  • the process of generating a genomic font preferably is a one-time process, and hence is connected to the next step by a broken line. It is typically carried out once, before an iterative process of converting the representation of multiple genomic sequences, from a genomic alphanumeric representation 3110 into a genomic graphic representation 3130.
  • genomic font Once a genomic font has been created, the process of graphically representing genomic sequence data may be very simple: an alphanumeric string representing genomic sequence data is received and a genomic font, generated by the previous step, is applied to this alphanumeric string.
  • Different colors may be applied to different letters, typically by using standard 'search-and-replace' commands, as is known in the art.
  • the colors applied to the letters 'A', 'T', 'G' & 'C are red, blue, brown & green respectively.
  • other colors may be used, according to user preferences. It should be noted, that applying different colors to different letters is an optional step: the user may or may not want to view the different letters in different colors, or may want to view a group of letters, for example purine nucleotides or pyrimidine nucleotides, or A-T or C-G, or some other grouping in a certain color.
  • Fig. 33 is a simplified illustration depicting an example of conversion of a typical alphanumeric genomic representation, of the type indicated by reference numeral 3110 of Fig. 31, into a typical graphic genomic representation, of the type indicated by reference numeral 3130 of Fig. 31.
  • Reference numeral 3300 designates an example of a short genomic sequence, conventionally represented by an alphanumeric string ' ACTTTTGATAATTATTGTAACTGTAAAAGAT' .
  • the short genomic sequence 3300 may be displayed using a genomic graphic representation as designated by reference numeral 3310, either employing genomic font with embedded letters 3140 of Fig. 31, as designated by reference numeral 3320 or employing genomic font without embedded letters 3150 of Fig. 31, as designated by reference numeral 3330.
  • genomic alphanumeric representation 3300 it is easier to visually discern patterns in the genomic sequence when it is displayed as a genomic graphic representation 3310, than when displayed as genomic alphanumeric representation 3300.
  • the segment 'ATAATTAT', 8 l l -15 th characters in the string from its left end, surrounded by a broken-line border may not immediately stand out as having any special significance.
  • a visual pattern is apparent: the first four characters in this segment, 'ATAA', are a vertical and horizontal mirror image of the last four characters of the segment, 'TTAT'.
  • a genomic sequence in which the second half of the sequence is a reversed-inversed sequence relative to the first half of the sequence, such as 'ATAATTAT' in the given example designated by reference numeral 3300 is known in the art as a 'hair-pin structure'.
  • sequence 'ATAATTAT' is what: a genomic sequence in which 'Hair-pin' sequences are genomic patterns which may indeed be biologically significant.
  • Fig. 34 is a simplified illustration of an example demonstrating an advantage of the graphic genomic representation 3130 of Fig. 3 1 , in comparing a genomic motif sequence with the inverse-reverse thereof.
  • Genomic motifs are short genomic sequences, which may have a specific biological significance or action. Genomic motifs may be compared to 'words', insofar as a word is a combination of English letters and has a specific meaning, and a genomic motif is a combination of a genomic nucleotides, and may have a specific action.
  • An example for a well known genomic motif is the genomic sequence 'GATAA.
  • genomic sequence data is typically provided for a sequence of nucleotides on a positive strand of the DNA.
  • some segments of biologically significant genomic data are actually 'coded' on the negative, i.e. opposite, strand of the DNA.
  • To inverse-reverse means to read the sequence from right to left, and replace each A with a T, and each C with a G and vice-versa.
  • 'TTATC is the inverse-reverse of the genomic motif 'GATAA'.
  • genomic motif 'GATAA' may appear in the genomic sequence either as 'GATAA', or as its inverse-reverse 'TTATC.
  • 'GATAA' may appear in the genomic sequence either as 'GATAA', or as its inverse-reverse 'TTATC.
  • the genomic graphic representation 3130 of Fig. 31 provides the user with enhanced ease of visually discerning genomic motifs from their inversed-reversed sequences, inasmuch as the inversed-reversed sequence presents a horizontal and vertical mirror image of the original motif. This is due to the fact that complementary nucleotide pairs adenine- thymine and cytosine-guanine, are graphically represented by complementary vertical halves of the same shape, as described with reference to Fig. 31.
  • Fig. 34 enables comparison between the genomic sequence 'GATAA' which is a well known genomic motif, and the genomic sequence 'TTATC which is the inverse-reverse of this genomic motif. It is seen that the graphical representation of the inversed-reversed genomic motif 'TTATC as designated by reference numeral 3440, presents a vertical and horizontal mirror image of the genomic motif 'GATAA' as designate by reference numeral 3450. This provides the user with enhanced ease of visually discerning the similarity of these motifs. The same is true for the graphic representation with embedded letters, as depicted by reference numerals 3420 and 3430 respectively.
  • Fig 35 is a simplified illustration of an example demonstrating an advantage of graphic genomic representation 3130 of Fig. 3 1 , in visually distinguishing adenine-thymine-rich sequences, from cytosine-guanine- rich sequences.
  • genomic sequence "CCCGCTCCAGG”, which is a GC-rich sequence
  • genomic sequence "TTTATTATCTA” which is an AT- rich sequence.
  • Reference numerals 3500 and 3510 respectively designate these sequences in standard alphanumeric form
  • reference numerals 3520 & 3530 and 3540 & 3550, respectively, designate these sequences graphically, with embedded letters and without embedded letters respectively.
  • genomic graphic representation 3130 of Fig. 31 provides the user with enhanced ease of visually discerning GC-rich sequences, depicted by reference numerals 3520 and 3540, in which the predominant shapes are squares, from AT-rich sequences, depicted by reference numerals 3530 and 3550, in which the predominant shapes are ovals. This may be particularly useful, since AT-rich sequences and GC-rich sequences may have different genomic significance, as is well known in the art.
  • Fig. 36 is a simplified illustration of an example which demonstrates an advantage of graphic genomic representation 3130 of Fig. 31 , in visually distinguishing purine nucleotides from pyrimidine nucleotides.
  • meaningful genomic motifs often appear in a genome with slight variations, while still maintaining their biological function and significance.
  • a motif in which variations are known to happen is typically described in terms of a 'consensus-sequence' which is a description of the location and frequency of acceptable 'mistakes', notwithstanding which the biological function of the motif is maintained.
  • a consensus-sequence may be compared to an English word, for which several slightly different spellings may be considered acceptable, e.g. Haematology and Hematology.
  • the consensus sequence may be related to a biochemical type of nucleotides.
  • the consensus-sequence definition for the well known genomic motif 'GATA box' is WGATAR, where W stands for adenine or thymine nucleotide, and R stands for a purine nucleotides: either adenine or guanine.
  • W stands for adenine or thymine nucleotide
  • R stands for a purine nucleotides: either adenine or guanine.
  • the consensus-sequence in this example states that both 'AGATAA' and 'AGATAG' may have the same biological function, despite the difference in the last nucleotide, since both adenine and guanine are purine nucleotides.
  • the present invention provides the user with enhanced ease of visually discerning purine nucleotides from pyrimidine nucleotides, thereby making it easier to visually identify genomic consensus-sequence motifs in which the consensus-sequence definition contains a purine or a pyrimidine.
  • Fig. 36 provides an example of the two genomic sequences 'AGATAA' and 'AGATAG' mentioned above, both being variants of the same consensus-sequence motif WGATAR mentioned above.
  • Reference numeral 3600 designates a genomic alphanumeric representation of an adenine ending GATA box, 'AGATAA', and reference numeral 3610 designates a genomic graphic representation of a guanine ending GATA box, 'AGATAG'.
  • the purine nucleotide ending the GATA box, adenine in reference numeral 3600 and guanine in reference numeral 3610 is surrounded by a broken-line border.
  • Reference numerals 3640 and 3650 designate genomic graphic representations of these two variants of the WGATAR motif: 'AGATAA' and 'AGATAG' respectively. It is appreciated that the genomic graphic representation 3130 of Fig. 31 makes it easier to visually identify the similarity between fhe$e two variants of the same 'GATA box' consensus sequence, since adenine and guanine, which are both purine nucleotides, are graphically represented by upward oriented shapes: upward oriented half-oval 151 and upward oriented half-square 3153 respectively. The same is clearly true of the graphic representation with embedded letters, as depicted by reference numerals 3620 and 3630 respectively.

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Biophysics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Testing And Monitoring For Control Systems (AREA)

Abstract

A system and method for analysis of genomic sequence data utilizing genomic sequence similarity assessment in compressed data space, the method including: obtaining genomic data, preprocessing genomic data into preprocessed genomic data, compressing at least part of the preprocessed genomic data, storing compressed preprocessed genomic data, indexing compressed preprocessed genomic data, and analyzing genomic data, based at least in part on the indexing, the analyzing also including assessing genomic sequence similarity, based at least in part on the indexing.

Description

INTEGRATED SYSTEM AND METHOD FOR ANALYSIS OF GENOMIC SEQUENCE DATA
REFERENCE TO CO-PENDING APPLICATIONS
Applicant hereby claims priority of U.S. Provisional Patent Application Serial No. 60/329,1 10, filed on October 12, 2001, entitled "Integrated System and Method for Analysis of Genomic Sequence Data", U.S. Provisional Patent Application Serial No. 60/329, 1 1 1, filed on October 12, 2001, entitled "System and Method for Storage Retrieval and Comparison in Compressed Data Space", U.S. Provisional Patent Application Serial No. 60/329,112, filed on October 12, 2001 , entitled "System- and Method for Storage and Retrieval of Genomic Sequence Data in Compressed Data Space", U.S. Provisional Patent Application Serial No. 60/329,1 14, filed on October 12, 2001, entitled "System and Method for Pattern Analysis of Genomic Sequence Data", U.S. Provisional Patent Application Serial No. 60/329,1 15, filed on October 12, 2001, entitled "System and Method for Genomic Sequence Similarity Comparison in Compressed Data Space"and U.S. Patent Application Serial No. 09/976,91 1, filed on October 12, 2001, entitled "System and Method for Graphically Representing Genomic Sequence Data".
FIELD OF THE INVENTION The present invention relates to database storage and retrieval in general and more particularly to storage and retrieval in computers which utilize multi-phase bit technology and to analysis and representation of genomic sequence data in general, and to pattern analysis of genomic data in particular.
BACKGROUND OF THE INVENTION Late developments in genomics technology have enabled sequencing of genomes of multiple organisms on a massive scale, yielding growing volumes of genomic sequence data, stored digitally in computer databases. This information currently typically is stored and indexed, represented as standard alphanumeric data, in computer databases. Memory of computer systems has been traditionally based on a system which comprises of bytes, where each byte comprises eight bits, and each bit comprises two phases: zero and one. Recent technological advancements make it possible, or are expected to make it possible in the near future, to create computers in which each bit may have more than two phases.
Alongside the ongoing progress in sequencing of genomes of multiple organisms, a major focus of increasing importance is the analysis of this genomic sequence data. Genomic sequence data is typically represented as alphanumeric strings.
The following US patents are believed to represent the state of the art: 6,226,412; 5,973,731; 5,781,773; 5,668,897; 5,966,712; 5,966,711; 5,853,989; and 5,81 1 ,235.
SUMMARY OF THE INVENTION
The present invention seeks to provide a system and method for analysis of genomic sequence data, and particularly pattern analysis of genomic motifs appearing in genomic sequence data and their possible functional significance.
The present invention comprises three sub-systems: a first sub-system and method for analysis of genomic data further described in co-pending U.S. Provisional Patent Application 60/329,114; a second sub-system and method for storage and retrieval of genomic data in compressed data space also described in co-pending U.S. Provisional Patent Application 60/329, 1 12; and a third sub-system and method for genomic sequence similarity comparison in compressed data space also described in co-pending U.S. Provisional Patent Application 60/329,115.
These capabilities and preferably other capabilities are preferably provided using a computer software application or a computer database program.
There is therefore provided in accordance with a preferred embodiment of the present invention a method for analysis of genomic sequence data in compressed data space, the method including: obtaining genomic data, preprocessing genomic data into preprocessed genomic data, compressing at least part of the preprocessed genomic data, storing compressed preprocessed genomic data, indexing compressed preprocessed genomic data, and analyzing genomic data, based at least in part on the indexing.
Further in accordance with a preferred embodiment of the present invention the obtaining includes obtaining uncompressed genomic sequence data, and protein coding region location data for each of a plurality of proteins known to be encoded by the genomic sequence data, the preprocessing includes calculating and storing a plurality of genomic region sequences, based at least in part on the obtaining, and determining for each of the plurality of genomic region sequences, a plurality of uncompressed short genomic segments contained therewith, the compressing includes compressing each of the pluralitty of uncompressed short genomic segments contained in each of the plurality of genomic region sequences into one of a plurality of compressed short genomic segments, the storing includes storing the plurality of compressed short genomic segments, the indexing includes indexing the plurality of compressed short genomic segments, and the analyzing includes: receiving a user query containing at least one logical condition relating to at least one of the following: one of the genomic region sequences, and one of the uncompressed short genomic segments, and retrieving results to the user query, the retrieving including at least one of the following: retrieving one of the plurality of proteins, retrieving one of the plurality of genomic region sequences, and retrieving and decompressing one of the plurality of compressed short genomic segments, based at least in part on the indexing.
There is additionally provided in accordance with another preferred embodiment of the present invention a method for storage and retrieval of compressed genomic sequence data and similarity assessment of genomic sequence data in compressed data space, the method including: receiving uncompressed genomic sequence data, compressing the uncompressed genomic sequence data into compressed genomic sequence data, storing the compressed genomic sequence data, indexing the compressed genomic sequence data, retrieving at least part of the compressed genomic sequence data representing uncompressed genomic sequence data similar to an uncompressed genomic target sequence, based at least in part on the indexing, and decompressing the at least part of the compressed genomic sequence data.
Still further in accordance with a preferred embodiment of the present invention the retrieving includes: receiving a target genomic sequence, a first plurality of compressed genomic sequences, representing respectively in compressed form a first plurality of genomic sequences, and at least one similarity criterion, and producing a second plurality of compressed genomic sequences, representing respectively in compressed form a second plurality of genomic sequences, the second plurality of genomic sequence being a subset of the first plurality of genomic sequences, each of the second plurality of genomic sequences being similar to the target genomic sequence, according to the at least one similarity criterion.
There is further provided in accordance with another preferred embodiment of the present invention a method for analysis of genomic sequence data utilizing genomic sequence similarity assessment in compressed data space, the method including: obtaining genomic data, preprocessing genomic data into preprocessed genomic data, compressing at least part of the preprocessed genomic data, storing compressed preprocessed genomic data, indexing compressed preprocessed genomic data, and analyzing genomic data, based at least in part on the indexing, the analyzing also including assessing genomic sequence similarity, based at least in part on the indexing
Additionally in accordance with a preferred embodiment of the present invention the obtaining includes obtaining uncompressed genomic sequence data, and protein coding region location data for each of a plurality of proteins known to be encoded by the genomic sequence data, the preprocessing includes calculating and storing a plurality of genomic region sequences, based at least in part on the obtaining, and determining for each of the plurality of genomic region sequences, a plurality of uncompressed short genomic segments contained therewith, the compressing includes compressing each of the plurality of uncompressed short genomic segments contained in each of the plurality of genomic region sequences into one of a plurality of compressed short genomic segments, the storing includes storing the plurality of compressed short genomic segments, the indexing includes indexing the plurality of compressed short genomic segments, and the analyzing includes: receiving a user query containing at least one logical condition relating to one of the plurality of uncompressed short genomic segments and at least one similarity criterion, extracting a subset of the plurality of uncompressed short genomic segments, each of the subset being similar to the one of the uncompressed short genomic segments according to the at least one similarity criterion, and retrieving results to the user query, based at least in part on the extracting, the retrieving including at least one of the following: retrieving one of the plurality of proteins, retrieving one of the plurality of genomic region sequences, and retrieving and decompressing one of the plurality of compressed short genomic segments, based at least in part on the indexing.
Further in accordance with a preferred embodiment of the present invention the plurality of genomic region sequences includes a plurality of protein coding regions.
Still further in accordance with a preferred embodiment of the present invention each of the plurality of protein coding regions is normalized.
Additionally in accordance with a preferred embodiment of the present invention the plurality of genomic region sequences includes a plurality of regions adjacent to protein coding regions.
Moreover in accordance with a preferred embodiment of the present invention the plurality of regions adjacent to protein coding regions includes a plurality of regions upstream to protein boding regions.
Still further in accordance with a preferred embodiment of the present invention the plurality of regions adjacent to protein coding regions includes a plurality of regions downstream to protein coding regions.
Still further in accordance with a preferred embodiment of the present invention each of the plurality of regions adjacent to protein coding regions is normalized according to coding direction of one of the plurality of protein coding regions adjacent thereto.
Additionally in accordance with- a preferred embodiment of the present invention the plurality of proteins known to be encoded by the genomic sequence data includes a majority of proteins known to be encoded by the genomic sequence data.
Moreover in accordance with a preferred embodiment of the present invention the plurality of short genomic segments contained in each of the plurality of genomic region sequences includes a majority of short genomic segments of a given length contained in each of the plurality of genomic region sequences.
Additionally in accordance with a preferred embodiment of the present invention the given length is user specified. Still further in accordance with a preferred embodiment of the present invention genomic sequence data includes: a first genomic sequence data belonging to a first organism, and a second genomic sequence data belonging to a second organism different than the first organism.
Additionally in accordance with a preferred embodiment of the present invention the method also includes storing for each one of the plurality of proteins at least one of the following protein properties: an organism of expression, a tissue of expression and a function.
Moreover in accordance with a preferred embodiment of the present invention the at least one logical condition includes a degree of uniqueness of one of the plurality of short genomic sequences relative to at least one of the plurality of genomic sequence regions.
Moreover in accordance with a preferred embodiment of the present invention the at least one logical condition includes a degree of commonality of one of the plurality of short genomic sequences relative to at least two of the plurality of genomic sequence regions.
Still further in accordance with a preferred embodiment of the present invention the at least one logical condition includes exclusion of one of the plurality of short genomic sequences relative to at least one of the plurality of genomic sequence regions
Additionally in accordance with a preferred embodiment of the present invention the method also includes: storing, based on user input, a plurality of criteria, determining and marking, each of the plurality of short genomic segments which complies with each one of the criteria, and the user query is based at least in part on at least one of the plurality of criteria.
Moreover in accordance with a preferred embodiment of the present invention each of the plurality of criteria includes at least one of the at least one logical condition.
Further in accordance with a preferred embodiment of the present invention the determining and storing also includes determining and storing a relationship between at least two of the plurality of short genomic segments, and the logical condition references the relationship. Still further in accordance with a preferred embodiment of the present invention the relationship also includes a relation between a location of a first one of the plurality of short genomic sequence relative to one of the plurality of genomic region sequences, and a second one of the plurality of short genomic sequence relative to the one of the plurality of genomic region sequences.
Additionally in accordance with a preferred embodiment of the present invention the relationship also includes a similarity between a first one of the plurality of short genomic sequences and a second one of the plurality of short genomic sequences.
Further in accordance with a preferred embodiment of the present invention the retrieving includs: receiving a query, the query including a query condition and uncompressed query data to which the query condition relates, compressing the uncompressed query data into compressed query data, and extracting the at least part of the compressed genomic sequence data, based at least in part on the compressed query data.
Still further in accordance with a preferred embodiment of the present invention the retrieving does not require storing the uncompressed genomic sequence data.
Additionally in accordance with a preferred embodiment of the present invention the retrieving does not require accessing the uncompressed genomic sequence data.
Moreover in accordance with a preferred embodiment of the present invention the retrieving does not require retrieving the uncompressed genomic sequence data.
Further in accordance with a preferred embodiment of the present invention the retrieving includes sorting the uncompressed genomic sequence data, based at least in part on the indexing.
Still further in accordance with a preferred embodiment of the present invention the sorting is alphabetical sorting.
Additionally in accordance with a preferred embodiment of the present invention the uncompressed genomic sequence data includs a plurality of uncompressed strings, and the compressed genomic sequence data includs a plurality of compressed strings, each of the plurality of uncompressed strings being compressed into a single corresponding one of the plurality of compressed strings.
Moreover in accordance with a preferred embodiment of the present invention each of the plurality of uncompressed strings is an alphanumeric string representing a genomic sequence, each alphanumeric string includs a plurality of characters, and each of the plurality of characters represents one of the following items: a nucleotide in the genomic sequence, and an unknown nucleotide in the genomic sequence
Further in accordance with a preferred embodiment of the present invention each of the plurality of uncompressed strings includs a plurality of uncompressed characters, and each of the plurality of compressed strings includs a plurality of compressed characters, at least two of the plurality of uncompressed characters being compressed into one of the plurality of compressed characters.
Still further in accordance with a preferred embodiment of the present invention each one of the plurality of uncompressed characters is compressed into one of the plurality of compressed characters.
Additionally in accordance with a preferred embodiment of the present invention the at least two of the plurality of uncompressed characters includs at least three of the plurality of uncompressed characters.
Moreover in accordance with a preferred embodiment of the present invention the at least two of the plurality of uncompressed characters includs at least four of the pluratlity of uncompressed characters.
Further in accordance with a preferred embodiment of the present invention at least three of the plurality of uncompressed characters are compressed into each one of a majority of the plurality of compressed characters.
Still further in accordance with a preferred embodiment of the present invention the plurality of compressed strings is stored in a field, the field being part of a table and the table being part of a database.
Additionally in accordance with a preferred embodiment of the present invention the receiving, the compressing, the storing, the indexing, the retrieving, and the decompressing are performed internally by the database. Moreover in accordance with a preferred embodiment of the present invention the receiving, the, compressing, the storing, the indexing, the retrieving, and the decompressing, do not require a program external to the database.
Further in accordance with a preferred embodiment of the present invention the receiving, the compressing, the storing, the indexing, the retrieving, and the decompressing, do not require programming.
Still further in accordance with a preferred embodiment of the present invention each of the plurality of uncompressed strings is an alphanumeric string, including a plurality of alphanumeric characters.
Further in accordance with a preferred embodiment of the present invention the determining does not require comparing the first genomic sequence with the second genomic sequence.
Still further in accordance with a preferred embodiment of the present invention the determining does not include any of the following: decompressing the first compressed genomic sequence, and decompressing the second compressed genomic sequence.
Additionally in accordance with a preferred embodiment of the present invention the method also includes decompressing each of the second plurality of compressed genomic sequences.
Moreover in accordance with a preferred embodiment of the present invention the producing does not require comparing the genomic sequence with any of the first plurality of genomic sequences.
Further in accordance with a preferred embodiment of the present invention the producing does not require decompressing any of the first plurality of compressed genomic sequences.
Still further in accordance with a preferred embodiment of the present invention functionality of the genomic data analyzer does not require comparing the first genomic sequence with the second genomic sequence.
Additionally in accordance with a preferred embodiment of the present invention functionality of the genomic data analyzer does not require any of the following: decompressing the first compressed genomic sequence, and decompressing the second compressed genomic sequence. Moreover in accordance with a preferred embodiment of the present invention the genomic compressed sequence similarity assessment system also includes a genomic decompressor operative to decompress each of the second plurality of compressed genomic sequence's.
Further in accordance with a preferred embodiment of the present invention functionality of the genomic data extractor does not require comparing the genomic sequence with any of the first plurality of genomic sequences.
Still further in accordance with a preferred embodiment of the present invention functionality of the genomic data extractor does not require decompressing any of the first plurality of compressed genomic sequences.
The present invention also seeks to provide an improved method for storage, sorting and retrieval of data in a database. In various preferred embodiments, the present invention seeks to provide the capability to store, index, and retrieve multiple alphanumeric strings, in compressed form, in a database and to assess string similarity of strings in their compressed form. These capabilities and preferably other capabilities are preferably provided using a computer software application or a computer database program.
There is thus provided in accordance with a preferred embodiment of the present invention a method for storage and retrieval of compressed data and similarity assessment of data in compressed data space, the method including: receiving uncompressed data, compressing the uncompressed data into compressed data, storing the compressed data, indexing the compressed data, retrieving at least part of the compressed data representing uncompressed data similar to an uncompressed target data item, based at least in part on the indexing, and decompressing the at least part of the compressed data.
Still further in accordance with a preferred embodiment of the present invention the retrieving includes: receiving ' a target string, a first plurality of compressed strings, representing respectively in compressed form a first plurality of strings, and at least one similarity criterion, and producing a second plurality of compcessed strings, representing respectively in compressed form a second plurality of strings, the second plurality of string being a subset of the first plurality of strings, each of the second plurality of strings being similar to the target string, according to the at least one similarity criterion.
There is thus provided in accordance with a preferred embodiment of the present invention a method for storage and retrieval of compressed data, the method including: receiving uncompressed data, compressing the uncompressed data into compressed data, storing the compressed data, indexing the compressed data, retrieving at least part of the compressed data, based at least in part on the indexing, and decompressing the at least part of the compressed data.
There is further provided in accordance with another preferred embodiment of the present invention a method for comparing compressed strings, the method including: receiving two compressed strings, a first compressed string representing in compressed form a first string, and a second compressed string representing in compressed form a second string, comparing the first compressed string with the second compressed string, and determining degree of similarity between the first string and the second string, based at least in part on the comparing.
There is still further provided in accordance with another preferred embodiment of the present invention a method for assessing similarity of strings, the method including: receiving the following items: a string, a first plurality of compressed strings, representing respectively in compressed form a first plurality of strings, and at least one similarity criterion, and producing a second plurality of compressed strings, representing respectively in compressed form a second plurality of strings, the second plurality of strings being a subset of the first plurality of strings, each of the second plurality of strings being similar to the string, according to the at least one similarity criterion.
There is additionally provided in accordance with another preferred embodiment of the present invention a compressed data storage and retrieval system including' a data compressor operative to receive uncompressed data and to compress the uncompressed data into compressed data, a compressed data indexer operative to store the compressed data and to index the compressed data, and a data extractor employing the compressed data indexer, and operative to retrieve at least part of the compressed data and to decompress the at least part of the compressed data. There is moreover provided in accordance with another preferred embodiment of the present invention a compressed string comparison system including: a compressed string evaluator operative to receive two compressed strings, a first compressed string representing in compressed form a first string, and a second compressed string representing in compressed form a second string, and to compare the first compressed string. with the second compressed string, and a compressed string analyzer employing the compressed string evaluator, and operative to determine degree of similarity between the first string and the second string.
There is further provided in accordance with another preferred embodiment of the present invention a compressed string similarity assessment system including: a compressed string evaluator operative to receive a string, a first plurality of compressed strings, representing respectively in compressed form a first plurality of strings, and at least one similarity criterion, and a compressed string extractor operative to produce a second plurality of compressed strings, representing respectively in compressed form a second plurality of strings, the second plurality of string being a subset of the first plurality of strings, each of the second plurality of strings being similar to the string, according to the at least one similarity criterion.
There is still further provided in accordance with another preferred embodiment of the present invention a computer-readable medium including a computer program, the computer program being operative, when in operative association with a computer, to perform the following steps: receiving uncompressed data, compressing the uncompressed data into compressed data, storing the compressed data, indexing the compressed data, retrieving at least part of the compressed data, based at least in part on the indexing, and decompressing the at least part of the compressed data.
There is additionally provided in accordance with another preferred embodiment of the present invention a computer-readable medium including a computer program, the computer program being operative, when in operative association with a computer, to perform the following steps: receiving two compressed strings, a first compressed string representing in compressed form a first string, and a second compressed string representing in compressed form a second string, the second string . being different from the first string, comparing the first compressed string with the second compressed string, and determining degree of similarity between the first string and the second string, based at least in part on the comparing.
There is moreover provided in accordance with another preferred embodiment of the present invention a computer-readable medium including a computer program, the computer program being operative, when in operative association with a computer, to perform the following steps: receiving the following items: a string, a first plurality of compressed strings, representing respectively in compressed form a first plurality of strings, and at least one similarity criterion, and producing a second plurality of compressed strings, representing respectively in compressed form a second plurality of strings, the second plurality of strings being a subset of the first plurality of strings, each of the second plurality of strings being similar to the string, according to the at least one similarity criterion.
Further in accordance with a preferred embodiment of the present invention the retrieving includes: receiving a query, the query including a query condition and uncompressed query data to which the query condition relates, compressing the uncompressed query data into compressed query data, and extracting the at least part of the compressed data, based at least in part on the compressed query data.
Still further in accordance with a preferred embodiment of the present invention the retrieving does not require storing the uncompressed data.
Additionally in accordance with a preferred embodiment of the present invention the retrieving does not require accessing the uncompressed data.
Moreover in accordance with a preferred embodiment of the present invention the retrieving does not require retrieving the uncompressed data.
Further in accordance with a preferred embodiment of the present invention the retrieving includes sorting the uncompressed data, based at least in part on the indexing.
Still further in accordance with a preferred embodiment of the present invention the sorting is alphabetical sorting.
Additionally in accordance with a preferred embodiment of the present invention: the uncompressed data includes a plurality of uncompressed strings, and the compressed data includes a plurality of compressed strings, each of the plurality of uncompressed strings being compressed into a single corresponding one of the plurality of compressed strings.
Moreover in accordance with a preferred embodiment of the present invention: each of the plurality of uncompressed strings is an alphanumeric string, including a plurality of alphanumeric characters.
Further in accordance with a preferred embodiment of the present invention: each of the plurality of uncompressed strings includes a plurality of uncompressed characters, and each of the plurality of compressed strings includes a plurality of compressed characters, at least two of the plurality of uncompressed characters being compressed into one of the plurality of compressed characters.
Still further in accordance with a preferred embodiment of the present invention each one of the plurality of uncompressed characters is compressed into one of the plurality of compressed characters.
Additionally in accordance with a preferred embodiment of the present invention the at least two of the plurality of uncompressed characters includes at least three of the plurality of uncompressed characters.
Moreover in accordance with a preferred embodiment of the present invention the at least two of the plurality of uncompressed characters includes at least four of the plurality of uncompressed characters.
Additionally in accordance with a preferred embodiment of the present invention at least three of the plurality of uncompressed characters are compressed into each one of a majority of the plurality of compressed characters.
Moreover in accordance with a preferred embodiment of the present invention the plurality of compressed strings is stored in a field, the field being part of a table and the table being part of a database.
Further in accordance with a preferred embodiment of the present invention the receiving, the compressing, the storing, the indexing, the retrieving, and the decompressing are performed internally by the database.
Still further in accordance with a preferred embodiment of the present invention the receiving, the compressing, the storing, the indexing, the retrieving, and the decompressing, do not require a program external to the database. Additionally in accordance with a preferred embodiment of the present invention the receiving, the compressing, the storing, the indexing, the retrieving, and the decompressing, do not require programming.
Moreover in accordance with a preferred embodiment of the present invention each of the plurality of compressed characters is stored in one byte of memory, the one byte of memory including a plurality of bits, each of the plurality of bits storing one of more than two possible values.
Further in accordance with a preferred embodiment of the present invention the determining does not include comparing the first string with the second string-11
Still further in accordance with a preferred embodiment of the present invention the determining does not include any of the following: decompressing the first compressed string, and decompressing the second compressed string.
Additionally in accordance with a preferred embodiment of the present invention the first string and the first string and the second string are alphanumeric strings.
Moreover in accordance with a preferred embodiment of the present invention each of the plurality of compressed characters is stored in one byte of memory, the one byte of memory including a plurality of bits, each of the plurality of bits storing one of more than two possible values.
Further in accordance with a preferred embodiment of the present invention the method also includes decompressing each of the second plurality of compressed strings.
Still further in accordance with a preferred embodiment of the present invention the producing does not require comparing the string with any of the first plurality of strings
Additionally in accordance with a preferred embodiment of the present invention the producing does not require decompressing any of the first plurality of compressed strings.
Moreover in accordance with a preferred embodiment of the present invention each of the plurality of compressed characters is stored in one byte of memory, the one byte of memory including a plurality of bits, each of the plurality of bits storing one of more than two possible values.
Further in accordance with a preferred embodiment of the present invention the data extractor provides the following functionality: receiving a query including a query condition and uncompressed query data to which the query condition relates, compressing the uncompressed query data into compressed query data, and extracting the at least part of the compressed data, based at least in part on the compressed query data.
Still further in accordance with a preferred embodiment of the present invention the functionality of the data extractor does not require storing the uncompressed data.
Additionally in accordance with a preferred embodiment of the present invention the functionality of the data extractor does not require accessing the uncompressed data.
Moreover in accordance with a preferred embodiment of the present invention the the functionality of the data extractor does not require retrieving the uncompressed data.
Further in accordance with a preferred embodiment of the present invention the data extractor employing the compressed data indexer is operative to sort the uncompressed data.
Still further in accordance with a preferred embodiment of the present invention the data extractor employing the compressed data indexer is operative to alphabetically sort the uncompressed data.
Additionally in accordance with a preferred embodiment of the present invention: the uncompressed data includes a plurality of uncompressed strings, and the compressed data includes a plurality of compressed strings, each of the plurality of uncompressed strings being compressed into a single corresponding one of the plurality of compressed strings.
Moreover in accordance with a preferred embodiment of the present invention each of the plurality of uncompressed strings is an alphanumeric string including a plurality of alphanumeric characters. Further in accordance with a preferred embodiment of the present invention each of the plurality of uncompressed strings includes a plurality of uncompressed characters, each of the plurality of compressed strings includes a plurality of compressed characters, at least two of the plurality of uncompressed characters being compressed into one of the plurality of compressed characters.
Still further in accordance with a preferred embodiment of the present invention each one of the plurality of uncompressed characters is compressed into one of the plurality of compressed characters.
Additionally in accordance with a preferred embodiment of the present invention the at least two of the plurality of uncompressed characters includes at least three of the plurality of uncompressed characters.
Moreover in accordance with a preferred embodiment of the present invention the at least two of the plurality of uncompressed characters includes at least four of the plurality of uncompressed characters.
Further in accordance with a preferred embodiment of the present invention at least three of the plurality of uncompressed characters are compressed into each one of a majority of the plurality of compressed characters.
Still further in accordance with a preferred embodiment of the present invention the plurality of compressed strings is stored in a field, the field being part of a table and the table being part of a database.
Additionally in accordance with a preferred embodiment of the present invention functionality of the data compressor, the compressed data indexer, and the data extractor is performed internally by a database.
Moreover in accordance with a preferred embodiment of the present invention functionality of the data compressor, the compressed data indexer, and the data extractor does not require a program external to the database.
Further in accordance with a preferred embodiment of the present invention functionality of the data compressor, the compressed data indexer, and the data extractor does not require programming.
Still further in accordance with a preferred embodiment of the present invention each of the plurality of compressed characters is stored in one byte of memory, the one byte of memory including a plurality of bits, each of the plurality of bits stoπng one of more than two possible values.
Moreover in accordance with a preferred embodiment of the present invention functionality of the compressed string analyzer does not require comparing the first string with the second string.
Additionally in accordance with a preferred embodiment of the present invention functionality of the compressed string analyzer does not require any of the following: decompressing the first compressed string, and decompressing the second compressed string.
Further in accordance with a preferred embodiment of the present invention the first string and the first string and the second string are alphanumeric strings.
Still further in accordance with a preferred embodiment of the present invention each of the plurality of compressed characters is stored in one byte of memory, the one byte of memory including a plurality of bits, each of the plurality of bits stoπng one of more than two possible values.
Additionally in accordance with a preferred embodiment of the present invention the compressed string similarity assessment system also includes a compressed string decompressor operative to decompress each of the second plurality of compressed strings.
Moreover in accordance with a preferred embodiment of the present invention functionality of the compressed string extractor does not require comparing the string with any of the first plurality of strings.
Further in accordance with a preferred embodiment of the present invention functionality of the compressed string extractor does not require decompressing any of the first plurality of compressed strings.
Still further in accordance with a preferred embodiment of the present invention each of the plurality of compressed characters is stored in one byte of memory, the one byte of memory including a plurality of bits, each of the plurality of bits storing one of more than two possible values.
The present invention also seeks to provide an improved method for presentation of genomic sequence data. In various preferred embodiments, the present invention seeks to increase the ease with which genomic motifs and their inverse- reversed sequences may be visually distinguished from each other.
Preferably, the present invention enhances the ease with which a viewer can visually distinguish purine nucleotides from pyrimidine nucleotides and can visually distinguish one set of complementary nucleotides, i e adenine-thymine, from another set of complementary nucleotides, i.e. guanine-cytosine. These and other enhanced visual distinctions are preferably provided by employing a novel type of genomic computer font. Different colors may also be applied to different nucleotides.
There is thus provided in accordance with a preferred embodiment of the present invention a method for displaying genomic sequence data, the method including receiving an alphanumeric string representing genomic sequence data, the alphanumeric string including a plurality of characters, each of the characters representing a nucleotide in the genomic sequence; and expressing the alphanumeric string using a representation which distinguishes a first plurality of nucleotides, sharing in common a first genomic attribute, from a second plurality of nucleotides, sharing in common a second genomic attribute, the second genomic attribute being different from the first genomic attribute
There is further provided in accordance with another preferred embodiment of the present invention a method for graphically displaying genomic sequence information, the method including: receiving a first alphanumeric string representing a first genomic sequence, and a second alphanumeric string representing a second genomic sequence, the second genomic sequence being a reversed-inversed genomic sequence of the first genomic sequence; and graphically displaying the first alphanumeric string and the second alphanumeric string, such that a graphical display of the second alphanumeric string is a horizontal and vertical mirror image of a graphical display of the first alphanumeric string.
There is still further provided in accordance with another preferred embodiment of the present invention a genomic display system comprising: a receiving apparatus operative to receive an alphanumeric string representing genomic sequence data, said alphanumeric string comprising a plurality of characters, each of said characters representing a nucleotide in said genomic sequence; and an expressing apparatus operative to express said alphanumeric string using a representation which distinguishes a first plurality of nucleotides, sharing in common a first genomic attribute, from a second plurality of nucleotides, sharing in common a second genomic attribute, said second genomic attribute being different from said first genomic attribute.
There is additionally provided in accordance with another preferred embodiment of the present invention a system for graphically displaying genomic sequence information, the system comprising: a genomic sequence expressor, receiving a first alphanumeric string representing a first genomic sequence and a second alphanumeric string representing a second genomic sequence, said second genomic sequence being a reversed-inversed genomic sequence of said first genomic sequence; and expressing said first alphanumeric string and said second alphanumeric string, such that a graphical display of said second alphanumeric string is a horizontal and vertical mirror image of a graphical display of said first alphanumeric string; and a display operative to receive an output from said genomic sequence expressor and to provide a visually sensible display of an expression of said graphical display of said first alphanumeric string and said graphical display of said second alphanumeric string.
There is also provided in accordance with another preferred embodiment of the present invention a computer-readable medium comprising a computer program, the computer program being operative, when in operative association with a computer, to perform the following steps: receiving an alphanumeric string representing genomic sequence data, said alphanumeric string comprising a plurality of characters, each of said characters representing a nucleotide in said genomic sequence; and expressing said alphanumeric string using a representation which distinguishes a first plurality of nucleotides, sharing in common a first genomic attribute, from a second plurality of nucleotides, sharing in common a second genomic attribute, said second genomic attribute being different from said first genomic attribute. In. accordance with a preferred embodiment of the present invention, the first plurality of nucleotides are represented by at least one first representing attribute, and the second plurality of nucleotides are represented by at least one second representing attribute, the second representing attribute being different from the-first representing attribute.
There is further provided in accordance - with another preferred embodiment of the present invention a computer-readable medium comprising a computer program, the computer program being operative, when in operative association with a computer, to perform the following steps: receiving a first alphanumeric string representing a first genomic sequence and a second alphanumeric string representing a second genomic sequence, said second genomic sequence being a reversed-inversed genomic sequence of said first genomic sequence; and graphically displaying said first alphanumeric string and said second alphanumeric string, such that a graphical display of said second alphanumeric string is a horizontal and vertical mirror image of a graphical display of said first alphanumeric string.
Further in accordance with a preferred embodiment of the present invention the representation comprises a human sensible representation.
Still further in accordance with a preferred embodiment of the present invention the at least one first representing attribute and the at least one second representing attribute are graphical attributes.
Additionally in accordance with a preferred embodiment of the present invention the graphical attributes are shapes.
Moreover in accordance with a preferred embodiment of the present invention the graphical attributes are positions.
Further in accordance with a preferred embodiment of the present invention the positions are vertical positions.
Still further in accordance with a preferred embodiment of the present invention the graphical attributes are orientations.
Additionally in accordance with a preferred embodiment of the present invention the orientations are vertical orientations.
Moreover in accordance with a preferred embodiment of the present invention the graphical attributes are colors.
Further in accordance with a preferred embodiment of the present invention using the representation also includes representing each of four nucleotides: adenine, thymine, cytosine, and guanine, by a different color.
Still further in accordance with a preferred embodiment of the present invention the human sensible representation includes one of the following: a shape with a letter and a shape without a letter.
Additionally in accordance with a preferred embodiment of the present invention the human sensible representation is produced using a computer font. Moreover in accordance with a preferred embodiment of the present invention the computer font is a TRUETYPE® font.
Further in accordance with a preferred embodiment of the present invention the representation comprises a machine sensible representation.
Still further in accordance with a preferred embodiment of the present invention the at least one first representing attribute and the at least one second representing attribute are machine sensible attributes.
Additionally in accordance with a preferred embodiment of the present invention the first plurality of nucleotides are purine nucleotides, and the second plurality of nucleotides are pyrimidine nucleotides.
Moreover in accordance with a preferred embodiment of the present invention the first plurality of nucleotides consists of adenine and thymine nucleotides, and the second plurality of nucleotides consists of guanine and cytosine nucleotides.
Further in accordance with a preferred embodiment of the present invention the representation also distinguishes a third plurality of nucleotides, sharing in common a third genomic attribute, from a fourth plurality of nucleotides, sharing in common a fourth genomic attribute, the fourth genomic attribute being different from the third genomic attribute.
Still further in accordance with a preferred embodiment of the present invention the third plurality of nucleotides are represented by at least one third representing attribute, and the fourth plurality of nucleotides are represented by at least one fourth representing attribute, the at least one third representing attribute being different from the at least one fourth representing attribute.
Additionally in accordance with a preferred embodiment of the present invention the first plurality of nucleotides are purine nucleotides, the second plurality of nucleotides are pyrimidine nucleotides, the third plurality of nucleotides are adenine and thymine nucleotides, and the fourth plurality of nucleotides are guanine and cytosine nucleotides.
Moreover in accordance with a preferred embodiment of the present invention the method also includes expressing the first alphanumeric string and the second alphanumeric string using a representation which distinguishes a first plurality of nucleotides, sharing in common a first genomic attribute, from a second plurality of nucleotides, sharing in common a second genomic attribute, the second genomic attribute being different from the first genomic attribute.
Further in accordance with a preferred embodiment of the present invention the genomic sequence expressor is also operative to receive an alphanumeric string which represents genomic sequence data, the alphanumeric string including a plurality of characters, each of the plurality of characters representing a nucleotide in the genomic sequence, and to express the alphanumeric string using a representation which distinguishes a first plurality of nucleotides, sharing in common a first genomic attribute, from a second plurality of nucleotides, sharing in common a second genomic attribute, the second genomic attribute being different from the first genomic attribute, and the display is also operative to receive an output from the expressor and to display the genomic sequence using the representation.
BRIEF DESCRIPTION OF THE DRAWINGS
The present invention will be understood and appreciated more fully from the following detailed description, taken in conjunction with the drawings in which:
Fig. 1 is a simplified block diagram illustrating a computer application constaicted and operative in accordance with a preferred embodiment of the present invention;
Fig. 2 is a simplified block diagram illustrating a genomic data compression mechanism, which is a preferred implementation of a compression mechanism and a decompression mechanism constructed and operative in accordance with a preferred embodiment of the present invention;
Fig. 3A is a simplified illustration of a preferred implementation of a compressed byte bitmap used in accordance with a preferred embodiment of the present invention;
Fig. 3B is a simplified illustration of an alternative preferred implementation of a compressed byte bitmap used in accordance with a preferred embodiment of the present invention;
Fig. 4A is a table illustrating preferred values assignable to 'header' bits in accordance with a preferred embodiment of the present invention; Fig. 4B is a table illustrating preferred values assignable to 'nucleotide- representing' bits in accordance with a preferred embodiment of the present invention;
Fig. 4C is a table illustrating preferred values assignable to bits when encoding one or more uncommon characters;
Fig. 5 is a simplified flowchart illustrating operation of a genomic data compression engine constructed and operative in accordance with a preferred embodiment of the present invention;
Fig. 6 is a simplified flowchart illustrating a methodology for generating translation tables used in accordance with a preferred embodiment of the present invention;
Fig. 7A is a simplified illustration of a compression table employed in accordance with a preferred embodiment of the present invention;
Fig. 7B is a simplified illustration of a decompression table employed in accordance with a preferred embodiment of the present invention;
Fig. 8 is a simplified flowchart illustrating operation of a genomic data decompression engine constructed and operative in accordance with a preferred embodiment of the present invention;
Fig. 9A is a simplified illustration of an example of compression of an uncompressed genomic string representing genomic sequence data into a compressed genomic string;
Fig. 9B is a simplified illustration of an example of compression of an uncompressed genomic string representing genomic sequence data, and containing an unknown nucleotide, into a compressed genomic string;
Fig. 10 is a simplified block diagram illustrating shifted genomic sequences utilized by a compressed genomic sequence similarity search module constructed and operative in accordance with a preferred embodiment of the present invention;
Fig. 1 1 is a simplified flowchart illustrating operation of the compressed genomic sequence similarity search module constructed and operative in accordance with a preferred embodiment of the present invention; Fig. 12A is a simplified illustration of an example of identifying a genomic sequence having one nucleotide replacement relative to a target genomic sequence;
Fig. 12B is a simplified illustration of an example of identifying a genomic sequence having two nucleotide additions relative to a target genomic sequence;
Fig. 12C is a simplified illustration of an example of identifying a genomic sequence having one nucleotide deletion relative to a target genomic sequence;
Fig. 13 is a simplified functional diagram of a computer database application constructed and operative in accordance with a preferred embodiment of the present invention;
Fig. 14 is a simplified block diagram illustrating a genomic pattern analysis database constructed and operative in accordance with a preferred embodiment of the present invention; and
Fig. 15 is a flowchart diagram illustrating operation of a genomic preprocessing unit constructed and operative in accordance with a preferred embodiment of the present invention;
Fig. 16 is a flowchart diagram illustrating operation of a genomic query processing unit constructed and operative in accordance with a preferred embodiment of the present invention;
Fig. 17 is a simplified illustration of an example of genomic pattern analysis performed by a preferred embodiment of the present invention;
Fig. 18 is a simplified block diagram illustrating a computer application constructed and operative in accordance with a preferred embodiment of the present invention;
Fig. 19 is a simplified block diagram illustrating a mechanism for data compression and decompression, which is a preferred implementation of a compression mechanism and a decompression mechanism constructed and operative in accordance with a preferred embodiment of the present invention;
Fig. 20A is a simplified illustration of a preferred implementation of a compressed byte bitmap used in accordance with a preferred embodiment of the present invention; Fig. 20B is a simplified illustration of an alternative preferred implementation of a compressed byte bitmap used in accordance with a preferred embodiment of the present invention;
Fig. 21 A is a table illustrating preferred values assignable to 'header' bits in accordance with a preferred embodiment of the present invention;
Fig. 2 IB is a table illustrating preferred values assignable to character- representing bits in accordance with a preferred embodiment of the present invention;
Fig. 21C is a table illustrating preferred values assignable to bits when encoding one or more uncommon characters;
Fig. 22 is a simplified flowchart illustrating operation of a compression engine constructed and operative in accordance with a preferred embodiment of the present invention;
Fig. 23 is a simplified flowchart illustrating a methodology for generating translation tables used in accordance with a preferred embodiment of the present invention;
Fig. 24A is a simplified illustration of a compression table employed in accordance with a preferred embodiment of the present invention;
Fig. 24B is a simplified illustration of a decompression table employed in accordance with a preferred embodiment of the present invention;
Fig. 25 is a simplified flowchart illustrating operation of a decompression engine constructed and operative in accordance with a preferred embodiment of the present invention;
Fig. 26A is a simplified illustration of an example of compression of an uncompressed string into a compressed string;
Fig. 26B is a simplified illustration of an example of compression of an uncompressed string, and containing a rare character, into a compressed string;
Fig. 27 is a simplified block diagram illustrating shifted compressed strings utilized by a compressed string similarity search module constructed and operative in accordance with a preferred embodiment of the present invention;
Fig. 28 is a simplified flowchart illustrating operation of the compressed string similarity search module constructed and operative in accordance with a preferred embodiment of the present invention; Fig. 29A is a simplified illustration of an example of identifying a character string having one character replacement relative to a target string;
Fig. 29B is a simplified illustration of an example of identifying a character string having two character additions relative to a target string;
Fig. 29C is a simplified illustration of an example of identifying a character string having one character deletion relative to a target string;
Fig. 30 is a simplified illustration of a triphase-bit compressed character used in accordance with a preferred embodiment of the present invention;
Fig. 3 1 is a simplified block diagram illustrating a computer application constructed and operative in accordance with a preferred embodiment of the present invention;
Fig. 32 is a simplified flowchart illustrating preferred operation of a genomic graphic representation engine, constructed and operative in accordance with a preferred embodiment of the present invention;
Fig. 33 is a simplified illustration of an example demonstrating conversion of alphanumeric genomic representation into graphic genomic representation;
Fig 34 is a simplified illustration of an example demonstrating an advantage of a graphic genomic representation in comparing a genomic motif sequence with the inverse-reversed sequence of this motif;
Fig 35 is a simplified illustration of an example demonstrating an advantage of a graphic genomic representation, in visually distinguishing adenine- thymine-rich sequences, from cytosine-guanine-rich sequences; and
Fig 36 is a simplified illustration of an example demonstrating an advantage of a graphic genomic representation, in visually distinguishing purine nucleotides from pyrimidine nucleotides.
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
Reference is now made to Fig. 1, which is a simplified block diagram illustrating a computer application constructed and operative in accordance with a preferred embodiment of the present invention. Each of a plurality of genomic sequences 100, typically represented by an alphanumeric string, is compressed by a compression mechanism 102, collectively yielding a respective plurality of compressed genomic sequences 104, each typically represented as a compressed alphanumeric string.
The compressed genomic sequences 104 are stored in a plurality of records in a table in a database 106, and a compressed genomic sequences index 108 is constructed, which indexes the compressed genomic sequences 104.
A target genomic sequence 1 10, one or more similarity criteria 1 11 and a query condition 1 12 relating to the target genomic sequence 110 are provided by a user of the database 106, in order to find all genomic sequences 100 in the database 106 which comply with the provided query condition 112 as it is applied to genomic sequences similar to the target genomic sequence 110, to a degree specified by the similarity criteria 1 12.
The target genomic sequence 110 is compressed by a compression mechanism 1 14, which may be similar to compression mechanism 102, into a compressed target genomic sequence 116.
The compressed target genomic sequence 116 and the similarity criteria 1 1 1 are passed on as input to a compressed genomic sequence similarity search module 1 18. The compressed genomic sequence similarity search module in conjunction with the compressed genomic sequence index 108 is operative to query the database 106, retrieving a plurality of compressed genomic sequences which comply with the query condition 1 12, as applied to genomic sequences which are similar to the target genomic sequence 1 10, to a degree defined by similarity criteria 111, in compressed form. These results are designated compressed similar to target query results 120. The compressed genomic sequence similarity search module 118 is further described below with reference to Figs. 10, 11, 12A, 12B and 12C.
Each of the compressed genomic sequences similar to target 120 is decompressed by a decompression mechanism 122, collectively yielding a respective plurality of genomic sequences similar to target 124.
Preferred embodiments of the compression mechanisms 102 and 114 and of the decompression mechanism 122, which preferably reverses the process of these compression mechanisms, are further described below with reference to Fig. 2. It is appreciated that while the end result is retrieval of genomic sequences which are similar to the target genomic sequence 110, the actions of sequence similarity comparison and retrieval are preferably performed on compressed genomic sequences: comparing compressed target genomic sequence 116 with compressed genomic sequences 104, using the compressed genomic sequence index 108. An important aspect of the present invention is that it allows determining the level of similarity between genomic sequences, by comparing compressed genomic sequences to which these genomic sequences correspond.
Reference is now made to Fig. 2, which is a simplified block diagram illustrating a mechanism for compression and decompression of genomic data. This mechanism is a preferred implementation of the compression mechanisms 102 and 114 and of the decompression mechanism 122 described hereinabove with reference to Fig. 1.
An uncompressed genomic string 200 is given as an example of one of the genomic sequences 100 or of the target genomic sequence 110 described hereinabove with reference to Fig. 1.
In a preferred embodiment of the present invention, the uncompressed genomic string 200 represents genomic sequence data. As is well known in the art, genomic sequence data is typically represented as an alphanumeric string comprising only five letters: A, T, C, and G, representing the four nucleotides, which comprise the genome: Adenine, Thymine, Cytosine, and Guanine respectively, and the letter N or the minus sign, representing locations in the sequence in which the nucleotide is currently not known. The letter N typically appears in genomic sequence data much less frequently than do the other four letters. Further, when N's do appear in a genomic sequence they are frequently found contiguously rather than separately since it is frequently the case that a contiguous group of nucleotides in the sequence, rather than just one nucleotide, are unknown.
The uncompressed genomic string 200 comprises a plurality of bytes, each storing a character - A, T, C, or G - representing a nucleotide. For simplicity of explanation, the uncompressed genomic string 200 shown in Fig. 2 comprises only three bytes: BYTE I, BYTE II and BYTE III, each storing a character representing a corresponding nucleotide: NUC 1, NUC 2 and NUC 3, respectively. The uncompressed genomic string 200 is compressed by a genomic data compression engine 202 into a compressed genomic string 204. Operation of a preferred embodiment of the genomic data compression engine 202 is further described below with reference to Fig. 5
The genomic data compression engine 202 employs a compression table 206 in compression of uncompressed genomic string 200 into compressed genomic string 204 The compression table 206 is preferably a translation table and holds a list of all possible or reasonable 3 -alphanumeric-character combinations and, for each such combination, the byte or bytes into which it may be compressed A preferred embodiment of the compression table 206 is further described below with reference to Fig 7A
The compressed genomic string 204 preferably comprises one or more compressed bytes 208 For simplicity of explanation, the compressed genomic string 204 shown in Fig 2 comprises only one compressed byte 208, which represents in compressed form all three nucleotides which are represented in the uncompressed genomic string 200, NUC 1, NUC 2 and NUC 3, which nucleotides require three bytes of storage, BYTE I, BYTE II and BYTE III, in their original uncompressed form.
It is appreciated that compressed genomic string 204, which comprises of a plurality of compressed bytes 208, may alternatively be stored as one or more integers. For example, an Integer is typically defined as a data-type comprising 4 bytes. It is therefore possible to compress an uncompressed genomic string 200 comprising 12 nucleotides into 4 compressed bytes 208, and then to store all 4 resulting compressed bytes 208 as one Integer Longer genomic strings may be compressed into longer Integers, such as Biglnt data type in MS SQL-2000® which comprises 8 bytes, or as a plurality of Integers or Biglnts. For example, a genomic string comprising 48 nucleotides may be compressed into 2 Biglnts
It is further appreciated that a Tinylnt datatype, which comprises one byte of memory, may be used for compressed byte 208. In this configuration, when storing genomic strings which are longer than 3 or 4 characters, i.e. are compressed into more than one Tinylnt, it is possible to store a plurality of Tinylnt type fields, each ι" representing in compressed form 3 or 4 uncompressed characters. It is then possible to create an indexed View which indexes all these fields together, as one. The compressed byte 208 is further described below with reference to Figs 3, 4 A, 4B, and 4C
The compressed genomic string 204 may be decompressed by a genomic data decompression engine 210 back to uncompressed genomic string 200, preferably by reversal of the methodology of the genomic data compression engine 202 Operation of the genomic data decompression engine 210 is further described below with refei ence to Fig 8
Genomic data decompression engine 210 employs a decompression table 212 in decompressing compressed genomic string 204 into uncompressed genomic string 200 The decompression table 212 is a translation table and holds a list of the bit- values for each possible or reasonable compressed byte 208 and for each such compressed byte, the alphanumeric string, containing up to three characters, into which it may be decompressed The decompression table 212 is further described below with reference to Fig 7B
Reference is now made to Fig 3 A, which is a simplified illustration of a preferred implementation of a byte bitmap preferably employed in generating the compressed byte 208 of Fig 2
The compressed byte 208 of Fig 2 preferably comprises 8 bits' BIT I, BIT II, BIT III, BIT IV, BIT V, BIT VI, BIT VII & BIT VIII Preferably, these bits are divided into four groups, each containing 2 bits A HEADER typically contains BIT I and BIT II, while a BIT-PAIR I typically contains BIT III and BIT IV, a BIT-PAIR II typically contains BIT V and BIT VI and a BIT-PAIR III typically contains BIT VII and BIT VIII
The HEADER preferably stores information about what the other bits in the compressed byte, BIT III - BIT VIII, represent In a preferred embodiment of the present invention the compression of genomic data is such that the compressed byte may either store up to three nucleotides, or may store up to three unknown-nucleotides, i e 'N's, but preferably not a combination of nucleotides and N's The HEADER stores information which indicates how many nucleotides the compressed byte 208 represents, one, two or three, or alternatively if the entire compressed-byte represents one or more 'N's The values assignable to bits of the HEADER are further described below with reference to Fig 4A BIT-PAIR I, BIT-PAIR II and BIT-PAIR III each contain 2 bits which are capable, when taken together, of representing one of four possible nucleotides, A, T, C and G. The values assignable to the bits of each of the nucleotides - BIT-PAIR I, BIT-PAIR II and BIT-PAIR III - are further described below with reference to Fig. 4B.
When the entire compressed byte 208 represents one or more N's, and does not represent any nucleotides, values assigned to the two bits of BIT -PAIR I, determine whether the compressed byte 208 represents one, two or three N's. Values assignable to bits of BIT-PAIR I, determining the number of N's that the compressed byte 208 represents, are further described below with reference to Fig. 4C.
Reference is now made to Fig. 3B, which is a simplified illustration of an alternative preferred implementation of a byte bitmap preferably employed in generating the compressed byte 208 of Fig. 2.
Alternative compressed byte 300 is an alternative byte bitmap which may be used for compression of genomic data, instead of the byte bitmap of compressed byte 208, depicted in Fig. 3A.
Alternative compressed byte 300 comprises four bit-pairs, BIT-PAIR I, BIT-PAIR II, BIT-PAIR III and BIT-PAIR IV, rather than only three bit-pairs in compressed-byte 208 of Fig. 3 A. Unlike compressed byte 208 of Fig. 3 A, in which BIT I and BIT II function as a HEADER, alternative compressed byte 300 does not comprise any such header. All 8 bits of alternative compressed byte 300 function as one of four bit-pairs, each of said bit-pairs representing a nucleotide. Alternative compressed byte 300 is therefore capable of representing 4 nucleotides in compressed form, as opposed to compressed byte 208 of Fig. 3A, which is capable of representing 3 nucleotides.
It is appreciated that the absence of a 'header' in alternative compressed byte 300 does not allow alternative compressed byte 300 to specify the number of nucleotides it represents, or whether it represents one or more unknown nucleotides. Alternative compressed byte 300 may be useful when compressing genomic sequences which do not include unknown nucleotides, and are of a fixed length. If the length of the uncompressed genomic string 200 is known, then it is possible to ignore the possible tailing zeros at the right end of the alternative compressed byte 300, which do not represent a nucleotide, but rather represent a blank. For example, an uncompressed genomic string 200 which is known to be 7 nucleotides long, may be compressed into 2 alternative compressed bytes 300: the first containing in compressed form 4 nucleotides, and the second containing 3 nucleotides. In this second alternative compressed byte 300, BIT VII and BITVIII of BIT-PAIR IV contain zeros which are ignored because the uncompressed genomic string is known to be 7 nucleotides long, despite the absence of a 'header' which would explicitly instruct to ignore these bits.
Reference is now made to Fig. 4A, which is a table illustrating preferred values assignable to BIT I and BIT II, both belonging to the HEADER of compressed byte 208 shown in Fig. 3 A.
Assigning the value '00' to the bits of the HEADER, i.e. assigning '0' to BIT I and assigning '0' to BIT II of compressed byte 208 shown in Fig. 3A, signifies that the entire compressed byte 208 represents only one or more unknown nucleotides i.e. lN\ and does not represent any known nucleotides.
Assigning the value '01 ' to the bits of the HEADER, i.e. assigning '0' to BIT I and ' 1 ' to BIT II of compressed byte 208 shown in Fig. 3A, signifies that the compressed byte 208 represents only one nucleotide, as represented by the values in BIT III and BIT IV, both belonging to BIT-PAIR I of compressed byte 208. The remaining four bits of the compressed byte 208, BIT V, BIT VI, BIT VII and BIT VUI are to be ignored and do not represent any additional nucleotide.
Assigning the value ' 10' to the bits of the HEADER, i.e. assigning T to BIT I and '0' to BIT II of compressed byte 208 shown in Fig. 3 A, signifies that the compressed byte 208 represents two nucleotides; the first nucleotide being represented by the values in BIT III & BIT IV, both belonging to BIT-PAIR I, and the second nucleotide being represented by values in BIT V & BIT VI, both belonging to BIT- PAIR II of compressed byte 208. The remaining two bits of the compressed byte 208, BIT VII & BIT VIII, are to be ignored and do not represent any additional nucleotide.
Assigning the value of ' 11' to the bits of the HEADER, i.e. assigning ' 1 ' to BIT I and ' 1 ' to BIT II of compressed byte 208, signifies that the compressed byte 208 represents three nucleotides; the first nucleotide being represented by values in BIT III & BIT IV both belonging to BIT-PAIR I, the second nucleotide being represented by values in BIT V & BIT VI both belonging to BIT-PAIR II, and the third nucleotide being represented by values in BIT VII & BIT VIII, both belonging to BIT-PAIR III.
Reference is now made to Fig 4B, which is a table illustrating the preferred values assignable to the nucleotide-representing bits: BIT III, BIT IV, BIT V, BIT VI, BIT VII, & BIT VIII of compressed byte 208 shown in Fig. 3 A.
As mentioned above with reference to Fig. 3 A, each of BIT-PAIR I, BIT-PAIR II and BIT-PAIR III in compressed byte 208 comprises a pair of bits: BIT III & BIT IV, BIT V & BIT VI, and BIT VII & BIT VIII respectively. The values presented in Fig 4B are values which may be assigned to each of the above mentioned pairs of bits so as to allow each of these pairs of bits to represent one of the four possible genomic nucleotides A, T, C or G
Assigning the value '00' to any of the three bit-pairs representing one of the three nucleotides, i e assigning '0' to BIT III and '0' to BIT IV, or assigning '0' to BIT V and '0' to BIT VI, or assigning '0' to BIT VII and '0' to BIT VIII, signifies that that bit-pair, i e BIT-PAIR I, BIT-PAIR II or BIT-PAIR III, of the compressed byte 208 of Fig 3 A, represents the nucleotide 'A'.
Assigning the value of '01' to any of the three bit-pairs representing one of the three nucleotides, i e. assigning '0' to BIT III & '1' to BIT IV, or assigning '0' to BIT V & ' L to BIT VI, or assigning '0' to BIT VII & ' 1 ' to BIT VIII, signifies that that bit-pair, i e. BIT-PAIR I, BIT-PAIR II or BIT-PAIR III, represents the nucleotide 'C.
Assigning the value of ' 10' to any of the three bit-pairs representing one of the three nucleotides, i e. assigning T to BIT III & '0' to BIT IV, or assigning ' 1' to BIT V & '0' to BIT VI, or assigning ' 1 ' to BIT VII & '0' to BIT VIII, signifies that that bit-pair, i e BIT-PAIR I, BIT-PAIR II or BIT-PAIR III, represents the nucleotide 'G'
Assigning the value of ' 11 ' to any of the three bit-pairs representing one of the three nucleotides, i e assigning ' 1 ' to BIT III & ' 1' to BIT IV, or assigning T to BIT V & ' 1 ' to BIT VI, or assigning ' 1 ' to BIT VII & ' 1 ' to BIT VIII, signifies that that bit-pair, i e BIT-PAIR I, BIT-PAIR II or BIT-PAIR III, represents the nucleotide "T.
It is appreciated that notwithstanding the above-stated significance of the above-mentioned assignable values, the significance of values assigned to any of the six bits potentially representing nucleotides, BIT III - BIT VIII, always depends on the values assigned to the bits of the HEADER, as explained above with reference to Fig. 4A. For example the value '00' in BIT VII and BIT VIII respectively, signifies an 'A' in BIT-PAIR III only if the value assigned to the HEADER bits is '11', signifying that the byte represents three nucleotides. Otherwise, the values assigned to BIT VII and BIT VIII are ignored.
Reference is now made to Fig. 4C, which is a table illustrating preferred values' assignable to BIT III and BIT IV of Fig. 3 A respectively, when encoding one or more .unknown nucleotides.
As mentioned above with reference to Fig. 4 A, a value '00' assigned to the bits of the HEADER of Fig. 3 A signifies that the compressed byte 208 represents one or more unknown nucleotides and does not represent any known nucleotides. In such case, in accordance with a preferred embodiment of the present invention, BIT III and BIT IV may be used to signify if the entire byte represents one, two of three 'N's.
Assigning the value '01 ' to BIT III & BIT IV, i.e. assigning '0' to BIT III and ' 1 ' to BIT IV, signifies that the entire compressed byte 208 represents one unknown nucleotide.
Assigning the value ' 10' to BIT III & BIT IV, i.e. assigning '1' to BIT III and '0' to BIT IV, signifies that the entire compressed byte 208 represents two unknown nucleotides.
Assigning the value ' 11 ' to BIT in & BIT IV, i.e. assigning '1' to BIT III and ' 1 ' to BIT IV, signifies that the entire compressed byte 208 represents three unknown nucleotides.
It is appreciated that notwithstanding that a preferred embodiment of the present invention demonstrates encoding of up to three 'N's in one compressed byte 208, it is possible to use each compressed byte 208 to encode more or less than three 'N's. For example, it is possible to use BIT III through BIT VIII of Fig. 3 A to signify up to 64 N's represented by a compressed byte 208.
Reference is now made to Fig. 5, which is a simplified flowchart illustrating operation of the genomic data compression engine 202 of Fig. 2, constructed and operative in accordance with a preferred embodiment of the present invention.
Preferably two translation . tables, compression table 206 and decompression table 212, required for operation of genomic data compression engine 202 and genomic data decompression engine 210 of Fig. 2, respectively, are initially generated. Generally, the compression table 206 stores for each possible combination of up to three nucleotides, e.g. 'ATG', 'GGC, 'AT', bit-values which represents this combination in compressed form in one compressed byte 208, preferably according to the values suggested in Figs. 4A, 4B, and 4C. A preferred implementation of this step, which is preferably carried out only once, is further described below with reference to Fig. 6. Figs. 7A and 7B present examples of preferred compression and decompression tables 206 and 212, respectively.
Following generation of the translation tables, an iterative process of compression of multiple strings takes place. An uncompressed genomic string, such as uncompressed genomic string 200 of Fig. 2, is received. For clarity, this iterative process is explained with reference to an example wherein the uncompressed genomic string 200 is a genomic sequence represented by a string 'ATGAT'. This example is followed through the following steps of Fig. 5.
The uncompressed genomic string 200 preferably is parsed into substrings, each having up to three nucleotides (3-nucleotide-substrings) by parsing the uncompressed genomic string 200 from left to right. It should be noted that one or more of the nucleotides in a 3-nucleotide-substring may in fact be unknown, i.e. an 'N'. In the given example the string 'ATGAT' is parsed from left to right, into 3-nucleotide- substrings, yielding 'ATG' and 'AT'.
Following parsing of the uncompressed genomic string 200 into 3- nucleotide-substrings a recursive operation is initiated, which looks up each 3- nucleotide-substring in the compression table 206, and based on the contents of the compression table, assigns appropriate bit values to the bits in one or more compressed byte 208. The compressed bytes 208 are combined to yield a compressed genomic string 204.
Reference is now made to Fig. 6, which is a simplified flowchart illustrating a preferred functionality for generating translation tables, including compression table 206 and decompression table 212 of Fig. 2.
Firstly, preferably all possible 3-nucleotide-substrings, i.e. 1 -nucleotide 2-nucleotide and 3-nucleotide combinations of A, T, C, G and N, are generated. Examples of these combinations may include ATN, ATA and ATC.
For each 3-nucleotide-substring, the following procedure applies: A determination is made as to whether the 3-nucleotide-substring contains one or more N's.
If so, the N's are encoded in compressed bytes 208 which do not also represent known nucleotides. Values '00' are assigned to the headers of such compressed bytes and all known nucleotides in the 3-nucleotide-substring are encoded in other bytes.
If no N's are present in a 3-nucleotide-substring, all nucleotides in the 3- nucleotide-substrings are encoded in a single compressed byte 208.
For all 3-nucleotide-substrings containing known nucleotides, a determination is made as to the number of nucleotides in that current 3-nucleotide- substring.
If a single nucleotide is encoded into a 3-nucleotide-substring, the HEADER is assigned the value '01 ' and the single nucleotide is represented by BIT III & BIT IV of the compressed byte 208.
If two nucleotides are encoded into a byte, the HEADER is assigned the value ' 10' and the two nucleotides are represented by BITS III - VI of the compressed byte 208, with the first nucleotide represented by BIT III & BIT IV and the second nucleotide represented by BIT V & BIT VI.
If three nucleotides are encoded into a byte, the HEADER is assigned the value ' 1 1 ' and the three nucleotides are represented by BITS III - VIII of the compressed byte 208, with the first nucleotide represented by BIT III & BIT IV, the second nucleotide represented by BIT V & BIT VI and the third nucleotide represented by BIT VII & BIT VIII.
Each 3-nucleotide-substring and its corresponding one or more compressed bytes 208 are stored in translation tables including compression table 206 and decompression table 212.
It is appreciated by those skilled in the art that unknown nucleotides (N's) are typically very rare in typical genomic sequences, and that furthermore when 'N's appear in a genomic sequence they tend to appear contiguously, signifying a 'gap' in the sequenced genome. Instances of isolated single or double N's are typically less frequent than instances of contiguous 'N's. The present invention utilizes this fact to achieve an optimized compression suited specifically for genomic sequence data: three- nucleotide combinations which contain only nucleotides and no N's, as well as those containing only N's and no nucleotides, are both compressed into a single byte. Most of the rare cases of 3-nucleotide mixtures of nucleotides and N's are compressed into two bytes. Only a minority of extremely rare combinations of nucleotides and N's require three bytes and therefore are in fact not compressed.
Reference is now made to Fig. 7A, which is a simplified illustration of a preferred implementation of compression table 206 of Fig. 2, employed in accordance with a preferred embodiment of the present invention.
The goal of compression table 206 is to provide a translation-table, also referred to as a 'lookup table', which provides the bit-values of the one or more compressed bytes 208 required to represent in compressed form every possible 1- nucleotide, 2-nucleotide and 3-nucleotide sub-string of uncompressed genomic string 200.
For simplicity of explanation, the compression table 206 is described here logically, as a database table comprising fields into each of which multiple values are stored in respective multiple records. It is appreciated by those skilled in the art that the description of the compression table 206 in terms of table comprising fields is meant for clarity and not meant to be limiting, and that the compression table 206 may equally be implemented as a 'CASE' or 'IF-THEN' programming code in a any suitable computer language, as is well known in the art. For example, computer code can be written, which comprises a plurality of TF-THEN' or 'CASE' arguments, each one of the arguments providing bit-values of the one or more compressed bytes 208 representing in compressed form one 3-nucleotide-substring of uncompressed genomic string 200.
Compression table 206 preferably comprises multiple records each containing 4 fields: uncompressed nucleotide-combination 700, compressed byte I 702, compressed byte II 704 and compressed byte III 706. For clarity, an example is given for the content which may be stored in each of these fields.
The uncompressed nucleotide combination 700 is a field which stores all possible 3-nucleotide substrings, i.e. 1-nucleotide, 2-nucleotide and 3-nucleotide combinations, including combinations of nucleotides only and combinations which include N's. In the given example uncompressed nucleotide combination 700 stores a 3- nucleotide combination 'ATN'.
Compressed byte I 702, compressed byte II 704 and compressed byte III 706 respectively are fields which store for each uncompressed nucleotide-combination 700 the bit-values for each of the one or more compressed bytes 208 required for encoding it. In the given example, compressed byte I 702 stores ' 10001100', which represents the nucleotide-combination 'AT', and compressed byte II 704 stores '00010000', which represents 'N\ Compressed byte III 706, in the given example, stores null, since only two compressed bytes are required to represent the nucleotide combination 'ATN'.
As described above with reference to Fig. 6, in accordance with a preferred embodiment of the present invention, most 3-nucleotide substrings may be compressed into one compressed byte 208, some rare combinations may be compressed into two compressed bytes, and some 20 very rare combinations may require 3 compressed bytes, and therefore may not be compressed. Therefore, notwithstanding that compression table 206 comprises three compressed bytes fields 702-706, one compressed byte field, such as compressed byte I 702, is sufficient to translate a vast majority of 3-nucleotide combinations to be typically found in a genomic sequence.
Reference is now made to Fig. 7B, which is a simplified illustration of a preferred implementation of decompression table 212 of Fig. 2, employed in accordance with a preferred embodiment of the present invention.
The goal of decompression table 212 is to provide a translation-table, also referred to as a 'lookup table', which provides the 1-nucleotide, 2-nucleotide or 3- nucleotide uncompressed genomic string 200 preferably corresponding to every possible compressed byte 208.
It is appreciated that the description of the decompression table 212 in terms of table comprising fields is meant for clarity and not meant to be limiting, and that the decompression table 212 may equally be implemented as a 'CASE' code in any computer language, as is well known in the art.
The decompression table 212 preferably comprises multiple records each containing 2 fields: compressed byte 708 and decompressed nucleotide-combination 710. Compressed byte 708 is a field which preferably stores bit-values of every possible compressed byte 208.
Decompressed nucleotide combination 710 is a field which stores for each compressed byte 708 the 1-nucleotide, 2-nucleotide or 3-nucleotide uncompressed genomic string 200 which it encodes.
For example, the field compressed byte 708 may contain the compressed byte 208 bit- value ' 10001100' and the respective field decompressed nucleotide combination 710 may contain the 2-nucleotide combination 'AT' which this bit value represents in compressed form.
Reference is now made to Fig. 8, which is a simplified flowchart illustrating operation of genomic data decompression engine 210 of Fig. 2 constructed and operative in accordance with a preferred embodiment of the present invention. Generally, genomic data decompression engine 210 of Fig. 2 performs a reverse action of that of genomic data compression engine 202 of Fig. 2, which was further described hereinabove with reference to Fig. 5.
A compressed genomic string 204 is received in order to be decompressed. Genomic data decompression engine 210 of Fig. 2 gets the compressed genomic string 204 of Fig. 2 to be decompressed. For clarity, the process shown in Fig. 8 is explained with reference to an example wherein the compressed genomic string 204 comprises two compressed bytes 208 the bit-value of which are: ' 11001110' & ' 10001 100' respectively. This example is followed through the following steps of Fig.
8-
A recursive operation is initiated, which parses the received compressed genomic string 204 into the compressed byte 208 included in this compressed genomic string. Each compressed byte 208 is looked up in the decompression table 212, and based on the contents of the compression table, finds out the 3-nucleotide substring which this compressed byte represents. The 3-nucleotide substrings are combined to yield an uncompressed genomic string 200.
In the given example, the second compressed byte in the compressed genomic string has bit-values of ' 10001100', which when looked-up in the decompression table is found out to represent the nucleotide combination 'AT'. Combining the two 3-nucleotide-substrings, 'ATG' represented by compressed byte ' 1 1001 1 10' and 'AT' represented by compressed byte ' 10001100', yields the uncompressed genomic string 'ATGAT'.
Reference is now made to Fig. 9 A, which is a simplified illustration of an example of compressing an uncompressed genomic string 200 into a compressed genomic string 204, both shown in Fig. 2.
Uncompressed genomic string 'ATGAT' 900 is an uncompressed genomic string 200 comprising the nucleotides: 'ATGAT'.
When parsed into 3-nucleotide substrings, beginning from the left side of the string, as shown in Fig. 5, the results is two 'nucleotide-triplets': nucleotide-triplet-1 'ATG' 902 and nucleotide-triplet-2 'AT_' 904. It is appreciated that the nucleotide- triplet-2 'AT_' 904 actually contains only two nucleotides: A and T.
Since neither of these 3-nucleotide sub-strings contains an N, each one of them is compressed directly into one compressed byte 208: compressed byte-1 906 and compressed byte-2 908, respectively.
Compressed byte-1 906 encodes three nucleotides: 'ATG'. Therefore a value of ' 1 1 ' is assigned to the two bits of the HEADER of compressed byte-1 906, as indicated by reference numeral 908, signifying that this compressed byte 208 encodes 3 nucleotides.
Value '00' is set to the two bits of BIT-PAIR I, of compressed byte-1 906, as indicated by reference numeral 910, signifying that the first nucleotide represented by this byte is 'A'.
Value ' 11 ' is assigned to the two bits of BIT-PAIR II of compressed byte- 1 906, as indicated by reference numeral 912, signifying that the second nucleotide represented by this byte is 'T'.
Value ' 10' is assigned to the two bits of BIT-PAIR III of compressed byte-1 906, as indicated by reference numeral 914, signifying that the third nucleotide represented by this byte is 'G'.
Compressed byte-2 908 encodes two nucleotides: 'AT'. Therefore a value of ' 10' is assigned to the two bits of the HEADER of compressed byte-2 908, as indicated by reference numeral 916, signifying that this compressed byte 208 encodes 2 nucleotides, and that therefore the two bits of BIT-PAIR III are to be ignored. Value '00' is assigned to the two bits of BIT-PAIR I of compressed byte- 2 908, as indicated by reference numeral 918, signifying that the first nucleotide represented by this byte is 'A'.
Value ' 1 1 ' is assigned to the two bits of BIT-PAIR II of compressed byte-2 908, as indicated by reference numeral 920, signifying that the second nucleotide represented by this byte is 'T'.
Value '00' stored in the bits of BIT-PAIR III of compressed byte-2 908, as indicated by reference numeral 922, is ignored and does not represent an 'A', since the HEADER specified that this byte encodes only 2 nucleotides.
Reference is now made to Fig. 9B, which is a simplified illustration of another example of compression of an uncompressed genomic string 200 of Fig. 2, containing an unknown nucleotide, into a compressed genomic string 204 of Fig. 2.
Uncompressed genomic string 'ATNCG' 950 is an uncompressed genomic string 200 comprising the characters: 'ATNCG'.
When parsed into 3-nucleotide sub-strings, beginning from the left side of the string, as shown in Fig. 5, the result is two 'nucleotide-triplets': nucleotide-triplet- 1 'ATN' 952, and nucleotide-triplet-2 'CG_' 954. It is appreciated that the nucleotide- triplet-2 'CG_' 954 actually contains not a triplet but only 2 nucleotides: C and G.
Since nucleotide-triplet-1 'ATN' 952 contains an 'N' it is preferably represented by two compressed bytes 208 rather than one: the first, compressed byte-1 956, encodes 'AT', and the second, compressed byte-2 958, encodes 'N\
Compressed byte-1 956 encodes two nucleotides, 'AT', therefore ' 10' is assigned to the two bits of the HEADER of compressed byte-1 956, as indicated by reference numeral 960, signifying that this compressed byte 208 encodes 2 nucleotides.
Value '00' is assigned to the two bits of BIT-PAIR I of compressed byte- 1 956, as indicated by reference numeral 962, signifying that the first nucleotide represented by this byte is 'A'.
Value ' 1 1 ' is assigned to the two bits of BIT-PAIR II of compressed byte- 1 956, as indicated by reference numeral 964, signifying that the second nucleotide represented by this byte is 'T'. Value '00' assigned to the bits of BIT-PAIR III of compressed byte-1 956, as indicated by reference numeral 966, is ignored, and does not represent an 'A' since the HEADER specified that this byte encodes only 2 nucleotides.
Compressed byte-2 958 is dedicated to encoding 'N's, in this case only one 'N\ which is derived from the nucleotide-triplet-1 'ATN' 952. Therefore '00' is assigned to the two bits of the HEADER of compressed byte-2 958, as indicated by reference numeral 968, signifying that this compressed byte 208 is dedicated to encoding one or more N's.
The value '01' is assigned to the two bits of BIT-PALR I of compressed byte-2 958, as indicated by reference numeral 970, signifying that this byte, which is dedicated to encoding N's, encodes only one N. Accordingly, The zeros in the two bits of BIT-PAIR II and the two bits of BIT-PAIR III, indicated by reference numerals 972 & 974, are ignored.
Compressed byte-3 990 encodes two nucleotides: 'CG'. Therefore a value of ' 10' is assigned to the two bits of the HEADER of compressed byte-3 990, as indicated by reference numeral 976, signifying that this compressed byte 208 encodes only 2 nucleotides.
Value '01 ' is assigned to the two bits of BIT-PAIR I of compressed byte- 3 990, as indicated by reference numeral 978, signifying that the first nucleotide represented by this byte is 'C\
Value ' 10' is assigned to the two bits of BIT-PAIR II of compressed byte-3 990, as indicated by reference numeral 980, signifying that the second nucleotide represented by this byte is 'G'.
The value '00' in the bits of BIT-PAIR III of compressed byte-3 990, as indicated by reference numeral 982 is ignored and does not represent an 'A' since the HEADER specified that this byte encodes only 2 nucleotides.
Reference is now made to Fig. 10, which is a simplified block diagram illustrating shifted genomic sequences utilized by the compressed genomic sequence similarity search module 1 18 of Fig. 1 constructed and operative in accordance with a preferred embodiment of the present invention.
It is appreciated that the difficulty in assessing similarity of genomic sequences in compressed form, is that the compression mechanism described above with reference to Figs. 2-8, compresses more than one nucleotide into one byte. In a preferred embodiment of the present invention, each compressed byte represents in compressed form 3 or 4 nucleotides. It is therefore very easy to compare entire 'triplets' of nucleotides, but if an addition or deletion of a single nucleotide occurs, then all triplets 'downstream' to the one modified will have seemed to have changed completely, whereas in fact, they have only been 'shifted' to the right or to the left by one location.
The basic concept of the compressed genomic sequence similarity search module 1 18 of Fig. 1 is therefore to calculate all possible 'shifted' variations of the compressed genomic sequence 110 of Fig. 1, and to use them to search for compressed genomic sequences similar to target 120 of Fig. 1.
An example is provided of target genomic sequence 1000, which is a compressed genomic sequence comprising of 12 nucleotides, NI through N12, which are represented in compressed form by 4 compressed bytes 208, BYTE 1, BYTE 2, BYTE 3 and BYTE 4.
Based on this target genomic sequence 1000, four shifted genomic sequences 1002 are generated: 'minus one' shifted genomic sequence 1004, 'minus two' shifted genomic sequence 1006, 'plus one' shifted genomic sequence 1008, 'plus two' shifted genomic sequence 1010.
The first nucleotide in target genomic sequence 1000, NI, has been removed in the 'minus one' shifted genomic sequence 1004, and therefore 'minus one' shifted genomic sequence 1004 begins with N2. The nucleotides compressed into each of the four compressed bytes 208 of 'minus one' shifted genomic sequence 1004 are therefore 'shifted to the left' by one location.
Similarly, the sequence of nucleotides compressed into each of the four compressed bytes 208 of 'minus two' shifted genomic sequence 1006 is shifted to the left by two locations; that of 'plus one' shifted genomic sequence 1008 is shifted to the right by one location; and that of 'plus two' shifted genomic sequence 1010 is shifted to the right by two locations.
Reference is now made to Fig. 11, which is a simplified flowchart illustrating operation of the compressed genomic sequence similarity search module 118 of Fig. 1 constructed and operative in accordance with a preferred embodiment of the present invention.
Operation of the compressed genomic sequence similarity search module 1 18 begins by getting a target compressed genomic sequence 110 of Fig. 1.
Based on this compressed target genomic sequence, four shifted compressed genomic sequences are generated: 'minus one' shifted genomic sequence 1004, 'minus two' shifted genomic sequence 1006, 'plus one' shifted genomic sequence 1008, 'plus two' shifted genomic sequence 1010, as described hereinabove with reference to Fig. 10.
Using the compressed genomic sequence index 104 of Fig. 1, all compressed genomic sequences 104, having at least one compressed byte which matches that of the compressed target genomic sequence 1000 or of one of the four shifted genomic sequences 1004-1010, are retrieved. It is important to note that a match is looked for only between bytes occupying the same location in the compressed genomic string: the first compressed byte in a compressed genomic sequence 104 is compared to the first compressed byte of the compressed target genomic sequence and to the first compressed byte of each of the four compressed shifted genomic sequences. It is not compared to any other, e.g. second, third or fourth, compressed bytes within these genomic sequences. All compressed -genomic sequences having at least one match, are considered potentially similar, and are passed on the next step.
Next, all of the compressed genomic sequences having at least one match are assessed, and the number of mismatching compressed bytes for each of them is counted.
Compressed genomic sequences, having less mismatching compressed bytes with the target or one of the shifted genomic sequences than a certain user defined 'threshold', are considered potentially very similar, and are passed on to the next step.
Optionally, the mismatching compressed byte/s are further analyzed to determine the exact nature of the mistake, in order to further fine-tune the similarity comparison.
The resulting compressed genomic sequences similar to target 120 of Fig. 1 are considered to represent in compressed form. genomic sequences which are similar to the target genomic sequence represented in compressed form by the compressed target genomic sequence 1 10, and are delivered.
Reference is now made to Fig. 12A, which is a simplified illustration of an example of identifying a genomic sequence having one nucleotide replacement relative to a target genomic sequence.
Fig. 12A shows a genomic sequence designated SIMILAR TO TARGET GENOMIC SEQUENCE (1 REPLACEMENT), in which a nucleotide designated N13 shown in broken line format, in the compressed byte designated 1R-BYTEIL has replaced nucleotide N5 in that same spot in the original TARGET GENOMIC SEQUENCE.
As is apparent from Fig. 12A, 3 of the 4 compressed bytes of SIMILAR TO TARGET GENOMIC SEQUENCE (1 REPLACEMENT), shown in bold line format, still match those in TARGET GENOMIC SEQUENCE, and so the two genomic sequences can be deduced as being similar, by comparison of their compressed format, without decompressing them.
Reference is now made to Fig. 12B, which is a simplified illustration of an example of identifying a genomic sequence having two nucleotide additions relative to a target genomic sequence.
Fig. 12B shows a genomic sequence designated SIMILAR TO TARGET GENOMIC SEQUENCE (2 ADDITIONS), in which two nucleotides designated N13 and N 14 shown in broken line format, in compressed byte designated 2A-B YTEIL have been added to the genomic sequence relative to the original TARGET GENOMIC SEQUENCE, 'pushing' nucleotides N5 and N6 to the next compressed byte designated 2A-BYTE III, and shifting all the following nucleotides by two positions.
As is apparent from Fig. 12B, only the first compressed byte designated 2A-BYTE I, in genomic sequence designated SIMILAR TO TARGET GENOMIC SEQUENCE (2 ADDITIONS), matches the original BYTE I of TARGET GENOMIC SEQUENCE. However, the third and fourth bytes of SIMILAR TO TARGET GENOMIC SEQUENCE (2 ADDITIONS), designated 2A-BYTE III and 2A-BYTE IV respectively, do match those of 'PLUS TWO' SHIFTED GENOMIC SEQUENCE, designated P2-BYTE III and P2-BYTE IV respectively. All matching bytes are shown in bold line format. The two genomic sequences may therefore be deduced as being similar, differing by a two addition mistake in the mismatched compressed byte 2A-BYTE II.
Reference is now made to Fig. 12C, which is a simplified illustration of an example of identifying a genomic sequence having one nucleotide deletion relative to a target genomic sequence.
Fig. 12C shows a genomic sequence designated SIMILAR TO TARGET GENOMIC SEQUENCE (1 DELETION), in which one nucleotide designated N5 of TARGET GENOMIC SEQUENCE has been deleted, shifting all nucleotides from N6 onwards one position to the left. The missing N5 in byte ID-BYTE II is represented by a small blank broken line box between N4 and N6.
As is apparent from Fig. 12C, only the first compressed byte designated I D-BYTE I, in genomic sequence designated SIMILAR TO TARGET GENOMIC SEQUENCE (1 DELETION), matches the original BYTE I of TARGET GENOMIC SEQUENCE. However, the third and fourth bytes of SIMILAR TO TARGET GENOMIC SEQUENCE (1 DELETION), designated ID-BYTE III and ID-BYTE IV respectively, do match those of 'MINUS ONE' SHIFTED GENOMIC SEQUENCE, designated Ml -BYTE III and MI-BYTE IV respectively. All matching bytes are shown in bold line format.
The two genomic sequences may therefore be deduced as being similar, differing by a one deletion mistake in the mismatched compressed byte ID-BYTE II.
It is therefore appreciated that replacement, deletion and addition types of differences between a target genomic sequence and other genomic sequences, can be detected in compressed form, without decompression.
It is also appreciated that while Figs. 10 and 12A, 12B and 12C demonstrate detection of up to 2 addition or deletion mistakes, the same concept may be utilized to detect a wider spectrum of mistakes, by generating more 'shifted sequences' accordingly, e.g. 'plus three' shifted sequence, etc.
Reference is made now to Fig. 13, which is a simplified block diagram of a computer database application constructed and operative in accordance with a preferred embodiment of the present invention. It is appreciated that the computer database application may be implemented in any appropriate programmed computer system, for example, an appropriate personal computer, or a personal computer server, and may use any appropriate database system, for example MICROSOFT SQL SERVER 12000®. The embodiment of Fig. 13 comprises a mechanism for efficient pattern analysis of genomic sequence data.
The general idea of the present invention is to view the task of genomic pattern analysis in a manner similar to an attempt to understand a book in a totally foreign and unknown language, but where some clues do exist as for the meaning of a few specific words, or the general significance of several chapters. The approach in such a case would be to break up the book into meaningful sections, such as chapters, and within each such chapter, to make a list of all the words appearing in that chapter. This then allows one to find correlations between words and other words, or between words and the chapters they are found in.
For example, finding the word 'stand' immediately next to the word 'by' frequently, leads to the assumption that this conjunction has some meaning. Similarly, if the word 'chocolate' is found frequently in chapters which are somehow known to be 'cooking recipes', and is never found in chapters which are definitely known to be 'car maintenance manuals', then this supports the suspicion that 'chocolate' is a 'food' related, rather than 'garage' related.
Analysis of genomic sequence data may be approached in much a similar manner. First the 'book', i.e. the DNA sequence, is divided into meaningful 'chapters' such as protein coding regions, and regions upstream and downstream of these regions. As is well known in the art, regions adjacent to protein coding regions are often involved in inhibiting or enhancing the production of these proteins. Additional 'sub- chapters' may be created as well, such as regions within a protein coding region, which is known or suspected to have a specific function.
Next, each of these 'chapters', i.e. genomic protein related regions, is parsed to reveal preferably all of the 'words' they contain. Since in the case of genomic sequence data we do not know what the 'words' are, the approach taken by the present invention is to parse each 'chapter' into 'words' of arbitrary length/s, such as all lengths between 10 and 30 characters. This approach generates a very long list of 'potential words', knowing that most of these are non-sense, and only a small fraction are genuine 'words'. All occurrences of such 'words' in each of the 'chapters', referred to in the present invention as 'short genomic segment in region' or 'SGSR', are thus found, stored and indexed. This serves as the basis for pattern analysis of genomic sequence data, and a means for elucidating functional meanings hidden in the genomic sequence 'text'. Fig. 13 gives an overview of this process, which is described as follows:
Analysis of genomic data begins by obtaining genomic data to be analyzed, and other definitions and preferences required for the genomic data analysis. Primary required data is raw genomic data 1100, including sequenced DNA data 1102 and protein location information 1104. Protein location information 1104 comprises relative offset of the protein coding regions of proteins known to be encoded by the sequenced DNA 1 102, as well as the orientation of these protein coding areas where available. Protein location 1104 is typically part of basic genomic annotation data which is made available as part of the genomic sequencing effort.
In addition, other genomic data 1106 may also contribute to the process of genomic pattern analysis, and may include various properties of known proteins, such as tissue-specific expression of proteins, the organism in which each protein is expressed (when analyzing genomes of multiple organisms), a protein specific biological function. This information may also include additional research-derived information about proteins encoded by the sequenced DNA, such as grouping specific proteins into groups of proteins which are of particular interest. Other genomic data 1 106 may also include information about specific sites or locations in a protein-coding region, such as various protein binding sites, or regions upstream of the coding area of a protein, which are of specific interest. The significance and use of such additional data is elaborated hereinbelow.
User defined criteria 1108 may be entered, defining various parameters by which the genomic sequence data analysis is performed, as explained hereinbelow.
The raw genomic data 1100, other genomic data 1106 and user defined criteria 1 108 are entered into a genomic pattern analysis engine 1110.
The genomic pattern analysis engine 1110 is a computer based program, preferably built around a database program, such as MICROSOFT SQL-SERVER 12000®, and comprising a genomic pre-processing unit 1112, a genomic pattern analysis database 1 1 14 and a genomic query processing unit 1116. The genomic pre-processing unit 1112 is a computer program operative in conjunction with a database program, which receives the raw genomic data 1100 and other genomic data 1 106 entered to the genomic pattern analysis engine 1110, pre- processes it and stores the pre-processed genomic data to the genomic pattern analysis database. Operation of the genomic pre-processing unit 1112 is further described below with reference to Fig. 15.
The genomic pattern analysis database 1114 is a database storing the preprocessed genomic data. The data structure of the genomic pattern analysis database 1 1 14 is designed so as to be conducive to pattern analysis of genomic sequence data, and to include the following major data elements:
Proteins 11 18 is a list of proteins known to be encoded by the sequenced DNA 1 102;
Protein related regions 1120 are regions in the sequenced DNA, which are related to each of the proteins 1118, such as protein coding regions, and regions upstream of protein coding regions;
Short genomic segments in regions 1122 (SGSR) are all short genomic segments of a length defined by the user defined criteria 1108, found in each of the protein related regions; and ,
SGSR-to-SGSR relationships 1124 document various relationships between two or more SGSRs, such as the distance between them.
The genomic pattern analysis database is further described below with reference to Fig. 14.
The genomic pattern analysis database 1114 may be queried by the genomic query processing unit 1116, allowing a user 1126 to analyze the raw genomic data 1 100, by using the preprocessed data derived therefrom and stored in the genomic pattern analysis database 1114. Operation of the genomic query processing unit is further described below with reference to Fig. 16.
It is appreciated that a basic concept of the genomic pattern analysis engine is to perform as much pre-processing and storing of useful intermediate results as possible before the actual process of pattern analysis, so as to be able afterwards to produce very fast results in response to relatively complicated pattern analysis queries. This approach allows performing complicated genomic' data analysis tasks, frequently carried out only by mainframe computers or super-computers, on relatively inexpensive, and easily scalable hardware, such as PC server computers. While this approach potentially requires very large databases, with up to billions of records in some cases, and may requires significant pre-processing time, it still offers a dramatically more cost- effective alternative than the traditional extremely expensive parallel processing alternatives.
Reference is now made to Fig. 14, which is a block diagram illustrating a preferred embodiment of the genomic pattern analysis 1114 of Fig. 13.
As mentioned above, the genomic pattern analysis database preferably comprises of four major data elements: proteins 1118, protein related regions 1120, short genomic segments in region (SGSR) 1122 and SGSR-to-SGSR relationships 1124. Preferably, each of these data elements is stored in a table in the database, related to the other tables, as described below. Each of these data elements with its related properties is now described.
Proteins 1118 is a plurality of proteins known to be encoded by the raw genomic data 1 100. For each of these proteins, multiple properties relevant to the genomic pattern analysis may be stored. For example, an organism 1200 it belongs to, a biological or other function 1202 it is known to perform, and its expression 1204 in a specific organ or tissue.
Each of the proteins 1 118 is related to one or more protein related regions 1 120. These include a protein coding region 1206, which must be obtained or calculated based on the protein location data 1104. In addition, regions adjacent to the protein may be calculated and stored: pre protein 1208 is a region upstream to the protein coding region 1206, and post protein 1210 is a region downstream to the protein coding region. The coding direction of the protein is required in order to calculate the protein-adjacent regions. Finally, other regions 1212 may also be defined, such as regions of special interest within a specific protein, e.g. regions within the protein coding region known or suspected, correlating to an amino-sequence which is responsible for specific biological activity in the final protein, to have a certain functional significance. Similarly, other regions may be selected, which are not related to a specific protein, in order to analyze genomic patterns within them as well. It is appreciated that some of the definitions used to create the protein related regions 1 120 may be semi-arbitrary, and therefore may be defined by the user, as part of the user defined criteria 1108. For example, when analyzing the regions upstream of a protein coding region, a user defined criterion 1108 may define whether this region should extend all the way until the next protein upstream, or if it should be considered only a maximal fixed distance from the protein.
Each of the protein related regions 1120 is related to a plurality of short genomic segments in region 1 122, which are a key element in the present invention. Short genomic segments in region 1 122 is a plurality of preferably all, or most, of the short genomic segments of a given length or range of lengths, as determined by the user defined criteria 1 108, which are found in each of the plurality of protein related regions 1 120.
It is appreciated that depending on the organism analyzed, the number of protein related regions 1120 created and the number of lengths of short genomic segments desired, the number of short genomic segments in regions 1122 may be billions or tens of billions. As would be clear to one knowledgeable in the art, due to the huge required database size, which may require supporting many billions of records, and in order to achieve good database response, records may be preferably stored in partitioned tables, under a single view, using a database such as MICROSOFT SQL SERVER 12000®
For each of short genomic segment in region 1122 several properties are calculated and stored:
Location 1214 is the location, or offset, of the SGSR 1122 relative to the protein related region 1120 in which it is found.
Uniqueness 1216 stores a link to a reference indicating the degree of uniqueness of this SGSR relative to the protein related region 1120 in which it is found, or relative to multiple protein related regions 1120. For example, one SGSR 1122 may be unique in the protein related region 1120 it appears in, i.e. it appears in that region only once. Another SGSR 1 122 may appear in a specific protein coding region 1206, and may be unique relative to all protein coding regions 1206. Yet another SGSR 1122 may appear only 3 times in the pre protein regions 1208 of all proteins 1118 which have a similar expression 1204, such as proteins expressing in nerve cells - this may still be considered significant by a user 1126 of the system, and may be queried.
Similarly, commonality 1218 stores an indication, or a link to an indication, as to the commonality of a SGSR 1122 relative to two or more protein related regions 1 120. For example, it may be of interest to find out and mark all SGSRs 1 122 which appear in a pre protein region of proteins which have a similar function, as a starting point to attempting to assess which short genomic segments may be active as 'triggers' in controlling expression of these proteins.
Each SGSR may be associated with one or more, possibly many, criteria flags 1220, each of which stores an indication or a link to an indication of any compound condition which this SGSR meets. A criteria flag 1220 may be used in order to 'mark' all SGSRs meeting a certain query, so that they can later be retrieved quickly and easily. For example, a criteria flag 1220 may be created to indicate any SGSR which appears in the post protein region 1210 of all proteins 1118 having a first function 1202, and does not appear in the post protein region 1210 of any proteins 1118 having a second function 1202. Since each short-genomic-segment record may be associated with multiple criteria flags, preferably each criterion is a record in a criteria table (not shown in Fig. 14), which is linked to multiple criteria-flag records 1220, each of which is linked to one short-genomic-segment record 1122.
The 'criteria flag' mechanism therefore allows extremely fast retrieval of all short-genomic-segments which comply with any combination of complex queries, which have been applied at the pre-processing phase.
Finally, each short genomic segment in region 1122 may be associated, with one or more SGSR-to-SGSR relationships 1124. Each SGSR-to-SGSR relationship 1 124 stores information on the relation between two or more SGSRs.
SGSR-to-SGSR relationships 1124 may document the proximity 1222 of two SGSRs from each other, i.e. the difference between their respective location 1214 values; or their nucleotide sequence similarity 1224, or any other 1226 parameter by which they may be compared.
Reference is now made to Fig. 15, which is a flowchart illustrating operation of the genomic pre-processing unit 11 12 of Fig. 13 in accordance with a preferred embodiment of the present invention. The genomic pre-processing unit 1112 of Fig. 13 processes the raw genomic data 1 100, in several steps described below, and populates the genomic pattern analysis database 1 1 14.
Preprocessing begins by acquiring the genomic data to be processed, including raw genomic data 1100, other genomic data 1106 and user defined criteria 1 108 of Fig. 13.
Proteins known to be encoded by the raw genomic data 1100, are stored in the genomic pattern analysis database 1110, and are classified according to various attributes which are deemed relevant to the genomic data analysis process, such as the organism 1200 they belong to, their biological or other function 1202, and their organs or tissue specific expression 1204 of Fig. 14.
For each of the proteins 1118, the protein coding region 1206 of Fig. 14 is calculated based on the protein location data 1104 of Fig. 13.
In a preferred embodiment of the present invention, the protein coding regions are also normalized, i.e. if the direction of the protein is known to be right to left, it is reversed, so as to be read from left to right, and inverted, replacing every A with a T and every C with a G. As is known in the art, some proteins are coded on the positive strand of the DNA, and are therefore 'read' from left to right from the sequenced DNA 1 102, whereas some are encoded on the negative strand, and therefore are 'read' from right to left, and appear 'inverted' in the sequenced DNA 1102, i.e. each 'A' should be replaced with a 'T', each 'C replaced with a 'G', and vice-versa.
Depending on the goal of the genomic analysis requested, regions upstream and downstream of each protein, designated by reference numerals 1208 and 1210 of Fig. 14 respectively, may also be calculated, normalized and stored, as may also other regions 1212 of interest. Protein related regions 1120 are stored in the database, and are each linked to the protein related thereto.
Each of the protein related regions 1120, is then parsed in order to find preferably all short genomic segments of a given length located in that region. The results are stored as short genomic segments in region 1122 of Fig. 14. The length/s of the short genomic segments to be analyzed is determined as part of the user defined criteria. Next, various queries may be performed, in order to further determine properties of short genomic segments in region 1122, which are deemed material to the desired direction of genomic sequence data analysis. SGSRs' answering a queried criteria, may be 'flagged' for future use, using the uniqueness 1216, commonality 1218 or other criteria flags 1220 of Fig. 14.
Finally, SGSR-to-SGSR relationships 1124 between two or more SGSRs are determined, such as their proximity 1222, similarity 1224, or any other attribute 1226 by which they may be compared. ;.
Reference is now made to Fig. 16, which is a block diagram illustrating operation of the genomic query processing unit 1116 of Fig. 13, constructed and operative in accordance with a preferred embodiment of the present invention.
The genomic query processing unit 1116 allows a user 1126 of Fig. 13 to perform complex pattern analysis of genomic sequence data, based on the preprocessed data stored in the genomic pattern analysis database 1114.
First, qualifying properties to be used in analysis of genomic data are obtained from the user 1126, as indicated by reference numeral 1400. These may include short segment properties 1402, short segment in,region properties 1404, protein properties 1406 and other pre-defined criteria 1408.
Next, the genomic pattern analysis database 1114 is queried according to the qualifying properties obtained by the previous step, as indicated by reference numeral 1410.
Optionally, this step of querying the database according to qualifying properties may also comprise a short genomic segment similarity comparison mechanism, as indicated by reference numeral 1412 marked by broken line. Such a mechanism may enable querying the database for short genomic segments which are similar but not identical to the short segments specified by any of the qulifying properties indicated by reference numerals 1402-1408. Such a mechanism may be advantegous since, as is well known in the art, it is frequently the case that a genomic motif may appear in slight variations while still maintaining its biological functionality. Accordingly, various algorithms are known in the art for identifying genomic string similarity, and may therefore be used here. Alternatively, a 'translation-table' may be created, in which for each short genomic segment, all of the possible variants, e.g. of up to a small number of mistakes such as 2 mistakes, are listed. Such a mechanism may be very efficient especially for short segments, such as 3-7 nucleotides long, since the number of possible permutations is not very large.
A system and method for genomic sequence similarity comparison which is deemed especially suited to perform the function of short genomic segment similarity comparison mechanism 1412 is described in a co-pending US Provisional Patent application 60/329, 1 15.
Finally, results which fit the qualifying properties 1402-1408 are retrieved from the database and are delivered, as indicated by reference numeral 1414. These include qualifying short genomic segments 1416, qualifying regions 1418 and qualifying proteins 1420.
Reference is now made to Fig. 17, which is a simplified illustration of an example of genomic pattern analysis performed by a preferred embodiment of the present invention. Fig. 17 provides a genomic analysis example which may be conducive for a better understanding the usefulness and operation of the present invention.
In the given example, protein A, protein B, protein C, protein D, protein E and protein F are six proteins known to be coded by the raw genomic data 1100 entered into genomic pattern analysis engine 1110 of Fig. 13. Of these proteins, proteins A and C are known to have a specific biological function, which the user 1126 considers desirable, and protein B is known not to have this specific desired biological function. The biological function of proteins D, E and F is unknown.
In the given example, the initial goal of the genomic pattern analysis is to find a genomic sequence pattern common to the coding regions of proteins A and C, and which is not found in the coding region of protein B. If such a genomic pattern is found, then the final goal is to find other proteins, the function of which is at present unknown, such as proteins D, E and F in the given example, which display a genomic pattern similar to that found in the initial step. The rationale is that the genomic pattern common to proteins known to have a desired function, may serve as a predictor for finding other proteins, the function of which is at present unknown, and which might be expected to perhaps have the desired function. The genomic preprocessing unit 1112 preprocesses the raw genomic data 1 100, and using the sequenced DNA 1102 and the protein location 1104 of these six proteins relative to the sequenced DNA 1102, and their coding direction (left-to-right or right-to-left), calculates and preferably normalizes the protein coding region 1206 of Fig. 14.
The resulting protein coding regions, Protein A coding region 1500, protein B coding region 1502, protein C coding region 1504, protein D coding region 1506, protein E coding region 1508 and protein F coding region 1510 are illustrated in simplified form in Fig. 17. Coding regions of proteins D, E and F, designated by reference numerals 1506, 1508 and 1510, the biological function of which is unknown, are shown in broken line format. Preferably all protein related regions for preferably all other proteins known to be encoded by raw genomic data 1100 are processed in a similar manner. For clarity of explanation, the given example now focuses on these six proteins alone.
The genomic pre-processing unit further processes the protein coding regions of proteins A, B, C, D, E and F designated by reference numerals 1500-1510, in order to find and store preferably all short genomic segments, of a given length, e.g. 10 nucleotides long, in each of these protein coding regions. As mentioned above, the length of SGSR to be used is a matter of user preference, and is determined by user defined criteria 1 108 of Fig. 13.
For clarity of explanation, Fig. 17 illustrates only six short genomic segments found in the protein coding regions of these proteins: SGSR-I, SGSR-II, SGSR-III, SGSR-IV, SGSR-V and SGSR- VI. It is appreciated that in reality there is a very large number, such as tens of thousands, of short genomic segments found in each protein related genomic region. This number depends upon the size of the region and the number of different SGSR lengths which the user 1126 decides to use.
As illustrated by Fig. 17, SGSR-I, SGSR-II and SGSR-III are found in protein A coding region 1500; SGSR-IV, SGSR-III, SGSR-II and SGSR-V are found in protein B coding region 1502; SGSR- VI, SGSR-I, SGSR-II and SGSR-III are found in protein C coding region 1504; none of the six SGSR are found in protein D coding region 1506; SGSR-III, SGSR-II and SGSR-V are found in protein E coding region 1508; and SGSR-I, SGSR-II and SGSR-III are found in protein F coding region 1510. As mentioned above, the initial step of the genomic pattern analysis, searches for a pattern common to protein A and C, and not to protein B. As illustrated by Fig 17, SGSR-I, SGSR-II and SGSR-III form a pattern which appears in protein A coding region 1500 and protein C coding region 1504, but not in protein B coding region 1502
Technically, the commonality property of SGSR, designated by reference numeral 1218 of Fig 14, may be used to 'flag' all short genomic segments in regions 1 122, which are common to the protein coding regions 1206 of all proteins 1118, sharing the same desired function 1202 - protein A and B in the given example All SGSRs common to coding regions of all proteins which share the lack of that desired function, may be 'flagged accordingly as well, using a different commonality 1218 'flag' In the given example only protein B is shown as a representative of that group of proteins It is then easy to find all SGSRs which are common to A and C but not to B.
Next, a more elaborate pattern may be sought, for example by analyzing the location of the SGSRs which seem to be potentially significant relative to each other, or to the region in which they are found.
In the given example although SGSR-III and SGSR-II do appear as well in protein B coding region 1502, they do not appear in the same pattern, e.g. the location SGSR-III relative to SGSR-II is different in protein B coding region than it is in coding regions of proteins A and C This may easily be queried from the genomic pattern analysis, using the location property of SGSR, designated by reference numeral 1214 of Fig 14, which preferably stores the location, i e offset, of the SGSR relative to the region in which it is found
Further, in complex pattern analysis tasks, the SGSR-to-SGSR relationships 1 124 of Fig 14 may be very useful In the given example, multiple such relationship records may be formed, which document the proximity 1222 of Fig. 14 of each of the pairs of SGSRs which are suspected as being potentially significant: SGSR- I-to-SGSR-II, SGSR-II-to-SGSR-III and SGSR-I-to-SGSR-III. This provides an efficient means of finding all instances in which several SGSRs are not only found in proximity, but form a pattern relative to one another.
Finally, the criteria flags 1220 of Fig. 14 may be used to flag all SGSRs which comply with a very complex query. In the given example a criteria flag may be formed to 'flag' all SGSRs which are: (a) common to a first group of proteins, (b) do not appear in a second group, (c) have a location property indicating they appear close together, and (d) appear as part of a certain SGSR-to-SGSR relationship with a certain proximity value.
Once a potentially meaningful pattern has been identified, additional protein coding regions are examined to find ones in which a similar pattern is found. In the given example, protein coding regions of proteins D, E and F are thus examined. Protein F coding region 1504, shown marked in bold broken line, is the only one which displays a similar pattern of SGSRs to that observed in the coding regions of proteins A and C. Protein D coding region shown none of the three SGSRs suspected as significant, and protein E coding region shows two of the significant three, but not in the same pattern.
It is appreciated that the example provided in Fig. 17 is a much simplified one, providing only a general illustration of how the architecture of the genomic sequence analysis database 1114 may be beneficially utilized in performing complex genomic pattern analysis tasks, and of the usefulness of such analysis. It is further appreciated that genomic pattern analysis is often a highly complex task, often requiring a long, iterative, and somewhat creative process of trial-and-error.
Reference is now made to Fig. 18, which is a simplified block diagram illustrating a computer application constructed and operative in accordance with another preferred embodiment of the present invention.
Each of a plurality of strings 1800 is compressed by a compression mechanism 1802, collectively yielding a respective plurality of compressed strings 1804.
The compressed strings 1804 are stored in a plurality of records in a table in a database 1806, and a compressed strings index 1808 is constructed, which indexes the compressed strings 1804.
A target string 1810, one or more similarity criteria 1811, and a query condition 1812 relating to the target string 1810 are provided by a user of the database 1806, in order to find all strings 1800 in the database 1806 which comply with the provided query condition 1812, as it is applied to strings similar to the target string 1810, to a degree specified by the similarity criteria 1811. The target string 1810 is compressed by a compression mechanism 1814, which may be similar to compression mechanism 1802, .into a compressed target string 1816.
The compressed target string 1816 and the similarity criteria 1811 are passed on as input to a compressed string similarity search module 1818. The compressed string similarity search module, in conjunction with the compressed string index 1808 is operative to query the database 1806, retrieving a plurality of compressed strings which comply with the query condition 1812, as applied to strings which are similar to the target string 1810, to a degree defined by similarity criteria 1811, in compressed form. These results are designated compressed similar to target query results 1820. The compressed string similarity search module 1818 is further described below with reference to Figs. 27, 28, 29A, 29B and 29C-
Each of the compressed similar to target query results 1820 is decompressed by a decompression mechanism 1822, collectively yielding a respective plurality of strings similar to target 1824.
Preferred embodiments of the compression mechanisms 1802 and 1814 and of the decompression mechanism 1822, which preferably reverses the process of these compression mechanisms, are further described below with reference to Fig. 19.
It is appreciated that while the end result is retrieval of strings which are similar to the target string 1810, the actions of string similarity comparison and retrieval are preferably performed on compressed strings: comparing compressed target string 1816 with compressed strings 1804, using the compressed string index 1808. An important aspect of the present invention is that it allows determining the level of similarity between strings, by comparing compressed strings to which these strings correspond.
Reference is now made to Fig. 19, which is a simplified block diagram illustrating a mechanism for compression and decompression of data. This mechanism is a preferred implementation of the compression mechanisms 1802 and 1814 and of the decompression mechanism 1822 described hereinabove with reference to Fig. 18.
An uncompressed string 1900 is given as an example of one of the strings 1800 or of the target string 1810 described hereinabove with reference to Fig. 18. In order to better explain the usefulness of the present invention, an example is used throughout the description of the present invention, of an application of the present invention to storage and retrieval, compression and string similarity of genomic sequence data. Genomic sequence data is typically represented as alphanumeric strings, each character representing one of four nucleotides, A, T, C, and G, and unknown nucleotides represented by N. The letter N typically appears in genomic sequence data much less frequently than do the other four letters. Therefore, in the context of this example, 'N' is an example of a 'rare character'. The general idea is that all frequently used characters will be compressed to the maximum, whereas there is a provision for one or more rare characters to be represented, even if not compressed, or less compressed. In the genomic example, there is only one rare character, but it is realized that the present invention is not limited by this example, and more rare characters may be represented, as described hereinbelow with reference to Fig. 30.
The uncompressed string 1900 comprises a plurality of bytes, each storing a character: A, T, C, or G. For simplicity of explanation, the uncompressed string 1900 shown in Fig. 19 comprises only three bytes: BYTE I, BYTE II and BYTE III, each storing a character: CHR 1, CHR 2 and CHR 3, respectively.
The uncompressed string 1900 is compressed by a compression engine 1902 into a compressed string 1904. Operation of a preferred embodiment of the compression engine 1902 is further described below with reference to Fig. 22.
The compression engine 1902 employs a compression table 1906 in compression of uncompressed string 1900 into compressed string 1904. The compression table 1906 is preferably a translation table and holds a list of all possible or reasonable 3 -alphanumeric-character combinations and, for each such combination, the byte or bytes into which it may be compressed. A preferred embodiment of the compression table 1906 is further described below with reference to Fig. 24A.
The compressed string 1904 preferably comprises one or more compressed bytes 1908. For simplicity of explanation, the compressed string 1904 shown in Fig. 19 comprises only one compressed byte 1908, which represents in compressed form all three characters which are represented in the uncompressed string 1900, CHR 1, CHR 2 and CHR 3, which characters require three bytes of storage, BYTE I, BYTE II and BYTE III, in their original uncompressed form. It is appreciated that compressed string 1904, which comprises a plurality of compressed bytes 1908, may alternatively be stored as one or more integers. For example, an Integer is typically defined as a data-type comprising 4 bytes. It is therefore possible to compress an uncompressed string 1900 comprising 12 characters into 4 compressed bytes 1908, and then to store all 4 resulting compressed bytes 1908 as one Integer. Longer strings may be compressed into longer Integers, such as Biglnt data type in MS SQL-2000® which comprised of 8 bytes, or as a plurality of Integers or Biglnts. For example a string comprising 48 characters may be compressed into 2 Biglnts.
It is further appreciated that a Tinylnt data type, which comprises one byte of memory, may be used to store compressed byte 1908. In this configuration, when storing strings which are longer than 3 or 4 characters, i.e. are compressed into more than one Tinylnt, it is possible to store a plurality of Tinylnt type fields, each representing in compressed form 3 or 4 uncompressed characters. It is then possible to create an indexed View which indexes all these fields together, as one.
The compressed byte 1908 is further described below with reference to Figs. 20, 21A, 21B, and 21 C.
The compressed string 1904 may be decompressed by a decompression engine 1910 back to uncompressed string 1900, preferably by reversal of the methodology of the compression engine 1902. Operation of the decompression engine 1910 is further described below with reference to Fig. 25.
Decompression engine 1910 employs a decompression table 1912 in decompressing compressed string 1904 into uncompressed string 1900. The decompression table 1912 is a translation table and holds a list of the bit-values for each possible or reasonable compressed byte 1908 and for each such compressed byte, the alphanumeric string, containing up to three characters, into which it may be decompressed. The decompression table 1912 is further described below with reference to Fig. 24B.
Reference is now made to Fig. 20A, which is a simplified illustration of a preferred implementation of a byte bitmap preferably employed in generating the compressed byte 1908 of Fig. 19. The compressed byte 1908 of Fig. 19 preferably comprises 8 bits: BIT I, BIT II, BIT III, BIT IV, BIT V, BIT VI, BIT VII & BIT VIII. Preferably, these bits are divided into four groups, each containing 2 bits. A HEADER typically contains BIT I and BIT II, while a BIT-PAIR I typically contains BIT III and BIT IV, a BIT-PAIR II typically contains BIT V and BIT VI and a BIT-PAIR III typically contains BIT VII and BIT VIII.
The HEADER preferably stores information about what the other bits in the compressed byte, BIT III - BIT VIII, represent. In a preferred embodiment of the present invention the compression of data is such that the compressed byte may either store up to three characters, or may store up to three rare characters, e.g. 'N's in the genomic example, but preferably not a combination of characters and N's. The HEADER stores information which indicates how many characters the compressed byte 1908 represents, one, two or three, or alternatively if the entire compressed-byte represents one or more 'N's. The values assignable to bits of the HEADER are further described below with reference to Fig. 21A.
BIT-PAIR I, BIT-PAIR II, and BIT-PAIR in, each contain 2 bits which are capable, when taken together, of representing one of four possible characters, A, T, C, and G. The values assignable to the bits of each of the characters - BIT-PAIR I, BIT- PAIR II, and BIT-PAIR III - are further described below with reference to Fig. 21B.
When the entire compressed byte 1908 represents one or more N's, and does not represent any characters, values assigned to the two bits of BIT-PAIR I, determine whether the compressed byte 1908 represents one, two or three N's. Values assignable to bits of BIT-PAIR I, determining the number of N's that the compressed byte 1908 represents, are further described below with reference to Fig. 21C.
Reference is now made to Fig. 20B, which is a simplified illustration of an alternative preferred implementation of a byte bitmap preferably employed in generating the compressed byte 1908 of Fig. 19.
Alternative compressed byte 2000 is an alternative byte bitmap which may be used for compression of data, instead of the byte bitmap of compressed byte 1908 depicted in Fig. 20 A.
Alternative compressed byte 2000 comprises of four bit-pairs, BIT-PAIR I, BIT-PAIR II, BIT-PAIR III and BIT-PAIR IV, rather than only three bit-pairs in compressed-byte 1908 of Fig. 20 A. Unlike compressed byte 1908 of Fig. 20A, in which BIT 1 and BIT II function as a HEADER, alternative compressed byte 2000 does not comprise of any such header. All 8 bits of alternative compressed byte 2000 function as one of four bit-pairs, each of said bit-pairs representing a character. Alternative compressed byte 2000 is therefore capable of representing 4 characters in compressed form, as opposed to compressed byte 1908 of Fig. 20A, which is capable of representing 3 characters.
It is appreciated that the absence of a 'header' in alternative compressed byte 2000 does not allow alternative compressed byte 2000 to specify the number of characters it represented, or whether it represents one or more rare characters. Alternative compressed byte 2000 may be useful when compressing strings which do not include rare characters, and are of a fixed length. If the length of the uncompressed string 1900 is known, then it is possible to ignore the possible tailing zeros at the right end of the alternative compressed byte 2000, which do not represent a character, but rather represent a blank.
For example, an uncompressed string 1900 which is known to be 7 characters long, may be compressed into 2 alternative compressed byte 2000: the first containing in compressed form 4 characters, and the second containing 3 characters. In this second alternative compressed byte 2000, BIT VII and BITVIII of BIT-PAIR IV contain zeros which are ignored because the uncompressed string is known to be 7 characters long, despite the absence of a 'header' which would explicitly instruct to ignore these bits.
Reference is now made to Fig. 21 A, which is a table illustrating preferred values assignable to BIT I and BIT II, both belonging to the HEADER of compressed byte 1908 shown in Fig. 20 A.
Assigning the value '00' to the bits of the HEADER, i.e. assigning '0' to BIT I and assigning '0' to BIT II of compressed byte 1908 shown in Fig. 20A, signifies that the entire compressed byte 1908 represents only one or more rare characters, i.e. 'N\ and does not represent any known characters. In the non genomic embodiment of the present invention, it is possible to specify a plurality of different rare characters, such as up to 64, which would be represented by an entire byte, when the value of the header is '00'. Assigning the value '01 ' to the bits of the HEADER, i.e. assigning '0' to BIT I and ' 1 ' to BIT II of compressed byte 1908 shown in Fig. 20 A, signifies that the compressed byte 1908 represents only one character, as represented by the values in BIT III and BIT IV, both belonging to BIT-PAIR I of compressed byte 1908. The remaining four bits of the compressed byte 1908, BIT V, BIT VI, BIT Vπ and BIT VIII are to be ignored and do not represent any additional character.
Assigning the value ' 10' to the bits of the HEADER, i.e. assigning '1 ' to BIT I and '0' to BIT II of compressed byte 1908 shown in Fig. 20A, signifies that the compressed byte 1908 represents two characters; the first character being represented by the values in BIT III & BIT IV, both belonging to BIT-PAIR I, and the second character being represented by values in BIT V & BIT VI, both belonging to BIT-PAIR II of compressed byte 1908. The remaining two bits of the compressed byte 1908, BIT VII & BIT VIII, are to be ignored and do not represent any additional character.
Assigning the value of ' 11' to the bits of the HEADER, i.e. assigning ' 1 ' to BIT I and T to BIT II of compressed byte 1908, signifies that the compressed byte 1908 represents three characters; the first character being represented by values in BIT III & BIT IV both belonging to BIT-PAIR I, the second character being represented by values in BIT V & BIT VI both belonging to BIT-PAIR II, and the third character being represented by values in BIT VII & BIT VIII, both belonging to BIT-PAIR III.
Reference is now made to Fig. 21B, which is a table illustrating the preferred values assignable to the character-representing bits: BIT III, BIT IN BIT V, BIT VI, BIT VII, & BIT VIII of compressed byte 1908 shown in Fig. 20A.
As mentioned above with reference to Fig. 20A, each of BIT-PAIR I, BIT-PAIR II and BIT-PAIR III in compressed byte 1908 comprises a pair of bits: BIT III & BIT IV, BIT V & BIT VI, and BIT VII & BIT VIII respectively. The values presented in Fig. 21B are values which may be assigned to each of the above mentioned pairs of bits so as to allow each of these pairs of bits to represent one of the four possible genomic characters: A, T, C, or G.
Assigning the value '00' to any of the three bit-pairs representing one of the three characters, i.e. assigning '0' to BIT III and '0' to BIT IV, or assigning '0' to BIT V and '0' to BIT VI, or assigning '0' to BIT VII and '0' to BIT VIH, signifies that that bit-pair, i.e. BIT-PAIR I, BIT-PAIR II or BIT-PAIR III, of the compressed byte 1908 of Fig. 20A, represents the character 'A'.
Assigning the value of '01 ' to any of the three bit-pairs representing one of the three characters, i.e. assigning '0' to BIT III & ' 1' to BIT IV, or assigning '0' to BIT V & ' 1 ' to BIT VI, or assigning '0' to BIT VII & ' 1 ' to BIT VIII, signifies that that bit-pair, i.e. BIT-PAIR I, BIT-PAIR II or BIT-PAIR III, represents the character 'C.
Assigning the value of ' 10' to any of the three bit-pairs representing one of the three characters, i.e. assigning ' 1 ' to BIT IH & '0' to BIT IV, or assigning ' 1' to BIT V & '0' to BIT VI, or assigning ' 1 ' to BIT VII & '0' to BIT VIII, signifies that that bit-pair, i.e. BIT-PAIR I, BIT-PAIR II or BIT-PAIR III, represents the character 'G'.
Assigning the value of ' 11' to any of the three bit-pairs representing one of the three characters, i.e. assigning ' 1 ' to BIT III & ' 1 ' to BIT IV, or assigning ' 1' to BIT V & ' 1 ' to BIT VI, or assigning ' 1 ' to BIT VII & ' 1 ' to BIT VIII, signifies that that bit-pair, i.e. BIT-PAIR I, BIT-PAIR II or BIT-PAIR III, represents the character 'T'.
It is appreciated that notwithstanding the above-stated significance of the above-mentioned assignable values, the significance of values assigned to any of the six bits potentially representing characters, BIT III - BIT VIII, always depends on the values assigned to the bits of the HEADER, as explained above with reference to Fig. 21 A. For example the value '00' in BIT VII and BIT VIII respectively, signifies an 'A' in BIT-PAIR III only if the value assigned to the HEADER bits is ' 11', signifying that the byte represents three characters. Otherwise, the values assigned to BIT VII and BIT VIII are ignored.
Reference is now made to Fig. 21C, which is a table illustrating preferred values assignable to BIT III and BIT IV of Fig. 20A respectively, when encoding one or more rare characters.
As mentioned above with reference to Fig. 21 A, a value '00' assigned to the bits of the HEADER of Fig. 20 A signifies that the compressed byte 1908 represents one or more rare characters and does not represent any known characters. In such case, in accordance with a preferred embodiment of the present invention, BIT III and BIT IV may be used to signify if the entire byte represents one, two of three 'N's. Assigning the value '01 ' to BIT III & BIT IN i.e. assigning '0' to BIT III and ' 1 ' to BIT IV, signifies that the entire compressed byte 1908 represents one rare character.
Assigning the value ' 10' to BIT III & BIT IN i.e. assigning ' 1 ' to BIT III and '0' to BIT IV, signifies that the entire compressed byte 1908 represents two rare characters
Assigning the value ' 11 ' to BIT IU & BIT IN i.e. assigning '1' to BIT III and ' 1 ' to BIT IV, signifies that the entire compressed byte 1908 represents three rare characters.
It is appreciated that notwithstanding that a preferred embodiment of the present invention demonstrates encoding of up to three 'Ν's in one compressed byte 1908, it is possible to use each compressed byte 1908 to encode more or less than three 'Ν's For example, it is possible to use BIT III through BIT VIII of Fig. 20A to signify up to 64 Ν's represented by a compressed byte 1908. In a non-genomic embodiment of the present invention, it is possible to represent up to 64 different rare characters - each one not compressed at all, i.e. represented by the entire byte.
Reference is now made to Fig. 22, which is a simplified flowchart illustrating operation of the compression engine 1902 of Fig. 19, constructed and operative in accordance with a preferred embodiment of the present invention.
Preferably two translation tables, compression table 1906 and decompression table 1912, required for operation of compression engine 1902 and decompression engine 1910 of Fig. 19 respectively, are initially generated. Generally, the compression table 1906 stores for each possible combination of up to three characters, e.g. 'ATG', 'GGC, 'AT', bit-values which represents this combination in compressed form in one compressed byte 1908, preferably according to the values suggested in Figs 21 A, 21B, and 21C. A preferred implementation of this step, which is preferably carried out only once, is further described below with reference to Fig. 23. Figs. 24A and 24B present examples of preferred compression and decompression tables 1906 and 1912 respectively.
Following generation of the translation tables, an iterative process of compression of multiple strings takes place. An uncompressed string, such as uncompressed string 1900 of Fig. 19, is received. For clarity, this iterative process is explained with reference to an example wherein the uncompressed string 1900 is a string l epi esented by a string 'ATGAT' This example is followed through the following steps of Fig 22
The uncompressed string 1900 preferably is parsed into sub-strings, each having up to three characters (3 -character-substrings) by parsing the uncompressed string 1900 from left to right It should be noted that one or more of the characters in a 3-character-substπng may in fact be unknown, i e an 'N' In the given example the stung 'ATGAT' is parsed from left to right, into 3 -character-substrings, yielding 'ATG' and 'AT'
Following parsing of the uncompressed string 1900 into 3-character- substπngs a recursive operation is initiated, which looks up each 3 -character-substring in the compression table 1906, and based on the contents of the compression table, assigns appropriate bit values to the bits in one or more compressed byte 1908 The compressed bytes 1908 are combined to yield a compressed string 1904
Reference is now made to Fig 23, which is a simplified flowchart illustrating a pi eferred functionality for generating translation tables, including compression table 1906 and decompression table 1912 of Fig 19
Firstly, preferably all possible 3 -character-substrings, i e 1-character 2- character and 3 -character combinations of A, T, C, G and N, are generated Examples of these combinations may include ATN, ATA and ATC
For each 3 -character-substring, the following procedure applies
A determination is made as to whether the 3 -character-substring contains one or more N's
If so, the N's are encoded in compressed bytes 1908 which do not also represent known characters Values '00' are assigned to the headers of such compressed bytes and all known characters in the 3 -character-substring are encoded in other bytes
If no N's are present in a 3 -character-substring, all characters in the 3- character-substπngs are encoded in a single compressed byte 1908
For all 3 -character-substrings containing known characters, a determination is made as to the number of characters in that current 3-character- substring If a single character is encoded into a 3 -character-substring, the HEADER is assigned the value '01 ' and the single character is represented by BIT III & BIT IV of the compressed byte 1908.
If two characters are encoded into a byte, the HEADER is assigned the value ' 10' and the two characters are represented by BITS III - VI of the compressed byte 1908, with the first character represented by BIT III & BIT IV and the second character represented by BIT V & BIT VI.
If three characters are encoded into a byte, the HEADER is assigned the value ' 1 1 ' and the three characters are represented by BITS III - VIII of the compressed byte 1908, with the first character represented by BIT III & BIT IV, the second character represented by BIT V & BIT VI and the third character represented by BIT VII & BIT VIII.
Each 3 -character-substring and its corresponding one or more compressed bytes 1908 are stored in translation tables including compression table 1906 and decompression table 1912.
It is appreciated by those skilled in the art that rare characters (N's) are typically very rare in typical strings, and that furthermore when 'N's appear in a string they tend to appear contiguously, signifying a 'gap' in the sequenced genome. Instances of isolated single or double N's are typically less frequent than instances of contiguous 'N's. The present invention utilizes this fact to achieve an optimized compression suited specifically for genomic sequence data: three-character combinations which contain only characters and no N's, as well as those containing only N's and no characters, are both compressed into a single byte. Most of the rare cases of 3-character mixtures of characters and N's are compressed into two bytes. Only a minority of extremely rare combinations of characters and N's require three bytes and therefore are in fact not compressed.
Reference is now made to Fig. 24A, which is a simplified illustration of a preferred implementation of compression table 1906 of Fig. 19, employed in accordance with a preferred embodiment of the present invention.
The goal of compression table 1906 is to provide a translation-table, also referred to as a 'lookup table', which provides the bit- values of the one or more compressed bytes 1908 required to represent in compressed form every possible 1- character 2-character and 3-character sub-string of uncompressed string 1900.
For simplicity of explanation, the compression table 1906 is described here logically, as a database table comprising fields into each of which multiple values are stored in respective multiple records. It is appreciated by those skilled in the art that the description of the compression table 1906 in terms of a table comprising fields is meant for clarity and not meant to be limiting and that the compression table 1906 may equally be implemented as a 'CASE', or 'IF-THEN' programming code in a any suitable computer language, as is well known in the art. For example, computer code can be written, which comprises a plurality of 'IF-THEN' or 'CASE' arguments, each one of the arguments providing bit-values of the one or more compressed bytes 1908 representing in compressed form one 3 -character-substring of uncompressed string 1900.
It is further possible to program a hierarchical 'tree' of IF-THEN statements, which makes the process of finding the appropriate compressed byte representing the 3-character combination. For example - find out what is the first letter in the triplet, then find out what is the second, then find what is the third. This gets to the compressed byte in fewer steps, and therefore may be advantageous, as opposed to looking it up from among a list of all possible compressed bytes.
Compression table 1906 preferably comprises multiple records each containing 4 fields: uncompressed character-combination 2400, compressed byte I 2402, compressed byte II 2404 and compressed byte III 2406. For clarity, an example is given or the content which may be stored in each of these fields.
The uncompressed character combination 2400 is a field which stores all possible 3-character substrings, i.e. 1-character, 2-character and 3-character combinations, including combinations of characters only and combinations which include N's. In the given example, uncompressed character combination 2400 stores a 3-character combination 'ATN'.
Compressed byte I 2402, compressed byte II 2404 and compressed byte III 2406, respectively,- are fields which store for each uncompressed character- combination 2400 the bit-values for each of the one or more compressed bytes 1908 required for encoding it. In the given example, compressed byte I 2402 stores ' 10001 100', which represents the character-combination 'AT', and compressed byte II 2404 stores '00010000', which represents 'N\ Compressed byte III 2406, in the given example, stores null, since only two compressed bytes are required to represent the character combination 'ATN'.
As described above with reference to Fig. 23, in accordance with a preferred embodiment of the present invention, most 3-character substrings may be compressed into one compressed byte 1908, some rare combinations may be compressed into two compressed bytes, and some 20 very rare combinations may require 3 compressed bytes, and therefore may not be compressed. Therefore, notwithstanding that compression table 1906 comprises three compressed bytes fields 2402-2406, one compressed byte field, such as compressed byte-1 2402, is sufficient to translate a vast majority of 3-character combinations to be typically found in a string.
Reference is now made to Fig. 24B, which is a simplified illustration of a preferred implementation of decompression table 1912 of Fig. 19, employed in accordance with a preferred embodiment of the present invention.
The goal of decompression table 1912 is to provide a translation-table, also referred to as a 'lookup table', which provides the 1-character, 2-character or 3- character uncompressed string 1900 preferably corresponding to every possible compressed byte 1908.
It is appreciated that the description of the decompression table 1912 in terms of table comprising fields is meant for clarity and not meant to be limiting, and that the decompression table 1912 may equally be implemented as a 'CASE' code in any computer language, as is well known in the art.
The decompression table 1912 preferably comprises multiple records each containing 2 fields: compressed byte 2408 and decompressed character- combination 2410.
Compressed byte 2408 is a field which preferably stores bit-values of every possible compressed byte 1908.
Decompressed character combination 2410 is a field which stores for each compressed byte 2408 the 1-character, 2-character or 3-character uncompressed string 1900 which it encodes. For example, the field compressed byte 2408 may contain the compressed byte 1908 bit-value ' 10001100' and the respective field decompressed character combination 2410 may contain the 2-character combination 'AT' which this bit value represents in compressed form.
Reference is now made to Fig. 25, which is a simplified flowchart illustrating operation of decompression engine 1910 of Fig. 19 constructed and operative in accordance with a preferred embodiment of the present invention. Generally, decompression engine 1910 of Fig. 19 performs a reverse action of that of compression engine 1902 of Fig. 19, which was further described hereinabove with reference to Fig. 22.
A compressed string 1904 is received in order to be decompressed. Decompression engine 1910 of Fig. 19 gets the compressed string 1904 of Fig. 19 to be decompressed. For clarity, the process shown in Fig. 25 is explained with reference to an example wherein the compressed string 1904 comprises two compressed bytes 1908 the bit-value of which are: ' 11001 1 10' & ' 10001100' respectively. This example is followed through the following steps of Fig. 25.
A recursive operation is initiated, which parses the received compressed string 1904 into the compressed byte 1908 included in this compressed string. Each compressed byte 1908 is looked up in the decompression table 1912, and based on the contents of the compression table, finds out the 3-character substring which this compressed byte represents. The 3-character substrings are combined to yield an uncompressed string 1900.
In the given example, the second compressed byte in the compressed string has bit-values of ' 10001100', which when looked-up in the decompression table is found out to represent the character combination 'AT'. Combining the two 3- character-substrings, 'ATG' represented by compressed byte ' 11001110' and 'AT' represented by compressed byte ' 10001 100', yields the uncompressed string 'ATGAT'.
Reference is now made to Fig. 26A, which is a simplified illustration of an example of compressing an uncompressed string 1900 into a compressed string 1904, both shown in Fig. 19.
Uncompressed string 'ATGAT' 2600 is an uncompressed string 1900 comprising the characters: 'ATGAT'. When parsed into 3-character substrings, beginning from the left side of the string, as shown in Fig. 22, the results is two 'character-triplets': character-triplet- 1 'ATG' 2602 and character-triplet-2 'AT_' 2604. It is appreciated that the character- triplet-2 'AT_' 2604 actually contains only two characters: A and T.
Since neither of these 3-character sub-strings contains an N, each one of them is compressed directly into one compressed byte 1908: compressed byte-1 2606 and compressed byte-2 2607, respectively.
Compressed byte-1 2606 encodes three characters: 'ATG'. Therefore a value of ' 1 r is assigned to the two bits of the HEADER of compressed byte-1 2606, as indicated by reference numeral 2608, signifying that this compressed byte 1908 encodes 3 characters.
Value '00' is set to the two bits of BIT-PATR I, of compressed byte-1 2606, as indicated by reference numeral 2610, signifying that the first character represented by this byte is 'A'.
Value ' 1 1 ' is assigned to the two bits of BIT-PAIR II of compressed byte- 1 2606, as indicated by reference numeral 2612, signifying that the second character represented by this byte is 'T'.
Value ' 10' is assigned to the two bits of BIT-PAIR III of compressed byte- 1 2606, as indicated by reference numeral 2614, signifying that the third character represented by this byte is 'G'.
Compressed byte-2 2607 encodes two characters: 'AT'. Therefore a value of ' 10' is assigned to the two bits of the HEADER of compressed byte-2 2607, as indicated by reference numeral 2616, signifying that this compressed byte 1908 encodes 2 characters, and that therefore the two bits of BIT-PATR III are to be ignored.
Value '00' is assigned to the two bits of BIT-PAIR I of compressed byte- 2 2607, as indicated by reference numeral 2618, signifying that the first character represented by this byte is 'A'.
Value ' 1 1 ' is assigned to the two bits of BIT-PAIR II of compressed byte-2 2607, as indicated by reference numeral 2620, signifying that the second character represented by this byte is 'T'. Value '00' stored in the bits of BIT-PAIR III of compressed byte-2 2607, as indicated by reference numeral 2622 are ignored and do not represent an 'A', since the HEADER specified that this byte encodes only 2 characters.
Reference is now made to Fig. 26B, which is a simplified illustration of another example of compression of an uncompressed string 1900 of Fig. 19, containing a rare character, into a compressed string 1904 of Fig. 19.
Uncompressed string 'ATNCG' 2650 is an uncompressed string 1900 comprising the characters: 'ATNCG'.
When parsed into 3-character sub-strings, beginning from the left side of the string, as shown in Fig. 22, the result is two 'character-triplets': character-triplet- 1 'ATN' 2652, and character-triplet-2 'CG 2654. It is appreciated that the character- triplet-2 'CG_' 2654 actually contains not a triplet but only 2 characters: C and G.
Since character-triplet- 1 'ATN' 2652 contains an 'N' it is preferably represented by two compressed bytes 1908 rather than one: the first, compressed byte-1 2656, encodes 'AT', and the second, compressed byte-2 2658, encodes 'N'.
Compressed byte-1 2656 encodes two characters, 'AT', therefore ' 10' is assigned to the two bits of the HEADER of compressed byte-1 2656, as indicated by reference numeral 2660, signifying that this compressed byte 1908 encodes 2 characters.
Value '00' is assigned to the two bits of BIT -PAIR I of compressed byte- 1 2656, as indicated by reference numeral 2662, signifying that the first character represented by this byte is 'A'.
Value ' 11 ' is assigned to the two bits of BIT-PAIR II of compressed byte- 1 2656, as indicated by reference numeral 2664, signifying that the second character represented by this byte is 'T'.
Value '00' assigned to the bits of BIT-PAIR III of compressed byte-1 2656, as indicated by reference numeral 2666, is ignored, and does not represent an 'A' since the HEADER specified that this byte encodes only 2 characters.
Compressed byte-2 2658 is dedicated to encoding 'N's, in this case only one 'N\ which is derived from the character-triplet- 1 'ATN' 2652. Therefore '00' is assigned to the two bits of the HEADER of compressed byte-2 2658, as indicated by reference numeral 2668, signifying that this compressed byte 1908 is dedicated to encoding one or more N's.
The value '01 ' is assigned to the two bits of BIT-PAIR I of compressed byte-2 2658, as indicated by reference numeral 2670, signifying that this byte, which is dedicated to encoding N's, encodes only one N. Accordingly, the zeros in the two bits of BIT-PAIR II and the two bits of BIT-PAIR III, indicated by reference numerals 2672 & 2674 are ignored.
Compressed byte-3 2690 encodes two characters: 'CG'. Therefore a value of ' 10' is assigned to the two bits of the HEADER of compressed byte-3 2690, as indicated by reference numeral 2676, signifying that this compressed byte 1908 encodes only 2 characters.
Value '01 ' is assigned to the two bits of BIT-PAIR I of compressed byte- 3 2690, as indicated by reference numeral 2678, signifying that the first character represented by this byte is 'C.
Value ' 10' is assigned to the two bits of BIT-PAIR II of compressed byte-3 2690, as indicated by reference numeral 2680, signifying that the second character represented by this byte is 'G'.
The value '00' in the bits of BIT-PAIR III of compressed byte-3 2690, as indicated by reference numeral 2682, is ignored and does not represent an 'A' since the HEADER specified that this byte encodes only 2 characters.
Reference is now made to Fig. 27, which is a simplified block diagram illustrating shifted compressed strings utilized by the compressed string similarity search module 1818 of Fig. 18 constructed and operative in accordance with a preferred embodiment of the present invention.
It is appreciated that the difficulty in assessing similarity of character strings in compressed form, is that the compression mechanism described above with reference to Figs. 19-25 compresses more than one character into one byte. In a preferred embodiment of the present invention, each compressed byte represents in compressed form 3 or 4 characters. It is therefore very easy to compare entire 'triplets' of characters, but if an addition or deletion of a single character occurs, then all triplets 'downstream' to the one modified will have seemed to have changed completely, whereas in fact, they have only been 'shifted' to the right or to the left by one location. The basic concept of the compressed character string similarity search module 1818 of Fig. 18 is therefore to calculate all possible 'shifted' variations of the compressed character string 1810 of Fig. 18, and to use them to search for compressed character strings similar to target 1820 of Fig. 18.
An example is provided of target string 2700, which is a compressed character string comprising 12 characters, NI through N12, which are represented in compressed form by 4 compressed bytes 1908, BYTE 1, BYTE 2, BYTE 3 and BYTE 4.
Based on this target string 2700, four shifted strings 2702 are generated: 'minus one' shifted string 2704, 'minus two' shifted string 2706, 'plus one' shifted string 2708 and 'plus two' shifted string 2710.
The first character in target string 2700, NI, has been removed in the 'minus one' shifted string 2704, therefore 'minus one' shifted string 2704 begins with N2. The characters compressed into each of the four compressed bytes 1908 of 'minus one' shifted string 2704 are therefore 'shifted to the left' by one location.
Similarly, the sequence of characters compressed into each of the four compressed bytes 1908 of 'minus two' shifted string 2706 is shifted to the left by two locations; that of 'plus one' shifted string 2708 is shifted to the right by one location; and that of 'plus two' shifted string 2710 is shifted to the right by two locations.
Reference is now made to Fig. 28, which is a simplified flowchart illustrating operation of the compressed string similarity search module 1818 of Fig. 18 constaicted and operative in accordance with a preferred embodiment of the present invention.
Operation of the compressed string similarity search module 1818 begins by getting a target compressed string 1810 of Fig. 18.
Based on this compressed target string, four shifted compressed strings are generated: 'minus one' shifted string 2704, 'minus two' shifted string 2706, 'plus one1 shifted string 2708 and 'plus two' shifted string 2710, as described hereinabove with reference to Fig. 27.
Using the compressed string index 1808 of Fig. 18, all compressed strings 1804, having at least one compressed byte which matches that of the compressed target string 2700 or of one of the four shifted strings 2704-2710, are retrieved. It is important to note that a match is looked for only between bytes occupying the same location in the compressed genomic string: the first compressed byte in a compressed string 1804 is compared to the first compressed byte of the compressed target string 1816 and to the first compressed byte of each of the four compressed shifted strings. It is not compared to any other, e.g. second, third or fourth, compressed bytes within these strings. All compressed strings having at least one match, are considered potentially similar, and are passed on the next step.
Next, all of the compressed strings having at least one match are assessed, and the number of mismatching compressed bytes for each of them is counted.
Compressed strings, having less mismatching compressed bytes with the target or one of the shifted strings than a certain user defined 'threshold', are considered potentially very similar, and are passed on to the next step.
Optionally, the mismatching compressed byte/s are further analyzed to determine the exact nature of the mistake, in order to further fine-tune the similarity comparison
The resulting compressed strings similar to target 1816 of Fig. 18 are considered to represent in compressed form genomic sequences which are similar to the target genomic sequence represented in compressed form by the compressed target string 1816, and are delivered.
Reference is now made to Fig. 29A, which is a simplified illustration of an example of identifying a string having one character replacement relative to a target string.
Fig. 29A shows a string designated SIMILAR TO TARGET STRING (1 REPLACEMENT), in which a character, designated N13 shown in broken line format, in the compressed byte designated 1R-BYTEII, has replaced character N5 in that same spot in the original TARGET STRING.
As is apparent from Fig. 29A, 3 of the 4 compressed bytes of SIMILAR TO TARGET GENOMIC SEQUENCE (1 REPLACEMENT), shown in bold line format, still match those in TARGET GENOMIC SEQUENCE, and so the two genomic sequences can be deduced as being similar, by comparison of their compressed format, without decompressing them. Reference is now made to Fig. 29B, which is a simplified illustration of an example of identifying a string having two character additions relative to a target string.
Fig. 29B shows a string designated SIMILAR TO TARGET STRING (2 ADDITIONS), in which two characters, designated N13 and N14, shown in broken line format, in compressed byte designated 2A-BYTEII, have been added to the string relative to the original TARGET STRING, 'pushing' characters N5 and N6 to the next compressed byte designated 2A-BYTE III, and shifting all the following characters by two positions.
As is apparent from Fig. 29B, only the first compressed byte designated 2A-BYTE I, in string designated SIMILAR TO TARGET STRING (2 ADDITIONS), matches the original BYTE I of TARGET STRING. However, the third and fourth bytes of SIMILAR TO TARGET GENOMIC SEQUENCE (2 ADDITIONS), designated 2A-BYTE III and 2A-BYTE IV respectively, do match those of 'PLUS TWO' SHIFTED STRING, designated P2-BYTE III and P2-BYTE IV respectively. All matching bytes are shown in bold line format.
The two strings may therefore be deduced as being similar, differing by a two addition mistake in the mismatched compressed byte 2A-BYTE II.
Reference is now made to Fig. 29C, which is a simplified illustration of an example of identifying a character string having one character deletion relative to a target string.
Fig. 29C shows a string designated SIMILAR TO TARGET STRING (1 DELETION), in which one character, designated N5, of TARGET STRING has been deleted, shifting all characters from N6 onwards one position to the left. The missing N5 in byte ID-BYTE II is represented by a small blank broken line box between N4 and N6.
As is apparent from Fig. 29C, only the first compressed byte designated ID-BYTE I, in string designated SIMILAR TO TARGET STRING (1 DELETION), matches the original BYTE I of TARGET STRING. However, the third and fourth bytes of SIMILAR TO TARGET STRING (1 DELETION), designated ID-BYTE III and ID-BYTE IV respectively, do match those of 'MINUS ONE' SHIFTED STRING, designated Ml -BYTE III and MI-BYTE IV, respectively. All matching bytes are shown in bold line format.
The two strings may therefore be deduced as being similar, differing by a one deletion mistake in the mismatched compressed byte ID-BYTE II.
It is therefore appreciated that replacement, deletion and addition types of differences between a target string and other strings, can be detected in compressed form, without decompression.
It is also appreciated that while Figs. 27 and 29A, 29B and 29C demonstrate detection of up to 2 addition or deletion mistakes, the same concept may be utilized to detect a wider spectrum of mistakes, by generating more 'shifted sequences' accordingly, e.g. 'plus three' shifted sequence etc.
Reference is now made to Fig. 30, which is a simplified illustration of a triphase-bit compressed character used in accordance with a preferred embodiment of the present invention.
Current computer systems are typically based on byte memory units consisting of 8 bits, each bit having two phases or states: zero and one. However, as is well known in the art, it is possible, or will be in the near future, to use computer systems which are based on multi-phase bits, such as tri-phase bits, i.e. where each bit has more than two phases, such as three phases. For example, a tri-phase bit has three phases rather than two: zero, one and two.
Implementing the present invention in an environment of multi-phase bit expands the usefulness of the present invention as is demonstrated by Fig. 30, which is a preferred implementation of the present invention in a tri-phase bit environment.
A compressed triphase-bit compressed byte 3000 is a preferred embodiment of a compressed string 1804 of Fig. 18, when implemented in a triphase-bit computer environment.
Triphase-bit compressed byte 3000 preferably comprises the following three elements: a HEADER comprising two bits: BIT I and BIT π, CHAR I comprising three bits BIT III, BIT IV & BIT V, and CHAR II comprising three bits BIT VI, BIT VII & BIT VIII.
Each of the bits of compressed triphase-bit compressed byte 3000, BIT I through BIT VIII, may store one of three values: zero, one and two. Therefore, CHAR I and CHAR II, each of which contains 3 bits, may each represent 27 possibilities: 3 x 3 x 3 = 27. It is appreciated that 27 possibilities is sufficient to represent, as an example, the 26 letters of the English alphabet.
Similarly, the HEADER, which contains two triphase-bits, may represent 9 possibilities: 3 x 3 = 9. These possibilities may be utilized to further increase the possibility of the triphase-bit compressed character to represent up to two alphanumeric characters, each of which may represent one of 81 common characters. For example each of the two alphanumeric characters which may be represented by the triphase compressed character 3000 may be one of the following three common-character-sets: uppercase English letters, lowercase English letters, or numerals and seventeen other common symbols. Alternatively, the HEADER may indicate that the entire byte represents one rare character, one of seven hundred twenty nine options, as is explained below.
In a preferred implementation; the following values may be utilized by the two bits of the HEADER:
The values '00' in the two bits of the HEADER signify that the entire byte represents a rare character. In this case all 6 bits of CHAR I & CHAR II, BIT III - BIT VIII respectively, when taken together represent one 'rare' character, rather than two separate characters. This mechanism allows for representation of 729 symbols (3 x 3 x 3 x 3 x 3 x 3 = 729), which could not be included in the 81 common characters.
The values '01 ' in the two bits of the HEADER signify that the entire byte represents only one character, which is a lowercase letter.
The values '02' in the two bits of the HEADER signify that the entire byte represents only one character, which is an uppercase letter.
The values ' 10' in the two bits of the HEADER - signify that the entire byte represents only one character, which is a numeral or other common symbols. Up to 17 common symbols may be represented in this way, in addition to the ten numerals, since CHAR I and CHAR II each contain 3 bits and may therefore represent 27 options.
The values ' 11 ' in the two bits of the HEADER signify that the byte represents two characters, which are both lowercase letters.
The values '12' in the two bits ;of the HEADER signify that the byte represents two characters, which are both uppercase letters. The values '20' in the two bits of the HEADER signify that the byte represents two characters, which are both numerals or common symbols.
The values '21' in the two bits of the HEADER signify that the byte represents two characters, the first of which is a lowercase letter and the second of which is an uppercase letter
The values '22' in the two bits of the HEADER signify that the byte represents two characters, the first of which is an uppercase letter and the second of which is a lowercase letter
Using the above schema the option of compressing one numeral and one letter into the triphase-bit compressed character 3000, is not supported, as it may be viewed as more rare
It is appreciated that although Fig. 30 depicts a configuration where 2 bits are utilized as a HEADER of the triphase-bit compressed character 3000, this is not necessary It is possible, for example, to encode two characters in the triphase-bit compressed character 3000, each comprising four, rather than three, bits, and therefore supporting 81 options, one of these options being a null.
It is further be appreciated that although an implementation in a triphase bit environment has been demonstrated, the present invention may similarly be implemented in a multi-phase bit environment, in which any bit may have more than three phases, as well as in an environment where each byte may have more than eight bits, or less than eight bits
Reference is now made to Fig 31, which is a simplified block diagram illustrating a computer application constructed and operative in accordance with another preferred embodiment of the present invention. It is appreciated that the computer application may be implemented in any appropriately programmed computer system such as, for example, a suitable personal computer including an operating system having a suitable graphical user interface.
The embodiment of Fig. 31 comprises a mechanism for conversion of a standard alphanumeric representation of genomic sequence information into a more informative and intuitive graphical representation, conducive to visual pattern analysis of genomic sequence information Sequenced DNA 3100 is biological information relating to a sequence of nucleotides - adenine, thymine, guanine, and cytosine - of a given DNA molecule, or genome. Determining the sequence of nucleotides of a genome is achieved by various 'wet-lab' sequencing methodologies and techniques, as is well known in the art.
The resulting genomic sequence is typically represented in a genomic alphanumeric representation 3110, which is an alphanumeric string used both for computer storage of sequenced DNA data and for its presentation. The genomic alphanumeric representation 3110 comprises primarily four letters 'A' representing adenine, 'T' representing thymine, 'G' representing guanine and 'C representing cytosine.
As is well known in the art, the nucleotide adenine, represented by 'A', is a 'counterpart' of the nucleotide thymine represented by T: wherever the DNA sequence on one DNA strand contains adenine, the opposite strand at that exact location contains thymine, and vice-versa. Similarly, the nucleotide guanine represented by 'G' is a 'counterpart' of nucleotide cytosine represented by C. Wherever the DNA sequence on one DNA strand contains guanine, the opposite strand at that exact location contains cytosine, and vice-versa.
Also, as is well known in the art, adenine represented by 'A' and guanine represented by 'G1 are purine nucleotides, whereas thymine represented by 'T' and cytosine represented by 'C are pyrimidine nucleotides.
In accordance with a preferred embodiment of the present invention, a genomic graphic representation engine 3120 receives the genomic alphanumeric representation 31 10 and converts it into a genomic graphical representation 3130. A preferred embodiment of a genomic graphic representation engine 3120 is further described below with reference to Fig. 32.
The genomic graphic representation 3130, produced by the genomic graphic representation engine 3120, preferably represents each of the four letters - 'A', 'T\ 'G' and 'C - in the original genomic alphanumeric representation 3110, using a plurality of graphic parameters. In a preferred embodiment of the invention, the alphanumeric representation 3110 is represented by a combination of a graphic shape, a vertical orientation, and a color specific to that letter, as is further described below. As is known in the art, the genomic sequence data may also include unknown nucleotides, i.e. nucleotides in the genomic sequence which the sequencing process was unable to identify. Unknown nucleotides are typically represented by 'N' or '-'. These may be also be represented by the graphic representation 3130 by a designated shape, color, and letter, as per user preference.
A preferred embodiment of the genomic graphic representation 3130 may include a genomic font with embedded letters 3140, in which a letter representing each nucleotide is embedded on a shape that represents it graphically. In a preferred embodiment of the present invention the following shapes are used: The letter 'A' is represented by an upward oriented half-oval with an embedded letter 'A', as illustrated by reference numeral 3141. The letter 'T' is represented by a downward oriented half- oval with an embedded letter 'T', as illustrated by reference numeral 3142. The letter 'G' is represented by an upward oriented half-square with an embedded letter 'G', as illustrated by reference numeral 3143. The letter 'C is represented by a downward , oriented half-square with an embedded letter 'C, as illustrated by reference numeral 3 144.
Alternatively, the genomic graphic representation 3130 may also include a genomic font without letters 3150, in which only a shape is used to graphically represent each letter, without any letter embedded on the shape. In a preferred embodiment of the present invention an upward oriented half-oval 3151 represents 'A', a downward oriented half-oval 3152 represents 'T', an upward oriented half-square 3 153 represents 'G', and a downward oriented half-square 3154 represents 'C.
Preferably, each of the four shapes without letters 3151, 3152, 3153 & 3 154, or alternatively each of the four shapes with embedded letters 3141, 3142, 3143 & 3144, representing the four letters 'A', 'T', 'G' and 'C, respectively, may be displayed in a different color, according to the user's preference. A preferred embodiment of the current invention displays the above mentioned shapes in red, blue, brown, and green, respectively.
It is appreciated that the genomic graphic representation 3130 described above provides enhanced visual discrimination between adenine-thymine counterparts, and guanine-cytosine counterparts. Preferably, 'A' and 'T' are represented by two vertical 'complementary' halves of one shape. For example, in a preferred embodiment of the current invention, 'A' and 'T' are represented by two halves of an oval 3151 and 3 152 respectively. Similarly, 'G' and 'C are also represented by two vertical complementary halves of a different shape. For example, as in a preferred embodiment of the current invention, 'G' and 'C are represented by two halves of a square 3153 and 3 154 respectively. An example of the usefulness of enhanced ease of visual discrimination between adenine-thymine counterparts and guanine-cytosine counterparts, is distinguishing 'AT-rich' DNA segments, i.e. segments in which there is a higher incidence of adenine and thymine nucleotides, from 'CG-rich' DNA segments, i.e. segments in which there is a higher incidence of cytosine and guanine nucleotides. It is appreciated that AT-rich DNA segments, in which the oval shapes are more dominant, can be discerned visually with enhanced ease from 'CG-rich' segments, in which square shapes are more dominant. Different shapes may be utilized other than the ones described here, e.g. a triangle may be used instead of an half-oval. Enhanced visual discernment of AT-rich from CG-rich DNA sequences is further described below with reference to Fig. 35.
It is further appreciated that the genomic graphic representation 3130 described above also provides the user with enhanced ease of visually distinguishing purine nucleotides from pyrimidine nucleotides. According to a preferred embodiment of the current invention, both purine nucleotides adenine ('A') and guanine ('G') are graphically represented by shapes that have an upward orientation: an upward oriented half-oval 3 15 1 and an upward oriented half-square 3153, respectively. Similarly, both pyrimidine nucleotides thymine ('T') and cytosine ('C') are graphically represented by shapes that have a downward orientation: a downward oriented half-oval 3152 and a downward oriented half-square 3154, respectively. An example for the usefulness of the enhanced ease of visually distinguishing purine nucleotides from pyrimidine nucleotides is the enhanced ease of visually discerning the similarity between two genomic motifs: When comparing two genomic motifs, one ending with adenine ('A') while the other ends with guanine ('G'), both adenine and guanine being purine nucleotides, since both adenine and guanine are graphically represented by upward oriented shapes, the similarity between these two genomics motifs, is made more visually apparent. Visually distinguishing of purine nucleotides from pyrimidine nucleotides is further described below with reference to Fig. 35. It is yet further appreciated that the genomic graphic representation 3130 described above may also provide enhanced visual discrimination between the four different nucleotides, based on their different colors.
It is appreciated that while Fig. 31 illustrates a human sensible, graphical representation of genomic sequence data in order to represent one or more genomic attributes of each of the four nucleotides, another implementation of the present invention may use machine sensible representation in order to represent these attributes.
Reference is now made to Fig. 32, which is a simplified flowchart illustrating preferred operation of the genomic graphic representation engine 3120 of Fig. 3 1 , constructed and operative in accordance with a preferred embodiment of the present invention.
First, a genomic font is produced, preferably using conventional font- creation software, such as 'FONT CREATOR PROGRAM'. In this font, preferred shapes are assigned to each of the four letters 'A', 'T', 'G' and 'C, such as the shapes indicated by reference numerals 3151-3154 of Fig. 31, respectively. Preferably two variations of the genomic font are employed: the first comprising shapes with embedded letters in shapes, as designated by reference, numeral 3140, and the other comprising shapes without embedded letters, as designated by reference numeral 3150, both of Fig. 31. The preferred shapes for each of the four letters, 'A', 'T', 'G' & 'C, are preferably those designated by reference numerals 3141, 3142, 3143 & 3144 and by reference numerals 3151 , 3152, 3153 & 3154 respectively in Fig. 31.
The process of generating a genomic font preferably is a one-time process, and hence is connected to the next step by a broken line. It is typically carried out once, before an iterative process of converting the representation of multiple genomic sequences, from a genomic alphanumeric representation 3110 into a genomic graphic representation 3130.
Once a genomic font has been created, the process of graphically representing genomic sequence data may be very simple: an alphanumeric string representing genomic sequence data is received and a genomic font, generated by the previous step, is applied to this alphanumeric string.
Different colors may be applied to different letters, typically by using standard 'search-and-replace' commands, as is known in the art. In a preferred embodiment of the present invention, the colors applied to the letters 'A', 'T', 'G' & 'C are red, blue, brown & green respectively. Clearly, other colors may be used, according to user preferences. It should be noted, that applying different colors to different letters is an optional step: the user may or may not want to view the different letters in different colors, or may want to view a group of letters, for example purine nucleotides or pyrimidine nucleotides, or A-T or C-G, or some other grouping in a certain color.
Finally, the resulting alphanumeric string, now displayed graphically by the genomic font, may be displayed.
Reference is now made to Fig. 33, which is a simplified illustration depicting an example of conversion of a typical alphanumeric genomic representation, of the type indicated by reference numeral 3110 of Fig. 31, into a typical graphic genomic representation, of the type indicated by reference numeral 3130 of Fig. 31.
Reference numeral 3300 designates an example of a short genomic sequence, conventionally represented by an alphanumeric string ' ACTTTTGATAATTATTGTAACTGTAAAAGAT' .
The short genomic sequence 3300 may be displayed using a genomic graphic representation as designated by reference numeral 3310, either employing genomic font with embedded letters 3140 of Fig. 31, as designated by reference numeral 3320 or employing genomic font without embedded letters 3150 of Fig. 31, as designated by reference numeral 3330.
It is appreciated that it is easier to visually discern patterns in the genomic sequence when it is displayed as a genomic graphic representation 3310, than when displayed as genomic alphanumeric representation 3300. For example, when viewing the genomic alphanumeric representation 3300, the segment 'ATAATTAT', 8l l-15th characters in the string from its left end, surrounded by a broken-line border, may not immediately stand out as having any special significance. However, when examining the same segment in the genomic graphic representation without embedded letters 3330, a visual pattern is apparent: the first four characters in this segment, 'ATAA', are a vertical and horizontal mirror image of the last four characters of the segment, 'TTAT'. A genomic sequence in which the second half of the sequence is a reversed-inversed sequence relative to the first half of the sequence, such as 'ATAATTAT' in the given example designated by reference numeral 3300 is known in the art as a 'hair-pin structure'.
It is thus easier to visually discern a genomic pattern, indicating that the sequence 'ATAATTAT' is what: a genomic sequence in which 'Hair-pin' sequences are genomic patterns which may indeed be biologically significant.
Reference is now made to Fig. 34, which is a simplified illustration of an example demonstrating an advantage of the graphic genomic representation 3130 of Fig. 3 1 , in comparing a genomic motif sequence with the inverse-reverse thereof.
Genomic motifs are short genomic sequences, which may have a specific biological significance or action. Genomic motifs may be compared to 'words', insofar as a word is a combination of English letters and has a specific meaning, and a genomic motif is a combination of a genomic nucleotides, and may have a specific action. An example for a well known genomic motif is the genomic sequence 'GATAA.
As is well known in the art, genomic sequence data is typically provided for a sequence of nucleotides on a positive strand of the DNA. However, some segments of biologically significant genomic data are actually 'coded' on the negative, i.e. opposite, strand of the DNA. In order to determine the 'real' sequence data for such segments it is necessary to inverse-reverse the sequence received from the positive strand, as is well known in the art. To inverse-reverse means to read the sequence from right to left, and replace each A with a T, and each C with a G and vice-versa. For example, 'TTATC is the inverse-reverse of the genomic motif 'GATAA'. The action of inverse-reversing genomic sequences is very frequently used, especially when analyzing genomic motifs. For example, the genomic motif 'GATAA' may appear in the genomic sequence either as 'GATAA', or as its inverse-reverse 'TTATC. However, using the standard alphanumeric presentation, it may not be easy to visually determine the similarity between such a motif and its inverse-reverse.
As noted above with reference to Fig. 31, the genomic graphic representation 3130 of Fig. 31, provides the user with enhanced ease of visually discerning genomic motifs from their inversed-reversed sequences, inasmuch as the inversed-reversed sequence presents a horizontal and vertical mirror image of the original motif. This is due to the fact that complementary nucleotide pairs adenine- thymine and cytosine-guanine, are graphically represented by complementary vertical halves of the same shape, as described with reference to Fig. 31.
Fig. 34 enables comparison between the genomic sequence 'GATAA' which is a well known genomic motif, and the genomic sequence 'TTATC which is the inverse-reverse of this genomic motif. It is seen that the graphical representation of the inversed-reversed genomic motif 'TTATC as designated by reference numeral 3440, presents a vertical and horizontal mirror image of the genomic motif 'GATAA' as designate by reference numeral 3450. This provides the user with enhanced ease of visually discerning the similarity of these motifs. The same is true for the graphic representation with embedded letters, as depicted by reference numerals 3420 and 3430 respectively.
Reference is now made to Fig 35, which is a simplified illustration of an example demonstrating an advantage of graphic genomic representation 3130 of Fig. 3 1 , in visually distinguishing adenine-thymine-rich sequences, from cytosine-guanine- rich sequences.
An example is given of a genomic sequence "CCCGCTCCAGG", which is a GC-rich sequence, and a genomic sequence "TTTATTATCTA", which is an AT- rich sequence. Reference numerals 3500 and 3510 respectively designate these sequences in standard alphanumeric form, whereas reference numerals 3520 & 3530 and 3540 & 3550, respectively, designate these sequences graphically, with embedded letters and without embedded letters respectively.
It is appreciated that the genomic graphic representation 3130 of Fig. 31 provides the user with enhanced ease of visually discerning GC-rich sequences, depicted by reference numerals 3520 and 3540, in which the predominant shapes are squares, from AT-rich sequences, depicted by reference numerals 3530 and 3550, in which the predominant shapes are ovals. This may be particularly useful, since AT-rich sequences and GC-rich sequences may have different genomic significance, as is well known in the art.
Reference is now made to Fig. 36 which is a simplified illustration of an example which demonstrates an advantage of graphic genomic representation 3130 of Fig. 31 , in visually distinguishing purine nucleotides from pyrimidine nucleotides. As is well known in the art, meaningful genomic motifs often appear in a genome with slight variations, while still maintaining their biological function and significance. A motif in which variations are known to happen, is typically described in terms of a 'consensus-sequence' which is a description of the location and frequency of acceptable 'mistakes', notwithstanding which the biological function of the motif is maintained. A consensus-sequence may be compared to an English word, for which several slightly different spellings may be considered acceptable, e.g. Haematology and Hematology.
Often, the consensus sequence may be related to a biochemical type of nucleotides. For example, the consensus-sequence definition for the well known genomic motif 'GATA box' is WGATAR, where W stands for adenine or thymine nucleotide, and R stands for a purine nucleotides: either adenine or guanine. The consensus-sequence in this example, states that both 'AGATAA' and 'AGATAG' may have the same biological function, despite the difference in the last nucleotide, since both adenine and guanine are purine nucleotides.
The present invention provides the user with enhanced ease of visually discerning purine nucleotides from pyrimidine nucleotides, thereby making it easier to visually identify genomic consensus-sequence motifs in which the consensus-sequence definition contains a purine or a pyrimidine.
Fig. 36 provides an example of the two genomic sequences 'AGATAA' and 'AGATAG' mentioned above, both being variants of the same consensus-sequence motif WGATAR mentioned above.
Reference numeral 3600 designates a genomic alphanumeric representation of an adenine ending GATA box, 'AGATAA', and reference numeral 3610 designates a genomic graphic representation of a guanine ending GATA box, 'AGATAG'. For clarity, the purine nucleotide ending the GATA box, adenine in reference numeral 3600 and guanine in reference numeral 3610, is surrounded by a broken-line border.
Reference numerals 3640 and 3650 designate genomic graphic representations of these two variants of the WGATAR motif: 'AGATAA' and 'AGATAG' respectively. It is appreciated that the genomic graphic representation 3130 of Fig. 31 makes it easier to visually identify the similarity between fhe$e two variants of the same 'GATA box' consensus sequence, since adenine and guanine, which are both purine nucleotides, are graphically represented by upward oriented shapes: upward oriented half-oval 151 and upward oriented half-square 3153 respectively. The same is clearly true of the graphic representation with embedded letters, as depicted by reference numerals 3620 and 3630 respectively.
It is appreciated, by persons skilled in the art that the present invention is not limited by what has been particularly shown and described hereinabove. Rather the scope of the present invention includes both combinations and subcombinations of the various features described hereinabove as well as variations and modifications which would occur to persons skilled in the art upon reading the specifications and which are not in the prior art.

Claims

C L A I M S
1. A method for analysis of genomic sequence data in compressed data space, the method comprising: obtaining genomic data; preprocessing genomic data into preprocessed genomic data; compressing at least part of said preprocessed genomic data; storing compressed preprocessed genomic data; indexing compressed preprocessed genomic data; and analyzing genomic data, based at least in part on said indexing.
2. A method according to claim 1 and wherein: said obtaining comprises obtaining uncompressed genomic sequence data, and protein coding region location data for each of a plurality of proteins known to be encoded by said genomic sequence data; said preprocessing comprises calculating and storing a plurality of genomic region sequences, based at least in part on said obtaining, and determining for each of said plurality of genomic region sequences, a plurality of uncompressed short genomic segments contained therewith; said compressing comprises compressing each of said plurality of uncompressed short genomic segments contained in each of said plurality of genomic region sequences into one of a plurality of compressed short genomic segments; said storing comprises storing said plurality of compressed short genomic segments; said indexing comprises indexing said plurality of compressed short genomic segments; and said analyzing comprises: receiving a user query containing at least one logical condition relating to at least one of the following: one of said genomic region sequences, and one of said uncompressed short genomic segments, and retrieving results to said user query, said retrieving comprising at least one of the following: retrieving one of said plurality of proteins, retrieving one of said plurality of genomic region sequences, and retrieving and decompressing one of said plurality of compressed short genomic segments, based at least in part on said indexing.
3. A method according to claim 2 and wherein said plurality of genomic region sequences comprises a plurality of protein coding regions.
4. A method according to claim 3 and wherein each of said plurality of protein coding regions is normalized.
5. A method according to claim 2 and wherein said plurality of genomic region sequences comprises a plurality of regions adjacent to protein coding regions.
6. A method according to claim 5 and wherein said plurality of regions adjacent to protein coding regions comprises a plurality of regions upstream to protein coding regions.
7. A method according to claim 5 and wherein said plurality of regions adjacent to protein coding regions comprises a plurality of regions downstream to protein coding regions.
8. A method according to claim 5 and wherein each of said plurality of regions adjacent to protein coding regions is normalized according to coding direction of one of said plurality of protein coding regions adjacent thereto.
9. A method according to claim 2 and wherein said plurality of short genomic segments contained in each of said plurality of genomic region sequences comprises a majority of short genomic segments of a given length contained in each of said plurality of genomic region sequences.
10. A method according to claim 9 and wherein said given length is user specified.
1 1. A method according to claim 2 and wherein said plurality of proteins known to be encoded by said genomic sequence data comprises a majority of proteins known to be encoded by said genomic sequence data.
12. A method according to claim 2 and wherein genomic sequence data comprises: a first genomic sequence data belonging to a first organism, and a second genomic sequence data belonging to a second organism different than said first
13. A method according to claim 2 and wherein said determining also includes determining a relationship between at least two of said plurality of short genomic segments, and said storing also includes storing said relationship between at least two of said plurality of short genomic segments, and said at least one logical condition references said relationship.
14. A method according to claim 13 and wherein said relationship also includes a relation between a location of a first one of said plurality of short genomic sequences relative to one of said plurality of genomic region sequences, and a second one of said plurality of short genomic sequence relative to said one of said plurality of genomic region sequences.
15. A method according to claim 13 and wherein said relationship also includes a similarity between a first one of said plurality of short genomic sequences and a second one of said plurality of short genomic sequences.
16. A method according to claim 2 and wherein said at least one logical condition comprises a degree of uniqueness of one of said plurality of short genomic sequences relative to at least one of said plurality of genomic sequence regions.
17. A method according to claim 2 and wherein said at least one logical condition comprises a degree of commonality of one of said plurality of short genomic sequences relative to at least two of said plurality of genomic sequence regions.
18. A method according to claim 2 and wherein said at least one logical condition comprises exclusion of one of said plurality of short genomic sequences relative to at least one of said plurality of genomic sequence regions.
19. A method according to claim 2 and wherein: said method also includes: storing, based on user input, a plurality of criteria; determining and marking, each of said plurality of short genomic segments which complies with each one of said criteria; and said user query is based at least in part on at least one of said plurality of criteria.
20. A method according to claim 2 and wherein said retrieving comprises: receiving a query, said query comprising a query condition and uncompressed query data to which said query condition relates; compressing said uncompressed query data into compressed query data; and extracting said at least part of said compressed genomic sequence data, based at least in part on said compressed query data.
21. A method according to claim 20 and wherein said retrieving does not require storing said uncompressed genomic sequence data.
22. A method according to claim 20 and wherein said retrieving does not require accessing said uncompressed genomic sequence data.
23. A method according to claim 20 and wherein said retrieving does not require retrieving said uncompressed genomic sequence data.
24. A method according to claim 2 and wherein said retrieving includes sorting said uncompressed genomic sequence data, based at least in part on said indexing.
25. A method according to claim 24 and wherein said sorting is alphabetical sorting.
26. A method according to claim 2 and wherein: said uncompressed genomic sequence data comprises a plurality of uncompressed strings; and said compressed genomic sequence data comprises a plurality of compressed strings, each of said plurality of uncompressed strings being compressed into a single corresponding one of said plurality of compressed strings.
27. A method according to claim 26 and wherein: each of said plurality of uncompressed strings is an alphanumeric string representing a genomic sequence; each alphanumeric string comprises a plurality of characters; and each of said plurality of characters represents one of the following items: a nucleotide in said genomic sequence, and an unknown nucleotide in said genomic sequence.
28. A method according to claim 2 and wherein: each of said plurality of uncompressed strings comprises a plurality of uncompressed characters; and each of said plurality of compressed strings comprises a plurality of compressed characters, at least two of said plurality of uncompressed characters being compressed into one of said plurality of compressed characters.
29. A method according to claim 28 and wherein each one of said plurality of uncompressed characters is compressed into one of said plurality of compressed characters.
30. A method according to claim 28 and wherein said at least two of said plurality of uncompressed characters comprises at least three of said plurality of uncompressed characters.
3 1. A method according to claim 28 and wherein said at least two of said plurality of uncompressed characters comprises at least four of said pluratlity of uncompressed characters.
32. A method according to claim 28 and wherein at least three of said plurality of uncompressed characters are compressed into each one of a majority of said plurality of compressed characters.
33. A method according to claim 28 and wherein said plurality of compressed strings is stored in a field, said field being part of a table and said table being part of a database.
34. A method according to claim 33 and wherein said receiving, said compressing, said storing, said indexing, said retrieving, and said decompressing are performed internally by said database.
35. A method according to claim 34 and wherein said receiving, said compressing, said storing, said indexing, said retrieving, and said decompressing, do not require a program external to said database.
36. A method according to claim 34 and wherein said receiving, said compressing, said storing, said indexing, said retrieving, and said decompressing, do not require programming.
37. A method for analysis and comparison of genomic sequence data comprising: obtaining two compressed genomic sequences, a first compressed genomic sequence representing in compressed form a first genomic sequence, and a second compressed genomic sequence representing in compressed form a second genomic sequence, and protein coding region location data for each of a plurality of proteins known to be encoded by said first genomic sequence and said second genomic sequence, calculating and storing a plurality of genomic region sequences for each of said first and second compressed genomic sequences, based at least in part on said obtaining, determining and storing a plurality of short genomic segments contained in each of said plurality of genomic region sequences for each of said first and second compressed genomic sequences; comparing said first compressed genomic sequence with said second compressed genomic sequence, based on at least one of the following: one of said genomic region sequences, and one of said short genomic segments, and determining degree of similarity between said first compressed genomic sequence and said second compressed genomic sequence, based at least in part on said comparing
38 A method according to claim 37 and wherein said plurality of genomic region sequences comprises a plurality of protein coding regions.
39 A method according to claim 38 and wherein each of said plurality of protein coding regions is normalized.
40 A method according to claim 37 and wherein said plurality of genomic region sequences comprises a plurality of regions adjacent to protein coding regions.
41 A method according to claim 40 and wherein said plurality of regions adjacent to protein coding regions comprises a plurality of regions upstream to protein coding regions
42 A method according to claim 40 and wherein said plurality of regions adjacent to protein coding regions comprises a plurality of regions downstream to protein coding regions
43 A method according to claim 40 and wherein each of said plurality of regions adjacent to protein coding regions is normalized according to coding direction of one of said plurality of protein coding regions adjacent thereto
44 A method according to claim 37 and wherein said plurality of short genomic segments contained in each of said plurality of genomic region sequences comprises a majority of short genomic segments of a given length contained in each of said plui ahty of genomic region sequences
45 A method according to claim 44 and wherein said given length is user specified
46 A method according to claim 37 and wherein said plurality of proteins known to be encoded by said genomic sequence data comprises a majority of proteins known to be encoded by said genomic sequence data
47 A method according to claim 37 and wherein genomic sequence data comprises a first genomic sequence data belonging to a first organism, and a second genomic sequence data belonging to a second organism different than said first
48 A method according to claim 37 and wherein said determining and storing also includes determining and storing a relationship between at least two of said plurality of short genomic segments, and said logical condition references said relationship
49 A method according to claim 48 and wherein said relationship also includes a relation between a location of a first one of said plurality of short genomic sequence relative to one of said plurality of genomic region sequences, and a second one of said plurality of short genomic sequence relative to said one of said plurality of genomic region sequences.
50. A method according to claim 48 and wherein said relationship also includes a similarity between a first one of said plurality of short genomic sequences and a second one of said plurality of short genomic sequences.
51. A method according to claim 37 and wherein said at least one logical condition comprises a degree of uniqueness of one of said plurality of short genomic sequences relative to at least one of said plurality of genomic sequence regions.
52. A method according to claim 37 and wherein said at least one logical condition comprises a degree of commonality of one of said plurality of short genomic sequences relative to at least two of said plurality of genomic sequence regions.
53. A method according to claim 37 and wherein said at least one logical condition comprises exclusion of one of said plurality of short genomic sequences relative to at least one of said plurality of genomic sequence regions.
54. A method according to claim 37 and wherein: said method also includes: storing, based on user input, a plurality of criteria; determining and marking, each of said plurality of short genomic segments which complies with each one of said criteria; and said user query is based at least in part on at least one of said plurality of criteria.
55. A method according to claim 37 and wherein said determining does not require comparing said first genomic sequence with said second genomic sequence.
56. A method according to claim 37 and wherein said determining does not include any of the following: decompressing said first compressed genomic sequence, and decompressing said second compressed genomic sequence.
57. A method for analysis of genomic sequence data utilizing genomic sequence similarity assessment, the method comprising: obtaining genomic data; preprocessing genomic data into preprocessed genomic data; storing said preprocessed genomic data; indexing said preprocessed genomic data; and analyzing genomic data, based at least in part on said indexing, said analyzing also comprising assessing genomic sequence similarity, based at least in part on said indexing.
58. A method according to claim 57 and wherein: said obtaining comprises obtaining uncompressed genomic sequence data, and protein coding region location data for each of a plurality of proteins known to be encoded by said genomic sequence data; said preprocessing comprises calculating and storing a plurality of genomic region sequences, based at least in part on said obtaining, and determining for each of said plurality of genomic region sequences, a plurality of uncompressed short genomic segments contained therewith; said storing comprises storing said plurality of uncompressed short genomic segments; said indexing comprises indexing said plurality of uncompressed short, genomic segments; and said analyzing comprises: receiving a user query containing at least one logical condition relating to one of said plurality of uncompressed short genomic segments and at least one similarity criterion; extracting a subset of said plurality of uncompressed short genomic segments, each of said subset being similar to said one of said uncompressed short genomic segments according to said at least one similarity criterion; and retrieving results to said user query, based at least in part on said extracting, said retrieving comprising at least one of the following: retrieving one of said plurality of proteins and retrieving one of said plurality of genomic region sequences, based at least in part on said indexing.
59. A method according to claim 58 and wherein said plurality of genomic region sequences comprises a plurality of protein coding regions.
60. A method according to claim 59 and wherein each of said plurality of protein coding regions is normalized.
61 . A method according to claim 58 and wherein said plurality of genomic region sequences comprises a plurality of regions adjacent to protein coding regions.
62. A method according to claim 61 and wherein said plurality of regions adjacent to protein coding regions comprises a plurality of regions upstream to protein coding regions.
63. A method according to claim 61 and wherein said plurality of regions adjacent to protein coding regions comprises a plurality of regions downstream to protein coding regions.
64. A method according to claim 61 and wherein each of said plurality of regions adjacent to protein coding regions is normalized according to coding direction of one of said plurality of protein coding regions adjacent thereto.
65. A method according to claim 58 and wherein said plurality of short genomic segments contained in each of said plurality of genomic region sequences comprises a majority of short genomic segments of a given length contained in each of said plurality of genomic region sequences
66 A method according to claim 65 and wherein said given length is user specified
67 A method according to claim 58 and wherein said plurality of proteins known to be encoded by said genomic sequence data comprises a majority of proteins known to be encoded by said genomic sequence data
68 A method according to claim 58 and wherein genomic sequence data comprises a first genomic sequence data belonging to a first organism, and a second genomic sequence data belonging to a second organism different than said first
69 A method according to claim 58 and wherein said determining and storing also includes determining and storing a relationship between at least two of said plurality of short genomic segments, and said logical condition references said relationship
70 A method according to claim 69 and wherein said relationship also includes a relation between a location of a first one of said plurality of short genomic sequence relative to one of said plurality of genomic region sequences, and a second one of said plurality of short genomic sequence relative to said one of said plurality of genomic region sequences
71 A method according to claim 69 and wherein said relationship also includes a similarity between a first one of said plurality of short genomic sequences and a second one of said plurality of short genomic sequences
72. A method according to claim 58 and wherein said at least one logical condition comprises a degree of uniqueness of one of said plurality of short genomic sequences relative to at least one of said plurality of genomic sequence regions.
73. A method according to claim 58 and wherein said at least one logical condition comprises a degree of commonality of one of said plurality of short genomic sequences relative to at least two of said plurality of genomic sequence regions.
74. A method according to claim 58 and wherein said at least one logical condition comprises exclusion of one of said plurality of short genomic sequences relative to at least one of said plurality of genomic sequence regions.
75. A method according to claim 58 and wherein: said method also includes: storing, based on user input, a plurality of criteria; determining and marking, each of said plurality of short genomic segments which complies with each one of said criteria; and said user query is based at least in part on at least one of said plurality of criteria.
76. A method according to claim 58 and wherein said method also comprises decompressing each of said second plurality of compressed genomic sequences.
77. A method according to claim 58 and wherein said producing does not require comparing said genomic sequence with any of said first plurality of genomic sequences.
78. A method according to claim 58 and wherein said producing does not require decompressing any of said first plurality of compressed genomic sequences.
79. A method for comparing uncompressed genomic sequences, the method comprising: receiving a first uncompressed genomic sequence and a second uncompressed genomic sequence; compressing said first uncompressed genomic sequence into a first compressed genomic sequence, and said second uncompressed genomic sequence into a second compressed genomic sequence, comparing said first compressed genomic sequence with said second compressed genomic sequence, and determining degree of similarity between said first genomic sequence and said second genomic sequence, based at least in part on said comparing.
80 A method according to claim 79 and wherein said retrieving comprises: receiving a query, said query comprising a query condition and uncompressed query data to which said query condition relates; compressing said uncompressed query data into compressed query data; and extracting said at least part of said compressed genomic sequence data, based at least in part on said compressed query data
81 A method according to claim 80 and wherein said retrieving does not require storing said uncompressed genomic sequence data.
82 A method according to claim 80 and wherein said retrieving does not require accessing said uncompressed genomic sequence data.
83 A method according to claim 80 and wherein said retrieving does not require retrieving said uncompressed genomic sequence data.
84 A method according to claim 79 and wherein said retrieving includes sorting said uncompressed genomic sequence data, based at least in part on said indexing
85. A method according to claim 84 and wherein said sorting is alphabetical sorting.
86. A method according to claim 79 and wherein: said uncompressed genomic sequence data comprises a plurality of uncompressed strings; and said compressed genomic sequence data comprises a plurality of compressed strings, each of said plurality of uncompressed strings being compressed into a single corresponding one of said plurality of compressed strings.
87. A method according to claim 86 and wherein: each of said plurality of uncompressed strings is an alphanumeric string representing a genomic sequence; each alphanumeric string comprises a plurality of characters; and each of said plurality of characters represents one of the following items: a nucleotide in said genomic sequence, and an unknown nucleotide in said genomic sequence.
88. A method according to claim 79 and wherein: each of said plurality of uncompressed strings comprises a plurality of uncompressed characters; and each of said plurality of compressed strings comprises a plurality of compressed characters, at least two of said plurality of uncompressed characters being compressed into one of said plurality of compressed characters.
89. A method according to claim 88 and wherein each one of said plurality of uncompressed characters is compressed into one of said, plurality of compressed characters.
90. A method according to claim 88 and wherein said at least two of said plurality of uncompressed characters comprises at least three of said plurality of uncompressed characters.
91 A method according to claim 88 and wherein said at least two of said plurality of uncompressed characters comprises at least four of said pluratlity of uncompressed characters
92 A method according to claim 88 and wherein at least three of said plurality of uncompressed characters are compressed into each one of a majority of said plurality of compressed characters
93 A method according to claim 88 and wherein said plurality of compressed strings is stored in a field, said field being part of a table and said table being part of a database
94 A method according to claim 93 and wherein said receiving, said compressing, said storing, said indexing, said retrieving, and said decompressing are performed internally by said database
95 A method according to claim 94 and wherein said receiving, said compressing, said storing, said indexing, said retrieving, and said decompressing, do not require a program external to said database
96 A method according to claim 94 and wherein said receiving, said compressing, said storing, said indexing, said retrieving, and said decompressing, do not require programming
97 A method according to claim 79 and wherein said determining does not require comparing said first genomic sequence with said second genomic sequence
98 A method according to claim 79 and wherein said determining does not include any of the following decompressing said first compressed genomic sequence, and decompressing said second compressed genomic sequence
99 A method for assessing similarity, storage and retrieval of compressed genomic sequence data, the method comprising receiving uncompressed genomic sequence data, compressing said uncompressed genomic sequence data into compressed genomic sequence data, storing said compressed genomic sequence data, indexing said compressed genomic sequence data, retrieving at least part of said compressed genomic sequence data repiesenting uncompressed genomic sequence data similar to an uncompressed genomic tai et sequence, based at least in part on said indexing, and decompressing said at least part of said compressed genomic sequence data
100 A method according to claim 99 and wherein said retrieving comprises receiving a target genomic sequence, a first plurality of compressed genomic sequences, representing respectively in compressed form a first plurality of genomic sequences, and at least one similarity criterion, and producing a second plurality of compressed genomic sequences, representing respectively in compressed form a second plurality of genomic sequences, said second plurality of genomic sequence being a subset of said first plurality of genomic sequences, each of said second plurality of genomic sequences being similar to said target genomic sequence, according to said at least one similarity criterion
101 A method according to claim 100 and wherein said retrieving comprises receiving a query, said query comprising a query condition and uncompressed query data to which said query condition relates, compressing said uncompressed query data into compressed query data, and extracting said at least part of said compressed genomic sequence data, based at least in part on said compressed query data
102. A method according to claim 101 and wherein said retrieving does not require storing said uncompressed genomic sequence data.
103. A method according to claim 101 and wherein said retrieving does not require accessing said uncompressed genomic sequence data.
104. A method according to claim 101 and wherein said retrieving does not require retrieving said uncompressed genomic sequence data.
105. A method according to claim 100 and wherein said retrieving includes sorting said uncompressed genomic sequence data, based at least in part on said indexing.
106. A method according to claim 105 and wherein said sorting is alphabetical sorting.
107. A method according to claim 100 and wherein: said uncompressed genomic sequence data comprises a plurality of uncompressed strings; and said compressed genomic sequence data comprises a plurality of compressed strings, each of said plurality of uncompressed strings being compressed into a single corresponding one of said plurality of compressed strings.
108. A method according to claim 107 and wherein: each of said plurality of uncompressed strings is an alphanumeric string . representing a genomic sequence; each alphanumeric string comprises a plurality of characters; and each of said plurality of characters represents one of the following items: a nucleotide in said genomic sequence, and an unknown nucleotide in said genomic sequence.
109. A method according to claim 100 and wherein: each of said plurality of uncompressed strings comprises a plurality of uncompressed characters, and each of said plurality of compressed strings comprises a plurality of compressed characters, at least two of said plurality of uncompressed characters being compressed into one of said plurality of compressed characters
1 10 A method according to claim 109 and wherein each one of said plurality of uncompressed characters is compressed into one of said plurality of compressed characters
1 1 1 A method according to claim 109 and wherein said at least two of said plurality of uncompressed characters comprises at least three of said plurality of uncompressed characters
1 12 A method according to claim 109 and wherein said at least two of said plurality of uncompressed characters comprises at least four of said pluratlity of uncompressed characters
1 13 A method according to claim 109 and wherein at least three of said plurality of uncompressed characters are compressed into each one of a majority of said plurality of compressed characters
1 14 A method according to claim 109 and wherein said plurality of compressed strings is stored in a field, said field being part of a table and said table being part of a database
1 15 A method according to claim 114 and wherein said receiving, said compressing, said storing, said indexing, said retrieving, and said decompressing are performed internally by said database.
1 16. A method according to claim 115 and wherein said receiving, said compressing, said storing, said indexing, said retrieving, and said decompressing, do not require a program external to said database.
1 17. ' A method according to claim 115 and wherein said receiving, said compressing, said storing, said indexing, said retrieving, and said decompressing, do not require programming.
1 18. A method according to claim 100 and wherein said method also comprises decompressing each of said second plurality of compressed genomic sequences.
1 19. A method according to claim 100 and wherein said producing does not require comparing said genomic sequence with any of said first plurality of genomic sequences.
120. A method according to claim 100 and wherein said producing does not require decompressing any of said first plurality of compressed genomic sequences.
121 . A method for analysis and comparison of genomic sequence data . comprising: obtaining a first uncompressed genomic sequence and a second uncompressed genomic sequence, and protein coding region location data for each of a plurality of proteins known to be encoded by said first uncompressed genomic sequence and said second uncompressed genomic sequence; calculating and storing a plurality of genomic region sequences for each of said first and second uncompressed genomic sequences, based at least in part on said obtaining; determining and storing a plurality of short genomic segments contained in each of said plurality of genomic region sequences for each of said first and second uncompressed genomic sequences; compressing said first uncompressed genomic sequence into a first compressed genomic sequence, and said second uncompressed genomic sequence into a second compressed genomic sequence; comparing said first compressed genomic sequence with said second compressed genomic sequence, based on at least one of the following: one of said genomic region sequences, and one of said short genomic segments; and determining degree of similarity between said first compressed genomic sequence and said second compressed genomic sequence, based at least in part on said comparing.
122. A method according to claim 121 and wherein said retrieving comprises: receiving a query, said query comprising a query condition and uncompressed query data to which said query condition relates; compressing said uncompressed query data into compressed query data; and extracting said at least part of said compressed genomic sequence data, based at least in part on said compressed query data.
123. A method according to claim 122 and wherein said retrieving does not require storing said uncompressed genomic sequence data.
124. A method according to claim 122 and wherein said retrieving does not require accessing said uncompressed genomic sequence data.
125. A method according to claim 122 and wherein said retrieving does not require retrieving said uncompressed genomic sequence data.
126. A method according to claim 121 and wherein said retrieving includes sorting said uncompressed genomic sequence data, based at least in part on said indexing.
I l l
127. A method according to claim 126 and wherein said sorting is alphabetical sorting.
128. A method according to claim 121 and wherein: said uncompressed genomic sequence data comprises a plurality of uncompressed strings; and said compressed genomic sequence data comprises a plurality of compressed strings, each of said plurality of uncompressed strings being compressed into a single corresponding one of said plurality of compressed strings.
129. A method according to claim 128 and wherein: each of said plurality of uncompressed strings is an alphanumeric string representing a genomic sequence; each alphanumeric string comprises a plurality of characters; and each of said plurality of characters represents one of the following items: a nucleotide in said genomic sequence, and an unknown nucleotide in said genomic sequence.
130. A method according to claim 121 and wherein: each of said plurality of uncompressed strings comprises a plurality of uncompressed characters; and each of said plurality of compressed strings comprises a plurality of compressed characters, at least two of said plurality of uncompressed characters being compressed into one of said plurality of compressed characters.
131 . A method according to claim 130 and wherein each one of said plurality of uncompressed characters is compressed into one of said plurality of compressed characters.
132. A method according to claim 130 and wherein said at least two of said plurality of uncompressed characters comprises at least three of said plurality of uncompressed characters.
133 A method according to claim 130 and wherein said at least two of said plurality of uncompressed characters comprises at least four of said pluratlity of uncompressed characters
134 A method according to claim 130 and wherein at least three of said plurality of uncompressed characters are compressed into each one of a majority of said plurality of compressed characters
135 A method according to claim 130 and wherein said plurality of compressed strings is stored in a field, said field being part of a table and said table being part of a database
136 A method according to claim 135 and wherein said receiving, said compressing, said storing, said indexing, said retrieving, and said decompressing are performed internally by said database.
137 A method according to claim 136 and wherein said receiving, said compressing, said storing, said indexing, said retrieving, and said decompressing, do not require a program external to said database
138 A method according to claim 136 and wherein said receiving, said compressing, said storing, said indexing, said retrieving, and said decompressing, do not require programming
139 A method according to claim 121 and wherein said retrieving comprises: receiving a query, said query comprising a query condition and uncompressed query data to which said query condition relates; compressing said uncompressed query data into compressed query data; and extracting said at least part of said compressed genomic sequence data, based at least in part on said compressed query data.
140 A method according to claim 139 and wherein said retrieving does not require storing said uncompressed genomic sequence data.
141 A method according to claim 139 and wherein said retrieving does not require accessing said uncompressed genomic sequence data.
142 A method according to claim 139 and wherein said retrieving does not require retrieving said uncompressed genomic sequence data
143 A method according to claim 121 and wherein said retrieving includes sorting said uncompressed genomic sequence data, based at least in part on said indexing
144 A method according to claim 143 and wherein said sorting is alphabetical sorting
145 A method according to claim 121 and wherein: said uncompressed genomic sequence data comprises a plurality of uncompressed strings, and said compressed genomic sequence data comprises a plurality of compressed strings, each of said plurality of uncompressed strings being compressed into a single corresponding one of said plurality of compressed strings.
146 A method according to claim 145 and wherein: each of said plurality of uncompressed strings is an alphanumeric string representing a genomic sequence; each alphanumeric string comprises a plurality of characters; and each of said plurality of characters represents one of the following items: a nucleotide in said genomic sequence, and an unknown nucleotide in said genomic sequence.
147. ' A method according to claim 121 and wherein: each of said plurality of uncompressed strings comprises a plurality of uncompressed characters; and each of said plurality of compressed strings comprises a plurality of compressed characters, at least two of said plurality of uncompressed characters being compressed into one of said plurality of compressed characters.
148. A method according to claim 147 and wherein each one of said plurality of uncompressed characters is compressed into one of said plurality of compressed characters.
149. A method according to claim 147 and wherein said at least two of said plurality of uncompressed characters comprises at least three of said plurality of uncompressed characters.
150. A method according to claim 147 and wherein said at least two of said plurality of uncompressed characters comprises at least four of said pluratlity of uncompressed characters.
151. A method according to claim 147 and wherein at least three of said plurality of uncompressed characters are compressed into each one of a majority of said plurality of compressed characters.
152. A method according to claim 147 and wherein said plurality of compressed strings is stored in a field, said field being part of a table and said table being part of a database.
153. A method according to claim 152 and wherein said receiving, said compressing, said storing, said indexing, said retrieving, and said decompressing are performed internally by said database.
154. A method according to claim 153 and wherein said receiving, said compressing, said storing, said indexing, said retrieving, and said decompressing, do not require a program external to said database.
155. A method according to claim 153 and wherein said receiving, said compressing, said storing, said indexing, said retrieving, and said decompressing, do not require programming.
156. A method according to claim 121 and wherein said determining does not require comparing said first genomic sequence with said second genomic sequence.
157. A method according to claim 121 and wherein said determining does not include any of the following: decompressing said first compressed genomic sequence, and decompressing said second compressed genomic sequence.
158. A method for analysis of genomic sequence data utilizing genomic sequence similarity assessment in compressed data space, the method comprising: obtaining genomic data; preprocessing genomic data into preprocessed genomic data; compressing at least part of said preprocessed genomic data; storing compressed preprocessed genomic data; indexing compressed preprocessed genomic data; and analyzing genomic data, based at least in part on said indexing, said analyzing also comprising assessing genomic sequence similarity, based at least in part on said indexing.
159. A method according to claim 158 and wherein: said obtaining comprises obtaining uncompressed genomic sequence data, and protein coding region location data for each of a plurality of proteins known to be encoded by said genomic sequence data; said preprocessing comprises calculating and storing a plurality of genomic region sequences, based at least in part on said obtaining, and determining for each of said plurality of genomic region sequences, a plurality of uncompressed short genomic segments contained therewith; said compressing comprises compressing each of said plurality of uncompressed short genomic segments contained in each of said plurality of genomic region sequences into one of a plurality of compressed short genomic segments; said storing comprises storing said plurality of compressed short genomic segments; said indexing comprises indexing said plurality of compressed short genomic segments; and said analyzing comprises: receiving a user query containing at least one logical condition relating to one of said plurality of uncompressed short genomic segments and at least one similarity criterion, extracting a subset of said plurality of uncompressed short genomic segments, each of said subset being similar to said one of said uncompressed short genomic segments according to said at least one similarity criterion; and retrieving results to said user query, based at least in part on said extracting, said retrieving comprising at least one of the following: retrieving one of said plurality of proteins, retrieving one of said plurality of genomic region sequences, and retrieving and decompressing one of said plurality of compressed short genomic segments, based at least in part on said indexing.
160. A method according to claim 159 and wherein said plurality of genomic region sequences comprises a plurality of protein coding regions.
161. A method according to claim 160 and wherein each of said plurality of protein coding regions is normalized.
162. A method according to claim 159 and wherein said plurality of genomic region sequences comprises a plurality of regions adjacent to protein coding regions.
163. A method according to claim 162 and wherein said plurality of regions adjacent to protein coding regions comprises a plurality of regions upstream to protein coding regions.
164. A method according to claim 162 and wherein said plurality of regions adjacent to protein coding regions comprises a plurality of regions downstream to protein coding regions.
165. A method according to claim 162 and wherein each of said plurality of regions adjacent to protein coding regions is normalized according to coding direction of one of said plurality of protein coding regions adjacent thereto.
166. A method according to claim 159 and wherein said plurality of short genomic segments contained in each of said plurality of genomic region sequences comprises a majority of short genomic segments of a given length contained in each of said plurality of genomic region sequences.
167. A method according to claim 166 and wherein said given length is user specified.
168. A method according to claim 159 and wherein said plurality of proteins known to be encoded by said genomic sequence data comprises a majority of proteins known to be encoded by said genomic sequence data.
169. A method according to claim 159 and wherein genomic sequence data comprises: a first genomic sequence data belonging to a first organism, and a second genomic sequence data belonging to a second organism different than said first
170. A method according to claim 159 and wherein said determining and storing also includes determining and storing a relationship between at least two of said plurality of short genomic segments, and said logical condition references said relationship.
171 . A method according to claim 170 and wherein said relationship also includes a relation between a location of a first one of said plurality of short genomic sequence relative to one of said plurality of genomic region sequences, and a second one of said plurality of short genomic sequence relative to said one of said plurality of genomic region sequences.
172. A method according to claim 170 and wherein said relationship also includes a similarity between a first one of said plurality of short genomic sequences and a second one of said plurality of short genomic sequences.
173. A method according to claim 159 and wherein said at least one logical condition comprises a degree of uniqueness of one of said plurality of short genomic sequences relative to at least one of said plurality of genomic sequence regions.
174. A method according to claim 159 and wherein said at least one logical condition comprises a degree of commonality of one of said plurality of short genomic sequences relative to at least two of said plurality of genomic sequence regions.
175. A method according to claim 159 and wherein said at least one logical condition comprises exclusion of one of said plurality of short genomic sequences relative to at least one of said plurality of genomic sequence regions.
1 6. A method according to claim 159 and wherein: said method also includes: storing, based on user input, a plurality of criteria; determining and marking, each of said plurality of short genomic segments which complies with each one of said criteria; and said user query is based at least in part on at least one of said plurality of criteria.
177. A method according to claim 159 and wherein said retrieving comprises: receiving a query, said query comprising a query condition and uncompressed query data to which said query condition relates; compressing said uncompressed query data into compressed query data; and extracting said at least part of said compressed genomic sequence data, based at least in part on said compressed query data.
178. A method according to claim 177 and wherein said retrieving does not require storing said uncompressed genomic sequence data.
179. A method according to claim 177 and wherein said retrieving does not require accessing said uncompressed genomic sequence data.
1 80. A method according to claim 177 and wherein said retrieving does not require retrieving said uncompressed genomic sequence data.
181 . A method according to claim 159 and wherein said retrieving includes sorting said uncompressed genomic sequence data, based at least in part on said indexing.
182. A method according to claim 181 and wherein said sorting is alphabetical sorting. .
183. A method according to claim 159 and wherein: said uncompressed genomic sequence data comprises a plurality of uncompressed strings; and said compressed genomic sequence data comprises a plurality of compressed strings, each of said plurality of uncompressed strings being compressed into a single corresponding one of said plurality of compressed strings.
184. A method according to claim 183 and wherein: each of said plurality of uncompressed strings is an alphanumeric string representing a genomic sequence; each alphanumeric string comprises a plurality of characters; and each of said plurality of characters represents one of the following items: a nucleotide in said genomic sequence, and an unknown nucleotide in said genomic sequence.
185. A method according to claim 159 and wherein: each of said plurality of uncompressed strings comprises a plurality of uncompressed characters; and each of said plurality of compressed strings comprises a plurality of compressed characters, at least two of said plurality of uncompressed characters being compressed into one of said plurality of compressed characters.
186. A method according to claim 185 and wherein each one of said plurality of uncompressed characters is compressed into one of said plurality of compressed characters.
187. A method according to claim 185 and wherein said at least two of said plurality of uncompressed characters comprises at least three of said plurality of uncompressed characters.
188. A method according to claim 185 and wherein said at least two of said plurality of uncompressed characters comprises at least four of said pluratlity of uncompressed characters.
189. A method according to claim 185 and wherein at least three of said plurality of uncompressed characters are compressed into each one of a majority of said plurality of compressed characters.
190. A method according to claim 185 and wherein said plurality of compressed strings is stored in a field, said field being part of a table and said table being part of a database.
191. A method according to claim 190 and wherein said receiving, said compressing, said storing, said indexing, said retrieving, and said decompressing are performed internally by said database.
192. A method according to claim 191 and wherein said receiving, said compressing, said storing, said indexing, said retrieving, and said decompressing, do not require a program external to said database.
193. A method according to claim 191 and wherein said receiving, said compressing, said storing, said indexing, said retrieving, and said decompressing, do not require programming.
194. A method according to claim 159 and wherein said method also comprises decompressing each of said second plurality of compressed genomic sequences.
195. A method according to claim 159 and wherein said producing does not require comparing said genomic sequence with any of said first plurality of genomic sequences.
196. A method according to claim 159 and wherein said producing does not require decompressing any of said first plurality of compressed genomic sequences.
197. A genomic sequence analysis system comprising: a genomic sequence analysis preprocessor operative to obtain uncompressed genomic sequence data, and protein coding region location data for each of a plurality of proteins known to be encoded by said uncompressed genomic sequence data, to calculate and store a plurality of genomic region sequences, based at least in part on said uncompressed genomic sequence data, and protein coding region location data, and to determine a plurality of short genomic segments contained in each of said plurality of uncompressed genomic region sequences; a genomic sequence data compressor operative to compress each of said plurality of uncompressed short genomic segments into one of a plurality of compressed short genomic segments; a genomic compressed sequence data indexer operative to store said plurality of compressed short genomic segments and to index said plurality of compressed short genomic segments; a genomic sequence analysis executor employing said genomic sequence analysis preprocessor and operative to receive a user query containing at least one logical condition relating to at least one of the following: one of said genomic region sequences and one of said uncompressed short genomic segments, and to retrieve results to said user query, said results comprising at least one of the following: one of said plurality of proteins, one of said plurality of genomic region sequences, and one of said plurality of compressed short genomic segments, based at least in part on said user query and employing said genomic compressed sequence data indexer; and a genomic sequence data extractor employing said genomic compressed sequence data indexer, and operative to decompress said one of said plurality of compressed short genomic segments received from said genomic sequence analysis executor
198. A genomic sequence analysis system according to claim 197 and wherein said plurality of genomic region sequences comprises a plurality of protein coding regions
199 A genomic sequence analysis system according to claim 198 and wherein each of said plurality of protein coding regions is normalized.
200. A genomic sequence analysis system according to claim 197 and wherein said plurality of genomic region sequences comprises a plurality of regions adjacent to protein coding regions.
201 A genomic sequence analysis system according to claim 200 and wherein said plurality of regions adjacent to protein coding regions comprises a plurality of regions upstream to protein coding regions.
202 A genomic sequence analysis system according to claim 200 and wherein said plurality of regions adjacent to protein coding regions comprises a plurality of regions downstream to protein coding regions.
203 A genomic sequence analysis system according to claim 200 and wherein each of said plurality of regions adjacent to protein coding regions is normalized according to coding direction of one of said plurality of protein coding regions adjacent thereto
204 A genomic sequence analysis system according to claim 197 and wherein said plurality of short genomic segments contained in each of said plurality of genomic region sequences comprises a majority of short genomic segments of a given length contained in each of said plurality of genomic region sequences.
205 A genomic sequence analysis system according to claim 204 and wherein said given length is user specified.
206 A genomic sequence analysis system according to claim 197 and wherein said plurality of proteins known to be encoded by said genomic sequence data comprises a majority of proteins known to be encoded by said genomic sequence data.
207 A genomic sequence analysis system according to claim 197 and wherein genomic sequence data comprises: a first genomic sequence data belonging to a first organism, and a second genomic sequence data belonging to a second organism different than said first organism.
208. A genomic sequence analysis system according to claim 197 and wherein said determining and storing also includes determining and storing a relationship between at least two of said plurality of short genomic segments, and said logical condition references said relationship.
209. A genomic sequence analysis system according to claim 208 and wherein said relationship also includes a relation between a location of a first one of said plurality of short genomic sequence relative to one of said plurality of genomic region sequences, and a second one of said plurality of short genomi'c sequence relative to said one of said plurality of genomic region sequences.
210. A genomic sequence analysis system according to claim 208 and wherein said relationship also includes a similarity between a first one of said plurality of short genomic sequences and a second one of said plurality of short genomic sequences.
21 1. A genomic sequence analysis system according to claim 197 and wherein said at least one logical condition comprises a degree of uniqueness of one of said plurality of short genomic sequences relative to at least one of said plurality of genomic sequence regions.
212. A genomic sequence analysis system according to claim 197 and wherein said at least one logical condition comprises a degree of co'mmonality of one of said plurality of short genomic sequences relative to at least two of said plurality of genomic sequence regions.
213. A genomic sequence analysis system according to claim 197 and wherein said at least one logical condition comprises exclusion of one of said plurality of short genomic sequences relative to at least one of said plurality of genomic sequence
214. A genomic sequence analysis system according to claim 197 and wherein: said method also includes storing, based on user input, a plurality of criteria; determining and marking, each of said plurality of short genomic segments which complies with each one of said criteria; and said user query is based at least in part on at least one of said plurality of criteria
215 A genomic compressed data storage and retrieval system according to claim 197 and wherein said genomic sequence data extractor provides the following functionality: receiving a query comprising a query condition and uncompressed query data to which said query condition relates, compressing said uncompressed query data into compressed query data; and extracting said at least part of said compressed genomic sequence data, based at least in part on said compressed query data.
216 A genomic compressed data storage and retrieval system according to claim 215 and wherein said functionality of said genomic sequence data extractor does not require storing said uncompressed genomic sequence data.
217 A genomic compressed data storage and retrieval system according to claim 215 and wherein said functionality of said genomic sequence data extractor does not require accessing said uncompressed genomic sequence data.
218 A genomic compressed data storage and retrieval system according to claim 215 and wherein said said functionality of said genomic sequence data extractor does not require retrieving said uncompressed genomic sequence data
21 A genomic compressed data storage and retrieval system according to claim 197 and wherein said genomic sequence data extractor employing said genomic compressed sequence data indexer is operative to sort said uncompressed genomic sequence data
220 A genomic compressed data storage and retrieval system according to claim 219 and wherein said genomic sequence data extractor employing said genomic compressed sequence data indexer is operative to alphabetically sort said uncompressed genomic sequence data
221 A genomic compressed data storage and retrieval system according to claim 197 and wherein said uncompressed genomic sequence data comprises a plurality of uncompressed strings, and said compressed genomic sequence data comprises a plurality of compressed strings, each of said plurality of uncompressed strings being compressed into a single corresponding one of said plurality of compressed strings
222 A genomic compressed data storage and retrieval system according to claim 221 and wherein each of said plurality of uncompressed strings is an alphanumeric string representing a genomic sequence; said alphanumeric string comprises a plurality of characters; and each of said plurality of characters represents one of the following items: a nucleotide in said genomic sequence, and an unknown nucleotide in said genomic sequence.
223 A genomic compressed data storage and retrieval system according to claim 197 and wherein each of said plurality of uncompressed strings comprises a plurality of uncompressed characters, each of said plurality of compressed strings comprises a plurality of compressed characters, at least two of said plurality of uncompressed characters being compressed into one of said plurality of compressed characters.
224. A genomic compressed data storage and retrieval system according to claim 223 and wherein each one of said plurality of uncompressed characters is compressed into one of said plurality of compressed characters.
225. A genomic compressed data storage and retrieval system according to claim 223 and wherein said at least two of said plurality of uncompressed characters comprises at least three of said plurality of uncompressed characters.
226. A genomic compressed data storage and retrieval system according to claim 223 and wherein said at least two of said plurality of uncompressed characters comprises at least four of said plurality of uncompressed characters.
227. A genomic compressed data storage and retrieval system according to claim 223 and wherein at least three of said plurality of uncompressed characters are compressed into each one of a majority of said plurality of compressed characters.
228. A genomic compressed data storage and retrieval system according to claim 223 and wherein said plurality of compressed strings is stored in a field, said field being part of a table and said table being part of a database.
229. A genomic compressed data storage and retrieval system according to claim 228 and wherein functionality of said genomic sequence data compressor, said genomic compressed sequence data indexer, and said genomic sequence data extractor is performed internally by a database.
230. A genomic compressed data storage and retrieval system according to claim 229 and wherein functionality of said genomic sequence data compressor, said genomic compressed sequence data indexer, and said genomic sequence data extractor does not require a program external to said database.
23 1. A genomic compressed data storage and retrieval system according to claim 229 and wherein functionality of said genomic sequence data compressor, said genomic compressed sequence data indexer, and said genomic sequence data extractor does not require programming.
232. A genomic sequence analysis and comparison system comprising: a genomic sequence analysis preprocessor operative to receive two compressed genomic sequences, a first compressed genomic sequence representing in compressed form a first genomic sequence, and a second compressed genomic sequence representing in compressed form a second genomic sequence, and protein coding region location data for each of a plurality of proteins known to be encoded by said first and second compressed genomic sequences, to calculate and store a plurality of genomic region sequences for each of said first and second compressed genomic sequences, based at least in part on said genomic sequence data and protein coding region location data, and to determine and store a plurality of short genomic segments contained in each of said first and second compressed genomic sequences; and a genomic data evaluator operative to compare said first compressed genomic sequence with said second compressed genomic sequence, based on at least one of the following: one of said plurality of genomic region sequences and one of said plurality of short genomic segments; and a genomic data analyzer employing said genomic data evaluator and operative to determine degree of similarity between said first genomic sequence and said second genomic sequence.
233. A genomic sequence analysis system according to claim 232 and wherein said plurality of genomic region sequences comprises a plurality of protein coding regions.
234. A genomic sequence analysis system according to claim 233 and wherein each of said plurality of protein coding regions is normalized.
235. A genomic sequence analysis system according to claim 232 and wherein said plurality of genomic region sequences comprises a plurality of regions adjacent to protein coding regions.
236. A genomic sequence analysis system according to claim 235 and wherein said plurality of regions adjacent to protein coding regions comprises a plurality of regions upstream to protein coding regions.
237. A genomic sequence analysis system according to claim 235 and wherein said plurality of regions adjacent to protein coding regions comprises a plurality of regions downstream to protein coding regions.
238. A genomic sequence analysis system according to claim 235 and wherein each of said plurality of regions adjacent to protein coding regions is normalized according to coding direction of one of said plurality of protein coding regions adjacent thereto.
239. A genomic sequence analysis system according to claim 232 and wherein said plurality of short genomic segments contained in each of said plurality of genomic region sequences comprises a majority of short genomic segments of a given length contained in each of said plurality of genomic region seqμences.
240. ; A genomic sequence analysis system according to claim 239 and wherein said given length is user specified.
241 . A genomic sequence analysis system according to claim 232 and wherein said plurality of proteins known to be encoded by said genomic sequence data comprises a majority of proteins known to be encoded by said genomic sequence data.
242. A genomic sequence analysis system according to claim 232 and wherein genomic sequence data comprises: a first genomic sequence data belonging to a first organism, and a second genomic sequence data belonging to a second organism different than said first organism.
243 A genomic sequence analysis system according to claim 232 and wherein said determining and storing also includes determining and storing a relationship between at least two of said plurality of short genomic segments, and said logical condition references said relationship.
244 A genomic sequence analysis system according to claim 243 and wherein said relationship also includes a relation between a location of a first one of said plurality of short genomic sequence relative to one of said plurality of genomic region sequences, and a second one of said plurality of short genomic sequence relative to said one of said plurality of genomic region sequences.
245 A genomic sequence analysis system according to claim 243 and wherein said relationship also includes a similarity between a first one of said plurality of short genomic sequences and a second one of said plurality of short genomic sequences.
246 A genomic sequence analysis system according to claim 232 and wherein said at least one logical condition comprises a degree of uniqueness of one of said plurality of short genomic sequences relative to at least one of said plurality of genomic sequence regions
247 A genomic sequence analysis system according to claim 232 and wherein said at least one logical condition comprises a degree of commonality of one of said plurality of short genomic sequences relative to at least two of said plurality of genomic sequence regions
248 A genomic sequence analysis system according to claim 232 and wherein said at least one logical condition comprises exclusion of one of said plurality of short genomic sequences relative to at least one of said plurality of genomic sequence
249. A genomic sequence analysis system according to claim 232 and wherein: said method also includes: storing, based on user input, a plurality of criteria; determining and marking, each of said plurality of short genomic segments which complies with each one of said criteria; and said user query is based at least in part on at least one of said plurality of criteria.
250. A compressed genomic sequence comparison system according to claim 232 and wherein functionality of said genomic data analyzer does not require comparing said first genomic sequence with said second genomic sequence.
251 . A compressed genomic sequence comparison system according to claim 232 and wherein functionality of said genomic data analyzer does not require any of the following: decompressing said first compressed genomic sequence, and decompressing said second compressed genomic sequence.
252. A genomic sequence analysis system utilizing genomic sequence similarity assessment comprising: a genomic sequence analysis preprocessor operative to receive a target genomic sequence, a first plurality of compressed genomic sequences, representing respectively in compressed form a first plurality of genomic sequences, at least one similarity criterion, and protein coding region location data for each of a plurality of proteins known to be encoded by said target genomic sequence and said first plurality of genomic sequences, to calculate and store a plurality of genomic region sequences, based at least in part on said target genomic sequence and said first plurlity of genomic sequences and protein coding region location data, and to determine and store a plurality of short genomic segments contained in each of said plurality of genomic region sequences; and a genomic data extractor operative to produce a second plurality of compressed genomic sequences, representing respectively in compressed form a second plurality of genomic sequences, said second plurality of genomic sequence being a subset of said first plurality of genomic sequences, each of said second plurality of genomic sequences being similar to said target genomic sequence, according to said at least one similarity criterion.
253. A genomic sequence analysis system according to claim 252 and wherein said plurality of genomic region sequences comprises a plurality of protein coding
254. A genomic sequence analysis system according to claim 253 and wherein each of said plurality of protein coding regions is normalized.
255. A genomic sequence analysis system according to claim 252 and wherein said plurality of genomic region sequences comprises a plurality of regions adjacent to protein coding regions.
256. A genomic sequence analysis system according to claim 255 and wherein said plurality of regions adjacent to protein coding regions comprises a plurality of regions upstream to protein coding regions.
257. A genomic sequence analysis system according to claim 255 and wherein said plurality of regions adjacent to protein coding regions comprises a plurality of regions downstream to protein coding regions.
258. A genomic sequence analysis system according to claim 255 and wherein each of said plurality of regions adjacent to protein coding regions is normalized according to coding direction of one of said plurality of protein coding regions adjacent thereto.
259. A genomic sequence analysis system according to claim 252 and wherein said plurality of short genomic segments contained in each of said plurality of genomic region sequences comprises a majority of short genomic segments of a given length contained in each of said plurality of genomic region sequences.
260. A genomic sequence analysis system according to claim 259 and wherein said given length is user specified.
261. A genomic sequence analysis system according to claim 252 and wherein said plurality of proteins known to be encoded by said genomic sequence data comprises a majority of proteins known to be encoded by said genomic sequence data.
262. A genomic sequence analysis system according to claim 252 and wherein genomic sequence data comprises: a first genomic sequence data belonging to a first organism, and a second genomic sequence data belonging to a second organism different than said first organism.
263. A genomic sequence analysis system according to claim 252 and wherein said determining and storing also includes determining and storing a relationship between at least two of said plurality of short genomic segments, and said logical condition references said relationship.
264. A genomic sequence analysis system according to claim 263 and wherein said relationship also includes a relation between a location of a first one of said plurality of short genomic sequence relative to one of said plurality of genomic region sequences, and a second one of said plurality of short genomic sequence relative to said one of said plurality of genomic region sequences.
265. A genomic sequence analysis system according to claim 263 and wherein said relationship also includes a similarity between a first one of said plurality of short genomic sequences and a second one of said plurality of short genomic sequences.
266. A genomic sequence analysis system according to claim 252 and wherein said at least one logical condition comprises a degree of uniqueness of one of said plurality of short genomic sequences relative to at least one of said plurality of genomic sequence regions.
267. A genomic sequence analysis system according to claim 252 and wherein said at least one logical condition comprises a degree of commonality of one of said plurality of short genomic sequences relative to at least two of said plurality of genomic sequence regions.
268. . A genomic sequence analysis system according to claim 252 and wherein said at least one logical condition comprises exclusion of one of said plurality of short genomic sequences relative to at least one of said plurality of genomic sequence
269. A genomic sequence analysis system according to claim 252 and wherein: said method also includes: storing, based on user input, a plurality of criteria; determining and marking, each of said plurality of short genomic segments which complies with each one of said criteria; and said user query is based at least in part on at least one of said plurality of criteria.
270. A genomic compressed sequence similarity assessment system according to claim 252 and wherein said system also comprises a genomic decompressor operative to decompress each of said second plurality of compressed genomic sequences.
271. A genomic compressed sequence similarity assessment system according to claim 252 and wherein functionality of said genomic data extractor does not require comparing said genomic sequence with any of said first plurality of genomic sequences.
272. A genomic compressed sequence similarity assessment system according to claim 252 and wherein functionality of said genomic data extractor does not require decompressing any of said first plurality of compressed genomic sequences.
273. A genomic sequence compression and comparison system comprising: a genomic sequence data compressor operative to receive a first uncompressed genomic sequence and a second uncompressed genomic sequence and to compress said first uncompressed genomic sequence into a first compressed genomic sequence and said second uncompressed genomic sequence into a second compressed genomic sequence; a genomic data evaluator operative to receive said first and second compressed genomic sequences and to compare said first compressed genomic sequence with said second compressed genomic sequence; and a genomic data analyzer employing said genomic data evaluator and operative to determine degree of similarity between said first uncompressed genomic sequence and said second uncompressed genomic sequence.
274. A genomic compressed data storage and retrieval system according to claim 273 and wherein said genomic sequence data extractor provides the following functionality: receiving a query comprising a query condition and uncompressed query data to which said query condition relates; compressing said uncompressed query data into compressed query data; and extracting said at least part of said compressed genomic sequence data, based at least in part on said compressed query data.
275. A genomic compressed data storage and retrieval system according to claim 274 and wherein said functionality of said genomic sequence data extractor does not require storing said uncompressed genomic sequence data.
276 A genomic compressed data storage and retrieval system according to claim 274 and wherein said functionality of said genomic sequence data extractor does not require accessing said uncompressed genomic sequence data.
277 A genomic compressed data storage and retrieval system according to claim 274 and wherein said said functionality of said genomic sequence data extractor does not require retrieving said uncompressed genomic sequence data.
278 A genomic compressed data storage and retrieval system according to claim 273 and wherein said genomic sequence data extractor employing said genomic compressed sequence data indexer is operative to sort said uncompressed genomic sequence data.
279 A genomic compressed data storage and retrieval system according to claim 278 and wherein said genomic sequence data extractor employing said genomic compressed sequence data indexer is operative to alphabetically sort said uncompressed genomic sequence data
280 A genomic compressed data storage and retrieval system according to claim 273 and wherein said uncompressed genomic sequence data comprises a plurality of uncompressed strings; and said compressed genomic sequence data comprises a plurality of compressed strings, each of said plurality of uncompressed strings being compressed into a single corresponding one of said plurality of compressed strings.
281 A genomic compressed data storage and retrieval system according to claim 280 and wherein' each of said plurality of uncompressed strings is an alphanumeric string representing a genomic sequence; said alphanumeric string comprises a plurality of characters; and each of said plurality of characters represents one of the following items: a nucleotide in said genomic sequence, and an unknown nucleotide in said genomic sequence.
282 A genomic compressed data storage and retrieval system according to claim 273 and wherein each of said plurality of uncompressed strings comprises a plurality of uncompressed characters, each of said plurality of compressed strings comprises a plurality of compressed characters, at least two of said plurality of uncompressed characters being compressed into one of said plurality of compressed characters
283 A genomic compressed data storage and retrieval system according to claim 282 and wherein each one of said plurality of uncompressed characters is compressed into one of said plurality of compressed characters
284 A genomic compressed data storage and retrieval system according to claim 282 and wherein said at least two of said plurality of uncompressed characters comprises at least three of said plurality of uncompressed characters.
285 A genomic compressed data storage and retrieval system according to claim 282 and wherein said at least two of said plurality of uncompressed characters comprises at least four of said plurality of uncompressed characters.
286 A genomic compressed data storage and retrieval system according to claim 282 and wherein at least three of said plurality of uncompressed characters are compressed into each one of a majority of said plurality of compressed characters.
287 A genomic compressed data storage and retrieval system according to claim 282 and wherein said plurality of compressed strings is stored in a field, said field being part of a table and said table being part of a database
288 A genomic compressed data storage and retrieval system according to claim 287 and wherein functionality of said genomic sequence data compressor, said genomic compressed sequence data indexer, and said genomic sequence data extractor is performed internally by a database.
289 A genomic compressed data storage and retrieval system according to claim 288 and wherein functionality of said genomic sequence data compressor, said genomic compressed sequence data indexer, and said genomic sequence data extractor does not require a program external to said database.
290 A genomic compressed data storage and retrieval system according to claim 288 and wherein functionality of said genomic sequence data compressor, said genomic compressed sequence data indexer, and said genomic sequence data extractor does not require programming.
291. A compressed genomic sequence comparison system according to claim
273 and wherein functionality of said genomic data analyzer does not require comparing said first genomic sequence with said second genomic sequence.
292 A compressed genomic sequence comparison system according to claim
273 and wherein functionality of said genomic data analyzer does not require any of the following, decompressing said first compressed genomic sequence, and decompressing said second compressed genomic sequence.
293. A genomic sequence similarity assessment system comprising: a genomic sequence data compressor operative to receive a first plurality of uncompressed genomic sequences and to compress said first plurality of uncompressed genomic sequences into a first plurality of compressed genomic sequences, a genomic data evaluator operative to receive a target genomic sequence, said first plurality of compressed genomic sequences, and at least one similarity criterion, and a genomic data extractor operative to produce a second plurality of compressed genomic sequences, representing respectively in compressed form a second plurality of genomic sequences, said second plurality of genomic sequences being a subset of said first plurality of genomic sequences, each of said second plurality of genomic sequences being similar to said target genomic sequence, according to said at least one similarity criterion.
294. A genomic compressed data storage and retrieval system according to claim 293 and wherein said genomic sequence data extractor provides the following functionality: receiving a query comprising a query condition and uncompressed query data to which said query condition relates; compressing said uncompressed query data into compressed query data; and extracting said at least part of said compressed genomic sequence data, based at least in part on said compressed query data.
295. A genomic compressed data storage and retrieval system according to claim 294 and wherein said functionality of said genomic sequence data extractor does not require storing said uncompressed genomic sequence data.
296. A genomic compressed data storage and retrieval system according to claim 294 and wherein said functionality of said genomic sequence data extractor does not require accessing said uncompressed genomic sequence data.
297. A genomic compressed data storage and retrieval system according to claim 294 and wherein said said functionality of said genomic sequence data extractor does not require retrieving said uncompressed genomic sequence data.
298. A genomic compressed data storage and retrieval system according to claim 293 and wherein said genomic sequence data extractor employing said genomic compressed sequence data indexer is operative to sort said uncompressed genomic sequence data.
299. A genomic compressed data storage and retrieval system according to claim 298 and wherein said genomic sequence data extractor employing said genomic compressed sequence data indexer is operative to alphabetically sort said uncompressed genomic sequence data.
300. A genomic compressed data storage and retrieval system according to claim 293 and wherein: said uncompressed genomic sequence data comprises a plurality of uncompressed strings; and said compressed genomic sequence data comprises a plurality of compressed strings, each of said plurality of uncompressed strings being compressed into a single corresponding one of said plurality of compressed strings.
301. A genomic compressed data storage and retrieval system according to claim 300 and wherein: each of said plurality of uncompressed strings is an alphanumeric string representing a genomic sequence; said alphanumeric string comprises a plurality of characters; and each of said plurality of characters represents one of the following items: a nucleotide in said genomic sequence, and an unknown nucleotide in said genomic sequence.
302. A genomic compressed data storage and retrieval system according to claim 293 and wherein: each of said plurality of uncompressed strings comprises a plurality of uncompressed characters; each of said plurality of compressed strings comprises a plurality of compressed characters, at least two of said plurality of uncompressed characters being compressed into one of said plurality of compressed characters.
303. A genomic compressed data storage and retrieval system according to claim 302 and wherein each one of said plurality of uncompressed characters is compressed into one of said plurality of compressed characters.
304. A genomic compressed data storage and retrieval system according to claim 302 and wherein said at least two of said plurality of uncompressed characters comprises at least three of said plurality of uncompressed characters.
305. A genomic compressed data storage and retrieval system according to claim 302 and wherein said at least two of said plurality of uncompressed characters comprises at least four of said plurality of uncompressed characters.
306. A genomic compressed data storage and retrieval system according to claim 302 and wherein at least three of said plurality of uncompressed characters are compressed into each one of a majority of said plurality of compressed characters.
307. A genomic compressed data storage and retrieval system according to claim 302 and wherein said plurality of compressed strings is stored in a field, said field being part of a table and said table being part of a database.
308. A genomic compressed data storage and retrieval system according to claim 307 and wherein functionality of said genomic sequence data compressor, said genomic compressed sequence data indexer, and said genomic sequence data extractor is performed internally by a database.
309. A genomic compressed data storage and retrieval system according to claim 308 and wherein functionality of said genomic sequence data compressor, said genomic compressed sequence data indexer, and said genomic sequence data extractor does not require a program external to said database.
310. A genomic compressed data storage and retrieval system according to claim 308 and wherein functionality of said genomic sequence data compressor, said genomic compressed sequence data indexer, and said genomic sequence data extractor does not require programming.
31 1. A genomic compressed sequence similarity assessment system according to claim 293 and wherein said system also comprises a genomic decompressor operative to decompress each of said second plurality of compressed genomic sequences.
312. A genomic compressed sequence similarity assessment system according to claim 293 and wherein functionality of said genomic data extractor does not require comparing said genomic sequence with any of said first plurality of genomic sequences.
313. A genomic compressed sequence similarity assessment system according to claim 293 and wherein functionality of said genomic data extractor does not require decompressing any of said first plurality of compressed genomic sequences.
314. A genomic sequence analysis and comparison system comprising: a genomic sequence data compressor operative to obtain a first genomic sequence and a second genomic sequence and to compress said first genomic sequence into a first compressed genomic sequence and said second genomic sequence into a second compressed genomic sequence; a genomic sequence analysis preprocessor operative to receive said first and second compressed genomic sequences and protein coding region location data for each of a plurality of proteins known to be encoded by said first and second genomic sequences, to calculate and store a plurality of genomic region sequences for each of said first and second genomic sequences, based at least in part on said genomic sequence data and protein coding region location data, and to determine and store a plurality of short genomic segments contained in each of said first and second compressed genomic sequences; and a genomic data evaluator operative to compare said first compressed genomic sequence with said second compressed genomic sequence, based on at least one of the following one of said plurality of genomic region sequences and one of said plurality of short genomic segments; and a genomic data analyzer employing said genomic data evaluator and operative to determine degree of similarity between said first genomic sequence and said second genomic sequence.
315 A genomic sequence analysis system according to claim 314 and wherein said plurality of genomic region sequences comprises a plurality of protein coding regions
316 A genomic sequence analysis system according to claim 315 and wherein each of said plurality of protein coding regions is normalized.
317 A genomic sequence analysis system according to claim 314 and wherein said plurality of genomic region sequences comprises a plurality of regions adjacent to protein coding regions
3 18 A genomic sequence analysis system according to claim 317 and wherein said plurality of regions adjacent to protein coding regions comprises a plurality of regions upstream to protein coding regions
3 19 A genomic sequence analysis system according to claim 317 and wherein said plurality of regions adjacent to protein coding regions comprises a plurality of regions downstream to protein coding regions
320 A genomic sequence analysis system according to claim 317 and wherein each of said plurality of regions adjacent to protein coding regions is normalized according to coding direction of one of said plurality of protein coding regions adjacent thereto
321 A genomic sequence analysis system according to claim 314 and wherein said plurality of short genomic segments contained in each of said plurality of genomic region sequences comprises a majority of short genomic segments of a given length contained in each of said plurality of genomic region sequences.
322. A genomic sequence analysis system according to claim 321 and wherein said given length is user specified.
323. A genomic sequence analysis system according to claim 314 and wherein said plurality of proteins known to be encoded by- said genomic sequence data comprises a majority of proteins known to be encoded by said genomic sequence data.
324. A genomic sequence analysis system according to .claim 314 and wherein genomic sequence data comprises: a first genomic sequence data belonging to a first organism, and a second genomic sequence data belonging to a second organism different than said first organism.
325. A genomic sequence analysis system according to claim 314 and wherein said determining and storing also includes determining and storing a relationship between at least two of said plurality of short genomic segments, and said logical condition references said relationship.
326. A genomic sequence analysis system according to claim 325 and wherein said relationship also includes a relation between a location of a first one of said plurality of short genomic sequence relative to one of said plurality of genomic region sequences, and a second one of said plurality of short genomic sequence relative to said one of said plurality of genomic region sequences.
327. A genomic sequence analysis system according to claim 325 and wherein said relationship also includes a similarity between a first one of said plurality of short genomic sequences and a second one of said plurality of short genomic sequences.
328. A genomic sequence analysis system according to claim 314 and wherein said at least one logical condition comprises a degree of uniqueness of one of said plurality of short genomic sequences relative to at least one of said plurality of genomic sequence regions.
329. A genomic sequence analysis system according to claim 314 and wherein said at least one logical condition comprises a degree of commonality of one of said plurality of short genomic sequences relative to at least two of said plurality of genomic sequence regions.
330. A genomic sequence analysis system according to claim 314 and wherein said at least one logical condition comprises exclusion of one of said plurality of short genomic sequences relative to at least one of said plurality of genomic sequence regions.
33 1 . A genomic sequence analysis system according to claim 314 and wherein: said method also includes: storing, based on user input, a plurality of criteria; determining and marking, each of said plurality of short genomic segments which complies with each one of said criteria; and said user query is based at least in part on at least one of said plurality of criteria.
332. A genomic compressed data storage and retrieval system according to claim 314 and wherein said genomic sequence data extractor provides the following functionality: receiving a query comprising a query condition and uncompressed query data to which said query condition relates; compressing said uncompressed query data into compressed query data; and extracting said at least part of said compressed genomic sequence data, based at least in part on said compressed query data.
333. A genomic compressed data storage and retrieval system according to claim 332 and wherein said functionality of said genomic sequence data extractor does not require storing said uncompressed genomic sequence data.
334. A genomic compressed data storage and retrieval system according to claim 332 and wherein said functionality of said genomic sequence data extractor does not require accessing said uncompressed genomic sequence data.
335. A genomic compressed data storage and retrieval system according to claim 332 and wherein said said functionality of said genomic sequence data extractor does not require retrieving said uncompressed genomic sequence data.
336. A genomic compressed data storage and retrieval system according to claim 3 14 and wherein said genomic sequence data extractor employing said genomic compressed sequence data indexer is operative to sort said uncompressed genomic sequence data.
337. A genomic compressed data storage and retrieval system according to claim 336 and wherein said genomic sequence data extractor employing said genomic compressed sequence data indexer is operative to alphabetically sort said uncompressed genomic sequence data.
338. A genomic compressed data storage and retrieval system according to claim 3 14 and wherein: said uncompressed genomic sequence data comprises a plurality of uncompressed strings; and said compressed genomic sequence data comprises a plurality of compressed strings, each of said plurality of uncompressed strings being compressed into a single corresponding one of said plurality of compressed strings.
339. A genomic compressed data storage and retrieval system according to claim 338 and wherein: each of said plurality of uncompressed strings is an alphanumeric string representing a genomic sequence; said alphanumeric string comprises a plurality of characters; and each of said plurality of characters represents one of the following items: a nucleotide in said genomic sequence, and an unknown nucleotide in said genomic sequence.
340 A genomic compressed data storage and retrieval system according to claim 3 14 and wherein each of said plurality of uncompressed strings comprises a plurality of uncompressed characters, each of said plurality of compressed strings comprises a plurality of compressed characters, at least two of said plurality of uncompressed characters being compressed into one of said plurality of compressed characters
341 A genomic compressed data storage and retrieval system according to claim 340 and wherein each one of said plurality of uncompressed characters is compressed into one of said plurality of compressed characters.
342 A genomic compressed data storage and retrieval system according to claim 340 and wherein said at least two of said plurality of uncompressed characters comprises at least three of said plurality of uncompressed characters
343 A genomic compressed data storage and retrieval system according to claim 340 and wherein said at least two of said plurality of uncompressed characters comprises at least four of said plurality of uncompressed characters.
344 A genomic compressed data storage and retrieval system according to claim 340 and wherein at least three of said plurality of uncompressed characters are compressed into each one of a majority of said plurality of compressed characters.
345. A genomic compressed data storage and retrieval system according to claim 340 and wherein said plurality of compressed strings is stored in a field, said field being part of a table and said table being part of a database.
346. A genomic compressed data storage and retrieval system according to claim 345 and wherein functionality of said genomic sequence data compressor, said genomic compressed sequence data indexer, and said genomic sequence data extractor is performed internally by a database.
347. A genomic compressed data storage and retrieval system according to claim 346 and wherein functionality of said genomic sequence data compressor, said genomic compressed sequence data indexer, and said genomic sequence data extractor does not require a program external to said database.
348. A genomic compressed data storage and retrieval system according to claim 346 and wherein functionality of said genomic sequence data compressor, said genomic compressed sequence data indexer, and said genomic sequence data extractor does not require programming.
349. A compressed genomic sequence comparison system according to claim 314 and wherein functionality of said genomic data analyzer does not require comparing said first genomic sequence with said second genomic sequence.
350. A compressed genomic sequence comparison system according to claim 3 14 and wherein functionality of said genomic data analyzer does not require any of the following: decompressing said first compressed genomic sequence, and decompressing said second compressed genomic sequence.
351. A genomic sequence analysis system comprising: a genomic sequence analysis preprocessor operative to obtain uncompressed genomic sequence data, and protein coding region location data for each of a plurality of proteins known to be encoded by said uncompressed genomic sequence data, to calculate and store a plurality of genomic region sequences, based at least in part on said uncompressed genomic sequence data, and protein coding region location data, and to determine a plurality of short genomic segments contained in each of said plurality of uncompressed genomic region sequences; a genomic sequence data compressor operative to compress each of said plurality of uncompressed short genomic segments into one of a plurality of compressed short genomic segments; a genomic compressed sequence data indexer operative to store said plurality of compressed short genomic segments and to index said plurality of compressed short genomic segments; a genomic sequence analysis executor employing said genomic sequence analysis preprocessor and operative to receive a user query containing at least one logical condition relating to at least one of the following: one of said genomic region sequences and one of said uncompressed short genomic segments, and to retrieve results to said user query, said results comprising at least one of the following: one of said plurality of proteins, one of said plurality of genomic region sequences, and one of said plurality of compressed short, genomic segments, based at least in part on said user query and employing said genomic compressed sequence data indexer; and a genomic sequence data extractor employing said genomic compressed sequence data indexer, and operative to decompress said one of said plurality of compressed short genomic segments received from said genomic sequence analysis executor.
352. A genomic sequence analysis system according to claim 351 and wherein said plurality of genomic region sequences comprises a plurality of protein coding
353. A genomic sequence analysis system according to claim 352 and wherein each of said plurality of protein coding regions is normalized. .
354 A genomic sequence analysis system according to claim 351 and whei e said plurality of genomic region sequences comprises a plurality of regions adjacent to protein coding regions
355 A genomic sequence analysis system according to claim 354 and wherein said plurality of regions adjacent to protein coding regions comprises a plurality of regions upstream to protein coding regions
356 A genomic sequence analysis system according to claim 354 and wherein said plurality of legions adjacent to protein coding regions comprises a plurality of regions downstream to protein coding regions
357 A genomic sequence analysis system according to claim 354 and wherein each of said plurality of regions adjacent to protein coding regions is normalized according to coding direction of one of said plurality of protein coding regions adjacent thereto
358 A genomic sequence analysis system according to claim 351 and wherein said plurality of short genomic segments contained in each of said plurality of genomic region sequences comprises a majority of short genomic segments of a given length contained in each of said plurality of genomic region sequences
359 A genomic sequence analysis system according to claim 358 and wherein said given length is user specified
360 A genomic sequence analysis system according to claim 351 and wherein said plurality of proteins known to be encoded by said genomic sequence data comprises a majority of proteins known to be encoded by said genomic sequence data
361 A genomic sequence analysis system according to claim 351 and wherein genomic sequence data comprises a first genomic sequence data belonging to a first organism, and a second genomic sequence data belonging to a second organism different than said first organism.
362 A genomic sequence analysis system according to claim 351 and wherein said determining and storing also includes determining and storing a relationship between at least two of said plurality of short genomic segments, and said logical condition references said relationship
363 A genomic sequence analysis system according to claim 362 and wherein said relationship also includes a relation between a location of a first one of said plurality of short genomic sequence relative to one of said plurality of genomic region sequences, and a second one of said plurality of short genomic sequence relative to said one of said plurality of genomic region sequences
364 A genomic sequence analysis system according to claim 362 and wherein said relationship also includes a similarity between a first one of said plurality of short genomic sequences and a second one of said plurality of short genomic sequences
365 A genomic sequence analysis system according to claim 351 and wherein said at least one logical condition comprises a degree of uniqueness of one of said plurality of short genomic sequences relative to at least one of said plurality of genomic sequence regions
366 A genomic sequence analysis system according to claim 351 and wherein said at least one logical condition comprises a degree of commonality of one of said plurality of short genomic sequences relative to at least two of said plurality of genomic sequence regions
367 A genomic sequence analysis system according to claim 351 and wherein said at least one logical condition comprises exclusion of one of said plurality of short genomic sequences relative to at least one of said plurality of genomic sequence regions
368. A genomic sequence analysis system according to claim 351 and wherein: said method also includes: storing, based on user input, a plurality of criteria; determining and marking, each of said plurality of short genomic segments which complies with each one of said criteria; and said user query is based at least in part on at least one of said plurality of criteria.
369. A genomic compressed data storage and retrieval system according to claim 351 and wherein said genomic sequence data extractor provides the following functionality: receiving a query comprising a query condition and uncompressed query data to which said query condition relates; compressing said uncompressed query data into compressed query data; and extracting said at least part of said compressed genomic sequence data, based at least in part on said compressed query data.
370. A genomic compressed data storage and retrieval system according to claim 369 and wherein said functionality of said genomic sequence data extractor does not require storing said uncompressed genomic sequence data.
371. A genomic compressed data storage and retrieval system according to claim 369 and wherein said functionality of said genomic sequence data extractor does not require accessing said uncompressed genomic sequence data.
372. A genomic compressed data storage and retrieval system according to claim 369 and wherein said said functionality of said genomic sequence data extractor does not require retrieving said uncompressed genomic sequence data.
373. A genomic compressed data storage and retrieval system according to claim 35 1 and wherein said genomic sequence data extractor employing said genomic compressed sequence data indexer is operative to sort said uncompressed genomic sequence data.
374. A genomic compressed data storage and retrieval system according to claim 373 and wherein said genomic sequence data extractor employing said genomic compressed sequence data indexer is operative to alphabetically sort said uncompressed genomic sequence data.
375. A genomic compressed data storage and retrieval system according to claim 351 and wherein: said uncompressed genomic sequence data comprises a plurality of uncompressed strings; and said compressed genomic sequence data comprises a plurality of compressed strings, each of said plurality of uncompressed strings being compressed into a single corresponding one of said plurality of compressed strings.
376. A genomic compressed data storage and retrieval system according to claim 375 and wherein: each of said plurality of uncompressed strings is an alphanumeric string representing a genomic sequence; said alphanumeric string comprises a plurality of characters; and each of said plurality of characters represents one of the following items: a nucleotide in said genomic sequence, and an unknown nucleotide in said genomic sequence.
377. A genomic compressed data storage and retrieval system according to claim 351 and wherein: each of said plurality of uncompressed strings comprises a plurality of uncompressed characters; each of said plurality of compressed strings comprises a plurality of compressed characters, at least two of said plurality of uncompressed characters being compressed into one of said plurality of compressed characters.
378. A genomic compressed data storage and retrieval system according to claim 377 and wherein each- one of said plurality of uncompressed characters is compressed into one of said plurality of compressed characters.
379. A genomic compressed data storage and retrieval system according to claim 377 and wherein said at least two of said plurality of uncompressed characters comprises at least three of said plurality of uncompressed characters.
380. A genomic compressed data storage and retrieval system according to claim 377 and wherein said at least two of said plurality of uncompressed characters comprises at least four of said plurality of uncompressed characters.
381. A genomic compressed data storage and retrieval system according to claim 377 and wherein' at least three of said plurality of uncompressed characters are compressed into each one of a majority of said plurality of compressed characters.
382. A genomic compressed data storage and retrieval system according to claim 377 and wherein said plurality of compressed strings is stored in a field, said field being part of a table and said table being part of a database.
383. A genomic compressed data storage and retrieval system according to claim 382 and wherein functionality of said genomic sequence data compressor, said genomic compressed sequence data indexer, and said genomic sequence data extractor is performed internally by a database.
384. A genomic compressed data storage and retrieval system according to claim 383 and wherein functionality of said genomic sequence data compressor, said genomic compressed sequence data indexer, and said genomic sequence data extractor does not require a program external to said database.
385. A genomic compressed data storage and retrieval system according to claim 383 and wherein functionality of said genomic sequence data compressor, said genomic compressed sequence data indexer, and said genomic sequence data extractor does not require programming.
386. A genomic compressed sequence similarity assessment system according to claim 351 and wherein said system also comprises a genomic decompressor operative to decompress each of said second plurality of compressed genomic sequences.
387. A genomic compressed sequence similarity assessment system according to claim 351 and wherein functionality of said genomic data extractor does not require comparing said genomic sequence with any of said first plurality of genomic sequences.
388. A genomic compressed sequence similarity assessment system according to claim 351 and wherein functionality of said genomic data extractor does not require decompressing any of said first plurality of compressed genomic sequences.
389. A computer-readable medium comprising a computer program, the computer program being operative, when in operative association with a computer, to perform the following steps: obtaining genomic data; preprocessing genomic data into preprocessed genomic data; compressing at least part of said preprocessed genomic data; storing compressed preprocessed genomic data; indexing compressed preprocessed genomic data; and analyzing genomic data, based at least in part on said indexing.
390. A computer-readable medium according to claim 389 and wherein: said obtaining comprises obtaining uncompressed genomic sequence data, and protein coding region location data for each of a plurality of proteins known to be encoded by said genomic sequence data; said preprocessing comprises calculating and storing a plurality of genomic region sequences, based at least in part on said obtaining, and determining for each of said plurality of genomic region sequences, a plurality of uncompressed short genomic segments contained therewith; said compressing comprises compressing each of said plurality of uncompressed short genomic segments contained in each of said plurality of genomic region sequences into one of a plurality of compressed short genomic segments; said storing comprises storing said plurality of compressed short genomic segments; said indexing comprises indexing said plurality of compressed short genomic segments; and said analyzing comprises: receiving a user query containing at least one logical condition relating to at least one of the following: one of said genomic region sequences, and one of said uncompressed short genomic segments, and retrieving results to said user query, said retrieving comprising at least one of the following: retrieving one of said plurality of proteins, retrieving one of said plurality of genomic region sequences, and retrieving and decompressing one of said plurality of compressed short genomic segments, based at least in part on said indexing.
391. A computer-readable medium according to claim 390 and wherein said plurality of genomic region sequences comprises a plurality of protein coding regions.
392. A computer-readable medium according to claim 391 and wherein each of said plurality of protein coding regions is normalized.
393. A computer-readable medium according to claim 390 and wherein said plurality of genomic region sequences comprises a plurality of regions adjacent to protein coding regions.
394 A computer-readable medium according to claim 393 and wherein said plurality of regions adjacent to protein coding regions comprises a plurality of regions upstream to protein coding regions
395 A computer- readable medium according to claim 393 and wherein said plurality of regions adjacent to protein coding regions comprises a plurality of regions downstream to protein coding regions.
396 A computer-readable medium according to claim 393 and wherein each of said plurality of regions adjacent to protein coding regions is normalized according to coding direction of one of said plurality of protein coding regions adjacent thereto.
397 A computer-readable medium according to claim 390 and wherein said plurality of short genomic segments contained in each of said plurality of genomic region sequences comprises a majority of short genomic segments of a given length contained in each of said plurality of genomic region sequences
398 A computer-readable medium according to claim 397 and wherein said given length is user specified
399 A computer-readable medium according to claim 390 and wherein said plurality of proteins known to be encoded by said genomic sequence data comprises a majority of proteins known to be encoded by said genomic sequence data.
400 A computer-readable medium according to claim 390 and wherein genomic sequence data comprises, a first genomic sequence data belonging to a first organism, and a second genomic sequence data belonging to a second organism different than said first organism.
401 A computer-readable medium according to claim 390 and wherein said determining and storing also includes determining and storing a relationship between at least two of said plurality of short genomic segments, and said logical condition references said relationship
402 A computer-readable medium according to claim 401 and wherein said relationship also includes a relation between a location of a first one of said plurality of short genomic sequence relative to one of said plurality of genomic region sequences, and a second one of said plurality of short genomic sequence relative to said one of said plurality of genomic region sequences.
403 A computer-readable medium according to claim 401 and wherein said relationship also includes a similarity between a first one of said plurality of short genomic sequences and a second one of said plurality of short genomic
404 A computer-readable medium according to claim 390 and wherein said at least one logical condition comprises a degree of uniqueness of one of said plurality of short genomic sequences relative to at least one of said plurality of genomic sequence
405 A computer-readable medium according to claim 390 and wherein said at least one logical condition comprises a degree of commonality of one of said plurality of short genomic sequences relative to at least two of said plurality of genomic sequence
406 A computer-readable medium according to claim 390 and wherein said at least one logical condition comprises exclusion of one of said plurality of short genomic sequences relative to at least one of said plurality of genomic sequence regions
407 A computer-readable medium according to claim 390 and wherein: said computer-readable medium also includes. storing, based on user input, a plurality of criteria; determining and marking, each of said plurality of short genomic segments which complies with each one of said criteria; and said user query is based at least in part on at least one of said plurality of criteria.
408. A computer-readable medium according to claim 389 and wherein said retrieving comprises: receiving a query, said query comprising a query condition and uncompressed query data to which said query condition relates; compressing said uncompressed query data into compressed query data; and extracting said at least part of said compressed genomic sequence data, based at least in part on said compressed query data.
409. A computer-readable medium according to claim 408 and wherein said retrieving does not require storing said uncompressed genomic sequence data.
410. A computer-readable medium according to claim 408 and wherein said retrieving does not require accessing said uncompressed genomic sequence data.
41 1. A computer-readable medium according to claim 408 and wherein said retrieving does not require retrieving said uncompressed genomic sequence data.
412. A computer-readable medium according to claim 389 and wherein said retrieving includes sorting said uncompressed genomic sequence data, based at least in part on said indexing.
413. A computer-readable medium according to claim 412 and wherein said sorting is alphabetical sorting.
414. A computer-readable medium according to claim 389 and wherein: said uncompressed genomic sequence data comprises a plurality of uncompressed strings; and said compressed genomic sequence data comprises a plurality of compressed strings, each of said plurality of uncompressed strings being compressed into a single corresponding one of said plurality of compressed strings.
415 A computer-readable medium according to claim 414 and wherein: each of said plurality of uncompressed strings is an alphanumeric string representing a genomic sequence; each alphanumeric string comprises a plurality of characters; and each of said plurality of characters represents one of the following items: a nucleotide in said genomic sequence, and an unknown nucleotide in said genomic sequence.
416 A computer-readable medium according to claim 389 and wherein: each of said plurality of uncompressed strings comprises a plurality of uncompressed characters, and each of said plurality of compressed strings comprises a plurality of compressed characters, at least two of said plurality of uncompressed characters being compressed into one of said plurality of compressed characters.
417 A computer-readable medium according to claim 416 and wherein each one of said plurality of uncompressed characters is compressed into one of said plurality of compressed characters.
418 A computer-readable medium according to claim 416 and wherein said at least two of said plurality of uncompressed characters comprises at least three of said plurality of uncompressed characters.
419 A computer-readable medium according to claim 416 and wherein said at least two of said plurality of uncompressed characters comprises at least four of said pluratlity of uncompressed characters.
420 A computer-readable medium according to claim 416 and wherein at least three of said plurality of uncompressed characters are compressed into each one of a majority of said plurality of compressed characters.
421 A computer- readable medium according to claim 416 and wherein said plurality of compressed strings is stored in a field, said field being part of a table and said table being part of a database.
422 A computer-readable medium according to claim 421 and wherein said receiving, said compressing, said storing, said indexing, said retrieving, and said decompressing are performed internally by said database.
423 A computer-readable medium according to claim 422 and wherein said receiving, said compressing, said storing, said indexing, said retrieving, and said decompressing, do not require a program external to said database.
424 A computer-readable medium according to claim 422 and wherein said receiving, said compressing, said storing, said indexing, said retrieving, and said decompressing, do not require programming.
425 A computer-readable medium comprising a computer program, the computer program being operative, when in operative association with a computer, to perform the following steps: obtaining two compressed genomic sequences, a first compressed genomic sequence representing in compressed form a first genomic sequence, and a second compressed genomic sequence representing in compressed form a second genomic sequence, and protein coding region location data for each of a plurality of proteins known to be encoded by said first genomic sequence and said second genomic sequence, calculating and storing a plurality of genomic region sequences for each of said first and second compressed genomic sequences, based at least in part on said obtaining, determining and storing a plurality of short genomic segments contained in each of said plurality of genomic region sequences for each of said first and second compressed genomic sequences; comparing said first compressed genomic sequence with said second compressed genomic sequence, based on at least one of the following: one of said genomic region sequences, and one of said short genomic segments; and determining degree of similarity between said first compressed genomic sequence and said second compressed genomic sequence, based at least in part on said comparing.
426. A computer-readable medium according to claim 425 and wherein said plurality of genomic region sequences comprises a plurality of protein coding regions.
427. A computer-readable medium according to claim 426 and wherein each of said plurality of protein coding regions is normalized.
428. A computer-readable medium according to claim 425 and wherein said plurality of genomic region sequences comprises a plurality of regions adjacent to protein coding regions.
429. A computer-readable medium according to claim 428 and wherein said plurality of regions adjacent to protein coding regions comprises a plurality of regions upstream to protein coding regions.
430. A computer-readable medium according to claim 428 and wherein said plurality of regions adjacent to protein coding regions comprises a plurality of regions downstream to protein coding regions.
431 A computer-readable medium according to claim 428 and wherein each of said plurality of regions adjacent to protein coding regions is normalized according to coding direction of one of said plurality of protein coding regions adjacent thereto.
432 A computer-readable medium according to claim 425 and wherein said plurality of short genomic segments contained in each of said plurality of genomic region sequences comprises a majority of short genomic segments of a given length contained in each of said plurality of genomic region sequences.
433 A computer- readable medium according to claim 432 and wherein said given length is user specified.
434 A computer-readable medium according to claim 425 and wherein said plurality of proteins known to be encoded by said genomic sequence data comprises a majority of proteins known to be encoded by said genomic sequence data.
435 A computer-readable medium according to claim 425 and wherein genomic sequence data comprises: a first genomic sequence data belonging to a first organism, and a second genomic sequence data belonging to a second organism different than said first organism.
436 A computer-readable medium according to claim 425 and wherein said determining and storing also includes determining and storing a relationship between at least two of said plurality of short genomic segments, and said logical condition references said relationship.
437 A computer-readable medium according to claim 436 and wherein said relationship also includes a relation between a location of a first one of said plurality of short genomic sequence relative to one of said plurality of genomic region sequences, and a second one of said plurality of short genomic sequence relative to said one of said plurality of genomic region sequences.
438 A computer-readable medium according to claim 436 and wherein said relationship also includes a similarity between a first one of said plurality of short genomic sequences and a second one of said plurality of short genomic
439. A computer-readable medium according to claim 425 and wherein said at least. one logical condition comprises a degree of uniqueness- of one of said plurality of short genomic sequences relative to at least one of said plurality of genomic sequence regions. ;
440. A computer-readable medium according to claim 425 and wherein said at least one logical condition comprises a degree of commonality of one of said plurality of short genomic sequences relative to at least two of said plurality of genomic sequence
441. A computer-readable medium according to claim 425 and wherein said at least one logical condition comprises exclusion of one of said plurality of short genomic sequences relative to at least one of said plurality of genomic sequence regions.
442. A computer-readable medium according to claim 425 and wherein: said computer- readable medium also includes: storing, based on user input, a plurality of criteria; determining and marking, each of said plurality of short genomic segments which complies with each one of said criteria; and said user query is based at least in part on at least one of said plurality of criteria.
443. A computer-readable medium according to claim 425 and wherein said determining does not include comparing said first genomic sequence with said second genomic sequence.
444. A computer-readable medium according to claim 425 and wherein said determining does not include any of the following: decompressing said first compressed genomic sequence, and decompressing said second compressed genomic sequence.
445. A computer-readable medium comprising a computer program, the computer program being operative, when in operative association with a computer, to perform the following steps: obtaining genomic data; preprocessing genomic data into preprocessed genomic data; storing said preprocessed genomic data; indexing said preprocessed genomic data; and analyzing genomic data, based at least in part on said indexing, said analyzing also comprising assessing genomic sequence similarity, based at least in part on said indexing.
446. A computer-readable medium according to claim 445 and wherein: said obtaining comprises obtaining uncompressed genomic sequence data, and protein coding region location data for each of a plurality of proteins known to be encoded by said genomic sequence data; said preprocessing comprises calculating and storing a plurality of genomic region sequences, based at least in part on said obtaining, and determining for each of said plurality of genomic region sequences, a plurality of uncompressed short genomic segments contained therewith; said storing comprises storing said plurality of uncompressed short genomic segments; said indexing comprises indexing said plurality of uncompressed short genomic segments; and said analyzing comprises: receiving a user query containing at least one logical condition relating to one of said plurality of uncompressed short genomic segments and at least one similarity criterion; extracting a subset of said plurality of uncompressed short genomic segments, each of said subset being similar to said one of said uncompressed short genomic segments according to said at least one similarity criterion; and retrieving results to said user query, based at least in part on said extracting, said retrieving comprising at least one of the following: retrieving one of said plurality of proteins and retrieving one of said plurality of genomic region sequences, based at least in part on said indexing.
447. A computer-readable medium according to claim 446 and wherein said plurality of genomic region sequences comprises a plurality of protein coding regions.
448. A computer-readable medium according to claim 447 and wherein each of said plurality of protein coding regions is normalized.
449. A computer-readable medium according to claim 446 and wherein said plurality of genomic region sequences comprises a plurality of regions adjacent to protein coding regions.
450. A computer-readable medium according to claim 449 and wherein said plurality of regions adjacent to protein coding regions comprises a plurality of regions upstream to protein coding regions.
451. A computer-readable medium according to claim 449 and wherein said plurality of regions adjacent to protein coding regions comprises a plurality of regions downstream to protein coding regions.
452. A computer-readable medium according to claim 449 and wherein each of said plurality of regions adjacent to protein coding regions is normalized according to coding direction of one of said plurality of protein coding regions adjacent thereto.
453. A computer-readable medium according to claim 446 and wherein said plurality of short genomic segments contained in each of said plurality of genomic region sequences comprises a majority of short genomic segments of a given length contained in each of said plurality of genomic region sequences.
454. A computer-readable medium according to claim 453 and wherein said given length is user specified.
455 A computer-readable medium according to claim 446 and wherein said plurality of proteins known to be encoded by said genomic sequence data comprises a majority of proteins known to be encoded by said genomic sequence data
456 A computer-readable medium according to claim 446 and wherein genomic sequence data comprises a first genomic sequence data belonging to a first organism, and a second genomic sequence data belonging to a second organism different than said first organism.
457 A computer-readable medium according to claim 446 and wherein said determining and storing also includes determining and storing a relationship between at least two of said plurality of short genomic segments, and said logical condition references said relationship
458 A computer-readable medium according to claim 457 and wherein said relationship also includes a relation between a location of a first one of said plurality of short genomic sequence relative to one of said plurality of genomic region sequences, and a second one of said plurality of short genomic sequence relative to said one of said plurality of genomic region sequences
459 A computer-readable medium according to claim 457 and wherein said relationship also includes a similarity between a first one of said plurality of short genomic sequences and a second one of said plurality of short genomic
460 A computer-readable medium according to claim 446 and wherein said at least one logical condition comprises a degree of uniqueness of one of said plurality of short genomic sequences relative to at least one of said plurality of genomic sequence
461 A computer-readable medium according to claim 446 and wherein said at least one logical condition comprises a degree of commonality of one of said plurality of short genomic sequences relative to at least two of said plurality of genomic sequence
462 A computer-readable medium according to claim 446 and wherein said at least one logical condition comprises exclusion of one of said plurality of short genomic sequences relative to at least one of said plurality of genomic sequence regions
463 A computer- readable medium according to claim 446 and wherein said computer-readable medium also includes storing, based on user input, a plurality of criteria, determining and marking, each of said plurality of short genomic segments which complies with each one of said criteria, and said user query is based at least in part on at least one of said plurality of criteria
464 A computer-readable medium according to claim 446 and wherein said method also comprises decompressing and delivering each of said second plurality of compressed genomic sequences
465 A computer-readable medium according to claim 446 and wherein said producing does not include comparing said genomic sequence with any of said first plurality of genomic sequences
466 A computer-readable medium according to claim 446 and wherein said producing does not include decompressing any of said first plurality of compressed genomic sequences
467 A computer-readable medium comprising a computer program, the computer program being operative, when in operative association with a computer, to perform the following steps receiving a first uncompressed genomic sequence and a second uncompressed genomic sequence, compressing said first uncompressed genomic sequence into a first compressed genomic sequence, and said second uncompressed genomic sequence into a second compressed genomic sequence; comparing said first compressed genomic sequence with said second compressed genomic sequence; and determining degree of similarity between said first genomic sequence and said second genomic sequence,' based at least in part on said comparing.
468. A computer-readable medium according to claim 467 and wherein said retrieving comprises: receiving a query, said query comprising a query condition and uncompressed query data to which said query condition relates; compressing said uncompressed query data into compressed query data; and extracting said at least part of said compressed genomic sequence data, based at least in part on said compressed query data.
469. A computer-readable medium according to claim 468 and wherein said retrieving does not require storing said uncompressed genomic sequence data.
470. A computer-readable medium according to claim 468 and wherein said retrieving does not require accessing said uncompressed genomic sequence data.
471 . A computer-readable medium according to claim 468 and wherein said retrieving does not require retrieving said uncompressed genomic sequence data.
472. A computer-readable medium according to claim 467 and wherein said retrieving includes sorting said uncompressed genomic sequence data, based at least in part on said indexing.
473. A computer-readable medium according to claim 472 and wherein said sorting is alphabetical sorting. \
474 A computer- readable medium according to claim 467 and wherein. said uncompressed genomic sequence data comprises a plurality of uncompressed strings, and said compressed genomic sequence data comprises a plurality of compressed strings, each of said plurality of uncompressed strings being compressed into a single coi responding one of said plurality of compressed strings
475 A computer-readable medium according to claim 474 and wherein each of said plurality of uncompressed strings is an alphanumeric string representing a genomic sequence, each alphanumeric string comprises a plurality of characters; and each of said plurality of characters represents one of the following items a nucleotide in said genomic sequence, and an unknown nucleotide in said genomic sequence
476 A computer- readable medium according to claim 467 and wherein each of said plurality of uncompressed strings comprises a plurality of uncompressed characters, and each of said plurality of compressed strings comprises a plurality of compressed characters, at least two of said plurality of uncompressed characters being compressed into one of said plurality of compressed characters
477 A computer-readable medium according to claim 476 and wherein each one of said plurality of uncompressed characters is compressed into one of said plurality of compressed characters
478 A computer-readable medium according to claim 476 and wherein said at least two of said plurality of uncompressed characters comprises at least three of said plurality of uncompressed characters
479. A computer-readable medium according to claim 476 and wherein said at least two of said plurality of uncompressed characters comprises at least four of said pluratlity of uncompressed characters.
480. A computer-readable medium according to claim 476 and wherein at least three of said plurality of uncompressed characters are compressed into each one of a majority of said plurality of compressed characters.
481 A computer-readable medium according to claim 476 and wherein said plurality of compressed strings is stored in a field, said field being part of a table and said table being part of a database.
482 A computer-readable medium according to claim 481 and wherein said receiving, said compressing, said storing, said indexing, said retrieving, and said decompressing are performed internally by said database.
483 A computer-readable medium according to claim 482 and wherein said receiving, said compressing, said storing, said indexing, said retrieving, and said decompressing, do not require a program external to said database.
484. A computer-readable medium according to claim 482 and wherein said receiving, said compressing, said storing, said indexing, said retrieving, and said decompressing, do not require programming.
485. A computer-readable medium according to claim 467 and wherein said determining does not include comparing said first genomic sequence with said second genomic sequence.
486. A computer-readable medium according to claim 467 and wherein said determining does not include any of the following: decompressing said first compressed genomic sequence, and decompressing said second compressed genomic sequence.
487. A computer-readable medium comprising a computer program, the computer program being operative, when in operative association with a computer, to perform the following steps: receiving uncompressed genomic sequence data; compressing said uncompressed genomic sequence data into compressed genomic sequence data; storing said compressed genomic sequence data; indexing said compressed genomic sequence data; retrieving at least part of said compressed genomic sequence data representing uncompressed genomic sequence data similar to an uncompressed genomic target sequence, based at least in part on said indexing; and decompressing said at least part of said compressed genomic sequence data.
488. A computer-readable medium according to claim 487 and wherein said retrieving comprises: receiving a target genomic sequence, a first plurality of compressed genomic sequences, representing respectively in compressed form a first plurality of genomic sequences, and at least one similarity criterion; and producing a second plurality of compressed genomic sequences, representing respectively in compressed form a second plurality of genomic sequences, said second plurality of genomic sequence being a subset of said first plurality of genomic sequences, each of said second plurality of genomic sequences being similar to said target genomic sequence, according to said at least one similarity criterion.
489. A computer-readable medium according to claim 488 and wherein said retrieving comprises: receiving a query, said query comprising a query condition and uncompressed query data to which said query condition relates; compressing said uncompressed query data into compressed query data; and extracting said at least part of said compressed genomic sequence data, based at least in part on said compressed query data.
490. A computer-readable medium according to claim 489 and wherein said retrieving does not require storing said uncompressed genomic sequence data.
491. A computer-readable medium according to claim 489 and wherein said retrieving does not require accessing said uncompressed genomic sequence data.
492. A computer-readable medium according to claim 489 and wherein said retrieving does not require retrieving said uncompressed genomic sequence data.
493. A computer-readable medium according to claim 488 and wherein said retrieving includes sorting said uncompressed genomic sequence data, based at least in part on said indexing.
494. A computer-readable medium according to claim 493 and wherein said sorting is alphabetical sorting.
495. A computer- readable medium according to claim 488 and wherein: said uncompressed genomic sequence data comprises a plurality of uncompressed strings; and said compressed genomic sequence data comprises a plurality of compressed strings, each of said plurality of uncompressed strings being compressed into a single corresponding one of said plurality of compressed strings.
496. A computer-readable medium according to claim 495 and wherein: each of said plurality of uncompressed strings is an alphanumeric string representing a genomic sequence; each alphanumeric string comprises a plurality of characters; and each of said plurality of characters represents one of the following items: a nucleotide in said genomic sequence, and an unknown nucleotide in said genomic sequence.
497. A computer-readable medium according to claim 488 and wherein: each of said plurality of uncompressed strings comprises a plurality of uncompressed characters; and each of said plurality of compressed strings comprises a plurality of compressed characters, at least two of said plurality of uncompressed characters being compressed into one of said plurality of compressed characters.
498. A computer- readable medium according to claim 497 and wherein each one of said plurality of uncompressed characters is compressed into one of said plurality of compressed characters.
499. A computer-readable medium according to claim 497 and wherein said at least two of said plurality of uncompressed characters comprises at least three of said plurality of uncompressed characters.
500. A computer-readable medium according to claim 497 and wherein said at least two of said plurality of uncompressed characters comprises at least four of said pluratlity of uncompressed characters.
501. A computer-readable medium according to claim 497 and wherein at least three of said plurality of uncompressed characters are compressed into each one of a majority of said plurality of compressed characters.
502. A computer-readable medium according to claim 497 and wherein said plurality of compressed strings is stored in a field, said field being part of a table and said table being part of a database.
503. A computer-readable medium according to claim 502 and wherein said receiving, said compressing, said storing, said indexing, said retrieving, and said decompressing are performed internally by said database.
504. A computer-readable medium according to claim 503 and wherein said receiving, said compressing, said storing, said indexing, said retrieving, and said decompressing, do not require a program external to said database.
505. A computer-readable medium according to claim 503 and wherein said receiving, said compressing, said storing, said indexing, said retrieving, and said decompressing, do not require programming.
506. A computer-readable medium according to claim 488 and wherein said method also comprises decompressing and delivering each of said second plurality of compressed genomic sequences.
507. A computer-readable medium according to claim 488 and wherein said producing does not include comparing said genomic sequence with any of said first plurality of genomic sequences.
508. A computer-readable medium according to claim 488 and wherein said producing does not include decompressing any of said first plurality of compressed genomic sequences.
509. A computer-readable medium comprising a computer program, the computer program being operative, when in operative association with a computer, to perform the following steps: obtaining a first uncompressed genomic sequence and a second uncompressed genomic sequence, and protein coding region location data for each of a plurality of proteins known to be encoded by said first uncompressed genomic sequence and said second uncompressed genomic sequence; calculating and storing a plurality of genomic region sequences for each of said first and second uncompressed genomic sequences, based at least in part on said obtaining; determining and storing a plurality of short genomic segments contained in each of said plurality of genomic region sequences for each of said first and second uncompressed genomic sequences; compressing said first uncompressed genomic sequence into a first compressed genomic sequence, and said second uncompressed genomic sequence into a second compressed genomic sequence; comparing said first compressed genomic sequence with said second compressed genomic sequence, based on at least one of the following: one of said genomic region sequences, and one of said short genomic segments; and determining degree of similarity between said first compressed genomic sequence and said second compressed genomic sequence, based at least in part on said comparing.
5 10. A computer-readable medium according to claim 509 and wherein said plurality of genomic region sequences comprises a plurality of protein coding regions.
51 1. A computer-readable medium according to claim 510 and wherein each of said plurality of protein coding regions is normalized.
5 12. A computer-readable medium according to claim 509 and wherein said plurality of genomic region sequences comprises a plurality of regions adjacent to protein coding regions.
513. A computer-readable medium according to claim 512 and wherein said plurality of regions adjacent to protein coding regions comprises a plurality of regions upstream to protein coding regions.
514. A computer-readable medium according to claim 512 and wherein said plurality of regions adjacent to protein coding regions comprises a plurality of regions downstream to protein coding regions.
515. A computer- readable medium according to claim 512 and wherein each of said plurality of regions adjacent to protein coding regions is normalized according to coding direction of one of said plurality of protein coding regions adjacent thereto.
516. A computer- readable medium according to claim 509 and wherein said plurality of short genomic segments contained in each of said plurality of genomic region sequences comprises a majority of short genomic segments of a given length contained in each of said plurality of genomic region sequences.
517. A computer- readable medium according to claim 516 and wherein said given length is user specified.
518. A computer-readable medium according to claim 509 and wherein said plurality of proteins known to be encoded by said genomic sequence data comprises a majority of proteins known to be encoded by said genomic sequence data.
519. A computer-readable medium according to claim 509 and wherein genomic sequence data comprises: a first genomic sequence data belonging to a first organism, and a second genomic sequence data belonging to a second organism different than said first organism.
520. A computer-readable medium according to claim 509 and wherein said determining and storing also includes determining and storing a relationship between at least two of said plurality of short genomic segments, and said logical condition references said relationship.
521. A computer-readable medium according to claim 520 and wherein said relationship also includes a relation between a location of a first one of said plurality of short genomic sequence relative to one of said plurality of genomic region sequences, and a second one of said plurality of short genomic sequence relative to said one of said plurality of genomic region sequences.
522 A computer-readable medium according to claim 520 and wherein said relationship also includes a similarity between a first one of said plurality of short genomic sequences and a second one of said plurality of short genomic
523 A computer- readable medium according to claim 509 and wherein said at least one logical condition comprises a degree of uniqueness of one of said plurality of short genomic sequences relative to at least one of said plurality of genomic sequence
524 A computer-readable medium according to claim 509 and wherein said at least one logical condition comprises a degree of commonality of one of said plurality of short genomic sequences relative to at least two of said plurality of genomic sequence
525 A computer-readable medium according to claim 509 and wherein said at least one logical condition comprises exclusion of one of said plurality of short genomic sequences relative to at least one of said plurality of genomic sequence regions.
526 A computer-readable medium according to claim 509 and wherein said computer-readable medium also includes storing, based on user input, a plurality of criteria; determining and marking, each of said plurality of short genomic segments which complies with each one of said criteria, and said user query is based at least in part on at least one of said plurality of criteria
527 A computer-readable medium according to claim 509 and wherein said retrieving comprises' receiving a query, said query comprising a query condition and uncompressed query data to which said query condition relates; compressing said uncompressed query data into compressed query data; and extracting said at least part of said compressed genomic sequence data, based at least in part on said compressed query data.
528. A computer- readable medium according to claim 527 and wherein said retrieving does not require storing said uncompressed genomic sequence data.
529 A computer-readable medium according to claim 527 and wherein said retrieving does not require accessing said uncompressed genomic sequence data.
530. A computer-readable medium according to claim 527 and wherein said retrieving does not require retrieving said uncompressed genomic sequence data.
531 A computer-readable medium according to claim 509 and wherein said retrieving includes sorting said uncompressed genomic sequence data, based at least in part on said indexing.
532. A computer-readable medium according to claim 531 and wherein said sorting is alphabetical sorting.
533 A computer-readable medium according to claim 509 and wherein: said uncompressed genomic sequence data comprises a plurality of uncompressed strings; and said compressed genomic sequence data comprises a plurality of compressed strings, each of said plurality of uncompressed strings being compressed into a single corresponding one of said plurality of compressed strings.
534. A computer-readable medium according to claim 533 and wherein: each of said plurality of uncompressed strings is an alphanumeric string representing a genomic sequence; each alphanumeric string comprises a plurality of characters; and each of said plurality of characters represents one of the following items: a nucleotide in said genomic sequence, and an unknown nucleotide in said genomic sequence.
535. A computer-readable medium according to claim 509 and wherein: each of said plurality of uncompressed strings comprises a plurality of uncompressed characters; and each of said plurality of compressed strings comprises a plurality of compressed characters, at least two of said plurality of uncompressed characters being compressed into one of said plurality of compressed characters.
536. A computer-readable medium according to claim 535 and wherein each one of said plurality of uncompressed characters is compressed into one of said plurality of compressed characters.
537. A computer-readable medium according to claim 535 and wherein said at least two of said plurality of uncompressed characters comprises at least three of said plurality of uncompressed characters.
538. A computer- readable medium according to claim 535 and wherein said at least two of said plurality of uncompressed characters comprises at least four of said pluratlity of uncompressed characters.
539. A computer-readable medium according to claim 535 and wherein at least three of said plurality of uncompressed characters are compressed into each one of a majority of said plurality of compressed characters.
540. A computer-readable medium according to claim 535 and wherein said plurality of compressed strings is stored in a field, said field being part of a table and said table being part of a database.
541. A computer-readable medium according to claim 540 and wherein said receiving, said compressing, said storing, said indexing, said retrieving, and said decompressing are performed internally by said database.
542. A computer-readable medium according to claim 541 and wherein said receiving, said compressing, said storing, said indexing, said retrieving, and said decompressing, do not require a program external to said database.
543. A computer-readable medium according to claim 541 and wherein said receiving, said compressing, said storing, said indexing, said retrieving, and said decompressing, do not require programming.
544. A computer-readable medium according to claim 509 and wherein said determining does not include comparing said first genomic sequence with said second genomic sequence.
545. A computer-readable medium according to claim 509 and wherein said determining does not include any of the following: decompressing said first compressed genomic sequence, and decompressing said second compressed genomic sequence.
546. A computer-readable medium comprising a computer program, the computer program being operative, when in operative association with a computer, to perform the following steps: obtaining genomic data; preprocessing genomic data into preprocessed genomic data; compressing at least part of said preprocessed genomic data; storing compressed preprocessed genomic data; indexing compressed preprocessed genomic data; and analyzing genomic data, based at least in part on said indexing, said analyzing also comprising assessing genomic sequence similarity, based at least in part on said indexing.
547. A computer- readable medium according to claim 546 and wherein: said obtaining comprises obtaining uncompressed genomic sequence data, and protein coding region location data for each of a plurality of proteins known to be encoded by said genomic sequence data; said preprocessing comprises calculating and storing a plurality of genomic region sequences, based at least in part on said obtaining, and determining for each of said plurality of genomic region sequences, a plurality of uncompressed short genomic segments contained therewith; said compressing comprises compressing each of said plurality of uncompressed short genomic segments contained in each of said plurality of genomic region sequences into one of a plurality of compressed short genomic segments; said storing comprises storing said plurality of compressed short genomic segments, said indexing comprises indexing said plurality of compressed short genomic segments, and said analyzing comprises: receiving a user query containing at least one logical condition relating to one of said plurality of uncompressed short genomic segments and at least one similarity criterion, extracting a subset of said plurality of uncompressed short genomic segments, each of said subset being similar to said one of said uncompressed short genomic segments according to said at least one similarity criterion; and retrieving results to said user query, based at least in part on said extracting, said retrieving comprising at least one of the following: retrieving one of said plurality of proteins, retrieving one of said plurality of genomic region sequences, and retrieving and decompressing one of said plurality of compressed short genomic segments, based at least in part on said indexing.
548. A computer-readable medium according to claim 547 and wherein said plurality of genomic region sequences comprises a plurality of protein coding regions.
549. A computer-readable medium according to claim 548 and wherein each of said plurality of protein coding regions is normalized.
550. A computer-readable medium according to claim 547 and wherein said plurality of genomic region sequences comprises a plurality of regions adjacent to protein coding regions.
551 A computer-readable medium according to claim 550 and wherein said plurality of regions adjacent to protein coding regions comprises a plurality of regions upstream to protein coding regions.
552 A computer-readable medium according to claim 550 and wherein said plurality of regions adjacent to protein coding regions comprises a plurality of regions downstream to protein coding regions.
553 A computer- readable medium according to claim 550 and wherein each of said plurality of regions adjacent to protein coding regions is normalized according to coding direction of one of said plurality of protein coding regions adjacent thereto.
554 A computer-readable medium according to claim 547 and wherein said plurality of short genomic segments contained in each of said plurality of genomic region sequences comprises a majority of short genomic segments of a given length contained in each of said plurality of genomic region sequences.
555 A computer- readable medium according to claim 554 and wherein said given length is user specified
556 A computer-readable medium according to claim 547 and wherein said plurality of proteins known to be encoded by said genomic sequence data comprises a majority of proteins known to be encoded by said genomic sequence data.
557 A computer-readable medium according to claim 547 and wherein genomic sequence data comprises: a first genomic sequence data belonging to a first organism, and a second genomic sequence data belonging to a second organism different than said first organism.
558. A computer-readable medium according to claim 547 and wherein said determining and storing also includes determining and storing a relationship between at least two of said plurality of short genomic segments, and said logical condition references said relationship.
559. A computer-readable medium according to claim 558 and wherein said relationship also includes a relation between a location of a first one of said plurality of short genomic sequence relative to one of said plurality of genomic region sequences, and a second one of said plurality of short genomic sequence relative to said one of said plurality of genomic region sequences.
560. A computer- readable medium according to claim 558 and wherein said relationship also includes a similarity between a first one of said plurality of short genomic sequences and a second one of said plurality of short genomic
561. A computer-readable medium according to claim 547 and wherein said at least one logical condition comprises a degree of uniqueness of one of said plurality of short genomic sequences relative to at least one of said plurality of genomic sequence
562. A computer-readable medium according to claim 547 and wherein said at least one logical condition comprises a degree of commonality of one of said plurality of short genomic sequences relative to at least two of said plurality of genomic sequence
563. A computer-readable medium according to claim 547 and wherein said at least one logical condition comprises exclusion of one of said plurality of short genomic sequences relative to at least one of said plurality of genomic sequence regions.
564. A computer-readable medium according to claim 547 and wherein: said computer-readable medium also includes: storing, based on user input, a plurality of criteria; determining and marking, each of said plurality of short genomic segments which complies with each one of said criteria; and said user query is based at least in part on at least one of said plurality of criteria.
565. A computer-readable medium according to claim 547 and wherein said retrieving comprises: receiving a query, said query comprising a query condition and uncompressed query data to which said query condition relates; compressing said uncompressed query data into compressed query data; and extracting said at least part of said compressed genomic sequence data, based at least in part on said compressed query data.
566. A computer-readable medium according to claim 565 and wherein said retrieving does not require storing said uncompressed genomic sequence data.
567. A computer-readable medium according to claim 565 and wherein said retrieving does not require accessing said uncompressed genomic sequence data.
568. A computer-readable medium according to claim 565 and wherein said retrieving does not require retrieving said uncompressed genomic sequence data.
569. A computer-readable medium according to claim 547 and wherein said retrieving includes sorting said uncompressed genomic sequence data, based at least in part on said indexing.
570. A computer-readable medium according to claim 569 and wherein said sorting is alphabetical sorting.
571 A computer-readable medium according to claim 547 and wherein: said uncompressed genomic sequence data comprises a plurality of uncompressed strings; and said compressed genomic sequence data comprises a plurality of compressed strings, each of said plurality of uncompressed strings being compressed into a single corresponding one of said plurality of compressed strings.
572 A computer-readable medium according to claim 571 and wherein: each of said plurality of uncompressed strings is an alphanumeric string representing a genomic sequence; each alphanumeric string comprises a plurality of characters; and each of said plurality of characters represents one of the following items: a nucleotide in said genomic sequence, and an unknown nucleotide in said genomic sequence.
573 A computer-readable medium according to claim 547 and wherein: each of said plurality of uncompressed strings comprises a plurality of uncompressed characters; and each of said plurality of compressed strings comprises a plurality of compressed characters, at least two of said plurality of uncompressed characters being compressed into one of said plurality of compressed characters.
574 A computer-readable medium according to claim 573 and wherein each one of said plurality of uncompressed characters is compressed into one of said plurality of compressed characters
575 A computer-readable medium according to claim 573 and wherein said at least two of said plurality of uncompressed characters comprises at least three of said plurality of uncompressed characters.
576 A computer-readable medium according to claim 573 and wherein said at least two of said plurality of uncompressed characters comprises at least four of said pluratlity of uncompressed characters
577 A computer-readable medium according to claim 573 and wherein at least three of said plurality of uncompressed characters are compressed into each one of a majority of said plurality of compressed characters.
578 A computer-readable medium according to claim 573 and wherein said plurality of compressed strings is stored in a field, said field being part of a table and said table being part of a database.
579 A computer-readable medium according to claim 578 and wherein said receiving, said compressing, said storing, said indexing, said retrieving, and said decompressing are performed internally by said database
580 A computer-readable medium according to claim 579 and wherein said receiving, said compressing, said storing, said indexing, said retrieving, and said decompressing, do not require a program external to said database.
581 A computer-readable medium according to claim 579 and wherein said receiving, said compressing, said storing, said indexing, said retrieving, and said decompressing, do not require programming.
582 A computer- readable medium according to claim 547 and wherein said method also comprises decompressing and delivering each of said second plurality of compressed genomic sequences
583 A computer- readable medium according to claim 547 and wherein said producing does not include comparing said genomic sequence with any of said first plurality of genomic sequences.
584 A computer-readable medium according to claim 547 and wherein said producing does not include decompressing any of said first plurality of compressed genomic sequences
585 A method for analysis of genomic sequence data comprising: obtaining genomic sequence data, and protein coding region location data for each of a plurality of proteins known to be encoded by said genomic sequence data; calculating and storing a plurality of genomic region sequences, based at least in part on said obtaining; determining and storing a plurality of short genomic segments contained in each of said plurality of genomic region sequences; receiving a user query containing at least one logical condition relating to at least one of the following' one of said genomic region sequences, and one of said short genomic segments, and producing results to said user query, comprising at least one of the following, one of said plurality of proteins, one of said plurality of genomic region sequences, and one of said plurality of short genomic segments, based at least in part on said user query and said determining and storing.
586 A method according to claim 585 and wherein said plurality of genomic region sequences comprises a plurality of protein coding regions.
587 A method according to claim 586 and wherein each of said plurality of protein coding regions is normalized.
588 A method according to claim 585 and wherein said plurality of genomic region sequences comprises a plurality of regions adjacent to protein coding regions.
589 A method according to claim 588 and wherein said plurality of regions adjacent to protein coding regions comprises a plurality of regions upstream to protein coding regions.
590 A method according to claim 588 and wherein said plurality of regions adjacent to protein coding regions comprises a plurality of regions downstream to protein coding regions
591 A method according to claim 588 and wherein each of said plurality of regions adjacent to protein coding regions is normalized according to coding direction of one of said plurality of protein coding regions adjacent thereto.
592 A method according to claim 585 and wherein said plurality of proteins known to be encoded by said genomic sequence data comprises a majority of proteins known to be encoded by said genomic sequence data
593 A method according to claim 585 and wherein said plurality of short genomic segments contained in each of said plurality of genomic region sequences comprises a majority of short genomic segments of a given length contained in each of said plurality of genomic region sequences
594 A method according to claim 593 and wherein said given length is user specified
595 A method according to claim 585 and wherein genomic sequence data comprises a first genomic sequence data belonging to a first organism, and a second genomic sequence data belonging to a second organism different than said first
596 A method according to claim 585 and wherein said method also includes storing for each one of said plurality of proteins at least one of the following protein properties an organism of expression, a tissue of expression and a function.
597 A method according to claim 585 and wherein said at least one logical condition comprises a degree of uniqueness of one of said plurality of short genomic sequences relative to at least one of said plurality of genomic sequence regions.
598 A method according to claim 585 and wherein said at least one logical condition comprises a degree of commonality of one of said plurality of short genomic sequences relative to at least two of said plurality of genomic sequence regions.
599. A method according to claim 585 and wherein said at least one logical condition comprises exclusion of one of said plurality of short genomic sequences relative to at least one of said plurality of genomic sequence regions.
600. A method according to claim 585 and wherein: said method also includes: storing, based on user input, a plurality of criteria; determining and marking, each of said plurality of short genomic segments which complies with each one of said criteria; and said user query is based at least in part on at least one of said plurality of criteria.
601. A method according to claim 600 and wherein each of said plurality of criteria comprises at least one of said at least one logical condition.
602. A method according to claim 585 and wherein said determining and storing also includes determining and storing a relationship between at least two of said plurality of short genomic segments, and said logical condition references said relationship.
603. A method according to claim 602 and wherein said relationship also includes a relation between a location of a first one of said plurality of short genomic sequence relative to one of said plurality of genomic region sequences, and a second one of said plurality of short genomic sequence relative to said one of said plurality of genomic region sequences.
604. A method according to claim 602 and wherein said relationship also includes a similarity between a first one of said plurality of short genomic sequences and a second one of said plurality of short genomic sequences.
605. A genomic sequence analysis system comprising: a genomic sequence analysis preprocessor operative to obtain genomic sequence data, and protein coding region location data for each of a plurality of proteins known to be encoded by said genomic sequence data, to calculate and store a plurality of genomic region sequences, based at least in part on said genomic sequence data, and protein coding region location data, and to determine and store a plurality of short genomic segments contained in each of said plurality of genomic region sequences; and a genomic sequence analysis executor employing said genomic sequence analysis preprocessor and operative to receive a user query containing at least one logical condition relating to at least one of the following: one of said genomic region sequences, and one of said short genomic segments, and to produce results to said user query, comprising at least one of the following: one of said plurality of proteins, one of said plurality of genomic region sequences, and one of said plurality of short genomic segments, based at least in part on said user query and said determining and storing.
606. A genomic sequence analysis system according to claim 605 and wherein said plurality of genomic region sequences comprises a plurality of protein coding regions.
607. A genomic sequence analysis system according to claim 606 and wherein each of said plurality of protein coding regions is normalized.
608. A genomic sequence analysis system according to claim 605 and wherein said plurality of genomic region sequences comprises a plurality of regions adjacent to protein coding regions.
609. A genomic sequence analysis system according to claim 608 and wherein said plurality of regions adjacent to protein coding regions comprises a plurality of regions upstream to protein coding regions.
610. A genomic sequence analysis system according to claim 608 and wherein said plurality of regions adjacent to protein coding regions comprises a plurality of regions downstream to protein coding regions.
61 1. A genomic sequence analysis system according to claim 608 and wherein each of said plurality of regions adjacent to protein coding regions is normalized according to coding direction of one of said plurality of protein coding regions adjacent thereto.
612. A genomic sequence analysis system according to claim 605 and wherein said plurality of proteins known to be encoded by said genomic sequence data comprises a majority of proteins known to be encoded by said genomic sequence data.
613. A genomic sequence analysis system according to claim 605 and wherein said plurality of short genomic segments contained in each of said plurality of genomic region sequences comprises a majority of short genomic segments of a given length contained in each of said plurality of genomic region sequences.
614. A genomic sequence analysis system according to claim 613 and wherein said given length is user specified.
615. A genomic sequence analysis system according to claim 605 and wherein genomic sequence data comprises: a first genomic sequence data belonging to a first organism, and a second genomic sequence data belonging to a second organism different than said first organism.
616. A genomic sequence analysis system according to claim 605 and wherein said method also includes storing for each one of said plurality of proteins at least one of the following protein properties an organism of expression, a tissue of expression and a function
617 A genomic sequence analysis system according to claim 605 and wherein said at least one logical condition comprises a degree of uniqueness of one of said plurality of short genomic sequences relative to at least one of said plurality of genomic sequence regions
618 A genomic sequence analysis system according to claim 605 and wherein said at least one logical condition comprises a degree of commonality of one of said plurality of short genomic sequences relative to at least two of said plurality of genomic sequence regions
619 A genomic sequence analysis system according to claim 605 and wherein said at least one logical condition comprises exclusion of one of said plurality of short genomic sequences relative to at least one of said plurality of genomic sequence
620 A genomic sequence analysis system according to claim 605 and wherein said method also includes: storing, based on user input, a plurality of criteria, determining and marking, each of said plurality of short genomic segments which complies with each one of said criteria; and said user query is based at least in part on at least one of said plurality of criteria
621 A genomic sequence analysis system according to claim 620 and wherein each of said plurality of criteria comprises at least one of said at least one logical condition
622. A genomic sequence analysis system according to claim 605 and wherein said determining and storing also includes determining and storing a relationship between at least two of said plurality of short genomic segments, and said logical condition references said relationship.
623. A genomic sequence analysis system according to claim 622 and wherein said relationship also includes a relation between a location of a first one of said plurality of short genomic sequence relative to one of said plurality of genomic region sequences, and a second one of said plurality of short genomic sequence relative to said one of said plurality of genomic region sequences.
624. A genomic sequence analysis system according to claim 622 and wherein said relationship also includes a similarity between a first one of said plurality of short genomic sequences and a second one of said plurality of short genomic sequences.
625. A computer-readable medium comprising a computer program, the computer program being operative, when in operative association with a computer, to perform the following steps: obtaining genomic sequence data, and protein coding region location data for each of a plurality of proteins known to be encoded by said genomic sequence data; calculating and storing a plurality of genomic region sequences, based at least in part on said obtaining; determining and storing a plurality of short genomic segments contained in each of said plurality of genomic region sequences; receiving a user query containing at least one logical condition relating to at least one of the following: one of said genomic region sequences, and one of said short genomic segments; and producing results to said user query, comprising at least one of the following: one of said plurality of proteins, one of said plurality of genomic region sequences, and one of said plurality of short genomic segments, based at least in part on said user query and said determining and storing.
626. A computer-readable medium according to claim 625 and wherein said plurality of genomic region sequences comprises a plurality of protein coding regions.
627. A computer-readable medium according to claim 626 and wherein each of said plurality of protein coding regions is normalized.
628. A computer-readable medium according to claim 625 and wherein said plurality of genomic region sequences comprises a plurality of regions adjacent to protein coding regions.
629. A computer-readable medium according to claim 628 and wherein said plurality of regions adjacent to protein coding regions comprises a plurality of regions upstream to protein coding regions.
630. A computer-readable medium according to claim 628 and wherein said plurality of regions adjacent to protein coding regions comprises a plurality of regions downstream to protein coding regions.
631. A computer-readable medium according to claim 628 and wherein each of said plurality of regions adjacent to protein coding regions is normalized according to coding direction of one of said plurality of protein coding regions adjacent thereto.
632. A computer-readable medium according to claim 625 and wherein said plurality of proteins known to be encoded by said genomic sequence data comprises a majority of proteins known to be encoded by said genomic sequence data.
633. A computer-readable medium according to claim 625 and wherein said plurality of short genomic segments contained in each of said plurality of genomic region sequences comprises a majority of short genomic segments of a given length contained in each of said plurality of genomic region sequences.
634 A computer-readable medium according to claim 633 and wherein said given length is user specified
635 A computer-readable medium according to claim 625 and wherein genomic sequence data comprises a first genomic sequence data belonging to a first organism, and a second genomic sequence data belonging to a second organism different than said first organism
636 A computer-readable medium according to claim 625 and wherein said computer-readable medium also includes storing for each one of said plurality of proteins at least one of the following protein properties: an organism of expression, a tissue of expression and a function.
637 A computer-readable medium according to claim 625 and wherein said at least one logical condition comprises a degree of uniqueness of one of said plurality of short genomic sequences relative to at least one of said plurality of genomic sequence
638 A computer-readable medium according to claim 625 and wherein said at least one logical condition comprises a degree of commonality of one of said plurality of short genomic sequences relative to at least two of said plurality of genomic sequence regions
639 A computer- readable medium according to claim 625 and wherein said at least one logical condition comprises exclusion of one of said plurality of short genomic sequences relative to at least one of said plurality of genomic sequence
640 A computer-readable medium according to claim 625 and wherein: said computer-readable medium also includes' storing, based on user input, a plurality of criteria; determining and marking, each of said plurality of short genomic segments which complies with each one of said criteria; and said user query is based at least in part on at least one of said plurality of criteria.
641. A computer-readable medium according to claim 640 and wherein each of said plurality of criteria comprises at least one of said at least one logical condition.
642. A computer-readable medium according to claim 625 and wherein said determining and storing also includes determining and storing a relationship between at least two of said plurality of short genomic segments, and said logical condition references said relationship.
643. A computer-readable medium according to claim 642 and wherein said relationship also includes a relation between a location of a first one of said plurality of short genomic sequence relative to one of said plurality of genomic region sequences, and a second one of said plurality of short genomic sequence relative to said one of said plurality of genomic region sequences.
644. A computer-readable medium according to claim 642 and wherein said relationship also includes a similarity between a first one of said plurality of short genomic sequences and a second one of said plurality of short genomic sequences.
645. A method for storage and retrieval of compressed genomic sequence data, the method comprising: receiving uncompressed genomic sequence data; compressing said uncompressed genomic sequence data into compressed genomic sequence data; storing said compressed genomic sequence data; indexing said compressed genomic sequence data; retrieving at least part of said compressed genomic sequence data, based at least in part on said indexing; and decompressing said at least part of said compressed genomic sequence data
646 A method according to claim 645 and wherein said retrieving comprises: receiving a query, said query comprising a query condition and uncompressed query data to which said query condition relates, compressing said uncompressed query data into compressed query data; and extracting said at least part of said compressed genomic sequence data, based at least in part on said compressed query data.
647 A method according to claim 646 and wherein said retrieving does not require storing said uncompressed genomic sequence data.
648 A method according to claim 646 and wherein said retrieving does not require accessing said uncompressed genomic sequence data
649 A method according to claim 646 and wherein said retrieving does not require retrieving said uncompressed genomic sequence data.
650 A method according to claim 645 and wherein said retrieving includes sorting said uncompressed genomic sequence data, based at least in part on said indexing
651 A method according to claim 650 and wherein said sorting is alphabetical sorting.
652 A method according to claim 645 and wherein- said uncompressed genomic sequence data comprises a plurality of uncompressed strings; and said compressed genomic sequence data comprises a plurality of compressed strings, each of said plurality of uncompressed strings being compressed into a single corresponding one of said plurality of compressed strings.
653. A method according to claim 652 and wherein: each of said plurality of uncompressed strings is an alphanumeric string representing a genomic sequence; each alphanumeric string comprises a plurality of characters; and each of said plurality of characters represents one of the following items: a nucleotide in said genomic sequence, and an unknown nucleotide in said genomic sequence.
654. A method according to claim 645 and wherein: each of said plurality of uncompressed strings comprises a plurality of uncompressed characters; and each of said plurality of compressed strings comprises a plurality of compressed characters, at least two of said plurality of uncompressed characters being compressed into one of said plurality of compressed characters.
655. A method according to claim 654 and wherein each one of said plurality of uncompressed characters is compressed into one of said plurality of compressed characters.
656. A method according to claim 654 and wherein said at least two of said plurality of uncompressed characters comprises at least three of said plurality of uncompressed characters.
657. A method according to claim 654 and wherein said at least two of said plurality of uncompressed characters comprises at least four of said pluratlity of uncompressed characters.
658. A method according to claim 654 and wherein at least three of said plurality of uncompressed characters are compressed into each one of a majority of said plurality of compressed characters.
659. A method according to claim 654 and wherein said plurality of compressed strings is stored in a field, said field being part of a table and said table being part of a database.
660. A method according to claim 659 and wherein said receiving, said compressing, said storing, said indexing, said retrieving, and said decompressing are performed internally by said database.
661 . A method according to claim 660 and wherein said receiving, said compressing, said storing, said indexing, said retrieving, and said decompressing, do not require a program external to said database.
662. A method according to claim 660 and wherein said receiving, said compressing, said storing, said indexing, said retrieving, and said decompressing, do not require programming.
663. A genomic compressed data storage and retrieval system comprising: a genomic sequence data compressor operative to receive uncompressed genomic sequence data and to compress said uncompressed genomic sequence data into compressed genomic sequence data; a genomic compressed sequence data indexer operative to store said compressed genomic sequence data and to index said compressed genomic sequence data; and a genomic sequence data extractor employing said genomic compressed sequence data indexer, and operative to retrieve at least part of said compressed genomic sequence data and to decompress said at least part of said compressed genomic sequence data.
664 A genomic compressed data storage and retrieval system according to claim 663 and wherein said genomic sequence data extractor provides the following functionality receiving a query comprising a query condition and uncompressed query data to which said query condition relates; compressing said uncompressed query data into compressed query data; and extracting said at least part of said compressed genomic sequence data, based at least in part on said compressed query data
665 A genomic compressed data storage and retrieval system according to claim 664 and wherein said functionality of said genomic sequence data extractor does not require storing said uncompressed genomic sequence data.
666 A genomic compressed data storage and retrieval system according to claim 664 and wherein said functionality of said genomic sequence data extractor does not require accessing said uncompressed genomic sequence data.
667 A genomic compressed data storage and retrieval system according to claim 664 and wherein said said functionality of said genomic sequence data extractor does not require retrieving said uncompressed genomic sequence data
668 A genomic compressed data storage and retrieval system according to claim 663 and wherein said genomic sequence data extractor employing said genomic compressed sequence data indexer is operative to sort said uncompressed genomic sequence data
669 A genomic compressed data storage and retrieval system according to claim 668 and wherein said genomic sequence data extractor employing said genomic compressed sequence data indexer is operative to alphabetically sort said uncompressed genomic sequence data
670. A genomic compressed data storage and retrieval system according to claim 663 and wherein: said uncompressed genomic sequence data comprises a plurality of uncompressed strings; and said compressed genomic sequence data comprises a plurality of compressed strings, each of said plurality of uncompressed strings being compressed into a single corresponding one of said plurality of compressed strings.
671. A genomic compressed data storage and retrieval system according to claim 670 and wherein: each of said plurality of uncompressed strings is an alphanumeric string representing a genomic sequence; said alphanumeric string comprises a plurality of characters; and each of said plurality of characters represents one of the following items: a nucleotide in said genomic sequence, and an unknown nucleotide in said genomic sequence.
672. A genomic compressed data storage and retrieval system according to claim 663 and wherein: each of said plurality of uncompressed strings comprises a plurality of uncompressed characters; each of said plurality of compressed strings comprises a plurality of compressed characters, at least two of said plurality of uncompressed characters being compressed into one of said plurality of compressed characters.
673. A genomic compressed data storage and retrieval system according to claim 672 and wherein each one of said plurality of uncompressed characters is compressed into one of said plurality of compressed characters.
674. A genomic compressed data storage and retrieval system according to claim 672 and wherein said at least two of said plurality of uncompressed characters comprises at least three of said plurality of uncompressed characters.
675. A genomic compressed data storage and retrieval system according to claim 672 and wherein said at least two of said plurality of uncompressed characters comprises at least four of said plurality of uncompressed characters.
676 A genomic compressed data storage and retrieval system according to claim 672 and wherein at least three of said plurality of uncompressed characters are compressed into each one of a majority of said plurality of compressed characters.
677. A genomic compressed data storage and retrieval system according to claim 672 and wherein said plurality of compressed strings is stored in a field, said field being part of a table and said table being part of a database.
678 A genomic compressed data storage and retrieval system according to claim 677 and wherein functionality of said genomic sequence data compressor, said genomic compressed sequence data indexer, and said genomic sequence data extractor is performed internally by a database.
679 A genomic compressed data storage and retrieval system according to claim 678 and wherein functionality of said genomic sequence data compressor, said genomic compressed sequence data indexer, and said genomic sequence data extractor does not require a program external to said database.
680 A genomic compressed data storage and retrieval system according to claim 678 and wherein functionality of said genomic sequence data compressor, said genomic compressed sequence data indexer, and said genomic sequence data extractor does not require programming.
681. A computer-readable medium comprising a computer program, the computer program being operative, when in operative association with a computer, to perform the following steps: receiving uncompressed genomic sequence data; compressing said uncompressed genomic sequence data into compressed genomic sequence data, storing said compressed genomic sequence data; indexing said compressed genomic sequence data; retrieving at least part of said compressed genomic sequence data, based at least in part on said indexing; and decompressing said at least part of said compressed genomic sequence data
682 A computer-readable medium according to claim 681 and wherein said retrieving comprises receiving a query, said query comprising a query condition and uncompressed query data to which said query condition relates, compressing said uncompressed query data into compressed query data; and extracting said at least part of said compressed genomic sequence data, based at least in part on said compressed query data.
683 A computer-readable medium according to claim 682 and wherein said retrieving does not require storing said uncompressed genomic sequence data.
684 A computer-readable medium according to claim 682 and wherein said retrieving does not require accessing said uncompressed genomic sequence data
685 A computer-readable medium according to claim 682 and wherein said retrieving does not require retrieving said uncompressed genomic sequence data.
686 A computer-readable medium according to claim 681 and wherein said retrieving includes sorting said uncompressed genomic sequence data, based at least in part on said indexing
687. A computer- readable medium according to claim 686 and wherein said sorting is alphabetical sorting.
688. A computer-readable medium according to claim 681 and wherein: said uncompressed genomic sequence data comprises a plurality of uncompressed strings; and said compressed genomic sequence data comprises a plurality of compressed strings, each of said plurality of uncompressed strings being compressed into a single corresponding one of said plurality of compressed strings.
689. A computer-readable medium according to claim 688 and wherein: each of said plurality of uncompressed strings is an alphanumeric string representing a genomic sequence; each alphanumeric string comprises a plurality of characters; and each of said plurality of characters represents one of the following items: a nucleotide in said genomic sequence, and an unknown nucleotide in said genomic sequence.
690. A computer-readable medium according to claim 681 and wherein: each of said plurality of uncompressed strings comprises a plurality of uncompressed characters; and each of said plurality of compressed strings comprises a plurality of compressed characters, at least two of said plurality of uncompressed characters being compressed into one of said plurality of compressed characters.
691 . A computer-readable medium according to claim 690 and wherein each one of said plurality of uncompressed characters is compressed into one of said plurality of compressed characters.
692. A computer-readable medium according to claim 690 and wherein said at least two of said plurality of uncompressed characters comprises at least three of said plurality of uncompressed characters.
693 A computer-readable medium according to claim 690 and wherein said at least two of said plurality of uncompressed characters comprises at least four of said pluratlity of uncompressed characters.
694 A computer-readable medium according to claim 690 and wherein at least three of said plurality of uncompressed characters are compressed into each one of a majority of said plurality of compressed characters.
695 A computer-readable medium according to claim 690 and wherein said plurality of compressed strings is stored in a field, said field being part of a table and said table being part of a database.
696 A computer-readable medium according to claim 695 and wherein said receiving, said compressing, said storing, said indexing, said retrieving, and said decompressing are performed internally by said database.
697 A computer-readable medium according to claim 696 and wherein said receiving, said compressing, said storing, said indexing, said retrieving, and said decompressing, do not require a program external to said database.
698 A computer-readable medium according to claim 696 and wherein said receiving, said compressing, said storing, said indexing, said retrieving, and said decompressing, do not require programming.
699 A method for comparing compressed genomic sequences, the method comprising receiving two compressed genomic sequences, a first compressed genomic sequence representing in compressed form a first genomic sequence, and a second compressed genomic sequence representing in compressed form a second genomic sequence; comparing said first compressed genomic sequence with said second compressed genomic sequence; and determining degree of similarity between said first genomic sequence and said second genomic sequence, based at least in part on said comparing.
700. A method according to claim 699 and wherein said determining does not require comparing said first genomic sequence with said second genomic sequence.
701. A method according to claim 699 and wherein said determining does not include any of the following: decompressing said first compressed genomic sequence, and decompressing said second compressed genomic sequence.
702. A method for assessing similarity of genomic sequences, the method comprising: receiving the following items: a target genomic sequence, a first plurality of compressed genomic sequences, representing respectively in compressed form a first plurality of genomic sequences, and at least one similarity criterion; producing a second plurality of compressed genomic sequences, representing respectively in compressed form a second plurality of genomic sequences, said second plurality of genomic sequence being a subset of said first plurality of genomic sequences, each of said second plurality of genomic sequences being similar to said target genomic sequence, according to said at least one similarity criterion.
703. A method according to claim 702 and wherein said method also comprises decompressing each of said second plurality of compressed genomic sequences.
704. A method according to claim 702 and wherein said producing does not require comparing said genomic sequence with any of said first plurality of genomic sequences.
705. A method according to claim 702 and wherein said producing does not require decompressing any of said first plurality of compressed genomic sequences.
706. A compressed genomic sequence comparison system comprising: a genomic data evaluator operative to receive two compressed genomic sequences, a first compressed genomic sequence representing in compressed form a first genomic sequence, and a second compressed genomic sequence representing in compressed form a second genomic sequence, and to compare said first compressed genomic sequence with said second compressed genomic sequence; and a genomic data analyzer employing said genomic data evaluator, and operative to determine degree of similarity between said first genomic sequence and said second genomic sequence.
707. A compressed genomic sequence comparison system according to claim 706 and wherein functionality of said genomic data analyzer does not require comparing said first genomic sequence with said second genomic sequence.
708. A compressed genomic sequence comparison system according to claim 706 and wherein functionality of said genomic data analyzer does not require any of the following: decompressing said first compressed genomic sequence, and decompressing said second compressed genomic sequence.
709. A genomic compressed sequence similarity assessment system comprising. a genomic data evaluator operative to receive a genomic sequence, a first plurality of compressed genomic sequences, representing respectively in compressed form a first plurality of genomic sequences, and at least one similarity criterion; and a genomic data extractor operative to produce a second plurality of compressed genomic sequences, representing respectively in compressed form a second plurality of genomic sequences, said second plurality of genomic sequence being a subset of said first plurality of genomic sequences, each of said second plurality of genomic sequences being similar to said genomic sequence, according to said at least one similarity criterion.
710 A genomic compressed sequence similarity assessment system according to claim 709 and wherein said system also comprises a genomic decompressor operative to decompress each of said second plurality of compressed genomic sequences.
71 1. A genomic compressed sequence similarity assessment system according to claim 709 and wherein functionality of said genomic data extractor does not require comparing said genomic sequence with any of said first plurality of genomic sequences.
712 A genomic compressed sequence similarity assessment system according to claim 709 and wherein functionality of said genomic data extractor does not require decompressing any of said first plurality of compressed genomic sequences.
713. A computer-readable medium comprising a computer program, the computer program being operative, when in operative association with a computer, to perform the following steps: receiving two compressed genomic sequences, a first compressed genomic sequence representing in compressed form a first genomic sequence, and a second compressed genomic sequence representing in compressed form a second genomic sequence; comparing said first compressed genomic sequence with said second compressed genomic sequence; and determining degree of similarity between said first genomic sequence and said second genomic sequence, based at least in part on said comparing.
714. A computer-readable medium according to claim 713 and wherein said determining does not include comparing said first genomic sequence with said second genomic sequence.
715. A computer-readable medium according to claim 713 and wherein said determining does not include any of the following: decompressing said first compressed genomic sequence, and decompressing said second compressed genomic sequence.
716. A computer-readable medium comprising a computer program, the computer program being operative, when in operative association with a computer, to perform the following steps: receiving the following items: a genomic sequence, a first plurality of compressed genomic sequences, representing respectively in compressed form a first plurality of genomic sequences, and at least one similarity criterion; and producing a second plurality of compressed genomic sequences, representing respectively in compressed form a second plurality of genomic sequences, said second plurality of genomic sequence being a subset of said first plurality of genomic sequences, each of said second plurality of genomic sequences being similar to said genomic sequence, according to said at least one similarity criterion.
717. A computer-readable medium according to claim 716 and wherein said method also comprises decompressing and delivering each of said second plurality of compressed genomic sequences.
718. A computer-readable medium according to claim 716 and wherein said producing does not include comparing said genomic sequence with any of said first plurality of genomic sequences.
719. A computer-readable medium according to claim 716 and wherein said producing does not include decompressing any of said first plurality of compressed genomic sequences.
720. A method for storage and retrieval of compressed data, the method comprising: receiving uncompressed data; compressing said uncompressed data into compressed data; storing said compressed data; indexing said compressed data; retrieving at least part of said compressed data, based at least in part on said indexing; and decompressing said at least part of said compressed data.
721. A method according to claim 720 and wherein said retrieving comprises: receiving a query, said query comprising a query condition and uncompressed query data to which said query condition relates; compressing said uncompressed query data into compressed query data; and extracting said at least part of said compressed data, based at least in part on said compressed query data.
722. A method according to claim 721 and wherein said retrieving does not require storing said uncompressed data.
723. A method according to claim 721 and wherein said retrieving does not require accessing said uncompressed data.
724. A method according to claim 721 and wherein said retrieving does not require retrieving said uncompressed data.
725. A method according to claim 720 and wherein said retrieving includes sorting said uncompressed data, based at least in part on said indexing.
726. A method according to claim 725 and wherein said sorting is alphabetical sorting.
727. A method according to claim 720 and wherein: said uncompressed data comprises a plurality of uncompressed strings; and said compressed data comprises a plurality of compressed strings, each of said plurality of uncompressed strings being compressed into a single corresponding one of said plurality of compressed strings
728 A method according to claim 727 and wherein: each of said plurality of uncompressed strings is an alphanumeric string, comprising a plurality of alphanumeric characters
729 A method according to claim 720 and wherein" each of said plurality of uncompressed strings comprises a plurality of uncompressed characters, and each of said plurality of compressed strings comprises a plurality of compressed characters, at least two of said plurality of uncompressed characters being compressed into one of said plurality of compressed characters
730 A method according to claim 729 and wherein each one of said plurality of uncompressed characters is compressed into one of said plurality of compressed characters
731 A method according to claim 729 and wherein said at least two of said plurality of uncompressed characters comprises at least three of said plurality of uncompressed characters
732 A method according to claim 729 and wherein said at least two of said plurality of uncompressed characters comprises at least four of said plurality of uncompressed characters
733 A method according to claim 729 and wherein at least three of said plurality of uncompressed characters are compressed into each one of a majority of said plurality of compressed characters.
734. A method according to claim 729 and wherein said plurality of compressed strings is stored in a field, said field being part of a table and said table being part of a database.
735. A method according to claim 734 and wherein said receiving, said compressing, said storing, said indexing, said retrieving, and said decompressing are performed internally by said database.
736. A method according to claim 735 and wherein said receiving, said compressing, said storing, said indexing, said retrieving, and said decompressing, do not require a program external to said database.
737. A method according to claim 735 and wherein said receiving, said compressing, said storing, said indexing, said retrieving, and said decompressing, do not require programming.
738. A method according to claim 720 and wherein each of said plurality of compressed characters is stored in one byte of memory, said one byte of memory comprising a plurality of bits, each of said plurality of bits storing one of more than two possible values.
739. A method for comparing compressed strings, the method comprising: receiving two compressed strings, a first compressed string representing in compressed form a first string, and a second compressed string representing in compressed form a second string; comparing said first compressed string with said second compressed string; and determining degree of similarity between said first string and said second string, based at least in part on said comparing.
740. A method according to claim 739 and wherein said determining does not include comparing said first string with said second string.
741 . A method according to claim 739 and wherein said determining does not include any of the following: decompressing said first compressed string, and decompressing said second compressed string.
742. A method according to claim 739 and wherein said first string and said first string and said second string are alphanumeric strings.
743. A method according to claim 739 and wherein each of said plurality of compressed characters is stored in one byte of memory, said one byte of memory comprising a plurality of bits, each of said plurality of bits storing one of more than two possible values.
744. A method for assessing similarity of strings, the method comprising: receiving the following items: a string, a first plurality of compressed strings, representing respectively in compressed form a first plurality of strings, and at least one similarity criterion; and producing a second plurality of compressed strings, representing respectively in compressed form a second plurality of strings, said second plurality of strings being a subset of said first plurality of strings, each of said second plurality of strings being similar to said string, according to said at least one similarity criterion.
745. A method according to claim 744 and wherein said method also comprises decompressing each of said second plurality of compressed strings.
746. A method according to claim 744 and wherein said producing does not require comparing said string with any of said first plurality of strings.
747. A method according to claim 744 and wherein said producing does not require decompressing any of said first plurality of compressed strings.
748. A method according to claim 744 and wherein each of said plurality of compressed characters is stored in one byte of memory, said one byte of memory comprising a plurality of bits, each of said plurality of bits storing one of more than two possible values.
749. A compressed data storage and retrieval system comprising: a data compressor operative to receive uncompressed data and to compress said uncompressed data into compressed data; a compressed data indexer operative to store said compressed data and to index said compressed data; and a data extractor employing said compressed data indexer, and operative to retrieve at least part of said compressed data and to decompress said at least part of said compressed data.
750. A compressed data storage and retrieval system according to claim 749 and wherein said data extractor provides the following functionality: receiving a query comprising a query condition and uncompressed query data to which said query condition relates; compressing said uncompressed query data into compressed query data; and extracting said at least part of said compressed data, based at least in part on said compressed query data.
751 . A compressed data storage and retrieval system according to claim 750 and wherein said functionality of said data extractor does not require storing said uncompressed data.
752 A compressed data storage and retrieval system according to claim 750 and wherein said functionality of said data extractor does not require accessing said uncompressed data
753 A compressed data storage and retrieval system according to claim 750 and wherein said said functionality of said data extractor does not require retrieving said uncompressed data
754 A compressed data storage and retrieval system according to claim 749 and wherein said data extractor employing said compressed data indexer is operative to sort said uncompressed data
755 A compressed data storage and retrieval system according to claim 754 and wherein said data extractor employing said compressed data indexer is operative to alphabetically sort said uncompressed data
756 A compressed data storage and retrieval system according to claim 749 and wherein said uncompressed data comprises a plurality of uncompressed strings; and said compressed data comprises a plurality of compressed strings, each of said plurality of uncompressed strings being compressed into a single corresponding one of said plurality of compressed strings.
757 A compressed data storage and retrieval system according to claim 756 and wherein each of said plurality of uncompressed strings is an alphanumeric string comprising a plurality of alphanumeric characters
758 A compressed data storage and retrieval system according to claim 749 and wherein each of said plurality of uncompressed strings comprises a plurality of uncompressed characters, each of said plurality of compressed strings comprises a plurality of compressed characters, at least two of said plurality of uncompressed characters being compressed into one of said plurality of compressed characters.
759. A compressed data storage and retrieval system according to claim 758 and wherein each one of said plurality of uncompressed characters is compressed into one of said plurality of compressed characters.
760. A compressed data storage and retrieval system according to claim 758 and wherein said at least two of said plurality of uncompressed characters comprises at least three of said plurality of uncompressed characters.
761. A compressed data storage and retrieval system according to claim 758 and wherein said at least two of said plurality of uncompressed characters comprises at least four of said plurality of uncompressed characters.
762. A compressed data storage and retrieval system according to claim 758 and wherein at least three of said plurality of uncompressed characters are compressed into each one of a majority of said plurality of compressed characters.
763. A compressed data storage and retrieval system according to claim 758 and wherein said plurality of compressed strings is stored in a field, said field being part of a table and said table being part of a database.
764. A compressed data storage and retrieval system according to claim 763 and wherein functionality of said data compressor, said compressed data indexer, and said data extractor is performed internally by a database.
765. A compressed data storage and retrieval system according to claim 764 and wherein functionality of said data compressor, said compressed data indexer, and said data extractor does not require a program external to said database.
766 A compressed data storage and retrieval system according to claim 764 and wherein functionality of said data compressor, said compressed data indexer, and said data extractor does not require programming
767 A compressed data storage and retrieval system according to claim 749 and wherein each of said plurality of compressed characters is stored in one byte of memory, said one byte of memory comprising a plurality of bits, each of said plurality of bits storing one of more than two possible values
768 A compressed string comparison system comprising: a compressed string evaluator operative to receive two compressed strings, a first compressed string representing in compressed form a first string, and a second compressed string representing in compressed form a second string, and to compare said first compressed string with said second compressed string, and a compressed string analyzer employing said compressed string evaluator, and operative to determine degree of similarity between said first string and said second string
769 A compressed string comparison system according to claim 768 and wherein functionality of said compressed string analyzer does not require comparing said first string with said second string.
770 A compressed string comparison system according to claim 768 and wherein functionality of said compressed string analyzer does not require any of the following decompressing said first compressed string, and decompressing said second compressed string
771 A compressed string comparison system according to claim 768 and wherein said first string and said first string and said second string are alphanumeric strings
772. A compressed data storage and retrieval system according to claim 768 and wherein each of said plurality of compressed characters is stored in one byte of memory, said one byte of memory comprising a plurality of bits, each of said plurality of bits storing one of more than two possible values.
773. A compressed string similarity assessment system comprising: a compressed string evaluator operative to receive a string, a first plurality of compressed strings, representing respectively in compressed form a first plurality of strings, and at least one similarity criterion; and a compressed string extractor operative to produce a second plurality of compressed strings, representing respectively in compressed form a second plurality of strings, said second plurality of string being a subset of said first plurality of strings, each of said second plurality of strings being similar to said string, according to said at least one similarity criterion.
774. A compressed string similarity assessment system according to claim 773 and wherein said system also comprises a compressed string decompressor operative to decompress each of said second plurality of compressed strings.
775. A compressed string similarity assessment system according to claim 773 and wherein functionality of said compressed string extractor does not require comparing said string with any of said first plurality of strings.
776. A compressed string similarity assessment system according to claim 773 and wherein functionality of said compressed string extractor does not require decompressing any of said first plurality of compressed strings.
777. A compressed data storage and retrieval system according to claim 773 and wherein each of said plurality of compressed characters is stored in one byte of memory, said one byte of memory comprising a plurality of bits, each of said plurality of bits storing one of more than two possible values.
778. A computer-readable medium comprising a computer program, the computer program being operative, when in operative association with a computer, to perform the following steps: receiving uncompressed data; compressing said uncompressed data into compressed data; storing said compressed data; indexing said compressed data; retrieving at least part of said compressed data, based at least in part on said indexing; and decompressing said at least part of said compressed data.
779. A computer-readable medium according to claim 778 and wherein said retrieving comprises: receiving a query, said query comprising a query condition and uncompressed query data to which said query condition relates; compressing said uncompressed query data into compressed query data; and extracting said at least part of said compressed data, based at least in part on said compressed query data.
780. A computer-readable medium according to claim 779 and wherein said retrieving does not require storing said uncompressed data.
781 . A computer-readable medium according to claim 779 and wherein said retrieving does not require accessing said uncompressed data.
782. A computer-readable medium according to claim 779 and wherein said retrieving does not require retrieving said uncompressed data.
783. A computer-readable medium according to claim 778 and wherein said retrieving includes sorting said uncompressed data, based at least in part' on said indexing.
784 A computer-readable medium according to claim 783 and wherein said sorting is alphabetical sorting
785 A computer-readable medium according to claim 778 and wherein said uncompressed data comprises a plurality of uncompressed strings, and said compressed data comprises a plurality of compressed strings, each of said plurality of uncompressed strings being compressed into a single corresponding one of said plurality of compressed strings
786 A computer-readable medium according to claim 785 and wherein each of said plurality of uncompressed strings is an alphanumeric string, comprising a plurality of alphanumeric characters
787 A computer-readable medium according to claim 778 and wherein each of said plurality of uncompressed strings comprises a plurality of uncompressed characters, and each of said plurality of compressed strings comprises a plurality of compressed characters, at least two of said plurality of uncompressed characters being compressed into one of said plurality of compressed characters
788 A computer-readable medium according to claim 787 and wherein each one of said plurality of uncompressed characters is compressed into one of said plurality of compressed characters
789 A computer-readable medium according to claim 787 and wherein said at least two of said plurality of uncompressed characters comprises at least three of said plurality of uncompressed characters
790. A computer-readable medium according to claim 787 and wherein said at least two of said plurality of uncompressed characters comprises at least four of said plurality of uncompressed characters.
791. A computer-readable medium according to claim 787 and wherein at least three of said plurality of uncompressed characters are compressed into each one of a majority of said plurality of compressed characters.
792. A computer-readable medium according to claim 787 and wherein said plurality of compressed strings is stored in a field, said field being part of a table and said table being part of a database.
793. A computer-readable medium according to claim 792 and wherein said receiving, said compressing, said storing, said indexing, said retrieving, and said decompressing are performed internally by said database.
794. A computer-readable medium according to claim 793 and wherein said receiving, said compressing, said storing, said indexing, said retrieving, and said decompressing, do not require a program external to said database.
795. A computer-readable medium according to claim 793 and wherein said receiving, said compressing, said storing, said indexing, said retrieving, and said decompressing, do not require programming.
796. A computer-readable medium according to claim 778 and wherein each of said plurality of compressed characters is stored in one byte of memory, said one byte of memory comprising a plurality of bits, each of said plurality of bits storing one of more than two possible values.
797. A computer-readable medium comprising a computer program, the computer program being operative, when in operative association with a computer, to perform the following steps: receiving two compressed strings, a first compressed string representing in compressed form a first string, and a second compressed string representing in compressed form a second string, said second string being different from said first string; comparing said first compressed string with said second compressed string, and determining degree of similarity between said first string and said second string, based at least in part on said comparing.
798. A computer-readable medium according to claim 797 and wherein said determining does not include comparing said first string with said second string.
799. A computer-readable medium according to claim 797 and wherein said determining does not include any of the following: decompressing said first compressed string, and decompressing said second compressed string.
800 A computer-readable medium according to claim 797 and wherein said first string and said first string and said second string are alphanumeric strings.
801 A computer-readable medium according to claim 797 and wherein each of said plurality of compressed characters is stored in one byte of memory, said one byte of memory comprising a plurality of bits, each of said plurality of bits storing one of more than two possible values.
802. A computer-readable medium comprising a computer program, the computer program being operative, when in operative association with a computer, to perform the following steps: receiving the following items: a string, a first plurality of compressed strings, representing respectively in compressed form a first plurality of strings, and at least one similarity criterion; and producing a second plurality of compressed strings, representing respectively in compressed form a second plurality of strings, said second plurality of strings being a subset of said first plurality of strings, each of said second plurality of strings being similar to said string, according to said at least one similarity criterion.
803 A computer-readable medium according to claim 802 and wherein said method also comprises decompressing each of said second plurality of compressed strings
804 A computer-readable medium according to claim 802 and wherein said producing does not include comparing said string with any of said first plurality of strings
805 A computer-readable medium according to claim 802 and wherein said producing does not include decompressing any of said first plurality of compressed strings
806 A computer-readable medium according to claim 802 and wherein each of said plurality of compressed characters is stored in one byte of memory, said one byte of memory comprising a plurality of bits, each of said plurality of bits storing one of more than two possible values
807 A method for displaying genomic sequence data, the method including: receiving an alphanumeric string representing genomic sequence data, said alphanumeric string comprising a plurality of characters, each of said characters representing a nucleotide in said genomic sequence, and expressing said alphanumeric string using a representation which distinguishes a first plurality of nucleotides, sharing in common a first genomic attribute, from a second plurality of nucleotides, sharing in common a second genomic attribute, said second genomic attribute being different from said first genomic attribute.
808 A method according to claim 807 and wherein said first plurality of nucleotides are represented by at least one first representing attribute, and said second plurality of nucleotides are represented by at least one second representing attribute, said second representing attribute being different from said first representing attribute
809 A method according to claim 807 and wherein said representation comprises a human sensible representation
810 A method according to claim 808 and wherein said representation comprises a human sensible representation
81 1 A method according to claim 810 and wherein said at least one first representing attribute and said at least one second representing attribute are graphical attributes
812 A method according to claim 811 and wherein said graphical attributes are shapes
813 A method according to claim 811 and wherein said graphical attributes are positions
814 A method according to claim 813 and wherein said positions are vertical positions
815 A method according to claim 811 and wherein said graphical attributes are orientations
816 A method according to claim 815 and wherein said orientations are vertical orientations
817 A method according to claim 811 and wherein said graphical attributes are colors
818 A method according to claim 810 and wherein using said representation also includes representing each of the following four nucleotides: adenine, thymine, cytosine, and guanine, by a different color.
819 A method according to claim 811 and wherein said human sensible representation comprises one of the following a shape with a letter and a shape without a letter
820 A method according to claim 811 and wherein said human sensible representation is produced using a computer font.
821 A method according to claim 820 and wherein said computer font is a TRUETYPE® font
822 A method according to claim 807 and wherein said representation comprises a machine sensible representation
823 A method according to claim 808 and wherein said representation comprises a machine sensible representation
824 A method according to claim 823 and wherein said at least one first representing attribute and said at least one second representing attribute are machine sensible attributes
825 A method according to claim 808 and wherein said first plurality of nucleotides are purine nucleotides, and said second plurality of nucleotides are pyrimidine nucleotides
826. A method according to claim 808 and wherein said first plurality of nucleotides consists of adenine and thymine nucleotides, and said second plurality of nucleotides consists of guanine and cytosine nucleotides.
827 A method according to claim 808 and wherein said representation also distinguishes a third plurality of nucleotides, sharing in common a third genomic attribute, from a fourth plurality of nucleotides, sharing in common a fourth genomic attribute, said fourth genomic attribute being different from said third genomic attribute.
828. A method according to claim 827 and wherein said third plurality of nucleotides are represented by at least one third representing attribute, and said fourth plurality of nucleotides are represented by at least one fourth representing attribute, said at least one third representing attribute being different from said at least one fourth representing attribute.
829. A method according to claim 827 and wherein said first plurality of nucleotides are purine nucleotides, said second plurality of nucleotides are pyrimidine nucleotides, said third plurality of nucleotides are adenine and thymine nucleotides, and said fourth plurality of nucleotides are guanine and cytosine nucleotides.
830. A method of graphically displaying genomic sequence information, said method comprising: receiving a first alphanumeric string representing a first genomic sequence and a second alphanumeric string representing a second genomic sequence, said second genomic sequence being a reversed-inversed genomic sequence of said first genomic sequence; and graphically displaying said first alphanumeric string and said second alphanumeric string, such that a graphical display of said second alphanumeric string is a horizontal and vertical mirror image of a graphical display of said first alphanumeric string.
83 1 A method according to claim 830 and wherein said method also comprises expressing said first alphanumeric string and said second alphanumeric string using a representation which distinguishes a first plurality of nucleotides, sharing in common a first genomic attribute, from a second plurality of nucleotides, sharing in common a second genomic attribute, said second genomic attribute being different from said first genomic attribute
832 A method according to claim 831 and wherein said first plurality of nucleotides are represented by at least one first representing attribute, and said second plurality of nucleotides are represented by at least one second representing attribute, said second representing attribute being different from said first representing attribute.
833 A method according to claim 831 and wherein said representation comprises a human sensible representation.
834 A method according to claim 832 and wherein said representation comprises a human sensible representation.
835 A method according to claim 834 and wherein said at least one first representing attribute and said at least one second representing attribute are graphical attributes
836 A method according to claim 835 and wherein said graphical attributes are shapes
837 A method according to claim 835 and wherein said graphical attributes are positions
838 A method according to claim 837 and wherein said positions are vertical positions
839. A method according to claim 835 and wherein said graphical attributes are orientations.
840. A method according to claim 839 and wherein said orientations are vertical orientations.
841. A method according to claim 835 and wherein said graphical attributes are colors.
842. A method according to claim 834 and wherein using said representation also includes representing each of the following four nucleotides: adenine, thymine, cytosine, and guanine, by a different color.
843. A method according to claim 835 and wherein said human sensible representation comprises one of the following: a shape with a letter and a shape without a letter.
844. A method according to claim 835 and wherein said human sensible representation is produced using a computer font.
845. A method according to claim 844 and wherein said computer font is a TRUETYPE® font.
846. A method according to claim 832 and wherein said first plurality of nucleotides are purine nucleotides, and said second plurality of nucleotides are pyrimidine nucleotides.
847 A method according to claim 832 and wherein said first plurality of nucleotides consists of adenine and thymine nucleotides, and said second plurality of nucleotides consists of guanine and cytosine nucleotides.
848 A method according to claim 832 and wherein said representation also distinguishes a third plurality of nucleotides, sharing in common a third genomic attribute, from a fourth plurality of nucleotides, sharing in common a fourth genomic attribute, said fourth genomic attribute being different from said third genomic attribute.
849 A method according to claim 848 and wherein said third plurality of nucleotides are represented by at least one third representing attribute, and said fourth plurality of nucleotides are represented by at least one fourth representing attribute, said at least one third representing attribute being different from said at least one fourth representing attribute
850 A method according to claim 848 and wherein said first plurality of nucleotides are purine nucleotides, said second plurality of nucleotides are pyrimidine nucleotides, said third plurality of nucleotides are adenine and thymine nucleotides, and said fourth plurality of nucleotides are guanine and cytosine nucleotides.
851 A genomic display system comprising a genomic sequence expressor operative to receive an alphanumeric string representing genomic sequence data, said alphanumeric string comprising a plurality of characters, each of said plurality of characters representing a nucleotide in said genomic sequence, and to express said alphanumeric string using a representation which distinguishes a first plurality of nucleotides, sharing in common a first genomic attribute, from a second plurality of nucleotides, sharing in common a second genomic attribute, said second genomic attribute being different from said first genomic attribute; a display operative to receive an output from said expressor and to display said genomic sequence using said representation.
852. A genomic display system according to claim 851 and wherein said genomic sequence expressor is operative to express said alphanumeric string in a manner wherein said first plurality of nucleotides are represented by at least one first representing attribute, and said second plurality of nucleotides are represented by at least one second representing attribute, said second representing attribute being different from said first representing attribute.
853. A genomic display system according to claim 851 and wherein said representation comprises a human sensible representation.
854. A genomic display system according to claim 852 and wherein said representation comprises a human sensible representation.
855. A genomic display system according to claim 854 and wherein said at least one first representing attribute and said at least one second representing attribute are graphical attributes.
856. A genomic display system according to claim 855 and wherein said graphical attributes are shapes.
857. A genomic display system according to claim 855 and wherein said graphical attributes are positions.
858. A genomic display system according to claim 857 and wherein said positions are vertical positions.
859. A genomic display system according to claim 855 and wherein said graphical attributes are orientations.
860. A genomic display system according to claim 859 and wherein said orientations are vertical orientations.
861 A genomic display system according to claim 855 and wherein said graphical attributes are colors
862 A genomic display system according to claim 854 and wherein using said representation includes representations of each of the following four nucleotides' adenine, thymine, cytosine, and guanine, in a different color
863 A genomic display system according to claim 855 and wherein said human sensible representation comprises one of the following a shape with a letter and a shape without a letter
864 A genomic display system according to claim 855 and wherein said human sensible representation employs a computer font.
865 A genomic display system according to claim 864 and wherein said computer font is a TRUETYPE® font
866 A genomic display system according to claim 851 and wherein said representation comprises a machine sensible representation
867 A genomic display system according to claim 852 and wherein said representation comprises a machine sensible representation.
868 A genomic display system according to claim 867 and wherein said at least one first representing attribute and said at least one second representing attribute are machine sensible attributes
869 A genomic display system according to claim 852 and wherein said first plurality of nucleotides are purine nucleotides, and said second plurality of nucleotides are pyrimidine nucleotides.
870 A genomic display system according to claim 852 and wherein said first plurality of nucleotides consists of adenine and thymine nucleotides, and said second plurality of nucleotides consists of guanine and cytosine nucleotides.
871 A genomic display system according to claim 852 and wherein said genomic sequence expressor is also operative to express said alphanumeric string in a manner wherein a third plurality of nucleotides are represented by at least one third representing attribute, and a fourth plurality of nucleotides are represented by at least one fourth representing attribute, said fourth representing attribute being different from said third representing attribute.
872 A genomic display system according to claim 871 and wherein said third plurality of nucleotides are represented by at least one third representing attribute, and said fourth plurality of nucleotides are represented by at least one fourth representing attribute, said at least one third representing attribute being different from said at least one fourth representing attribute.
873 A genomic display system according to claim 871 and wherein said first plurality of nucleotides are purine nucleotides, said second plurality of nucleotides are pyrimidine nucleotides, said third plurality of nucleotides are adenine and thymine nucleotides, and said fourth plurality of nucleotides are guanine and cytosine nucleotides
874 A system for graphically displaying genomic sequence information, the system comprising: a genomic sequence expressor, receiving a first alphanumeric string representing a first genomic sequence and a second alphanumeric string representing a second genomic sequence, said second genomic sequence being a reversed-inversed genomic sequence of said first genomic sequence; and expressing said first alphanumeric string and said second alphanumeric string, in a manner wherein that a graphical expression of said second alphanumeric string is a horizontal and vertical mirror image of a graphical expression of said first alphanumeric string; and a display operative to receive an output from said genomic sequence expressor and to provide a visually sensible display of said graphical expression of said first alphanumeric string and said graphical expression of said second alphanumeric string.
875. A genomic display system according to claim 874 and wherein: said genomic sequence expressor is also operative to receive an alphanumeric string which represents genomic sequence data, said alphanumeric string comprising a plurality of characters, each of said plurality of characters representing a nucleotide in said genomic sequence, and to express said alphanumeric string using a representation which distinguishes a first plurality of nucleotides, sharing in common a first genomic attribute, from a second plurality of nucleotides, sharing in common a second genomic attribute, said second genomic attribute being different from said first genomic attribute; and said display is also operative to receive an output from said expressor and to display said genomic sequence using said representation.
876. A genomic display system according to claim 875 and wherein said genomic sequence expressor is operative to express said alphanumeric string in a manner wherein said first plurality of nucleotides are represented by at least one first representing attribute, and said second plurality of nucleotides are represented by at least one second representing attribute, said second representing attribute being different from said first representing attribute.
877. A genomic display system according to claim 875 and wherein said representation comprises a human sensible representation.
878 A genomic display system according to claim 876 and wherein said representation comprises a human sensible representation.
879 A genomic display system according to claim 878 and wherein said at least one first representing attribute and said at least one second representing attribute are graphical attributes
880 A genomic display system according to claim 879 and wherein said graphical attributes are shapes
881 A genomic display system according to claim 879 and wherein said graphical attributes are positions
882 A genomic display system according to claim 881 and wherein said positions are vertical positions.
883 A genomic display system according to claim 879 and wherein said graphical attributes are orientations
884 A genomic display system according to claim 883 and wherein said orientations are vertical orientations.
885 A genomic display system according to claim 879 and wherein said graphical attributes are colors
886 A genomic display system according to claim 878 and wherein using said representation also includes representing each of the following four nucleotides: adenine, thymine, cytosine, and guanine, by a different color.
887 A genomic display system according to claim 879 and wherein said human sensible representation comprises one of the following: a shape with a letter and a shape without a letter
888 A genomic display system according to claim 879 and wherein said human sensible representation is produced using a computer font.
889 A genomic display system according to claim 888 and wherein said computer font is a TRUETYPE® font
890 A genomic display system according to claim 876 and wherein said first plurality of nucleotides are purine nucleotides, and said second plurality of nucleotides are pyrimidine nucleotides
891 A genomic display system according to claim 876 and wherein said first plurality of nucleotides consists of adenine and thymine nucleotides, and said second plurality of nucleotides consists of guanine and cytosine nucleotides
892 A genomic display system according to claim 876 and wherein said representation also distinguishes a third plurality of nucleotides, sharing in common a third genomic attribute, from a fourth plurality of nucleotides, sharing in common a fourth genomic attribute, said fourth genomic attribute being different from said third genomic attribute
893 A genomic display system according to claim 892 and wherein said third plurality of nucleotides are represented by at least one third representing attribute, and said fourth plurality of nucleotides are represented by at least one fourth representing attribute, said at least one third representing attribute being different from said at least one fourth representing attribute.
894 A genomic display system according to claim 892 and wherein said first plurality of nucleotides are purine nucleotides, said second plurality of nucleotides are pyrimidine nucleotides, said third plurality of nucleotides are adenine and thymine nucleotides, and said fourth plurality of nucleotides are guanine and cytosine nucleotides.
895 A computer-readable medium comprising a computer program, the computer program being operative, when in operative association with a computer, to perform the following steps: receiving an alphanumeric string representing genomic sequence data, said alphanumeric string comprising a plurality of characters, each of said characters representing a nucleotide in said genomic sequence, and expressing said alphanumeric string using a representation which distinguishes a first plurality of nucleotides, sharing in common a first genomic attribute, from a second plurality of nucleotides, sharing in common a second genomic attribute, said second genomic attribute being different from said first genomic attribute.
896 A computer-readable medium according to claim 895 and wherein said first plurality of nucleotides are represented by at least one first representing attribute, and said second plurality of nucleotides are represented by at least one second representing attribute, said second representing attribute being different from said first representing attribute.
897 A computer-readable medium according to claim 895 and wherein said representation comprises a human sensible representation
898 A computer-readable medium according to claim 896 and wherein said representation comprises a human sensible representation.
899 A computer-readable medium according to claim 898 and wherein said at least one first representing attribute and said at least one second representing attribute are graphical attributes.
900. A computer-readable medium according to claim 899 and wherein said graphical attributes are shapes.
901 . A computer-readable medium according to claim 899 and wherein said graphical attributes are positions.
902. A computer-readable medium according to claim 901 and wherein said positions are vertical positions.
903. A computer-readable medium according to claim 899 and wherein said graphical attributes are orientations.
904. A computer-readable medium according to claim 903 and wherein said orientations are vertical orientations.
905. A computer-readable medium according to claim 899 and wherein said graphical attributes are colors.
906. A computer-readable medium according to claim 898 and wherein using said representation also includes representing each of the following four nucleotides: adenine, thymine, cytosine, and guanine, by a different color.
907. A computer-readable medium according to claim 899 and wherein said human sensible representation comprises one of the following: a shape with a letter and a shape without a letter.
908 A computer-readable medium according to claim 899 and wherein said human sensible representation is produced using a computer font
909 A computer-readable medium according to claim 908 and wherein said computer font is a TRUETYPE® font.
910 A computer-readable medium according to claim 895 and wherein said representation comprises a machine sensible representation.
91 1 A computer-readable medium according to claim 896 and wherein said representation comprises a machine sensible representation.
912 A computer-readable medium according to claim 911 and wherein said at least one first representing attribute and said at least one second representing attribute are machine sensible attributes
913 A computer-readable medium according to claim 896 and wherein said first plurality of nucleotides are purine nucleotides, and said second plurality of nucleotides are pyrimidine nucleotides.
914 A computer-readable medium according to claim 896 and wherein said first plurality of nucleotides consists of adenine and thymine nucleotides, and said second plurality of nucleotides consists of guanine and cytosine nucleotides.
915 A computer-readable medium according to claim 896 and wherein said representation also distinguishes a third plurality of nucleotides, sharing in common a third genomic attribute, from a fourth plurality of nucleotides, sharing in common a fourth genomic attribute, said fourth genomic attribute being different from said third genomic attribute
916 A computer-readable medium according to claim 915 and wherein said third plurality of nucleotides are represented by at least one third representing attribute, and said fourth plurality of nucleotides are represented by at least one fourth representing attribute, said at least one third representing attribute being different from said at least one fourth representing attribute.
917. A computer-readable medium according to claim 915 and wherein said first plurality of nucleotides are purine nucleotides, said second plurality of nucleotides are pyrimidine nucleotides, said third plurality of nucleotides are adenine and thymine nucleotides, and said fourth plurality of nucleotides are guanine and cytosine nucleotides.
918. A computer-readable medium comprising a computer program, the computer program being operative, when in operative association with a computer, to perform the following steps: receiving a first alphanumeric string representing a first genomic sequence and a second alphanumeric string representing a second genomic sequence, said second genomic sequence being a reversed-inversed genomic sequence of said first genomic sequence; and graphically displaying said first alphanumeric string and said second alphanumeric string, such that a graphical display of said second alphanumeric string is a horizontal and vertical mirror image of a graphical display of said first alphanumeric string.
919. A method according to claim 918 and wherein said method also comprises expressing said first alphanumeric string and said second alphanumeric string using a representation which distinguishes a first plurality of nucleotides, sharing in common a first genomic attribute, from a second plurality of nucleotides, sharing in common a second genomic attribute, said second genomic attribute being different from said first genomic attribute.
920. A method according to claim 919 and wherein said first plurality of nucleotides are represented by at least one first representing attribute, and said second plurality of nucleotides are represented by at least one second representing attribute, said second representing attribute being different from said first representing attribute.
921 A method according to claim 919 and wherein said representation comprises a human sensible representation
922 A method according to claim 920 and wherein said representation comprises a human sensible representation.
923 A method according to claim 922 and wherein said at least one first representing attribute and said at least one second representing attribute are graphical attributes
924 A method according to claim 923 and wherein said graphical attributes are shapes
925 A method according to claim 923 and wherein said graphical attributes are positions
926 A method according to claim 925 and wherein said positions are vertical positions
927 A method according to claim 923 and wherein said graphical attributes are orientations
928 A method according to claim 927 and wherein said orientations are vertical orientations
929 A method according to claim 923 and wherein said graphical attributes are colors
930 A method according to claim 922 and wherein using said representation also includes representing each of the following four nucleotides: adenine, thymine, cytosine, and guanine, by a different color.
93 1 A method according to claim 923 and wherein said human sensible representation comprises one of the following: a shape with a letter and a shape without a letter
932 A method according to claim 923 and wherein said human sensible representation is produced using a computer font.
933 A method according to claim 932 and wherein said computer font is a TRUETYPE® font
934 A method according to claim 920 and wherein said first plurality of nucleotides are purine nucleotides, and said second plurality of nucleotides are pyrimidine nucleotides
935 A method according to claim 920 and wherein said first plurality of nucleotides consists of adenine and thymine nucleotides, and said second plurality of nucleotides consists of guanine and cytosine nucleotides.
936 A method according to claim 920 and wherein said representation also distinguishes a third plurality of nucleotides, sharing in common a third genomic attribute, from a fourth plurality of nucleotides, sharing in common a fourth genomic attribute, said fourth genomic attribute being different from said third genomic attribute.
937. A method according to claim 936 and wherein said third plurality of nucleotides are represented by at least one third representing attribute, and said fourth plurality of nucleotides are represented by at least one fourth representing attribute, said at least one third representing attribute being different from said at least one fourth representing attribute.
938. A method according to claim 936 and wherein said first plurality of nucleotides are purine nucleotides, said second plurality of nucleotides are pyrimidine nucleotides, said third plurality of nucleotides are adenine and thymine nucleotides, and said fourth plurality of nucleotides are guanine and cytosine nucleotides.
EP02800693A 2001-10-12 2002-10-10 Integrated system and method for analysis of genomic sequence data Withdrawn EP1442292A2 (en)

Applications Claiming Priority (13)

Application Number Priority Date Filing Date Title
US32911401P 2001-10-12 2001-10-12
US32911501P 2001-10-12 2001-10-12
US32911201P 2001-10-12 2001-10-12
US32911101P 2001-10-12 2001-10-12
US32911001P 2001-10-12 2001-10-12
US329112P 2001-10-12
US976911 2001-10-12
US329111P 2001-10-12
US09/976,911 US20060129330A1 (en) 2001-10-12 2001-10-12 System and method for graphically representing genomic sequence data
US329114P 2001-10-12
US329115P 2001-10-12
US329110P 2001-10-12
PCT/IL2002/000818 WO2003031565A2 (en) 2001-10-12 2002-10-10 Integrated system and method for analysis of genomic sequence data

Publications (1)

Publication Number Publication Date
EP1442292A2 true EP1442292A2 (en) 2004-08-04

Family

ID=27559744

Family Applications (1)

Application Number Title Priority Date Filing Date
EP02800693A Withdrawn EP1442292A2 (en) 2001-10-12 2002-10-10 Integrated system and method for analysis of genomic sequence data

Country Status (3)

Country Link
EP (1) EP1442292A2 (en)
AU (1) AU2002334367A1 (en)
WO (1) WO2003031565A2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9355108B2 (en) 2012-11-07 2016-05-31 International Business Machines Corporation Storing data files in a file system

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9519650B2 (en) * 2011-07-06 2016-12-13 President And Fellows Of Harvard College Systems and methods for genetic data compression
GB201604383D0 (en) 2016-03-15 2016-04-27 Genomics Plc A data compression method, a data decompression method, and a data processing apparatus

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of WO03031565A2 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9355108B2 (en) 2012-11-07 2016-05-31 International Business Machines Corporation Storing data files in a file system
US9922041B2 (en) 2012-11-07 2018-03-20 International Business Machines Corporation Storing data files in a file system
US10409777B2 (en) 2012-11-07 2019-09-10 International Business Machines Corporation Storing data in a file system
US11221992B2 (en) 2012-11-07 2022-01-11 International Business Machines Corporation Storing data files in a file system

Also Published As

Publication number Publication date
WO2003031565A3 (en) 2004-03-04
AU2002334367A1 (en) 2003-04-22
WO2003031565A2 (en) 2003-04-17

Similar Documents

Publication Publication Date Title
Kumar et al. Time-series bitmaps: a practical visualization tool for working with large time series databases
US8712977B2 (en) Computer product, information retrieval method, and information retrieval apparatus
Cox et al. Large-scale compression of genomic sequence databases with the Burrows–Wheeler transform
US8595196B2 (en) Computer product, information retrieving apparatus, and information retrieval method
KR101188886B1 (en) System and method for managing genetic information
US8131721B2 (en) Information retrieval method, information retrieval apparatus, and computer product
JP5309570B2 (en) Information retrieval apparatus, information retrieval method, and control program
Friedberg et al. Using an alignment of fragment strings for comparing protein structures
JP6343081B1 (en) Recording medium recording code code classification search software
CN105474214A (en) Text character string search device, text character string search method, and text character string search program
EP1442292A2 (en) Integrated system and method for analysis of genomic sequence data
Deorowicz et al. AGC: Compact representation of assembled genomes
JPH0869476A (en) Retrieval system
US20220171815A1 (en) System and method for generating filters for k-mismatch search
US8571809B2 (en) Apparatus for calculating scores for chains of sequence alignments
Wei et al. A practical tool for visualizing and data mining medical time series
JP5347307B2 (en) Information retrieval apparatus, information retrieval method, and control program
JPH1185794A (en) Retrieval word input device and recording medium recording retrieval word input program
JP4334955B2 (en) Biological information lossless encoder
KR100513266B1 (en) Client/server based workbench system and method for expressed sequence tag analysis
CN114882950A (en) Method for identifying microorganism species and sequences in metagenome sequence based on software
JP7548312B2 (en) Information processing program, information processing method, and information processing device
KR102234896B1 (en) Computer-readable recording medium with data having file structure for storing and searching CHR Genetic variants data
CN117910022B (en) Data searching method, device, computer equipment, storage medium and product
Ursing et al. Exprot-a database for experimentally verified protein functions

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20040512

AK Designated contracting states

Kind code of ref document: A2

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR IE IT LI LU MC NL PT SE SK TR

AX Request for extension of the european patent

Extension state: AL LT LV MK RO SI

RIN1 Information on inventor provided before grant (corrected)

Inventor name: MOUYAL, YITZHAK

Inventor name: BENTWICH, ISAAC

RIC1 Information provided on ipc code assigned before grant

Ipc: G06F 19/00 20060101AFI20060619BHEP

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN WITHDRAWN

18W Application withdrawn

Effective date: 20060802