WO2009001174A1 - Système et procédé pour une compression et un stockage de données permettant une récupération rapide - Google Patents

Système et procédé pour une compression et un stockage de données permettant une récupération rapide Download PDF

Info

Publication number
WO2009001174A1
WO2009001174A1 PCT/IB2007/052508 IB2007052508W WO2009001174A1 WO 2009001174 A1 WO2009001174 A1 WO 2009001174A1 IB 2007052508 W IB2007052508 W IB 2007052508W WO 2009001174 A1 WO2009001174 A1 WO 2009001174A1
Authority
WO
WIPO (PCT)
Prior art keywords
grammar
symbol
string
hand side
strings
Prior art date
Application number
PCT/IB2007/052508
Other languages
English (en)
Inventor
Karlis Freivalds
Paulis Kikusts
Peteris Rucevskis
Original Assignee
Smartimage Solutions, Sia
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Smartimage Solutions, Sia filed Critical Smartimage Solutions, Sia
Priority to PCT/IB2007/052508 priority Critical patent/WO2009001174A1/fr
Publication of WO2009001174A1 publication Critical patent/WO2009001174A1/fr

Links

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3084Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • G06F16/24561Intermediate data storage techniques for performance improvement
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models

Definitions

  • the present invention relates to the field of data compression, database management systems and more particularly the systems and methods for data compression and storage and retrieval data from the compressed form.
  • data in a database is logically represented as a collection of one or more tables.
  • Each table is composed of a series of rows and columns.
  • Each row in the table represents a collection of related data.
  • Each column in the table represents a particular type of data.
  • each row is composed of a series of data values, one data value from each column.
  • Data compression is a common technique in computer systems to reduce data storage costs.
  • Commonly used compression tools such as zip or gzip compress a given file or collection of files into a smaller representation of the same data.
  • a drawback of such approach is that access to data is provided on the file granularity, namely the whole file has to be decompressed, even if a small part of that file is actually needed by the user.
  • access is typically needed on the granularity of individual records.
  • the database system may contain large quantities of small records and applying compression on each record individually does not yield significant reduction of data size.
  • the index is a search pattern, which ultimately locates the data. Its main disadvantage is that it requires space and maintenance.
  • the nature of the indexing technique as well as the volume of the data held in the files determine the number of VO operations that are required to retrieve, insert, delete or modify a given data record.
  • indexing schemes have been developed.
  • One of the most widely used indexing algorithm is the B-tree (Rudolf Bayer, Binary B-Treesfor Virtual Memory, ACM- SIGFIDET Workshop 1971, San Diego, California, Session 5B, p. 219-235) in which the keys are kept in a balanced tree structure and the lowest level points at the data itself.
  • Another indexing technique is using the trie data structure which enables fast search whilst avoiding data duplication in indices.
  • the trie indexing file has the structure of a tree wherein the search is based on partitioning the search according to search key portions - digit or bit.
  • the trie structure affords memory space savings, since the search-key is not held, as a whole, in interim nodes and hence the duplication is avoided. At the same time, the memory space savings using trie data structure are not significant comparing to memory space savings using some tried- and- true data compression techniques, for example RAR or zip compression technique.
  • a method for implementing storage and retrieval of data in a computing system disclosed in WO03096230 by Potapov et al. entitled “Method and Mechanism of Storing and Accessing Data and Improving Performance of Database Query Language Statements” provides compression of stored data by reducing or eliminating duplicated values in a database block.
  • the on-disk data is configured to reference a copy of each duplicated data value stored in a symbol table. Information describing data duplication within a data block is maintained.
  • the disclosed method provides also ability to search compressed data.
  • the method for decompression disclosed in WO03096230 has a disadvantage which limits the usefulness thereof.
  • Each data block is compressed individually, hence the compression ratio is significantly lower than compression ratio achieved by RAR or zip compression tools. Besides that, no index compression is disclosed in said method, hence the overall size reduction is limited.
  • Liozides et al. discloses a method for generating a multilevel compressed index for data retrieval. Index is partitioned into blocks, and care is taken to minimize number of blocks accessed during search. This compression method provides only index compression; no data compression is performed. Therefore, the overall data size reduction is small. Although it is possible to use simultaneously index compression methods with data compression methods, it is not efficient since there is data duplication between the compressed index and compressed data making such approach suboptimal.
  • Data compression method comprises: parsing an input string into irreducible grammar using a trie-type data structure that represents variables of the irreducible grammar; updating said variables of said irreducible grammar based on at least one character to be parsed in the input string; and encoding said irreducible grammar into a string of bits.
  • the method for performing data decompression comprises: decoding input bit stream based on irreducible grammar; and updating the irreducible grammar based on at least one character to be parsed in the input string; the decoding and updating steps being performed to substantially prevent any bit-error in the input bit stream from adversely affecting operation of decoding system.
  • the method provides compression of data stream corresponding to the input file.
  • the disadvantage of this method is that it cannot be applied to databases having large number of small strings that have to be compressed and decompressed individually. Additionally, this method does not allow fast searching inside the compressed data, i.e. searching that would be faster than reading and processing the whole compressed file, hence this method cannot be applied to databases.
  • Vt is a finite set of terminals
  • Vn is a finite set of non-terminals
  • P is a finite set of production rules
  • S is an element of Vn, the distinguished starting non-terminal.
  • elements of P are of the form V -> w, where V is a non-terminal symbol and w is a string consisting of terminals and/or nonterminals, or w is an empty string denoted by ⁇ .
  • alternatives S->a and S->b are written as S-> a I b.
  • Grammar production rules are used for transforming strings. To generate a string in a language, one begins with a string consisting of only a single start symbol, and then successively applies the rules any number of times, in any order to rewrite this string. Rewriting process terminates when the string contains only terminal symbols.
  • the language corresponding to the particular grammar consists of all the strings that can be generated in this manner. Any particular sequence of legal choices taken during this rewriting process yields one particular string in the language.
  • a grammar is called finite, if it generates only a finite set of strings. String generation process may be started with any grammar symbol, not only the starting symbol S, to generate a set of strings represented by this grammar symbol.
  • This example of a finite context-free grammar generates a language consisting of tree strings "abcaa”, “acaa”, “aabc”.
  • grammar based compression deals with only one string making a context- free grammar whose starting symbol generates this string. It is evident for the person skilled in the art that these algorithms can work on a set of strings producing a grammar with several start symbols Si .. S k where each start symbol generates one string of the string set.
  • the known methods either do not allow data compression yielding high compression ratio, or do not allow compression to be applied to databases having large number of small strings that have to be compressed and decompressed individually, or do not allow fast searching inside the compressed data, that is searching which would be faster than reading and processing the whole compressed file.
  • the effective data compression method allowing fast retrieval is required also for small electronic devices such as mobile phones, PDA (Personal Digital Assistant), palmtop computers, GPS (Global Positioning System), car navigation systems, where the amount of available memory is limited. Such devices must hold large amount of data, for example car navigation system has to hold the maps for several countries, which can occupy several gigabytes of storage space. Compressing the maps would decrease costs of navigation systems or increase the number of maps that can be stored on the same device.
  • the known methods and systems do not allow to achieve compression ratio similar to those achieved by the best archivers, at the same time permitting fast data searching and retrieval similar to the current database.
  • the present invention discloses a method and system for compressing, storing data in compressed form yielding high compression ratio and efficiently retrieving required data directly from the compressed format.
  • the present invention utilizes a specifically developed context-free grammar as a data structure for holding compressed data and improved algorithm for data compression and retrieval.
  • the proposed compression method provides unification of data compression with index compression in a common data structure, resulting in high compression ratio and fast retrieval.
  • a context-free grammar according to offered method is used not only for compression but also for fast searching of data.
  • a special kind of grammar is proposed.
  • the right-hand side of any rule of the LM-grammar is in a special form.
  • This grammar can have multiple starting non-terminals.
  • An LM-grammar G can be defined as a 4-tuple:
  • LM2 Vn is a finite set of non-terminals (LM3)
  • P is a finite set of production rules
  • every production rule of the context-free grammar is in the form A-> ⁇ or A -> BC, where B is a grammar symbol that represents exactly one non-empty string; C denotes any sequence of grammar symbols or an empty string and
  • the string represented by the leftmost grammar symbol of the right-hand side of one alternative is not a prefix of the string represented by the leftmost grammar symbol of the right-hand side of the other alternative; i.e. for every two alternatives with the same left- hand side A->BC and A->DE the string represented by grammar symbol B is not the prefix of the string represented by grammar symbol D and the string represented by grammar symbol D is not the prefix of the string represented by grammar symbol B; C and E denote any (possibly empty) sequences of grammar symbols.
  • Properties (a) and (b) provide a way for fast searching.
  • the search process is recursive traversal of the grammar and these properties allow traversing only the relevant parts of the grammar.
  • Property (a) ensures that the string represented by the leftmost symbol of the right-hand side of any rule is unique.
  • Property (b) ensures the ability to distinguish between alternatives for finding the relevant ones by looking at the strings represented by the first grammar symbols of the right-hand sides.
  • property (b) is replaced by property (bl), where for every two alternatives with the same left-hand and both non-empty right- hand sides, the first symbol of the string represented by the leftmost grammar symbol of the right-hand side of one alternative is different from the first symbol of the string represented by the leftmost grammar symbol of the right-hand side of the other alternative, i.e. for every two alternatives with the same left-hand side A->BC and A->DE the first symbol of the string represented by grammar symbol B differs from the first symbol of the string represented by grammar symbol D; C and E denote any (possibly empty) sequences of grammar symbols.
  • LM-grammar is used as a primary mechanism for storing compressed data.
  • Each starting non-terminal of an LM-grammar corresponds to one database index.
  • Each starting non-terminal generates a set of strings which corresponds to the rows of the database table. If the system comprises several indices, each database row can be obtained through its corresponding string from every index.
  • Such storage organization allows searching by multiple criteria. At the same time, storage space overhead is negligible since the generation process can use common grammar rules.
  • Vt ⁇ a, b, c ⁇
  • Vn ⁇ S 1 , S 2 , A, B, C, D, E ⁇
  • starting non-terminals ⁇ Si, S 2 ⁇ Production rules:
  • PATRICIA trie data structure D. R. Morrison. PATRICIA - Practical Algorithm to Retrieve Information Coded in Alphanumeric. Jrnl. of the ACM, 15(4) pp514-534, Oct 1968.
  • PATRICIA trie For a given set of strings PATRICIA trie has one node for every maximal common prefix. Leaves of the PATRICIA trie correspond to input strings. Edges are labeled with such strings, that the path from root to any node spells out the prefix associated with that node.
  • PATRICIA trie is used as an intermediate step for generating prefixes of the set of strings for LM-grammar creation process.
  • the present invention may be embodied on a computer-based data storage system comprising at least a microprocessor and memory.
  • the said memory can be random access memory, a hard or fixed disk, flash memory, CD-ROM, internal memory of the microprocessor, or any other device capable of holding digital data.
  • a typical system optionally comprises also an input-output controller, a keyboard, display, a pointing device.
  • the said memory is holding a digital representation of the LM-grammar.
  • the microprocessor executes instructions for creating LM-grammar or encoding LM-grammar into a stream of bits forming the compressed file or searching by using LM-grammar and retrieving of data matching a given query.
  • the present invention can also be embodied in the form of instructions for a microprocessor, or program code in some programming language in a machine -readable storage medium, such as floppy diskettes, flash memory, CD-ROM or a hard or fixed disk, wherein, when loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention.
  • a machine such as a computer
  • the instructions or program code segments combine with the processor to provide a unique device for performing the disclosed methods that operates analogously to specific logic circuits.
  • Fig. 1 illustrates a database table in conventional non-compressed form, one or more indices, overview of the method of data and index compression, yielding compressed file;
  • Fig. 2 is a flowchart of the process of creation of the sets of strings
  • Fig. 3 is a flowchart of one embodiment of the method for converting a data row into a string
  • Fig. 4 is a flowchart of one embodiment of the method for building the LM-grammar from the set of strings
  • Fig. 5 illustrates an example of tries for each set of strings coming from one index, generated from the fields of the database table shown on Fig. 1.
  • Fig. 6 is a flowchart of the process of encoding of LM-grammar into compressed file;
  • Fig. 7 is a flowchart of one embodiment of the method for writing a grammar rule to file
  • Fig. 8 is a block diagram (overview) of the retrieval process
  • Fig. 9 is a flowchart of the top level process of retrieval according to an embodiment of the present invention
  • Fig. 10 is a flowchart of a recursive search process which generates only the strings that match the query;
  • Fig. 11 is a flowchart of the process of testing whether the given partial string can match the query criteria
  • Fig. 12 is a flowchart of the steps that are performed when the compressed file is opened for reading
  • Fig 13 is a flowchart of the process of reading the rule referenced by its identifier.
  • Fig. 1 illustrates a database table 101 in conventional non-compressed form, one or more indices 102, method of data and index compression (blocks 105-107), yielding a compressed file 103.
  • Database table 101 consists of data rows, each row having one or more fields.
  • This example database table 101 has three columns; fields in column 1 and column 2 are of string type and field in column 3 is of integer type.
  • each index 102 is an ordered list of table column names.
  • This example database table 101 is indexed by two indices - the first by Column 1, the second by Column 2.
  • a set of strings is created at block 104. The process of creation of a set of strings is illustrated in Fig. 2.
  • index information and table column information is written to the compressed file 103.
  • the table 101 column information contains the number of columns, the column names and data types.
  • Index information contains the number of indices and the ordered list of column names for each index 102.
  • a loop is carried out for each row and each index and each row is converted into string corresponding to each of the indices as described in Fig. 3.
  • the process of converting a row into a string for a given index starts at steps 202-203 where each field of the row is converted into a string, depending on the type of the field.
  • a string type field is represented as string of bytes in ASCII, utf-8, utf-16 or any other encoding.
  • 32 bit integer is represented as a fixed length string of 4 bytes, highest bytes first.
  • a 64 bit integer is stored as a string of 8 bytes, highest bytes first.
  • Positive double values are represented as their raw bits in IEEE 754 floating-point format with the highest bit inverted.
  • Negative double values are represented as their all raw bits in IEEE 754 floatingpoint format inverted. Such storage of double values ensures that the resulting string lexicographic ordering is the same as double value ordering.
  • the next step 204 is concatenation of the individual field strings to form the row string, in the order as specified in the index, and a special delimiter symbol is inserted between the fields to delimit them.
  • the delimiter is a character that does not appear elsewhere in the data.
  • the index does not contain all fields of the row, the missing fields are appended at the end in arbitrary order. If some field is of fixed length, then the field delimiter after this field can be omitted. For example, two sets of strings are generated from the fields of database table 101: for index 1 - ⁇ "abb#ca#005", “caa#bbb#017", “abc#bab#100” ⁇ and for index 2 - ⁇ "ca#abb#005", "bbb#caa#017”, “bab#abc#100” ⁇ .
  • the field delimiter is denoted by #. For simplicity of presentation it is assumed that integers are represented in decimal notation, instead of bytes and the maximum length of a decimal number is 3 digits.
  • the process transfers to block 105 (Fig. 1) where the set of strings is represented as a context-free LM-grammar 106.
  • the context-free LM-grammar 106 is encoded into a stream of bits (block 107), yielding a compressed file 103.
  • Fig. 4 illustrates a flowchart of one embodiment of the method for building the LM- grammar 106 from the set of created strings.
  • the process of forming an LM-grammar starts at block 401 by forming a separate PATRICIA trie for each collection of strings coming from the same index. Tries for each set of strings coming from one index and generated from the fields of the database table 101 shown on Fig. 1 are illustrated in Fig. 5.
  • the next step 402 of building the LM-grammar is creation of collection of unique strings containing strings from all the labels of the edges of all the tries. If several edges have the same label, only one string is formed for all of them.
  • the following collection of unique strings is created from trie edge labels shown on Fig.
  • a common context-free grammar having grammar symbols Ui..U k representing the given unique strings is formed.
  • this grammar should be as short as possible.
  • Any of the known methods for performing data compression using a context-free grammar can be used for this purpose. The method described in CG. Nevill-Manning, LH. Witten and D. L. Maulsby. Compression by induction of hierarchical grammars. In J.A. Storer and M. Cohn (editors), Proc. IEEE Data Compression Conference, pages 244-253, Snowbird, Utah, March 1994. IEEE Computer Society Press, Los Alamitos, California is used in the preferred embodiment.
  • a possible common context-free grammar of these strings with starting non-terminals Ui - Us is as follows: Ui->ab
  • the created tries are combined with the generated common context-free grammar to form the LM-grammar.
  • tries are processed one by one; a nonterminal symbol is formed for each non-leaf trie node.
  • Grammar rule in form of R -> AB is created for each trie edge, having the non-terminal symbol R corresponding to the source node of this edge as the left-hand side of the rule.
  • the right-hand side of the rule is concatenation of the string of grammar symbols denoted by A and non-terminal symbol B corresponding to the target node of this edge, where string A is either the right-hand side of the grammar rule U 1 ->A of the common grammar corresponding to the label of the trie edge, if grammar symbol U 1 does not occur anywhere in the right-hand side of any other rule, or A is the grammar symbol U 1 itself, if U 1 occurs in the right-hand side of some other grammar rule.
  • the target non-terminal B in the created rule is omitted, if the target node of the edge is a leaf in the trie.
  • the LM-grammar includes all the referenced rules of the common grammar and the newly generated rules.
  • the starting terminals of the LM- grammar correspond to the root nodes of the tries.
  • the obtained grammar for the database table 101 illustrated in Fig. 1 is the following LM- grammar with starting symbols Si and S 2 : A -> MFH I cDaMI
  • the next step 107 of the process of data compression is illustrated in Fig. 6 in greater details.
  • the grammar rules are divided into packets of a specified size at block 501. All the alternatives with the same left-hand side are considered as one grammar rule when storing LM-grammar into the compressed file 103. In this way each grammar rule is uniquely identified by its left-hand side grammar symbol.
  • the rules are sorted and divided into contiguous chunks according to that sorting. There are several alternative ways of sorting the rules for achieving different goals. In the preferred embodiment, to achieve the most efficient coding, rules are sorted by the reference counts. The reference count is the number of times the left-hand side symbol of this rule appears in the right-hand side of other grammar rules. Then, the required frequency model can be built and encoded efficiently.
  • rules can be sorted topologically in the underlying reference graph or by distance from the root, thus achieving that commonly accessed rules are together in one packet or in a few packets.
  • the goal for dividing rules into packets is to be able to read from file and decode grammar rules from each packet independently, based on the query to be performed.
  • the size of each packet is selected to balance the read performance with overhead of storing utility information for each packet.
  • Preferred size of a packet is 1 to 8 kilobytes. A size of 4 kilobytes is used in the preferred embodiment.
  • the next step 502 of the compression process is forming the encoding frequency model from the rule reference counts and encoding the model and writing it to the compressed file 103.
  • the model is an array of frequencies of occurrence of each grammar rule in the sorted order.
  • the frequencies can be represented exactly and written into the file as 4 byte integer numbers, or variable length integer code such as Golomb code, Elias delta code, Elias gamma code, Elias omega code, Fibonacci code, Rice code, preferably Elias delta code.
  • frequencies are represented approximately by quantization into a fixed number of bits.
  • frequencies are encoded with wavelet transform (Charles K.
  • the all grammar rules of each packet are encoded into the compressed file 103 according to the process illustrated in Fig. 7. All alternatives of the given rule are encoded into compressed file 103.
  • a pointer to the current alternative is allocated and set to the first alternative of the rule at block 604.
  • a pointer to the current symbol is allocated and set to the first symbol of the current alternative.
  • the current symbol is encoded and emitted to file at block 606.
  • encoding is done using arithmetic coding (LH. Witten, R. Neal, and J.G. Cleary entitled "Arithmetic Coding for Data Compression", Communications of the ACM, Vol. 30, pp.
  • grammar symbols can be encoded using fixed number of bits, or using Huffman coding (D. A. Huffman, "A method for the construction of minimum-redundancy codes", Proceedings of the I.R.E., sept 1952, pp 1098-1102), or variable-length integer codes such as Golomb code, Elias code, Fibonacci code, Rice code.
  • Huffman coding D. A. Huffman, "A method for the construction of minimum-redundancy codes", Proceedings of the I.R.E., sept 1952, pp 1098-1102
  • variable-length integer codes such as Golomb code, Elias code, Fibonacci code, Rice code.
  • a special symbol NEXT _ALTERN ATIVE is written to compressed file 103 at block 609 indicating that the current alternative has ended and the corresponding decoder should switch to the next alternative.
  • the next step 610 is testing whether the current alternative is the last one of the rule. If it is not, the next alternative is set to current at block 611, and the process is transferred to block 605.
  • a special symbol NEXT_RULE is written to compressed file 103 at block 612 indicating that the current rule has ended and the corresponding decoder should switch to the next rule of the packet.
  • the next step 704 of the process of encoding (Fig. 6) is forming a list of packet offsets.
  • the list consists of positions of each packet in compressed file 103.
  • the next step 705 is encoding this list of packet offsets into the compressed file 103. Each offset is written using fixed number of bits, or with variable-length integer code such as Golomb code, Elias code, Fibonacci code, Rice code. Preferably Elias code is used to keep the offset list compact.
  • the next step 706 is forming the list of packet rule counts.
  • the list consists of integer numbers representing the number of grammar rules in each packet.
  • the next step 707 is encoding this list of packet rule counts into the compressed file 103. Each integer is written using fixed number of bits, or with variable-length integer code. Preferably Elias code is used.
  • Fig. 8 illustrates the overview of the data retrieval process.
  • the retrieval process 802 retrieves and decompresses the data 803 matching to the query.
  • the search query consists of specifying criteria for each field of the database table and is written down in some query language such as Structured Query Language (SQL). Only queries which have separate criteria on each field combined with AND are described, but it is obvious for a person skilled in the art, that the present invention can be applied to more complicated queries, too.
  • SQL Structured Query Language
  • the top level process of retrieval is illustrated in Fig. 9.
  • an appropriate index is selected which allows the query to be executed in the fastest way. More than one known technique for selecting the best index may be adopted for realization of the search process in the offered method. As an example, the index which starts with the field having the strictest criterion can be selected. The correctness of the method does not depend on the selected index, only performance is affected.
  • the starting grammar symbol of the LM-grammar is selected which corresponds to the selected index.
  • Next step 902 is creation of a list for holding the strings that match the query criteria. Initially the list is empty. Then, at block 903 the search process described in Fig. 10 is performed, passing the parameter which is a string consisting of the selected starting grammar symbol.
  • the resultList contains strings of the rows matching the query.
  • a loop is carried out over all strings. For that purpose, at block 904 a pointer to the current string is allocated and set to the first string in resultList.
  • the next step 905 is decoding the string by using the inverse of the encoding transformation performed as indicated in Fig. 3 yielding the original data rows. The decoded rows are reported to the user.
  • the next step 906 is checking if the current string is the last in the resultList. If yes, the process terminates. If no, the next string in resultList is set to current at step 907 and the process is transferred to block 905.
  • Fig. 10 illustrates a recursive search process which generates only the strings that match the query.
  • the process takes a parameter partial_string (block 1001) which initially holds one symbol that is one of the starting symbols of the LM-grammar. Partial_string is modified during the process to hold the current partially decoded string.
  • the longest prefix of partial _str ing containing only terminal symbols is identified starting at the beginning of the string and ending before the first non-terminal symbol.
  • the whole partial_string is taken as a prefix, if it has no non-terminals.
  • the next step 1003 of the process is, based on the identified prefix, testing whether the current partial_string can match the query criteria. That is illustrated in Fig. 11.
  • a test 1004 is carried out whether the partial_string contains nonterminal symbols. If no, a matching row is found, partial _string is added to the resultList at block 1005, and the process terminates. If the partial_string contains non-termninal symbols, the first non-terminal symbol, denoted by R, in partial _string is identified at block 1006. Next, in block 1007, a pointer to the current alternative is allocated and set to the first alternative of the rule with the left-hand side equal to R.
  • the process of testing of the prefix of the partial string on whether it can satisfy the query criteria will now be described (Fig. 11).
  • the parameter prefix is taken.
  • the prefix is divided into fields that are between field delimiters. There are some fields that are fully decoded, and there is a prefix of the last decoded field.
  • the process is transferred to block 1103 where a pointer to the current field is allocated and set to the first decoded field.
  • a check is performed on whether the current field is completely decoded. If yes, the current field is tested if it satisfies the query criteria at block 1105. If criteria are not satisfied, the process ends with result FALSE at block 1106.
  • step 1107 a test is carried out if the current field is the last one. If it is the last, the process ends with result TRUE at block 1108. If the current field is not the last one, step 1109 is taken where next field is set to current and the process is returned to step 1104. If the test at step 1104 returns No, the process is transferred to block 1110, where the interval of possible values, that the field can take in order to have the given prefix, is derived from the partially decoded field. Next, at step 1111, the derived interval is checked if it intersects with the interval defined by the query criteria. If there is no intersection the process terminates with result FALSE (block 1106).
  • FIG. 12 shows the steps that are performed when the compressed file 103 is opened for reading.
  • the process starts by reading the table column information from the compressed file 103 at block 1201.
  • Column information includes the number of columns and the column names.
  • the next step 1202 is reading the index information.
  • the index information includes the number of indices, the number of fields in each index and the ordered list of table column names for each index.
  • the arithmetic coding frequency model is read from the compressed file 103 at block 1203.
  • the frequency model reading step is omitted, if rules are not encoded by arithmetic or Huffman coding.
  • the next step 1204 is reading the packet information from the compressed file 103.
  • the packet information contains the number of packets, the list of packet offsets and the list of number of rules in each packet.
  • the process of loading and decompression of the grammar rules from the disk is performed only once when the compressed file 103 is opened. It is not time consuming since the size of the read information is relatively small when compared to the total size of the compressed file 103. After opening the compressed file 103, arbitrary number of queries can be executed. To be able to execute processes illustrated in Fig. 10 and Fig. 11 efficiently, the grammar rules must be accessible in any order. For this purpose, the division into packets is exploited which allows to decompress any given rule with only small overhead. Rules are identified by their numbers in the sorting obtained at block 501 (Fig. 6).
  • Each rule is loaded from compressed file 103 into main memory on demand as soon as its right-hand side is accessed.
  • Fig 13 illustrates the process of reading the rule referenced by its identifier.
  • block 1301 it is tested whether the rule with the given identifier is already in the memory, if it is, the rule is returned at block 1302 and the process ends. If not, the process is transferred to block 1303 where the packet in which the rule is located is identified by using binary search in the list of offsets.
  • the entire packet in which the rule is located is read from the disk and all rules of the packet are decoded.
  • the starting offset in the compressed file 103 and number of rules that have to be decoded are determined from the lists stored at compression steps 704 and 706.
  • the rule decoding process starts by allocation of a new empty rule in memory at block 1304.
  • the next step 1305 is creating an empty alternative of the rule.
  • a symbol from the file is inputted using an arithmetic coder (block 1306) provided the frequency model that was read at block 1203. If some other encoding method was used in encoding step 606, decoding is done using an appropriate decoder, corresponding to the encoder.
  • the input symbol is tested whether it is the special NEXT_RULE symbol. If the symbol is NEXT_RULE symbol, it means that the rule is completely decoded and the process is transferred to block 1308, where the test is carried out on whether all rules of the packet are decoded.
  • the process is transferred to block 1304 for decoding of the next rule. If the test in block 1307 (if the symbol is NEXT_RULE symbol) returns No, the process is transferred to block 1310 where a test is carried out if the symbol is NEXT _ALTERN ATIVE symbol. If yes, the current alternative is finished and decoding of the next alternative of the current rule is started by transferring to block 1305. If no, the symbol is recognized as an identifier of some grammar symbol (either terminal or non- terminal) and the grammar symbol is appended to the current alternative of the current rule at block 1311 and the process is transferred to block 1306.
  • the first index is selected at block 901 (Fig. 9) and its starting non-terminal Si is selected.
  • a partial string "Si" is created.
  • the first one is used first obtaining a partial string "UiA”.
  • This partial string is tested if it matches the query criteria.
  • the prefix until the first nonterminal is empty, there is one partially decoded field with infinitely wide interval that intersects with the query interval and it is reported that it matches the query.
  • the next step is rewriting the first non-terminal symbol Ui in the partial string "UiA”.
  • the partial string "abA" is obtained.
  • Data storage system for storing data in compressed form and fast retrieval
  • the offered data storage system for storing data in compressed form and fast data retrieval comprises at least a microprocessor and memory.
  • the memory is holding data in form of the context-free LM-grammar, according to properties (a) and (b) or (bl), as described earlier.
  • LM-grammar according to the property (bl) is used in the preferred embodiment.
  • the microprocessor is executing instructions that cause the machine to perform the steps of processing a set of strings to represent them as a context-free LM- grammar and performing encoding of the context-free LM-grammar into compressed file 103.
  • the microprocessor is executing instructions that cause the machine to perform the steps of forming a PATRICIA trie for each set of strings coming from one index (block 401), creating a collection of unique strings from all the labels of the edges of all the formed PATRICIA tries (block 402), creating a common context-free grammar representing those unique strings (block 403), and combining the formed PATRICIA tries with created common context-free grammar to form an LM-grammar (block 404).
  • the microprocessor is executing instructions that cause the machine to perform the steps of dividing grammar rules into packets of a specified size (block 501), encoding each packet of grammar rules into compressed file 103 (block 503), forming a list of packet offsets (block 704), which consists of positions of each packet in compressed file 103, encoding the list of packet offsets into the file (block 705), forming a list of packet rule counts (block 706) consisting of integer numbers representing the number of grammar symbols in each packet and encoding the list of packet rule counts into the compressed file 103 (block 707).
  • the microprocessor is further executing instructions that cause the machine to perform the steps of selecting an appropriate index allowing the query to be executed in the fastest way (block 901) and selecting the starting symbol of the LM- grammar according to the selected index, performing a recursive search process which generates only the strings that match the query (block 903), and decoding the strings by using the inverse of the encoding transformation performed yielding the original data rows (block 905).
  • the step of performing a recursive search process which generates only the strings that match the query comprises identifying the longest prefix of the current partial string containing only terminal symbols (block 1002), based on the identified prefix, testing if the current string can match the query (block 1003), replacing the first non- terminal symbol in the current string with one of the alternatives of the grammar rule with the left-hand side equal to the said first non-terminal symbol (block 1009), and recursively continuing the search process on the modified string (block 1010).
  • all the occurrences of the first non-terminal symbol in the current string are replaced with one of the alternatives of the grammar rule with the left-hand side equal to the said first non-terminal symbol.
  • the microprocessor is executing instructions that cause the machine to perform the steps of dividing the prefix into fields that are between delimiters (block 1102), identifying the fully decoded fields and the partially decoded field (block 1104), testing if the fully decoded fields match the query (block 1105), deriving an interval from the partially decoded field (block 1110) and testing if the said interval intersects with the interval derived from the query criteria (block 1111).
  • the offered system and method for data compression and storage allowing fast retrieval yields high data compression ratio similar to those achieved by best archivers, at the same time permitting fast data searching and retrieval similar to the current database systems.
  • the retrieval method when implemented in Java programming language, is of only about 30 KB object code, and uses very little memory for operation, hence it is well suited for use in mobile devices.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne le domaine de la compression de données, des systèmes de gestion de base de données et plus particulièrement les systèmes et procédés servant à compresser, stocker et récupérer des données à partir de la forme compressée. Le procédé de compression de données proposé permettant une récupération rapide comprend les étapes de représentation d'un ensemble de chaînes sous la forme d'une grammaire non contextuelle, de codage de la grammaire non contextuelle en un fichier compressé, ladite grammaire non contextuelle étant la grammaire LM, qui a un ou plusieurs non terminaux de départ et dans laquelle le côté droit de chaque règle de production de ladite grammaire non contextuelle est soit une chaîne vide, soit le symbole de grammaire le plus à gauche du côté droit représente une seule chaîne non vide et pour chaque couple de possibilités ayant le même côté gauche et toutes les deux des côtés droits non vides, la chaîne représentée par le symbole de grammaire le plus à gauche du côté droit d'une possibilité n'est pas un préfixe de la chaîne représentée par le symbole de grammaire le plus à gauche du côté droit de l'autre possibilité.
PCT/IB2007/052508 2007-06-28 2007-06-28 Système et procédé pour une compression et un stockage de données permettant une récupération rapide WO2009001174A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/IB2007/052508 WO2009001174A1 (fr) 2007-06-28 2007-06-28 Système et procédé pour une compression et un stockage de données permettant une récupération rapide

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/IB2007/052508 WO2009001174A1 (fr) 2007-06-28 2007-06-28 Système et procédé pour une compression et un stockage de données permettant une récupération rapide

Publications (1)

Publication Number Publication Date
WO2009001174A1 true WO2009001174A1 (fr) 2008-12-31

Family

ID=38935841

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2007/052508 WO2009001174A1 (fr) 2007-06-28 2007-06-28 Système et procédé pour une compression et un stockage de données permettant une récupération rapide

Country Status (1)

Country Link
WO (1) WO2009001174A1 (fr)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI419391B (zh) * 2009-12-25 2013-12-11 Ind Tech Res Inst 電池系統中的散熱與熱失控擴散防護結構
US9928267B2 (en) 2014-06-13 2018-03-27 International Business Machines Corporation Hierarchical database compression and query processing
CN109995373A (zh) * 2018-01-03 2019-07-09 上海艾拉比智能科技有限公司 一种整数数组的混合打包压缩方法
CN110808738A (zh) * 2019-09-16 2020-02-18 平安科技(深圳)有限公司 数据压缩方法、装置、设备及计算机可读存储介质

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1122655A2 (fr) * 2000-02-04 2001-08-08 International Business Machines Corporation Appareil et méthode de compression de données, base de données, système de communication, support de mémorisation et appareil de transmission de données

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1122655A2 (fr) * 2000-02-04 2001-08-08 International Business Machines Corporation Appareil et méthode de compression de données, base de données, système de communication, support de mémorisation et appareil de transmission de données

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
BOTTCHER S ET AL: "XML index compression by DTD subtraction", ICEIS 2007. PROCEEDINGS OF THE NINTH INTERNATIONAL CONFERENCE ON ENTERPRISE INFORMATION SYSTEMS, VOLUME DISI INSTICC PORTUGAL, 12 June 2007 (2007-06-12), pages 86 - 94, XP009097565, ISBN: 978-972-8865-88-7 *
COOPER B F ET AL: "A fast index for semistructured data", PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES, XX, XX, 2001, pages 341 - 350, XP002303292 *
LOHREY ET AL: "The complexity of tree automata and XPath on grammar-compressed trees", THEORETICAL COMPUTER SCIENCE, AMSTERDAM, NL, vol. 363, no. 2, 28 October 2006 (2006-10-28), pages 196 - 210, XP005685578, ISSN: 0304-3975 *
WOLFF J E ET AL: "Searching and browsing collections of structural information", ADVANCES IN DIGITAL LIBRARIES, 2000. PROCEEDINGS. IEEE WASHINGTON, DC, USA 22-24 MAY 2000, LOS ALAMITOS, CA, USA,IEEE COMPUT. SOC, US, 22 May 2000 (2000-05-22), pages 141 - 150, XP010501112, ISBN: 0-7695-0659-3 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI419391B (zh) * 2009-12-25 2013-12-11 Ind Tech Res Inst 電池系統中的散熱與熱失控擴散防護結構
US9928267B2 (en) 2014-06-13 2018-03-27 International Business Machines Corporation Hierarchical database compression and query processing
CN109995373A (zh) * 2018-01-03 2019-07-09 上海艾拉比智能科技有限公司 一种整数数组的混合打包压缩方法
CN109995373B (zh) * 2018-01-03 2023-08-15 上海艾拉比智能科技有限公司 一种整数数组的混合打包压缩方法
CN110808738A (zh) * 2019-09-16 2020-02-18 平安科技(深圳)有限公司 数据压缩方法、装置、设备及计算机可读存储介质
CN110808738B (zh) * 2019-09-16 2023-10-20 平安科技(深圳)有限公司 数据压缩方法、装置、设备及计算机可读存储介质

Similar Documents

Publication Publication Date Title
US8838551B2 (en) Multi-level database compression
US8356060B2 (en) Compression analyzer
JP6596102B2 (ja) コンテンツ連想シーブに存在している基本データエレメントからデータを導出することによるデータの無損失削減
US7917480B2 (en) Document compression system and method for use with tokenspace repository
US5870036A (en) Adaptive multiple dictionary data compression
US5561421A (en) Access method data compression with system-built generic dictionaries
US9450605B2 (en) Block compression of tables with repeated values
US5153591A (en) Method and apparatus for encoding, decoding and transmitting data in compressed form
JP3273119B2 (ja) データ圧縮・伸長装置
US5678043A (en) Data compression and encryption system and method representing records as differences between sorted domain ordinals that represent field values
US20020152219A1 (en) Data interexchange protocol
Balkenhol et al. Universal data compression based on the Burrows-Wheeler transformation: Theory and practice
JP2020518207A (ja) 基本データシーブの使用によるデータの無損失削減、ならびに基本データシーブを使用して無損失削減されたデータに対する多次元検索およびコンテンツ連想的な取出しの実行
EP0127815B1 (fr) Procédé de compression de données
JP6846426B2 (ja) 音声データおよびブロック処理ストレージシステム上に記憶されたデータの削減
TW202147787A (zh) 利用主要資料的局部性來有效率檢索已使用主要資料篩而被無損地縮減的資料
CA2770348A1 (fr) Compression de bitmaps et de valeurs
JPH05241777A (ja) データ圧縮方式
WO2009001174A1 (fr) Système et procédé pour une compression et un stockage de données permettant une récupération rapide
CN115840751B (zh) 一种新型树状数据的编码方法
KR20080026772A (ko) Lempel-Ziv 압축 방법의 복원 속도를 보완한압축 방법
JP3241787B2 (ja) データ圧縮方式
US6731229B2 (en) Method to reduce storage requirements when storing semi-redundant information in a database
Sadakane Text compression using recency rank with context and relation to context sorting, block sorting and PPM/sup*
JPH06251070A (ja) 単語検索のための電子辞書圧縮方法及び装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07789822

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 07789822

Country of ref document: EP

Kind code of ref document: A1