WO2017161589A1 - 一种字符串序列的压缩索引方法及装置 - Google Patents

一种字符串序列的压缩索引方法及装置 Download PDF

Info

Publication number
WO2017161589A1
WO2017161589A1 PCT/CN2016/077428 CN2016077428W WO2017161589A1 WO 2017161589 A1 WO2017161589 A1 WO 2017161589A1 CN 2016077428 W CN2016077428 W CN 2016077428W WO 2017161589 A1 WO2017161589 A1 WO 2017161589A1
Authority
WO
WIPO (PCT)
Prior art keywords
string
node
index
hop table
layer
Prior art date
Application number
PCT/CN2016/077428
Other languages
English (en)
French (fr)
Inventor
魏建生
朱俊华
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN201680083999.8A priority Critical patent/CN108780455B/zh
Priority to PCT/CN2016/077428 priority patent/WO2017161589A1/zh
Publication of WO2017161589A1 publication Critical patent/WO2017161589A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Definitions

  • the present invention relates to the field of data management technologies, and in particular, to a compression index method and apparatus for a string sequence.
  • the dictionary encoding method may be used to store the data.
  • the CS-Prefix Tree cache-aware prefix tree order-preserving indexing mechanism proposed by Carsten Binnig et al. in 2009 is generally used to support the non-decompression query of the compression dictionary.
  • CS-Prefix-Tree is composed of two parts: a shared leaf and an encoded index.
  • the shared leaf contains a series of fixed-length data blocks, each of which stores a set of ⁇ string, encoding> (value, code) dictionary items, and the dictionary items within and between blocks are globally ordered by "string", all The data blocks together form a complete dictionary.
  • the coding index is a tree structure composed of a series of fixed-length branch nodes. Each branch node includes: an address of the first child node in the branch node, a number of keywords recorded by the branch node, and a keyword list.
  • the keyword is the difference prefix of the adjacent child node, and the difference prefix refers to the shortest prefix that distinguishes the smallest string included in a node from the largest string included in the precursor node, for example, as shown in FIG.
  • the first leaf node contains the largest string "aaf”
  • the second leaf node contains the smallest string "amd”
  • the first leaf node is the second leaf node's precursor node, distinguishing "amd” from
  • the shortest prefix of "aaf” is "am"
  • the difference prefix of the two leaf nodes is "am”.
  • the coding index is constructed in a "bottom-up” manner, that is, all leaf nodes are first, and the branch nodes are constructed layer by layer.
  • Figure 1 uses a 32-byte branch node, and in the third field, the keywords are written one by one from the two ends to the middle, that is, the difference prefix is first used.
  • "am” is written as a keyword at the end of the field, and the node offset 29 is recorded to the field header to form a field form of ([29],...,[am]).
  • the offset is addressed from 0, that is, the offset of the 32-byte branch node is sequentially addressed from 0 to 31, and each character in the keyword occupies 1 byte and requires 1 byte of white space.
  • CS-Prefix-Tree has the following problem: since the difference prefix length between the underlying adjacent leaf nodes is not controlled, when processing a long string sequence, the difference prefix length may be From a few bytes to a few hundred bytes, at this time, the excessively long differential prefix length will result in a decrease in the capacity of the coding index branch node, an increase in the number of branch nodes and a search complexity.
  • An embodiment of the present invention provides a compression index method and apparatus for a string sequence to solve the problem that the underlying leaf node has a long differential prefix length in the existing CS-Prefix-Tree encoding index process, resulting in the encoding of the index branch node. Reduced capacity, increased number of branch nodes and difficulty in finding complexity.
  • an embodiment of the present invention provides a compression indexing method for a string sequence, and the method may include:
  • the difference prefix length of each string in the string sequence group the string sequence to obtain M string groups so that the first character in each string group
  • the difference prefix length of the string is the shortest within the preset string range
  • the M string groups are sequentially stored into N memory pages, and the index key of the memory page is the difference prefix of the first string group in the memory page;
  • the jump table index including the Q layer jump table is constructed according to the index keys of the N memory pages, and the jump table index is constructed in a bottom-up manner, and the first layer jump table can be constructed according to the index keywords of the N memory pages.
  • Each hop table node includes at least one index key, the number of index keys, and the addressing information of the index key.
  • sequence of the ordered string may be a sequence of strings arranged in ascending or descending order of the dictionary.
  • the jump table index is constructed by grouping and paging the ordered string sequence. Since the difference prefix length of the string group is the shortest within a certain string range, the index key of each page after paging according to the difference prefix length of the string group is also the local shortest, and thus the jump constructed on the basis of the page.
  • the length of the index key in the table index is also relatively short, which reduces the average length of the index key in the jump table index, and improves the capacity of the jump table node, thereby reducing the number of index nodes and reducing the complexity of index lookup. effect.
  • sequence of ordered strings can be grouped in the following manner:
  • the first string is the starting string of the first string group, and the difference prefix length of the subsequent W max strings including the string is sequentially calculated;
  • the string with the smallest difference prefix length is used as the start string of the second string group, and the above process is repeated to obtain the second string group;
  • the subsequent strings can be grouped according to the above grouping method until all the strings are all grouped.
  • threshold of the string that each string group can hold may be the same or different.
  • the M string groups may be sequentially stored in the N memory pages according to the difference prefix length of each of the M string groups. Therefore, the difference prefix length of the first string group in each memory page is the shortest within the preset string group.
  • the specific implementation is as follows:
  • the first string group is used as the starting string group, and according to the sorting of the string group, at least one subsequent string group is sequentially written into the first memory page;
  • the first memory page can be calculated. N more string groups that can be accommodated by the storage capacity (C max - C occupied );
  • the string group with the smallest difference prefix length is written as the first string group of the second memory page, and the second memory page is written, and the second memory page is determined according to the above method. Thus, the above process is repeated. Until all string component pages are completed.
  • the string group can be set. Strings other than the first string are written to the memory page in compressed form.
  • the shared prefix length between the string and its neighboring previous string, and the suffix string outside the shared prefix in any other string are written to the remaining free space of the memory page.
  • the Q layer hopping table may be a multi-layer hopping table constructed layer by layer, and the qth layer hopping table in the Q layer hopping table is according to the q-1
  • the index keys of the N memory pages are sequentially written into the hop table node of the layer 1 hop table, and the number of index keys included in the hop table node and the index key search are recorded in each hop table node.
  • Address information wherein the addressing information of the index key in each hop table node in the layer 1 hop table is used to: indicate a memory page where the index key corresponding to the addressing information is located;
  • the jump table index is constructed from the bottom up until the number of the jump table index or the number of the jump table nodes of the uppermost hop table index meets the preset condition, and the construction of the jump table index is ended.
  • each hop table node in the jump table index can be set for simple construction. For a fixed length, the index key length is more average for the memory page; when the jump table node adopts a fixed length, the index key with the interval F extracted from the lower layer may be sequentially written into the current layer jump table. If the current hop table node is full, the next hop table node of the layer is written until the index key extracted from the lower layer is completely written to the hop table node of the layer.
  • the hop table construction method of the variable length hopping node may be used, and the index keys of the lower layer are sequentially written into the upper hop table node, and the specific implementation manner is as follows:
  • the difference between the occupied length L occupied by the first hop table node and the L node-min is smaller than the storage overhead of the ith index key, and the first hop table is calculated.
  • the N node-more index key that can be accommodated by the length (L node-max - L occupied ) in the node , where L node-min is the minimum length of each hop table node, and L node-max is per The maximum length of the hop table nodes;
  • the jump table index is constructed. After that, according to the established jump table index, some strings associated with the to-be-queried string can be searched from top to bottom, and the specific implementation is as follows:
  • the t-th layer jumps the s-th hop table node of the t-th layer to find an index key in the s-th hop table node of the t-2th layer jump table;
  • the method may further include:
  • the method may further include:
  • the number of the character strings in the third character string group is less than the threshold after the first character string is deleted, acquiring a fourth character string group adjacent to the third character string group, and acquiring the third character string Regrouping the group and the fourth string group;
  • the regrouped string group is sequentially written to the second memory page, and if the sum of the memory pages adjacent to the second memory page and the second memory page is smaller than the data amount threshold of one memory page, the merge is performed Two memory pages.
  • the grouping, paging, and hopping indexes all have a certain spatial elasticity. Therefore, inserting or deleting a string generally only causes local reconstruction, and does not need to completely reconstruct the hopping index, and the efficiency is high.
  • an embodiment of the present invention provides a compression indexing apparatus, configured to perform the method of the first aspect, where the apparatus may include:
  • An obtaining unit configured to obtain an ordered sequence of character strings
  • the grouping unit is configured to perform group processing on the string sequence according to the difference prefix length of each string in the string sequence to obtain M string groups, and the difference prefix length of each string group is the first in the string group.
  • the difference prefix length of a string group so that the difference prefix length of the first string in each string group is the shortest within the preset string range;
  • a paging unit configured to sequentially store the M string groups obtained by the grouping unit into the N memory pages, where the index key of the memory page is a difference prefix of the first string group in the memory page;
  • a jump table index construction unit configured to construct a jump table index including a Q layer jump table according to an index key of the N memory pages obtained by the paging unit, wherein the first layer jump table may be based on an index key of the N memory pages
  • the hop table includes at least one hop table node, and each hop table node includes at least one index key, the number of index keys, and the addressing information of the index key.
  • sequence of the ordered string may be a sequence of strings arranged in ascending or descending order of the dictionary.
  • the jump table index is constructed by grouping and paging the ordered string sequence. Since the difference prefix length of the string group is the shortest within a certain string range, the index key of each page after paging according to the difference prefix length of the string group is also the local shortest, and thus the jump constructed on the basis of the page.
  • the length of the index key in the table index is also relatively short, which reduces the average length of the index key in the jump table index, and improves the capacity of the jump table node, thereby reducing the number of index nodes and reducing the complexity of index lookup. effect.
  • the specific execution process of the grouping unit is the same as the grouping process described in the first aspect, and the specific execution process of the paging unit is the same as the paging process described in the first aspect, and the specific execution process of the table jump index building unit and the first aspect are The method for constructing the jump table index is the same.
  • the compression indexing apparatus may further include: a query unit, configured to query a character string associated with the character string to be queried in the string sequence, and the specific execution process is the same as the string query process described in the first aspect.
  • the compression indexing apparatus may further include: a string insertion unit, configured to insert a new character string into the sequence of the string, the specific execution process being the same as the process of inserting the new string in the first aspect,
  • the compression indexing device may further include: a character string deleting unit, configured to delete the character string in the string sequence, and the specific execution process is the same as the process of deleting the character string in the string sequence according to the first aspect.
  • the foregoing compression indexing device may be disposed in any computer of the data storage system, or may be disposed in the data storage system independently of any device;
  • the acquiring unit in the second aspect may be a transceiver in the compression indexing device, and the grouping unit, the paging unit, the skip table index construction unit, the query unit, the string insertion unit, and the string deletion unit in the second aspect may be separately established.
  • the processor may also be implemented in one of the processors of the compression indexing device, or may be stored in the memory of the compression indexing device in the form of program code, and is called by one of the processors of the compression indexing device and executes the above.
  • the processor described herein may be a Central Processing Unit (CPU), or an Application Specific Integrated Circuit (ASIC), or one or more integrated systems configured to implement embodiments of the present invention. Circuit.
  • an embodiment of the present invention provides a compression index method and apparatus for a string sequence, which obtains an ordered sequence of character strings, and pairs the string according to the difference prefix length of each string in the string sequence.
  • the sequence is grouped to obtain M string groups.
  • the difference prefix length of the first string in each string group is the shortest in the preset string range, and the M string groups are sequentially stored into N memory pages.
  • the jump table index is constructed according to the index keys of the N memory pages. Thus, after grouping and paging processing the ordered string sequence, the jump table index is constructed.
  • the index key of each page after paging according to the difference prefix length of the string group is also the local shortest, and then the jump table index constructed on the basis of the page.
  • the length of the index key in the index is also relatively short, which reduces the average length of the index key in the jump table index, and improves the capacity of the jump table node, thereby reducing the number of index nodes and reducing the complexity of index lookup.
  • the underlying leaf node has a long differential prefix length, which leads to a decrease in the capacity of the encoding index branch node, and an increase in the number of branch nodes and the complexity of the search.
  • FIG. 1 is a structural diagram of an existing CS-Prefix-Tree index
  • FIG. 2 is a structural diagram of a compression indexing apparatus 10 according to an embodiment of the present invention.
  • FIG. 3 is a flowchart of a compression index method for a string sequence according to an embodiment of the present invention
  • FIG. 4 is a schematic diagram of an ordered string sequence grouping and paging process according to an embodiment of the present invention.
  • FIG. 5 is a schematic diagram of storing intra-page character string encoding according to an embodiment of the present invention.
  • FIG. 6 is a schematic diagram of a process of constructing a jump table index by using a fixed length node according to an embodiment of the present invention
  • FIG. 7 is a schematic diagram of a process of constructing a jump table index by using a variable length node according to an embodiment of the present invention.
  • FIG. 8 is a structural diagram of a compression indexing apparatus 20 according to an embodiment of the present invention.
  • the core idea of the present invention is: grouping a plurality of ordered character strings so that the difference prefix length of adjacent strings between groups is the shortest, and then paging processing the plurality of character string groups to make adjacent strings between pages
  • the difference prefix length is the shortest
  • the index key of the defined page is the difference prefix of the first string it accommodates.
  • the jump table index is constructed layer by layer, and the jump table index is used to find the page by the index key.
  • the string in the group it should be noted that the grouping and paging process does not change the order of the string, and the order between the groups and pages is the same as the order between the strings they hold.
  • FIG. 2 is a structural diagram of a compression indexing apparatus 10 according to an embodiment of the present invention, for performing the compression indexing method provided by the present invention.
  • the compression indexing device 10 can be data
  • the device that can perform data storage in the library system may be disposed in any computer, or may exist in the data storage system independently of any device.
  • the compression indexing device 10 may include: a processor 1011, a transceiver 1012, a memory 1013, and at least one communication bus 1014, the communication bus 1014 is used to implement connection and mutual communication between the devices;
  • the processor 1011 may be a central processing unit (CPU), may be an application specific integrated circuit (ASIC), or may be configured to implement one or more integrations of embodiments of the present invention.
  • the circuit for example: one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs).
  • the transceiver 1012 can be used for data interaction with external network elements.
  • the memory 1013 may be a volatile memory (Volatile Memory), such as a random access memory (RAM), or a non-volatile memory (Non-Volatile Memory), such as a read-only memory (Read-Only Memory). , ROM), Flash Memory, Hard Disk Drive (HDD) or Solid-State Drive (SSD); or a combination of the above types of memory.
  • volatile memory such as a random access memory (RAM)
  • non-Volatile Memory such as a read-only memory (Read-Only Memory).
  • ROM read-only memory
  • HDD Hard Disk Drive
  • SSD Solid-State Drive
  • the communication bus 1014 can be divided into an address bus, a data bus, a control bus, etc., and can be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, or an extended industry standard architecture ( Extended Industry Standard Architecture, EISA) bus, etc.
  • ISA Industry Standard Architecture
  • PCI Peripheral Component Interconnect
  • EISA Extended Industry Standard Architecture
  • the string sequence is grouped according to the difference prefix length of each string in the string sequence to obtain M string groups, so that each The difference prefix length of the first string in the string group is the shortest within the preset string range;
  • a jump table index including a Q layer hop table according to an index key of the N memory pages, wherein the layer 1 hop table in the Q layer hop table is constructed according to the index keys of the N memory pages, and each layer jump table Include at least one hop table node; each hop table node includes at least one index key, the number of index keys, and the addressing information of the index key, so as to find a certain page within the page by using the index key in the jump table index The string in the group.
  • the M characters may be determined according to the difference prefix length of each of the M string groups.
  • the string groups are sequentially stored in the N memory pages such that the difference prefix length of the first string group of each of the N memory pages is the shortest within the preset string group range.
  • the Q layer hopping table may be a multi-layer hopping table constructed layer by layer, and the qth layer hopping table in the Q layer hopping table is the first of the hopping node according to the sparse coefficient F in the q-1 layer hopping table. Index keys are constructed, the F is an integer greater than or equal to 1, and the q is an integer greater than or equal to 2.
  • the jump table index is constructed. Since the difference prefix length of each string group is the shortest within the preset string range, the index key of each page after paging according to the difference prefix length of the string group is also the local shortest, and thus based on the page.
  • the length of the index key in the constructed jump table index is also relatively short, which reduces the average length of the index key in the jump table index, and improves the capacity of the jump table node, thereby reducing the number of index nodes and reducing the index search complexity. The beneficial effect of the degree.
  • embodiment 1 illustrates and describes in detail the compression process of the string sequence provided by the present invention in the form of steps, wherein the steps shown may also be performed in a set of executable computer systems. Moreover, although logical sequences are shown in the figures, in some cases the steps shown or described may be performed in a different order than the ones described herein.
  • FIG. 3 is a schematic diagram of a compression index method for a string sequence according to an embodiment of the present invention. The flowchart is executed by the compression indexing device 10 shown in FIG. 2. As shown in FIG. 3, the compression index method of the string sequence may include the following steps:
  • S101 Acquire a sequence of strings, where the sequence of strings includes more than one character string arranged in an order.
  • the string sequence can be read directly from the inventory database.
  • more than one character string in an orderly arrangement may be arranged in ascending order of the dictionary or in descending order of the dictionary. This is not limited in the embodiment of the present invention.
  • the present invention only takes a sequence of character strings arranged in ascending order of the dictionary as an example.
  • the compression index method provided by the present invention will be described.
  • the sequence of strings on the left side of Figure 3 is a sequence of strings arranged in ascending order of the dictionary "A to Z".
  • S102 Perform grouping processing on the string sequence according to the difference prefix length of each string in the string sequence to obtain M string groups, so that the difference prefix length of the first string in each string group It is the shortest in the range of the preset string, where M is an integer greater than or equal to 1, each string group contains at least one character string, and the difference prefix of each string group is the first one in the string group.
  • M is an integer greater than or equal to 1
  • each string group contains at least one character string
  • the difference prefix of each string group is the first one in the string group.
  • the difference prefix of the string is performed by performing grouping processing on the string sequence according to the difference prefix length of each string in the string sequence to obtain M string groups, so that the difference prefix length of the first string in each string group It is the shortest in the range of the preset string, where M is an integer greater than or equal to 1, each string group contains at least one character string, and the difference prefix of each string group is the first one in the string group.
  • the above-mentioned string si is arranged in front of the string sj, which may be in ascending order of the dictionary, and the string si is arranged before the string sj.
  • two strings “abe” and “afe” are arranged in sequence, and the string “abe” is a predecessor string of the string “afe”.
  • the shared prefix is "a”
  • the shared prefix length is 1, the character is
  • the difference prefix length of the string "afe” is: the prefix string "af” of length 2 in the string.
  • M characters can be obtained by the following methods:
  • the first string of the mth string group is used as a starting string, and the difference prefix length of each string in the subsequent W max strings is sequentially calculated;
  • the kth string is the string with the smallest difference prefix length in the subsequent W min strings to the subsequent W max strings, W min ⁇ k ⁇ W max ;
  • the m+1th string group can be determined as described above, and the process is repeated until the string in the string sequence is processed. Upon completion, the string sequence can be divided into M string groups according to the order of the string sequence.
  • the first string is the first string in the string sequence, and in addition, the difference prefix length in the W min strings to the W max strings. more than a minimum string, usually the first to the second string W min W max th prefix length difference in the minimum string string string top row of the next packet as the first string.
  • the minimum threshold W min of the number of strings that can be included in each string group refers to the minimum number of strings that can be accommodated in the string group; the maximum threshold W max can be the maximum character that can be accommodated in the string.
  • the number of the strings, which can be set as needed, is not limited in the embodiment of the present invention, and the minimum threshold W min and the maximum threshold W max of the number of character strings that can be accommodated in each character string group can be the same. , can also be different.
  • the first string in the string sequence is "Alabama.”A&M University(AL) as the first string of the first string group, starting with the string, and calculating 8 strings of "Alabama A&M University(AL)” ⁇ "American University(DC)”
  • the difference prefix length is "1, 8, 2, 16, 9, 11, 15, 9”
  • the string whose shortest prefix length is the shortest in the 2nd to 10th differential prefix lengths is "American College (PA)".
  • PA American College
  • the first string of the group repeat the above process, determine the second string group, and subsequent other groups until each string in the string sequence is grouped, and mark the 10 groups after the group as G1 to G10.
  • S103 sequentially store the M string groups into N memory pages, where N is an integer greater than or equal to 1, each memory page includes at least one string group, and an index key of each memory page is: The difference prefix of the first string group in the memory page.
  • the addresses of the N memory pages may be continuous or discontinuous, and the size of each memory page may be an integer multiple of the computer system cache block size C block , and the size of each memory page may be the same or different. .
  • the M string groups may be sequentially stored into the N memory pages according to a difference prefix length of each of the M string groups. So that the difference prefix length of the first string group of each of the N memory pages is the shortest within the preset string group range;
  • the string group is stored in the nth memory page of the N memory pages, 1 ⁇ n ⁇ N, that is, the nth memory page is any one of the N memory pages, and may include:
  • the difference in the n-th memory page is occupied storage capacity occupies C min C i is smaller than the storage overhead of the string groups, the computing the n th N more string groups that can be accommodated by the storage capacity (C max - C occupied ) in the memory page, wherein the N more string groups are: N arranged in order from the i-th string group More string groups;
  • Determining, in the N more string groups, a string group having the shortest differential prefix, and setting the i-th string group and the i-th string group to a string between the shortest string groups of the prefix The group is sequentially stored in the nth memory page, and the shortest string group of the prefix is used as the first string group of the n+1th memory page.
  • the corresponding string group may be sequentially stored in the n+1th memory page according to the above manner, so that the string group may be sequentially stored in order. Go to N memory pages.
  • the first string group is the first string group in the M string groups, and in addition, the characters with the smallest difference prefix among the N more string groups are described.
  • the first string group in the string group with the smallest difference prefix in the N more string groups is usually used as the first string group of the next page.
  • the minimum capacity C min and the maximum capacity C max of each memory page can be set according to the actual storage capacity of the memory page, which is not limited by the embodiment of the present invention, and the minimum capacity C min and maximum of each memory page.
  • the capacity C max may be the same or different; optionally, the minimum capacity C min and the maximum capacity C max are integer multiples of the computer system cache block size C block .
  • the string group in the memory page can be renumbered in the page, and does not need to be the same as the number of the grouped string group.
  • FIG. 4 is a schematic diagram of a process of grouping a character string page. Starting from the character string group G1, G1, G2, and G3 are sequentially stored to the first memory page p1, and if the memory page p1 accommodates the groups G1, G2, and G3.
  • the occupied capacity is close to the minimum capacity threshold C min , but the usable capacity to the minimum capacity threshold C min is not enough to store the packet G4, and then the two packets G4 and G5 are sequentially searched forward, and if the memory page p1 is determined to store the G4
  • G5 reaches the maximum capacity threshold C max , the packet G4 of the shortest differential prefix length among the groups G4 and G5 is grouped as the first character string of the next memory page p2, and the three character strings of G1 to G3 before G4 are grouped.
  • the memory is stored in the memory page p1, and the above process is repeated until the p2 and p3 pages are completed, and the internal grouping of each page is sequentially addressed by g1, g2, g3, etc., wherein p1 and "A" in the figure respectively represent the memory page p1. Address and index keywords.
  • the character string in the string group may be further The compressed form is stored in the memory page, that is, the xth string group of the M string groups to be stored in the yth memory page of the N memory pages, 1 ⁇ x ⁇ M, 1 ⁇ y ⁇ N,
  • the xth string group is stored in the yth memory page by the following compression storage method:
  • FIG. 5 shows a schematic diagram of encoding and storing the memory page p2 in FIG. 4.
  • the first string “Arizona State Polytechnic Campus (AZ)” is stored in an uncompressed form, in the original group.
  • the second string “Arizona State University (AZ)” shares the prefix with the first string “Arizona State”, and the length is 14, the string “14University (AZ)” is stored as the first group g1 in p2.
  • the second string the same reason, the third string is stored as "25West (AZ)", where "25” represents the length of its shared prefix "Arizona State University” with the second string, waiting for the memory page
  • p2 writes the packet data
  • the number of packets "3” is written in the reverse order to the end reserved space, and the intra-page addresses g3, g2, g1 of the packet are written. At the end of the page.
  • the shortest difference prefix in the string sequence is "A", "B” is 1 in length, and the longest differential prefix such as "Arizona State University W" is 26, after the grouping and paging method, the long differential prefix can be effectively avoided as an index key of the memory page, and the storage overhead of the subsequent build index is reduced.
  • each hop table includes at least one hop table node, and each hop table node includes at least one index key, the number of index keys, and the addressing information of the index key.
  • the Q-layer hopping table may be a multi-layer hopping table constructed layer by layer, and the q-th layer hopping table in the Q-layer hopping table may be a hopping table with a sparsity coefficient F in the q-1 layer hopping table.
  • the first index key of the node is constructed, the F is an integer greater than or equal to 1, and the q is an integer greater than or equal to 2.
  • the sparse coefficient F can be set as needed, which is not limited in this embodiment of the present invention.
  • the length L node of each hop table node may be an integer multiple of the length of the computer system cache.
  • the jump table index may be configured as follows:
  • the index keys of the N memory pages are sequentially written into the hop table node of the layer 1 hop table, and the number of index keys included in the hop table node and the index key search are recorded in each hop table node.
  • Address information wherein the addressing information of the index key in each hop table node in the layer 1 hop table is used to indicate a memory page where the index key corresponding to the addressing information is located;
  • the first hop table node in the layer hop table is the starting node, and the first index key of at least one hop table node with an interval F;
  • the first layer jump table is constructed from the bottom up, and then the qth layer jump table is constructed upwards until the number of jump table nodes included in the constructed layer number Q or the Q layer jump table satisfies the preset condition or the most When the upper hop table is converged to a hop table node, the hop table index is stopped.
  • the preset condition can be set as required.
  • a fixed length jump table node may be used.
  • the index key extracted from the q-1 layer jump table is sequentially written into the jump table node of the qth layer jump table, and may include:
  • the index key is sequentially written into the jump table node of the qth layer jump table; each time an index key is written, the corresponding addressing information is recorded, and the number of index keys in the jump table node is updated. Calculate the remaining free space of the jump table node;
  • Figure 6 shows a schematic diagram of the order-preserving compression index using a fixed-length hopping table node.
  • the first-layer hopping table has seven nodes, the addresses are respectively n1-1 to n1-7, and the index keys of all 39 pages are recorded.
  • the word and address are taken as the first hop table node of the first layer.
  • the address is n1-1.
  • the first field "3" represents three index keys, and the second field records three index keys.
  • the third field records the offset address of the three index keys in the node and the address of the corresponding memory page; for example, (o1, p1) represents the index key "A” The position where the offset is o1 in the node n1-1 is recorded, and the address of the memory page corresponding to "A” is p1.
  • the index keyword "A”, “Ar” and “B” are written sequentially followed by the first field, and (o1, p1), (o2, p2), and (o3, p3) are written in reverse order from the end of the node, thereby allowing the free space to be concentrated. The middle of the 2nd and 3rd fields to maximize the capacity of the node.
  • the 4 nodes in the first layer hop table of Figure 6 are 2: the first, third, fifth, and seventh nodes can be indexed sequentially to the second layer hop table with the address n2. -1 and n2-2 in two hop table nodes.
  • the address is n2-1
  • the first field "3" represents that three index keys are recorded
  • the second field records n1-1, n1-3, and n1-, respectively.
  • the first index key contained in 5 includes "A", "C", etc., wherein the information of n1-5 is limited to the picture size is not explicitly listed
  • the third field records three index keys in the reverse order at the node.
  • the offset address within and the address of the corresponding lower hop table node For example, (o1, n1-1), the index key "A" is recorded at the position of the offset n1 in the node n2-1, and the address of the lower layer jump table node corresponding to the "A" is n1-1.
  • the jump table index is stopped. Otherwise, the hop table index is continuously constructed according to the above method until the hop table in the uppermost hop table The number of nodes or the number of hopping layers meets the preset conditions.
  • a variable length hop table node may also be used to construct Jump table index.
  • the variable length hop table node is used to construct the hop table index, the first index key of the at least one hop table node extracted from the q-1 layer hop table is sequentially written into the hop table of the qth layer hop table.
  • Nodes can include:
  • the first index key of the first index key of the at least one hop table node is used as the starting index key, and the first index key of at least one hop table node is sequentially written into the hop of the qth layer hop table.
  • the N node-more index keys that can be accommodated in the length of the incoming hop table node (L node-max - L occupation ), the N node-more index keys are: the ith The N keyword -more index keywords in which the index keywords are arranged in order;
  • Determining the shortest index key of the N node-more index keys, and the shortest of the ith index key and the ith index key to the N node-more index keys The index key between the index keys is written into the hop table node being written, and the shortest index key is written as the first index key of the next hop table node into the next hop table. node.
  • the corresponding index key may be sequentially written into the next hop table node according to the above manner, and thus repeated, the index keywords may be sequentially stored in the qth order.
  • Jump table node of the layer jump table It should be noted that, when calculating the remaining available length (L node-max - L occupation ), the reserved storage overhead corresponding to the addressing information of the index key needs to be deducted.
  • L node-min is the minimum length of each hop table node
  • L node-max is the maximum length of each hop table node
  • the minimum length L node-min and maximum length L node-max of each hop table node can be The setting of the actual length of the hop table node is not limited in this embodiment of the present invention, and the minimum length L node-min and the maximum length L node-max of each hop table node may be the same or different.
  • Figure 7 shows a schematic diagram of constructing a jump table index using a variable length hop table node.
  • the first layer hop table records the index keys and addresses of all 39 memory pages, taking the first hop table node of the layer 1 hop table as an example.
  • the length 2*Lline is 2 times the computer system cache. Line length, the address is n1-1, the second field "5" represents 5 index keys recorded, and the third field records 5 index keys as "A", “Ar", "B", " Bo” and "Bu”, the fourth field records the offset address of the five index keys in the node and the address of the corresponding page.
  • (o1, p1) represents that the index key "A” records the position where the offset is o1 in the node n1-1, and the address of the page corresponding to "A" is p1.
  • "A”, “Ar”, “B”, “Bo”, and “Bu” are sequentially written in the second field, (o1, p1), (o2, p2), (o3, p3).
  • (o4, p4) and (o5, p5) are written in reverse order from the end of the node, so that the free space can be concentrated in the middle of the 3rd and 4th fields to maximize the capacity of the node.
  • the address is N2-1
  • the length 1*Lline is 1 times the length of the computer system cache line
  • the second field "5" represents the recording of 5 index keys
  • the third field records the number of n1-1 to n1-5 respectively.
  • 1 index key including "A”, “C”, ..., "Y”, etc., part of the node information is limited to the picture size is not explicitly listed
  • the fourth field records the 5 index keys in the node in reverse order The offset address and the address of the corresponding lower hop table node.
  • the index key "A” is recorded at the position of the offset n1 in the node n2-1, and the address of the lower layer jump table node corresponding to the "A" is n1-1.
  • the Layer 2 hop table has one node, the address is n2-1, and the index structure is completed after the Layer 2 hop table is created.
  • the storage space of each hop table node in each layer hop table may be continuously allocated or non-continuously allocated.
  • the node calculates the storage address of other hop table nodes in the hop table. Therefore, you can record only the start address and end address of the hop table to the tuples such as ⁇ n1-start, n1-end> to avoid searching.
  • the access in the process is out of bounds.
  • each hop table node in each hop table When the storage space of each hop table node in each hop table is not continuous, a linked list structure is required, and a pointer field is added in each hop table node, pointing to the next hop table node adjacent to the same layer, and at each layer The last hop table node sets the end tag to avoid out of bounds during the lookup process.
  • the corresponding memory page can be searched from the top down according to the index key in the jump table index.
  • the corresponding group is searched for in the memory page, and the string in the group is fed back to the user.
  • the method may further include:
  • the t-th layer jumps the s-th hop table node of the t-th layer to find an index key in the s-th hop table node of the t-2th layer jump table;
  • the index key that matches the to-be-queried string may be: an index key that is arranged in a dictionary ascending order before the to-be-queried string, or a string that has a shared prefix with the to-be-queried string.
  • embodiment of the present invention may also dynamically insert a character string into the string sequence, and the specific implementation is as follows:
  • the second string group may be the next string group adjacent to the first string group.
  • steps S102 and S103 may be used to determine the first memory page and the first character string group to which the new character string belongs; in addition, if the new string is inserted, the number of memory pages or the index key changes. , you need to update the jump table node from bottom to top until the index reconstruction is completed.
  • the embodiment of the present invention can also dynamically delete a character string in a string sequence, and the specific implementation is as follows:
  • the number of the character strings in the third character string group is less than the threshold after deleting the first character string, acquiring a fourth character string group adjacent to the third character string group, and acquiring the third character string group
  • the string group and the fourth string group are regrouped;
  • the regrouped string group is sequentially written to the second memory page, and if the sum of the data amount of the memory page adjacent to the second memory page and the second memory page is less than the data amount threshold of one memory page, the two are merged Memory pages.
  • the fourth string group may be the previous string group adjacent to the third string group, or may be the next string adjacent to the third string group; the data volume threshold of the memory page may be as needed
  • the setting is not limited in this embodiment of the present invention.
  • the jump table node is updated from bottom to top until the index reconstruction is completed.
  • the string group, the memory page, and the jump table index all have a certain spatial elasticity. Therefore, the insertion/deletion string generally only causes local reconstruction, and the efficiency is high.
  • the embodiment of the present invention provides a compression index method for a string sequence, which obtains an ordered sequence of character strings, and performs the sequence of the string according to the difference prefix length of each string in the string sequence.
  • Packet processing obtaining M string groups, so that the difference prefix length of the first string in each string group is the shortest within the preset string range, and M string groups are sequentially stored to N memory pages.
  • the jump table index is constructed according to the index keys of the N memory pages. Thus, after grouping and paging processing the ordered string sequence, the jump table index is constructed.
  • the index key of each page after paging according to the difference prefix length of the string group is also the local shortest, and thus based on the page.
  • the length of the index key in the constructed jump table index is also relatively short, which reduces the average length of the index key in the jump table index, and improves the capacity of the jump table node, thereby reducing the number of index nodes and reducing the index search complexity.
  • the beneficial effect of the degree avoids the problem that the underlying leaf node has a long differential prefix length in the existing CS-Prefix-Tree coding index process, which leads to a decrease in the capacity of the coding index branch node, and an increase in the number of branch nodes and the complexity of the search. .
  • the following embodiments of the present invention further provide a compression indexing device 20, preferably for implementing the method in the foregoing method embodiments.
  • FIG. 8 is a structural diagram of a compression indexing device 20 according to an embodiment of the present disclosure, which is used to perform the method according to the first embodiment. As shown in FIG. 8, the device may include:
  • the obtaining unit 201 is configured to obtain a sequence of strings, where the sequence of strings includes more than one character string arranged in an order.
  • the grouping unit 202 is configured to perform group processing on the string sequence according to the difference prefix length of each character string in the string sequence acquired by the obtaining unit 201, to obtain M string groups, so that each character
  • the difference prefix length of the first string in the string group is the shortest in the preset string range, where the M is an integer greater than or equal to 1, each string group contains at least one character string, and each string group
  • the difference prefix is the difference prefix of the first string in the string group.
  • the paging unit 203 is configured to sequentially store the M character string groups obtained by the grouping unit 202 into N memory pages, where N is an integer greater than or equal to 1, and each memory page includes at least one character string group.
  • the index key of the memory page is: the difference prefix of the first string group in the memory page.
  • the jump table index construction unit 204 constructs a hop table index according to the index keys of the N memory pages obtained by the paging unit 203, where the hop table index includes a Q layer hop table, and the Q is an integer greater than or equal to 1,
  • the layer 1 hop table of the Q layer hopping table is constructed according to the index keys of the N memory pages, each hop table includes at least one hop table node, and each hop table node includes at least one index key and an index key. The number of bits and the addressing information of the index key.
  • the grouping unit 202 may obtain the mth string group in the M string groups by the following method, where 1 ⁇ m ⁇ M:
  • the first string of the mth string group is used as a starting string, and the difference prefix length of each string in the subsequent W max strings is sequentially calculated;
  • the kth string is the string with the smallest difference prefix length in the subsequent W min strings to the subsequent W max strings, W min ⁇ k ⁇ W max ;
  • the m+1th string group can be determined as described above, and the repetition is performed, and the string sequence can be followed by the string sequence. Sorting is divided into M string groups.
  • the first string is the first string in the string sequence, and in addition, the difference prefix length in the W min strings to the W max strings. more than a minimum string, usually the first to the second string W min W max th prefix length difference in the minimum string string string top row of the next packet as the first string.
  • the paging unit 203 may sequentially store the M string groups to the N memories according to a difference prefix length of each string group in the M string groups.
  • the difference prefix length of the first string group of each of the N memory pages is the shortest within the preset string group range; specifically, the paging unit 203 stores the string group To the nth memory page of the N memory pages, 1 ⁇ n ⁇ N, which may include:
  • the difference in the n-th memory page is occupied storage capacity occupies C min C i is smaller than the storage overhead of the string groups, the computing the n th N more string groups that can be accommodated by the storage capacity (C max -C occupied ) in the memory page, wherein the N more string groups are: N arranged in order from the i-th string group More string groups;
  • Determining, in the N more string groups, a string group having the shortest difference prefix, and grouping the i-th string group and the i-th string group between the shortest string groups of the difference prefix The string group is sequentially stored in the nth memory page, and the shortest string group of the difference prefix is used as the first string group of the n+1th memory page.
  • the corresponding string group may be sequentially stored in the n+1th memory page according to the above manner, so that the string group may be sequentially stored in order. Go to N memory pages.
  • the first string group is the first string group in the M string groups, and in addition, the characters with the smallest difference prefix among the N more string groups are described.
  • the first string group in the string group with the smallest difference prefix length in N more string groups is usually used as the first string group of the next page.
  • the string group in the memory page can be renumbered in the page, and does not need to be the same as the number of the grouped string group.
  • the paging unit 203 can further remove the string before the first string in the string group.
  • the string is stored in a compressed form into the memory page, that is, the xth string group among the M string groups to be stored in the yth memory page of the N memory pages, 1 ⁇ x ⁇ M, 1 ⁇ y ⁇ N, the paging unit 203 can store the xth string group into the yth memory page by the following compression storage method:
  • the prefix, the shared prefix length between any other string and its adjacent previous string, and the suffix string after the shared prefix in any of the other strings are written to the yth memory page Free space.
  • the Q layer hopping table may be a multi-layer hopping table constructed layer by layer, and the qth layer hopping table in the Q layer hopping table may be a sparse coefficient F according to the interval in the q-1 layer hopping table.
  • the first index key of the hop table node is constructed.
  • the F is an integer greater than or equal to 1.
  • the q is an integer greater than or equal to 2.
  • the sparse coefficient F can be set as needed.
  • the length L node of each hop table node may be an integer multiple of the length of the computer system cache.
  • the hop table index construction unit 204 is specifically configured to:
  • the index keys of the N memory pages are sequentially written into the hop table node of the layer 1 hop table, and the number of index keys included in the hop table node and the index key search are recorded in each hop table node.
  • Address information the addressing information of the index key in each hop table node in the layer 1 hop table is used to indicate a memory page where the index key corresponding to the addressing information is located;
  • the first layer jump table is constructed from the bottom up, and then the qth layer jump table is constructed upwards until the number of jump table nodes included in the constructed layer number Q or the Q layer jump table satisfies the preset condition or the most When the upper hop table is converged to a hop table node, the hop table index is stopped.
  • the preset condition can be set as required.
  • a fixed length jump table node may be used.
  • the jump table index construction unit 204 sequentially writes the first index key of the at least one hop table node extracted from the q-1 layer hop table into the hop table node of the qth layer hop table. Specifically for:
  • the first index key of at least one hop table node is sequentially written into the hop table node of the qth layer hop table, and each index key is written and recorded.
  • a variable length hop table node may also be used to construct Jump table index.
  • the hop table index construction unit 204 sequentially writes at least one index key extracted from the q-1 layer hop table into the hop table of the qth layer hop table. Node, specifically can be used to:
  • the first index key of the first index key of the at least one hop table node is used as the starting index key, and the first index key of at least one hop table node is sequentially written into the hop of the qth layer hop table.
  • the N node-more index keys are: the i-th index
  • the N node-more index keys are arranged in order, wherein L node-min is the minimum length of each hop table node, and L node-max is the maximum length of each hop table node;
  • Determining an index key of the shortest index key among the N node-more index keys, and indexing the i-th index key and the i-th index key to the shortest index key The word is written into the hop table node being written, and the shortest index key is written as the first index key of the next hop table node to the next hop table node.
  • the corresponding index key may be sequentially written into the next hop table node according to the above manner, and thus repeated, the index keywords may be sequentially stored in the qth order.
  • Jump table node of the layer jump table It should be noted that, when the available length (L node-max - L occupation ) can be used in the calculation, the reserved storage overhead corresponding to the addressing information of the index key needs to be deducted.
  • the minimum length L node-min and the maximum length L node-max of each hop table node may be set according to the actual length of the hop table node, which is not limited in this embodiment of the present invention, and the minimum length of each hop table node L node-min and maximum length L node-max may be the same or different.
  • the compression indexing device 20 may further include: a query unit 205;
  • the query unit 205 is configured to: obtain a character string to be queried;
  • the t-th layer jumps the s-th hop table node of the t-th layer to find an index key in the s-th hop table node of the t-2th layer jump table;
  • the index key that matches the to-be-queried string may be: an index key that is arranged in a dictionary ascending order before the to-be-queried string, or a string that has a shared prefix with the to-be-queried string.
  • the embodiment of the present invention may further dynamically insert a character string into the string sequence.
  • the apparatus 20 may further include: a string insertion unit 206;
  • the string insertion unit 206 is configured to acquire a new string, where the new string is a string that is not in the sequence of the string;
  • the second string group may be a next string group adjacent to the first string group.
  • the embodiment of the present invention may also dynamically delete the character string in the string sequence.
  • the device 20 may further include: a string deletion unit 207;
  • the character string deleting unit 207 may be configured to delete the first character string in the string sequence, where the first character string is located in the second memory page and the third character string group;
  • the number of the character strings in the third character string group is less than the threshold after deleting the first character string, acquiring a fourth character string group adjacent to the third character string group, and acquiring the third character string group
  • the string group and the fourth string group are regrouped;
  • the regrouped string group is sequentially written to the second memory page, and if the sum of the memory page adjacent to the second memory page and the second memory page data amount is less than the data amount threshold of one memory page, the two are merged Memory page.
  • the fourth string group may be a previous string group adjacent to the third string group, or may be a next string group adjacent to the third string group; the data volume threshold of the memory page may be as needed
  • the setting is not limited in this embodiment of the present invention.
  • the jump table node is updated from bottom to top until the index reconstruction is completed.
  • the string group, the memory page, and the jump table index all have a certain spatial elasticity. Therefore, the insertion/deletion string generally only causes local reconstruction, and the efficiency is high.
  • the compression indexing device 20 in FIG. 8 may be disposed in any computer of the data storage system, or may be disposed in the data storage system independently of any device; the obtaining unit 201 in FIG. 8 may be as shown in FIG.
  • the transceiver 1012 in the compression indexing device 10, the grouping unit 202, the paging unit 203, the skip table index construction unit 204, the query unit 205, the character string insertion unit 206, and the character string deletion unit 207 may be separately set up in FIG.
  • the processor 1011 may be implemented in one of the processors 1011 of the compression indexing device 10, or may be stored in the memory 1013 of the compression indexing device 10 in the form of program code, by a processor of the compression indexing device 10.
  • the processor described herein may be a Central Processing Unit (CPU), or an Application Specific Integrated Circuit (ASIC), or one or more integrated systems configured to implement embodiments of the present invention. Circuit.
  • CPU Central Processing Unit
  • ASIC Application Specific Integrated Circuit
  • the embodiment of the present invention provides a compression indexing device for a string sequence, which obtains an ordered sequence of character strings, and performs the sequence of the string according to the difference prefix length of each string in the string sequence.
  • the grouping process obtains M string groups, so that the difference prefix length of the first string in each string group is the shortest within the preset string range, and the M string groups are sequentially stored to N groups.
  • the memory page constructs a jump table index according to the index keys of the N memory pages. Thus, after grouping and paging processing the ordered string sequence, the jump table index is constructed.
  • the index key of each page after paging according to the difference prefix length of the string group is also the local shortest, and thus based on the page.
  • the length of the index key in the constructed jump table index is also relatively short, which reduces the average length of the index key in the jump table index, and improves the capacity of the jump table node, thereby reducing the number of index nodes and reducing the index search complexity.
  • the beneficial effect of the degree avoids the problem that the underlying leaf node has a long differential prefix length in the existing CS-Prefix-Tree coding index process, which leads to a decrease in the capacity of the coding index branch node, and an increase in the number of branch nodes and the complexity of the search. .
  • the disclosed systems, devices, and methods may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or Not executed.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may be physically separate, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of hardware plus software functional units.
  • the above-described integrated unit implemented in the form of a software functional unit can be stored in a computer readable storage medium.
  • the software functional units described above are stored in a storage medium and include instructions for causing a computer device (which may be a personal computer, server, or network device, etc.) to perform portions of the steps of the methods described in various embodiments of the present invention.
  • the foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like, and the program code can be stored. Medium.
  • the storage medium may include a read only memory, a random access memory, a magnetic disk or an optical disk, or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

一种字符串序列的压缩索引方法及装置,涉及数据管理技术领域,解决现有CS-Prefix-Tree编码索引过程中,底层叶节点存在过长的差异前缀长度,导致编码索引分支节点的容纳能力下降,增加分支节点数量和查找复杂度的问题。方法包括:根据字符串序列中每个字符串的差异前缀长度对所述字符串序列进行分组处理,获得M个字符串组,以使每个字符串组中的首个字符串的差异前缀长度在预设字符串范围内是最短的(S102),将所述M个字符串组依次存储到N个内存页中(S103),根据所述N个内存页的索引关键字构建跳表索引(S104)。

Description

一种字符串序列的压缩索引方法及装置 技术领域
本发明涉及数据管理技术领域,尤其涉及一种字符串序列的压缩索引方法及装置。
背景技术
随着数据库广泛应用到社会生产的各个领域,数据库记录的规模和属性也日趋复杂,在这种前提下列优先存储(简称“列存”)的优势日渐突出。其中,当采用列存时,为了减少存储开销,可以采用字典编码方式来存储数据。目前,人们通常采用Carsten Binnig等人于2009年提出的CS-Prefix Tree(缓存感知型前缀树)保序压缩索引机制,用于支持对压缩字典的不解压查询。
如图1所示,CS-Prefix-Tree由共享叶(Shared leaves)和编码索引(Encode index)两部分构成。共享叶包含一系列固定长度的数据块,每个数据块存储一组<字符串,编码>(value,code)字典项,块内及块间的字典项按“字符串”全局有序,所有数据块一起构成了完整字典。编码索引是由一系列固定长度分支节点(node)构成的树结构,每个分支节点包括:分支节点中第一个子节点的地址、分支节点记录的关键字个数、以及关键字列表。其中,关键字为相邻子节点的差异前缀,差异前缀是指区分某个节点所包含的最小字符串与其前驱节点所包含的最大字符串的最短前缀,例如,如图1所示,最下行首个叶节点所包含的最大字符串为“aaf”,第二个叶节点所包含的最小字符串为“amd”,首个叶节点为第二个叶节点的前驱节点,区分“amd”与“aaf”的最短前缀为“am”,即两个叶节点的差异前缀为“am”。
编码索引采用“自底向上”的方式构造,即先有全部叶节点,再逐层构造分支节点。例如,图1采用32字节分支节点,并在第三字段采用由两端到中间的方式逐个写入关键字,即首先将差异前缀 “am”作为关键字写入字段尾部,并记录节点偏移量29到字段首部,形成([29],…,[am])的字段形式。其中,偏移量从0开始编址,即32字节分支节点的偏移量由0顺序编址到31,关键字中的每个字符占用1个字节且需要1个字节的空白字符作为结束标记,所以“am”需要占用偏移量29~31三个字节。如此类推,将第二个差异前缀“amq”作为关键字写入偏移量为25的字段尾部,并记录偏移量到字段首部,形成([29,25],…,[amq,am])的字段形式;将第三个差异前缀“bc”作为关键字写入偏移量为22的字段尾部,并记录偏移量到字段首部,形成([29,25,22],…,[bc,amq,am])的字段形式。此时,字段可使用空间不足以容纳下一个关键字,则分配新的分支节点索引后续叶节点。在构造编码索引的过程中,若当前最上层索引有两个及以上分支节点,则需分配新的分支节点构造更上一层索引,直到索引收敛到单一的根节点。
但是,在实现本发明的过程中,发明人发现CS-Prefix-Tree存在以下问题:由于底层相邻叶节点间的差异前缀长度不受控制,在处理长字符串序列时,差异前缀长度可能在几字节到几百字节不等,此时,过长的差异前缀长度会导致编码索引分支节点的容纳能力下降,增加分支节点数量和查找复杂度。
发明内容
本发明的实施例提供一种字符串序列的压缩索引方法及装置,以解决现有CS-Prefix-Tree编码索引过程中,底层叶节点存在过长的差异前缀长度,导致编码索引分支节点的容纳能力下降,增加分支节点数量和查找复杂度的问题。
为达到上述目的,本发明的实施例采用如下技术方案:
第一方面,本发明实施例提供一种字符串序列的压缩索引方法,所述方法可以包括:
获取有序排列的字符串序列;
根据字符串序列中每个字符串的差异前缀长度,对字符串序列进行分组处理,获得M个字符串组,以使每个字符串组中首个字符 串的差异前缀长度在预设字符串范围内是最短的;
将M个字符串组依次存储到N个内存页中,内存页的索引关键字为该内存页中首个字符串组的差异前缀;
根据N个内存页的索引关键字构建包含Q层跳表的跳表索引,该跳表索引采用自下而上的方式构建,其第1层跳表可以根据N个内存页的索引关键字构建,每个跳表节点包含至少一个索引关键字、索引关键字的个数以及索引关键字的寻址信息。
需要说明的是,上述有序字符串序列可以为按照字典升序或降序排列的字符串序列。
如此,通过对有序字符串序列进行分组和分页处理,构建跳表索引。由于字符串组的差异前缀长度在一定字符串范围内是最短的,使得根据字符串组的差异前缀长度分页后的每页的索引关键字也是局部最短的,进而在页的基础上构建的跳表索引内的索引关键字的长度也是比较短的,降低了跳表索引中索引关键字的平均长度,提升了跳表节点的容纳能力,从而达到减少索引节点数量和降低索引查找复杂度的有益效果。
具体的,在第一方面的一种可实现方式中,可以采用下述方式对有序字符串序列进行分组:
设定所述M个字符串组中每个字符串组包含的字符串的个数的最小阈值Wmin和最大阈值Wmax
以第1个字符串为第1个字符串组的起始字符串,依次计算以该字符串为首的后续Wmax个字符串的差异前缀长度;
确定后续第Wmin个字符串到后续第Wmax个字符串之间的字符串中差异前缀长度最小的字符串,将第1个字符串至该差异前缀长度最小的字符串间的所有字符串分为第1个字符串组;
将差异前缀长度最小的字符串作为第2个字符串组的起始字符串,重复上述过程获得第2个字符串组;
如此,按照上述分组方式可以对后续字符串进行分组,直至将所有字符串全部分组完成。
需要说明的是,每个字符串组所能容纳的字符串的阈值可以相同,也可以不同。
具体的,在第一方面的一种可实现方式中,可以根据所述M个字符串组中每个字符串组的差异前缀长度,将M个字符串组依次存储到N个内存页中,以使每个内存页中首个字符串组的差异前缀长度在预设字符串组范围内是最短的,具体实现如下:
设定所述N个内存页中每个内存页的最小容量Cmin和最大容量Cmax
将第1个字符串组写入第1个内存页;
以第1个字符串组为起始字符串组,按照字符串组的排序,依次将其后续的至少一个字符串组写入第1个内存页;
若写入第i个字符串组时,第1个内存页中被占用的存储容量C占用与Cmin的差值小于第i个字符串组的存储开销,则计算第1个内存页中可使用的存储容量(Cmax-C占用)所能容纳的Nmore个字符串组;
确定Nmore个字符串组中差异前缀最小的字符串组,将第i个字符串组、以及第i个字符串组到差异前缀最小的字符串组之间的字符串组依次写入第1个内存页中,至此获得第1个内存页;
接下来,将差异前缀长度最小的字符串组作为第2个内存页的首个字符串组,写入第2个内存页,并按照上述方法确定第2个内存页,如此,重复上述过程,直至将所有字符串组分页完成。
由于,字符串组中各字符串间存在共享前缀,因此,为了降低字符串在存储过程中占用的内存,提供压缩效率,在将字符串组写入内存页的过程中,可以将字符串组除首个字符串之外的字符串以压缩形式写入内存页中,具体实现如下:
将字符串组的首个字符串以不压缩形式写入到内存页的可用空间;
对该字符串组中除首个字符串之外的其他任一字符串,获取其他任一字符串与其相邻的前一个字符串间的共享前缀,将其他任一 字符串与其相邻的前一个字符串间的共享前缀长度、以及其他任一字符串中在共享前缀之外的后缀字符串写入到该内存页的剩下的可用空间。
其中,为了便于后续查找字符串组,在将字符串组写入内存页之后,还需要在内存页尾部的存储空间中逆序写入:各个字符串组在内存页中所处的地址信息、以及该内存页中包含的字符串组的个数。
具体的,在第一方面的一种可实现方式中,所述Q层跳表可以为逐层构建的多层跳表,所述Q层跳表中的第q层跳表根据第q-1层跳表中间隔为稀疏系数F的跳表节点的首个索引关键字构建,所述F为大于等于1的整数,所述q为大于等于2的整数,具体的,跳表索引的构建过程如下:
将N个内存页的索引关键字依次写入第1层跳表的跳表节点,并在每个跳表节点中记录该跳表节点所包含的索引关键字的个数以及索引关键字的寻址信息,其中,第1层跳表中每个跳表节点中索引关键字的寻址信息用于:指示与该寻址信息相对应的索引关键字所处的内存页;
对于第1层跳表之上的第q层跳表,2≤q≤M,获取第q-1层跳表中,以第q-1层跳表中的首个跳表节点为起始节点,间隔为F的至少一个跳表节点的首个索引关键字;
将所述第q-1层中至少一个跳表节点的首个索引关键字依次写入第q层跳表的跳表节点,并在第q层跳表的每个跳表节点中记录该跳表节点所包含的索引关键字的个数以及索引关键字的寻址信息,其中,第q层跳表中每个跳表节点中索引关键字的寻址信息用于:指示与该寻址信息相对应的索引关键字在第q-1层跳表中所在的跳表节点;
如此,自下而上构建跳表索引,直至当跳表索引的层数或最上层跳表索引的跳表节点数量满足预设条件时,结束构建跳表索引。
其中,为了构造简单,跳表索引中每个跳表节点的长度可设置 为固定长度,以适用于内存页的索引关键字长度较为平均的场景;当跳表节点采用固定长度时,可以将从下层提取的间隔为F的索引关键字依次写入当前层跳表中的跳表节点,若当前跳表节点存储空间已满,则写入该层的下一跳表节点,直至将从下层提取的索引关键字完全写入本层的跳表节点。
此外,当内存页的索引关键字长度差异较大时,有必要采取措施控制各跳表节点第1个索引关键字的长度,从而降低上层跳表索引的索引开销,提升索引的空间效率。具体的,可以采用可变长度跳表节点的跳表构建方法,将下层的索引关键字依次写入上层跳表节点中,具体实现方式如下:
获取第q-1层跳表中,以第q-1层跳表中的首个跳表节点为起始节点,间隔为F的至少一个跳表节点的首个索引关键字;
将至少一个跳表节点的首个索引关键字依次写入第q层第1个跳表节点;
若写入第i个索引关键字时,第1个跳表节点中被占用的长度L占用与Lnode-min的差值小于第i个索引关键字的存储开销,则计算第1个跳表节点中可使用的长度(Lnode-max-L占用)所能容纳的Nnode-more个索引关键字,其中,Lnode-min为每个跳表节点的最小长度,Lnode-max为每个跳表节点的最大长度;
确定Nnode-more个索引关键字中最短的索引关键字,将第i个索引关键字、以及第i个索引关键字到最短索引关键字之间的索引关键字写入到第1个跳表节点;
将最短索引关键字作为第q层第2个跳表节点的首个索引关键字写入第2个跳表节点,按照上述方式完成第2个跳表节点的构建;
重复上述过程,直至将从q-1层跳表提取到的至少一个索引关键字全部写入第q层跳表中。
如此,可以保证每层跳表中每个跳表节点的首个索引关键字是局部最短的。
进一步的,在第一方面的一种可实现方式中,在跳表索引构建 之后,可以根据建立的跳表索引,自上而下查找与待查询字符串相关联的一些字符串,其具体实现如下:
获取待查询字符串;
自上而下查找所述跳表索引中的每层跳表,确定所述Q层跳表中的第t层跳表第j个跳表节点存储有与所述待查询字符串相匹配的第一索引关键字,其中,所述第一索引关键字的寻址信息指示:第t-1层跳表第r个跳表节点,查找所述第t-1层跳表第r个跳表节点中的索引关键字;
确定所述第t-1层跳表第r个跳表节点中存储有与所述待查询字符串相匹配的第二索引关键字,其中,所述第二索引关键字的寻址信息指示:第t-2层跳表第s个跳表节点,查找所述第t-2层跳表第s个跳表节点中的索引关键字;
重复上述过程,直至根据第1层跳表第d个跳表节点中存储的与所述待查询字符串相匹配的第三索引关键字,查找第h个内存页中每个字符串组的差异前缀,其中,所述第三索引关键字的寻址信息指示:所述第h个内存页;
确定所述第h个内存页中第w个字符串组的差异前缀与所述待查询字符串相匹配,查找所述第w个字符串组中的匹配字符串并返回查询结果。
需要说明的是,当字符串组中的字符串采用压缩方式写入内存页时,还需要将字符串解压后作为与待查询字符串相关联的字符串。
进一步的,在第一方面的一种可实现方式中,当在原有字符串序列中插入新字符串时,所述方法还可以包括:
确定插入的新字符串所属的第一内存页和第一字符串组;
将所述新字符串插入所述第一字符串组;
若插入新字符串后,第一字符串组内的字符串数量超过阈值,则获取与第一字符串组相邻的第二字符串组,并对第一字符串组和第二字符串组重新分组;
将重新分组后的字符串组顺序写入第一内存页,若在写入字符 串组的过程中第一内存页中有字符串组溢出,则将溢出的字符串组写入与第一内存页相邻的下一内存页。
进一步的,在第一方面的一种可实现方式中,当删除原有字符串序列中的字符串时,所述方法还可以包括:
删除字符串序列中的第一字符串,第一字符串位于第二内存页和第三字符串组;
若删除第一字符串后,所述第三字符串组内的字符串数量小于阈值,则获取与所述第三字符串组相邻的第四字符串组,并对所述第三字符串组和所述第四字符串组重新分组;
将重新分组后的字符串组顺序写入第二内存页,若与第二内存页相邻的内存页、以及第二内存页的数据量之和小于一个内存页的数据量阈值,则合并这两个内存页。
需要说明的是,若对字符串序列插入或删除字符串后,引起内存页数量或者内存页的索引关键字发生变化,则需要自下而上依次更新跳表索引中的跳表节点,直至跳表索引重建完成。
由于在本发明实施例中,分组、分页和跳表索引均有一定的空间弹性,因此,插入或删除字符串一般只引起局部重构,不需完全重建跳表索引,效率较高。
第二方面,本发明实施例提供一种压缩索引装置,用于执行第一方面所述的方法,所述装置可以包括:
获取单元,用于获取有序排列的字符串序列;
分组单元,用于根据字符串序列中每个字符串的差异前缀长度,对字符串序列进行分组处理,获得M个字符串组,每个字符串组的差异前缀长度为该字符串组中首个字符串组的差异前缀长度,以使每个字符串组中首个字符串的差异前缀长度在预设字符串范围内是最短的;
分页单元,用于将分组单元获得的M个字符串组依次存储到N个内存页中,内存页的索引关键字为该内存页中首个字符串组的差异前缀;
跳表索引构建单元,用于根据所述分页单元获得的N个内存页的索引关键字构建包含Q层跳表的跳表索引,其第1层跳表可以根据N个内存页的索引关键字构建,每层跳表包含至少一个跳表节点,每个跳表节点包含至少一个索引关键字、索引关键字的个数以及索引关键字的寻址信息。
需要说明的是,上述有序字符串序列可以为按照字典升序或降序排列的字符串序列。
如此,通过对有序字符串序列进行分组和分页处理,构建跳表索引。由于字符串组的差异前缀长度在一定字符串范围内是最短的,使得根据字符串组的差异前缀长度分页后的每页的索引关键字也是局部最短的,进而在页的基础上构建的跳表索引内的索引关键字的长度也是比较短的,降低了跳表索引中索引关键字的平均长度,提升了跳表节点的容纳能力,从而达到减少索引节点数量和降低索引查找复杂度的有益效果。
其中,分组单元的具体执行过程与第一方面所述的分组过程相同,分页单元的具体执行过程与第一方面所述的分页过程相同,跳表索引构建单元的具体执行过程与第一方面所述的跳表索引构建方法相同。
进一步的,所述压缩索引装置还可以包括:查询单元,用于查询字符串序列中与待查询字符串相关联的字符串,其具体执行过程与第一方面所述的字符串查询过程相同。
进一步的,所述压缩索引装置还可以包括:字符串插入单元,用于向字符串序列中插入新字符串,其具体执行过程与第一方面所述的插入新字符串的过程相同,
进一步的,所述压缩索引装置还可以包括:字符串删除单元,用于删除字符串序列中的字符串,其具体执行过程与第一方面所述的删除字符串序列中字符串的过程相同。
需要说明的是,上述压缩索引装置可以设置在数据存储系统的任一计算机中,也可以独立于任何设备设置在数据存储系统中;第 二方面所述的获取单元可以为压缩索引装置中的收发器,第二方面中的分组单元、分页单元、跳表索引构建单元、查询单元、字符串插入单元、字符串删除单元可以为单独设立的处理器,也可以集成在压缩索引装置的某一个处理器中实现,此外,也可以以程序代码的形式存储于压缩索引装置的存储器中,由压缩索引装置的某一个处理器调用并执行以上分组单元、分页单元、跳表索引构建单元、查询单元、字符串插入单元以及字符串删除单元。这里所述的处理器可以是一个中央处理器(Central Processing Unit,CPU),或者是特定集成电路(Application Specific Integrated Circuit,ASIC),或者是被配置成实施本发明实施例的一个或多个集成电路。
由上可知,本发明实施例提供一种字符串序列的压缩索引方法及装置,获取有序排列的字符串序列,根据所述字符串序列中每个字符串的差异前缀长度对所述字符串序列进行分组处理,获得M个字符串组,每个字符串组中首个字符串的差异前缀长度在预设字符串范围内是最短的,将M个字符串组依次存储到N个内存页中,根据所述N个内存页的索引关键字构建跳表索引。如此,在对有序字符串序列进行分组和分页处理后,构建跳表索引。由于每个字符串组的差异前缀长度在局部是最短的,使得根据字符串组的差异前缀长度分页后的每页的索引关键字也是局部最短的,进而在页的基础上构建的跳表索引内的索引关键字的长度也是比较短的,降低了跳表索引中索引关键字的平均长度,提升了跳表节点的容纳能力,从而达到减少索引节点数量和降低索引查找复杂度的有益效果,避免了现有CS-Prefix-Tree编码索引过程中,底层叶节点存在过长的差异前缀长度,导致编码索引分支节点的容纳能力下降,增加分支节点数量和查找复杂度的问题。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于 本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为现有CS-Prefix-Tree索引结构图;
图2为本发明实施例提供的压缩索引装置10的结构图;
图3为本发明实施例提供的一种字符串序列的压缩索引方法的流程图;
图4为本发明实施例提供的有序字符串序列分组及分页过程示意图;
图5为本发明实施例提供的页内字符串编码存储示意图;
图6为本发明实施例提供的采用固定长度节点构建跳表索引的过程示意图;
图7为本发明实施例提供的采用可变长度节点构建跳表索引的过程示意图;
图8为本发明实施例提供的压缩索引装置20的结构图。
具体实施方式
本发明的核心思想是:对多个有序字符串进行分组处理,使各组间相邻字符串的差异前缀长度最短,再对多个字符串组进行分页处理,使页间相邻字符串的差异前缀长度最短,定义页的索引关键字为其所容纳的首个字符串的差异前缀,在页的基础上,逐层构建跳表索引,跳表索引用于通过索引关键字查找页内分组中的字符串;需要说明的是,分组及分页过程不改变字符串的有序性,各组及各页之间的顺序与其所容纳字符串之间的顺序相同。
下面结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整的描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
图2为本发明实施例提供的压缩索引装置10的结构图,用于执行本发明所提供的压缩索引方法。所述压缩索引装置10可以为数据 库系统中可进行数据存储的装置,可以设置在任一计算机中,也可以独立于任一设备存在于数据存储系统,具体的,如图2所示,所述压缩索引装置10可以包括:处理器1011、收发器1012、存储器1013、以及至少一个通信总线1014,通信总线1014用于实现这些装置之间的连接和相互通信;
处理器1011可能是一个中央处理器(Central Processing Unit,简称为CPU),也可以是特定集成电路(Application Specific Integrated Circuit,ASIC),或者是被配置成实施本发明实施例的一个或多个集成电路,例如:一个或多个微处理器(Digital Signal Processor,DSP),或,一个或者多个现场可编程门阵列(Field Programmable Gate Array,FPGA)。
收发器1012可用于与外部网元之间进行数据交互。
存储器1013,可以是易失性存储器(Volatile Memory),例如随机存取存储器(Random-Access Memory,RAM);或者非易失性存储器(Non-Volatile Memory),例如只读存储器(Read-Only Memory,ROM),快闪存储器(Flash Memory),硬盘(Hard Disk Drive,HDD)或固态硬盘(Solid-State Drive,SSD);或者上述种类的存储器的组合。
通信总线1014可以分为地址总线、数据总线、控制总线等,可以是工业标准体系结构(Industry Standard Architecture,ISA)总线、外部设备互连(Peripheral Component Interconnect,PCI)总线或扩展工业标准体系结构(Extended Industry Standard Architecture,EISA)总线等。为便于表示,图2中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
具体的,当处理器1011获取到有序字符串序列后,根据字符串序列中每个字符串的差异前缀长度,对所述字符串序列进行分组处理,获得M个字符串组,以使每个字符串组中首个字符串的差异前缀长度在预设字符串范围内是最短的;
将M个字符串组依次存储到N个内存页中,每个内存页的索引 关键字为:该内存页中首个字符串组的差异前缀;
根据N个内存页的索引关键字构建包含Q层跳表的跳表索引,所述Q层跳表中的第1层跳表根据所述N个内存页的索引关键字构建,每层跳表包含至少一个跳表节点;每个跳表节点包含至少一个索引关键字、索引关键字的个数以及索引关键字的寻址信息,以便后续通过跳表索引中的索引关键字查找某页内某组中的字符串。
需要说明的是,在将M个字符串组依次存储到N个内存页中的过程中,可以根据所述M个字符串组中每个字符串组的差异前缀长度,将所述M个字符串组依次存储到所述N个内存页中,以使所述N个内存页中的每个内存页的首个字符串组的差异前缀长度在预设字符串组范围内是最短的。
所述Q层跳表可以为逐层构建的多层跳表,所述Q层跳表中的第q层跳表根据第q-1层跳表中间隔为稀疏系数F的跳表节点的首个索引关键字构建,所述F为大于等于1的整数,所述q为大于等于2的整数。
如此,在对有序字符串序列进行分组和分页处理后,构建跳表索引。由于每个字符串组的差异前缀长度在预设字符串范围内是最短的,使得根据字符串组的差异前缀长度分页后的每页的索引关键字也是局部最短的,进而在页的基础上构建的跳表索引内的索引关键字的长度也是比较短的,降低了跳表索引中索引关键字的平均长度,提升了跳表节点的容纳能力,从而达到减少索引节点数量和降低索引查找复杂度的有益效果。
为了便于描述,以下实施例一以步骤的形式示出并详细描述了本发明提供的字符串序列的压缩过程,其中,示出的步骤也可以在一组可执行指令的计算机系统中执行。此外,虽然在图中示出了逻辑顺序,但是在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤。
实施例一
图3为本发明实施例提供的一种字符串序列的压缩索引方法的 流程图,由图2所示的压缩索引装置10执行,如图3所示,所述字符串序列的压缩索引方法可以包括以下步骤:
S101:获取字符串序列,所述字符串序列包含有序排列的一个以上字符串。
可选的,可以从列存数据库中直接读取字符串序列。
需要说明的是,有序排列的一个以上字符串可以按照字典升序排列,也可以按照字典降序排列,本发明实施例对此不进行限定,本发明仅以按照字典升序排列的字符串序列为例对本发明提供的压缩索引方法进行说明。例如,图3左侧的字符串序列就是按照”A~Z”的字典升序排列的字符串序列。
S102:根据所述字符串序列中每个字符串的差异前缀长度对所述字符串序列进行分组处理,获得M个字符串组,以使每个字符串组中首个字符串的差异前缀长度在预设字符串范围内是最短的,其中,所述M为大于等于1的整数,每个字符串组包含至少一个字符串,每个字符串组的差异前缀为该字符串组中首个字符串的差异前缀。
不失一般性,设给定的字符串序列为S={s1,s2,s3,…,sn},若该字符串序列按照字典升序排列,则当字符串si排列在字符串sj前面,则规定si<sj。若si与sj相邻且si<sj,则称si是sj的前驱字符串,sj是si的后继字符串,若两者共享前缀长度为l,则后继字符串sj的差异前缀为:字符串sj中长度为l+1的前缀子串;需要说明的是,对于字符串序列中的首个字符串,规定其前驱字符串为空串,其对应的差异前缀长度为1。需要说明的是,上述字符串si排列在字符串sj前面可以指按照字典升序,字符串si先于字符串sj排列。例如,先后排列的两个字符串“abe”和“afe”,字符串“abe”是字符串“afe”的前驱字符串,二者共享前缀为“a”,共享前缀长度为1,则字符串“afe”的差异前缀长度为:该字符串中长度为2的前缀字符串“af”。
可选的,在本发明实施例中,可以通过下述方法获得M个字符 串组中的第m个字符串组,1≤m≤M,即第m个字符串组为M个字符串组中的任一字符串组:
设定M个字符串组中每个字符串组包含的字符串的个数的最小阈值Wmin、最大阈值Wmax
确定第m个字符串组的首个字符串;
以所述第m个字符串组的首个字符串为起始字符串,依次计算后续Wmax个字符串中每个字符串的差异前缀长度;
确定第k个字符串,第k个字符串为后续第Wmin个字符串到后续第Wmax个字符串中差异前缀长度最小的字符串,Wmin≤k≤Wmax
将所述第m个字符串组的首个字符串到第k-1个字符串之间的字符串的集合确定为所述第m个字符串组,并将第k个字符串作为第m+1个字符串组的首个字符串。
将第k个字符串作为第m+1个字符串组的首个字符串后,可以按照上述方式确定第m+1个字符串组,如此重复进行,直至将字符串序列中的字符串处理完成,可以将字符串序列按照字符串序列的排序分成M个字符串组。
需要说明的是,对于第1个字符串组,其首个字符串为字符串序列中的第1个字符串,此外,当第Wmin个字符串到第Wmax个字符串中差异前缀长度最小的字符串不止一个时,通常将第Wmin个字符串到第Wmax个字符串中差异前缀长度最小的字符串中排在最前面的字符串作为下一分组的首个字符串。
其中,每个字符串组所能包含的字符串的个数的最小阈值Wmin是指字符串组最少可容纳的字符串的个数;最大阈值Wmax可以是指字符串最大可容纳的字符串的个数,二者可以根据需要进行设置,本发明实施例对此不进行限定,并且,各字符串组所能容纳的字符串的个数的最小阈值Wmin和最大阈值Wmax可以相同,也可以不相同。
例如,图4左侧给出了一组有序字符串序列,设每个字符串组的Wmin=2,Wmax=8,首先,将该字符串序列中的第1个字符串 “Alabama A&M University(AL)”作为第1个字符串组的首个字符串,以该字符串为起始字符串,计算“Alabama A&M University(AL)”~“American University(DC)”8个字符串的差异前缀长度“1、8、2、16、9、11、15、9”,确定第2~第10个差异前缀长度中差异前缀长度最短的字符串为“American College(PA)”,此时,可以将字符串“American College(PA)”之前到第1个字符串的字符串划分到第1个字符串组,同时,将字符串“American College(PA)”作为第2个字符串组的首个字符串,重复上述过程,确定第2个字符串组,以及后续其他分组,直至将字符串序列中每个字符串分组完成,并将分组后的10个组依次标记为G1到G10。
S103:将所述M个字符串组依次存储到N个内存页中,所述N为大于等于1的整数,每个内存页包含至少一个字符串组,每个内存页的索引关键字为:该内存页中首个字符串组的差异前缀。
其中,N个内存页的地址可以是连续的,也可以是不连续的,每个内存页的大小可以为计算机系统缓存块大小Cblock的整数倍,各内存页的大小可以相同,也可以不同。
可选的,在本发明实施例中,可以根据所述M个字符串组中每个字符串组的差异前缀长度,将所述M个字符串组依次存储到所述N个内存页中,以使所述N个内存页中的每个内存页的首个字符串组的差异前缀长度在预设字符串组范围内是最短的;
具体的,将字符串组存储到N个内存页中的第n个内存页中,1≤n≤N,即第n个内存页为N个内存页中任一内存页,可以包括:
设定所述N个内存页中每个内存页的最小容量Cmin和最大容量Cmax
确定第n个内存页的首个字符串组;
以所述所述第n个内存页的首个字符串组为起始字符串组,依次将所述M个字符串组中至少一个字符串组存储到第n个内存页;
若存储到第i个字符串组时,所述第n个内存页中被占用的存储容量C占用与Cmin的差值小于第i个字符串组的存储开销,则计算 所述第n个内存页中可使用的存储容量(Cmax-C占用)所能容纳的Nmore个字符串组,所述Nmore个字符串组为:以所述第i个字符串组开始依次排列的Nmore个字符串组;
确定所述Nmore个字符串组中差异前缀最短的字符串组,将所述第i个字符串组、以及所述第i个字符串组到所述前缀最短字符串组之间的字符串组依次存储到所述第n个内存页中,将所述前缀最短字符串组作为第n+1个内存页的首个字符串组。
确定第n+1个内存页的首个字符串后,可以按照上述方式依次将相应的字符串组存储到第n+1个内存页中,如此重复进行,可以将字符串组按照排序依次存入到N个内存页中。
需要说明的是,对于第1个内存页,其首个字符串组为M个字符串组中的第1个字符串组,此外,当所述Nmore个字符串组中差异前缀最小的字符串组不止一个时,通常将Nmore个字符串组中差异前缀最小的字符串组中排在最前面的字符串组作为下一分页的首个字符串组。
其中,每个内存页最小容量Cmin和最大容量Cmax可以根据内存页实际可存储的存储容量进行设置,本发明实施例对此不进行限定,并且,各内存页的最小容量Cmin和最大容量Cmax可以相同,也可以不相同;可选的,最小容量Cmin和最大容量为Cmax均为计算机系统缓存块大小Cblock的整数倍。
需要说明的是,内存页中的字符串组可以在页内重新进行编号,不需要与分组后的字符串组的编号相同。其中,为了快速定位到页内各分组,在构造内存页的过程中还需要记录内存页中每个分组的起始地址,并在分组写入完成后,将所有地址逆序写入页尾部预留的索引空间。此外,还可以在页头或页尾预留固定长度字段,记录页内分组数量。为了便于查找,还需要存储每个内存页的地址信息。
例如,图4右侧为将字符串组分页的过程示意图,从字符串组G1开始,依次将G1、G2、G3存储到第1个内存页p1,若内存页p1容纳分组G1、G2和G3后,其被占用的容量接近最小容量阈值 Cmin,但到最小容量阈值Cmin的可使用容量又不够存储分组G4,则依次向前查找两个分组G4、G5,若确定内存页p1存储G4、G5达到最大容量阈值Cmax,则将分组G4、G5中最短差异前缀长度的分组G4作为下一内存页p2的起始首个字符串分组,将G4之前G1~G3的3个字符串组存储到内存页p1中,重复上述过程,直到完成p2、p3分页,每个页内部的分组用g1、g2、g3等顺序编址,其中,图中p1、“A”分别代表内存页p1的地址和索引关键字。
进一步的,由于字符串组中的字符串序列间具有共享前缀,因此,为了提高字符串存储效率,在本发明实施例中,还可以将字符串组中除首个字符串之前的字符串以压缩形式存储到内存页中,即对于待存储到N个内存页中第y个内存页的M个字符串组的第x个字符串组,1≤x≤M,1≤y≤N,可以通过下述压缩存储方式将第x个字符串组存储到第y个内存页中:
将第x个字符串组的首个字符串以不压缩形式写入到第y个内存页的可用空间;
对所述第x个字符串组中除首个字符串之外的其他任一字符串,获取所述其他任一字符串与其相邻的前一个字符串间的共享前缀,将其他任一字符串与其相邻的前一字符串间的共享前缀长度、以及所述其他任一字符串中在共享前缀之后的后缀字符串写入所述第n个内存页的可用空间。
例如,图5给出了对图4中内存页p2进行编码存储的示意图,以第1个分组g1为例,首字符串“Arizona State Polytechnic Campus(AZ)”以不压缩形式存储,原组内第2个字符串“Arizona State University(AZ)”与首个字符串的共享前缀为“Arizona State”,长度为14,则将字符串“14University(AZ)”存为p2中第1个分组g1的第2个字符串,同理,第3个字符串存储为“25West(AZ)”,其中“25”代表其与第2个字符串的共享前缀“Arizona State University”的长度,待内存页p2写入分组数据后,将分组数量“3”采用逆序方式写入页尾预留空间,将分组的页内地址g3、g2、g1写 入页尾。
从图4左侧给出的字符串序列可以看出,该字符串序列中最短的差异前缀如“A”、“B”长度为1,最长的差异前缀如“Arizona State University W”长度为26,而通过所述分组及分页方法之后,可以有效避免较长的差异前缀成为内存页的索引关键字,降低后续构建索引的存储开销。
S104:根据所述N个内存页的索引关键字构建跳表索引,所述跳表索引包含Q层跳表,所述Q为大于等于1的整数,所述跳表索引的第1层跳表根据所述N个内存页的索引关键字构建,每层跳表包含至少一个跳表节点,每个跳表节点包含至少一个索引关键字、索引关键字的个数以及索引关键字的寻址信息。
其中,所述Q层跳表可以为逐层构建的多层跳表,所述Q层跳表中的第q层跳表可以根据第q-1层跳表中间隔为稀疏系数F的跳表节点的首个索引关键字构建,所述F为大于等于1的整数,所述q为大于等于2的整数,所述稀疏系数F可以根据需要进行设定,本发明实施例对此不进行限定,每个跳表节点的长度Lnode可以为计算机系统缓存长度的整数倍。
可选的,当跳表索引包含至少两层跳表时,可以采用下述方式构建跳表索引包括:
将N个内存页的索引关键字依次写入第1层跳表的跳表节点,并在每个跳表节点中记录该跳表节点所包含的索引关键字的个数以及索引关键字的寻址信息,其中,所述第1层跳表中每个跳表节点中索引关键字的寻址信息用于指示与该寻址信息相对应的索引关键字所处的内存页;
对于所述跳表索引中所述第1层跳表之上的第q层跳表,2≤q≤M,获取所述跳表索引中第q-1层跳表中,以第q-1层跳表中的首个跳表节点为起始节点,间隔为F的至少一个跳表节点的首个索引关键字;
将所述至少一个跳表节点的首个索引关键字依次写入第q层跳 表的跳表节点,并在第q层跳表的每个跳表节点中记录该跳表节点所包含的索引关键字的个数以及索引关键字的寻址信息,其中,所述第q层跳表中每个跳表节点中索引关键字的寻址信息用于指示与该寻址信息相对应的索引关键字在所述第q-1层跳表中所在的跳表节点。
如此,自下而上先构建第1层跳表,再依次向上构建第q层跳表,直至构建的层数Q或第Q层跳表包含的跳表节点的个数满足预设条件或者最上层跳表收敛至一个跳表节点时,停止构建跳表索引;其中,预设条件可以根据需要进行设置,本发明实施例对此不进行限定。
其中,为了便于构造跳表索引,可以采用固定长度的跳表节点。当采用固定长度的跳表节点时,将从第q-1层跳表提取的索引关键字依次写入第q层跳表的跳表节点可以包括:
按照索引关键字排序,依次将索引关键字写入第q层跳表的跳表节点;每写入一个索引关键字,记录其对应的寻址信息,并更新跳表节点内索引关键字数量,计算跳表节点的剩余可用空间;
若当前正写入的跳表节点无法容纳下一个索引关键字,则分配新的跳表节点,按照上述方式写入索引关键字,直到所有索引关键字及寻址信息写入完成。
同理,将N个内存页的索引关键字写入第1层跳表中的跳表节点与上述过程相同,在此不再详细赘述。
例如,图6给出了采用固定长度跳表节点的保序压缩索引示意图,第1层跳表共有7个节点,地址分别为n1-1到n1-7,记录了全部39个页的索引关键字和地址,以第1层第1个跳表节点为例,其地址为n1-1,第1个字段“3”代表记录3个索引关键字,第2个字段记录3个索引关键字分别为“A”、“Ar”和“B”,第3个字段记录3个索引关键字在节点内的偏移地址和对应内存页的地址;如(o1,p1)代表索引关键字“A”记录在节点n1-1中偏移量为o1的位置,“A”所对应内存页的地址为p1。需要说明的是,索引关键字“A”、 “Ar”和“B”紧跟第1个字段顺序写入,而(o1,p1)、(o2,p2)和(o3,p3)从节点尾部开始逆序写入,从而可以使空闲空间集中在第2、3字段中间,以最大化节点的容纳能力。
设定稀疏系数F=2,则图6中第1层跳表中间隔为2的4个节点:第1、3、5、7节点可被依次索引到第2层跳表中地址分别为n2-1和n2-2的两个跳表节点中。以第2层第1个节点为例,其地址为n2-1,第1个字段“3”代表记录了3个索引关键字,第2个字段分别记录n1-1、n1-3和n1-5所容纳的第1个索引关键字,包括“A”、“C”等,其中n1-5的信息限于图片尺寸未明确列出,第3个字段以逆序方式记录3个索引关键字在节点内的偏移地址和对应下层跳表节点的地址。如(o1,n1-1),代表索引关键字“A”记录在节点n2-1中偏移量为o1的位置,“A”所对应的下层跳表节点的地址为n1-1。
此时,若第2层中跳表节点的个数或跳表层数2满足预设条件,则停止构建跳表索引,否则,继续按照上述方法构建跳表索引,直至最上层跳表中跳表节点的个数或跳表层数满足预设条件。
进一步的,当各内存页的索引关键字长度差异较大时,为了降低上层跳表的索引开销,提高索引的空间效率,在本发明实施例中,还可以采用可变长度的跳表节点构建跳表索引。当采用可变长度的跳表节点构建跳表索引时,所述将从第q-1层跳表提取的至少一个跳表节点的首个索引关键字依次写入第q层跳表的跳表节点可以包括:
以至少一个跳表节点的首个索引关键字中的第1个索引关键字为起始索引关键字,依次将至少一个跳表节点的首个索引关键字写入到第q层跳表的跳表节点;
若写入第i个索引关键字时,正在写入的跳表节点中被占用的长度L占用与Lnode-min的差值小于第i个索引关键字的存储开销,则计算所述正在写入的跳表节点中可使用的长度(Lnode-max-L占用)所能容纳的Nnode-more个索引关键字,所述Nnode-more个索引关键字为:以所述第i个索引关键字开始依次排列的Nnode-more个索引关键字;
确定所述Nnode-more个索引关键字中最短的索引关键字,将所述第i个索引关键字、以及所述第i个索引关键字到所述Nnode-more个索引关键字中最短的索引关键字之间的索引关键字写入到正在写入的跳表节点中,将所述最短的索引关键字作为下一跳表节点的首个索引关键字写入所述下一跳表节点。
确定下一跳表节点的首个字符串后,可以按照上述方式依次将相应的索引关键字写入下一跳表节点中,如此重复进行,可以将索引关键字按照排序依次存入到第q层跳表的跳表节点。需要说明的是,在计算剩余可用长度(Lnode-max-L占用)时,需扣除索引关键字对应寻址信息的预留存储开销。
其中,Lnode-min为每个跳表节点的最小长度,Lnode-max为每个跳表节点的最大长度,每个跳表节点的最小长度Lnode-min和最大长度Lnode-max可以根据跳表节点实际长度进行设置,本发明实施例对此不进行限定,并且,各跳表节点的最小长度Lnode-min和最大长度Lnode-max可以相同,也可以不相同。
例如,图7给出了采用可变长度跳表节点构建跳表索引的示意图。图7中,第1层跳表记录了全部39个内存页的索引关键字和地址,以第1层跳表的第1个跳表节点为例,其长度2*Lline为2倍计算机系统缓存线长度,地址为n1-1,第2个字段“5”代表记录了5个索引关键字,第3个字段记录5个索引关键字分别为“A”、“Ar”、“B”、“Bo”和“Bu”,第4个字段记录5个索引关键字在节点内的偏移地址和对应页的地址。如(o1,p1)代表索引关键字“A”记录在节点n1-1中偏移量为o1的位置,“A”所对应页的地址为p1。需要说明的是“A”、“Ar”、“B”、“Bo”和“Bu”紧跟第2个字段顺序写入,(o1,p1)、(o2,p2)、(o3,p3)、(o4,p4)和(o5,p5)从节点尾部开始逆序写入,从而可以使空闲空间集中在第3、4字段中间,以最大化节点的容纳能力。
设定稀疏系数F=1,因此第1层跳表中的第1至5个跳表节点可被索引到第2层跳表。以第2层第1个跳表节点为例,其地址为 n2-1,长度1*Lline为1倍计算机系统缓存线长度,第2个字段“5”代表记录了5个索引关键字;第3个字段分别记录n1-1到n1-5所容纳的第1个索引关键字,包括“A”、“C”、…、“Y”等,部分节点信息限于图片大小未明确列出;第4个字段以逆序方式记录5个索引关键字在节点内的偏移地址和对应下层跳表节点的地址。如(o1,n1-1),代表索引关键字“A”记录在节点n2-1中偏移量为o1的位置,“A”所对应的下层跳表节点的地址为n1-1。图中第2层跳表共有1个节点,地址为n2-1,且创建第2层跳表之后索引构造完成。
可理解的是,在本发明实施例中,每层跳表中各跳表节点的存储空间可以是连续分配的,也可以是非连续分配的。当每层跳表中各跳表节点的存储空间可以是连续分配时,若知该层跳表中第1个跳表节点的存储地址Addr1,则可以根据Addri=Addr1+(i-1)*Lnode计算得出该层跳表中其他跳表节点的存储地址,因此,可以仅将跳表的起始地址和结束地址记录到如<n1-start,n1-end>的元组,以避免查找过程中的访问越界。
当每层跳表中各跳表节点的存储空间不连续时,则需要采用链表结构,在每个跳表节点内增加指针字段,指向同层相邻的下一个跳表节点,并在每层的最后一个跳表节点设置结束标记,以避免查找过程中的访问越界。
进一步的,作为压缩索引的逆过程,当用户需要查找数据库中存储的与一字符串有关的所有字符串时,可以从根据跳表索引中的索引关键字自上而下查找相应的内存页,在该内存页中查找相应分组,将该分组中的字符串反馈给用户;具体的,所述方法还可以包括:
获取待查询字符串;
自上而下查找所述跳表索引中的每层跳表,确定所述Q层跳表中第t层跳表第j个跳表节点存储有与所述待查询字符串相匹配的第一索引关键字,其中,所述第一索引关键字的寻址信息指示:第t-1层跳表第r个跳表节点,查找所述第t-1层跳表第r个跳表节点中的 索引关键字;
确定所述第t-1层跳表第r个跳表节点中存储有与所述待查询字符串相匹配的第二索引关键字,其中,所述第二索引关键字的寻址信息指示:第t-2层跳表第s个跳表节点,查找所述第t-2层跳表第s个跳表节点中的索引关键字;
重复上述过程,直至根据第1层跳表第d个跳表节点中存储的与所述待查询字符串相匹配的第三索引关键字,查找第h个内存页中每个字符串组的差异前缀,其中,所述第三索引关键字的寻址信息指示:所述第h个内存页;
确定所述第h个内存页中第w个字符串组的差异前缀与所述待查询字符串相匹配,查找所述第w个字符串组中的匹配字符串并返回查询结果。
需要说明的是,当字符串组中的字符串采用压缩方式写入内存页时,还需要将字符串解压后作为与待查询字符串相关联的字符串。
其中,与待查询字符串相匹配的索引关键字可以为:按照字典升序先于待查询字符串排列的索引关键字,或者与待查询字符串具有共享前缀的字符串。
下面以用户需要查找所有前缀为“Art Institute”的字符串为例,结合图6和图4对根据索引关键字查找字符串的过程进行介绍:
首先,查找跳表索引的最上层跳表,图6中为第2层节点,通过比较第2层跳表第1个跳表节点n2-1中的索引关键字“A”和“C”,得知“Art Institute”应该在第1层跳表节点的n1-1和n1-3之间查找,由于“Art Institute”小于“C”,查找范围不包括n1-3。
其次,依次比较n1-1和n1-2中的第1个关键字“A”和“Bo”,由“Art Institute”小于“Bo”可知,查找范围不包括n1-2。
再次,通过比较节点n1-1中的索引关键字“Ar”和“B”,得知“Art Institute”应该在页节点p2和p3之间查找,由于“Art Institute”小于“B”,查找范围不包括p3。
然后,在图4中内存页p2内查找,首先,读取各分组的页内地址g1、g2、g3,访问各分组的第1个非压缩的字符串,比较得知“Art Institute”大于“Art Institute of Atlanta(GA)”的差异前缀“Art”且小于“Austin College(TX)”,可知前缀为“Art Institute”的字符串位于p2页中的g2分组。
最后,根据记录的共享前缀长度依次解压缩g2分组中的字符串,并返回前缀为“Art Institute”的所有结果。
进一步的,本发明实施例还可以动态地向字符串序列中插入字符串,其具体实现如下:
获取一个新字符串,所述新字符串为不在所述字符串序列中的字符串;
确定所述新字符串所属的第一内存页和第一字符串组;
将所述新字符串插入所述第一字符串组;
若插入所述新字符串后,所述第一字符串组内的字符串数量超过阈值,则获取与所述第一字符串组相邻的第二字符串组,并对所述第一字符串组和所述第二字符串组重新分组;
将重新分组后的字符串组顺序写入所述第一内存页,若所述第一内存页中有字符串组溢出,则将溢出的字符串存入与所述第一内存页相邻的下一内存页。
其中,所述第二字符串组可以为与第一字符串组相邻的下一个字符串组。
需要说明的是,可以采用步骤S102、S103的方法确定所述新字符串所属的第一内存页和第一字符串组;此外,若插入新字符串导致内存页的数量或索引关键字发生变化,则需要自下向上依次更新跳表节点,直到索引重建完成。
相应的,本发明实施例还可以动态地向删除字符串序列中的字符串,其具体实现如下:
删除所述字符串序列中的第一字符串,所述第一字符串位于第二内存页和第三字符串组;
若删除所述第一字符串后,所述第三字符串组内的字符串数量小于阈值,则获取与所述第三字符串组相邻的第四字符串组,并对所述第三字符串组和所述第四字符串组重新分组;
将重新分组后的字符串组顺序写入第二内存页,若与第二内存页相邻的内存页以及第二内存页的数据量之和小于一个内存页的数据量阈值,则合并这两个内存页。
其中,第四字符串组可以为与第三字符串组相邻的上一个字符串组,也可以为与第三字符串组相邻的下一个字符串;内存页的数据量阈值可以根据需要进行设置,本发明实施例对此不进行限定。
需要说明的是,若删除字符串后,导致内存页的数量或索引关键字发生变化,则自下向上依次更新跳表节点,直到索引重建完成。
由于在本发明实施例中,字符串组、内存页和跳表索引均有一定的空间弹性,因此,插入/删除字符串一般只引起局部重构,效率较高。
由上可知,本发明实施例提供一种字符串序列的压缩索引方法,获取有序排列的字符串序列,根据所述字符串序列中每个字符串的差异前缀长度对所述字符串序列进行分组处理,获得M个字符串组,以使每个字符串组中首个字符串的差异前缀长度在预设字符串范围内是最短的,将M个字符串组依次存储到N个内存页,根据N个内存页的索引关键字构建跳表索引。如此,在对有序字符串序列进行分组和分页处理后,构建跳表索引。由于每个字符串组的差异前缀长度在预设字符串范围内是最短的,使得根据字符串组的差异前缀长度分页后的每页的索引关键字也是局部最短的,进而在页的基础上构建的跳表索引内的索引关键字的长度也是比较短的,降低了跳表索引中索引关键字的平均长度,提升了跳表节点的容纳能力,从而达到减少索引节点数量和降低索引查找复杂度的有益效果,避免了现有CS-Prefix-Tree编码索引过程中,底层叶节点存在过长的差异前缀长度,导致编码索引分支节点的容纳能力下降,增加分支节点数量和查找复杂度的问题。
根据本发明实施例,本发明下述实施例还提供了一种压缩索引装置20,优选地用于实现上述方法实施例中的方法。
实施例二
图8为本发明实施例提供的一种压缩索引装置20的结构图,用于执行实施例一所述的方法,如图8所示,所述装置可以包括:
获取单元201,用于获取字符串序列,所述字符串序列包含有序排列的一个以上字符串。
分组单元202,用于根据所述获取单元201获取到的字符串序列中每个字符串的差异前缀长度,对所述字符串序列进行分组处理,获得M个字符串组,以使每个字符串组中首个字符串的差异前缀长度在预设字符串范围内是最短的,其中,所述M为大于等于1的整数,每个字符串组包含至少一个字符串,每个字符串组的差异前缀为该字符串组中首个字符串的差异前缀。
分页单元203,用于将所述分组单元202获得的M个字符串组依次存储到N个内存页中,所述N为大于等于1的整数,每个内存页包含至少一个字符串组,每个内存页的索引关键字为:该内存页中首个字符串组的差异前缀。
跳表索引构建单元204,根据所述分页单元203得到的N个内存页的索引关键字构建跳表索引,所述跳表索引包含Q层跳表,所述Q为大于等于1的整数,所述Q层跳表的第1层跳表根据所述N个内存页的索引关键字构建,每层跳表包含至少一个跳表节点,每个跳表节点包含至少一个索引关键字、索引关键字的个数以及索引关键字的寻址信息。
可选的,在本发明实施例中,分组单元202可以通过下述方法获得M个字符串组中的第m个字符串组,1≤m≤M:
设定M个字符串组中每个字符串组包含的字符串的个数的最小阈值Wmin、最大阈值Wmax
确定第m个字符串组的首个字符串;
以所述第m个字符串组的首个字符串为起始字符串,依次计算 后续Wmax个字符串中每个字符串的差异前缀长度;
确定第k个字符串,第k个字符串为后续第Wmin个字符串到后续第Wmax个字符串中差异前缀长度最小的字符串,Wmin≤k≤Wmax
将所述第m个字符串组的首个字符串到第k-1个字符串之间的字符串的集合确定为所述第m个字符串组,并将第k个字符串作为第m+1个字符串组的首个字符串。
将第k个字符串作为第m+1个字符串组的首个字符串后,可以按照上述方式确定第m+1个字符串组,如此重复进行,可以将字符串序列按照字符串序列的排序分成M个字符串组。
需要说明的是,对于第1个字符串组,其首个字符串为字符串序列中的第1个字符串,此外,当第Wmin个字符串到第Wmax个字符串中差异前缀长度最小的字符串不止一个时,通常将第Wmin个字符串到第Wmax个字符串中差异前缀长度最小的字符串中排在最前面的字符串作为下一分组的首个字符串。
可选的,在本发明实施例中,分页单元203可以根据所述M个字符串组中每个字符串组的差异前缀长度,将所述M个字符串组依次存储到所述N个内存页中,以使所述N个内存页中的每个内存页的首个字符串组的差异前缀长度在预设字符串组范围内是最短的;具体的,分页单元203将字符串组存储到N个内存页中的第n个内存页中,1≤n≤N,可以包括:
设定所述N个内存页中每个内存页的最小容量Cmin和最大容量Cmax
确定第n个内存页的首个字符串组;
以所述所述第n个内存页的首个字符串组为起始字符串组,依次将所述M个字符串组中至少一个字符串组存储到第n个内存页;
若存储到第i个字符串组时,所述第n个内存页中被占用的存储容量C占用与Cmin的差值小于第i个字符串组的存储开销,则计算所述第n个内存页中可使用的存储容量(Cmax-C占用)所能容纳的Nmore个字符串组,所述Nmore个字符串组为:以所述第i个字符串组开始 依次排列的Nmore个字符串组;
确定所述Nmore个字符串组中差异前缀最短的字符串组,将所述第i个字符串组、以及所述第i个字符串组到所述差异前缀最短的字符串组之间的字符串组依次存储到所述第n个内存页中,将所述差异前缀最短的字符串组作为第n+1个内存页的首个字符串组。
确定第n+1个内存页的首个字符串后,可以按照上述方式依次将相应的字符串组存储到第n+1个内存页中,如此重复进行,可以将字符串组按照排序依次存入到N个内存页中。
需要说明的是,对于第1个内存页,其首个字符串组为M个字符串组中的第1个字符串组,此外,当所述Nmore个字符串组中差异前缀最小的字符串组不止一个时,通常将Nmore个字符串组中差异前缀长度最小的字符串组中排在最前面的字符串组作为下一分页的首个字符串组。
需要说明的是,内存页中的字符串组可以在页内重新进行编号,不需要与分组后的字符串组的编号相同。其中,为了快速定位到页内各分组,在构造内存页的过程中还需要记录内存页中每个分组的起始地址,并在分组写入完成后,将所有地址逆序写入页尾部预留的索引空间。此外,还可以在页头或页尾预留固定长度字段,记录页内分组数量。为了便于查找,还需要存储每个内存页的地址信息。
进一步的,由于字符串组中的字符串序列间具有共享前缀,因此,为了提高字符串存储效率,在本发明实施例中,分页单元203还可以将字符串组中除首个字符串之前的字符串以压缩形式存储到内存页中,即对于待存储到N个内存页中第y个内存页的M个字符串组中的第x个字符串组,1≤x≤M,1≤y≤N,分页单元203可以通过下述压缩存储方式将第x个字符串组存储到第y个内存页中:
将第x个字符串组的首个字符串以不压缩形式写入到第y个内存页的可用空间;
对所述第x个字符串组中除首个字符串之外的其他任一字符串,获取所述其他任一字符串与其相邻的前一个字符串间的共享前 缀,将其他任一字符串与其相邻的前一字符串间的共享前缀长度、以及所述其他任一字符串中在所述共享前缀之后的后缀字符串写入所述第y个内存页的可用空间。
可选的,所述Q层跳表可以为逐层构建的多层跳表,所述Q层跳表中的第q层跳表可以根据第q-1层跳表中间隔为稀疏系数F的跳表节点的首个索引关键字构建,所述F为大于等于1的整数,所述q为大于等于2的整数,所述稀疏系数F可以根据需要进行设定,本发明实施例对此不进行限定,每个跳表节点的长度Lnode可以为计算机系统缓存长度的整数倍。
当跳表索引包含至少两层跳表时,所述跳表索引构建单元204具体用于:
将N个内存页的索引关键字依次写入第1层跳表的跳表节点,并在每个跳表节点中记录该跳表节点所包含的索引关键字的个数以及索引关键字的寻址信息,所述第1层跳表中每个跳表节点中索引关键字的寻址信息用于指示与该寻址信息相对应的索引关键字所处的内存页;
对于所述跳表索引中所述第1层跳表之上的第q层跳表,获取以第q-1层跳表中的首个跳表节点为起始节点,间隔为F的至少一个跳表节点的首个索引关键字;
将所述至少一个跳表节点的首个索引关键字依次写入第q层跳表的跳表节点,并在每个跳表节点中记录该跳表节点所包含的索引关键字的个数以及索引关键字的寻址信息,其中,所述第q层跳表中每个跳表节点中索引关键字的寻址信息用于指示与该寻址信息相对应的索引关键字在所述第q-1层跳表中所在的跳表节点。
如此,自下而上先构建第1层跳表,再依次向上构建第q层跳表,直至构建的层数Q或第Q层跳表包含的跳表节点的个数满足预设条件或者最上层跳表收敛至一个跳表节点时,停止构建跳表索引;其中,预设条件可以根据需要进行设置,本发明实施例对此不进行限定。
其中,为了便于构造跳表索引,可以采用固定长度的跳表节点。当采用固定长度的跳表节点时,跳表索引构建单元204将从第q-1层跳表提取的至少一个跳表节点的首个索引关键字依次写入第q层跳表的跳表节点,具体用于:
按照至少一个跳表节点的首个索引关键字的排序,依次将至少一个跳表节点的首个索引关键字写入第q层跳表的跳表节点,每写入一个索引关键字,记录其对应的寻址信息,并更新跳表节点内索引关键字数量,计算跳表节点的剩余可用空间;
若当前正写入的跳表节点无法容纳下一个索引关键字,则分配新的跳表节点,按照上述方式写入索引关键字,直到所有索引关键字及寻址信息写入完成。
同理,将N个内存页的索引关键字写入第1层跳表中的跳表节点与上述过程相同,在此不再详细赘述。
进一步的,当各内存页的索引关键字长度差异较大时,为了降低上层跳表的索引开销,提高索引的空间效率,在本发明实施例中,还可以采用可变长度的跳表节点构建跳表索引。当采用可变长度的跳表节点构建跳表索引时,所述跳表索引构建单元204将从第q-1层跳表提取的至少一个索引关键字依次写入第q层跳表的跳表节点,具体可以用于:
以至少一个跳表节点的首个索引关键字中的第1个索引关键字为起始索引关键字,依次将至少一个跳表节点的首个索引关键字写入到第q层跳表的跳表节点;
若写入第i个索引关键字时,正在写入的跳表节点中被占用的长度L占用与Lnode-min的差值小于第i个索引关键字的存储开销,则计算所述第n个跳表节点中可使用的长度(Lnode-max-L占用)所能容纳的Nnode-more个索引关键字,所述Nnode-more个索引关键字为:以所述第i个索引关键字开始依次排列的Nnode-more个索引关键字,其中,Lnode-min为每个跳表节点的最小长度,Lnode-max为每个跳表节点的最大长度;
确定所述Nnode-more个索引关键字中最短的索引关键字,将所述第i个索引关键字、以及所述第i个索引关键字到所述最短的索引关键字之间的索引关键字写入正在写入的跳表节点中,将所述最短的索引关键字作为下一跳表节点的首个索引关键字写入所述下一跳表节点。
确定下一跳表节点的首个字符串后,可以按照上述方式依次将相应的索引关键字写入下一跳表节点中,如此重复进行,可以将索引关键字按照排序依次存入到第q层跳表的跳表节点。需要说明的是,在计算可使用可用长度(Lnode-max-L占用)时,需扣除索引关键字对应寻址信息的预留存储开销。
其中,每个跳表节点的最小长度Lnode-min和最大长度Lnode-max可以根据跳表节点实际长度进行设置,本发明实施例对此不进行限定,并且,各跳表节点的最小长度Lnode-min和最大长度Lnode-max可以相同,也可以不相同。
进一步的,作为压缩索引的逆过程,当用户需要查找数据库中存储的与一字符串有关的所有字符串时,可以从根据跳表索引中的索引关键字自上而下查找相应的内存页,在该内存页中查找相应分组,将该分组中的字符串反馈给用户;具体的,如图8所示,所述压缩索引装置20还可以包括:查询单元205;
所述查询单元205用于:获取待查询字符串;
自上而下查找所述跳表索引中的每层跳表,确定所述Q层跳表中的第t层跳表第j个跳表节点存储有与所述待查询字符串相匹配的第一索引关键字,其中,所述第一索引关键字的寻址信息指示:第t-1层跳表第r个跳表节点,查找所述第t-1层跳表第r个跳表节点中的索引关键字;
确定所述第t-1层跳表第r个跳表节点中存储有与所述待查询字符串相匹配的第二索引关键字,其中,所述第二索引关键字的寻址信息指示:第t-2层跳表第s个跳表节点,查找所述第t-2层跳表第s个跳表节点中的索引关键字;
重复上述过程,直至根据第1层跳表第d个跳表节点中存储的与所述待查询字符串相匹配的第三索引关键字,查找第h个内存页中每个字符串组的差异前缀,其中,所述第三索引关键字的寻址信息指示:所述第h个内存页;
确定所述第h个内存页中第w个字符串组的差异前缀与所述待查询字符串相匹配,查找所述第w个字符串组中的匹配字符串并返回查询结果。
需要说明的是,当字符串组中的字符串采用压缩方式写入内存页时,还需要将字符串解压后作为与待查询字符串相关联的字符串。
其中,与待查询字符串相匹配的索引关键字可以为:按照字典升序先于待查询字符串排列的索引关键字,或者与待查询字符串具有共享前缀的字符串。
进一步的,本发明实施例还可以动态地向字符串序列中插入字符串,具体的,如图8所示,所述装置20还可以包括:字符串插入单元206;
所述字符串插入单元206,用于获取一个新字符串,所述新字符串为不在所述字符串序列中的字符串;
确定所述新字符串所属的第一内存页和第一字符串组;
将所述新字符串插入所述第一字符串组;
若插入所述新字符串后,所述第一字符串组内的字符串数量超过阈值,则获取与所述第一字符串组相邻的第二字符串组,并对所述第一字符串组和所述第二字符串组重新分组;
将重新分组后的字符串组顺序写入所述第一内存页,若所述第一内存页中有字符串组溢出,则将溢出的字符串存入与所述第一内存页相邻的下一内存页。
其中,第二字符串组可以为与第一字符串组相邻的下一字符串组。
需要说明的是,若插入新字符串导致内存页的数量或索引关键 字发生变化,则需要自下向上依次更新跳表节点,直到索引重建完成。
相应的,本发明实施例还可以动态地向删除字符串序列中的字符串,具体的,如图8所示,所述装置20还可以包括:字符串删除单元207;
所述字符串删除单元207,可以用于删除所述字符串序列中的第一字符串,所述第一字符串位于第二内存页和第三字符串组;
若删除所述第一字符串后,所述第三字符串组内的字符串数量小于阈值,则获取与所述第三字符串组相邻的第四字符串组,并对所述第三字符串组和所述第四字符串组重新分组;
将重新分组后的字符串组顺序写入第二内存页,若与第二内存页相邻的内存页以及第二内存页数据量之和小于一个内存页的数据量阈值,则合并这两个内存页。
其中,第四字符串组可以与第三字符串组相邻的上一字符串组,也可以为与第三字符串组相邻的下一字符串组;内存页的数据量阈值可以根据需要进行设定,本发明实施例对此不进行限定。
需要说明的是,若删除字符串后,导致内存页的数量或索引关键字发生变化,则自下向上依次更新跳表节点,直到索引重建完成。
由于在本发明实施例中,字符串组、内存页和跳表索引均有一定的空间弹性,因此,插入/删除字符串一般只引起局部重构,效率较高。
需要说明的是,图8中的压缩索引装置20可以设置在数据存储系统的任一计算机中,也可以独立于任何设备设置在数据存储系统中;图8中的获取单元201可以为图2所示压缩索引装置10中的收发器1012,分组单元202、分页单元203、跳表索引构建单元204、查询单元205、字符串插入单元206、字符串删除单元207可以为图2中单独设立的处理器1011,也可以集成在压缩索引装置10的某一个处理器1011中实现,此外,也可以以程序代码的形式存储于压缩索引装置10的存储器1013中,由压缩索引装置10的某一个处理器 1012调用并执行以上分组单元202、分页单元203、跳表索引构建单元204、查询单元205、字符串插入单元206以及字符串删除单元207的功能。这里所述的处理器可以是一个中央处理器(Central Processing Unit,CPU),或者是特定集成电路(Application Specific Integrated Circuit,ASIC),或者是被配置成实施本发明实施例的一个或多个集成电路。
由上可知,本发明实施例提供一种字符串序列的压缩索引装置,获取有序排列的字符串序列,根据所述字符串序列中每个字符串的差异前缀长度对所述字符串序列进行分组处理,获得M个字符串组,以使每个字符串组中首个字符串的差异前缀长度在预设字符串范围内是最短的,将所述M个字符串组依次存储到N个内存页,根据所述N个内存页的索引关键字构建跳表索引。如此,在对有序字符串序列进行分组和分页处理后,构建跳表索引。由于每个字符串组的差异前缀长度在预设字符串范围内是最短的,使得根据字符串组的差异前缀长度分页后的每页的索引关键字也是局部最短的,进而在页的基础上构建的跳表索引内的索引关键字的长度也是比较短的,降低了跳表索引中索引关键字的平均长度,提升了跳表节点的容纳能力,从而达到减少索引节点数量和降低索引查找复杂度的有益效果,避免了现有CS-Prefix-Tree编码索引过程中,底层叶节点存在过长的差异前缀长度,导致编码索引分支节点的容纳能力下降,增加分支节点数量和查找复杂度的问题。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的单元和系统的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,设备和方法,可以通过其它的方式实现。例如,以上所描述的设备实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或 不执行。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能单元的形式实现。
上述以软件功能单元的形式实现的集成的单元,可以存储在一个计算机可读取存储介质中。上述软件功能单元存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述方法的部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,简称ROM)、随机存取存储器(Random Access Memory,简称RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件(例如处理器)来完成,该程序可以存储于一计算机可读存储介质中,存储介质可以包括:只读存储器、随机存储器、磁盘或光盘等。
最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。

Claims (24)

  1. 一种字符串序列的压缩索引方法,其特征在于,包括:
    获取字符串序列,所述字符串序列包含有序排列的一个以上字符串;
    根据所述字符串序列中每个字符串的差异前缀长度,对所述字符串序列进行分组处理,获得M个字符串组,以使所述M个字符串组中的每个字符串组中首个字符串的差异前缀长度在预设字符串范围内是最短的,其中,所述M为大于等于1的整数,每个所述字符串组包含至少一个字符串,每个所述字符串组的差异前缀为该字符串组中首个字符串的差异前缀;
    将所述M个字符串组依次存储到N个内存页中,所述N为大于等于1的整数,每个所述内存页包含至少一个字符串组,每个所述内存页的索引关键字为:该内存页中首个字符串组的差异前缀;
    根据所述N个内存页的索引关键字构建跳表索引,所述跳表索引包含Q层跳表,所述Q为大于等于1的整数,所述Q层跳表中的第1层跳表根据所述N个内存页的索引关键字构建,每层跳表包含至少一个跳表节点,每个跳表节点包含至少一个索引关键字、索引关键字的个数以及索引关键字的寻址信息。
  2. 根据权利要求1所述的方法,其特征在于,每个所述字符串组包含的字符串的个数的最小阈值为Wmin、最大阈值为Wmax;对所述字符串序列进行分组处理,获得所述M个字符串组中的第m个字符串组的方法包括,1≤m≤M:
    确定所述第m个字符串组的首个字符串;
    以所述第m个字符串组的首个字符串为起始字符串,依次计算后续Wmax个字符串中每个字符串的差异前缀长度;
    确定第k个字符串,所述第k个字符串为后续第Wmin个字符串到后续第Wmax个字符串中差异前缀长度最小的字符串,Wmin≤k≤Wmax
    将所述第m个字符串组的首个字符串到第k-1个字符串之间的字 符串的集合确定为所述第m个字符串组,并将所述第k个字符串作为第m+1个字符串组的首个字符串。
  3. 根据权利要求1或2所述的方法,其特征在于,所述将所述M个字符串组依次存储到N个内存页中,包括:
    根据所述M个字符串组中每个字符串组的差异前缀长度,将所述M个字符串组依次存储到所述N个内存页中,以使所述N个内存页中的每个内存页的首个字符串组的差异前缀长度在预设字符串组范围内是最短的。
  4. 根据权利要求1-3任一项所述的方法,其特征在于,所述Q层跳表为逐层构建的多层跳表,所述Q层跳表中的第q层跳表根据第q-1层跳表中间隔为稀疏系数F的跳表节点的首个索引关键字构建,所述F为大于等于1的整数,所述q为大于等于2的整数。
  5. 根据权利要求1-4任一项所述的方法,其特征在于,所述N个内存页中每个内存页的最小容量为Cmin、最大容量为Cmax,将字符串组存储到所述N个内存页中的第n个内存页的方法包括,1≤n≤N:
    确定所述第n个内存页的首个字符串组;
    以所述第n个内存页的首个字符串组为起始字符串组,依次将所述M个字符串组中至少一个字符串组存储到所述第n个内存页;
    若存储到第i个字符串组时,所述第n个内存页中被占用的存储容量C占用与Cmin的差值小于所述第i个字符串组的存储开销,则计算所述第n个内存页中可使用的存储容量(Cmax-C占用)所能容纳的Nmore个字符串组,所述Nmore个字符串组为:以所述第i个字符串组开始依次排列的Nmore个字符串组;
    确定所述Nmore个字符串组中差异前缀长度最小的字符串组,将所述第i个字符串组、以及所述第i个字符串组到所述差异前缀长度最小的字符串组之间的字符串组存储到所述第n个内存页中,将所述差异前缀长度最小的字符串组作为第n+1个内存页的首个字符串组。
  6. 根据权利要求1-5任一项所述的方法,其特征在于,将所述M个字符串组中的第x个字符串组存入到所述N个内存页中的第y 个内存页的方法包括,1≤x≤M,1≤y≤N:
    将所述第x个字符串组的首个字符串以不压缩形式写入到所述第y个内存页的可用空间;
    对所述第x个字符串组中除首个字符串之外的其他任一字符串,获取所述其他任一字符串与其相邻的前一个字符串间的共享前缀,将所述其他任一字符串与其相邻的前一个字符串间的共享前缀长度、以及所述其他任一字符串中在所述共享前缀之后的后缀字符串写入到所述第y个内存页的可用空间。
  7. 根据权利要求1-6任一项所述的方法,其特征在于,所述Q层跳表中的第1层跳表构建的方法包括:
    将所述N个内存页的索引关键字依次写入所述第1层跳表的跳表节点,并在每个跳表节点中记录该跳表节点所包含的索引关键字的个数以及索引关键字的寻址信息,其中,所述第1层跳表中每个跳表节点中索引关键字的寻址信息用于指示与该寻址信息相对应的索引关键字所处的内存页。
  8. 根据权利要求4所述的方法,其特征在于,所述Q层跳表中的第q层跳表的构建方法包括:
    获取所述跳表索引中所述第q-1层跳表中,以所述第q-1层跳表中的首个跳表节点为起始节点,间隔为F的至少一个跳表节点的首个索引关键字;
    将获取到的所述至少一个跳表节点的首个索引关键字依次写入所述第q层跳表的跳表节点,并在所述第q层跳表的每个跳表节点中记录该跳表节点所包含的索引关键字的个数以及索引关键字的寻址信息;其中,所述第q层跳表中每个跳表节点中索引关键字的寻址信息用于指示与该寻址信息相对应的索引关键字在所述第q-1层跳表中所在的跳表节点。
  9. 根据权利要求8所述的方法,其特征在于,每个跳表节点的长度是可变的;所述将获取到的所述至少一个跳表节点的首个索引关键字依次写入所述第q层跳表的跳表节点包括:
    以所述至少一个跳表节点的首个索引关键字中的第1个索引关键字为起始索引关键字,依次将所述至少一个跳表节点的首个索引关键字写入到所述第q层跳表的跳表节点;
    若写入到第i个索引关键字时,正在写入的跳表节点中被占用的长度L占用与Lnode-min的差值小于所述第i个索引关键字的存储开销,则计算所述正在写入的跳表节点中可使用的长度(Lnode-max-L占用)所能容纳的Nnode-more个索引关键字,所述Nnode-more个索引关键字为:以所述第i个索引关键字开始依次排列的Nnode-more个索引关键字,其中,Lnode-min为每个跳表节点的最小长度,Lnode-max为每个跳表节点的最大长度;
    确定所述Nnode-more个索引关键字中最短的索引关键字,将所述第i个索引关键字、以及所述第i个索引关键字到所述最短的索引关键字之间的索引关键字写入到所述正在写入的跳表节点中,将所述最短的索引关键字作为下一跳表节点的首个索引关键字写入所述下一跳表节点。
  10. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    获取待查询字符串;
    自上而下查找所述跳表索引中的每层跳表,确定所述Q层跳表中的第t层跳表第j个跳表节点存储有与所述待查询字符串相匹配的第一索引关键字,其中,所述第一索引关键字的寻址信息指示:第t-1层跳表第r个跳表节点,查找所述第t-1层跳表第r个跳表节点中的索引关键字;
    确定所述第t-1层跳表第r个跳表节点中存储有与所述待查询字符串相匹配的第二索引关键字,其中,所述第二索引关键字的寻址信息指示:第t-2层跳表第s个跳表节点,查找所述第t-2层跳表第s个跳表节点中的索引关键字;
    重复上述过程,直至根据第1层跳表第d个跳表节点中存储的与所述待查询字符串相匹配的第三索引关键字,查找第h个内存页中每个字符串组的差异前缀,其中,所述第三索引关键字的寻址信息指示: 所述第h个内存页;
    确定所述第h个内存页中第w个字符串组的差异前缀与所述待查询字符串相匹配,查找所述第w个字符串组中的匹配字符串并返回查询结果。
  11. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    获取一个新字符串,所述新字符串为不在所述字符串序列中的字符串;
    确定所述新字符串所属的第一内存页和第一字符串组;
    将所述新字符串插入所述第一字符串组;
    若插入所述新字符串后,所述第一字符串组内的字符串数量超过阈值,则获取与所述第一字符串组相邻的第二字符串组,并对所述第一字符串组和所述第二字符串组重新分组;
    将重新分组后的字符串组顺序写入所述第一内存页,若所述第一内存页中有字符串组溢出,则将溢出的字符串组存入与所述第一内存页相邻的下一内存页。
  12. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    删除所述字符串序列中的第一字符串,所述第一字符串位于第二内存页和第三字符串组;
    若删除所述第一字符串后,所述第三字符串组内的字符串数量小于阈值,则获取与所述第三字符串组相邻的第四字符串组,并对所述第三字符串组和所述第四字符串组重新分组;
    将重新分组后的字符串组顺序写入所述第二内存页,若与所述第二内存页相邻的内存页以及所述第二内存页的数据量之和小于一个内存页的数据量阈值,则合并所述与所述第二内存页相邻的内存页以及所述第二内存页。
  13. 一种压缩索引装置,其特征在于,所述装置包括:
    获取单元,用于获取字符串序列;所述字符串序列包含有序排列的一个以上字符串;
    分组单元,用于根据所述获取单元获取到的字符串序列中每个字 符串的差异前缀长度,对所述字符串序列进行分组处理,获得M个字符串组,以使每个字符串组中的首个字符串的差异前缀长度在预设字符串范围内是最短的,所述M为大于等于1的整数,每个字符串组包含至少一个字符串,每个字符串组的差异前缀为该字符串组中首个字符串的差异前缀;
    分页单元,用于将所述分组单元获得的M个字符串组依次存储到N个内存页中,所述N为大于等于1的整数,每个所述内存页包含至少一个字符串组,每个所述内存页的索引关键字为:该内存页中首个字符串组的差异前缀;
    跳表索引构建单元,用于根据所述N个内存页的索引关键字构建跳表索引,所述跳表索引包含Q层跳表,所述Q为大于等于1的整数,第1层跳表根据所述N个内存页的索引关键字构建,每层跳表包含至少一个跳表节点,每个跳表节点包含至少一个索引关键字、索引关键字的个数以及索引关键字的寻址信息。
  14. 根据权利要求12所述的压缩索引装置,其特征在于,每个所述字符串组包含的字符串的个数的最小阈值为Wmin、最大阈值为Wmax;对于获得所述M个字符串组中的第m个字符串组,1≤m≤M,所述分组单元具体用于:
    确定所述第m个字符串组的首个字符串;
    以所述第m个字符串组的首个字符串为起始字符串,依次计算后续Wmax个字符串中每个字符串的差异前缀长度;
    确定第k个字符串,所述第k个字符串为后续第Wmin个字符串到后续第Wmax个字符串中差异前缀长度最小的字符串,Wmin≤k≤Wmax
    将所述第m个字符串组的首个字符串到第k-1个字符串之间的字符串的集合确定为所述第m个字符串组,并将所述第k个字符串作为第m+1个字符串组的首个字符串。
  15. 根据权利要求13或14所述的压缩索引装置,其特征在于,所述分页单元具体用于:
    根据所述M个字符串组中每个字符串组的差异前缀长度,将所述M个字符串组依次存储到所述N个内存页中,以使所述N个内存页中的每个内存页的首个字符串组的差异前缀长度在预设字符串组范围内是最短的。
  16. 根据权利要求13-15任一项所述的压缩索引装置,其特征在于,所述Q层跳表为逐层构建的多层跳表,所述Q层跳表中的第q层跳表根据第q-1层跳表中间隔为稀疏系数F的跳表节点的首个索引关键字构建,所述F为大于等于1的整数,所述q为大于等于2的整数。
  17. 根据权利要求13-16任一项所述的压缩索引装置,其特征在于,所述N个内存页中每个内存页的最小容量为Cmin、最大容量为Cmax,对于将字符串组存储到所述N个内存页中的第n个内存页,1≤n≤N,所述分页单元具体用于:
    确定所述第n个内存页的首个字符串组;
    以所述所述第n个内存页的首个字符串组为起始字符串组,依次将所述M个字符串组中至少一个字符串组存储到所述第n个内存页;
    若存储到第i个字符串组时,所述第n个内存页中被占用的存储容量C占用与Cmin的差值小于所述第i个字符串组的存储开销,则计算所述第n个内存页中可使用的存储容量(Cmax-C占用)所能容纳的Nmore个字符串组,所述Nmore个字符串组为:以所述第i个字符串组开始依次排列的Nmore个字符串组;
    确定所述Nmore个字符串组中差异前缀长度最小的字符串组,将所述第i个字符串组、以及所述第i个字符串组到所述差异前缀长度最小的字符串组之间的字符串组存储到所述第n个内存页中,将所述差异前缀长度最小的字符串组作为第n+1个内存页的首个字符串组。
  18. 根据权利要求13-17任一项所述的压缩索引装置,其特征在于,对于将所述M个字符串组中的第x个字符串组存入到所述N个内存页中的第y个内存页,1≤x≤M,1≤y≤N,所述分页单元具体用于:
    将所述第x个字符串组的首个字符串以不压缩形式写入到第y个内存页的可用空间;
    对所述第x个字符串组中除首个字符串之外的其他任一字符串,获取所述其他任一字符串与其相邻的前一个字符串间的共享前缀,将所述其他任一字符串与其相邻的前一字符串间的共享前缀长度、以及所述其他任一字符串中在所述共享前缀之后的后缀字符串写入到所述第y个内存页的可用空间。
  19. 根据权利要求13-18任一项所述的压缩索引装置,其特征在于,所述跳表索引构建单元具体用于:
    将N个内存页的索引关键字依次写入第1层跳表的跳表节点,并在每个跳表节点中记录该跳表节点所包含的索引关键字的个数以及索引关键字的寻址信息,其中,所述第1层跳表中每个跳表节点中索引关键字的寻址信息用于指示与该寻址信息相对应的索引关键字所处的内存页。
  20. 根据权利要求16所述的压缩索引装置,其特征在于,2≤q≤Q,所述Q为大于2的整数,所述跳表索引构建单元具体用于:
    获取所述跳表索引中所述第q-1层跳表中,以所述第q-1层跳表中的首个跳表节点为起始节点,间隔为F的至少一个跳表节点的首个索引关键字;
    将获取到的所述至少一个跳表节点的首个索引关键字依次写入所述第q层跳表的跳表节点,并在所述第q层跳表的每个跳表节点中记录该跳表节点所包含的索引关键字的个数以及索引关键字的寻址信息;其中,所述第q层跳表中每个跳表节点中索引关键字的寻址信息用于指示与该寻址信息相对应的索引关键字在所述第q-1层跳表中所在的跳表节点。
  21. 根据权利要求20所述的压缩索引装置,其特征在于,每个跳表节点的长度是可变的;所述跳表索引构建单元具体用于:
    以所述至少一个跳表节点的首个索引关键字中的第1个索引关键字为起始索引关键字,依次将所述至少一个跳表节点的首个索引关 键字写入到所述第q层跳表的跳表节点;
    若写入到第i个索引关键字时,正在写入的跳表节点中被占用的长度L占用与Lnode-min的差值小于所述第i个索引关键字的存储开销,则计算所述正在写入的跳表节点中可使用的长度(Lnode-max-L占用)所能容纳的Nnode-more个索引关键字,所述Nnode-more个索引关键字为:以所述第i个索引关键字开始依次排列的Nnode-more个索引关键字,其中,Lnode-min为每个跳表节点的最小长度,Lnode-max为每个跳表节点的最大长度;
    确定所述Nnode-more个索引关键字中最短的索引关键字,将所述第i个索引关键字、以及所述第i个索引关键字到所述最短的索引关键字之间的索引关键字写入到所述正在写入的跳表节点中,将所述最短的索引关键字作为下一跳表节点的首个索引关键字写入所述下一跳表节点。
  22. 根据权利要求13所述的压缩索引装置,其特征在于,所述压缩索引装置还包括:查询单元;
    所述查询单元,用于获取待查询字符串;
    自上而下查找所述跳表索引中的每层跳表,确定所述Q层跳表中的第t层跳表第j个跳表节点存储有与所述待查询字符串相匹配的第一索引关键字,其中,所述第一索引关键字的寻址信息指示:第t-1层跳表第r个跳表节点,查找所述第t-1层跳表第r个跳表节点中的索引关键字;
    确定所述第t-1层跳表第r个跳表节点中存储有与所述待查询字符串相匹配的第二索引关键字,其中,所述第二索引关键字的寻址信息指示:第t-2层跳表第s个跳表节点,查找所述第t-2层跳表第s个跳表节点中的索引关键字;
    重复上述过程,直至根据第1层跳表第d个跳表节点中存储的与所述待查询字符串相匹配的第三索引关键字,查找第h个内存页中每个字符串组的差异前缀,其中,所述第三索引关键字的寻址信息指示:所述第h个内存页;
    确定所述第h个内存页中第w个字符串组的差异前缀与所述待查询字符串相匹配,查找所述第w个字符串组中的匹配字符串并返回查询结果。
  23. 根据权利要求13所述的压缩索引装置,其特征在于,所述压缩索引装置还包括:字符串插入单元;
    所述字符串插入单元,用于获取一个新字符串,所述新字符串为不在所述字符串序列中的字符串;
    确定所述新字符串所属的第一内存页和第一字符串组;
    将所述新字符串插入所述第一字符串组;
    若插入所述新字符串后,所述第一字符串组内的字符串数量超过阈值,则获取与所述第一字符串组相邻的第二字符串组,并对所述第一字符串组和所述第二字符串组重新分组;
    将重新分组后的字符串组顺序写入所述第一内存页,若所述第一内存页中有字符串组溢出,则将溢出的字符串组存入与所述第一内存页相邻的下一内存页。
  24. 根据权利要求13所述的压缩索引装置,其特征在于,所述压缩索引装置还包括:字符串删除单元;
    所述字符串删除单元,用于删除所述字符串序列中的第一字符串,所述第一字符串位于第二内存页和第三字符串组;
    若删除所述第一字符串后,所述第三字符串组内的字符串数量小于阈值,则获取与所述第三字符串组相邻的第四字符串组,并对所述第三字符串组和所述第四字符串组重新分组;
    将重新分组后的字符串组顺序写入所述第二内存页,若与所述第二内存页相邻的内存页以及第二内存页的数据量之和小于一个内存页的数据量阈值,则合并所述与所述第二内存页相邻的内存页以及所述第二内存页。
PCT/CN2016/077428 2016-03-25 2016-03-25 一种字符串序列的压缩索引方法及装置 WO2017161589A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201680083999.8A CN108780455B (zh) 2016-03-25 2016-03-25 一种字符串序列的压缩索引方法及装置
PCT/CN2016/077428 WO2017161589A1 (zh) 2016-03-25 2016-03-25 一种字符串序列的压缩索引方法及装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/077428 WO2017161589A1 (zh) 2016-03-25 2016-03-25 一种字符串序列的压缩索引方法及装置

Publications (1)

Publication Number Publication Date
WO2017161589A1 true WO2017161589A1 (zh) 2017-09-28

Family

ID=59899869

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/077428 WO2017161589A1 (zh) 2016-03-25 2016-03-25 一种字符串序列的压缩索引方法及装置

Country Status (2)

Country Link
CN (1) CN108780455B (zh)
WO (1) WO2017161589A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113065964A (zh) * 2021-04-13 2021-07-02 上证所信息网络有限公司 一种采用可变步长跳表的数据存储系统及方法
CN113626431A (zh) * 2021-07-28 2021-11-09 浪潮云信息技术股份公司 一种基于lsm树的延迟垃圾回收的键值分离存储方法及系统
CN117194440A (zh) * 2023-11-08 2023-12-08 本原数据(北京)信息技术有限公司 数据库索引压缩方法、装置、电子设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102193941A (zh) * 2010-03-12 2011-09-21 富士通株式会社 数据处理装置和为值串形式索引值建立索引的方法
US8635195B2 (en) * 2011-05-19 2014-01-21 International Business Machines Corporation Index compression in a database system
CN103870462A (zh) * 2012-12-10 2014-06-18 腾讯科技(深圳)有限公司 一种数据处理方法及装置
CN104408192A (zh) * 2014-12-15 2015-03-11 北京国双科技有限公司 字符串类型列的压缩处理方法及装置
CN104408067A (zh) * 2014-10-29 2015-03-11 中国建设银行股份有限公司 一种多树结构的数据库设计方法及装置
CN101937448B (zh) * 2009-06-28 2016-01-20 Sap欧洲公司 用于主存储器列存储装置的基于字典的保持顺序的串压缩

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014201047A1 (en) * 2013-06-11 2014-12-18 InfiniteBio Fast, scalable dictionary construction and maintenance
CN104881503A (zh) * 2015-06-24 2015-09-02 郑州悉知信息技术有限公司 一种数据处理方法和装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101937448B (zh) * 2009-06-28 2016-01-20 Sap欧洲公司 用于主存储器列存储装置的基于字典的保持顺序的串压缩
CN102193941A (zh) * 2010-03-12 2011-09-21 富士通株式会社 数据处理装置和为值串形式索引值建立索引的方法
US8635195B2 (en) * 2011-05-19 2014-01-21 International Business Machines Corporation Index compression in a database system
CN103870462A (zh) * 2012-12-10 2014-06-18 腾讯科技(深圳)有限公司 一种数据处理方法及装置
CN104408067A (zh) * 2014-10-29 2015-03-11 中国建设银行股份有限公司 一种多树结构的数据库设计方法及装置
CN104408192A (zh) * 2014-12-15 2015-03-11 北京国双科技有限公司 字符串类型列的压缩处理方法及装置

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113065964A (zh) * 2021-04-13 2021-07-02 上证所信息网络有限公司 一种采用可变步长跳表的数据存储系统及方法
CN113065964B (zh) * 2021-04-13 2024-05-03 上证所信息网络有限公司 一种采用可变步长跳表的数据存储系统及方法
CN113626431A (zh) * 2021-07-28 2021-11-09 浪潮云信息技术股份公司 一种基于lsm树的延迟垃圾回收的键值分离存储方法及系统
CN117194440A (zh) * 2023-11-08 2023-12-08 本原数据(北京)信息技术有限公司 数据库索引压缩方法、装置、电子设备及存储介质
CN117194440B (zh) * 2023-11-08 2024-02-13 本原数据(北京)信息技术有限公司 数据库索引压缩方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN108780455B (zh) 2022-03-29
CN108780455A (zh) 2018-11-09

Similar Documents

Publication Publication Date Title
US10642515B2 (en) Data storage method, electronic device, and computer non-volatile storage medium
US11256696B2 (en) Data set compression within a database system
WO2020041928A1 (zh) 数据存储方法、系统及终端设备
CN105117415B (zh) 一种优化的ssd数据更新方法
CN109299113B (zh) 具有存储感知的混合索引的范围查询方法
US9535940B2 (en) Intra-block partitioning for database management
JP2012529105A (ja) 分散連想メモリベースを提供する方法、システム、及びコンピュータプログラム製品
US20120215752A1 (en) Index for hybrid database
CN105320775A (zh) 数据的存取方法和装置
CN111190904B (zh) 一种图-关系数据库混合存储的方法和装置
WO2017161589A1 (zh) 一种字符串序列的压缩索引方法及装置
US7054994B2 (en) Multiple-RAM CAM device and method therefor
US10061792B2 (en) Tiered index management
US20050187898A1 (en) Data Lookup architecture
WO2018205151A1 (zh) 数据更新方法和存储装置
CN113961514A (zh) 数据查询方法及装置
US11709835B2 (en) Re-ordered processing of read requests
CN102110171A (zh) 基于树形结构的布鲁姆过滤器的查询与更新方法
Conway et al. Optimal hashing in external memory
WO2015010508A1 (zh) 一种基于一维线性空间实现Trie树的词典存储管理方法
US20240126762A1 (en) Creating compressed data slabs that each include compressed data and compression information for storage in a database system
KR20020029843A (ko) 주기억장치 데이터베이스의 인덱스 데이터 관리방법
WO2013097115A1 (zh) 文件目录存储方法、检索方法和设备
US20140324875A1 (en) Index for fast batch updates of large data tables
CN100433009C (zh) 静态范围匹配表的管理维护方法

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16894942

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 16894942

Country of ref document: EP

Kind code of ref document: A1