WO2015151444A1 - データ構造、情報処理装置、情報処理方法、及びプログラム記録媒体 - Google Patents
データ構造、情報処理装置、情報処理方法、及びプログラム記録媒体 Download PDFInfo
- Publication number
- WO2015151444A1 WO2015151444A1 PCT/JP2015/001568 JP2015001568W WO2015151444A1 WO 2015151444 A1 WO2015151444 A1 WO 2015151444A1 JP 2015001568 W JP2015001568 W JP 2015001568W WO 2015151444 A1 WO2015151444 A1 WO 2015151444A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- value
- continuous section
- bit string
- specified
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
- G06F16/2237—Vectors, bitmaps or matrices
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/22—Arrangements for sorting or merging computer data on continuous record carriers, e.g. tape, drum, disc
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
- H03M7/3082—Vector coding
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
- H03M7/3084—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method
- H03M7/3088—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method employing the use of a dictionary, e.g. LZ78
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
- H03M7/40—Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code
- H03M7/4006—Conversion to or from arithmetic code
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
- H03M7/40—Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code
- H03M7/4031—Fixed length to variable length coding
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
- H03M7/46—Conversion to or from run-length codes, i.e. by representing the number of consecutive digits, or groups of digits, of the same kind by a code word and a digit indicative of that kind
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
- H03M7/70—Type of the data to be coded, other than image and sound
- H03M7/707—Structured documents, e.g. XML
Definitions
- the present invention relates to a data structure, an information processing apparatus, an information processing method, and a program recording medium for realizing these, and in particular, a data structure, an information processing apparatus, and a data structure for performing efficient calculation on a bit string,
- the present invention relates to an information processing method and a program recording medium.
- a complete dictionary is a data structure that supports two types of operations called rank and select for a bit string B of length n. These two types of operations are defined as follows. However, in the following, the first element of the bit string B is B [0], the last element of the bit string B is B [n-1], and the substring consisting of the j-th element from the i-th element on the bit string B is It is represented by B [i, j], and the substring when the j-th element that is the terminal is not included is represented by B [i, j).
- rank1 (B, i) is an operation that returns the number of 1s in the range of B [0, i).
- Rank0 (B, i) is an operation that returns the number of 0 existing in the range of B [0, i).
- select1 (B, i) is an operation for returning the position where the (i + 1) th 1 appears on the bit string B.
- select0 (B, i) is an operation for returning the position where the (i + 1) th 0 appears on the bit string B.
- a complete dictionary may support an operation that returns the value B [i] of the i-th element called access (B, i).
- complete dictionaries are sometimes called “concise bit vectors”, “rank / select dictionary”, etc., depending on the literature, but all mean the same thing.
- a complete dictionary is the basis for creating a space-saving data structure called a succinct data structure.
- the concise data structure expresses various data structures such as a tree structure, a graph structure, and text data with a small data structure, and is attracting attention as a technique for handling large-scale data.
- the size of the concise data structure depends on the size of the underlying complete dictionary. Therefore, realizing a complete dictionary with as small a size as possible is important in handling large-scale data. This point will be described below.
- the bit string may be compressed.
- a technique for realizing a complete dictionary of a bit string while compressing the bit string is known.
- Such a method has an advantage that a complete dictionary can be realized with a size smaller than that of the original bit string B.
- the complete dictionary compression method differs depending on the distribution of 1 and 0 in the original bit string.
- bit string such as “sparse bit string”.
- bit string such as “0000010000001000000”.
- Non-Patent Document 1 see “42.4 Sparse Case”.
- bit string E in which this array D is represented by a single code. This bit string is represented by a length of 2 m. At this time, a complete dictionary for the bit string E can be constructed. A complete dictionary of bit string E can be realized with 2m + o (m) bits.
- a complete dictionary of sparse bit strings is a combination of an uncompressed complete dictionary of array E and bit array L. Therefore, since array L is m log (n / m) bits and complete dictionary of array E is represented by 2m + o (m) bits, the total size is m log (n / m) + 2m + o (m) bits. This size is smaller than the original n-bit array B in a sparse case where m ⁇ n.
- select1 (B, i) can be calculated by the following equation (1).
- rank1 (B, i) can be calculated as follows. First, by counting the number of “0” using the complete dictionary of the bit string E, the position t of the minimum upper bit H [t] that is equal to or larger than the upper bits of i is obtained. The value of P [t] is obtained by adding the lower bits L [t] corresponding to H [t]. Thereafter, if t is incremented by 1 and the maximum P [t] not exceeding i is obtained, then the value of t represents the number of “1” existing up to position i.
- Non-Patent Document 3 As a method for expressing a complete dictionary of sparse bit strings, a method disclosed in Non-Patent Document 3 can be cited. A complete dictionary of sparse bit strings can be efficiently expressed by using any of the methods disclosed in Non-Patent Document 1 and Non-Patent Document 3.
- bit string such as “0000111100001111”.
- a sparse bit string is a bit string in which 0 is long and continuous, but in this bit string, 1 is long and is different in this respect.
- a region where 1s are continuous is referred to as “run”. For example, in the above bit string example, two runs are included.
- a complete dictionary of bit strings in which 1s and 0s are continuous can be efficiently expressed using run-length compression.
- Non-Patent Document 2 discloses a method for realizing a complete dictionary by run-length compressing a bit string in which 1s and 0s are continuous (“3.1 Run-Length Encoded Wavelet Tree”). reference).
- Non-Patent Document 2 a region where 1s are continuous is expressed as “1 bit run”, and a region where 0s are continuous is expressed as “0 bit run”.
- a region in which 1s are continuous is denoted as “run”.
- a region in which 0 continues is expressed as “blank”. Therefore, it can be considered that all bit strings such as “0000111100001111” are alternately arranged with runs and blanks. Below, the method currently disclosed by the nonpatent literature 2 is demonstrated.
- bit string B in which 1 and 0 are continuous is given.
- this bit string B is expressed by two sparse bit strings B1 and a bit string Brl.
- b be the number of runs included in bit string B.
- the bit string B1 is a sparse bit string that becomes 1 only at the start position of all runs.
- the bit B1 is as shown in Equation 2 below. In the formulas of this specification is expressed as B 1 and B1.
- bit string B1 since the number of runs included in the bit string B is b, the bit string B1 includes b 1s.
- bit string Brl is a sparse bit string obtained by connecting the results of all the run lengths represented by a hexadecimal code.
- the bit string Brl also includes b 1s.
- both the bit string B1 and the bit string Brl are sparse bit strings, a complete dictionary can be efficiently constructed using the complete dictionary of sparse bit strings described in Non-Patent Document 1.
- Non-Patent Document 2 the complete dictionary of sparse bit strings disclosed in Non-Patent Document 3 is used instead of the complete dictionary of sparse bit strings disclosed in Non-Patent Document 1.
- both complete dictionaries are common in realizing rank and select, and there is no difference in operation regardless of which complete dictionary is used.
- description will be given using a complete dictionary of sparse bit strings described in Non-Patent Document 1.
- bit string Brl can be regarded as a stored value of rank1 (B, i) at the start position i of the r-th run. That is, in the bit string Brl, the r-th 1 position counted from the head represents the value of rank 1 (B, i) at the start position i of the r-th run. Therefore, the following formula 3 is established. In the formulas herein, it represents a Brl and B rl.
- bit string B can be expressed by the bit string B1 and the bit string Brl. Therefore, the three types of operations on the bit string B, rank1, rank0, and select1, use the complete dictionary of the bit string B1 and the bit string Brl. It can be calculated by using it.
- j select1 (B1, r-1).
- the calculated rank1 (B, i) is obtained by adding the number of 1s in the range [0, j-1] and the number 1 in the range [j, i).
- select1 (B, i) can be calculated by the following equation (7).
- r rank1 (Brl, i + 1).
- r represents that the run when the total run length is i + 1 when the run length is added from the top is the rth.
- select1 (B1, r) represents the start position of this run
- i + 1- select1 (Brl, r) is the bit string B in which number 1 in this run. Indicates whether it is the i + 1th 1 in the whole. If this value is summed and 1 is subtracted, it becomes the (i + 1) th 1 position.
- bit string B is “001110011011”
- bit string B1 is “0010000100101”
- bit string Brl is “10010101”.
- i 4, that is, select1 (B, 4) is to be obtained.
- a complete dictionary can be expressed by a sparse bit string.
- the size of the complete dictionary of the bit string B is considered.
- the size of the complete dictionary of the bit string B is the sum of the sizes of the bit string B1 and the bit string Brl.
- the size of the complete dictionary of sparse bit strings is m log (n / m) + 2m + o (m) bits, where n is the length of the bit string and m is the number of included 1s.
- the size of the bit string B1 includes length 1 and b 1s, so b log (n / b) + 2b + o (b) bits.
- the bit string Brl is b ⁇ log (m / b) + 2b + o (b) bits because the bit string Brl has a length m and includes b 1s. Accordingly, the size of the complete dictionary of the bit string B is obtained by adding the two. That is, the size of the complete dictionary of the bit string B is b (log (n / b) + log (m / b) +4) + o (b) bits.
- Non-Patent Document 2 only by preparing two sparse bit strings, run-length compression is performed on a bit string in which 1 and 0 are continuous, and rank1, rank0, and select1 can be calculated efficiently.
- Non-Patent Document 2 has a problem that select0 cannot be calculated efficiently. This is because the bit string that holds the value of rank1 that is necessary for efficient calculation of select1 is used in this method, but the bit string that holds the value of rank0 that is necessary for efficient calculation of select0. This is because is not used.
- one complete dictionary of b (log (m / b) + 2b + o (b) bits is added, and the total size is b (log (n / b) + log ( Increase from m / b) +4) + o (b) bits to b (log (n / b) + 2log (m / b) +6) + o (b) bits, that is, m is close to n
- log (n / b) and log (m / b) are logarithmic, they are almost the same size, and if this is set to C, 2b (C + 2) + o (b) bits to 3b (C + 2 ) + o (b), which increases to approximately 1.5 times the size, which is a non-negligible increase in size when implementing a complete dictionary.
- select1 and select0 are both used in various data structures.
- a tree-structured data structure called a wavelet tree both the complete dictionary select1 and select0 are called when the tree structure is traced up. Therefore, there is a need for a complete dictionary that can calculate both select1 and select0 in various data structures and that is small in size.
- An example of the object of the present invention is to solve the above-mentioned problem and to enable the two types of select operations using the complete dictionary for the target bit string, while suppressing an increase in the size of the complete dictionary, information structure, information A processing device, an information processing method, and a program recording medium are provided.
- an information processing apparatus provides: A storage unit storing a data structure for expressing a bit string composed of a first value and a second value;
- the data structure is First data for specifying the position of all or a part of a continuous section in which one or more of the first value or the second value are continuous on the bit string;
- second data specifies the number of appearances of the first value that appears from the beginning of the bit string to the continuous section;
- the third data that specifies the number of appearance of the second value that appeared from the beginning of the bit string to the continuous section; have, It is characterized by that.
- a data structure for reproducing a bit string composed of a first value and a second value On the bit string, first data that specifies the position of all or part of a continuous section in which one or more of the same values are continuous; For a part of the continuous section, on the bit string, second data that specifies the number of appearances of the first value that appears from the beginning of the bit string to the continuous section; and For a part of the continuous section, on the bit string, the third data that specifies the number of appearance of the second value that appeared from the beginning of the bit string to the continuous section; It is characterized by having.
- an information processing method includes: Identifying a position of all or a part of a continuous section in which one or more of the first value or the second value continues on a bit string composed of a first value and a second value; The second data specifying the number of appearances of the first value appearing from the beginning of the bit string to the continuous section on the bit string, for a part of the continuous section, A part of the information processing method uses a data structure having third data that specifies the number of appearances of the second value that appears from the beginning of the bit string to the continuous section on the bit string.
- a first select position which is a position on the bit string and the number of the first values included from the head to the position is equal to the natural number, is Identifying using one data, the second data, and the third data
- a second select position which is a position on the bit string and the number of the second values included from the head to the position is equal to the natural number, Identifying using one data, the second data, and the third data
- a program recording medium provides: On the computer, (A) a storage device included in the computer; Identifying a position of all or a part of a continuous section in which one or more of the first value or the second value continues on a bit string composed of a first value and a second value; The second data specifying the number of appearances of the first value appearing from the beginning of the bit string to the continuous section on the bit string, for a part of the continuous section, Storing a data structure having a third data that specifies, for a part, the number of appearances of a second value appearing from the beginning of the bit string to the continuous section on the bit string; (B) when a natural number is input, a first select position which is a position on the bit string and the number of the first values included from the head to the position is equal to the natural number is Identifying using one data, the second data, and the third data; and (C) when a natural number is input, a second select position
- an increase in the size of a complete dictionary can be suppressed while enabling two types of select operations using the complete dictionary for a target bit string.
- FIG. 1 is a block diagram showing a schematic configuration of an information processing apparatus according to an embodiment of the present invention.
- FIG. 2 is a block diagram specifically showing the configuration of the information processing apparatus 100 according to the embodiment of the present invention.
- FIG. 3 is a diagram illustrating an example of a target bit string and various values obtained therefrom.
- FIG. 4 is a diagram showing an example of a data structure for expressing the bit string shown in FIG.
- FIG. 5 is a flowchart showing the operation of the information processing apparatus according to the embodiment of the present invention.
- FIG. 6 is a block diagram illustrating an example of a computer that implements the information processing apparatus according to the embodiment of the present invention.
- FIG. 7 is a diagram schematically illustrating a calculation method of select1 (B, i) in the embodiment of the present invention.
- rank1 (B, sa) at the start position sa of the a-th run is expressed as rank1 of the run
- rank0 (B, sa) is expressed as rank0 of the run.
- theorem For any a-th run on the bit string B, the start position of the a-th run is set to sa, and the start position of the a + 1-th run is set to sa + 1. At this time, if only four values of sa, rank0 (B, sa), sa + 1, rank1 (B, sa + 1) are known, 1 or 0 for all bits in the range of B [sa, sa + 1). Can be identified.
- the method disclosed in Non-Patent Document 2 corresponds to a case where two types of values of ⁇ start position, rank 1> are always stored for all runs. Therefore, in the method disclosed in Non-Patent Document 2, only the bit string of rank1 is retained, so that select1 can be calculated at high speed. However, in the method disclosed in Non-Patent Document 2, since the bit string of rank0 is not retained, there arises a problem that select0 cannot be calculated at high speed.
- Non-Patent Document 2 may be simply extended to store all three types of values of ⁇ start position, rank1, rank0>.
- the required data size is two types of ⁇ start position, rank1>. The value is increased by about 1.5 times compared to the stored value.
- the information processing apparatus does not always store the same two types of values as in the method disclosed in Non-Patent Document 2, but uses two different types for each run. A value may be selected, and the two selected values may be stored.
- the information processing apparatus since only two of the three values ⁇ start position, rank1, rank0> are stored, the information processing apparatus stores all three values of ⁇ start position, rank1, rank0>. Compared to the case, the required data size can be reduced to about two-thirds, and the storage area can be saved. As described above, the information processing apparatus can dynamically calculate the remaining one of the three values from the other two values.
- the embodiment of the present invention is superior in that rank 1 is not stored for all start positions, but rank 1 and sometimes rank 0 are stored by run. It is.
- the information processing apparatus is in a state of sampling the values of rank1 and rank0 at various places in the entire bit string.
- select1 it is possible to get an approximate position by searching for the sampled rank1 value.
- select0 you can get the approximate position by searching the sampled rank0 value.
- the information processing apparatus dynamically calculates the rank 1 or rank 0 value of the surrounding run to obtain an accurate position.
- the information processing device calculates select1 (B, i)
- the run having rank1 value close to i is identified from the runs in which rank1 value is stored. To do.
- the information processing apparatus dynamically calculates the value of rank1 up to the run for the run that does not store the value of rank1 that exists around the run, using this run as a clue.
- the value of rank1 of such a run is dynamically calculated from the remaining two values, namely the start position of the run and the value of rank0. In this way, the information processing apparatus can calculate the position at which 1 actually appears by restoring the value of rank1 of the surrounding runs.
- the information processing device calculates select0 (B, i), conversely, it identifies a run having a rank0 value close to i from the runs in which the rank0 value is stored, and its surroundings.
- the rank 0 value may be calculated dynamically for the runs in which the rank 0 value is not stored.
- the information processing apparatus may store three types of values for each run by rotation. That is, the first run stores two other than the start position, the second run stores two other than rank1, the third run stores two other than rank0, and the fourth run A method of returning again and storing two data other than the start position is conceivable.
- the information processing apparatus does not hold the value of rank 0 at the start position of the even-numbered run and the value of rank 1 at the start position of the odd-numbered run, but can dynamically calculate it.
- FIG. 1 is a block diagram showing a schematic configuration of an information processing apparatus according to an embodiment of the present invention.
- the information processing apparatus 100 includes a storage unit 10 that stores a data structure 11.
- the data structure 11 is a data structure for expressing a bit string composed of a first value and a second value, and includes a first data 12, a second data 13, a third data 14, and the like. have.
- the first data is data that specifies the position of all or part of a continuous section in which one or more first values or second values are continuous on the bit string.
- the second data is data that specifies the number of appearances of the first value that appears from the beginning of the bit string to the continuous section on the bit string for a part of the continuous section.
- the third data is data that specifies the number of appearances of the second value that appears from the beginning of the bit string to the continuous section on the bit string for a part of the continuous section.
- the first value is “1”
- the second value is “0”
- the continuous section where the first data specifies the position is “1”.
- the first data is data for specifying the start position of the run (hereinafter referred to as “continuous section position data”).
- the second data is data specifying the number of occurrences of “1” up to the corresponding continuous section, that is, the value of rank 1 (hereinafter referred to as “rank 1 data”).
- the third data is data that specifies the number of occurrences of “0” up to the corresponding continuous section, that is, the value of rank 0 (hereinafter referred to as “rank 0 data”).
- the storage unit 10 stores any one of the three values ⁇ start position, rank1, rank0> for all or part of the continuous section. Therefore, two types of select operations can be performed on the target bit string. Moreover, since it is not necessary to store all three values in all the continuous sections, an increase in the size of the complete dictionary (data structure 11) is suppressed.
- FIG. 2 is a block diagram specifically showing the configuration of the information processing apparatus 100 according to the embodiment of the present invention.
- the information processing apparatus 100 includes a calculation unit 20, an input reception unit 30, and an output unit 40 in addition to the storage unit 10 that stores the data structure 11. ing.
- the information processing apparatus 100 is constructed, for example, by introducing a program to be described later into a computer. In this case, the information processing apparatus 100 can function as a part of an operating system constituting the computer.
- the input receiving unit 30 receives an external input and outputs it to the calculating unit 20.
- the output unit 40 outputs the calculation result by the calculation unit 20 to the outside.
- the calculation unit 20 includes a first select calculation unit 21, a second select calculation unit 22, a first rank calculation unit 23, and a second rank calculation unit 24.
- the first selection calculation unit 21 is a position on the bit string when a natural number is input, and the number of “1 (first value)” included from the beginning to the position is calculated. A first select position that is equal to a natural number is specified.
- the first select calculation unit 21 specifies the first select position using the continuous section position data, rank1 data, and rank0 data. That is, the first select calculation unit 21 executes select1 (B, i) for the bit string B.
- the second select calculation unit 22 is a position on the bit string, and the number of “0 (second value)” included from the head to the position is equal to the natural number.
- the second select position is specified using the continuous section position data, rank1 data, and rank0 data. That is, the second select calculation unit 21 executes select0 (B, i) for the bit string B.
- the first rank calculation unit 23 uses “1 (first) that has appeared up to the designated position using the continuous section position data, rank1 data, and rank0 data. The number of occurrences of " That is, the first rank calculation unit 23 executes rank1 (B, i) for the bit string B.
- the second rank calculation unit 24 uses the continuous section position data, rank1 data, and rank0 data to display “0 (second) The number of occurrences of " That is, the second rank calculation unit 24 executes rank0 (B, i) for the bit string B.
- FIG. 3 is a diagram illustrating an example of a target bit string and various values obtained therefrom.
- FIG. 4 is a diagram showing an example of a data structure for expressing the bit string shown in FIG.
- the bit string B that is data is [001110011011], and has three continuous sections (runs) for one.
- the starting positions of each run are 2, 7, and 10.
- start position the appearance order is described at the position of the corresponding start position.
- “Pos1” represents a position.
- the subscript of the array of the bit string B starts from 0, but since the rank of the run is counted from the first, it is assumed that the 0th run does not exist.
- rank1 indicates the number of occurrences of “1” from the head at each position in the bit string B, that is, the value of rank1.
- Rank0 indicates the number of occurrences of “0” from the beginning at each position in the bit string B, that is, the value of rank0.
- Select1 indicates an input value i whose position is returned as a result of select1 (B, i).
- Select0 indicates an input value i whose position is returned as a result of select0 (B, i).
- the continuous section position data is data obtained by converting the data of “start position” in FIG. 3 into a bit string.
- the rank1 data is data obtained by converting the data of “rank1” at the start position in FIG. 3 into a bit string.
- the rank0 data is data obtained by converting the data of “rank0” at the start position in FIG. 3 into a bit string.
- the bit string constituting the continuous section position data is denoted as “B1”, the bit string constituting the rank1 data as “Br1”, and the bit string constituting the rank0 data as “Br0”.
- the data structure 11 includes, for each continuous section, at least two values among the three values of the start position, rank1, and rank0 are continuous section position data, rank1 data, rank0. Built to be identified by data. At this time, the specified at least two values change depending on the position (start position) of the continuous section.
- the continuous section position data in the example of FIG. 4 specifies all positions of continuous sections in which at least one “1” continues on the bit string B.
- the continuous section in which the number of occurrences of rank1 data “1” is identified matches the even section that appears evenly.
- a continuous section in which the number of occurrences of rank 0 data “0” is specified is a continuous section that appears odd.
- the data structure 11 is not limited to the example shown in FIG.
- the data structure specifies rank1 and rank0 in the first continuous section, specifies the start position and rank0 in the second continuous section, and specifies the start position and rank1 in the third continuous section. It may be specified so that this is repeated thereafter.
- the first select calculator 21 and the second select calculator 22 perform calculations as follows. That is, the first select calculation unit 21 first specifies the number of occurrences of “1” in rank1 data based on the number of occurrences of “1” specified by rank1 data, and is a specific target. Estimate continuous intervals where select1 exists or is close.
- the first select calculation unit 21 is based on the continuous section position data and rank0 data for a continuous section that is close to the estimated continuous section and for which the number of occurrences of “1” is not specified in rank1 data. , Specify the number of occurrences of “1”. Then, the first select calculation unit 21 specifies select1 using the specified number of occurrences of “1”.
- the second select calculation unit 22 first identifies the number of occurrences of “0” in the rank0 data based on the number of occurrences of “0” specified by the rank0 data for the continuous section. Estimate the continuous section where select0 that is the target of exists or is close.
- the second select calculation unit 22 is based on the continuous section position data and the rank1 data for a continuous section that is close to the estimated continuous section and the number of occurrences of “0” is not specified in the rank0 data. , Specify the number of occurrences of “0”. Then, the second select calculation unit 22 specifies select0 using the specified number of occurrences of “0”.
- the data structure 11 includes the position specified by the continuous section position data, the number of occurrences of “1” specified by rank1 data, and the number of occurrences of “0” specified by rank0 data. It should be compressed by considering it as an increasing number sequence. In this case, the data structure 11 is stored in the storage unit 10 in a compressed state.
- FIG. 5 is a flowchart showing the operation of the information processing apparatus according to the embodiment of the present invention.
- FIG. 1 is taken into consideration as appropriate.
- the information processing method is performed by operating the information processing apparatus 100. Therefore, the description of the information processing method in the present embodiment is replaced with the following description of the operation of the information processing apparatus 100.
- the input receiving unit 30 receives an input of a natural number from the outside and an input of a requested operation (step A1), and outputs the received content to the calculating unit 20.
- the calculation unit 20 determines whether or not the requested operation is select1 (step A2). As a result of the determination in step A2, when the requested operation is select1, the first select calculation unit 21 acquires the continuous section position data 11, the rank1 data 12, and the rank0 data 13 from the storage unit 10. Then, the first select calculation unit 21 calculates select1 for the natural number accepted in step A1 using these (step A3).
- step A4 determines whether or not the requested operation is select0 (step A4).
- step A4 when the requested operation is select0, the second select calculation unit 22 acquires continuous section position data 11, rank1 data 12, and rank0 data 13 from the storage unit 10. Then, the second select calculation unit 22 calculates select0 for the natural number accepted in step A1 using these (step A6).
- step A4 when the requested operation is not select0, the calculation unit 20 determines whether or not the requested operation is rank1 (step A5).
- step A5 when the requested operation is rank1, the first rank calculation unit 23 obtains continuous section position data 11, rank1 data 12, and rank0 data 13 from the storage unit 10. And the 1st rank calculation part 23 calculates rank1 about the natural number accepted by step A1 using these (step A7).
- step A5 determines whether the requested operation is not rank1
- the second rank calculation unit 24 acquires the continuous section position data 11, rank1 data 12, and rank0 data 13 from the storage unit 10. .
- the 2nd rank calculation part 24 calculates rank0 about the natural number received by step A1 using these (step A8).
- step A3 When step A3, A6, A7, or A8 described above is executed, the output unit 40 receives the calculation result and outputs the calculation result to the outside (step A9). As described above, by executing steps A1 to A9, operations of select1, select0, rank1, and rank0 can be performed using the data structure 11.
- bit string B1 is a sparse bit string that becomes 1 only at the start position of a run.
- the length of this bit string is equal to the number m of “1” s contained in the bit string B. Further, this bit string includes b / 2 “1” s.
- the length of this bit string is equal to the number (n ⁇ m) of “0” included in B.
- the bit string includes b / 2 1s.
- the size of the data structure 11, that is, the size of the complete dictionary is the sum of the sizes of the complete dictionaries of the three sparse bit strings B1, bit strings Br1, and bit strings Br0.
- the size of the complete dictionary of the bit string B1 is b log (n / b) + 2b + o (b) bits because the length n includes b 1s.
- the size of the complete dictionary of bit string Br1 is length m and includes b / 2 1s, so (b / 2) log (2m / b) + 2 (b / 2) + o (b) bits is there.
- the size of the complete dictionary of the bit string Br0 is (b / 2) log (2 (nm) / b) + 2 (b / 2) + o (b ) Bit. Therefore, when these are added together, the total size S of the complete dictionary (data structure 11) is as shown in the following equation (10).
- log (n / b), log (2m / b), and log (2 (n-m) / b) can be regarded as having substantially the same size.
- m n / 2
- these numbers are set to log (n / b).
- the size of the complete dictionary in the present embodiment is 2b (C + 2) + o (b) bits. In other words, this indicates that both the operations of select1 and select0 can be supported while being almost the same size as the complete dictionary of Non-Patent Document 2 that supports only select1.
- the size of the complete dictionary in this embodiment is about (2/3) the size of the complete dictionary of Non-Patent Document 2 that supports both select1 and select0. ing.
- the first value is “1”, the second value is “0”, and the continuous interval in which the first data specifies the position is “1”.
- the present invention is not limited to this example. Since 1 and 0 in the bit string can be exchanged, 1 and 0 are inverted, the first value is “0”, the second value is “1”, and the first data specifies the continuous section May be a continuous interval of “0”. Even in this case, the same effect as described above can be obtained.
- the continuous section position data holds the start position of the continuous section.
- the front and rear can be interchanged, and the continuous section position data may hold the end position instead of the start position of the continuous section. Even in this case, the same effect can be obtained.
- rank 1 is held for even-numbered runs and rank 0 is held for odd-numbered runs.
- the relationship between the odd-numbered and even-numbered may be switched. Even in this case, the same effect can be obtained.
- the program in this embodiment may be a program that causes a computer to execute steps A1 to A9 shown in FIG.
- the information processing apparatus 100 and the information processing method in the present embodiment can be realized by installing and executing this program on a computer.
- a CPU Central Processing Unit
- the storage unit 10 is realized by storing data files constituting these in a storage device such as a memory or a hard disk provided in the computer.
- FIG. 6 is a block diagram illustrating an example of a computer that implements the information processing apparatus according to the embodiment of the present invention.
- the computer 110 includes a CPU 111, a main memory 112, a storage device 113, an input interface 114, a display controller 115, a data reader / writer 116, and a communication interface 117. These units are connected to each other via a bus 121 so that data communication is possible.
- the CPU 111 performs various operations by developing the program (code) in the present embodiment stored in the storage device 113 in the main memory 112 and executing them in a predetermined order.
- the main memory 112 is typically a volatile storage device such as a DRAM (Dynamic Random Access Memory).
- the program in the present embodiment is provided in a state of being stored in a computer-readable recording medium 120. Note that the program in the present embodiment may be distributed on the Internet connected via the communication interface 117.
- the storage device 113 include a semiconductor storage device such as a flash memory in addition to a hard disk.
- the input interface 114 mediates data transmission between the CPU 111 and an input device 118 such as a keyboard and a mouse.
- the display controller 115 is connected to the display device 119 and controls display on the display device 119.
- the data reader / writer 116 mediates data transmission between the CPU 111 and the recording medium 120, and reads a program from the recording medium 120 and writes a processing result in the computer 110 to the recording medium 120.
- the communication interface 117 mediates data transmission between the CPU 111 and another computer.
- the recording medium 120 include general-purpose semiconductor storage devices such as CF (Compact Flash) and SD (Secure Digital), magnetic storage media such as a flexible disk, or CD-ROM (Compact Disk).
- Optical storage media such as “Read Only Memory”.
- FIGS. 1 to 6 A specific example of the information processing apparatus according to the present embodiment shown in FIGS. 1 to 6 will be described below. At that time, the description will focus on the fact that the data structure (complete dictionary) in the present embodiment can be used as a complete dictionary, that is, the operations of access, rank, and select can be performed.
- the data structure (complete dictionary) in the present embodiment can be used as a complete dictionary, that is, the operations of access, rank, and select can be performed.
- bit string B shown in FIG. 3 is assumed.
- the bit string B shown in FIG. 3 is composed of seven 1s and five 0s in a bit string of length 12.
- the start positions of the run in the bit string B are four positions B [2], B [7], B [10], and B [12].
- the run starting from B [12] is a virtual run that does not exist in the actual bit string B.
- the start position of the a-th run is sa
- the position where the a-th run ends and becomes 0 is ea
- the start position of the a + 1-th run is sa + 1.
- sa ⁇ i ⁇ sa + 1 and sa ⁇ ea ⁇ sa + 1 always hold.
- the ranges of B [sa, ea) are all 1, and the ranges of B [ea, sa + 1) are all 0.
- a s a a sa, a sa + 1 and s a + 1, and B r0 to Br0 expressed as a Br1 B r1.
- rank0 (B, ea) rank0 (B, sa) holds.
- rank1 (B, ea) rank1 (B, sa + 1) holds.
- the value of rank0 in sa is recorded as the “(a-1) / 2 + 1” th 1-position in Br0.
- the value of rank1 in sa + 1 is recorded as the “(a-1) / 2 + 1” -th 1 position in Br1. Therefore, if the sum of the results of calculating select1 in each complete dictionary is calculated, the value of ea is calculated.
- the value of access (B, i) is calculated by comparing with the value of i as described above.
- the calculation unit 20 may divide the case in the same way. If i ⁇ ea, the desired answer can be calculated by the following equation (12).
- the calculation unit 20 calculates the value of access (B, i) by comparing the value of ea with the value of i as described above.
- the calculation unit 20 can calculate access and rank1. In effect, the calculation unit 20 calculates rank1 once on the bit string B1, calculates select1 once on the bit string Br1, and calculates select1 once on the bit string Br0. Since rank0 (B, i) is equal to “i-rank1 (B, i)”, it can be easily calculated from the value of rank1.
- FIG. 7 is a diagram schematically illustrating a calculation method of select1 (B, i) in the embodiment of the present invention.
- the calculation unit 20 can calculate the answer of select1 using the data structure 11.
- a 0 may be obtained.
- the same calculation can be performed.
- i since the (a * 2) th run does not exist, i always exists on the (a * 2 + 1) th run.
- b is always 0, so i ⁇ b is not satisfied.
- the calculation unit 20 seems to call the same data structure 11 many times at first glance, but only refers to an element adjacent to the element once found, so the calculation unit 20 The search in the data structure 11 is completed only once. That is, rank1 in the bit string Br1, select1 in the bit string B1, and select1 in the bit string Br0 are three times in total.
- the a-th run is the (a * 2-1) th in the total run including the odd-numbered and even-numbered runs. Therefore, pay attention to the three runs (a * 2-1), (a * 2), and (a * 2 + 1).
- the rank0 value at the start position of the (a * 2) th run is an even number, so it is not stored directly in the complete dictionary and must be calculated from the position and the rank1 value. This value can be calculated by using Equation 21 below.
- the information processing apparatus 100 shown in the embodiments and examples can perform all operations of access, rank1, rank0, select1, and select0 using a data structure (complete dictionary).
- the target value is obtained by executing rank or select in the complete dictionary constructed by three sparse bit strings several times. That is, the calculation amount in this embodiment is the same as the rank or select of a complete dictionary constructed with sparse bit strings, so the processing is sufficiently fast in practical terms.
- Non-Patent Document 2 there is a bit string that holds the value of rank1, but there is no bit string that holds the value of rank0, and it is possible to calculate select0 at high speed. It was impossible.
- rank1 and rank0 are alternately recorded in the data structure 11 so that the size of the data structure 11 is substantially the same as that of the complete dictionary described in Non-Patent Document 2. Select0 and select1 can be calculated at high speed.
- the value of rank1 is stored in the data structure, but it can also be regarded as a unicode encoding of the run length.
- a bit string in which the run length is simply encoded is used, whereas in the above-described embodiments and examples, two spaces are inserted.
- Two bit strings are used: a bit string obtained by unicode-encoding the value obtained by summing the run lengths, and a bit string obtained by unicode-encoding a value obtained by summing the lengths of two blanks across the run. This is a device that is greatly different from the conventional run length compression in which the continuous lengths of the same symbols are encoded as they are.
- an increase in the size of a complete dictionary can be suppressed while enabling two types of select operations using a complete dictionary for a target bit string.
- the present invention is useful for a system that requires a search, particularly a system that uses a wavelet tree structure.
- the present invention has been described above using the above-described embodiment as an exemplary example. However, the present invention is not limited to the above-described embodiment. That is, the present invention can apply various modes that can be understood by those skilled in the art within the scope of the present invention. This application claims the priority on the basis of Japanese application Japanese Patent Application No. 2014-073545 for which it applied on March 31, 2014, and takes in those the indications of all here.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Hardware Design (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
Description
配列L[i]は、単調増加配列P[i]の下位pビットを保持し、配列H[i]は、単調増加配列P[i]の残りの上位ビットを保持するとする。すなわち、P[i] =2p・H[i] + L[i]が成立するとする。なお、このとき、配列Hは、広義単調増加列である。
第1の値と第2の値とで構成されたビット列を表現するためのデータ構造を記憶した記憶部を備え、
前記データ構造は、
前記ビット列上で、前記第1の値又は前記第2の値が1個以上連続する連続区間の全部又は一部の位置を特定する、第1のデータと、
前記連続区間の一部について、前記ビット列上で、前記ビット列の先頭から当該連続区間までに出現した第1の値の出現数を特定する、第2のデータと、
前記連続区間の一部について、前記ビット列上で、前記ビット列の先頭から当該連続区間までに出現した第2の値の出現数を特定する、第3のデータと、
を有している、
ことを特徴とする。
第1の値と第2の値とで構成されたビット列を再現するためのデータ構造であって、
前記ビット列上で、同じ値が1個以上連続する連続区間の全部又は一部の位置を特定する、第1のデータと、
前記連続区間の一部について、前記ビット列上で、前記ビット列の先頭から当該連続区間までに出現した第1の値の出現数を特定する、第2のデータと、
前記連続区間の一部について、前記ビット列上で、前記ビット列の先頭から当該連続区間までに出現した第2の値の出現数を特定する、第3のデータと、
を有している、ことを特徴とする。
第1の値と第2の値とで構成されたビット列上で、前記第1の値又は前記第2の値が1個以上連続する連続区間の全部又は一部の位置を特定する、第1のデータと、前記連続区間の一部について、前記ビット列上で、前記ビット列の先頭から当該連続区間までに出現した第1の値の出現数を特定する、第2のデータと、前記連続区間の一部について、前記ビット列上で、前記ビット列の先頭から当該連続区間までに出現した第2の値の出現数を特定する、第3のデータと、を有するデータ構造を用いた情報処理方法であって、
(a)自然数が入力されたときに、前記ビット列上の位置であって、先頭から当該位置までに含まれる前記第1の値の個数が前記自然数と等しくなる第1のセレクト位置を、前記第1のデータ、前記第2のデータ、及び前記第3のデータを用いて、特定する、ステップと、
(b)自然数が入力されたときに、前記ビット列上の位置であって、先頭から当該位置までに含まれる前記第2の値の個数が前記自然数と等しくなる第2のセレクト位置を、前記第1のデータ、前記第2のデータ、及び前記第3のデータを用いて、特定する、ステップと、
を有する、ことを特徴とする。
コンピュータに、
(a)当該コンピュータが備える記憶装置に、
第1の値と第2の値とで構成されたビット列上で、前記第1の値又は前記第2の値が1個以上連続する連続区間の全部又は一部の位置を特定する、第1のデータと、前記連続区間の一部について、前記ビット列上で、前記ビット列の先頭から当該連続区間までに出現した第1の値の出現数を特定する、第2のデータと、前記連続区間の一部について、前記ビット列上で、前記ビット列の先頭から当該連続区間までに出現した第2の値の出現数を特定する、第3のデータと、を有するデータ構造を格納する、ステップと、
(b)自然数が入力されたときに、前記ビット列上の位置であって、先頭から当該位置までに含まれる前記第1の値の個数が前記自然数と等しくなる第1のセレクト位置を、前記第1のデータ、前記第2のデータ、及び前記第3のデータを用いて、特定する、ステップと、
(c)自然数が入力されたときに、前記ビット列上の位置であって、先頭から当該位置までに含まれる前記第2の値の個数が前記自然数と等しくなる第2のセレクト位置を、前記第1のデータ、前記第2のデータ、及び前記第3のデータを用いて、特定する、ステップと、を実行させることを特徴とする。
最初に、本発明の概要について説明する。本明細書では、a番目のランの開始位置sa におけるrank1(B, sa)の値を、ランのrank1と表記し、rank0(B, sa)の値をランのrank0と表記する。
ビット列B上の、任意のa番目のランについて、a番目のランの開始位置をsa、a+1番目のランの開始位置をsa+1とおく。このとき、sa、rank0(B, sa)、sa+1、rank1(B, sa+1)の4つの値さえ分かれば、B[sa , sa+1)の範囲の全てのビットについて1か0かを特定できる。
a番目のランが終了して0になる位置をeaとおく。このとき、ランの定義から、B[sa, ea)の範囲は全て1であり、B[ea , sa+1)の範囲は全て0である。このとき、eaの値は以下の数8によって計算できる。なお、本明細書の数式では、eaをeaと表現する。
これを全てのランの間に適用すれば、全てのランについて、<開始位置、rank1、rank0>の3つの値が記憶されているとき、その3つの値さえあれば元のビット列Bを復元できる。
以下、本発明の実施の形態における、データ構造、情報処理装置、情報処理方法、及びプログラムについて、図1~図6を参照しながら説明する。
最初に、本実施の形態における情報処理装置の構成について説明する。図1は、本発明の実施の形態における情報処理装置の概略構成を示すブロック図である。
次に、本発明の実施の形態における情報処理装置100の動作について図5を用いて説明する。図5は、本発明の実施の形態における情報処理装置の動作を示すフロー図である。以下の説明においては、適宜図1を参酌する。また、本実施の形態では、情報処理装置100を動作させることによって、情報処理方法が実施される。よって、本実施の形態における情報処理方法の説明は、以下の情報処理装置100の動作説明に代える。
続いて、本実施の形態による効果について説明する。図4に示したように、本実施の形態では、ビット列Bに対して、3つの疎なビット列B1、Br1、Br0が、データ構造11として用意される。ビット列B1はランの開始位置においてのみ1となる疎なビット列である。
ビット列Br1は、偶数番目のランの開始位置におけるrank1の値を記憶するビット列である。すなわち、偶数番目のランの開始位置iについて、Br1 [rank1(B,i)] = 1としてセットし、それ以外の要素は0とする。このビット列の長さは、ビット列Bに含まれる「1」の個数mに等しい。また、このビット列には、b/2個の「1」が含まれている。
上述した例では、第1の値が「1」、第2の値が「0」、第1のデータが位置を特定する連続区間が「1」の連続区間であるが、本実施の形態は、この例に限定されるものではない。ビット列の1と0とは交換可能であるため、1と0とを反転させ、第1の値が「0」、第2の値が「1」、第1のデータが位置を特定する連続区間が「0」の連続区間であっても良い。この場合であっても、上述した効果と全く同じ効果を得ることができる。
本実施の形態におけるプログラムは、コンピュータに、図5に示すステップA1~A9を実行させるプログラムであれば良い。このプログラムをコンピュータにインストールし、実行することによって、本実施の形態における情報処理装置100と情報処理方法とを実現することができる。この場合、コンピュータのCPU(Central Processing Unit)は、入力受付部30、計算部20、及び出力部40として機能し、処理を行なう。また、記憶部10は、コンピュータに備えられた、メモリ、ハードディスク等の記憶装置に、これらを構成するデータファイルを格納することによって実現されている。
その際、本実施の形態におけるデータ構造(完備辞書)が、完備辞書として利用可能であること、すなわちaccess、rank、selectの操作が行えることを中心に説明する。
以上、上述した実施形態を模範的な例として本発明を説明した。しかしながら、本発明は、上述した実施形態には限定されない。即ち、本発明は、本発明のスコープ内において、当業者が理解し得る様々な態様を適用することができる。
この出願は、2014年3月31日に出願された日本出願特願2014-073545を基礎とする優先権を主張し、その開示の全てをここに取り込む。
11 データ構造
12 連続区間位置データ(第1のデータ)
13 rank1データ(第2のデータ)
14 rank0データ(第3のデータ)
20 計算部
21 第1のセレクト計算部
22 第2のセレクト計算部
23 第1のランク計算部
24 第2のランク計算部
30 入力受付部
40 出力部
100 情報処理装置
110 コンピュータ
111 CPU
112 メインメモリ
113 記憶装置
114 入力インターフェイス
115 表示コントローラ
116 データリーダ/ライタ
117 通信インターフェイス
118 入力機器
119 ディスプレイ装置
120 記録媒体
121 バス
Claims (21)
- 第1の値と第2の値とで構成されたビット列を表現するためのデータ構造を記憶した記憶手段を備え、
前記データ構造は、
前記ビット列上で、前記第1の値又は前記第2の値が1個以上連続する連続区間の全部又は一部の位置を特定する、第1のデータと、
前記連続区間の一部について、前記ビット列上で、前記ビット列の先頭から当該連続区間までに出現した第1の値の出現数を特定する、第2のデータと、
前記連続区間の一部について、前記ビット列上で、前記ビット列の先頭から当該連続区間までに出現した第2の値の出現数を特定する、第3のデータと、
を有している、
ことを特徴とする情報処理装置。 - 各連続区間について、当該連続区間の位置と、前記第1の値の出現数と、前記第2の値の出現数との3つの値のうち、少なくとも2つの値が、前記第1のデータ、前記第2のデータ、又は前記第3のデータによって特定され、且つ、特定される前記少なくとも2つの値は、当該連続区間の位置によって変化する、
請求項1に記載の情報処理装置。 - 前記第1のデータが、前記ビット列上で、第1の値が1個以上連続する連続区間の全部の位置を特定し、
前記第2のデータが前記第1の値の出現数を特定した連続区間は、偶数番目に出現した前記連続区間と一致し、
前記第3のデータが前記第2の値の出現数を特定した連続区間は、奇数番目に出現した前記連続区間と一致している、
請求項2に記載の情報処理装置。 - 自然数が入力されたときに、前記ビット列上の位置であって、先頭から当該位置までに含まれる前記第1の値の個数が前記自然数と等しくなる第1のセレクト位置を、前記第1のデータ、前記第2のデータ、及び前記第3のデータを用いて、特定する、第1のセレクト計算手段と、
自然数が入力されたときに、前記ビット列上の位置であって、先頭から当該位置までに含まれる前記第2の値の個数が前記自然数と等しくなる第2のセレクト位置を、前記第1のデータ、前記第2のデータ、及び前記第3のデータを用いて、特定する、第2のセレクト計算手段と、
を更に備えている、請求項1~3のいずれかに記載の情報処理装置。 - 前記第1のセレクト計算手段が、
前記第2のデータが特定する前記第1の値の出現数に基づいて、前記第2のデータにおいて前記第1の値の出現数が特定されており、且つ、特定の対象となる前記第1のセレクト位置が存在又は近接する、連続区間を推定し、更に、
推定した前記連続区間に近接し、且つ、前記第2のデータにおいて前記第1の値の出現数が特定されていない連続区間について、前記第1のデータ及び前記第3のデータに基づいて、前記第1の値の出現数を特定し、そして、
特定した前記第1の値の出現数を用いて、前記第1のセレクト位置を特定する、
請求項4に記載の情報処理装置。 - 前記第2のセレクト計算手段が、
前記第3のデータが特定する前記第2の値の出現数に基づいて、前記第3のデータにおいて前記第2の値の出現数が特定されており、且つ、特定の対象となる前記第2のセレクト位置が存在又は近接する、連続区間を推定し、更に、
推定した前記連続区間に近接し、且つ、前記第3のデータにおいて前記第2の値の出現数が特定されていない連続区間について、前記第1のデータ及び前記第2のデータに基づいて、前記第2の値の出現数を特定し、そして、
特定した前記第2の値の出現数を用いて、前記第2のセレクト位置を特定する、
請求項4または5に記載の情報処理装置。 - 前記データ構造は、前記第1のデータが特定する前記位置、前記第2のデータが特定する前記第1の値の出現数、前記第3のデータが特定する前記第2の値の出現数、それぞれを、単調増加する数列とみなすことによって圧縮され、圧縮された状態で、前記記憶部に記憶されている、
請求項1~6のいずれかに記載の情報処理装置。 - 第1の値と第2の値とで構成されたビット列を再現するためのデータ構造であって、
前記ビット列上で、同じ値が1個以上連続する連続区間の全部又は一部の位置を特定する、第1のデータと、
前記連続区間の一部について、前記ビット列上で、前記ビット列の先頭から当該連続区間までに出現した第1の値の出現数を特定する、第2のデータと、
前記連続区間の一部について、前記ビット列上で、前記ビット列の先頭から当該連続区間までに出現した第2の値の出現数を特定する、第3のデータと、
を有している、
ことを特徴とするデータ構造。 - 前記第1のデータが位置を特定した連続区間と、前記第2のデータが前記第1の値の出現数を特定した連続区間と、前記第2のデータが前記第2の値の出現数を特定した連続区間とのうち、2つが一致している、又はそれぞれが互いに異なっている、
請求項8に記載のデータ構造。 - 前記第1のデータが、前記ビット列上で、第1の値が1個以上連続する連続区間の全部の位置を特定し、
前記第2のデータが前記第1の値の出現数を特定した連続区間は、偶数番目に出現した前記連続区間と一致し、
前記第3のデータが前記第2の値の出現数を特定した連続区間は、奇数番目に出現した前記連続区間と一致している、
請求項9に記載のデータ構造。 - 第1の値と第2の値とで構成されたビット列上で、前記第1の値又は前記第2の値が1個以上連続する連続区間の全部又は一部の位置を特定する、第1のデータと、前記連続区間の一部について、前記ビット列上で、前記ビット列の先頭から当該連続区間までに出現した第1の値の出現数を特定する、第2のデータと、前記連続区間の一部について、前記ビット列上で、前記ビット列の先頭から当該連続区間までに出現した第2の値の出現数を特定する、第3のデータと、を有するデータ構造を用いた情報処理方法であって、
(a)自然数が入力されたときに、前記ビット列上の位置であって、先頭から当該位置までに含まれる前記第1の値の個数が前記自然数と等しくなる第1のセレクト位置を、前記第1のデータ、前記第2のデータ、及び前記第3のデータを用いて、特定する、ステップと、
(b)自然数が入力されたときに、前記ビット列上の位置であって、先頭から当該位置までに含まれる前記第2の値の個数が前記自然数と等しくなる第2のセレクト位置を、前記第1のデータ、前記第2のデータ、及び前記第3のデータを用いて、特定する、ステップと、
を有する、ことを特徴とする情報処理方法。 - 各連続区間について、当該連続区間の位置と、前記第1の値の出現数と、前記第2の値の出現数との3つの値のうち、少なくとも2つの値が、前記第1のデータ、前記第2のデータ、又は前記第3のデータによって特定され、且つ、特定される前記少なくとも2つの値は、当該連続区間の位置によって変化する、
請求項11に記載の情報処理方法。 - 前記第1のデータが、前記ビット列上で、第1の値が1個以上連続する連続区間の全部の位置を特定し、
前記第2のデータが前記第1の値の出現数を特定した連続区間は、偶数番目に出現した前記連続区間と一致し、
前記第3のデータが前記第2の値の出現数を特定した連続区間は、奇数番目に出現した前記連続区間と一致している、
請求項12に記載の情報処理装置。 - 前記(a)のステップにおいて、
前記第2のデータが特定する前記第1の値の出現数に基づいて、前記第2のデータにおいて前記第1の値の出現数が特定されており、且つ、特定の対象となる前記第1のセレクト位置が存在又は近接する、連続区間を推定し、更に、
推定した前記連続区間に近接し、且つ、前記第2のデータにおいて前記第1の値の出現数が特定されていない連続区間について、前記第1のデータ及び前記第3のデータに基づいて、前記第1の値の出現数を特定し、そして、
特定した前記第1の値の出現数を用いて、前記第1のセレクト位置を特定する、
請求項11に記載の情報処理方法。 - 前記(b)のステップにおいて、
前記第3のデータが特定する前記第2の値の出現数に基づいて、前記第3のデータにおいて前記第2の値の出現数が特定されており、且つ、特定の対象となる前記第2のセレクト位置が存在又は近接する、連続区間を推定し、更に、
推定した前記連続区間に近接し、且つ、前記第3のデータにおいて前記第2の値の出現数が特定されていない連続区間について、前記第1のデータ及び前記第2のデータに基づいて、前記第2の値の出現数を特定し、そして、
特定した前記第2の値の出現数を用いて、前記第2のセレクト位置を特定する、
請求項11に記載の情報処理方法。 - コンピュータに、
(a)当該コンピュータが備える記憶装置に、
第1の値と第2の値とで構成されたビット列上で、前記第1の値又は前記第2の値が1個以上連続する連続区間の全部又は一部の位置を特定する、第1のデータと、前記連続区間の一部について、前記ビット列上で、前記ビット列の先頭から当該連続区間までに出現した第1の値の出現数を特定する、第2のデータと、前記連続区間の一部について、前記ビット列上で、前記ビット列の先頭から当該連続区間までに出現した第2の値の出現数を特定する、第3のデータと、を有するデータ構造を格納する、ステップと、
(b)自然数が入力されたときに、前記ビット列上の位置であって、先頭から当該位置までに含まれる前記第1の値の個数が前記自然数と等しくなる第1のセレクト位置を、前記第1のデータ、前記第2のデータ、及び前記第3のデータを用いて、特定する、ステップと、
(c)自然数が入力されたときに、前記ビット列上の位置であって、先頭から当該位置までに含まれる前記第2の値の個数が前記自然数と等しくなる第2のセレクト位置を、前記第1のデータ、前記第2のデータ、及び前記第3のデータを用いて、特定する、ステップと、
を実行させるプログラム記録媒体。 - 各連続区間について、当該連続区間の位置と、前記第1の値の出現数と、前記第2の値の出現数との3つの値のうち、少なくとも2つの値が、前記第1のデータ、前記第2のデータ、又は前記第3のデータによって特定され、且つ、特定される前記少なくとも2つの値は、当該連続区間の位置によって変化する、
請求項16に記載のプログラム記録媒体。 - 前記第1のデータが、前記ビット列上で、第1の値が1個以上連続する連続区間の全部の位置を特定し、
前記第2のデータが前記第1の値の出現数を特定した連続区間は、偶数番目に出現した前記連続区間と一致し、
前記第3のデータが前記第2の値の出現数を特定した連続区間は、奇数番目に出現した前記連続区間と一致している、
請求項17に記載のプログラム記録媒体。 - 前記(b)のステップにおいて、
前記第2のデータが特定する前記第1の値の出現数に基づいて、前記第2のデータにおいて前記第1の値の出現数が特定されており、且つ、特定の対象となる前記第1のセレクト位置が存在又は近接する、連続区間を推定し、更に、
推定した前記連続区間に近接し、且つ、前記第2のデータにおいて前記第1の値の出現数が特定されていない連続区間について、前記第1のデータ及び前記第3のデータに基づいて、前記第1の値の出現数を特定し、そして、
特定した前記第1の値の出現数を用いて、前記第1のセレクト位置を特定する、
請求項16に記載のプログラム記録媒体。 - 前記(c)のステップにおいて、
前記第3のデータが特定する前記第2の値の出現数に基づいて、前記第3のデータにおいて前記第2の値の出現数が特定されており、且つ、特定の対象となる前記第2のセレクト位置が存在又は近接する、連続区間を推定し、更に、
推定した前記連続区間に近接し、且つ、前記第3のデータにおいて前記第2の値の出現数が特定されていない連続区間について、前記第1のデータ及び前記第2のデータに基づいて、前記第2の値の出現数を特定し、そして、
特定した前記第2の値の出現数を用いて、前記第2のセレクト位置を特定する、
請求項16に記載のプログラム記録媒体。 - 前記(a)のステップにおいて、前記データ構造は、前記第1のデータが特定する前記位置、前記第2のデータが特定する前記第1の値の出現数、前記第3のデータが特定する前記第2の値の出現数、それぞれを、単調増加する数列とみなすことによって圧縮され、圧縮された状態で、前記記憶装置に格納される、
請求項16~20のいずれかに記載のプログラム記録媒体。
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP15772790.0A EP3128443A4 (en) | 2014-03-31 | 2015-03-20 | Data structure, information processing device, information processing method, and program recording medium |
JP2016511364A JP6276386B2 (ja) | 2014-03-31 | 2015-03-20 | データ構造、情報処理装置、情報処理方法、及びプログラム記録媒体 |
US15/127,479 US10789227B2 (en) | 2014-03-31 | 2015-03-20 | Data structure, information processing device, information processing method, and program recording medium |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2014073545 | 2014-03-31 | ||
JP2014-073545 | 2014-03-31 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2015151444A1 true WO2015151444A1 (ja) | 2015-10-08 |
Family
ID=54239793
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2015/001568 WO2015151444A1 (ja) | 2014-03-31 | 2015-03-20 | データ構造、情報処理装置、情報処理方法、及びプログラム記録媒体 |
Country Status (4)
Country | Link |
---|---|
US (1) | US10789227B2 (ja) |
EP (1) | EP3128443A4 (ja) |
JP (1) | JP6276386B2 (ja) |
WO (1) | WO2015151444A1 (ja) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP6123088B1 (ja) * | 2016-02-25 | 2017-05-10 | 楽天株式会社 | ブロック符号化装置、ブロック復号化装置、情報処理装置、プログラム、ブロック符号化方法及びブロック復号化方法 |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9792308B2 (en) * | 1998-12-11 | 2017-10-17 | Realtime Data, Llc | Content estimation data compression |
US6208275B1 (en) * | 1999-06-01 | 2001-03-27 | William S. Lovell | Method and apparatus for digital concatenation |
US6975629B2 (en) * | 2000-03-22 | 2005-12-13 | Texas Instruments Incorporated | Processing packets based on deadline intervals |
US7784094B2 (en) * | 2005-06-30 | 2010-08-24 | Intel Corporation | Stateful packet content matching mechanisms |
US20080040345A1 (en) * | 2006-08-07 | 2008-02-14 | International Characters, Inc. | Method and Apparatus for String Search Using Parallel Bit Streams |
US8392174B2 (en) * | 2006-08-07 | 2013-03-05 | International Characters, Inc. | Method and apparatus for lexical analysis using parallel bit streams |
US7961960B2 (en) * | 2006-08-24 | 2011-06-14 | Dell Products L.P. | Methods and apparatus for reducing storage size |
US8453032B2 (en) * | 2010-04-21 | 2013-05-28 | General Electric Company | Energy and space efficient detection for data storage |
TWI432964B (zh) * | 2011-08-15 | 2014-04-01 | Phison Electronics Corp | 金鑰傳送方法、記憶體控制器與記憶體儲存裝置 |
EP2830225A4 (en) * | 2012-03-19 | 2015-12-30 | Fujitsu Ltd | PROGRAM, COMPROMISED DATA PRODUCTION PROCESS, DECOMPRESSION PROCESS, INFORMATION PROCESSING DEVICE AND RECORDING MEDIUM |
-
2015
- 2015-03-20 WO PCT/JP2015/001568 patent/WO2015151444A1/ja active Application Filing
- 2015-03-20 US US15/127,479 patent/US10789227B2/en active Active
- 2015-03-20 EP EP15772790.0A patent/EP3128443A4/en not_active Withdrawn
- 2015-03-20 JP JP2016511364A patent/JP6276386B2/ja active Active
Non-Patent Citations (5)
Title |
---|
DAISUKE OKANOHARA: "Large-Scale String Processing:Theory and Practice", IEICE TECHNICAL REPORT, vol. 110, no. 76, 7 June 2010 (2010-06-07), pages 15 - 22, XP008184390 * |
DAISUKE OKANOHARA: "Succinct Data Structure", IPSJ MAGAZINE, vol. 53, no. 5, 15 April 2012 (2012-04-15), pages 504 - 512, XP008184378 * |
KUNIHIKO SADAKANE: "Succinct Data Structures for Large-Scale Data Processing", IPSJ MAGAZINE, vol. 48, no. 8, 15 August 2007 (2007-08-15), pages 899 - 902, XP008184377 * |
See also references of EP3128443A4 * |
VELI MAKINEN ET AL.: "Storage and Retrieval of Highly Repetitive Sequence Collections", JOURNAL OF COMPUTATIONAL BIOLOGY, vol. 17, no. 3, 8 April 2010 (2010-04-08), pages 281 - 308, XP055229647, Retrieved from the Internet <URL:http://jltsiren.kapsi.fi/papers/Maekinen2010.pdf> * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP6123088B1 (ja) * | 2016-02-25 | 2017-05-10 | 楽天株式会社 | ブロック符号化装置、ブロック復号化装置、情報処理装置、プログラム、ブロック符号化方法及びブロック復号化方法 |
WO2017145317A1 (ja) * | 2016-02-25 | 2017-08-31 | 楽天株式会社 | ブロック符号化装置、ブロック復号化装置、情報処理装置、プログラム、ブロック符号化方法及びブロック復号化方法 |
US11212528B2 (en) | 2016-02-25 | 2021-12-28 | Rakuten Group, Inc. | Bit string block encoder device, block decoder device, information processing device, program, block encoding method and block decoding method |
Also Published As
Publication number | Publication date |
---|---|
US20170132262A1 (en) | 2017-05-11 |
US10789227B2 (en) | 2020-09-29 |
JP6276386B2 (ja) | 2018-02-07 |
JPWO2015151444A1 (ja) | 2017-04-13 |
EP3128443A1 (en) | 2017-02-08 |
EP3128443A4 (en) | 2017-11-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10025773B2 (en) | System and method for natural language processing using synthetic text | |
EP3120266B1 (en) | Ozip compression and decompression | |
US20080133565A1 (en) | Device and method for constructing inverted indexes | |
JP2020518207A (ja) | 基本データシーブの使用によるデータの無損失削減、ならびに基本データシーブを使用して無損失削減されたデータに対する多次元検索およびコンテンツ連想的な取出しの実行 | |
TW202147787A (zh) | 利用主要資料的局部性來有效率檢索已使用主要資料篩而被無損地縮減的資料 | |
JP2007508753A (ja) | データ圧縮システム及び方法 | |
JP6048251B2 (ja) | データ圧縮装置、データ圧縮方法、およびデータ圧縮プログラム、並びにデータ復元装置、データ復元方法、およびデータ復元プログラム | |
JP3714935B2 (ja) | 改善されたハフマンデコーディング方法及び装置 | |
JP6846426B2 (ja) | 音声データおよびブロック処理ストレージシステム上に記憶されたデータの削減 | |
KR100495593B1 (ko) | 파일 처리 방법, 데이터 처리 장치, 및 기억 매체 | |
KR101842420B1 (ko) | 정보 처리 장치 및 데이터 관리 방법 | |
JP6276386B2 (ja) | データ構造、情報処理装置、情報処理方法、及びプログラム記録媒体 | |
CN114518841A (zh) | 存储器中处理器和使用存储器中处理器输出指令的方法 | |
US8463759B2 (en) | Method and system for compressing data | |
JP6805927B2 (ja) | インデックス生成プログラム、データ検索プログラム、インデックス生成装置、データ検索装置、インデックス生成方法、及びデータ検索方法 | |
TW202030621A (zh) | 已使用主要資料篩而被無損地縮減的資料之有效率擷取 | |
US9595291B1 (en) | Columnar data storage on tape partition | |
US10037148B2 (en) | Facilitating reverse reading of sequentially stored, variable-length data | |
JP2011033806A (ja) | 言語モデル圧縮装置、言語モデルのアクセス装置、言語モデル圧縮方法、言語モデルのアクセス方法、言語モデル圧縮プログラム、言語モデルのアクセスプログラム | |
US9059728B2 (en) | Random extraction from compressed data | |
JP5736589B2 (ja) | 数列データ検索装置、数列データ検索方法及びプログラム | |
US10810180B1 (en) | Methods and systems for compressing data | |
JP2015159352A (ja) | データ圧縮装置、データ圧縮方法、及びプログラム | |
JP2004213113A (ja) | 配列圧縮方法 | |
US7893851B2 (en) | Encoding apparatus, method, and processor |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 15772790 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2016511364 Country of ref document: JP Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 15127479 Country of ref document: US |
|
REEP | Request for entry into the european phase |
Ref document number: 2015772790 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2015772790 Country of ref document: EP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |