WO2015143708A1 - Method and apparatus for constructing suffix array - Google Patents

Method and apparatus for constructing suffix array Download PDF

Info

Publication number
WO2015143708A1
WO2015143708A1 PCT/CN2014/074276 CN2014074276W WO2015143708A1 WO 2015143708 A1 WO2015143708 A1 WO 2015143708A1 CN 2014074276 W CN2014074276 W CN 2014074276W WO 2015143708 A1 WO2015143708 A1 WO 2015143708A1
Authority
WO
WIPO (PCT)
Prior art keywords
suffix
array
order
unsorted
suffixes
Prior art date
Application number
PCT/CN2014/074276
Other languages
French (fr)
Chinese (zh)
Inventor
朱俊华
白戈
罗琼
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN201480000232.5A priority Critical patent/CN105264522A/en
Priority to PCT/CN2014/074276 priority patent/WO2015143708A1/en
Publication of WO2015143708A1 publication Critical patent/WO2015143708A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations

Definitions

  • the present invention relates to communication technologies, and in particular, to a method and a device for constructing a string suffix array. Background technique
  • a suffix array is an array of suffixes of all the suffixes of a string. It is widely used in fields such as string matching, sequence analysis, and text compression.
  • the prefix multiplication algorithm (Prefix Doubling algorithm) is a more commonly used suffix array construction algorithm.
  • the core idea is to use the h-order of each suffix Si that has been obtained as the primary key (denoted as R h [i] ) and [i+h] as the secondary key, and derive the 2h-order suffix array and the corresponding name. Number of times.
  • the Graphics Multiplier (GPU)-based prefix multiplication algorithm takes full advantage of the GPU cardinality sorting, making the sorting process extremely parallel and improving the performance of the algorithm.
  • the prefix multiplication algorithm does not distinguish between the suffix in which the final ranking has been found and the suffix in which the final ranking is not found at each iteration, so there is a problem of repeating the ranking of the suffix that has found the final ranking.
  • the present invention provides a method and a device for constructing a suffix array, which solves the problem of repeating sorting of suffixes in which the final ranking has been found in the prior art, and at the same time realizes a fast composition of the suffix array.
  • a first aspect, embodiments provide a method of construction of the present invention suffix array, comprising: the suffix array SA Q strings and the number of the first group R Q, obtaining the suffix array SA Q h- order of suffix array SA h , and the second-order group R h , h is a variable with an initial value of 1; The order of the h- suffix array SA h, obtaining the suffix array SA h h- order not ordered set of suffix UG h;
  • the method further includes:
  • N is a natural number greater than 2.
  • the updating the value of the variable h includes:
  • the method further includes:
  • the starting character position of the suffix Si in the suffix array SA Q is adjusted to obtain the name number group R Q of the character string.
  • the suffix array SA Q and the name order group R Q are obtained according to the suffix array SA Q of the string and the Q of the suffix array SA Q - the sequence suffix array SA h , and the second number of times group R h , including:
  • each suffix Si in the suffix array SA Q is used as a comparison key, and each suffix Si in the suffix array SA Q is sorted to obtain the h-order.
  • the array SA h and the second-order number group R h are affixed.
  • a is the starting suffix in the unsorted suffix segment Position
  • b is the position of the ending suffix in the unsorted suffix segment
  • the one or more unsorted suffix segments [a, b] constitute a set UG h of unsorted suffixes.
  • all suffixes in the set UG h are sorted to obtain a 2h-order suffix array SA 2h , including:
  • the h- named group element R h [i] of any suffix Si in the set UG h is used as a main comparison key
  • the h- named group element R h of the suffix 8 1+11 in the set UG h [i] +h] is a secondary comparison key
  • the suffixes in the set UG h are sorted to obtain a 2h-order suffix array SA 2h .
  • another second is obtained according to the second name order group R h and the 2h-order suffix array SA 2h a collection of sorted suffixes UG 2h;
  • One or more unsorted suffix segments are combined into a set UG 2h of unsorted suffixes.
  • the space occupied by the 2h-order suffix array 8 211 is the space occupied by the suffix array SAo.
  • the suffix segment of the set UG h is divided into an S-type suffix segment and an L-type suffix segment according to a preset length T;
  • the S-type suffix segments are sorted by one thread block, and the L-type suffix segments are sorted by using two or more thread blocks to obtain a 2h-order suffix array SA 2h .
  • the preset length T divides the suffix segment of the set UG h into an S-type suffix segment and an L-type suffix segment.
  • the method includes: comparing a length of any suffix segment of the set UG h with the preset length T, and comparing a length of the suffix segment in the set UG h to be less than or equal to the preset length T,
  • the suffix of the preset length ⁇ is used as the S-type suffix segment
  • the suffix greater than the preset length T is used as the L-type suffix segment.
  • the embodiment of the present invention provides a device for constructing a suffix array, including: a first obtaining unit, configured to acquire the suffix array SA according to a suffix array SA of a character string ( ⁇ n first name group R Q )
  • the h-order suffix array SA h of Q and the second-order number group R h , h are variables, and the initial value is 1;
  • a second acquiring unit according to the order of suffix array SA h h-, h- order to acquire the suffix array SA h in unordered set of suffix UG h;
  • Sorting means for sorting the set of all suffixes of UG h, 2h- order to obtain suffix array SA 2h;
  • Third obtaining unit according to the number of second groups R h, of the order of 2h- suffix array SA 2h, acquiring another set of UG unsorted suffix 2H; 2H UG if the set is an empty set, Then get the sorted suffix array SA.
  • the third obtaining unit is further used for
  • the device further includes: a variable update unit, where the variable update unit is configured to update the value of the variable h;
  • the sorting unit is also used for And sorting the unsorted suffix set UG 2H by using the value of the third-order number group R 2h acquired by the third obtaining unit and the variable h updated by the variable updating unit, for acquiring the N-th unsorted suffix set Until the set UG ⁇ of the Nth unsorted suffix finally obtained by the third acquiring unit is an empty set, the ordered suffix array SA is obtained; N is a natural number greater than 2.
  • the variable updating unit is specifically configured to update the value of the variable h to 2h.
  • the apparatus further includes:
  • a fourth obtaining unit configured to: before the first obtaining unit acquires the h-order suffix array SA h , initialize the input string to obtain a suffix array SA Q of the string ;
  • the starting character position of the suffix Si in the suffix array SA Q is adjusted to obtain the name number group R Q of the character string.
  • the first acquiring unit is specifically used to
  • each suffix in the suffix array SA Q is used as a comparison key, and each suffix in the suffix array SA Q is sorted to obtain the h-order suffix array SA.
  • h and second place number group R h are used as a comparison key, and each suffix in the suffix array SA Q is sorted to obtain the h-order suffix array SA.
  • the second obtaining unit is specifically configured to be used
  • the one or more unsorted suffix segments [a, b] constitute a set UG H of unsorted suffixes.
  • the sorting unit is specifically used to
  • the h- named group element R h [i] of any suffix Si in the set UG h is used as a main comparison key
  • the h- named group element R h of the suffix 8 1+11 in the set UG h [i] +h] is a secondary comparison key
  • the suffixes in the set UG h are sorted to obtain a 2h-order suffix array SA 2h .
  • the third acquiring unit is specifically used to
  • One or more unsorted suffix segments are combined into a set UG 2h of unsorted suffixes.
  • the space occupied by the 2h-order suffix array 8 211 is the space occupied by the suffix array SA 0 .
  • the sorting unit is specifically used to
  • the suffix segment of the set UG h is divided into an S-type suffix segment and an L-type suffix segment according to a preset length T;
  • the S-type suffix segment and the L-type suffix segment are respectively sorted to obtain a 2h-order suffix array SA 2h .
  • the sorting unit is specifically used to
  • the length of the suffix segment of the set UG h is compared with the preset length T, and the length of the suffix segment in the set UG h is less than or equal to the preset length T, and the preset is less than or equal to the preset length
  • the suffix of the length T is used as the S-type suffix segment, and the suffix greater than the preset length T is used as the L-type suffix segment.
  • an embodiment of the present invention provides a device for constructing a suffix array, including: a processor and a memory; and the memory is configured to store an instruction;
  • the processor executes instructions stored in the memory for:
  • the h-order suffix array SA h of SA Q and the second-order number group R h , h are variables, and the initial value is 1; according to the h-order suffix array SA h , the h-order suffix array SA h is obtained. a collection of unsorted suffixes in the UG h;
  • the processor is further configured to obtain a third-order number group according to the second-order number of groups R h and the 2h-order suffix array SA 2h a set of R 2h and another unsorted suffix UG 2h;
  • the value of the variable h is updated, and the unordered suffix set UG 2h is sorted according to the third-order number group R 2h for obtaining the Nth unsorted suffix.
  • the processor is specifically configured to update the value of the variable h to 2h.
  • the processor is further used to
  • the starting character position of the suffix Si in the suffix array SA Q is adjusted to obtain the name number group R Q of the character string.
  • the processor is specifically used to
  • each suffix in the suffix array SA Q is used as a comparison key, and each suffix Si in the suffix array SA Q is sorted to obtain the h-order suffix array.
  • SA h and the second-order number of groups R h are used as a comparison key, and each suffix Si in the suffix array SA Q is sorted to obtain the h-order suffix array.
  • the one or more unsorted suffix segments [a, b] constitute a set UG h of unsorted suffixes.
  • the h- named group element R h [i] of any suffix Si in the set UG h is used as a main comparison key
  • the h- named group element R h of the suffix 8 1+11 in the set UG h [i] +h] is a secondary comparison key
  • the suffixes in the set UG h are sorted to obtain a 2h-order suffix array SA 2h .
  • the processor is specifically used to
  • One or more unsorted suffix segments are combined into a set UG 2h of unsorted suffixes.
  • the space occupied by the 2h-order suffix array 8 211 is the space occupied by the suffix array SAo.
  • the processor is specifically used to
  • the suffix segment of the set UG h is divided into an S-type suffix segment and an L-type suffix segment according to a preset length T;
  • the S-type suffix segments are sorted by one thread block, and the L-type suffix segments are sorted by using two or more thread blocks to obtain a 2h-order suffix array SA 2h .
  • the processor is specifically used for
  • the suffix of the preset length T is used as the S-type suffix segment, and the suffix greater than the preset length T is used as the L-type suffix. segment.
  • the method and apparatus for constructing the suffix array of the embodiment of the present invention acquires the h-order suffix array SA h of the suffix array of the string and the second-order number group R h by using the string suffix array SA Q .
  • FIG. 1 is a system architecture diagram of a method for constructing a suffix array according to an embodiment of the present invention
  • FIG. 2A is a schematic flowchart of a method for constructing a suffix array according to an embodiment of the present invention
  • FIG. 2B is a schematic diagram of a suffix string table according to an embodiment of the present invention
  • 3A is a schematic flow chart of a method for constructing a suffix array according to another embodiment of the present invention.
  • FIG. 3B is a schematic diagram of scheduling an S-type suffix segment and an L-type suffix segment in a GPU according to an embodiment of the present invention
  • FIG. 4 is a schematic structural diagram of a device for constructing a suffix array according to an embodiment of the present invention
  • FIG. 5 is a schematic structural diagram of a device for constructing a suffix array according to another embodiment of the present invention
  • Schematic diagram of the construction of the suffix array
  • GPUs have the advantages of large-scale thread concurrency, high memory bandwidth, etc., which can greatly alleviate the computational speed bottleneck of computationally intensive/data-intensive applications. question.
  • the embodiment of the present invention proposes a construction method of a suffix array based on a GPU (or a similar high data concurrency processor).
  • FIG. 1 is a system architecture diagram of a method for constructing a suffix array according to an embodiment of the present invention.
  • the system of this embodiment may include: a CPU, a GPU, and a host memory; wherein, the CPU respectively The host memory and the GPU are connected through a data bus, and the host memory is connected to the GPU through a data bus.
  • the execution of the GPU in this embodiment is scheduled by the CPU.
  • the CPU selects the appropriate method for processing according to the characteristics of the service. For example, the simple service is processed by the CPU itself, and the data concurrently processed by the GPU can be processed by the GPU.
  • the GPU cannot directly access the data stored in the host memory, but needs to copy the data of the host memory to the global memory of the GPU through the data bus before accessing.
  • There are multiple thread blocks within the GPU each thread block having the same number of threads, and its own shared memory.
  • Each thread has private computing resources (such as registers, local storage, etc.).
  • the GPU global memory can be accessed by all threads of all thread blocks.
  • the shared memory of the thread block can only be accessed by all threads of the thread block, while the thread's registers and local memory can only be accessed by this thread.
  • Data is stored in a disk device (HDD) prior to data processing. Once the handler is started, the data is first read by the disk to Memory, which is then dispatched by the CPU to itself or to the GPU.
  • HDD disk device
  • Character set A character set ⁇ is a set that establishes a full-order relationship, that is, any two different elements in ⁇ a and b have a certain size relationship, not a ⁇ b, which is a>b.
  • the elements of a character set are called characters.
  • the character set has a special character ' $ ', which appears only at the end of the string and is the smallest element in the character set.
  • a string of length n is an array of n characters S[0,n-1]. Among them, the last element of S is the terminator fixed to '$'.
  • suffix array SA is a string S all suffixes are sorted in lexicographic order An array formed by columns. The suffix it contains is represented by its starting position. That is, 1 represents the suffix Si.
  • H- order suffix array h-order suffix array SA h is an array obtained by using all the suffixes of a string S according to its starting h characters as comparison keys.
  • unsorted suffix segments described below are composed of unsorted suffixes.
  • the unsorted suffix consists of unsorted suffix segments based on whether their first h characters are the same.
  • FIG. 2A is a flow chart showing a method for constructing a suffix array according to an embodiment of the present invention. As shown in FIG. 2A, the construction method of the suffix array of the embodiment of the present invention is as follows.
  • each suffix in the suffix array SA Q may be used.
  • the first character of S is a comparison key, and sorts each suffix in the suffix array SA Q to obtain the h-order suffix array SAi and the second name number group. Specifically, if SAJi]
  • the third acquisition acquires the number of groups R 2h.
  • the h- named group element R h [i] of any suffix Si in the set UG h is used as a main comparison key
  • the h-name number group element of the UG h suffix 8 1+11 is used as a main comparison key.
  • R h [i+h] is a secondary comparison key
  • the suffixes in the set UG h are sorted to obtain a 2h-order suffix array SA 2h .
  • the h-named group element R h [i] and the h-numbered group element R h [i+h] described above are elements in the second-order number group R h . 204, according to the second name sequence group R h , the 2h-order suffix array SA 2h , obtain another set of unsorted suffixes UG 2h;
  • Sorted suffix array SA if the set UG 2h is not an empty set, update the value of the variable h, and repeatedly sort the unsorted suffix set UG 2h until the last acquired unsorted suffix set is an empty set, Sorted suffix array SA.
  • the unordered suffix set UG 2h is sorted according to the third-order number group R 2h for obtaining the N-th unsorted suffix set UG ⁇ until the last N unsorted suffix is obtained.
  • the set UG ⁇ is an empty set, and the sorted suffix array SA is obtained; N is a natural number greater than 2.
  • the foregoing updating the value of the variable h may specifically be: updating the value of the variable h to 2h.
  • step 204 ' according to the third name sequence group R 2h , the 4h-order suffix array SA 4h , obtain an unsorted suffix set UG 4h;
  • the suffix array and the name group can use the space of the initial suffix array and the name group, thus saving storage overhead.
  • the unsorted suffix collection UG can also be saved with an array, and therefore does not limit its representation in the computing system.
  • the space occupied by the 2h-order suffix array SA 2h may be the space occupied by the suffix array SA Q , and the space occupied by the name number group R 2h may be the number of times.
  • the space occupied by the group R Q can save space.
  • the h-order suffix array SA h of the suffix array of the string and the second-order number group R h are obtained by the suffix array SA Q of the string, and then the h-order suffix array is obtained.
  • step 201 the method may further include the following step 200, not shown:
  • step 202 may include sub-steps not shown in the following figures:
  • the same suffix of the first character having more than one consecutive suffix in the h-order suffix array SA h is composed of more than one unsorted suffix segment [a, b] ; a is the start in the unsorted suffix segment The position of the suffix, where b is the position of the ending suffix in the unsorted suffix segment;
  • the one or more unsorted suffix segments [a, b] form a set UG h of unsorted suffixes.
  • the input string ⁇ For the input string ⁇ , first extract all its possible suffixes, such as T is g00g ol$, as shown in the right column of Figure 2B, and then sort the suffixes similar to the prefix multiplication method. That is, the entire sorting process is composed of a plurality of iterative processes, but different from the prior art, in this embodiment, by introducing a sorted suffixed set (Sorted group) and an unsorted suffixed set (Unsorted group), it can be avoided. In the prior art, the problem of repeating the ranking of the suffixes that have been ranked.
  • suffixes are placed in the Unsorted group as a suffix segment.
  • the suffix in the Unsorted group can be sorted, thereby avoiding heavy suffixes of the determined order position. Reordering. After all the suffixes are determined in order (that is, all suffixes have been added to the Sorted group), the suffix sorting process is completed. Thus, the suffix array SA and the name order group R can be derived from the sort result.
  • each suffix takes only its first k characters for sorting, and after sorting, it will be determined that the global order suffix is moved from the Unsorted group to the Sorted group. That is, for each suffix in the Unsorted group, if other suffixes are equal (that is, their first k characters are the same), it means that these equal suffixes cannot be determined by their first k characters, so these are equal.
  • the suffixes are grouped together, waiting for the calculation of the next iteration;
  • the order position of the suffix in the current Unsorted group can be determined, so move it to Sorted g roup.
  • Step 1 Extract the suffix (see suffixes on the right side of Figure 2B) as a suffix segment ugl, add to
  • the first iteration only consider the first character of each suffix for sorting, so the result is
  • Sorted group ⁇ $, 1$ ⁇
  • the embodiment of the present invention proposes a method for parallel suffixing in the unsorted suffix set for the GPU parallel data processing, thereby improving the processing of the step 204 in FIG. 2A. speed.
  • the aforementioned step 204 may include the following diagram not shown. Sub-steps:
  • A2041 comparing the suffixes in the 2h-order suffix array SA 2h according to the neighboring element comparison rule, to obtain the first auxiliary array NC 2h;
  • NC 2h [a] is set to a; for other suffixes, the adjacent suffix to the left is compared with the first 2h characters. The keys are compared, the same is 0, and the difference is 1.
  • Unsorted suffix set 1; 0 211 is the monk of all unsorted suffixes
  • auxiliary arrays such as NC, NS, etc. can be reused, saving storage overhead. Therefore, there is no restriction on the space allocation of NC and NS.
  • the suffix parallel processing method of the unsorted suffix set of this embodiment is as follows.
  • the suffix segment of the set UG h is divided into an S-type suffix segment and an L-type suffix segment according to a preset length T.
  • suffix segment I [a, b] if its length
  • the length in the suffix segment refers to the number of suffixes included in the suffix segment.
  • the processing of the S-type suffix segment and the processing of the L-type suffix segment have no order relationship and can be executed in parallel. Because multiple thread blocks in the GPU can be executed in parallel, multiple suffix segments of the S-type suffix segment can be sorted simultaneously. Similarly, multiple suffix segments of the L-type suffix segment can also be sorted simultaneously. Therefore, this method can greatly improve the computational utilization of the GPU, thereby speeding up the construction of the suffix array.
  • cardinality sorting is currently the most efficient parallel sorting method. Therefore, the embodiment of the present invention takes parallel matrix ordering as an example to illustrate how to use the GPU to implement single-thread block parallel sorting and multi-thread block parallel sorting.
  • the embodiment of the present invention uses a least significant bit radix sort.
  • the bits of each iteration of the cardinality order can be configured, depending on the computing power and storage space of the GPU.
  • the steps for a single iteration of a single-thread block parallel-matrix sort are as follows.
  • step of sorting the S-type suffix segments by using one thread block in step 302 includes:
  • each element's current comparison key calculates the number of elements for each specific value.
  • the first histogram H is saved in an array, and the key value is used as an array subscript, and the number of elements corresponding to the key value is used as an array element corresponding to the subscript.
  • the starting position of the array element with a key value of 1 is M[i]. If there are array elements with the same key value, they are placed in subsequent positions.
  • step S01 and step S02 the characteristics of the GPU can be further utilized to speed up the calculation. For example, each thread of the thread block first computes the local histogram and keeps it in the register, then copies the local histogram into the shared memory of the thread block, and then performs a prefix summation operation on all the histograms.
  • the L-type suffix segment is fragmented according to the length of the array processed by each thread block, and the fragmented L-type suffix segment is allocated to the corresponding thread block.
  • each thread block is responsible for an array length of t, which uses 1/t thread blocks.
  • M02 Obtain a histogram of the L-type suffix segments after each slice, and perform a prefix sum operation on the histograms acquired by all the thread blocks to obtain a prefix summation result array M g of the L-type suffix segments.
  • each thread block first calculates a histogram for the suffix segment responsible for the fragment, obtains a histogram H b of the suffix segment of the thread block, and copies the result to the global memory of the GPU.
  • M03 sum the result array M g according to the global prefix, and distribute the suffix to the position corresponding to the suffix segment that completes the sorting.
  • the starting position of an array element with a key value of 1 is M g [i], and if there are array elements with the same key value, they are placed in subsequent positions.
  • Step M01 is executed by the CPU, and the rest of the steps are executed by the GPU, and each step is implemented by a separate kernel function.
  • Step M02 can be executed concurrently by multiple thread blocks.
  • Step M03 can only be executed after all thread blocks of step M02 have been executed.
  • Step M03 may require one thread block or multiple thread blocks to execute.
  • Each thread block is responsible for a set of independent key values.
  • the multi-thread block parallel cardinal sorting method is slightly more complicated than the single-thread block parallel cardinal sorting method.
  • the above method makes full use of the GPU's multiple concurrent threads and high memory bandwidth advantages, and accelerates the operation of the suffix array construction process.
  • the above steps are all a separate concurrent process (that is, each step can be processed by multiple threads concurrently).
  • each suffix in the Unsorted group is divided into two types according to the size of the group in which it is divided into two categories.
  • Unsorted group data in the GPU is scheduled by the CPU. And the sorting of each suffix segment in the S class is performed by a single ThreadBlock in the GPU, and the sorting of each suffix segment of the L class is performed by multiple ThreadBlock thread blocks (as shown in FIG. 3B, where the suffix segment Mg) , M g2 and belong to the S class, while Mg4 and M g5 belong to the L class).
  • the classification of the suffix segments in the Unsorted group For example, the classification of the suffix segments in the Unsorted group.
  • each suffix segment is performed by a kernel function and is completed in three steps:
  • M01 calculating their respective histograms in units of thread blocks
  • M02 scanning a histogram of each thread block, and calculating a dispersion offset value of each thread block
  • the sorted results are aggregated to the global memory of the GPU, and then the remaining steps are sequentially completed by the thread blocks of the GPU, and each step is also executed concurrently.
  • each adjacent suffix in each suffix segment can be globally compared; the prefix summation operation is performed on the adjacent comparison result of each suffix segment; and the result obtained by the previous step is used to calculate the corresponding suffix array corresponding to the suffix Name group and new Unsorted group and so on.
  • the parallelization processing of the suffix in the Unsorted group is proposed, which can be applied to a high data parallelity environment similar to the GPU.
  • the apparatus for constructing a suffix array of the present embodiment includes: a first acquiring unit 41, a second acquiring unit 42, Sorting unit 43 and third obtaining unit 44;
  • the first obtaining unit 41 is configured to obtain an h-order suffix array SA h of the suffix array SA Q and a second-order number group R h according to the suffix array SA Q of the character string and the first-order number group Ro.
  • h is a variable with an initial value of 1;
  • the second acquiring unit 42 according to the order of the suffix array h- SA h, obtaining the suffix array SA h h- order not ordered set of suffix UG h;
  • the sorting unit 43 is configured to sort all the suffixes in the set UG h to obtain a 2h-order suffix array SA 2h;
  • the sorting unit 43 is specifically configured to use the h- named group element R h [i] of any suffix Si in the set UG h as a main comparison key, and the suffix S i+ in the set UG h
  • the h-named group element R h [i+h] is a secondary comparison key, and the suffixes in the set UG h are sorted to obtain a 2h-order suffix array SA 2h .
  • Third obtaining unit 44 according to the second number of groups R h, of the order of 2h- suffix array SA 2h, acquiring another set of UG unsorted suffix 2H; 2H UG if the set is an empty set, Then get the sorted suffix array SA.
  • the third acquiring unit according to the second frequency and further configured to group R h, of the order of 2h- suffix array SA 2h, obtaining the group number of the third R 2h.
  • the foregoing suffix array constructing apparatus may further include: a variable updating unit 45, as shown in FIG. 5;
  • the variable update unit 45 is configured to update the value of the variable h; for example, the variable update unit is specifically configured to update the value of the variable h to 2h.
  • the sorting unit 43 is further configured to repeatedly sort the unsorted suffixes UG 2H by using the variable update unit, until the set of the unsorted suffixes finally obtained by the third obtaining unit is an empty set, and the sort is obtained.
  • the suffix array SA That is, the set of unordered suffixes UG 2H is sorted according to the value of the third-order number of times group R 2h acquired by the third obtaining unit and the variable h updated by the variable updating unit, for acquiring the Nth ordered set of suffix _D 2 UG, UG set up by the third obtaining unit acquired last N-th unsorted _D suffix 2 is an empty set, to obtain sorted suffix array SA; N is a natural number greater than 2.
  • the space occupied by the 2h-order suffix array SA 2H is the space occupied by the suffix array SA Q .
  • the foregoing apparatus for constructing a suffix array further includes a fourth obtaining unit 46, not shown, which is used by the first obtaining unit 41 to obtain the Before the h-order suffix array SA h ,
  • the starting character position of the suffix Si in the suffix array SA Q is adjusted to obtain the name number group R Q of the character string.
  • the first obtaining unit 41 may be specifically configured to: when h is an initial value, use the first character of each suffix Si in the suffix array SA Q as a comparison key, and the suffix array SA Q Each suffix Si in the middle is sorted to obtain the h-order suffix array SA h and the second-order number group R h .
  • the second obtaining unit 42 is specifically configured to: determine that the first character of the consecutive suffixes of the h-order suffix array SA H having more than one is the same;
  • the one or more unsorted suffix segments [a, b] constitute a set UG H of unsorted suffixes.
  • the third obtaining unit 44 is configured to compare the suffixes in the 2h-order suffix array SA 2H according to the neighboring element comparison rule to obtain a first auxiliary array NC 2H; Auxiliary array NC 21 ⁇ A row prefix sum, to obtain a second auxiliary array NS 2H; If more than one consecutive suffixes of the second auxiliary array NS 2h have the same value, the one or more consecutive suffixes constitute an unsorted suffix segment;
  • One or more unsorted suffix segments are combined into a set UG 2h of unsorted suffixes.
  • the foregoing sorting unit 43 is specifically configured to: divide the suffix segment of the set UG h into an S-type suffix segment and an L-type suffix segment according to a preset length T;
  • the length of any suffix segment of the set UG h may be compared with the preset length T, and the length of the suffix segment in the set UG h is less than or equal to the preset length T, and then the value is less than or equal to
  • the suffix of the preset length T is used as the S-type suffix segment, and the suffix greater than the preset length T is used as the L-type suffix segment.
  • the S-type suffix segment and the L-type suffix segment are respectively sorted to obtain a 2h-order suffix array SA 2h .
  • L-type suffix segments may be segmented according to the length of the array processed by each thread block, and the L-type suffix segments after the slice are allocated to each thread block;
  • the prefix summation result array M and the global prefix summation structure array M g are obtained to obtain a 2h-order suffix array SA 2h .
  • the suffix array construction apparatus of this embodiment can avoid the problem of repeatedly sorting the suffixes of the final ranking, and can realize the composition of the acceleration suffix array during the suffix data processing.
  • the device in this embodiment may be used to implement the technical solution of the method embodiment shown in FIG. 2A and FIG. 3A, and the implementation principle and the technical effect are similar, and details are not described herein again.
  • FIG. 6 is a schematic structural diagram of a device for constructing a suffix array according to another embodiment of the present invention.
  • the apparatus for constructing a suffix array of the present embodiment includes: a bus 61; and a processor connected to the bus 61. 62.
  • a memory 63 and an interface 64 wherein the memory 63 is configured to store an instruction, and the processor 62 is configured to execute the instruction, where Obtaining the h-order suffix array SA h of the suffix array SA Q and the second-order number group R h , h as variables according to the suffix array SA Q of the string and the first-order number group R Q , and the initial value is 1 ; h- according to the order of the suffix array SA h, obtaining the suffix array SA h h- order not ordered set of suffix UG h;
  • the processor 62 executes the foregoing instruction, and is further configured to acquire, according to the second name sequence group R h and the 2h-order suffix array SA 2h , a set UG of the third name group R 2h and another unsorted suffix 2h;
  • the value of the variable h is updated, and the unordered suffix set UG 2h is sorted according to the third-order number group R 2h for obtaining the Nth unsorted suffix. set! ; , until the collection of the Nth unsorted suffix obtained last is an empty set, get the sorted suffix array SA; N is a natural number greater than 2.
  • the processor 62 executes the above instructions for updating the value of the variable h to be occupied by the suffix array SA Q in the present embodiment.
  • the space occupied by the 2h-order suffix array SA 2h is occupied by the suffix array SA Q. space.
  • the processor 62 executes the foregoing instruction, and is further configured to initialize the input string to obtain a suffix array SA Q of the string ;
  • the starting character position of the suffix Si in the suffix array SA Q is adjusted to obtain the name number group R Q of the character string.
  • processor 62 executes the foregoing instructions, specifically,
  • each suffix Si in the suffix array SA Q is used as a comparison key, and each suffix Si in the suffix array SA Q is sorted to obtain the h-order suffix.
  • An array SAi and the second-order number of times ie, the h-order suffix array SA h and the second-order number of groups R h ).
  • processor 62 executes the foregoing instructions, specifically for
  • a is the starting suffix in the unsorted suffix segment Position
  • b is the position of the ending suffix in the unsorted suffix segment
  • the one or more unsorted suffix segments [a, b] constitute a set UG h of unsorted suffixes.
  • the processor 62 executes the foregoing instructions, specifically for
  • the h- named group element R h [i] of any suffix Si in the set UG h is used as a main comparison key
  • the h- named group element R h of the suffix 8 1+11 in the set UG h [i] +h] is a secondary comparison key
  • the suffixes in the set UG h are sorted to obtain a 2h-order suffix array SA 2h .
  • processor 62 executes the foregoing instructions, specifically for
  • One or more unsorted suffix segments are combined into a set UG 2h of unsorted suffixes.
  • the processor 62 executes the foregoing instructions to sort all the suffixes in the set UG h to obtain a 2h-order suffix array SA 2h , including:
  • the suffix segment of the set UG h is divided into an S-type suffix segment and an L-type suffix segment according to a preset length T;
  • the length of any suffix segment of the set UG h is compared with the preset length T, and the length of the suffix segment in the set UG h is less than or equal to the preset length T, which is less than or equal to the
  • the suffix of the preset length T is used as the S-type suffix segment, and the suffix greater than the preset length T is used as the L-type suffix segment.
  • the S-type suffix segments are sorted by one thread block, and the L-type suffix segments are sorted by using two or more thread blocks to obtain a 2h-order suffix array SA 2h .
  • the apparatus for constructing the suffix array of the embodiment of the present invention executes the above instruction by the memory storage instruction, thereby avoiding the problem of repeatedly sorting the suffix of the final ranking, and realizing the configuration of the accelerated suffix array.
  • the device in this embodiment may be used to implement the technical solution of the method embodiment shown in FIG. 2A and FIG. 3A, and the implementation principle and the technical effect are similar, and details are not described herein again.
  • the aforementioned program can be stored in a computer readable storage medium.
  • the program when executed, performs the steps including the above-described method embodiments; and the foregoing storage medium includes: a medium that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.

Abstract

A method and apparatus for constructing a suffix array. The method comprises: obtaining an h-order suffix array SAh of a suffix array SA0 and a second rank array Rh according to the suffix array SA0 of a character string and a first rank array R0, h being a variable with an initial value of 1; obtaining a set Ugh of unsorted suffixes in the h-order suffix array SAh according to the h-order suffix array SAh; sorting all suffixes in the set Ugh, so as to obtain an 2h-order suffix array SA2h; obtaining another set UG2h of unsorted suffixes according to the rank array Rh and the 2h-order suffix array SA2h; and when the set UG2h is an empty set, obtaining a sorted suffix array SA. By using the method, the formation of a suffix array can be accelerated during suffix data processing.

Description

后缀数组的构造方法及装置  Method and device for constructing suffix array
技术领域 Technical field
本发明涉及通信技术, 尤其涉及一种字符串后缀数组的构造方法及装 置。 背景技术  The present invention relates to communication technologies, and in particular, to a method and a device for constructing a string suffix array. Background technique
后缀数组是指一个字符串的所有后缀按字典顺序排列的数组, 广泛应 用于字符串匹配、 序列分析及文本压缩等领域。  A suffix array is an array of suffixes of all the suffixes of a string. It is widely used in fields such as string matching, sequence analysis, and text compression.
目前, 前缀倍增算法 (Prefix Doubling算法) 是一种较为常用的后缀 数组构造算法。其工作原理是,先以第一个字符为比较键将所有后缀排序, 得到 1-次序后缀数组及其对应的名次数组。 然后从 h=l开始, 以前次计算 结果为基础递归计算 2h-次序后缀数组, 直至所有的后缀都具有唯一的名 次。 其核心思想是利用已求得的每个后缀 Si的 h-次序作为主键 (记为 Rh[i] ) , 以 [i+h]作为辅键, 推导出 2h-次序后缀数组及对应的名次数组。 Currently, the prefix multiplication algorithm (Prefix Doubling algorithm) is a more commonly used suffix array construction algorithm. The working principle is that all the suffixes are first sorted by using the first character as a comparison key to obtain a 1-order suffix array and its corresponding name order group. Then starting from h=l, the previous calculation results are recursively calculated based on the 2h-order suffix array until all suffixes have a unique ranking. The core idea is to use the h-order of each suffix Si that has been obtained as the primary key (denoted as R h [i] ) and [i+h] as the secondary key, and derive the 2h-order suffix array and the corresponding name. Number of times.
基于图形处理器 (Graphics Processing Unit, 简称 GPU) 的前缀倍增 算法充分利用了 GPU基数排序的优势, 使得排序过程极大并行化从而提 升算法性能。  The Graphics Multiplier (GPU)-based prefix multiplication algorithm takes full advantage of the GPU cardinality sorting, making the sorting process extremely parallel and improving the performance of the algorithm.
然后, 前缀倍增算法在每次迭代时不区分已经找到最终名次的后缀和 未找到最终名次的后缀, 因此存在对已经找到最终名次的后缀重复排序的 问题。 发明内容  Then, the prefix multiplication algorithm does not distinguish between the suffix in which the final ranking has been found and the suffix in which the final ranking is not found at each iteration, so there is a problem of repeating the ranking of the suffix that has found the final ranking. Summary of the invention
为解决现有技术中的缺陷, 本发明提供一种后缀数组的构造方法及装 置, 解决了现有技术中对已经找到最终名次的后缀重复排序的问题, 同时 可实现后缀数组的快速构成。  In order to solve the defects in the prior art, the present invention provides a method and a device for constructing a suffix array, which solves the problem of repeating sorting of suffixes in which the final ranking has been found in the prior art, and at the same time realizes a fast composition of the suffix array.
第一方面, 本发明实施例提供一种后缀数组的构造方法, 包括: 根据字符串的后缀数组 SAQ和第一名次数组 RQ, 获取所述后缀数组 SAQ的 h-次序后缀数组 SAh, 及第二名次数组 Rh, h为变量, 初始值为 1 ; 根据所述 h-次序后缀数组 SAh,获取所述 h-次序后缀数组 SAh中未排 序后缀的集合 UGh; A first aspect, embodiments provide a method of construction of the present invention suffix array, comprising: the suffix array SA Q strings and the number of the first group R Q, obtaining the suffix array SA Q h- order of suffix array SA h , and the second-order group R h , h is a variable with an initial value of 1; The order of the h- suffix array SA h, obtaining the suffix array SA h h- order not ordered set of suffix UG h;
对所述集合 UGh中的所有后缀进行排序,得到 2h-次序后缀数组 SA2h; 根据所述第二名次数组 Rh、 所述 2h-次序后缀数组 SA2h, 获取另一个 未排序的后缀的集合 UG2h; 若所述集合 UG2h为空集合, 则得到排序的后 缀数组 SA。 Sorting all suffixes in the set UG h to obtain a 2h-order suffix array SA 2h; obtaining another unsorted suffix according to the second-order order group R h and the 2h-order suffix array SA 2h The set UG 2h; if the set UG 2h is an empty set, the sorted suffix array SA is obtained.
结合第一方面, 在第一种可能的实现方式中,  In combination with the first aspect, in a first possible implementation manner,
所述根据所述第二名次数组 RH、 所述 2h-次序后缀数组 SA2H, 获取另 一个未排序的后缀的集合 UG2H, 包括: And acquiring, according to the second name sequence group R H , the 2h-order suffix array SA 2H , another set UG 2H of unsorted suffixes, including:
根据所述第二名次数组 RH、 所述 2h-次序后缀数组 SA2H, 获取第三名 次数组 R2H和另一个未排序的后缀的集合 UG2H ; Obtaining, according to the second name order group R H , the 2h-order suffix array SA 2H , a third name group R 2H and another unsorted suffix set UG 2H;
所述方法还包括:  The method further includes:
若所述集合 UG2H不为空集合, 则更新所述变量 h的值, 根据第三名 次数组 R2H对未排序后缀的集合 UG2H进行排序, 用于获取第 N个未排序 的后缀的集合 UG2,. , 直至最后获取的第 N 个未排序的后缀的集合 UG^^ 为空集合, 得到排序的后缀数组 SA; N为大于 2的自然数。 If the set UG 2H is not an empty set, updating the value of the variable h, sorting the unsorted suffix set UG 2H according to the third-order number group R 2H , and acquiring the N-th unsorted suffix The collection UG 2 , . , until the last acquired Nth unsorted suffix set UG^^ is an empty set, and the sorted suffix array SA is obtained; N is a natural number greater than 2.
结合第一放面及第一种可能的实现方式, 在第二种可能的实现方式 中, 所述更新所述变量 h的值, 包括:  In combination with the first aspect and the first possible implementation, in a second possible implementation, the updating the value of the variable h includes:
将所述变量 h的值更新为 2h。  Update the value of the variable h to 2h.
结合第一放面及第一方面上述可能的实现方式, 在第三种可能的实现 方式, 所述根据字符串的后缀数组 SAQ和第一名次数组 RQ, 获取所述后缀 数组 SAQ的 h-次序后缀数组 SAh的歩骤之前, 所述方法还包括: Combining the first and second aspect of the above-described surface discharge possible implementation mode, in a third possible implementation, in accordance with the suffix array SA Q number of the first set of strings and R Q, obtaining the suffix array SA Q Before the h-order suffix array SA h , the method further includes:
初始化输入的字符串, 得到字符串的后缀数组 SAQ ; Initialize the input string to get the suffix array SA Q of the string ;
将所述后缀数组 SAQ中的后缀 Si的起始字符位置进行调整,得到所述 字符串的名次数组 RQThe starting character position of the suffix Si in the suffix array SA Q is adjusted to obtain the name number group R Q of the character string.
结合第一放面及第一方面上述可能的实现方式,在第四种可能的实现 方式,所述根据字符串的后缀数组 SAQ和名次数组 RQ,获取所述后缀数组 SAQ的 h-次序后缀数组 SAh, 及第二名次数组 Rh, 包括: With reference to the first aspect, and the foregoing possible implementation manner of the first aspect, in a fourth possible implementation, the suffix array SA Q and the name order group R Q are obtained according to the suffix array SA Q of the string and the Q of the suffix array SA Q - the sequence suffix array SA h , and the second number of times group R h , including:
h为初始值时, 以所述后缀数组 SAQ中每个后缀 Si的第一个字符为比 较键, 将所述后缀数组 SAQ中的每个后缀 Si进行排序, 得到所述 h-次序后 缀数组 SAh和所述第二名次数组 RhWhen h is an initial value, the first character of each suffix Si in the suffix array SA Q is used as a comparison key, and each suffix Si in the suffix array SA Q is sorted to obtain the h-order. The array SA h and the second-order number group R h are affixed.
结合第一方面及第一方面的第四种可能的实现方式,在第五种可能的 实现方式中,  With reference to the first aspect and the fourth possible implementation manner of the first aspect, in a fifth possible implementation manner,
所述根据所述 h-次序后缀数组 SAh, 获取所述 h-次序后缀数组 SAh 中未排序后缀的集合 UGh, 包括: H- according to the order of the suffix array SA h, obtaining the suffix array SA h h- order not ordered set suffix UG h, comprising:
确定所述 h-次序后缀数组 SAh中有大于一个的连续后缀的第一个字 符相同; Determining that the first character of the consecutive suffixes having more than one in the h-order suffix array SA h is the same;
将所述 h-次序后缀数组 SAh中具有一个以上连续后缀的第一个字符 相同的后缀组成一个以上的未排序后缀段 [a, b] ; a 为未排序后缀段中的 起始后缀的位置, b为未排序后缀段中的结束后缀的位置; Having the same suffix of the first character having more than one consecutive suffix in the h-order suffix array SA h to form more than one unsorted suffix segment [a, b] ; a is the starting suffix in the unsorted suffix segment Position, b is the position of the ending suffix in the unsorted suffix segment;
所述一个以上的未排序后缀段 [a, b]组成未排序后缀的集合 UGh。 结合第一方面及第一方面的上述可能的实现方式,在第六种可能的实 现方式中, 对所述集合 UGh中的所有后缀进行排序, 得到 2h-次序后缀数 组 SA2h, 包括: The one or more unsorted suffix segments [a, b] constitute a set UG h of unsorted suffixes. With reference to the first aspect and the foregoing possible implementation manner of the first aspect, in a sixth possible implementation manner, all suffixes in the set UG h are sorted to obtain a 2h-order suffix array SA 2h , including:
采用所述集合 UGh中任一后缀 Si的 h-名次数组元素 Rh[i]为主比较键, 所述集合 UGh中后缀 81+11的 h-名次数组元素 Rh[i+h]为辅比较键, 对所述 集合 UGh中的后缀进行排序, 得到 2h-次序后缀数组 SA2hThe h- named group element R h [i] of any suffix Si in the set UG h is used as a main comparison key, and the h- named group element R h of the suffix 8 1+11 in the set UG h [i] +h] is a secondary comparison key, and the suffixes in the set UG h are sorted to obtain a 2h-order suffix array SA 2h .
结合第一方面及第一方面的上述可能的实现方式, 在第七种可能的实 现方式中, 根据所述第二名次数组 Rh、 所述 2h-次序后缀数组 SA2h, 获取 另一个未排序的后缀的集合 UG2h; With reference to the first aspect and the foregoing possible implementation manner of the first aspect, in a seventh possible implementation manner, another second is obtained according to the second name order group R h and the 2h-order suffix array SA 2h a collection of sorted suffixes UG 2h;
根据相邻元素比较规则对所述 2h-次序后缀数组 8 211中的后缀进行比 较, 得到第一辅助数组 NC2h; Comparing the suffixes in the 2h-order suffix array 8 211 according to the neighboring element comparison rule, to obtain the first auxiliary array NC 2h;
对所述第一辅助数组 NC21^A行前缀求和, 得到第二辅助数组 NS2h; 如果所述第二辅助数组 NS2h中的一个以上连续后缀有相同的值, 则 所述一个以上连续后缀组成一个未排序后缀段; And prefixing the first auxiliary array NC 21 ^ A row to obtain a second auxiliary array NS 2h; if more than one consecutive suffixes in the second auxiliary array NS 2h have the same value, the one or more consecutive The suffix forms an unsorted suffix segment;
将一个以上的未排序后缀段组成未排序后缀的集合 UG2hOne or more unsorted suffix segments are combined into a set UG 2h of unsorted suffixes.
结合第一方面及第一方面的上述可能的实现方式, 在第八种可能的实 现方式中, 所述 2h-次序后缀数组 8 211所占用的空间为所述后缀数组 SAo 所占用的空间。 With reference to the first aspect and the foregoing possible implementation manner of the first aspect, in the eighth possible implementation manner, the space occupied by the 2h-order suffix array 8 211 is the space occupied by the suffix array SAo.
结合第一方面及第一方面的上述可能的实现方式,在第九种可能的实 现方式中, 对所述集合 UGh中的所有后缀进行排序, 得到 2h-次序后缀数 组 SA2h, 包括: In combination with the first aspect and the above possible implementation of the first aspect, in the ninth possible In the current mode, all the suffixes in the set UG h are sorted to obtain a 2h-order suffix array SA 2h , including:
根据预设长度 T将所述集合 UGh的后缀段划分为 S型后缀段和 L型 后缀段; The suffix segment of the set UG h is divided into an S-type suffix segment and an L-type suffix segment according to a preset length T;
采用一个线程块对所述 S型后缀段进行排序,采用两个以上线程块对 所述 L型后缀段进行排序, 得到 2h-次序后缀数组 SA2hThe S-type suffix segments are sorted by one thread block, and the L-type suffix segments are sorted by using two or more thread blocks to obtain a 2h-order suffix array SA 2h .
结合第一方面的第九种可能的实现方式, 在第十种可能的实现方式 中, 所述预设长度 T将所述集合 UGh的后缀段划分为 S型后缀段和 L型 后缀段, 包括: 将所述集合 UGh的任一后缀段的长度与所述预设长度 T 进行比较, 将所述集合 UGh中后缀段长度小于等于所述预设长度 T, 则将 小于等于所述预设长度 τ的后缀作为 S型后缀段, 将大于所述预设长度 T 的后缀作为 L型后缀段。 With reference to the ninth possible implementation manner of the foregoing aspect, in a tenth possible implementation manner, the preset length T divides the suffix segment of the set UG h into an S-type suffix segment and an L-type suffix segment. The method includes: comparing a length of any suffix segment of the set UG h with the preset length T, and comparing a length of the suffix segment in the set UG h to be less than or equal to the preset length T, The suffix of the preset length τ is used as the S-type suffix segment, and the suffix greater than the preset length T is used as the L-type suffix segment.
第二方面, 本发明实施例提供一种后缀数组的构造装置, 包括: 第一获取单元, 用于根据字符串的后缀数组 SA(^n第一名次数组 RQ, 获取所述后缀数组 SAQ的 h-次序后缀数组 SAh, 及第二名次数组 Rh, h为 变量, 初始值为 1 ; In a second aspect, the embodiment of the present invention provides a device for constructing a suffix array, including: a first obtaining unit, configured to acquire the suffix array SA according to a suffix array SA of a character string (^n first name group R Q ) The h-order suffix array SA h of Q and the second-order number group R h , h are variables, and the initial value is 1;
第二获取单元, 用于根据所述 h-次序后缀数组 SAh, 获取所述 h-次序 后缀数组 SAh中未排序后缀的集合 UGh; A second acquiring unit, according to the order of suffix array SA h h-, h- order to acquire the suffix array SA h in unordered set of suffix UG h;
排序单元, 用于对所述集合 UGh中的所有后缀进行排序, 得到 2h-次 序后缀数组 SA2h; Sorting means for sorting the set of all suffixes of UG h, 2h- order to obtain suffix array SA 2h;
第三获取单元, 用于根据所述第二名次数组 Rh、 所述 2h-次序后缀数 组 SA2h, 获取另一个未排序的后缀的集合 UG2h; 若所述集合 UG2h为空集 合, 则得到排序的后缀数组 SA。 Third obtaining unit, according to the number of second groups R h, of the order of 2h- suffix array SA 2h, acquiring another set of UG unsorted suffix 2H; 2H UG if the set is an empty set, Then get the sorted suffix array SA.
结合第二方面, 在第一种可能的实现方式中,  In combination with the second aspect, in a first possible implementation manner,
所述第三获取单元, 还用于  The third obtaining unit is further used for
用于根据所述第二名次数组 Rh、 所述 2h-次序后缀数组 SA2h, 获取第 三名次数组 R2h; According to the number of second groups R h, of the order of 2h- suffix array SA 2h, obtaining the number of third group R 2h;
在所述集合 110211不为空集合时, 所述装置还包括: 变量更新单元; 所述变量更新单元, 用于更新所述变量 h的值; When the set 110 211 is not an empty set, the device further includes: a variable update unit, where the variable update unit is configured to update the value of the variable h;
所述排序单元, 还用于 结合所述第三获取单元获取的第三名次数组 R2h和所述变量更新单元 更新的变量 h的值对未排序后缀的集合 UG2H进行排序, 用于获取第 N个 未排序的后缀的集合
Figure imgf000006_0001
直至所述第三获取单元最后获取的第 N个 未排序的后缀的集合 UG^ )为空集合, 得到排序的后缀数组 SA; N为大 于 2的自然数。
The sorting unit is also used for And sorting the unsorted suffix set UG 2H by using the value of the third-order number group R 2h acquired by the third obtaining unit and the variable h updated by the variable updating unit, for acquiring the N-th unsorted suffix set
Figure imgf000006_0001
Until the set UG^ of the Nth unsorted suffix finally obtained by the third acquiring unit is an empty set, the ordered suffix array SA is obtained; N is a natural number greater than 2.
结合第二方面的第一种可能的实现方式, 在第二种可能的实现方式 中,  In conjunction with the first possible implementation of the second aspect, in a second possible implementation,
所述变量更新单元, 具体用于将所述变量 h的值更新为 2h。  The variable updating unit is specifically configured to update the value of the variable h to 2h.
结合第二方面或第二方面的第一或第二种可能的实现方式, 在第三种 可能的实现方式中, 所述装置还包括:  With reference to the second aspect or the first or second possible implementation of the second aspect, in a third possible implementation, the apparatus further includes:
第四获取单元, 用于在所述第一获取单元获取所述 h-次序后缀数组 SAh之前, 初始化输入的字符串, 得到字符串的后缀数组 SAQ ; a fourth obtaining unit, configured to: before the first obtaining unit acquires the h-order suffix array SA h , initialize the input string to obtain a suffix array SA Q of the string ;
将所述后缀数组 SAQ中的后缀 Si的起始字符位置进行调整,得到所述 字符串的名次数组 RQThe starting character position of the suffix Si in the suffix array SA Q is adjusted to obtain the name number group R Q of the character string.
结合第二方面或第二方面的上述可能的实现方式, 在第四种可能的实 现方式中, 所述第一获取单元, 具体用于  With reference to the second aspect or the foregoing possible implementation manner of the second aspect, in a fourth possible implementation manner, the first acquiring unit is specifically used to
h为初始值时, 以所述后缀数组 SAQ中每个后缀 的第一个字符为比 较键, 将所述后缀数组 SAQ中的每个后缀 进行排序, 得到所述 h-次序 后缀数组 SAh和第二名次数组 RhWhen h is an initial value, the first character of each suffix in the suffix array SA Q is used as a comparison key, and each suffix in the suffix array SA Q is sorted to obtain the h-order suffix array SA. h and second place number group R h .
结合第二方面的第四种可能的实现方式, 在第五种可能的实现方式 中,  In conjunction with the fourth possible implementation of the second aspect, in a fifth possible implementation manner,
所述第二获取单元, 具体用于  The second obtaining unit is specifically configured to be used
确定所述 h-次序后缀数组 SAh中有大于一个的连续后缀的第一个字 符相同; Determining that the first character of the consecutive suffixes having more than one in the h-order suffix array SA h is the same;
将所述 h-次序后缀数组 SAh中具有一个以上连续后缀的第一个字符 相同的后缀组成一个以上的未排序后缀段 [a, b] ; a 为未排序后缀段中的 起始后缀的位置, b为未排序后缀段中的结束后缀的位置; Having the same suffix of the first character having more than one consecutive suffix in the h-order suffix array SA h to form more than one unsorted suffix segment [a, b]; a is the starting suffix in the unsorted suffix segment Position, b is the position of the ending suffix in the unsorted suffix segment;
所述一个以上的未排序后缀段 [a, b]组成未排序后缀的集合 UGH。 结合第二方面或第二方面的上述可能的实现方式, 在第六种可能的实 现方式中, 所述排序单元, 具体用于 采用所述集合 UGh中任一后缀 Si的 h-名次数组元素 Rh[i]为主比较键, 所述集合 UGh中后缀 81+11的 h-名次数组元素 Rh[i+h]为辅比较键, 对所述 集合 UGh中的后缀进行排序, 得到 2h-次序后缀数组 SA2hThe one or more unsorted suffix segments [a, b] constitute a set UG H of unsorted suffixes. With reference to the second aspect or the foregoing possible implementation manner of the second aspect, in a sixth possible implementation, the sorting unit is specifically used to The h- named group element R h [i] of any suffix Si in the set UG h is used as a main comparison key, and the h- named group element R h of the suffix 8 1+11 in the set UG h [i] +h] is a secondary comparison key, and the suffixes in the set UG h are sorted to obtain a 2h-order suffix array SA 2h .
结合第二方面或第二方面的上述可能的实现方式, 在第七种可能的实 现方式中, 所述第三获取单元, 具体用于  With reference to the second aspect or the foregoing possible implementation manner of the second aspect, in a seventh possible implementation manner, the third acquiring unit is specifically used to
根据相邻元素比较规则对所述 2h-次序后缀数组 8 211中的后缀进行比 较, 得到第一辅助数组 NC2h; Comparing the suffixes in the 2h-order suffix array 8 211 according to the neighboring element comparison rule, to obtain the first auxiliary array NC 2h;
对所述第一辅助数组 NC21^A行前缀求和, 得到第二辅助数组 NS2h; 如果所述第二辅助数组 NS2h中的一个以上连续后缀有相同的值, 则 所述一个以上连续后缀组成一个未排序后缀段; And prefixing the first auxiliary array NC 21 ^ A row to obtain a second auxiliary array NS 2h; if more than one consecutive suffixes in the second auxiliary array NS 2h have the same value, the one or more consecutive The suffix forms an unsorted suffix segment;
将一个以上的未排序后缀段组成未排序后缀的集合 UG2hOne or more unsorted suffix segments are combined into a set UG 2h of unsorted suffixes.
结合第二方面或第二方面的上述可能的实现方式, 在第八种可能的实 现方式中, 所述 2h-次序后缀数组 8 211所占用的空间为所述后缀数组 SA0 所占用的空间。 With reference to the second aspect or the foregoing possible implementation manner of the second aspect, in the eighth possible implementation, the space occupied by the 2h-order suffix array 8 211 is the space occupied by the suffix array SA 0 .
结合第二方面或第二方面的上述可能的实现方式, 在第九种可能的实 现方式中, 所述排序单元, 具体用于  With reference to the second aspect or the foregoing possible implementation manner of the second aspect, in a ninth possible implementation manner, the sorting unit is specifically used to
根据预设长度 T将所述集合 UGh的后缀段划分为 S型后缀段和 L型 后缀段; The suffix segment of the set UG h is divided into an S-type suffix segment and an L-type suffix segment according to a preset length T;
分别对所述 S型后缀段和 L型后缀段进行排序, 得到 2h-次序后缀数 组 SA2hThe S-type suffix segment and the L-type suffix segment are respectively sorted to obtain a 2h-order suffix array SA 2h .
结合第二方面的第九种可能的实现方式, 在第十种可能的实现方式 中, 所述排序单元, 具体用于  With reference to the ninth possible implementation of the second aspect, in a tenth possible implementation, the sorting unit is specifically used to
将所述集合 UGh的任一后缀段的长度与所述预设长度 T进行比较,将 所述集合 UGh中后缀段长度小于等于所述预设长度 T , 则将小于等于所述 预设长度 T的后缀作为 S型后缀段,将大于所述预设长度 T的后缀作为 L 型后缀段。 The length of the suffix segment of the set UG h is compared with the preset length T, and the length of the suffix segment in the set UG h is less than or equal to the preset length T, and the preset is less than or equal to the preset length The suffix of the length T is used as the S-type suffix segment, and the suffix greater than the preset length T is used as the L-type suffix segment.
第三方面, 本发明实施例提供一种后缀数组的构造装置, 包括: 处理 器和和存储器; 所述存储器用于存储指令;  In a third aspect, an embodiment of the present invention provides a device for constructing a suffix array, including: a processor and a memory; and the memory is configured to store an instruction;
所述处理器执行所述存储器中存储的指令, 用于:  The processor executes instructions stored in the memory for:
根据字符串的后缀数组 SAQ和第一名次数组 RQ, 获取所述后缀数组 SAQ的 h-次序后缀数组 SAh, 及第二名次数组 Rh, h为变量, 初始值为 1 ; 根据所述 h-次序后缀数组 SAh,获取所述 h-次序后缀数组 SAh中未排 序后缀的集合 UGh; Obtaining the suffix array according to the suffix array SA Q of the string and the first name group R Q The h-order suffix array SA h of SA Q and the second-order number group R h , h are variables, and the initial value is 1; according to the h-order suffix array SA h , the h-order suffix array SA h is obtained. a collection of unsorted suffixes in the UG h;
对所述集合 UGh中的所有后缀进行排序,得到 2h-次序后缀数组 SA2h; 根据所述第二名次数组 Rh、 所述 2h-次序后缀数组 SA2h, 获取另一个 未排序的后缀的集合 UG2h; 若所述集合 UG2h为空集合, 则得到排序的后 缀数组 SA。 Sorting all suffixes in the set UG h to obtain a 2h-order suffix array SA 2h; obtaining another unsorted suffix according to the second-order order group R h and the 2h-order suffix array SA 2h The set UG 2h; if the set UG 2h is an empty set, the sorted suffix array SA is obtained.
结合第三方面, 在第一种可能的实现方式中, 所述处理器, 还用于 根据所述第二名次数组 Rh、 所述 2h-次序后缀数组 SA2h, 获取第三名 次数组 R2h和另一个未排序的后缀的集合 UG2h; With reference to the third aspect, in a first possible implementation, the processor is further configured to obtain a third-order number group according to the second-order number of groups R h and the 2h-order suffix array SA 2h a set of R 2h and another unsorted suffix UG 2h;
若所述集合 UG2h不为空集合, 则更新所述变量 h的值, 根据第三名 次数组 R2h对未排序后缀的集合 UG2h进行排序, 用于获取第 N个未排序 的后缀的集合 UG2,. , 直至最后获取的第 N 个未排序的后缀的集合 UG^^ 为空集合, 得到排序的后缀数组 SA; N为大于 2的自然数。 If the set UG 2h is not an empty set, the value of the variable h is updated, and the unordered suffix set UG 2h is sorted according to the third-order number group R 2h for obtaining the Nth unsorted suffix. The collection UG 2 , . , until the last acquired Nth unsorted suffix set UG^^ is an empty set, and the sorted suffix array SA is obtained; N is a natural number greater than 2.
结合第三方面的第一种可能的实现方式中,在第二种可能的实现方式 中, 所述处理器, 具体用于将所述变量 h的值更新为 2h。  In conjunction with the first possible implementation of the third aspect, in a second possible implementation, the processor is specifically configured to update the value of the variable h to 2h.
结合第三方面或第三方面的上述可能的实现方式,在第三种可能的实 现方式中, 所述处理器, 还用于  With reference to the third aspect or the above possible implementation manner of the third aspect, in a third possible implementation manner, the processor is further used to
初始化输入的字符串, 得到字符串的后缀数组 SAQ ; Initialize the input string to get the suffix array SA Q of the string ;
将所述后缀数组 SAQ中的后缀 Si的起始字符位置进行调整,得到所述 字符串的名次数组 RQThe starting character position of the suffix Si in the suffix array SA Q is adjusted to obtain the name number group R Q of the character string.
结合第三方面或第三方面的上述可能的实现方式, 在第四种可能的实 现方式中, 所述处理器, 具体用于  With reference to the third aspect or the foregoing possible implementation manner of the third aspect, in a fourth possible implementation manner, the processor is specifically used to
h为初始值时, 以所述后缀数组 SAQ中每个后缀 的第一个字符为比 较键, 将所述后缀数组 SAQ中的每个后缀 Si进行排序, 得到所述 h-次序 后缀数组 SAh和所述第二名次数组 RhWhen h is an initial value, the first character of each suffix in the suffix array SA Q is used as a comparison key, and each suffix Si in the suffix array SA Q is sorted to obtain the h-order suffix array. SA h and the second-order number of groups R h .
结合第三方面的第四种可能的实现方式, 在第五种可能的实现方式 中, 所述处理器, 具体用于  With reference to the fourth possible implementation manner of the third aspect, in a fifth possible implementation, the
确定所述 h-次序后缀数组 SAh中有大于一个的连续后缀的第一个字 符相同; 将所述 h-次序后缀数组 SAh中具有一个以上连续后缀的第一个字符 相同的后缀组成一个以上的未排序后缀段 [a, b] ; a 为未排序后缀段中的 起始后缀的位置, b为未排序后缀段中的结束后缀的位置; Determining that the first character of the consecutive suffixes having more than one in the h-order suffix array SA h is the same; Having the same suffix of the first character having more than one consecutive suffix in the h-order suffix array SA h to form more than one unsorted suffix segment [a, b] ; a is the starting suffix in the unsorted suffix segment Position, b is the position of the ending suffix in the unsorted suffix segment;
所述一个以上的未排序后缀段 [a, b]组成未排序后缀的集合 UGh。 结合第三方面或第三方面的上述可能的实现方式, 在第六种可能的实 现方式中, 所述处理器, 具体用于 The one or more unsorted suffix segments [a, b] constitute a set UG h of unsorted suffixes. With reference to the third aspect or the foregoing possible implementation manner of the third aspect, in a sixth possible implementation, the
采用所述集合 UGh中任一后缀 Si的 h-名次数组元素 Rh[i]为主比较键, 所述集合 UGh中后缀 81+11的 h-名次数组元素 Rh[i+h]为辅比较键, 对所述 集合 UGh中的后缀进行排序, 得到 2h-次序后缀数组 SA2hThe h- named group element R h [i] of any suffix Si in the set UG h is used as a main comparison key, and the h- named group element R h of the suffix 8 1+11 in the set UG h [i] +h] is a secondary comparison key, and the suffixes in the set UG h are sorted to obtain a 2h-order suffix array SA 2h .
结合第三方面或第三方面的上述可能的实现方式, 在第七种可能的实 现方式中, 所述处理器, 具体用于  With reference to the third aspect or the foregoing possible implementation manner of the third aspect, in a seventh possible implementation manner, the processor is specifically used to
根据相邻元素比较规则对所述 2h-次序后缀数组 8 211中的后缀进行比 较, 得到第一辅助数组 NC2h; Comparing the suffixes in the 2h-order suffix array 8 211 according to the neighboring element comparison rule, to obtain the first auxiliary array NC 2h;
对所述第一辅助数组 NC21^A行前缀求和, 得到第二辅助数组 NS2h; 如果所述第二辅助数组 NS2h中的一个以上连续后缀有相同的值, 则 所述一个以上连续后缀组成一个未排序后缀段; And prefixing the first auxiliary array NC 21 ^ A row to obtain a second auxiliary array NS 2h; if more than one consecutive suffixes in the second auxiliary array NS 2h have the same value, the one or more consecutive The suffix forms an unsorted suffix segment;
将一个以上的未排序后缀段组成未排序后缀的集合 UG2hOne or more unsorted suffix segments are combined into a set UG 2h of unsorted suffixes.
结合第三方面或第三方面的上述可能的实现方式, 在第八种可能的实 现方式中, 所述 2h-次序后缀数组 8 211所占用的空间为所述后缀数组 SAo 所占用的空间。 With reference to the third aspect or the foregoing possible implementation manner of the third aspect, in the eighth possible implementation, the space occupied by the 2h-order suffix array 8 211 is the space occupied by the suffix array SAo.
结合第三方面或第三方面的上述可能的实现方式, 在第九种可能的实 现方式中, 所述处理器具体用于  With reference to the third aspect or the foregoing possible implementation manner of the third aspect, in a ninth possible implementation manner, the processor is specifically used to
根据预设长度 T将所述集合 UGh的后缀段划分为 S型后缀段和 L型 后缀段; The suffix segment of the set UG h is divided into an S-type suffix segment and an L-type suffix segment according to a preset length T;
采用一个线程块对所述 S型后缀段进行排序,采用两个以上线程块对 所述 L型后缀段进行排序, 得到 2h-次序后缀数组 SA2hThe S-type suffix segments are sorted by one thread block, and the L-type suffix segments are sorted by using two or more thread blocks to obtain a 2h-order suffix array SA 2h .
结合第三方面的第九种可能的实现方式, 在第十种可能的实现方式 中,  In conjunction with the ninth possible implementation of the third aspect, in a tenth possible implementation manner,
所述处理器, 具体用于  The processor is specifically used for
将所述集合 UGh的任一后缀段的长度与所述预设长度 T进行比较,将 所述集合 UGh中后缀段长度小于等于所述预设长度 T, 则将小于等于所述 预设长度 T的后缀作为 S型后缀段,将大于所述预设长度 T的后缀作为 L 型后缀段。 Comparing the length of any suffix segment of the set UG h with the preset length T, If the length of the suffix segment in the set UG h is less than or equal to the preset length T, the suffix of the preset length T is used as the S-type suffix segment, and the suffix greater than the preset length T is used as the L-type suffix. segment.
由上述技术方案可知, 本发明实施例的后缀数组的构造方法及装置, 通过字符串的后缀数组 SAQ获取字符串的后缀数组的 h-次序后缀数组 SAh 和第二名次数组 Rh, 进而获取所述 h-次序后缀数组 SAh中未排序后缀的 集合 UGh, 对所述集合 UGh中的所有后缀进行排序, 得到 2h-次序后缀数 组 SA2h; 根据所述第二名次数组 Rh、 所述 2h-次序后缀数组 SA2h, 获取另 一个未排序的后缀的集合 UG2h; 在所述集合 UG2h为空集合时, 得到排序 的后缀数组 SA, 由此可解决了现有技术中对已经找到最终名次的后缀重 复排序的问题, 同时可实现后缀数组的快速构成。 附图说明 According to the foregoing technical solution, the method and apparatus for constructing the suffix array of the embodiment of the present invention acquires the h-order suffix array SA h of the suffix array of the string and the second-order number group R h by using the string suffix array SA Q . Further h- obtaining the suffix array SA h in order unordered set suffix UG h, for all suffixes in the set to be sorted UG h, 2h- order to obtain suffix array SA 2h; number according to the second group R h , the 2h-order suffix array SA 2h , obtain another set UG 2h of unsorted suffixes ; when the set UG 2h is an empty set, obtain a sorted suffix array SA, thereby solving the existing In the technology, the problem of repeating the sorting of the suffix of the final ranking has been found, and at the same time, the rapid composition of the suffix array can be realized. DRAWINGS
图 1为本发明一实施例提供的实现后缀数组的构造方法的系统架构 图;  1 is a system architecture diagram of a method for constructing a suffix array according to an embodiment of the present invention;
图 2A为本发明一实施例提供的后缀数组的构造方法的流程示意图; 图 2B为本发明一实施例提供的后缀字符串表的示意图;  2A is a schematic flowchart of a method for constructing a suffix array according to an embodiment of the present invention; FIG. 2B is a schematic diagram of a suffix string table according to an embodiment of the present invention;
图 3A为本发明另一实施例提供的后缀数组的构造方法的流程示意 图;  3A is a schematic flow chart of a method for constructing a suffix array according to another embodiment of the present invention;
图 3B为本发明一实施例提供的 S型后缀段和 L型后缀段在 GPU中调 度的示意图;  FIG. 3B is a schematic diagram of scheduling an S-type suffix segment and an L-type suffix segment in a GPU according to an embodiment of the present invention;
图 4为本发明一实施例提供的后缀数组的构造装置的结构示意图; 图 5为本发明另一实施例提供的后缀数组的构造装置的结构示意图; 图 6为本发明另一实施例提供的后缀数组的构造装置的结构示意图。 具体实施方式  4 is a schematic structural diagram of a device for constructing a suffix array according to an embodiment of the present invention; FIG. 5 is a schematic structural diagram of a device for constructing a suffix array according to another embodiment of the present invention; Schematic diagram of the construction of the suffix array. detailed description
当前, 随着 GPU计算能力的不断增强, 将 GPU作为协处理器同中央 处理器 (Central Processing Unit, 简称 CPU) 协作以提高系统计算能力已 成为应用性能提升的一种重要手段。 由于 GPU具有大规模线程并发、 高 内存带宽等优势,可极大缓解计算密集 /数据密集型应用的计算速度瓶颈问 题。本发明实施例便提出了一种基于 GPU (或类似的高数据并发性的处理 器) 的后缀数组的构造方法。 Currently, as GPU computing power continues to increase, the use of GPUs as coprocessors in conjunction with Central Processing Units (CPUs) to improve system computing power has become an important means of application performance improvement. GPUs have the advantages of large-scale thread concurrency, high memory bandwidth, etc., which can greatly alleviate the computational speed bottleneck of computationally intensive/data-intensive applications. question. The embodiment of the present invention proposes a construction method of a suffix array based on a GPU (or a similar high data concurrency processor).
图 1 示出了本发明实施例中提供的一种实现后缀数组的构造方法的 系统架构图, 如图 1所示, 本实施例的系统可包括: CPU、 GPU和主机存 储器; 其中, CPU分别与主机存储器、 GPU通过数据总线连接, 而主机 存储器与 GPU通过数据总线连接。  FIG. 1 is a system architecture diagram of a method for constructing a suffix array according to an embodiment of the present invention. As shown in FIG. 1, the system of this embodiment may include: a CPU, a GPU, and a host memory; wherein, the CPU respectively The host memory and the GPU are connected through a data bus, and the host memory is connected to the GPU through a data bus.
本实施例中的 GPU的执行受 CPU调度。 CPU根据业务特点, 选择合 适的方式进行处理, 比如简单业务由 CPU 自身处理, 数据并发的业务可 由 GPU处理。  The execution of the GPU in this embodiment is scheduled by the CPU. The CPU selects the appropriate method for processing according to the characteristics of the service. For example, the simple service is processed by the CPU itself, and the data concurrently processed by the GPU can be processed by the GPU.
GPU不能直接访问主机存储器保存的数据,而需先通过数据总线将主 机存储器的数据复制到 GPU的全局存储器中再访问。 在 GPU内部存在多 个线程块, 每个线程块具有相同数目的线程, 以及自有的共享存储器。 每 个线程拥有私有的计算资源 (如寄存器、 本地存储器等等) 。 GPU全局存 储器可被所有线程块的所有线程访问, 线程块的共享存储器只能被本线程 块的所有线程访问, 而线程的寄存器和本地存储器只能被本线程访问。  The GPU cannot directly access the data stored in the host memory, but needs to copy the data of the host memory to the global memory of the GPU through the data bus before accessing. There are multiple thread blocks within the GPU, each thread block having the same number of threads, and its own shared memory. Each thread has private computing resources (such as registers, local storage, etc.). The GPU global memory can be accessed by all threads of all thread blocks. The shared memory of the thread block can only be accessed by all threads of the thread block, while the thread's registers and local memory can only be accessed by this thread.
在数据处理前, 数据存储于磁盘设备中 (HDD) 。 一旦处理程序启动 时, 数据先由磁盘读取至 Memory, 继而由 CPU调度至自身或 GPU执行。  Data is stored in a disk device (HDD) prior to data processing. Once the handler is started, the data is first read by the disk to Memory, which is then dispatched by the CPU to itself or to the GPU.
为表述方便,下面对本发明实施例中出现的一些符号 /术语给出相关说 明。  For convenience of presentation, the following description will be given of some symbols/terms appearing in the embodiments of the present invention.
字符集: 一个字符集∑是一个建立了全序关系的集合, 即∑中的任意 两个不同元素 a和 b都具备确定的大小关系, 不是 a<b, 就是 a>b。 字符 集的元素称为字符。字符集中有个特殊字符' $ ',它只出现在字符串的末尾, 且为字符集中的最小元素。  Character set: A character set ∑ is a set that establishes a full-order relationship, that is, any two different elements in ∑ a and b have a certain size relationship, not a<b, which is a>b. The elements of a character set are called characters. The character set has a special character ' $ ', which appears only at the end of the string and is the smallest element in the character set.
字符串: 一个长度为 n的字符串是由 n个字符组成的数组 S[0,n-1]。 其中, S的最后一个元素即结束符固定为' $'。  String: A string of length n is an array of n characters S[0,n-1]. Among them, the last element of S is the terminator fixed to '$'.
子串: 字符串 S的子串 K[i, j] (i<j)是指由字符串 S中从位置 1开始 到位置 J结束的字符(包括位置 j)组成的子串,即 K[i,」]=S[i] S[i+l]...S[j]o 后缀: 字符串 S的后缀 是指由字符串 S中从位置工的字符开始到结 束符' $ '为止组成的子串, 即 Si=S[i]S[i+l]...$。  Substring: The substring K[i, j] of the string S (i<j) refers to a substring consisting of characters (including position j) from the position 1 to the end of the position J in the character string S, that is, K[ i,"]=S[i] S[i+l]...S[j]o Suffix: The suffix of the string S refers to the string S from the character of the position worker to the end character ' $ ' The substring of the composition, that is, Si=S[i]S[i+l]...$.
后缀数组: 后缀数组 SA为一个字符串 S的所有后缀按照字典顺序排 列形成的数组。 其包含的后缀由其起始位置表示。 即 1表示后缀 Si。 Suffix array: suffix array SA is a string S all suffixes are sorted in lexicographic order An array formed by columns. The suffix it contains is represented by its starting position. That is, 1 represents the suffix Si.
其中, SAW 表示后缀 是所有后缀中第 1小的后缀。  Where SAW indicates that the suffix is the 1st small suffix of all suffixes.
名次数组: 名次数组 R又称逆后缀数组 ISA。 其保存每个后缀的名次 值, ISA[i]=R[i]=j表示后缀 Si是所有后缀中第 j小的后缀。 因此, R、 SA 与 ISA具有以下关系: R=ISA=SA—  Name group: Name group R is also called inverse suffix array ISA. It holds the ranking value of each suffix, ISA[i]=R[i]=j indicates that the suffix Si is the jth small suffix of all suffixes. Therefore, R, SA and ISA have the following relationship: R=ISA=SA—
h-次序后缀数组: h-次序后缀数组 SAh是将一个字符串 S的所有后缀 按其起始的 h个字符为比较键而得到的数组。 H- order suffix array: h-order suffix array SA h is an array obtained by using all the suffixes of a string S according to its starting h characters as comparison keys.
另外, 下述的未排序后缀段是由未排序后缀组成的。 未排序后缀根据 其前 h个字符是否相同组成未排序后缀段。  In addition, the unsorted suffix segments described below are composed of unsorted suffixes. The unsorted suffix consists of unsorted suffix segments based on whether their first h characters are the same.
图 2A示出了本发明实施例提供的一种后缀数组的构造方法的流程示 意图。 如图 2A所示, 本发明实施例的后缀数组的构造方法如下所述。  FIG. 2A is a flow chart showing a method for constructing a suffix array according to an embodiment of the present invention. As shown in FIG. 2A, the construction method of the suffix array of the embodiment of the present invention is as follows.
201、 根据字符串的后缀数组 SAQ和名次数组 RQ, 获取所述后缀数组 SAQ的 h-次序后缀数组 SAh, 及第一名次数组 Rh。 其中, h为变量, 初始 值为 1。 201 The suffix array SA Q and R Q ranking array string, obtaining the suffix array SA Q h- order of suffix array SA h, and the number of the first group R h. Where h is a variable and the initial value is 1.
在本实施例中, h为初始值时, 可以将所述后缀数组 SAQ中每个后缀In this embodiment, when h is an initial value, each suffix in the suffix array SA Q may be used.
S,的第一个字符为比较键,对所述后缀数组 SAQ中的每个后缀 进行排序, 得到所述 h-次序后缀数组 SAi和所述第二名次数组 。具体的,如果 SAJi]The first character of S, is a comparison key, and sorts each suffix in the suffix array SA Q to obtain the h-order suffix array SAi and the second name number group. Specifically, if SAJi]
= J, 则^ ] = 1。 = J, then ^ ] = 1.
202、 根据所述 h-次序后缀数组 SAh, 获取所述 h-次序后缀数组 SAh 中未排序后缀的集合 UGh; 202, according to the order of the suffix array SA h h-, h- order to acquire the suffix array SA h in unordered set of suffix UG h;
当然, h为初始值时, 该未排序后缀的集合即为 UG^  Of course, when h is the initial value, the set of unsorted suffixes is UG^
203、 对所述集合 UGh中的所有后缀进行排序, 得到 2h-次序后缀数 组 SA2h203. Sort all the suffixes in the set UG h to obtain a 2h-order suffix array SA 2h .
可选地, 还可根据第二名次数组 Rh、 所述 2h-次序后缀数组 SA2h, 获 取获取第三名次数组 R2hAlternatively, also according to the number of second groups R h, of the order of 2h- suffix array SA 2h, the third acquisition acquires the number of groups R 2h.
举例来说, 采用所述集合 UGh中任一后缀 Si的 h-名次数组元素 Rh[i] 为主比较键, 所述集合 UGh中后缀 81+11的 h-名次数组元素 Rh[i+h]为辅比 较键, 对所述集合 UGh中的后缀进行排序, 得到 2h-次序后缀数组 SA2hFor example, the h- named group element R h [i] of any suffix Si in the set UG h is used as a main comparison key, and the h-name number group element of the UG h suffix 8 1+11 is used as a main comparison key. R h [i+h] is a secondary comparison key, and the suffixes in the set UG h are sorted to obtain a 2h-order suffix array SA 2h .
上述的 h-名次数组元素 Rh[i]和 h-名次数组元素 Rh[i+h]为第二名次数 组 Rh中的元素。 204、 根据所述第二名次数组 Rh、 所述 2h-次序后缀数组 SA2h, 获取 另一个未排序的后缀的集合 UG2h; The h-named group element R h [i] and the h-numbered group element R h [i+h] described above are elements in the second-order number group R h . 204, according to the second name sequence group R h , the 2h-order suffix array SA 2h , obtain another set of unsorted suffixes UG 2h;
205、 判断所述集合 1;0211是否为空集合。 205. Determine whether the set 1; 0 211 is an empty set.
206、 若所述集合 UG2h为空集合, 得到排序的后缀数组 SA。 206. If the set UG 2h is an empty set, obtain a sorted suffix array SA.
可选地, 若所述集合 UG2h不为空集合, 更新所述变量 h的值, 重复 对未排序后缀的集合 UG2h进行排序, 直至最后获取的未排序的后缀的集 合为空集合, 得到排序的后缀数组 SA。 Optionally, if the set UG 2h is not an empty set, update the value of the variable h, and repeatedly sort the unsorted suffix set UG 2h until the last acquired unsorted suffix set is an empty set, Sorted suffix array SA.
也就是说, 根据第三名次数组 R2h对未排序后缀的集合 UG2h进行排 序, 用于获取第 N个未排序的后缀的集合 UG^^ , 直至最后获取的第 N 个未排序的后缀的集合 UG^^ 为空集合, 得到排序的后缀数组 SA; N 为大于 2的自然数。 That is, the unordered suffix set UG 2h is sorted according to the third-order number group R 2h for obtaining the N-th unsorted suffix set UG^^ until the last N unsorted suffix is obtained. The set UG^^ is an empty set, and the sorted suffix array SA is obtained; N is a natural number greater than 2.
举例来说, 前述的更新所述变量 h的值, 可具体为: 将所述变量 h的 值更新为 2h。  For example, the foregoing updating the value of the variable h may specifically be: updating the value of the variable h to 2h.
也就是说, 判断新的未排序后缀集合 UG2h是否为空。 如不为空, 更 新 h值, g卩 h=2 X h。 回到前述的歩骤 203, SP , 对所述集合 UG2h中的所 有后缀进行排序, 得到 4h-次序后缀数组 SA4hThat is, it is judged whether the new unsorted suffix set UG 2h is empty. If not empty, update the h value, g卩h=2 X h. Returning to the foregoing step 203, SP, all suffixes in the set UG 2h are sorted to obtain a 4h-order suffix array SA 4h .
相应地, 歩骤 204 ' 、 根据所述第三名次数组 R2h、 所述 4h-次序后缀 数组 SA4h, 获取一个未排序的后缀的集合 UG4h; Correspondingly, step 204 ', according to the third name sequence group R 2h , the 4h-order suffix array SA 4h , obtain an unsorted suffix set UG 4h;
判断新的未排序后缀集合 UG4h是否为空。 如不为空, 重复上述的过 程, 若在最后的 UG4h为空, 则得到最终的后缀数组, 记为 SA。 Determine if the new unsorted suffix set UG 4h is empty. If it is not empty, repeat the above process. If the last UG 4h is empty, the final suffix array is obtained, which is denoted as SA.
在迭代过程中,后缀数组及名次数组均可利用初始后缀数组及名次数 组的空间, 从而节省存储开销。 此外, 未排序后缀集合 UG也可以用数组 来保存, 因此也不对其在计算系统中的表示方式做限制。  In the iterative process, the suffix array and the name group can use the space of the initial suffix array and the name group, thus saving storage overhead. In addition, the unsorted suffix collection UG can also be saved with an array, and therefore does not limit its representation in the computing system.
也就是说, 在上述实施例中, 所述 2h-次序后缀数组 SA2h所占用的空 间可为所述后缀数组 SAQ所占用的空间, 以及名次数组 R2h所占用的空间 可为名次数组 RQ所占用的空间。 由此, 上述后缀数组的构造方法可节省 占用空间。 That is, in the above embodiment, the space occupied by the 2h-order suffix array SA 2h may be the space occupied by the suffix array SA Q , and the space occupied by the name number group R 2h may be the number of times. The space occupied by the group R Q. Thus, the construction method of the above suffix array can save space.
本实施例的后缀数组的构造方法,通过字符串的后缀数组 SAQ获取字 符串的后缀数组的 h-次序后缀数组 SAh和第二名次数组 Rh, 进而获取所 述 h-次序后缀数组 SAh中未排序后缀的集合 UGh, 对所述集合 UGh中的 所有后缀进行排序, 得到 2h-次序后缀数组 SA2h; 根据所述第二名次数组 Rh、 所述 2h-次序后缀数组 SA2h, 获取另一个未排序的后缀的集合 UG2h; 在所述集合 UG2h为空集合时, 得到排序的后缀数组 SA, 由此可实现后缀 数据处理过程中加速后缀数组的构成, 同时解决了现有技术中对已经获得 最终名次的后缀进行重复排序的问题。 In the constructing method of the suffix array of the embodiment, the h-order suffix array SA h of the suffix array of the string and the second-order number group R h are obtained by the suffix array SA Q of the string, and then the h-order suffix array is obtained. a set UG h of unsorted suffixes in SA h , in the set UG h All suffixes are sorted to obtain a 2h-order suffix array SA 2h; according to the second name order group R h , the 2h-order suffix array SA 2h , another set of unsorted suffixes UG 2h is obtained; When the set UG 2h is an empty set, the sorted suffix array SA is obtained, thereby realizing the structure of the accelerated suffix array in the suffix data processing process, and solving the problem of repeating the sorting of the suffixes that have obtained the final ranking in the prior art.
在一种可选的实现场景中, 在歩骤 201之前, 上述方法还可包括下述 未示出的歩骤 200:  In an alternative implementation scenario, prior to step 201, the method may further include the following step 200, not shown:
200、 初始化输入的字符串 S , 得到字符串的后缀数组 SAQ ; 将所述后 缀数组 8 ()中的后缀 Si的起始字符位置进行调整,得到所述字符串的名次 数组 R0200. Initialize the input string S to obtain a suffix array SA Q of the string ; adjust the starting character position of the suffix Si in the suffix array 8 () to obtain the name number group R 0 of the string.
举例来说, 将所述后缀数组 SAQ中的后缀 Si的起始字符位置放到 RQ 的第 1个位置, 即 SAQ[i]=RQ[i]=i。 For example, the starting character position of the suffix Si in the suffix array SA Q is placed at the first position of R Q , that is, SA Q [i]=R Q [i]=i.
可选地, 前述的歩骤 202可包括下述的图中未示出的子歩骤:  Optionally, the foregoing step 202 may include sub-steps not shown in the following figures:
A202K 确定所述 h-次序后缀数组 SAh中有大于一个的连续后缀的第 —个字符相同; A202K determining that the first character of the consecutive suffixes having more than one in the h-order suffix array SA h is the same;
A2022、 将所述 h-次序后缀数组 SAh中具有一个以上连续后缀的第一 个字符相同的后缀组成一个以上的未排序后缀段 [a,b] ; a为未排序后缀段 中的起始后缀的位置, b为未排序后缀段中的结束后缀的位置; A2022. The same suffix of the first character having more than one consecutive suffix in the h-order suffix array SA h is composed of more than one unsorted suffix segment [a, b] ; a is the start in the unsorted suffix segment The position of the suffix, where b is the position of the ending suffix in the unsorted suffix segment;
A2023、 所述一个以上的未排序后缀段 [a,b]组成未排序后缀的集合 UGhA2023. The one or more unsorted suffix segments [a, b] form a set UG h of unsorted suffixes.
以下采用一未排序的后缀数组进行举例说明。  An example of an unsorted suffix array is given below.
对于输入字符串 τ,首先提取出其所有可能的后缀,比如 T为 g00gol$, 如图 2B的右边一列所示, 继而采用类似于前缀倍增方法对这些后缀进行 排序。 即整个排序过程由多个迭代过程组成, 但与现有技术不同的是, 本 实施例中通过引入已排序后缀的集合(Sorted group )和未排序后缀的集合 ( Unsorted group )的分类, 可避免现有技术中对已经排好名次的后缀进行 重复排序的问题。 For the input string τ, first extract all its possible suffixes, such as T is g00g ol$, as shown in the right column of Figure 2B, and then sort the suffixes similar to the prefix multiplication method. That is, the entire sorting process is composed of a plurality of iterative processes, but different from the prior art, in this embodiment, by introducing a sorted suffixed set (Sorted group) and an unsorted suffixed set (Unsorted group), it can be avoided. In the prior art, the problem of repeating the ranking of the suffixes that have been ranked.
具体地, 初始时 (即在第一次迭代计算开始前) , 所有后缀作为一个 后缀段均放置在 Unsorted group中。 而在每次迭代过程本实施例中可仅排 序 Unsorted group中的后缀, 从而避免对已确定好顺序位置的后缀进行重 复排序。 待所有后缀均确定顺序 (即全部后缀都已加入到 Sorted group ) 后, 则表示后缀排序过程完成。 从而, 后缀数组 SA及名次数组 R便可由 排序结果得出。 Specifically, at the initial time (ie, before the start of the first iteration calculation), all suffixes are placed in the Unsorted group as a suffix segment. In this embodiment, in this embodiment, only the suffix in the Unsorted group can be sorted, thereby avoiding heavy suffixes of the determined order position. Reordering. After all the suffixes are determined in order (that is, all suffixes have been added to the Sorted group), the suffix sorting process is completed. Thus, the suffix array SA and the name order group R can be derived from the sort result.
Unsorted group中的后缀的排序由多次(大于或等于 1 )迭代过程完成。 在每次迭代计算中, 每个后缀只取其前 k个字符用以排序, 且在排序后将 能确定全局顺序的后缀从 Unsorted group中移到 Sorted group中。 即对于 Unsorted group中每个后缀, 若存在其它后缀与之相等 (即它们的前 k个 字符相同) , 则表示这些相等的后缀不能通过它们的前 k个字符得以确定 顺序, 故将这些相等的后缀分别归到一组, 等待下一次迭代的计算;  The ordering of the suffixes in the Unsorted group is done by multiple iterations (greater than or equal to 1). In each iteration calculation, each suffix takes only its first k characters for sorting, and after sorting, it will be determined that the global order suffix is moved from the Unsorted group to the Sorted group. That is, for each suffix in the Unsorted group, if other suffixes are equal (that is, their first k characters are the same), it means that these equal suffixes cannot be determined by their first k characters, so these are equal. The suffixes are grouped together, waiting for the calculation of the next iteration;
若不存在其它后缀与之相等, 则该后缀在当前 Unsorted group中的顺 序位置可被确定, 故将它移到 Sorted group。 在一次迭代计算后, 若仍存 在未能确定顺序的后缀, 则增加 k值 (如 k=2k, 该处的 k对应前述图 2A 中所示的变量 h) 并转向下一次迭代过程, 否则整个排序过程完成。 If no other suffixes are equal, the order position of the suffix in the current Unsorted group can be determined, so move it to Sorted g roup. After an iterative calculation, if there is still a suffix that fails to determine the order, increase the k value (such as k=2k, where k corresponds to the variable h shown in Figure 2A above) and move to the next iteration, otherwise the whole The sorting process is complete.
例 1 :对输入字符 T =googol$ 的处理过程 (此处不给出 SA和 ISA的计算)。 步骤 1 : 提取后缀 (如图 2B右侧 suffixes)作为一个后缀段 ugl, 加入到 Example 1: The processing of the input character T = googol$ (the calculation of SA and ISA is not given here). Step 1: Extract the suffix (see suffixes on the right side of Figure 2B) as a suffix segment ugl, add to
Unsorted group, Unsorted group,
Unsorted group = {ugl= {googol$, oogol$, ogol$, gol$, ol$, 1$, $} },  Unsorted group = {ugl= {googol$, oogol$, ogol$, gol$, ol$, 1$, $} },
Sorted grou 为  Sorted grou is
步骤 2 : 迭代处理 Unsorted group。 假设 k初始为 k=l, 且每次迭代后递增 1  Step 2: Iteratively process the Unsorted group. Suppose k is initially k=l and is incremented by 1 after each iteration.
第 1次迭代: 只考虑每个后缀的第一个字符用以排序, 故处理后结果为  The first iteration: only consider the first character of each suffix for sorting, so the result is
Sorted group ={$, 1$}, Unsorted group = {ugl={ googol$, gol$}, ug2={ oogol$, ogol$, ol$} 第 2次迭代: 此时 k=2, 每个后缀的前两个字符用于排序, 处理结果为  Sorted group ={$, 1$}, Unsorted group = {ugl={ googol$, gol$}, ug2={ oogol$, ogol$, ol$} 2nd iteration: k=2 at this time, each suffix The first two characters are used for sorting, and the result is
Sorted group ={$, 1$, ogol$, ol$, oogol$}, Unsorted group = {ugl={ googol$, gol$} }  Sorted group ={$, 1$, ogol$, ol$, oogol$}, Unsorted group = {ugl={ googol$, gol$} }
第 3次迭代: 此时 k=3, 每个后缀的前两个字符用于排序, 处理结果为  The third iteration: at this time k=3, the first two characters of each suffix are used for sorting, and the processing result is
Sorted group ={$, 1$, gol$, googol$, ogol$, ol$, oogol$}, Unsorted group = { }  Sorted group ={$, 1$, gol$, googol$, ogol$, ol$, oogol$}, Unsorted group = { }
此时所有后缀已排序, 得到排序的后缀数组 SA。  At this point all suffixes are sorted, and the sorted suffix array SA is obtained.
为了提升 GPU构造后缀数组的速度, 本发明实施例针对 GPU并行数 据处理的特点,提出了一种对未排序后缀集合中的后缀进行并行排序的方 法, 从而提升图 2A中的歩骤 204的处理速度。  In order to improve the speed of the GPU suffix array, the embodiment of the present invention proposes a method for parallel suffixing in the unsorted suffix set for the GPU parallel data processing, thereby improving the processing of the step 204 in FIG. 2A. speed.
例如, 假设当前迭代值是 h, 前述的歩骤 204可包括下述的图中未示 出的子歩骤: For example, assuming that the current iteration value is h, the aforementioned step 204 may include the following diagram not shown. Sub-steps:
A2041、 根据相邻元素比较规则对所述 2h-次序后缀数组 SA2h中的后 缀进行比较, 得到第一辅助数组 NC2h; A2041, comparing the suffixes in the 2h-order suffix array SA 2h according to the neighboring element comparison rule, to obtain the first auxiliary array NC 2h;
具体的, 对于未排序后缀集合中的每个后缀段 [a, b], 将 NC2h[a]设置 为 a; 对于其他后缀, 与其左侧的相邻后缀以起始的 2h个字符为比较键进 行比较, 相同为 0, 不相同则为 1。 Specifically, for each suffix segment [a, b] in the unsorted suffix set, NC 2h [a] is set to a; for other suffixes, the adjacent suffix to the left is compared with the first 2h characters. The keys are compared, the same is 0, and the difference is 1.
A2042、对所述第一辅助数组 NC21^A行前缀求和, 得到第二辅助数组 NS2h; 具体的, NS2h[i]=NC2h[0]+NC2h[l]+〜+NC2h[i]。 A 2042, summing the first auxiliary array NC 21 ^ A lines to obtain a second auxiliary array NS 2h ; specifically, NS 2h [i]=NC 2h [0]+NC 2h [l]+~+NC 2h [i].
A2043、如果所述第二辅助数组 NS2h中的一个以上连续后缀有相同的 值, 则所述一个以上连续后缀组成一个未排序后缀段。 A2043. If more than one consecutive suffixes of the second auxiliary array NS 2h have the same value, the one or more consecutive suffixes form an unsorted suffix segment.
可选地, 根据第二辅助数组 NS2h, 将 SA2h中的元素分布到新数组的 对应位置, 得到名次数组 R2h; 具体的, 如果 NS2h[i]=」, 则 R2h[j]=i。 Optionally, according to the second auxiliary array NS 2h , the elements in the SA 2h are distributed to the corresponding positions of the new array, and the name number group R 2h is obtained; specifically, if NS 2h [i]=”, then R 2h [j ]=i.
A2044、 将一个以上的未排序后缀段组成未排序后缀的集合 UG2h。 具体的, 如果 NS2h中的大于 1 个的连续元素有相同的值, 则这些元 素组成一个未排序后缀段。 未排序后缀集合 1;0211是所有未排序后缀段的 隹人 A2044. Make more than one unsorted suffix segments into a set UG 2h of unsorted suffixes. Specifically, if more than one consecutive elements in NS 2h have the same value, the elements form an unsorted suffix segment. Unsorted suffix set 1; 0 211 is the monk of all unsorted suffixes
采 PI o Pick PI o
此外, 对未排序后缀集合 UGh中的每个后缀段 [a, b], 对于每个后缀 SAh[i] ( a«b ) , 以对应的后缀 SAh[i]+h的 h-名次 Rh[SAh[i]+h]为比较 键对该未排序后缀段进行排序, 得到 SA2hIn addition, for each suffix segment [a, b] in the unsorted suffix set UG h , for each suffix SA h [i] ( a«b ), with the corresponding suffix SA h [i] + h h- The ranking R h [SA h [i]+h] is a comparison key to sort the unsorted suffix segments to obtain SA 2h .
值得注意的是, 在迭代过程中, 辅助数组如 NC, NS等都可以重复使 用, 从而节省存储开销。 因此, 在此不对 NC, NS的空间分配做限制。  It is worth noting that in the iterative process, auxiliary arrays such as NC, NS, etc. can be reused, saving storage overhead. Therefore, there is no restriction on the space allocation of NC and NS.
为了提升 GPU构造后缀数组的速度,前述的歩骤 203中获取 2h-次序 后缀数组 SA2h可结合 GPU的多线程和高内存优势加速获取, 从而提升图 2A中的歩骤 203的处理速度。如图 3A所示, 本实施例的未排序后缀集合 的后缀并行处理方法如下所述。 In order to improve the speed of the GPU constructing the suffix array, the obtaining the 2h-order suffix array SA 2h in the foregoing step 203 can be accelerated in combination with the multi-threading and high memory advantages of the GPU, thereby improving the processing speed of the step 203 in FIG. 2A. As shown in FIG. 3A, the suffix parallel processing method of the unsorted suffix set of this embodiment is as follows.
301、 根据预设长度 T将所述集合 UGh的后缀段划分为 S型后缀段和 L型后缀段。 301. The suffix segment of the set UG h is divided into an S-type suffix segment and an L-type suffix segment according to a preset length T.
例如, 将所述集合 UGh的任一后缀段的长度与所述预设长度 T进行 比较, 所述集合 UGh中, 将不大于所述预设长度 T的后缀作为 S型后缀 段, 将大于所述预设长度 T的后缀作为 L型后缀段。 具体的, 对于一个后缀段 I=[a, b],如果其长度 |1|不大于预设值 t, 则 为 S型后缀段; 反之为 L型后缀段。 For example, comparing the length of any suffix segment of the set UG h with the preset length T, in the set UG h , the suffix not greater than the preset length T is used as the S-type suffix segment, A suffix larger than the preset length T is used as an L-type suffix segment. Specifically, for a suffix segment I=[a, b], if its length |1| is not greater than the preset value t, it is an S-type suffix segment; otherwise, it is an L-type suffix segment.
应说明的是, 后缀段中长度指后缀段中所包含的后缀个数。  It should be noted that the length in the suffix segment refers to the number of suffixes included in the suffix segment.
302、 采用一个线程块对所述 S型后缀段进行排序, 采用两个以上线 程块对所述 L型后缀段进行排序, 得到 2h-次序后缀数组 SA2h302. Sort the S-type suffix segments by using one thread block, and sort the L-type suffix segments by using two or more thread blocks to obtain a 2h-order suffix array SA 2h .
对 S型后缀段的处理和 L型后缀段的处理无顺序关系, 可并行执行。 因为 GPU中多个线程块可并行执行, 因此 S型后缀段的多个后缀段可以 同时进行排序。 同样的, L型后缀段的多个后缀段也可以同时进行排序。 因此, 这种方法可以极大的提高 GPU的计算利用率, 从而加快后缀数组 构造的速度。  The processing of the S-type suffix segment and the processing of the L-type suffix segment have no order relationship and can be executed in parallel. Because multiple thread blocks in the GPU can be executed in parallel, multiple suffix segments of the S-type suffix segment can be sorted simultaneously. Similarly, multiple suffix segments of the L-type suffix segment can also be sorted simultaneously. Therefore, this method can greatly improve the computational utilization of the GPU, thereby speeding up the construction of the suffix array.
对 GPU而言, 基数排序是目前效率最高的一种并行排序方法。 因此, 本发明实施例以并行基数排序为例, 阐述如何用 GPU来实现单线程块并 行排序和多线程块并行排序。  For GPUs, cardinality sorting is currently the most efficient parallel sorting method. Therefore, the embodiment of the present invention takes parallel matrix ordering as an example to illustrate how to use the GPU to implement single-thread block parallel sorting and multi-thread block parallel sorting.
对于单线程块并行排序而言, 因为线程块内的所有线程可以访问该线 程块的共享存储器, 所以实现方法比较简单。 本发明实施例采用最低有效 位基数排序方法 (least significant bit radix sort) 。 基数排序的每次迭代的 比特位可以配置, 一般视 GPU的计算能力和存储空间而定。 单线程块并 行基数排序的单次迭代的歩骤如下所示。  For single-thread block parallel sorting, the implementation is simpler because all threads within the thread block can access the shared memory of the thread block. The embodiment of the present invention uses a least significant bit radix sort. The bits of each iteration of the cardinality order can be configured, depending on the computing power and storage space of the GPU. The steps for a single iteration of a single-thread block parallel-matrix sort are as follows.
举例来说,歩骤 302中的采用一个线程块对所述 S型后缀段进行排序, 包括:  For example, the step of sorting the S-type suffix segments by using one thread block in step 302 includes:
S01、 计算所述 S型后缀段的第一直方图11。  S01. Calculate a first histogram 11 of the S-type suffix segment.
具体的, 对所有元素的当前比较键, 计算每个具体值的元素个数。 第 一直方图 H用数组保存,键值作为数组下标, 而该键值对应的元素个数作 为该下标对应的数组元素。  Specifically, for each element's current comparison key, calculate the number of elements for each specific value. The first histogram H is saved in an array, and the key value is used as an array subscript, and the number of elements corresponding to the key value is used as an array element corresponding to the subscript.
S02、 对所述第一直方图 H执行前缀求和操作, 得到所述 S型后缀段 的前缀求和结果数组 M。 g卩, M[i] = H[0]+H[l]+〜+H[i]。  S02. Perform a prefix sum operation on the first histogram H to obtain an array M of prefix summation results of the S-type suffix segments. g卩, M[i] = H[0]+H[l]+~+H[i].
S03、 根据前缀求和结果数组 M, 将后缀分布到对应的位置。  S03: According to the prefix sum result array M, distribute the suffix to the corresponding position.
具体的, 键值为 1的数组元素的起始位置是 M[i]。 如果有相同键值的 数组元素, 则依次放在后续位置。  Specifically, the starting position of the array element with a key value of 1 is M[i]. If there are array elements with the same key value, they are placed in subsequent positions.
以上所述的歩骤由一个独立的并行处理代码 (内核函数, kernel function) 执行。 对于歩骤 S01和歩骤 S02而言, 可以进一歩利用 GPU的 特性来加快计算速度。 比如线程块的每个线程先计算本地的直方图并保持 在寄存器中, 然后将本地直方图复制到线程块的共享存储器中, 然后再对 所有的直方图执行前缀求和操作。 The above described procedure consists of a separate parallel processing code (kernel function, kernel) Function) Execution. For step S01 and step S02, the characteristics of the GPU can be further utilized to speed up the calculation. For example, each thread of the thread block first computes the local histogram and keeps it in the register, then copies the local histogram into the shared memory of the thread block, and then performs a prefix summation operation on all the histograms.
歩骤 302中的采用两个以上线程块对所述 L型后缀段进行排序,包括: Sorting the L-type suffix segments by using more than two thread blocks in step 302, including:
M01、 根据每一线程块所处理的数组长度将所述 L型后缀段分片, 并 将分片后的所述 L型后缀段分配给对应的线程块。 M01. The L-type suffix segment is fragmented according to the length of the array processed by each thread block, and the fragmented L-type suffix segment is allocated to the corresponding thread block.
例如, 假设数组长度为 1, 每个线程块负责的数组长度为 t, 则使用 1/t 个线程块。  For example, suppose the array length is 1, and each thread block is responsible for an array length of t, which uses 1/t thread blocks.
M02、 获取每个分片后的所述 L型后缀段的直方图, 且将所有线程块 获取的直方图进行前缀求和操作,得到所述 L型后缀段的前缀求和结果数 组 MgM02: Obtain a histogram of the L-type suffix segments after each slice, and perform a prefix sum operation on the histograms acquired by all the thread blocks to obtain a prefix summation result array M g of the L-type suffix segments.
也就是说, 每个线程块先对所负责的后缀段分片计算直方图, 得到本 线程块的后缀段分片的直方图 Hb,并将该结果复制到 GPU的全局存储器。 That is to say, each thread block first calculates a histogram for the suffix segment responsible for the fragment, obtains a histogram H b of the suffix segment of the thread block, and copies the result to the global memory of the GPU.
在每个线程块完成本线程块的直方图计算后,将所有的线程块直方图 一起执行前缀求和操作, 得到一个全局的前缀求和结果数组 MgAfter each thread block completes the histogram calculation of the thread block, all the thread block histograms are subjected to a prefix summation operation together to obtain a global prefix summation result array M g .
M03、 根据全局前缀求和结果数组 Mg, 将后缀分布到完成排序的后 缀段对应的位置。 M03: sum the result array M g according to the global prefix, and distribute the suffix to the position corresponding to the suffix segment that completes the sorting.
例如, 具体的, 键值为 1的数组元素的起始位置是 Mg[i], 如果有相同 键值的数组元素, 则依次放在后续位置。 For example, specifically, the starting position of an array element with a key value of 1 is M g [i], and if there are array elements with the same key value, they are placed in subsequent positions.
也就是说, 根据所述前缀求和结果数组 M和所述全局前缀求和结构 数组 Mg, 得到 2h-次序后缀数组 SA2hThat is, according to the prefix sum result array M and the global prefix summation structure array Mg, a 2h-order suffix array SA 2h is obtained .
上述方式可以充分发挥 GPU的多线程块处理的优势。  The above approach can take full advantage of the GPU's multi-thread block processing.
歩骤 M01是由 CPU执行, 而其余歩骤由 GPU执行, 且每个歩骤由 独立的内核函数实现。 歩骤 M02可由多个线程块并发执行。 歩骤 M03只 有在歩骤 M02的所有线程块完成执行后才能执行。 歩骤 M03可需要一个 线程块, 或者多个线程块执行。 每个线程块负责一组独立的键值。  Step M01 is executed by the CPU, and the rest of the steps are executed by the GPU, and each step is implemented by a separate kernel function. Step M02 can be executed concurrently by multiple thread blocks. Step M03 can only be executed after all thread blocks of step M02 have been executed. Step M03 may require one thread block or multiple thread blocks to execute. Each thread block is responsible for a set of independent key values.
因为 GPU的不同线程块的线程必须通过 GPU全局存储器实现数据共 享, 因此多线程块并行基数排序方法比单线程块并行基数排序方法稍显复 杂。 上述方法在 GPU环境下, 充分利用 GPU的多并发线程及高内存带宽 优势, 加速后缀数组构造过程的运行。 Because the threads of different thread blocks of the GPU must implement data sharing through the GPU global memory, the multi-thread block parallel cardinal sorting method is slightly more complicated than the single-thread block parallel cardinal sorting method. In the GPU environment, the above method makes full use of the GPU's multiple concurrent threads and high memory bandwidth advantages, and accelerates the operation of the suffix array construction process.
上述实施例中提出了对 Unsorted group中后缀的并行化处理, 进而能 适用于类似于 GPU的高数据并行性环境。如上的歩骤 S01至歩骤 S03 ,歩 骤 M01至歩骤 M03。 上述歩骤均是一个独立的并发处理过程 (即每个歩 骤都可由多个线程并发来处理) 。  In the above embodiment, the parallelization processing of the suffix in the Unsorted group is proposed, which can be applied to a high data parallelity environment similar to the GPU. Step S01 to step S03 as above, and step M01 to step M03. The above steps are all a separate concurrent process (that is, each step can be processed by multiple threads concurrently).
以下结合图 1对 GPU的并行处理进行举例说明。  The parallel processing of the GPU will be described below with reference to FIG.
对应前述的歩骤 301, 为充分利用 GPU的并行处理能力, 将 Unsorted group中的各后缀根据其所在组的大小分成两类大小分成两类中的各后缀 段根据其段长度分成两类: 1 ) S类 (小组类, 即 S型后缀段) : S={M& I s i z e(ugi) ^ } , 即该类中各后缀段 包含的后缀个数需小于或等于预定 的阈值 τ, (如上的预设长度 Τ ) ; 2 )否则为 L类(大组类即 L型后缀段) 。 Corresponding to the foregoing step 301, in order to make full use of the parallel processing capability of the GPU, each suffix in the Unsorted group is divided into two types according to the size of the group in which it is divided into two categories. Each suffix segment is divided into two categories according to the length of the segment: 1 S class (group class, ie S-type suffix segment): S={ M & I size(u gi ) ^ } , that is, the number of suffixes included in each suffix segment in the class needs to be less than or equal to a predetermined threshold τ, ( The preset length is as above Τ ) ; 2 ) Otherwise it is L class (large group class is L-shaped suffix segment).
Unsorted group数据在 GPU中的计算执行受 CPU的调度。 且 S类中 的各后缀段的排序由 GPU中的单个 ThreadBlock (线程块) 负责, 而 L类 的各后缀段的排序则由多个 ThreadBlock线程块执行 (如图 3B所示, 其 中后缀段 Mg , Mg2和 属于 S类, 而 Mg4和 Mg5属于 L类) 。 The computational execution of Unsorted group data in the GPU is scheduled by the CPU. And the sorting of each suffix segment in the S class is performed by a single ThreadBlock in the GPU, and the sorting of each suffix segment of the L class is performed by multiple ThreadBlock thread blocks (as shown in FIG. 3B, where the suffix segment Mg) , M g2 and belong to the S class, while Mg4 and M g5 belong to the L class).
例如, Unsorted group中后缀段的分类。  For example, the classification of the suffix segments in the Unsorted group.
对于 Unsorted group = { ugl ={ississippi$, issippi$, i$ }, ug2 =  For Unsorted group = { ugl ={ississippi$, issippi$, i$ }, ug2 =
{ssissippi$, sissippi$, ssippi$, sippi$ }, ug3 = {ppi$, pi$} } , 且 τ =3, 贝 ij ugl 和 将被分到 S类, 而 Mg2分到 L类。 {ssissippi$, sissippi$, ssippi$, sippi$ }, ug3 = {ppi$, pi$} } , and τ =3, Bay ij ugl and will be assigned to the S class, and M g2 to the L class.
S类和 L类中后缀段的排序可采用如下不同的子歩骤实现:  The ordering of the suffix segments in the S and L classes can be implemented in the following different sub-steps:
具体地, S类的后缀段的排序: 每个后缀段的排序由一个内核函数负 责, 且分成三个歩骤完成:  Specifically, the ordering of the suffix segments of the S class: The ordering of each suffix segment is performed by a kernel function and is completed in three steps:
S01、 计算该后缀段的直方图 H (只考虑 k个连续比特位) ;  S01. Calculate a histogram H of the suffix segment (only consider k consecutive bits);
S02、 扫描直方图 H, 获取各线程的散布偏移值;  S02, scanning a histogram H, obtaining a scatter offset value of each thread;
S03、 将各线程的排序结果写入线程块的共享存储器, 由后者继而传 输到 GPU的全局存储器。  S03. Write the sort result of each thread to the shared memory of the thread block, and then transfer the data to the global memory of the GPU.
另外, L类的后缀段的排序: 也由三个歩骤组成, 但此处每个歩骤由 一个内核函数负责:  In addition, the ordering of the suffix segments of the L class: also consists of three steps, but each step here is responsible for a kernel function:
M01、 以线程块为单位计算它们各自的直方图; M02、 扫描各线程块的直方图, 并计算各线程块的散布偏移值;M01, calculating their respective histograms in units of thread blocks; M02, scanning a histogram of each thread block, and calculating a dispersion offset value of each thread block;
M03、 同 S类的后缀段的排序, 各线程块处理各自的数据。 M03, sorting of the suffix segments of the same class S, each thread block processes its own data.
特别地, 排序后的结果汇聚到 GPU的全局存储器, 随后由 GPU的各 线程块依次完成余下的歩骤, 且每个歩骤也均并发执行。  Specifically, the sorted results are aggregated to the global memory of the GPU, and then the remaining steps are sequentially completed by the thread blocks of the GPU, and each step is also executed concurrently.
例如, 可从全局上比较各后缀段内的各相邻后缀的相对大小; 对各后 缀段的相邻比较结果进行前缀求和操作; 由前面歩骤所得结果, 计算排序 后的后缀数组对应的名次数组及新的 Unsorted group等等。  For example, the relative size of each adjacent suffix in each suffix segment can be globally compared; the prefix summation operation is performed on the adjacent comparison result of each suffix segment; and the result obtained by the previous step is used to calculate the corresponding suffix array corresponding to the suffix Name group and new Unsorted group and so on.
上述实施例中提出了对 Unsorted group中后缀的并行化处理, 进而能 适用于类似于 GPU的高数据并行性环境。  In the above embodiment, the parallelization processing of the suffix in the Unsorted group is proposed, which can be applied to a high data parallelity environment similar to the GPU.
图 4示出了本发明一实施例提供的后缀数组的构造装置的结构示意 图,如图 4所示,本实施例的后缀数组的构造装置包括:第一获取单元 41、 第二获取单元 42、 排序单元 43和第三获取单元 44;  4 is a schematic structural diagram of a device for constructing a suffix array according to an embodiment of the present invention. As shown in FIG. 4, the apparatus for constructing a suffix array of the present embodiment includes: a first acquiring unit 41, a second acquiring unit 42, Sorting unit 43 and third obtaining unit 44;
其中, 第一获取单元 41用于根据字符串的后缀数组 SAQ和第一名次 数组 Ro, 获取所述后缀数组 SAQ的 h-次序后缀数组 SAh, 及第二名次数组 Rh, h为变量, 初始值为 1 ; The first obtaining unit 41 is configured to obtain an h-order suffix array SA h of the suffix array SA Q and a second-order number group R h according to the suffix array SA Q of the character string and the first-order number group Ro. h is a variable with an initial value of 1;
第二获取单元 42用于根据所述 h-次序后缀数组 SAh, 获取所述 h-次 序后缀数组 SAh中未排序后缀的集合 UGh; The second acquiring unit 42 according to the order of the suffix array h- SA h, obtaining the suffix array SA h h- order not ordered set of suffix UG h;
排序单元 43用于对所述集合 UGh中的所有后缀进行排序, 得到 2h- 次序后缀数组 SA2h; The sorting unit 43 is configured to sort all the suffixes in the set UG h to obtain a 2h-order suffix array SA 2h;
举例来说, 所述排序单元 43具体用于, 采用所述集合 UGh中任一后 缀 Si的 h-名次数组元素 Rh[i]为主比较键, 所述集合 UGh中后缀 Si+ h- 名次数组元素 Rh[i+h]为辅比较键, 对所述集合 UGh中的后缀进行排序, 得到 2h-次序后缀数组 SA2hFor example, the sorting unit 43 is specifically configured to use the h- named group element R h [i] of any suffix Si in the set UG h as a main comparison key, and the suffix S i+ in the set UG h The h-named group element R h [i+h] is a secondary comparison key, and the suffixes in the set UG h are sorted to obtain a 2h-order suffix array SA 2h .
第三获取单元 44用于根据所述第二名次数组 Rh、 所述 2h-次序后缀 数组 SA2h, 获取另一个未排序的后缀的集合 UG2h; 若所述集合 UG2h为空 集合, 则得到排序的后缀数组 SA。 Third obtaining unit 44 according to the second number of groups R h, of the order of 2h- suffix array SA 2h, acquiring another set of UG unsorted suffix 2H; 2H UG if the set is an empty set, Then get the sorted suffix array SA.
可选地, 所述第三获取单元, 还用于根据所述第二名次数组 Rh、所述 2h-次序后缀数组 SA2h, 获取第三名次数组 R2hAlternatively, the third acquiring unit, according to the second frequency and further configured to group R h, of the order of 2h- suffix array SA 2h, obtaining the group number of the third R 2h.
若所述集合 UG2h不为空集合, 则前述的后缀数组的构造装置还可包 括: 变量更新单元 45, 如图 5所示, ; 所述变量更新单元 45用于更新所述变量 h的值; 举例来说, 所述变 量更新单元, 具体用于将所述变量 h的值更新为 2h。 If the set UG 2h is not an empty set, the foregoing suffix array constructing apparatus may further include: a variable updating unit 45, as shown in FIG. 5; The variable update unit 45 is configured to update the value of the variable h; for example, the variable update unit is specifically configured to update the value of the variable h to 2h.
此时, 所述排序单元 43还用于, 结合变量更新单元重复对未排序后 缀的集合 UG2H进行排序, 直至所述第三获取单元最后获取的未排序的后 缀的集合为空集合, 得到排序的后缀数组 SA。 也就是说, 结合所述第三 获取单元获取的第三名次数组 R2h和所述变量更新单元更新的变量 h的值 对未排序后缀的集合 UG2H进行排序, 用于获取第 N个未排序的后缀的集 合 UG2 _D, 直至所述第三获取单元最后获取的第 N个未排序的后缀的集 合 UG2 _D为空集合, 得到排序的后缀数组 SA; N为大于 2的自然数。 At this time, the sorting unit 43 is further configured to repeatedly sort the unsorted suffixes UG 2H by using the variable update unit, until the set of the unsorted suffixes finally obtained by the third obtaining unit is an empty set, and the sort is obtained. The suffix array SA. That is, the set of unordered suffixes UG 2H is sorted according to the value of the third-order number of times group R 2h acquired by the third obtaining unit and the variable h updated by the variable updating unit, for acquiring the Nth ordered set of suffix _D 2 UG, UG set up by the third obtaining unit acquired last N-th unsorted _D suffix 2 is an empty set, to obtain sorted suffix array SA; N is a natural number greater than 2.
在本实施例中, 所述 2h-次序后缀数组 SA2H所占用的空间为所述后缀 数组 SAQ所占用的空间。 In this embodiment, the space occupied by the 2h-order suffix array SA 2H is the space occupied by the suffix array SA Q .
在一种可选的实现场景中,前述的后缀数组的构造装置还包括图中未 示出的第四获取单元 46, 该第四获取单元 46用于在所述第一获取单元 41 获取所述 h-次序后缀数组 SAh之前, In an optional implementation scenario, the foregoing apparatus for constructing a suffix array further includes a fourth obtaining unit 46, not shown, which is used by the first obtaining unit 41 to obtain the Before the h-order suffix array SA h ,
初始化输入的字符串, 得到字符串的后缀数组 SAQ ; Initialize the input string to get the suffix array SA Q of the string ;
将所述后缀数组 SAQ中的后缀 Si的起始字符位置进行调整,得到所述 字符串的名次数组 RQThe starting character position of the suffix Si in the suffix array SA Q is adjusted to obtain the name number group R Q of the character string.
举例来说, 所述第一获取单元 41 可具体用于, h为初始值时, 以所 述后缀数组 SAQ中每个后缀 Si的第一个字符为比较键, 将所述后缀数组 SAQ中的每个后缀 Si进行排序,得到所述 h-次序后缀数组 SAh和第二名次 数组 RhFor example, the first obtaining unit 41 may be specifically configured to: when h is an initial value, use the first character of each suffix Si in the suffix array SA Q as a comparison key, and the suffix array SA Q Each suffix Si in the middle is sorted to obtain the h-order suffix array SA h and the second-order number group R h .
可选地, 所述第二获取单元 42具体用于, 确定所述 h-次序后缀数组 SAH中有大于一个的连续后缀的第一个字符相同; Optionally, the second obtaining unit 42 is specifically configured to: determine that the first character of the consecutive suffixes of the h-order suffix array SA H having more than one is the same;
将所述 h-次序后缀数组 SAh中具有一个以上连续后缀的第一个字符相 同的后缀组成一个以上的未排序后缀段 [a, b] ; a为未排序后缀段中的起始 后缀的位置, b为未排序后缀段中的结束后缀的位置; Having the same suffix of the first character having more than one consecutive suffix in the h-order suffix array SA h to form more than one unsorted suffix segment [a, b]; a is the starting suffix in the unsorted suffix segment Position, b is the position of the ending suffix in the unsorted suffix segment;
所述一个以上的未排序后缀段 [a, b]组成未排序后缀的集合 UGH。 可选地, 所述第三获取单元 44 具体用于, 根据相邻元素比较规则对 所述 2h-次序后缀数组 SA2H中的后缀进行比较, 得到第一辅助数组 NC2H; 对所述第一辅助数组 NC21^A行前缀求和, 得到第二辅助数组 NS2H; 如果所述第二辅助数组 NS2h中的一个以上连续后缀有相同的值,则所 述一个以上连续后缀组成一个未排序后缀段; The one or more unsorted suffix segments [a, b] constitute a set UG H of unsorted suffixes. Optionally, the third obtaining unit 44 is configured to compare the suffixes in the 2h-order suffix array SA 2H according to the neighboring element comparison rule to obtain a first auxiliary array NC 2H; Auxiliary array NC 21 ^ A row prefix sum, to obtain a second auxiliary array NS 2H; If more than one consecutive suffixes of the second auxiliary array NS 2h have the same value, the one or more consecutive suffixes constitute an unsorted suffix segment;
将一个以上的未排序后缀段组成未排序后缀的集合 UG2hOne or more unsorted suffix segments are combined into a set UG 2h of unsorted suffixes.
在第二种可选的实现场景中, 前述的排序单元 43具体用于, 根据预设长度 T将所述集合 UGh的后缀段划分为 S型后缀段和 L型 后缀段; In the second optional implementation scenario, the foregoing sorting unit 43 is specifically configured to: divide the suffix segment of the set UG h into an S-type suffix segment and an L-type suffix segment according to a preset length T;
例如,可将所述集合 UGh的任一后缀段的长度与所述预设长度 T进行 比较, 将所述集合 UGh中后缀段长度小于等于所述预设长度 T, 则将小于 等于所述预设长度 T的后缀作为 S型后缀段,将大于所述预设长度 T的后 缀作为 L型后缀段。 For example, the length of any suffix segment of the set UG h may be compared with the preset length T, and the length of the suffix segment in the set UG h is less than or equal to the preset length T, and then the value is less than or equal to The suffix of the preset length T is used as the S-type suffix segment, and the suffix greater than the preset length T is used as the L-type suffix segment.
分别对所述 S型后缀段和 L型后缀段进行排序, 得到 2h-次序后缀数 组 SA2hThe S-type suffix segment and the L-type suffix segment are respectively sorted to obtain a 2h-order suffix array SA 2h .
例如, 计算所述 S型后缀段的第一直方图 H;  For example, calculating a first histogram H of the S-type suffix segment;
对所述第一直方图 H执行前缀求和操作,得到所述 S型后缀段的前缀 求和结果数组 M。  Performing a prefix sum operation on the first histogram H to obtain a prefix sum result array M of the S-type suffix segments.
此外,可同时根据每一线程块所处理的数组长度将所述 L型后缀段分 片, 并将分片后的所述 L型后缀段分配给每一线程块;  In addition, the L-type suffix segments may be segmented according to the length of the array processed by each thread block, and the L-type suffix segments after the slice are allocated to each thread block;
获取每个分片后的所述 L型后缀段的直方图,且将所有线程块获取的 直方图进行前缀求和操作, 得到所述 L型后缀段的前缀求和结果数组 Mg; 根据所述前缀求和结果数组 M和所述全局前缀求和结构数组 Mg, 得 得到 2h-次序后缀数组 SA2hObtaining a histogram of the L-type suffix segments after each slice, and performing a prefix sum operation on the histograms obtained by all the thread blocks, to obtain a prefix summation result array M g of the L-type suffix segments ; The prefix summation result array M and the global prefix summation structure array M g are obtained to obtain a 2h-order suffix array SA 2h .
由上述实施例可知, 本实施例的后缀数组的构造装置, 可避免对已经 找到最终名次的后缀重复排序的问题, 且能够实现后缀数据处理过程中加 速后缀数组的构成。  It can be seen from the above embodiment that the suffix array construction apparatus of this embodiment can avoid the problem of repeatedly sorting the suffixes of the final ranking, and can realize the composition of the acceleration suffix array during the suffix data processing.
本实施例的装置, 可以用于执行图 2A、 图 3A所示方法实施例的技术 方案, 其实现原理和技术效果类似, 此处不再赘述。  The device in this embodiment may be used to implement the technical solution of the method embodiment shown in FIG. 2A and FIG. 3A, and the implementation principle and the technical effect are similar, and details are not described herein again.
图 6示出了本发明另一实施例提供的后缀数组的构造装置的结构示意 图, 如图 6所示, 本实施例的后缀数组的构造装置包括: 总线 61 ; 以及连 接到总线 61的处理器 62、 存储器 63和接口 64, 其中该存储器 63用于存 储指令, 该处理器 62用于执行该指令, 用于 根据字符串的后缀数组 SAQ和第一名次数组 RQ, 获取所述后缀数组 SAQ的 h-次序后缀数组 SAh, 及第二名次数组 Rh, h为变量, 初始值为 1 ; 根据所述 h-次序后缀数组 SAh,获取所述 h-次序后缀数组 SAh中未排 序后缀的集合 UGh; 6 is a schematic structural diagram of a device for constructing a suffix array according to another embodiment of the present invention. As shown in FIG. 6, the apparatus for constructing a suffix array of the present embodiment includes: a bus 61; and a processor connected to the bus 61. 62. A memory 63 and an interface 64, wherein the memory 63 is configured to store an instruction, and the processor 62 is configured to execute the instruction, where Obtaining the h-order suffix array SA h of the suffix array SA Q and the second-order number group R h , h as variables according to the suffix array SA Q of the string and the first-order number group R Q , and the initial value is 1 ; h- according to the order of the suffix array SA h, obtaining the suffix array SA h h- order not ordered set of suffix UG h;
对所述集合 UGh中的所有后缀进行排序,得到 2h-次序后缀数组 SA2h; 根据所述第二名次数组 Rh、 所述 2h-次序后缀数组 SA2h, 获取另一个 未排序的后缀的集合 UG2h; 若所述集合 UG2h为空集合, 则得到排序的后 缀数组 SA。 Sorting all suffixes in the set UG h to obtain a 2h-order suffix array SA 2h; obtaining another unsorted suffix according to the second-order order group R h and the 2h-order suffix array SA 2h The set UG 2h; if the set UG 2h is an empty set, the sorted suffix array SA is obtained.
处理器 62执行上述指令, 还用于, 根据所述第二名次数组 Rh、 所述 2h-次序后缀数组 SA2h,获取第三名次数组 R2h和另一个未排序的后缀的集 合 UG2h; The processor 62 executes the foregoing instruction, and is further configured to acquire, according to the second name sequence group R h and the 2h-order suffix array SA 2h , a set UG of the third name group R 2h and another unsorted suffix 2h;
若所述集合 UG2h不为空集合, 则更新所述变量 h的值, 根据第三名 次数组 R2h对未排序后缀的集合 UG2h进行排序, 用于获取第 N个未排序 的后缀的集合 !; , 直至最后获取的第 N 个未排序的后缀的集合 为空集合, 得到排序的后缀数组 SA; N为大于 2的自然数。 If the set UG 2h is not an empty set, the value of the variable h is updated, and the unordered suffix set UG 2h is sorted according to the third-order number group R 2h for obtaining the Nth unsorted suffix. set! ; , until the collection of the Nth unsorted suffix obtained last is an empty set, get the sorted suffix array SA; N is a natural number greater than 2.
举例来说, 处理器 62执行上述指令, 用于将所述变量 h的值更新为 在本实施例中, 上述 2h-次序后缀数组 SA2h所占用的空间为所述后缀 数组 SAQ所占用的空间。 For example, the processor 62 executes the above instructions for updating the value of the variable h to be occupied by the suffix array SA Q in the present embodiment. The space occupied by the 2h-order suffix array SA 2h is occupied by the suffix array SA Q. space.
在一种可选的实现场景中, 所述处理器 62执行上述指令, 还用于, 初始化输入的字符串, 得到字符串的后缀数组 SAQ ; In an optional implementation scenario, the processor 62 executes the foregoing instruction, and is further configured to initialize the input string to obtain a suffix array SA Q of the string ;
将所述后缀数组 SAQ中的后缀 Si的起始字符位置进行调整,得到所述 字符串的名次数组 RQThe starting character position of the suffix Si in the suffix array SA Q is adjusted to obtain the name number group R Q of the character string.
可选地, 所述处理器 62执行上述指令, 具体用于,  Optionally, the processor 62 executes the foregoing instructions, specifically,
h为初始值时, 以所述后缀数组 SAQ中每个后缀 Si的第一个字符为比 较键, 将所述后缀数组 SAQ中的每个后缀 Si进行排序, 得到所述 h-次序后 缀数组 SAi和所述第二名次数组 (即, 所述 h-次序后缀数组 SAh和所 述第二名次数组 Rh) 。 When h is an initial value, the first character of each suffix Si in the suffix array SA Q is used as a comparison key, and each suffix Si in the suffix array SA Q is sorted to obtain the h-order suffix. An array SAi and the second-order number of times (ie, the h-order suffix array SA h and the second-order number of groups R h ).
可选地, 所述处理器 62执行上述指令, 具体用于  Optionally, the processor 62 executes the foregoing instructions, specifically for
确定所述 h-次序后缀数组 SAh中有大于一个的连续后缀的第一个字符 相同; Determining that the h-order suffix array SA h has more than one consecutive suffix of the first character the same;
将所述 h-次序后缀数组 SAh中具有一个以上连续后缀的第一个字符相 同的后缀组成一个以上的未排序后缀段 [a, b] ; a为未排序后缀段中的起始 后缀的位置, b为未排序后缀段中的结束后缀的位置; Having the same suffix of the first character having more than one consecutive suffix in the h-order suffix array SA h to form more than one unsorted suffix segment [a, b] ; a is the starting suffix in the unsorted suffix segment Position, b is the position of the ending suffix in the unsorted suffix segment;
所述一个以上的未排序后缀段 [a, b]组成未排序后缀的集合 UGh。 可选地, 所述处理器 62执行上述指令, 具体用于 The one or more unsorted suffix segments [a, b] constitute a set UG h of unsorted suffixes. Optionally, the processor 62 executes the foregoing instructions, specifically for
采用所述集合 UGh中任一后缀 Si的 h-名次数组元素 Rh[i]为主比较键, 所述集合 UGh中后缀 81+11的 h-名次数组元素 Rh[i+h]为辅比较键, 对所述 集合 UGh中的后缀进行排序, 得到 2h-次序后缀数组 SA2hThe h- named group element R h [i] of any suffix Si in the set UG h is used as a main comparison key, and the h- named group element R h of the suffix 8 1+11 in the set UG h [i] +h] is a secondary comparison key, and the suffixes in the set UG h are sorted to obtain a 2h-order suffix array SA 2h .
可选地, 所述处理器 62执行上述指令, 具体用于  Optionally, the processor 62 executes the foregoing instructions, specifically for
根据相邻元素比较规则对所述 2h-次序后缀数组 8 211中的后缀进行比 较, 得到第一辅助数组 NC2h; Comparing the suffixes in the 2h-order suffix array 8 211 according to the neighboring element comparison rule, to obtain the first auxiliary array NC 2h;
对所述第一辅助数组 NC21^A行前缀求和, 得到第二辅助数组 NS2h; 如果所述第二辅助数组 NS2h中的一个以上连续后缀有相同的值,则所 述一个以上连续后缀组成一个未排序后缀段; And summing the first auxiliary array NC 21 ^ A row to obtain a second auxiliary array NS 2h; if one or more consecutive suffixes in the second auxiliary array NS 2h have the same value, the one or more consecutive The suffix forms an unsorted suffix segment;
将一个以上的未排序后缀段组成未排序后缀的集合 UG2hOne or more unsorted suffix segments are combined into a set UG 2h of unsorted suffixes.
在具体的应用过程中, 处理器 62执行上述指令, 对所述集合 UGh中 的所有后缀进行排序, 得到 2h-次序后缀数组 SA2h, 包括: In a specific application process, the processor 62 executes the foregoing instructions to sort all the suffixes in the set UG h to obtain a 2h-order suffix array SA 2h , including:
根据预设长度 T将所述集合 UGh的后缀段划分为 S型后缀段和 L型 后缀段; The suffix segment of the set UG h is divided into an S-type suffix segment and an L-type suffix segment according to a preset length T;
例如,将所述集合 UGh的任一后缀段的长度与所述预设长度 T进行比 较, 将所述集合 UGh中后缀段长度小于等于所述预设长度 T, 则将小于等 于所述预设长度 T的后缀作为 S型后缀段,将大于所述预设长度 T的后缀 作为 L型后缀段。 For example, the length of any suffix segment of the set UG h is compared with the preset length T, and the length of the suffix segment in the set UG h is less than or equal to the preset length T, which is less than or equal to the The suffix of the preset length T is used as the S-type suffix segment, and the suffix greater than the preset length T is used as the L-type suffix segment.
采用一个线程块对所述 S型后缀段进行排序, 采用两个以上线程块对 所述 L型后缀段进行排序, 得到 2h-次序后缀数组 SA2hThe S-type suffix segments are sorted by one thread block, and the L-type suffix segments are sorted by using two or more thread blocks to obtain a 2h-order suffix array SA 2h .
例如, 计算所述 S型后缀段的第一直方图 H;  For example, calculating a first histogram H of the S-type suffix segment;
对所述第一直方图 H执行前缀求和操作,得到所述 S型后缀段的前缀 求和结果数组 M。  Performing a prefix sum operation on the first histogram H to obtain a prefix sum result array M of the S-type suffix segments.
可与 S型后缀段同时实现的是, 根据每一线程块所处理的数组长度将 所述 L型后缀段分片, 并将分片后的所述 L型后缀段分配给每一线程块; 获取每个分片后的所述 L型后缀段的直方图,且将所有线程块获取的 直方图进行前缀求和操作, 得到所述 L型后缀段的前缀求和结果数组 Mg; 根据所述前缀求和结果数组 M和所述全局前缀求和结构数组 Mg, 得 到 2h-次序后缀数组 SA2hCan be implemented simultaneously with the S-type suffix segment, according to the length of the array processed by each thread block Deleting the L-type suffix segment, and assigning the L-type suffix segment after the slice to each thread block; obtaining a histogram of the L-type suffix segment after each slice, and all the thread blocks Obtaining a histogram to perform a prefix sum operation to obtain a prefix summation result array M g of the L-type suffix segment ; obtaining a 2h-order according to the prefix sum result array M and the global prefix summation structure array Mg The suffix array SA 2h .
因此, 本发明实施例的后缀数组的构造装置, 通过存储器存储指令, 处理器执行上述指令, 从而能够避免对已经找到最终名次的后缀重复排序 的问题, 且能够实现加速后缀数组的构成。  Therefore, the apparatus for constructing the suffix array of the embodiment of the present invention executes the above instruction by the memory storage instruction, thereby avoiding the problem of repeatedly sorting the suffix of the final ranking, and realizing the configuration of the accelerated suffix array.
本实施例的装置, 可以用于执行图 2A、 图 3A所示方法实施例的技术 方案, 其实现原理和技术效果类似, 此处不再赘述。  The device in this embodiment may be used to implement the technical solution of the method embodiment shown in FIG. 2A and FIG. 3A, and the implementation principle and the technical effect are similar, and details are not described herein again.
本领域普通技术人员可以理解: 实现上述各方法实施例的全部或部分 歩骤可以通过程序指令相关的硬件来完成。 前述的程序可以存储于一计算 机可读取存储介质中。 该程序在执行时, 执行包括上述各方法实施例的歩 骤; 而前述的存储介质包括: ROM、 RAM, 磁碟或者光盘等各种可以存 储程序代码的介质。 最后应说明的是: 以上各实施例仅用以说明本发明的技术方案, 而非 对其限制; 尽管参照前述各实施例对本发明进行了详细的说明, 本领域的 普通技术人员应当理解: 其依然可以对前述各实施例所记载的技术方案进 行修改, 或者对其中部分或者全部技术特征进行等同替换; 而这些修改或 者替换, 并不使相应技术方案的本质脱离本发明各实施例技术方案的范 围。  One of ordinary skill in the art will appreciate that all or a portion of the steps of implementing the various method embodiments described above can be accomplished by hardware associated with the program instructions. The aforementioned program can be stored in a computer readable storage medium. The program, when executed, performs the steps including the above-described method embodiments; and the foregoing storage medium includes: a medium that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk. It should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, and are not intended to be limiting; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art The technical solutions described in the foregoing embodiments may be modified, or some or all of the technical features may be equivalently replaced; and the modifications or substitutions do not deviate from the technical solutions of the embodiments of the present invention. range.

Claims

权 利 要 求 书 claims
1、 一种后缀数组的构造方法, 其特征在于, 包括: 1. A method of constructing a suffix array, which is characterized by including:
根据字符串的后缀数组 SAQ和第一名次数组 RQ, 获取所述后缀数组 SAQ的 h-次序后缀数组 SAh, 及第二名次数组 Rh, h为变量, 初始值为 1 ; 根据所述 h-次序后缀数组 SAh,获取所述 h-次序后缀数组 SAh中未排 序后缀的集合 UGH; According to the suffix array SA Q and the first rank array R Q of the string, obtain the h-order suffix array SA h of the suffix array SA Q and the second rank array R h , h is a variable, and the initial value is 1 ; According to the h-order suffix array SA h , obtain the set UG H of unsorted suffixes in the h-order suffix array SA h ;
对所述集合 UGH中的所有后缀进行排序,得到 2h-次序后缀数组 SA2H; 根据所述第二名次数组 RH、 所述 2h-次序后缀数组 SA2H, 获取另一个 未排序的后缀的集合 UG2H; 若所述集合 UG2H为空集合, 则得到排序的后 缀数组 SA。 Sort all suffixes in the set UG H to obtain a 2h-order suffix array SA 2H; obtain another unsorted suffix based on the second-ranked array R H and the 2h-order suffix array SA 2H The set UG 2H; if the set UG 2H is an empty set, the sorted suffix array SA is obtained.
2、 根据权利要求 1 所述的方法, 其特征在于, 所述根据所述第二名 次数组 RH、 所述 2h-次序后缀数组 SA2H, 获取另一个未排序的后缀的集合 UG2H, 包括: 2. The method according to claim 1, characterized in that: obtaining another unsorted suffix set UG 2H according to the second rank array R H and the 2h-order suffix array SA 2H , include:
根据所述第二名次数组 RH、 所述 2h-次序后缀数组 SA2H, 获取第三名 次数组 R2H和另一个未排序的后缀的集合 UG2H; According to the second place array R H and the 2h-order suffix array SA 2H , obtain the third place array R 2H and another unsorted suffix set UG 2H;
所述方法还包括: The method also includes:
若所述集合 UG2H不为空集合, 则更新所述变量 h的值, 根据第三名 次数组 R2H对未排序后缀的集合 UG2H进行排序, 用于获取第 N个未排序 的后缀的集合 UG2,. , 直至最后获取的第 N 个未排序的后缀的集合 UG^^ 为空集合, 得到排序的后缀数组 SA; N为大于 2的自然数。 If the set UG 2H is not an empty set, update the value of the variable h, sort the set UG 2H of unsorted suffixes according to the third place array R 2H , and use it to obtain the Nth unsorted suffix. The set UG 2 ,., until the finally obtained Nth unsorted suffix set UG^^ is an empty set, and the sorted suffix array SA is obtained; N is a natural number greater than 2.
3、 根据权利要求 2所述的方法, 其特征在于, 所述更新所述变量 h 的值, 包括: 3. The method according to claim 2, wherein the updating the value of the variable h includes:
将所述变量 h的值更新为 2h。 Update the value of the variable h to 2h.
4、 根据权利要求 1至 3任一所述的方法, 其特征在于, 所述根据字 符串的后缀数组 SAQ和第一名次数组 RQ, 获取所述后缀数组 SAQ的 h-次 序后缀数组 SAH的歩骤之前, 所述方法还包括: 4. The method according to any one of claims 1 to 3, characterized in that, according to the suffix array SA Q and the first rank array R Q of the string, the h-order suffix of the suffix array SA Q is obtained Before the steps of array SA H , the method also includes:
初始化输入的字符串, 得到字符串的后缀数组 SAQ ; Initialize the input string and get the suffix array SA Q of the string;
将所述后缀数组 SAQ中的后缀 Si的起始字符位置进行调整,得到所述 字符串的名次数组 RQ The starting character position of the suffix Si in the suffix array SA Q is adjusted to obtain the rank array R Q of the string.
5、 根据权利要求 1至 4任一所述的方法, 其特征在于, 所述根据字 符串的后缀数组 SAQ和名次数组 RQ, 获取所述后缀数组 SAQ的 h-次序后 缀数组 SAh, 及第二名次数组 Rh, 包括: 5. The method according to any one of claims 1 to 4, characterized in that: The suffix array SA Q and the rank array R Q of the string, obtain the h-order suffix array SA h of the suffix array SA Q , and the second rank array R h , including:
h为初始值时, 以所述后缀数组 SAQ中每个后缀 的第一个字符为比 较键, 将所述后缀数组 SAQ中的每个后缀 进行排序, 得到所述 h-次序 后缀数组 SAh和所述第二名次数组 Rh When h is the initial value, use the first character of each suffix in the suffix array SA Q as the comparison key, sort each suffix in the suffix array SA Q , and obtain the h-order suffix array SA h and the second-place array R h .
6、 根据权利要求 5所述的方法, 其特征在于, 所述根据所述 h-次序 后缀数组 SAh, 获取所述 h-次序后缀数组 SAh中未排序后缀的集合 UGh, 包括: 6. The method according to claim 5, characterized in that, according to the h-order suffix array SA h , obtaining the set UG h of unsorted suffixes in the h-order suffix array SA h includes:
确定所述 h-次序后缀数组 SAh中有大于一个的连续后缀的第一个字 符相同; Determine that the first characters of more than one consecutive suffixes in the h-order suffix array SA h are the same;
将所述 h-次序后缀数组 SAh中具有一个以上连续后缀的第一个字符 相同的后缀组成一个以上的未排序后缀段 [a, b] ; a 为未排序后缀段中的 起始后缀的位置, b为未排序后缀段中的结束后缀的位置; Suffixes with the same first character of more than one consecutive suffix in the h-order suffix array SA h form more than one unsorted suffix segment [a, b] ; a is the starting suffix in the unsorted suffix segment position, b is the position of the ending suffix in the unsorted suffix segment;
所述一个以上的未排序后缀段 [a, b]组成未排序后缀的集合 UGh The one or more unsorted suffix segments [a, b] form a set of unsorted suffixes UG h .
7、 根据权利要求 1至 6任一所述的方法, 其特征在于, 所述对所述 集合 UGh中的所有后缀进行排序, 得到 2h-次序后缀数组 SA2h, 包括: 采用所述集合 UGh中任一后缀 Si的 h-名次数组元素 Rh[i]为主比较键, 所述集合 UGh中后缀 81+11的 h-名次数组元素 Rh[i+h]为辅比较键, 对所述 集合 UGh中的后缀进行排序, 得到 2h-次序后缀数组 SA2h7. The method according to any one of claims 1 to 6, characterized in that: sorting all suffixes in the set UG h to obtain a 2h-order suffix array SA 2h includes: using the set UG The h-rank array element R h [i] of any suffix Si in h is the primary comparison key, and the h-rank array element R h [i+h] of the suffix 8 1+11 in the set UG h is the auxiliary key. Compare the keys and sort the suffixes in the set UG h to obtain a 2h-order suffix array SA 2h .
8、 根据权利要求 1至 7任一所述的方法, 其特征在于, 所述根据所 述第二名次数组 Rh、 所述 2h-次序后缀数组 SA2h, 获取另一个未排序的后 缀的集合 UG2h, 包括: 8. The method according to any one of claims 1 to 7, characterized in that: obtaining another unsorted suffix based on the second rank array Rh and the 2h-order suffix array SA 2h Collection UG 2h , includes:
根据相邻元素比较规则对所述 2h-次序后缀数组 8 211中的后缀进行比 较, 得到第一辅助数组 NC2h; Compare the suffixes in the 2h-order suffix array 8 211 according to the adjacent element comparison rules to obtain the first auxiliary array NC 2h;
对所述第一辅助数组 NC21^A行前缀求和, 得到第二辅助数组 NS2h; 如果所述第二辅助数组 NS2h中的一个以上连续后缀有相同的值, 则 所述一个以上连续后缀组成一个未排序后缀段; Sum the row prefixes of the first auxiliary array NC 21 to obtain the second auxiliary array NS 2h; if more than one consecutive suffix in the second auxiliary array NS 2h has the same value, then the more than one consecutive suffix The suffixes form an unordered suffix segment;
将一个以上的未排序后缀段组成未排序后缀的集合 UG2h Concatenate more than one unsorted suffix segment into an unsorted suffix set UG 2h .
9、 根据权利要求 1至 8任一所述的方法, 其特征在于, 所述 2h-次序 后缀数组 8 211所占用的空间为所述后缀数组 SA(^^占用的空间。 9. The method according to any one of claims 1 to 8, characterized in that the space occupied by the 2h-order suffix array 8 211 is the space occupied by the suffix array SA(^^.
10、 根据权利要求 1至 9任一所述的方法, 其特征在于, 所述对所述 集合 UGh中的所有后缀进行排序, 得到 2h-次序后缀数组 SA2h, 包括: 根据预设长度 T将所述集合 UGh的后缀段划分为 S型后缀段和 L型 后缀段; 10. The method according to any one of claims 1 to 9, characterized in that: sorting all suffixes in the set U h to obtain a 2h-order suffix array SA 2h includes: according to the preset length T Divide the suffix segments of the set UG h into S-shaped suffix segments and L-shaped suffix segments;
采用一个线程块对所述 S型后缀段进行排序,采用两个以上线程块对 所述 L型后缀段进行排序, 得到 2h-次序后缀数组 SA2h One thread block is used to sort the S-shaped suffix segments, and two or more thread blocks are used to sort the L-shaped suffix segments to obtain a 2h-order suffix array SA 2h .
11、 根据权利要求 10所述的方法, 其特征在于, 所述预设长度 T将 所述集合 UGh的后缀段划分为 S型后缀段和 L型后缀段, 包括: 11. The method according to claim 10, characterized in that the preset length T divides the suffix segments of the set UG h into S-shaped suffix segments and L-shaped suffix segments, including:
将所述集合 UGh的任一后缀段的长度与所述预设长度 T进行比较,将 所述集合 UGh中后缀段长度小于等于所述预设长度 T, 则将小于等于所述 预设长度 T的后缀作为 S型后缀段,将大于所述预设长度 T的后缀作为 L 型后缀段。 Compare the length of any suffix segment in the set UG h with the preset length T. If the length of the suffix segment in the set UG h is less than or equal to the preset length T, then the length will be less than or equal to the preset length T. A suffix of length T is regarded as an S-shaped suffix segment, and a suffix longer than the preset length T is regarded as an L-shaped suffix segment.
12、 一种后缀数组的构造装置, 其特征在于, 包括: 12. A suffix array construction device, characterized by including:
第一获取单元, 用于根据字符串的后缀数组 SA(^n第一名次数组 RQ, 获取所述后缀数组 SAQ的 h-次序后缀数组 SAh, 及第二名次数组 Rh, h为 变量, 初始值为 1 ; The first acquisition unit is used to obtain the h-order suffix array SA h of the suffix array SA Q according to the first rank array R Q of the string suffix array SA(^n, and the second rank array R h , h is a variable with an initial value of 1;
第二获取单元, 用于根据所述 h-次序后缀数组 SAh, 获取所述 h-次序 后缀数组 SAh中未排序后缀的集合 UGh; The second acquisition unit is used to obtain the set UG h of unsorted suffixes in the h-order suffix array SA h according to the h-order suffix array SA h ;
排序单元, 用于对所述集合 UGh中的所有后缀进行排序, 得到 2h-次 序后缀数组 SA2h; The sorting unit is used to sort all the suffixes in the set UG h to obtain the 2h-order suffix array SA 2h;
第三获取单元, 用于根据所述第二名次数组 Rh、 所述 2h-次序后缀数 组 SA2h, 获取另一个未排序的后缀的集合 UG2h; 若所述集合 UG2h为空集 合, 则得到排序的后缀数组 SA。 The third acquisition unit is used to acquire another unsorted suffix set UG 2h according to the second rank array Rh and the 2h-order suffix array SA 2h ; if the set UG 2h is an empty set, Then we get the sorted suffix array SA.
13、 根据权利要求 12所述的装置, 其特征在于, 所述第三获取单元, 还用于 13. The device according to claim 12, characterized in that the third acquisition unit is also used to
根据所述第二名次数组 Rh、 所述 2h-次序后缀数组 SA2h, 获取第三名 次数组 R2h; According to the second place array Rh and the 2h-order suffix array SA 2h , obtain the third place array R 2h;
在所述集合 110211不为空集合时, 所述装置还包括: 变量更新单元; 所述变量更新单元, 用于更新所述变量 h的值; When the set 110 211 is not an empty set, the device further includes: a variable update unit; the variable update unit, used to update the value of the variable h;
所述排序单元, 还用于 结合所述第三获取单元获取的第三名次数组 R2h和所述变量更新单元 更新的变量 h的值对未排序后缀的集合 UG2H进行排序, 用于获取第 N个 未排序的后缀的集合
Figure imgf000029_0001
直至所述第三获取单元最后获取的第 N个 未排序的后缀的集合 UG^ )为空集合, 得到排序的后缀数组 SA; N为大 于 2的自然数。
The sorting unit is also used to The set of unsorted suffixes UG 2H is sorted by combining the third rank array R 2h obtained by the third acquisition unit and the value of the variable h updated by the variable update unit, for obtaining the Nth unsorted suffix. gather
Figure imgf000029_0001
Until the Nth unsorted suffix set UG^) finally acquired by the third acquisition unit is an empty set, a sorted suffix array SA is obtained; N is a natural number greater than 2.
14、 根据权利要求 13所述的装置, 其特征在于, 所述变量更新单元, 具体用于 14. The device according to claim 13, characterized in that the variable update unit is specifically used to
将所述变量 h的值更新为 2h。 Update the value of the variable h to 2h.
15、 根据权利要求 12至 14任一所述的装置, 其特征在于, 所述装置 还包括: 第四获取单元, 用于在所述第一获取单元获取所述 h-次序后缀数 组 SAh之前, 初始化输入的字符串, 得到字符串的后缀数组 SAQ ; 15. The device according to any one of claims 12 to 14, characterized in that the device further includes: a fourth acquisition unit, configured to acquire the h-order suffix array SA h before the first acquisition unit , initialize the input string and get the suffix array SA Q of the string;
将所述后缀数组 SAQ中的后缀 Si的起始字符位置进行调整,得到所述 字符串的名次数组 RQ The starting character position of the suffix Si in the suffix array SA Q is adjusted to obtain the rank array R Q of the string.
16、 根据权利要求 12至 15任一所述的装置, 其特征在于, 所述第一 获取单元, 具体用于 16. The device according to any one of claims 12 to 15, characterized in that the first acquisition unit is specifically used to
h为初始值时, 以所述后缀数组 SAQ中每个后缀 的第一个字符为比 较键, 将所述后缀数组 SAQ中的每个后缀 进行排序, 得到所述 h-次序 后缀数组 SAH和第二名次数组 RH When h is the initial value, use the first character of each suffix in the suffix array SA Q as the comparison key, sort each suffix in the suffix array SA Q , and obtain the h-order suffix array SA H and the second place array R H .
17、 根据权利要求 16所述的装置, 其特征在于, 所述第二获取单元, 具体用于 17. The device according to claim 16, characterized in that the second acquisition unit is specifically used to
确定所述 h-次序后缀数组 SAh中有大于一个的连续后缀的第一个字 符相同; Determine that the first characters of more than one consecutive suffixes in the h-order suffix array SA h are the same;
将所述 h-次序后缀数组 SAh中具有一个以上连续后缀的第一个字符 相同的后缀组成一个以上的未排序后缀段 [a, b] ; a 为未排序后缀段中的 起始后缀的位置, b为未排序后缀段中的结束后缀的位置; Suffixes with the same first character of more than one consecutive suffix in the h-order suffix array SA h form more than one unsorted suffix segment [a, b] ; a is the starting suffix in the unsorted suffix segment position, b is the position of the ending suffix in the unsorted suffix segment;
所述一个以上的未排序后缀段 [a, b]组成未排序后缀的集合 UGh The one or more unsorted suffix segments [a, b] form a set of unsorted suffixes UG h .
18、 根据权利要求 12至 17任一所述的装置, 其特征在于, 所述排序 单元, 具体用于 18. The device according to any one of claims 12 to 17, characterized in that the sorting unit is specifically used for
采用所述集合 UGh中任一后缀 Si的 h-名次数组元素 Rh[i]为主比较键, 所述集合 UGh中后缀 Si+1^ h-名次数组元素 Rh[i+h]为辅比较键, 对所述 集合 UGh中的后缀进行排序, 得到 2h-次序后缀数组 SA2h Use the h-rank array element Rh [i] of any suffix Si in the set UG h as the main comparison key, and use the suffix Si +1 ^ h-rank array element R h [i] of the suffix Si in the set UG h +h] is the auxiliary comparison key, for the The suffixes in the set UG h are sorted to obtain the 2h-order suffix array SA 2h .
19、 根据权利要求 12至 18任一所述的装置, 其特征在于, 所述第三 获取单元, 具体用于 19. The device according to any one of claims 12 to 18, characterized in that the third acquisition unit is specifically used to
根据相邻元素比较规则对所述 2h-次序后缀数组 8 211中的后缀进行比 较, 得到第一辅助数组 NC2h; Compare the suffixes in the 2h-order suffix array 8 211 according to the adjacent element comparison rules to obtain the first auxiliary array NC 2h;
对所述第一辅助数组 NC21^A行前缀求和, 得到第二辅助数组 NS2h; 如果所述第二辅助数组 NS2h中的一个以上连续后缀有相同的值, 则 所述一个以上连续后缀组成一个未排序后缀段; Sum the row prefixes of the first auxiliary array NC 21 to obtain the second auxiliary array NS 2h; if more than one consecutive suffix in the second auxiliary array NS 2h has the same value, then the more than one consecutive suffix The suffixes form an unordered suffix segment;
将一个以上的未排序后缀段组成未排序后缀的集合 UG2h Concatenate more than one unsorted suffix segment into an unsorted suffix set UG 2h .
20、 根据权利要求 12至 19任一所述的装置, 其特征在于, 所述 2h- 次序后缀数组 8 211所占用的空间为所述后缀数组 SA(^^占用的空间。 20. The device according to any one of claims 12 to 19, characterized in that the space occupied by the 2h-order suffix array 8 211 is the space occupied by the suffix array SA(^^.
21、 根据权利要求 12至 20任一所述的装置, 其特征在于, 所述排序 单元, 具体用于 21. The device according to any one of claims 12 to 20, characterized in that the sorting unit is specifically used to
根据预设长度 T将所述集合 UGh的后缀段划分为 S型后缀段和 L型 后缀段; Divide the suffix segments of the set UG h into S-shaped suffix segments and L-shaped suffix segments according to the preset length T;
分别对所述 S型后缀段和 L型后缀段进行排序, 得到 2h-次序后缀数 组 SA2h The S-shaped suffix segments and L-shaped suffix segments are sorted respectively to obtain a 2h-order suffix array SA 2h .
22、 根据权利要求 21所述的装置, 其特征在于, 所述排序单元, 具 体用于 22. The device according to claim 21, characterized in that the sorting unit is specifically used for
将所述集合 UGh的任一后缀段的长度与所述预设长度 T进行比较,将 所述集合 UGh中后缀段长度小于等于所述预设长度 T, 则将小于等于所述 预设长度 T的后缀作为 S型后缀段,将大于所述预设长度 T的后缀作为 L 型后缀段。 Compare the length of any suffix segment in the set UG h with the preset length T. If the length of the suffix segment in the set UG h is less than or equal to the preset length T, then the length will be less than or equal to the preset length T. A suffix of length T is regarded as an S-shaped suffix segment, and a suffix longer than the preset length T is regarded as an L-shaped suffix segment.
23、 一种后缀数组的构造装置, 其特征在于, 包括: 23. A suffix array construction device, characterized by including:
处理器和和存储器; 所述存储器用于存储指令; A processor and a memory; the memory is used to store instructions;
所述处理器执行所述存储器中存储的指令, 用于: The processor executes instructions stored in the memory for:
根据字符串的后缀数组 SAQ和第一名次数组 RQ, 获取所述后缀数组 SAQ的 h-次序后缀数组 SAh, 及第二名次数组 Rh, h为变量, 初始值为 1 ; 根据所述 h-次序后缀数组 SAh,获取所述 h-次序后缀数组 SAh中未排 序后缀的集合 UGh; 对所述集合 UGh中的所有后缀进行排序,得到 2h-次序后缀数组 SA2h; 根据所述第二名次数组 Rh、 所述 2h-次序后缀数组 SA2h, 获取另一个 未排序的后缀的集合 UG2h; 若所述集合 UG2h为空集合, 则得到排序的后 缀数组 SA。 According to the suffix array SA Q and the first rank array R Q of the string, obtain the h-order suffix array SA h of the suffix array SA Q and the second rank array R h , h is a variable, and the initial value is 1 ; According to the h-order suffix array SA h , obtain the set UG h of unsorted suffixes in the h-order suffix array SA h ; Sort all suffixes in the set UG h to obtain a 2h-order suffix array SA 2h; obtain another unsorted suffix based on the second-ranked array R h and the 2h-order suffix array SA 2h The set UG 2h; if the set UG 2h is an empty set, the sorted suffix array SA is obtained.
24、 根据权利要求 23所述的装置, 其特征在于, 所述处理器, 还用 于 24. The device according to claim 23, characterized in that the processor is also used to
根据所述第二名次数组 Rh、 所述 2h-次序后缀数组 SA2h, 获取第三名 次数组 R2h和另一个未排序的后缀的集合 UG2h; According to the second place array R h and the 2h-order suffix array SA 2h , obtain the third place array R 2h and another unsorted suffix set UG 2h;
若所述集合 UG2h不为空集合, 则更新所述变量 h的值, 根据第三名 次数组 R2h对未排序后缀的集合 UG2h进行排序, 用于获取第 N个未排序 的后缀的集合 UG2,. , 直至最后获取的第 N 个未排序的后缀的集合 UG^^ 为空集合, 得到排序的后缀数组 SA; N为大于 2的自然数。 If the set UG 2h is not an empty set, update the value of the variable h, sort the set UG 2h of unsorted suffixes according to the third-place array R 2h , and use it to obtain the Nth unsorted suffix. The set UG 2 ,., until the finally obtained Nth unsorted suffix set UG^^ is an empty set, and the sorted suffix array SA is obtained; N is a natural number greater than 2.
25、 根据权利要求 24所述的装置, 其特征在于, 所述处理器, 具体 用于 25. The device according to claim 24, characterized in that the processor is specifically used to
将所述变量 h的值更新为 2h。 Update the value of the variable h to 2h.
26、 根据权利要求 23至 25任一所述的装置, 其特征在于, 所述处理 器, 还用于 26. The device according to any one of claims 23 to 25, characterized in that the processor is also used to
初始化输入的字符串, 得到字符串的后缀数组 SAQ ; Initialize the input string and get the suffix array SA Q of the string;
将所述后缀数组 SAQ中的后缀 Si的起始字符位置进行调整,得到所述 字符串的名次数组 RQ The starting character position of the suffix Si in the suffix array SA Q is adjusted to obtain the rank array R Q of the string.
27、 根据权利要求 23至 26任一所述的装置, 其特征在于, 所述处理 器, 具体用于 27. The device according to any one of claims 23 to 26, characterized in that the processor is specifically used to
h为初始值时, 以所述后缀数组 SAQ中每个后缀 的第一个字符为比 较键, 将所述后缀数组 SAQ中的每个后缀 进行排序, 得到所述 h-次序 后缀数组 SAh和所述第二名次数组 Rh When h is the initial value, use the first character of each suffix in the suffix array SA Q as the comparison key, sort each suffix in the suffix array SA Q , and obtain the h-order suffix array SA h and the second-place array R h .
28、 根据权利要求 27所述的装置, 其特征在于, 所述处理器, 具体 用于 28. The device according to claim 27, characterized in that the processor is specifically used to
确定所述 h-次序后缀数组 SAh中有大于一个的连续后缀的第一个字 符相同; Determine that the first characters of more than one consecutive suffixes in the h-order suffix array SA h are the same;
将所述 h-次序后缀数组 SAh中具有一个以上连续后缀的第一个字符 相同的后缀组成一个以上的未排序后缀段 [a, b] ; a 为未排序后缀段中的 起始后缀的位置, b为未排序后缀段中的结束后缀的位置; Set the h-order suffix array SA to the first character in h that has more than one consecutive suffix The same suffix forms more than one unsorted suffix segment [a, b] ; a is the position of the starting suffix in the unsorted suffix segment, b is the position of the ending suffix in the unsorted suffix segment;
所述一个以上的未排序后缀段 [a, b]组成未排序后缀的集合 UGh The one or more unsorted suffix segments [a, b] form a set of unsorted suffixes UG h .
29、 根据权利要求 23至 28任一所述的装置, 其特征在于, 所述处理 器, 具体用于 29. The device according to any one of claims 23 to 28, characterized in that the processor is specifically used to
采用所述集合 UGh中任一后缀 Si的 h-名次数组元素 Rh[i]为主比较键, 所述集合 UGh中后缀 81+11的 h-名次数组元素 Rh[i+h]为辅比较键, 对所述 集合 UGh中的后缀进行排序, 得到 2h-次序后缀数组 SA2h The h-rank array element Rh [ i ] of any suffix Si in the set UG h is used as the main comparison key, and the h-rank array element Rh [i] of the suffix 8 1+11 in the set UG h is used as the main comparison key. +h] is the auxiliary comparison key, and the suffixes in the set UG h are sorted to obtain the 2h-order suffix array SA 2h .
30、 根据权利要求 23至 29任一所述的装置, 其特征在于, 所述处理 器, 具体用于 30. The device according to any one of claims 23 to 29, characterized in that the processor is specifically used to
根据相邻元素比较规则对所述 2h-次序后缀数组 8 211中的后缀进行比 较, 得到第一辅助数组 NC2h; Compare the suffixes in the 2h-order suffix array 8 211 according to the adjacent element comparison rules to obtain the first auxiliary array NC 2h;
对所述第一辅助数组 NC21^A行前缀求和, 得到第二辅助数组 NS2h; 如果所述第二辅助数组 NS2h中的一个以上连续后缀有相同的值, 则 所述一个以上连续后缀组成一个未排序后缀段; Sum the row prefixes of the first auxiliary array NC 21 to obtain the second auxiliary array NS 2h; if more than one consecutive suffix in the second auxiliary array NS 2h has the same value, then the more than one consecutive suffix The suffixes form an unordered suffix segment;
将一个以上的未排序后缀段组成未排序后缀的集合 UG2h Concatenate more than one unsorted suffix segment into an unsorted suffix set UG 2h .
31、 根据权利要求 23至 30任一所述的装置, 其特征在于, 所述 2h- 次序后缀数组 8 211所占用的空间为所述后缀数组 SA(^^占用的空间。 31. The device according to any one of claims 23 to 30, wherein the space occupied by the 2h-order suffix array 8 211 is the space occupied by the suffix array SA(^^.
32、 根据权利要求 23至 31任一所述的装置, 其特征在于, 所述处理 器具体用于 32. The device according to any one of claims 23 to 31, characterized in that the processor is specifically used to
根据预设长度 T将所述集合 UGh的后缀段划分为 S型后缀段和 L型 后缀段; Divide the suffix segments of the set UG h into S-shaped suffix segments and L-shaped suffix segments according to the preset length T;
采用一个线程块对所述 S型后缀段进行排序,采用两个以上线程块对 所述 L型后缀段进行排序, 得到 2h-次序后缀数组 SA2h One thread block is used to sort the S-shaped suffix segments, and two or more thread blocks are used to sort the L-shaped suffix segments to obtain a 2h-order suffix array SA 2h .
33、 根据权利要求 32所述的装置, 其特征在于, 所述处理器, 具体 用于 33. The device according to claim 32, characterized in that the processor is specifically used to
将所述集合 UGh的任一后缀段的长度与所述预设长度 T进行比较,将 所述集合 UGh中后缀段长度小于等于所述预设长度 T , 则将小于等于所述 预设长度 T的后缀作为 S型后缀段,将大于所述预设长度 T的后缀作为 L 型后缀段。 Compare the length of any suffix segment in the set UG h with the preset length T. If the length of the suffix segment in the set UG h is less than or equal to the preset length T, then the length will be less than or equal to the preset length T. A suffix of length T is regarded as an S-shaped suffix segment, and a suffix longer than the preset length T is regarded as an L-shaped suffix segment.
PCT/CN2014/074276 2014-03-28 2014-03-28 Method and apparatus for constructing suffix array WO2015143708A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201480000232.5A CN105264522A (en) 2014-03-28 2014-03-28 Method and apparatus for constructing suffix array
PCT/CN2014/074276 WO2015143708A1 (en) 2014-03-28 2014-03-28 Method and apparatus for constructing suffix array

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2014/074276 WO2015143708A1 (en) 2014-03-28 2014-03-28 Method and apparatus for constructing suffix array

Publications (1)

Publication Number Publication Date
WO2015143708A1 true WO2015143708A1 (en) 2015-10-01

Family

ID=54193942

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/074276 WO2015143708A1 (en) 2014-03-28 2014-03-28 Method and apparatus for constructing suffix array

Country Status (2)

Country Link
CN (1) CN105264522A (en)
WO (1) WO2015143708A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105491573A (en) * 2015-12-31 2016-04-13 上海物联网有限公司 Cognitive radio interference prediction method and system

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108804204A (en) * 2018-04-17 2018-11-13 佛山市顺德区中山大学研究院 Multi-threaded parallel constructs the method and system of Suffix array clustering
CN112765938B (en) * 2021-01-13 2024-02-09 中山大学 Method for constructing suffix array, terminal equipment and computer readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003228571A (en) * 2001-11-28 2003-08-15 Kyoji Umemura Method of counting appearance frequency of character string, and device for using the method
CN102073740A (en) * 2011-01-27 2011-05-25 农革 String suffix array construction method on basis of radix sorting
CN102081673A (en) * 2011-01-27 2011-06-01 农革 Suffix array construction method
CN102334119A (en) * 2009-02-26 2012-01-25 国立大学法人丰桥技术科学大学 Speech search device and speech search method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003228571A (en) * 2001-11-28 2003-08-15 Kyoji Umemura Method of counting appearance frequency of character string, and device for using the method
CN102334119A (en) * 2009-02-26 2012-01-25 国立大学法人丰桥技术科学大学 Speech search device and speech search method
CN102073740A (en) * 2011-01-27 2011-05-25 农革 String suffix array construction method on basis of radix sorting
CN102081673A (en) * 2011-01-27 2011-06-01 农革 Suffix array construction method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105491573A (en) * 2015-12-31 2016-04-13 上海物联网有限公司 Cognitive radio interference prediction method and system
CN105491573B (en) * 2015-12-31 2021-06-22 上海物联网有限公司 Cognitive radio interference prediction method and system

Also Published As

Publication number Publication date
CN105264522A (en) 2016-01-20

Similar Documents

Publication Publication Date Title
US7801903B2 (en) Shared-memory multiprocessor system and method for processing information
US9619204B2 (en) Method and system for bin coalescing for parallel divide-and-conquer sorting algorithms
Oh et al. Fast and robust parallel SGD matrix factorization
US9953071B2 (en) Distributed storage of data
US20160098481A1 (en) Parallel data sorting
Schlag et al. Scalable edge partitioning
US10593080B2 (en) Graph generating method and apparatus
CN110837584B (en) Method and system for constructing suffix array in block parallel manner
Ferraro Petrillo et al. Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statistics
TW202029064A (en) Multipath neural network, method to allocate resources and multipath neural network analyzer
US9137336B1 (en) Data compression techniques
CN114175640B (en) Vectorized hash table
CN103995827B (en) High-performance sort method in MapReduce Computational frames
US20210365300A9 (en) Systems and methods for dynamic partitioning in distributed environments
WO2015143708A1 (en) Method and apparatus for constructing suffix array
US9135984B2 (en) Apparatuses and methods for writing masked data to a buffer
Dünner et al. Efficient use of limited-memory accelerators for linear learning on heterogeneous systems
JP2017204161A (en) Clustering device, clustering method, and clustering program
CN106778812B (en) Clustering implementation method and device
Satish et al. Mapreduce based parallel suffix tree construction for human genome
US20110055492A1 (en) Multiple processing core data sorting
US20130173647A1 (en) String matching device based on multi-core processor and string matching method thereof
Al-Absi et al. Long read alignment with parallel MapReduce cloud platform
Song et al. Nslpa: A node similarity based label propagation algorithm for real-time community detection
Li et al. Optimizing machine learning on apache spark in HPC environments

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 201480000232.5

Country of ref document: CN

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14887693

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14887693

Country of ref document: EP

Kind code of ref document: A1