WO2015143708A1

WO2015143708A1 - Method and apparatus for constructing suffix array

Info

Publication number: WO2015143708A1
Application number: PCT/CN2014/074276
Authority: WO
Inventors: 朱俊华; 白戈; 罗琼
Original assignee: 华为技术有限公司
Priority date: 2014-03-28
Filing date: 2014-03-28
Publication date: 2015-10-01
Also published as: CN105264522A

Abstract

A method and apparatus for constructing a suffix array. The method comprises: obtaining an h-order suffix array SA_h of a suffix array SA₀ and a second rank array R_h according to the suffix array SA₀ of a character string and a first rank array R₀, h being a variable with an initial value of 1; obtaining a set Ug_h of unsorted suffixes in the h-order suffix array SA_h according to the h-order suffix array SA_h; sorting all suffixes in the set Ug_h, so as to obtain an 2h-order suffix array SA_2h; obtaining another set UG_2h of unsorted suffixes according to the rank array R_h and the 2h-order suffix array SA_2h; and when the set UG_2h is an empty set, obtaining a sorted suffix array SA. By using the method, the formation of a suffix array can be accelerated during suffix data processing.

Description

Method and device for constructing suffix array

Technical field

The present invention relates to communication technologies, and in particular, to a method and a device for constructing a string suffix array. Background technique

A suffix array is an array of suffixes of all the suffixes of a string. It is widely used in fields such as string matching, sequence analysis, and text compression.

Currently, the prefix multiplication algorithm (Prefix Doubling algorithm) is a more commonly used suffix array construction algorithm. The working principle is that all the suffixes are first sorted by using the first character as a comparison key to obtain a 1-order suffix array and its corresponding name order group. Then starting from h=l, the previous calculation results are recursively calculated based on the 2h-order suffix array until all suffixes have a unique ranking. The core idea is to use the h-order of each suffix Si that has been obtained as the primary key (denoted as R _h [i] ) and [i+h] as the secondary key, and derive the 2h-order suffix array and the corresponding name. Number of times.

The Graphics Multiplier (GPU)-based prefix multiplication algorithm takes full advantage of the GPU cardinality sorting, making the sorting process extremely parallel and improving the performance of the algorithm.

Then, the prefix multiplication algorithm does not distinguish between the suffix in which the final ranking has been found and the suffix in which the final ranking is not found at each iteration, so there is a problem of repeating the ranking of the suffix that has found the final ranking. Summary of the invention

In order to solve the defects in the prior art, the present invention provides a method and a device for constructing a suffix array, which solves the problem of repeating sorting of suffixes in which the final ranking has been found in the prior art, and at the same time realizes a fast composition of the suffix array.

A first aspect, embodiments provide a method of construction of the present invention suffix array, comprising: the suffix array SA _Q strings and the number of the first group R _Q, obtaining the suffix array SA _Q h- order of suffix array SA _h , and the second-order group R _h , h is a variable with an initial value of 1; The order of the h- suffix array SA _h, obtaining the suffix array SA _h h- order not ordered set of suffix UG _h;

Sorting all suffixes in the set UG _h to obtain a 2h-order suffix array SA _2h; obtaining another unsorted suffix according to the second-order order group R _h and the 2h-order suffix array SA _2h The set UG _2h; if the set UG _2h is an empty set, the sorted suffix array SA is obtained.

In combination with the first aspect, in a first possible implementation manner,

And acquiring, according to the second name sequence group R _H , the 2h-order suffix array SA _2H , another set UG _{2H of} unsorted suffixes, including:

Obtaining, according to the second name order group R _H , the 2h-order suffix array SA _2H , a third name group R _2H and another unsorted suffix set UG _2H;

The method further includes:

If the set UG _{2H is} not an empty set, updating the value of the variable h, sorting the unsorted suffix set UG _2H according to the third-order number group R _2H , and acquiring the N-th unsorted suffix The collection UG ₂ , . , until the last acquired Nth unsorted suffix set UG^^ is an empty set, and the sorted suffix array SA is obtained; N is a natural number greater than 2.

In combination with the first aspect and the first possible implementation, in a second possible implementation, the updating the value of the variable h includes:

Update the value of the variable h to 2h.

Combining the first and second aspect of the above-described surface discharge possible implementation mode, in a third possible implementation, in accordance with the suffix array SA _Q number of the first set of strings and R _Q, obtaining the suffix array SA _Q Before the h-order suffix array SA _h , the method further includes:

Initialize the input string to get the suffix array SA _{Q of the} string _;

The starting character position of the suffix Si in the suffix array SA _Q is adjusted to obtain the name number group R _{Q of} the character string.

With reference to the first aspect, and the foregoing possible implementation manner of the first aspect, in a fourth possible implementation, the suffix array SA _Q and the name order group R _{Q are} obtained according to the suffix array SA _Q of the string and the _Q of the suffix array SA _Q - the sequence suffix array SA _h , and the second number of times group R _h , including:

When h is an initial value, the first character of each suffix Si in the suffix array SA _Q is used as a comparison key, and each suffix Si in the suffix array SA _Q is sorted to obtain the h-order. The array SA _h and the second-order number group R _{h are} affixed.

With reference to the first aspect and the fourth possible implementation manner of the first aspect, in a fifth possible implementation manner,

H- according to the order of the suffix array SA _h, obtaining the suffix array SA _h h- order not ordered set suffix UG _h, comprising:

Determining that the first character of the consecutive suffixes having more than one in the h-order suffix array SA _h is the same;

Having the same suffix of the first character having more than one consecutive suffix in the h-order suffix array SA _h to form more than one unsorted suffix segment [a, b] _; a is the starting suffix in the unsorted suffix segment Position, b is the position of the ending suffix in the unsorted suffix segment;

The one or more unsorted suffix segments [a, b] constitute a set UG _{h of} unsorted suffixes. With reference to the first aspect and the foregoing possible implementation manner of the first aspect, in a sixth possible implementation manner, all suffixes in the set UG _h are sorted to obtain a 2h-order suffix array SA _2h , including:

The _h- named group element R _h [i] of any suffix Si in the set UG _h is used as a main comparison key, and the _h- named group element R _h of the suffix 8 ₁₊₁₁ in the set UG _h [i] +h] is a secondary comparison key, and the suffixes in the set UG _h are sorted to obtain a 2h-order suffix array SA _2h .

With reference to the first aspect and the foregoing possible implementation manner of the first aspect, in a seventh possible implementation manner, another second is obtained according to the second name order group R _h and the 2h-order suffix array SA _2h a collection of sorted suffixes UG _2h;

Comparing the suffixes in the 2h-order suffix array 8 ₂₁₁ according to the neighboring element comparison rule, to obtain the first auxiliary array NC _2h;

And prefixing the first auxiliary array NC ₂₁ ^ A row to obtain a second auxiliary array NS _2h; if more than one consecutive suffixes in the second auxiliary array NS _2h have the same value, the one or more consecutive The suffix forms an unsorted suffix segment;

One or more unsorted suffix segments are combined into a set UG _{2h of} unsorted suffixes.

With reference to the first aspect and the foregoing possible implementation manner of the first aspect, in the eighth possible implementation manner, the space occupied by the 2h-order suffix array 8 ₂₁₁ is the space occupied by the suffix array SAo.

In combination with the first aspect and the above possible implementation of the first aspect, in the ninth possible In the current mode, all the suffixes in the set UG _h are sorted to obtain a 2h-order suffix array SA _2h , including:

The suffix segment of the set UG _h is divided into an S-type suffix segment and an L-type suffix segment according to a preset length T;

The S-type suffix segments are sorted by one thread block, and the L-type suffix segments are sorted by using two or more thread blocks to obtain a 2h-order suffix array SA _2h .

With reference to the ninth possible implementation manner of the foregoing aspect, in a tenth possible implementation manner, the preset length T divides the suffix segment of the set UG _h into an S-type suffix segment and an L-type suffix segment. The method includes: comparing a length of any suffix segment of the set UG _h with the preset length T, and comparing a length of the suffix segment in the set UG _h to be less than or equal to the preset length T, The suffix of the preset length τ is used as the S-type suffix segment, and the suffix greater than the preset length T is used as the L-type suffix segment.

In a second aspect, the embodiment of the present invention provides a device for constructing a suffix array, including: a first obtaining unit, configured to acquire the suffix array SA according to a suffix array SA of a character string (^n first name group R _Q ) The h-order suffix array SA _{h of} _Q and the second-order number group R _h , h are variables, and the initial value is 1;

A second acquiring unit, according to the order of suffix array SA _h h-, h- order to acquire the suffix array SA _h in unordered set of suffix UG _h;

Sorting means for sorting the set of all suffixes of UG _h, 2h- order to obtain suffix array SA _2h;

Third obtaining unit, according to the number of second groups R _h, of the order of 2h- suffix array SA _2h, acquiring another set of UG unsorted suffix _2H; _2H UG if the set is an empty set, Then get the sorted suffix array SA.

In combination with the second aspect, in a first possible implementation manner,

The third obtaining unit is further used for

According to the number of second groups R _h, of the order of 2h- suffix array SA _2h, obtaining the number of third group R _2h;

When the set 110 _{211 is} not an empty set, the device further includes: a variable update unit, where the variable update unit is configured to update the value of the variable h;

The sorting unit is also used for And sorting the unsorted suffix set UG _2H by using the value of the third-order number group R _2h acquired by the third obtaining unit and the variable h updated by the variable updating unit, for acquiring the N-th unsorted suffix set

Until the set UG^ of the Nth unsorted suffix finally obtained by the third acquiring unit is an empty set, the ordered suffix array SA is obtained; N is a natural number greater than 2.

In conjunction with the first possible implementation of the second aspect, in a second possible implementation,

The variable updating unit is specifically configured to update the value of the variable h to 2h.

With reference to the second aspect or the first or second possible implementation of the second aspect, in a third possible implementation, the apparatus further includes:

a fourth obtaining unit, configured to: before the first obtaining unit acquires the h-order suffix array SA _h , initialize the input string to obtain a suffix array SA _{Q of the} string _;

With reference to the second aspect or the foregoing possible implementation manner of the second aspect, in a fourth possible implementation manner, the first acquiring unit is specifically used to

When h is an initial value, the first character of each suffix in the suffix array SA _Q is used as a comparison key, and each suffix in the suffix array SA _Q is sorted to obtain the h-order suffix array SA. _h and second place number group R _h .

In conjunction with the fourth possible implementation of the second aspect, in a fifth possible implementation manner,

The second obtaining unit is specifically configured to be used

Having the same suffix of the first character having more than one consecutive suffix in the h-order suffix array SA _h to form more than one unsorted suffix segment [a, b]; a is the starting suffix in the unsorted suffix segment Position, b is the position of the ending suffix in the unsorted suffix segment;

The one or more unsorted suffix segments [a, b] constitute a set UG _{H of} unsorted suffixes. With reference to the second aspect or the foregoing possible implementation manner of the second aspect, in a sixth possible implementation, the sorting unit is specifically used to The _h- named group element R _h [i] of any suffix Si in the set UG _h is used as a main comparison key, and the _h- named group element R _h of the suffix 8 ₁₊₁₁ in the set UG _h [i] +h] is a secondary comparison key, and the suffixes in the set UG _h are sorted to obtain a 2h-order suffix array SA _2h .

With reference to the second aspect or the foregoing possible implementation manner of the second aspect, in a seventh possible implementation manner, the third acquiring unit is specifically used to

With reference to the second aspect or the foregoing possible implementation manner of the second aspect, in the eighth possible implementation, the space occupied by the 2h-order suffix array 8 ₂₁₁ is the space occupied by the suffix array SA ₀ .

With reference to the second aspect or the foregoing possible implementation manner of the second aspect, in a ninth possible implementation manner, the sorting unit is specifically used to

The S-type suffix segment and the L-type suffix segment are respectively sorted to obtain a 2h-order suffix array SA _2h .

With reference to the ninth possible implementation of the second aspect, in a tenth possible implementation, the sorting unit is specifically used to

The length of the suffix segment of the set UG _h is compared with the preset length T, and the length of the suffix segment in the set UG _h is less than or equal to the preset length T, and the preset is less than or equal to the preset length The suffix of the length T is used as the S-type suffix segment, and the suffix greater than the preset length T is used as the L-type suffix segment.

In a third aspect, an embodiment of the present invention provides a device for constructing a suffix array, including: a processor and a memory; and the memory is configured to store an instruction;

The processor executes instructions stored in the memory for:

Obtaining the suffix array according to the suffix array SA _{Q of the} string and the first name group R _Q The h-order suffix array SA _{h of} SA _Q and the second-order number group R _h , h are variables, and the initial value is 1; according to the h-order suffix array SA _h , the h-order suffix array SA _{h is obtained.} a collection of unsorted suffixes in the UG _h;

With reference to the third aspect, in a first possible implementation, the processor is further configured to obtain a third-order number group according to the second-order number of groups R _h and the 2h-order suffix array SA _2h a set of R _2h and another unsorted suffix UG _2h;

If the set UG _{2h is} not an empty set, the value of the variable h is updated, and the unordered suffix set UG _2h is sorted according to the third-order number group R _2h for obtaining the Nth unsorted suffix. The collection UG ₂ , . , until the last acquired Nth unsorted suffix set UG^^ is an empty set, and the sorted suffix array SA is obtained; N is a natural number greater than 2.

In conjunction with the first possible implementation of the third aspect, in a second possible implementation, the processor is specifically configured to update the value of the variable h to 2h.

With reference to the third aspect or the above possible implementation manner of the third aspect, in a third possible implementation manner, the processor is further used to

Initialize the input string to get the suffix array SA _{Q of the} string _;

With reference to the third aspect or the foregoing possible implementation manner of the third aspect, in a fourth possible implementation manner, the processor is specifically used to

When h is an initial value, the first character of each suffix in the suffix array SA _Q is used as a comparison key, and each suffix Si in the suffix array SA _Q is sorted to obtain the h-order suffix array. SA _h and the second-order number of groups R _h .

With reference to the fourth possible implementation manner of the third aspect, in a fifth possible implementation, the

Determining that the first character of the consecutive suffixes having more than one in the h-order suffix array SA _h is the same; Having the same suffix of the first character having more than one consecutive suffix in the h-order suffix array SA _h to form more than one unsorted suffix segment [a, b] _; a is the starting suffix in the unsorted suffix segment Position, b is the position of the ending suffix in the unsorted suffix segment;

The one or more unsorted suffix segments [a, b] constitute a set UG _{h of} unsorted suffixes. With reference to the third aspect or the foregoing possible implementation manner of the third aspect, in a sixth possible implementation, the

With reference to the third aspect or the foregoing possible implementation manner of the third aspect, in a seventh possible implementation manner, the processor is specifically used to

With reference to the third aspect or the foregoing possible implementation manner of the third aspect, in the eighth possible implementation, the space occupied by the 2h-order suffix array 8 ₂₁₁ is the space occupied by the suffix array SAo.

With reference to the third aspect or the foregoing possible implementation manner of the third aspect, in a ninth possible implementation manner, the processor is specifically used to

In conjunction with the ninth possible implementation of the third aspect, in a tenth possible implementation manner,

The processor is specifically used for

Comparing the length of any suffix segment of the set UG _h with the preset length T, If the length of the suffix segment in the set UG _h is less than or equal to the preset length T, the suffix of the preset length T is used as the S-type suffix segment, and the suffix greater than the preset length T is used as the L-type suffix. segment.

According to the foregoing technical solution, the method and apparatus for constructing the suffix array of the embodiment of the present invention acquires the h-order suffix array SA _h of the suffix array of the string and the second-order number group R _h by using the string suffix array SA _Q . Further h- obtaining the suffix array SA _h in order unordered set suffix UG _h, for all suffixes in the set to be sorted UG _h, 2h- order to obtain suffix array SA _2h; number according to the second group R _h , the 2h-order suffix array SA _2h , obtain another set UG _{2h of} unsorted suffixes _; when the set UG _2h is an empty set, obtain a sorted suffix array SA, thereby solving the existing In the technology, the problem of repeating the sorting of the suffix of the final ranking has been found, and at the same time, the rapid composition of the suffix array can be realized. DRAWINGS

1 is a system architecture diagram of a method for constructing a suffix array according to an embodiment of the present invention;

2A is a schematic flowchart of a method for constructing a suffix array according to an embodiment of the present invention; FIG. 2B is a schematic diagram of a suffix string table according to an embodiment of the present invention;

3A is a schematic flow chart of a method for constructing a suffix array according to another embodiment of the present invention;

FIG. 3B is a schematic diagram of scheduling an S-type suffix segment and an L-type suffix segment in a GPU according to an embodiment of the present invention;

4 is a schematic structural diagram of a device for constructing a suffix array according to an embodiment of the present invention; FIG. 5 is a schematic structural diagram of a device for constructing a suffix array according to another embodiment of the present invention; Schematic diagram of the construction of the suffix array. detailed description

Currently, as GPU computing power continues to increase, the use of GPUs as coprocessors in conjunction with Central Processing Units (CPUs) to improve system computing power has become an important means of application performance improvement. GPUs have the advantages of large-scale thread concurrency, high memory bandwidth, etc., which can greatly alleviate the computational speed bottleneck of computationally intensive/data-intensive applications. question. The embodiment of the present invention proposes a construction method of a suffix array based on a GPU (or a similar high data concurrency processor).

FIG. 1 is a system architecture diagram of a method for constructing a suffix array according to an embodiment of the present invention. As shown in FIG. 1, the system of this embodiment may include: a CPU, a GPU, and a host memory; wherein, the CPU respectively The host memory and the GPU are connected through a data bus, and the host memory is connected to the GPU through a data bus.

The execution of the GPU in this embodiment is scheduled by the CPU. The CPU selects the appropriate method for processing according to the characteristics of the service. For example, the simple service is processed by the CPU itself, and the data concurrently processed by the GPU can be processed by the GPU.

The GPU cannot directly access the data stored in the host memory, but needs to copy the data of the host memory to the global memory of the GPU through the data bus before accessing. There are multiple thread blocks within the GPU, each thread block having the same number of threads, and its own shared memory. Each thread has private computing resources (such as registers, local storage, etc.). The GPU global memory can be accessed by all threads of all thread blocks. The shared memory of the thread block can only be accessed by all threads of the thread block, while the thread's registers and local memory can only be accessed by this thread.

Data is stored in a disk device (HDD) prior to data processing. Once the handler is started, the data is first read by the disk to Memory, which is then dispatched by the CPU to itself or to the GPU.

For convenience of presentation, the following description will be given of some symbols/terms appearing in the embodiments of the present invention.

Character set: A character set ∑ is a set that establishes a full-order relationship, that is, any two different elements in ∑ a and b have a certain size relationship, not a<b, which is a>b. The elements of a character set are called characters. The character set has a special character ' $ ', which appears only at the end of the string and is the smallest element in the character set.

String: A string of length n is an array of n characters S[0,n-1]. Among them, the last element of S is the terminator fixed to '$'.

Substring: The substring K[i, j] of the string S (i<j) refers to a substring consisting of characters (including position j) from the position 1 to the end of the position J in the character string S, that is, K[ i,"]=S[i] S[i+l]...S[j]o Suffix: The suffix of the string S refers to the string S from the character of the position worker to the end character ' $ ' The substring of the composition, that is, Si=S[i]S[i+l]...$.

Suffix array: suffix array SA is a string S all suffixes are sorted in lexicographic order An array formed by columns. The suffix it contains is represented by its starting position. That is, 1 represents the suffix Si.

Where SAW indicates that the suffix is the 1st small suffix of all suffixes.

Name group: Name group R is also called inverse suffix array ISA. It holds the ranking value of each suffix, ISA[i]=R[i]=j indicates that the suffix Si is the jth small suffix of all suffixes. Therefore, R, SA and ISA have the following relationship: R=ISA=SA—

_H- order suffix array: h-order suffix array SA _h is an array obtained by using all the suffixes of a string S according to its starting h characters as comparison keys.

In addition, the unsorted suffix segments described below are composed of unsorted suffixes. The unsorted suffix consists of unsorted suffix segments based on whether their first h characters are the same.

FIG. 2A is a flow chart showing a method for constructing a suffix array according to an embodiment of the present invention. As shown in FIG. 2A, the construction method of the suffix array of the embodiment of the present invention is as follows.

201 The suffix array SA _Q and R _Q ranking array string, obtaining the suffix array SA _Q h- order of suffix array SA _h, and the number of the first group R _h. Where h is a variable and the initial value is 1.

In this embodiment, when h is an initial value, each suffix in the suffix array SA _Q may be used.

The first character of S, is a comparison key, and sorts each suffix in the suffix array SA _Q to obtain the h-order suffix array SAi and the second name number group. Specifically, if SAJi]

= J, then ^ ] = 1.

202, according to the order of the suffix array SA _h h-, h- order to acquire the suffix array SA _h in unordered set of suffix UG _h;

Of course, when h is the initial value, the set of unsorted suffixes is UG^

203. Sort all the suffixes in the set UG _h to obtain a 2h-order suffix array SA _2h .

Alternatively, also according to the number of second groups R _h, of the order of 2h- suffix array SA _2h, the third acquisition acquires the number of groups R _2h.

For example, the _h- named group element R _h [i] of any suffix Si in the set UG _h is used as a main comparison key, and the h-name number group element of the UG _h suffix 8 ₁₊₁₁ is used as a main comparison key. R _h [i+h] is a secondary comparison key, and the suffixes in the set UG _h are sorted to obtain a 2h-order suffix array SA _2h .

The h-named group element R _h [i] and the h-numbered group element R _h [i+h] described above are elements in the second-order number group R _h . 204, according to the second name sequence group R _h , the 2h-order suffix array SA _2h , obtain another set of unsorted suffixes UG _2h;

205. Determine whether the set 1; 0 ₂₁₁ is an empty set.

206. If the set UG _2h is an empty set, obtain a sorted suffix array SA.

Optionally, if the set UG _{2h is} not an empty set, update the value of the variable h, and repeatedly sort the unsorted suffix set UG _2h until the last acquired unsorted suffix set is an empty set, Sorted suffix array SA.

That is, the unordered suffix set UG _2h is sorted according to the third-order number group R _2h for obtaining the N-th unsorted suffix set UG^^ until the last N unsorted suffix is obtained. The set UG^^ is an empty set, and the sorted suffix array SA is obtained; N is a natural number greater than 2.

For example, the foregoing updating the value of the variable h may specifically be: updating the value of the variable h to 2h.

That is, it is judged whether the new unsorted suffix set UG _2h is empty. If not empty, update the h value, g卩h=2 X h. Returning to the foregoing step 203, SP, all suffixes in the set UG _2h are sorted to obtain a 4h-order suffix array SA _4h .

Correspondingly, step 204 ', according to the third name sequence group R _2h , the 4h-order suffix array SA _4h , obtain an unsorted suffix set UG _4h;

Determine if the new unsorted suffix set UG _4h is empty. If it is not empty, repeat the above process. If the last UG _4h is empty, the final suffix array is obtained, which is denoted as SA.

In the iterative process, the suffix array and the name group can use the space of the initial suffix array and the name group, thus saving storage overhead. In addition, the unsorted suffix collection UG can also be saved with an array, and therefore does not limit its representation in the computing system.

That is, in the above embodiment, the space occupied by the 2h-order suffix array SA _2h may be the space occupied by the suffix array SA _Q , and the space occupied by the name number group R _2h may be the number of times. The space occupied by the group R _Q. Thus, the construction method of the above suffix array can save space.

In the constructing method of the suffix array of the embodiment, the h-order suffix array SA _h of the suffix array of the string and the second-order number group R _{h are} obtained by the suffix array SA _{Q of the} string, and then the h-order suffix array is obtained. a set UG _{h of} unsorted suffixes in SA _h , in the set UG _h All suffixes are sorted to obtain a 2h-order suffix array SA _2h; according to the second name order group R _h , the 2h-order suffix array SA _2h , another set of unsorted suffixes UG _{2h is obtained;} When the set UG _2h is an empty set, the sorted suffix array SA is obtained, thereby realizing the structure of the accelerated suffix array in the suffix data processing process, and solving the problem of repeating the sorting of the suffixes that have obtained the final ranking in the prior art.

In an alternative implementation scenario, prior to step 201, the method may further include the following step 200, not shown:

200. Initialize the input string S to obtain a suffix array SA _{Q of the} string _; adjust the starting character position of the suffix Si in the suffix array 8 ₍₎ to obtain the name number group R _{0 of} the string.

For example, the starting character position of the suffix Si in the suffix array SA _Q is placed at the first position of R _Q , that is, SA _Q [i]=R _Q [i]=i.

Optionally, the foregoing step 202 may include sub-steps not shown in the following figures:

A202K determining that the first character of the consecutive suffixes having more than one in the h-order suffix array SA _h is the same;

A2022. The same suffix of the first character having more than one consecutive suffix in the h-order suffix array SA _h is composed of more than one unsorted suffix segment [a, b] _; a is the start in the unsorted suffix segment The position of the suffix, where b is the position of the ending suffix in the unsorted suffix segment;

A2023. The one or more unsorted suffix segments [a, b] form a set UG _{h of} unsorted suffixes.

An example of an unsorted suffix array is given below.

For the input string τ, first extract all its possible suffixes, such as T is _g00g ol$, as shown in the right column of Figure 2B, and then sort the suffixes similar to the prefix multiplication method. That is, the entire sorting process is composed of a plurality of iterative processes, but different from the prior art, in this embodiment, by introducing a sorted suffixed set (Sorted group) and an unsorted suffixed set (Unsorted group), it can be avoided. In the prior art, the problem of repeating the ranking of the suffixes that have been ranked.

Specifically, at the initial time (ie, before the start of the first iteration calculation), all suffixes are placed in the Unsorted group as a suffix segment. In this embodiment, in this embodiment, only the suffix in the Unsorted group can be sorted, thereby avoiding heavy suffixes of the determined order position. Reordering. After all the suffixes are determined in order (that is, all suffixes have been added to the Sorted group), the suffix sorting process is completed. Thus, the suffix array SA and the name order group R can be derived from the sort result.

The ordering of the suffixes in the Unsorted group is done by multiple iterations (greater than or equal to 1). In each iteration calculation, each suffix takes only its first k characters for sorting, and after sorting, it will be determined that the global order suffix is moved from the Unsorted group to the Sorted group. That is, for each suffix in the Unsorted group, if other suffixes are equal (that is, their first k characters are the same), it means that these equal suffixes cannot be determined by their first k characters, so these are equal. The suffixes are grouped together, waiting for the calculation of the next iteration;

If no other suffixes are equal, the order position of the suffix in the current Unsorted group can be determined, so move it to Sorted _g roup. After an iterative calculation, if there is still a suffix that fails to determine the order, increase the k value (such as k=2k, where k corresponds to the variable h shown in Figure 2A above) and move to the next iteration, otherwise the whole The sorting process is complete.

Example 1: The processing of the input character T = googol$ (the calculation of SA and ISA is not given here). Step 1: Extract the suffix (see suffixes on the right side of Figure 2B) as a suffix segment ugl, add to

Unsorted group,

Unsorted group = {ugl= {googol$, oogol$, ogol$, gol$, ol$, 1$, $} },

Sorted grou is

Step 2: Iteratively process the Unsorted group. Suppose k is initially k=l and is incremented by 1 after each iteration.

The first iteration: only consider the first character of each suffix for sorting, so the result is

Sorted group ={$, 1$}, Unsorted group = {ugl={ googol$, gol$}, ug2={ oogol$, ogol$, ol$} 2nd iteration: k=2 at this time, each suffix The first two characters are used for sorting, and the result is

Sorted group ={$, 1$, ogol$, ol$, oogol$}, Unsorted group = {ugl={ googol$, gol$} }

The third iteration: at this time k=3, the first two characters of each suffix are used for sorting, and the processing result is

Sorted group ={$, 1$, gol$, googol$, ogol$, ol$, oogol$}, Unsorted group = { }

At this point all suffixes are sorted, and the sorted suffix array SA is obtained.

In order to improve the speed of the GPU suffix array, the embodiment of the present invention proposes a method for parallel suffixing in the unsorted suffix set for the GPU parallel data processing, thereby improving the processing of the step 204 in FIG. 2A. speed.

For example, assuming that the current iteration value is h, the aforementioned step 204 may include the following diagram not shown. Sub-steps:

A2041, comparing the suffixes in the 2h-order suffix array SA _2h according to the neighboring element comparison rule, to obtain the first auxiliary array NC _2h;

Specifically, for each suffix segment [a, b] in the unsorted suffix set, NC _2h [a] is set to a; for other suffixes, the adjacent suffix to the left is compared with the first 2h characters. The keys are compared, the same is 0, and the difference is 1.

A 2042, summing the first auxiliary array NC ₂₁ ^ A lines to obtain a second auxiliary array NS _2h ; specifically, NS _2h [i]=NC _2h [0]+NC _2h [l]+~+NC _2h [i].

A2043. If more than one consecutive suffixes of the second auxiliary array NS _2h have the same value, the one or more consecutive suffixes form an unsorted suffix segment.

Optionally, according to the second auxiliary array NS _2h , the elements in the SA _2h are distributed to the corresponding positions of the new array, and the name number group R _{2h is obtained;} specifically, if NS _2h [i]=”, then R _2h [j ]=i.

A2044. Make more than one unsorted suffix segments into a set UG _{2h of} unsorted suffixes. Specifically, if more than one consecutive elements in NS _2h have the same value, the elements form an unsorted suffix segment. Unsorted suffix set 1; 0 ₂₁₁ is the _monk of all unsorted suffixes

Pick PI o

In addition, for each suffix segment [a, b] in the unsorted suffix set UG _h , for each suffix SA _h [i] ( a«b ), with the corresponding suffix SA _h [i] + h h- The ranking R _h [SA _h [i]+h] is a comparison key to sort the unsorted suffix segments to obtain SA _2h .

It is worth noting that in the iterative process, auxiliary arrays such as NC, NS, etc. can be reused, saving storage overhead. Therefore, there is no restriction on the space allocation of NC and NS.

In order to improve the speed of the GPU constructing the suffix array, the obtaining the 2h-order suffix array SA _2h in the foregoing step 203 can be accelerated in combination with the multi-threading and high memory advantages of the GPU, thereby improving the processing speed of the step 203 in FIG. 2A. As shown in FIG. 3A, the suffix parallel processing method of the unsorted suffix set of this embodiment is as follows.

301. The suffix segment of the set UG _h is divided into an S-type suffix segment and an L-type suffix segment according to a preset length T.

For example, comparing the length of any suffix segment of the set UG _h with the preset length T, in the set UG _h , the suffix not greater than the preset length T is used as the S-type suffix segment, A suffix larger than the preset length T is used as an L-type suffix segment. Specifically, for a suffix segment I=[a, b], if its length |1| is not greater than the preset value t, it is an S-type suffix segment; otherwise, it is an L-type suffix segment.

It should be noted that the length in the suffix segment refers to the number of suffixes included in the suffix segment.

302. Sort the S-type suffix segments by using one thread block, and sort the L-type suffix segments by using two or more thread blocks to obtain a 2h-order suffix array SA _2h .

The processing of the S-type suffix segment and the processing of the L-type suffix segment have no order relationship and can be executed in parallel. Because multiple thread blocks in the GPU can be executed in parallel, multiple suffix segments of the S-type suffix segment can be sorted simultaneously. Similarly, multiple suffix segments of the L-type suffix segment can also be sorted simultaneously. Therefore, this method can greatly improve the computational utilization of the GPU, thereby speeding up the construction of the suffix array.

For GPUs, cardinality sorting is currently the most efficient parallel sorting method. Therefore, the embodiment of the present invention takes parallel matrix ordering as an example to illustrate how to use the GPU to implement single-thread block parallel sorting and multi-thread block parallel sorting.

For single-thread block parallel sorting, the implementation is simpler because all threads within the thread block can access the shared memory of the thread block. The embodiment of the present invention uses a least significant bit radix sort. The bits of each iteration of the cardinality order can be configured, depending on the computing power and storage space of the GPU. The steps for a single iteration of a single-thread block parallel-matrix sort are as follows.

For example, the step of sorting the S-type suffix segments by using one thread block in step 302 includes:

S01. Calculate a first histogram 11 of the S-type suffix segment.

Specifically, for each element's current comparison key, calculate the number of elements for each specific value. The first histogram H is saved in an array, and the key value is used as an array subscript, and the number of elements corresponding to the key value is used as an array element corresponding to the subscript.

S02. Perform a prefix sum operation on the first histogram H to obtain an array M of prefix summation results of the S-type suffix segments. g卩, M[i] = H[0]+H[l]+~+H[i].

S03: According to the prefix sum result array M, distribute the suffix to the corresponding position.

Specifically, the starting position of the array element with a key value of 1 is M[i]. If there are array elements with the same key value, they are placed in subsequent positions.

The above described procedure consists of a separate parallel processing code (kernel function, kernel) Function) Execution. For step S01 and step S02, the characteristics of the GPU can be further utilized to speed up the calculation. For example, each thread of the thread block first computes the local histogram and keeps it in the register, then copies the local histogram into the shared memory of the thread block, and then performs a prefix summation operation on all the histograms.

Sorting the L-type suffix segments by using more than two thread blocks in step 302, including:

M01. The L-type suffix segment is fragmented according to the length of the array processed by each thread block, and the fragmented L-type suffix segment is allocated to the corresponding thread block.

For example, suppose the array length is 1, and each thread block is responsible for an array length of t, which uses 1/t thread blocks.

M02: Obtain a histogram of the L-type suffix segments after each slice, and perform a prefix sum operation on the histograms acquired by all the thread blocks to obtain a prefix summation result array M _{g of} the L-type suffix segments.

That is to say, each thread block first calculates a histogram for the suffix segment responsible for the fragment, obtains a histogram H _b of the suffix segment of the thread block, and copies the result to the global memory of the GPU.

After each thread block completes the histogram calculation of the thread block, all the thread block histograms are subjected to a prefix summation operation together to obtain a global prefix summation result array M _g .

M03: sum the result array M _g according to the global prefix, and distribute the suffix to the position corresponding to the suffix segment that completes the sorting.

For example, specifically, the starting position of an array element with a key value of 1 is M _g [i], and if there are array elements with the same key value, they are placed in subsequent positions.

That is, according to the prefix sum result array M and the global prefix summation structure array Mg, a 2h-order suffix array SA _{2h is obtained} .

The above approach can take full advantage of the GPU's multi-thread block processing.

Step M01 is executed by the CPU, and the rest of the steps are executed by the GPU, and each step is implemented by a separate kernel function. Step M02 can be executed concurrently by multiple thread blocks. Step M03 can only be executed after all thread blocks of step M02 have been executed. Step M03 may require one thread block or multiple thread blocks to execute. Each thread block is responsible for a set of independent key values.

Because the threads of different thread blocks of the GPU must implement data sharing through the GPU global memory, the multi-thread block parallel cardinal sorting method is slightly more complicated than the single-thread block parallel cardinal sorting method. In the GPU environment, the above method makes full use of the GPU's multiple concurrent threads and high memory bandwidth advantages, and accelerates the operation of the suffix array construction process.

In the above embodiment, the parallelization processing of the suffix in the Unsorted group is proposed, which can be applied to a high data parallelity environment similar to the GPU. Step S01 to step S03 as above, and step M01 to step M03. The above steps are all a separate concurrent process (that is, each step can be processed by multiple threads concurrently).

The parallel processing of the GPU will be described below with reference to FIG.

Corresponding to the foregoing step 301, in order to make full use of the parallel processing capability of the GPU, each suffix in the Unsorted group is divided into two types according to the size of the group in which it is divided into two categories. Each suffix segment is divided into two categories according to the length of the segment: 1 S class (group class, ie S-type suffix segment): S={ _M & I size(u _gi ) ^ } , that is, the number of suffixes included in each suffix segment in the class needs to be less than or equal to a predetermined threshold τ, ( The preset length is as above Τ ) ; 2 ) Otherwise it is L class (large group class is L-shaped suffix segment).

The computational execution of Unsorted group data in the GPU is scheduled by the CPU. And the sorting of each suffix segment in the S class is performed by a single ThreadBlock in the GPU, and the sorting of each suffix segment of the L class is performed by multiple ThreadBlock thread blocks (as shown in FIG. 3B, where the suffix segment Mg) , _M g2 and belong to the S class, while Mg4 and _M g5 belong to the L class).

For example, the classification of the suffix segments in the Unsorted group.

For Unsorted group = { ugl ={ississippi$, issippi$, i$ }, ug2 =

{ssissippi$, sissippi$, ssippi$, sippi$ }, ug3 = {ppi$, pi$} } , and τ =3, Bay ij ugl and will be assigned to the S class, and _M g2 to the L class.

The ordering of the suffix segments in the S and L classes can be implemented in the following different sub-steps:

Specifically, the ordering of the suffix segments of the S class: The ordering of each suffix segment is performed by a kernel function and is completed in three steps:

S01. Calculate a histogram H of the suffix segment (only consider k consecutive bits);

S02, scanning a histogram H, obtaining a scatter offset value of each thread;

S03. Write the sort result of each thread to the shared memory of the thread block, and then transfer the data to the global memory of the GPU.

In addition, the ordering of the suffix segments of the L class: also consists of three steps, but each step here is responsible for a kernel function:

M01, calculating their respective histograms in units of thread blocks; M02, scanning a histogram of each thread block, and calculating a dispersion offset value of each thread block;

M03, sorting of the suffix segments of the same class S, each thread block processes its own data.

Specifically, the sorted results are aggregated to the global memory of the GPU, and then the remaining steps are sequentially completed by the thread blocks of the GPU, and each step is also executed concurrently.

For example, the relative size of each adjacent suffix in each suffix segment can be globally compared; the prefix summation operation is performed on the adjacent comparison result of each suffix segment; and the result obtained by the previous step is used to calculate the corresponding suffix array corresponding to the suffix Name group and new Unsorted group and so on.

In the above embodiment, the parallelization processing of the suffix in the Unsorted group is proposed, which can be applied to a high data parallelity environment similar to the GPU.

4 is a schematic structural diagram of a device for constructing a suffix array according to an embodiment of the present invention. As shown in FIG. 4, the apparatus for constructing a suffix array of the present embodiment includes: a first acquiring unit 41, a second acquiring unit 42, Sorting unit 43 and third obtaining unit 44;

The first obtaining unit 41 is configured to obtain an h-order suffix array SA _{h of} the suffix array SA _Q and a second-order number group R _h according to the suffix array SA _{Q of the} character string and the first-order number group Ro. h is a variable with an initial value of 1;

The second acquiring unit 42 according to the order of the suffix array h- SA _h, obtaining the suffix array SA _h h- order not ordered set of suffix UG _h;

The sorting unit 43 is configured to sort all the suffixes in the set UG _h to obtain a 2h-order suffix array SA _2h;

For example, the sorting unit 43 is specifically configured to use the _h- named group element R _h [i] of any suffix Si in the set UG _h as a main comparison key, and the suffix S _i+ in the set UG _h The h-named group element R _h [i+h] is a secondary comparison key, and the suffixes in the set UG _h are sorted to obtain a 2h-order suffix array SA _2h .

Third obtaining unit 44 according to the second number of groups R _h, of the order of 2h- suffix array SA _2h, acquiring another set of UG unsorted suffix _2H; _2H UG if the set is an empty set, Then get the sorted suffix array SA.

Alternatively, the third acquiring unit, according to the second frequency and further configured to group R _h, of the order of 2h- suffix array SA _2h, obtaining the group number of the third R _2h.

If the set UG _{2h is} not an empty set, the foregoing suffix array constructing apparatus may further include: a variable updating unit 45, as shown in FIG. 5; The variable update unit 45 is configured to update the value of the variable h; for example, the variable update unit is specifically configured to update the value of the variable h to 2h.

At this time, the sorting unit 43 is further configured to repeatedly sort the unsorted suffixes UG _2H by using the variable update unit, until the set of the unsorted suffixes finally obtained by the third obtaining unit is an empty set, and the sort is obtained. The suffix array SA. That is, the set of unordered suffixes UG _2H is sorted according to the value of the third-order number of times group R _2h acquired by the third obtaining unit and the variable h updated by the variable updating unit, for acquiring the Nth ordered set of suffix _D ₂ UG, UG set up by the third obtaining unit acquired last N-th unsorted _D suffix ₂ is an empty set, to obtain sorted suffix array SA; N is a natural number greater than 2.

In this embodiment, the space occupied by the 2h-order suffix array SA _2H is the space occupied by the suffix array SA _Q .

In an optional implementation scenario, the foregoing apparatus for constructing a suffix array further includes a fourth obtaining unit 46, not shown, which is used by the first obtaining unit 41 to obtain the Before the h-order suffix array SA _h ,

Initialize the input string to get the suffix array SA _{Q of the} string _;

For example, the first obtaining unit 41 may be specifically configured to: when h is an initial value, use the first character of each suffix Si in the suffix array SA _Q as a comparison key, and the suffix array SA _Q Each suffix Si in the middle is sorted to obtain the h-order suffix array SA _h and the second-order number group R _h .

Optionally, the second obtaining unit 42 is specifically configured to: determine that the first character of the consecutive suffixes of the h-order suffix array SA _H having more than one is the same;

The one or more unsorted suffix segments [a, b] constitute a set UG _{H of} unsorted suffixes. Optionally, the third obtaining unit 44 is configured to compare the suffixes in the 2h-order suffix array SA _2H according to the neighboring element comparison rule to obtain a first auxiliary array NC _2H; Auxiliary array NC ₂₁ ^ A row prefix sum, to obtain a second auxiliary array NS _2H; If more than one consecutive suffixes of the second auxiliary array NS _2h have the same value, the one or more consecutive suffixes constitute an unsorted suffix segment;

In the second optional implementation scenario, the foregoing sorting unit 43 is specifically configured to: divide the suffix segment of the set UG _h into an S-type suffix segment and an L-type suffix segment according to a preset length T;

For example, the length of any suffix segment of the set UG _h may be compared with the preset length T, and the length of the suffix segment in the set UG _h is less than or equal to the preset length T, and then the value is less than or equal to The suffix of the preset length T is used as the S-type suffix segment, and the suffix greater than the preset length T is used as the L-type suffix segment.

For example, calculating a first histogram H of the S-type suffix segment;

Performing a prefix sum operation on the first histogram H to obtain a prefix sum result array M of the S-type suffix segments.

In addition, the L-type suffix segments may be segmented according to the length of the array processed by each thread block, and the L-type suffix segments after the slice are allocated to each thread block;

Obtaining a histogram of the L-type suffix segments after each slice, and performing a prefix sum operation on the histograms obtained by all the thread blocks, to obtain a prefix summation result array M _{g of} the L-type suffix segments _; The prefix summation result array M and the global prefix summation structure array M _{g are} obtained to obtain a 2h-order suffix array SA _2h .

It can be seen from the above embodiment that the suffix array construction apparatus of this embodiment can avoid the problem of repeatedly sorting the suffixes of the final ranking, and can realize the composition of the acceleration suffix array during the suffix data processing.

The device in this embodiment may be used to implement the technical solution of the method embodiment shown in FIG. 2A and FIG. 3A, and the implementation principle and the technical effect are similar, and details are not described herein again.

6 is a schematic structural diagram of a device for constructing a suffix array according to another embodiment of the present invention. As shown in FIG. 6, the apparatus for constructing a suffix array of the present embodiment includes: a bus 61; and a processor connected to the bus 61. 62. A memory 63 and an interface 64, wherein the memory 63 is configured to store an instruction, and the processor 62 is configured to execute the instruction, where Obtaining the h-order suffix array SA _{h of} the suffix array SA _Q and the second-order number group R _h , h as variables according to the suffix array SA _{Q of the} string and the first-order number group R _Q , and the initial value is 1 ; h- according to the order of the suffix array SA _h, obtaining the suffix array SA _h h- order not ordered set of suffix UG _h;

The processor 62 executes the foregoing instruction, and is further configured to acquire, according to the second name sequence group R _h and the 2h-order suffix array SA _2h , a set UG of the third name group R _2h and another unsorted suffix _2h;

If the set UG _{2h is} not an empty set, the value of the variable h is updated, and the unordered suffix set UG _2h is sorted according to the third-order number group R _2h for obtaining the Nth unsorted suffix. set! ; , until the collection of the Nth unsorted suffix obtained last is an empty set, get the sorted suffix array SA; N is a natural number greater than 2.

For example, the processor 62 executes the above instructions for updating the value of the variable h to be occupied by the suffix array SA _Q in the present embodiment. The space occupied by the 2h-order suffix array SA _2h is occupied by the suffix array SA _Q. space.

In an optional implementation scenario, the processor 62 executes the foregoing instruction, and is further configured to initialize the input string to obtain a suffix array SA _{Q of the} string _;

Optionally, the processor 62 executes the foregoing instructions, specifically,

When h is an initial value, the first character of each suffix Si in the suffix array SA _Q is used as a comparison key, and each suffix Si in the suffix array SA _Q is sorted to obtain the h-order suffix. An array SAi and the second-order number of times (ie, the h-order suffix array SA _h and the second-order number of groups R _h ).

Optionally, the processor 62 executes the foregoing instructions, specifically for

Determining that the h-order suffix array SA _h has more than one consecutive suffix of the first character the same;

The one or more unsorted suffix segments [a, b] constitute a set UG _{h of} unsorted suffixes. Optionally, the processor 62 executes the foregoing instructions, specifically for

And summing the first auxiliary array NC ₂₁ ^ A row to obtain a second auxiliary array NS _2h; if one or more consecutive suffixes in the second auxiliary array NS _2h have the same value, the one or more consecutive The suffix forms an unsorted suffix segment;

In a specific application process, the processor 62 executes the foregoing instructions to sort all the suffixes in the set UG _h to obtain a 2h-order suffix array SA _2h , including:

For example, the length of any suffix segment of the set UG _h is compared with the preset length T, and the length of the suffix segment in the set UG _h is less than or equal to the preset length T, which is less than or equal to the The suffix of the preset length T is used as the S-type suffix segment, and the suffix greater than the preset length T is used as the L-type suffix segment.

For example, calculating a first histogram H of the S-type suffix segment;

Can be implemented simultaneously with the S-type suffix segment, according to the length of the array processed by each thread block Deleting the L-type suffix segment, and assigning the L-type suffix segment after the slice to each thread block; obtaining a histogram of the L-type suffix segment after each slice, and all the thread blocks Obtaining a histogram to perform a prefix sum operation to obtain a prefix summation result array M _{g of} the L-type suffix segment _; obtaining a 2h-order according to the prefix sum result array M and the global prefix summation structure array Mg The suffix array SA _2h .

Therefore, the apparatus for constructing the suffix array of the embodiment of the present invention executes the above instruction by the memory storage instruction, thereby avoiding the problem of repeatedly sorting the suffix of the final ranking, and realizing the configuration of the accelerated suffix array.

One of ordinary skill in the art will appreciate that all or a portion of the steps of implementing the various method embodiments described above can be accomplished by hardware associated with the program instructions. The aforementioned program can be stored in a computer readable storage medium. The program, when executed, performs the steps including the above-described method embodiments; and the foregoing storage medium includes: a medium that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk. It should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, and are not intended to be limiting; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art The technical solutions described in the foregoing embodiments may be modified, or some or all of the technical features may be equivalently replaced; and the modifications or substitutions do not deviate from the technical solutions of the embodiments of the present invention. range.

Claims

claims

1. A method of constructing a suffix array, which is characterized by including:

According to the suffix array SA _Q and the first rank array R _Q of the string, obtain the h-order suffix array SA h of the suffix array SA _Q and the second rank array R _h , _h is a variable, and the initial value is 1 ; According to the h-order suffix array SA _h , obtain the set UG H of unsorted suffixes in the h-order suffix array SA _h _;

Sort all suffixes in the set UG _H to obtain a 2h-order suffix array SA _2H; obtain another unsorted suffix based on the second-ranked array R _H and the 2h-order suffix array SA _2H The set UG _2H; if the set UG _2H is an empty set, the sorted suffix array SA is obtained.

2. The method according to claim 1, characterized in that: obtaining another unsorted suffix set UG _2H according to the second rank array R _H and the 2h-order suffix array SA _2H , include:

According to the second place array R _H and the 2h-order suffix array SA _2H , obtain the third place array R _2H and another unsorted suffix set UG _2H;

The method also includes:

If the set UG _2H is not an empty set, update the value of the variable h, sort the set UG _2H of unsorted suffixes according to the third place array R _2H , and use it to obtain the Nth unsorted suffix. The set UG ₂ ,., until the finally obtained Nth unsorted suffix set UG^^ is an empty set, and the sorted suffix array SA is obtained; N is a natural number greater than 2.

3. The method according to claim 2, wherein the updating the value of the variable h includes:

Update the value of the variable h to 2h.

4. The method according to any one of claims 1 to 3, characterized in that, according to the suffix array SA _Q and the first rank array R _Q of the string, the h-order suffix of the suffix array SA _Q is obtained Before the steps of array SA _H , the method also includes:

Initialize the input string and get the suffix array SA _{Q of the string;}

The starting character position of the suffix Si in the suffix array SA _Q is adjusted to obtain the rank array R _Q of the string.

5. The method according to any one of claims 1 to 4, characterized in that: The suffix array SA _Q and the rank array R _Q of the string, obtain the h-order suffix array SA _h of the suffix array SA _Q , and the second rank array R _h , including:

When h is the initial value, use the first character of each suffix in the suffix array SA _Q as the comparison key, sort each suffix in the suffix array SA _Q , and obtain the h-order suffix array SA _h and the second-place array R _h .

6. The method according to claim 5, characterized in that, according to the h-order suffix array SA _h , obtaining the set UG _h of unsorted suffixes in the h-order suffix array SA _h includes:

Determine that the first characters of more than one consecutive suffixes in the h-order suffix array SA _h are the same;

Suffixes with the same first character of more than one consecutive suffix in the h-order suffix array SA _h form more than one unsorted suffix segment [a, b] _; a is the starting suffix in the unsorted suffix segment position, b is the position of the ending suffix in the unsorted suffix segment;

The one or more unsorted suffix segments [a, b] form a set of unsorted suffixes UG _h .

7. The method according to any one of claims 1 to 6, characterized in that: sorting all suffixes in the set UG _h to obtain a 2h-order suffix array SA _2h includes: using the set UG The h-rank array element R _h [i] of any suffix Si in _h is the primary comparison key, and the h-rank array element R _h [i+h] of the suffix 8 ₁₊₁₁ in the set UG _h is the auxiliary key. Compare the keys and sort the suffixes in the set UG _h to obtain a 2h-order suffix array SA _2h .

8. The method according to any one of claims 1 to 7, characterized in that: obtaining another unsorted suffix based on the second rank array _Rh and the 2h-order suffix array SA _2h Collection UG _2h , includes:

Compare the suffixes in the 2h-order suffix array 8 ₂₁₁ according to the adjacent element comparison rules to obtain the first auxiliary array NC _2h;

Sum the row prefixes of the first auxiliary array NC ₂₁ to obtain the second auxiliary array NS _2h; if more than one consecutive suffix in the second auxiliary array NS _2h has the same value, then the more than one consecutive suffix The suffixes form an unordered suffix segment;

Concatenate more than one unsorted suffix segment into an unsorted suffix set UG _2h .

9. The method according to any one of claims 1 to 8, characterized in that the space occupied by the 2h-order suffix array 8 ₂₁₁ is the space occupied by the suffix array SA(^^.

10. The method according to any one of claims 1 to 9, characterized in that: sorting all suffixes in the set U _h to obtain a 2h-order suffix array SA _2h includes: according to the preset length T Divide the suffix segments of the set UG _h into S-shaped suffix segments and L-shaped suffix segments;

One thread block is used to sort the S-shaped suffix segments, and two or more thread blocks are used to sort the L-shaped suffix segments to obtain a 2h-order suffix array SA _2h .

11. The method according to claim 10, characterized in that the preset length T divides the suffix segments of the set UG _h into S-shaped suffix segments and L-shaped suffix segments, including:

Compare the length of any suffix segment in the set UG _h with the preset length T. If the length of the suffix segment in the set UG _h is less than or equal to the preset length T, then the length will be less than or equal to the preset length T. A suffix of length T is regarded as an S-shaped suffix segment, and a suffix longer than the preset length T is regarded as an L-shaped suffix segment.

12. A suffix array construction device, characterized by including:

The first acquisition unit is used to obtain the h-order suffix array SA h of the suffix array SA _Q according to the first rank array R _Q of the string suffix array SA(^n, _and the second rank array R _h , h is a variable with an initial value of 1;

The second acquisition unit is used to obtain the set UG _h of unsorted suffixes in the h-order suffix array SA _h according to the h-order suffix array SA h _;

The sorting unit is used to sort all the suffixes in the set UG _h to obtain the 2h-order suffix array SA _2h;

The third acquisition unit is used to acquire another unsorted suffix set UG 2h according to the second rank array _Rh and the 2h-order suffix array SA _2h _; if the set UG _2h is an empty set, Then we get the sorted suffix array SA.

13. The device according to claim 12, characterized in that the third acquisition unit is also used to

According to the second place array _Rh and the 2h-order suffix array SA _2h , obtain the third place array R _2h;

When the set 110 ₂₁₁ is not an empty set, the device further includes: a variable update unit; the variable update unit, used to update the value of the variable h;

The sorting unit is also used to The set of unsorted suffixes UG _2H is sorted by combining the third rank array R _2h obtained by the third acquisition unit and the value of the variable h updated by the variable update unit, for obtaining the Nth unsorted suffix. gather

Until the Nth unsorted suffix set UG^) finally acquired by the third acquisition unit is an empty set, a sorted suffix array SA is obtained; N is a natural number greater than 2.

14. The device according to claim 13, characterized in that the variable update unit is specifically used to

Update the value of the variable h to 2h.

15. The device according to any one of claims 12 to 14, characterized in that the device further includes: a fourth acquisition unit, configured to acquire the h-order suffix array SA _h before the first acquisition unit , initialize the input string and get the suffix array SA _{Q of the string;}

16. The device according to any one of claims 12 to 15, characterized in that the first acquisition unit is specifically used to

When h is the initial value, use the first character of each suffix in the suffix array SA _Q as the comparison key, sort each suffix in the suffix array SA _Q , and obtain the h-order suffix array SA _H and the second place array R _H .

17. The device according to claim 16, characterized in that the second acquisition unit is specifically used to

18. The device according to any one of claims 12 to 17, characterized in that the sorting unit is specifically used for

Use the h-rank array element Rh [i] of any suffix Si in the set UG _h as the main comparison key, and use the suffix Si ₊₁ ^ h-rank array element R _h [i] of the suffix Si in the set _UG _h +h] is the auxiliary comparison key, for the The suffixes in the set UG _h are sorted to obtain the 2h-order suffix array SA _2h .

19. The device according to any one of claims 12 to 18, characterized in that the third acquisition unit is specifically used to

20. The device according to any one of claims 12 to 19, characterized in that the space occupied by the 2h-order suffix array 8 ₂₁₁ is the space occupied by the suffix array SA(^^.

21. The device according to any one of claims 12 to 20, characterized in that the sorting unit is specifically used to

Divide the suffix segments of the set UG _h into S-shaped suffix segments and L-shaped suffix segments according to the preset length T;

The S-shaped suffix segments and L-shaped suffix segments are sorted respectively to obtain a 2h-order suffix array SA _2h .

22. The device according to claim 21, characterized in that the sorting unit is specifically used for

23. A suffix array construction device, characterized by including:

A processor and a memory; the memory is used to store instructions;

The processor executes instructions stored in the memory for:

According to the suffix array SA _Q and the first rank array R _Q of the string, obtain the h-order suffix array SA h of the suffix array SA _Q and the second rank array R _h , _h is a variable, and the initial value is 1 ; According to the h-order suffix array SA _h , obtain the set UG _h of unsorted suffixes in the h-order suffix array SA _h ; Sort all suffixes in the set UG _h to obtain a 2h-order suffix array SA _2h; obtain another unsorted suffix based on the second-ranked array R _h and the 2h-order suffix array SA _2h The set UG _2h; if the set UG _2h is an empty set, the sorted suffix array SA is obtained.

24. The device according to claim 23, characterized in that the processor is also used to

If the set UG _2h is not an empty set, update the value of the variable h, sort the set UG _2h of unsorted suffixes according to the third-place array R _2h , and use it to obtain the Nth unsorted suffix. The set UG ₂ ,., until the finally obtained Nth unsorted suffix set UG^^ is an empty set, and the sorted suffix array SA is obtained; N is a natural number greater than 2.

25. The device according to claim 24, characterized in that the processor is specifically used to

Update the value of the variable h to 2h.

26. The device according to any one of claims 23 to 25, characterized in that the processor is also used to

Initialize the input string and get the suffix array SA _{Q of the string;}

27. The device according to any one of claims 23 to 26, characterized in that the processor is specifically used to

28. The device according to claim 27, characterized in that the processor is specifically used to

Set the h-order suffix array SA to the first character in _h that has more than one consecutive suffix The same suffix forms more than one unsorted suffix segment [a, b] _; a is the position of the starting suffix in the unsorted suffix segment, b is the position of the ending suffix in the unsorted suffix segment;

29. The device according to any one of claims 23 to 28, characterized in that the processor is specifically used to

The h-rank array element Rh [ _i ] of any suffix Si in the set UG _h is used as the main comparison key, and the h-rank array element _Rh [i] of the suffix 8 ₁₊₁₁ in the set UG _h is used as the main comparison key. +h] is the auxiliary comparison key, and the suffixes in the set UG _h are sorted to obtain the 2h-order suffix array SA _2h .

30. The device according to any one of claims 23 to 29, characterized in that the processor is specifically used to

31. The device according to any one of claims 23 to 30, wherein the space occupied by the 2h-order suffix array 8 ₂₁₁ is the space occupied by the suffix array SA(^^.

32. The device according to any one of claims 23 to 31, characterized in that the processor is specifically used to

33. The device according to claim 32, characterized in that the processor is specifically used to