CN112765938B - Method for constructing suffix array, terminal equipment and computer readable storage medium - Google Patents
Method for constructing suffix array, terminal equipment and computer readable storage medium Download PDFInfo
- Publication number
- CN112765938B CN112765938B CN202110042806.7A CN202110042806A CN112765938B CN 112765938 B CN112765938 B CN 112765938B CN 202110042806 A CN202110042806 A CN 202110042806A CN 112765938 B CN112765938 B CN 112765938B
- Authority
- CN
- China
- Prior art keywords
- suffix array
- target
- character string
- array construction
- sample
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 55
- 238000010276 construction Methods 0.000 claims abstract description 298
- 238000005070 sampling Methods 0.000 claims description 34
- 238000004590 computer program Methods 0.000 claims description 19
- 238000003491 array Methods 0.000 claims description 5
- 230000011218 segmentation Effects 0.000 claims description 4
- 238000012545 processing Methods 0.000 abstract description 11
- 230000006870 function Effects 0.000 description 8
- 230000008569 process Effects 0.000 description 8
- 241000234295 Musa Species 0.000 description 5
- 235000018290 Musa x paradisiaca Nutrition 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 238000012163 sequencing technique Methods 0.000 description 2
- 244000099147 Ananas comosus Species 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/126—Character encoding
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The application is applicable to the technical field of data processing, and provides a method for constructing a suffix array, terminal equipment and a computer readable storage medium, wherein the method for constructing the suffix array comprises the following steps: acquiring information of a target character string and information of current terminal equipment; determining a target suffix array construction mode matched with the information of the target character string and the information of the current terminal equipment; acquiring a sample set corresponding to a target suffix array construction mode; determining a target suffix array construction algorithm with optimal performance corresponding to the information of the target character string and the information of the current terminal equipment from the sample set; and constructing a suffix array of the target character string by adopting a target suffix array construction algorithm. According to the method and the device, the suffix array can be constructed by automatically selecting the target suffix array construction algorithm which is matched with the target suffix array construction algorithm and has optimal performance according to the information of the target character string and the information of the current terminal equipment, so that the construction efficiency of the suffix array is improved.
Description
Technical Field
The application belongs to the technical field of data processing, and particularly relates to a method for constructing a suffix array, terminal equipment and a computer readable storage medium.
Background
The suffix array is a data structure for representing the starting position of each suffix after sequencing in a character string, which is obtained by sequencing all the suffixes of the character string, and is often applied to the fields of full-text index, pattern matching and the like of mass data. Efficient construction of suffix arrays is a precondition for full-text indexing or pattern matching using suffix arrays. There are various ways of constructing suffix arrays in the prior art, such as serial, parallel, and external memory, and each construction can include various suffix array construction algorithms.
However, each suffix array construction algorithm generally only can show optimal performance under specific conditions, and under other conditions, the performance becomes relatively poor or even inapplicable, that is, each suffix array construction algorithm has a certain limitation, and the suffix array construction algorithm with optimal performance cannot be automatically selected according to the use situation in the prior art to construct the suffix array, so that the construction efficiency of the suffix array is low.
Disclosure of Invention
In view of this, the embodiments of the present application provide a method, a terminal device, and a computer readable storage medium for constructing a suffix array, so as to solve the technical problem that in the prior art, a suffix array construction algorithm with optimal performance cannot be automatically selected according to a usage scenario to construct a suffix array, resulting in lower construction efficiency of the suffix array.
In a first aspect, an embodiment of the present application provides a method for constructing a suffix array, including:
acquiring information of a target character string and information of current terminal equipment;
determining a target suffix array construction mode matched with the information of the target character string and the information of the current terminal equipment;
acquiring a sample set corresponding to the target suffix array construction mode; each sample data in the sample set consists of a sample suffix array construction condition corresponding to a sample character string and a suffix array construction algorithm with optimal performance, wherein the suffix array construction condition is suitable for the sample suffix array construction condition, and the sample suffix array construction condition comprises information of the sample character string and information of sample terminal equipment used when constructing a suffix array of the sample character string;
determining a target suffix array construction algorithm with optimal performance corresponding to the information of the target character string and the information of the current terminal equipment from the sample set;
and constructing a suffix array of the target character string by adopting the target suffix array construction algorithm.
Optionally, the information of the target character string includes a character string size and a repetition degree of the target character string;
Accordingly, the obtaining the information of the target character string includes:
acquiring the character string size of the target character string;
sequentially performing character string segmentation operation and character string sampling operation on the target character string to obtain a plurality of sampling character strings of the target character string;
and calculating a first similarity value between every two sampling character strings by adopting a preset character string similarity algorithm, and determining the repeatability of the target character string based on the first similarity value.
Optionally, the information of the current terminal device includes the number of cores and the memory capacity of the current terminal device;
correspondingly, the determining the construction mode of the target suffix array matched with the information of the target character string and the information of the current terminal equipment comprises the following steps:
if the character string size of the target character string is larger than the memory capacity of the current terminal equipment, determining an external memory suffix array construction mode as the target suffix array construction mode;
if the character string size of the target character string is smaller than or equal to the memory capacity of the current terminal equipment and the number of the cores of the current terminal equipment is equal to 1, determining a serial suffix array construction mode as the target suffix array construction mode;
And if the character string size of the target character string is smaller than or equal to the memory capacity of the current terminal equipment and the number of the cores of the current terminal equipment is larger than 1, determining a parallel suffix array construction mode as the target suffix array construction mode.
Optionally, the determining, from the sample set, a target suffix array configuration algorithm with optimal performance corresponding to the information of the target character string and the information of the current terminal device includes:
taking the information of the target character string and the information of the current terminal equipment as a target suffix array construction condition corresponding to the target character string;
determining a sample suffix array construction condition matching the target suffix array construction condition from the sample set;
and determining the target suffix array construction algorithm based on the suffix array construction algorithm with the optimal performance, which is applicable to the sample suffix array construction condition matched with the target suffix array construction condition.
Optionally, the determining, from the sample set, a sample suffix array construction condition that matches the target suffix array construction condition includes:
calculating a second similarity value between the target suffix array construction condition and each sample suffix array construction condition in the sample set;
And determining the sample suffix array construction condition corresponding to the second similarity value meeting the preset condition as the sample suffix array construction condition matched with the target suffix array construction condition.
Optionally, the determining the target suffix array construction algorithm based on the suffix array construction algorithm with the optimal performance applicable to the sample suffix array construction condition matched with the target suffix array construction condition includes:
and determining the suffix array construction algorithm with the highest duty ratio in suffix array construction algorithms with optimal performance, which are applicable to each sample suffix array construction condition matched with the target suffix array construction condition, as the target suffix array construction algorithm.
Optionally, the method for constructing a suffix array further includes:
and forming a new sample data by the target suffix array construction condition and the target suffix array construction algorithm, and adding the new sample data into the sample set.
In a second aspect, an embodiment of the present application provides a terminal device, including:
the first acquisition unit is used for acquiring information of the target character string and information of the current terminal equipment;
the first determining unit is used for determining a target suffix array construction mode matched with the information of the target character string and the information of the current terminal equipment;
The second acquisition unit is used for acquiring a sample set corresponding to the target suffix array construction mode; each sample data in the sample set consists of a sample suffix array construction condition corresponding to a sample character string and a suffix array construction algorithm with optimal performance, wherein the suffix array construction condition is suitable for the sample suffix array construction condition, and the sample suffix array construction condition comprises information of the sample character string and information of sample terminal equipment used when constructing a suffix array of the sample character string;
a second determining unit, configured to determine, from the sample set, a target suffix array construction algorithm with optimal performance corresponding to the information of the target character string and the information of the current terminal device;
and the suffix array construction unit is used for constructing the suffix array of the target character string by adopting the target suffix array construction algorithm.
In a third aspect, an embodiment of the present application provides a terminal device, where the terminal device includes a processor, a memory, and a computer program stored in the memory and executable on the processor, where the processor implements the method for constructing a suffix array according to the first aspect or any of the alternatives of the first aspect when the computer program is executed.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium storing a computer program that, when executed by a processor, implements a method of constructing a suffix array according to the first aspect or any of the alternatives of the first aspect.
In a fifth aspect, embodiments of the present application provide a computer program product, which when run on a terminal device, causes the terminal device to perform the method of constructing a suffix array according to the first aspect or any of the alternatives of the first aspect.
The method, the terminal equipment, the computer readable storage medium and the computer program product for constructing the suffix array provided by the embodiment of the application have the following beneficial effects:
the method for constructing the suffix array is provided by the embodiment of the application, and the information of the target character string and the information of the current terminal equipment are obtained; firstly, determining a target suffix array construction mode matched with information of a target character string and information of current terminal equipment; acquiring a sample set corresponding to the target suffix array construction mode; each sample data in the sample set consists of a sample suffix array construction condition corresponding to one sample character string and a suffix array construction algorithm with optimal performance suitable for the sample suffix array construction condition, wherein the sample suffix array construction condition comprises information of the sample character string and information of sample terminal equipment used when constructing a suffix array of the sample character string; therefore, the target suffix array construction algorithm with optimal performance corresponding to the information of the target character string and the information of the current terminal equipment can be determined from the sample set corresponding to the target suffix array construction mode according to the information of the target character string and the information of the current terminal equipment, namely, the suffix array of the target character string can be constructed by automatically selecting the target suffix array construction algorithm with optimal performance matched with the information of the target character string and the information of the current terminal equipment according to the information of the target character string and the information of the current terminal equipment, so that the construction efficiency of the suffix array is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required for the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a method for constructing a suffix array according to an embodiment of the present application;
fig. 2 is a flowchart of a specific implementation of S11 in a method for constructing a suffix array according to an embodiment of the present application;
FIG. 3 is a flowchart of a specific implementation of S14 in a method for constructing a suffix array according to an embodiment of the present application;
FIG. 4 is a schematic flow chart diagram of a method of constructing a suffix array according to another embodiment of the application;
fig. 5 is a schematic structural diagram of a terminal device provided in an embodiment of the present application;
fig. 6 is a schematic structural diagram of a terminal device according to another embodiment of the present application.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system configurations, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
It should be understood that the term "and/or" as used in this specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations. In addition, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used merely to distinguish between descriptions and are not to be construed as indicating or implying relative importance.
It should also be appreciated that references to "one embodiment" or "some embodiments" or the like described in this specification mean that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in various places throughout this specification are not necessarily all referring to the same embodiment, but mean "one or more, but not all, embodiments" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.
For ease of understanding, some concepts related to the embodiments of the present application are described below.
Character string: a character string S with a length of n refers to a one-dimensional character array formed by arranging n characters according to a certain rule, and can be expressed as S=S1S2 … Sn. In a specific application, the string S typically ends with the smallest lexicographically character $, i.e., S [ n ] is typically the character $, e.g., string s=banana$.
Substring of character string: can be expressed as S [ i, j ] = S [ i ] … S [ j ], wherein 1.ltoreq.i.ltoreq.j.ltoreq.n, i.e. the string S [ i, j ] consisting of all the characters between S [ i ] and S [ j ] (including S [ i ] and S [ j ]) is a substring of the string S.
Suffix of character string: can be expressed as S [ i, n ] =s [ i ] … S [ n ], where 1.ltoreq.i.ltoreq.n, commonly referred to as suffix (i). For example, the suffix of the string s=banana$ may be as shown in table 1.
TABLE 1
Suffix Array (SA) of character string: the method is a data structure obtained by arranging all suffixes of the character string S in the order from small to large according to dictionary sequence, and storing all the suffixes in the initial position of the character string S in an integer array according to the arranged order. For example, all suffixes of the string s=banana$ are arranged in the order of dictionary order from small to large, and the suffixes shown in table 2 can be obtained.
TABLE 2
suffix(i) | i |
$ | 7 |
a$ | 6 |
ana$ | 4 |
anana$ | 2 |
banana$ | 1 |
na$ | 5 |
nana$ | 3 |
Then, the suffix array SA [ i ] = [7 6 4 2 1 5 3] of the string s=banana$, i.e. SA [1] = 7, SA [2] = 6, SA [3] = 4, SA [4] = 2, SA [5] = 1, SA [6] = 5, SA [7] = 3.
In general, the process of sorting all the suffixes of a string to obtain a suffix array of the string is the process of constructing the suffix array of the string. The prior art provides a variety of suffix array configurations, including: a serial suffix array construction mode, a parallel suffix array construction mode, a memory suffix array construction mode and the like. The serial suffix array construction mode means that the terminal equipment adopts a serial mode to construct a suffix array in the internal memory of the terminal equipment; the parallel suffix array construction mode means that the terminal equipment adopts a parallel mode to construct a suffix array in the internal memory of the terminal equipment; the external memory suffix array construction mode refers to that the terminal equipment constructs a suffix array in the external memory of the terminal equipment.
Each of the suffix array construction methods includes a plurality of different suffix array construction algorithms.
For example, the serial suffix array configuration method includes: suffix array induced ordering (suffix array induce sort, SA-IS) algorithm, libdivsufsort algorithm, suffix array construction algorithm K (suffix array construction algorithm K, saca-K), and the like. Wherein, the libdivsufsort algorithm refers to that the suffix array construction algorithm is packaged into a library, and the name of the library is called libdivsufsort.
The parallel suffix array construction mode comprises the following steps: parallel suffix array induced ordering (parallel suffix array induced sorting, psaak) algorithm, parallel suffix array construction algorithm K (parallel suffix array scan K, psaca-K), parallel libdivsufsort algorithm (pdss), and the like. Where K represents K integers.
The construction mode of the external memory suffix array comprises the following steps: an external suffix array induced ordering (external suffix array induced sorting, esais) algorithm, an n suffix array induced ordering (n suffix array induced sorting, nsais) algorithm, a parallel suffix array scanning (parallel suffix array scan, psucca) algorithm, and the like.
Generally, each suffix array construction algorithm only can show optimal performance under specific conditions, and under other conditions, the performance can become relatively poor or even inapplicable, namely, each suffix array construction algorithm has a certain limitation, therefore, the embodiment of the application provides a method for constructing a suffix array, and the method can automatically select the suffix array construction algorithm with optimal performance matched with the suffix array construction condition to construct the suffix array of the target character string according to the target suffix array construction condition corresponding to the target character string of the suffix array to be constructed, so that the construction efficiency of the suffix array is improved.
Referring to fig. 1, fig. 1 is a schematic flowchart of a method for constructing a suffix array according to an embodiment of the disclosure. In the embodiment of the present application, the execution body of the method for constructing the suffix array is a terminal device. In a specific application, the terminal device may be a single-core (i.e. the number of cores is equal to 1) terminal device, or may be a multi-core (i.e. the number of cores is greater than 1) terminal device. By way of example, and not limitation, the terminal device may be a cell phone, tablet computer, or desktop computer, among others.
As shown in fig. 1, the method for constructing the suffix array may include S11 to S15, which are described in detail as follows:
s11: and acquiring information of the target character string and information of the current terminal equipment.
In this embodiment of the present application, the target string may be any string that needs to construct a suffix array. The current terminal device refers to a terminal device currently executing the method of constructing the suffix array.
In one possible implementation, the information of the target string includes, but is not limited to: the character string size of the target character string, the character set size of the target character string, and the repeatability of the target character string. The character string size of the target character string is used for describing the storage capacity occupied by the target character string; the character set size of the target character string is used to describe the number of kinds of characters appearing in the target character string, and typically, the maximum value of the character set size is 255.
The information of the terminal device includes, but is not limited to: the number of cores of the terminal equipment, the memory capacity of the terminal equipment and the memory capacity of the terminal equipment.
In one embodiment of the present application, the information of the terminal device may be stored in advance in a memory local to the terminal device. Based on this, the terminal device can obtain its above information from its local memory.
In one embodiment of the present application, the step of the terminal device obtaining the information of the target string may be specifically implemented by S111 to S114 as shown in fig. 2, which is described in detail as follows:
s111: and acquiring the character string size of the target character string.
In the embodiment of the application, the terminal device may call the library function in python to directly obtain the string size of the target string.
S112: and sequentially performing character string segmentation operation and character string sampling operation on the target character string to obtain a plurality of sampling character strings of the target character string.
In this embodiment, the terminal device performs a string splitting operation on the target string, and splits the target string into n parts, where each part is a sub-string of the target string. Wherein n is an integer greater than 1.
In one possible implementation manner, the terminal device may divide the target string into n parts in a halving manner, so that the size of the string of each sub-string obtained by division is equal. In another possible implementation manner, the terminal device may divide the target string into n parts in an unequal division manner, so that the sizes of strings of different sub-strings obtained by division may be the same or different.
After the terminal device obtains n sub-strings of the target character string, the character string sampling operation can be performed on the n sub-strings, that is, m sub-strings are sampled from the n sub-strings, and each sub-string in the m sub-strings is a sampling character string of the target character string. Wherein, m is more than or equal to 1 and less than or equal to n, and m is an integer.
In one possible implementation, the terminal device may sample m sub-strings from the n sub-strings in a sampling manner with equal intervals, so that the number of sub-strings spaced between every two adjacent sampling strings obtained by sampling is equal. In another possible implementation manner, the terminal device may sample m sub-strings from n sub-strings in a sampling manner with unequal intervals, so that the number of sub-strings spaced between every two adjacent sampling strings obtained by sampling may be equal or unequal.
For example, if the string size of the target string is 10G, the terminal device divides the target string into 100 sub-strings in a halving division manner, and the string size of each sub-string is 0.1G. If the terminal device samples 5 sub-strings from the 100 sub-strings in an equidistant sampling manner as sampling strings of the target string, the sampled 5 sampling strings include the 1 st sub-string, the 20 th sub-string, the 40 th sub-string, the 80 th sub-string and the 100 th sub-string.
S113: traversing the sampling character strings, determining the character set sizes of all the sampling character strings, and determining the character set sizes of all the sampling character strings as the character set sizes of the target character strings.
In the embodiment of the present application, in order to improve the construction efficiency of the suffix array, the terminal device may determine the character set size of the sampling character string of the target character string as the character set size of the target character string.
Specifically, after obtaining the sampling character strings of the target character strings, the terminal equipment traverses the sampling character strings of the target character strings, determines the character set sizes of all the sampling character strings, and determines the character set sizes of all the sampling character strings as the character set sizes of the target character strings.
For example, if the target string s=ahahahahahahahahahahahshsbbb, the sample string of the target string includes three, respectively: aha, ahs, shb since the three sample strings include a number of kinds of character sets of 4, the character set size of the three sample strings is 4, and the terminal device determines that the character set size of the target string is 4.
S114: and calculating a first similarity value between every two sampling character strings by adopting a preset character string similarity algorithm, and determining the repeatability of the target character string based on the first similarity value.
In this embodiment, after obtaining a plurality of sampling strings of a target string, the terminal device may calculate a first similarity value between every two sampling strings by using a preset string similarity algorithm, obtain a plurality of first similarity values, and determine the repetition degree of the target string based on the plurality of first similarity values.
In a specific application, the preset character string similarity algorithm includes, but is not limited to: cosine similarity algorithm, matrix similarity algorithm and character string editing distance algorithm.
Taking a preset character string similarity algorithm as an example of a character string editing distance algorithm, the character string editing distance algorithm calculates the editing distance between two character strings, and determines a first similarity value between the two character strings according to the editing distance. Wherein the edit distance refers to the minimum number of editing operations required to convert one of the two strings into the other string, the editing operations including, but not limited to, a character replacement operation, a character addition operation, and a character deletion operation. For example, the edit distance between the first character string abc and the second character string abed is 2. The string edit distance algorithm typically uses the formula 1-d/maxlen to calculate a first similarity value between two strings; where d is the edit distance between the two strings, and maxlen is the maximum of the lengths of the two strings.
For example, the length of the first string abc is 3 and the length of the second string abed is 4, and the maximum value of the lengths of the two strings is 4, so that the similarity between the first string abc and the second string abed calculated by using the string edit distance algorithm is 1-2/4=0.5.
In one possible implementation manner, after the terminal device calculates the first similarity value between every two sampling strings, the average value of the calculated first similarity values may be determined as the repeatability of the target string.
Note that the terminal device may execute S112 and S113 simultaneously; or the terminal device may perform S112 first and then S113; or the terminal device may perform S113 first and then S112, and the order in which S112 and S113 are performed is not limited here.
S12: and determining a target suffix array construction mode matched with the information of the target character string and the information of the current terminal equipment.
In the embodiment of the present application, when the terminal device constructs the suffix array of the target string, the information of the target string and the information of the current terminal device may be used as the construction condition of the target suffix array corresponding to the target string.
It should be noted that, the suffix array configuration modes applicable to different suffix array configuration conditions are generally different. In the embodiment of the present application, a suffix array configuration mode applicable to each suffix array configuration condition is prestored in a terminal device.
As an example, but not by way of limitation, when the size of the string in the suffix array configuration condition is larger than the memory capacity of the terminal device, it is indicated that the memory capacity of the terminal device cannot support the terminal device to complete the configuration of the suffix array of the string in its memory, so that the suffix array configuration mode applicable to the suffix array configuration condition is generally a external memory suffix array configuration mode. When the size of the character string in the suffix array construction condition is smaller than or equal to the memory capacity of the terminal equipment, and the number of the cores of the terminal equipment is equal to 1, the memory capacity of the terminal equipment is indicated to support the terminal equipment to finish the construction of the suffix array of the character string in the memory of the terminal equipment, but because the terminal equipment is a single-core terminal equipment, the capability of parallel processing data is weaker, and therefore, the suffix array construction mode applicable to the suffix array construction condition is usually a serial suffix array construction mode. When the size of the character string in the suffix array construction condition is smaller than or equal to the memory capacity of the terminal equipment and the number of the cores of the terminal equipment is larger than 1, the memory capacity of the terminal equipment is indicated to support the terminal equipment to finish the construction of the suffix array of the character string in the memory of the terminal equipment, and the suffix array construction mode applicable to the suffix array construction condition is usually a parallel suffix array construction mode because the terminal equipment is multi-core terminal equipment and has higher capability of parallel data processing.
Based on this, in one embodiment of the present application, S12 may specifically include the following steps:
if the character string size of the target character string is larger than the memory capacity of the current terminal equipment, determining an external memory suffix array construction mode as the target suffix array construction mode;
if the character string size of the target character string is smaller than or equal to the memory capacity of the current terminal equipment and the number of the cores of the current terminal equipment is equal to 1, determining a serial suffix array construction mode as the target suffix array construction mode;
and if the character string size of the target character string is smaller than or equal to the memory capacity of the current terminal equipment and the number of the cores of the current terminal equipment is larger than 1, determining a parallel suffix array construction mode as the target suffix array construction mode.
S13: and acquiring a sample set corresponding to the target suffix array construction mode.
In this embodiment of the present application, each suffix array configuration mode is configured with a corresponding sample set.
The sample set comprises a plurality of pieces of sample data, and each piece of sample data consists of a sample suffix array construction condition corresponding to one sample character string and a suffix array construction algorithm with optimal performance, which is suitable for the sample suffix array construction condition. For example, a best performing suffix array construction algorithm may refer to the suffix array construction algorithm that is the shortest time to use in constructing a suffix array under the sample suffix array construction conditions.
In a specific application, the sample suffix array construction condition may include information of a sample character string and information of a sample terminal device used in constructing the sample character string. The information of the sample character string includes, but is not limited to, the character string size, the character set size and the repeatability of the sample character string. The information of the sample terminal device includes, but is not limited to, the number of cores, the memory capacity, and the memory capacity of the sample terminal device.
Illustratively, taking the parallel suffix array configuration mode as an example, a sample set corresponding to the parallel suffix array configuration mode may be as shown in table 3.
TABLE 3 Table 3
In the embodiment of the application, after determining the construction mode of the target suffix array matched with the target character string, the terminal equipment acquires a sample set corresponding to the construction mode of the target suffix array.
S14: and determining a target suffix array construction algorithm with optimal performance corresponding to the information of the target character string and the information of the current terminal equipment from the sample set.
In one embodiment of the present application, S14 may be specifically implemented by S141 to S143 shown in fig. 3, which are described in detail as follows:
s141: and taking the information of the target character string and the information of the current terminal equipment as a target suffix array construction condition corresponding to the target character string.
S142: and determining a sample suffix array construction condition matched with the target suffix array construction condition from the sample set.
In this embodiment, the sample suffix array construction condition matched with the target suffix array construction condition may refer to a sample suffix array construction condition in which the second similarity value with the target suffix array construction condition satisfies a predetermined condition. The preset conditions may be set according to actual requirements, and are not limited herein.
As an example and not by way of limitation, the preset condition may be that the second similarity value is greater than a preset similarity value threshold, i.e., the sample suffix array construction condition that matches the target suffix array construction condition refers to the sample suffix array construction condition that has a second similarity value with the target suffix array construction condition greater than the preset similarity value threshold. The preset similarity threshold value may be set according to actual requirements, which is not limited herein.
In one possible implementation, the second similarity value between the target suffix array construction condition and the sample suffix array construction condition may be represented by a euclidean distance or a square of the euclidean distance between the target suffix array construction condition and the sample suffix array construction condition.
Based on this, S142 may specifically include the following steps:
calculating a second similarity value between the target suffix array construction condition and each sample suffix array construction condition in the sample set;
and determining the sample suffix array construction condition corresponding to the second similarity value meeting the preset condition as the sample suffix array construction condition matched with the target suffix array construction condition.
In this embodiment, when the second similarity value is represented by the euclidean distance or the square of the euclidean distance, the terminal device may calculate the square of the euclidean distance or the euclidean distance between the target suffix array configuration condition and each sample suffix array configuration condition in the sample set, and determine the square of the euclidean distance or the euclidean distance between the target suffix array configuration condition and each sample suffix array configuration condition as the second similarity value between the target suffix array configuration condition and each sample suffix array configuration condition, respectively.
In a specific application, the sample suffix array construction conditions which are determined by the terminal equipment and matched with the target suffix array construction conditions can be one or a plurality of sample suffix array construction conditions.
Illustratively, if the target suffix array construction conditions are as shown in table 4.
TABLE 4 Table 4
For example, if the square of the euclidean distance (i.e. the second similarity value) between the target suffix array construction condition calculated by the terminal device and each sample suffix array construction condition in table 3, and the suffix array construction algorithm with the best performance to which each sample suffix array construction condition is applied are shown in table 5.
TABLE 5
If the preset similarity value threshold is 1, the sample suffix array construction condition corresponding to the second similarity value greater than the preset similarity value threshold includes the sample suffix array construction condition corresponding to the sample string 1, the sample suffix array construction condition corresponding to the sample string 3, and the sample suffix array construction condition corresponding to the sample string 4.
S143: and determining the target suffix array construction algorithm based on the suffix array construction algorithm with the optimal performance, which is applicable to the sample suffix array construction condition matched with the target suffix array construction condition.
In one embodiment of the present application, when the sample suffix array construction condition matched with the target suffix array construction condition is one, the terminal device may directly determine, as the target suffix array construction algorithm, a suffix array construction algorithm with optimal performance to which the sample suffix array construction condition is applicable.
In another embodiment of the present application, when the number of sample suffix array construction conditions matched with the target suffix array construction conditions is plural, the terminal device may determine, as the target suffix array construction algorithm, a suffix array construction algorithm with the highest occupancy rate among suffix array construction algorithms with optimal performance to which the plural sample suffix array construction conditions are applied.
For example, in combination with table 5, if the sample suffix array construction condition matched with the target suffix array construction condition is the sample suffix array construction condition corresponding to the sample string 1, the sample suffix array construction condition corresponding to the sample string 3, and the sample suffix array construction condition corresponding to the sample string 4, the suffix array construction algorithms with the optimal performance to which the 3 sample suffix array construction conditions are applied are respectively algorithm 2, and algorithm 3, wherein the duty ratio of algorithm 2 is the highest, so the terminal device determines algorithm 2 as the target suffix array construction algorithm.
S15: and constructing a suffix array of the target character string by adopting the target suffix array construction algorithm.
In the embodiment of the present application, a suffix array construction component corresponding to each suffix array construction algorithm is configured in the terminal device. After determining the target suffix array construction algorithm, the terminal device can call a suffix array construction component corresponding to the target suffix array construction algorithm to construct a suffix array of the target character string.
As can be seen from the above, in the embodiment of the present application, the information of the target character string and the information of the current terminal device are obtained; firstly, determining a target suffix array construction mode matched with information of a target character string and information of current terminal equipment; acquiring a sample set corresponding to the target suffix array construction mode; each sample data in the sample set consists of a sample suffix array construction condition corresponding to one sample character string and a suffix array construction algorithm with optimal performance suitable for the sample suffix array construction condition, wherein the sample suffix array construction condition comprises information of the sample character string and information of sample terminal equipment used when constructing a suffix array of the sample character string; therefore, the target suffix array construction algorithm with optimal performance corresponding to the information of the target character string and the information of the current terminal equipment can be determined from the sample set corresponding to the target suffix array construction mode according to the information of the target character string and the information of the current terminal equipment, namely, the suffix array of the target character string can be constructed by automatically selecting the target suffix array construction algorithm with optimal performance matched with the information of the target character string and the information of the current terminal equipment according to the information of the target character string and the information of the current terminal equipment, so that the construction efficiency of the suffix array is improved.
Referring to fig. 4, fig. 4 is a schematic flowchart of a method for constructing a suffix array according to another embodiment of the present application, as shown in fig. 4, after S14, the method for constructing a suffix array in this embodiment may further include S16, which is described in detail below:
s16: and forming a new sample data by the target suffix array construction condition and the target suffix array construction algorithm, and adding the new sample data into the sample set.
In this embodiment, in order to increase the sample size and the sample richness of the sample set, and further increase the construction efficiency of the suffix array, after determining the target suffix array construction algorithm, the terminal device may form a new sample data from the target suffix array construction condition corresponding to the target character string and the target suffix array construction algorithm, and add the new sample data to the sample set corresponding to the target suffix array construction mode. Therefore, when the terminal equipment subsequently determines the target suffix array construction algorithm through the sample set, the accuracy of the determined target suffix array construction algorithm can be improved due to the improvement of the sample quantity and the sample richness of the sample set, and the efficiency of constructing the suffix array based on the target suffix array construction algorithm is further improved.
It should be noted that the terminal device may perform S15 and S16 simultaneously; the terminal device may execute S15 first and then S16 second; the terminal device may also execute S16 first and then S15, and the order in which S15 and S16 are executed is not limited here.
It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic of each process, and should not limit the implementation process of the embodiment of the present application in any way.
Based on the method for constructing the suffix array provided by the above embodiment, the embodiment of the present invention further provides an embodiment of a terminal device for implementing the above method embodiment.
Referring to fig. 5, fig. 5 is a schematic structural diagram of a terminal device according to an embodiment of the present application. In this embodiment, each unit included in the terminal device is configured to perform each step in the embodiment corresponding to fig. 1 or fig. 4 and fig. 1 or fig. 4. Refer specifically to fig. 1 or fig. 4 and the related descriptions in the embodiments corresponding to fig. 1 or fig. 4. For convenience of explanation, only the portions related to the present embodiment are shown. As shown in fig. 5, the terminal device 50 includes: first acquisition unit 51, first determination unit 52, second acquisition unit 53, second determination unit 54, and suffix array construction unit 55. Wherein:
The first obtaining unit 51 is configured to obtain information of a target character string and information of a current terminal device.
The first determining unit 52 is configured to determine a target suffix array configuration mode that matches the information of the target character string and the information of the current terminal device.
The second obtaining unit 53 is configured to obtain a sample set corresponding to the target suffix array configuration mode; each sample data in the sample set consists of a sample suffix array construction condition corresponding to one sample character string and a suffix array construction algorithm with optimal performance suitable for the sample suffix array construction condition, wherein the sample suffix array construction condition comprises information of the sample character string and information of sample terminal equipment used in constructing a suffix array of the sample character string.
The second determining unit 54 is configured to determine, from the sample set, a target suffix array construction algorithm with optimal performance corresponding to the information of the target character string and the information of the current terminal device.
The suffix array construction unit 55 is configured to construct a suffix array of the target character string by using the target suffix array construction algorithm.
Optionally, the information of the target character string includes a character string size and a repetition degree of the target character string; accordingly, the first acquisition unit 51 includes a character string size acquisition unit, a character string processing unit, and a repetition degree determination unit. Wherein:
The character string size acquisition unit is used for acquiring the character string size of the target character string.
And the character string processing unit is used for sequentially carrying out character string segmentation operation and character string sampling operation on the target character string to obtain a plurality of sampling character strings of the target character string.
The repeatability determining unit is used for calculating a first similarity value between every two sampling strings by adopting a preset string similarity algorithm, and determining the repeatability of the target string based on the first similarity value.
Optionally, the information of the current terminal device includes the number of cores and the memory capacity of the current terminal device; accordingly, the first determining unit 52 is specifically configured to:
if the character string size of the target character string is larger than the memory capacity of the current terminal equipment, determining an external memory suffix array construction mode as the target suffix array construction mode;
if the character string size of the target character string is smaller than or equal to the memory capacity of the current terminal equipment and the number of the cores of the current terminal equipment is equal to 1, determining a serial suffix array construction mode as the target suffix array construction mode;
and if the character string size of the target character string is smaller than or equal to the memory capacity of the current terminal equipment and the number of the cores of the current terminal equipment is larger than 1, determining a parallel suffix array construction mode as the target suffix array construction mode.
Optionally, the second determining unit 54 includes: the device comprises a target construction condition determining unit, a sample construction condition determining unit and a target algorithm determining unit. Wherein:
the target construction condition determining unit is used for taking the information of the target character string and the information of the current terminal equipment as target suffix array construction conditions corresponding to the target character string;
the sample construction condition determining unit is used for determining a sample suffix array construction condition matched with the target suffix array construction condition from the sample set;
the target algorithm determining unit is used for determining the target suffix array construction algorithm based on the suffix array construction algorithm with the optimal performance, which is applicable to the sample suffix array construction condition matched with the target suffix array construction condition.
Optionally, the sample construction condition determining unit is specifically configured to:
calculating a second similarity value between the target suffix array construction condition and each sample suffix array construction condition in the sample set;
and determining the sample suffix array construction condition corresponding to the second similarity value meeting the preset condition as the sample suffix array construction condition matched with the target suffix array construction condition.
Optionally, the target algorithm determining unit is specifically configured to:
and determining the suffix array construction algorithm with the highest duty ratio in suffix array construction algorithms with optimal performance, which are applicable to each sample suffix array construction condition matched with the target suffix array construction condition, as the target suffix array construction algorithm.
Optionally, the terminal device 50 further comprises a sample set processing unit.
The sample set processing unit is used for forming new sample data by the target suffix array construction condition and the target suffix array construction algorithm, and adding the new sample data into the sample set.
It should be noted that, because the content of information interaction and execution process between the modules is based on the same concept as the method embodiment of the present application, specific functions and technical effects thereof may be referred to the method embodiment specifically, and will not be described herein again.
Fig. 6 is a schematic structural diagram of a terminal device according to another embodiment of the present application. As shown in fig. 6, the terminal device 6 provided in this embodiment includes: a processor 60, a memory 61 and a computer program 62 stored in the memory 61 and executable on the processor 60, for example a program corresponding to a method of constructing a suffix array. The steps of the above-described embodiments of the method of constructing a suffix array are implemented by the processor 60 when executing the computer program 62, such as S11-S15 shown in fig. 1. Alternatively, the processor 60 may implement the functions of the modules/units in the embodiments of the terminal device described above, such as the functions of the units 51-55 shown in fig. 5, when executing the computer program 62.
By way of example, the computer program 62 may be partitioned into one or more modules/units that are stored in the memory 61 and executed by the processor 60 to complete the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions for describing the execution of the computer program 62 in the terminal device 6. For example, the computer program 62 may be divided into a vehicle-mounted terminal control unit, a first vehicle position determining unit, a second vehicle position determining unit, and a vehicle position occupied state determining unit, and specific functions of the units are described with reference to the relevant descriptions in the corresponding embodiments of fig. 2, which are not repeated herein.
The terminal device may include, but is not limited to, a processor 60, a memory 61. It will be appreciated by those skilled in the art that fig. 6 is merely an example of the terminal device 6 and does not constitute a limitation of the terminal device 6, and may include more or less components than illustrated, or may combine certain components, or different components, e.g., the terminal device may further include an input-output device, a network access device, a bus, etc.
The processor 60 may be a central processing unit (Central Processing Unit, CPU), other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 61 may be an internal storage unit of the terminal device 6, such as a hard disk or a memory of the terminal device 6. The memory 61 may be an external storage device of the terminal device 6, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the terminal device 6. Further, the memory 61 may also include both an internal storage unit and an external storage device of the terminal device 6. The memory 61 is used for storing the computer program and other programs and data required by the terminal device. The memory 61 may also be used for temporarily storing data that has been output or is to be output.
Embodiments of the present application also provide a computer-readable storage medium. The computer readable storage medium stores a computer program which, when executed by a processor, implements the method for constructing a suffix array.
The embodiment of the application provides a computer program product which can realize the method for constructing the suffix array when being executed on terminal equipment.
It will be clearly understood by those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of each functional unit and module is illustrated, and in practical application, the above-mentioned functional allocation may be performed by different functional units and modules, that is, the internal structure of the terminal device is divided into different functional units or modules, so as to perform all or part of the above-mentioned functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference may be made to related descriptions of other embodiments.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.
Claims (7)
1. A method of constructing a suffix array, comprising:
acquiring information of a target character string and information of current terminal equipment;
determining a target suffix array construction mode matched with the information of the target character string and the information of the current terminal equipment;
acquiring a sample set corresponding to the target suffix array construction mode; each sample data in the sample set consists of a sample suffix array construction condition corresponding to a sample character string and a suffix array construction algorithm with optimal performance, wherein the suffix array construction condition is suitable for the sample suffix array construction condition, and the sample suffix array construction condition comprises information of the sample character string and information of sample terminal equipment used when constructing a suffix array of the sample character string;
determining a target suffix array construction algorithm with optimal performance corresponding to the information of the target character string and the information of the current terminal equipment from the sample set;
constructing a suffix array of the target character string by adopting the target suffix array construction algorithm;
the information of the target character string comprises the character string size and the repeatability of the target character string;
Accordingly, the obtaining the information of the target character string includes:
acquiring the character string size of the target character string;
sequentially performing character string segmentation operation and character string sampling operation on the target character string to obtain a plurality of sampling character strings of the target character string;
calculating a first similarity value between every two sampling character strings by adopting a preset character string similarity algorithm, and determining the repeatability of the target character string based on the first similarity value;
the information of the current terminal equipment comprises the number of cores and the memory capacity of the current terminal equipment;
correspondingly, the determining the construction mode of the target suffix array matched with the information of the target character string and the information of the current terminal equipment comprises the following steps:
if the character string size of the target character string is larger than the memory capacity of the current terminal equipment, determining an external memory suffix array construction mode as the target suffix array construction mode;
if the character string size of the target character string is smaller than or equal to the memory capacity of the current terminal equipment and the number of the cores of the current terminal equipment is equal to 1, determining a serial suffix array construction mode as the target suffix array construction mode;
If the character string size of the target character string is smaller than or equal to the memory capacity of the current terminal equipment and the number of the cores of the current terminal equipment is larger than 1, determining a parallel suffix array construction mode as the target suffix array construction mode;
the determining, from the sample set, a target suffix array construction algorithm with optimal performance corresponding to the information of the target character string and the information of the current terminal device, including:
taking the information of the target character string and the information of the current terminal equipment as a target suffix array construction condition corresponding to the target character string;
determining a sample suffix array construction condition matching the target suffix array construction condition from the sample set;
and determining the target suffix array construction algorithm based on the suffix array construction algorithm with the optimal performance, which is applicable to the sample suffix array construction condition matched with the target suffix array construction condition.
2. The method of constructing a suffix array of claim 1, wherein said determining from said sample set a sample suffix array construction condition that matches said target suffix array construction condition comprises:
Calculating a second similarity value between the target suffix array construction condition and each sample suffix array construction condition in the sample set;
and determining the sample suffix array construction condition corresponding to the second similarity value meeting the preset condition as the sample suffix array construction condition matched with the target suffix array construction condition.
3. The method of constructing a suffix array according to claim 1, wherein said determining the target suffix array construction algorithm based on a suffix array construction algorithm with optimal performance to which a sample suffix array construction condition matching the target suffix array construction condition is applied comprises:
and determining the suffix array construction algorithm with the highest duty ratio in suffix array construction algorithms with optimal performance, which are applicable to each sample suffix array construction condition matched with the target suffix array construction condition, as the target suffix array construction algorithm.
4. The method of constructing a suffix array of claim 1, wherein the method of constructing a suffix array further comprises:
and forming a new sample data by the target suffix array construction condition and the target suffix array construction algorithm, and adding the new sample data into the sample set.
5. A terminal device, characterized in that it is configured to implement the method for constructing suffix arrays according to any of claims 1-4, said terminal device comprising:
the first acquisition unit is used for acquiring information of the target character string and information of the current terminal equipment;
the first determining unit is used for determining a target suffix array construction mode matched with the information of the target character string and the information of the current terminal equipment;
the second acquisition unit is used for acquiring a sample set corresponding to the target suffix array construction mode; each sample data in the sample set consists of a sample suffix array construction condition corresponding to a sample character string and a suffix array construction algorithm with optimal performance, wherein the suffix array construction condition is suitable for the sample suffix array construction condition, and the sample suffix array construction condition comprises information of the sample character string and information of sample terminal equipment used when constructing a suffix array of the sample character string;
a second determining unit, configured to determine, from the sample set, a target suffix array construction algorithm with optimal performance corresponding to the information of the target character string and the information of the current terminal device;
And the suffix array construction unit is used for constructing the suffix array of the target character string by adopting the target suffix array construction algorithm.
6. A terminal device comprising a processor, a memory and a computer program stored in the memory and executable on the processor, the processor implementing the method of constructing a suffix array according to any of claims 1-4 when the computer program is executed by the processor.
7. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the method of constructing a suffix array according to any of claims 1-4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110042806.7A CN112765938B (en) | 2021-01-13 | 2021-01-13 | Method for constructing suffix array, terminal equipment and computer readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110042806.7A CN112765938B (en) | 2021-01-13 | 2021-01-13 | Method for constructing suffix array, terminal equipment and computer readable storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112765938A CN112765938A (en) | 2021-05-07 |
CN112765938B true CN112765938B (en) | 2024-02-09 |
Family
ID=75700177
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110042806.7A Active CN112765938B (en) | 2021-01-13 | 2021-01-13 | Method for constructing suffix array, terminal equipment and computer readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112765938B (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6785677B1 (en) * | 2001-05-02 | 2004-08-31 | Unisys Corporation | Method for execution of query to search strings of characters that match pattern with a target string utilizing bit vector |
CN105264522A (en) * | 2014-03-28 | 2016-01-20 | 华为技术有限公司 | Method and apparatus for constructing suffix array |
CN105335481A (en) * | 2015-10-14 | 2016-02-17 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | Large scale character string text suffix index building method and device |
KR20160087134A (en) * | 2015-01-13 | 2016-07-21 | 단국대학교 산학협력단 | Apparatus and method for matching of character string |
CN108804204A (en) * | 2018-04-17 | 2018-11-13 | 佛山市顺德区中山大学研究院 | Multi-threaded parallel constructs the method and system of Suffix array clustering |
CN110597855A (en) * | 2019-08-14 | 2019-12-20 | 中山大学 | Data storage method, terminal equipment and computer readable storage medium |
CN110837584A (en) * | 2019-10-18 | 2020-02-25 | 中山大学 | Method and system for constructing suffix array in block parallel manner |
CN110852046A (en) * | 2019-10-18 | 2020-02-28 | 中山大学 | Block induction sequencing method and system for text suffix index |
-
2021
- 2021-01-13 CN CN202110042806.7A patent/CN112765938B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6785677B1 (en) * | 2001-05-02 | 2004-08-31 | Unisys Corporation | Method for execution of query to search strings of characters that match pattern with a target string utilizing bit vector |
CN105264522A (en) * | 2014-03-28 | 2016-01-20 | 华为技术有限公司 | Method and apparatus for constructing suffix array |
KR20160087134A (en) * | 2015-01-13 | 2016-07-21 | 단국대학교 산학협력단 | Apparatus and method for matching of character string |
CN105335481A (en) * | 2015-10-14 | 2016-02-17 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | Large scale character string text suffix index building method and device |
CN108804204A (en) * | 2018-04-17 | 2018-11-13 | 佛山市顺德区中山大学研究院 | Multi-threaded parallel constructs the method and system of Suffix array clustering |
CN110597855A (en) * | 2019-08-14 | 2019-12-20 | 中山大学 | Data storage method, terminal equipment and computer readable storage medium |
CN110837584A (en) * | 2019-10-18 | 2020-02-25 | 中山大学 | Method and system for constructing suffix array in block parallel manner |
CN110852046A (en) * | 2019-10-18 | 2020-02-28 | 中山大学 | Block induction sequencing method and system for text suffix index |
Non-Patent Citations (2)
Title |
---|
一种分布式后缀树构造与匹配算法;黄政林;张冰;;华中科技大学学报(自然科学版)(S1);第219-224页 * |
一种适合于GPU计算的并行后缀数组构造算法;孙伟东;马宗民;;小型微型计算机系统(05);第830-836页 * |
Also Published As
Publication number | Publication date |
---|---|
CN112765938A (en) | 2021-05-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110597855B (en) | Data query method, terminal device and computer readable storage medium | |
CN104679778B (en) | A kind of generation method and device of search result | |
US8943091B2 (en) | System, method, and computer program product for performing a string search | |
CN111445952B (en) | Method and system for quickly comparing similarity of super-long gene sequences | |
CN108536739B (en) | Metadata sensitive information field identification method, device, equipment and storage medium | |
US9330159B2 (en) | Techniques for finding a column with column partitioning | |
WO2016177830A1 (en) | Method, system and computer program product for performing numeric searches | |
US9069634B2 (en) | Signature representation of data with aliasing across synonyms | |
CN110888981A (en) | Title-based document clustering method and device, terminal equipment and medium | |
CN110019341B (en) | Data query method and device | |
Haj Rachid et al. | A practical and scalable tool to find overlaps between sequences | |
Engels et al. | Practical near neighbor search via group testing | |
CN109857366B (en) | Insertion ordering method, system, equipment and storage medium based on external memory | |
CN116560984A (en) | Test case clustering grouping method based on call dependency graph | |
CN114238334A (en) | Heterogeneous data encoding method and device, heterogeneous data decoding method and device, computer equipment and storage medium | |
CN109657060A (en) | safety production accident case pushing method and system | |
CN112765938B (en) | Method for constructing suffix array, terminal equipment and computer readable storage medium | |
CN112527950A (en) | MapReduce-based graph data deleting method and system | |
US9201982B2 (en) | Priority search trees | |
US20220066988A1 (en) | Hash suppression | |
CN111666278A (en) | Data storage method, data retrieval method, electronic device and storage medium | |
EP2784692A1 (en) | Filter regular expression | |
US9830355B2 (en) | Computer-implemented method of performing a search using signatures | |
US20210248142A1 (en) | Dual filter histogram optimization | |
CN112100313B (en) | Data indexing method and system based on finest granularity segmentation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |