CN108763170A - The method and system of constant working space parallel construction Suffix array clustering - Google Patents
The method and system of constant working space parallel construction Suffix array clustering Download PDFInfo
- Publication number
- CN108763170A CN108763170A CN201810344030.2A CN201810344030A CN108763170A CN 108763170 A CN108763170 A CN 108763170A CN 201810344030 A CN201810344030 A CN 201810344030A CN 108763170 A CN108763170 A CN 108763170A
- Authority
- CN
- China
- Prior art keywords
- character string
- suffix
- character
- parallel
- recorded
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/126—Character encoding
Abstract
The invention discloses the method and system of constant working space parallel construction Suffix array clustering, by obtaining the initial character pointer of all LMS substrings in character string X and being recorded in array P1, further to be sorted come the parallel conclusion carried out to LMS substrings all in character string X in constant working space using P1 and SA, obtain character string X1, and the different configuration input parameter of SA is distinguished according to the uniqueness of character in X1, it is able in the Suffix array clustering to SA for concluding calculating character string X parallel in constant working space eventually by the correspondence of X1 and its Suffix array clustering SA1.Present invention reduces calculator memory require and the speed of service faster, so that Space-time Complexity is optimal, be suitable for extensive character string Suffix array clustering build.
Description
Technical field
The present invention relates to string postfix arrays to construct field, especially constant working space parallel construction Suffix array clustering
Method and system.
Background technology
Suffix array clustering (Suffix Array, SA) is that a section space-efficient for suffix tree (Suffix Tree, ST) substitutes
Type data structure has the characteristics that compact-sized and space hold is small, can be realized in smaller space and be equal to suffix tree
Algorithm.Suffix array clustering is commonly used to index character string, many processing tasks about character string can be solved, in full-text index
There is extensive use in applications with gene matching etc..
In recent years, the memory headroom of all-purpose computer constantly increases so that quickly processing is large-scale on memory model
Text and gene data become possible.With the explosive growth of data scale, existing serial approach and system can not
Meet the quick process demand of extensive string data, i.e. the speed of service is slower, and it needs larger memory headroom item
Part, it is just not applicable for the relatively small computer system of certain memories, although to sum up, character still may be implemented
The Suffix array clustering of string is built, but its Space-time Complexity is in a poor criteria.
Invention content
To solve the above-mentioned problems, the object of the present invention is to provide the methods of constant working space parallel construction Suffix array clustering
And system, reduce request memory and the speed of service faster, so that Space-time Complexity is optimal, be suitable for extensive character string
Suffix array clustering is built.
In order to make up for the deficiencies of the prior art, the technical solution adopted by the present invention is:
The method of constant working space parallel construction Suffix array clustering, includes the following steps:
S1, the character string X for scanning an input from right to left, compare two of Current Scan according to the definition of suffix type
Adjacent character X [i] and X [i+1], to obtain the type of each character and suffix;
In scanning process, according to the S of LMS substrings+L+S type-schemes, which define, finds out the position that all LMS characters occur,
To obtain the initial character pointer of all LMS substrings in character string X, and it is recorded in array P1;
S2, returning parallel in constant working space is carried out to LMS substrings all in character string X by array P1 and SA
It receives sequence, and ranking results is stored in SA1;
Wherein, SA is the Suffix array clustering for recording character string X;SA1 is the Suffix array clustering for record ordering result;
S3, according to ranking results LMS substrings all in parallel renaming character string X in constant working space, to
Form character string X1;
S4, check whether each character in character string X1 is unique, if then each suffix of direct sequencing character string X1 comes
The Suffix array clustering of calculating character string X1 is preserved into SA1, and character is otherwise substituted using character string X1 and SA1 as new input parameter
String X and SA, respectively recursive call to step S1 and S2;
S5, the Suffix array clustering according to the character string X1 being stored in SA1 of acquisition, are concluded parallel in constant working space
The Suffix array clustering of calculating character string X is preserved into SA.
Further, scan a character string X in the step S1 from right to left, used scan mode include piecemeal simultaneously
Row scanning, flowing water parallel scan and prefix and parallel scan.
Further, in the step S2, constant is carried out to LMS substrings all in character string X by array P1 and SA
Parallel conclusion sequence in working space, includes the following steps:
S21, all elements for initializing SA are -1, and scan all suffix in character string X with prefix and parallel mode
The end position of affiliated each bucket in SA is recorded in the barrelage group B that size is O (1);Flowing water parallel scan character from right to left
It goes here and there X, each LMS suffix scanned is inserted successively the current end position of its affiliated bucket in SA, and by the stop bits of this barrel
It sets and is moved to the left a lattice;
S22, the initial position that all suffix affiliated each bucket in SA in character string X is scanned with prefix and parallel mode,
It is recorded in the barrelage group B that size is O (1), and block parallel scan process is carried out to SA:
Scan SA from left to right in current block, for each of scan be not -1 element S A [i], read from SA
X's [SA [i]] is preceding after character X [SA [i] -1];If this it is preceding after character be L-type, by the value of SA [i] -1 and suf (X, SA
[i] -1) suffix be recorded in SA as ranking results the current initial position of affiliated bucket in SA, and by the start bit of this barrel
Set the lattice that move right;
It reads previous piece of ranking results and is recorded in SA;
It reads before latter piece all after character and is recorded in SA;
S23, the end position that all suffix affiliated each bucket in SA in character string X is scanned with prefix and parallel mode,
It is recorded in the barrelage group B that size is O (1), and block parallel scan process is carried out to SA:
It scans SA from right to left in current block, for each element S A [i] scanned, X [SA [i]] is read from SA
It is preceding after character X [SA [i] -1];If this it is preceding after character be S types, after the value of SA [i] -1 and suf (X, SA [i] -1)
Sew the current end position of affiliated bucket in SA to be recorded in SA as ranking results, and the end position of this barrel is moved to the left
One lattice;
It reads previous piece of ranking results and is recorded in SA;
It reads before latter piece all after character and is recorded in SA.
Further, the step S3 owns in constant working space in parallel renaming character string X according to ranking results
LMS substrings include the following steps to form character string X1:
S31, ordering LMS substrings in SA1 are subjected to piecemeal, it is more adjacent successively from left to right in each piecemeal
The size of two LMS substrings;For each LMS substrings, named with the initial position of its affiliated bucket in SA1, first bucket
Initial position is since 0 to get to the local name of LMS substrings in each piecemeal;
S32, using workable global title in prefix and each piecemeal of method statistic, by the office of LMS substrings in each piecemeal
Portion's name replaces with global title, to form character string Z1;
S33, block parallel scanning is carried out to character string Z1 from right to left, for each character Z1 [i] being scanned, if
Z1 [i] is L-type, then enables X1 [i]=Z1 [i], X1 [i] is otherwise set as to the end position of Z1 [i] affiliated buckets in SA1;It waits sweeping
After retouching Z1, X1 is to the result after each S ocra font ocrs renaming in Z1.
Further, it in step S5, according to the Suffix array clustering for the character string X1 of acquisition being stored in SA1, works in constant
The Suffix array clustering for concluding calculating character string X in space parallel, includes the following steps:
S51, all elements for initializing SA are -1, scan all suffix in character string X with prefix and parallel mode and exist
The end position of affiliated each bucket in SA, is multiplexed the space of SA to record;Flowing water parallel scan array SA1 from right to left, to each
P1 [SA1 [i]] is placed on suffix suf (X, P1 [SA1 [i]]) current knot of affiliated bucket in SA by the element S A1 [i] scanned
Beam position, and the end position of this barrel is moved to the left a lattice;
S52, the end position that all suffix affiliated each bucket in SA in character string X is scanned with prefix and parallel mode,
The space of multiplexing SA is recorded, and block parallel scan process is carried out to SA:
Scan SA from left to right in current block, for each of scan be not -1 element S A [i], read from SA
X's [SA [i]] is preceding after character X [SA [i] -1];If this it is preceding after character be L-type, by the value of SA [i] -1 and suf (X, SA
[i] -1) suffix be recorded in SA as ranking results the current initial position of affiliated bucket in SA, and by the start bit of this barrel
Set the lattice that move right;
It reads previous piece of ranking results and is recorded in SA;
It reads before latter piece all after character and is recorded in SA;
S53, the end position that all suffix affiliated each bucket in SA in character string X is scanned with prefix and parallel mode,
The space of multiplexing SA is recorded, and block parallel scan process is carried out to SA:
It scans SA from right to left in current block, for each element S A [i] scanned, X [SA [i]] is read from SA
It is preceding after character X [SA [i] -1];If this it is preceding after character be S types, after the value of SA [i] -1 and suf (X, SA [i] -1)
Sew the current end position of affiliated bucket in SA to be recorded in SA as ranking results, and the end position of this barrel is moved to the left
One lattice;
It reads previous piece of ranking results and is recorded in SA;
It reads before latter piece all after character and is recorded in SA.
The system of constant working space parallel construction Suffix array clustering, including:The parallel sorting module, storage unit, preceding concluded
Set unit and resolution unit;
The parallel conclusion sorting module, for by array P1 and SA come to input substring or suffix carry out constant work
Make spatial parallelism and conclude sequence, and returns the result;
The storage unit, for storing the ephemeral data during generating Suffix array clustering;
The front end units are concluded sequence using constant working space and ordered again parallel for the character string X according to input
The method of name generates character string X1, and write storage unit;
The resolution unit, for reading character string X1 from storage unit, to be stored in the suffix of the character string X1 in SA1
Array is concluded the Suffix array clustering of calculating character string X in constant working space and is preserved into SA parallel.
Further, the front end units include array P1 computing modules, LMS substrings sorting module, character string X1 generation moulds
Block and character string X1 decision-making modules;
The array P1 computing modules, the character string X for reading input, according to the S of LMS substrings+L+S type-schemes are fixed
Justice finds out the position that all LMS characters occur, and to obtain the initial character pointer of all LMS substrings in character string X, and is recorded in
In array P1;
LMS substring sorting modules for reading array P1 from storage unit, and call parallel sorting module of concluding to character
All LMS substrings are ranked up in string X, and ranking results are stored in SA1;
Character string X1 generation modules, for reading array SA from storage unit, and according to the parallel renaming word of ranking results
Each LMS substrings in symbol string X, generate character string X1;
Character string X1 decision-making modules, for from storage unit reading character string X1, judging whether each character of X1 is unique,
If being then transferred to resolution unit, otherwise recursive call front end units.
Further, the resolution unit includes that Suffix array clustering computing module, Suffix array clustering generation module and Suffix array clustering are deposited
Storage unit;
The Suffix array clustering computing module, for reading character string X1 from storage unit, and directly sequencing character string X1
Each suffix comes the Suffix array clustering of calculating character string X1, and write storage unit;
The Suffix array clustering generation module calls parallel sorting module of concluding for reading array SA1 from storage unit
LMS suffix all in X is ranked up, to obtain the Suffix array clustering of X;
The Suffix array clustering storage unit, the Suffix array clustering for storing character string X.
The beneficial effects of the invention are as follows:User need to only input an arbitrary string X defined in constant character set, this
Invention then can be by obtaining the initial character pointer of all LMS substrings in X and being recorded in array P1, further to utilize P1 and SA
The parallel conclusion for carry out all LMS substrings in character string X in constant working space is sorted, and obtains character string X1, and according to
The different configuration input parameter that SA is distinguished according to the uniqueness of character in X1, it is corresponding with its Suffix array clustering SA1 eventually by X1
Relationship is able in the Suffix array clustering to SA for concluding calculating character string X parallel in constant working space.Therefore, present invention reduces
Calculator memory require and the speed of service faster, so that Space-time Complexity is optimal, be suitable for extensive character string suffix number
Group structure.
Description of the drawings
Present pre-ferred embodiments are provided below in conjunction with the accompanying drawings, with the embodiment that the present invention will be described in detail.
Fig. 1 is the method and step flow chart of the present invention;
Fig. 2 is the system structure schematic block diagram of the present invention.
Specific implementation mode
Wherein, following technical term is used in the description of the present invention, is illustrated herein:
Working space:Refer to gross space and removes the remainder behind space used in character string X and its Suffix array clustering.
Constant working space:Refer to gross space and removes the arbitrary string X and its Suffix array clustering defined in constant character set
Remainder behind space used, the Suffix array clustering constructed by Suffix array clustering developing algorithm or system according to this Space-time Complexity
It is theoretically accessible optimal solution.
Character set:One character set Σ is a set for establishing ordering relation, i.e. the different member of any two in Σ
Plain α and β can compare size or α<β or α>β.Element in character set Σ is known as character, wherein minimum character
For ' $ '.Character set size according to the present invention can be a constant O (1) or an integer O (n).
Character string:The character string X that one length is n is to arrange the character that n belong in character set Σ successively by its position
The array X [0, n-1] formed is arranged, the end mark of X is fixed as ' $ ', and the other positions of ' $ ' not in X occur.
Substring:Substring X [i, j], the i≤j of character string X indicates one section of character string from position i to position j in X strings,
Be exactly by character X [i], X [i+1] ..., X [j] composition character string.
Suffix:A suffix of character string X refers to the substring that ' ' is accorded with from some position i start to finish.From i-th
The postfix notation that a character starts is suf (X, i), that is, suf (X, i)=X [i, n-1].
Character and suffix type:Character in X is divided into L and S two types.
(1) ' $ ' is S types;
(2) X [i], i ∈ [0, n-2] are S types, and if only if suf (X, i)<Suf (X, i+1), i.e. X [i]<X [i+1] or
Person X [i]=X [i+1] and X [i+1] are S types;
(3) X [i], i ∈ [0, n-2] are L types, and if only if suf (X, i)>Suf (X, i+1), i.e. X [i]>X [i+1] or
Person X [i]=X [i+1] and X [i+1] are L-type.
Suffix suf (X, i) be S types and if only if character X [i] be S types, suffix suf (X, i) be L types when and only
When character X [i] is L types.
LMS (Leftmost S-type, most left S types) characters and suffix:
(1) ' $ ' is LMS characters;
(2) X [i], i ∈ [1, n-1] are LMS characters, and and if only if X [i] be S types and X [i-1] is L-type;
(3) suffix suf (X, i) be LMS suffix and if only if character X [i] be LMS characters.
LMS substrings and its S+L+S type-schemes:
(1) ' $ ' is LMS substrings;
(2) X [i, j] is LMS substrings, and if only if 1≤i<j<N, X [i] and X [j] are all LMS characters, and X [i] and X
Other LMS characters are not present between [j].
Therefore, a LMS substring is made of successively three parts:One or more S ocra font ocrs, one or more L-type characters
With a single S ocra font ocr, this is known as the S of LMS substrings+L+S type-schemes.
Array of pointers:Position in array of pointers P1 record character strings X where the initial character of all LMS substrings, i.e. P1 [i]
Record in character string X (from left to right) position of the initial character of i+1 LMS substrings in X.
Character string size compares:The size of two character strings compares, and refers to that usually said " lexicographic order " compares, that is,
For two character strings u and v, i is enabled sequentially to compare u [i] and v [i] since 0.I is enabled to add 1 to be further continued for if u [i]=v [i]
More next u [i] and v [i], if otherwise u [i]<V [i] then thinks u<V or u [i]>V [i] then thinks u>v.
Suffix array clustering:Suffix array clustering SA is an one-dimension array for including n integer, meets suf for i ∈ [0, n-1]
(X,SA[i])<Suf (X, SA [i+1]), that is, after the n suffix of X is ranked up from small to large it is sorted it is each after
Sew position of the initial character in X to be from left to right sequentially put into SA.
Bucket and barrelage group:All suffix of character string X are ranked up by its first character in array SA, then are owned
The identical suffix of first character is all continuously arranged in a certain section of region in SA, this section of region is referred to as these corresponding suffix
A bucket.If in X including i different characters, i bucket, and suffix included in each bucket can be formed in SA
Initial character it is all identical.Barrelage group B is for recording current location of each bucket in SA.
Using terms above, the example for providing a structural string Suffix array clustering is as follows.Character string X=baac $, length
Spend n=5, suf (X, 0)=baac $, suf (X, 1)=aac $, suf (X, 2)=ac $, suf (X, 3)=c $, suf (X, 4)=$.
The definition compared according to character string size is readily apparent that suf (X, 4)<suf(X,1)<suf(X,2)<suf(X,0)<suf(X,3).
Further according to the definition of Suffix array clustering, it is easy to obtain SA [0]=4, SA [1]=1, SA [2]=2, SA [3]=0, SA [4]=3, i.e.,
SA=[4,1,2,0,3].
Referring to Fig.1, the method for constant working space parallel construction Suffix array clustering, includes the following steps:
S1, the character string X for scanning an input from right to left, compare two of Current Scan according to the definition of suffix type
Adjacent character X [i] and X [i+1], to obtain the type of each character and suffix;
In scanning process, according to the S of LMS substrings+L+S type-schemes, which define, finds out the position that all LMS characters occur,
To obtain the initial character pointer of all LMS substrings in character string X, and it is recorded in array P1;
S2, returning parallel in constant working space is carried out to LMS substrings all in character string X by array P1 and SA
It receives sequence, and ranking results is stored in SA1;
Wherein, SA is array, the Suffix array clustering for recording character string X;SA1 is array, for record ordering result
Suffix array clustering;
S3, according to ranking results LMS substrings all in parallel renaming character string X in constant working space, to
Form character string X1;
S4, check whether each character in character string X1 is unique, if then each suffix of direct sequencing character string X1 comes
The Suffix array clustering of calculating character string X1 is preserved into SA1, and character is otherwise substituted using character string X1 and SA1 as new input parameter
String X and SA, respectively recursive call to step S1 and S2;
S5, the Suffix array clustering according to the character string X1 being stored in SA1 of acquisition, are concluded parallel in constant working space
The Suffix array clustering of calculating character string X is preserved into SA.
Specifically, user need to only input an arbitrary string X defined in constant character set, and the present invention can then pass through
It obtains the initial character pointer of all LMS substrings in X and is recorded in array P1, with further using P1 and SA come to character string X
In all LMS substrings carry out the parallel conclusion in constant working space and sort, obtain character string X1, and according to character in X1
Uniqueness distinguishes the different configuration input parameter of SA, is able to normal eventually by the correspondence of X1 and its Suffix array clustering SA1
In the Suffix array clustering to SA for concluding calculating character string X in number working space parallel.Therefore, present invention reduces calculator memories to want
Ask and the speed of service faster, so that Space-time Complexity is optimal, be suitable for extensive character string Suffix array clustering build.
Wherein, a character string X is scanned in the step S1 from right to left, used scan mode includes block parallel
Scanning, flowing water parallel scan and prefix and parallel scan, this three kinds of scan modes are common character string scan mode,
This is not repeated.
Wherein, in the step S2, constant work is carried out to LMS substrings all in character string X by array P1 and SA
Make the parallel conclusion sequence in space, includes the following steps:
S21, all elements for initializing SA are -1, and scan all suffix in character string X with prefix and parallel mode
The end position of affiliated each bucket in SA is recorded in the barrelage group B that size is O (1);Flowing water parallel scan character from right to left
It goes here and there X, each LMS suffix scanned is inserted successively the current end position of its affiliated bucket in SA, and by the stop bits of this barrel
It sets and is moved to the left a lattice;
S22, the initial position that all suffix affiliated each bucket in SA in character string X is scanned with prefix and parallel mode,
It is recorded in the barrelage group B that size is O (1), and block parallel scan process is carried out to SA:
Scan SA from left to right in current block, for each of scan be not -1 element S A [i], read from SA
X's [SA [i]] is preceding after character X [SA [i] -1];If this it is preceding after character be L-type, by the value of SA [i] -1 and suf (X, SA
[i] -1) suffix be recorded in SA as ranking results the current initial position of affiliated bucket in SA, and by the start bit of this barrel
Set the lattice that move right;
It reads previous piece of ranking results and is recorded in SA;
It reads before latter piece all after character and is recorded in SA;
S23, the end position that all suffix affiliated each bucket in SA in character string X is scanned with prefix and parallel mode,
It is recorded in the barrelage group B that size is O (1), and block parallel scan process is carried out to SA:
It scans SA from right to left in current block, for each element S A [i] scanned, X [SA [i]] is read from SA
It is preceding after character X [SA [i] -1];If this it is preceding after character be S types, after the value of SA [i] -1 and suf (X, SA [i] -1)
Sew the current end position of affiliated bucket in SA to be recorded in SA as ranking results, and the end position of this barrel is moved to the left
One lattice;
It reads previous piece of ranking results and is recorded in SA;
It reads before latter piece all after character and is recorded in SA.
Further, the step S3 owns in constant working space in parallel renaming character string X according to ranking results
LMS substrings include the following steps to form character string X1:
S31, ordering LMS substrings in SA1 are subjected to piecemeal, it is more adjacent successively from left to right in each piecemeal
The size of two LMS substrings;For each substring, named with the initial position of its affiliated bucket in SA1, the starting of first bucket
Position is since 0 to get to the local name of LMS substrings in each piecemeal;
S32, using workable global title in prefix and each piecemeal of method statistic, by the office of LMS substrings in each piecemeal
Portion's name replaces with global title, to form character string Z1;
S33, block parallel scanning is carried out to character string Z1 from right to left, for each character Z1 [i] being scanned, if
Z1 [i] is L-type, then enables X1 [i]=Z1 [i], X1 [i] is otherwise set as to the end position of Z1 [i] affiliated buckets in SA1;It waits sweeping
After retouching Z1, X1 is to the result after each S ocra font ocrs renaming in Z1.
Wherein, it in step S5, according to the Suffix array clustering for the character string X1 of acquisition being stored in SA1, works in constant empty
In the Suffix array clustering of interior parallel conclusion calculating character string X, include the following steps:
S51, all elements for initializing SA are -1, scan all suffix in character string X with prefix and parallel mode and exist
The end position of affiliated each bucket in SA, is multiplexed the space of SA to record;Flowing water parallel scan array SA1 from right to left, to each
P1 [SA1 [i]] is placed on suffix suf (X, P1 [SA1 [i]]) current knot of affiliated bucket in SA by the element S A1 [i] scanned
Beam position, and the end position of this barrel is moved to the left a lattice;
S52, the end position that all suffix affiliated each bucket in SA in character string X is scanned with prefix and parallel mode,
The space of multiplexing SA is recorded, and block parallel scan process is carried out to SA:
Scan SA from left to right in current block, for each of scan be not -1 element S A [i], read from SA
X's [SA [i]] is preceding after character X [SA [i] -1];If this it is preceding after character be L-type, by the value of SA [i] -1 and suf (X, SA
[i] -1) suffix be recorded in SA as ranking results the current initial position of affiliated bucket in SA, and by the start bit of this barrel
Set the lattice that move right;
It reads previous piece of ranking results and is recorded in SA;
It reads before latter piece all after character and is recorded in SA;
S53, the end position that all suffix affiliated each bucket in SA in character string X is scanned with prefix and parallel mode,
The space of multiplexing SA is recorded, and block parallel scan process is carried out to SA:
It scans SA from right to left in current block, for each element S A [i] scanned, X [SA [i]] is read from SA
It is preceding after character X [SA [i] -1];If this it is preceding after character be S types, after the value of SA [i] -1 and suf (X, SA [i] -1)
Sew the current end position of affiliated bucket in SA to be recorded in SA as ranking results, and the end position of this barrel is moved to the left
One lattice;
It reads previous piece of ranking results and is recorded in SA;
It reads before latter piece all after character and is recorded in SA.
Reference Fig. 2, the system of constant working space parallel construction Suffix array clustering, including:It is parallel to conclude sorting module 8, deposit
Storage unit 1, front end units 2 and resolution unit 3;
The parallel conclusion sorting module 8, for by array P1 and SA come to input substring or suffix carry out constant
Working space concludes sequence parallel, and returns the result;
The storage unit 1, for storing the ephemeral data during generating Suffix array clustering, such as array P1, SA1 and word
Symbol string X1 etc.;
The front end units 2 are sorted and again for according to the character string X of input, being concluded parallel using constant working space
The method of name generates character string X1, and write storage unit 1;
The resolution unit 3, for reading character string X1 from storage unit 1, with after the character string X1 that is stored in SA1
Sew array, conclude the Suffix array clustering of calculating character string X parallel in constant working space and preserves into SA.
Specifically, it is front end units 2 and the module that resolution unit 3 can be called, storage unit to conclude sorting module 8 parallel
1 setting ensure that the data in building process are not lost, and be conducive to reading or the tune of front end units 2 and resolution unit 3
With.
Wherein, the front end units 2 include array P1 computing modules 4, LMS substrings sorting module 5, character string X1 generation moulds
Block 6 and character string X1 decision-making modules 7;
The array P1 computing modules 4, the character string X for reading input, according to the S of LMS substrings+L+S type-schemes are fixed
Justice finds out the position that all LMS characters occur, and to obtain the initial character pointer of all LMS substrings in character string X, and is recorded in
In array P1;
LMS substrings sorting module 5 for reading array P1 from storage unit 1, and calls parallel conclusion sorting module 8 right
All LMS substrings are ranked up in character string X, and ranking results are stored in SA1;
Character string X1 generation modules 6, for reading array SA from storage unit 1, and according to the parallel renaming of ranking results
Each LMS substrings in character string X generate character string X1;
Whether only character string X1 decision-making modules 7 judge each character of X1 for reading character string X1 from storage unit 1
One, if being then transferred to resolution unit 3, otherwise recursive call front end units 2.
Wherein, the resolution unit 3 includes Suffix array clustering computing module 9, Suffix array clustering generation module 10 and Suffix array clustering
Storage unit 11;
The Suffix array clustering computing module 9, for from storage unit 1 read character string X1, and directly sequence X1 it is each after
The Suffix array clustering and write storage unit for sewing to calculate X1;
The Suffix array clustering generation module 10 for reading array SA1 from storage unit 1, and calls parallel conclusion sequence
Module 8 is ranked up LMS suffix all in X, to obtaining the Suffix array clustering of X;
The Suffix array clustering storage unit 11, the Suffix array clustering for storing character string X.
Presently preferred embodiments of the present invention and basic principle is discussed in detail in the above content, but the invention is not limited in
The above embodiment, those skilled in the art should be recognized that also had under the premise of without prejudice to spirit of that invention it is various
Equivalent variations and replacement, these equivalent variations and replacement all fall within the protetion scope of the claimed invention.
Claims (8)
1. the method for constant working space parallel construction Suffix array clustering, which is characterized in that include the following steps:
S1, the character string X for scanning an input from right to left, according to the definition of suffix type compare two of Current Scan it is adjacent
Character X [i] and X [i+1], to obtain the type of each character and suffix;
In scanning process, according to the S of LMS substrings+L+S type-schemes, which define, finds out the position that all LMS characters occur, to
The initial character pointer of all LMS substrings in character string X is obtained, and is recorded in array P1;
S2, it is arranged come the parallel conclusion carried out to LMS substrings all in character string X in constant working space by array P1 and SA
Sequence, and ranking results are stored in SA1;
Wherein, SA is the Suffix array clustering for recording character string X;SA1 is the Suffix array clustering for record ordering result;
S3, according to ranking results LMS substrings all in parallel renaming character string X in constant working space, to be formed
Character string X1;
S4, check whether each character in character string X1 is unique, if then each suffix of direct sequencing character string X1 calculates
The Suffix array clustering of character string X1, preserve in SA1, otherwise substituted using character string X1 and SA1 as new input parameter character string X and
SA, respectively recursive call to step S1 and S2;
S5, the Suffix array clustering according to the character string X1 being stored in SA1 of acquisition are concluded in constant working space and are calculated parallel
The Suffix array clustering of character string X is preserved into SA.
2. the method for constant working space parallel construction Suffix array clustering according to claim 1, which is characterized in that the step
Scan a character string X in rapid S1 from right to left, used scan mode include block parallel scanning, flowing water parallel scan with
And prefix and parallel scan.
3. the method for constant working space parallel construction Suffix array clustering according to claim 1, which is characterized in that the step
In rapid S2, arranged come the parallel conclusion carried out to LMS substrings all in character string X in constant working space by array P1 and SA
Sequence includes the following steps:
S21, all elements for initializing SA are -1, and scan in character string X all suffix in SA with prefix and parallel mode
In affiliated each bucket end position, be recorded in size be O (1) barrelage group B in;Flowing water parallel scan character string X from right to left,
Each LMS suffix scanned is inserted successively the current end position of its affiliated bucket in SA, and by the end position of this barrel to
Move left a lattice;
S22, the initial position that all suffix affiliated each bucket in SA in character string X is scanned with prefix and parallel mode, record
Block parallel scan process is carried out in the barrelage group B that size is O (1), and to SA:
Scan SA from left to right in current block, for each of scan be not -1 element S A [i], X [SA are read from SA
[i]] it is preceding after character X [SA [i] -1];If this it is preceding after character be L-type, by the value of SA [i] -1 and suf (X, SA [i] -1)
Suffix be recorded in SA as ranking results the current initial position of affiliated bucket in SA, and to the right by the initial position of this barrel
A mobile lattice;
It reads previous piece of ranking results and is recorded in SA;
It reads before latter piece all after character and is recorded in SA;
S23, the end position that all suffix affiliated each bucket in SA in character string X is scanned with prefix and parallel mode, record
Block parallel scan process is carried out in the barrelage group B that size is O (1), and to SA:
SA is scanned from right to left in current block, for each element S A [i] scanned, before reading X [SA [i]] in SA
After character X [SA [i] -1];If it after character is S types that this is preceding, the suffix of the value of SA [i] -1 and suf (X, SA [i] -1) is existed
The current end position of affiliated bucket is recorded in as ranking results in SA in SA, and the end position of this barrel is moved to the left one
Lattice;
It reads previous piece of ranking results and is recorded in SA;
It reads before latter piece all after character and is recorded in SA.
4. the method for constant working space parallel construction Suffix array clustering according to claim 1, which is characterized in that the step
Rapid S3, according to ranking results LMS substrings all in parallel renaming character string X in constant working space, to form word
Symbol string X1, includes the following steps:S31, ordering LMS substrings in SA1 are subjected to piecemeal, in each piecemeal from left to right according to
The size of secondary two more adjacent LMS substrings;For each LMS substrings, ordered with the initial position of its affiliated bucket in SA1
Name, the initial position of first bucket is since 0 to get to the local name of LMS substrings in each piecemeal;
S32, using workable global title in prefix and each piecemeal of method statistic, by the local name of LMS substrings in each piecemeal
Global title is replaced with, to form character string Z1;
S33, block parallel scanning is carried out to character string Z1 from right to left, for each character Z1 [i] being scanned, if Z1
[i] is L-type, then enables X1 [i]=Z1 [i], X1 [i] is otherwise set as to the end position of Z1 [i] affiliated buckets in SA1;It is to be scanned
After Z1, X1 is to the result after each S ocra font ocrs renaming in Z1.
5. the method for constant working space parallel construction Suffix array clustering according to claim 1, which is characterized in that step S5
In, according to the Suffix array clustering for the character string X1 of acquisition being stored in SA1, calculating character is concluded parallel in constant working space
The Suffix array clustering of string X, includes the following steps:
S51, all elements for initializing SA are -1, and all suffix are scanned in character string X in SA with prefix and parallel mode
The end position of affiliated each bucket is multiplexed the space of SA to record;Flowing water parallel scan array SA1 from right to left, to each scanning
P1 [SA1 [i]] is placed on suffix suf (X, P1 [SA1 [i]]) current stop bits of affiliated bucket in SA by the element S A1 [i] arrived
It sets, and the end position of this barrel is moved to the left a lattice;
S52, the end position that all suffix affiliated each bucket in SA in character string X is scanned with prefix and parallel mode, multiplexing
The space of SA is recorded, and block parallel scan process is carried out to SA:
Scan SA from left to right in current block, for each of scan be not -1 element S A [i], X [SA are read from SA
[i]] it is preceding after character X [SA [i] -1];If this it is preceding after character be L-type, by the value of SA [i] -1 and suf (X, SA [i] -1)
Suffix be recorded in SA as ranking results the current initial position of affiliated bucket in SA, and to the right by the initial position of this barrel
A mobile lattice;
It reads previous piece of ranking results and is recorded in SA;
It reads before latter piece all after character and is recorded in SA;
S53, the end position that all suffix affiliated each bucket in SA in character string X is scanned with prefix and parallel mode, multiplexing
The space of SA is recorded, and block parallel scan process is carried out to SA:
SA is scanned from right to left in current block, for each element S A [i] scanned, before reading X [SA [i]] in SA
After character X [SA [i] -1];If it after character is S types that this is preceding, the suffix of the value of SA [i] -1 and suf (X, SA [i] -1) is existed
The current end position of affiliated bucket is recorded in as ranking results in SA in SA, and the end position of this barrel is moved to the left one
Lattice;
It reads previous piece of ranking results and is recorded in SA;
It reads before latter piece all after character and is recorded in SA.
6. the system of the method based on any constant working space parallel construction Suffix array clusterings of claim 1-5, special
Sign is, including:It is parallel to conclude sorting module (8), storage unit (1), front end units (2) and resolution unit (3);
The parallel conclusion sorting module (8), for by array P1 and SA come to input substring or suffix carry out constant work
Make spatial parallelism and conclude sequence, and returns the result;
The storage unit (1), for storing the ephemeral data during generating Suffix array clustering;
The front end units (2) are concluded sequence using constant working space and ordered again parallel for the character string X according to input
The method of name generates character string X1, and write storage unit (1);
The resolution unit (3), for reading character string X1 from storage unit (1), and to be stored in the character string X1's in SA1
Suffix array clustering is concluded the Suffix array clustering of calculating character string X in constant working space and is preserved into SA parallel.
7. the system of constant working space parallel construction Suffix array clustering according to claim 6, which is characterized in that before described
It includes array P1 computing modules (4), LMS substrings sorting module (5), character string X1 generation modules (6) and character string to set unit (2)
X1 decision-making modules (7);
The array P1 computing modules (4), the character string X for reading input, according to the S of LMS substrings+L+S type-schemes define
The position that all LMS characters occur is found out, to obtain the initial character pointer of all LMS substrings in character string X, and is recorded in number
In group P1;
LMS substrings sorting module (5) for reading array P1 from storage unit (1), and calls parallel conclusion sorting module (8)
LMS substrings all in character string X are ranked up, and ranking results are stored in SA1;
Character string X1 generation modules (6), for reading array SA from storage unit (1), and according to the parallel renaming of ranking results
Each LMS substrings in character string X, to generate character string X1;
Whether only character string X1 decision-making modules (7) judge each character of X1 for reading character string X1 from storage unit (1)
One, if being then transferred to resolution unit (3), otherwise recursive call front end units (2).
8. the system of constant working space parallel construction Suffix array clustering according to claim 7, which is characterized in that the solution
It includes Suffix array clustering computing module (9), Suffix array clustering generation module (10) and Suffix array clustering storage unit (11) to analyse unit (3);
The Suffix array clustering computing module (9), for reading character string X1, and directly sequencing character string X1 from storage unit (1)
Each suffix come the Suffix array clustering of calculating character string X1, and write storage unit (1);
The Suffix array clustering generation module (10) for reading array SA1 from storage unit (1), and calls parallel conclusion sequence
Module (8) is ranked up LMS suffix all in character string X, to obtaining the Suffix array clustering of character string X;
The Suffix array clustering storage unit (11), the Suffix array clustering for storing character string X.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810344030.2A CN108763170A (en) | 2018-04-17 | 2018-04-17 | The method and system of constant working space parallel construction Suffix array clustering |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810344030.2A CN108763170A (en) | 2018-04-17 | 2018-04-17 | The method and system of constant working space parallel construction Suffix array clustering |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108763170A true CN108763170A (en) | 2018-11-06 |
Family
ID=64010834
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810344030.2A Pending CN108763170A (en) | 2018-04-17 | 2018-04-17 | The method and system of constant working space parallel construction Suffix array clustering |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108763170A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109614510A (en) * | 2018-11-23 | 2019-04-12 | 腾讯科技(深圳)有限公司 | A kind of image search method, device, graphics processor and storage medium |
CN110852046A (en) * | 2019-10-18 | 2020-02-28 | 中山大学 | Block induction sequencing method and system for text suffix index |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102073740A (en) * | 2011-01-27 | 2011-05-25 | 农革 | String suffix array construction method on basis of radix sorting |
CN102081673A (en) * | 2011-01-27 | 2011-06-01 | 农革 | Suffix array construction method |
CN102521213A (en) * | 2011-12-01 | 2012-06-27 | 农革 | Construction method of linear time suffix arrays |
CN107015951A (en) * | 2017-03-24 | 2017-08-04 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | The correctness verification method and system of a kind of Suffix array clustering |
CN107015952A (en) * | 2017-03-24 | 2017-08-04 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | The correctness verification method and system of a kind of Suffix array clustering and most long common prefix |
-
2018
- 2018-04-17 CN CN201810344030.2A patent/CN108763170A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102073740A (en) * | 2011-01-27 | 2011-05-25 | 农革 | String suffix array construction method on basis of radix sorting |
CN102081673A (en) * | 2011-01-27 | 2011-06-01 | 农革 | Suffix array construction method |
CN102521213A (en) * | 2011-12-01 | 2012-06-27 | 农革 | Construction method of linear time suffix arrays |
CN107015951A (en) * | 2017-03-24 | 2017-08-04 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | The correctness verification method and system of a kind of Suffix array clustering |
CN107015952A (en) * | 2017-03-24 | 2017-08-04 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | The correctness verification method and system of a kind of Suffix array clustering and most long common prefix |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109614510A (en) * | 2018-11-23 | 2019-04-12 | 腾讯科技(深圳)有限公司 | A kind of image search method, device, graphics processor and storage medium |
CN109614510B (en) * | 2018-11-23 | 2021-05-07 | 腾讯科技(深圳)有限公司 | Image retrieval method, image retrieval device, image processor and storage medium |
CN110852046A (en) * | 2019-10-18 | 2020-02-28 | 中山大学 | Block induction sequencing method and system for text suffix index |
CN110852046B (en) * | 2019-10-18 | 2021-11-05 | 中山大学 | Block induction sequencing method and system for text suffix index |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108804204A (en) | Multi-threaded parallel constructs the method and system of Suffix array clustering | |
CN101398820B (en) | Large scale key word matching method | |
US7080091B2 (en) | Inverted index system and method for numeric attributes | |
CN101359325B (en) | Multi-key-word matching method for rapidly analyzing content | |
US6131092A (en) | System and method for identifying matches of query patterns to document text in a document textbase | |
CN110134714B (en) | Distributed computing framework cache index method suitable for big data iterative computation | |
Kaukoranta et al. | A fast exact GLA based on code vector activity detection | |
CN105335481B (en) | A kind of the suffix index building method and device of extensive character string text | |
CN102081673A (en) | Suffix array construction method | |
CN1890669A (en) | Incremental search of keyword strings | |
JPH09134369A (en) | Method for retrieving dictionary where retrieval is executed with lattice as key and its method | |
CN108399213B (en) | User-oriented personal file clustering method and system | |
US5367677A (en) | System for iterated generation from an array of records of a posting file with row segments based on column entry value ranges | |
Hull et al. | An integrated algorithm for text recognition: comparison with a cascaded algorithm | |
CN102073740A (en) | String suffix array construction method on basis of radix sorting | |
Tavakoli | Modeling genome data using bidirectional LSTM | |
CN110083683B (en) | Entity semantic annotation method based on random walk | |
CN108763170A (en) | The method and system of constant working space parallel construction Suffix array clustering | |
WO2020037794A1 (en) | Index building method for english geographical name, and query method and apparatus therefor | |
CN101251845A (en) | Method for performing multi-pattern string match using improved Wu-Manber algorithm | |
CN106484815B (en) | A kind of automatic identification optimization method based on mass data class SQL retrieval scene | |
US6625592B1 (en) | System and method for hash scanning of shared memory interfaces | |
CN108628907A (en) | A method of being used for the Trie tree multiple-fault diagnosis based on Aho-Corasick | |
CN102521213A (en) | Construction method of linear time suffix arrays | |
CN109446293A (en) | A kind of parallel higher-dimension nearest Neighbor |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181106 |
|
RJ01 | Rejection of invention patent application after publication |