CN108763170A - The method and system of constant working space parallel construction Suffix array clustering - Google Patents

The method and system of constant working space parallel construction Suffix array clustering Download PDF

Info

Publication number
CN108763170A
CN108763170A CN201810344030.2A CN201810344030A CN108763170A CN 108763170 A CN108763170 A CN 108763170A CN 201810344030 A CN201810344030 A CN 201810344030A CN 108763170 A CN108763170 A CN 108763170A
Authority
CN
China
Prior art keywords
character string
suffix
character
parallel
recorded
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810344030.2A
Other languages
Chinese (zh)
Inventor
劳斌
解静仪
徐文涛
农革
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
SYSU CMU Shunde International Joint Research Institute
Research Institute of Zhongshan University Shunde District Foshan
National Sun Yat Sen University
Original Assignee
SYSU CMU Shunde International Joint Research Institute
Research Institute of Zhongshan University Shunde District Foshan
National Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SYSU CMU Shunde International Joint Research Institute, Research Institute of Zhongshan University Shunde District Foshan, National Sun Yat Sen University filed Critical SYSU CMU Shunde International Joint Research Institute
Priority to CN201810344030.2A priority Critical patent/CN108763170A/en
Publication of CN108763170A publication Critical patent/CN108763170A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding

Abstract

The invention discloses the method and system of constant working space parallel construction Suffix array clustering, by obtaining the initial character pointer of all LMS substrings in character string X and being recorded in array P1, further to be sorted come the parallel conclusion carried out to LMS substrings all in character string X in constant working space using P1 and SA, obtain character string X1, and the different configuration input parameter of SA is distinguished according to the uniqueness of character in X1, it is able in the Suffix array clustering to SA for concluding calculating character string X parallel in constant working space eventually by the correspondence of X1 and its Suffix array clustering SA1.Present invention reduces calculator memory require and the speed of service faster, so that Space-time Complexity is optimal, be suitable for extensive character string Suffix array clustering build.

Description

The method and system of constant working space parallel construction Suffix array clustering
Technical field
The present invention relates to string postfix arrays to construct field, especially constant working space parallel construction Suffix array clustering Method and system.
Background technology
Suffix array clustering (Suffix Array, SA) is that a section space-efficient for suffix tree (Suffix Tree, ST) substitutes Type data structure has the characteristics that compact-sized and space hold is small, can be realized in smaller space and be equal to suffix tree Algorithm.Suffix array clustering is commonly used to index character string, many processing tasks about character string can be solved, in full-text index There is extensive use in applications with gene matching etc..
In recent years, the memory headroom of all-purpose computer constantly increases so that quickly processing is large-scale on memory model Text and gene data become possible.With the explosive growth of data scale, existing serial approach and system can not Meet the quick process demand of extensive string data, i.e. the speed of service is slower, and it needs larger memory headroom item Part, it is just not applicable for the relatively small computer system of certain memories, although to sum up, character still may be implemented The Suffix array clustering of string is built, but its Space-time Complexity is in a poor criteria.
Invention content
To solve the above-mentioned problems, the object of the present invention is to provide the methods of constant working space parallel construction Suffix array clustering And system, reduce request memory and the speed of service faster, so that Space-time Complexity is optimal, be suitable for extensive character string Suffix array clustering is built.
In order to make up for the deficiencies of the prior art, the technical solution adopted by the present invention is:
The method of constant working space parallel construction Suffix array clustering, includes the following steps:
S1, the character string X for scanning an input from right to left, compare two of Current Scan according to the definition of suffix type Adjacent character X [i] and X [i+1], to obtain the type of each character and suffix;
In scanning process, according to the S of LMS substrings+L+S type-schemes, which define, finds out the position that all LMS characters occur, To obtain the initial character pointer of all LMS substrings in character string X, and it is recorded in array P1;
S2, returning parallel in constant working space is carried out to LMS substrings all in character string X by array P1 and SA It receives sequence, and ranking results is stored in SA1;
Wherein, SA is the Suffix array clustering for recording character string X;SA1 is the Suffix array clustering for record ordering result;
S3, according to ranking results LMS substrings all in parallel renaming character string X in constant working space, to Form character string X1;
S4, check whether each character in character string X1 is unique, if then each suffix of direct sequencing character string X1 comes The Suffix array clustering of calculating character string X1 is preserved into SA1, and character is otherwise substituted using character string X1 and SA1 as new input parameter String X and SA, respectively recursive call to step S1 and S2;
S5, the Suffix array clustering according to the character string X1 being stored in SA1 of acquisition, are concluded parallel in constant working space The Suffix array clustering of calculating character string X is preserved into SA.
Further, scan a character string X in the step S1 from right to left, used scan mode include piecemeal simultaneously Row scanning, flowing water parallel scan and prefix and parallel scan.
Further, in the step S2, constant is carried out to LMS substrings all in character string X by array P1 and SA Parallel conclusion sequence in working space, includes the following steps:
S21, all elements for initializing SA are -1, and scan all suffix in character string X with prefix and parallel mode The end position of affiliated each bucket in SA is recorded in the barrelage group B that size is O (1);Flowing water parallel scan character from right to left It goes here and there X, each LMS suffix scanned is inserted successively the current end position of its affiliated bucket in SA, and by the stop bits of this barrel It sets and is moved to the left a lattice;
S22, the initial position that all suffix affiliated each bucket in SA in character string X is scanned with prefix and parallel mode, It is recorded in the barrelage group B that size is O (1), and block parallel scan process is carried out to SA:
Scan SA from left to right in current block, for each of scan be not -1 element S A [i], read from SA X's [SA [i]] is preceding after character X [SA [i] -1];If this it is preceding after character be L-type, by the value of SA [i] -1 and suf (X, SA [i] -1) suffix be recorded in SA as ranking results the current initial position of affiliated bucket in SA, and by the start bit of this barrel Set the lattice that move right;
It reads previous piece of ranking results and is recorded in SA;
It reads before latter piece all after character and is recorded in SA;
S23, the end position that all suffix affiliated each bucket in SA in character string X is scanned with prefix and parallel mode, It is recorded in the barrelage group B that size is O (1), and block parallel scan process is carried out to SA:
It scans SA from right to left in current block, for each element S A [i] scanned, X [SA [i]] is read from SA It is preceding after character X [SA [i] -1];If this it is preceding after character be S types, after the value of SA [i] -1 and suf (X, SA [i] -1) Sew the current end position of affiliated bucket in SA to be recorded in SA as ranking results, and the end position of this barrel is moved to the left One lattice;
It reads previous piece of ranking results and is recorded in SA;
It reads before latter piece all after character and is recorded in SA.
Further, the step S3 owns in constant working space in parallel renaming character string X according to ranking results LMS substrings include the following steps to form character string X1:
S31, ordering LMS substrings in SA1 are subjected to piecemeal, it is more adjacent successively from left to right in each piecemeal The size of two LMS substrings;For each LMS substrings, named with the initial position of its affiliated bucket in SA1, first bucket Initial position is since 0 to get to the local name of LMS substrings in each piecemeal;
S32, using workable global title in prefix and each piecemeal of method statistic, by the office of LMS substrings in each piecemeal Portion's name replaces with global title, to form character string Z1;
S33, block parallel scanning is carried out to character string Z1 from right to left, for each character Z1 [i] being scanned, if Z1 [i] is L-type, then enables X1 [i]=Z1 [i], X1 [i] is otherwise set as to the end position of Z1 [i] affiliated buckets in SA1;It waits sweeping After retouching Z1, X1 is to the result after each S ocra font ocrs renaming in Z1.
Further, it in step S5, according to the Suffix array clustering for the character string X1 of acquisition being stored in SA1, works in constant The Suffix array clustering for concluding calculating character string X in space parallel, includes the following steps:
S51, all elements for initializing SA are -1, scan all suffix in character string X with prefix and parallel mode and exist The end position of affiliated each bucket in SA, is multiplexed the space of SA to record;Flowing water parallel scan array SA1 from right to left, to each P1 [SA1 [i]] is placed on suffix suf (X, P1 [SA1 [i]]) current knot of affiliated bucket in SA by the element S A1 [i] scanned Beam position, and the end position of this barrel is moved to the left a lattice;
S52, the end position that all suffix affiliated each bucket in SA in character string X is scanned with prefix and parallel mode, The space of multiplexing SA is recorded, and block parallel scan process is carried out to SA:
Scan SA from left to right in current block, for each of scan be not -1 element S A [i], read from SA X's [SA [i]] is preceding after character X [SA [i] -1];If this it is preceding after character be L-type, by the value of SA [i] -1 and suf (X, SA [i] -1) suffix be recorded in SA as ranking results the current initial position of affiliated bucket in SA, and by the start bit of this barrel Set the lattice that move right;
It reads previous piece of ranking results and is recorded in SA;
It reads before latter piece all after character and is recorded in SA;
S53, the end position that all suffix affiliated each bucket in SA in character string X is scanned with prefix and parallel mode, The space of multiplexing SA is recorded, and block parallel scan process is carried out to SA:
It scans SA from right to left in current block, for each element S A [i] scanned, X [SA [i]] is read from SA It is preceding after character X [SA [i] -1];If this it is preceding after character be S types, after the value of SA [i] -1 and suf (X, SA [i] -1) Sew the current end position of affiliated bucket in SA to be recorded in SA as ranking results, and the end position of this barrel is moved to the left One lattice;
It reads previous piece of ranking results and is recorded in SA;
It reads before latter piece all after character and is recorded in SA.
The system of constant working space parallel construction Suffix array clustering, including:The parallel sorting module, storage unit, preceding concluded Set unit and resolution unit;
The parallel conclusion sorting module, for by array P1 and SA come to input substring or suffix carry out constant work Make spatial parallelism and conclude sequence, and returns the result;
The storage unit, for storing the ephemeral data during generating Suffix array clustering;
The front end units are concluded sequence using constant working space and ordered again parallel for the character string X according to input The method of name generates character string X1, and write storage unit;
The resolution unit, for reading character string X1 from storage unit, to be stored in the suffix of the character string X1 in SA1 Array is concluded the Suffix array clustering of calculating character string X in constant working space and is preserved into SA parallel.
Further, the front end units include array P1 computing modules, LMS substrings sorting module, character string X1 generation moulds Block and character string X1 decision-making modules;
The array P1 computing modules, the character string X for reading input, according to the S of LMS substrings+L+S type-schemes are fixed Justice finds out the position that all LMS characters occur, and to obtain the initial character pointer of all LMS substrings in character string X, and is recorded in In array P1;
LMS substring sorting modules for reading array P1 from storage unit, and call parallel sorting module of concluding to character All LMS substrings are ranked up in string X, and ranking results are stored in SA1;
Character string X1 generation modules, for reading array SA from storage unit, and according to the parallel renaming word of ranking results Each LMS substrings in symbol string X, generate character string X1;
Character string X1 decision-making modules, for from storage unit reading character string X1, judging whether each character of X1 is unique, If being then transferred to resolution unit, otherwise recursive call front end units.
Further, the resolution unit includes that Suffix array clustering computing module, Suffix array clustering generation module and Suffix array clustering are deposited Storage unit;
The Suffix array clustering computing module, for reading character string X1 from storage unit, and directly sequencing character string X1 Each suffix comes the Suffix array clustering of calculating character string X1, and write storage unit;
The Suffix array clustering generation module calls parallel sorting module of concluding for reading array SA1 from storage unit LMS suffix all in X is ranked up, to obtain the Suffix array clustering of X;
The Suffix array clustering storage unit, the Suffix array clustering for storing character string X.
The beneficial effects of the invention are as follows:User need to only input an arbitrary string X defined in constant character set, this Invention then can be by obtaining the initial character pointer of all LMS substrings in X and being recorded in array P1, further to utilize P1 and SA The parallel conclusion for carry out all LMS substrings in character string X in constant working space is sorted, and obtains character string X1, and according to The different configuration input parameter that SA is distinguished according to the uniqueness of character in X1, it is corresponding with its Suffix array clustering SA1 eventually by X1 Relationship is able in the Suffix array clustering to SA for concluding calculating character string X parallel in constant working space.Therefore, present invention reduces Calculator memory require and the speed of service faster, so that Space-time Complexity is optimal, be suitable for extensive character string suffix number Group structure.
Description of the drawings
Present pre-ferred embodiments are provided below in conjunction with the accompanying drawings, with the embodiment that the present invention will be described in detail.
Fig. 1 is the method and step flow chart of the present invention;
Fig. 2 is the system structure schematic block diagram of the present invention.
Specific implementation mode
Wherein, following technical term is used in the description of the present invention, is illustrated herein:
Working space:Refer to gross space and removes the remainder behind space used in character string X and its Suffix array clustering.
Constant working space:Refer to gross space and removes the arbitrary string X and its Suffix array clustering defined in constant character set Remainder behind space used, the Suffix array clustering constructed by Suffix array clustering developing algorithm or system according to this Space-time Complexity It is theoretically accessible optimal solution.
Character set:One character set Σ is a set for establishing ordering relation, i.e. the different member of any two in Σ Plain α and β can compare size or α<β or α>β.Element in character set Σ is known as character, wherein minimum character For ' $ '.Character set size according to the present invention can be a constant O (1) or an integer O (n).
Character string:The character string X that one length is n is to arrange the character that n belong in character set Σ successively by its position The array X [0, n-1] formed is arranged, the end mark of X is fixed as ' $ ', and the other positions of ' $ ' not in X occur.
Substring:Substring X [i, j], the i≤j of character string X indicates one section of character string from position i to position j in X strings, Be exactly by character X [i], X [i+1] ..., X [j] composition character string.
Suffix:A suffix of character string X refers to the substring that ' ' is accorded with from some position i start to finish.From i-th The postfix notation that a character starts is suf (X, i), that is, suf (X, i)=X [i, n-1].
Character and suffix type:Character in X is divided into L and S two types.
(1) ' $ ' is S types;
(2) X [i], i ∈ [0, n-2] are S types, and if only if suf (X, i)<Suf (X, i+1), i.e. X [i]<X [i+1] or Person X [i]=X [i+1] and X [i+1] are S types;
(3) X [i], i ∈ [0, n-2] are L types, and if only if suf (X, i)>Suf (X, i+1), i.e. X [i]>X [i+1] or Person X [i]=X [i+1] and X [i+1] are L-type.
Suffix suf (X, i) be S types and if only if character X [i] be S types, suffix suf (X, i) be L types when and only When character X [i] is L types.
LMS (Leftmost S-type, most left S types) characters and suffix:
(1) ' $ ' is LMS characters;
(2) X [i], i ∈ [1, n-1] are LMS characters, and and if only if X [i] be S types and X [i-1] is L-type;
(3) suffix suf (X, i) be LMS suffix and if only if character X [i] be LMS characters.
LMS substrings and its S+L+S type-schemes:
(1) ' $ ' is LMS substrings;
(2) X [i, j] is LMS substrings, and if only if 1≤i<j<N, X [i] and X [j] are all LMS characters, and X [i] and X Other LMS characters are not present between [j].
Therefore, a LMS substring is made of successively three parts:One or more S ocra font ocrs, one or more L-type characters With a single S ocra font ocr, this is known as the S of LMS substrings+L+S type-schemes.
Array of pointers:Position in array of pointers P1 record character strings X where the initial character of all LMS substrings, i.e. P1 [i] Record in character string X (from left to right) position of the initial character of i+1 LMS substrings in X.
Character string size compares:The size of two character strings compares, and refers to that usually said " lexicographic order " compares, that is, For two character strings u and v, i is enabled sequentially to compare u [i] and v [i] since 0.I is enabled to add 1 to be further continued for if u [i]=v [i] More next u [i] and v [i], if otherwise u [i]<V [i] then thinks u<V or u [i]>V [i] then thinks u>v.
Suffix array clustering:Suffix array clustering SA is an one-dimension array for including n integer, meets suf for i ∈ [0, n-1] (X,SA[i])<Suf (X, SA [i+1]), that is, after the n suffix of X is ranked up from small to large it is sorted it is each after Sew position of the initial character in X to be from left to right sequentially put into SA.
Bucket and barrelage group:All suffix of character string X are ranked up by its first character in array SA, then are owned The identical suffix of first character is all continuously arranged in a certain section of region in SA, this section of region is referred to as these corresponding suffix A bucket.If in X including i different characters, i bucket, and suffix included in each bucket can be formed in SA Initial character it is all identical.Barrelage group B is for recording current location of each bucket in SA.
Using terms above, the example for providing a structural string Suffix array clustering is as follows.Character string X=baac $, length Spend n=5, suf (X, 0)=baac $, suf (X, 1)=aac $, suf (X, 2)=ac $, suf (X, 3)=c $, suf (X, 4)=$. The definition compared according to character string size is readily apparent that suf (X, 4)<suf(X,1)<suf(X,2)<suf(X,0)<suf(X,3). Further according to the definition of Suffix array clustering, it is easy to obtain SA [0]=4, SA [1]=1, SA [2]=2, SA [3]=0, SA [4]=3, i.e., SA=[4,1,2,0,3].
Referring to Fig.1, the method for constant working space parallel construction Suffix array clustering, includes the following steps:
S1, the character string X for scanning an input from right to left, compare two of Current Scan according to the definition of suffix type Adjacent character X [i] and X [i+1], to obtain the type of each character and suffix;
In scanning process, according to the S of LMS substrings+L+S type-schemes, which define, finds out the position that all LMS characters occur, To obtain the initial character pointer of all LMS substrings in character string X, and it is recorded in array P1;
S2, returning parallel in constant working space is carried out to LMS substrings all in character string X by array P1 and SA It receives sequence, and ranking results is stored in SA1;
Wherein, SA is array, the Suffix array clustering for recording character string X;SA1 is array, for record ordering result Suffix array clustering;
S3, according to ranking results LMS substrings all in parallel renaming character string X in constant working space, to Form character string X1;
S4, check whether each character in character string X1 is unique, if then each suffix of direct sequencing character string X1 comes The Suffix array clustering of calculating character string X1 is preserved into SA1, and character is otherwise substituted using character string X1 and SA1 as new input parameter String X and SA, respectively recursive call to step S1 and S2;
S5, the Suffix array clustering according to the character string X1 being stored in SA1 of acquisition, are concluded parallel in constant working space The Suffix array clustering of calculating character string X is preserved into SA.
Specifically, user need to only input an arbitrary string X defined in constant character set, and the present invention can then pass through It obtains the initial character pointer of all LMS substrings in X and is recorded in array P1, with further using P1 and SA come to character string X In all LMS substrings carry out the parallel conclusion in constant working space and sort, obtain character string X1, and according to character in X1 Uniqueness distinguishes the different configuration input parameter of SA, is able to normal eventually by the correspondence of X1 and its Suffix array clustering SA1 In the Suffix array clustering to SA for concluding calculating character string X in number working space parallel.Therefore, present invention reduces calculator memories to want Ask and the speed of service faster, so that Space-time Complexity is optimal, be suitable for extensive character string Suffix array clustering build.
Wherein, a character string X is scanned in the step S1 from right to left, used scan mode includes block parallel Scanning, flowing water parallel scan and prefix and parallel scan, this three kinds of scan modes are common character string scan mode, This is not repeated.
Wherein, in the step S2, constant work is carried out to LMS substrings all in character string X by array P1 and SA Make the parallel conclusion sequence in space, includes the following steps:
S21, all elements for initializing SA are -1, and scan all suffix in character string X with prefix and parallel mode The end position of affiliated each bucket in SA is recorded in the barrelage group B that size is O (1);Flowing water parallel scan character from right to left It goes here and there X, each LMS suffix scanned is inserted successively the current end position of its affiliated bucket in SA, and by the stop bits of this barrel It sets and is moved to the left a lattice;
S22, the initial position that all suffix affiliated each bucket in SA in character string X is scanned with prefix and parallel mode, It is recorded in the barrelage group B that size is O (1), and block parallel scan process is carried out to SA:
Scan SA from left to right in current block, for each of scan be not -1 element S A [i], read from SA X's [SA [i]] is preceding after character X [SA [i] -1];If this it is preceding after character be L-type, by the value of SA [i] -1 and suf (X, SA [i] -1) suffix be recorded in SA as ranking results the current initial position of affiliated bucket in SA, and by the start bit of this barrel Set the lattice that move right;
It reads previous piece of ranking results and is recorded in SA;
It reads before latter piece all after character and is recorded in SA;
S23, the end position that all suffix affiliated each bucket in SA in character string X is scanned with prefix and parallel mode, It is recorded in the barrelage group B that size is O (1), and block parallel scan process is carried out to SA:
It scans SA from right to left in current block, for each element S A [i] scanned, X [SA [i]] is read from SA It is preceding after character X [SA [i] -1];If this it is preceding after character be S types, after the value of SA [i] -1 and suf (X, SA [i] -1) Sew the current end position of affiliated bucket in SA to be recorded in SA as ranking results, and the end position of this barrel is moved to the left One lattice;
It reads previous piece of ranking results and is recorded in SA;
It reads before latter piece all after character and is recorded in SA.
Further, the step S3 owns in constant working space in parallel renaming character string X according to ranking results LMS substrings include the following steps to form character string X1:
S31, ordering LMS substrings in SA1 are subjected to piecemeal, it is more adjacent successively from left to right in each piecemeal The size of two LMS substrings;For each substring, named with the initial position of its affiliated bucket in SA1, the starting of first bucket Position is since 0 to get to the local name of LMS substrings in each piecemeal;
S32, using workable global title in prefix and each piecemeal of method statistic, by the office of LMS substrings in each piecemeal Portion's name replaces with global title, to form character string Z1;
S33, block parallel scanning is carried out to character string Z1 from right to left, for each character Z1 [i] being scanned, if Z1 [i] is L-type, then enables X1 [i]=Z1 [i], X1 [i] is otherwise set as to the end position of Z1 [i] affiliated buckets in SA1;It waits sweeping After retouching Z1, X1 is to the result after each S ocra font ocrs renaming in Z1.
Wherein, it in step S5, according to the Suffix array clustering for the character string X1 of acquisition being stored in SA1, works in constant empty In the Suffix array clustering of interior parallel conclusion calculating character string X, include the following steps:
S51, all elements for initializing SA are -1, scan all suffix in character string X with prefix and parallel mode and exist The end position of affiliated each bucket in SA, is multiplexed the space of SA to record;Flowing water parallel scan array SA1 from right to left, to each P1 [SA1 [i]] is placed on suffix suf (X, P1 [SA1 [i]]) current knot of affiliated bucket in SA by the element S A1 [i] scanned Beam position, and the end position of this barrel is moved to the left a lattice;
S52, the end position that all suffix affiliated each bucket in SA in character string X is scanned with prefix and parallel mode, The space of multiplexing SA is recorded, and block parallel scan process is carried out to SA:
Scan SA from left to right in current block, for each of scan be not -1 element S A [i], read from SA X's [SA [i]] is preceding after character X [SA [i] -1];If this it is preceding after character be L-type, by the value of SA [i] -1 and suf (X, SA [i] -1) suffix be recorded in SA as ranking results the current initial position of affiliated bucket in SA, and by the start bit of this barrel Set the lattice that move right;
It reads previous piece of ranking results and is recorded in SA;
It reads before latter piece all after character and is recorded in SA;
S53, the end position that all suffix affiliated each bucket in SA in character string X is scanned with prefix and parallel mode, The space of multiplexing SA is recorded, and block parallel scan process is carried out to SA:
It scans SA from right to left in current block, for each element S A [i] scanned, X [SA [i]] is read from SA It is preceding after character X [SA [i] -1];If this it is preceding after character be S types, after the value of SA [i] -1 and suf (X, SA [i] -1) Sew the current end position of affiliated bucket in SA to be recorded in SA as ranking results, and the end position of this barrel is moved to the left One lattice;
It reads previous piece of ranking results and is recorded in SA;
It reads before latter piece all after character and is recorded in SA.
Reference Fig. 2, the system of constant working space parallel construction Suffix array clustering, including:It is parallel to conclude sorting module 8, deposit Storage unit 1, front end units 2 and resolution unit 3;
The parallel conclusion sorting module 8, for by array P1 and SA come to input substring or suffix carry out constant Working space concludes sequence parallel, and returns the result;
The storage unit 1, for storing the ephemeral data during generating Suffix array clustering, such as array P1, SA1 and word Symbol string X1 etc.;
The front end units 2 are sorted and again for according to the character string X of input, being concluded parallel using constant working space The method of name generates character string X1, and write storage unit 1;
The resolution unit 3, for reading character string X1 from storage unit 1, with after the character string X1 that is stored in SA1 Sew array, conclude the Suffix array clustering of calculating character string X parallel in constant working space and preserves into SA.
Specifically, it is front end units 2 and the module that resolution unit 3 can be called, storage unit to conclude sorting module 8 parallel 1 setting ensure that the data in building process are not lost, and be conducive to reading or the tune of front end units 2 and resolution unit 3 With.
Wherein, the front end units 2 include array P1 computing modules 4, LMS substrings sorting module 5, character string X1 generation moulds Block 6 and character string X1 decision-making modules 7;
The array P1 computing modules 4, the character string X for reading input, according to the S of LMS substrings+L+S type-schemes are fixed Justice finds out the position that all LMS characters occur, and to obtain the initial character pointer of all LMS substrings in character string X, and is recorded in In array P1;
LMS substrings sorting module 5 for reading array P1 from storage unit 1, and calls parallel conclusion sorting module 8 right All LMS substrings are ranked up in character string X, and ranking results are stored in SA1;
Character string X1 generation modules 6, for reading array SA from storage unit 1, and according to the parallel renaming of ranking results Each LMS substrings in character string X generate character string X1;
Whether only character string X1 decision-making modules 7 judge each character of X1 for reading character string X1 from storage unit 1 One, if being then transferred to resolution unit 3, otherwise recursive call front end units 2.
Wherein, the resolution unit 3 includes Suffix array clustering computing module 9, Suffix array clustering generation module 10 and Suffix array clustering Storage unit 11;
The Suffix array clustering computing module 9, for from storage unit 1 read character string X1, and directly sequence X1 it is each after The Suffix array clustering and write storage unit for sewing to calculate X1;
The Suffix array clustering generation module 10 for reading array SA1 from storage unit 1, and calls parallel conclusion sequence Module 8 is ranked up LMS suffix all in X, to obtaining the Suffix array clustering of X;
The Suffix array clustering storage unit 11, the Suffix array clustering for storing character string X.
Presently preferred embodiments of the present invention and basic principle is discussed in detail in the above content, but the invention is not limited in The above embodiment, those skilled in the art should be recognized that also had under the premise of without prejudice to spirit of that invention it is various Equivalent variations and replacement, these equivalent variations and replacement all fall within the protetion scope of the claimed invention.

Claims (8)

1. the method for constant working space parallel construction Suffix array clustering, which is characterized in that include the following steps:
S1, the character string X for scanning an input from right to left, according to the definition of suffix type compare two of Current Scan it is adjacent Character X [i] and X [i+1], to obtain the type of each character and suffix;
In scanning process, according to the S of LMS substrings+L+S type-schemes, which define, finds out the position that all LMS characters occur, to The initial character pointer of all LMS substrings in character string X is obtained, and is recorded in array P1;
S2, it is arranged come the parallel conclusion carried out to LMS substrings all in character string X in constant working space by array P1 and SA Sequence, and ranking results are stored in SA1;
Wherein, SA is the Suffix array clustering for recording character string X;SA1 is the Suffix array clustering for record ordering result;
S3, according to ranking results LMS substrings all in parallel renaming character string X in constant working space, to be formed Character string X1;
S4, check whether each character in character string X1 is unique, if then each suffix of direct sequencing character string X1 calculates The Suffix array clustering of character string X1, preserve in SA1, otherwise substituted using character string X1 and SA1 as new input parameter character string X and SA, respectively recursive call to step S1 and S2;
S5, the Suffix array clustering according to the character string X1 being stored in SA1 of acquisition are concluded in constant working space and are calculated parallel The Suffix array clustering of character string X is preserved into SA.
2. the method for constant working space parallel construction Suffix array clustering according to claim 1, which is characterized in that the step Scan a character string X in rapid S1 from right to left, used scan mode include block parallel scanning, flowing water parallel scan with And prefix and parallel scan.
3. the method for constant working space parallel construction Suffix array clustering according to claim 1, which is characterized in that the step In rapid S2, arranged come the parallel conclusion carried out to LMS substrings all in character string X in constant working space by array P1 and SA Sequence includes the following steps:
S21, all elements for initializing SA are -1, and scan in character string X all suffix in SA with prefix and parallel mode In affiliated each bucket end position, be recorded in size be O (1) barrelage group B in;Flowing water parallel scan character string X from right to left, Each LMS suffix scanned is inserted successively the current end position of its affiliated bucket in SA, and by the end position of this barrel to Move left a lattice;
S22, the initial position that all suffix affiliated each bucket in SA in character string X is scanned with prefix and parallel mode, record Block parallel scan process is carried out in the barrelage group B that size is O (1), and to SA:
Scan SA from left to right in current block, for each of scan be not -1 element S A [i], X [SA are read from SA [i]] it is preceding after character X [SA [i] -1];If this it is preceding after character be L-type, by the value of SA [i] -1 and suf (X, SA [i] -1) Suffix be recorded in SA as ranking results the current initial position of affiliated bucket in SA, and to the right by the initial position of this barrel A mobile lattice;
It reads previous piece of ranking results and is recorded in SA;
It reads before latter piece all after character and is recorded in SA;
S23, the end position that all suffix affiliated each bucket in SA in character string X is scanned with prefix and parallel mode, record Block parallel scan process is carried out in the barrelage group B that size is O (1), and to SA:
SA is scanned from right to left in current block, for each element S A [i] scanned, before reading X [SA [i]] in SA After character X [SA [i] -1];If it after character is S types that this is preceding, the suffix of the value of SA [i] -1 and suf (X, SA [i] -1) is existed The current end position of affiliated bucket is recorded in as ranking results in SA in SA, and the end position of this barrel is moved to the left one Lattice;
It reads previous piece of ranking results and is recorded in SA;
It reads before latter piece all after character and is recorded in SA.
4. the method for constant working space parallel construction Suffix array clustering according to claim 1, which is characterized in that the step Rapid S3, according to ranking results LMS substrings all in parallel renaming character string X in constant working space, to form word Symbol string X1, includes the following steps:S31, ordering LMS substrings in SA1 are subjected to piecemeal, in each piecemeal from left to right according to The size of secondary two more adjacent LMS substrings;For each LMS substrings, ordered with the initial position of its affiliated bucket in SA1 Name, the initial position of first bucket is since 0 to get to the local name of LMS substrings in each piecemeal;
S32, using workable global title in prefix and each piecemeal of method statistic, by the local name of LMS substrings in each piecemeal Global title is replaced with, to form character string Z1;
S33, block parallel scanning is carried out to character string Z1 from right to left, for each character Z1 [i] being scanned, if Z1 [i] is L-type, then enables X1 [i]=Z1 [i], X1 [i] is otherwise set as to the end position of Z1 [i] affiliated buckets in SA1;It is to be scanned After Z1, X1 is to the result after each S ocra font ocrs renaming in Z1.
5. the method for constant working space parallel construction Suffix array clustering according to claim 1, which is characterized in that step S5 In, according to the Suffix array clustering for the character string X1 of acquisition being stored in SA1, calculating character is concluded parallel in constant working space The Suffix array clustering of string X, includes the following steps:
S51, all elements for initializing SA are -1, and all suffix are scanned in character string X in SA with prefix and parallel mode The end position of affiliated each bucket is multiplexed the space of SA to record;Flowing water parallel scan array SA1 from right to left, to each scanning P1 [SA1 [i]] is placed on suffix suf (X, P1 [SA1 [i]]) current stop bits of affiliated bucket in SA by the element S A1 [i] arrived It sets, and the end position of this barrel is moved to the left a lattice;
S52, the end position that all suffix affiliated each bucket in SA in character string X is scanned with prefix and parallel mode, multiplexing The space of SA is recorded, and block parallel scan process is carried out to SA:
Scan SA from left to right in current block, for each of scan be not -1 element S A [i], X [SA are read from SA [i]] it is preceding after character X [SA [i] -1];If this it is preceding after character be L-type, by the value of SA [i] -1 and suf (X, SA [i] -1) Suffix be recorded in SA as ranking results the current initial position of affiliated bucket in SA, and to the right by the initial position of this barrel A mobile lattice;
It reads previous piece of ranking results and is recorded in SA;
It reads before latter piece all after character and is recorded in SA;
S53, the end position that all suffix affiliated each bucket in SA in character string X is scanned with prefix and parallel mode, multiplexing The space of SA is recorded, and block parallel scan process is carried out to SA:
SA is scanned from right to left in current block, for each element S A [i] scanned, before reading X [SA [i]] in SA After character X [SA [i] -1];If it after character is S types that this is preceding, the suffix of the value of SA [i] -1 and suf (X, SA [i] -1) is existed The current end position of affiliated bucket is recorded in as ranking results in SA in SA, and the end position of this barrel is moved to the left one Lattice;
It reads previous piece of ranking results and is recorded in SA;
It reads before latter piece all after character and is recorded in SA.
6. the system of the method based on any constant working space parallel construction Suffix array clusterings of claim 1-5, special Sign is, including:It is parallel to conclude sorting module (8), storage unit (1), front end units (2) and resolution unit (3);
The parallel conclusion sorting module (8), for by array P1 and SA come to input substring or suffix carry out constant work Make spatial parallelism and conclude sequence, and returns the result;
The storage unit (1), for storing the ephemeral data during generating Suffix array clustering;
The front end units (2) are concluded sequence using constant working space and ordered again parallel for the character string X according to input The method of name generates character string X1, and write storage unit (1);
The resolution unit (3), for reading character string X1 from storage unit (1), and to be stored in the character string X1's in SA1 Suffix array clustering is concluded the Suffix array clustering of calculating character string X in constant working space and is preserved into SA parallel.
7. the system of constant working space parallel construction Suffix array clustering according to claim 6, which is characterized in that before described It includes array P1 computing modules (4), LMS substrings sorting module (5), character string X1 generation modules (6) and character string to set unit (2) X1 decision-making modules (7);
The array P1 computing modules (4), the character string X for reading input, according to the S of LMS substrings+L+S type-schemes define The position that all LMS characters occur is found out, to obtain the initial character pointer of all LMS substrings in character string X, and is recorded in number In group P1;
LMS substrings sorting module (5) for reading array P1 from storage unit (1), and calls parallel conclusion sorting module (8) LMS substrings all in character string X are ranked up, and ranking results are stored in SA1;
Character string X1 generation modules (6), for reading array SA from storage unit (1), and according to the parallel renaming of ranking results Each LMS substrings in character string X, to generate character string X1;
Whether only character string X1 decision-making modules (7) judge each character of X1 for reading character string X1 from storage unit (1) One, if being then transferred to resolution unit (3), otherwise recursive call front end units (2).
8. the system of constant working space parallel construction Suffix array clustering according to claim 7, which is characterized in that the solution It includes Suffix array clustering computing module (9), Suffix array clustering generation module (10) and Suffix array clustering storage unit (11) to analyse unit (3);
The Suffix array clustering computing module (9), for reading character string X1, and directly sequencing character string X1 from storage unit (1) Each suffix come the Suffix array clustering of calculating character string X1, and write storage unit (1);
The Suffix array clustering generation module (10) for reading array SA1 from storage unit (1), and calls parallel conclusion sequence Module (8) is ranked up LMS suffix all in character string X, to obtaining the Suffix array clustering of character string X;
The Suffix array clustering storage unit (11), the Suffix array clustering for storing character string X.
CN201810344030.2A 2018-04-17 2018-04-17 The method and system of constant working space parallel construction Suffix array clustering Pending CN108763170A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810344030.2A CN108763170A (en) 2018-04-17 2018-04-17 The method and system of constant working space parallel construction Suffix array clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810344030.2A CN108763170A (en) 2018-04-17 2018-04-17 The method and system of constant working space parallel construction Suffix array clustering

Publications (1)

Publication Number Publication Date
CN108763170A true CN108763170A (en) 2018-11-06

Family

ID=64010834

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810344030.2A Pending CN108763170A (en) 2018-04-17 2018-04-17 The method and system of constant working space parallel construction Suffix array clustering

Country Status (1)

Country Link
CN (1) CN108763170A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109614510A (en) * 2018-11-23 2019-04-12 腾讯科技(深圳)有限公司 A kind of image search method, device, graphics processor and storage medium
CN110852046A (en) * 2019-10-18 2020-02-28 中山大学 Block induction sequencing method and system for text suffix index

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102073740A (en) * 2011-01-27 2011-05-25 农革 String suffix array construction method on basis of radix sorting
CN102081673A (en) * 2011-01-27 2011-06-01 农革 Suffix array construction method
CN102521213A (en) * 2011-12-01 2012-06-27 农革 Construction method of linear time suffix arrays
CN107015951A (en) * 2017-03-24 2017-08-04 广东顺德中山大学卡内基梅隆大学国际联合研究院 The correctness verification method and system of a kind of Suffix array clustering
CN107015952A (en) * 2017-03-24 2017-08-04 广东顺德中山大学卡内基梅隆大学国际联合研究院 The correctness verification method and system of a kind of Suffix array clustering and most long common prefix

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102073740A (en) * 2011-01-27 2011-05-25 农革 String suffix array construction method on basis of radix sorting
CN102081673A (en) * 2011-01-27 2011-06-01 农革 Suffix array construction method
CN102521213A (en) * 2011-12-01 2012-06-27 农革 Construction method of linear time suffix arrays
CN107015951A (en) * 2017-03-24 2017-08-04 广东顺德中山大学卡内基梅隆大学国际联合研究院 The correctness verification method and system of a kind of Suffix array clustering
CN107015952A (en) * 2017-03-24 2017-08-04 广东顺德中山大学卡内基梅隆大学国际联合研究院 The correctness verification method and system of a kind of Suffix array clustering and most long common prefix

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109614510A (en) * 2018-11-23 2019-04-12 腾讯科技(深圳)有限公司 A kind of image search method, device, graphics processor and storage medium
CN109614510B (en) * 2018-11-23 2021-05-07 腾讯科技(深圳)有限公司 Image retrieval method, image retrieval device, image processor and storage medium
CN110852046A (en) * 2019-10-18 2020-02-28 中山大学 Block induction sequencing method and system for text suffix index
CN110852046B (en) * 2019-10-18 2021-11-05 中山大学 Block induction sequencing method and system for text suffix index

Similar Documents

Publication Publication Date Title
CN108804204A (en) Multi-threaded parallel constructs the method and system of Suffix array clustering
CN101398820B (en) Large scale key word matching method
US7080091B2 (en) Inverted index system and method for numeric attributes
CN101359325B (en) Multi-key-word matching method for rapidly analyzing content
US6131092A (en) System and method for identifying matches of query patterns to document text in a document textbase
CN110134714B (en) Distributed computing framework cache index method suitable for big data iterative computation
Kaukoranta et al. A fast exact GLA based on code vector activity detection
CN105335481B (en) A kind of the suffix index building method and device of extensive character string text
CN102081673A (en) Suffix array construction method
CN1890669A (en) Incremental search of keyword strings
JPH09134369A (en) Method for retrieving dictionary where retrieval is executed with lattice as key and its method
CN108399213B (en) User-oriented personal file clustering method and system
US5367677A (en) System for iterated generation from an array of records of a posting file with row segments based on column entry value ranges
Hull et al. An integrated algorithm for text recognition: comparison with a cascaded algorithm
CN102073740A (en) String suffix array construction method on basis of radix sorting
Tavakoli Modeling genome data using bidirectional LSTM
CN110083683B (en) Entity semantic annotation method based on random walk
CN108763170A (en) The method and system of constant working space parallel construction Suffix array clustering
WO2020037794A1 (en) Index building method for english geographical name, and query method and apparatus therefor
CN101251845A (en) Method for performing multi-pattern string match using improved Wu-Manber algorithm
CN106484815B (en) A kind of automatic identification optimization method based on mass data class SQL retrieval scene
US6625592B1 (en) System and method for hash scanning of shared memory interfaces
CN108628907A (en) A method of being used for the Trie tree multiple-fault diagnosis based on Aho-Corasick
CN102521213A (en) Construction method of linear time suffix arrays
CN109446293A (en) A kind of parallel higher-dimension nearest Neighbor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20181106

RJ01 Rejection of invention patent application after publication