CN102081673A - Suffix array construction method - Google Patents

Suffix array construction method Download PDF

Info

Publication number
CN102081673A
CN102081673A CN2011100290142A CN201110029014A CN102081673A CN 102081673 A CN102081673 A CN 102081673A CN 2011100290142 A CN2011100290142 A CN 2011100290142A CN 201110029014 A CN201110029014 A CN 201110029014A CN 102081673 A CN102081673 A CN 102081673A
Authority
CN
China
Prior art keywords
suffix
array
lms
character
barrel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011100290142A
Other languages
Chinese (zh)
Inventor
农革
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN2011100290142A priority Critical patent/CN102081673A/en
Publication of CN102081673A publication Critical patent/CN102081673A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a suffix array construction method within a linear time. The method comprises the following steps of: 1) scanning a character string S from right to left, comparing two adjacent characters S[i] and S[i+1] which are scanned at the present to obtain the type of each character and the type of the suffix, and recording the types by using an array t; 2) scanning the array t from left to right, finding out all positions where an LMS character appears, obtaining initial pointers of all LMS sub strings, and recording the pointers of the LMS sub strings by using P1; 3) sequencing all the LMS sub strings in the S via the pointer array P1 of the LMS sub strings and arrays B and SA; 4) renaming each LMS sub string in the character string S according to a sequenced result obtained in the step 3 to form a new shortened string S1; 5) if each character in the S1 is unique, directly sequencing the suffixes of the S1 to calculate the suffix array SA1 of the S1, otherwise, recursively calling an SA-IS algorithm by using the S1 and the SA1 which serve as input parameters; 6) concluding and calculating the suffix array SA of the S according to the suffix array SA1 of the S1; and 7) returning.

Description

Suffix array building method
Technical field
The present invention relates to a kind of character string suffix array building method, relate in particular to a kind of method of in linear session, finishing character string suffix array structure by computing machine automatically.
Background technology
Character string suffix array is the substituted type data structure of saving the space of suffix tree, is proposed in document [1,2] by Manber and Myers at first, can realize being equal to the algorithm of suffix tree in littler space.The suffix array has extensive use in application such as data directory and pattern match.This paper has invented a new suffix array construction algorithm, can construct its suffix array for any given character string in linear session.
In the statement of this paper, use following term.
Character set ∑ of character set is a set of setting up ordering relation, and promptly any two different element α and β can compare size in the ∑, or α<β, or α>β.Element in the character set ∑ is called character, and wherein Zui Xiao character is ' $ '.The related character set size of this paper is assumed to be a constant O (1).
The character string S that length of character string is n is the array S[0..n-1 that n character that belongs in the character set ∑ is arranged in order formation by its position], the end mark of S is fixed as ' $ ', and ' $ ' do not occur the position of other in S.
The substring S[i..j of substring character string S], i≤j, one section character string from position i to position j in the expression S string is just by character S[i], S[i+1] ..., S[j] character string formed.
The suffix of suffix character string S is meant from certain position i and begins a substring to end mark.Since the postfix notation of i character be suf (S, i), just suf (S, i)=S[i..n-1].
Character among character and the suffix type S is divided into two types of L and S:
1) ' $ ' is the S type;
2) S[i], i ∈ [1, n-1] is the S type, and if only if suf (S, i)<suf (S, i+1), i.e. S[i]<S[i+1] or S[i]=S[i+1] and S[i+1] be the S type.
3) S[i], i ∈ [1, n-1] is the L type, and if only if suf (S, i)>suf (S, i+1), i.e. S[i]>S[i+1] or S[i]=S[i+1] and S[i+1] be the L type;
4) suffix suf (S i) is S type and if only if character S[i] be the S type; Suffix suf (S i) is L type and if only if character S[i] be the L type.
LMS (leftmost S-type, the most left S type) character and suffix
1) ' $ ' is the LMS character;
2) S[i], i ∈ [1, n-1] is the LMS character, and if only if S[i] be S type and S[i-1] be the L type;
3) suffix suf (S i) is LMS suffix and if only if character S[i] be the LMS character.
The LMS substring
1) ' $ ' is the LMS substring;
2) S[i..j] be the LMS substring, and if only if 1≤i<j<n, S[i] and S[j] be all the LMS character, and S[i] and S[j] between do not have other LMS characters.
The position at the initial place of all LMS substrings, i.e. P1[i among the array of pointers array of pointers P1 record character string S] position of initial in S of (from left to right) i+1 LMS substring among the record character string S.
The character string size relatively size of two character strings compares, and is meant usually said " lexicographic order " relatively, that is for two character string u and v, makes i compare u[i in turn since 0] and v[i].If u[i]=v[i] then make i add 1 continuing more next u[i again] and v[i], otherwise if u[i]<v[i] then think u<v, perhaps u[i]>v[i] then think u>v.
The suffix array SA of suffix array S is an one-dimension array that comprises n integer, satisfies suf (S, SA[i])<suf (S, SA[i+1]) for i ∈ [0, n-1].After just n the suffix of S being sorted from small to large sorted each suffix initial position in S is from left to right put into SA in turn.
Utilize above term, it is as follows that we provide an example of constructing character string suffix array.
Character string S=baac$, its length n=5, suf (S, 0)=baac$, suf (S, 1)=aac$, suf (S, 2)=ac$, suf (S, 3)=c$, suf (S, 4)=$.Suf (S, 4)<suf (S, 1)<suf (S, 2)<suf (S, 0)<suf (S, 3) is known in definition relatively easily according to the character string size.According to the definition of suffix array, draw SA[0 easily again]=4, SA[1]=1, SA[2]=2, SA[3]=0, SA[4]=3, i.e. SA=[4,1,2,0,3].
The existing computerized algorithm that multiple structure character string suffix array is arranged is referring to document [1-8].Time complexity by these algorithms is classified, and can be divided into linear session or big class of ultralinear times two.Wherein linear time algorithm is defined as: length is the character string of n on the given character set ∑, and promptly this character string comprises n and belongs to character in the character set ∑, and the time complexity that n suffix in this character string sorted is O (n).There is the shortcoming that actual motion speed is slow, space complexity is big [3,4,5,7,8] in existing linear session suffix array construction algorithm, has limited their utilizations in practice.
List of references
1)U.Manber?and?G.Myers,“Suffix?arrays:A?new?method?for?online?string?searches,”inProceedings?of?SODA,1990,pp.319-327.
2)U.Manber?and?G.Myers,“Suffix?arrays:A?new?method?for?on-line?string?searches,”SIAMJournal?on?Computing,vol.22,no.5,pp.935-948,1993.
3)D.K.Kim,J.S.Sim,H.Park,and?K.Park,“Linear-time?construction?of?suffix?arrays,”inProceedings?of?CPM,2003,pp.186-199.
4)P.Ko?and?S.Aluru,“Space-efficient?linear?time?construction?of?suffix?arrays,”Journal?ofDiscrete?Algorithms,vol.3,no.2-4,pp.143-156,2005.
5)J.Karkkainen,P.Sanders,and?S.Burkhardt,“Linear?work?suffix?array?construction,”JACM,no.6,pp.918-936,Nov.2006.
6)G.Manzini?and?P.Ferragina,“Engineering?a?lightweight?suffix?array?construction?algorithm,”Algorithmica,vol.40,no.1,pp.33-50,Sep.2004.
7)S.J.Puglisi,W.F.Smyth,and?A.H.Turpin,“A?taxonomy?of?suffix?array?construction?algorithms,”ACM?Comput.Surv.,vol.39,no.2,pp.1-31,2007.
8)S.J.Puglisi,W.F.Smyth,and?A.Turpin,“The?performance?of?linear?time?suffix?sorting?algorithms,”in?Proceedings?of?Data?Compression?Conference,Mar.2005,pp.358-367.
Summary of the invention
At above deficiency, the present invention proposes a novel linear session suffix array building method, it comprises:
The type of each character and suffix in the step 1) tab character string: scan character string S from right to left one time, two adjacent character S[i according to the more current scanning of definition of suffix type] and S[i+1], draw the type of each character and suffix, t comes record with array;
Step 2) scans a pass group t from left to right, find out the position that all LMS characters occur, thereby obtain the initial pointer of all LMS substrings, write down the pointer of each LMS substring with P1;
Step 3) comes LMS substrings all among the S is sorted by LMS substring array of pointers P1, array B and SA;
Step 4) renames each LMS substring among the character string S according to the result of step 3) ordering, forms a new string S1 who has shortened;
If each character of step 5) S1 all is unique, each suffix of the S1 that then directly sorts calculates the suffix array SA1 of S1, otherwise with S1 and SA1 as input parameter recursive call SA-IS algorithm, promptly SA-IS (S1, SA1);
Step 6) is concluded the suffix array SA that calculates S according to the suffix array SA1 of the S1 that obtains in the step 5);
Step 7) is returned.
The process that in the described step 3) all LMS substrings among the S is sorted comprises:
31) all elements of initialization SA is-1, find out the end position of all suffix affiliated each barrel in SA among the S, scan S from left to right once, successively the LMS suffix that scans is inserted the current end position of its affiliated bucket in SA, and then the end position of this barrel is moved to the left lattice;
32) find out all suffix among the S in SA under the reference position of each barrel, scan the SA array from left to right, be not-1 element S A[i for each that scans], if S[SA[i]-1] be the L type, then SA[i]-1 this numerical value inserts suf (S, SA[i]-1) this suffix in SA under the current reference position of bucket, and then the reference position of this barrel lattice that move right;
33) find out all suffix among the S in SA under the end position of each barrel, scan the SA array from right to left, for each the element S A[i that scans], if S[SA[i]-1] be the S type, then SA[i]-1 this numerical value inserts suf (S, SA[i]-1) this suffix in SA under the current end position of bucket, and then the end position of this barrel is moved to the left lattice
Wherein, all suffix of character string S are sorted in array SA by its first character, then the suffix that all first characters are identical is all continuous is arranged in a certain section zone among the SA, and we are referred to as a bucket of corresponding these suffix this section zone.
The process of calculating new character strings S1 in the described step 4) comprises:
41) scan ordering all LMS substrings in the SA array from left to right, the size of two more adjacent successively LMS substrings, the LMS substring that is compared is named from 0 open numbering, if two LMS substrings equate, then numbering is the same, adds 1 otherwise latter's numbering equals the former numbering;
42) LMS substring among the S with it in step 41) in the numbering obtained replace, formed new character strings is S1.
Concluding the step of calculating SA from SA1 in the described step 6) comprises:
61) all elements of initialization SA is-1, find out the end position of all suffix affiliated each barrel in SA among the S, scan the SA1 array from right to left, to each element S A1[i that scans], then P1[SA1[i]] be placed on suffix suf (S, P1[SA1[i]]) in SA under the current end position of bucket, and then the end position of this barrel is moved to the left lattice;
62) find out all suffix among the S in SA under the reference position of each barrel, scan the SA array from left to right, be not-1 element S A[i for each that scans], if S[SA[i]-1] be the L type, then SA[i]-1 this numerical value inserts suf (S, SA[i]-1) this suffix in SA under the current reference position of bucket, and then the reference position of this barrel lattice that move right;
63) find out all suffix among the S in SA under the end position of each barrel, scan the SA array from right to left, for each the element S A[i that scans], if S[SA[i]-1] be the S type, then SA[i]-1 this numerical value inserts suf (S, SA[i]-1) this suffix in SA under the current end position of bucket, and then the end position of this barrel is moved to the left lattice.
Beneficial effect of the present invention: utilize the present invention can be in linear session O (n) be that the character string of n is constructed its suffix array to length, compare other existing linear session suffix array building methods, the inventive method have travelling speed fast, consume that the space is little, the advantage of easy realization.
Description of drawings
Fig. 1 is the process flow diagram of suffix array building method of the present invention.
Embodiment
Below in conjunction with accompanying drawing the present invention is further set forth.
As shown in Figure 1, the present invention proposes a novel linear session suffix array building method (SA-IS), can effectively overcome the shortcoming of existing linear session suffix array construction algorithm, the false code of each step provides as follows in this process flow diagram, wherein the element of each array is stored in mode from left to right, be first element at Far Left, last element is at rightmost.
SA-IS(S,SA)
S: input of character string; (length is n character, comprises n1 LMS substring)
The suffix array of SA:S;
S1: integer array; (new character strings of record to forming after each LMS substring rename among the S, length is n1)
The suffix array of SA1:S1
T: boolean's array; (type of each character among the record S, length is n)
P1: integer array; (position that each LMS substring occurs among the record S, length is n1)
B: integer array; (the auxiliary array of using during ordering, length be || ∑ (S) || (being the number of element in the character set ∑))
The type of each character and suffix in the step 1) tab character string.Scan character string S from right to left one time, according to two adjacent character S[i of the more current scanning of definition of suffix type] and S[i+1], drawing the type of each character and suffix, t comes record with array;
Step 2) scans a pass group t from left to right, find out the position that all LMS characters occur, thereby obtain the initial pointer of all LMS substrings, write down the pointer of each LMS substring with P1;
Step 3) comes LMS substrings all among the S is sorted by LMS substring array of pointers P1, array B and SA;
Step 4) renames each LMS substring among the character string S according to the result of step 3) ordering, forms a new string S1 who has shortened;
If each character of step 5) S1 all is unique, each suffix of the S1 that then directly sorts calculates the suffix array SA1 of S1, otherwise with S1 and SA1 as input parameter recursive call SA-IS algorithm, promptly SA-IS (S1, SA1);
Step 6) is concluded the suffix array SA that calculates S according to the suffix array SA1 of the S1 that obtains in the step 5);
Step 7) is returned.
Below to step 3), 4) and 6) details be described further, be convenient narration, at first introduce the notion of " bucket ", all suffix of character string S are sorted in array SA by its first character, then the suffix that all first characters are identical is all continuous is arranged in a certain section zone among the SA, and we are referred to as a bucket of corresponding these suffix this section zone.If include m different character among the S, then can form m bucket among the SA, the initial character of the suffix that is comprised in each barrel is all identical.If the initial character of the suffix that a bucket comprised is ' y ', we also are called for short this bucket and are character ' y ' bucket.In addition, when we say a unit a suffix being put into SA, its implication is the position of this suffix of this unit record in S in SA.
The process steps that in the step 3) all LMS substrings among the S is sorted is described below:
31) all elements of initialization SA is-1; Find out the end position of all suffix affiliated each barrel in SA among the S; Scan S from left to right once, successively the LMS suffix that scans is inserted the current end position of its affiliated bucket in SA, and then the end position of this barrel is moved to the left lattice;
32) find out all suffix among the S in SA under the reference position of each barrel; Scan the SA array from left to right, be not-1 element S A[i for each that scans], if S[SA[i]-1] be the L type, then SA[i]-1 this numerical value inserts suf (S, SA[i]-1) this suffix in SA under the current reference position of bucket, and then the reference position of this barrel lattice that move right.
33) find out all suffix among the S in SA under the end position of each barrel; Scan the SA array from right to left, for each the element S A[i that scans], if S[SA[i]-1] be the S type, then SA[i]-1 this numerical value inserts suf (S, SA[i]-1) this suffix in SA under the current end position of bucket, and then the end position of this barrel is moved to the left lattice.
The step of calculating new character strings S1 in the step 4) is as follows:
41) scan ordering all LMS substrings in the SA array, the size of two more adjacent successively LMS substrings from left to right; The LMS substring that is compared is named from 0 open numbering, if two LMS substrings equate that then numbering is the same, adds 1 otherwise latter's numbering equals the former numbering;
42) LMS substring among the S is replaced with the numbering that it obtains in step 4.1, formed new character strings is S1.
It is as follows to conclude the flow process of calculating SA from SA1 in the step 6):
61) all elements of initialization SA is-1; Find out the end position of all suffix affiliated each barrel in SA among the S; Scan the SA1 array from right to left, to each element S A1[i that scans], then P1[SA1[i]] be placed on suffix suf (S, P1[SA1[i]]) in SA under the current end position of bucket, and then the end position of this barrel is moved to the left lattice;
62) find out all suffix among the S in SA under the reference position of each barrel, scan the SA array from left to right, be not-1 element S A[i for each that scans], if S[SA[i]-1] be the L type, then SA[i]-1 this numerical value inserts suf (S, SA[i]-1) this suffix in SA under the current reference position of bucket, and then the reference position of this barrel lattice that move right;
63) find out all suffix among the S in SA under the end position of each barrel, scan the SA array from right to left, for each the element S A[i that scans], if S[SA[i]-1] be the S type, then SA[i]-1 this numerical value inserts suf (S, SA[i]-1) this suffix in SA under the current end position of bucket, and then the end position of this barrel is moved to the left lattice.
Below we are example with character string " mmiissiissiippii$ ", provide in the SA-IS algorithm detailed process of calculating new character strings S1 from S, as follows to help understanding details of the present invention, at first to provide respectively to go on foot operation result:
00 0 1
01 index: 01234567890123456
02 S:m?m?i?i?s?s?i?i?s?s?i?i?p?p?i?i?$
03 t:L?L?S?S?L?L?S?S?L?L?S?S?L?L?L?L?S
04 LMS: * * * *
05 the 1st step:
06 barrel: $ i m p s
07 SA:{16}?{-1?-1?-1?-1?-1?10?06?02}?{-1?-1}?{-1?-1}?{-1?-1?-1?-1}
08 the 2nd step:
09 barrel: $ i m p s
10 SA:{16}?{-1?-1?-1?-1?-1?10?06?02}?{-1?-1}?{-1?-1}?{-1?-1?-1?-1}
11 @ ^^^^^
12 {16}?{15?-1?-1?-1?-1?10?06?02}?{-1?-1}?{-1?-1}?{-1?-1?-1?-1}
13 ^@ ^^^^
14 {16}?{15?14?-1?-1?-1?10?06?02}?{-1?-1}?{13?-1}?{-1?-1?-1?-1}
15 ^@ ^^^^
16 {16}?{15?14?-1?-1?-1?10?06?02}?{-1?-1}?{13?-1}?{09?-1?-1?-1}
17 ^^@ ^^^
18 {16}?{15?14?-1?-1?-1?10?06?02}?{-1?-1}?{13?-1}?{09?05?-1?-1}
19 ^^@ ^^^
20 {16}?{15?14?-1?-1?-1?10?06?02}?{01?-1}?{13?-1}?{09?05?-1?-1}
21 ^^@ ^^^
22 {16}?{15?14?-1?-1?-1?10?06?02}?{01?00}?{13?-1}?{09?05?-1?-1}
23 ^^@ ^^^
24 {16}?{15?14?-1?-1?-1?10?06?02}?{01?00}?{13?12}?{09?05?-1?-1}
25 ^^^@ ^^
26 {16}?{15?14?-1?-1?-1?10?06?02}?{01?00}?{13?12}?{09?05?08?-1}
27 ^^^^@ ^
28 {16}?{15?14?-1?-1?-1?10?06?02}?{01?00}?{13?12}?{09?05?08?04}
29 ^^^^@ ^
30 the 3rd steps:
31 barrels: $ i m p s
32 SA:{16}?{15?14?-1?-1?-1?10?06?02}?{01?00}?{13?12}?{09?05?08?04}
33 ^^^^@^
34 {16}?{15?14?-1?-1?-1?10?06?03}?{01?00}?{13?12}?{09?05?08?04}
35 ^^^^@ ^
36 {16}?{15?14?-1?-1?-1?10?07?03}?{01?00}?{13?12}?{09?05?08?04}
37 ^^^@ ^^
38 {16}?{15?14?-1?-1?-1?11?07?03}?{01?00}?{13?12}?{09?05?08?04}
39 ^^@ ^^^
40 {16}?{15?14?-1?-1?02?11?07?03}?{01?00}?{13?12}?{09?05?08?04}
41 ^^@ ^^^
42 {16}?{15?14?-1?06?02?11?07?03}?{01?00}?{13?12}?{09?05?08?04}
43 ^^^^^
44 {16}?{15?14?10?06?02?11?07?03}?{01?00}?{13?12}?{09?05?08?04}
45 ^^@ ^^^
46 S1:?2?2?1?0
More than each step be described as follows.
Each character types in the step 1) tab character string.At first ' $ ' is the S type, scan character string S then from right to left one time, according to the character S[i of the more current scanning of definition of suffix type] and its subsequent character S[i+1], draw character S[i] type: if S[i]>S[i+1], S[i then] be the L type, t[i]=0; If S[i]<S[i+1], S[i then] and be the S type, t[i]=1; If S[i]=S[i+1], S[i] with S[i+1] type identical, t[i]=t[i+1].The t that obtains provides at the 3rd row.
Step 2) obtains each LMS substring position among the S.Scan character number of types group t from left to right, mark each LMS substring position, promptly in the 4th capable position that goes out with ' * ' labelled notation.The location records of these LMS substrings is in array P1.
Step 3) sorts all LMS substrings by LMS substring array of pointers P1, array B and SA, and 3 following sub-steps are arranged.
31) the affiliated separately bucket of SA array put in each LMS suffix among the character string S.At first initialization SA all elements is-1, finds out the starting and ending position of each suffix bucket among the SA, i.e. corresponding ' $ ', ' i ', ' m ', ' p ', 5 character buckets of ' s '.Then each self-corresponding bucket put in four LMS suffix, and be that the end of bucket is placed from back to front under separately.The initial character S[2 of first LMS suffix] be character ' i ', so it is placed on last position of character ' i ' bucket.Second LMS letter S[6] be letter ' i ', so it is placed on the penult position of character ' i ' bucket.So obtain the 7th row after the operation.
32) find out the reference position of each suffix bucket among the SA, mark with ' ^ ' at the 11st row.Scan the SA array from left to right, the current element that scans thereunder marks with " @ ".For the element S A[i that scans], if-1 is skipped, otherwise judge character S[SA[i]-1] and type whether be the L type.Then skip if not the L type, otherwise SA[i]-1 this numeral inserts S[SA[i]-1] the current reference position of bucket under this character, and the reference position of this barrel lattice that move right.For example the 10th the row SA[0]=16, search S[16-1 among the array t]=i is the L type.So 16-1=15 is placed on the current reference position of ' i ' letter bucket, then the reference position of this barrel lattice that move right.So finish scanning back in the data among the SA shown in the 28th row.
33) find out the end position of each suffix bucket among the SA, mark with ' ^ ' at the 33rd row.Scan the SA array from right to left, the current element that scans thereunder marks with " @ ".For the element S A[i that scans], judge character S[SA[i]-1] and type whether be the S type.Then skip if not the S type, otherwise SA[i]-1 this numeral inserts S[SA[i]-1] the current end position of bucket under this character, and the end position of this barrel is moved to the left lattice.Because SA last element S A[n-1]=4, then see S[4-1] type.By being S[4-1]=' i ' be the S type, so 4-1=3 is placed on the current end position of ' i ' character bucket, and the end position of this barrel toward moving left lattice.So finish scanning back in the data among the SA shown in the 44th row.
A LMS substring obtains S1 among the step 4) rename S.Scan SA from left to right and check ordering all LMS substrings.Name each LMS word string from 0 open numbering, if two adjacent LMS substrings are unequal, latter's numbering equals the former numbering and adds 1, otherwise the numbering of two word strings is the same.For example: first scans $,, next scans the numerical value 10 of ' i ' bucket the inside so $ is numbered 0, and its corresponding LMS substring is iippii$, is not equal to $, so be numbered 0+1=1.Next scan the numerical value 6 of ' i ' bucket the inside, its corresponding LMS substring is iissi, is not equal to iippii$, so label is 1+1=2.Next scan the numerical value 2 of ' i ' bucket the inside, its corresponding LMS substring is iissi, equals a LMS substring that scans, so number identically, all is 2.At last each LMS substring among the S is replaced with its numbering, obtain new character strings S1=2210.
The above only is a better embodiment of the present invention, the present invention is not limited to above-mentioned embodiment, in implementation process, may there be local small structural modification, if various changes of the present invention or modification are not broken away from the spirit and scope of the present invention, and belong within claim of the present invention and the equivalent technologies scope, then the present invention also is intended to comprise these changes and modification.

Claims (4)

1. suffix array building method is characterized in that it comprises:
The type of each character and suffix in the step 1) tab character string: scan character string S from right to left one time, two adjacent character S[i according to the more current scanning of definition of suffix type] and S[i+1], draw the type of each character and suffix, t comes record with array;
Step 2) scans a pass group t from left to right, find out the position that all LMS characters occur, thereby obtain the initial pointer of all LMS substrings, write down the pointer of each LMS substring with P1;
Step 3) comes LMS substrings all among the S is sorted by LMS substring array of pointers P1, array B and SA;
Step 4) renames each LMS substring among the character string S according to the result of step 3) ordering, forms a new string S1 who has shortened;
If each character of step 5) S1 all is unique, each suffix of the S1 that then directly sorts calculates the suffix array SA1 of S1, otherwise with S1 and SA1 as input parameter recursive call SA-IS algorithm, promptly SA-IS (S1, SA1);
Step 6) is concluded the suffix array SA that calculates S according to the suffix array SA1 of the S1 that obtains in the step 5);
Step 7) is returned.
2. suffix array building method according to claim 1 is characterized in that, the process that in the described step 3) all LMS substrings among the S is sorted comprises:
31) all elements of initialization SA is-1, find out the end position of all suffix affiliated each barrel in SA among the S, scan S from left to right once, successively the LMS suffix that scans is inserted the current end position of its affiliated bucket in SA, and then the end position of this barrel is moved to the left lattice;
32) find out all suffix among the S in SA under the reference position of each barrel, scan the SA array from left to right, be not-1 element S A[i for each that scans], if S[SA[i]-1] be the L type, then SA[i]-1 this numerical value inserts suf (S, SA[i]-1) this suffix in SA under the current reference position of bucket, and then the reference position of this barrel lattice that move right;
33) find out all suffix among the S in SA under the end position of each barrel, scan the SA array from right to left, for each the element S A[i that scans], if S[SA[i]-1] be the S type, then SA[i]-1 this numerical value inserts suf (S, SA[i]-1) this suffix in SA under the current end position of bucket, and then the end position of this barrel is moved to the left lattice
Wherein, all suffix of character string S are sorted in array SA by its first character, then the suffix that all first characters are identical is all continuous is arranged in a certain section zone among the SA, and we are referred to as a bucket of corresponding these suffix this section zone.
3. suffix array building method according to claim 2 is characterized in that, the process of calculating new character strings S1 in the described step 4) comprises:
41) scan ordering all LMS substrings in the SA array from left to right, the size of two more adjacent successively LMS substrings, the LMS substring that is compared is named from 0 open numbering, if two LMS substrings equate, then numbering is the same, adds 1 otherwise latter's numbering equals the former numbering;
42) LMS substring among the S with it in step 41) in the numbering obtained replace, formed new character strings is S1.
4. suffix array building method according to claim 3 is characterized in that, concludes the step of calculating SA from SA1 in the described step 6) and comprises:
61) all elements of initialization SA is-1, find out the end position of all suffix affiliated each barrel in SA among the S, scan the SA1 array from right to left, to each element S A1[i that scans], then P1[SA1[i]] be placed on suffix suf (S, P1[SA1[i]]) in SA under the current end position of bucket, and then the end position of this barrel is moved to the left lattice;
62) find out all suffix among the S in SA under the reference position of each barrel, scan the SA array from left to right, be not-1 element S A[i for each that scans], if S[SA[i]-1] be the L type, then SA[i]-1 this numerical value inserts suf (S, SA[i]-1) this suffix in SA under the current reference position of bucket, and then the reference position of this barrel lattice that move right;
63) find out all suffix among the S in SA under the end position of each barrel, scan the SA array from right to left, for each the element S A[i that scans], if S[SA[i]-1] be the S type, then SA[i]-1 this numerical value inserts suf (S, SA[i]-1) this suffix in SA under the current end position of bucket, and then the end position of this barrel is moved to the left lattice.
CN2011100290142A 2011-01-27 2011-01-27 Suffix array construction method Pending CN102081673A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011100290142A CN102081673A (en) 2011-01-27 2011-01-27 Suffix array construction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011100290142A CN102081673A (en) 2011-01-27 2011-01-27 Suffix array construction method

Publications (1)

Publication Number Publication Date
CN102081673A true CN102081673A (en) 2011-06-01

Family

ID=44087634

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011100290142A Pending CN102081673A (en) 2011-01-27 2011-01-27 Suffix array construction method

Country Status (1)

Country Link
CN (1) CN102081673A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102521213A (en) * 2011-12-01 2012-06-27 农革 Construction method of linear time suffix arrays
WO2015143708A1 (en) * 2014-03-28 2015-10-01 华为技术有限公司 Method and apparatus for constructing suffix array
CN105335481A (en) * 2015-10-14 2016-02-17 广东顺德中山大学卡内基梅隆大学国际联合研究院 Large scale character string text suffix index building method and device
CN106953806A (en) * 2017-03-27 2017-07-14 广东顺德中山大学卡内基梅隆大学国际联合研究院 A kind of method and system based on suffix index Match IP Address
CN107015951A (en) * 2017-03-24 2017-08-04 广东顺德中山大学卡内基梅隆大学国际联合研究院 The correctness verification method and system of a kind of Suffix array clustering
CN107015952A (en) * 2017-03-24 2017-08-04 广东顺德中山大学卡内基梅隆大学国际联合研究院 The correctness verification method and system of a kind of Suffix array clustering and most long common prefix
CN107169315A (en) * 2017-03-27 2017-09-15 广东顺德中山大学卡内基梅隆大学国际联合研究院 The transmission method and system of a kind of magnanimity DNA data
CN108664459A (en) * 2018-03-22 2018-10-16 佛山市顺德区中山大学研究院 A kind of merging method that Suffix array clustering is adaptive and its device
CN108763170A (en) * 2018-04-17 2018-11-06 佛山市顺德区中山大学研究院 The method and system of constant working space parallel construction Suffix array clustering
CN108804204A (en) * 2018-04-17 2018-11-13 佛山市顺德区中山大学研究院 Multi-threaded parallel constructs the method and system of Suffix array clustering
CN108920483A (en) * 2018-04-28 2018-11-30 南京搜文信息技术有限公司 Character string fast matching method based on Suffix array clustering
CN109299152A (en) * 2018-08-27 2019-02-01 中山大学 A kind of the Suffix array clustering indexing means and device of real-time stream

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102521213A (en) * 2011-12-01 2012-06-27 农革 Construction method of linear time suffix arrays
WO2015143708A1 (en) * 2014-03-28 2015-10-01 华为技术有限公司 Method and apparatus for constructing suffix array
CN105335481A (en) * 2015-10-14 2016-02-17 广东顺德中山大学卡内基梅隆大学国际联合研究院 Large scale character string text suffix index building method and device
CN105335481B (en) * 2015-10-14 2019-01-22 广东顺德中山大学卡内基梅隆大学国际联合研究院 A kind of the suffix index building method and device of extensive character string text
CN107015952B (en) * 2017-03-24 2020-08-18 广东顺德中山大学卡内基梅隆大学国际联合研究院 Method and system for verifying correctness of suffix array and longest common prefix
CN107015951A (en) * 2017-03-24 2017-08-04 广东顺德中山大学卡内基梅隆大学国际联合研究院 The correctness verification method and system of a kind of Suffix array clustering
CN107015952A (en) * 2017-03-24 2017-08-04 广东顺德中山大学卡内基梅隆大学国际联合研究院 The correctness verification method and system of a kind of Suffix array clustering and most long common prefix
CN107015951B (en) * 2017-03-24 2020-08-18 广东顺德中山大学卡内基梅隆大学国际联合研究院 Method and system for verifying correctness of suffix array
CN107169315A (en) * 2017-03-27 2017-09-15 广东顺德中山大学卡内基梅隆大学国际联合研究院 The transmission method and system of a kind of magnanimity DNA data
CN107169315B (en) * 2017-03-27 2020-08-04 广东顺德中山大学卡内基梅隆大学国际联合研究院 Mass DNA data transmission method and system
CN106953806A (en) * 2017-03-27 2017-07-14 广东顺德中山大学卡内基梅隆大学国际联合研究院 A kind of method and system based on suffix index Match IP Address
CN108664459B (en) * 2018-03-22 2021-09-17 佛山市顺德区中山大学研究院 Suffix array self-adaptive merging method and device thereof
CN108664459A (en) * 2018-03-22 2018-10-16 佛山市顺德区中山大学研究院 A kind of merging method that Suffix array clustering is adaptive and its device
CN108763170A (en) * 2018-04-17 2018-11-06 佛山市顺德区中山大学研究院 The method and system of constant working space parallel construction Suffix array clustering
CN108804204A (en) * 2018-04-17 2018-11-13 佛山市顺德区中山大学研究院 Multi-threaded parallel constructs the method and system of Suffix array clustering
CN108920483A (en) * 2018-04-28 2018-11-30 南京搜文信息技术有限公司 Character string fast matching method based on Suffix array clustering
CN109299152A (en) * 2018-08-27 2019-02-01 中山大学 A kind of the Suffix array clustering indexing means and device of real-time stream
CN109299152B (en) * 2018-08-27 2021-11-30 中山大学 Suffix array indexing method and device for real-time data stream

Similar Documents

Publication Publication Date Title
CN102081673A (en) Suffix array construction method
CN102073740A (en) String suffix array construction method on basis of radix sorting
KR101276602B1 (en) System and method for searching and matching data having ideogrammatic content
CN101430714B (en) Content structuring process method and system based on model
CN108399213B (en) User-oriented personal file clustering method and system
CN109902142B (en) Character string fuzzy matching and query method based on edit distance
CN105335481B (en) A kind of the suffix index building method and device of extensive character string text
CN101158957A (en) Internet hot point topics correlativity excavation method
CN101477555B (en) Fast retrieval and generation display method for task tree based on SQL database
CN101251845B (en) Method for performing multi-pattern string match using improved Wu-Manber algorithm
CN101236550A (en) Method and system for processing tree -type structure data
CN102521213A (en) Construction method of linear time suffix arrays
CN111860981B (en) Enterprise national industry category prediction method and system based on LSTM deep learning
CN108804204A (en) Multi-threaded parallel constructs the method and system of Suffix array clustering
CN102081649A (en) Method and system for searching computer files
CN106295252A (en) Search method for gene prod
US20100057809A1 (en) Information storing/retrieving method and device for state transition table, and program
US8549023B2 (en) Method and apparatus for resorting a sequence of sorted strings
CN101853444A (en) Method for building integrated enterprise process reference model based on model combination
CN106126618A (en) Email address based on name recommends method and system
CN101034350A (en) Device and method for quick searching computer program functional entrance
CN108763170A (en) The method and system of constant working space parallel construction Suffix array clustering
CN101211347A (en) Search engine and method for quickly establishing key phrase search relationship
CN109828785A (en) A kind of approximate Code Clones detection method accelerated using GPU
JPH1153383A (en) Plural database retrieval method and recording medium recording plural database retrieval program or the like

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20110601