CN102081673A

CN102081673A - Suffix array construction method

Info

Publication number: CN102081673A
Application number: CN2011100290142A
Authority: CN
Inventors: 农革
Original assignee: Individual
Current assignee: Individual
Priority date: 2011-01-27
Filing date: 2011-01-27
Publication date: 2011-06-01

Abstract

The invention discloses a suffix array construction method within a linear time. The method comprises the following steps of: 1) scanning a character string S from right to left, comparing two adjacent characters S[i] and S[i+1] which are scanned at the present to obtain the type of each character and the type of the suffix, and recording the types by using an array t; 2) scanning the array t from left to right, finding out all positions where an LMS character appears, obtaining initial pointers of all LMS sub strings, and recording the pointers of the LMS sub strings by using P1; 3) sequencing all the LMS sub strings in the S via the pointer array P1 of the LMS sub strings and arrays B and SA; 4) renaming each LMS sub string in the character string S according to a sequenced result obtained in the step 3 to form a new shortened string S1; 5) if each character in the S1 is unique, directly sequencing the suffixes of the S1 to calculate the suffix array SA1 of the S1, otherwise, recursively calling an SA-IS algorithm by using the S1 and the SA1 which serve as input parameters; 6) concluding and calculating the suffix array SA of the S according to the suffix array SA1 of the S1; and 7) returning.

Description

Suffix array building method

Technical field

The present invention relates to a kind of character string suffix array building method, relate in particular to a kind of method of in linear session, finishing character string suffix array structure by computing machine automatically.

Background technology

Character string suffix array is the substituted type data structure of saving the space of suffix tree, is proposed in document [1,2] by Manber and Myers at first, can realize being equal to the algorithm of suffix tree in littler space.The suffix array has extensive use in application such as data directory and pattern match.This paper has invented a new suffix array construction algorithm, can construct its suffix array for any given character string in linear session.

In the statement of this paper, use following term.

Character set ∑ of character set is a set of setting up ordering relation, and promptly any two different element α and β can compare size in the ∑, or α＜β, or α＞β.Element in the character set ∑ is called character, and wherein Zui Xiao character is ' $ '.The related character set size of this paper is assumed to be a constant O (1).

The character string S that length of character string is n is the array S[0..n-1 that n character that belongs in the character set ∑ is arranged in order formation by its position], the end mark of S is fixed as ' $ ', and ' $ ' do not occur the position of other in S.

The substring S[i..j of substring character string S], i≤j, one section character string from position i to position j in the expression S string is just by character S[i], S[i+1] ..., S[j] character string formed.

The suffix of suffix character string S is meant from certain position i and begins a substring to end mark.Since the postfix notation of i character be suf (S, i), just suf (S, i)=S[i..n-1].

Character among character and the suffix type S is divided into two types of L and S:

1) ' $ ' is the S type;

2) S[i], i ∈ [1, n-1] is the S type, and if only if suf (S, i)＜suf (S, i+1), i.e. S[i]＜S[i+1] or S[i]=S[i+1] and S[i+1] be the S type.

3) S[i], i ∈ [1, n-1] is the L type, and if only if suf (S, i)＞suf (S, i+1), i.e. S[i]＞S[i+1] or S[i]=S[i+1] and S[i+1] be the L type;

4) suffix suf (S i) is S type and if only if character S[i] be the S type; Suffix suf (S i) is L type and if only if character S[i] be the L type.

LMS (leftmost S-type, the most left S type) character and suffix

1) ' $ ' is the LMS character;

2) S[i], i ∈ [1, n-1] is the LMS character, and if only if S[i] be S type and S[i-1] be the L type;

3) suffix suf (S i) is LMS suffix and if only if character S[i] be the LMS character.

The LMS substring

1) ' $ ' is the LMS substring;

2) S[i..j] be the LMS substring, and if only if 1≤i＜j＜n, S[i] and S[j] be all the LMS character, and S[i] and S[j] between do not have other LMS characters.

The position at the initial place of all LMS substrings, i.e. P1[i among the array of pointers array of pointers P1 record character string S] position of initial in S of (from left to right) i+1 LMS substring among the record character string S.

The character string size relatively size of two character strings compares, and is meant usually said " lexicographic order " relatively, that is for two character string u and v, makes i compare u[i in turn since 0] and v[i].If u[i]=v[i] then make i add 1 continuing more next u[i again] and v[i], otherwise if u[i]＜v[i] then think u＜v, perhaps u[i]＞v[i] then think u＞v.

The suffix array SA of suffix array S is an one-dimension array that comprises n integer, satisfies suf (S, SA[i])＜suf (S, SA[i+1]) for i ∈ [0, n-1].After just n the suffix of S being sorted from small to large sorted each suffix initial position in S is from left to right put into SA in turn.

Utilize above term, it is as follows that we provide an example of constructing character string suffix array.

Character string S=baac$, its length n=5, suf (S, 0)=baac$, suf (S, 1)=aac$, suf (S, 2)=ac$, suf (S, 3)=c$, suf (S, 4)=$.Suf (S, 4)＜suf (S, 1)＜suf (S, 2)＜suf (S, 0)＜suf (S, 3) is known in definition relatively easily according to the character string size.According to the definition of suffix array, draw SA[0 easily again]=4, SA[1]=1, SA[2]=2, SA[3]=0, SA[4]=3, i.e. SA=[4,1,2,0,3].

The existing computerized algorithm that multiple structure character string suffix array is arranged is referring to document [1-8].Time complexity by these algorithms is classified, and can be divided into linear session or big class of ultralinear times two.Wherein linear time algorithm is defined as: length is the character string of n on the given character set ∑, and promptly this character string comprises n and belongs to character in the character set ∑, and the time complexity that n suffix in this character string sorted is O (n).There is the shortcoming that actual motion speed is slow, space complexity is big [3,4,5,7,8] in existing linear session suffix array construction algorithm, has limited their utilizations in practice.

List of references

1)U.Manber?and?G.Myers，“Suffix?arrays：A?new?method?for?online?string?searches，”inProceedings?of?SODA，1990，pp.319-327.

2)U.Manber?and?G.Myers，“Suffix?arrays：A?new?method?for?on-line?string?searches，”SIAMJournal?on?Computing，vol.22，no.5，pp.935-948，1993.

3)D.K.Kim，J.S.Sim，H.Park，and?K.Park，“Linear-time?construction?of?suffix?arrays，”inProceedings?of?CPM，2003，pp.186-199.

4)P.Ko?and?S.Aluru，“Space-efficient?linear?time?construction?of?suffix?arrays，”Journal?ofDiscrete?Algorithms，vol.3，no.2-4，pp.143-156，2005.

5)J.Karkkainen，P.Sanders，and?S.Burkhardt，“Linear?work?suffix?array?construction，”JACM，no.6，pp.918-936，Nov.2006.

6)G.Manzini?and?P.Ferragina，“Engineering?a?lightweight?suffix?array?construction?algorithm，”Algorithmica，vol.40，no.1，pp.33-50，Sep.2004.

7)S.J.Puglisi，W.F.Smyth，and?A.H.Turpin，“A?taxonomy?of?suffix?array?construction?algorithms，”ACM?Comput.Surv.，vol.39，no.2，pp.1-31，2007.

8)S.J.Puglisi，W.F.Smyth，and?A.Turpin，“The?performance?of?linear?time?suffix?sorting?algorithms，”in?Proceedings?of?Data?Compression?Conference，Mar.2005，pp.358-367.

Summary of the invention

At above deficiency, the present invention proposes a novel linear session suffix array building method, it comprises:

The type of each character and suffix in the step 1) tab character string: scan character string S from right to left one time, two adjacent character S[i according to the more current scanning of definition of suffix type] and S[i+1], draw the type of each character and suffix, t comes record with array;

Step 2) scans a pass group t from left to right, find out the position that all LMS characters occur, thereby obtain the initial pointer of all LMS substrings, write down the pointer of each LMS substring with P1;

Step 3) comes LMS substrings all among the S is sorted by LMS substring array of pointers P1, array B and SA;

Step 4) renames each LMS substring among the character string S according to the result of step 3) ordering, forms a new string S1 who has shortened;

If each character of step 5) S1 all is unique, each suffix of the S1 that then directly sorts calculates the suffix array SA1 of S1, otherwise with S1 and SA1 as input parameter recursive call SA-IS algorithm, promptly SA-IS (S1, SA1);

Step 6) is concluded the suffix array SA that calculates S according to the suffix array SA1 of the S1 that obtains in the step 5);

Step 7) is returned.

The process that in the described step 3) all LMS substrings among the S is sorted comprises:

31) all elements of initialization SA is-1, find out the end position of all suffix affiliated each barrel in SA among the S, scan S from left to right once, successively the LMS suffix that scans is inserted the current end position of its affiliated bucket in SA, and then the end position of this barrel is moved to the left lattice;

32) find out all suffix among the S in SA under the reference position of each barrel, scan the SA array from left to right, be not-1 element S A[i for each that scans], if S[SA[i]-1] be the L type, then SA[i]-1 this numerical value inserts suf (S, SA[i]-1) this suffix in SA under the current reference position of bucket, and then the reference position of this barrel lattice that move right;

33) find out all suffix among the S in SA under the end position of each barrel, scan the SA array from right to left, for each the element S A[i that scans], if S[SA[i]-1] be the S type, then SA[i]-1 this numerical value inserts suf (S, SA[i]-1) this suffix in SA under the current end position of bucket, and then the end position of this barrel is moved to the left lattice

Wherein, all suffix of character string S are sorted in array SA by its first character, then the suffix that all first characters are identical is all continuous is arranged in a certain section zone among the SA, and we are referred to as a bucket of corresponding these suffix this section zone.

The process of calculating new character strings S1 in the described step 4) comprises:

41) scan ordering all LMS substrings in the SA array from left to right, the size of two more adjacent successively LMS substrings, the LMS substring that is compared is named from 0 open numbering, if two LMS substrings equate, then numbering is the same, adds 1 otherwise latter's numbering equals the former numbering;

42) LMS substring among the S with it in step 41) in the numbering obtained replace, formed new character strings is S1.

Concluding the step of calculating SA from SA1 in the described step 6) comprises:

61) all elements of initialization SA is-1, find out the end position of all suffix affiliated each barrel in SA among the S, scan the SA1 array from right to left, to each element S A1[i that scans], then P1[SA1[i]] be placed on suffix suf (S, P1[SA1[i]]) in SA under the current end position of bucket, and then the end position of this barrel is moved to the left lattice;

62) find out all suffix among the S in SA under the reference position of each barrel, scan the SA array from left to right, be not-1 element S A[i for each that scans], if S[SA[i]-1] be the L type, then SA[i]-1 this numerical value inserts suf (S, SA[i]-1) this suffix in SA under the current reference position of bucket, and then the reference position of this barrel lattice that move right;

63) find out all suffix among the S in SA under the end position of each barrel, scan the SA array from right to left, for each the element S A[i that scans], if S[SA[i]-1] be the S type, then SA[i]-1 this numerical value inserts suf (S, SA[i]-1) this suffix in SA under the current end position of bucket, and then the end position of this barrel is moved to the left lattice.

Beneficial effect of the present invention: utilize the present invention can be in linear session O (n) be that the character string of n is constructed its suffix array to length, compare other existing linear session suffix array building methods, the inventive method have travelling speed fast, consume that the space is little, the advantage of easy realization.

Description of drawings

Fig. 1 is the process flow diagram of suffix array building method of the present invention.

Embodiment

Below in conjunction with accompanying drawing the present invention is further set forth.

As shown in Figure 1, the present invention proposes a novel linear session suffix array building method (SA-IS), can effectively overcome the shortcoming of existing linear session suffix array construction algorithm, the false code of each step provides as follows in this process flow diagram, wherein the element of each array is stored in mode from left to right, be first element at Far Left, last element is at rightmost.

SA-IS(S，SA)

S: input of character string; (length is n character, comprises n1 LMS substring)

The suffix array of SA:S;

S1: integer array; (new character strings of record to forming after each LMS substring rename among the S, length is n1)

The suffix array of SA1:S1

T: boolean's array; (type of each character among the record S, length is n)

P1: integer array; (position that each LMS substring occurs among the record S, length is n1)

B: integer array; (the auxiliary array of using during ordering, length be || ∑ (S) || (being the number of element in the character set ∑))

The type of each character and suffix in the step 1) tab character string.Scan character string S from right to left one time, according to two adjacent character S[i of the more current scanning of definition of suffix type] and S[i+1], drawing the type of each character and suffix, t comes record with array;

Step 7) is returned.

Below to step 3), 4) and 6) details be described further, be convenient narration, at first introduce the notion of " bucket ", all suffix of character string S are sorted in array SA by its first character, then the suffix that all first characters are identical is all continuous is arranged in a certain section zone among the SA, and we are referred to as a bucket of corresponding these suffix this section zone.If include m different character among the S, then can form m bucket among the SA, the initial character of the suffix that is comprised in each barrel is all identical.If the initial character of the suffix that a bucket comprised is ' y ', we also are called for short this bucket and are character ' y ' bucket.In addition, when we say a unit a suffix being put into SA, its implication is the position of this suffix of this unit record in S in SA.

The process steps that in the step 3) all LMS substrings among the S is sorted is described below:

31) all elements of initialization SA is-1; Find out the end position of all suffix affiliated each barrel in SA among the S; Scan S from left to right once, successively the LMS suffix that scans is inserted the current end position of its affiliated bucket in SA, and then the end position of this barrel is moved to the left lattice;

32) find out all suffix among the S in SA under the reference position of each barrel; Scan the SA array from left to right, be not-1 element S A[i for each that scans], if S[SA[i]-1] be the L type, then SA[i]-1 this numerical value inserts suf (S, SA[i]-1) this suffix in SA under the current reference position of bucket, and then the reference position of this barrel lattice that move right.

33) find out all suffix among the S in SA under the end position of each barrel; Scan the SA array from right to left, for each the element S A[i that scans], if S[SA[i]-1] be the S type, then SA[i]-1 this numerical value inserts suf (S, SA[i]-1) this suffix in SA under the current end position of bucket, and then the end position of this barrel is moved to the left lattice.

The step of calculating new character strings S1 in the step 4) is as follows:

41) scan ordering all LMS substrings in the SA array, the size of two more adjacent successively LMS substrings from left to right; The LMS substring that is compared is named from 0 open numbering, if two LMS substrings equate that then numbering is the same, adds 1 otherwise latter's numbering equals the former numbering;

42) LMS substring among the S is replaced with the numbering that it obtains in step 4.1, formed new character strings is S1.

It is as follows to conclude the flow process of calculating SA from SA1 in the step 6):

61) all elements of initialization SA is-1; Find out the end position of all suffix affiliated each barrel in SA among the S; Scan the SA1 array from right to left, to each element S A1[i that scans], then P1[SA1[i]] be placed on suffix suf (S, P1[SA1[i]]) in SA under the current end position of bucket, and then the end position of this barrel is moved to the left lattice;

Below we are example with character string " mmiissiissiippii$ ", provide in the SA-IS algorithm detailed process of calculating new character strings S1 from S, as follows to help understanding details of the present invention, at first to provide respectively to go on foot operation result:

00 0 1

01 index: 01234567890123456

02 S：m?m?i?i?s?s?i?i?s?s?i?i?p?p?i?i?$

03 t：L?L?S?S?L?L?S?S?L?L?S?S?L?L?L?L?S

04 LMS： * * * *

05 the 1st step:

06 barrel: $ i m p s

07 SA：{16}?{-1?-1?-1?-1?-1?10?06?02}?{-1?-1}?{-1?-1}?{-1?-1?-1?-1}

08 the 2nd step:

09 barrel: $ i m p s

10 SA：{16}?{-1?-1?-1?-1?-1?10?06?02}?{-1?-1}?{-1?-1}?{-1?-1?-1?-1}

11 @ ^^^^^

12 {16}?{15?-1?-1?-1?-1?10?06?02}?{-1?-1}?{-1?-1}?{-1?-1?-1?-1}

13 ^@ ^^^^

14 {16}?{15?14?-1?-1?-1?10?06?02}?{-1?-1}?{13?-1}?{-1?-1?-1?-1}

15 ^@ ^^^^

16 {16}?{15?14?-1?-1?-1?10?06?02}?{-1?-1}?{13?-1}?{09?-1?-1?-1}

17 ^^@ ^^^

18 {16}?{15?14?-1?-1?-1?10?06?02}?{-1?-1}?{13?-1}?{09?05?-1?-1}

19 ^^@ ^^^

20 {16}?{15?14?-1?-1?-1?10?06?02}?{01?-1}?{13?-1}?{09?05?-1?-1}

21 ^^@ ^^^

22 {16}?{15?14?-1?-1?-1?10?06?02}?{01?00}?{13?-1}?{09?05?-1?-1}

23 ^^@ ^^^

24 {16}?{15?14?-1?-1?-1?10?06?02}?{01?00}?{13?12}?{09?05?-1?-1}

25 ^^^@ ^^

26 {16}?{15?14?-1?-1?-1?10?06?02}?{01?00}?{13?12}?{09?05?08?-1}

27 ^^^^@ ^

28 {16}?{15?14?-1?-1?-1?10?06?02}?{01?00}?{13?12}?{09?05?08?04}

29 ^^^^@ ^

30 the 3rd steps:

31 barrels: $ i m p s

32 SA：{16}?{15?14?-1?-1?-1?10?06?02}?{01?00}?{13?12}?{09?05?08?04}

33 ^^^^@^

34 {16}?{15?14?-1?-1?-1?10?06?03}?{01?00}?{13?12}?{09?05?08?04}

35 ^^^^@ ^

36 {16}?{15?14?-1?-1?-1?10?07?03}?{01?00}?{13?12}?{09?05?08?04}

37 ^^^@ ^^

38 {16}?{15?14?-1?-1?-1?11?07?03}?{01?00}?{13?12}?{09?05?08?04}

39 ^^@ ^^^

40 {16}?{15?14?-1?-1?02?11?07?03}?{01?00}?{13?12}?{09?05?08?04}

41 ^^@ ^^^

42 {16}?{15?14?-1?06?02?11?07?03}?{01?00}?{13?12}?{09?05?08?04}

43 ^^^^^

44 {16}?{15?14?10?06?02?11?07?03}?{01?00}?{13?12}?{09?05?08?04}

45 ^^@ ^^^

46 S1：?2?2?1?0

More than each step be described as follows.

Each character types in the step 1) tab character string.At first ' $ ' is the S type, scan character string S then from right to left one time, according to the character S[i of the more current scanning of definition of suffix type] and its subsequent character S[i+1], draw character S[i] type: if S[i]＞S[i+1], S[i then] be the L type, t[i]=0; If S[i]＜S[i+1], S[i then] and be the S type, t[i]=1; If S[i]=S[i+1], S[i] with S[i+1] type identical, t[i]=t[i+1].The t that obtains provides at the 3rd row.

Step 2) obtains each LMS substring position among the S.Scan character number of types group t from left to right, mark each LMS substring position, promptly in the 4th capable position that goes out with ' * ' labelled notation.The location records of these LMS substrings is in array P1.

Step 3) sorts all LMS substrings by LMS substring array of pointers P1, array B and SA, and 3 following sub-steps are arranged.

31) the affiliated separately bucket of SA array put in each LMS suffix among the character string S.At first initialization SA all elements is-1, finds out the starting and ending position of each suffix bucket among the SA, i.e. corresponding ' $ ', ' i ', ' m ', ' p ', 5 character buckets of ' s '.Then each self-corresponding bucket put in four LMS suffix, and be that the end of bucket is placed from back to front under separately.The initial character S[2 of first LMS suffix] be character ' i ', so it is placed on last position of character ' i ' bucket.Second LMS letter S[6] be letter ' i ', so it is placed on the penult position of character ' i ' bucket.So obtain the 7th row after the operation.

32) find out the reference position of each suffix bucket among the SA, mark with ' ^ ' at the 11st row.Scan the SA array from left to right, the current element that scans thereunder marks with " @ ".For the element S A[i that scans], if-1 is skipped, otherwise judge character S[SA[i]-1] and type whether be the L type.Then skip if not the L type, otherwise SA[i]-1 this numeral inserts S[SA[i]-1] the current reference position of bucket under this character, and the reference position of this barrel lattice that move right.For example the 10th the row SA[0]=16, search S[16-1 among the array t]=i is the L type.So 16-1=15 is placed on the current reference position of ' i ' letter bucket, then the reference position of this barrel lattice that move right.So finish scanning back in the data among the SA shown in the 28th row.

33) find out the end position of each suffix bucket among the SA, mark with ' ^ ' at the 33rd row.Scan the SA array from right to left, the current element that scans thereunder marks with " @ ".For the element S A[i that scans], judge character S[SA[i]-1] and type whether be the S type.Then skip if not the S type, otherwise SA[i]-1 this numeral inserts S[SA[i]-1] the current end position of bucket under this character, and the end position of this barrel is moved to the left lattice.Because SA last element S A[n-1]=4, then see S[4-1] type.By being S[4-1]=' i ' be the S type, so 4-1=3 is placed on the current end position of ' i ' character bucket, and the end position of this barrel toward moving left lattice.So finish scanning back in the data among the SA shown in the 44th row.

A LMS substring obtains S1 among the step 4) rename S.Scan SA from left to right and check ordering all LMS substrings.Name each LMS word string from 0 open numbering, if two adjacent LMS substrings are unequal, latter's numbering equals the former numbering and adds 1, otherwise the numbering of two word strings is the same.For example: first scans $,, next scans the numerical value 10 of ' i ' bucket the inside so $ is numbered 0, and its corresponding LMS substring is iippii$, is not equal to $, so be numbered 0+1=1.Next scan the numerical value 6 of ' i ' bucket the inside, its corresponding LMS substring is iissi, is not equal to iippii$, so label is 1+1=2.Next scan the numerical value 2 of ' i ' bucket the inside, its corresponding LMS substring is iissi, equals a LMS substring that scans, so number identically, all is 2.At last each LMS substring among the S is replaced with its numbering, obtain new character strings S1=2210.

The above only is a better embodiment of the present invention, the present invention is not limited to above-mentioned embodiment, in implementation process, may there be local small structural modification, if various changes of the present invention or modification are not broken away from the spirit and scope of the present invention, and belong within claim of the present invention and the equivalent technologies scope, then the present invention also is intended to comprise these changes and modification.

Claims

1. suffix array building method is characterized in that it comprises:

Step 7) is returned.

2. suffix array building method according to claim 1 is characterized in that, the process that in the described step 3) all LMS substrings among the S is sorted comprises:

3. suffix array building method according to claim 2 is characterized in that, the process of calculating new character strings S1 in the described step 4) comprises:

4. suffix array building method according to claim 3 is characterized in that, concludes the step of calculating SA from SA1 in the described step 6) and comprises: