JP2002269096A

JP2002269096A - Method and device for character string restoration and recording medium

Info

Publication number: JP2002269096A
Application number: JP2001064405A
Authority: JP
Inventors: Hideo Ito; 秀夫伊東
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2001-03-08
Filing date: 2001-03-08
Publication date: 2002-09-20

Abstract

PROBLEM TO BE SOLVED: To decrease the storage capacity needed for character string retrieval using a suffix array. SOLUTION: A character string is restored by using the suffix array of the character string and a character table containing positions in the suffix array by the characters in the character string. Basically, the total storage capacity is reduced by storing a suffix array of a character string and a character table mentioned below instead of the suffix array of the character string and the character table and restoring the character string from the suffix array and character table when character string retrieval is performed. For the purpose, the device is equipped with a character string buffer which stores the character string, a suffix array buffer which stores the suffix array of the character string, a character table buffer which stores the positions in the suffix array by the characters in the character string, and a character string restoration part which refers to the character table buffer and suffix array buffer and stores the characters at specific positions in the character string buffer.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、文字列の接尾配列
と、文字列中の各文字毎に接尾配列中の位置を記憶した
文字表を用いて文字列を復元するようにした文字列復元
方法及びその装置並びに記録媒体に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a character string restoring method in which a character string is restored using a suffix array of a character string and a character table storing the position in the suffix array for each character in the character string. The present invention relates to a method, an apparatus thereof, and a recording medium.

【０００２】[0002]

【従来の技術】従来、特開平１０−２６０９８０号公報
には、テキストを圧縮コード列のままで検索する際に、
検索キーの接頭（プレフィックス）とテキスト中の接尾
（サフィックス）を照合することを基本処理とするもの
が記載されている。また、Ｕ．Ｍａｎｂｅｒａｎｄ
Ｇ．Ｍｙｅｒｓ， “Ｓｕｆｆｉｘａｒｒａｙｓ：ａ
ｎｅｗｍｅｔｈｏｄｆｏｒｏｎ−ｌｉｎｅｓｔ
ｒｉｎｇｓｅａｒｃｈｅｓ”，ＳＩＡＭｊｏｕｎａ
ｌｏｆｃｏｍｐｕｔｉｎｇ，２２（５），ｐ
ｐ．９３５−９４８，１９９３には、Ｓｕｆｆｉｘ
ａｒｒａｙ（接尾配列）の提案．Ｓｕｆｆｉｘａｒｒ
ａｙの構成と利用に関するアルゴリズムが示されてい
る。2. Description of the Related Art Conventionally, Japanese Patent Application Laid-Open No. H10-260980 discloses a technique for searching a text as a compressed code string.
It describes that the basic processing is to match the prefix (prefix) of the search key with the suffix (suffix) in the text. U.S.A. Manber and
G. FIG. Myers, “Suffix arrays: a
new method for on-line st
ring searches ”, SIAM jouna
l of computing, 22 (5), p
p. 935-948, 1993 include Suffix
Proposal of array (suffix array). Suffix arr
The algorithm for the construction and use of ay is shown.

【０００３】テキストＴと入力パタンＰに対し，ＰのＴ
中での出現位置を求める問題は文字列照合（ｓｔｒｉｎ
ｇｍａｔｃｈｉｎｇ）と呼ばれる。Ｔが既知である場
合、Ｔに予め索引を構成し利用することで、高速な文字
列照合が実現できる。For a text T and an input pattern P, T
The problem of finding the position of occurrence in a string is string matching (string
g matching). When T is known, high-speed character string collation can be realized by constructing and using an index in advance for T.

【０００４】このような索引の例として接尾配列（ｓｕ
ｆｆｉｘａｒｒａｙ）が提案されている（Ｕ．Ｍａｎ
ｂｅｒａｎｄＧ．Ｍｙｅｒｓ１９９３）。以下に
接尾配列の例をテキストＴ＝ＢＡＮＡＮＡの場合につい
て示す。As an example of such an index, a suffix array (su
fix array has been proposed (U. Man).
ber and G. Myers 1993). An example of the suffix sequence is shown below for the case where the text T = BANANA.

【０００５】 [0005]

【０００６】テキストにはテキスト末を表す記号＄を付
与しておく。接尾とはテキスト中の各文字からテキスト
末までの範囲の文字列であり、Ｎ文字からなるテキスト
にはＮ個の接尾が定義できる。そして、接尾の先頭文字
のテキスト中での出現位置とは１対１に対応する。この
出現位置をポインタと呼ぶ。接尾配列とは、全てのポイ
ンタをそのポインタに対する接尾が辞書順になるように
並べて得られるポインタ列である。ここで、接尾の辞書
順はその接尾を構成する文字（接尾の構成文字と呼ぶ）
に対して予め定義された文字値に基づいて定義される。
本説明では各文字の文字値を以下のように定める。The text is provided with a symbol ＄ indicating the end of the text. The suffix is a character string ranging from each character in the text to the end of the text, and N suffixes can be defined for a text consisting of N characters. The appearance position of the first character of the suffix in the text has a one-to-one correspondence. This appearance position is called a pointer. The suffix array is a pointer sequence obtained by arranging all the pointers such that the suffixes to the pointers are in dictionary order. Here, the dictionary order of the suffix is a character constituting the suffix (referred to as a suffix constituent character).
Is defined based on a character value defined in advance.
In this description, the character value of each character is determined as follows.

【０００７】＄＜＄＜Ａ＜Ｂ＜Ｎよって上記の例からは以下の接尾配列が得られる。＄ <＄ <A <B <N Therefore, the following suffix arrangement is obtained from the above example.

【０００８】接尾配列：５３１０４２例えば、最初のポインタ５に対応する接尾はＡ＄であ
り、ポインタ３に対応する接尾ＡＮＡ＄より辞書順で若
い。Suffix array: 5 3 1 0 4 2 For example, the suffix corresponding to the first pointer 5 is A #, and is younger in dictionary order than the suffix ANA # corresponding to the pointer 3.

【０００９】Ｔ中のＰの出現位置は、上記の接尾配列上
の２分探索によって求めることができる。すなわち、Ｐ
を接頭とする接尾に対するポインタはＰのＴ中での出現
位置であり、それらポインタは接尾配列中に連続して並
ぶため、その連続範囲を２分探索で求めればよい。接尾
配列に関するものではないがこの種の検索は特開平１０
−２６０９８０号公報にも見られる。The appearance position of P in T can be obtained by a binary search on the above suffix array. That is, P
The pointers to the suffixes prefixed by are the appearance positions of P in T, and those pointers are continuously arranged in the suffix array. Therefore, the continuous range may be obtained by a binary search. Although not related to suffix sequences, this type of search is disclosed in
It is also found in -260980.

【００１０】[0010]

【発明が解決しようとする課題】従来、接尾配列を用い
て文字列検索を行う場合、（Ａ）文字列の接尾配列（Ｂ）文字列の２つを記憶しておく必要があるが、対象文字列の数が
増えるにつれ、これらの記憶コストが実用上の問題にな
る。Conventionally, when performing a character string search using a suffix array, it is necessary to store (A) a suffix array of character strings and (B) a character string. As the number of strings increases, these storage costs become a practical problem.

【００１１】本発明は、接尾配列を用いた文字列検索に
おける必要な記憶量を削減することを目的とする。An object of the present invention is to reduce the amount of storage required for a character string search using a suffix array.

【００１２】[0012]

【課題を解決するための手段】基本的には、上記（Ａ）
（Ｂ）の代わりに、以下の（Ａ）（Ｃ）を記憶してお
き、文字列検索実行時に、（Ａ）（Ｃ）から（Ｂ）を復
元することで、全体の記憶量を削減する。（Ａ）文字列の接尾配列（Ｃ）文字表Means for Solving the Problems Basically, (A)
Instead of (B), the following (A) and (C) are stored, and (A) and (B) are restored from (A) and (C) at the time of executing a character string search, thereby reducing the total storage amount. . (A) Character string suffix array (C) Character table

【００１３】具体的な、課題達成のための手段は、文字
列を記憶するための文字列バッファと、文字列の接尾配
列を記憶するための接尾配列バッファと、文字列中の各
文字毎に接尾配列中の位置を記憶する文字表バッファ
と、文字表バッファと接尾配列バッファを参照し、文字
列バッファ中の所定位置に文字を格納する文字列復元部
とよりなるものである。Specifically, means for achieving the object include a character string buffer for storing a character string, a suffix array buffer for storing a suffix array of character strings, and a suffix array buffer for storing each character in the character string. It comprises a character table buffer that stores the position in the suffix array, and a character string restoration unit that stores characters at a predetermined position in the character string buffer with reference to the character table buffer and the suffix array buffer.

【００１４】[0014]

【発明の実施の形態】本発明の実施の形態を図面に基づ
いて説明する。図１に示すものは、装置の構成例を示す
ものであり、図２に示すものは、そのソフトウェアの一
例を示すものである。Embodiments of the present invention will be described with reference to the drawings. FIG. 1 shows an example of the configuration of the apparatus, and FIG. 2 shows an example of the software.

【００１５】前述の例、文字列“ＢＡＮＡＮＡ”に対す
る接尾配列は、図３に示す状態である。この図３の上段
は、接尾配列の要素位置（配列添字）を明示したもので
あり、下段が接尾配列の各要素（ポインタ）の値であ
る。In the above example, the suffix array for the character string "BANANA" is as shown in FIG. The upper part of FIG. 3 clearly shows the element position (array subscript) of the suffix array, and the lower part shows the value of each element (pointer) of the suffix array.

【００１６】図３の接尾配列に対して、図４に示す文字
表が用意されているものとする。図４において、左列
（文字コードの列）には、文字列中に出現する文字の文
字コードが、文字の辞書順に格納される。最後の“＃”
行は文字表の最終行を表すために設ける。It is assumed that a character table shown in FIG. 4 is prepared for the suffix arrangement shown in FIG. In FIG. 4, the character codes of the characters appearing in the character string are stored in the left column (character code column) in the dictionary order of the characters. Last"#"
The line is provided to represent the last line of the character table.

【００１７】一方、右列（配列添字の列）には、各文字
コードに対し、その文字から始まる接尾のうち辞書順で
最も若いものへのポインタ（すなわち、“Ａ”について
は５，“Ｂ”については０，“Ｎ”については４）が格
納されている接尾配列中の配列添字が格納されている。On the other hand, in the right column (array subscript column), for each character code, a pointer to the youngest suffix starting from that character in dictionary order (that is, "A" is 5, "B" The array suffix in the suffix array in which "0" is stored for "" and 4) is stored for "N" is stored.

【００１８】図３の接尾配列、図４の文字表は、各々、
接尾配列バッファ、文字表バッファに記憶されているも
のとする。The suffix arrangement of FIG. 3 and the character table of FIG.
It is assumed that the data is stored in the suffix array buffer and the character table buffer.

【００１９】文字列復元部は、図５に示すフローチャー
トに従って、文字表および接尾配列から、文字列“ＢＡ
ＮＡＮＡ”を復元する。ただし、・文字表の第Ｉ行の文字コードをｔａｂｌｅ［Ｉ］．ｃ
ｏｄｅで表し配列添字をｔａｂｌｅ［Ｉ］．ｉｎｄｅｘ
で表す。・配列添字Ｐに対する接尾配列の要素をＡ［Ｐ］で表
す。・配列添字Ｐに対する文字列バッファの要素をＳ［Ｐ］
で表す。・この処理が終了後、文字列バッファＳ［．．］上に文
字列が復元される。The character string restoring section converts the character string "BA" from the character table and the suffix array according to the flowchart shown in FIG.
NANA "is restored. However, the character code of the first row of the character table is changed to table [I] .c.
ode and the array subscript is table [I]. index
Expressed by The element of the suffix array for the array suffix P is represented by A [P]. The character string buffer element corresponding to the array subscript P is S [P]
Expressed by After this processing is completed, the character string buffer S [. . ] Is restored.

【００２０】[0020]

【発明の効果】文字表は、対象文字列の容量に依らず、
文字の異なり数分の記憶量のみを必要とする。よって、
対象文字列が大規模な場合、対象文字列を記憶するより
も、文字表を記憶した方が、全体として少ない記憶量に
することができる。The character table is independent of the capacity of the target character string.
Different characters require only a few minutes of storage. Therefore,
When the target character string is large, storing the character table can reduce the storage amount as a whole, rather than storing the target character string.

【００２１】また、文字列検索が必要になった時に、文
字列復元部により文字列を復元するが、その処理は、接
尾配列を１回走査するだけであり、非常に高速に行うこ
とができる。When a character string search becomes necessary, the character string is restored by the character string restoring unit. This processing is performed only once by scanning the suffix array, and can be performed at a very high speed. .

[Brief description of the drawings]

【図１】本発明の実施の形態におけるハードウェア構成
の一例を示すブロック図である。FIG. 1 is a block diagram illustrating an example of a hardware configuration according to an embodiment of the present invention.

【図２】そのソフトウェアの一例を示すブロック図であ
る。FIG. 2 is a block diagram showing an example of the software.

【図３】接尾配列の一例を示す配列図である。FIG. 3 is an array diagram showing an example of a suffix array.

【図４】接尾配列に対する文字表を示す配列図である。FIG. 4 is an array diagram showing a character table for a suffix array.

【図５】文字列復元部の処理を示すフローチャートであ
る。FIG. 5 is a flowchart illustrating processing of a character string restoring unit.

Claims

[Claims]

1. A character string restoring method for restoring a character string by using a character string suffix array and a character table in which a position in the suffix array is stored for each character in the character string.

2. A character string buffer for storing a character string, a suffix array buffer for storing a suffix array of character strings, and a character table for storing a position in the suffix array for each character in the character string. A character string restoring device comprising: a buffer; and a character string restoring unit that refers to the character table buffer and the suffix array buffer and stores characters at predetermined positions in the character string buffer.

3. A computer-readable storage medium storing a program for restoring a character string using a character table storing a character string suffix array and a position in the suffix array for each character in the character string.