JP2002269096A - Method and device for character string restoration and recording medium - Google Patents

Method and device for character string restoration and recording medium

Info

Publication number
JP2002269096A
JP2002269096A JP2001064405A JP2001064405A JP2002269096A JP 2002269096 A JP2002269096 A JP 2002269096A JP 2001064405 A JP2001064405 A JP 2001064405A JP 2001064405 A JP2001064405 A JP 2001064405A JP 2002269096 A JP2002269096 A JP 2002269096A
Authority
JP
Japan
Prior art keywords
character string
character
suffix array
suffix
buffer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
JP2001064405A
Other languages
Japanese (ja)
Inventor
Hideo Ito
秀夫 伊東
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ricoh Co Ltd
Original Assignee
Ricoh Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ricoh Co Ltd filed Critical Ricoh Co Ltd
Priority to JP2001064405A priority Critical patent/JP2002269096A/en
Publication of JP2002269096A publication Critical patent/JP2002269096A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

PROBLEM TO BE SOLVED: To decrease the storage capacity needed for character string retrieval using a suffix array. SOLUTION: A character string is restored by using the suffix array of the character string and a character table containing positions in the suffix array by the characters in the character string. Basically, the total storage capacity is reduced by storing a suffix array of a character string and a character table mentioned below instead of the suffix array of the character string and the character table and restoring the character string from the suffix array and character table when character string retrieval is performed. For the purpose, the device is equipped with a character string buffer which stores the character string, a suffix array buffer which stores the suffix array of the character string, a character table buffer which stores the positions in the suffix array by the characters in the character string, and a character string restoration part which refers to the character table buffer and suffix array buffer and stores the characters at specific positions in the character string buffer.

Description

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【0001】[0001]

【発明の属する技術分野】本発明は、文字列の接尾配列
と、文字列中の各文字毎に接尾配列中の位置を記憶した
文字表を用いて文字列を復元するようにした文字列復元
方法及びその装置並びに記録媒体に関するものである。
BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a character string restoring method in which a character string is restored using a suffix array of a character string and a character table storing the position in the suffix array for each character in the character string. The present invention relates to a method, an apparatus thereof, and a recording medium.

【0002】[0002]

【従来の技術】従来、特開平10−260980号公報
には、テキストを圧縮コード列のままで検索する際に、
検索キーの接頭(プレフィックス)とテキスト中の接尾
(サフィックス)を照合することを基本処理とするもの
が記載されている。また、U.Manber and
G.Myers, “Suffix arrays:a
new method for on−line st
ring searches”,SIAM jouna
l of computing, 22(5), p
p.935−948, 1993には、Suffix
array(接尾配列)の提案.Suffix arr
ayの構成と利用に関するアルゴリズムが示されてい
る。
2. Description of the Related Art Conventionally, Japanese Patent Application Laid-Open No. H10-260980 discloses a technique for searching a text as a compressed code string.
It describes that the basic processing is to match the prefix (prefix) of the search key with the suffix (suffix) in the text. U.S.A. Manber and
G. FIG. Myers, “Suffix arrays: a
new method for on-line st
ring searches ”, SIAM jouna
l of computing, 22 (5), p
p. 935-948, 1993 include Suffix
Proposal of array (suffix array). Suffix arr
The algorithm for the construction and use of ay is shown.

【0003】テキストTと入力パタンPに対し,PのT
中での出現位置を求める問題は文字列照合(strin
g matching)と呼ばれる。Tが既知である場
合、Tに予め索引を構成し利用することで、高速な文字
列照合が実現できる。
For a text T and an input pattern P, T
The problem of finding the position of occurrence in a string is string matching (string
g matching). When T is known, high-speed character string collation can be realized by constructing and using an index in advance for T.

【0004】このような索引の例として接尾配列(su
ffix array)が提案されている(U.Man
ber and G.Myers 1993)。以下に
接尾配列の例をテキストT=BANANAの場合につい
て示す。
As an example of such an index, a suffix array (su
fix array has been proposed (U. Man).
ber and G. Myers 1993). An example of the suffix sequence is shown below for the case where the text T = BANANA.

【0005】 [0005]

【0006】テキストにはテキスト末を表す記号$を付
与しておく。接尾とはテキスト中の各文字からテキスト
末までの範囲の文字列であり、N文字からなるテキスト
にはN個の接尾が定義できる。そして、接尾の先頭文字
のテキスト中での出現位置とは1対1に対応する。この
出現位置をポインタと呼ぶ。接尾配列とは、全てのポイ
ンタをそのポインタに対する接尾が辞書順になるように
並べて得られるポインタ列である。ここで、接尾の辞書
順はその接尾を構成する文字(接尾の構成文字と呼ぶ)
に対して予め定義された文字値に基づいて定義される。
本説明では各文字の文字値を以下のように定める。
The text is provided with a symbol $ indicating the end of the text. The suffix is a character string ranging from each character in the text to the end of the text, and N suffixes can be defined for a text consisting of N characters. The appearance position of the first character of the suffix in the text has a one-to-one correspondence. This appearance position is called a pointer. The suffix array is a pointer sequence obtained by arranging all the pointers such that the suffixes to the pointers are in dictionary order. Here, the dictionary order of the suffix is a character constituting the suffix (referred to as a suffix constituent character).
Is defined based on a character value defined in advance.
In this description, the character value of each character is determined as follows.

【0007】 $ < $ < A < B < N よって上記の例からは以下の接尾配列が得られる。$ <$ <A <B <N Therefore, the following suffix arrangement is obtained from the above example.

【0008】接尾配列: 5 3 1 0 4 2 例えば、最初のポインタ5に対応する接尾はA$であ
り、ポインタ3に対応する接尾ANA$より辞書順で若
い。
Suffix array: 5 3 1 0 4 2 For example, the suffix corresponding to the first pointer 5 is A #, and is younger in dictionary order than the suffix ANA # corresponding to the pointer 3.

【0009】T中のPの出現位置は、上記の接尾配列上
の2分探索によって求めることができる。すなわち、P
を接頭とする接尾に対するポインタはPのT中での出現
位置であり、それらポインタは接尾配列中に連続して並
ぶため、その連続範囲を2分探索で求めればよい。接尾
配列に関するものではないがこの種の検索は特開平10
−260980号公報にも見られる。
The appearance position of P in T can be obtained by a binary search on the above suffix array. That is, P
The pointers to the suffixes prefixed by are the appearance positions of P in T, and those pointers are continuously arranged in the suffix array. Therefore, the continuous range may be obtained by a binary search. Although not related to suffix sequences, this type of search is disclosed in
It is also found in -260980.

【0010】[0010]

【発明が解決しようとする課題】従来、接尾配列を用い
て文字列検索を行う場合、 (A)文字列の接尾配列 (B)文字列 の2つを記憶しておく必要があるが、対象文字列の数が
増えるにつれ、これらの記憶コストが実用上の問題にな
る。
Conventionally, when performing a character string search using a suffix array, it is necessary to store (A) a suffix array of character strings and (B) a character string. As the number of strings increases, these storage costs become a practical problem.

【0011】本発明は、接尾配列を用いた文字列検索に
おける必要な記憶量を削減することを目的とする。
An object of the present invention is to reduce the amount of storage required for a character string search using a suffix array.

【0012】[0012]

【課題を解決するための手段】基本的には、上記(A)
(B)の代わりに、以下の(A)(C)を記憶してお
き、文字列検索実行時に、(A)(C)から(B)を復
元することで、全体の記憶量を削減する。 (A)文字列の接尾配列 (C)文字表
Means for Solving the Problems Basically, (A)
Instead of (B), the following (A) and (C) are stored, and (A) and (B) are restored from (A) and (C) at the time of executing a character string search, thereby reducing the total storage amount. . (A) Character string suffix array (C) Character table

【0013】具体的な、課題達成のための手段は、文字
列を記憶するための文字列バッファと、文字列の接尾配
列を記憶するための接尾配列バッファと、文字列中の各
文字毎に接尾配列中の位置を記憶する文字表バッファ
と、文字表バッファと接尾配列バッファを参照し、文字
列バッファ中の所定位置に文字を格納する文字列復元部
とよりなるものである。
Specifically, means for achieving the object include a character string buffer for storing a character string, a suffix array buffer for storing a suffix array of character strings, and a suffix array buffer for storing each character in the character string. It comprises a character table buffer that stores the position in the suffix array, and a character string restoration unit that stores characters at a predetermined position in the character string buffer with reference to the character table buffer and the suffix array buffer.

【0014】[0014]

【発明の実施の形態】本発明の実施の形態を図面に基づ
いて説明する。図1に示すものは、装置の構成例を示す
ものであり、図2に示すものは、そのソフトウェアの一
例を示すものである。
Embodiments of the present invention will be described with reference to the drawings. FIG. 1 shows an example of the configuration of the apparatus, and FIG. 2 shows an example of the software.

【0015】前述の例、文字列“BANANA”に対す
る接尾配列は、図3に示す状態である。この図3の上段
は、接尾配列の要素位置(配列添字)を明示したもので
あり、下段が接尾配列の各要素(ポインタ)の値であ
る。
In the above example, the suffix array for the character string "BANANA" is as shown in FIG. The upper part of FIG. 3 clearly shows the element position (array subscript) of the suffix array, and the lower part shows the value of each element (pointer) of the suffix array.

【0016】図3の接尾配列に対して、図4に示す文字
表が用意されているものとする。図4において、左列
(文字コードの列)には、文字列中に出現する文字の文
字コードが、文字の辞書順に格納される。最後の“#”
行は文字表の最終行を表すために設ける。
It is assumed that a character table shown in FIG. 4 is prepared for the suffix arrangement shown in FIG. In FIG. 4, the character codes of the characters appearing in the character string are stored in the left column (character code column) in the dictionary order of the characters. Last"#"
The line is provided to represent the last line of the character table.

【0017】一方、右列(配列添字の列)には、各文字
コードに対し、その文字から始まる接尾のうち辞書順で
最も若いものへのポインタ(すなわち、“A”について
は5,“B”については0,“N”については4)が格
納されている接尾配列中の配列添字が格納されている。
On the other hand, in the right column (array subscript column), for each character code, a pointer to the youngest suffix starting from that character in dictionary order (that is, "A" is 5, "B" The array suffix in the suffix array in which "0" is stored for "" and 4) is stored for "N" is stored.

【0018】図3の接尾配列、図4の文字表は、各々、
接尾配列バッファ、文字表バッファに記憶されているも
のとする。
The suffix arrangement of FIG. 3 and the character table of FIG.
It is assumed that the data is stored in the suffix array buffer and the character table buffer.

【0019】文字列復元部は、図5に示すフローチャー
トに従って、文字表および接尾配列から、文字列“BA
NANA”を復元する。ただし、 ・文字表の第I行の文字コードをtable[I].c
odeで表し配列添字をtable[I].index
で表す。 ・配列添字Pに対する接尾配列の要素をA[P]で表
す。 ・配列添字Pに対する文字列バッファの要素をS[P]
で表す。 ・この処理が終了後、文字列バッファS[..]上に文
字列が復元される。
The character string restoring section converts the character string "BA" from the character table and the suffix array according to the flowchart shown in FIG.
NANA "is restored. However, the character code of the first row of the character table is changed to table [I] .c.
ode and the array subscript is table [I]. index
Expressed by The element of the suffix array for the array suffix P is represented by A [P]. The character string buffer element corresponding to the array subscript P is S [P]
Expressed by After this processing is completed, the character string buffer S [. . ] Is restored.

【0020】[0020]

【発明の効果】文字表は、対象文字列の容量に依らず、
文字の異なり数分の記憶量のみを必要とする。よって、
対象文字列が大規模な場合、対象文字列を記憶するより
も、文字表を記憶した方が、全体として少ない記憶量に
することができる。
The character table is independent of the capacity of the target character string.
Different characters require only a few minutes of storage. Therefore,
When the target character string is large, storing the character table can reduce the storage amount as a whole, rather than storing the target character string.

【0021】また、文字列検索が必要になった時に、文
字列復元部により文字列を復元するが、その処理は、接
尾配列を1回走査するだけであり、非常に高速に行うこ
とができる。
When a character string search becomes necessary, the character string is restored by the character string restoring unit. This processing is performed only once by scanning the suffix array, and can be performed at a very high speed. .

【図面の簡単な説明】[Brief description of the drawings]

【図1】本発明の実施の形態におけるハードウェア構成
の一例を示すブロック図である。
FIG. 1 is a block diagram illustrating an example of a hardware configuration according to an embodiment of the present invention.

【図2】そのソフトウェアの一例を示すブロック図であ
る。
FIG. 2 is a block diagram showing an example of the software.

【図3】接尾配列の一例を示す配列図である。FIG. 3 is an array diagram showing an example of a suffix array.

【図4】接尾配列に対する文字表を示す配列図である。FIG. 4 is an array diagram showing a character table for a suffix array.

【図5】文字列復元部の処理を示すフローチャートであ
る。
FIG. 5 is a flowchart illustrating processing of a character string restoring unit.

Claims (3)

【特許請求の範囲】[Claims] 【請求項1】 文字列の接尾配列と、文字列中の各文字
毎に接尾配列中の位置を記憶した文字表を用いて文字列
を復元する文字列復元方法。
1. A character string restoring method for restoring a character string by using a character string suffix array and a character table in which a position in the suffix array is stored for each character in the character string.
【請求項2】 文字列を記憶するための文字列バッファ
と、文字列の接尾配列を記憶するための接尾配列バッフ
ァと、文字列中の各文字毎に接尾配列中の位置を記憶す
る文字表バッファと、文字表バッファと接尾配列バッフ
ァを参照し、文字列バッファ中の所定位置に文字を格納
する文字列復元部とよりなることを特徴とする文字列復
元装置。
2. A character string buffer for storing a character string, a suffix array buffer for storing a suffix array of character strings, and a character table for storing a position in the suffix array for each character in the character string. A character string restoring device comprising: a buffer; and a character string restoring unit that refers to the character table buffer and the suffix array buffer and stores characters at predetermined positions in the character string buffer.
【請求項3】 文字列の接尾配列と、文字列中の各文字
毎に接尾配列中の位置を記憶した文字表を用いて文字列
を復元するプログラムを記録した計算機が読み取り可能
な記憶媒体。
3. A computer-readable storage medium storing a program for restoring a character string using a character table storing a character string suffix array and a position in the suffix array for each character in the character string.
JP2001064405A 2001-03-08 2001-03-08 Method and device for character string restoration and recording medium Pending JP2002269096A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2001064405A JP2002269096A (en) 2001-03-08 2001-03-08 Method and device for character string restoration and recording medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2001064405A JP2002269096A (en) 2001-03-08 2001-03-08 Method and device for character string restoration and recording medium

Publications (1)

Publication Number Publication Date
JP2002269096A true JP2002269096A (en) 2002-09-20

Family

ID=18923228

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2001064405A Pending JP2002269096A (en) 2001-03-08 2001-03-08 Method and device for character string restoration and recording medium

Country Status (1)

Country Link
JP (1) JP2002269096A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012166190A1 (en) * 2011-06-03 2012-12-06 Microsoft Corporation Compression match enumeration

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000259646A (en) * 1999-03-05 2000-09-22 Ricoh Co Ltd Information indexing device

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000259646A (en) * 1999-03-05 2000-09-22 Ricoh Co Ltd Information indexing device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012166190A1 (en) * 2011-06-03 2012-12-06 Microsoft Corporation Compression match enumeration
US8493249B2 (en) 2011-06-03 2013-07-23 Microsoft Corporation Compression match enumeration
US9065469B2 (en) 2011-06-03 2015-06-23 Microsoft Technology Licensing, Llc Compression match enumeration

Similar Documents

Publication Publication Date Title
US8095526B2 (en) Efficient retrieval of variable-length character string data
US10095755B2 (en) Fast identification of complex strings in a data stream
JP3672242B2 (en) PATTERN SEARCH METHOD, PATTERN SEARCH DEVICE, COMPUTER PROGRAM, AND STORAGE MEDIUM
JP2683870B2 (en) Character string search system and method
US5754847A (en) Word/number and number/word mapping
JPWO2004062110A1 (en) Data compression method, program and apparatus
JP4077409B2 (en) Fast longest match search method and apparatus
JP2009512099A (en) Method and apparatus for restartable hashing in a try
US5553283A (en) Stored mapping data with information for skipping branches while keeping count of suffix endings
Haubold et al. Exact matching
US20010032073A1 (en) Coding and storage of phonetical characteristics of strings
JP2002269096A (en) Method and device for character string restoration and recording medium
JP4208326B2 (en) Information indexing device
JP2013197850A (en) Encoding method, encoding device, and computer program
JP3534471B2 (en) Merge sort method and merge sort device
JPH06274701A (en) Word collating device
JPS59100939A (en) Japanese word input device
JP2002049645A (en) Hash value computing method, hash value computing device, retrieval method, retrieval device and recording medium
JP3350070B2 (en) Kana-Kanji conversion device
JPH07121665A (en) Compiling method and retrieving method for character recognition dictionary
JP2001117929A (en) Data retrieving method, data aligning method and data retrieving device
JPH03127254A (en) Word retrieving device
JPH0546663A (en) Key word retrieval system
CN1547135A (en) Method for searching Chinese name by tone of Chinese phonetic alphabet
JPH1166076A (en) Data derivation device/method and storage medium storing data derivation program

Legal Events

Date Code Title Description
RD04 Notification of resignation of power of attorney

Free format text: JAPANESE INTERMEDIATE CODE: A7424

Effective date: 20040930

RD01 Notification of change of attorney

Free format text: JAPANESE INTERMEDIATE CODE: A7421

Effective date: 20051021

RD01 Notification of change of attorney

Free format text: JAPANESE INTERMEDIATE CODE: A7421

Effective date: 20060811

A621 Written request for application examination

Free format text: JAPANESE INTERMEDIATE CODE: A621

Effective date: 20080229

A977 Report on retrieval

Free format text: JAPANESE INTERMEDIATE CODE: A971007

Effective date: 20100723

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20100803

A02 Decision of refusal

Free format text: JAPANESE INTERMEDIATE CODE: A02

Effective date: 20101130