JP2002269096A - Method and device for character string restoration and recording medium - Google Patents
Method and device for character string restoration and recording mediumInfo
- Publication number
- JP2002269096A JP2002269096A JP2001064405A JP2001064405A JP2002269096A JP 2002269096 A JP2002269096 A JP 2002269096A JP 2001064405 A JP2001064405 A JP 2001064405A JP 2001064405 A JP2001064405 A JP 2001064405A JP 2002269096 A JP2002269096 A JP 2002269096A
- Authority
- JP
- Japan
- Prior art keywords
- character string
- character
- suffix array
- suffix
- buffer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
【0001】[0001]
【発明の属する技術分野】本発明は、文字列の接尾配列
と、文字列中の各文字毎に接尾配列中の位置を記憶した
文字表を用いて文字列を復元するようにした文字列復元
方法及びその装置並びに記録媒体に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a character string restoring method in which a character string is restored using a suffix array of a character string and a character table storing the position in the suffix array for each character in the character string. The present invention relates to a method, an apparatus thereof, and a recording medium.
【0002】[0002]
【従来の技術】従来、特開平10−260980号公報
には、テキストを圧縮コード列のままで検索する際に、
検索キーの接頭(プレフィックス)とテキスト中の接尾
(サフィックス)を照合することを基本処理とするもの
が記載されている。また、U.Manber and
G.Myers, “Suffix arrays:a
new method for on−line st
ring searches”,SIAM jouna
l of computing, 22(5), p
p.935−948, 1993には、Suffix
array(接尾配列)の提案.Suffix arr
ayの構成と利用に関するアルゴリズムが示されてい
る。2. Description of the Related Art Conventionally, Japanese Patent Application Laid-Open No. H10-260980 discloses a technique for searching a text as a compressed code string.
It describes that the basic processing is to match the prefix (prefix) of the search key with the suffix (suffix) in the text. U.S.A. Manber and
G. FIG. Myers, “Suffix arrays: a
new method for on-line st
ring searches ”, SIAM jouna
l of computing, 22 (5), p
p. 935-948, 1993 include Suffix
Proposal of array (suffix array). Suffix arr
The algorithm for the construction and use of ay is shown.
【0003】テキストTと入力パタンPに対し,PのT
中での出現位置を求める問題は文字列照合(strin
g matching)と呼ばれる。Tが既知である場
合、Tに予め索引を構成し利用することで、高速な文字
列照合が実現できる。For a text T and an input pattern P, T
The problem of finding the position of occurrence in a string is string matching (string
g matching). When T is known, high-speed character string collation can be realized by constructing and using an index in advance for T.
【0004】このような索引の例として接尾配列(su
ffix array)が提案されている(U.Man
ber and G.Myers 1993)。以下に
接尾配列の例をテキストT=BANANAの場合につい
て示す。As an example of such an index, a suffix array (su
fix array has been proposed (U. Man).
ber and G. Myers 1993). An example of the suffix sequence is shown below for the case where the text T = BANANA.
【0005】 [0005]
【0006】テキストにはテキスト末を表す記号$を付
与しておく。接尾とはテキスト中の各文字からテキスト
末までの範囲の文字列であり、N文字からなるテキスト
にはN個の接尾が定義できる。そして、接尾の先頭文字
のテキスト中での出現位置とは1対1に対応する。この
出現位置をポインタと呼ぶ。接尾配列とは、全てのポイ
ンタをそのポインタに対する接尾が辞書順になるように
並べて得られるポインタ列である。ここで、接尾の辞書
順はその接尾を構成する文字(接尾の構成文字と呼ぶ)
に対して予め定義された文字値に基づいて定義される。
本説明では各文字の文字値を以下のように定める。The text is provided with a symbol $ indicating the end of the text. The suffix is a character string ranging from each character in the text to the end of the text, and N suffixes can be defined for a text consisting of N characters. The appearance position of the first character of the suffix in the text has a one-to-one correspondence. This appearance position is called a pointer. The suffix array is a pointer sequence obtained by arranging all the pointers such that the suffixes to the pointers are in dictionary order. Here, the dictionary order of the suffix is a character constituting the suffix (referred to as a suffix constituent character).
Is defined based on a character value defined in advance.
In this description, the character value of each character is determined as follows.
【0007】 $ < $ < A < B < N よって上記の例からは以下の接尾配列が得られる。$ <$ <A <B <N Therefore, the following suffix arrangement is obtained from the above example.
【0008】接尾配列: 5 3 1 0 4 2 例えば、最初のポインタ5に対応する接尾はA$であ
り、ポインタ3に対応する接尾ANA$より辞書順で若
い。Suffix array: 5 3 1 0 4 2 For example, the suffix corresponding to the first pointer 5 is A #, and is younger in dictionary order than the suffix ANA # corresponding to the pointer 3.
【0009】T中のPの出現位置は、上記の接尾配列上
の2分探索によって求めることができる。すなわち、P
を接頭とする接尾に対するポインタはPのT中での出現
位置であり、それらポインタは接尾配列中に連続して並
ぶため、その連続範囲を2分探索で求めればよい。接尾
配列に関するものではないがこの種の検索は特開平10
−260980号公報にも見られる。The appearance position of P in T can be obtained by a binary search on the above suffix array. That is, P
The pointers to the suffixes prefixed by are the appearance positions of P in T, and those pointers are continuously arranged in the suffix array. Therefore, the continuous range may be obtained by a binary search. Although not related to suffix sequences, this type of search is disclosed in
It is also found in -260980.
【0010】[0010]
【発明が解決しようとする課題】従来、接尾配列を用い
て文字列検索を行う場合、 (A)文字列の接尾配列 (B)文字列 の2つを記憶しておく必要があるが、対象文字列の数が
増えるにつれ、これらの記憶コストが実用上の問題にな
る。Conventionally, when performing a character string search using a suffix array, it is necessary to store (A) a suffix array of character strings and (B) a character string. As the number of strings increases, these storage costs become a practical problem.
【0011】本発明は、接尾配列を用いた文字列検索に
おける必要な記憶量を削減することを目的とする。An object of the present invention is to reduce the amount of storage required for a character string search using a suffix array.
【0012】[0012]
【課題を解決するための手段】基本的には、上記(A)
(B)の代わりに、以下の(A)(C)を記憶してお
き、文字列検索実行時に、(A)(C)から(B)を復
元することで、全体の記憶量を削減する。 (A)文字列の接尾配列 (C)文字表Means for Solving the Problems Basically, (A)
Instead of (B), the following (A) and (C) are stored, and (A) and (B) are restored from (A) and (C) at the time of executing a character string search, thereby reducing the total storage amount. . (A) Character string suffix array (C) Character table
【0013】具体的な、課題達成のための手段は、文字
列を記憶するための文字列バッファと、文字列の接尾配
列を記憶するための接尾配列バッファと、文字列中の各
文字毎に接尾配列中の位置を記憶する文字表バッファ
と、文字表バッファと接尾配列バッファを参照し、文字
列バッファ中の所定位置に文字を格納する文字列復元部
とよりなるものである。Specifically, means for achieving the object include a character string buffer for storing a character string, a suffix array buffer for storing a suffix array of character strings, and a suffix array buffer for storing each character in the character string. It comprises a character table buffer that stores the position in the suffix array, and a character string restoration unit that stores characters at a predetermined position in the character string buffer with reference to the character table buffer and the suffix array buffer.
【0014】[0014]
【発明の実施の形態】本発明の実施の形態を図面に基づ
いて説明する。図1に示すものは、装置の構成例を示す
ものであり、図2に示すものは、そのソフトウェアの一
例を示すものである。Embodiments of the present invention will be described with reference to the drawings. FIG. 1 shows an example of the configuration of the apparatus, and FIG. 2 shows an example of the software.
【0015】前述の例、文字列“BANANA”に対す
る接尾配列は、図3に示す状態である。この図3の上段
は、接尾配列の要素位置(配列添字)を明示したもので
あり、下段が接尾配列の各要素(ポインタ)の値であ
る。In the above example, the suffix array for the character string "BANANA" is as shown in FIG. The upper part of FIG. 3 clearly shows the element position (array subscript) of the suffix array, and the lower part shows the value of each element (pointer) of the suffix array.
【0016】図3の接尾配列に対して、図4に示す文字
表が用意されているものとする。図4において、左列
(文字コードの列)には、文字列中に出現する文字の文
字コードが、文字の辞書順に格納される。最後の“#”
行は文字表の最終行を表すために設ける。It is assumed that a character table shown in FIG. 4 is prepared for the suffix arrangement shown in FIG. In FIG. 4, the character codes of the characters appearing in the character string are stored in the left column (character code column) in the dictionary order of the characters. Last"#"
The line is provided to represent the last line of the character table.
【0017】一方、右列(配列添字の列)には、各文字
コードに対し、その文字から始まる接尾のうち辞書順で
最も若いものへのポインタ(すなわち、“A”について
は5,“B”については0,“N”については4)が格
納されている接尾配列中の配列添字が格納されている。On the other hand, in the right column (array subscript column), for each character code, a pointer to the youngest suffix starting from that character in dictionary order (that is, "A" is 5, "B" The array suffix in the suffix array in which "0" is stored for "" and 4) is stored for "N" is stored.
【0018】図3の接尾配列、図4の文字表は、各々、
接尾配列バッファ、文字表バッファに記憶されているも
のとする。The suffix arrangement of FIG. 3 and the character table of FIG.
It is assumed that the data is stored in the suffix array buffer and the character table buffer.
【0019】文字列復元部は、図5に示すフローチャー
トに従って、文字表および接尾配列から、文字列“BA
NANA”を復元する。ただし、 ・文字表の第I行の文字コードをtable[I].c
odeで表し配列添字をtable[I].index
で表す。 ・配列添字Pに対する接尾配列の要素をA[P]で表
す。 ・配列添字Pに対する文字列バッファの要素をS[P]
で表す。 ・この処理が終了後、文字列バッファS[..]上に文
字列が復元される。The character string restoring section converts the character string "BA" from the character table and the suffix array according to the flowchart shown in FIG.
NANA "is restored. However, the character code of the first row of the character table is changed to table [I] .c.
ode and the array subscript is table [I]. index
Expressed by The element of the suffix array for the array suffix P is represented by A [P]. The character string buffer element corresponding to the array subscript P is S [P]
Expressed by After this processing is completed, the character string buffer S [. . ] Is restored.
【0020】[0020]
【発明の効果】文字表は、対象文字列の容量に依らず、
文字の異なり数分の記憶量のみを必要とする。よって、
対象文字列が大規模な場合、対象文字列を記憶するより
も、文字表を記憶した方が、全体として少ない記憶量に
することができる。The character table is independent of the capacity of the target character string.
Different characters require only a few minutes of storage. Therefore,
When the target character string is large, storing the character table can reduce the storage amount as a whole, rather than storing the target character string.
【0021】また、文字列検索が必要になった時に、文
字列復元部により文字列を復元するが、その処理は、接
尾配列を1回走査するだけであり、非常に高速に行うこ
とができる。When a character string search becomes necessary, the character string is restored by the character string restoring unit. This processing is performed only once by scanning the suffix array, and can be performed at a very high speed. .
【図1】本発明の実施の形態におけるハードウェア構成
の一例を示すブロック図である。FIG. 1 is a block diagram illustrating an example of a hardware configuration according to an embodiment of the present invention.
【図2】そのソフトウェアの一例を示すブロック図であ
る。FIG. 2 is a block diagram showing an example of the software.
【図3】接尾配列の一例を示す配列図である。FIG. 3 is an array diagram showing an example of a suffix array.
【図4】接尾配列に対する文字表を示す配列図である。FIG. 4 is an array diagram showing a character table for a suffix array.
【図5】文字列復元部の処理を示すフローチャートであ
る。FIG. 5 is a flowchart illustrating processing of a character string restoring unit.
Claims (3)
毎に接尾配列中の位置を記憶した文字表を用いて文字列
を復元する文字列復元方法。1. A character string restoring method for restoring a character string by using a character string suffix array and a character table in which a position in the suffix array is stored for each character in the character string.
と、文字列の接尾配列を記憶するための接尾配列バッフ
ァと、文字列中の各文字毎に接尾配列中の位置を記憶す
る文字表バッファと、文字表バッファと接尾配列バッフ
ァを参照し、文字列バッファ中の所定位置に文字を格納
する文字列復元部とよりなることを特徴とする文字列復
元装置。2. A character string buffer for storing a character string, a suffix array buffer for storing a suffix array of character strings, and a character table for storing a position in the suffix array for each character in the character string. A character string restoring device comprising: a buffer; and a character string restoring unit that refers to the character table buffer and the suffix array buffer and stores characters at predetermined positions in the character string buffer.
毎に接尾配列中の位置を記憶した文字表を用いて文字列
を復元するプログラムを記録した計算機が読み取り可能
な記憶媒体。3. A computer-readable storage medium storing a program for restoring a character string using a character table storing a character string suffix array and a position in the suffix array for each character in the character string.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2001064405A JP2002269096A (en) | 2001-03-08 | 2001-03-08 | Method and device for character string restoration and recording medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2001064405A JP2002269096A (en) | 2001-03-08 | 2001-03-08 | Method and device for character string restoration and recording medium |
Publications (1)
Publication Number | Publication Date |
---|---|
JP2002269096A true JP2002269096A (en) | 2002-09-20 |
Family
ID=18923228
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
JP2001064405A Pending JP2002269096A (en) | 2001-03-08 | 2001-03-08 | Method and device for character string restoration and recording medium |
Country Status (1)
Country | Link |
---|---|
JP (1) | JP2002269096A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2012166190A1 (en) * | 2011-06-03 | 2012-12-06 | Microsoft Corporation | Compression match enumeration |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2000259646A (en) * | 1999-03-05 | 2000-09-22 | Ricoh Co Ltd | Information indexing device |
-
2001
- 2001-03-08 JP JP2001064405A patent/JP2002269096A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2000259646A (en) * | 1999-03-05 | 2000-09-22 | Ricoh Co Ltd | Information indexing device |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2012166190A1 (en) * | 2011-06-03 | 2012-12-06 | Microsoft Corporation | Compression match enumeration |
US8493249B2 (en) | 2011-06-03 | 2013-07-23 | Microsoft Corporation | Compression match enumeration |
US9065469B2 (en) | 2011-06-03 | 2015-06-23 | Microsoft Technology Licensing, Llc | Compression match enumeration |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8095526B2 (en) | Efficient retrieval of variable-length character string data | |
US10095755B2 (en) | Fast identification of complex strings in a data stream | |
JP3672242B2 (en) | PATTERN SEARCH METHOD, PATTERN SEARCH DEVICE, COMPUTER PROGRAM, AND STORAGE MEDIUM | |
JP2683870B2 (en) | Character string search system and method | |
US5754847A (en) | Word/number and number/word mapping | |
JPWO2004062110A1 (en) | Data compression method, program and apparatus | |
JP4077409B2 (en) | Fast longest match search method and apparatus | |
JP2009512099A (en) | Method and apparatus for restartable hashing in a try | |
US5553283A (en) | Stored mapping data with information for skipping branches while keeping count of suffix endings | |
Haubold et al. | Exact matching | |
US20010032073A1 (en) | Coding and storage of phonetical characteristics of strings | |
JP2002269096A (en) | Method and device for character string restoration and recording medium | |
JP4208326B2 (en) | Information indexing device | |
JP2013197850A (en) | Encoding method, encoding device, and computer program | |
JP3534471B2 (en) | Merge sort method and merge sort device | |
JPH06274701A (en) | Word collating device | |
JPS59100939A (en) | Japanese word input device | |
JP2002049645A (en) | Hash value computing method, hash value computing device, retrieval method, retrieval device and recording medium | |
JP3350070B2 (en) | Kana-Kanji conversion device | |
JPH07121665A (en) | Compiling method and retrieving method for character recognition dictionary | |
JP2001117929A (en) | Data retrieving method, data aligning method and data retrieving device | |
JPH03127254A (en) | Word retrieving device | |
JPH0546663A (en) | Key word retrieval system | |
CN1547135A (en) | Method for searching Chinese name by tone of Chinese phonetic alphabet | |
JPH1166076A (en) | Data derivation device/method and storage medium storing data derivation program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
RD04 | Notification of resignation of power of attorney |
Free format text: JAPANESE INTERMEDIATE CODE: A7424 Effective date: 20040930 |
|
RD01 | Notification of change of attorney |
Free format text: JAPANESE INTERMEDIATE CODE: A7421 Effective date: 20051021 |
|
RD01 | Notification of change of attorney |
Free format text: JAPANESE INTERMEDIATE CODE: A7421 Effective date: 20060811 |
|
A621 | Written request for application examination |
Free format text: JAPANESE INTERMEDIATE CODE: A621 Effective date: 20080229 |
|
A977 | Report on retrieval |
Free format text: JAPANESE INTERMEDIATE CODE: A971007 Effective date: 20100723 |
|
A131 | Notification of reasons for refusal |
Free format text: JAPANESE INTERMEDIATE CODE: A131 Effective date: 20100803 |
|
A02 | Decision of refusal |
Free format text: JAPANESE INTERMEDIATE CODE: A02 Effective date: 20101130 |