JP3328334B2

JP3328334B2 - Full-text database search device

Info

Publication number: JP3328334B2
Application number: JP29702192A
Authority: JP
Inventors: 文人西野; 正利塩内; 尚美杉本
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1992-11-06
Filing date: 1992-11-06
Publication date: 2002-09-24
Anticipated expiration: 2017-09-24
Also published as: JPH06149882A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】近年の計算機技術の進歩により、
大量の文書が電子化されるようになり、これらを検索す
る要求も高まっている。データベースの検索では、予め
キーワードを付与しておく方法があるが、大量の全文デ
ータにキーワードを付与する手間は大変である。[Industrial applications] Recent advances in computer technology
A large number of documents have been digitized, and the demand for searching them has been increasing. In a database search, there is a method of assigning a keyword in advance, but it is difficult to assign a keyword to a large amount of full-text data.

【０００２】これに対して、二次情報を予め準備せず
に、検索者が自由に指定するキーワードをもとに本文を
直接参照して照合を行うフルテキスト・サーチ方式もあ
る。しかし、これは全文を検索する実行時間が問題とな
っている。この二つの中間をとったものとして、全文の
すべての文字をキーワードのようなものとみなし、全文
中のすべての文字に対するインデックスを与え、キーワ
ードを持たずに高速に検索を行おうという試みがある。On the other hand, there is also a full-text search system in which a text is directly referred to and collated based on a keyword freely designated by a searcher without preparing secondary information in advance. However, this has a problem in execution time for searching the full text. In the middle of these two, there is an attempt to treat all characters in the whole sentence as a keyword, give an index to every character in the whole sentence, and perform a high-speed search without keywords. .

【０００３】本発明は、上記したような各文字のアドレ
ス情報を保有し、このアドレス情報をインデックスとし
てフルテキスト・サーチを行う全文データベース検索装
置に関し、特に、本発明はそのインデックスの作成技術
に関するものである。[0003] The present invention relates to a full-text database search device that holds address information of each character as described above and performs a full-text search using the address information as an index. In particular, the present invention relates to a technique for creating the index. It is.

【０００４】[0004]

【従来の技術】上記のように各文字のアドレス情報を保
有し、このアドレス情報を基にフルテキスト・サーチを
行う全文データベース検索装置においては、従来、各文
字のアドレス情報は、それぞれの文字に対して、本文中
に現れる次の文字情報とその位置をペアとして保有して
いた。2. Description of the Related Art As described above, in a full-text database search apparatus that holds address information of each character and performs a full-text search based on the address information, conventionally, the address information of each character is stored in each character. On the other hand, the next character information appearing in the text and its position were held as a pair.

【０００５】しかしながら、上記した従来の保有の仕方
では、全文中の文字に対して、次の文字情報と位置情報
を保有する必要があり、このインデックスのために必要
な記憶容量は、本文の２倍以上必要であった。またその
検索に際しても、それぞれの文字の次に現れる文字を順
次見ていく必要があるため、検索速度を高速化すること
が困難であった。However, in the above-described conventional holding method, it is necessary to hold the following character information and position information for the characters in the entire text, and the storage capacity required for this index is 2 bytes in the text. More than twice as needed. Also, at the time of the search, it is necessary to sequentially look at the character appearing next to each character, so that it has been difficult to increase the search speed.

【０００６】[0006]

【発明が解決しようとする課題】本発明は上記した従来
技術の問題点に鑑みなされたものであって、全文データ
ベースに高速にアクセスすることができ、しかもそのイ
ンデックス用の記憶領域を従来のものに比して少なくす
ることができる全文データベース検索装置を提供するこ
とを目的とする。SUMMARY OF THE INVENTION The present invention has been made in view of the above-mentioned problems of the prior art, and provides a high-speed access to a full-text database and a conventional index storage area. It is an object of the present invention to provide a full-text database search device that can be reduced in number as compared with the first embodiment.

【０００７】[0007]

【課題を解決するための手段】図１は本発明の原理ブロ
ック図である。同図において、１は全文テキスト１ａを
記憶した全文記憶部、２は全文テキスト１ａの各文字の
全文記憶部１中における位置を格納したインデックス領
域部、３は与えられたキーに対してインデックス領域部
２のインデックス情報を利用して全文記憶部１を検索す
る検索部、４−１ないし４−ｎは各文字コードの次の文
字のコードに対応した位置に文字出現位置に関する情報
を記録した直接配列（丸印はデータがある領域を示
す）、２ａは各文字コードの直接配列４−１ないし４−
ｎを右もしくは左に適当量ずらして重ね合わせた圧縮配
列である。FIG. 1 is a block diagram showing the principle of the present invention. In the figure, 1 is a full-text storage unit that stores a full-text text 1a, 2 is an index area unit that stores the position of each character of the full-text text 1a in the full-text storage unit 1, and 3 is an index area for a given key. A search unit 4-1 to 4-n for searching the full-text storage unit 1 using the index information of the unit 2 directly records information on the character appearance position at a position corresponding to the code of the character next to each character code. Array (circles indicate areas where data is present), 2a is direct array 4-1 to 4-
This is a compressed array in which n is shifted to the right or left by an appropriate amount and superimposed.

【０００８】本発明の請求項１の発明は、図１に示すよ
うに、電子化された文書の内容を検索するために、全文
テキスト１ａを保持している全文記憶部１と、全文テキ
スト１ａの一文字、一文字に対してその出現位置を保持
するインデックス領域部２と、上記インデックス領域部
２のインデックスを参照して全文検索を行う検索部３と
を備え、インデックスとして各文字の出現位置の他に全
文データベースにおける次の文字情報をも保有する全文
データベース検索装置において、各文字コードに対して
与えられる次の文字情報と文字出現位置に関する情報の
ペアのリストに基づき、上記次の文字情報の文字コード
に対応した位置に文字出現位置情報を記録した直接配列
４−１，…，４−ｎを生成し、その値が定義されている
要素が重ならないように、各文字コードから得られる直
接配列４−１，…，４−ｎを右あるいは左に適当量ずら
して重ね合わせることにより、複数の直接配列の集合か
ら一つの圧縮配列２ａを得て上記インデックス領域部２
に格納し、インデックス領域部２に格納された圧縮配列
２ａを用いて全文データベースから、要求された文字列
を検索するようにしたものである。As shown in FIG. 1, a full-text storage unit 1 holding a full-text text 1a and a full-text text 1a for retrieving the contents of a digitized document. An index area section 2 for holding the appearance position of one character, and a search section 3 for performing a full-text search with reference to the index of the index area section 2. In a full-text database search device that also holds the next character information in the full-text database, based on a list of pairs of the next character information and information on the character appearance position given to each character code, .., 4-n in which character appearance position information is recorded at a position corresponding to the code, and elements whose values are defined do not overlap. .., 4-n obtained from each character code are shifted to the right or to the left by an appropriate amount, and one compressed array 2a is obtained from a set of a plurality of direct arrays. Area part 2
And a requested character string is searched from the full-text database using the compressed array 2a stored in the index area 2.

【０００９】本発明の請求項２の発明は、請求項１の発
明において、次の文字情報の前に位置する各文字に関す
る情報を圧縮配列２ａに明示的に持たせず、圧縮配列２
ａにおける文字出現位置情報に基づき全文テキスト１ａ
より上記各文字を検索し、その検索結果より圧縮配列２
ａより取り出された文字出現位置の情報が正しい情報で
あることを確認するようにしたものてある。According to a second aspect of the present invention, in the first aspect of the present invention, information regarding each character located before the next character information is not explicitly provided in the compression array 2a.
Full text 1a based on character appearance position information in a
Search for each of the above characters, and based on the search result,
In this case, it is confirmed that the information of the character appearance position extracted from the character a is correct information.

【００１０】本発明の請求項３の発明は、請求項１の発
明において、同一の組み合わせの２文字列が全文テキス
ト１ａ中に複数回出現し、圧縮配列２ａ中に特定の文字
コードの位置情報が複数あるとき、その複数の情報を圧
縮配列２ａとは別の領域に保持するようにしたものであ
る。本発明の請求項４の発明は、請求項１の発明におい
て、同一の組み合わせの２文字列が全文テキスト１ａ中
に複数回出現し、圧縮配列２ａ中に特定の文字コードの
位置情報が複数あるとき、その複数の情報を圧縮配列２
ａの連続領域に保持するようにしたものである。According to a third aspect of the present invention, in the first aspect, a two-character string of the same combination appears a plurality of times in the full-text text 1a, and the position information of a specific character code is included in the compressed array 2a. Are stored in a different area from the compressed array 2a. According to a fourth aspect of the present invention, in the first aspect, the same combination of two character strings appears a plurality of times in the full-text text 1a, and there is a plurality of pieces of position information of a specific character code in the compressed array 2a. At that time, the plurality of pieces of information are
This is held in the continuous area of a.

【００１１】本発明の請求項５の発明は、請求項４の発
明において、圧縮配列２ａにポインタ・フラグを保持す
ることにより、複数の位置情報を圧縮配列２ａの分割さ
れた連続領域に保持するようにしたものである。According to a fifth aspect of the present invention, in the fourth aspect of the present invention, a plurality of pieces of position information are held in the divided continuous area of the compressed array 2a by holding the pointer flag in the compressed array 2a. It is like that.

【００１２】[0012]

【作用】全文記憶部１に格納された全文テキスト１ａの
各文字に対して、その次に出現する文字と文字出現位置
のペア情報に基づき、次の文字のコードに対応した位置
に文字出現位置を記録した直接配列４−１，…，４−ｎ
を作成する。そして、直接配列を左右に適当量ずらして
データが重ならないように重ね合わせて圧縮配列２ａを
得て、インデックス領域部３に格納する。According to the present invention, for each character of the full-text text 1a stored in the full-text storage unit 1, the character appearance position is set at the position corresponding to the code of the next character based on the pair information of the next character and the character appearance position. ,..., 4-n in which
Create Then, the compressed array 2a is obtained by directly shifting the array to the right and left by an appropriate amount so that the data is not overlapped to obtain the compressed array 2a and storing it in the index area 3.

【００１３】検索時には、検索部３がインデックス領域
部２に格納された圧縮配列より要求されたキーに対応す
る情報を取り出し、この情報を使用して全文記憶部１に
アクセスして、要求された内容を取り出す。本発明の請
求項１の発明においては、上記のように、直接配列４−
１，…，４−ｎより圧縮配列２ａを得て、この圧縮配列
２ａを利用して、全文テキスト１ａを検索しているの
で、高速な検索が可能となるとともに、直接配列４−
１，…，４−ｎを重ね合わせることにより、スパースな
配列が圧縮される。At the time of retrieval, the retrieval unit 3 retrieves information corresponding to the requested key from the compressed array stored in the index area unit 2, accesses the full-text storage unit 1 using this information, and Retrieve the contents. In the invention of claim 1 of the present invention, as described above, the direct sequence 4-
Since the compressed sequence 2a is obtained from 1,..., 4-n and the full-text data 1a is searched using the compressed sequence 2a, high-speed search is possible and the direct sequence 4a is obtained.
By superimposing 1,..., 4-n, a sparse array is compressed.

【００１４】本発明の請求項２の発明においては、請求
項１の発明において、圧縮配列２ａにおける各要素が、
どの文字が持つ情報であるかを全文テキスト１ａを参照
することにより得ているので、インデックス領域２をコ
ンパクトとすることができる。本発明の請求項３の発明
においては、請求項１の発明において、圧縮配列２ａ中
に特定の文字コードの位置情報が複数あるとき、その複
数の情報を圧縮配列２ａとは別の領域に保持するように
しているので、同一の組み合わせの２文字列が全文テキ
スト１ａ中に複数回出現する場合にも対応することがで
きる。According to a second aspect of the present invention, in the first aspect, each element in the compression array 2a is:
Since the character information is obtained by referring to the full text 1a, the index area 2 can be made compact. According to a third aspect of the present invention, in the first aspect of the invention, when there is a plurality of pieces of position information of a specific character code in the compressed array 2a, the plurality of pieces of information are stored in an area different from the compressed array 2a. Therefore, it is possible to cope with a case where two character strings of the same combination appear a plurality of times in the full text 1a.

【００１５】本発明の請求項４，５の発明においては、
請求項１の発明において、同一の組み合わせの２文字列
が全文テキスト１ａ中に複数回出現し、圧縮配列２ａ中
に特定の文字コードの位置情報が複数あるとき、その複
数の情報を圧縮配列２ａの連続領域に保持したり、圧縮
領域２ａの連続した領域に分割して保持するようにした
ので、インデックス領域２をコンパクトとすることがで
きる。In the invention of claims 4 and 5 of the present invention,
In the invention of claim 1, when a two-character string of the same combination appears a plurality of times in the full-text text 1a and there is a plurality of pieces of positional information of a specific character code in the compressed array 2a, the plurality of pieces of information are transferred to the compressed array 2a. Or the compressed area 2a is divided and held in a continuous area, so that the index area 2 can be made compact.

【００１６】[0016]

【実施例】図２は本発明の全文データベース検索装置の
基本構成を示す図であり、同図において、１１はサーチ
の対象となるテキスト全体を格納する全文記憶部、１２
はインデックス領域部であり、後述するように、各文字
の全文記憶部中での位置を格納している。１３はインデ
ックス作成部であり、全文記憶部１１内のテキストの一
部もしくは全部の作成・更新時に起動され、テキストの
インデックスを作成する。１４は検索部であり、与えら
れたキーに対してインデックス領域部１２に格納された
インデックス情報を利用して全文記憶部１１を検索す
る。１５は利用者インタフェース部であり、利用者から
の検索要求を取り込み、検索結果を出力する。FIG. 2 is a diagram showing the basic configuration of a full-text database search apparatus according to the present invention. In FIG. 2, reference numeral 11 denotes a full-text storage unit for storing the entire text to be searched;
Is an index area, which stores the position of each character in the full-text storage as described later. Reference numeral 13 denotes an index creation unit that is started when a part or all of the text in the full-text storage unit 11 is created or updated, and creates an index of the text. A search unit 14 searches the full-text storage unit 11 for a given key using the index information stored in the index area unit 12. Reference numeral 15 denotes a user interface unit that receives a search request from a user and outputs a search result.

【００１７】同図において、全文記憶部１１内のテキス
トが作成もしくは更新されると、インデックス作成部１
３が起動される。インデックス作成部１３は、全文記憶
部１１に記憶された全文テキスト中の各文字に対して、
後述するようにしてインデックスを作成し、このインデ
ックスはインデックス領域部１２に格納される。利用者
が利用者インタフェース部１５より検索要求を出すと、
検索部１４はインデックス領域部１２より、要求された
キーに対応する情報を取り出し、この情報を利用して全
文記憶部１１にアクセスして要求された内容を取り出
す。In FIG. 1, when the text in the full-text storage unit 11 is created or updated, the index creation unit 1
3 is activated. The index creator 13 determines, for each character in the full-text text stored in the full-text storage 11,
An index is created as described later, and this index is stored in the index area unit 12. When the user issues a search request from the user interface unit 15,
The search unit 14 extracts information corresponding to the requested key from the index area unit 12 and accesses the full text storage unit 11 using this information to extract the requested content.

【００１８】そして、取り出された結果、あるいは検索
に失敗した場合はその旨が、利用者インタフェース部１
５より利用者に出力される。次に、図２に示す全文デー
タベース検索装置におけるインデックスの作成および検
索処理の実施例について説明する。（１）インデックスの作成図３はインデックス作成部１３におけるインデックス作
成処理のフローチャートを示す図であり、図３を参照し
ながら、図４なしい図８の実例を用いて、本実施例にお
けるインデックスの作成について説明する。The user interface unit 1 informs the user of the retrieved result or, if the retrieval has failed, that fact.
5 to the user. Next, an embodiment of index creation and search processing in the full-text database search device shown in FIG. 2 will be described. (1) Creation of Index FIG. 3 is a diagram showing a flowchart of an index creation process in the index creation unit 13, and referring to FIG. The creation will be described.

【００１９】図３のステップＳ１において、全文記憶部
１１に格納された全文ファイルをオープンし、ステップ
Ｓ２において、ファイルが終了したか否かを判別し、フ
ァイルが終了していない場合には、ステップＳ３におい
て、全文記憶部１１より１文字読み込み、ステップＳ４
に行く。ステップＳ４において、読み込んだ文字に対し
て、その全文ファイル内での位置と次の文字とのペアを
ワークファイルに出力する。そして、上記処理をファイ
ルが終了するまで繰り返し、全文テキストの全ての文字
について上記処理を行い、処理が終了するとステップＳ
５に行く。In step S1 of FIG. 3, the full-text file stored in the full-text storage unit 11 is opened. In step S2, it is determined whether or not the file has been completed. In step S3, one character is read from the full-text storage unit 11, and step S4
go to. In step S4, a pair of the position of the read character in the full-text file and the next character is output to the work file. Then, the above processing is repeated until the end of the file, the above processing is performed for all the characters of the full text, and when the processing is completed, step S
Go to 5.

【００２０】図４は、インデックス作成の対象となる全
文テキストと、上記ステップＳ４において作成されたワ
ークファイルの一例を示す図である。同図において、図
４（ａ）は全文テキストの一例を示し、同図（ｂ）は作
成されたワークファイルを示しており、（ａ）おいて、
縦横軸に付された数字は各文字の位置を示すアドレスで
ある。FIG. 4 is a diagram showing an example of a full-text text to be indexed and an example of the work file created in step S4. 4A shows an example of a full text, FIG. 4B shows a created work file, and FIG.
The numbers attached to the vertical and horizontal axes are addresses indicating the position of each character.

【００２１】また、（ｂ）において、左端側の列は見出
し文字、次の列は見出し文字の次の文字とその全文テキ
スト中の位置を示すアドレスであり、同図に示すよう
に、全文テキスト中に複数回出現する文字については、
見出し文字の列に１文字だけ記録され、その見出し文字
の次の文字を記録した列に、見出し文字に続く文字が複
数記録される。例えば、「日」という字は、全文テキス
ト中で、「日本」、「日々」、「日頃」としてつかわれ
ているが、見出し文字の列には「日」は一度だけ現れ、
次の文字を示す列に「本」、「々」、「頃」とその全文
テキスト中の位置を示すアドレスが記録される。In (b), the leftmost column is a heading character, and the next column is an address indicating the next character of the heading character and its position in the full-text, and as shown in FIG. For characters that appear multiple times in
Only one character is recorded in the column of heading characters, and a plurality of characters following the heading character are recorded in the column in which the character next to the heading character is recorded. For example, the word "day" is used as "Japan", "daily", and "daily" in the full text, but "date" appears only once in the column of heading characters.
In the column indicating the next character, "book", "at", "around" and addresses indicating their positions in the full text are recorded.

【００２２】以上のようにして、全文テキストに対して
ワークファイルを作成すると、図３のステップＳ５に行
き、各文字コードに対して、直接配列を作成する。図５
は直接配列の一例を示す図であり、同図において、５１
は見出し文字、５２はそれに対応する直接配列を示して
おり、同図は、一例として、「日」という文字が見出し
文字であり、「日」に続く文字である「本」、「々」、
「頃」がそれぞれ全文テキスト中のアドレス４，８，２
４の位置に出現する場合の配列を示している。When the work file is created for the full text as described above, the process goes to step S5 in FIG. 3, and an array is created directly for each character code. FIG.
Is a diagram showing an example of a direct arrangement. In FIG.
Indicates a heading character, and 52 indicates a direct arrangement corresponding to the heading character. In the figure, as an example, the character "day" is a heading character, and the characters following "day" are "book", "chi",
"Circle" is address 4,8,2 in full text
The array when it appears at position 4 is shown.

【００２３】同図において、配列５２は全文テキスト中
で使用されるうる文字コード（文字セット）の数に対応
した領域を持ち、各領域は文字コード順に配置され、そ
の領域の各位置はそれぞれの文字コードに対応する。す
なわち、配列５２の位置を特定することにより、文字コ
ードを特定することができる。例えば、「本」という文
字が全文テキスト中のアドレス４に出現する場合、同図
に示すように、配列５２の「本」という文字コードの位
置に「４」が記録される。同様に「々」という文字が全
文テキスト中のアドレス８に出現する場合には、配列５
２の「々」という文字コードの位置に「８」が記録され
る。In the figure, an array 52 has areas corresponding to the number of character codes (character sets) that can be used in the full text, and each area is arranged in the order of the character code. Corresponds to the character code. That is, by specifying the position of the array 52, the character code can be specified. For example, when the character "book" appears at address 4 in the full text, "4" is recorded at the position of the character code "book" in the array 52 as shown in FIG. Similarly, if the character "" appears at address 8 in the full-text,
“8” is recorded at the position of the character code “2” of “2”.

【００２４】また、アドレスが記録される領域以外の領
域には「ｎｉｌ」（データなしを意味する）が記録され
る。したがって、見出し文字が「々」、「し」、
「を」、「日」の場合には、図６に示すような配列が作
成され、そのアドレス（例えば、「９」、「２」、
「７」等）が記録された位置から文字（例えば、
「の」、「い」、「日」等）を知ることができる。Further, "nil" (meaning no data) is recorded in an area other than the area where the address is recorded. Therefore, if the heading characters are
In the case of “を” and “day”, an array as shown in FIG. 6 is created, and its address (for example, “9”, “2”,
Characters (eg, “7”) from the position where they were recorded
"No", "i", "day", etc.).

【００２５】すなわち、全文テキスト中の各文字コード
ｉに対して、文字セットの数の要素の配列Ａi を用意
し、コードｊの文字がコードｉの文字の次文字として存
在するときに、Ａi [j] には、その文字の全文データベ
ース上での位置を示す情報を格納したリストへのポイン
タ（アドレス）を格納する。また、コードｊの文字がコ
ードｉの文字の次文字として存在しないときには、「ｎ
ｉｌ」を格納する。That is, for each character code i in the full-text text, an array Ai of elements of the number of character sets is prepared, and when the character of code j exists as the next character of the character of code i, Ai [ j] stores a pointer (address) to a list that stores information indicating the position of the character in the full-text database. When the character of code j does not exist as the next character of the character of code i, “n
il ”is stored.

【００２６】なお、上記例においては説明の便宜上、全
文データベース上でのコードｊの位置を示す情報を格納
したリストへのポインタ（アドレス）を格納する例を示
しているが、コードｉの位置を示す情報を格納したリス
トへのポインタ（アドレス）を格納してもよい。以上の
ようにして全文テキスト中の各文字について直接配列を
作成したのち、図３のフローチャートのステップＳ６に
行き、圧縮配列に非「ｎｉｌ」情報が重ならないように
適当に右あるいは左にずらしながら、直接配列を重ね合
わせ圧縮配列を作成する。In the above example, for convenience of explanation, an example is shown in which a pointer (address) to a list storing information indicating the position of the code j in the full-text database is stored. A pointer (address) to a list storing the information to be indicated may be stored. After an array is directly created for each character in the full text as described above, the process goes to step S6 in the flowchart of FIG. 3, and is shifted to the right or left appropriately so that non- "nil" information does not overlap the compressed array. Create a compressed array by directly overlaying the arrays.

【００２７】図７はステップＳ６における配列の重ね合
わせ処理を示す図である。同図において、７１，７２，
７３は第１ないし第３の文字コードに対応する配列を示
し、７４は配列７１，７２，７３を重ね合わせた圧縮配
列を示しており、同図の配列７１，７２，７３において
は、空欄は「ｎｉｌ」を示し、丸印が付された位置は非
ｎｉｌ（前記したアドレスが記録されている）を示して
いる。FIG. 7 is a diagram showing the array overlapping process in step S6. In the figure, 71, 72,
73 denotes an array corresponding to the first to third character codes, 74 denotes a compressed array in which the arrays 71, 72, 73 are superimposed, and in the arrays 71, 72, 73 in FIG. "Nil" is indicated, and the position with a circle indicates a non-nil (the above-mentioned address is recorded).

【００２８】圧縮配列７４を作成するには、同図（ａ）
に示す配列７１，７２，７３が与えられた場合、同図
（ｂ）に示すように、丸印の位置が重ならないよう、例
えば、配列７１を左に２ずらし、配列７２，７３を右に
１ずらす。ついで、同図（ｃ）に示すように、配列７
１，７２，７３を重ね合わせ、圧縮配列７４を作成す
る。また、圧縮配列の各位置がどの文字コードに対応す
るかを示すため、同図（ｃ）に示すように、圧縮配列の
各位置に対応した領域に、その位置がどの文字コードに
対応するかを示す情報を記録しておく。例えば、同図
（ｃ）に示すように、圧縮配列７４の左端の欄が同図
（ａ）の配列７１の第１の文字コードに対応しているこ
とを示すため、圧縮配列７４に左端の欄に対応する領域
に「１」が記録され、同様に、圧縮配列７４の右端の欄
が配列７３の第３の文字コードに対応していることを示
すため、圧縮配列７４に左端の欄に対応する領域に
「３」が記録される。In order to create the compressed array 74, FIG.
When arrays 71, 72, and 73 shown in FIG. 2 are given, as shown in FIG. 2B, for example, array 71 is shifted to the left and arrays 72 and 73 are shifted to the right so that the positions of the circles do not overlap. Shift one. Next, as shown in FIG.
The compression array 74 is created by superimposing 1, 72 and 73. In addition, in order to indicate which character code corresponds to each position of the compressed array, as shown in FIG. 10C, in the area corresponding to each position of the compressed array, which character code corresponds to the position is shown. Is recorded. For example, as shown in FIG. 7C, the leftmost column of the compressed array 74 corresponds to the first character code of the array 71 of FIG. "1" is recorded in the area corresponding to the column, and similarly, to indicate that the rightmost column of the compressed array 74 corresponds to the third character code of the array 73, the leftmost column is displayed in the compressed array 74. “3” is recorded in the corresponding area.

【００２９】さらに、各配列７１，７２．７３等のずら
し量をずらし量表として保存しておく。図８はずらし量
表とインデックスの圧縮配列の模式図を示す図であり、
同図（ａ）はずらし量表の一例を示し、同図（ｂ）はイ
ンデックスの圧縮配列の一例を示している。Further, the shift amounts of the arrays 71, 72.73 and the like are stored as a shift amount table. FIG. 8 is a diagram showing a schematic diagram of a compressed arrangement of a shift amount table and an index.
FIG. 7A shows an example of a shift amount table, and FIG. 7B shows an example of a compressed array of indexes.

【００３０】ずらし量表は、見出し文字と各見出し文字
に対応したずらし量を記録したものであり、同図（ａ）
に示すように、例えば、「々」のずらし量は「＋１
２」、「を」のずらし量は「−２９」などのように記録
される。また、インデックスの圧縮配列は同図（ｂ）に
示すような構造をもち、同図に示すように、例えば、
「日頃」という文字列の文字「頃」のアドレス２４は、
圧縮配列においては、文字セットにおける「頃」の本来
の位置から「−１２５」（文字「日」のずらし量が「−
１２５」であるため）の位置に記録される。（２）検索利用者が利用者インタフェース部１５より検索要求を出
すと、検索部１４はインデックス領域部１２に格納され
た上記インデックスの圧縮配列より、要求されたキーに
対応する情報を取り出し、この情報を利用して全文記憶
部１１にアクセスして要求された内容を取り出す。The shift amount table records a heading character and a shift amount corresponding to each heading character.
For example, as shown in FIG.
The shift amounts of "2" and "wo" are recorded as "-29". In addition, the compressed array of indexes has a structure as shown in FIG.
The address 24 of the character "Chiro" in the character string "Chiro" is
In the compressed array, the shift amount of the character “day” is “−125” from the original position of the “period” in the character set.
125 "). (2) Search When the user issues a search request from the user interface unit 15, the search unit 14 extracts information corresponding to the requested key from the compressed array of the indexes stored in the index area unit 12, and The full text storage unit 11 is accessed using the information to extract the requested content.

【００３１】コードｉ、コードｊが連続する文字列を探
すには、まず、図８（ａ）に示す「ずらし表」を参照し
て、文字コードｉに対応するずらし量Ｔi を求める。つ
いで、図８（ｂ）に示すインデックス領域のＴi ＋ｊ番
目の要素から全文記憶部１１内の全文テキスト中のアド
レスＬを取り出す。また、インデックス領域のＴi ＋ｊ
番目の要素に対応した領域がどの文字コードに対応する
かを示す情報を取り出し（図７（ｃ）参照）、取り出し
た情報の文字コードが上記文字コードｉの場合には、上
記アドレスＬは正しい情報であるとする。もし、上記取
り出した情報の文字コードが上記文字コードｉでない場
合には、文字コードｉとｊとが連続するような文字列は
存在しないことを示している。In order to search for a character string in which the code i and the code j are continuous, first, a shift amount Ti corresponding to the character code i is obtained with reference to a "shift table" shown in FIG. Next, the address L in the full-text text in the full-text storage unit 11 is extracted from the Ti + j-th element in the index area shown in FIG. Also, Ti + j in the index area
Information indicating which character code the area corresponding to the element corresponds to is extracted (see FIG. 7C). If the character code of the extracted information is the character code i, the address L is correct. Let it be information. If the character code of the extracted information is not the character code i, it indicates that there is no character string in which the character codes i and j are continuous.

【００３２】例えば、図８において、「日」と「本」と
が連続する文字列を検索する場合は、まず、図８（ａ）
に示す「ずらし表」を参照して「日」のずらし量「−１
２５」を取り出し、「本」の文字コードｊと上記「ずら
し量」より、「ｊ−１２５」を求める。ついで、図８
（ｂ）のインデックス領域より、「ｊ−１２５」の位置
に記録されたアドレス４を取り出す。また、「ｊ−１２
５」の領域がどの文字コードに対応するかを取り出し、
その文字コードが「日」に対応した文字コードの場合に
は、「日」と「本」とが連続する文字列は全文テキスト
中のアドレス４の位置にあるとする。For example, in FIG. 8, when searching for a character string in which "date" and "book" are continuous, first, FIG.
Referring to the “shift table” shown in FIG.
25 ”is taken out, and“ j-125 ”is obtained from the character code j of“ book ”and the above“ shift amount ”. Next, FIG.
The address 4 recorded at the position of “j-125” is extracted from the index area of (b). Also, "j-12
Extract the character code corresponding to the area of "5",
When the character code is a character code corresponding to "day", it is assumed that a character string in which "day" and "book" are continuous is located at the address 4 in the full text.

【００３３】なお、上記実施例においては、圧縮配列の
各位置がどの文字コードに対応するかを示すため、図７
（ｃ）に示すように、圧縮配列の各位置に対応した領域
に、その位置がどの文字コードに対応するかを示す情報
を記録しておき、検索時、上記情報を参照して、上記し
たように、文字コードｉとｊとが連続する文字列が存在
することを確認する例を示したが、上記のようにその位
置がどの文字コードに対応するかを示す情報を持たずに
連続した文字列が存在することを確認することもでき
る。In the above embodiment, in order to indicate which character code corresponds to each position of the compressed array, FIG.
As shown in (c), information indicating which character code the position corresponds to is recorded in an area corresponding to each position in the compressed array, and the above information is referred to at the time of search with reference to the above information. As described above, the example of confirming that there is a character string in which the character codes i and j are continuous has been described. However, as described above, the character string is continuous without having information indicating which character code corresponds to the position. You can also verify that a string exists.

【００３４】すなわち、全文記憶部１１の全文テキスト
中には、各文字とそのアドレスが格納されているので、
インデックス領域のＴi ＋ｊ番目の要素から全文テキス
トにおけるアドレスＬを取り出した際、全文記憶部１１
の全文テキストよりアドレスＬの位置を検索し、その位
置の前の文字コード（アドレスＬが前の文字の位置を示
す場合には、その位置の文字コード）が文字コードｉの
場合には、文字コードｉとｊが連続していることを確認
することができる。（３）同一組み合わせの２文字列が複数回全文テキスト
中に出現する場合のインデックスの作成。That is, since each character and its address are stored in the full-text text in the full-text storage unit 11,
When the address L in the full text is extracted from the Ti + j-th element of the index area, the full text storage unit 11
Is searched for the position of address L from the full-text text, and if the character code before that position (if the address L indicates the position of the previous character, the character code at that position) is the character code i, the character code i It can be confirmed that the codes i and j are continuous. (3) Creation of an index when the same combination of two character strings appears multiple times in the full text.

【００３５】前記した（１）においては、全文テキスト
中に２文字列が１回しか出現しない場合について説明し
たが、同一組み合わせの２文字列が複数回全文テキスト
中に出現する場合が存在する。この場合には、インデッ
クス中に位置情報を複数保有する必要がある。図９
（ａ）は上記した同一組み合わせの２文字列が複数回全
文テキスト中に出現する場合のワークファイルを示した
図であり、同図は、見出し文字βと文字コードＥとから
なる連続する文字列がアドレス８，５９，６４，９６に
４回出現する場合を示している。In the above (1), the case where the two character strings appear only once in the full-text text has been described. However, there are cases where the same combination of two character strings appears multiple times in the full-text text. In this case, it is necessary to hold a plurality of pieces of position information in the index. FIG.
(A) is a diagram showing a work file in the case where the two character strings of the same combination appear in the full-text text a plurality of times, and shows a continuous character string composed of a heading character β and a character code E. Appear four times at addresses 8, 59, 64, and 96.

【００３６】図９（ｂ）は上記のようにインデックス中
に位置情報を複数保有する必要がある場合の実施例を示
す図である。本実施例においては、インデックスの圧縮
配列中に複数の位置情報を持たせず、複数の位置情報は
圧縮配列とは別の場所に格納する。別の場所に格納する
方法としては、例えば、圧縮配列には複数の位置情報を
格納したリストの先頭アドレスを格納しておき、リスト
の先頭アドレスを用いて、複数の位置情報を読みだす。FIG. 9B is a diagram showing an embodiment in a case where a plurality of pieces of position information need to be held in the index as described above. In this embodiment, a plurality of pieces of position information are not provided in the compressed array of indexes, and the plurality of pieces of position information are stored in a location different from the compressed array. As a method of storing in a different location, for example, the head address of a list storing a plurality of pieces of position information is stored in the compressed array, and the plurality of pieces of position information are read out using the head address of the list.

【００３７】あるいは、圧縮配列とは別の配列Ｈを用意
し、上記した位置情報は配列Ｈの連続領域に格納し、デ
ータの個数あるいは終端がわかるように、データの個数
あるいはデータの終了マークを設ける。そして、圧縮配
列にはそのデータが格納されている配列Ｈの最初のデー
タのインデックスを格納し、インデックスを参照して複
数の位置情報を読みだす。Alternatively, an array H different from the compressed array is prepared, and the above-mentioned position information is stored in a continuous area of the array H, and the number of data or the end mark of the data is marked so that the number or end of the data can be known. Provide. Then, the index of the first data in the array H in which the data is stored is stored in the compressed array, and a plurality of pieces of position information are read out with reference to the index.

【００３８】図９（ｃ）はインデックス中に位置情報を
複数保有する必要がある場合の他の実施例を示す図であ
り、同図は複数の位置情報を圧縮配列とは別の領域に保
存するのではなく、直接、圧縮配列中に保存する実施例
を示している。ある２文字連続に対して位置情報が高々
１つならば、単に実際の位置情報を圧縮配列中に格納す
るだけであるが、位置情報が複数ある場合には、あらか
じめ圧縮配列に１ビットの連続領域マーク・ビット領域
を導入しておき、同図（ｃ）に示すように、データが連
続する間はこの連続領域マーク・ビットをオンとして、
圧縮配列領域上に位置情報を格納していく。FIG. 9C is a diagram showing another embodiment in which a plurality of pieces of position information need to be held in the index. FIG. 9C shows a case where a plurality of pieces of position information are stored in an area different from the compressed array. Instead of doing so, an embodiment is shown in which it is stored directly in a compressed array. If there is at most one piece of position information for a given two-character continuation, the actual position information is simply stored in the compressed array. An area mark bit area is introduced, and as shown in FIG. 10C, while the data is continuous, the continuous area mark bit is turned on.
Position information is stored on the compressed array area.

【００３９】図９（ｄ）は複数の位置情報を、直接、圧
縮配列中に格納する場合において、長い連続領域が確保
できなかったときの格納の仕方を示しており、本実施例
においては、長い連続領域が確保できなかったとき領域
を分割して格納する。すなわち、図９（ｄ）に示すよう
に上記したマーク・ビットに加えポインタ・ビットを用
意し、ポインタ・ビットがオンになっている場合には、
そのデータは実データではなく、次のデータが格納され
ている圧縮領域上のインデックスを指すものとする。FIG. 9D shows a method of storing a plurality of pieces of position information when a long continuous area cannot be secured when the plurality of pieces of position information are directly stored in the compressed array. When a long continuous area cannot be secured, the area is divided and stored. That is, as shown in FIG. 9D, a pointer bit is prepared in addition to the above-mentioned mark bit, and when the pointer bit is on,
The data is not actual data but indicates an index on the compression area where the next data is stored.

【００４０】[0040]

【発明の効果】以上説明したことから明らかなように、
本発明においては、全文テキストの各文字に対して、そ
の次に出現する文字と文字出現位置のペア情報に基づ
き、次の文字のコードに対応した位置に文字出現位置を
記録した直接配列を作成し、直接配列を左右に適当量ず
らしてデータが重ならないように重ね合わせて圧縮配列
を得て、圧縮配列をインデックスとして全文テキストに
アクセスするようにしているので、直接配列のずらし量
と次の文字の文字コードより、検索する文字の全文テキ
ストにおける位置情報を直ちに得ることができ、アクセ
スを高速化することができる。また、そのインデックス
用の記憶領域を従来に較べ少なくすることが可能とな
る。As is apparent from the above description,
In the present invention, for each character of the full-text text, a direct array is created in which the character appearance position is recorded at the position corresponding to the code of the next character based on the pair information of the character that appears next and the character appearance position Then, the direct array is shifted to the right and left by an appropriate amount and superimposed so that the data does not overlap to obtain a compressed array, and the full-text text is accessed using the compressed array as an index. From the character code of the character, the position information in the full text of the character to be searched can be immediately obtained, and the access can be speeded up. Further, it becomes possible to reduce the storage area for the index as compared with the related art.

[Brief description of the drawings]

【図１】本発明の原理ブロック図である。FIG. 1 is a principle block diagram of the present invention.

【図２】本発明の全文データベース検索装置の基本構成
を示す図である。FIG. 2 is a diagram showing a basic configuration of a full-text database search device of the present invention.

【図３】インデックス作成のフローチャートを示す図で
ある。FIG. 3 is a diagram showing a flowchart of index creation.

【図４】全文テキストおよびワーク・ファイルの一例を
示す図である。FIG. 4 is a diagram showing an example of a full text and a work file.

【図５】直接配列の構成を説明する図である。FIG. 5 is a diagram illustrating a configuration of a direct array.

【図６】圧縮前のインデックス配列の状態を示す図であ
る。FIG. 6 is a diagram showing a state of an index array before compression.

【図７】配列の重ね合わせを説明する図である。FIG. 7 is a diagram illustrating the superposition of arrays.

【図８】ずらし量表およひ圧縮後のインデックス配列の
一例を示す図である。FIG. 8 is a diagram illustrating an example of a shift amount table and an index array after compression.

【図９】圧縮配列内への位置情報の格納状態を示す図で
ある。FIG. 9 is a diagram showing a storage state of position information in a compressed array.

[Explanation of symbols]

１，１１全文記憶部１ａ全文テキスト２，１２インデックス領域部２ａ圧縮配列３，１４検索部４−１，…，４−ｎ直接配列１３インデックス作成部１５利用者インタフェース部 1,11 full-text storage unit 1a full-text 2,12 index area 2a compressed array 3,14 search unit 4-1 ... 4-n direct array 13 index creation unit 15 user interface unit

───────────────────────────────────────────────────── フロントページの続き (56)参考文献特開平４−242864（ＪＰ，Ａ) 特開平４−215181（ＪＰ，Ａ) 特開平４−205560（ＪＰ，Ａ) 小川泰嗣，短単位キーワードに基づくテキストデータベースシステム，情報処理学会研究報告 92−ＤＢＳ−90，日本，社団法人情報処理学会，1992年９月11日，ＶＯＬ．92．Ｎｏ．71，第45頁乃至第54頁 (58)調査した分野(Int.Cl.⁷，ＤＢ名) G06F 17/30 ＪＩＣＳＴファイル（ＪＯＩＳ)────────────────────────────────────────────────── ─── Continuation of front page (56) References JP-A-4-242864 (JP, A) JP-A-4-215181 (JP, A) JP-A 4-205560 (JP, A) Yasuji Ogawa, short unit Text Database System Based on Keywords, Information Processing Society of Japan Research Report 92-DBS-90, Japan, Information Processing Society of Japan, September 11, 1992, Vol. 92. No. 71, pages 45 to 54 (58) Fields investigated (Int. Cl. ⁷ , DB name) G06F 17/30 JICST file (JOIS)

Claims

(57) [Claims]

1. A full-text storage unit (1) holding a full-text (1a) for searching the contents of a digitized document.
And an index area (2) that holds the appearance position of each character in the full-text text (1a), and a search unit (3) that performs a full-text search by referring to the index of the index area (2). In a full-text database search device that also holds the next character information in the full-text database in addition to the appearance position of each character as an index, a pair of the next character information and character appearance position information given for each character code , A direct array (4-1, ... 4-n) in which character appearance position information is recorded at a position corresponding to the character code of the following character information, and the element whose value is defined is By overlapping the direct arrays (4-1,... 4-n) obtained from each character code to the right or left by an appropriate amount so as not to overlap, one compressed array (2a Get on Stored in the index area portion (2), the full text database using stored compressed sequence in the index area portion (2) and (2a), full-text database retrieval apparatus characterized by retrieving the requested string.

The compressed array (2a) does not explicitly have information on each character positioned before the next character information.
Full text (1a) based on character appearance position information in (a)
Search for each of the above characters, and use the compressed sequence (2
2. The full-text database search device according to claim 1, wherein the character appearance position information extracted from a) is confirmed to be correct information.

3. When two character strings of the same combination appear multiple times in the full-text text (1a) and there is a plurality of pieces of position information of a specific character code in the compressed array (2a), the plurality of pieces of information are compressed. 2. The full-text database search device according to claim 1, wherein the full-text database search device is held in an area different from the array (2a).

4. When two character strings of the same combination appear multiple times in the full-text text (1a) and there are a plurality of pieces of position information of a specific character code in the compressed array (2a), the plurality of pieces of information are compressed. 2. The full-text database search device according to claim 1, wherein the full-text database search device is stored in a continuous area of the array (2a).

5. The full-text database according to claim 4, wherein a plurality of pieces of position information are stored in divided continuous areas of the compressed array (2a) by holding pointer flags in the compressed array (2a). Search device.