JP2003006231A

JP2003006231A - Method and system for creation of index and retrieval of computer character information

Info

Publication number: JP2003006231A
Application number: JP2002100490A
Authority: JP
Inventors: Qin Yong; ヨンキン; Li Hong; ホンリ
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2001-04-02
Filing date: 2002-04-02
Publication date: 2003-01-10
Anticipated expiration: 2022-04-02
Also published as: CN1378157A; CN1326073C; JP3728264B2

Abstract

PROBLEM TO BE SOLVED: To provide a method and a system for creation of index and/or retrieval of computer character information. SOLUTION: Each character position in a group of retrieval object documents is determined in accordance with the order of all characters in all documents. Position data of same characters are sequentially stored in one or more database blocks corresponding to the character. Each database block is constituted of a plurality of mini blocks, each of which has storing space of a plurality of bytes. Retrieval is carried out based on these index structures. In regard to the retrieved position of a retrieval word, the document in which the retrieval word exists is determined by a document information sharing conversion part.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、コンピュータ情報
処理における情報のインデックス作成及び検索に関し、
特に、コンピュータ文字情報のインデックス作成及び検
索を行なうための方法及びシステムに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to indexing and retrieval of information in computer information processing,
In particular, it relates to methods and systems for indexing and retrieving computer character information.

【０００２】[0002]

【従来の技術】コンピュータテキスト情報の現在の全文
検索では、文字リスト方法及び語リスト方法の２つのイ
ンデックス作成方法がある。文字リスト方法の場合、検
索の単位として文書中の文字を使用することでインデッ
クスを作成するため、大きな格納空間を必要とする。語
リスト方法の場合、検索の単位として文書中の語を使用
することでインデックスを作成するため、使用する格納
空間は小さく、検索速度も改善されるが、インデックス
作成速度は遅く、検索漏れの率が高くなる。2. Description of the Related Art In the current full-text search of computer text information, there are two indexing methods, a character list method and a word list method. In the case of the character list method, a large storage space is required because the index is created by using the characters in the document as a search unit. In the word list method, the index is created by using the words in the document as the search unit, so the storage space used is small and the search speed is improved, but the index creation speed is slow and the search omission rate is high. Becomes higher.

【０００３】特開平８−２３５２１２号公報、８−１０
１８４８号公報及び１０−３０７８４１号公報では、文
字リスト方法を使用してインデックスを作成し、ファイ
ルシステムにより文書のインデックス情報を格納する全
文検索システムが開示されている。文字列中の各文字に
対して、システムは、該当する文字の各文書中での位置
を格納するために、対応のファイルを作成する。文字位
置データの格納空間を節約するために、文字インデック
スを作成する場合、システムは、各文字の第１出現位置
を対応するインデックスファイルに格納する。その第１
位置に基づいて、差分アルゴリズム（差分アルゴリズム
の定義に関しては、本明細書の用語の説明を参照）によ
り、第１位置に後続する各位置及び先行する位置を使用
して差分値が形成され、第１位置の後ろに順次格納され
る。全文検索を実施する場合、インデックスファイルに
格納された各文字の第１位置及びそれに続く第１差分値
を使用して、第２位置が復元される。続いて、復元され
た第２位置及びそれに続く差分値を使用して、第３位置
が復元される。この復元は、検索語の文字の一致位置が
見つかるまで繰り返される。「の」などの頻繁に出現す
る文字に関して、遠く後方に離れた文字位置と照合する
場合、文字の各位置を第１位置から照合対象位置まで１
つずつ復元する必要がある。例えば、文字が、文書中に
１０００回出現する場合、第９９９位置を取得するため
には、第１差分値から復元を９９８回実施する必要があ
る。従って、上述の全文検索システムでは、差分値を文
書中の各文字の位置に復元するのに多大な時間を要す
る。Japanese Unexamined Patent Publication No. 8-235212, 8-10
1848 and 10-307841 disclose full-text search systems in which an index is created using a character list method and the index information of a document is stored by a file system. For each character in the string, the system creates a corresponding file to store the position of that character in each document. To save character position data storage space, when creating a character index, the system stores the first appearance position of each character in the corresponding index file. The first
Based on the position, a difference algorithm (for a definition of the difference algorithm, see the description of terms herein) uses each position following the first position and the preceding position to form a difference value, Sequentially stored after position 1. When performing a full text search, the second position is restored using the first position of each character and the subsequent first difference value stored in the index file. Subsequently, the third position is restored using the restored second position and the subsequent difference value. This restoration is repeated until a matching position of the characters of the search word is found. When collating frequently appearing characters such as "no" with character positions far away and backward, each position of the character is 1 from the first position to the matching target position.
You need to restore each one. For example, when a character appears 1000 times in a document, in order to acquire the 999th position, it is necessary to perform the restoration from the first difference value 998 times. Therefore, in the above-mentioned full-text search system, it takes a lot of time to restore the difference value to the position of each character in the document.

【０００４】[0004]

【発明が解決しようとする課題】コンピュータネットワ
ーク技術の急速な発展により、従来の全文検索システム
では、データ検索の絶えず増大する需要に応えることが
できない。Due to the rapid development of computer network technology, conventional full-text search systems cannot meet the ever-increasing demand for data search.

【０００５】従って、本発明の目的は、コンピュータ文
字情報のインデックスを作成し、高速格納及び大容量の
データの検索を支援し、複数ユーザによるデータの共有
を支援する文字情報のインデックス作成及び検索のため
の新規の方法及びシステムを提供することである。Therefore, an object of the present invention is to create an index of computer character information, to support high-speed storage and retrieval of a large amount of data, and to create an index of character information and retrieval which supports sharing of data by a plurality of users. To provide a new method and system for.

【０００６】本発明の１つの目的は、文書のインデック
ス格納の空間を削減し、且つ高速の全文検索を保証する
ことが可能な文書のインデックス作成方法を提供するこ
とである。An object of the present invention is to provide a document indexing method capable of reducing the space for storing the document index and ensuring high-speed full-text search.

【０００７】本発明の別の目的は、上述のインデックス
を使用した全文検索のための方法を提供することであ
る。Another object of the present invention is to provide a method for full text search using the above index.

【０００８】本発明の別の目的は、ある文書に所属する
文字の任意の位置に従って、多数の文書からその対応す
る文書を迅速に探し出すための方法を提供することであ
る。Another object of the present invention is to provide a method for quickly locating a corresponding document in a large number of documents according to arbitrary positions of characters belonging to the document.

【０００９】[0009]

【課題を解決するための手段】上述の目的を達成するた
めに、本発明者等は、文字情報の全文記録及び全文検索
を行なうための新規の方法及びシステムを開発した。こ
の方法及びシステムは、クライアント／サーバモードで
開発されている。文字情報のインデックスを作成するの
に、ＳＱＬサーバ関係データベースの特徴が使用されて
おり、大容量のデータを格納することができ、データの
共有性、整合性及び保全性を向上させることができる。
このため、インターネット及びイントラネット上のウェ
ブサーバは、大容量の高速全文検索の機能を有すること
ができ、情報源の包括的な共有を実現することができ
る。In order to achieve the above-mentioned object, the present inventors have developed a new method and system for performing full-text recording and full-text retrieval of character information. The method and system are developed in client / server mode. The features of the SQL server relational database are used to create an index of character information, which can store a large amount of data and improve the sharability, integrity and integrity of the data.
Therefore, web servers on the Internet and intranet can have a large-capacity high-speed full-text search function, and can realize comprehensive sharing of information sources.

【００１０】本発明は、文字情報のインデックスを作成
する方法において、検索対象の一群の文書の各文字の位
置を文書全体の全ての文字の順序に従って判定する過程
と、文字の位置データをその文字に対応する１つ以上の
データベースブロックに順次格納し、各データベースブ
ロックに格納された最大位置及び最小位置を取得する過
程と、各データベースブロックをそれぞれが複数バイト
の格納空間を有する複数のミニブロックに分割し、各ミ
ニブロックに格納された最小位置を取得する過程とを含
む方法を提供する。According to the present invention, in the method of creating an index of character information, the process of determining the position of each character of a group of documents to be searched according to the order of all the characters in the entire document, and the character position data Sequentially storing in one or more database blocks corresponding to, and obtaining the maximum position and the minimum position stored in each database block, and each database block into a plurality of mini blocks each having a storage space of multiple bytes. Partitioning and obtaining the minimum position stored in each miniblock.

【００１１】また、本発明は、検索語の各文字のインデ
ックスを検索するために、検索語の各文字間の相対位置
関係を取得する過程と、文字の各データベースブロック
が、相対位置関係と一致する位置を有する可能性がある
か否かをそれぞれ判定する過程と、相対位置関係と一致
する位置を有する可能性があるデータベースブロックに
対して、そのデータベースブロックの各ミニブロック
が、相対位置関係と一致する位置を有する可能性がある
か否かをそれぞれ判定する過程と、相対位置関係と一致
する位置を有する可能性があるミニブロックに対して、
そのミニブロックの各位置が、相対位置関係と一致する
か否かを判定する過程とを含む、請求項１に従って作成
された文字情報のインデックスに基づいて文字情報を検
索する方法を提供する。Further, according to the present invention, in order to search the index of each character of the search word, the process of obtaining the relative positional relationship between the characters of the search word and each database block of the character match the relative positional relationship. Each of the mini blocks of the database block has a relative positional relationship with respect to the database block that may have a position matching the relative positional relationship. For each process of determining whether there is a possibility of having a matching position, and for a miniblock that may have a position matching the relative positional relationship,
A method of searching character information based on an index of character information created according to claim 1, including a step of determining whether each position of the mini-block matches a relative positional relationship.

【００１２】本発明は、更に、文字情報のインデックス
作成及び検索を行なう方法において、ある容量の共有メ
モリを予約し、且つデータベースに格納された検索対象
の一群の文書の文書情報の各フィールドの一部をデータ
ベースから共有メモリに読み込む文書情報共有変換部を
設置し、全文検索を実施する場合、共有メモリから直
接、関連する文書情報を取得することを特徴とする方法
を提供する。The present invention further provides a method for indexing and searching character information, in which a certain amount of shared memory is reserved and each field of the document information of a group of documents stored in a database is searched. A document information sharing conversion unit that reads a copy from a database into a shared memory is provided, and when performing a full-text search, a related method is directly obtained from the shared memory.

【００１３】また、本発明は、文字情報のインデックス
作成及び検索を実施するためのシステムにおいて、検索
対象の一群の文書の各文字の位置を文書全体の全ての文
字の順序に従って判定し、文字の位置データをその文字
に対応する１つ以上のデータベースブロックに順次格納
し、各データベースブロックに格納された最大位置及び
最小位置を取得し、各データベースブロックをそれぞれ
が複数バイトの格納空間を有する複数のミニブロックに
分割し、各ミニブロックに格納された最小位置を取得す
るインデックス生成手段と、検索語の各文字のインデッ
クスを検索するために、検索語の各文字間の相対位置関
係を取得し、文字の各データベースブロックが、相対位
置関係と一致する位置を有する可能性があるか否かをそ
れぞれ判定し、相対位置関係と一致する位置を有する可
能性があるデータベースブロックに対して、そのデータ
ベースブロックの各ミニブロックが、相対位置関係と一
致する位置を有する可能性があるか否かをそれぞれ判定
し、相対位置関係と一致する位置を有する可能性がある
ミニブロックに対して、そのミニブロックの各位置が、
相対位置関係と一致するか否かを判定する全文検索手段
と、文書及びそのインデックス情報を格納する格納手段
とを含むシステムを提供する。Further, according to the present invention, in a system for performing indexing and searching of character information, the position of each character in a group of documents to be searched is determined according to the order of all characters in the entire document, The position data is sequentially stored in one or more database blocks corresponding to the character, the maximum position and the minimum position stored in each database block are obtained, and each database block has a plurality of storage spaces each having a plurality of bytes. An index generation unit that divides into mini blocks and acquires the minimum position stored in each mini block, and in order to search the index of each character of the search word, acquires the relative positional relationship between each character of the search word, Determine whether each database block of characters may have a position that matches the relative positional relationship, and For a database block that may have a position that matches the positional relationship, determine whether or not each miniblock of that database block may have a position that matches the relative positional relationship. For a miniblock that may have positions that match the relationship, each position of that miniblock is
Provided is a system including a full-text search unit that determines whether or not a relative positional relationship matches, and a storage unit that stores a document and its index information.

【００１４】更に、本発明は、文字情報のインデックス
作成及び検索を実施するためのシステムにおいて、ある
容量の共有メモリを予約し、且つデータベースに格納さ
れた検索対象の一群の文書の文書情報の各フィールドの
一部をデータベースから共有メモリに読み込み、全文検
索を実施する場合、共有メモリから直接、関連する文書
情報を取得する文書情報共有変換手段を含むことを特徴
とするシステムを提供する。Further, according to the present invention, in a system for performing indexing and searching of character information, each of the document information of a group of documents to be searched, which reserves a shared memory of a certain capacity and is stored in a database. There is provided a system characterized by including document information sharing conversion means for directly acquiring related document information from a shared memory when a part of fields is read from a database into the shared memory and a full-text search is performed.

【００１５】用語の説明以下の説明では、リスト、レコード、フィールドなどの
データベースに関する幾つかの共通用語が使用される。
リストは、データベースの構造の構成要素であり、多数
のレコード項目から成る。このレコード項目の各々は、
複数のフィールドから成る。本明細書では、以下の用語
が使用される。以下にこれらの用語を説明する。Glossary of Terms In the following discussion, some common terms for databases such as lists, records, fields, etc. are used.
A list is a component of the structure of a database and consists of a number of record items. Each of this record item is
It consists of multiple fields. The following terms are used herein. These terms are explained below.

【００１６】文書カテゴリ：その内容、作成者、発行時
間、記録オペレータ、記録用のメインコンピュータ、又
は、その他の各要素に従って、記録対象の各文書は、複
数の文書のカテゴリ、すなわち、文書カテゴリに分類さ
れる。各文書カテゴリは、複数の文書から成る。Document Category: Each document to be recorded is divided into a plurality of document categories, that is, document categories, according to its content, creator, issuance time, recording operator, main computer for recording, or each other element. being classified. Each document category consists of multiple documents.

【００１７】文書：論説、小説、ニュース記事、特許明
細書など。Documents: editorials, novels, news articles, patent specifications, etc.

【００１８】文字：本明細書で述べられる文字は、レタ
ー（英字、漢字、日本語の文字、種々の文字、ひらがな
及びカタカナなどの１バイト又は２バイト文字を含
む）、句読点、数字、特殊文字及びタブなどを含む。Characters: Characters mentioned in this specification include letters (including alphabetic characters, kanji, Japanese characters, various characters, 1-byte or 2-byte characters such as hiragana and katakana), punctuation marks, numbers and special characters. And tabs etc. are included.

【００１９】文字の内部コード：別々の動作システムで
は、２バイト文字のコード標準は、それぞれ異なる。例
えば、ＷＩＮＤＯＷＳ（登録商標）プラットフォームで
の日本語コード標準は、シフトＪＩＳ（マッキントッシ
ュ及びＤＯＳ−Ｖにおける８ビット日本語コード標準）
である。ＵＮＩＸプラットフォームでの日本語コード標
準は、ＥＵＣ（拡張ＵＮＩＸ（登録商標）符号）であ
る。本発明のシステムにおいて、異なるプラットフォー
ムからの文書を記録するため、本発明者等は、内部コー
ド方式を使用して、同一の文字のＪＩＳ（日本工業規
格）、シフトＪＩＳ（マッキントッシュ及びＤＯＳ−Ｖ
における８ビット日本語コード標準）又はＥＵＣ（拡張
ＵＮＩＸ符号）などの種々の標準のコードを唯一の対応
する内部コードに変換する。Character Internal Code: In different operating systems, the code standards for double-byte characters are different. For example, the Japanese code standard on the WINDOWS (registered trademark) platform is Shift JIS (8-bit Japanese code standard in Macintosh and DOS-V).
Is. The Japanese code standard on the UNIX platform is EUC (extended UNIX code). In order to record documents from different platforms in the system of the present invention, the present inventors have used the internal code system to specify JIS (Japanese Industrial Standard) of the same character, Shift JIS (Macintosh and DOS-V).
Converts various standard codes such as 8-bit Japanese code standard in) or EUC (extended UNIX code) into only one corresponding internal code.

【００２０】文字位置：１つの文書カテゴリにおける各
文書の記録順序及び各文書内の文字の順序に従って文書
を記録する場合、文書中の各文字には、文書カテゴリ中
の絶対位置が付与される。例えば、文書カテゴリの第１
文字の文字位置は１であり、後続の各文字の文字位置
は、それぞれ、２、３．．．である。Character position: When a document is recorded according to the recording order of each document in one document category and the order of characters in each document, each character in the document is given an absolute position in the document category. For example, the first in the document category
The character position of the character is 1, and the character positions of the following characters are 2, 3 ,. ．． Is.

【００２１】差分値：１つの文書カテゴリにおいて、文
字の現在の文字位置と前の文字位置とに基づいて、２つ
の文字位置間の差分値が、差分アルゴリズムにより計算
される。Difference value: In one document category, the difference value between two character positions is calculated by the difference algorithm based on the current character position and the previous character position of the character.

【００２２】差分アルゴリズム：文書カテゴリにおい
て、文字の現在位置と以前の位置との間の差分は、例え
ば、１０進法から１２７進法などのように基数の小さい
進法から基数の大きい進法に変換される。各差分値を識
別するため、各文字位置の差分値は、各１２７進法の差
分値の最終桁（単位桁）を除く全ての桁の最上位ビット
を１に設定することによって得られる。例えば、１０進
値２０４８３８３は、以下に示すように１２７進値に変
換される。Difference Algorithm: In the document category, the difference between the current position of a character and the previous position of the character is from a base with a small base to a base with a large base such as decimal to 127. To be converted. In order to identify each difference value, the difference value at each character position is obtained by setting the most significant bit of all digits except the final digit (unit digit) of each 127-adic difference value to 1. For example, the decimal value 2048383 is converted to a 127 decimal value as shown below.

【００２３】 (2048383)10 = (0x01 0x00 0x00 0x00)127 得られる１２７進値は４桁であり、そのそれぞれは、１
６進値で表される。４桁のうちの最終桁０を除く残りの
３桁の最上位ビットは、１に設定される（すなわち、０
ｘ８０との論理和が求められる）。従って、得られる差
分値は、(81808000)16である。(2048383) 10 = (0x01 0x00 0x00 0x00) 127 The resulting 127-ary value has four digits, each of which is 1
It is represented by a hexadecimal value. The remaining 3 most significant bits of the 4 except the last digit 0 are set to 1 (ie 0
Logical sum with x80 is required). Therefore, the obtained difference value is (81808000) 16.

【００２４】文書中の文字の位置は変動するので、差分
値は様々な値であり、各差分値のバイト数もそれぞれ異
なる。差分値において、最終桁の最上位ビット（最終桁
の１バイトの第８ビット）は０であり、その他の桁の最
上位ビットは、全て１に設定されているので、各差分値
を識別することができる。例えば、文書カテゴリ中の文
字「日」の２つの連続する位置が、１３９０と１４５０
であるとする。この２つの位置の差分は、((1450 - 139
0) % 127)10 = (60)10 = (3C)16である。この差分値を
格納するのに、１バイトしか必要としない。別の例にお
いて、文書カテゴリ中の文字「好」の２つの連続する位
置が、１３０８と９０５４であるとする。このときの文
字の差分値は、以下のように計算される：ステップ１：((9054 - 1308)／127)10を計算すると、商
は(60)10であり、余りは(126)10 = (7E)16である。Since the position of the character in the document varies, the difference value has various values, and the number of bytes of each difference value also differs. In the difference value, the most significant bit of the last digit (8th bit of 1 byte of the last digit) is 0, and the most significant bits of the other digits are all set to 1, so each difference value is identified. be able to. For example, two consecutive positions of the letter "day" in the document category are 1390 and 1450.
Suppose The difference between these two positions is ((1450-139
0)% 127) 10 = (60) 10 = (3C) 16. Only one byte is needed to store this difference value. In another example, assume that the two consecutive positions of the character "good" in the document category are 1308 and 9054. The difference value of the characters at this time is calculated as follows: Step 1: When calculating ((9054-1308) / 127) 10, the quotient is (60) 10 and the remainder is (126) 10 = (7E) 16.

【００２５】ステップ２：(60／127)10を計算すると、
商は(0)10であり、余りは(60)10 = (3C)16である。Step 2: When (60/127) 10 is calculated,
The quotient is (0) 10 and the remainder is (60) 10 = (3C) 16.

【００２６】ステップ１、２で得られる１２７進値に関
して、単位桁(7E)16の最上位ビットは変化しない。もう
一方の桁(3C)16の最上位ビットは、１に設定される、す
なわち、(3C)16から(BC)16になる。この差分値(BC7E)16
は、格納するのに２バイト必要である。Regarding the 127-ary value obtained in steps 1 and 2, the most significant bit of the unit digit (7E) 16 does not change. The most significant bit of the other digit (3C) 16 is set to 1, ie (3C) 16 to (BC) 16. This difference value (BC7E) 16
Requires 2 bytes to store.

【００２７】復元：全文検索を実施する場合、差分値を
復元して対応する文書カテゴリ中の各文字の文字位置を
取得することが必要である。文字位置の差分値を復元す
るためのアルゴリズムは、差分アルゴリズムの逆であ
る。Restoration: When performing a full-text search, it is necessary to restore the difference value and acquire the character position of each character in the corresponding document category. The algorithm for restoring the difference value of the character position is the reverse of the difference algorithm.

【００２８】ミニブロック：ミニブロックは、データを
格納するための本発明者等が定義した構造であり、文字
位置を格納するためのデータ項目及び文字位置の複数の
差分値を格納するための配列から成る。例えば、ミニブ
ロックは、１つの文字の多数の位置データを格納するた
めに、倍長整数型データの項目及びＢＹＴＥ型配列を備
える。ここで、倍長整数型の項目は、文字の１つの文字
位置を格納するのに使用され、ＢＹＴＥ型配列は、倍長
整数型の項目に格納された文字位置に後続する文字の各
文字位置の差分値を格納するのに使用される。Miniblock: A miniblock is a structure defined by the present inventors for storing data, and is an array for storing data items for storing character positions and a plurality of difference values of the character positions. Consists of. For example, a miniblock comprises a long integer type data item and a BYTE type array to store multiple position data of one character. Here, the long integer type item is used to store one character position of a character, and the BYTE type array is each character position of the character following the character position stored in the long integer type item. Used to store the difference value of.

【００２９】データベースブロック：データベースブロ
ックは、データを格納するためのデータベース中の物理
領域であり、リスト構造中の１つのフィールドである。
データベースブロックは、複数のミニブロックから成
り、文字の複数の位置値を格納することができる。リス
トの各レコード項目は、１個のデータベースブロック
と、そのデータベースブロックに格納される文字の最小
位置及び最大位置をそれぞれ格納するための２つのフィ
ールドとから成る。この２つのフィールドによれば、検
索を実施する際には、現在のレコード項目のデータベー
スブロックが、検索語のその他の文字と一致する文字位
置を有する可能性があるか否かを迅速に判定することが
できる。現在のレコード項目において一致文字位置が存
在する可能性があると判定される場合、更なる復元処理
がそのデータベースブロックに対して行なわれる。現在
のレコード項目に一致する文字位置が存在しないと判定
される場合、文字の次のレコード項目が判定／復元され
る。Database block: A database block is a physical area in the database for storing data and is a field in the list structure.
A database block consists of multiple miniblocks and can store multiple position values for a character. Each record item in the list consists of one database block and two fields for storing the minimum and maximum positions of the characters stored in that database block, respectively. These two fields allow a quick determination when performing a search whether the database block of the current record item may have a character position that matches other characters in the search term. be able to. If it is determined that there may be a matching character position in the current record item, further decompression processing is performed on that database block. If it is determined that there is no matching character position in the current record item, the record item next to the character is determined / restored.

【００３０】検索語：検索語は、１つ以上の文字から成
る文字列である。検索語は、検索対象の文字列であり、
オペレータにより指定される。Search word: A search word is a character string consisting of one or more characters. The search word is a character string to be searched,
Specified by the operator.

【００３１】一致位置：各文書のインデックスは、単位
としての各文字と共に格納されるので、検索語中の文字
の位置は、検索語全体の位置、すなわち、一致位置とし
て指定され、データベースに格納された文字位置情報に
対して照合処理が行なわれる。例えば、検索語「米国ア
メリカ」において、「米」の文字位置が１０００１であ
り、「国」の文字位置が１０００２であり、「ア」の文
字位置が１０００３であり、「メ」の文字位置が１００
０４であり、「リ」の文字位置が１０００５であり、
「カ」の文字位置が１０００６であるとする。検索語の
第１文字の文字位置が、検索語の一致位置として使用さ
れる場合、検索語「米国アメリカ」の一致位置は、
「米」の文字位置１０００１である。Matching position: Since the index of each document is stored together with each character as a unit, the position of the character in the search word is designated as the position of the entire search word, that is, the matching position and stored in the database. Collation processing is performed on the character position information. For example, in the search word “US America”, the character position of “US” is 10001, the character position of “country” is 10002, the character position of “A” is 10003, and the character position of “ME” is 100
04, the character position of "ri" is 10005,
It is assumed that the character position of "F" is 10006. When the character position of the first character of the search word is used as the matching position of the search word, the matching position of the search word "US America" is
It is the character position 10001 of “rice”.

【００３２】変位：変位は、検索語の各文字の一致位置
に対するオフセット量である。例えば、検索語「米国ア
メリカ」の第１文字が、開始点として設定される場合、
「米」の変位は０であり、「国」の変位は−１であり、
「ア」の変位は−２であり、「メ」の変位は−３であ
り、「リ」の変位は−４であり、「カ」の変位は−５で
ある。また、検索語の最終文字を開始点として設定する
こともでき、この場合、「米」の変位は０であり、
「国」の変位は１であり、「ア」の変位は２であり、
「メ」の変位は３であり、「リ」の変位は４であり、
「カ」の変位は５である。Displacement: The displacement is the offset amount with respect to the matching position of each character of the search word. For example, if the first character of the search term "US America" is set as the starting point,
The displacement of "rice" is 0, the displacement of "country" is -1,
The displacement of "A" is -2, the displacement of "M" is -3, the displacement of "L" is -4, and the displacement of "F" is -5. Also, the last character of the search term can be set as the starting point, in which case the displacement of "rice" is 0,
The displacement of “Country” is 1, the displacement of “A” is 2,
The displacement of "M" is 3, the displacement of "L" is 4,
The displacement of “F” is 5.

【００３３】上述のように、本発明は、全文書を複数の
文書カテゴリに記録／格納する。各文書カテゴリは、複
数の文書から成る。文書中の各文字は、各文書の記録順
序に従って格納される。As mentioned above, the present invention records / stores all documents in multiple document categories. Each document category consists of multiple documents. Each character in the document is stored according to the recording order of each document.

【００３４】記録される文書から、文書カテゴリ情報、
文書情報、文字情報及び文字位置情報が取り出され、デ
ータベースに格納される。この４種類の情報を以下に説
明する。From the recorded document, document category information,
Document information, character information and character position information are extracted and stored in the database. The four types of information will be described below.

【００３５】文書カテゴリ情報：内容に従って、各文書
は、政治、経済、スポーツ、旅行などの様々な文書カテ
ゴリに分類することができる。記録オペレータ又はその
他の要素に従って、記録文書を別の文書カテゴリに分類
することもできる。各文書カテゴリに対して、本発明で
は、対応する文書カテゴリを一意的に識別することがで
きる文書カテゴリ番号を指定する。また、各文書カテゴ
リは、文書カテゴリ名及び文書カテゴリにおける各文字
の最終位置を有する。文書カテゴリの各文字の最終位置
は、現在の文書カテゴリに最も新しく記録された文書中
の最終文字の文字位置である。従って、文書カテゴリ情
報は、文書カテゴリ番号、文書カテゴリ名及び各文字の
最終位置を含む。Document category information: According to the content, each document can be classified into various document categories such as politics, economy, sports, and travel. The recorded documents can also be classified into different document categories according to the recording operator or other factors. For each document category, the present invention specifies a document category number that can uniquely identify the corresponding document category. Further, each document category has a document category name and a final position of each character in the document category. The final position of each character in the document category is the character position of the final character in the document most recently recorded in the current document category. Therefore, the document category information includes the document category number, the document category name, and the final position of each character.

【００３６】文書情報：各文書カテゴリは、複数の文書
を有するので、記録される各文書には、文書番号が付与
される。文書番号は、１つの文書カテゴリにおいて一意
的なものであり、１から連続的に開始する倍長整数型の
値であっても、あるいは、断続的な整数値であっても良
い。Document information: Since each document category has a plurality of documents, a document number is given to each document to be recorded. The document number is unique in one document category, and may be a long integer type value starting continuously from 1 or an intermittent integer value.

【００３７】従って、文書情報は、各文書ごとの一意的
な文書番号、その文書が所属する文書カテゴリ番号、文
書中の第１文字及び最終文字の文字位置（以下の説明で
は、これらの２つの位置は、文書の開始位置及び終了位
置と呼ばれる）を含む。Therefore, the document information includes the unique document number of each document, the document category number to which the document belongs, the character positions of the first character and the last character in the document (in the following description, these two The position includes the start position and the end position of the document).

【００３８】文字位置情報：各文書カテゴリは、複数の
文書を含み、この文書は、それぞれ、複数の文字を含
む。重複のない全ての文字により文字セットを構成す
る。文字セット中の各文字は、対応する内部コードを有
する。各文字は、文書カテゴリ中の各文書に複数回出現
する可能性があるので、各文書カテゴリ中の全ての文字
の全ての文字位置が、データベースに格納される。文字
位置情報は、文字が所属する文書カテゴリ番号、文字に
対応する内部コード、文字位置データ（文字位置及び差
分値を含む）を格納するためのデータベースブロック、
このデータベースブロックに格納される第１文字位置及
び最終文字位置（これらの２つの位置は、以降、データ
ベースブロックの最小位置及び最大位置と呼ばれる）を
含む。Character position information: Each document category includes a plurality of documents, and each document includes a plurality of characters. A character set is composed of all unique characters. Each character in the character set has a corresponding internal code. Since each character may appear multiple times in each document in the document category, all character positions of all characters in each document category are stored in the database. The character position information is a document category number to which the character belongs, an internal code corresponding to the character, a database block for storing character position data (including the character position and the difference value),
It contains the first and last character positions (these two positions are hereinafter referred to as the minimum and maximum positions of the database block) stored in this database block.

【００３９】文字情報：検索を高速化するために、本発
明では、各文書カテゴリの各文字の文字情報をデータベ
ースに格納する。文字情報は、その文字が所属する文書
カテゴリ番号、対応する内部コード、所属する文書カテ
ゴリ中の文字の最大文字位置、文字の最終データベース
ブロック中の最小文字位置及び文字の最終データベース
ブロックに格納されたデータの長さを含む。Character information: In order to speed up the search, the present invention stores the character information of each character in each document category in a database. The character information is stored in the document category number to which the character belongs, the corresponding internal code, the maximum character position of the character in the document category to which the character belongs, the minimum character position in the final database block of the character, and the final database block of the character. Contains the length of the data.

【００４０】本発明では、文書カテゴリ中の全ての文字
の文字位置のインデックスを単位としての各文字と共に
作成する。すなわち、文書カテゴリの各文書中の文字セ
ットの各文字の各文字位置が、データベースに格納され
る。In the present invention, the index of the character position of all the characters in the document category is created together with each character as a unit. That is, each character position of each character of the character set in each document of the document category is stored in the database.

【００４１】漢字の「的」及び日本語の文字「の」など
の文字は、頻繁に用いられる。文字の文字位置情報を格
納する空間を削減するために、本発明の方法では、文書
カテゴリの各文書中の各文字の位置情報が、この文書カ
テゴリにおける第１文字位置及び後続の各文字位置とし
て表される。更に、この後続位置は、２つの連続する文
字位置間の差分値として表される。実際には、物理構造
の面で述べると、第１文字位置と後続の文字位置の各差
分値とが、データベースブロックに格納される。各位置
の差分値は、実際の文字位置値よりも使用バイト数が少
ないので、各文字のインデックスにより使用される空間
を相対的に削減することができる。Characters such as the Chinese character "target" and the Japanese character "no" are frequently used. In order to reduce the space for storing the character position information of a character, in the method of the present invention, the position information of each character in each document of the document category is used as the first character position and each subsequent character position in this document category. expressed. Further, this subsequent position is represented as a difference value between two consecutive character positions. Actually, in terms of physical structure, each difference value between the first character position and the succeeding character position is stored in the database block. Since the difference value at each position has a smaller number of bytes used than the actual character position value, the space used by the index of each character can be relatively reduced.

【００４２】文字の第１文字位置に後続する各位置を差
分値として表すことにより、使用空間は削減されるが、
検索速度が低下することになる。文書カテゴリ中に非常
に頻繁に出現する文字に対しては、多数の差分値が存在
する。例えば、ある文字が文書カテゴリ中に１００，０
００回出現する場合、第１文字位置及び後続の各文字位
置の差分値を使用して第９０，０００文字位置を復元す
るには、多大な時間を要することになる。By representing each position following the first character position of the character as a difference value, the space used is reduced,
Search speed will be reduced. For characters that appear very often in the document category, there are many difference values. For example, a character is 100,0 in the document category.
When it appears 00 times, it takes a lot of time to restore the 90,000th character position by using the difference value between the first character position and each subsequent character position.

【００４３】この問題を解決するために、本発明では、
システムファイルを使用して文字の位置データを格納す
る全文検索システムＪｅｔＳｅａｒｃｈに対して改良を
行なう。本発明では、文字の位置データは、１つのファ
イルではなく、複数の部分として格納される。データベ
ース１０４において、各レコード項目は、所属する文書
カテゴリ中の文字の位置データを格納するためのデータ
ベースブロックのフィールドを有する。文書カテゴリ中
の１文字は、その位置データを格納するのに複数のデー
タベースブロックを有しても良い。In order to solve this problem, in the present invention,
Improvements to JetSearch, a full-text search system that uses system files to store character position data. In the present invention, the character position data is stored as a plurality of parts instead of one file. In the database 104, each record item has a field of a database block for storing position data of characters in the document category to which it belongs. A character in a document category may have multiple database blocks to store its position data.

【００４４】頻繁に使用される文字の位置データを格納
する場合、その文字に対応する多数の位置データが存在
するであろう。検索を実施する場合、何度も出現する文
字の一致条件を満たす文字位置（例えば、スポーツニュ
ースの文書カテゴリにおいてＮＢＡを含む文書を検索す
る場合、検索を介して第１のＮがスポーツニュースの文
書カテゴリの第１０，０００文字であることがわかって
いるならば、第１０，０００文字以前のＢ及びＡは、一
致条件を満たさない）を迅速に探し出すためには、シス
テムは、対応する位置を迅速且つ確実に判定／検索する
必要がある。差分値を連続的に格納するデータベースブ
ロックに対しては、検索を実施する場合、各データベー
スブロックの第１差分値から検索語の各文字の一致条件
を満たす文字位置まで、１つずつ文字位置を復元する必
要がある。これは、システム時間を浪費するので、良い
方法ではない。When storing the position data of a frequently used character, there will be many position data corresponding to that character. When a search is performed, character positions satisfying the matching condition of a character that appears many times (for example, when searching a document including NBA in the document category of sports news, the first N is a document of sports news through the search). If it is known that it is the 10,000th character of the category, B and A before the 10,000th character do not meet the matching condition), and the system searches for the corresponding position. There is a need for quick and reliable judgment / search. When performing a search for database blocks that store difference values continuously, character positions are searched one by one from the first difference value of each database block to the character position that satisfies the matching condition of each character of the search word. Needs to be restored. This wastes system time and is not a good idea.

【００４５】従って、本発明者は、データベースブロッ
クを多数のミニブロックに分割し、これらのミニブロッ
クの各々が、位置データを格納するために一定数のバイ
トを有するようにした。各ミニブロックの開始バイト
は、文書カテゴリ中の文字の文字位置を格納する。この
文字の文字位置（以下の説明では、ミニブロックの最小
位置と呼ぶ）は、差分値ではなく、文書カテゴリ中の全
ての文字の順序により判定され、データベースブロック
の最大位置及び最小位置とは異なっている。ミニブロッ
クの開始バイトの後続の各バイトは、２つの連続する文
字位置間の差分値を格納する。文字の各文字位置は、相
互に異なるものであり、差分値もそれぞれ異なる値であ
る。各差分値に対して使用されるバイト数もまちまちで
ある。ミニブロックの残りのバイトが、新規の差分値に
とって十分なものでない場合、システムは、ミニブロッ
クの残りのバイトを０ｘ００で充填し、別の新規のミニ
ブロックを使用してこの新規ミニブロックの最小位置と
して開始バイトに新規の差分値を格納する。続いて、後
続の各文字位置が、差分値として新規のミニブロックに
格納される。言い換えると、文字の各文字位置をある個
数の差分値で表した後に、新規のミニブロックが必要に
なり、この新規のミニブロックに対して最小位置が定義
され、後続の文字位置が差分値として表される。データ
ベースブロック中の各ミニブロックが、文字の位置デー
タで満杯である場合、その文字に対しては、新規のレコ
ード項目を作成し、その新規のデータベースブロックに
対しての位置データの格納を継続する必要がある。Therefore, the inventor has divided the database block into a number of mini-blocks, each of these mini-blocks having a certain number of bytes for storing position data. The start byte of each miniblock stores the character position of the character in the document category. The character position of this character (referred to as the minimum position of the mini-block in the following description) is determined by the order of all the characters in the document category, not by the difference value, and differs from the maximum position and the minimum position of the database block. ing. Each subsequent byte of the start byte of the miniblock stores the difference value between two consecutive character positions. The character positions of the characters are different from each other, and the difference values are also different values. The number of bytes used for each difference value also varies. If the remaining bytes of the miniblock are not enough for the new difference value, the system fills the remaining bytes of the miniblock with 0x00 and uses another new miniblock to generate the minimum of this new miniblock. Store the new difference value in the start byte as a position. Subsequently, each succeeding character position is stored as a difference value in a new mini block. In other words, after representing each character position of a character with a certain number of difference values, a new miniblock is needed, the minimum position is defined for this new miniblock, and the subsequent character positions are the difference values. expressed. If each miniblock in the database block is full of character position data, create a new record item for that character and continue storing position data for the new database block. There is a need.

【００４６】設定されるミニブロックの長さが長すぎる
場合、復元は多大な時間を要することになる。設定され
るミニブロックの長さが短すぎる場合、比較的大きな空
間が使用される。従って、ミニブロックに関しては、数
百バイトの長さであることが好ましい。言い換えると、
文字の１００から２００回の出現ごとに新規のミニブロ
ックを使用するのが適切である。If the set mini-block length is too long, the restoration will take a lot of time. If the set miniblock length is too short, a relatively large space is used. Therefore, for miniblocks, it is preferable to be hundreds of bytes long. In other words,
It is appropriate to use a new miniblock for every 100 to 200 occurrences of the character.

【００４７】しかし、一般的に使用される文字は、１つ
の文書カテゴリ中に、１０００，０００回より多く出現
する可能性がある。従って、本発明は、１つの文書カテ
ゴリにおいて、文字の出現回数が４００，０００から８
００，０００回（例えば、４０００個のミニブロックが
満杯になる）ごとに、新規のレコード項目を作成し、文
字の後続位置データを格納するように構成されている。
また、レコード項目は、この項目のデータベースブロッ
クの最小位置及び最大位置を含むべきである。However, commonly used characters can occur more than one million times in a document category. Therefore, according to the present invention, the number of occurrences of characters is 400,000 to 8 in one document category.
It is configured to create a new record item and store the subsequent position data of the character every 0,000 times (e.g., 4000 miniblocks are full).
The record item should also contain the minimum and maximum positions of the database block for this item.

【００４８】上述のように、文字位置データのデータベ
ースを構築する場合、文字を検索するのに使用される時
間を短縮する観点から、文字位置データを格納するため
のレコード項目は、このレコード項目のデータベースブ
ロックに格納された文書カテゴリ中の文字の最小文字位
置及び最大文字位置を格納するための２つの重複したフ
ィールド、すなわち、データベースブロックの最小位置
及び最大位置を有する。全文検索を実施する際に、検索
語の全ての文字の文字位置データを格納するための全て
のレコードが、データベースから探し出された場合、各
々が文字位置データを格納するためのデータベースブロ
ックを有する多数のレコード項目が存在することは確実
である。各データベースブロックは、文字の位置データ
（ミニブロックの最小位置及び各差分値を含む）を格納
するために数百バイトの容量をそれぞれ有する何千個も
のミニブロックから成る。レコード項目にデータベース
ブロックの最小位置及び最大位置が存在しない場合、検
索語の各文字の一致文字位置が見つかるまで、多数のレ
コード項目の各データベースブロックにおいて各ミニブ
ロックの最小位置を１つずつ比較するのには、非常に長
い時間がかかる。しかし、１つのレコード項目におい
て、データベースブロックの最小位置及び最大位置を格
納するために、２つのフィールドが追加される場合、検
索語の文字の文書カテゴリ中の文字位置が分かるので、
この文字の文字位置を使用して検索語のその他の文字の
位置データの各データベースブロックの最小位置及び最
大位置と比較することができる。従って、あるデータベ
ースブロックにおいて、検索対象の文字位置が存在する
可能性かあるか否かが判定される。データベースブロッ
クが判定された後、判定済のデータベースブロック中の
各ミニブロックの最小位置に基づいて、どのミニブロッ
クが検索対象の文字位置を有する可能性があるかを迅速
に判定することができる。このため、各文字の文字位置
を１つずつ復元する方法に比べて、使用時間は削減され
る。As described above, when constructing a database of character position data, the record item for storing the character position data is the record item for storing the character position data from the viewpoint of shortening the time used for searching characters. It has two overlapping fields for storing the minimum and maximum character positions of the characters in the document category stored in the database block, namely the minimum and maximum positions of the database block. When performing a full-text search, if all the records for storing the character position data of all the characters of the search word are found from the database, each has a database block for storing the character position data. It is certain that there will be many record items. Each database block consists of thousands of miniblocks, each having a capacity of hundreds of bytes for storing character position data (including the miniblock's minimum position and each difference value). If the record item does not have a minimum and maximum position for a database block, the minimum positions of each miniblock are compared one by one in each database block of multiple record items until a matching character position for each character of the search term is found. Takes a very long time. However, when two fields are added to store the minimum position and the maximum position of the database block in one record item, the character position of the character of the search word in the document category can be known.
The character position of this character can be used to compare the minimum and maximum positions of each database block of position data for other characters of the search term. Therefore, it is determined whether or not there is a possibility that the character position to be searched exists in a certain database block. After the database block is determined, it is possible to quickly determine which miniblock may have the character position to be searched based on the minimum position of each miniblock in the determined database block. Therefore, the use time is reduced as compared with the method of restoring the character position of each character one by one.

【００４９】要約すると、本発明の各文字のインデック
スは、３つのレベルに分割される。In summary, each character index of the present invention is divided into three levels.

【００５０】まず、各文字は、複数のレコード項目を有
する可能性がある。各レコード項目は、文字位置データ
を格納するためのデータベースブロックと、このデータ
ベースブロックに格納された文字の最小位置及び最大位
置とを含む。First, each character may have multiple record items. Each record item includes a database block for storing character position data and the minimum and maximum positions of the characters stored in this database block.

【００５１】次に、文字位置データを格納する各データ
ベースブロックは、文字位置データを格納するために、
数千個のミニブロックを含む。各ミニブロックは、２つ
の部分から成り、第１の部分は、ミニブロックに格納さ
れた各文字の第１文字位置、すなわち、ミニブロックの
最小文字位置を格納し、第２の部分は、後続の各文字位
置、すなわち、差分値を格納する。Next, in order to store the character position data, each database block which stores the character position data,
Contains thousands of miniblocks. Each mini-block consists of two parts, the first part stores the first character position of each character stored in the mini-block, i.e. the minimum character position of the mini-block, and the second part follows. Each character position of, that is, the difference value is stored.

【００５２】第３に、文書カテゴリ中の文字の各後続の
文字位置は、文字位置と先行する文字位置との間の差分
値で表される。Third, each subsequent character position of a character in the document category is represented by the difference value between the character position and the preceding character position.

【００５３】[0053]

【発明の実施の形態】図面を参照しながら、本発明の実
施例を以下に説明する。本発明の趣旨が以下の実施例に
限定されないことは明らかである。以下の実施例におい
て、本発明の方法を実現するためのハードウェアプラッ
トフォームとして、クライアント／サーバネットワーク
構造が一例として使用されている。すなわち、クライア
ントは、端末動作手段の例であり、サーバは、情報記憶
／処理手段の例である。また、以下の実施例で述べるネ
ットワークは、接続手段の一例として理解されるべきで
ある。本発明の以下の教示により、単一のコンピュー
タ、ブラウザ／ブラウザサーバなどのその他のコンピュ
ータシステム上で本発明の方法を実施できることは当業
者には明らかであろう。Embodiments of the present invention will be described below with reference to the drawings. Obviously, the spirit of the present invention is not limited to the following examples. In the following examples, a client / server network structure is used as an example as a hardware platform for implementing the method of the present invention. That is, the client is an example of the terminal operating means, and the server is an example of the information storing / processing means. Further, the networks described in the following embodiments should be understood as an example of connecting means. It will be apparent to those skilled in the art that the following teachings of the present invention allow the method of the present invention to be implemented on other computer systems such as a single computer, browser / browser server, etc.

【００５４】クライアント（端末動作手段）１０１は、
オペレータにとってのプラットフォームであり、文書の
記録、更新又は検索に対する要求はここからサーバ１０
３に送信される。The client (terminal operating means) 101 is
It is a platform for operators, and requests for recording, updating or retrieving documents can be made from the server 10 from here.
3 is sent.

【００５５】ネットワーク（接続手段）１０２は、クラ
イアント／サーバ情報を伝送するためのものである。The network (connecting means) 102 is for transmitting client / server information.

【００５６】サーバ（情報記憶／処理手段）１０３は、
ネットワーク１０２を介してクライアント１０１により
送信された文書の記録、更新又は検索に対する要求を受
信するためのものである。文書インデックス生成部１０
５による処理の際には、全文書中の全ての文字の位置情
報がデータベース１０４に格納される。また、全文検索
エンジン１０６によって、オペレータにより指定された
検索要求を満たす文書情報が検索される。The server (information storage / processing means) 103 is
It is for receiving a request for recording, updating or retrieving a document transmitted by the client 101 via the network 102. Document index generator 10
At the time of the processing of 5, the position information of all the characters in all the documents is stored in the database 104. Further, the full-text search engine 106 searches for document information that satisfies the search request designated by the operator.

【００５７】図１Ｂは、サーバ１０３のハードウェアブ
ロック図を示す。図１Ｂにおいて、ＣＰＵ１は、ＲＡＭ
３に記憶されたプログラムを実行してサーバに対する各
制御を行なう。ＲＯＭ２は、各フローチャートに示した
処理を実現するためのプログラムを記憶し、このプログ
ラムはＣＰＵ１により実行される。ＲＡＭ３は、ＣＰＵ
１がプログラムを実行するための空間を提供する。ＣＲ
Ｔ４は、ＣＰＵ１の制御下で表示を行なう。キーボード
５は、情報を入力するのに使用される。外部記憶装置６
は、検索対象の文書及びこの文書から生成された文字イ
ンデックス情報を格納するためのハードディスク又はソ
フトディスクである。バス７は、上述の各部を接続し、
各部間でのデータ伝送を実現する。FIG. 1B shows a hardware block diagram of the server 103. In FIG. 1B, the CPU 1 is a RAM
The program stored in 3 is executed to control each server. The ROM 2 stores a program for implementing the processing shown in each flowchart, and this program is executed by the CPU 1. RAM3 is CPU
1 provides space for executing the program. CR
T4 displays under the control of the CPU 1. The keyboard 5 is used to enter information. External storage device 6
Is a hard disk or a soft disk for storing the document to be searched and the character index information generated from this document. The bus 7 connects the above-mentioned parts,
Realizes data transmission between each unit.

【００５８】データベース１０４は、全ての文書カテゴ
リの全文インデックスデータ及びその他の種類の文書情
報データを格納するためのものであり、外部記憶装置６
に設けられる。The database 104 is for storing full-text index data of all document categories and other types of document information data, and the external storage device 6
It is provided in.

【００５９】文書インデックス生成部１０５は、文字リ
ストの方法に従ってデータベース１０４に文書を記録す
るためのものであり、ＣＰＵ１により行なわれる。The document index generator 105 is for recording a document in the database 104 according to the method of the character list, and is performed by the CPU 1.

【００６０】全文検索エンジン１０６は、全文検索を実
施するためのものであり、ＣＰＵ１により行なわれる。The full-text search engine 106 is for carrying out a full-text search, and is executed by the CPU 1.

【００６１】文書情報共有変換部１０７は、記録された
文書の情報を格納するために、サーバのメモリＲＡＭ３
において共有メモリのブロックを提供するためのもので
あり、ＣＰＵ１により行なわれる。文書インデックスを
生成する場合には、共有メモリ中の文書情報が、文書イ
ンデックス生成部１０５によりタイミング良く更新され
る。全文検索を行なう場合には、全文検索エンジン１０
６が、共有メモリから直接、文書に関連する情報を取得
する。The document information sharing conversion unit 107 stores the information of the recorded document in the memory RAM 3 of the server.
In order to provide a block of shared memory in, the CPU 1 executes. When generating the document index, the document information in the shared memory is updated by the document index generating unit 105 at a good timing. When performing a full-text search, the full-text search engine 10
6 directly acquires information related to the document from the shared memory.

【００６２】クライアント／サーバのネットワーク構造
では、本発明のシステムは、サーバ上で実行され、文書
インデックス生成部１０５、全文検索エンジン１０６、
文書情報共有変換部１０７及びデータベース１０４を主
に具備する。In the client / server network structure, the system of the present invention is executed on the server, and the document index generating unit 105, the full-text search engine 106,
The document information sharing conversion unit 107 and the database 104 are mainly included.

【００６３】データベース１０４は、複数の文書カテゴ
リを格納するのに使用される。指定の文書カテゴリに
は、文書インデックス生成部１０５により１つ以上の文
書を記録することができ、その文書カテゴリの全文イン
デックスを作成又は更新することができる。文書の記録
を行なう場合、文書インデックス生成部１０５は、文書
中の各文字を対応する内部コードに変換し、所属する文
書カテゴリ中の各文字の位置情報が格納される。The database 104 is used to store a plurality of document categories. One or more documents can be recorded in the designated document category by the document index generation unit 105, and the full-text index of the document category can be created or updated. When recording a document, the document index generation unit 105 converts each character in the document into a corresponding internal code, and stores the position information of each character in the document category to which it belongs.

【００６４】検索条件及びファジー値（０の場合は厳密
な検索、０を超す場合はファジー検索であり、ファジー
値が高い場合は、一致の精度が低く検索結果が多いこと
を示す）に従って全文検索を行なう場合、全文検索エン
ジン１０６は、データベース１０４中の関連するレコー
ド項目を検索し、検索語の各文字の位置を比較し、検索
条件と一致する文字列を探し出し、全文書のうちの検索
語を含む文書の番号と各文書における検索語の位置とを
戻す。Full-text search according to search conditions and fuzzy values (a value of 0 is a strict search, a value of more than 0 indicates a fuzzy search, and a high fuzzy value indicates that the matching accuracy is low and there are many search results). When performing the search, the full-text search engine 106 searches for related record items in the database 104, compares the positions of the characters in the search word, finds a character string that matches the search condition, and searches for the search word in all documents. Returns the number of the document containing and the position of the search term in each document.

【００６５】データベース１０４には、大量のテキスト
情報が格納されているので、全文検索を行なうのには非
常に長いデータベース処理時間を要する。データベース
１０４の入出力を削減し、検索時間を短縮し、システム
のパフォーマンスを向上させるために、本発明者等は、
メモリＲＡＭ３に共有メモリのブロックを残すように機
能する文書情報共有変換部１０７を設計する。続いて、
データベース１０４中の文書情報のリストが検索され
る。文書番号と文書の位置範囲を表す指定の文書カテゴ
リの全文書のデータ項目（文書番号、文書の開始位置及
び終了位置、並びに削除フラグ)が、順序通りの記録に
従って、データベース１０４から共有メモリに読み込ま
れてその中に常駐するようになる。全文検索を行なう場
合、二分アルゴリズムを使用して、検索語の一致位置に
従って判定される文書位置を対応する文書番号に変換す
る。新規の文書の記録を行なう場合、その文書に関する
情報が共有メモリに加えられる。文書が削除される場
合、文書の削除フラグが１（すなわち、削除済み）に設
定される。Since a large amount of text information is stored in the database 104, it takes a very long database processing time to perform a full text search. In order to reduce the input / output of the database 104, shorten the search time, and improve the system performance, the present inventors have
A document information sharing conversion unit 107 that functions so as to leave a shared memory block in the memory RAM 3 is designed. continue,
A list of document information in the database 104 is searched. The data items (document number, start position and end position of the document, and deletion flag) of all documents of the specified document category that represent the document number and the position range of the document are read from the database 104 into the shared memory according to the recording in order. Will be resident in it. When performing a full-text search, a binary algorithm is used to convert the document position determined according to the matching position of the search word into the corresponding document number. When recording a new document, information about the document is added to the shared memory. When the document is deleted, the deletion flag of the document is set to 1 (that is, deleted).

【００６６】本発明の特徴は、主に、以下の３点にあ
る。The features of the present invention are mainly in the following three points.

【００６７】第１の点は、データベース１０４中の文字
位置情報のインデックスの構造である。文字は、文書カ
テゴリ中に何度も出現する可能性があるので、文書カテ
ゴリ中の文字の全ての位置は、格納される必要がある。
文字の各位置が整数型又は倍長整数型のフィールド（各
位置は４バイト使用）としてデータベースに格納される
場合、膨大な格納空間が必要である。格納空間の浪費を
削減すると共に、所望の速度での全文検索を行なえるよ
うにするために、本発明者は、差分アルゴリズムを使用
して文書カテゴリ中のある文字の現在の文字位置と前の
文字位置との間の差分値の計算を行なう。文書カテゴリ
中のその文字の各後続位置が差分値として示され、デー
タベースにバイナリ型（画像型）フィールドとして格納
される。The first point is the structure of the index of the character position information in the database 104. Since a character can appear multiple times in a document category, all positions of the character in the document category need to be stored.
If each position of a character is stored in the database as a field of integer type or long integer type (each position uses 4 bytes), a huge storage space is required. In order to reduce wasted storage space and to allow full-text search at a desired speed, the inventor uses a difference algorithm to determine the current character position and the previous character position of a character in a document category. The difference value between the character position and the character position is calculated. Each subsequent position of that character in the document category is shown as a difference value and stored in the database as a binary (image type) field.

【００６８】本発明では、所属する文書カテゴリ中の文
字の位置を格納する各レコード中のフィールドは、デー
タベースブロックと呼ばれる。各データベースブロック
は、４，０００個のミニブロックに分割される。各ミニ
ブロックは、文字位置データを格納するために２６０バ
イトを有する。図２において、各ミニブロックの始めの
４バイトは、所属する文書カテゴリ中の文字の文字位置
を格納する。第５バイト以降の各バイト（第５バイトを
含む）は、文字の２つの連続する文字位置（現在の文字
位置と前の文字位置）間で差分アルゴリズムにより計算
された差分値を格納する。ミニブロックの残りのバイト
が、新規の差分値にとって十分ではない場合、システム
は、残りのバイトを０ｘ００で充填する。システムは、
新規のミニブロックを使用して、所属する文書カテゴリ
の文字の現在の文字位置をこのミニブロックの最小位置
としてその最初の４バイトに格納し、文字の後続する各
位置を差分値を用いて表す。言い換えると、差分値を用
いて文字の位置を何回か（例えば、１００回から２００
回）表した後、格納用の新規のミニブロックが必要にな
り、このミニブロックに対して最小位置が与えられる。
続いて、文字の各後続位置が差分値として表され、新規
のミニブロック中の最小位置の後ろに格納される。デー
タベースブロックの４，０００個のミニブロックが、全
て文字の位置データで充填された場合、その文字に対し
て新規のレコード項目を作成する必要がある。In the present invention, the field in each record that stores the position of a character in the document category to which it belongs is called a database block. Each database block is divided into 4,000 miniblocks. Each miniblock has 260 bytes for storing character position data. In FIG. 2, the first 4 bytes of each miniblock store the character position of the character in the document category to which it belongs. Each byte after the fifth byte (including the fifth byte) stores the difference value calculated by the difference algorithm between two consecutive character positions of the character (the current character position and the previous character position). If the remaining bytes of the miniblock are not enough for the new difference value, the system fills the remaining bytes with 0x00. the system,
Use the new miniblock to store the current character position of a character in the document category to which it belongs as the minimum position of this miniblock in its first 4 bytes, and represent each subsequent position of the character with a difference value. . In other words, the difference value is used to position the character several times (for example, 100 times to 200 times).
Times), a new miniblock for storage is needed and a minimum position is given for this miniblock.
Each subsequent position of the character is then represented as a difference value and stored after the minimum position in the new miniblock. If the 4,000 miniblocks of a database block are all filled with character position data, then a new record entry needs to be created for that character.

【００６９】第２の点は、以下の通りである。データベ
ース１０４の文字インデックス構造に格納された文字位
置情報に関して、全文検索結果を迅速且つ正確に取得す
るために、本発明者等は、文字位置を検索／照合する方
法を設計する。本発明の文字インデックス構造の特徴に
関して、本発明の方法は、照会言語を使用して一致語の
位置を迅速に探し出す。続いて、文書情報共有変換部に
より、検索された各結果が存在する文書の文書番号が取
得される。The second point is as follows. With respect to the character position information stored in the character index structure of the database 104, in order to obtain the full-text search result quickly and accurately, the present inventors design a method for searching / matching the character position. With respect to the features of the character index structure of the present invention, the method of the present invention uses a query language to quickly locate match words. Then, the document information sharing conversion unit acquires the document number of the document in which each retrieved result exists.

【００７０】第３に、大量のデータを格納する場合で、
データベースの検索処理を実行するときには、データベ
ースの頻繁な入出力、データベースの実行性能の低下、
データベースへのアクセス時間の増加、及び、全文検索
速度の低下という問題が生じる。データベースの実行負
荷を減少させ、データベースの検索のための時間を短縮
し、全文検索の速度を増大するために、本発明者等は、
文字位置情報を文書情報に迅速に変換する方法を設計す
る。キャッシュメモリの高速アクセスの特徴に関して、
文書番号及び文書の位置範囲を表し、且つ文書番号、文
書開始位置、文書終了位置、及び、削除フラグを含むデ
ータベース１０４中の指定の文書カテゴリに対する文書
情報のデータ項目が、一度にメモリＲＡＭ３に読み込ま
れる。任意の文字位置に対して、各文書の位置範囲に従
いながら二分法を使用することによって、メモリ中の一
致文書データが迅速に探し出され、文書番号が取得され
る。文書の記録及び削除を行なう場合、文書情報共有変
換部は、関連する文書情報をタイミング良く更新するた
めの対応インタフェースの提供も行なう。Thirdly, in the case of storing a large amount of data,
Frequent database input / output, database execution performance degradation, and
There are problems that the access time to the database increases and the full-text search speed decreases. In order to reduce the execution load of the database, reduce the time for searching the database, and increase the speed of full-text search, the present inventors have
Design a method to quickly convert character position information to document information. Regarding the features of high-speed access of cache memory,
The data item of the document information for the designated document category in the database 104, which includes the document number, the document position range, and the document number, the document start position, the document end position, and the deletion flag, is read into the memory RAM 3 at once. Be done. By using the dichotomy while following the position range of each document for any character position, the matching document data in memory is quickly located and the document number is obtained. When recording and deleting a document, the document information sharing conversion unit also provides a corresponding interface for updating related document information in a timely manner.

【００７１】各実施例の説明まず最初に、ネットワーク
１０２上の本発明のシステムの実行手順を例示的に説明
する。Description of Each Embodiment First, an execution procedure of the system of the present invention on the network 102 will be exemplarily described.

【００７２】サーバ１０３上の全文検索システムが、ク
ライアント１０１を介してのユーザからの文字検索要求
を処理する手順中に、以下の処理が行なわれる：１．検索対象の文字又は語がクライアント１０１のキー
ボードを介してユーザにより入力される。２つ以上の文
字又は語が検索対象である場合、ＯＲ、ＡＮＤ又はＮＯ
Ｔなどの文字間又は語間の論理関係が与えられるべきで
ある。During the procedure in which the full-text search system on the server 103 processes a character search request from a user via the client 101, the following processing is performed: The character or word to be searched is input by the user via the keyboard of the client 101. OR, AND, or NO when two or more characters or words are searched
A logical relationship between letters or words such as T should be given.

【００７３】２．与えられた検索語及び論理関係は、サ
ーバ１０３上で実行中の全文検索システムにネットワー
ク１０２を介して送信される。2. The given search word and logical relation are transmitted via the network 102 to the full-text search system running on the server 103.

【００７４】３．サーバ１０３上において、全文検索シ
ステムの全文検索エンジン１０６が、各検索語に従って
文書インデックスを検索することによって、受信した検
索結果の処理を行なってデータベース１０４から全ての
関連する一致位置を取得する。3. On the server 103, the full-text search engine 106 of the full-text search system searches the document index according to each search word to process the received search result and acquire all relevant matching positions from the database 104.

【００７５】４．上述の取得された各一致位置が存在す
る文書の文書番号及び削除フラグを取得するように、文
書情報共有変換部１０７により、各一致位置が、文書情
報（文書番号、文書開始位置、文書終了位置及び削除フ
ラグなど）中の文章開始位置及び文章終了位置とそれぞ
れ比較される。各文書の削除フラグに従って、有効な文
書番号が判定され、この文書番号が全文検索システムの
出力として結果セットを形成する。4. The document information sharing conversion unit 107 obtains the document number (document number, document start position, document end position) by the document information sharing conversion unit 107 so as to obtain the document number and the deletion flag of the document in which each of the obtained match positions exists. And the deletion flag) and the start position and the end position of the sentence, respectively. A valid document number is determined according to the delete flag of each document, and this document number forms the result set as the output of the full-text search system.

【００７６】５．この検索結果セットが、ネットワーク
１０２を介して全文検索システムによりクライアント１
０１に送信され、クライアント１０１によりその画面上
に表示される。5. This search result set is sent to the client 1 by the full-text search system via the network 102.
01, and is displayed on the screen by the client 101.

【００７７】文書インデックスの作成処理本発明では、
所属する文書カテゴリの文字の２つの連続する文字位置
から差分アルゴリズムにより差分値が計算されてデータ
ベース１０４の画像型データベースブロックに格納され
る文書インデックスの作成方法が提供される。このよう
なデータベースブロックは、例えば、各々が倍長整数型
の数値（４バイト）及び２５６バイトから構成される
４，０００個のミニブロックから成る。データベースブ
ロックの構造に関しては、図２を参照されたい。文字位
置データを格納するデータベースブロックは、従って、
１，０４０，０００バイト（(256 + 4) * 4000= 1,040,
000）を有する。Document Index Creation Process In the present invention,
A method for creating a document index is provided in which a difference value is calculated from two consecutive character positions of the characters of the document category to which it belongs by a difference algorithm and the difference value is stored in the image type database block of the database 104. Such a database block is composed of, for example, 4,000 mini blocks each of which is composed of a long integer type numerical value (4 bytes) and 256 bytes. See FIG. 2 for the structure of the database block. The database block that stores the character position data is therefore
1,040,000 bytes ((256 + 4) * 4000 = 1,040,
000).

【００７８】データベースブロックを定義する場合、こ
れに含まれるミニブロックの個数は、変更することが可
能であり、各ミニブロックのサイズも変更可能である。
記録された文書に対して文書インデックス生成部１０５
により文書インデックスを作成するプロセスについて、
図３を参照しながら以下に説明する。このプロセスは、
最初にデータベース１０４を作成したり、文書をデータ
ベース１０４に追加したり、あるいは、文書をデータベ
ース１０４から削除したりする場合に、図１に示す文書
インデックス生成部１０５を用いて行なわれる。When defining a database block, the number of mini blocks included in this can be changed, and the size of each mini block can also be changed.
The document index generation unit 105 for the recorded document
About the process of creating a document index by
This will be described below with reference to FIG. This process
When the database 104 is first created, a document is added to the database 104, or a document is deleted from the database 104, the document index generation unit 105 shown in FIG. 1 is used.

【００７９】まず最初に、各文書の内容、文書を記録す
るオペレータ、あるいは、その他の各要素に従って、記
録対象の文書が、対応する文書カテゴリに予め分類され
る。各文書カテゴリには、文書カテゴリ名が与えられ
る。ステップ４０２において、例えば、入力ボックスを
有するダイアログボックスが表示されて、オペレータに
対して入力処理を行なうように促す。オペレータは、記
録対象の文書が所属する文書カテゴリ名を入力する。First, according to the contents of each document, the operator who records the document, or each other element, the document to be recorded is classified into the corresponding document category in advance. A document category name is given to each document category. In step 402, for example, a dialog box having an input box is displayed to prompt the operator to perform an input process. The operator inputs the document category name to which the document to be recorded belongs.

【００８０】ステップ４０４において、ステップ４０２
で指定された文書カテゴリ名に従って、関連する文書カ
テゴリ情報を求めてデータベース１０４が検索される。
指定の文書カテゴリ名がデータベース１０４に存在しな
い場合、システムは、その文書カテゴリに対して文書カ
テゴリ番号を割り当て、文書カテゴリ中の全ての文字の
最終位置に対して初期値を設定する（例えば、文書カテ
ゴリ番号が１、各文字の最終位置が０）。続いて、この
文書カテゴリの情報がデータベース１０４に挿入され、
文書カテゴリ番号及び最終位置が戻される。指定の文書
カテゴリ名がデータベース１０４に存在する場合、文書
カテゴリ番号及び文書カテゴリ中の全ての文字の最終位
置がデータベース１０４から取得される。取得された文
書カテゴリ番号に従って、データベース１０４中の各文
字の文字情報、すなわち、文書カテゴリ中の文字の最大
文字位置、文字の最終データベースブロックの最小位置
及び最終データベースブロックに格納された文字位置デ
ータの長さなどが探し出される。In step 404, step 402
The database 104 is searched for related document category information in accordance with the document category name designated by.
If the specified document category name does not exist in the database 104, the system assigns a document category number for that document category and sets an initial value for the final position of all characters in the document category (eg, document The category number is 1 and the final position of each character is 0). Then, the information of this document category is inserted into the database 104,
The document category number and final position are returned. When the specified document category name exists in the database 104, the document category number and the final positions of all characters in the document category are acquired from the database 104. According to the acquired document category number, the character information of each character in the database 104, that is, the maximum character position of the character in the document category, the minimum position of the final database block of the character, and the character position data stored in the final database block. The length etc. is searched.

【００８１】ステップ４０６において、キャッシュメモ
リの高速アクセス機能に基づいて、文書情報共有変換部
１０７が起動される。指定の文書カテゴリの文書情報が
ＲＡＭ３の共有メモリに存在するか否かが判定される。
指定の文書カテゴリの文書情報が共有メモリに存在しな
い場合、メモリＲＡＭ３中のある容量の共有メモリが、
使用される。指定の文書カテゴリに記録された各文書の
文書番号及び文書の位置範囲に関するデータ項目（文書
番号、文書カテゴリの文書の開始位置及び終了位置、並
びに削除フラグなど）が、データベース１０４から共有
メモリに読み込まれる。複数のユーザが、同時にこれを
使用して指定の文書カテゴリを検索することができる。In step 406, the document information sharing conversion unit 107 is started based on the high speed access function of the cache memory. It is determined whether the document information of the designated document category exists in the shared memory of the RAM 3.
If the document information of the specified document category does not exist in the shared memory, a certain amount of shared memory in the memory RAM3
used. Data items related to the document number of each document and the document position range recorded in the specified document category (document number, start position and end position of document of document category, deletion flag, etc.) are read from the database 104 into the shared memory. Be done. Multiple users can use it at the same time to search for a specified document category.

【００８２】ステップ４０８において、ＲＡＭ３の予約
済メモリ空間が動作中のシステムに適用されて初期化さ
れる。In step 408, the reserved memory space of RAM3 is applied to the operating system and initialized.

【００８３】ステップ４１０において、記録対象の文書
が存在するか否かが判定される。記録対象の文書が存在
する場合は、ステップ４１２に進む。記録対象の文書が
存在しない場合は、ステップ４２０、４３０及び４３２
が実施され、データベース１０４の関連情報が更新され
る。In step 410, it is judged whether or not the document to be recorded exists. If the document to be recorded exists, the process proceeds to step 412. If the document to be recorded does not exist, steps 420, 430 and 432
Is performed and the related information in the database 104 is updated.

【００８４】ステップ４１２において、記録対象の文書
の情報が読み出される。データベース１０４の文書情報
が、文書番号に従って検索される。文書番号が、データ
ベース１０４において見つからない場合、その文書情報
はデータベース１０４の文書情報に格納される。文書情
報は、文書が所属する文書カテゴリの番号、文書番号、
文書中の先頭文字及び最終文字の文字位置（文書の開始
位置及び終了位置）などを含む。文書番号がデータベー
ス１０４の文書情報において見つかった場合、その文書
番号がデータベース１０４の文書情報に格納されている
ことを意味し、エラーコードが戻される。In step 412, the information of the document to be recorded is read. The document information in the database 104 is searched according to the document number. If the document number is not found in the database 104, the document information is stored in the document information in the database 104. The document information includes the document category number to which the document belongs, the document number,
It includes the character positions of the first and last characters in the document (start position and end position of the document). If the document number is found in the document information of the database 104, it means that the document number is stored in the document information of the database 104, and an error code is returned.

【００８５】ステップ４１４において、記録中の文書に
未処理の文字が存在するか否かがチェックされる。文書
中に未処理の文字が存在する場合、ステップ４１６のプ
ロセスが実施され、文字のインデックスが作成される。
文書中の最終文字の処理が終了すると、ステップ４２６
が実施され、その最終文字の文字位置でもって、データ
ベース１０４の文書情報に格納された文書の終了位置が
更新される。At step 414, it is checked whether there are unprocessed characters in the document being recorded. If there are unprocessed characters in the document, then the process of step 416 is performed to index the characters.
Upon completion of processing the last character in the document, step 426.
Is executed, and the end position of the document stored in the document information of the database 104 is updated with the character position of the final character.

【００８６】ステップ４１６において、文書中の文字が
順次読み込まれ、対応する内部コードに変換される（例
えば、ＷＩＮＤＯＷＳシステムで使用されるシフトＪＩ
Ｓコードが、システムの内部コードに変換される。「ア
メリカ」を内部コードに変換する場合、「ア」、
「メ」、「リ」及び「カ」の内部コードは、それぞれ、
２８３、３４１、３４７及び２８８である）。文字の文
字位置が、所属する文書カテゴリの最終位置から取得さ
れる。続いて、現在の文字位置と前の文字位置との間の
差分値が、差分アルゴリズムにより計算される。In step 416, the characters in the document are sequentially read and converted to the corresponding internal code (eg, Shift JI used in the WINDOWS system).
The S code is converted to the internal code of the system. When converting "America" to the internal code, "A",
The internal codes of "me,""ri," and "ka" are:
283, 341, 347 and 288). The character position of the character is obtained from the final position of the document category to which it belongs. Subsequently, the difference value between the current character position and the previous character position is calculated by the difference algorithm.

【００８７】ステップ４１８において、メモリＲＡＭ３
の予約済空間の残りの部分がステップ４１６の差分値を
格納するのに十分であるか否かがチェックされる。メモ
リの予約済空間が満杯の場合、ステップ４２０、４２２
及び４２４が実行され、それにより、メモリの予約済空
間中の文字の全データがデータベース１０４の文字位置
情報に書き込まれる。メモリの予約済空間が満杯ではな
い場合、フローチャートはステップ４２４に進む。In step 418, the memory RAM3
Is checked to see if the remaining portion of the reserved space of is sufficient to store the difference value of step 416. If the reserved space of memory is full, steps 420, 422.
And 424 are executed so that all data of the characters in the reserved space of the memory are written to the character position information of the database 104. If the memory reserved space is not full, the flow chart proceeds to step 424.

【００８８】ステップ４２０において、メモリＲＡＭ３
の予約済空間中の、文字が所属する文書カテゴリの番
号、対応する内部コード、複数の差分値を格納するため
のデータベースブロック、文字のデータベースブロック
に格納された最小位置及び最大位置などの全ての文字位
置情報は、データベース１０４の文字位置情報に格納さ
れる。In step 420, the memory RAM3
All of the document category number to which the character belongs, the corresponding internal code, the database block for storing multiple difference values, the minimum position and the maximum position stored in the character database block, etc. in the reserved space of The character position information is stored in the character position information of the database 104.

【００８９】ステップ４２２において、データベース１
０４の文字位置情報へのメモリＲＡＭ３の予約済空間中
の全ての文字位置情報の格納が無事終了すると、記録中
の文書内の各文字の位置情報を継続して格納することが
可能なように、メモリの予約済空間が再初期化される。In step 422, the database 1
When the storage of all character position information in the reserved space of the memory RAM 3 in the character position information 04 is successfully completed, the position information of each character in the document being recorded can be continuously stored. , The reserved space of memory is reinitialized.

【００９０】ステップ４２４において、ステップ４１６
で取得された文字位置又は差分値がメモリＲＡＭ３の予
約済空間に格納される。続いて、ステップ４１４に戻
り、記録中の文書内の次の文字を取り出す。In step 424, step 416
The character position or the difference value acquired in step 3 is stored in the reserved space of the memory RAM 3. Then, returning to step 414, the next character in the document being recorded is extracted.

【００９１】ステップ４２６において、記録中の文書内
の全ての文字の処理が終了すると、文書中の最終文字の
文字位置が、データベース１０４の文書情報に格納され
る。すなわち、データベース１０４の文書情報中の文書
の終了位置が文書の最終文字の文字位置でもって更新さ
れる。When all the characters in the document being recorded are processed in step 426, the character position of the last character in the document is stored in the document information of the database 104. That is, the end position of the document in the document information of the database 104 is updated with the character position of the last character of the document.

【００９２】ステップ４２８において、文書情報共有変
換部１０７が使用され、記録される文書の情報が共有メ
モリに格納される。In step 428, the document information sharing converter 107 is used to store the information of the document to be recorded in the shared memory.

【００９３】ステップ４３０において、新規の文書の記
録が終了すると、データベース１０４の文字情報が、文
字の所属する文書カテゴリの番号、対応する内部コー
ド、所属する文書カテゴリ中の文字の最大文字位置、文
字の位置データを格納する最終データベースブロックの
最小位置、及び、最終データベースブロックに格納され
た文字位置データのバイト数を含む各文字の新規の文字
情報でもって更新される。When the recording of the new document is completed in step 430, the character information of the database 104 is changed to the number of the document category to which the character belongs, the corresponding internal code, the maximum character position of the character in the document category to which the character belongs, and the character. Is updated with the newest character information of each character including the minimum position of the final database block that stores the position data and the number of bytes of the character position data stored in the final database block.

【００９４】ステップ４３２において、記録を終えたば
かりの最終文書中の最終文字の文字位置を使用して、デ
ータベース１０４の文書カテゴリ情報に格納された文書
カテゴリの最終位置が更新される。In step 432, the final position of the document category stored in the document category information of the database 104 is updated using the character position of the final character in the final document just recorded.

【００９５】実施例１：ＷＩＮＤＯＷＳプラットフォー
ムにおいて、文書インデックス生成部１０５を使用して
文書（テキストファイル）Ａｍｅｒｉｃａ１．ｔｘｔを
データベース１０４に記録し、文書中の各文字に対して
インデックスを作成する。文書の内容は以下の通りであ
る：米国アメリカアメリカ合衆国この文書は１４文字
から成り、５つの漢字（重複を除けば実際は４つの漢
字）と、８つのカタカナ（重複を除けば実際は４つのカ
タカナ）と、１つの空白とを含む。「米」、「合」、
「衆」の各文字と空白は、それぞれ、文書中に１回出現
する。「国」、「ア」、「メ」、「リ」及び「カ」の各
々は、文書中に２回出現する。Embodiment 1: In the WINDOWS platform, the document (text file) America1. The txt is recorded in the database 104 and an index is created for each character in the document. The contents of the document are as follows: United States America United States This document consists of 14 characters, 5 Chinese characters (actually 4 Chinese characters excluding duplications), 8 Katakana (4 actual Katakanas without duplication) , Including one blank. "Rice", "Go",
Each of the characters "Zou" and white space appears once in the document. Each of "country", "a", "me", "ri" and "mosquito" appears twice in the document.

【００９６】この文書が所属する文書カテゴリ名をニュ
ースカテゴリ、文書カテゴリ番号を１とし、この文書カ
テゴリは、中に文書が記録されていない新規の文書カテ
ゴリであるとする。この文書カテゴリにおいて、上記文
書の文書番号は１、文書名はＡｍｅｒｉｃａ１、発行時
は、１９９９．８．１０である。この文書は、ニュース
文書カテゴリの第１文書として文書インデックス生成部
１０５に供給され、記録される。The document category name to which this document belongs is a news category, the document category number is 1, and this document category is a new document category in which no document is recorded. In this document category, the document number of the above document is 1, the document name is America 1, and the document name is 1999.8.10. This document is supplied to and recorded in the document index generation unit 105 as the first document in the news document category.

【００９７】１．１４文字：「米国アメリカアメリカ
合衆国」を含む文書の全内容が、メモリＲＡＭ３に読み
込まれる。文書カテゴリ番号１、文書番号１、文書名Ａ
ｍｅｒｉｃａ１、発行時１９９８．８．１０、及びニュ
ース文書カテゴリにおけるこの文書の開始文字位置が、
文書インデックス生成部１０５によりデータベース１０
４中のニュース文書カテゴリの文書情報に格納される。1.14 characters: The entire contents of the document including "US America USA" are read into the memory RAM3. Document category number 1, document number 1, document name A
merica1, at the time of publication 1998.8.10, and the start character position of this document in the news document category,
The database 10 by the document index generation unit 105
It is stored in the document information of the news document category in 4.

【００９８】２．文書Ａｍｅｒｉｃａ１．ｔｘｔ中の８
つの異なる文字は、１つずつ、シフトＪＩＳコードから
システムの内部コードへと変換される。文書中の空白に
対しては、所定のパラメータに従って、文書インデック
ス生成部１０５が、文書中の空白を処理するか否かを判
定することができる。パラメータのデフォルト値では、
空白に対してインデックスは作成しない。文書Ａｍｅｒ
ｉｃａ１．ｔｘｔにおいては、第７文字が空白である。
実施例１において、パラメータは、デフォルト値に設定
されている、つまり、空白に対してインデックスを作成
しないように設定されているものとする。従って、文書
インデックス生成部は、空白を処理しない。これによ
り、第８文字及び後続の各文字の文字位置の値は、１つ
減少する（表１参照）。３．各文書カテゴリの開始位置は１であり、１を１段階
として使用する。各文字の文字位置は、文書カテゴリ中
の文字の順序に従って判定される。本実施例では、ニュ
ース文書カテゴリ中の文書Ａｍｅｒｉｃａ１．ｔｘｔの
全ての文字の文字位置が表１に示されている。文書イン
デックス生成部１０５による処理終了後の、各文字と各
ミニブロック中の始めの４バイトとの間の対応関係が表
２に示される。４．ある文字がニュース文書カテゴリ中に２回出現する
場合、現在の文字位置と前の文字位置との間の差分値
が、差分アルゴリズムにより計算され、その文字のミニ
ブロックの第５バイト及び後続の各バイトに順次格納さ
れる。例えば、文書Ａｍｅｒｉｃａ１．ｔｘｔにおい
て、第８文字、第９文字、第１０文字及び第１１文字
「アメリカ」は、重複している。以下においては、文書
中の文字「ア」を例に取り上げ詳細に説明する。文字
「ア」は、文書中に２回、ニュース文書カテゴリの文字
位置３、７において出現する。表３の列５において明ら
かなように、ミニブロックに格納された位置データは、
０ｘ０００００００３０４である。ミニブロック中のバイト順序：１２３４５位置データ：０ｘ０００００００３０４ミニブロックの始めの４バイトに格納される位置データ
は、１６進数で表すところの０ｘ０００００００３であ
り、これは、ニュース文書カテゴリ中のこの文字の第１
文字位置であり、(03)10 = (03)16である。ミニブロッ
クの第５バイトのデータは０ｘ０４である。これは文字
「ア」の第２文字位置と第１文字位置との間の差分値で
あり、(07 - 03)10 = (4)16である。３つの文字「メリ
カ」の差分値の計算も、「ア」のときと同様である。こ
れらの３つの文字の各々の第２文字位置と第１文字位置
との間の差分値も４である。漢字「国」の第１文字位置
と第２文字位置は、２と１３であり、差分値は、(13 -
2)10 = (11)10 = (0B)16のように計算される。４つのカ
タカナ「アメリカ」及び漢字「国」の差分値は、表３の
列５に示される。尚、太字は、表３と表２との間の違い
を示す。2. Document America1. 8 in txt
The three different characters are converted one by one from the shift JIS code to the system internal code. For a blank in the document, the document index generation unit 105 can determine whether to process the blank in the document according to a predetermined parameter. The default value of the parameter is
No index is created for white space. Document Amer
ica1. In txt, the seventh character is blank.
In the first embodiment, it is assumed that the parameter is set to a default value, that is, it is set not to create an index for a blank space. Therefore, the document index generation unit does not process the blank. As a result, the value of the character position of the eighth character and each subsequent character is decreased by one (see Table 1). 3. The start position of each document category is 1, and 1 is used as one stage. The character position of each character is determined according to the order of the characters in the document category. In this embodiment, the document America1. The character positions of all the characters of txt are shown in Table 1. Table 2 shows the correspondence relationship between each character and the first 4 bytes in each mini-block after the processing by the document index generation unit 105 is completed. 4. If a character appears twice in the news document category, the difference value between the current character position and the previous character position is calculated by the difference algorithm, and the fifth byte of the character's miniblock and each subsequent character block are calculated. Sequentially stored in bytes. For example, the document America1. In txt, the eighth character, the ninth character, the tenth character, and the eleventh character “America” are duplicated. In the following, the character "a" in the document will be taken as an example for detailed description. The character "a" appears twice in the document at character positions 3 and 7 in the news document category. As can be seen in column 5 of Table 3, the position data stored in the miniblock is
It is 0x000003004. Byte order in miniblock: 1 2 3 4 5 Position data: 0x00 00 00 00 03 04 The position data stored in the first 4 bytes of the miniblock is 0x00000003, which is the hexadecimal number, which is the news document. The first of this letter in the category
The character position, (03) 10 = (03) 16. The data of the 5th byte of the mini block is 0x04. This is the difference value between the second character position and the first character position of the character "A", and is (07-03) 10 = (4) 16. The calculation of the difference value between the three characters "Merika" is similar to that for "A". The difference value between the second character position and the first character position of each of these three characters is also 4. The first character position and the second character position of the kanji "country" are 2 and 13, and the difference value is (13-
It is calculated as 2) 10 = (11) 10 = (0B) 16. The difference values of the four katakana “America” and the Chinese character “country” are shown in column 5 of Table 3. The bold letters indicate the difference between Table 3 and Table 2.

【００９９】ある文字が処理中の記録に頻繁に出現する
場合、その文字の位置データの大きさは、ミニブロック
のサイズを超す可能性がある。この場合、文字の位置デ
ータを格納するのに複数のミニブロックが必要となる。
本実施例の文書の文字数は少ないので、文書インデック
ス生成部１０５は、それぞれの位置データ（ミニブロッ
クの最小位置及び差分値）を格納するのに各文字
「米」、「国」、「ア」、「メ」、「リ」、「カ」、
「合」、「衆」に対して１個のミニブロックのみを供給
する。５．上記各文字の処理終了後、全ての文字の関連情報
が、それぞれ、データベース１０４の文字情報及び文字
位置情報に書き込まれる。データベース１０４中の各デ
ータベースブロックは、ニュース文書カテゴリ中の各文
字の文字位置を格納する。データは、データベースブロ
ックにおけるその順序に従って、各ミニブロックに格納
される。ミニブロックがデータで満杯になれば、次のミ
ニブロックがデータの格納に使用され、この処理は、デ
ータベースブロック中の４，０００個のミニブロックが
全てデータで充填されるまで行なわれる。If a character frequently appears in the record being processed, the size of the position data of the character may exceed the size of the miniblock. In this case, multiple miniblocks are required to store the character position data.
Since the number of characters of the document of this embodiment is small, the document index generation unit 105 stores each character “US”, “country”, “A” to store each position data (minimum position and difference value of mini block). , "Me", "ri", "mosquito",
Supply only one miniblock to "Go" and "Pop." 5. After the processing of each character is completed, the related information of all the characters is written in the character information and the character position information of the database 104, respectively. Each database block in the database 104 stores the character position of each character in the news document category. Data is stored in each miniblock according to its order in the database block. When a miniblock is full of data, the next miniblock is used to store data, and this process is done until all 4,000 miniblocks in the database block are filled with data.

【０１００】文字「ア」を例として挙げると、その文書
カテゴリ番号１と、内部コード２８３と、ミニブロック
に格納された位置データ０ｘ０００００００３０４と、
データベースブロックに格納された位置データの最小位
置３及び最大位置７とが、データベース１０４の文字位
置情報に格納される。文書カテゴリ番号１と、内部コー
ド２８３と、ニュース文書カテゴリにおける最終位置７
と、データベースブロックに格納された位置データの最
小位置３及び最大位置７と、ミニブロックに格納された
位置データが占めるミニブロック中のバイト数とが、デ
ータベース１０４の文字情報に格納される。Taking the character "a" as an example, the document category number 1, the internal code 283, the position data 0x000000304 stored in the mini block,
The minimum position 3 and the maximum position 7 of the position data stored in the database block are stored in the character position information of the database 104. Document category number 1, internal code 283, and final position 7 in the news document category
The minimum position 3 and the maximum position 7 of the position data stored in the database block, and the number of bytes in the mini block occupied by the position data stored in the mini block are stored in the character information of the database 104.

【０１０１】６．データベース１０４における文書カテ
ゴリ情報中の文字最終位置及び文書情報中の記録された
文書の終了位置が更新される。実施例１において、デー
タベース１０４における文書カテゴリ情報中の文字最終
位置は１３に更新され、文書Ａｍｅｒｉｃａ１．ｔｘｔ
の文書情報中の終了位置は１３になる。6. The final character position in the document category information and the end position of the recorded document in the document information in the database 104 are updated. In the first embodiment, the final character position in the document category information in the database 104 is updated to 13, and the document America1. txt
The end position in the document information of is 13.

【０１０２】７．ニュース文書カテゴリの関連情報が、
データベース１０４の文書カテゴリ情報に格納される。
文書Ａｍｅｒｉｃａ１．ｔｘｔの関連情報が、文書情報
に格納される。表３の列２、３、４及び５のデータが、
文字位置情報に格納される。表３の列２、３及び４のデ
ータと、データベースブロックに格納された列５の位置
データの長さ（バイト単位）とが、文字情報に格納され
る。7. Information related to the news document category
It is stored in the document category information of the database 104.
Document America1. Related information of txt is stored in the document information. The data in columns 2, 3, 4 and 5 of Table 3 are
It is stored in the character position information. The data in columns 2, 3 and 4 of Table 3 and the length (in bytes) of the position data in column 5 stored in the database block are stored in the character information.

【０１０３】実施例２：本実施例は、実施例１の文書及
びその他の複数の文書の記録後に、ある文書が記録され
る場合であるとする。Example 2 In this example, it is assumed that a document is recorded after recording the document of Example 1 and a plurality of other documents.

【０１０４】ＷＩＮＤＯＷＳプラットフォームにおい
て、本実施例では、文書インデックス生成部１０５を使
用して文書（テキストファイル）Ａｍｅｒｉｃａ２．ｔ
ｘｔをデータベース１０４に記録し、文書中の各文字に
対してインデックスを作成する。文書の内容は以下の通
りである：アメリカ合衆国米国アメリカこの文書は１４文字から成り、５つの漢字（重複を除け
ば実際は４つの漢字）と、８つのカタカナ（重複を除け
ば実際は４つのカタカナ）と、１つの空白とを含む。
「米」、「合」、「衆」の各文字と空白は、それぞれ、
文書中に１回出現する。「国」、「ア」、「メ」、
「リ」及び「カ」の各々は、文書中に２回出現する。In the WINDOWS platform, in the present embodiment, the document (text file) America2. t
The xt is recorded in the database 104 and an index is created for each character in the document. The contents of the document are as follows: United States United States America This document consists of 14 characters, 5 Chinese characters (actually 4 Chinese characters excluding duplications), 8 Katakana (4 actual Katakanas without duplication) , Including one blank.
The characters "U.S.A.", "Go", "Popular" and the space are
Occurs once in a document. "Country", "A", "Me",
Each of "li" and "mosquito" appears twice in the document.

【０１０５】この文書が所属する文書カテゴリ名をニュ
ースカテゴリ、文書カテゴリ番号を１とし、この文書カ
テゴリには、複数の文書が記録されているものとする。
また、この文書カテゴリにおける各文字の最終位置は、
３９４９１、上記文書の文書番号は、１３００１１、文
書名は、Ａｍｅｒｉｃａ２、発行時は、１９９９．８．
１１であるとする。この文書は、ニュース文書カテゴリ
の第１文書として文書インデックス生成部１０５に供給
され、記録される。データベース１０４において、ニュ
ース文書カテゴリ中のカタカナ「リ」の最大文字位置
は、８２３７であり、この文字の位置データを格納する
ための最終データベースブロックは、２５８バイトの位
置データを格納している。ニュース文書カテゴリにおけ
る文字「衆」の最大文字位置は、１３２０であり、この
文字の位置データを格納するための最終データベースブ
ロックは、２５バイトの位置データを格納している。次
に、この文書が、文書インデックス生成部１０５に供給
され、処理中の記録が行なわれる。It is assumed that the document category name to which this document belongs is a news category and the document category number is 1, and a plurality of documents are recorded in this document category.
Also, the final position of each character in this document category is
39491, the document number of the above document is 130011, the document name is America2, and at the time of publication, 1999.8.
It is assumed to be 11. This document is supplied to and recorded in the document index generation unit 105 as the first document in the news document category. In the database 104, the maximum character position of katakana “ri” in the news document category is 8237, and the final database block for storing the position data of this character stores 258-byte position data. The maximum character position of the character “Zou” in the news document category is 1320, and the final database block for storing the position data of this character stores 25 bytes of position data. Next, this document is supplied to the document index generation unit 105 and is recorded during processing.

【０１０６】１．ニュース文書カテゴリの現在の最終位
置及びこのカテゴリ中の各文字の最大文字位置を探し出
すために、文書インデックス生成部１０５が、データベ
ース１０４の文書カテゴリ情報を検索する。実施例２で
は、データベース１０４を検索した結果、ニュース文書
カテゴリ中の各文字の最終位置は、３０４９１であり、
ニュース文書カテゴリ中のカタカナ「リ」の最大文字位
置は、８２３７である。また、「リ」の位置データを格
納する最終レコード項目のデータベースブロックは、２
５８バイトの位置データを格納しており、ニュース文書
カテゴリ中の「衆」の最大文字位置は、１３２０であ
り、「リ」の位置データを格納する最終レコード項目の
データベースブロックは、２５バイトの位置データを格
納している。1. To find the current final position in the news document category and the maximum character position for each character in this category, the document index generator 105 searches the document category information in the database 104. In the second embodiment, as a result of searching the database 104, the final position of each character in the news document category is 30491,
The maximum character position of katakana “ri” in the news document category is 8237. In addition, the database block of the last record item that stores the position data of "ri" is 2
It stores position data of 58 bytes, the maximum character position of "Public" in the news document category is 1320, and the database block of the last record item that stores the position data of "Li" is a position of 25 bytes. Stores data.

【０１０７】２．文書の全内容が、メモリＲＡＭ３に読
み込まれる。文書インデックス生成部１０５が、文書番
号、文書名、作成者及び発行時をデータベース１０４の
文書情報に格納する。本実施例では、システムは、１４
文字「アメリカ合衆国米国アメリカ」を読み込む。文
書番号１３００１１、文書名Ａｍｅｒｉｃａ２、発行時
１９９９．８．１１、及びニュース文書カテゴリ中のこ
の文書の開始文字位置３０４９２が、文書インデックス
生成部１０５によりデータベース１０４のニュース文書
カテゴリの文書情報に格納される。2. The entire content of the document is read into the memory RAM3. The document index generation unit 105 stores the document number, the document name, the creator, and the issuing time in the document information of the database 104. In this example, the system is 14
Read the characters "United States United States America". The document number 130011, the document name America2, the time of publication 1999.8.11, and the start character position 30492 of this document in the news document category are stored in the document information of the news document category of the database 104 by the document index generation unit 105. .

【０１０８】３．文書Ａｍｅｒｉｃａ２．ｔｘｔ中の１
３文字が、１つずつ、シフトＪＩＳコードからシステム
の対応する内部コードへと変換される。文書中の空白に
対しては、所定のパラメータに従って、文書インデック
ス生成部１０５が、文書中の空白を処理するか否かを判
定することができる。実施例１では、パラメータは、デ
フォルト値に設定されるものとした。すなわち、空白に
対してインデックスは作成しないものとした。文書Ａｍ
ｅｒｉｃａ２．ｔｘｔにおいては、第８文字が空白であ
る。文書インデックス生成部は、空白を処理しない。こ
れにより、第９文字及び後続する各文字の文字位置の値
は、それぞれ、１つ減少する（表４参照）。４．文書カテゴリの最終位置＋１が、記録対象の新規文
書の開始位置として使用され、段階増分は１である。各
文字の文字位置は、文書カテゴリ中の文字の順序に従っ
て判定される。本実施例では、文書Ａｍｅｒｉｃａ２．
ｔｘｔの開始位置は、３０４９２である。Ａｍｅｒｉｃ
ａ２．ｔｘｔの全ての文字の文字位置が表４に示されて
いる。3. Document America2. 1 in txt
The three characters are converted one by one from the shift JIS code to the corresponding internal code of the system. For a blank in the document, the document index generation unit 105 can determine whether to process the blank in the document according to a predetermined parameter. In the first embodiment, the parameters are set to default values. That is, no index is created for blanks. Document Am
erica2. In txt, the eighth character is blank. The document index generator does not process white space. As a result, the value of the character position of each of the ninth character and the following characters is decreased by one (see Table 4). 4. The final position of the document category + 1 is used as the starting position of the new document to be recorded, and the step increment is 1. The character position of each character is determined according to the order of the characters in the document category. In this example, the document America2.
The start position of txt is 30492. American
a2. The character positions of all the characters in txt are shown in Table 4.

【０１０９】５．新規文書の記録前の文書カテゴリ中の
ある文字の最大位置（本実施例の項目１に記載）及びそ
の文字の現在の文字位置に基づいて、文字の差分値が、
差分アルゴリズムにより計算される。ある文字の現在の
差分値が、複数バイトを必要とし、現在のミニブロック
の２６０バイト（４＋２５６バイト）の残りのバイト
が、新規の差分値にとって十分ではない場合、システム
は０ｘ００を使用して現在のミニブロックの残りのバイ
トを充填する。続いて、新規のミニブロックが使用され
る。その文書カテゴリ中のその文字の現在位置が、新規
のミニブロックの最小位置としてその始めの４バイトに
格納される。この文字がこれ以降も出現する場合は、各
文字位置と前の文字位置との間の差分値が、第５バイト
及び各後続バイトに格納される。現在の文字位置が、そ
の文書カテゴリ中のその文字の最大位置として格納され
る。ミニブロックが位置データで充填される場合、新規
のミニブロックが使用され、文字の現在の文字位置が、
このミニブロックの最小位置としてその中に格納され
る。上述のプロセスは、データベースブロック中の全て
のミニブロックがデータで充填されるまで繰り返され
る。実施例２では、カタカナ「リ」及び漢字「衆」が例
として取り上げられる。前述のように、データベース１
０４において、ニュース文書カテゴリ中の「リ」の最大
位置は８２３７であり、このカタカナの位置データを格
納するための最終データベースブロックは、２５８バイ
トの位置データを格納している。表４において明らかな
ように、文書Ａｍｅｒｉｃａ２．ｔｘｔを記録する場
合、この文書の第１文字「リ」の文字位置は、３０４９
４であり、データベース１０４の文字位置情報に格納さ
れたニュース文書カテゴリにおけるカタカナ「リ」の最
大文字位置は８２３７である。差分アルゴリズムによれ
ば、３０４９４と８２３７との間の差分値は、０ｘ８１
Ｂ０２０であり、この格納には３バイトが必要である。
データベースブロック中の各ミニブロックの保全性を維
持するために、第１ミニブロックの第２５９バイト及び
第２６０バイトが、０ｘ００で充填され、この文書に出
現する文字の位置データが、第２ミニブロックに格納さ
れる。ニュース文書カテゴリ中の「リ」の文字位置３０
４９４（０ｘ７７１Ｅ）が、第２ミニブロックの最小位
置として、このミニブロックの第１から第４バイトに格
納される。文書中における「リ」の第２の出現の文字位
置は、３０５０３である。このときの対応する差分値
は、０ｘ０９であり、この値は、第２ミニブロックの第
５バイトに格納される。表５のカタカナ「リ」の行の太
字を参照されたい。また、前述のように、ニュース文書
カテゴリ中の文字「衆」の最大位置は１３２０であり、
この文字の位置データを格納するための第１ミニブロッ
クは、その２５バイトの位置データを格納している。文
字「衆」は、この文書中に１回出現し、ニュース文書カ
テゴリ中のその文字位置は３０４９７である。この文字
の現在の文字位置３０４９７及びデータベース１０４の
文字位置情報から検索されたニュース文書カテゴリ中の
この文字の最大文字位置１３２０に基づいて、差分値が
０ｘ８１Ｅ６５Ｅとして計算される。この値は、格納に
３バイト必要である。この差分値が、文字「衆」のミニ
ブロックの第２６バイト、第２７バイト及び第２８バイ
トに格納される。表５の文字「衆」の行を参照された
い。その他の文字の差分値の計算及び格納は、文字
「リ」及び「衆」と同様である。実施例２の８つの異な
る文字の位置情報に関しては、表５を参照されたい。６．上記各文字の処理終了後、文字の内部コードの順序
に従った各文字のレコードを求めて、データベース１０
４が検索される。指定の文字のレコードが見つかった場
合、現在記録済の文字の情報を使用してデータベース１
０４の文字情報及び文字位置情報が更新される。指定の
文字のレコードが見つからない場合、現在記録済の文字
の情報は、データベース１０４の文字情報及び文字位置
情報に格納される。実施例２では、文書Ａｍｅｒｉｃａ
２．ｔｘｔ中の文字「リ」及び「衆」が例として取り上
げられる。「リ」及び「衆」の文字情報は、それぞれ、
データベース１０４において探し出される。続いて、デ
ータベース１０４のその文字情報及び文字位置情報が更
新される。文字位置情報を更新する場合、文字「リ」に
関しては、文書インデックス生成部１０５が、データベ
ース１０４中の最終データベースブロックの第１ミニブ
ロックの最終２バイトと、第２ミニブロックの第１バイ
トから第５バイトを更新すると共に、データベース１０
４の文字情報において、ニュース文書カテゴリ中の文字
の最大文字位置が３０５０３に更新される。「衆」の位
置情報をデータベース１０４に書き込む場合、文書イン
デックス生成部１０５が、データベース１０４の文字位
置情報において、文字位置を格納するフィールドの第２
６バイト、第２７バイト及び第２８バイトのデータを更
新する。それと共に、データベース１０４の文字情報に
おいて、ニュース文書カテゴリ中の文字「衆」の最大文
字位置が、３０４９７に更新される。その他の６文字の
データ更新処理も「リ」及び「衆」と同様である。７．データベース１０４において、文書カテゴリ情報中
の最終位置及び記録された文書の文書情報中の終了位置
が更新される。実施例２では、データベース１０４のニ
ュース文書カテゴリの文書カテゴリ情報において、ニュ
ース文書カテゴリ中の全ての文字の最終位置が３０５０
４に更新される。また、文書情報において、文書Ａｍｅ
ｒｉｃａ２．ｔｘｔの終了位置は、３０５０４である。5. Based on the maximum position of a certain character (described in item 1 of this embodiment) and the current character position of that character in the document category before the recording of a new document, the difference value of the character is
Calculated by the difference algorithm. If the current difference value of a character requires multiple bytes and the remaining bytes of the current miniblock's 260 bytes (4 + 256 bytes) are not enough for the new difference value, the system will use 0x00 to Fill the remaining bytes of the miniblock. Subsequently, the new miniblock is used. The current position of the character in the document category is stored in the first 4 bytes as the minimum position of the new miniblock. If this character appears thereafter, the difference value between each character position and the previous character position is stored in the fifth byte and each subsequent byte. The current character position is stored as the maximum position for that character in that document category. If the miniblock is filled with position data, a new miniblock is used and the current character position of the character is
It is stored in it as the minimum position of this miniblock. The above process is repeated until all miniblocks in the database block are filled with data. In the second embodiment, katakana "ri" and kanji "shu" are taken as examples. As mentioned above, database 1
In 04, the maximum position of "ri" in the news document category is 8237, and the final database block for storing the position data of this katakana stores the position data of 258 bytes. As can be seen in Table 4, the document America2. When txt is recorded, the character position of the first character "ri" in this document is 3049.
4, and the maximum character position of Katakana “ri” in the news document category stored in the character position information of the database 104 is 8237. According to the difference algorithm, the difference value between 30494 and 8237 is 0x81.
B020, and this storage requires 3 bytes.
To maintain the integrity of each miniblock in the database block, the 259th and 260th bytes of the first miniblock are filled with 0x00, and the position data of the characters appearing in this document are stored in the second miniblock. Stored in. Character position of "ri" in news document category 30
494 (0x771E) is stored in the first to fourth bytes of this mini block as the minimum position of the second mini block. The character position of the second appearance of "ri" in the document is 30503. The corresponding difference value at this time is 0x09, and this value is stored in the fifth byte of the second mini block. Please refer to the bold face of the katakana "ri" line in Table 5. Further, as described above, the maximum position of the character “Population” in the news document category is 1320,
The first mini-block for storing the character position data stores the 25-byte position data. The character "Zou" appears once in this document and its character position is 30497 in the News Documents category. The difference value is calculated as 0x81E65E based on the current character position 30497 of this character and the maximum character position 1320 of this character in the news document category retrieved from the character position information of the database 104. This value requires 3 bytes to store. This difference value is stored in the 26th byte, the 27th byte and the 28th byte of the mini-block of the character "Popular". See the row for the letters "Population" in Table 5. The calculation and storage of the difference values of the other characters are the same as those of the characters “ri” and “go”. See Table 5 for location information for the eight different characters of Example 2. 6. After the above processing of each character is completed, a record of each character is obtained according to the order of the internal code of the character, and the database 10
4 is retrieved. When the record of the specified character is found, the database 1 is used by using the information of the character currently recorded.
The character information 04 and the character position information 04 are updated. When the record of the designated character is not found, the information of the currently recorded character is stored in the character information and the character position information of the database 104. In the second embodiment, the document America
2. The letters "li" and "people" in txt are taken as an example. The character information of "ri" and "shoku" are respectively
It is found in the database 104. Then, the character information and the character position information of the database 104 are updated. When updating the character position information, for the character “ri”, the document index generation unit 105 causes the document index generation unit 105 to determine the last 2 bytes of the first mini-block of the final database block in the database 104 and the first byte of the second mini-block. 5 bytes are updated and the database 10
In the character information of No. 4, the maximum character position of the character in the news document category is updated to 30503. In the case of writing the position information of “Public” in the database 104, the document index generating unit 105 sets the second position of the field storing the character position in the character position information of the database 104.
Update the data of 6th byte, 27th byte and 28th byte. At the same time, the maximum character position of the character “Population” in the news document category is updated to 30497 in the character information of the database 104. The other 6-character data updating process is the same as that of "ri" and "go". 7. In the database 104, the final position in the document category information and the end position in the document information of the recorded document are updated. In the second embodiment, in the document category information of the news document category of the database 104, the final positions of all characters in the news document category are 3050.
Updated to 4. In the document information, the document Ame
rica2. The ending position of txt is 30504.

【０１１０】実施例３：本実施例は、実施例２の文書及
びその他の複数の文書の記録後に、ある文書が記録され
る場合であるとする。Third Embodiment In this embodiment, it is assumed that a document is recorded after recording the document of the second embodiment and a plurality of other documents.

【０１１１】ＷＩＮＤＯＷＳプラットフォームにおい
て、本実施例では、文書インデックス生成部１０５を使
用して文書（テキストファイル）Ａｍｅｒｉｃａ２．ｔ
ｘｔをデータベース１０４に記録し、７文字を含む文書
中の各文字に対してインデックスを作成する。文書の内
容は以下の通りである：アメリカ合衆国この文書が所属
する文書カテゴリ名をニュースカテゴリ、文書カテゴリ
番号を１とし、この文書カテゴリには、複数の文書が記
録されているものとする。また、この文書カテゴリにお
いて、文字の最終位置は、３０３８４２９７５、上記文
書の文書番号は、２９０３７０、文書名は、Ａｍｅｒｉ
ｃａ３、発行時は、２０００．５．１であるとする。こ
の文書が、文書インデックス生成部１０５に供給され、
以下に示すステップと共に記録される。本実施例では、
文字「リ」が例として取り上げられる。データベース１
０４において、ニュース文書カテゴリのカタカナ「リ」
の最大文字位置は、１０１６９４７である。この文字の
位置データは、複数のデータベースブロックに格納され
る。この文字の位置データを格納するための最終データ
ベースブロックは、１０３９９９７バイトの位置データ
を格納している。In the WINDOWS platform, in the present embodiment, the document (text file) America2. t
Record xt in database 104 and create an index for each character in the document, including 7 characters. The contents of the document are as follows: United States The document category name to which this document belongs is the news category, the document category number is 1, and a plurality of documents are recorded in this document category. In this document category, the final position of characters is 303842975, the document number of the above document is 290370, and the document name is Ameri.
It is assumed that ca3 is 2000.5.1 at the time of issuance. This document is supplied to the document index generation unit 105,
It is recorded with the steps shown below. In this embodiment,
The letter "ri" is taken as an example. Database 1
In 04, Katakana “ri” in the news document category
The maximum character position of is 1016947. The character position data is stored in a plurality of database blocks. The final database block for storing the character position data stores 1039997 bytes of position data.

【０１１２】１．ニュース文書カテゴリにおける現在の
最終位置と、このカテゴリ中の各文字の最大文字位置
と、各文字の対応する最終データベースブロックに格納
された位置データの長さとを探し出すために、文書イン
デックス生成部１０５が、データベース１０４中のニュ
ース文書カテゴリの文書カテゴリ情報を検索する。実施
例３では、ニュース文書カテゴリ中の文字の最終位置
は、３０３８４２９７５であり、ニュース文書カテゴリ
中のカタカナ「リ」の最大文字位置は、１０１６９４７
である。1. In order to find the current final position in the news document category, the maximum character position of each character in this category, and the length of the position data stored in the final database block corresponding to each character, the document index generation unit 105 , The document category information of the news document category in the database 104 is searched. In the third embodiment, the final character position in the news document category is 303842975, and the maximum character position of katakana “ri” in the news document category is 1016947.
Is.

【０１１３】２．文書の全内容が、メモリＲＡＭ３に読
み込まれる。文書インデックス生成部１０５は、文書番
号、文書名、作成者及び発行時をデータベース１０４の
文書情報に格納する。本実施例では、システムは、７文
字：「アメリカ合衆国」を読み込む。文書番号２９０３
７０、文書名Ａｍｅｒｉｃａ３、発行時２０００．５．
１１、及びニュース文書カテゴリ中のこの文書の開始文
字位置３０３８４２９７６が、文書インデックス生成部
１０５によりデータベース１０４中のニュース文書カテ
ゴリの文書情報に格納される。2. The entire content of the document is read into the memory RAM3. The document index generation unit 105 stores the document number, the document name, the creator, and the issuing time in the document information of the database 104. In this example, the system reads 7 characters: "United States". Document number 2903
70, document name America3, at the time of publication 2000.5.
11, and the start character position 303842976 of this document in the news document category is stored in the document information of the news document category in the database 104 by the document index generation unit 105.

【０１１４】３．文書Ａｍｅｒｉｃａ３．ｔｘｔ中の７
文字は、１つずつ、シフトＪＩＳコードからシステムの
対応する内部コードへと変換される（表７の列２、３参
照）。４．ニュース文書カテゴリの最終位置＋１が、記録対象
の新規文書の開始位置として使用され、増加の段階は１
である。各文字の文字位置は、文書カテゴリ中の文字の
順序に従って判定される。本実施例では、文書Ａｍｅｒ
ｉｃａ３．ｔｘｔの開始位置は、３０３８４２９７６で
ある。Ａｍｅｒｉｃａ３．ｔｘｔ中の全ての文字の文字
位置が表７に示されている。3. Document America3. 7 in txt
The characters are converted one by one from the Shift JIS code to the corresponding internal code of the system (see columns 2 and 3 of Table 7). 4. The final position of the news document category + 1 is used as the starting position of the new document to be recorded, and the incrementing stage is 1.
Is. The character position of each character is determined according to the order of the characters in the document category. In this example, the document Amer
ica3. The start position of txt is 303842976. America3. The character positions of all the characters in txt are shown in Table 7.

【０１１５】５．新規文書の記録前の文書カテゴリ中の
ある文字の最大位置（本実施例の項目１に記載）及びそ
の文字の現在の文字位置に基づいて、各文字の差分値
が、差分アルゴリズムにより計算される。ある文字の現
在の差分値が、複数バイトを必要とし、現在のミニブロ
ックの２６０バイト（４＋２５６バイト）の残りのバイ
トが、新規の差分値にとって十分ではない場合、システ
ムは０ｘ００を使用して現在のミニブロックの残りのバ
イトを充填する。続いて、新規のミニブロックが使用さ
れる。その文書カテゴリ中のその文字の現在位置が、新
規のミニブロックの最小位置としてその始めの４バイト
に格納される。この文字がこれ以降も出現する場合は、
各文字位置と前の文字位置との間の差分値が、新規のミ
ニブロックの第５バイト及び各後続バイトに格納され
る。ミニブロックが位置データで充填される場合、新規
のミニブロックが使用され、文字の現在の文字位置は、
このミニブロックの最小位置としてその中に格納され
る。上述のプロセスは、データベースブロック中の全て
のミニブロックがデータで充填されるまで繰り返され
る。実施例３では、カタカナ「リ」が例として取り上げ
られる。前述のように、データベース１０４において、
ニュース文書カテゴリ中のカタカナ「リ」の最大位置
は、１０１６９４７であり、このカタカナの位置データ
を格納するための最終データベースブロックは、１０３
９９９７バイトの位置データを格納している。1039997
／260を計算すると、このカタカナは、最終データベー
スブロック中の３９９９個のミニブロックを充填してい
る。また、1039997 % 260を計算すると、データベース
ブロックの第４０００ミニブロックは、カタカナの位置
データのうちの２５７バイトを格納しており、３バイト
を残している。文書Ａｍｅｒｉｃａ３．ｔｘｔにおい
て、ニュース文書カテゴリ中の「リ」の文字位置は、３
０３８４２９７８であり、データベース１０４の文字位
置情報に格納されたニュース文書カテゴリ中のカタカナ
「リ」の最大文字位置は、１０１６９４７である。差分
アルゴリズムによれば、３０３８４２９７８と１０１６
９４７との間の差分値は、０ｘ８１９４ＥＡ９Ｆ７７で
あり、この格納には５バイトが必要である。データベー
スブロック中の各ミニブロックの保全性を維持するため
に、第４０００ミニブロックの第２５８バイトから第２
６０バイトが、０ｘ００で充填される。続いて、新規の
データベースブロックが使用される。ニュース文書カテ
ゴリ中の「リ」の文字位置０ｘ１２１Ｃ４６Ａ２が、新
規のデータベースブロックの第１ミニブロックに格納さ
れる。5. The difference value of each character is calculated by the difference algorithm based on the maximum position of a certain character (described in item 1 of this embodiment) in the document category before the recording of a new document and the current character position of that character. . If the current difference value of a character requires multiple bytes and the remaining bytes of the current miniblock's 260 bytes (4 + 256 bytes) are not enough for the new difference value, the system will use 0x00 to Fill the remaining bytes of the miniblock. Subsequently, the new miniblock is used. The current position of the character in the document category is stored in the first 4 bytes as the minimum position of the new miniblock. If this character appears after this,
The difference value between each character position and the previous character position is stored in the fifth byte and each subsequent byte of the new miniblock. If the miniblock is filled with position data, a new miniblock is used and the current character position of the character is
It is stored in it as the minimum position of this miniblock. The above process is repeated until all miniblocks in the database block are filled with data. In the third embodiment, katakana “ri” is taken as an example. As described above, in the database 104,
The maximum position of katakana “ri” in the news document category is 1016947, and the final database block for storing the position data of this katakana is 103
Stores 9997 bytes of position data. 1039997
Calculating / 260, this katakana fills 3999 miniblocks in the final database block. Further, when calculating 1039997% 260, the 4000th mini-block of the database block stores 257 bytes of the Katakana position data, leaving 3 bytes. Document America3. In txt, the character position of "ri" in the news document category is 3
The maximum character position of Katakana "ri" in the news document category stored in the character position information of the database 104 is 1016947. According to the difference algorithm, 303842978 and 1016
The difference value with 947 is 0x8194EA9F77, and this storage requires 5 bytes. In order to maintain the integrity of each mini-block in the database block, the second to the second 258th bytes of the 4000th mini-block are maintained.
60 bytes are filled with 0x00. Subsequently, the new database block is used. The character position 0x121C46A2 of "ri" in the news document category is stored in the first mini block of the new database block.

【０１１６】６．文書中の全ての文字の処理終了後、文
字の内部コードの順序に従った各文字のレコードを求め
て、データベース１０４が検索される。指定の文字のレ
コードが見つかった場合、現在記録済の文字の情報を使
用してデータベース１０４の文字情報及び文字位置情報
が更新される。指定の文字のレコードが見つからない場
合、現在記録済の文字の情報は、データベース１０４の
文字情報及び文字位置情報に格納される。実施例３で
は、文書Ａｍｅｒｉｃａ３．ｔｘｔ中の文字「リ」が例
として取り上げられる。「リ」の文字情報が、データベ
ース１０４において探し出される。続いて、データベー
ス１０４のその文字情報及び文字位置情報が更新され
る。文書Ａｍｅｒｉｃａ３．ｔｘｔを記録する前に、文
字「リ」の１０３９９９バイトの位置データが、データ
ベース１０４中の文字の最終レコード項目のデータベー
スブロックに格納されている。文字位置情報を更新する
場合、文字「リ」に関しては、文書インデックス生成部
１０５が、データベース１０４中の文字の文字位置情報
の最終レコード項目のデータベースブロックを更新す
る。続いて、新規のデータベースブロックが、文字の文
字位置情報中の新規のレコード項目としてデータベース
１０４に格納される。この新規のレコード項目が、デー
タベース１０４中の文字の文字位置情報中の最終のレコ
ード項目になる。７．データベース１０４において、文書カテゴリ情報中
の最終位置及び記録された文書の文書情報中の終了位置
が更新される。実施例３では、データベース１０４中の
ニュース文書カテゴリの文書カテゴリ情報において、ニ
ュース文書カテゴリ中の全ての文字の最終位置が３０３
８４２９８２に更新される。また、文書情報において、
文書Ａｍｅｒｉｃａ３．ｔｘｔの終了位置は、３０３８
４２９８２である。6. After processing all the characters in the document, the database 104 is searched for a record for each character according to the order of the character's internal code. When the record of the specified character is found, the character information and the character position information of the database 104 are updated by using the information of the currently recorded character. When the record of the designated character is not found, the information of the currently recorded character is stored in the character information and the character position information of the database 104. In the third embodiment, the document America3. The character "li" in txt is taken as an example. The character information of “ri” is searched for in the database 104. Then, the character information and the character position information of the database 104 are updated. Document America3. Prior to recording txt, the 103999 byte position data for the character "ri" is stored in the database block of the last record entry for the character in database 104. When the character position information is updated, for the character “ri”, the document index generation unit 105 updates the database block of the last record item of the character position information of the character in the database 104. Subsequently, the new database block is stored in the database 104 as a new record item in the character position information of the character. This new record item becomes the last record item in the character position information of the character in the database 104. 7. In the database 104, the final position in the document category information and the end position in the document information of the recorded document are updated. In the third embodiment, in the document category information of the news document category in the database 104, the final positions of all characters in the news document category are 303
Updated to 842982. In the document information,
Document America3. The end position of txt is 3038
42982.

【０１１７】全文検索処理また、本発明では、全文検索の方法が提供される。この
方法では、オペレータにより指定された検索語に対する
全文検索を本発明において作成された文書インデックス
の文字位置情報を使用して実施する。Full-Text Search Process Further, the present invention provides a full-text search method. In this method, a full-text search for a search word specified by an operator is performed using the character position information of the document index created in the present invention.

【０１１８】図４Ａ及び４Ｂのフローチャートにおい
て、作成された文書インデックスを使用しての全文検索
処理を以下に説明する。The full-text search processing using the created document index in the flowcharts of FIGS. 4A and 4B will be described below.

【０１１９】ステップ５０２において、文書カテゴリ名
が入力される。例えば、入力ボックスを有するダイアロ
グボックスが表示され、オペレータに対して文書カテゴ
リ名を入力するように促す。In step 502, the document category name is input. For example, a dialog box with an input box is displayed prompting the operator to enter the document category name.

【０１２０】ステップ５０４において、入力された文書
カテゴリ名に従って、全文検索エンジン１０６が、デー
タベース１０４において文書カテゴリ情報を検索する。
オペレータにより指定された文書カテゴリの文書カテゴ
リ情報を見つけた場合、全文検索エンジン１０６は、文
書カテゴリ情報からその文書カテゴリの文書カテゴリ番
号を取得する。In step 504, the full-text search engine 106 searches the database 104 for document category information according to the input document category name.
When the document category information of the document category designated by the operator is found, the full-text search engine 106 acquires the document category number of the document category from the document category information.

【０１２１】ステップ５０６において、検索語が入力さ
れる。例えば、入力ボックスを有するダイアログボック
スが表示され、オペレータに対して検索語を入力するよ
うに促す。At step 506, a search term is entered. For example, a dialog box with an input box is displayed prompting the operator to enter a search term.

【０１２２】ステップ５０８において、全文検索の前の
データ初期化プロセスが行なわれる。このプロセスは以
下の過程：検索語一致位置を定義する過程であり、例え
ば、その初期値を１に設定する、すなわち、検索語の第
１文字の指定文書カテゴリにおける第１の出現の位置を
１とする過程と、検索語の文字数を取得する過程と、検
索語の各文字を内部コードに変換する過程と、検索語の
各文字の順序に従って、各文字に変位が与えられる過程
であり、例えば、検索語「米国アメリカ」の第１文字を
開始点として設定すると、「米」の変位は０、「国」の
変位は−１、「ア」の変位は−２、「メ」の変位は−
３、「リ」の変位は−４、「カ」の変位は−５である過
程と、データベース照会ステートメントを構成する過程
と、結果セットを初期化する過程とを含む。In step 508, a data initialization process prior to full text search is performed. This process is as follows: The search word matching position is defined. For example, the initial value is set to 1, that is, the position of the first appearance of the first character of the search word in the designated document category is set to 1. , The step of obtaining the number of characters of the search word, the step of converting each character of the search word into an internal code, and the step of giving a displacement to each character according to the order of each character of the search word. , If the first character of the search word "US America" is set as the starting point, the displacement of "US" is 0, the displacement of "Country" is -1, the displacement of "A" is -2, and the displacement of "M" is −
3. The displacement of "li" is -4, the displacement of "f" is -5, the process of constructing a database query statement, and the process of initializing a result set are included.

【０１２３】ステップ５１０において、ステップ５０６
で構成されたデータベース照会ステートメントがデータ
ベース１０４に与えられ、データベース検索が行なわれ
る。検索語の各文字の位置情報の全てのレコード項目が
探し出される。これらのレコードは、レコードセットを
形成する。レコードセットは、データベース１０４中の
各文字の位置情報レコードを含む。各レコード項目は、
文字位置データを格納するためのデータベースブロック
のフィールドを含む。In step 510, step 506
The database inquiry statement composed of is given to the database 104 and a database search is performed. All record items of the position information of each character of the search word are searched. These records form a record set. The record set includes a position information record for each character in the database 104. Each record item is
Contains the fields of the database block for storing character position data.

【０１２４】ステップ５１２において、検索語の各文字
のレコードがレコードセット中にあるか否かが判定され
る。レコードセットにレコードのない文字があれば、検
索は終了する。レコードのない文字がなければ、ステッ
プ５１４に進む。At step 512, it is determined whether or not a record of each character of the search word is in the record set. If there are unrecorded characters in the recordset, the search ends. If there is no character without a record, the process proceeds to step 514.

【０１２５】ステップ５１４において、検索語の各文字
の探し出されたレコードが、データベースブロックの最
小位置に従って、各文字ごとに整列される。そして、各
文字に対して、第１のレコード項目のデータベースブロ
ック、このデータベースブロック中第１のミニブロック
及びこの第１のミニブロック中の第１の文字位置が、そ
れぞれ、現在のデータベースブロック、現在のミニブロ
ック及び現在の文字位置として設定される。In step 514, the found record for each character of the search term is aligned character by character according to the minimum position of the database block. Then, for each character, the database block of the first record item, the first miniblock in this database block and the first character position in this first miniblock are respectively the current database block, the current Is set as the miniblock and the current character position.

【０１２６】ステップ５１６において、カウンタ（Ｉ）
が設定される。これは、検索語の第Ｉ番目の文字の復元
／照合処理が行なわれていることを示す。ステップ５１
８において、カウンタは、検索語の第I番目の文字の文
字位置の復元／照合処理を制御するためのループ制御変
数として機能する。Iの初期値は１であり、これは、復
元／照合処理が、検索語の第１文字から開始されること
を示す。In step 516, the counter (I)
Is set. This indicates that the I-th character of the search word is being restored / matched. Step 51
At 8, the counter functions as a loop control variable for controlling the restoration / matching process of the character position of the I-th character of the search word. The initial value of I is 1, which indicates that the restore / match process begins with the first character of the search term.

【０１２７】ステップ５１８において、Iが検索語の文
字数以下である場合、ステップ５２０に進み、第I番目
の文字の復元／照合処理を行なう。文字数を超える場合
は、検索結果が取得、格納されたことを示し、ステップ
５４４に進む。If I is less than or equal to the number of characters in the search word in step 518, the process proceeds to step 520 to restore / collate the I-th character. If it exceeds the number of characters, it indicates that the search result has been acquired and stored, and the process proceeds to step 544.

【０１２８】ステップ５２０において、検索語一致位置
が、検索語の文字Iの現在のレコード項目のデータベー
スブロックの最大位置と文字Iの変位との和と比較され
る。１．検索語一致位置の方が大きい場合、データベー
スブロックには、この検索語一致位置と一致する文字位
置がないことを意味し、ステップ５３８に進む。ここ
で、文字Iがレコードセット中に更にレコードを有する
か否かが判定される。文字Iがレコードセット中に更に
レコードを有する場合、ステップ５４０に進み、文字I
の次のレコード項目を取得し、そのレコード項目のデー
タベースを現在のデータベースとして、その中の第１の
ミニブロックを現在のミニブロックとして、また、第１
のミニブロックの最小位置を現在の文字位置として設定
する。レコードがない場合、現在の検索語の検索は終了
する。２．検索語一致位置の方が小さい場合、現在のレ
コード項目のデータベースブロック中のミニブロック
は、一致する文字位置を格納している可能性があること
を意味する。検索語一致位置が和と等しい場合、現在の
データベースブロックの最大位置が、一致文字位置であ
ることを意味し、ステップ５４２に進む。ここで、Iに
１を加えて、次の文字が現在の一致位置と一致するか否
かが判定される。In step 520, the search word match position is compared to the sum of the maximum position of the database block of the current record entry for character I of the search word and the displacement of character I. 1. If the search word matching position is larger, it means that there is no character position matching the search word matching position in the database block, and the process proceeds to step 538. Here, it is determined whether the letter I has more records in the record set. If character I has more records in the record set, proceed to step 540 and character I
Next record item of the record item, the database of the record item as the current database, the first miniblock therein as the current miniblock, and the first
Sets the minimum position of the miniblock of as the current character position. If there is no record, the search for the current search term ends. 2. If the search term match position is smaller, it means that the miniblock in the database block of the current record item may contain the matching character position. If the search word matching position is equal to the sum, it means that the maximum position of the current database block is the matching character position, and the process proceeds to step 542. Here, 1 is added to I to determine whether or not the next character matches the current matching position.

【０１２９】ステップ５２２において、まず最初に、若
干の説明を行なう。本発明では、データベースブロック
は、複数のミニブロックを有しても良い。各ミニブロッ
クは、最小位置及び複数の差分値を含む。位置データが
昇順に格納される。従って、２個の連続するミニブロッ
クの第２ミニブロックの最小位置は、第１ミニブロック
の最大位置とみなすことができる。例えば、文字「日」
のデータベースブロック中の第５ミニブロックの最小位
置は１０００であり、第６ミニブロックの最小位置は１
５００であるとする。データベースブロックの第４ミニ
ブロックに格納された最大文字位置は、１０００未満で
あり、データベースブロックの第５ミニブロックに格納
された最大文字位置は、１５００未満であると判定する
ことができる。データベースブロック中の最終ミニブロ
ックに関して、その最大位置は、データベースブロック
の最大位置であると判定することができる。At step 522, first, some explanation will be given. In the present invention, the database block may have a plurality of mini blocks. Each miniblock includes a minimum position and a plurality of difference values. Position data is stored in ascending order. Therefore, the minimum position of the second mini block of the two consecutive mini blocks can be regarded as the maximum position of the first mini block. For example, the letters "day"
The minimum position of the fifth mini-block in the database block is 1000 and the minimum position of the sixth mini-block is 1.
It is assumed to be 500. It can be determined that the maximum character position stored in the fourth mini-block of the database block is less than 1000 and the maximum character position stored in the fifth mini-block of the database block is less than 1500. For the last miniblock in the database block, its maximum position can be determined to be the maximum position of the database block.

【０１３０】ステップ５２２において、検索語一致位置
が、検索語の文字Iの現在のミニブロックの最大位置と
文字Iの変位との和と比較される。１．検索語一致位置
の方が大きい場合、現在のミニブロックには、この検索
語一致位置と一致する文字位置がないことを意味し、ス
テップ５３４に進む。ここで、現在のレコード項目に次
のミニブロックがあるか否かが判定される。ブロックが
ある場合、ステップ５３６に進み、次のミニブロックを
取得し、このミニブロックを現在のミニブロックとし
て、ミニブロックの最小位置を現在の文字位置として設
定する。レコードがない場合、ステップ５３８に進む。
２．検索語一致位置の方が小さい場合、現在のミニブロ
ックは、一致文字位置を格納している可能性があること
を意味し、ステップ５２４に進む。ここで、現在のミニ
ブロックの位置データが判定される。検索語一致位置が
和と等しい場合、現在のミニブロックの最大位置が、一
致文字位置であることを示し、ステップ５４２に進む。
ここで、Iに１が加えられ、次の文字が現在の一致位置
と一致するか否かが判定される。In step 522, the search word match position is compared to the sum of the current miniblock maximum position of the search word letter I and the displacement of the letter I. 1. If the search word matching position is larger, it means that there is no character position matching the search word matching position in the current miniblock, and the process proceeds to step 534. Here, it is determined whether or not there is a next miniblock in the current record item. If there is a block, proceed to step 536 to obtain the next miniblock, set this miniblock as the current miniblock, and set the miniblock's minimum position as the current character position. If there is no record, go to step 538.
2. If the search word matching position is smaller, it means that the current miniblock may store the matching character position, and the process proceeds to step 524. Here, the current mini-block position data is determined. If the search word matching position is equal to the sum, it indicates that the maximum position of the current mini-block is the matching character position, and the process proceeds to step 542.
Here, 1 is added to I to determine whether or not the next character matches the current matching position.

【０１３１】ステップ５２４において、検索語一致位置
が、検索語の文字Iの現在の復元された文字位置と文字I
の変位との和と比較される。１．検索語一致位置の方が
大きい場合、ステップ５３０に進み、現在のミニブロッ
クには次の差分値があるか否かが判定される。差分値が
ある場合、ステップ５３２に進み、次の差分値を取得す
る。この差分値及び文字の現在の文字位置が差分アルゴ
リズムにより計算され、所属する文書カテゴリ中の文字
の新規の現在の文字位置が取得される。差分値がない場
合、ステップ５３４に進む。２．検索語一致位置の方が
小さい場合、ステップ５２６に進む。ここで、検索語一
致位置が、文字Iの現在の復元された文字位置及び文字I
の変位としてリセットされ、ステップ５２８に進む。ス
テップ５２８において、Iは１に設定され、ステップ５
１８に戻る。ここで、検索語の第１文字から新規の検索
語一致位置と一致する文字位置が検索される。検索語一
致位置が和と等しい場合、現在の復元された文字位置
が、一致文字位置であることを示し、ステップ５４２に
進む。ここで、Iに１が加えられ、次の文字が現在の検
索語一致位置と一致するか否かが判定される。In step 524, the search word matching position is the current restored character position of the character I of the search word and the character I.
Is compared with the displacement of. 1. If the search word matching position is larger, the process proceeds to step 530, and it is determined whether or not the current miniblock has the next difference value. If there is a difference value, the process proceeds to step 532, and the next difference value is acquired. The difference value and the current character position of the character are calculated by the difference algorithm, and the new current character position of the character in the document category to which the character belongs is acquired. If there is no difference value, the process proceeds to step 534. 2. If the search word matching position is smaller, the process proceeds to step 526. Here, the search word matching position is the current restored character position of the character I and the character I.
Is reset as the displacement of No. and the process proceeds to step 528. In step 528, I is set to 1 and step 5
Return to 18. Here, the first character of the search word is searched for a character position that matches the new search word matching position. If the search word matching position is equal to the sum, it indicates that the current restored character position is the matching character position, and the process proceeds to step 542. Here, 1 is added to I, and it is determined whether or not the next character matches the current search word matching position.

【０１３２】ステップ５４４において、ステップ５１８
でIが検索語の文字数より多いと判定される場合、検索
語の各文字の現在の復元された文字位置が、現在の検索
語一致位置と一致する、すなわち、検索語の検索結果が
現在の文書カテゴリにおいて見つかったことを意味す
る。続いて、文書情報共有変換部により、現在の検索語
一致位置がどの文書にあるかが判定される。また、削除
フラグにより文書が削除されたか否かが判定される。文
書が削除された場合、削除済文書に出現した検索語が、
検索結果であってはならない。文書が削除されていない
場合、文書の文書番号が取得される。In Step 544, Step 518
If I is determined to be greater than the number of characters in the search term, the current restored character position of each character in the search term matches the current search term match position, that is, the search result of the search term is the current Means found in document category. Then, the document information sharing conversion unit determines in which document the current search word matching position is located. Further, it is determined by the deletion flag whether the document has been deleted. When a document is deleted, the search term that appears in the deleted document is
It cannot be a search result. If the document has not been deleted, the document number of the document is obtained.

【０１３３】ステップ５４６において、取得された文書
番号が検索結果セットに格納される。In step 546, the obtained document number is stored in the search result set.

【０１３４】ステップ５４８において、検索語一致位置
が更新される。現在の検索語一致位置に１を加えて新規
の検索語一致位置とする。ステップ５１６に戻り、Iを
１に設定する。続いて、ステップ５１８に進み、検索語
の第１文字から新規の検索語一致位置と一致する文字位
置を検索する。At step 548, the search word matching position is updated. 1 is added to the current search word matching position to form a new search word matching position. Returning to step 516, I is set to 1. Next, in step 518, a character position that matches the new search word matching position is searched from the first character of the search word.

【０１３５】実施例４：全文検索エンジン１０６によ
り、データベース１０４に格納された各文書に対して全
文検索が行なわれる。Fourth Embodiment The full-text search engine 106 performs full-text search for each document stored in the database 104.

【０１３６】ステップ５０２において、文書カテゴリ名
が入力される。例えば、入力ボックスを有するダイアロ
グボックスが表示され、オペレータに対して文書カテゴ
リ名「ニュース」を入力するように促す。In step 502, the document category name is input. For example, a dialog box with an input box is displayed prompting the operator to enter the document category name "News".

【０１３７】ステップ５０４において、入力された文書
カテゴリ名に従って、全文検索エンジン１０６が、デー
タベース１０４において文書カテゴリ情報を検索する。
ニュース文書カテゴリの文書カテゴリ情報を見つけた場
合、全文検索エンジン１０６は、文書カテゴリ情報から
ニュース文書カテゴリの文書カテゴリ番号１を取得す
る。In step 504, the full-text search engine 106 searches the database 104 for document category information according to the input document category name.
When the document category information of the news document category is found, the full-text search engine 106 acquires the document category number 1 of the news document category from the document category information.

【０１３８】ステップ５０６において、検索語が入力さ
れる。例えば、入力ボックスを有するダイアログボック
スが表示され、オペレータに対して検索語「米国アメリ
カ」を入力するように促す。At step 506, a search term is entered. For example, a dialog box with an input box is displayed prompting the operator to enter the search term "US America".

【０１３９】ステップ５０８において、全文検索の前の
データ初期化プロセスが行なわれる。このプロセスは以
下の過程：検索語一致位置の初期値を１として定義する
過程であり、すなわち、検索語の第１文字のニュース文
書カテゴリにおける第１の出現位置を１とする過程と、
検索語「米国アメリカ」の文字数６を取得する過程と、
検索語の各文字を内部コードに変換し、例えば、６文字
をシフトＪＩＳコードから対応するシステム内部コード
に変換する（表９の列１、２参照）過程と、検索語の各
文字の順序に従って、各文字に変位が与えられる過程で
あり、検索語「米国アメリカ」の第１文字が開始点とし
て設定され、６文字にはそれぞれ変位が与えられ、
「米」の変位は０、「国」の変位は−１、「ア」の変位
は−２、「メ」の変位は−３、「リ」の変位は−４、
「カ」の変位は−５である過程と、入力された文書カテ
ゴリ名「ニュース」及び６文字の内部コードとが、デー
タベースＳＱＬ照会ステートメントにおいて記述される
過程と、結果セットを空にする過程とを含む。In step 508, a data initialization process prior to full text search is performed. This process is as follows: defining the initial value of the search word matching position as 1, that is, setting the first appearance position of the first character of the search word in the news document category to 1.
The process of obtaining the number of characters 6 in the search word "US America",
According to the process of converting each character of the search word into an internal code, for example, converting 6 characters from the shift JIS code into the corresponding system internal code (see columns 1 and 2 of Table 9) and the order of each character of the search word. , Is the process of giving displacement to each character, the first character of the search word "US America" is set as the starting point, and the displacement is given to each of the six characters,
The displacement of "rice" is 0, the displacement of "country" is -1, the displacement of "a" is -2, the displacement of "me" is -3, the displacement of "li" is -4,
The displacement of "F" is -5, the input document category name "News" and the 6-character internal code are described in the database SQL query statement, and the result set is emptied. including.

【０１４０】ステップ５１０において、ステップ５０６
で構成されたデータベース照会ステートメントがデータ
ベース１０４に与えられ、データベース検索が行なわれ
る。検索語の各文字の位置情報の全てのレコード項目が
探し出される。これらのレコードは、レコードセットを
形成する。各レコード項目は、複数のフィールドを含
む。文字位置データを格納するデータベースブロック
は、各レコード項目にフィールドとして含まれる。すな
わち、１レコード項目は、１個のデータベースブロック
に対応する。レコードセットは、データベース１０４に
格納された検索語の各文字の位置情報を含む。表９は、
データベース１０４中のニュース文書カテゴリ中の６文
字「米国アメリカ」の文字位置情報の幾つかのレコード
項目を示す。ステップ５１２において、検索語の各文字のレコードが
レコードセット中にあるか否かが判定される。レコード
セットにレコードのない文字があれば、検索は終了す
る。レコードのない文字がなければ、ステップ５１４に
進む。In step 510, step 506
The database inquiry statement composed of is given to the database 104 and a database search is performed. All record items of the position information of each character of the search word are searched. These records form a record set. Each record item includes multiple fields. A database block that stores character position data is included as a field in each record item. That is, one record item corresponds to one database block. The record set includes position information of each character of the search term stored in the database 104. Table 9 shows
6 shows some record items of character position information of 6 characters “US America” in the news document category in the database 104. At step 512, it is determined whether there is a record for each character of the search term in the record set. If there are unrecorded characters in the recordset, the search ends. If there is no character without a record, the process proceeds to step 514.

【０１４１】ステップ５１４において、検索語の各文字
の探し出されたレコードは、データベースブロックの最
小位置に従って、各文字ごとに整列される。例えば、
「リ」の３つのレコード項目のデータベースブロックの
最小位置は、それぞれ、５、３０４９４及び３０３８４
２９７８である。In step 514, the found record for each character of the search term is aligned character by character according to the minimum position of the database block. For example,
The minimum positions of the database blocks of the three record items of "ri" are 5, 30494 and 30384, respectively.
2978.

【０１４２】ステップ５１６において、カウンタ（Ｉ）
の初期値が１に設定される。これは、検索語の第Ｉ文字
「米」の復元／照合処理が第１文字「米」から開始され
ることを意味する。In step 516, the counter (I)
The initial value of is set to 1. This means that the restoration / matching process of the I-th character “US” of the search word is started from the first character “US”.

【０１４３】ステップ５１８において、I = 1 ＜ 6の場
合、ステップ５２０に進む。１に初期化された検索語一
致位置が、文字「米」のデータベースブロックの最大位
置３０４９９と文字「米」の変位０（表１０の列１、４
及び５を参照)との和と比較される。1 ＜ 30499 + 0で
あるので、データベースブロックに、現在の検索語一致
位置１と一致する「米」の文字位置がある可能性があ
る、すなわち、文字位置と変位０の和が、現在の検索語
一致位置に等しいことを意味する。データベースブロッ
ク中の第２ミニブロックの最小位置がＸであるとする。
検索語一致位置１が、データベースブロックの第１ミニ
ブロックの最大位置Ｘ（第２ミニブロックの最小位置）
と文字「米」の変位０との和と比較される。現在の検索
語一致位置１に等しい「米」の文字位置が、第１ミニブ
ロックに存在する可能性があることが判定される。続い
て、現在の検索語一致位置１が、第１ミニブロックの最
小位置１と変位０との和と比較される。比較結果は等し
く、カウンタが１だけ増分されてI = 2となる。ステッ
プ５１８に戻り、文字「国」のデータベースブロック
が、「国」の変位との和が現在の検索語一致位置１に等
しい文字位置を有するか否かが判定される。If I = 1 <6 in step 518, the process proceeds to step 520. The search word matching position initialized to 1 is the maximum position 30499 of the database block of the character "rice" and the displacement 0 of the character "rice" (columns 1 and 4 of Table 10).
And 5)). Since 1 <30499 + 0, there is a possibility that there is a character position of "US" matching the current search word matching position 1 in the database block, that is, the sum of the character position and the displacement 0 is the current search. It is equal to the word match position. It is assumed that the minimum position of the second mini block in the database block is X.
Search word matching position 1 is the maximum position X of the first mini-block of the database block (minimum position of the second mini-block)
Is compared with the displacement of the character "rice" and zero. It is determined that the character position of “US” equal to the current search word matching position 1 may exist in the first miniblock. Subsequently, the current search word matching position 1 is compared with the sum of the minimum position 1 and the displacement 0 of the first miniblock. The comparison results are equal and the counter is incremented by 1 to I = 2. Returning to step 518, it is determined whether the database block for the character "Country" has a character position whose sum with the "Country" displacement equals the current search term match position 1.

【０１４４】現在I = 2であり、「国」の変位は−１で
ある。文字の第１レコード項目のデータベースブロック
の最大位置は、３０３８４２９８２（表９の列１、４及
び５を参照）である。第１ミニブロックの最小位置は２
である。第１ミニブロックの最大位置をＹとする。ま
ず、レコード項目のデータベースブロックの最大位置と
文字の変位との和が３０３８４２９８１（303842982 -
1 = 303842981）と計算される。この和３０３８４２９
８１が、現在の検索語一致位置１と比較され、それによ
り、データベースブロックが現在の検索語一致位置と一
致する文字位置を有する可能性があると判定される。検
索語一致位置１が、データベースブロックの第１ミニブ
ロックの最大位置（第２ミニブロックの最小位置）と変
位との和（Y - 1）と比較され、それにより、現在の検
索語一致位置と一致する文字位置が、データベースブロ
ックの第１ミニブロックに存在する可能性があると判定
される。検索語一致位置１が、第１ミニブロックの最小
位置２と変位−１との和１（2 - 1 = 1）と比較され
る。比較結果は等しく、これは、現在の検索語一致位置
１と一致する検索語の第２文字「国」の文字位置が見つ
かったことを意味する。カウンタが１だけ増分される。Currently, I = 2, and the displacement of "country" is -1. The maximum position of the database block for the first record item of characters is 303842982 (see columns 9, 4 and 5 of Table 9). The minimum position of the first mini block is 2
Is. Let Y be the maximum position of the first mini-block. First, the sum of the maximum position of the database block of the record item and the displacement of the character is 303842981 (303842982-
1 = 303842981) is calculated. This sum 3038429
81 is compared to the current search term match position 1, which determines that the database block may have a character position that matches the current search term match position. The search word matching position 1 is compared with the sum (Y-1) of the maximum position of the first miniblock (minimum position of the second miniblock) and the displacement of the database block, and thereby the current search word matching position is obtained. It is determined that the matching character position may be in the first miniblock of the database block. The search word matching position 1 is compared with the sum 1 (2-1 = 1) of the minimum position 2 of the first miniblock and the displacement -1. The comparison results are equal, which means that the character position of the second character "country" of the search word that matches the current search word matching position 1 has been found. The counter is incremented by 1.

【０１４５】残りの４文字「アメリカ」の照合プロセス
は、文字「米国」と同様であり、変位が異なるのみであ
る。「カ」の照合プロセスが終了したとき、カウンタは
７である。従って、カウンタの数値は、検索語の文字数
６より大きく、これは、検索語の第１検索結果が、一致
位置１で見つかったことを意味する。続いて、ステップ
５４４に進む。The matching process for the remaining four characters "America" is similar to the character "America", only the displacements are different. The counter is 7 when the matching process of “F” is completed. Therefore, the numerical value of the counter is larger than the number of characters of the search word, which means that the first search result of the search word is found at the matching position 1. Then, it progresses to step 544.

【０１４６】ステップ５４４から５４８において、文書
情報共有変換部１０７により、上述の一致位置１を使用
して対応する文書番号１が探し出される。検索結果が、
結果セットに格納される。続いて、検索語一致位置が１
だけ増分され、現在の検索語一致位置として新規の一致
位置２が得られる。カウンタが１にリセットされる。各
文字の現在のデータベースブロックの現在のミニブロッ
ク中の現在の文字位置から新規の検索語一致位置と一致
する結果の検索が開始される。In steps 544 to 548, the document information sharing conversion unit 107 searches for the corresponding document number 1 by using the matching position 1 described above. The search result is
Stored in the result set. Then, the search word matching position is 1
By 1 to obtain a new matching position 2 as the current search word matching position. The counter is reset to 1. A search is started for results that match the new search word matching position from the current character position in the current miniblock of the current database block for each character.

【０１４７】次の検索結果の検索を開始する場合、ま
ず、差分値０ｘ８２６Ｅが、文字「米」の現在のデータ
ベースブロックの第１ミニブロックの第５バイト及び第
６バイトから取得される。単位桁以外の桁の最上位のビ
ットが、０に復元される。すなわち、０ｘ８２６Ｅが、
０ｘ０２６Ｅに復元される。(016E)16 = (366)10であ
る。続いて、「米」の前の文字位置１と３６６との和が
計算され、復元された文字位置３６７（1 + 366 = 36
7）が得られる。復元された文字位置３６７と変位０と
の和は、現在の検索語一致位置よりも大きい。従って、
この復元された文字位置は、一致位置２と一致しない。
現在の文字位置３６７と変位０の和３６７を使用して、
検索語一致位置がリセットされる。また、カウンタが１
にリセットされる。各文字の現在のデータベースブロッ
クの現在のミニブロック中の現在の文字位置から新規の
検索語一致位置３６７と一致する結果の検索が開始され
る。上述のプロセスは、検索語中の文字の未処理の文字
位置情報がなくなるまで継続される。こうして、語「米
国アメリカ」の検索プロセスが終了する。When the search for the next search result is started, first, the difference value 0x826E is obtained from the 5th and 6th bytes of the first miniblock of the current database block of the character "US". The most significant bits of the digits other than the unit digit are restored to 0. That is, 0x826E is
It is restored to 0x026E. (016E) 16 = (366) 10. Then, the sum of the character positions 1 and 366 before "US" is calculated and the restored character position 367 (1 + 366 = 36
7) is obtained. The sum of the restored character position 367 and the displacement 0 is larger than the current search word matching position. Therefore,
The restored character position does not match the matching position 2.
Using the sum 367 of the current character position 367 and the displacement 0,
The search word matching position is reset. Also, the counter is 1
Is reset to. A search is started for results that match the new search word match position 367 from the current character position in the current miniblock of the current database block for each character. The above process continues until there is no unprocessed character position information for the characters in the search term. This completes the search process for the word "US America".

【０１４８】実施例５：全文検索エンジン１０６によ
り、データベース１０４に格納された各文書に対して全
文検索が行なわれる。語「アメリカ」をニュース文書カ
テゴリにおいて検索するものとする。Fifth Embodiment: The full-text search engine 106 performs full-text search for each document stored in the database 104. Search for the word "America" in the News Documents category.

【０１４９】ステップ５０２において、文書カテゴリ名
が入力される。例えば、入力ボックスを有するダイアロ
グボックスが表示され、オペレータに対して文書カテゴ
リ名「ニュース」を入力するように促す。In step 502, the document category name is input. For example, a dialog box with an input box is displayed prompting the operator to enter the document category name "News".

【０１５０】ステップ５０４において、入力された文書
カテゴリ名に従って、全文検索エンジン１０６が、デー
タベース１０４において文書カテゴリ情報を検索する。
ニュース文書カテゴリの文書カテゴリ情報を見つけた場
合、全文検索エンジン１０６は、文書カテゴリ情報から
ニュース文書カテゴリの文書カテゴリ番号１を取得す
る。In step 504, the full-text search engine 106 searches the database 104 for document category information according to the input document category name.
When the document category information of the news document category is found, the full-text search engine 106 acquires the document category number 1 of the news document category from the document category information.

【０１５１】ステップ５０６において、検索語が入力さ
れる。例えば、入力ボックスを有するダイアログボック
スが表示され、オペレータに対して検索語「アメリカ」
を入力するように促す。At step 506, a search term is entered. For example, a dialog box with an input box is displayed, prompting the operator with the search term "America".
Prompt you to enter.

【０１５２】ステップ５０８において、全文検索の前の
データ初期化プロセスが行なわれる。このプロセスは以
下の過程：検索語一致位置の初期値を１として定義する
過程であり、すなわち、検索語の第１文字のニュース文
書カテゴリにおける第１の出現の位置を１とする過程
と、検索語「アメリカ」の文字数４を取得する過程と、
検索語の各文字を内部コードに変換し、例えば、４文字
をシフトＪＩＳコードから対応するシステム内部コード
にそれぞれ変換する（表１０の列１、２参照）過程と、
検索語の各文字の順序に従って、各文字に変位が与えら
れる過程であり、検索語「アメリカ」の第１文字が開始
点として設定され、４文字にはそれぞれ変位が与えら
れ、「ア」の変位は０、「メ」の変位は−１、「リ」の
変位は−２、「カ」の変位は−３である過程と、入力さ
れた文書カテゴリ名「ニュース」及び４文字の内部コー
ドとが、データベースＳＱＬ照会ステートメントにおい
て記述される過程と、結果セットを空にする過程とを含
む。In step 508, a data initialization process prior to full text search is performed. This process is as follows: The initial value of the search word matching position is defined as 1, that is, the position of the first appearance of the first character of the search word in the news document category is 1, and the search The process of obtaining the number of characters of the word "America",
A process of converting each character of the search word into an internal code, for example, converting four characters from the shift JIS code into a corresponding system internal code (see columns 1 and 2 of Table 10),
In the process of giving each character a displacement according to the order of each character in the search word, the first character of the search word "America" is set as the starting point, and the four characters are each given a displacement and the "A" The displacement is 0, the displacement of "me" is -1, the displacement of "li" is -2, the displacement of "f" is -3, the input document category name "news" and the 4-character internal code. Include the steps described in the database SQL query statement and emptying the result set.

【０１５３】ステップ５１０において、ステップ５０６
で構成されたデータベース照会ステートメントがデータ
ベース１０４に与えられ、データベース検索が行なわれ
る。検索語の各文字の位置情報の全てのレコード項目が
探し出される。これらのレコードは、レコードセットを
形成する。各レコード項目は、複数のフィールドを含
む。文字位置データを格納するデータベースブロック
は、各レコード項目にフィールドとして含まれる。すな
わち、１レコード項目は、１個のデータベースブロック
に対応する。レコードセットは、データベース１０４に
格納された検索語の各文字の位置情報を含む。表１０
は、データベース１０４中のニュース文書カテゴリ中の
４文字「アメリカ」の文字位置情報の幾つかのレコード
項目を示す。ステップ５１２において、検索語の各文字のレコードが
レコードセット中にあるか否かが判定される。レコード
セットにレコードのない文字があれば、検索は終了す
る。レコードのない文字がなければ、ステップ５１４に
進む。In Step 510, Step 506
The database inquiry statement composed of is given to the database 104 and a database search is performed. All record items of the position information of each character of the search word are searched. These records form a record set. Each record item includes multiple fields. A database block that stores character position data is included as a field in each record item. That is, one record item corresponds to one database block. The record set includes position information of each character of the search term stored in the database 104. Table 10
Indicates some record items of character position information of four characters “America” in the news document category in the database 104. At step 512, it is determined whether there is a record for each character of the search term in the record set. If there are unrecorded characters in the recordset, the search ends. If there is no character without a record, the process proceeds to step 514.

【０１５４】ステップ５１４において、検索語の各文字
の探し出されたレコードは、データベースブロックの最
小位置に従って、各文字ごとに整列される。例えば、
「リ」の３つのレコード項目のデータベースブロックの
最小位置は、それぞれ、１００５、３０４９４、３０３
８４２９７８である。In step 514, the found record for each character of the search term is aligned character by character according to the minimum position of the database block. For example,
The minimum positions of the database blocks of the three record items of "ri" are 1005, 30494, and 303, respectively.
842978.

【０１５５】ステップ５１６において、カウンタ（Ｉ）
の初期値が１に設定される。これは、復元／照合処理が
検索語の第１文字「ア」から開始されることを示す。In step 516, the counter (I)
The initial value of is set to 1. This indicates that the restoration / matching process starts from the first character "A" of the search word.

【０１５６】ステップ５１８において、I = 1 ＜ 4の場
合、ステップ５２０に進む。１に初期化された検索語一
致位置が、文字「ア」のデータベースブロックの最大位
置３０３８４２９７６と文字「ア」の変位０（表１０の
列１、４及び５を参照)との和と比較される。1 ＜ 3038
42976 + 0であるので、データベースブロックに、現在
の検索語一致位置１と一致する「ア」の文字位置がある
可能性がある、すなわち、文字位置と変位０の和が、現
在の検索語一致位置１に等しいことを意味する。データ
ベースブロック中の第２ミニブロックの最小位置がＸ１
であるとする。検索語一致位置１が、データベースブロ
ックの第１ミニブロックの最大位置Ｘ１（第２ミニブロ
ックの最小位置）と文字「ア」の変位０との和と比較さ
れる。現在の検索語一致位置１に等しい「ア」の文字位
置が、第１ミニブロックに存在する可能性があることが
判定される。続いて、現在の検索語一致位置１が第１ミ
ニブロックの最小位置１００３と変位０との和と比較さ
れる。第１ミニブロックの最小位置１００３と変位０と
の和は、現在の検索語一致位置１より大きい。現在、
「ア」の現在の文字位置は、１００３である。ステップ
５２６において、検索語一致位置が、「ア」の現在の文
字位置１００３と変位０との和に設定され、カウンタは
１に設定される。ステップ５１８に戻り、新規の検索語
一致位置１００３を使用して、再度、第１文字「ア」の
現在の文字位置に対しての照合プロセスが行なわれる。
検索語一致位置１００３が、文字の現在の文字位置１０
０３と変位との和に等しいと判定される。Ｉが１だけ増
分される。If I = 1 <4 in step 518, the process proceeds to step 520. The search word match position initialized to 1 is compared to the sum of the maximum position 303842976 of the database block for the character "A" and the displacement 0 of the character "A" (see columns 1, 4 and 5 of Table 10). It 1 <3038
Since it is 42976 + 0, there is a possibility that there is a character position of "A" that matches the current search word matching position 1 in the database block, that is, the sum of the character position and the displacement 0 is the current search word matching position. Means equal to position 1. The minimum position of the second mini-block in the database block is X1
Suppose The search word match position 1 is compared with the sum of the maximum position X1 of the first miniblock of the database block (the minimum position of the second miniblock) and the displacement 0 of the character "A". It is determined that a character position of "A" equal to the current search word matching position 1 may exist in the first miniblock. Subsequently, the current search word matching position 1 is compared with the sum of the minimum position 1003 of the first miniblock and the displacement 0. The sum of the minimum position 1003 of the first miniblock and the displacement 0 is larger than the current search word matching position 1. Current,
The current character position of "A" is 1003. In step 526, the search word matching position is set to the sum of the current character position 1003 of "A" and displacement 0, and the counter is set to 1. Returning to step 518, the new search word match position 1003 is used again to perform the matching process on the current character position of the first character "A".
The search word matching position 1003 is the current character position 10 of the character.
It is determined to be equal to the sum of 03 and the displacement. I is incremented by 1.

【０１５７】現在I = 2であり、「メ」の変位は−１で
ある。文字の第１レコード項目のデータベースブロック
の最大位置は、３０３８４２９７７（表１０の列１、４
及び５を参照）である。第１ミニブロックの最小位置は
１００４である。第１ミニブロックの最大位置をＹ２と
する。まず、レコード項目のデータベースブロックの最
大位置と文字の変位−１との和が３０３８４２９７６
（303842977 - 1 = 303842976）と計算される。この和
３０３８４２９７６が、現在の検索語一致位置１００３
と比較され、それにより、データベースブロックが現在
の検索語一致位置と一致する文字位置を有する可能性が
あると判定される。検索語一致位置１００３が、データ
ベースブロックの第１ミニブロックの最大位置（第２ミ
ニブロックの最小位置）と変位との和（Y2 - 1）と比較
され、それにより、現在の検索語一致位置１００３と一
致する文字位置が、データベースブロックの第１ミニブ
ロックに存在する可能性があると判定される。検索語一
致位置１００３が、第１ミニブロックの最小位置１００
４と変位−１との和１００３（1004 - 1 = 1003）と比
較される。比較結果は等しく、これは、現在の検索語一
致位置１００３と一致する検索語の第２文字「メ」の文
字位置が見つかったことを意味する。カウンタが１だけ
増分される。Currently, I = 2, and the displacement of "M" is -1. The maximum position of the database block for the first record item of characters is 303842977 (columns 1 and 4 of Table 10).
And 5). The minimum position of the first mini block is 1004. The maximum position of the first mini block is Y2. First, the sum of the maximum position of the database block of the record item and the displacement -1 of the character is 303842976.
(303842977-1 = 303842976) is calculated. This sum 303842976 is the current search word match position 1003.
And it is determined that the database block may have a character position that matches the current search term match position. The search word matching position 1003 is compared with the sum (Y2-1) of the maximum position (minimum position of the second mini block) of the first mini-block of the database block and the displacement thereof, whereby the current search word matching position 1003. It is determined that the character position that matches with may exist in the first miniblock of the database block. The search word matching position 1003 is the minimum position 100 of the first mini block.
It is compared with the sum of 4 and displacement-1 1003 (1004-1 = 1003). The comparison results are the same, which means that the character position of the second character "me" of the search word that matches the current search word matching position 1003 has been found. The counter is incremented by 1.

【０１５８】現在、I = 3である。検索語一致位置１０
０３と一致する第３文字「リ」の文字位置の検索を行な
う。「リ」の変位は−２である。現在のデータベースブ
ロックの最大位置は、１３２０である（表１０の列１、
４及び５を参照）。第１のミニブロックの最小位置は、
１００４である。第１のミニブロックの最大位置をＸ３
とする。まず、レコード項目のデータベースブロックの
最大位置と文字の変位−２との和を１３１８（1320 - 2
= 1318）と計算する。続いて、和１３１８が、現在の
検索語一致位置１００３と比較され、それにより、デー
タベースブロックは、現在検索語一致位置と一致する文
字位置を有する可能性があると判定される。検索語一致
位置１００３が、データベースブロックの第１ミニブロ
ックの最大位置（第２ミニブロックの最小位置）と変位
−２との和（X3 - 2）と比較され、それにより、現在の
検索語一致位置１００３と一致する文字位置が、データ
ベースブロックの第１ミニブロックに存在する可能性が
あると判定される。検索語一致位置１００３が、第１ミ
ニブロックの最小位置１００５と変位−１との和１００
３（1005 - 2 = 1003）と比較される。比較結果は等し
く、これは、現在の検索語一致位置１００３と一致する
検索語の第３文字「リ」の文字位置が見つかったことを
意味する。カウンタが１だけ増分される。Currently, I = 3. Search word match position 10
The character position of the third character "ri" that matches 03 is searched. The displacement of "ri" is -2. The maximum position of the current database block is 1320 (column 1 of Table 10,
4 and 5). The minimum position of the first miniblock is
1004. The maximum position of the first mini block is X3
And First, the sum of the maximum position of the record item database block and the character displacement -2 is 1318 (1320 -2).
= 1318). The sum 1318 is then compared to the current search term match position 1003, thereby determining that the database block may have a character position that matches the current search term match position. The search word matching position 1003 is compared with the sum (X3 -2) of the maximum position (minimum position of the second mini block) of the first mini-block of the database block and the displacement -2, whereby the current search word match is found. It is determined that a character position matching position 1003 may be present in the first miniblock of the database block. The search word matching position 1003 is the sum 100 of the minimum position 1005 of the first mini block and the displacement −1.
3 (1005-2 = 1003). The comparison results are the same, which means that the character position of the third character "ri" of the search word that matches the current search word matching position 1003 has been found. The counter is incremented by 1.

【０１５９】現在、I = 4である。検索語一致位置１０
０３と一致する第３文字「カ」の文字位置の検索を行な
う。「カ」の変位は、−３である。現在のデータベース
ブロックの最大位置は、３０３８４２９７９（表１０の
列１、４及び５参照）。第１ミニブロックの最小位置
は、１００６である。第１ミニブロックの最大位置は、
Ｘ４であるとする。まず、レコード項目のデータベース
ブロックの最大位置と文字の変位−３の和は、３０３８
４２９７６（303842979 - 3 = 303842976）と計算され
る。和３０３８４２９７６が、現在の検索語一致位置１
００３と比較され、それにより、現在の検索語一致位置
１００３と一致する文字位置を有する可能性があると判
定される。検索語一致位置１００３が、データベースブ
ロックの第１ミニブロックの最大位置（第２ミニブロッ
クの最小位置）と変位−３との和（X4 - 3）と比較さ
れ、それにより、現在の検索語一致位置１００３と一致
する文字位置が、データベースブロックの第１ミニブロ
ックに存在する可能性があると判定される。検索語一致
位置１００３が、第１ミニブロックの最小位置１００５
と変位−３の和１００３（1006 - 3 = 1003）と比較さ
れる。比較結果は等しく、これは、現在の検索語一致位
置１００３と一致する検索語の第４文字「カ」の文字位
置が見つかったことを意味する。カウンタが、１だけ増
分される。Currently, I = 4. Search word match position 10
The character position of the third character "F" that matches 03 is searched. The displacement of "F" is -3. The maximum position of the current database block is 303842979 (see columns 1, 4 and 5 of Table 10). The minimum position of the first mini block is 1006. The maximum position of the first mini block is
It is assumed to be X4. First, the sum of the maximum position of the database block of the record item and the displacement-3 of the character is 3038.
It is calculated as 42976 (303842979-3 = 303842976). Sum 303842976 is the current search word match position 1
003, which determines that it may have a character position that matches the current search term match position 1003. The search term matching position 1003 is compared with the sum (X4-3) of the maximum position (minimum position of the second mini block) of the first mini-block of the database block and the displacement -3, whereby the current search word match is found. It is determined that a character position matching position 1003 may be present in the first miniblock of the database block. The search word matching position 1003 is the minimum position 1005 of the first mini block.
And the sum of displacement-3 are 1003 (1006-3 = 1003). The comparison results are the same, which means that the character position of the fourth character “” of the search word that matches the current search word matching position 1003 has been found. The counter is incremented by 1.

【０１６０】現在、I = 5である。ステップ５１８にお
いて、カウンタの数値は、検索語の文字数よりも大き
い。従って、検索結果が一致位置１００３で見つかった
ものと判定される。ステップ５４４から５４８におい
て、文書情報共有変換部１０７により、上述の一致位置
１００３を使用して、対応する文書番号が探し出され
る。検索結果が、結果セットに格納される。続いて、検
索語一致位置１００４が、１だけ増分され、現在の検索
語一致位置として新規の一致位置１００４が得られる。
カウンタが１にリセットされる。語「アメリカ」の各文
字の現在の文字位置は、それぞれ、１００３、１００
４、１００５及び１００６である。各文字の現在のデー
タベースブロックの現在のミニブロック中の現在の文字
位置から新規の検索語一致位置と一致する新規の検索結
果の検索が開始される。Currently, I = 5. At step 518, the counter value is greater than the number of characters in the search term. Therefore, it is determined that the search result is found at the matching position 1003. In steps 544 to 548, the document information sharing conversion unit 107 searches for the corresponding document number using the matching position 1003 described above. Search results are stored in the result set. Subsequently, the search word matching position 1004 is incremented by 1, and a new matching position 1004 is obtained as the current search word matching position.
The counter is reset to 1. The current character position of each character of the word "America" is 1003, 100, respectively.
4, 1005 and 1006. A search is started for a new search result that matches the new search word matching position from the current character position in the current miniblock of the current database block for each character.

【０１６１】次の検索結果を検索する場合、まず、差分
値０ｘ０４が、文字「ア」の現在のデータベースブロッ
クの第１ミニブロックの第５バイトから取得される。こ
の差分値には１桁しかないので、差分値は、直接、前の
文字位置１００３に加算され、復元された文字位置１０
０７（1003 + 4 = 1007）が得られる。比較すると、
「ア」の復元された文字位置１００７と変位０の和は、
現在の検索語一致位置１００４より大きい。現在の文字
位置１００７と変位０の和１００７が使用されて、検索
語一致位置がリセットされる。また、カウンタも１にリ
セットされる。「ア」の一致文字位置の検索が再開され
る。その結果、検索語一致位置１００７は、「ア」の現
在の文字位置に等しい。Ｉは１から２に変更される。When searching for the next search result, first, the difference value 0x04 is obtained from the fifth byte of the first miniblock of the current database block of the character "A". Since this difference value has only one digit, the difference value is directly added to the previous character position 1003 and the restored character position 10
07 (1003 + 4 = 1007) is obtained. By comparison,
The sum of the restored character position 1007 of "A" and displacement 0 is
It is larger than the current search word matching position 1004. The current character position 1007 plus the zero displacement 1007 is used to reset the search term match position. The counter is also reset to 1. The search for the matching character position of "A" is restarted. As a result, the search word matching position 1007 is equal to the current character position of "A". I is changed from 1 to 2.

【０１６２】現在の検索語一致位置１００７と一致する
残りの３文字「メリカ」の文字位置が、それぞれ、検索
される。ステップ５４４から５４８において、全ての文
字が一致位置と一致する場合、文書情報共有変換部１０
７により、取得された一致位置を使用して、対応する文
書番号が探し出される。新規の検索結果が、結果セット
に格納される。続いて、上述のプロセスが５１６から繰
り返される。検索語の文字に未処理の文字位置情報がな
くなると、語「アメリカ」に対する検索が終了する。The character positions of the remaining three characters "Merika" that match the current search word matching position 1007 are searched respectively. In steps 544 to 548, if all the characters match the matching position, the document information sharing conversion unit 10
7, the obtained matching position is used to find the corresponding document number. The new search results are stored in the result set. The above process is then repeated from 516. When the character of the search word has no unprocessed character position information, the search for the word "America" ends.

【０１６３】文書情報共有変換部本発明では、指定の文
字位置から対応する文書情報を迅速に取得する方法をも
提供する。キャッシュメモリの高速アクセス機能に基づ
いて、この方法では、共有メモリにデータベース１０４
の文書情報の一部のバックアップコピーが格納される。
全文検索を実行する場合、二分アルゴリズムにより、対
応する文書番号が、全文検索プロセスで見つかった１つ
以上の一致位置から、迅速且つ正確に取得される。文書
の記録処理又は削除処理が実施される度に、文書情報の
バックアップコピーがタイミング良く更新されるよう
に、文書情報共有変換部１０７が起動される。Document Information Sharing Conversion Unit The present invention also provides a method for quickly acquiring corresponding document information from a designated character position. Based on the high-speed access function of the cache memory, this method uses the database 104 in the shared memory.
A backup copy of part of the document information is stored.
When performing a full text search, the bisection algorithm quickly and accurately obtains the corresponding document number from one or more matching positions found in the full text search process. The document information sharing conversion unit 107 is activated so that the backup copy of the document information is updated in good timing each time the document recording process or the document deleting process is performed.

【０１６４】図５において、文書情報共有変換部のプロ
セスを以下に説明する。The process of the document information sharing converter in FIG. 5 will be described below.

【０１６５】１．文書カテゴリ番号入力に従って、文書
情報共有変換部１０７が、まず、指定された文書カテゴ
リの文書情報が、共有メモリ６０４に格納されているか
否かをチェックする。指定の文書カテゴリの文書情報が
共有メモリ６０４にない場合、共有メモリのブロックを
システムに適用する。続いて、データベース１０４の文
書情報が検索される。各文書の文書番号及び位置範囲を
表すデータベース１０４中の指定の文書カテゴリ中の全
文書の文書情報のデータ項目（文書番号、文書の開始位
置及び終了位置、並びに、削除フラグなど）が、共有メ
モリ６０４に読み込まれ、マルチユーザにより使用され
て指定の文書カテゴリを検索できるように、文書の順序
通りの記録に従って、共有メモリに常駐するようにな
る。作成者、表題、記録日時、削除日時などのデータベ
ース１０４の文書情報のその他のデータ項目が、リスト
形式の情報としてデータベース１０４に格納される。共
有メモリ６０４が、指定の文書カテゴリの文書情報を格
納する場合、文書情報は、データベース１０４にアクセ
スすることなく、直接、共有メモリ６０４から読み出す
ことができる。従って、データベース１０４の入出力の
頻度を低下させ、データベースへのアクセス時間を削減
し、データベース照会の速度を増大することができる。1. According to the input of the document category number, the document information sharing conversion unit 107 first checks whether or not the document information of the designated document category is stored in the shared memory 604. If the document information of the specified document category does not exist in the shared memory 604, the block of the shared memory is applied to the system. Then, the document information in the database 104 is searched. The data items of the document information (document number, start position and end position of document, deletion flag, etc.) of all documents in the specified document category in the database 104 representing the document number and position range of each document are shared memory. Read into 604 and become resident in shared memory according to the in-order record of the document so that it can be used by multiple users to retrieve a specified document category. Other data items of the document information of the database 104 such as creator, title, recording date and time, and deletion date and time are stored in the database 104 as list format information. When the shared memory 604 stores the document information of the designated document category, the document information can be directly read from the shared memory 604 without accessing the database 104. Therefore, the frequency of input / output of the database 104 can be reduced, the access time to the database can be reduced, and the speed of database inquiry can be increased.

【０１６６】２．全文検索を実行する場合、１つ以上の
一致位置を格納するための１次元配列である位置情報６
０６が、入力パラメータとして文書情報共有変換部１０
７に与えられる。二分アルゴリズムにより、文書情報共
有変換部１０７が、各一致位置を共有メモリ６０４の各
文書の範囲（文書の開始位置及び終了位置)と比較し、
一致位置のある文書を判定できるようになる。判定され
た文書の削除フラグがチェックされる。文書が削除され
ていれば、削除フラグは１であり、この文書に対する戻
り値は、−１である。文書が削除されていない場合、文
書情報共有変換部１０７が、見つかった対応する文書番
号を出力する。最後に、一致位置から変換された全ての
文書番号が、１つ以上の文書番号を格納するための１次
元配列である文書情報６０８に格納される。文書情報６
０８への文書番号の格納順序は、ちょうど、位置情報６
０６中の一致位置の順序に対応する。2. When performing a full-text search, position information 6 which is a one-dimensional array for storing one or more matching positions
06 is the document information sharing conversion unit 10 as an input parameter.
Given to 7. By the bisection algorithm, the document information sharing conversion unit 107 compares each matching position with the range of each document in the shared memory 604 (start position and end position of the document),
Documents with matching positions can be determined. The deletion flag of the determined document is checked. If the document has been deleted, the delete flag is 1 and the return value for this document is -1. When the document is not deleted, the document information sharing conversion unit 107 outputs the found corresponding document number. Finally, all the document numbers converted from the matching positions are stored in the document information 608 which is a one-dimensional array for storing one or more document numbers. Document information 6
The storage order of the document numbers in 08 is just position information 6
This corresponds to the order of matching positions in 06.

【０１６７】３．新規の文書を記録し、文字インデック
スを作成する場合、文書インデックス生成部１０５が、
新規の文書の情報を共有メモリ６０４にタイミング良く
追加するためのインタフェースを提供する。3. When recording a new document and creating a character index, the document index generation unit 105
An interface is provided for timely addition of new document information to the shared memory 604.

【０１６８】４．文書が削除される場合、文書インデッ
クス生成部１０５が、共有メモリ６０４中の削除された
文書の削除フラグをタイミング良く１（文書が削除され
たことを示す)に設定するためのインタフェースを提供
する。4. When a document is deleted, the document index generation unit 105 provides an interface for setting the deletion flag of the deleted document in the shared memory 604 to 1 (indicating that the document has been deleted) at the appropriate time.

【０１６９】実施例７：本実施例では、一致位置に対応
する文書番号を取得できるように、一致位置を格納する
ための１次元配列中のデータが、文書情報に変換され
る。Embodiment 7: In this embodiment, the data in the one-dimensional array for storing the matching position is converted into the document information so that the document number corresponding to the matching position can be obtained.

【０１７０】１．文書カテゴリ番号1の入力６０２に従
って、文書情報共有変換部１０７が、まず、文書カテゴ
リ番号１の文書情報が共有メモリ６０４に格納されてい
るか否かをチェックする。指定の文書カテゴリの文書情
報が共有メモリ６０４にない場合、共有メモリ６０４の
ブロックをシステムに適用する。続いて、データベース
１０４の文書情報中の文書カテゴリ番号１の全文書の文
書情報が探し出される。各文書の文書番号、開始位置及
び終了位置が取得されて共有メモリ６０４に読み込ま
れ、マルチユーザにより使用されて指定の文書カテゴリ
を検索できるように、文書の順序通りの記録に従って、
共有メモリ６０４に常駐するようになる。詳細は、図５
を参照されたい。1. According to the input 602 of the document category number 1, the document information sharing conversion unit 107 first checks whether the document information of the document category number 1 is stored in the shared memory 604. If the document information of the specified document category does not exist in the shared memory 604, the block of the shared memory 604 is applied to the system. Then, the document information of all the documents of the document category number 1 in the document information of the database 104 is searched. The document number, start position, and end position of each document are acquired and read into the shared memory 604, and used by multiple users to search a specified document category.
It becomes resident in the shared memory 604. For details, see FIG.
Please refer to.

【０１７１】２．複数の一致位置を格納するための１次
元配列である位置情報６０６が、文書情報共有変換部１
０７に与えられる。2. The position information 606, which is a one-dimensional array for storing a plurality of matching positions, is used as the document information sharing conversion unit 1.
It is given to 07.

【０１７２】３．二分アルゴリズムにより、文書情報共
有変換部１０７が、配列中の第１の一致位置より、共有
メモリ６０４中の各文書の範囲（文書の開始位置及び終
了位置）と各一致位置を比較し、一致位置のある文書を
判定できるようにする。判定された文書の削除フラグが
チェックされる。文書が削除されている場合、削除フラ
グは１であり、この文書に対応する戻り値は、−１であ
る。文書が削除されていない場合、文書情報共有変換部
１０７が、見つかった対応する文書番号を出力する。実
施例７では、位置情報６０６中の第１一致位置１００１
に対して、二分アルゴリズムにより、共有メモリ６０４
において、開始位置が９９８、終了位置が１１００、文
書番号が２１で削除フラグが文書が削除されていないこ
とを示す０である対応文書を探し出す。文書情報共有変
換部１０７が、一致位置１００１に対応する文書の文書
番号２を第１の番号として文書情報６０８に格納する。
こうして、１つの一致位置を対応する文書番号に変換す
るプロセスが完了する。検索及び比較の際に、位置情報
６０６中の一致位置３００１を処理する場合、文書情報
共有変換部１０７は、一致位置３００１が、開始位置が
２８９０、終了位置が３００５で文書番号が４１の文書
にあると判定する。続いて、削除フラグが１であるかが
チェックされる。これは、文書が削除されたことを意味
する。文書情報共有変換部１０７が、値−１を戻す場
合、一致位置３００１には、対応する文書がないことを
意味する。文書情報共有変換部による位置情報６０６中
の他の一致位置への変換プロセスは、上述の過程と同じ
である。3. By the bisection algorithm, the document information sharing conversion unit 107 compares the range of each document (the start position and the end position of the document) in the shared memory 604 with each matching position from the first matching position in the array, and the matching position Allows the determination of documents with The deletion flag of the determined document is checked. If the document has been deleted, the delete flag is 1 and the return value corresponding to this document is -1. When the document is not deleted, the document information sharing conversion unit 107 outputs the found corresponding document number. In the seventh embodiment, the first matching position 1001 in the position information 606.
, The shared memory 604
In, the corresponding document whose start position is 998, end position is 1100, the document number is 21 and the deletion flag is 0 indicating that the document has not been deleted is searched. The document information sharing conversion unit 107 stores the document number 2 of the document corresponding to the matching position 1001 in the document information 608 as the first number.
Thus, the process of converting one matching position into the corresponding document number is completed. When processing the matching position 3001 in the position information 606 at the time of searching and comparing, the document information sharing conversion unit 107 converts the matching position 3001 into a document with a start position of 2890, an end position of 3005, and a document number of 41. Judge that there is. Then, it is checked whether the deletion flag is 1. This means that the document has been deleted. When the document information sharing conversion unit 107 returns the value -1, it means that there is no corresponding document at the matching position 3001. The conversion process by the document information sharing conversion unit to another matching position in the position information 606 is the same as the above process.

【０１７３】４．位置情報６０６中の検索語位置から変
換された文書番号が、文書情報共有変換部１０７により
文書情報６０８に格納される。文書情報６０８の文書番
号の格納順序は、位置情報６０６中の一致位置の格納順
序に対応する。4. The document number converted from the search word position in the position information 606 is stored in the document information 608 by the document information sharing conversion unit 107. The storage order of the document numbers of the document information 608 corresponds to the storage order of the matching positions in the position information 606.

[Brief description of drawings]

【図１Ａ】インデックス作成及び全文検索用システムの
構造の一例を示す図。FIG. 1A is a diagram showing an example of the structure of an index creation and full-text search system.

【図１Ｂ】図１Ａのサーバのハードウェアブロック図。1B is a hardware block diagram of the server of FIG. 1A.

【図２】文字位置データを格納するためのデータベース
ブロック及びミニブロックの構造を示す図。FIG. 2 is a diagram showing a structure of a database block and a mini block for storing character position data.

【図３】文書インデックス生成部の処理のフローチャ
ート。FIG. 3 is a flowchart of processing of a document index generation unit.

【図４Ａ】全文検索エンジンの処理のフローチャート。FIG. 4A is a flowchart of processing of a full-text search engine.

【図４Ｂ】全文検索エンジンの処理のフローチャート。FIG. 4B is a flowchart of processing of a full-text search engine.

【図５】文書情報共有変換部の処理を示す図。FIG. 5 is a diagram showing a process of a document information sharing conversion unit.

[Explanation of symbols]

１０４…データベース、１０５…文書インデックス生成
部、１０６…全文検索エンジン、１０７…文書情報共有
変換部、６０４…共有メモリ、６０８…文書情報Reference numeral 104 ... Database, 105 ... Document index generation unit, 106 ... Full-text search engine, 107 ... Document information sharing conversion unit, 604 ... Shared memory, 608 ... Document information

───────────────────────────────────────────────────── フロントページの続き (72)発明者リホン中華人民共和国ベイジン 100080，ハイディアンディストリクト，サマーパレスロード，ナンバー１，ベイジンリソースホテル，ルーム 1307 ベイジンピーカンインフォメーションシステムコーポレーションリミテッド内Ｆターム(参考） 5B075 ND02 NK49 ─────────────────────────────────────────────────── ─── Continued front page (72) Inventor Rihon People's Republic of China Beijing 100080, High Dian District, Summer Palais Slaud, Number 1, Beijing Litho Sus Hotel, Room 1307 Beijing Pecan Information System Within Corporation Limited F-term (reference) 5B075 ND02 NK49

Claims

[Claims]

1. A method of creating an index of character information, the process of determining the position of each character in a group of documents to be searched according to the order of all characters in the entire document, and the position data of the character A process of sequentially storing in one or more database blocks corresponding to characters to obtain the maximum position and the minimum position stored in each database block, and a plurality of mini blocks each having a storage space of multiple bytes And obtaining a minimum position stored in each mini-block.

2. The difference value is acquired between each two consecutive positions of all the positions of the character, and the difference value is stored as position data in a database block. Method.

3. The difference value is calculated by a difference algorithm, wherein the difference algorithm converts the difference between the current position and the previous position of the characters of the set of documents into a 127-ary number. Each digit of the 127-digit number is stored as an 8-bit byte, the most significant bit of the byte of each digit is 0, and the 127-digit number is divided to divide each difference value of the characters. 3. The most significant bit of the byte of each digit is set to 1 excluding the unit digit, and the difference value corresponding to the current position of the character is obtained. Method.

4. When indexing the characters of the group of documents, when the remaining bytes of the current miniblock of the characters are not sufficient for the new difference value, then:
Each of the remaining bytes is filled with 0x00, a new miniblock is used, and the current position of the character in the set of documents is the first byte of the new miniblock as the minimum position of the new miniblock. The method of claim 1, wherein the method is stored in :.

5. A process of determining the position of each character in a group of documents to be searched according to the order of all the characters in the entire document, and the position data of the character in one or more database blocks corresponding to the character. A process of sequentially storing and obtaining the maximum position and the minimum position stored in each database block, and dividing each database block into a plurality of mini blocks each having a storage space of multiple bytes, and storing each mini block A storage medium storing a program for executing the process of obtaining the minimum position.

6. A process of determining the position of each character of a group of documents to be searched according to the order of all characters of the entire document, and the position data of the character in one or more database blocks corresponding to the character. A process of sequentially storing and obtaining the maximum position and the minimum position stored in each database block, and dividing each database block into a plurality of mini blocks each having a storage space of multiple bytes, and storing each mini block A computer program executed by a computer to carry out the process of obtaining a minimum position.

7. A process of obtaining a relative positional relationship between each character of the search word to search an index of each character of the search word, and each database block of the character matches the relative positional relationship. Each step of determining whether or not there is a possibility of having a position, and for each database block that may have a position that matches the relative positional relationship, each miniblock of that database block is A step of determining whether or not there is a possibility of having a position that matches, and for a miniblock that may have a position that matches the relative positional relationship, each position of the miniblock is
The method for searching character information based on the index of the character information created according to claim 1, comprising the step of determining whether the character information matches the relative positional relationship.

8. It is determined whether or not there is a possibility that the database block has a position that matches the relative positional relationship, based on the maximum position and the minimum position of each database block. Item 7. The method according to Item 7.

9. The minimum position of the second miniblock of the two consecutive miniblocks can be regarded as the maximum position of the first miniblock, and its maximum position with respect to the last miniblock of the database block. Can be determined as the maximum position of the database block, and based on the maximum position and the minimum position of each miniblock, whether the miniblock may have a position that matches the relative positional relationship. 8. The method of claim 7, wherein is determined.

10. The relative positional relationship between the characters of the search word is set with the first character in the search word as a starting point,
Each character has a displacement of 0, -1, -2, -3 ,. ．． ,-(N-
8. The method according to claim 7, wherein 1) is sequentially given, where N is represented as the number of characters of the search term.

11. A process of obtaining a relative positional relationship between each character of the search word for searching an index of each character of the search word, and each database block of the character matches the relative positional relationship. Each step of determining whether or not there is a possibility of having a position, and for each database block that may have a position that matches the relative positional relationship, each miniblock of that database block is And a step of determining whether or not there is a possibility of having a position that matches each other, and for a miniblock that may have a position that matches the relative positional relationship, each position of the miniblock is
A storage medium that stores a program for executing a process of determining whether or not the relative positional relationship matches.

12. A process of obtaining a relative positional relationship between each character of the search word to retrieve an index of each character of the search word, and each database block of the character matches the relative positional relationship. Each step of determining whether or not there is a possibility of having a position, and for each database block that may have a position that matches the relative positional relationship, each miniblock of that database block is And a step of determining whether or not there is a possibility of having a position that matches each other, and for a miniblock that may have a position that matches the relative positional relationship, each position of the miniblock is
A computer program executed by a computer to perform a process of determining whether or not the relative positional relationship matches.

13. A method for indexing and searching character information, wherein a certain amount of shared memory is reserved, and a part of each field of the document information of a group of documents to be searched stored in the database is partially retrieved from the database. A method for obtaining related document information directly from the shared memory when a document information sharing conversion unit to be read into the shared memory is installed and a full-text search is performed.

14. The method according to claim 13, wherein the read document information includes at least a field indicating a document number and a field indicating a document range.

15. When performing a full-text search, for one or more search word positions to be searched, the document in which the search word is located is acquired by a bisection algorithm according to the position range of each document. 14. The method of claim 13 characterized.

16. When a document index is created,
14. The method according to claim 13, wherein the document information in the shared memory is updated in a timely manner.

17. When recording or deleting each document, the document information sharing conversion unit provides a corresponding interface for updating the relevant document information in the shared memory at a proper timing. Claim 13
The method described.

18. A process of reserving a shared memory of a certain capacity and copying the document information of a group of documents to be searched stored in a database from the database to the shared memory, and performing the full text search, the sharing A storage medium that stores a program for performing a process of directly acquiring related document information from a memory.

19. A process of reserving a shared memory of a certain capacity and copying the document information of a group of documents to be searched stored in a database from the database to the shared memory, and performing the full text search, the sharing A computer program executed by a computer to perform the process of obtaining related document information directly from a memory.

20. A system for performing indexing and searching of character information, wherein the position of each character of a group of documents to be searched is determined according to the order of all characters of the entire document, and the position data of the character is determined. Are sequentially stored in one or more database blocks corresponding to the character, the maximum position and the minimum position stored in each database block are obtained, and each database block has a plurality of mini blocks each having a storage space of a plurality of bytes. And a relative position relationship between each character of the search word in order to search the index of each character of the search word, and It is determined whether or not each database block of characters may have a position that matches the relative positional relationship. And,
For a database block that may have a position that matches the relative positional relationship, it is determined whether each miniblock of the database block may have a position that matches the relative positional relationship. A full-text search unit that determines whether or not each position of the mini-block may have a position that matches the relative positional relationship, the full-text search unit that determines whether or not each position of the mini-block matches the relative positional relationship, And a storage unit that stores the index information.

21. In a system for indexing and searching character information, a part of each field of document information of a group of documents to be searched, which reserves a certain amount of shared memory and is stored in a database. A system including document information sharing conversion means for directly acquiring related document information from the shared memory when performing full-text search by reading the database from the database into the shared memory.