JP3475009B2

JP3475009B2 - Information retrieval device

Info

Publication number: JP3475009B2
Application number: JP13019496A
Authority: JP
Inventors: 功難波
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1996-05-24
Filing date: 1996-05-24
Publication date: 2003-12-08
Anticipated expiration: 2016-05-24
Also published as: JPH09311868A

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、例えば文書ファイ
リングシステムなどのように、膨大な量の情報の中から
指定されたキーに対応する情報を検索する情報検索シス
テムに関するものである。近年の情報処理装置の発達に
伴って、多種多様でありかつ莫大な量の情報を書類とし
て保管する代わりに、大規模な文書ファイリングシステ
ムを構築し、膨大な量の文書をデジタル情報としてデー
タベースに蓄積するようになっている。このようなデー
タベースにおいては、キーワードなどに対応して予め作
成されたインデックステーブルも大規模化しており、特
に、全文検索に対応するインデックステーブルは非常に
大きくなっている。このため、インデックステーブルを
検索する際に必要とされるシーク時間も無視できない大
きさとなっており、検索処理全体としての処理時間を短
縮するために、インデックステーブルから該当するデー
タ部を効率よく検索する技術が必要とされている。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an information retrieval system such as a document filing system for retrieving information corresponding to a designated key from a huge amount of information. With the development of information processing devices in recent years, instead of storing a huge amount of various kinds of information as documents, a large-scale document filing system was constructed, and a huge amount of documents were converted to digital information in a database. It is designed to accumulate. In such a database, an index table created in advance corresponding to a keyword or the like is also large-scaled, and in particular, an index table compatible with full-text search is very large. Therefore, the seek time required when searching the index table is not negligible, and the corresponding data part is efficiently searched from the index table in order to reduce the processing time of the entire search process. Technology is needed.

【０００２】[0002]

【従来の技術】図７に、従来の情報検索装置の構成例を
示す。図７において、索引作成部４１０は、文書データ
ベース４０１に登録された各文書について、様々な検索
方法に対応するインデックステーブル４０２を作成し、
検索処理部４０３の処理に供する構成となっている。2. Description of the Related Art FIG. 7 shows a configuration example of a conventional information retrieval apparatus. In FIG. 7, the index creation unit 410 creates an index table 402 corresponding to various search methods for each document registered in the document database 401,
The configuration is provided for the processing of the search processing unit 403.

【０００３】このインデックステーブル４０２は、各キ
ー項目に対応するデータ部を格納するデータ領域４０４
と、各キー項目に対応して該当するデータ部の格納場所
を示すポインタ情報を格納するインデックス部４０５と
から構成されている。したがって、検索処理部４０３
は、このインデックス部４０５のポインタ情報に基づい
て、データ領域４０４から該当するデータ部を読み出す
ことにより、検索キーとして指定されたキー項目に対応
するデータ部を得ることができる。The index table 402 has a data area 404 for storing a data part corresponding to each key item.
And an index section 405 that stores pointer information indicating the storage location of the corresponding data section corresponding to each key item. Therefore, the search processing unit 403
By reading the corresponding data part from the data area 404 based on the pointer information of the index part 405, the data part corresponding to the key item designated as the search key can be obtained.

【０００４】例えば、全文検索用のインデックステーブ
ルを作成する際に、索引作成部４１０のキー検出部４１
１は、入力されたテキストデータの先頭から順次にキー
項目を検出していき、この検出結果に応じて、登録処理
部４１２が、インデックス部４０５およびデータ領域４
０４への登録処理を行っている。このキー検出部４１１
がキー項目を検出する方法としては、予め用意しておい
た辞書に基づいて入力テキストから単語（フリーター
ム）を検出する方法や、利用者が指定したキーワードを
検出する方法の他に、Ｎ個を上限とする連続した文字を
機械的に抽出するＮ−ｇｒａｍと呼ばれる方法がある。For example, when creating an index table for full-text search, the key detecting section 41 of the index creating section 410 is used.
1 sequentially detects key items from the beginning of the input text data, and in accordance with the detection result, the registration processing unit 412 causes the index unit 405 and the data area 4 to operate.
Registration processing to 04 is being performed. This key detection unit 411
As a method of detecting key items, there are N methods in addition to the method of detecting a word (free term) from input text based on a dictionary prepared in advance, the method of detecting a keyword specified by a user, and the like. There is a method called N-gram that mechanically extracts consecutive characters whose upper limit is.

【０００５】また、キー項目の入力に応じて、登録処理
部４１２は、まず、インデックス部４０５を参照し、該
当するキー項目が登録されていれば、ポインタ情報で示
されたデータ部に、入力テキストを示す文書番号を追加
すればよい。一方、インデックス部４０５に登録されて
いないキー項目が検出された場合に、登録処理部４１２
は、検出されたキー項目を新しいキー項目としてインデ
ックス部４０５に登録するとともに、データ領域４０４
に新規のデータ部を作成して、入力テキストの文書番号
を格納すればよい。Further, in response to the input of a key item, the registration processing unit 412 first refers to the index unit 405, and if the corresponding key item is registered, inputs it to the data unit indicated by the pointer information. The document number indicating the text may be added. On the other hand, when a key item not registered in the index unit 405 is detected, the registration processing unit 412
Registers the detected key item in the index section 405 as a new key item, and
A new data section may be created in to store the document number of the input text.

【０００６】この場合は、キー項目の出現順に、そのキ
ー項目を含む文書番号からなるデータ部がデータ領域４
０４に配置されることになる。また、このようにして各
キー項目およびデータ部をインデックステーブル４０２
に登録した後に、インデックス部４０５およびデータ領
域４０４を辞書順に並べ替える場合もある。In this case, the data part consisting of the document number including the key item is arranged in the data area 4 in the order of appearance of the key item.
It will be placed at 04. Further, in this way, each key item and data part are stored in the index table 402.
In some cases, the index unit 405 and the data area 404 may be rearranged in the order of the dictionary after the registration.

【０００７】例えば、英文テキストを蓄積した文書デー
タベースに対応するインデックステーブルでは、例え
ば、語幹が同一で語尾が異なる単語のように、類似した
キー項目が連続して指定された場合を考慮して、インデ
ックステーブル４０２のデータ領域４０４には、各キー
項目のデータ部が辞書順に配置されている。For example, in an index table corresponding to a document database in which English texts are stored, considering a case where similar key items are consecutively designated, such as words having the same stem but different endings, In the data area 404 of the index table 402, the data part of each key item is arranged in dictionary order.

【０００８】[0008]

【発明が解決しようとする課題】上述したように、従来
の情報検索装置においては、検索キーとして指定される
可能性の大きさにかかわらず、キー項目の出現順や辞書
順にデータ部がデータ領域４０４に配置されている。As described above, in the conventional information retrieving apparatus, regardless of the possibility of being designated as a retrieval key, the data parts are arranged in the data area in the order of appearance of the key items or in the dictionary order. It is located at 404.

【０００９】ところで、データベースに登録された文書
数が膨大である場合は、個々の文書を特定するための文
書番号が非常に大きくなっているため、各キー項目に対
応するデータ部のデータ長は長くなっている。特に、出
現頻度の高いキー項目に対応するデータ部には、そのキ
ー項目を含む多数の文書番号が羅列されるため、そのデ
ータ長は非常に長くなっている。By the way, when the number of documents registered in the database is enormous, the document number for identifying each document is very large, so the data length of the data part corresponding to each key item is It's getting longer. Particularly, in the data section corresponding to a key item having a high frequency of appearance, a large number of document numbers including the key item are listed, so that the data length thereof is extremely long.

【００１０】このため、特に、全文検索のためのインデ
ックステーブルにおいては、データ領域４０４の大規模
化が著しく、このデータ領域４０４から指定された検索
キーに対応するデータ部を読み出す際には、探索範囲が
極めて広いために、ディスク装置によるシーク処理に長
い時間がかかってしまっていた。Therefore, particularly in the index table for full-text search, the size of the data area 404 is remarkably increased, and when the data portion corresponding to the specified search key is read from this data area 404, the search is performed. Since the range is extremely wide, it took a long time to perform seek processing by the disk device.

【００１１】特に、文書における出現順にキー項目に対
応するデータ部を配置した場合には、検索対象となるデ
ータ部がデータ領域４０４全体に広くに分布しているた
め、データ部を探索する範囲が大きくなり、平均シーク
時間が長くなってしまう。この場合は、利用者が指定す
る検索キーは、登録の際の出現順には全く関係がないた
め、対応するデータ部を読み出すためのシーク範囲は、
データ領域４０４全体となるから、データ領域４０４の
大規模化に伴って、情報検索装置の処理能力は著しく低
下してしまう。In particular, when the data parts corresponding to the key items are arranged in the order of appearance in the document, the data parts to be searched are widely distributed in the entire data area 404, so that the search range for the data parts is limited. It becomes bigger and the average seek time becomes longer. In this case, the search key specified by the user has nothing to do with the order of appearance at the time of registration, so the seek range for reading the corresponding data part is
Since the entire data area 404 is used, the processing capacity of the information retrieval apparatus is significantly reduced as the data area 404 is enlarged.

【００１２】一方、キー項目に対応するデータ部を辞書
順に配置した場合には、類似したキー項目に対応するデ
ータ部は、データ領域４０４において近接して配置され
ていると期待できる。このため、類似した単語が検索キ
ーとして連続して指定されているかぎりにおいては、シ
ーク範囲をデータ領域４０４の一部の領域に限定するこ
とができるので、シーク時間は短くなる。しかし、全く
異なる検索キーが指定されれば、上述した出現順による
配置の場合と同様に、データ領域４０４全体がシーク範
囲となるので、データ領域４０４の容量に対応するシー
ク時間が必要となってしまう。On the other hand, when the data parts corresponding to the key items are arranged in the dictionary order, the data parts corresponding to similar key items can be expected to be arranged close to each other in the data area 404. Therefore, as long as similar words are continuously specified as the search key, the seek range can be limited to a part of the data area 404, and the seek time becomes short. However, if a completely different search key is specified, the seek area corresponding to the capacity of the data area 404 is necessary because the entire data area 404 becomes the seek range, as in the case of the arrangement in the appearance order described above. I will end up.

【００１３】本発明は、検索キーに対応するデータ部を
探索する範囲を限定可能な情報検索装置を提供すること
を目的とする。It is an object of the present invention to provide an information search device capable of limiting the search range of the data part corresponding to the search key.

【００１４】[0014]

【課題を解決するための手段】図１は、請求項１および
請求項２の情報検索装置の原理ブロック図である。FIG. 1 is a block diagram showing the principle of the information retrieval apparatus according to the first and second aspects.

【００１５】請求項１の発明は、検索キーの入力に応じ
て、文書データベース１０１に蓄積された全ての文書か
ら検索キーを含む文書を検索する全文検索システムにお
いて、検索キーとなる可能性のある全ての単語をキー項
目とし、文書データベース１０１から各キー項目を含む
文書を検出し、検出された文書を特定する情報を各キー
項目に対応する文書情報として出力する単語検出手段１
１１と、各キー項目に対応して文書情報を保持する文書
情報保持手段１１２と、各キー項目に対応して保持され
た文書情報に基づいて、各キー項目を含む文書の数を計
数し、キー項目に対応する指標として出力する計数手段
１１３と、各キー項目に対応する指標に基づいて、各キ
ー項目を複数のグループに分類する分類手段１１４と、
得られた分類結果に基づいて、文書情報保持手段１１２
に保持された各文書情報を、対応するキー項目が属する
グループごとに文書情報保持手段１１２に備えられた記
憶媒体において配列する配列手段１１５と、検索キーの
入力に応じて、文書情報保持手段１１２から対応する文
書情報を検索して出力する検索手段１１６とを備えたこ
とを特徴とする。The invention according to claim 1 may serve as a search key in a full-text search system that searches for a document including the search key from all the documents stored in the document database 101 in response to the input of the search key. A word detection unit 1 that detects all documents including each key item from the document database 101 by using all words as key items, and outputs information specifying the detected document as document information corresponding to each key item.
11, a document information storage unit 112 for holding document information corresponding to each key item, based on the document information stored in correspondence with each key item, total number of documents containing each key field
Counting means 113 for counting and outputting as an index corresponding to the key item, and classification means 114 for classifying each key item into a plurality of groups based on the index corresponding to each key item,
The document information holding unit 112 is based on the obtained classification result.
The document information stored in the document information storage unit 112 stores the document information stored in the
It is characterized in that it is provided with an arranging means 115 arranged in a storage medium, and a searching means 116 for searching and outputting corresponding document information from the document information holding means 112 in response to an input of a search key.

【００１６】請求項１の発明は、単語検出手段１１１お
よび計数手段１１３の動作により、文書データベース１
０１に蓄積された全文書の中でキー項目が出現した文書
の数を各キー項目に対応する統計的な指標として得るこ
とができる。このようにして得られた指標に応じて、分
類手段１１４が各キー項目を分類することにより、例え
ば、キー項目を出現する文書の数に応じて複数のグルー
プに分類することができる。この分類結果に応じて配列
手段１１５が、文書情報保持手段１１２を格納している
磁気ディスクなどの記憶媒体において、各文書情報を対
応するキー項目が属するグループごとに配置することに
より、出現する文書数が同程度であるキー項目に対応す
る文書情報が文書情報保持手段１１２を格納している記
憶媒体において分布している範囲を限定することができ
る。According to the first aspect of the invention, the document database 1 is operated by the operation of the word detecting means 111 and the counting means 113.
Document in which the key item appears in all documents stored in 01
The number of Ru can be obtained as a statistical indicator for each key field. Depending on the index thus obtained,
For example, by the classifying means 114 classifying each key item ,
For example, depending on the number of documents in which the key item appears,
Can be classified into groups. The arranging means 115 stores the document information holding means 112 in accordance with this classification result .
Pair each document information on a storage medium such as a magnetic disk.
By respond to key fields are arranged for each group belonging, serial document information document the number of occurrences corresponding to the key item is comparable has stored document information holding means 112
The distribution range in the storage medium can be limited.

【００１７】ここで、ほとんど全ての文書に含まれてい
るようなキー項目や非常にまれなキー項目は、検索キー
として指定されることは少なく、通常は、出現頻度が中
程度であるようなキー項目が検索キーとして指定され
る。したがって、上述したようにして、出現頻度に応じ
て文書情報の分布範囲を制御することにより、出現頻度
が中程度であるキー項目が集中して配置された領域を形
成することができ、検索手段１１６が、文書情報保持手
段１１２において目的とする文書情報を探索する範囲を
限定することができる。Here, a key item that is included in almost all documents or a very rare key item is rarely designated as a search key, and normally, it appears that the appearance frequency is medium. The key item is specified as the search key. Therefore, as described above, by controlling the distribution range of the document information according to the appearance frequency, it is possible to form the area in which the key items having the medium appearance frequency are concentrated and arranged. It is possible to limit the range in which the document information holding unit 112 searches for the target document information.

【００１８】請求項２の発明は、検索キーの入力に応じ
て、文書データベース１０１に蓄積された全ての文書か
ら検索キーを含む文書を検索する全文検索システムにお
いて、文書データベース１０１に蓄積された各文書か
ら、Ｎ−ｇｒａｍによって連続する文字列をキー項目と
して抽出し、抽出された文書を特定する情報を各キー項
目に対応する文書情報として出力する文字列抽出手段１
２１と、各キー項目に対応して文書情報を保持する文書
情報保持手段１１２と、文字列抽出手段１２１で得られ
たキー項目それぞれを構成する文字種の組み合わせの特
徴を判別し、判別された特徴を示す情報を各キー項目に
対応する指標として出力する判別手段１２２と、各キー
項目に対応する指標に基づいて、各キー項目を複数のグ
ループに分類する分類手段１１４と、得られた分類結果
に基づいて、文書情報保持手段１１２に保持された各文
書情報を、対応するキー項目が属するグループごとに文
書情報保持手段１１２に備えられた記憶媒体において配
列する配列手段１１５と、検索キーの入力に応じて、文
書情報保持手段１１２から対応する文書情報を検索して
出力する検索手段１１６とを備えたことを特徴とする。According to a second aspect of the present invention, in the full-text search system for searching a document including the search key from all the documents stored in the document database 101 in response to the input of the search key, each of the documents stored in the document database 101 is searched. Character string extraction means 1 for extracting a continuous character string by N-gram as a key item from a document and outputting information for identifying the extracted document as document information corresponding to each key item.
21, the document information holding unit 112 that holds the document information corresponding to each key item, and the characteristics of the combination of the character types that make up each of the key items obtained by the character string extraction unit 121 are determined, and the determined features are determined. Information for each key item
A discriminating unit 122 that outputs as a corresponding index , a classifying unit 114 that classifies each key item into a plurality of groups based on the index corresponding to each key item, and a document information holding unit based on the obtained classification result. Each sentence held in 112
Statement written information, for each group the corresponding key item belongs
Arrangement means 115 arranged in a storage medium provided in the written information holding means 112, and a search means for searching and outputting corresponding document information from the document information holding means 112 in response to an input of a search key. And 116.

【００１９】請求項２の発明は、文字列抽出手段１２１
によって抽出された各キー項目について、特徴抽出手段
１２２が文字種構成の特徴を抽出することにより、この
文字種構成の特徴を各キー項目に対応する指標として得
ることができる。この指標に基づいて、分類手段１１４
がキー項目を分類することにより、キー項目を文字種の
組み合わせの特徴に応じて複数のグループに分類するこ
とができる。この分類結果を配列手段１１５の処理に供
することにより、例えば、文字種の組み合わせの特徴が
類似しているキー項目に対応する文書情報が、文書情報
保持手段１１２を格納している記憶媒体において分布し
ている範囲を制御することができる。According to the second aspect of the invention, the character string extracting means 121
For each key item extracted by, the feature extraction unit 122 extracts the feature of the character type configuration ,
Get the characteristics of character type composition as an index corresponding to each key item.
You can Based on this index, the classification means 114
Classifies the key items so that the key items
It can be divided into multiple groups according to the characteristics of the combination.
You can By providing this classification result to the processing of the arranging means 115, for example, the characteristics of the combination of character types can be determined.
Document information corresponding to similar key items is distributed in the storage medium storing the document information holding unit 112.
You can control the range you are in.

【００２０】この場合は、利用者が入力する検索キーを
構成する文字種の組み合わせの傾向が知られていれば、
該当する特徴を有するキー項目を集中して文書情報保持
手段１１２に配置することにより、検索手段１１６が、
文書情報保持手段１１２において目的とする文書情報を
探索する範囲を限定することができる。図２は、請求項
３の情報検索装置の原理ブロック図である。In this case, if the tendency of the combination of character types forming the search key input by the user is known,
By centrally arranging the key items having the corresponding characteristics in the document information holding means 112, the search means 116
The range in which the document information holding unit 112 searches for the target document information can be limited. FIG. 2 is a block diagram showing the principle of the information retrieval device according to the third aspect.

【００２１】請求項３の発明は、単語の読みの入力に応
じて、辞書検索手段１０２が辞書１０３に予め登録され
た変換候補を検索して提供する仮名漢字変換システムに
おいて、所定の標準文書を受け取り、標準文書に含まれ
る相異なる読みをキー項目として抽出する読み抽出手段
１３１と、各キー項目が標準文書において出現した度数
を計数し、キー項目に対応する指標として出力する度数
計数手段１３２と、各キー項目に対応する指標に基づい
て、各キー項目を複数のグループに分類する分類手段１
１４と、得られた分類結果に基づいて、辞書１０３に保
持された各変換候補を、対応するキー項目が属するグル
ープごとに辞書１０３を格納する記憶媒体において配列
して、辞書検索手段１０２の処理に供する辞書配列手段
１３３とを備えたことを特徴とする。According to the invention of claim 3, in the kana-kanji conversion system in which the dictionary search means 102 searches and provides conversion candidates registered in advance in the dictionary 103 in response to the input of the reading of a word, a predetermined standard document is set. A reading extraction unit 131 that receives and extracts different readings included in the standard document as key items, and a frequency counting unit 132 that counts the number of times each key item appears in the standard document and outputs it as an index corresponding to the key item. , A classifying means 1 for classifying each key item into a plurality of groups based on an index corresponding to each key item
14 and the obtained classification result in the dictionary 103 .
A dictionary arranging unit 133 is provided for arranging the held conversion candidates in a storage medium storing the dictionary 103 for each group to which the corresponding key item belongs and providing the dictionary arranging unit 133 for the processing of the dictionary searching unit 102. .

【００２２】請求項３の発明は、読み抽出手段１３１と
度数計数手段１３２との動作によって得られた標準文書
における各単語の読みの出現度数を指標として、分類手
段１１４が動作することにより、キー項目に相当する各
単語の読みを出現度数に応じて複数のグループに分類す
ることができる。このようにして得られた分類結果に基
づいて、辞書配列手段１３３は、辞書１０３が格納され
た記憶媒体において、グループごとにそのグループに属
する読みに対応する変換候補を配列する。これにより、
標準文書における読みの出現度数を辞書１０３における
変換候補の配列に反映することができる。According to the third aspect of the present invention, the frequency of occurrence of the reading of each word in the standard document obtained by the operations of the reading extraction unit 131 and the frequency counting unit 132 is used as an index to classify the classifier.
The operation of the column 114 causes each of the items corresponding to the key items.
Classify word readings into multiple groups according to frequency of occurrence
You can Based on the classification results obtained in this way
Then, the dictionary array means 133 stores the dictionary 103.
Storage media that belong to each group
Arrange the conversion candidates corresponding to the reading. This allows
The appearance frequency of reading in the standard document can be reflected in the conversion candidate array in the dictionary 103.

【００２３】ここで、標準辞書として充分大きなサイズ
を持つ文書を読み抽出手段１３１に入力すれば、度数計
数手段１３２によって、任意の文書における各単語の読
みの出現頻度に相当する計数結果が得られる。したがっ
て、辞書配列手段１３３による配列処理により、出現頻
度の高い読みに対応する変換候補を辞書１０２に集中し
て配置することができるから、辞書検索手段１０２が、
辞書１０３において変換候補を探索する平均的な範囲を
縮小することができる。Here, if a document having a sufficiently large size as a standard dictionary is input to the reading extraction means 131, the frequency counting means 132 can obtain a counting result corresponding to the frequency of reading of each word in an arbitrary document. . Therefore, by the arrangement processing by the dictionary arrangement unit 133, the conversion candidates corresponding to the readings having a high appearance frequency can be arranged in the dictionary 102 in a concentrated manner.
The average range for searching the conversion candidates in the dictionary 103 can be reduced.

【００２４】[0024]

【発明の実施の形態】以下、図面に基づいて、本発明の
実施形態について詳細に説明する。図３は、請求項１の
発明の情報検索装置を適用した文書ファイリングシステ
ムの構成図である。図３において、キー検出部４１１お
よび登録処理部４１２は、請求項１で述べた単語検出手
段１１１に相当するものであり、従来と同様にして、文
書データベース４０１に検索対象の文書を入力する際
に、全文検索用のインデックステーブル４０２を作成
し、配置変更部２１０の処理に供すればよい。BEST MODE FOR CARRYING OUT THE INVENTION Embodiments of the present invention will be described in detail below with reference to the drawings. FIG. 3 is a configuration diagram of a document filing system to which the information search device according to the first aspect of the invention is applied. In FIG. 3, a key detection unit 411 and a registration processing unit 412 correspond to the word detection unit 111 described in claim 1, and when a document to be searched is input to the document database 401 in the same manner as in the conventional case. Then, the index table 402 for full-text search may be created and used for the processing of the arrangement changing unit 210.

【００２５】図３に示した配置変更部２１０において、
頻度計数部２１１は、請求項１で述べた計数手段１１２
に相当するものであり、文書情報保持手段１１２に相当
するデータ領域４０４を参照し、各キー項目に対応する
データ部に含まれる文書番号の数を計数し、この計数結
果を頻度テーブル２１２を介して分類処理部２１３の処
理に供する構成となっている。In the arrangement changing unit 210 shown in FIG. 3,
The frequency counting unit 211 is the counting unit 112 described in claim 1.
The number of document numbers included in the data section corresponding to each key item is counted by referring to the data area 404 corresponding to the document information holding unit 112, and the counting result is passed through the frequency table 212. It is configured to be provided to the processing of the classification processing unit 213.

【００２６】ここで、頻度テーブル２１２は、各キー項
目を示すキー番号に対応して、キー項目を含む文書数を
該当するキー項目の出現頻度として格納する構成とすれ
ばよい。また、図３において、分類処理部２１３は、請
求項１で述べた分類手段１１４に相当するものであり、
頻度テーブル２１２に格納された出現頻度と所定の閾値
との比較結果に応じて、キー項目を２つのグループに分
類し、頻度テーブル２１２の各キー項目に対応して、そ
れぞれが属するグループを示すグループフラグを付加す
る構成となっている。Here, the frequency table 212 may be configured to store the number of documents including the key item as the appearance frequency of the corresponding key item, corresponding to the key number indicating each key item. Further, in FIG. 3, the classification processing unit 213, 請
It corresponds to the classifying means 114 described in claim 1 ,
According to the result of comparison between the appearance frequency stored in the frequency table 212 and a predetermined threshold value, the key items are classified into two groups, and each key item of the frequency table 212 is associated with a group indicating a group to which the key item belongs. It is configured to add a flag.

【００２７】図３において、ソート処理部２１４は、グ
ループフラグに基づいて、それぞれのグループに属する
キー項目を抽出し、グループごとにその出現頻度順に並
べ替えて、データ配置部２１５の処理に供する構成とな
っている。このデータ配置部２１５は、ソート処理部２
１４から受け取ったソート結果に従って、各グループに
属するキー項目に対応するデータ部のデータ領域４０４
における格納場所を変更する構成となっている。In FIG. 3, the sort processing unit 214 extracts the key items belonging to each group based on the group flag, rearranges them in the order of appearance frequency for each group, and supplies them to the processing of the data arranging unit 215. Has become. This data arrangement unit 215 is a sort processing unit 2
According to the sorting result received from 14, the data area 404 of the data section corresponding to the key item belonging to each group
The storage location in is changed.

【００２８】また、このデータ配置部２１５による配置
結果に応じて、インデックス更新部２１６は、インデッ
クス部４０５の内容を更新し、各キー項目に対応するポ
インタ情報に上述した配置結果を反映すればよい。この
ように、ソート処理部２１４によるソート結果に応じ
て、データ配置部２１５が動作することにより、配列手
段１１５の機能を実現し、同等の頻度で出現するキー項
目に対応するデータ部をデータ領域４０４にそれぞれ集
中して配置することができる。The index updating unit 216 updates the contents of the index unit 405 in accordance with the placement result of the data placement unit 215, and reflects the placement result described above in the pointer information corresponding to each key item. . In this way, the data arranging unit 215 operates according to the sorting result by the sorting processing unit 214 to realize the function of the arraying unit 115, and the data unit corresponding to the key item that appears at the same frequency is provided in the data area. They can be arranged centrally in 404.

【００２９】例えば、上述した所定の閾値を全文書数ｎ
の半分程度に相当する値とし、分類処理部２１３によ
り、出現頻度が所定の閾値を超えるキー項目からなるグ
ループと閾値以下であるキー項目からなるグループとに
分類すれば、文書データベース４０１に登録された文書
の大半に含まれているキー項目を分離することができ
る。For example, the above-mentioned predetermined threshold is set to the total number of documents n.
When the classification processing unit 213 classifies into a group consisting of key items whose appearance frequency exceeds a predetermined threshold and a group consisting of key items below the threshold, the classification processing unit 213 registers them in the document database 401. The key items contained in the majority of the documents can be separated.

【００３０】ここで、登録された文書の大半に含まれて
いる出現頻度の高いキー項目は、例えば、フリーターム
の場合の助詞や動詞、また、指示語や接続詞などであ
り、これらが検索キーとして指定されることはほとんど
ないと考えられる。ところが、上述したように、キー項
目に対応するデータ部は、それぞれのキー項目（この場
合は単語）を含む文書番号の羅列であるから、出現頻度
の高いキー項目に対応するデータ部ほど、そのデータ長
は長くなっており、このような巨大なデータ部がデータ
領域４０３に広く分布していることが、検索対象のデー
タ部を探索する際のシーク範囲を拡大する原因となって
いる。Here, the key items having a high appearance frequency included in most of the registered documents are, for example, particles and verbs in the case of free terms, and directives and connectives, and these are search keys. It is thought that it is seldom specified as. However, as described above, since the data part corresponding to the key item is a list of document numbers including the respective key items (words in this case), the data part corresponding to the key item having the higher frequency of occurrence is The data length is long, and the wide distribution of such a huge data portion in the data area 403 is a cause of expanding the seek range when searching for the data portion to be searched.

【００３１】したがって、上述したようにして出現頻度
に応じてキー項目を分類し、データ領域４０４において
分離して配置することにより、検索手段１１６に相当す
る検索処理部４０３が、データ部を探索する際のシーク
範囲を大幅に縮小することが可能である。また、更に、
出現頻度が閾値以下であるグループに属するキー項目に
ついて、データ配置部２１５が、出現頻度の降順でキー
項目に対応するデータ部をデータ領域４０４に配置する
ことにより、図４(a) に示すように、比較的頻繁に文書
中に出現するキー項目と出現頻度が著しく低いキー項目
とを緩やかに分離して、データ領域４０４に配置するこ
とができる。Therefore, by classifying the key items according to the appearance frequency as described above and arranging them separately in the data area 404, the search processing section 403 corresponding to the search means 116 searches the data section. It is possible to greatly reduce the seek range. In addition,
For the key items belonging to the group whose appearance frequency is less than or equal to the threshold value, the data arrangement unit 215 arranges the data parts corresponding to the key items in the data area 404 in descending order of the appearance frequency, so that as shown in FIG. In addition, the key items that appear in the document relatively frequently and the key items that appear significantly less frequently can be gently separated and placed in the data area 404.

【００３２】ここで、一般名詞のように、検索キーとし
て指定される可能性の高い単語は、上述した指示語など
著しく出現頻度の高い単語ほどは出現しないが、固有名
詞などに比べれば出現頻度は高い。したがって、上述し
たようにして、出現頻度が著しく高いキー項目ととも
に、出現頻度が著しく低いキー項目を分離し、中程度の
頻度で文書中に現れるキー項目をデータ領域４０４に集
中して配置することにより、更に、データ部の探索の際
のシーク範囲を限定し、平均シーク時間の短縮効果を高
めることができる。Here, a word such as a general noun that is highly likely to be designated as a search key does not appear as frequently as a word having a remarkably high frequency of appearance such as the above-mentioned noun, but it appears more frequently than a proper noun or the like. Is high. Therefore, as described above, it is necessary to separate the key items having a very low appearance frequency from the key items having a significantly high appearance frequency, and arrange the key items appearing in the document with a moderate frequency in the data area 404 in a concentrated manner. Thereby, the seek range at the time of searching the data part can be further limited, and the effect of shortening the average seek time can be enhanced.

【００３３】これにより、全文検索処理に要する時間を
大幅に短縮し、文書ファイリングシステムの操作性を向
上することができる。なお、分類処理部２１３が、出現
頻度と所定の閾値との比較結果に応じて、頻度テーブル
２１２において、出現頻度を表す数値データの符号ビッ
トを操作する構成としてもよい。As a result, the time required for the full-text search processing can be greatly reduced, and the operability of the document filing system can be improved. Note that the classification processing unit 213 may be configured to operate the sign bit of the numerical data representing the appearance frequency in the frequency table 212 in accordance with the result of comparison between the appearance frequency and a predetermined threshold value.

【００３４】例えば、出現頻度が所定の閾値以上である
場合に、頻度テーブル２１２の該当する数値データの符
号ビットを操作し、出現頻度を示す数値を負の数とすれ
ばよい。この場合は、ソート処理部２１４が、頻度テー
ブル２１２に格納された頻度に基づいて、すべてのキー
項目について一括してソート処理を行うことにより、上
述したように、グループ分け後にそれぞれソートした場
合と同様に、出現頻度が著しく高いあるいは低いキー項
目を分離したソート結果を得ることができる。For example, when the appearance frequency is equal to or higher than a predetermined threshold value, the sign bit of the corresponding numerical value data in the frequency table 212 may be operated to make the numerical value indicating the appearance frequency a negative number. In this case, the sorting unit 214 collectively sorts all the key items based on the frequencies stored in the frequency table 212, so that the sorting is performed after the grouping as described above. Similarly, it is possible to obtain a sorting result in which key items having extremely high or low frequency of occurrence are separated.

【００３５】ところで、文書中からキー項目を抽出する
手法の１つであるＮ−ｇｒａｍは、文書中の各文字を先
頭とするＮ文字までの文字列を虱潰しに抽出する手法で
あるから、これを利用すると、文書の性質によらずにキ
ー項目の抽出を行うことができる。しかしながら、この
手法を利用した場合には、例えば、カタカナ１文字やカ
タカナと英数字との組み合わせなど、明らかに検索キー
としての有用性に乏しいキー項目も一緒に抽出され、キ
ー項目および対応するデータ部が、それぞれインデック
ステーブル４０２に登録されることになる。By the way, since N-gram, which is one of the methods for extracting key items from a document, is a method for extracting a character string up to N characters starting from each character in the document in a crushed manner. By using this, key items can be extracted regardless of the nature of the document. However, when this method is used, key items that are obviously not useful as a search key, such as one katakana character or a combination of katakana and alphanumeric characters, are also extracted, and the key items and corresponding data are also extracted. Each part is registered in the index table 402.

【００３６】次に、キー項目の文字種に着目して、デー
タ領域への配置処理を行う方法について説明する。図５
に、請求項２の情報検索装置を適用した文書ファイリン
グシステムの構成図を示す。図５において、キー検出部
４１１は、請求項２で述べた文字列抽出手段１２１の機
能として動作し、キー項目の抽出手法としてＮ−ｇｒａ
ｍを用いて文書データベース４０１内の各文書からキー
項目を検出し、登録処理部４１２が、キー検出部４１１
によって得られた順番に、各キー項目についてインデッ
クステーブル４０２への登録処理を行う構成となってい
る。Next, focusing on the character type of the key item, a method of arranging in the data area will be described. Figure 5
FIG. 9 shows a block diagram of a document filing system to which the information retrieval apparatus of claim 2 is applied. In FIG. 5, the key detection unit 411 operates as a function of the character string extraction means 121 described in claim 2, and N-gra is used as a key item extraction method.
The key processing unit 412 detects a key item from each document in the document database 401 by using m, and the registration processing unit 412
The registration process is performed on the index table 402 for each key item in the order obtained by the above.

【００３７】また、図５に示した配置変更部２１０にお
いて、特徴抽出部２２１は、インデックス部４０５に登
録された各キー項目について、それぞれを構成する文字
種の組合せの特徴を抽出し、得られた特徴を示す特徴情
報を照合処理部２２２の処理に供する構成となってい
る。この照合処理部２２２は、各キー項目に対応する特
徴情報と条件保持部２２３に保持された抽出条件とを照
合し、この照合結果に応じて、抽出条件に適合するか否
かを示す適合フラグを頻度テーブル２１２の該当するキ
ー項目に対応して付加し、ソート処理部２１４の処理に
供する構成となっている。Further, in the arrangement changing unit 210 shown in FIG. 5, the feature extracting unit 221 extracts and obtains the feature of the combination of the character types constituting each key item registered in the index unit 405. The feature information indicating the feature is provided to the processing of the matching processing unit 222. The matching processing unit 222 matches the feature information corresponding to each key item with the extraction condition held in the condition holding unit 223, and, according to the matching result, a matching flag indicating whether or not the extraction condition is matched. Is added to correspond to the corresponding key item of the frequency table 212, and is provided for the processing of the sort processing unit 214.

【００３８】このように、特徴抽出部２２１で得られた
特徴情報に基づいて、照合処理部２２２が動作すること
により、請求項２で述べた判別手段１２２および分類手
段１１４の機能を実現し、Ｎ−ｇｒａｍを適用して得ら
れた各キー項目を構成する文字種の特徴に基づいて、各
キー項目を２つのグループに分類することができる。As described above, the collation processing unit 222 operates based on the characteristic information obtained by the characteristic extraction unit 221, so that the discrimination means 122 and the classification means described in claim 2 are executed.
Each key item can be classified into two groups based on the characteristics of the character types that form each key item obtained by applying the N-gram by implementing the function of the step 114 .

【００３９】また、この分類結果をソート処理部２１４
を介してデータ配置部２１５の処理に供し、これに応じ
て、データ配置部２１５およびインデックス更新部２１
６が動作することにより、請求項２で述べた配列手段１
１５の機能を実現し、各キー項目を構成する文字種の特
徴に応じて、それぞれに対応するデータ部をデータ領域
４０４に配置することができる。Further, the sorting result is based on this classification result.
The data arranging unit 215 is subjected to processing via the
The arranging means 1 described in claim 2 by the operation of 6
It is possible to realize the 15 functions and arrange the data portion corresponding to each of the key items in the data area 404 in accordance with the characteristics of the character type forming each key item.

【００４０】ここで、例えば、上述した条件保持部２２
３には、漢字２文字、カタカナ２文字、漢字１文字にか
な１文字が続く文字列などのように、検索キーとして指
定される可能性が高い文字種の組合せを示す情報を抽出
条件として格納しておけばよい。Here, for example, the condition holding unit 22 described above.
3 stores information indicating a combination of character types that are likely to be specified as a search key, such as a character string in which two kanji characters, two katakana characters, one kana character followed by one kana character, and the like are stored as extraction conditions. You can leave it.

【００４１】この場合は、上述したように各部が動作す
ることにより、検索キーとして指定される可能性が高い
と考えられる文字列からなるグループとその他の文字列
からなるグループとにキー項目を分類し、それぞれに対
応するデータ部をデータ領域４０４において分離して配
置することができる。これにより、図４(b) に示すよう
に、抽出条件に適合するキー項目をデータ領域４０４の
先頭部分に集中して配置することができるから、検索処
理部４０３が、検索キーの指定に応じてデータ部を探索
する範囲を大幅に縮小し、平均シーク時間を短縮するこ
とが可能となる。In this case, the key items are classified into a group consisting of character strings and a group consisting of other character strings which are considered to be highly likely to be designated as a search key by the operation of each section as described above. However, the data parts corresponding to each can be arranged separately in the data area 404. As a result, as shown in FIG. 4B, the key items that meet the extraction condition can be concentrated in the head portion of the data area 404, so that the search processing unit 403 responds to the designation of the search key. It is possible to significantly reduce the range for searching the data part by shortening the average seek time.

【００４２】また、更に、上述した判別処理と並行し
て、頻度計数部２１１による出現頻度の計数処理を行え
ば、キー項目の出現に応じた配置の変更を同時に適用す
ることができる。例えば、分類処理部２１３が、出現頻
度と閾値との比較結果に応じて頻度テーブル２１２にお
いて、出現頻度を示す数値データの符号を操作する構成
とすれば、ソート処理部２１４が、頻度テーブル２１２
の適合フラグで示された各グループごとに、出現頻度の
降順で頻度テーブル２１２のキー項目をソートし、デー
タ配置部２１５の処理に供すればよい。Further, by performing the appearance frequency counting process by the frequency counting unit 211 in parallel with the above-mentioned determination process, the arrangement change according to the appearance of the key item can be applied at the same time. For example, if the classification processing unit 213 is configured to operate the sign of numerical data indicating the appearance frequency in the frequency table 212 according to the comparison result between the appearance frequency and the threshold value, the sorting processing unit 214 causes the frequency table 212 to operate.
The key items of the frequency table 212 may be sorted in descending order of the appearance frequency for each group indicated by the conformance flag of No. 1, and provided to the processing of the data placement unit 215.

【００４３】この場合は、データ配置部２１５により、
まず、検索キーとして有用であるとされたグループに属
するキー項目に対応するデータ部が、キー項目の出現頻
度の降順にデータ領域４０３に配置され、次に、検索キ
ーとしての有用性が低いとされたグループに属するキー
項目に対応するデータ部が、同様にしてデータ領域４０
３に配置される（図４(b) 参照）。In this case, the data arrangement unit 215
First, the data parts corresponding to the key items belonging to the group considered to be useful as the search key are arranged in the data area 403 in descending order of the frequency of appearance of the key item, and then the usefulness as the search key is low. Similarly, the data section corresponding to the key item belonging to the selected group has the data area 40
3 (see FIG. 4 (b)).

【００４４】これにより、検索処理の際のデータ部の探
索範囲を更に限定し、平均シーク時間を短縮することが
できる。また、仮名漢字変換用の辞書に、上述したよう
なキー項目の統計的な性質を考慮した配置方法を適用し
てもよい。As a result, it is possible to further limit the search range of the data part in the search processing and shorten the average seek time. Further, the above arrangement method considering the statistical property of the key item may be applied to the dictionary for converting kana to kanji.

【００４５】図６に、請求項３の情報検索装置を適用し
た仮名漢字変換システムの構成を示す。図６において、
単語辞書２３１は、請求項３で述べた辞書１０３に相当
するものであり、読みに対応する変換候補を格納するデ
ータ部を辞書順に配置したデータ領域２３２と、各読み
に対応するデータ部の格納場所を示すインデックス部２
３３とから構成されている。FIG. 6 shows the configuration of a kana-kanji conversion system to which the information retrieval device of claim 3 is applied. In FIG.
The word dictionary 231 corresponds to the dictionary 103 described in claim 3, and stores a data area 232 in which data parts storing conversion candidates corresponding to readings are arranged in dictionary order and a data part corresponding to each reading. Index part 2 showing the place
And 33.

【００４６】また、変換処理部２３４は、請求項３で述
べた辞書検索手段１０２に相当するものであり、利用者
によって入力された読みを入出力制御部２３５を介して
受け取り、上述した単語辞書２３１から該当する変換候
補を検索して、得られた変換候補を返す構成となってい
る。また、図６に示した計数処理部２４０において、単
語解析部２４１は、読み抽出手段１３１に相当するもの
であり、入出力制御部２３５を介して標準文書を受け取
り、解析辞書２４２を参照しながらこの標準文書を解析
して個々の単語を抽出し、得られた単語の読みを順次に
度数計数部２４３に送出する構成となっている。The conversion processing unit 234 corresponds to the dictionary search means 102 described in claim 3, receives the reading input by the user through the input / output control unit 235, and then the word dictionary described above. The corresponding conversion candidate is searched from 231 and the obtained conversion candidate is returned. Further, in the counting processing unit 240 shown in FIG. 6, the word analysis unit 241 corresponds to the reading extraction unit 131, receives a standard document via the input / output control unit 235, and refers to the analysis dictionary 242. This standard document is analyzed to extract individual words, and the readings of the obtained words are sequentially sent to the frequency counting unit 243.

【００４７】これに応じて度数計数部２４３は、度数計
数手段１３２として動作し、頻度テーブル２４４の該当
する読みに対応する計数値をインクリメントすることに
より、各読みについての出現度数を計数する構成となっ
ている。ここで、標準文書として十分に長い文書を入力
すれば、この標準文書中での各読みの出現度数によっ
て、一般の文書における読みの出現頻度を予想すること
ができる。In response to this, the frequency counting section 243 operates as the frequency counting means 132 and increments the count value corresponding to the corresponding reading of the frequency table 244 to count the appearance frequency for each reading. Has become. Here, if a sufficiently long document is input as the standard document, the appearance frequency of the reading in a general document can be predicted based on the appearance frequency of each reading in the standard document.

【００４８】したがって、上述したようにして得られた
頻度テーブル２４４を分類処理部２１３の処理に供し、
この分類処理部２１３による分類結果に応じて、ソート
処理部２１４およびデータ配置部２１５が動作すること
により、各単語の読みが文書中で出現する頻度に応じ
て、単語辞書２３１に備えられたデータ領域２３２に対
応するデータ部を配列することができる。Therefore, the frequency table 244 obtained as described above is subjected to the processing of the classification processing unit 213,
The sort processing unit 214 and the data arranging unit 215 operate according to the classification result by the classification processing unit 213, so that the data stored in the word dictionary 231 is stored according to the frequency of reading of each word in the document. A data part corresponding to the area 232 can be arranged.

【００４９】このようにして、文書中に頻繁に出現する
ことが予想される単語に対応するデータ部とまれにしか
出現しないと予想される単語とを分離し、頻繁に使われ
る単語の読みに対応する変換候補を集中的にデータ領域
２３２に配置することにより、仮名漢字変換候補を探索
する際の探索範囲を限定し、平均シーク時間を短縮する
ことができるので、仮名漢字変換処理の高速化を図るこ
とが可能である。In this way, the data part corresponding to the word that is expected to appear frequently in the document and the word that is expected to appear infrequently are separated, and the reading of frequently used words is supported. By centrally arranging the conversion candidates to be converted in the data area 232, the search range when searching for the kana-kanji conversion candidates can be limited and the average seek time can be shortened, so that the kana-kanji conversion processing can be speeded up. It is possible to plan.

【００５０】[0050]

【発明の効果】以上に説明したように、請求項１の発明
によれば、文書中における出現頻度に応じて、各キー項
目に対応するデータ部を配置することにより、全文対象
検索処理の際に、該当するデータ部を探索するための探
索範囲を限定することができるから、平均的なシーク時
間を短縮し、検索処理の高速化を図ることができる。As described above, according to the first aspect of the invention, the data portion corresponding to each key item is arranged in accordance with the appearance frequency in the document, so that the full-text target search process is performed. In addition, since the search range for searching the corresponding data part can be limited, the average seek time can be shortened and the search processing can be speeded up.

【００５１】また、請求項２の発明によれば、抽出され
たキー項目を構成する文字種の組合せの特徴に応じて各
キー項目に対応するデータ部を配置することにより、特
に、キー項目をＮ−ｇｒａｍによって抽出した場合に、
全文対象検索処理の際に、該当するデータ部を探索する
ための探索範囲を限定することができるから、平均的な
シーク時間を短縮し、検索処理の高速化を図ることがで
きる。According to the second aspect of the invention, by arranging the data part corresponding to each key item in accordance with the characteristics of the combination of the character types forming the extracted key item, the number of key items is N -When extracted by gram,
Since the search range for searching the corresponding data part can be limited during the full-text target search processing, it is possible to shorten the average seek time and speed up the search processing.

【００５２】また、請求項３の発明によれば、標準文書
における単語の出現度数に基づいて、一般の文書におけ
る各単語の出現頻度を推定し、この推定結果に応じて各
単語の読みに対応する変換候補からなるデータ部を配置
することにより、変換候補を探索する際の探索範囲を限
定することができるから、平均的なシーク時間を短縮
し、仮名漢字変換処理の高速化を図ることができる。According to the invention of claim 3, the appearance frequency of each word in a general document is estimated based on the frequency of appearance of the word in the standard document, and the reading of each word is handled according to the estimation result. By arranging the data part consisting of the conversion candidates, the search range when searching for the conversion candidates can be limited, so that the average seek time can be shortened and the kana-kanji conversion processing can be speeded up. it can.

[Brief description of drawings]

【図１】請求項１および請求項２の情報検索装置の原理
ブロック図である。FIG. 1 is a principle block diagram of an information search device according to claims 1 and 2. FIG.

【図２】請求項３の情報検索装置の原理ブロック図であ
る。FIG. 2 is a block diagram showing the principle of the information retrieval device of claim 3;

【図３】請求項１の情報検索装置を適用した文書ファイ
リングシステムの構成図である。FIG. 3 is a configuration diagram of a document filing system to which the information search device of claim 1 is applied.

【図４】データ領域におけるデータ部の配置を説明する
図である。FIG. 4 is a diagram illustrating the arrangement of data parts in a data area.

【図５】請求項２の情報検索装置を適用した文書ファイ
リングシステムの構成図である。FIG. 5 is a configuration diagram of a document filing system to which the information search device of claim 2 is applied.

【図６】請求項３の情報検索装置を適用した仮名漢字変
換システムの構成図である。FIG. 6 is a configuration diagram of a kana-kanji conversion system to which the information search device according to claim 3 is applied.

【図７】従来の情報検索装置の構成例を示す図である。FIG. 7 is a diagram showing a configuration example of a conventional information search device.

[Explanation of symbols]

１０１、４０１文書データベース１０２辞書検索手段１０３辞書１１１単語検出手段１１２文書情報保持手段１１３計数手段１１４分類手段１１５配列手段１１６検索手段１２１文字列抽出手段１２２判別手段１３１読み抽出手段１３２度数計数手段１３３辞書配列手段２１０配置変更部２１１頻度計数部２１２、２４４頻度テーブル２１３分類処理部２１４ソート処理部２１５データ配置部２１６インデックス更新部２２１特徴抽出部２２２照合処理部２２３条件保持部２３１単語辞書２３４変換処理部２３５入出力制御部２４０度数計数部２４１単語解析部２４２解析辞書２４３度数計数部４０２インデックステーブル４０３検索処理部４０４、２３２データ領域４０５、２３３インデックス部４１０索引作成部４１１キー検出部４１２登録処理部101, 401 Document database 102 Dictionary search means 103 Dictionary 111 Word detection means 112 Document information holding means 113 Counting means 114 Classification means 115 Arrangement means 116 Searching means 121 Character string extracting means 122 Discriminating means 131 Reading extraction means 132 Frequency counting means 133 Dictionary Arrangement unit 210 Arrangement changing unit 211 Frequency counting unit 212, 244 Frequency table 213 Classification processing unit 214 Sort processing unit 215 Data arrangement unit 216 Index updating unit 221 Feature extracting unit 222 Matching processing unit 223 Condition holding unit 231 Word dictionary 234 Conversion processing unit 235 Input / output control unit 240 Frequency counting unit 241 Word analysis unit 242 Analysis dictionary 243 Frequency counting unit 402 Index table 403 Search processing unit 404, 232 Data area 405, 233 Index unit 410 Index creation 411 key detection unit 412 registration processing unit

フロントページの続き (56)参考文献特開平６−68159（ＪＰ，Ａ) 特開平８−194719（ＪＰ，Ａ) 特開平４−205560（ＪＰ，Ａ) 菊池忠一外，構成文字の属性／文字位置を含むコード化による全文検索の高速化手法，電子情報通信学会技術研究報告（ＤＥ90−24），1990年12月14日，第 90巻，第362号，第１〜７頁菊池忠一，文字列照合を用いた全文検索における仮名文字検索の高速化手法, 情報処理学会研究報告（91−ＤＢＳ− 83），1991年５月24日，第91巻，第46 号，第１〜10頁菊池忠一，日本語文書用高速全文検索の一手法，情報処理学会研究報告（92− ＦＩ−25），1992年５月12日，第92 巻，第32号，第９〜16頁 (58)調査した分野(Int.Cl.⁷，ＤＢ名) G06F 17/30 G06F 17/22 Continuation of the front page (56) References JP-A-6-68159 (JP, A) JP-A-8-194719 (JP, A) JP-A-4-205560 (JP, A) T. Kikuchi, T. Kikuchi, Attributes of constituent characters / Acceleration method of full text search by encoding including character position, IEICE Technical Report (DE90-24), December 14, 1990, Volume 90, No. 362, pages 1-7 Tadashi Kikuchi, Acceleration of Kana Character Search in Full-text Search Using String Matching, IPSJ Research Report (91-DBS-83), May 24, 1991, Vol. 91, No. 46, No. 46 Pages 1-10 Chuichi Kikuchi, A Method for High-speed Full-text Search for Japanese Documents, IPSJ Research Report (92-FI-25), May 12, 1992, Vol. 92, No. 32, 9-16 Page (58) Fields surveyed (Int.Cl. ⁷ , DB name) G06F 17/30 G06F 17/22

Claims

(57) [Claims]

1. In a full-text search system for searching a document including the search key from all documents stored in a document database in response to the input of the search key, all words that may be a search key are used as keys. A word detecting unit that detects a document including the key items as items, and outputs information that specifies the detected document as document information corresponding to each key item; and a word detecting unit that corresponds to each key item. Based on the document information held corresponding to each key item, the number of documents including each key item is counted, and
Counting means for outputting as an index corresponding to the key item, classification means for classifying each key item into a plurality of groups based on the index corresponding to each key item, and based on the obtained classification result, to document information holding means
Each of the retained document information is stored in the document information retaining means for each group to which the corresponding key item belongs.
An information retrieving apparatus comprising: an arranging unit arranged in a storage medium; and a retrieving unit that retrieves and outputs corresponding document information from the document information holding unit in response to an input of a retrieval key.

2. A full-text search system for searching a document including the search key from all the documents stored in the document database in response to the input of the search key, wherein N- g
A character string extraction unit that extracts a continuous character string by ram as a key item and outputs information that specifies the extracted document as document information corresponding to each key item, and the document information corresponding to each key item. The characteristic of the combination of the document information holding unit that holds the key information and the character type that constitutes each of the key items obtained by the character string extraction unit is determined, and the determined feature is determined.
The information indicating the signature is output as an index corresponding to each key item.
A determination unit that, on the basis of the index corresponding to each key item, and classifying means for classifying said each key field into a plurality of groups, based on the classification results obtained, in the document information storage unit
Each of the retained document information is stored in the document information retaining means for each group to which the corresponding key item belongs.
An information retrieving apparatus comprising: an arranging unit arranged in a storage medium; and a retrieving unit that retrieves and outputs corresponding document information from the document information holding unit in response to an input of a retrieval key.

3. A kana-kanji conversion system in which a dictionary search means searches and provides conversion candidates registered in advance in a dictionary in response to a word reading input, and receives a predetermined standard document and includes it in the standard document. Reading extraction means for extracting different readings as key items, and frequency counting means for counting the frequency of occurrence of each key item in the standard document and outputting it as an index corresponding to the key item. , based on an index corresponding to each key item, the classification means each key item is classified into a plurality of groups, based on the obtained classification result, each conversion candidates held in the dictionary, corresponding An information retrieval device comprising: a dictionary arrangement unit arranged in a storage medium storing the dictionary for each group to which the key item belongs and used for processing of the dictionary retrieval unit. Place