JPH0668159A

JPH0668159A - Retrieval device

Info

Publication number: JPH0668159A
Application number: JP4216752A
Authority: JP
Inventors: Kenji Hashimoto; 賢治橋本; Katsumi Murai; 克己村井
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1992-08-14
Filing date: 1992-08-14
Publication date: 1994-03-11

Abstract

PURPOSE:To reduce the capacity of a table without reducing an indexing efficiency by using 2-character concatenation selected based on the frequency of appearance for the table for retrieval pre-processing especially with respect to the retrieval device based on a full-text search retrieving document data on request from lots of document data without provision of retrieval index information. CONSTITUTION:The device is provided with a 2-character concatenation extract means 1 extracting 2-character concatenation from a retrieval object document and a spreadsheet generating means 4 selecting the 2-character concatenation to generate a spreadsheet based on the means 2 extracting appearance frequency information of 2-character concatenation for each file, and 2-character concatenation is extracted from a requested retrieval character string similarly, 2-character concatenation is selected based on the frequency of appearance, a retrieval entry and retrieval content 6 for the selected 2-character concatenation is used and a spreadsheet retrieval means 7 are used to obtain an object document 8, a detailed retrieval means 9 is used for the object document 8 to retrieve it and then the result of retrieval is obtained.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、情報処理装置において
大量の文書データを予め人手によって付与されたインデ
ックス情報を用いずに要求された文書データを引き出し
てくる全文検索方式を基本とした検索装置に関するもの
である。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a search device based on a full-text search method for extracting requested document data without using index information which is manually assigned to a large amount of document data in an information processing device. It is about.

【０００２】[0002]

【従来の技術】近年、計算機周辺技術の発達によりワー
ドプロセッサーやパーソナルコンピュータが普及し、大
量の文書データが仕事場や家庭において利用されるよう
になってきた。これら大量の文書データを整理して有効
に利用していくために、大容量のデータ記憶装置と高速
な検索マシンが研究開発されてきた。しかし、従来の検
索マシンでは検索のためにあらかじめもとのデータにイ
ンデックス情報を付ける必要があり、データ量が増大す
るにつれて人手でインデックス付けを行っていると大変
な労力が必要となってきた。また、インデックスをコン
ピュータに自動的に抽出させる試みもあるが、文書から
インデックスとなるキーワードを抽出する処理の基盤と
なる自然言語処理技術がまだ未完成であり、完全自動化
には至っていないのが現状である。これに対して、イン
デックス情報を付与することなしに検索する方法として
全文検索方式が研究開発されてきている。この全文検索
方式は、検索用インデックス情報を用いた検索方式に比
べて検索速度が遅くなるのが欠点であったが、この欠点
を解決する方法とて例えば、全文検索用テキストサーチ
マシン（電子情報通信学会技術研究報告・データ工学89
-38）や、構成文字の属性／文字位置を含むコード化に
よる全文検索の高速化手法（電子情報通信学会技術研究
報告・データ工学90-24）などがある。更に、本文の検
索の前処理として検索対象文書を絞り込むのに２文字連
接を使用することで検索の高速化をはかる方法も提案さ
れている。2. Description of the Related Art In recent years, word processors and personal computers have become widespread due to the development of computer peripheral technology, and a large amount of document data has come to be used at work and at home. In order to organize and effectively use such a large amount of document data, a large-capacity data storage device and a high-speed search machine have been researched and developed. However, in the conventional search machine, it is necessary to add index information to the original data in advance for the search, and if the index amount is manually increased as the amount of data increases, a great deal of labor is required. There is also an attempt to automatically extract the index on a computer, but the natural language processing technology that is the basis of the process of extracting the keyword that will be the index from the document is not yet completed, and it is the current situation that it is not fully automated. Is. On the other hand, a full-text search method has been researched and developed as a method for searching without adding index information. This full-text search method has a drawback in that the search speed is slower than the search method using the index information for search. As a method for solving this drawback, for example, a text search machine for full-text search (electronic information IEICE Technical Report / Data Engineering 89
-38), and a method for speeding up full-text search by encoding including the attributes / positions of constituent characters (IEICE Technical Report / Data Engineering 90-24). Further, as a pre-process for searching the text, a method of speeding up the search by using two-character concatenation to narrow down search target documents has been proposed.

【０００３】[0003]

【発明が解決しようとする課題】しかしながら文章中に
存在する全ての２文字連接を使用する場合、各文書に出
現する頻度が非常に高い２文字連接について表作成を行
うと、検索対象の絞り込みの効率が上がらずに表データ
の容量が増えてしまうことになる。例えば、日本語文書
で頻繁に使用される２文字連接「ある」「する」
「る。」「これ」などが検索対象の文書ファイルのほと
んどに存在した場合、これらを表作成に用いても絞り込
みの役割を果たさずに表自体の容量を増大させるだけに
なる。However, when all two-character concatenations existing in a sentence are used, if a table is created for two-character concatenations that appear very frequently in each document, the search target is narrowed down. The efficiency will not increase and the capacity of table data will increase. For example, two-character concatenation "Aru" and "Suru" that are often used in Japanese documents.
If "ru.", "Kore", etc. are present in most of the document files to be searched, even if they are used for creating a table, the capacity of the table itself is increased without playing the role of narrowing down.

【０００４】本発明は、２文字連接の出現頻度を文書単
位で解析し、その結果から２文字連接を選択して表を作
成することにより絞り込み率を大きく下げることなく表
データの容量を削減することが可能となる検索装置を提
供することを目的とする。According to the present invention, the appearance frequency of two-character concatenation is analyzed for each document, and two-character concatenation is selected from the result to create a table, thereby reducing the volume of table data without greatly reducing the narrowing rate. It is an object of the present invention to provide a search device capable of achieving the above.

【０００５】[0005]

【課題を解決するための手段】問題点を解決するために
本発明の検索装置は、検索対象となる文書の日本語文字
コード列から２文字連接を抽出し、前記２文字連接が各
文書ファイルに出現する頻度情報を得て、前記頻度情報
に基づいて選択した２文字連接の１文字目をエントリと
して残り１文字および２文字連接を抽出した検索対象文
書の存在する記録位置概略情報を内容とする表を作成
し、検索要求時に入力された検索文字列から前記出現頻
度に基づいて２文字連接を選択し、前記選択された２文
字連接の１文字目を検索エントリ、２文字目を検索内容
として、検索エントリを用いて前記作成された表から検
索内容を検索して対応する記録位置概略情報を獲得し、
検索対象文書を絞り込んだ後に検索対象文書を詳細検索
して検索要求を満たす文書データを出力することができ
る構成を有している。In order to solve the problems, a retrieval apparatus of the present invention extracts a two-character concatenation from a Japanese character code string of a document to be retrieved, and the two-character concatenation is each document file. Frequency information that appears in the, and the first character of the two-character concatenation selected based on the frequency information is used as an entry and the remaining one character and the two-character concatenation are extracted. Table is created, two-character concatenation is selected from the search character string input at the time of the search request based on the appearance frequency, the first character of the selected two-character concatenation is the search entry, and the second character is the search content. As a result, the search contents are searched from the created table using the search entry to obtain the corresponding recording position outline information,
After the search target documents are narrowed down, a detailed search is performed for the search target documents and the document data satisfying the search request can be output.

【０００６】[0006]

【作用】本発明によれば上記のように、検索対象となる
文書データから２文字連接を抽出し、前記２文字連接が
各文書ファイルに出現する頻度情報を得て、前記頻度情
報に基づいて２文字連接を選択して表を作成することか
ら、出現する２文字連接全てについて表を作成するより
も表の容量を削減することができ、頻度が高いものにつ
いて選択すれば絞り込み効率を下げることもない。表の
容量を削減することができれば、表の検索時間の短縮、
検索対象となる文書データ保存容量の増加が可能とな
る。According to the present invention, as described above, the two-character concatenation is extracted from the document data to be searched, the frequency information that the two-character concatenation appears in each document file is obtained, and based on the frequency information. Since the table is created by selecting the two-character concatenation, the capacity of the table can be reduced compared to the case of creating the table for all the two-character concatenations that appear, and the selection efficiency can be reduced by selecting the most frequent ones. Nor. If the table capacity can be reduced, the table search time can be shortened,
It is possible to increase the storage capacity of the document data to be searched.

【０００７】従って検索効率を下げることなく全文検索
用の記憶媒体の有効利用が可能となる。Therefore, the storage medium for full-text search can be effectively used without lowering the search efficiency.

【０００８】[0008]

【実施例】以下、本発明の実施例を図面を用いて詳細に
説明する。Embodiments of the present invention will now be described in detail with reference to the drawings.

【０００９】図１は、本発明における検索方式の機能ブ
ロック図である。図１において、検索対象文書の日本語
文字コード列から１の２文字連接抽出手段により順次２
文字連接を抽出し、２の頻度情報抽出手段を用いて前記
２文字連接が各文書ファイルに存在する出現頻度を抽出
する。前記出現頻度に基づいて２文字連接を選択し、同
一文書内で重複する２文字連接を削除した後に残った２
文字連接の１文字をエントリとして残り１文字および５
の２文字連接を抽出した検索対象文書についての記録位
置概略情報とから４の表作成手段により表を得る。要求
検索文字列から同じく１の２文字連接抽出手段により順
次２文字連接を抽出し、検索対象文書から２の頻度情報
抽出手段を用いて得られたファイル毎頻度情報に基づい
て２文字連接を選択して６の検索エントリと検索内容を
得る。前記検索エントリと検索内容により７の表検索手
段を用いて表を検索し、対応する検索候補文書の記録位
置概略情報を得て、検索候補文書において９の詳細検索
手段を用いて要求検索文字列について検索して検索結果
を得ることができる。FIG. 1 is a functional block diagram of a search method according to the present invention. In FIG. 1, the two-character concatenation extraction means 1 from the Japanese character code string of the document to be searched sequentially outputs two.
Character concatenation is extracted, and the frequency of appearance of the two character concatenation existing in each document file is extracted using the second frequency information extraction means. The remaining 2 characters are selected after selecting the 2-character concatenation based on the appearance frequency and deleting the duplicate 2-character concatenation in the same document.
1 character of concatenated character as an entry and 1 remaining character and 5
A table is obtained by the table creating means 4 from the recording position outline information about the search target document in which the two-character concatenation is extracted. Similarly, the two-character concatenation extraction means of 1 is sequentially extracted from the requested search character string, and the two-character concatenation is selected from the document to be searched based on the per-file frequency information obtained by using the two frequency information extraction means. Then, the 6 search entries and the search contents are obtained. The table is searched using the table search means 7 according to the search entry and the search contents, the recording position outline information of the corresponding search candidate document is obtained, and the requested search character string is used in the search candidate document by the detailed search means 9 You can search for and get search results.

【００１０】図２は、本発明における頻度情報を用いた
表作成の具体例である。日本語文字列として例えば
「…、日本の首都東京と古都である京都を比較すると、
…」に対して２文字連接抽出手段により「日本」「本
の」「の首」…「であ」「ある」…の様な２文字連接群
を得る。頻度情報抽出手段により、出現する２文字連接
が対象文章の全ファイル中何ファイルに存在するかを頻
度情報として得る。頻度情報に基づいて出現頻度が設定
した敷居値より低いものに関してのみ選択し、エントリ
と内容を得て表を作成する。出現頻度が高ければ高いほ
ど、各ファイル間にまたがって２文字連接が存在するこ
とから、ファイル情報を用いた絞り込みの効果が出現頻
度に比例して少なくなり、表の容量は大きくなる。特に
全てのファイルに存在する２文字連接については、絞り
込みの効果が全くなくなり、表要素として占める割合が
大きいだけとなる。前記のことから、出現頻度が敷居値
以下のものについてのみ２文字連接を作成することは、
絞り込み率を大きく下げることなしに表の容量を小さく
することが可能となり、表検索の時間も削減できる。FIG. 2 is a specific example of table creation using frequency information according to the present invention. As a Japanese character string, for example, "..., comparing Tokyo, the capital of Japan, with Kyoto, the ancient capital,
... "is obtained by a two-character concatenation extraction means such as" Japan "," book "," neck of ... "" De "," Aru "... The frequency information extraction means obtains as frequency information the number of files in which the appearing two-character concatenation exists in all files of the target sentence. Based on the frequency information, select only those whose appearance frequency is lower than the set threshold value, obtain entries and contents, and create a table. The higher the appearance frequency is, the more the two-character concatenation exists between the files. Therefore, the effect of narrowing down using the file information decreases in proportion to the appearance frequency, and the capacity of the table increases. Especially, regarding the two-character concatenation existing in all the files, the effect of narrowing is completely eliminated, and the ratio occupied as a table element is large. From the above, it is possible to create a two-character concatenation only when the frequency of appearance is below the threshold.
It is possible to reduce the capacity of the table without significantly reducing the narrowing rate, and it is possible to reduce the time required for table search.

【００１１】図３は、特許文書に含まれる２文字連接の
出現頻度をグラフ化した具体例である。対象データは、
特許文書１９０ファイル（平均ファイルサイズ５２[Kby
te]、合計ファイル容量１０[Mbyte]）を用いている。横
軸はファイル数で、縦軸はそのファイル数だけ出現する
２文字連接の種類数を示している。２文字連接のファイ
ル数に対する出現種類数は、ファイル数が増えれば指数
的に減っていくが全ファイル数に近い所では逆に増えて
いる。この傾向は、対象文書データに依らずほとんど同
じである。データより、全ファイル数に近い所すなわち
出現頻度の高い２文字連接を選択しなければ絞り込み率
をほとんど低下させずに表の１０％〜２０％の削減が可
能となることを示している。FIG. 3 is a specific example in which the appearance frequency of two-character concatenation included in a patent document is graphed. The target data is
Patent document 190 files (Average file size 52 [Kby
te] and the total file capacity 10 [Mbyte]) are used. The horizontal axis represents the number of files, and the vertical axis represents the number of types of two-character concatenation that appears for that number of files. The number of types of appearance with respect to the number of files of two-character concatenation decreases exponentially as the number of files increases, but conversely increases near the total number of files. This tendency is almost the same regardless of the target document data. The data show that unless the two-character concatenation that is close to the total number of files, that is, the appearance frequency is selected, the reduction rate can be reduced by 10% to 20% with almost no reduction in the narrowing rate.

【００１２】本発明で、出現頻度が高いものを選択しな
いだけの構成をとると、実際に存在しないものと出現頻
度が高いために選択されなかったものの区別ができな
い。そこで、出現したが頻度情報により選択されないも
のに関して区別するために、表中に特別コードで各２文
字連接につき１度だけ登録しておくことも可能である。
その結果として、実際に存在しないものが検索要求文字
列に存在した場合は、詳細検索をする前の段階で、候補
文書ファイルが存在しないことになるために詳細検索時
間を必要とせずに検索結果を表示することが可能とな
る。In the present invention, if the configuration is such that the one with a high appearance frequency is not selected, it is not possible to distinguish the one that does not actually exist from the one that is not selected because the appearance frequency is high. Therefore, in order to distinguish those that have appeared but are not selected by the frequency information, it is possible to register the special code in the table only once for each two-character concatenation.
As a result, if there is something that does not actually exist in the search request character string, the result of the search does not require the detailed search time because the candidate document file does not exist before the detailed search. Can be displayed.

【００１３】２文字連接の出現頻度の高いものについて
は、対象とする文書データに依存する部分と日本語文章
の文字列で頻繁に使用される部分に分かれる。例えば、
文書データが特許であったとすると、文書に依存する２
文字連接として項目に使用される「発明」「装置」「方
法」「効果」等の単語があげられる。日本語として使用
頻度が高くなるのは、「する」「なる」「もの」「あ
る」等の助詞、助動詞があげられる。あらかじめ頻度に
ついてのテーブルを持つ場合は、後者のみに着目した解
析結果を用いることとなる。The two-character concatenation having a high appearance frequency is divided into a part that depends on the target document data and a part that is frequently used in a character string of a Japanese sentence. For example,
If the document data is a patent, it depends on the document 2
The words such as "invention", "apparatus", "method", "effect", and the like used for items as character concatenation are given. The most frequently used Japanese words are particles and auxiliary verbs such as “suru”, “naruru”, “mono”, and “ar”. When the table for the frequency is prepared in advance, the analysis result focusing only on the latter will be used.

【００１４】日本語の特性として、句読点「、」「。」
をはじめとする特殊記号や平仮名のみの単語は検索対象
となり難い。句読点の前は助詞、助動詞、動詞などの平
仮名が来ることが多く、それ自体に意味を持たないため
に全文検索として要求検索文字列の構成要素に入れるこ
とは少ない。また、平仮名のみで構成される文字列を要
求検索文字列とするのは読み仮名としての場合が多く、
漢字変換されてから検索する構成をとればほとんど問題
ない。上述のことから、検索対象文書文字列において句
読点を含む２文字連接や平仮名で始まる２文字連接を選
択しない構成も可能である。平仮名の出現頻度は、通常
の文書では約４０％前後であることから平仮名で始まる
２文字連接を選択しないとすれば、表の容量を４０％前
後削除することになり大幅な容量削減が実現できる。な
お、平仮名を含む検索語として頻繁に使用する状況にお
いては検索時間に影響するため平仮名も他の文字と同じ
に扱う方が良くなるので、使用状況に応じて表作成の基
準を変更することは容易であり、これらを本発明の範囲
から排除するものではない。As a characteristic of Japanese, punctuation marks ",""."
It is difficult to search for special symbols such as and words with only hiragana. Hiragana such as particles, auxiliary verbs, and verbs often come before punctuation marks, and because they have no meaning in themselves, they are rarely included in the constituent elements of the request search character string as full-text search. In addition, a character string that consists of only hiragana is often used as a phonetic kana,
There is almost no problem if the system is configured to search after converting kanji. From the above, it is possible to adopt a configuration in which two-character concatenation including punctuation marks or two-character concatenation starting with hiragana is not selected in the search target document character string. The frequency of occurrence of hiragana is around 40% in normal documents, so if you do not select two-character concatenation that starts with hiragana, the table capacity will be deleted by around 40% and a significant capacity reduction can be realized. . It should be noted that it is better to treat hiragana in the same way as other characters in situations where it is frequently used as a search word that includes hiragana, as it affects the search time. It is easy and does not exclude them from the scope of the present invention.

【００１５】[0015]

【発明の効果】以上、説明したように本発明によれば次
のような効果を得ることができる。As described above, according to the present invention, the following effects can be obtained.

【００１６】２文字連接のファイル毎出現頻度情報を用
いて表作成時に予め設定した敷居値を越えない２文字連
接を選択すれば、表による絞り込み効果の少ないものだ
けを削除することになり、更に削除される表の容量は大
きいため、検索時の表による絞り込み率をほとんど減少
させることなしに表の容量を大幅に削減することが可能
となる。If two-character concatenation that does not exceed the preset threshold value at the time of creating the table is selected using the appearance frequency information for each file of two-character concatenation, only those that have a narrowing-down effect by the table will be deleted. Since the capacity of the table to be deleted is large, it is possible to significantly reduce the capacity of the table without substantially reducing the narrowing rate by the table at the time of retrieval.

[Brief description of drawings]

【図１】本発明における検索方式の機能ブロック図FIG. 1 is a functional block diagram of a search method according to the present invention.

【図２】本発明における頻度情報を用いた表作成の具体
例を示す図FIG. 2 is a diagram showing a specific example of creating a table using frequency information according to the present invention.

【図３】本発明における２文字連接出現頻度の具体例を
示すグラフFIG. 3 is a graph showing a specific example of a two-character continuous appearance frequency in the present invention.

[Explanation of symbols]

１２文字連接抽出手段２ファイル毎出現頻度情報抽出手段３２文字連接選択手段４表作成手段５記録位置概略情報６検索エントリおよび検索内容７表検索手段８検索候補文書９詳細検索手段 1 2 Character concatenation extraction means 2 File-by-file appearance frequency information extraction means 3 2 Character concatenation selection means 4 Table creation means 5 Recording position summary information 6 Search entry and search content 7 Table search means 8 Search candidate document 9 Detailed search means

Claims

[Claims]

1. A two-character concatenation extraction means for extracting an arbitrary two-character concatenation from a Japanese character code string of a search target document in a storage medium, and frequency information for obtaining the frequency of the two-character concatenation appearing in each document file. Extraction means, means for selecting a two-character concatenation whose appearance frequency does not exceed a preset threshold value in the two-character concatenation, and the remaining one character and two characters with the first character of the selected two-character concatenation as an entry. A table creating means for creating a table containing the recording position outline information of the document to be searched for which the character connection is extracted, and connecting the two characters from the search character string input at the time of the search request based on the appearance frequency. By selecting the first character of the selected two-character concatenation as the search entry and the second character as the search content, the search content is searched from the created table by using the search entry, and the corresponding recording position outline information is searched. A search device characterized by performing a detailed search for a search target document after acquiring information and narrowing down the search target document.

2. The search device according to claim 1, wherein a character combination that forms a two-character concatenation that includes punctuation marks "." And "," is not selected when a table is created.

3. The retrieval apparatus according to claim 1, wherein character combinations that form a two-character concatenation that do not begin with hiragana are not selected when creating a table.