JPH09101951A

JPH09101951A - Document retrieving device

Info

Publication number: JPH09101951A
Application number: JP7260097A
Authority: JP
Inventors: Junichi Fukumoto; 淳一福本
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1995-10-06
Filing date: 1995-10-06
Publication date: 1997-04-15

Abstract

PROBLEM TO BE SOLVED: To save labor for adding an index by selecting a document at one part of all the documents, calculating and extracting the degree of similarity between the selected document and any strongly related document when narrowing the documents retrieved by a keyword into the target document. SOLUTION: A document retrieving part 4 performs the retrieval of documents from a document data base 3 while using the keyword. Besides, the high-order and low-order bocaburaries of the retrieval keyword are extracted while using a thesaurus table 2, the retrieval of documents is performed, and the results are held in a retrieved result holding part 5. Next, a user interface 1 selects the document of the requested field for narrowing down the retrieved results. Afterwards, a word extracting part 6 extracts words contained in the respective documents in the retrieved result holding part. A similarity degree calculating part 7 outputs the degree of similarity between the selected document and the documents in the retrieved result holding part 5 by mutually comparing the words extracted by the extracting part 6. A document selecting part 8 selects the respective documents out of the holding part 5 from the highest rank of similarity degree, narrows them down and displays the result for a user.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、指定された文字列
を用いて文書データベースを検索する文書検索装置に関
するもので、特に、検索された文書をさらに細かく検索
する機能に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document retrieval apparatus for retrieving a document database using a designated character string, and more particularly to a function for retrieving retrieved documents in more detail.

【０００２】[0002]

【従来の技術】従来の文書検索装置としては、特開平４
−１００６２号に開示されているものがある。従来の文
書データベース検索においては、検索時に文字列を指定
し、その指定された文字列をインデックスとして持つ文
書を検索結果として出力するという手法が取られてい
る。また、上記文献では、検索時に指定された文字列を
あらかじめ登録された上位−下位の関係を保持した辞書
データを用いることで、検索時に指定された文字列と関
係のある文字列を取り出し、その文字列を用いても検索
することが可能である。これにより、検索時に指定した
キーワードが、検索対象である文書データベースにあら
かじめ付けられたインデックスと異なる場合も検索が可
能となっていた。2. Description of the Related Art As a conventional document retrieval apparatus, Japanese Patent Laid-Open No.
No. -10062 is disclosed. In the conventional document database search, a method is used in which a character string is specified at the time of search and a document having the specified character string as an index is output as a search result. Further, in the above-mentioned document, by using the dictionary data that holds the upper-lower relationship in which the character string specified at the time of search is registered in advance, the character string related to the character string specified at the time of retrieval is extracted, It is also possible to search using a character string. As a result, it is possible to perform a search even when the keyword specified at the time of the search is different from the index provided in advance in the document database that is the search target.

【０００３】[0003]

【発明が解決しようとする課題】一般に、従来の文書デ
ータベース検索において目的とする文書をキーワードの
指定のみで検索するためには、有効な検索キーワードが
選択されなければならない。また、そのようなキーワー
ドを一度で与えることができなかった場合、一旦あるキ
ーワードで検索された文書に対し、さらにキーワードを
追加するという手法を用いなければならない。このよう
な文書データベースから目的とする文書の絞り込みのた
めに、どのようなキーワードが有効であるかといった判
断は、検索者自身で行わなければならないという問題が
ある。Generally, in the conventional document database search, in order to search a target document only by designating a keyword, a valid search keyword must be selected. Further, when such a keyword cannot be given at once, it is necessary to use a method of adding a keyword to a document once searched with a certain keyword. There is a problem that the searcher himself must determine what keywords are effective in order to narrow down target documents from such a document database.

【０００４】例えば、最初に「車」といったキーワード
で文書データベースを検索し、その結果、「車の開
発」、「車の事故」、「車の貿易問題」等の車に関連す
る話題の文書が検索結果として得られたとする。このと
き、最初の検索の段階では、検索された全体の文書とし
て、どのような話題のものが検索されたのかといった情
報を得るためには、文書全体を調べなければならず、そ
のような状況で検索の目的である「車の開発」といった
話題の文書を得るためには、それらの文書に付与されて
いるキーワードの中から適切なものを選択する必要があ
る。For example, first, a document database is searched with a keyword such as "car", and as a result, documents of topics related to cars such as "car development", "car accident", and "car trade problem" are found. It is assumed that it is obtained as a search result. At this time, in the first search stage, in order to obtain information such as what topic was searched as the entire searched document, it was necessary to search the entire document. In order to obtain a topical document such as "development of car", which is the purpose of the search, it is necessary to select an appropriate keyword from the keywords assigned to those documents.

【０００５】また、以上の検索が有効であるためには、
検索対象である大量の文書データベース中の各文書に対
し、検索用のインデックスが付与されている必要がある
が、そのような大量の文書に対してインデックスを付与
するためには多くの労力が必要であるといった問題もあ
る。In order for the above search to be effective,
It is necessary to add an index for searching to each document in a large number of document databases to be searched, but a lot of effort is required to add an index to such a large number of documents. There is also the problem that

【０００６】[0006]

【課題を解決するための手段】上述した課題を解決する
ため、本発明は、指定された文字列を用いて検索された
複数の文書からユーザの目的に応じた文書を検索する文
書検索装置において、検索された複数の文書を一時的に
保持する手段と、前記一時的に保持した文書からユーザ
の目的とする文書を選択させる手段と、前記ユーザの選
択した文書と類似した文書を前記一時的に保持した文章
中から検索する手段を有するものである。In order to solve the above-mentioned problems, the present invention provides a document search apparatus for searching a document according to a user's purpose from a plurality of documents searched using a designated character string. A means for temporarily holding a plurality of retrieved documents, a means for selecting a document intended by the user from the temporarily held documents, and a document similar to the document selected by the user for the temporary It has a means to search from the sentences stored in.

【０００７】[0007]

【発明の実施の形態】図１は本発明の実施の形態の一例
を示す文書検索装置のブロック図である。１は検索キー
ワードの入力や検索結果から適当な文書の選択を行うユ
ーザインタフェース、２は語彙間の上位−下位の関係の
辞書データを保持したシソーラステーブル、３は検索対
象である文書を保持した文書データベース、４は前記ユ
ーザインタフェース１で入力された検索キーワードとシ
ソーラステーブル２を用いて文書データベース３におい
て文書の検索を行う文書検索部、５は前記文書検索部４
の結果であるいくつかの文書を保持する検索結果保持
部、６は前記検索結果保持部５に保持されている検索結
果の文書の各文書について文書中に含まれる単語を抽出
する単語抽出部、７はユーザインタフェース１で選択さ
れた文書と検索結果の各文書との類似度を計算する類似
度計算部、８はユーザインタフェース１で選択された文
書と検索結果の各文書との前記類似度計算部７で計算し
た類似度情報を用いて文書の絞り込みを行う文書選択部
である。1 is a block diagram of a document retrieval apparatus showing an example of an embodiment of the present invention. 1 is a user interface for inputting a search keyword and selecting an appropriate document from search results, 2 is a thesaurus table that holds dictionary data of upper-lower relationships between vocabularies, and 3 is a document that holds documents to be searched The database 4 is a document search unit that searches for documents in the document database 3 using the search keyword input in the user interface 1 and the thesaurus table 2. Reference numeral 5 is the document search unit 4
A search result holding unit that holds some documents that are the results of the search result; a word extraction unit 6 that extracts words included in each document of the search result documents held in the search result holding unit 5; Reference numeral 7 denotes a similarity calculation unit that calculates the similarity between the document selected by the user interface 1 and each search result document, and 8 indicates the similarity calculation between the document selected by the user interface 1 and each search result document. A document selection unit that narrows down documents using the similarity information calculated by the unit 7.

【０００８】次に、上述した文書検索装置の動作を説明
する。まず、ユーザは、ユーザインタフェース１におい
て文書データベース３中の文書検索のための検索キーワ
ードを入力する。文書検索部４においては、ユーザイン
タフェース１で入力された検索キーワードを用いて文書
データベース３から文書の検索を行う。また、その検索
キーワードの上位または下位にあたる語彙をシソーラス
テーブル２を用いて抽出し、それらの語彙も用いて文書
の検索を行う。そして、検索結果である文書は検索結果
保持部５において保持される。Next, the operation of the above-described document search device will be described. First, the user inputs a search keyword for searching a document in the document database 3 on the user interface 1. The document search unit 4 searches for a document from the document database 3 using the search keyword input through the user interface 1. Further, the vocabulary that is higher or lower than the search keyword is extracted using the thesaurus table 2, and the documents are searched using these vocabulary as well. Then, the document as the search result is held in the search result holding unit 5.

【０００９】図２は検索結果保持部５において保持され
る検索結果の一例を示す説明図である。図２において、
１１〜１５の各文書は、「車」というキーワードを用い
て検索された結果の文書例を示す。次に、検索結果の絞
り込みを行うため、ユーザはユーザインタフェース１に
より、検索結果からユーザの要求する分野の文書を選択
する。この選択は、検索された文書の一部をユーザに対
して表示することで行う。FIG. 2 is an explanatory diagram showing an example of the search results stored in the search result storage unit 5. In FIG.
Each of documents 11 to 15 is an example of a document obtained as a result of retrieval using the keyword “car”. Next, in order to narrow down the search results, the user selects a document in the field requested by the user from the search results using the user interface 1. This selection is made by displaying a part of the retrieved document to the user.

【００１０】この選択は１つの文書であってもそれ以上
であってもよいが、複数の文書を選択するためには、多
くの文書を表示する必要があるので、通常は１つの文書
が選択されるものとする。次に、単語抽出部６により検
索結果保持部の文書の各文書について文書中に含まれる
単語を抽出する。This selection may be one document or more. However, in order to select a plurality of documents, it is necessary to display many documents, so normally one document is selected. Shall be done. Next, the word extraction unit 6 extracts the words contained in each document of the search result holding unit.

【００１１】図３は図２に示す文書から抽出された単語
の一例を示す説明図である。図３において、２１の単語
列は図２の文書１１から抽出されたもの、２２の単語列
は図２の文書１２から抽出されたもの、２３の単語列は
図２の文書１３から抽出されたもの、２４の単語列は図
２の文書１４から抽出されたもの、２５の単語列は図２
の文書１５から抽出されたものの例を示す。なお、この
図３においては、単語は“／”で区切られている。FIG. 3 is an explanatory diagram showing an example of words extracted from the document shown in FIG. In FIG. 3, 21 word strings are extracted from the document 11 in FIG. 2, 22 word strings are extracted from the document 12 in FIG. 2, and 23 word strings are extracted from the document 13 in FIG. 2, the 24 word strings are extracted from the document 14 of FIG. 2, and the 25 word strings are shown in FIG.
An example of what is extracted from the document 15 of FIG. In FIG. 3, words are separated by "/".

【００１２】次に、類似度計算部７では、ユーザインタ
フェース１で選択された文書と検索結果保持部５の各文
書との類似度を、単語抽出部６において抽出された単語
同士を比較することで行う。文書の類似度計算の一例と
しては、ユーザによって指定された文書中の単語と同じ
ものが、検索された各文書にいくつ存在するのかを数
え、その数値を文書の類似度として取り扱うといった方
法が考えられる。また、単語間の意味的関係として、K.
W.Church et al..“Using Statistics in Lexical Anal
ysis”,Lexicalacquisition:Exploiting on-line resou
rces to build a lexicon.(Zernik Uri(ed.)),London,L
awrence Erlbaum Associates,1991,pp.115-164 で提案
されたmutual informationの値を比較する各文書中の単
語間について計算し、それを合計する等の方法で文書の
類似度を計算するといった方法も考えられる。Next, the similarity calculation unit 7 compares the words extracted by the word extraction unit 6 with the similarity between the document selected by the user interface 1 and each document in the search result holding unit 5. Done in. As an example of document similarity calculation, a method of counting how many same words as the word in the document specified by the user exist in each retrieved document and treating the numerical value as the document similarity is considered. To be Also, as a semantic relationship between words, K.
W. Church et al .. “Using Statistics in Lexical Anal
ysis ”, Lexicalacquisition: Exploiting on-line resou
rces to build a lexicon. (Zernik Uri (ed.)), London, L
awrence Erlbaum Associates, 1991, pp.115-164, which compares the values of mutual information, calculates between words in each document, and calculates the similarity between documents by summing them. Conceivable.

【００１３】図４はユーザにより図２中の文書１１が選
択された場合の検索結果保持部の各文書との類似度と類
似度を計算した結果を示す説明図である。ここでの類似
度の計算は、ユーザによって指定された文書中の単語と
同じものが、検索された各文書にいくつ存在するのかを
数えるという上述した手法の中の前者の手法を用いてい
る。FIG. 4 is an explanatory diagram showing the similarity and the result of calculation of the similarity with each document in the search result holding portion when the document 11 in FIG. 2 is selected by the user. The calculation of the degree of similarity here uses the former method of the above-mentioned methods of counting how many same words as the word in the document designated by the user exist in each retrieved document.

【００１４】図４において、３１は図２の文書１２を示
し、３５はその文書中で選択された文書と同じ単語の数
を示す。３２は図２の文書１３を示し、３６はその文書
中で選択された文書と同じ単語の数を示す。３３は図２
の文書１４を示し、３７はその文書中で選択された文書
と同じ単語の数を示す。３４は図２の文書１５を示し、
３８はその文書中で選択された文書と同じ単語の数を示
す。In FIG. 4, 31 indicates the document 12 of FIG. 2, and 35 indicates the same number of words as the selected document in the document. 32 shows the document 13 of FIG. 2 and 36 shows the same number of words as the selected document in the document. 33 is shown in FIG.
Document 14 and 37 indicates the same number of words as the selected document in the document. 34 indicates the document 15 in FIG.
38 indicates the same number of words as the selected document in the document.

【００１５】最後に、文書選択部８では、類似度計算部
７で計算された文書の類似度情報を用い、検索結果保持
部５中に各文書を類似度の高い順に選択することで文書
の絞り込みを行い、ユーザインタフェース１を通じてユ
ーザに対して表示する。例えば、図４の結果を用いた場
合、３３，３１，３２，３４の順に類似度が高いものと
判断され、例えば上位１０％を表示するとした場合、文
書３３が表示される。Finally, the document selecting section 8 uses the similarity information of the documents calculated by the similarity calculating section 7 to select each document in the search result holding section 5 in the descending order of similarity. The result is narrowed down and displayed to the user through the user interface 1. For example, when the result of FIG. 4 is used, it is determined that the degree of similarity is high in the order of 33, 31, 32, 34, and for example, when the top 10% is displayed, the document 33 is displayed.

【００１６】[0016]

【発明の効果】以上説明したように、本発明は、キーワ
ードによって一時的に検索された文書からさらに目的と
する文書に絞り込みを行う際、これらの文書の中の一部
の文書を選択させて、選択された文書と関連の強い文書
を文書の類似度を計算することにより抽出するものであ
る。これにより、目的とする文書の絞り込みのために、
どのようなキーワードが有効であるのかといった判断
は、検索者自身が行う必要がなくなり、キーワードを知
らなくとも検索したい文書を絞り込むことができる。As described above, according to the present invention, when the documents temporarily searched by the keyword are narrowed down to the target documents, some of the documents are selected. A document having a strong relationship with the selected document is extracted by calculating the document similarity. With this, in order to narrow down the target documents,
It is not necessary for the searcher to determine which keyword is valid, and it is possible to narrow down the documents to be searched without knowing the keyword.

【００１７】また、検索対象である文書データベースの
中の大量の文書に対し、数多くの検索用のインデックス
を付与しておく必要がなくなり、インデックスを付与す
るための労力を削減することができる。Further, it is not necessary to add a large number of search indexes to a large number of documents in the document database to be searched, and the labor for adding indexes can be reduced.

[Brief description of the drawings]

【図１】本発明の実施の形態の一例を示す文書検索装置
のブロック図FIG. 1 is a block diagram of a document search device showing an example of an embodiment of the present invention.

【図２】検索結果の一例を示す説明図FIG. 2 is an explanatory diagram showing an example of a search result.

【図３】文書から抽出された単語の一例を示す説明図FIG. 3 is an explanatory diagram showing an example of words extracted from a document.

【図４】類似度の計算結果を示す説明図FIG. 4 is an explanatory diagram showing a calculation result of similarity.

[Explanation of symbols]

１ユーザインタフェース２シソーラステーブル３文書データベース４文書検索部５検索結果保持部６単語抽出部７類似度計算部８文書選択部 1 User Interface 2 Thesaurus Table 3 Document Database 4 Document Search Section 5 Search Result Holding Section 6 Word Extraction Section 7 Similarity Calculation Section 8 Document Selection Section

Claims

[Claims]

1. A document retrieval apparatus for retrieving a document according to a user's purpose from a plurality of documents retrieved using a designated character string, and means for temporarily holding the retrieved plurality of documents. It has means for selecting a document intended by the user from the temporarily stored documents, and means for searching a document similar to the document selected by the user from the temporarily stored sentences. Document retrieval device.

2. The document search device according to claim 1, wherein a means for extracting a word in the temporarily held document and a similarity between the word information and the document selected by the user are obtained. A document retrieving apparatus comprising means and means for retrieving a document similar to the document selected by the user using the similarity information.