JPH103481A

JPH103481A - Document retrieval device

Info

Publication number: JPH103481A
Application number: JP8156764A
Authority: JP
Inventors: Shoichi Tateno; 昌一舘野; Hiroshi Masuichi; 博増市
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 1996-06-18
Filing date: 1996-06-18
Publication date: 1998-01-06
Anticipated expiration: 2016-06-18
Also published as: JP3707506B2

Abstract

PROBLEM TO BE SOLVED: To provide a document retrieval device improving possibility that a desired document can be obtained from the document of a retrieval object. SOLUTION: A morpheme analysis part 3 analyzes a moptheme on the document in a document storage part 2 by using a noun dictionary in a noun dictionary storage part 11 and extracts a keyword. Then, an index file is generated and is additionally stored in a index storage part 4. The extracted keyword is added to the noun dictionary in a registered noun dictionary storage part 12. When a sentence being the retrieval object is inputted from the input part 1, the morpheme analysis part 3 analyzes the morpheme by using the noun dictionary in the registered noun dictionary storage part 12 and extracts the keyword. Furthermore, the morpheme is analyzed by using the noun dictionary in the noun dictionary storage part 11 and extracts the keyword. A specification part 5 executes retrieval by using the index file in the index storage part 4 based on the extracted keyword, and outputs a retrieval result from an output part 6.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、キーワード検索方
式の文書検索装置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document search apparatus using a keyword search method.

【０００２】[0002]

【従来の技術】膨大な量の文書を対象とした検索を行な
う検索方式として、キーワード検索方式が一般に用いら
れている。キーワード検索方式では、検索対象となる文
書から予めキーワードとなり得る語を抽出し、通常、イ
ンデックスファイルと呼ばれるファイルに登録する。イ
ンデックスファイルには、キーワードとそのキーワード
が出現する文書名、文書中の位置等の情報が対になって
記録される。検索時には、求める文書を指定すべく入力
された語と一致するキーワードをインデックス中から探
し出し、その文書名あるいは文書中の位置情報を得るこ
とにより、高速に文書を検索することができる。2. Description of the Related Art A keyword search method is generally used as a search method for searching a huge amount of documents. In the keyword search method, words that can be keywords are extracted in advance from documents to be searched and registered in a file called an index file. In the index file, a keyword and information such as a document name in which the keyword appears and a position in the document are recorded as a pair. At the time of retrieval, a keyword that matches a word input to specify a desired document is searched for in the index, and the document name or position information in the document is obtained, so that the document can be retrieved at high speed.

【０００３】文書中からキーワードを抽出する方法とし
ては、人手によって抽出する方法と、形態素解析に代表
されるキーワード抽出手法を用いた自動抽出による方法
とを挙げることができる。キーワード抽出の時間的なコ
ストの観点からは、自動抽出による方法が有利であると
いえる。As a method of extracting a keyword from a document, there are a method of extracting the keyword manually and a method of automatic extraction using a keyword extraction technique represented by morphological analysis. From the viewpoint of the time cost of keyword extraction, it can be said that the method based on automatic extraction is advantageous.

【０００４】キーワード検索方式の問題は、所望の文書
を指定するために入力する語が、インデックスファイル
に登録されているキーワードと完全に等しいものでなけ
ればならず、適切な入力語の指定が容易でない点であ
る。[0004] A problem with the keyword search method is that words to be input to specify a desired document must be exactly the same as the keywords registered in the index file, and it is easy to specify appropriate input words. It is not a point.

【０００５】この問題点を解決するため、例えば、特開
平７−１８２３７０号公報では、所望の文書を指定する
ための入力として文章を許す。指定された文章からキー
ワードを自動抽出し、得られたキーワードとインデック
スファイル中に登録されているキーワードを比較するこ
とによって、所望の文書を得ることができる。この方式
によれば、所望の文書を指定するための入力がインデッ
クスファイル中の語と完全に一致しない場合でも、入力
文章中に含まれる語が一致すれば所望の文書を得ること
が可能となり、上記の問題点を解決することができる。
例えば、入力として「Ａ社で発売した複写機を知りた
い」を指定した場合、キーワード「Ａ社」、「発売」、
「複写機」が自動抽出され、それらのキーワードをとも
に含む文書を検索する。In order to solve this problem, for example, Japanese Patent Laid-Open Publication No. Hei 7-182370 allows a sentence as an input for designating a desired document. A desired document can be obtained by automatically extracting a keyword from the designated text and comparing the obtained keyword with a keyword registered in the index file. According to this method, even when the input for designating the desired document does not completely match the word in the index file, it becomes possible to obtain the desired document if the word included in the input sentence matches, The above problems can be solved.
For example, if "I want to know copiers released by Company A" is specified as an input, the keywords "Company A", "Release",
“Copier” is automatically extracted, and a search is made for documents that include both of those keywords.

【０００６】しかしながら、形態素解析を含むキーワー
ドの自動抽出技術では、抽出されるキーワードが文脈に
依存する。そのため、上述の文献のように文書を指定す
る文章から形態素解析によりキーワードを抽出すると、
検索者の意図と異なるキーワードが抽出され、検索漏れ
が多くなるという問題がある。例えば、検索対象文書中
に「Ａ社系列車」という記述が存在し、インデックスフ
ァイルを作成する際のキーワード抽出により、「Ａ社
系」と「列車」が抽出され、登録されているとする。こ
こで、検索者がこの文書を検索したいと考え、「Ａ社系
列」を指定したとする。上述の文献のように形態素解析
を行なえば、これから「Ａ社」と「系列」を自動抽出
し、検索を行なうことがあり得る。この場合、所望の文
書を得ることはできない。However, in the keyword automatic extraction technique including morphological analysis, the extracted keyword depends on the context. Therefore, when keywords are extracted by morphological analysis from a sentence specifying a document as in the above-mentioned document,
There is a problem that keywords different from the searcher's intention are extracted, and search omissions increase. For example, it is assumed that a description “A company affiliated car” exists in the search target document, and “A company affiliate” and “train” are extracted and registered by keyword extraction when creating an index file. Here, it is assumed that the searcher wants to search this document and designates “company A series”. If a morphological analysis is performed as in the above-mentioned document, "Company A" and "Series" may be automatically extracted and searched. In this case, a desired document cannot be obtained.

【０００７】[0007]

【発明が解決しようとする課題】本発明は、上述した事
情に鑑みてなされたもので、検索対象の文書中から所望
の文書が得られる可能性を高めた文書検索装置を提供す
ることを目的とするものである。SUMMARY OF THE INVENTION The present invention has been made in view of the above-mentioned circumstances, and has as its object to provide a document search apparatus which increases the possibility of obtaining a desired document from documents to be searched. It is assumed that.

【０００８】[0008]

【課題を解決するための手段】請求項１に記載の発明
は、文書検索装置において、複数の文書を格納する文書
格納手段と、該文書格納手段に格納されている各文書か
ら予め抽出されたキーワードとともに該キーワードが存
在する文書名を記録したインデックスファイルを格納す
るインデックス格納手段と、検索対象を指定すべく入力
された文章を解析してキーワードを抽出する抽出手段
と、該抽出手段により抽出されたキーワードと前記イン
デックス格納手段に格納されているインデックスファイ
ル中のキーワードとを比較して検索対象の文書を特定す
る特定手段と、該特定手段により特定された前記文書を
前記文書格納手段から読み出して出力する出力手段を有
し、前記抽出手段は、前記インデックス格納手段に格納
されているインデックスファイル中に存在するキーワー
ド情報を用いた解析を行なってキーワードを抽出するこ
とを特徴とするものである。According to a first aspect of the present invention, in a document search apparatus, a document storage unit for storing a plurality of documents and a document extracted in advance from each document stored in the document storage unit. An index storage unit that stores an index file that records a document name in which the keyword exists together with the keyword, an extraction unit that analyzes a sentence input to specify a search target and extracts the keyword, and an extraction unit that extracts the keyword. Specifying means for specifying a document to be searched by comparing the keyword with the keyword in the index file stored in the index storing means; and reading the document specified by the specifying means from the document storing means. Output means for outputting, wherein the extraction means includes an index stored in the index storage means. It is characterized in that extracting the keyword by performing analysis using the keyword information present in the file.

【０００９】請求項２に記載の発明は、請求項１に記載
の文書検索装置において、前記抽出手段は、さらに、前
記インデックス格納手段に格納されているインデックス
ファイル中に存在するキーワード情報を用いずに解析を
行なって前記文章からキーワードを抽出することを特徴
とするものである。According to a second aspect of the present invention, in the document search device according to the first aspect, the extraction means further does not use keyword information existing in an index file stored in the index storage means. And extracting keywords from the sentence.

【００１０】請求項３に記載の発明は、請求項１に記載
の文書検索装置において、第１および第２の語彙辞書
と、該第１の語彙辞書を用いて前記文書格納手段に格納
されている文書を解析してキーワードを抽出し前記イン
デックスファイルを作成するとともに抽出したキーワー
ドを前記第２の語彙辞書に追加するインデックス作成手
段をさらに有し、前記抽出手段は、前記第２の語彙辞書
を用いて前記文章の解析を行ないキーワードを抽出する
ことを特徴とするものである。According to a third aspect of the present invention, there is provided the document search apparatus according to the first aspect, wherein the first and second vocabulary dictionaries are stored in the document storage means using the first vocabulary dictionaries. Further comprising: an index creating unit that analyzes a document to extract keywords to create the index file, and adds the extracted keywords to the second vocabulary dictionary, wherein the extracting unit includes the second vocabulary dictionary. The sentence is analyzed to extract keywords.

【００１１】[0011]

【発明の実施の形態】図１は、本発明の文書検索装置の
実施の一形態を示す構成図である。図中、１は入力部、
２は文書格納部、３は形態素解析部、４はインデックス
格納部、５は特定部、６は出力部、１１は名詞辞書格納
部、１２は登録名詞辞書格納部である。この実施の一形
態では、キーワードとして抽出し、インデックスファイ
ルに登録する語を名詞としているが、形容詞、動詞等も
キーワードの対象とすることは容易に可能である。FIG. 1 is a block diagram showing an embodiment of a document search apparatus according to the present invention. In the figure, 1 is an input unit,
2 is a document storage unit, 3 is a morphological analysis unit, 4 is an index storage unit, 5 is a specific unit, 6 is an output unit, 11 is a noun dictionary storage unit, and 12 is a registered noun dictionary storage unit. In this embodiment, words extracted as keywords and registered in the index file are used as nouns. However, adjectives, verbs, and the like can be easily targeted as keywords.

【００１２】入力部１は、検索者が所望の文書を指定す
るための文章を入力することができるユーザインタフェ
ースを有する。入力された文書を指定するための文章は
形態素解析部３に送られる。The input unit 1 has a user interface through which a searcher can input a sentence for designating a desired document. The sentence for specifying the input document is sent to the morphological analysis unit 3.

【００１３】文書格納部２は、検索対象となる複数の文
書を格納する。The document storage unit 2 stores a plurality of documents to be searched.

【００１４】形態素解析部３は、文書格納部２中に格納
されている文書および入力部１に入力された文章に対し
て形態素解析処理を施し、キーワードを抽出する。形態
素解析部３は、文書格納部２中に格納されている文書を
解析する際に用いられる名詞辞書を格納した名詞辞書格
納部１１と、文書から抽出したキーワードが格納され、
検索するための文章の解析に用いられる登録名詞辞書格
納部１２を有している。このほか、形態素解析部３は、
形態素解析を実行するために必要な、文法辞書および名
詞辞書以外の各種語彙の辞書等も含んでいる。The morphological analysis unit 3 performs a morphological analysis process on the document stored in the document storage unit 2 and the text input to the input unit 1 to extract keywords. The morphological analysis unit 3 stores a noun dictionary storage unit 11 storing a noun dictionary used when analyzing a document stored in the document storage unit 2, and a keyword extracted from the document.
It has a registered noun dictionary storage unit 12 used for analyzing a sentence to be searched. In addition, the morphological analysis unit 3
It also includes various vocabulary dictionaries other than grammar dictionaries and noun dictionaries necessary for executing morphological analysis.

【００１５】形態素解析部３は、文書格納部２中に格納
されている文書を解析する場合は、名詞辞書格納部１１
中の名詞辞書を用いて行なう。文書を解析して抽出した
キーワードと、そのキーワードを含む文書名などを記録
してインデックスファイルを作成し、インデックス格納
部４に追加格納する。さらに、インデックスファイル中
に登録したキーワードを登録名詞辞書格納部３２に格納
されている辞書に追加する。When analyzing a document stored in the document storage unit 2, the morphological analysis unit 3 uses the noun dictionary storage unit 11
This is done by using the noun dictionary inside. An index file is created by recording a keyword extracted by analyzing the document and a document name including the keyword, and the index file is additionally stored in the index storage unit 4. Further, the keyword registered in the index file is added to the dictionary stored in the registered noun dictionary storage unit 32.

【００１６】また、入力部１から入力された文章を解析
する際には、まず登録名詞辞書格納部１２中の名詞辞書
を用いて解析してキーワードを抽出し、特定部５に送
る。その後、名詞辞書格納部１１中の名詞辞書を用いて
解析し、キーワードを抽出して特定部５に送る。When analyzing a sentence input from the input unit 1, first, a keyword is extracted by using the noun dictionary in the registered noun dictionary storage unit 12, and a keyword is extracted and sent to the specifying unit 5. Then, analysis is performed using the noun dictionary in the noun dictionary storage unit 11 to extract keywords and send them to the specifying unit 5.

【００１７】名詞辞書格納部１１は、形態素解析に必要
な名詞語彙を記述した辞書を格納している。格納される
名詞辞書は、形態素解析部３が文書格納部２に格納され
ている文書および入力部１から入力された文章の解析を
行なう際に参照される。The noun dictionary storage unit 11 stores a dictionary describing a noun vocabulary necessary for morphological analysis. The stored noun dictionary is referred to when the morphological analysis unit 3 analyzes a document stored in the document storage unit 2 and a sentence input from the input unit 1.

【００１８】登録名詞辞書格納部１２は、形態素解析部
３が文書格納部２中の文書を解析した結果得られたキー
ワードを記述した辞書が格納される。この名詞辞書は、
入力部１から入力された文章を形態素解析部３が解析を
行なう際に参照される。辞書中に存在する名詞はインデ
ックス格納部４に格納されるインデックスファイル中の
キーワードと一致している。The registered noun dictionary storage unit 12 stores a dictionary describing keywords obtained as a result of analyzing the documents in the document storage unit 2 by the morphological analysis unit 3. This noun dictionary is
The text input from the input unit 1 is referred to when the morphological analysis unit 3 analyzes the text. The nouns existing in the dictionary match the keywords in the index file stored in the index storage unit 4.

【００１９】インデックス格納部４は、形態素解析部３
の解析処理によって文書格納部２中に格納されている文
書から得られたキーワードと各キーワードを含む文書名
などを記録したインデックスファイルを格納する。The index storage unit 4 stores the morphological analysis unit 3
And an index file that records keywords obtained from the documents stored in the document storage unit 2 by the analysis processing and document names including the respective keywords.

【００２０】特定部５は、形態素解析部３の解析処理に
よって入力部１に入力された文章から得られたキーワー
ドをすべて含む文書を、インデックス格納部４中のイン
デックスファイルを参照することによって、文書格納部
２中から特定する。The specifying unit 5 refers to the index file in the index storage unit 4 for a document including all keywords obtained from the text input to the input unit 1 by the analysis processing of the morphological analysis unit 3, It is specified from the storage unit 2.

【００２１】出力部６は、特定部５によって特定された
文書を検索結果として検索者に表示するユーザインタフ
ェースを持つ。The output unit 6 has a user interface for displaying a document specified by the specifying unit 5 to a searcher as a search result.

【００２２】図２は、本発明の文書検索装置の実施の一
形態における形態素解析部３の動作の一例を示すフロー
チャートである。Ｓ２１において、解析対象となる文を
受け取り、Ｓ２２において、その文が文書格納部２中の
文書であれば、まずＳ２３で名詞辞書格納部１１中の名
詞辞書を用いて形態素解析を行ない、キーワードを抽出
する。次にＳ２４において、得られたキーワードととも
にキーワードを含む文書名を記録してインデックスファ
イルとし、インデックス格納部４に追加格納する。さら
に、Ｓ２５において、インデックスファイルに登録した
キーワードを登録名詞辞書格納部１２に格納されている
名詞辞書に追加する。したがって、登録名詞辞書格納部
３２に格納されている辞書中に存在する名詞はインデッ
クス格納部４に格納されるインデックスファイル中のキ
ーワードと一致している。FIG. 2 is a flowchart showing an example of the operation of the morphological analyzer 3 in the embodiment of the document search device of the present invention. In step S21, a sentence to be analyzed is received. In step S22, if the sentence is a document in the document storage unit 2, first, in step S23, a morphological analysis is performed using the noun dictionary in the noun dictionary storage unit 11, and a keyword is input. Extract. Next, in S24, the obtained keyword is recorded together with the document name including the keyword as an index file, and additionally stored in the index storage unit 4. Further, in S25, the keyword registered in the index file is added to the noun dictionary stored in the registered noun dictionary storage unit 12. Therefore, the nouns present in the dictionary stored in the registered noun dictionary storage unit 32 match the keywords in the index file stored in the index storage unit 4.

【００２３】Ｓ２１で受け取った検索対象となる文が入
力部１に入力された文章であれば、まずＳ２６におい
て、登録名詞辞書格納部１２中の名詞辞書を用いて形態
素解析を行ない、キーワードを抽出する。そして、Ｓ２
７において、その結果を特定部５に通知することによっ
て、検索結果を得る。さらに、Ｓ２８において、名詞辞
書格納部１１中の名詞辞書を用いて形態素解析を行なっ
てキーワードを抽出し、Ｓ２９でその結果を特定部５に
通知することによって、さらに検索結果を得る。検索結
果は出力部６より出力される。If the sentence to be retrieved received in S21 is a sentence input to the input unit 1, first in S26, a morphological analysis is performed using the noun dictionary in the registered noun dictionary storage unit 12 to extract keywords. I do. And S2
In step 7, the search result is obtained by notifying the specifying unit 5 of the result. Further, in S28, a keyword is extracted by performing morphological analysis using the noun dictionary in the noun dictionary storage unit 11, and the result is notified to the specifying unit 5 in S29, thereby further obtaining a search result. The search result is output from the output unit 6.

【００２４】このように、登録名詞辞書格納部１２中の
名詞辞書を用いて形態素解析を行ない、キーワードを抽
出することによって、キーワード検索のヒット率を高め
留ことができる。また、その後、通常の名詞辞書格納部
１１中の名詞辞書を用いて形態素解析を行ない、キーワ
ードを抽出することによって、さらに漏れの少ない検索
が可能となる。As described above, by performing morphological analysis using the noun dictionary in the registered noun dictionary storage unit 12 and extracting keywords, it is possible to increase the hit rate of keyword search. After that, a morphological analysis is performed using the normal noun dictionary in the noun dictionary storage unit 11 to extract keywords, thereby enabling a search with less omission.

【００２５】例えば、検索対象文書中に「Ａ社系列車」
という記述が存在し、インデックスファイル作成時のキ
ーワード抽出により、「Ａ社系」と「列車」が抽出さ
れ、インデックスファイルに登録されているとする。こ
こで、検索者がこの文書を検索したいと考え、入力部１
から「Ａ社系列」を指定したとする。形態素解析部３は
登録名詞辞書格納部１２中の名詞辞書を用いて形態素解
析処理を行なう。登録名詞辞書格納部１２中の名詞辞書
には、文書から抽出した「Ａ社系」が登録されているの
で、指定された「Ａ社系列」の解析により登録されてい
る語「Ａ社系」と未登録語「列」に分割され、キーワー
ド「Ａ社系」が得られる。これにより、所望の文書を検
索することが可能となる。For example, "A company affiliated car" is included in the search target document.
It is assumed that “company A” and “train” are extracted by keyword extraction when creating an index file and registered in the index file. Here, the searcher wants to search this document, and the input unit 1
Suppose "company A series" is specified. The morphological analysis unit 3 performs a morphological analysis process using the noun dictionary in the registered noun dictionary storage unit 12. In the noun dictionary in the registered noun dictionary storage unit 12, "Company A" extracted from the document is registered. Therefore, the word "Company A" registered by analysis of the designated "Company A series" is registered. And the unregistered word “column”, and the keyword “company A” is obtained. This makes it possible to search for a desired document.

【００２６】[0026]

【発明の効果】以上の説明から明らかなように、本発明
によれば、実際にインデックスファイル中に登録されて
いるキーワードを優先して、入力文章からのキーワード
抽出を行なうことが可能となる。したがって、キーワー
ドの自動抽出による検索の欠点である、キーワード抽出
時の形態素解析の文脈依存による検索漏れを防ぎ、ヒッ
ト率の高い検索を実現することが可能であるという効果
がある。さらに、通常の辞書を用いた入力文章からのキ
ーワード抽出を行なうことにより、検索漏れを減少させ
ることができる。As is apparent from the above description, according to the present invention, it is possible to extract a keyword from an input sentence by giving priority to a keyword actually registered in an index file. Therefore, there is an effect that it is possible to prevent a search omission due to a context dependency of morphological analysis at the time of keyword extraction, which is a drawback of a search by automatic keyword extraction, and realize a search with a high hit rate. Further, by extracting keywords from the input text using a normal dictionary, search omissions can be reduced.

[Brief description of the drawings]

【図１】本発明の文書検索装置の実施の一形態を示す
構成図である。FIG. 1 is a configuration diagram illustrating an embodiment of a document search device according to the present invention.

【図２】本発明の文書検索装置の実施の一形態におけ
る形態素解析部３の動作の一例を示すフローチャートで
ある。FIG. 2 is a flowchart illustrating an example of an operation of a morphological analysis unit 3 in the embodiment of the document search device of the present invention.

[Explanation of symbols]

１…入力部、２…文書格納部、３…形態素解析部、４…
インデックス格納部、５…特定部、６…出力部、１１…
名詞辞書格納部、１２…登録名詞辞書格納部DESCRIPTION OF SYMBOLS 1 ... Input part, 2 ... Document storage part, 3 ... Morphological analysis part, 4 ...
Index storage unit, 5 ... specific unit, 6 ... output unit, 11 ...
Noun dictionary storage unit, 12 ... registered noun dictionary storage unit

Claims

[Claims]

A document storage unit for storing a plurality of documents;
An index storage means for storing an index file in which the name of a document in which the keyword exists is recorded together with a keyword previously extracted from each document stored in the document storage means, and a sentence input to specify a search target is analyzed Extracting means for extracting a keyword by extracting the keyword extracted by the extracting means and a keyword in an index file stored in the index storing means to specify a document to be searched;
Output means for reading and outputting the document specified by the specifying means from the document storage means, wherein the extraction means uses keyword information present in an index file stored in the index storage means. A document retrieval apparatus characterized in that a keyword is extracted by performing analysis.

2. The method according to claim 1, wherein the extracting unit performs analysis without using keyword information existing in an index file stored in the index storing unit to extract a keyword from the text. Item 2. The document search device according to Item 1.

3. The first and second vocabulary dictionaries, and a document stored in the document storage means are analyzed using the first vocabulary dictionaries to extract keywords and create and extract the index file. The second keyword is
2. The document according to claim 1, further comprising: an index creating unit that adds the keyword to the vocabulary dictionary, wherein the extracting unit analyzes the text using the second vocabulary dictionary and extracts a keyword. 3. Search device.