JPH08115340A

JPH08115340A - Document retrieval device and generating device for index file used for the same

Info

Publication number: JPH08115340A
Application number: JP6278543A
Authority: JP
Inventors: Hiroko Matsuo; 裕子松尾; Makoto Ando; 誠安藤; Akio Yamashita; 明男山下; Kazuo Aihara; 一雄相原; Tatsuomi Kita; 辰臣喜多; Shinji Kawamoto; 真司川本; Hiroshi Yamaguchi; 浩山口
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 1994-10-19
Filing date: 1994-10-19
Publication date: 1996-05-07

Abstract

PURPOSE: To reduce omission of retrieval and improve the speed of retrieval using a retrieval key word as to a document retrieval device using index file. CONSTITUTION: A retrieval character string specification part 709 specifies a retrieval character string. A key word expansion part 705 finds the host word of the retrieval character string specified by the retrieval character string specification part. a retrieval part 711 retrieves the index file 708 with the host word found by the key word expansion part 705. When a coincidence is obtained as a result of the retrieval, a retrieval part retrieves only the range of a key word group classified to be the host word by using the retrieval character string. Thus, when there is the host word, the retrieval character string is only compared with the key words in the index file in the limited range belonging to the host word, so the retrieval speed is greatly improved.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、文書を登録する時にキ
ーと共に上位語をインデックスに登録することにより、
効率よく文書を検索する文書検索装置およびその文書検
索装置に用いるためのインデックスファイル作成装置に
関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention allows a keyword to be registered in an index together with a key when a document is registered.
The present invention relates to a document search device for efficiently searching for documents and an index file creation device for use in the document search device.

【０００２】[0002]

【従来の技術】従来の文書検索装置として、文書登録時
にキーなる語と文書とを対応させたインデックスファイ
ルを作成しておき、文書の検索を行うときには、検索キ
ーワードがこのインデックスファイル中のキーワードと
一致しているかどうか調べ、一致していればキーワード
を含む文書の名や出現位置を表示するインデックス検索
方式がある。このインデックス検索方式において、ユー
ザが入力した検索キーワードがインデックスファイル中
に見つからなければ、例えばシソーラス辞書を用いて検
索キーワードを展開して同義語を検索キーとして、改め
て検索する方法がある。また、最初から検索キーワード
を展開して複数の同義語を検索する方法もある。しか
し、展開した複数の検索キーワードでインデックスファ
イルを検索すると、検索キーワードが増えるにつれて検
索手法が複雑になり、検索時間がかかる。また、展開し
た検索キーワードの数を制限すると、検索時間は短縮さ
れるが、検索漏れが生ずる恐れがある。2. Description of the Related Art As a conventional document retrieval apparatus, an index file is created in which key words are associated with documents at the time of document registration, and when a document is retrieved, the retrieval keyword is the keyword in this index file. There is an index search method that checks whether they match and, if they match, displays the name and appearance position of the document containing the keyword. In this index search method, if the search keyword input by the user is not found in the index file, there is a method of expanding the search keyword using, for example, a thesaurus dictionary, and searching again using the synonym as a search key. There is also a method of expanding a search keyword from the beginning and searching for a plurality of synonyms. However, when the index file is searched with a plurality of expanded search keywords, the search method becomes complicated as the number of search keywords increases, and the search takes time. Further, if the number of expanded search keywords is limited, the search time is shortened, but the search omission may occur.

【０００３】検索時間の短縮と検索漏れをなくすため
に、例えば特開平３−１７７９７２号公報記載の「デー
タベースシステム」においては、データを入力する時に
文書に用いられている単語を、その単語と同じ意味を持
つグループの代表単語に置換する。また、例えば、特開
平５−８９１７７号公報や特開平５−８９１７８号公報
記載の装置においては、個々の対象名と上位の一般名称
とを少なくとも２階層として予め登録している。特開平
５−８９１７７号公報記載の装置では対象名で検索する
と、対象名を検索するだけでなく、ユーザの指示によっ
てこの対象名の上位の一般名称に属するすべての名称を
展開することができ、さらに上位の一般名称をさかのぼ
って検索範囲を広げることもできる。特開平５−８９１
７８号公報記載の装置では一般名称で検索すると、対象
名に分解して検索する。In order to shorten the search time and eliminate omission of search, for example, in the "database system" described in Japanese Patent Laid-Open No. 177792/1993, the word used in a document when inputting data is the same as the word. Replace with a representative word of a meaningful group. Further, for example, in the devices described in Japanese Patent Application Laid-Open No. 5-89177 and Japanese Patent Application Laid-Open No. 5-89178, individual target names and upper general names are registered in advance as at least two layers. In the device described in Japanese Patent Laid-Open No. 5-89177, when a target name is searched, not only the target name is searched, but also all names belonging to higher general names of this target name can be expanded according to a user's instruction. It is also possible to extend the search range by going back to the higher-ranked generic name. Japanese Patent Laid-Open No. 5-891
In the apparatus described in Japanese Patent Publication No. 78, when a general name is searched, the target name is decomposed into the search.

【０００４】[0004]

【発明が解決しようとする課題】前述のデータベースへ
の入力時にデータに用いられている単語を、代表単語に
置換する前記従来技術では、元のデータが変わってしま
い、正確なデータを保存し検索することはできない。ま
た、代表単語に置換するかどうか、複数のカテゴリに属
する単語ではどのカテゴリとするかをユーザが指示する
ので、一度に大量のデータを登録することは非常に困難
である。In the above-mentioned prior art in which the word used in the data at the time of inputting into the above-mentioned database is replaced with the representative word, the original data is changed, and accurate data is saved and retrieved. You cannot do it. In addition, since the user indicates whether to replace with a representative word and which category should be used for words belonging to a plurality of categories, it is very difficult to register a large amount of data at one time.

【０００５】個々の対象名と上位の一般名称とを少なく
とも２階層のツリー構造とした知識ベースを用いる前記
従来技術においては、そのツリー構造の知識ベースは与
えられた検索語をツリー構造を用いて同位語または下位
語に展開するための関連語辞書に相当するものである
が、この構造をインデックスファイルに適用したとして
もその検索速度を向上させるものとはならない。In the above-mentioned prior art which uses a knowledge base having a tree structure of at least two hierarchies of individual target names and superordinate general names, the knowledge base of the tree structure uses a tree structure for a given search word. Although it corresponds to a related word dictionary for expanding to a synonym or a subordinate word, even if this structure is applied to an index file, it does not improve the search speed.

【０００６】本発明は、インデックスファイルを用いる
文書検索装置において、検索漏れが少なく、検索キーワ
ードによる検索速度を向上させることを目的とする。ま
た、本発明は文書検索のためのインデックスファイルを
容易にかつ自動的に、作成することを目的とする。It is an object of the present invention to improve search speed by a search keyword with less omission of search in a document search device using an index file. Another object of the present invention is to easily and automatically create an index file for document retrieval.

【０００７】[0007]

【課題を解決するための手段】本発明の文書検索装置
は、文書を記憶する文書記憶手段（５０８）と、前記文
書に対するキーワードを上位語で分類して文書識別情報
とともに保存するインデックスファイル記憶手段（５０
７）と、文書を検索するための検索文字列を指定する検
索文字列指定手段（５０１）と、検索文字列指定手段に
より指定された検索文字列の上位語を求める上位語抽出
手段（５０４，５０５）と、上位語でインデックスファ
イル中の上位語と一致しているかを検索し、一致してい
ればその上位語に分類されているキーワード群の範囲を
検索文字列で検索する検索手段（５０６）とを備えてい
る。A document retrieval apparatus according to the present invention comprises a document storage unit (508) for storing a document and an index file storage unit for classifying keywords for the document by upper terms and storing them together with document identification information. (50
7), a search character string designating means (501) for designating a search character string for searching a document, and an upper word extracting means (504, for obtaining a higher word of the search character string designated by the search character string designating means). 505) and a higher-rank word to find a match with a higher-rank word in the index file, and if they match, a search means (506) that searches the range of the keyword group classified into the higher-rank word with the search character string. ) And.

【０００８】また、本発明のインデックスファイルの作
成装置は、文書を読み込む文書入力手段（１０２）と、
上位語で分類されたキーワードと文書識別情報とを対応
させたインデックスファイルを記憶するインデックスフ
ァイル記憶手段（１０８）と、入力した文書から検索に
用いるためのキーワードを抽出するキーワード抽出手段
（１０３）と、キーワード抽出手段により抽出したキー
ワードの上位語を求める上位語抽出手段（１０５）と、
キーワード抽出手段で抽出したキーワードを上位語抽出
手段で抽出した上位語で分類して、上位語とキーワード
とを前記読み込まれた文書の文書識別情報とともにイン
デックスファイル記憶手段に登録する登録手段（１０
７）とを備えている。The index file creating apparatus of the present invention further comprises a document input means (102) for reading a document,
Index file storage means (108) for storing an index file in which keywords classified by higher-order words and document identification information are associated with each other, and keyword extraction means (103) for extracting keywords to be used for search from an input document. A high-order word extraction means (105) for obtaining a high-order word of the keyword extracted by the keyword extraction means,
Registration means (10) for classifying the keywords extracted by the keyword extraction means by the high-order words extracted by the high-order word extraction means and registering the high-order words and the keywords in the index file storage means together with the document identification information of the read document.
7) and are provided.

【０００９】[0009]

【作用】本発明の文書検索装置において、まず、検索文
字列指定手段により、検索文字列を指定する。上位語抽
出手段は、検索文字列指定手段により指定された検索文
字列の上位語を求める。検索手段は上位語抽出手段で求
めた上位語でインデックスファイルを検索する。即ち、
その上位語がインデックスファイル中の上位語と一致し
ているかを検索する。検索の結果、一致していれば、検
索手段は、次にその上位語に分類されているキーワード
群の範囲のみを検索文字列で検索する。このように上位
語がある場合には、検索文字列は、上位語に属する限定
された範囲のインデックスファイル中のキーワードとの
比較をすればよいので、検索速度が大幅に向上する。な
お、上位語による検索処理が、従来技術に比べて余分に
必要になるが、検索文字列の比較対象が絞られるので、
総合的には大幅な検索速度の向上が達成される。また、
検索文字列の比較対象を絞る際に、従来技術のように単
に数を減らすのではなく、上位語という関連に基づいて
対象を絞るので、検索漏れが生ずる恐れは少ない。In the document retrieval apparatus of the present invention, first, the retrieval character string designating means designates the retrieval character string. The high-order word extraction means obtains a high-order word of the search character string designated by the search character string designation means. The search means searches the index file with the high-order word obtained by the high-order word extraction means. That is,
Search whether the high-order word matches the high-order word in the index file. If they match as a result of the search, the search means next searches only the range of the keyword group classified into the superordinate word with the search character string. In this way, when there is a high-order word, the search character string may be compared with the keywords in the index file in a limited range belonging to the high-order word, so that the search speed is significantly improved. It should be noted that although a search process using broader terms is required more than in the conventional technique, the comparison target of the search character string is narrowed down,
Overall, a significant improvement in search speed is achieved. Also,
When narrowing down the comparison target of the search character strings, the target is narrowed down based on the relation of the high-order word, rather than simply reducing the number as in the conventional technique, and therefore there is little possibility of missing the search.

【００１０】本発明のインデックスファイルの作成装置
において、キーワード抽出手段は入力した文書から検索
に用いるためのキーワードを抽出する。上位語抽出手段
はキーワード抽出手段により抽出したキーワードの上位
語を求める。登録手段は、キーワード抽出手段で抽出し
たキーワードを上位語抽出手段で抽出した上位語で分類
して、上位語とキーワードとを前記読み込まれた文書の
文書識別情報とともにインデックスファイル記憶手段に
登録する。In the index file creating apparatus of the present invention, the keyword extracting means extracts a keyword to be used for retrieval from the input document. The high-order word extraction means finds a high-order word of the keyword extracted by the keyword extraction means. The registration means classifies the keywords extracted by the keyword extraction means by the upper words extracted by the upper word extraction means, and registers the upper words and the keywords in the index file storage means together with the document identification information of the read document.

【００１１】[0011]

【Example】

（実施例１）図１は本発明の文書検索装置において、文
書登録する際にインデックスファイルを作成するための
インデックスファイル作成装置の実施例の構成を示すブ
ロック図である。この実施例のインデックスファイル作
成装置は、文書を記憶する文書記憶部１０１と、インデ
ックスを抽出するために対象の文書を読み込む文書入力
部１０２と、その文書入力部１０２から入力された文書
からキーワードを抽出するキーワード抽出部１０３と、
キーワードを抽出する際に解析のために参照する解析用
辞書１０４と、抽出したキーワードを展開するキーワー
ド展開部１０５と、キーワードを展開する際に参照する
シソーラス辞書１０６と、キーワード抽出部１０３で抽
出されたキーワードおよびキーワード展開部１０５で展
開されたワードをインデックスとして登録するインデッ
クス登録部１０７と、インデックスファイルを保持する
ファイル記憶部１０８とを有している。(Embodiment 1) FIG. 1 is a block diagram showing the configuration of an embodiment of an index file creation device for creating an index file when registering a document in the document search device of the present invention. The index file creating apparatus according to this embodiment includes a document storage unit 101 that stores a document, a document input unit 102 that reads a target document to extract an index, and a keyword that is input from the document input unit 102. A keyword extraction unit 103 for extracting,
An analysis dictionary 104 that is referred to for analysis when extracting a keyword, a keyword expansion unit 105 that expands the extracted keyword, a thesaurus dictionary 106 that is referred to when expanding the keyword, and a keyword extraction unit 103. It has an index registration unit 107 for registering the keywords and words expanded by the keyword expansion unit 105 as an index, and a file storage unit 108 for holding an index file.

【００１２】図２は以上のように構成された実施例の動
作のフローを示すものである。文書入力部１０２はユー
ザの指示により文書記憶部１０１に文書ファイルを格納
し、それと同時にその文書ファイルをキーワード抽出部
１０３に読み込む（ステップＳ２０１）。キーワード抽
出部１０３は読み込まれた文書を形態素解析など周知の
解析手法を用いて順に解析し、自立語を抽出してキーワ
ードとしこれをインデックスファイルのデータとする
（ステップＳ２０２）。解析用辞書１０４にはインデッ
クスとなり得る自立語のリストが含まれている。自立語
を抽出するためにキーワード抽出部１０３はこの解析用
辞書１０４を用いる。抽出された自立語がキーワードと
なる。またキーワード抽出部１０３はキーワード展開部
１０５を通じてシソーラス辞書１０６にアクセスし、キ
ーワードの上位語を得る（ステップＳ２０３）。インデ
ックス登録部１０７は、キーワードを上位語で分類し
て、位置識別情報などとともに上位語とキーワードとを
インデックスファイル記憶部１０８に格納する（ステッ
プＳ２０４）。このステップＳ２０１〜ステップＳ２０
４の処理を、指定された文書の数だけ繰り返す（ステッ
プＳ２０５）。キーワードのインデックスファイル記憶
部１０８における格納位置は上位語の昇順かつキーワー
ドの昇順に並べられた位置とする。このとき、辞書、シ
ソーラス辞書の項目は互いに重なりがない通し番号が打
たれているので、語に対応する通し番号で昇順にソート
してもよいし、語の文字コード順でもよい。FIG. 2 shows a flow of operation of the embodiment configured as described above. The document input unit 102 stores the document file in the document storage unit 101 according to a user's instruction, and at the same time, reads the document file into the keyword extraction unit 103 (step S201). The keyword extraction unit 103 sequentially analyzes the read document using a well-known analysis method such as morphological analysis, extracts independent words, and sets them as keywords (step S202). The analysis dictionary 104 includes a list of independent words that can be indexes. The keyword extraction unit 103 uses this analysis dictionary 104 to extract independent words. The extracted independent word is a keyword. Further, the keyword extracting unit 103 accesses the thesaurus dictionary 106 through the keyword expanding unit 105 to obtain a keyword upper word (step S203). The index registration unit 107 classifies the keywords by upper words, and stores the upper words and keywords in the index file storage unit 108 together with the position identification information (step S204). This step S201 to step S20
The process of 4 is repeated for the number of designated documents (step S205). The storage positions of the keywords in the index file storage unit 108 are positions arranged in ascending order of high-order words and in ascending order of keywords. At this time, since the items of the dictionary and thesaurus dictionary are numbered so as not to overlap each other, they may be sorted in ascending order by the number corresponding to the word, or may be in the character code order of the word.

【００１３】図３にインデックスファイルの例を示す。
図３（ａ）は従来の一般的なインデックスファイルの例
であり、キーワードに対してそのキーワードを含む文書
ファイルのリストを対応させたものである。図３（ｂ）
が本実施例でのインデックスファイルの例であり、同
図（ａ）の従来のインデックスファイルに比べると各キ
ーワードが上位語を有し、上位語でソートすることによ
りキーワードが上位語で分類されている点が異なる。こ
れらの上位語とキーワードとは本実施例では語の表記の
文字列であるが、表記の格納位置へのポインタで表して
もよい。ファイルリストはキーワードの出現位置を表す
リスト、またはリストへのポインタで表す。キーワード
の出現位置は、ファイル名、またファイルの通し番号、
あるいは物理的なアドレスなどで表す。上位語として存
在する語と、別の上位語の下のキーワードとして存在す
る語とは同一である可能性もあるが、本実施例では２階
層しか持たないため、重複をなくす処理はしない。即
ち、一つのキーワードに対して複数の上位語がある時
は、本実施例では、複数の上位語それぞれに同じキーワ
ードがぶら下がるようにしているが、意味解析をしてキ
ーワードが出現するごとに上位語を選択してもよい。図
３（ｃ）に作成されたインデックスファイルの具体例
を示す。即ち、このようにして作成されたインデックス
ファイルは図３（ｃ）に例示するような上位語「山」
「川」などで分類されたキーワード「阿蘇」「石鎚」な
どとそれを含む文書のリストを対応させたものとなる。FIG. 3 shows an example of the index file.
FIG. 3A shows an example of a conventional general index file, in which a keyword is associated with a list of document files including the keyword. FIG. 3 (b)
Is an example of the index file in the present embodiment, and each keyword has a higher word than the conventional index file in FIG. 7A, and the keywords are classified by the higher word by sorting by the higher word. The difference is. In the present embodiment, these superordinate words and keywords are character strings of word notations, but they may be represented by pointers to storage locations of notations. The file list is represented by a list indicating the appearance position of the keyword or a pointer to the list. The appearance position of the keyword is the file name, file serial number,
Alternatively, it is represented by a physical address or the like. There is a possibility that a word existing as a high-order word and a word existing as a keyword under another high-order word are the same, but in the present embodiment, since there are only two hierarchies, processing for eliminating duplication is not performed. That is, when there are a plurality of upper words for one keyword, the same keyword is hung in each of the plurality of upper words in the present embodiment. You may select a word. FIG. 3C shows a specific example of the index file created. That is, the index file created in this way has the generic word "mountain" as illustrated in FIG. 3 (c).
It corresponds to the keywords "Aso", "Ishizuchi", etc., which are classified by "river", and the list of documents including them.

【００１４】（実施例２）この実施例２は、実施例１と
同様に文書登録する際にインデックスファイルを作成す
る実施例であるが、キーワード抽出しインデックスファ
イルに登録する処理手順の一部が実施例１とは異なる。
即ち、前述のように実施例１においては、キーワードを
抽出する毎にその上位語を求め、キーワードを上位語で
分類して、上位語とキーとを位置識別情報とともにイン
デックスファイルに保存する処理を行い、いわば、キー
ワードと対応する上位語を平行的に抽出しインデックス
ファイルに登録するものであった。これに対し、実施例
２のキーワード抽出とインデックスファイルへの登録に
おいては、キーワードを抽出したら、一旦、位置識別情
報とともにインデックスファイルに保存し、その後キー
ワードの抽出がすべて終わってからインデックスファイ
ル中のキーワードの上位語を抽出し、キーワードを上位
語で分類して、上位語とキーとを位置識別情報とともに
インデックスファイルに再保存する。なお、装置の構成
を示すブロック図は、実施例１と同じであり、図１に示
される。(Embodiment 2) This embodiment 2 is an embodiment in which an index file is created when a document is registered as in the case of the embodiment 1, but a part of the processing procedure of extracting a keyword and registering it in the index file is Different from the first embodiment.
That is, as described above, in the first embodiment, a process of obtaining a high-order word every time a keyword is extracted, classifying the high-order word by the high-order word, and saving the high-order word and the key together with the position identification information in the index file is performed. In other words, so to speak, high-order words corresponding to keywords are extracted in parallel and registered in the index file. On the other hand, in the keyword extraction and the registration to the index file of the second embodiment, once the keyword is extracted, it is once stored in the index file together with the position identification information, and after that, the keyword in the index file is completely extracted after the extraction of the keyword is completed. The upper word is extracted, the keywords are classified by the upper word, and the upper word and the key are saved again in the index file together with the position identification information. The block diagram showing the configuration of the apparatus is the same as that of the first embodiment and is shown in FIG.

【００１５】図４は実施例２におけるキーワードの抽出
とインデックスファイルへの登録に関する流れ図であ
る。文書入力部１０２はユーザの指示によりキーワード
抽出部１０３にキーワードを抽出する対象の文書ファイ
ルを文書記憶部１０１から読み込みはじめる（ステップ
Ｓ４０１）。キーワード抽出部１０３は読み込んだ文書
を順に解析し、解析用辞書１０４を参照して自立語を抽
出する（ステップＳ４０２）。インデックス登録部１０
７は抽出した語をインデックスファイルのデータとして
インデックスファイル記憶部１０８に格納する（ステッ
プＳ４０３）。登録文書すべてについてキーワードを抽
出しおえてから（ステップＳ４０４）、インデックスフ
ァイルのデータを補う。キーワード展開部１０５を通じ
てシソーラス辞書１０６にアクセスし、キーワードそれ
ぞれについての上位語を得る（ステップＳ４０５、ステ
ップＳ４０６）。インデックス登録部１０７は、キーワ
ードを上位語で分類して、位置識別情報などとともに上
位語とキーワードとをインデックスファイル記憶部１０
８に格納しなおす（ステップＳ４０７）。このようにし
て作成されたインデックスファイルは実施例１と同様に
図３（ｃ）に例示するような上位語「山」「川」などで
分類されたキーワード「阿蘇」「石鎚」などとそれを含
む文書のリストを対応させたものとなる。FIG. 4 is a flow chart relating to keyword extraction and index file registration in the second embodiment. The document input unit 102 starts reading from the document storage unit 101 the document file for which the keyword is to be extracted by the keyword extraction unit 103 according to a user's instruction (step S401). The keyword extracting unit 103 analyzes the read documents in order and refers to the analysis dictionary 104 to extract an independent word (step S402). Index registration unit 10
7 stores the extracted word as data of the index file in the index file storage unit 108 (step S403). After extracting the keywords for all the registered documents (step S404), the data in the index file is supplemented. The thesaurus dictionary 106 is accessed through the keyword expansion unit 105 to obtain a broader term for each keyword (steps S405 and S406). The index registration unit 107 classifies the keywords according to the higher-rank words, and stores the higher-rank words and the keywords together with the position identification information and the like in the index file storage unit 10.
It is stored again in step 8 (step S407). The index file created in this way is similar to the first embodiment, and the keywords “Aso”, “Ishizuchi”, etc., which are classified by the generic terms “mountain”, “river”, etc. as illustrated in FIG. It corresponds to the list of included documents.

【００１６】（実施例３）実施例３は実施例１あるいは
実施例２により作成したインデックスファイルを用いて
検索する検索装置の例であり、図５はそのブロック図で
ある。この検索装置は、文書を検索するための検索文字
列を指定する検索文字列指定部５０１と、その検索文字
列指定部５０１で指定された検索文字列を解析して検索
キーを抽出する検索キーワード抽出部５０２と、検索キ
ーワード抽出部５０２で検索キーワードを抽出するため
に参照する解析用辞書５０３と、検索キーワード抽出部
５０２で抽出したキーワードの上位語を求めるキーワー
ド展開部５０４と、上位語を求めるためにキーワード展
開部５０４で参照されるシソーラス辞書５０５と、上位
語で分類されたキーワードとそれを含む文書の出現位置
を対応させたインデックスファイルを保存したインデッ
クスファイル記憶部５０７と、インデックスファイルを
用いて検索キーの出現位置を検索する検索部５０６と、
文書を記憶する文書記憶部５０８と、検索結果を表示す
る表示部５０９とを備えている。(Embodiment 3) Embodiment 3 is an example of a search device for searching using the index file created in Embodiment 1 or 2, and FIG. 5 is a block diagram thereof. This search device includes a search character string designating unit 501 for designating a search character string for searching a document, and a search keyword for analyzing a search character string designated by the search character string designating unit 501 and extracting a search key. An extraction unit 502, an analysis dictionary 503 that is referred to by the search keyword extraction unit 502 to extract a search keyword, a keyword expansion unit 504 that obtains an upper word of the keyword extracted by the search keyword extraction unit 502, and an upper word. For this purpose, a thesaurus dictionary 505 referred to by the keyword expansion unit 504, an index file storage unit 507 that stores an index file that associates keywords classified by upper terms with the appearance positions of documents including the keywords, and the index file are used. A search unit 506 for searching the appearance position of the search key by
A document storage unit 508 that stores documents and a display unit 509 that displays search results are provided.

【００１７】図６は、以上のように構成された実施例３
の検索装置の動作を示す流れ図である。検索文字列指定
部５０１で検索する時の検索文字列をユーザが指定す
る（ステップＳ６０１）。検索キーワード抽出部５０２
は検索文字列を解析し、検索キーワードを抽出する（ス
テップＳ６０２）。たとえば検索キーワードが「検索シ
ステム」であれば、「検索」と「システム」という２語
を抽出する。検索キーワードを抽出するために検索キー
ワード抽出部５０２は解析用辞書５０３を用いる。また
検索キーワード抽出部５０２はキーワード展開部５０４
を通じてシソーラス辞書５０５にアクセスし、検索キー
ワードの上位語を得て、結果を検索部５０６に送る（ス
テップＳ６０３）。検索部５０６はインデックスファイ
ル記憶部５０７のインデックスファイルを検索する。即
ち、まずキーワード展開部５０４により得られた上位語
でインデックスファイル中の上位語と一致するものがあ
るか検索する（ステップＳ６０４）。その検索の結果、
一致している上位語があればさらに検索キーワードまた
は同義語で検索し（ステップＳ６０５、ステップＳ６０
６、ステップＳ６０７、ステップＳ６０８）、上位語で
一致するものがなければ、その検索キーワードに対する
検索を終了する（ステップＳ６０５の判定がｎｏの場
合）。FIG. 6 shows a third embodiment constructed as described above.
3 is a flowchart showing the operation of the search device of FIG. The user designates a search character string to be searched by the search character string designating unit 501 (step S601). Search keyword extraction unit 502
Analyzes the search character string and extracts a search keyword (step S602). For example, if the search keyword is "search system", two words "search" and "system" are extracted. The search keyword extraction unit 502 uses the analysis dictionary 503 to extract the search keyword. Also, the search keyword extraction unit 502 is a keyword expansion unit 504.
The thesaurus dictionary 505 is accessed through to obtain the high-order word of the search keyword and send the result to the search unit 506 (step S603). The search unit 506 searches the index file in the index file storage unit 507. That is, first, it searches for a higher-rank word obtained by the keyword expansion unit 504 that matches the higher-rank word in the index file (step S604). As a result of that search,
If there is a matching high-ranking word, a search keyword or a synonym is further searched (steps S605 and S60).
6, step S607, step S608), if there is no match in the upper word, the search for the search keyword is ended (when the determination in step S605 is no).

【００１８】上記一致している上位語がある場合につい
て、さらに説明すると、まずシソーラス検索するか否か
を予め指定しておくかあるいは上位語が見つかったこと
をユーザに提示してユーザの指定を待つ。その指定がシ
ソーラス検索をしない場合であったときには、ステップ
Ｓ６０２で求めた検索キーワードにより、その上位語の
範囲に属するインデックスファイルの範囲を検索する
（ステップＳ６０７）。例えば、インデックスファイル
が図３（ｃ）に例示する情報を含んでいたとし、検索キ
ーワードが「阿蘇」であったときには、その上位語
「山」でインデックスファイルを検索する。そうすると
致するので、次にシソーラス検索するかを判定し、その
判定の結果シソーラス検索をしない場合においては、イ
ンデックスファイル中の「山」に属するキーワードと検
索指定されたキーワードの「阿蘇」との比較により検索
を行う。従って、検索の範囲が「阿蘇」「石鎚」「赤
城」「富士」の範囲に限られるので、検索の速度が高速
化される。The case where there is a matching upper word will be further described. First, whether or not the thesaurus search is to be performed is designated in advance, or the user is notified that the higher term is found, and the user is designated. wait. When the designation is not for the thesaurus search, the range of the index file belonging to the range of the upper word is searched by the search keyword obtained in step S602 (step S607). For example, if the index file includes the information illustrated in FIG. 3C, and the search keyword is “ASO”, the index file is searched for by the superordinate word “mountain”. If you do so, it will be correct, so if you decide whether to perform the thesaurus search next, and if you do not perform the thesaurus search, compare the keyword belonging to "mountain" in the index file with the keyword specified as "Aso" Search by. Therefore, the range of the search is limited to the ranges of "Aso", "Ishizuchi", "Akagi", and "Fuji", and the search speed is increased.

【００１９】検索キーワードをシソーラス展開して検索
する時は、指定された検索キーワードをシソーラス辞書
により展開して同義語を求める。そして上位語が一致し
ている範囲で、同義語でインデックスファイルを検索す
る（ステップＳ６０６、ステップＳ６０８）。この場合
も前述の検索キーワード「阿蘇」で例示した場合と同じ
く検索の範囲がインデックスファイル中の上位語「山」
の範囲に限られるので、検索の速度は高速である。検索
キーワードが複数ある場合は、検索キーワードの数に対
応した数だけステップＳ６０３からステップＳ６０８を
繰り返す（ステップＳ６０９）。When a search keyword is expanded into a thesaurus for a search, the specified search keyword is expanded into a thesaurus to obtain synonyms. Then, the index file is searched for using the synonyms within the range in which the upper terms match (steps S606 and S608). Also in this case, the range of the search is the high-ranking word "mountain" in the index file as in the case of the above-mentioned search keyword "Aso".
The search speed is fast because it is limited to the range. When there are a plurality of search keywords, steps S603 to S608 are repeated by the number corresponding to the number of search keywords (step S609).

【００２０】検索結果を表示部５０９に表示し（ステッ
プＳ６１０）、ユーザから要求があれば（ステップＳ６
１１）、文書ファイル記憶部５０８をアクセスして、文
書中の出現位置をも表示する（ステップＳ６１２）。The search result is displayed on the display unit 509 (step S610), and if requested by the user (step S6).
11), the document file storage unit 508 is accessed and the appearance position in the document is also displayed (step S612).

【００２１】本実施例は、上位語で分類してあるキーワ
ードを持つインデックスファイルを、まず検索キーワー
ドの上位語で検索し、上位語があればその上位語に属す
る範囲の限定されたインデックスファイル中のキーワー
ドとの比較による検索をすればよいので、上位語による
検索が従来に比べ余分に必要ではあるが、総合的には検
索速度が大幅に向上する。また、本実施例において、上
位語で分類してあるものを検索し、上位語が見つかった
かどうかをユーザに提示し、同義語で検索するか否かを
促すようにした場合、ユーザは、上位語があるかどうか
が事前に分かり、同義語で検索する効果があるか判断で
きるので、その判断により同義語で検索するか否かを指
定すればよく、従って、効率よく所望の検索結果が得ら
れる。なお、同義語をすべて検索する時は、検索キーを
シソーラス展開しなくても、上位語の検索だけで結果を
得ることができる。In this embodiment, an index file having keywords classified by high-order words is first searched for by the high-order word of the search keyword, and if there is a high-order word, the index file with a limited range belonging to the high-order word is searched. Since it suffices to perform a search by comparing with the keyword of, the extra search by the high-ranking word is required more than before, but the search speed is greatly improved as a whole. In addition, in the present embodiment, when searching for items classified by high-ranking words, presenting to the user whether high-ranking words have been found, and prompting whether or not to search by synonyms, the user is It is possible to know in advance whether or not there is a word, and it is possible to judge whether it is effective to search with a synonym, so it is sufficient to specify whether to search with a synonym based on that judgment, so that the desired search result can be obtained efficiently. To be When all synonyms are searched, the result can be obtained only by searching the high-ranking words without expanding the search key in the thesaurus.

【００２２】（実施例４）本実施例は、実施例１または
実施例２に示したインデックスファイルの作成装置と実
施例３に示した検索装置を組み合わせた実施例である。
インデックスファイルの作成時には、文書記憶部７０
１、文書入力部７０２、キーワード抽出部７０３、解析
用辞書７０４、インデックスファイル記憶部７０８、キ
ーワード展開部７０５、インデックス登録部７０７およ
びシソーラス辞書７０６とによって、実施例１あるいは
実施例２と同じ構成と動作を持つインデックスファイル
作成装置が機能する。また、インデックスファイルを用
いた検索時には、検索文字列指定部７０９、検索キーワ
ード抽出部７１０、検索部７１１、表示部７１２、キー
ワード展開部７０５、シソーラス辞書７０６、解析用辞
書７０４、インデックスファイル記憶部７０８および文
書記憶部７０１とによって、実施例３と同じ構成と動作
を持つ検索装置が機能する。この実施例４の装置によれ
ば、一つの装置で、インデックスファイルの作成登録と
そのインデックスファイルを用いた検索を共に行うこと
ができる。(Embodiment 4) This embodiment is an embodiment in which the index file creating apparatus shown in Embodiment 1 or 2 and the search apparatus shown in Embodiment 3 are combined.
When creating the index file, the document storage unit 70
1, the document input unit 702, the keyword extraction unit 703, the analysis dictionary 704, the index file storage unit 708, the keyword expansion unit 705, the index registration unit 707, and the thesaurus dictionary 706, the same configuration as that of the first or second embodiment. The index file creation device that has operations functions. Further, when performing a search using the index file, a search character string designating unit 709, a search keyword extracting unit 710, a search unit 711, a display unit 712, a keyword expanding unit 705, a thesaurus dictionary 706, an analysis dictionary 704, and an index file storage unit 708. The search device having the same configuration and operation as the third embodiment functions by the document storage unit 701. According to the apparatus of the fourth embodiment, one apparatus can perform both the registration registration of the index file and the search using the index file.

【００２３】（実施例５）図８は実施例５の構成を示す
ブロック図であり、図７に示す実施例４とは辞書部の構
成が異なるのみである。キーワード抽出のための解析用
辞書として、キーワードの抽出をシソーラスに登録され
た語の範囲とすることにより、一つの辞書に統一するこ
とができる。図８において、辞書８０４はそのようにシ
ソーラス辞書と解析用辞書を統一したものであり、その
他の構成は図７の実施例４と同一である。また、インデ
ックスファイルの作成登録とそのインデックスファイル
を用いた検索を行う動作もほぽ同一である。(Fifth Embodiment) FIG. 8 is a block diagram showing the structure of the fifth embodiment, which is different from the fourth embodiment shown in FIG. 7 only in the structure of the dictionary section. As an analysis dictionary for keyword extraction, the extraction of keywords can be unified into one dictionary by setting the range of words registered in the thesaurus. In FIG. 8, the dictionary 804 is such that the thesaurus dictionary and the analysis dictionary are unified in this way, and the other configuration is the same as that of the fourth embodiment of FIG. 7. Further, the operations of creating and registering an index file and performing a search using the index file are almost the same.

【００２４】（実施例の変形例）なお、実施例４および
実施例５において、インデックスの作成登録には図４
の手順を用いることができるが、その際に、図４のステ
ップＳ４０５、ステップＳ４０６、ステップＳ４０７を
バックグラウンドで処理するようにしてもよい。この場
合、バックグラウンド処理中に検索の指令が出ると、図
６の手順に従って検索を行う。(Modification of Embodiment) In the fourth and fifth embodiments, the index creation and registration is performed as shown in FIG.
The above procedure can be used, but in that case, step S405, step S406, and step S407 in FIG. 4 may be processed in the background. In this case, when a search command is issued during the background processing, the search is performed according to the procedure of FIG.

【００２５】また、以上の各実施例において、インデッ
クスに格納するキーや上位語、検索キーは図３（ｃ）に
示すように文字列であるとしたが、文字列の代わりに同
図（ｄ）に示すように解析用辞書やシソーラス辞書にお
ける通し番号を用いるようにしてもよい。この場合、固
定長で検索できるので、検索の際のアドレス計算が簡単
となり、検索速度が向上し、さらにインデックスファイ
ルの大きさも小さくなる。Further, in each of the above-mentioned embodiments, the key stored in the index, the broader word, and the search key are assumed to be character strings as shown in FIG. 3C, but instead of the character strings shown in FIG. The serial numbers in the analysis dictionary and thesaurus dictionary may be used as shown in FIG. In this case, since the search can be performed with a fixed length, the address calculation at the time of the search is simplified, the search speed is improved, and the size of the index file is reduced.

【００２６】[0026]

【発明の効果】本発明の文書検索装置によれば、上位語
で分類してあるキーワードを持つインデックスファイル
を、まず検索文字列の上位語で検索し、上位語があれば
その上位語に属する範囲の限定されたキーワードとの比
較による検索をすればよいので、上位語による検索の処
理が従来に比べ余分に必要があるとしても、総合的には
検索速度が大幅に向上する。また、検索文字列の比較対
象を絞る際に、従来技術のように単に数を減らすのでは
なく、上位語という関連に基づいて対象を絞るので、検
索漏れが生ずる恐れは少ない。According to the document search apparatus of the present invention, an index file having keywords classified by higher-order words is first searched for by a higher-order word of a search character string, and if there is a higher-order word, it belongs to the higher-order word. Since it suffices to perform a search by comparing with a keyword having a limited range, the search speed can be significantly improved as a whole even if an extra search process using an upper word is required. Further, when narrowing down the comparison target of the search character strings, the number is narrowed down based on the relation of the high-order word instead of simply reducing the number as in the conventional technique, so that there is little possibility of missing the search.

【００２７】本発明のインデックス作成装置によれば、
指定された文書を解析用辞書により解析してキーワード
を抽出し、その抽出したキーワードの上位語を求め、抽
出したキーワードを上位語で分類してインデックスファ
イルを作成する。その作成は自動的に行うことができ、
かつ容易に行うことができる。According to the index creating device of the present invention,
The specified document is analyzed by the analysis dictionary to extract the keywords, the high-order words of the extracted keywords are obtained, and the extracted keywords are classified by the high-order words to create an index file. Its creation can be done automatically,
And it can be done easily.

[Brief description of drawings]

【図１】実施例１（および実施例２）の概略ブロック図FIG. 1 is a schematic block diagram of a first embodiment (and a second embodiment).

【図２】実施例１のインデックス作成処理の流れを示す
図FIG. 2 is a diagram showing a flow of index creation processing according to the first embodiment.

【図３】インデックスファイルの例を示す図で、（ａ）
は従来例、（ｂ）〜（ｄ）は本発明の例を示す図FIG. 3 is a diagram showing an example of an index file, (a)
Is a conventional example, (b) to (d) are diagrams showing an example of the present invention.

【図４】実施例２のインデックス作成の流れを示す図FIG. 4 is a diagram showing a flow of creating an index according to the second embodiment.

【図５】実施例３の概略ブロック図FIG. 5 is a schematic block diagram of a third embodiment.

【図６】実施例３の検索処理の流れを示す図FIG. 6 is a diagram showing a flow of search processing according to the third embodiment.

【図７】実施例４の概略ブロック図FIG. 7 is a schematic block diagram of a fourth embodiment.

【図８】実施例５の概略ブロック図FIG. 8 is a schematic block diagram of a fifth embodiment.

[Explanation of symbols]

１０１…文書記憶部、１０２…文書入力部、１０３…キ
ーワード抽出部、１０４…解析用辞書、１０８…インデ
ックスファイル記憶部、１０５…キーワード展開部、１
０７…インデックス登録部、１０６…シソーラス辞書、
５０１…検索文字列指定部、５０２…検索キーワード抽
出部、５０３…解析用辞書、５０４…キーワード展開
部、５０５…シソーラス辞書、５０６…検索部、５０７
…インデックスファイル記憶部、５０８…文書記憶部、
５０９…表示部。101 ... Document storage unit, 102 ... Document input unit, 103 ... Keyword extraction unit, 104 ... Analysis dictionary, 108 ... Index file storage unit, 105 ... Keyword expansion unit, 1
07 ... Index registration unit, 106 ... Thesaurus dictionary,
501 ... Search character string designation unit, 502 ... Search keyword extraction unit, 503 ... Analysis dictionary, 504 ... Keyword expansion unit, 505 ... Thesaurus dictionary, 506 ... Search unit, 507
... index file storage unit, 508 ... document storage unit,
509 ... Display unit.

───────────────────────────────────────────────────── フロントページの続き (72)発明者山下明男神奈川県川崎市高津区坂戸３丁目２番１号ＫＳＰＲ＆Ｄビジネスパークビル富士ゼロックス株式会社内 (72)発明者相原一雄神奈川県川崎市高津区坂戸３丁目２番１号ＫＳＰＲ＆Ｄビジネスパークビル富士ゼロックス株式会社内 (72)発明者喜多辰臣神奈川県川崎市高津区坂戸３丁目２番１号ＫＳＰＲ＆Ｄビジネスパークビル富士ゼロックス株式会社内 (72)発明者川本真司神奈川県川崎市高津区坂戸３丁目２番１号ＫＳＰＲ＆Ｄビジネスパークビル富士ゼロックス株式会社内 (72)発明者山口浩神奈川県川崎市高津区坂戸３丁目２番１号ＫＳＰＲ＆Ｄビジネスパークビル富士ゼロックス株式会社内 ─────────────────────────────────────────────────── ─── Continuation of the front page (72) Akio Yamashita Akio Yamashita 3-2-1 Sakado, Takatsu-ku, Kawasaki City, Kanagawa KSP R & D Business Park Building Fuji Xerox Co., Ltd. (72) Kazuo Aihara Takatsu-ku, Kawasaki City, Kanagawa Prefecture 3-2-1 Sakado KSP R & D Business Park Building Fuji Xerox Co., Ltd. (72) Inventor Tatsuomi Kita 3-2-1 Sakado, Takatsu-ku, Kawasaki City, Kanagawa KSP R & D Business Park Building Fuji Xerox Co., Ltd. (72 ) Inventor Shinji Kawamoto 3-2-1 Sakado, Takatsu-ku, Kanagawa Prefecture KSP R & D Business Park Building Fuji Xerox Co., Ltd. (72) Inventor Hiroshi Yamaguchi 3-2-1 Sakado, Takatsu-ku, Kawasaki City, Kanagawa Prefecture KSP R & D Business park Le Fuji Xerox Co., Ltd. in

Claims

[Claims]

1. A document storage unit for storing a document, an index file storage unit for classifying keywords for the document by upper terms and storing them together with document identification information, and a search for designating a search character string for searching a document. The character string specifying means, the upper word extracting means for obtaining the upper word of the search character string specified by the search character string specifying means, and the upper word are searched for matching with the upper word in the index file. If so, a document search device provided with a search means for searching a range of a keyword group classified into the superordinate word with a search character string.

2. A document input means for reading a document, an index file storage means for storing an index file in which keywords classified by broader terms and document identification information are associated with each other, and keywords used for retrieval from the input document. A keyword extracting means for extracting a keyword, an upper word extracting means for obtaining an upper word of the keyword extracted by the keyword extracting means, and a keyword extracted by the keyword extracting means classified by the upper word extracted by the upper word extracting means An index file creating apparatus comprising: a registration unit that registers a keyword and a keyword together with the document identification information of the read document in an index file storage unit.