JPH041853A

JPH041853A - Document retrieving device

Info

Publication number: JPH041853A
Application number: JP2103606A
Authority: JP
Inventors: Takeshi Shichino; 七野　剛; Yasutada Nagano; 永野　靖忠; Satoshi Tanaka; 聡田中; Takao Hirata; 平田　孝雄
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 1990-04-19
Filing date: 1990-04-19
Publication date: 1992-01-07
Anticipated expiration: 2011-10-30
Also published as: JP2549745B2

Abstract

PURPOSE:To improve the document retrieving efficiency by providing an index extracting means which generates an index and a retrieving device which displays a document format stored in a document format storage means in a retrieving state and retrieves a document by reference to the index and based on the inputted retrieving conditions. CONSTITUTION:A document format storage means 9 stores the formats of the documents to be retrieved, and an index extraction means 10 extracts the data necessary for an index based on the format stored in the means 9 and generates the index. Then a retrieving means 12 displays the format stored in the means 9 in a retrieving state and retrieves a document by reference to the index and based on the inputted retrieving conditions. As a result, an index is automatically produced and at the same time the data extracted out of the document are accurately designated based on each format. Furthermore a retrieving operator can easily obtain the retrieving conditions in a blank filling way without knowing any special retrieving language. Thus the document retrieving efficiency is improved.

Description

【発明の詳細な説明】［産業上の利用分野］この発明は、蓄積された複数の文書の中からインデック
スを用いて指定された文書を検索する文書検索装置に係
わり、特に書式が予め決まっている文書の検索装置に関
するものである。[Detailed Description of the Invention] [Field of Industrial Application] The present invention relates to a document retrieval device that uses an index to retrieve a designated document from among a plurality of accumulated documents, and particularly relates to a document retrieval device that uses an index to retrieve a designated document from among a plurality of accumulated documents. The present invention relates to a document retrieval device.

［従来の技術］計算機システムを用いて文書データベースを構築するに
は、利用目的に応じて、文書内容を適切に表わすキーワ
ードを含むインデックスに作成する必要がある。従来の
文書検索装置においては、一般にインデックスは人手で
作成されており、人間が文書そのものを見て、その文書
の中から書式等に基づき必要なキーワードを見つけ呂し
て１表形式のインデックスに入力することにより作成さ
れていた。従って、文書データベースの構築には、多大
の費用と時間がかかっていた。[Prior Art] In order to construct a document database using a computer system, it is necessary to create an index that includes keywords that appropriately represent document contents depending on the purpose of use. In conventional document retrieval devices, indexes are generally created manually, where a person looks at the document itself, finds necessary keywords from the document based on the format, etc., and inputs them into a single table-format index. It was created by. Therefore, building a document database requires a great deal of cost and time.

そこで、第１１図に示すようなデータベースシステムが
実用化されている。図において、１は文書を入力するワ
ープロ等の文書入力手段、２は入力された文書を格納す
る磁気ディスク装置等の文書記憶手段、３は格納された
文書から自然言語処理機能によりキーワードを自動的に
抽出するキーワード自動抽出手段、４は抽出されたキー
ワードを格納するキーワード記憶手段、５は検索要求と
して検索条件となるキーワード等を入力するキーボード
やデイスプレィ等の検索要求入力手段、６は入力された
キーワードに基づきキーワード記憶手段４を参照して文
書記憶手段２に記憶された文書の中から指定された文書
を検索する検索手段、７はこの検索結果を出力するデイ
スプレィ等の検索結果出力手段である。Therefore, a database system as shown in FIG. 11 has been put into practical use. In the figure, 1 is a document input means such as a word processor for inputting a document, 2 is a document storage means such as a magnetic disk device for storing the input document, and 3 is a natural language processing function that automatically extracts keywords from the stored document. 4 is a keyword storage means for storing the extracted keywords; 5 is a search request input means such as a keyboard or display for inputting keywords serving as search conditions as a search request; 6 is a search request input means for inputting keywords to be used as search conditions; A search means refers to the keyword storage means 4 based on the keyword and searches for a designated document from among the documents stored in the document storage means 2, and 7 is a search result output means such as a display for outputting the search results. .

このシステムにおいては、入力された文書を文法に基づ
き単語毎に分割する分かち書き処理を行った後、分割さ
れた単語を評価して助詞等の不要な用語を除去すること
により、検索時に必要となるキーワードを自動的に抽出
し設定するようにしている。This system divides the input document into words based on grammar, then evaluates the divided words and removes unnecessary terms such as particles. Keywords are automatically extracted and set.

［発明が解決しようとする課題］従来の文書検索装置は以上のように構成されていたので
、上述したように一般にインデックスを人手で作成しな
ければならず、データベース構築に多大の費用と時間が
かかるという問題点があった。また、自然言語処理機能
により文書中からキーワードを自動抽出するものも実用
化されているが、名詞や動詞等が全てキーワードとなる
ので、抽出されるキーワードが曖昧であったり、不適当
であったりすることが多く、キーワードの抽出、設定に
時間を要したり、検索が効率的に行えないなどの間層点
があった。[Problems to be Solved by the Invention] Conventional document retrieval devices have been configured as described above, and as mentioned above, indexes generally have to be created manually, which requires a great deal of cost and time to construct a database. There was a problem that it took a while. In addition, devices that automatically extract keywords from documents using natural language processing functions have been put into practical use, but since all nouns, verbs, etc. are used as keywords, the extracted keywords may be ambiguous or inappropriate. There were problems such as it took time to extract and set keywords, and searches could not be performed efficiently.

この発明は上記のような問題点を解消するためになされ
たものであり、本当に必要なキーワードだけを自動的に
抽出することができ１文書内容の詳細な事項を正確にイ
ンデックスとして自動抽出し、効率よく検索できる文書
検索装置を得ることを目的とする。This invention was made to solve the above-mentioned problems, and is capable of automatically extracting only the really necessary keywords, automatically extracting the detailed contents of one document as an accurate index, and The object of the present invention is to obtain a document retrieval device that can perform efficient retrieval.

［課題を解決するための手段］この発明に係る文書検索装置は、検索する文書の書式を
格納する文書書式記憶手段と、検索する文書から上記文
書書式記憶手段に格納された書式を用いてインデックス
に必要なデータを抽出し、インデックスを生成するイン
デックス抽出手段と、検索時に上記文書書式記憶手段に
格納された書式を表示し、これに対して入力された検索
条件に基づき上記インデックスを用いて文書を検索する
検索手段とを備えたものである。[Means for Solving the Problems] A document search device according to the present invention includes a document format storage means for storing the format of a document to be searched, and an index from the document to be searched using the format stored in the document format storage means. an index extraction means that extracts data necessary for the search and generates an index, and displays the format stored in the document format storage means at the time of search, and extracts the document using the index based on the search conditions entered. and a search means for searching.

［作用コこの発明においては、文書が一般に定形の書式に従って
書かれている点に着目し、この文書書式を格納しておき
、検索する文書から上記文書書式を用いてインデックス
に必要なデータを抽出することにより、インデックスを
自動的に生成する。[Operations] In this invention, we focus on the fact that documents are generally written in a fixed format, store this document format, and use the document format to extract data necessary for indexing from the document to be searched. By doing this, the index will be automatically generated.

このように書式を用いることにより、インデックスの作
成において、文書から抽出するデータを書式によって指
定しているので正確に行え、また、書式を用いて文書構
造の解析が行えるので、文書内容の詳細な事項をインデ
ックスとすることができる。また、検索において、書式
を検索者に提示することができるため、検索者は特別な
検索言語を知らなくても、検索条件を穴埋め式に簡単に
与えることができる。By using formats in this way, index creation can be done accurately because the data to be extracted from documents is specified by the format, and the document structure can be analyzed using the format, so it is possible to create an index in detail. Items can be used as indexes. Furthermore, since a format can be presented to the searcher during a search, the searcher can easily provide search conditions in a fill-in-the-blank format without knowing a special search language.

［実施例］以下、この発明の一実施例を図において説明する。[Example] An embodiment of the present invention will be described below with reference to the drawings.

第１図は実施例の文書検索装置の全体構成を示すブロッ
ク図であり、前記第１１図と同−又は相当部分には同一
符号を用いてその説明は省略する。図において、８は検
索する文書に定められた書式を入力するワープロ等の文
書書式入力手段、９は入力された文書書式を格納する磁
気ディスク装置等の文書書式記憶手段、１０は文書記憶
手段２に格納された文書から上記文書書式記憶手段９に
格納された書式を用いてインデックスに必要なデータを
抽出し、インデックスを生成するインデックス抽出手段
、１１は上記インデックス抽出手段１０によって生成さ
れたインデックスが格納される磁気ディスク装置等のイ
ンデックス記憶手段、１２は検索時に上記文書書式記憶
手段９に格納された文書書式を検索要求入力手段５のデ
イスプレィに表示し、これに対してキーボードから入力
された検索条件に基づき、インデックス記憶手段ｌｌ中
のインデックスを用いて文書記憶手段２内の文書を検索
する検索手段であり、検索結果はデイスプレィ等の検索
結果出力手段７に出力される。なお、上記インデックス
抽出手段１０及び検索手段１２は、計算機システムを構
成するプロセッサとその上で動作するソフトウェアによ
って実現されている。FIG. 1 is a block diagram showing the overall configuration of a document retrieval device according to an embodiment, and the same or corresponding parts as in FIG. In the figure, 8 is a document format input means such as a word processor for inputting a format defined for a document to be searched, 9 is a document format storage means such as a magnetic disk device for storing the input document format, and 10 is a document storage means 2. Index extraction means 11 extracts data necessary for an index from documents stored in the document format storage means 9 using the format stored in the document format storage means 9 to generate an index; The index storage means 12, such as a magnetic disk device, displays the document format stored in the document format storage means 9 on the display of the search request input means 5 at the time of a search, and in response to a search entered from the keyboard. This is a search means for searching documents in the document storage means 2 based on conditions using the index in the index storage means 11, and the search results are outputted to the search result output means 7 such as a display. Note that the index extraction means 10 and the search means 12 are realized by a processor constituting a computer system and software running thereon.

次に動作について説明する。Next, the operation will be explained.

前述したように、本願は、文書が一般に定形の書式に従
って書かれている点に着目したもので、特に文書データ
ベースの対象となる技術文書、例えば研究報告、規格書
、仕様書等は第２図に示すように、１ページ目の表紙や
２ページ目の目次が罫線の枠で種別や表題を示す各フィ
ールドが区画された共通の書式を有している。なお、本
願で扱う書式としては、上記の他に、文書中の表（例え
ばｏＯの規格表など）や、明細書のように枠などはなく
ともよい。As mentioned above, this application focuses on the fact that documents are generally written in a fixed format, and in particular, technical documents that are subject to document databases, such as research reports, standards, specifications, etc., are shown in Figure 2. As shown in the figure, the cover of the first page and the table of contents of the second page have a common format in which each field indicating the type and title is divided by a frame of ruled lines. In addition to the above-mentioned formats, formats handled in this application include tables in documents (for example, oO standard tables), and formats that do not need frames such as those in specifications.

先ず、文書そのものは、従来と同様に文書入力手段１に
よって入力され、文書記憶手段２に格納される。また本
装置では、その文書の書式が文書書式入力手段８から入
力され、文書書式記憶手段９に格納される。この文書書
式は、書式の同じ文書のみを扱う場合は１種類でよいが
、書式の異なる文書を扱う場合はその種類だけ入力され
、インデックス抽出時や検索時にはその種類が指定され
る。一般に文書は、パラグラフ（ある意味でもってかた
まりとみなせる論理的な単位）の集まりからなる。従っ
て、書式として入力されるデータは、第３図に示すよう
に、文書がどのようなパラグラフから構成されているか
をあられす文書構造（枠、パラグラフタイトル）と、抽
出するデータの位置（斜線部）及び抽出したデータのイ
ンデックスとの対応関係（矢印で示すポインタ）などで
ある。なお、第３図の場合、パラグラフとは罫線で囲ま
れた内部をあられす。First, the document itself is input by the document input means 1 and stored in the document storage means 2 as in the conventional case. Further, in this apparatus, the format of the document is inputted from the document format input means 8 and stored in the document format storage means 9. One type of document format is sufficient when handling only documents with the same format, but when handling documents with different formats, only that type is input, and that type is specified when extracting an index or searching. Generally, a document consists of a collection of paragraphs (logical units that can be considered as a group in a sense). Therefore, as shown in Figure 3, the data input as a format includes the document structure (frames, paragraph titles) that shows what kind of paragraphs the document consists of, and the position of the data to be extracted (shaded areas). ) and the correspondence relationship with the extracted data index (pointer indicated by an arrow). In the case of Figure 3, a paragraph is defined as the area enclosed by ruled lines.

一方、インデックス抽出手段１０は、文書入力手段１か
ら文書が入力され文書記憶手段２に格納される度に、第
４図（ａ）に示す文書構造解析処理（ステップＳｌ）と
抽呂データ決定処理（ステップＳ２）とインデックス生
成処理（ステップＳ３）の一連の処理を実行する。第４
図（′ｂ）に各処理ステップでの入力と処理内容と出力
を示し、そのデータの流れを第４図ｆｃ）に示す。なお
、同図の文書構造解析では、罫線で囲まれた文書の表紙
からインデックスを生成する例を取り上げたが、文書構
造解析は、第５図（ａ）に示すようにパラグラフが罫線
で囲まれていなくても、また、第５図（ｂ）に示すよう
にパラグラフの長さが可変長であっても対応できる。On the other hand, every time a document is input from the document input means 1 and stored in the document storage means 2, the index extraction means 10 performs the document structure analysis process (step Sl) shown in FIG. (Step S2) and index generation processing (Step S3) are executed. Fourth
Figure ('b) shows the input, processing contents, and output at each processing step, and the flow of data is shown in Figure 4 (fc). Note that in the document structure analysis shown in the same figure, an example was taken in which an index is generated from the cover of a document surrounded by ruled lines, but document structure analysis is also performed when paragraphs are surrounded by ruled lines, as shown in Figure 5 (a). Even if the length of the paragraph is variable as shown in FIG. 5(b), it can be handled.

第６図は上記文書構造解析処理を更に詳細に説明するた
めの図であり、文書構造解析処理（ステップＳｌ）は推
論部で行われ、文書記憶手段２から取り出された文書デ
ータを最小構成要素に分解する文書要素解析処理（ステ
ップ５１１）と、得られた各最小構成要素を文書書式記
憶手段９から取り出された書式データのパラグラフに対
応（いくつかの代替案が可能）させるパラグラフ対応処
理（ステップ５１２）と、得られたパラグラフの対応の
中から最も可能性の高い対応を選択して出力する構造解
析処理（ステップ５１３）に分けられ、それぞれ以下に
示すような処理が行われる。FIG. 6 is a diagram for explaining the above-mentioned document structure analysis process in more detail. a document element analysis process (step 511) that decomposes each of the obtained minimum constituent elements into paragraphs of the format data taken out from the document format storage means 9 (several alternatives are possible); Step 512) and a structure analysis process (Step 513) in which the most likely correspondence is selected and output from among the obtained paragraph correspondences, and the following processes are performed respectively.

（１）ステップ５１１（文書要素解析）ここでは、文書
データの内容を最小構成要素に分解して、それぞれの要
素に順番に番号を付ける。番号を付けた最小構成要素を
要素データと呼ぶ。ここで言う最小構成要素とは、ａ）行ｂ）表の中の行Ｃ）図やグラフ等９文書以外の領域であり、行とは、改行記号又は表の罫線があるところま
での文字列である。(1) Step 511 (Document Element Analysis) Here, the contents of the document data are decomposed into minimum constituent elements, and each element is numbered in order. The minimum numbered components are called element data. The minimum constituent elements referred to here are a) rows b) rows in tables C) areas other than 9 documents such as figures and graphs, and a line is a string of characters up to the line feed symbol or table border It is.

（２）ステップ５１２（パラグラフ対応）ここでは、要
素データの先頭を書式データにある各パラグラフのパラ
グラフタイトル（パラグラフを見つけるためのキーワー
ド）とマツチングさせ、各パラグラフの始まりとなる要
素データを選択する（複数選択可）。これらの対応付け
をパラグラフ対応データと呼ぶ。具体的な例を第７図（
ａ）、（ｂｌに示す。同図（ａｌ、（ｂｌに示したよう
な２つのデータから第８図に示すような５つのパラグラ
フ対応データが得られる。同図に示すパラグラフ対応デ
ータ■と■とは互いに矛盾するデータであるが、この段
階では２つとも候補として保持しておき、次のステップ
５１３（構造解析）で■と■のどちらかを選ぶ。(2) Step 512 (paragraph correspondence) Here, the beginning of the element data is matched with the paragraph title (keyword for finding the paragraph) of each paragraph in the format data, and the element data that becomes the beginning of each paragraph is selected ( (Multiple selections possible) These correspondences are called paragraph correspondence data. A specific example is shown in Figure 7 (
a), (bl). Five paragraph-corresponding data as shown in Fig. 8 are obtained from the two data shown in (al, (bl) in the same figure.Paragraph-corresponding data ■ and ■ shown in the figure Although these are mutually contradictory data, both are held as candidates at this stage, and in the next step 513 (structural analysis), either ■ or ■ is selected.

（３）ステップ５１３（構造解析）上記ステップＳ１２で得られたパラグラフ対応データの
中から最も適当な組み合わせをプロダクション・ルール
セットを用いて選択し、最終的に決定したパラグラフ対
応データの集合をパラグラフ切り分はデータとして確保
する。これによって、各パラグラフの先頭にくる要素デ
ータが決定され、従って各パラグラフを構成する要素デ
ータの集合も決定される。(3) Step 513 (Structural analysis) Select the most appropriate combination from the paragraph-corresponding data obtained in step S12 above using the production rule set, and divide the finally determined set of paragraph-corresponding data into paragraphs. The amount will be secured as data. As a result, the element data at the beginning of each paragraph is determined, and therefore the set of element data that constitutes each paragraph is also determined.

プロダクション・ルールセットとしては、パラグラフの
順番の整合性や、パラグラフとしての確からしさなどが
考えられる。以下にプロダクション・ルールセットの例
を示す。As a production rule set, the consistency of the order of paragraphs, the certainty of paragraphs, etc. can be considered. An example of a production ruleset is shown below.

・パラグラフの順番の整合性ａ）パラグラフ対応データにおいて、パラグラフｉｄの
並びの昇順を乱す対応付けは候補から除く。- Consistency of paragraph order a) In paragraph correspondence data, correspondences that disturb the ascending order of paragraph IDs are excluded from candidates.

・パラグラフとしての確からしさｂ）パラグラフタイトルが表題であり、かつ対応する最
小構成要素が４０字以上のときは候補から除く。- Probability as a paragraph b) If the paragraph title is a title and the corresponding minimum constituent element is 40 characters or more, it is excluded from the candidates.

上記のパラグラフ対応データ（第８図）では、３１がル
ールセットｂ）により、２°がルールセットａ）により
候補から外される。そして、残りの対応付けが最も適当
な組み合わせとして選択される。In the above paragraph correspondence data (FIG. 8), 31 is excluded from the candidates according to rule set b) and 2° is excluded from the candidates according to rule set a). The remaining correspondences are then selected as the most appropriate combinations.

以上のようにして文書構造解析処理（ステップ８１）が
終了すると、次の抽出データ決定処理（ステップＳ２）
では、上記ステップ５１３（構造解析）で得られた各パ
ラグラフを構成する要素データ集合から、そのパラグラ
フのパラグラフタイトルを除いたものがインデックスと
なるデータとして抽出される（第９図及び前記第４図ｔ
ｅｌ参照）。When the document structure analysis process (step 81) is completed as described above, the next extracted data determination process (step S2)
Then, from the element data set constituting each paragraph obtained in step 513 (structure analysis), excluding the paragraph title of that paragraph, is extracted as index data (see Figure 9 and Figure 4 above). t
(see el).

そして、最後のインデックス生成処理（ステップＳ３）
において、上記ステップＳ２（抽出データ決定処理）で
抽出したデータをインデックスデータを格納する表のフ
ィールドに投入することにより（前記第４図（Ｃ１参照
）、インデックスをインデックス記憶手段１１上に生成
する。Then, the final index generation process (step S3)
In this step, an index is generated on the index storage means 11 by inputting the data extracted in step S2 (extraction data determination process) into a field of a table storing index data (see FIG. 4 (C1)).

次に検索時について説明すると、例えば、第１０図ｆａ
ｌに示すようなインデックスデータが上述した自動抽出
によって用意されている場合に、検索者が「゛山口′が
書いた。大要に１ソフトウエア設計′という言葉がある
１研究報告′の大要の部分を見たい」という検索要求を
行うときは、第１０図（ｂ）に示すように、検索要求入
力画面に検索手段１２が対応する書式の枠組みを表示し
、検索者はシステムが表示したこの書式に必要項目を入
力するだけで、検索条件が検索手段１２に与えられる。Next, to explain the time of searching, for example, Fig. 10 fa
If the index data shown in 1 is prepared by the above-mentioned automatic extraction, the searcher will search for ``Summary of 1 research report written by ``Yamaguchi''. When making a search request such as "I want to see the part of Search conditions are provided to the search means 12 by simply entering the necessary items in this format.

なお、図中の本は周知のワイルドカードであり、これが
与えられたときはテキストサーチを行う。これにより、
検索手段１２はインデックスを用いた通常の検索を行い
、検索結果として、第１０図ｔｅｌのように、検索要求
を満足する文書の指定した所が出力され、文書が複数あ
る場合はマルチウィンドウで出力される。Note that the book in the figure is a well-known wildcard, and when this is given, a text search is performed. This results in
The search means 12 performs a normal search using an index, and as a search result, the specified location of the document that satisfies the search request is output as shown in FIG. be done.

［発明の効果］以上のように、この発明によれば、検索する文書の書式
を格納する文書書式記憶手段と、検索する文書から上記
文書書式記憶手段に格納された書式を用いてインデック
スに必要なデータを抽出し、インデックスを生成するイ
ンデックス抽出手段と、検索時に上記文書書式記憶手段
に格納された書式を表示し、これに対して入力された検
索条件に基づき上記インデックスを用いて文書を検索す
る検索手段とを備えたので、インデックスを自動的に作
成できるとともに、文書から抽出するデータを書式によ
って指定しているので正確に行え、また、書式を用いて
文書構造の解析が行えるので、文書内容の詳細な事項を
インデックスとすることができる。また、検索において
、書式を検索者に提示することができるため、検索者は
特別な検索言語を知らなくても、検索条件を穴埋め式に
簡単に与えることができる。[Effects of the Invention] As described above, according to the present invention, a document format storage means for storing the format of a document to be searched, and a format necessary for indexing from the document to be searched using the format stored in the document format storage means. an index extraction means for extracting data and generating an index; and displaying the format stored in the document format storage means at the time of search, and searching for a document using the index based on the search condition input thereto; Since it is equipped with a search means to automatically create an index, and the data to be extracted from a document is specified by a format, it can be done accurately. Also, since the document structure can be analyzed using the format, it is possible to automatically create an index. Detailed contents can be used as an index. Furthermore, since a format can be presented to the searcher during a search, the searcher can easily provide search conditions in a fill-in-the-blank format without knowing a special search language.

[Brief explanation of drawings]

第１図はこの発明の一実施例による文書検索装置の全体
構成を示すブロック図、第２図は文書書式を説明するた
めの図、第３図は書式として入力されるデータを説明す
るための図、第４図はインデックス抽出手段の作用を説
明するための図、第５図はインデックス抽出が可能な文
書の他の例を示す図、第６図は文書構造解析処理を更に
詳細に説明するための図、第７図はバラグラフ対応の一
例を示す図、第８図はパラグラフ対応データの一例を示
す図、第９図は抽出データの一例を示す図、第１０図は
検索手段の作用を説明するための図、第１１図は従来例
の構成を示すブロック図である。１は文書入力手段、２は文書記憶手段、５は検索要求入
力手段、７は検索結果出力手段、８は文書書式入力手段
、９は文書書式記憶手段、１０はインデックス抽出手段
、１１はインデックス記憶手段、１２は検索手段。なお、図中、同一符号は同一、又は相当部分を示す。FIG. 1 is a block diagram showing the overall configuration of a document retrieval device according to an embodiment of the present invention, FIG. 2 is a diagram for explaining a document format, and FIG. 3 is a diagram for explaining data input as a format. 4 is a diagram for explaining the operation of the index extraction means, FIG. 5 is a diagram showing another example of a document from which index extraction is possible, and FIG. 6 is a diagram for explaining the document structure analysis process in more detail. FIG. 7 is a diagram showing an example of bar graph correspondence, FIG. 8 is a diagram showing an example of paragraph correspondence data, FIG. 9 is a diagram showing an example of extracted data, and FIG. 10 is a diagram showing the operation of the search means. FIG. 11, which is a diagram for explanation, is a block diagram showing the configuration of a conventional example. 1 is a document input means, 2 is a document storage means, 5 is a search request input means, 7 is a search result output means, 8 is a document format input means, 9 is a document format storage means, 10 is an index extraction means, and 11 is an index storage Means, 12 is a search means. In addition, in the figures, the same reference numerals indicate the same or equivalent parts.

Claims

[Claims] A document retrieval device that retrieves a specified document from a plurality of stored documents using an index, comprising: a document format storage unit that stores a format of a document to be retrieved; and a document to be retrieved. index extracting means for extracting data necessary for an index from the format using the format stored in the document format storage means and generating an index; displaying the format stored in the document format storage means at the time of retrieval; and a search means for searching for a document using the index based on search conditions input to the document.