JP2000099543A

JP2000099543A - Information retrieval device

Info

Publication number: JP2000099543A
Application number: JP10272641A
Authority: JP
Inventors: Katsuhiko Itonori; 勝彦糸乘
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 1998-09-28
Filing date: 1998-09-28
Publication date: 2000-04-07

Abstract

PROBLEM TO BE SOLVED: To provide an information retrieval device which integrally deals with document information of different document formats and retrieves a document through the use of relevance shown in the document. SOLUTION: A document format common part 1 previously makes the document format of document information to be common and it is storaged in a document information storage part 2. When a user designates document information and logic structure from a designation device 5 at the time of retrieval, a logic structure extraction part 3 extracts designated logic structure on designated document information. A document information retrieval part 4 retrieves document information stored in the document information storage part 2 with the content of extracted logic structure as a retrieval key. Retrieved document information is stored in a retrieval result storage part 6. Logic structure is extracted by the logic structure extraction part 3 and a document information retrieval part 4 retrieves document information on respective pieces of document information. A retrieval result is stored in a storage device 9 and relevance between document information is displayed on a display device 7 so that it can be grasped.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、複数の異なる文書
フォーマットを持つ文書情報を蓄積し、検索する文書管
理システムに関するものであり、特にある文書から関連
する情報を検索する情報検索装置に関するものである。[0001] 1. Field of the Invention [0002] The present invention relates to a document management system for storing and retrieving document information having a plurality of different document formats, and more particularly to an information retrieval apparatus for retrieving related information from a certain document. is there.

【０００２】[0002]

【従来の技術】従来、文書情報を検索するためには、文
書を蓄積している文書管理システム内の文書フォーマッ
トを統一し、その文書フォーマットに特有の検索方法を
用いて検索を行っていた。例えば、構造化文書に対する
文書管理システムであれば、文書管理システム内には構
造化文書のみを蓄積しており、文書の構造を用いた検索
が可能である。しかし、そのような検索には、当然、文
書画像やワープロで単に書かれた構造を持たない文書は
検索対象となっていなかった。また、検索された文書は
それぞれ独立した存在として表示され、各文書間での関
連性はユーザが全文を読んで確認する必要があった。2. Description of the Related Art Conventionally, in order to search for document information, a document format in a document management system in which documents are stored is unified, and a search is performed using a search method specific to the document format. For example, in the case of a document management system for structured documents, only the structured documents are stored in the document management system, and retrieval using the structure of the document is possible. However, in such a search, a document having no structure simply written in a document image or a word processor has not been searched. Also, the retrieved documents are displayed as independent entities, and the relevance between the documents has to be checked by the user by reading the full text.

【０００３】本来、文書は他の文書と関連性を持って書
かれることが多く、また、文書内においても多くの引用
関係が存在している。例えば、参考文献として文書を引
用することは、引用した文書と引用された文書間での関
連性を表わしている。このような関連性をたどることに
よって、ユーザはより容易に文書を検索することが可能
である。しかし従来の文書管理システムでは、このよう
な関連性を用いることは少なかった。Originally, a document is often written in relation to another document, and there are many citation relationships within the document. For example, quoting a document as a reference indicates the relevance between the cited document and the cited document. By following such a relationship, the user can more easily search for a document. However, such a relationship is rarely used in a conventional document management system.

【０００４】文書間での関連性を積極的に用いようとす
る技術として、例えば特開平７−２３０４６７号公報や
特開平８−２８７０８７号公報には、参考文献や引用文
献があるときには、それらの文献を表示できるようにし
ている。また、例えば特開平８−２７２８１８号公報で
は、各文献間の引用、参照関係を予め作成しておき、こ
れらの関係を利用して文献を検索可能にしている。さら
に、例えば特開平９−１４６９６８号公報には、単に参
考文献を検索するだけでなく、検索した文書を参考文献
とする他の文書を検索することで、元となる文献より新
しい、関連する文書を取り出すことを可能としている。As a technique for actively using the relationship between documents, for example, JP-A-7-230467 and JP-A-8-287087 disclose references and cited references when they are available. Documents can be displayed. For example, in Japanese Patent Application Laid-Open No. Hei 8-272818, citations and reference relationships between documents are created in advance, and documents can be searched using these relationships. Further, for example, Japanese Patent Application Laid-Open No. Hei 9-146968 discloses that a related document that is newer than the original document is obtained by not only searching for a reference document but also searching for another document that uses the searched document as a reference document. It is possible to take out.

【０００５】これらの技術はいずれも、参考文献により
引用関係を利用できるように、文書管理システムに蓄積
するデータを加工することによって、初めて上述のよう
な検索を可能にしている。例えば、参考文献リストから
書誌情報だけを取り出して、本文の情報と別に書誌情報
用のデータベースを構築し、参考文献に関する検索につ
いてはこの書誌情報のデータベースを検索するようにし
ている。[0005] In each of these techniques, the above-mentioned search is enabled for the first time by processing data stored in a document management system so that citation relations can be used in reference documents. For example, only bibliographic information is extracted from the reference list, a bibliographic information database is constructed separately from the text information, and the bibliographic information database is searched for references.

【０００６】また、上述のようなシステムにおいては、
文書画像を取り扱う場合には、あらかじめ文字認識技術
などを用いて文書情報をコード化した上で登録する必要
がある。例えば特開平１０−３４８３号公報において
は、近年発達してきた文書画像処理技術を用いて、文書
画像情報から一定の条件で書かれている引用関係を自動
的に取り出し、文字コード化して、関連する文書情報を
検索するシステムが提案されている。このシステムで
は、処理コストの高い文字認識処理を最小限に抑えるた
めに、決められた画像パターンを用いて文書内から引用
関係を見つけ出し、対応する部分のみについてだけ文字
認識処理を適用し、検索用の書誌情報を取り出すように
している。In the above system,
When handling a document image, it is necessary to code the document information using a character recognition technique or the like in advance and then register it. For example, in Japanese Patent Application Laid-Open No. 10-3483, a citation relationship written under a certain condition is automatically extracted from document image information using a document image processing technology developed in recent years, converted into a character code, and associated. A system for retrieving document information has been proposed. In this system, in order to minimize the costly character recognition processing, a citation relationship is found from within the document using a predetermined image pattern, and the character recognition processing is applied only to the corresponding part, and the search is performed. To retrieve bibliographic information.

【０００７】しかし、文書間あるいは文書内での引用関
係の記述方法は、さまざまな方法が用いられている。例
えば、引用文献を示すために、例えば［１］、［２］、
…などのように一連の番号を用いる場合もあるし、例え
ば［ｎｏｒｉ９８］、［ｔａｒｏ９６］といったように
文献の発表時期と著者によって示す場合もある。そのた
め、上述の特開平１０−３４８３号公報に記載されてい
るような、決められた画像パターンを検索する方法で
は、さまざまな書き方に柔軟に対応することは難しい。[0007] However, various methods are used to describe citation relationships between documents or within a document. For example, [1], [2],
In some cases, a series of numbers may be used, such as..., Or in some cases, such as [nori98] or [taro96], depending on the publication date of the document and the author. Therefore, it is difficult to flexibly cope with various writing methods by a method of searching for a predetermined image pattern as described in the above-mentioned Japanese Patent Application Laid-Open No. 10-3483.

【０００８】さらに、上述の文書間の関連を利用できる
いずれのシステムにおいても、電子文書と文書画像を統
一的に扱うことはできない。このように電子文書や文書
画像など、異なる文書フォーマットが混在した文書情報
について、関連性のある文書を示すような技術はこれま
で提案されていなかった。[0008] Further, in any system that can use the above-mentioned association between documents, it is impossible to handle electronic documents and document images in a unified manner. As described above, no technique has been proposed so far that indicates related documents for document information in which different document formats are mixed, such as electronic documents and document images.

【０００９】[0009]

【発明が解決しようとする課題】本発明は、上述した事
情に鑑みてなされたもので、例えば電子文書と文書画像
などのように異なる文書フォーマットの文書情報を統一
的に扱い、文書中に示されている引用関係などの関連性
を用いた文書の検索を可能にした情報検索装置を提供す
ることを目的とするものである。SUMMARY OF THE INVENTION The present invention has been made in view of the above-described circumstances. For example, document information of different document formats, such as an electronic document and a document image, is handled in a unified manner, and is indicated in the document. It is an object of the present invention to provide an information search apparatus that enables a search of a document using a related relationship such as a cited relationship.

【００１０】[0010]

【課題を解決するための手段】本発明は、複数の文書情
報の文書フォーマットを共通化して蓄積しておく。この
とき、例えば文書情報が文書画像である場合には、文書
画像を性質の違う領域に分割し、文字領域に関しては文
字認識を行い、領域分割の結果と文字認識の結果から文
書の論理構造を決定し、また例えば単にワープロで作っ
たような文書情報が文書内容と書式情報のみを持つ場合
には、文書中の書式情報の変化と文書内容から文書情報
の論理構造を決定し、共通の文書フォーマットに変換す
る。According to the present invention, the document formats of a plurality of document information are shared and stored. At this time, for example, if the document information is a document image, the document image is divided into regions having different properties, character recognition is performed for the character region, and the logical structure of the document is determined from the region division result and the character recognition result. If the document information created by a word processor has only the document content and format information, the logical structure of the document information is determined from the change in the format information in the document and the document content. Convert to format.

【００１１】文書情報および文書情報中の特定の論理構
造を指定すると、論理構造抽出手段で指定された文書情
報から指定された論理構造を抽出し、抽出した論理構造
に対応する文書内容を検索キーとして文書情報を検索す
る。さらに、検索された文書情報から指定された論理構
造を抽出し、抽出した論理構造に対応する文書内容を検
索キーとして前記文書情報蓄積手段内の文書情報を検索
する。この処理を検索結果が存在しなくなるまで繰り返
す。論理構造を抽出する際には、文書情報を構成する各
ノードの情報を参照して、指定された論理構造と意味的
に同等の構造を抽出するように構成することができる。When the document information and a specific logical structure in the document information are designated, the designated logical structure is extracted from the document information designated by the logical structure extracting means, and a document content corresponding to the extracted logical structure is searched for. As document information. Further, a specified logical structure is extracted from the searched document information, and the document information in the document information storage unit is searched using a document content corresponding to the extracted logical structure as a search key. This process is repeated until the search result no longer exists. When the logical structure is extracted, it is possible to refer to the information of each node constituting the document information and extract a structure that is semantically equivalent to the specified logical structure.

【００１２】このようにして、文書フォーマットに関係
なく、指定された論理構造を有する文書情報を検索する
ことができる。また、例えば論理構造として引用関係を
表す構造を指定すれば、検索された文書情報は、それぞ
れ引用関係によって関連づけられた情報として取り出す
ことができる。検索された文書情報を例えばそれぞれ関
連づけて表示することによって、ユーザに対して文書情
報の引用関係を把握しやすい形式で提供することができ
る。In this way, it is possible to retrieve document information having a specified logical structure regardless of the document format. Further, for example, if a structure representing a citation relationship is designated as a logical structure, the retrieved document information can be extracted as information associated with each citation relationship. For example, by displaying the retrieved document information in association with each other, it is possible to provide the user with a format in which the citation relationship of the document information can be easily grasped.

【００１３】[0013]

【発明の実施の形態】図１は、本発明の情報検索装置の
実施の一形態を示すブロック図である。図中、１は文書
フォーマット共通化部、２は文書情報蓄積部、３は論理
構造抽出部、４は文書情報検索部、５は指定装置、６は
検索結果記憶部、７は表示装置、８は中央制御装置、９
は記憶装置である。FIG. 1 is a block diagram showing an embodiment of an information retrieval apparatus according to the present invention. In the figure, 1 is a document format sharing unit, 2 is a document information storage unit, 3 is a logical structure extraction unit, 4 is a document information search unit, 5 is a designation device, 6 is a search result storage unit, 7 is a display device, 8 Is the central control unit, 9
Is a storage device.

【００１４】文書フォーマット共通化部１は、文書情報
の文書フォーマットを共通化する。文書情報蓄積部２
は、文書フォーマットが共通化された複数の文書情報を
蓄積する。論理構造抽出部３は、文書情報から特定の論
理構造を抽出する。文書情報検索部４は、論理構造抽出
部３で抽出された論理構造の内容を検索キーとして、文
書情報蓄積部２に蓄積された文書情報を検索する。指定
装置５は、マウスやキーボードなどの入力装置を具備
し、特定の文書情報や論理構造を指定することができ
る。検索結果記憶部６は、検索結果を一時的に記憶す
る。表示装置７は、検索結果や動作状況を表示する。中
央制御装置８は、装置全体の動作を制御する。記憶装置
９は、中央制御装置８で実行されるプログラムやデータ
を記憶するとともに、論理構造抽出部３による論理構造
の抽出および文書情報検索部４における検索キーの設定
に用いた文書情報を蓄積する。The document format sharing unit 1 shares the document format of the document information. Document information storage unit 2
Accumulates a plurality of pieces of document information having a common document format. The logical structure extracting unit 3 extracts a specific logical structure from the document information. The document information search unit 4 searches the document information stored in the document information storage unit 2 using the content of the logical structure extracted by the logical structure extraction unit 3 as a search key. The designation device 5 includes an input device such as a mouse and a keyboard, and can designate specific document information and a logical structure. The search result storage unit 6 temporarily stores the search results. The display device 7 displays a search result and an operation status. The central controller 8 controls the operation of the entire apparatus. The storage device 9 stores programs and data executed by the central control device 8 and accumulates document information used for extracting a logical structure by the logical structure extraction unit 3 and setting a search key in the document information search unit 4. .

【００１５】図２は、本発明の情報検索装置の実施の一
形態における動作の概要を示すフローチャートである。
予め、Ｓ１１において文書フォーマット共通化部１によ
って文書情報の文書フォーマットを共通化し、文書情報
蓄積部２に蓄積しておく。その後、検索を行う際に、Ｓ
１２においてユーザが検索キーを設定するための文書情
報と論理構造を指定装置５から指定する。指定された文
書情報は、変数Ａに格納する。変数Ａに格納された文書
情報はＳ１３において記憶装置９に記憶される。Ｓ１４
において、論理構造抽出部３は、変数Ａに格納されてい
る文書情報について、指定された論理構造を抽出する。
そしてＳ１５において、文書情報検索部４は、論理構造
抽出部３で抽出された論理構造の内容を検索キーとし
て、文書情報蓄積部２に蓄積された文書情報を検索す
る。検索された文書情報は、Ｓ１６において検索結果記
憶部６に記憶する。FIG. 2 is a flowchart showing an outline of the operation of the information retrieval apparatus according to the embodiment of the present invention.
In step S11, the document format of the document information is shared by the document format sharing unit 1 in S11 and stored in the document information storage unit 2. Then, when performing a search,
In 12, the user specifies document information and a logical structure for setting a search key from the specifying device 5. The designated document information is stored in a variable A. The document information stored in the variable A is stored in the storage device 9 in S13. S14
In, the logical structure extracting unit 3 extracts a specified logical structure from the document information stored in the variable A.
In step S15, the document information search unit 4 searches the document information stored in the document information storage unit 2 using the content of the logical structure extracted by the logical structure extraction unit 3 as a search key. The searched document information is stored in the search result storage unit 6 in S16.

【００１６】Ｓ１７において、検索結果記憶部６に検索
された文書情報が記憶されているか否かを判定し、検索
された文書情報が検索結果記憶部６に記憶されている場
合には、Ｓ１８でそのうちの１つを選択して取り出し、
変数Ａに格納してＳ１３に戻る。そして、選択した文書
情報を記憶装置９に蓄積した後、その文書情報につい
て、指定された論理構造を論理構造抽出部３で抽出し、
抽出した論理構造の内容を検索キーとして、文書情報検
索部４により文書情報の検索を行う。このような処理
を、検索結果記憶部６に検索された文書情報がなくなる
まで繰り返す。In S17, it is determined whether or not the searched document information is stored in the search result storage unit 6. If the searched document information is stored in the search result storage unit 6, the process proceeds to S18. Select one of them and take it out,
The value is stored in the variable A and the process returns to S13. Then, after storing the selected document information in the storage device 9, the logical structure extraction unit 3 extracts a specified logical structure from the document information,
The document information is searched by the document information search unit 4 using the extracted contents of the logical structure as a search key. Such a process is repeated until there is no more document information in the search result storage unit 6.

【００１７】Ｓ１７において検索結果記憶部６に文書情
報がなくなると、それまで検索した文書情報を用い、そ
れらの文書情報の関係、すなわちどの文書情報からどの
文書情報を検索したかによって得られる関係をユーザが
理解できるように、Ｓ１９で表示装置７に表示する。When there is no document information in the search result storage unit 6 in S17, the document information searched so far is used to determine the relationship between the document information, that is, the relationship obtained from which document information is searched for which document information. It is displayed on the display device 7 in S19 so that the user can understand it.

【００１８】以下、上述の動作の概要について具体例を
用いながら詳述する。まず、文書フォーマット共通化部
１において、入力された複数の文書情報の文書フォーマ
ットを特定の文書フォーマットに共通化する処理を行
う。共通化に用いる特定の文書フォーマットは、論理構
造を用いた検索を行いやすいように論理構造を表わすこ
とができ、かつ簡便な記述が可能な文書フォーマットが
望ましい。論理構造が扱える文書フォーマットとしてＳ
ＧＭＬやＨＴＭＬが知られている。しかしＳＧＭＬは、
ＤＴＤにより論理要素を表わすタグ名とそのスキーマが
制限されるため、本発明には適していない。なぜなら、
さまざまな情報を表わし、さまざまな論理構造を持つ文
書が入力されることが想定されているので、すべての論
理構造のスキーマと、用いられるすべての論理要素のタ
グ名をあらかじめ予想することは難しいためである。同
様に、ＨＴＭＬでは、容易に使用できるように論理要素
を表わすタグ名が大きく制限されているために、本発明
には適さない。特定の文書フォーマットを独自に定義す
ることもできるが、ここでは共通の文書フォーマットと
してＸＭＬを用いて説明する。ＸＭＬでは、ＤＴＤを定
義する必要がなく、ユーザが自由にタグ名を定義するこ
とができる仕様となっているため、本発明で共通フォー
マットとして使用するには好適である。Hereinafter, the outline of the above operation will be described in detail using a specific example. First, the document format sharing unit 1 performs a process of sharing a document format of a plurality of input document information into a specific document format. It is desirable that the specific document format used for the commonization be a document format that can express a logical structure so that a search using the logical structure can be easily performed, and that can be easily described. S as a document format that can handle logical structures
GML and HTML are known. But SGML,
The DTD restricts the tag name representing the logical element and its schema, and is not suitable for the present invention. Because
Since it is assumed that documents that represent various information and have various logical structures are input, it is difficult to predict in advance the schema of all logical structures and the tag names of all logical elements used. It is. Similarly, HTML is not suitable for the present invention because tag names representing logical elements are greatly restricted for easy use. Although a specific document format can be independently defined, the description will be made here using XML as a common document format. XML does not need to define a DTD, and has a specification that allows a user to freely define a tag name. Therefore, the XML is suitable for use as a common format in the present invention.

【００１９】まず、論理構造を記述できる文書フォーマ
ットであるＳＧＭＬやＨＴＭＬから、ＸＭＬへ変換する
のは容易である。ＳＧＭＬはＸＭＬのサブセットとして
定義されているので、基本的に変更して使用する部分は
ない。しかし、本発明では文書型の定義は必要ないの
で、文書型の定義の部分を削除して用いることができ
る。First, it is easy to convert from a document format in which a logical structure can be described, such as SGML or HTML, to XML. Since SGML is defined as a subset of XML, there is basically no part to change and use. However, since the present invention does not require a document type definition, the document type definition can be deleted and used.

【００２０】次に、単にワープロなどで作成された書式
情報のみを持つような文書情報についての共通化につい
て説明する。書式情報のみを持つ文書情報では、各文字
のフォントやフォントサイズ、また段落やインデントな
ど、表示に必要な情報を多く含んでいる。そのため、こ
のようなフォント情報などの変化により、論理構造を推
定することができる。ただし、ここでは正確な論理構造
の推定は困難であるので、“節”や“段落”、“本文”
などの大まかな構造のみを割り当てていく。例えば、節
の見出しなどは、フォントとして本文より大きなサイズ
を用いていたり、また、書体としてボールドを用いるな
どにより強調している場合が多い。本文は、節の見出し
などに対してインデントをつけて記述される場合があ
る。このような情報により、構造の割り当てを実施す
る。したがって、ＸＭＬの各ノードのタグ名として
“節”や“段落”といったものが用いられ、論理構造が
構成される。Next, commonization of document information having only format information created by a word processor or the like will be described. Document information having only format information includes a lot of information necessary for display, such as the font and font size of each character, and paragraphs and indents. Therefore, a logical structure can be estimated from such a change in font information or the like. However, since it is difficult to estimate the exact logical structure here, "sections", "paragraphs", "text"
Only the rough structure such as is assigned. For example, section headings and the like are often emphasized by using a font larger than the text size, or by using bold font. The text may be described with indentation for section headings. Based on such information, the structure is allocated. Therefore, a tag such as a “section” or a “paragraph” is used as a tag name of each node of the XML, thereby forming a logical structure.

【００２１】図３は、書式情報のみを持つ文書情報につ
いての共通化処理の具体例の説明図である。この例で
は、図３（Ａ）に示すような書式情報のみを持つ文書情
報（ＲＴＦ文書）をＸＭＬによる記述に変換した例を示
しており、図３（Ｂ）に示すようなＸＭＬの記述が得ら
れる。FIG. 3 is an explanatory diagram of a specific example of the common processing for document information having only format information. This example shows an example in which document information (RTF document) having only format information as shown in FIG. 3A is converted into a description in XML, and the description in XML as shown in FIG. can get.

【００２２】文書画像では、さらに構造を推定すること
が難しい。文書画像の場合は、その画像情報の性質の違
いから画像内を領域分けして、そのテキスト領域に対し
て構造を割り当てていく。このとき利用できる情報は、
画像上の位置情報とサイズ情報、さらに文書認識装置を
用いることで各テキスト領域内の文字情報を用いること
ができる。しかし、この時点での構造の割り当ては、各
テキスト領域の位置関係のみにより大まかに行うのみで
ある。例えば、画像領域の最上部と最下部に存在する文
字領域は、ヘッダであったり、フッタであるかもしれな
いが、本文である可能性もある。したがって、この段階
では、文字領域（テキストブロック）の位置関係をＸＭ
Ｌで記述するのみで、構造を決定するのは論理構造抽出
部３での構造のマッチング処理により最終的に決定す
る。In a document image, it is more difficult to estimate the structure. In the case of a document image, the image is divided into regions based on the difference in the properties of the image information, and a structure is assigned to the text region. The information available at this time is
The position information and size information on the image, and the character information in each text area can be used by using the document recognition device. However, the assignment of the structure at this time is only roughly performed based on only the positional relationship between the text regions. For example, the character areas at the top and bottom of the image area may be headers or footers, but may also be text. Therefore, at this stage, the positional relationship of the character area (text block) is set to XM
Only by describing L, the structure is finally determined by the structure matching process in the logical structure extraction unit 3.

【００２３】図４は、文書画像についての共通化処理の
具体例の説明図である。この例では、図４（Ａ）に示す
ような文書画像が入力された場合を示している。このよ
うな文書画像から、「文書情報の保存と検索」、「富○
太郎」、「富○学園大学」、の各文字領域と、その下に
■を並べて示した文字領域をそれぞれ分離し、文字認識
を行い、図４（Ｂ）に示すようなＸＭＬの記述を得てい
る。FIG. 4 is an explanatory diagram of a specific example of the common processing for a document image. This example shows a case where a document image as shown in FIG. 4A is input. From such document images, "storage and retrieval of document information", "
The character areas of “Taro” and “Tomi Gakuen University” and the character areas indicated by placing “■” beneath each are separated, and character recognition is performed to obtain an XML description as shown in FIG. ing.

【００２４】ＸＭＬに変換された文書情報は、文書情報
蓄積部２に蓄積される。この時、共通フォーマットに変
換された文書のみを蓄積しておいてもよいし、元の文書
と共通フォーマットに変換した文書を対にして蓄積して
もよい。対にして蓄積しておけば、検索結果を元の文書
フォーマットとして取り出して利用することも可能とな
る。The document information converted into XML is stored in the document information storage unit 2. At this time, only the document converted into the common format may be stored, or the original document and the document converted into the common format may be stored as a pair. If stored as a pair, it is also possible to retrieve and use the search results as the original document format.

【００２５】検索を行う際には、まず、検索キーを設定
するための文書を１つ指定する。指定する文書は、文書
情報蓄積部２に蓄積されている文書情報を指定装置５を
用いて指定することができる。あるいは、外部の文書デ
ータベースの文書情報、または、ユーザが新たに入力す
る文書情報でもよい。この場合の文書情報は、上述のよ
うにして文書フォーマット共通化部１で文書フォーマッ
トを共通のフォーマットに変換しておく必要がある。When performing a search, first, one document for setting a search key is specified. As the document to be specified, the document information stored in the document information storage unit 2 can be specified using the specifying device 5. Alternatively, the document information may be document information of an external document database or document information newly input by a user. In this case, the document information needs to be converted from the document format to the common format by the document format common unit 1 as described above.

【００２６】論理構造抽出部３では、ユーザにより指定
された文書情報、あるいは検索結果記憶部６から取り出
された文書情報から、抽出対象となる論理構造を抽出す
る。図５は、抽出対象となる論理構造の具体例の説明
図、図６は、文書情報が有する論理構造の具体例の説明
図である。ここでは具体例として、学術論文の巻末ある
いは文末に記載されている参考文献リストを示す図５の
ような論理構造を抽出する例を示す。図５に示した構造
では、まず、“参考文献リスト”というタグ名を持つノ
ードの子ノードとして“文献”というタグ名のノードが
複数並んでいる。これらのノードの内容は、“文献”と
いうタグ名のノードについては各文献の書誌情報であ
り、“参考文献リスト”というタグ名のノードには“参
考文献”、“文献”あるいは“Ｒｅｆｅｒｅｎｃｅｓ”
などの表題である。The logical structure extracting unit 3 extracts a logical structure to be extracted from the document information specified by the user or the document information extracted from the search result storage unit 6. FIG. 5 is an explanatory diagram of a specific example of a logical structure to be extracted, and FIG. 6 is an explanatory diagram of a specific example of a logical structure included in document information. Here, as a specific example, an example of extracting a logical structure as shown in FIG. 5 showing a reference list listed at the end or the end of an academic paper is shown. In the structure shown in FIG. 5, first, a plurality of nodes having the tag name "document" are arranged as child nodes of the node having the tag name "reference list". The contents of these nodes are bibliographic information of each document for a node with a tag name of “document”, and a node with a tag name of “reference list” is “reference”, “document” or “References”.
Such as the title.

【００２７】はじめからこのようなタグ名で各構造を定
義している論理構造であれば、論理構造間のマッチング
操作によって、該当する論理構造を抽出することが可能
である。しかし、蓄積されている文書のタグ名が異なる
タグ名で定義されいるＸＭＬ文書であったり、文書画像
から変換された文書情報である場合には、単純なマッチ
ング操作だけでは、指定された論理構造を抽出すること
はできない。If a logical structure defines each structure with such a tag name from the beginning, the corresponding logical structure can be extracted by a matching operation between the logical structures. However, if the tag name of the stored document is an XML document defined by a different tag name or document information converted from a document image, the designated logical structure can be obtained only by a simple matching operation. Cannot be extracted.

【００２８】例えば図６（Ａ），（Ｂ）に示すいずれの
構造も、図５に示すようなタグ名を有した構造とはなっ
ていない。しかし、図６（Ａ）に示す構造において、
“節タイトル”のタグ名を持つノードに“参考文献”と
いう内容が存在し、その兄弟ノードの“段落”に各参考
文献の書誌情報が含まれていれば、その構造は図５に示
した構造と同様の参考文献リストを表わしていると判断
できる。また、図６（Ｂ）に示す構造についても、“テ
キストブロック”のタグ名を持つノードに“参考文献”
の内容があり、その子ノードに参考文献の書誌情報があ
れば、その構造は図５に示した構造と同様の参考文献リ
ストであると判断できる。For example, none of the structures shown in FIGS. 6A and 6B have a structure having a tag name as shown in FIG. However, in the structure shown in FIG.
If the node having the tag name of “section title” has the content of “reference” and the “paragraph” of its sibling node includes the bibliographic information of each reference, the structure is shown in FIG. It can be determined that it represents a reference list similar to the structure. Also, regarding the structure shown in FIG. 6 (B), the node having the tag name
If the child node has bibliographic information of a reference, the structure can be determined to be a reference list similar to the structure shown in FIG.

【００２９】また、このような論理構造だけでは判断が
できない場合には、各ノードの特徴的な内容も加味した
マッチング操作を行えばよい。例えば、各文献の書誌情
報は本文中の引用と対応できるように番号、あるいは特
定の文字列で開始されている。このような特徴を各ノー
ドの内容から取り出しながら、各ノードを指定された論
理構造とマッチングをとって、指定された論理構造を抽
出することができる。If it is not possible to make a judgment only with such a logical structure, a matching operation taking into account the characteristic contents of each node may be performed. For example, the bibliographic information of each document starts with a number or a specific character string so as to correspond to the citation in the text. While extracting such features from the contents of each node, each node can be matched with the specified logical structure to extract the specified logical structure.

【００３０】図７は、論理構造抽出部３における指定さ
れた論理構造の抽出処理の一例を示すフローチャートで
ある。ここでは上述の具体例に倣い、参考文献リストの
構造を抽出する例を示している。まずＳ２１において、
文書情報から最初のノードを取り出して、変数Ａに格納
する。このときＳ２２においてノードが存在したか否か
を判定し、ノードが存在しなくなるまで以下の処理を行
う。FIG. 7 is a flowchart showing an example of the process of extracting the designated logical structure in the logical structure extraction unit 3. Here, an example is shown in which the structure of the reference list is extracted according to the specific example described above. First, in S21,
The first node is extracted from the document information and stored in a variable A. At this time, it is determined whether or not a node exists in S22, and the following processing is performed until the node no longer exists.

【００３１】Ｓ２３において、Ｓ２１で取り出したノー
ドの内容が“参考文献”あるいは“文献”であるか否か
を調べる。ノードの内容が“参考文献”あるいは“文
献”でなければ、Ｓ２１へ戻って次のノードを取り出
す。このような処理を繰り返すことにより、参考文献が
リストされている節の見出しを見つけることができる。In S23, it is checked whether or not the content of the node extracted in S21 is "reference" or "document". If the content of the node is not “reference” or “document”, the process returns to S21 to retrieve the next node. By repeating such processing, it is possible to find the section heading in which the reference is listed.

【００３２】“参考文献”あるいは“文献”を内容とす
るノードを見つけたら、このノードの子ノードあるいは
兄弟のノードについての処理を行う。まずＳ２４におい
て、このノードに子ノードが存在するか否かを判定す
る。子ノードが存在する場合には、Ｓ２５において子ノ
ードを取り出す。Ｓ２６において子ノードが存在したか
否かを判定し、子ノードが存在する限り、それらのノー
ドについて書誌情報を抽出する。書誌情報は、文書内の
引用場所との対応を取るため、特別な記述によって始ま
っている。ここでは一例として、括弧に囲まれた数字で
文書内との対応をつけているものとする。したがって、
各ノードの内容がテキストであるか否かを調べ、テキス
トであればその内容が“［”、数字、および“］”の組
み合わせで始まる記述であるか否かをＳ２７で判定す
る。この記述で始まっているノードを参考文献の書誌情
報の一部であるとして、Ｓ２８においてそのノードの内
容を取り出す。子ノードがなくなれば、Ｓ２６からＳ２
１へ戻り、次の参考文献を内容とするノードを見つける
処理を続ける。When a node having "reference" or "document" is found, processing is performed on the child node or sibling node of this node. First, in S24, it is determined whether or not this node has a child node. If a child node exists, the child node is extracted in S25. In S26, it is determined whether or not child nodes exist, and as long as child nodes exist, bibliographic information is extracted for those nodes. Bibliographic information begins with a special description to correspond to the citation location in the document. Here, as an example, it is assumed that a number enclosed in parentheses is associated with a document. Therefore,
It is determined whether or not the content of each node is text. If it is text, it is determined in S27 whether or not the content is a description starting with a combination of “[”, a number, and “]”. The node starting with this description is regarded as a part of the bibliographic information of the reference, and the contents of the node are extracted in S28. If there are no child nodes, S26 to S2
Returning to step 1, the processing for finding a node having the next reference is continued.

【００３３】同様に、兄弟ノードが存在する場合には、
Ｓ２９において兄弟ノードを取り出す。Ｓ３０において
兄弟ノードが存在したか否かを判定し、兄弟ノードが存
在する限り、それらのノードについて書誌情報を抽出す
る。Ｓ３１において、各ノードの内容がテキストである
か否かを調べ、テキストであればその内容が“［”、数
字、および“］”の組み合わせで始まる記述であるか否
かを判定する。この記述で始まっているノードを参考文
献の書誌情報の一部であるとして、Ｓ３２においてその
ノードの内容を取り出す。兄弟ノードがなくなれば、Ｓ
３０からＳ２１へ戻り、次の参考文献を内容とするノー
ドを見つける処理を続ける。Similarly, when there is a sibling node,
In S29, a sibling node is extracted. In S30, it is determined whether or not there are sibling nodes. As long as there are sibling nodes, bibliographic information is extracted from those nodes. In S31, it is determined whether or not the content of each node is text. If the content is text, it is determined whether or not the content is a description starting with a combination of “[”, a number, and “]”. The node starting with this description is regarded as a part of the bibliographic information of the reference, and the contents of the node are extracted in S32. If there are no more sibling nodes, S
Returning from S30 to S21, the process of finding a node having the next reference is continued.

【００３４】この例では、単に各ノードの内容だけを見
てマッチングを行っているが、例えばノードのタグ名を
比較することによって、さらに正確に早くマッチングを
行うことができる。ただし、図５と図６で示したよう
に、各ノードのタグ名は統一されていないので、タグ名
の間の対応付けを考慮してマッチングを行う必要があ
る。たとえば、“参考文献”タグは“テキストブロッ
ク”タグあるいは“節”タグとマッチングするというル
ールを持って処理を行えばよい。ここで、兄弟ノードを
探索しているのは、例えば図６（Ａ）に示すように、節
の子ノードとして節タイトルが位置し、その兄弟ノード
に文献の書誌情報が位置づけられているような文書構造
に対応するためである。In this example, matching is performed only by looking at the contents of each node. However, for example, by comparing tag names of nodes, matching can be performed more accurately and quickly. However, as shown in FIG. 5 and FIG. 6, since the tag names of the respective nodes are not unified, it is necessary to perform matching in consideration of the correspondence between the tag names. For example, processing may be performed according to a rule that a “reference” tag matches a “text block” tag or a “section” tag. Here, the search for the sibling node is performed, for example, as shown in FIG. 6A, such that the section title is located as a child node of the section, and the bibliographic information of the document is positioned at the sibling node. This is to cope with the document structure.

【００３５】次に文書情報検索部４において、論理構造
抽出部３で抽出された論理構造の各ノードの内容を検索
キーとして、文書情報蓄積部２に蓄積されている情報を
検索する。この文書情報検索部４は、一般的な文書検索
方法によって実現できる。また、書誌情報の中を特定の
記号を用いて、著者やタイトル、発行機関などの要素に
分解して、検索装置の検索キーとして用いることができ
る。例えば図３（Ｂ）に示した文書情報の例では、すで
に句読点などで各文書内容が分割されている。したがっ
て、これらの各文書内容から著者名、タイトルなどを特
定することで検索式を作成することができる。まず、著
者名を特定するためには、人名辞典と各文書内容の文字
列を比較し、人名辞典に登録されている文字列であれ
ば、その文書内容は著者名であるとする。数字の並びが
日付の記法、例えば“×月×日”、“ＹＹＭＭＤＤ”な
どに一致すれば、その文書内容は日付情報と判断でき
る。さらに、“論文集”や“予稿集”、“ｉｎＰｒｏ
ｃ”などのキーワードを含む文書内容は、出典名である
と判断することができる。そして、残りの文書内容をタ
イトルとする。Next, the document information search unit 4 searches for information stored in the document information storage unit 2 using the content of each node of the logical structure extracted by the logical structure extraction unit 3 as a search key. The document information search unit 4 can be realized by a general document search method. Further, the bibliographic information can be decomposed into elements such as an author, a title, and an issuing institution by using a specific symbol and used as a search key of a search device. For example, in the example of the document information shown in FIG. 3B, each document content is already divided by punctuation. Therefore, a search formula can be created by specifying an author name, a title, and the like from each of these document contents. First, in order to specify the author name, a personal dictionary is compared with a character string of each document content, and if the character string is registered in the personal dictionary, the document content is assumed to be the author name. If the sequence of numbers matches the date notation, for example, "x month x day", "YYMMDD", etc., the document content can be determined to be date information. In addition, "Paper Collection", "Preliminary Collection", "in Pro
A document content including a keyword such as "c" can be determined to be a source name. The remaining document content is set as a title.

【００３６】図８は、文書情報検索部における検索式の
生成の具体例の説明図である。図８（Ａ）は図３（Ｂ）
と同じＸＭＬ文書である。人名辞典に“Ｔ．Ｆｕｊ
ｉ”、“Ｓ．Ｙａｍａｄａ”の名前が登録されていれ
ば、これらの内容を著者名として特定し、“Ｄｏｃｕｍ
ｅｎｔＩｍａｇｅＡｎａｌｉｓｙｓ”、“Ｄｏｃｕ
ｍｅｎｔＲｅｃｏｇｎｉｔｉｏｎ”をタイトルとして
特定し、“ｉｎＰｒｏｃｏｆｘｘｘＳｙｍｐｏｓ
ｉｕｍ．”を出典名、“１９８９．”を日付として特定
できる。これらによって、例えば図８（Ｂ）および図８
（Ｃ）に示すような検索式を生成することができる。FIG. 8 is an explanatory diagram of a specific example of generating a search formula in the document information search unit. FIG. 8A shows FIG. 3B.
This is the same XML document. "T.Fuji" in the personal dictionary
i "," S. If the name of "Yamada" is registered, these contents are specified as the author name, and "Docum" is specified.
ent Image Analysis "," Docu
"Recognition" is specified as the title, and "in Procof xxx Sympos"
ium. "As the source name and" 1989. "Can be specified as a date. With these, for example, FIG.
It is possible to generate a search formula as shown in FIG.

【００３７】この例ではタイトルをそのままキーワード
としているが、これに限らず、さらに書誌情報のタイト
ルに対して形態素解析を行い、キーワードを抽出して検
索することもできる。In this example, the title is used as the keyword as it is. However, the present invention is not limited to this. Further, a morphological analysis may be performed on the title of the bibliographic information, and the keyword may be extracted and searched.

【００３８】このようにして生成された検索式を用い
て、文書情報蓄積部２に蓄積されている文書情報を検索
することによって、ユーザは１つの文書から、大量の文
書情報の中から関連する文書情報（この場合、参考文
献）を容易に取り出すことができる。しかも、もとの文
書情報が異なる文書フォーマットを有していても検索す
ることができる。このときユーザは、文書フォーマット
の変換のような処理を指定する必要はなく、また、特別
なキーワードを設定する必要もなく、検索することがで
きる。このようにして検索された結果は、例えば表示装
置７に列挙して表示することができる。その列挙された
中から参照したい文書情報を選択することにより、所望
の文書情報を得ることができる。By searching the document information stored in the document information storage unit 2 using the search formula generated as described above, the user can search for a single document and a large amount of document information. Document information (in this case, a reference) can be easily retrieved. Moreover, even if the original document information has a different document format, it can be searched. At this time, the user does not need to specify a process such as conversion of a document format, and can search without any need to set a special keyword. The results searched in this way can be listed and displayed on the display device 7, for example. By selecting the document information that the user wants to refer to from the list, desired document information can be obtained.

【００３９】文書情報検索部４で検索された文書情報
は、一時的に検索結果記憶部６に蓄積される。この検索
されたおのおのの文書情報に対して、再び文書構造抽出
部３により特定の論理構造を抽出して、文書情報蓄積部
２に蓄積されている文書情報を検索する。このような処
理を、検索結果記憶部６に蓄積されている検索結果の文
書情報がなくなるまで繰り返し行う。また、検索された
結果は記憶装置９に順次蓄積していく。The document information retrieved by the document information retrieval section 4 is temporarily stored in the retrieval result storage section 6. A specific logical structure is extracted again by the document structure extraction unit 3 for each of the searched document information, and the document information stored in the document information storage unit 2 is searched. Such processing is repeated until there is no more document information of the search result stored in the search result storage unit 6. The searched results are sequentially stored in the storage device 9.

【００４０】このような検索だけでは、最初に指定した
文書より過去に出版された文書のみが記憶装置９に蓄積
されるだけであるので、次に指定した文書の書誌情報を
検索キーとして、文書情報蓄積部２に蓄積された文書情
報を検索する。この場合も同様に、検索結果を一時的に
検索結果記憶部６に蓄積し、この検索結果記憶部６内に
文書情報がなくなるまで繰り返し検索を行う。検索の際
には、文書情報内での検索の範囲を、参考文献を記載し
ている節のみとすることで、より早く、精度の高い検索
が可能である。この参考文献の記載されている節を見つ
けるには論理構造抽出部３の機能を使用することができ
る。また、文書の書誌情報は、文書のフロント頁に記載
されている情報、タイトルや著者名、著者所属を用いる
ことで取り出すことができる。これらの構造を取り出す
のも論理構造抽出部３の機能を同様に利用することがで
きる。このようにして検索した結果も、記憶装置９に記
憶される。In such a search alone, only documents published earlier than the first designated document are stored in the storage device 9. Therefore, the bibliographic information of the next designated document is used as a search key to retrieve the document. The document information stored in the information storage unit 2 is searched. In this case, similarly, the search results are temporarily stored in the search result storage unit 6, and the search is repeatedly performed until there is no more document information in the search result storage unit 6. At the time of retrieval, faster and more accurate retrieval is possible by limiting the range of retrieval within the document information to only the sections in which reference documents are described. The function of the logical structure extraction unit 3 can be used to find the section in which this reference is described. The bibliographic information of a document can be extracted by using information described on the front page of the document, title, author name, and author affiliation. To extract these structures, the function of the logical structure extraction unit 3 can be similarly used. The search result is also stored in the storage device 9.

【００４１】このようにして、ユーザが指定した文書か
ら、参照関係にある文書情報が記憶装置９に蓄積され
る。記憶装置９に蓄積された文書情報は、例えば出版時
期により整列し、表示装置７に表示させることができ
る。図９は、検索結果の表示形態の一例の説明図であ
る。この例では、記憶装置９に蓄積された検索結果であ
る文書情報間の関連がわかるように、グラフ表示した例
を示している。最初に指定された文書は、それが分かる
ように表示領域の中心に配置している。図９ではハッチ
ングを施して、表示形態を異ならせていることを示して
いる。そして、その周りに検索の結果得られた文書情報
を配置し、引用と被引用の関係にある文書を線で結ぶこ
とで文書間の関連性を表わしている。このように検索結
果を表示することにより、検索結果の関係を一目で簡単
に把握することができる。もちろん、図９に示した表示
形態は一例であって、他の表示形態で表示してもよい。In this manner, the document information having a reference relationship is stored in the storage device 9 from the document specified by the user. The document information stored in the storage device 9 can be arranged on, for example, the publication date and displayed on the display device 7. FIG. 9 is an explanatory diagram of an example of a display form of a search result. In this example, an example is shown in which a graph is displayed so that the relationship between the document information as search results stored in the storage device 9 can be understood. The document specified first is placed at the center of the display area so that it can be recognized. In FIG. 9, hatching is applied to indicate that the display mode is different. Then, the document information obtained as a result of the search is arranged around the document, and the relevance between the documents is represented by connecting the documents having the cited and cited relationship with a line. By displaying the search results in this manner, the relationship between the search results can be easily grasped at a glance. Of course, the display mode shown in FIG. 9 is an example, and may be displayed in another display mode.

【００４２】[0042]

【発明の効果】以上の説明から明らかなように、本発明
によれば、もとの文書情報の文書フォーマットが異なっ
ていても、ユーザは蓄積された文書情報の文書フォーマ
ットを気にすることなく、文書情報を検索し、さらに関
連するすべての文書情報を検索することができる。検索
の際には、文書情報と、特定の論理構造を指定するだけ
でよく、論理構造を有しない文書情報に対しても論理構
造を用いた検索を行うことができる。また、検索結果は
ユーザが把握しやすいように表示させることができる。
このように本発明によれば、種々の効果がある。As is apparent from the above description, according to the present invention, even if the document format of the original document information is different, the user does not have to worry about the document format of the stored document information. , Search for document information, and further search for all related document information. At the time of the search, it is only necessary to specify the document information and a specific logical structure, and the search using the logical structure can be performed for the document information having no logical structure. Further, the search result can be displayed so that the user can easily grasp it.
As described above, according to the present invention, there are various effects.

[Brief description of the drawings]

【図１】本発明の情報検索装置の実施の一形態を示す
ブロック図である。FIG. 1 is a block diagram showing an embodiment of an information search device of the present invention.

【図２】本発明の情報検索装置の実施の一形態におけ
る動作の概要を示すフローチャートである。FIG. 2 is a flowchart showing an outline of an operation in the embodiment of the information search device of the present invention.

【図３】書式情報のみを持つ文書情報についての共通
化処理の具体例の説明図である。FIG. 3 is an explanatory diagram of a specific example of a common process for document information having only format information.

【図４】文書画像についての共通化処理の具体例の説
明図である。FIG. 4 is an explanatory diagram of a specific example of a common process for a document image.

【図５】抽出対象となる論理構造の具体例の説明図で
ある。FIG. 5 is an explanatory diagram of a specific example of a logical structure to be extracted;

【図６】文書情報が有する論理構造の具体例の説明図
である。FIG. 6 is an explanatory diagram of a specific example of a logical structure included in document information.

【図７】論理構造抽出部における指定された論理構造
の抽出処理の一例を示すフローチャートである。FIG. 7 is a flowchart illustrating an example of a process of extracting a specified logical structure in a logical structure extraction unit.

【図８】文書情報検索部における検索式の生成の具体
例の説明図である。FIG. 8 is an explanatory diagram of a specific example of generating a search formula in a document information search unit.

【図９】検索結果の表示形態の一例の説明図である。FIG. 9 is an explanatory diagram of an example of a display form of a search result.

[Explanation of symbols]

１…文書フォーマット共通化部、２…文書情報蓄積部、
３…論理構造抽出部、４…文書情報検索部、５…指定装
置、６…検索結果記憶部、７…表示装置、８…中央制御
装置、９…記憶装置。1 ... document format common unit, 2 ... document information storage unit,
3 a logical structure extraction unit, 4 a document information search unit, 5 a designation device, 6 a search result storage unit, 7 a display device, 8 a central control device, 9 a storage device.

Claims

[Claims]

1. A document information storage means for storing a plurality of document information, a document format sharing means for sharing a document format of the plurality of document information and storing the document information in the document information storage means, Means for designating a specific logical structure, a logical structure extracting means for extracting a specified logical structure from designated document information and retrieved document information, and a logical structure extracted by the logical structure extracting means. Search means for searching the document information in the document information storage means using the contents of the document to be searched as a search key, and extracting the logical structure by the logical structure extracting means and searching for the document information searched by the search means An information search apparatus characterized in that a search by means is repeated until a search result no longer exists.

2. The information retrieval apparatus according to claim 1, further comprising display means for displaying the document information sequentially searched by said search means in association with each other.

3. When the document information is a document image, the document format commoning unit divides the document image into regions having different properties, and performs character recognition on a character region.
2. The information retrieval apparatus according to claim 1, wherein a logical structure of the document is determined from a result of the area division and a result of the character recognition, and the document is converted into a common document format.

4. The document format commoner determines a logical structure of the document information from a change in the format information in the document and the document content when the document information has only the document content and the format information. The information retrieval apparatus according to claim 1, wherein the information is converted into a format.

5. The logical structure extracting means according to claim 1, wherein said logical structure extracting means extracts a structure semantically equivalent to a specified logical structure by referring to information of each node constituting the document information. Information retrieval device.