JPH10198697A

JPH10198697A - Structured document retrieving device

Info

Publication number: JPH10198697A
Application number: JP9004269A
Authority: JP
Inventors: Hisashi Nakatsuyama; 恒中津山
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 1997-01-14
Filing date: 1997-01-14
Publication date: 1998-07-31
Anticipated expiration: 2017-01-14
Also published as: JP3612914B2

Abstract

PROBLEM TO BE SOLVED: To make it possible to retrieve documents at high reproducibility even in the case of retrieving plural documents having respectively different logical structures. SOLUTION: When a retrieval expression 1 is inputted, a retrieval expression generating means 2 rewrites retrieving conditions indicated in the expression to stepwisely moderate conditions and generates a condition eased retrieval expression 4. An accuracy calculating means 3 calculates accuracy indicating the accuracy of a retrieval result based on the expression 4 in accordance with the rewritten condition executed for the formation of each expression 4. A retrieval execution means 5 retrieves structured documents stored in a document storing means 6 based on the inputted retrieval expression 1 and the retrieval expression 4 generated by the means 2. A retrieved result merging means 7 merges retrieved results by arranging them in order from the result having the highest accuracy. Consequently a document of which logical structure is not correctly prepared can also be retrieved and the reproducibility of documents can be improved.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は構造化文書を対象と
した文書検索を行う構造化文書検索装置に関し、特に複
数の文書型から生成された文書を検索対象とする構造化
文書検索装置に関する。[0001] 1. Field of the Invention [0002] The present invention relates to a structured document search apparatus for performing a document search for a structured document, and more particularly to a structured document search apparatus for searching a document generated from a plurality of document types.

【０００２】[0002]

【従来の技術】構造化文書では、文書の内容は論理構造
と呼ばれ、章、節、図などの複数の文書構成要素からな
る木構造で表現される。図１９は論理構造の例を示す図
である。このような論理構造６１はまったく自由に作成
してよいのではなく、文書型と呼ばれる構文規則に沿っ
て作成される。2. Description of the Related Art In a structured document, the content of the document is called a logical structure, and is represented by a tree structure including a plurality of document components such as chapters, sections, and figures. FIG. 19 is a diagram illustrating an example of a logical structure. Such a logical structure 61 may not be created freely, but is created according to a syntax rule called a document type.

【０００３】図２０は文書型の例を示す図である。この
文書型６０の中で、矩形のノードは要素の型（要素型）
を定義している。ノードのラベルは、要素型の名前を示
している。同一の名前をもつノードの実体は同一の要素
型である。したがって、図２０の「節」という名前の要
素型は、再帰的に定義されていることになる。FIG. 20 is a diagram showing an example of a document type. In this document type 60, a rectangular node is an element type (element type).
Is defined. The label of the node indicates the name of the element type. Entities of nodes having the same name have the same element type. Therefore, the element type named “section” in FIG. 20 is defined recursively.

【０００４】楕円で示したノードは要素のつながりを定
義する。このノードを構築子と呼ぶ。ＳＥＱノードは、
それにつながるノードのインスタンスがその順に生成さ
れることを示している。ＲＥＰノードは、それにつなが
るノードのインスタンスが１回以上生成されることを示
す。ＯＰＴノードは、それにつながるノードのインスタ
ンスが、出現してもしなくてもよいことを示す。ＣＨＯ
ノードは、それにつながるいずれか１つのノードのイン
スタンスが生成されることを示す。ここで「インスタン
ス」とは、この文書型に基づき生成される文書の要素を
示す。A node indicated by an ellipse defines a connection between elements. This node is called a constructor. SEQ nodes are:
This indicates that the instances of the nodes connected to it are generated in that order. The REP node indicates that the instance of the node connected to the REP node is generated one or more times. The OPT node indicates that the instance of the node connected to it may or may not appear. CHO
The node indicates that an instance of any one of the nodes connected to the node is generated. Here, the “instance” indicates a document element generated based on the document type.

【０００５】図２０の文書型の定義を書き下すと、次の
ようになる。「記事」は１つ以上の「節」からなり、
「節」は「見出し」と０個以上の「段落」または「図」
および０個以上の「節」からなる。前述のように、
「節」は入れ子になってよい。図１９の論理構造６１
は、図２０の文書型６０の制約を満たしている。[0005] The definition of the document type in FIG. 20 is as follows. An "article" consists of one or more "sections"
"Section" is "heading" and zero or more "paragraphs" or "figure"
And zero or more “nodes”. As aforementioned,
"Sections" may be nested. Logical structure 61 in FIG.
Satisfies the restrictions of the document type 60 in FIG.

【０００６】図２０では簡単な文書型の例を示したが、
実用規模の文書型は大規模であり、要素型の数が数百に
及ぶことも珍しくない。文書型は、データベースで言え
ばスキーマに相当する。即ち、文書の要素の意味と、要
素間の関係とを記述したものが文書型である。データベ
ースの処理がスキーマにしたがって行われるのと同様
に、構造化文書の処理は文書型の情報に基づいて行われ
る。例えば、文書型にしたがって割り付け指示を定義し
ておき、文書インスタンスと割り付け指示とを入力とし
て、文書割り付けが行われる。もう１つの例として、既
存の文書群から必要な部分を適宜抽出、それらを合成し
て新たな文書を作成する例があげられる。このとき、必
要であれば、新規な部分を入れ込むこともある。このよ
うな処理においては、必要な部分を特定する検索処理
と、新たに構成した文書が所望の形態であるかを検査す
る検証処理などに文書型の情報が用いられる。FIG. 20 shows an example of a simple document type.
Practical-scale document types are large, and it is not uncommon for the number of element types to be in the hundreds. A document type corresponds to a schema in a database. That is, the document type describes the meaning of the elements of the document and the relationship between the elements. Just as the processing of the database is performed according to the schema, the processing of the structured document is performed based on the document type information. For example, a layout instruction is defined according to a document type, and the document is allocated by inputting the document instance and the layout instruction. As another example, there is an example in which a necessary part is extracted as appropriate from an existing document group, and these are combined to create a new document. At this time, a new part may be inserted if necessary. In such a process, document type information is used for a search process for specifying a necessary part, a verification process for checking whether a newly constructed document is in a desired form, or the like.

【０００７】文書型に基づいて構造化文書を作成するに
は、論理構造を直接ユーザに提示するネイティブエディ
タを用いることができる。ネイティブエディタを利用す
るには、構造化文書そのものと、文書作成に用いる文書
型に精通している必要がある。To create a structured document based on a document type, a native editor that directly presents a logical structure to a user can be used. To use a native editor, you need to be familiar with the structured document itself and the document type used to create the document.

【０００８】ユーザが構造化文書や文書型に精通してい
ない場合、テキストエディタや、印刷イメージとほぼ同
じ画面表示を行う文書作成ソフトウェア（ＷＹＳＩＷＹ
Ｇエディタ）を使って文書を作成し、コンバータを使っ
て、所望の論理構造を得るという方法が広く用いられて
いる。この場合、テキストエディタやＷＹＳＩＷＹＧエ
ディタでの文書作成は一定の規則に沿って行う必要があ
る。[0008] If the user is not familiar with structured documents and document types, a text editor or document creation software (WYSIWYW) which displays a screen almost the same as a print image is used.
(G editor), and a method of obtaining a desired logical structure using a converter is widely used. In this case, it is necessary to create a document using a text editor or a WYSIWYG editor in accordance with certain rules.

【０００９】作成規則は、大きく分けて、構造を抽出す
るために予め定められたパターンに合うよう文章を作成
する方法と、要素として扱う部分に特定のスタイル( 体
裁を指定するための情報) を指定する方法の２つが用い
られる。The creation rules are roughly divided into a method of creating a sentence so as to match a predetermined pattern for extracting a structure, and a specific style (information for designating a style) in a portion to be treated as an element. Two of the specifying methods are used.

【００１０】パターンを用いる場合には、例えば、要素
と要素の間には空行( 連続する改行) を入れる、特定の
要素はインデントをつけて表現する、正規表現などによ
り予め定められたパターンに合うよう項目に番号づけす
る、正規表現などにより予め定められた文字列を用い
る、などの方法が取られている。In the case of using a pattern, for example, a blank line (continuous line feed) is inserted between elements, a specific element is expressed with indentation, or a predetermined pattern such as a regular expression is used. Methods such as numbering items so as to match or using a character string predetermined by a regular expression or the like are used.

【００１１】他方、スタイルを用いる場合には、要素と
して用いる段落、文、語句などに、予め定められたスタ
イルを指定する。ところで、文書の検索を行うには、文
書がもつテキストを検索条件に用いる全文検索の手法が
用いられることが多い。全文検索では、文書がもつ語句
に関する条件を、ＡＮＤ（かつ）、ＯＲ（または）、Ｎ
ＯＴ（否定）で結合したブーリアン検索が一般的であ
る。単純なブーリアン検索では、再現率（recall）はよ
いが、精度(precision）が低くなる傾向がある。すなわ
ち、検索結果に、ユーザが期待していない文書が数多く
含まれることがよくあるという問題がある。On the other hand, when a style is used, a predetermined style is specified for a paragraph, sentence, phrase, or the like used as an element. By the way, in order to search a document, a full-text search method using a text of the document as a search condition is often used. In a full-text search, conditions regarding terms in a document are defined as AND (and), OR (or), N
Boolean search combined with OT (negation) is common. Simple Boolean searches tend to have good recall but low precision. That is, there is a problem that a search result often includes many documents that the user does not expect.

【００１２】そこで、構造化文書を対象とする文書検索
においては、論理構造を用いることにより、検索精度を
あげる方法が広く用いられている。例えば、章見出しに
「文書処理」という文字列をもつ章にあり、図見出しに
「データベース」という図を検索することにより、文書
データベースや、データベース中のデータをもとに生成
した文書などといった、文書処理の文脈でのデータベー
スに関する図を検索することができる。Therefore, in a document search for a structured document, a method of increasing search accuracy by using a logical structure is widely used. For example, in a chapter that has a character string of “document processing” in the chapter heading, by searching for a figure of “database” in the figure heading, a document database, a document generated based on the data in the database, etc. Diagrams can be searched for databases in the context of document processing.

【００１３】これらの手法を用いている従来技術には、
特開平４−２１７０７３号公報（文書蓄積システムにお
ける文書検索装置）がある。この文書検索装置では、文
書の論理構造におけるノードの親子関係、ノードの内容
および属性を用いて、検索対象を指定できる。これによ
り、論理構造が正しく作成されていれば、構造を利用す
ることにより、再現率を低下させることなく精度を上げ
ることができる。Conventional techniques using these techniques include:
Japanese Patent Application Laid-Open No. 4-217073 (document search device in a document storage system) is known. In this document search device, a search target can be specified using the parent-child relationship of the nodes in the logical structure of the document, the contents and attributes of the nodes. As a result, if the logical structure is correctly created, the accuracy can be increased without lowering the recall by using the structure.

【００１４】[0014]

【発明が解決しようとする課題】しかし、従来の構造化
文書を対象とする文書検索装置では、論理構造が正しく
作成されていない場合には、精度が低下することはない
が再現率が低下してしまうという問題点があった。However, in the conventional document retrieval apparatus for structured documents, if the logical structure is not created correctly, the accuracy does not decrease but the recall rate decreases. There was a problem that would.

【００１５】即ち、文書はユーザが作成するものである
から、論理構造が常に正しく作成されているとは限らな
い。とくに、ＷＹＳＩＷＹＧエディタを用いて文書を作
成している場合には、所定のパターンにマッチしないよ
うに記述されたり、画面表示が同一になるか類似した表
示になる、論理構造の異なる複数の文書要素が混同され
たりする。この結果、文書型の制約は満たしているが、
本来もつべき論理構造とは異なる論理構造をもつ文書が
作成される。That is, since the document is created by the user, the logical structure is not always created correctly. In particular, when a document is created using a WYSIWYG editor, a plurality of document elements having different logical structures, which are described so as not to match a predetermined pattern, or have the same or similar screen display. Is sometimes confused. As a result, the document type constraint is satisfied,
A document having a logical structure different from the logical structure to be originally created is created.

【００１６】以上の理由により、構造化文書を対象とす
る従来の文書検索装置では、文書型によって定められた
要素型の情報を用いて多数の文書を対象として検索する
と再現率が低下する。このことは、先に例示した従来技
術以外の技術も含めた、既存の全ての文書検索装置にあ
てはまる。For the above reasons, in the conventional document retrieval apparatus for structured documents, the recall is reduced when a large number of documents are retrieved using information of the element type determined by the document type. This applies to all existing document search devices, including technologies other than the prior art exemplified above.

【００１７】本発明はこのような点に鑑みてなされたも
のであり、論理構造の異なる複数の文書に対する検索に
おいても、高い再現率を維持した構造化文書検索装置を
提供することを目的とする。SUMMARY OF THE INVENTION The present invention has been made in view of the above points, and has as its object to provide a structured document retrieval apparatus which maintains a high recall even when retrieving a plurality of documents having different logical structures. .

【００１８】[0018]

【課題を解決するための手段】本発明では上記課題を解
決するために、構造化文書を検索対象とする構造化文書
検索装置において、文書構造中のノードのタイプ、ノー
ドの内容、ノードの属性、ノード間の構造上の関係によ
って記述された検索式が入力されると、前記検索式に示
された検索条件を段階的に緩やかな条件に書き換えた条
件緩和検索式を生成する検索式生成手段と、前記条件緩
和検索式を生成するのに行った書き換えの内容に応じ
て、前記条件緩和検索式による検索結果の確からしさを
示す確度を計算する確度計算手段と、入力された前記検
索式及び前記条件緩和検索式に基づいて検索を行う検索
実行手段と、前記検索実行手段による検索結果を、確度
の高い順に並べて併合する検索結果併合手段と、を有す
ることを特徴とする構造化文書検索装置が提供される。According to the present invention, in order to solve the above-mentioned problems, in a structured document search apparatus for searching a structured document, a type of a node in a document structure, a content of the node, and an attribute of the node. When a search expression described by a structural relationship between nodes is input, a search expression generating means for generating a condition relaxation search expression in which the search conditions indicated in the search expression are gradually rewritten into mild conditions And, according to the contents of the rewriting performed to generate the relaxed condition search formula, a certainty calculation means for calculating a certainty indicating the likelihood of the search result by the relaxed condition search formula, and the input search formula and A search execution unit for performing a search based on the relaxed condition search expression; and a search result merging unit for arranging and merging the search results obtained by the search execution unit in descending order of accuracy. Zoka document retrieval apparatus is provided.

【００１９】このような構造化文書検索装置によれば、
検索式が入力されると、検索式生成手段によって、前記
検索式に示された検索条件を段階的に緩やかな条件に書
き換え、条件緩和検索式が生成される。すると、確度計
算手段により、各条件緩和検索式を生成するのに行った
書き換えの内容に応じて、条件緩和検索式による検索結
果の確からしさを示す確度が計算されるとともに、検索
実行手段により、入力された検索式及び条件緩和検索式
に基づいて検索が行われる。検索結果は、検索結果併合
手段により、確度の高い順に並べて併合される。According to such a structured document search device,
When a search expression is input, the search expression generation means rewrites the search conditions indicated in the search expression into gradually looser conditions, and generates a relaxed condition search expression. Then, according to the content of the rewriting performed to generate each of the relaxed condition search expressions, the accuracy indicating the likelihood of the search result obtained by the relaxed condition search expression is calculated by the accuracy calculation unit, and the search execution unit calculates A search is performed based on the input search formula and the relaxed condition search formula. The search results are arranged and merged by the search result merging means in ascending order of accuracy.

【００２０】また、構造化文書を検索対象とする構造化
文書検索装置において、文書構造中のノードのタイプ、
ノードの内容、ノードの属性、ノード間の構造上の関係
によって記述された検索式が入力されると、前記検索式
内の各検索条件を個別の部分式とし、前記部分式毎に条
件を満たした文書部品の取り出しを行う部分式評価手段
と、前記部分式評価手段で取り出された文書部品が条件
を満たす部分式に応じて、各文書部品の確からしさの度
合いを示す確度を計算する確度計算手段と、確度の高い
文書部品の順に検索結果を出力する検索結果出力手段
と、を有することを特徴とする文書検索装置が提供され
る。Further, in a structured document search apparatus for searching a structured document, a type of a node in the document structure,
When a search expression described by the content of the node, the attribute of the node, and the structural relationship between the nodes is input, each search condition in the search expression is set as an individual sub-expression, and the condition is satisfied for each sub-expression. Sub-expression evaluation means for extracting a document part extracted, and a probability calculation for calculating a certainty indicating a degree of certainty of each document part according to a sub-expression in which the document part extracted by the sub-expression evaluation means satisfies a condition. Means and a search result output means for outputting search results in the order of document parts having a high degree of certainty.

【００２１】このような構造化文書検索装置によれば、
検索式が入力されると、部分式評価手段によって、検索
式内の各検索条件を個別の部分式とし、部分式毎に条件
を満たした文書部品の取り出しが行われる。すると、確
度計算手段により、部分式評価手段で取り出された文書
部品が条件を満たす部分式に応じて、各文書部品の確か
らしさの度合いを示す確度が計算される。そして、検索
結果出力手段により、確度の高い文書部品の順に検索結
果が出力される。According to such a structured document search device,
When a search expression is input, each search condition in the search expression is set as an individual partial expression by the partial expression evaluation means, and a document part satisfying the condition is extracted for each partial expression. Then, the certainty calculating means calculates the certainty indicating the degree of certainty of each document part according to the sub-expression that satisfies the condition of the document part extracted by the sub-expression evaluating means. Then, the search result is output by the search result output unit in the order of the document parts with the highest accuracy.

【００２２】[0022]

【発明の実施の形態】以下、本発明の実施の形態を図面
を参照して説明する。図１は、本発明の原理構成図であ
る。まず、文書構造中のノードのタイプ、ノードの内
容、ノードの属性、ノード間の構造上の関係によって記
述された検索式１が入力されると、検索式生成手段２
が、検索式に示された検索条件を段階的に緩やかな条件
に書き換え、条件緩和検索式４を生成する。例えば、構
造条件の「ＡＮＤ」を「ＯＲ」に書き換えれば、検索条
件が緩和された条件緩和検索式となる。このような条件
緩和検索式４が生成されると、確度計算手段３は、各条
件緩和検索式４を生成するのに行った書き換えの内容に
応じて、条件緩和検索式４による検索結果の確からしさ
を示す確度を計算する。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a diagram illustrating the principle of the present invention. First, when a search expression 1 described by the type of a node in the document structure, the content of the node, the attribute of the node, and the structural relationship between the nodes is input, the search expression generation means 2
, Rewrites the search conditions indicated in the search formula gradually into mild conditions, and generates a relaxed condition search formula 4. For example, if "AND" of the structural condition is rewritten to "OR", the condition becomes a relaxed condition search formula. When such a condition relaxation search expression 4 is generated, the accuracy calculation means 3 determines whether the search result by the condition relaxation search expression 4 is correct according to the contents of the rewriting performed to generate each condition relaxation search expression 4. Calculate the likelihood of accuracy.

【００２３】検索実行手段５は、入力された検索式１と
検索式生成手段２により生成された条件緩和検索式４の
それぞれにより、文書保持手段６内の構造化文書を対象
として検索を実行する。検索結果併合手段７は、検索実
行手段５による検索結果を、確度の高い順に並べて併合
する。The search execution means 5 executes a search on the structured document in the document holding means 6 by using the input search expression 1 and the relaxed search expression 4 generated by the search expression generation means 2. . The search result merging unit 7 arranges and merges the search results obtained by the search executing unit 5 in descending order of accuracy.

【００２４】これにより、正しく論理構造が作成されて
いない文書も検索することができる。単純に検索結果を
見た場合にはもとの検索式で検索したときに比して適合
率が低下するが、検索結果はもとの検索式に正確に一致
するものとあまり一致しないものとが乱雑に並ぶことは
なく、もとの検索式に照らして確からしい順に出力され
るので、利用者は確からしいものから順次検索結果を吟
味することができる。As a result, a document whose logical structure has not been correctly created can be searched. If you simply look at the search results, the relevance rate will be lower than when searching with the original search formula, but the search results will be those that exactly match the original search formula and those that do not match much Are not arranged in a random order, and are output in the most probable order based on the original search formula, so that the user can examine the search results sequentially from the most probable ones.

【００２５】図２は、構造化文書検索装置の第１の実施
の形態のブロック図である。この実施の形態は、以下の
ような構成要素からなる。なお、この実施の形態では、
図１の説明で用いた「確度」は、「ペナルティ」で表し
ている。このペナルティは、値が小さいほど検索結果の
確からしさが大きくなる。FIG. 2 is a block diagram of the first embodiment of the structured document search device. This embodiment includes the following components. In this embodiment,
“Accuracy” used in the description of FIG. 1 is represented by “penalty”. The smaller the value of this penalty, the greater the certainty of the search result.

【００２６】検索式入力手段１１は、利用者に対して、
検索式の作成及び入力機能を提供している。また、この
検索式入力手段１１は、他のプログラムで作成した検索
式を入力することもできる。ペナルティ割当手段１２
は、利用者に対して、ペナルティを計算する基準となる
値を検索式の書き換え規則毎に割り当てるための機能を
提供している。ペナルティ計算手段１３は、ペナルティ
割当手段１２により入力されたペナティの値に従い、各
検索式のペナルティの合計値を求める。ペナルティ指定
手段１４は、利用者に対して、ペナルティの上限値の入
力機能を提供する。The search expression input means 11 provides the user with
Provides functions for creating and inputting search formulas. The search formula input means 11 can also input a search formula created by another program. Penalty assigning means 12
Provides a function for a user to assign a reference value for calculating a penalty for each rewrite rule of a search expression. The penalty calculation means 13 calculates the total penalty of each search formula according to the penalty value input by the penalty assignment means 12. The penalty designating means 14 provides the user with a function of inputting the upper limit of the penalty.

【００２７】検索式生成手段１５は、検索式入力手段１
１により入力された検索式をもとに、検索式書き換え規
則にしたがって検索条件の緩やかな検索式を生成する。
検索式保持手段１６は、検索式入力手段１１により入力
された検索式と検索式生成手段１５で生成された検索式
とを、ペナルティ計算手段１３により計算されたペナル
ティの値を付加して保持する。The search formula generating means 15 includes a search formula input means 1.
Based on the search formula input in 1, a search formula with mild search conditions is generated according to the search formula rewriting rule.
The search formula holding unit 16 holds the search formula input by the search formula input unit 11 and the search formula generated by the search formula generation unit 15 with the penalty value calculated by the penalty calculation unit 13 added thereto. .

【００２８】文書保持手段１７には、論理構造の異なる
複数の文書が格納されている。検索実行手段１８は、文
書保持手段１７に保持された文書を対象とし、検索式保
持手段１６に格納されている検索式による検索を行う。
検索結果保持手段１９は、検索式実行手段１８が行った
検索結果を格納する。この際、各検索結果には、検索式
に設定されたペナルティの値が付加されている。検索結
果併合手段２０は、検索結果保持手段１９に格納されて
いる検索結果を、付加されたペナルティの値の小さい順
に並べ替える。検索結果出力手段２１は、検索結果併合
手段２０で並べ替えられた順に、検索結果を表示装置の
画面上に表示する。The document holding means 17 stores a plurality of documents having different logical structures. The search execution unit 18 performs a search on the document held in the document holding unit 17 by using the search formula stored in the search formula holding unit 16.
The search result holding unit 19 stores a search result performed by the search expression execution unit 18. At this time, a penalty value set in the search formula is added to each search result. The search result merging unit 20 rearranges the search results stored in the search result holding unit 19 in ascending order of the value of the added penalty. The search result output means 21 displays the search results on the screen of the display device in the order in which the search results have been rearranged by the search result merging means 20.

【００２９】このような構成の構造化文書検索装置にお
いて検索を実行するには、予めペナルティ割当手段１２
を用いて、検索式の書き換え規則に対するペナルティを
設定する。設定するペナルティは、検索式の書き換えに
よって、検索の条件が緩やかになる度合いが大きいほど
大きな値とする。入力されたペナルティは、ペナルティ
計算手段１３に渡され、そこで保持される。また、利用
者は、ペナルティの上限値を、ペナルティ指定手段１４
を用いて入力することもできる。なお、ペナルティの上
限値は必須の設定項目ではない。In order to execute a search in the structured document search apparatus having such a configuration, the penalty assigning means 12 must be set in advance.
Is used to set a penalty for the rule for rewriting the search expression. The penalty to be set is set to a larger value as the degree of loosening the search condition by rewriting the search formula is increased. The input penalty is passed to the penalty calculating means 13 and is stored there. Further, the user sets the upper limit value of the penalty in the penalty specifying means 14.
Can also be entered using. The upper limit of the penalty is not an essential setting item.

【００３０】以上の設定を行った後、文書検索をしよう
とする利用者は、検索式入力手段１１により、任意の検
索式を検索式生成手段１５に入力する。すると、検索式
生成手段１５は、入力された検索式から、検索条件を段
階的に緩やかにした検索式を多数生成する。検索式生成
手段１５で生成された検索式は、検索式保持手段１６で
一時的に保持される。なお、ペナルティの上限値が設定
されている場合には、ペナルティが上限値を超えた検索
式が生成されることはない。After making the above settings, a user who wishes to search for a document inputs an arbitrary search formula to the search formula generation means 15 by the search formula input means 11. Then, the search formula generation unit 15 generates a large number of search formulas in which the search conditions are gradually reduced from the input search formula. The search formula generated by the search formula generating unit 15 is temporarily stored in the search formula holding unit 16. If the upper limit of the penalty is set, a search expression with the penalty exceeding the upper limit is not generated.

【００３１】検索式保持手段１６に保持された検索式
は、検索式実行手段１８により順に取り出され、文書保
持手段１７に格納されている構造化文書を対象として検
索が行われる。そして、各検索式を評価した結果得られ
る検索結果は、検索式に付加されていたペナルティの値
とともに検索結果保持手段１９に保持される。The search formulas held in the search formula holding means 16 are sequentially retrieved by the search formula executing means 18, and a search is performed on the structured document stored in the document holding means 17. Then, the search result obtained as a result of evaluating each search expression is held in the search result holding means 19 together with the penalty value added to the search expression.

【００３２】すべての検索式が評価されたのち、ペナル
ティとともに検索結果保持手段１９に保持された検索結
果は、検索結果併合手段２０により、ペナルティの値の
小さい順に並べ替えて併合される。併合された検索結果
は、検索結果出力手段２１により、順次出力される。After all the search expressions are evaluated, the search results held in the search result holding means 19 together with the penalty are sorted and merged by the search result merging means 20 in ascending order of the penalty value. The merged search results are sequentially output by the search result output unit 21.

【００３３】次に、第１の実施の形態の内容を具体的に
説明する。まず、書き換えの規則に割り当てるペナルテ
ィの例を示す。図３は、書き換え規則に割り当てられた
ペナルティの例を示す図である。このペナルティ１３ａ
は、ペナルティ割当手段１２で入力され、ペナルティ計
算手段１３が保持するものである。この例では、書き換
え種別が「親子→祖孫」の場合にはペナルティは「１
０」である。書き換え種別が「タイプの無視」の場合に
はペナルティは「３０」である。書き換え種別が「子孫
のＡＮＤ→ＯＲ」の場合にはペナルティは「１００」で
ある。書き換え種別が「テキストに関する条件のＡＮＤ
→ＯＲ」の場合にはペナルティは「２００」である。書
き換え種別が「ブーリアン検索」の場合にはペナルティ
は「５００」である。Next, the contents of the first embodiment will be specifically described. First, an example of a penalty assigned to the rewriting rule will be described. FIG. 3 is a diagram illustrating an example of a penalty assigned to a rewrite rule. This penalty 13a
Are input by the penalty assigning means 12 and held by the penalty calculating means 13. In this example, when the rewrite type is “parent-child → grandchild”, the penalty is “1”.
0 ". If the rewrite type is “ignore type”, the penalty is “30”. If the rewrite type is “descendant AND → OR”, the penalty is “100”. When the rewrite type is "AND
In the case of “→ OR”, the penalty is “200”. If the rewrite type is “Boolean search”, the penalty is “500”.

【００３４】このようなペナルティが設定された状態
で、利用者は、検索式入力手段１１により検索式を入力
する。例えば、以下のような検索式を入力する。図４
は、検索式の例を示す図である。図に示すように、利用
者が検索式入力手段１１を用いて入力した検索式３０
は、内部的には有向グラフで表現される。With such a penalty set, the user inputs a search formula using the search formula input means 11. For example, the following search expression is input. FIG.
Is a diagram showing an example of a search formula. As shown in the figure, a search formula 30 input by the user using the search formula input unit 11 is shown.
Is internally represented by a directed graph.

【００３５】ＡＮＤ以外のラベルをもつノード３１，３
３〜３５は、文書のある要素自身に関する条件( 局所条
件) と、そのノードにもっとも近い、ＡＮＤ以外のラベ
ルをもつノードの局所条件で指定された要素との構造上
の関係を示す条件( 構造条件) を表現する。この例で示
した局所条件は、取出対象であるか否か（真、偽）の条
件、そのノードのタイプの種別に関する条件、及びその
ノードのテキストに関する条件である。また、構造条件
は、上位のノードとの関係が親子であるか、祖孫である
かにより指定されている。Nodes 31 and 3 having labels other than AND
3 to 35 are conditions (structures) indicating a structural relationship between a condition (local condition) relating to an element of the document itself and an element specified by the local condition of a node having a label other than AND closest to the node. Condition). The local conditions shown in this example are a condition as to whether or not to be extracted (true or false), a condition as to the type of the node type, and a condition as to the text of the node. The structural condition is specified depending on whether the relationship with the upper node is parent or child or grandchild.

【００３６】ラベルがＡＮＤであるノード３２は、構造
に関する条件の連言を示す。なお、検索式入力手段１１
により入力された検索式のペナルティは、常に「０」で
ある。The node 32 whose label is AND indicates a conjunction of conditions relating to the structure. Note that the search expression input means 11
Is always "0".

【００３７】以後、図４の有向グラフのノードおよびノ
ード内で指定された条件を、もとの検索式の部分式と呼
ぶ。図４のような検索式３０が入力されたら、検索処理
の実行に先だって、検索式の書き換えが行われる。Hereinafter, the nodes of the directed graph of FIG. 4 and the conditions specified in the nodes are referred to as sub-expressions of the original search formula. When the search formula 30 as shown in FIG. 4 is input, the search formula is rewritten prior to execution of the search process.

【００３８】図５は、検索式の書き換え処理のフローチ
ャートである。以下、図５の手順にしたがい、検索式の
書き換え処理について説明する。〔Ｓ１〕ペナルティ計算手段１３は、ペナルティの値を
初期化する。即ち、値を「０」にする。FIG. 5 is a flowchart of the rewriting process of the retrieval formula. Hereinafter, the rewriting process of the search formula will be described according to the procedure of FIG. [S1] The penalty calculation means 13 initializes a penalty value. That is, the value is set to “0”.

【００３９】以下のステップＳ２ないしステップＳ９
は、検索式の内部表現の部分式に対する繰り返し処理で
ある。〔Ｓ２〕検索式生成手段１５は、現在与えられている検
索式において、適用可能な書き換え規則が存在するか否
か検査する。適用可能な書き換え規則が存在すればステ
ップＳ３へ、そうでなければ実行を終了する。〔Ｓ３〕検索式生成手段１５は、適用可能な書き換え規
則をひとつ選択する。〔Ｓ４〕検索式生成手段１５は、ステップＳ３で選択さ
れた書き換え規則がタイプの無視であるか否か検査す
る。書き換え規則が「タイプの無視」であればステップ
Ｓ５へ、そうでなければステップＳ６へ行く。〔Ｓ５〕検索式生成手段１５は、取出対象以外のノード
（取出対象が「偽」）のタイプに関する条件を、任意型
を示す「ＡＮＹ」に書き換える。すなわち、タイプに関
する条件を無効化する。この処理の後、ステップＳ７へ
行く。〔Ｓ６〕検索式生成手段１５は、ステップＳ３で選択さ
れた書き換え規則を適用し、検索式を書き換える。書き
換え規則が「親子→祖孫」、「子孫のＡＮＤ→ＯＲ」、
若しくは「テキストに関する条件のＡＮＤ→ＯＲ」のい
ずれかであれば、その書き換え規則の通りに検索式の書
き換えを行う。また、書き換え規則が「ブーリアン検
索」の場合には、検索式全体をブーリアン検索に変更す
る。その変更の手続については、後述する（図１０に示
す）。〔Ｓ７〕ペナルティ計算手段１３は、現在処理中の検索
式に対し、ステップＳ３で選択された書き換え規則に対
応するペナルティを加算する。〔Ｓ８〕検索式生成手段１５は、ペナルティが基準値を
超えたか否かを検査する。ペナルティが基準値を超えて
いれば実行を終了し、そうでなければステップＳ９へ行
く。〔Ｓ９〕ステップＳ５あるいはステップＳ６の検索式の
書き換えで得られた検索式を、検索式保持手段１６が保
存し、ステップＳ２へ行く。The following steps S2 to S9
Is an iterative process for the sub-expression of the internal expression of the search expression. [S2] The search expression generation unit 15 checks whether or not there is an applicable rewrite rule in the currently given search expression. If there is an applicable rewrite rule, the process proceeds to step S3; otherwise, the execution ends. [S3] The search formula generation unit 15 selects one applicable rewrite rule. [S4] The search formula generation unit 15 checks whether or not the rewrite rule selected in step S3 is a type ignoring. If the rewrite rule is "ignore type", the process proceeds to step S5, and if not, the process proceeds to step S6. [S5] The search formula generation means 15 rewrites the condition relating to the type of the node other than the extraction target (the extraction target is “false”) to “ANY” indicating an arbitrary type. That is, the condition regarding the type is invalidated. After this processing, go to step S7. [S6] The search formula generation means 15 rewrites the search formula by applying the rewrite rule selected in step S3. Rewrite rules are "parent and child → grandchild", "and descendant AND → OR",
Alternatively, if the condition is “AND → OR of text-related conditions”, the search expression is rewritten according to the rewriting rule. If the rewrite rule is “Boolean search”, the entire search expression is changed to Boolean search. The procedure for the change will be described later (shown in FIG. 10). [S7] The penalty calculation means 13 adds a penalty corresponding to the rewrite rule selected in step S3 to the search formula currently being processed. [S8] The search formula generation unit 15 checks whether the penalty has exceeded a reference value. If the penalty exceeds the reference value, the execution ends, otherwise the process goes to step S9. [S9] The search formula holding means 16 stores the search formula obtained by rewriting the search formula in step S5 or step S6, and goes to step S2.

【００４０】以上のような処理が行われることにより、
図４に示した検索式から複数の新たな検索式が生成され
る。そして、各検索式には、書き換えの内容に応じてペ
ナルティが付加される。図４の検索式３０を書き換えた
例を図６ないし図９に示す。By performing the above processing,
A plurality of new search formulas are generated from the search formula shown in FIG. Then, a penalty is added to each search formula according to the content of the rewrite. FIGS. 6 to 9 show examples in which the search formula 30 in FIG. 4 is rewritten.

【００４１】図６は、書き換え後の検索式の第１の例を
示す図である。この検索式３０ａは、図４の検索式３０
に対し、親子関係を祖孫関係に変更する書き換え規則を
３箇所に適用したものである。この場合のペナルティは
１０×３＝３０となる。FIG. 6 is a diagram showing a first example of a rewritten search formula. This search expression 30a is the search expression 30 in FIG.
In contrast, a rewrite rule for changing a parent-child relationship to a grandchild relationship is applied to three places. The penalty in this case is 10 × 3 = 30.

【００４２】図７は、書き換え後の検索式の第２の例を
示す図である。この検索式３０ｂは、図６の検索式３０
ａに対し、タイプを無視する書き換え規則を３箇所に適
用したものである。この場合のペナルティは( 図６の検
索式３０ａのペナルティ)+３０×３＝３０＋９０＝１２
０である。FIG. 7 is a diagram showing a second example of the rewritten search formula. This search expression 30b is the search expression 30 in FIG.
The rewriting rule that ignores the type is applied to three places with respect to a. The penalty in this case is (penalty of search formula 30a in FIG. 6) + 30 × 3 = 30 + 90 = 12
0.

【００４３】図８は、書き換え後の検索式の第３の例を
示す図である。この検索式３０ｃは、図７の検索式３０
ｂに対し、子孫に関するＡＮＤの条件をＯＲに変更する
書き換え規則を１箇所に適用したものである。この場合
のペナルティは、( 図７の検索式３０ｂのペナルティ)
＋１００＝１２０＋１００＝２２０である。FIG. 8 is a diagram showing a third example of the search formula after rewriting. The search expression 30c is the search expression 30 in FIG.
For b, a rewrite rule for changing the AND condition for descendants to OR is applied to one place. The penalty in this case is (the penalty of the retrieval formula 30b in FIG. 7).
+ 100 = 120 + 100 = 220.

【００４４】図９は、書き換え後の検索式の第４の例を
示す図である。この検索式３０ｄは、図８の検索式３０
ｃに対し、テキストに関する条件のＡＮＤをＯＲに変更
する書き換え規則を１箇所に適用したものである。この
場合のペナルティは、( 図８の検索式３０ｃのペナルテ
ィ) ＋２００＝２２０＋２００＝４２０である。FIG. 9 is a diagram showing a fourth example of the rewritten search formula. This search formula 30d is the same as the search formula 30 in FIG.
For c, a rewrite rule for changing the AND of the text-related condition to OR is applied to one place. The penalty in this case is (penalty of search formula 30c in FIG. 8) + 200 = 220 + 200 = 420.

【００４５】図６から図８は、木構造の検索式から木構
造の検索式への書き換えを行った場合の例であり、これ
らの書き換えは、各ノード内の検索条件を、書き換え規
則に従って書き換えればよい。一方、ブーリアン検索へ
の書き換えを行うには、次のような手続を実行する必要
がある。FIGS. 6 to 8 show examples in which a tree-structured search formula is rewritten to a tree-structured search formula. In these rewrites, search conditions in each node are rewritten according to a rewrite rule. I just need. On the other hand, rewriting to Boolean search requires the following procedure.

【００４６】図１０は、検索式全体をブーリアン検索へ
書き換える手続きのフローチャートである。この手続は
検索式生成手段１５の行う処理であり、この手続きの入
力は木構造で表現された検索式のノードで、出力はその
ノードを頂点とする検索式の部分木を変換して得られた
ブーリアン検索式である。この手続きでは、出力される
検索式は文字列で表わされる。〔Ｓ１１〕ノードが要素に対する条件か否か判定する。
要素に関する条件であればステップＳ１２へ、そうでな
ければステップＳ１３へ行く。〔Ｓ１２〕ノードに指定されたテキスト内容に関する条
件を検索式とし、ステップＳ１４へ行く。〔Ｓ１３〕検索式を空とし、ステップＳ１４へ行く。〔Ｓ１４〕ノードが子ノードをもつか否かを判定する。
子ノードが存在すればステップＳ１５へ、存在しなけれ
ばステップＳ２３へ行く。FIG. 10 is a flowchart of a procedure for rewriting the entire search expression into a boolean search. This procedure is a process performed by the retrieval formula generation means 15. The input of this procedure is a node of the retrieval formula represented by a tree structure, and the output is obtained by converting a subtree of the retrieval formula having the node as a vertex. Is a boolean search expression. In this procedure, the output retrieval formula is represented by a character string. [S11] It is determined whether or not the node is a condition for the element.
If the condition is related to an element, the process proceeds to step S12; otherwise, the process proceeds to step S13. [S12] The condition relating to the text content specified in the node is set as a search expression, and the process proceeds to step S14. [S13] The search expression is made empty, and the process proceeds to step S14. [S14] It is determined whether the node has a child node.
If a child node exists, the process proceeds to step S15; otherwise, the process proceeds to step S23.

【００４７】以下のステップＳ１５ないしステップＳ２
２は、ノードの子ノードに対する繰り返し処理である。〔Ｓ１５〕変換処理を施していない子ノードがあるか否
か判定する。未処理の子ノードがあればステップＳ１６
へ、そうでなければステップＳ２３へ行く。〔Ｓ１６〕未処理の子ノードを１つ選択する。〔Ｓ１７〕ステップＳ１６で選択したノードを引数とし
て、ブーリアン検索への書き換え手続きを再帰的に呼び
出す。〔Ｓ１８〕検索式が空であるか否かを判定する。検索式
が空であればステップＳ１９へ、そうでなければステッ
プＳ２０へ行く。〔Ｓ１９〕ステップＳ１７の手続き呼出しの結果得られ
た検索式( 文字列) を検索式とし、ステップＳ１５へ行
く。〔Ｓ２０〕この手続きの引数として与えられたノードが
ＯＲノードであるか否か判定する。ＯＲノードであれば
ステップＳ２２へ、そうでなければステップＳ２１へ行
く。〔Ｓ２１〕検索式と、ステップＳ１７の手続き呼出しの
結果得られた検索式とをＡＮＤで連結し、ステップＳ１
５へ行く。〔Ｓ２２〕検索式と、ステップＳ１７の手続き呼出しの
結果得られた検索式をＯＲで連結し、ステップＳ１５へ
行く。〔Ｓ２３〕検索式を戻り値として、手続きの実行を終了
する。The following steps S15 to S2
2 is a repetition process for a child node of the node. [S15] It is determined whether there is any child node that has not been subjected to the conversion process. If there is an unprocessed child node, step S16
Otherwise, go to step S23. [S16] One unprocessed child node is selected. [S17] Using the node selected in step S16 as an argument, a rewrite procedure for Boolean search is recursively called. [S18] It is determined whether the search expression is empty. If the search expression is empty, go to step S19; otherwise, go to step S20. [S19] The search formula (character string) obtained as a result of the procedure call in step S17 is used as a search formula, and the process proceeds to step S15. [S20] It is determined whether the node given as an argument of this procedure is an OR node. If it is an OR node, go to step S22, otherwise go to step S21. [S21] The search expression and the search expression obtained as a result of the procedure call in step S17 are connected by AND, and the process proceeds to step S1.
Go to 5. [S22] The search formula and the search formula obtained as a result of the procedure call in step S17 are connected by OR, and the process proceeds to step S15. [S23] The execution of the procedure is terminated using the search expression as a return value.

【００４８】このような処理により、木構造の検索式か
らブーリアン検索式を求めることができる。以下に、図
４の検索式を直接ブーリアン検索式へ書き換えた例を示
す。By such processing, a Boolean search formula can be obtained from a tree-structured search formula. An example in which the search expression of FIG. 4 is directly rewritten to a Boolean search expression will be described below.

【００４９】[0049]

【数１】(SGML AND conference) AND "Yuri Rubinski"
AND Publishing このブーリアン検索式のペナルティは、( 図４の検索式
３０のペナルティ)+５００＝０＋５００＝５００であ
る。[Equation 1] (SGML AND conference) AND "Yuri Rubinski"
AND Publishing The penalty for this boolean search formula is (penalty for search formula 30 in FIG. 4) + 500 = 0 + 500 = 500.

【００５０】以上のような、複数の木構造の検索式とブ
ーリアン検索式とのそれぞれに基づいて、検索が行われ
ると、それぞれの検索結果に対して検索式のペナルティ
が付加される。そして、全ての検索結果がペナルティの
低い順に並べられる。この際、１つの文書が複数の検索
式により検出された場合には、その検索結果には、値の
小さい方のペナルティが採用される。並べられた検索結
果は、表示装置の画面に表示される。When a search is performed based on each of a plurality of tree-structured search expressions and a Boolean search expression as described above, a penalty of the search expression is added to each search result. Then, all search results are arranged in ascending penalty order. At this time, if one document is detected by a plurality of search formulas, a smaller value penalty is adopted as the search result. The arranged search results are displayed on the screen of the display device.

【００５１】図１１は、第１の実施の形態による検索結
果の表示例を示す図である。同図では、検索結果の表示
画面２１ａの中に、検索結果として得られた文書が表示
されている。これらの文書は、ペナルティ、文書名、著
者、作成日とともに表示されている。画面中にはスクロ
ールバーが設けられており、このスクロールバーを操作
することにより、表示させる文書をスクロールさせ、ペ
ナルティの小さい検索結果を順次画面表示させることが
できる。ペナルティの上限値が定められていれば、その
上限値を超える検索結果は存在しない。FIG. 11 is a diagram showing a display example of a search result according to the first embodiment. In the drawing, a document obtained as a search result is displayed in a search result display screen 21a. These documents are listed with the penalty, document name, author, and date created. A scroll bar is provided on the screen. By operating the scroll bar, a document to be displayed can be scrolled, and search results with small penalties can be sequentially displayed on the screen. If the upper limit of the penalty is set, there is no search result exceeding the upper limit.

【００５２】なお、同図ではペナルティにより検索条件
の厳しさを表示しているが、検索条件が厳しいものほど
高い値となるようなスコアを計算して表示してもよい。
例えば、ペナルティ指定手段１４によりペナルティの上
限値を指定した場合には、その上限値からペナルティの
値を引いたものをスコアとすることができる。Although the strictness of the search condition is indicated by the penalty in FIG. 5, a score may be calculated so that the stricter the search condition, the higher the score.
For example, when the penalty upper limit value is specified by the penalty specifying means 14, a value obtained by subtracting the penalty value from the upper limit value can be used as the score.

【００５３】このようにして、入力した検索式から検索
条件を段階的に緩めた検索式による検索結果を閲覧する
ことができる。従って、正しい論理構造に基づいて作成
した検索式を入力した場合でも、論理構造が正しく作成
されていない文書を検出することができ、再現率が向上
する。しかも、検索式の条件を緩める度合いが低いもの
を優先的に表示するため、検索結果の数が多くなっても
利用者の閲覧が不便になることはない。In this way, it is possible to browse the search results based on the search formula in which the search conditions are gradually relaxed from the input search formula. Therefore, even when a search formula created based on a correct logical structure is input, a document whose logical structure has not been created correctly can be detected, and the recall is improved. In addition, since a search expression having a low degree of loosening condition is preferentially displayed, even if the number of search results increases, browsing of the user does not become inconvenient.

【００５４】ところで、上記の第１の実施の形態は、入
力された検索式の検索条件を書き換えることにより、段
階的な検索条件の緩和を行ったものであるが、入力され
た検索式の部分式に基づいて文書部品（構造化文書の個
々の要素）を取り出し、検索式の条件を満たしている度
合いの高い文書部品を、検索結果として出力することも
できる。そのような例を第２の実施の形態として以下に
説明する。In the first embodiment, the search conditions are gradually reduced by rewriting the search conditions of the input search formula. Document parts (individual elements of the structured document) are extracted based on the formula, and a document part having a high degree of satisfying the condition of the search formula can be output as a search result. Such an example will be described below as a second embodiment.

【００５５】図１２は、構造化文書検索装置の第２の実
施の形態のブロック図である。なお、この実施の形態で
は、検索結果の確からしさを「スコア」で表している。
このスコアは、値が大きいほど検索結果の確からしさが
大きいことを示す。検索式入力手段４１は、利用者に対
して、検索式の作成及び入力機能を提供している。検索
式保持手段４２は、検索式入力手段４１により入力され
た検索式を保持する。FIG. 12 is a block diagram of a second embodiment of the structured document retrieval apparatus. In this embodiment, the likelihood of the search result is represented by “score”.
This score indicates that the greater the value, the greater the likelihood of the search result. The search formula input means 41 provides a user with a function of creating and inputting a search formula. The search formula holding unit 42 holds the search formula input by the search formula input unit 41.

【００５６】文書保持手段４３には、論理構造の異なる
複数の文書が格納されている。部分式評価手段４４は、
検索式保持手段４２に格納された検索式の部分式を順次
取り出し、その部分式により文書保持手段４３に格納さ
れた文書を対象として評価を行う。即ち、取り出した部
分式の示す条件を満たした文書部品を文書保持手段４３
から候補として取り出す。一方、スコア割当手段４５
は、スコアを計算する基準となる値を部分式の種類毎に
割り当てるための機能を利用者に対して提供している。
スコア計算手段４６は、スコア割当手段４５により入力
されたスコアの値に従い、部分式評価手段４４の取り出
した候補のスコアを求める。The document holding means 43 stores a plurality of documents having different logical structures. The sub-expression evaluation means 44
The sub-expressions of the retrieval formula stored in the retrieval formula holding unit 42 are sequentially extracted, and the documents stored in the document retaining unit 43 are evaluated based on the partial formula. That is, the document part satisfying the condition indicated by the extracted sub-expression is stored in the document holding unit 43.
From the list as candidates. On the other hand, score assigning means 45
Provides a user with a function for assigning a reference value for calculating a score for each type of sub-expression.
The score calculation means 46 obtains the scores of the candidates extracted by the sub-expression evaluation means 44 according to the value of the score input by the score allocation means 45.

【００５７】候補保持手段４７は、部分式評価手段４４
が取得した候補に、スコアの値を付加して保持する。ス
コア指定手段４８は、利用者に対して、スコアの下限値
の入力機能を提供する。検索結果出力手段４９は、スコ
ア指定手段４８により指定された下限値よりも大きいス
コアの文書部品を検索結果として、スコアの大きいもの
から順に、表示装置の画面上に表示する。The candidate holding means 47 includes a sub-expression evaluation means 44
Adds the value of the score to the obtained candidate and holds it. The score specifying means 48 provides the user with a function of inputting the lower limit of the score. The search result output unit 49 displays, on the screen of the display device, document parts having a score greater than the lower limit specified by the score designating unit 48 as search results, in descending order of the score.

【００５８】なお、図中の部分式評価手段４４、スコア
計算手段４６、及び候補保持手段４７により、検索式評
価手段４０を構成している。このような構造化文書検索
装置において検索を実行するには、利用者は、予めスコ
ア割当手段４５を用いて、スコアを計算する基準となる
値を部分式の種類毎に割り当てる。割り当てるスコア
は、その検索条件により検索対象が絞り込まれる度合い
が大きいほど大きな値とする。入力されたスコアは、ス
コア計算手段４６に渡され、そこで保持される。また、
利用者は、スコアの下限値を、スコア指定手段４８を用
いて入力することもできる。なお、スコアの下限値は必
須の設定項目ではない。It should be noted that the sub-expression evaluation means 44, the score calculation means 46, and the candidate holding means 47 in the figure constitute a retrieval expression evaluation means 40. To execute a search in such a structured document search device, a user assigns a value that is a reference for calculating a score for each type of sub-expression using the score assigning unit 45 in advance. The score to be assigned is set to a larger value as the degree of narrowing down the search target by the search condition is larger. The input score is passed to the score calculation means 46, where it is stored. Also,
The user can also input the lower limit value of the score using the score specifying means 48. Note that the lower limit of the score is not an essential setting item.

【００５９】以上の設定を行った後、文書検索をしよう
とする利用者は、検索式入力手段４１により、自己が作
成したか若しくは所定のプログラムにより作成された検
索式を、検索式保持手段４２に入力する。検索式入力手
段４１で入力された検索式は、検索式保持手段４２で保
持される。検索式保持手段４２に保持された検索式は、
部分式評価手段４４により、文書保持手段４３に保持さ
れた文書を対象として、部分式ごとに評価される。部分
式評価手段４４で検索式の部分式を評価した結果得られ
た候補は、部分式毎にスコア計算手段４６により算出さ
れたスコアとともに候補保持手段４７に保持される。部
分式を評価していく段階で、候補となる文書部品が得ら
れたとき、その部品がまだ候補保持手段４７に存在して
いなければ、スコア計算手段４６はその候補のスコアと
して該部分式のスコアを割り当てる。得られた候補がす
でに候補保持手段４７に存在していれば、スコア計算手
段４６は該候補のスコアを該部分式のスコア分だけ増加
させる。After making the above settings, the user who wishes to search for a document can use the search formula input means 41 to input the search formula created by himself or a predetermined program into the search formula holding means 42. To enter. The search formula input by the search formula input unit 41 is held by the search formula holding unit 42. The search expression held in the search expression holding means 42 is
The sub-expression evaluation unit 44 evaluates the document held in the document holding unit 43 for each sub-expression. The candidates obtained as a result of evaluating the sub-expressions of the search expression by the sub-expression evaluation unit 44 are stored in the candidate storage unit 47 together with the score calculated by the score calculation unit 46 for each sub-expression. When a candidate document part is obtained at the stage of evaluating the sub-expression, if the part does not yet exist in the candidate holding unit 47, the score calculation unit 46 determines the score of the sub-expression as the candidate score. Assign a score. If the obtained candidate already exists in the candidate holding unit 47, the score calculating unit 46 increases the score of the candidate by the score of the sub-expression.

【００６０】すべての部分式が評価されたのち、スコア
とともに候補保持手段４７に保持された検索結果は、検
索結果出力手段４９によりスコア順に出力される。次
に、第２の実施の形態の内容を具体的に説明する。ま
ず、文書を評価する条件に割り当てるスコアの例を示
す。After all the sub-expressions have been evaluated, the search results held in the candidate holding means 47 together with the scores are output by the search result output means 49 in the order of the scores. Next, the contents of the second embodiment will be specifically described. First, an example of a score assigned to a condition for evaluating a document will be described.

【００６１】図１３は、文書を評価する条件に割り当て
られたスコアの例を示す図である。このスコア４６ａ
は、スコア割当手段４５で入力され、スコア計算手段４
６が保持するものである。図の例では、条件が「タイ
プ」の場合にはスコアは「１００」である。条件が「親
子関係」の場合にはスコアは「１００」である。条件が
「祖孫関係」の場合にはスコアは「１００」である。条
件が「テキストに関する条件」の場合にはスコアは「１
００」である。条件が「タイプのＡＮＤ」の場合にはス
コアは「２００」である。条件が「テキストに関する条
件のＡＮＤ」の場合にはスコアは「２００」である。FIG. 13 is a diagram showing an example of scores assigned to conditions for evaluating a document. This score 46a
Is input by the score allocating unit 45 and the score calculating unit 4
6 holds. In the example of the figure, when the condition is “type”, the score is “100”. When the condition is “parent-child relationship”, the score is “100”. If the condition is “grandchild relationship”, the score is “100”. If the condition is “condition for text”, the score is “1”.
00 ”. If the condition is “type AND”, the score is “200”. If the condition is “AND of condition related to text”, the score is “200”.

【００６２】このようなスコアが設定された状態で、利
用者は、検索式入力手段４１により検索式を入力する。
例えば、図４のような検索式を入力する。すると、検索
式評価手段４０は、文書保持手段４３内の文書に対して
検索式の評価を行い、評価の高い文書を検索結果とす
る。With such a score set, the user inputs a search formula using the search formula input means 41.
For example, a search formula as shown in FIG. 4 is input. Then, the search formula evaluation unit 40 evaluates the search formula for the document in the document holding unit 43, and determines a document with a high evaluation as a search result.

【００６３】図１４は、検索式を評価する手続きのフロ
ーチャートである。この手続きでは、検索条件を評価す
る段階で、検索結果の候補に対しスコアを与えていく。〔Ｓ３１〕部分式評価手段４４は、検索式の内部表現の
ノードのうち、未処理のものが存在するか否かを検査す
る。未処理のノードが存在すればステップＳ３２へ行
く。そうでなければ、ステップＳ３７に行く。〔Ｓ３２〕部分式評価手段４４は、検索式の内部表現の
ノードのうち、未処理のものを１つ選択する。〔Ｓ３３〕部分式評価手段４４は、ステップＳ３２で選
択されたノードがもつ検索条件のうち、未処理のものが
存在するか否かを検査する。未処理の条件が存在すれば
ステップＳ３４へ、そうでなければステップＳ３１へ行
く。〔Ｓ３４〕部分式評価手段４４は、ステップＳ３２で選
択されたノードがもつ検索条件のうち、未処理のものを
１つ選択する。〔Ｓ３５〕部分式評価手段４４は、ステップＳ３４で選
択された検索条件を評価する。〔Ｓ３６〕スコア計算手段４６は、ステップＳ３５で検
索条件を評価した結果として得られた各候補に対し、ス
テップＳ３５で評価した検索条件に対応するスコアを加
算する。その後、ステップＳ３３へ行く。〔Ｓ３７〕部分式評価手段４４は、構造条件のうち、未
評価のものが存在するか否か検査する。未評価の構造条
件が存在すればステップＳ３８へ、そうでなければステ
ップＳ４１へ行く。〔Ｓ３８〕部分式評価手段４４は、構造条件のうち、未
評価のものを１つ選択する。〔Ｓ３９〕部分式評価手段４４は、ステップＳ３８で選
択された構造条件を評価する。〔Ｓ４０〕スコア計算手段４６は、ステップＳ３９の評
価結果得られた各候補に対し、ステップＳ３９で評価し
た条件に対応するスコアを加算する。〔Ｓ４１〕候補保持手段４７は、取出対象でないものを
候補から除外して、残った候補を保持する。即ち、取出
対象が「真」である候補のみを保持する。FIG. 14 is a flowchart of a procedure for evaluating a retrieval expression. In this procedure, scores are given to search result candidates at the stage of evaluating search conditions. [S31] The sub-expression evaluation unit 44 checks whether there is any unprocessed node among the nodes of the internal expression of the search expression. If there is an unprocessed node, the procedure goes to step S32. Otherwise, go to step S37. [S32] The sub-expression evaluation means 44 selects one unprocessed node from the nodes of the internal expression of the search expression. [S33] The sub-expression evaluation unit 44 checks whether there is any unprocessed search condition among the search conditions of the node selected in step S32. If there is an unprocessed condition, the process proceeds to step S34; otherwise, the process proceeds to step S31. [S34] The sub-expression evaluation means 44 selects one unprocessed search condition among the search conditions of the node selected in step S32. [S35] The sub-expression evaluation means 44 evaluates the search condition selected in step S34. [S36] The score calculation means 46 adds a score corresponding to the search condition evaluated in step S35 to each candidate obtained as a result of evaluating the search condition in step S35. After that, it goes to step S33. [S37] The sub-expression evaluation unit 44 checks whether or not there is an unevaluated structural condition. If there is an unevaluated structural condition, the process proceeds to step S38; otherwise, the process proceeds to step S41. [S38] The sub-expression evaluation unit 44 selects one of the structural conditions that has not been evaluated. [S39] The sub-expression evaluation means 44 evaluates the structural condition selected in step S38. [S40] The score calculation means 46 adds a score corresponding to the condition evaluated in step S39 to each candidate obtained in step S39. [S41] The candidate holding unit 47 excludes non-extraction targets from the candidates and holds the remaining candidates. That is, only the candidate whose extraction target is “true” is held.

【００６４】以上のような処理が行われることにより、
図４に示した検索式から検索結果が得られる。そして、
各検索式には、スコアが付加される。なお、図４の検索
式で取出対象となるのは、タイプが「Section 」のノー
ド３１のみであるため、取り出される検索結果は、タイ
プが「Section 」の文書部品だけである。図１５ないし
図１７に、図４の検索式３０による検索結果をスコアと
ともに示す。By performing the above processing,
Search results are obtained from the search formula shown in FIG. And
A score is added to each search expression. Since only the node 31 of type "Section" is to be extracted in the search formula of FIG. 4, the retrieved result is only the document part of type "Section". FIGS. 15 to 17 show search results obtained by the search formula 30 in FIG. 4 together with scores.

【００６５】図１５は、検索結果の第１の例を示す図で
ある。この文書部品５０は、各ノード５１〜５４の関係
と内容とが、図４の検索式３０の条件を全て満たしてい
る。そのため、スコアは最大の値（＝１５００）とな
る。FIG. 15 is a diagram showing a first example of a search result. In the document part 50, the relations and contents of the respective nodes 51 to 54 satisfy all the conditions of the search formula 30 in FIG. Therefore, the score has the maximum value (= 1500).

【００６６】図１６は、検索結果の第２の例を示す図で
ある。この文書部品５０ａは、３つのノード５１，５
２，５４に関する条件のみを満たしている。そのため、
スコアは図１５の例よりも低くなり、スコア＝１０００
である。FIG. 16 is a diagram showing a second example of the search result. This document part 50a includes three nodes 51 and 5
Only the conditions regarding 2,54 are satisfied. for that reason,
The score is lower than the example of FIG.
It is.

【００６７】図１７は、検索結果の第３の例を示す図で
ある。この文書部品５０ｂは、２つのノード５１，５２
に関する条件のみを満たしている。そのため、スコアは
図１６の例よりもさらに低くなり、スコア＝７００であ
る。FIG. 17 is a diagram showing a third example of the search result. This document part 50b is composed of two nodes 51 and 52.
Only the conditions regarding are satisfied. Therefore, the score is lower than the example of FIG. 16, and the score is 700.

【００６８】このように取り出された検索結果は、スコ
アの順に表示装置の画面に表示される。図１８は、第２
の実施の形態による検索結果の表示例を示す図である。
同図では、検索結果の表示画面４９ａの中に、検索結果
として得られた文書が表示されている。これらの文書
は、スコア、文書名、著者、作成日とともに表示されて
いる。この画面は、スクロールバーを操作することによ
り、画面中に表示させる文書をスクロールさせ、スコア
の小さい検索結果を順次画面表示させることができる。
このとき、スコアの上限値が定められていれば、その上
限値を超えた検索結果は表示されない。The retrieval results thus retrieved are displayed on the screen of the display device in the order of the scores. FIG.
It is a figure showing the example of a display of the search result by an embodiment.
In the figure, a document obtained as a search result is displayed in a search result display screen 49a. These documents are displayed with their scores, document names, authors, and creation dates. By operating a scroll bar on this screen, a document to be displayed on the screen can be scrolled, and search results having a small score can be sequentially displayed on the screen.
At this time, if the upper limit of the score is determined, the search results exceeding the upper limit are not displayed.

【００６９】なお、以上の２つの実施の形態以外に、次
のような変形例も考えられる。上記の実施の形態では、
スコアやペナルティは固定の値としたが、データベース
の状態により可変にしてもよい。以下に、データベース
の状態によりこれらの値を変更する方法の例を示す。It is to be noted that, in addition to the above two embodiments, the following modified examples are also conceivable. In the above embodiment,
Although the score and penalty are fixed values, they may be variable depending on the state of the database. An example of a method for changing these values according to the state of the database will be described below.

【００７０】例えば、テキストに関する条件の場合、デ
ータベース中に頻出する語句ほどスコアあるいはペナル
ティを下げる。また、タイプに関する条件の場合、デー
タベース中に頻出する語句ほどスコアあるいはペナルテ
ィを下げる。For example, in the case of a condition relating to a text, the score or penalty is lowered for words and phrases that appear more frequently in the database. Further, in the case of the condition relating to the type, the score or the penalty is reduced as the words appear more frequently in the database.

【００７１】このように、語句の出現頻度に応じてペナ
ルティ若しくはスコアを計算するには、ペナルティ計算
手段１３（図２に示す）若しくはスコア計算手段４６
（図１２に示す）が、テキストに関する条件が指定され
ると、テキストに関する条件の語句の頻出の度合いを求
める手段と、その度合いに応じたペナルティ若しくはス
コアを求める手段とを有していればよい。As described above, in order to calculate a penalty or a score according to the frequency of appearance of a word, the penalty calculating means 13 (shown in FIG. 2) or the score calculating means 46
(Shown in FIG. 12), when a condition relating to text is specified, it is sufficient to have means for determining the degree of frequent occurrence of a phrase in the condition relating to text, and means for determining a penalty or score according to the degree. .

【００７２】また、必然的に親子関係あるいは祖孫関係
が成立する場合、親子関係あるいは祖孫関係に関する条
件に一致したときのスコアは０とする。この場合には、
親子関係あるいは祖孫関係が必ず成立するため、ペナル
ティの値がいくらになっていても実質的には意味をなさ
ない。親子関係あるいは祖孫関係が必然的に成立するか
どうかは、文書型を見れば判定できる。あるタイプＴ１
から、ＳＥＱまたはＲＥＰノードだけを辿って別のタイ
プＴ２に到達できるとき、Ｔ１が存在すればＴ２が必ず
存在し( 逆も真) 、それらは親子関係にある。また、あ
るタイプＴ１から、ＳＥＱ、ＲＥＰ、またはタイプノー
ドだけを辿って別のタイプＴ２に到達できるとき、Ｔ１
が存在すればＴ２が必ず存在し( 逆も真) 、それらは祖
孫関係にある。When the parent-child relationship or the grandchild relationship is inevitably established, the score when the condition regarding the parent-child relationship or the grandchild relationship is matched is set to 0. In this case,
Since a parent-child relationship or a grandchild relationship always holds, no matter what the value of the penalty, it has no practical meaning. Whether the parent-child relationship or the grandchild relationship is inevitably established can be determined by looking at the document type. A certain type T1
Therefore, when another type T2 can be reached by tracing only the SEQ or REP node, if T1 exists, T2 always exists (and vice versa), and they are in a parent-child relationship. When it is possible to reach another type T2 from one type T1 by tracing only the SEQ, REP, or type node, T1
, T2 always exists (and vice versa), and they are in a grandchild relationship.

【００７３】[0073]

【発明の効果】以上説明したように第１の発明では、入
力された検索式に基づいて、検索条件を段階的に緩やか
にした条件緩和検索式を生成し、条件の緩やかな検索を
も行うようにしたため、正しく論理構造が作成されてい
ない文書も検索することができ、再現率が向上する。し
かも、検索結果はもとの検索式に照らして確からしい順
に出力されるので、利用者は確からしいものから順次検
索結果を吟味することができる。As described above, in the first aspect of the present invention, a condition relaxation search formula in which search conditions are gradually relaxed is generated based on an input search formula, and a search with moderate conditions is also performed. As a result, a document for which a logical structure has not been created correctly can be searched, and the recall is improved. In addition, since the search results are output in the most probable order in light of the original search formula, the user can examine the search results sequentially from the most probable ones.

【００７４】また、第２の発明では、入力された検索式
の部分式により文書部品を取り出し、入力された検索式
の条件を満たしている度合いが高いものから順に検索結
果として出力するようにしたため、上記第１の発明と同
様に、正しく論理構造が作成されていない文書も検索す
ることができ、再現率が向上する。Further, in the second invention, the document parts are extracted by the sub-expression of the input search expression, and are output as the search results in descending order of the degree satisfying the conditions of the input search expression. In the same manner as in the first aspect, a document whose logical structure has not been correctly created can be searched, and the recall is improved.

[Brief description of the drawings]

【図１】本発明の原理構成図である。FIG. 1 is a principle configuration diagram of the present invention.

【図２】構造化文書検索装置の第１の実施の形態のブロ
ック図である。FIG. 2 is a block diagram of a first embodiment of a structured document search device.

【図３】書き換え規則に割り当てられたペナルティの例
を示す図である。FIG. 3 is a diagram showing an example of a penalty assigned to a rewrite rule.

【図４】検索式の例を示す図である。FIG. 4 is a diagram showing an example of a search formula.

【図５】検索式の書き換え処理のフローチャートであ
る。FIG. 5 is a flowchart of a search formula rewriting process.

【図６】書き換え後の検索式の第１の例を示す図であ
る。FIG. 6 is a diagram illustrating a first example of a search expression after rewriting.

【図７】書き換え後の検索式の第２の例を示す図であ
る。FIG. 7 is a diagram illustrating a second example of a rewritten search formula.

【図８】書き換え後の検索式の第３の例を示す図であ
る。FIG. 8 is a diagram illustrating a third example of a rewritten search formula.

【図９】書き換え後の検索式の第４の例を示す図であ
る。FIG. 9 is a diagram illustrating a fourth example of a rewritten search expression.

【図１０】検索式全体をブーリアン検索へ書き換える手
続きのフローチャートである。FIG. 10 is a flowchart of a procedure for rewriting the entire search expression to a Boolean search.

【図１１】第１の実施の形態による検索結果の表示例を
示す図である。FIG. 11 is a diagram illustrating a display example of a search result according to the first embodiment.

【図１２】構造化文書検索装置の第２の実施の形態のブ
ロック図である。FIG. 12 is a block diagram of a structured document search device according to a second embodiment.

【図１３】文書を評価する条件に割り当てられたスコア
の例を示す図である。FIG. 13 is a diagram illustrating an example of a score assigned to a condition for evaluating a document.

【図１４】検索式を評価する手続きのフローチャートで
ある。FIG. 14 is a flowchart of a procedure for evaluating a search expression.

【図１５】検索結果の第１の例を示す図である。FIG. 15 is a diagram showing a first example of a search result.

【図１６】検索結果の第２の例を示す図である。FIG. 16 is a diagram showing a second example of a search result.

【図１７】検索結果の第３の例を示す図である。FIG. 17 is a diagram showing a third example of a search result.

【図１８】第２の実施の形態による検索結果の表示例を
示す図である。FIG. 18 is a diagram illustrating a display example of a search result according to the second embodiment.

【図１９】論理構造の例を示す図である。FIG. 19 is a diagram illustrating an example of a logical structure.

【図２０】文書型の例を示す図である。FIG. 20 is a diagram illustrating an example of a document type.

[Explanation of symbols]

１検索式２検索式生成手段３確度計算手段４条件緩和検索式５検索実行手段６文書保持手段７検索結果併合手段 1 search formula 2 search formula generation means 3 accuracy calculation means 4 relaxed condition search formula 5 search execution means 6 document holding means 7 search result merging means

Claims

[Claims]

1. A structured document search apparatus for a structured document as a search target, wherein a search expression described by a node type, a node content, a node attribute, and a structural relationship between nodes in a document structure is provided. When input, a search expression generating means for generating a condition relaxation search expression in which the search conditions indicated in the search expression are gradually rewritten into mild conditions, and rewriting performed to generate the condition relaxation search expression According to the content of, a certainty calculation means for calculating the certainty indicating the likelihood of the search result by the relaxed search formula, and a search execution means for performing a search based on the input search formula and the relaxed search formula And a search result merging means for arranging and merging the search results obtained by the search executing means in descending order of accuracy.

2. The structured document search device according to claim 1, wherein said search formula generation means generates only a condition relaxation search formula having a higher accuracy than a predetermined limit value.

3. A reliability assigning means for assigning a reference accuracy for each rewriting rule, wherein the accuracy calculating means calculates the accuracy of the condition relaxation search formula based on the reference accuracy assigned by the accuracy assigning means. 2. The document search device according to claim 1, wherein:

4. A structured document search apparatus for a structured document as a search target, wherein a search expression described by a node type, a node content, a node attribute, and a structural relationship between nodes in a document structure is used. When input, each search condition in the search expression is set as an individual sub-expression, and a sub-expression evaluation unit that retrieves a document part satisfying each condition for each sub-expression; Based on what sub-expression conditions the document part satisfies, the accuracy calculation means that calculates the degree of certainty of each document part, and the search results are output in the order of the document parts with the highest accuracy And a search result output unit.

5. The structured document search device according to claim 4, wherein said search result output means sets only a document part having a higher accuracy than a predetermined limit value as a search result.

6. A reliability assigning means for assigning a reference accuracy for each type of sub-expression, wherein the accuracy calculating means calculates the accuracy of each document part based on the reference accuracy assigned by the accuracy assigning means. 5. The structured document search device according to claim 4, wherein: