JP2001014326A

JP2001014326A - Device and method for retrieving similar document by structure specification

Info

Publication number: JP2001014326A
Application number: JP11183349A
Authority: JP
Inventors: Tadataka Matsubayashi; 忠孝松林; Katsumi Tada; 勝己多田; Natsuko Sugaya; 菅谷　　奈津子; Yasuhiko Inaba; 靖彦稲場; Akihiko Yamaguchi; 明彦山口; Yosuke Gochi; 陽介後地
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1999-06-29
Filing date: 1999-06-29
Publication date: 2001-01-19

Abstract

PROBLEM TO BE SOLVED: To add the specification of an object structure to be retrieved to retrieval conditions and to improve the retrieval precision when a document which is similar to a seed document (document specified as a retrieval condition) is retrieved. SOLUTION: A retrieval condition expression analyzing program 130 receives the specification of a seed document and the input of an object structure to be retrieved as retrieval conditions. A featured character string extracting program 150 extracts a featured character string from the text of the specified seed document. A retrieval object structure ID acquiring program 151 converts the specified structure into its ID. A similarity calculating program 152 performs retrieval from an appearance frequency file 181 to acquire the appearance frequency of a document whose structure ID matches the featured character string and calculates the similarity of the similar document based upon the seed document. A retrieval result output program 132 displays the identifier and similarity of the similar document as the retrieval result.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、検索条件として指
定された文書（種文書）に類似する文書を検索する装置
及び方法に係わり、特に構造化文書の構造を対象として
検索を行う装置及び方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an apparatus and a method for searching for a document similar to a document (seed document) specified as a search condition, and more particularly to an apparatus and a method for searching a structured document structure. About.

【０００２】[0002]

【従来の技術】近年、パーソナルコンピュータやインタ
ーネット等の普及に伴い、データベースに蓄積される電
子化文書の数が増大しており、膨大な電子化文書の中か
らユーザが所望する情報を含んだ文書を検索精度よく、
高速かつ効率的に検索したいという要求が高まってい
る。2. Description of the Related Art In recent years, with the spread of personal computers and the Internet, the number of digitized documents stored in a database has been increasing. Search with high accuracy,
There is an increasing demand for fast and efficient searches.

【０００３】このような要求に対して種々の検索技術が
提案されている。例えば特開平１０−２４０７５２号公
報によれば、文書を構成する個々の論理的な構造要素が
識別できる文書（以下、構造化文書と呼ぶ）を対象とし
て、論理構造に関する条件を検索条件中に付加した検索
を行うことにより、精度の高い検索を行うことができ
る。[0003] Various search techniques have been proposed to meet such demands. For example, according to Japanese Patent Laid-Open No. Hei 10-240752, a condition relating to a logical structure is added to a search condition for a document (hereinafter, referred to as a structured document) in which individual logical structural elements constituting the document can be identified. By performing the search, a highly accurate search can be performed.

【０００４】また特開平１１−１４３９０２号公報は、
ユーザが自分の所望する内容の文書あるいは文章（以
下、種文書と呼ぶ）を指定し、その文書と類似する文書
を検索する類似文書検索技術を開示する。この技術によ
れば、サンプル文書を示すだけで目的の文書を簡単に検
索でき、ユーザが複雑な検索条件式を考えたり入力する
手間が省け、効率的な検索ができる。Japanese Patent Application Laid-Open No. 11-143902 discloses that
Disclosed is a similar document search technique in which a user specifies a document or a sentence having desired contents (hereinafter, referred to as a seed document) and searches for a document similar to the document. According to this technique, a target document can be easily searched for only by showing a sample document, and the user does not need to think or input a complicated search condition expression, and can perform an efficient search.

【０００５】[0005]

【発明が解決しようとする課題】上記の特開平１１−１
４３９０２号公報の技術によれば、ユーザは使い勝手が
よく効率的な検索ができるが、以下に例示するように検
索精度の問題を残している。The above-mentioned JP-A-11-1
According to the technique disclosed in Japanese Patent No. 43902, the user can easily and efficiently perform a search, but has a problem of search accuracy as exemplified below.

【０００６】図３は、従来の類似文書検索システムの処
理手順を示す図である。検索条件取得プログラムは、検
索条件を入力するためのガイダンス画面を表示装置上に
表示する。例えば種文書を含む複数の候補文書の文書番
号や見出しなどの一覧情報を表示する。検索条件として
種文書が指定されると、特徴ｎ−ｇｒａｍ抽出プログラ
ムが起動され、文書ファイルから種文書のテキスト全文
を取り出し、テキスト中から特徴文字列を抽出する。次
に類似度算出プログラムが起動され、特徴文字列に対応
して文書番号とその特徴文字列の出現回数が登録してあ
る出現頻度ファイルを参照し、種文書の特徴文字列に基
づいて同じ特徴文字列を使用する関連文書の種文書に対
する類似度を算出して候補文書の文書番号、類似度、見
出しなどの一覧情報を検索結果として表示する。FIG. 3 is a diagram showing a processing procedure of a conventional similar document search system. The search condition acquisition program displays a guidance screen for inputting search conditions on a display device. For example, list information such as document numbers and headings of a plurality of candidate documents including a seed document is displayed. When a seed document is specified as a search condition, a feature n-gram extraction program is started, the entire text of the seed document is extracted from the document file, and a characteristic character string is extracted from the text. Next, the similarity calculation program is started, the document number and the number of appearances of the characteristic character string are registered corresponding to the characteristic character string, and the appearance frequency file is referred to, and the same characteristic is determined based on the characteristic character string of the seed document. The similarity of the related document using the character string to the seed document is calculated, and list information such as the document number, similarity, and heading of the candidate document is displayed as a search result.

【０００７】図３の検索結果によれば、種文書とよく類
似する文書は文書４であるにもかかわらず、特徴文字列
の出現頻度がより高いために関連の薄い文書１の類似度
がより高くなり、優先的に表示されるという問題があ
る。According to the search results shown in FIG. 3, although the document which is very similar to the seed document is the document 4, the similarity of the less related document 1 is higher because the appearance frequency of the characteristic character string is higher. There is a problem that it becomes expensive and is displayed preferentially.

【０００８】本発明の目的は、複雑な検索条件の入力を
避けるが検索精度のよい類似文書の検索装置及び検索方
法を提供することにある。An object of the present invention is to provide an apparatus and a method for searching for a similar document which avoids inputting complicated search conditions but has a high search accuracy.

【０００９】[0009]

【課題を解決するための手段】本発明は、計算機を利用
して種文書に類似する構造化文書を検索する方法であっ
て、類似度計算の検索条件として種文書と構造化文書に
属する少なくとも１つの構造の指定を受けるステップ
と、類似度計算の後、類似度のより高い対象文書を優先
して表示するステップとを有する構造指定による類似文
書の検索方法を特徴とする。SUMMARY OF THE INVENTION The present invention relates to a method for retrieving a structured document similar to a seed document using a computer, wherein at least the seed document and the structured document belonging to the structured document are used as search conditions for similarity calculation. The method is characterized by a method of searching for similar documents by designating a structure, comprising a step of receiving designation of one structure, and a step of giving priority to a target document having a higher similarity after calculating the similarity.

【００１０】また本発明は、構造化された種文書に類似
する文書を検索する方法であって、種文書とその種文書
に属する少なくとも１つの構造の指定を受けるステップ
と、類似度計算の後、類似度のより高い対象文書を優先
して表示するステップとを有する構造指定による類似文
書の検索方法を特徴とする。According to another aspect of the present invention, there is provided a method for retrieving a document similar to a structured seed document, comprising the steps of receiving a designation of a seed document and at least one structure belonging to the seed document; And displaying a target document having a higher similarity with priority.

【００１１】さらに本発明は、上記の機能を備える検索
装置を特徴とする。Further, the present invention is characterized by a retrieval device having the above functions.

【００１２】なおここで構造を指定するとは、文書を構
成する論理的な構造要素の名称を指定することを意味す
る。Here, designating the structure means designating the names of the logical structural elements constituting the document.

【００１３】[0013]

【発明の実施の形態】以下、本発明の実施形態について
図面を用いて説明する。Embodiments of the present invention will be described below with reference to the drawings.

【００１４】図１は、第１の実施形態の類似文書検索シ
ステムの構成図である。本システムを実現する計算機ハ
ードウェアは、表示装置１００、入力装置１０１、中央
処理装置（ＣＰＵ）１０２、外部記憶装置１０３、フロ
ッピィディスクドライブ（ＦＤＤ）１０４、主メモリ１
０６とこれら装置間を接続するバス１０７から構成され
る。FIG. 1 is a configuration diagram of a similar document search system according to the first embodiment. The computer hardware that implements this system includes a display device 100, an input device 101, a central processing unit (CPU) 102, an external storage device 103, a floppy disk drive (FDD) 104, a main memory 1
06 and a bus 107 connecting these devices.

【００１５】外部記憶装置１０３は、テキスト１８０、
出現頻度ファイル１８１及び構造インデクス１８２を格
納する。テキスト１８０は、構造化文書ファイルあるい
は構造化されていない文書ファイルの集合を格納する。
ここで構造化文書とは、ＳＧＭＬ，ＸＭＬなどの標準形
式に準拠した論理構造をもつ文書、あるいは各論理構造
ごとに抽出された複数のフラットテキストから構成され
るものである。ＦＤＤ１０４を介してフロッピィディス
ク１０５に格納されている文書が主メモリ１０６を経由
してテキスト１８０に登録される。The external storage device 103 stores a text 180,
An appearance frequency file 181 and a structure index 182 are stored. The text 180 stores a structured document file or a set of unstructured document files.
Here, the structured document is a document having a logical structure conforming to a standard format such as SGML, XML, or a plurality of flat texts extracted for each logical structure. The document stored in the floppy disk 105 via the FDD 104 is registered in the text 180 via the main memory 106.

【００１６】主メモリ１０６に格納されるシステム制御
プログラム１１０は、オペレーティングシステム、グラ
フィカル・ユーザインタフェースを提供するプログラム
などを含む。文書登録制御プログラム１１１は、文書登
録用のプログラムの実行を制御する。登録プログラムに
は、テキスト登録プログラム１２０、出現頻度計数プロ
グラム１４０を含む出現頻度ファイル作成プログラム１
２１及び構造インデクス作成プログラム１２２がある。
テキスト登録プログラム１２０は、フロッピィディスク
１０５上の文書をテキスト１８０に登録するプログラム
である。The system control program 110 stored in the main memory 106 includes an operating system, a program for providing a graphical user interface, and the like. The document registration control program 111 controls execution of a document registration program. The registration program includes an appearance frequency file creation program 1 including a text registration program 120 and an appearance frequency counting program 140.
21 and a structure index creation program 122.
The text registration program 120 is a program for registering a document on the floppy disk 105 in the text 180.

【００１７】検索制御プログラム１１２は、類似文書の
検索に係わるプログラムの実行を制御するプログラムで
ある。検索用のプログラムには、検索条件式解析プログ
ラム１３０、類似文書検索プログラム１３１及び検索結
果出力プログラム１３２がある。類似文書検索プログラ
ム１３１には、特徴文字列抽出プログラム１５０、検索
対象構造ＩＤ取得プログラム１５１及び類似度算出プロ
グラム１５２が含まれる。これら検索用プログラムの機
能については、以下の検索処理手順の説明の中で説明す
る。検索制御プログラム１１２及び検索用プログラムを
記憶媒体に格納し、駆動装置を介して主メモリ１０６に
読み込み、ＣＰＵ１０２によって実行することが可能で
ある。The search control program 112 is a program for controlling execution of a program related to searching for similar documents. The search programs include a search condition expression analysis program 130, a similar document search program 131, and a search result output program 132. The similar document search program 131 includes a characteristic character string extraction program 150, a search target structure ID acquisition program 151, and a similarity calculation program 152. The functions of these search programs will be described in the following description of the search procedure. The search control program 112 and the search program can be stored in a storage medium, read into the main memory 106 via a drive device, and executed by the CPU 102.

【００１８】主メモリ１０６中に格納される共有ライブ
ラリ１６０として、構造化文書解析プログラム１７０が
ある。またワークエリア１６１は、テキスト１８０、出
現頻度ファイル１８１、構造インデクス１８２から読み
込んだデータ等の一時記憶領域や作業用領域として使用
される領域である。As a shared library 160 stored in the main memory 106, there is a structured document analysis program 170. The work area 161 is an area used as a temporary storage area or a work area for data read from the text 180, the appearance frequency file 181, and the structure index 182.

【００１９】出現頻度ファイル１８１は、図２の一部に
示すように文字列又は単語に対応して、文書番号、その
文字列が含まれる論理構造のＩＤ及びその論理構造中の
出現回数を格納する。出現頻度ファイル１８１に登録さ
れる対象文書は、構造化文書または構造化していない文
書である。出現頻度ファイル作成プログラム１２１は、
テキスト１８０中の文書を１つずつ読み込み、テキスト
中から特徴文字列を抽出し、出現頻度計数プログラム１
４０によって各論理構造ごとの特徴文字列の出現回数を
計数し、出現頻度ファイル１８１を作成して外部記憶装
置１０３に登録する。構造化していない文書について
は、論理構造の区分が指定されると、その指定に従って
特徴文字列と論理構造ＩＤとを対応づける。例えば特開
平１１−１４３９０２号公報は出現頻度ファイル作成プ
ログラムの処理手順を開示する。The appearance frequency file 181 stores a document number, an ID of a logical structure including the character string, and the number of appearances in the logical structure corresponding to a character string or a word as shown in a part of FIG. I do. The target document registered in the appearance frequency file 181 is a structured document or an unstructured document. The appearance frequency file creation program 121
The document in the text 180 is read one by one, a characteristic character string is extracted from the text, and the appearance frequency counting program 1
40, the number of appearances of the characteristic character string for each logical structure is counted, and an appearance frequency file 181 is created and registered in the external storage device 103. For a document that is not structured, when a logical structure category is specified, the characteristic character string is associated with the logical structure ID according to the specification. For example, JP-A-11-143902 discloses a processing procedure of an appearance frequency file creation program.

【００２０】構造インデクス１８２は、図２の一部に示
すように論理構造とそのＩＤの対応関係を格納する。構
造インデクス作成プログラム１２２は、構造化文書解析
プログラム１７０を呼び出し、テキスト１８０から読み
込んだ文書テキストの論理構造を解析して各論理構造に
ＩＤを付与して外部記憶装置１０３に登録する。例えば
特開平１０−２４０７５２号公報は、構造インデクス作
成プログラムの処理手順を開示する。The structure index 182 stores the correspondence between a logical structure and its ID as shown in a part of FIG. The structure index creation program 122 calls the structured document analysis program 170, analyzes the logical structure of the document text read from the text 180, assigns an ID to each logical structure, and registers the logical structure in the external storage device 103. For example, Japanese Patent Laying-Open No. 10-240752 discloses a processing procedure of a structure index creation program.

【００２１】図２は、第１の実施形態の処理手順を示す
図である。第１の実施形態では、種文書が構造化されて
いない文書、検索対象文書が出現頻度ファイル１８１に
登録済みの文書（構造化文書又は構造化されていない文
書）とする。検索条件式解析プログラム１３０は、表示
装置１００上にガイダンス画面を表示し、検索条件式の
入力を受け付ける。FIG. 2 is a diagram showing a processing procedure of the first embodiment. In the first embodiment, it is assumed that a seed document is an unstructured document and a search target document is a document (structured document or unstructured document) registered in the appearance frequency file 181. The search condition expression analysis program 130 displays a guidance screen on the display device 100 and accepts an input of a search condition expression.

【００２２】ここで検索条件式は、種文書及び少なくと
も１つの検索対象構造である。種文書はすでに見出し等
が表示された複数の文書候補のうちの１つを選択するこ
とが可能であるし、入力装置１０１から直接入力するこ
とも可能であるし、ＦＤＤ１０４やＣＤ−ＲＯＭ装置
（図には示していない）、ネットワーク（図には示して
いない）等を介して入力することも可能である。Here, the search condition expression is a seed document and at least one search target structure. The seed document can select one of a plurality of document candidates in which a heading or the like is already displayed, can directly input the seed document from the input device 101, or can input the FDD 104 or the CD-ROM device ( It is also possible to input via a network (not shown), a network (not shown), or the like.

【００２３】さらに図１２に示すように、表示装置１０
０に種文書入力用領域１２００、検索対象構造入力用領
域１２０１および検索実行ボタン１２０２を備えた画面
インタフェースを介して検索条件式が入力されるものと
してもよい。種文書入力用領域１２００には、入力装置
１０１より種文書を直接入力することも可能であるし、
あるいは検索結果表示画面（図には示していない）上の
テキストを種文書入力用領域１２００にコピーすること
も可能である。あるいは種文書はこのようなテキストの
うちの指示された部分であってもよい。Further, as shown in FIG.
A search condition expression may be input to 0 through a screen interface including a seed document input area 1200, a search target structure input area 1201, and a search execution button 1202. In the seed document input area 1200, a seed document can be directly input from the input device 101.
Alternatively, the text on the search result display screen (not shown) can be copied to the seed document input area 1200. Alternatively, the seed document may be the indicated portion of such text.

【００２４】また検索対象構造は、表示されるドロップ
ダウンメニューから少なくとも１つを選択することが可
能である。複数の検索対象構造が指定された場合の検索
条件では、各構造に対して重みを付与することが可能で
ある。ここで重みは、重み入力用領域（図には示してい
ない）を介して入力されるものでもよいし、システム定
義ファイル（図には示していない）で定義されるものと
してもよい。At least one search target structure can be selected from a displayed drop-down menu. In a search condition when a plurality of search target structures are specified, it is possible to assign a weight to each structure. Here, the weight may be input via a weight input area (not shown) or may be defined in a system definition file (not shown).

【００２５】入力装置１０１を介して種文書及び検索対
象構造が入力されると、検索条件式解析プログラム１３
０は指定された検索条件から種文書のテキストを取得す
る。なお検索対象とする構造を指定する代わりに、検索
対象としない構造（検索対象から除外する構造）を指定
してもよい。その場合には、検索条件式解析プログラム
１３０は、残りの構造を検索対象構造とする。また検索
対象構造を検索条件式の１つとして入力装置１０１を介
して入力する代わりにあらかじめシステム定義ファイル
（図には示していない）に設定された検索対象構造を用
いてもよい。When the seed document and the search target structure are input via the input device 101, the search condition analysis program 13
0 acquires the text of the seed document from the specified search condition. Instead of specifying a structure to be searched, a structure not to be searched (a structure excluded from the search) may be specified. In that case, the search condition expression analysis program 130 sets the remaining structure as the search target structure. Instead of inputting the search target structure as one of the search condition expressions via the input device 101, a search target structure set in advance in a system definition file (not shown) may be used.

【００２６】検索対象構造ＩＤ取得プログラム１５１
は、構造インデクス１８２を参照して指定された構造に
対応する識別子を検索対象構造ＩＤとして取得する。ま
た特徴文字列抽出プログラム１５０は、テキスト１８０
から指定された種文書のテキスト全文を取り出し、特徴
文字列を抽出し、抽出した特徴文字列の出現回数を計数
する。特徴文字列の抽出方法としては、例えば特開平１
１−１４３９０２号公報に記載された方法を用いること
ができる。図２の例では抽出した特徴文字列のうち優先
度の高いものを採用している。あるいは文書テキストか
ら単語を切り出し、単語辞書（図には示していない）を
参照して登録された単語との一致をチェックしながら単
語を抽出してもよい。Search target structure ID acquisition program 151
Acquires the identifier corresponding to the specified structure with reference to the structure index 182 as the search target structure ID. In addition, the characteristic character string extraction program 150
, The full text of the specified seed document is extracted, a characteristic character string is extracted, and the number of appearances of the extracted characteristic character string is counted. As a method of extracting a characteristic character string, for example,
The method described in 1-1143902 can be used. In the example of FIG. 2, the extracted characteristic character strings having the higher priority are used. Alternatively, words may be extracted from the document text, and words may be extracted while referring to a word dictionary (not shown) to check for a match with the registered words.

【００２７】次に類似度算出プログラム１５２は、出現
頻度ファイル１８１を参照して抽出された各特徴文字列
又は単語と検索対象構造ＩＤが一致する文書の文書番号
とその出現頻度を取得する。次に出現頻度ファイル１８
１を参照して取得した各文書の検索対象構造ＩＤについ
て抽出された特徴文字列又は単語以外の他の文字列又は
単語の出現頻度を取得し、各文書ごとに種文書との類似
度を算出する。類似度算出方法としては、例えば特開平
１１−１４３９０２号公報に記載の数式１を用いること
ができる。あるいは種文書の各特徴文字列（単語）の正
規化された出現ウェイトを要素とする特徴ベクトルと取
得した各文書の特徴ベクトルを求め、種文書と他文書の
特徴ベクトルの内積によって各文書の類似度を計算して
もよい。Next, the similarity calculation program 152 obtains the document number of the document whose structure ID matches the search target structure ID with each characteristic character string or word extracted with reference to the appearance frequency file 181, and its appearance frequency. Next, the appearance frequency file 18
1 and obtains the frequency of occurrence of a character string or word other than the characteristic character string or word extracted for the search target structure ID of each document acquired with reference to No. 1 and calculates the degree of similarity with the seed document for each document I do. As a similarity calculation method, for example, Equation 1 described in JP-A-11-143902 can be used. Alternatively, a feature vector having a normalized appearance weight of each feature character string (word) of the seed document as an element and a feature vector of each acquired document are obtained. The degree may be calculated.

【００２８】最後に検索結果出力プログラム１３２は、
取得した文書を類似度の高い順に並べ替え、類似度の高
い順に従って表示の優先度を決定し、優先度の高い文書
から順に文書番号とその類似度を表示装置１００上に表
示する。ファイル（図には示していない）を参照して各
文書の見出し、概要などの書誌事項を取得して併せて表
示してもよい。類似文書との比較のために種文書の文書
番号、類似度、見出しなどを併せて表示することも可能
である。Finally, the search result output program 132
The acquired documents are rearranged in descending order of similarity, the display priority is determined in descending order of similarity, and the document numbers and their similarities are displayed on the display device 100 in descending order of priority. Bibliographic items such as the headline and outline of each document may be acquired with reference to a file (not shown in the figure) and displayed together. The document number, similarity, headline, etc. of the seed document can also be displayed for comparison with a similar document.

【００２９】なお複数の検索対象構造が指定された場合
に、各文書の類似度を算出するに際して、各検索対象構
造の類似度を全体に亘って累積する累積値を求め、文書
をこの類似度の累積値の大きい順に並べ替えてもよい。
ここで累積値とは、各検索対象構造ごとの類似度の総
和、２乗和の平方根を求めたものなどである。あるいは
各文書について複数の検索対象構造の各類似度のうち最
も高い類似度を採用し、文書をこの採用した類似度の大
きい順に並べ替えてもよい。例えば特許明細書中の［請
求項ｎ］のように文書中に同一種類の論理構造が繰り返
し出現する場合に、各論理構造ごとに類似度を算出し、
その中で最も高い類似度を採用して種文書の類似度と比
較すると、同一種類の論理構造の順番には無関係に内容
の類似度の高い論理構造同志の比較をすることができ
る。また類似度の累積値を求めるモードと、最も高い類
似度を採用するモードの両方を設け、検索条件の１つと
していずれかのモードを選択できるようにしてもよい
し、あらかじめシステム定義ファイル（図には示してい
ない）に選択するモードを設定できるようにしてもよ
い。When a plurality of search target structures are designated, when calculating the similarity of each document, a cumulative value which accumulates the similarity of each search target structure over the entirety is obtained, and the document is determined by this similarity. May be sorted in descending order of the cumulative value of.
Here, the cumulative value is a value obtained by calculating the sum of squares of the similarity for each search target structure and the square root of the sum of squares. Alternatively, the highest similarity among the similarities of a plurality of search target structures may be adopted for each document, and the documents may be sorted in descending order of the adopted similarity. For example, when the same type of logical structure repeatedly appears in a document as in [claim n] in a patent specification, a similarity is calculated for each logical structure,
If the highest similarity is adopted and compared with the similarity of the seed document, it is possible to compare the logical structures having the highest similarity in content regardless of the order of the logical structure of the same type. Further, both a mode for obtaining the cumulative value of the similarity and a mode for employing the highest similarity may be provided so that any one of the search conditions can be selected, or a system definition file (FIG. (Not shown in the figure) may be set.

【００３０】図４は、第２の実施形態の類似文書検索プ
ログラム１３１ａの構成を示す図である。第２の実施形
態では、特徴文字列抽出プログラム１５０ａに種文書構
造解析プログラム４００が加わっている。種文書構造解
析プログラム４００は、共有ライブラリ１６０に格納さ
れている構造化文書解析プログラム１７０を呼び出す構
成をとる。また類似度算出プログラム１５２ａに対応構
造判定プログラム４０１が加わっている。FIG. 4 is a diagram showing a configuration of the similar document search program 131a according to the second embodiment. In the second embodiment, a seed document structure analysis program 400 is added to the characteristic character string extraction program 150a. The seed document structure analysis program 400 is configured to call the structured document analysis program 170 stored in the shared library 160. In addition, a correspondence structure determination program 401 is added to the similarity calculation program 152a.

【００３１】図５は、第２の実施形態の処理手順を示す
図である。第２の実施形態では種文書が構造化文書、検
索対象文書が出現頻度ファイル１８１に登録済みの文書
（構造化文書又は構造化されていない文書）とする。第
２の実施形態の検索条件は種文書及び種文書に属する少
なくとも１つの構造である。入力装置１０１を介して種
文書及び構造が入力されると、検索条件式解析プログラ
ム１３０は指定された検索条件から種文書のテキストを
取得する。検索条件で指定された種文書の論理構造と検
索対象文書の論理構造が一致するものとする。なお検索
対象とする構造を指定する代わりに、検索対象としない
構造（検索対象から除外する構造）を指定してもよい。
その場合には、検索条件式解析プログラム１３０は、種
文書に属する残りの構造を検索対象構造とする。第１の
実施形態と同様に検索対象構造をあらかじめシステム定
義ファイルに設定しておいてもよい。FIG. 5 is a diagram showing a processing procedure according to the second embodiment. In the second embodiment, the seed document is a structured document, and the search target document is a document (structured document or unstructured document) registered in the appearance frequency file 181. The search condition of the second embodiment is a seed document and at least one structure belonging to the seed document. When the seed document and the structure are input via the input device 101, the search condition expression analysis program 130 acquires the text of the seed document from the specified search condition. It is assumed that the logical structure of the seed document specified by the search condition matches the logical structure of the search target document. Instead of specifying a structure to be searched, a structure not to be searched (a structure excluded from the search) may be specified.
In this case, the search condition expression analysis program 130 sets the remaining structure belonging to the seed document as the search target structure. Similar to the first embodiment, the search target structure may be set in the system definition file in advance.

【００３２】次に種文書構造解析プログラム４００は、
テキスト１８０から種文書のテキストを取り出し、種文
書の構造を解析して指定された構造に関する本文テキス
トのみを抽出する。種文書のテキストに指定された構造
が含まれていないときにはエラーとする。文書の構造解
析の方法としては、例えば特開平１０−２４０７５２号
公報に文書構造解析プログラムの処理手順として記載さ
れている。次に検索対象構造ＩＤ取得プログラム１５１
は、構造インデクス１８２を参照して指定された構造に
対応する検索対象構造ＩＤを取得する。また特徴文字列
抽出プログラム１５０ａは、抽出されたテキストの特徴
文字列（単語）を抽出し、抽出した特徴文字列の出現回
数を計数する。なお種文書について、すでに各構造の特
徴文字列が抽出され、その構造ごとの出現回数が計数さ
れており、出現頻度ファイル１８１のように登録されて
いるのであれば、そのファイルを参照して指定された構
造、特徴文字列と出現回数を抽出するだけでよい。この
場合には種文書構造解析プログラム４００及び特徴文字
列抽出プログラム１５０ａの処理をスキップできる。Next, the seed document structure analysis program 400
The text of the seed document is extracted from the text 180, and the structure of the seed document is analyzed to extract only the body text relating to the specified structure. It is an error if the text of the seed document does not contain the specified structure. As a method of analyzing the structure of a document, for example, a processing procedure of a document structure analysis program is described in Japanese Patent Laid-Open No. Hei 10-240752. Next, the search target structure ID acquisition program 151
Acquires the search target structure ID corresponding to the specified structure with reference to the structure index 182. The characteristic character string extraction program 150a extracts characteristic character strings (words) of the extracted text and counts the number of appearances of the extracted characteristic character strings. For the seed document, the characteristic character string of each structure has already been extracted, the number of appearances for each structure has been counted, and if it has been registered as in the appearance frequency file 181, the file is specified by referring to that file. It is only necessary to extract the structure, the characteristic character string, and the number of appearances. In this case, the processing of the seed document structure analysis program 400 and the characteristic character string extraction program 150a can be skipped.

【００３３】次に類似度算出プログラム１５２ａは、第
１の実施形態と同様に出現頻度ファイル１８１を参照し
て抽出された各特徴文字列（単語）と検索対象構造ＩＤ
が一致する文書の文書番号を取得し、各文書ごとに種文
書との類似度を算出する。この際に対応構造判定プログ
ラム４０１は、種文書の構造と、出現頻度ファイル１８
１から取得された文書の構造ＩＤとの対応をとり、特徴
文字列を検索対象構造ごとのグループに分け、検索対象
構造ごとの類似度を算出する。複数の検索対象構造が指
定された場合に、各文書の最終的な類似度を算出する方
法は第１の実施形態と同様である。最後に検索結果出力
プログラム１３２は、取得した文書を算出した類似度の
高い順に並べ替えてその文書番号、類似度、見出し等を
表示装置１００上に表示する。Next, similar to the first embodiment, the similarity calculation program 152a extracts each characteristic character string (word) extracted with reference to the appearance frequency file 181 and the search target structure ID.
Is obtained, and the similarity with the seed document is calculated for each document. At this time, the corresponding structure determination program 401 determines the structure of the seed document and the appearance frequency file 18
1, the characteristic character strings are divided into groups for each search target structure, and the similarity for each search target structure is calculated. When a plurality of search target structures are specified, the method of calculating the final similarity of each document is the same as in the first embodiment. Finally, the search result output program 132 sorts the acquired documents in descending order of the calculated similarity, and displays the document numbers, similarities, headings, and the like on the display device 100.

【００３４】なお上記の第２の実施形態の説明では、指
定された種文書の構造と検索対象構造とが一致するもの
としたが、両者が別の論理構造であってもよい。すなわ
ち種文書の構造が指定され、これとは別の検索対象構造
が指定された場合、種文書構造解析プログラム４００及
び特徴文字列抽出プログラム１５０ａは、指定された種
文書の構造に注目して特徴文字列を抽出し、類似度算出
プログラム１５２ａは指定された検索対象構造のＩＤに
注目して検索対象文書を検索する。また対応構造判定プ
ログラム４０１は、指定された種文書の構造と指定され
た検索対象構造が同一グループとみなして対応づけをす
る。例えば薬の効能書の［副作用］を特徴文字列を抽出
するときの対象構造とし、［効能］を検索対象文書の類
似度を計算するときの検索対象構造とすることにより、
種文書に記載の薬のもつ副作用を抑える薬について記載
された文書を探し出すことが可能となる。In the above description of the second embodiment, it is assumed that the structure of the specified seed document matches the search target structure, but both may have different logical structures. That is, when the structure of the seed document is specified and another search target structure is specified, the seed document structure analysis program 400 and the characteristic character string extraction program 150a pay attention to the structure of the specified seed document. The character string is extracted, and the similarity calculation program 152a searches the search target document by paying attention to the ID of the specified search target structure. The correspondence structure determination program 401 associates the structure of the specified seed document with the specified search target structure as the same group. For example, by setting [side effect] of a drug efficacy document as a target structure when extracting a characteristic character string and [effect] as a search target structure when calculating the similarity of a search target document,
It is possible to find a document that describes a drug that suppresses the side effects of the drug described in the seed document.

【００３５】図６は、第２の実施形態の問題点を説明す
る図である。この例では種文書に関する構造として［効
能］［副作用］［使用上の注意］が指定され、これらの
構造が検索対象構造とみなされ、検索が実行されてい
る。その結果”服用””自動車””運転”など薬の効能
書にとって重要度が小さいか無意味な特徴文字列が抽出
され、これらの特徴文字列を含む特徴文字列に基づく類
似度算出の結果として、文書２、文書３などあまり重要
でない文書の類似度が無視できない程の値を示し、検索
結果として文書２、文書３などが挙がったことを示して
いる。FIG. 6 is a diagram for explaining a problem of the second embodiment. In this example, [effect], [side effect], and [precautions] are specified as the structure relating to the seed document, and these structures are regarded as the search target structures, and the search is executed. As a result, feature strings that are less important or meaningless for the drug efficacy book such as “dose”, “car”, and “driving” are extracted, and as a result of similarity calculation based on the feature strings including these feature strings, , Document 2 and document 3 have such a value that the degree of similarity cannot be ignored, indicating that the search results include document 2 and document 3.

【００３６】図７は、第３の実施形態の類似文書検索プ
ログラム１３１ｂの構成を示す図である。第３の実施形
態では、特徴文字列抽出プログラム１５０ｂにさらに構
造重みプログラム６００が加わっている。FIG. 7 is a diagram showing the configuration of the similar document search program 131b according to the third embodiment. In the third embodiment, a structure weighting program 600 is further added to the characteristic character string extraction program 150b.

【００３７】図８は、第３の実施形態の処理手順を示す
図である。第３の実施形態は、特徴文字列抽出プログラ
ム１５０ａが抽出した特徴文字列に対して構造重みプロ
グラム６００を適用する以外は第２の実施形態の処理と
同じである。構造重みプログラム６００は、論理構造ご
とに重要度が設定してあるシステム定義ファイルを参照
して、各論理構造ごとにその重要度に応じて抽出した特
徴文字列の中から検索用として採用する特徴文字列の数
を決定する。例えば「重要」の構造は抽出されたすべて
の特徴文字列を採用し、「普通」の構造は抽出された特
徴文字列の重要度に従って一部の特徴文字列のみを採用
する。あるいは各論理構造ごとに採用する特徴文字列の
数をシステム定義ファイルに設定し、抽出された特徴文
字列からその重要度が上位の所定数の特徴文字列を採用
してもよい。また所定の文字種からそれぞれ所定数を採
用するようにしてもよい。特徴文字列の重要度を算出す
る方法としては、例えば特開平１１−１４３９０２号公
報は数式２として特徴文字列の重要度の算出式を挙げて
いる。なお各論理構造の重要度や特徴文字列の採用個数
をシステム定義ファイルに設定する代わりに、検索条件
式の一部として入力装置１０１を介して指定してもよ
い。なお論理構造の重要度あるいは特徴文字列の重要度
により採用する特徴文字列を決定する方式は、上記の第
１の実施形態にも適用可能である。FIG. 8 is a diagram showing a processing procedure of the third embodiment. The third embodiment is the same as the process of the second embodiment except that the structure weighting program 600 is applied to the characteristic character string extracted by the characteristic character string extraction program 150a. The structure weighting program 600 refers to a system definition file in which importance is set for each logical structure, and employs, for each logical structure, a characteristic character string extracted according to the importance for search. Determine the number of strings. For example, the structure of "important" employs all the extracted characteristic character strings, and the structure of "normal" employs only some characteristic character strings according to the importance of the extracted characteristic character strings. Alternatively, the number of characteristic character strings to be employed for each logical structure may be set in the system definition file, and a predetermined number of characteristic character strings having higher importance than the extracted characteristic character strings may be employed. Alternatively, a predetermined number may be adopted from a predetermined character type. As a method for calculating the importance of the characteristic character string, for example, Japanese Patent Application Laid-Open No. H11-143902 discloses a formula for calculating the importance of the characteristic character string as Expression 2. Instead of setting the importance of each logical structure and the number of adopted characteristic character strings in the system definition file, the importance may be specified as a part of the search condition expression via the input device 101. Note that the method of determining the characteristic character string to be adopted based on the importance of the logical structure or the importance of the characteristic character string is also applicable to the first embodiment.

【００３８】以上のようにして特徴文字列を絞り込んだ
上で類似度算出プログラム１５２ａ及び対応構造判定プ
ログラム４０１を適用すると、検索結果から重要度の少
ない文書を排除することができる。また第１、第２の実
施形態に比べて特徴文字列の数が削減されることになる
ので、出現頻度ファイル１８１を検索する際の検索時間
を短縮できる。By applying the similarity calculation program 152a and the corresponding structure determination program 401 after narrowing down the characteristic character strings as described above, documents with low importance can be excluded from the search results. Further, since the number of characteristic character strings is reduced as compared with the first and second embodiments, the search time for searching the appearance frequency file 181 can be reduced.

【００３９】図９は、第４の実施形態の検索結果表示プ
ログラム１３２ａの構成を示す図である。第４の実施形
態では、検索結果表示プログラム１３２ａに構造別表示
方法取得プログラム７００が加わっている。FIG. 9 is a diagram showing the configuration of the search result display program 132a according to the fourth embodiment. In the fourth embodiment, a structure-specific display method acquisition program 700 is added to the search result display program 132a.

【００４０】図１０は、第４の実施形態の処理手順を示
す図である。第４の実施形態は、検索結果表示プログラ
ム１３２ａの処理を除いては第１〜第３の実施形態の処
理と同じである。構造別表示方法取得プログラム７００
は、類似度算出プログラム１５２ａの処理結果として挙
げられた文書について検索対象構造別の類似度を表示す
る。また検索対象構造ごとに抽出された特徴文字列を強
調表示する。FIG. 10 is a diagram showing a processing procedure according to the fourth embodiment. The fourth embodiment is the same as the processing of the first to third embodiments except for the processing of the search result display program 132a. Structure-based display method acquisition program 700
Displays the similarity of each document as a processing result of the similarity calculation program 152a for each search target structure. Also, the characteristic character strings extracted for each search target structure are highlighted.

【００４１】図１１は、構造別表示方法取得プログラム
７００の処理手順を示すＰＡＤ図である。構造別表示方
法取得プログラム７００は、特徴文字列抽出プログラム
１５０ａにより抽出された各構造ごとの特徴文字列をそ
れぞれワークエリア１６１に格納する（ステップ７０
１）。次に類似度算出プログラム１５２ａにより算出さ
れた各構造ごとの類似度をワークエリア１６１に格納す
る（ステップ７０２）。次に検索された各文書の指定さ
れたすべての構造について以下の処理を繰り返す（ステ
ップ７０３）。まずワークエリア１６１に格納された当
該構造の類似度を取得し表示する（ステップ７０４）。
次にワークエリア１６１に格納された当該構造の特徴文
字列を取得し、強調表示する（ステップ７０５）。なお
この実施形態では各論理構造ごとに類似度と特徴文字列
の強調表示とを行うものとしたが、いずれか一方のみを
行ってもよい。また検索結果の表示条件をシステム定義
ファイル上に設定してもよいし、検索条件式の一部とし
て指定してもよい。FIG. 11 is a PAD diagram showing a processing procedure of the structure-specific display method acquisition program 700. The structure-based display method acquisition program 700 stores the characteristic character strings for each structure extracted by the characteristic character string extraction program 150a in the work area 161 (step 70).
1). Next, the similarity for each structure calculated by the similarity calculation program 152a is stored in the work area 161 (step 702). Next, the following processing is repeated for all the specified structures of the retrieved documents (step 703). First, the similarity of the structure stored in the work area 161 is obtained and displayed (step 704).
Next, the characteristic character string of the structure stored in the work area 161 is obtained and highlighted (step 705). In this embodiment, the similarity and the characteristic character string are highlighted for each logical structure, but only one of them may be performed. The display condition of the search result may be set in the system definition file or may be specified as a part of the search condition expression.

【００４２】なお種文書及び検索結果として挙げられた
類似文書について、各々文書番号に対応して見出し、概
要などを表示し、これらのテキストに含まれ、採用され
た特徴文字列を強調表示してもよい。このように表示す
ると、類似文書に含まれる特徴文字列を種文書に含まれ
る特徴文字列と比較することができる。For the seed document and similar documents listed as a search result, a heading, an outline, etc. are displayed corresponding to the document numbers, and the characteristic character strings included and adopted in these texts are highlighted. Is also good. With such display, the characteristic character string included in the similar document can be compared with the characteristic character string included in the seed document.

【００４３】なお上記実施形態で使用した出現頻度ファ
イル１８１の代わりに特開平１１−１４３９０２号公報
のｎ−ｇｒａｍインデクスを用いてもよい。すなわち特
開平１１−１４３９０２号公報と特開平１０−２４０７
５２号公報の構造化された文字列インデックスを組み合
わせると、出現頻度ファイル１８１に代わるファイルを
構成可能である。同一文字列についての１つ以上の出現
位置はその文字列の出現回数をも示している。Note that the appearance frequency file 181 used in the above embodiment may be replaced with an n-gram index disclosed in Japanese Patent Laid-Open No. 11-143902. That is, JP-A-11-143902 and JP-A-10-2407
By combining the structured character string index of Japanese Patent Publication No. 52, it is possible to configure a file that replaces the appearance frequency file 181. One or more occurrence positions of the same character string also indicate the number of appearances of the character string.

【００４４】なお上記第１〜第４の実施形態では種文書
として１つの文書が指定されるものとしたが、複数の種
文書を指定できるものとしてもよい。ここで特徴文字列
としては、それぞれの種文書から抽出された特徴文字列
をすべて用いるものとしてもよいし、それぞれの種文書
に共通して含まれる特徴文字列を用いるものとしてもよ
い。In the first to fourth embodiments, one document is designated as a seed document. However, a plurality of seed documents may be designated. Here, as the characteristic character string, all the characteristic character strings extracted from each seed document may be used, or a characteristic character string commonly included in each seed document may be used.

【００４５】[0045]

【発明の効果】以上述べたように本発明によれば、類似
文書検索の検索条件として論理構造の指定を付加するの
で、類似文書検索の利点を最大限に生かしながら検索精
度を高めることができる。なお複数の検索対象構造が指
定される場合に、あらかじめ設定された論理構造の重要
度または特徴文字列の重要度に応じて関連の薄い特徴文
字列を排除でき、さらに検索精度を高めることができ
る。また種文書は、構造化文書と構造化していない文書
のいずれも可能であり、ユーザが種文書の選択に注意を
払う必要がない。As described above, according to the present invention, the designation of a logical structure is added as a search condition for a similar document search, so that the search accuracy can be improved while maximizing the advantage of the similar document search. . When a plurality of search target structures are designated, a less relevant feature character string can be excluded according to a predetermined importance of the logical structure or the importance of the characteristic character string, and the search accuracy can be further improved. . The seed document can be a structured document or an unstructured document, and the user does not need to pay attention to the selection of the seed document.

[Brief description of the drawings]

【図１】実施形態の類似文書検索システムの構成図であ
る。FIG. 1 is a configuration diagram of a similar document search system according to an embodiment.

【図２】第１の実施形態の処理手順を示す図である。FIG. 2 is a diagram illustrating a processing procedure according to the first embodiment.

【図３】従来の類似文書検索システムの処理手順を示す
図である。FIG. 3 is a diagram showing a processing procedure of a conventional similar document search system.

【図４】第２の実施形態の類似文書検索プログラム１３
１ａの構成を示す図である。FIG. 4 is a similar document search program 13 according to the second embodiment.
It is a figure which shows the structure of 1a.

【図５】第２の実施形態の処理手順を示す図である。FIG. 5 is a diagram illustrating a processing procedure according to a second embodiment.

【図６】第２の実施形態の問題点を説明する図である。FIG. 6 is a diagram illustrating a problem of the second embodiment.

【図７】第３の実施形態の類似文書検索プログラム１３
１ｂの構成を示す図である。FIG. 7 shows a similar document search program 13 according to the third embodiment.
It is a figure which shows the structure of 1b.

【図８】第３の実施形態の処理手順を示す図である。FIG. 8 is a diagram illustrating a processing procedure according to a third embodiment.

【図９】第４の実施形態の検索結果出力プログラム１３
２ａの構成を示す図である。FIG. 9 shows a search result output program 13 according to the fourth embodiment.
It is a figure which shows the structure of 2a.

【図１０】第４の実施形態の処理手順を示す図である。FIG. 10 is a diagram illustrating a processing procedure according to a fourth embodiment.

【図１１】第４の実施形態の構造別表示方法取得プログ
ラムの処理手順を示す図である。FIG. 11 is a diagram illustrating a processing procedure of a structure-specific display method acquisition program according to the fourth embodiment.

【図１２】検索条件入力画面の例を示す図である。FIG. 12 is a diagram showing an example of a search condition input screen.

[Explanation of symbols]

１３１：類似文書検索プログラム、１３２：検索結果出
力プログラム、１５０：特徴文字列抽出プログラム、１
５１：検索対象構造ＩＤ取得プログラム、１５２：類似
度算出プログラム、１８０：テキスト、１８１：出現頻
度ファイル、１８２：構造インデクス、４００：種文書
構造解析プログラム131: similar document search program, 132: search result output program, 150: characteristic character string extraction program, 1
51: search target structure ID acquisition program, 152: similarity calculation program, 180: text, 181: appearance frequency file, 182: structure index, 400: seed document structure analysis program

───────────────────────────────────────────────────── フロントページの続き (72)発明者菅谷奈津子神奈川県川崎市幸区鹿島田890番地株式会社日立製作所システム開発本部内 (72)発明者稲場靖彦神奈川県川崎市幸区鹿島田890番地株式会社日立製作所システム開発本部内 (72)発明者山口明彦神奈川県川崎市幸区鹿島田890番地株式会社日立製作所システム開発本部内 (72)発明者後地陽介神奈川県横浜市戸塚区戸塚町5030番地株式会社日立製作所ソフトウェア事業部内Ｆターム(参考） 5B009 QA09 VA02 5B075 ND03 NK06 NK39 PP13 PQ02 PQ22 PQ36 PQ46 PQ75 PR06 QM08 ──────────────────────────────────────────────────続き Continuing from the front page (72) Inventor Natsuko Sugaya 890 Kashimada, Saiwai-ku, Kawasaki-shi, Kanagawa Prefecture Inside the Hitachi, Ltd.System Development Division (72) Inventor Yasuhiko Inaba 890 Kashimada, Saiwai-ku, Kawasaki-shi, Kanagawa Hitachi, Ltd. (72) Inventor Akihiko Yamaguchi 890 Kashimada, Saiwai-ku, Kawasaki-shi, Kanagawa Prefecture Hitachi, Ltd.System Development Headquarters (72) Yosuke Hachiji 5030 Totsukacho, Totsuka-ku, Yokohama-shi, Kanagawa Prefecture Co., Ltd. F-term in Hitachi Software Division (reference) 5B009 QA09 VA02 5B075 ND03 NK06 NK39 PP13 PQ02 PQ22 PQ36 PQ46 PQ75 PR06 QM08

Claims

[Claims]

1. A method for using a computer to search for a structured document similar to a document or a sentence (hereinafter collectively referred to as a seed document) specified as a search condition, wherein the seed is used as a search condition for similarity calculation. Receiving a designation of a document and at least one structure belonging to the structured document; and, after calculating the similarity, displaying a target document having a higher similarity with priority. Search method for similar documents.

2. A method for retrieving a structured document similar to a seed document using a computer, wherein when a seed document and a structure to be searched are specified, a feature is determined from the text of the specified seed document. Extracting a character string, and calculating a degree of similarity with the seed document based on the characteristic character string for a document in which the extracted characteristic character string matches the specified search target structure. Determining the display priority according to the order of the calculated similarity. The method for searching for similar documents by designating a structure.

3. A method for retrieving a document similar to a structured seed document using a computer, comprising the steps of receiving designation of a seed document and at least one structure belonging to the seed document; And displaying the target document having a higher degree of similarity with priority, after that.

4. A method for retrieving a document similar to a structured seed document using a computer, wherein when a seed document and a structure to be searched are specified, a text of the specified seed document is specified. Extracting a character string that is a characteristic from the text belonging to the specified structure among the texts, and extracting the character string based on the characteristic character string with respect to a document in which the extracted characteristic character string matches the specified search target structure. A method for searching for a similar document by designating a structure, comprising: calculating a similarity to a document; and determining a display priority in accordance with an order of the calculated similarity.

5. The method according to claim 1, wherein the seed document is a designated text on a display screen.

6. The similar document according to claim 1, wherein a structure to be excluded from the search target is specified instead of the structure to be searched. Search method.

7. The structure according to claim 3, wherein a structure belonging to the seed document and to be excluded from the search target is designated instead of a structure belonging to the seed document.
Search method for similar documents by specifying the structure of description.

8. When a plurality of structures are specified as search targets, a similarity is calculated for each of the search target structures, and a cumulative value of the similarities over all the search target structures is calculated as a final value at the time of determining the priority. The method according to claim 2 or 4, wherein the similarity is set to a similarity.

9. When a plurality of structures are specified as a search target, a similarity is calculated for each search target structure, and the highest similarity for the target document is determined as the final similarity at the time of determining the priority. The method according to claim 2 or 4, wherein a similar document is searched by designating a structure.

10. The method according to claim 2, wherein after extracting the characteristic character string, a characteristic character string to be adopted is further determined according to the importance of the characteristic character string. Search method for similar documents.

11. The method according to claim 4, further comprising, after extracting the characteristic character string, determining a characteristic character string to be adopted in accordance with the importance of the structure belonging to the specified seed document.
Search method for similar documents by specifying the structure of description.

12. The method according to claim 2, wherein after extracting the characteristic character string, a characteristic character string to be adopted is further determined according to the importance of the search target structure. Search method for similar documents.

13. The structure specification according to claim 2, wherein, when displaying the search result, the characteristic character string extracted for each of the search target structures is highlighted in the document to be displayed. Search method for similar documents.

14. A method for searching for a similar document by designating a structure according to claim 2, wherein, when displaying the search result, the similarity is displayed for each of the structures to be searched for the document to be displayed. .

15. An apparatus for searching a structured document similar to a document or a sentence (seed document) specified as a search condition, wherein the seed document and at least one document belonging to the structured document are used as search conditions for similarity calculation. An apparatus for retrieving similar documents by structure specification, comprising: means for receiving designation of two structures, and means for preferentially displaying a target document having a higher similarity after calculating the similarity.

16. An apparatus for retrieving a structured document similar to a seed document, wherein when a seed document and a structure to be searched are specified, a character string that is a feature is extracted from the text of the specified seed document. Means for extracting, for a document in which the extracted characteristic character string matches the specified search target structure, means for calculating the degree of similarity with the seed document based on the characteristic character string; Means for deciding display priority according to the order of the degree of similarity.

17. An apparatus for retrieving a document similar to a structured seed document, comprising: means for receiving designation of a seed document and at least one structure belonging to the seed document; Means for preferentially displaying a target document having a higher number of documents.

18. An apparatus for retrieving a document similar to a structured seed document, wherein when a seed document and a structure to be searched are specified, a specified one of the texts of the specified seed document is specified. Means for extracting a character string that is a feature from text belonging to the structure, and a similarity between the extracted feature string and a specified document to be searched that matches the seed document based on the feature string. And a means for determining the display priority in accordance with the order of the calculated similarity.

19. A program stored in a computer-readable storage medium, the program searching for a structured document similar to a document or a sentence (seed document) specified as a search condition, Program means for receiving designation of a seed document and at least one structure belonging to the structured document as search conditions for similarity calculation; and program means for giving priority to a target document having higher similarity after similarity calculation. A storage medium for storing a program, characterized by comprising:

20. A program stored in a computer-readable storage medium, the program searching for a structured document similar to a seed document, wherein a seed document and a structure to be searched are specified. Program means for extracting a character string that is a feature from the text of the specified seed document, and applying the feature character string to a document in which the extracted feature string matches the specified search target structure. A storage medium for storing a program, comprising: program means for calculating a similarity to the seed document based on the program; and program means for determining a display priority according to the order of the calculated similarity.

21. A program stored in a computer-readable storage medium, wherein the program is a program for searching for a document similar to a structured seed document,
A program characterized by comprising: program means for receiving a designation of a seed document and at least one structure belonging to the seed document; and program means for preferentially displaying a target document having a higher similarity after calculating the similarity. Storage medium for storing.

22. A program stored in a computer-readable storage medium, wherein the program is a program for searching for a document similar to a structured seed document,
When a seed document and a structure to be searched are specified, a program means for extracting a character string which is a feature from text belonging to the specified structure among texts of the specified seed document, and an extracted characteristic character string Program means for calculating a similarity with the seed document based on the characteristic character string for a document in which the specified search target structure matches, and determining a display priority in accordance with the calculated similarity in descending order. A storage medium for storing a program, comprising: