JPH10116290A

JPH10116290A - Document classification managing method and document retrieving method

Info

Publication number: JPH10116290A
Application number: JP8269668A
Authority: JP
Inventors: Osamu Moriguchi; 修森口; Makoto Imamura; 誠今村; Yoichi Fujii; 洋一藤井; Yasuhiro Takayama; 泰博高山
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 1996-10-11
Filing date: 1996-10-11
Publication date: 1998-05-06

Abstract

PROBLEM TO BE SOLVED: To obtain a document classification managing method in which document contents are appropriately reflected or a document retrieving method which has a high efficiency even for a new document and a document including different classification systems. SOLUTION: Document parameters put together with tags of specified formats of N pieces of gathered documents are extracted. Then, the weight based upon the appearance frequency of components of an M-dimensional document parameter vector consisting of M pieces of document parameters extracted from the N documents is calculated to determine document parameter vectors by the documents. The similarities of document parameter vectors found by the documents are found, and the said gathered documents are classified and manages by document classes of the classification systems determined by the similarities.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、文書情報に基づい
て文書を分類管理または検索する文書分類管理、文書検
索方法に関するものである。[0001] 1. Field of the Invention [0002] The present invention relates to a document classification management and document retrieval method for classifying and managing documents based on document information.

【０００２】[0002]

【従来の技術】各種の文書をファイリングしてデータベ
ースを構築するに際し、分類やキーワード付与などの検
索のための情報を作成する必要がある。また、複数の提
供元から収集した文書は、提供元によって分類体系や用
語の使われかたが異なるため、提供元毎に異なる文書検
索の方法を準備する必要がある。それを避けて単一の枠
組みで文書を検索するためには、文書の分類体系や文書
内で用いられる用語を統一しておく必要がある。こうし
た作業は、その正確さがその後のデータベース利用の便
利さに繋がるため、重要な作業であるが、しかし、こう
した作業を人間が行なうとすると、その文書内容をすべ
て読み、理解した上で行なわなければならず、その作業
が極めて煩雑となるばかりでなく、作業者によってばら
つくため正確さに欠けることになる。2. Description of the Related Art When constructing a database by filing various documents, it is necessary to create information for retrieval such as classification and keyword assignment. In addition, since documents collected from a plurality of providers use different classification systems and terms depending on the providers, it is necessary to prepare different document search methods for each provider. In order to avoid this and search documents in a single framework, it is necessary to unify the document classification system and terms used in the documents. These tasks are important because their accuracy will lead to the convenience of subsequent use of the database, but if humans are to perform these tasks, they must read and understand the entire contents of the document. In addition, the work becomes extremely complicated, and the accuracy varies due to variations among operators.

【０００３】そこで、文書の分類を自動的に行なうシス
テムとして、種々のものが開発されている。例えば、特
開平６−２８２５８７号に開示された「文書の自動分類
方法及び装置並びに分類用の辞書作成方法及び装置」の
ように文書を構成する語句から種類分けしつつ抽出した
キーワードによって分野依存度を算出し、任意の文書が
属する分野を既知の分類体系中から決定するものがあ
る。Therefore, various systems have been developed as systems for automatically classifying documents. For example, as disclosed in Japanese Patent Application Laid-Open No. 6-282587, "a method and apparatus for automatically classifying documents and a method and apparatus for creating a dictionary for classification", the degree of field dependence is determined by keywords extracted while being classified from words constituting the document. Is calculated, and the field to which an arbitrary document belongs is determined from a known classification system.

【０００４】しかし、この方式では、文書中から抽出し
たキーワードを基に分類分野を決定するため、文書中に
分類キーとして的確な用語が、同定可能な表記で使われ
ているかどうかという偶然性に左右されるため、信頼性
に欠ける。また、上記「種類分けしつつ抽出したキーワ
ード」においてキーワードに付与される属性は、文法上
の格（主格や目的格）であり意味上の格と必ずしも一致
しないため、能動態・受動態といった文章作成者の言い
回しに依存するなど正確性に欠ける。さらに、上記の
「任意の文書が属する分野を既知の分類体系中から決定
する」という分類方法では、提供元によって最適化され
た文書分体系類情報を破棄し、単一の既知の固定した分
類体系に組み込むことになるため、新規分野への対応や
文書数の増加に伴う分類体系の詳細化ができない。However, in this method, since the classification field is determined based on the keywords extracted from the document, the exact term as the classification key in the document depends on the contingency of whether it is used in an identifiable notation. Therefore, it lacks reliability. In addition, the attribute given to the keyword in the above-mentioned “keyword extracted while being classified” is a grammatical case (nominative case or purpose case) and does not necessarily match the semantic case, so that the sentence creator such as active voice or passive voice is used. Lack of accuracy, such as depending on the wording of Furthermore, in the above-described classification method of “determining the field to which an arbitrary document belongs from a known classification system”, the document classification system information optimized by the provider is discarded, and a single known fixed classification is performed. Since it is incorporated into the system, it is not possible to cope with new fields or to refine the classification system with an increase in the number of documents.

【０００５】また、文書分類の他の方式として、例えば
特開平７−７８１８２号に開示された「キーワード付与
システム」のように、分類済みの文書データから分類に
依存したキーワードを抽出し、文書のキーワードとして
付与する方式があるが、類似した文書であっても特定の
キーワードが含まれない文書は検索漏れとなってしま
う。[0005] As another method of document classification, for example, a keyword depending on the classification is extracted from the classified document data as in a "keyword assignment system" disclosed in Japanese Patent Laid-Open No. 7-78182, and the document is classified. There is a method of assigning keywords as keywords. However, even if the documents are similar, documents that do not include a specific keyword are omitted from the search.

【０００６】また、他の方式として、例えば特開平３−
２８２７７７号に開示された「属性に基づく分類・検索
システム」のように、各文書に複数の文書パラメタを付
与し、任意に選択した単一の文書パラメタを分類軸とす
る複数の観点からの分類体系を同時に構成する方式があ
るが、文書パラメタ数が増加した場合に分類軸とする文
書パラメタを選択する作業が検索者の負担となってしま
う。As another method, for example, Japanese Patent Laid-Open No.
A plurality of document parameters are assigned to each document, as in the “attribute-based classification / search system” disclosed in Japanese Patent No. 282777, and classification is performed from a plurality of viewpoints using a single document parameter arbitrarily selected as a classification axis. There is a method of simultaneously configuring the system, but when the number of document parameters increases, the work of selecting a document parameter to be used as a classification axis becomes a burden on the searcher.

【０００７】[0007]

【発明が解決しようとする課題】従来の文書の分類管理
方法または検索方法は上記のように構成されており、人
間が介在して文書の内容を理解して分類ルールを定めて
管理するか、または固定のキーワードに基づいて分類さ
れるか、または複数の文書パラメタを付与して分類のた
めの文書パラメタ軸も設定してこの設定文書パラメタ軸
に基づいて分類をするようにしてた。これらの方法は、
上記で述べた不具合があり、作業が困難、漏れがあり信
頼性に欠ける、分類基準の選択が人に任されて結果のば
らつきが大きいなどの課題があった。A conventional document classification management method or search method is configured as described above, and a human intervenes to understand the contents of a document to determine and manage classification rules. Alternatively, the document is classified based on a fixed keyword, or a plurality of document parameters are assigned, a document parameter axis for classification is set, and the classification is performed based on the set document parameter axis. These methods are
There are problems such as the above-mentioned inconveniences, such as difficulty in work, leakage and lack of reliability, and large variation in results due to the responsibility of selecting a classification standard.

【０００８】この発明は上記の課題を解決するためにな
されたもので、元の分類情報を利用して、文書からは文
書パラメタを抽出し、この文書パラメタで構成される文
書パラメタベクトルをもって文書分類を行い、新規文書
に対しても、また異なる分類体系を超えての文書に対し
ても、文書内容を適切に反映した文書の分類管理方法を
得る、または効率の良い文書検索方法を得ることを目的
とする。SUMMARY OF THE INVENTION The present invention has been made to solve the above-described problem. A document parameter is extracted from a document by using original classification information, and a document classification is performed using a document parameter vector composed of the document parameter. To obtain a document classification management method that appropriately reflects the document contents, or to obtain an efficient document search method for new documents and for documents that go beyond different classification systems. Aim.

【０００９】[0009]

【課題を解決するための手段】この発明に係る文書分類
管理方法は、収集された複数のＮ個の文書の所定形式の
タグでくくられた文書パラメタを抽出する文書パラメタ
抽出ステップと、このＮ個の文書から抽出されたＭ個の
文書パラメタで構成されるＭ次元の文書パラメタベクト
ルに対して、このパラメタベクトルの成分の出現頻度に
基づく重みを計算し、文書毎の文書パラメタベクトルを
定める文書類似度計算ステップと、この文書毎に求めら
れた文書パラメタベクトル間の類似度を求めて、この類
似度により定められた分類体系の文書クラスに分類する
文書分類ステップとを備え、上記で収集された文書を分
類管理するようにした。A document classification management method according to the present invention includes a document parameter extracting step of extracting document parameters of a plurality of collected N documents by tags of a predetermined format, and a step of extracting the document parameters. A document that determines a document parameter vector for each document by calculating a weight based on the appearance frequency of components of the parameter vector for an M-dimensional document parameter vector composed of M document parameters extracted from the document A similarity calculation step; and a document classification step of finding a similarity between the document parameter vectors obtained for each document and classifying the document into a document class of a classification system determined by the similarity. Documents are now classified and managed.

【００１０】本発明の文書検索方法は、収集された複数
のＮ個の文書の所定形式のタグでくくられた文書パラメ
タを抽出する文書パラメタ抽出ステップと、このＮ個の
文書から抽出されたＭ個の文書パラメタで構成されるＭ
次元の文書パラメタベクトルに対して、このパラメタベ
クトルの成分の出現頻度に基づく重みを計算し、文書毎
の文書パラメタベクトルを定める文書類似度計算ステッ
プと、この文書毎に求められた文書パラメタベクトル間
の類似度を求めて、この類似度により定められた分類体
系の文書クラスに分類する文書分類ステップと、この分
類管理された複数の文書に対し、分類体系に示された文
書クラスに適合する文書を抽出する文書分類検索ステッ
プを付加した。The document retrieval method according to the present invention includes a document parameter extracting step of extracting document parameters created by tags of a predetermined format from a plurality of collected N documents, and M extracted from the N documents. M composed of individual document parameters
A document similarity calculation step for calculating a weight based on the frequency of appearance of the component of this parameter vector for the two-dimensional document parameter vector, and determining a document parameter vector for each document. A document classification step of finding a similarity of the document and classifying the document into a document class of a classification system determined by the similarity; and a document conforming to the document class indicated in the classification system with respect to the plurality of classified and managed documents. Added a document classification search step to extract.

【００１１】また基本構成に更に、文書分類管理方法に
より分類管理された文書に対し、全文書に対する特定文
書パラメタの出現割合を指定すると、この指定された出
現割合に近い所定出現割合の文書パラメタを抽出する文
書パラメタ検索ガイダンスステップを付加した。Further, when the appearance ratio of the specific document parameter to all the documents is specified for the documents classified and managed by the document classification management method, the document parameters having the predetermined appearance ratio close to the specified appearance ratio are added to the basic configuration. Added document parameter search guidance step to extract.

【００１２】また更に、同一分類体系における文書クラ
スの平均パラメタベクトルを求め、この求めた平均パラ
メタベクトルの異なる分類体系間での類似度が所定の値
内であれば異なる分類体系を併せた合併分類体系での同
一文書クラスとする分類体系合併ステップを付加した。Further, an average parameter vector of the document class in the same classification system is obtained, and if the similarity of the obtained average parameter vector between the different classification systems is within a predetermined value, the merged classification combining different classification systems is performed. Added the step of merging the classification system to use the same document class in the system.

【００１３】また更に、分類体系合併ステップは、類似
度を求める対象の文書クラスに接続される下位の文書ク
ラスがある場合には下位の接続形態を保存した合併分類
体系として管理するようにした。Further, in the classification system merging step, if there is a lower document class connected to the document class whose similarity is to be obtained, the lower connection form is managed as a merged classification system.

【００１４】また更に、同一の文書クラスに属する文書
の最類似文書を計算で求めて順に移動して、この移動毎
の文書間類似度を抽出する分類体系細分化ステップを付
加した。Further, a classification system subdivision step of extracting the most similar documents belonging to the same document class by calculation and sequentially moving them, and extracting the inter-document similarity for each movement is added.

【００１５】また更に、文書中から所定の用語抽出ルー
ルで抽出された用語と属性を用語属性辞書として文書ク
ラスと対応して構築管理して、特定の属性を指定される
と、この属性に対応する用語を持つ文書クラスを抽出す
る文書分類検索ガイダンスステップを付加した。Further, the terms and attributes extracted from the document by a predetermined term extraction rule are constructed and managed in correspondence with the document class as a term attribute dictionary, and when a specific attribute is designated, the attribute corresponding to this attribute is specified. Added a document classification search guidance step to extract the document class that has the term.

【００１６】また更に、文書中から所定の用語抽出ルー
ルで抽出された用語と属性を用語属性辞書として文書ク
ラスと対応して構築管理して、特定のキーワードを持つ
文書を検索するキーワード検索機構を付加し、特定の属
性を指定すると対応する用語を抽出してキーワードとし
て検索するようにした。Furthermore, a keyword search mechanism for constructing and managing terms and attributes extracted from a document according to a predetermined term extraction rule in association with a document class as a term attribute dictionary and searching for a document having a specific keyword is provided. In addition, when a specific attribute is specified, the corresponding term is extracted and searched as a keyword.

【００１７】また更に、文書中から所定の用語抽出ルー
ルで抽出された用語と属性を用語属性辞書として文書ク
ラスと対応して構築管理して、特定の属性を指定される
と属性に対応する用語を持つ文書クラスを抽出する文書
分類検索ガイダンスステップと、検索結果の文書クラス
との類似度の高い順に所定の数の文書を抽出して追加す
る類似文書追加ステップを付加した。Further, the terms and attributes extracted from the document by a predetermined term extraction rule are constructed and managed as a term attribute dictionary corresponding to the document class, and when a specific attribute is designated, the term corresponding to the attribute is designated. And a similar document adding step of extracting and adding a predetermined number of documents in descending order of similarity with the search result document class.

【００１８】また更に、同一文書クラスの属性の用語の
類似度に基づいて同義語・類似語辞書を構築管理して、
特定のキーワードを持つ文書を検索するキーワード検索
機構を付加して、キーワード検索に際しては、同義語・
類似語辞書で同等と登録管理された用語を当該キーワー
ドに加えてキーワード検索するようにした。Further, a synonym / similar word dictionary is constructed and managed based on the similarity between terms having the same document class attribute,
A keyword search mechanism that searches for documents with specific keywords has been added.
A keyword search is performed in addition to the keyword registered and managed as equivalent in the similar word dictionary.

【００１９】また更に、特定文書クラスでの出現頻度に
基づいて特徴用語を抽出する特徴用語判定ステップと、
特定のキーワードを持つ文書を検索するキーワード検索
機構を付加して、キーワード検索に際しては、特徴用語
判定ステップで抽出された特徴用語を当該キーワードと
して検索するようにした。Still further, a characteristic term determining step of extracting characteristic terms based on the frequency of appearance in a specific document class;
A keyword search mechanism for searching for a document having a specific keyword is added, and in the keyword search, the characteristic term extracted in the characteristic term determination step is searched as the keyword.

【００２０】[0020]

BEST MODE FOR CARRYING OUT THE INVENTION

実施の形態１．本発明に係る文書分類管理方法及び文書
検索方法の実施の形態１について添付図面を参照にして
詳述する。図１は本発明に係る基本的な文書検索方法の
動作フローチャートである。また図２は図１の動作を説
明するための文書提供元と具体的な解析対象の文書の例
と抽出された「特性」の例を示す図である。図３は文書
類似度計算の動作を説明するための図であり、図３
（ａ）は解析対象の特性がタグでくくられた文書と文書
パラメータの例を、図３（ｂ）は文書パラメータベクト
ルの例を、図３（ｃ）は分類体系の例を示す図である。
図４は文書類似度計算動作のフローチャートである。Embodiment 1 FIG. A first embodiment of a document classification management method and a document search method according to the present invention will be described in detail with reference to the accompanying drawings. FIG. 1 is an operation flowchart of a basic document search method according to the present invention. FIG. 2 is a diagram showing a document providing source for explaining the operation of FIG. 1, an example of a specific document to be analyzed, and an example of extracted “characteristics”. FIG. 3 is a diagram for explaining the operation of the document similarity calculation.
FIG. 3A is a diagram illustrating an example of a document whose characteristics to be analyzed are tagged and a document parameter, FIG. 3B is a diagram illustrating an example of a document parameter vector, and FIG. 3C is a diagram illustrating an example of a classification system. .
FIG. 4 is a flowchart of the document similarity calculation operation.

【００２１】まず、図１に示すように文書収集ステップ
１により複数の提供元の文書を収集する。文書収集ステ
ップ１では、図２（ａ）中の提出元アドレス１００１に
示すようなネットワーク上のアドレスの形式で保存した
提供元の文書の所在地情報を読込み、提供元の文書を公
知のファイル転送プロトコルなどを用いてダウンロード
する。First, as shown in FIG. 1, documents of a plurality of providers are collected by a document collecting step 1. In the document collecting step 1, the location information of the document of the provider stored in the form of the address on the network as shown by the source address 1001 in FIG. 2A is read, and the document of the provider is read by a known file transfer protocol. Download using such as.

【００２２】次に文書収集ステップ１で収集した各文書
から、文書パラメタ抽出ステップ２により文書パラメタ
を抽出する。文書パラメタとは、文書が記述対象として
いる事物が有する「特性」の情報である。この「特性」
情報は分析対象となる文書中に「特性名称」と「特性
値」とが対になって既にタグでくくられて示されてい
る。即ち、各文書はＳＧＭＬ（構造化文書記述言語：Ｉ
ＳＯ８８７９）などの所定の形式により記述されている
ものとし、文書パラメタ抽出ステップ２では、ＳＧＭＬ
形式の文書の形式をチェックする公知のＳＧＭＬパーザ
などを用いて文書内容の形式を解析することにより文書
パラメタを抽出する。例えば提供元が部品メーカであ
り、文書が部品仕様書であるとすると、文書内容には部
品の「特性」が図２（ｂ）に示されるようにタグでくく
られて記述されており、部品の「特性名称」（または単
に「名称」）と部品の「特性値」を文書パラメタとして
抽出する。文書がメモリチップの仕様書である場合は図
２（ｂ）に示すＳＧＭＬ形式の文書Ａに記述されたタグ
＜部品仕様＞、＜特性＞、＜名称＞、＜値＞、＜／部品
仕様＞を解析する。その結果、図２（ｃ）に示すような
メモリ容量、リフレッシュ周期、アクセス時間などの文
書パラメタの「名称」および、それらと対応する４Ｍ
Ｂ、２０ｎｓ、１０ｎｓなどの「特性値」を抽出する。
図２（ａ）の例ではベンダＢ、Ｃからの文書Ｂ、文書Ｃ
からも同様に文書パラメタを抽出し、図２（ｃ）の抽出
結果を得る。Next, document parameters are extracted from each document collected in the document collection step 1 in a document parameter extraction step 2. The document parameter is information on “characteristics” of the thing described by the document. This "characteristic"
The information is already indicated in the document to be analyzed by pairing “characteristic name” and “characteristic value” with a tag. That is, each document is SGML (Structured Document Description Language: I
SO8879), and in document parameter extraction step 2, SGML
The document parameters are extracted by analyzing the format of the document content using a known SGML parser or the like that checks the format of the document. For example, assuming that the provider is a component maker and the document is a component specification, the "characteristics" of the component are described in the document content by enclosing the tag as shown in FIG. The “characteristic name” (or simply “name”) and the “characteristic value” of the part are extracted as document parameters. If the document is a specification of a memory chip, tags <parts specification>, <characteristics>, <name>, <value>, </ parts specification> described in the document A in the SGML format shown in FIG. Is analyzed. As a result, the “name” of the document parameter such as the memory capacity, refresh cycle, and access time as shown in FIG.
B, “characteristic values” such as 20 ns and 10 ns are extracted.
In the example of FIG. 2A, documents B and C from vendors B and C are used.
Similarly, the document parameters are extracted from, and the extraction result of FIG. 2C is obtained.

【００２３】また、図３（ａ）中の製品カタログ１０１
０は、ＳＧＭＬ形式で記述されたディスプレイモニタの
製品カタログの例である。このカタログのタグ＜製品仕
様＞、＜製品特性＞、＜特性名称＞、＜特性値＞、＜／
製品特性＞、＜／製品仕様＞を同様に解析することによ
り、図３（ａ）中の文書パラメータ１０１１に示すブラ
ウン管サイズという文書パラメタの「名称」とそれに対
応する１７インチという文書パラメタの「特性値」、ド
ットピッチという文書パラメタの「名称」とそれに対応
する０．２５ｍｍという文書パラメタの「特性値」、水
平走査周波数という文書パラメタの「名称」とそれに対
応する２４ｋＨｚ〜８６ｋＨｚという文書パラメタの
「特性値」を文書パラメタとして抽出する。The product catalog 101 shown in FIG.
0 is an example of a product catalog of the display monitor described in the SGML format. Tags in this catalog: <product specification>, <product characteristics>, <characteristic name>, <characteristic value>, <//
By analyzing the product characteristics> and </ product specifications> in the same manner, the “name” of the document parameter “CRT size” shown in the document parameter 1011 in FIG. 3A and the “characteristic” of the 17-inch document parameter corresponding thereto are displayed. "Value", the document parameter "name" of dot pitch and the corresponding "characteristic value" of the document parameter of 0.25 mm, the horizontal scanning frequency of the document parameter "name" and the corresponding document parameter of 24 kHz to 86 kHz "24 kHz to 86 kHz". The characteristic value is extracted as a document parameter.

【００２４】次に図１の文書パラメタ抽出ステップ２で
抽出した文書パラメタを用いて文書類似度計算ステップ
３により文書間の類似度を計算する。文書類似度計算ス
テップ３は、図４に示すフローチャートによって各文書
に対して図３（ｂ）に示すような文書パラメタベクトル
を作成する。また後に記述するように文書パラメタベク
トル間の内積を計算した結果を文書間の類似度とする。
文書パラメタベクトルは各文書パラメタを各々のベクト
ル成分と対応させたベクトルである。Ｎ個の文書を収集
し、Ｍ種類の文書パラメタが抽出されたとすると、文書
パラメタベクトルはＭ次元の成分からなるベクトルとな
る。Ｍ種類の文書パラメタに１からＭの番号ｉを付与
し、ｉ番目の文書パラメタＰ（ｉ）がｉ番目の文書パラ
メタベクトル成分に対応するものとする。このとき、文
書パラメタは文書パラメタの名称によって区別し、同一
の文書パラメタ名称の文書パラメタは同一種類の文書パ
ラメタとみなす。Ｎ個の文書に１からＮの番号を付与
し、ｊ番目の文書Ｄ（ｊ）のｉ番目の文書パラメタベク
トル成分をＶ（ｊ，ｉ）とする。但し、必要があってｉ
＝Ｍ＋１番目のパラメタベクトル成分を設ける場合、こ
のＭ＋１成分は０とする。Next, the similarity between documents is calculated in a document similarity calculation step 3 using the document parameters extracted in the document parameter extraction step 2 of FIG. The document similarity calculating step 3 creates a document parameter vector as shown in FIG. 3B for each document according to the flowchart shown in FIG. The result of calculating the inner product between document parameter vectors as described later is defined as the similarity between documents.
The document parameter vector is a vector in which each document parameter is associated with each vector component. Assuming that N documents are collected and M types of document parameters are extracted, the document parameter vector is a vector including M-dimensional components. Numbers i of 1 to M are assigned to M types of document parameters, and the i-th document parameter P (i) corresponds to the i-th document parameter vector component. At this time, the document parameters are distinguished by the names of the document parameters, and the document parameters having the same document parameter name are regarded as the same type of document parameter. Numbers 1 to N are assigned to the N documents, and the i-th document parameter vector component of the j-th document D (j) is V (j, i). However, it is necessary
When the (M + 1) th parameter vector component is provided, the M + 1 component is set to 0.

【００２５】図４において、ステップＳ１００〜Ｓ１０
５は文書パラメタベクトルの各成分の重みＣ（ｉ）を文
書パラメタの出現頻度によって計算する処理である。ス
テップＳ１００で文書パラメタ番号ｉを１に初期化し、
ステップＳ１０２ではｉ番目の文書パラメタをＰ（ｉ）
にセットする。ステップＳ１０１では文書パラメタ番号
ｉが最大値Ｍを超えていないかチェックし、超えていれ
ばステップＳ１０１〜Ｓ１０５のループを終了する。ス
テップＳ１０３ではｉ番目の文書パラメタＰ（ｉ）を有
する文書がＮ個の文書のうちいくつ存在するかカウント
し、その値をＳにセットする。ステップＳ１０４ではｉ
番目の文書パラメタベクトル成分の重みＣ（ｉ）を１／
Ｓにセットする。ステップＳ１０５でｉを１つ増加させ
次の文書パラメタに対応する文書パラメタベクトル成分
の重みを計算する。ｉの値がＭを超えた時点で文書パラ
メタベクトルのすべての成分の重みＣ（１）〜Ｃ（Ｍ）
の計算が終了する。例えば、図２（ｃ）に示す３つの文
書（Ｎ＝３）において、２番目（ｉ＝２）の文書パラメ
タＰ（２）「リフレッシュ周期」を有する文書は、文書
Ａおよび文書Ｂの２つ（Ｓ＝２）が存在するため、２番
目の文書パラメタＰ（２）の重みＣ（２）は、Ｐ（２）
を有する文書数（Ｓ＝２）の逆数である１／２となる。In FIG. 4, steps S100 to S10
5 is a process for calculating the weight C (i) of each component of the document parameter vector based on the appearance frequency of the document parameter. In step S100, the document parameter number i is initialized to 1, and
In step S102, the i-th document parameter is set to P (i)
Set to. In step S101, it is checked whether or not the document parameter number i exceeds the maximum value M, and if so, the loop of steps S101 to S105 ends. In step S103, the number of documents having the i-th document parameter P (i) is counted out of the N documents, and the value is set in S. In step S104, i
The weight C (i) of the document parameter vector component of the
Set to S. In step S105, i is increased by one, and the weight of the document parameter vector component corresponding to the next document parameter is calculated. When the value of i exceeds M, the weights C (1) to C (M) of all components of the document parameter vector
Is completed. For example, in the three documents (N = 3) shown in FIG. 2C, the document having the second (i = 2) document parameter P (2) “refresh cycle” includes two documents A and B. (S = 2), the weight C (2) of the second document parameter P (2) is P (2)
Ｓ, which is the reciprocal of the number of documents having S (2).

【００２６】ステップＳ１０９〜Ｓ１１４はｊ番目の文
書Ｄ（ｊ）の仮の文書パラメタベクトルとその長さの二
乗Ｌを計算する処理である。ステップＳ１０９で仮の文
書パラメタベクトルの長さの二乗Ｌを０に初期化する。
Ｓ１１０で文書パラメタ番号ｉを１に初期化する。Ｓ１
１１では文書パラメタ番号ｉが最大値Ｍを超えていない
かチェックし、超えていればＳ１１１〜Ｓ１１４のルー
プを終了する。Ｓ１１２ではｊ番目の文書Ｄ（ｊ）がｉ
番目の文書パラメタＰ（ｉ）を有するかどうかチェック
する。ｊ番目の文書Ｄ（ｊ）がｉ番目の文書パラメタＰ
（ｉ）を有していればＤ（ｊ）の仮の文書パラメタベク
トルのｉ番目の成分をＣ（ｉ）とし、Ｓ１１３でＬにＣ
（ｉ）の二乗を加算する。ｊ番目の文書Ｄ（ｊ）がｉ番
目の文書パラメタＰ（ｉ）を有していなければ、即ち、
Ｍ＋１番目の成分に対しては、Ｄ（ｊ）の仮の文書パラ
メタベクトルのｉ番目の成分は０とする。Ｓ１１４でｉ
を１つ増加させ次の文書パラメタに対応するＤ（ｊ）の
仮の文書パラメタベクトル成分の値を計算する。ｉの値
がＭを超えた時点でＤ（ｊ）の仮の文書パラメタベクト
ル成分およびその長さの二乗Ｌの計算が終了する。Steps S109 to S114 are processes for calculating the temporary document parameter vector of the j-th document D (j) and the square L of its length. In step S109, the square L of the length of the temporary document parameter vector is initialized to zero.
In S110, the document parameter number i is initialized to 1. S1
At 11, it is checked whether or not the document parameter number i exceeds the maximum value M, and if it does, the loop of S111 to S114 ends. In S112, the j-th document D (j) is i
Check if it has the th document parameter P (i). The j-th document D (j) is the i-th document parameter P
If it has (i), the i-th component of the temporary document parameter vector of D (j) is set to C (i).
Add the square of (i). If the j-th document D (j) does not have the i-th document parameter P (i), ie,
For the (M + 1) th component, the ith component of the temporary document parameter vector of D (j) is set to 0. I in S114
Is increased by one, and the value of the temporary document parameter vector component of D (j) corresponding to the next document parameter is calculated. When the value of i exceeds M, the calculation of the temporary document parameter vector component of D (j) and the square L of its length is completed.

【００２７】ステップＳ１０６、Ｓ１０７、Ｓ１１５〜
Ｓ１２２はＮ個の文書すべての文書パラメタベクトルを
計算する処理である。ステップＳ１２２で文書番号ｊを
１に初期化し、Ｓ１０７でｊ番目の文書をＤ（ｊ）にセ
ットする。Ｓ１０６では文書番号ｊが最大値Ｎを超えて
いないかチェックし、超えていればＳ１０６、Ｓ１０
７、Ｓ１０９〜Ｓ１２１のループを終了する。Ｓ１１５
で文書パラメタ番号ｉを１に初期化する。Ｓ１１６では
文書パラメタ番号ｉが最大値Ｍを超えていないかチェッ
クし、超えていればＳ１１６〜Ｓ１２０のループを終了
する。Ｓ１１７ではｊ番目の文書Ｄ（ｊ）がｉ番目の文
書パラメタＰ（ｉ）を有するかどうかチェックする。ｊ
番目の文書Ｄ（ｊ）がｉ番目の文書パラメタＰ（ｉ）を
有する場合、Ｓ１１８で仮の文書パラメタベクトルのｉ
番目の値Ｃ（ｉ）を仮の文書パラメタベクトルの長さ√
Ｌで除算した値を文書パラメタベクトルの値としてＶ
（ｊ，ｉ）にセットする。ｊ番目の文書Ｄ（ｊ）がｉ番
目の文書パラメタＰ（ｉ）を有しない場合、Ｓ１１９で
文書パラメタベクトルの値としてＶ（ｊ，ｉ）に０をセ
ットする。Ｓ１２０でｉを１つ増加させ次の文書パラメ
タに対応するＤ（ｊ）の文書パラメタベクトル成分の値
を計算する。ｉの値がＭを超えた時点でＤ（ｊ）の文書
パラメタベクトル成分の値の計算が終了する。Ｓ１２１
でｊを１つ増加させ次の文書の文書パラメタベクトルの
計算を行なう。ｊの値がＮを超えた時点で、すべての文
書の文書パラメタベクトルの計算を終了する。Steps S106, S107, S115-
S122 is processing for calculating document parameter vectors of all N documents. In step S122, the document number j is initialized to 1, and in step S107, the j-th document is set to D (j). In S106, it is checked whether or not the document number j exceeds the maximum value N, and if it does, S106, S10
7. The loop of S109 to S121 ends. S115
Initializes the document parameter number i to 1. In S116, it is checked whether or not the document parameter number i exceeds the maximum value M, and if so, the loop of S116 to S120 is ended. In S117, it is checked whether the j-th document D (j) has the i-th document parameter P (i). j
If the document D (j) has the i-th document parameter P (i), the temporary document parameter vector i
The value C (i) is the length of the temporary document parameter vector √
The value divided by L is used as the value of the document parameter vector as V
Set to (j, i). If the j-th document D (j) does not have the i-th document parameter P (i), 0 is set to V (j, i) as the value of the document parameter vector in S119. In S120, i is increased by one, and the value of the document parameter vector component of D (j) corresponding to the next document parameter is calculated. When the value of i exceeds M, the calculation of the value of the document parameter vector component of D (j) ends. S121
J is incremented by one, and the document parameter vector of the next document is calculated. When the value of j exceeds N, the calculation of the document parameter vectors of all the documents is completed.

【００２８】例えば図２（ｃ）に示す３番目（ｊ＝３）
の文書Ｃにおいて、仮の文書パラメタベクトルは（Ｃ
（１），０，Ｃ（３），０）＝（１／３，０，１／３，
０）となり、即ち、第４番目の特性値はないので第４番
目の成分は０となり、その長さの二乗Ｌは１／９＋１／
９＝２／９となる。仮の文書パラメタベクトルの各成分
値Ｖ（３，ｉ）をその長さ√Ｌで除算することにより長
さが１に正規化された文書Ｃの文書パラメタベクトル
（Ｖ（３，１），Ｖ（３，２），Ｖ（３，３），Ｖ
（３，４））＝（０．７１，０，０．７１，０）が得ら
れる。文書Ａ，文書Ｂについても同様の方法で文書パラ
メタベクトルの各成分値を計算し、図３（ｂ）に示すよ
うな文書パラメタベクトルがそれぞれの文書に対して求
められる。For example, the third (j = 3) shown in FIG.
In document C, the temporary document parameter vector is (C
(1), 0, C (3), 0) = (1/3, 0, 1/3,
0), that is, since there is no fourth characteristic value, the fourth component is 0, and the square of its length L is 1/9 + 1 /
9 = 2/9. By dividing each component value V (3, i) of the temporary document parameter vector by its length ΔL, the document parameter vector (V (3,1), V (3,2), V (3,3), V
(3, 4)) = (0.71, 0, 0.71, 0) is obtained. For document A and document B, each component value of the document parameter vector is calculated in the same manner, and a document parameter vector as shown in FIG. 3B is obtained for each document.

【００２９】文書パラメタ抽出ステップ２、文書類似度
計算ステップ３によって文書間の類似度の計算が可能と
なった文書は、図１の文書分類ステップ４により別に定
められた分類体系中の最も類似度の高い文書クラスに登
録する。ここで、文書Ｄ（ｍ）と文書Ｄ（ｎ）との類似
度は、文書パラメタ間の内積＝Ｖ（ｍ，１）×Ｖ（ｎ，
１）＋Ｖ（ｍ，２）×Ｖ（ｎ，２）＋．．．＋Ｖ（ｍ，
Ｍ）×Ｖ（ｎ，Ｍ）と定義する。そして設定類似度以上
のものを同一の文書クラスとする。例えば別に定められ
た分類体系が図３（ｃ）のように与えられているとする
と、文書分類ステップ４では、収集した文書の文書パラ
メタベクトルと文書クラスに与えられた文書パラメタベ
クトルとによって類似度を求める。文書Ａは図３（ｃ）
中の文書クラスＤＲＡＭ１００５との類似度は、図３
（ｂ）中の文書パラメタベクトル１００４および図３
（ｃ）中の文書パラメタベクトル１００７との内積を計
算することにより０．９８６と求められる。同様に図３
（ｃ）中の文書クラスＳＲＡＭ１００６の文書パラメタ
ベクトルとの類似度は０．６１６と求められる。同様に
すべての文書クラスと文書Ａとの文書パラメタベクトル
類似度を計算し、その結果として文書Ａが図３（ｃ）中
の文書クラスＤＲＡＭ１００５との類似度が最も高くな
った場合、設定は類似度の一番高いものとして、文書Ａ
は最も類似度の高い図３（ｃ）中の文書クラスＤＲＡＭ
１００５に登録する。以上により、収集した文書を既存
の分類体系のどの文書クラスに登録すべきかを決定でき
るため、文書登録における作業が容易となる。上記の動
作ステップにより複数の収集された文書を分類管理がで
きる。The documents for which the similarity between the documents can be calculated by the document parameter extraction step 2 and the document similarity calculation step 3 are the most similarity in the classification system separately determined by the document classification step 4 in FIG. To a document class with high Here, the similarity between the document D (m) and the document D (n) can be calculated as follows: Inner product between document parameters = V (m, 1) × V (n,
1) + V (m, 2) × V (n, 2) +. . . + V (m,
M) × V (n, M). Then, the same document class or more is set as the same document class. For example, assuming that a separately defined classification system is given as shown in FIG. 3C, in the document classification step 4, the similarity is determined by the document parameter vector of the collected document and the document parameter vector given to the document class. Ask for. Document A is shown in FIG.
The similarity with the document class DRAM 1005 in FIG.
(B) Document parameter vector 1004 in FIG. 3 and FIG.
By calculating the inner product with the document parameter vector 1007 in (c), 0.986 is obtained. Similarly in FIG.
The similarity with the document parameter vector of the document class SRAM 1006 in (c) is obtained as 0.616. Similarly, the document parameter vector similarity between all the document classes and the document A is calculated. As a result, when the similarity between the document A and the document class DRAM 1005 in FIG. Document A has the highest degree
Is the document class DRAM having the highest similarity in FIG.
Register in 1005. As described above, it is possible to determine in which document class of the existing classification system the collected documents are to be registered, so that the work in document registration is facilitated. By the above operation steps, a plurality of collected documents can be classified and managed.

【００３０】分類された文書は図１の文書分類検索ステ
ップ５によって分類体系における文書クラスを選択する
ことにより文書を検索する。例えば、図３（ｃ）に示す
ような分類体系を検索者に提示する。検索者が文書クラ
スの「メモリ」を選択した場合、文書分類検索ステップ
５は文書クラス「ＤＲＡＭ」もしくは文書クラス「ＳＲ
ＡＭ」に登録された文書のみを検索者に提示する。次に
検索者が文書クラスの「ＤＲＡＭ」を選択した場合、文
書分類検索ステップ５は文書クラス「ＤＲＡＭ」に登録
された文書のみを検索者に提示する。検索機能は文書ク
ラス名を属性とすることで公知のリレーショナルデータ
ベースの検索機能によって実現可能である。The classified documents are searched by selecting a document class in the classification system in the document classification search step 5 of FIG. For example, a classification system as shown in FIG. 3C is presented to the searcher. When the searcher selects the "memory" of the document class, the document classification search step 5 is performed in the document class "DRAM" or the document class "SR".
Only the documents registered in "AM" are presented to the searcher. Next, when the searcher selects the document class “DRAM”, the document classification search step 5 presents only the documents registered in the document class “DRAM” to the searcher. The search function can be realized by a well-known relational database search function using the document class name as an attribute.

【００３１】実施の形態２．図５は本発明の実施の形態
２の検索方法を示す動作フローチャートであり、図にお
いて実施の形態１と同様または相当する部分については
同一符号を付しその説明を省略する。本実施の形態で
は、実施の形態１における文書分類管理方法において、
検索に際して全文書中に占める特定文書パラメータを持
つ文書の出現期待割合を検索コストとして定義し、この
検索コスト（出現期待割合）を指定すると、その検索コ
ストに近い文書パラメータを抽出するステップを付加す
る。こうすることで検索範囲を更に容易に絞り込むこと
ができる。Embodiment 2 FIG. 5 is an operation flowchart showing a search method according to the second embodiment of the present invention. In the drawing, the same or corresponding parts as those in the first embodiment are denoted by the same reference numerals and description thereof is omitted. In the present embodiment, in the document classification management method in the first embodiment,
At the time of retrieval, the expected appearance ratio of a document having a specific document parameter in all documents is defined as a search cost, and when this search cost (expected appearance ratio) is specified, a step of extracting a document parameter close to the search cost is added. . By doing so, the search range can be narrowed more easily.

【００３２】図５における文書パラメタ検索ステップ６
は、実施の形態１に前記した文書パラメタを検索条件と
して文書を検索する。この検索機能は文書パラメタを属
性・属性値としてリレーショナルデータベースを構築す
ることによって公知の手法で実現可能である。Document parameter retrieval step 6 in FIG.
Searches for a document using the document parameter described in the first embodiment as a search condition. This search function can be realized by a known method by constructing a relational database using document parameters as attributes and attribute values.

【００３３】本実施の形態の文書検索方法は、文書パラ
メタ検索ステップ６における検索条件入力を、分類パラ
メタ検索ガイダンスステップ７によって作成された検索
ガイダンスを利用して行なうことを特徴とする。分類パ
ラメタ検索ガイダンスステップ７で作成する検索ガイダ
ンスは、検索対象の文書が有する文書パラメタを検索効
率の観点で各々評価し、検索効率の高い順番に文書パラ
メタを表示するものである。または対象文書の件数に対
する「名称」か「特性値」別の件数を表示する。これに
より、検索効率の高い文書パラメタを検索条件として入
力することを検索者に促す効果がある。検索対象の文書
とは、文書検索を複数回繰り返すことによって文書を絞
りこむ場合の、検索途中の文書の集合である。したがっ
て、分類パラメタ検索ガイダンスステップ７は文書検索
を繰り返す度に実行する。The document search method according to the present embodiment is characterized in that the search condition input in the document parameter search step 6 is performed using the search guidance created in the classification parameter search guidance step 7. The search guidance created in the classification parameter search guidance step 7 evaluates the document parameters of the document to be searched from the viewpoint of search efficiency, and displays the document parameters in order of the search efficiency. Alternatively, the number of cases for each “name” or “characteristic value” with respect to the number of target documents is displayed. This has the effect of prompting the searcher to input a document parameter with high search efficiency as a search condition. The search target document is a set of documents in the middle of the search when the documents are narrowed down by repeating the document search a plurality of times. Therefore, the classification parameter search guidance step 7 is executed each time the document search is repeated.

【００３４】図６は、分類パラメタ検索ガイダンスステ
ップ７の動作の詳細を示すフローチャートである。ここ
で、検索対象の文書が有する文書パラメタの種類をＩと
し、検索対象の文書が有する文書パラメタに付与した番
号をｉとし、Ｐ（ｉ）をｉ番目の文書パラメタとする。
また、検索対象の文書数をＪとし、検索対象の文書に付
与した番号をｊとし、Ｄ（ｊ）をｊ番目の文書とする。
ここで、ｉ：文書パラメタの番号、Ｉ：検索対象文書のパラメタの種類、Ｐ（ｉ）：ｉ番目の文書パラメタ、ｊ：検索対象の文書の番号、Ｊ：検索対象の文書数、Ｄ（ｊ）：ｊ番目の文書、Ｓ（ｉ）：Ｐ（ｉ）を有する文書数、Ｑ（ｉ）：Ｐ（ｉ）の検索コスト、Ｂ（ｉ）：Ｓ（ｉ）≠Ｊ、Ｃ１：検索文書比率の最適値、Ｃ２：文書パラメタ値の種類の最小値、Ｃ３：実数分布の最小比率、Ｃ４：文書パラメタ値分割数、とする。また、左上のループのステップＳ２０２ないし
Ｓ２０８は、ある文書パラメタ値を持っている文書数を
カウントすることを意味し、左下のループのステップＳ
２１４ないしＳ２２１は、後述の共通パラメタを持つ対
象となる文書を分割して対象とする動作を意味し、右中
のステップＳ２１２は後述の固有パラメタを持つ対象の
文書のコスト計算、つまり出現頻度の位置を定めてい
る。FIG. 6 is a flowchart showing the details of the operation of the classification parameter search guidance step 7. Here, the type of the document parameter of the document to be searched is I, the number assigned to the document parameter of the document to be searched is i, and P (i) is the i-th document parameter.
The number of documents to be searched is J, the number assigned to the documents to be searched is j, and D (j) is the j-th document.
Here, i: the number of the document parameter, I: the type of the parameter of the search target document, P (i): the i-th document parameter, j: the number of the search target document, J: the number of the search target document, D ( j): j-th document, S (i): number of documents having P (i), Q (i): search cost of P (i), B (i): S (i) ≠ J, C1: search The optimal value of the document ratio, C2: the minimum value of the document parameter value type, C3: the minimum ratio of the real number distribution, and C4: the document parameter value division number. Steps S202 to S208 in the upper left loop mean counting the number of documents having a certain document parameter value.
Steps 214 to S221 mean an operation of dividing a target document having a common parameter described later and targeting the divided document. Step S212 in the middle right shows the cost calculation of the target document having a unique parameter described later, that is, the appearance frequency. The position is determined.

【００３５】ステップＳ２０１〜Ｓ２０８ではＰ（ｉ）
を有する文書数Ｓ（ｉ）を計算する。ステップＳ２０１
でＳ（１）〜Ｓ（Ｉ）に初期値として０をセットし、文
書番号ｊの初期値として１をセットする。Ｓ２０２でｊ
が検索対象の文書数Ｊを超えていないかチェックし、超
えていればＳ（ｉ）の計算を終了する。Ｓ２０３では文
書パラメタ番号ｉの初期値として１をセットし、Ｓ２０
４でｉが検索対象の文書が有する文書パラメタの種類Ｉ
を超えていないかチェックし、超えていればＳ２０５で
文書番号ｊを１つ増加させ次の文書を処理する。Ｓ２０
６はＤ（ｊ）がＰ（ｉ）を有するかどうかチェックし、
有する場合はＳ（ｉ）を１つ増加させる。Ｓ２０８でｉ
を１つ増加させ、次の文書パラメタを処理する。以上に
より検索対象の文書が有するすべての文書パラメタに対
して、各々の文書パラメタを有する文書の数Ｓ（ｉ）が
計算される。In steps S201 to S208, P (i)
Is calculated. Step S201
Then, 0 is set as an initial value in S (1) to S (I), and 1 is set as an initial value of document number j. J in S202
Does not exceed the number J of documents to be searched, and if it does, the calculation of S (i) ends. In S203, 1 is set as the initial value of the document parameter number i, and in S20
In step 4, i is the document parameter type I of the document to be searched.
Is checked, and if so, the document number j is incremented by one in S205 and the next document is processed. S20
6 checks whether D (j) has P (i),
If so, S (i) is increased by one. I in S208
Is incremented by one and the next document parameter is processed. As described above, the number S (i) of documents having each document parameter is calculated for all document parameters of the document to be searched.

【００３６】ステップＳ２０９〜Ｓ２２１は検索対象の
文書が有する文書パラメタを検索効率の観点で各々評価
し、検索コストＱ（ｉ）を計算する処理である。ここで
いう検索コストとは、全文書に対してどの割合で対象文
書を抽出したいかの出現期待割合である。論理的には１
／２が対象を絞り込む回数が一番少なくなる。ここでは
検索コストＱ（ｉ）が小さい程、文書パラメタ（ｉ）は
検索効率が高いとみなす。Ｓ２０９、Ｓ２１０、Ｓ２１
３により文書パラメタ番号ｉを１からＩに変化させ、す
べての文書パラメタの検索コストＱ（ｉ）を計算する。
Ｓ２１１で文書パラメタＰ（ｉ）を有する文書数Ｓ
（ｉ）が文書数Ｊと一致するかどうかをチェックする。Steps S209 to S221 are processes for evaluating the document parameters of the document to be searched from the viewpoint of search efficiency and calculating the search cost Q (i). The search cost here is an expected appearance ratio of the ratio of the target document to be extracted with respect to all the documents. Logically 1
/ 2 is the smallest number of times to narrow down the target. Here, it is assumed that the smaller the search cost Q (i), the higher the search efficiency of the document parameter (i). S209, S210, S21
3, the document parameter number i is changed from 1 to I, and the search cost Q (i) of all document parameters is calculated.
The number S of documents having the document parameter P (i) in S211
It is checked whether or not (i) matches the number J of documents.

【００３７】Ｓ（ｉ）とＪが一致する場合はすなわち検
索対象の文書すべてが文書パラメタＰ（ｉ）を有する場
合であり、そのような文書パラメタＰ（ｉ）を共通パラ
メタと呼ぶ。共通パラメタＰ（ｉ）に関しては「名称」
では対象を絞り込めないために、その「特性値」を検索
条件とする必要がある。従って、共通パラメタＰ（ｉ）
の値の分布をもとにＱ（ｉ）をＳ２１４以降のステップ
で計算する。Ｓ（ｉ）とＪが一致しない場合は、すなわ
ち検索対象の文書の一部のみが文書パラメタＰ（ｉ）を
有する場合は、そのような文書パラメタＰ（ｉ）を固有
パラメタと呼ぶ。固有パラメタＰ（ｉ）に関してはその
存在を検索条件とする必要があるため、固有パラメタＰ
（ｉ）の存在比率をもとにＱ（ｉ）をＳ２１２のステッ
プで計算する。The case where S (i) and J match, that is, when all documents to be searched have document parameters P (i), such document parameters P (i) are called common parameters. "Name" for common parameter P (i)
In this case, it is necessary to use the “characteristic value” as a search condition because the target cannot be narrowed down. Therefore, the common parameter P (i)
Q (i) is calculated in steps after S214 based on the distribution of the values of. When S (i) and J do not match, that is, when only a part of the document to be searched has the document parameter P (i), such a document parameter P (i) is called a unique parameter. Since the existence of the unique parameter P (i) needs to be used as a search condition, the unique parameter P (i)
Based on the existence ratio of (i), Q (i) is calculated in step S212.

【００３８】Ｓ２１２ではＪ×Ｃ１−Ｓ（ｉ）の絶対値
またはＪ×（１−Ｃ１）−Ｓ（ｉ）の絶対値のうち小さ
い方の値を検索コストとしてＱ（ｉ）にセットする。例
えば検索対象の文書数Ｊが１２０であり、検索文書比率
の最適値Ｃ１を１／３に設定している場合、Ｓ（ｉ）が
４０またはその倍数の８０は残りが１／３となり、Ｐ
（ｉ）が最も検索コストが小さい、すなわち最も検索効
率の良い文書パラメタとなる。また、Ｓ２１２ではＰ
（ｉ）が固有パラメタであることをＢ（ｉ）を１にセッ
トすることによって保存する。In S212, the smaller of the absolute value of J * C1-S (i) or the absolute value of J * (1-C1) -S (i) is set as the search cost in Q (i). For example, when the number J of documents to be searched is 120 and the optimum value C1 of the search document ratio is set to 1/3, S (i) is 40 or 80, which is a multiple thereof, is 1/3 and the remainder is 1/3.
(I) is the document parameter with the lowest search cost, that is, the document parameter with the highest search efficiency. In S212, P
Save that (i) is a unique parameter by setting B (i) to one.

【００３９】ステップＳ２１４〜Ｓ２２１では検索対象
の文書が共通に共通パラメタＰ（ｉ）の値の分布から共
通パラメタＰ（ｉ）の検索コストＱ（ｉ）を計算する処
理である。Ｓ２１４ではＪ個の文書について共通パラメ
タＰ（ｉ）の値が何種類あるかをカウントしＸにセット
する。あらかじめ設定した文書パラメタ値の種類の最小
値Ｃ２より共通パラメタＰ（ｉ）の値の種類Ｘが大きい
かどうかをＳ２１５でチェックする。ＸがＣ２以下の場
合は、共通パラメタの値が離散的であるとみなし、Ｓ２
２０で同一のパラメタ値の数をカウントし、その最大値
をＭにセットする。In steps S214 to S221, the search cost Q (i) of the common parameter P (i) is calculated from the distribution of the value of the common parameter P (i) commonly for the documents to be searched. In S214, the number of values of the common parameter P (i) for the J documents is counted and set to X. In step S215, it is checked whether the type X of the value of the common parameter P (i) is larger than the minimum value C2 of the type of the document parameter value set in advance. If X is equal to or less than C2, the value of the common parameter is considered to be discrete, and S2
At 20, the number of identical parameter values is counted, and the maximum value is set to M.

【００４０】ＸがＣ２より大きい場合は、共通パラメタ
の値が連続的であるとみなされるため、共通パラメタ値
の範囲をあらかじめ設定した分割数Ｃ４個に分割するこ
とによって離散的に扱う。このとき、Ｓ２１６で文書パ
ラメタ値の絶対値の最大値と絶対値の最小値との比率Ｗ
を求め、Ｗがあらかじめ設定した分布比率Ｃ３より大き
いかどうかをＳ２１７でチェックする。ＷがＣ３より大
きい場合はＳ２１９でパラメタ値を対数軸上で等間隔に
Ｃ４個に分割し、ＷがＣ３以下の場合はＳ２１８でパラ
メタ値を実数軸上で等間隔にＣ４個に分割する。これに
よりＸがＣ２以下の場合と同様にパラメタ値をＳ２２０
で離散的に扱うことが可能となる。Ｓ２２１ではＳ２１
２と同様にＪ×Ｃ１−Ｍの絶対値またはＪ×（１−Ｃ
１）−Ｍの絶対値のうち小さい方の値を検索コストとし
てＱ（ｉ）にセットする。また、Ｓ２２１ではＰ（ｉ）
が共通パラメタであることをＢ（ｉ）を０にセットする
ことによって保存する。以上の処理をＳ２１０で文書パ
ラメタ番号ｉが文書パラメタの種類Ｉを超えるまで繰り
返すことにより、すべての文書パラメタが共通パラメタ
であるか固有パラメタであるかの判別Ｂ（ｉ）と、検索
コストＱ（ｉ）が計算される。When X is larger than C2, the value of the common parameter is considered to be continuous, so that the range of the common parameter value is treated discretely by dividing it into a predetermined number of divisions C4. At this time, the ratio W between the maximum value of the absolute value of the document parameter value and the minimum value of the absolute value is determined in S216.
Is checked in step S217 to determine whether W is larger than a predetermined distribution ratio C3. If W is greater than C3, the parameter value is divided into C4 pieces at equal intervals on the logarithmic axis in S219, and if W is less than C3, the parameter value is divided into C4 pieces at equal intervals on the real axis in S218. As a result, the parameter value is set to S220 in the same manner as when X is C2 or less.
Can be treated discretely. In S221, S21
2, the absolute value of J × C1-M or J × (1-C
1) The smaller one of the absolute values of -M is set as a search cost in Q (i). In S221, P (i)
Is a common parameter by setting B (i) to 0. By repeating the above processing in step S210 until the document parameter number i exceeds the document parameter type I, it is determined whether all document parameters are common parameters or unique parameters B (i), and the search cost Q ( i) is calculated.

【００４１】ステップＳ２２２ではＢ（ｉ）の値により
文書パラメタＰ（ｉ）を共通パラメタと固有パラメタの
２種類に分割し、ステップＳ２２３では分割された文書
パラメタＰ（ｉ）をそれぞれ検索コストＱ（ｉ）によっ
て昇順に並べることにより順位付けする。さらに、ステ
ップＳ２２４では固有パラメタＰ（ｉ）から共起パラメ
タを除去する。共起パラメタとは、固有パラメタＡおよ
び固有パラメタＢにおいてＡを有する文書のすべてがＢ
を有し、Ｂを有する文書のすべてがＡを有する固有パラ
メタＡ，Ｂの組み合わせである。共起関係にある固有パ
ラメタの組み合わせのうち任意の１個の固有パラメタ以
外の固有パラメタをＳ２２４で削除し、それ以下の順位
の固有パラメタは順位を繰り上げる。In step S222, the document parameter P (i) is divided into two types, a common parameter and a unique parameter, according to the value of B (i). In step S223, the divided document parameter P (i) is divided into search costs Q ( The items are ranked in ascending order according to i). Further, in step S224, the co-occurrence parameter is removed from the unique parameter P (i). A co-occurrence parameter means that all documents having A in the unique parameter A and the unique parameter B are B
, And all of the documents having B are combinations of the unique parameters A and B having A. Unique parameters other than any one of the unique parameter combinations in the co-occurrence relationship are deleted in S224, and the unique parameters of lower ranks are moved up in rank.

【００４２】検索コストの指定なしに検索結果の抽出文
書の文書パラメタ別のカウント数を表示することもでき
る。分類パラメタ検索ガイダンスステップ７による文書
パラメタ検索条件入力画面の例を図７に示す。共通パラ
メタ名表示３７は順位付けされた共通パラメタの上位か
ら固定数の共通パラメタの名称の表示をしている。共通
パラメタ名表示３７と対を成すパラメタ値入力領域３８
に各共通パラメタの値による検索条件を入力し、検索実
行ボタン４３によって検索を実行する。なお、本実施の
形態では更に検索範囲を変更するのに便利なように、新
規ボタン４０が選択されていれば検索対象の文書を破棄
して文書全体を検索対象とする。また、追加ボタン４１
が選択されていれば文書全体を検索対象とした検索結果
を検索対象の文書に追加し、絞込ボタン４２が選択され
ていれば現在の検索対象の文書を検索対象とする。検索
実行後に戻るボタン４４によって検索前の状態を復元す
る。It is also possible to display the count number for each document parameter of the extracted document of the search result without specifying the search cost. FIG. 7 shows an example of a document parameter search condition input screen in the classification parameter search guidance step 7. The common parameter name display 37 displays the names of a fixed number of common parameters from the top of the ranked common parameters. Parameter value input area 38 paired with common parameter name display 37
, A search condition based on the value of each common parameter is input, and a search is executed by a search execution button 43. In this embodiment, if the new button 40 is selected, the search target document is discarded and the entire document is set as the search target so that the search range can be further changed. Add button 41
If is selected, the search result for the entire document as a search target is added to the search target document, and if the narrow down button 42 is selected, the current search target document is set as the search target. The state before the search is restored by the return button 44 after the search is executed.

【００４３】検索対象文書数３９によって現在検索対象
となっている文書数を表示する。この例では、共通パラ
メタ「名称」を含む文書が２０７件あることを示してい
る。図７（ｂ）は共通パラメタの「名称」最小動作電源
電圧に対し、その「特性」値別の件数であり、パラメタ
値入力領域３８にある入力候補ボタン４９によって入力
値の候補および各入力値候補に属する文書数からなる共
通パラメタ入力値候補５０が表示される。例えばここで
は最小動作電源電圧の値が３．６Ｖ〜４．０Ｖである文
書が検索対象の文書中に５６個あることがわかる。さら
に分布ボタン４５によって共通パラメタ値分布グラフ５
１が表示され、共通パラメタの値と文書数の分布状態が
グラフ表示される。The number of documents currently being searched is displayed based on the number of documents to be searched 39. This example indicates that there are 207 documents including the common parameter “name”. FIG. 7B shows the number of cases for each “characteristic” value with respect to the “name” minimum operating power supply voltage of the common parameter, and the input value candidate and each input value are input by the input candidate button 49 in the parameter value input area 38. A common parameter input value candidate 50 including the number of documents belonging to the candidate is displayed. For example, here, it can be seen that there are 56 documents having a minimum operating power supply voltage value of 3.6 V to 4.0 V in the search target document. Further, a common parameter value distribution graph 5 is displayed by a distribution button 45.
1 is displayed, and the value of the common parameter and the distribution state of the number of documents are graphically displayed.

【００４４】固有パラメタ名４７は、順位付けされた固
有パラメタの上位から固定数の固有パラメタの名称表示
をしている。固有パラメタ名４７と対を成すパラメタ選
択ボタン４６によって固有パラメタの有無を検索条件と
して入力する。固有パラメタ名４７と対を成す固有パラ
メタ含有文書数４８は固有パラメタを含む文書数を表示
している。なお、システムとしては、パラメタ選択ボタ
ン４６の選択状態を３種類とし、それぞれ指定した固有
パラメタを含むという文書検索条件、指定した固有パラ
メタを含まないという文書検索条件、指定した固有パラ
メタを検索条件としないという３つの文書検索条件に対
応させることもできる。以上のように文書を文書パラメ
タによって検索するパラメタ検索において、検索条件を
入力するためのガイダンスを検索対象の文書を解析する
ことによって生成し、検索者に提示するようにしたた
め、検索者による検索条件の入力が容易となる効果があ
る。The unique parameter name 47 indicates the name of a fixed number of unique parameters from the top of the ranked unique parameters. The presence / absence of a unique parameter is input as a search condition by a parameter selection button 46 paired with the unique parameter name 47. The number 48 of unique parameter-containing documents forming a pair with the unique parameter name 47 indicates the number of documents including the unique parameter. The system selects three types of selection states of the parameter selection button 46, and sets a document search condition that includes the specified unique parameter, a document search condition that does not include the specified unique parameter, and a search condition that specifies the specified unique parameter. It is also possible to correspond to three document search conditions of not performing. As described above, in the parameter search that searches for documents by document parameters, guidance for inputting search conditions is generated by analyzing the document to be searched, and presented to the searcher. This has the effect of making it easier to input.

【００４５】実施の形態３．図８は本実施の形態におけ
る文書検索方法の動作フローチャートであり、実施の形
態１と同様または相当する部分については同一符号を付
しその説明を省略する。本実施の形態の検索方法は、実
施の形態１において分類管理されている文書に対して、
異なる分類体系の下にある文書を分類体系を超えて統合
して分類管理する、または統合して検索する。こうする
ことで検索範囲を広げることができる。Embodiment 3 FIG. 8 is an operation flowchart of the document search method according to the present embodiment. The same or corresponding parts as those in the first embodiment are denoted by the same reference numerals and description thereof is omitted. The search method according to the present embodiment is applied to the documents classified and managed in the first embodiment.
Documents under different classification systems are integrated and managed beyond the classification system, or integrated and searched. By doing so, the search range can be expanded.

【００４６】図において、分類体系獲得ステップ２２は
文書収集ステップ１において文書を収集する際に、提供
元が構築した分類体系を同時に収集する。文書の集合を
参照する目次が提供元に存在し、目次の内容がＳＧＭＬ
（構造化文書記述言語：ＩＳＯ８８７９）などの所定の
形式により記述されているものとすれば、ＳＧＭＬ形式
の文書の形式をチェックする公知のＳＧＭＬパーザなど
を用いて目次の構造を解析することにより文書の分類体
系を得ることができる。In the drawing, a classification system acquisition step 22 simultaneously collects the classification system constructed by the provider when collecting documents in the document collection step 1. A table of contents that refers to a set of documents exists at the provider, and the contents of the table of contents are SGML.
If the document is described in a predetermined format such as (Structured Document Description Language: ISO8879), the document is analyzed by analyzing the structure of the table of contents using a known SGML parser or the like for checking the format of the document in the SGML format. Classification system can be obtained.

【００４７】分類体系合併ステップ８は、分類体系獲得
ステップ２２により得られた複数の分類体系を合併し、
統一した単一の分類体系を構築する処理である。図９は
分類体系合併ステップ８の動作の詳細を説明するフロー
チャートである。また図１０はその動作説明図である。
図１０において、Ｔ（１），Ｔ（２）の２つの分類体系
を合併することとする。まず、図９において、ｉ：分類体系の番号（ｉ＝１〜２）、Ｔ（ｉ）：ｉ番目の分類体系、Ｊ（ｉ）：Ｔ（ｉ）の終端文書クラスの数、ｊ：終端文書クラスの番号、Ｃ（ｉ，ｊ）：Ｔ（ｉ）のｊ番目の終端文書クラス、Ｖ（ｉ，ｊ）：Ｃ（ｉ，ｊ）の平均パラメタベクトル、Ｋ（ｊ）：Ｖ（１，ｊ）とＶ（２，ｋ）との類似度が最
大となるｋ、とする。この動作フローの左上のループのステップＳ３
０１ないしＳ３０７は、ある文書クラスの平均ベクトル
Ｖ（ｉ，ｊ）を求めることを意味し、右側のループのス
テップＳ３０８ないしＳ３１４は分類体系の異なる文書
クラス間の上記平均ベクトルの類似度を計算することを
意味し、左下のループのステップＳ３１５ないしＳ３２
０は、分類体系合併後においても下位文書クラスの接続
関係を維持して合併する動作を意味している。In a classification system merging step 8, a plurality of classification systems obtained in the classification system acquisition step 22 are merged,
This is a process for constructing a single unified classification system. FIG. 9 is a flowchart for explaining the details of the operation of the classification system merging step 8. FIG. 10 is an explanatory diagram of the operation.
In FIG. 10, two classification systems T (1) and T (2) are merged. First, in FIG. 9, i: classification system number (i = 1 to 2), T (i): i-th classification system, J (i): number of terminal document classes of T (i), j: terminal Document class number, C (i, j): j-th terminal document class of T (i), V (i, j): average parameter vector of C (i, j), K (j): V (1) , J) and V (2, k) have the maximum similarity. Step S3 of the upper left loop of this operation flow
01 to S307 mean finding the average vector V (i, j) of a certain document class, and steps S308 to S314 in the right loop calculate the similarity of the average vector between document classes having different classification systems. This means that steps S315 to S32 of the lower left loop
0 means the operation of merging while maintaining the connection relation of the lower document classes even after the merging of the classification systems.

【００４８】図９のステップＳ３０１〜Ｓ３０７によっ
てＴ（ｉ）の終端に位置するすべての文書クラスＣ
（ｉ，ｊ）の平均ベクトルＶ（ｉ，ｊ）を計算する。Ｓ
３０１、Ｓ３０２、Ｓ３０５によって分類体系Ｔの番号
ｉを１から２に変化させ、つまり分類体系Ｔ（１）から
Ｔ（２）に変化させ、Ｓ３０３、Ｓ３０４、Ｓ３０５に
よって終端文書クラスの番号を１からＪ（ｉ）に変化さ
せ、Ｓ３０６で平均ベクトルを計算することにより、分
類体系Ｔ（１），Ｔ（２）のすべての文書クラスＣ
（ｉ，ｊ）のそれぞれの平均ベクトルＶ（ｉ，ｊ）を計
算する。平均ベクトルＶ（ｉ，ｊ）とは文書クラスＣ
（ｉ，ｊ）に属するすべての文書の文書パラメタベクト
ルにおいて、各文書パラメタベクトル成分毎の平均値に
よる仮の平均ベクトルを作成し、長さが１となるよう正
規化した文書パラメタベクトルである。By the steps S301 to S307 in FIG. 9, all the document classes C located at the end of T (i)
The average vector V (i, j) of (i, j) is calculated. S
The number i of the classification system T is changed from 1 to 2 by 301, S302, and S305, that is, the classification system T (1) is changed from T (2), and the end document class number is changed from 1 by S303, S304, and S305. J (i), and by calculating the average vector in S306, all the document classes C of the classification systems T (1) and T (2) are obtained.
The average vector V (i, j) of (i, j) is calculated. Average vector V (i, j) is document class C
In the document parameter vectors of all the documents belonging to (i, j), a temporary average vector based on the average value of each document parameter vector component is created, and the document parameter vector is normalized to have a length of 1.

【００４９】次にＳ３０８〜Ｓ３２０によって分類体系
Ｔ（２）を基準の分類体系とし分類体系Ｔ（１）のすべ
ての文書クラスが分類体系Ｔ（２）のどの文書クラスと
の類似度が最も高いかを判定する処理を行なう。Ｓ３０
８、Ｓ３０９、Ｓ３１２によってｊを１からＪ（１）に
変化させ、分類体系Ｔ（１）のすべての終端文書クラス
Ｃ（１，ｊ）を処理すると同時に、Ｓ３１０、Ｓ３１
１、Ｓ３１４によってｋを１からＪ（２）に変化させ、
分類体系Ｔ（２）のすべての終端文書クラスＣ（２，
ｋ）を処理する。このとき、Ｓ３１３でｊに対してＶ
（１，ｊ）とＶ（２，ｋ）との類似度すなわち内積値が
最大となるｋを求め、これを最大類似文書クラス番号Ｋ
（ｊ）にセットする。すなわち、分類体系Ｔ（２）のす
べての終端文書クラスであるＣ（２，１）からＣ（２，
Ｊ（２））の中で、Ｃ（１，ｊ）と最も類似する終端文
書クラスはＣ（２，Ｋ（ｊ））である。Next, in steps S308 to S320, the classification system T (2) is set as a reference classification system, and all the document classes of the classification system T (1) have the highest similarity to any document class of the classification system T (2). Is determined. S30
8, S309 and S312 change j from 1 to J (1), and process all the terminal document classes C (1, j) of the classification system T (1), and at the same time, S310, S31
1, k is changed from 1 to J (2) by S314,
All terminal document classes C (2,2) of the classification system T (2)
Process k). At this time, V is set for j in S313.
The degree of similarity between (1, j) and V (2, k), that is, k that maximizes the inner product value, is determined, and this is determined as the maximum similar document class number K
Set to (j). That is, C (2,1) to C (2,1) which are all terminal document classes of the classification system T (2).
J (2)), the terminal document class most similar to C (1, j) is C (2, K (j)).

【００５０】Ｓ３１５〜Ｓ３２０では分類体系Ｔ（１）
の各終端文書クラスＣ（１，ｍ）を分類体系Ｔ（２）に
結合する処理を行なう。Ｓ３１５、Ｓ３１６、Ｓ３２０
によってｍを１からＪ（１）に変化させ、分類体系Ｔ
（１）のすべての終端文書クラスＣ（１，ｍ）を処理す
る。Ｓ３１７においてＫ（ｍ）＝Ｋ（ｎ）となるｎ（ｎ
≠ｍ）が１つ以上存在するかどうかを判定する。Ｋ
（ｍ）＝Ｋ（ｎ）となる場合はＴ（１）の２つ以上の終
端文書クラスＣ（１，ｍ），Ｃ（１，ｎ）がいずれもＴ
（２）の終端文書クラスＣ（２，Ｋ（ｍ））と最も類似
度が高くなっている。この場合は、Ｓ３１９によりＣ
（１，ｍ），Ｃ（１，ｎ）をＣ（２，Ｋ（ｍ））の下位
の文書クラスとみなして結合する。このとき、Ｃ（２，
Ｋ（ｍ））に所属していた文書はＣ（１，ｍ），Ｃ
（１，ｎ）のうちの何れかの類似度が高い方の文書クラ
スへ移動する。Ｋ（ｍ）＝Ｋ（ｎ）となるｎが存在しな
い場合、Ｓ３１８によりＣ（１，ｍ）をＣ（２，Ｋ
（ｍ））と同一の文書クラスとみなして結合する。In S315 to S320, the classification system T (1)
Is performed to combine each end document class C (1, m) into the classification system T (2). S315, S316, S320
Changes m from 1 to J (1) according to the classification system T
Process all terminal document classes C (1, m) in (1). In S317, n (n) where K (m) = K (n) is satisfied.
≠ m) is determined to be present or not. K
If (m) = K (n), the two or more end document classes C (1, m) and C (1, n) of T (1) are all T
It has the highest similarity to the last document class C (2, K (m)) of (2). In this case, C is determined by S319.
(1, m) and C (1, n) are combined assuming that they are lower-order document classes of C (2, K (m)). At this time, C (2,
K (m)) belong to C (1, m), C
The process moves to the document class having the higher similarity of (1, n). If n where K (m) = K (n) does not exist, C (1, m) is changed to C (2, K
(M)) and combined assuming that they are the same document class.

【００５１】以上の動作を図解したものが図１０の動作
説明図である。図において、分類体系Ｔ（２）の終端文
書クラスの出力装置９０３は、分類体系Ｔ（１）の終端
文書クラスの出力デバイス９０１と最も類似度が高く、
Ｔ（１）の終端文書クラスの出力デバイス９０１を最も
類似度が高いとするＴ（２）の他の終端文書クラスは存
在しないため、Ｔ（２）の終端文書クラスの出力装置９
０３はＴ（１）の終端文書クラスの出力デバイス９０１
と同一の文書クラスであるとみなして結合し、新しく終
端文書クラスの出力デバイス９０６を作成する。Ｔ
（２）の終端文書クラスのハードディスク９０４および
ＦＤＤ９０５は共にＴ（１）の終端文書クラスの記憶デ
バイス９０２と最も類似度が高いため、Ｔ（２）の終端
文書クラスのハードディスク９０４およびＦＤＤ９０５
は共にＴ（１）の終端文書クラスの記憶デバイス９０２
の下位の文書クラスとみなして結合する。以上に説明し
たように、文書と同時に提供元から文書の分類体系を収
集し、文書クラス間の類似度を計算して類似度の高い文
書を同一文書クラスとして分類管理する事により、文書
クラス単位で文書を分類体系を超えて合併して分類管理
し登録することが可能となる。このため、以後の検索で
は類似度の高い文書を同一文書クラスとして取り扱うこ
とができる。また、提供元から収集した文書の分類体系
が既存の分類体系よりも詳細である場合、既存の分類体
系を詳細化することが可能となる。FIG. 10 is a diagram illustrating the above operation. In the figure, the output device 903 of the terminal document class of the classification system T (2) has the highest similarity with the output device 901 of the terminal document class of the classification system T (1).
Since there is no other terminal document class of T (2) whose output device 901 of the terminal document class of T (1) has the highest similarity, the output device 9 of the terminal document class of T (2) exists.
03 is an output device 901 of the terminal document class of T (1)
The document is regarded as having the same document class and is combined, and a new output device 906 of the terminal document class is created. T
Since both the hard disk 904 and the FDD 905 of the terminal document class of (2) have the highest similarity with the storage device 902 of the terminal document class of T (1), the hard disk 904 and the FDD 905 of the terminal document class of T (2).
Are both storage devices 902 of the terminal document class of T (1)
It is considered as a lower document class of and combined. As described above, the document classification system is collected from the provider at the same time as the document, and the similarity between the document classes is calculated. With this, documents can be merged beyond the classification system, classified, managed, and registered. Therefore, in a subsequent search, documents having a high degree of similarity can be handled as the same document class. If the classification system of the document collected from the provider is more detailed than the existing classification system, it is possible to refine the existing classification system.

【００５２】実施の形態４．図１１は本実施の形態にお
ける文書検索方法の動作フローチャートであり、図の実
施の形態３と同様または相当する部分については同一符
号を付しその説明を省略する。また、分類体系合併ステ
ップ８は必須ではない。本実施の形態は、同一の文書ク
ラスに属する文書が多くなり検索に時間がかかるなどの
検索効率の低下を改善し、検索範囲を絞れるように同一
クラスの文書を細分化して管理できるようにする。Embodiment 4 FIG. FIG. 11 is an operation flowchart of the document search method according to the present embodiment, and the same or corresponding parts as those in the third embodiment shown in FIG. Also, the classification system merging step 8 is not essential. According to the present embodiment, it is possible to improve a decrease in search efficiency, such as a case where the number of documents belonging to the same document class increases and a search takes time, and to manage documents of the same class by subdividing the search range. .

【００５３】分類体系細分化ステップ９は前記分類体系
合併ステップ８によって合併された分類体系を細分化す
る。図１２は分類体系細分化ステップ９の動作の流れを
説明するフローチャートである。図において、Ｓ：移動ステップ、ｉ，ｊ：特定の文書クラス内の文書の番号、Ｉ：特定の文書クラス内の文書の数、Ｊ：移動先の文書の番号、Ｃ（ｉ）：特定の文書クラス内のｉ番目の文書、Ｓｉｍ（Ａ，Ｂ）：文書Ａ・文書Ｂ間の類似度、Ｐ（ｉ）：文書Ｃ（ｉ）の通過フラグ、Ｌ（ｓ）：移動ステップＳの移動文書間の類似度、である。ステップＳ４０１〜Ｓ４１２は、ある終端文書
クラスに属する文書数が上限値を超えるなどの条件を満
たした場合に起動して、その終端文書クラスの下位の文
書クラスを作成する処理であり、これをすべての終端文
書クラスに対して適用することによって分類体系を細分
化する。The classification system subdividing step 9 subdivides the classification system merged in the classification system merging step 8. FIG. 12 is a flowchart for explaining the flow of the operation of the classification system subdividing step 9. In the figure, S: move step, i, j: number of documents in a specific document class, I: number of documents in a specific document class, J: number of destination document, C (i): specific number I-th document in the document class, Sim (A, B): similarity between document A and document B, P (i): passage flag of document C (i), L (s): movement of movement step S The similarity between documents, Steps S401 to S412 are processing to be started when conditions such as the number of documents belonging to a certain terminal document class exceed an upper limit value and to create a lower-level document class of the terminal document class. The classification system is subdivided by applying to the end document class of.

【００５４】Ｓ４０１〜Ｓ４０４は初期設定である。Ｓ
４０１ですべての文書の通過フラグＰ（ｉ）を０にリセ
ットする。Ｓ４０２で文書番号ｉに１をセットすること
により、特定の終端文書クラスに属する、文書のうちの
任意の１文書である文書Ｃ（１）を移動する文書の始点
として選択する。移動ステップｓの初期値として１をセ
ットする。Ｓ４０３では移動先の文書を探索するための
移動先文書番号ｊの初期値として２をセットする。Ｓ４
０４では移動文書間の類似度の最大値Ｍを０に初期化
し、移動元文書の通過フラグＰ（ｉ）を１にセットす
る。Ｓ４０５およびＳ４１０からなる繰り返し処理によ
って移動先の文書番号Ｊを判定する。Steps S401 to S404 are initial settings. S
At 401, the pass flags P (i) of all documents are reset to zero. By setting 1 to the document number i in S402, the document C (1), which is an arbitrary one of the documents belonging to the specific end document class, is selected as the starting point of the moving document. 1 is set as the initial value of the moving step s. In S403, 2 is set as the initial value of the destination document number j for searching for the destination document. S4
In step 04, the maximum value M of the similarity between the moving documents is initialized to 0, and the pass flag P (i) of the moving source document is set to 1. The destination document number J is determined by the repetition of steps S405 and S410.

【００５５】移動先の文書はまだ通過していない文書で
なければならないため、Ｓ４０６で移動先文書の候補Ｃ
（ｊ）の通過フラグＰ（ｊ）をチェックし、Ｐ（ｊ）が
０であればＳ４０７で移動元文書Ｃ（ｉ）と移動先文書
の候補Ｃ（ｊ）との類似度を計算しＳにセットする。Ｓ
４０８で移動文書間の類似度Ｓがその最大値Ｍを超えて
いるかどうかチェックし、超えていたらＳ４０９で移動
文書間の類似度の最大値Ｍおよび移動先の文書番号Ｊを
更新する。Ｓ４０５で移動先の文書番号の候補ｊの調査
がすべて終了したと判断された場合、移動先の文書番号
Ｊおよび移動文書間類似度の最大値Ｍが決定される。つ
まり、Ｓ４０５ないしＳ４１０では、最初に選んだ文書
Ｃ（１）との間で最も類似度が高い最類似文書を同一文
書クラスの残りの全文書と全て比較計算して選ぶことを
意味する。Since the destination document must be a document that has not yet passed, the destination document candidate C is determined in step S406.
The pass flag P (j) of (j) is checked, and if P (j) is 0, the similarity between the source document C (i) and the destination document candidate C (j) is calculated in S407 and S Set to. S
At 408, it is checked whether or not the similarity S between the moving documents exceeds the maximum value M, and if it does, the maximum value M of the similarity between the moving documents and the document number J of the moving destination are updated at S409. If it is determined in step S405 that all the investigations of the candidate j of the destination document number have been completed, the destination document number J and the maximum value M of the similarity between the moved documents are determined. That is, in steps S405 to S410, it means that all the most similar documents having the highest similarity with the first selected document C (1) are compared with all the remaining documents in the same document class, and selected.

【００５６】こうしてＳ４１１で移動文書間類似度の最
大値Ｍが０であった場合、文書間の移動がすべて終了し
たと認定され、処理を終了する。移動文書間類似度の最
大値Ｍが０でなかった場合、Ｓ４１２で文書Ｃ（ｉ）か
ら文書Ｃ（Ｊ）への移動を行ない、その時の移動文書間
類似度Ｌ（ｓ）をＭにセットする。また、ｓを１つ増加
させ、ｉをＪにセットすることにより次のステップの文
書移動を行なう。以上により、特定の文書クラスにおい
て、文書クラス内の任意の文書を始点とし、一旦通過し
た文書は２度通過しないようにして、最も類似度の高い
文書へ次々に移動し、すべての文書を移動する。このと
き移動ステップｓにおける移動文書間類似度Ｌ（ｓ）を
保存することにより、移動ステップｓと移動文書間類似
度Ｌ（ｓ）との関係が求まる。If the maximum value M of the similarity between moving documents is 0 in S411, it is determined that all the movements between the documents have been completed, and the process is terminated. If the maximum value M of the similarity between moving documents is not 0, the document is moved from the document C (i) to the document C (J) in S412, and the similarity L (s) between the moving documents at that time is set to M. I do. The document is moved in the next step by increasing s by one and setting i to J. As described above, in a specific document class, an arbitrary document in the document class is set as a starting point, a document that has passed once is not passed twice, the document is successively moved to a document having the highest similarity, and all documents are moved. I do. At this time, by storing the similarity L (s) between the moving documents in the moving step s, the relationship between the moving step s and the similarity L (s) between the moving documents is obtained.

【００５７】図１３（ａ）は以上の処理を図解したもの
である。文書クラス内の任意の文書（１）１２１を始点
とし、同一文書クラス内で最も類似度の高い文書（７）
１２２へ移動する。一度通過した文書は再び通過しない
ものとして文書クラス内のすべての文書を通過するまで
最も類似する文書への移動を同様に繰り返す。FIG. 13A illustrates the above processing. A document (7) having the highest similarity in the same document class, starting from an arbitrary document (1) 121 in the document class
Move to 122. The document that has passed once is regarded as not passing again, and the movement to the most similar document is repeated in the same manner until all documents in the document class have passed.

【００５８】以上により図１３（ｂ）に示すような文書
の移動ステップｓと各移動ステップにおける移動文書間
類似度Ｌ（ｓ）との関係が得られる。つまり図１２の動
作フローのステップＳ４１２での移動ステップＳと対応
する移動文書間類似度Ｌを順次プロットしたものであ
る。こうすると一般的には、ある一定の移動ステップ範
囲Ｗをあらかじめ設定し、移動ステップｓからｓ＋Ｗま
での範囲における複数の移動文書間類似度Ｌ（ｓ）の総
和を計算し、移動ステップｓからｓ＋Ｗまでの範囲にお
ける移動文書間類似度Ｌ（ｓ）の総和が極大となる極大
範囲１２４、１２５を複数個得る。As described above, the relationship between the document moving steps s and the similarity L (s) between the moving documents in each moving step as shown in FIG. 13B is obtained. That is, the similarity L between the moving documents corresponding to the moving step S in step S412 of the operation flow of FIG. 12 is sequentially plotted. In this way, generally, a certain moving step range W is set in advance, and the sum of a plurality of similarities L (s) between moving documents in the range from the moving step s to s + W is calculated. A plurality of maximal ranges 124 and 125 in which the sum of the similarities L (s) between the moving documents in the range up to is maximal are obtained.

【００５９】上記で得られたプロットにより、複数の最
大範囲の各中心部分の文書の文書パラメタを調べれば、
各極大範囲に対応して下位の文書クラスを作成すること
は容易である。すなわち各極大範囲毎に移動ステップｓ
からｓ＋Ｗの範囲に属する文書の平均ベクトルを算出
し、それを下位の文書クラスの文書パラメタベクトルと
する。または調べた結果で新しく下位の文書クラスの文
書パラメタを設定する。極大範囲外に属する文書はいず
れかの最も類似度の高い下位文書クラスに移動すること
になり、文書クラスの細分化が行なえる。図１２の例で
は、少なくとも極大範囲１２４、１２５の２つのグルー
プに属する文書群を分割して下位の文書クラスが２つ作
成される。以上により、特定の単一の文書クラスを複数
の下位の文書クラスに細分化されるため、登録文書の増
加に伴い文書の分類体系を細分化して分類管理すること
が可能となる。By examining the document parameters of the document at each central portion of the plurality of maximum ranges from the plot obtained above,
It is easy to create a lower document class corresponding to each maximal range. That is, the moving step s for each maximum range
, The average vector of the documents belonging to the range of s + W is calculated, and this is used as the document parameter vector of the lower document class. Or, newly set the document parameters of the lower document class based on the result of the check. Documents belonging outside the maximum range are moved to any of the lower-order document classes having the highest similarity, and the document class can be subdivided. In the example of FIG. 12, at least two lower-order document classes are created by dividing a document group belonging to two groups of the maximum ranges 124 and 125. As described above, since a specific single document class is subdivided into a plurality of lower document classes, it becomes possible to subdivide the classification system of the document and manage the classification by subdividing the document system as the number of registered documents increases.

【００６０】実施の形態５．図１４は本実施の形態にお
ける文書検索方法の動作フローチャートであり、実施の
形態１と同様または相当する部分については同一符号を
付しその説明を省略する。本実施の形態の検索方法は、
別に属性と対になる用語を抽出して用語属性辞書を作成
し、ある属性を指定することによってその属性を含む文
書クラスを抽出して検索を容易にするものである。図に
おいて用語抽出ステップ１０は文書中のテキストデータ
部分から用語を抽出する。用語抽出機能は公知の形態素
解析処理によりテキストデータを単語単位に分割し、名
詞などの特定の品詞が付けられた単語を抽出することに
より実現可能である。Embodiment 5 FIG. FIG. 14 is an operation flowchart of the document search method according to the present embodiment. The same reference numerals are given to the same or corresponding parts as in the first embodiment, and the description thereof will be omitted. The search method according to the present embodiment
Separately, a term paired with an attribute is extracted to create a term attribute dictionary, and by specifying a certain attribute, a document class including the attribute is extracted to facilitate retrieval. In the figure, a term extraction step 10 extracts terms from a text data portion in a document. The term extraction function can be realized by dividing text data into words by a known morphological analysis process and extracting words to which specific parts of speech such as nouns are attached.

【００６１】用語属性判定ステップ１１は用語抽出ルー
ルを用語抽出ステップ１０で単語単位に分割されたテキ
ストデータに対して適用することで抽出された用語の用
語属性（または単に属性という）を判定する。図１５は
用語属性判定ステップ１１および用語属性辞書構築ステ
ップ１２の詳細動作を説明するフローチャートである。
図中の用語抽出ルールＲ（ｉ）は、図１６（ａ）に示す
ようにテキストパターン部と用語属性抽出部とからな
り、テキストデータの一部と用語抽出ルールのテキスト
パターン部とのマッチングが成功した場合、用語属性抽
出部が実行され用語の属性が判定される。例えば図１６
（ｂ）に示すような文書のテキスト部分に対して図１４
の用語属性判定ステップ１１を適用した場合、図１６
（ｃ）に示すような用語属性付きの用語が抽出される。
図１６（ｂ）に示す文書のテキスト部分に含まれる単語
「用」と図１６（ａ）に示す用語抽出ルールのうちルー
ル番号１の用語抽出ルールのテキストパターン部とのマ
ッチングが成功し、テキストパターン部の変数＄１には
単語「留守番電話機」がマッチングするため、用語属性
抽出部によって単語「留守番電話機」の用語属性が「用
途」であると判別される。The term attribute judging step 11 judges term attributes (or simply attributes) of the extracted terms by applying the term extraction rule to the text data divided in word units in the term extracting step 10. FIG. 15 is a flowchart for explaining the detailed operations of the term attribute determination step 11 and the term attribute dictionary construction step 12.
The term extraction rule R (i) in the figure includes a text pattern part and a term attribute extraction part as shown in FIG. 16A, and a part of text data is matched with the text pattern part of the term extraction rule. If successful, the term attribute extraction unit is executed to determine the attribute of the term. For example, FIG.
FIG. 14 shows the text portion of the document as shown in FIG.
When the term attribute determination step 11 of FIG.
A term with a term attribute as shown in (c) is extracted.
The word “for” included in the text portion of the document shown in FIG. 16B is successfully matched with the text pattern portion of the term extraction rule of rule number 1 among the term extraction rules shown in FIG. Since the word “answering machine” matches the variable # 1 of the pattern portion, the term attribute extracting unit determines that the term attribute of the word “answering machine” is “use”.

【００６２】属性が判定された用語は、図１４の用語属
性辞書構築ステップ１２によって属性辞書に用語属性の
情報が付与されて登録される。用語属性判決ステップ１
１と用語属性辞書構築ステップ１２の詳細である図１５
に基づいて動作説明をする。図において、ｉ：用語抽出ルール番号、Ｉ：用語抽出ルールの数、Ｒ（ｉ）：ｉ番目の用語抽出ルールとすると、ステップＳ５０２〜Ｓ５０８は用語抽出ステ
ップ１０で単語単位に分割されたテキストデータに対し
て用語抽出ルールをすべて適用する処理である。Ｓ５０
１では文書の形式を解析することによりテキスト記述部
分を決定する。例えばＳＧＭＬ形式の文書であれば、＜
概要＞、＜説明＞、＜製品説明＞などのタグによって指
定された範囲をテキスト記述部分として決定する。Ｓ５
０２、Ｓ５０３、Ｓ５０８によってルール番号ｉを１か
らＩまで変化させ、Ｓ５０４ですべての用語抽出ルール
Ｒ（ｉ）を適用する。The terms for which the attributes have been determined are registered by adding term attribute information to the attribute dictionary in the term attribute dictionary construction step 12 of FIG. Term attribute judgment step 1
FIG. 15 showing details of step 1 and the term attribute dictionary construction step 12.
The operation will be described based on. In the figure, i: term extraction rule number, I: number of term extraction rules, R (i): i-th term extraction rule, the steps S502 to S508 are the text data divided in word units in the term extraction step 10. This is a process for applying all the term extraction rules to. S50
In step 1, a text description portion is determined by analyzing the format of a document. For example, if the document is in the SGML format, <
A range specified by tags such as “Overview”, “Description”, and “Product Description” is determined as a text description portion. S5
02, S503, and S508, the rule number i is changed from 1 to I, and all term extraction rules R (i) are applied in S504.

【００６３】Ｓ５０５では用語抽出ルールＲ（ｉ）のテ
キストパターン部とＳ５０１で決定したテキストパター
ン部とのマッチングを行ない、マッチングが成功した場
合Ｓ５０６で用語抽出ルールＲ（ｉ）の用語属性抽出部
を実行することにより属性を判別する。属性が判別され
た用語はＳ５０７で用語属性辞書へ属性の情報を付与し
て登録する。文書分類検索ガイダンスステップ１３は、
以上によって文書から抽出した属性付きの用語を用い
て、前記文書分類検索ステップ５において各文書クラス
に属する文書が有する属性付きの用語を表示する機能と
指定された属性付きの用語を含む文書クラスを抽出して
表示する機能とを提供する。In S505, the text pattern portion of the term extraction rule R (i) is matched with the text pattern portion determined in S501, and if the matching is successful, the term attribute extraction portion of the term extraction rule R (i) is checked in S506. The attribute is determined by executing. In step S507, the term whose attribute is determined is registered by adding attribute information to the term attribute dictionary. Document classification search guidance step 13
Using the attributed terms extracted from the document as described above, the function of displaying the attributed terms of the documents belonging to each document class in the document classification search step 5 and the document class including the designated attributed term And a function of extracting and displaying.

【００６４】用語属性判定ステップ１１および用語属性
辞書構築ステップ１２によって例えば図１７（ａ）に示
すような結果が得られたとする。ここで文書クラスＴ
（４）に属する文書１５０１から用語・属性１５０２が
抽出されている。文書分類検索ステップ５における検索
画面を図１７（ｂ）に示す。ここでは文書クラスＴ
（１）〜Ｔ（５）からなる分類体系１５０６が表示さ
れ、属性の指定を１５０３によって行ない表示ボタン１
５０５を選択すると、各文書クラスに属する文書が有す
る用語のうち、１５０３で指定した属性の用語リスト１
５０７が表示される。用語リスト１５０７から用語を指
定し、検索ボタン１５０５を選択すると、指定した用語
を有する文書クラスが表示される。例えば、属性として
「用途」を選択した場合、用語リスト１５０７には文書
クラスＴ（４）に登録された文書から抽出された用語の
うち、属性が「用途」である「留守番電話機」、「炊飯
器」などの用語だけが表示される。以上により、検索者
は指定した属性の用語によって文書クラスを一覧するこ
とにより、意味的な観点で分類体系を把握することが可
能となる。さらに表示された用語を検索条件として利用
することにより、後に述べるキーワード検索における検
索条件の入力が容易となる。It is assumed that the result shown in FIG. 17A is obtained by the term attribute determining step 11 and the term attribute dictionary constructing step 12, for example. Where document class T
Term / attribute 1502 is extracted from document 1501 belonging to (4). FIG. 17B shows a search screen in the document classification search step 5. Here, document class T
A classification system 1506 made up of (1) to T (5) is displayed.
When 505 is selected, the term list 1 of the attribute specified in 1503 among the terms of the documents belonging to each document class
507 is displayed. When a term is specified from the term list 1507 and a search button 1505 is selected, a document class having the specified term is displayed. For example, when “use” is selected as the attribute, the term list 1507 includes “answering machine”, “cooked rice” with the attribute “use” among the terms extracted from the documents registered in the document class T (4). Only terms such as "instrument" are displayed. As described above, the searcher can grasp the classification system from a semantic viewpoint by listing the document classes by the terms of the designated attribute. Further, by using the displayed term as a search condition, it becomes easy to input a search condition in a keyword search described later.

【００６５】実施の形態６．図１８は本実施の形態にお
ける文書検索方法の動作フローチャートであり、実施の
形態５と同様または相当する部分については同一符号を
付しその説明を省略する。図において、文書全文検索ス
テップ１４は自由入力されたキーワードを検索条件とし
てキーワードを含む文書を検索結果とする公知の検索処
理である。Embodiment 6 FIG. FIG. 18 is an operation flowchart of a document search method according to the present embodiment. The same or corresponding parts as those in the fifth embodiment are denoted by the same reference numerals, and description thereof is omitted. In the figure, a document full-text search step 14 is a well-known search process in which a document containing a keyword is set as a search result using a freely input keyword as a search condition.

【００６６】文書全文検索ガイダンスステップ１５は前
記用語属性判定ステップ１１で抽出した属性付き用語を
文書全文検索ステップ１４における検索条件キーワード
の入力候補として提示する。図１９は文書全文検索ステ
ップ１４による検索条件入力画面の例であり、検索キー
ワード入力１８０１に検索キーワードを入力し検索ボタ
ン１８０２を選択することによって入力したキーワード
を含む文書を検索する。文書全文検索ガイダンスステッ
プ１５はこれに用語属性選択機構１８０３およびキーワ
ード候補表示機構１８０４を追加する。用語属性選択機
構１８０３によって属性を選択すると、キーワード候補
表示機構１８０４には選択された属性を持つ検索対象の
文書が有する用語リストを表示する。キーワード候補表
示機構１８０４に表示された用語を選択すると、検索キ
ーワード入力１８０１に選択した用語が検索用のキーワ
ードとして追加される。以上により、検索対象の文書か
ら抽出した用語のうち、検索者が指定した用語属性の用
語を利用してキーワード検索の検索条件を入力すること
により、検索者の検索条件入力作業が容易となる。さら
に、検索対象の文書から抽出した用語をキーワード検索
の検索条件とするため、検索結果の文書は少なくとも１
つ存在することが保証される。The document full-text search guidance step 15 presents the attributed term extracted in the term attribute determination step 11 as an input candidate of a search condition keyword in the document full-text search step 14. FIG. 19 shows an example of a search condition input screen in the document full-text search step 14. When a search keyword is input to the search keyword input 1801 and a search button 1802 is selected, a document including the input keyword is searched. The document full-text search guidance step 15 adds a term attribute selection mechanism 1803 and a keyword candidate display mechanism 1804 to this. When an attribute is selected by the term attribute selection mechanism 1803, a keyword candidate display mechanism 1804 displays a term list of a document to be searched having the selected attribute. When a term displayed on the keyword candidate display mechanism 1804 is selected, the selected term is added to the search keyword input 1801 as a keyword for search. As described above, by inputting the search condition of the keyword search using the term of the term attribute designated by the searcher among the terms extracted from the search target document, the search condition input operation of the searcher becomes easy. Furthermore, since the terms extracted from the document to be searched are used as search conditions for the keyword search, at least one search result document is included.
Are guaranteed to exist.

【００６７】実施の形態７．図２０は本実施の形態にお
ける文書検索方法の動作フローチャートであり、実施の
形態６と同様または相当する部分については同一符号を
付しその説明を省略する。また、他の実施の形態と同
様、一部のステップを省略または組み合わせてもよい。
同義語・類似語判定ステップ１６は前記用語抽出ステッ
プ１０によって複数の異なる文書からそれぞれ抽出した
表記の異なる複数の用語の同義性・類似性を判定するこ
とにより同義語・類似語を抽出する。同義語・類似語辞
書構築ステップ１７は同義語・類似語判定ステップ１６
で抽出した同義語・類似語から成る同義語・類似語辞書
を構築する。Embodiment 7 FIG. 20 is an operation flowchart of a document search method according to the present embodiment. The same or corresponding parts as those in the sixth embodiment are denoted by the same reference numerals and description thereof is omitted. Also, as in the other embodiments, some steps may be omitted or combined.
The synonym / similar word judging step 16 extracts synonyms / similar words by judging the synonym / similarity of a plurality of terms having different notations extracted from a plurality of different documents in the term extracting step 10. Synonym / similar word dictionary construction step 17 is a synonym / similar word determination step 16
Constructs a synonym / similar word dictionary composed of the synonyms / similar words extracted in (1).

【００６８】図２１は同義語・類似語判定ステップ１６
および同義語・類似語辞書構築ステップ１７の詳細動作
を説明するフローチャートである。図において、ステッ
プＳ１９０１，Ｓ１９０２，Ｓ１９０５によって分類体
系中のすべての文書クラスを処理する。Ｓ１９０３，Ｓ
１９０４，Ｓ１９０８によって特定の文書クラスに属す
るすべての文書を処理する。Ｓ１９０６，Ｓ１９０７，
Ｓ１９１５によって特定の文書が有するすべての用語属
性を処理する。Ｓ１９０９によって特定の文書Ｄから特
定の属性Ａが付与された用語の集合Ｗを抽出する。Ｓ１
９１０によって処理対象の文書との類似度があらかじめ
定められた下限値Ｌ以上の文書Ｅを求める。Ｓ１９１１
によって文書Ｅから属性Ａが付与された用語の集合Ｘを
抽出する。Ｓ１９１２によって用語の集合Ｗから用語の
集合Ｘと共通の用語を削除した用語の集合Ｙを求める。
用語の集合Ｘから用語の集合Ｗと共通の用語を削除した
用語の集合Ｚを求める。Ｓ１９１３によって用語の集合
Ｙに属する用語が１つのＹ（１）であり、用語の集合Ｚ
に属する用語が１つのＺ（１）である場合、用語Ｙ
（１）、用語Ｚ（１）を同義語・類似語とみなして抽出
する。このとき文書Ｄと文書Ｅとの類似度を用語Ｙ
（１）と用語Ｚ（１）との類似度とする。Ｓ１９１４に
よって同義語・類似語Ｙ（１）・Ｚ（１）を類似度と共
に同義語・類似語辞書へ登録する。FIG. 21 shows a synonym / similar word determination step 16
6 is a flowchart illustrating the detailed operation of synonym / similar word dictionary construction step 17. In the figure, all document classes in the classification system are processed in steps S1901, S1902, and S1905. S1903, S
By 1904 and S1908, all documents belonging to a specific document class are processed. S1906, S1907,
In step S1915, all term attributes of the specific document are processed. In step S1909, a set W of terms to which a specific attribute A is assigned is extracted from the specific document D. S1
At step 910, a document E whose similarity with the document to be processed is equal to or greater than a predetermined lower limit L is obtained. S1911
, A set X of terms to which the attribute A is assigned is extracted from the document E. In S1912, a term set Y obtained by deleting terms common to the term set X from the term set W is obtained.
A term set Z obtained by deleting terms common to the term set W from the term set X is obtained. According to S1913, the term belonging to the term set Y is one Y (1), and the term set Z
Is Z (1), the term Y
(1) The term Z (1) is extracted as a synonym / similar word. At this time, the similarity between the document D and the document E is expressed by the term Y
The similarity between (1) and the term Z (1) is assumed. In S1914, the synonym / similar words Y (1) and Z (1) are registered in the synonym / similar word dictionary together with the similarity.

【００６９】例えば、同一の文書クラスに登録された２
つの集積回路部品の仕様書が存在し、その類似度が０．
９７という高い値であったとする。一方の部品の仕様書
から用語属性が「用途」である用語として「携帯電
話」、「セルラホン」の２つが抽出され、もう一方の部
品の仕様書から用語属性が「用途」である用語として
「自動車電話」、「セルラホン」の２つが抽出されたと
する。用語「セルラホン」は共通の用語であるため無視
し、各々の文書から抽出された用語として１つずつ残っ
た２つの用語「携帯電話」と「自動車電話」を同義語・
類似語と判断し、その類似度を０．９７として同義語・
類似語辞書に登録する。For example, if two documents registered in the same document class
There are two integrated circuit component specifications, and the similarity is 0.
Suppose the value was as high as 97. Two terms “cellular phone” and “cellular phone” are extracted as terms whose term attributes are “use” from the specification of one part, and “term” is a term whose term attribute is “use” from the specification of the other part. It is assumed that two items, "car phone" and "cellular phone", are extracted. The term "cellular phone" is a common term and is ignored, and the two remaining terms extracted from each document, one by one, "mobile phone" and "car phone" are synonyms.
Judge as a similar word, and set its similarity to 0.97,
Register in the similar word dictionary.

【００７０】文書全文検索同義語・類似語追加ステップ
１８はキーワードを検索条件として文書を検索する文書
全文検索ステップ１４の検索結果に、検索条件キーワー
ドの同義語・類似語を含む文書を検索結果に追加する。
即ち、キーワードとして同義語・類似語を含むキーワー
ド検索結果が得られる。このとき、検索条件キーワード
と同義語・類似語との類似度を検索された文書の評価値
として検索結果の文書を順位付けることもできる。ま
た、同義語・類似語の同義語・類似語を含む文書を検索
結果に追加する場合、それぞれの類似度の積を検索され
た文書の評価値とすることができる。The document full-text search synonym / similar word adding step 18 includes, in the search result of the document full-text search step 14 in which a document is searched using a keyword as a search condition, a document including a synonym / similar word of the search condition keyword in the search result. to add.
That is, a keyword search result including synonyms and similar words as keywords is obtained. At this time, it is also possible to rank the documents of the search result using the similarity between the search condition keyword and the synonym / similar word as the evaluation value of the searched document. When a document containing a synonym / similar word of a synonym / similar word is added to the search result, the product of the similarities can be used as the evaluation value of the searched document.

【００７１】例えば、「携帯電話」というキーワードに
よって文書を全文検索する場合、同義語・類似語辞書に
よって「携帯電話の」の同義語・類似語として「自動車
電話」が得られたとすると、「携帯電話」を含む文書に
加えて「自動車電話」を含む文書を検索結果として検索
者に提示する。このとき、「携帯電話」と「自動車電
話」との類似度が０．９７であった場合、「携帯電話」
を１つ含む文書の評価値を１とすると、「自動車電話」
を１つ含む文書の評価値は０．９７として順位付けるこ
とができる。以上により、全文検索の検索条件として入
力したキーワードと完全に一致する用語を含んでいない
文書であっても、入力したキーワードと類似する用語を
含む文書を、用語の類似性によって順位付けた検索結果
として検索者に提示することが可能となる。For example, when performing a full-text search of a document using the keyword “mobile phone”, if “automobile phone” is obtained as a synonym / similar word of “mobile phone” from the synonym / similar word dictionary, A document containing "car phone" in addition to a document containing "phone" is presented to the searcher as a search result. At this time, if the similarity between “mobile phone” and “car phone” is 0.97, “mobile phone”
Assuming that the evaluation value of a document containing one is 1, "car phone"
Can be ranked as 0.97. As described above, even if a document does not include a term that completely matches the keyword input as a search condition of the full-text search, a search result in which documents including a term similar to the input keyword are ranked by similarity of the term Can be presented to the searcher.

【００７２】実施の形態８．図２２は本実施の形態の文
書検索方法の動作を示すフローチャートであり、実施の
形態６と同様または相当する部分については同一符号を
付しその説明を省略する。Embodiment 8 FIG. FIG. 22 is a flowchart showing the operation of the document search method according to the present embodiment. The same or corresponding parts as in the sixth embodiment are denoted by the same reference numerals and description thereof is omitted.

【００７３】文書全文検索類似文書追加ステップ１９は
前記文書全文検索ステップ１４における検索結果の文書
の集合に属する文書数が、あらかじめ定めた値より少な
い場合、検索結果の文書集合に属する文書と前記文書類
似度計算ステップ３により計算した類似度が高い文書を
検索結果に追加することにより、検索結果の文書数があ
らかじめ定めた値以上となるようにする。例えば、全文
検索結果と少なくとも１０個の文書を得たいと検索者が
あらかじめ設定して全文検索を実行したが、検索条件と
して入力したキーワードを含む文書が３個しか存在しな
かったとする。この場合、文書全文検索類似文書追加ス
テップ１９は検索条件として入力したキーワードを含む
３個の文書それぞれに対して、検索条件として入力した
キーワードを含まない文書との類似度を文書パラメタベ
クトルの内積を求める文書類似度計算ステップ３により
求め、この求めた類似度の高い文書から順に７個の文書
を検索結果に追加することによって、１０個の文書を検
索結果として検索者に提示する。以上により、全文検索
によって少なくとも１つの文書が検索された場合、検索
者の所望する個数の文書を検索結果として提示すること
が可能となる。In the document full-text search similar document adding step 19, if the number of documents belonging to the set of documents of the search result in the document full-text search step 14 is smaller than a predetermined value, the document belonging to the search result document set and the document By adding a document having a high similarity calculated in the similarity calculation step 3 to the search result, the number of documents in the search result is equal to or greater than a predetermined value. For example, it is assumed that the searcher presets to obtain a full-text search result and at least 10 documents and executes a full-text search, but there are only three documents including the keyword input as a search condition. In this case, the document full-text search similar document addition step 19 calculates the similarity between each of the three documents including the keyword input as the search condition and the document not including the keyword input as the search condition by calculating the inner product of the document parameter vector. The document similarity calculation step 3 is performed, and seven documents are added to the search result in order from the document having the highest similarity, thereby presenting 10 documents to the searcher as a search result. As described above, when at least one document is searched by the full-text search, it is possible to present the number of documents desired by the searcher as a search result.

【００７４】実施の形態９．図２３は本実施の形態にお
ける文書検索方法の動作を示すフローチャートであり、
実施の形態６と同様または相当する部分については同一
符号を付しその説明を省略する。図において、特徴用語
属性判定ステップ２０は前記用語抽出ステップ１０によ
って抽出した用語の前記文書分類ステップ４で構築した
分類体系に対する依存性を評価することによって、特定
の文書クラスにおいて出現頻度が高い特徴用語を抽出す
る。Embodiment 9 FIG. 23 is a flowchart showing the operation of the document search method according to the present embodiment,
Parts that are the same as or correspond to those in Embodiment 6 are given the same reference numerals, and descriptions thereof are omitted. In the figure, a characteristic term attribute determining step 20 evaluates the dependency of the term extracted in the term extracting step 10 on the classification system constructed in the document classifying step 4 to determine a characteristic term having a high appearance frequency in a specific document class. Is extracted.

【００７５】図２４は特徴用語属性判定ステップ２０の
動作の詳細を説明するフローチャートである。ステップ
Ｓ２１０１，Ｓ２１０２，Ｓ２１０５によって分類体系
を構成するすべての文書クラスを処理する。Ｓ２１０
３，Ｓ２１０４，Ｓ２１１３によって特定の文書クラス
に属するすべての文書を処理する。Ｓ２１０６，Ｓ２１
０７，Ｓ２１１２によって特定の文書が有するすべての
用語を処理する。Ｓ２１０９によって用語Ｗの文書クラ
スＣにおける出現頻度Ｆ（Ｗ，Ｃ）を計算する。Ｓ２１
１０によって用語Ｗの文書集合全体における出現頻度Ｆ
（Ｗ）を計算する。Ｓ２１１１によって用語Ｗの文書ク
ラスＣに対する依存度Ｐ（Ｗ，Ｃ）を計算する。Ｓ２１
０８によって文書クラスＣに出現する用語ＷをＰ（Ｗ，
Ｃ）によって降順に並べる。Ｐ（Ｗ，Ｃ）は用語Ｗの文
書クラスＣに対する依存度であり、用語Ｗが文書クラス
Ｃにおいて多く出現し、文書全体において少なく出現す
るほど高い値となる。FIG. 24 is a flowchart for explaining the details of the operation of the characteristic term attribute determining step 20. In steps S2101, S2102, and S2105, all document classes constituting the classification system are processed. S210
3, all documents belonging to a specific document class are processed by S2104 and S2113. S2106, S21
07, S2112 processes all terms of the specific document. The appearance frequency F (W, C) of the term W in the document class C is calculated in S2109. S21
10, the appearance frequency F of the term W in the entire document set
Calculate (W). In S2111, the degree of dependence P (W, C) of the term W on the document class C is calculated. S21
08 to P (W,
Sort in descending order by C). P (W, C) is the degree of dependence of the term W on the document class C, and the higher the term W appears in the document class C and the smaller the term W appears in the entire document, the higher the value.

【００７６】例えば、「通信機器」という文書クラスに
３０個の文書が登録されており、そこに「携帯電話」と
いう用語が６０回出現しているとすると、Ｆ（携帯電
話，通信機器）は６０／３０＝２となる。一方文書全体
が１０００個の文書からなっており、そこに「携帯電
話」という用語が１００回出現しているとすると、Ｆ
（携帯電話）は１００／１０００＝０．１となる。従っ
て、用語「携帯電話」の文書クラス「通信機器」に対す
る依存度Ｐ（携帯電話，通信機器）は２／０．１＝２０
となる。以上により、すべての文書クラスにおいて、そ
こに出現するすべての用語の文書クラスに対する依存度
が計算され、各々の文書クラスにおいて文書クラスへの
依存度が高い特徴用語を抽出することができる。For example, if 30 documents are registered in a document class of “communication device” and the term “mobile phone” appears 60 times therein, then F (mobile phone, communication device) becomes 60/30 = 2. On the other hand, if the entire document is composed of 1000 documents, and the term “mobile phone” appears 100 times,
(Mobile phone) is 100/1000 = 0.1. Therefore, the degree of dependence P (mobile phone, communication device) of the term “mobile phone” on the document class “communication device” is 2 / 0.1 = 20.
Becomes As described above, in all the document classes, the dependencies of all the terms appearing in the document classes on the document class are calculated, and characteristic terms highly dependent on the document class in each document class can be extracted.

【００７７】文書全文検索特徴用語ガイダンスステップ
２１は前記特徴用語属性判定ステップ２０で求めた文書
クラスへの依存度が高い特徴用語を、前記文書全文検索
ステップ１４における検索条件入力画面上に、依存度の
高い順に並べた用語リストの形式で提示し、検索条件キ
ーワードとして選択入力可能とする。上記の特徴用語を
キーワードとし、この特徴用語による、キーワード検索
によって分類検索と同等の文書検索効果を得られるた
め、検索者が文書の分類体系に関する知識に乏しい場合
でも検索者が知っている用語を利用した検索が可能とな
る。また、分類検索５において、各文書クラス毎に特徴
用語を提示してもよい。これにより検索者は文書クラス
名以外の特徴用語という情報によって文書の分類体系を
把握することが可能となる。The document full-text search feature term guidance step 21 displays a feature term having a high degree of dependence on the document class obtained in the feature term attribute determination step 20 on the search condition input screen in the document full-text search step 14 on the document condition input screen. Are presented in the form of a term list arranged in descending order, and can be selected and entered as search condition keywords. The above-mentioned characteristic terms are used as keywords, and keyword search based on these characteristic terms can provide the same document search effect as classification search, so even if the searcher has little knowledge of the document classification system, the terms that the searcher knows are used. The search that was used becomes possible. In the classification search 5, characteristic terms may be presented for each document class. As a result, the searcher can grasp the classification system of the document based on information of characteristic terms other than the document class name.

【００７８】[0078]

【発明の効果】以上のように、この発明によれば複数の
提供元から収集した、文書から文書パラメタを抽出し、
文書パラメタベクトルを生成した分類管理するので、文
書の体裁や用語の用法の揺れに影響されず、また単一の
枠組みでの品質の高い文書分類管理と検索が容易にでき
る効果がある。As described above, according to the present invention, document parameters are extracted from documents collected from a plurality of sources,
Since the classification management is performed by generating the document parameter vector, there is an effect that the format of the document and the usage of the term are not affected, and the high-quality document classification management and retrieval can be easily performed in a single framework.

【００７９】文書パラメタを検索条件とした文書検索に
際して特定の文書パラメタの出現割合を示すようにした
ので、対象文書の絞り込み検索が容易となる効果があ
る。Since the appearance ratio of a specific document parameter is indicated at the time of document search using the document parameter as a search condition, there is an effect that the narrowing search of the target document can be easily performed.

【００８０】複数の異なる分類体系を収集して、これら
を合併して分類管理するので、提供元によって最適化さ
れた文書の分類体系を反映し、しかも統一した分類体系
を構築できるという効果がある。また複数の分類体系を
超えて文書を検索できる効果がある。Since a plurality of different classification systems are collected and merged and classified and managed, there is an effect that the classification system of the document optimized by the provider can be reflected and a uniform classification system can be constructed. . In addition, there is an effect that documents can be searched over a plurality of classification systems.

【００８１】文書数が増加した場合など分類体系を細分
化できるようにしたので対象文書数が増大しても最適な
文書検索ができる効果がある。Since the classification system can be subdivided, for example, when the number of documents increases, there is an effect that an optimum document search can be performed even if the number of target documents increases.

【００８２】文書中から抽出した用語を属性毎にガイダ
ンス情報として提示するように構成したため、文書クラ
スまたは検索対象の文書集合を属性からも検索できると
いう効果がある。Since the term extracted from the document is presented as guidance information for each attribute, there is an effect that a document class or a document set to be searched can be searched from the attribute.

【００８３】文書中から同義語・類似語を抽出し、同義
語・類似語辞書を備えたので、同義語・類似語を含む類
似の文書が検索可能となる効果がある。Since the synonym / similar word is extracted from the document and the synonym / similar word dictionary is provided, there is an effect that similar documents including the synonym / similar word can be searched.

【００８４】分類した文書から分類に依存した特徴用語
を抽出するようにしたので、文書クラスまたは検索対象
の文書集合を特徴用語によって検索できる効果がある。Since characteristic terms depending on the classification are extracted from the classified documents, a document class or a set of documents to be searched can be searched by the characteristic terms.

[Brief description of the drawings]

【図１】この発明の実施の形態１における文書分類管
理と文書検索方法の動作を示すフローチャート図であ
る。FIG. 1 is a flowchart showing an operation of a document classification management and document search method according to Embodiment 1 of the present invention.

【図２】実施の形態１の文書分類管理方法の動作を説
明するための解析対象の文書、抽出された文書パラメタ
の例を示す図である。FIG. 2 is a diagram illustrating an example of a document to be analyzed and extracted document parameters for explaining the operation of the document classification management method according to the first embodiment;

【図３】別の文書とその文書パラメタ、生成された文
書パラメタベクトル、分類体系と文書パラメタベクトル
の例を示す図である。FIG. 3 is a diagram showing an example of another document and its document parameters, a generated document parameter vector, a classification system, and a document parameter vector.

【図４】図１の文書類似度計算ステップ３による文書
パラメタベクトル生成の動作フローチャート図である。FIG. 4 is an operation flowchart of document parameter vector generation in document similarity calculation step 3 of FIG. 1;

【図５】この発明の実施の形態２における文書分類管
理及び文書検索方法の動作を示すフローチャート図であ
る。FIG. 5 is a flowchart showing an operation of a document classification management and document search method according to Embodiment 2 of the present invention.

【図６】図５の分類パラメタ検索ガイダンスステップ
７の詳細動作を示すフローチャート図である。FIG. 6 is a flowchart showing a detailed operation of a classification parameter search guidance step 7 of FIG. 5;

【図７】分類パラメタ検索ステップの動作を説明する
ための操作と結果の例を示す画面の図である。FIG. 7 is a diagram of a screen showing an example of operations and results for explaining the operation of a classification parameter search step.

【図８】この発明の実施の形態３における文書分類管
理及び文書検索方法の動作を示すフローチャート図であ
る。FIG. 8 is a flowchart showing an operation of a document classification management and document search method according to Embodiment 3 of the present invention.

【図９】図８の分類体系合併ステップ８の詳細動作を
示すフローチャート図である。FIG. 9 is a flowchart showing a detailed operation of a classification system merging step 8 in FIG. 8;

【図１０】分類体系合併ステップの動作を説明するた
めの分類体系の変化を示す図である。FIG. 10 is a diagram showing a change in the classification system for explaining the operation of the classification system merging step.

【図１１】この発明の実施の形態４における文書分類
管理及び文書検索方法の動作を示すフローチャート図で
ある。FIG. 11 is a flowchart illustrating an operation of a document classification management and document search method according to Embodiment 4 of the present invention.

【図１２】図１１の分類体系細分化ステップ９の詳細
動作を示すフローチャート図である。FIG. 12 is a flowchart showing a detailed operation of a classification system subdividing step 9 in FIG. 11;

【図１３】分類体系細分化ステップの動作を説明する
ための移動ステップＳと移動文書間類似度Ｌの関係の例
を示す図である。FIG. 13 is a diagram illustrating an example of a relationship between a moving step S and a similarity L between moving documents for explaining the operation of the classification system subdividing step.

【図１４】この発明の実施の形態５における文書検索
方法の動作を示すフローチャート図である。FIG. 14 is a flowchart showing an operation of a document search method according to Embodiment 5 of the present invention.

【図１５】図１４における用語属性判定ステップ１１
と用語属性辞書構築ステップ１２の詳細動作を示すフロ
ーチャート図である。15 is a term attribute determination step 11 in FIG.
FIG. 7 is a flowchart showing a detailed operation of a term attribute dictionary construction step 12.

【図１６】図１５の用語抽出ルールと、それをある文
書に適用した用語抽出結果の例を示す図である。16 is a diagram illustrating an example of a term extraction rule of FIG. 15 and a result of term extraction applied to a certain document.

【図１７】用語属性抽出結果の例と文書検索に用語属
性抽出結果を適用した例を示す図である。FIG. 17 is a diagram illustrating an example of a term attribute extraction result and an example in which the term attribute extraction result is applied to a document search.

【図１８】この発明の実施の形態６における文書検索
方法の動作を示すフローチャート図である。FIG. 18 is a flowchart showing an operation of a document search method according to Embodiment 6 of the present invention.

【図１９】図１８の文書全文検索ステップによる全文
検索条件の入力画面の例を示す図である。FIG. 19 is a diagram showing an example of an input screen of a full-text search condition in a document full-text search step of FIG. 18;

【図２０】この発明の実施の形態７における文書検索
方法の動作を示すフローチャート図である。FIG. 20 is a flowchart showing an operation of a document search method according to Embodiment 7 of the present invention.

【図２１】図２０の同義語・類似語判定ステップ１６
と同義語・類似語辞書構築ステップ１７の詳細動作を示
すフローチャート図である。21 is a synonym / similar word determination step 16 in FIG.
FIG. 10 is a flowchart showing a detailed operation of synonym / similar word dictionary construction step 17.

【図２２】この発明の実施の形態８における文書検索
方法の動作を示すフローチャート図である。FIG. 22 is a flowchart showing an operation of a document search method according to the eighth embodiment of the present invention.

【図２３】この発明の実施の形態９における文書検索
方法の動作を示すフローチャート図である。FIG. 23 is a flowchart showing an operation of a document search method according to Embodiment 9 of the present invention.

【図２４】図２３の特徴用語判定ステップ２０の詳細
動作を示すフローチャート図である。24 is a flowchart showing a detailed operation of a characteristic term determination step 20 in FIG. 23.

【符号の説明】１文書収集ステップ、２文書パラメタ抽出ステッ
プ、３文書類似度計算ステップ、４文書分類ステッ
プ、５文書分類検索ステップ、６文書パラメタ検索
ステップ、７分類パラメタ検索ガイダンスステップ、
８分類体系合併ステップ、９分類体系細分化ステッ
プ、１０用語抽出ステップ、１１用語属性判定ステ
ップ、１２用語属性辞書構築ステップ、１３文書分
類検索ガイダンスステップ、１４文書全文検索ステッ
プ、１５文書全文検索ガイダンスステップ、１６同
義語・類似語判定ステップ、１７同義語・類似語辞書
構築ステップ、１８文書全文検索同義語・類似語追加
ステップ、１９文書全文検索類似文書追加ステップ、
２０特徴用語属性判定ステップ、２１文書全文検索
特徴用語ガイダンスステップ、２２分類体系獲得ステ
ップ、３７共通パラメタ名表示、３８パラメタ値入
力領域、３９検索対象文書数、４０新規ボタン、４
１追加ボタン、４２絞込ボタン、４３検索実行ボ
タン、４４戻るボタン、４５分布ボタン、４６パ
ラメタ選択ボタン、４７固有パラメタ名、４８固有
パラメタ含有文書数、４９入力候補ボタン、５０共
通パラメタ入力値候補、５１共通パラメタ値分布グラ
フ。[Description of Signs] 1 document collection step, 2 document parameter extraction step, 3 document similarity calculation step, 4 document classification step, 5 document classification search step, 6 document parameter search step, 7 classification parameter search guidance step,
8 Classification System Merging Step, 9 Classification System Subdivision Step, 10 Term Extraction Step, 11 Term Attribute Determination Step, 12 Term Attribute Dictionary Construction Step, 13 Document Classification Search Guidance Step, 14 Document Full Text Search Step, 15 Document Full Text Search Guidance Step , 16 synonym / similar word determination step, 17 synonym / similar word dictionary construction step, 18 full-text search synonym / similar word addition step, 19 full-text search similar document addition step,
20 characteristic term attribute judgment step, 21 full-text search characteristic term guidance step, 22 classification system acquisition step, 37 common parameter name display, 38 parameter value input area, 39 search target document number, 40 new button, 4
1 Add button, 42 Refine button, 43 Search button, 44 Return button, 45 Distribution button, 46 Parameter selection button, 47 Unique parameter name, 48 Unique parameter containing document number, 49 Input candidate button, 50 Common parameter input value candidate , 51 Common parameter value distribution graph.

───────────────────────────────────────────────────── フロントページの続き (72)発明者高山泰博東京都千代田区丸の内二丁目２番３号三菱電機株式会社内 ────────────────────────────────────────────────── ─── Continued on the front page (72) Inventor Yasuhiro Takayama 2-3-2 Marunouchi, Chiyoda-ku, Tokyo Inside Mitsubishi Electric Corporation

Claims

[Claims]

1. A document parameter extracting step of extracting document parameters created by a tag of a predetermined format of a plurality of collected N documents, and the M document parameters extracted from the N documents. For the configured M-dimensional document parameter vector,
Calculating a weight based on the appearance frequency of the component of the parameter vector to determine a document parameter vector for each document; and obtaining a similarity between the document parameter vectors obtained for each document, And a document classification step of classifying the collected documents into a document class of a classification system determined according to degree.

2. A document parameter extracting step of extracting document parameters created by a tag of a predetermined format of a plurality of collected N documents, and M document parameters extracted from the N documents. For the configured M-dimensional document parameter vector,
Calculating a weight based on the appearance frequency of the component of the parameter vector to determine a document parameter vector for each document; and obtaining a similarity between the document parameter vectors obtained for each document, A document classifying step of classifying the documents into a document class of a classification system determined by the degree, and a document extracting a document conforming to the document class indicated in the classification system from a plurality of documents classified and managed in the above-described document classification step. A document search method comprising a classification search step.

3. A document parameter retrieving guidance step for extracting a document parameter having a predetermined appearance ratio close to the specified appearance ratio when an appearance ratio of a specific document parameter to all documents is designated. Document search method described in 2.

4. An average parameter vector of a document class in the same classification system is determined, and if the similarity of the determined average parameter vector between different classification systems is within a predetermined value, the merged classification combining the different classification systems is performed. 2. The document classification management method according to claim 1, further comprising a classification system merging step for setting the same document class in the system.

5. The classification system merging step is characterized in that, when there is a lower-level document class connected to a document class whose similarity is to be obtained, the lower-level connection form is managed as a merged classification system. The document classification management method according to claim 4.

6. A classification system subdividing step for sequentially moving the most similar documents belonging to the same document class and extracting the inter-document similarity for each movement is added. Document classification management method.

7. A term and attribute extracted from a document by a predetermined term extraction rule are constructed and managed as a term attribute dictionary in association with a document class, and when a specific attribute is designated, a term corresponding to the attribute is designated. 3. The document search method according to claim 2, further comprising a document classification search guidance step for extracting a document class having the following.

8. A keyword search mechanism for searching for a document having a specific keyword, and when a specific attribute is designated, the corresponding term is extracted from a term attribute dictionary and searched as a keyword. 7. The document search method according to item 7.

9. The document search method according to claim 7, further comprising a similar document addition step of extracting and adding a predetermined number of documents in descending order of similarity with the search result document class.

10. A synonym / similar word dictionary is constructed and managed based on the similarity between terms of attributes of the same document class, and a keyword search mechanism for searching for a document having a specific keyword is added. 3. The document search method according to claim 2, wherein, in the synonym / similar word dictionary, a term registered and managed equivalently is searched for in addition to the keyword.

11. A feature term determining step of extracting feature terms based on the frequency of appearance in a specific document class, and a keyword search mechanism for searching for a document having a specific keyword are added. 3. The document search method according to claim 2, wherein the characteristic term extracted in the determination step is searched as the keyword.