JP2006190235A

JP2006190235A - Document classifying method, document classifying program and document classifying device

Info

Publication number: JP2006190235A
Application number: JP2005058100A
Authority: JP
Inventors: Yosuke Kunishi; 洋介国司
Original assignee: Shin Etsu Polymer Co Ltd; Shin Etsu Chemical Co Ltd
Current assignee: Shin Etsu Polymer Co Ltd; Shin Etsu Chemical Co Ltd
Priority date: 2004-12-09
Filing date: 2005-03-02
Publication date: 2006-07-20
Anticipated expiration: 2025-03-02
Also published as: JP4545614B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a document classifying method for showing a relationship of similarity between a plurality of documents to be classified in a visually recognizable manner. <P>SOLUTION: A stable document-to-document distance between two documents out of the documents to be classified is calculated depending on a degree of similarity between both documents, and an initially arranged document is selected from the document to be classified and its position coordinate is initialized. In accordance with a difference between the length of a spacing vector to the other initially arranged document and the stable document-to-document distance and the direction of the spacing vector, a step of calculating a document-to-document force vector to be received from the other initially arranged document and a step of calculating the position coordinate at the next processing point depending on the document-to-document force vector are repeated until each initially arranged document is converged, to find a temporary position coordinate. Then, arranged documents are added several times and the temporary position coordinate is calculated, and the step of calculating the document-to-document force vector and a step of updating the position coordinate are repeated until all documents to be classified are finally converged. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、複数の分類対象文書をそれぞれの内容に応じて分類する文書分類方法に関するものである。 The present invention relates to a document classification method for classifying a plurality of documents to be classified according to their contents.

従来の文書分類装置としては、例えば特許文献１ないし３，２に記載されたものがある。特許文献４は、予め組み合わせが決められているノードとアークを視覚的に好適なバランスでディスプレイ空間に配置するグラフィックス作成方法に関する技術である。特許文献１に記載された文書分類装置においては、複数の分類対象文書について、文書間の距離に応じて、各文書を予め与えられた複数のカテゴリの何れかに分類している。また、特許文献２に記載された文書分類装置においては、複数の分類対象文書について各文書の語句ベクトルと、２次元的に配列されたセルの語句ベクトルとの距離に応じて、各文書を上記セルの何れかに分類している。
特開平８−２２１４４７号公報特許第３３８５２９７号公報特開２００３−２８８３５２号公報特開２００２−３１２８０３号公報 Examples of conventional document classification devices include those described in Patent Documents 1 to 3 and 2, for example. Patent Document 4 is a technique related to a graphics creation method in which nodes and arcs whose combinations are determined in advance are arranged in a display space with a visually suitable balance. In the document classification device described in Patent Document 1, each of a plurality of classification target documents is classified into one of a plurality of predetermined categories according to the distance between the documents. Further, in the document classification device described in Patent Document 2, each document is classified as described above according to the distance between the word vector of each document and the word vectors of cells arranged two-dimensionally for a plurality of documents to be classified. It is classified as one of the cells.
Japanese Patent Laid-Open No. 8-222447 Japanese Patent No. 3385297 JP 2003-288352 A JP 2002-312803 A

特許文献１，２に記載された文書分類装置では、カテゴリ単位或いはセル単位で各文書を分類している。これによれば、複数の分類対象文書全体において文書間の内容に応じた位置関係を大まかに知ることができる。しかしながら、これらの文書分類装置では、同一のカテゴリ或いはセルに分類された複数の文書間の位置関係まで知ることができない。仮に、同一のカテゴリ或いはセル内の文書を更に詳細に分類しようとすれば、それらの文書の内容をユーザが一つ一つ確認する必要があり、人手による作業負担が大きくなってしまう。
特許文献３に記載されたスプリングモデルには配置が収束するまでの計算量が膨大になるという問題点があった。また、特許文献４にはディスプレイ空間にノードを順次追加しながら配置する方法が開示されている。しかし、特許文献４のグラフィックス作成方法は、ノード間の相互関係をノード間の距離として表すものではなく、本発明におけるように新規の配置点（ノード、文書）を追加することにより既存の全ての配置点の位置を再構成しなければならないものではない。特許文献４で開示されているノードの追加方法は、基本的には既存のノードを固定しつつ新規のノードを追加していくものであり、追加の際、近傍の既存ノードの位置のみが再構成されるというものである。特許文献４のグラフィックス作成方法には、これを文書分類に適用すると配置点を一つ加えるごとに全ての配置点の位置を再構成しなければならないのでかえって計算量が増大するという問題点があった。
本発明は、上記課題に鑑みてなされたものであり、複数の分類対象文書内における文書間の位置関係を詳細に知ることを可能ならしめ、加えて分類対象文書の数が多くなっても計算量の増大を抑制することを可能ならしめる文書分類方法、文書分類プログラム及び文書分類装置を提供することを目的とする。 In the document classification apparatuses described in Patent Documents 1 and 2, each document is classified in units of categories or cells. According to this, it is possible to roughly know the positional relationship according to the content between documents in the entire plurality of classification target documents. However, these document classification devices cannot know the positional relationship between a plurality of documents classified into the same category or cell. If it is intended to classify documents in the same category or cell in more detail, it is necessary for the user to check the contents of those documents one by one, which increases the burden of manual work.
The spring model described in Patent Document 3 has a problem that the amount of calculation until the arrangement converges becomes enormous. Patent Document 4 discloses a method of arranging nodes while sequentially adding nodes to the display space. However, the graphics creation method of Patent Document 4 does not represent the interrelationship between nodes as a distance between nodes, but adds all the existing arrangement points (nodes, documents) as in the present invention. It is not necessary to reconstruct the positions of the arrangement points. The node addition method disclosed in Patent Document 4 is basically a method of adding a new node while fixing the existing node. At the time of addition, only the position of the existing node in the vicinity is re-executed. It is composed. When the graphics creation method of Patent Document 4 is applied to document classification, the position of all placement points must be reconstructed every time one placement point is added, which in turn increases the amount of calculation. there were.
The present invention has been made in view of the above problems, and makes it possible to know in detail the positional relationship between documents in a plurality of classification target documents, and in addition, even if the number of classification target documents increases. An object of the present invention is to provide a document classification method, a document classification program, and a document classification device that make it possible to suppress an increase in the amount.

本発明の文書分類方法は、複数の分類対象文書をそれぞれの内容に応じて分類する文書分類方法であって、前記分類対象文書のうちの各２文書間の安定文書間距離を、両文書が類似する程度に応じて算出する第１のステップと、前記分類対象文書から初期配置文書を選択する第２のステップと、各前記初期配置文書が表示座標系上で当初配置される位置座標を算出する第３のステップと、配置された前記初期配置文書のうちの各２文書間について、現処理時点（第４ないし６のステップの繰返処理（又は第１０及び１１のステップの繰返処理）中のある回における第６のステップ（又は第１１のステップ）実行前のある時点）における前記表示座標系上の一方の文書から他方の文書への離間ベクトルを算出し、各配置された前記初期配置文書について、前記離間ベクトルの長さと前記安定文書間距離との差及び前記離間ベクトルの方向に基づいて、他のある初期配置文書から受ける文書間力ベクトルを算出する第４のステップと、各配置された前記初期配置文書について、他の各前記初期配置文書から受ける文書間力ベクトルを総和して総和文書間力ベクトルを算出する第５のステップと、各配置された前記初期配置文書について、前記総和文書間力ベクトルに応じて次回処理時点（第４ないし６のステップの繰返処理（又は第１０及び１１のステップの繰返処理）中の前記ある回の次の回における第６のステップ（又は第１１のステップ）実行前のある時点）における位置座標を算出する第６のステップと、前記第４ないし６のステップの繰返処理の実行中に前記初期配置文書の位置座標の収束を判断し、当該繰返処理を終了させる第７のステップと、前記分類対象文書から新たに前記表示座標系に組み入れる次期配置文書を選択する第８のステップと、各前記次期配置文書が前記表示座標系において当初配置される位置座標を算出する第９のステップと、新たに配置された前記次期配置文書について、既存の配置文書から受ける文書間力ベクトルを総和して総和文書間力ベクトルを算出する第１０のステップと、新たに配置された前記次期配置文書について、前記総和文書間力ベクトルに応じて次回処理時点における位置座標を算出する第１１のステップと、前記第１０及び１１のステップの繰返処理の実行中に前記次期配置文書の位置座標の収束を判断し、当該繰返処理を終了させる第１２のステップとを備えることを特徴とする。 The document classification method of the present invention is a document classification method for classifying a plurality of classification target documents according to their contents, and the distance between stable documents between two documents among the classification target documents is determined by both documents. A first step of calculating according to the degree of similarity; a second step of selecting an initially arranged document from the classification target documents; and calculating position coordinates at which each of the initially arranged documents is initially arranged on a display coordinate system And the current processing point (repeating process of the fourth to sixth steps (or repeating process of the tenth and eleventh steps)) between each of the two documents among the arranged initial placement documents. A distance vector from one document to the other document on the display coordinate system at a certain time before execution of the sixth step (or eleventh step) at a certain time in Placement document A fourth step of calculating an inter-document force vector received from some other initially placed document based on the difference between the length of the spacing vector and the stable inter-document distance and the direction of the spacing vector; A fifth step of calculating a total document inter-force vector by summing up inter-document force vectors received from each of the other initial-arranged documents for the initial-arranged document; According to the inter-document force vector, the sixth step (or the next step after the certain time during the next processing time point (repeating process of the fourth to sixth steps (or repeating process of the tenth and eleventh steps)) (or An eleventh step) a sixth step of calculating position coordinates at a certain point in time) and the execution of the initial arrangement document during the repetition of the fourth to sixth steps. A seventh step of determining the convergence of the position coordinates and terminating the repetition process; an eighth step of selecting a next layout document to be newly incorporated into the display coordinate system from the classification target document; and each next layout A ninth step of calculating a position coordinate at which the document is initially arranged in the display coordinate system; and a sum of inter-document force vectors received from existing arranged documents for the newly arranged next arranged document A tenth step of calculating a force vector; an eleventh step of calculating a position coordinate at the next processing time according to the total inter-document force vector for the newly placed next-placed document; And a twelfth step of determining convergence of the position coordinates of the next arranged document during execution of the repeating process of the eleventh step and ending the repeating process. It is characterized by.

一連のステップを実行することにより、表示座標系において他の分類対象文書と安定文書間距離を保とうとする力が均衡するように分類対象文書が分配される。安定文書間距離は二つの分類対象文書の間の類似度に応じて算出されるので、分類対象文書の分布は、類似度の高い分類対象文書同士が近くに位置し、類似度の低い分類対象文書同士が遠くに位置するようになる。 By executing a series of steps, the classification target document is distributed so that the force to maintain the stable inter-document distance with other classification target documents in the display coordinate system is balanced. Since the stable inter-document distance is calculated according to the similarity between two documents to be classified, the distribution of the documents to be classified is classified objects with a high similarity between the documents to be classified close to each other. Documents are located far away.

単純に全ての分類対象文書について移動処理を一度に行うと、分類対象文書の数が多いとき膨大な計算量になる。また、類似度の大きい分類対象文書の間にこれらの分類対象文書との類似度の小さい分類対象文書が多数存在するとき、移動処理を繰り返してもこれらの分類対象文書が互いに近づかないという問題が生じる。しかし、まず少数の分類対象文書について移動処理を実行し、その後順次別の分類対象文書を加えて移動処理を行うことにより上記のような問題を回避することができる。なお、第９ないし１２のステップは、時期配置文書中の全ての文書について一度に行う方法と、一つずつの文書について順次行う方法の両方が考えられる。
本発明の文書分類方法の別の側面は、複数の分類対象文書をそれぞれの内容に応じて分類する文書分類方法であって、前記分類対象文書のうちの各２文書間の安定文書間距離を、両文書が類似する程度に応じて算出する第１のステップと、前記分類対象文書から初期配置文書を選択する第２のステップと、各前記初期配置文書が表示座標系上で当初配置される位置座標を算出する第３のステップと、配置された前記初期配置文書のうちの各２文書間について、現処理時点における前記表示座標系上の一方の文書から他方の文書への離間ベクトルを算出し、各配置された前記初期配置文書について、前記離間ベクトルの長さと前記安定文書間距離との差及び前記離間ベクトルの方向に基づいて、他のある初期配置文書から受ける文書間力ベクトルを算出する第４のステップと、各配置された前記初期配置文書について、他の各前記初期配置文書から受ける文書間力ベクトルを総和して総和文書間力ベクトルを算出する第５のステップと、各配置された前記初期配置文書について、前記総和文書間力ベクトルに応じて次回処理時点における位置座標を算出する第６のステップと、前記第４ないし６のステップの繰返処理の実行中に前記初期配置文書の位置座標の収束を判断し、当該繰返処理を終了させる第７のステップと、前記分類対象文書から新たに前記表示座標系に組み入れる複数の次期配置文書を選択する第８のステップと、前記次期配置文書中の一つについて前記表示座標系において当初配置される位置座標を算出してこれを配置する第９のステップと、前記第９のステップで配置された前記次期配置文書について、既存の配置文書から受ける文書間力ベクトルを総和して総和文書間力ベクトルを算出する第１０のステップと、前記第９のステップで配置された前記次期配置文書について、前記総和文書間力ベクトルに応じて次回処理時点における位置座標を算出する第１１のステップと、前記第１０及び１１のステップの繰返処理の実行中に前記第９のステップで配置された前記次期配置文書の位置座標の収束を判断し、この時点での位置を前記第９のステップで配置された前記次期配置文書の仮決め位置とする第１２のステップと、前記第９ないし１２のステップを前記第８のステップで選択された次期配置文書の全てについて実行した後、全ての配置文書について前記第４ないし７のステップにおける繰返処理を実行する第１３のステップと、残存している前記分類対象文書について、前記第８ないし１３のステップを実行する第１４のステップとを備えることを特徴とする。
本発明の文書分類方法の別の側面によれば、新規の分類対象文書が追加されるごとに行われるのは当該追加文書の位置の仮決めであり、第８のステップで選択された複数の次期配置文書が配置されるごとに全ての配置済み文書の位置の再構成が実行されるので、逆に計算量が増大するのを防止することができる。
本発明の文書分類方法は、各前記分類対象文書が自文書を識別する文書番号を有し、前記第１のステップにおける両文書が類似する程度を当該両文書が引用する前記分類対象文書の文書番号を用いて算出することが好適である。この文書分類方法によると、異なる言語で記載された分類対象文書を対象にすることができる。また、この文書分類方法において自文書を自文書が引用する前記分類対象文書の一つとみなすのが好適である。これにより、類似度の評価対象である２文書のうちの片方が他方を引用している事実を類似度の算出に反映させることができる。 If the transfer processing is simply performed for all the classification target documents at once, the amount of calculation becomes large when the number of classification target documents is large. In addition, there is a problem that when there are many classification target documents with low similarity to these classification target documents among classification target documents with high similarity, these classification target documents do not approach each other even if the moving process is repeated. Arise. However, the above-described problem can be avoided by first executing the movement process for a small number of classification target documents and then sequentially adding another classification target document and performing the movement process. Note that the ninth to twelfth steps may be both a method of performing all the documents in the time-arranged document at once and a method of sequentially performing each of the documents.
Another aspect of the document classification method of the present invention is a document classification method for classifying a plurality of classification target documents according to their contents, wherein a stable inter-document distance between each two documents of the classification target documents is determined. A first step of calculating according to the degree of similarity between the two documents, a second step of selecting an initially placed document from the classification target documents, and each of the initially placed documents is initially placed on a display coordinate system A third step of calculating position coordinates and a distance vector from one document to the other document on the display coordinate system at the time of the current processing is calculated between each two documents of the arranged initial arranged documents. Then, for each of the initially arranged documents arranged, the inter-document force vector received from some other initially arranged document based on the difference between the distance vector length and the stable inter-document distance and the direction of the separation vector. A fifth step of calculating a total inter-document force vector by summing the inter-document force vectors received from each of the other initial arranged documents for each of the arranged initial arranged documents; For each of the initially arranged documents arranged, the sixth step of calculating the position coordinates at the next processing time according to the total inter-document force vector and the execution of the repetition process of the fourth to sixth steps A seventh step of determining convergence of the position coordinates of the initially arranged document and terminating the repetition process; and an eighth step of selecting a plurality of next arranged documents to be newly incorporated in the display coordinate system from the classification target document. A ninth step of calculating a position coordinate that is initially arranged in the display coordinate system for one of the next arranged documents and arranging the position coordinate; A tenth step of calculating the total inter-document force vector by summing up the inter-document force vectors received from the existing arranged document, and the next arranged document arranged in the ninth step. For the eleventh step of calculating the position coordinates at the next processing time according to the total inter-document force vector, and the ninth step during the repetition of the tenth and eleventh steps. A twelfth step of determining the convergence of the position coordinates of the next arranged document and setting the position at this time as a provisional decision position of the next arranged document arranged in the ninth step; After the step is executed for all the next arranged documents selected in the eighth step, the repeating process in the fourth to seventh steps is performed for all the arranged documents. A thirteenth step to be executed and a fourteenth step to execute the eighth to thirteenth steps for the remaining classification target document.
According to another aspect of the document classification method of the present invention, each time a new document to be classified is added, provisional determination of the position of the additional document is performed, and the plurality of documents selected in the eighth step are performed. Every time the next arranged document is arranged, the reconstruction of the positions of all the arranged documents is executed, so that it is possible to prevent the calculation amount from increasing.
In the document classification method of the present invention, each of the classification target documents has a document number for identifying the document itself, and the documents of the classification target documents that the two documents cite the degree of similarity between the two documents in the first step. It is preferable to calculate using a number. According to this document classification method, classification target documents described in different languages can be targeted. Further, in this document classification method, it is preferable to regard the own document as one of the classification target documents cited by the own document. As a result, the fact that one of the two documents that are the evaluation target of the similarity quotes the other can be reflected in the calculation of the similarity.

本発明の好適な形態は、特許請求の範囲中の独立項で特定される形態に、従属項中の構成要素のうち任意のもの（従属項中の構成要素のあらゆる組み合わせ）を付加した形態を含む。 The preferred form of the present invention is a form obtained by adding any of the constituent elements in the dependent claims (any combination of constituent elements in the dependent claims) to the form specified by the independent claims in the claims. Including.

複数の分類対象文書内における文書間の位置関係を詳細に知ることができる。 It is possible to know in detail the positional relationship between documents in a plurality of classification target documents.

以下、添付図面を参照して、本発明の好適な実施形態を詳細に説明する。図１は、本発明による文書分類装置の一実施形態を示すブロック図である。文書分類装置１は、複数の分類対象文書を、各分類対象文書の内容に応じて分類するものである。文書分類装置１は、データベース１０、安定文書間距離算出部２２、配置文書選択部２３、位置座標初期値設定部２４、文書間力ベクトル算出部２６、及び位置座標更新部２８を備えている。データベース１０は、分類対象文書ＤＢ１２、安定文書間距離ＤＢ１４、位置座標ＤＢ１６及び文書間力ベクトルＤＢ１８を有している。分類対象文書ＤＢ１２は、複数の分類対象文書を各文書を特定する文書コードに関連付けて格納している。分類対象文書は、分類対象文書ＤＢ１２に予め格納されているが、適宜の入力手段により必要に応じて入力することもできる。 Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. FIG. 1 is a block diagram showing an embodiment of a document classification apparatus according to the present invention. The document classification device 1 classifies a plurality of classification target documents according to the contents of each classification target document. The document classification device 1 includes a database 10, a stable inter-document distance calculation unit 22, an arrangement document selection unit 23, a position coordinate initial value setting unit 24, an inter-document force vector calculation unit 26, and a position coordinate update unit 28. The database 10 includes a classification target document DB 12, a stable inter-document distance DB 14, a position coordinate DB 16, and an inter-document force vector DB 18. The classification target document DB 12 stores a plurality of classification target documents in association with document codes that specify each document. The classification target document is stored in the classification target document DB 12 in advance, but can be input as necessary by an appropriate input unit.

安定文書間距離ＤＢ１４は、安定文書間距離算出部２２により算出される安定文書間距離を文書コードに関連付けて格納する。図２は、文書間距離ＤＢ１４のデータベースの一例を示す構成図である。この図に示すように、各２文書間の安定文書間距離が、それらの文書コード（Ｐ０００１，Ｐ０００２，・・・）に関連付けられて格納されている。例えば、この場合、文書（Ｐ０００１）と文書（Ｐ０００２）との間の安定文書間距離は、０．００５である。 The stable inter-document distance DB 14 stores the stable inter-document distance calculated by the stable inter-document distance calculation unit 22 in association with the document code. FIG. 2 is a configuration diagram illustrating an example of the database of the inter-document distance DB 14. As shown in this figure, the stable inter-document distance between two documents is stored in association with the document codes (P0001, P0002,...). For example, in this case, the stable inter-document distance between the document (P0001) and the document (P0002) is 0.005.

位置座標ＤＢ１６は、位置座標初期値設定部２４により設定される各文書の位置座標の初期値、及び位置座標更新部２８により更新された位置座標を文書コードに関連付けて格納する。図３は、位置座標ＤＢ１６のデータベースの一例を示す構成図である。この図に示すように、各文書の位置座標（Ｘ座標，Ｙ座標）が文書コードに関連付けられて格納されている。例えば、この場合、文書（Ｐ０００３）の位置座標は、（０．５１５５，０．３４１７）である。 The position coordinate DB 16 stores the initial value of the position coordinate of each document set by the position coordinate initial value setting unit 24 and the position coordinate updated by the position coordinate update unit 28 in association with the document code. FIG. 3 is a configuration diagram illustrating an example of the database of the position coordinate DB 16. As shown in this figure, the position coordinates (X coordinate, Y coordinate) of each document are stored in association with the document code. For example, in this case, the position coordinates of the document (P0003) are (0.5155, 0.3417).

文書間力ベクトルＤＢ１８は、文書間力ベクトル算出部２６により算出される総和文書間力ベクトルを文書コードに関連付けて格納する。図４は、総和文書間力ベクトルＤＢ１８のデータベースの一例を示す構成図である。この図に示すように、各文書に働く総和文書間力ベクトル（ＦＸ，ＦＹ）が文書コードに関連付けられて格納されている。例えば、この場合、文書（Ｐ０００２）の総和文書間力ベクトルは（０．００７，‐０．００３）である。 The inter-document force vector DB 18 stores the total inter-document force vector calculated by the inter-document force vector calculation unit 26 in association with the document code. FIG. 4 is a configuration diagram illustrating an example of the database of the total inter-document force vector DB 18. As shown in this figure, the total inter-document force vector (FX, FY) working on each document is stored in association with the document code. For example, in this case, the total inter-document force vector of the document (P0002) is (0.007, -0.003).

安定文書間距離算出部２２は、分類対象文書ＤＢ１２に格納されている複数の分類対象文書について、各２文書間の安定文書間距離を、両文書の類似する程度に応じて算出する。この安定文書間距離は、両文書の内容が類似する程度が高いほど小さく、類似する程度が低いほど大きくなる。 The stable inter-document distance calculation unit 22 calculates a stable inter-document distance between two documents for a plurality of classification target documents stored in the classification target document DB 12 according to the degree of similarity between the two documents. The stable inter-document distance decreases as the degree of similarity between both documents increases and decreases as the degree of similarity decreases.

位置座標初期値設定部２４は、２次元座標平面上における各文書の位置座標の初期値を設定する。位置座標初期値設定部２４における初期値の設定方法の一例を説明する。説明の便宜のため、分類対象文書数をＮ（Ｎは２以上の整数）とし、各文書をＴ_ｉ（ｉ＝１，２，・・・，Ｎ）と表すことにする。まず、文書Ｔ_ｉと文書Ｔ_ｊ（ｊ＝１，２，・・・，Ｎ、ｊ≠ｉ）との間の安定文書間距離Ｌ_０（ｉ，ｊ）をテーブルＬ_ａに読み込む。全ての（ｉ，ｊ）の組について安定文書間距離Ｌ_０（ｉ，ｊ）を読み込んだ後、Ｌ_０（ｉ，ｊ）の平均値Ｌ_ａｖｇを求める。そして、各文書Ｔ_ｉの位置座標（Ｘ_ｉ，Ｙ_ｉ）を下記式、
Ｘ_ｉ＝Ｌ_ａｖｇ×ｒｎｄ
Ｙ_ｉ＝Ｌ_ａｖｇ×ｒｎｄ
から求める。ここで、ｒｎｄは乱数を表している。これにより、各文書の位置座標の初期値が設定される。なお、安定文書間距離Ｌ_０（ｉ，ｊ）は平均値Ｌ_ａｖｇで除されることにより、正規化される。 The position coordinate initial value setting unit 24 sets the initial value of the position coordinate of each document on the two-dimensional coordinate plane. An example of an initial value setting method in the position coordinate initial value setting unit 24 will be described. For convenience of explanation, the number of documents to be classified is N (N is an integer of 2 or more), and each document is represented as T _i (i = 1, 2,..., N). First, the document _{T i} and the document _T j reads (j = 1,2, ···, N , j ≠ i) stable document distance _L 0 (i, j) between the in table _{L a.} After reading the stable inter-document distance L ₀ (i, j) for all (i, j) pairs, the average value L _avg of L ₀ (i, j) is obtained. Then, the position coordinates (X _i , Y _i ) of each document T _i are expressed by the following equation:
X _i = L _avg × rnd
Y _i = L _avg × rnd
Ask from. Here, rnd represents a random number. Thereby, the initial value of the position coordinate of each document is set. The stable inter-document distance L ₀ (i, j) is normalized by dividing by the average value L _avg .

文書間力ベクトル算出部２６は、各文書に働く総和文書間力ベクトルを算出する。総和文書間力ベクトルとは、各文書が他の文書から受ける文書間力のベクトル和である。また、文書間力とは、各２文書の位置座標から求められる座標平面上における距離が上記の文書間距離よりも大きい場合には両文書間に引力が働き、逆に座標平面上における距離が文書間距離よりも小さい場合には両文書間に斥力が働くと仮定して導入した概念である。これらの力の大きさは、座標平面上における距離と文書間距離との差の絶対値が増加するにつれて大きくなり、上記絶対値が減少するにつれて小さくなる。また、座標平面上における距離が安定文書間距離と一致する場合には、両文書間に働く文書間力は０である。 The inter-document force vector calculation unit 26 calculates a total inter-document force vector that works on each document. The total inter-document force vector is a vector sum of inter-document forces that each document receives from other documents. The inter-document force is an attractive force between the two documents when the distance on the coordinate plane obtained from the position coordinates of the two documents is larger than the above-mentioned inter-document distance, and conversely, the distance on the coordinate plane is This is a concept introduced on the assumption that a repulsive force acts between both documents when the distance between the documents is smaller. The magnitude of these forces increases as the absolute value of the difference between the distance on the coordinate plane and the inter-document distance increases, and decreases as the absolute value decreases. When the distance on the coordinate plane matches the stable inter-document distance, the inter-document force acting between the two documents is zero.

文書間力ベクトル算出部２６における文書間力ベクトルの算出方法の一例を説明する。まず、文書Ｔ_ｉと文書Ｔ_ｊの距離Ｌ（ｉ，ｊ）をその処理時点（現処理時点）（特に本実施形態では、位置座標の更新について「現処理時点」、「次回処理時点」というとき、「現処理時点」とは、移動処理（全部又は一部の配置文書の各々について、総和文書間力ベクトルを算出してこれに基づき位置座標を更新する処理を位置座標の収束が判断されるまで繰り返す処理）の繰返処理において、ある回が開始する時点を指し、「次回処理時点」とは、当該ある回の次の回が開始する時点を指すものとする。）における両者の位置座標に基づいて、下記式、
Ｌ（ｉ，ｊ）＝｛（Ｘ_ｉ−Ｙ_ｉ）^２＋（Ｘ_j−Ｙ_j）^２｝^０．５
から求める。なお、「その処理時点における両者の位置座標に基づいて」とあるのは、後述するように、各文書Ｔ_ｉの位置座標は必要に応じて更新されるため、常に同じ値をとるとは限らないからである。次に、文書Ｔ_ｉと文書Ｔ_ｊの文書間力ｆ（ｉ，ｊ）を下記式、
ｆ（ｉ，ｊ）＝（Ｌ_０（ｉ，ｊ）−Ｌ（ｉ，ｊ））／（Ｌ_０（ｉ，ｊ）＋ε１）^α
から求める。ここで、ε１は、Ｌ_０（ｉ，ｊ）が０のときに対応するための定数であり、例えば１×１０^−１２とされる。αは、安定文書間距離Ｌ_０（ｉ，ｊ）が小さくなるに連れて文書間力ｆ（ｉ，ｊ）が指数関数的に大きくように設定される。こうすることにより、文書間の類似度が高いときにより大きな文書間力が働くようになる。その結果、類似する文書の集団を形成するのが容易になると共に集団が配置される位置が人間の感覚に近いものになり、また分類対象文書数Nが多くなっても容易に収束させることができる。分類対象文書数が比較的少数である場合（Nが５０未満の場合）にはα＝０．８〜２．３の何れかの値に設定される。Nが１００を超える場合にはα＝１．８〜２．２の何れかの値に設定することにより容易に収束させることができる。特に、N＝１０１〜３０００の場合にはα＝２とするのが好適である。特に、分類対象文書を２次元空間にマッピングする場合、αが上記範囲より小さい場合は、移動処理の繰返処理の過程で一部文書の座標が収束せず発散するケースが多くなり、上記範囲より大きい場合は、個々の文書の内容を反映しない均一な文書の集団を形成しやすくなる。αは、例えば０．８〜２．３の何れかの値に設定され、好ましくは２である。次に、文書Ｔ_ｉが文書Ｔ_ｊから受ける文書間力のＸ成分ｆＸ（ｉ，ｊ）及びＹ成分ｆＹ（ｉ，ｊ）を下記式、
ｆＸ（ｉ，ｊ）＝ｆ（ｉ，ｊ）×（Ｘ_ｉ−Ｘ_ｊ）／（Ｌ（ｉ，ｊ）＋ε２）^β
ｆＹ（ｉ，ｊ）＝ｆ（ｉ，ｊ）×（Ｙ_ｉ−Ｙ_ｊ）／（Ｌ（ｉ，ｊ）＋ε２）^β
から求める。ここで、ε２は、Ｌ（ｉ，ｊ）が０のときに対応するための定数であり、例えば１×１０^−１２とされる。また、βは、例えば０．５に設定される。最後に、各文書Ｔ_ｉに働く文書間力の総和のＸ成分ＦＸ_ｉ及びＹ成分ＦＹ_ｉを下記式、
ＦＸ_ｉ＝Σ_ｊｆＸ（ｉ，ｊ）
ＦＹ_ｉ＝Σ_ｊｆＹ（ｉ，ｊ）
から求める。ここで、Σ_ｊは、全ての配置済み文書についての和をとることを意味する。このようにして算出されたＦＸ_ｉ及びＦＹ_ｉを成分とするベクトルが上述の総和文書間力ベクトルである。 An example of a method for calculating the inter-document force vector in the inter-document force vector calculation unit 26 will be described. First, the distance L (i, j) between the document T _i and the document T _j is referred to as a processing time point (current processing time point) (in particular, in the present embodiment, the update of position coordinates is referred to as “current processing time point” and “next processing time point”). The “current processing time point” refers to the movement process (the process of calculating the total inter-document force vector and updating the position coordinates based on the total or part of each of the arranged documents is determined as the convergence of the position coordinates. In the repetitive processing of (repeating until a certain time), the time point at which a certain time starts is referred to, and the “next processing time point” indicates the time point at which the next time after the certain time starts). Based on the coordinates,
L (i, j) = {(X _i −Y _i ) ² + (X _j −Y _j ) ² } ^0.5
Ask from. Incidentally, the term "based on the position coordinates of the two in the processing time," as will be described later, the position coordinates are updated as required in each document T _i, always take the same value Because there is no. Next, the inter-document force f (i, j) between the document T _i and the document T _j is expressed by the following equation:
f (i, j) = (L ₀ (i, j) −L (i, j)) / (L ₀ (i, j) + ε1) ^α
Ask from. Here, ε1 is a constant to cope with when L ₀ (i, j) is 0, and is set to 1 × 10 ⁻¹² , for example. α is set so that the inter-document force f (i, j) increases exponentially as the stable inter-document distance L ₀ (i, j) decreases. By doing so, a greater inter-document force works when the similarity between documents is high. As a result, it becomes easy to form a group of similar documents, the position where the group is arranged is close to the human sense, and it can be easily converged even if the number of classified documents N increases. it can. When the number of classification target documents is relatively small (when N is less than 50), α is set to any value between 0.8 and 2.3. When N exceeds 100, it can be easily converged by setting α to any value of 1.8 to 2.2. In particular, when N = 101 to 3000, α = 2 is preferable. In particular, when mapping a document to be classified into a two-dimensional space, if α is smaller than the above range, there are many cases where the coordinates of some documents do not converge and diverge in the course of repeated movement processing. If it is larger, it becomes easier to form a uniform document group that does not reflect the contents of individual documents. α is set to any value between 0.8 and 2.3, for example, and is preferably 2. Next, the X component fX (i, j) and the Y component fY (i, j) of the inter-document force that the document T _i receives from the document T _j are expressed by the following equations:
fX (i, j) = f (i, j) × (X i -X j) / (L (i, j) + ε2) β
fY (i, j) = f (i, j) × (Y i -Y j) / (L (i, j) + ε2) β
Ask from. Here, ε2 is a constant to cope with L (i, j) being 0, for example, 1 × 10 ⁻¹² . Β is set to 0.5, for example. Finally, the X component FX _i and the Y component FY _i of the sum of the inter-document forces acting on each document T _i are expressed by the following equations:
FX _i = Σ _j fX (i, j)
FY _i = Σ _j fY (i, j)
Ask from. Here, Σ _j means taking the sum of all placed documents. The vector having FX _i and FY _i calculated in this way as components is the above-described total inter-document force vector.

位置座標更新部２８は、文書間力ベクトル算出部２６により算出された総和文書間力ベクトルの絶対値が小さくなるように、各文書の位置座標を更新する。位置座標更新部２８における位置座標の更新方法の一例を説明する。すなわち、各文書Ｔ_ｉの位置座標（Ｘ_ｉ，Ｙ_ｉ）は、文書間力ベクトル算出部２６により算出された文書間力ベクトル（ＦＸ_ｉ，ＦＹ_ｉ）に基づいて、下記式、
Ｘ_ｉ’＝Ｘ_ｉ−ｋ×ＦＸ_ｉ
Ｙ_ｉ’＝Ｙ_ｉ−ｋ×ＦＹ_ｉ
により更新される。ここで、（Ｘ_ｉ’，Ｙ_ｉ’）は、更新後の位置座標を表す。また、ｋは移動係数であり、例えば１×１０^−２３以上１×１０^−２２以下の定数とされる。上記式は、各文書Ｔ_ｉを、文書間力ベクトルの向きに、そのベクトルの絶対値の大きさに比例した距離だけ移動させることを意味している。更新された位置座標は、位置座標ＤＢ１６に格納され、それまで格納されていた位置座標に対して上書きされる。本実施形態において位置座標更新部２８は、位置座標の更新と併せて、各文書Ｔ_ｉの移動距離の平均値ＭＬを下記式、
ＭＬ＝Σ｛（ｋ×ＦＸ_ｉ）^２＋（ｋ×ＦＹ_ｉ）^２｝^０．５
から求める。この平均値ＭＬは、後述する収束条件判定部３０による収束条件の判定の際に用いられる。 The position coordinate update unit 28 updates the position coordinates of each document so that the absolute value of the total inter-document force vector calculated by the inter-document force vector calculation unit 26 becomes small. An example of the position coordinate update method in the position coordinate update unit 28 will be described. That is, the position coordinates (X _i , Y _i ) of each document T _i are expressed by the following equation based on the inter-document force vector (FX _i , FY _i ) calculated by the inter-document force vector calculation unit 26:
X _i ′ = X _i −k × FX _i
Y _i ′ = Y _i −k × FY _i
Updated by Here, (X _i ′, Y _i ′) represents the updated position coordinates. Further, k is a movement coefficient, for example, a constant not less than 1 × 10 ^{−23 and not} more than 1 × 10 ⁻²² . The above formula, each document T _i, in the direction of the document between the force vector, which means that is moved by a distance proportional to the magnitude of the absolute value of the vector. The updated position coordinates are stored in the position coordinate DB 16 and are overwritten on the position coordinates stored so far. Position coordinate update section 28 in the present embodiment, in conjunction with the updated coordinates, an average value ML of the movement distance of each document T _i the following formula,
ML = Σ {(k × FX _i ) ² + (k × FY _i ) ² } ^0.5
Ask from. This average value ML is used when determining the convergence condition by the convergence condition determination unit 30 described later.

文書分類装置１は、収束条件判定部３０、表示部３２（出力手段）、及び入力部３４をさらに備えている。収束条件判定部３０は、位置座標更新部２８により位置座標が更新された後に、収束条件の判定を行う。例えば、上述の位置座標更新部２８において求められた平均値ＭＬが規定値以下になることを収束条件として設定することができる。この収束条件が満たされないときは、収束条件判定部３０は、文書間力ベクトル算出部２６に更新後の位置座標を用いて再度総和文書間力ベクトルを算出させるとともに、位置座標更新部２８にその総和文書間力ベクトルを用いて再度位置座標を更新させる。したがって、位置座標更新部２８による位置座標の更新は、上述の収束条件が満たされるまで実行される。 The document classification device 1 further includes a convergence condition determination unit 30, a display unit 32 (output unit), and an input unit 34. The convergence condition determination unit 30 determines the convergence condition after the position coordinates are updated by the position coordinate update unit 28. For example, it can be set as a convergence condition that the average value ML obtained by the position coordinate update unit 28 is equal to or less than a specified value. When the convergence condition is not satisfied, the convergence condition determination unit 30 causes the inter-document force vector calculation unit 26 to calculate the total inter-document force vector again using the updated position coordinates, and causes the position coordinate update unit 28 to The position coordinates are updated again using the total inter-document force vector. Therefore, the update of the position coordinates by the position coordinate update unit 28 is executed until the above convergence condition is satisfied.

表示部３２は、上述の収束条件が満たされ、位置座標更新部２８による位置座標の更新が終了した後、決定した位置座標に基づいて、各文書Ｔ_ｉ間の座標平面上における相対的な位置関係を可視化して表示する。表示部３２における表示方法の一例を説明する。図５は、表示部３２による結果表示画面の一例を示す図である。本例では、まず、表示エリア５０をｍ×ｎ個（ここではｍ＝ｎ＝４）のセルに区切る。また、後述する入力部３４により表示エリアを規定するＸ座標、Ｙ座標それぞれの最大値（Ｘ_ｍａｘ、Ｙ_ｍａｘ）及び最小値（Ｘ_ｍｉｎ、Ｙ_ｍｉｎ）を入力する。なお、これらの値を入力せずに、既に決定されている全文書の位置座標から、Ｘ座標及びＹ座標それぞれについて、最大のもの及び最小のものをデフォルト値として用いることもできる。次に、表示部３２は、入力されたこれらの値をから、各セルに相当する座標範囲を求める。そして、各セルに含まれる文書の数を、図５に示すように表示する。例えば、この場合、一番右上のセルに含まれる文書数は１である。さらに本例では、各セルに含まれる文書のイメージを作成するとともに、各セルにそのイメージをハイパーリンクさせる。図５に示すように、注目するセルにマウスポインタ５２を合わせると、そのセルに含まれる文書を、該当文書リストとして表示させることができる。ここでは、分類対象文書として公開特許公報等の特許文献を想定しており、該当文書リストには特許文献の種別と公報番号とを表示させている。また、これらの表示にはハイパーリンクが貼られているので、例えば「特開平８−○○○○○○号公報」と表示されている部分を画面上でクリックすれば、その公開特許公報のイメージにアクセスして、その内容を見ることができる。 Display unit 32, the convergence condition described above is satisfied, after the updating of the position coordinates by the position coordinate update section 28 has been completed, based on the determined position coordinates, relative position on the coordinate plane between the document T _i Visualize and display relationships. An example of a display method on the display unit 32 will be described. FIG. 5 is a diagram illustrating an example of a result display screen by the display unit 32. In this example, first, the display area 50 is divided into m × n cells (here, m = n = 4). Further, the maximum value (X _max , Y _max ) and the minimum value (X _min , Y _min ) of the X coordinate and Y coordinate that define the display area are input by the input unit 34 described later. In addition, without inputting these values, the maximum and minimum values of the X coordinate and the Y coordinate can be used as default values from the position coordinates of all documents that have already been determined. Next, the display part 32 calculates | requires the coordinate range corresponded to each cell from these input values. Then, the number of documents included in each cell is displayed as shown in FIG. For example, in this case, the number of documents included in the upper right cell is one. Further, in this example, an image of a document included in each cell is created, and the image is hyperlinked to each cell. As shown in FIG. 5, when the mouse pointer 52 is moved to a cell of interest, the documents included in the cell can be displayed as a corresponding document list. Here, a patent document such as a published patent publication is assumed as a classification target document, and the type of patent document and the publication number are displayed in the corresponding document list. In addition, since hyperlinks are pasted on these displays, for example, if a portion displayed as “JP-A-8-XXXXX” is clicked on the screen, the published patent gazette You can access the image and see its contents.

入力部３４は、表示部３２により表示される対象となる座標平面上における表示エリア等を入力するためのものであり、例えばキーボードやマウス等が用いられる。例えば、図５の例では、表示エリア５０を規定するＸ_ｍａｘ、Ｙ_ｍａｘ、Ｘ_ｍｉｎ、Ｙ_ｍｉｎの値を入力部３４から入力することができる。入力された情報は、表示部３２へと渡される。 The input unit 34 is for inputting a display area or the like on a coordinate plane to be displayed by the display unit 32, and for example, a keyboard or a mouse is used. For example, in the example of FIG. 5, values of X _max , Y _max , X _min , and Y _min that define the display area 50 can be input from the input unit 34. The input information is passed to the display unit 32.

次に、文書分類装置１の動作を説明し、併せて本発明による文書分類方法の一実施形態を説明する。図６は、初期処理及び二次元表示座標系において初期配置文書を配置・移動する処理を示すフローチャートである。先ず、安定文書間距離算出部２２が、分類対象文書ＤＢ１２に格納されている分類対象文書を読み込んで各２文書間の安定文書間距離を算出し、算出した安定文書間距離を安定文書間距離ＤＢ１４に格納させる（Ｓ６１）。続いて、安定文書間距離算出部２２が、文書間距離ＤＢ１４に格納されている安定文書間距離を読み込んで平均値を算出し（Ｓ６２）、各安定文書間距離をこの平均値で除することにより正規化して安定文書間距離ＤＢ１４のデータを更新する（Ｓ６３）。配置文書選択部２３が、最初に表示座標系に配置する分類対象文書である初期配置文書Ｔ_ｋをｉｎｔ√Ｎ（分類対象文書の総数Ｎの平方根の小数点以下を切り捨てた値）個選択する（Ｓ６４）。位置座標初期値設定部２４は、上記平均値を用いて各文書の位置座標の初期値を設定し、設定した位置座標の初期値を位置座標ＤＢ１６に格納させる（Ｓ６５）。そして、文書間力ベクトル算出部２６が、文書間距離ＤＢ１４に格納されている安定文書間距離及び位置座標ＤＢ１６に格納されている位置座標を読み込み、それらの値を用いて、各文書に働く総和文書間力ベクトルを算出し、算出した総和文書間力ベクトルを文書間力ベクトルＤＢ１８に格納する（Ｓ６６）。その後、位置座標更新部２８が、文書間力ベクトルＤＢ１８に格納されている総和文書間力ベクトルを読み込み、そのベクトルに基づいて各文書の位置座標を更新し、更新した位置座標を位置座標ＤＢ１６に格納させる（Ｓ６７）。位置座標が更新されると、収束条件判定部３０が収束条件の判定を行い、収束条件が満たされていない場合には上記ステップ（Ｓ６６〜Ｓ６７）を繰返し実行させる。収束条件が満たされている場合には、新たな分類対象文書を追加していく処理に移る。 Next, the operation of the document classification apparatus 1 will be described, and an embodiment of the document classification method according to the present invention will be described. FIG. 6 is a flowchart showing the initial process and the process of arranging and moving the initially arranged document in the two-dimensional display coordinate system. First, the stable inter-document distance calculation unit 22 reads the classification target document stored in the classification target document DB 12, calculates the stable inter-document distance between the two documents, and calculates the calculated stable inter-document distance as the stable inter-document distance. The data is stored in the DB 14 (S61). Subsequently, the stable inter-document distance calculation unit 22 reads the stable inter-document distance stored in the inter-document distance DB 14 to calculate an average value (S62), and divides each stable inter-document distance by this average value. Is normalized and the data in the stable inter-document distance DB 14 is updated (S63). The arrangement document selection unit 23 selects int√N (a value obtained by rounding down the decimal point of the square root of the total number N of the classification target documents) as the initial arrangement document T _k that is the classification target document to be first arranged in the display coordinate system ( S64). The position coordinate initial value setting unit 24 sets the initial value of the position coordinate of each document using the average value, and stores the initial value of the set position coordinate in the position coordinate DB 16 (S65). Then, the inter-document force vector calculation unit 26 reads the stable inter-document distance stored in the inter-document distance DB 14 and the position coordinates stored in the position coordinate DB 16, and uses these values to calculate the total sum applied to each document. The inter-document force vector is calculated, and the calculated total inter-document force vector is stored in the inter-document force vector DB 18 (S66). Thereafter, the position coordinate update unit 28 reads the total document inter-force vector stored in the inter-document force vector DB 18, updates the position coordinates of each document based on the vector, and stores the updated position coordinates in the position coordinate DB 16. Store (S67). When the position coordinates are updated, the convergence condition determination unit 30 determines the convergence condition. If the convergence condition is not satisfied, the above steps (S66 to S67) are repeatedly executed. If the convergence condition is satisfied, the process proceeds to a process of adding a new classification target document.

続いて、本実施形態の効果を説明する。文書分類装置１においては、座標平面上において複数の分類対象文書の位置座標を決定するに際し、各２文書間の座標平面上における距離が安定文書間距離算出部２２により算出された安定文書間距離よりも大きければそれらの差に比例した引力が文書間力として両文書間に働き、逆に座標平面上における距離が安定文書間距離よりも小さければそれらの差に比例した斥力が文書間力として両文書間に働くものと仮定したときに、各文書が他の文書から受ける文書間力のベクトル和が文書間力ベクトル算出部２６により算出される。ここで、２文書間の文書間力は、両文書の座標平面上における距離が安定文書間距離算出部２２により算出される文書間距離から離れるほど大きくなるものであるから、文書間力が極力小さくなるように各文書の位置座標を決定することが望ましい。そこで、各文書の位置座標は、文書間力ベクトルの絶対値が小さくなるように、位置座標更新部２８によって更新される。この位置座標の更新は、所定の収束条件が満たされるまで、１回若しくは複数回実行される。位置座標の更新が複数回実行される場合には、２回目以降の各回の更新前に、その時点における各文書の位置座標、すなわち前回の更新後の位置座標に基づいて文書間力ベクトル算出部２６による文書間力ベクトルの算出が実行される。これにより、全ての分類対象文書間で整合をとりつつ、各２文書間の座標平面上における距離を安定文書間距離算出部２２により算出される文書間距離に近づけることができる。したがって、本実施形態に係る文書分類装置１及び文書分類方法によれば、各２文書間で算出した文書間距離に基づいて各文書の位置座標を決定することができるので、複数の分類対象文書内における文書間の位置関係を詳細に知ることが可能となる。 Then, the effect of this embodiment is demonstrated. In the document classification device 1, when determining the position coordinates of a plurality of classification target documents on the coordinate plane, the distance between the two documents on the coordinate plane is calculated by the stable inter-document distance calculation unit 22. If the distance is larger, the attractive force proportional to the difference acts between the two documents as the inter-document force. Conversely, if the distance on the coordinate plane is smaller than the stable inter-document distance, the repulsive force proportional to the difference is the inter-document force. When it is assumed that the document works between both documents, the vector sum of inter-document forces received by each document from other documents is calculated by the inter-document force vector calculation unit 26. Here, the inter-document force between the two documents increases as the distance between the two documents on the coordinate plane increases away from the inter-document distance calculated by the stable inter-document distance calculation unit 22. It is desirable to determine the position coordinates of each document so as to be small. Therefore, the position coordinate of each document is updated by the position coordinate update unit 28 so that the absolute value of the inter-document force vector becomes smaller. The update of the position coordinates is executed once or a plurality of times until a predetermined convergence condition is satisfied. When the update of the position coordinates is executed a plurality of times, before each update after the second time, based on the position coordinates of each document at that time, that is, the position coordinates after the previous update, the inter-document force vector calculation unit The calculation of the inter-document force vector by 26 is executed. Thus, the distance between the two documents on the coordinate plane can be brought close to the inter-document distance calculated by the stable inter-document distance calculation unit 22 while matching all the classification target documents. Therefore, according to the document classification apparatus 1 and the document classification method according to the present embodiment, the position coordinates of each document can be determined based on the inter-document distance calculated between the two documents. It becomes possible to know in detail the positional relationship between documents in the document.

また、文書分類装置１は、表示部３２を備えている。これにより、ユーザは、表示部３２による表示を見ることにより、容易に文書間の相対的な位置関係を知ることができる。なお、文書分類装置１に表示部３２を設けない構成としてもよい。この場合、例えば、表示部３２の代わりに分類結果を出力する出力部を設け、その出力内容を外部のディスプレイ等により表示、或いは外部のプリンタにより印刷させることとしてもよい。 The document classification device 1 also includes a display unit 32. Thereby, the user can easily know the relative positional relationship between documents by viewing the display on the display unit 32. The document classification device 1 may be configured not to include the display unit 32. In this case, for example, an output unit that outputs the classification result may be provided instead of the display unit 32, and the output content may be displayed on an external display or printed by an external printer.

また、文書分類装置１は、表示エリア５０（図５参照）に表示される範囲を規定するＸ座標及びＹ座標それぞれの最大値及び最小値を入力することのできる入力部３４を備えている。これにより、ユーザは、座標平面上の所望の範囲を表示させ、その範囲における文書間の位置関係を詳細に知ることができる。 In addition, the document classification device 1 includes an input unit 34 that can input the maximum value and the minimum value of the X coordinate and the Y coordinate that define the range displayed in the display area 50 (see FIG. 5). As a result, the user can display a desired range on the coordinate plane and know in detail the positional relationship between documents in the range.

なお、位置座標更新部２８は、各文書に働く文書間力ベクトルの絶対値を全ての分類対象文書について和をとった値が極小となるまで、位置座標の更新を実行することが好適である。この場合、全ての分類対象文書間で特に高い整合性を保ちつつ、各文書の位置座標を決定することができる。 Note that the position coordinate update unit 28 preferably executes the update of the position coordinates until the absolute value of the inter-document force vector acting on each document becomes the minimum value obtained by summing all the documents to be classified. . In this case, the position coordinates of each document can be determined while maintaining particularly high consistency among all the classification target documents.

図７は、図１の安定文書間距離算出部２２の構成の一例を示すブロック図である。安定文書間距離算出部２２は、各種文書からワードを抽出するワード抽出部７０と、ワード抽出部７０によって抽出されたワードを格納する各種データベース８０とを備えている。 FIG. 7 is a block diagram showing an example of the configuration of the stable inter-document distance calculation unit 22 in FIG. The stable inter-document distance calculation unit 22 includes a word extraction unit 70 that extracts words from various documents, and various databases 80 that store the words extracted by the word extraction unit 70.

ワード抽出部７０は、キー文書からワードをキーワードとして抽出するキーワード抽出部７１と、参照文書からワードを参照ワードとして抽出する参照ワード抽出部７２と、検索文書からワードを検索ワードとして抽出する検索ワード抽出部７３とを有している。ここで、「キー文書」及び「検索文書」の区分は便宜的なものであり、安定文書間距離算出部２２においては、文書間距離を求めたい２文書のうちの一方がキー文書、他方が検索文書とされる。また、参照文書とは、キーワード評価値、すなわち各キーワードがキー文書に固有に含まれる程度を表す値を設定する際に参照される文書である。参照文書としては、例えば分類対象文書ＤＢ１２（図１参照）内の全文書、或いは予めランダムに抽出した分類対象文書ＤＢ１２内の一部の文書を用いることができる。参照文書は、適宜の入力手段により、必要に応じて安定文書間距離算出部２２に入力することができる。また、安定文書間距離算出部２２は、参照文書を格納する格納手段（図示せず）を備えている。 The word extraction unit 70 includes a keyword extraction unit 71 that extracts words from the key document as keywords, a reference word extraction unit 72 that extracts words from the reference document as reference words, and a search word that extracts words from the search document as search words. And an extraction unit 73. Here, the classification of “key document” and “search document” is convenient, and in the stable inter-document distance calculation unit 22, one of the two documents whose inter-document distance is desired to be obtained is the key document, and the other is Search document. The reference document is a document that is referred to when setting a keyword evaluation value, that is, a value representing the degree to which each keyword is uniquely included in the key document. As the reference document, for example, all the documents in the classification target document DB 12 (see FIG. 1) or a part of the documents in the classification target document DB 12 extracted in advance at random can be used. The reference document can be input to the stable inter-document distance calculation unit 22 as necessary by an appropriate input unit. Further, the stable inter-document distance calculation unit 22 includes storage means (not shown) for storing a reference document.

抽出部７１〜７３はいずれも、日本語にあっては、ひらがな、句読点、特殊記号及びスペースを区切記号として或いは形態素解析ツール等を利用して文書内のワードを抽出する機能を有する。また、抽出部７１〜７３は、いずれも一の文書から重複してワードを抽出しないように、文書から切り出されたワードは、同じ文書から既に切り出されたワードと照合され、一致しないワードのみを抽出する機能を有する。また、抽出部７１〜７３はいずれも、英語等のアルファベット表記がなされる言語にあっては、特殊記号及び／又はスペースを区切記号として或いは形態素解析ツール等を利用して文書内のワードを抽出する機能を有する。 Each of the extraction units 71 to 73 has a function of extracting words in a document using hiragana, punctuation marks, special symbols, and spaces as delimiters or using a morphological analysis tool or the like in Japanese. In addition, the extraction units 71 to 73 collate words cut out from the document with words already cut out from the same document so that words are not extracted redundantly from one document, and only the words that do not match are extracted. Has a function to extract. In addition, the extraction units 71 to 73 extract words in a document by using special symbols and / or spaces as delimiters or using a morphological analysis tool or the like in a language in which alphabets such as English are used. It has the function to do.

データベース（ＤＢ）８０は、キーワードＤＢ８１、全ワードＤＢ８２、評価値ＤＢ８３、検索ワードＤＢ８４、及び類似度ＤＢ８５を有している。キーワードＤＢ８１は、キー文書から抽出したキーワードを格納する。キーワードは、抽出元であるキー文書を特定するキー文書コードに関連付けて格納されている。全ワードＤＢ８２は、キー文書から抽出されたキーワードと参照文書から抽出された参照ワードとを格納する。キーワード及び参照ワードは、それぞれの抽出元であるキー文書を特定するキー文書コード及び参照文書を特定する参照文書コードに関連付けて格納されている。評価値ＤＢ８３は、後述するキーワード評価値計算部９１により算出される評価値を格納する。検索ワードＤＢ８４は、検索文書から抽出される検索ワードを格納する。検索ワードは、抽出元である検索文書を特定する検索文書コードに関連付けて格納されている。類似度ＤＢ８５は、後述する類似度計算部９２により算出される類似度を格納する。 The database (DB) 80 includes a keyword DB 81, an all word DB 82, an evaluation value DB 83, a search word DB 84, and a similarity DB 85. The keyword DB 81 stores keywords extracted from the key document. The keyword is stored in association with the key document code that identifies the key document that is the extraction source. The all word DB 82 stores a keyword extracted from the key document and a reference word extracted from the reference document. The keyword and the reference word are stored in association with the key document code that specifies the key document that is the extraction source and the reference document code that specifies the reference document. The evaluation value DB 83 stores an evaluation value calculated by a keyword evaluation value calculation unit 91 described later. The search word DB 84 stores search words extracted from the search document. The search word is stored in association with a search document code that specifies a search document as an extraction source. The similarity DB 85 stores the similarity calculated by the similarity calculation unit 92 described later.

なお、上記のキーワード、参照ワード、及び検索ワードは、それぞれ抽出対象となる文書の全体から抽出してもよいし、一部から抽出してもよい。例えば、抽出対象となる文書が特許文献であれば、書誌的事項、要約、請求項、又は実施例等に抽出範囲を限定してもよい。特に、データ量に制限がある場合には、抽出範囲を文書の一部に絞ることが有効となる。また、参照ワードは参照文書の一部から抽出し、キーワード及び検索ワードはそれぞれキー文書及び検索文書の全体から抽出するというように、各ワード毎に適宜抽出範囲を変えることより、いわゆるノイズと漏れの関係を調整することができる。 Note that the keyword, reference word, and search word may be extracted from the entire document to be extracted, or may be extracted from a part thereof. For example, if the document to be extracted is a patent document, the extraction range may be limited to bibliographic items, summaries, claims, or examples. In particular, when the data amount is limited, it is effective to narrow the extraction range to a part of the document. In addition, the reference word is extracted from a part of the reference document, the keyword and the search word are extracted from the entire key document and the search document, respectively. Can be adjusted.

また、安定文書間距離算出部２２は、キーワード評価値計算部９１、類似度計算部９２、及び文書間距離計算部９３を備えている。キーワード評価値計算部９１は、キー文書と参照文書とを合わせた全文書に共通のキーワードが出現する出現率を算出する機能を有する。参照文書がＮ個で、その内のＢ個に共通のキーワードが存在する場合には、全文書内キーワード出現率は、Ｂ／（１＋Ｎ）で算出される。キーワード評価値計算部９１は、全ワードＤＢ８２に格納されたキーワード及び参照ワードを検索して、同一のキーワード及びキーワードと同一の参照ワードが何個存在するか算出する。ここで、「参照ワード」とは参照文書から抽出したワードに便宜的に付与した名称であるので、「キーワードと同一の参照ワード」とは、すなわち参照文書に含まれるキーワードを意味する。算出されたキーワード数を全文書の数で除することによって、全文書内キーワード出現率を算出する。さらに、キーワード評価値計算部９１は、全文書内キーワード出現率の逆数をとって、キーワード評価値を算出する機能を有する。すなわち、キーワード評価値は、（１＋Ｎ）／Ｂで算出され、各キーワードがキー文書に固有に含まれる程度を示すものである。 The stable inter-document distance calculation unit 22 includes a keyword evaluation value calculation unit 91, a similarity calculation unit 92, and an inter-document distance calculation unit 93. The keyword evaluation value calculation unit 91 has a function of calculating an appearance rate at which a common keyword appears in all documents including the key document and the reference document. When there are N reference documents and a common keyword exists in B of them, the keyword appearance rate in all documents is calculated as B / (1 + N). The keyword evaluation value calculation unit 91 searches the keywords and reference words stored in the all-word DB 82, and calculates how many reference words are the same as the same keyword and keyword. Here, since the “reference word” is a name given to the word extracted from the reference document for convenience, the “reference word identical to the keyword” means a keyword included in the reference document. The keyword appearance rate in all documents is calculated by dividing the calculated number of keywords by the number of all documents. Further, the keyword evaluation value calculation unit 91 has a function of calculating a keyword evaluation value by taking the reciprocal of the keyword appearance rate in all documents. That is, the keyword evaluation value is calculated by (1 + N) / B and indicates the degree to which each keyword is uniquely included in the key document.

類似度計算部９２は、検索文書に含まれる全てのキーワードの評価値を加算し、加算した値を当該検索文書に含まれるキーワードの数で除することにより、キー文書と検索文書とが類似する程度を表す類似度を算出する機能を有する。また、類似度計算部９２は、算出した類似度を類似度ＤＢ８５に格納させる。 The similarity calculation unit 92 adds the evaluation values of all keywords included in the search document, and divides the added value by the number of keywords included in the search document, whereby the key document and the search document are similar. It has a function of calculating the degree of similarity representing the degree. Further, the similarity calculation unit 92 stores the calculated similarity in the similarity DB 85.

文書間距離計算部９３は、類似度ＤＢ８５に格納されている類似度を用いて文書Ｔ_ｉと文書Ｔ_ｊとの間の安定文書間距離Ｌ_０（ｉ，ｊ）を算出する機能を有する。ここで、安定文書間距離Ｌ_０（ｉ，ｊ）は、下記式、
Ｌ_０（ｉ，ｊ）＝２／（Ｓ_ｉｊ＋Ｓ_ｊｉ）
から求められる。ここで、Ｓ_ｉｊは、文書Ｔ_ｉをキー文書とし、文書Ｔ_ｊを検索文書としたときの類似度を表し、Ｓ_ｊｉは、文書Ｔ_ｊをキー文書とし、文書Ｔ_ｉを検索文書としたときの類似度を表す。つまり、上記式は、文書Ｔ_ｉと文書Ｔ_ｊとの間で、キー文書と検索文書の関係を入れ替えて算出された類似度の平均値をとり、さらにその平均値の逆数をとることを意味している。キー文書と検索文書の関係を入れ替えて算出された類似度の平均値を用いるのは、上記のＳ_ｉｊとＳ_ｊｉとは必ずしも一致しないからである。このようにして算出される安定文書間距離Ｌ_０（ｉ，ｊ）は、両文書間の類似度が高いほど小さくなり、類似度が低いほど大きくなる。 The inter-document distance calculation unit 93 has a function of calculating a stable inter-document distance L ₀ (i, j) between the document T _i and the document T _j using the similarity stored in the similarity DB 85. Here, the stable inter-document distance L ₀ (i, j) is expressed by the following equation:
L ₀ (i, j) = 2 / (S _ij + S _ji )
It is requested from. Here, S _ij represents the similarity when the document T _i is the key document and the document T _j is the search document. S _ji is the document T _j is the key document and the document T _i is the search document. Represents the degree of similarity. That is, the above expression means that the average value of the similarity calculated by switching the relationship between the key document and the search document is taken between the document T _i and the document T _j, and further, the inverse of the average value is taken. is doing. The _{reason why} the average value of similarity calculated by switching the relationship between the key document and the search document is used is that S _ij and S _ji do not necessarily match. The stable inter-document distance L ₀ (i, j) calculated in this way decreases as the similarity between the two documents increases, and increases as the similarity decreases.

図８は、安定文書間距離を算出する処理（図６の安定文書間距離算出ステップ（Ｓ６１）のサブルーチン）を示すフローチャートである。まず、キーワード抽出部７１がキー文書からキーワードを抽出し、抽出したキーワードをキーワードＤＢ８１に格納させる（Ｓ８１）。また、参照ワード抽出部７２が参照文書から参照ワードを抽出し、抽出した参照ワードを全ワードＤＢ８２に格納させる（Ｓ８２）。なお、全ワードＤＢ８２には、キーワード抽出部７１により抽出されたキーワードも格納される。次に、キーワード評価値計算部９１が、全ワードＤＢ８２に格納されているキーワード及び参照ワードを読み込み、各キーワードの評価値を計算し、その評価値を評価値ＤＢ８３に格納させる（Ｓ８３）。また、文書間距離計算部９３が検索文書から検索ワードを抽出し、抽出した検索ワードを検索ワードＤＢ８４に格納させる（Ｓ８４）。次に、類似度計算部９２が、評価値ＤＢ８３に格納されている評価値及び検索ワードＤＢ８４に格納されている検索ワードを読み込み、キー文書と検索文書との間の類似度を計算し、その類似度を類似度ＤＢ８５に格納させる（Ｓ８５）。最後に、文書間距離計算部９３が、類似度ＤＢ８５に格納されている類似度を読み込み、各文書間の安定文書間距離を計算する（Ｓ８６）。 FIG. 8 is a flowchart showing a process for calculating a stable inter-document distance (a subroutine of the stable inter-document distance calculating step (S61) in FIG. 6). First, the keyword extraction unit 71 extracts keywords from the key document, and stores the extracted keywords in the keyword DB 81 (S81). Further, the reference word extraction unit 72 extracts the reference word from the reference document, and stores the extracted reference word in the all word DB 82 (S82). Note that the all words DB 82 also stores the keywords extracted by the keyword extraction unit 71. Next, the keyword evaluation value calculation unit 91 reads the keywords and reference words stored in the all-word DB 82, calculates the evaluation value of each keyword, and stores the evaluation value in the evaluation value DB 83 (S83). Further, the inter-document distance calculation unit 93 extracts a search word from the search document, and stores the extracted search word in the search word DB 84 (S84). Next, the similarity calculation unit 92 reads the evaluation value stored in the evaluation value DB 83 and the search word stored in the search word DB 84, calculates the similarity between the key document and the search document, The similarity is stored in the similarity DB 85 (S85). Finally, the inter-document distance calculation unit 93 reads the similarity stored in the similarity DB 85 and calculates the stable inter-document distance between each document (S86).

本実施形態例の安定文書間距離算出部２２によれば、各キーワードがキー文書に固有に含まれる程度を示す評価値を用いて２文書間の安定文書間距離を算出するので、両文書が類似する程度を高精度に反映した安定文書間距離を求めることができる。なお、安定文書間距離は、本実施形態例に示す算出方法により算出されるものに限らず、例えば特許文献１，２に記載されているような、語句ベクトル等のベクトル間の距離として算出されるものであってもよい。ただし、２文書の類似する程度を精度良く反映させるためには、本実施形態例に示す算出方法を用いることが好ましい。
本発明が特許文献の分類表示に適用される場合における安定文書間距離を算出する処理の別の実施形態として、重複引用文献を利用する方法が考えられる。図１９は、重複引用文献を利用する実施形態の説明を補助するための図である。図２０は、この実施形態における処理を示すフローチャートである。まず、文書Ｔ_ｉと文書Ｔ_ｊとが重複して引用する文献を検出する（Ｓ２０１）。例えば、図１９に示されるように、ＵＳ６７１３５２０Ｂ２を文書Ｔ_ｉとし、ＵＳ６４３３０９０Ｂ１を文書Ｔ_ｊとする。文書Ｔ_ｉには、３つの文献が引用されている（ただし、自文書も引用文献として扱われている。）。文書Ｔ_ｊには、２３の文献が引用されている（ただし、自文書も引用文献として扱われている。）。これらのうちＵＳ６４３３０９０（文書Ｔ_ｊ）のみが重複引用文献として検出される。次に、重複引用文献ＵＳ６４３３０９０の評価値を算出する（Ｓ２０２）。具体的には、他の分類対象文書で当該文献を引用しているものの数の逆数を評価値とする。例えば、重複引用文献ＵＳ６４３３０９０を引用している分類対象文書が文書Ｔ_ｉに加えて一つだけ存在すると仮定すると、重複引用文献ＵＳ６４３３０９０の評価値は１／２＝０．５となる。次に、こうして算出された重複引用文献の評価値に基づいて文書Ｔ_ｉと文書Ｔ_ｊとの類似度を算出する（Ｓ２０３）。具体的には、重複引用文献の評価値を合算して得られる値を文書Ｔ_ｉの引用文献数と文書Ｔ_ｊの引用文献数との和で除した値を類似度とする。本例では、重複引用文献はＵＳ６４３３０９０のみであり、その評価値は０．５である。したがって、重複引用文献の評価値を合算して得られる値は０．５である。文書Ｔ_ｉの引用文献数と文書Ｔ_ｊの引用文献数との和は、３＋２３＝２６である。したがって、文書Ｔ_ｉと文書Ｔ_ｊとの類似度は０．５／２６≒０．０１９となる。最後に、類似度に基づいて文書Ｔ_ｉと文書Ｔ_ｊとの間の安定文書間距離を算出する（Ｓ２０４）。具体的には、類似度の逆数（２６／０．５＝５２）を安定文書間距離とする。なお、この実施形態では、重複引用文献を引用している他の分類対象文書の数の逆数を評価値としたが、この方法は分類対象文書の件数が１００件未満（好ましくは５０件未満）の場合に特に有用である。これに代えて、分類対象文書の件数が２０００件未満の場合には、重複引用文献を引用している他の分類対象文書の数の平方根の逆数を評価値とすることもできる。また、１００件を超える場合は評価値を１に固定するのが望ましい。Ｓ２０１において処理時間を短縮するために、予め分類対象文書ごとに自文書の文献番号とこれを引用する他の分類対象文書の文献番号とを示すテーブルを用意しておくことが考えられる。 According to the stable inter-document distance calculation unit 22 of the present exemplary embodiment, the stable inter-document distance between two documents is calculated using an evaluation value indicating the degree to which each keyword is inherently included in the key document. A stable inter-document distance that reflects the degree of similarity with high accuracy can be obtained. Note that the stable inter-document distance is not limited to the one calculated by the calculation method shown in this embodiment, and is calculated as a distance between vectors such as word vectors as described in Patent Documents 1 and 2, for example. It may be a thing. However, in order to accurately reflect the degree of similarity between the two documents, it is preferable to use the calculation method shown in this embodiment.
As another embodiment of the process for calculating the stable inter-document distance when the present invention is applied to the classification display of patent documents, a method using duplicate citations can be considered. FIG. 19 is a diagram for assisting the description of the embodiment using the duplicate citations. FIG. 20 is a flowchart showing the processing in this embodiment. First, a document cited by duplicating the document T _i and the document T _j is detected (S201). For example, as shown in FIG. 19, the US6713520B2 the document _{T i,} the document _{T j} of US6433090B1. The document T _i, 3 single documents are cited (but is also self-document treated as references.). 23 documents are cited in the document T _j (however, the document itself is also treated as a cited document). Of these, only US6433090 (document T _j ) is detected as a duplicate citation. Next, the evaluation value of the double cited document US6433090 is calculated (S202). More specifically, the reciprocal of the number of documents classified as other classification target documents is used as the evaluation value. For example, the classification target document cites duplicate references US6433090 is assuming there is only one addition to document _{T i,} the evaluation value of the duplicate references US6433090 becomes 1/2 = 0.5. Next, the similarity between the document T _i and the document T _j is calculated on the basis of the evaluation value of the duplicate cited document calculated in this way (S203). Specifically, the value obtained by adding the evaluation values of the overlapping cited documents is divided by the sum of the number of cited documents of the document T _{i and} the number of cited documents of the document T _j is set as the similarity. In this example, the only cited reference is US6433090, and its evaluation value is 0.5. Therefore, the value obtained by adding the evaluation values of the cited references is 0.5. The sum of the number of documents cited in the document T _{i and} the number of documents cited in the document T _j is 3 + 23 = 26. Therefore, the similarity between the document T _i and the document T _j is 0.5 / 26≈0.019. Finally, a stable inter-document distance between the document T _i and the document T _j is calculated based on the similarity (S204). Specifically, the reciprocal of the similarity (26 / 0.5 = 52) is set as the stable inter-document distance. In this embodiment, the reciprocal of the number of other classification target documents that cite the duplicate citation is used as the evaluation value. However, in this method, the number of classification target documents is less than 100 (preferably less than 50). It is particularly useful in the case of. Instead, when the number of classification target documents is less than 2,000, the reciprocal of the square root of the number of other classification target documents quoting duplicate citations can be used as the evaluation value. Moreover, when the number exceeds 100, it is desirable to fix the evaluation value to 1. In order to shorten the processing time in S201, it is conceivable to prepare a table indicating the document number of its own document and the document number of another classification target document that cites it in advance for each classification target document.

図９は、図６の総和文書間力ベクトル算出ステップ（Ｓ６６）のサブルーチンを示すフローチャートである。まず、文書間力ベクトル算出部２６が、位置座標ＤＢ１６に格納されている各文書の位置座標を読み込み、その位置座標から各２文書間の座標平面上における距離（離間ベクトルの長さ）を算出する（Ｓ９１）。また、文書間力ベクトル算出部２６は、文書間距離ＤＢ１４に格納されている安定文書間距離を読み込み、その安定文書間距離と前ステップＳ９１で計算した距離とを用いて、文書間力を算出する（Ｓ９２）。さらに、文書間力ベクトル算出部２６は、離間ベクトルに基づいて文書間力のＸ成分及びＹ成分を算出し（Ｓ９３）、ある文書に対して他の配置済み文書から働く文書間力の総和をベクトル和として求めることにより、総和文書間力ベクトルを算出する（Ｓ９４）。そして、全ての配置済み文書について総和文書間力ベクトルが算出された場合にはフローが終了し、総和文書間力ベクトルが算出されていない文書がある場合には、上記ステップ（Ｓ９１〜Ｓ９４）が繰り返される（Ｓ９５）。 FIG. 9 is a flowchart showing a subroutine of the total inter-document force vector calculation step (S66) of FIG. First, the inter-document force vector calculation unit 26 reads the position coordinates of each document stored in the position coordinate DB 16, and calculates the distance on the coordinate plane between the two documents (the length of the separation vector) from the position coordinates. (S91). The inter-document force vector calculation unit 26 reads the stable inter-document distance stored in the inter-document distance DB 14 and calculates the inter-document force using the stable inter-document distance and the distance calculated in the previous step S91. (S92). Further, the inter-document force vector calculation unit 26 calculates the X component and the Y component of the inter-document force based on the separation vector (S93), and calculates the sum of the inter-document forces working from other arranged documents for a certain document. By calculating as a vector sum, a total inter-document force vector is calculated (S94). Then, when the total document inter-force vector is calculated for all the arranged documents, the flow ends. When there is a document for which the total inter-document force vector is not calculated, the above steps (S91 to S94) are performed. Repeated (S95).

図１０は、図６の位置座標の更新ステップ（Ｓ６７）のサブルーチンを示すフローチャートである。まず、位置座標更新部２８が、文書間力ベクトルＤＢ１８に格納されている文書間力ベクトルを読み込み、そのベクトルに応じて各文書の移動、すなわち位置座標の変更を行う（Ｓ１０１）。その後、位置座標更新部２８は、収束条件の判定に用いられる、各文書の移動距離の平均値を算出する（Ｓ１０２）。Ｓ６８において、移動距離の平均値が閾値を下回ることを収束条件とすることができる。また、これに代えてｉｎｔ√Ｎ回位置座標の更新ステップを繰り返したことを収束条件とすることもできる。 FIG. 10 is a flowchart showing a subroutine of the position coordinate update step (S67) of FIG. First, the position coordinate update unit 28 reads an inter-document force vector stored in the inter-document force vector DB 18, and moves each document, that is, changes the position coordinate according to the vector (S101). Thereafter, the position coordinate update unit 28 calculates an average value of the moving distances of the respective documents used for determining the convergence condition (S102). In S <b> 68, the convergence condition can be that the average value of the movement distance is less than the threshold value. Alternatively, the convergence condition may be that the position coordinate update step is repeated int√N times.

図１１は、表示座標系に追加の配置文書を加えていく処理を示すフローチャートである。図１１を参照して、図６に示した初期配置文書の配置・移動処理が終了してから、順次追加の配置文書を加えていって、全ての分類対象文書の配置・移動を完了させる処理を説明する。 FIG. 11 is a flowchart showing processing for adding an additional arrangement document to the display coordinate system. Referring to FIG. 11, after the arrangement / movement process of the initial arrangement document shown in FIG. 6 is completed, additional arrangement documents are sequentially added to complete the arrangement / movement of all the classification target documents. Will be explained.

配置文書選択部２３が、次に表示座標系に加える時期配置文書を、ｉｎｔ（ｍｍ／１０）（ただし、ｍｍは既に表示座標系に配置済みの分類対象文書の数）個無作為に選択する（Ｓ１１１）。ただし、分類対象文書の残りの個数がｉｎｔ（ｍｍ／１０）に満たない場合には、残存している分類対象文書全てが時期配置文書になる。初期配置文書の配置・移動処理の直後に追加される時期配置文書の数は、ｉｎｔ｛（ｉｎｔ√Ｎ）／１０｝個となる。 The arrangement document selection unit 23 randomly selects the next arrangement document to be added to the display coordinate system int (mm / 10) (where mm is the number of classification target documents already arranged in the display coordinate system). (S111). However, when the remaining number of classification target documents is less than int (mm / 10), all the remaining classification target documents become time-arranged documents. The number of time-arranged documents added immediately after the arrangement / movement process of the initial arrangement document is int {(int√N) / 10}.

位置座標初期値設定部２４が、時期配置文書が最初に設置される表示座標系上の位置座標を算出する（Ｓ１１２）。本実施形態では、時期配置文書は、最も安定文書間距離が短い配置済み分類対象文書の近傍に位置するように初期設定される。具体的には、位置座標初期値設定部２４は、安定文書間距離ＤＢ１４を参照して、Ｌ_０（ｃ，ｍｍ＋ｋ）が最小値となるｃを求め（ただし、ｃ＝１〜ｍｍ）、時期配置文書Ｔ_ｋの位置座標の初期値を（Ｘ_ｋ，Ｙ_ｋ）＝（Ｘ_ｃ＋ε，Ｙ_ｃ＋ε）（ε：定数）とする。また、以上に代えて初期配置文書の場合と同様に時期配置文書の初期値を乱数により決定してもよい。 The position coordinate initial value setting unit 24 calculates the position coordinates on the display coordinate system where the time placement document is first installed (S112). In the present embodiment, the time-arranged document is initially set so as to be positioned in the vicinity of the arranged classification target document having the shortest stable inter-document distance. Specifically, the position coordinate initial value setting unit 24 refers to the stable inter-document distance DB 14 to obtain c where L ₀ (c, mm + k) is the minimum value (where c = 1 to mm). The initial value of the position coordinates of the arrangement document T _k is (X _k , Y _k ) = (X _c + ε, Y _c + ε) (ε: constant). Instead of the above, the initial value of the time-arranged document may be determined by a random number as in the case of the initial-arranged document.

ある時期配置文書の位置座標が初期値に設定された後、文書間力ベクトル算出部２６が、当該時期配置文書が他の配置済み分類対象文書から受ける現在の総和文書間力ベクトルを算出する（Ｓ１１３）。位置座標更新部２８が、この総和文書間力ベクトルに基づき当該時期配置文書の位置座標を更新する（Ｓ１１４）。収束条件判定部３０が、当該時期配置文書の今回の移動量が閾値以下であること又はＳ１１３〜Ｓ１１５の処理が所定の回数実行されたことをもって当該時期配置文書の位置座標の収束を判断する（Ｓ１１５）。収束が判断されなかった場合には、当該時期配置文書について再びＳ１１３〜Ｓ１１５の処理が繰り返される。 After the position coordinates of a certain time-arranged document are set to initial values, the inter-document force vector calculation unit 26 calculates the current total inter-document force vector received by the time-arranged document from other arranged classification target documents ( S113). The position coordinate update unit 28 updates the position coordinates of the time-arranged document based on the total inter-document force vector (S114). The convergence condition determination unit 30 determines the convergence of the position coordinates of the time-arranged document when the current movement amount of the time-arranged document is equal to or smaller than the threshold value or when the processes of S113 to S115 are executed a predetermined number of times ( S115). If convergence is not determined, the processing of S113 to S115 is repeated again for the time-arranged document.

時期配置文書のうちの他の文書についても、順次Ｓ１１２〜Ｓ１１５の処理が行われる。時期配置文書中の全ての文書について配置・移動処理が完了した場合には、今回表示座標系に加えられた時期配置文書を含む全ての配置済み文書について位置計算（Ｓ６６〜Ｓ６８の処理）が√Ｎ回行われる（Ｓ１１７）。このように、時期配置文書の全てを一度に配置して移動処理（全部又は一部の配置文書の各々について、総和文書間力ベクトルを算出してこれに基づき位置座標を更新する処理を位置座標の収束が判断されるまで繰り返す処理）を行う代わりに、他の配置済み文書の位置座標を固定しつつ一つずつ順次時期配置文書の配置・移動処理を行って全ての時期配置文書の位置座標を仮決めし、さらに今回の時期配置文書を含む全ての配置済み文書についての配置・移動処理を行うことにより、移動処理の繰返回数を減少させることができる。未配置の分類対象文書についてＳ１１１〜Ｓ１１７の処理が行われる。ただし、未配置の分類対象文書がなくなった場合には、この時点で配置・移動処理が終了する。 The processes of S112 to S115 are sequentially performed for other documents among the time-arranged documents. When the arrangement / movement processing is completed for all the documents in the time-arranged document, the position calculation (processing in S66 to S68) is performed for all the arranged documents including the time-arranged document added to the display coordinate system this time. This is performed N times (S117). In this way, all the time-arranged documents are arranged at a time and moved (the process of calculating the total inter-document force vector and updating the position coordinates based on the total or part of the arranged documents is performed using the position coordinates. Instead of repeating the process until the convergence is determined, the position coordinates of all the time-arranged documents are processed one by one by sequentially placing and moving the time-arranged documents one by one while fixing the position coordinates of the other arranged documents. , And the placement / movement process for all the arranged documents including the current time-arranged document is performed, whereby the number of repetitions of the movement process can be reduced. The processing of S111 to S117 is performed on the unallocated classification target document. However, when there is no unallocated classification target document, the arrangement / movement process ends at this point.

図１２は、図１１の結果表示ステップ（出力ステップ）のサブルーチンを示すフローチャートである。まず、表示部３２が、表示エリアをｍ×ｎ個のセルに区切る（Ｓ１２１）。ここで、入力部３４により表示エリアを規定するＸ座標及びＹ座標それぞれの最大値及び最小値を入力する（Ｓ１２２）。この入力は、ユーザが行うものである。次に、表示部３２は、入力部３４より入力された上記の値に基づいて、各セルに相当する座標範囲を算出する（Ｓ１２３）。そして、表示部３２は、各セルの座標範囲内に位置座標を有する文書数を表示エリアに表示する（Ｓ１２４）。また、表示部３２は、各セルに含まれる文書のイメージを作成するとともに（Ｓ１２５）、各セルに文書のイメージをハイパーリンクさせる（Ｓ１２６）。 FIG. 12 is a flowchart showing a subroutine of the result display step (output step) in FIG. First, the display unit 32 divides the display area into m × n cells (S121). Here, the maximum value and the minimum value of the X coordinate and the Y coordinate defining the display area are input by the input unit 34 (S122). This input is performed by the user. Next, the display unit 32 calculates a coordinate range corresponding to each cell based on the above-described value input from the input unit 34 (S123). Then, the display unit 32 displays the number of documents having position coordinates within the coordinate range of each cell in the display area (S124). The display unit 32 creates an image of the document included in each cell (S125) and hyperlinks the image of the document to each cell (S126).

図１３（ａ）及び図１３（ｂ）は、図１の表示部３２による結果表示の変形例を説明するための図である。図に示される表示エリア５０は、図５に対応するものである。本例では、表示エリア５０内の一部を新たな表示エリアとして指定することにより、その部分を表示エリア５０全体に再表示させることができる。例えば、図１３（ａ）において中央の４つのセル（外枠を太線で示している）を指定した場合、この指定した部分が、図１３（ｂ）に示すように、表示エリア５０全体に再表示される。このとき、表示エリア５０内のセル数は不変であるので、指定した部分はより細かいセルに分割されている。例えば、図１３（ａ）において文書数が「５」と表示されているセルは、図１３（ｂ）において右上の４つのセルに対応している。したがって、この４つのセルの文書数の和は５となっている。表示エリアの指定は、例えば図１の入力部３４に座標値を入力することにより、或いは画面上においてマウスで選択することにより行うことができる。 FIGS. 13A and 13B are diagrams for explaining a modification of the result display by the display unit 32 of FIG. The display area 50 shown in the figure corresponds to FIG. In this example, by designating a part of the display area 50 as a new display area, the part can be displayed again on the entire display area 50. For example, if four cells in the center (the outer frame is indicated by a thick line) are designated in FIG. 13A, the designated portion is re-displayed in the entire display area 50 as shown in FIG. 13B. Is displayed. At this time, since the number of cells in the display area 50 is unchanged, the designated portion is divided into finer cells. For example, the cell in which the number of documents is displayed as “5” in FIG. 13A corresponds to the upper right four cells in FIG. Therefore, the sum of the number of documents in the four cells is 5. The designation of the display area can be performed, for example, by inputting coordinate values to the input unit 34 in FIG. 1 or by selecting with a mouse on the screen.

図１４は、図１の表示部３２による結果表示の変形例を説明するための図である。本例において表示部３２は、表示エリア５４ｄ内のプロットエリア５４ｐに、各文書の表示座標系（二次元表示座標系）における位置座標をプロットして表示する。各プロット５６には、対応する文書を特定できるように、各文書のタイトル等がテキストボックス５３（「テキストボックス」とは、プロットされた文書の属性を表示するための一定の形状及び大きさの小エリアをいう。ただし、属性情報の表示量に応じて何段階かの異なる形状又は大きさのテキストボックスを設定することも考えられる。）内に表示される。テキストボックス５３が表示エリア５４ｄからはみ出すことがないように、プロットエリア５４ｐは表示エリア５４ａからテキストボックス５３のサイズ分（複数段階の形状又は大きさのテキストボックスが設定されている場合には、最も長い縦径又は横径の分）だけ内側に縮小されている。すなわち、表示エリア５４ａの各角に位置する点線と表示エリア５４ａの枠で囲われる領域はテキストボックス５３と同じ大きさ及び形状になっている。ここでは、分類対象文書として公開特許公報を想定しており、その文献番号として出願番号が表示されている。また、表示されているテキストボックス５３あるいは文献番号には、ハイパーリンクが貼られており、画面上で文献番号をクリックすることによりその文書のイメージにアクセスすることができる。また、テクストボックス５３には文献番号の他に発明の名称や出願人、要約から切り出したキーワードを表示させることができ、これを行えば分類内容の把握が一層容易となる。 FIG. 14 is a diagram for explaining a modified example of the result display by the display unit 32 of FIG. In this example, the display unit 32 plots and displays the position coordinates of each document in the display coordinate system (two-dimensional display coordinate system) in the plot area 54p in the display area 54d. Each plot 56 has a text box 53 (“text box” having a certain shape and size for displaying the attributes of the plotted document so that the corresponding document can be identified. This is a small area, but it is also possible to set text boxes of several different shapes or sizes depending on the display amount of attribute information. In order to prevent the text box 53 from protruding from the display area 54d, the plot area 54p corresponds to the size of the text box 53 from the display area 54a (when a text box having a plurality of shapes or sizes is set, It is reduced inward by the length of the long vertical or horizontal diameter. That is, the area surrounded by the dotted line located at each corner of the display area 54 a and the frame of the display area 54 a has the same size and shape as the text box 53. Here, an open patent publication is assumed as a document to be classified, and an application number is displayed as the document number. Further, a hyperlink is pasted on the displayed text box 53 or the document number, and the document image can be accessed by clicking the document number on the screen. In addition to the document number, the text box 53 can display the name of the invention, the applicant, and a keyword cut out from the summary, which makes it easier to grasp the classification contents.

図１５は、図１４の変形例に係るフローチャートを示している。まず、ユーザが表示エリア５４ａを規定するＸ座標及びＹ座標それぞれの最大値及び最小値を入力する（Ｓ１５１）。この入力は、図１の入力部３４より行うことができる。表示部３２はプロットエリア５４ｐを設定する（Ｓ１５２）。表示部３２は、プロットエリア５４ｐ内にある各文書の位置座標をプロットし、プロット５６と関連付けてテキストボックス５３を表示する（Ｓ１５３）。さらに、表示部３２は、各プロットに文書イメージをハイパーリンクさせる（Ｓ１５４）。なお、図１４の表示例では、表示の一部領域を指定し、これを新たな表示エリアとして拡大表示する、或いは、表示エリアの一点、例えばエリア中心部のテキストボックスをマウスポインタで指定し、これを中心に拡大／縮小表示することができる。また、指定した１又は２以上のテキストボックスの内容を表計算ソフト等のワークシート上にコピーすることで、分類に続く作業をより一層容易にすることができる。 FIG. 15 shows a flowchart according to a modification of FIG. First, the user inputs the maximum value and the minimum value of the X coordinate and the Y coordinate that define the display area 54a (S151). This input can be performed from the input unit 34 of FIG. The display unit 32 sets the plot area 54p (S152). The display unit 32 plots the position coordinates of each document in the plot area 54p, and displays the text box 53 in association with the plot 56 (S153). Further, the display unit 32 hyperlinks the document image to each plot (S154). In the display example of FIG. 14, a partial area of the display is specified, and this is enlarged and displayed as a new display area, or one point of the display area, for example, a text box at the center of the area is specified with the mouse pointer. It is possible to display an enlarged / reduced display around this. Also, by copying the contents of one or more designated text boxes onto a worksheet such as spreadsheet software, the work following the classification can be made easier.

図１６は、図１の表示部３２による結果表示の変形例を説明するための図である。本例において、表示部３２は、ユーザが指定した基準文書の位置座標を基点として、座標平面上において表示半径内に位置座標をもつ文書を該当文書として表示する。基準文書及び表示半径の指定は、図１の入力部３４より行うことができる。また、基準文書は、分類対象文書ＤＢ１２に格納されている分類対象文書の中から選ばれる。図１６において表示画面５７内の右側に、該当文書リストが表示されている。これらの表示には、ハイパーリンクが貼られている。ここでは、基準文書から表示半径内に４つの文書が存在する。また、このリストは、基準文書からの距離が近い順にソートされて表示されている。さらに、本例では、表示画面５７内の左側に、表示エリア５８が設けられている。この表示エリア５８には、基準文書を中心として各文書の位置座標がプロットされ、併せて基準文書を中心として表示半径を半径とする円５９が表示される。この円５９は、表示半径を再指定する際の目安とすることができる。各プロットに付されている数字は、該当文書リストにおける番号に対応している。本例によれば、基準文書に類似する文書を検索することができる。また、この場合、基準文書を色又は字体等を変える所謂ハイライト表示で表示することにより、目標とする基準文書とそれに類似する文書の位置関係の把握が容易になる。 FIG. 16 is a diagram for explaining a modification of the result display by the display unit 32 of FIG. In this example, the display unit 32 displays a document having a position coordinate within a display radius on the coordinate plane as a corresponding document with the position coordinate of the reference document designated by the user as a base point. Designation of the reference document and the display radius can be performed from the input unit 34 in FIG. The reference document is selected from the classification target documents stored in the classification target document DB 12. In FIG. 16, the corresponding document list is displayed on the right side in the display screen 57. Hyperlinks are pasted on these displays. Here, there are four documents within the display radius from the reference document. In addition, this list is sorted and displayed in order from the shortest distance from the reference document. Further, in this example, a display area 58 is provided on the left side in the display screen 57. In the display area 58, the position coordinates of each document are plotted with the reference document as the center, and a circle 59 having the display radius as the center with the reference document as the center is displayed. This circle 59 can be used as a guide when redesignating the display radius. The numbers attached to each plot correspond to the numbers in the corresponding document list. According to this example, a document similar to the reference document can be searched. Further, in this case, by displaying the reference document in a so-called highlight display in which the color or font is changed, it becomes easy to grasp the positional relationship between the target reference document and a similar document.

図１７は、図１６の変形例に係るフローチャートを示している。まず、ユーザが図１の入力部３４より基準文書、及び表示変形を入力する（Ｓ１７１，Ｓ１７２）。すると、表示部３２は、位置座標ＤＢ１６に格納されている各文書の位置座標を読み込み、基準文書から表示半径内の距離にある文書を該当文書リストとして表示する（Ｓ１７３）。さらに、表示部３２は、該当文書リストに表示される文書のイメージをハイパーリンクさせる（Ｓ１７４）。ここで、ユーザは、必要に応じて、ハイパーリンクを辿ることにより表示された文書のイメージにアクセスし、その内容を確認する（Ｓ１７５）。そして、表示半径を再指定して検索し直すときは、上記ステップ（Ｓ１７２〜Ｓ１７５）を繰り返し実行し、検索し直さないときはフローを終了する（Ｓ１７６）。 FIG. 17 shows a flowchart according to a modification of FIG. First, the user inputs a reference document and display deformation from the input unit 34 in FIG. 1 (S171, S172). Then, the display unit 32 reads the position coordinates of each document stored in the position coordinate DB 16, and displays the documents located within the display radius from the reference document as a corresponding document list (S173). Further, the display unit 32 hyperlinks the image of the document displayed in the corresponding document list (S174). Here, the user accesses the image of the document displayed by following the hyperlink as necessary, and confirms the content (S175). When the search radius is specified again and the search is performed again, the above steps (S172 to S175) are repeated, and when the search is not performed again, the flow is terminated (S176).

最後に、図１８を参照して、文書分類装置１のハードウェア構成について説明する。図１８は、図１の文書分類装置１のハードウェア構成を示すブロック図である。図１８に示すように、文書分類装置１は、物理的には、制御装置１ａ、メモリ１ｂ、格納装置１ｃ、入力装置１ｄ、及び表示装置１ｅを備えて構成される。これら各装置は、バス１ｆを介して相互に各種信号の入出力が可能な様に電気的に接続されている。 Finally, the hardware configuration of the document classification device 1 will be described with reference to FIG. FIG. 18 is a block diagram showing a hardware configuration of the document classification device 1 of FIG. As shown in FIG. 18, the document classification device 1 physically includes a control device 1a, a memory 1b, a storage device 1c, an input device 1d, and a display device 1e. These devices are electrically connected to each other via a bus 1f so that various signals can be input and output.

具体的には、制御装置１ａは例えばＣＰＵ（CentralProcessing Unit）であり、メモリ１ｂはＲＡＭ（RandomAccess Memory）といった揮発性の半導体メモリである。格納装置１ｃはＨＤＤ（Hard Disc Drive）を始めとする不揮発性の磁気ディスクである。入力装置１ｄは例えばキーボードやマウスであり、表示装置１ｅはＬＣＤ（Liquid Crystal Display）やＣＲＴ（Cathode Ray Tube）ディスプレイである。 Specifically, the control device 1a is, for example, a CPU (Central Processing Unit), and the memory 1b is a volatile semiconductor memory such as a RAM (Random Access Memory). The storage device 1c is a non-volatile magnetic disk such as an HDD (Hard Disc Drive). The input device 1d is, for example, a keyboard or a mouse, and the display device 1e is an LCD (Liquid Crystal Display) or a CRT (Cathode Ray Tube) display.

上記ハードウェア構成と機能的構成との対応関係を以下に示す。文書分類装置１に関して、データベース１０の有する機能は、物理的な構成要素としての格納装置１ｃにより実現される。安定文書間距離算出部２２、位置座標初期値設定部２４、文書間力ベクトル算出部２６、位置座標更新部２８、収束条件判定部３０の有する各機能は、制御装置１ａが所定のプログラムを実行することにより実現される。入力部３４の有する各機能は入力装置１ｄにより実現される。なお、表示部３２の有する各機能は、制御装置１ａ及び表示装置１ｅにより実現される。すなわち、制御装置１ａが所定の演算を施すことにより分類結果の表示内容を確定し、表示装置１ｅがその内容に従って分類結果を表示する。 The correspondence between the hardware configuration and the functional configuration is shown below. Regarding the document classification device 1, the functions of the database 10 are realized by the storage device 1c as a physical component. Each function of the stable inter-document distance calculation unit 22, the position coordinate initial value setting unit 24, the inter-document force vector calculation unit 26, the position coordinate update unit 28, and the convergence condition determination unit 30 is executed by the control device 1a. It is realized by doing. Each function of the input unit 34 is realized by the input device 1d. In addition, each function which the display part 32 has is implement | achieved by the control apparatus 1a and the display apparatus 1e. That is, the control device 1a performs a predetermined calculation to determine the display content of the classification result, and the display device 1e displays the classification result according to the content.

本発明による文書分類装置及び文書分類方法は、上記実施形態に限定されるものではなく、様々な変形が可能である。例えば、２次元の座標平面上において各文書の位置座標を決定する構成を示したが、その座標平面は１次元であってもよい。このとき、各文書は１本の直線上に位置座標を有することになるが、この場合も便宜的に１次元の「座標平面」と呼ぶことにする。また、３次元以上に拡張して、各文書の位置座標を決定する構成としてもよい。 The document classification apparatus and the document classification method according to the present invention are not limited to the above-described embodiments, and various modifications can be made. For example, although the configuration in which the position coordinates of each document are determined on a two-dimensional coordinate plane is shown, the coordinate plane may be one-dimensional. At this time, each document has position coordinates on one straight line. In this case, too, it is referred to as a one-dimensional “coordinate plane” for convenience. Alternatively, the position coordinates of each document may be determined by extending to three or more dimensions.

また、各文書の移動距離の平均値が規定値以下となることを収束条件としたが、収束条件はこれに限られない。例えば、各文書の移動距離の最大値が規定値以下となることを収束条件としてもよい。 Further, although the convergence condition is that the average value of the moving distance of each document is equal to or less than a specified value, the convergence condition is not limited to this. For example, the convergence condition may be that the maximum value of the moving distance of each document is not more than a specified value.

また、位置座標の更新の際に用いられる移動係数ｋは、常に一定の値である必要はない。ある程度収束が進んだ後、収束速度を上げるために、各文書の移動距離の平均値の増減如何によって移動係数ｋを加減する構成としてもよい。例えば、移動距離の平均値が前回の更新後よりも大きければｋ’＝ｋ×０．０１（ｋ’：加減後の移動係数）とし、小さければｋ’＝ｋ×１．０３とする。 Further, the movement coefficient k used for updating the position coordinates does not always need to be a constant value. After the convergence has progressed to some extent, the movement coefficient k may be adjusted depending on whether the average value of the movement distance of each document is increased or decreased in order to increase the convergence speed. For example, if the average value of the movement distance is larger than that after the previous update, k ′ = k × 0.01 (k ′: movement coefficient after addition / subtraction), and if it is smaller, k ′ = k × 1.03.

複数の特許文献の間の類似度の関係を視覚で認識できるように示すことができる。 It can show so that the relationship of the similarity between several patent documents can be recognized visually.

本発明による文書分類装置の一実施形態を示すブロック図である。It is a block diagram which shows one Embodiment of the document classification device by this invention. 文書間距離ＤＢ１４のデータベースの一例を示す構成図である。It is a block diagram which shows an example of the database of inter-document distance DB14. 位置座標ＤＢ１６のデータベースの一例を示す構成図である。It is a block diagram which shows an example of the database of position coordinate DB16. 文書間力ベクトルＤＢ１８のデータベースの一例を示す構成図である。It is a block diagram which shows an example of the database of document force vector DB18. 表示部３２による結果表示画面の一例を示す図である。It is a figure which shows an example of the result display screen by the display part. 初期処理及び二次元表示座標系において初期配置文書を配置・移動する処理を示すフローチャートである。It is a flowchart which shows the process which arrange | positions and moves an initial arrangement document in an initial process and a two-dimensional display coordinate system. 図１の安定文書間距離算出部２２の構成の一例を示すブロック図である。FIG. 2 is a block diagram illustrating an example of a configuration of a stable inter-document distance calculation unit 22 in FIG. 1. 図７の安定文書間距離算出部２２の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the stable document distance calculation part 22 of FIG. 図６の総和文書間力ベクトル算出ステップ（Ｓ６６）のサブルーチンを示すフローチャートである。It is a flowchart which shows the subroutine of the total document inter-force vector calculation step (S66) of FIG. 図６の位置座標の更新ステップ（Ｓ６７）のサブルーチンを示すフローチャートである。It is a flowchart which shows the subroutine of the update step (S67) of the position coordinate of FIG. 表示座標系に追加の配置文書を加えていく処理を示すフローチャートである。It is a flowchart which shows the process which adds an additional arrangement | positioning document to a display coordinate system. 図６の結果表示ステップのサブルーチンを示すフローチャートである。It is a flowchart which shows the subroutine of the result display step of FIG. （ａ）及び（ｂ）は、図１の表示部３２による結果表示の変形例を説明するための図である。(A) And (b) is a figure for demonstrating the modification of the result display by the display part 32 of FIG. 図１の表示部３２による結果表示の変形例を説明するための図である。It is a figure for demonstrating the modification of the result display by the display part of FIG. 図１４の変形例に係るフローチャートを示している。15 shows a flowchart according to a modification of FIG. 図１の表示部３２による結果表示の変形例を説明するための図である。It is a figure for demonstrating the modification of the result display by the display part of FIG. 図１６の変形例に係るフローチャートを示している。FIG. 17 shows a flowchart according to a modification of FIG. 図１の文書分類装置１のハードウェア構成を示すブロック図である。It is a block diagram which shows the hardware constitutions of the document classification device 1 of FIG. 重複引用文献を利用して安定文書間距離を算出する実施形態の説明を補助するための図である。It is a figure for assisting description of embodiment which calculates the distance between stable documents using a duplication reference document. 重複引用文献を利用して安定文書間距離を算出する実施形態における処理を示すフローチャートである。It is a flowchart which shows the process in embodiment which calculates the distance between stable documents using a duplication reference document.

Explanation of symbols

１…文書分類装置、１０…データベース、１２…分類対象文書ＤＢ、１４…文書間距離ＤＢ、１６…位置座標ＤＢ、１８…文書間力ベクトルＤＢ、２２…安定文書間距離算出部、２３…配置文書選択部、２４…位置座標初期値設定部、２６…文書間力ベクトル算出部、２８…位置座標更新部、３０…収束条件判定部、３２…表示部、３４…入力部、５３…テキストボックス、５４ａ…表示エリア、５４ｐ…プロットエリア、５６…プロット。

DESCRIPTION OF SYMBOLS 1 ... Document classification apparatus, 10 ... Database, 12 ... Classification object DB, 14 ... Inter-document distance DB, 16 ... Position coordinate DB, 18 ... Inter-document force vector DB, 22 ... Stable inter-document distance calculation part, 23 ... Arrangement Document selection unit, 24 ... Position coordinate initial value setting unit, 26 ... Inter-document force vector calculation unit, 28 ... Position coordinate update unit, 30 ... Convergence condition determination unit, 32 ... Display unit, 34 ... Input unit, 53 ... Text box 54a ... display area, 54p ... plot area, 56 ... plot.

Claims

A document classification method for classifying a plurality of documents to be classified according to their contents,
A first step of calculating a stable inter-document distance between each two documents of the classification target document according to the degree of similarity between both documents;
A second step of selecting an initially arranged document from the classification target documents;
A third step of calculating position coordinates where each of the initial placement documents is initially placed on a display coordinate system;
A distance vector from one document on the display coordinate system to the other document at the time of the current processing is calculated between each two documents of the arranged initial arranged documents, and the arranged initial arranged documents are calculated. A fourth step of calculating an inter-document force vector received from a certain other initial arrangement document based on a difference between a length of the separation vector and the stable inter-document distance and a direction of the separation vector;
A fifth step of calculating a total inter-document force vector by summing up inter-document force vectors received from each other initial-arranged document for each of the initially arranged documents;
A sixth step of calculating the position coordinates at the next processing time according to the total inter-document force vector for each of the arranged initial arranged documents;
A seventh step of determining convergence of the position coordinates of the initially arranged document during execution of the repetition process of the fourth to sixth steps and terminating the repetition process;
An eighth step of selecting a next layout document to be newly incorporated into the display coordinate system from the classification target document;
A ninth step of calculating a position coordinate at which each next placement document is initially placed in the display coordinate system;
A tenth step of calculating a total document inter-force vector by summing up inter-document force vectors received from existing arranged documents for the newly arranged next-arranged document;
An eleventh step of calculating the position coordinates at the next processing time in accordance with the total inter-document force vector for the newly placed next-placed document;
A document classification method comprising: a twelfth step of determining convergence of position coordinates of the next arranged document during execution of the repetition processing of the tenth and eleventh steps and ending the repetition processing. .

In the fourth step,
The length of the inter-document force vector depends on the absolute value of the difference between the distance vector length and the stable inter-document distance.
The direction of the inter-document force vector is the same direction as the separation vector or the opposite direction, and the direction of the separation vector is drawn toward the other initial arrangement document when the length of the separation vector is larger than the stable inter-document distance. 2. The document classification method according to claim 1, wherein when the length of the separation vector is smaller than the stable inter-document distance, the document classification method is repelled by the other initial arranged document.

In the seventh step,
When the number of classification target documents is N,
2. The document classification method according to claim 1, wherein convergence is determined when the fourth to sixth steps are repeated N times.

The document classification method according to claim 1, wherein the eighth to twelfth steps are repeated for the next arranged document.

When the number of classification target documents is N, the number of the initially arranged documents is not less than √N and not more than (√N + 100).
When the number of the classifying target document disposed in the display coordinate system previously with N _k, claims, characterized in that the number of next placement document than 1 times 0.01 times the N _k 4. The document classification method according to 4.

Steps 10 to 12 are sequentially performed on the next arranged document one by one.
In the twelfth step,
2. The document classification method according to claim 1, wherein convergence is determined when the total inter-document force vector calculated in the eleventh step is less than or equal to a threshold value for the time-arranged document being processed.

In the ninth step,
The document classification method according to claim 6, wherein the position coordinates at which each of the next arranged documents is initially arranged in the display coordinate system are determined by random numbers.

In the ninth step,
7. The document classification method according to claim 6, wherein a position coordinate at which each next arranged document is initially arranged in the display coordinate system is set to a neighborhood of a previously arranged document having the smallest stable inter-document distance. .

The tenth to twelfth steps are sequentially performed on the next arranged document one by one, and then the repetition process in the fourth to seventh steps is executed a predetermined number of times for all the arranged documents including the next arranged document. The document classification method according to claim 1.

The document classification method according to claim 1, wherein the initial arrangement document is randomly selected from the classification target documents.

A mark of the converged position coordinates of the classification target document on the display coordinate system is plotted in a plot area in the display area of the display means, and a text box including a label of each of the plotted classification target documents is associated with the mark Step to display in the display area,
The document classification method according to claim 1, wherein the plot area is reduced inward by a size of the text box from a frame of the display area.

A document classification method for classifying a plurality of documents to be classified according to their contents,
A first step of calculating a stable inter-document distance between each two documents of the classification target document according to the degree of similarity between both documents;
A second step of selecting an initially arranged document from the classification target documents;
A third step of calculating position coordinates where each of the initial placement documents is initially placed on a display coordinate system;
A distance vector from one document on the display coordinate system to the other document at the time of the current processing is calculated between each two documents of the arranged initial arranged documents, and the arranged initial arranged documents are calculated. A fourth step of calculating an inter-document force vector received from a certain other initial arrangement document based on a difference between a length of the separation vector and the stable inter-document distance and a direction of the separation vector;
A fifth step of calculating a total inter-document force vector by summing up inter-document force vectors received from each other initial-arranged document for each of the initially arranged documents;
A sixth step of calculating the position coordinates at the next processing time according to the total inter-document force vector for each of the arranged initial arranged documents;
A seventh step of determining convergence of the position coordinates of the initially arranged document during execution of the repetition process of the fourth to sixth steps and terminating the repetition process;
An eighth step of selecting a plurality of next arranged documents to be newly incorporated into the display coordinate system from the classification target documents;
A ninth step of calculating and arranging position coordinates initially arranged in the display coordinate system for one of the next arranged documents;
A tenth step of calculating a total inter-document force vector by summing up inter-document force vectors received from existing arranged documents for the next-arranged document newly arranged in the ninth step;
An eleventh step of calculating position coordinates at the next processing time in accordance with the total inter-document force vector for the next arranged document newly arranged in the ninth step;
During the execution of the repeating process of the tenth and eleventh steps, the convergence of the position coordinates of the next arranged document arranged in the ninth step is determined, and the repeating process is terminated at this time. A twelfth step in which the position at is a provisionally determined position of the next arranged document arranged in the ninth step;
After executing the ninth to twelfth steps for all the next arranged documents selected in the eighth step, the thirteenth step for executing the repeating process in the fourth to seventh steps for all the arranged documents. When,
A document classification method comprising: a fourteenth step of executing the eighth to thirteenth steps for the remaining classification target document.

Each classification target document has a document number for identifying its own document,
The document according to any one of claims 1 to 12, wherein a degree of similarity between both documents in the first step is calculated using a document number of the classification target document cited by both documents. Classification method.

A value obtained by dividing the number of documents to be classified cited in both documents in the first step by the sum of the number of citations in each of the two documents is set as a value representing the degree of similarity between the two documents. The document classification method according to claim 13, wherein:

The reciprocal of the number of classification target documents that cite a certain classification target document is used as the evaluation value of the classification target document,
A value representing a degree of similarity between the two documents obtained by dividing the sum of the evaluation values of the classification target documents cited by both documents in the first step by the sum of the number of citations in each of the documents; The document classification method according to claim 13, wherein:

16. The document classification method according to claim 13, wherein the document is regarded as one of the classification target documents cited by the document.

The classification target document is a patent document,
The classification object document cited in a foreign patent document related to a foreign patent application for which priority is claimed based on the own document is regarded as one of the classification object documents cited by the own document. The document classification method according to any one of 13 to 15.

18. The display unit displays only the classification target document located within a range of a threshold distance on the display coordinate system from the specified classification target document. 18. Document classification method.

181. A document classification program for causing a computer system to execute the steps according to any one of claims 1 to 181.

A document classification device for classifying a plurality of classification target documents according to their contents,
Means for executing a first step of calculating a stable inter-document distance between each two documents of the classification target documents according to the degree of similarity between the two documents;
Means for executing a second step of selecting an initially arranged document from the classification target documents;
Means for performing a third step of calculating position coordinates at which each of the initial placement documents is initially placed on a display coordinate system;
A distance vector from one document on the display coordinate system to the other document at the time of the current processing is calculated between each two documents of the arranged initial arranged documents, and the arranged initial arranged documents are calculated. Means for performing a fourth step of calculating an inter-document force vector received from some other initial placed document based on a difference between a length of the separation vector and the stable inter-document distance and a direction of the separation vector;
Means for performing a fifth step of calculating a total inter-document force vector by summing up inter-document force vectors received from each other initial-arranged document for each of the initially arranged documents;
Means for executing a sixth step for calculating the position coordinates at the next processing time in accordance with the total inter-document force vector for each placed initial placed document;
Means for determining a convergence of the position coordinates of the initially arranged document during execution of the repetition process of the fourth to sixth steps, and executing a seventh step of ending the repetition process;
Means for executing an eighth step of selecting a next layout document to be newly incorporated into the display coordinate system from the classification target document;
Means for performing a ninth step of calculating a position coordinate at which each next placement document is initially placed in the display coordinate system;
Means for executing a tenth step of calculating a total inter-document force vector by summing up inter-document force vectors received from existing arranged documents for the newly arranged next-arranged document;
Means for executing an eleventh step of calculating position coordinates at the next processing time in accordance with the total inter-document force vector for the newly arranged next arranged document;
Means for executing a twelfth step of determining the convergence of the position coordinates of the next arranged document during the execution of the repeating process of the tenth and eleventh steps and ending the repeating process. Document classification device.