JP2013196212A

JP2013196212A - Document division device, document division program and document division method

Info

Publication number: JP2013196212A
Application number: JP2012061209A
Authority: JP
Inventors: Isao Nanba; 功難波
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2012-03-16
Filing date: 2012-03-16
Publication date: 2013-09-30

Abstract

PROBLEM TO BE SOLVED: To provide a document division device, a document division program and a document division method for efficiently performing the parallel processing of clustering to a document.SOLUTION: A generation part 41 generates index data 31 as feature information showing the feature of each of a plurality of documents of document data 30. An inter-document similarity calculation part 42 calculates similarity between a plurality of documents on the basis of generated index data 31, and stores the calculated inter-document similarity as inter-document similarity data 32. An inter-group similarity calculation part 44 calculates similarity between the groups of a document in which similarity between the documents of each document is equal to or less than an upper predetermined order, and stores the calculated inter-group similarity as inter-group similarity data 34. A classification part 45 classifies documents having high inter-group similarity into groups. A division part 46 divides a plurality of documents by group units stored in cluster data 35.

Description

本発明は、文書分割装置、文書分割プログラムおよび文書分割方法に関する。 The present invention relates to a document dividing device, a document dividing program, and a document dividing method.

従来から複数の文書同士の類似度を計算し、類似度が高い文書をクラスタリングする技術が知られている。このクラスタリングは、例えば、ヘルプデスクなどで受けた問い合わせに対してＦＡＱ（Frequently Asked Question）を作成する際に、頻度の高い問い合わせを特定する際に用いられる。 2. Description of the Related Art Conventionally, a technique for calculating a similarity between a plurality of documents and clustering documents having a high similarity is known. This clustering is used, for example, when a frequently asked question is specified when an FAQ (Frequently Asked Question) is created for a query received at a help desk or the like.

クラスタリングでは、例えば、複数の文書同士を総当りで比較して類似する文書を求める。このようにクラスタリングでは、複数の文書同士を総当りで比較して各文書間の類似度を求めるため、クラスタリングの対象とする文書数の増加に伴い処理で使用する記憶容量が増加し、処理時間も増加する。 In clustering, for example, a plurality of documents are compared with each other in a brute force manner to obtain similar documents. As described above, in clustering, a plurality of documents are compared with each other and the similarity between the documents is obtained, so that the storage capacity used in the processing increases as the number of documents to be clustered increases and the processing time increases. Will also increase.

従来よりデータ処理を行う際の処理時間や処理負荷の軽減を図る方法として、複数のコンピュータやプロセッサによる並列処理が知られている。そこで、例えば、データの複製を複数用意して、複数のコンピュータやプロセッサでクラスタリングを並列処理させる方法が考えられる。 Conventionally, parallel processing by a plurality of computers and processors is known as a method for reducing processing time and processing load when data processing is performed. Thus, for example, a method of preparing a plurality of data replicas and performing clustering in parallel with a plurality of computers and processors is conceivable.

特開２００３−２６３４４３号公報JP 2003-263443 A

しかしながら、クラスタリングを並列処理させる場合、対象とする文書数の増加に伴い多くの記憶容量が必要となる中で、データの複製を作成することになるため、より記憶容量を消費する。 However, when clustering is performed in parallel, a larger amount of storage capacity is required as the number of documents to be processed increases, and therefore, a copy of data is created, so that more storage capacity is consumed.

開示の技術は、上記に鑑みてなされたものであって、文書に対するクラスタリングの並列処理を効率的に行うことができる文書分割装置、文書分割プログラムおよび文書分割方法を提供することを目的とする。 The disclosed technique has been made in view of the above, and an object thereof is to provide a document dividing apparatus, a document dividing program, and a document dividing method capable of efficiently performing parallel processing of clustering on a document.

本願の開示する文書分割装置は、生成部と、文書間類似度算出部と、集合間類似度算出部と、分類部と、分割部と、を有する。生成部は、クラスタリング対象とする複数の文書それぞれの特徴を示す特徴情報を生成する。文書間類似度算出部は、生成部により生成された特徴情報に基づき、前記複数の文書間の類似度を算出する。集合間類似度算出部は、各文書の前記文書間の類似度が上位の所定位以内の文書の集合間の類似度を算出する。分類部は、集合間の類似度の高い文書同士をグループに分類する。分割部は、分類部により分類されたグループ単位で前記複数の文書を分割する。 The document dividing device disclosed in the present application includes a generation unit, an inter-document similarity calculation unit, an inter-set similarity calculation unit, a classification unit, and a division unit. The generation unit generates feature information indicating features of a plurality of documents to be clustered. The inter-document similarity calculation unit calculates the similarity between the plurality of documents based on the feature information generated by the generation unit. The inter-set similarity calculation unit calculates the similarity between sets of documents having a similarity between the documents within a predetermined upper level. The classification unit classifies documents having high similarity between sets into groups. The division unit divides the plurality of documents in units of groups classified by the classification unit.

本願の開示する文書分割装置によれば、文書に対するクラスタリングの並列処理を効率的に行うことができる。 According to the document dividing device disclosed in the present application, it is possible to efficiently perform parallel processing of clustering on documents.

図１は、文書分割装置の全体構成を示す図である。FIG. 1 is a diagram showing the overall configuration of the document dividing apparatus. 図２は、文書間類似度データの構成例を示す図である。FIG. 2 is a diagram illustrating a configuration example of inter-document similarity data. 図３は、集合間類似度データの構成例を示す図である。FIG. 3 is a diagram illustrating a configuration example of the similarity data between sets. 図４は、文書の集合間の類似度を説明するための説明図である。FIG. 4 is an explanatory diagram for explaining the similarity between sets of documents. 図５は、文書を分類する処理の流れの一例を示す図である。FIG. 5 is a diagram illustrating an example of a flow of processing for classifying documents. 図６は、文書を分割する処理の流れの一例を示す図である。FIG. 6 is a diagram illustrating an example of a flow of processing for dividing a document. 図７は、クラスタリングを説明するための説明図である。FIG. 7 is an explanatory diagram for explaining clustering. 図８は、文書データに対する処理の流れを模式的に示した図である。FIG. 8 is a diagram schematically showing the flow of processing for document data. 図９は、文書データに含まれるクラスタリング対象の文書の文書名および文書に含まれる単語の一例を示す図である。FIG. 9 is a diagram illustrating an example of a document name of a clustering target document included in the document data and a word included in the document. 図１０は、索引データの一例を示す図である。FIG. 10 is a diagram illustrating an example of index data. 図１１は、文書間類似度データの一例を示す図である。FIG. 11 is a diagram showing an example of inter-document similarity data. 図１２は、除外文書データの一例を示す図である。FIG. 12 is a diagram illustrating an example of excluded document data. 図１３は、集合間類似度データの一例を示す図である。FIG. 13 is a diagram illustrating an example of inter-set similarity data. 図１４は、クラスタデータの一例を示す図である。FIG. 14 is a diagram illustrating an example of cluster data. 図１５は、分散データの一例を示す図である。FIG. 15 is a diagram illustrating an example of distributed data. 図１６は、分割処理の手順を示すフローチャートである。FIG. 16 is a flowchart illustrating the procedure of the dividing process. 図１７は、索引生成処理の手順を示すフローチャートである。FIG. 17 is a flowchart showing the procedure of index generation processing. 図１８は、文書間類似度算出処理の手順を示すフローチャートである。FIG. 18 is a flowchart showing the procedure of the inter-document similarity calculation process. 図１９は、除外文書特定処理の手順を示すフローチャートである。FIG. 19 is a flowchart showing a procedure of excluded document specifying processing. 図２０は、集合間類似度算出処理を示すフローチャートである。FIG. 20 is a flowchart showing the similarity calculation processing between sets. 図２１は、分類処理を示すフローチャートである。FIG. 21 is a flowchart showing the classification process. 図２２は、分割処理を示すフローチャートである。FIG. 22 is a flowchart showing the dividing process. 図２３は、クラスタリング処理の手順を示すフローチャートである。FIG. 23 is a flowchart illustrating the procedure of the clustering process. 図２４は、分割プログラムを実行するコンピュータの一例について説明するための図である。FIG. 24 is a diagram for describing an example of a computer that executes a division program.

以下に、本願の開示する文書分割装置、文書分割プログラムおよび文書分割方法の実施例を図面に基づいて詳細に説明する。なお、この実施例は開示の技術を限定するものではない。そして、各実施例は、処理内容を矛盾させない範囲で適宜組み合わせることが可能である。 Embodiments of a document dividing device, a document dividing program, and a document dividing method disclosed in the present application will be described below in detail with reference to the drawings. Note that this embodiment does not limit the disclosed technology. Each embodiment can be appropriately combined within a range in which processing contents are not contradictory.

［文書分割装置１０の構成］
実施例１に係る文書分割装置について説明する。図１は、文書分割装置の全体構成を示す図である。文書分割装置１０は、複数の文書を類似度が高い文書同士の組に分割する装置であり、例えば、コンピュータである。 [Configuration of Document Splitting Apparatus 10]
A document dividing apparatus according to the first embodiment will be described. FIG. 1 is a diagram showing the overall configuration of the document dividing apparatus. The document dividing apparatus 10 is an apparatus that divides a plurality of documents into sets of documents having high similarity, and is a computer, for example.

文書分割装置１０は、外部Ｉ／Ｆ（インタフェース）部２０と、記憶部２１と、制御部２２とを有する。 The document dividing apparatus 10 includes an external I / F (interface) unit 20, a storage unit 21, and a control unit 22.

外部Ｉ／Ｆ部２０は、外部と各種の情報を入出力するインタフェースである。外部Ｉ／Ｆ部２０は、例えば、ネットワークなどのデータ伝送路を介して外部装置から各種の情報を入出力するインタフェースであってもよい。また、外部Ｉ／Ｆ部２０は、例えば、フラッシュメモリなどの記憶媒体との間で各種情報を入出力するインタフェースであってもよい。外部Ｉ／Ｆ部２０は、外部装置と通信を行うことにより、あるいは、記憶媒体を介して各種情報を取得する。例えば、外部Ｉ／Ｆ部２０には、クラスタリング対象である複数の文書が入力される。また、外部Ｉ／Ｆ部２０は、分割された分散データ３６をクラスタリングを行うコンピュータやプロセッサに出力する。また、外部Ｉ／Ｆ部２０は、クラスタリングを行うコンピュータやプロセッサとの間でクラスタリングの結果を入出力する。 The external I / F unit 20 is an interface that inputs and outputs various types of information with the outside. The external I / F unit 20 may be an interface that inputs and outputs various types of information from an external device via a data transmission path such as a network, for example. The external I / F unit 20 may be an interface that inputs and outputs various types of information to and from a storage medium such as a flash memory. The external I / F unit 20 acquires various types of information by communicating with an external device or via a storage medium. For example, a plurality of documents that are clustering targets are input to the external I / F unit 20. Further, the external I / F unit 20 outputs the divided distributed data 36 to a computer or processor that performs clustering. In addition, the external I / F unit 20 inputs and outputs the result of clustering with a computer or processor that performs clustering.

記憶部２１は、フラッシュメモリなどの半導体メモリ素子、ハードディスク、光ディスクなどの記憶装置である。なお、記憶部２１は、上記の種類の記憶装置に限定されるものではなく、ＲＡＭ（Random Access Memory)、ＲＯＭ（Read Only Memory)であってもよい。 The storage unit 21 is a storage device such as a semiconductor memory element such as a flash memory, a hard disk, or an optical disk. The storage unit 21 is not limited to the type of storage device described above, and may be a RAM (Random Access Memory) or a ROM (Read Only Memory).

記憶部２１は、制御部２２で実行されるＯＳ（Operating System）や後述する文書の分割に用いる各種プログラムを記憶する。さらに、記憶部２１は、制御部２２で実行されるプログラムで使用する各種データを記憶する。かかるデータの一例として、記憶部２１は、文書データ３０と、索引データ３１と、文書間類似度データ３２と、除外文書データ３３と、集合間類似度データ３４と、クラスタデータ３５と、分散データ３６とを記憶する。 The storage unit 21 stores an OS (Operating System) executed by the control unit 22 and various programs used for document division described later. Furthermore, the storage unit 21 stores various data used in the program executed by the control unit 22. As an example of such data, the storage unit 21 includes document data 30, index data 31, inter-document similarity data 32, excluded document data 33, inter-set similarity data 34, cluster data 35, and distributed data. 36 is stored.

文書データ３０は、クラスタリングの対象となる複数の文書が記憶されたデータである。一例として、文書データ３０は、外部Ｉ／Ｆ部２０に入力されたクラスタリング対象である複数の文書の文書データが後述の登録部４０によって格納される。他の一例として、文書データ３０は、後述の生成部４１および分割部４６によって参照される。 The document data 30 is data in which a plurality of documents to be clustered are stored. As an example, the document data 30 is stored in a registration unit 40 (to be described later) as document data of a plurality of documents to be clustered input to the external I / F unit 20. As another example, the document data 30 is referred to by a generation unit 41 and a division unit 46 described later.

索引データ３１は、文書データ３０に記憶された複数の文書に含まれる単語がどの文書に含まれるかを示す索引が記憶されたデータである。一例として、索引データ３１は、後述の生成部４１によって生成されて格納される。他の一例として、索引データ３１は、後述の文書間類似度算出部４２によって参照される。 The index data 31 is data in which an index indicating which document includes words included in a plurality of documents stored in the document data 30 is stored. As an example, the index data 31 is generated and stored by a generation unit 41 described later. As another example, the index data 31 is referred to by an inter-document similarity calculation unit 42 described later.

文書間類似度データ３２は、文書データ３０に記憶された各文書について、自文書を含む各文書との間の類似度が記憶されたデータである。一例として、文書間類似度データ３２は、後述の文書間類似度算出部４２によって算出されて格納される。他の一例として、文書間類似度データ３２は、後述の除外文書特定部４３、集合間類似度算出部４４および出力部４７によって参照される。 The inter-document similarity data 32 is data in which the similarity between each document stored in the document data 30 and each document including its own document is stored. As an example, the inter-document similarity data 32 is calculated and stored by an inter-document similarity calculation unit 42 described later. As another example, the inter-document similarity data 32 is referred to by an excluded document specifying unit 43, an inter-set similarity calculating unit 44, and an output unit 47, which will be described later.

図２は、文書間類似度データの構成例を示す図である。文書データ３０に含まれる各文書には、それぞれの文書を識別する文書名が定めされる。文書名は、文書にそれぞれ予め定められ、文書データ３０に記憶されている場合、その文書名を用いてもよい。また、文書名は、文書分割装置１０において、文書データ３０に含まれる各文書に対して定めてもよい。図２の例は、文書名が「Ａ」〜「Ｚ」の各文書について、類似度が所定の第１閾値以上である相手の文書名と、類似度を、[文書名，類似度]の形式で類似度が大きい順位に並べ記憶されている。類似度は、値が大きいほど類似していることを示す。本実施例では、類似度は、０．０〜１．０までの範囲に正規化している。図２の例は、文書Ａは、文書Ａとの類似度が１．０であり、文書Ｂとの類似度が０．８であり、文書Ｃとの類似度が０．８であり、文書Ｅとの類似度が０．７であることを示す。また、文書Ｂは、文書Ｂとの類似度が１．０であり、文書Ａとの類似度が０．８であり、文書Ｄとの類似度が０．８であり、文書Ｇとの類似度が０．５であることを示す。また、文書Ｃは、文書Ｃとの類似度が１．０であり、文書Ｅとの類似度が０．９であり、文書Ａとの類似度が０．８であり、文書Ｄとの類似度が０．７であることを示す。また、文書Ｄは、文書Ｄとの類似度が１．０であり、文書Ｂとの類似度が０．８であり、文書Ｃとの類似度が０．７であることを示す。また、文書Ｉは、文書Ｉとの類似度が１．０であることを示す。また、文書Ｊは、文書Ｊとの類似度が１．０であり、文書Ｘとの類似度が０．９であり、文書Ｋとの類似度が０．５であることを示す。また、文書Ｘは、文書Ｘとの類似度が１．０であり、文書Ｊとの類似度が０．９であり、文書Ｙとの類似度が０．８であることを示す。また、文書Ｙは、文書Ｙとの類似度が１．０であり、文書Ｚとの類似度が０．８であり、文書Ｘとの類似度が０．８であることを示す。また、文書Ｚは、文書Ｚとの類似度が１．０であり、文書Ｘとの類似度が０．９であり、文書Ｙとの類似度が０．８であることを示す。 FIG. 2 is a diagram illustrating a configuration example of inter-document similarity data. For each document included in the document data 30, a document name for identifying each document is defined. When the document name is predetermined for each document and stored in the document data 30, the document name may be used. Further, the document name may be determined for each document included in the document data 30 in the document dividing apparatus 10. In the example of FIG. 2, for each document whose document names are “A” to “Z”, the document name of the other party whose similarity is equal to or more than a predetermined first threshold and the similarity are represented by [document name, similarity] It is arranged and stored in the order of high similarity in format. The degree of similarity indicates that the larger the value, the more similar. In this embodiment, the similarity is normalized to a range from 0.0 to 1.0. In the example of FIG. 2, the document A has a similarity of 1.0 to the document A, a similarity of 0.8 to the document B, and a similarity of 0.8 to the document C. The similarity with E is 0.7. Document B has a similarity with document B of 1.0, a similarity with document A of 0.8, a similarity with document D of 0.8, and a similarity with document G. The degree is 0.5. Document C has a similarity with document C of 1.0, a similarity with document E of 0.9, a similarity with document A of 0.8, and a similarity with document D. The degree is 0.7. Further, the document D indicates that the similarity with the document D is 1.0, the similarity with the document B is 0.8, and the similarity with the document C is 0.7. Document I indicates that the similarity with Document I is 1.0. Further, the document J indicates that the similarity with the document J is 1.0, the similarity with the document X is 0.9, and the similarity with the document K is 0.5. Further, the document X indicates that the similarity with the document X is 1.0, the similarity with the document J is 0.9, and the similarity with the document Y is 0.8. Further, the document Y indicates that the similarity with the document Y is 1.0, the similarity with the document Z is 0.8, and the similarity with the document X is 0.8. Further, the document Z indicates that the similarity with the document Z is 1.0, the similarity with the document X is 0.9, and the similarity with the document Y is 0.8.

除外文書データ３３は、第１閾値以上類似する文書が自文書以外にない文書を除外文書として記憶したデータである。一例として、除外文書データ３３は、後述の除外文書特定部４３によって除外文書が特定されて格納される。他の一例として、除外文書データ３３は、後述の分割部４６によって参照される。 The excluded document data 33 is data in which a document in which there is no document similar to the first threshold value other than its own document is stored as an excluded document. As an example, the excluded document data 33 is stored with an excluded document specified by an excluded document specifying unit 43 described later. As another example, the excluded document data 33 is referred to by a dividing unit 46 described later.

集合間類似度データ３４は、文書データ３０に記憶された各文書について、それぞれの文書の類似する文書集合間の類似度が記憶されたデータである。一例として、集合間類似度データ３４は、後述の集合間類似度算出部４４によって算出されて格納される。他の一例として、集合間類似度データ３４は、後述の分類部４５によって参照される。 The inter-set similarity data 34 is data in which, for each document stored in the document data 30, the similarity between similar document sets of the respective documents is stored. As an example, the inter-set similarity data 34 is calculated and stored by an inter-set similarity calculating unit 44 described later. As another example, the inter-set similarity data 34 is referred to by a classification unit 45 described later.

図３は、集合間類似度データの構成例を示す図である。図２の例は、文書名が「Ａ」〜「Ｚ」の各文書について、類似する文書集合間の類似度が記憶されている。集合間の類似度は、値が大きいほどその集合に含まれる文書が類似する文書が類似していることを示す。本実施例では、集合間の類似度は、０．０〜１．０までの範囲に正規化している。図３の例は、文書名が「Ａ」〜「Ｚ」の各文書について、相手の文書名と、集合間の類似度を、[文書名，類似度]の形式で類似度が大きい順位に並べ記憶されている。図３の例は、文書Ａは、文書Ｂとの集合間の類似度が０．５であることを示す。また、文書Ｂは、文書Ａとの集合間の類似度が０．５であり、文書Ｃとの集合間の類似度が０．３であることを示す。また、文書Ｃは、文書Ｄとの集合間の類似度が０．３であることを示す。また、文書Ｄは、文書Ｂとの集合間の類似度が０．４であり、文書Ｃとの類似度が０．３であることを示す。また、文書Ｊは、文書Ｋとの集合間の類似度が０．４であることを示す。また、文書Ｘは、文書Ｙとの集合間の類似度が０．５であることを示す。また、文書Ｙは、文書Ｚとの集合間の類似度が０．６であり、文書Ｘとの類似度が０．５であることを示す。また、文書Ｚは、文書Ｙとの集合間の類似度が０．６であることを示す。 FIG. 3 is a diagram illustrating a configuration example of the similarity data between sets. In the example of FIG. 2, the similarity between similar document sets is stored for each document whose document names are “A” to “Z”. The similarity between sets indicates that the larger the value, the more similar the documents that are included in the set. In this embodiment, the similarity between sets is normalized to a range from 0.0 to 1.0. In the example of FIG. 3, for each document whose document names are “A” to “Z”, the document name of the other party and the similarity between sets are ranked in the order of high similarity in the form of [document name, similarity]. They are stored side by side. The example in FIG. 3 indicates that the similarity between the set of document A and document B is 0.5. The document B indicates that the similarity between the set with the document A is 0.5 and the similarity between the set with the document C is 0.3. The document C indicates that the similarity between the set with the document D is 0.3. The document D indicates that the similarity between the set with the document B is 0.4 and the similarity with the document C is 0.3. The document J indicates that the similarity between the set with the document K is 0.4. The document X indicates that the similarity between the set with the document Y is 0.5. The document Y indicates that the similarity between the set with the document Z is 0.6 and the similarity with the document X is 0.5. The document Z indicates that the similarity between the set with the document Y is 0.6.

クラスタデータ３５は、文書データ３０に示される複数の文書をグループに分類し、各グループの文書をクラスタとして記憶したデータである。一例として、クラスタデータ３５は、後述の分類部４５によって生成されて格納される。他の一例として、クラスタデータ３５は、後述の分割部４６によって参照される。 The cluster data 35 is data in which a plurality of documents shown in the document data 30 are classified into groups and the documents of each group are stored as clusters. As an example, the cluster data 35 is generated and stored by a classification unit 45 described later. As another example, the cluster data 35 is referred to by a dividing unit 46 described later.

分散データ３６は、文書データ３０に示される複数の文書をクラスタデータ３５により示されるグループ単位で分けた文書を記憶したデータである。一例として、分散データ３６は、後述の分割部４６によって生成されて格納される。他の一例として、分散データ３６は、後述の出力部４７およびクラスタリング部４８によって参照される。 The distributed data 36 is data in which a document obtained by dividing a plurality of documents indicated by the document data 30 into groups indicated by the cluster data 35 is stored. As an example, the distributed data 36 is generated and stored by a dividing unit 46 described later. As another example, the distributed data 36 is referred to by an output unit 47 and a clustering unit 48 described later.

図１に戻り、制御部２２は、各種の処理手順を規定したプログラムや制御データを格納するための内部メモリを有し、これらによって種々の処理を実行する。制御部２２には、各種の集積回路や電子回路を採用できる。例えば、集積回路としては、ＡＳＩＣ（Application Specific Integrated Circuit）が挙げられる。また、電子回路としては、ＣＰＵ（Central Processing Unit）やＭＰＵ（Micro Processing Unit）などが挙げられる。制御部２２は、図１に示すように、登録部４０と、生成部４１と、文書間類似度算出部４２と、除外文書特定部４３と、集合間類似度算出部４４と、分類部４５と、分割部４６と、出力部４７と、クラスタリング部４８とを有する。 Returning to FIG. 1, the control unit 22 includes an internal memory for storing programs and control data that define various processing procedures, and executes various processes using these. Various types of integrated circuits and electronic circuits can be employed for the control unit 22. For example, an ASIC (Application Specific Integrated Circuit) is an example of the integrated circuit. Examples of the electronic circuit include a central processing unit (CPU) and a micro processing unit (MPU). As shown in FIG. 1, the control unit 22 includes a registration unit 40, a generation unit 41, an inter-document similarity calculation unit 42, an excluded document specification unit 43, an inter-set similarity calculation unit 44, and a classification unit 45. A dividing unit 46, an output unit 47, and a clustering unit 48.

このうち、登録部４０は、文書データを記憶部２１へ登録する処理部である。一態様としては、登録部４０は、データ伝送路あるいは記憶媒体から外部Ｉ／Ｆ部２０にクラスタリング対象である複数の文書を示す文書データが入力された場合に、文書データを記憶部２１へ登録する。これにより、記憶部２１には、文書データ３０が格納される。 Among these, the registration unit 40 is a processing unit that registers document data in the storage unit 21. As one aspect, the registration unit 40 registers document data in the storage unit 21 when document data indicating a plurality of documents to be clustered is input to the external I / F unit 20 from a data transmission path or a storage medium. To do. Thereby, the document data 30 is stored in the storage unit 21.

生成部４１は、文書データ３０に記憶されたクラスタリング対象とする複数の文書それぞれの特徴を示す特徴情報を生成する処理部である。一態様としては、生成部４１は、文書の特徴を示す特徴情報として、文書データ３０に示される各文書に含まれる各単語がどの文書に含まれるかを示す索引データ３１を生成する。例えば、生成部４１は、文書データ３０に示される各文書からそれぞれ単語を抽出し、各単語を文書名と対応させて記憶する。そして、生成部４１は、各文書から抽出した単語および文書名を単語の文字コード順に並べる。文字コード順に並べた結果、単語毎に、単語および文書名が順に並ぶ。生成部４１は、文字コード順に並べた単語および文書名を順に参照して、それぞれ各単語がどの文書に含まれるかを示す索引データ３１を生成する。生成部４１は、生成した索引データ３１を記憶部２１に格納する。 The generation unit 41 is a processing unit that generates feature information indicating features of a plurality of documents to be clustered stored in the document data 30. As one aspect, the generation unit 41 generates index data 31 indicating which document includes each word included in each document indicated in the document data 30 as feature information indicating the document characteristics. For example, the generation unit 41 extracts a word from each document shown in the document data 30 and stores each word in association with the document name. Then, the generation unit 41 arranges the words and document names extracted from the documents in the order of the character codes of the words. As a result of arranging in the character code order, words and document names are arranged in order for each word. The generation unit 41 refers to the words and document names arranged in the order of the character codes in order, and generates the index data 31 indicating which document each word is included in. The generation unit 41 stores the generated index data 31 in the storage unit 21.

文書間類似度算出部４２は、文書データ３０に記憶されたクラスタリング対象とする複数の文書間の類似度を算出する処理部である。一態様としては、文書間類似度算出部４２は、生成部４１により生成された索引データ３１に基づき、文書データ３０に記憶された複数の文書間の類似度を算出する。例えば、文書間類似度算出部４２は、索引データ３１に基づき、文書データ３０に示される各文書に含まれる単語を特定する。そして、文書間類似度算出部４２は、それぞれの文書を単語のベクトルと見なしてベクトル間の角度(cosineθ)を文書間の類似度を算出している。 The inter-document similarity calculation unit 42 is a processing unit that calculates the similarity between a plurality of documents to be clustered stored in the document data 30. As one aspect, the inter-document similarity calculation unit 42 calculates the similarity between a plurality of documents stored in the document data 30 based on the index data 31 generated by the generation unit 41. For example, the inter-document similarity calculation unit 42 specifies words included in each document indicated in the document data 30 based on the index data 31. Then, the inter-document similarity calculation unit 42 regards each document as a word vector and calculates the angle between the vectors (cosine θ) as the similarity between the documents.

例えば、文書Ａが単語ａ、単語ｂ、単語ｃ、単語ｄを含み、文書Ｂが単語ｂ、単語ｃ、単語ｄ、単語ｅを含むものとする。この場合、文書間類似度算出部４２は、文書Ａと文書Ｂが含む単語を要素とし、それぞれ単語を含む場合、単語を対応する要素の値を１とし、それぞれ単語を含まない場合、単語を対応する要素の値を０としたベクトルを求める。例えば、ベクトルの要素が（単語ａ、単語ｂ、単語ｃ、単語ｄ、単語ｅ）に対応するものとした場合、文書ＡのベクトルＡは、（１、１、１、１、０）と求まる。また、文書ＢのベクトルＢは、（０、１、１、１、１）と求まる。文書間類似度算出部４２は、ベクトルＡ、ベクトルＢについて、以下の（１）式に示す角度(cosineθ)を求める演算を行って文書Ａと文書Ｂの類似度を求める。
類似度＝Ａ・Ｂ／（｜Ａ｜×｜Ｂ｜）（１） For example, it is assumed that document A includes word a, word b, word c, and word d, and document B includes word b, word c, word d, and word e. In this case, the inter-document similarity calculation unit 42 uses the words included in the document A and the document B as elements. If each word includes a word, the value of the element corresponding to the word is set to 1. If the word does not include each word, the word is calculated. A vector whose corresponding element value is 0 is obtained. For example, when the vector element corresponds to (word a, word b, word c, word d, word e), the vector A of the document A is obtained as (1, 1, 1, 1, 0). . Further, the vector B of the document B is obtained as (0, 1, 1, 1, 1). The inter-document similarity calculation unit 42 performs an operation for obtaining an angle (cosine θ) shown in the following equation (1) for the vectors A and B to obtain the similarity between the documents A and B.
Similarity = A · B / (| A | × | B |) (1)

ここで、Ａ・Ｂは、１×０＋１×１＋１×１＋１×１＋０×１＝３と算出される。また、｜Ａ｜は、（１×１＋１×１＋１×１＋１×１＋０×０）^１／２＝２と算出される。同様に、｜Ｂ｜は、（０×０＋１×１＋１×１＋１×１＋１×１）^１／２＝２と算出される。よって、文書Ａと文書Ｂの類似度は、３／（２×２）＝０．７５と算出される。 Here, A · B is calculated as 1 × 0 + 1 × 1 + 1 × 1 + 1 × 1 + 0 × 1 = 3. Further, | A | is calculated as (1 × 1 + 1 × 1 + 1 × 1 + 1 × 1 + 0 × 0) ^1/2 = 2. Similarly, | B | is calculated as (0 × 0 + 1 × 1 + 1 × 1 + 1 × 1 + 1 × 1) ^1/2 = 2. Therefore, the similarity between document A and document B is calculated as 3 / (2 × 2) = 0.75.

文書間類似度算出部４２は、文書データ３０に示される各文書毎に、文書データ３０に示される自文書を含む各文書をそれぞれ相手文書として類似度を求める。そして、文書間類似度算出部４２は、求めた類似度が所定の第１閾値以上である相手文書の文書名および類似度を文書間類似度データ３２として記憶部２１に格納する。例えば、第１閾値を０．４とし、文書間類似度算出部４２は、類似度が０．４以上である相手文書の文書名および類似度を文書間類似度データ３２として記憶部２１に格納するものとする。なお、第１閾値については、この例示に限るものではなく、文書分割装置１０を利用する者が任意の値に設定してよい。これにより、記憶部２１には、図２に示すように文書間類似度データ３２が記憶される。なお、類似度の算出方法は、文書間が類似している度合いを算出できるものであれば、文書を単語のベクトルと見做し上記の角度(cosineθ)を求める方法に限らず何れの方式を用いてもよい。例えば、自文書及び相手文書に互いに同じ単語が含まれる割合を類似度として算出してもよい。 The inter-document similarity calculation unit 42 obtains a similarity for each document shown in the document data 30 with each document including the self-document shown in the document data 30 as a partner document. Then, the inter-document similarity calculation unit 42 stores the document name and similarity of the partner document whose calculated similarity is equal to or greater than a predetermined first threshold in the storage unit 21 as inter-document similarity data 32. For example, the first threshold is set to 0.4, and the inter-document similarity calculation unit 42 stores the document name and similarity of the partner document having the similarity of 0.4 or more in the storage unit 21 as inter-document similarity data 32. It shall be. The first threshold is not limited to this example, and a person who uses the document dividing device 10 may set an arbitrary value. As a result, the storage unit 21 stores inter-document similarity data 32 as shown in FIG. Note that the similarity calculation method is not limited to the method of calculating the above-mentioned angle (cosine θ) by regarding a document as a vector of words as long as the degree of similarity between documents can be calculated. It may be used. For example, the ratio of the same word contained in the own document and the partner document may be calculated as the similarity.

除外文書特定部４３は、何れの文書とも類似していない文書を特定する処理部である。一態様としては、除外文書特定部４３は、文書間類似度データ３２に基づき、自文書以外に類似度が上記第１閾値以上の文書がない文書を特定する。例えば、除外文書特定部４３は、文書間類似度データ３２において自文書以外に類似する文書が登録されていない文書を特定する。図２の例では、文書Ｉが自文書Ｉ以外に類似度が０．４以上の文書がない。このため、除外文書特定部４３は、文書Ｉを除外文書と特定する。除外文書特定部４３は、特定した文書を除外文書データ３３として記憶部２１に格納する。 The excluded document specifying unit 43 is a processing unit that specifies a document that is not similar to any document. As an aspect, the excluded document specifying unit 43 specifies a document having no document having a similarity equal to or higher than the first threshold other than the self document based on the inter-document similarity data 32. For example, the excluded document specifying unit 43 specifies a document in which similar documents other than the own document are not registered in the inter-document similarity data 32. In the example of FIG. 2, there is no document whose document I has a similarity of 0.4 or more other than its own document I. For this reason, the excluded document specifying unit 43 specifies the document I as an excluded document. The excluded document specifying unit 43 stores the specified document as excluded document data 33 in the storage unit 21.

集合間類似度算出部４４は、文書データ３０に記憶された各文書について、それぞれの類似する文書の集合間の類似度を算出する処理部である。一態様としては、集合間類似度算出部４４は、各文書の文書間の類似度が上位の所定位以内の文書の集合間の類似度を算出する。例えば、集合間類似度算出部４４は、除外文書特定部４３により除外された文書を除き、各文書の文書間の類似度が上位４位以内の文書の集合間の類似度を算出する。なお、集合間類似度算出部４４は、文書に類似度が上記第１閾値以上である相手文書が所定位分無い場合、類似度が上記第１閾値以上の全文書の集合間で類似度を算出する。すなわち、類似度が上記第１閾値以上の文書が３つしか無い場合は、３つの文書を集合として類似度を算出する。例えば、上位の所定位以内の文書として上位４位以内の文書の集合間の類似度を算出するものとする。なお、類似度を算出する範囲とする順位については、この例示に限るものではなく、文書分割装置１０を利用する者が任意の値に設定してよい。 The inter-set similarity calculation unit 44 is a processing unit that calculates the similarity between sets of similar documents for each document stored in the document data 30. As one aspect, the inter-set similarity calculation unit 44 calculates the similarity between sets of documents that are within a predetermined predetermined upper rank. For example, the inter-set similarity calculation unit 44 calculates the similarity between sets of documents having the highest similarity between the documents of each document, excluding documents excluded by the excluded document specifying unit 43. Note that the inter-set similarity calculation unit 44 calculates the similarity between sets of all documents having a similarity equal to or higher than the first threshold when there are not a predetermined number of counterpart documents whose similarity is equal to or higher than the first threshold. calculate. That is, when there are only three documents having a similarity equal to or greater than the first threshold value, the similarity is calculated with the three documents as a set. For example, it is assumed that the similarity between a set of documents in the top four ranks is calculated as a document in the top predetermined rank. The order in which the similarity is calculated is not limited to this example, and a person using the document dividing device 10 may set an arbitrary value.

集合間類似度算出部４４は、文書間類似度データ３２に基づき、文書データ３０に示される各文書を、それぞれ類似する文書を要素とし、類似する文書との類似度を要素の値としたベクトルと見なして、ベクトル間の類似度から文書の集合間の類似度を算出する。例えば、文書Ａは、上位４位の文書および類似度が[Ａ、１．０]、[Ｂ、０．８]、[Ｃ、０．８]、[Ｅ、０．７]であり、文書Ｂは、上位４位以内の文書および類似度が[Ｂ、１．０]、[Ａ、０．８]、[Ｄ、０．８]、[Ｇ、０．５]であるものとする。この場合、集合間類似度算出部４４は、文書Ａおよび文書Ｂに類似する文書を要素とし、類似度を要素の値としたベクトルを求める。例えば、ベクトルの要素が（文書Ａ、文書Ｂ、文書Ｃ、文書Ｄ、文書Ｅ、文書Ｇ）に対応するものとした場合、文書ＡのベクトルＡは（１．０、０．８、０．８、０．０、０．７、０．０）と求まる。また、文書ＢのベクトルＢは（０．８、１．０、０．０、０．８、０．０、０．５）と求まる。集合間類似度算出部４４は、ベクトルＡ、ベクトルＢについて、以下の（２）式に示す角度cosineθを求めこれを文書Ａと文書Ｂの類似度としている。
類似度＝Ａ・Ｂ／（｜Ａ｜×｜Ｂ｜）（２） The inter-set similarity calculation unit 44 is based on the inter-document similarity data 32, and each vector shown in the document data 30 has a similar document as an element and a similarity with the similar document as an element value. The similarity between sets of documents is calculated from the similarity between vectors. For example, the document A has the top four documents and the similarity is [A, 1.0], [B, 0.8], [C, 0.8], [E, 0.7] It is assumed that B is a document in the top four and the similarity is [B, 1.0], [A, 0.8], [D, 0.8], [G, 0.5]. In this case, the inter-set similarity calculation unit 44 obtains a vector having the documents similar to the documents A and B as elements and the similarity as the element value. For example, when the vector elements correspond to (document A, document B, document C, document D, document E, document G), the vector A of document A is (1.0, 0.8,. 8, 0.0, 0.7, 0.0). Further, the vector B of the document B is obtained as (0.8, 1.0, 0.0, 0.8, 0.0, 0.5). The inter-set similarity calculation unit 44 obtains an angle cosine θ shown in the following equation (2) for the vector A and the vector B, and uses this as the similarity between the document A and the document B.
Similarity = A · B / (| A | × | B |) (2)

ここで、Ａ・Ｂは、１．０×０．８＋０．８×１．０＋０．８×０．０＋０．０×０．８＋０．７×０．０＋０．０×０．５＝１．６と算出される。また、｜Ａ｜は、（１．０×１．０＋０．８×０．８＋０．８×０．８＋０．７×０．７）^１／２＝（２．７７）^１／２と算出される。同様に、｜Ｂ｜は、（０．８×０．８＋１．０×１．０＋０．８×０．８＋０．５×０．５）^１／２＝（２．５３）^１／２と算出される。よって、文書Ａと文書Ｂの類似度は、１．６／（（２．７７）^１／２×（２．５３）^１／２）≒０．６と算出される。 Here, A · B is 1.0 × 0.8 + 0.8 × 1.0 + 0.8 × 0.0 + 0.0 × 0.8 + 0.7 × 0.0 + 0.0 × 0.5 = 1.6 Calculated. Also, | A | is calculated as (1.0 × 1.0 + 0.8 × 0.8 + 0.8 × 0.8 + 0.7 × 0.7) ^1/2 = (2.77) ^1/2 . Similarly, | B | is calculated as (0.8 × 0.8 + 1.0 × 1.0 + 0.8 × 0.8 + 0.5 × 0.5) ^1/2 = (2.53) ^1/2 The Accordingly, the similarity between the document A and the document B is calculated as 1.6 / ((2.77) ^1/2 × (2.53) ^1/2 ) ≈0.6.

また、例えば、文書Ａおよび文書Ｂが、相手文書が３つしかなく、文書Ａは、相手文書および類似度が[Ａ、１．０]、[Ｂ、０．５]、[Ｃ、０．３]であり、文書Ｂは、相手文書および類似度が[Ｂ、１．０]、[Ａ、０．８]、[Ｃ、０．４]であるものとする。この場合も、集合間類似度算出部４４は、文書Ａおよび文書Ｂに類似する文書を要素とし、類似度を要素の値としたベクトルを求める。例えば、ベクトルの要素が（文書Ａ、文書Ｂ、文書Ｃ）に対応するものとした場合、文書ＡのベクトルＡは（１．０、０．５、０．３）と求まる。また、文書ＢのベクトルＢは（０．８、１．０、０．４）と求まる。集合間類似度算出部４４は、ベクトルＡ、ベクトルＢについて、上述の（２）式に示す演算を行って文書Ａと文書Ｂの類似度を求める。 Further, for example, document A and document B have only three counterpart documents, and document A has counterpart documents and similarities of [A, 1.0], [B, 0.5], [C, 0. 3], and document B is assumed to be [B, 1.0], [A, 0.8], [C, 0.4] with the counterpart document. Also in this case, the inter-set similarity calculation unit 44 obtains a vector having the documents similar to the documents A and B as elements and the similarity as the element value. For example, when the vector element corresponds to (document A, document B, document C), the vector A of document A is obtained as (1.0, 0.5, 0.3). Further, the vector B of the document B is obtained as (0.8, 1.0, 0.4). The inter-set similarity calculation unit 44 calculates the similarity between the document A and the document B by performing the calculation shown in the above equation (2) for the vector A and the vector B.

ここで、Ａ・Ｂは、１．０×０．８＋０．５×１．０＋０．３×０．４＝１．４２と算出される。また、｜Ａ｜は、（１．０×１．０＋０．５×０．５＋０．３×０．３）^１／２＝（１．３９）^１／２と算出される。同様に、｜Ｂ｜は、（０．８×０．８＋１．０×１．０＋０．４×０．４）^１／２＝（１．８）^１／２と算出される。よって、文書Ａと文書Ｂの類似度は、１．４２／（（１．３９）^１／２×（１．８）^１／２）≒０．９２と算出される。 Here, A · B is calculated as 1.0 × 0.8 + 0.5 × 1.0 + 0.3 × 0.4 = 1.42. Also, | A | is calculated as (1.0 × 1.0 + 0.5 × 0.5 + 0.3 × 0.3) ^1/2 = (1.39) ^1/2 . Similarly, | B | is calculated as (0.8 × 0.8 + 1.0 × 1.0 + 0.4 × 0.4) ^1/2 = (1.8) ^1/2 . Accordingly, the similarity between the document A and the document B is calculated as 1.42 / ((1.39) ^1/2 × (1.8) ^1/2 ) ≈0.92.

ここで、各文書の文書間の類似度が上位の所定位以内の文書の集合間の類似度について説明する。図４は、文書の集合間の類似度を説明するための説明図である。図４の例では、各文書の文書間の類似度の一例が示されている。また、図４の例では、文書Ａ〜Ｇ、Ｎ〜Ｑを文書間の類似度が高いほど文書間の距離を短く配置して文書間の類似度を模式的に示している。また、図４の例では、文書間の類似度が高い文書Ａ〜Ｇを「●」で示し、文書間の類似度が低い文書Ｎ〜Ｑを「○」で示している。図４の例では、例えば、クラスタリングを行った場合、互いに文書間の類似度が高い範囲５０内の文書Ａ〜Ｇがクラスタにまとまることが期待される。この文書Ａ〜Ｇは、互いに類似しているため、類似する文書の集合間の類似度を算出した場合、類似度が高く算出される。よって、類似する文書の集合間の類似度が高い文書を同じグループにまとめることにより、類似する文書が別の分散データ３６に分割されることを抑制できる。また、それぞれの文書間について、類似する文書の集合で類似度を算出することにより、類似する文書が同じグループに分類しやすくなる。 Here, the degree of similarity between a set of documents having a similarity between documents of upper and lower predetermined levels will be described. FIG. 4 is an explanatory diagram for explaining the similarity between sets of documents. In the example of FIG. 4, an example of the similarity between documents of each document is shown. In the example of FIG. 4, the documents A to G and N to Q are schematically illustrated by arranging the distance between documents shorter as the similarity between the documents is higher. In the example of FIG. 4, documents A to G having high similarity between documents are indicated by “●”, and documents N to Q having low similarity between documents are indicated by “◯”. In the example of FIG. 4, for example, when clustering is performed, it is expected that the documents A to G in the range 50 in which the similarity between documents is high are gathered into clusters. Since the documents A to G are similar to each other, when the similarity between sets of similar documents is calculated, the similarity is calculated to be high. Therefore, by grouping documents having high similarity between sets of similar documents into the same group, it is possible to prevent similar documents from being divided into different distributed data 36. Also, by calculating the similarity between sets of similar documents for each document, similar documents can be easily classified into the same group.

集合間類似度算出部４４は、各文書毎に、相手文書の文書名および集合間の類似度を集合間類似度データ３４として記憶部２１に格納する。これにより、記憶部２１には、図３に示すように集合間類似度データ３４が記憶される。なお、図３に示す集合間類似度データ３４は、一例であり、図２に示す文書間類似度データ３２から求めたものではない。 The inter-set similarity calculation unit 44 stores, for each document, the document name of the counterpart document and the similarity between sets in the storage unit 21 as inter-set similarity data 34. Thereby, the storage unit 21 stores the inter-set similarity data 34 as shown in FIG. The inter-set similarity data 34 shown in FIG. 3 is an example, and is not obtained from the inter-document similarity data 32 shown in FIG.

分類部４５は、文書データ３０に記憶された各文書を集合間の類似度の高い文書同士をグループに分類する処理部である。一態様としては、分類部４５は、集合間類似度データ３４に基づき、各文書を集合間の類似度が所定の第２閾値以上である文書同士をグループに分類する。図５は、文書を分類する処理の流れの一例を示す図である。例えば、分類部４５は、集合間類似度データ３４に記憶された各文書を、文書の集合間の類似度の大きい順に並べ、類似度の大きい順に各文書を順次、処理対象と特定する。分類部４５は、処理対象とされた文書が、既に処理対象とされた文書と集合間の類似度が第２閾値以上である場合、既に処理対象とされた文書と同じグループに分類する。また、分類部４５は、既に処理対象とされた文書と集合間の類似度が第２閾値未満である場合、新たなグループとして分類する。例えば、第２閾値を０．４とし、分類部４５は、集合間の類似度が０．４以上である場合、既に処理対象とされた文書と同じグループに分類し、集合間の類似度が０．４未満である場合、新たなグループとして分類するものとする。これにより、文書データ３０に記憶された各文書は、グループに分類される。これにより、図５の例では、文書Ｙ、Ｚ、Ｘが同じグループに分類され、文書Ａ、Ｂ、Ｃ、Ｄが同じグループに分類され、文書Ｊ、Ｋが同じグループに分類される。分類部４５は、分類した各グループの文書をクラスタデータ３５として記憶部２１に格納する。なお、第２閾値については、この例示に限るものではなく、文書分割装置１０を利用する者が任意の値に設定してよい。 The classification unit 45 is a processing unit that classifies each document stored in the document data 30 into a group of documents having high similarity between sets. As one aspect, the classification unit 45 classifies each document into a group of documents whose similarity between sets is equal to or more than a predetermined second threshold based on the set similarity data 34. FIG. 5 is a diagram illustrating an example of a flow of processing for classifying documents. For example, the classification unit 45 arranges the documents stored in the inter-set similarity data 34 in descending order of similarity between sets of documents, and sequentially specifies each document as a processing target in descending order of similarity. The classification unit 45 classifies the document to be processed into the same group as the document that has already been processed when the similarity between the document that has already been processed and the set is greater than or equal to the second threshold. Further, the classification unit 45 classifies the document as a new group when the similarity between the document already processed and the set is less than the second threshold. For example, when the second threshold value is 0.4 and the similarity between sets is 0.4 or more, the classification unit 45 classifies the document into the same group as the document that has already been processed, and the similarity between sets is If it is less than 0.4, it is classified as a new group. As a result, each document stored in the document data 30 is classified into a group. Thereby, in the example of FIG. 5, the documents Y, Z, and X are classified into the same group, the documents A, B, C, and D are classified into the same group, and the documents J and K are classified into the same group. The classification unit 45 stores the classified documents of each group in the storage unit 21 as cluster data 35. Note that the second threshold is not limited to this example, and a person using the document dividing device 10 may set an arbitrary value.

分割部４６は、文書データ３０に記憶された各文書を分散処理させる分散データに分割する処理部である。一態様としては、分割部４６は、クラスタデータ３５に記憶されたグループ単位で、複数の文書を分割する。また、分割部４６は、除外文書データ３３に記憶された除外文書を分散データに分割する。図６は、文書を分割する処理の流れの一例を示す図である。例えば、分割部４６は、分類部４５により分類されたグループを文書数の多い順に並べ、文書数の多い順に各グループを順次、処理対象と特定する。そして、分割部４６は、複数の分散データ３６に分割する場合、各分散データ３６のうち、文書数が最も少ない分散データ３６にグループ単位で文書を分割する。これにより、図６の例では、文書Ａ、Ｂ、Ｃ、Ｄのグループと文書Ｊ、Ｋのグループが分散データ３６Ａに分割され、文書Ｙ、Ｚ、Ｘが分散データ３６Ｂに分割される。また、分割部４６は、除外文書データ３３に記憶された除外文書も文書数が最も少ない分散データに分割する。なお、分割部４６は、文書数が最も少ない分散データ３６が複数ある場合、文書数が最も少ない何れかの分散データに文書を分割する。これにより、分散データには、文書数が同等となるように複数の文書が分割される。分割部４６は、分割した各分散データ３６を記憶部２１に格納する。 The dividing unit 46 is a processing unit that divides each document stored in the document data 30 into distributed data for distributed processing. As one aspect, the dividing unit 46 divides a plurality of documents in units of groups stored in the cluster data 35. Further, the dividing unit 46 divides the excluded document stored in the excluded document data 33 into distributed data. FIG. 6 is a diagram illustrating an example of a flow of processing for dividing a document. For example, the dividing unit 46 arranges the groups classified by the classifying unit 45 in the descending order of the number of documents, and sequentially specifies each group as a processing target in the descending order of the number of documents. When the dividing unit 46 divides the document into a plurality of distributed data 36, the divided unit 46 divides the document in units of groups into the distributed data 36 having the smallest number of documents among the distributed data 36. Thus, in the example of FIG. 6, the group of documents A, B, C, and D and the group of documents J and K are divided into distributed data 36A, and the documents Y, Z, and X are divided into distributed data 36B. The dividing unit 46 also divides the excluded document stored in the excluded document data 33 into distributed data having the smallest number of documents. When there are a plurality of distributed data 36 having the smallest number of documents, the dividing unit 46 divides the document into any of the distributed data having the smallest number of documents. Thereby, a plurality of documents are divided into the distributed data so that the number of documents is equal. The dividing unit 46 stores each divided distribution data 36 in the storage unit 21.

出力部４７は、記憶部２１に記憶された分散データ３６を、それぞれクラスタリングを実行する処理部へ出力する処理部である。一態様としては、出力部４７は、クラスタリングを実行する複数のコンピュータやプロセッサに対してクラスタリング対象のデータとして分散データ３６を出力する。また、出力部４７は、分散データ３６と共に、文書間類似度データ３２に記憶された当該分散データ３６に含まれる文書にかかる文書間の類似度をクラスタリングを実行する複数のコンピュータやプロセッサに対して出力する。本実施例では、何れか１つの分散データ３６を処理対象として文書分割装置１０においてクラスタリングを実行する。出力部４７は、クラスタリングの処理対象とされた分散データ３６以外の分散データ３６を他のコンピュータやプロセッサに対して出力する。出力された分散データ３６は、他のコンピュータやプロセッサにおいてクラスタリングが実行される。また、出力部４７は、分散データ３６と共に、文書間類似度データ３２に記憶された当該分散データ３６に含まれる文書にかかる文書間の類似度を他のコンピュータやプロセッサに対して出力する。他のコンピュータやプロセッサでは、分散データ３６に含まれる文書にかかる文書間の類似度を使用してクラスタリングを行うことにより、類似度の算出処理を軽減できる。 The output unit 47 is a processing unit that outputs the distributed data 36 stored in the storage unit 21 to a processing unit that executes clustering. As one aspect, the output unit 47 outputs the distributed data 36 as data to be clustered to a plurality of computers and processors that execute clustering. Further, the output unit 47 outputs the similarity between documents related to the documents included in the distributed data 36 stored in the inter-document similarity data 32 together with the distributed data 36 to a plurality of computers and processors that execute clustering. Output. In this embodiment, clustering is executed in the document dividing apparatus 10 with any one distributed data 36 as a processing target. The output unit 47 outputs the distributed data 36 other than the distributed data 36 that is the clustering processing target to another computer or processor. The output distributed data 36 is subjected to clustering in another computer or processor. The output unit 47 outputs the similarity between documents related to the document included in the distributed data 36 stored in the inter-document similarity data 32 together with the distributed data 36 to another computer or processor. In other computers and processors, the similarity calculation processing can be reduced by performing clustering using the similarity between documents concerning the document included in the distributed data 36.

クラスタリング部４８は、クラスタリングの処理対象とされた分散データ３６のクラスタリングを実行する処理部である。一態様としては、クラスタリング部４８は、処理対象とされた分散データ３６に含まれる文書間の類似度を算出し、類似する文書をまとめるクラスタリング処理を実行する。この際、クラスタリング部４８は、文書間の類似度を算出する際、文書間類似度データ３２に処理対象の分散データ３６に含まれる文書にかかる文書間の類似度が記憶されている場合、記憶された類似度を使用する。すなわち、クラスタリング部４８は、文書間の類似度が文書間類似度データ３２に記憶されている場合、類似度の算出を行わず、記憶された類似度を文書間の類似度として使用する。このように、文書間類似度データ３２に類似度が記憶されている文書間については、記憶された類似度を使用することにより、類似度の算出処理が省略され、処理負荷を軽減することができる。 The clustering unit 48 is a processing unit that performs clustering of the distributed data 36 that is a clustering target. As one aspect, the clustering unit 48 calculates a similarity between documents included in the distributed data 36 to be processed, and executes a clustering process for collecting similar documents. At this time, when calculating the similarity between documents, the clustering unit 48 stores the similarity between documents related to the document included in the distributed data 36 to be processed in the inter-document similarity data 32. Use the similarity degree. That is, when the similarity between documents is stored in the inter-document similarity data 32, the clustering unit 48 does not calculate the similarity and uses the stored similarity as the similarity between documents. As described above, for the documents whose similarity is stored in the inter-document similarity data 32, the stored similarity is used, so that the similarity calculation process is omitted, and the processing load can be reduced. it can.

図７は、クラスタリングを説明するための説明図である。図７の例では、上段に文書Ａ〜Ｆをクラスタリングした一例が示されている。また、図７の例では、下段に文書Ａ〜Ｆを類似する文書の集合間の類似度が高い文書Ａ〜Ｃと文書Ｄ〜Ｆとに分割し、文書Ａ〜Ｃと文書Ｄ〜Ｆをそれぞれ別なプロセッサ１、２によりクラスタリングした一例が示されている。クラスタリングでは、例えば、複数の文書同士を総当りで比較して類似する文書を求める。このため、図７の上段に示すように、文書Ａ〜Ｆをクラスタリングする場合は、文書Ａ〜Ｆの間の類似度を算出し、類似度を記憶するため、クラスタリングの対象とする文書数の増加に伴い処理で使用する記憶容量が増加し、処理時間も増加する。例えば、文書数ｎが多くなるとｎ×ｎで類似度の計算量および記憶容量も増加する。一例として、文書数ｎを１００万個とし、各類似度を記憶する記憶領域のサイズを４バイトとした場合、類似度を記憶する記憶領域は、（１０^７×２−１）^２×４バイト≒１４．５ＴＢとなる。よって、文書数ｎが多い場合、１つのプロセッサでは文書のクラスタリングを行うことに限界がある。 FIG. 7 is an explanatory diagram for explaining clustering. In the example of FIG. 7, an example in which the documents A to F are clustered is shown in the upper part. In the example of FIG. 7, the documents A to F are divided into documents A to C and documents D to F having high similarity between sets of similar documents in the lower part, and the documents A to C and the documents D to F are divided. An example of clustering by different processors 1 and 2 is shown. In clustering, for example, a plurality of documents are compared with each other in a brute force manner to obtain similar documents. Therefore, as shown in the upper part of FIG. 7, when documents A to F are clustered, the similarity between the documents A to F is calculated and the similarity is stored. Along with the increase, the storage capacity used for processing increases and the processing time also increases. For example, as the number of documents n increases, the calculation amount and storage capacity of similarity increase by n × n. As an example, when the number of documents n is 1 million and the size of the storage area for storing each similarity is 4 bytes, the storage area for storing the similarity is (10 ⁷ × 2-1) ² × 4 bytes. ≈14.5 TB. Therefore, when the number of documents n is large, there is a limit to clustering documents with one processor.

一方、図７の下段に示すように、文書Ａ〜Ｃと文書Ｄ〜Ｆをそれぞれ別なプロセッサによりクラスタリングした場合は、分割したことによりクラスタリング対象の文書数が減るため、記憶容量を抑制でき、処理時間も抑制できる。また、文書Ａ〜Ｃと文書Ｄ〜Ｆは、それぞれ類似する文書の集合間の類似度が低く、文書間の類似度も低い。このため、文書Ａ〜Ｃと文書Ｄ〜Ｆをクラスタリングする処理部は、互いに算出した類似度を交換することなく、類似した文書のクラスタリングを行うことができる。 On the other hand, as shown in the lower part of FIG. 7, when the documents A to C and the documents D to F are clustered by different processors, the number of documents to be clustered decreases due to the division, so that the storage capacity can be suppressed. Processing time can also be suppressed. Documents A to C and documents D to F have low similarity between sets of similar documents, and the similarity between documents is also low. For this reason, the processing unit for clustering the documents A to C and the documents D to F can perform clustering of similar documents without exchanging the similarities calculated with each other.

クラスタリング部４８は、クラスタリングを実行した他のコンピュータやプロセッサと通信を行ってそれぞれのクラスタリング処理の結果をマージし、文書データ３０に対するクラスタリングの結果としてまとめる。 The clustering unit 48 communicates with other computers and processors that have performed clustering, merges the results of the respective clustering processes, and summarizes the results as clustering for the document data 30.

図８は、文書データに対する処理の流れを模式的に示した図である。生成部４１は、文書データ３０の複数の文書それぞれの特徴を示す特徴情報として索引データ３１を生成する。文書間類似度算出部４２は、生成された索引データ３１に基づき、複数の文書間の類似度を算出し、算出した文書間の類似度を文書間類似度データ３２として格納する。除外文書特定部４３は、文書間類似度データ３２において自文書以外に類似する文書が登録されていない文書を特定し、特定した文書を除外文書データ３３として格納する。集合間類似度算出部４４は、各文書の文書間の類似度が上位の所定位以内の文書の集合間の類似度を算出し、算出した集合間の類似度を集合間類似度データ３４として格納する。分類部４５は、集合間の類似度の高い文書同士をグループに分類する。分割部４６は、クラスタデータ３５に記憶されたグループ単位で、複数の文書を分散データ３６に分割する。また、分割部４６は、除外文書データ３３に記憶された除外文書を分散データ３６に分割する。このように分散データ３６に分割することにより、類似する文書が別な分散データ３６に分かれることが抑制されるため、それぞれの分散データ３６に対するクラスタリングの並列処理を効率的に行うことができる。また、分割したことによりクラスタリング対象の文書数が減るため、記憶容量、処理時間も抑制されたため、文書に対する効率的なクラスタリングの並列処理を行うことができる。 FIG. 8 is a diagram schematically showing the flow of processing for document data. The generation unit 41 generates index data 31 as feature information indicating the features of each of the plurality of documents in the document data 30. The inter-document similarity calculation unit 42 calculates the similarity between a plurality of documents based on the generated index data 31, and stores the calculated similarity between documents as inter-document similarity data 32. The excluded document specifying unit 43 specifies a document in which similar documents other than the own document are not registered in the inter-document similarity data 32, and stores the specified document as excluded document data 33. The inter-set similarity calculation unit 44 calculates the similarity between sets of documents that are within a predetermined upper level of similarity between documents of each document, and uses the calculated similarity between sets as inter-set similarity data 34. Store. The classification unit 45 classifies documents with high similarity between sets into groups. The dividing unit 46 divides a plurality of documents into distributed data 36 in units of groups stored in the cluster data 35. The dividing unit 46 divides the excluded document stored in the excluded document data 33 into distributed data 36. By dividing into the distributed data 36 in this way, it is possible to suppress a similar document from being divided into different distributed data 36, so that parallel processing of clustering for each distributed data 36 can be performed efficiently. Further, since the number of documents to be clustered is reduced by the division, the storage capacity and the processing time are also suppressed, so that efficient clustering parallel processing can be performed on the documents.

［具体例］
次に、実施例１に係る文書分割装置１０により文書を分割する具体例を説明する。図９は、文書データに含まれるクラスタリング対象の文書の文書名および文書に含まれる単語の一例を示す図である。図９の例では、クラスタリング対象の文書として文書名「ｄｏｃ０１」〜「ｄｏｃ１１」の１１個の文書が示されており、それぞれの文書名の横に文書に含まれる単語がアルファベットの記号として示されている。図９の例では、文書名ｄｏｃ０１の文書は、単語Ａ〜Ｊを含むことを示す。 [Concrete example]
Next, a specific example in which a document is divided by the document dividing apparatus 10 according to the first embodiment will be described. FIG. 9 is a diagram illustrating an example of a document name of a clustering target document included in the document data and a word included in the document. In the example of FIG. 9, eleven documents with document names “doc01” to “doc11” are shown as documents to be clustered, and words included in the documents are shown as alphabetic symbols next to the respective document names. ing. In the example of FIG. 9, it is shown that the document with the document name doc01 includes the words A to J.

生成部４１は、クラスタリング対象の各文書に含まれる各単語がどの文書に含まれるかを示す索引データ３１を生成する。図１０は、索引データの一例を示す図である。図１０の例では、クラスタリング対象の文書に含まれる単語が一覧で示されており、それぞれの単語の横に単語を含む文書の文書名が示されている。図１０の例では、単語Ａは、文書名ｄｏｃ０１、ｄｏｃ０２、ｄｏｃ０３、ｄｏｃ０４、ｄｏｃ０５の各文書に含まれていることを示す。 The generation unit 41 generates index data 31 indicating which document includes each word included in each document to be clustered. FIG. 10 is a diagram illustrating an example of index data. In the example of FIG. 10, the words included in the clustering target document are shown in a list, and the document name of the document including the word is shown next to each word. In the example of FIG. 10, it is indicated that the word A is included in each document with the document names doc01, doc02, doc03, doc04, and doc05.

文書間類似度算出部４２は、索引データ３１に基づき、文書データ３０に記憶された複数の文書間の類似度を算出し、算出した類似度が第１閾値以上である相手文書の文書名および類似度を文書間類似度データ３２として記憶部２１に格納する。なお、第１閾値は０．２としている。図１１は、文書間類似度データの一例を示す図である。図１１の例では、クラスタリング対象の文書の文書名が示されており、それぞれの文書名の横に、類似度が第１閾値以上である相手の文書名と、類似度が、文書名：類似度の形式で類似度が大きい順位に並べ示されている。図１１の例では、文書名ｄｏｃ０１の文書は、文書名ｄｏｃ０１の文書との類似度が１．０であり、文書名ｄｏｃ０２の文書との類似度が０．８であり、文書名ｄｏｃ０３の文書との類似度が０．７であることを示す。また、図１１の例では、文書名ｄｏｃ０１の文書は、文書名ｄｏｃ０４の文書との類似度が０．４であり、文書名ｄｏｃ０５の文書との類似度が０．３であることを示す。 The inter-document similarity calculation unit 42 calculates the similarity between a plurality of documents stored in the document data 30 based on the index data 31, and the document name of the partner document whose calculated similarity is equal to or more than a first threshold value and The similarity is stored in the storage unit 21 as inter-document similarity data 32. The first threshold value is 0.2. FIG. 11 is a diagram showing an example of inter-document similarity data. In the example of FIG. 11, the document names of the documents to be clustered are shown, and the similarity between the document name of the partner whose similarity is equal to or more than the first threshold and the similarity is document name: similar to each document name. They are listed in the order of degree of similarity in degree format. In the example of FIG. 11, the document with the document name doc01 has a similarity with the document with the document name doc01 of 1.0, the similarity with the document with the document name doc02 is 0.8, and the document with the document name doc03. It shows that the similarity to is 0.7. In the example of FIG. 11, the document with the document name doc01 has a similarity of 0.4 with the document with the document name doc04, and the similarity with the document with the document name doc05 is 0.3.

除外文書特定部４３は、文書間類似度データ３２において自文書以外に類似する文書が登録されていない文書を特定し、特定した文書を除外文書データ３３として記憶部２１に格納する。図１２は、除外文書データの一例を示す図である。図１２の例では、図１１に示した文書間類似度データ３２から自文書以外に類似する文書が登録されていない文書名ｄｏｃ１１の文書が除外文書として記憶されている。 The excluded document specifying unit 43 specifies a document in which similar documents other than the own document are not registered in the inter-document similarity data 32, and stores the specified document in the storage unit 21 as excluded document data 33. FIG. 12 is a diagram illustrating an example of excluded document data. In the example of FIG. 12, a document with the document name doc11 in which no similar document other than the own document is registered from the inter-document similarity data 32 shown in FIG. 11 is stored as an excluded document.

集合間類似度算出部４４は、文書間類似度データ３２に基づき、各文書の文書間の類似度が上位４位以内の文書の集合間の類似度を算出し、各文書毎に、相手文書の文書名および集合間の類似度を集合間類似度データ３４として記憶部２１に格納する。図１３は、集合間類似度データの一例を示す図である。図１３の例では、クラスタリング対象の文書の文書名が示されており、それぞれの文書名の横に、相手の文書名と、類似する文書の集合間の類似度が、文書名：類似度の形式で類似度が大きい順位に並べ示されている。図１３の例では、文書名ｄｏｃ０１の文書は、文書名ｄｏｃ０１の文書との類似度が１．００であり、文書名ｄｏｃ０２の文書との類似度が０．９８であり、文書名ｄｏｃ０３の文書との類似度が０．９６であることを示す。また、図１３の例では、文書名ｄｏｃ０１の文書は、文書名ｄｏｃ０４の文書との類似度が０．７６であり、文書名ｄｏｃ０５の文書との類似度が０．４４であることを示す。 Based on the inter-document similarity data 32, the inter-set similarity calculation unit 44 calculates the similarity between the sets of documents having the top four similarities between the documents. Are stored in the storage unit 21 as inter-set similarity data 34. FIG. 13 is a diagram illustrating an example of inter-set similarity data. In the example of FIG. 13, the document names of the documents to be clustered are shown. The similarity between the other document name and a set of similar documents is displayed next to each document name. They are listed in order of decreasing similarity. In the example of FIG. 13, the document with the document name doc01 has a similarity with the document with the document name doc01 of 1.00, the similarity with the document with the document name doc02 is 0.98, and the document with the document name doc03. It is shown that the similarity to is 0.96. In the example of FIG. 13, the document with the document name doc01 has a similarity of 0.76 with the document with the document name doc04, and the similarity with the document with the document name doc05 is 0.44.

分類部４５は、集合間類似度データ３４に基づき、各文書を集合間の類似度が第２閾値以上である文書同士をグループに分類し、分類した各グループの文書をクラスタデータ３５として記憶部２１に格納する。なお、第２閾値は０．７としている。図１４は、クラスタデータの一例を示す図である。図１４の例では、グループ１として文書名ｄｏｃ０８〜ｄｏｃ１０の文書がクラスタとされ、グループ２として文書名ｄｏｃ０６、ｄｏｃ０７の文書がクラスタとされ、グループ３として文書名ｄｏｃ０１〜ｄｏｃ０５の文書がクラスタとされていることを示す。 The classification unit 45 classifies each document into a group based on the inter-set similarity data 34, and stores the classified documents in each group as cluster data 35. 21. The second threshold value is 0.7. FIG. 14 is a diagram illustrating an example of cluster data. In the example of FIG. 14, documents with document names doc08 to doc10 are set as clusters as group 1, documents with document names doc06 and doc07 are set as clusters as group 2, and documents with document names doc01 to doc05 are set as clusters as group 3. Indicates that

分割部４６は、分類部４５により分類されたグループを文書数の多い順に並べ、文書数の多い順に各グループを順次、文書数が最も少ない分散データ３６にグループ単位で文書を分割する。また、分割部４６は、除外文書データ３３に記憶された除外文書も文書数が最も少ない分散データ３６に分割する。図１５は、分散データの一例を示す図である。図１５は、文書を２つの分散データ３６Ａ、３６Ｂに分割した例を示している。図１５の例では、グループ３の文書名ｄｏｃ０１〜ｄｏｃ０５の文書と、グループ２の文書名ｄｏｃ０６、ｄｏｃ０７の文書が分散データ３６Ａとして分割されていることを示す。また、図１５の例では、グループ１の文書名ｄｏｃ０８〜ｄｏｃ１０の文書と、文書名ｄｏｃ１１の除外文書が分散データ３６Ｂとして分割されていることを示す。 The dividing unit 46 arranges the groups classified by the classifying unit 45 in descending order of the number of documents, sequentially groups each group in descending order of the number of documents, and divides the document into groups of distributed data 36 having the smallest number of documents. The dividing unit 46 also divides the excluded document stored in the excluded document data 33 into the distributed data 36 having the smallest number of documents. FIG. 15 is a diagram illustrating an example of distributed data. FIG. 15 shows an example in which a document is divided into two distributed data 36A and 36B. The example of FIG. 15 indicates that the document names doc01 to doc05 of the group 3 and the document names doc06 and doc07 of the group 2 are divided as the distributed data 36A. Further, the example of FIG. 15 indicates that the document names doc08 to doc10 of group 1 and the excluded document of document name doc11 are divided as distributed data 36B.

このように文書データ３０を分散データ３６Ａ、３６Ｂに分割することにより文書に対する効率的なクラスタリングの並列処理を実行させることができる。 As described above, by dividing the document data 30 into the distributed data 36A and 36B, it is possible to execute an efficient clustering parallel process for the document.

［処理の流れ］
次に、本実施例に係る文書分割装置１０によりクラスタリング対象の複数の文書を分割する分割処理の流れを説明する。図１６は、分割処理の手順を示すフローチャートである。この分割処理は、記憶部２１に文書データ３０が格納された後に分割開始を指示する所定操作が行われたタイミングで実行される。 [Process flow]
Next, the flow of a dividing process for dividing a plurality of documents to be clustered by the document dividing apparatus 10 according to the present embodiment will be described. FIG. 16 is a flowchart illustrating the procedure of the dividing process. This division processing is executed at a timing when a predetermined operation for instructing division start is performed after the document data 30 is stored in the storage unit 21.

図１６に示すように、生成部４１は、文書の特徴を示す特徴情報として、文書データ３０に示される各文書に含まれる各単語がどの文書に含まれるかを示す索引データ３１を生成する索引生成処理を実行する（ステップＳ１０）。文書間類似度算出部４２は、生成された索引データ３１に基づき、文書データ３０に記憶された複数の文書間の類似度を算出する文書間類似度算出処理を実行する（ステップＳ１１）。除外文書特定部４３は、何れの文書とも類似していない除外文書の特定する除外文書特定処理を実行する（ステップＳ１２）。文書間類似度算出部４２は、文書間類似度算出処理により算出された比較元文書、相手文書、類似度の組を、比較元文書毎に、[文書名，類似度]の形式で類似度が大きい順位に並べ文書間類似度データ３２として記憶部２１に格納する（ステップＳ１３）。集合間類似度算出部４４は、各文書の文書間の類似度が上位の所定位以内の文書の集合間の類似度を算出する集合間類似度算出処理を実行する（ステップＳ１４）。分類部４５は、文書データ３０に記憶された各文書を集合間の類似度の高い文書同士をグループに分類する分類処理を実行する（ステップＳ１５）。分割部４６は、各文書を分散処理させる分散データに分割する分割処理を実行し（ステップＳ１６）、処理を終了する。 As illustrated in FIG. 16, the generation unit 41 generates index data 31 that indicates which document includes each word included in each document indicated in the document data 30 as feature information indicating the document characteristics. Generation processing is executed (step S10). The inter-document similarity calculation unit 42 executes inter-document similarity calculation processing for calculating the similarity between a plurality of documents stored in the document data 30 based on the generated index data 31 (step S11). The excluded document specifying unit 43 executes an excluded document specifying process for specifying an excluded document that is not similar to any document (step S12). The inter-document similarity calculation unit 42 sets a combination of the comparison source document, the partner document, and the similarity calculated by the inter-document similarity calculation processing for each comparison source document in the form of [document name, similarity]. Are stored in the storage unit 21 as the inter-document similarity data 32 (step S13). The inter-set similarity calculation unit 44 executes an inter-set similarity calculation process for calculating the similarity between sets of documents whose documents are within a predetermined upper rank (step S14). The classification unit 45 executes a classification process for classifying each document stored in the document data 30 into a group of documents having high similarity between sets (step S15). The dividing unit 46 executes a dividing process for dividing each document into distributed data for distributed processing (step S16), and ends the process.

次に、本実施例に係る索引生成処理の流れについて説明する。図１７は、索引生成処理の手順を示すフローチャートである。この索引生成処理は、図１６に示す分割処理のステップＳ１０から呼び出されて実行される。 Next, the flow of index generation processing according to the present embodiment will be described. FIG. 17 is a flowchart showing the procedure of index generation processing. This index generation process is called and executed from step S10 of the division process shown in FIG.

図１７に示すように、生成部４１は、単語および文書名を記憶するための記憶領域を初期化する（ステップＳ２０）。生成部４１は、文書データ３０に示される複数の文書の最初の文書に処理対象であることを示すポインタＡをセットする（ステップＳ２１）。生成部４１は、ポインタＡが示す文書から単語を全て取り出す（ステップＳ２２）。生成部４１は、取り出した単語から重複する単語を１つのみ残して消去し、それぞれの単語を一意にする（ステップＳ２３）。生成部４１は、取り出した各単語にポインタＡが示す文書の文書名を対応させて登録する（ステップＳ２４）。生成部４１は、ポインタＡを次の文書に進める（ステップＳ２５）。生成部４１は、ポインタＡが示す文書があるか否か判定する（ステップＳ２６）。ポインタＡが示す文書がある場合（ステップＳ２６肯定）、上述のステップＳ２２へ戻り、処理を実行する。 As shown in FIG. 17, the generation unit 41 initializes a storage area for storing words and document names (step S20). The generation unit 41 sets a pointer A indicating that it is a processing target in the first document of the plurality of documents indicated in the document data 30 (step S21). The generation unit 41 extracts all words from the document indicated by the pointer A (step S22). The generation unit 41 deletes only one overlapping word from the extracted words, and makes each word unique (step S23). The generation unit 41 registers each extracted word in association with the document name indicated by the pointer A (step S24). The generation unit 41 advances the pointer A to the next document (step S25). The generation unit 41 determines whether there is a document indicated by the pointer A (step S26). If there is a document indicated by the pointer A (Yes at step S26), the process returns to the above-described step S22 to execute processing.

一方、ポインタＡが示す文書がない場合（ステップＳ２６否定）、生成部４１は、各文書から取り出した単語および文書名を単語の文字コード順にソートする（ステップＳ２７）。生成部４１は、文字コード順に並んだ単語および文書名を順に変換して、それぞれ各単語がどの文書に含まれるかを示す索引データ３１を生成し（ステップＳ２８）、呼び出した元に戻る。 On the other hand, when there is no document indicated by the pointer A (No at Step S26), the generation unit 41 sorts the words and document names extracted from each document in the order of the character codes of the words (Step S27). The generation unit 41 sequentially converts words and document names arranged in the order of character codes, generates index data 31 indicating which documents each word is included in (step S28), and returns to the calling source.

次に、本実施例に係る文書間類似度算出処理の流れについて説明する。図１８は、文書間類似度算出処理の手順を示すフローチャートである。この文書間類似度算出処理は、図１６に示す分割処理のステップＳ１１から呼び出されて実行される。 Next, the flow of the inter-document similarity calculation process according to the present embodiment will be described. FIG. 18 is a flowchart showing the procedure of the inter-document similarity calculation process. This inter-document similarity calculation process is called and executed from step S11 of the division process shown in FIG.

図１８に示すように、文書間類似度算出部４２は、類似度の第１閾値をセットする（ステップＳ３０）。文書間類似度算出部４２は、文書データ３０に示される複数の文書の最初の文書に処理対象であることを示すポインタＡをセットする（ステップＳ３１）。文書間類似度算出部４２は、ポインタＡが示す文書から単語を全て取り出す（ステップＳ３２）。文書間類似度算出部４２は、索引データ３１に基づき、文書データ３０に記憶された自文書を含む複数の文書間の類似度を算出する（ステップＳ３３）。文書間類似度算出部４２は、算出された類似度が所定の第１閾値以上である相手文書の文書名および類似度を文書間類似度データ３２として記憶部２１に格納する（ステップＳ３４）。文書間類似度算出部４２は、ポインタＡを次の文書に進める（ステップＳ３５）。文書間類似度算出部４２は、ポインタＡが示す文書があるか否か判定する（ステップＳ３６）。ポインタＡが示す文書がある場合（ステップＳ３６肯定）、上述のステップＳ３２へ戻り、処理を実行する。一方、ポインタＡが示す文書がない場合（ステップＳ３６否定）、呼び出した元に戻る。 As illustrated in FIG. 18, the inter-document similarity calculation unit 42 sets a first threshold value of similarity (step S30). The inter-document similarity calculation unit 42 sets a pointer A indicating that the document is to be processed in the first document of the plurality of documents indicated in the document data 30 (step S31). The inter-document similarity calculation unit 42 extracts all words from the document indicated by the pointer A (step S32). The inter-document similarity calculation unit 42 calculates the similarity between a plurality of documents including the self-document stored in the document data 30 based on the index data 31 (step S33). The inter-document similarity calculation unit 42 stores the document name and similarity of the partner document whose calculated similarity is equal to or greater than a predetermined first threshold in the storage unit 21 as inter-document similarity data 32 (step S34). The inter-document similarity calculation unit 42 advances the pointer A to the next document (step S35). The inter-document similarity calculation unit 42 determines whether there is a document indicated by the pointer A (step S36). If there is a document indicated by the pointer A (Yes at step S36), the process returns to the above-described step S32 to execute processing. On the other hand, if there is no document indicated by the pointer A (No at step S36), the process returns to the calling source.

次に、本実施例に係る除外文書特定処理の流れについて説明する。図１９は、除外文書特定処理の手順を示すフローチャートである。この除外文書特定処理は、図１６に示す分割処理のステップＳ１２から呼び出されて実行される。 Next, the flow of the excluded document specifying process according to the present embodiment will be described. FIG. 19 is a flowchart showing a procedure of excluded document specifying processing. This excluded document specifying process is called and executed from step S12 of the dividing process shown in FIG.

図１９に示すように、除外文書特定部４３は、文書データ３０に示される複数の文書の最初の文書に処理対象であることを示すポインタＡをセットする（ステップＳ４０）。除外文書特定部４３は、除外文書データ３３を初期化する（ステップＳ４１）。除外文書特定部４３は、ポインタＡが示す文書が自文書以外に類似度が第１閾値以上の文書がない文書であるか否かを判定する（ステップＳ４２）。自文書以外に類似度が第１閾値以上の文書がない場合（ステップＳ４２肯定）、除外文書特定部４３は、ポインタＡが示す文書を除外文書データ３３に登録する（ステップＳ４３）。一方、自文書以外に類似度が第１閾値以上の文書がある場合（ステップＳ４２否定）、後述するステップＳ４４へ移行する。 As illustrated in FIG. 19, the excluded document specifying unit 43 sets a pointer A indicating that the document is to be processed in the first document of the plurality of documents indicated in the document data 30 (step S40). The excluded document specifying unit 43 initializes the excluded document data 33 (step S41). The excluded document specifying unit 43 determines whether the document indicated by the pointer A is a document in which there is no document whose similarity is equal to or more than the first threshold other than the document itself (step S42). When there is no document whose similarity is equal to or greater than the first threshold other than the own document (Yes at Step S42), the excluded document specifying unit 43 registers the document indicated by the pointer A in the excluded document data 33 (Step S43). On the other hand, if there is a document whose similarity is equal to or greater than the first threshold other than the own document (No at Step S42), the process proceeds to Step S44 described later.

除外文書特定部４３は、ポインタＡを次の文書に進める（ステップＳ４４）。除外文書特定部４３は、ポインタＡが示す文書があるか否か判定する（ステップＳ４５）。ポインタＡが示す文書がある場合（ステップＳ４５肯定）、上述のステップＳ４２へ戻り、処理を実行する。一方、ポインタＡが示す文書がない場合（ステップＳ４５否定）、呼び出した元に戻る。 The excluded document specifying unit 43 advances the pointer A to the next document (step S44). The excluded document specifying unit 43 determines whether there is a document indicated by the pointer A (step S45). If there is a document indicated by the pointer A (Yes at step S45), the process returns to the above-described step S42 to execute processing. On the other hand, if there is no document indicated by the pointer A (No at step S45), the process returns to the calling source.

次に、本実施例に係る集合間類似度算出処理の流れについて説明する。図２０は、集合間類似度算出処理を示すフローチャートである。この集合間類似度算出処理は、図１６に示す分割処理のステップＳ１４から呼び出されて実行される。 Next, the flow of inter-set similarity calculation processing according to the present embodiment will be described. FIG. 20 is a flowchart showing the similarity calculation processing between sets. This inter-set similarity calculation process is called and executed from step S14 of the division process shown in FIG.

図２０に示すように、集合間類似度算出部４４は、文書データ３０に示される複数の文書の最初の文書に処理対象であることを示すポインタＡをセットする（ステップＳ５０）。集合間類似度算出部４４は、ポインタＢをポインタＡの次の文書にセットする（ステップＳ５１）。集合間類似度算出部４４は、ポインタＡが示す文書とポインタＢが示す文書それぞれの類似度が上位４位以内の文書の集合間の類似度を算出する（ステップＳ５２）。集合間類似度算出部４４は、算出した類似度をポインタＡが示す文書とポインタＢが示す文書の類似する文書の集合間の類似度として記憶する（ステップＳ５３）。 As illustrated in FIG. 20, the inter-set similarity calculation unit 44 sets a pointer A indicating that it is a processing target in the first document of the plurality of documents indicated in the document data 30 (step S50). The inter-set similarity calculation unit 44 sets the pointer B to the document next to the pointer A (step S51). The inter-set similarity calculation unit 44 calculates the similarity between the sets of documents having the top four rankings of similarity between the document indicated by the pointer A and the document indicated by the pointer B (step S52). The inter-set similarity calculation unit 44 stores the calculated similarity as a similarity between sets of documents similar to the document indicated by the pointer A and the document indicated by the pointer B (step S53).

集合間類似度算出部４４は、ポインタＢを次の文書に進める（ステップＳ５４）。集合間類似度算出部４４は、ポインタＢが示す文書があるか否か判定する（ステップＳ５５）。ポインタＢが示す文書がある場合（ステップＳ５５肯定）、上述のステップＳ５２へ戻り、処理を実行する。一方、ポインタＢが示す文書がない場合（ステップＳ５５否定）、集合間類似度算出部４４は、ポインタＡを次の文書に進める（ステップＳ５６）。集合間類似度算出部４４は、ポインタＡが示す文書があるか否か判定する（ステップＳ５７）。ポインタＡが示す文書がある場合（ステップＳ５７肯定）、上述のステップＳ５１へ戻り、処理を実行する。一方、ポインタＡが示す文書がない場合（ステップＳ５７否定）、呼び出した元に戻る。 The inter-set similarity calculation unit 44 advances the pointer B to the next document (step S54). The inter-set similarity calculation unit 44 determines whether there is a document indicated by the pointer B (step S55). If there is a document indicated by the pointer B (Yes at Step S55), the process returns to the above-described Step S52 to execute processing. On the other hand, when there is no document indicated by the pointer B (No at Step S55), the inter-set similarity calculation unit 44 advances the pointer A to the next document (Step S56). The inter-set similarity calculation unit 44 determines whether there is a document indicated by the pointer A (step S57). If there is a document indicated by the pointer A (Yes at step S57), the process returns to the above-described step S51 to execute processing. On the other hand, if there is no document indicated by the pointer A (No at step S57), the process returns to the calling source.

次に、本実施例に係る分類処理の流れについて説明する。図２１は、分類処理を示すフローチャートである。この分類処理は、図１６に示す分割処理のステップＳ１５から呼び出されて実行される。 Next, the flow of classification processing according to the present embodiment will be described. FIG. 21 is a flowchart showing the classification process. This classification process is called and executed from step S15 of the division process shown in FIG.

図２１に示すように、分類部４５は、集合間類似度データ３４に記憶された各文書を、文書の集合間の類似度の大きい順に並べる（ステップＳ６０）。分類部４５は、文書の集合間の類似度の大きい順に、集合間類似度データ３４に記憶された最初の文書に処理対象であることを示すポインタＡをセットする（ステップＳ６１）。分類部４５は、ポインタＡが示す文書をグループに分類する（ステップＳ６２）。分類部４５は、ポインタＡを次の文書に進める（ステップＳ６３）。分類部４５は、ポインタＡが示す文書があるか否か判定する（ステップＳ６４）。ポインタＡが示す文書がある場合（ステップＳ６４肯定）、後述するステップＳ６５へ移行する。一方、ポインタＡが示す文書がない場合（ステップＳ６４否定）、後述するステップＳ７０へ移行する。 As shown in FIG. 21, the classification unit 45 arranges the documents stored in the inter-set similarity data 34 in descending order of similarity between sets of documents (step S60). The classification unit 45 sets the pointer A indicating that the document is a processing target in the first document stored in the inter-set similarity data 34 in descending order of similarity between the sets of documents (step S61). The classification unit 45 classifies the document indicated by the pointer A into a group (step S62). The classification unit 45 advances the pointer A to the next document (step S63). The classification unit 45 determines whether there is a document indicated by the pointer A (step S64). If there is a document indicated by the pointer A (Yes at Step S64), the process proceeds to Step S65 described later. On the other hand, when there is no document indicated by the pointer A (No at Step S64), the process proceeds to Step S70 described later.

分類部４５は、ポインタＡが示す文書と集合間の類似度が第２閾値以上である文書が既にグループに分けられた文書に存在するか否か判定する（ステップＳ６５）。集合間の類似度が第２閾値以上である文書が存在する場合（ステップＳ６５肯定）、分類部４５は、類似度が第２閾値以上である文書が存在するグループにポインタＡが示す文書を追加し（ステップＳ６６）、ステップＳ６３へ移行する。一方、集合間の類似度が第２閾値以上である文書が存在しない場合（ステップＳ６５否定）、分類部４５は、ポインタＡが示す文書に集合間の類似度が第２閾値以上のものがあるか否か判定する（ステップＳ６７）。集合間の類似度が第２閾値以上のものがある場合（ステップＳ６７肯定）、分類部４５は、ポインタＡが示す文書を新たなグループに分類し（ステップＳ６８）、ステップＳ６３へ移行する。一方、集合間の類似度が第２閾値以上のものがない場合（ステップＳ６７否定）、分類部４５は、ポインタＡが示す文書をその他として記憶し（ステップＳ６９）、ステップＳ６３へ移行する。 The classification unit 45 determines whether a document whose similarity between the document indicated by the pointer A and the set is equal to or greater than the second threshold exists in the documents already divided into groups (step S65). If there is a document whose similarity between sets is equal to or greater than the second threshold (Yes in step S65), the classification unit 45 adds the document indicated by the pointer A to the group where documents whose similarity is equal to or greater than the second threshold exists. (Step S66), the process proceeds to Step S63. On the other hand, when there is no document whose similarity between sets is equal to or greater than the second threshold (No in step S65), the classification unit 45 includes a document indicated by the pointer A whose similarity between sets is equal to or greater than the second threshold. Whether or not (step S67). When there is a similarity between the sets equal to or greater than the second threshold (Yes at Step S67), the classification unit 45 classifies the document indicated by the pointer A into a new group (Step S68), and proceeds to Step S63. On the other hand, when there is no similarity between sets equal to or greater than the second threshold (No at Step S67), the classification unit 45 stores the document indicated by the pointer A as other (Step S69), and proceeds to Step S63.

一方、分類部４５は、その他として記憶した最初の文書に処理対象であることを示すポインタＢをセットする（ステップＳ７０）。分類部４５は、ポインタＢが示す文書において文書間の類似度が最も大きい文書を含むグループにポインタＢが示す文書を追加する（ステップＳ７１）。分類部４５は、ポインタＢを次の文書に進める（ステップＳ７２）。分類部４５は、ポインタＢが示す文書があるか否か判定する（ステップＳ７３）。ポインタＢが示す文書がある場合（ステップＳ７３肯定）、上述のステップＳ７１へ戻り、処理を実行する。一方、ポインタＢが示す文書がない場合（ステップＳ７３否定）、呼び出した元に戻る。 On the other hand, the classification unit 45 sets a pointer B indicating that it is a processing target to the first document stored as others (step S70). The classification unit 45 adds the document indicated by the pointer B to the group including the document having the highest similarity between documents in the document indicated by the pointer B (step S71). The classification unit 45 advances the pointer B to the next document (step S72). The classification unit 45 determines whether there is a document indicated by the pointer B (step S73). If there is a document indicated by the pointer B (Yes at step S73), the process returns to the above-described step S71 to execute processing. On the other hand, when there is no document indicated by the pointer B (No at Step S73), the process returns to the calling source.

次に、本実施例に係る分割処理の流れについて説明する。図２２は、分割処理を示すフローチャートである。この分割処理は、図１６に示す分割処理のステップＳ１６から呼び出されて実行される。 Next, the flow of division processing according to the present embodiment will be described. FIG. 22 is a flowchart showing the dividing process. This division process is called and executed from step S16 of the division process shown in FIG.

図２２に示すように、分割部４６は、文書をｎ個に分割する場合、分割文書を格納するｎ個の領域を準備する（ステップＳ８０）。分割部４６は、分類処理により分類されたグループを文書数の多い順に並べる（ステップＳ８１）。分割部４６は、文書数の多い順に、最初のグループに処理対象であることを示すポインタＡをセットする（ステップＳ８２）。分割部４６は、ｎ個の領域に格納された文書数を比較し、最も文書数が少ない領域にポインタＡが示すグループの文書を格納する（ステップＳ８３）。分割部４６は、ポインタＡを次のグループに進める（ステップＳ８４）。分割部４６は、ポインタＡが示すグループがあるか否か判定する（ステップＳ８５）。ポインタＡが示すグループがある場合（ステップＳ８５肯定）、ステップＳ８３へ戻り、処理を実行する。 As shown in FIG. 22, when dividing the document into n pieces, the dividing unit 46 prepares n areas for storing the divided documents (step S80). The dividing unit 46 arranges the groups classified by the classification process in descending order of the number of documents (step S81). The dividing unit 46 sets the pointer A indicating the processing target in the first group in the descending order of the number of documents (step S82). The dividing unit 46 compares the number of documents stored in the n areas, and stores the group of documents indicated by the pointer A in the area having the smallest number of documents (step S83). The dividing unit 46 advances the pointer A to the next group (step S84). The dividing unit 46 determines whether there is a group indicated by the pointer A (step S85). If there is a group indicated by the pointer A (Yes at step S85), the process returns to step S83 to execute the process.

一方、ポインタＡが示すグループがない場合（ステップＳ８５否定）、分割部４６は、除外文書データ３３に記憶された除外文書に処理対象であることを示すポインタＢをセットする（ステップＳ８６）。分割部４６は、ｎ個の領域に格納された文書数を比較し、最も文書数が少ない領域にポインタＢが示す文書を格納する（ステップＳ８７）。分割部４６は、ポインタＢを次の除外文書に進める（ステップＳ８８）。分割部４６は、ポインタＢが示す文書があるか否か判定する（ステップＳ８９）。ポインタＢが示す文書がある場合（ステップＳ８９肯定）、ステップＳ８７へ戻り、処理を実行する。一方、ポインタＢが示す文書がない場合（ステップＳ８９否定）、分割部４６は、ｎ個の領域に格納された文書をそれぞれ分散データ３６として記憶部２１へ格納し（ステップＳ９０）、呼び出した元に戻る。 On the other hand, when there is no group indicated by the pointer A (No at Step S85), the dividing unit 46 sets the pointer B indicating that it is a processing target in the excluded document stored in the excluded document data 33 (Step S86). The dividing unit 46 compares the number of documents stored in the n areas, and stores the document indicated by the pointer B in the area having the smallest number of documents (step S87). The dividing unit 46 advances the pointer B to the next excluded document (step S88). The dividing unit 46 determines whether there is a document indicated by the pointer B (step S89). If there is a document indicated by the pointer B (Yes at Step S89), the process returns to Step S87 to execute processing. On the other hand, when there is no document indicated by the pointer B (No at Step S89), the dividing unit 46 stores the documents stored in the n areas as the distributed data 36 in the storage unit 21 (Step S90), and the calling source Return to.

出力部４７は、このように上述した分割処理により分割された分散データ３６をクラスタリング対象のデータとしてクラスタリングを実行する複数のコンピュータやプロセッサに対して出力する。 The output unit 47 outputs the distributed data 36 divided by the above-described division processing as data to be clustered to a plurality of computers and processors that execute clustering.

次に、本実施例に係る文書分割装置１０によりクラスタリング処理の流れを説明する。図２３は、クラスタリング処理の手順を示すフローチャートである。このクラスタリング処理は、上述した分割処理の後に続けて実行されてもよい。また、クラスタリングを指示する所定操作が行われたタイミングで実行されてもよい。 Next, the flow of clustering processing by the document dividing apparatus 10 according to the present embodiment will be described. FIG. 23 is a flowchart illustrating the procedure of the clustering process. This clustering process may be executed after the above-described division process. Further, it may be executed at a timing when a predetermined operation for instructing clustering is performed.

図２３に示すように、クラスタリング部４８は、処理対象とされた分散データ３６に含まれる文書間の類似度を算出する（ステップＳ１００）。クラスタリング部４８は、算出した文書間の類似度に基づき、類似する文書をまとめるクラスタリングを行う（ステップＳ１０１）。クラスタリング部４８は、クラスタリングを実行した他のプロセッサとクラスタリングの結果をマージし（ステップＳ１０２）、処理を終了する。 As shown in FIG. 23, the clustering unit 48 calculates the similarity between documents included in the distributed data 36 to be processed (step S100). The clustering unit 48 performs clustering for collecting similar documents based on the calculated similarity between documents (step S101). The clustering unit 48 merges the result of clustering with another processor that has performed clustering (step S102), and ends the process.

［実施例１の効果］
本実施例に係る文書分割装置１０は、文書データ３０の複数の文書それぞれの特徴を示す特徴情報として索引データ３１を生成する。本実施例に係る文書分割装置１０は、生成された索引データ３１に基づき、複数の文書間の類似度を算出する。本実施例に係る文書分割装置１０は、各文書の文書間の類似度が上位の所定位以内の文書の集合間の類似度を算出する。そして、本実施例に係る文書分割装置１０は、集合間の類似度の高い文書同士をグループに分類する。本実施例に係る文書分割装置１０は、分類されたグループ単位で複数の文書を分散処理させる分散データ３６に分割する。これにより、本実施例に係る文書分割装置１０によれば、文書に対するクラスタリングの並列処理を効率的に行うことができる。 [Effect of Example 1]
The document dividing apparatus 10 according to the present embodiment generates index data 31 as feature information indicating features of a plurality of documents in the document data 30. The document dividing apparatus 10 according to the present embodiment calculates the similarity between a plurality of documents based on the generated index data 31. The document dividing apparatus 10 according to the present embodiment calculates the similarity between a set of documents that are within a predetermined upper level in the similarity between documents of each document. Then, the document dividing apparatus 10 according to this embodiment classifies documents having high similarity between sets into groups. The document dividing apparatus 10 according to the present embodiment divides a plurality of documents into distributed data 36 for distributed processing in classified group units. Thereby, according to the document dividing apparatus 10 according to the present embodiment, it is possible to efficiently perform parallel processing of clustering on the document.

また、本実施例に係る文書分割装置１０は、各文書を、それぞれ当該文書との類似度が上位の所定位以内の文書を要素とし、類似度を要素の値としたベクトルと見なして、ベクトル間の類似度から文書の集合間の類似度を算出する。これにより、本実施例に係る文書分割装置１０によれば、類似する文書の集合間の類似度が高い文書を同じグループにまとめることができ、類似する文書が別の分散データ３６に分割されることを抑制できる。 In addition, the document dividing apparatus 10 according to the present embodiment regards each document as a vector in which each of the documents having a similarity degree higher than a predetermined level is an element and the similarity is an element value. The similarity between sets of documents is calculated from the similarity between the documents. Thereby, according to the document dividing apparatus 10 according to the present embodiment, documents having high similarity between sets of similar documents can be grouped into the same group, and similar documents are divided into different distributed data 36. This can be suppressed.

また、本実施例に係る文書分割装置１０は、分割された分散データ３６および当該分散データ３６に含まれる文書にかかる文書間の類似度をそれぞれクラスタリングを実行する処理部へ出力する。これにより、本実施例に係る文書分割装置１０によれば、文書間の類似度の算出処理を軽減させることができる。 Further, the document dividing apparatus 10 according to the present embodiment outputs the divided distributed data 36 and the similarity between documents related to the documents included in the distributed data 36 to the processing unit that executes clustering. Thereby, according to the document dividing apparatus 10 according to the present embodiment, it is possible to reduce the calculation processing of the similarity between documents.

また、本実施例に係る文書分割装置１０は、文書数が同等となるように複数の文書を分散データ３６に分割する。これにより、本実施例に係る文書分割装置１０によれば、各分散データ３６のクラスタリングを複数のコンピュータやプロセッサにより並列処理させた場合でも、特定のコンピュータやプロセッサの負荷が偏ることを抑制できる。 Further, the document dividing apparatus 10 according to the present embodiment divides a plurality of documents into the distributed data 36 so that the number of documents becomes equal. Thereby, according to the document dividing apparatus 10 according to the present embodiment, even when the clustering of each distributed data 36 is processed in parallel by a plurality of computers and processors, it is possible to suppress the load on a specific computer or processor from being biased.

さて、これまで開示の装置に関する実施例について説明したが、本発明は上述した実施例以外にも、種々の異なる形態にて実施されてよいものである。そこで、以下では、本発明に含まれる他の実施例を説明する。 Although the embodiments related to the disclosed apparatus have been described above, the present invention may be implemented in various different forms other than the above-described embodiments. Therefore, another embodiment included in the present invention will be described below.

例えば、上記の実施例では、文書数が同等となるように複数の文書を分散データ３６に分割する場合について説明したが、開示の装置はこれに限定されない。例えば、文書数の多い順に各グループの文書を分散データ３６に順番に格納してもよい。また、除外文書データ３３に記憶された除外文書を特定の分散データ３６に格納してもよい。 For example, in the above embodiment, a case has been described in which a plurality of documents are divided into the distributed data 36 so that the number of documents is equal, but the disclosed apparatus is not limited to this. For example, the documents of each group may be stored in the distributed data 36 in order from the largest number of documents. Further, the excluded document stored in the excluded document data 33 may be stored in the specific distributed data 36.

また、上記の実施例では、クラスタリング部４８によりいずれかの分散データ３６のクラスタリングを実行する場合について説明したが、開示の装置はこれに限定されない。例えば、分散データ３６を全て外部のコンピュータやプロセッサでクラスタリングを実行させるものとしてもよい。 In the above-described embodiment, the case where the clustering unit 48 performs clustering of any of the distributed data 36 has been described, but the disclosed apparatus is not limited to this. For example, all the distributed data 36 may be clustered by an external computer or processor.

また、上記の実施例では、各文書からそれぞれ単語を抽出して各文書間の類似度を算出する場合について説明したが、開示の装置はこれに限定されない。例えば、文書が画像を含む場合、画像間の類似度も加えて文書間の類似度を算出してもよい。 In the above-described embodiments, the case where words are extracted from each document and the similarity between the documents is calculated has been described. However, the disclosed apparatus is not limited thereto. For example, when a document includes images, the similarity between documents may be calculated in addition to the similarity between images.

また、上記の実施例では、類似する文書が自文書以外にない文書を除外文書として除外文書データ３３に記憶した場合について説明したが、開示の装置はこれに限定されない。例えば、類似する文書が自文書以外にない文書をそれぞれ単独でグループとしてもよい。これにより、グループ単位で文書を分散データ３６に分割することにより、除外文書も分散データ３６に分割することができる。 In the above-described embodiment, a case has been described in which a document having no similar document other than its own document is stored in the excluded document data 33 as an excluded document. However, the disclosed apparatus is not limited to this. For example, documents having no similar document other than the own document may be individually grouped. As a result, by dividing the document into the distributed data 36 in units of groups, the excluded document can also be divided into the distributed data 36.

また、上記の実施例では、文書数が同等となるように文書データ３０に含まれる複数の文書を分散データ３６に分割する場合について説明したが、開示の装置はこれに限定されない。例えば、クラスタリングを実行するコンピュータやプロセッサなどの各処理部の処理能力を示す情報を取得し、各処理部の処理能力に応じた比率で文書データ３０に含まれる複数の文書を分散データ３６に分割してもよい。処理能力を示す情報としては、例えば、ＣＰＵのクロック数やメモリ量、処理性能を測定するソフトウェアにより測定された性能を示す数値、管理者等が処理能力に応じて定めた数値などが挙げられる。これにより、クラスタリングを実行する各処理部に、処理能力に応じた文書数の分散データ３６が割り当てられるため、特定の処理部の負荷が過剰に高くなることを抑制できる。 In the above-described embodiment, a case has been described in which a plurality of documents included in the document data 30 are divided into the distributed data 36 so that the number of documents is equal. However, the disclosed apparatus is not limited to this. For example, information indicating the processing capability of each processing unit such as a computer or processor that performs clustering is acquired, and a plurality of documents included in the document data 30 are divided into distributed data 36 at a ratio according to the processing capability of each processing unit. May be. Examples of the information indicating the processing capability include the number of CPU clocks, the amount of memory, the numerical value indicating the performance measured by software for measuring the processing performance, and the numerical value determined by the administrator according to the processing capability. As a result, the distributed data 36 having the number of documents corresponding to the processing capability is assigned to each processing unit that executes clustering, and therefore it is possible to suppress an excessive increase in the load on a specific processing unit.

［分散および統合］
また、図示した各装置の各構成要素は、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部または一部を、各種の負荷や使用状況などに応じて、任意の単位で機能的または物理的に分散・統合して構成することができる。例えば、文書分割装置１０の登録部４０、生成部４１、文書間類似度算出部４２、除外文書特定部４３、集合間類似度算出部４４、分類部４５、分割部４６、出力部４７、クラスタリング部４８の各処理部が適宜統合されてもよい。また、各処理部の処理が適宜複数の処理部の処理に分離されてもよい。また、各処理部を別の装置がそれぞれ有し、ネットワーク接続されて協働することで、上記の文書分割装置１０の機能を実現するようにしてもよい。 [Distribution and integration]
In addition, each component of each illustrated apparatus does not necessarily need to be physically configured as illustrated. In other words, the specific form of distribution / integration of each device is not limited to that shown in the figure, and all or a part thereof may be functionally or physically distributed or arbitrarily distributed in arbitrary units according to various loads or usage conditions. Can be integrated and configured. For example, the registration unit 40, the generation unit 41, the inter-document similarity calculation unit 42, the excluded document specification unit 43, the inter-set similarity calculation unit 44, the classification unit 45, the division unit 46, the output unit 47, and the clustering of the document dividing device 10 Each processing unit of the unit 48 may be appropriately integrated. Further, the processing of each processing unit may be appropriately separated into a plurality of processing units. In addition, the functions of the document dividing device 10 may be realized by having each processing unit in a separate device and connected through a network to cooperate.

［分割プログラム］
また、上記の実施例で説明した各種の処理は、予め用意されたプログラムをパーソナルコンピュータやワークステーションなどのコンピュータで実行することによって実現することができる。そこで、以下では、図２４を用いて、上記の実施例と同様の機能を有する分割プログラムを実行するコンピュータの一例について説明する。 [Split program]
The various processes described in the above embodiments can be realized by executing a prepared program on a computer such as a personal computer or a workstation. Therefore, in the following, an example of a computer that executes a division program having the same function as that of the above embodiment will be described with reference to FIG.

図２４は、分割プログラムを実行するコンピュータの一例について説明するための図である。図２４に示すように、コンピュータ２００は、操作部２１０と、ディスプレイ２２０と、通信部２３０とを有する。さらに、このコンピュータ２００は、ＣＰＵ２５０と、ＲＯＭ２６０と、ＨＤＤ２７０と、ＲＡＭ２８０と有する。これら２１０〜２８０の各部はバス２４０を介して接続される。 FIG. 24 is a diagram for describing an example of a computer that executes a division program. As illustrated in FIG. 24, the computer 200 includes an operation unit 210, a display 220, and a communication unit 230. The computer 200 further includes a CPU 250, a ROM 260, an HDD 270, and a RAM 280. These units 210 to 280 are connected via a bus 240.

ＨＤＤ２７０には、登録部４０、生成部４１、文書間類似度算出部４２、除外文書特定部４３、集合間類似度算出部４４、分類部４５、分割部４６、出力部４７、クラスタリング部４８と同様の機能を発揮する分割プログラム２７０ａが予め記憶される。この分割プログラム２７０ａについては、実施例１で示した各構成要素と同様、適宜統合又は分離しても良い。すなわち、ＨＤＤ２７０に格納される各データは、常に全てのデータがＨＤＤ２７０に格納される必要はなく、処理に必要なデータのみがＨＤＤ２７０に格納されれば良い。 The HDD 270 includes a registration unit 40, a generation unit 41, an inter-document similarity calculation unit 42, an excluded document specification unit 43, an inter-set similarity calculation unit 44, a classification unit 45, a division unit 46, an output unit 47, and a clustering unit 48. A division program 270a that exhibits the same function is stored in advance. About this division | segmentation program 270a, you may integrate or isolate | separate suitably like each component shown in Example 1. FIG. In other words, all the data stored in the HDD 270 need not always be stored in the HDD 270, and only the data necessary for processing may be stored in the HDD 270.

そして、ＣＰＵ２５０が、分割プログラム２７０ａをＨＤＤ２７０から読み出してＲＡＭ２８０に展開する。これによって、図２４に示すように、分割プログラム２７０ａは、分割プロセス２８０ａとして機能する。この分割プロセス２８０ａは、ＨＤＤ２７０から読み出した各種データを適宜ＲＡＭ２８０上の自身に割り当てられた領域に展開し、この展開した各種データに基づいて各種処理を実行する。分割プロセス２８０ａは、登録部４０、生成部４１、文書間類似度算出部４２、除外文書特定部４３、集合間類似度算出部４４、分類部４５、分割部４６、出力部４７、クラスタリング部４８にて実行される処理、例えば図１６〜図２３に示す処理を含む。すなわち、分割プロセス２８０ａは、登録部４０、生成部４１、文書間類似度算出部４２、除外文書特定部４３、集合間類似度算出部４４、分類部４５、分割部４６、出力部４７、クラスタリング部４８と同様の動作を実行する。なお、ＣＰＵ２５０上で仮想的に実現される各処理部は、常に全ての処理部がＣＰＵ２５０上で動作する必要はなく、処理に必要な処理部のみが仮想的に実現されれば良い。 Then, the CPU 250 reads the division program 270a from the HDD 270 and expands it in the RAM 280. Accordingly, as shown in FIG. 24, the division program 270a functions as a division process 280a. The division process 280a expands various data read from the HDD 270 in an area allocated to itself on the RAM 280 as appropriate, and executes various processes based on the expanded data. The division process 280a includes a registration unit 40, a generation unit 41, an inter-document similarity calculation unit 42, an excluded document specification unit 43, an inter-set similarity calculation unit 44, a classification unit 45, a division unit 46, an output unit 47, and a clustering unit 48. Including the processes shown in FIGS. 16 to 23. That is, the division process 280a includes a registration unit 40, a generation unit 41, an inter-document similarity calculation unit 42, an excluded document specification unit 43, an inter-set similarity calculation unit 44, a classification unit 45, a division unit 46, an output unit 47, clustering. The same operation as that of the unit 48 is executed. It should be noted that all the processing units virtually realized on the CPU 250 do not always have to operate on the CPU 250, and only the processing units necessary for processing need only be virtually realized.

なお、上記の分割プログラム２７０ａについては、必ずしも最初からＨＤＤ２７０やＲＯＭ２６０に記憶させておく必要はない。例えば、コンピュータ２００に挿入されるフレキシブルディスク、いわゆるＦＤ、ＣＤ−ＲＯＭ、ＤＶＤディスク、光磁気ディスク、ＩＣカードなどの「可搬用の物理媒体」に各プログラムを記憶させる。そして、コンピュータ２００がこれらの可搬用の物理媒体から各プログラムを取得して実行するようにしてもよい。また、公衆回線、インターネット、ＬＡＮ、ＷＡＮなどを介してコンピュータ２００に接続される他のコンピュータまたはサーバ装置などに各プログラムを記憶させておき、コンピュータ２００がこれらから各プログラムを取得して実行するようにしてもよい。 Note that the above divided program 270a is not necessarily stored in the HDD 270 or the ROM 260 from the beginning. For example, each program is stored in a “portable physical medium” such as a flexible disk inserted into the computer 200, so-called FD, CD-ROM, DVD disk, magneto-optical disk, or IC card. Then, the computer 200 may acquire and execute each program from these portable physical media. Each program is stored in another computer or server device connected to the computer 200 via a public line, the Internet, a LAN, a WAN, etc., and the computer 200 acquires and executes each program from these. It may be.

１０文書分割装置
２１記憶部
２２制御部
３０文書データ
３１索引データ
３２文書間類似度データ
３３除外文書データ
３４集合間類似度データ
３５クラスタデータ
３６分散データ
４１生成部
４２文書間類似度算出部
４３除外文書特定部
４４集合間類似度算出部
４５分類部
４６分割部
４７出力部
４８クラスタリング部 DESCRIPTION OF SYMBOLS 10 Document dividing device 21 Storage part 22 Control part 30 Document data 31 Index data 32 Inter-document similarity data 33 Excluded document data 34 Inter-set similarity data 35 Cluster data 36 Distributed data 41 Generation part 42 Inter-document similarity calculation part 43 Exclusion Document specifying unit 44 Inter-set similarity calculation unit 45 Classification unit 46 Dividing unit 47 Output unit 48 Clustering unit

Claims

A generation unit that generates feature information indicating features of each of a plurality of documents to be clustered;
An inter-document similarity calculating unit that calculates a similarity between the plurality of documents based on the feature information generated by the generating unit;
An inter-set similarity calculating unit that calculates the similarity between sets of documents within a predetermined upper rank in the similarity between the documents of each document;
A classification unit for classifying documents having high similarity between the sets into groups;
A division unit that divides the plurality of documents in units of groups classified by the classification unit;
A document dividing apparatus characterized by comprising:

The inter-set similarity calculation unit regards each document as a vector having a higher degree of similarity with the document within a predetermined level as an element and the similarity as an element value, and the similarity between vectors The document dividing apparatus according to claim 1, wherein similarity between sets of documents is calculated from the document.

And further comprising an output unit that outputs the distributed data obtained by dividing the plurality of documents by the dividing unit and the similarity between the documents relating to the documents included in the distributed data to a processing unit that executes clustering, respectively. The document dividing device according to claim 1 or 2.

The document dividing device according to claim 1, wherein the dividing unit divides the plurality of documents into distributed data so that the number of documents is equal.

On the computer,
Generate feature information indicating the features of each of multiple documents to be clustered,
Based on the generated feature information, the similarity between the plurality of documents is calculated,
Calculating a similarity between a set of documents having a similarity between the documents within a predetermined upper level of each document;
Classify documents with high similarity between the sets into groups,
A document segmentation program for executing each process for segmenting the plurality of documents in units of groups classified by the classification unit.

Computer
Generate feature information indicating the features of each of multiple documents to be clustered,
Based on the generated feature information, the similarity between the plurality of documents is calculated,
Calculating a similarity between a set of documents having a similarity between the documents within a predetermined upper level of each document;
Classify documents with high similarity between the sets into groups,
A document dividing method, comprising: performing each process of dividing the plurality of documents in units of groups classified by the classification unit.