JP5621773B2

JP5621773B2 - Classification hierarchy re-creation system, classification hierarchy re-creation method, and classification hierarchy re-creation program

Info

Publication number: JP5621773B2
Application number: JP2011521779A
Authority: JP
Inventors: 弘紀水口; 大久寿居
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2009-07-06
Filing date: 2010-04-20
Publication date: 2014-11-12
Anticipated expiration: 2030-04-20
Also published as: JPWO2011004529A1; US8732173B2; US20120109963A1; WO2011004529A1

Description

本発明は、階層化された分類を再構築して新たな分類階層を作成する分類階層再作成システム、分類階層再作成方法及び分類階層再作成プログラムに関する。 The present invention relates to a classification hierarchy re-creation system, a classification hierarchy re-creation method, and a classification hierarchy re-creation program for reconstructing a hierarchical classification and creating a new classification hierarchy.

特許文献１には、階層構造を持つ項目で多次元データを分割する際、分析目的に合致した適切なグループに分割するデータ分割方法が記載されている。特許文献１に記載されたデータ分割装置は、データ群とデータ群の分類階層を受け取ると、受け取ったデータ群の分布をもとに、分類階層の中から特徴的でない階層を削除した分類階層を出力する。具体的には、判定手段が、特定の分類を分割対象とし、そのデータ群（分割対象グループ）の分布に基づいて統計的検定を行うことにより、分割対象グループが特徴的か否かを示す属性を判定する。次に、分割手段は、判定結果に基づいて、分割対象グループを子階層に属する子グループ群に分割し、新たに分割対象にする。そして、統合手段は、判定結果の属性に基づいて特徴的でない子グループ群を親グループに統合する。具体的には、統合手段は、特徴的でない階層を削除し、特徴的な階層のみを残す。そのため、出力される分類階層を親分類から順番にたどることで、特徴的な子階層までの分類を得ることができる。 Patent Document 1 describes a data division method that divides multidimensional data into items having a hierarchical structure and divides the data into appropriate groups that match the purpose of analysis. When receiving the data group and the classification hierarchy of the data group, the data dividing device described in Patent Document 1 generates a classification hierarchy obtained by deleting a non-characteristic hierarchy from the classification hierarchy based on the distribution of the received data group. Output. Specifically, the attribute indicating whether or not the division target group is characteristic by performing a statistical test based on the distribution of the data group (division target group) with the specific classification as the division target. Determine. Next, the dividing unit divides the group to be divided into child group groups belonging to the child hierarchy based on the determination result, and newly sets the group to be divided. Then, the integration unit integrates the child groups that are not characteristic into the parent group based on the attribute of the determination result. Specifically, the integration unit deletes the non-characteristic hierarchy and leaves only the characteristic hierarchy. Therefore, it is possible to obtain classifications up to a characteristic child hierarchy by tracing the output classification hierarchy in order from the parent classification.

また、特許文献２には、入力された文書データをもとに、用語間の関係を出力する用語辞書生成方法が記載されている。特許文献２に記載された用語辞書生成方法では、まず、文書データの各単語及び位置情報をもとに関連語を選択する。次に、単語と関連語をノードとしたグラフを作成する。また、グラフのあらゆる二つのノードの組合せについて、共起統計量を計算し、さらに、類義語辞書やその他の文書データなどから類似度を計算する。そして、共起統計量と類似度の値を利用する変換ルールに基づいてグラフを変換する。 Patent Document 2 describes a term dictionary generation method for outputting a relationship between terms based on input document data. In the term dictionary generation method described in Patent Document 2, first, related words are selected based on each word and position information of document data. Next, a graph with words and related words as nodes is created. In addition, the co-occurrence statistics are calculated for every two combinations of nodes in the graph, and the similarity is calculated from a synonym dictionary and other document data. Then, the graph is converted based on the conversion rule using the co-occurrence statistic and the similarity value.

特許文献３には、情報処理装置に蓄えられた大量の文書群を、その特徴に従って高い精度で自動的に分類する文書整理装置が記載されている。特許文献３に記載された文書整理装置は、キーワード対（Ｈ，Ｂ）の共起出現頻度を表すサポートｓｕｐ（Ｈ→Ｂ）および確信度ｃｏｎｆ（Ｈ→Ｂ）を定義する。そして、点（Ｘ，Ｙ）＝（ｃｏｎｆ（ｋｗ→ｗi ），ｃｏｎｆ（ｗi →ｋｗ））で定められるＸＹ平面を５つに分け、階層関係、同値関係、および連想関係を決定する。 Patent Document 3 describes a document organizing apparatus that automatically classifies a large number of document groups stored in an information processing apparatus with high accuracy according to the characteristics of the document group. The document organizing apparatus described in Patent Document 3 defines a support sup (H → B) and a certainty factor conf (H → B) representing the co-occurrence appearance frequency of a keyword pair (H, B). Then, the XY plane defined by the point (X, Y) = (conf (kw → wi), conf (wi → kw)) is divided into five, and the hierarchical relationship, the equivalence relationship, and the associative relationship are determined.

特許文献４には、フラットな分類枠から階層化構造の分類体系を自動的に構築する分類体系生成装置が記載されている。特許文献４に記載された分類体系生成装置では、非階層型（すなわちフラットな分類枠）から出発してクラスタリングによりクラスタを生成する。そして、生成したこれらクラスタを上位の分類枠として階層構造分類体系を準備し、分類精度が基準値より低い上位分類枠（すなわちクラスタ）に着目して他のクラスタと統合した後、再クラスタリングを行うことで階層を伸ばしていく。また、特許文献４に記載された分類体系生成装置では、既存の分類体系の分類精度が基準値より低い場合や、状況に併せて分類体系を修正した場合に、文書分類部の分類体系を分類体系記憶部に記憶して最適化対象とする。そして、文書入力部から入力された分類済み文書や、状況を代表するサンプル文書をもとに分類を評価し、変更を行うことで、分類精度を向上させる。 Patent Document 4 describes a classification system generating device that automatically constructs a hierarchical classification system from a flat classification frame. In the classification system generation device described in Patent Document 4, a cluster is generated by clustering starting from a non-hierarchical type (that is, a flat classification frame). Then, a hierarchical structure classification system is prepared using these generated clusters as the upper classification frames, and after focusing on the upper classification frames (that is, clusters) whose classification accuracy is lower than the reference value, reclustering is performed. I will extend the hierarchy. In addition, the classification system generation apparatus described in Patent Document 4 classifies the classification system of the document classification unit when the classification accuracy of the existing classification system is lower than the reference value or when the classification system is modified according to the situation. Stored in the system storage unit to be optimized. Then, the classification accuracy is improved by evaluating and changing the classification based on the classified document input from the document input unit or the sample document representing the situation.

特開２００８−２９９３８２号公報（段落００２７、００４７〜００４８、００７９）JP 2008-299382 A (paragraphs 0027, 0047 to 0048, 0079) 特開平１１−９６１７７号公報（段落００１５〜００１７、図１）Japanese Patent Laid-Open No. 11-96177 (paragraphs 0015 to 0017, FIG. 1) 特開２００５−２６６８６６号公報（段落００２１、００５１、図４）Japanese Patent Laying-Open No. 2005-266866 (paragraphs 0021 and 0051, FIG. 4) 特開２０００−１０９９６号公報（段落００８１、００８４〜００８５、図１１）JP 2000-10996 (paragraphs 0081, 0084 to 0085, FIG. 11)

特許文献１に記載されたデータ分割方法では、特徴的でない階層は削除されてしまうため、削除対象になった階層を分類することができないという課題がある。例えば、特許文献１に記載されたデータ分割方法では、データの特性に合う観点が分類階層にある場合はよいが、データ特性に合う観点がない場合は、適切な分類階層を得ることはできない。このような分類対象にならない階層であっても、その階層の上下関係を考慮した分類や、同じ意味の分類を統合した分類（例えば、分類１と分類２がまったく同じデータに割り振られている場合は、同じ意味の分類として一つにまとめる、など）を作成できることが望ましい。 In the data dividing method described in Patent Document 1, a hierarchy that is not characteristic is deleted, and thus there is a problem that the hierarchy that is a deletion target cannot be classified. For example, in the data division method described in Patent Document 1, it is good if the viewpoint matching the data characteristics is in the classification hierarchy, but if there is no viewpoint matching the data characteristics, an appropriate classification hierarchy cannot be obtained. Even in a hierarchy that is not a classification target, a classification that considers the hierarchical relationship of the hierarchy or a classification that integrates classifications with the same meaning (for example, classification 1 and classification 2 are assigned to exactly the same data) Can be created as a category with the same meaning).

また、特許文献１に記載されたデータ分割方法では、各階層が特徴的か否かを判定するために、すべての階層に対し判定をする必要があるため、効率的でないという課題がある。同様に、特許文献２に記載された用語辞書生成方法においても、ノード間の関係を変換するために、すべてのノードに当たる単語間の関係で共起統計量や類似度を計算しておく必要があり、効率的でないという課題がある。また、特許文献３に記載された文書整理装置も、記憶された全てのキーワードをもとにディレクトリファイルを生成するため、効率的でないという課題がある。 In addition, the data division method described in Patent Document 1 has a problem that it is not efficient because it is necessary to make a determination for all layers in order to determine whether each layer is characteristic. Similarly, in the term dictionary generation method described in Patent Document 2, it is necessary to calculate co-occurrence statistics and similarities based on the relationships between words corresponding to all nodes in order to convert the relationships between nodes. Yes, there is a problem that it is not efficient. The document organizing apparatus described in Patent Document 3 also has a problem that it is not efficient because it generates a directory file based on all stored keywords.

また、特許文献４に記載された分類体系生成装置は、サンプル文書との関連度に基づいて分類枠のクラスタリングを繰り返すことで分類枠を階層化する。しかし、関連度は、各クラスタにおける単語の出現頻度に基づいて判断されるため、特許文献４に記載された文書分類装置では、階層の上下関係を考慮した分類や、同じ意味の分類を統合した分類ができないという課題がある。 Further, the classification system generation device described in Patent Document 4 stratifies classification frames by repeating clustering of classification frames based on the degree of association with a sample document. However, since the degree of relevance is determined based on the appearance frequency of words in each cluster, the document classification apparatus described in Patent Document 4 integrates classifications that consider hierarchical relationships of hierarchies and classifications that have the same meaning. There is a problem that classification is not possible.

そこで、本発明は、既存の分類階層を再構築して新たな分類階層を作成する場合に、分類の上下関係を考慮した分類階層や、同じ意味の分類を統合した分類階層を効率的に作成できる分類階層再作成システム、分類階層再作成方法及び分類階層再作成プログラムを提供することを目的とする Therefore, the present invention efficiently creates a classification hierarchy that considers the hierarchical relationship of classifications and a classification hierarchy that integrates classifications with the same meaning when reconstructing an existing classification hierarchy and creating a new classification hierarchy. To provide a classification hierarchy re-creation system, classification hierarchy re-creation method and classification hierarchy re-creation program

本発明による分類階層再作成システムは、階層化された分類に対応付けられたデータ群をクラスタ化し、そのクラスタ内の各データに対応する分類のうち、予め定められた条件を満たす分類を抽出したグループである分類グループを作成するクラスタリング手段と、分類グループから選択した二つの分類の共起度を計算する共起度計算手段と、分類グループ及び共起度をもとに、分類の階層を再作成する分類階層再作成手段とを備えたことを特徴とする。 The classification hierarchy re-creation system according to the present invention clusters a data group associated with a hierarchical classification, and extracts a classification satisfying a predetermined condition from the classifications corresponding to each data in the cluster. Based on the classification group and the co-occurrence degree, the clustering means for creating the classification group which is a group, the co-occurrence degree calculation means for calculating the co-occurrence degree of the two classifications selected from the classification group, and re-classifying the classification hierarchy. It comprises a classification hierarchy re-creating means for creating.

本発明による分類階層再作成方法は、データ処理装置のクラスタリング手段が、階層化された分類に対応付けられたデータ群をクラスタ化し、クラスタリング手段が、クラスタ内の各データに対応する分類のうち、予め定められた条件を満たす分類を抽出したグループである分類グループを作成し、データ処理装置の共起度計算手段が、分類グループから選択した二つの分類の共起度を計算し、データ処理装置の分類階層再作成手段が、分類グループ及び共起度をもとに、分類の階層を再作成することを特徴とする。 In the classification hierarchy re-creation method according to the present invention, the clustering means of the data processing device clusters the data group associated with the hierarchical classification, and the clustering means includes the classification corresponding to each data in the cluster, A data group is created by creating a classification group, which is a group from which classifications satisfying a predetermined condition are extracted, and the co-occurrence degree calculation means of the data processing device calculates the co-occurrence degree of two classifications selected from the classification group The classification hierarchy re-creating means re-creates the classification hierarchy based on the classification group and the co-occurrence degree.

本発明による分類階層再作成プログラムは、コンピュータに、階層化された分類に対応付けられたデータ群をクラスタ化し、そのクラスタ内の各データに対応する分類のうち、予め定められた条件を満たす分類を抽出したグループである分類グループを作成するクラスタリング処理、分類グループから選択した二つの分類の共起度を計算する共起度計算処理、および、分類グループ及び共起度をもとに、分類の階層を再作成する分類階層再作成処理を実行させることを特徴とする。 The classification hierarchy re-creation program according to the present invention clusters a data group associated with a hierarchical classification in a computer, and a classification satisfying a predetermined condition among the classifications corresponding to each data in the cluster Clustering process to create a classification group that is a group extracted from, a co-occurrence degree calculation process to calculate the co-occurrence degree of two classifications selected from the classification group, and based on the classification group and co-occurrence degree, A classification hierarchy re-creation process for re-creating a hierarchy is executed.

本発明によれば、既存の分類階層を再構築して新たな分類階層を作成する場合に、分類の上下関係を考慮した分類階層や、同じ意味の分類を統合した分類階層を効率的に作成できる。 According to the present invention, when a new classification hierarchy is created by reconstructing an existing classification hierarchy, a classification hierarchy that considers the hierarchical relationship of classifications or a classification hierarchy that integrates classifications with the same meaning is efficiently created. it can.

本発明の第１の実施形態における分類階層再作成システムの例を示すブロック図である。It is a block diagram which shows the example of the classification hierarchy recreation system in the 1st Embodiment of this invention. 入力手段１１に入力されるデータ群とその分類の例を示す説明図である。It is explanatory drawing which shows the example of the data group input into the input means 11, and its classification. 第１の実施形態におけるデータ処理装置１００の動作の例を示すフローチャートである。It is a flowchart which shows the example of operation | movement of the data processor 100 in 1st Embodiment. 分類階層の例を示す説明図である。It is explanatory drawing which shows the example of a classification hierarchy. クロス集計表の例を示す説明図である。It is explanatory drawing which shows the example of a cross tabulation table. 分割された結果のクロス集計表の例を示す説明図である。It is explanatory drawing which shows the example of the cross tabulation table of the result of having been divided | segmented. 共起度の計算結果の例を示す説明図である。It is explanatory drawing which shows the example of the calculation result of a co-occurrence degree. 分類階層を更新する途中の例を示す説明図である。It is explanatory drawing which shows the example in the middle of updating a classification hierarchy. 分類階層を更新した結果の例を示す説明図である。It is explanatory drawing which shows the example of the result of having updated the classification | category hierarchy. 更新された分類階層の例を示す説明図である。It is explanatory drawing which shows the example of the updated classification | category hierarchy. 更新された分類階層の例を示す説明図である。It is explanatory drawing which shows the example of the updated classification | category hierarchy. 本発明の第２の実施形態における分類階層再作成システムの例を示すブロック図である。It is a block diagram which shows the example of the classification hierarchy recreation system in the 2nd Embodiment of this invention. 構造付きデータの例を示す説明図である。It is explanatory drawing which shows the example of data with a structure. 本発明の第３の実施形態における分類階層再作成システムの例を示すブロック図である。It is a block diagram which shows the example of the classification hierarchy recreation system in the 3rd Embodiment of this invention. 第３の実施形態におけるデータ処理装置１００の動作の例を示すフローチャートである。It is a flowchart which shows the example of operation | movement of the data processor 100 in 3rd Embodiment. 入力手段１１が受け取るデータ群の例を示す説明図である。It is explanatory drawing which shows the example of the data group which the input means 11 receives. 分類階層の例を示す説明図である。It is explanatory drawing which shows the example of a classification hierarchy. クロス集計表の例を示す説明図である。It is explanatory drawing which shows the example of a cross tabulation table. クロス集計表を分割した結果の例を示す説明図である。It is explanatory drawing which shows the example of the result of having divided | segmented the cross tabulation table. 共起スコアの計算結果例を示す説明図である。It is explanatory drawing which shows the example of a calculation result of a co-occurrence score. 分類階層の例を示す説明図である。It is explanatory drawing which shows the example of a classification hierarchy. 分類階層の例を示す説明図である。It is explanatory drawing which shows the example of a classification hierarchy. 本発明の最小構成を示すブロック図である。It is a block diagram which shows the minimum structure of this invention.

以下、本発明の実施形態を図面を参照して説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

実施形態１．
図１は、本発明の第１の実施形態における分類階層再作成システムの例を示すブロック図である。本実施形態における分類階層再作成システムは、データ処理装置１００と、データ記憶装置１０１と、入力手段１１と、出力手段１６とを備えている。入力手段１１は、例えば、キーボードなどの入力デバイスであるが、入力手段１１の態様はキーボードに限定されない。例えば、入力手段１１は、別の装置からのデータを受信する入力インタフェースであってもよい。また、出力手段１６は、例えば、ティスプレイ装置などの出力デバイスであるが、出力手段１６の態様はディスプレイ装置に限定されない。例えば、出力手段１６は、別の装置へデータを送信する出力インタフェースであってもよい。Embodiment 1. FIG.
FIG. 1 is a block diagram showing an example of a classification hierarchy recreation system in the first exemplary embodiment of the present invention. The classification hierarchy re-creation system in this embodiment includes a data processing device 100, a data storage device 101, an input unit 11, and an output unit 16. The input unit 11 is an input device such as a keyboard, for example, but the mode of the input unit 11 is not limited to the keyboard. For example, the input unit 11 may be an input interface that receives data from another device. The output unit 16 is an output device such as a display device, for example, but the mode of the output unit 16 is not limited to the display device. For example, the output unit 16 may be an output interface that transmits data to another device.

データ処理装置１００は、クラスタリング手段１３と、共起度計算手段１４と、分類階層更新手段１５とを備えている。 The data processing apparatus 100 includes clustering means 13, co-occurrence degree calculation means 14, and classification hierarchy update means 15.

また、データ記憶装置１０１は、分類の階層的な関係（以下、分類階層と記す）を記憶する分類階層記憶手段１２を備えている。分類階層とは、分類の上下関係を表す階層であり、例えば、分類をノードとする有向グラフ構造で表わされる。以下の説明では、分類をノードとする有効グラフ構造で分類階層を表す場合について説明するが、分類階層は上記構造に限定されない。分類階層は、各分類の階層的な関係を示すことができる他の構造であってもよい。分類階層記憶手段１２は、例えば、データ記憶装置１０１が備える磁気ディスク装置等によって実現される。以上の手段は、それぞれ以下のように動作する。 The data storage device 101 also includes a classification hierarchy storage unit 12 that stores a hierarchical relationship of classifications (hereinafter referred to as a classification hierarchy). The classification hierarchy is a hierarchy that represents the vertical relationship of classification, and is represented, for example, by a directed graph structure with classification as a node. In the following description, the case where the classification hierarchy is represented by an effective graph structure having classification as a node will be described. However, the classification hierarchy is not limited to the above structure. The classification hierarchy may be another structure that can indicate the hierarchical relationship of each classification. The classification hierarchy storage unit 12 is realized by, for example, a magnetic disk device provided in the data storage device 101. Each of the above means operates as follows.

入力手段１１は、入力されたデータ群と各データの分類を受信し、クラスタリング手段１３に通知する。図２は、入力されるデータ群とその分類の例を示す説明図である。図２に示す例では、データとそのデータが属する分類（以下、データ分類、もしくは、単に「分類」と記すこともある。）を１レコードで表わしており、そのレコードを含む表全体がデータ群を表す。なお、表中の「・・・」は省略を表す。また、図２に示す例では、「，」（カンマ）で区切られた複数の分類が、各データの属する分類を表す。例えば、１レコード目の「テキストデータ１」は、分類「Ｆ」、「Ｇ」及び「Ｈ」に属していることを示す。 The input unit 11 receives the input data group and the classification of each data, and notifies the clustering unit 13 of it. FIG. 2 is an explanatory diagram showing an example of an input data group and its classification. In the example shown in FIG. 2, the data and the classification to which the data belongs (hereinafter referred to as data classification or simply “classification”) are represented by one record, and the entire table including the record is the data group. Represents. In addition, "..." in a table | surface represents omission. In the example shown in FIG. 2, a plurality of classifications separated by “,” (comma) represent classifications to which each data belongs. For example, “text data 1” in the first record indicates that it belongs to the classifications “F”, “G”, and “H”.

クラスタリング手段１３は、入力手段１１からデータ群と各データの分類を受け取り、受け取ったデータ群をクラスタリングする。クラスタリング手段１３は、例えば、Ｋ−ＭＥＡＮＳなどのクラスタリング方法を用いてデータ群をクラスタリングしてもよい。なお、クラスタリング手段１３は、クラスタリング方法として、Ｋ−ＭＥＡＮＳ以外の他の方法を用いてもよい。 The clustering means 13 receives the data group and the classification of each data from the input means 11, and clusters the received data group. The clustering means 13 may cluster the data group using a clustering method such as K-MEANs, for example. The clustering means 13 may use a method other than K-MEANs as a clustering method.

次に、クラスタリング手段１３は、各クラスタ内のデータを分類ごとに集計し、データ数の多い分類をクラスタごとにグループ化する。例えば、クラスタリング手段１３は、各クラスタ内の各データに対応する分類を用いてクロス集計表を作成する。具体的には、クラスタリング手段１３は、横にクラスタを示す情報を、縦に分類を示す情報をそれぞれ配置し、各クラスタ及び分類のデータ数を値とするクロス集計表を作成する。そして、クラスタリング手段１３は、集計表を参照してデータ数の多い部分をマークし、クラスタごとにマークした部分をグループ化する。 Next, the clustering means 13 aggregates the data in each cluster for each classification, and groups the classification having a large number of data for each cluster. For example, the clustering means 13 creates a cross tabulation table using the classification corresponding to each data in each cluster. Specifically, the clustering unit 13 arranges information indicating clusters horizontally and information indicating classification vertically, and creates a cross tabulation table having values of the number of data of each cluster and classification. Then, the clustering means 13 refers to the tabulation table, marks a portion with a large number of data, and groups the marked portions for each cluster.

次に、クラスタリング手段１３は、分類階層を参照し、クラスタ内のマークした分類群（すなわち、グループ化した分類）が階層的に遠い場合、この分類群を分割する。そして、クラスタリング手段１３は、分割結果をもとに作成した分類のグループ（以下、分類グループと記す。）を共起度計算手段１４に通知する。 Next, the clustering means 13 refers to the classification hierarchy, and divides the classification group when the marked classification group in the cluster (that is, the grouped classification) is far from the hierarchy. Then, the clustering means 13 notifies the co-occurrence degree calculating means 14 of the classification group created based on the division result (hereinafter referred to as a classification group).

共起度計算手段１４は、分類グループを受け取り、分類グループ内から選択した二つの分類の組合せごとに共起度を計算する。ここで、共起とは、１つのデータに二つの分類がともに出現している（属している）ことである。また、共起度とは、共起を元に算出した統計量であり、共起の度合いを示す値である。共起度計算手段１４は、各分類の共起度を、例えば、二つの分類が共起しているデータ数を分母とし、各分類に属するデータ数を分子として計算する。例えば、分類「Ｆ」と分類「Ｇ」が共起しているデータ数を１０、分類「Ｇ」のデータ数を９とする。このとき、共起度計算手段１４は、Ｐ（分類「Ｆ」，分類「Ｇ」｜分類「Ｇ」）＝９／１０＝０．９のように共起度Ｐを計算する。なお、以下の説明では、二つの分類が共起しているデータ数を共起頻度と記す。上記例では、分類「Ｆ」と分類「Ｇ」の共起頻度は１０になる。 The co-occurrence degree calculating means 14 receives the classification group and calculates the co-occurrence degree for each combination of two classifications selected from the classification group. Here, co-occurrence means that two classifications appear (belong) to one data. The co-occurrence degree is a statistic calculated based on the co-occurrence and is a value indicating the degree of co-occurrence. The co-occurrence degree calculation means 14 calculates the co-occurrence degree of each classification, for example, using the number of data in which two classifications co-occur as a denominator and the number of data belonging to each classification as a numerator. For example, the number of data in which the classification “F” and the classification “G” co-occur is 10 and the number of data in the classification “G” is 9. At this time, the co-occurrence degree calculating means 14 calculates the co-occurrence degree P such that P (classification “F”, classification “G” | classification “G”) = 9/10 = 0.9. In the following description, the number of data in which two classifications co-occur is described as a co-occurrence frequency. In the above example, the co-occurrence frequency of the classification “F” and the classification “G” is 10.

分類階層更新手段１５は、分類グループと共起度とを用いて、分類の上下関係の作成や、分類の統合を行うことにより分類階層を更新する。まず、分類階層更新手段１５は、１つの分類グループを取り出し、その分類グループ内の中から二つの分類を取り出す。取出した二つの分類が、所定の閾値以上の共起度を有し、さらに、包含関係を満たす場合、分類階層更新手段１５は、その二つの分類に対して親から子への上下関係を作成する。一方、取出した二つの分類が、所定の閾値以上の共起度を有し、さらに、同意関係を満たす場合、分類階層更新手段１５は、その二つの分類を統合する。分類階層更新手段１５は、グループ内の二つの分類の組合せ、及び、すべての分類グループに対して、以上の処理を繰り返すことで、分類階層を更新する。 The classification hierarchy updating means 15 updates the classification hierarchy by creating the hierarchical relationship of the classifications and integrating the classifications using the classification groups and the co-occurrence degrees. First, the classification hierarchy update unit 15 extracts one classification group and extracts two classifications from the classification group. If the two extracted categories have a co-occurrence degree equal to or greater than a predetermined threshold and satisfy the inclusion relationship, the classification hierarchy update unit 15 creates a parent-child hierarchical relationship for the two categories. To do. On the other hand, if the two extracted categories have a co-occurrence degree equal to or greater than a predetermined threshold and further satisfy the agreement relationship, the classification hierarchy update unit 15 integrates the two categories. The classification hierarchy updating means 15 updates the classification hierarchy by repeating the above processing for the combination of two classifications in the group and for all the classification groups.

ここで、包含関係とは、二つの分類が示す概念において、片方が広くもう一方が狭い場合で、広い概念が狭い概念を包含している関係のことをいう。また、同意関係とは、二つの分類が示す概念において、両方の概念が同じ広い概念に包含されている関係のことをいう。すなわち、分類階層更新手段１５は、共起度を用いて、二つの分類が包含関係か同意関係かを判断し、判断したこれらの関係をもとに分類階層を更新する。 Here, the inclusion relationship refers to a relationship in which one concept is wide and the other concept is narrow, and a broad concept includes a narrow concept in the concepts indicated by the two classifications. In addition, the consent relationship refers to a relationship in which both concepts are included in the same broad concept in the concepts indicated by the two classifications. That is, the classification hierarchy update unit 15 determines whether the two classifications are inclusive relations or consent relations using the co-occurrence degree, and updates the classification hierarchy based on the determined relations.

出力手段１６は、更新された分類階層の内容をティスプレイ装置などに出力する。 The output means 16 outputs the contents of the updated classification hierarchy to a display device or the like.

クラスタリング手段１３と、共起度計算手段１４と、分類階層更新手段１５とは、プログラム（分類階層再作成プログラム）に従って動作するコンピュータのＣＰＵによって実現される。例えば、プログラムは、データ処理装置１００の記憶部（図示せず）に記憶され、ＣＰＵは、そのプログラムを読み込み、プログラムに従って、クラスタリング手段１３、共起度計算手段１４及び分類階層更新手段１５として動作してもよい。また、クラスタリング手段１３と、共起度計算手段１４と、分類階層更新手段１５とは、それぞれが専用のハードウェアで実現されていてもよい。 The clustering means 13, the co-occurrence degree calculating means 14, and the classification hierarchy updating means 15 are realized by a CPU of a computer that operates according to a program (classification hierarchy recreation program). For example, the program is stored in a storage unit (not shown) of the data processing apparatus 100, and the CPU reads the program and operates as the clustering unit 13, the co-occurrence degree calculating unit 14, and the classification hierarchy updating unit 15 according to the program. May be. Further, the clustering means 13, the co-occurrence degree calculating means 14, and the classification hierarchy updating means 15 may each be realized by dedicated hardware.

次に、動作について説明する。図３は、本実施形態におけるデータ処理装置１００の動作の例を示すフローチャートである。 Next, the operation will be described. FIG. 3 is a flowchart showing an example of the operation of the data processing apparatus 100 in the present embodiment.

まず、入力手段１１が、受け取ったデータ群をクラスタリング手段１３に通知すると、クラスタリング手段１３は、そのデータ群をもとにクラスタリングを行う（ステップＳ１）。クラスタリング手段１３は、クラスタリングの手法として、受信したデータに適したクラスタリング手法を用いることができる。例えば、クラスタリング手段１３は、Ｋ−ＭＥＡＮＳなどのよく知られた手法を用いてもよい。なお、本実施形態では、クラスタリング手段１３が、テキストデータをクラスタリングする場合について説明するが、クラスタリングするデータ群はテキストデータに限られない。例えば、クラスタリング手段１３は、データ群として、音声や画像などのバイナリデータをクラスタリングしてもよい。 First, when the input unit 11 notifies the clustering unit 13 of the received data group, the clustering unit 13 performs clustering based on the data group (step S1). The clustering means 13 can use a clustering method suitable for the received data as a clustering method. For example, the clustering unit 13 may use a well-known method such as K-MEANS. In the present embodiment, the case where the clustering unit 13 clusters text data will be described, but the data group to be clustered is not limited to text data. For example, the clustering unit 13 may cluster binary data such as sound and images as a data group.

次に、クラスタリング手段１３は、分類階層記憶手段１２に記憶された分類階層を参照し、クラスタリングした各クラスタとデータ分類のクロス集計表を作成し、分類グループを作成する（ステップＳ２）。図４は、分類階層の例を示す説明図である。また、図５は、クロス集計表の例を示す説明図である。 Next, the clustering means 13 refers to the classification hierarchy stored in the classification hierarchy storage means 12, creates a cross tabulation table of each clustered cluster and data classification, and creates a classification group (step S2). FIG. 4 is an explanatory diagram illustrating an example of a classification hierarchy. FIG. 5 is an explanatory diagram showing an example of a cross tabulation table.

図４に示す例では、分類をノードとする有向グラフ構造で分類階層を表現していることを示す。また、図５に示す例では、クロス集計表が、横にクラスタを示す情報を、縦に分類を示す情報をそれぞれ配置した表で構成されていることを示す。また、図５に例示するクロス集計表の値は、クラスタに存在するデータで分類に属しているデータ数（すなわち、各分類に属するデータ数をクラスタ内のデータを対象に集計した値）を示すが、これは一例である。例えば、データ数をクラスタの合計データ数で割った値を用いてもよいし、データ数を分類の合計データ数で割った値を用いてもよい。 The example shown in FIG. 4 indicates that the classification hierarchy is expressed by a directed graph structure in which the classification is a node. Further, the example shown in FIG. 5 indicates that the cross tabulation table is composed of a table in which information indicating clusters is arranged horizontally and information indicating classification is arranged vertically. Further, the values of the cross tabulation table illustrated in FIG. 5 indicate the number of data belonging to the classification in the data existing in the cluster (that is, the value obtained by counting the number of data belonging to each classification for the data in the cluster). But this is an example. For example, a value obtained by dividing the number of data by the total number of data in the cluster may be used, or a value obtained by dividing the number of data by the total number of data in the classification may be used.

ここで、クラスタリング手段１３は、ある閾値以上のセルにマークする。図５に示す例では、マークした部分を太線で囲んで表わすものとし、クラスタリング手段１３が閾値１０以上のセルにマークしたことを示す。マークされた部分は、クラスタに含まれるデータが多く属する分類であることを示している。例えば、図５に例示する「クラスタ１」は、分類Ｈ、分類Ｉ、分類Ｊに属するデータを多く含んでいることを示している。ここで、分類に属するデータが多いとは、予め定められた閾値以上であることを意味する。 Here, the clustering means 13 marks cells above a certain threshold. In the example shown in FIG. 5, the marked portion is expressed by being surrounded by a thick line, and indicates that the clustering unit 13 has marked a cell having a threshold value of 10 or more. The marked part indicates that the data included in the cluster belongs to many classifications. For example, “Cluster 1” illustrated in FIG. 5 indicates that it includes a lot of data belonging to the classification H, the classification I, and the classification J. Here, the fact that there are many data belonging to the classification means that it is equal to or more than a predetermined threshold value.

クラスタリング手段１３は、クラスタごとにマークされている分類をもとに、分類グループを作成する。例えば、図５に示す例では、クラスタリング手段１３は、「クラスタ２」の中でマークされている分類（分類Ｈ、分類Ｉ及び分類Ｊ）を一つのグループ（分類群）にする。次に、クラスタリング手段１３は、クロス集計表と分類階層を参照し、階層的距離が離れている分類群を分割する（ステップＳ３）。クラスタリング手段１３は、分類群の分類それぞれに対し、階層的距離が閾値以上であるか否かを判定する。そして、階層的距離が閾値以上であれば、クラスタリング手段１３は、分類群を分割する。ここで、階層的距離とは、階層化された分類同士の隔たりの程度を示す指標であり、本実施形態においては、二つの分類の分類階層内での最短ホップ数を意味するものとする。 The clustering means 13 creates a classification group based on the classification marked for each cluster. For example, in the example illustrated in FIG. 5, the clustering unit 13 groups the classifications (classification H, classification I, and classification J) marked in “cluster 2” into one group (classification group). Next, the clustering means 13 refers to the cross tabulation table and the classification hierarchy, and divides the classification group having a large hierarchical distance (step S3). The clustering means 13 determines whether or not the hierarchical distance is greater than or equal to the threshold for each classification of the classification group. If the hierarchical distance is equal to or greater than the threshold, the clustering unit 13 divides the classification group. Here, the hierarchical distance is an index indicating the degree of separation between the hierarchized classifications, and in the present embodiment, means the shortest hop number in the classification hierarchy of the two classifications.

以下、クラスタリング手段１３が、閾値が５ホップの場合に分類群を分割する方法について、図４及び図５を用いて説明する。図４及び図５に示す例では、「クラスタ３」の分類群（分類Ｏ、分類Ｐ、分類Ｑ、分類Ｒ）において、分類Ｏと分類Ｑ、分類Ｏと分類Ｒ、分類Ｐと分類Ｑ及び分類Ｐと分類Ｒがそれぞれ６ホップ離れているので分割対象になる。上記の分類ペアは、（分類Ｏ、分類Ｐ）と、（分類Ｑ、分類Ｒ）の別グループにそれぞれ分割される。クロス集計表を分割した結果の例を図６に示す。図６に示す例では、「クラスタ３」の分類群（分類Ｏ、分類Ｐ、分類Ｑ、分類Ｒ）が、「クラスタ３」の分類グループ（分類Ｏ、分類Ｐ）と、「クラスタ３’」の分類グループ（分類Ｑ、分類Ｒ）に分割されたことを示す。なお、以下の説明では、図６に例示するクラスタ番号を、分類グループの番号（以下、グループ番号と記す。）として記載する。 Hereinafter, a method in which the clustering unit 13 divides the classification group when the threshold is 5 hops will be described with reference to FIGS. 4 and 5. In the example shown in FIGS. 4 and 5, in the classification group of “cluster 3” (classification O, classification P, classification Q, classification R), classification O and classification Q, classification O and classification R, classification P and classification Q, and Since the classification P and the classification R are 6 hops away from each other, they become division targets. The above-mentioned classification pairs are divided into separate groups (classification O, classification P) and (classification Q, classification R). An example of the result of dividing the cross tabulation table is shown in FIG. In the example illustrated in FIG. 6, the classification group of “cluster 3” (classification O, classification P, classification Q, classification R) is classified into the classification group of “cluster 3” (classification O, classification P) and “cluster 3 ′”. It shows that it was divided into the following classification groups (classification Q, classification R). In the following description, the cluster numbers illustrated in FIG. 6 are described as classification group numbers (hereinafter referred to as group numbers).

次に、共起度計算手段１４は、分類グループから選択した二つの分類の共起度を計算する（ステップＳ４）。図７は、共起度の計算結果の例を示す説明図である。図７に例示する表は、グループ番号、共起度を計算する対象の二つの分類である「分類１」と「分類２」、及び、それぞれの分類の共起度を示す「共起スコア１」と「共起スコア２」からなる表である。以下の説明では、共起度を示す「共起スコア１」と「共起スコア２」は、それぞれ、「分類１」と「分類２」の共起する条件付確率とする。すなわち、「共起スコア１」は、「分類１」に対する共起の確率であり、「共起スコア２」は、「分類２」に対する共起の確率である。「共起スコア１」の値は、以下の（式１）で、「共起スコア２」の値は、以下の（式２）でそれぞれ算出できる。 Next, the co-occurrence degree calculation means 14 calculates the co-occurrence degree of two classifications selected from the classification group (step S4). FIG. 7 is an explanatory diagram illustrating an example of a calculation result of the co-occurrence degree. The table illustrated in FIG. 7 includes “class 1” and “class 2” which are two classifications for which the group number and the degree of co-occurrence are calculated, and “co-occurrence score 1” indicating the degree of co-occurrence of each class. ”And“ Co-occurrence score 2 ”. In the following description, “co-occurrence score 1” and “co-occurrence score 2” indicating the degree of co-occurrence are conditional probabilities of co-occurrence of “class 1” and “class 2”, respectively. That is, “co-occurrence score 1” is the probability of co-occurrence for “class 1”, and “co-occurrence score 2” is the probability of co-occurrence for “class 2”. The value of “co-occurrence score 1” can be calculated by the following (formula 1), and the value of “co-occurrence score 2” can be calculated by the following (formula 2).

共起スコア１＝Ｐ（分類１，分類２｜分類１）＝分類１と分類２の共起頻度／分類１の頻度（式１） Co-occurrence score 1 = P (Class 1, Class 2 | Category 1) = Co-occurrence frequency of Class 1 and Class 2 / Frequency of Class 1 (Formula 1)

共起スコア２＝Ｐ（分類１，分類２｜分類２）＝分類１と分類２の共起頻度／分類２の頻度（式２） Co-occurrence score 2 = P (Category 1, Category 2 | Category 2) = Co-occurrence frequency of Category 1 and Category 2 / Frequency of Category 2 (Formula 2)

共起度計算手段１４は、この二つの値（すなわち、共起スコア１及び共起スコア２）をもとに、二つの分類が包含関係か同意関係かを判断する。 The co-occurrence degree calculation means 14 determines whether the two classifications are inclusion relations or consent relations based on these two values (ie, co-occurrence score 1 and co-occurrence score 2).

例えば、共起スコア１と共起スコア２のうちの片方のスコアが高い場合、高いスコアに対応する分類と他方の分類との間には包含関係があると言える。また、共起スコア１と共起スコア２のスコアが両方高い場合、両方の分類の間には同意関係があると言える。これは、分子となる共通部分が同じであるが、分母となるそれぞれの分類頻度が異なるためである。 For example, when one of the co-occurrence score 1 and the co-occurrence score 2 is high, it can be said that there is an inclusion relationship between the classification corresponding to the high score and the other classification. Moreover, when the scores of both the co-occurrence score 1 and the co-occurrence score 2 are high, it can be said that there is an agreement relationship between the two categories. This is because the common part that is the numerator is the same, but the classification frequency that is the denominator is different.

共起スコア１が高く、共起スコア２が低い場合を例に挙げて具体的に説明する。共起スコア１が高い場合、分類１に属しているデータは、ほぼすべて分類２にも属していることになる。逆にいえば、共起スコア２が小さい場合、分類２に属しているデータは、分類１の他にもさまざまなデータに属していることになる。したがって、分類２は分類１より大きく、分類２は分類１を包含していると言える。逆に、共起スコア２が高く、共起スコア１が低い場合は、分類１が分類２を包含していると言える。 The case where the co-occurrence score 1 is high and the co-occurrence score 2 is low will be specifically described as an example. When the co-occurrence score 1 is high, almost all data belonging to the category 1 also belongs to the category 2. In other words, when the co-occurrence score 2 is small, the data belonging to the category 2 belongs to various data other than the category 1. Therefore, it can be said that category 2 is larger than category 1, and category 2 includes category 1. Conversely, when the co-occurrence score 2 is high and the co-occurrence score 1 is low, it can be said that the classification 1 includes the classification 2.

一方、二つの共起スコア（すなわち、共起スコア１と共起スコア２）が同じく高い場合、それぞれの分類（すなわち、分類１と分類２）の中には同じデータが出現することが多いため、分類１と分類２は同意であるといえる。 On the other hand, if two co-occurrence scores (ie, co-occurrence score 1 and co-occurrence score 2) are the same, the same data often appears in each category (ie, category 1 and category 2). It can be said that classification 1 and classification 2 are agreements.

次に、分類階層更新手段１５は、分類グループ及び共起度に基づき、分類階層を更新する（ステップＳ５）。分類階層更新手段１５は、共起度をもとに判断した結果、二つの分類の関係が包含関係を満たす場合、この二つの分類を親子分類として更新する。一方、二つの分類の関係が同意関係を満たす場合、分類階層更新手段１５は、この二つの分類を、一つの分類に統合する。分類階層更新手段１５は、閾値を用いて共起スコアの高低を判断する。以下、この閾値を共起スコア閾値と記す。 Next, the classification hierarchy updating means 15 updates the classification hierarchy based on the classification group and the co-occurrence degree (step S5). As a result of the determination based on the co-occurrence degree, the classification hierarchy updating unit 15 updates the two classifications as parent-child classifications when the relation between the two classifications satisfies the inclusion relation. On the other hand, when the relationship between the two categories satisfies the agreement relationship, the category hierarchy updating unit 15 integrates the two categories into one category. The classification hierarchy update unit 15 determines the level of the co-occurrence score using a threshold value. Hereinafter, this threshold is referred to as a co-occurrence score threshold.

以下、分類階層を更新する処理について、図４及び図７に示す例を用いて説明する。ここでは、共起スコア閾値は予めシステムに設定されているものとする。また、分類階層更新手段１５は、共起スコア閾値が０．７以上の場合に高いと判断し、共起スコア閾値が０．３以下の場合に低いと判断するものとする。 Hereinafter, the process of updating the classification hierarchy will be described using the examples shown in FIGS. 4 and 7. Here, it is assumed that the co-occurrence score threshold is preset in the system. The classification hierarchy updating unit 15 determines that the co-occurrence score threshold is high when the co-occurrence score threshold is 0.7 or more, and determines that the co-occurrence score threshold is low when the co-occurrence score threshold is 0.3 or less.

図７に例示する「グループ１」の分類Ｇと分類Ｈの共起スコアによれば、「共起スコア１」が高く、「共起スコア２」が低いと言える。よって、この二つの分類には包含関係があり、分類Ｈが親、分類Ｇが子の関係にあることが分かる。したがって、分類階層更新手段１５は、図４に例示する分類Ｈが親、分類Ｇが子の関係になるように分類階層を更新する。分類階層を更新する途中の例を図８に示す。図８に示す例では、分類Ｇが分類Ｈの子供として更新されていることが分かる。なお、分類Ｂから分類Ｇに向けて引かれている破線は、更新前の親子関係を示す線である。分類階層更新手段１５は、更新前の親子関係を削除してもよいし、しなくともよい。なお、以下の説明では、更新前の親子関係を後ほど削除することにする。 According to the co-occurrence score of classification G and classification H of “group 1” illustrated in FIG. 7, it can be said that “co-occurrence score 1” is high and “co-occurrence score 2” is low. Therefore, it can be seen that these two classifications have an inclusion relationship, and classification H has a parent relationship and classification G has a child relationship. Therefore, the classification hierarchy updating unit 15 updates the classification hierarchy so that the classification H illustrated in FIG. 4 has a parent and the classification G has a child relationship. An example in the middle of updating the classification hierarchy is shown in FIG. In the example shown in FIG. 8, it can be seen that the category G is updated as a child of the category H. The broken line drawn from the category B to the category G is a line indicating the parent-child relationship before update. The classification hierarchy update unit 15 may or may not delete the parent-child relationship before the update. In the following description, the parent-child relationship before update will be deleted later.

また、図７に例示する「グループ２」の分類Ｈと分類Ｉの共起スコアをみると、「共起スコア２」が高く、「共起スコア１」が低いと言える。よって、この二つの分類にも包含関係があり、分類Ｈが親、分類Ｉが子の関係にあることが分かる。同様に、分類Ｈと分類Ｊの共起スコアから、分類Ｈが親、分類Ｊが子の関係にあることがわかる。一方、分類Ｉと分類Ｊは両方の共起スコアが高いため、同意関係にあることがわかる。そのため、分類階層更新手段１５は、この二つの分類を統合する。 In addition, when viewing the co-occurrence scores of classification H and classification I of “group 2” illustrated in FIG. 7, it can be said that “co-occurrence score 2” is high and “co-occurrence score 1” is low. Therefore, it can be seen that these two categories also have an inclusive relationship, with category H having a parent and category I having a child relationship. Similarly, from the co-occurrence score of classification H and classification J, it can be seen that classification H has a parent relationship and classification J has a child relationship. On the other hand, it can be seen that the classification I and the classification J are in agreement because both co-occurrence scores are high. Therefore, the classification hierarchy update unit 15 integrates these two classifications.

「グループ２」の分類グループをもとに分類階層を更新した結果の例を図９に示す。図９に例示する分類階層は、「グループ１」と「グループ２」によって更新されたものである。なお、同意関係の分類を統合する際、それぞれの分類の親分類が異なる場合がある。この場合、分類階層更新手段１５は、二つの分類のうち含まれるデータ量が多い分類にデータ量が少ない分類を統合して一つの分類を作成する。 FIG. 9 shows an example of the result of updating the classification hierarchy based on the classification group “group 2”. The classification hierarchy illustrated in FIG. 9 is updated by “Group 1” and “Group 2”. In addition, when integrating consent-related categories, the parent categories of the categories may be different. In this case, the classification hierarchy updating unit 15 integrates a classification with a small amount of data into a classification with a large amount of data included in the two classifications to create one classification.

また、図７に例示する「グループ３」の分類Ｏと分類Ｐは、同意関係であることがわかるため、分類階層更新手段１５は、この二つの分類を統合する。一方、図７に例示する「グループ３’」の分類Ｑと分類Ｒは、包含関係でも同意関係でもないため、分類階層更新手段１５は、分類階層を更新しない。 In addition, since it is understood that the classification O and the classification P of “Group 3” illustrated in FIG. 7 are in a consent relationship, the classification hierarchy updating unit 15 integrates these two classifications. On the other hand, the classification Q and the classification R of the “group 3 ′” illustrated in FIG. 7 are neither inclusive nor in agreement, so the classification hierarchy updating unit 15 does not update the classification hierarchy.

以上の結果、更新された分類階層の例を図１０に示す。ここで、図１０に例示する太線で囲まれた分類は、属するデータが存在する分類である。分類階層更新手段１５は、更新前の親子関係（図中の破線で結ばれた関係）を削除してもよいし、削除しなくてもよい。削除せずに残す場合、例えば、更新前の分類階層を用いてデータを分類したいといった要求に応える事ができる。 FIG. 10 shows an example of the classification hierarchy updated as a result of the above. Here, the classification surrounded by the thick line illustrated in FIG. 10 is a classification in which the data to which it belongs. The classification hierarchy update unit 15 may or may not delete the parent-child relationship before the update (relationship connected by a broken line in the figure). In the case of leaving without deleting, for example, a request to classify data using the classification hierarchy before update can be satisfied.

さらに、分類階層更新手段１５は、属するデータがない分類に対する処理を行ってもよい。例えば、分類階層更新手段１５は、属するデータがない分類が子分類を持たない場合、その分類を削除してもよい。例えば、図１０に示す例では、分類Ｌ、分類Ｍ、分類Ｎには属するデータがないため、分類階層更新手段１５は、これらの分類を削除してもよい。 Furthermore, the classification hierarchy update unit 15 may perform processing for a classification for which no data belongs. For example, the classification hierarchy update unit 15 may delete a classification when a classification with no data belonging to does not have a child classification. For example, in the example shown in FIG. 10, since there is no data belonging to the classification L, the classification M, and the classification N, the classification hierarchy updating unit 15 may delete these classifications.

また、分類階層更新手段１５は、分類に属するデータがない分類であって、子分類を一つしか持たない分類に対し、その分類を削除し、削除される分類の親分類と子分類との間に階層関係を作成してもよい。すなわち、分類階層更新手段１５は、孫分類を子供分類にした上下関係を作成してもよい。子分類を一つしかもたないこのような分類の階層を保持しておく意味はあまりないからである。例えば、分類Ｅは分類Ｏ＋Ｐしか子供を持たないため、分類階層更新手段１５は、分類Ｅを削除し、分類Ｂと分類Ｏ＋Ｐに対し直接上下関係を作成する。以上の結果、更新された分類階層の例を図１１に示す。 Further, the classification hierarchy updating means 15 deletes the classification for a classification having no data belonging to the classification and having only one child classification, and determines the parent classification and the child classification of the classification to be deleted. A hierarchical relationship may be created between them. That is, the classification hierarchy update unit 15 may create a hierarchical relationship in which the grandchild classification is a child classification. This is because there is not much point in keeping such a hierarchy of classifications that has only one child classification. For example, since the class E has only the class O + P, the class hierarchy updating unit 15 deletes the class E and directly creates a vertical relationship with the class B and the class O + P. As a result of the above, an example of the updated classification hierarchy is shown in FIG.

以上のように、本実施形態によれば、クラスタリング手段１３が、階層化された分類に対応付けられたデータ群をクラスタ化する。そして、クラスタリング手段１３は、クラスタ内の各データに対応する分類のうち、予め定められた条件（例えば、「属するデータ数が多い」という条件）を満たす分類を抽出した分類グループを作成する。そして、共起度計算手段１４が、分類グループから選択した二つの分類の共起度を計算すると、分類階層更新手段１５は、分類グループ及び共起度をもとに分類階層を再作成する。よって、既存の分類階層を再構築して新たな分類階層を作成する場合に、分類の上下関係を考慮した分類階層や、同じ意味の分類を統合した分類階層を効率的に作成できる。 As described above, according to the present embodiment, the clustering means 13 clusters the data group associated with the hierarchical classification. Then, the clustering unit 13 creates a classification group in which classifications satisfying a predetermined condition (for example, a condition “the number of belonging data is large”) among classifications corresponding to each data in the cluster are created. Then, when the co-occurrence degree calculating unit 14 calculates the co-occurrence degree of two classifications selected from the classification group, the classification hierarchy updating unit 15 recreates the classification hierarchy based on the classification group and the co-occurrence degree. Therefore, when a new classification hierarchy is created by reconstructing an existing classification hierarchy, it is possible to efficiently create a classification hierarchy that considers the hierarchical relationship of classifications, or a classification hierarchy that integrates classifications having the same meaning.

すなわち、本実施形態によれば、分類階層更新手段１５が分類グループ内の分類の共起度をもとに、分類の上下関係の作成や分類の統合を行うため、データの特性を考慮し分類階層の上下関係の作成や分類の統合を行うことができる。また、本実施形態によれば、クラスタリング手段１３が、似ている分類のグループをあらかじめ作成し、共起度計算手段１４が、そのグループ内の共起度のみを計算するため、データの特性を考慮し効果的に分類階層を更新できる。 That is, according to the present embodiment, the classification hierarchy updating unit 15 creates the hierarchical relationship of the classification and integrates the classification based on the co-occurrence degree of the classification in the classification group. Create hierarchical relationships and integrate classifications. In addition, according to the present embodiment, the clustering unit 13 creates a group of similar classifications in advance, and the co-occurrence degree calculating unit 14 calculates only the co-occurrence degree in the group. The classification hierarchy can be updated effectively in consideration.

実施形態２．
図１２は、本発明の第２の実施形態における分類階層再作成システムの例を示すブロック図である。第２の実施形態では、第１の実施形態に比べ、入力手段１１が第２入力手段２１に変わり、クラスタリング手段１３が第２クラスタリング手段２３に変わっている点が異なる。なお、第１の実施形態と同様の構成については、図１と同一の符号を付し、説明を省略する。Embodiment 2. FIG.
FIG. 12 is a block diagram illustrating an example of a classification hierarchy recreation system according to the second exemplary embodiment of the present invention. The second embodiment is different from the first embodiment in that the input unit 11 is changed to the second input unit 21 and the clustering unit 13 is changed to the second clustering unit 23. In addition, about the structure similar to 1st Embodiment, the code | symbol same as FIG. 1 is attached | subjected and description is abbreviate | omitted.

本実施形態における分類階層再作成システムは、データ処理装置１００と、データ記憶装置１０１と、第２入力手段２１と、出力手段１６とを備えている。データ記憶装置１０１については、第１の実施形態と同様であり、第２入力手段２１の態様は、第１の実施形態における入力手段１１と同様である。第２入力手段２１は、入力された構造付きデータ群と、各データの分類とを受信する。なお、以下の説明では、構造付きデータとは、構造化されたデータの各部分を識別する名称（以下、構造部分名称と記す。）が付与されたデータを意味するものとする。 The classification hierarchy re-creation system in this embodiment includes a data processing device 100, a data storage device 101, a second input unit 21, and an output unit 16. The data storage device 101 is the same as that in the first embodiment, and the mode of the second input means 21 is the same as that of the input means 11 in the first embodiment. The second input means 21 receives the input structured data group and the classification of each data. In the following description, structured data means data to which a name for identifying each portion of structured data (hereinafter referred to as a structure portion name) is assigned.

図１３は構造付きデータの例を示す説明図である。図１３は特許データの例である。特許データはあらかじめ、要約や目的、課題といった構造情報を持っている。第２入力手段２１は、このような構造付きのデータを一つのデータとして受信する。なお、上記説明では、第２入力手段２１が、構造付きデータとしてテキストデータを受信する場合について説明したが、第２入力手段２１は、音声データや画像データなどを受信してもよい。音声データの場合、構造付きデータが、音声の特定話者の発話部であってもよく、画像データの場合、構造付きデータが、画像の特定の人などであってもよい。 FIG. 13 is an explanatory diagram showing an example of structured data. FIG. 13 shows an example of patent data. Patent data has structural information such as summary, purpose, and problem in advance. The second input means 21 receives such structured data as one data. In the above description, the case where the second input unit 21 receives text data as structured data has been described. However, the second input unit 21 may receive audio data, image data, or the like. In the case of voice data, the structured data may be a speech part of a specific speaker of voice, and in the case of image data, the structured data may be a specific person of the image.

さらに、第２入力手段２１は、後述の第２クラスタリング手段２３が分析対象（クラスタリングの対象）とする構造部分名称も受信する。なお、構造部分名称は、構造情報の名称と言うことができる。図１３に示す例では、構造部分名称は、要約や目的、課題などである。第２入力手段２１は、構造部分名称を複数受信してもよい。例えば、第２入力手段２１は、「課題」と「発明の目的」の二つの構造部分名称を受信してもよい。 Further, the second input means 21 also receives a structure part name that is to be analyzed (clustered object) by the second clustering means 23 described later. The structure part name can be said to be the name of structure information. In the example illustrated in FIG. 13, the structure part name is a summary, a purpose, a problem, or the like. The second input means 21 may receive a plurality of structure part names. For example, the second input means 21 may receive two structural part names of “issue” and “object of invention”.

データ処理装置１００は、第２クラスタリング手段２３と、共起度計算手段１４と、分類階層更新手段１５とを備えている。共起度計算手段１４及び分類階層更新手段１５については、第１の実施形態と同様であるため、説明を省略する。 The data processing apparatus 100 includes second clustering means 23, co-occurrence degree calculating means 14, and classification hierarchy updating means 15. Since the co-occurrence degree calculating unit 14 and the classification hierarchy updating unit 15 are the same as those in the first embodiment, the description thereof is omitted.

第２クラスタリング手段２３は、第２入力手段２１から、構造付きデータ群と、各データの分類と、構造部分名称を受け取り、構造付きデータ群のクラスタリングを行う。具体的には、第２クラスタリング手段２３は、構造付きデータ全体をもとにクラスタリングを行うのではなく、受信した構造部分名称に該当する部分のみを各データから抽出し、抽出した部分の情報をもとにクラスタリングを行う。例えば、第２クラスタリング手段２３は、図１３に例示する構造を備えた構造付きデータから、「課題」と「発明の目的」に該当する部分のテキストを抽出し、この部分のテキストのみを使って類似度などを判断し、クラスタリングを行う。第２クラスタリング手段２３は、例えば、Ｋ−ＭＥＡＮＳなどのクラスタリング方法を用いてデータ群をクラスタリングしてもよい。なお、第２クラスタリング手段２３は、クラスタリング方法として、Ｋ−ＭＥＡＮＳ以外の他の方法を用いてもよい。 The second clustering means 23 receives the structured data group, the classification of each data, and the structure part name from the second input means 21, and clusters the structured data group. Specifically, the second clustering means 23 does not perform clustering based on the entire structured data, but extracts only the part corresponding to the received structure part name from each data, and extracts the information of the extracted part. Clustering is performed originally. For example, the second clustering means 23 extracts the text corresponding to the “issue” and the “purpose of the invention” from the structured data having the structure illustrated in FIG. 13, and uses only the text of this part. Clustering is performed by determining the degree of similarity. For example, the second clustering unit 23 may cluster the data group using a clustering method such as K-MEANs. Note that the second clustering means 23 may use a method other than K-MEANs as a clustering method.

なお、構造付きデータが音声データの場合であり、構造部分名称として特定の発話者名を受信した場合、第２クラスタリング手段２３は、例えば、この発話者名に該当する部分の波形を抽出し、類似度を計算してクラスタリングを行ってもよい。また、構造付きデータが画像データの場合であり、構造部分名称として特定の人物名を受信した場合、第２クラスタリング手段２３は、この人物が映っている画像の領域のみを抽出し、類似度を計算してクラスタリングを行ってもよい。 When the structured data is voice data and a specific speaker name is received as the structure part name, the second clustering means 23 extracts, for example, the waveform of the part corresponding to the speaker name, Clustering may be performed by calculating similarity. Also, in the case where the structured data is image data and a specific person name is received as the structure part name, the second clustering means 23 extracts only the area of the image in which this person is shown, and the similarity is obtained. Clustering may be performed by calculation.

第２クラスタリング手段２３と、共起度計算手段１４と、分類階層更新手段１５とは、プログラム（分類階層再作成プログラム）に従って動作するコンピュータのＣＰＵによって実現される。また、第２クラスタリング手段２３と、共起度計算手段１４と、分類階層更新手段１５とは、それぞれが専用のハードウェアで実現されていてもよい。 The second clustering means 23, the co-occurrence degree calculating means 14, and the classification hierarchy updating means 15 are realized by a CPU of a computer that operates according to a program (classification hierarchy recreation program). Further, the second clustering means 23, the co-occurrence degree calculating means 14, and the classification hierarchy updating means 15 may each be realized by dedicated hardware.

次に、動作について説明する。本実施形態におけるデータ処理装置１００の動作は、図３に例示するフローチャートと同様である。第２の実施形態では、第２クラスタリング手段２３が、第２入力手段２１から、構造付きデータ群と、各データの分類と、構造部分名称を受け取り、構造付きデータ群のクラスタリングを行う点で第１の実施形態と同様である。具体的には、第１の実施形態では、クラスタリング手段１３が、データ全体をもとにクラスタリングする。一方、第２の実施形態では、第２クラスタリング手段２３が、受信した構造部分名称に該当する部分のみを各データから抽出し、抽出した部分の情報をもとにクラスタリングを行う。それ以外の動作については、第１の実施形態と同様である。 Next, the operation will be described. The operation of the data processing apparatus 100 in this embodiment is the same as the flowchart illustrated in FIG. In the second embodiment, the second clustering unit 23 receives the structured data group, the classification of each data, and the structure part name from the second input unit 21 and performs clustering of the structured data group. This is the same as the first embodiment. Specifically, in the first embodiment, the clustering unit 13 performs clustering based on the entire data. On the other hand, in the second embodiment, the second clustering means 23 extracts only the part corresponding to the received structure part name from each data, and performs clustering based on the extracted part information. Other operations are the same as those in the first embodiment.

以上のように、本実施形態によれば、第２クラスタリング手段２３が、構造付きデータと構造部分名称とに基づき、構造部分名称に該当する部分を構造付きデータから抽出したデータを用いて構造付きデータ群をクラスタ化する。よって、第１の実施形態の効果に加え、ユーザが分析したい観点で分類階層を再作成できる。 As described above, according to the present embodiment, the second clustering unit 23 uses the data extracted from the structured data based on the structured data and the structured part name. Cluster data groups. Therefore, in addition to the effects of the first embodiment, the classification hierarchy can be recreated from the viewpoint that the user wants to analyze.

すなわち、本実施形態によれば、第２クラスタリング手段２３が、分析対象となる部分のみを抽出してクラスタリングする。具体的には、構造付きデータと分析対象になる構造部分名称とを用いてクラスタリングする。そのため、ユーザが分析したい観点で分類階層を更新できる。このように、分析対象を変更することで分類グループを変化させることができるため、分析対象の部分が示す特徴を分類階層に反映できる。例えば、対象のデータが特許データであれば、目的別に分けたい、課題別に分けたいなどの観点で分類階層を更新できる。 That is, according to the present embodiment, the second clustering means 23 extracts and clusters only the part to be analyzed. Specifically, clustering is performed using the structured data and the structure part name to be analyzed. Therefore, the classification hierarchy can be updated from the viewpoint that the user wants to analyze. As described above, since the classification group can be changed by changing the analysis target, the characteristics indicated by the analysis target portion can be reflected in the classification hierarchy. For example, if the target data is patent data, the classification hierarchy can be updated from the viewpoint of wanting to divide by purpose or divide by issue.

実施形態３．
図１４は、本発明の第３の実施形態における分類階層再作成システムの例を示すブロック図である。第３の実施形態では、第１の実施形態に比べ、データ処理装置１００が再更新手段３１を含んでいる点が異なる。なお、第１の実施形態と同様の構成については、図１と同一の符号を付し、説明を省略する。すなわち、第３の実施形態におけるデータ処理装置１００は、クラスタリング手段１３と、共起度計算手段１４と、分類階層更新手段１５と、再更新手段３１とを備えている。クラスタリング手段１３、共起度計算手段１４及び分類階層更新手段１５については、第１の実施形態と同様であるため、説明を省略する。Embodiment 3. FIG.
FIG. 14 is a block diagram illustrating an example of a classification hierarchy recreation system according to the third exemplary embodiment of the present invention. The third embodiment is different from the first embodiment in that the data processing apparatus 100 includes a re-updating unit 31. In addition, about the structure similar to 1st Embodiment, the code | symbol same as FIG. 1 is attached | subjected and description is abbreviate | omitted. That is, the data processing apparatus 100 according to the third embodiment includes clustering means 13, co-occurrence degree calculating means 14, classification hierarchy updating means 15, and re-update means 31. The clustering unit 13, the co-occurrence degree calculating unit 14, and the classification hierarchy updating unit 15 are the same as those in the first embodiment, and thus description thereof is omitted.

再更新手段３１は、分類階層更新手段１５から更新結果の分類階層を受け取り、受け取った分類階層が所定の条件を満たさない場合、分類階層の再更新を行うよう指示する。ここで、所定の条件とは、分類階層の分類数や深さ、再更新回数、ユーザからの停止指示の有無の少なくとも１つ、又は、その組み合わせであるが、所定の条件はこれらの内容に限定されない。 The re-updating means 31 receives the updated classification hierarchy from the classification hierarchy update means 15 and instructs to re-update the classification hierarchy when the received classification hierarchy does not satisfy a predetermined condition. Here, the predetermined condition is at least one of the classification number and depth of the classification hierarchy, the number of re-updates, the presence / absence of a stop instruction from the user, or a combination thereof, but the predetermined condition includes these contents. It is not limited.

具体的には、再更新手段３１は、更新した分類階層でデータ群の分類や分類階層を書き直す。また、再更新手段３１は、クラスタリングを行う場合の閾値や、分類階層更新手段１５が包含関係と同意関係を決める閾値（すなわち、共起スコア閾値）を緩和した値に変更する。そして、再更新手段３１は、分類階層の再作成を行うようクラスタリング手段１３に指示する。 Specifically, the re-updating unit 31 rewrites the data group classification and classification hierarchy with the updated classification hierarchy. Further, the re-updating unit 31 changes the threshold for clustering and the threshold for determining the inclusion relationship and the consent relationship (that is, the co-occurrence score threshold) by the classification hierarchy updating unit 15 to a relaxed value. Then, the re-updating unit 31 instructs the clustering unit 13 to re-create the classification hierarchy.

クラスタリング手段１３と、共起度計算手段１４と、分類階層更新手段１５と、再更新手段３１とは、プログラム（分類階層再作成プログラム）に従って動作するコンピュータのＣＰＵによって実現される。また、クラスタリング手段１３と、共起度計算手段１４と、分類階層更新手段１５と、再更新手段３１とは、それぞれが専用のハードウェアで実現されていてもよい。 The clustering means 13, the co-occurrence degree calculating means 14, the classification hierarchy updating means 15, and the re-update means 31 are realized by a CPU of a computer that operates according to a program (classification hierarchy re-creation program). The clustering unit 13, the co-occurrence degree calculating unit 14, the classification hierarchy updating unit 15, and the re-updating unit 31 may be realized by dedicated hardware.

次に、動作について説明する。図１５は、本実施形態におけるデータ処理装置１００の動作の例を示すフローチャートである。入力手段１１がデータを受信し、分類階層更新手段が分類階層を更新するまでの処理は、図３におけるステップＳ１〜Ｓ５の処理と同様であるため、説明を省略する。再更新手段３１は、分類階層更新手段１５から更新結果の分類階層を受け取り、受け取った分類階層が所定の条件を満たすか否かを判断する（ステップＳ６）。所定の条件を満たさない場合（ステップＳ６におけるＮＯ）、再更新手段３１は、クラスタリングを行う場合の閾値や、共起スコア閾値を緩和した値に変更し（ステップＳ７）、分類階層の再作成を行うようクラスタリング手段１３に指示する。以降、ステップＳ１〜Ｓ６の処理を繰り返す。一方、所定の条件を満たす場合（ステップＳ６におけるＹＥＳ）、再更新手段３１は、更新処理を終了する。 Next, the operation will be described. FIG. 15 is a flowchart showing an example of the operation of the data processing apparatus 100 in the present embodiment. The processing until the input means 11 receives data and the classification hierarchy update means updates the classification hierarchy is the same as the processing in steps S1 to S5 in FIG. The re-updating means 31 receives the updated classification hierarchy from the classification hierarchy update means 15, and determines whether or not the received classification hierarchy satisfies a predetermined condition (step S6). If the predetermined condition is not satisfied (NO in step S6), the re-updating means 31 changes the threshold value for clustering or the co-occurrence score threshold value to a relaxed value (step S7), and recreates the classification hierarchy. The clustering means 13 is instructed to do so. Thereafter, the processes of steps S1 to S6 are repeated. On the other hand, when the predetermined condition is satisfied (YES in step S6), the re-updating unit 31 ends the update process.

以上のように、本実施形態によれば、再更新手段３１が、分類階層更新手段１５が再作成した分類階層を再度更新する指示を行う。具体的には、再更新手段３１は、再作成された分類階層が、予め定められた要件を満たさない場合、分類グループを作成するための条件や分類の階層を再作成するための共起度の条件を変更する。そして、クラスタリング手段１３が、変更された条件を満たす分類を抽出した分類グループを作成し、分類階層再作成手段１５は、変更された条件をもとに分類の階層を再作成する。よって、第１の実施形態の効果に加え、より条件に近い分類階層を得ることができる。すなわち、条件に合わない場合であっても、再更新手段３１が再度更新を行うことで、より条件に近い分類階層を得ることができる。 As described above, according to the present embodiment, the re-updating unit 31 instructs to update again the classification hierarchy recreated by the classification hierarchy updating unit 15. Specifically, the re-updating unit 31 determines the condition for creating the classification group and the co-occurrence degree for re-creating the classification hierarchy when the re-created classification hierarchy does not satisfy the predetermined requirement. Change the conditions. Then, the clustering means 13 creates a classification group from which the classification satisfying the changed condition is extracted, and the classification hierarchy recreating means 15 recreates the classification hierarchy based on the changed condition. Therefore, in addition to the effects of the first embodiment, a classification hierarchy closer to the condition can be obtained. That is, even if the conditions are not met, the re-updating unit 31 performs the update again, whereby a classification hierarchy closer to the conditions can be obtained.

以下、具体的な実施例により本発明を説明するが、本発明の範囲は以下に説明する内容に限定されない。本実施例では、図１に例示するブロック図及び図３に例示するフローチャートをもとに、具体例を挙げて説明する。 Hereinafter, the present invention will be described with reference to specific examples, but the scope of the present invention is not limited to the contents described below. In this embodiment, a specific example will be described based on the block diagram illustrated in FIG. 1 and the flowchart illustrated in FIG.

まず、入力手段１１が、受け取ったデータ群をクラスタリング手段１３に通知すると、クラスタリング手段１３は、そのデータ群をもとにクラスタリングを行う（図３におけるステップＳ１）。入力手段１１が受け取るデータ群の例を図１６に示す。図１６に例示するデータ群は、１レコードに「データ」と「分類」を含む。本実施例では、データとしてテキストデータを例に挙げて説明するが、データは音声や画像などでもよい。また、図１６に例示する分類は、カンマで区切られ、複数指定されていることを示す。 First, when the input unit 11 notifies the clustering unit 13 of the received data group, the clustering unit 13 performs clustering based on the data group (step S1 in FIG. 3). An example of a data group received by the input means 11 is shown in FIG. The data group illustrated in FIG. 16 includes “data” and “classification” in one record. In the present embodiment, text data is described as an example of data, but the data may be voice or image. In addition, the classifications illustrated in FIG. 16 are separated by commas and indicate that a plurality of classifications are specified.

以下、クラスタリング手段１３が、このデータをクラスタリングする場合について説明する。クラスタリング手段１３は、データに適したクラスタリング手法を用いてクラスタリングする。本実施例の場合、受信するデータがテキストデータであるので、クラスタリング手段１３は、各データのテキストをベクトルデータとして類似度を計算するＫ−ＭＥＡＮＳ手法を用いる。具体的には、クラスタリング手段１３は、まず、各データのテキストを形態素解析し単語に分割する。次に、クラスタリング手段１３は、次元を単語、値を単語数とするベクトルデータに変換する。次に、クラスタリング手段１３は、ベクトルデータ間のコサイン類似度からＫ個のクラスタを作成する。本実施例ではＫ＝４とし、クラスタリング手段１３が、４つのクラスタを作成するものとする。 Hereinafter, a case where the clustering means 13 clusters this data will be described. The clustering means 13 performs clustering using a clustering method suitable for data. In the case of the present embodiment, since the received data is text data, the clustering means 13 uses the K-MEANS method for calculating the similarity using the text of each data as vector data. Specifically, the clustering means 13 first morphologically analyzes the text of each data and divides it into words. Next, the clustering means 13 converts the data into vector data in which the dimension is a word and the value is the number of words. Next, the clustering means 13 creates K clusters from the cosine similarity between the vector data. In this embodiment, it is assumed that K = 4, and the clustering means 13 creates four clusters.

なお、受信するデータがテキストデータではなく、音声や画像などのバイナリデータの場合、クラスタリング手段１３は、それぞれのデータに適した方法を用いればよい。例えば、音声データの場合、クラスタリング手段１３は、音声波形データを読み取り、その類似度を元に計算してクラスタリングしてもよい。また、画像の場合、画像から色ヒストグラムを生成し、その類似度を元に計算してクラスタリングしてもよい。 Note that when the received data is not text data but binary data such as voice or image, the clustering means 13 may use a method suitable for each data. For example, in the case of voice data, the clustering unit 13 may read the voice waveform data and calculate and cluster based on the similarity. In the case of an image, a color histogram may be generated from the image, and calculation may be performed based on the similarity to perform clustering.

次に、クラスタリング手段１３は、分類階層記憶手段１２に記憶された分類階層を参照し、クラスタリング結果のクラスタと分類のクロス集計表を作成し、分類グループを作成する（図３におけるステップＳ２）。分類階層の例を図１７に、クロス集計表の例を図１８に示す。 Next, the clustering means 13 refers to the classification hierarchy stored in the classification hierarchy storage means 12, creates a clustering result cluster and classification cross tabulation table, and creates a classification group (step S2 in FIG. 3). An example of the classification hierarchy is shown in FIG. 17, and an example of the cross tabulation table is shown in FIG.

図１７に例示する分類階層は分類をノードとする有向グラフ構造である。図１７に示す例では、「主要カテゴリ」をルート分類に、その分類の下位階層に分類「社会」及び「自然」が存在し、さらに分類「社会」の下位階層にも様々な幅広い分類が存在していることを示す。 The classification hierarchy illustrated in FIG. 17 has a directed graph structure with classification as a node. In the example shown in FIG. 17, “main category” is the root classification, classifications “society” and “nature” exist in the lower hierarchy of the classification, and various broad classifications exist in the lower hierarchy of the classification “society”. Indicates that

また、図１８に例示するクロス集計表は、横にクラスタを示す情報を、縦に分類を示す情報をそれぞれ配置した表である。図１８に例示するクロス集計表の値は、クラスタに存在するデータで、各分類に属しているデータ数を示す。ただし、図１８に例示する値は一例であり、値として、データ数をクラスタの合計データ数で割った値でもよいし、データ数を分類の合計データ数で割った値でもよい。なお、本実施例では、分類「社会」以下の分類に属するデータのみ入力されているとする。 In addition, the cross tabulation table illustrated in FIG. 18 is a table in which information indicating clusters is arranged horizontally and information indicating classification is arranged vertically. The values in the cross tabulation table illustrated in FIG. 18 indicate the number of data belonging to each category, which is data existing in the cluster. However, the value illustrated in FIG. 18 is an example, and the value may be a value obtained by dividing the number of data by the total number of data in the cluster, or may be a value obtained by dividing the number of data by the total number of data for classification. In the present embodiment, it is assumed that only data belonging to a category of “Society” or lower is input.

ここで、クラスタリング手段１３は、ある閾値以上のセルにマークする。図１８に示す例では、マークした部分を太線で囲んで表わすものとし、クラスタリング手段１３が閾値１０以上のセルにマークしたことを示す。マークされた部分は、クラスタに含まれるデータが多く属する分類であることを示している。例えば、図１８に例示する「クラスタ１」は、分類「移植」と、分類「親族」に属するデータを多く含んでいることを示している。ここで、分類に属するデータが多いとは、予め定められた閾値以上であることを意味する。 Here, the clustering means 13 marks cells above a certain threshold. In the example shown in FIG. 18, the marked portion is expressed by being surrounded by a thick line, and indicates that the clustering unit 13 has marked a cell having a threshold value of 10 or more. The marked part indicates that the data included in the cluster belongs to many classifications. For example, “Cluster 1” illustrated in FIG. 18 includes a lot of data belonging to the classification “Transplant” and the classification “Relatives”. Here, the fact that there are many data belonging to the classification means that it is equal to or more than a predetermined threshold value.

クラスタリング手段１３は、クラスタごとにマークされている分類をもとに、分類グループを作成する。例えば、図１８に示す例では、クラスタリング手段１３は、「クラスタ１」の中でマークされている分類（「移植」、「親族」）を一つのグループ（分類群）にする。他にも、クラスタリング手段１３は、「クラスタ２」の中から、（「健康」、「医学」、「移植」）のグループを、「クラスタ３」の中から（「行政」、「外交官」）のグループを、「クラスタ４」の中から、（「家庭」、「育児」）のグループをそれぞれ作成する。 The clustering means 13 creates a classification group based on the classification marked for each cluster. For example, in the example shown in FIG. 18, the clustering means 13 classifies the classifications (“Transplant”, “Relatives”) marked in “Cluster 1” into one group (classification group). In addition, the clustering means 13 selects a group (“health”, “medicine”, “transplant”) from “cluster 2” and a group “administration”, “diplomat” from “cluster 3”. ) Groups (“Home”, “Child-raising”) are created from “Cluster 4”.

次に、クラスタリング手段１３は、クロス集計表と分類階層を参照し、階層的距離が離れている分類群を分割する（図３におけるステップＳ３）。クラスタリング手段１３は、分類群の分類それぞれに対し、階層的距離が閾値以上であるか否かを判定する。そして、階層的距離が閾値以上であれば、クラスタリング手段１３は、分類群を分割する。本実施例において、階層的距離とは、二つの分類の分類階層内での最短ホップ数を意味するものとする。 Next, the clustering unit 13 refers to the cross tabulation table and the classification hierarchy, and divides the classification group having a hierarchical distance (step S3 in FIG. 3). The clustering means 13 determines whether or not the hierarchical distance is greater than or equal to the threshold for each classification of the classification group. If the hierarchical distance is equal to or greater than the threshold, the clustering unit 13 divides the classification group. In this embodiment, the hierarchical distance means the shortest number of hops in the classification hierarchy of two classifications.

以下、閾値が５ホップの場合について、図１７を用いて説明する。図１７に示す例では、（「移植」、「親族」）のグループでは、「移植」と「分類」が５ホップ離れているので分割対象となる。よって、このグループは、（「移植」）、（「親族」）に分割される。クロス集計表を分割した結果の例を、図１９に示す。図１９に示す例では、「クラスタ１」の分類「移植」と「親族」が、「クラスタ１」と「クラスタ１’」それぞれに分割されたことがわかる。なお、以下の説明では、図１９に例示するクラスタ番号、をグループ番号として記載する。 Hereinafter, the case where the threshold is 5 hops will be described with reference to FIG. In the example shown in FIG. 17, in the group (“Transplant”, “Relative”), “Transplant” and “Category” are 5 hops away from each other, so that they are to be divided. Therefore, this group is divided into (“Transplant”) and (“Relative”). An example of the result of dividing the cross tabulation table is shown in FIG. In the example shown in FIG. 19, it can be seen that the classifications “transplant” and “relative” of “cluster 1” are divided into “cluster 1” and “cluster 1 ′”, respectively. In the following description, the cluster numbers illustrated in FIG. 19 are described as group numbers.

次に、共起度計算手段１４は、分類グループから選択した二つの分類の共起度を計算する（図３におけるステップＳ４）。ここで、共起度は、二つの分類の共起頻度を元にした統計量である。図２０に共起スコアの計算結果例を示す。図２０に例示する表は、分類グループ番号、共起度を計算する対象の二つの分類である「分類１」と「分類２」、及び、それぞれの分類の共起度を示す「共起スコア１」と「共起スコア２」からなる表である。本実施例では、共起度を示す「共起スコア１」と「共起スコア２」は、それぞれ、「分類１」と「分類２」の共起する条件付確率とする。すなわち、「共起スコア１」は、「分類１」に対する共起の確率であり、「共起スコア２」は、「分類２」に対する共起の確率である。「共起スコア１」の値、及び、「共起スコア２」の値は、上述の（式１）及び（式２）でそれぞれ算出できる。 Next, the co-occurrence degree calculating means 14 calculates the co-occurrence degree of two classifications selected from the classification group (step S4 in FIG. 3). Here, the co-occurrence degree is a statistic based on the co-occurrence frequencies of the two classifications. FIG. 20 shows an example of the calculation result of the co-occurrence score. The table illustrated in FIG. 20 includes a classification group number, two classifications for which the co-occurrence degree is calculated, “class 1” and “class 2”, and a “co-occurrence score” indicating the co-occurrence degree of each classification. 1 ”and“ Co-occurrence score 2 ”. In this embodiment, “co-occurrence score 1” and “co-occurrence score 2” indicating the degree of co-occurrence are conditional probabilities of co-occurring “class 1” and “class 2”, respectively. That is, “co-occurrence score 1” is the probability of co-occurrence for “class 1”, and “co-occurrence score 2” is the probability of co-occurrence for “class 2”. The value of “co-occurrence score 1” and the value of “co-occurrence score 2” can be calculated by the above (formula 1) and (formula 2), respectively.

共起スコアの値は、具体的には以下のように計算される。「分類グループ１」及び「分類グループ１’」には、マークがついた分類（すなわち、ある閾値以上のデータが属する分類）が一つしか存在しない。よって、共起度計算手段１４は、共起スコアを計算しない。一方、「分類グループ１」及び「分類グループ１’」には、マークがついた分類が二つ（すなわち、分類「健康」、「医学」）存在する。よって、共起度計算手段１４は、「分類グループ２」の、二つの分類「健康」、「医学」について、共起スコアを以下のように計算する。 Specifically, the value of the co-occurrence score is calculated as follows. In “classification group 1” and “classification group 1 ′”, there is only one classification with a mark (that is, a classification to which data of a certain threshold or more belongs). Therefore, the co-occurrence degree calculation means 14 does not calculate the co-occurrence score. On the other hand, in “classification group 1” and “classification group 1 ′”, there are two classifications with marks (that is, classification “health” and “medicine”). Therefore, the co-occurrence degree calculation means 14 calculates the co-occurrence score for the two classifications “health” and “medicine” of “classification group 2” as follows.

ここで、「健康」と「医学」が同じデータに割振られている数（すなわち、「健康」と「医学」の共起頻度）を１６とし、「健康」の出現頻度を２１、「医学」の出現頻度を２０とする。このとき、それぞれの共起スコアは、以下のように計算される。 Here, the number of “health” and “medicine” allocated to the same data (ie, the co-occurrence frequency of “health” and “medicine”) is 16, the appearance frequency of “health” is 21, and “medicine” The appearance frequency of is assumed to be 20. At this time, each co-occurrence score is calculated as follows.

共起スコア１＝Ｐ（健康，医学｜健康）＝「健康」と「医学」の共起頻度／「健康」の頻度＝１６／２１＝０．７７ Co-occurrence score 1 = P (health, medicine | health) = co-occurrence frequency of “health” and “medicine” / frequency of “health” = 16/21 = 0.77

共起スコア２＝Ｐ（健康，医学｜医学）＝「健康」と「医学」の共起頻度／「医学」の頻度＝１６／２０＝０．８ Co-occurrence score 2 = P (health, medicine | medicine) = co-occurrence frequency of “health” and “medicine” / frequency of “medicine” = 16/20 = 0.8

なお、その他の共起スコアについても同様に算出されるため、説明を省略する。 Since other co-occurrence scores are calculated in the same manner, description thereof is omitted.

次に、分類階層更新手段１５は、分類グループと共起度に基づき、分類階層を更新する（図３におけるステップＳ５）。分類階層更新手段１５は、共起スコア閾値を用いて共起度（すなわち、共起スコア）の高低を判断する。本実施例では、分類階層更新手段１５は、共起スコア閾値０．７以上の場合に共起スコアが高いと判断し、共起スコア閾値０．２以下の場合に共起スコアが低いと判断するものとする。 Next, the classification hierarchy update unit 15 updates the classification hierarchy based on the classification group and the co-occurrence degree (step S5 in FIG. 3). The classification hierarchy update unit 15 determines the level of the co-occurrence degree (that is, the co-occurrence score) using the co-occurrence score threshold. In this embodiment, the classification hierarchy updating unit 15 determines that the co-occurrence score is high when the co-occurrence score threshold is 0.7 or more, and determines that the co-occurrence score is low when the co-occurrence score threshold is 0.2 or less. It shall be.

図２０に例示する「グループ２」の「健康」と「医学」の共起度（共起スコア）によれば、「共起スコア１」が高く、「共起スコア２」も高いと判断される。よって、この二つの分類には同意関係があると言える。また、上述の通り、「健康」の出現頻度が２１、「医学」の出現頻度が２０であるので、「健康」のほうが大きな分類と言える。したがって、分類階層更新手段１５は、「医学」を「健康」に統合することにより分類階層を更新する。 According to the co-occurrence degree (co-occurrence score) of “health” and “medicine” of “group 2” illustrated in FIG. 20, it is determined that “co-occurrence score 1” is high and “co-occurrence score 2” is also high. The Therefore, it can be said that these two classifications have a consensus relationship. Further, as described above, since the appearance frequency of “health” is 21 and the appearance frequency of “medicine” is 20, it can be said that “health” is a larger classification. Therefore, the classification hierarchy updating means 15 updates the classification hierarchy by integrating “medicine” with “health”.

一方、図２０に例示する「グループ２」の「健康」と「移植」の共起度、及び、「グループ２」の「医学」と「移植」の共起度は、いずれも高いと言えず、また低いとも言えない。そのため、分類階層更新手段１５は、分類階層を更新しない。 On the other hand, the co-occurrence of “health” and “transplant” in “Group 2” and the co-occurrence of “medicine” and “transplant” in “Group 2” illustrated in FIG. 20 are not high. I can't say it's too low. Therefore, the classification hierarchy update unit 15 does not update the classification hierarchy.

また、図２０に例示する「グループ３」の「行政」と「外交官」の共起度によれば、「共起スコア１」は低く、「共起スコア２」は高いと判断される。よって、この二つの分類には包含関係があると言える。したがって、分類階層更新手段１５は、「行政」を親、「外交官」を子として分類階層を更新する。 Further, according to the co-occurrence degree of “administration” and “diplomat” of “group 3” illustrated in FIG. 20, it is determined that “co-occurrence score 1” is low and “co-occurrence score 2” is high. Therefore, it can be said that these two categories have an inclusive relationship. Therefore, the classification hierarchy updating means 15 updates the classification hierarchy with “administration” as the parent and “diplomat” as the child.

同様に、図２０に例示する「グループ４」の「家庭」と「育児」の共起度は、「共起スコア１」が高く、「共起スコア２」も高いと判断される。よって、この二つの分類には同意関係があると言える。ここで、「家庭」の方が大きな分類である場合、分類階層更新手段１５は、「育児」を「家庭」に統合することにより分類階層を更新する。 Similarly, the “co-occurrence score 1” and “co-occurrence score 2” of “Group 4” illustrated in FIG. 20 are determined to be high. Therefore, it can be said that these two classifications have a consensus relationship. Here, when “home” is a larger classification, the classification hierarchy updating unit 15 updates the classification hierarchy by integrating “childcare” into “home”.

以上の結果得られる分類階層の例を図２１に示す。図２１に示す破線は、分類階層を更新する前の親子関係を示す線である。また、図２１に例示する分類の中で、その分類に属するデータが存在する分類を太線で囲んで表わすものとし、その分類にデータが存在しない分類は太線で囲まずに表現するものとする。なお、更新前の親子関係は削除してもよいし、しなくともよい。本実施例では、分類階層更新手段１５が、後ほど削除することにする。 An example of the classification hierarchy obtained as a result is shown in FIG. The broken line shown in FIG. 21 is a line indicating the parent-child relationship before the classification hierarchy is updated. Further, among the classes illustrated in FIG. 21, a class in which data belonging to the class exists is represented by being surrounded by a thick line, and a class in which no data exists in the class is represented without being surrounded by a thick line. Note that the parent-child relationship before the update may or may not be deleted. In this embodiment, the classification hierarchy update means 15 will delete later.

さらに、分類階層更新手段１５は、属するデータがない分類に対する処理を行ってもよい。本実施例では、属するデータがない分類であって、子供分類がない分類を削除する。例えば、図２１に例示する分類のうち、「家族法」「外交史」「官公庁」は、分類に属すデータがない分類であって、子供分類がない分類である。よって、分類階層更新手段１５は、これらの分類を削除することにより分類階層を更新する。また、分類階層更新手段１５は、分類に属すデータがない分類であって、子供分類が一つしかない分類に対し、その分類を削除し、子供分類を繰り上げて直接上下関係を作成してもよい。ただし、本実施例では、そのようは分類がないため、分類階層の更新は行わない。以上の結果得られる分類階層の例を図２２に示す。 Furthermore, the classification hierarchy update unit 15 may perform processing for a classification for which no data belongs. In the present embodiment, a classification having no data belonging thereto and having no child classification is deleted. For example, among the categories illustrated in FIG. 21, “family law”, “diplomatic history”, and “government” are categories that do not have data belonging to the category and do not have child categories. Therefore, the classification hierarchy update means 15 updates the classification hierarchy by deleting these classifications. Further, the classification hierarchy update means 15 may delete a classification for a classification having no data belonging to the classification and has only one child classification and create a direct hierarchical relationship by moving up the child classification. Good. However, in this embodiment, since there is no such classification, the classification hierarchy is not updated. An example of the classification hierarchy obtained as a result is shown in FIG.

他にも、情報検索結果を表示する際、検索結果を分類して表示するといった用途に本発明を適用できる。また、更新された分類階層とその分類内の単語との関係をもとに定められる関連語を表示する場合にも本発明を適用できる。 In addition, when displaying the information search result, the present invention can be applied to the usage of classifying and displaying the search result. The present invention can also be applied to the case where related words defined based on the relationship between the updated classification hierarchy and the words in the classification are displayed.

次に、本発明の最小構成を説明する。図２３は、本発明の最小構成を示すブロック図である。本発明による分類階層再作成システムは、階層化された分類に対応付けられたデータ群をクラスタ化し、そのクラスタ内の各データに対応する分類のうち、予め定められた条件を満たす分類（例えば、属するデータ数の多い分類）を抽出したグループである分類グループ（例えば、分類群、分類グループ）を作成するクラスタリング手段８１（例えば、クラスタリング手段１３）と、分類グループから選択した二つの分類の共起度を計算する（例えば、（式１）、（式２）によって計算する）共起度計算手段８２（例えば、共起度計算手段１４）と、分類グループ及び共起度をもとに、分類の階層（例えば、分類階層）を再作成する分類階層再作成手段８３（分類階層更新手段１５）とを備えている。 Next, the minimum configuration of the present invention will be described. FIG. 23 is a block diagram showing the minimum configuration of the present invention. The classification hierarchy re-creation system according to the present invention clusters a data group associated with a hierarchical classification, and among classifications corresponding to each data in the cluster, classifications that satisfy a predetermined condition (for example, Clustering means 81 (for example, clustering means 13) for creating a classification group (for example, classification group, classification group) that is a group from which a classification having a large number of data belonging) is extracted, and co-occurrence of two classifications selected from the classification group The degree of co-occurrence (for example, the co-occurrence degree calculating means 14) (for example, the co-occurrence degree calculating means 14) for calculating the degree (for example, calculating according to (Expression 1) and (Expression 2)), and classification based on the classification group and the co-occurrence degree Classification hierarchy re-creating means 83 (classification hierarchy updating means 15) for re-creating the hierarchy (for example, classification hierarchy).

そのような構成により、既存の分類階層を再構築して新たな分類階層を作成する場合に、分類の上下関係を考慮した分類階層や、同じ意味の分類を統合した分類階層を効率的に作成できる。 With such a configuration, when a new classification hierarchy is created by reconstructing an existing classification hierarchy, a classification hierarchy that considers the hierarchical relationship of classifications and a classification hierarchy that integrates classifications with the same meaning are efficiently created. it can.

また、少なくとも以下に示すような分類階層再作成システムも、上記に示すいずれかの実施形態に記載されていると言える。 Moreover, it can be said that at least the classification hierarchy re-creation system as described below is described in any of the embodiments described above.

（１）階層化された分類に対応付けられたデータ群をクラスタ化し、そのクラスタ内の各データに対応する分類のうち、予め定められた条件を満たす分類（例えば、属するデータ数の多い分類）を抽出したグループである分類グループ（例えば、分類群、分類グループ）を作成するクラスタリング手段（例えば、クラスタリング手段１３）と、分類グループから選択した二つの分類の共起度を計算する（例えば、（式１）、（式２）によって計算する）共起度計算手段（例えば、共起度計算手段１４）と、分類グループ及び共起度をもとに、分類の階層（例えば、分類階層）を再作成する分類階層再作成手段（分類階層更新手段１５）とを備える分類階層再作成システム。 (1) A group of data associated with a hierarchical classification is clustered, and among classifications corresponding to each data in the cluster, a classification that satisfies a predetermined condition (for example, a classification having a large number of belonging data) Clustering means (for example, clustering means 13) for creating a classification group (for example, classification group, classification group), which is a group extracted from, and the co-occurrence degree of two classifications selected from the classification group (for example, ( Based on the co-occurrence degree calculating means (e.g., co-occurrence degree calculating means 14) calculated by Equation (1) and (Equation 2), and the classification hierarchy (for example, classification hierarchy) based on the classification group and the co-occurrence degree. A classification hierarchy recreation system comprising classification hierarchy recreation means (classification hierarchy update means 15) for recreation.

（２）クラスタリング手段が、作成した分類グループ内の分類が、予め定められた距離以上離れている場合に、その分類グループ（例えば、分類群）を分割した分類グループを作成する分類階層再作成システム。 (2) A classification hierarchy recreating system in which the clustering means creates a classification group obtained by dividing the classification group (for example, classification group) when the classification in the generated classification group is more than a predetermined distance. .

（３）共起度計算手段が、二つの分類が共起しているデータ数である共起頻度と各分類に属するデータ数とをもとに共起度を計算し、分類階層再作成手段が、二つの分類が包含関係か同意関係かを共起度をもとに判断し、二つの分類が包含関係か同意関係かを示す判断結果に基づいて分類の階層を再作成する分類階層再作成システム。 (3) The co-occurrence degree calculation means calculates the co-occurrence degree based on the co-occurrence frequency that is the number of data in which two classifications co-occur and the number of data belonging to each classification, and class hierarchy re-creation means Is based on the co-occurrence degree, and the classification hierarchy is recreated based on the judgment result indicating whether the two classifications are inclusion relations or consent relations. Creation system.

（４）分類階層再作成手段が、二つの分類の関係が包含関係の場合に、包含する側の分類を親分類に、包含される側の分類を子分類にした階層を追加することにより分類の階層を再作成し、二つの分類の関係が同意関係の場合に、その二つの分類のうち、含まれるデータ数が多い分類に対して少ない分類を統合した分類を作成することにより分類の階層を再作成する分類階層再作成システム。 (4) Classification hierarchy re-creating means, when the relationship between two classifications is an inclusion relationship, adding a hierarchy in which the inclusion classification is the parent classification and the inclusion classification is the child classification If the relationship between the two classifications is a consensus relationship, the classification hierarchy is created by creating a classification that integrates a small number of classifications with respect to a classification with a large amount of data included in the two classifications. Classification hierarchy re-creation system to re-create.

（５）分類階層再作成手段が、包含される側の分類を子分類にした階層を追加した場合に、分類階層を再作成する前のその子分類の親子関係を削除することにより分類の階層を再作成する分類階層再作成システム。 (5) When the classification hierarchy recreating means adds a hierarchy in which the included classification is a child classification, the classification hierarchy is deleted by deleting the parent-child relationship of the child classification before the classification hierarchy is recreated. Classification hierarchy re-creation system to re-create.

（６）分類階層再作成手段が、属するデータがない分類が子分類を持たない分類の場合にその分類を削除することにより分類の階層を再作成し、属するデータがない分類であって、子分類を１つしか持たない分類の場合、その分類を削除して、その削除される分類の親分類と、子分類との間に階層関係を作成することにより分類の階層を再作成する分類階層再作成システム。 (6) The classification hierarchy recreating means recreates the classification hierarchy by deleting the classification when the classification with no belonging data is a classification having no child classification, In the case of a category having only one category, the category hierarchy is recreated by deleting the category and creating a hierarchical relationship between the parent category and the child category of the deleted category. Recreation system.

（７）クラスタリング手段（例えば、第２クラスタリング手段２３）が、構造化されたデータである構造付きデータと、その構造付きデータの各部分を識別する名称である構造部分名称とに基づき、構造部分名称に該当する部分を構造付きデータから抽出したデータを用いて構造付きデータ群をクラスタ化する分類階層再作成システム。 (7) Based on the structured data that is structured data and the structure part name that is a name for identifying each part of the structured data, the clustering unit (for example, the second clustering unit 23) A classification hierarchy re-creation system that clusters structured data groups using data extracted from structured data for the part corresponding to the name.

（８）分類階層再作成手段が再作成した分類階層を再度更新する指示を行う再更新手段（例えば、再更新手段３１）を備え、再更新手段が、再作成された分類階層が予め定められた要件を満たさない場合、分類グループを作成するための条件、分類の階層を再作成するための共起度の条件のうちの少なくとも１つの条件を変更し、クラスタリング手段が、変更された条件を満たす分類を抽出した分類グループを作成し、分類階層再作成手段が、変更された条件をもとに分類の階層を再作成する分類階層再作成システム。 (8) A re-updating unit (for example, re-updating unit 31) for instructing re-updating of the re-created classification layer is provided by the classification layer re-creating unit, and the re-updating unit determines the re-created classification layer If at least one of the conditions for creating a classification group and the condition for co-occurrence for recreating a classification hierarchy is changed, the clustering means A classification hierarchy re-creation system in which a classification group is created by extracting the satisfying classifications, and the classification hierarchy re-creation means re-creates the classification hierarchy based on the changed condition.

（９）再更新手段が、分類階層の分類数、分類階層の深さ、分類階層の再更新回数、停止指示の有無のうちの少なくとも１つの要件が予め定められた要件を満たさない場合に、条件を変更する分類階層再作成システム。 (9) When the re-updating means does not satisfy a predetermined requirement, at least one of the number of classifications in the classification hierarchy, the depth of the classification hierarchy, the number of re-updates of the classification hierarchy, and whether or not there is a stop instruction, Classification hierarchy re-creation system that changes conditions.

（１０）クラスタリング手段が、クラスタ内の各データに対応する分類のうち、その分類に属するデータ数が予め定められた数よりも多い分類を抽出して分類グループを作成する分類階層再作成システム。 (10) A classification hierarchy recreating system in which the clustering means extracts a classification having a number of data belonging to the classification larger than a predetermined number from the classification corresponding to each data in the cluster to create a classification group.

以上、実施形態及び実施例を参照して本願発明を説明したが、本願発明は上記実施形態および実施例に限定されるものではない。本願発明の構成や詳細には、本願発明のスコープ内で当業者が理解し得る様々な変更をすることができる。 Although the present invention has been described with reference to the embodiments and examples, the present invention is not limited to the above embodiments and examples. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.

この出願は、２００９年７月６日に出願された日本特許出願２００９−１６００７１を基礎とする優先権を主張し、その開示の全てをここに取り込む。 This application claims the priority on the basis of the JP Patent application 2009-160071 for which it applied on July 6, 2009, and takes in those the indications of all here.

本発明は、階層化された分類を再構築して新たな分類階層を作成する分類階層再作成システムに好適に適用される。 The present invention is preferably applied to a classification hierarchy recreating system that reconstructs a hierarchical classification and creates a new classification hierarchy.

１１入力手段
１２分類階層記憶手段
１３クラスタリング手段
１４共起度計算手段
１５分類階層更新手段
２１第２入力手段
２３第２クラスタリング手段
３１再更新手段
１００データ処理装置
１０１データ記憶装置DESCRIPTION OF SYMBOLS 11 Input means 12 Classification hierarchy storage means 13 Clustering means 14 Co-occurrence degree calculation means 15 Classification hierarchy update means 21 Second input means 23 Second clustering means 31 Reupdate means 100 Data processing device 101 Data storage device

Claims

Clustering means for clustering data groups associated with hierarchized classifications and creating a classification group that is a group in which classifications satisfying a predetermined condition are extracted from the classifications corresponding to each data in the cluster When,
A co-occurrence degree calculating means for calculating the co-occurrence degree of two classifications selected from the classification group;
A classification hierarchy recreating system comprising: a classification hierarchy recreating means for recreating the classification hierarchy based on the classification group and the co-occurrence degree.

The classification hierarchy recreation system according to claim 1, wherein the clustering means creates a classification group obtained by dividing the classification group when the classification in the created classification group is more than a predetermined distance.

The co-occurrence degree calculating means calculates the co-occurrence degree based on the co-occurrence frequency that is the number of data in which two classifications co-occur and the number of data belonging to each classification,
The classification hierarchy recreating means determines whether the two classifications are inclusive relations or consent relations based on the co-occurrence degree, and based on the determination result indicating whether the two classifications are inclusion relations or consent relations, The classification hierarchy re-creation system according to claim 1 or 2, wherein the hierarchy is re-created.

The classification hierarchy re-creating means adds the hierarchy of the classification by adding the hierarchy in which the inclusion classification is the parent classification and the inclusion classification is the child classification when the relationship of the two classifications is an inclusion relation. If the relationship between the two categories is a consensus relationship, the category hierarchy is recreated by creating a category that integrates a small number of categories into a category with a large amount of data. The classification hierarchy re-creation system according to claim 3.

The classification hierarchy re-creation means recreates the classification hierarchy by deleting the parent-child relationship of the child classification before re-creating the classification hierarchy when adding a hierarchy with the included classification as a child classification The classification hierarchy re-creation system according to claim 4.

The classification hierarchy re-creating means re-creates the classification hierarchy by deleting the classification when the classification with no belonging data is a classification with no child classification, and is a classification with no data to which the classification is 1 In the case of a class having only one class, the class is deleted, and the class hierarchy is recreated by creating a hierarchical relationship between the parent class of the class to be deleted and the child class. The classification hierarchy recreating system according to claim 5.

The clustering means, based on the structured data that is structured data and the structural part name that is a name for identifying each part of the structured data, extracts a part corresponding to the structural part name from the structured data. The classification hierarchy re-creation system according to claim 1, wherein the structured data group is clustered using the extracted data.

Re-updating means for instructing to re-update the classification hierarchy re-created by the classification hierarchy re-creation means,
When the re-created classification hierarchy does not satisfy a predetermined requirement, the re-update means has at least one of a condition for creating a classification group and a condition for co-occurrence for re-creating a classification hierarchy. Change one condition,
The clustering means creates a classification group that extracts classifications that satisfy the changed conditions,
The classification hierarchy re-creation system according to any one of claims 1 to 7, wherein the classification hierarchy re-creation unit re-creates a classification hierarchy based on the changed condition.

The re-updating means changes the condition when at least one of the number of classifications in the classification hierarchy, the depth of the classification hierarchy, the number of re-updates of the classification hierarchy, and the presence / absence of a stop instruction does not satisfy a predetermined requirement. The classification hierarchy recreating system according to claim 8.

The clustering means creates a classification group by extracting, from the classifications corresponding to each data in the cluster, a classification in which the number of data belonging to the classification is larger than a predetermined number. The classification hierarchy re-creation system according to any one of the above.

The clustering means of the data processing device clusters data groups associated with the hierarchical classification,
The clustering means creates a classification group that is a group in which classifications satisfying a predetermined condition are extracted from the classifications corresponding to each data in the cluster,
The co-occurrence degree calculation means of the data processing device calculates the co-occurrence degree of two classifications selected from the classification group,
A classification hierarchy re-creation method, wherein a classification hierarchy re-creation unit of the data processing device re-creates the classification hierarchy based on the classification group and the co-occurrence degree.

The classification hierarchy recreating method according to claim 11, wherein the clustering means creates a classification group obtained by dividing the classification group when the classification within the created classification group is more than a predetermined distance.

The co-occurrence degree calculation means calculates the co-occurrence degree based on the co-occurrence frequency that is the number of data in which two classifications co-occur and the number of data belonging to each classification,
The classification hierarchy recreating means determines whether the two classifications are inclusive relations or consent relations based on the co-occurrence degree,
The classification hierarchy recreating method according to claim 11 or 12, wherein the classification hierarchy recreating means recreates a classification hierarchy based on a determination result indicating whether two classifications are inclusive relations or consent relations.

If the classification hierarchy recreating means adds a hierarchy in which the inclusion classification is the parent classification and the inclusion classification is the child classification, when the relationship between the two classifications is an inclusion relation, If the relationship between the two categories is a consensus relationship, the classification hierarchy is re-created by creating one category that integrates a small number of the two categories into a category with a large amount of data. The classification hierarchy re-creation method according to claim 13.

If the classification hierarchy re-creation means adds a hierarchy whose child classification is included, the classification hierarchy is re-created by deleting the parent-child relationship of the child classification before the classification hierarchy is re-created. The classification hierarchy recreating method according to claim 14.

The classification hierarchy recreating means recreates the classification hierarchy by deleting the classification when the classification with no data belonging to is a classification having no child classification, and the classification classification without the data belonging to it is set to 1 In the case of a class having only one class, the class is deleted, and the class hierarchy is recreated by creating a hierarchical relationship between the parent class of the class to be deleted and the child class. The classification hierarchy recreating method according to claim 15.

Based on the structured data that is the structured data and the structural part name that is the name for identifying each part of the structured data, the clustering means extracts the part corresponding to the structural part name from the structured data. The classification hierarchy recreating method according to claim 11, wherein the structured data group is clustered using the extracted data.

When the re-updating means of the data processing device does not satisfy the predetermined requirement, the conditions for creating the classification group and the conditions for the co-occurrence degree for re-creating the classification hierarchy Change at least one of the conditions and instruct to update the re-created classification hierarchy again,
The clustering means creates a classification group that extracts classifications that meet the changed conditions,
The classification hierarchy re-creation method according to any one of claims 11 to 17, wherein the classification hierarchy re-creation unit re-creates a classification hierarchy based on the changed condition.

The condition is changed when at least one of the number of classification hierarchies, the depth of the classification hierarchies, the number of times of renewal of the classification hierarchies, and the presence / absence of stop instruction does not satisfy the predetermined requirement. The classification hierarchy recreating method according to claim 18.

The clustering means creates a classification group by extracting a classification in which the number of data belonging to the classification is larger than a predetermined number from the classification corresponding to each data in the cluster. The classification hierarchy re-creation method according to any one of the above.

On the computer,
A clustering process that clusters data groups associated with hierarchical classifications and creates a classification group that is a group in which classifications satisfying predetermined conditions are extracted from the classifications corresponding to each data in the cluster. ,
A co-occurrence degree calculation process for calculating the co-occurrence degree of two classifications selected from the classification group; and
A classification hierarchy re-creation program for executing a classification hierarchy re-creation process for re-creating the classification hierarchy based on the classification group and the co-occurrence degree.

On the computer,
The classification hierarchy re-creation program according to claim 21, wherein when the classification in the created classification group is more than a predetermined distance in the clustering process, a classification group is generated by dividing the classification group.