JP2008299382A

JP2008299382A - Data division program, recording medium with the program recorded thereon, data distribution device and data distribution method

Info

Publication number: JP2008299382A
Application number: JP2007141681A
Authority: JP
Inventors: Takahiro Saito; 孝広齊藤
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2007-05-29
Filing date: 2007-05-29
Publication date: 2008-12-11
Anticipated expiration: 2027-05-29
Also published as: JP5045240B2

Abstract

<P>PROBLEM TO BE SOLVED: To automatically distribute multi-dimensional data into proper groups matched with an analytic purpose in distributing the multi-dimensional data based on items having a hierarchical structure, and to present characteristics considered by a user to the distributed groups. <P>SOLUTION: A data distribution device 100 is configured: to execute input processing for accepting the input of explanation variables and target variables by an input part 201; and to execute decision processing by a decision part 202 and distribution processing by a distribution part 203 (decision/distribution processing), and to, when hierarchical data and a decision table as distribution processing results are acquired, execute integration processing by an integration part 204, and to, when the hierarchical structure data and the decision table as the integration processing results are acquired by the integration part 204, output the integration processing results by an output part 205. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

この発明は、階層をあらわす説明変数と分布の偏りをあらわすことができる目的変数とを有する多次元データ群を分割するデータ分割プログラム、該プログラムを記録した記録媒体、データ分割装置、およびデータ分割方法に関する。 The present invention relates to a data division program that divides a multidimensional data group having explanatory variables representing hierarchies and target variables that can represent distribution bias, a recording medium on which the program is recorded, a data division apparatus, and a data division method About.

従来、多次元データの分析において、項目間の関連を分析するタスクは広く実施されている。また、その１つ以上の項目が階層構造を持つ場合も多い。たとえば、特許公報データにおいては、出願人といった階層を持たない項目と共に、ＩＰＣコードと呼ばれる階層性を持つ分類コードが付与されている。 Conventionally, in the analysis of multidimensional data, the task of analyzing the relationship between items has been widely performed. In many cases, the one or more items have a hierarchical structure. For example, in patent gazette data, a classification code having a hierarchy called an IPC code is given together with an item having no hierarchy such as an applicant.

この特許公報データの分析において、出願人とＩＰＣコードの関連性を分析するというタスクを想定する（従来技術１）。ＩＰＣコードは出願された特許公報の技術的な内容をコード化したものである。このＩＰＣコードと出願人との関連性を分析することにより、出願人毎の技術的な強み／弱みを把握することができるので、この分析は広くおこなわれている。 In this patent publication data analysis, a task of analyzing the relationship between the applicant and the IPC code is assumed (prior art 1). The IPC code is obtained by encoding the technical content of the patent publication filed. By analyzing the relationship between the IPC code and the applicant, the technical strength / weakness of each applicant can be grasped, so this analysis is widely performed.

たとえば、あるＩＰＣコードを持つ特許公報が、ある出願人によって多量に出願されていることが分かれば、その出願人はそのＩＰＣコードで表される技術分野に強いことが分かり、その出願人に対して適切な戦略を立案することができる。 For example, if a patent publication having a certain IPC code is found to have been filed in large quantities by a certain applicant, it can be understood that the applicant is strong in the technical field represented by the IPC code. Can devise an appropriate strategy.

この分析手法としては，たとえば、特許公報データをＩＰＣコードで分割し、各グループの出願人の分布の偏りを見ることになる。具体的には、横軸をＩＰＣコード、縦軸を出願人とした出願人別出願件数積み上げグラフを作成することで、特定の出願人が出願件数の多くを占めているＩＰＣコードや、各出願人からまんべんなく出願されているＩＰＣコードといった情報を読みとることができる。 As this analysis method, for example, patent gazette data is divided by an IPC code, and the distribution of applicants in each group is observed. Specifically, by creating a graph of the number of applications by applicant with the IPC code on the horizontal axis and the applicant on the vertical axis, the IPC code where a specific applicant accounts for the majority of applications, Information such as IPC codes that have been applied for evenly by humans can be read.

なお、上述において、「ＩＰＣコード」のように、階層性を持つ項目を「説明変数」と称す。一方、「出願人」のように、分布の偏りを知りたい項目を「目的変数」と称す。特に、「出願人」のように、数値以外の決まった値を定型値と称す。 In the above description, items having a hierarchy such as “IPC code” are referred to as “explanatory variables”. On the other hand, items such as “applicant” who want to know the distribution bias are called “objective variables”. In particular, a fixed value other than a numerical value such as “applicant” is referred to as a fixed value.

また、公知のデータ分割技術の中で、上述の観点に類似するものとしては、下記特許文献１（従来技術２）が挙げられる。この従来技術２は、分割された結果に対して目的変数の分布の差異が有意かどうかを、統計的検定を用いて判定し、有意差がないグループを統合するものである。この技術を用いることで、元々の分割結果がいくつかのグループに統合され、かつ作成したグループの間には目的変数の分布に統計的に有意な差が生じていることが保証される。 Further, among the known data division techniques, the following patent document 1 (conventional technique 2) can be cited as something similar to the above-mentioned viewpoint. This prior art 2 determines whether the difference in the distribution of the objective variable is significant with respect to the divided results using a statistical test, and integrates groups having no significant difference. By using this technique, it is ensured that the original division results are integrated into several groups, and that there is a statistically significant difference in the distribution of objective variables between the created groups.

また、階層性を持つデータの公知の分割技術として、各グループに属する個数を一定の範囲に収めるように分割することで、粒度の揃ったグループを作成する従来技術３（たとえば、下記特許文献２を参照。）がある。 In addition, as a well-known division technique for hierarchical data, conventional technique 3 for creating groups with uniform granularity by dividing the numbers belonging to each group so as to fall within a certain range (for example, Patent Document 2 below) See).

また、目的変数として従来技術１の特許公報データにおける「出願人」といった定型値を取る項目ではなく、順序づけ可能な数値情報（連続値または離散値）を取る項目を選択する分析がある（従来技術４）。 Also, there is an analysis for selecting an item that takes numerical information (continuous value or discrete value) that can be ordered, instead of an item that takes a fixed value such as “Applicant” in the patent gazette data of Prior Art 1 as an objective variable (prior art) 4).

たとえば、ＰＯＳデータを分析対象データとして、ある商品の売上数とその他の項目との関連性を分析するというタスクにおいては、着目する目的変数は売上数である。この分析により、たとえば、製品Ａは平日と休日では売上数に大きな差があることが判明すれば、曜日によって適切な仕入れ数を決定するといった活用をおこなうことができる。 For example, in the task of analyzing the relationship between the number of sales of a certain product and other items using POS data as analysis target data, the objective variable of interest is the number of sales. As a result of this analysis, for example, if it is found that there is a large difference in the number of sales of product A between weekdays and holidays, it is possible to make use of determining the appropriate number of purchases according to the day of the week.

ただし、日次の売上数にはばらつきが存在するのが普通である。したがって、個々の日次売上データを比較するだけでは有用な知識は得られない。そこで個々のデータをグループ化し、たとえば、「全く同じ条件下では、売上数はポアソン分布に従う」といった仮定（この仮定は妥当である）をおいた上で、統計的な分析を行なう必要がある。 However, there are usually variations in daily sales. Therefore, useful knowledge cannot be obtained simply by comparing individual daily sales data. Therefore, it is necessary to group individual data and perform statistical analysis, for example, under the assumption that the number of sales follows a Poisson distribution under exactly the same conditions (this assumption is valid).

また、階層構造を持つ説明変数としてデータ中に記載されている項目を採用するのではなく、データ中の自由記述項目（テキスト記述項目）に対して階層化クラスタリングと呼ばれる手法を適用することにより、データの階層構造を作成してその階層構造を説明変数とする分析がある（従来技術５）。 In addition, instead of adopting items described in the data as explanatory variables with a hierarchical structure, by applying a technique called hierarchical clustering to free description items (text description items) in the data, There is an analysis that creates a hierarchical structure of data and uses the hierarchical structure as an explanatory variable (prior art 5).

たとえば、ある製品に関するモニターアンケートにおいて、１レコードが性別とその製品に対する感想を自由に記載した項目のみから構成されるデータを分析する際に、性別の違いにより、対象製品の何に着目するかの違いを知りたいといった分析が挙げられる。 For example, in a monitor questionnaire on a product, when analyzing data consisting only of items in which one record freely describes the gender and impressions about the product, what is the target product due to gender differences? An analysis that wants to know the difference.

この分析においては、自由記述項目の記載内容の類似性を基に階層構造を作成して、適当な階層でグループに分け、たとえば、製品のデザインに着目した感想の多いグループは女性比率が多く、使い勝手に着目した感想の多いグループは男性比率が多いといった知識を獲得する。 In this analysis, create a hierarchical structure based on the similarity of the description contents of the free description items, and divide it into groups at an appropriate hierarchy.For example, a group with many impressions focusing on product design has a high percentage of women, A group with a lot of feedback focusing on usability gains knowledge that there are many male ratios.

このような階層化クラスタリングされた結果を分割する方法としては、分割結果のデータ数の差をなるべく抑えるように粒度を揃えて分割する方法や、クラスタリング時に算出される類似度の弱い部分で分割するといった方法は一般的で広く用いられている（従来技術６）。 As a method of dividing the result of the hierarchical clustering, a method of dividing the data with the same granularity so as to suppress the difference in the number of data of the divided results as much as possible, or a method of dividing by a portion having a low similarity calculated at the time of clustering. Such a method is general and widely used (prior art 6).

特開２００４−１５７８１４号公報JP 2004-157814 A 特開平１０−１２４５２８号公報Japanese Patent Laid-Open No. 10-124528

しかしながら、上述した従来技術１では、ＩＰＣコードは階層性を持っているため、上述のグラフの横軸にどの階層を採用すれば読みとりたい情報がグラフに現れるかは不明である。データ全体としては、各出願人からまんべんなく出願されているものであっても、それをＩＰＣコードの一つの階層で分割するとはっきりした偏りが生じているといった時、そのような特徴が見える階層はデータによって異なるので、グラフの作成前に適切な階層を決定することはできない。 However, in the above-described prior art 1, since the IPC code has a hierarchy, it is unclear which layer is used for the horizontal axis of the above-mentioned graph and the information to be read appears in the graph. As for the entire data, even if it is applied evenly by each applicant, when it is clearly biased when it is divided into one layer of the IPC code, the layer where such characteristics can be seen is the data Therefore, it is not possible to determine an appropriate hierarchy before creating the graph.

そのため、分析者は階層毎のグラフを試行錯誤的に作成して、全グラフを検討する必要があった。また、本来の分析目的を考慮すれば、グラフの横軸を同じ階層のＩＰＣコードに揃えることに意味はない。たとえば、あるグループは第二階層で特徴が読みとれるが、別のグループは第三階層まで分けることではじめて特徴が出るといったことは十分に考えられるので、分析の際にはこのようなことも考慮して、各々のグループが特徴を持つような粒度に分割されていることが望ましい。 Therefore, the analyst had to create a graph for each hierarchy on a trial and error basis and examine the entire graph. Considering the original analysis purpose, it is meaningless to align the horizontal axis of the graph with the IPC code of the same layer. For example, it is fully conceivable that a feature can be read in the second level of a certain group, but a feature can be obtained only by dividing another group up to the third level. Thus, it is desirable that each group is divided into granularities having characteristics.

また、最下層まで分割をおこなっても、特徴の出ないグループ（上記の例における「各出願人からまんべんなく出願されているＩＰＣコードのグループ」）も存在する可能性がある。特徴があるグループと同様に、このような特徴のないグループも重要な情報であるが、このようなグループを分析者が発見するためには、階層毎に作成した複数のグラフを詳細に検討する必要があった。 In addition, even if the lowermost layer is divided, there is a possibility that there is a group having no characteristics (“IPC code group that has been applied evenly by each applicant” in the above example). Similar to groups with features, groups without such features are important information, but in order for analysts to find such groups, examine multiple graphs created for each hierarchy in detail. There was a need.

また、上述した従来技術２では、階層性を持つ場合が考慮されておらず、上述した従来技術１のタスクにそのまま適用することができない。 Further, in the above-described prior art 2, the case of having a hierarchy is not considered, and it cannot be applied as it is to the above-described prior art 1 task.

また、上述した従来技術３も、作成されたグループが目的変数の分布に対して各々特徴を持つかどうかは不明である。なお、上記の公知技術を説明変数の階層性を考慮した工夫を加えて組み合わせたとしても、目的とする結果は得られない。 Also, in the above-described prior art 3, it is unclear whether the created group has a feature with respect to the distribution of the objective variable. Note that even if the above-described known techniques are combined with an idea that takes into account the hierarchy of explanatory variables, the intended result cannot be obtained.

また、上述した従来技術４では、ＰＯＳデータに記録される購買は様々な条件下でおこなわれるので、「全く同じ条件下」という定義が極めて難しいという問題がある。そこで、多種多様な条件における個々の購買データに対して分析の際に無視可能な差異を無視してグループを作成する必要がある。この分割に用いられる項目が階層構造を持つ場合には、どの階層で分割した時にそれぞれ条件が異なっているグループが作成されるかは不明であるので、従来技術１の特許公報データの分析における分割と同様の問題が発生する。 Further, in the above-described prior art 4, since the purchase recorded in the POS data is performed under various conditions, there is a problem that it is extremely difficult to define “exactly the same conditions”. Therefore, it is necessary to create a group by ignoring differences that can be ignored in the analysis of individual purchase data under various conditions. If the items used for this division have a hierarchical structure, it is unclear which group will create a group with different conditions when the division is performed. The same problem occurs.

たとえば、ある製品の売上数は、季節単位で変化するのか、月単位で変化するのか、週単位、曜日単位で変化するのかは分析を行なう前は不明である。そのため、分析者の持つ常識に頼った単位でデータを分割して結果を検証するか、全ての分割単位で売上数を比較するといった方法に頼らざるを得ない。 For example, it is unclear before the analysis whether the number of sales of a product changes on a seasonal basis, on a monthly basis, or on a weekly or weekday basis. Therefore, it is necessary to divide the data in units that depend on the common sense of the analyst and verify the results, or to compare the number of sales in all divided units.

また、上述した従来技術５においても、どの階層でデータを分割すれば有用な知識が得られるかは不明である。たとえば、製品のデザインに着目した感想の多いグループにおいては男女比率に大きな違いはないが、その下位の階層が、デザインが「良い」といっているグループと「悪い」といっているグループが作成されていて、その２つのグループは男女比率が各々大きく偏っているといった知識は有用であるが、上位のグループで分けた時にはこの知識は獲得できない。つまりこの場合も、階層化クラスタリング結果をどのように分割すれば、有用な知見が得られるかは不明である。 Also in the above-described prior art 5, it is not clear at which hierarchy the useful knowledge can be obtained by dividing the data. For example, in a group with many impressions focusing on product design, there is no significant difference in the gender ratio, but in the lower hierarchy, there are groups where the design is “good” and “bad”. The knowledge that the two groups are largely biased is useful, but this knowledge cannot be acquired when divided by the upper group. That is, also in this case, it is unclear how useful knowledge can be obtained by dividing the hierarchical clustering result.

また、従来技術５に対して従来技術６の分割手法を用いても、その分割結果が目的変数に関して特徴を持つかどうかは不明である。 Further, even if the division method of Conventional Technique 6 is used for Conventional Technique 5, it is unclear whether the result of the division has a feature with respect to the objective variable.

この発明は、上述した従来技術による問題点を解消するため、多次元データを、階層構造を持つ項目で分割する際に、分析目的に合致した適切なグループに自動的に分割することを第一の目的とする。ここで、「分析目的に合致した適切なグループ」とは、たとえばユーザの着目する項目の分布に特徴のあるグループであったり、着目する数値項目の変動に関して、同一の条件であるとみなせるグループである。また、分割されたグループに対して、ユーザの着目する特徴を提示することを第二の目的とする。 In order to solve the above-described problems caused by the prior art, the first aspect of the present invention is to automatically divide multidimensional data into appropriate groups that match the purpose of analysis when dividing into items having a hierarchical structure. The purpose. Here, the “appropriate group meeting the purpose of analysis” is a group that is characterized by, for example, the distribution of the item of interest of the user, or a group that can be regarded as having the same conditions regarding the variation of the numeric item of interest. is there. In addition, a second object is to present a feature that the user pays attention to the divided groups.

上述した課題を解決し、目的を達成するため、この発明にかかるデータ分割プログラム、該プログラムを記録した記録媒体、データ分割装置、およびデータ分割方法は、多次元データ群の階層をあらわす説明変数に該当する多次元データのグループを分割対象グループとした場合、当該分割対象グループにおける分布の偏りをあらわすことができる目的変数の分布に基づいて統計的検定をおこなうことにより、前記分割対象グループの属性を判定し、判定結果に基づいて、前記分割対象グループから当該分割対象グループが属する階層の子階層に属する子グループ群に分割して、当該子グループ群の中から選ばれた子グループをあらたに前記分割対象グループとし、得られたグループ集合を構成する任意のグループを統合対象グループとした場合、当該統合対象グループの階層の子階層に属する子グループ群の属性に基づいて、当該子グループ群を前記統合対象グループに統合して統合結果を得ることを特徴とする。 In order to solve the above-described problems and achieve the object, a data division program, a recording medium on which the program is recorded, a data division device, and a data division method are used as explanatory variables representing the hierarchy of a multidimensional data group. When the group of the corresponding multidimensional data is set as the division target group, the attribute of the division target group is obtained by performing a statistical test based on the distribution of the objective variable that can represent the distribution bias in the division target group. Determining, based on the determination result, dividing the group to be divided into child group groups belonging to the child hierarchy of the hierarchy to which the group to be divided belongs, and newly selecting the child group from among the child group groups. Divided target group, and any group that composes the obtained group set is consolidated target group If, based on the attributes of the child group belonging to the child layer of the hierarchical of the integration target group, and wherein the obtaining the integration result by integrating the child group set to the integration target group.

この発明によれば、目的変数にふさわしい粒度で、多次元データを分割することができる。 According to the present invention, multidimensional data can be divided with a granularity suitable for an objective variable.

また、上記発明において、前記分割対象グループに属する多次元データの数に基づいて前記統計的検定が利用可能であると判定された場合、前記統計的検定をおこなうことにより、前記分割対象グループの属性を判定することとしてもよい。 Further, in the above invention, when it is determined that the statistical test can be used based on the number of multidimensional data belonging to the division target group, the attribute of the division target group is determined by performing the statistical test. May be determined.

この発明によれば、統計的検定の効率化を図ることができる。 According to the present invention, the efficiency of the statistical test can be improved.

また、上記発明において、前記分割対象グループに属する多次元データの数に基づいて前記統計的検定が利用不可能であると判定された場合、前記分割対象グループはその階層の子階層に属する子グループ群への遷移を停止するグループであると判定し、前記分割対象グループに選ばれていないグループをあらたに前記分割対象グループとすることとしてもよい。 Further, in the above invention, when it is determined that the statistical test cannot be used based on the number of multidimensional data belonging to the division target group, the division target group is a child group belonging to a child hierarchy of the hierarchy. It may be determined that the group is a group that stops the transition to a group, and a group that is not selected as the division target group may be newly set as the division target group.

この発明によれば、属性判定の効率化を図ることができ、意味のない分割を抑制することができる。 According to the present invention, it is possible to improve the efficiency of attribute determination and to suppress meaningless division.

また、上記発明において、前記分割対象グループにおける前記目的変数の分布が特徴的な分布であるか否かを判定することとしてもよい。 Moreover, in the said invention, it is good also as determining whether distribution of the said objective variable in the said division | segmentation object group is characteristic distribution.

特に、前記目的変数が定型値である場合、前記分割対象グループにおける前記目的変数の分布と前記グループ集合全体における前記目的変数の分布との有意性検定により、前記分割対象グループにおける前記目的変数の分布が特徴的な分布であるか否かを判定することとしてもよい。 In particular, when the objective variable is a fixed value, the distribution of the objective variable in the group to be divided is determined by a significance test between the distribution of the objective variable in the group to be divided and the distribution of the objective variable in the entire group set. It may be determined whether or not is a characteristic distribution.

この発明によれば、特徴のあるグループと特徴のないグループとを区別することができる。 According to the present invention, it is possible to distinguish between characteristic groups and non-characteristic groups.

また、上記発明において、前記両分布が有意に異なる場合、前記分割対象グループにおける前記目的変数の分布は前記グループ集合全体の分布と比較することにより前記分割対象グループが特徴的な分布であると判定し、前記分割対象グループに選ばれていないグループをあらたに前記分割対象グループとすることとしてもよい。 In the above invention, when the two distributions are significantly different from each other, the distribution of the objective variable in the group to be divided is compared with the distribution of the entire group set to determine that the group to be divided is a characteristic distribution. A group that is not selected as the division target group may be newly set as the division target group.

この発明によれば、特徴のあるグループを検出するとともに、下位の階層への過分割を抑制することができる。 According to the present invention, it is possible to detect a characteristic group and suppress excessive division into lower layers.

また、上記発明において、前記目的変数が数値である場合、前記分割対象グループにおける前記目的変数の分布と当該分布から推定される既知の確率分布モデルとの適合度検定をおこなうことにより、前記分割対象グループにおける前記目的変数の分布が特徴的な分布であるか否かを判定することとしてもよい。 Further, in the above invention, when the objective variable is a numerical value, by performing a fitness test between the distribution of the objective variable in the division target group and a known probability distribution model estimated from the distribution, It may be determined whether the distribution of the objective variable in the group is a characteristic distribution.

特に、前記分割対象グループにおける前記目的変数の分布と前記確率分布モデルとが異ならない場合、前記分割対象グループにおける前記目的変数の分布が前記確率分布モデルに従う特徴的な分布であると判定し、前記分割対象グループに選ばれていないグループをあらたに前記分割対象グループとすることとしてもよい。 In particular, when the distribution of the objective variable in the division target group is not different from the probability distribution model, it is determined that the distribution of the objective variable in the division target group is a characteristic distribution according to the probability distribution model, A group that is not selected as the division target group may be newly set as the division target group.

また、上記発明において、前記分割対象グループにおける前記目的変数の分布が前記確率分布モデルに従う特徴的な分布でなく、かつ、前記子階層に属する子グループ群が存在しない場合、前記分割対象グループは、その階層の子階層に属するグループ群への分割を停止するグループであると判定し、前記分割対象グループに選ばれていないグループをあらたに前記分割対象グループとすることとしてもよい。 Further, in the above invention, when the distribution of the objective variable in the group to be divided is not a characteristic distribution according to the probability distribution model and there is no child group group belonging to the child hierarchy, the group to be divided is: It is also possible to determine that the group is a group that stops the division into a group group belonging to a child hierarchy of the hierarchy, and a group not selected as the division target group may be newly set as the division target group.

また、上記発明において、前記分割対象グループにおける前記目的変数の分布が前記確率分布モデルに従う特徴的な分布でなく、かつ、前記子階層に属する子グループ群が存在する場合、前記分割対象グループは、その階層の子階層に属する子グループの属性により判定可能であるグループと判定し、前記子グループをあらたに前記分割対象グループとすることとしてもよい。 Further, in the above invention, when the distribution of the objective variable in the group to be divided is not a characteristic distribution according to the probability distribution model and there is a child group group belonging to the child hierarchy, the group to be divided is The group may be determined as a group that can be determined by the attribute of the child group belonging to the child hierarchy of the hierarchy, and the child group may be newly set as the division target group.

この発明によれば、下位の階層への過分割を抑制することができる。 According to the present invention, it is possible to suppress overdivision into lower hierarchies.

また、上記発明において、前記統合対象グループの階層の子階層に属する子グループがすべて分割を停止するグループであると判定されていたときは、前記子グループ群を削除し、前記統合対象グループを前記子グループ群と同一属性にすることとしてもよい。 Further, in the above invention, when it is determined that all the child groups belonging to the child hierarchy of the hierarchy of the integration target group are groups that stop the division, the child group group is deleted, and the integration target group is It is good also as making it the same attribute as a child group group.

この発明によれば、過分割されたグループを統合することができる。 According to the present invention, it is possible to integrate over-divided groups.

また、上記発明において、前記統合対象グループの階層の子階層に属する各子グループにおける前記目的変数の分布がすべて特徴的な分布であると判定されていたときは、前記子グループ群を前記統合対象グループに統合しないこととしてもよい。 In the above invention, when it is determined that all distributions of the objective variable in each child group belonging to a child hierarchy of the hierarchy of the integration target group are characteristic distributions, the child group group is set as the integration target. It may not be integrated into a group.

この発明によれば、特徴的な分布となるグループのみを残存させておくことができる。 According to the present invention, it is possible to leave only a group having a characteristic distribution.

本発明にかかるデータ分割プログラム、該プログラムを記録した記録媒体、データ分割装置、およびデータ分割方法によれば、多次元データを、階層構造を持つ項目で分割する際に、分析目的に合致した適切なグループに自動的分割することができるという効果を奏する。また、分割されたグループに対して、ユーザの着目する特徴を提示することができるという効果を奏する。 According to the data division program, the recording medium on which the program is recorded, the data division device, and the data division method according to the present invention, when dividing multi-dimensional data into items having a hierarchical structure, the data division program The effect is that it can be automatically divided into various groups. Moreover, the effect that the feature which a user pays attention to can be shown with respect to the divided | segmented group is produced.

以下に添付図面を参照して、この発明にかかるデータ分割プログラム、該プログラムを記録した記録媒体、データ分割装置、およびデータ分割方法の好適な実施の形態を詳細に説明する。 Exemplary embodiments of a data division program, a recording medium recording the program, a data division device, and a data division method according to the present invention will be explained below in detail with reference to the accompanying drawings.

（データ分割装置のハードウェア構成）
まず、この発明の実施の形態にかかるデータ分割装置のハードウェア構成について説明する。図１は、この発明の実施の形態にかかるデータ分割装置のハードウェア構成を示す説明図である。 (Hardware configuration of data division device)
First, the hardware configuration of the data dividing apparatus according to the embodiment of the present invention will be described. FIG. 1 is an explanatory diagram showing the hardware configuration of the data dividing apparatus according to the embodiment of the present invention.

図１において、データ分割装置１００は、コンピュータ本体１１０と、入力装置１２０と、出力装置１３０と、から構成されており、不図示のルータやモデムを介してＬＡＮ，ＷＡＮやインターネットなどのネットワーク１４０に接続可能である。 In FIG. 1, a data dividing device 100 includes a computer main body 110, an input device 120, and an output device 130. The data dividing device 100 is connected to a network 140 such as a LAN, WAN, or the Internet via a router or a modem (not shown). Connectable.

コンピュータ本体１１０は、ＣＰＵ，記憶装置，インターフェースを有する。ＣＰＵは、データ分割装置１００の全体の制御を司る。記憶装置は、ＲＯＭ，ＲＡＭ，ＨＤ，光ディスク１１１，フラッシュメモリから構成される。ＲＡＭはＣＰＵのワークエリアとして使用される。 The computer main body 110 has a CPU, a storage device, and an interface. The CPU governs overall control of the data dividing device 100. The storage device includes a ROM, RAM, HD, optical disk 111, and flash memory. The RAM is used as a work area for the CPU.

また、記憶装置には各種プログラムが格納されており、ＣＰＵからの命令に応じてロードされる。ＨＤおよび光ディスク１１１はディスクドライブによりデータのリード／ライトが制御される。また、光ディスク１１１およびフラッシュメモリはコンピュータ本体１１０に対し着脱自在である。インターフェースは、入力装置１２０からの入力、出力装置１３０への出力、ネットワーク１４０に対する送受信の制御をおこなう。 In addition, various programs are stored in the storage device, and loaded according to instructions from the CPU. Data read / write of the HD and the optical disk 111 is controlled by a disk drive. The optical disk 111 and the flash memory are detachable from the computer main body 110. The interface controls input from the input device 120, output to the output device 130, and transmission / reception with respect to the network 140.

また、入力装置１２０としては、キーボード１２１、マウス１２２、スキャナ１２３などがある。キーボード１２１は、文字、数字、各種指示などの入力のためのキーを備え、データの入力をおこなう。また、タッチパネル式であってもよい。マウス１２２は、カーソルの移動や範囲選択、あるいはウィンドウの移動やサイズの変更などをおこなう。スキャナ１２３は、画像を光学的に読み取る。読み取られた画像は画像データとして取り込まれ、コンピュータ本体１１０内の記憶装置に格納される。なお、スキャナ１２３にＯＣＲ機能を持たせてもよい。 The input device 120 includes a keyboard 121, a mouse 122, a scanner 123, and the like. The keyboard 121 includes keys for inputting characters, numbers, various instructions, and the like, and inputs data. Further, it may be a touch panel type. The mouse 122 performs cursor movement, range selection, window movement, size change, and the like. The scanner 123 optically reads an image. The read image is captured as image data and stored in a storage device in the computer main body 110. Note that the scanner 123 may have an OCR function.

また、出力装置１３０としては、ディスプレイ１３１、プリンタ１３２、スピーカ１３３などがある。ディスプレイ１３１は、カーソル、アイコンあるいはツールボックスをはじめ、文書、画像、機能情報などのデータを表示する。また、プリンタ１３２は、画像データや文書データを印刷する。またスピーカ１３３は、効果音や読み上げ音などの音声を出力する。 The output device 130 includes a display 131, a printer 132, a speaker 133, and the like. The display 131 displays data such as a document, an image, and function information as well as a cursor, an icon, or a tool box. The printer 132 prints image data and document data. The speaker 133 outputs sound such as sound effects and reading sounds.

（データ分割装置の機能的構成）
つぎに、この発明の実施の形態にかかるデータ分割装置の機能的構成について説明する。図２は、この発明の実施の形態にかかるデータ分割装置の機能的構成を示すブロック図である。 (Functional configuration of data partitioning device)
Next, a functional configuration of the data dividing apparatus according to the embodiment of the present invention will be described. FIG. 2 is a block diagram showing a functional configuration of the data dividing apparatus according to the embodiment of the present invention.

データ分割装置１００は、入力部２０１と判定部２０２と分割部２０３と統合部２０４と出力部２０５とＤＢ２１０とから構成されており、階層をあらわす説明変数と分布の偏りをあらわすことができる目的変数とを有する多次元データ群を分割する。なお、各機能は、図１に示したメモリまたはＨＤなどの記憶装置に記憶されているプログラムを、ＣＰＵに実行させることによって、またはＩ／Ｆによって、その機能を実現する。 The data dividing apparatus 100 includes an input unit 201, a determination unit 202, a dividing unit 203, an integrating unit 204, an output unit 205, and a DB 210, and is an objective variable that can represent an explanatory variable representing a hierarchy and a distribution bias. Are divided into multidimensional data groups. Each function is realized by causing the CPU to execute a program stored in a storage device such as the memory or the HD shown in FIG. 1 or by an I / F.

まず、入力部２０１は、多次元データの階層をあらわす説明変数と分布の偏りをあらわすことができる目的変数との入力を受け付ける機能を有する。多次元データとは、説明変数と目的変数とを有するデータである。たとえば、上述したように、多次元データが特許公報データの場合、説明変数としてＩＰＣコードを有し、目的変数として「出願人」、「発明者」などの定型値や、「被引用数（その特許公報が何件の特許から引用されているか）」などの数値を有する。 First, the input unit 201 has a function of receiving input of an explanatory variable that represents a hierarchy of multidimensional data and an objective variable that can represent a distribution bias. Multidimensional data is data having explanatory variables and objective variables. For example, as described above, when the multidimensional data is patent publication data, it has an IPC code as an explanatory variable, a fixed value such as “applicant” and “inventor” as an objective variable, From what number of patent publications the patent publication is cited)).

また、多次元データにおいて、説明変数の代わりに、テキストによる自由記述項目が多次元データに含まれていてもよい。これは、自由記述項目に記載されるテキストに階層化クラスタリングと呼ばれる既知の手法を適用して多次元データ群の階層構造を作成し、その階層構造を説明変数として利用することとしてもよい。また、入力部２０１は、分割処理に用いるパラメータ（たとえば、統計的検定をおこなう際の危険率や、それ以上分割をおこなわないデータ個数）の入力を受け付けることとしてもよい。 In multidimensional data, a textual free description item may be included in the multidimensional data instead of the explanatory variable. This may be achieved by creating a hierarchical structure of a multidimensional data group by applying a known technique called hierarchical clustering to the text described in the free description item, and using the hierarchical structure as an explanatory variable. Further, the input unit 201 may accept input of parameters used for the division process (for example, a risk factor when performing a statistical test or the number of data items that are not further divided).

このように、説明変数が入力されることで、説明変数に該当する多次元データのグループが特定される。この特定されたグループが初期の分割対象グループとなる。たとえば、特許公報データの分析の場合、ＩＰＣコードが「Ｇ０６」である場合、「Ｇ０６」が初期の分割対象グループとなる。この初期の分割対象グループである説明変数をルートとする階層構造データを図３に示す。 In this way, when an explanatory variable is input, a group of multidimensional data corresponding to the explanatory variable is specified. This identified group becomes the initial division target group. For example, in the analysis of patent gazette data, if the IPC code is “G06”, “G06” is the initial division target group. FIG. 3 shows hierarchical structure data whose root is the explanatory variable which is the initial division target group.

図３は、説明変数をルートとする階層構造データを示す説明図である。階層構造データ３００において、全体グループＲは説明変数をあらわしている。図３では、例として、第一階層、第二階層、第三階層にまで子孫となるグループが分岐している。第一階層のグループＡ，Ｂ，Ｃは、全体グループＲの子グループである。 FIG. 3 is an explanatory diagram showing hierarchical structure data having the explanatory variable as a root. In the hierarchical structure data 300, the entire group R represents an explanatory variable. In FIG. 3, as an example, the descendant groups branch to the first hierarchy, the second hierarchy, and the third hierarchy. The groups A, B, and C in the first hierarchy are child groups of the entire group R.

以降、グループＧ（Ｇはグループ名）の階層の子階層に属するグループをＧ＃（＃は数字）、またその子階層に属するグループをＧ＃＃と表記する。たとえば、第一階層のグループＡの第二階層に属するグループはそれぞれ、Ａ１，Ａ２，Ａ３となり、第二階層のグループＡ１の子階層である第三階層に属するグループはそれぞれ、Ａ１１，Ａ１２となる。 Hereinafter, a group belonging to a child hierarchy of the group G (G is a group name) hierarchy is denoted as G # (# is a number), and a group belonging to the child hierarchy is denoted as G ##. For example, the groups belonging to the second hierarchy of the group A of the first hierarchy are A1, A2, and A3, respectively, and the groups belonging to the third hierarchy that is a child hierarchy of the group A1 of the second hierarchy are A11 and A12, respectively. .

この階層構造データ３００は、入力部２０１に直接与えてもよく、また、任意の説明変数が入力された場合、ＤＢ２１０内の多次元データ群がそれぞれ有する説明変数を参照することにより、自動的に作成することとしてもよい。また、ＤＢ２１０内にすでに多次元データ群全体の階層構造データ３００が格納されている場合には、入力部２０１に任意の説明変数を入力することで、その説明変数をルートとする部分的な階層構造データ３００を抽出することとしてもよい。 The hierarchical structure data 300 may be directly given to the input unit 201. When an arbitrary explanatory variable is input, the hierarchical structure data 300 is automatically referred to by referring to the explanatory variables included in the multidimensional data group in the DB 210. It may be created. Further, when the hierarchical data 300 of the entire multidimensional data group is already stored in the DB 210, a partial hierarchy having the explanatory variable as a root by inputting an arbitrary explanatory variable to the input unit 201 The structure data 300 may be extracted.

さらに、説明変数に該当するグループだけ与えて、その後、分割部２０３により分割する都度、その下位階層の子グループを追加していくこととしてもよい。この場合、図６に示す階層構造データ６００が直接生成されることとなる。 Further, only the group corresponding to the explanatory variable may be given, and then the child group of the lower hierarchy may be added each time the division unit 203 divides the group. In this case, the hierarchical structure data 600 shown in FIG. 6 is directly generated.

また、階層構造データ３００のルートとなる全体グループＲは、初期の分割対象グループとなる。分割対象グループとは、判定部２０２による判定対象および分割部２０３による分割対象となるグループであり、分割により得られる子階層のグループ（子グループ）をあらたに分割対象グループとすることで、判定部２０２の判定処理および分割部２０３の分割処理を再帰的に実行することができる。 Further, the entire group R that is the root of the hierarchical structure data 300 is an initial division target group. The division target group is a group to be determined by the determination unit 202 and a group to be divided by the division unit 203, and a group (child group) of a child hierarchy obtained by the division is newly set as a division target group, so that the determination unit The determination process 202 and the dividing process of the dividing unit 203 can be recursively executed.

この入力部２０１は、たとえば、図１に示したディスプレイ１３１に表示されたデータ項目をマウス１２２、キーボード１２１などの入力装置１２０を用いてユーザが選択するといったＩ／Ｆや、外部装置からネットワーク１４０を介して受信するといったＩ／Ｆで実現可能である。 This input unit 201 is, for example, an I / F in which a user selects a data item displayed on the display 131 shown in FIG. 1 using the input device 120 such as a mouse 122 and a keyboard 121, or a network 140 from an external device. It can be realized by an I / F such as receiving via the I / F.

また、図２において、判定部２０２は、入力部２０１によって入力された説明変数に該当する多次元データのグループを分割対象グループとした場合、当該分割対象グループにおける目的変数の分布に基づいて統計的検定をおこなうことにより、分割対象グループの属性を判定する機能を有する。具体的には、分割対象グループがどのような属性のグループであるかを判定する。判定結果はフラグとして判定テーブルに保持される。 In FIG. 2, when the determination unit 202 sets a group of multidimensional data corresponding to the explanatory variable input by the input unit 201 as a division target group, the determination unit 202 performs statistical processing based on the distribution of the objective variable in the division target group. It has a function to determine the attribute of the group to be divided by performing the test. Specifically, it is determined what attribute group the division target group is. The determination result is held in the determination table as a flag.

図４は、判定テーブル４００（初期状態）を示す説明図である。判定テーブル４００は、階層構造データ３００の各階層のグループごとのフラグを設定するテーブルである。図４に示した判定テーブル４００は、判定開始前の初期状態をあらわしているため、判定結果であるフラグはすべて「０」に設定しておく。フラグ：０は、再分割条件をあらわすフラグであり、停止条件および適合条件にマッチしない場合には、「０」のままである。 FIG. 4 is an explanatory diagram showing the determination table 400 (initial state). The determination table 400 is a table for setting a flag for each group of each hierarchy of the hierarchical structure data 300. Since the determination table 400 shown in FIG. 4 represents the initial state before the start of determination, all the flags that are determination results are set to “0”. The flag: 0 is a flag indicating the re-division condition, and remains “0” when the stop condition and the matching condition are not matched.

この判定部２０２により判定処理が実行されると、その判定結果として判定テーブル４００にフラグが設定される。フラグにはグループの属性に応じて数種類用意されている。フラグ：１は、その分割対象グループが「適合グループ」であることを示すフラグである。ここで、フラグ：１に設定されるための条件を適合条件と称す。 When the determination process is executed by the determination unit 202, a flag is set in the determination table 400 as the determination result. Several types of flags are prepared according to group attributes. The flag: 1 is a flag indicating that the division target group is a “conforming group”. Here, the condition for setting the flag to 1 is referred to as a conforming condition.

適合条件とは、分割対象グループにおける目的変数の分布が特徴的な分布であるか否かを判定するための条件であり、特徴的であれば適合条件にマッチしてフラグが「１」に設定される。一方、特徴的でなければ適合条件にマッチしないこととなる。 The conforming condition is a condition for determining whether or not the distribution of the objective variable in the group to be divided is a characteristic distribution. If the distribution is characteristic, the conforming condition is matched and the flag is set to “1”. Is done. On the other hand, if it is not characteristic, the matching condition is not met.

適合条件による判定は、目的変数の種類によって異なる。まず、目的変数が「出願人」のような定型値の場合について説明する。定型値の場合、全体グループＲにおける目的変数の分布と分割対象グループにおける目的変数の分布とが統計的検定により有意に異なるか否かにより、適合条件にマッチするか否かを判定する。特徴的である場合、適合条件にマッチして、フラグが「１」に設定される。一方、特徴的でない場合、適合条件にマッチしないこととなる。 Judgment based on conformance conditions varies depending on the type of objective variable. First, the case where the objective variable is a fixed value such as “applicant” will be described. In the case of a fixed value, whether or not the matching condition is met is determined by whether or not the distribution of the objective variable in the entire group R and the distribution of the objective variable in the group to be divided are significantly different by statistical test. If it is characteristic, the flag is set to “1” in accordance with the matching condition. On the other hand, if it is not characteristic, the matching condition is not matched.

一方、目的変数が数値である場合、分割対象グループの目的変数である数値が、分割対象グループの多次元データ群から推定される既知の確率分布モデルに従っているか否かにより、適合条件にマッチするか否かを判定する。既知の確率分布モデルに従っている場合、適合条件にマッチして、フラグが「１」に設定される。一方、特徴的でない場合、適合条件にマッチしないこととなる。 On the other hand, if the objective variable is a numerical value, whether the numerical value that is the objective variable of the group to be divided conforms to the matching condition depending on whether or not it follows a known probability distribution model estimated from the multidimensional data group of the group to be divided Determine whether or not. If the known probability distribution model is followed, the flag is set to “1” in accordance with the matching condition. On the other hand, if it is not characteristic, the matching condition is not matched.

また、フラグ：２は、その分割対象グループが「停止グループ」であることを示すフラグである。ここで、フラグ：２に設定されるための条件を停止条件と称す。停止条件とは、子階層に属するグループ群への分割を停止するか否かを判定するための条件であり、２種類の停止条件が用意されている。ここでは、区別するため、一方を「停止条件１」、他方を「停止条件２」とする。 The flag: 2 is a flag indicating that the division target group is a “stop group”. Here, the condition for setting the flag: 2 is referred to as a stop condition. The stop condition is a condition for determining whether or not to divide into groups that belong to a child hierarchy, and two types of stop conditions are prepared. Here, in order to distinguish, one is referred to as “stop condition 1” and the other as “stop condition 2”.

まず、停止条件１とは、分割対象グループに含まれる多次元データ群が、適合条件の判定に用いる統計的検定手法が適用できる条件であり、満たさない場合には停止条件１にマッチしてフラグが「２」に設定される。一方、満たす場合は統計的検定手法を適用できるため、停止条件１にマッチしないこととなる。 First, the stop condition 1 is a condition to which the statistical test method used for the determination of the conformity condition can be applied to the multidimensional data group included in the division target group. Is set to “2”. On the other hand, if the condition is satisfied, the statistical test method can be applied, so that the stop condition 1 is not matched.

停止条件１としては、たとえば、統計的検定手法の精度を保証する多次元データ数よりも分割対象グループ内の多次元データ数が少ないといった条件である。少ない場合には停止条件１にマッチしてフラグが「２」に設定される。 The stop condition 1 is, for example, a condition that the number of multidimensional data in the division target group is smaller than the number of multidimensional data that guarantees the accuracy of the statistical test method. When the number is small, the stop condition 1 is matched and the flag is set to “2”.

一方、停止条件２とは、その分割対象グループが子階層を持たず、それ以上分割できない条件である。子階層を持たない場合には停止条件２にマッチしてフラグが「２」に設定される。一方、子階層を持つ場合には停止条件２にマッチせず、子階層への分割が可能となる。 On the other hand, the stop condition 2 is a condition in which the division target group has no child hierarchy and cannot be further divided. If there is no child hierarchy, the flag is set to “2” in accordance with the stop condition 2. On the other hand, if there is a child hierarchy, the stop condition 2 is not matched, and the child hierarchy can be divided.

また、分割部２０３は、判定部２０２によって判定された判定結果に基づいて、分割対象グループから当該分割対象グループが属する階層の子階層に属するグループ群（子グループ群）に分割して、当該子グループ群の中から選ばれた子グループをあらたに分割対象グループとする機能を有する。具体的には、分割対象グループが再分割条件にマッチした場合、子グループ群に分割する。 Further, based on the determination result determined by the determination unit 202, the dividing unit 203 divides the division target group into groups (child group groups) belonging to the child hierarchy of the hierarchy to which the division target group belongs, and It has a function of newly setting a child group selected from the group group as a group to be divided. Specifically, when the division target group matches the subdivision condition, the group is divided into child group groups.

なお、分割とは、図３に示した階層構造データ３００が分割に先立って与えられている場合には、分割対象グループからその子グループ群に遷移することをいう。そして、遷移先となる子グループ群の中の任意のグループをあらたに分割対象グループとする。子グループ群があるにもかかわらず、子グループ群に分割（遷移）する必要がない場合には、その子グループ群以下のグループを削除する。 Note that division means transition from a division target group to its child group when the hierarchical structure data 300 shown in FIG. 3 is given prior to division. Then, an arbitrary group in the child group group that is the transition destination is newly set as a division target group. If it is not necessary to divide (transition) into child group groups even though there are child group groups, the groups below the child group groups are deleted.

一方、図３に示した階層構造データ３００が分割に先立って与えられておらず、全体グループＲのみ与えられている場合には、分割とは、分割対象グループの子グループ群をＤＢ２１０から読み出して、分割対象グループに追加することをいう。この場合、分割された子グループ群のみが子階層に追加されるため、図６に示した階層構造データ６００が直接生成されることとなる。なお、判定部２０２の判定処理と分割部２０３の分割処理との具体例については後述する。 On the other hand, when the hierarchical structure data 300 shown in FIG. 3 is not given prior to the division and only the entire group R is given, the division is to read the child group group of the division target group from the DB 210. It means adding to the group to be divided. In this case, since only the divided child group group is added to the child hierarchy, the hierarchical structure data 600 shown in FIG. 6 is directly generated. Specific examples of the determination process of the determination unit 202 and the division process of the division unit 203 will be described later.

また、統合部２０４は、分割結果後のグループ集合となる階層構造データ３００を構成する任意のグループを統合対象グループとした場合、当該統合対象グループの階層の子階層に属するグループ群（子グループ群）の属性に基づいて、当該子グループ群を統合対象グループに統合する機能を有する。 Further, the integration unit 204, when an arbitrary group constituting the hierarchical structure data 300 that is a group set after the division result is an integration target group, a group group (child group group) belonging to a child hierarchy of the hierarchy of the integration target group ) Based on the attribute, the child group group is integrated into the integration target group.

具体的には、統合部２０４は、分割部２０３により過分割されたグループをそのグループの属性を手がかりとして検出して、過分割なグループを削除することにより、その親グループに統合する。この場合、親グループの属性をあらわすフラグを、再分割条件をあらわす中間グループ（フラグ：０）から、停止条件をあらわすフラグ：２に変更する。なお、統合部２０４の統合処理の詳細については、後述する。 Specifically, the integration unit 204 detects the group overdivided by the division unit 203 using the attribute of the group as a clue, and deletes the overdivided group to integrate it with its parent group. In this case, the flag representing the attribute of the parent group is changed from the intermediate group (flag: 0) representing the re-division condition to the flag: 2 representing the stop condition. Details of the integration process of the integration unit 204 will be described later.

また、出力部２０５は、統合部２０４によって統合された統合結果を出力する機能を有する。統合結果とは、最終的に得られる（図１１の）階層構造データ１１００や判定テーブル４００の内容でもよく、また、統合結果毎に目的変数を集計してその分布をグラフ化した出力データとしてもよいが、これらの出力形態に限るものではない。 The output unit 205 has a function of outputting the integration result integrated by the integration unit 204. The integration result may be the content of the hierarchical structure data 1100 or the determination table 400 finally obtained (FIG. 11), or may be output data in which the objective variables are aggregated for each integration result and the distribution is graphed. Although it is good, it is not restricted to these output forms.

また、出力される階層構造データ１１００には、各々「適合グループ（フラグ：１）」または「停止グループ（フラグ：２）」の属性が付与されているため、その属性も提示することが可能である。この属性の表示方法も様々な形態が考えられるが、たとえば、上述したグラフ化した出力データである場合、全体と比べて目的変数の分布に特徴のあるグループ（「適合グループ（フラグ：１）」）と、分布の差がないグループ（「停止グループ（フラグ：２）」）を色分けして表示するといった方法が挙げられる。 Further, since the attribute of “adaptation group (flag: 1)” or “stop group (flag: 2)” is assigned to the output hierarchical structure data 1100, the attribute can also be presented. is there. There are various forms of the display method of this attribute. For example, in the case of the above-described graphed output data, a group characterized by the distribution of the objective variable compared to the whole (“matching group (flag: 1)” ) And a group having no distribution difference (“stop group (flag: 2)”) are displayed in different colors.

（判定処理および分割処理の具体例）
つぎに、上述した判定部２０２の判定処理および分割部２０３の分割処理の具体例について説明する。まず、判定部２０２では、目的変数の種類にかかわらず、停止条件１⇒適合条件⇒停止条件２の順に判定処理を実行する。まず、目的変数が定型値である場合の判定処理について具体的に説明する。 (Specific examples of judgment processing and division processing)
Next, a specific example of the determination process of the determination unit 202 and the division process of the division unit 203 described above will be described. First, the determination unit 202 executes determination processing in the order of stop condition 1 ⇒ conforming condition ⇒ stop condition 2 regardless of the type of the objective variable. First, the determination process when the objective variable is a fixed value will be specifically described.

たとえば、目的変数とする項目を、Ｘ，Ｙ，Ｚという定型値とする。このＸ，Ｙ，Ｚは、特許公報データの分析でいえば、たとえば、出願人Ｘ，出願人Ｙ，出願人Ｚに相当する。分割対象グループがルートとなる全体グループＲである場合、停止条件２を満たした場合に、分割部２０３により、子階層である第一階層のグループＡ〜Ｃに分割される。 For example, items that are objective variables are set to fixed values of X, Y, and Z. These X, Y, and Z correspond to, for example, applicant X, applicant Y, and applicant Z in the analysis of patent gazette data. In the case where the group to be divided is the entire group R serving as the root, when the stop condition 2 is satisfied, the dividing unit 203 divides the group A into C of the first hierarchy that is a child hierarchy.

判定部２０２では、このグループＡ〜Ｃの中から未判定のグループ（たとえば、グループＡ）を選び、分割対象グループとする。この分割対象グループとなったグループＡにおける目的変数Ｘ，Ｙ，Ｚ別の多次元データ数と全体グループＲにおける目的変数Ｘ，Ｙ，Ｚ別の多次元データ数とを図表に示す。 The determination unit 202 selects an undetermined group (for example, group A) from the groups A to C and sets it as a division target group. The table shows the number of multidimensional data for each objective variable X, Y, Z in the group A, which is the group to be divided, and the number of multidimensional data for each objective variable X, Y, Z in the entire group R.

図５は、グループＡにおける目的変数Ｘ，Ｙ，Ｚ別の多次元データ数と全体グループＲにおける目的変数Ｘ，Ｙ，Ｚ別の多次元データ数とを示す図表である。まず、停止条件１にマッチするかどうかの判定処理を実行する。この停止条件１は、たとえば「分割対象グループの多次元データ数が１０個以下」といった条件が挙げられる。 FIG. 5 is a chart showing the number of multidimensional data for each objective variable X, Y, Z in the group A and the number of multidimensional data for each objective variable X, Y, Z in the entire group R. First, a process for determining whether or not the stop condition 1 is matched is executed. An example of the stop condition 1 is a condition that “the number of multidimensional data of the group to be divided is 10 or less”.

また、つぎの適合条件のチェックに用いられる統計的検定手法と、統計的検定を行なう多次元データの性質によって定まる条件を用いてもよい。たとえば、統計的検定手法として一般的なχ^２検定であれば、「無帰仮説が成立する時に、一つの目的変数に対するデータ個数の期待値が５未満となるものが４０％を超える場合、検定精度が著しく低下する」といった性質を持つことが知られている。停止条件１として、このような複雑な条件を採用することもできる。 In addition, a statistical test method used for checking the next matching condition and a condition determined by the property of multidimensional data to be statistically tested may be used. For example, in the case of a general χ ² test as a statistical test method, “when the null hypothesis is established, if the expected value of the number of data for one objective variable is less than 5% exceeds 40%, the test It is known to have the property that the accuracy is significantly reduced. Such a complicated condition can be adopted as the stop condition 1.

図５で示したように、無帰仮説が成立する時、すなわち、グループＡと全体グループＲの目的変数の分布に差がない場合には、全体グループＲにおいては３つの目的変数Ｘ，Ｙ，Ｚが均等な分布（いずれも１００）になっているので、グループＡにおいても各々の目的変数Ｘ，Ｙ，Ｚが均等になっていると考えられる。 As shown in FIG. 5, when the null hypothesis is established, that is, when there is no difference in the distribution of the objective variables of the group A and the entire group R, the three objective variables X, Y, Since Z has a uniform distribution (both are 100), it is considered that each objective variable X, Y, Z is also uniform in group A.

また、グループＡの多次元データは全部で９０個であるので、各目的変数Ｘ，Ｙ，Ｚに対する多次元データ数の期待値は各々３０個（９０／３）となる。目的変数で分割した３つのグループの各々の期待値が５未満のものは１つもないので、グループＡは停止条件１にはマッチしない。したがって、この場合は適合条件の判定に移行する。 Further, since the total number of multidimensional data of group A is 90, the expected value of the number of multidimensional data for each objective variable X, Y, Z is 30 (90/3). Since no one of the three groups divided by the objective variable has an expected value less than 5, group A does not match stop condition 1. Therefore, in this case, the process proceeds to determination of the conforming condition.

なお、目的変数が数値である場合、既知の確率分布モデルとの適合度検定を行なうが、その適合度検定においては、「分割対象グループに属する多次元データ数の期待値が１未満であるグループが一つでもある場合には、検定精度が低下する」という性質が知られているので、これを停止条件１に採用することもできる。 In addition, when the objective variable is a numerical value, a goodness-of-fit test with a known probability distribution model is performed. In the goodness-of-fit test, “the group whose expected value of the number of multidimensional data belonging to the group to be divided is less than 1” Since there is known a property that the accuracy of the test decreases when there is at least one, it can be adopted as the stop condition 1.

つぎに、適合条件にマッチするかどうかの判定処理を実行する。統計的検定手法として、上述したχ^２検定を用いる場合、以下の式（１）で表されるχ^２統計量を算出する。 Next, a process for determining whether or not the matching condition is met is executed. When the above-described χ ² test is used as a statistical test method, a χ ² statistic represented by the following formula (1) is calculated.

χ^２＝Σ｛（実測値−期待値）^２／期待値｝・・・（１） χ ² = Σ {(actual measurement value−expected value) ² / expected value} (1)

図５の場合、目的変数各々における期待値が３０であり、実測値が順に３２個，２８個，３０個であるので、上記式（１）に代入すると、χ^２統計量は、０．２７と算出される。 In the case of FIG. 5, the expected value in each objective variable is 30, and the actual measurement values are 32, 28, and 30 in this order. Therefore, when substituting into the above equation (1), the χ ² statistic is 0.27. Is calculated.

つぎに、算出されたχ^２統計量によって分布の異なりについて有意性を判定する。図５の場合、目的変数が取りうる値の異なり数は３であるので、自由度はそれより１少ない２となる。自由度２のχ^２検定において、危険率（有意性がないにも関わらず有意であると判定してしまう確率）を、たとえば、５％とすると、その判定基準値は５．９９１となる。 Next, the significance of the difference in distribution is determined based on the calculated χ ² statistic. In the case of FIG. 5, since the number of possible values of the objective variable is 3, the degree of freedom is 2, which is 1 less than that. In the χ ² test with ² degrees of freedom, if the risk factor (probability of determining that there is no significance but 5%) is, for example, 5%, the determination reference value is 5.991.

したがって、算出されたχ^２統計量が判定基準値５．９９１より大きい場合に有意に異なると言え、適合条件にマッチして、フラグが「１」に設定される。今回の分割対象グループであるグループＡのχ^２統計量（＝０．２７）は判定基準値５．９９１以下であるため、グループＡは適合条件にマッチしない。したがって、この場合は停止条件２の判定に移行する。 Therefore, it can be said that the calculated χ ² statistic is significantly different when it is larger than the criterion value 5.991, and the flag is set to “1” in accordance with the matching condition. Since the χ ² statistic (= 0.27) of group A, which is the current group to be divided, is equal to or less than the criterion value 5.991, group A does not match the matching condition. Therefore, in this case, the process proceeds to the determination of the stop condition 2.

さらに、停止条件２にマッチするかどうかの判定処理を実行する。図５に示したグループＡは子階層を持つため、さらに分割可能であり、停止条件２にマッチしない。以上の条件のいずれにもマッチしなかったグループは、再分割条件にマッチしたと判定される。今回のグループＡは再分割条件にマッチするので、第二階層のＡ１〜Ａ３の３グループに分割され、その各々のグループＡ１〜Ａ３に対して同様の判定処理を実行する。 Further, a process for determining whether or not the stop condition 2 is matched is executed. Since the group A shown in FIG. 5 has child hierarchies, it can be further divided and does not match the stop condition 2. A group that does not match any of the above conditions is determined to match the subdivision condition. Since the current group A matches the subdivision condition, it is divided into three groups A1 to A3 in the second hierarchy, and the same determination process is executed for each of the groups A1 to A3.

グループＡ１は適合条件にマッチするので、「適合グループ（フラグ：１）」の属性が付与され、それ以上の分割は行なわれない。グループＡ２，Ａ３は再分割条件にマッチしたので「中間グループ（フラグ：０）」が付与されてさらに分割が行なわれる。その結果、グループＡ２はグループＡ２１，Ａ２２に分割され、このグループＡ２１，Ａ２２は各々適合条件にマッチするので「適合グループ（フラグ：１）」の属性が付与される。 Since the group A1 matches the matching condition, the attribute of “matching group (flag: 1)” is given, and no further division is performed. Since the groups A2 and A3 match the re-division condition, “intermediate group (flag: 0)” is assigned and further division is performed. As a result, the group A2 is divided into groups A21 and A22, and each of the groups A21 and A22 matches the matching condition, so the attribute of “matching group (flag: 1)” is given.

また、グループＡ３は、グループＡ３１，Ａ３２に分割され、このグループＡ３１，Ａ３２は停止条件にマッチするので、各々「停止グループ（フラグ：２）」が付与される。グループＡにおける以上のような処理を、グループＢ，Ｃにも同様に行なうことにより、多次元データ群が分割される。分割結果とその判定結果を示す。 Further, the group A3 is divided into groups A31 and A32, and the groups A31 and A32 match the stop condition, so that “stop group (flag: 2)” is assigned to each of them. By performing the above-described processing in group A on groups B and C in the same manner, the multidimensional data group is divided. The division result and the determination result are shown.

図６は、分割処理後（統合処理前）の階層構造データ６００を示す説明図であり、図７は、分割処理後（統合処理前）の判定テーブル４００を示す図表である。なお、図６において、各グループの右下の数字は、そのグループの属性をあらわすフラグの値である。 FIG. 6 is an explanatory diagram showing the hierarchical structure data 600 after the division process (before the integration process), and FIG. 7 is a chart showing the determination table 400 after the division process (before the integration process). In FIG. 6, the number on the lower right of each group is a flag value representing the attribute of the group.

また、目的変数として数値を取る項目が選択された場合、適合条件のみが変わり、多次元データから推定される既知の確率分布モデルに分割対象グループの目的変数が従っているかどうかの判定（一般に、「適合度検定」と呼ばれる）を実行する。たとえば、目的変数としてある製品の１日の売上数といった離散値を取る項目が選択されたとする。このとき、たとえばグループＡ１（多次元データ数：５０）における各目的変数のデータ度数の分布を、下記図表に示す。 In addition, when an item that takes a numerical value as the objective variable is selected, only the fitting condition changes, and it is determined whether the objective variable of the group to be divided follows a known probability distribution model estimated from multidimensional data (in general, “ (Referred to as “goodness-of-fit test”). For example, it is assumed that an item having a discrete value such as the number of sales per day of a product is selected as an objective variable. At this time, for example, the following chart shows the distribution of the data frequency of each objective variable in group A1 (number of multidimensional data: 50).

図８は、グループＡ１における各目的変数のデータ度数の分布を示す図表である。一方、比較をおこなう確率分布関数としてポアソン分布を採用する場合、目的変数の期待値は２．４であるので、目的変数がｘとなる確率ｐ（ｘ）は下記式（２）で定義される。 FIG. 8 is a chart showing the data frequency distribution of each objective variable in group A1. On the other hand, when the Poisson distribution is adopted as the probability distribution function for comparison, the expected value of the objective variable is 2.4, and therefore the probability p (x) that the objective variable is x is defined by the following equation (2). .

ｐ（ｘ）＝ｅｘｐ（−２．４）×２．４^ｘ／ｘ！・・・（２） p (x) = exp (−2.4) × 2.4 ^x / x! ... (2)

つぎに、目的変数が、式（２）の確率分布に従う場合、各々の目的変数を取る多次元データ数の期待値を以下に示す。 Next, when the objective variable follows the probability distribution of Expression (2), the expected value of the number of multidimensional data taking each objective variable is shown below.

図９は、グループＡ１における各目的変数のデータ度数の分布（期待値度数を追加）を示す図表である。各目的変数の値で分割した６つのグループにおいて、その期待度数が１未満のグループは一つもないので停止条件１にはマッチしない。したがって、適合条件の判定を実行することになる。 FIG. 9 is a chart showing the data frequency distribution (added expected value frequency) of each objective variable in group A1. In the six groups divided by the value of each objective variable, there is no group whose expected frequency is less than 1, so the stop condition 1 is not matched. Therefore, the determination of the conforming condition is executed.

つぎに、適合度検定によく用いられる統計値である修正χ^２統計量を算出する。修正χ^２統計量は、上記式（１）で定義されるため、今回の修正χ^２統計量は、２．１３となる。 Next, a modified χ ² statistic that is a statistical value often used in the fitness test is calculated. Since the modified χ ² statistic is defined by the above equation (1), the current modified χ ² statistic is 2.13.

一方、危険率５％の下での自由度５の基準値は９．４８８である。今回のデータにおける修正χ^２統計量はその基準値を下回っているので２つの分布は異ならない。つまり、この分割対象グループにおける目的変数の分布はポアソン分布に従っているといえる。以上より、グループＡ１は適合条件にマッチし、それ以上の分割は行なわれない。 On the other hand, the standard value of the degree of freedom 5 under the risk rate of 5% is 9.488. Since the modified χ ² statistic in this data is below its reference value, the two distributions are not different. That is, it can be said that the distribution of the objective variable in this division target group follows the Poisson distribution. As described above, the group A1 matches the matching condition, and no further division is performed.

なお、上述した例において、統計的検定には一般的なχ^２統計量に基づく統計的検定手法を採用したが、この統計的検定手法にのみ限定されるものではない。また、目的変数が数値である場合に比較を行なう確率分布関数としてポアソン分布を用いて検定を行なったが、これに限るものではない。目的変数が連続値をとるのか離散値をとるのかといった目的変数の性質等から、予想される確率分布を考慮して、適切な確率分布モデルを選択することができる。 In the above-described example, a statistical test method based on a general χ ² statistic is adopted for the statistical test, but is not limited to this statistical test method. Further, although the test is performed using the Poisson distribution as a probability distribution function for comparison when the objective variable is a numerical value, the present invention is not limited to this. An appropriate probability distribution model can be selected in consideration of the expected probability distribution from the properties of the objective variable such as whether the objective variable takes a continuous value or a discrete value.

たとえば、金属製品に塗料を吹き付けて錆留め用の皮膜を作成する作業を行なう場合、同条件下における皮膜厚は正規分布に従うと予想される。塗料をなるべく少なくしつつ、十分な皮膜厚を確保するための条件を発見するという分析は有用であるが、階層構造を持つ条件において、無視可能な条件の違いを無視し、真に皮膜厚に影響する粒度で多次元データを分割するのに有用である。 For example, when an operation of creating a rust-preventing film by spraying a paint on a metal product, the film thickness under the same condition is expected to follow a normal distribution. Although it is useful to find conditions for ensuring sufficient film thickness while reducing the number of paints as much as possible, in conditions with a hierarchical structure, ignoring differences in negligible conditions, Useful for partitioning multidimensional data with influencing granularity.

（統合処理の具体例）
つぎに、上述した統合部２０４による統合処理の具体例について説明する。分割部２０３による分割結果には、過分割が発生している可能性がある。統合部２０４では、この過分割グループを検出し、それらを親階層で統合する。過分割グループとは具体的には、すべて停止条件にマッチしているグループ群である。停止条件にマッチしているグループ群は、目的変数に関して、特徴が見られないグループであり、このような特徴のないグループを分けておく意味は薄いので、統合部２０４によって統合する。 (Specific example of integration processing)
Next, a specific example of integration processing by the integration unit 204 described above will be described. There is a possibility that excessive division has occurred in the division result by the division unit 203. The integration unit 204 detects this overdivided group and integrates them in the parent hierarchy. Specifically, the over-divided group is a group group that matches all the stop conditions. The group group that matches the stop condition is a group in which no feature is found with respect to the objective variable. Since it is not meaningful to separate such a group having no feature, the integration unit 204 performs integration.

統合部２０４による統合処理では、初期状態において、分割結果となるグループ集合（分割後の階層構造データ６００）のルートとなるグループ（全体グループＲ）を統合対象グループとして、以下の手順（１）〜（４）に沿って実現される。 In the integration process by the integration unit 204, in the initial state, the group (total group R) that becomes the root of the group set (hierarchical structure data 600 after the division) that is the division result is set as the integration target group, and the following procedures (1) to (1) to Realized according to (4).

（１）統合対象グループが「中間グループ（フラグ：０）」の属性を持つ（すなわち、その階層の子階層に属するグループを持つ）か否かを判定し、この属性を持たない場合は、その統合対象グループの統合処理を終了する。 (1) It is determined whether or not the integration target group has an attribute of “intermediate group (flag: 0)” (that is, has a group belonging to a child hierarchy of the hierarchy). End the integration process for the integration target group.

（２）一方、属性を持つ場合には、子階層に属する各グループをそれぞれ統合対象グループとして、上記（１）を再帰的に実行する。この統合処理の結果、各グループには、「停止グループ（フラグ：２）」または「統合できない中間グループ（フラグ：３）」のいずれかの属性が付与されることになる。 (2) On the other hand, when having an attribute, the above (1) is recursively executed with each group belonging to the child hierarchy as an integration target group. As a result of this integration processing, each group is given either the “stop group (flag: 2)” or “intermediate group that cannot be integrated (flag: 3)” attribute.

（３）統合対象グループの子階層に属するグループ群に「適合グループ（フラグ：１）」、または「統合できない中間グループ（フラグ：３）」の属性を持つグループが含まれている場合には、統合対象グループに「統合できない中間グループ（フラグ：３）」の属性を付与する。 (3) When a group having an attribute of “applicable group (flag: 1)” or “intermediate group that cannot be integrated (flag: 3)” is included in the group belonging to the child hierarchy of the integration target group, The attribute of “intermediate group that cannot be integrated (flag: 3)” is assigned to the integration target group.

（４）統合対象グループの階層の下位階層のすべてのグループ（子孫グループ）に対して、統合処理を行なった結果、統合対象グループに「統合できない中間グループ（フラグ：３）」の属性が付与されていない場合には、このグループに「停止グループ（フラグ：２）」の属性を付与し、子孫グループをすべて削除する。この手順（１）〜（４）を、図６に示した分割結果となる階層構造データ６００を用いて説明する。 (4) As a result of performing the integration process on all groups (descendant groups) in the lower hierarchy of the integration target group hierarchy, the attribute of “intermediate group that cannot be integrated (flag: 3)” is assigned to the integration target group. If not, the “stop group (flag: 2)” attribute is assigned to this group, and all descendant groups are deleted. The procedures (1) to (4) will be described using the hierarchical structure data 600 that is the division result shown in FIG.

なお、図１０は、統合処理中の階層構造データ６００を示す説明図であり、図１１は、統合処理後の階層構造データ１１００を示す説明図であり、図１２は、統合処理後の判定テーブル４００を示す図表である。 10 is an explanatory diagram showing the hierarchical structure data 600 during the integration process, FIG. 11 is an explanatory diagram showing the hierarchical structure data 1100 after the integration process, and FIG. 12 is a determination table after the integration process. FIG.

［１］図６において、全体グループＲは、グループＡ〜Ｃの３つの子グループに分割されているので（フラグ：０）、［２］に移行する。 [1] In FIG. 6, since the entire group R is divided into three child groups of groups A to C (flag: 0), the process proceeds to [2].

［２］グループＡ（フラグ：０）は、さらに、グループＡ１，Ａ２，Ａ３の３つの子グループに分割されているので、グループＡを統合対象グループとして下記［２ａ］〜［２ｃ］の再帰処理を実行する。 [2] Group A (flag: 0) is further divided into three child groups of groups A1, A2 and A3. Therefore, the recursive processing of [2a] to [2c] below is performed with group A as the integration target group. Execute.

［２ａ］子グループであるグループＡ１は「適合グループ（フラグ：１）」の属性を持っているため、統合対象グループであるグループＡの属性は「統合できない中間グループ（フラグ：３）」に変更される（図１０を参照。）。 [2a] Since the group A1 which is a child group has the attribute of “conformity group (flag: 1)”, the attribute of the group A which is an integration target group is changed to “intermediate group which cannot be integrated (flag: 3)” (See FIG. 10).

［２ｂ］子グループであるグループＡ２（フラグ：０）は、その子階層に属するグループを持つので、グループＡ２を統合対象グループとする統合処理が適用される。グループＡ２の子グループであるグループＡ２１，Ａ２２のいずれも、「適合グループ（フラグ：１）」の属性を持っているので、グループＡ２の属性は、「統合できない中間グループ（フラグ：３）」に変更される（図１０を参照。）。またその結果、グループＡにも「統合できない中間グループ（フラグ：３）」の属性が付与されるが、グループＡ１で付与された属性と変わらない。 [2b] The group A2 (flag: 0), which is a child group, has a group that belongs to its child hierarchy, and therefore, an integration process that uses the group A2 as an integration target group is applied. Since both of the groups A21 and A22 which are child groups of the group A2 have the attribute of “conforming group (flag: 1)”, the attribute of the group A2 is changed to “intermediate group that cannot be integrated (flag: 3)”. It is changed (see FIG. 10). As a result, the attribute of “intermediate group that cannot be integrated (flag: 3)” is also given to group A, but it is not different from the attribute given to group A1.

［２ｃ］グループＡ３（フラグ：０）は、その子階層に属するグループを持つので、グループＡ３を統合対象グループとする統合処理が適用される。グループＡ３の子グループであるグループＡ３１，Ａ３２のいずれも、「停止グループ（フラグ：２）」の属性を持っているので、グループＡ３の属性は、「停止グループ（フラグ：２）」に変更される（図１０を参照。）。またその結果、子グループであるグループＡ３１，Ａ３２は削除される。 [2c] Since the group A3 (flag: 0) has a group that belongs to its child hierarchy, an integration process in which the group A3 is the integration target group is applied. Since both the groups A31 and A32 which are child groups of the group A3 have the attribute of “stop group (flag: 2)”, the attribute of the group A3 is changed to “stop group (flag: 2)”. (See FIG. 10). As a result, the child groups A31 and A32 are deleted.

以上により、図１０において、グループＡには「統合できない中間グループ（フラグ：３）」の属性が付与され、グループＡは、グループＡ１（フラグ：１），グループＡ２１（フラグ：１），グループＡ２２（フラグ：１），グループＡ３（フラグ：２）の４つに分割される。また、グループＡの属性が「統合できない中間グループ（フラグ：３）」であるので、全体グループＲにも「統合できない中間グループ（フラグ：３）」の属性が付与される。 As described above, in FIG. 10, the attribute of “intermediate group that cannot be integrated (flag: 3)” is assigned to group A, and group A includes group A1 (flag: 1), group A21 (flag: 1), and group A22. (Flag: 1) and Group A3 (Flag: 2). Further, since the attribute of the group A is “intermediate group that cannot be integrated (flag: 3)”, the attribute of “intermediate group that cannot be integrated (flag: 3)” is also given to the entire group R.

［３］また、図６において、グループＢ（フラグ：０）も、その子階層に属するグループを持つので、グループＡの場合と同様に、再帰的に統合処理が実施される。この処理の結果、グループＢには「停止グループ（フラグ：２）」の属性が付与され、この子孫グループであるグループＢ１，Ｂ２，Ｂ２１，Ｂ２２は削除される（図１１を参照。）。 [3] In FIG. 6, the group B (flag: 0) also has a group belonging to its child hierarchy, so that the integration process is recursively performed as in the case of the group A. As a result of this processing, the attribute of “stop group (flag: 2)” is given to group B, and the descendant groups B1, B2, B21, and B22 are deleted (see FIG. 11).

より具体的には、グループＢ１の属性は「停止グループ（フラグ：２）」であるので再帰処理は行なわれない。また、グループＢ２は「中間グループ（フラグ：０）」の属性を持つので、グループＢ２を統合対象グループとして、再帰的に統合処理が行なわれる。 More specifically, since the attribute of the group B1 is “stop group (flag: 2)”, the recursive process is not performed. Further, since the group B2 has the attribute of “intermediate group (flag: 0)”, the integration process is recursively performed with the group B2 as the integration target group.

そして、グループＢ２の子グループであるグループＢ２１，Ｂ２２がいずれも「停止グループ（フラグ：２）」の属性を持つので、グループＢ２の属性は「停止グループ（フラグ：２）」に変更され、その子グループであるグループＢ２１，Ｂ２２が削除される（図１０を参照。）。 Since both the groups B21 and B22 which are child groups of the group B2 have the attribute “stop group (flag: 2)”, the attribute of the group B2 is changed to “stop group (flag: 2)” and its children Groups B21 and B22, which are groups, are deleted (see FIG. 10).

この結果、グループＢ１，Ｂ２の両方が「停止グループ（フラグ：２）」の属性を持つことになるので、その親グループであるグループＢは「停止グループ（フラグ：２）」に変更され、グループＢ１，Ｂ２は削除される（図１１を参照。）。 As a result, both of the groups B1 and B2 have the attribute of “stop group (flag: 2)”, so that the parent group, group B, is changed to “stop group (flag: 2)”. B1 and B2 are deleted (see FIG. 11).

［４］また、図６において、グループＣ（フラグ：０）も、その子階層に属するグループを持つので、グループＡ，Ｂの場合と同様に、再帰的に統合処理が実施される。この処理の結果、グループＣの属性は、「統合できない中間グループ（フラグ：３）」に変更され、その子グループであるグループＣ１（フラグ：１）とグループＣ２（フラグ：２）に分割される（図１０、図１１を参照。）。 [4] Also, in FIG. 6, since the group C (flag: 0) also has a group belonging to its child hierarchy, the integration processing is recursively performed as in the case of the groups A and B. As a result of this processing, the attribute of group C is changed to “intermediate group that cannot be integrated (flag: 3)”, and is divided into group C1 (flag: 1) and group C2 (flag: 2) that are child groups ( (See FIGS. 10 and 11).

また、グループＣの属性が「統合できない中間グループ（フラグ：３）」であるので、全体グループＲの属性は（既に付与されている）「統合できない中間グループ（フラグ：３）」を維持する。 Further, since the attribute of the group C is “intermediate group that cannot be integrated (flag: 3)”, the attribute of the entire group R is maintained (the already assigned intermediate group that cannot be integrated (flag: 3)).

［５］全体グループＲには「統合できない中間グループ（フラグ：３）」の属性が付与されているので、一連の処理を終了する。これにより、図１１に示したような統合処理後の階層構造データ１１００と、図１２に示したような統合処理後の判定テーブル４００が得られる。 [5] Since the attribute of “intermediate group that cannot be integrated (flag: 3)” is assigned to the entire group R, a series of processing ends. Thereby, the hierarchical structure data 1100 after the integration process as shown in FIG. 11 and the determination table 400 after the integration process as shown in FIG. 12 are obtained.

ここで、分割結果（図６および図７）と統合結果（図１１および図１２）とを比較すると、子グループ群がすべて停止グループ（フラグ：２）である場合、その親グループの属性は停止グループ（フラグ：２）になり、子グループ群が削除された状態になっている。また、（統合処理が行なわれていない）グループの属性は、フラグ：０であったが、統合処理を行なった場合、「統合できない中間グループ（フラグ：３）」となっている。 Here, when the division result (FIGS. 6 and 7) and the integration result (FIGS. 11 and 12) are compared, if all the child group groups are stop groups (flag: 2), the attribute of the parent group is stopped. It becomes a group (flag: 2), and the child group group has been deleted. Further, the attribute of the group (where the integration process is not performed) is flag: 0, but when the integration process is performed, the attribute is “intermediate group that cannot be integrated (flag: 3)”.

また、一つのグループに対して、子グループが１つでも「適合グループ（フラグ：１）」または「統合できない中間グループ（フラグ：３）」の属性を持つ場合には、統合を行なわないとしているが、この条件を緩めて統合処理を行なうことも可能である。 In addition, if one child group has an attribute of “applicable group (flag: 1)” or “intermediate group that cannot be integrated (flag: 3)”, no integration is performed. However, it is possible to relax the condition and perform the integration process.

たとえば、子グループの数と、その中で「適合グループ（フラグ：１）」または「統合できない中間グループ（フラグ：３）」の属性を持つグループの数を比較し、その割合が一定数以下である場合は統合を行なうといった処理や、「適合グループ（フラグ：１）」または「統合できない中間グループ（フラグ：３）」に属する多次元データ数と親グループの多次元データ数との割合が一定数以下である場合に統合を行なうといった処理としてもよい。 For example, the number of child groups is compared with the number of groups having the attribute of “matching group (flag: 1)” or “intermediate group that cannot be merged (flag: 3)”, and the ratio is less than a certain number. In some cases, integration is performed, and the ratio between the number of multidimensional data belonging to the “adapted group (flag: 1)” or “intermediate group that cannot be integrated (flag: 3)” and the number of multidimensional data of the parent group is constant. Processing may be performed such that integration is performed when the number is less than the number.

また、親グループへの統合以外にも、あるグループ（親グループ）の子グループが「適合グループ（フラグ：１）」と「停止グループ（フラグ：２）」の属性に分かれている場合、複数の「停止グループ（フラグ：２）」を１つのグループに統合するという処理（複数の停止グループを一つにまとめる処理）を追加してもよい。この処理は、「適合グループ（フラグ：１）」となるグループが特に重要であり、特徴を持たない「停止グループ（フラグ：２）」に対しては、それ以上の詳細は分析する必要のない分析用途に適する。この処理の結果、親グループの直下の子グループとして、1つ以上の適合グループと1つの停止グループが作成される。 In addition to the integration to the parent group, if a child group of a certain group (parent group) is divided into “adaptation group (flag: 1)” and “stop group (flag: 2)” attributes, A process of integrating the “stop group (flag: 2)” into one group (a process of combining a plurality of stop groups into one) may be added. In this process, the group that becomes the “conformity group (flag: 1)” is particularly important, and no further details need to be analyzed for the “stop group (flag: 2)” that has no characteristics. Suitable for analytical use. As a result of this processing, one or more matching groups and one stop group are created as child groups immediately below the parent group.

たとえば、特許公報データの分析においては、出願人の分布に特徴のあるＩＰＣコードを持つグループに関しては、どの出願人から多く（または少なく）出願されているかという観点でさらに分析を進めるが、特徴のないグループはすべて「各出願人からまんべんなく出願されている特許公報のグループ」であるのでそれ以上の検討は行なわないといった場合である。 For example, in the analysis of patent gazette data, for groups with IPC codes characterized by the distribution of applicants, the analysis proceeds further in terms of which applicants have applied for more (or fewer) applications. In this case, all the groups that are not present are “groups of patent publications that have been applied evenly by each applicant”, and therefore no further examination is performed.

（データ分割処理手順）
つぎに、この発明の実施の形態にかかるデータ分割処理手順について説明する。図１３は、この発明の実施の形態にかかるデータ分割処理手順を示すフローチャートである。図１３において、まず、入力部２０１により説明変数および目的変数の入力を受け付ける入力処理を実行し（ステップＳ１３０１）、判定部２０２による判定処理および分割部２０３による分割処理（判定・分割処理）を実行する（ステップＳ１３０２）。 (Data division processing procedure)
Next, a data division processing procedure according to the embodiment of the present invention will be described. FIG. 13 is a flowchart showing a data division processing procedure according to the embodiment of the present invention. In FIG. 13, first, input processing for receiving input of explanatory variables and objective variables is executed by the input unit 201 (step S1301), and determination processing by the determination unit 202 and division processing (determination / division processing) by the division unit 203 are executed. (Step S1302).

分割処理結果（図６に示した階層構造データ６００および図７に示した判定テーブル４００）が得られると、統合部２０４により統合処理を実行する（ステップＳ１３０３）。統合部２０４により統合処理結果（図１１に示した階層構造データ１１００および図１２に示した判定テーブル４００）が得られると、出力部２０５により統合処理結果を出力する出力処理を実行する（ステップＳ１３０４）。 When the division processing result (the hierarchical structure data 600 shown in FIG. 6 and the determination table 400 shown in FIG. 7) is obtained, the integration processing is executed by the integration unit 204 (step S1303). When the integration unit 204 obtains the integration process result (the hierarchical structure data 1100 shown in FIG. 11 and the determination table 400 shown in FIG. 12), the output unit 205 executes the output process for outputting the integration process result (step S1304). ).

（判定・分割処理手順）
つぎに、図１３に示した判定・分割処理の詳細な処理手順について説明する。図１４は、図１３に示した判定・分割処理の詳細な処理手順を示すフローチャートである。図１４において、まず、説明変数をルートとする階層構造データ３００（図３を参照。）を取得して（ステップＳ１４０１）、その全グループの属性として、フラグ：０をセットする（ステップＳ１４０２）。 (Judgment / division procedure)
Next, a detailed processing procedure of the determination / division processing shown in FIG. 13 will be described. FIG. 14 is a flowchart showing a detailed processing procedure of the determination / division processing shown in FIG. 14, first, hierarchical structure data 300 (see FIG. 3) having an explanatory variable as a root is acquired (step S1401), and flag: 0 is set as an attribute of all the groups (step S1402).

つぎに、全体グループＲが停止条件２にマッチするか否かを判定する（ステップＳ１４０３）。停止条件２にマッチする場合（ステップＳ１４０３：Ｙｅｓ）、子グループが存在しないため、全体グループＲの属性をフラグ：２に変更し（ステップＳ１４０４）、ステップＳ１３０３に移行する。 Next, it is determined whether or not the entire group R matches the stop condition 2 (step S1403). When the stop condition 2 is matched (step S1403: Yes), since there is no child group, the attribute of the entire group R is changed to flag: 2 (step S1404), and the process proceeds to step S1303.

一方、停止条件２にマッチしない場合（ステップＳ１４０３：Ｎｏ）、子グループに分割する（ステップＳ１４０５）。この子グループ群の中にフラグ：０の属性を持つ子グループがあるか否かを判断し（ステップＳ１４０６）、フラグ：０の属性を持つ子グループが存在する場合（ステップＳ１４０６：Ｙｅｓ）、フラグ：０の属性を持つ子グループを１つ選択して分割対象グループとし（ステップＳ１４０７）、判定処理を実行する（ステップＳ１４０８）。なお、一度判定処理が実行された子グループは分割対象グループとして選ばれない。この判定処理（ステップＳ１４０８）については後述する。 On the other hand, when the stop condition 2 is not matched (step S1403: No), it is divided into child groups (step S1405). It is determined whether or not there is a child group having an attribute of flag: 0 in this child group group (step S1406). If a child group having an attribute of flag: 0 exists (step S1406: Yes), the flag : One child group having the attribute of 0 is selected as a division target group (step S1407), and the determination process is executed (step S1408). Note that the child group for which the determination process has been executed is not selected as the group to be divided. This determination process (step S1408) will be described later.

この判定処理による分割対象グループのフラグをチェックする（ステップＳ１４０９）。フラグ：０である場合（ステップＳ１４０９：Ｙｅｓ）、この分割対象グループは子グループ群を有するため、ステップＳ１４０５に戻る。これにより、再帰的に分割をおこなうことができる。 The division target group flag by this determination processing is checked (step S1409). When the flag is 0 (step S1409: YES), since this division target group has a child group group, the process returns to step S1405. Thereby, division can be performed recursively.

一方、フラグ：０でない場合（ステップＳ１４０９：Ｎｏ）、すなわちフラグが１または２である場合、図３に示した階層構造データ３００から分割対象グループの子孫グループを削除して、ステップＳ１４０６に戻る。 On the other hand, when the flag is not 0 (step S1409: No), that is, when the flag is 1 or 2, the descendant group of the group to be divided is deleted from the hierarchical structure data 300 shown in FIG. 3, and the process returns to step S1406.

また、ステップＳ１４０６において、子グループ群にフラグ：０の属性を持つ子グループがない場合（ステップＳ１４０６：Ｎｏ）、当該子グループ群の判定処理が完了しているため、その親グループに遷移する（ステップＳ１４１０）。そして、その親グループは全体グループＲであるか否かを判断する（ステップＳ１４１１）。 In step S1406, if there is no child group having the attribute of flag: 0 in the child group group (step S1406: No), since the determination process for the child group group has been completed, transition to the parent group is performed ( Step S1410). Then, it is determined whether or not the parent group is the entire group R (step S1411).

親グループでない場合（ステップＳ１４１１：Ｎｏ）、ステップＳ１４０６に戻る。これにより、フラグ：０の属性を持つ子グループをサーチすることができる。一方、親グループである場合（ステップＳ１４１１：Ｙｅｓ）、ステップＳ１３０３に移行する。これにより、一連の判定・分割処理を終了する。 If it is not the parent group (step S1411: NO), the process returns to step S1406. Thereby, a child group having the attribute of flag: 0 can be searched. On the other hand, when it is a parent group (step S1411: Yes), it transfers to step S1303. As a result, the series of determination / division processing ends.

（判定処理手順）
つぎに、図１４に示した判定処理（ステップＳ１４０８）の詳細な処理手順について説明する。図１５は、図１４に示した判定処理の詳細な処理手順を示すフローチャートである。図１５において、まず、ステップＳ１４０７で分割対象グループとして選択された子グループが停止条件１にマッチするか否かを判定する（ステップＳ１５０１）。停止条件１にマッチする場合（ステップＳ１５０１：Ｙｅｓ）、分割対象グループのフラグを２に変更して（ステップＳ１５０２）、ステップＳ１４０９に移行する。 (Judgment processing procedure)
Next, a detailed processing procedure of the determination process (step S1408) illustrated in FIG. 14 will be described. FIG. 15 is a flowchart showing a detailed processing procedure of the determination process shown in FIG. In FIG. 15, first, it is determined whether or not the child group selected as the group to be divided in step S1407 matches the stop condition 1 (step S1501). When the stop condition 1 is matched (step S1501: Yes), the division target group flag is changed to 2 (step S1502), and the process proceeds to step S1409.

一方、停止条件１にマッチしない場合（ステップＳ１５０１：Ｎｏ）、適合条件にマッチするか否かを判定する（ステップＳ１５０３）。適合条件にマッチする場合（ステップＳ１５０３：Ｙｅｓ）、分割対象グループのフラグを１に変更して（ステップＳ１５０４）、ステップＳ１４０９に移行する。 On the other hand, when the stop condition 1 is not matched (step S1501: No), it is determined whether or not the matching condition is matched (step S1503). When the matching condition is met (step S1503: Yes), the division target group flag is changed to 1 (step S1504), and the process proceeds to step S1409.

一方、適合条件にマッチしない場合（ステップＳ１５０３：Ｎｏ）、停止条件２にマッチするか否かを判定する（ステップＳ１５０５）。停止条件２にマッチする場合（ステップＳ１５０５：Ｙｅｓ）、分割対象グループのフラグを２に変更して（ステップＳ１５０６）、ステップＳ１４０９に移行する。一方、停止条件２にマッチしない場合（ステップＳ１５０５：Ｎｏ）、分割対象グループのフラグが０のまま、ステップＳ１４０９に移行する。これにより、一連の判定処理を終了する。 On the other hand, if the matching condition is not matched (step S1503: No), it is determined whether the stop condition 2 is matched (step S1505). When the stop condition 2 is matched (step S1505: Yes), the division target group flag is changed to 2 (step S1506), and the process proceeds to step S1409. On the other hand, when the stop condition 2 is not matched (step S1505: No), the division target group flag remains 0, and the process proceeds to step S1409. Thereby, a series of determination processes is completed.

（統合処理）
つぎに、図１３に示した統合処理の詳細な処理手順について説明する。図１６は、図１３に示した統合処理の詳細な処理手順を示すフローチャートである。図１６において、まず、全体グループＲのフラグをチェックし（ステップＳ１６０１）、フラグ：０でない場合（ステップＳ１６０１：Ｎｏ）、ステップＳ１３０４に移行する。 (Integration process)
Next, a detailed processing procedure of the integration processing shown in FIG. 13 will be described. FIG. 16 is a flowchart showing a detailed processing procedure of the integration processing shown in FIG. In FIG. 16, first, the flag of the entire group R is checked (step S1601). If the flag is not 0 (step S1601: No), the process proceeds to step S1304.

一方、フラグ：０である場合（ステップＳ１６０１：Ｙｅｓ）、その子階層に遷移して（ステップＳ１６０２）、子階層に属するグループ群（子グループ群）に未選択の子グループがあるか否かを判断する（ステップＳ１６０３）。未選択の子グループがある場合（ステップＳ１６０３：Ｙｅｓ）、その子グループを１つ選択し（ステップＳ１６０４）、そのフラグをチェックする（ステップＳ１６０５）。 On the other hand, when the flag is 0 (step S1601: Yes), the process shifts to the child hierarchy (step S1602), and determines whether there is an unselected child group in the group group (child group group) belonging to the child hierarchy. (Step S1603). If there is an unselected child group (step S1603: Yes), one child group is selected (step S1604), and its flag is checked (step S1605).

そして、フラグ：０である場合（ステップＳ１６０５：Ｙｅｓ）、ステップＳ１６０２に戻って、さらに子階層に遷移する。一方、フラグ：０でない場合（ステップＳ１６０５：Ｎｏ）、ステップＳ１６０３に戻って、未選択の子グループがあるか否かを判断する。 If the flag is 0 (step S1605: YES), the process returns to step S1602, and further transitions to a child hierarchy. On the other hand, if the flag is not 0 (step S1605: NO), the process returns to step S1603 to determine whether there is an unselected child group.

また、ステップＳ１６０６において、その子グループ群において未選択の子グループがない場合（ステップＳ１６０３：Ｎｏ）、その子グループ群の親グループを統合対象Ｇループとして、その子グループ群のフラグをチェックする（ステップＳ１６０６）。 If there is no unselected child group in the child group group in step S1606 (step S1603: No), the flag of the child group group is checked with the parent group of the child group group as the integration target G loop (step S1606). .

子グループ群のフラグがすべて２である場合（ステップＳ１６０６：Ｙｅｓ）、統合対象となる親グループのフラグを２に変更し（ステップＳ１６０７）、子グループ群を削除する（ステップＳ１６０８）。このとき、子グループ群がその下位階層にグループを有する場合には、これらのグループもまとめて、すなわち子孫グループを削除する。そして、ステップＳ１６１０に移行する。 When all the flags of the child group group are 2 (step S1606: Yes), the flag of the parent group to be integrated is changed to 2 (step S1607), and the child group group is deleted (step S1608). At this time, if the child group group has a group in the lower hierarchy, these groups are also collected, that is, the descendant group is deleted. Then, control goes to a step S1610.

一方、子グループ群のフラグがすべて２でない場合（ステップＳ１６０６：Ｎｏ）、親グループのフラグを３に変更し（ステップＳ１６０９）、ステップＳ１６１０に移行する。このステップＳ１６１０において、親グループが全体グループＲであるか否かを判断する（ステップＳ１６１０）。全体グループでない場合（ステップＳ１６１０：Ｎｏ）、その親階層に遷移して（ステップＳ１６１１）、ステップＳ１６０３に戻る。一方、全体グループＲである場合（ステップＳ１６１０：Ｙｅｓ）、ステップＳ１３０４に移行する。これにより、一連の統合処理を終了する。 On the other hand, if all the flag of the child group group is not 2 (step S1606: No), the flag of the parent group is changed to 3 (step S1609), and the process proceeds to step S1610. In step S1610, it is determined whether or not the parent group is the entire group R (step S1610). When it is not the entire group (step S1610: No), the process transits to the parent hierarchy (step S1611) and returns to step S1603. On the other hand, when it is the entire group R (step S1610: Yes), the process proceeds to step S1304. As a result, the series of integration processes is completed.

このように、この発明の実施の形態によれば、多次元データを目的に合わせて適切に分割することができる。また、目的変数が定型値を取る場合には、その分布に関して、「全体と比較して分布に特徴のあるグループ」、「全体と分布に違いのないグループ」といった属性が付与されており、この属性を提示することもできる。また、目的変数が数値を取る場合には、「既知の確率分布モデルに従うグループ」、「既知の確率分布モデルに従わないグループ」といった属性が付与されており、さらに前者に関しては、目的変数が従っている確率分布モデルのパラメータも提示することができる。 Thus, according to the embodiment of the present invention, multidimensional data can be appropriately divided according to the purpose. In addition, when the objective variable takes a fixed value, attributes such as “a group that is characteristic of the distribution compared to the whole” and “a group that is not different from the whole” are assigned to the distribution. Attributes can also be presented. In addition, when the objective variable takes a numerical value, attributes such as “a group that follows a known probability distribution model” and “a group that does not follow a known probability distribution model” are given. The parameters of the probability distribution model can also be presented.

ここで、本実施の形態と公知技術（上述した特許文献１）との比較をおこなう。公知技術は、分割された結果の組に対して、目的変数の分布に有意差がない場合にその二つを統合する技術である。本実施の形態における処理と公知技術を用いた処理を以下の例を用いて説明する。なお、この例では、目的変数に定型値を取る項目を指定した場合の処理であるが、目的変数として数値を取る項目を指定した場合も検定方法が異なるのみであり、同様の結果となる。 Here, the present embodiment is compared with a known technique (Patent Document 1 described above). The publicly known technique is a technique for integrating two divided result sets when there is no significant difference in the distribution of objective variables. Processing according to the present embodiment and processing using a known technique will be described using the following example. In this example, the processing is performed when an item that takes a fixed value as the objective variable is specified. However, when an item that takes a numerical value as the objective variable is specified, only the test method is different, and the same result is obtained.

・第一階層：説明変数は第一階層として、グループＡ，グループＢの２つの値を取る。
・第二階層：第一階層のグループＡ，Ｂは、それぞれ、グループＡ１〜Ａ５０，グループＢ１〜Ｂ５０の値を取る。
・各グループの特徴：グループＡ，Ｂは、目的変数の分布に関して、全体グループＲと比較すると特徴を持つ。また、グループＡ，Ｂの下位のグループ（グループＡ１〜Ａ５０，グループＢ１〜Ｂ５０）に関しては、目的変数の分布に関して多少の異なりが存在する。 First layer: The explanatory variable takes two values of group A and group B as the first layer.
Second layer: Groups A and B in the first layer take the values of groups A1 to A50 and groups B1 to B50, respectively.
-Characteristics of each group: Groups A and B have characteristics in comparison with the overall group R in terms of the distribution of objective variables. In addition, regarding the lower groups of groups A and B (groups A1 to A50 and groups B1 to B50), there are some differences in the distribution of objective variables.

このような場合、判定・分割処理では、第一階層による分割でグループＡ，Ｂに分割される。各々のグループＡ，Ｂは適合条件にマッチするので、「適合グループ（フラグ：１）」の属性が与えられ、それ以上の分割は行なわない。以上の処理における判定回数は２回となる。 In such a case, in the determination / division processing, the division into the groups A and B is carried out by division in the first hierarchy. Since each group A and B matches the matching condition, the attribute of “matching group (flag: 1)” is given and no further division is performed. The number of determinations in the above processing is two.

また、統合処理では、全体グループＲはグループＡ，Ｂに分割されている。このグループＡ，Ｂは、「適合グループ（フラグ：１）」の属性を持つので、全体グループＲに「統合できない中間グループ（フラグ：３）」の属性が与えられ。処理を終了する。以上の処理における判定回数は２回である。 In the integration process, the entire group R is divided into groups A and B. Since these groups A and B have the attribute of “matching group (flag: 1)”, the attribute of “intermediate group (flag: 3) that cannot be integrated” is given to the entire group R. The process ends. The number of determinations in the above processing is two.

なお、この判定処理は、単純に２つの下位グループにおける属性を参照して統合するかどうかを決定するだけであり、統計的検定といった複雑な処理は行なわない。以上により、多次元データは第一階層のグループＡ，Ｂに分割され、この二つのグループＡ，Ｂは、（検定に用いた危険率の下で）各々、全体グループＲと比較して目的変数の分布に違いがあるという属性が付与される。 Note that this determination process simply determines whether to integrate by referring to the attributes in the two lower groups, and does not perform complicated processing such as statistical testing. As described above, the multidimensional data is divided into groups A and B in the first hierarchy, and these two groups A and B are respectively compared with the whole group R (under the risk factor used in the test) as the objective variable. The attribute that there is a difference in the distribution of is given.

つぎに、公知技術による処理結果について説明する。まず、最下位グループまでの分割結果として、グループＡ１〜Ａ５０，Ｂ１〜Ｂ５０を作成する。つぎに、多次元データの階層性を考慮した工夫として、「上位グループが異なるグループは統合を行なわない」という条件を加えて統合処理を行なう。この条件を加えない場合、公知技術ではグループのすべての組み合わせに対して統合可能かどうかを判定するため、判定回数は１００個のうち、２個を選択する組み合わせの数である４９５０回の判定を行なうことになる。 Next, a processing result by a known technique will be described. First, groups A1 to A50 and B1 to B50 are created as division results up to the lowest group. Next, as a contrivance in consideration of the hierarchical nature of multidimensional data, the integration process is performed under the condition that “groups with different upper groups are not integrated”. When this condition is not added, in the known technique, it is determined whether or not integration is possible with respect to all combinations of the group. Therefore, the determination is performed 4950 times, which is the number of combinations for selecting two out of 100. Will do.

上記条件下で、たとえば、まず、グループＡの下位グループであるグループＡ１〜Ａ５０に対する統合処理を行なう。ここで、選択した２つのグループ間に有意差がない場合には、順次統合を行なっていくとすると、判定を全く誤らない場合には、１回の判定において２つのグループが統合されるので、４９回の判定処理が行なわれ、５０個のグループＡ１〜Ａ５０が１つのグループに統合される。 Under the above conditions, for example, first, integration processing is performed on the groups A1 to A50 which are lower groups of the group A. Here, if there is no significant difference between the two selected groups, and if the integration is performed sequentially, the two groups are integrated in one determination if there is no mistake in the determination. 49 determination processes are performed, and 50 groups A1 to A50 are integrated into one group.

しかしながら、判定に用いる統計的検定は必ず正しい判定ができることを保証しない。たとえば、一般的に広く採用される危険率５％検定においては、５％の割合で本当は有意差がないにも関わらず有意差があると判定してしまう。グループＡの処理においても、この確率で判定誤りが発生するとした場合、４９回中、１、２回は判定を誤ると考えられる。この誤りが発生すると、グループＡの下位グループＡ１〜Ａ５０において、統合されないグループが作成されてしまう。 However, the statistical test used for judgment does not guarantee that a correct judgment can be made. For example, in a 5% risk ratio test that is generally widely adopted, it is determined that there is a significant difference at a rate of 5% even though there is no significant difference. Also in the processing of group A, if a determination error occurs with this probability, it is considered that the determination is incorrect one or two times out of 49 times. When this error occurs, a group that is not integrated is created in the lower groups A1 to A50 of the group A.

この場合は、１回の判定で統合されないグループができる分、統合処理でグループ数が減らないので、さらに判定回数が増加してしまう（最悪はすべて統合されない場合で判定回数は１２２５回となる）。下位階層でこのような統合されないグループが作成された場合、最終的には、グループＡへの統合が行なわれないので、適切な結果が得られない。また、このような誤りが発生する可能性は、グループＢにおける処理でも変わらない。 In this case, since the number of groups is not reduced by the integration process because the number of groups that are not integrated by one determination is generated, the number of determinations further increases (the worst case is that all the determinations are not integrated and the number of determinations is 1225). . When such an unintegrated group is created in the lower hierarchy, the integration into the group A is not finally performed, and thus an appropriate result cannot be obtained. Further, the possibility of such an error occurring does not change even in the processing in group B.

なお、検定における危険率はユーザが制御可能であるので、この値をできる限り低くすることで前記誤り確率を低下させることができる。しかしながら、危険率を低下させることで、今度は、本当は有意差があるにも関わらず、有意差がないと判定してしまう誤りが増加する。その結果、たとえば、第二階層での統合処理によって統合されたグループＡ，Ｂ間にも有意差がないと判定してしまう可能性が高まり、結局は全く分割が行なわれない可能性が高まる。 Since the risk rate in the test can be controlled by the user, the error probability can be lowered by making this value as low as possible. However, by reducing the risk factor, this time, errors that determine that there is no significant difference even though there is actually a significant difference increase. As a result, for example, there is a high possibility that it is determined that there is no significant difference between the groups A and B integrated by the integration processing in the second hierarchy, and eventually there is a high possibility that no division will be performed.

また、グループＡ，Ｂと正しく分割できたとしても、その２つのグループＡ，Ｂが全体グループＲと比較して目的変数の分布の偏りがある、つまり特徴を持つのかどうかは不明である。たとえば、グループＡは全体グループＲに比べて特徴を持つが、グループＢは特徴を持たないような多次元データであったとしても、本実施の形態においては各グループに付与されているフラグにより区別可能であるが、公知技術ではその区別はされずに分割結果が得られるだけである。以上より、公知技術による処理結果をまとめると以下の通りとなる。 Even if the groups A and B can be correctly divided, it is unclear whether the two groups A and B have a bias in the distribution of objective variables compared to the whole group R, that is, have characteristics. For example, even though the group A is multidimensional data having characteristics compared to the entire group R, but the group B has no characteristics, in the present embodiment, the group A is distinguished by the flag assigned to each group. Although it is possible, in the known technique, the distinction is not made and only the division result is obtained. From the above, the processing results of the known techniques are summarized as follows.

・判定回数は最低９９回（第二階層グループでの判定は各４９回、第一階層での判定回数は１回）である。
・検定回数が多いため、検定誤りが発生する可能性が高く、適切な分割結果が得られない可能性が高い。
・分割結果のグループの全体に対する特徴が不明である。 The number of determinations is at least 99 (49 determinations in the second layer group and 1 determination in the first layer).
-Since the number of verifications is large, there is a high possibility that a verification error will occur, and there is a high possibility that an appropriate division result will not be obtained.
・ Characteristics for the entire group of segmentation results are unknown.

なお、このような傾向は、多次元データの階層が多くなればなるほど最下層におけるグループ数が増えるのでますます悪化する。一方、本実施の形態においては、最上位階層から判定を行なっていくので、階層が多くなっても上位の階層で条件に適合すればそれ以上は分割を行なわないので、効率的である。 Such a tendency becomes worse as the number of layers of multidimensional data increases and the number of groups in the lowest layer increases. On the other hand, in this embodiment, since the determination is performed from the highest hierarchy, even if the number of hierarchies increases, no further division is performed as long as the condition is met in the higher hierarchy, which is efficient.

また、本実施の形態は上位階層グループの多次元データ数が多い状態で検定を行なうので、公知技術が行なっている最下層の多次元データ数の少ないグループの間で行なう検定と比べて、検定精度が高いという利点もある。 In addition, since the present embodiment performs the test in a state where the number of multi-dimensional data of the upper hierarchy group is large, the test is performed in comparison with the test performed between the groups having a small number of multi-dimensional data performed by the known technique. There is also an advantage of high accuracy.

また、グループＢでは特徴を持たず、その下位グループのいくつかが特徴を持つ別データの場合、本実施の形態においては、検定を行なう順序によらずに同じ分割結果を得ることができる。一方、公知技術を用いた場合、一回の検定結果で有意差が生じなかったグループを順次統合していくといった処理を行なう場合、検定を行なうグループ組の選ぶ順番によって分割結果が変わってしまう可能性がある。 Further, in the case of another data having no characteristics in group B and some of its subgroups having characteristics, in the present embodiment, the same division result can be obtained regardless of the order in which the tests are performed. On the other hand, when using known technology, when processing such as sequentially integrating groups that did not produce a significant difference in a single test result, the result of the division may change depending on the order in which the group group to be tested is selected. There is sex.

しかしながら、この検定結果の適切な順番を予め決めることは不可能である。公知技術を利用し、かつどんな場合でもデータの順番によらない分割結果を得るには、たとえば、５０グループのすべての組に対して検定を行ない、その中で目的変数の分布が最も近いグループを統合していくという処理を繰り返すといった処理が必要となる。 However, it is impossible to predetermine an appropriate order of the test results. In order to obtain a segmentation result that does not depend on the order of the data in any case using known techniques, for example, tests are performed on all 50 groups, and the group with the closest distribution of the objective variable is selected. It is necessary to repeat the process of integration.

その場合は、一回目の統合に対しては５０個の中から２つを選ぶ組み合わせである１２２５回の検定を行なった後に、分布が最も近いグループを一つ統合し、次の統合においては、あらたにできたグループと残りのグループとの組み合わせ（４８組）に対してさらに統計量の算出を行なって次に統合するグループを決定するといった処理を、統合できなくなるまで繰り返すといった処理になる。 In that case, after performing the test of 1225, which is a combination of selecting two out of 50 for the first integration, one group having the closest distribution is integrated, and in the next integration, The process of further calculating the statistic for the newly created group and the remaining group combinations (48 sets) and determining the group to be integrated next is repeated until it cannot be integrated.

この方法の場合、グループＢ１〜Ｂ５０の順番によらずに同じ分割結果が得られるが、統計量の算出回数、その中で最も近い統合可能なグループの探索および統合処理の回数が非常に多くなる。たとえば、このケースの統合は、それ以上統合できないグループが少数である場合、上記の一連の処理を約２５００回行なうことになる。 In the case of this method, the same division result is obtained regardless of the order of the groups B1 to B50, but the number of statistics calculations, the search for the nearest group that can be integrated, and the number of integration processes are very large. . For example, in this case, when the number of groups that cannot be further integrated is small, the above-described series of processing is performed about 2500 times.

一方、本実施の形態におけるグループＢの５０個の下位グループにおける検定処理は各々１回、計５０回で済むので、公知技術を用いた処理に比べると格段に少ない検定回数で処理を終えることができる。 On the other hand, the test processing in the 50 subgroups of group B in the present embodiment can be completed once with a total of 50 times, so that the processing can be completed with a significantly smaller number of tests compared to the processing using the known technology. it can.

なお、この場合には、公知技術と本実施の形態では出力結果が異なる。公知技術では、同一階層に属するグループにおいて、目的変数の分布が同一と判定されたものは統合されて一つのグループになるが、本実施の形態では同階層グループにおける部分的な統合は行なわれない（なお、特徴を持たないグループについてはそれ以上の分析を行なわない場合には、一つに統合できる）。 In this case, the output result differs between the known technique and the present embodiment. In the known technique, in the groups belonging to the same hierarchy, those determined to have the same distribution of objective variables are integrated into one group. However, in this embodiment, partial integration in the same hierarchy group is not performed. (Note that groups without features can be combined into one if no further analysis is performed).

この違いについて検討すると、たとえば、適用例として挙げた特許公報データのＩＰＣコードによる出願人の傾向を見るという分析目的に利用する場合には、本実施の形態における分割結果（厳密には統合結果）の方が適切であることが分かる。 When this difference is examined, for example, when used for the purpose of analyzing the tendency of the applicant by the IPC code of the patent gazette data cited as an application example, the division result in this embodiment (strictly, the integration result) It turns out that is more appropriate.

たとえば、同一階層の２つのＩＰＣコードＸ，Ｙについて、ＩＰＣコードＸを持つ特許公報データが１０００件あり、ＩＰＣコードＹを持つ特許公報データが１００件ある場合、公知技術では出願人分布が同じとみなせる時は合わせて１１００件の１グループが作成されるが、その内訳は不明となる。 For example, for two IPC codes X and Y in the same hierarchy, if there are 1000 patent publication data with IPC code X and 100 patent publication data with IPC code Y, the distribution of applicants is the same in the known technology When it can be considered, a group of 1100 is created in total, but the breakdown is unknown.

一方、本実施の形態においてはこの２つのグループ（ＩＰＣコードＸ，Ｙ）は統合されないので、ＩＰＣコードＸの技術については出願件数が多く、ＩＰＣコードＹの技術は比較的出願件数が少ないという重要な情報が保持される。 On the other hand, in the present embodiment, since these two groups (IPC codes X and Y) are not integrated, it is important that the number of applications for IPC code X technology is large, and the technology for IPC code Y is relatively small in number. Information is retained.

以上説明したように、この発明の実施の形態によれば、多次元データを、階層構造を持つ項目で分割する際に、分析目的に合致した適切なグループに自動的分割することができる。また、分割されたグループに対して、ユーザの着目する特徴を提示することができる。 As described above, according to the embodiment of the present invention, when multidimensional data is divided into items having a hierarchical structure, it can be automatically divided into appropriate groups that match the analysis purpose. Moreover, the feature which a user pays attention with respect to the divided | segmented group can be shown.

なお、本実施の形態で説明したデータ分割方法は、予め用意されたプログラムをパーソナル・コンピュータやワークステーション等のコンピュータで実行することにより実現することができる。このプログラムは、ハードディスク、フレキシブルディスク、ＣＤ−ＲＯＭ、ＭＯ、ＤＶＤ等のコンピュータで読み取り可能な記録媒体に記録され、コンピュータによって記録媒体から読み出されることによって実行される。またこのプログラムは、インターネット等のネットワークを介して配布することが可能な伝送媒体であってもよい。 The data division method described in this embodiment can be realized by executing a program prepared in advance on a computer such as a personal computer or a workstation. This program is recorded on a computer-readable recording medium such as a hard disk, a flexible disk, a CD-ROM, an MO, and a DVD, and is executed by being read from the recording medium by the computer. The program may be a transmission medium that can be distributed via a network such as the Internet.

また、本実施の形態で説明したデータ分割装置１００は、スタンダードセルやストラクチャードＡＳＩＣ(Application Specific Integrated Circuit)などの特定用途向けＩＣ（以下、単に「ＡＳＩＣ」と称す。）やＦＰＧＡなどのＰＬＤ(Programmable Logic Device)によっても実現することができる。具体的には、たとえば、上述したデータ分割装置１００の機能的構成（機能ブロックの各符号を入れる）をＨＤＬ記述によって機能定義し、そのＨＤＬ記述を論理合成してＡＳＩＣやＰＬＤに与えることにより、データ分割装置１００を製造することができる。 In addition, the data division apparatus 100 described in the present embodiment includes a PLD (Programmable) such as an application specific integrated circuit (hereinafter referred to simply as “ASIC”) such as a standard cell or a structured ASIC (Application Specific Integrated Circuit). (Logic Device) can also be realized. Specifically, for example, by functionally defining the functional configuration of the above-described data dividing device 100 (inserting each code of the functional block) by HDL description, logically synthesizing the HDL description and giving it to the ASIC or PLD, The data dividing device 100 can be manufactured.

（付記１）コンピュータを、
多次元データ群の階層をあらわす説明変数に該当する多次元データのグループを分割対象グループとした場合、当該分割対象グループにおける分布の偏りをあらわすことができる目的変数の分布に基づいて統計的検定をおこなうことにより、前記分割対象グループの属性を判定する判定手段、
前記判定手段によって判定された判定結果に基づいて、前記分割対象グループから当該分割対象グループが属する階層の子階層に属する子グループ群に分割して、当該子グループ群の中から選ばれた子グループをあらたに前記分割対象グループとする分割手段、
前記分割手段によって得られたグループ集合を構成する任意のグループを統合対象グループとした場合、当該統合対象グループの階層の子階層に属する子グループ群の属性に基づいて、当該子グループ群を前記統合対象グループに統合して統合結果を得る統合手段、
として機能させることを特徴とするデータ分割プログラム。 (Appendix 1) Computer
When a group of multidimensional data that corresponds to an explanatory variable that represents the hierarchy of a multidimensional data group is used as a division target group, a statistical test is performed based on the distribution of objective variables that can represent the distribution bias in the division target group. Determination means for determining the attribute of the division target group by performing,
Based on the determination result determined by the determining means, the child group selected from the child group group by dividing the group to be divided into child group groups belonging to the child hierarchy of the hierarchy to which the group to be divided belongs. Dividing means for newly setting the dividing target group,
When an arbitrary group constituting the group set obtained by the dividing unit is an integration target group, the integration of the child group group based on the attribute of the child group group belonging to the child hierarchy of the hierarchy of the integration target group Integration means to integrate the target group and obtain the integration result,
A data division program characterized by functioning as

（付記２）前記判定手段は、
前記分割対象グループに属する多次元データの数に基づいて前記統計的検定が利用可能であると判定された場合、前記統計的検定をおこなうことにより、前記分割対象グループの属性を判定することを特徴とする付記１に記載のデータ分割プログラム。 (Appendix 2) The determination means includes
When it is determined that the statistical test is available based on the number of multidimensional data belonging to the division target group, the attribute of the division target group is determined by performing the statistical test. The data division program according to appendix 1.

（付記３）前記判定手段は、
前記分割対象グループに属する多次元データの数に基づいて前記統計的検定が利用不可能であると判定された場合、前記分割対象グループはその階層の子階層に属する子グループ群への遷移を停止するグループであると判定し、
前記分割手段は、
前記分割対象グループに選ばれていないグループをあらたに前記分割対象グループとすることを特徴とする付記２に記載のデータ分割プログラム。 (Supplementary Note 3) The determination means includes
If it is determined that the statistical test is not available based on the number of multidimensional data belonging to the division target group, the division target group stops transitioning to a child group group belonging to a child hierarchy of the hierarchy To determine that the group
The dividing means includes
The data division program according to appendix 2, wherein a group not selected as the division target group is newly set as the division target group.

（付記４）前記判定手段は、
前記分割対象グループにおける前記目的変数の分布が特徴的な分布であるか否かを判定することを特徴とする付記１または２に記載のデータ分割プログラム。 (Supplementary Note 4)
The data division program according to appendix 1 or 2, wherein it is determined whether or not the distribution of the objective variable in the division target group is a characteristic distribution.

（付記５）前記判定手段は、
前記目的変数が定型値である場合、前記分割対象グループにおける前記目的変数の分布と前記グループ集合全体における前記目的変数の分布との有意性検定により、前記分割対象グループにおける前記目的変数の分布が特徴的な分布であるか否かを判定することを特徴とする付記４に記載のデータ分割プログラム。 (Supplementary Note 5) The determination means includes:
When the objective variable is a fixed value, the distribution of the objective variable in the group to be divided is characterized by a significance test between the distribution of the objective variable in the group to be divided and the distribution of the objective variable in the entire group set. The data division program according to appendix 4, wherein it is determined whether or not the distribution is a typical distribution.

（付記６）前記判定手段は、
前記両分布が有意に異なる場合、前記分割対象グループにおける前記目的変数の分布は前記グループ集合全体の分布と比較することにより前記分割対象グループが特徴的な分布であると判定し、
前記分割手段は、
前記分割対象グループに選ばれていないグループをあらたに前記分割対象グループとすることを特徴とする付記５に記載のデータ分割プログラム。 (Appendix 6) The determination means includes
If the two distributions are significantly different, the distribution of the objective variable in the group to be divided is determined to be a characteristic distribution of the group to be divided by comparing with the distribution of the entire group set,
The dividing means includes
The data division program according to appendix 5, wherein a group not selected as the division target group is newly set as the division target group.

（付記７）前記判定手段は、
前記目的変数が数値である場合、前記分割対象グループにおける前記目的変数の分布と当該分布から推定される既知の確率分布モデルとの適合度検定をおこなうことにより、前記分割対象グループにおける前記目的変数の分布が特徴的な分布であるか否かを判定することを特徴とする付記４に記載のデータ分割プログラム。 (Appendix 7) The determination means includes
When the objective variable is a numerical value, by performing a fitness test between the distribution of the objective variable in the division target group and a known probability distribution model estimated from the distribution, the objective variable of the division target group The data division program according to appendix 4, wherein it is determined whether or not the distribution is a characteristic distribution.

（付記８）前記判定手段は、
前記分割対象グループにおける前記目的変数の分布と前記確率分布モデルとが異ならない場合、前記分割対象グループにおける前記目的変数の分布が前記確率分布モデルに従う特徴的な分布であると判定し、
前記分割手段は、
前記分割対象グループに選ばれていないグループをあらたに前記分割対象グループとすることを特徴とする付記７に記載のデータ分割プログラム。 (Appendix 8) The determination means includes
If the distribution of the objective variable in the group to be divided and the probability distribution model are not different, determine that the distribution of the objective variable in the group to be divided is a characteristic distribution according to the probability distribution model,
The dividing means includes
The data division program according to appendix 7, wherein a group not selected as the division target group is newly set as the division target group.

（付記９）前記判定手段は、
前記分割対象グループにおける前記目的変数の分布が前記確率分布モデルに従う特徴的な分布でなく、かつ、前記子階層に属する子グループ群が存在しない場合、前記分割対象グループは、その階層の子階層に属するグループ群への分割を停止するグループであると判定し、
前記分割手段は、
前記分割対象グループに選ばれていないグループをあらたに前記分割対象グループとすることを特徴とする付記４に記載のデータ分割プログラム。 (Supplementary note 9)
When the distribution of the objective variable in the division target group is not a characteristic distribution according to the probability distribution model and there is no child group group belonging to the child hierarchy, the division target group is included in a child hierarchy of the hierarchy. It is determined that the group stops dividing into groups to which it belongs,
The dividing means includes
The data division program according to appendix 4, wherein a group not selected as the division target group is newly set as the division target group.

（付記１０）前記判定手段は、
前記分割対象グループにおける前記目的変数の分布が前記確率分布モデルに従う特徴的な分布でなく、かつ、前記子階層に属する子グループ群が存在する場合、前記分割対象グループは、その階層の子階層に属する子グループの属性により判定可能であるグループと判定し、
前記分割手段は、
前記子グループをあらたに前記分割対象グループとすることを特徴とする付記４に記載のデータ分割プログラム。 (Supplementary Note 10) The determination means includes:
When the distribution of the objective variable in the division target group is not a characteristic distribution according to the probability distribution model and there are child group groups belonging to the child hierarchy, the division target group is included in the child hierarchy of the hierarchy. It is determined as a group that can be determined by the attribute of the child group to which it belongs,
The dividing means includes
The data division program according to appendix 4, wherein the child group is newly set as the division target group.

（付記１１）前記統合手段は、
前記統合対象グループの階層の子階層に属する子グループがすべて前記判定手段により分割を停止するグループであると判定されていたときは、前記子グループ群を削除し、前記統合対象グループを前記子グループ群と同一属性にすることを特徴とすることを特徴とする付記１〜１０のいずれか一つに記載のデータ分割プログラム。 (Appendix 11) The integration means
When all the child groups belonging to the child hierarchy of the hierarchy of the integration target group are determined to be groups to be divided by the determination unit, the child group group is deleted, and the integration target group is changed to the child group The data division program according to any one of appendices 1 to 10, wherein the same attribute as that of the group is used.

（付記１２）前記統合手段は、
前記統合対象グループの階層の子階層に属する各子グループにおける前記目的変数の分布がすべて特徴的な分布であると判定されていたときは、前記子グループ群を前記統合対象グループに統合しないことを特徴とすることを特徴とする付記１〜１１のいずれか一つに記載のデータ分割プログラム。 (Supplementary Note 12) The integration means includes:
When it is determined that the distribution of the objective variable in each child group belonging to the child hierarchy of the hierarchy of the integration target group is all characteristic distribution, the child group group is not integrated into the integration target group. The data division program according to any one of supplementary notes 1 to 11, characterized in that the data division program is characterized.

（付記１３）付記１〜１２のいずれか一つに記載のデータ分割プログラムを記録した前記コンピュータに読み取り可能な記録媒体。 (Additional remark 13) The said computer-readable recording medium which recorded the data division | segmentation program as described in any one of additional marks 1-12.

（付記１４）多次元データ群の階層をあらわす説明変数に該当する多次元データのグループを分割対象グループとした場合、当該分割対象グループにおける分布の偏りをあらわすことができる目的変数の分布に基づいて統計的検定をおこなうことにより、前記分割対象グループの属性を判定する判定手段と、
前記判定手段によって判定された判定結果に基づいて、前記分割対象グループから当該分割対象グループが属する階層の子階層に属する子グループ群に分割して、当該子グループ群の中から選ばれた子グループをあらたに前記分割対象グループとする分割手段と、
前記分割手段によって得られたグループ集合を構成する任意のグループを統合対象グループとした場合、当該統合対象グループの階層の子階層に属する子グループ群の属性に基づいて、当該子グループ群を前記統合対象グループに統合して統合結果を得る統合手段と、
を備えることを特徴とするデータ分割装置。 (Supplementary note 14) When a group of multidimensional data corresponding to an explanatory variable representing a hierarchy of a multidimensional data group is set as a division target group, it is based on the distribution of objective variables that can represent the distribution bias in the division target group. A determination means for determining an attribute of the division target group by performing a statistical test;
Based on the determination result determined by the determining means, the child group selected from the child group group by dividing the group to be divided into child group groups belonging to the child hierarchy of the hierarchy to which the group to be divided belongs. Dividing means for newly setting the group to be divided;
When an arbitrary group constituting the group set obtained by the dividing unit is an integration target group, the integration of the child group group based on the attribute of the child group group belonging to the child hierarchy of the hierarchy of the integration target group Integration means to integrate into the target group and obtain the integration results;
A data dividing device comprising:

（付記１５）多次元データ群の階層をあらわす説明変数に該当する多次元データのグループを分割対象グループとした場合、当該分割対象グループにおける分布の偏りをあらわすことができる目的変数の分布に基づいて統計的検定をおこなうことにより、前記分割対象グループの属性を判定する判定工程と、
前記判定工程によって判定された判定結果に基づいて、前記分割対象グループから当該分割対象グループが属する階層の子階層に属する子グループ群に分割して、当該子グループ群の中から選ばれた子グループをあらたに前記分割対象グループとする分割工程と、
前記分割工程によって得られたグループ集合を構成する任意のグループを統合対象グループとした場合、当該統合対象グループの階層の子階層に属する子グループ群の属性に基づいて、当該子グループ群を前記統合対象グループに統合して統合結果を得る統合工程と、
を含んだことを特徴とするデータ分割方法。 (Supplementary Note 15) When a group of multidimensional data corresponding to an explanatory variable representing a hierarchy of a multidimensional data group is set as a division target group, it is based on a distribution of objective variables that can represent a distribution bias in the division target group. A determination step of determining an attribute of the division target group by performing a statistical test;
Based on the determination result determined in the determination step, the child group selected from the child group group by dividing the child group group into child group groups belonging to child hierarchies of the hierarchy to which the target group to be divided belongs. A dividing step to newly set the dividing target group,
When an arbitrary group constituting the group set obtained by the dividing step is an integration target group, the integration of the child group group based on the attribute of the child group group belonging to the child hierarchy of the hierarchy of the integration target group An integration process to integrate the target group and obtain the integration results;
A data partitioning method characterized by comprising:

以上のように、本発明にかかるデータ分割プログラム、該プログラムを記録した記録媒体、データ分割装置、およびデータ分割方法は、複数の項目を持つ多次元データの分割処理に関し、たとえば特許公報データなど、項目の一つが階層構造を持つデータの分析に適する。また、項目の一つがテキストで記述されるデータの分析にも好適である。 As described above, the data division program according to the present invention, the recording medium on which the program is recorded, the data division device, and the data division method relate to multidimensional data division processing having a plurality of items, such as patent publication data, One of the items is suitable for analyzing data having a hierarchical structure. It is also suitable for analyzing data in which one item is described in text.

この発明の実施の形態にかかるデータ分割装置のハードウェア構成を示す説明図である。It is explanatory drawing which shows the hardware constitutions of the data division | segmentation apparatus concerning embodiment of this invention. この発明の実施の形態にかかるデータ分割装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the data division | segmentation apparatus concerning embodiment of this invention. 説明変数をルートとする階層構造データを示す説明図である。It is explanatory drawing which shows the hierarchical structure data which makes an explanatory variable a root. 判定テーブル（初期状態）を示す説明図である。It is explanatory drawing which shows the determination table (initial state). グループＡにおける目的変数Ｘ，Ｙ，Ｚ別の多次元データ数と全体グループＲにおける目的変数Ｘ，Ｙ，Ｚ別の多次元データ数とを示す図表である。5 is a chart showing the number of multidimensional data for each objective variable X, Y, Z in group A and the number of multidimensional data for each objective variable X, Y, Z in overall group R. 分割処理後（統合処理前）の階層構造データを示す説明図である。It is explanatory drawing which shows the hierarchical structure data after a division process (before integration process). 分割処理後（統合処理前）の判定テーブルを示す図表である。It is a chart which shows the determination table after a division process (before an integration process). グループＡ１における各目的変数のデータ度数の分布を示す図表である。It is a graph which shows distribution of the data frequency of each objective variable in group A1. グループＡ１における各目的変数のデータ度数の分布（期待値度数を追加）を示す図表である。It is a graph which shows distribution of the data frequency of each objective variable in group A1 (an expected value frequency is added). 統合処理中の階層構造データを示す説明図である。It is explanatory drawing which shows the hierarchical structure data in integration processing. 統合処理後の階層構造データを示す説明図である。It is explanatory drawing which shows the hierarchical structure data after an integration process. 統合処理後の判定テーブルを示す図表である。It is a chart which shows the determination table after an integration process. この発明の実施の形態にかかるデータ分割処理手順を示すフローチャートである。It is a flowchart which shows the data division | segmentation processing procedure concerning embodiment of this invention. 図１３に示した判定・分割処理の詳細な処理手順を示すフローチャートである。It is a flowchart which shows the detailed process sequence of the determination / division | segmentation process shown in FIG. 図１４に示した判定処理の詳細な処理手順を示すフローチャートである。It is a flowchart which shows the detailed process sequence of the determination process shown in FIG. 図１３に示した統合処理の詳細な処理手順を示すフローチャートである。It is a flowchart which shows the detailed process sequence of the integration process shown in FIG.

Explanation of symbols

１００データ分割装置
２０１入力部
２０２判定部
２０３分割部
２０４統合部
２０５出力部
３００，６００，１１００階層構造データ
４００判定テーブル DESCRIPTION OF SYMBOLS 100 Data division apparatus 201 Input part 202 Judgment part 203 Division part 204 Integration part 205 Output part 300,600,1100 Hierarchical structure data 400 Determination table

Claims

Computer
When a group of multidimensional data that corresponds to an explanatory variable that represents the hierarchy of a multidimensional data group is used as a division target group, a statistical test is performed based on the distribution of objective variables that can represent the distribution bias in the division target group. Determination means for determining the attribute of the division target group by performing,
Based on the determination result determined by the determining means, the child group selected from the child group group by dividing the group to be divided into child group groups belonging to the child hierarchy of the hierarchy to which the group to be divided belongs. Dividing means for newly setting the dividing target group,
When an arbitrary group constituting the group set obtained by the dividing unit is an integration target group, the integration of the child group group based on the attribute of the child group group belonging to the child hierarchy of the hierarchy of the integration target group Integration means to integrate the target group and obtain the integration result,
A data division program characterized by functioning as

The determination means includes
When it is determined that the statistical test is available based on the number of multidimensional data belonging to the division target group, the attribute of the division target group is determined by performing the statistical test. The data division program according to claim 1.

The determination means includes
If it is determined that the statistical test is not available based on the number of multidimensional data belonging to the division target group, the division target group stops transitioning to a child group group belonging to a child hierarchy of the hierarchy To determine that the group
The dividing means includes
The data division program according to claim 2, wherein a group not selected as the division target group is newly set as the division target group.

The determination means includes
The data division program according to claim 1 or 2, wherein it is determined whether or not the distribution of the objective variable in the division target group is a characteristic distribution.

The determination means includes
When the objective variable is a fixed value, the distribution of the objective variable in the group to be divided is characterized by a significance test between the distribution of the objective variable in the group to be divided and the distribution of the objective variable in the entire group set. The data division program according to claim 4, wherein it is determined whether or not the distribution is a typical distribution.

6. A computer-readable recording medium on which the data division program according to claim 1 is recorded.

When a group of multidimensional data that corresponds to an explanatory variable that represents the hierarchy of a multidimensional data group is used as a division target group, a statistical test is performed based on the distribution of objective variables that can represent the distribution bias in the division target group. Determination means for determining the attribute of the group to be divided by performing,
Based on the determination result determined by the determining means, the child group selected from the child group group by dividing the group to be divided into child group groups belonging to the child hierarchy of the hierarchy to which the group to be divided belongs. Dividing means for newly setting the group to be divided;
When an arbitrary group constituting the group set obtained by the dividing unit is an integration target group, the integration of the child group group based on the attribute of the child group group belonging to the child hierarchy of the hierarchy of the integration target group Integration means to integrate into the target group and obtain the integration results;
A data dividing device comprising:

When a group of multidimensional data that corresponds to an explanatory variable that represents the hierarchy of a multidimensional data group is used as a division target group, a statistical test is performed based on the distribution of objective variables that can represent the distribution bias in the division target group. A determination step of determining an attribute of the division target group by performing,
Based on the determination result determined in the determination step, the child group selected from the child group group by dividing the child group group into child group groups belonging to child hierarchies of the hierarchy to which the target group to be divided belongs. A dividing step to newly set the dividing target group,
When an arbitrary group constituting the group set obtained by the dividing step is an integration target group, the integration of the child group group based on the attribute of the child group group belonging to the child hierarchy of the hierarchy of the integration target group An integration process to integrate the target group and obtain the integration results;
A data partitioning method characterized by comprising: