JP2013196609A

JP2013196609A - Data analysis support device and data analysis support processing program

Info

Publication number: JP2013196609A
Application number: JP2012065768A
Authority: JP
Inventors: Seiji Egawa; 誠二江川; Rumi Hayakawa; ルミ早川; Shigeaki Sakurai; 茂明櫻井; Kazuyoshi Nishi; 一嘉西
Original assignee: Toshiba Corp; Toshiba Solutions Corp
Current assignee: Toshiba Corp; Toshiba Digital Solutions Corp
Priority date: 2012-03-22
Filing date: 2012-03-22
Publication date: 2013-09-30
Anticipated expiration: 2032-03-22
Also published as: CN103325002A; CN103325002B; JP5367112B2

Abstract

PROBLEM TO BE SOLVED: To improve precision of analysis when data on different organizations are integrated even if a deficit due to differences in data attribute among the respective organization is caused.SOLUTION: A data analysis support device includes data-by-organization storage means of storing data tables by a plurality of organizations for storing records as data having at least one kind of attribute to be analyzed with respect to the plurality of organizations; distance calculation means of calculating distances between groups of records, shown by the data-by-organization tables, of the plurality of organizations having the same attribute on the basis of a value of the common attribute by using the number of attributes common among the records and the total value of the common attributes; and analysis processing means of taking an analysis on the basis of the calculated distances.

Description

本発明の実施形態は、異なる組織のそれぞれに関わるデータの分析を支援するデータ分析支援装置およびデータ分析支援処理プログラムに関する。 Embodiments described herein relate generally to a data analysis support apparatus and a data analysis support processing program that support analysis of data related to different organizations.

従来、例えば複数種類の金融機関といった異なる組織のそれぞれに関わる集計データである事務ミスデータのそれぞれを統合して分析する事で、同じ傾向の事務ミスデータをもつ組織同士をクラスタリングすることがなされている。 Conventionally, organizations that have office error data with the same tendency have been clustered by integrating and analyzing each of office error data, which is aggregated data related to different organizations such as multiple types of financial institutions, for example. Yes.

ここで、異なる組織の集計データのうち、特定の組織の集計データと他の組織の集計データの間で一致しない属性がある場合、例えば上述した特定の組織では事務ミスの原因の情報を集計しているのに対し、他の組織ではこの原因の情報を集計していないといった場合には、上述した他の組織における事務ミスに係わる原因の情報は、複数の組織の集計データを統合して分析する場合の欠損情報となる。 Here, if there is an attribute that does not match between the aggregated data of a specific organization and the aggregated data of another organization among the aggregated data of different organizations, for example, the specific organization mentioned above aggregates information on the cause of the administrative error. On the other hand, if other organizations do not aggregate this cause information, the cause information related to administrative errors in other organizations described above is analyzed by integrating the aggregated data of multiple organizations. This is missing information.

従来では、欠損情報への対処として、データベースに蓄積されている集計データ群の中から、一部の特徴の欠損が存在する欠損データを検出して、集計データ群を当該欠損データと欠損のない正常データとに分割し、欠損データに類似する正常データを所定の類似尺度を用いて求め、この求めた正常データにおける欠損データの欠損特徴に対応する特徴のデータを補完データとして、欠損データの欠損特徴に代入して補完するものがある。 Conventionally, as a countermeasure to missing information, from the aggregated data group accumulated in the database, the missing data in which some feature defects exist is detected, and the aggregated data group is free of the missing data and the missing data. The data is divided into normal data, normal data similar to the missing data is obtained using a predetermined similarity scale, and the missing data is obtained using the feature data corresponding to the missing features of the missing data in the obtained normal data as complementary data. Some of them are substituted into features and complemented.

特開２００２−２１５６４６号公報JP 2002-215646 A

上述したように、欠損データに類似する正常データを所定の類似尺度を用いて求める手法では、ある組織の集計データに欠損がある場合、当該組織の欠損データについては、他の組織の集計データを用いて補完することとなり、補完後の集計データの信頼性が必要十分であるとはいえず、分析の精度が十分ではなかった。 As described above, in the method of obtaining normal data similar to missing data using a predetermined similarity scale, if there is a deficiency in the aggregated data of a certain organization, the aggregated data of another organization is used for the deficient data of that organization. Therefore, the reliability of the aggregated data after the supplementation is not necessary and sufficient, and the accuracy of the analysis was not sufficient.

本発明が解決しようとする課題は、異なる組織のそれぞれのデータのうち、データの属性が組織間で異なることに起因する欠損が生じても、これらのデータを統合した際の分析の精度を向上させることが可能になるデータ分析支援装置およびデータ分析支援処理プログラムを提供することにある。 The problem to be solved by the present invention is to improve the accuracy of analysis when these data are integrated, even if a defect occurs due to the difference in data attributes among the data of different organizations. Another object of the present invention is to provide a data analysis support device and a data analysis support processing program that can be executed.

実施形態によれば、データ分析支援装置は、分析対象である複数の組織のそれぞれについての、少なくとも１種類の属性を有するデータであるレコードを組織別に管理するための組織別データテーブルを格納する組織別データテーブル格納手段と、前記組織別データテーブルで示される、複数の組織間で少なくとも１種類の共通する属性を有する複数の組織のそれぞれのレコードの組について、前記共通する属性の値に基づいて、当該レコード間で共通する属性の数、および当該共通する属性における集計値に基づいて、前記レコードの組の間の距離を算出する距離算出手段と、前記距離算出手段により算出した距離に基づいて分析を行なう分析処理手段とをもつ。 According to the embodiment, the data analysis support apparatus stores an organization-specific data table for managing records, which are data having at least one type of attribute, for each of a plurality of organizations to be analyzed. Based on the value of the common attribute for each record set of a plurality of organizations having at least one type of common attribute between the plurality of organizations shown in the separate data table storage means and the organization-specific data table , Based on the number of attributes common to the records, and a distance calculation means for calculating a distance between the record sets based on the total value in the common attributes, based on the distance calculated by the distance calculation means And analysis processing means for performing analysis.

実施形態におけるデータ分析支援装置の機能構成例を示すブロック図。The block diagram which shows the function structural example of the data analysis assistance apparatus in embodiment. 複数の銀行の支店の事務ミスデータの一例を表形式で示す図。The figure which shows an example of the office work mistake data of the branch of a some bank in a table format. 複数の銀行の支店の事務ミスデータに基づく、支店別に集計した事務ミス件数のデータの一例を表形式で示す図。The figure which shows an example of the data of the number of office work mistakes totaled according to the branch based on the office work mistake data of the branch of a plurality of banks in a tabular form. 実施形態におけるデータ分析支援装置による処理データの流れの一例を示す図。The figure which shows an example of the flow of the processing data by the data analysis assistance apparatus in embodiment. 実施形態におけるデータ分析支援装置の処理動作の手順の一例を示すフローチャート。The flowchart which shows an example of the procedure of the processing operation of the data analysis assistance apparatus in embodiment. 本実施形態におけるデータ分析支援装置のデータテーブル結合部４１による同一属性抽出のための処理動作の一例を示すフローチャート。The flowchart which shows an example of the processing operation for the same attribute extraction by the data table coupling | bond part 41 of the data analysis assistance apparatus in this embodiment. 本実施形態におけるデータ分析支援装置のデータテーブル結合部により生成した結合済データテーブルの一例を表形式で示す図。The figure which shows an example of the combined data table produced | generated by the data table coupling | bond part of the data analysis assistance apparatus in this embodiment in a table format. 本実施形態におけるデータ分析支援装置のレコード間距離算出部による処理動作の一例を示すフローチャート。The flowchart which shows an example of the processing operation by the distance calculation part between records of the data analysis assistance apparatus in this embodiment. 本実施形態におけるデータ分析支援装置のクラスタリング実施部による処理動作の一例を示すフローチャート。The flowchart which shows an example of the processing operation by the clustering implementation part of the data analysis assistance apparatus in this embodiment. クラスタ中心支店の初期集合の設定例を示す図。The figure which shows the example of a setting of the initial set of a cluster center branch. 各支店をクラスタ中心支店に対応付けた例を示す図。The figure which shows the example which matched each branch with the cluster center branch. 結合済みデータテーブルで定義される所定のクラスタに含まれる各組織の属性および属性値の一例を表形式で示す図。The figure which shows an example of the attribute and attribute value of each organization which are contained in the predetermined cluster defined with the combined data table in a table format. 結合済みデータテーブルで定義される所定のクラスタに含まれる各組織の各属性の重心の計算結果の一例を表形式で示す図。The figure which shows an example of the calculation result of the gravity center of each attribute of each structure | tissue included in the predetermined cluster defined by the combined data table in a table format. 各クラスタの重心の一例を示す図。The figure which shows an example of the gravity center of each cluster. 結合済みデータテーブルで定義されるクラスタのクラスタ中心支店の再計算結果の一例を表形式で示す図。The figure which shows an example of the recalculation result of the cluster center branch of the cluster defined by the combined data table in a table format. クラスタリングの精度の評価に利用した実験データを表形式で示す図。The figure which shows the experimental data utilized for evaluation of the precision of clustering in a tabular form. クラスタリングの精度の評価に利用した、各銀行の事務ミス収集状況を表形式で示す図。The figure which shows the administrative error collection situation of each bank used for evaluation of the accuracy of clustering in a tabular form. クラスタリングの精度の評価に利用した、欠損項目を含む実験データを表形式で示す図。The figure which shows the experimental data containing a missing item utilized for evaluation of the precision of clustering in a table form. 各クラスタに実際に正しく分類された支店の数の正解率を表形式で示す図。The figure which shows the correct answer rate of the number of the branches actually classified into each cluster in a table form.

以下、実施の形態について、図面を参照して説明する。
本実施形態では、分析対象である複数の組織のそれぞれについての、少なくとも１種類の属性を有する集計データであるレコードを組織別に管理するための組織別データテーブルを格納し、組織別データテーブルで示される、複数の組織間で少なくとも１種類の共通する属性を有する複数の組織のそれぞれのレコードの組について、共通する属性の値に基づいて、当該レコード間で共通する属性の数、および当該共通する属性における集計値に基づいて、レコードの組の間の距離を算出し、この算出した距離に基づいて、それぞれのレコードに対応する組織をクラスタとしたクラスタリングを行なうことを特徴とする。 Hereinafter, embodiments will be described with reference to the drawings.
In this embodiment, an organization-specific data table for managing records, which are aggregated data having at least one type of attribute, for each of a plurality of organizations to be analyzed is stored by organization and is shown in the organization-specific data table. For each record set of a plurality of organizations having at least one type of common attribute among the plurality of organizations, based on the value of the common attribute, the number of attributes common between the records, and the common Based on the total value in the attribute, the distance between the record sets is calculated, and clustering is performed on the basis of the calculated distance, with the organization corresponding to each record as a cluster.

図１は、実施形態におけるデータ分析支援装置の機能構成例を示すブロック図である。図１に示すように、実施形態におけるデータ分析支援装置１０は、装置全体の処理動作を司る制御部１１、記憶装置１２、データテーブル結合部４１、レコード間距離算出部４２、クラスタリング実施部４３を有する。データテーブル結合部４１、レコード間距離算出部４２、クラスタリング実施部４３は、マイクロプロセッサ上のソフトウェアにて実行される処理部であり、図１のように記憶装置１２を介して各部間で情報の授受が可能となっている。
これらのうち、レコード間距離算出部４２は、従来技術に比した顕著な特徴を有し、課題を解決するための主となる機能を有する。 FIG. 1 is a block diagram illustrating a functional configuration example of a data analysis support apparatus according to the embodiment. As shown in FIG. 1, the data analysis support apparatus 10 according to the embodiment includes a control unit 11 that controls processing operations of the entire apparatus, a storage device 12, a data table combining unit 41, an inter-record distance calculation unit 42, and a clustering execution unit 43. Have. The data table combination unit 41, the inter-record distance calculation unit 42, and the clustering execution unit 43 are processing units that are executed by software on a microprocessor. Information is exchanged between the units via the storage device 12 as shown in FIG. It is possible to give and receive.
Among these, the inter-record distance calculation unit 42 has a prominent feature compared to the prior art and has a main function for solving the problem.

また、記憶装置１２は、不揮発性メモリなどの記憶媒体であり、組織別データテーブル格納部３１、結合済データテーブル格納部３２、レコード間距離格納部３３およびクラスタリング結果格納部３４を有する。 The storage device 12 is a storage medium such as a nonvolatile memory, and includes an organization-specific data table storage unit 31, a combined data table storage unit 32, an inter-record distance storage unit 33, and a clustering result storage unit 34.

本実施形態では、クラスタリングのための分析対象の各組織である、銀行の各支店の事務ミスデータを分析する例について述べる。データ分析支援装置１０は、各銀行の各支店で集計された集計データの結合により欠損値が生じたデータを用いて、複数の銀行の各支店を、事務ミス発生の特徴に基づいてクラスタリングする。 In the present embodiment, an example will be described in which office error data of each branch of a bank, which is an analysis target organization for clustering, is analyzed. The data analysis support apparatus 10 clusters each branch of a plurality of banks based on the feature of the occurrence of an office error, using data in which a missing value is generated by combining the aggregated data totaled at each branch of each bank.

各行の銀行では、日々の業務で発生したミス、例えば手数料間違い、口座番号指定間違いなどについて、いつ、どの業務で、誰が、どのようなミスを発生させたかを示す情報を事務ミスデータとして蓄積している。 Banks of each bank accumulate information as administrative error data that indicates when, in what business, and who made what mistakes, such as mistakes in daily operations, such as incorrect fees and incorrect account number designation. ing.

上述した、誰がミスを発生させたのかの情報は、ミスを発生させた行員はどのような役職・肩書きかを示す情報である。
また、上述した、どのようなミスを発生させたかを示す情報は、ミスの原因は何か、損失金額はいくらかなどを示す情報である。
事務ミスに関して収集される情報の属性は、各行の間で概ね同一であるものの、特定の銀行に固有の属性も存在し、特定の属性が必ずしもすべての銀行で収集されているとは限らない。 The information on who made the mistake described above is information indicating what title / title the employee who made the mistake has made.
Further, the above-described information indicating what kind of mistake has occurred is information indicating what is the cause of the mistake and what is the amount of loss.
Although the attributes of information collected regarding administrative errors are generally the same between the lines, there are attributes specific to a specific bank, and a specific attribute is not necessarily collected at all banks.

図２は、複数の銀行の支店の事務ミスデータの一例を表形式で示す図である。この図２では、Ａ銀行、Ｂ銀行、Ｃ銀行といった３つの銀行の各支店で発生した事務ミスデータを示す。これらのＡ銀行、Ｂ銀行、Ｃ銀行では、いずれも、事務ミスの発生日、発生支店、ミス発生業務を収集している。この場合、各行の事務ミスデータでは、同一の属性として、発生日、発生支店、ミス発生業務を有する事になる。 FIG. 2 is a diagram illustrating an example of office error data of branch offices of a plurality of banks in a table format. FIG. 2 shows administrative error data generated at each branch of three banks such as A bank, B bank, and C bank. These A bank, B bank, and C bank all collect the date of occurrence of the administrative error, the branch that has occurred, and the business in which the error occurred. In this case, the office error data of each row has the occurrence date, the occurrence branch, and the error occurrence business as the same attributes.

一方、事務ミスの発生者の役職については、Ａ銀行、Ｃ銀行では収集しているものの、Ｂ銀行では収集していない。この場合、Ｂ銀行の事務ミスデータでは、属性「発生者の役職」が無く、この属性の値は欠損値となる。 On the other hand, the positions of those who have made administrative errors are collected by Banks A and C, but not by Bank B. In this case, the business error data of bank B does not have the attribute “title of the person who generated”, and the value of this attribute is a missing value.

また、事務ミスの発生原因については、Ａ銀行、Ｂ銀行では収集しているものの、Ｃ銀行では収集していない。この場合、Ｃ銀行の事務ミスデータでは、属性「ミス発生原因」が無く、この属性の値は欠損値となる。 The causes of administrative errors are collected by Bank A and Bank B, but not by Bank C. In this case, the business error data of bank C does not have the attribute “cause of occurrence of error”, and the value of this attribute is a missing value.

図２では、便宜上、欠損属性の欠損値を「NULL」として明示しているが、実際には、銀行ごとの事務ミスデータでは、収集されていないデータについては属性そのものが存在しない。 In FIG. 2, for the sake of convenience, the missing value of the missing attribute is clearly indicated as “NULL”, but in reality, in the administrative error data for each bank, the attribute itself does not exist for the uncollected data.

図３は、複数の銀行の支店の事務ミスデータに基づく、支店別に集計した事務ミス件数のデータの一例を表形式で示す図である。
図２に示した事務ミスデータについて、各行の銀行の支店別のミス件数の集計について説明する。ここでは、説明の簡略化のため、図２に示した事務ミスデータにおける「ミス発生業務」、「発生者の役職」、「ミス発生原因」のみを集計対象とする。 FIG. 3 is a diagram illustrating an example of data on the number of office errors counted for each branch based on office error data of a plurality of bank branches in a table format.
With respect to the administrative error data shown in FIG. 2, the calculation of the number of mistakes by bank branch of each bank will be described. Here, for simplification of explanation, only “error occurrence business”, “occurrence title”, and “error occurrence cause” in the office error data shown in FIG.

この事務ミスデータでは、Ａ銀行では、当該Ａ銀行の各支店の支店番号を示す「支店番号」、事務ミスの発生業務が預金であることを示す属性「業務：預金」、事務ミスの発生業務が融資であることを示す属性「業務：融資」、事務ミスの発生者の役職が一般行員であることを示す属性「役職：一般行員」、事務ミスの発生者の役職がパートであることを示す属性「役職：パート」、事務ミスの発生原因が能力不足であることを示す属性「原因：能力不足」、事務ミスの発生原因が人為的なミスであることを示す属性「原因：ミス」の値が各支店のそれぞれについて集計される。 In this administrative error data, in bank A, the “branch number” indicating the branch number of each branch of the bank A, the attribute “business: deposit” indicating that the business error has occurred is a deposit, and the business error has occurred. "Business: Loan" attribute that indicates a loan, attribute "Position: General employee" that indicates that the position of the office error occurrence person is a general employee, and that the position of the office error occurrence person is a part Attribute “Position: Part”, Attribute “Cause: Insufficient Capability” indicating that the cause of the administrative error is insufficient ability, Attribute “Cause: Miss” indicating that the cause of the administrative error is human error Are aggregated for each branch.

例えば、図３に示した組織別データテーブルの支店番号「Ａ００１」の行で定義される属性「業務：預金」の列の欄の値「３１」は、Ａ銀行における支店番号が「Ａ００１」である支店で発生した事務ミスのうち、業務が預金である事務ミスの件数が３１件であることを示す。 For example, the value “31” in the column of the attribute “business: deposit” defined in the row of the branch number “A001” in the organization-specific data table shown in FIG. This indicates that out of office errors that occurred in a certain branch, the number of office errors in which the business is a deposit is 31.

また、図３に示した組織別データテーブル支店番号「Ｂ００１」の行で属性「役職：一般行員」の列が定義されると仮定した場合、この列のセルの値は欠損値である「ｎｕｌｌ」となる。前述したように、実際には、銀行ごとの事務ミスデータでは、収集されていないデータについては属性そのものが存在しないので、各銀行で収集されていない属性の値は事務ミス件数の集計結果にも現れない。図３では各銀行の各支店の組織別データテーブルの属性名のうち、存在しない属性名を薄く表記し、件数を「ｎｕｌｌ」としているが、実際には、このような属性の列自体が存在しない。 Further, assuming that the column of the attribute “title: general employee” is defined in the row of the organization-specific data table branch number “B001” shown in FIG. 3, the value of the cell of this column is “null”. " As described above, in fact, in the administrative error data for each bank, there is no attribute itself for the data that is not collected, so the value of the attribute that is not collected in each bank is also included in the result of counting the number of administrative errors. It does not appear. In FIG. 3, among the attribute names in the organization-specific data table of each bank, the attribute names that do not exist are shown lightly and the number of cases is “null”, but in reality, such attribute columns themselves exist. do not do.

本実施形態では、支店別に集計した各属性に係るミス件数のデータを組織別データテーブルとして、図１に示した記憶装置１２の組織別データテーブル格納部３１に格納される。図３に示した例では、同じ銀行の各支店の集計データの各属性は同一であり、これら各支店に関わるデータテーブルを一纏まりの組織別データテーブルとして銀行別に区分している。 In the present embodiment, data on the number of mistakes related to each attribute aggregated for each branch is stored as an organization-specific data table in the organization-specific data table storage unit 31 of the storage device 12 shown in FIG. In the example shown in FIG. 3, the attributes of the aggregated data of each branch of the same bank are the same, and the data tables related to these branches are classified by bank as a group of organization-specific data tables.

図４は、実施形態におけるデータ分析支援装置による処理データの流れの一例を示す図である。
データテーブル結合部４１は、記憶装置１２の組織別データテーブル格納部３１に格納される、各銀行の各支店の組織別データテーブルを入力データとして取り込む。また、データテーブル結合部４１は、組織別データテーブル中の属性から、組織間、つまり支店間で同一の属性を特定し、この特定した属性に基づいて各組織のデータテーブルを結合して、単一の結合済データテーブルを生成して、記憶装置１２の結合済データテーブル格納部３２に格納する。 FIG. 4 is a diagram illustrating an example of a flow of processing data by the data analysis support device in the embodiment.
The data table coupling unit 41 takes in the organization-specific data table of each branch of each bank, which is stored in the organization-specific data table storage unit 31 of the storage device 12, as input data. Further, the data table combining unit 41 specifies the same attribute between organizations, that is, between branches, from the attributes in the organization-specific data table, and combines the data tables of each organization based on the specified attributes. One combined data table is generated and stored in the combined data table storage unit 32 of the storage device 12.

また、レコード間距離算出部４２は、結合済データテーブル中の、１つの支店の各属性のミス件数のデータの集合を１つのレコードとした際の任意の２つのレコード、つまり銀行の種別を問わない２つの支店に係る各属性のミス件数のデータについて、レコード間の類似の高低を示す距離を算出し、この算出結果を記憶装置１２のレコード間距離格納部３３に格納する。 Further, the inter-record distance calculation unit 42 asks for any two records, that is, bank types, when a set of data on the number of mistakes for each attribute of one branch in the combined data table is set as one record. A distance indicating the level of similarity between records is calculated for data on the number of mistakes of each attribute relating to two branches that are not present, and the calculation result is stored in the inter-record distance storage unit 33 of the storage device 12.

クラスタリング実施部４３は、レコード間距離格納部３３に格納された、レコード間の距離の情報を用いて、結合済データテーブル中のレコードをクラスタリングし、クラスタリング結果を記憶装置１２のクラスタリング結果格納部３４に格納し、さらに、例えば液晶ディスプレイ装置などの表示装置２０への出力を行なう。 The clustering execution unit 43 clusters the records in the combined data table using the information on the distance between records stored in the inter-record distance storage unit 33, and the clustering result is stored in the clustering result storage unit 34 of the storage device 12. And output to the display device 20 such as a liquid crystal display device.

図５は、実施形態におけるデータ分析支援装置の処理動作の手順の一例を示すフローチャートである。ここで説明する手順は、処理動作の概要であり、各処理の詳細は後述する。
まず、データ分析支援装置１０のデータテーブル結合部４１は、記憶装置１２の組織別データテーブル格納部３１に格納される各組織の組織別データテーブルの各属性を抽出する（ステップＳ１）。 FIG. 5 is a flowchart illustrating an example of a processing operation procedure of the data analysis support apparatus according to the embodiment. The procedure described here is an outline of the processing operation, and details of each processing will be described later.
First, the data table combining unit 41 of the data analysis support device 10 extracts each attribute of the organization-specific data table stored in the organization-specific data table storage unit 31 of the storage device 12 (step S1).

データテーブル結合部４１は、各組織の組織別データテーブルから、組織間で同一の属性である同一属性を抽出する（ステップＳ２）。抽出対象の同一属性を特定する方法の一例として、図３に示すような各銀行の組織別データテーブル間で属性名の完全一致を検出する方法が挙げられる。 The data table combining unit 41 extracts the same attribute that is the same attribute between organizations from the organization-specific data table of each organization (step S2). As an example of a method for specifying the same attribute to be extracted, there is a method for detecting a complete match of attribute names between organization-specific data tables of each bank as shown in FIG.

データテーブル結合部４１は、ステップＳ２で抽出した同一属性を利用して、記憶装置１２の組織別データテーブル格納部３１に格納される組織別データテーブルを結合して単一の結合済データテーブルを生成して、記憶装置１２の結合済データテーブル格納部３２に格納する（ステップＳ３）。データテーブル結合部４１は、一部の組織別データテーブルにのみ存在する属性があれば、当該属性を持たない組織のデータテーブルに、その属性を追加し、この追加した属性の属性値を欠損値（ｎｕｌｌ）とする。 The data table combining unit 41 combines the organization-specific data tables stored in the organization-specific data table storage unit 31 of the storage device 12 using the same attribute extracted in step S2 to create a single combined data table. Generated and stored in the combined data table storage 32 of the storage device 12 (step S3). If there is an attribute that exists only in some organization-specific data tables, the data table combining unit 41 adds the attribute to the data table of the organization that does not have the attribute, and sets the attribute value of the added attribute as a missing value. (Null).

レコード間距離算出部４２は、記憶装置１２の結合済データテーブル格納部３２に格納される結合済データテーブルの各レコードのうち任意の２つのレコードを選択して、この選択したレコード間の距離を算出する（ステップＳ４）。 The inter-record distance calculation unit 42 selects any two records from the records in the combined data table stored in the combined data table storage unit 32 of the storage device 12, and calculates the distance between the selected records. Calculate (step S4).

従来技術に比した顕著な特徴として、本実施形態では、レコード間距離算出部４２は、この選択した２レコードの属性のうち、少なくともどちらか一方のレコードで値が欠損値であるような属性は対象外とし、２レコードがともに値を持つ属性のみを対象として、レコード間の距離を算出する。レコード間距離算出部４２は、この算出した距離の情報を、記憶装置１２のレコード間距離格納部３３に格納する。レコード間距離算出部４２は、この処理を結合済データテーブルにおける２レコードの全ての組み合わせについて行なう。 As a remarkable feature compared with the prior art, in the present embodiment, the inter-record distance calculation unit 42 has an attribute whose value is a missing value in at least one of the selected two record attributes. The distance between the records is calculated only for the attributes that are excluded from the target and have both the values of the two records. The inter-record distance calculation unit 42 stores the calculated distance information in the inter-record distance storage unit 33 of the storage device 12. The inter-record distance calculation unit 42 performs this process for all combinations of two records in the combined data table.

クラスタリング実施部４３は、記憶装置１２のレコード間距離格納部３３に格納された、レコード間の距離の情報を用いて、結合済データテーブル中のレコードをクラスタリングすることで、各支店のクラスタリングを行なう（ステップＳ５）。そして、クラスタリング実施部４３は、クラスタリング結果を記憶装置２のクラスタリング結果格納部３４に記憶して、表示装置２０へ出力する（ステップＳ６）。 The clustering execution unit 43 performs clustering of each branch by clustering the records in the combined data table using the information on the distance between records stored in the inter-record distance storage unit 33 of the storage device 12. (Step S5). Then, the clustering execution unit 43 stores the clustering result in the clustering result storage unit 34 of the storage device 2 and outputs it to the display device 20 (step S6).

次に、データテーブル結合部４１の動作の詳細について説明する。
前述したように、データテーブル結合部４１は、記憶装置１２の組織別データテーブル格納部３１に格納される組織別データテーブルから属性を抽出して、組織間での同一属性を特定して、データテーブルを結合する。 Next, details of the operation of the data table combining unit 41 will be described.
As described above, the data table combining unit 41 extracts attributes from the organization-specific data table stored in the organization-specific data table storage unit 31 of the storage device 12, identifies the same attribute between organizations, Join tables.

図６は、本実施形態におけるデータ分析支援装置のデータテーブル結合部４１による同一属性抽出のための処理動作の一例を示すフローチャートである。
図６に示す処理動作は、図５に示す処理動作のステップＳ２を詳細に説明するものであり、組織間での同一属性を抽出するための処理動作である。 FIG. 6 is a flowchart illustrating an example of a processing operation for extracting the same attribute by the data table combining unit 41 of the data analysis support apparatus according to this embodiment.
The processing operation shown in FIG. 6 explains step S2 of the processing operation shown in FIG. 5 in detail, and is a processing operation for extracting the same attribute between organizations.

図３に示した、組織別データテーブルを例に挙げると、データテーブル結合部４１は、異なる銀行の各支店の組織別データテーブルのそれぞれに同一の属性名を持つ属性が存在すれば、これらを同一属性として抽出する。 Taking the organization-specific data table shown in FIG. 3 as an example, the data table combining unit 41, if there is an attribute having the same attribute name in each organization-specific data table of each branch of a different bank, Extract as the same attribute.

データテーブル結合部４１は、銀行別のデータテーブルを記憶装置１２の組織別データテーブル格納部３１から読み出して、すべての銀行のすべての属性からなる属性集合Ｔを生成する（ステップＳ１１）。 The data table combining unit 41 reads the bank-specific data table from the organization-specific data table storage unit 31 of the storage device 12, and generates an attribute set T including all the attributes of all banks (step S11).

具体的には、このステップＳ１１において、データテーブル結合部４１が図３に示した組織別データテーブルから得る属性集合Ｔの要素は、以下の１４の属性である。 Specifically, in step S11, the elements of the attribute set T obtained by the data table combining unit 41 from the organization-specific data table shown in FIG. 3 are the following 14 attributes.

「業務：預金（Ａ銀行）」、「業務：預金（Ｂ銀行）」、「業務：預金（Ｃ銀行）」
「業務：融資（Ａ銀行）」、「業務：融資（Ｂ銀行）」、「業務：融資（Ｃ銀行）」
「役職：一般行員（Ａ銀行）」、「役職：一般行員（Ｃ銀行）」
「役職：パート（Ａ銀行）」、「役職：パート（Ｃ銀行）」
「原因：能力不足（Ａ銀行）」、「原因：能力不足（Ｂ銀行）」
「原因：ミス（Ａ銀行）」、「原因：ミス（Ｂ銀行）」
ここでは、同じ属性名を有していても、属性値の集計元の銀行が異なる場合は別の属性としてカウントしている。例えば、上述の「業務：預金（Ａ銀行）」、「業務：預金（Ｂ銀行）」、「業務：預金（Ｃ銀行）」の属性名は、銀行名を除いた「業務：預金」であり、これら「業務：預金（Ａ銀行）」、「業務：預金（Ｂ銀行）」、「業務：預金（Ｃ銀行）」は属性集合Ｔにおける個別の要素となる。 “Business: Deposit (Bank A)”, “Business: Deposit (Bank B)”, “Business: Deposit (Bank C)”
"Business: Financing (Bank A)", "Business: Financing (Bank B)", "Business: Financing (Bank C)"
“Position: General Employee (Bank A)”, “Position: General Employee (Bank C)”
“Position: Part (Bank A)”, “Position: Part (Bank C)”
“Cause: Insufficient capacity (Bank A)”, “Cause: Insufficient capacity (Bank B)”
"Cause: Miss (Bank A)", "Cause: Miss (Bank B)"
Here, even if they have the same attribute name, they are counted as different attributes if the banks from which the attribute values are aggregated are different. For example, the attribute names “business: deposit (bank A)”, “business: deposit (bank B)”, and “business: deposit (bank C)” are “business: deposit” excluding the bank name. These “business: deposit (bank A)”, “business: deposit (bank B)”, and “business: deposit (bank C)” are individual elements in the attribute set T.

データテーブル結合部４１は、ステップＳ１１で生成した属性集合Ｔの中から、任意のひとつの属性を抽出する（ステップＳ１２)。この抽出した属性を属性ａと称する。
データテーブル結合部４１は、属性集合Ｔ中の属性のうち、ステップＳ１２で抽出した属性ａと同一の属性名を有する属性があれば、属性値の集計元の銀行の種別に関わらず、これを抽出する（ステップＳ１３）。この抽出した属性を属性ｂ、属性ｃ、・・・と称する。 The data table combining unit 41 extracts an arbitrary attribute from the attribute set T generated in step S11 (step S12). This extracted attribute is referred to as attribute a.
If there is an attribute having the same attribute name as the attribute a extracted in step S12 among the attributes in the attribute set T, the data table combining unit 41 selects the attribute value regardless of the type of bank from which the attribute value is aggregated. Extract (step S13). The extracted attributes are referred to as attribute b, attribute c,.

具体的には、データテーブル結合部４１は、ステップＳ１２において、「業務：預金（Ａ銀行）」を属性ａとして抽出した場合、ステップＳ１３では、この属性ａと同一の属性名「業務：預金」を有する属性である「業務：預金（Ｂ銀行）」、「業務：預金（Ｃ銀行）」を属性ｂ、属性ｃとして抽出する。 Specifically, when “business: deposit (bank A)” is extracted as attribute a in step S12, the data table combining unit 41 has the same attribute name “business: deposit” as attribute a in step S13. The attributes “business: deposit (bank B)” and “business: deposit (bank C)” are extracted as attributes b and c.

データテーブル結合部４１は、ステップＳ１２およびステップＳ１３で抽出した属性ａ、ｂ、ｃ、…の情報を同一属性として記憶装置１２に記憶する（ステップＳ１４）。
ここで、属性集合Ｔ中から抽出された属性ａと同一の属性名を有する属性が属性集合Ｔ中に存在しない、つまり、属性ａが単一の組織の組織別データテーブルでのみ存在し、他の組織の組織別データテーブルに存在しない場合には、データテーブル結合部４１は、この属性ａのみを上述した同一属性として記憶装置１２に記憶する。 The data table combining unit 41 stores the information of the attributes a, b, c,... Extracted in step S12 and step S13 in the storage device 12 as the same attribute (step S14).
Here, an attribute having the same attribute name as the attribute a extracted from the attribute set T does not exist in the attribute set T, that is, the attribute a exists only in the organization-specific data table of a single organization. If the data does not exist in the organization-specific data table, the data table combining unit 41 stores only the attribute a in the storage device 12 as the same attribute described above.

データテーブル結合部４１は、ステップＳ１１で生成した属性集合Ｔの中に、ステップＳ１２またはステップＳ１３でまだ処理していない属性がある、つまり属性集合Ｔ中の属性のうちステップＳ１２で属性ａとして抽出しておらず、かつステップＳ１３で属性ｂ、ｃ、…として抽出していない属性がある場合は（ステップＳ１５のＹＥＳ）、ステップＳ１２に戻って、属性集合Ｔの中から抽出していない属性のいずれかを新たな属性ａとして抽出し、このステップＳ１２で抽出した新たな属性ａについて、ステップＳ１３，Ｓ１４の処理を再度行なう。 The data table combining unit 41 includes an attribute that has not yet been processed in step S12 or step S13 in the attribute set T generated in step S11. In other words, the attribute in the attribute set T is extracted as attribute a in step S12. If there are attributes that are not extracted as attributes b, c,... In step S13 (YES in step S15), the process returns to step S12 to return attributes that have not been extracted from the attribute set T. Any one is extracted as a new attribute a, and the processes in steps S13 and S14 are performed again for the new attribute a extracted in step S12.

また、データテーブル結合部４１は、ステップＳ１１で生成した属性集合Ｔ中の属性をすべて処理している場合、つまり、属性集合Ｔ中の全ての属性を、ステップＳ１２で属性ａとして抽出済みである場合、またはステップＳ１３で属性ｂ、ｃ、…として抽出済みである場合は（ステップＳ１５のＮＯ）、同一属性抽出のための処理を終了する。 Further, when all the attributes in the attribute set T generated in step S11 are processed, the data table combining unit 41 has extracted all the attributes in the attribute set T as the attribute a in step S12. In this case, or when extracted as attributes b, c,... In step S13 (NO in step S15), the process for extracting the same attribute is terminated.

データテーブル結合部４１が同一属性抽出の処理を終了した際、このデータテーブル結合部４１が、図３に示した組織別データテーブルから同一属性として得た属性の組は、以下の（ア）、（イ）、（ウ）、（エ）、（オ）、（カ）の６組である。 When the data table combining unit 41 completes the process of extracting the same attribute, the attribute set obtained by the data table combining unit 41 as the same attribute from the organization-specific data table shown in FIG. There are six pairs (i), (c), (d), (e), and (f).

（ア）：「業務：預金（Ａ銀行）」、「業務：預金（Ｂ銀行）」、「業務：預金（Ｃ銀行）」
（イ）：「業務：融資（Ａ銀行）」、「業務：融資（Ｂ銀行）」、「業務：融資（Ｃ銀行）」
（ウ）：「役職：一般行員（Ａ銀行）」、「役職：一般行員（Ｃ銀行）」
（エ）：「役職：パート（Ａ銀行）」、「役職：パート（Ｃ銀行）」
（オ）：「原因：能力不足（Ａ銀行）」、「原因：能力不足（Ｂ銀行）」
（カ）：「原因：ミス（Ａ銀行）」、「原因：ミス（Ｂ銀行）」
次に、データテーブル結合部４１により、ステップＳ１４で記憶装置１２に記憶した同一属性を用いて組織別データテーブルを結合するための処理動作を以下に示す。
データテーブル結合部４１は、組織別データテーブルから、すべての組織の組織別データテーブルについて同一属性が抽出された場合は、この属性を結合済データテーブルに組み入れ、この結合済データベースにおける一属性とする。 (A): “Business: Deposit (Bank A)”, “Business: Deposit (Bank B)”, “Business: Deposit (Bank C)”
(I): "Business: Loan (Bank A)", "Business: Finance (Bank B)", "Business: Finance (Bank C)"
(C): “Position: General Employee (Bank A)”, “Position: General Employee (Bank C)”
(D): “Position: Part (Bank A)”, “Position: Part (Bank C)”
(E): “Cause: Insufficient capacity (Bank A)”, “Cause: Insufficient capacity (Bank B)”
(F): “Cause: Miss (Bank A)”, “Cause: Miss (Bank B)”
Next, the processing operation for combining the organization-specific data tables using the same attribute stored in the storage device 12 in step S14 by the data table combining unit 41 will be described below.
When the same attribute is extracted for the organization-specific data tables of all organizations from the organization-specific data table, the data table combining unit 41 incorporates this attribute into the combined data table and sets it as one attribute in the combined database. .

具体的には、図３に示した組織別データテーブルから抽出された同一属性の組（ア）では、Ａ，Ｂ，Ｃ銀行の各支店の組織別データテーブルには「業務：預金」の属性が同一属性として存在しており、全ての銀行について組織別データテーブルに、この「業務：預金」の属性が存在しているので、この属性を結合済データテーブルに組み入れて、当該結合済みデータベースにおける属性「業務：預金」とする。 Specifically, in the group (a) of the same attributes extracted from the organization-specific data table shown in FIG. 3, the “business: deposit” attribute is included in the organization-specific data table of each branch of banks A, B, and C. Exist as the same attribute, and the attribute of “business: deposit” exists in the organization-specific data table for all banks, so this attribute is incorporated into the combined data table and The attribute is “business: deposit”.

同様に、図３に示した組織別データテーブルから抽出された、上記の同一属性の組（イ）では、Ａ，Ｂ，Ｃ銀行の各支店の組織別データテーブルには「業務：融資」の属性が同一属性として存在しており、全ての銀行について組織別データテーブルにこの「業務：融資」の属性が存在しているので、この属性を結合済データテーブルに組み入れて当該結合済みデータベースにおける属性「業務：融資」とする。 Similarly, in the group (b) of the same attribute extracted from the organization-specific data table shown in FIG. 3, “business: loan” is included in the organization-specific data table of each branch of banks A, B, and C. The attribute exists as the same attribute, and since this “business: loan” attribute exists in the organization-specific data table for all banks, this attribute is incorporated into the combined data table and the attribute in the combined database. “Business: Loan”.

また、図３に示した組織別データテーブルから、一部の銀行の各支店の組織別データテーブルから抽出された属性と同一の属性がその他の銀行の各支店の組織別データテーブルから抽出されなかった場合は、このその他の銀行の属性に当該属性を追加して、結合済データテーブルにおける一属性とする。その際、結合済みデータテーブルにおける、前述したその他の銀行における前述した追加された属性の属性値はすべて欠損値とする。 Further, from the organization data table shown in FIG. 3, the same attributes extracted from the organization data table of each branch of some banks are not extracted from the organization data table of each branch of other banks. If this is the case, the attribute is added to the attribute of the other bank to make it one attribute in the combined data table. At this time, all the attribute values of the added attributes described above in the other bank described above in the combined data table are set as missing values.

具体的には、図３に示した組織別データテーブルから抽出された同一属性の組（ウ）では、Ａ，Ｃ銀行の各支店の組織別データテーブルには「役職：一般行員」の属性が同一属性として存在するが、Ｂ銀行の各支店の組織別データテーブルにはこの「役職：一般行員」の属性が存在していない。 Specifically, in the same attribute set (c) extracted from the organization-specific data table shown in FIG. 3, the organization-specific data table of each branch of banks A and C has the attribute “title: general employee”. Although it has the same attribute, the “position: general employee” attribute does not exist in the organization-specific data table of each branch of bank B.

そこで、データテーブル結合部４１は、Ｂ銀行の各支店の組織別データテーブルに「役職：一般行員」を追加したものを結合済データベースに組み入れ、この結合済みデータベースにおけるＢ銀行の各支店の行の「役職：一般行員」の列のセルの値である属性値はすべて欠損値とする。 Therefore, the data table combining unit 41 incorporates the data table classified by organization of each branch of the B bank with “position: general employee” added to the combined database, and stores the line of each branch of the B bank in this combined database. All attribute values that are cell values in the column “Position: General Bank” are missing values.

また、図３に示した組織別データテーブルから抽出された同一属性の組（エ）では、Ａ，Ｃ銀行の各支店の組織別データテーブルには「役職：パート」の属性が同一属性として存在するが、Ｂ銀行の各支店の組織別データテーブルにはこの「役職：パート」の属性が存在していない。そこで、データテーブル結合部４１は、Ｂ銀行の各支店の組織別データテーブルに「役職：パート」を追加したものを結合済データベースに組み入れ、この結合済みデータベースにおけるＢ銀行の各支店の行の「役職：パート」の列のセルの値である属性値はすべて欠損値とする。 Further, in the group (d) of the same attribute extracted from the organization data table shown in FIG. 3, the “position: part” attribute exists as the same attribute in the organization data table of each branch of the A and C banks. However, the “position: part” attribute does not exist in the organization-specific data table of each branch of the B bank. Therefore, the data table combining unit 41 incorporates the data table classified by organization of each branch of the bank B with “position: part” added to the combined database, and stores “ All attribute values that are cell values in the column “Position: Part” are assumed to be missing values.

また、図３に示した組織別データテーブルから抽出された同一属性の組（オ）では、Ａ，Ｂ銀行の各支店の組織別データテーブルには「原因：能力不足」の属性が同一属性として存在するが、Ｃ銀行の各支店の組織別データテーブルにはこの「原因：能力不足」の属性が存在していない。
そこで、データテーブル結合部４１は、Ｃ銀行の各支店の組織別データテーブルに「原因：能力不足」を追加したものを結合済データベースに組み入れ、この結合済みデータベースにおけるＣ銀行の各支店の行の「原因：能力不足」の列のセルの値である属性値はすべて欠損値とする。 Further, in the same attribute set (e) extracted from the organization-specific data table shown in FIG. 3, the attribute “Cause: Insufficient Capability” is assumed to be the same attribute in the organization-specific data table of each branch of banks A and B. Although it exists, this “Cause: Insufficient Capability” attribute does not exist in the organization-specific data table of each branch of bank C.
Therefore, the data table combining unit 41 incorporates the data table classified by organization of each branch of bank C with “cause: insufficient capacity” added to the combined database, and stores the row of each branch of bank C in this combined database. All attribute values that are cell values in the column “Cause: Insufficient ability” are assumed to be missing values.

また、図３に示した組織別データテーブルから抽出された同一属性の組（カ）では、Ａ，Ｂ銀行の各支店の組織別データテーブルには「原因：ミス」の属性が同一属性として存在するが、Ｃ銀行の各支店の組織別データテーブルにはこの「原因：ミス」の属性が存在していない。
そこで、データテーブル結合部４１は、Ｃ銀行の各支店の組織別データテーブルに「原因：ミス」を追加したものを結合済データベースに組み入れ、この結合済みデータベースにおけるＣ銀行の各支店の行の「原因：ミス」の列のセルの値である属性値はすべて欠損値とする。 Further, in the same attribute group (f) extracted from the organization data table shown in FIG. 3, the attribute “Cause: Miss” exists as the same attribute in the organization data table of each bank of A and B banks. However, this “Cause: Miss” attribute does not exist in the organization-specific data table of each branch of bank C.
Therefore, the data table combining unit 41 incorporates the organization-specific data table of each branch of the C bank with “Cause: Miss” added to the combined database, and stores the “ All attribute values that are cell values in the “Cause: Miss” column are missing values.

このようにして、データテーブル結合部４１は、図３に示した各銀行の各支店の組織別データテーブルを結合して、単一の結合済データテーブルを生成して、記憶装置１２の結合済データテーブル格納部３２に格納する。 In this way, the data table combining unit 41 generates a single combined data table by combining the organization-specific data tables of the branches of each bank shown in FIG. Store in the data table storage unit 32.

図７は、本実施形態におけるデータ分析支援装置のデータテーブル結合部により生成した結合済データテーブルの一例を表形式で示す図である。
この結合済データテーブルの各行は、各銀行の各支店の一レコードに対応し、各列は各行の支店番号、および、結合元の組織別データテーブル中の各属性である「業務：預金」、「業務：融資」、「役職：一般行員」、「役職：パート」、「原因：能力不足」、「原因：ミス」に対応する。 FIG. 7 is a diagram illustrating an example of a combined data table generated by the data table combining unit of the data analysis support apparatus according to the present embodiment in a table format.
Each row of this combined data table corresponds to one record of each branch of each bank, each column has a branch number of each row, and “business: deposit”, which is each attribute in the organization-specific data table of the combination source, Corresponds to “Business: Loan”, “Job title: General employee”, “Job title: Part”, “Cause: Insufficient ability”, “Cause: Mistake”.

例えば、図３に示したＡ銀行の各支店の組織別データテーブルの支店番号「Ａ００１」の行で定義される属性「業務：預金」の列のセルの値は「３１」であるので、結合済データテーブルの支店番号「Ａ００１」の行で定義される属性「業務：預金」の列のセルの値も「３１」となる。 For example, the cell value in the column of the attribute “business: deposit” defined in the row of the branch number “A001” in the organization-specific data table of each branch of the bank A shown in FIG. The value of the cell in the column of the attribute “business: deposit” defined in the row of the branch number “A001” in the completed data table is also “31”.

また、図３に示したＢ銀行の各支店の組織別データテーブルの支店番号「Ｂ００１」の行で定義される属性「役職：一般行員」や「役職：パート」の列のセルの値は存在しないので、結合済データテーブルの支店番号「Ｂ００１」の行で定義される属性「役職：一般行員」や「役職：パート」の列のセルの値は「ｎｕｌｌ」となる。 In addition, there are cell values in columns of the attributes “title: general employee” and “title: part” defined in the row of the branch number “B001” in the organization-specific data table of each branch of the bank B shown in FIG. Therefore, the value of the cell in the column of “position: general employee” or “position: part” defined in the row of the branch number “B001” in the combined data table is “null”.

また、図３に示したＣ銀行の各支店の組織別データテーブルの支店番号「Ｃ００１」の行で定義される属性「原因：能力不足」や「原因：ミス」の列のセルの値は存在しないので、結合済データテーブルの支店番号「Ｃ００１」の行で定義される属性「原因：能力不足」や「原因：ミス」の列のセルの値は「ｎｕｌｌ」となる。 In addition, there is a cell value in the column of the attribute “cause: insufficient ability” or “cause: miss” defined in the row of the branch number “C001” of the organization-specific data table of each branch of bank C shown in FIG. Therefore, the value of the cell in the column of the attribute “cause: insufficient capacity” or “cause: miss” defined in the row of the branch number “C001” of the combined data table is “null”.

次に、レコード間距離算出部４２の動作の詳細について説明する。
図８は、本実施形態におけるデータ分析支援装置のレコード間距離算出部による処理動作の一例を示すフローチャートである。
図８に示す処理動作は、図５に示す処理動作のステップＳ４を詳細に説明するものであり、結合済データテーブルにおける行方向に沿ったセルの集合を一レコードとした際の任意の２レコード間の類似度の高低を示す距離を算出するための処理動作である。 Next, details of the operation of the inter-record distance calculation unit 42 will be described.
FIG. 8 is a flowchart illustrating an example of a processing operation performed by the inter-record distance calculation unit of the data analysis support device according to this embodiment.
The processing operation shown in FIG. 8 explains step S4 of the processing operation shown in FIG. 5 in detail, and any two records when a set of cells along the row direction in the combined data table is one record. This is a processing operation for calculating a distance indicating the level of similarity between the two.

レコード間距離算出部４２は、結合済データテーブルの２つのレコードの組であるレコードペア（レコードｉとレコードjとする）を任意に指定し（ステップＳ２１）、このレコードペアのそれぞれがともに値をもつ属性である共通属性を特定する（ステップＳ２２）。 The inter-record distance calculation unit 42 arbitrarily designates a record pair (record i and record j) that is a set of two records in the combined data table (step S21), and each of the record pairs has a value. A common attribute that is an attribute is specified (step S22).

次に、レコード間距離算出部４２は、ステップＳ２２で特定した共通属性を考慮して、以下の式（１）にしたがって、レコードｉとレコードjとの間の距離ｄ_ｉ，ｊを算出して、この算出した距離の情報をレコードペアの各レコードの識別名の情報とともに記憶装置１２のレコード間距離格納部３３に格納する（ステップＳ２３）。

Next, the inter-record distance calculation unit 42 calculates the distance d _{i, j} between the record i and the record j according to the following equation (1) in consideration of the common attribute specified in step S22. The calculated distance information is stored in the inter-record distance storage unit 33 of the storage device 12 together with the identification name information of each record of the record pair (step S23).

式（１）におけるｎは、レコードｉとレコードｊとの間の共通属性の数である。 N in the formula (1) is the number of common attributes between the record i and the record j.

式（１）におけるＣは、レコードｉとレコードｊとの間の共通属性の集合である。 C in Equation (1) is a set of common attributes between record i and record j.

式（１）におけるａは、属性である。 In Expression (1), “a” is an attribute.

式（１）におけるａ_ｋは、レコードｋにおける属性ａの属性値である。 A _k in equation (1) is the attribute value of attribute a in record k.

具体例について説明する。まず、第１の例として、図７に示した支店番号「Ａ００１」の行のレコードと支店番号「Ａ００２」の行のレコードとのペアを選択した場合、「Ａ００１」の行のレコードは、欠損値でない値をもつ属性は、「業務：預金」、「業務：融資」、「役職：一般行員」、「役職：パート」、「原因：能力不足」、「原因：ミス」の６つである。また、「Ａ００２」の行のレコードは、「Ａ００１」の行のレコードと同様に、欠損値でない値をもつ属性は、「業務：預金」、「業務：融資」、「役職：一般行員」、「役職：パート」、「原因：能力不足」、「原因：ミス」の６つである。 A specific example will be described. First, as a first example, when a pair of the record of the branch number “A001” and the record of the branch number “A002” shown in FIG. 7 is selected, the record of the “A001” line is missing. There are six attributes with non-value values: “Business: Deposit”, “Business: Loan”, “Job title: General employee”, “Job title: Part”, “Cause: Insufficient ability”, “Cause: Mistake”. . Similarly to the record of the “A001” row, the record of the “A002” row has the attributes having a value that is not a missing value: “business: deposit”, “business: loan”, “title: general employee”, “Position: Part”, “Cause: Insufficient ability”, and “Cause: Miss”.

つまり、両レコードは、互いに６つの属性を持っており、これらはすべて共通属性であるので、ステップＳ２１で特定される共通属性は、「業務：預金」、「業務：融資」、「役職：一般行員」、「役職：パート」、「原因：能力不足」、「原因：ミス」の６つとなる。 That is, since both records have six attributes, and these are all common attributes, the common attributes specified in step S21 are “business: deposit”, “business: loan”, and “title: general”. There will be six: “Employee”, “Position: Part”, “Cause: Insufficient ability”, and “Cause: Mistake”.

この場合、レコード間距離算出部４２は、共通属性である６属性を用いて、ユークリッド距離を算出し、この距離を式（１）に従って共通属性数６で割った以下の値が支店番号「Ａ００１」の行のレコードと支店番号「Ａ００２」の行のレコードとの距離となる。

In this case, the inter-record distance calculation unit 42 calculates the Euclidean distance using the six attributes that are common attributes, and the following value obtained by dividing the distance by the number of common attributes 6 according to the equation (1) is the branch number “A001”. ”And the record of the branch number“ A002 ”.

また、第２の例として、図７に示した支店番号「Ａ００１」の行のレコードと支店番号「Ｂ００１」の行のレコードとのペアを選択した場合、「Ａ００１」の行のレコードは、欠損値でない値をもつ属性は、「業務：預金」、「業務：融資」、「役職：一般行員」、「役職：パート」、「原因：能力不足」、「原因：ミス」の６つである。一方、「Ｂ００１」の行のレコードは、欠損値でない値をもつ属性は、「業務：預金」、「業務：融資」、「原因：能力不足」、「原因：ミス」の４つである。 Further, as a second example, when a pair of the record of the branch number “A001” and the record of the branch number “B001” shown in FIG. 7 is selected, the record of the “A001” line is missing. There are six attributes with non-value values: “Business: Deposit”, “Business: Loan”, “Job title: General employee”, “Job title: Part”, “Cause: Insufficient ability”, “Cause: Mistake”. . On the other hand, in the record of the row “B001”, there are four attributes having a value that is not a missing value: “business: deposit”, “business: loan”, “cause: insufficient ability”, and “cause: miss”.

つまり、両レコードは、「業務：預金」、「業務：融資」、「原因：能力不足」、「原因：ミス」の４つの属性については、ともに欠損値でない値を有しており、これらの属性がステップＳ２１で特定される共通属性はとなる。 That is, both records have values that are not missing values for the four attributes of “business: deposit”, “business: loan”, “cause: insufficient ability”, and “cause: miss”. The common attribute whose attribute is specified in step S21 is as follows.

一方、「Ａ００１」の行のレコードで値を有する「役職：一般行員」、「役職：パート」の２属性については、「Ｂ００１」の行のレコードでは欠損値を有しており、これらの属性は、ステップＳ２１で特定される共通属性とはならない。 On the other hand, with regard to the two attributes “title: general employee” and “position: part” having values in the record of the line “A001”, the record of the line “B001” has a missing value. Does not become a common attribute specified in step S21.

この場合、レコード間距離算出部４２は、共通属性である４属性を用いて、ユークリッド距離を算出し、この距離を式（１）に従って共通属性数４で割った以下の値が支店番号「Ａ００１」の行のレコードと支店番号「Ｂ００１」の行のレコードとの距離となる。

In this case, the inter-record distance calculation unit 42 calculates the Euclidean distance using the four attributes that are common attributes, and the following value obtained by dividing this distance by the number of common attributes 4 according to the equation (1) is the branch number “A001”. ”And the record of the branch number“ B001 ”.

また、第３の例として、図７に示した支店番号「Ｂ００１」の行のレコードと支店番号「Ｃ００１」の行のレコードとのペアを選択した場合、「Ｂ００１」の行のレコードは、欠損値でない値をもつ属性は、「業務：預金」、「業務：融資」、「原因：能力不足」、「原因：ミス」の４つである。一方、「Ｃ００１」の行のレコードは、欠損値でない値をもつ属性は、「業務：預金」、「業務：融資」、「役職：一般行員」、「役職：パート」の４つである。 Further, as a third example, when a pair of the record of the branch number “B001” and the record of the branch number “C001” shown in FIG. 7 is selected, the record of the “B001” line is missing. There are four attributes having values other than values: “business: deposit”, “business: financing”, “cause: insufficient ability”, and “cause: mistake”. On the other hand, in the record of the row “C001”, there are four attributes having a value that is not a missing value: “business: deposit”, “business: financing”, “title: general employee”, and “title: part”.

つまり、両レコードは、「業務：預金」、「業務：融資」の２つの属性については、ともに欠損値でない値を有しており、これらの属性がステップＳ２１で特定される共通属性となる。 That is, both records have values that are not missing values for the two attributes “business: deposit” and “business: loan”, and these attributes are common attributes specified in step S21.

一方、「Ｃ００１」の行のレコードで値を有する「役職：一般行員」、「役職：パート」の２属性については、「Ｂ００１」の行のレコードでは欠損値を有しており、これらの属性は、ステップＳ２１で特定される共通属性とはならない。また、「Ｂ００１」の行のレコードで値を有する「役職：一般行員」、「役職：パート」の２属性については、「Ｃ００１」の行のレコードでは欠損値を有しており、これらの属性も、ステップＳ２１で特定される共通属性とはならない。 On the other hand, regarding the two attributes “title: general employee” and “position: part” having values in the record of the line “C001”, the record of the line “B001” has a missing value. Does not become a common attribute specified in step S21. In addition, regarding the two attributes “title: general employee” and “position: part” having values in the record of the “B001” row, the record of the “C001” row has a missing value. Is not a common attribute specified in step S21.

この場合、レコード間距離算出部４２は、共通属性である２属性を用いて、ユークリッド距離を算出し、この距離を式（１）に従って共通属性数２で割った以下の値が支店番号「Ｂ００１」の行のレコードと支店番号「Ｃ００１」の行のレコードとの距離となる。

In this case, the inter-record distance calculation unit 42 calculates the Euclidean distance using the two attributes that are common attributes, and the following value obtained by dividing the distance by the number of common attributes 2 according to the equation (1) is the branch number “B001”. ”And the record of the branch number“ C001 ”.

つまり、本実施形態における、各支店間の距離の算出では、従来技術のような、共通する属性が多いほど加算する項が増加して、これらの和である距離の値が不当に大きくなる事を防いでいる。 In other words, in the calculation of the distance between each branch in this embodiment, the number of terms to be added increases as the number of common attributes increases as in the conventional technique, and the distance value that is the sum of these increases unreasonably. Is preventing.

さらに、本実施形態では、式（１）に示すように、１属性あたりの属性値の差が大きいほど、算出される距離が大きくなり、また、共通する属性の数が多いほど、算出される距離が小さくなるので、従来技術に比して精度の高い距離を算出する事が可能となる。 Furthermore, in the present embodiment, as shown in Expression (1), the greater the difference in attribute values per attribute, the greater the calculated distance, and the greater the number of common attributes, the greater the calculated value. Since the distance becomes smaller, it is possible to calculate a distance with higher accuracy than in the prior art.

あるレコードペアに対するステップＳ２３の処理の後、レコード間距離算出部４２は、結合済データテーブル上のすべてのレコードペアに対する、レコード間の距離の算出が終了していない場合には（ステップＳ２４のＮＯ）、ステップＳ２１に戻って、結合済データテーブルの２つレコードの新たなペアを任意に指定し、ステップＳ２２，Ｓ２３の処理を再度行なう。 After the process in step S23 for a certain record pair, the inter-record distance calculation unit 42 determines that the distance between records for all record pairs on the combined data table has not been completed (NO in step S24). ), Returning to step S21, a new pair of two records in the combined data table is arbitrarily designated, and the processes of steps S22 and S23 are performed again.

また、レコード間距離算出部４２は、結合済データテーブル上のすべてのレコードペアに対する、レコード間の距離の算出が終了した場合には（ステップＳ２４のＹＥＳ）、レコード間の距離の算出のための処理を終了する。
このようにして、レコード間距離算出部４２は、結合済データテーブル上のすべてのレコードペアに対して、レコード間の距離を算出する。 Also, the inter-record distance calculation unit 42 calculates the distance between records when the calculation of the distance between records for all record pairs on the combined data table is completed (YES in step S24). The process ends.
In this way, the inter-record distance calculation unit 42 calculates the inter-record distance for all record pairs on the combined data table.

次に、クラスタリング実施部４３の動作の詳細について説明する。
図９は、本実施形態におけるデータ分析支援装置のクラスタリング実施部による処理動作の一例を示すフローチャートである。
図９に示す処理動作は、図５に示す処理動作のステップＳ５を詳細に説明するものであり、記憶装置１２の結合済データテーブル格納部３２に格納される結合済データテーブルを読み出し、このテーブルの支店番号の列で示されるすべての支店のクラスタリングを行なうための処理動作である。 Next, details of the operation of the clustering execution unit 43 will be described.
FIG. 9 is a flowchart illustrating an example of a processing operation performed by the clustering execution unit of the data analysis support apparatus according to this embodiment.
The processing operation shown in FIG. 9 explains step S5 of the processing operation shown in FIG. 5 in detail. The combined data table stored in the combined data table storage unit 32 of the storage device 12 is read, and this table This is a processing operation for clustering all the branches indicated by the branch number column.

以下、各銀行の各支店をクラスタリングする過程を二次元マップで示す。本実施形態では、各支店をクラスタリングするために、クラスタ中心支店を設定して、このクラスタ中心支店に対し距離が近い支店を対応付けてクラスタを設定した上で、このクラスタの重心を求めて、この重心に最も距離が近い支店を新たなクラスタ中心支店として設定し、重心を求める前後のクラスタ中心支店が同じである場合に正しいクラスタリングが行えたとして、クラスタリング結果を出力する。 In the following, the process of clustering each branch of each bank is shown in a two-dimensional map. In the present embodiment, in order to cluster each branch, a cluster central branch is set, a cluster is set by associating a branch having a short distance to the cluster central branch, and then the center of gravity of the cluster is obtained. The branch closest to the center of gravity is set as a new cluster center branch, and if the cluster center branch before and after obtaining the center of gravity is the same, the clustering result is output assuming that correct clustering has been performed.

図１０は、クラスタ中心支店の初期集合の設定例を示す図である。
図１０に示した二次元マップでは、結合済データテーブルでの各行の支店番号で示される各支店を円で表す。そして、この二次元マップでは、支店間の距離は、記憶装置１２のレコード間距離格納部３３に格納されている距離を表す。 FIG. 10 is a diagram illustrating an example of setting an initial set of cluster central branches.
In the two-dimensional map shown in FIG. 10, each branch indicated by the branch number of each row in the combined data table is represented by a circle. In this two-dimensional map, the distance between branches represents the distance stored in the inter-record distance storage unit 33 of the storage device 12.

クラスタリング実施部４３は、予め指定されたクラスタ数と同数の支店を無作為に選択し、これら選択した各支店をクラスタ中心支店に設定する（ステップＳ３１）。 The clustering execution unit 43 randomly selects the same number of branches as the number of clusters designated in advance, and sets each selected branch as a cluster center branch (step S31).

例えばクラスタ数が３と指定された場合、クラスタリング実施部４３は、図１０を例にとると、この図１０で示される黒丸の３つの支店のそれぞれをクラスタ中心支店に設定する。 For example, when the number of clusters is specified as 3, the clustering execution unit 43 sets each of the three black circled branches shown in FIG. 10 as the cluster center branch, taking FIG. 10 as an example.

次に、クラスタリング実施部４３は、ステップＳ３１で設定したクラスタ中心支店以外の各支店の１つを任意に選択し（ステップＳ３２）、この選択した支店と各クラスタ中心支店との距離のそれぞれを、記憶装置１２のレコード間距離格納部３３から読み出して参照し（ステップＳ３３）、当該選択した支店を、各クラスタ中心支店のうち最も距離が近いクラスタ中心支店に対応付けることでクラスタを任意に生成する（ステップＳ３４）。この生成されたクラスタの要素は、クラスタ中心支店および当該クラスタ中心支店に対応付けられたその他の支店のそれぞれである。 Next, the clustering execution unit 43 arbitrarily selects one of the branches other than the cluster central branch set in step S31 (step S32), and sets the distance between the selected branch and each cluster central branch, The data is read from the inter-record distance storage unit 33 of the storage device 12 and referred to (step S33), and a cluster is arbitrarily generated by associating the selected branch with the closest cluster center branch among the cluster center branches ( Step S34). The generated cluster elements are the cluster central branch and the other branch associated with the cluster central branch.

図１１は、各支店をクラスタ中心支店に対応付けた例を示す図である。
図１１に示した例では、第１のクラスタ、第２のクラスタ、第３のクラスタといった３つのクラスタが示される。 FIG. 11 is a diagram illustrating an example in which each branch is associated with a cluster central branch.
In the example shown in FIG. 11, three clusters such as a first cluster, a second cluster, and a third cluster are shown.

第１のクラスタは、図１０に示した各クラスタ中心支店のうち第１のクラスタ中心支店５１に最も距離が近い２支店を対応付けた３支店でなるクラスタである。
第２のクラスタは、図１０に示した各クラスタ中心支店のうち第２のクラスタ中心支店５２に最も距離が近い３支店を対応付けた４支店でなる、二重線Ｌ１で囲まれたクラスタである。
第３のクラスタは、図１０に示した各クラスタ中心支店のうち第３のクラスタ中心支店５３に最も距離が近い５支店を対応付けた６支店でなるクラスタである。 The first cluster is a cluster composed of three branches that are associated with two branches that are closest to the first cluster central branch 51 among the cluster central branches shown in FIG.
The second cluster is a cluster surrounded by a double line L1 that is composed of four branches that are associated with the three branches closest to the second cluster central branch 52 among the cluster central branches shown in FIG. is there.
The third cluster is a cluster composed of six branches that are associated with five branches that are closest to the third cluster central branch 53 among the cluster central branches shown in FIG.

続いて、クラスタリング実施部４３は、クラスタを生成するための、クラスタ中心支店以外のすべての支店の選択済みであれば（ステップＳ３５のＹＥＳ）、ステップＳ３４で生成された各クラスタの重心を計算する（ステップＳ３６）。 Subsequently, the clustering execution unit 43 calculates the center of gravity of each cluster generated in step S34 if all branches other than the cluster central branch for generating the cluster have been selected (YES in step S35). (Step S36).

ここでは、図１１の二重線Ｌ１で囲った４支店でなる第２のクラスタに焦点を当てて説明する。
図１２は、結合済みデータテーブルで定義される所定のクラスタに含まれる各組織の属性および属性値の一例を表形式で示す図である。この図では、図７に示した結合済データテーブルから、上述の第２のクラスタに含まれる４つの支店のレコードの属性および当該属性の属性値を示す。 Here, the description will focus on the second cluster consisting of four branches surrounded by the double line L1 in FIG.
FIG. 12 is a diagram showing an example of attributes and attribute values of each organization included in a predetermined cluster defined in the combined data table in a table format. In this figure, from the combined data table shown in FIG. 7, the attributes of the records of the four branches included in the second cluster and the attribute values of the attributes are shown.

第２のクラスタに含まれる４つの支店は、図７に示した結合済データテーブルの支店番号「Ａ００３」に対応する支店、支店番号「Ａ００４」に対応する支店、支店番号「Ｂ００３」に対応する支店、支店番号「Ｃ００３」に対応する支店である。 The four branches included in the second cluster correspond to the branch corresponding to the branch number “A003”, the branch corresponding to the branch number “A004”, and the branch number “B003” in the combined data table illustrated in FIG. This is the branch corresponding to the branch and branch number “C003”.

具体的には、ステップＳ３３では、クラスタリング実施部４３は、第２のクラスタに含まれる４つの支店に対応するレコードの各属性について、各レコードの同じ属性の属性値の平均値を算出し、その平均値を重心の属性値とする。
ただし、算出する重心に係わる各レコードのうち属性値が欠損値であるレコードが存在する場合は、このレコードの属性値を平均値算出の対象外とし、属性値を持つレコードのみを対象として平均値を算出する。 Specifically, in step S33, the clustering execution unit 43 calculates an average value of attribute values of the same attribute of each record for each attribute of the records corresponding to the four branches included in the second cluster, The average value is used as the attribute value of the center of gravity.
However, if there is a record whose attribute value is a missing value among the records related to the center of gravity to be calculated, the attribute value of this record is excluded from the average value calculation, and the average value only for the records having the attribute value Is calculated.

例えば、図１２に示した各レコードの「業務：預金」の属性値の平均値、つまり「業務：預金」の重心の属性値は、支店番号「Ａ００３」の行の値「５」、支店番号「Ａ００４」の行の値「２」、支店番号「Ｂ００３」の行の値「３」、支店番号「Ｃ００３」の行の値「３」の総和を、各レコードのうち「業務：預金」の属性値が欠損値でない値として存在するレコード数「４」で割った値であり、以下の式のようになる。 For example, the average value of the “business: deposit” attribute value of each record shown in FIG. 12, that is, the attribute value of the center of gravity of “business: deposit” is the value “5” in the row of the branch number “A003”, the branch number. The sum of the value “2” in the row “A004”, the value “3” in the row of the branch number “B003”, and the value “3” in the row of the branch number “C003” is set to “business: deposit” in each record. The attribute value is a value divided by the number of records “4” existing as non-missing values, as shown in the following formula.

(5+2+3+3)/4=3.25
また、図１２に示した各レコードの「業務：融資」の属性値の平均値、つまり「業務：融資」の重心の属性値は、支店番号「Ａ００３」の行の値「３」、支店番号「Ａ００４」の行の値「５」、支店番号「Ｂ００３」の行の値「４」、支店番号「Ｃ００３」の行の値「３」の総和を、各レコードのうち「業務：融資」の属性値が欠損値でない値として存在するレコード数「４」で割った値であり、以下の式のようになる。 (5 + 2 + 3 + 3) /4=3.25
Further, the average value of the attribute value of “business: loan” of each record shown in FIG. 12, that is, the attribute value of the center of gravity of “business: loan” is the value “3” in the row of the branch number “A003”, the branch number The sum of the value “5” in the row “A004”, the value “4” in the row of the branch number “B003”, and the value “3” in the row of the branch number “C003” The attribute value is a value divided by the number of records “4” existing as non-missing values, as shown in the following formula.

(3+5+4+3)/4=3.75
また、図１２に示した各レコードの「役職：一般行員」の属性値の平均値、つまり「役職：一般行員」の重心の属性値は、支店番号「Ａ００３」の行の値「２」、支店番号「Ａ００４」の行の値「４」、支店番号「Ｃ００３」の行の値「４」の総和を、各レコードのうち「役職：一般」の属性値が欠損値でない値として存在するレコード数「３」で割った値であり、以下の式のようになる。 (3 + 5 + 4 + 3) /4=3.75
Further, the average value of the attribute values of “title: general employee” of each record shown in FIG. 12, that is, the attribute value of the center of gravity of “title: general employee” is the value “2” in the row of the branch number “A003”, A record in which the sum of the value “4” in the row of the branch number “A004” and the value “4” in the row of the branch number “C003” is present as a value in which the attribute value of “title: general” is not a missing value in each record This is the value divided by the number “3”, and is given by the following formula.

(2+4+4)/3≒3.33である。 (2 + 4 + 4) /3≈3.33.

また、図１２に示した各レコードの「役職：パート」の属性値の平均値、つまり「役職：パート」の重心の属性値は、支店番号「Ａ００３」の行の値「６」、支店番号「Ａ００４」の行の値「３」、支店番号「Ｃ００３」の行の値「２」の総和を、各レコードのうち「役職：パート」の属性値が欠損値でない値として存在するレコード数「３」で割った値であり、以下の式のようになる。 Further, the average value of the attribute values of “position: part” of each record shown in FIG. 12, that is, the attribute value of the center of gravity of “position: part” is the value “6” in the row of the branch number “A003”, the branch number. The sum of the value “3” in the row of “A004” and the value “2” in the row of the branch number “C003” is used as the number of records in which the attribute value of “title: part” is not a missing value among the records “ This is the value divided by 3 ”, as shown in the following equation.

(6+3+2)/3≒3.67
また、図１２に示した各レコードの「原因：能力不足」の属性値の平均値、つまり「原因：能力不足」の重心の属性値は、支店番号「Ａ００３」の行の値「３」、支店番号「Ａ００４」の行の値「３」、支店番号「Ｂ００３」の行の値「２」の総和を、各レコードのうち「原因：能力不足」の属性値が欠損値でない値として存在するレコード数「３」で割った値であり、以下の式のようになる。 (6 + 3 + 2) /3≒3.67
Further, the average value of the attribute values of “cause: insufficient ability” of each record shown in FIG. 12, that is, the attribute value of the center of gravity of “cause: insufficient ability” is the value “3” in the row of the branch number “A003”, The sum of the value “3” in the row of the branch number “A004” and the value “2” in the row of the branch number “B003” exists as a value in which the attribute value “cause: insufficient ability” is not a missing value in each record. This is the value divided by the number of records “3”, as shown in the following formula.

(3+3+2)/3≒2.67
また、図１２に示した各レコードの「原因：ミス」の属性値の平均値、つまり「原因：ミス」の重心の属性値は、支店番号「Ａ００３」の行の値「５」、支店番号「Ａ００４」の行の値「４」、支店番号「Ｂ００３」の行の値「５」の総和を、各レコードのうち「原因：能力不足」の属性値が欠損値でない値として存在するレコード数「３」で割った値であり、以下の式のようになる。 (3 + 3 + 2) /3≒2.67
Also, the average value of the attribute values of “Cause: Miss” of each record shown in FIG. 12, that is, the attribute value of the center of gravity of “Cause: Miss” is the value “5” in the row of the branch number “A003”, the branch number Number of records in which the sum of the value “4” of the row “A004” and the value “5” of the row of the branch number “B003” is present as a value in which the attribute value of “cause: insufficient ability” is not a missing value This is the value divided by “3”, as shown in the following equation.

(5+4+5)/3≒4.67
図１３は、結合済みデータテーブルで定義される所定のクラスタに含まれる各組織（支店）の各属性の重心の計算結果の一例を表形式で示す図である。
図１４は、各クラスタの重心の一例を示す図である。
図１４では、二次元マップ上の第１のクラスタの重心、第２のクラスタの重心、および第３のクラスタの重心のそれぞれを×印で示している。 (5 + 4 + 5) /3≒4.67
FIG. 13 is a diagram illustrating an example of a calculation result of the center of gravity of each attribute of each organization (branch) included in a predetermined cluster defined in the combined data table in a table format.
FIG. 14 is a diagram illustrating an example of the center of gravity of each cluster.
In FIG. 14, the centroid of the first cluster, the centroid of the second cluster, and the centroid of the third cluster on the two-dimensional map are indicated by “x” marks.

最後に、クラスタリング実施部４３は、各クラスタのクラスタ中心支店を再計算する（ステップＳ３７）。具体的には、クラスタリング実施部４３は、ステップＳ３１で設定したクラスタ中心支店を含む全支店の中で、当該所定のクラスタについてステップＳ３６で算出した重心との距離がもっとも小さい支店を計算し、この支店を新たなクラスタ中心支店として設定し、この新たなクラスタ中心支店の設定を、ステップＳ３１で設定したクラスタ中心支店のそれぞれについて行なうことで、クラスタ中心の集合を新たに設定する。 Finally, the clustering execution unit 43 recalculates the cluster center branch of each cluster (step S37). Specifically, the clustering execution unit 43 calculates a branch having the smallest distance from the center of gravity calculated in step S36 for the predetermined cluster among all branches including the cluster central branch set in step S31. A branch is set as a new cluster center branch, and this new cluster center branch is set for each of the cluster center branches set in step S31, thereby newly setting a cluster center set.

各支店とクラスタの重心との距離の計算には、支店同士の距離の計算と同様、上記の式（１）を用いる。 In calculating the distance between each branch and the center of gravity of the cluster, the above formula (1) is used as in the calculation of the distance between the branches.

図１５は、結合済みデータテーブルで定義されるクラスタのクラスタ中心支店の再計算結果の一例を表形式で示す図である。
ステップＳ３７で再計算したクラスタ中心支店の集合が、ステップＳ３１で設定していた、元のクラスタ中心支店の集合から変化していれば（ステップＳ３８のＹＥＳ）、クラスタリング実施部４３は、適切なクラスタリングが行えていないとみなして、ステップＳ３２に戻り、ステップＳ３７で再計算したクラスタ中心支店以外の各支店の１つを任意に選択して、このクラスタ中心支店を基準としたステップＳ３３以降の処理を再度行なう。 FIG. 15 is a diagram illustrating an example of a recalculation result of a cluster central branch of a cluster defined in the combined data table in a table format.
If the cluster center branch set recalculated in step S37 has changed from the original cluster center branch set set in step S31 (YES in step S38), the clustering execution unit 43 performs appropriate clustering. Is returned to step S32, one of the branches other than the cluster central branch recalculated in step S37 is arbitrarily selected, and the processes in and after step S33 based on this cluster central branch are performed. Try again.

図１５に示した例では、図１０に示した状態からクラスタ中心支店の集合が変化しているため、ステップＳ３２に戻る。具体的には、図１５に示すように、第１のクラスタのクラスタ中心支店は、当初のクラスタ中心支店５１からクラスタ中心支店６１に変化し、第２のクラスタのクラスタ中心支店は、当初のクラスタ中心支店５２からクラスタ中心支店６２に変化し、第３のクラスタのクラスタ中心支店は、当初のクラスタ中心支店５３からクラスタ中心支店６３に変化している。 In the example shown in FIG. 15, since the set of cluster central branches has changed from the state shown in FIG. 10, the process returns to step S32. Specifically, as shown in FIG. 15, the cluster central branch of the first cluster is changed from the initial cluster central branch 51 to the cluster central branch 61, and the cluster central branch of the second cluster is changed to the original cluster. The central branch 52 changes to the cluster central branch 62, and the cluster central branch of the third cluster changes from the initial cluster central branch 53 to the cluster central branch 63.

また、クラスタリング実施部４３は、ステップＳ３７で再計算したクラスタ中心支店の集合が、ステップＳ３１で設定していた、元のクラスタ中心支店の集合から変化していなければ（ステップＳ３８のＮＯ）、適切なクラスタリングが行えたとみなして、クラスタリングのための処理を終了し、クラスタリング結果を記憶装置１２のクラスタリング結果格納部３４に格納し、例えば液晶ディスプレイ装置などの表示装置２０への出力を行なう。 In addition, the clustering execution unit 43 determines that the cluster center branch set recalculated in step S37 does not change from the original cluster center branch set set in step S31 (NO in step S38). Thus, the clustering process is terminated, the clustering result is stored in the clustering result storage unit 34 of the storage device 12, and output to the display device 20 such as a liquid crystal display device, for example.

次に、本実施形態を実データに適用し、クラスタリングの精度を評価するための実験の結果を以下に記す。
本実験では、以下の３手法の精度を比較した。
（ア）本実施形態の手法
（イ）既存の手法（欠損項目あり）
（ウ）既存の手法（欠損項目なし）
既存の手法（イ），（ウ）としては、一般的に用いられる以下の手法を用いた。 Next, the result of an experiment for applying the present embodiment to actual data and evaluating the accuracy of clustering will be described below.
In this experiment, the accuracy of the following three methods was compared.
(A) Method of this embodiment (I) Existing method (with missing items)
(C) Existing method (no missing items)
As the existing methods (a) and (c), the following commonly used methods were used.

「少なくとも１つのレコードが欠損値であるような属性は、分析に利用しない。」
ただし、手法（ウ）については、欠損項目がないデータを入力データとした。これは入力データをすべて活用できるケースに相当し、クラスタリング手法の精度の上限値を示すものである。 “An attribute where at least one record is a missing value is not used for analysis.”
However, for the method (c), data without missing items was used as input data. This corresponds to a case where all input data can be used, and indicates the upper limit of the accuracy of the clustering method.

続いて、実験に用いたデータについて説明する。図１６は、クラスタリングの精度の評価に利用した実験データを表形式で示す図である。これは、Ａ銀行、Ｂ銀行、Ｃ銀行でなる３つの銀行の、合計３０支店の事務ミスを集計したものであり、銀行の種別の列、支店番号の列、支店種別の列、クラスタリングに用いる、ミスの属性値の列を有する。 Subsequently, data used in the experiment will be described. FIG. 16 is a diagram showing, in a tabular form, experimental data used for evaluating the accuracy of clustering. This is a total of 30 administrative errors for three banks, Bank A, Bank B, and Bank C, and is used for bank type column, branch number column, branch type column, and clustering. , Having a column of miss attribute values.

ただし、クラスタリングに用いる属性値は、ミス件数の集計値ではなく、ミス件数の比を用いた。たとえば、図１６に示したデータのＡ銀行の支店番号「Ａ０１」に対応するレコードにおける「業務ａ」の属性値0.291は、「支店Ａ０１で発生したすべてのミスの件数」に対する「支店Ａ０１で発生した業務aでのミスの件数」の割合を示す。つまり、あるレコードの業務ａ〜業務ｅまででなるすべての属性値を合計すると、その合計値は１となる。 However, the attribute value used for clustering was not the aggregate value of the number of mistakes, but the ratio of the number of mistakes. For example, the attribute value 0.291 of “operation a” in the record corresponding to the branch number “A01” of bank A in the data shown in FIG. 16 is “occurrence at branch A01” for “the number of all mistakes that occurred at branch A01”. Of “number of mistakes in job a”. That is, when all attribute values of business a to business e of a record are summed, the total value is 1.

図１６に示したデータは、手法（ウ）のための入力データとなる。一方、手法（ア），（イ）への入力データは、図１６に示したデータに擬似的に欠損を発生させることで作成した。 The data shown in FIG. 16 is input data for the method (c). On the other hand, the input data to the methods (a) and (b) was created by generating a pseudo defect in the data shown in FIG.

図１７は、クラスタリングの精度の評価に利用した、各銀行の事務ミス収集状況を表形式で示す図である。
この図１７に基づき、「担当者」の項目に関する情報および「状況」の項目に関する情報は各銀行で収集しているが、Ｂ銀行では「業務」の項目に関する情報を収集しておらず、また、Ｃ銀行では「原因」の項目に関する情報を収集していないものとし、図１６における該当する部分を欠損値とした。 FIG. 17 is a diagram showing, in a tabular form, the collection status of office errors at each bank used for evaluating the accuracy of clustering.
Based on FIG. 17, information on the “person in charge” item and information on the “situation” item are collected by each bank, but bank B does not collect information on the “business” item. Bank C does not collect information on the “Cause” item, and the corresponding part in FIG.

図１８は、クラスタリングの精度の評価に利用した、欠損項目を含む実験データを表形式で示す図である。
クラスタリング結果の評価には、図１６や図１８で示された「支店種別」の列の値を用いる。この「支店種別」は、３つの銀行間で共通して用いられている支店の種別であり、“大型”、“小型”、“特殊”の３種類の属性値からなる。 FIG. 18 is a diagram showing, in a tabular form, experimental data including missing items used for evaluating the accuracy of clustering.
For the evaluation of the clustering result, the values in the “branch type” column shown in FIGS. 16 and 18 are used. The “branch type” is a type of branch used in common among the three banks, and includes three types of attribute values “large”, “small”, and “special”.

ここで、クラスタ数を３として実施した各手法により生成されたクラスタを、それぞれ「大型」、「小型」、「特殊」の集合とみなして、全支店の数に対する、実際に正しく分類された支店の数の割合を、３つのクラスタと３つの支店種別の全ての組合せについて算出し、最も高い値を、手法の正解率とする。 Here, the clusters generated by each method implemented with 3 clusters are regarded as a set of “large”, “small”, and “special” respectively, and branches that are actually correctly classified with respect to the total number of branches. Is calculated for all combinations of the three clusters and the three branch types, and the highest value is the correct answer rate of the method.

たとえば、第１のクラスタに分類されるべき支店種別を「大型」とし、第２のクラスタに分類されるべき支店種別を「小型」とし、第３のクラスタに分類されるべき支店種別を「特殊」とした場合で、ある手法で実際に分類された支店種別が、以下のように、第１のクラスタに対しては、「大型」の支店、「大型」の支店、「小型」の支店、「特殊」の支店、「特殊」の支店でなる５つの支店が分類され、第２のクラスタに対しては、「小型」の支店、「小型」の支店でなる２つの支店が分類され、第３のクラスタに対しては、「大型」の支店、「特殊」の支店、「特殊」の支店、「特殊」の支店でなる４つの支店が分類されたとする。 For example, the branch type to be classified into the first cluster is “large”, the branch type to be classified into the second cluster is “small”, and the branch type to be classified into the third cluster is “special”. ”And the branch type actually classified by a certain method is“ large ”branch,“ large ”branch,“ small ”branch, Five branches consisting of “special” branches and “special” branches are classified. For the second cluster, two branches consisting of “small” branches and “small” branches are classified. Suppose that for the cluster of 3, four branches are classified into a “large” branch, a “special” branch, a “special” branch, and a “special” branch.

第１のクラスタ：大型、大型、小型、特殊、特殊
第２クラスタ：小型、小型
第３クラスタ：大型、特殊、特殊、特殊
この場合、第１のクラスタに分類された「大型」の支店の数は２であり、第２クラスタに分類された「小型」の支店の数は２であり、第３のクラスタに分類された「特殊」の支店の数は３であるので、全支店の数に対する、各クラスタに実際に正しく分類された支店の数の割合である正解率は、(2+2+3)/11=7/11となり、この正解率が３つのクラスタと３つの支店種別の全ての組合せについて最も正解率が高い場合、この算出結果を、この手法によるクラスタリング結果の正解率とする。 First cluster: large, large, small, special, special Second cluster: small, small Third cluster: large, special, special, special In this case, the number of “large” branches classified in the first cluster Is 2, the number of “small” branches classified in the second cluster is 2, and the number of “special” branches classified in the third cluster is 3, so the total number of branches is The correct answer rate, which is the ratio of the number of branches that are actually correctly classified in each cluster, is (2 + 2 + 3) / 11 = 7/11. This correct answer rate is all three clusters and all three branch types. When the correct answer rate is the highest for the combination, the calculation result is set as the correct answer rate of the clustering result by this method.

次に、手法（ア）、手法（イ）、手法（ウ）による各手法による精度の良し悪しを示す、全支店の数に対する、各クラスタに実際に正しく分類された支店の数の正解率を図１９に示す。 Next, the accuracy rate of the number of branches actually classified correctly in each cluster with respect to the total number of branches, indicating the accuracy of each method by method (a), method (b), and method (c). It shows in FIG.

この例で示すように、「（ア）本実施形態の手法」の精度は、「（イ）既存の手法（欠損項目あり）」の精度を上回っており、既存の手法に比べて、本実施形態の手法は、欠損項目を含むデータに対して頑健であるといえる。 As shown in this example, the accuracy of “(a) the method of this embodiment” exceeds the accuracy of “(a) the existing method (with missing items)”, and this accuracy is higher than that of the existing method. It can be said that the method of form is robust against data including missing items.

以上のように、本実施形態では、分析対象である複数の組織のそれぞれについての、少なくとも１種類の属性を有する集計データであるレコードを組織別に管理するための組織別データテーブルを格納し、この組織別データテーブルで示される、複数の組織間で少なくとも１種類の共通する属性を有する複数の組織のそれぞれのレコードの組について、共通する属性の値に基づいて、当該レコード間で共通する属性の数、および当該共通する属性における集計値に基づいて、レコードの組の間の距離を算出し、この算出した距離に基づいて、それぞれのレコードに対応する組織をクラスタとしたクラスタリングを行なう。
よって、組織ごとに収集する属性が異なるために、集計データが全ての組織間で一致していない状況であっても、組織間で共通する属性の情報を活用することで、収集されたデータを有効に活用でき、複数組織のデータを統合した精度の高い分析が可能となる。 As described above, in the present embodiment, an organization-specific data table for managing records, which are aggregated data having at least one type of attribute, for each of a plurality of organizations to be analyzed is stored. Based on the value of the common attribute for each record set of the plurality of organizations having at least one type of common attribute among the plurality of organizations shown in the organization-specific data table, Based on the number and the total value of the common attribute, the distance between the record sets is calculated, and based on the calculated distance, clustering is performed with the organization corresponding to each record as a cluster.
Therefore, since the collected attributes are different for each organization, even if the aggregated data is not consistent among all organizations, the collected data can be obtained by utilizing the attribute information common to the organizations. It can be used effectively and enables high-precision analysis that integrates data from multiple organizations.

これらの各実施形態によれば、異なる組織のそれぞれのデータのうち、データの属性が組織間で異なることに起因する欠損が生じても、これらのデータを統合した際の分析の精度を向上させることが可能になるデータ分析支援装置を提供することができる。
以上説明した実施形態では、組織別データテーブルは、分析対象である複数の組織のそれぞれについての、少なくとも１種類の属性を有する集計データであるレコードを組織別に管理するためのデータテーブルであると説明したが、この組織別データテーブルにより管理するデータは、定量的データであってもよいし、定性的データであってもよい。 According to each of these embodiments, even if a defect occurs due to the difference in attribute of data among the data of different organizations, the accuracy of analysis when these data are integrated is improved. Therefore, it is possible to provide a data analysis support device that can be used.
In the embodiment described above, the organization-specific data table is a data table for managing records, which are aggregated data having at least one kind of attribute, for each of a plurality of organizations to be analyzed by organization. However, the data managed by the organization-specific data table may be quantitative data or qualitative data.

また、本実施形態では、データ分析支援装置１０のクラスタリング実施部４３が、組織別データテーブルで示される、複数の組織間で少なくとも１種類の共通する属性を有する複数の組織のそれぞれのレコードの組の間の距離に基づいて、それぞれのレコードに対応する組織をクラスタとしたクラスタリング処理を行なうと説明したが、これに限らず、レコードの組の間の距離を用いる分析処理を行なうのであれば、クラスタリング実施部４３の代わりに分析実施部を設けて、この分析実施部により、例えば、レコードの組の間の距離を用いて自己組織化マップによる分析処理を行なってもよいし、多次元尺度構成法による分析処理を行なってもよい。 In the present embodiment, the clustering execution unit 43 of the data analysis support apparatus 10 sets each record set of a plurality of organizations having at least one kind of common attribute among the plurality of organizations, which is indicated by the organization-specific data table. Based on the distance between the two, it has been described that the clustering process is performed with the organization corresponding to each record as a cluster, but not limited to this, if the analysis process using the distance between a set of records is performed, An analysis execution unit may be provided instead of the clustering execution unit 43, and the analysis execution unit may perform analysis processing using a self-organizing map using, for example, a distance between record sets, or a multidimensional scale configuration. Analysis processing by a method may be performed.

なお、上記実施形態に記載した手法は、コンピュータに実行させることのできるプログラムとして、磁気ディスク（フロッピー（登録商標）ディスク、ハードディスクなど）、光ディスク（ＣＤ−ＲＯＭ、ＤＶＤなど）、光磁気ディスク（ＭＯ）、半導体メモリなどの記憶媒体に格納して頒布することもできる。 Note that the method described in the above embodiment includes a magnetic disk (floppy (registered trademark) disk, hard disk, etc.), an optical disk (CD-ROM, DVD, etc.), a magneto-optical disk (MO) as programs that can be executed by a computer. ), And can be distributed in a storage medium such as a semiconductor memory.

また、この記憶媒体としては、プログラムを記憶でき、かつコンピュータが読み取り可能な記憶媒体であれば、その記憶形式は何れの形態であっても良い。 In addition, as long as the storage medium can store a program and can be read by a computer, the storage format may be any form.

また、記憶媒体からコンピュータにインストールされたプログラムの指示に基づきコンピュータ上で稼働しているＯＳ（オペレーティングシステム）や、データベース管理ソフト、ネットワークソフト等のＭＷ（ミドルウェア）等が上記実施形態を実現するための各処理の一部を実行しても良い。 In addition, an OS (operating system) running on a computer based on an instruction of a program installed in the computer from a storage medium, MW (middleware) such as database management software, network software, and the like realize the above-described embodiment. A part of each process may be executed.

さらに、本発明における記憶媒体は、コンピュータと独立した媒体に限らず、ＬＡＮやインターネット等により伝送されたプログラムをダウンロードして記憶または一時記憶した記憶媒体も含まれる。 Further, the storage medium in the present invention is not limited to a medium independent of a computer, but also includes a storage medium in which a program transmitted via a LAN, the Internet, or the like is downloaded and stored or temporarily stored.

また、記憶媒体は１つに限らず、複数の媒体から上記実施形態における処理が実行される場合も本発明における記憶媒体に含まれ、媒体構成は何れの構成であっても良い。 Further, the number of storage media is not limited to one, and the case where the processing in the above embodiment is executed from a plurality of media is also included in the storage media in the present invention, and the media configuration may be any configuration.

尚、本発明におけるコンピュータは、記憶媒体に記憶されたプログラムに基づき、上記実施形態における各処理を実行するものであって、パソコン等の１つからなる装置、複数の装置がネットワーク接続されたシステム等の何れの構成であっても良い。 The computer according to the present invention executes each process in the above-described embodiment based on a program stored in a storage medium, and is a single device such as a personal computer or a system in which a plurality of devices are connected to a network. Any configuration may be used.

また、本発明におけるコンピュータとは、パソコンに限らず、情報処理機器に含まれる演算処理装置、マイコン等も含み、プログラムによって本発明の機能を実現することが可能な機器、装置を総称している。 In addition, the computer in the present invention is not limited to a personal computer, but includes an arithmetic processing device, a microcomputer, and the like included in an information processing device, and is a generic term for devices and devices that can realize the functions of the present invention by a program. .

発明のいくつかの実施形態を説明したが、これらの実施形態は、例として提示したものであり、発明の範囲を限定することは意図していない。これら新規な実施形態は、その他の様々な形態で実施されることが可能であり、発明の要旨を逸脱しない範囲で、種々の省略、置き換え、変更を行うことができる。これら実施形態やその変形は、発明の範囲や要旨に含まれるとともに、特許請求の範囲に記載された発明とその均等の範囲に含まれる。 Although several embodiments of the invention have been described, these embodiments are presented by way of example and are not intended to limit the scope of the invention. These novel embodiments can be implemented in various other forms, and various omissions, replacements, and changes can be made without departing from the scope of the invention. These embodiments and modifications thereof are included in the scope and gist of the invention, and are included in the invention described in the claims and the equivalents thereof.

１０…データ分析支援装置、１１…制御部、１２…記憶装置、２０…表示装置、３１…組織別データテーブル格納部、３２…結合済データテーブル記憶部、３３…レコード間距離格納部、４１…データテーブル結合部、４２…レコード間距離算出部、４３…クラスタリング実施部。 DESCRIPTION OF SYMBOLS 10 ... Data analysis support apparatus, 11 ... Control part, 12 ... Storage apparatus, 20 ... Display apparatus, 31 ... Organization-specific data table storage part, 32 ... Combined data table storage part, 33 ... Inter-record distance storage part, 41 ... Data table combining unit, 42... Inter-record distance calculating unit, 43.

Claims

An organization-specific data table storage means for storing an organization-specific data table for managing records that are data having at least one type of attribute for each of a plurality of organizations to be analyzed;
Based on the value of the common attribute, the records are common to each record set of the plurality of organizations having at least one type of common attribute among the plurality of organizations shown in the organization-specific data table. Distance calculating means for calculating the distance between the set of records based on the number of attributes and the aggregate value in the common attribute;
A data analysis support apparatus comprising: analysis processing means for performing analysis based on the distance calculated by the distance calculation means.

The distance calculating means includes
For each record set of a plurality of organizations having a common attribute shown in the organization-specific data table, the attribute value common to these records in one record and the common record in the other record The data analysis support apparatus according to claim 1, wherein a distance between the records is calculated based on a difference value between attribute values and an inverse number of the common attributes.

The analysis processing means includes
Based on the distance between the record sets calculated by the distance calculation means, a plurality of cluster centers are arbitrarily set as a set of cluster centers from among the cluster elements corresponding to the respective records, and the cluster centers For each of the sets, a cluster is arbitrarily set by associating at least one cluster element having a distance close to the cluster center, the center of gravity of the set cluster is calculated, and the cluster element having the closest distance to the calculated center of gravity is calculated Is set to the new cluster center of the cluster, a new set of cluster centers is set, and the newly set cluster center set is set before the calculation of the centroid for setting the cluster center. If it is not the same as the set of cluster centers, then To configure the cluster in that associating at least one cluster element distance is close to the cluster center again, calculate the centroids of the clusters the set as a new center of gravity,
In addition, if the set of the newly set cluster centers is the same as the set of cluster centers set before the calculation of the center of gravity for setting the cluster centers, the information on the set latest clusters is used as a clustering result. The data analysis support apparatus according to claim 1, wherein the data analysis support apparatus outputs the data analysis support apparatus.

A computer having an organization-specific data table storage device that stores an organization-specific data table for managing records, which are data having at least one type of attribute, for each of a plurality of organizations to be analyzed.
Based on the value of the common attribute, the records are common to each record set of the plurality of organizations having at least one type of common attribute among the plurality of organizations shown in the organization-specific data table. As a distance calculation means for calculating the distance between the record sets based on the number of attributes and the aggregate value in the common attribute, and an analysis processing means for performing an analysis based on the distance calculated by the distance calculation means Data analysis support processing program to make it function.