JP6567720B1

JP6567720B1 - Data preprocessing device, data preprocessing method, and data preprocessing program

Info

Publication number: JP6567720B1
Application number: JP2018060085A
Authority: JP
Inventors: 拓馬若森; 希望稲子
Original assignee: Nippon Telegraph and Telephone West Corp
Current assignee: Nippon Telegraph and Telephone West Corp
Priority date: 2018-03-27
Filing date: 2018-03-27
Publication date: 2019-08-28
Anticipated expiration: 2038-03-27
Also published as: JP2019174960A

Abstract

【課題】非熟練者であってもデータの前処理を実施できる技術を提供する。【解決手段】クラスタリングサーバ５０が、高次元データをクラスタに分類するとともに、高次元データの任意の２次元の組み合わせに対するスコアを計算し、可視化サーバ３０が、２次元の組み合わせに対するスコアを表形式で表示して選択を受け付け、選択された２次元の組み合わせの各次元を軸とする２次元平面上に高次元データをクラスタに分けてプロットし、高次元データに対する前処理装置を受け付ける。【選択図】図２Provided is a technique that enables even a non-expert to perform data preprocessing. A clustering server classifies high-dimensional data into clusters, calculates a score for an arbitrary two-dimensional combination of high-dimensional data, and a visualization server 30 displays a score for the two-dimensional combination in a tabular format. Display and accept selection, and plot high-dimensional data divided into clusters on a two-dimensional plane with each dimension of the selected two-dimensional combination as an axis, and accept a preprocessing device for the high-dimensional data. [Selection] Figure 2

Description

本発明は、データ分析においてデータを前処理する技術に関する。 The present invention relates to a technique for preprocessing data in data analysis.

近年、ビジネスデータを分析して事業戦略策定に活用するビジネスアナリティクスが盛んである。実ビジネスデータは、高次元かつノイズや欠損、必要なラベル情報の欠落が多く、クレンジング、ラベル付け、及び特徴選択などの前処理が不可欠である。前処理の作業に要する利用者の試行錯誤や工数を削減できる技術として特許文献１に記載の技術がある。 In recent years, business analytics that analyzes business data and uses it to formulate business strategies has become popular. Real business data is high-dimensional and has many noises and defects, and lack of necessary label information. Preprocessing such as cleansing, labeling, and feature selection is indispensable. There is a technique described in Patent Document 1 as a technique that can reduce the trial and error and man-hours required for the pre-processing work.

特開２０１２−２４３０１３号公報JP 2012-243013 A

データ分析は前処理が９割とも言われている。前処理によりデータサイエンティストの稼働が圧迫され、本来業務であるモデル構築に十分な稼働が割けないという問題があった。また、データサイエンティストの不足も深刻である。 Data analysis is said to have 90% pre-processing. The operation of the data scientist was under pressure due to the pre-processing, and there was a problem that sufficient operation could not be used for model construction that was originally a business. The shortage of data scientists is also serious.

本発明は、上記に鑑みてなされたものであり、データに対してのドメイン知識を有するが確率・統計等のデータサイエンスの基礎知識を持たない非熟練者であってもデータの前処理を実施できる技術を提供することを目的とする。 The present invention has been made in view of the above, and performs data preprocessing even for non-experts who have domain knowledge of data but do not have basic knowledge of data science such as probability and statistics. The purpose is to provide technology that can be used.

第１の本発明に係るデータ前処理装置は、高次元データをクラスタに分類するとともに、前記高次元データの任意の２次元の組み合わせに対するスコアを計算するクラスタリング部と、複数の前記２次元の組み合わせに対するスコアを表示し、前記２次元の組み合わせの選択を受け付ける大域分析画面を表示する大域分析画面表示部と、前記大域分析画面において前記２次元の組み合わせが選択されると、選択された前記２次元の組み合わせの各次元を軸とする２次元平面上に前記高次元データをクラスタに分けて描画し、前記高次元データに対する操作を受け付ける局所分析画面を表示する局所分析画面表示部と、前記高次元データを不変とし、前記高次元データに対する操作の履歴を管理する履歴管理部と、を有し、前記局所分析画面において前記高次元データに対する操作を受け付けた場合、当該高次元データに対する操作の履歴を前記履歴管理部に登録し、前記クラスタリング部は前記高次元データに前記操作の履歴の示す操作を反映した最新データを用いて前記スコアを再計算し、前記局所分析画面から前記大域分析画面に移行すると、前記大域分析画面表示部は再計算された前記スコアを表示し、前記２次元の組み合わせの選択を受け付けることを特徴とする。 A data preprocessing device according to the first aspect of the present invention includes a clustering unit that classifies high-dimensional data into clusters, calculates a score for any two-dimensional combination of the high-dimensional data, and a plurality of the two-dimensional combinations A global analysis screen display for displaying a global analysis screen for displaying a score for and accepting selection of the two-dimensional combination, and when the two-dimensional combination is selected on the global analysis screen, the selected two-dimensional a local analysis screen display unit that each dimension of combining the high dimensional data onto a two-dimensional plane having axes drawn divided into clusters, and displays the local analysis screen for accepting an operation on the high-dimensional data of the high dimensional A history management unit for making the data invariant and managing a history of operations on the high-dimensional data, and on the local analysis screen When the operation for the high-dimensional data is accepted, the history of the operation for the high-dimensional data is registered in the history management unit, and the clustering unit reflects the operation indicated by the operation history in the high-dimensional data. When recalculating the score using, and shifting from the local analysis screen to the global analysis screen, the global analysis screen display unit displays the recalculated score and accepts the selection of the two-dimensional combination It is characterized by.

第２の本発明に係るデータ前処理方法は、コンピュータが実行するデータ前処理方法であって、高次元データをクラスタに分類するとともに、前記高次元データの任意の２次元の組み合わせに対するスコアを計算するステップと、複数の前記２次元の組み合わせに対するスコアを表示し、前記２次元の組み合わせの選択を受け付ける大域分析画面を表示するステップと、前記大域分析画面において前記２次元の組み合わせが選択されると、選択された前記２次元の組み合わせの各次元を軸とする２次元平面上に前記高次元データをクラスタに分けて描画し、前記高次元データに対する操作を受け付ける局所分析画面を表示するステップと、を有し、前記高次元データを不変とし、前記局所分析画面において前記高次元データに対する操作を受け付けた場合、当該高次元データに対する操作の履歴を履歴管理部に登録し、前記高次元データに前記操作の履歴の示す操作を反映した最新データを用いて前記スコアを再計算し、前記局所分析画面から前記大域分析画面に移行すると、前記大域分析画面には再計算された前記スコアが表示されて、前記２次元の組み合わせの選択を受け付けることを特徴とする。 A data preprocessing method according to a second aspect of the present invention is a data preprocessing method executed by a computer, classifying high-dimensional data into clusters, and calculating scores for arbitrary two-dimensional combinations of the high-dimensional data. Displaying a score for a plurality of the two-dimensional combinations, displaying a global analysis screen for accepting selection of the two-dimensional combination, and selecting the two-dimensional combination on the global analysis screen. Displaying a local analysis screen for accepting an operation on the high-dimensional data by drawing the high-dimensional data in a cluster on a two-dimensional plane with each dimension of the selected two-dimensional combination as an axis; The high-dimensional data is unchanged, and an operation on the high-dimensional data is received on the local analysis screen. If registered, the history of the operation for the high-dimensional data is registered in the history management unit, the score is recalculated using the latest data reflecting the operation indicated by the operation history in the high-dimensional data, and the local analysis is performed. When the screen is shifted from the screen to the global analysis screen, the recalculated score is displayed on the global analysis screen, and the selection of the two-dimensional combination is accepted .

第３の本発明に係るデータ前処理プログラムは、上記データ前処理装置の各部としてコンピュータを動作させることを特徴とする。 According to a third aspect of the present invention, there is provided a data preprocessing program for operating a computer as each unit of the data preprocessing apparatus.

本発明によれば、非熟練者であってもデータの前処理を実施できる。 According to the present invention, even an unskilled person can perform data preprocessing.

本実施形態のデータ前処理システムの概要を説明するための図である。It is a figure for demonstrating the outline | summary of the data pre-processing system of this embodiment. 本実施形態のデータ前処理システムの構成を示す機能ブロック図である。It is a functional block diagram which shows the structure of the data pre-processing system of this embodiment. 図３（ａ）は変換前の実ビジネスデータの例を示す図であり、図３（ｂ）は変換後データの例を示す図である。FIG. 3A is a diagram illustrating an example of actual business data before conversion, and FIG. 3B is a diagram illustrating an example of data after conversion. ＤＢサーバのデータベース構成の例を示す図である。It is a figure which shows the example of the database structure of DB server. 大域分析を可視化した大域分析画面の例である。It is an example of the global analysis screen which visualized global analysis. 局所分析を可視化した局所分析画面の例である。It is an example of the local analysis screen which visualized local analysis. 局所分析において外れ値を除去する例を示す図である。It is a figure which shows the example which removes an outlier in a local analysis. 局所分析においてラベルを付ける例を示す図である。It is a figure which shows the example which attaches a label in a local analysis. 本実施形態のデータ前処理システムによるデータの変換処理の流れを示すフロートチャートである。It is a float chart which shows the flow of the data conversion process by the data pre-processing system of this embodiment. 本実施形態のデータ前処理システムによる大域分析処理の流れを示すフロートチャートである。It is a float chart which shows the flow of the global analysis process by the data pre-processing system of this embodiment. 本実施形態のデータ前処理システムによる局所分析処理の流れを示すフロートチャートである。It is a float chart which shows the flow of the local analysis process by the data pre-processing system of this embodiment. 本実施形態のデータ前処理システムによるデータの変換処理の流れを示すシーケンス図である。It is a sequence diagram which shows the flow of the data conversion process by the data pre-processing system of this embodiment. 本実施形態のデータ前処理システムによる大域分析処理の流れを示すシーケンス図である。It is a sequence diagram which shows the flow of the global analysis process by the data pre-processing system of this embodiment. 本実施形態のデータ前処理システムによる局所分析処理の流れを示すシーケンス図である。It is a sequence diagram which shows the flow of the local analysis process by the data pre-processing system of this embodiment.

以下、本発明の実施の形態について図面を用いて説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図１に示すように、本実施形態のデータ前処理システムは、大域分析画面として、高次元の実ビジネスデータの任意の２次元の組み合わせに対して計算されたスコアを円の直径に対応させて表形式にて可視化し、スコアを表す円の選択を受け付ける。なお、スコアの可視化方法として円の直径ではなく、色の濃淡や、数値そのものの表記など、他の方法を選択することもできる。また、円の大きさが顕著に異なるようであれば、対数化するなどして表示上の差を小さくすることもできる。また、本実施形態のデータ前処理システムは、局所分析画面として、選択された円に対応する２次元の平面上にデータをプロットして可視化し、データの外れ値の除去およびラベル付けを実施できるようにする。大域分析として任意の２次元の組み合わせに対するスコアを表示して可視化し、２次元の組み合わせの選択を受け付け、選択された２次元の組み合わせの各次元を軸とする２次元平面上にデータを描画してデータに対する操作を受け付けることで、高次元データが理解しやすい直感的な形式で可視化されるので、非熟練者であってもデータの前処理を実行できる。また、大域分析と局所分析との間を容易に移行して繰り返すことができるので、クラスタリングの結果を多角的な視点で俯瞰しつつ、データ前処理を実行できる。 As shown in FIG. 1, the data preprocessing system of the present embodiment uses a global analysis screen, and the score calculated for an arbitrary two-dimensional combination of high-dimensional real business data is associated with the diameter of a circle. Visualize in tabular form and accept selection of circles representing scores. As a method for visualizing the score, other methods such as color shading and numerical value notation can be selected instead of the circle diameter. Further, if the size of the circles is significantly different, the difference in display can be reduced by logarithmization or the like. Moreover, the data preprocessing system of this embodiment can plot and visualize data on a two-dimensional plane corresponding to the selected circle as a local analysis screen, and can perform removal and labeling of data outliers. Like that. As a global analysis, a score for an arbitrary two-dimensional combination is displayed and visualized, a selection of a two-dimensional combination is accepted, and data is drawn on a two-dimensional plane with each dimension of the selected two-dimensional combination as an axis. By accepting operations on the data, the high-dimensional data is visualized in an intuitive format that is easy to understand, so even a non-expert can perform data pre-processing. In addition, since it is possible to easily shift between the global analysis and the local analysis and repeat it, it is possible to execute the data preprocessing while overlooking the clustering result from various viewpoints.

図２は、本実施形態のデータ前処理システムの構成を示す機能ブロック図である。図２に示すデータ前処理システムは、変換サーバ１０、データベース（ＤＢ）サーバ２０、可視化サーバ３０、履歴管理サーバ４０、及びクラスタリングサーバ５０を備える。データ前処理システムが備える各装置は、演算処理装置、記憶装置等を備えたコンピュータにより構成して、各装置の処理がプログラムによって実行されるものとしてもよい。このプログラムは各装置が備える記憶装置に記憶されており、磁気ディスク、光ディスク、半導体メモリ等の記録媒体に記録することも、ネットワークを通して提供することも可能である。データ前処理システムの各装置の機能を１台のコンピュータで実現してもよいし、各装置の機能を複数のコンピュータで分散して実現してもよい。以下、各装置について説明する。 FIG. 2 is a functional block diagram showing the configuration of the data preprocessing system of this embodiment. The data preprocessing system shown in FIG. 2 includes a conversion server 10, a database (DB) server 20, a visualization server 30, a history management server 40, and a clustering server 50. Each device included in the data preprocessing system may be configured by a computer including an arithmetic processing device, a storage device, and the like, and the processing of each device may be executed by a program. This program is stored in a storage device included in each device, and can be recorded on a recording medium such as a magnetic disk, an optical disk, or a semiconductor memory, or provided through a network. The functions of each device of the data preprocessing system may be realized by a single computer, or the functions of each device may be realized by being distributed by a plurality of computers. Hereinafter, each device will be described.

変換サーバ１０は、高次元の実ビジネスデータのデータ変換処理を実行する。図３（ａ）に変換前の実ビジネスデータ（訪問営業履歴）の例を示し、図３（ｂ）に変換後データの例を示す。変換サーバ１０は、データ変換処理として、カテゴリ変数の除去、連続値（例えば日付など）の数値変換、及び列ごとの標準化等を行う。図３の例では、性別の列を除去し、訪問日を数値に変換し、列ごとに標準化を行っている。なお、ＩＤは識別用に残している。 The conversion server 10 executes data conversion processing of high-dimensional actual business data. 3A shows an example of actual business data (visit sales history) before conversion, and FIG. 3B shows an example of converted data. The conversion server 10 performs categorical variable removal, numerical conversion of continuous values (for example, dates), standardization for each column, and the like as data conversion processing. In the example of FIG. 3, the gender column is removed, the visit date is converted into a numerical value, and standardization is performed for each column. The ID is left for identification.

ＤＢサーバ２０は、変換後データテーブル２１、前処理履歴テーブル２２、及びクラスタリング結果テーブル２３を有する。 The DB server 20 includes a post-conversion data table 21, a preprocessing history table 22, and a clustering result table 23.

変換後データテーブル２１は、変換サーバ１０が実ビジネスデータから変換した変換後データを保持する。前処理履歴テーブル２２は、変換後データに対する前処理の各操作の履歴を保持する。本実施形態では、変換後データをイミュータブル（不変の）オブジェクトとして扱い、前処理の各操作（外れ値除去、ラベル付け）に対してバージョン管理を行い、全ての変更履歴をＧｉｔ等の版管理システムで管理する。可視化サーバ３０およびクラスタリングサーバ５０が変換後データを参照する際は、変換後データテーブル２１の変換後データに対して前処理履歴テーブル２２の変更履歴を反映した最新データ（以下、「データ」、「高次元データ」と称することもある）を参照する。変換後データを変更せずに、変換後データに対する各操作の履歴を管理することで、誤った操作をした場合に、任意の時点のデータに復元することができる。 The post-conversion data table 21 holds post-conversion data converted from actual business data by the conversion server 10. The preprocessing history table 22 holds a history of each operation of preprocessing for the converted data. In this embodiment, the converted data is treated as an immutable (invariant) object, version management is performed for each preprocessing operation (outlier removal, labeling), and all change histories are version management systems such as Git Manage with. When the visualization server 30 and the clustering server 50 refer to the converted data, the latest data reflecting the change history of the preprocessing history table 22 to the converted data of the converted data table 21 (hereinafter referred to as “data”, “ Sometimes referred to as “high-dimensional data”). By managing the history of each operation on the post-conversion data without changing the post-conversion data, it is possible to restore the data at any point in time when an erroneous operation is performed.

クラスタリング結果テーブル２３は、クラスタリングサーバ５０によるクラスタリング結果を保持する。クラスタリング結果は、高次元データをクラスタに分類した結果および高次元データの任意の２次元の組み合わせに対するスコアを含む。 The clustering result table 23 holds the clustering result by the clustering server 50. The clustering result includes a result of classifying the high-dimensional data into clusters and a score for an arbitrary two-dimensional combination of the high-dimensional data.

図４に、ＤＢサーバ２０のデータベース構成の例を示す。図中の矢印は、従属関係を示す。前処理履歴テーブル２２には、履歴ＩＤと操作・値のレコード、履歴ＩＤと対象ＩＤのレコードが登録されている。例えば、履歴ＩＤがＨ０には、操作として削除が登録され、対象ＩＤとしてＤ１が登録されている。これは、変換後データからデータＩＤがＤ１のデータを削除した操作を示す。クラスタリング結果テーブル２３には、クラスタＩＤと履歴ＩＤ・軸１・軸２・スコアのレコード、クラスタＩＤと対象ＩＤ・クラスタ番号のレコードが登録されている。例えば、クラスタＩＤがＣ０のレコードには、履歴ＩＤがＨ０、軸１が年齢、軸２が契約日数、スコアが２５６のデータが登録されている。これは、変換後データに履歴ＩＤがＨ０までの各操作を反映したデータについて、軸１を年齢、軸２を契約日数としてクラスタリングしたときのスコアが２５６であることを示す。また、クラスタＩＤがＣ０のクラスタリングでは、Ｄ１のデータはクラスタ番号が０のクラスタに属し、Ｄ２のデータはクラスタ番号が１のクラスタに属することが示されている。 FIG. 4 shows an example of the database configuration of the DB server 20. Arrows in the figure indicate dependency relationships. In the pre-processing history table 22, a record of history ID and operation / value and a record of history ID and target ID are registered. For example, when the history ID is H0, deletion is registered as an operation, and D1 is registered as a target ID. This indicates an operation in which the data with the data ID D1 is deleted from the converted data. In the clustering result table 23, a record of cluster ID, history ID, axis 1, axis 2, and score and a record of cluster ID, target ID, and cluster number are registered. For example, in the record with the cluster ID C0, data with a history ID H0, axis 1 is age, axis 2 is contract days, and score is 256 is registered. This indicates that the score obtained when clustering with the axis 1 as the age and the axis 2 as the number of contract days is 256 for the data reflecting each operation up to the history ID H0 in the converted data. In the clustering with the cluster ID C0, the data of D1 belongs to the cluster with the cluster number 0, and the data of D2 is shown to belong to the cluster with the cluster number 1.

可視化サーバ３０は、大域分析の可視化として、クラスタリング結果テーブル２３の保持するスコアの情報を用い、高次元データを構成する任意の２次元の組み合わせに対して計算されたスコアを表示する。より具体的には、可視化サーバ３０は、縦軸と横軸のそれぞれに高次元データを構成する任意の各次元を表す文字列を並べて表示し、任意の２次元の組み合わせに対して計算されたスコアの大きさを円の直径に対応させ、表形式にて可視化する。図５に、大域分析を可視化した大域分析画面の例を示す。 The visualization server 30 uses the score information held in the clustering result table 23 as the visualization of the global analysis, and displays the score calculated for any two-dimensional combination constituting the high-dimensional data. More specifically, the visualization server 30 displays character strings representing arbitrary dimensions constituting high-dimensional data on the vertical axis and the horizontal axis, respectively, and is calculated for an arbitrary two-dimensional combination. Make the score correspond to the diameter of the circle and visualize it in a tabular format. FIG. 5 shows an example of the global analysis screen that visualizes the global analysis.

分析者は、大域分析において、可視化されたスコアをもとに、関心のある２次元（２軸の交点にある円）を選択できる。可視化サーバ３０は、大域分析においていずれかの円が選択されると、選択された円に対応する２次元における局所分析に移行し、局所分析の結果を可視化する。 In the global analysis, the analyst can select two dimensions of interest (a circle at the intersection of two axes) based on the visualized score. When any circle is selected in the global analysis, the visualization server 30 shifts to a local analysis in two dimensions corresponding to the selected circle, and visualizes the result of the local analysis.

可視化サーバ３０は、局所分析の可視化として、大域分析において選択された２次元平面上に全てのデータをプロットする。可視化サーバ３０は、各データの属するクラスタがわかるように、形状あるいは色を変えてデータを描画する。全てのデータは、いずれかのクラスタに属する。図６に、局所分析を可視化した局所分析画面の例を示す。図６の例では、横軸の次元として問い合わせ数をとり、縦軸の次元として訪問後日数をとった平面上にデータをプロットした。各データは、データの属するクラスタを示す形状および色で描画される。画面の左上には、各クラスタを示す形状および色とクラスタ名、クラスタの中心点の形状、各クラスタの最遠点の形状を説明する凡例を表示する。 The visualization server 30 plots all data on the two-dimensional plane selected in the global analysis as visualization of the local analysis. The visualization server 30 draws data by changing the shape or color so that the cluster to which each data belongs can be known. All data belongs to one of the clusters. FIG. 6 shows an example of the local analysis screen that visualizes the local analysis. In the example of FIG. 6, data is plotted on a plane in which the number of inquiries is taken as the horizontal axis dimension and the number of days after the visit is taken as the vertical axis dimension. Each data is drawn with a shape and color indicating the cluster to which the data belongs. In the upper left of the screen, a legend describing the shape and color indicating each cluster, the cluster name, the shape of the center point of the cluster, and the shape of the farthest point of each cluster are displayed.

可視化サーバ３０が局所分析の結果を表示しているとき、分析者は、各データが属するクラスタを視覚的に確認するとともに、外れ値の除去およびデータのラベル付けを実施する。 When the visualization server 30 displays the result of the local analysis, the analyst visually confirms the cluster to which each data belongs, and removes outliers and labels the data.

図７に、外れ値の除去の様子を示す。外れ値の除去では、利用者は、２次元平面上においてデータを示す任意の点または点集合を選択し、選択した点または点集合に対して削除操作を行う。この削除操作の履歴は、履歴管理サーバ４０によって記録される。 FIG. 7 shows how outliers are removed. In the removal of outliers, the user selects an arbitrary point or point set indicating data on the two-dimensional plane, and performs a deletion operation on the selected point or point set. The history of this deletion operation is recorded by the history management server 40.

図８に、ラベル付けの様子を示す。ラベル付けをする際、分析者は、凡例として表示されたクラスタ名を選択し、選択したクラスタ名を付与したいラベル名に変更する。この操作により、ラベルが付与されたクラスタに属するデータに対して一括してラベル付けが行われる。クラスタに属する全てのデータに対してラベル付けが行えるならば、どのような方法であってもよい。このラベル付けの履歴は、履歴管理サーバ４０によって記録される。 FIG. 8 shows the state of labeling. When labeling, the analyst selects the cluster name displayed as a legend and changes the selected cluster name to the desired label name. By this operation, labeling is collectively performed on data belonging to the cluster to which the label is assigned. Any method may be used as long as all data belonging to the cluster can be labeled. The labeling history is recorded by the history management server 40.

履歴管理サーバ４０は、局所分析におけるデータに対する前処理操作の履歴を管理する。履歴管理サーバ４０は、操作履歴をＤＢサーバ２０の前処理履歴テーブル２２に記録する。 The history management server 40 manages the history of preprocessing operations for data in local analysis. The history management server 40 records the operation history in the preprocessing history table 22 of the DB server 20.

クラスタリングサーバ５０は、前処理履歴テーブル２２に記録した操作履歴を変換後データに適用した最新データをクラスタリングするとともに、高次元データを構成する任意の２次元の組み合わせに対してスコアを計算する。クラスタリング結果と任意の２次元の組み合わせに対するスコアは、クラスタリング結果テーブル２３に記録される。 The clustering server 50 clusters the latest data obtained by applying the operation history recorded in the preprocessing history table 22 to the converted data, and calculates a score for an arbitrary two-dimensional combination constituting the high-dimensional data. The score for the clustering result and an arbitrary two-dimensional combination is recorded in the clustering result table 23.

クラスタリングには、ｋ−ｍｅａｎｓ＋＋を用いる。他のアルゴリズムを用いてもよい。分析者は、データに対する知識を元に、データを幾つかのグループに分けて分析する。このときのグループ数をクラスタ数ｋに指定する。スコアの算出には、クラスタ間分散とクラスタ内分散の比からなるＣａｌｉｎｓｋｉ−Ｈａｒａｂａｚｉｎｄｅｘと呼ばれる下記指標を用いる。他の指標を用いてもよい。 For clustering, k-means ++ is used. Other algorithms may be used. The analyst analyzes the data in several groups based on the knowledge of the data. The number of groups at this time is designated as the number of clusters k. For the calculation of the score, the following index called Caliski-Harabaz index consisting of the ratio of inter-cluster variance and intra-cluster variance is used. Other indicators may be used.

ここで、ＳＳ_Wはクラスタ内の分散（距離二乗和）、ＳＳ_Bはクラスタ間の分散（全サンプルの中心点からの距離二乗和からＳＳ_Wを減じたもの）、ｋはクラスタ数、Ｎは全サンプル数である。 Here, SS _W is the variance within the cluster (sum of squared distances), SS _B is the variance between clusters (the sum of squared distances from the center point of all samples minus SS _W ), k is the number of clusters, and N is The total number of samples.

端末６０は、分析者が操作する装置である。端末６０は、可視化サーバ３０の作成した大域分析画面および局所分析画面の表示、局所分析する２次元の組み合わせの受け付け、データに対する操作の入力などを行う。 The terminal 60 is a device operated by an analyst. The terminal 60 displays a global analysis screen and a local analysis screen created by the visualization server 30, receives a two-dimensional combination for local analysis, and inputs an operation for data.

次に、本実施形態のデータ前処理システムの動作について説明する。 Next, the operation of the data preprocessing system of this embodiment will be described.

図９は、本実施形態のデータ前処理システムによるデータの変換処理の流れを示すフロートチャートである。 FIG. 9 is a float chart showing the flow of data conversion processing by the data preprocessing system of this embodiment.

変換サーバ１０は、実データを読み込み（ステップＳ１１）、実データの変換処理を行い（ステップＳ１２）、変換後データをＤＢサーバ２０へ送信する（ステップＳ１３）。 The conversion server 10 reads the actual data (step S11), performs conversion processing of the actual data (step S12), and transmits the converted data to the DB server 20 (step S13).

ＤＢサーバ２０は、変換後データを受信すると、受信した変換後データで変換後データテーブル２１を初期化する（ステップＳ１４）。 Upon receiving the converted data, the DB server 20 initializes the converted data table 21 with the received converted data (step S14).

クラスタリングサーバ５０は、変換後データをクラスタリング処理するとともに、高次元データを構成する任意の２次元の組み合わせに対してスコアを計算する（ステップＳ１５）。クラスタリング結果とスコアは、クラスタリング結果テーブル２３に記録される。 The clustering server 50 performs a clustering process on the converted data and calculates a score for any two-dimensional combination constituting the high-dimensional data (step S15). The clustering result and the score are recorded in the clustering result table 23.

図１０は、本実施形態のデータ前処理システムによる大域分析処理の流れを示すフロートチャートである。 FIG. 10 is a float chart showing the flow of global analysis processing by the data preprocessing system of this embodiment.

可視化サーバ３０は、クラスタリング結果テーブル２３からスコアを取得し、大域分析の可視化処理を実行する（ステップＳ２１）。 The visualization server 30 acquires a score from the clustering result table 23 and executes a visualization process for global analysis (step S21).

可視化サーバ３０は、高次元データを構成する任意の２次元の組み合わせの選択を受け付ける。いずれかの２次元の組み合わせが選択された場合（ステップＳ２２のＹＥＳ）、局所分析を行う（ステップＳ２３）。 The visualization server 30 accepts selection of any two-dimensional combination that constitutes the high-dimensional data. When any two-dimensional combination is selected (YES in step S22), local analysis is performed (step S23).

可視化サーバ３０は、データの出力指示を受信すると、前処理を完了し（ステップＳ２４のＹＥＳ）、変換後データテーブル２１の保持する変換後データに前処理履歴テーブル２２に記録された操作履歴を適用してデータを作成して出力する（ステップＳ２５）。 Upon receiving the data output instruction, the visualization server 30 completes the preprocessing (YES in step S24), and applies the operation history recorded in the preprocessing history table 22 to the post-conversion data held in the post-conversion data table 21. Then, data is created and output (step S25).

前処理を完了しない場合（ステップＳ２４のＮＯ）、大域分析（ステップＳ２１）および局所分析（ステップＳ２３）を繰り返す。ステップＳ２３の局所分析において外れ値の除去が行われていた場合、可視化サーバ３０はクラスタリング結果テーブル２３の保持する最新のスコアの情報を用い、大域分析の可視化処理を実行する。 When the preprocessing is not completed (NO in step S24), the global analysis (step S21) and the local analysis (step S23) are repeated. If the outlier has been removed in the local analysis in step S23, the visualization server 30 executes the visualization process of the global analysis using the latest score information held in the clustering result table 23.

図１１は、本実施形態のデータ前処理システムによる局所分析処理の流れを示すフロートチャートである。 FIG. 11 is a float chart showing the flow of local analysis processing by the data preprocessing system of this embodiment.

可視化サーバ３０は、大域分析において選択された２次元の組み合わせについて、局所分析の可視化処理を実行する（ステップＳ３１）。 The visualization server 30 executes a local analysis visualization process for the two-dimensional combination selected in the global analysis (step S31).

分析者が任意の点または点集合を選択して削除操作を行った場合（ステップＳ３２のＹＥＳ）、選択された点または点集合を削除し、履歴管理サーバ４０が削除操作の履歴を管理する（ステップＳ３３，Ｓ３４）。 When the analyst selects an arbitrary point or point set and performs a delete operation (YES in step S32), the selected point or point set is deleted, and the history management server 40 manages the history of the delete operation ( Steps S33 and S34).

クラスタリングサーバ５０は、選択された点または点集合が削除された最新データを用いてクラスタリング処理を行う（ステップＳ３５）。 The clustering server 50 performs a clustering process using the latest data from which the selected point or point set has been deleted (step S35).

分析者が凡例からクラスタを選択して名称を変更した場合（ステップＳ３６のＹＥＳ）、選択されたクラスタに対してラベル付けを行い、履歴管理サーバ４０がラベル付け操作の履歴を管理する（ステップＳ３７，Ｓ３８）。 When the analyst selects a cluster from the legend and changes the name (YES in step S36), the selected cluster is labeled, and the history management server 40 manages the history of the labeling operation (step S37). , S38).

外れ値の除去とラベル付けの実施順序は問わない。複数回外れ値を除去してもよいし、複数個のクラスタにラベル付けをしてもよい。 The order of performing outlier removal and labeling is not important. Outliers may be removed multiple times, or multiple clusters may be labeled.

次に、本実施形態のデータ前処理システム全体の動作について説明する。 Next, the operation of the entire data preprocessing system of this embodiment will be described.

図１２Ａ〜１２Ｃは、本実施形態のデータ前処理システムの処理の流れを示すシーケンス図である。 12A to 12C are sequence diagrams showing the flow of processing of the data preprocessing system of this embodiment.

まず、図１２Ａを参照し、データ変換処理の流れについて説明する。 First, the flow of data conversion processing will be described with reference to FIG. 12A.

変換サーバ１０は、実データの変換処理を行い（ステップＳ１０１）、変換後データをＤＢサーバ２０へ送信する（ステップＳ１０２）。 The conversion server 10 performs conversion processing of actual data (step S101), and transmits the converted data to the DB server 20 (step S102).

ＤＢサーバ２０は、受信した変換後データで変換後データテーブル２１を初期化する（ステップＳ１０３）。 The DB server 20 initializes the converted data table 21 with the received converted data (step S103).

ＤＢサーバ２０は、変換後データテーブル２１の保持する変換後データに前処理履歴テーブル２２の保持する操作履歴を適用して最新データを作成し（ステップＳ１０４）、最新データをクラスタリングサーバ５０へ送信する（ステップＳ１０５）。なお、この段階では操作履歴は無いので変換後データテーブル２１の保持する変換後データが最新データとして送信される。 The DB server 20 creates the latest data by applying the operation history held in the preprocessing history table 22 to the converted data held in the converted data table 21 (step S104), and transmits the latest data to the clustering server 50. (Step S105). Since there is no operation history at this stage, the converted data held in the converted data table 21 is transmitted as the latest data.

クラスタリングサーバ５０は、最新データを用いてクラスタリング処理を行い（ステップＳ１０６）、クラスタリング結果をＤＢサーバ２０へ送信する（ステップＳ１０７）。 The clustering server 50 performs clustering processing using the latest data (step S106), and transmits the clustering result to the DB server 20 (step S107).

ＤＢサーバ２０は、受信したクラスタリング結果でクラスタリング結果テーブル２３を更新する（ステップＳ１０８）。 The DB server 20 updates the clustering result table 23 with the received clustering result (step S108).

続いて、図１２Ｂを参照し、大域分析の流れについて説明する。 Next, the flow of global analysis will be described with reference to FIG. 12B.

ＤＢサーバ２０は、変換後データテーブル２１の保持する変換後データに前処理履歴テーブル２２の保持する操作履歴を適用して最新データを作成し（ステップＳ２０１）、最新データおよびクラスタリング結果を可視化サーバ３０へ送信する（ステップＳ２０２）。 The DB server 20 creates the latest data by applying the operation history held in the preprocessing history table 22 to the converted data held in the converted data table 21 (step S201), and visualizes the latest data and the clustering result. (Step S202).

可視化サーバ３０は、最新データおよびクラスタリング結果を用いて大域分析画面を作成し（ステップＳ２０３）、大域分析画面を端末６０に表示させる（ステップＳ２０４）。 The visualization server 30 creates a global analysis screen using the latest data and the clustering result (step S203), and displays the global analysis screen on the terminal 60 (step S204).

端末６０は、分析者の選択を受け付けて（ステップＳ２０５）、選択された２次元（軸１及び軸２）を可視化サーバ３０へ送信する（ステップＳ２０６）。 The terminal 60 receives the analyst's selection (step S205), and transmits the selected two dimensions (axis 1 and axis 2) to the visualization server 30 (step S206).

続いて、図１２Ｃを参照し、局所分析の流れについて説明する。 Next, the flow of local analysis will be described with reference to FIG. 12C.

ＤＢサーバ２０は、変換後データテーブル２１の保持する変換後データに前処理履歴テーブル２２の保持する操作履歴を適用して最新データを作成し（ステップＳ３０１）、最新データおよびクラスタリング結果を可視化サーバ３０へ送信する（ステップＳ３０２）。 The DB server 20 creates the latest data by applying the operation history held in the preprocessing history table 22 to the converted data held in the converted data table 21 (step S301), and visualizes the latest data and the clustering result. (Step S302).

可視化サーバ３０は、最新データおよびクラスタリング結果を用いて局所分析画面を作成し（ステップＳ３０３）、局所分析画面を端末６０に表示させる（ステップＳ３０４）。 The visualization server 30 creates a local analysis screen using the latest data and the clustering result (step S303), and displays the local analysis screen on the terminal 60 (step S304).

端末６０は、分析者から外れ値の選択および削除の操作を受け付けると（ステップＳ３０５）、外れ値のデータＩＤを含む外れ値データを履歴管理サーバ４０へ送信する（ステップＳ３０６）。 Upon receiving an outlier selection and deletion operation from the analyst (step S305), the terminal 60 transmits outlier data including the outlier data ID to the history management server 40 (step S306).

履歴管理サーバ４０は、外れ値データを受信すると、外れ値データに含まれるデータＩＤの示すデータを削除する履歴データを作成し（ステップＳ３０７）、作成した履歴データをＤＢサーバ２０へ送信する（ステップＳ３０８）。 When receiving the outlier data, the history management server 40 creates history data for deleting the data indicated by the data ID included in the outlier data (step S307), and transmits the created history data to the DB server 20 (step S307). S308).

ＤＢサーバ２０は、受信した履歴データを前処理履歴テーブル２２に登録する（ステップＳ３０９）。 The DB server 20 registers the received history data in the preprocessing history table 22 (step S309).

データを削除する操作が行われたので、ＤＢサーバ２０は、変換後データテーブル２１の保持する変換後データに前処理履歴テーブル２２の保持する操作履歴を適用して最新データを作成し（ステップＳ３１０）、最新データをクラスタリングサーバ５０へ送信する（ステップＳ３１１）。 Since the operation for deleting the data has been performed, the DB server 20 creates the latest data by applying the operation history held in the preprocessing history table 22 to the converted data held in the converted data table 21 (step S310). ), The latest data is transmitted to the clustering server 50 (step S311).

クラスタリングサーバ５０は、最新データを用いてクラスタリング処理を行い（ステップＳ３１２）、クラスタリング結果をＤＢサーバ２０へ送信する（ステップＳ３１３）。 The clustering server 50 performs clustering processing using the latest data (step S312), and transmits the clustering result to the DB server 20 (step S313).

ＤＢサーバ２０は、受信したクラスタリング結果でクラスタリング結果テーブル２３を更新する（ステップＳ３１４）。 The DB server 20 updates the clustering result table 23 with the received clustering result (step S314).

また、端末６０は、分析者からラベル付けの操作を受け付けると（ステップＳ３１５）、ラベルを付けるクラスタ番号と付与するラベル名を含むラベルデータを履歴管理サーバ４０へ送信する（ステップＳ３１６）。 Further, when receiving a labeling operation from the analyst (step S315), the terminal 60 transmits label data including a cluster number to be labeled and a label name to be given to the history management server 40 (step S316).

履歴管理サーバ４０は、ラベルデータを受信すると、ラベルデータに含まれるクラスタ番号にラベル名を付与する履歴データを作成し（ステップＳ３１７）、作成した履歴データをＤＢサーバ２０へ送信する（ステップＳ３１８）。 When receiving the label data, the history management server 40 creates history data for assigning a label name to the cluster number included in the label data (step S317), and transmits the created history data to the DB server 20 (step S318). .

ＤＢサーバ２０は、受信した履歴データを前処理履歴テーブル２２に登録する（ステップＳ３１９）。 The DB server 20 registers the received history data in the preprocessing history table 22 (step S319).

なお、本実施形態では、分析者が、大域分析画面から局所分析する２次元の組み合わせを選択し、局所分析画面から除去する外れ値を選択していたが、これらの処理を自動的に処理する処理部を備えて自動化することもできる。 In this embodiment, the analyst selects a two-dimensional combination for local analysis from the global analysis screen and selects an outlier to be removed from the local analysis screen. However, these processes are automatically processed. It can also be automated with a processing unit.

例えば、大域分析では、処理部が、全ての２次元の組み合わせからスコアが最大の２次元の組み合わせを選択する。 For example, in the global analysis, the processing unit selects the two-dimensional combination having the maximum score from all the two-dimensional combinations.

局所分析での外れ値の除去については、処理部が、各クラスタについて、下記基準にて外れ値を選択して除去する。 For removal of outliers in local analysis, the processing unit selects and removes outliers for each cluster according to the following criteria.

外れ値閾値θ_th（１以上）と繰り返し最大数Ｎ_rを設定し、除去数Ｎ_d＝０として、クラスタｎ（１≦ｎ＜ｋ）に属するデータ点Ｘ_n＝｛ｘ_n1，ｘ_n2，・・・，ｘ_nN｝について以下の処理を行う。 An outlier threshold θ _th (1 or more) and a maximum repetition number N _r are set, and the removal number N _d = 0, and data points X _n = {x _n1 , x _n2 , belonging to cluster n (1 ≦ n <k) .., X _nN }, the following processing is performed.

すべてのｘ_nm＝（ｘ_nm1，ｘ_nm2）（１≦ｍ＜Ｎ）に対して、ｘ_nmのクラスタｎの中心点ｃ_n＝（ｃ_n1，ｃ_n2）からのＬ²距離（ユークリッド距離）を次式で計算する。 L ² distance (Euclidean distance) from the center point c _n = (c _n1 , c _n2 ) of cluster n of x _nm for all x _nm = (x _nm1 , x _nm2 ) (1 ≦ m <N) Is calculated by the following equation.

すべてのｄから、最大値ｄをとる最遠点ｘ_np（１≦ｐ＜Ｎ）を求める。 The farthest point x _np (1 ≦ p <N) having the maximum value d is obtained from all d.

ｘ_npに対して局所外れ値因子法（ｌｏｆ法）による異常度判定を行う。異常度がθ_thを超える場合、外れ値とみなし、Ｎ_d＜Ｎ_rの場合はｘ_npを除去して、Ｎ_d＝Ｎ_d＋１とし、Ｎ_d≧Ｎ_rの場合は外れ値の除去を終了する。 The degree of abnormality is determined by the local outlier factor method (lof method) for x _np . If the degree of abnormality exceeds _θth , it is regarded as an outlier. If N _d <N _r , x _np is removed so that N _d = N _d +1, and if N _d ≧ N _r , the outlier is removed. finish.

局所分析でのラベル付けに関して、各クラスタに対してラベルを自動付与してもよい。 For labeling in the local analysis, a label may be automatically given to each cluster.

以上説明したように、本実施の形態によれば、クラスタリングサーバ５０が、高次元データをクラスタに分類するとともに、高次元データの任意の２次元の組み合わせに対するスコアを計算し、可視化サーバ３０が、２次元の組み合わせに対するスコアを表形式で表示して選択を受け付け、選択された２次元の組み合わせの各次元を軸とする２次元平面上に高次元データをクラスタに分けてプロットし、高次元データに対する前処理装置を受け付けることにより、高次元データが非熟練者にも理解しやすい直感的な形式で可視化されるので、非熟練者であってもデータ前処理を実行できる。 As described above, according to the present embodiment, the clustering server 50 classifies the high-dimensional data into clusters, calculates a score for any two-dimensional combination of the high-dimensional data, and the visualization server 30 The score for the two-dimensional combination is displayed in a tabular format, and selection is accepted. On the two-dimensional plane with each dimension of the selected two-dimensional combination as an axis, high-dimensional data is divided into clusters and plotted. By accepting the pre-processing device, the high-dimensional data is visualized in an intuitive format that can be easily understood by non-experts, so that even non-experts can execute data pre-processing.

本実施の形態によれば、高次元データを不変とし、履歴管理サーバ４０が高次元データに対する前処理の操作履歴を管理し、高次元データを参照するときは、変換後データテーブル２１の変換後データに操作履歴を反映した最新データを用いることにより、誤った操作をした場合に、任意の時点のデータに復元することができる。 According to the present embodiment, when the high-dimensional data is unchanged, the history management server 40 manages the operation history of the preprocessing for the high-dimensional data, and refers to the high-dimensional data, after the conversion of the post-conversion data table 21 By using the latest data reflecting the operation history in the data, it is possible to restore the data at an arbitrary time point when an erroneous operation is performed.

１０…変換サーバ
２０…ＤＢサーバ
２１…変換後データテーブル
２２…前処理履歴テーブル
２３…クラスタリング結果テーブル
３０…可視化サーバ
４０…履歴管理サーバ
５０…クラスタリングサーバ
６０…端末 DESCRIPTION OF SYMBOLS 10 ... Conversion server 20 ... DB server 21 ... Post-conversion data table 22 ... Pre-processing history table 23 ... Clustering result table 30 ... Visualization server 40 ... History management server 50 ... Clustering server 60 ... Terminal

Claims

A clustering unit for classifying high-dimensional data into clusters and calculating a score for an arbitrary two-dimensional combination of the high-dimensional data;
A global analysis screen display unit that displays a score for a plurality of the two-dimensional combinations and displays a global analysis screen that accepts selection of the two-dimensional combinations;
When the two-dimensional combination is selected on the global analysis screen, the high-dimensional data is divided into clusters and drawn on a two-dimensional plane with each dimension of the selected two-dimensional combination as an axis. A local analysis screen display unit that displays a local analysis screen that accepts operations on dimensional data;
A history management unit for making the high-dimensional data unchanged and managing a history of operations on the high-dimensional data,
When an operation on the high-dimensional data is accepted on the local analysis screen, the history of the operation on the high-dimensional data is registered in the history management unit, and the clustering unit performs an operation indicated by the operation history on the high-dimensional data. Recalculate the score using the latest reflected data,
The data pre-processing apparatus according to claim 1, wherein when the local analysis screen is shifted to the global analysis screen, the global analysis screen display unit displays the recalculated score and accepts selection of the two-dimensional combination .

The data according to claim 1, wherein the global analysis screen display unit displays a character string representing each dimension of the high-dimensional data side by side along a vertical axis and a horizontal axis, and displays the score in a table format. Pre-processing device.

The high-dimensional data satisfying a predetermined distance between the center point of each cluster on the two-dimensional plane with each dimension of the two-dimensional combination as an axis and the high-dimensional data belonging to the cluster is removed as an outlier. The data preprocessing apparatus according to claim 1, further comprising a processing unit.

A data preprocessing method executed by a computer,
Classifying the high-dimensional data into clusters and calculating a score for any two-dimensional combination of the high-dimensional data;
Displaying a score for a plurality of the two-dimensional combinations and displaying a global analysis screen for accepting selection of the two-dimensional combinations;
When the two-dimensional combination is selected on the global analysis screen, the high-dimensional data is divided into clusters and drawn on a two-dimensional plane with each dimension of the selected two-dimensional combination as an axis. Displaying a local analysis screen for accepting operations on the dimensional data , and
Making the high-dimensional data unchanged,
When an operation for the high-dimensional data is accepted on the local analysis screen, the history of the operation for the high-dimensional data is registered in the history management unit, and the latest data reflecting the operation indicated by the operation history is added to the high-dimensional data. To recalculate the score,
A data preprocessing method , wherein when the transition is made from the local analysis screen to the global analysis screen, the recalculated score is displayed on the global analysis screen, and the selection of the two-dimensional combination is accepted .

A data preprocessing program for operating a computer as each unit of the data preprocessing device according to any one of claims 1 to 3 .