JP2019168820A

JP2019168820A - Data analysis support system and data analysis support method

Info

Publication number: JP2019168820A
Application number: JP2018054882A
Authority: JP
Inventors: 山田　隆亮; Takaaki Yamada; 隆亮山田; 勇樹前川; Yuki Maekawa
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2018-03-22
Filing date: 2018-03-22
Publication date: 2019-10-03
Anticipated expiration: 2038-03-22
Also published as: JP6886935B2

Abstract

To provide a data analysis support system revealing a data relation among tables.SOLUTION: In a data analysis support system comprising a computer, the computer comprises a memory in which a program is stored and a CPU executing the program stored in the memory, the CPU inputs multiple tables, extracts a keyword from each of the inputted multiple tables, conducts machine learning of the extracted keyword, extracts a characteristic component including a pair of the keyword and an appearance probability from each of the multiple tables as a result of the machine learning, identifies a combination of inter-table characteristic components based on inter-table matching of a keyword included in the extracted characteristic component, synthesizes an appearance probability of a characteristic component included in the identified combination to integrate analytical results, and outputs the integrated analytical results.SELECTED DRAWING: Figure 1

Description

本発明は、データ分析支援システムおよびデータ分析支援方法に関する。 The present invention relates to a data analysis support system and a data analysis support method.

データを分析したいという需要は古くから根強い。膨大なデータを手作業で分析することは困難であり、データマイニングなど、多様なデータ分析技術が開発されてきた。そのひとつとして、機械学習は、統計や確率などに基づく何らかのモデルを策定し、実際に観測されたデータから逆にモデルの詳細パラメータを推定する技術である。機械学習されたモデルに基づく計算機判断は知的な判断とみなされ、分類、予測、あるいは推薦等に役立ちうる。 The demand to analyze data has been persistent for a long time. It is difficult to analyze a huge amount of data manually, and various data analysis techniques such as data mining have been developed. As one of them, machine learning is a technique for formulating some model based on statistics, probabilities, etc., and estimating the detailed parameters of the model from the actually observed data. Computer judgment based on a machine-learned model is regarded as intelligent judgment and can be useful for classification, prediction, recommendation, and the like.

しかしながら、考えなしに玉石混交のビッグデータを直接的に分析しても、膨大な計算機資源を投じたわりに、表面的に推定できる結果のみが得られ、潜在的な結果は得られにくい傾向がある。このような傾向があるので、有用な結果を得るために、特定のデータを対象にして特定のトピックを分析することが多い。 However, even if we directly analyze big data of boulders mixed without thinking, we can only obtain superficial estimation results without investing a huge amount of computer resources, and it is difficult to obtain potential results. . Because of this tendency, specific topics are often analyzed for specific data in order to obtain useful results.

例えば、特許文献１には、購買情報テーブルを参照して各アカウントの興味があるトピックを推定し、トピックごとに商品が購買される確率を推定し、商品が新商品である場合には購買される確率を高くして、新商品を推薦する技術が開示されている。 For example, Patent Literature 1 estimates a topic in which each account is interested with reference to a purchase information table, estimates a probability that a product is purchased for each topic, and purchases a product if the product is a new product. A technique for recommending new products with a high probability is described.

また、特許文献２には、トピックモデルによる分析手法を利用して、求人情報とレジュメ情報との間の類似度学習を行う技術が開示されている。 Patent Document 2 discloses a technique for performing similarity learning between job information and resume information by using an analysis method based on a topic model.

特開２０１２−２４２９４０号公報JP 2012-242940 A 特開２０１７−１３４７３２号公報JP 2017-134732 A

特許文献１あるいは特許文献２に開示された技術を用いれば、特定のデータを対象にして特定のトピックを分析することは可能になる。しなしながら、多くのテーブルから成るデータであってトピックを定められないビッグデータを機械学習しても、同一テーブル内にあるデータ同士の関係性が強く、テーブルをまたいだ関係性が埋もれがちである。 If the technique disclosed in Patent Document 1 or Patent Document 2 is used, it is possible to analyze a specific topic for specific data. However, even when machine learning big data that consists of many tables and cannot define topics, the relationships between the data in the same table are strong and the relationships across the tables tend to be buried. is there.

テーブル内の関係性を雑音（Ｎ）、テーブルをまたいだ関係性を信号（Ｓ）とするとき、学習の対象となるテーブル内のレコード数を単に増やしても、このＳＮ比は改善されない。そして、特許文献１および特許文献２には、このような学習におけるＳＮ比の改善に関する技術の開示は見当たらない。 When the relationship in the table is noise (N) and the relationship across the table is a signal (S), simply increasing the number of records in the table to be learned does not improve the S / N ratio. In Patent Document 1 and Patent Document 2, no disclosure of the technology relating to the improvement of the SN ratio in such learning is found.

また、大容量多種多様なデータが混在するビッグデータを貯め置くデータレイクの中で、データの関係を俯瞰することは難しい。特に、データ分析作業者が生データの関係性の中に未知のパターンを発見することの支援、言い換えると、その気付きとなるデータ関係情報を抽出することが重要である。 In addition, it is difficult to take a bird's-eye view of data relationships in a data lake that stores big data that contains a large amount of various data. In particular, it is important for the data analysis worker to support the discovery of unknown patterns in the relationships of raw data, in other words, to extract data relationship information that is noticed.

本発明の目的は、テーブル間のデータ関係を顕在化するデータ分析支援システムを提供することである。 An object of the present invention is to provide a data analysis support system that makes the data relationship between tables obvious.

本発明に係る代表的なデータ分析支援システムは、計算機により構成されたデータ分析支援システムであって、前記計算機は、プログラムが格納されたメモリと、前記メモリに格納されたプログラムを実行するＣＰＵと、を備え、前記ＣＰＵは、複数のテーブルを入力し、入力された複数のテーブルのそれぞれからキーワードを抽出し、抽出されたキーワードを機械学習し、機械学習の分析結果として、キーワードと出現確率のペアを含む特徴成分を、複数のテーブルのそれぞれについて抽出し、抽出された特徴成分に含まれるキーワードのテーブル間の一致に基づき、特徴成分のテーブル間の組合せを特定し、特定された組合せに含まれる特徴成分の出現確率を合成して、分析結果を集約し、集約された分析結果を出力することを特徴とする。 A typical data analysis support system according to the present invention is a data analysis support system configured by a computer, the computer including a memory in which a program is stored, and a CPU that executes the program stored in the memory. The CPU inputs a plurality of tables, extracts a keyword from each of the input plurality of tables, performs machine learning on the extracted keyword, and uses the keyword and the appearance probability as a result of machine learning analysis. A feature component including a pair is extracted for each of a plurality of tables, a combination between feature component tables is identified based on a match between keyword tables included in the extracted feature component, and included in the identified combination This is characterized by combining the appearance probabilities of the feature components to be collected, aggregating the analysis results, and outputting the aggregated analysis results. .

本発明によれば、テーブル間のデータ関係を顕在化するデータ分析支援システムを提供することが可能になる。 ADVANTAGE OF THE INVENTION According to this invention, it becomes possible to provide the data analysis support system which actualizes the data relationship between tables.

データ分析支援システムの例を示す図である。It is a figure which shows the example of a data analysis support system. 計算機の例を示す図である。It is a figure which shows the example of a computer. 機械学習を用いたデータ仕分けの例を示す図である。It is a figure which shows the example of the data classification using machine learning. 変換関係定義テーブルの例を示す図である。It is a figure which shows the example of a conversion relationship definition table. 量質変換テーブルの例を示す図である。It is a figure which shows the example of a quantity quality conversion table. データ値と処理の関係の例を示す図である。It is a figure which shows the example of the relationship between a data value and a process. 組合せ最適化の例を示す図である。It is a figure which shows the example of combination optimization. データ分析支援システムの処理フローの例を示す図である。It is a figure which shows the example of the processing flow of a data analysis support system. ユーザ対話画面の例を示す図である。It is a figure which shows the example of a user interaction screen. 機械学習の処理フローの例を示す図である。It is a figure which shows the example of the processing flow of machine learning.

本明細書において開示される技術のうち、一つの観点に従う構成と作用は以下の通りであり、図面を参照して本発明の実施形態を実施例として説明する。 Among the techniques disclosed in this specification, the configuration and operation according to one aspect are as follows, and an embodiment of the present invention will be described as an example with reference to the drawings.

図１は、データ分析支援システムの例を示す図である。データ分析支援システム１０１は計算機から構成され、ユーザ端末１０２、データレイク１０３、業務運用システム１０４、およびデータウェアハウス１０５とネットワーク１２０で接続される。 FIG. 1 is a diagram illustrating an example of a data analysis support system. The data analysis support system 101 includes a computer and is connected to a user terminal 102, a data lake 103, a business operation system 104, and a data warehouse 105 through a network 120.

ユーザ端末１０２、データレイク１０３、業務運用システム１０４、およびデータウェアハウス１０５も計算機であってもよい。ユーザはユーザ端末１０２を介して、データ分析支援システム１０１を含むこれらの計算機にアクセスし、データを入力したりデータを表示させたりする。 The user terminal 102, the data lake 103, the business operation system 104, and the data warehouse 105 may also be computers. The user accesses these computers including the data analysis support system 101 via the user terminal 102, and inputs data or displays data.

図１において、実線のネットワーク１２０は計算機間の物理的な接続を表し、破線はデータフロー（あるいは制御のフロー）を表す。ただし、破線は代表的なデータフローを示すものであって、図１に示した破線以外のデータフローがあってもよい。 In FIG. 1, a solid line network 120 represents a physical connection between computers, and a broken line represents a data flow (or control flow). However, the broken line indicates a typical data flow, and there may be a data flow other than the broken line shown in FIG.

データレイク１０３は分析前の生データが格納される設備である。運用管理ログや保守記録などが業務運用システム１０４からネットワーク１２０を介してデータレイク１０３内のデータベースであるトランザクションＤＢ１０６に格納される。これにより、トランザクションＤＢ１０６には時系列に生じてスキーマ構造に応じた記録が大量に溜まる。 The data lake 103 is a facility for storing raw data before analysis. Operation management logs, maintenance records, and the like are stored from the business operation system 104 via the network 120 into the transaction DB 106 that is a database in the data lake 103. As a result, a large amount of records corresponding to the schema structure are generated in the transaction DB 106 in time series.

なお、トランザクションＤＢ１０６には、時系列によらない異種のテーブルが含まれていてもよい。また、トランザクションＤＢ１０６以外のデータベースとして、データレイク１０３は、データの符号化方法やスキーマ構造をあらかじめ定めたマスターＤＢ１０７を有する。マスターＤＢ１０７の情報はユーザによりあらかじめ格納されてもよい。 The transaction DB 106 may include different types of tables that do not depend on time series. As a database other than the transaction DB 106, the data lake 103 has a master DB 107 in which a data encoding method and a schema structure are predetermined. Information of the master DB 107 may be stored in advance by the user.

業務運用システム１０４とデータレイク１０３を含めて各計算機のメモリには、データの送受信手順をあらかじめ定めた通信プログラムモジュールが格納され、各計算機のＣＰＵ（Central Processing Unit）によって通信プログラムが実行されることによって通信する。 The memory of each computer including the business operation system 104 and the data lake 103 stores a communication program module in which a data transmission / reception procedure is determined in advance, and the communication program is executed by a CPU (Central Processing Unit) of each computer. Communicate by.

ユーザは、データレイク１０３の中のデータを選択し、選択したデータを必要に応じて変換させてからデータウェアハウス１０５に転送させて格納させる。あるいは、データウェアハウス１０５のメモリ内には選択手順、変換手順、分析手順をあらかじめ定めた選択変換分析プログラムモジュールをあらかじめ格納されていてもよい。 The user selects data in the data lake 103, converts the selected data as necessary, and transfers the data to the data warehouse 105 for storage. Alternatively, a selective conversion analysis program module in which a selection procedure, a conversion procedure, and an analysis procedure are determined in advance may be stored in the memory of the data warehouse 105 in advance.

ユーザは、データウェアハウス１０５を用いて対話的にデータの分析を行う。この分析のために、ユーザはデータレイク１０３内のデータの中からデータを選択するのであるが、どのデータを選択すべきかの情報をデータ分析支援システム１０１はユーザへ提供する。 The user interactively analyzes data using the data warehouse 105. For this analysis, the user selects data from the data in the data lake 103. The data analysis support system 101 provides the user with information on which data to select.

別の言い方をすると、ユーザはデータ分析支援システム１０１を用いて選択すべきデータの推薦を受ける。なお、データ分析支援システム１０１の役割を説明するため、以上のようにデータウェアハウス１０５を説明したが、データウェアハウス１０５は以上で説明したものに限定されるものでははい。 In other words, the user receives a recommendation of data to be selected using the data analysis support system 101. In order to explain the role of the data analysis support system 101, the data warehouse 105 has been described as described above. However, the data warehouse 105 is not limited to the one described above.

データ分析支援システム１０１は、分析支援手順があらかじめ記述されたプログラムモジュール、およびユーザが対話的に指定する条件やあらかじめ定めおく初期値などのパラメータがメモリに格納されている。メモリに格納された分析支援手順の記述されたプログラムは、ＣＰＵによって実行されることによって、分析支援処理が実現される。 The data analysis support system 101 stores a program module in which an analysis support procedure is described in advance, and parameters such as conditions specified interactively by the user and predetermined initial values. An analysis support process is realized by executing a program describing the analysis support procedure stored in the memory by the CPU.

分析支援手順が記述されたプログラムモジュールには複数のサブモジュールがあり、統括制御モジュール１０８、学習制御モジュール１０９、集約制御モジュール１１０、および対話制御モジュール１１１の各サブモジュールが含まれる。ここで、各サブモジュールはプログラムそのものではなく、各プログラムを実行するＣＰＵであってもよい。 The program module in which the analysis support procedure is described includes a plurality of submodules, and includes the overall control module 108, the learning control module 109, the aggregation control module 110, and the dialog control module 111. Here, each submodule may be a CPU that executes each program instead of the program itself.

データ分析支援システム１０１のディスク（記憶装置）あるいはメモリ上には、データベースが構築され、このデータベースには、管理ＤＢ１１２、量質変換ＤＢ１１３、学習ＤＢ１１４、および集約ＤＢ１１５が含まれる。管理ＤＢ１１２は、変換関係定義テーブル（図４）とデータレイク１０３内の各種データへのアクセス方法についての情報が、あらかじめ作成されて格納されている。 A database is constructed on a disk (storage device) or memory of the data analysis support system 101, and this database includes a management DB 112, a quality conversion DB 113, a learning DB 114, and an aggregation DB 115. In the management DB 112, information on the conversion relationship definition table (FIG. 4) and the access method to various data in the data lake 103 is created and stored in advance.

量質変換ＤＢ１１３は、時刻や場所などの量質変換テーブル（図５）が、あらかじめ作成されて格納されている。学習ＤＢ１１４は、学習テーブル（図６）として中間処理結果が格納される。集約ＤＢ１１５は、集約テーブル（図６）として最終処理結果が格納される。 In the quality conversion DB 113, a quality conversion table (FIG. 5) such as time and place is created and stored in advance. The learning DB 114 stores intermediate processing results as a learning table (FIG. 6). The aggregation DB 115 stores final processing results as an aggregation table (FIG. 6).

データ分析支援システム１０１は、ユーザからの要求に応じて処理を開始し、ユーザへ結果を出力して処理を終了する。まず、統括制御モジュール１０８は、あらかじめ対話制御モジュール１１１をユーザからの要求待ち状態にする。なお、以下の一連の処理の引き渡しは、統括制御モジュール１０８の統括的な制御により行われてもよい。 The data analysis support system 101 starts processing in response to a request from the user, outputs a result to the user, and ends the processing. First, the overall control module 108 puts the dialog control module 111 into a state waiting for a request from the user in advance. It should be noted that the delivery of the following series of processes may be performed by the overall control of the overall control module 108.

ユーザ操作によってユーザ端末１０２からデータ分析支援が要求されたと判定すると、対話制御モジュール１１１は統括制御モジュール１０８に処理を引き渡す。統括制御モジュール１０８は、管理ＤＢ１１２に定めた管理情報に基づいて各種データへのアクセス方法を読み取って、学習制御モジュール１０９に処理を引き渡す。 If it is determined that data analysis support is requested from the user terminal 102 by a user operation, the dialogue control module 111 passes the processing to the overall control module 108. The overall control module 108 reads an access method to various data based on the management information defined in the management DB 112 and hands over the processing to the learning control module 109.

学習制御モジュール１０９は、量質変換ＤＢ１１３を用いてトランザクションＤＢ１０６（図６）のデータを読み込み、複数の学習結果を学習ＤＢ１１４に格納する。集約制御モジュール１１０は、学習ＤＢ１１４から複数の学習結果を読み込み、複数の学習結果を集約した集約データを集約ＤＢ１１５内の集約テーブルに格納する。対話制御モジュール１１１は、集約ＤＢ１１５から集約結果を読み込み、ユーザ端末１０２へ出力する。 The learning control module 109 reads the data of the transaction DB 106 (FIG. 6) using the quality conversion DB 113 and stores a plurality of learning results in the learning DB 114. The aggregation control module 110 reads a plurality of learning results from the learning DB 114 and stores aggregated data obtained by aggregating the plurality of learning results in an aggregation table in the aggregation DB 115. The dialogue control module 111 reads the aggregation result from the aggregation DB 115 and outputs it to the user terminal 102.

図２は、計算機の例を示す図である。計算機２０１は、例えばデータ分析支援システム１０１であり、図１に示した他の設備であってもよい。計算機２０１は、計算レジスタを備えたＣＰＵ２０２、メモリ２０３、ディスク２０５、入出力部２０６、タイマ２０４、およびセンサ２０７を備える。 FIG. 2 is a diagram illustrating an example of a computer. The computer 201 is the data analysis support system 101, for example, and may be the other equipment shown in FIG. The computer 201 includes a CPU 202 having a calculation register, a memory 203, a disk 205, an input / output unit 206, a timer 204, and a sensor 207.

ディスク２０５は、磁気記憶装置であってもよいし、不揮発性の半導体記憶装置であってもよい。入出力部２０６は、ディスプレイ、キーボード、マウス、およびネットワーク１２０との通信回路を含む。ユーザ端末１０２の代わりに、入出力部２０６のディスプレイ、キーボード、およびマウスが使用されてもよい。 The disk 205 may be a magnetic storage device or a non-volatile semiconductor storage device. The input / output unit 206 includes a display, a keyboard, a mouse, and a communication circuit with the network 120. Instead of the user terminal 102, the display, keyboard, and mouse of the input / output unit 206 may be used.

図３は、機械学習を用いたデータ仕分けの例を示す図である。データレイク１０３内には複数のデータベースと複数のテーブルがあり、テーブル同士は異種のものを含む。ここで、共通するフィールド（時刻など）を持つ２つのテーブルは互いに同種である。同種の２つのテーブルのデータは、テーブル結合処理によって単一テーブルのデータに変換できる。 FIG. 3 is a diagram illustrating an example of data sorting using machine learning. The data lake 103 includes a plurality of databases and a plurality of tables, and the tables include different types. Here, two tables having a common field (time etc.) are of the same type. Data of two tables of the same kind can be converted into data of a single table by table combination processing.

これに対して、共通するフィールドを持たない２つのテーブルは互いに異種である。異種の２つのテーブルは、テーブル結合ができない。例えば、４つのフィールドＡＢＣＤ、３つのテーブルＰＱＲがあり、Ｐ（Ａ、Ｂ）、Ｑ（Ｃ、Ｄ）、Ｒ（Ａ、Ｃ）のデータ構造を持つ場合、テーブルＰとテーブルＱは異種、テーブルＰとテーブルＲは同種、テーブルＱとテーブルＲは同種である。このため、３つのテーブルＰＱＲは異種テーブルを含む。 On the other hand, two tables that do not have a common field are different from each other. Two different tables cannot be joined. For example, if there are four fields ABCD and three tables PQR, and they have data structures of P (A, B), Q (C, D), and R (A, C), the table P and the table Q are different types. P and table R are the same type, and table Q and table R are the same type. For this reason, the three tables PQR include heterogeneous tables.

テーブルＰＱだけが入力された場合、異種テーブル同士の結合ができないため、フィールドＢとフィールドＤに含まれたデータ値同士の関係を分析できない。一方、テーブルＰＱＲが入力された場合、まずＰＲを結合し、その後にＱを結合することによって３テーブルを１テーブルに結合できる。さらに結合したテーブルを用いてフィールドＢとＤにあるデータ値同士の関係を分析できるようになる。 When only the table PQ is input, since the heterogeneous tables cannot be joined, the relationship between the data values included in the field B and the field D cannot be analyzed. On the other hand, when the table PQR is input, three tables can be combined into one table by first combining PR and then combining Q. Furthermore, the relationship between the data values in the fields B and D can be analyzed using the combined table.

しかし、データレイク内のテーブルは多種多様であるので、結合可能な組み合わせ数は膨大であり、従来のデータレイク内の作業では、組合せ経路を見つけること自体がたいへんな作業量であった。加えて、組合せによって分析結果が異なりうるため、適切なテーブル結合方法を見つけ出すことは、分析者の試行錯誤や直感頼りとなっていた。 However, since there are a wide variety of tables in the data lake, the number of combinations that can be combined is enormous, and in the conventional work in the data lake, finding the combination route itself is a very heavy workload. In addition, since the analysis results may differ depending on the combination, finding an appropriate table joining method has depended on the analyst's trial and error and intuition.

一方、機械学習を用いた分析処理の場合、異種テーブルであってもまとめて分析することは可能であり、異種の複数のテーブルのデータが入力されても、何らかのモデルに基づいてデータを仕分けすることができ、データ同士の関係性情報が出力可能である。機械学習のそのようなモデルには多種多様な方式が開発され、開示されている。 On the other hand, in the case of analysis processing using machine learning, even different types of tables can be analyzed together, and even if data of a plurality of different types of tables are input, the data is sorted based on some model. The relationship information between data can be output. A wide variety of methods have been developed and disclosed for such models of machine learning.

しかしながら、どのようなモデルを用いても、異種テーブルのデータを直接に機械学習した結果は、同一テーブル内にあるデータは、同一テーブル内にあるということへの関係性が強く評価され、他の異種テーブル内にあるデータとの関係性は弱く評価される傾向がある。 However, no matter what model is used, the result of direct machine learning of data in different tables is strongly evaluated as having a relationship with the fact that the data in the same table is in the same table. The relationship with data in different tables tends to be weakly evaluated.

図３には、データを分析してＸＹＺの３つの分析軸での３分類を行う機械学習方式を用いて４テーブルのデータを学習する場合を例示する。直接的にデータを機械学習すると、Ｘ軸に１つ目のテーブル内のデータが偏り、Ｙ軸に２つ目のテーブル内のデータが偏る、というようにテーブルごとに仕分けされてしまいやすい。 FIG. 3 exemplifies a case where four tables of data are learned using a machine learning method in which data is analyzed and three classification is performed on three analysis axes of XYZ. If the data is directly machine-learned, the data in the first table tends to be biased on the X axis, and the data in the second table is biased on the Y axis.

このように、学習した仕分け結果がテーブルごとに偏っては、わざわざ多くの（複数）の（異種）テーブルを分析しても、何を学習するかの意味が薄くなり、機械学習の効果が得られない。 In this way, if the learned sorting results are biased for each table, the meaning of what to learn will be lessened even if many (multiple) (different) tables are analyzed. I can't.

本実施例のデータ分析では、第１ステップとして、３分類を行う機械学習をテーブルごとに適用し、４テーブルそれぞれを独立に機械学習する。その結果、４つの基底あるいは分析軸（Ｘ１，Ｙ１，Ｚ１）．．．（Ｘ４，Ｙ４，Ｚ４）を得る。 In the data analysis of this embodiment, as a first step, machine learning for performing three classifications is applied to each table, and each of the four tables is machine-learned independently. As a result, four bases or analysis axes (X1, Y1, Z1). . . (X4, Y4, Z4) is obtained.

第２ステップとして、分析軸同士の関係を最適に集約して、新しい３つの集約軸（Ｇｒｏｕｐ１，Ｇｒｏｕｐ２，Ｇｒｏｕｐ３）を得る。集約軸には異種テーブル間を含めてテーブルをまたいだ関係に焦点が当たっているため、テーブル間のデータ関係が顕在化して抽出可能となる。 As a second step, the relationships between the analysis axes are optimally aggregated to obtain three new aggregation axes (Group1, Group2, Group3). Since the aggregation axis focuses on the relationship across tables including different tables, the data relationship between the tables becomes obvious and can be extracted.

そして、異種テーブル間のデータ関係は、異種テーブルに含まれる膨大なデータを前にしたデータ分析作業者では気付きにくく、本実施例のデータ分析支援システム１０１の利用によって得られる情報であって、このような情報の得られることが本実施例の有用な効果である。 The data relationship between the heterogeneous tables is information obtained by using the data analysis support system 101 of the present embodiment, which is difficult for a data analysis operator who has previously recognized the enormous amount of data included in the heterogeneous table. Obtaining such information is a useful effect of this embodiment.

本実施例では機械学習の一例としてトピックモデルを用いる。以下では分析軸をトピックと呼ぶ。トピックは特徴成分でもある。トピックモデルは主に文書の仕分けに用いられるモデルであり、１つのトピックは、単語とその出現確率のペアをリストしたものとして定式化される。多数の文書を機械学習した結果は複数のトピックとして出力される。この点においては一般的な機械学習と同じであってもよい。 In this embodiment, a topic model is used as an example of machine learning. Below, the analysis axis is called a topic. A topic is also a feature component. The topic model is a model mainly used for sorting documents, and one topic is formulated as a list of pairs of words and their appearance probabilities. The results of machine learning of a large number of documents are output as a plurality of topics. In this respect, it may be the same as general machine learning.

トピックモデルは、端的には文書群を入力して、トピック群を出力する機械学習である。トピックモデルに文書群を入力する場合、個々の文書を形態素解析して文書内のキーワードを抽出し、そのキーワードのリストを入力ベクトルとして用いる。トピックモデルは複数の入力ベクトルを読み込むことによって、入力ベクトル群を仕分けする適切なトピック（キーワードとその出現確率のペア）を定式化する。 The topic model is simply machine learning that inputs a document group and outputs the topic group. When inputting a document group into a topic model, morphological analysis is performed on each document to extract keywords in the document, and a list of the keywords is used as an input vector. The topic model formulates an appropriate topic (keyword and its appearance probability pair) to sort the input vector group by reading a plurality of input vectors.

個々のトピックは、一部の文書群を代表する潜在的意味を示すが、その意味は単一の単語では示されない。例えば、文書群を２トピックに仕分けるとき、１つ目のトピックでは、単語Ａが６０％、単語Ｂが０％、単語Ｃが４０％の確率で出現し、２つ目のトピックでは、単語Ａが０％、単語Ｂが７０％、単語Ｃが３０％の確率で出現する、といった形で、トピック群が出力される。 Individual topics show potential meanings that represent some documents, but their meanings are not expressed in a single word. For example, when a document group is classified into two topics, the first topic appears with a probability of 60%, the word B has 0%, and the word C has a probability of 40%, and the second topic has the word A. 0%, Word B 70%, and Word C appear with a probability of 30%.

個々の文書がどのトピックに属するかは、入力文書の単語ベクトルと、トピックを構成する単語ベクトルについて、内積値あるいは距離を指標とすることで、判定可能である。すなわち、機械学習で得たトピックを分析軸とすれば、その分析軸を持つトピックモデルは、未知の文書群の分類仕分けに利用できる。 The topic to which each document belongs can be determined by using the inner product value or distance as an index for the word vector of the input document and the word vectors constituting the topic. That is, if a topic obtained by machine learning is used as an analysis axis, a topic model having the analysis axis can be used for classification and sorting of unknown document groups.

一方、トピックモデルにデータベースレコード群を入力する場合、上述した形態素解析は必要なく、個々のレコードが１文書に相当し、個々のフィールド値が単語に相当する。その機械学習の処理については図１０を用いて後で詳細に説明する。ただし、レコード群にはトランザクションデータ固有の問題がある。 On the other hand, when a database record group is input to the topic model, the morphological analysis described above is not necessary, and each record corresponds to one document, and each field value corresponds to a word. The machine learning process will be described later in detail with reference to FIG. However, the record group has a problem specific to transaction data.

第１に、データベースレコード内の情報はディスク節約などのため通常符号化されており、符号化されたフィールド値をそのまま入力ベクトルとして使うと、符号値である「０」や「１」が極端に頻出する入力ベクトルとなり、学習する機械は適切な解釈ができなくなる。 First, information in a database record is normally encoded to save disk, etc. If the encoded field value is used as an input vector as it is, the code values “0” and “1” are extremely small. The input vector appears frequently, and the learning machine cannot perform proper interpretation.

加えて機械学習の効率が悪くなり、膨大なサンプルデータを要するようになる。そこで、符号語（符号値）を復号してから機械学習すると、学習効率が上がり、少数のサンプルでも妥当な学習が可能となる。 In addition, machine learning becomes inefficient and requires a large amount of sample data. Therefore, when machine learning is performed after decoding a codeword (code value), learning efficiency is improved, and appropriate learning is possible even with a small number of samples.

第２に、気温や時刻等の量的変数は単語文字列としてのバリエーションが広く、学習する機械は適切な解釈ができなくなる。仮に量的変数を直接に機械学習するとしても、その機械学習には極端に膨大な数のサンプルデータと処理時間を要して現実的ではない。 Secondly, quantitative variables such as temperature and time have a wide variation as word character strings, and the learning machine cannot perform proper interpretation. Even if machine learning is performed directly on quantitative variables, the machine learning requires an extremely large number of sample data and processing time, which is not realistic.

加えて、生成されたトピックをデータ分析作業者が解釈しにくくなる。例えば、「３」という単語がトピックを構成するキーワードになったとき、その単語がどこのテーブルで何を意味するかが多義的である。そこで、量的変数を質的変数に変換してから、変換された質的変数を機械学習する。 In addition, it is difficult for the data analysis operator to interpret the generated topic. For example, when the word “3” becomes a keyword constituting a topic, it is ambiguous what the word means in which table. Therefore, after the quantitative variable is converted into the qualitative variable, the converted qualitative variable is machine-learned.

例えば、気象コードの「３」を「晴れ」に変換する。この変換により、比較的少数のサンプルで効率的に機械学習できるようになり、このような機械学習は計算資源の節約となる。さらに、短い期間で妥当な学習が可能となる。そして、データ分析作業者にとっても計算結果が理解しやすくなる。 For example, the weather code “3” is converted to “clear”. This conversion enables efficient machine learning with a relatively small number of samples, and such machine learning saves computational resources. Furthermore, reasonable learning is possible in a short period of time. And it becomes easy for a data analysis worker to understand a calculation result.

図４は、変換関係定義テーブルの例を示す図である。変換関係定義テーブル４０１は、トランザクションテーブル名４０２、トランザクションフィールド名４０３、マスターテーブル名４０４、マスターフィールド名４０５、量質変換テーブル名４０６、および量質変換フィールド名４０７の各フィールドを備える。 FIG. 4 is a diagram illustrating an example of the conversion relationship definition table. The conversion relation definition table 401 includes fields of a transaction table name 402, a transaction field name 403, a master table name 404, a master field name 405, a quality conversion table name 406, and a quality conversion field name 407.

例えば、変換関係定義テーブル４０１の１番目のレコードは、トランザクションテーブル名４０２が「Ｔａｂｌｅ−Ｂ」、トランザクションフィールド名４０３が「日付時刻」、量質変換テーブル名４０６が「日付、時刻」である。 For example, in the first record of the conversion relation definition table 401, the transaction table name 402 is “Table-B”, the transaction field name 403 is “date time”, and the quality conversion table name 406 is “date, time”.

このレコードにより、「Ｔａｂｌｅ−Ｂ」というトランザクションテーブルの「日付時刻」のフィールドにある日付時刻の量的な値を、「日付」という量質変換テーブルの情報と「時刻」という量質変換テーブルの情報に基づいて、質的な値に変換するという対応関係が定義されている。 With this record, the quantitative value of the date and time in the “date and time” field of the transaction table “Table-B” is converted into the information of the quantitative conversion table “date” and the quantitative conversion table of “time”. Based on the information, a correspondence relationship of converting to a qualitative value is defined.

また、変換関係定義テーブル４０１の３番目のレコードは、トランザクションテーブル名４０２が「Ｔａｂｌｅ−Ｂ」、トランザクションフィールド名４０３が「警報」、マスターテーブル名４０４が「Ｔａｂｌｅ−Ｍ１」、マスターフィールド名４０５が「Ｆ」である。そして、「Ｔａｂｌｅ−Ｍ１」というマスターテーブルの「Ｆ」のフィールドには、警報コードとして「１」や「９」などの符号語と意味（質）とが対応付けられている。 The third record of the conversion relationship definition table 401 has a transaction table name 402 of “Table-B”, a transaction field name 403 of “alarm”, a master table name 404 of “Table-M1”, and a master field name 405 of “F”. In the “F” field of the master table “Table-M1”, code words such as “1” and “9” are associated with meaning (quality) as alarm codes.

このレコードとマスターテーブルにより、「Ｔａｂｌｅ−Ｂ」というトランザクションテーブルの「警報」のフィールドにある警報コードを、「Ｔａｂｌｅ−Ｍ１」というマスターテーブルの「Ｆ」フィールドに基づいて、例えば「１」から「注意」などの意味に変換するという対応関係が定義されている。 With this record and the master table, the alarm code in the “alarm” field of the transaction table “Table-B” can be changed from “1” to “1” based on the “F” field of the master table “Table-M1”. Correspondence that is converted into meanings such as “Caution” is defined.

このような変換関係定義テーブル４０１が用いられて、学習制御モジュール１０９は機械学習の前に、符号語を復号して量質変換できる。図４の例では、マスターテーブルと量質変換テーブルとが排反の例を示したが、マスターテーブルと量質変換テーブルの両方が定義されてもよい。 Using such a conversion relationship definition table 401, the learning control module 109 can decode the codeword and perform quality conversion before machine learning. In the example of FIG. 4, an example in which the master table and the quantity conversion table are rejected is shown, but both the master table and the quantity conversion table may be defined.

変換関係定義テーブル４０１にマスターテーブルと量質変換テーブルの両方が定義された場合、あらかじめ設定された情報に従ってどちらかのテーブルが選択されてもよいし、マスターテーブルと量質変換テーブルで定義される符号語あるいは量的変数が排反であり、変換対象の符号語あるいは量的変数に応じてどちらかのテーブルが選択されてもよい。 When both the master table and the quality conversion table are defined in the conversion relationship definition table 401, either table may be selected according to preset information, or defined by the master table and the quality conversion table. The code word or the quantitative variable is contradictory, and either table may be selected according to the code word or the quantitative variable to be converted.

図５は、量質変換テーブルの例を示す図である。量質変換テーブルは、量的変数を質的変数に変換する対応関係を定義するテーブルである。量的変数には、日付、時刻、位置、距離、価格、温度、年齢などがあり、数値で表され、暗に連続的な数値である。 FIG. 5 is a diagram illustrating an example of the quality conversion table. The qualitative conversion table is a table that defines a correspondence relationship for converting a quantitative variable into a qualitative variable. Quantitative variables include date, time, position, distance, price, temperature, age, etc., expressed numerically and implicitly continuous numerical values.

例えば、２０１７年１１月３０日は木曜日である。２０１７年１１月３０日は量的であり、秋や木曜日という表現は質的であるとみなす。質的変数を量的変数に変換するには特別な制約条件を要するが、その逆の変換は可能である。 For example, November 30, 2017 is Thursday. November 30, 2017 is quantitative, and the expressions fall and Thursday are considered qualitative. Converting a qualitative variable to a quantitative variable requires special constraints, but the reverse is possible.

このような量質変換テーブルの中で、量質変換テーブル５０１は時刻に関するテーブルであって、１つの量的変数から複数の質的変数への変換を可能にするテーブルである。量質変換テーブル５０１は、下限値５０２、上限値５０３、質的表現（１）５０４、質的表現（２）５０５、逆引きテーブル名５０６、およびレコード番号５０７の各フィールドを備える。 Among such qualitative conversion tables, the qualitative conversion table 501 is a table relating to time, and is a table that enables conversion from one quantitative variable to a plurality of qualitative variables. The quantitative conversion table 501 includes fields of a lower limit value 502, an upper limit value 503, a qualitative expression (1) 504, a qualitative expression (2) 505, a reverse lookup table name 506, and a record number 507.

例えば、量質変換テーブル５０１の１番目のレコードは、下限値５０２が「１４：００：００」、上限値５０３が「１４：５９：５９」、質的表現（１）５０４が「昼間」、質的表現（２）５０５が「２時台」である。 For example, in the first record of the quality conversion table 501, the lower limit 502 is “14:00:00”, the upper limit 503 is “14:59:59”, the qualitative expression (1) 504 is “daytime”, The qualitative expression (2) 505 is “2 o'clock”.

このレコードにより、量的表現としての時刻が、「１４：００：００」と「１４：５９：５９」との間に入るのであれば、「昼間」と「２時台」という質的表現に変換するという対応関係が定義されている。また、図４に示した変換関係定義テーブル４０１を逆引きできるように、逆引きテーブル名５０６とレコード番号５０７の各フィールドを備える。 With this record, if the time as a quantitative expression falls between “14:00:00” and “14:59:59”, the qualitative expression “daytime” and “2 o'clock” will be used. Correspondences to convert are defined. Further, fields for reverse lookup table name 506 and record number 507 are provided so that the conversion relationship definition table 401 shown in FIG.

そして、図５に示した例で、「１４：０６」という特定の時刻は、１番目のレコードと２番目のレコードの「昼間」、「２時台」、および「午後」と３つの質的表現が対応付けられている。 In the example shown in FIG. 5, the specific time of “14:06” has three qualitatives, “daytime”, “2pm”, and “afternoon” of the first record and the second record. Expressions are associated.

量質変換テーブル５０１を用いた変換処理によって、データのおおまかな特徴を機械が学習しやすくなり、少ないサンプルでの機械学習を効率化し、同時に計算資源消費を抑制し、高速な処理を実現できる。 The conversion process using the mass-quality conversion table 501 makes it easy for a machine to learn rough characteristics of data, makes machine learning with a small number of samples more efficient, and at the same time, reduces the consumption of computing resources, thereby realizing high-speed processing.

文書での単語の仕分けと異なり、トランザクションレコードのデータ値には多様な量的変数が含まれ、かつ量的変数のデータこそがレコードの本質的な情報である。トピックモデルを用いた機械学習において、レコードに含まれるトピックを推定するとき、「１４：０６」という単語の出現頻度で学習するよりも、「午後」あるいは「２時台」などの広い解釈を含めた単語の出現頻度で学習する方が、データ分析支援に寄与しやすい。 Unlike the word sorting in the document, the data value of the transaction record includes various quantitative variables, and the quantitative variable data is the essential information of the record. In machine learning using a topic model, when a topic included in a record is estimated, a broad interpretation such as “afternoon” or “2 o'clock” is included rather than learning with the appearance frequency of the word “14:06”. Learning with the frequency of occurrence of the word is more likely to contribute to data analysis support.

図６は、データと処理の関係の例を示す図である。データレイク１０３内のトランザクションＤＢ１０６のトランザクションテーブルのデータが、データ分析支援システム１０１へ入力される。図６に示した例で、トランザクションＤＢ１０６には、Ｔａｂｌｅ−Ａ６０１とＴａｂｌｅ−Ｂ６０２の２つのトランザクションテーブルがある。 FIG. 6 is a diagram illustrating an example of the relationship between data and processing. Data in the transaction table of the transaction DB 106 in the data lake 103 is input to the data analysis support system 101. In the example illustrated in FIG. 6, the transaction DB 106 includes two transaction tables, Table-A 601 and Table-B 602.

Ｔａｂｌｅ−Ａ６０１は、日付時刻６０３、担当者６０４、場所６０５、および作業６０６の各フィールドを備え、作業者の作業記録が格納される。Ｔａｂｌｅ−Ｂ６０２は、日時６０７、警報６０８、および位置６０９の各フィールドを備え、ある設備のセンサ出力記録が格納される。 The Table-A 601 includes fields of a date time 603, a person in charge 604, a place 605, and a work 606, and stores a work record of the worker. Table-B 602 includes fields of date / time 607, alarm 608, and position 609, and stores a sensor output record of a certain facility.

このように図６の例は、これらの情報源を元にした異種テーブルであるＴａｂｌｅ−Ａ６０１とＴａｂｌｅ−Ｂ６０２が、日時と場所の関係を介して分析支援を行う場合の例である。 As described above, the example of FIG. 6 is an example in which Table-A 601 and Table-B 602, which are heterogeneous tables based on these information sources, provide analysis support through the relationship between date and time.

学習テーブル６１０と学習テーブル６１１は、トピックモデルに基づく学習結果が格納される学習ＤＢ１１４のテーブルである。Ｔａｂｌｅ−Ａ６０１に格納されたデータが、復号され、量的変数から質的変数に変換されて、変換された質的変数の機械学習による学習結果が学習テーブル６１０に格納される。 The learning table 610 and the learning table 611 are tables of the learning DB 114 in which learning results based on the topic model are stored. Data stored in Table-A 601 is decoded, converted from a quantitative variable to a qualitative variable, and a learning result by machine learning of the converted qualitative variable is stored in a learning table 610.

また、Ｔａｂｌｅ−Ｂ６０２に格納されたデータが、復号され、量的変数から質的変数に変換されて、変換された質的変数の機械学習による学習結果が学習テーブル６１１に格納される。 In addition, the data stored in Table-B 602 is decoded, converted from a quantitative variable to a qualitative variable, and a learning result by machine learning of the converted qualitative variable is stored in the learning table 611.

そして、学習テーブル６１０は、トピック番号６１２、キーワード６１３、および出現確率６１４の各フィールドを備え、オブジェクト指向の書式で示されている。学習テーブル６１０も同じ各フィールドを備える。 The learning table 610 includes fields of topic number 612, keyword 613, and appearance probability 614, and is shown in an object-oriented format. The learning table 610 includes the same fields.

トピック番号６１２の採番では、例えば、１つ目のテーブル（Ｔａｂｌｅ−Ａ６０１）に対する１つ目のトピックを「＃１−１」と番号付ける。トピックは１０個に分けるものとする。ここで、１０個はあらかじめ設定された個数であってもよいし、計算された個数であってもよい。 In the numbering of the topic number 612, for example, the first topic for the first table (Table-A601) is numbered “# 1-1”. The topic is divided into 10 topics. Here, 10 may be a preset number or a calculated number.

このようにして、Ｔａｂｌｅ−Ａ６０１とＴａｂｌｅ−Ｂ６０２が、それぞれ１０トピックに仕分けられた場合のトピック群が学習テーブル６１０と学習テーブル６１１にそれぞれ示されている。 In this way, topic groups when Table-A 601 and Table-B 602 are sorted into 10 topics are shown in the learning table 610 and the learning table 611, respectively.

Ｔａｂｌｅ−Ａ６０１が分析されて得られたトピックは１０個である。例えば、トピック番号６１２が「＃１−１」のトピックは、キーワード６１３で示されるように「火曜」、「２時台」、「Ｈａｎｋａ」、および「大阪」などのキーワードによって潜在的な意味を示すトピックである。 The number of topics obtained by analyzing Table-A601 is ten. For example, a topic whose topic number 612 is “# 1-1” has a potential meaning based on keywords such as “Tuesday”, “2 o'clock”, “Hanka”, and “Osaka” as indicated by the keyword 613. It is a topic to show.

また、キーワード６１３の各キーワードの出現確率は、出現確率６１４で示されるように「０．３１」、「０．２」、「０．０４」、および「０．０３」などであり、トピック番号６１２が「＃１−１」のトピックでの出現確率の総和は１．０である。 Further, the appearance probability of each keyword of the keyword 613 is “0.31”, “0.2”, “0.04”, “0.03”, etc., as indicated by the appearance probability 614, and the topic number. The sum of the appearance probabilities in the topic 612 of “# 1-1” is 1.0.

すなわち、あるレコードが存在した場合、「火曜」が「０．３１」の出現確率で含まれ、「２時台」が「０．２」の出現確率で含まれるときに、その存在したレコードは、トピック番号６１２が「＃１−１」のトピックに含まれると判定されることが原理的には可能である。 That is, when a certain record exists, “Tuesday” is included with an appearance probability of “0.31”, and “2 o'clock” is included with an appearance probability of “0.2”, the existing record is In principle, it is possible to determine that the topic number 612 is included in the topic “# 1-1”.

しかし、実際のレコードでは「火曜」のデータが含まれるか含まれないか２値的であり、中途半端な出現確率で存在することはない。このように、出現確率は、あくまでもトピックの潜在的な意味を示すものである。 However, in an actual record, data on “Tuesday” is included or not, and it is binary, and does not exist with a halfway appearance probability. Thus, the appearance probability only indicates the potential meaning of the topic.

同様に、Ｔａｂｌｅ−Ｂ６０２が分析されて得られたトピックは１０個ある。例えば、トピック番号が「＃２−２」のトピックは、「火曜」、「警報（３）」、および「大阪」などのキーワードによって潜在的な意味を示すトピックである。また、各キーワードの出現確率は「０．５１」、「０．０７」、「０．０１」などである。 Similarly, there are 10 topics obtained by analyzing Table-B602. For example, the topic having the topic number “# 2-2” is a topic that indicates a potential meaning by keywords such as “Tuesday”, “Alarm (3)”, and “Osaka”. Further, the appearance probability of each keyword is “0.51”, “0.07”, “0.01”, and the like.

集約テーブル６１５は、後述する組合せ最適化および基底集約の処理を経て、トピック同士の最適な組合せを集約した結果が格納される集約ＤＢ１１５のテーブルである。集約テーブル６１５は、グループ番号６１６、トピック番号６１７、キーワード６１８、および出現頻度６１９を各フィールドに備え、それらはオブジェクト指向の可変長である。 The aggregation table 615 is a table of the aggregation DB 115 that stores the result of aggregating the optimal combination of topics through combination optimization and base aggregation processing described later. The aggregation table 615 includes a group number 616, a topic number 617, a keyword 618, and an appearance frequency 619 in each field, and these are object-oriented variable lengths.

トピックグループは、学習テーブルをまたいだデータの関係性を示す情報である。トピックグループは、ある学習テーブルを構成する複数のトピックの中から１つのトピックが選択され、次の学習テーブルを構成する複数のトピックの中から１つのトピックが選択されて、学習ＤＢ１１４内の学習テーブルから順次に１つずつのトピックが選択されて組み合わせられたものである。 A topic group is information indicating the relationship of data across learning tables. In the topic group, one topic is selected from a plurality of topics constituting a certain learning table, and one topic is selected from a plurality of topics constituting the next learning table, and the learning table in the learning DB 114 is selected. Each topic is selected and combined sequentially.

トピックグループは、グループ番号６１６により識別される。グループ番号６１６が「＃１」のトピックグループは、「Ｔｏｐｉｃ＃１−１」と「Ｔｏｐｉｃ２−２」を含む。これは、「火曜」などのキーワードが一致しており、最適な組合せと判定された組合せである。 A topic group is identified by a group number 616. The topic group whose group number 616 is “# 1” includes “Topic # 1-1” and “Topic2-2”. This is a combination in which keywords such as “Tuesday” are matched and determined to be the optimum combination.

グループ番号６１６が「＃１」のトピックグループの中で、「Ｈａｎｋａ」と「警報（３）」は元々Ｔａｂｌｅ−Ａ６０１とＴａｂｌｅ−Ｂ６０２すなわち異種テーブルに含まれる情報であり、テーブルの内容を既存のテーブルブラウザ等を用いて概観しているだけでは、両者の関係にデータ分析作業者は気付きにくい。 Among the topic groups with the group number 616 of “# 1”, “Hanka” and “Alarm (3)” are information originally included in the Table-A601 and Table-B602, that is, the heterogeneous tables, and the contents of the table are the existing ones. It is difficult for data analysis workers to notice the relationship between the two simply by using a table browser.

これらに加えて、場所と天気の関係を格納した異種テーブルがあったとき、複数のテーブル間の関係はさらに気付きにくい。例えば、この設備の故障を知らせる「警報（３）」は、雨天時に「Ｈａｎｋａ」の部品交換作業が不適切だったために断続的に起きていたものかもしれない。このような因果関係の仮説をデータ分析作業者が思いつくデータ発見の起点となりえる。 In addition to these, when there is a heterogeneous table storing the relationship between the location and the weather, the relationship between the plurality of tables is more difficult to notice. For example, the “alarm (3)” notifying the failure of the equipment may have occurred intermittently due to inappropriate parts replacement work of “Hanka” in rainy weather. Such a causal hypothesis can be the starting point for data discovery that data analysis workers can come up with.

図７は、組合せ最適化の例を示す図である。集約制御モジュール１１０は、図６に示した学習テーブル６１０と学習テーブル６１１を入力し、集約テーブル６１５を出力する。このために、集約制御モジュール１１０は、以下で説明する処理を実行する。 FIG. 7 is a diagram illustrating an example of combination optimization. The aggregation control module 110 inputs the learning table 610 and the learning table 611 illustrated in FIG. 6 and outputs the aggregation table 615. For this purpose, the aggregation control module 110 executes processing described below.

説明のために、分析対象となる学習テーブル数をＰとする。トピックモデルに基づく機械学習を行うと、１つの学習テーブルは複数のトピック（分析軸）で仕分けられる。ここでの１つの学習テーブルを構成するトピック数をＱとする。 For the sake of explanation, let P be the number of learning tables to be analyzed. When machine learning based on a topic model is performed, one learning table is sorted by a plurality of topics (analysis axes). Let Q be the number of topics that make up one learning table.

１つのトピックは、複数のキーワードと、対応する出現確率のデータを含む。ここでのトピックを構成するキーワード数をＲとする。ここでは説明の簡単化のために、どの学習テーブルも同じトピック数のトピックを含み、どのトピックも同じキーワード数のキーワードを持つとするが、同じでなくともよい。 One topic includes a plurality of keywords and corresponding appearance probability data. The number of keywords constituting the topic here is R. Here, for simplification of explanation, it is assumed that all learning tables include topics having the same number of topics and all topics have keywords having the same number of keywords, but they need not be the same.

ｐ番目の学習テーブルのｑ番目のトピックを構成するｒ番目のキーワードをｗ（ｐ，ｑ，ｒ）とし、対応するキーワードの出現確率をφ（ｐ，ｑ，ｒ）とする。キーワードｗは出現確率の高いもの順にソートされたものである。ｗ（ｐ，ｑ，１）は、ｐ番目のテーブルのｑ番目のトピック内で最頻のキーワードである。 Assume that the r-th keyword constituting the q-th topic of the p-th learning table is w (p, q, r), and the appearance probability of the corresponding keyword is φ (p, q, r). The keywords w are sorted in descending order of appearance probability. w (p, q, 1) is the most frequent keyword in the qth topic of the pth table.

あるトピックを構成するキーワードの出現確率の総和は１である。以下で単に出現確率と呼ぶ場合もキーワードの出現確率を示す。出現確率は最大値１最小値０の実数形式である。キーワードは可変長文字列形式である。 The sum of the appearance probabilities of keywords constituting a certain topic is 1. In the following, the term “probability of appearance” will also indicate the probability of appearance of a keyword. The appearance probability is a real number format having a maximum value 1 and a minimum value 0. Keywords are in variable-length character string format.

例えば、図６に示した例において、ｐ＝１の学習テーブル６１０を分析し、ｑ＝１のトピックとしてトピック番号６１２の「＃１−１」のトピックが得られた時、出現確率６１４が最大「０．３１」となるキーワード６１３は「火曜」である。このため、ｗ（ｐ，ｑ，１）＝「火曜」、φ（ｐ，ｑ，１）＝「０．３１」である。 For example, in the example illustrated in FIG. 6, when the learning table 610 with p = 1 is analyzed and the topic “# 1-1” with the topic number 612 is obtained as the topic with q = 1, the appearance probability 614 is the maximum. The keyword 613 that becomes “0.31” is “Tuesday”. Therefore, w (p, q, 1) = “Tuesday” and φ (p, q, 1) = “0.31”.

トピックグループは複数の学習テーブルのトピックについてグループ分けが行なわれたものである。ここで、トピックグループの数は学習テーブルのトピック数と同じＱとする。 A topic group is obtained by grouping topics of a plurality of learning tables. Here, the number of topic groups is Q, which is the same as the number of topics in the learning table.

ｋ番目の学習テーブルのｊ番目のトピックがｉ番目のトピックグループに属するか否かを示す隣接行列ｘが定義できる。隣接行列ｘの要素値ｘ（ｉ，ｊ，ｋ）は、値が「１」であれば、対応する要素がトピックグループに属し、値が「０」であれば、対応する要素がトピックグループに属さない。 An adjacency matrix x indicating whether or not the jth topic of the kth learning table belongs to the ith topic group can be defined. If the element value x (i, j, k) of the adjacency matrix x is “1”, the corresponding element belongs to the topic group, and if the value is “0”, the corresponding element is the topic group. Does not belong.

目的関数は、出現確率の高いキーワードが互いに多く含まれたトピック同士が同一トピックグループに属するように定義される。すなわち、図７に示すように、トピックグループを構成する複数トピックにおいて、トピック同士のキーワードが一致する範囲で出現確率の積の総和を計算し、その総和を最大化することを最適化目的とする。 The objective function is defined so that topics containing many keywords with high appearance probabilities belong to the same topic group. That is, as shown in FIG. 7, in a plurality of topics constituting a topic group, the object of optimization is to calculate the sum of the products of appearance probabilities in a range where the keywords of the topics match and to maximize the sum. .

キーワードが一致する範囲の判定のために、文字列照合する別関数を定義することによって、目的関数が隣接行列ｘの１次式として定式化されてもよい。キーワードの一致ではなく、あらかじめ設定された何らかの基準に基づくキーワードの近似であってもよい。キーワードｗが出現確率の高いもの順にソートされていることを利用し、出現確率の低いキーワードｗに関する計算は打ち切られてもよい。 The objective function may be formulated as a linear expression of the adjacency matrix x by defining another function for character string matching in order to determine a range where the keywords match. Instead of matching keywords, it may be keyword approximation based on some preset criteria. Using the fact that the keywords w are sorted in descending order of appearance probability, the calculation related to the keyword w having the low appearance probability may be aborted.

トピックのトピックグループへの割当てが適切になされた隣接行列では、対応関係のある要素値だけが「１」となり、「１」となる要素値と同じ行の他の要素値、および「１」となる要素値と同じ列の他の要素値は「０」となる。すなわち、図７に示すように制約条件として、隣接行列の行方向の値の総和は「１」であり、列方向の値の総和も「１」であり、要素値ｘは２値である。 In an adjacency matrix in which topics are appropriately assigned to topic groups, only the corresponding element value is “1”, the other element values in the same row as the element value “1”, and “1” The other element value in the same column as the element value becomes “0”. That is, as shown in FIG. 7, as a constraint condition, the sum of the values in the row direction of the adjacency matrix is “1”, the sum of the values in the column direction is also “1”, and the element value x is binary.

以上の定式化で得られる最適化問題は０１整数計画問題である。この定式化によって、一般的な最適化ソルバを用いて高速に最適解の隣接行列ｘを得ることができる。例えば学生のクラス分けに用いられるような別解法の手順が適用されてもよい。 The optimization problem obtained by the above formulation is a 01 integer programming problem. By this formulation, the adjacency matrix x of the optimal solution can be obtained at high speed using a general optimization solver. For example, another solution procedure used for classifying students may be applied.

図８は、データ分析支援システムの処理フローの例を示す図である。図８に示した処理フローは、統括制御モジュール１０８により全体が統括される。まず、統括制御モジュール１０８の制御により、対話制御モジュール１１１がユーザからの入力（要求）待ち状態になる。 FIG. 8 is a diagram illustrating an example of a processing flow of the data analysis support system. The overall processing flow shown in FIG. 8 is controlled by the overall control module 108. First, the dialog control module 111 waits for an input (request) from the user under the control of the overall control module 108.

ステップ８０１において、対話制御モジュール１１１はユーザからの入力を受け付け、受け付けた入力を統括制御モジュール１０８に渡す。統括制御モジュール１０８は入力に基づいて管理ＤＢ１１２を読み込み、初期設定し、トランザクションＤＢ１０６を開く。 In step 801, the dialog control module 111 receives an input from the user, and passes the received input to the overall control module 108. The overall control module 108 reads the management DB 112 based on the input, initializes it, and opens the transaction DB 106.

ステップ８０２において、統括制御モジュール１０８は、管理ＤＢ１１２の変換関係定義テーブル４０１を参照して、符号語を復号するテーブル関係を設定する。符号語を復号するテーブル関係は、学習制御モジュール１０９により設定されてもよい。 In step 802, the overall control module 108 refers to the conversion relationship definition table 401 of the management DB 112 and sets a table relationship for decoding codewords. The table relationship for decoding codewords may be set by the learning control module 109.

ステップ８０３において、学習制御モジュール１０９は、量質変換ＤＢ１１３の量質変換テーブル５０１を参照して、量的変数を質的変数に変換するテーブル関係を設定する。これらの符号語を復号するテーブル関係の設定と、量的変数を質的変数に変換するテーブル関係を設定により、以下の処理にて復号と変換が可能になる。 In step 803, the learning control module 109 refers to the quantitative conversion table 501 of the quantitative conversion DB 113 and sets a table relationship for converting quantitative variables into qualitative variables. By setting the table relationship for decoding these codewords and setting the table relationship for converting quantitative variables to qualitative variables, decoding and conversion can be performed by the following processing.

ステップ８０４において、学習制御モジュール１０９は、開かれたトランザクションＤＢ１０６の中で処理対象となるトランザクションテーブルそれぞれについてステップ８０５、８０６、８０９の処理をそれぞれ実行するように繰り返す。 In step 804, the learning control module 109 repeats the processes in steps 805, 806, and 809 for each transaction table to be processed in the opened transaction DB 106.

ステップ８０５において、学習制御モジュール１０９は、質的変数の数に基づいてトピック数Ｑを算出する。トピック数Ｑは、あらかじめユーザにより設定されてもよいし、データサンプルを用いて決められてもよい。データサンプルは、ステップ８０４の繰り返しの中でｉ番目のトランザクションテーブルから取得されてもよく、このために符号語が復号されたり、量的変数が質的変数に変換されたりしてもよい。 In step 805, the learning control module 109 calculates the topic number Q based on the number of qualitative variables. The number of topics Q may be set in advance by the user or may be determined using data samples. Data samples may be obtained from the i-th transaction table in the iteration of step 804, for which codewords may be decoded or quantitative variables may be converted to qualitative variables.

ここで、本実施例固有の構成は、量質変換テーブル５０１を参照してトピック数Ｑを決めることである。例えば、質的変数として曜日だけに着目する場合、Ｑ≧７とすることによって曜日ごとの傾向が浮かび上がりやすくなる。 Here, the configuration unique to the present embodiment is to determine the number Q of topics by referring to the quality conversion table 501. For example, when focusing only on the day of the week as a qualitative variable, the tendency for each day of the week can be easily revealed by setting Q ≧ 7.

場所を表す量的変数も量質変換されうる。量的変数の内容にかかわらず、（（質的変数の数）＋１）を用いると、分析した特徴成分が質的変数を１つずつ含み、かつ、雑音となる成分が残りの特徴成分に集まるため、注目した質的変数で分類する効果が高まる。 Quantitative variables representing locations can also be qualitatively converted. Regardless of the contents of the quantitative variable, if ((number of qualitative variables) +1) is used, the analyzed feature components will include one qualitative variable at a time, and noise components will be collected in the remaining feature components Therefore, the effect of classifying by the qualitative variable that has been noticed increases.

量質変換テーブルにて、時間をｘ通りに仕分け、場所をｙ通りに仕分けるデータを用意し、関数ｆ（）：Ｑ＝ｆ（ｘ，ｙ）＝ｍａｘ（ｘ，ｙ）＋１、を定義してトピック数Ｑを設定する。ｘは、量質変換テーブル（日付）があらかじめ作成され、そのテーブルの行数が数えられた値である。ｙは、量質変換テーブル（場所）があらかじめ作成され、そのテーブルの行数が数えられた値である。 In the quality conversion table, prepare data to sort time in x ways and place in y places, and define function f (): Q = f (x, y) = max (x, y) +1 To set the number of topics Q. x is a value obtained by creating a quality conversion table (date) in advance and counting the number of rows in the table. y is a value obtained by creating a quality conversion table (location) in advance and counting the number of rows in the table.

さらに、ｚを量質変換テーブル（時刻）の行数として、関数ｇ（）：Ｑ＝ｇ（ｘ，ｙ，ｚ）＝ｍａｘ（ｘ，ｙ，ｚ）＋１を定義して、トピック数Ｑが設定されてもよい。あるいは、トピック数Ｑが、Ｑ＝ｘ＊ｙ＊ｚのように組み合わせ数の値であると、ユーザは、時空間の組合せに基づく分析結果を網羅的に見ることができる。 Further, the function g (): Q = g (x, y, z) = max (x, y, z) +1 is defined where z is the number of rows of the qualitative conversion table (time), and the topic number Q is It may be set. Alternatively, when the number of topics Q is a value of the number of combinations such as Q = x * y * z, the user can comprehensively see the analysis results based on the combination of time and space.

ステップ８０６において、学習制御モジュール１０９は、ステップ８０４の繰り返しの中でｉ番目のトランザクションテーブルのレコードのそれぞれについてステップ８０７、８０８の処理をそれぞれ実行するように繰り返す。 In step 806, the learning control module 109 repeats the processing in steps 807 and 808 for each of the records in the i-th transaction table in the repetition of step 804.

ステップ８０７において、学習制御モジュール１０９は、ステップ８０６の繰り返しのｊ番目のレコードに対応する入力ベクトルを作成する。この入力ベクトルの作成では、符号語を復号して量的変数を質的変数に変換しつつ、キーワードの集合が作成される。なお、ｊ番目のレコードに量的変数が含まれている場合は、復号がスキップされてもよい。 In step 807, the learning control module 109 creates an input vector corresponding to the jth record in the repetition of step 806. In creating the input vector, a set of keywords is created while decoding the code word and converting the quantitative variable into the qualitative variable. Note that when the j-th record includes a quantitative variable, decoding may be skipped.

ステップ８０８において、学習制御モジュール１０９は、トピック数Ｑと作成した入力ベクトルを用いて機械学習する。なお、機械学習の処理そのものは、学習制御モジュール１０９が実行してもよいし、データ分析支援システム１０１の外部で実行されてもよく、外部で実行する場合の学習制御モジュール１０９は、その外部へトピック数Ｑと入力ベクトルの情報を送り、学習結果を得てもよい。その機械学習の処理については図１０を用いて後で詳細に説明する。 In step 808, the learning control module 109 performs machine learning using the topic number Q and the created input vector. Note that the machine learning process itself may be executed by the learning control module 109, or may be executed outside the data analysis support system 101. Information on the topic number Q and the input vector may be sent to obtain a learning result. The machine learning process will be described later in detail with reference to FIG.

ステップ８０９において、学習制御モジュール１０９は、ステップ８０７、８０８の繰り返しが終了すると、その繰り返しにより得られたｉ番目のトランザクションテーブルの学習結果を保存する。 In step 809, when the repetition of steps 807 and 808 is completed, the learning control module 109 stores the learning result of the i-th transaction table obtained by the repetition.

ステップ８１０において、集約制御モジュール１１０は、トピック群の組合せを最適化する。図７を用いて説明した通り、最適解は図７に示す隣接行列ｘの形式である。隣接行列ｘの要素値を参照して、ｘ（ｉ，ｊ，ｋ）＝１であれば、ｋ番目の学習テーブルのｊ番目のトピックはｉ番目のトピックグループに属すると判定する。 In step 810, the aggregation control module 110 optimizes the combination of topic groups. As described with reference to FIG. 7, the optimum solution is in the form of the adjacency matrix x shown in FIG. With reference to the element value of the adjacency matrix x, if x (i, j, k) = 1, it is determined that the jth topic of the kth learning table belongs to the ith topic group.

ステップ８１１において、集約制御モジュール１１０は、トピックグループを構成する各種のデータ（トピック、キーワード、および出現確率）を集約する。この集約のために、集約制御モジュール１１０は、第１に、おのおののトピックグループを示す索引ｉ（０≦ｉ≦Ｑ）について、トピックリストをメモリ領域に確保し、以下の第２と第３の処理手順を繰り返す。 In step 811, the aggregation control module 110 aggregates various data (topic, keyword, and appearance probability) constituting the topic group. For this aggregation, the aggregation control module 110 first secures a topic list in the memory area for the index i (0 ≦ i ≦ Q) indicating each topic group, and the following second and third Repeat the procedure.

第２に、集約制御モジュール１１０は、ｉ番目のトピックグループに含まれるｋ番目の学習テーブルのｊ番目のトピックを構成するr番目の構成要素、すなわちキーワードｗ（ｋ，ｊ，ｒ）と出現確率φ（ｋ，ｊ，ｒ）をトピックリストに加えてメモリに保存する。 Secondly, the aggregation control module 110 includes the r-th component constituting the j-th topic of the k-th learning table included in the i-th topic group, that is, the keyword w (k, j, r) and the appearance probability. Add φ (k, j, r) to the topic list and save it in memory.

第３に、集約制御モジュール１１０は、保存時にキーワードの重複を避けて統合する。例えば、トピックリストにあるキーワードと同じキーワードが別の学習テーブルでの構成要素であった場合、２つの出現確率の平均値を、そのキーワードの出現確率として計算して保存することにより、トピックリスト内のキーワードは、そのトピックリスト内で一意となる。 Third, the aggregation control module 110 integrates the keywords while avoiding duplication. For example, if the same keyword as the keyword in the topic list is a component in another learning table, the average value of the two occurrence probabilities is calculated and stored as the occurrence probability of the keyword, and the Are unique within the topic list.

これによって、それぞれの学習テーブルごとの分析結果が集約され、学習テーブルをまたいだデータ関係が顕在化する。そして、集約制御モジュール１１０は、メモリに保存した情報すなわち集約した各種のデータを、集約テーブルとして集約ＤＢ１１５に格納する。 As a result, the analysis results for each learning table are aggregated, and the data relationship across the learning tables becomes apparent. The aggregation control module 110 stores the information stored in the memory, that is, the aggregated various data in the aggregation DB 115 as an aggregation table.

ステップ８１２において、対話制御モジュール１１１は、分析結果と対話入力領域をユーザ端末１０２のディスプレイに出力する。分析結果は、集約ＤＢ１１５から得た情報であり、トピックリストのキーワードと出現確率を含む。さらに、対話入力領域には、キーワードに並列してユーザ選択を受け付けるチェックボックスを含み、ユーザからの入力を受け付ける。 In step 812, the dialogue control module 111 outputs the analysis result and the dialogue input area on the display of the user terminal 102. The analysis result is information obtained from the aggregation DB 115, and includes the keyword of the topic list and the appearance probability. Furthermore, the dialog input area includes a check box for accepting user selection in parallel with the keyword, and accepts input from the user.

ユーザが複数のチェックボックスを選択し、ユーザ端末１０２を用いて関連情報出力の指示を入力すると、対話制御モジュール１１１は、選択されたチェックボックスの情報を受け付けて、逆引き情報が検索されて、関連情報としてユーザ端末１０２に出力する。 When the user selects a plurality of check boxes and inputs an instruction to output related information using the user terminal 102, the dialog control module 111 accepts the information of the selected check boxes, the reverse lookup information is searched, It outputs to the user terminal 102 as related information.

図５を用いて、チェックされたキーワードが「２時台」だった場合の関連情報の出力について説明する。統括制御モジュール１０８は、対話制御モジュール１１１から「２時台」のキーワードと関連情報の検索の要求を受け取り、受け取ったキーワードに基づいて量質変換テーブル５０１を参照し、受け取ったキーワードと同じ質的表現を含むレコードを検索する。 The output of related information when the checked keyword is “2 o'clock” will be described with reference to FIG. The overall control module 108 receives the keyword “2 o'clock” and the related information search request from the dialogue control module 111, refers to the quality conversion table 501 based on the received keyword, and has the same qualitative characteristics as the received keyword. Search for records that contain expressions.

この検索により、統括制御モジュール１０８は、１番目のレコードの質的表現（２）５０５に「２時台」を見つけて、見つかった１番目のレコードに登録された逆引きテーブル名５０６の「Ｔａｂｌｅ−Ｂ」とレコード番号５０７の「１」を得る。そして、統括制御モジュール１０８は、トランザクションＤＢ１０６から「Ｔａｂｌｅ−Ｂ」というトランザクションテーブルの１番目のレコードを検索する。 By this search, the overall control module 108 finds “2 o'clock” in the qualitative expression (2) 505 of the first record, and displays “Table” of the reverse table name 506 registered in the found first record. -B "and record number 507" 1 "are obtained. Then, the overall control module 108 searches the transaction DB 106 for the first record in the transaction table “Table-B”.

統括制御モジュール１０８は、検索により見つけた１番目のレコードを得て、対話制御モジュール１１１へ送る。対話制御モジュールは受け取った１番目のレコードを、量質変換前の関連情報としてユーザ端末１０２に出力する。なお、逆引きテーブル名５０６とレコード番号５０７の情報が、関連情報としてユーザ端末１０２に出力されてもよい。 The overall control module 108 obtains the first record found by the search and sends it to the dialog control module 111. The dialogue control module outputs the received first record to the user terminal 102 as related information before the quality conversion. Note that the information of the reverse lookup table name 506 and the record number 507 may be output to the user terminal 102 as related information.

図９は、ユーザ対話画面の例を示す図である。図９に示した画面９０１は、分析結果出力領域９０２と対話入力領域９０３を含み、図８を用いて説明したステップ８１２において作成出力される。図９に示した例において、分析結果出力領域９０２には、「Ｇｒｏｕｐ１」すなわちトピックグループ１の分析結果として、「火曜０．３１」のようにキーワードと出現確率のペアが列挙される。 FIG. 9 is a diagram illustrating an example of a user interaction screen. A screen 901 shown in FIG. 9 includes an analysis result output area 902 and a dialog input area 903, and is created and output in step 812 described with reference to FIG. In the example shown in FIG. 9, the analysis result output area 902 lists pairs of keywords and appearance probabilities such as “Tuesday 0.31” as analysis results of “Group 1”, that is, topic group 1.

図９に示した例では、端的には、「火曜」が含まれるレコードはトピックグループ１に属する可能性が高いことを示している。また、「Ｇｒｏｕｐ２」のトピックグループ２には別のキーワードと出現確率が列挙されている。同一のキーワードが複数のトピックグループに出現してもよい。 In the example illustrated in FIG. 9, in short, it is indicated that a record including “Tuesday” is highly likely to belong to the topic group 1. In addition, another keyword and appearance probability are listed in the topic group 2 of “Group 2”. The same keyword may appear in a plurality of topic groups.

対話入力領域９０３はチェックボックスであり、ユーザがチェックボックスにチェックをつけて、キーワードを選択することが可能な入力領域である。図９に示した例では、「火曜」、「大阪」、および「警報（３）」それぞれのチェックボックスにチェックがついており、これらのキーワードがユーザにより選択された状態を示す。 The dialog input area 903 is a check box, and is an input area where the user can select a keyword by checking the check box. In the example shown in FIG. 9, the check boxes of “Tuesday”, “Osaka”, and “Alarm (3)” are checked, and the keywords are selected by the user.

逆引き情報９０４は、分析の基となった情報を示し、関連情報としてステップ８１２で作成されて出力される。量質変換により「火曜」と解釈された数値が含まれるレコードは、「トランザクションテーブルＴａｂｌｅ−Ｂ」のレコード番号が１、４、５などのレコードであり、さらに「保守テーブルＴａｂｌｅ−Ｈ０」のレコード番号が２、３、４などのレコードである。 The reverse lookup information 904 indicates information that is the basis of the analysis, and is created and output in step 812 as related information. The record including the numerical value interpreted as “Tuesday” by the quality conversion is a record having a record number of “transaction table Table-B” 1, 4, 5, etc., and a record of “maintenance table Table-H0”. It is a record whose number is 2, 3, 4, etc.

データ分析作業者は、図９に示した以上の情報から「警報（３）」が「火曜」、「大阪」、および「晴」と同時に表れるという傾向を見いだせる。元々の警報情報が含まれた「トランザクションテーブルＴａｂｌｅ−Ｂ」だけを分析していても、「警報（３）」は不定期に各地で頻発していて、特定の曜日や場所に注目しにくい場合もあり得る。 From the above information shown in FIG. 9, the data analysis worker can find a tendency that “alarm (3)” appears simultaneously with “Tuesday”, “Osaka”, and “Sunny”. Even if only the “transaction table Table-B” containing the original alarm information is analyzed, “alarm (3)” occurs frequently in various places, and it is difficult to pay attention to a specific day or place. There is also a possibility.

このような場合であって、特定の作業者の保守作業方法が不適切であり、特定の天候のときにのみ異常が生じるという複合要因による警報の原因を調べる場合、従来は、そのような仮説をデータ分析作業者が考えて、関係する異種データを自力で調べなければならなかった。 In such a case, when investigating the cause of an alarm due to a complex factor that the maintenance work method of a specific worker is inappropriate and an abnormality occurs only in a specific weather, conventionally such a hypothesis is used. A data analysis worker thought about it, and had to investigate related heterogeneous data by himself.

その上、このような仮説の組合せは極めて多く、データ分析作業者の作業時間は極めて大きくなる傾向がある。本実施例では、そのような仮説の可能性について、トピックグループを提示することにより、データ分析作業者にリコメンドできる。データ分析作業者の作業時間が大幅に削減することが可能になる。 In addition, there are many combinations of such hypotheses, and the working time of data analysis workers tends to be extremely long. In this embodiment, the possibility of such a hypothesis can be recommended to a data analysis worker by presenting a topic group. The time required for data analysis workers can be greatly reduced.

すなわち、「トランザクションテーブルＴａｂｌｅ−Ｂ」、「保守テーブルＴａｂｌｅ−Ｈ０」、および天候テーブルなどの異種テーブルを集約して分析することにより、異種テーブル間に分散していた有用なパターンを、データ分析作業者に見つけやすくさせることが可能になる。 That is, by collecting and analyzing heterogeneous tables such as “transaction table Table-B”, “maintenance table Table-H0”, and weather table, data analysis work Makes it easier for people to find.

図１０を用いてステップ８０８の機械学習の処理フローについて詳細に説明する。
機械学習への入力情報は、トピック出現頻度の偏りを示すパラメータα、語出現頻度の偏りを示すパラメータβ、サンプリング繰り返し回数上限のパラメータＳ、トピック総数Ｋ、レコードデータの集合ｒである。 The processing flow of machine learning in step 808 will be described in detail with reference to FIG.
Input information to the machine learning includes a parameter α indicating a bias in topic appearance frequency, a parameter β indicating a bias in word appearance frequency, a parameter S for the upper limit of sampling repetitions, a total number K of topics, and a record data set r.

パラメータα、βには固定値の定常分布を初期値に用いてもよい。繰り返し回数上限のパラメータＳは固定値でもよい。また、変数には次の４つの仮定を設ける：（１）レコードにおけるトピックの出現確率はαのディリクレ分布に従う、（２）ｋ番目のトピックにおけるデータ値の出現確率はβのディリクレ分布に従う、（３）ｄ番目のレコードにおけるｉ番目のデータ値を出現させた潜在変数は多項分布に従う、および、（４）ｄ番目のレコードにおけるｉ番目のデータ値は多項分布に従う。 For the parameters α and β, a fixed distribution with a fixed value may be used as an initial value. The parameter S for the upper limit of the number of repetitions may be a fixed value. In addition, the following four assumptions are made for the variables: (1) The topic occurrence probability in the record follows the Dirichlet distribution of α, (2) The appearance probability of the data value in the k-th topic follows the β Dirichlet distribution, ( 3) The latent variable in which the i-th data value in the d-th record appears follows a multinomial distribution, and (4) the i-th data value in the d-th record follows a multinomial distribution.

また、２つの統計的変数：（１）ｄ番目のレコードにおいてトピックｋが出現した回数、および（２）全レコードに対してｋ番目のトピックがデータ値索引ｖに対して推定された回数、を定義する。 Also, two statistical variables: (1) the number of times topic k appears in the dth record, and (2) the number of times the kth topic is estimated for the data value index v for all records. Define.

ステップ１００１において、すべてのテーブルに対して以降の処理を繰り返し制御する。ステップ１００２において、すべてのレコードの量的なデータ値を分類して質的なデータ値に変換する。すべてのテーブルで同じレコード数を扱ってもよい。さらに、データ値ｗに一意的に対応する索引ｖを作成する。 In step 1001, the subsequent processing is repeatedly controlled for all tables. In step 1002, the quantitative data values of all records are classified and converted into qualitative data values. All tables may handle the same number of records. Further, an index v uniquely corresponding to the data value w is created.

ステップ１００３において、以降の処理を繰り返し制御する。ステップ１００４において、すべてのレコードについて、以降の処理を繰り返し制御する。ステップ１００５において、すべてのデータ値レコードについて、以降の処理を繰り返し制御する。ステップ１００６において、上記仮定に基づいて潜在変数のサンプリングを行う。ステップ１００７において、２つの統計的変数の更新を行う。 In step 1003, the subsequent processing is repeatedly controlled. In step 1004, the subsequent processing is repeatedly controlled for all records. In step 1005, the subsequent processing is repeatedly controlled for all data value records. In step 1006, a latent variable is sampled based on the above assumption. In step 1007, two statistical variables are updated.

ステップ１００８において、パラメータα、βを更新する。ステップ１００９において、ｋ番目のトピックにおけるデータ値の出現確率φを出力し、さらに、作成済みのデータ値ｗも出力する。以上の処理を繰り返すことによって、実際のデータの統計的性質に基づいて、当初入力したパラメータα、βの精度を更新、収束させていくことができる。ステップ１０１０において、複数のトピック（ｗとφ）について組合せ最適化を行う。 In step 1008, the parameters α and β are updated. In step 1009, the appearance probability φ of the data value in the kth topic is output, and the generated data value w is also output. By repeating the above processing, it is possible to update and converge the accuracy of the initially input parameters α and β based on the statistical properties of actual data. In step 1010, combination optimization is performed for a plurality of topics (w and φ).

ステップ１００３からステップ１００９までの各ステップはステップ８０８の処理と等価であり、周辺化ギブスサンプリングの既存技術をベースとしている。 Each step from step 1003 to step 1009 is equivalent to the processing of step 808 and is based on the existing technique of marginal Gibbs sampling.

１０１データ分析支援システム
１０９学習制御モジュール
１１０集約制御モジュール
１１３量質変換ＤＢ
１１４学習ＤＢ
１１５集約ＤＢ 101 Data Analysis Support System 109 Learning Control Module 110 Aggregation Control Module 113 Quantity Conversion DB
114 Learning DB
115 Aggregation DB

Claims

A data analysis support system configured by a computer,
The calculator is
Memory where the program is stored;
A CPU for executing a program stored in the memory;
With
The CPU
Enter multiple tables
Extract keywords from each of multiple input tables,
Machine learning is performed on the extracted keywords, and as a result of machine learning analysis, feature components including pairs of keywords and appearance probabilities are extracted for each of a plurality of tables.
Based on the match between the keyword tables contained in the extracted feature components, identify the combinations between the feature component tables,
By combining the appearance probabilities of feature components included in the specified combination,
A data analysis support system characterized by outputting aggregated analysis results.

The data analysis support system according to claim 1,
The calculator is
A disk storing a quantitative conversion table that associates a range of values of quantitative variables with values of one or more qualitative variables;
The CPU
Quantitative variable values included in each of the plurality of input tables are converted into one or more qualitative variable values using the quantitative conversion table, and the converted qualitative variable values are converted into values. Data analysis support system characterized by extracting as keywords.

The data analysis support system according to claim 2,
The disc is
A conversion relationship definition table that associates the names of a plurality of input tables, the names of fields included in the plurality of input tables, and the quantitative conversion table;
The CPU
Identify the names of multiple input tables and field names, identify the quality conversion table to be used using the conversion relationship definition table, and input using the specified quality conversion table Data of converting a quantitative variable value included in each of the plurality of converted tables into one or more qualitative variable values, and extracting the converted qualitative variable values as keywords Analysis support system.

The data analysis support system according to claim 3,
The CPU
The code word included in each of the plurality of input tables is converted into a value of a quantitative variable, and the converted quantitative variable value is converted into one or more qualitative variables using the quantitative conversion table. The data analysis support system is characterized in that it converts the value of the qualitative variable as a keyword and converts the value of the converted qualitative variable as a keyword.

The data analysis support system according to claim 2,
The CPU
In each table among a plurality of input tables, the number of feature components is calculated using the number of types of values of each of a plurality of qualitative variables obtained from each table,
A data analysis support system characterized in that, in machine learning of extracted keywords, the calculated number of feature components is applied to machine learning, and the feature components of the calculated number of feature components are used as machine learning analysis results. .

The data analysis support system according to claim 5,
The CPU
In each table among a plurality of input tables, the number of types of values of each of the plurality of qualitative variables obtained from each table is specified and specified to be the maximum among the plurality of qualitative variables. A data analysis support system for calculating the number of feature components by adding 1 to the maximum number.

The data analysis support system according to claim 5,
The CPU
Data analysis support characterized in that the number of feature components is calculated by multiplying the number of value types of each of a plurality of qualitative variables obtained from each table in each of the plurality of input tables system.

The data analysis support system according to claim 2,
The CPU
The value of the quantitative variable included in each of the plurality of input tables is converted into the value of one or more qualitative variables using the quantitative conversion table, and the value of the quantitative variable before the conversion is converted. Information that associates the name of the table to be included with the value of the converted qualitative variable is stored in the quantitative conversion table, and the value of the converted qualitative variable is extracted as a keyword,
Outputs the aggregated analysis results of feature components including keywords, accepts selection input for the output keywords, identifies the keywords selected by the accepted input, and extracts qualitative variables extracted as identified keywords A data analysis support system, wherein a name of a table associated with a value is retrieved from the quantitative conversion table and output.

A data analysis support method by a computer,
The calculator is
Memory where the program is stored;
A CPU for executing a program stored in the memory;
With
The CPU
Enter multiple tables
Extract keywords from each of multiple input tables,
Machine learning is performed on the extracted keywords, and as a result of machine learning analysis, feature components including pairs of keywords and appearance probabilities are extracted for each of a plurality of tables.
Based on the match between the keyword tables contained in the extracted feature components, identify the combinations between the feature component tables,
By combining the appearance probabilities of feature components included in the specified combination,
A data analysis support method characterized by outputting an aggregated analysis result.

The data analysis support method according to claim 9, comprising:
The calculator is
A disk storing a quantitative conversion table that associates a range of values of quantitative variables with values of one or more qualitative variables;
The CPU
Quantitative variable values included in each of the plurality of input tables are converted into one or more qualitative variable values using the quantitative conversion table, and the converted qualitative variable values are converted into values. A data analysis support method characterized by extracting as a keyword.