JP6886935B2

JP6886935B2 - Data analysis support system and data analysis support method

Info

Publication number: JP6886935B2
Application number: JP2018054882A
Authority: JP
Inventors: 山田　隆亮; 隆亮山田; 勇樹前川
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2018-03-22
Filing date: 2018-03-22
Publication date: 2021-06-16
Anticipated expiration: 2038-03-22
Also published as: JP2019168820A

Description

本発明は、データ分析支援システムおよびデータ分析支援方法に関する。 The present invention relates to a data analysis support system and a data analysis support method.

データを分析したいという需要は古くから根強い。膨大なデータを手作業で分析することは困難であり、データマイニングなど、多様なデータ分析技術が開発されてきた。そのひとつとして、機械学習は、統計や確率などに基づく何らかのモデルを策定し、実際に観測されたデータから逆にモデルの詳細パラメータを推定する技術である。機械学習されたモデルに基づく計算機判断は知的な判断とみなされ、分類、予測、あるいは推薦等に役立ちうる。 The demand for analyzing data has been strong for a long time. It is difficult to analyze a huge amount of data manually, and various data analysis techniques such as data mining have been developed. As one of them, machine learning is a technology that formulates some model based on statistics and probabilities, and conversely estimates the detailed parameters of the model from the actually observed data. Computer judgments based on machine-learned models are considered intellectual judgments and can be useful for classification, prediction, recommendation, etc.

しかしながら、考えなしに玉石混交のビッグデータを直接的に分析しても、膨大な計算機資源を投じたわりに、表面的に推定できる結果のみが得られ、潜在的な結果は得られにくい傾向がある。このような傾向があるので、有用な結果を得るために、特定のデータを対象にして特定のトピックを分析することが多い。 However, even if the big data of cobblestone mixing is directly analyzed without thinking, it tends to be difficult to obtain potential results because only superficially estimable results can be obtained without investing a huge amount of computer resources. .. Because of this tendency, we often analyze specific topics for specific data in order to obtain useful results.

例えば、特許文献１には、購買情報テーブルを参照して各アカウントの興味があるトピックを推定し、トピックごとに商品が購買される確率を推定し、商品が新商品である場合には購買される確率を高くして、新商品を推薦する技術が開示されている。 For example, in Patent Document 1, the topic of interest of each account is estimated by referring to the purchase information table, the probability that a product is purchased for each topic is estimated, and if the product is a new product, it is purchased. A technique for recommending a new product with a high probability of being patented is disclosed.

また、特許文献２には、トピックモデルによる分析手法を利用して、求人情報とレジュメ情報との間の類似度学習を行う技術が開示されている。 Further, Patent Document 2 discloses a technique for learning the degree of similarity between job information and resume information by using an analysis method based on a topic model.

特開２０１２−２４２９４０号公報Japanese Unexamined Patent Publication No. 2012-242940 特開２０１７−１３４７３２号公報JP-A-2017-134732

特許文献１あるいは特許文献２に開示された技術を用いれば、特定のデータを対象にして特定のトピックを分析することは可能になる。しなしながら、多くのテーブルから成るデータであってトピックを定められないビッグデータを機械学習しても、同一テーブル内にあるデータ同士の関係性が強く、テーブルをまたいだ関係性が埋もれがちである。 By using the technique disclosed in Patent Document 1 or Patent Document 2, it is possible to analyze a specific topic for a specific data. However, even if machine learning of big data, which consists of many tables and whose topic cannot be determined, the relationships between the data in the same table are strong, and the relationships across tables tend to be buried. is there.

テーブル内の関係性を雑音（Ｎ）、テーブルをまたいだ関係性を信号（Ｓ）とするとき、学習の対象となるテーブル内のレコード数を単に増やしても、このＳＮ比は改善されない。そして、特許文献１および特許文献２には、このような学習におけるＳＮ比の改善に関する技術の開示は見当たらない。 When the relationship in the table is noise (N) and the relationship across tables is signal (S), simply increasing the number of records in the table to be learned does not improve this SN ratio. And, in Patent Document 1 and Patent Document 2, there is no disclosure of the technique relating to the improvement of the SN ratio in such learning.

また、大容量多種多様なデータが混在するビッグデータを貯め置くデータレイクの中で、データの関係を俯瞰することは難しい。特に、データ分析作業者が生データの関係性の中に未知のパターンを発見することの支援、言い換えると、その気付きとなるデータ関係情報を抽出することが重要である。 In addition, it is difficult to get a bird's-eye view of the data relationship in a data lake that stores big data in which a large amount of diverse data is mixed. In particular, it is important to help data analysts discover unknown patterns in the relationships of raw data, in other words, to extract the data-related information that they are aware of.

本発明の目的は、テーブル間のデータ関係を顕在化するデータ分析支援システムを提供することである。 An object of the present invention is to provide a data analysis support system that manifests data relationships between tables.

本発明に係る代表的なデータ分析支援システムは、計算機により構成されたデータ分析支援システムであって、前記計算機は、プログラムが格納されたメモリと、前記メモリに格納されたプログラムを実行するＣＰＵと、を備え、前記ＣＰＵは、複数のテーブルを入力し、入力された複数のテーブルのそれぞれからキーワードを抽出し、抽出されたキーワードを機械学習し、機械学習の分析結果として、キーワードと出現確率のペアを含む特徴成分を、複数のテーブルのそれぞれについて抽出し、抽出された特徴成分に含まれるキーワードのテーブル間の一致に基づき、特徴成分のテーブル間の組合せを特定し、特定された組合せに含まれる特徴成分の出現確率を合成して、分析結果を集約し、集約された分析結果を出力することを特徴とする。 A typical data analysis support system according to the present invention is a data analysis support system configured by a computer, in which the computer includes a memory in which a program is stored and a CPU that executes a program stored in the memory. , The CPU inputs a plurality of tables, extracts a keyword from each of the input plurality of tables, machine-learns the extracted keyword, and as an analysis result of machine learning, the keyword and the appearance probability. The feature components including the pair are extracted for each of the plurality of tables, the combination between the table of the feature components is specified based on the match between the table of the keywords included in the extracted feature component, and the combination is included in the specified combination. It is characterized by synthesizing the appearance probabilities of the characteristic components, aggregating the analysis results, and outputting the aggregated analysis results.

本発明によれば、テーブル間のデータ関係を顕在化するデータ分析支援システムを提供することが可能になる。 According to the present invention, it is possible to provide a data analysis support system that manifests the data relationship between tables.

データ分析支援システムの例を示す図である。It is a figure which shows the example of the data analysis support system. 計算機の例を示す図である。It is a figure which shows the example of a computer. 機械学習を用いたデータ仕分けの例を示す図である。It is a figure which shows the example of data sorting using machine learning. 変換関係定義テーブルの例を示す図である。It is a figure which shows the example of the transformation relation definition table. 量質変換テーブルの例を示す図である。It is a figure which shows the example of the quantity quality conversion table. データ値と処理の関係の例を示す図である。It is a figure which shows the example of the relationship between a data value and processing. 組合せ最適化の例を示す図である。It is a figure which shows the example of combinatorial optimization. データ分析支援システムの処理フローの例を示す図である。It is a figure which shows the example of the processing flow of the data analysis support system. ユーザ対話画面の例を示す図である。It is a figure which shows the example of the user dialogue screen. 機械学習の処理フローの例を示す図である。It is a figure which shows the example of the processing flow of machine learning.

本明細書において開示される技術のうち、一つの観点に従う構成と作用は以下の通りであり、図面を参照して本発明の実施形態を実施例として説明する。 Among the techniques disclosed in the present specification, configurations and operations according to one viewpoint are as follows, and embodiments of the present invention will be described as examples with reference to the drawings.

図１は、データ分析支援システムの例を示す図である。データ分析支援システム１０１は計算機から構成され、ユーザ端末１０２、データレイク１０３、業務運用システム１０４、およびデータウェアハウス１０５とネットワーク１２０で接続される。 FIG. 1 is a diagram showing an example of a data analysis support system. The data analysis support system 101 is composed of a computer, and is connected to the user terminal 102, the data lake 103, the business operation system 104, and the data warehouse 105 by a network 120.

ユーザ端末１０２、データレイク１０３、業務運用システム１０４、およびデータウェアハウス１０５も計算機であってもよい。ユーザはユーザ端末１０２を介して、データ分析支援システム１０１を含むこれらの計算機にアクセスし、データを入力したりデータを表示させたりする。 The user terminal 102, the data lake 103, the business operation system 104, and the data warehouse 105 may also be computers. The user accesses these computers including the data analysis support system 101 via the user terminal 102 to input data and display the data.

図１において、実線のネットワーク１２０は計算機間の物理的な接続を表し、破線はデータフロー（あるいは制御のフロー）を表す。ただし、破線は代表的なデータフローを示すものであって、図１に示した破線以外のデータフローがあってもよい。 In FIG. 1, the solid network 120 represents the physical connection between the computers, and the dashed line represents the data flow (or control flow). However, the broken line indicates a typical data flow, and there may be a data flow other than the broken line shown in FIG.

データレイク１０３は分析前の生データが格納される設備である。運用管理ログや保守記録などが業務運用システム１０４からネットワーク１２０を介してデータレイク１０３内のデータベースであるトランザクションＤＢ１０６に格納される。これにより、トランザクションＤＢ１０６には時系列に生じてスキーマ構造に応じた記録が大量に溜まる。 The data lake 103 is a facility for storing raw data before analysis. The operation management log, maintenance record, and the like are stored in the transaction DB 106, which is a database in the data lake 103, from the business operation system 104 via the network 120. As a result, a large amount of records that occur in time series and correspond to the schema structure are accumulated in the transaction DB 106.

なお、トランザクションＤＢ１０６には、時系列によらない異種のテーブルが含まれていてもよい。また、トランザクションＤＢ１０６以外のデータベースとして、データレイク１０３は、データの符号化方法やスキーマ構造をあらかじめ定めたマスターＤＢ１０７を有する。マスターＤＢ１０７の情報はユーザによりあらかじめ格納されてもよい。 Note that the transaction DB 106 may include different types of tables that do not depend on the time series. Further, as a database other than the transaction DB 106, the data lake 103 has a master DB 107 in which a data encoding method and a schema structure are predetermined. The information of the master DB 107 may be stored in advance by the user.

業務運用システム１０４とデータレイク１０３を含めて各計算機のメモリには、データの送受信手順をあらかじめ定めた通信プログラムモジュールが格納され、各計算機のＣＰＵ（Central Processing Unit）によって通信プログラムが実行されることによって通信する。 A communication program module in which data transmission / reception procedures are predetermined is stored in the memory of each computer including the business operation system 104 and the data lake 103, and the communication program is executed by the CPU (Central Processing Unit) of each computer. Communicate by.

ユーザは、データレイク１０３の中のデータを選択し、選択したデータを必要に応じて変換させてからデータウェアハウス１０５に転送させて格納させる。あるいは、データウェアハウス１０５のメモリ内には選択手順、変換手順、分析手順をあらかじめ定めた選択変換分析プログラムモジュールをあらかじめ格納されていてもよい。 The user selects the data in the data lake 103, converts the selected data as necessary, and then transfers the selected data to the data warehouse 105 for storage. Alternatively, a selection conversion analysis program module in which selection procedures, conversion procedures, and analysis procedures are predetermined may be stored in the memory of the data warehouse 105 in advance.

ユーザは、データウェアハウス１０５を用いて対話的にデータの分析を行う。この分析のために、ユーザはデータレイク１０３内のデータの中からデータを選択するのであるが、どのデータを選択すべきかの情報をデータ分析支援システム１０１はユーザへ提供する。 The user interactively analyzes the data using the data warehouse 105. For this analysis, the user selects data from the data in the data lake 103, and the data analysis support system 101 provides the user with information on which data should be selected.

別の言い方をすると、ユーザはデータ分析支援システム１０１を用いて選択すべきデータの推薦を受ける。なお、データ分析支援システム１０１の役割を説明するため、以上のようにデータウェアハウス１０５を説明したが、データウェアハウス１０５は以上で説明したものに限定されるものでははい。 In other words, the user receives a recommendation of data to be selected using the data analysis support system 101. Although the data warehouse 105 has been described above in order to explain the role of the data analysis support system 101, the data warehouse 105 is not limited to the one described above.

データ分析支援システム１０１は、分析支援手順があらかじめ記述されたプログラムモジュール、およびユーザが対話的に指定する条件やあらかじめ定めおく初期値などのパラメータがメモリに格納されている。メモリに格納された分析支援手順の記述されたプログラムは、ＣＰＵによって実行されることによって、分析支援処理が実現される。 In the data analysis support system 101, a program module in which an analysis support procedure is described in advance, and parameters such as conditions interactively specified by the user and predetermined initial values are stored in the memory. The analysis support process is realized by executing the program in which the analysis support procedure described in the memory is executed by the CPU.

分析支援手順が記述されたプログラムモジュールには複数のサブモジュールがあり、統括制御モジュール１０８、学習制御モジュール１０９、集約制御モジュール１１０、および対話制御モジュール１１１の各サブモジュールが含まれる。ここで、各サブモジュールはプログラムそのものではなく、各プログラムを実行するＣＰＵであってもよい。 The program module in which the analysis support procedure is described includes a plurality of submodules, and includes each submodule of the integrated control module 108, the learning control module 109, the centralized control module 110, and the interactive control module 111. Here, each submodule may be a CPU that executes each program, not the program itself.

データ分析支援システム１０１のディスク（記憶装置）あるいはメモリ上には、データベースが構築され、このデータベースには、管理ＤＢ１１２、量質変換ＤＢ１１３、学習ＤＢ１１４、および集約ＤＢ１１５が含まれる。管理ＤＢ１１２は、変換関係定義テーブル（図４）とデータレイク１０３内の各種データへのアクセス方法についての情報が、あらかじめ作成されて格納されている。 A database is constructed on the disk (storage device) or memory of the data analysis support system 101, and this database includes a management DB 112, a quality conversion DB 113, a learning DB 114, and an aggregation DB 115. In the management DB 112, information about the conversion relationship definition table (FIG. 4) and the access method to various data in the data lake 103 is created and stored in advance.

量質変換ＤＢ１１３は、時刻や場所などの量質変換テーブル（図５）が、あらかじめ作成されて格納されている。学習ＤＢ１１４は、学習テーブル（図６）として中間処理結果が格納される。集約ＤＢ１１５は、集約テーブル（図６）として最終処理結果が格納される。 In the quantity conversion DB 113, a quantity conversion table (FIG. 5) such as time and place is created and stored in advance. The learning DB 114 stores the intermediate processing result as a learning table (FIG. 6). The aggregation DB 115 stores the final processing result as an aggregation table (FIG. 6).

データ分析支援システム１０１は、ユーザからの要求に応じて処理を開始し、ユーザへ結果を出力して処理を終了する。まず、統括制御モジュール１０８は、あらかじめ対話制御モジュール１１１をユーザからの要求待ち状態にする。なお、以下の一連の処理の引き渡しは、統括制御モジュール１０８の統括的な制御により行われてもよい。 The data analysis support system 101 starts processing in response to a request from the user, outputs a result to the user, and ends the processing. First, the overall control module 108 puts the dialogue control module 111 into a state of waiting for a request from the user in advance. The delivery of the following series of processes may be performed by the integrated control of the integrated control module 108.

ユーザ操作によってユーザ端末１０２からデータ分析支援が要求されたと判定すると、対話制御モジュール１１１は統括制御モジュール１０８に処理を引き渡す。統括制御モジュール１０８は、管理ＤＢ１１２に定めた管理情報に基づいて各種データへのアクセス方法を読み取って、学習制御モジュール１０９に処理を引き渡す。 When it is determined that the data analysis support is requested from the user terminal 102 by the user operation, the dialogue control module 111 hands over the process to the integrated control module 108. The integrated control module 108 reads the access method to various data based on the management information defined in the management DB 112, and delivers the process to the learning control module 109.

学習制御モジュール１０９は、量質変換ＤＢ１１３を用いてトランザクションＤＢ１０６（図６）のデータを読み込み、複数の学習結果を学習ＤＢ１１４に格納する。集約制御モジュール１１０は、学習ＤＢ１１４から複数の学習結果を読み込み、複数の学習結果を集約した集約データを集約ＤＢ１１５内の集約テーブルに格納する。対話制御モジュール１１１は、集約ＤＢ１１５から集約結果を読み込み、ユーザ端末１０２へ出力する。 The learning control module 109 reads the data of the transaction DB 106 (FIG. 6) using the quantity conversion DB 113, and stores a plurality of learning results in the learning DB 114. The aggregation control module 110 reads a plurality of learning results from the learning DB 114, and stores the aggregated data in which the plurality of learning results are aggregated in the aggregation table in the aggregation DB 115. The dialogue control module 111 reads the aggregation result from the aggregation DB 115 and outputs it to the user terminal 102.

図２は、計算機の例を示す図である。計算機２０１は、例えばデータ分析支援システム１０１であり、図１に示した他の設備であってもよい。計算機２０１は、計算レジスタを備えたＣＰＵ２０２、メモリ２０３、ディスク２０５、入出力部２０６、タイマ２０４、およびセンサ２０７を備える。 FIG. 2 is a diagram showing an example of a computer. The computer 201 is, for example, a data analysis support system 101, and may be other equipment shown in FIG. The computer 201 includes a CPU 202 having a calculation register, a memory 203, a disk 205, an input / output unit 206, a timer 204, and a sensor 207.

ディスク２０５は、磁気記憶装置であってもよいし、不揮発性の半導体記憶装置であってもよい。入出力部２０６は、ディスプレイ、キーボード、マウス、およびネットワーク１２０との通信回路を含む。ユーザ端末１０２の代わりに、入出力部２０６のディスプレイ、キーボード、およびマウスが使用されてもよい。 The disk 205 may be a magnetic storage device or a non-volatile semiconductor storage device. The input / output unit 206 includes a display, a keyboard, a mouse, and a communication circuit with the network 120. Instead of the user terminal 102, the display, keyboard, and mouse of the input / output unit 206 may be used.

図３は、機械学習を用いたデータ仕分けの例を示す図である。データレイク１０３内には複数のデータベースと複数のテーブルがあり、テーブル同士は異種のものを含む。ここで、共通するフィールド（時刻など）を持つ２つのテーブルは互いに同種である。同種の２つのテーブルのデータは、テーブル結合処理によって単一テーブルのデータに変換できる。 FIG. 3 is a diagram showing an example of data sorting using machine learning. There are a plurality of databases and a plurality of tables in the data lake 103, and the tables include different ones. Here, two tables having a common field (time, etc.) are of the same kind. The data of two tables of the same type can be converted into the data of a single table by the table join process.

これに対して、共通するフィールドを持たない２つのテーブルは互いに異種である。異種の２つのテーブルは、テーブル結合ができない。例えば、４つのフィールドＡＢＣＤ、３つのテーブルＰＱＲがあり、Ｐ（Ａ、Ｂ）、Ｑ（Ｃ、Ｄ）、Ｒ（Ａ、Ｃ）のデータ構造を持つ場合、テーブルＰとテーブルＱは異種、テーブルＰとテーブルＲは同種、テーブルＱとテーブルＲは同種である。このため、３つのテーブルＰＱＲは異種テーブルを含む。 In contrast, two tables that do not have a common field are different from each other. Two different types of tables cannot be joined. For example, if there are 4 fields ABCD and 3 tables PQR and they have P (A, B), Q (C, D), R (A, C) data structures, then table P and table Q are different, tables. P and table R are of the same type, and table Q and table R are of the same type. Therefore, the three tables PQR include heterogeneous tables.

テーブルＰＱだけが入力された場合、異種テーブル同士の結合ができないため、フィールドＢとフィールドＤに含まれたデータ値同士の関係を分析できない。一方、テーブルＰＱＲが入力された場合、まずＰＲを結合し、その後にＱを結合することによって３テーブルを１テーブルに結合できる。さらに結合したテーブルを用いてフィールドＢとＤにあるデータ値同士の関係を分析できるようになる。 When only the table PQ is input, the relationship between the data values included in the field B and the field D cannot be analyzed because the heterogeneous tables cannot be joined. On the other hand, when the table PQR is input, three tables can be joined to one table by first joining PR and then joining Q. Furthermore, it becomes possible to analyze the relationship between the data values in the fields B and D using the joined table.

しかし、データレイク内のテーブルは多種多様であるので、結合可能な組み合わせ数は膨大であり、従来のデータレイク内の作業では、組合せ経路を見つけること自体がたいへんな作業量であった。加えて、組合せによって分析結果が異なりうるため、適切なテーブル結合方法を見つけ出すことは、分析者の試行錯誤や直感頼りとなっていた。 However, since the tables in the data lake are diverse, the number of combinations that can be joined is enormous, and in the conventional work in the data lake, finding the combination route itself is a great amount of work. In addition, since the analysis results can differ depending on the combination, finding an appropriate table join method has been a matter of trial and error and intuition of the analyst.

一方、機械学習を用いた分析処理の場合、異種テーブルであってもまとめて分析することは可能であり、異種の複数のテーブルのデータが入力されても、何らかのモデルに基づいてデータを仕分けすることができ、データ同士の関係性情報が出力可能である。機械学習のそのようなモデルには多種多様な方式が開発され、開示されている。 On the other hand, in the case of analysis processing using machine learning, it is possible to analyze even heterogeneous tables collectively, and even if data from multiple heterogeneous tables is input, the data is sorted based on some model. It is possible to output relationship information between data. A wide variety of methods have been developed and disclosed for such models of machine learning.

しかしながら、どのようなモデルを用いても、異種テーブルのデータを直接に機械学習した結果は、同一テーブル内にあるデータは、同一テーブル内にあるということへの関係性が強く評価され、他の異種テーブル内にあるデータとの関係性は弱く評価される傾向がある。 However, no matter what model is used, the result of direct machine learning of the data of different types of tables is strongly evaluated for the relationship that the data in the same table is in the same table, and other Relationships with data in heterogeneous tables tend to be weakly evaluated.

図３には、データを分析してＸＹＺの３つの分析軸での３分類を行う機械学習方式を用いて４テーブルのデータを学習する場合を例示する。直接的にデータを機械学習すると、Ｘ軸に１つ目のテーブル内のデータが偏り、Ｙ軸に２つ目のテーブル内のデータが偏る、というようにテーブルごとに仕分けされてしまいやすい。 FIG. 3 illustrates a case where four tables of data are learned by using a machine learning method that analyzes the data and performs three classifications on the three analysis axes of XYZ. When the data is directly machine-learned, the data in the first table is biased on the X-axis, the data in the second table is biased on the Y-axis, and so on.

このように、学習した仕分け結果がテーブルごとに偏っては、わざわざ多くの（複数）の（異種）テーブルを分析しても、何を学習するかの意味が薄くなり、機械学習の効果が得られない。 In this way, if the learned sorting results are biased for each table, even if you bother to analyze many (plural) (heterogeneous) tables, the meaning of what to learn becomes less meaningful, and the effect of machine learning can be obtained. I can't.

本実施例のデータ分析では、第１ステップとして、３分類を行う機械学習をテーブルごとに適用し、４テーブルそれぞれを独立に機械学習する。その結果、４つの基底あるいは分析軸（Ｘ１，Ｙ１，Ｚ１）．．．（Ｘ４，Ｙ４，Ｚ４）を得る。 In the data analysis of this embodiment, as the first step, machine learning that performs three classifications is applied to each table, and each of the four tables is machine-learned independently. As a result, four basis or analysis axes (X1, Y1, Z1). .. .. (X4, Y4, Z4) is obtained.

第２ステップとして、分析軸同士の関係を最適に集約して、新しい３つの集約軸（Ｇｒｏｕｐ１，Ｇｒｏｕｐ２，Ｇｒｏｕｐ３）を得る。集約軸には異種テーブル間を含めてテーブルをまたいだ関係に焦点が当たっているため、テーブル間のデータ関係が顕在化して抽出可能となる。 As a second step, the relationships between the analysis axes are optimally aggregated to obtain three new aggregation axes (Group1, Group2, Group3). Since the aggregation axis focuses on the relationship across tables including between different types of tables, the data relationship between tables becomes apparent and can be extracted.

そして、異種テーブル間のデータ関係は、異種テーブルに含まれる膨大なデータを前にしたデータ分析作業者では気付きにくく、本実施例のデータ分析支援システム１０１の利用によって得られる情報であって、このような情報の得られることが本実施例の有用な効果である。 The data relationship between different types of tables is difficult for a data analysis worker to notice in front of the huge amount of data contained in the different types of tables, and is information obtained by using the data analysis support system 101 of this embodiment. Obtaining such information is a useful effect of this embodiment.

本実施例では機械学習の一例としてトピックモデルを用いる。以下では分析軸をトピックと呼ぶ。トピックは特徴成分でもある。トピックモデルは主に文書の仕分けに用いられるモデルであり、１つのトピックは、単語とその出現確率のペアをリストしたものとして定式化される。多数の文書を機械学習した結果は複数のトピックとして出力される。この点においては一般的な機械学習と同じであってもよい。 In this embodiment, a topic model is used as an example of machine learning. In the following, the analysis axis will be referred to as a topic. The topic is also a feature component. The topic model is a model mainly used for sorting documents, and one topic is formulated as a list of words and their occurrence probability pairs. The result of machine learning a large number of documents is output as multiple topics. In this respect, it may be the same as general machine learning.

トピックモデルは、端的には文書群を入力して、トピック群を出力する機械学習である。トピックモデルに文書群を入力する場合、個々の文書を形態素解析して文書内のキーワードを抽出し、そのキーワードのリストを入力ベクトルとして用いる。トピックモデルは複数の入力ベクトルを読み込むことによって、入力ベクトル群を仕分けする適切なトピック（キーワードとその出現確率のペア）を定式化する。 The topic model is simply machine learning that inputs a group of documents and outputs a group of topics. When inputting a group of documents into the topic model, each document is morphologically analyzed to extract keywords in the document, and the list of the keywords is used as an input vector. The topic model formulates an appropriate topic (a pair of a keyword and its appearance probability) that sorts the input vector group by reading a plurality of input vectors.

個々のトピックは、一部の文書群を代表する潜在的意味を示すが、その意味は単一の単語では示されない。例えば、文書群を２トピックに仕分けるとき、１つ目のトピックでは、単語Ａが６０％、単語Ｂが０％、単語Ｃが４０％の確率で出現し、２つ目のトピックでは、単語Ａが０％、単語Ｂが７０％、単語Ｃが３０％の確率で出現する、といった形で、トピック群が出力される。 Individual topics show potential meanings that represent some set of documents, but their meanings are not shown in a single word. For example, when the document group is divided into two topics, word A appears with a probability of 60%, word B has a probability of 0%, and word C has a probability of 40% in the first topic, and word A appears in the second topic. The topic group is output in the form of 0%, word B 70%, word C appearing with a probability of 30%, and so on.

個々の文書がどのトピックに属するかは、入力文書の単語ベクトルと、トピックを構成する単語ベクトルについて、内積値あるいは距離を指標とすることで、判定可能である。すなわち、機械学習で得たトピックを分析軸とすれば、その分析軸を持つトピックモデルは、未知の文書群の分類仕分けに利用できる。 Which topic each document belongs to can be determined by using the internal product value or the distance as an index for the word vector of the input document and the word vector constituting the topic. That is, if the topic obtained by machine learning is used as the analysis axis, the topic model having the analysis axis can be used for classification and sorting of unknown document groups.

一方、トピックモデルにデータベースレコード群を入力する場合、上述した形態素解析は必要なく、個々のレコードが１文書に相当し、個々のフィールド値が単語に相当する。その機械学習の処理については図１０を用いて後で詳細に説明する。ただし、レコード群にはトランザクションデータ固有の問題がある。 On the other hand, when inputting a database record group into the topic model, the above-mentioned morphological analysis is not necessary, each record corresponds to one document, and each field value corresponds to a word. The machine learning process will be described in detail later with reference to FIG. However, there are problems specific to transaction data in the record group.

第１に、データベースレコード内の情報はディスク節約などのため通常符号化されており、符号化されたフィールド値をそのまま入力ベクトルとして使うと、符号値である「０」や「１」が極端に頻出する入力ベクトルとなり、学習する機械は適切な解釈ができなくなる。 First, the information in the database record is usually encoded to save disk space, and if the encoded field value is used as it is as an input vector, the code values "0" and "1" become extremely high. It becomes a frequent input vector, and the learning machine cannot interpret it properly.

加えて機械学習の効率が悪くなり、膨大なサンプルデータを要するようになる。そこで、符号語（符号値）を復号してから機械学習すると、学習効率が上がり、少数のサンプルでも妥当な学習が可能となる。 In addition, machine learning becomes inefficient and requires a huge amount of sample data. Therefore, if machine learning is performed after decoding the code word (code value), the learning efficiency is improved and appropriate learning is possible even with a small number of samples.

第２に、気温や時刻等の量的変数は単語文字列としてのバリエーションが広く、学習する機械は適切な解釈ができなくなる。仮に量的変数を直接に機械学習するとしても、その機械学習には極端に膨大な数のサンプルデータと処理時間を要して現実的ではない。 Secondly, quantitative variables such as temperature and time have a wide variety as word strings, and learning machines cannot interpret them properly. Even if quantitative variables are directly machine-learned, the machine learning requires an extremely large amount of sample data and processing time, which is not realistic.

加えて、生成されたトピックをデータ分析作業者が解釈しにくくなる。例えば、「３」という単語がトピックを構成するキーワードになったとき、その単語がどこのテーブルで何を意味するかが多義的である。そこで、量的変数を質的変数に変換してから、変換された質的変数を機械学習する。 In addition, it makes it difficult for data analysts to interpret the generated topics. For example, when the word "3" becomes a keyword that constitutes a topic, it is ambiguous what the word means in which table. Therefore, after converting the quantitative variable into the qualitative variable, the converted qualitative variable is machine-learned.

例えば、気象コードの「３」を「晴れ」に変換する。この変換により、比較的少数のサンプルで効率的に機械学習できるようになり、このような機械学習は計算資源の節約となる。さらに、短い期間で妥当な学習が可能となる。そして、データ分析作業者にとっても計算結果が理解しやすくなる。 For example, the weather code "3" is converted to "sunny". This conversion enables efficient machine learning with a relatively small number of samples, and such machine learning saves computational resources. Furthermore, reasonable learning becomes possible in a short period of time. Then, the calculation result becomes easy to understand even for the data analysis worker.

図４は、変換関係定義テーブルの例を示す図である。変換関係定義テーブル４０１は、トランザクションテーブル名４０２、トランザクションフィールド名４０３、マスターテーブル名４０４、マスターフィールド名４０５、量質変換テーブル名４０６、および量質変換フィールド名４０７の各フィールドを備える。 FIG. 4 is a diagram showing an example of a conversion relationship definition table. The conversion relationship definition table 401 includes the fields of transaction table name 402, transaction field name 403, master table name 404, master field name 405, quantity conversion table name 406, and quantity conversion field name 407.

例えば、変換関係定義テーブル４０１の１番目のレコードは、トランザクションテーブル名４０２が「Ｔａｂｌｅ−Ｂ」、トランザクションフィールド名４０３が「日付時刻」、量質変換テーブル名４０６が「日付、時刻」である。 For example, in the first record of the conversion relationship definition table 401, the transaction table name 402 is "Table-B", the transaction field name 403 is "date and time", and the quantity conversion table name 406 is "date and time".

このレコードにより、「Ｔａｂｌｅ−Ｂ」というトランザクションテーブルの「日付時刻」のフィールドにある日付時刻の量的な値を、「日付」という量質変換テーブルの情報と「時刻」という量質変換テーブルの情報に基づいて、質的な値に変換するという対応関係が定義されている。 By this record, the quantitative value of date and time in the field of "date and time" of the transaction table called "Table-B" is changed to the information of the quantity conversion table called "date" and the quantity conversion table called "time". A correspondence is defined that converts to a qualitative value based on the information.

また、変換関係定義テーブル４０１の３番目のレコードは、トランザクションテーブル名４０２が「Ｔａｂｌｅ−Ｂ」、トランザクションフィールド名４０３が「警報」、マスターテーブル名４０４が「Ｔａｂｌｅ−Ｍ１」、マスターフィールド名４０５が「Ｆ」である。そして、「Ｔａｂｌｅ−Ｍ１」というマスターテーブルの「Ｆ」のフィールドには、警報コードとして「１」や「９」などの符号語と意味（質）とが対応付けられている。 Further, in the third record of the conversion relationship definition table 401, the transaction table name 402 is "Table-B", the transaction field name 403 is "alarm", the master table name 404 is "Table-M1", and the master field name 405 is. It is "F". Then, in the field of "F" of the master table called "Table-M1", code words such as "1" and "9" and meanings (quality) are associated with each other as alarm codes.

このレコードとマスターテーブルにより、「Ｔａｂｌｅ−Ｂ」というトランザクションテーブルの「警報」のフィールドにある警報コードを、「Ｔａｂｌｅ−Ｍ１」というマスターテーブルの「Ｆ」フィールドに基づいて、例えば「１」から「注意」などの意味に変換するという対応関係が定義されている。 With this record and the master table, the alarm code in the "alarm" field of the transaction table "Table-B" can be changed from "1" to "1" based on the "F" field of the master table "Table-M1". Correspondence relation to convert to meaning such as "caution" is defined.

このような変換関係定義テーブル４０１が用いられて、学習制御モジュール１０９は機械学習の前に、符号語を復号して量質変換できる。図４の例では、マスターテーブルと量質変換テーブルとが排反の例を示したが、マスターテーブルと量質変換テーブルの両方が定義されてもよい。 Using such a conversion relationship definition table 401, the learning control module 109 can decode codewords and perform quantitative conversion before machine learning. In the example of FIG. 4, the master table and the quantity conversion table are excluded, but both the master table and the quantity conversion table may be defined.

変換関係定義テーブル４０１にマスターテーブルと量質変換テーブルの両方が定義された場合、あらかじめ設定された情報に従ってどちらかのテーブルが選択されてもよいし、マスターテーブルと量質変換テーブルで定義される符号語あるいは量的変数が排反であり、変換対象の符号語あるいは量的変数に応じてどちらかのテーブルが選択されてもよい。 When both the master table and the quantity conversion table are defined in the conversion relationship definition table 401, either table may be selected according to the preset information, or the master table and the quantity conversion table are defined. The codeword or quantitative variable is an exclusion, and either table may be selected depending on the codeword or quantitative variable to be converted.

図５は、量質変換テーブルの例を示す図である。量質変換テーブルは、量的変数を質的変数に変換する対応関係を定義するテーブルである。量的変数には、日付、時刻、位置、距離、価格、温度、年齢などがあり、数値で表され、暗に連続的な数値である。 FIG. 5 is a diagram showing an example of a quantity conversion table. The quantitative conversion table is a table that defines the correspondence that converts a quantitative variable into a qualitative variable. Quantitative variables include date, time, position, distance, price, temperature, age, etc., which are expressed numerically and are implicitly continuous numerical values.

例えば、２０１７年１１月３０日は木曜日である。２０１７年１１月３０日は量的であり、秋や木曜日という表現は質的であるとみなす。質的変数を量的変数に変換するには特別な制約条件を要するが、その逆の変換は可能である。 For example, November 30, 2017 is Thursday. November 30, 2017 is quantitative, and the expressions autumn and Thursday are considered qualitative. Converting a qualitative variable to a quantitative variable requires special constraints, but the reverse is possible.

このような量質変換テーブルの中で、量質変換テーブル５０１は時刻に関するテーブルであって、１つの量的変数から複数の質的変数への変換を可能にするテーブルである。量質変換テーブル５０１は、下限値５０２、上限値５０３、質的表現（１）５０４、質的表現（２）５０５、逆引きテーブル名５０６、およびレコード番号５０７の各フィールドを備える。 Among such quantity conversion tables, the quantity conversion table 501 is a table related to time, and is a table that enables conversion from one quantitative variable to a plurality of qualitative variables. The quantity conversion table 501 includes fields of a lower limit value 502, an upper limit value 503, a qualitative expression (1) 504, a qualitative expression (2) 505, a reverse lookup table name 506, and a record number 507.

例えば、量質変換テーブル５０１の１番目のレコードは、下限値５０２が「１４：００：００」、上限値５０３が「１４：５９：５９」、質的表現（１）５０４が「昼間」、質的表現（２）５０５が「２時台」である。 For example, in the first record of the quantity conversion table 501, the lower limit value 502 is "14:00:00", the upper limit value 503 is "14:59:59", and the qualitative expression (1) 504 is "daytime". Qualitative expression (2) 505 is "2 o'clock".

このレコードにより、量的表現としての時刻が、「１４：００：００」と「１４：５９：５９」との間に入るのであれば、「昼間」と「２時台」という質的表現に変換するという対応関係が定義されている。また、図４に示した変換関係定義テーブル４０１を逆引きできるように、逆引きテーブル名５０６とレコード番号５０７の各フィールドを備える。 According to this record, if the time as a quantitative expression falls between "14:00:00" and "14:59:59", it becomes a qualitative expression of "daytime" and "2 o'clock". The correspondence of conversion is defined. Further, each field of the reverse lookup table name 506 and the record number 507 is provided so that the conversion relationship definition table 401 shown in FIG. 4 can be reverse-looked up.

そして、図５に示した例で、「１４：０６」という特定の時刻は、１番目のレコードと２番目のレコードの「昼間」、「２時台」、および「午後」と３つの質的表現が対応付けられている。 Then, in the example shown in FIG. 5, the specific time of "14:06" has three qualitative values: "daytime", "2 o'clock", and "afternoon" of the first record and the second record. Representations are associated.

量質変換テーブル５０１を用いた変換処理によって、データのおおまかな特徴を機械が学習しやすくなり、少ないサンプルでの機械学習を効率化し、同時に計算資源消費を抑制し、高速な処理を実現できる。 The conversion process using the quantity conversion table 501 makes it easier for the machine to learn the rough features of the data, streamlines machine learning with a small number of samples, and at the same time suppresses the consumption of computational resources and realizes high-speed processing.

文書での単語の仕分けと異なり、トランザクションレコードのデータ値には多様な量的変数が含まれ、かつ量的変数のデータこそがレコードの本質的な情報である。トピックモデルを用いた機械学習において、レコードに含まれるトピックを推定するとき、「１４：０６」という単語の出現頻度で学習するよりも、「午後」あるいは「２時台」などの広い解釈を含めた単語の出現頻度で学習する方が、データ分析支援に寄与しやすい。 Unlike the sorting of words in a document, the data value of a transaction record contains various quantitative variables, and the data of the quantitative variables is the essential information of the record. In machine learning using a topic model, when estimating a topic contained in a record, it includes a broader interpretation such as "afternoon" or "2 o'clock" rather than learning with the frequency of occurrence of the word "14:06". It is easier to contribute to data analysis support by learning by the frequency of occurrence of new words.

図６は、データと処理の関係の例を示す図である。データレイク１０３内のトランザクションＤＢ１０６のトランザクションテーブルのデータが、データ分析支援システム１０１へ入力される。図６に示した例で、トランザクションＤＢ１０６には、Ｔａｂｌｅ−Ａ６０１とＴａｂｌｅ−Ｂ６０２の２つのトランザクションテーブルがある。 FIG. 6 is a diagram showing an example of the relationship between data and processing. The data of the transaction table of the transaction DB 106 in the data lake 103 is input to the data analysis support system 101. In the example shown in FIG. 6, the transaction DB 106 has two transaction tables, Table-A601 and Table-B602.

Ｔａｂｌｅ−Ａ６０１は、日付時刻６０３、担当者６０４、場所６０５、および作業６０６の各フィールドを備え、作業者の作業記録が格納される。Ｔａｂｌｅ−Ｂ６０２は、日時６０７、警報６０８、および位置６０９の各フィールドを備え、ある設備のセンサ出力記録が格納される。 The Table-A601 includes fields of date and time 603, person in charge 604, place 605, and work 606, and stores the work record of the worker. The Table-B602 includes fields of date and time 607, alarm 608, and position 609, and stores a sensor output record of a facility.

このように図６の例は、これらの情報源を元にした異種テーブルであるＴａｂｌｅ−Ａ６０１とＴａｂｌｅ−Ｂ６０２が、日時と場所の関係を介して分析支援を行う場合の例である。 As described above, the example of FIG. 6 is an example in which Table-A601 and Table-B602, which are heterogeneous tables based on these information sources, support analysis through the relationship between the date and time and the place.

学習テーブル６１０と学習テーブル６１１は、トピックモデルに基づく学習結果が格納される学習ＤＢ１１４のテーブルである。Ｔａｂｌｅ−Ａ６０１に格納されたデータが、復号され、量的変数から質的変数に変換されて、変換された質的変数の機械学習による学習結果が学習テーブル６１０に格納される。 The learning table 610 and the learning table 611 are tables of the learning DB 114 in which the learning results based on the topic model are stored. The data stored in the Table-A601 is decoded, converted from a quantitative variable to a qualitative variable, and the learning result of the converted qualitative variable by machine learning is stored in the learning table 610.

また、Ｔａｂｌｅ−Ｂ６０２に格納されたデータが、復号され、量的変数から質的変数に変換されて、変換された質的変数の機械学習による学習結果が学習テーブル６１１に格納される。 Further, the data stored in the Table-B602 is decoded, converted from a quantitative variable to a qualitative variable, and the learning result of the converted qualitative variable by machine learning is stored in the learning table 611.

そして、学習テーブル６１０は、トピック番号６１２、キーワード６１３、および出現確率６１４の各フィールドを備え、オブジェクト指向の書式で示されている。学習テーブル６１０も同じ各フィールドを備える。 The learning table 610 includes fields of topic number 612, keyword 613, and probability of occurrence 614 and is presented in an object-oriented format. The learning table 610 also includes the same fields.

トピック番号６１２の採番では、例えば、１つ目のテーブル（Ｔａｂｌｅ−Ａ６０１）に対する１つ目のトピックを「＃１−１」と番号付ける。トピックは１０個に分けるものとする。ここで、１０個はあらかじめ設定された個数であってもよいし、計算された個数であってもよい。 In the numbering of topic number 612, for example, the first topic for the first table (Table-A601) is numbered as "# 1-1". The topic shall be divided into 10. Here, 10 may be a preset number or a calculated number.

このようにして、Ｔａｂｌｅ−Ａ６０１とＴａｂｌｅ−Ｂ６０２が、それぞれ１０トピックに仕分けられた場合のトピック群が学習テーブル６１０と学習テーブル６１１にそれぞれ示されている。 In this way, the topic groups when the Table-A601 and the Table-B602 are divided into 10 topics are shown in the learning table 610 and the learning table 611, respectively.

Ｔａｂｌｅ−Ａ６０１が分析されて得られたトピックは１０個である。例えば、トピック番号６１２が「＃１−１」のトピックは、キーワード６１３で示されるように「火曜」、「２時台」、「Ｈａｎｋａ」、および「大阪」などのキーワードによって潜在的な意味を示すトピックである。 The number of topics obtained by analyzing the Table-A601 is 10. For example, a topic whose topic number 612 is "# 1-1" has a potential meaning by keywords such as "Tuesday", "2 o'clock", "Hanka", and "Osaka" as indicated by keyword 613. This is the topic to show.

また、キーワード６１３の各キーワードの出現確率は、出現確率６１４で示されるように「０．３１」、「０．２」、「０．０４」、および「０．０３」などであり、トピック番号６１２が「＃１−１」のトピックでの出現確率の総和は１．０である。 Further, the appearance probability of each keyword of the keyword 613 is "0.31", "0.2", "0.04", "0.03", etc. as shown by the appearance probability 614, and is a topic number. The total probability of appearance in the topic of "# 1-1" for 612 is 1.0.

すなわち、あるレコードが存在した場合、「火曜」が「０．３１」の出現確率で含まれ、「２時台」が「０．２」の出現確率で含まれるときに、その存在したレコードは、トピック番号６１２が「＃１−１」のトピックに含まれると判定されることが原理的には可能である。 That is, when a certain record exists, when "Tuesday" is included with the appearance probability of "0.31" and "2 o'clock" is included with the appearance probability of "0.2", the existing record is included. , It is possible in principle that the topic number 612 is determined to be included in the topic of "# 1-1".

しかし、実際のレコードでは「火曜」のデータが含まれるか含まれないか２値的であり、中途半端な出現確率で存在することはない。このように、出現確率は、あくまでもトピックの潜在的な意味を示すものである。 However, in the actual record, it is binary whether the data of "Tuesday" is included or not, and it does not exist with a halfway appearance probability. In this way, the probability of appearance only indicates the potential meaning of the topic.

同様に、Ｔａｂｌｅ−Ｂ６０２が分析されて得られたトピックは１０個ある。例えば、トピック番号が「＃２−２」のトピックは、「火曜」、「警報（３）」、および「大阪」などのキーワードによって潜在的な意味を示すトピックである。また、各キーワードの出現確率は「０．５１」、「０．０７」、「０．０１」などである。 Similarly, there are 10 topics obtained by analyzing Table-B602. For example, a topic with a topic number of "# 2-2" is a topic whose potential meaning is indicated by keywords such as "Tuesday", "alarm (3)", and "Osaka". The appearance probability of each keyword is "0.51", "0.07", "0.01", and the like.

集約テーブル６１５は、後述する組合せ最適化および基底集約の処理を経て、トピック同士の最適な組合せを集約した結果が格納される集約ＤＢ１１５のテーブルである。集約テーブル６１５は、グループ番号６１６、トピック番号６１７、キーワード６１８、および出現頻度６１９を各フィールドに備え、それらはオブジェクト指向の可変長である。 The aggregation table 615 is a table of the aggregation DB 115 in which the result of aggregating the optimum combinations of topics through the combinatorial optimization and basal aggregation processes described later is stored. The aggregate table 615 includes a group number 616, a topic number 617, a keyword 618, and a frequency of occurrence 619 for each field, which are object-oriented variable lengths.

トピックグループは、学習テーブルをまたいだデータの関係性を示す情報である。トピックグループは、ある学習テーブルを構成する複数のトピックの中から１つのトピックが選択され、次の学習テーブルを構成する複数のトピックの中から１つのトピックが選択されて、学習ＤＢ１１４内の学習テーブルから順次に１つずつのトピックが選択されて組み合わせられたものである。 A topic group is information that shows the relationship of data across learning tables. In the topic group, one topic is selected from a plurality of topics constituting a certain learning table, one topic is selected from a plurality of topics constituting the next learning table, and the learning table in the learning DB 114 is selected. One topic is selected and combined in order from.

トピックグループは、グループ番号６１６により識別される。グループ番号６１６が「＃１」のトピックグループは、「Ｔｏｐｉｃ＃１−１」と「Ｔｏｐｉｃ２−２」を含む。これは、「火曜」などのキーワードが一致しており、最適な組合せと判定された組合せである。 Topic groups are identified by group number 616. The topic group whose group number 616 is "# 1" includes "Topic # 1-1" and "Topic2-2". This is a combination that is determined to be the optimum combination because the keywords such as "Tuesday" match.

グループ番号６１６が「＃１」のトピックグループの中で、「Ｈａｎｋａ」と「警報（３）」は元々Ｔａｂｌｅ−Ａ６０１とＴａｂｌｅ−Ｂ６０２すなわち異種テーブルに含まれる情報であり、テーブルの内容を既存のテーブルブラウザ等を用いて概観しているだけでは、両者の関係にデータ分析作業者は気付きにくい。 In the topic group in which the group number 616 is "# 1", "Hanka" and "alarm (3)" are originally the information contained in the Table-A601 and the Table-B602, that is, the heterogeneous table, and the contents of the table are existing. It is difficult for data analysis workers to notice the relationship between the two just by looking at them using a table browser or the like.

これらに加えて、場所と天気の関係を格納した異種テーブルがあったとき、複数のテーブル間の関係はさらに気付きにくい。例えば、この設備の故障を知らせる「警報（３）」は、雨天時に「Ｈａｎｋａ」の部品交換作業が不適切だったために断続的に起きていたものかもしれない。このような因果関係の仮説をデータ分析作業者が思いつくデータ発見の起点となりえる。 In addition to these, when there are heterogeneous tables that store location-weather relationships, the relationships between multiple tables are even less noticeable. For example, the "alarm (3)" notifying the failure of this equipment may have occurred intermittently due to improper parts replacement work of "Hanka" in rainy weather. Such a causal hypothesis can be the starting point for data discovery that data analysts can come up with.

図７は、組合せ最適化の例を示す図である。集約制御モジュール１１０は、図６に示した学習テーブル６１０と学習テーブル６１１を入力し、集約テーブル６１５を出力する。このために、集約制御モジュール１１０は、以下で説明する処理を実行する。 FIG. 7 is a diagram showing an example of combinatorial optimization. The aggregation control module 110 inputs the learning table 610 and the learning table 611 shown in FIG. 6, and outputs the aggregation table 615. For this purpose, the aggregation control module 110 executes the process described below.

説明のために、分析対象となる学習テーブル数をＰとする。トピックモデルに基づく機械学習を行うと、１つの学習テーブルは複数のトピック（分析軸）で仕分けられる。ここでの１つの学習テーブルを構成するトピック数をＱとする。 For the sake of explanation, let P be the number of learning tables to be analyzed. When machine learning is performed based on a topic model, one learning table is sorted by a plurality of topics (analysis axes). Let Q be the number of topics that make up one learning table here.

１つのトピックは、複数のキーワードと、対応する出現確率のデータを含む。ここでのトピックを構成するキーワード数をＲとする。ここでは説明の簡単化のために、どの学習テーブルも同じトピック数のトピックを含み、どのトピックも同じキーワード数のキーワードを持つとするが、同じでなくともよい。 One topic contains multiple keywords and corresponding probability of occurrence data. Let R be the number of keywords that make up the topic here. Here, for the sake of simplicity, it is assumed that all learning tables contain topics with the same number of topics, and all topics have keywords with the same number of keywords, but they do not have to be the same.

ｐ番目の学習テーブルのｑ番目のトピックを構成するｒ番目のキーワードをｗ（ｐ，ｑ，ｒ）とし、対応するキーワードの出現確率をφ（ｐ，ｑ，ｒ）とする。キーワードｗは出現確率の高いもの順にソートされたものである。ｗ（ｐ，ｑ，１）は、ｐ番目のテーブルのｑ番目のトピック内で最頻のキーワードである。 Let w (p, q, r) be the r-th keyword constituting the q-th topic of the p-th learning table, and let φ (p, q, r) be the appearance probability of the corresponding keyword. The keyword w is sorted in descending order of appearance probability. w (p, q, 1) is the most frequent keyword in the qth topic of the pth table.

あるトピックを構成するキーワードの出現確率の総和は１である。以下で単に出現確率と呼ぶ場合もキーワードの出現確率を示す。出現確率は最大値１最小値０の実数形式である。キーワードは可変長文字列形式である。 The sum of the appearance probabilities of the keywords that make up a topic is 1. The probability of appearance of a keyword is also shown below when it is simply called the probability of appearance. The appearance probability is a real number format with a maximum value of 1 and a minimum value of 0. Keywords are in variable length string format.

例えば、図６に示した例において、ｐ＝１の学習テーブル６１０を分析し、ｑ＝１のトピックとしてトピック番号６１２の「＃１−１」のトピックが得られた時、出現確率６１４が最大「０．３１」となるキーワード６１３は「火曜」である。このため、ｗ（ｐ，ｑ，１）＝「火曜」、φ（ｐ，ｑ，１）＝「０．３１」である。 For example, in the example shown in FIG. 6, when the learning table 610 with p = 1 is analyzed and the topic “# 1-1” with topic number 612 is obtained as the topic with q = 1, the appearance probability 614 is the maximum. The keyword 613 that becomes "0.31" is "Tuesday". Therefore, w (p, q, 1) = "Tuesday" and φ (p, q, 1) = "0.31".

トピックグループは複数の学習テーブルのトピックについてグループ分けが行なわれたものである。ここで、トピックグループの数は学習テーブルのトピック数と同じＱとする。 A topic group is a grouping of topics in multiple learning tables. Here, the number of topic groups is Q, which is the same as the number of topics in the learning table.

ｋ番目の学習テーブルのｊ番目のトピックがｉ番目のトピックグループに属するか否かを示す隣接行列ｘが定義できる。隣接行列ｘの要素値ｘ（ｉ，ｊ，ｋ）は、値が「１」であれば、対応する要素がトピックグループに属し、値が「０」であれば、対応する要素がトピックグループに属さない。 An adjacency matrix x can be defined that indicates whether the j-th topic of the k-th learning table belongs to the i-th topic group. For the element value x (i, j, k) of the adjacency matrix x, if the value is "1", the corresponding element belongs to the topic group, and if the value is "0", the corresponding element belongs to the topic group. Does not belong.

目的関数は、出現確率の高いキーワードが互いに多く含まれたトピック同士が同一トピックグループに属するように定義される。すなわち、図７に示すように、トピックグループを構成する複数トピックにおいて、トピック同士のキーワードが一致する範囲で出現確率の積の総和を計算し、その総和を最大化することを最適化目的とする。 The objective function is defined so that topics containing many keywords with high probability of occurrence belong to the same topic group. That is, as shown in FIG. 7, in a plurality of topics constituting the topic group, the purpose of optimization is to calculate the sum of the products of the appearance probabilities within the range where the keywords of the topics match, and to maximize the sum. ..

キーワードが一致する範囲の判定のために、文字列照合する別関数を定義することによって、目的関数が隣接行列ｘの１次式として定式化されてもよい。キーワードの一致ではなく、あらかじめ設定された何らかの基準に基づくキーワードの近似であってもよい。キーワードｗが出現確率の高いもの順にソートされていることを利用し、出現確率の低いキーワードｗに関する計算は打ち切られてもよい。 The objective function may be formulated as a linear expression of the adjacency matrix x by defining another function for string matching to determine the range in which the keywords match. Instead of matching keywords, it may be an approximation of keywords based on some preset criteria. By utilizing the fact that the keyword w is sorted in descending order of appearance probability, the calculation regarding the keyword w having a low appearance probability may be terminated.

トピックのトピックグループへの割当てが適切になされた隣接行列では、対応関係のある要素値だけが「１」となり、「１」となる要素値と同じ行の他の要素値、および「１」となる要素値と同じ列の他の要素値は「０」となる。すなわち、図７に示すように制約条件として、隣接行列の行方向の値の総和は「１」であり、列方向の値の総和も「１」であり、要素値ｘは２値である。 In an adjacency matrix where topics are properly assigned to topic groups, only the associated element values are "1", the other element values in the same row as the element value that is "1", and "1". The other element value in the same column as the element value is "0". That is, as shown in FIG. 7, as a constraint condition, the sum of the values in the row direction of the adjacency matrix is "1", the sum of the values in the column direction is also "1", and the element value x is binary.

以上の定式化で得られる最適化問題は０１整数計画問題である。この定式化によって、一般的な最適化ソルバを用いて高速に最適解の隣接行列ｘを得ることができる。例えば学生のクラス分けに用いられるような別解法の手順が適用されてもよい。 The optimization problem obtained by the above formulation is the 01 integer programming problem. By this formulation, the adjacency matrix x of the optimum solution can be obtained at high speed by using a general optimization solver. Alternative procedures, such as those used for student classification, may be applied.

図８は、データ分析支援システムの処理フローの例を示す図である。図８に示した処理フローは、統括制御モジュール１０８により全体が統括される。まず、統括制御モジュール１０８の制御により、対話制御モジュール１１１がユーザからの入力（要求）待ち状態になる。 FIG. 8 is a diagram showing an example of a processing flow of the data analysis support system. The entire processing flow shown in FIG. 8 is controlled by the integrated control module 108. First, under the control of the integrated control module 108, the dialogue control module 111 is put into an input (request) waiting state from the user.

ステップ８０１において、対話制御モジュール１１１はユーザからの入力を受け付け、受け付けた入力を統括制御モジュール１０８に渡す。統括制御モジュール１０８は入力に基づいて管理ＤＢ１１２を読み込み、初期設定し、トランザクションＤＢ１０６を開く。 In step 801 the dialogue control module 111 receives an input from the user and passes the received input to the overall control module 108. The integrated control module 108 reads the management DB 112 based on the input, makes initial settings, and opens the transaction DB 106.

ステップ８０２において、統括制御モジュール１０８は、管理ＤＢ１１２の変換関係定義テーブル４０１を参照して、符号語を復号するテーブル関係を設定する。符号語を復号するテーブル関係は、学習制御モジュール１０９により設定されてもよい。 In step 802, the overall control module 108 sets a table relationship for decoding the codeword with reference to the conversion relationship definition table 401 of the management DB 112. The table relation for decoding the codeword may be set by the learning control module 109.

ステップ８０３において、学習制御モジュール１０９は、量質変換ＤＢ１１３の量質変換テーブル５０１を参照して、量的変数を質的変数に変換するテーブル関係を設定する。これらの符号語を復号するテーブル関係の設定と、量的変数を質的変数に変換するテーブル関係を設定により、以下の処理にて復号と変換が可能になる。 In step 803, the learning control module 109 refers to the quantity conversion table 501 of the quantity conversion DB 113 and sets a table relationship for converting the quantitative variable into the qualitative variable. By setting the table relation for decoding these codewords and the table relation for converting quantitative variables to qualitative variables, decoding and conversion can be performed by the following processing.

ステップ８０４において、学習制御モジュール１０９は、開かれたトランザクションＤＢ１０６の中で処理対象となるトランザクションテーブルそれぞれについてステップ８０５、８０６、８０９の処理をそれぞれ実行するように繰り返す。 In step 804, the learning control module 109 repeats the processing of steps 805, 806, and 809 for each transaction table to be processed in the opened transaction DB 106.

ステップ８０５において、学習制御モジュール１０９は、質的変数の数に基づいてトピック数Ｑを算出する。トピック数Ｑは、あらかじめユーザにより設定されてもよいし、データサンプルを用いて決められてもよい。データサンプルは、ステップ８０４の繰り返しの中でｉ番目のトランザクションテーブルから取得されてもよく、このために符号語が復号されたり、量的変数が質的変数に変換されたりしてもよい。 In step 805, the learning control module 109 calculates the number of topics Q based on the number of qualitative variables. The number of topics Q may be set in advance by the user or may be determined using a data sample. The data sample may be obtained from the i-th transaction table in the iteration of step 804, for which the codeword may be decoded or the quantitative variable may be converted to a qualitative variable.

ここで、本実施例固有の構成は、量質変換テーブル５０１を参照してトピック数Ｑを決めることである。例えば、質的変数として曜日だけに着目する場合、Ｑ≧７とすることによって曜日ごとの傾向が浮かび上がりやすくなる。 Here, the configuration peculiar to this embodiment is to determine the number of topics Q with reference to the quantity conversion table 501. For example, when focusing only on the day of the week as a qualitative variable, setting Q ≧ 7 makes it easier for the tendency for each day of the week to emerge.

場所を表す量的変数も量質変換されうる。量的変数の内容にかかわらず、（（質的変数の数）＋１）を用いると、分析した特徴成分が質的変数を１つずつ含み、かつ、雑音となる成分が残りの特徴成分に集まるため、注目した質的変数で分類する効果が高まる。 Quantitative variables representing places can also be quantitatively transformed. Regardless of the content of the quantitative variables, if ((number of qualitative variables) + 1) is used, the analyzed feature components include one qualitative variable, and the noisy components gather in the remaining feature components. Therefore, the effect of classifying by the qualitative variables of interest is enhanced.

量質変換テーブルにて、時間をｘ通りに仕分け、場所をｙ通りに仕分けるデータを用意し、関数ｆ（）：Ｑ＝ｆ（ｘ，ｙ）＝ｍａｘ（ｘ，ｙ）＋１、を定義してトピック数Ｑを設定する。ｘは、量質変換テーブル（日付）があらかじめ作成され、そのテーブルの行数が数えられた値である。ｙは、量質変換テーブル（場所）があらかじめ作成され、そのテーブルの行数が数えられた値である。 In the quantity conversion table, prepare data for sorting time in x ways and places in y ways, and define a function f (): Q = f (x, y) = max (x, y) + 1. And set the number of topics Q. x is a value in which a quantity conversion table (date) is created in advance and the number of rows in the table is counted. y is a value in which a quantity conversion table (location) is created in advance and the number of rows in the table is counted.

さらに、ｚを量質変換テーブル（時刻）の行数として、関数ｇ（）：Ｑ＝ｇ（ｘ，ｙ，ｚ）＝ｍａｘ（ｘ，ｙ，ｚ）＋１を定義して、トピック数Ｑが設定されてもよい。あるいは、トピック数Ｑが、Ｑ＝ｘ＊ｙ＊ｚのように組み合わせ数の値であると、ユーザは、時空間の組合せに基づく分析結果を網羅的に見ることができる。 Furthermore, the function g (): Q = g (x, y, z) = max (x, y, z) + 1 is defined with z as the number of rows in the quantity conversion table (time), and the number of topics Q is It may be set. Alternatively, when the number of topics Q is a value of the number of combinations such as Q = x * y * z, the user can comprehensively see the analysis result based on the combination of space-time.

ステップ８０６において、学習制御モジュール１０９は、ステップ８０４の繰り返しの中でｉ番目のトランザクションテーブルのレコードのそれぞれについてステップ８０７、８０８の処理をそれぞれ実行するように繰り返す。 In step 806, the learning control module 109 repeats the process of steps 807 and 808 for each of the records in the i-th transaction table in the repetition of step 804.

ステップ８０７において、学習制御モジュール１０９は、ステップ８０６の繰り返しのｊ番目のレコードに対応する入力ベクトルを作成する。この入力ベクトルの作成では、符号語を復号して量的変数を質的変数に変換しつつ、キーワードの集合が作成される。なお、ｊ番目のレコードに量的変数が含まれている場合は、復号がスキップされてもよい。 In step 807, the learning control module 109 creates an input vector corresponding to the jth record of the repetition of step 806. In the creation of this input vector, a set of keywords is created while decoding codewords and converting quantitative variables into qualitative variables. If the j-th record contains a quantitative variable, decoding may be skipped.

ステップ８０８において、学習制御モジュール１０９は、トピック数Ｑと作成した入力ベクトルを用いて機械学習する。なお、機械学習の処理そのものは、学習制御モジュール１０９が実行してもよいし、データ分析支援システム１０１の外部で実行されてもよく、外部で実行する場合の学習制御モジュール１０９は、その外部へトピック数Ｑと入力ベクトルの情報を送り、学習結果を得てもよい。その機械学習の処理については図１０を用いて後で詳細に説明する。 In step 808, the learning control module 109 performs machine learning using the number of topics Q and the created input vector. The machine learning process itself may be executed by the learning control module 109 or may be executed outside the data analysis support system 101, and the learning control module 109 when it is executed externally may be executed outside the data analysis support system 101. Information on the number of topics Q and the input vector may be sent to obtain a learning result. The machine learning process will be described in detail later with reference to FIG.

ステップ８０９において、学習制御モジュール１０９は、ステップ８０７、８０８の繰り返しが終了すると、その繰り返しにより得られたｉ番目のトランザクションテーブルの学習結果を保存する。 In step 809, when the repetition of steps 807 and 808 is completed, the learning control module 109 stores the learning result of the i-th transaction table obtained by the repetition.

ステップ８１０において、集約制御モジュール１１０は、トピック群の組合せを最適化する。図７を用いて説明した通り、最適解は図７に示す隣接行列ｘの形式である。隣接行列ｘの要素値を参照して、ｘ（ｉ，ｊ，ｋ）＝１であれば、ｋ番目の学習テーブルのｊ番目のトピックはｉ番目のトピックグループに属すると判定する。 In step 810, the aggregate control module 110 optimizes the combination of topic groups. As described with reference to FIG. 7, the optimal solution is in the form of the adjacency matrix x shown in FIG. With reference to the element value of the adjacency matrix x, if x (i, j, k) = 1, it is determined that the j-th topic of the k-th learning table belongs to the i-th topic group.

ステップ８１１において、集約制御モジュール１１０は、トピックグループを構成する各種のデータ（トピック、キーワード、および出現確率）を集約する。この集約のために、集約制御モジュール１１０は、第１に、おのおののトピックグループを示す索引ｉ（０≦ｉ≦Ｑ）について、トピックリストをメモリ領域に確保し、以下の第２と第３の処理手順を繰り返す。 In step 811 the aggregation control module 110 aggregates various data (topics, keywords, and appearance probabilities) that make up the topic group. For this aggregation, the aggregation control module 110 first allocates a topic list in the memory area for the index i (0≤i≤Q) indicating each topic group, and the following second and third Repeat the processing procedure.

第２に、集約制御モジュール１１０は、ｉ番目のトピックグループに含まれるｋ番目の学習テーブルのｊ番目のトピックを構成するr番目の構成要素、すなわちキーワードｗ（ｋ，ｊ，ｒ）と出現確率φ（ｋ，ｊ，ｒ）をトピックリストに加えてメモリに保存する。 Second, the aggregate control module 110 includes the r-th component that constitutes the j-th topic of the k-th learning table included in the i-th topic group, that is, the keyword w (k, j, r) and the appearance probability. Add φ (k, j, r) to the topic list and save it in memory.

第３に、集約制御モジュール１１０は、保存時にキーワードの重複を避けて統合する。例えば、トピックリストにあるキーワードと同じキーワードが別の学習テーブルでの構成要素であった場合、２つの出現確率の平均値を、そのキーワードの出現確率として計算して保存することにより、トピックリスト内のキーワードは、そのトピックリスト内で一意となる。 Thirdly, the aggregation control module 110 integrates by avoiding duplication of keywords at the time of storage. For example, if the same keyword as the keyword in the topic list is a component in another learning table, the average value of the two occurrence probabilities is calculated and saved as the appearance probability of the keyword in the topic list. Keywords are unique within the topic list.

これによって、それぞれの学習テーブルごとの分析結果が集約され、学習テーブルをまたいだデータ関係が顕在化する。そして、集約制御モジュール１１０は、メモリに保存した情報すなわち集約した各種のデータを、集約テーブルとして集約ＤＢ１１５に格納する。 As a result, the analysis results for each learning table are aggregated, and the data relationship across the learning tables becomes apparent. Then, the aggregation control module 110 stores the information stored in the memory, that is, various aggregated data, in the aggregation DB 115 as an aggregation table.

ステップ８１２において、対話制御モジュール１１１は、分析結果と対話入力領域をユーザ端末１０２のディスプレイに出力する。分析結果は、集約ＤＢ１１５から得た情報であり、トピックリストのキーワードと出現確率を含む。さらに、対話入力領域には、キーワードに並列してユーザ選択を受け付けるチェックボックスを含み、ユーザからの入力を受け付ける。 In step 812, the dialogue control module 111 outputs the analysis result and the dialogue input area to the display of the user terminal 102. The analysis result is the information obtained from the aggregated DB 115, and includes the keyword of the topic list and the appearance probability. Further, the interactive input area includes a check box for accepting user selection in parallel with the keyword, and accepts input from the user.

ユーザが複数のチェックボックスを選択し、ユーザ端末１０２を用いて関連情報出力の指示を入力すると、対話制御モジュール１１１は、選択されたチェックボックスの情報を受け付けて、逆引き情報が検索されて、関連情報としてユーザ端末１０２に出力する。 When the user selects a plurality of check boxes and inputs an instruction to output related information using the user terminal 102, the dialogue control module 111 accepts the information of the selected check boxes and searches for the reverse information. It is output to the user terminal 102 as related information.

図５を用いて、チェックされたキーワードが「２時台」だった場合の関連情報の出力について説明する。統括制御モジュール１０８は、対話制御モジュール１１１から「２時台」のキーワードと関連情報の検索の要求を受け取り、受け取ったキーワードに基づいて量質変換テーブル５０１を参照し、受け取ったキーワードと同じ質的表現を含むレコードを検索する。 The output of related information when the checked keyword is "2 o'clock" will be described with reference to FIG. The integrated control module 108 receives a request for searching the keyword "2 o'clock" and related information from the dialogue control module 111, refers to the quantity conversion table 501 based on the received keyword, and has the same qualitative as the received keyword. Search for records that contain expressions.

この検索により、統括制御モジュール１０８は、１番目のレコードの質的表現（２）５０５に「２時台」を見つけて、見つかった１番目のレコードに登録された逆引きテーブル名５０６の「Ｔａｂｌｅ−Ｂ」とレコード番号５０７の「１」を得る。そして、統括制御モジュール１０８は、トランザクションＤＢ１０６から「Ｔａｂｌｅ−Ｂ」というトランザクションテーブルの１番目のレコードを検索する。 By this search, the integrated control module 108 finds "2 o'clock" in the qualitative expression (2) 505 of the first record, and "Table" of the reverse lookup table name 506 registered in the first record found. -B "and" 1 "of record number 507 are obtained. Then, the integrated control module 108 searches the transaction DB 106 for the first record of the transaction table called "Table-B".

統括制御モジュール１０８は、検索により見つけた１番目のレコードを得て、対話制御モジュール１１１へ送る。対話制御モジュールは受け取った１番目のレコードを、量質変換前の関連情報としてユーザ端末１０２に出力する。なお、逆引きテーブル名５０６とレコード番号５０７の情報が、関連情報としてユーザ端末１０２に出力されてもよい。 The integrated control module 108 obtains the first record found by the search and sends it to the interactive control module 111. The dialogue control module outputs the first received record to the user terminal 102 as related information before the quantity conversion. The information of the reverse lookup table name 506 and the record number 507 may be output to the user terminal 102 as related information.

図９は、ユーザ対話画面の例を示す図である。図９に示した画面９０１は、分析結果出力領域９０２と対話入力領域９０３を含み、図８を用いて説明したステップ８１２において作成出力される。図９に示した例において、分析結果出力領域９０２には、「Ｇｒｏｕｐ１」すなわちトピックグループ１の分析結果として、「火曜０．３１」のようにキーワードと出現確率のペアが列挙される。 FIG. 9 is a diagram showing an example of a user dialogue screen. The screen 901 shown in FIG. 9 includes an analysis result output area 902 and an interactive input area 903, and is created and output in step 812 described with reference to FIG. In the example shown in FIG. 9, in the analysis result output area 902, as the analysis result of "Group 1", that is, the topic group 1, a pair of a keyword and an appearance probability is listed as "Tuesday 0.31".

図９に示した例では、端的には、「火曜」が含まれるレコードはトピックグループ１に属する可能性が高いことを示している。また、「Ｇｒｏｕｐ２」のトピックグループ２には別のキーワードと出現確率が列挙されている。同一のキーワードが複数のトピックグループに出現してもよい。 In the example shown in FIG. 9, it is simply shown that the record including "Tuesday" is likely to belong to the topic group 1. In addition, another keyword and the appearance probability are listed in the topic group 2 of "Group2". The same keyword may appear in multiple topic groups.

対話入力領域９０３はチェックボックスであり、ユーザがチェックボックスにチェックをつけて、キーワードを選択することが可能な入力領域である。図９に示した例では、「火曜」、「大阪」、および「警報（３）」それぞれのチェックボックスにチェックがついており、これらのキーワードがユーザにより選択された状態を示す。 The interactive input area 903 is a check box, which is an input area in which the user can check the check box and select a keyword. In the example shown in FIG. 9, the check boxes of "Tuesday", "Osaka", and "Alarm (3)" are checked, indicating that these keywords are selected by the user.

逆引き情報９０４は、分析の基となった情報を示し、関連情報としてステップ８１２で作成されて出力される。量質変換により「火曜」と解釈された数値が含まれるレコードは、「トランザクションテーブルＴａｂｌｅ−Ｂ」のレコード番号が１、４、５などのレコードであり、さらに「保守テーブルＴａｂｌｅ−Ｈ０」のレコード番号が２、３、４などのレコードである。 The reverse lookup information 904 indicates the information on which the analysis is based, and is created and output as related information in step 812. The record containing the numerical value interpreted as "Tuesday" by the quantity conversion is the record whose record number of "transaction table Table-B" is 1, 4, 5, etc., and further, the record of "maintenance table Table-H0". The numbers are records such as 2, 3, and 4.

データ分析作業者は、図９に示した以上の情報から「警報（３）」が「火曜」、「大阪」、および「晴」と同時に表れるという傾向を見いだせる。元々の警報情報が含まれた「トランザクションテーブルＴａｂｌｅ−Ｂ」だけを分析していても、「警報（３）」は不定期に各地で頻発していて、特定の曜日や場所に注目しにくい場合もあり得る。 From the above information shown in FIG. 9, the data analysis worker can find a tendency that "alarm (3)" appears at the same time as "Tuesday", "Osaka", and "fine". Even if only the "transaction table Table-B" containing the original alarm information is analyzed, "alarm (3)" occurs irregularly in various places and it is difficult to pay attention to a specific day or place. Is also possible.

このような場合であって、特定の作業者の保守作業方法が不適切であり、特定の天候のときにのみ異常が生じるという複合要因による警報の原因を調べる場合、従来は、そのような仮説をデータ分析作業者が考えて、関係する異種データを自力で調べなければならなかった。 In such a case, when investigating the cause of the alarm due to the compound factor that the maintenance work method of a specific worker is inappropriate and the abnormality occurs only in a specific weather, the conventional hypothesis is such. The data analysis worker had to think about this and investigate the related heterogeneous data on his own.

その上、このような仮説の組合せは極めて多く、データ分析作業者の作業時間は極めて大きくなる傾向がある。本実施例では、そのような仮説の可能性について、トピックグループを提示することにより、データ分析作業者にリコメンドできる。データ分析作業者の作業時間が大幅に削減することが可能になる。 Moreover, there are numerous combinations of such hypotheses, and the working hours of data analysts tend to be extremely long. In this example, the possibility of such a hypothesis can be recommended to data analysts by presenting a topic group. It is possible to significantly reduce the work time of data analysis workers.

すなわち、「トランザクションテーブルＴａｂｌｅ−Ｂ」、「保守テーブルＴａｂｌｅ−Ｈ０」、および天候テーブルなどの異種テーブルを集約して分析することにより、異種テーブル間に分散していた有用なパターンを、データ分析作業者に見つけやすくさせることが可能になる。 That is, by aggregating and analyzing different types of tables such as "transaction table Table-B", "maintenance table Table-H0", and weather table, useful patterns distributed among different types of tables can be analyzed by data analysis work. It makes it easier for people to find it.

図１０を用いてステップ８０８の機械学習の処理フローについて詳細に説明する。
機械学習への入力情報は、トピック出現頻度の偏りを示すパラメータα、語出現頻度の偏りを示すパラメータβ、サンプリング繰り返し回数上限のパラメータＳ、トピック総数Ｋ、レコードデータの集合ｒである。 The machine learning processing flow of step 808 will be described in detail with reference to FIG.
The input information for machine learning is a parameter α indicating the bias of the topic appearance frequency, a parameter β indicating the bias of the word appearance frequency, a parameter S of the upper limit of the number of sampling repetitions, a total number of topics K, and a set r of record data.

パラメータα、βには固定値の定常分布を初期値に用いてもよい。繰り返し回数上限のパラメータＳは固定値でもよい。また、変数には次の４つの仮定を設ける：（１）レコードにおけるトピックの出現確率はαのディリクレ分布に従う、（２）ｋ番目のトピックにおけるデータ値の出現確率はβのディリクレ分布に従う、（３）ｄ番目のレコードにおけるｉ番目のデータ値を出現させた潜在変数は多項分布に従う、および、（４）ｄ番目のレコードにおけるｉ番目のデータ値は多項分布に従う。 A stationary distribution of fixed values may be used as the initial value for the parameters α and β. The parameter S of the upper limit of the number of repetitions may be a fixed value. In addition, the following four assumptions are made for variables: (1) the appearance probability of a topic in a record follows the Dirichlet distribution of α, (2) the appearance probability of a data value in the kth topic follows the Dirichlet distribution of β, 3) The latent variable that caused the i-th data value in the d-th record to appear follows a multinomial distribution, and (4) the i-th data value in the d-th record follows a multinomial distribution.

また、２つの統計的変数：（１）ｄ番目のレコードにおいてトピックｋが出現した回数、および（２）全レコードに対してｋ番目のトピックがデータ値索引ｖに対して推定された回数、を定義する。 Also, two statistical variables: (1) the number of times the topic k appears in the dth record, and (2) the number of times the kth topic is estimated for the data value index v for all records. Define.

ステップ１００１において、すべてのテーブルに対して以降の処理を繰り返し制御する。ステップ１００２において、すべてのレコードの量的なデータ値を分類して質的なデータ値に変換する。すべてのテーブルで同じレコード数を扱ってもよい。さらに、データ値ｗに一意的に対応する索引ｖを作成する。 In step 1001, the subsequent processing is repeatedly controlled for all the tables. In step 1002, the quantitative data values of all the records are classified and converted into qualitative data values. All tables may handle the same number of records. Further, an index v uniquely corresponding to the data value w is created.

ステップ１００３において、以降の処理を繰り返し制御する。ステップ１００４において、すべてのレコードについて、以降の処理を繰り返し制御する。ステップ１００５において、すべてのデータ値レコードについて、以降の処理を繰り返し制御する。ステップ１００６において、上記仮定に基づいて潜在変数のサンプリングを行う。ステップ１００７において、２つの統計的変数の更新を行う。 In step 1003, the subsequent processing is repeatedly controlled. In step 1004, the subsequent processing is repeatedly controlled for all the records. In step 1005, the subsequent processing is repeatedly controlled for all the data value records. In step 1006, latent variables are sampled based on the above assumptions. In step 1007, two statistical variables are updated.

ステップ１００８において、パラメータα、βを更新する。ステップ１００９において、ｋ番目のトピックにおけるデータ値の出現確率φを出力し、さらに、作成済みのデータ値ｗも出力する。以上の処理を繰り返すことによって、実際のデータの統計的性質に基づいて、当初入力したパラメータα、βの精度を更新、収束させていくことができる。ステップ１０１０において、複数のトピック（ｗとφ）について組合せ最適化を行う。 In step 1008, the parameters α and β are updated. In step 1009, the appearance probability φ of the data value in the kth topic is output, and the created data value w is also output. By repeating the above processing, the accuracy of the initially input parameters α and β can be updated and converged based on the statistical properties of the actual data. In step 1010, combinatorial optimization is performed for a plurality of topics (w and φ).

ステップ１００３からステップ１００９までの各ステップはステップ８０８の処理と等価であり、周辺化ギブスサンプリングの既存技術をベースとしている。 Each step from step 1003 to step 1009 is equivalent to the process of step 808 and is based on the existing technique of peripheral Gibbs sampling.

１０１データ分析支援システム
１０９学習制御モジュール
１１０集約制御モジュール
１１３量質変換ＤＢ
１１４学習ＤＢ
１１５集約ＤＢ 101 Data analysis support system 109 Learning control module 110 Aggregate control module 113 Quantitative conversion DB
114 Learning DB
115 Aggregate DB

Claims

It is a data analysis support system composed of computers.
The calculator
The memory where the program is stored and
A CPU that executes a program stored in the memory,
With
The CPU
Enter multiple tables and
Extract keywords from each of the multiple entered tables
Machine learning is performed on the extracted keywords, and as a result of machine learning analysis, feature components including a pair of keywords and appearance probabilities are extracted for each of a plurality of tables.
Based on the matching between the table of keywords included in the extracted feature component, the combination between the table of feature components is identified, and the combination is identified.
Synthesize the appearance probabilities of the characteristic components contained in the specified combination, aggregate the analysis results, and
A data analysis support system characterized by outputting aggregated analysis results.

The data analysis support system according to claim 1.
The calculator
An additional disk containing a qualitative conversion table that associates a range of quantitative variable values with the values of one or more qualitative variables.
The CPU
The value of the quantitative variable contained in each of the input multiple tables is converted into the value of one or more qualitative variables using the quantitative conversion table, and the value of the converted qualitative variable is converted. A data analysis support system characterized by extracting as keywords.

The data analysis support system according to claim 2.
The disc is
A conversion relationship definition table that associates the names of the plurality of input tables, the names of the fields included in the plurality of input tables, and the quality conversion table is further stored.
The CPU
Specify the names of the plurality of input tables and the names of the fields, specify the quantity conversion table to be used using the conversion relationship definition table, and input using the specified quantity conversion table. Data characterized in that the values of quantitative variables contained in each of a plurality of tables are converted into the values of one or more qualitative variables, and the converted qualitative variable values are extracted as keywords. Analysis support system.

The data analysis support system according to claim 3.
The CPU
The code word contained in each of the input plurality of tables is converted into the value of the quantitative variable, and the value of the converted quantitative variable is converted into one or more qualitative variables using the quality conversion table. A data analysis support system characterized by converting to the value of and extracting the value of the converted qualitative variable as a keyword.

The data analysis support system according to claim 2.
The CPU
In each table among the multiple input tables, the number of feature components is calculated using the number of value types of each of the multiple qualitative variables obtained from each table.
In machine learning of extracted keywords, a data analysis support system characterized by applying the calculated number of feature components to machine learning and using the calculated number of feature components as the analysis result of machine learning. ..

The data analysis support system according to claim 5.
The CPU
In each table in the input multiple tables, the number of value types of each of the multiple qualitative variables obtained from each table is specified by identifying the maximum number among the multiple qualitative variables. A data analysis support system characterized in that the number of characteristic components is calculated by adding 1 to the maximum number.

The data analysis support system according to claim 5.
The CPU
Data analysis support characterized by calculating the number of feature components by multiplying the number of value types of each of the plurality of qualitative variables obtained from each table in each table among the plurality of input tables. system.

The data analysis support system according to claim 2.
The CPU
The value of the quantitative variable contained in each of the input multiple tables is converted into the value of one or more qualitative variables using the quantitative conversion table, and the value of the quantitative variable before conversion is converted. Information that associates the name of the included table with the value of the qualitative variable after conversion is stored in the quantitative conversion table, and the value of the qualitative variable after conversion is extracted as a keyword.
Outputs the aggregated analysis results of feature components including keywords, accepts selection input for the output keyword, identifies the selected keyword by the accepted input, and of the qualitative variables extracted as the specified keyword. A data analysis support system characterized in that the name of a table associated with a value is searched for from the qualitative conversion table and output.

It is a data analysis support method using a computer.
The calculator
The memory where the program is stored and
A CPU that executes a program stored in the memory,
With
The CPU
Enter multiple tables and
Extract keywords from each of the multiple entered tables
Machine learning is performed on the extracted keywords, and as a result of machine learning analysis, feature components including a pair of keywords and appearance probabilities are extracted for each of a plurality of tables.
Based on the matching between the table of keywords included in the extracted feature component, the combination between the table of feature components is identified, and the combination is identified.
Synthesize the appearance probabilities of the characteristic components contained in the specified combination, aggregate the analysis results, and
A data analysis support method characterized by outputting aggregated analysis results.

The data analysis support method according to claim 9.
The calculator
An additional disk containing a qualitative conversion table that associates a range of quantitative variable values with the values of one or more qualitative variables.
The CPU
The value of the quantitative variable contained in each of the input multiple tables is converted into the value of one or more qualitative variables using the qualitative conversion table, and the value of the converted qualitative variable is converted. A data analysis support method characterized by extracting as a keyword.