JP6695431B2

JP6695431B2 - Analytical apparatus, analytical system and analytical method

Info

Publication number: JP6695431B2
Application number: JP2018536626A
Authority: JP
Inventors: 琢磨柴原; 英司金森; 昌宏荻野; 鈴木　麻由美; 麻由美鈴木
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2016-09-01
Filing date: 2016-09-01
Publication date: 2020-05-20
Anticipated expiration: 2036-09-01
Also published as: JPWO2018042606A1; WO2018042606A1

Description

本発明は、データを分析する分析装置、分析システムおよび分析方法に関する。 The present invention relates to an analysis device, an analysis system and an analysis method for analyzing data.

特許文献１は、患者属性と１つ以上の有害事象（ＡｄｖｅｒｓｅＥｖｅｎｔｓ；ＡＥ）との間の相関に関する情報を識別および提供する臨床意思決定支援システムとともに使用するコンピュータ実装方法、システム、およびコンピュータ可読記憶媒体を開示する。特許文献１のプロセスは、ＡＥと患者属性との間の相関に対してＡＥおよび１つ以上の患者属性を含むデータベース情報を処理することと、１つ以上のＡＥと１つ以上の患者属性との間の少なくとも１つの相関を識別することとを含む。相関は、１つ以上の相関ルールを決定するための相関ルール発見プロセスを介して発見されてもよい。各相関ルールは、確信度、支持度、および／または他の閾値を満たす。当該プロセスは、識別または発見された相関に基づいて、ユーザに情報または警告をさらに提供する。 US Patent Application Publication No. 2004/011187 A1 is a computer-implemented method, system, and computer-readable storage for use with a clinical decision support system that identifies and provides information about a correlation between patient attributes and one or more adverse events (AEs). Disclose the medium. The process of U.S. Patent No. 6,096,981 processes database information including AEs and one or more patient attributes for correlation between AEs and patient attributes, one or more AEs and one or more patient attributes, Identifying at least one correlation between Correlations may be discovered via a correlation rule discovery process to determine one or more association rules. Each association rule meets certainty, support, and / or other thresholds. The process further provides the user with information or alerts based on the identified or discovered correlations.

特許文献２は、診療に対する適切な支援を行う診療支援プログラムを開示する。特許文献２の診療支援プログラムでは、診断された病気に対する患者の治療期間と前記診断された病気に対する基準治癒期間とを比較し、前記患者の治療期間が前記基準治癒期間を越えている場合に、類似する症状を発症させるそれぞれの病気を関連付けて記憶する記憶手段から前記診断された病気の症状に類似する症状を発症させる他の病気を検索し、検索した前記他の病気の病名情報を出力する、処理をコンピュータに実行させる。 Patent Document 2 discloses a medical care support program that appropriately supports medical care. In the medical treatment support program of Patent Document 2, a treatment period of a patient for a diagnosed disease is compared with a reference cure period for the diagnosed disease, and when the treatment period of the patient exceeds the reference cure period, Searching for other diseases that develop similar symptoms to the diagnosed disease from the storage means that associates and stores the respective diseases that cause similar symptoms, and outputs the disease name information of the searched other diseases. , Causes the computer to execute the process.

特表２０１２−５２４９４５号公報Special table 2012-524945 gazette 特開２０１４−１９９５９７号公報JP, 2014-199597, A

しかしながら、上述した従来技術では、学習データから学習モデルを生成しても、どの因子が他のどの因子と関連するかがわからないという問題がある。たとえば、目的変数を疾病確率、因子を複数の薬の投与量とした場合、たとえば、薬Ａと薬Ｂとを組み合わせて患者に投与することが効果的なのか、副作用が生じるのかがわからないという問題がある。 However, the above-mentioned conventional technique has a problem that even if a learning model is generated from learning data, it is not known which factor is associated with which other factor. For example, when the objective variable is the disease probability and the factor is the dose of a plurality of drugs, for example, it is not known whether it is effective to administer a combination of drug A and drug B to a patient or whether side effects occur. There is.

本発明は、因子の組み合わせの有効性を分析することを目的とする。 The present invention aims to analyze the effectiveness of a combination of factors.

本願において開示される発明の一側面となる分析装置、分析システムおよび分析方法は、記憶デバイスに、目的変数の実測値と複数の因子の実測値とを含む学習データを複数有する学習データ集合と、前記複数の因子の予測値を含む前記学習データ由来の予測データを複数有する予測データ集合と、前記目的変数の実測値と前記複数の因子の実測値との関係を示す学習モデルと、を記憶しておき、前記複数の因子の値どうしが類似するように前記予測データ集合をクラスタリングして、複数の因子クラスタを生成する第１生成処理と、前記予測データ集合を用いて、前記複数の因子の相関により前記複数の因子が共起する共起量を算出する第１算出処理と、前記第１算出処理によって算出された共起量に基づいて前記複数の因子をクラスタリングして、２以上の因子を含む共起クラスタを１以上有する複数の共起クラスタを生成する第２生成処理と、前記第１生成処理によって生成された複数の因子クラスタの中の２以上の因子を含む特定の因子クラスタに含まれる特定の予測データ群における前記２以上の因子の予測値のうち、前記第２生成処理によって生成された複数の共起クラスタの中の特定の共起クラスタが示す２以上の特定の因子の予測値を、前記学習モデルに与えることにより、前記特定の因子クラスタにおける前記目的変数の予測値を算出する第２算出処理と、を実行することを特徴とする。 An analysis apparatus, an analysis system, and an analysis method according to one aspect of the invention disclosed in the present application, a storage device, a learning data set having a plurality of learning data including measured values of objective variables and measured values of a plurality of factors, A prediction data set having a plurality of prediction data derived from the learning data including prediction values of the plurality of factors, and a learning model indicating a relationship between the measured value of the objective variable and the measured values of the plurality of factors are stored. A first generation process of clustering the prediction data sets to generate a plurality of factor clusters so that the values of the plurality of factors are similar to each other, and using the prediction data set, A first calculation process for calculating a co-occurrence amount in which the plurality of factors co-occur by correlation, and clustering the plurality of factors based on the co-occurrence amount calculated in the first calculation process to obtain two or more factors. A second generation process for generating a plurality of co-occurrence clusters having one or more co-occurrence clusters including, and a specific factor cluster including two or more factors among the plurality of factor clusters generated by the first generation process. Of the predicted values of the two or more factors in the included specific prediction data group, the two or more specific factors indicated by the specific co-occurrence cluster among the plurality of co-occurrence clusters generated by the second generation processing are included. A second calculation process of calculating a predicted value of the objective variable in the specific factor cluster by giving the predicted value to the learning model.

本発明の代表的な実施の形態によれば、因子の組み合わせの有効性を分析することができる。前述した以外の課題、構成及び効果は、以下の実施例の説明により明らかにされる。 According to the exemplary embodiment of this invention, the effectiveness of the combination of factors can be analyzed. Problems, configurations and effects other than those described above will be clarified by the following description of the embodiments.

図１は、実施例１にかかるデータ分析例を示す説明図である。FIG. 1 is an explanatory diagram of a data analysis example according to the first embodiment. 図２は、分析装置のハードウェア構成例を示すブロック図である。FIG. 2 is a block diagram showing a hardware configuration example of the analyzer. 図３は、図１に示した学習データの詳細な内容を示す説明図である。FIG. 3 is an explanatory diagram showing the detailed contents of the learning data shown in FIG. 図４は、初期設定画面例を示す説明図である。FIG. 4 is an explanatory diagram showing an example of the initial setting screen. 図５は、分析装置による分析処理手順例を示すフローチャートである。FIG. 5 is a flowchart showing an example of an analysis processing procedure by the analysis device. 図６は、因子の確率分布を示す説明図である。FIG. 6 is an explanatory diagram showing the probability distribution of factors. 図７は、統合確率分布の一例を示す説明図である。FIG. 7 is an explanatory diagram showing an example of the integrated probability distribution. 図８は、因子クラスタリング結果を示す説明図である。FIG. 8 is an explanatory diagram showing the result of factor clustering. 図９は、共起クラスタリングの処理例を示す説明図である。FIG. 9 is an explanatory diagram showing a processing example of co-occurrence clustering. 図１０は、ステップＳ５１０による予測結果を示す説明図である。FIG. 10: is explanatory drawing which shows the prediction result by step S510. 図１１は、表示画面例を示す説明図である。FIG. 11 is an explanatory diagram showing an example of a display screen. 図１２は、分析システムのシステム構成例を示す説明図である。FIG. 12 is an explanatory diagram showing a system configuration example of the analysis system. 図１３は、分析システムによる分散処理手順例を示すフローチャート１である。FIG. 13 is a flowchart 1 showing an example of distributed processing procedure by the analysis system. 図１４は、分析システムによる分散処理手順例を示すフローチャート２である。FIG. 14 is a flowchart 2 showing an example of distributed processing procedure by the analysis system. 図１５は、分析システムによる分散処理手順例を示すフローチャート３である。FIG. 15 is a flowchart 3 showing an example of distributed processing procedure by the analysis system. 図１６は、図１５に示した分析システムによる分散処理手順例を示すフローチャート３の変形例を示すフローチャートである。FIG. 16 is a flowchart showing a modification of the flowchart 3 showing an example of the distributed processing procedure by the analysis system shown in FIG.

＜データ分析例＞
図１は、実施例１にかかるデータ分析例を示す説明図である。（１）〜（６）は、分析装置による分析方法の手順を示す。（１）分析装置は、学習データ集合１０から学習モデルを生成する。学習データ集合１０は、例として、目的変数を薬効、具体的には疾病確率とし、因子を複数の薬の患者への投与量とする。疾病確率は、０％〜１００％で表現できるが、ここでは、疾病を１（＝１００％）、健康を０（＝０％）とする。また、因子は、便宜的に薬１〜薬４の４つの説明変数であるが、実際には、たとえば、数万から数億の薬である。また、各エントリは、患者を示す。患者は便宜的にＡ〜Ｆの６人であるが、実際には、たとえば、数万から数億の患者である。<Data analysis example>
FIG. 1 is an explanatory diagram of a data analysis example according to the first embodiment. (1) to (6) show the procedure of the analysis method by the analyzer. (1) The analysis device generates a learning model from the learning data set 10. In the learning data set 10, for example, the objective variable is the drug effect, specifically the disease probability, and the factor is the dose of a plurality of drugs to the patient. The disease probability can be expressed as 0% to 100%, but here, the disease is 1 (= 100%) and the health is 0 (= 0%). In addition, the factors are four explanatory variables of drug 1 to drug 4 for convenience, but actually, for example, tens of thousands to hundreds of millions of drugs. Each entry also indicates a patient. For convenience, the number of patients is 6 from A to F, but actually, for example, tens of thousands to hundreds of millions of patients.

（１）学習モデルの生成において、生成される学習モデルには、線形モデルと非線形モデルがある。線形モデルには、たとえば、線形分類（ＬｉｎｅａｒＣｌａｓｓｉｆｉｃａｔｉｏｎ）とロジスティック回帰（ＬｏｇｉｓｔｉｃＲｅｇｒｅｓｓｉｏｎ）とがある。非線形モデルには、たとえば、ニューラルネットワーク（ＮｅｕｒａｌＮｅｔｗｏｒｋ）、サポートベクターマシン（ＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅ）、アダブースト（Ａｄａｂｏｏｓｔ）、ランダムフォレスト（ＲａｎｄｏｍＦｏｒｅｓｔｓ）がある。ユーザは、学習モデルの生成の際に、いずれかのモデルを選択することができる。たとえば、ユーザは、因子の組み合わせの有効性を高速に分析したい場合には、線形モデルを選択すればよく、高精度に分析したい場合には、非線形モデルを選択すればよい。 (1) In generating the learning model, the learning model generated includes a linear model and a non-linear model. Linear models include, for example, linear classification and logistic regression. Examples of the non-linear model include a neural network (Neural Network), a support vector machine (Support Vector Machine), an Adaboost, and a random forest (Random Forests). The user can select one of the models when generating the learning model. For example, the user may select a linear model when analyzing the effectiveness of a combination of factors at high speed, and may select a non-linear model when analyzing with high accuracy.

（２）分析装置は、（１）で生成された学習モデルから各因子の確率分布２０を生成する。具体的には、たとえば、分析装置は、マルコフ連鎖モンテカルロ法に代表される確率サンプリング法を用いて、学習データ集合１０由来の因子の確率分布２０を２組（それぞれｄ１、ｄ２と称す）生成する。これにより、仮想的な因子データを大量に収集することができる。 (2) The analysis device generates the probability distribution 20 of each factor from the learning model generated in (1). Specifically, for example, the analyzer generates two sets of probability distributions 20 of factors derived from the learning data set 10 (referred to as d1 and d2, respectively) by using a probability sampling method represented by the Markov chain Monte Carlo method. .. Thereby, a large amount of virtual factor data can be collected.

（３）分析装置は、（２）で生成された因子の確率分布ｄ１，ｄ２が同一の確率分布に収束するか否かを判定する。収束判定には、具体的には、たとえば、Ｇｅｌｍａｎ−Ｒｕｂｉｎ法が用いられる。収束するまで、分析装置は、（２）の因子の確率分布２０を生成する。 (3) The analyzer determines whether the probability distributions d1 and d2 of the factors generated in (2) converge to the same probability distribution. Specifically, for example, the Gelman-Rubin method is used for the convergence determination. Until it converges, the analyzer generates the probability distribution 20 of the factor (2).

（４）分析装置は、（３）で収束すると判定された因子の確率分布ｄ１、ｄ２を統合し、統合した因子の確率分布（統合確率分布Ｄ）について、因子クラスタリングを実行する。因子クラスタリングには、具体的には、たとえば、ｋ−ｍｅａｎｓクラスタリングが用いられる。クラスタ数は、あらかじめ設定される。ここでは、クラスタ数は例として「３」とする。これにより、因子クラスタリング結果４０において、統合確率分布Ｄのエントリは、３種類の患者タイプα、β、γに分類される。 (4) The analysis device integrates the probability distributions d1 and d2 of the factors determined to converge in (3), and performs factor clustering on the integrated probability distribution of the factors (integrated probability distribution D). For the factor clustering, specifically, for example, k-means clustering is used. The number of clusters is set in advance. Here, the number of clusters is “3” as an example. As a result, in the factor clustering result 40, the entries of the integrated probability distribution D are classified into three types of patient types α, β, and γ.

（５）また、分析装置は、統合確率分布Ｄについて、共起クラスタリングを実行する。具体的には、たとえば、分析装置は、統合確率分布Ｄの因子同士の相関係数を共起量として算出する。そして、分析装置は、共起量に階層クラスタリング法を適用し、共起クラスタを生成する。ここでは、共起クラスタ１（薬１，薬２）と共起クラスタ２（薬３，薬４）が得られたものとする。なお、ここでは、共起クラスタは、２つの因子の組み合わせであるが、３以上の因子の組み合わせでもよい。 (5) Further, the analysis device executes co-occurrence clustering on the integrated probability distribution D. Specifically, for example, the analysis device calculates the correlation coefficient between the factors of the integrated probability distribution D as the co-occurrence amount. Then, the analysis device applies the hierarchical clustering method to the co-occurrence amount to generate a co-occurrence cluster. Here, it is assumed that the co-occurrence cluster 1 (drug 1, drug 2) and the co-occurrence cluster 2 (drug 3, drug 4) are obtained. Although the co-occurrence cluster is a combination of two factors here, it may be a combination of three or more factors.

（６）分析装置は、患者タイプα、β、γごとに、共起クラスタに属する因子を学習モデルに与えることにより、患者タイプα、β、γごとの疾病確率の予測値を算出する。このように、分析装置は、因子の組み合わせの有効性を分析することができる。 (6) The analyzer calculates the predicted value of the disease probability for each of the patient types α, β, γ by giving the learning model a factor belonging to the co-occurrence cluster for each of the patient types α, β, γ. In this way, the analyzer can analyze the effectiveness of the combination of factors.

＜分析装置のハードウェア構成例＞
図２は、分析装置のハードウェア構成例を示すブロック図である。分析装置２００は、プロセッサ２０１と、記憶デバイス２０２と、入力デバイス２０３と、出力デバイス２０４と、通信インターフェース（通信ＩＦ２０５）と、を有する。プロセッサ２０１、記憶デバイス２０２、入力デバイス２０３、出力デバイス２０４、および通信ＩＦ２０５は、バスにより接続される。プロセッサ２０１は、分析装置２００を制御する。記憶デバイス２０２は、プロセッサ２０１の作業エリアとなる。また、記憶デバイス２０２は、各種プログラムやデータを記憶する非一時的なまたは一時的な記録媒体である。記憶デバイス２０２としては、たとえば、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）、フラッシュメモリがある。入力デバイス２０３は、データを入力する。入力デバイス２０３としては、たとえば、キーボード、マウス、タッチパネル、テンキー、スキャナがある。出力デバイス２０４は、データを出力する。出力デバイス２０４としては、たとえば、ディスプレイ、プリンタがある。通信ＩＦ２０５は、ネットワークと接続し、データを送受信する。<Example of hardware configuration of analyzer>
FIG. 2 is a block diagram showing a hardware configuration example of the analyzer. The analysis apparatus 200 has a processor 201, a storage device 202, an input device 203, an output device 204, and a communication interface (communication IF 205). The processor 201, storage device 202, input device 203, output device 204, and communication IF 205 are connected by a bus. The processor 201 controls the analysis device 200. The storage device 202 serves as a work area of the processor 201. The storage device 202 is a non-temporary or temporary recording medium that stores various programs and data. Examples of the storage device 202 include a ROM (Read Only Memory), a RAM (Random Access Memory), a HDD (Hard Disk Drive), and a flash memory. The input device 203 inputs data. Examples of the input device 203 include a keyboard, a mouse, a touch panel, a numeric keypad, and a scanner. The output device 204 outputs data. Examples of the output device 204 include a display and a printer. The communication IF 205 connects to a network and transmits / receives data.

＜学習データ例＞
図３は、図１に示した学習データ集合１０の詳細な内容を示す説明図である。学習データ集合１０は、例として、テーブル形式のデータとする。なお、以降のデータベースまたはテーブルの説明において、ＡＡフィールドｂｂｂ（ＡＡはフィールド名、ｂｂｂは符号）の値を、ＡＡｂｂｂと表記する場合がある。たとえば、患者ＩＤフィールド３０１の値を、患者ＩＤ３０１と表記する。<Example of learning data>
FIG. 3 is an explanatory diagram showing the detailed contents of the learning data set 10 shown in FIG. The learning data set 10 is, for example, data in a table format. In the following description of the database or table, the value of the AA field bbb (AA is a field name, bbb is a code) may be referred to as AAbbb. For example, the value of the patient ID field 301 is described as patient ID 301.

学習データ集合１０は、患者ＩＤフィールド３０１と、目的変数フィールド３０２と、因子フィールド３０３と、を有する。同一行における各フィールド３０１〜３０３の値が患者情報となるエントリを構成する。図３では、エントリ数は「６」であるが、実際には、たとえば、数万から数億の患者のエントリがある。 The learning data set 10 has a patient ID field 301, an objective variable field 302, and a factor field 303. The values of the fields 301 to 303 in the same row form an entry that is patient information. In FIG. 3, the number of entries is “6”, but actually there are, for example, tens to hundreds of millions of patient entries.

患者ＩＤフィールド３０１は、患者ＩＤを格納する記憶領域である。患者ＩＤ３０１は、患者を一意に特定する識別情報である。 The patient ID field 301 is a storage area for storing a patient ID. The patient ID 301 is identification information that uniquely identifies the patient.

目的変数フィールド３０２は、患者ＩＤ３０１ごとの目的変数を格納する記憶領域である。目的変数３０２は、疾病確率を示す。疾病確率は、０％〜１００％で表現できるが、学習データ集合１０は実測値であるため、疾病を１（＝１００％）、健康を０（＝０％）とする。 The target variable field 302 is a storage area for storing a target variable for each patient ID 301. The objective variable 302 indicates a disease probability. The disease probability can be expressed as 0% to 100%, but since the learning data set 10 is an actual measurement value, the disease is 1 (= 100%) and the health is 0 (= 0%).

因子フィールド３０３は、複数の因子を格納する記憶領域である。因子３０３は、薬の投与量を示す説明変数である。本例では、因子３０３は、便宜的に薬１〜薬４の４つの説明変数であるが、実際には、たとえば、数万から数億の薬である。なお、因子３０３である薬の投与量の単位は、薬ごとに定められる。 The factor field 303 is a storage area that stores a plurality of factors. Factor 303 is an explanatory variable indicating the dose of the drug. In this example, the factors 303 are four explanatory variables, that is, drug 1 to drug 4 for convenience, but actually, for example, tens of thousands to hundreds of millions of drugs. The unit of the dose of the drug that is the factor 303 is determined for each drug.

図３において、患者ＩＤ３０１が「患者Ａ」のエントリは、患者Ａに薬１を「２０」、薬２を「１３．０」、薬４を「２２．０」を投与された結果、患者Ａは疾病であることを示す。また、患者ＩＤ３０１が「患者Ｂ」のエントリは、患者Ｂに薬１を「１０」、薬２を「２３．０」、薬３を「１」、薬４を「３１．０」を投与された結果、患者Ｂは疾病であることを示す。 In FIG. 3, for the entry with the patient ID 301 of “patient A”, patient A is administered with drug 1 “20”, drug 2 “13.0”, and drug 4 “22.0”. Indicates a disease. In addition, for the entry having the patient ID 301 of “patient B”, the patient 1 is administered with medicine 1 “10”, medicine 2 “23.0”, medicine 3 “1”, and medicine 4 “31.0”. As a result, patient B is ill.

＜初期設定画面例＞
図４は、初期設定画面例を示す説明図である。初期設定画面４００は、出力デバイス２０４の一例であるディスプレイに表示され、入力デバイス２０３により設定される。機械学習選択領域４０１は、機械学習方法を選択するプルダウン式のインタフェースである。因子クラスタリング設定領域４０２は、クラスタリング方法と、クラスタ数と、を設定する領域である。因子クラスタリング選択領域４０３は、因子クラスタリングの手法を選択するプルダウン式のインタフェースである。因子クラスタ数設定領域４０４は、因子クラスタリングで得たいクラスタの数を設定する入力欄である。<Example of initial setting screen>
FIG. 4 is an explanatory diagram showing an example of the initial setting screen. The initial setting screen 400 is displayed on a display, which is an example of the output device 204, and is set by the input device 203. The machine learning selection area 401 is a pull-down interface for selecting a machine learning method. The factor clustering setting area 402 is an area for setting the clustering method and the number of clusters. The factor clustering selection area 403 is a pull-down interface for selecting a factor clustering method. The factor cluster number setting area 404 is an input field for setting the number of clusters desired to be obtained by factor clustering.

σ値設定領域４０５は、σ値を設定する入力欄である。σ値は、図１の（２）各因子の確率分布２０の生成において、マルコフ連鎖モンテカルロ法の採択率αで用いられる固定のパラメータである。σ値は、０よりも大きく１以下の範囲の値である。 The σ value setting area 405 is an input field for setting the σ value. The σ value is a fixed parameter used in the adoption rate α of the Markov chain Monte Carlo method in the generation of the probability distribution 20 of each factor (2) in FIG. The σ value is a value in the range of more than 0 and 1 or less.

共起クラスタリング設定領域４０６は、共起方法と、クラスタリング方法と、クラスタ数と、しきい値とを設定する領域である。共起量選択領域４０７は、共起量の計算方法を選択するプルダウン式のインタフェースである。共起クラスタリング選択領域４０８は、共起クラスタリングの手法を選択するプルダウン式のインタフェースである。共起クラスタ数設定領域４０９は、因子クラスタリングで得たい共起クラスタの数を設定する入力欄である。しきい値設定領域４１０は、因子クラスタの関連度を示す相関値の予測値についてのしきい値を設定する入力欄である。決定ボタン４１１は、各項目４０１〜４１０の値を入力するボタンである。 The co-occurrence clustering setting area 406 is an area for setting a co-occurrence method, a clustering method, the number of clusters, and a threshold value. The co-occurrence amount selection area 407 is a pull-down interface for selecting a method of calculating the co-occurrence amount. The co-occurrence clustering selection area 408 is a pull-down interface for selecting a method of co-occurrence clustering. The co-occurrence cluster number setting area 409 is an input field for setting the number of co-occurrence clusters to be obtained by factor clustering. The threshold value setting area 410 is an input field for setting a threshold value for a predicted value of a correlation value indicating the degree of association of factor clusters. The enter button 411 is a button for inputting the values of the items 401 to 410.

＜分析処理手順例＞
図５は、分析装置２００による分析処理手順例を示すフローチャートである。分析装置２００は、記憶デバイス２０２に記憶された分析プログラムをプロセッサ２０１に実行させることにより、図５のフローチャートに示す処理を実行する。まず、分析装置２００は、初期設定を実行する（ステップＳ５０１）。初期設定（ステップＳ５０１）では、図４に示した初期設定画面がディスプレイに表示される。ユーザは、初期設定画面の各項目４０１〜４０９について選択または入力をする。分析装置２００は、入力ボタン４１０の押下を検出することで、各項目４０１〜４０９の値を読み込む。<Example of analysis processing procedure>
FIG. 5 is a flowchart showing an example of an analysis processing procedure by the analysis device 200. The analysis apparatus 200 executes the processing shown in the flowchart of FIG. 5 by causing the processor 201 to execute the analysis program stored in the storage device 202. First, the analysis device 200 executes initial setting (step S501). In the initial setting (step S501), the initial setting screen shown in FIG. 4 is displayed on the display. The user selects or inputs each item 401 to 409 on the initial setting screen. The analysis apparatus 200 reads the values of the items 401 to 409 by detecting the pressing of the input button 410.

つぎに、分析装置２００は、図１の（１）に示したように、学習データ集合１０から学習モデルを生成する（ステップＳ５０２）。ロジスティック回帰の場合、学習モデルは下記式（１）で表現される。 Next, the analysis device 200 generates a learning model from the learning data set 10 as shown in (1) of FIG. 1 (step S502). In the case of logistic regression, the learning model is expressed by the following equation (1).

ｙ＝ｆ（ｘ）＝σ（ｗ^ｔｘ＋ｂ）・・・（１）y = f (x) = σ (w ^t x + b) (1)

ｙは目的変数を示すスカラである。ｘはｍ次元の特徴量ベクトルである。ｍは因子の個数に相当する。図３の学習データ集合１０では、因子３０３の数は４個（薬１〜薬４）であるため、ｍ＝４である。σ（）はシグモイド関数である。ベクトルｗとスカラｂは、それぞれ、重みとバイアスのパラメータであり、学習パラメータと呼ばれる。非線形モデルの場合、シグモイド関数σ（）内のｗ^ｔｘが、ベクトルｗと因子ｘとに基づくｗ^ｔｘよりも複雑な関数に置き換わる。y is a scalar indicating an objective variable. x is an m-dimensional feature vector. m corresponds to the number of factors. In the learning data set 10 of FIG. 3, the number of factors 303 is four (medicine 1 to drug 4), and therefore m = 4. σ () is a sigmoid function. The vector w and the scalar b are parameters of weight and bias, respectively, and are called learning parameters. In the case of a non-linear model, w ^t x in the sigmoid function σ () is replaced by a more complex function than w ^t x based on the vector w and the factor x.

分析装置２００は、図４の機械学習選択領域４０１で選択された機械学習方法に応じた学習モデルを選択して、学習モデルを表現する学習パラメータを求める。 The analysis apparatus 200 selects a learning model according to the machine learning method selected in the machine learning selection area 401 of FIG. 4, and obtains a learning parameter expressing the learning model.

つぎに、分析装置２００は、図１の（２）に示したように、マルコフ連鎖モンテカルロ法に代表される確率サンプリング法を用いて、学習データ集合１０由来の因子の確率分布ｄ１，ｄ２を生成する（ステップＳ５０３）。 Next, the analyzer 200 generates probability distributions d1 and d2 of the factors derived from the learning data set 10 by using the probability sampling method represented by the Markov chain Monte Carlo method, as shown in (2) of FIG. Yes (step S503).

図６は、因子の確率分布ｄ１，ｄ２を示す説明図である。因子の確率分布ｄ１，ｄ２は、仮想患者ＩＤフィールド６０１と、目的変数フィールド６０２と、因子フィールド６０３と、を有する。同一行における各フィールド６０１〜６０３の値が仮想患者情報となるエントリを構成する。なお、エントリ数は、学習データ集合１０のエントリ数と同数とする。 FIG. 6 is an explanatory diagram showing the probability distributions d1 and d2 of the factors. The factor probability distributions d1 and d2 have a virtual patient ID field 601, an objective variable field 602, and a factor field 603. The values of the fields 601 to 603 in the same row form an entry that is virtual patient information. The number of entries is the same as the number of entries in the learning data set 10.

仮想患者ＩＤフィールド６０１は、仮想患者ＩＤを格納する記憶領域である。仮想患者ＩＤ６０１は、仮想患者を一意に特定する識別情報である。 The virtual patient ID field 601 is a storage area for storing a virtual patient ID. The virtual patient ID 601 is identification information that uniquely identifies the virtual patient.

目的変数フィールド６０２は、仮想患者ＩＤ６０１ごとの目的変数を格納する記憶領域である。目的変数６０２は、疾病確率を示す。疾病確率は、０％〜１００％で表現される。 The target variable field 602 is a storage area for storing a target variable for each virtual patient ID 601. The objective variable 602 indicates a disease probability. The disease probability is expressed as 0% to 100%.

因子フィールド６０３は、複数の因子を格納する記憶領域である。因子６０３は、薬の投与量を示す説明変数である。本例では、因子６０３の数は、学習データ集合１０の因子３０３の数と同数となる。 The factor field 603 is a storage area that stores a plurality of factors. The factor 603 is an explanatory variable indicating the dose of the drug. In this example, the number of factors 603 is the same as the number of factors 303 of the learning data set 10.

因子の確率分布ｄ１，ｄ２のエントリである仮想患者情報の生成例について説明する。分析装置２００は、学習データ集合１０のエントリ群からいずれかのエントリの因子ベクトルを選択する。たとえば、患者ＩＤ３０１が「患者Ａ」の因子ベクトルｘ＝（２０，１３．０，０，２２．０）が選択されたとする。分析装置２００は、選択した因子ベクトルの各要素に乱数値ｒを加算して、仮想因子ベクトルｘ’＝（２０＋ｒ，１３．０＋ｒ，０＋ｒ，２２．０＋ｒ）とする。 An example of generation of virtual patient information that is entries of the factor probability distributions d1 and d2 will be described. The analysis apparatus 200 selects a factor vector of any entry from the entry group of the learning data set 10. For example, it is assumed that the factor vector x = (20,13.0,0,22.0) with the patient ID 301 of "patient A" is selected. The analysis apparatus 200 adds the random number value r to each element of the selected factor vector to obtain the virtual factor vector x ′ = (20 + r, 13.0 + r, 0 + r, 22.0 + r).

分析装置２００は、選択された因子ベクトルｘと仮想因子ベクトルｘ’とをマルコフ連鎖モンテカルロ法の採択率αの式（２）に代入する。 The analysis apparatus 200 substitutes the selected factor vector x and the virtual factor vector x ′ into the equation (2) of the adoption rate α of the Markov chain Monte Carlo method.

関数ｑはガウス分布関数である。関数ｑ（ｘ’｜ｘ）は、因子ベクトルｘが与えられた場合に仮想因子ベクトルｘ’を生成する確率を示すガウス分布関数である。関数ｑ（ｘ｜ｘ’）は、仮想因子ベクトルｘ’が与えられた場合に因子ベクトルｘを生成する確率を示すガウス分布関数である。関数ｆは、たとえば、式（１）に示したような、ステップＳ５０２で生成された学習モデルである。σには、σ値設定領域４０５に入力されたσ値が代入される。σ値により、採択率αは、（１−σ）以上の疾病確率の患者情報を含むガウス分布となる。すなわち、（１−σ）以上の疾病確率となる仮想患者情報の仮想因子ベクトルｘ’を採択率αで採択することができる。 The function q is a Gaussian distribution function. The function q (x '| x) is a Gaussian distribution function indicating the probability of generating the virtual factor vector x'when the factor vector x is given. The function q (x | x ') is a Gaussian distribution function indicating the probability of generating the factor vector x when the virtual factor vector x'is given. The function f is, for example, the learning model generated in step S502 as shown in Expression (1). The σ value input to the σ value setting area 405 is substituted for σ. Depending on the σ value, the acceptance rate α has a Gaussian distribution including the patient information with the disease probability of (1−σ) or more. That is, the virtual factor vector x ′ of the virtual patient information having the disease probability of (1−σ) or more can be adopted at the adoption rate α.

次に、０〜１の区間で一様な乱数βを発生させ、採択率αがしきい値β（たとえば、１）以上である場合、分析装置２００は、仮想因子ベクトルｘ’を採択する。採択率αがしきい値以上でない場合、分析装置２００は、因子ベクトルｘを採択する。採択された因子ベクトルを採択因子ベクトル＜ｘ＞と表記する。 Next, when a uniform random number β is generated in the interval of 0 to 1 and the adoption rate α is equal to or greater than the threshold value β (for example, 1), the analysis device 200 adopts the virtual factor vector x ′. When the acceptance rate α is not greater than or equal to the threshold value, the analysis device 200 adopts the factor vector x. The adopted factor vector is expressed as adopted factor vector <x>.

採択率αがしきい値β（たとえば、１）以上である場合、分析装置２００は、採択因子ベクトル＜ｘ＞と乱数ベクトルＲとを比較する。具体的には、たとえば、分析装置２００は、採択因子ベクトル＜ｘ＞のすべての要素が、乱数ベクトルＲの対応する要素以上であるか否かを判断する。採択因子ベクトル＜ｘ＞のすべての要素が、乱数ベクトルＲの対応する要素以上である場合、分析装置２００は、採択因子ベクトル＜ｘ＞を新規の仮想患者の仮想因子ベクトルに決定する。 When the adoption rate α is equal to or larger than the threshold value β (for example, 1), the analysis device 200 compares the adoption factor vector <x> with the random number vector R. Specifically, for example, the analysis device 200 determines whether or not all the elements of the adoption factor vector <x> are greater than or equal to the corresponding elements of the random number vector R. When all the elements of the adoption factor vector <x> are equal to or more than the corresponding elements of the random number vector R, the analysis device 200 determines the adoption factor vector <x> as the virtual factor vector of the new virtual patient.

採択因子ベクトル＜ｘ＞のすべての要素が、乱数ベクトルＲの対応する要素以上でない場合、分析装置２００は、因子ベクトルｘを新規の仮想患者の仮想因子ベクトルに決定する。なお、採択因子ベクトル＜ｘ＞のすべての要素が、乱数ベクトルＲの対応する要素以上であることを判断の条件としたが、採択因子ベクトル＜ｘ＞の一部の要素が、乱数ベクトルＲの対応する要素以上であるとしてもよい。 When all the elements of the adopted factor vector <x> are not equal to or greater than the corresponding elements of the random number vector R, the analysis device 200 determines the factor vector x as the virtual factor vector of the new virtual patient. Although all the elements of the adoption factor vector <x> are greater than or equal to the corresponding elements of the random number vector R, the determination condition is that some elements of the adoption factor vector <x> are It may be more than the corresponding element.

このあと、分析装置２００は、各仮想患者情報のエントリにおいて、学習モデルに新規の仮想患者の仮想因子ベクトルである因子６０３を与えることで、目的変数６０２である疾病確率を算出する。このようにして、ステップＳ５０３において、仮想患者情報のエントリが設定され、因子の確率分布ｄ１，ｄ２が生成される。 After that, the analysis apparatus 200 calculates the disease probability that is the objective variable 602 by giving the learning model a factor 603 that is a virtual factor vector of a new virtual patient in each virtual patient information entry. In this way, in step S503, the entry of the virtual patient information is set, and the probability distributions d1 and d2 of the factors are generated.

図５に戻り、分析装置２００は、図１の（３）に示したように、因子の確率分布ｄ１，ｄ２が同一の確率分布に収束しているかを判定する（ステップＳ５０４）。具体的には、たとえば、分析装置２００は、因子の確率分布ｄ１，ｄ２が同一の確率分布に収束しているかを検証するための収束値を、Ｇｅｌｍａｎ−Ｒｕｂｉｎ法により計算する。より具体的には、分析装置２００は、因子の確率分布ｄ１の列データと、当該列データに対応する因子の確率分布ｄ２の列データとを、Ｇｅｌｍａｎ−Ｒｕｂｉｎの収束判定式に与えて、収束値Ｒｈａｔを算出する。 Returning to FIG. 5, the analyzer 200 determines whether the probability distributions d1 and d2 of the factors converge to the same probability distribution as shown in (3) of FIG. 1 (step S504). Specifically, for example, the analysis device 200 calculates a convergence value for verifying whether the probability distributions d1 and d2 of the factors converge to the same probability distribution by the Gelman-Rubin method. More specifically, the analysis device 200 gives the column data of the probability distribution d1 of the factor and the column data of the probability distribution d2 of the factor corresponding to the column data to the Gelman-Rubin convergence determination formula to converge. Calculate the value Rhat.

たとえば、分析装置２００は、因子の確率分布ｄ１の目的変数６０２の列データと、因子の確率分布ｄ２の目的変数６０２の列データとをＧｅｌｍａｎ−Ｒｕｂｉｎの収束判定式に与えて、収束値Ｒｈａｔを算出する。また、分析装置２００は、因子の確率分布ｄ１の因子６０３における薬１の列データと、因子の確率分布ｄ２の因子６０３における薬１の列データとをＧｅｌｍａｎ−Ｒｕｂｉｎの収束判定式に与えて、収束値Ｒｈａｔを算出する。薬２以降の列データに付いても同様に、分析装置２００は、収束値Ｒｈａｔを算出する。 For example, the analysis apparatus 200 gives the column data of the objective variable 602 of the probability distribution d1 of the factor and the column data of the objective variable 602 of the probability distribution d2 of the factor to the Gelman-Rubin convergence determination formula to obtain the convergence value Rhat. calculate. Further, the analysis device 200 gives the column data of the drug 1 in the factor 603 of the factor probability distribution d1 and the column data of the drug 1 in the factor 603 of the factor probability distribution d2 to the Gelman-Rubin convergence determination formula, The convergence value Rhat is calculated. Similarly, the analyzer 200 calculates the convergence value Rhat for the column data after the medicine 2.

収束値Ｒｈａｔが１．１以下であれば、因子の確率分布ｄ１，ｄ２の列データは、同一の確率分布に収束すると判定する。分析装置２００は、収束しないと判定された列データを削除する。残存列データの数がしきい値（たとえば、５０％以上）以上であれば、因子の確率分布ｄ１，ｄ２が同一の確率分布に収束していることとなり（ステップＳ５０４：Ｙｅｓ）、ステップＳ５０５に移行する。残存列データの数がしきい値以上でなければ（ステップＳ５０４：Ｎｏ）、ステップＳ５０３に戻り、分析装置２００は、学習データ集合１０由来の因子の確率分布ｄ１，ｄ２を再生成する。また、因子の確率分布ｄ１，ｄ２の因子６０３の列データが１つでも削除された場合、分析装置２００は、残存する因子６０３を学習モデルに与えて、目的変数６０２を再計算する。 If the convergence value Rhat is 1.1 or less, it is determined that the column data of the factor probability distributions d1 and d2 converge to the same probability distribution. The analysis device 200 deletes the column data determined not to converge. If the number of remaining column data is equal to or larger than the threshold value (for example, 50% or more), it means that the probability distributions d1 and d2 of the factors have converged to the same probability distribution (step S504: Yes), and the process proceeds to step S505. Transition. If the number of remaining column data is not equal to or more than the threshold value (step S504: No), the analysis device 200 returns to step S503 and regenerates the probability distributions d1 and d2 of the factors derived from the learning data set 10. Further, when even one column data of the factor 603 of the factor probability distributions d1 and d2 is deleted, the analyzer 200 gives the remaining factor 603 to the learning model and recalculates the objective variable 602.

収束しない列データを削除することにより、因子の確率分布ｄ１，ｄ２の信頼性の向上を図ることができ、分析精度が向上する。また、残存列データの数がしきい値以上であれば、分析装置２００は、収束しないと判定された列データを削除せずに、ステップＳ５０４に移行してもよい。これにより、因子６０３を網羅した分析をおこなうことができる。また、ステップＳ５０４を実行しないこととしてもよい。これにより、分析速度の向上を図ることができる。 By deleting the column data that does not converge, it is possible to improve the reliability of the probability distributions d1 and d2 of the factors and improve the analysis accuracy. If the number of remaining column data is equal to or larger than the threshold value, the analysis device 200 may move to step S504 without deleting the column data determined not to converge. As a result, an analysis covering the factor 603 can be performed. Further, step S504 may not be executed. As a result, the analysis speed can be improved.

つぎに、分析装置２００は、ステップＳ５０４において収束判定された因子の確率分布ｄ１，ｄ２を統合する（ステップＳ５０５）。統合した因子の確率分布を統合確率分布Ｄとする。 Next, the analysis device 200 integrates the probability distributions d1 and d2 of the factors that are determined to converge in step S504 (step S505). The probability distribution of the integrated factors is referred to as an integrated probability distribution D.

図７は、統合確率分布Ｄの一例を示す説明図である。図７では、説明の便宜上、図６に示した因子の確率分布ｄ１，ｄ２を連結した内容としたが、ステップＳ５０４において因子６０３におけるいずれかの列データが削除されている場合は、統合確率分布Ｄにおいても削除された状態となる。 FIG. 7 is an explanatory diagram showing an example of the integrated probability distribution D. In FIG. 7, the content of the probability distributions d1 and d2 of the factors shown in FIG. 6 is connected for convenience of description. However, if any column data in the factor 603 is deleted in step S504, the integrated probability distribution Even in D, the state is deleted.

つぎに、分析装置２００は、図１の（４）に示したように、統合確率分布Ｄを用いて、因子クラスタリングにより因子クラスタを生成する（ステップＳ５０６）。分析装置２００は、初期設定（ステップＳ５０１）において、因子クラスタリング選択領域４０３で選択された因子クラスタリングを実行し、因子クラスタ数設定領域４０４で設定されたクラスタ数分の因子クラスタを生成する。 Next, as shown in (4) of FIG. 1, the analysis apparatus 200 uses the integrated probability distribution D to generate a factor cluster by factor clustering (step S506). In the initial setting (step S501), the analyzer 200 executes the factor clustering selected in the factor clustering selection area 403 to generate the number of factor clusters set in the factor cluster number setting area 404.

図８は、因子クラスタリング結果４０を示す説明図である。因子クラスタリング結果４０は、患者タイプＩＤフィールド８０１と、目的変数フィールド８０２と、因子フィールド８０３と、を有する。同一行における各フィールド８０１〜８０３の値が患者タイプ情報となるエントリを構成する。 FIG. 8 is an explanatory diagram showing the result 40 of the factor clustering. The factor clustering result 40 has a patient type ID field 801, an objective variable field 802, and a factor field 803. The value of each field 801 to 803 in the same row constitutes an entry that becomes patient type information.

患者タイプＩＤフィールド８０１は、患者タイプＩＤを格納する記憶領域である。患者タイプＩＤ８０１は、因子クラスタリングで分類された患者タイプを一意に特定する識別情報である。 The patient type ID field 801 is a storage area for storing a patient type ID. The patient type ID 801 is identification information that uniquely identifies the patient type classified by the factor clustering.

目的変数フィールド８０２は、患者タイプＩＤ８０１ごとの目的変数を格納する記憶領域である。目的変数８０２は、疾病確率を示す。疾病確率は、０％〜１００％で表現される。 The target variable field 802 is a storage area for storing a target variable for each patient type ID 801. The objective variable 802 indicates a disease probability. The disease probability is expressed as 0% to 100%.

因子フィールド８０３は、複数の因子を格納する記憶領域である。因子８０３は、患者タイプへの薬の投与量を示す説明変数である。本例では、因子８０３は、便宜的に薬１〜薬４の４つの説明変数であるが、実際には、たとえば、収束判定（ステップＳ５０４）後に残存する薬である。 The factor field 803 is a storage area that stores a plurality of factors. Factor 803 is an explanatory variable indicating the dose of drug to the patient type. In the present example, the factor 803 is four explanatory variables of drug 1 to drug 4 for convenience, but actually, for example, the drug remains after the convergence determination (step S504).

図８では、因子クラスタリングとしてｋ−ｍｅａｎｓクラスタリングが用いられ、クラスタ数は例として「３」とする。これにより、統合確率分布Ｄのエントリは、３種類の患者タイプα、β、γの因子クラスタに分類される。 In FIG. 8, k-means clustering is used as the factor clustering, and the number of clusters is “3” as an example. As a result, the entries of the integrated probability distribution D are classified into the factor clusters of the three patient types α, β and γ.

図５に戻り、分析装置２００は、各因子クラスタから各因子の統計値を算出する（ステップＳ５０７）。具体的には、たとえば、分析装置２００は、因子フィールド８０３に、当該エントリの患者タイプに所属する統合確率分布Ｄ内の仮想患者情報における統計値を設定する。当該統計値は、たとえば、中央値である。中央値のほか、平均値、最大値、最小値、ランダムに選択された値でもよい。また、分析装置２００は、因子８０３である統計値を学習モデルに与えることにより、目的変数８０２である疾病確率を算出する。このように、患者タイプの因子８０３および説明変数８０２は、統計値および統計値由来の疾病確率に集約される。 Returning to FIG. 5, the analyzer 200 calculates the statistical value of each factor from each factor cluster (step S507). Specifically, for example, the analysis device 200 sets a statistical value in the virtual patient information in the integrated probability distribution D belonging to the patient type of the entry in the factor field 803. The statistical value is, for example, a median value. In addition to the median value, the average value, the maximum value, the minimum value, or a randomly selected value may be used. Further, the analysis device 200 calculates the disease probability that is the objective variable 802 by giving the learning model a statistical value that is the factor 803. In this way, the patient type factor 803 and the explanatory variable 802 are aggregated into a statistical value and a statistical probability-derived disease probability.

また、分析装置２００は、統合確率分布Ｄの因子同士の共起量を算出する（ステップＳ５０８）。共起量とは、２つの因子間の相関値である。具体的には、たとえば、分析装置２００は、統合確率分布Ｄ内の全因子を総当たりで組み合わせ、因子間の相関値を算出する。相関値は、初期設定（ステップＳ５０１）において、共起量選択領域４０７で選択された計算方法により算出される。 In addition, the analysis device 200 calculates the co-occurrence amount of the factors of the integrated probability distribution D (step S508). The co-occurrence amount is a correlation value between two factors. Specifically, for example, the analysis device 200 combines all factors in the integrated probability distribution D in a brute force manner and calculates a correlation value between the factors. The correlation value is calculated by the calculation method selected in the co-occurrence amount selection area 407 in the initial setting (step S501).

つぎに、分析装置２００は、図１の（５）に示したように、共起クラスタリングにより共起クラスタを生成する（ステップＳ５０９）。具体的には、たとえば、分析装置２００は、共起量に階層クラスタリング法を適用し、共起クラスタを生成する。階層クラスタリングとは、個々のデータを１つの共起クラスタとして設定しておき、共起クラスタ間の類似度を計算し、最も類似する共起クラスタを併合し、すべての共起クラスタが１つのクラスタになるまで処理を繰り返し、デンドログラムを生成するすクラスタリングである。ここで、共起クラスタ間の類似度とは、たとえば、共起クラスタ間の距離の短さである。具体的には、たとえば、最近隣法、最遠隣法、または重心法により、共起クラスタ間の距離が定義される。 Next, the analysis device 200 generates a co-occurrence cluster by co-occurrence clustering, as shown in (5) of FIG. 1 (step S509). Specifically, for example, the analysis device 200 applies a hierarchical clustering method to the co-occurrence amount to generate a co-occurrence cluster. Hierarchical clustering is to set individual data as one co-occurrence cluster, calculate the similarity between co-occurrence clusters, merge the most similar co-occurrence clusters, and all co-occurrence clusters are one cluster. It is a clustering that repeats the process until and becomes a dendrogram. Here, the similarity between co-occurrence clusters is, for example, the short distance between co-occurrence clusters. Specifically, the distance between co-occurrence clusters is defined by, for example, the nearest neighbor method, the farthest neighbor method, or the center of gravity method.

図９は、共起クラスタリング（Ｓ５０８、Ｓ５０９）の処理例を示す説明図である。（Ａ）は、ステップＳ５０８の処理を示す。共起量テーブル９００は、因子間の相関値を保持するテーブルである。（Ｂ）は、ステップＳ５０９の処理を示す。（Ｂ）において、分析装置２００は、同一因子の相関値を削除する。また、分析装置２００は、階層クラスタリングのために相関値を１から相関値を減じた相関値に変換する。（Ｂ）では、相関値が小さいほどその因子同士は類似することを意味する。したがって、分析装置２００は、相関値が最小となる因子の組み合わせを共起クラスタとして選択する。（Ｂ）の場合は、薬１と薬２の組み合わせ（共起クラスタ１）と、薬３と薬４の組み合わせ（共起クラスタ２）とが選択される。なお、ここでは、共起クラスタは、２つの因子の組み合わせであるが、３以上の因子の組み合わせでもよい。 FIG. 9 is an explanatory diagram showing a processing example of co-occurrence clustering (S508, S509). (A) shows the process of step S508. The co-occurrence amount table 900 is a table that holds correlation values between factors. (B) shows the process of step S509. In (B), the analyzer 200 deletes the correlation value of the same factor. Further, the analysis device 200 converts the correlation value into a correlation value obtained by subtracting the correlation value from 1 for hierarchical clustering. In (B), the smaller the correlation value, the more similar the factors are. Therefore, the analysis device 200 selects the combination of factors having the smallest correlation value as the co-occurrence cluster. In the case of (B), a combination of drug 1 and drug 2 (co-occurrence cluster 1) and a combination of drug 3 and drug 4 (co-occurrence cluster 2) are selected. Although the co-occurrence cluster is a combination of two factors here, it may be a combination of three or more factors.

なお、（Ｂ）の処理は、共起クラスタの数が共起クラスタ数設定領域４０９で設定された共起クラスタ数になるまで、または、これ以上クラスタを併合できない状態になるまで、実行される。 The process (B) is executed until the number of co-occurrence clusters reaches the number of co-occurrence clusters set in the co-occurrence cluster number setting area 409, or until no more clusters can be merged. ..

図５に戻り、分析装置２００は、図１の（６）に示したように、共起クラスタの予測値を算出する（ステップＳ５１０）。具体的には、たとえば、分析装置２００は、患者タイプα、β、γごとに、共起クラスタに属する因子を学習モデルに与えることにより、患者タイプα、β、γごとの疾病確率の予測値を算出する。 Returning to FIG. 5, the analysis device 200 calculates the predicted value of the co-occurrence cluster as shown in (6) of FIG. 1 (step S510). Specifically, for example, the analysis device 200 gives the learning model the factors belonging to the co-occurrence cluster for each of the patient types α, β, γ, to thereby predict the disease probability for each of the patient types α, β, γ. To calculate.

図１０は、ステップＳ５１０による予測結果１０００を示す説明図である。このように、分析装置２００は、因子の組み合わせの有効性を分析することができる。 FIG. 10: is explanatory drawing which shows the prediction result 1000 by step S510. In this way, the analysis device 200 can analyze the effectiveness of the combination of factors.

図５に戻り、分析装置２００は、予測結果１０００のしきい値処理を実行する（ステップＳ５１１）。具体的には、たとえば、分析装置２００は、予測値がしきい値以上の患者タイプと因子クラスタの組み合わせを選択する。たとえば、しきい値設定領域４１０に設定されたしきい値が「０．８」である場合、分析装置２００は、患者タイプαの因子クラスタ１、患者タイプβの因子クラスタ１、患者タイプγの因子クラスタ１を計算マーカとして選択する。 Returning to FIG. 5, the analysis device 200 executes threshold processing of the prediction result 1000 (step S511). Specifically, for example, the analysis device 200 selects a combination of a patient type and a factor cluster whose predicted value is equal to or higher than a threshold value. For example, when the threshold value set in the threshold value setting area 410 is “0.8”, the analysis device 200 determines that the factor cluster 1 of the patient type α, the factor cluster 1 of the patient type β, and the patient type γ. Select factor cluster 1 as the calculation marker.

分析装置２００は、ステップＳ５１０またはＳ５１１の処理結果を出力する（ステップＳ５１２）。具体的には、たとえば、分析装置２００は、出力デバイス２０４の一例であるディスプレイの表示画面を制御して処理結果を表示画面に表示したり、通信ＩＦ２０５を介して外部装置に処理結果を送信したり、記憶デバイス２０２に処理結果を書き込んだりする。また、ステップＳ５０４の収束判定結果も出力してもよい。 The analyzer 200 outputs the processing result of step S510 or S511 (step S512). Specifically, for example, the analysis apparatus 200 controls the display screen of the display which is an example of the output device 204 to display the processing result on the display screen, or transmits the processing result to the external device via the communication IF 205. Alternatively, the processing result is written in the storage device 202. Also, the convergence determination result of step S504 may be output.

＜表示画面例＞
図１１は、表示画面例を示す説明図である。表示画面１１００は、出力デバイス２０４の一例であるディスプレイに表示される。表示画面１１００は、スコア表示領域１１０１と、予測結果表示領域１１０２と、デンドログラム表示領域１１０３と、を有する。スコア表示領域１１０１には、収束判定（ステップＳ５０４）での収束値Ｒｈａｔが表示される。予測結果表示領域１１０２には、図１０に示した予測結果１０００が表示される。図１１に示すように、棒グラフで表示してもよい。デンドログラム表示領域１１０３には、階層クラスタリングにおけるデンドログラムが表示される。このように、図５に示した処理の途中結果や最終結果が表示画面１１００に表示される。<Display screen example>
FIG. 11 is an explanatory diagram showing an example of a display screen. The display screen 1100 is displayed on a display that is an example of the output device 204. The display screen 1100 has a score display area 1101, a prediction result display area 1102, and a dendrogram display area 1103. In the score display area 1101, the convergence value Rhat in the convergence determination (step S504) is displayed. In the prediction result display area 1102, the prediction result 1000 shown in FIG. 10 is displayed. As shown in FIG. 11, a bar graph may be displayed. A dendrogram in hierarchical clustering is displayed in the dendrogram display area 1103. In this way, the intermediate results and final results of the processing shown in FIG. 5 are displayed on the display screen 1100.

このように、実施例１によれば、分析装置２００は、複数の因子の値どうしが類似するように予測データ集合（たとえば、統合確率分布Ｄ）をクラスタリングして、複数の因子クラスタを生成する第１生成処理を実行する（ステップＳ５０６）。分析装置２００は、予測データ集合（たとえば、統合確率分布Ｄ）を用いて、複数の因子の相関により複数の因子が共起する共起量を算出する第１算出処理を実行する（ステップＳ５０８）。分析装置２００は、第１算出処理によって算出された共起量に基づいて複数の因子をクラスタリングして、２以上の因子を含む共起クラスタを１以上有する複数の共起クラスタを生成する第２生成処理を実行する（ステップＳ５０９）。分析装置２００は、第１生成処理によって生成された複数の因子クラスタの中の２以上の因子を含む特定の因子クラスタに含まれる特定の予測データ群における２以上の因子の予測値のうち、第２生成処理によって生成された複数の共起クラスタの中の特定の共起クラスタが示す２以上の特定の因子の予測値を、学習モデルに与える。そして、分析装置２００は、特定の因子クラスタにおける目的変数の予測値を算出する第２算出処理を実行する（ステップＳ５１０）。 As described above, according to the first embodiment, the analysis device 200 clusters the prediction data set (for example, the integrated probability distribution D) so that the values of the plurality of factors are similar to each other to generate a plurality of factor clusters. The first generation process is executed (step S506). The analysis apparatus 200 uses the prediction data set (for example, the integrated probability distribution D) to execute the first calculation process of calculating the co-occurrence amount in which a plurality of factors co-occur due to the correlation of a plurality of factors (step S508). .. The analysis device 200 clusters a plurality of factors based on the co-occurrence amount calculated by the first calculation process to generate a plurality of co-occurrence clusters having one or more co-occurrence clusters including two or more factors. Generation processing is executed (step S509). The analysis device 200 selects the first predicted value of the two or more factors in the specific predicted data group included in the specific factor cluster including the two or more factors among the plurality of factor clusters generated by the first generation process. The predicted values of two or more specific factors indicated by a specific co-occurrence cluster among a plurality of co-occurrence clusters generated by the 2 generation process are given to the learning model. Then, the analysis device 200 executes the second calculation process of calculating the predicted value of the objective variable in the specific factor cluster (step S510).

これにより、分析装置２００は、複数の因子が共起した特定の因子クラスタにおける目的変数の予測値により、因子の組み合わせの有効性を分析することができる。 Thereby, the analysis device 200 can analyze the effectiveness of the combination of factors based on the predicted value of the objective variable in the specific factor cluster in which a plurality of factors co-occur.

また、分析装置２００は、特定の予測データ群における２以上の因子の予測値に基づいて、特定の因子クラスタにおける２以上の因子の予測値を代表する統計値を算出する第３算出処理を実行する（ステップＳ５１０）。これにより、分析装置２００は、複数の因子が共起した特定の因子クラスタにおける目的変数の予測値の算出に際し、計算量の低減化を図ることができる。したがって、分析速度の向上を図ることができる。 Further, the analysis device 200 executes a third calculation process of calculating a statistical value representing the predicted values of the two or more factors in the specific factor cluster, based on the predicted values of the two or more factors in the specific predicted data group. (Step S510). As a result, the analyzer 200 can reduce the amount of calculation when calculating the predicted value of the target variable in the specific factor cluster in which a plurality of factors co-occur. Therefore, the analysis speed can be improved.

また、分析装置２００は、学習モデルの種類を設定する設定処理を実行する（ステップＳ５０１）。また、分析装置２００は、目的変数の実測値と複数の因子の実測値とを用いて、設定処理によって設定された種類の学習モデルを生成して、記憶デバイスに格納する第３生成処理を実行する（ステップＳ５０２）。これにより、ユーザは、目的に応じて学習モデルの種類を選択することができる。 The analysis apparatus 200 also executes a setting process for setting the type of learning model (step S501). In addition, the analysis device 200 uses the measured values of the objective variables and the measured values of the plurality of factors to generate a learning model of the type set by the setting process, and executes a third generation process of storing the learning model in the storage device. Yes (step S502). This allows the user to select the type of learning model according to the purpose.

また、分析装置２００は、設定処理では、種類として、線形モデルまたは非線形モデルを設定する。これにより、分析装置２００は、線形モデルが設定された場合、分析速度の向上を図ることができ、非線形モデルが設定された場合、分析精度の向上を図ることができる。換言すれば、ユーザは、分析結果がより早く得たい場合は、線形モデルを選択し、分析精度を上げたい場合は、非線形モデルを選択することができる。 In the setting process, the analysis device 200 sets a linear model or a non-linear model as the type. As a result, the analysis apparatus 200 can improve the analysis speed when the linear model is set, and can improve the analysis accuracy when the nonlinear model is set. In other words, the user can select a linear model if he / she wants to obtain an analysis result earlier and a non-linear model if he / she wants to improve the analysis accuracy.

また、予測データ集合（たとえば、統合確率分布Ｄ）は、学習モデルを用いた確率サンプリング法によって学習データ集合１０から生成されたデータ集合としてもよい。これにより、予測データ集合（たとえば、統合確率分布Ｄ）は、学習モデルに依存したデータ集合となる。したがって、たとえば、非線形モデルが設定された場合、予測データ集合（たとえば、統合確率分布Ｄ）は、線形モデルが設定された場合に比べて、精度のよいデータ集合となる。 Further, the prediction data set (for example, integrated probability distribution D) may be a data set generated from the learning data set 10 by the probability sampling method using the learning model. As a result, the predicted data set (for example, the integrated probability distribution D) becomes a data set that depends on the learning model. Therefore, for example, when a non-linear model is set, the prediction data set (for example, integrated probability distribution D) becomes a data set with higher accuracy than when a linear model is set.

また、分析装置２００は、学習モデルを用いた確率サンプリング法（たとえば、マルコフ連鎖モンテカルロ法）によって予測データまたは予測データに類似するデータのいずれか一方を採択することにより、２つの予測データ群（たとえば、因子の確率分布ｄ１，ｄ２）を生成する第４生成処理を実行する（ステップＳ５０３）。予測データに類似するデータとは、上述したように、予測データである因子の各値にランダム値が加算されたデータである。分析装置２００は、第４生成処理によって生成された２つの予測データ群（たとえば、因子の確率分布ｄ１，ｄ２）が同一の確率分布に収束するか否かを判定する判定処理を実行する（ステップＳ５０４）。分析装置２００は、判定処理による判定結果に基づいて２つの予測データ群（たとえば、因子の確率分布ｄ１，ｄ２）を統合することにより、予測データ集合（たとえば、統合確率分布Ｄ）を生成する統合処理を実行する（ステップＳ５０５）。 In addition, the analysis device 200 adopts either one of the prediction data or the data similar to the prediction data by the probability sampling method (for example, Markov chain Monte Carlo method) using the learning model, so that two prediction data groups (for example, , The probability generation distributions d1, d2) of the factors are executed (step S503). The data similar to the prediction data is, as described above, data in which a random value is added to each value of the factor that is the prediction data. The analysis apparatus 200 executes the determination process of determining whether or not the two prediction data groups (for example, the probability distributions d1 and d2 of factors) generated by the fourth generation process converge to the same probability distribution (step). S504). The analysis apparatus 200 integrates two prediction data groups (for example, the probability distributions d1 and d2 of factors) based on the determination result of the determination process to generate a prediction data set (for example, the integrated probability distribution D). The process is executed (step S505).

判定処理により、２つの予測データ群（たとえば、因子の確率分布ｄ１，ｄ２）が同一の確率分布、たとえば、学習データ集合１０の確率分布に収束するか否かが判定される。これにより、収束していれば、２つの予測データ群（たとえば、因子の確率分布ｄ１，ｄ２）が学習データ集合１０に類似すると判明するため、２つの予測データ群（たとえば、因子の確率分布ｄ１，ｄ２）から予測データ集合（たとえば、統合確率分布Ｄ）が生成される。これにより、予測データ集合（たとえば、統合確率分布Ｄ）の予測値としての確からしさ、すなわち、生成精度の向上を図ることができる。 By the determination process, it is determined whether or not the two prediction data groups (for example, the probability distributions d1 and d2 of the factors) converge to the same probability distribution, for example, the probability distribution of the learning data set 10. As a result, if the two prediction data groups (for example, the probability distributions d1 and d2 of factors) are similar to the learning data set 10 if they converge, the two prediction data groups (for example, the probability distribution d1 of the factor). , D2), a predicted data set (for example, integrated probability distribution D) is generated. This makes it possible to improve the certainty of the predicted data set (for example, the integrated probability distribution D) as a predicted value, that is, the generation accuracy.

また、分析装置２００は、学習モデルを用いた確率サンプリング法（たとえば、マルコフ連鎖モンテカルロ法）によって予測データまたは予測データに類似するデータのいずれか一方を採択する採択率αを制御するパラメータの値（たとえば、σ値）を設定する設定処理を実行する（ステップＳ５０１）。これにより、（１−σ）以上の目的変数となる因子を採択率αで採択することができる。 In addition, the analysis device 200 adopts a probability sampling method (for example, Markov chain Monte Carlo method) using a learning model to adopt either prediction data or data similar to the prediction data, and the value of a parameter that controls the adoption rate α ( For example, a setting process for setting the σ value) is executed (step S501). As a result, it is possible to adopt a factor that is an objective variable of (1-σ) or more at the adoption rate α.

また、分析装置２００は、因子クラスタの生成数を設定する設定処理を実行する（ステップＳ５０１）。これにより、分析装置２００は、ユーザが指定した数分の因子クラスタを生成することができる。具体的には、たとえば、因子クラスタの生成数が増加するほど、予測データ集合（たとえば、統合確率分布Ｄ）が細分化される。これにより、ユーザは、分析結果がより早く得たい場合は、因子クラスタの生成数を低めに設定し、分析精度を上げたい場合は、因子クラスタの生成数を高めに設定することができる。 In addition, the analysis device 200 executes a setting process for setting the number of generated factor clusters (step S501). Thereby, the analysis device 200 can generate the number of factor clusters designated by the user. Specifically, for example, as the number of generated factor clusters increases, the prediction data set (for example, integrated probability distribution D) is subdivided. Thus, the user can set the number of generated factor clusters to a lower number when the analysis result is desired to be obtained earlier, and can set the number of generated factor clusters to a higher number when the analysis accuracy is desired to be improved.

また、分析装置２００は、共起クラスタの生成数を設定する設定処理を実行する（ステップＳ５０１）。これにより、これにより、分析装置２００は、ユーザが指定した数分の共起クラスタを生成することができる。具体的には、たとえば、共起クラスタの生成数が増加するほど、共起しあう因子の数や、共起しあう因子の組み合わせの数が増加する。したがって、ユーザは、分析結果がより早く得たい場合は、共起クラスタの生成数を低めに設定し、分析精度を上げたい場合は、共起クラスタの生成数を高めに設定することができる。 The analysis apparatus 200 also executes a setting process for setting the number of co-occurrence clusters to be generated (step S501). As a result, the analysis apparatus 200 can generate the number of co-occurrence clusters designated by the user. Specifically, for example, as the number of generated co-occurrence clusters increases, the number of co-occurring factors and the number of combinations of co-occurring factors increase. Therefore, the user can set the number of generated co-occurrence clusters to a lower number if the analysis result is to be obtained earlier, and can set the number of generated co-occurrence clusters to a higher number to improve the analysis accuracy.

また、実施例１では、複数の因子３０３，６０３を複数の薬の患者への投与量とし、目的変数３０２，６０２を患者に複数の薬を投与量投与した場合の薬効を示す値（たとえば、疾病確率）とした。これにより、複数の薬の各々をどのタイプ（因子クラスタ）の患者にどの程度投与したら、どの程度の薬効があるかを予測することができる。 Further, in Example 1, a plurality of factors 303 and 603 are set as doses of a plurality of drugs to a patient, and objective variables 302 and 602 are values indicating drug efficacy when a plurality of doses are administered to a patient (for example, Disease probability). This makes it possible to predict how much each of a plurality of drugs is administered to a patient of which type (factor cluster) and how much the drug is effective.

なお、上述した実施例１では、薬効分析を例に挙げて説明したが、商品レコメンデーションにも適用可能である。この場合、図３に示した学習データ集合１０において、患者ＩＤ３０１は、たとえば、患者ではなく顧客に替わる。因子３０３は、たとえば、商品またはサービス（商品またはサービスのジャンルでもよい）の購入数（商品の場合）や利用回数（サービスの場合）を示す。目的変数３０２は、たとえば、商品またはサービス（商品またはサービスのジャンルでもよい）の購入金額（商品の場合）や利用金額（サービスの場合）を示す。因子の確率分布ｄ１，ｄ２、統合確率分布Ｄも同様である。 In addition, in Example 1 described above, the drug efficacy analysis was described as an example, but it is also applicable to product recommendation. In this case, in the learning data set 10 shown in FIG. 3, the patient ID 301 is, for example, a customer instead of a patient. The factor 303 indicates, for example, the number of purchases (in the case of a product) or the number of uses (in the case of a service) of a product or service (the genre of the product or service may be used). The objective variable 302 indicates, for example, the purchase price (in the case of a product) or the usage price (in the case of a service) of a product or service (the genre of the product or service may be used). The same applies to the probability distributions d1 and d2 of the factors and the integrated probability distribution D.

また、ニュース記事の分析の場合、図３に示した学習データ集合１０において、患者ＩＤ３０１は、たとえば、患者ではなく新聞や雑誌、ｗｅｂページに掲載されたニュース記事に替わる。因子３０３は、たとえば、単語の出現回数を示す。目的変数３０２は、たとえば、政治、社会、スポーツ、天気といったニュース記事のジャンルを示す。因子の確率分布ｄ１，ｄ２、統合確率分布Ｄも同様である。 Further, in the case of analysis of news articles, in the learning data set 10 shown in FIG. 3, the patient ID 301 is replaced with, for example, news articles published in newspapers, magazines, and web pages instead of patients. The factor 303 indicates the number of appearances of a word, for example. The objective variable 302 indicates the genre of news articles such as politics, society, sports, and weather. The same applies to the probability distributions d1 and d2 of the factors and the integrated probability distribution D.

実施例２について説明する。実施例１では、１台の計算機により図５に示した分析処理を実行したが、実施例２では、複数台の計算機により図５に示した分析処理を分散処理する。これにより、計算機の負荷低減と分析速度の高速化を図る。各計算機は、具体的には、たとえば、図２に示したハードウェア構成を有する。 Example 2 will be described. In the first embodiment, the analysis process shown in FIG. 5 is executed by one computer, but in the second embodiment, the analysis process shown in FIG. 5 is distributed by a plurality of computers. This will reduce the load on the computer and increase the analysis speed. Specifically, each computer has, for example, the hardware configuration shown in FIG.

図１２は、分析システムのシステム構成例を示す説明図である。分析システム１２００は、複数台の計算機（以下、単に、ノード）Ｎ０〜Ｎｎ（ｎは１以上の整数）と、１台以上のクライアント端末Ｃとを含む。複数台のノードＮ０〜Ｎｎ（ｎは２以上の整数）と、１台以上のクライアント端末Ｃとは、ネットワーク１２０１を介して通信可能に接続される。ノードＮ０は、マスターノードＮ０であり、ノードＮ１〜ＮｎはワーカーノードＮ１〜Ｎｎである。マスターノードＮ０は、ワーカーノードＮ１〜Ｎｎを管理する。ワーカーノードＮ１〜Ｎｎは、マスターノードＮ０の指示にしたがって処理を実行する。なお、マスターノードＮ０の機能をワーカーノードＮ１〜Ｎｎのいずれかが担当してもよい。 FIG. 12 is an explanatory diagram showing a system configuration example of the analysis system. The analysis system 1200 includes a plurality of computers (hereinafter, simply nodes) N0 to Nn (n is an integer of 1 or more) and one or more client terminals C. A plurality of nodes N0 to Nn (n is an integer of 2 or more) and one or more client terminals C are communicably connected via a network 1201. The node N0 is the master node N0, and the nodes N1 to Nn are the worker nodes N1 to Nn. The master node N0 manages the worker nodes N1 to Nn. The worker nodes N1 to Nn execute processing according to the instruction from the master node N0. Note that any one of the worker nodes N1 to Nn may be responsible for the function of the master node N0.

＜分散処理手順例＞
図１３〜図１５は、分析システム１２００による分散処理手順例を示すフローチャートである。なお、ここでは、一例として、ｎ＝２、すなわち、分析システム１２００は、マスターノードＮ０、ワーカーノードＮ１、Ｎ２、クライアント端末Ｃとする。<Example of distributed processing procedure>
13 to 15 are flowcharts showing an example of distributed processing procedure by the analysis system 1200. Here, as an example, n = 2, that is, the analysis system 1200 is the master node N0, the worker nodes N1 and N2, and the client terminal C.

まず、クライアント端末Ｃが初期設定（ステップＳ５０１）を実行する（ステップＳ１３０１）。そして、クライアント端末Ｃは、初期設定（ステップＳ５０１）の設定内容である解析リクエストを、マスターノードＮ０に送信する（ステップＳ１３０２）。 First, the client terminal C executes initial setting (step S501) (step S1301). Then, the client terminal C transmits an analysis request, which is the setting content of the initial setting (step S501), to the master node N0 (step S1302).

マスターノードＮ０は、学習モデル生成リクエストをワーカーノードＮ１に送信する（ステップＳ１３０３）。ワーカーノードＮ１は、学習モデル生成リクエストを受信した場合、ステップＳ５０２と同様、学習モデルを生成する（ステップＳ１３０４）。ワーカーノードＮ１は、学習モデルを生成すると、マスターノードＮ０に学習モデルを送信する（ステップＳ１３０５）。マスターノードＮ０は、ワーカーノードＮ１から学習モデルを受信すると、他のワーカーノードＮ２に学習モデルを送信する（ステップＳ１３０６）。 The master node N0 transmits a learning model generation request to the worker node N1 (step S1303). When receiving the learning model generation request, the worker node N1 generates a learning model as in step S502 (step S1304). After generating the learning model, the worker node N1 transmits the learning model to the master node N0 (step S1305). Upon receiving the learning model from the worker node N1, the master node N0 transmits the learning model to another worker node N2 (step S1306).

つぎに、マスターノードＮ０は、因子の確率分布ｄ１の生成リクエストをワーカーノードＮ１に送信し（ステップＳ１３０７）、因子の確率分布ｄ２の生成リクエストをワーカーノードＮ２に送信する（ステップＳ１３０８）。これにより、因子の確率分布ｄ１，ｄ２を並列処理で生成することができる。 Next, the master node N0 transmits a generation request of the factor probability distribution d1 to the worker node N1 (step S1307), and transmits a generation request of the factor probability distribution d2 to the worker node N2 (step S1308). Thereby, the probability distributions d1 and d2 of the factors can be generated by parallel processing.

つぎに、ワーカーノードＮ１は、ステップＳ５０３と同様、マルコフ連鎖モンテカルロ法に代表される確率サンプリング法を用いて、学習データ集合１０由来の因子の確率分布ｄ１を生成する（ステップＳ１３０９）。ワーカーノードＮ２も、ステップＳ５０３と同様、マルコフ連鎖モンテカルロ法に代表される確率サンプリング法を用いて、学習データ集合１０由来の因子の確率分布ｄ２を生成する（ステップＳ１３１０）。ワーカーノードＮ１は、生成した因子の確率分布ｄ１をマスターノードＮ０に送信する（ステップＳ１３１１）。ワーカーノードＮ２も、生成した因子の確率分布ｄ２をマスターノードＮ０に送信する（ステップＳ１３１２）。 Next, the worker node N1 generates the probability distribution d1 of the factors derived from the learning data set 10 by using the probability sampling method represented by the Markov chain Monte Carlo method as in step S503 (step S1309). Similarly to step S503, the worker node N2 also uses the probability sampling method represented by the Markov chain Monte Carlo method to generate the probability distribution d2 of the factors derived from the learning data set 10 (step S1310). The worker node N1 transmits the generated probability distribution d1 of the factors to the master node N0 (step S1311). The worker node N2 also transmits the generated probability distribution d2 of the factors to the master node N0 (step S1312).

マスターノードＮ０は、ステップＳ５０４と同様、因子の確率分布ｄ１，ｄ２が同一の確率分布に収束しているかを判定する（ステップＳ１３１３）。マスターノードＮ０は、その判定結果をクライアント端末Ｃに送信する（ステップＳ１３１４）。クライアント端末Ｃは、図１１に示したように、判定結果（たとえば、Ｇｅｌｍａｎ−Ｒｕｂｉｎスコア）を受信して表示する（ステップＳ１３１５）。 Similar to step S504, the master node N0 determines whether the probability distributions d1 and d2 of the factors converge to the same probability distribution (step S1313). The master node N0 transmits the determination result to the client terminal C (step S1314). As shown in FIG. 11, the client terminal C receives and displays the determination result (for example, Gelman-Rubin score) (step S1315).

図１４において、マスターノードＮ０は、ステップＳ５０５と同様、因子の確率分布ｄ１，ｄ２を統合して統合確率分布Ｄを生成する（ステップＳ１４０１）。そして、マスターノードＮ０は、因子クラスタリングリクエストをワーカーノードＮ１に送信する（ステップＳ１４０２）。ワーカーノードＮ１は、因子クラスタリングリクエストを受信した場合、ステップＳ５０６と同様、統合確率分布Ｄを用いて、因子クラスタリングにより因子クラスタを生成する（ステップＳ１４０３）。また、ワーカーノードＮ１は、ステップＳ５０７と同様、各因子クラスタから各因子の統計値を算出する（ステップＳ１４０４）。ワーカーノードＮ１は、算出した統計値をマスターノードＮ０に送信する（ステップＳ１４０５）。マスターノードＮ０は、他のワーカーノードＮ２に、受信した統計値を送信する（ステップＳ１４０６）。 In FIG. 14, the master node N0 integrates the probability distributions d1 and d2 of the factors to generate the integrated probability distribution D, as in step S505 (step S1401). Then, the master node N0 transmits a factor clustering request to the worker node N1 (step S1402). When the worker node N1 receives the factor clustering request, similarly to step S506, the worker node N1 uses the integrated probability distribution D to generate a factor cluster by factor clustering (step S1403). Further, the worker node N1 calculates the statistical value of each factor from each factor cluster, similarly to step S507 (step S1404). The worker node N1 transmits the calculated statistical value to the master node N0 (step S1405). The master node N0 transmits the received statistical value to another worker node N2 (step S1406).

マスターノードＮ０は、共起量計算リクエストをワーカーノードＮ２に送信する（ステップＳ１４０７）。ワーカーノードＮ２は、ステップＳ５０８と同様、統合確率分布Ｄの因子同士の共起量を算出する（ステップＳ１４０８）。そして、ワーカーノードＮ２は、算出した共起量（図９の（Ａ）を参照）をマスターノードＮ０に送信する（ステップＳ１４０９）。 The master node N0 transmits a co-occurrence amount calculation request to the worker node N2 (step S1407). The worker node N2 calculates the co-occurrence amount of the factors of the integrated probability distribution D, as in step S508 (step S1408). Then, the worker node N2 transmits the calculated co-occurrence amount (see (A) of FIG. 9) to the master node N0 (step S1409).

図１５において、マスターノードＮ０は、ステップＳ５０９と同様、共起クラスタリングにより共起クラスタを生成し、共起クラスタのＩＤリストＡ，Ｂを生成する（ステップＳ１５０１）。共起クラスタのＩＤリストＡとは、統合確率分布Ｄのエントリを分割した一方のエントリ群を一意に特定するＩＤリストである。共起クラスタのＩＤリストＢとは、統合確率分布Ｄのエントリを分割した他方のエントリ群を一意に特定するＩＤリストである。 In FIG. 15, the master node N0 generates co-occurrence clusters by co-occurrence clustering, as in step S509, and generates co-occurrence cluster ID lists A and B (step S1501). The co-occurrence cluster ID list A is an ID list that uniquely identifies one of the entry groups obtained by dividing the entries of the integrated probability distribution D. The co-occurrence cluster ID list B is an ID list that uniquely identifies the other entry group obtained by dividing the entries of the integrated probability distribution D.

マスターノードＮ０は、共起クラスタのＩＤリストＡをワーカーノードＮ１に送信し（ステップＳ１５０２）、共起クラスタのＩＤリストＢをワーカーノードＮ２に送信する（ステップＳ１５０３）。ワーカーノードＮ１は、ステップＳ５０９と同様、ＩＤリストＡについて、共起クラスタリングにより共起クラスタを生成する（ステップＳ１５０４）。ワーカーノードＮ２も、ステップＳ５０９と同様、ＩＤリストＢについて、共起クラスタリングにより共起クラスタを生成する（ステップＳ１５０５）。 The master node N0 transmits the ID list A of the co-occurrence cluster to the worker node N1 (step S1502) and the ID list B of the co-occurrence cluster to the worker node N2 (step S1503). The worker node N1 generates a co-occurrence cluster for the ID list A by co-occurrence clustering, similarly to step S509 (step S1504). Similarly to step S509, the worker node N2 also generates a co-occurrence cluster for the ID list B by co-occurrence clustering (step S1505).

ワーカーノードＮ１は、ステップＳ５１０と同様、ステップＳ１５０４で得られた共起クラスタの予測値を算出する（ステップＳ１５０６）。ワーカーノードＮ２も、ステップＳ５１０と同様、ステップＳ１５０５で得られた共起クラスタの予測値を算出する（ステップＳ１５０７）。ワーカーノードＮ１は、ステップＳ１５０６で得られた予測値を記憶デバイス２０２に保存する（ステップＳ１５０８）。ワーカーノードＮ２も、ステップＳ１５０７で得られた予測値を記憶デバイス２０２に保存する（ステップＳ１５０９）。ワーカーノードＮ１は、ステップＳ１５０６で得られた予測値をマスターノードＮ０に送信する（ステップＳ１５１０）。ワーカーノードＮ２も、ステップＳ１５０７で得られた予測値をマスターノードＮ０に送信する（ステップＳ１５１１）。 The worker node N1 calculates the predicted value of the co-occurrence cluster obtained in step S1504, as in step S510 (step S1506). Similarly to step S510, the worker node N2 also calculates the predicted value of the co-occurrence cluster obtained in step S1505 (step S1507). The worker node N1 saves the predicted value obtained in step S1506 in the storage device 202 (step S1508). The worker node N2 also stores the predicted value obtained in step S1507 in the storage device 202 (step S1509). The worker node N1 transmits the predicted value obtained in step S1506 to the master node N0 (step S1510). The worker node N2 also transmits the predicted value obtained in step S1507 to the master node N0 (step S1511).

マスターノードＮ０は、ステップＳ５１１と同様、予測値のしきい値処理を実行する（ステップＳ１５１２）。そして、マスターノードＮ０は、その実行結果である計算マーカをクライアント端末Ｃに送信する（ステップＳ１５１３）。クライアント端末Ｃは、計算マーカを表示画面に表示する（ステップＳ１５１４）。 The master node N0 executes threshold value processing of the predicted value, similarly to step S511 (step S1512). Then, the master node N0 transmits the calculation marker which is the execution result to the client terminal C (step S1513). The client terminal C displays the calculation marker on the display screen (step S1514).

図１６は、図１５に示した分析システム１２００による分散処理手順例を示すフローチャート３の変形例を示すフローチャートである。図１５では、ＩＤリストＡ，ＢごとにワーカーノードＮ１、Ｎ２が並列で共起クラスタリングを実行することで、処理の高速化を実現した。一方、図１６では、ＩＤリストＡ，Ｂの共起クラスタ計算は、ワーカーノードＮ１，Ｎ２ではなく、マスターノードＮ０が実行する。なお、図１５と同一処理については同一ステップ番号を付し、その説明を省略する。 FIG. 16 is a flowchart showing a modification of the flowchart 3 showing an example of the distributed processing procedure by the analysis system 1200 shown in FIG. In FIG. 15, the worker nodes N1 and N2 execute co-occurrence clustering in parallel for each of the ID lists A and B, thereby realizing high-speed processing. On the other hand, in FIG. 16, the co-occurrence cluster calculation of the ID lists A and B is executed by the master node N0, not by the worker nodes N1 and N2. The same steps as those in FIG. 15 are designated by the same step numbers, and the description thereof will be omitted.

図１６において、マスターノードＮ０は、ステップＳ５０９と同様、ＩＤリストＡについて、共起クラスタリングにより共起クラスタを生成する（ステップＳ１６０２）。マスターノードＮ０は、ＩＤリストＡの共起クラスタをワーカーノードＮ１に送信する（ステップＳ１６０３）。 In FIG. 16, the master node N0 generates a co-occurrence cluster by co-occurrence clustering for the ID list A, as in step S509 (step S1602). The master node N0 transmits the co-occurrence cluster of the ID list A to the worker node N1 (step S1603).

ワーカーノードＮ１は、ステップＳ５１０と同様、ステップＳ１６０２で得られた共起クラスタの予測値を算出する（ステップＳ１６０４）。ワーカーノードＮ１は、ステップＳ１６０４で得られた予測値を記憶デバイス２０２に保存する（ステップＳ１６０４）。ワーカーノードＮ１は、ステップＳ１６０４で得られた予測値をマスターノードＮ０に送信する（ステップＳ１６０６）。 The worker node N1 calculates the predicted value of the co-occurrence cluster obtained in step S1602, as in step S510 (step S1604). The worker node N1 stores the predicted value obtained in step S1604 in the storage device 202 (step S1604). The worker node N1 transmits the predicted value obtained in step S1604 to the master node N0 (step S1606).

マスターノードＮ０は、ステップＳ５０９と同様、ＩＤリストＢについて、共起クラスタリングにより共起クラスタを生成する（ステップＳ１６０７）。マスターノードＮ０は、ＩＤリストＢの共起クラスタをワーカーノードＮ２に送信する（ステップＳ１６０８）。 Similar to step S509, the master node N0 generates co-occurrence clusters for the ID list B by co-occurrence clustering (step S1607). The master node N0 transmits the co-occurrence cluster of the ID list B to the worker node N2 (step S1608).

ワーカーノードＮ２は、ステップＳ５１０と同様、ステップＳ１６０７で得られた共起クラスタの予測値を算出する（ステップＳ１６０９）。ワーカーノードＮ１は、ステップＳ１６０９で得られた予測値を記憶デバイス２０２に保存する（ステップＳ１６１０）。ワーカーノードＮ２は、ステップＳ１６０９で得られた予測値をマスターノードＮ０に送信する（ステップＳ１６１１）。 The worker node N2 calculates the predicted value of the co-occurrence cluster obtained in step S1607, as in step S510 (step S1609). The worker node N1 stores the predicted value obtained in step S1609 in the storage device 202 (step S1610). The worker node N2 transmits the predicted value obtained in step S1609 to the master node N0 (step S1611).

このように、実施例２によれば、実施例１と同様の効果を奏する。また、実施例２によれば、複数台の計算機により図５に示した分析処理を分散処理する。これにより、計算機の負荷低減と分析速度の高速化を図ることができる。なお、図１３〜図１６に示した分散処理は一例である。したがって、このほかにも、たとえば、図１３〜図１６に示したステップのうち少なくとも２以上のステップを異なる計算機で実行してもよい。 Thus, according to the second embodiment, the same effect as that of the first embodiment can be obtained. Further, according to the second embodiment, the analysis processing shown in FIG. 5 is distributed and processed by a plurality of computers. This makes it possible to reduce the load on the computer and increase the analysis speed. The distributed processing shown in FIGS. 13 to 16 is an example. Therefore, in addition to this, for example, at least two or more of the steps shown in FIGS. 13 to 16 may be executed by different computers.

なお、本発明は前述した実施例に限定されるものではなく、添付した特許請求の範囲の趣旨内における様々な変形例及び同等の構成が含まれる。例えば、前述した実施例は本発明を分かりやすく説明するために詳細に説明したものであり、必ずしも説明した全ての構成を備えるものに本発明は限定されない。また、ある実施例の構成の一部を他の実施例の構成に置き換えてもよい。また、ある実施例の構成に他の実施例の構成を加えてもよい。また、各実施例の構成の一部について、他の構成の追加、削除、または置換をしてもよい。 The present invention is not limited to the above-described embodiments, but includes various modifications and equivalent configurations within the spirit of the appended claims. For example, the above-described embodiments have been described in detail in order to explain the present invention in an easy-to-understand manner, and the present invention is not necessarily limited to those having all the configurations described. Further, part of the configuration of one embodiment may be replaced with the configuration of another embodiment. Further, the configuration of another embodiment may be added to the configuration of one embodiment. Moreover, you may add, delete, or replace another structure with respect to a part of structure of each Example.

また、前述した各構成、機能、処理部、処理手段等は、それらの一部又は全部を、例えば集積回路で設計する等により、ハードウェアで実現してもよく、プロセッサがそれぞれの機能を実現するプログラムを解釈し実行することにより、ソフトウェアで実現してもよい。 Further, each of the above-mentioned configurations, functions, processing units, processing means, etc. may be realized by hardware, for example, by designing a part or all of them with an integrated circuit, and the processor realizes each function. It may be realized by software by interpreting and executing the program.

各機能を実現するプログラム、テーブル、ファイル等の情報は、メモリ、ハードディスク、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）等の記憶装置、又は、ＩＣカード、ＳＤカード、ＤＶＤ等の記録媒体に格納することができる。 Information such as programs, tables, and files that implement each function can be stored in a storage device such as a memory, a hard disk, an SSD (Solid State Drive), or a recording medium such as an IC card, an SD card, or a DVD.

また、制御線や情報線は説明上必要と考えられるものを示しており、実装上必要な全ての制御線や情報線を示しているとは限らない。実際には、ほとんど全ての構成が相互に接続されていると考えてよい。 Further, the control lines and information lines are shown to be necessary for explanation, and not all the control lines and information lines necessary for mounting are shown. In reality, it can be considered that almost all configurations are connected to each other.

Claims

An analysis apparatus comprising: a processor that executes a program; and a storage device that stores the program,
The storage device includes a learning data set including a plurality of learning data sets each including an actual measurement value of an objective variable and an actual measurement value of a plurality of factors, and a prediction including a plurality of prediction data-derived prediction data including prediction values of the plurality of factors. A data set, and a learning model showing a relationship between the measured values of the objective variable and the measured values of the plurality of factors are stored,
The processor is
A first generation process of clustering the prediction data set so that the values of the plurality of factors are similar to each other, and generating a plurality of factor clusters;
A first calculation process of calculating a co-occurrence amount in which the plurality of factors are co-occurring by the correlation of the plurality of factors using the prediction data set;
A second generation process for clustering the plurality of factors based on the co-occurrence amount calculated by the first calculation process to generate a plurality of co-occurrence clusters having one or more co-occurrence clusters including two or more factors; ,
Of the predicted values of the two or more factors in the specific prediction data group included in the specific factor cluster including the two or more factors among the plurality of factor clusters generated by the first generation processing, the second generation Prediction of the target variable in the specific factor cluster by giving the learning model prediction values of two or more specific factors indicated by a specific co-occurrence cluster among the plurality of co-occurrence clusters generated by the processing. A second calculation process for calculating a value,
An analyzer for performing the following.

The analysis device according to claim 1, wherein
The processor is
Executing a third calculation process for calculating a statistical value representative of the predicted values of the two or more factors in the specific factor cluster, based on the predicted values of the two or more factors in the specific predicted data group,
In the second calculation process, the processor includes two or more specific factors indicated by the specific co-occurrence cluster among the statistical values representing the predicted values of the two or more factors calculated by the third calculation process. The analysis device is characterized in that a predicted value of the objective variable in the specific factor cluster is calculated by giving the statistical value of the above to the learning model.

The analysis device according to claim 1, wherein
The processor is
Setting processing for setting the type of the learning model,
A third generation process of generating a learning model of the type set by the setting process using the measured values of the objective variable and the measured values of the plurality of factors and storing the learning model in the storage device;
An analyzer for performing the following.

The analysis device according to claim 3, wherein
In the setting process, the processor sets a linear model or a non-linear model as the type.

The analysis device according to claim 1, wherein
The predictive data set is a data set generated from the learning data set by a probability sampling method using the learning model.

The analysis device according to claim 1, wherein
The processor is
A fourth generation process for generating two prediction data groups by adopting one of the prediction data or data similar to the prediction data by a probability sampling method using the learning model;
Determination processing for determining whether or not the two prediction data groups generated by the fourth generation processing converge to the same probability distribution,
Performing an integration process of generating the prediction data set by integrating the two prediction data groups based on the determination result of the determination process,
In the first generation processing, the processor clusters the prediction data set obtained by the integration processing so that the values of the plurality of factors are similar to each other, and generates the plurality of factor clusters,
In the first calculation process, the processor calculates a co-occurrence amount in which the plurality of factors co-occur by correlation of the plurality of factors, using the prediction data set obtained by the integration process. Analyzer.

The analysis device according to claim 6, wherein
The processor is
By performing a setting process for setting the value of a parameter that controls the adoption rate to adopt one of the prediction data or the data similar to the prediction data by the probability sampling method using the learning model,
In the fourth generation processing, the processor generates the two prediction data groups by adopting one of the prediction data or data similar to the prediction data based on the adoption rate. Analyzer.

The analysis device according to claim 1, wherein
The processor is
Execute the setting process to set the number of generation of the factor cluster,
In the first generation processing, the processor clusters the prediction data set so that the values of the plurality of factors are similar to each other, and generates the number of generation factor clusters set by the setting processing. Analyzer.

The analysis device according to claim 1, wherein
The processor is
Execute a setting process to set the number of co-occurrence clusters generated,
In the second generation process, the processor clusters the plurality of factors based on the co-occurrence amount calculated in the first calculation process, and has one or more co-occurrence clusters including two or more factors. An analyzing apparatus, wherein the number of generated clusters set by the setting process is generated.

The analysis device according to claim 1, wherein
The analyzer is characterized in that the plurality of factors are doses of a plurality of drugs to a patient, and the objective variable is a value indicating a drug effect when the plurality of drugs are administered to the patient in the doses.

An analysis system in which a plurality of computers are communicably connected,
Any one of the plurality of computers, a learning data set having a plurality of learning data including the actual measurement value of the objective variable and the actual measurement value of the plurality of factors, and the prediction data derived from the learning data including the prediction values of the plurality of factors Storing a prediction data set having a plurality of, and a learning model showing the relationship between the actual measurement value of the objective variable and the actual measurement value of the plurality of factors,
One of the plurality of computers,
A first generation process of clustering the prediction data set so that the values of the plurality of factors are similar to each other, and generating a plurality of factor clusters;
A first calculation process of calculating a co-occurrence amount in which the plurality of factors are co-occurring by the correlation of the plurality of factors using the prediction data set;
A second generation process for clustering the plurality of factors based on the co-occurrence amount calculated by the first calculation process to generate a plurality of co-occurrence clusters having one or more co-occurrence clusters including two or more factors; ,
Of the predicted values of the two or more factors in the specific prediction data group included in the specific factor cluster including the two or more factors among the plurality of factor clusters generated by the first generation processing, the second generation Prediction of the target variable in the specific factor cluster by giving the learning model prediction values of two or more specific factors indicated by a specific co-occurrence cluster among the plurality of co-occurrence clusters generated by the processing. A second calculation process for calculating a value,
An analysis system characterized by executing.

An analysis method by an analysis device having a processor that executes a program and a storage device that stores the program,
The storage device includes a learning data set having a plurality of learning data including measured values of objective variables and measured values of a plurality of factors, and prediction having a plurality of prediction data derived from the learning data including predicted values of the plurality of factors. A data set, and a learning model showing a relationship between the measured values of the objective variable and the measured values of the plurality of factors are stored,
The processor is
A first generation process of clustering the prediction data set so that the values of the plurality of factors are similar to each other, and generating a plurality of factor clusters;
A first calculation process of calculating a co-occurrence amount in which the plurality of factors are co-occurring by the correlation of the plurality of factors using the prediction data set;
A second generation process for clustering the plurality of factors based on the co-occurrence amount calculated by the first calculation process to generate a plurality of co-occurrence clusters having one or more co-occurrence clusters including two or more factors; ,
Of the predicted values of the two or more factors in the specific prediction data group included in the specific factor cluster including the two or more factors among the plurality of factor clusters generated by the first generation processing, the second generation Prediction of the target variable in the specific factor cluster by giving the learning model prediction values of two or more specific factors indicated by a specific co-occurrence cluster among the plurality of co-occurrence clusters generated by the processing. A second calculation process for calculating a value,
An analysis method characterized by executing.