WO2018042606A1

WO2018042606A1 - Analysis device, analysis system, and analysis method

Info

Publication number: WO2018042606A1
Application number: PCT/JP2016/075726
Authority: WO
Inventors: 琢磨柴原; 英司金森; 昌宏荻野; 鈴木　麻由美
Original assignee: 株式会社日立製作所
Priority date: 2016-09-01
Filing date: 2016-09-01
Publication date: 2018-03-08
Also published as: JP6695431B2; JPWO2018042606A1

Abstract

An analysis device stores, in a storage device, a learning data set having a plurality of pieces of learning data including an actual value of an objective variable and actual values of a plurality of factors, a prediction data set having a plurality of pieces of prediction data derived from learning data including prediction values of the plurality of factors, and a learning model indicating a relationship between the actual value of the objective variable and the actual values of the plurality of factors; performs clustering of the prediction data set in such a manner that the values of the plurality of factors are similar to each other to generate a plurality of factor clusters; calculates, using the prediction data set, a co-occurrence amount of co-occurrence of the plurality of factors on the basis of a correlation of the plurality of factors; performs clustering of the plurality of factors on the basis of the co-occurrence amount to generate a plurality of co-occurrence clusters having one or more co-occurrence clusters including two or more factors; and feeds the learning model with, from among the prediction values of two or more factors in a specific prediction data group included in a specific factor cluster including two or more factors in the plurality of factor clusters, prediction values of two or more specific factors indicated by a specific co-occurrence cluster out of the plurality of co-occurrence clusters, thereby calculating a prediction value of the objective variable in the specific factor cluster.

Description

Analysis apparatus, analysis system, and analysis method

The present invention relates to an analysis apparatus, an analysis system, and an analysis method for analyzing data.

US Pat. No. 6,057,031 discloses a computer-implemented method, system, and computer-readable storage for use with a clinical decision support system that identifies and provides information regarding correlations between patient attributes and one or more adverse events (AEs). A medium is disclosed. The process of U.S. Patent No. 6,057,056 processes database information including AE and one or more patient attributes for correlation between AE and patient attributes, and includes one or more AEs and one or more patient attributes. Identifying at least one correlation between. Correlation may be discovered through an association rule discovery process to determine one or more association rules. Each association rule satisfies confidence, support, and / or other thresholds. The process further provides information or alerts to the user based on the identified or discovered correlation.

Patent Document 2 discloses a medical assistance program that provides appropriate support for medical care. In the medical assistance program of Patent Document 2, the treatment period of the patient for the diagnosed disease is compared with the reference cure period for the diagnosed disease, and when the treatment period of the patient exceeds the reference cure period, Search for other illnesses that develop symptoms similar to the symptoms of the diagnosed illness from the storage means that associates and stores each illness that develops similar symptoms, and outputs the disease name information of the searched other illnesses , Cause the computer to execute the process.

Special table 2012-524945 gazette JP 2014-199597 A

However, the above-described conventional technique has a problem that even if a learning model is generated from learning data, it is not known which factor is associated with which other factor. For example, when the objective variable is the disease probability and the factor is the dose of a plurality of drugs, for example, it is difficult to know whether it is effective to administer the drug A and the drug B to the patient in combination or cause side effects. There is.

The present invention aims to analyze the effectiveness of factor combinations.

An analysis apparatus, an analysis system, and an analysis method according to an aspect of the invention disclosed in the present application include a learning data set having a plurality of learning data including measured values of objective variables and measured values of a plurality of factors in a storage device; Storing a prediction data set having a plurality of prediction data derived from the learning data including prediction values of the plurality of factors, and a learning model indicating a relationship between the actual measurement values of the objective variable and the actual measurement values of the plurality of factors; In addition, the prediction data set is clustered so that the values of the plurality of factors are similar, and a plurality of factor clusters are generated, and the prediction data set is used to generate the plurality of factor data. A first calculation process for calculating a co-occurrence amount in which the plurality of factors co-occur by correlation, and a cluster of the plurality of factors based on the co-occurrence amount calculated by the first calculation process A second generation process for generating a plurality of co-occurrence clusters having one or more co-occurrence clusters including two or more factors, and two or more of the plurality of factor clusters generated by the first generation process. Among the predicted values of the two or more factors in the specific prediction data group included in the specific factor cluster including the factor, a specific co-occurrence cluster among the plurality of co-occurrence clusters generated by the second generation process is A second calculation process for calculating a predicted value of the objective variable in the specific factor cluster by giving predicted values of two or more specific factors to the learning model.

According to a typical embodiment of the present invention, the effectiveness of factor combinations can be analyzed. Problems, configurations, and effects other than those described above will become apparent from the description of the following embodiments.

FIG. 1 is an explanatory diagram of a data analysis example according to the first embodiment. FIG. 2 is a block diagram illustrating a hardware configuration example of the analysis apparatus. FIG. 3 is an explanatory diagram showing the detailed contents of the learning data shown in FIG. FIG. 4 is an explanatory diagram illustrating an example of an initial setting screen. FIG. 5 is a flowchart illustrating an example of an analysis processing procedure performed by the analysis apparatus. FIG. 6 is an explanatory diagram showing a probability distribution of factors. FIG. 7 is an explanatory diagram showing an example of the integrated probability distribution. FIG. 8 is an explanatory diagram showing the result of factor clustering. FIG. 9 is an explanatory diagram of an example of co-occurrence clustering processing. FIG. 10 is an explanatory diagram illustrating the prediction result obtained in step S510. FIG. 11 is an explanatory diagram illustrating a display screen example. FIG. 12 is an explanatory diagram illustrating a system configuration example of the analysis system. FIG. 13 is a flowchart 1 illustrating an example of a distributed processing procedure by the analysis system. FIG. 14 is a flowchart 2 illustrating an example of a distributed processing procedure by the analysis system. FIG. 15 is a flowchart 3 illustrating an example of a distributed processing procedure by the analysis system. FIG. 16 is a flowchart showing a modification of the flowchart 3 showing an example of a distributed processing procedure by the analysis system shown in FIG.

<Data analysis example>
FIG. 1 is an explanatory diagram of a data analysis example according to the first embodiment. (1) to (6) show the procedure of the analysis method by the analyzer. (1) The analysis apparatus generates a learning model from the learning data set 10. In the learning data set 10, for example, the objective variable is the drug effect, specifically, the disease probability, and the factor is the dose of a plurality of drugs to the patient. The disease probability can be expressed as 0% to 100%, but here, the disease is 1 (= 100%) and the health is 0 (= 0%). In addition, the factor is four explanatory variables of drug 1 to drug 4 for convenience, but actually, for example, tens of thousands to hundreds of millions of drugs. Each entry indicates a patient. The number of patients is six from A to F for convenience, but actually, for example, there are tens of thousands to hundreds of millions of patients.

(1) In generating a learning model, the generated learning model includes a linear model and a non-linear model. The linear model includes, for example, a linear classification and a logistic regression. Nonlinear models include, for example, neural networks (Neural Network), support vector machines (Support Vector Machine), Adaboost, and random forests (Random Forests). The user can select one of the models when generating the learning model. For example, when the user wants to analyze the effectiveness of the combination of factors at high speed, the user may select a linear model, and when he wants to analyze with high accuracy, the user may select a nonlinear model.

(2) The analysis device generates a probability distribution 20 of each factor from the learning model generated in (1). Specifically, for example, the analysis apparatus generates two sets of probability distributions 20 of factors derived from the learning data set 10 (referred to as d1 and d2 respectively) using a probability sampling method typified by the Markov chain Monte Carlo method. . Thereby, a large amount of virtual factor data can be collected.

(3) The analysis apparatus determines whether or not the probability distributions d1 and d2 of the factors generated in (2) converge to the same probability distribution. Specifically, for example, the Gelman-Rubin method is used for the convergence determination. Until convergence, the analyzer generates a probability distribution 20 of the factor (2).

(4) The analysis apparatus integrates the probability distributions d1 and d2 of the factors determined to converge in (3), and executes factor clustering on the integrated probability distribution of the factors (integrated probability distribution D). Specifically, for example, k-means clustering is used for factor clustering. The number of clusters is set in advance. Here, the number of clusters is “3” as an example. Thus, in the factor clustering result 40, the entry of the integrated probability distribution D is classified into three types of patient types α, β, and γ.

(5) Further, the analysis apparatus performs co-occurrence clustering on the integrated probability distribution D. Specifically, for example, the analysis device calculates a correlation coefficient between factors of the integrated probability distribution D as a co-occurrence amount. Then, the analysis device applies a hierarchical clustering method to the co-occurrence amount to generate a co-occurrence cluster. Here, it is assumed that co-occurrence cluster 1 (drug 1, drug 2) and co-occurrence cluster 2 (drug 3, drug 4) are obtained. Here, the co-occurrence cluster is a combination of two factors, but may be a combination of three or more factors.

(6) The analysis apparatus calculates predicted values of disease probabilities for the patient types α, β, and γ by giving factors belonging to the co-occurrence cluster to the learning model for the patient types α, β, and γ. Thus, the analyzer can analyze the effectiveness of the combination of factors.

<Hardware configuration example of analyzer>
FIG. 2 is a block diagram illustrating a hardware configuration example of the analysis apparatus. The analysis apparatus 200 includes a processor 201, a storage device 202, an input device 203, an output device 204, and a communication interface (communication IF 205). The processor 201, the storage device 202, the input device 203, the output device 204, and the communication IF 205 are connected by a bus. The processor 201 controls the analysis device 200. The storage device 202 serves as a work area for the processor 201. The storage device 202 is a non-temporary or temporary recording medium that stores various programs and data. Examples of the storage device 202 include a ROM (Read Only Memory), a RAM (Random Access Memory), a HDD (Hard Disk Drive), and a flash memory. The input device 203 inputs data. Examples of the input device 203 include a keyboard, a mouse, a touch panel, a numeric keypad, and a scanner. The output device 204 outputs data. Examples of the output device 204 include a display and a printer. The communication IF 205 is connected to a network and transmits / receives data.

<Learning data example>
FIG. 3 is an explanatory diagram showing the detailed contents of the learning data set 10 shown in FIG. The learning data set 10 is, for example, data in a table format. In the following description of the database or table, the value of the AA field bbb (AA is a field name and bbb is a code) may be expressed as AAbbb. For example, the value of the patient ID field 301 is expressed as patient ID 301.

The learning data set 10 includes a patient ID field 301, an objective variable field 302, and a factor field 303. The values of the fields 301 to 303 in the same row constitute an entry that is patient information. In FIG. 3, the number of entries is “6”, but actually there are, for example, tens of thousands to hundreds of millions of patient entries.

The patient ID field 301 is a storage area for storing a patient ID. The patient ID 301 is identification information that uniquely identifies a patient.

The objective variable field 302 is a storage area for storing objective variables for each patient ID 301. The objective variable 302 indicates the disease probability. Although the disease probability can be expressed in the range of 0% to 100%, the learning data set 10 is actually measured values, so that the disease is 1 (= 100%) and the health is 0 (= 0%).

The factor field 303 is a storage area for storing a plurality of factors. The factor 303 is an explanatory variable indicating the dose of medicine. In this example, the factor 303 is four explanatory variables of medicine 1 to medicine 4 for convenience, but actually, for example, tens of thousands to hundreds of millions of medicines. In addition, the unit of the dosage of the medicine which is the factor 303 is determined for each medicine.

In FIG. 3, the entry with the patient ID 301 of “patient A” is that patient 1 is given “20” for medicine 1, “13.0” for medicine 2, and “22.0” for medicine 4. Indicates a disease. In addition, an entry whose patient ID 301 is “patient B” is given to patient B “10” for medicine 1, “23.0” for medicine 2, “1” for medicine 3, and “31.0” for medicine 4. As a result, it is shown that patient B is ill.

<Initial setting screen example>
FIG. 4 is an explanatory diagram illustrating an example of an initial setting screen. The initial setting screen 400 is displayed on a display which is an example of the output device 204 and is set by the input device 203. The machine learning selection area 401 is a pull-down interface for selecting a machine learning method. The factor clustering setting area 402 is an area for setting a clustering method and the number of clusters. The factor clustering selection area 403 is a pull-down interface for selecting a factor clustering method. The factor cluster number setting area 404 is an input field for setting the number of clusters to be obtained by factor clustering.

Σ value setting area 405 is an input field for setting a σ value. The σ value is a fixed parameter used in (2) the probability distribution 20 of each factor in FIG. 1 in the generation rate α of the Markov chain Monte Carlo method. The σ value is a value in the range of greater than 0 and less than or equal to 1.

The co-occurrence clustering setting area 406 is an area for setting a co-occurrence method, a clustering method, the number of clusters, and a threshold value. The co-occurrence amount selection area 407 is a pull-down interface for selecting a co-occurrence amount calculation method. The co-occurrence clustering selection area 408 is a pull-down interface for selecting a co-occurrence clustering method. The co-occurrence cluster number setting area 409 is an input field for setting the number of co-occurrence clusters to be obtained by factor clustering. The threshold setting area 410 is an input field for setting a threshold for a predicted value of a correlation value indicating the degree of association of factor clusters. The decision button 411 is a button for inputting values of the items 401 to 410.

<Analysis processing procedure example>
FIG. 5 is a flowchart illustrating an example of an analysis process procedure performed by the analysis apparatus 200. The analysis apparatus 200 executes the processing shown in the flowchart of FIG. 5 by causing the processor 201 to execute the analysis program stored in the storage device 202. First, the analysis apparatus 200 performs initial setting (step S501). In the initial setting (step S501), the initial setting screen shown in FIG. 4 is displayed on the display. The user selects or inputs each item 401 to 409 on the initial setting screen. The analysis apparatus 200 reads the values of the items 401 to 409 by detecting the pressing of the input button 410.

Next, the analysis apparatus 200 generates a learning model from the learning data set 10 as shown in (1) of FIG. 1 (step S502). In the case of logistic regression, the learning model is expressed by the following formula (1).

y = f (x) = σ (w ^t x + b) (1)

y is a scalar indicating the objective variable. x is an m-dimensional feature vector. m corresponds to the number of factors. In the learning data set 10 of FIG. 3, since the number of factors 303 is four (drug 1 to drug 4), m = 4. σ () is a sigmoid function. The vector w and the scalar b are weight and bias parameters, respectively, and are called learning parameters. In the case of the nonlinear model, w ^t x in the sigmoid function σ () is replaced with a more complicated function than w ^t x based on the vector w and the factor x.

The analysis apparatus 200 selects a learning model corresponding to the machine learning method selected in the machine learning selection area 401 in FIG. 4 and obtains a learning parameter that represents the learning model.

Next, the analysis apparatus 200 generates probability distributions d1 and d2 of factors derived from the learning data set 10 using a probability sampling method represented by the Markov chain Monte Carlo method, as shown in (2) of FIG. (Step S503).

FIG. 6 is an explanatory diagram showing probability distributions d1 and d2 of factors. The factor probability distributions d1 and d2 include a virtual patient ID field 601, an objective variable field 602, and a factor field 603. The value of each field 601 to 603 in the same row constitutes an entry that becomes virtual patient information. The number of entries is the same as the number of entries in the learning data set 10.

The virtual patient ID field 601 is a storage area for storing a virtual patient ID. The virtual patient ID 601 is identification information that uniquely identifies a virtual patient.

The objective variable field 602 is a storage area for storing objective variables for each virtual patient ID 601. The objective variable 602 indicates the disease probability. The disease probability is expressed as 0% to 100%.

The factor field 603 is a storage area for storing a plurality of factors. The factor 603 is an explanatory variable indicating the dose of medicine. In this example, the number of factors 603 is the same as the number of factors 303 in the learning data set 10.

An example of generation of virtual patient information that is an entry of factor probability distributions d1 and d2 will be described. The analysis apparatus 200 selects a factor vector of any entry from the entry group of the learning data set 10. For example, it is assumed that the factor vector x = (20, 13.0, 0, 22.0) having the patient ID 301 “patient A” is selected. The analysis apparatus 200 adds a random value r to each element of the selected factor vector to obtain a virtual factor vector x ′ = (20 + r, 13.0 + r, 0 + r, 22.0 + r).

The analyzing apparatus 200 substitutes the selected factor vector x and the virtual factor vector x ′ into the expression (2) of the acceptance rate α of the Markov chain Monte Carlo method.

The function q is a Gaussian distribution function. The function q (x ′ | x) is a Gaussian distribution function indicating the probability of generating the virtual factor vector x ′ when the factor vector x is given. The function q (x | x ′) is a Gaussian distribution function indicating the probability of generating the factor vector x when the virtual factor vector x ′ is given. The function f is, for example, a learning model generated in step S502 as shown in the equation (1). The σ value input to the σ value setting area 405 is substituted for σ. Depending on the σ value, the acceptance rate α becomes a Gaussian distribution including patient information with a disease probability of (1−σ) or more. That is, a virtual factor vector x ′ of virtual patient information having a disease probability of (1−σ) or more can be adopted at the adoption rate α.

Next, a uniform random number β is generated in the interval from 0 to 1, and when the acceptance rate α is equal to or greater than a threshold β (for example, 1), the analysis apparatus 200 adopts the virtual factor vector x ′. If the acceptance rate α is not greater than or equal to the threshold value, the analysis apparatus 200 adopts the factor vector x. The adopted factor vector is denoted as adopted factor vector <x>.

When the acceptance rate α is equal to or greater than a threshold value β (for example, 1), the analysis apparatus 200 compares the acceptance factor vector <x> with the random number vector R. Specifically, for example, the analysis apparatus 200 determines whether or not all elements of the adopted factor vector <x> are equal to or greater than corresponding elements of the random number vector R. If all the elements of the adopted factor vector <x> are equal to or greater than the corresponding elements of the random number vector R, the analysis apparatus 200 determines the adopted factor vector <x> as the virtual factor vector of the new virtual patient.

If all the elements of the adopted factor vector <x> are not equal to or greater than the corresponding elements of the random number vector R, the analysis apparatus 200 determines the factor vector x as the virtual factor vector of the new virtual patient. It should be noted that although all the elements of the adopted factor vector <x> are equal to or more than the corresponding elements of the random number vector R, the judgment condition is that some elements of the adopted factor vector <x> It may be more than the corresponding element.

Thereafter, the analysis apparatus 200 calculates a disease probability that is the objective variable 602 by giving a factor 603 that is a virtual factor vector of a new virtual patient to the learning model in each virtual patient information entry. In this way, in step S503, the entry of the virtual patient information is set, and the factor probability distributions d1 and d2 are generated.

5, the analysis apparatus 200 determines whether or not the factor probability distributions d1 and d2 have converged to the same probability distribution as shown in (3) of FIG. 1 (step S504). Specifically, for example, the analysis apparatus 200 calculates a convergence value for verifying whether the probability distributions d1 and d2 of the factors converge to the same probability distribution by the Gelman-Rubin method. More specifically, the analysis apparatus 200 gives the column data of the factor probability distribution d1 and the column data of the factor probability distribution d2 corresponding to the column data to the convergence determination formula of Gelman-Rubin so as to converge. The value Rhat is calculated.

For example, the analysis apparatus 200 gives the convergence value Rhat by giving the column data of the objective variable 602 of the factor probability distribution d1 and the column data of the objective variable 602 of the factor probability distribution d2 to the convergence determination formula of Gelman-Rubin. calculate. Further, the analysis apparatus 200 provides the Gelman-Rubin convergence judgment formula with the column data of the drug 1 in the factor 603 of the factor probability distribution d1 and the column data of the drug 1 in the factor 603 of the factor probability distribution d2. A convergence value Rhat is calculated. Similarly, the analysis apparatus 200 calculates the convergence value Rhat for the column data after the medicine 2.

If the convergence value Rhat is 1.1 or less, it is determined that the column data of the probability distributions d1 and d2 of the factors converge to the same probability distribution. The analysis device 200 deletes the column data determined not to converge. If the number of remaining column data is equal to or greater than a threshold value (for example, 50% or more), the factor probability distributions d1 and d2 have converged to the same probability distribution (step S504: Yes), and the process proceeds to step S505. Transition. If the number of remaining column data is not greater than or equal to the threshold value (step S504: No), the process returns to step S503, and the analysis device 200 regenerates the probability distributions d1 and d2 of the factors derived from the learning data set 10. When even one column data of the factor 603 of the factor probability distributions d1 and d2 is deleted, the analysis apparatus 200 gives the remaining factor 603 to the learning model and recalculates the objective variable 602.

By deleting non-converging column data, the reliability of the factor probability distributions d1 and d2 can be improved, and the analysis accuracy is improved. If the number of remaining column data is equal to or greater than the threshold value, the analysis apparatus 200 may move to step S504 without deleting the column data determined not to converge. Thereby, analysis covering the factor 603 can be performed. Further, step S504 may not be executed. Thereby, the analysis speed can be improved.

Next, the analysis apparatus 200 integrates the probability distributions d1 and d2 of the factors determined to be converged in step S504 (step S505). An integrated probability distribution D is a probability distribution of integrated factors.

FIG. 7 is an explanatory diagram showing an example of the integrated probability distribution D. In FIG. 7, for convenience of explanation, the content of the factor probability distributions d1 and d2 shown in FIG. 6 is connected. However, if any column data in the factor 603 is deleted in step S504, the integrated probability distribution is used. In D, it is also deleted.

Next, the analysis apparatus 200 generates a factor cluster by factor clustering using the integrated probability distribution D as shown in (4) of FIG. 1 (step S506). In the initial setting (step S501), the analysis apparatus 200 executes the factor clustering selected in the factor clustering selection region 403, and generates factor clusters for the number of clusters set in the factor cluster number setting region 404.

FIG. 8 is an explanatory diagram showing the factor clustering result 40. The factor clustering result 40 has a patient type ID field 801, an objective variable field 802, and a factor field 803. The value of each field 801 to 803 in the same row constitutes an entry that becomes patient type information.

The patient type ID field 801 is a storage area for storing a patient type ID. The patient type ID 801 is identification information that uniquely identifies a patient type classified by factor clustering.

The objective variable field 802 is a storage area for storing objective variables for each patient type ID 801. The objective variable 802 indicates the disease probability. The disease probability is expressed as 0% to 100%.

The factor field 803 is a storage area for storing a plurality of factors. Factor 803 is an explanatory variable indicating the dose of the drug to the patient type. In this example, the factor 803 is four explanatory variables of medicine 1 to medicine 4 for convenience, but actually, for example, it is a medicine that remains after the convergence determination (step S504).

In FIG. 8, k-means clustering is used as factor clustering, and the number of clusters is “3” as an example. As a result, the entries of the integrated probability distribution D are classified into three types of factor clusters of patient types α, β, and γ.

Referring back to FIG. 5, the analysis apparatus 200 calculates the statistical value of each factor from each factor cluster (step S507). Specifically, for example, the analysis apparatus 200 sets a statistical value in the virtual patient information in the integrated probability distribution D belonging to the patient type of the entry in the factor field 803. The statistical value is, for example, a median value. In addition to the median value, an average value, a maximum value, a minimum value, or a randomly selected value may be used. The analysis apparatus 200 calculates a disease probability that is the objective variable 802 by giving a statistical value that is the factor 803 to the learning model. Thus, patient type factors 803 and explanatory variables 802 are aggregated into statistics and disease probabilities derived from statistics.

Moreover, the analysis apparatus 200 calculates the co-occurrence amount between factors of the integrated probability distribution D (step S508). The co-occurrence amount is a correlation value between two factors. Specifically, for example, the analysis device 200 combines all factors in the integrated probability distribution D with brute force and calculates a correlation value between the factors. The correlation value is calculated by the calculation method selected in the co-occurrence amount selection area 407 in the initial setting (step S501).

Next, the analysis device 200 generates a co-occurrence cluster by co-occurrence clustering as shown in (5) of FIG. 1 (step S509). Specifically, for example, the analysis apparatus 200 applies a hierarchical clustering method to the co-occurrence amount to generate a co-occurrence cluster. Hierarchical clustering means that individual data is set as one co-occurrence cluster, the similarity between co-occurrence clusters is calculated, the most similar co-occurrence clusters are merged, and all co-occurrence clusters are one cluster. It is a clustering that repeats the process until it becomes to generate a dendrogram. Here, the similarity between co-occurrence clusters is, for example, a short distance between co-occurrence clusters. Specifically, for example, the distance between co-occurrence clusters is defined by the nearest neighbor method, the farthest neighbor method, or the centroid method.

FIG. 9 is an explanatory diagram showing a processing example of co-occurrence clustering (S508, S509). (A) shows the process of step S508. The co-occurrence amount table 900 is a table that holds correlation values between factors. (B) shows the process of step S509. In (B), the analyzer 200 deletes the correlation value of the same factor. The analysis apparatus 200 converts the correlation value into a correlation value obtained by subtracting the correlation value from 1 for hierarchical clustering. In (B), the smaller the correlation value, the more similar the factors are. Therefore, the analysis apparatus 200 selects a combination of factors that minimize the correlation value as a co-occurrence cluster. In the case of (B), a combination of medicine 1 and medicine 2 (co-occurrence cluster 1) and a combination of medicine 3 and medicine 4 (co-occurrence cluster 2) are selected. Here, the co-occurrence cluster is a combination of two factors, but may be a combination of three or more factors.

The process (B) is executed until the number of co-occurrence clusters reaches the number of co-occurrence clusters set in the co-occurrence cluster number setting area 409, or until no more clusters can be merged. .

Referring back to FIG. 5, the analysis apparatus 200 calculates the predicted value of the co-occurrence cluster as shown in (6) of FIG. 1 (step S510). Specifically, for example, the analysis apparatus 200 gives a factor belonging to the co-occurrence cluster to the learning model for each patient type α, β, γ, thereby predicting the disease probability for each patient type α, β, γ. Is calculated.

FIG. 10 is an explanatory diagram showing the prediction result 1000 in step S510. Thus, the analysis apparatus 200 can analyze the effectiveness of the combination of factors.

Returning to FIG. 5, the analysis apparatus 200 executes threshold processing of the prediction result 1000 (step S <b> 511). Specifically, for example, the analysis apparatus 200 selects a combination of a patient type and a factor cluster whose predicted value is equal to or greater than a threshold value. For example, when the threshold value set in the threshold setting area 410 is “0.8”, the analysis apparatus 200 causes the factor cluster 1 of the patient type α, the factor cluster 1 of the patient type β, and the patient type γ. Factor cluster 1 is selected as a calculation marker.

The analysis apparatus 200 outputs the processing result of step S510 or S511 (step S512). Specifically, for example, the analysis apparatus 200 controls the display screen of a display that is an example of the output device 204 to display the processing result on the display screen, or transmits the processing result to the external apparatus via the communication IF 205. Or the processing result is written in the storage device 202. Further, the convergence determination result in step S504 may also be output.

<Example of display screen>
FIG. 11 is an explanatory diagram illustrating a display screen example. The display screen 1100 is displayed on a display that is an example of the output device 204. The display screen 1100 includes a score display area 1101, a prediction result display area 1102, and a dendrogram display area 1103. In the score display area 1101, the convergence value Rhat in the convergence determination (step S504) is displayed. A prediction result 1000 shown in FIG. 10 is displayed in the prediction result display area 1102. As shown in FIG. 11, it may be displayed as a bar graph. A dendrogram display area 1103 displays a dendrogram in hierarchical clustering. As described above, the intermediate result and final result of the processing shown in FIG. 5 are displayed on the display screen 1100.

As described above, according to the first embodiment, the analysis apparatus 200 generates a plurality of factor clusters by clustering the prediction data set (for example, the integrated probability distribution D) so that the values of the plurality of factors are similar to each other. The first generation process is executed (step S506). The analysis apparatus 200 executes a first calculation process for calculating a co-occurrence amount in which a plurality of factors co-occur by correlation of a plurality of factors, using a prediction data set (for example, the integrated probability distribution D) (step S508). . The analysis apparatus 200 clusters a plurality of factors based on the co-occurrence amount calculated by the first calculation process, and generates a plurality of co-occurrence clusters having one or more co-occurrence clusters including two or more factors. Generation processing is executed (step S509). The analysis apparatus 200 includes, among prediction values of two or more factors in a specific prediction data group included in a specific factor cluster including two or more factors among the plurality of factor clusters generated by the first generation process. A prediction value of two or more specific factors indicated by a specific co-occurrence cluster among a plurality of co-occurrence clusters generated by the two generation process is given to the learning model. Then, the analysis apparatus 200 executes a second calculation process for calculating the predicted value of the objective variable in the specific factor cluster (step S510).

Thereby, the analysis apparatus 200 can analyze the effectiveness of the combination of factors based on the predicted value of the objective variable in a specific factor cluster in which a plurality of factors co-occur.

Further, the analysis apparatus 200 executes a third calculation process for calculating a statistical value representative of the predicted values of the two or more factors in the specific factor cluster based on the predicted values of the two or more factors in the specific predicted data group. (Step S510). As a result, the analysis apparatus 200 can reduce the amount of calculation when calculating the predicted value of the objective variable in a specific factor cluster in which a plurality of factors co-occur. Therefore, the analysis speed can be improved.

Also, the analysis apparatus 200 executes a setting process for setting the type of learning model (step S501). Further, the analysis apparatus 200 generates a learning model of the type set by the setting process using the measured value of the objective variable and the measured values of a plurality of factors, and executes a third generation process that stores the learning model in the storage device. (Step S502). Thereby, the user can select the kind of learning model according to the objective.

Further, the analysis apparatus 200 sets a linear model or a nonlinear model as a type in the setting process. Thereby, the analysis apparatus 200 can improve the analysis speed when the linear model is set, and can improve the analysis accuracy when the nonlinear model is set. In other words, the user can select a linear model if the analysis result is desired to be obtained earlier, and can select a non-linear model if the analysis accuracy is to be improved.

Further, the prediction data set (for example, the integrated probability distribution D) may be a data set generated from the learning data set 10 by a probability sampling method using a learning model. Thereby, a prediction data set (for example, integrated probability distribution D) becomes a data set depending on the learning model. Therefore, for example, when a non-linear model is set, the prediction data set (for example, the integrated probability distribution D) is a data set with higher accuracy than when a linear model is set.

In addition, the analysis apparatus 200 adopts either one of the prediction data or the data similar to the prediction data by a probability sampling method (for example, Markov chain Monte Carlo method) using a learning model, thereby two prediction data groups (for example, The fourth generation process for generating the factor probability distributions d1, d2) is executed (step S503). Data similar to the prediction data is data obtained by adding a random value to each value of the factor that is the prediction data, as described above. The analysis apparatus 200 executes a determination process for determining whether or not two prediction data groups (for example, the probability distributions d1 and d2 of the factors) generated by the fourth generation process converge to the same probability distribution (step) S504). The analysis apparatus 200 integrates two prediction data groups (for example, factor probability distributions d1 and d2) based on the determination result of the determination process, thereby generating a prediction data set (for example, an integrated probability distribution D). Processing is executed (step S505).

The determination process determines whether or not two prediction data groups (for example, probability distributions d1 and d2 of factors) converge to the same probability distribution, for example, the probability distribution of the learning data set 10. As a result, if it has converged, it becomes clear that the two prediction data groups (for example, the probability distributions d1 and d2 of the factors) are similar to the learning data set 10, and therefore two prediction data groups (for example, the probability distribution d1 of the factors) , D2), a prediction data set (for example, integrated probability distribution D) is generated. Thereby, the probability as a predicted value of a prediction data set (for example, integrated probability distribution D), that is, the generation accuracy can be improved.

In addition, the analysis apparatus 200 uses a probability sampling method (for example, a Markov chain Monte Carlo method) using a learning model to determine the value of a parameter that controls the adoption rate α for adopting either prediction data or data similar to the prediction data ( For example, a setting process for setting (σ value) is executed (step S501). As a result, a factor that becomes an objective variable of (1−σ) or more can be adopted at the acceptance rate α.

Moreover, the analysis apparatus 200 executes a setting process for setting the number of factor clusters generated (step S501). Thereby, the analysis apparatus 200 can generate as many factor clusters as specified by the user. Specifically, for example, as the number of factor clusters generated increases, the prediction data set (for example, integrated probability distribution D) is subdivided. As a result, the user can set the factor cluster generation number to be lower when the analysis result is desired to be obtained earlier, and can set the factor cluster generation number to be higher to increase the analysis accuracy.

Moreover, the analysis apparatus 200 executes a setting process for setting the number of co-occurrence clusters generated (step S501). Thereby, the analyzer 200 can generate the number of co-occurrence clusters specified by the user. Specifically, for example, as the number of co-occurrence clusters generated increases, the number of co-occurring factors and the number of combinations of co-occurring factors increase. Therefore, the user can set the number of co-occurrence clusters generated to be lower when the analysis result is desired to be obtained earlier, and can be set to be higher to increase the analysis accuracy.

Further, in Example 1, a plurality of

factors

303 and 603 are set as doses of a plurality of drugs to a patient, and

objective variables

302 and 602 are values indicating drug efficacy when a plurality of drugs are administered to a patient (for example, Disease probability). This makes it possible to predict how much of each of a plurality of drugs is administered to which type (factor cluster) and how much drug is effective.

In Example 1 described above, the medicinal effect analysis is described as an example, but the present invention can also be applied to product recommendation. In this case, in the learning data set 10 shown in FIG. 3, for example, the patient ID 301 is replaced by a customer, not a patient. The factor 303 indicates, for example, the number of purchases (in the case of a product) or the number of uses (in the case of a service) of a product or service (may be a product or service genre). The objective variable 302 indicates, for example, a purchase amount (in the case of a product) or a usage amount (in the case of a service) of a product or service (which may be a genre of the product or service). The same applies to the factor probability distributions d1 and d2 and the integrated probability distribution D.

In the case of news article analysis, in the learning data set 10 shown in FIG. 3, the patient ID 301 is replaced with, for example, a news article published in a newspaper, magazine, or web page instead of a patient. The factor 303 indicates, for example, the number of times the word appears. The objective variable 302 indicates the genre of a news article such as politics, society, sports, and weather. The same applies to the factor probability distributions d1 and d2 and the integrated probability distribution D.

Example 2 will be described. In the first embodiment, the analysis process shown in FIG. 5 is executed by one computer. In the second embodiment, the analysis process shown in FIG. 5 is distributed by a plurality of computers. This will reduce the load on the computer and increase the analysis speed. Specifically, each computer has, for example, the hardware configuration shown in FIG.

FIG. 12 is an explanatory diagram showing a system configuration example of the analysis system. The analysis system 1200 includes a plurality of computers (hereinafter simply referred to as nodes) N0 to Nn (n is an integer of 1 or more) and one or more client terminals C. A plurality of nodes N0 to Nn (n is an integer of 2 or more) and one or more client terminals C are communicably connected via a network 1201. The node N0 is the master node N0, and the nodes N1 to Nn are worker nodes N1 to Nn. The master node N0 manages the worker nodes N1 to Nn. Worker nodes N1 to Nn execute processing in accordance with instructions from master node N0. Note that one of the worker nodes N1 to Nn may be responsible for the function of the master node N0.

<Example of distributed processing procedure>
FIGS. 13 to 15 are flowcharts showing examples of distributed processing procedures by the analysis system 1200. FIG. Here, as an example, n = 2, that is, the analysis system 1200 is a master node N0, worker nodes N1, N2, and a client terminal C.

First, the client terminal C executes initial setting (step S501) (step S1301). Then, the client terminal C transmits an analysis request that is a setting content of the initial setting (step S501) to the master node N0 (step S1302).

The master node N0 transmits a learning model generation request to the worker node N1 (step S1303). When receiving the learning model generation request, the worker node N1 generates a learning model as in step S502 (step S1304). When the worker node N1 generates a learning model, the worker node N1 transmits the learning model to the master node N0 (step S1305). When receiving the learning model from the worker node N1, the master node N0 transmits the learning model to another worker node N2 (step S1306).

Next, the master node N0 transmits a generation request for the factor probability distribution d1 to the worker node N1 (step S1307), and transmits a generation request for the factor probability distribution d2 to the worker node N2 (step S1308). Thereby, the probability distributions d1 and d2 of factors can be generated by parallel processing.

Next, similarly to step S503, the worker node N1 generates a probability distribution d1 of factors derived from the learning data set 10 using a probability sampling method typified by the Markov chain Monte Carlo method (step S1309). Similarly to step S503, the worker node N2 also generates a probability distribution d2 of factors derived from the learning data set 10 using a probability sampling method typified by the Markov chain Monte Carlo method (step S1310). The worker node N1 transmits the generated probability distribution d1 of the factor to the master node N0 (step S1311). The worker node N2 also transmits the generated probability distribution d2 of the factor to the master node N0 (step S1312).

The master node N0 determines whether the factor probability distributions d1 and d2 converge to the same probability distribution as in step S504 (step S1313). The master node N0 transmits the determination result to the client terminal C (step S1314). As shown in FIG. 11, the client terminal C receives and displays the determination result (eg, Gelman-Rubin score) (step S1315).

In FIG. 14, the master node N0 generates the integrated probability distribution D by integrating the probability distributions d1 and d2 of factors (step S1401), as in step S505. Then, the master node N0 transmits a factor clustering request to the worker node N1 (step S1402). When receiving the factor clustering request, the worker node N1 generates a factor cluster by factor clustering using the integrated probability distribution D as in step S506 (step S1403). Further, the worker node N1 calculates the statistical value of each factor from each factor cluster as in step S507 (step S1404). The worker node N1 transmits the calculated statistical value to the master node N0 (step S1405). The master node N0 transmits the received statistical value to the other worker node N2 (step S1406).

The master node N0 transmits a co-occurrence amount calculation request to the worker node N2 (step S1407). The worker node N2 calculates the co-occurrence amount of factors of the integrated probability distribution D as in step S508 (step S1408). Then, the worker node N2 transmits the calculated co-occurrence amount (see (A) of FIG. 9) to the master node N0 (step S1409).

In FIG. 15, the master node N0 generates a co-occurrence cluster by co-occurrence clustering as in step S509, and generates ID lists A and B of the co-occurrence cluster (step S1501). The co-occurrence cluster ID list A is an ID list that uniquely identifies one entry group obtained by dividing the entry of the integrated probability distribution D. The co-occurrence cluster ID list B is an ID list that uniquely specifies the other entry group obtained by dividing the entry of the integrated probability distribution D.

The master node N0 transmits the co-occurrence cluster ID list A to the worker node N1 (step S1502), and transmits the co-occurrence cluster ID list B to the worker node N2 (step S1503). As in step S509, the worker node N1 generates a co-occurrence cluster for the ID list A by co-occurrence clustering (step S1504). Similarly to step S509, the worker node N2 also generates a co-occurrence cluster for the ID list B by co-occurrence clustering (step S1505).

The worker node N1 calculates the predicted value of the co-occurrence cluster obtained in step S1504 as in step S510 (step S1506). Similarly to step S510, worker node N2 also calculates the predicted value of the co-occurrence cluster obtained in step S1505 (step S1507). The worker node N1 stores the predicted value obtained in step S1506 in the storage device 202 (step S1508). The worker node N2 also stores the predicted value obtained in step S1507 in the storage device 202 (step S1509). The worker node N1 transmits the predicted value obtained in step S1506 to the master node N0 (step S1510). The worker node N2 also transmits the predicted value obtained in step S1507 to the master node N0 (step S1511).

The master node N0 executes the threshold processing for the predicted value as in step S511 (step S1512). Then, the master node N0 transmits a calculation marker that is the execution result to the client terminal C (step S1513). The client terminal C displays the calculation marker on the display screen (step S1514).

FIG. 16 is a flowchart showing a modification of the flowchart 3 showing an example of the distributed processing procedure by the analysis system 1200 shown in FIG. In FIG. 15, the worker nodes N1 and N2 execute the co-occurrence clustering in parallel for each of the ID lists A and B, thereby realizing high-speed processing. On the other hand, in FIG. 16, the co-occurrence cluster calculation of the ID lists A and B is executed not by the worker nodes N1 and N2 but by the master node N0. The same processes as those in FIG. 15 are denoted by the same step numbers, and the description thereof is omitted.

In FIG. 16, the master node N0 generates a co-occurrence cluster by co-occurrence clustering for the ID list A as in step S509 (step S1602). The master node N0 transmits the co-occurrence cluster of the ID list A to the worker node N1 (step S1603).

The worker node N1 calculates the predicted value of the co-occurrence cluster obtained in step S1602, similarly to step S510 (step S1604). The worker node N1 stores the predicted value obtained in step S1604 in the storage device 202 (step S1604). The worker node N1 transmits the predicted value obtained in step S1604 to the master node N0 (step S1606).

The master node N0 generates a co-occurrence cluster by co-occurrence clustering for the ID list B as in step S509 (step S1607). The master node N0 transmits the co-occurrence cluster of the ID list B to the worker node N2 (step S1608).

The worker node N2 calculates the predicted value of the co-occurrence cluster obtained in step S1607 as in step S510 (step S1609). The worker node N1 stores the predicted value obtained in step S1609 in the storage device 202 (step S1610). The worker node N2 transmits the predicted value obtained in step S1609 to the master node N0 (step S1611).

Thus, according to the second embodiment, the same effects as the first embodiment can be obtained. Further, according to the second embodiment, the analysis processing shown in FIG. 5 is distributed by a plurality of computers. Thereby, it is possible to reduce the load on the computer and increase the analysis speed. Note that the distributed processing shown in FIGS. 13 to 16 is an example. Therefore, in addition to this, for example, at least two or more of the steps shown in FIGS. 13 to 16 may be executed by different computers.

The present invention is not limited to the above-described embodiments, and includes various modifications and equivalent configurations within the scope of the appended claims. For example, the above-described embodiments have been described in detail for easy understanding of the present invention, and the present invention is not necessarily limited to those having all the configurations described. A part of the configuration of one embodiment may be replaced with the configuration of another embodiment. Moreover, you may add the structure of another Example to the structure of a certain Example. Moreover, you may add, delete, or replace another structure about a part of structure of each Example.

In addition, each of the above-described configurations, functions, processing units, processing means, etc. may be realized in hardware by designing a part or all of them, for example, with an integrated circuit, and the processor realizes each function. It may be realized by software by interpreting and executing the program to be executed.

Information such as programs, tables, and files that realize each function can be stored in a storage device such as a memory, a hard disk, and an SSD (Solid State Drive), or a recording medium such as an IC card, an SD card, and a DVD.

Also, the control lines and information lines indicate what is considered necessary for the explanation, and do not necessarily indicate all control lines and information lines necessary for mounting. In practice, it can be considered that almost all the components are connected to each other.

Claims

An analysis apparatus comprising a processor that executes a program and a storage device that stores the program,
The storage device has a learning data set including a plurality of learning data including actual values of objective variables and actual values of a plurality of factors, and a prediction including a plurality of prediction data derived from the learning data including prediction values of the plurality of factors. Storing a data set and a learning model indicating a relationship between the measured values of the objective variable and the measured values of the plurality of factors;
The processor is
Clustering the prediction data sets so that the values of the plurality of factors are similar to each other, and generating a plurality of factor clusters;
A first calculation process for calculating a co-occurrence amount in which the plurality of factors co-occur by correlation of the plurality of factors using the prediction data set;
A second generation process for clustering the plurality of factors based on the co-occurrence amount calculated by the first calculation process to generate a plurality of co-occurrence clusters having one or more co-occurrence clusters including two or more factors; ,
Among the predicted values of the two or more factors in the specific prediction data group included in the specific factor cluster including two or more factors among the plurality of factor clusters generated by the first generation processing, the second generation Predicting the objective variable in the specific factor cluster by giving the learning model prediction values of two or more specific factors indicated by the specific co-occurrence cluster among the plurality of co-occurrence clusters generated by the processing A second calculation process for calculating a value;
The analysis apparatus characterized by performing.
The analyzer according to claim 1,
The processor is
Based on the predicted values of the two or more factors in the specific prediction data group, execute a third calculation process for calculating a statistical value representative of the predicted values of the two or more factors in the specific factor cluster,
In the second calculation process, the processor includes two or more specific factors indicated by the specific co-occurrence cluster among statistical values representing the predicted values of the two or more factors calculated by the third calculation process. An analysis apparatus characterized in that a predicted value of the objective variable in the specific factor cluster is calculated by giving the statistical value of the above to the learning model.
The analyzer according to claim 1,
The processor is
A setting process for setting the type of the learning model;
Using a measured value of the objective variable and a measured value of the plurality of factors to generate a learning model of the type set by the setting process, and store it in the storage device;
The analysis apparatus characterized by performing.
The analyzer according to claim 3,
In the setting process, the processor sets a linear model or a nonlinear model as the type.
The analyzer according to claim 1,
The analysis apparatus, wherein the prediction data set is a data set generated from the learning data set by a probability sampling method using the learning model.
The analyzer according to claim 1,
The processor is
A fourth generation process for generating two prediction data groups by adopting either one of the prediction data or data similar to the prediction data by a probability sampling method using the learning model;
A determination process for determining whether or not the two prediction data groups generated by the fourth generation process converge to the same probability distribution;
An integration process for generating the prediction data set by integrating the two prediction data groups based on the determination result of the determination process;
In the first generation process, the processor generates the plurality of factor clusters by clustering the prediction data set obtained by the integration process so that the values of the plurality of factors are similar to each other.
In the first calculation process, the processor calculates a co-occurrence amount in which the plurality of factors co-occur by correlation of the plurality of factors, using the prediction data set obtained by the integration processing. Analyzing device.
The analyzer according to claim 6,
The processor is
Executing a setting process for setting a value of a parameter for controlling an acceptance rate for adopting either the prediction data or the data similar to the prediction data by a probability sampling method using the learning model;
In the fourth generation process, the processor generates the two prediction data groups by adopting either the prediction data or data similar to the prediction data based on the acceptance rate. Analyzing device.
The analyzer according to claim 1,
The processor is
Execute a setting process for setting the number of generations of the factor clusters;
In the first generation process, the processor clusters the prediction data sets so that the values of the plurality of factors are similar to each other, and generates a generation number of factor clusters set by the setting process. Analyzing device.
The analyzer according to claim 1,
The processor is
Execute a setting process for setting the number of generated co-occurrence clusters;
In the second generation process, the processor clusters the plurality of factors based on the co-occurrence amount calculated by the first calculation process, and has one or more co-occurrence clusters including two or more factors. An analyzer that generates a generation number set by the setting process.
The analyzer according to claim 1,
The analyzer is characterized in that the plurality of factors are doses of a plurality of drugs to a patient, and the objective variable is a value indicating a drug effect when the dose of the plurality of drugs is administered to the patient.
An analysis system in which a plurality of computers are communicably connected,
A learning data set in which any one of the plurality of computers includes a plurality of learning data including actual measurement values of objective variables and actual measurement values of a plurality of factors, and prediction data derived from the learning data including prediction values of the plurality of factors And a learning model indicating the relationship between the measured values of the objective variable and the measured values of the plurality of factors, and a prediction data set having a plurality of
One of the plurality of computers is
Clustering the prediction data sets so that the values of the plurality of factors are similar to each other, and generating a plurality of factor clusters;
A first calculation process for calculating a co-occurrence amount in which the plurality of factors co-occur by correlation of the plurality of factors using the prediction data set;
A second generation process for clustering the plurality of factors based on the co-occurrence amount calculated by the first calculation process to generate a plurality of co-occurrence clusters having one or more co-occurrence clusters including two or more factors; ,
Among the predicted values of the two or more factors in the specific prediction data group included in the specific factor cluster including two or more factors among the plurality of factor clusters generated by the first generation processing, the second generation Predicting the objective variable in the specific factor cluster by giving the learning model prediction values of two or more specific factors indicated by the specific co-occurrence cluster among the plurality of co-occurrence clusters generated by the processing A second calculation process for calculating a value;
An analysis system characterized by executing.
An analysis method by an analysis apparatus having a processor that executes a program and a storage device that stores the program,
The storage device has a learning data set including a plurality of learning data including actual values of objective variables and actual values of a plurality of factors, and a prediction including a plurality of prediction data derived from the learning data including prediction values of the plurality of factors. Storing a data set and a learning model indicating a relationship between the measured values of the objective variable and the measured values of the plurality of factors;
The processor is
Clustering the prediction data sets so that the values of the plurality of factors are similar to each other, and generating a plurality of factor clusters;
A first calculation process for calculating a co-occurrence amount in which the plurality of factors co-occur by correlation of the plurality of factors using the prediction data set;
A second generation process for clustering the plurality of factors based on the co-occurrence amount calculated by the first calculation process to generate a plurality of co-occurrence clusters having one or more co-occurrence clusters including two or more factors; ,
Among the predicted values of the two or more factors in the specific prediction data group included in the specific factor cluster including two or more factors among the plurality of factor clusters generated by the first generation processing, the second generation Predicting the objective variable in the specific factor cluster by giving the learning model prediction values of two or more specific factors indicated by the specific co-occurrence cluster among the plurality of co-occurrence clusters generated by the processing A second calculation process for calculating a value;
The analysis method characterized by performing.