WO2018042606A1 - Analysis device, analysis system, and analysis method - Google Patents

Analysis device, analysis system, and analysis method Download PDF

Info

Publication number
WO2018042606A1
WO2018042606A1 PCT/JP2016/075726 JP2016075726W WO2018042606A1 WO 2018042606 A1 WO2018042606 A1 WO 2018042606A1 JP 2016075726 W JP2016075726 W JP 2016075726W WO 2018042606 A1 WO2018042606 A1 WO 2018042606A1
Authority
WO
WIPO (PCT)
Prior art keywords
factors
occurrence
factor
values
clusters
Prior art date
Application number
PCT/JP2016/075726
Other languages
French (fr)
Japanese (ja)
Inventor
琢磨 柴原
英司 金森
昌宏 荻野
鈴木 麻由美
Original Assignee
株式会社日立製作所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社日立製作所 filed Critical 株式会社日立製作所
Priority to JP2018536626A priority Critical patent/JP6695431B2/en
Priority to PCT/JP2016/075726 priority patent/WO2018042606A1/en
Publication of WO2018042606A1 publication Critical patent/WO2018042606A1/en

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/22Social work or social welfare, e.g. community support activities or counselling services
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H20/00ICT specially adapted for therapies or health-improving plans, e.g. for handling prescriptions, for steering therapy or for monitoring patient compliance
    • G16H20/10ICT specially adapted for therapies or health-improving plans, e.g. for handling prescriptions, for steering therapy or for monitoring patient compliance relating to drugs or medications, e.g. for ensuring correct administration to patients
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Definitions

  • the present invention relates to an analysis apparatus, an analysis system, and an analysis method for analyzing data.
  • US Pat. No. 6,057,031 discloses a computer-implemented method, system, and computer-readable storage for use with a clinical decision support system that identifies and provides information regarding correlations between patient attributes and one or more adverse events (AEs).
  • a medium is disclosed.
  • the process of U.S. Patent No. 6,057,056 processes database information including AE and one or more patient attributes for correlation between AE and patient attributes, and includes one or more AEs and one or more patient attributes. Identifying at least one correlation between. Correlation may be discovered through an association rule discovery process to determine one or more association rules. Each association rule satisfies confidence, support, and / or other thresholds. The process further provides information or alerts to the user based on the identified or discovered correlation.
  • Patent Document 2 discloses a medical assistance program that provides appropriate support for medical care.
  • the treatment period of the patient for the diagnosed disease is compared with the reference cure period for the diagnosed disease, and when the treatment period of the patient exceeds the reference cure period, Search for other illnesses that develop symptoms similar to the symptoms of the diagnosed illness from the storage means that associates and stores each illness that develops similar symptoms, and outputs the disease name information of the searched other illnesses , Cause the computer to execute the process.
  • the above-described conventional technique has a problem that even if a learning model is generated from learning data, it is not known which factor is associated with which other factor.
  • the objective variable is the disease probability and the factor is the dose of a plurality of drugs, for example, it is difficult to know whether it is effective to administer the drug A and the drug B to the patient in combination or cause side effects. There is.
  • the present invention aims to analyze the effectiveness of factor combinations.
  • An analysis apparatus, an analysis system, and an analysis method include a learning data set having a plurality of learning data including measured values of objective variables and measured values of a plurality of factors in a storage device; Storing a prediction data set having a plurality of prediction data derived from the learning data including prediction values of the plurality of factors, and a learning model indicating a relationship between the actual measurement values of the objective variable and the actual measurement values of the plurality of factors;
  • the prediction data set is clustered so that the values of the plurality of factors are similar, and a plurality of factor clusters are generated, and the prediction data set is used to generate the plurality of factor data.
  • a first calculation process for calculating a co-occurrence amount in which the plurality of factors co-occur by correlation, and a cluster of the plurality of factors based on the co-occurrence amount calculated by the first calculation process A second generation process for generating a plurality of co-occurrence clusters having one or more co-occurrence clusters including two or more factors, and two or more of the plurality of factor clusters generated by the first generation process.
  • a specific co-occurrence cluster among the plurality of co-occurrence clusters generated by the second generation process is A second calculation process for calculating a predicted value of the objective variable in the specific factor cluster by giving predicted values of two or more specific factors to the learning model.
  • FIG. 1 is an explanatory diagram of a data analysis example according to the first embodiment.
  • FIG. 2 is a block diagram illustrating a hardware configuration example of the analysis apparatus.
  • FIG. 3 is an explanatory diagram showing the detailed contents of the learning data shown in FIG.
  • FIG. 4 is an explanatory diagram illustrating an example of an initial setting screen.
  • FIG. 5 is a flowchart illustrating an example of an analysis processing procedure performed by the analysis apparatus.
  • FIG. 6 is an explanatory diagram showing a probability distribution of factors.
  • FIG. 7 is an explanatory diagram showing an example of the integrated probability distribution.
  • FIG. 8 is an explanatory diagram showing the result of factor clustering.
  • FIG. 9 is an explanatory diagram of an example of co-occurrence clustering processing.
  • FIG. 1 is an explanatory diagram of a data analysis example according to the first embodiment.
  • FIG. 2 is a block diagram illustrating a hardware configuration example of the analysis apparatus.
  • FIG. 3 is an explanatory diagram showing the detailed contents of
  • FIG. 10 is an explanatory diagram illustrating the prediction result obtained in step S510.
  • FIG. 11 is an explanatory diagram illustrating a display screen example.
  • FIG. 12 is an explanatory diagram illustrating a system configuration example of the analysis system.
  • FIG. 13 is a flowchart 1 illustrating an example of a distributed processing procedure by the analysis system.
  • FIG. 14 is a flowchart 2 illustrating an example of a distributed processing procedure by the analysis system.
  • FIG. 15 is a flowchart 3 illustrating an example of a distributed processing procedure by the analysis system.
  • FIG. 16 is a flowchart showing a modification of the flowchart 3 showing an example of a distributed processing procedure by the analysis system shown in FIG.
  • FIG. 1 is an explanatory diagram of a data analysis example according to the first embodiment.
  • (1) to (6) show the procedure of the analysis method by the analyzer.
  • the analysis apparatus generates a learning model from the learning data set 10.
  • the objective variable is the drug effect, specifically, the disease probability
  • the factor is the dose of a plurality of drugs to the patient.
  • the factor is four explanatory variables of drug 1 to drug 4 for convenience, but actually, for example, tens of thousands to hundreds of millions of drugs.
  • Each entry indicates a patient.
  • the number of patients is six from A to F for convenience, but actually, for example, there are tens of thousands to hundreds of millions of patients.
  • the generated learning model includes a linear model and a non-linear model.
  • the linear model includes, for example, a linear classification and a logistic regression.
  • Nonlinear models include, for example, neural networks (Neural Network), support vector machines (Support Vector Machine), Adaboost, and random forests (Random Forests).
  • the user can select one of the models when generating the learning model. For example, when the user wants to analyze the effectiveness of the combination of factors at high speed, the user may select a linear model, and when he wants to analyze with high accuracy, the user may select a nonlinear model.
  • the analysis device generates a probability distribution 20 of each factor from the learning model generated in (1).
  • the analysis apparatus generates two sets of probability distributions 20 of factors derived from the learning data set 10 (referred to as d1 and d2 respectively) using a probability sampling method typified by the Markov chain Monte Carlo method. . Thereby, a large amount of virtual factor data can be collected.
  • the analysis apparatus determines whether or not the probability distributions d1 and d2 of the factors generated in (2) converge to the same probability distribution. Specifically, for example, the Gelman-Rubin method is used for the convergence determination. Until convergence, the analyzer generates a probability distribution 20 of the factor (2).
  • the analysis apparatus integrates the probability distributions d1 and d2 of the factors determined to converge in (3), and executes factor clustering on the integrated probability distribution of the factors (integrated probability distribution D).
  • factor clustering e.g., k-means clustering is used for factor clustering.
  • the number of clusters is set in advance.
  • the number of clusters is “3” as an example.
  • the entry of the integrated probability distribution D is classified into three types of patient types ⁇ , ⁇ , and ⁇ .
  • the analysis apparatus performs co-occurrence clustering on the integrated probability distribution D. Specifically, for example, the analysis device calculates a correlation coefficient between factors of the integrated probability distribution D as a co-occurrence amount. Then, the analysis device applies a hierarchical clustering method to the co-occurrence amount to generate a co-occurrence cluster.
  • co-occurrence cluster 1 drug 1, drug 2
  • co-occurrence cluster 2 drug 3, drug 4
  • the co-occurrence cluster is a combination of two factors, but may be a combination of three or more factors.
  • the analysis apparatus calculates predicted values of disease probabilities for the patient types ⁇ , ⁇ , and ⁇ by giving factors belonging to the co-occurrence cluster to the learning model for the patient types ⁇ , ⁇ , and ⁇ .
  • the analyzer can analyze the effectiveness of the combination of factors.
  • FIG. 2 is a block diagram illustrating a hardware configuration example of the analysis apparatus.
  • the analysis apparatus 200 includes a processor 201, a storage device 202, an input device 203, an output device 204, and a communication interface (communication IF 205).
  • the processor 201, the storage device 202, the input device 203, the output device 204, and the communication IF 205 are connected by a bus.
  • the processor 201 controls the analysis device 200.
  • the storage device 202 serves as a work area for the processor 201.
  • the storage device 202 is a non-temporary or temporary recording medium that stores various programs and data.
  • Examples of the storage device 202 include a ROM (Read Only Memory), a RAM (Random Access Memory), a HDD (Hard Disk Drive), and a flash memory.
  • the input device 203 inputs data. Examples of the input device 203 include a keyboard, a mouse, a touch panel, a numeric keypad, and a scanner.
  • the output device 204 outputs data. Examples of the output device 204 include a display and a printer.
  • the communication IF 205 is connected to a network and transmits / receives data.
  • FIG. 3 is an explanatory diagram showing the detailed contents of the learning data set 10 shown in FIG.
  • the learning data set 10 is, for example, data in a table format.
  • the value of the AA field bbb (AA is a field name and bbb is a code) may be expressed as AAbbb.
  • the value of the patient ID field 301 is expressed as patient ID 301.
  • the learning data set 10 includes a patient ID field 301, an objective variable field 302, and a factor field 303.
  • the values of the fields 301 to 303 in the same row constitute an entry that is patient information.
  • the number of entries is “6”, but actually there are, for example, tens of thousands to hundreds of millions of patient entries.
  • the patient ID field 301 is a storage area for storing a patient ID.
  • the patient ID 301 is identification information that uniquely identifies a patient.
  • the objective variable field 302 is a storage area for storing objective variables for each patient ID 301.
  • the factor field 303 is a storage area for storing a plurality of factors.
  • the factor 303 is an explanatory variable indicating the dose of medicine.
  • the factor 303 is four explanatory variables of medicine 1 to medicine 4 for convenience, but actually, for example, tens of thousands to hundreds of millions of medicines.
  • the unit of the dosage of the medicine which is the factor 303 is determined for each medicine.
  • the entry with the patient ID 301 of “patient A” is that patient 1 is given “20” for medicine 1, “13.0” for medicine 2, and “22.0” for medicine 4. Indicates a disease.
  • an entry whose patient ID 301 is “patient B” is given to patient B “10” for medicine 1, “23.0” for medicine 2, “1” for medicine 3, and “31.0” for medicine 4. As a result, it is shown that patient B is ill.
  • FIG. 4 is an explanatory diagram illustrating an example of an initial setting screen.
  • the initial setting screen 400 is displayed on a display which is an example of the output device 204 and is set by the input device 203.
  • the machine learning selection area 401 is a pull-down interface for selecting a machine learning method.
  • the factor clustering setting area 402 is an area for setting a clustering method and the number of clusters.
  • the factor clustering selection area 403 is a pull-down interface for selecting a factor clustering method.
  • the factor cluster number setting area 404 is an input field for setting the number of clusters to be obtained by factor clustering.
  • ⁇ value setting area 405 is an input field for setting a ⁇ value.
  • the ⁇ value is a fixed parameter used in (2) the probability distribution 20 of each factor in FIG. 1 in the generation rate ⁇ of the Markov chain Monte Carlo method.
  • the ⁇ value is a value in the range of greater than 0 and less than or equal to 1.
  • the co-occurrence clustering setting area 406 is an area for setting a co-occurrence method, a clustering method, the number of clusters, and a threshold value.
  • the co-occurrence amount selection area 407 is a pull-down interface for selecting a co-occurrence amount calculation method.
  • the co-occurrence clustering selection area 408 is a pull-down interface for selecting a co-occurrence clustering method.
  • the co-occurrence cluster number setting area 409 is an input field for setting the number of co-occurrence clusters to be obtained by factor clustering.
  • the threshold setting area 410 is an input field for setting a threshold for a predicted value of a correlation value indicating the degree of association of factor clusters.
  • the decision button 411 is a button for inputting values of the items 401 to 410.
  • FIG. 5 is a flowchart illustrating an example of an analysis process procedure performed by the analysis apparatus 200.
  • the analysis apparatus 200 executes the processing shown in the flowchart of FIG. 5 by causing the processor 201 to execute the analysis program stored in the storage device 202.
  • the analysis apparatus 200 performs initial setting (step S501).
  • the initial setting step S501
  • the initial setting screen shown in FIG. 4 is displayed on the display.
  • the user selects or inputs each item 401 to 409 on the initial setting screen.
  • the analysis apparatus 200 reads the values of the items 401 to 409 by detecting the pressing of the input button 410.
  • the analysis apparatus 200 generates a learning model from the learning data set 10 as shown in (1) of FIG. 1 (step S502).
  • the learning model is expressed by the following formula (1).
  • y is a scalar indicating the objective variable.
  • x is an m-dimensional feature vector.
  • m corresponds to the number of factors.
  • ⁇ () is a sigmoid function.
  • the vector w and the scalar b are weight and bias parameters, respectively, and are called learning parameters.
  • w t x in the sigmoid function ⁇ () is replaced with a more complicated function than w t x based on the vector w and the factor x.
  • the analysis apparatus 200 selects a learning model corresponding to the machine learning method selected in the machine learning selection area 401 in FIG. 4 and obtains a learning parameter that represents the learning model.
  • the analysis apparatus 200 generates probability distributions d1 and d2 of factors derived from the learning data set 10 using a probability sampling method represented by the Markov chain Monte Carlo method, as shown in (2) of FIG. (Step S503).
  • FIG. 6 is an explanatory diagram showing probability distributions d1 and d2 of factors.
  • the factor probability distributions d1 and d2 include a virtual patient ID field 601, an objective variable field 602, and a factor field 603.
  • the value of each field 601 to 603 in the same row constitutes an entry that becomes virtual patient information.
  • the number of entries is the same as the number of entries in the learning data set 10.
  • the virtual patient ID field 601 is a storage area for storing a virtual patient ID.
  • the virtual patient ID 601 is identification information that uniquely identifies a virtual patient.
  • the objective variable field 602 is a storage area for storing objective variables for each virtual patient ID 601.
  • the objective variable 602 indicates the disease probability.
  • the disease probability is expressed as 0% to 100%.
  • the factor field 603 is a storage area for storing a plurality of factors.
  • the factor 603 is an explanatory variable indicating the dose of medicine.
  • the number of factors 603 is the same as the number of factors 303 in the learning data set 10.
  • the analyzing apparatus 200 substitutes the selected factor vector x and the virtual factor vector x ′ into the expression (2) of the acceptance rate ⁇ of the Markov chain Monte Carlo method.
  • the function q is a Gaussian distribution function.
  • x) is a Gaussian distribution function indicating the probability of generating the virtual factor vector x ′ when the factor vector x is given.
  • x ′) is a Gaussian distribution function indicating the probability of generating the factor vector x when the virtual factor vector x ′ is given.
  • the function f is, for example, a learning model generated in step S502 as shown in the equation (1).
  • the ⁇ value input to the ⁇ value setting area 405 is substituted for ⁇ .
  • the acceptance rate ⁇ becomes a Gaussian distribution including patient information with a disease probability of (1 ⁇ ) or more. That is, a virtual factor vector x ′ of virtual patient information having a disease probability of (1 ⁇ ) or more can be adopted at the adoption rate ⁇ .
  • a uniform random number ⁇ is generated in the interval from 0 to 1, and when the acceptance rate ⁇ is equal to or greater than a threshold ⁇ (for example, 1), the analysis apparatus 200 adopts the virtual factor vector x ′. If the acceptance rate ⁇ is not greater than or equal to the threshold value, the analysis apparatus 200 adopts the factor vector x.
  • the adopted factor vector is denoted as adopted factor vector ⁇ x>.
  • the analysis apparatus 200 compares the acceptance factor vector ⁇ x> with the random number vector R. Specifically, for example, the analysis apparatus 200 determines whether or not all elements of the adopted factor vector ⁇ x> are equal to or greater than corresponding elements of the random number vector R. If all the elements of the adopted factor vector ⁇ x> are equal to or greater than the corresponding elements of the random number vector R, the analysis apparatus 200 determines the adopted factor vector ⁇ x> as the virtual factor vector of the new virtual patient.
  • a threshold value ⁇ for example, 1
  • the analysis apparatus 200 compares the acceptance factor vector ⁇ x> with the random number vector R. Specifically, for example, the analysis apparatus 200 determines whether or not all elements of the adopted factor vector ⁇ x> are equal to or greater than corresponding elements of the random number vector R. If all the elements of the adopted factor vector ⁇ x> are equal to or greater than the corresponding elements of the random number vector R, the analysis apparatus 200 determines the adopted factor vector ⁇ x> as the virtual factor vector of the new virtual patient.
  • the analysis apparatus 200 determines the factor vector x as the virtual factor vector of the new virtual patient. It should be noted that although all the elements of the adopted factor vector ⁇ x> are equal to or more than the corresponding elements of the random number vector R, the judgment condition is that some elements of the adopted factor vector ⁇ x> It may be more than the corresponding element.
  • the analysis apparatus 200 calculates a disease probability that is the objective variable 602 by giving a factor 603 that is a virtual factor vector of a new virtual patient to the learning model in each virtual patient information entry. In this way, in step S503, the entry of the virtual patient information is set, and the factor probability distributions d1 and d2 are generated.
  • the analysis apparatus 200 determines whether or not the factor probability distributions d1 and d2 have converged to the same probability distribution as shown in (3) of FIG. 1 (step S504). Specifically, for example, the analysis apparatus 200 calculates a convergence value for verifying whether the probability distributions d1 and d2 of the factors converge to the same probability distribution by the Gelman-Rubin method. More specifically, the analysis apparatus 200 gives the column data of the factor probability distribution d1 and the column data of the factor probability distribution d2 corresponding to the column data to the convergence determination formula of Gelman-Rubin so as to converge. The value Rhat is calculated.
  • the analysis apparatus 200 gives the convergence value Rhat by giving the column data of the objective variable 602 of the factor probability distribution d1 and the column data of the objective variable 602 of the factor probability distribution d2 to the convergence determination formula of Gelman-Rubin. calculate. Further, the analysis apparatus 200 provides the Gelman-Rubin convergence judgment formula with the column data of the drug 1 in the factor 603 of the factor probability distribution d1 and the column data of the drug 1 in the factor 603 of the factor probability distribution d2. A convergence value Rhat is calculated. Similarly, the analysis apparatus 200 calculates the convergence value Rhat for the column data after the medicine 2.
  • the analysis device 200 deletes the column data determined not to converge. If the number of remaining column data is equal to or greater than a threshold value (for example, 50% or more), the factor probability distributions d1 and d2 have converged to the same probability distribution (step S504: Yes), and the process proceeds to step S505. Transition. If the number of remaining column data is not greater than or equal to the threshold value (step S504: No), the process returns to step S503, and the analysis device 200 regenerates the probability distributions d1 and d2 of the factors derived from the learning data set 10. When even one column data of the factor 603 of the factor probability distributions d1 and d2 is deleted, the analysis apparatus 200 gives the remaining factor 603 to the learning model and recalculates the objective variable 602.
  • a threshold value for example, 50% or more
  • the analysis apparatus 200 may move to step S504 without deleting the column data determined not to converge. Thereby, analysis covering the factor 603 can be performed. Further, step S504 may not be executed. Thereby, the analysis speed can be improved.
  • An integrated probability distribution D is a probability distribution of integrated factors.
  • FIG. 7 is an explanatory diagram showing an example of the integrated probability distribution D.
  • the content of the factor probability distributions d1 and d2 shown in FIG. 6 is connected.
  • the integrated probability distribution is used.
  • D it is also deleted.
  • the analysis apparatus 200 generates a factor cluster by factor clustering using the integrated probability distribution D as shown in (4) of FIG. 1 (step S506).
  • the analysis apparatus 200 executes the factor clustering selected in the factor clustering selection region 403, and generates factor clusters for the number of clusters set in the factor cluster number setting region 404.
  • FIG. 8 is an explanatory diagram showing the factor clustering result 40.
  • the factor clustering result 40 has a patient type ID field 801, an objective variable field 802, and a factor field 803.
  • the value of each field 801 to 803 in the same row constitutes an entry that becomes patient type information.
  • the patient type ID field 801 is a storage area for storing a patient type ID.
  • the patient type ID 801 is identification information that uniquely identifies a patient type classified by factor clustering.
  • the objective variable field 802 is a storage area for storing objective variables for each patient type ID 801.
  • the objective variable 802 indicates the disease probability.
  • the disease probability is expressed as 0% to 100%.
  • the factor field 803 is a storage area for storing a plurality of factors.
  • Factor 803 is an explanatory variable indicating the dose of the drug to the patient type.
  • the factor 803 is four explanatory variables of medicine 1 to medicine 4 for convenience, but actually, for example, it is a medicine that remains after the convergence determination (step S504).
  • k-means clustering is used as factor clustering, and the number of clusters is “3” as an example.
  • the entries of the integrated probability distribution D are classified into three types of factor clusters of patient types ⁇ , ⁇ , and ⁇ .
  • the analysis apparatus 200 calculates the statistical value of each factor from each factor cluster (step S507). Specifically, for example, the analysis apparatus 200 sets a statistical value in the virtual patient information in the integrated probability distribution D belonging to the patient type of the entry in the factor field 803.
  • the statistical value is, for example, a median value. In addition to the median value, an average value, a maximum value, a minimum value, or a randomly selected value may be used.
  • the analysis apparatus 200 calculates a disease probability that is the objective variable 802 by giving a statistical value that is the factor 803 to the learning model.
  • patient type factors 803 and explanatory variables 802 are aggregated into statistics and disease probabilities derived from statistics.
  • the analysis apparatus 200 calculates the co-occurrence amount between factors of the integrated probability distribution D (step S508).
  • the co-occurrence amount is a correlation value between two factors.
  • the analysis device 200 combines all factors in the integrated probability distribution D with brute force and calculates a correlation value between the factors.
  • the correlation value is calculated by the calculation method selected in the co-occurrence amount selection area 407 in the initial setting (step S501).
  • the analysis device 200 generates a co-occurrence cluster by co-occurrence clustering as shown in (5) of FIG. 1 (step S509).
  • the analysis apparatus 200 applies a hierarchical clustering method to the co-occurrence amount to generate a co-occurrence cluster.
  • Hierarchical clustering means that individual data is set as one co-occurrence cluster, the similarity between co-occurrence clusters is calculated, the most similar co-occurrence clusters are merged, and all co-occurrence clusters are one cluster. It is a clustering that repeats the process until it becomes to generate a dendrogram.
  • the similarity between co-occurrence clusters is, for example, a short distance between co-occurrence clusters.
  • the distance between co-occurrence clusters is defined by the nearest neighbor method, the farthest neighbor method, or the centroid method.
  • FIG. 9 is an explanatory diagram showing a processing example of co-occurrence clustering (S508, S509).
  • A shows the process of step S508.
  • the co-occurrence amount table 900 is a table that holds correlation values between factors.
  • B shows the process of step S509.
  • the analyzer 200 deletes the correlation value of the same factor.
  • the analysis apparatus 200 converts the correlation value into a correlation value obtained by subtracting the correlation value from 1 for hierarchical clustering.
  • the smaller the correlation value the more similar the factors are. Therefore, the analysis apparatus 200 selects a combination of factors that minimize the correlation value as a co-occurrence cluster.
  • a combination of medicine 1 and medicine 2 (co-occurrence cluster 1) and a combination of medicine 3 and medicine 4 (co-occurrence cluster 2) are selected.
  • the co-occurrence cluster is a combination of two factors, but may be a combination of three or more factors.
  • the process (B) is executed until the number of co-occurrence clusters reaches the number of co-occurrence clusters set in the co-occurrence cluster number setting area 409, or until no more clusters can be merged. .
  • the analysis apparatus 200 calculates the predicted value of the co-occurrence cluster as shown in (6) of FIG. 1 (step S510). Specifically, for example, the analysis apparatus 200 gives a factor belonging to the co-occurrence cluster to the learning model for each patient type ⁇ , ⁇ , ⁇ , thereby predicting the disease probability for each patient type ⁇ , ⁇ , ⁇ . Is calculated.
  • FIG. 10 is an explanatory diagram showing the prediction result 1000 in step S510.
  • the analysis apparatus 200 can analyze the effectiveness of the combination of factors.
  • the analysis apparatus 200 executes threshold processing of the prediction result 1000 (step S ⁇ b> 511). Specifically, for example, the analysis apparatus 200 selects a combination of a patient type and a factor cluster whose predicted value is equal to or greater than a threshold value. For example, when the threshold value set in the threshold setting area 410 is “0.8”, the analysis apparatus 200 causes the factor cluster 1 of the patient type ⁇ , the factor cluster 1 of the patient type ⁇ , and the patient type ⁇ . Factor cluster 1 is selected as a calculation marker.
  • the analysis apparatus 200 outputs the processing result of step S510 or S511 (step S512). Specifically, for example, the analysis apparatus 200 controls the display screen of a display that is an example of the output device 204 to display the processing result on the display screen, or transmits the processing result to the external apparatus via the communication IF 205. Or the processing result is written in the storage device 202. Further, the convergence determination result in step S504 may also be output.
  • FIG. 11 is an explanatory diagram illustrating a display screen example.
  • the display screen 1100 is displayed on a display that is an example of the output device 204.
  • the display screen 1100 includes a score display area 1101, a prediction result display area 1102, and a dendrogram display area 1103.
  • the convergence value Rhat in the convergence determination (step S504) is displayed.
  • a prediction result 1000 shown in FIG. 10 is displayed in the prediction result display area 1102. As shown in FIG. 11, it may be displayed as a bar graph.
  • a dendrogram display area 1103 displays a dendrogram in hierarchical clustering. As described above, the intermediate result and final result of the processing shown in FIG. 5 are displayed on the display screen 1100.
  • the analysis apparatus 200 generates a plurality of factor clusters by clustering the prediction data set (for example, the integrated probability distribution D) so that the values of the plurality of factors are similar to each other.
  • the first generation process is executed (step S506).
  • the analysis apparatus 200 executes a first calculation process for calculating a co-occurrence amount in which a plurality of factors co-occur by correlation of a plurality of factors, using a prediction data set (for example, the integrated probability distribution D) (step S508). .
  • the analysis apparatus 200 clusters a plurality of factors based on the co-occurrence amount calculated by the first calculation process, and generates a plurality of co-occurrence clusters having one or more co-occurrence clusters including two or more factors.
  • the analysis apparatus 200 includes, among prediction values of two or more factors in a specific prediction data group included in a specific factor cluster including two or more factors among the plurality of factor clusters generated by the first generation process. A prediction value of two or more specific factors indicated by a specific co-occurrence cluster among a plurality of co-occurrence clusters generated by the two generation process is given to the learning model. Then, the analysis apparatus 200 executes a second calculation process for calculating the predicted value of the objective variable in the specific factor cluster (step S510).
  • the analysis apparatus 200 can analyze the effectiveness of the combination of factors based on the predicted value of the objective variable in a specific factor cluster in which a plurality of factors co-occur.
  • the analysis apparatus 200 executes a third calculation process for calculating a statistical value representative of the predicted values of the two or more factors in the specific factor cluster based on the predicted values of the two or more factors in the specific predicted data group. (Step S510).
  • the analysis apparatus 200 can reduce the amount of calculation when calculating the predicted value of the objective variable in a specific factor cluster in which a plurality of factors co-occur. Therefore, the analysis speed can be improved.
  • the analysis apparatus 200 executes a setting process for setting the type of learning model (step S501). Further, the analysis apparatus 200 generates a learning model of the type set by the setting process using the measured value of the objective variable and the measured values of a plurality of factors, and executes a third generation process that stores the learning model in the storage device. (Step S502). Thereby, the user can select the kind of learning model according to the objective.
  • the analysis apparatus 200 sets a linear model or a nonlinear model as a type in the setting process. Thereby, the analysis apparatus 200 can improve the analysis speed when the linear model is set, and can improve the analysis accuracy when the nonlinear model is set. In other words, the user can select a linear model if the analysis result is desired to be obtained earlier, and can select a non-linear model if the analysis accuracy is to be improved.
  • the prediction data set (for example, the integrated probability distribution D) may be a data set generated from the learning data set 10 by a probability sampling method using a learning model.
  • a prediction data set (for example, integrated probability distribution D) becomes a data set depending on the learning model. Therefore, for example, when a non-linear model is set, the prediction data set (for example, the integrated probability distribution D) is a data set with higher accuracy than when a linear model is set.
  • the analysis apparatus 200 adopts either one of the prediction data or the data similar to the prediction data by a probability sampling method (for example, Markov chain Monte Carlo method) using a learning model, thereby two prediction data groups (for example, The fourth generation process for generating the factor probability distributions d1, d2) is executed (step S503).
  • Data similar to the prediction data is data obtained by adding a random value to each value of the factor that is the prediction data, as described above.
  • the analysis apparatus 200 executes a determination process for determining whether or not two prediction data groups (for example, the probability distributions d1 and d2 of the factors) generated by the fourth generation process converge to the same probability distribution (step) S504).
  • the analysis apparatus 200 integrates two prediction data groups (for example, factor probability distributions d1 and d2) based on the determination result of the determination process, thereby generating a prediction data set (for example, an integrated probability distribution D). Processing is executed (step S505).
  • two prediction data groups for example, factor probability distributions d1 and d2
  • a prediction data set for example, an integrated probability distribution D
  • the determination process determines whether or not two prediction data groups (for example, probability distributions d1 and d2 of factors) converge to the same probability distribution, for example, the probability distribution of the learning data set 10.
  • two prediction data groups for example, the probability distributions d1 and d2 of the factors
  • a prediction data set for example, integrated probability distribution D
  • the analysis apparatus 200 uses a probability sampling method (for example, a Markov chain Monte Carlo method) using a learning model to determine the value of a parameter that controls the adoption rate ⁇ for adopting either prediction data or data similar to the prediction data (for example, a setting process for setting ( ⁇ value) is executed (step S501). As a result, a factor that becomes an objective variable of (1 ⁇ ) or more can be adopted at the acceptance rate ⁇ .
  • a probability sampling method for example, a Markov chain Monte Carlo method
  • the analysis apparatus 200 executes a setting process for setting the number of factor clusters generated (step S501).
  • the analysis apparatus 200 can generate as many factor clusters as specified by the user. Specifically, for example, as the number of factor clusters generated increases, the prediction data set (for example, integrated probability distribution D) is subdivided.
  • the user can set the factor cluster generation number to be lower when the analysis result is desired to be obtained earlier, and can set the factor cluster generation number to be higher to increase the analysis accuracy.
  • the analysis apparatus 200 executes a setting process for setting the number of co-occurrence clusters generated (step S501).
  • the analyzer 200 can generate the number of co-occurrence clusters specified by the user. Specifically, for example, as the number of co-occurrence clusters generated increases, the number of co-occurring factors and the number of combinations of co-occurring factors increase. Therefore, the user can set the number of co-occurrence clusters generated to be lower when the analysis result is desired to be obtained earlier, and can be set to be higher to increase the analysis accuracy.
  • Example 1 a plurality of factors 303 and 603 are set as doses of a plurality of drugs to a patient, and objective variables 302 and 602 are values indicating drug efficacy when a plurality of drugs are administered to a patient (for example, Disease probability). This makes it possible to predict how much of each of a plurality of drugs is administered to which type (factor cluster) and how much drug is effective.
  • Example 1 the medicinal effect analysis is described as an example, but the present invention can also be applied to product recommendation.
  • the patient ID 301 is replaced by a customer, not a patient.
  • the factor 303 indicates, for example, the number of purchases (in the case of a product) or the number of uses (in the case of a service) of a product or service (may be a product or service genre).
  • the objective variable 302 indicates, for example, a purchase amount (in the case of a product) or a usage amount (in the case of a service) of a product or service (which may be a genre of the product or service).
  • the patient ID 301 is replaced with, for example, a news article published in a newspaper, magazine, or web page instead of a patient.
  • the factor 303 indicates, for example, the number of times the word appears.
  • the objective variable 302 indicates the genre of a news article such as politics, society, sports, and weather. The same applies to the factor probability distributions d1 and d2 and the integrated probability distribution D.
  • Example 2 will be described.
  • the analysis process shown in FIG. 5 is executed by one computer.
  • the analysis process shown in FIG. 5 is distributed by a plurality of computers. This will reduce the load on the computer and increase the analysis speed.
  • each computer has, for example, the hardware configuration shown in FIG.
  • FIG. 12 is an explanatory diagram showing a system configuration example of the analysis system.
  • the analysis system 1200 includes a plurality of computers (hereinafter simply referred to as nodes) N0 to Nn (n is an integer of 1 or more) and one or more client terminals C.
  • a plurality of nodes N0 to Nn (n is an integer of 2 or more) and one or more client terminals C are communicably connected via a network 1201.
  • the node N0 is the master node N0, and the nodes N1 to Nn are worker nodes N1 to Nn.
  • the master node N0 manages the worker nodes N1 to Nn.
  • Worker nodes N1 to Nn execute processing in accordance with instructions from master node N0. Note that one of the worker nodes N1 to Nn may be responsible for the function of the master node N0.
  • FIGS. 13 to 15 are flowcharts showing examples of distributed processing procedures by the analysis system 1200.
  • n 2 that is, the analysis system 1200 is a master node N0, worker nodes N1, N2, and a client terminal C.
  • the client terminal C executes initial setting (step S501) (step S1301). Then, the client terminal C transmits an analysis request that is a setting content of the initial setting (step S501) to the master node N0 (step S1302).
  • the master node N0 transmits a learning model generation request to the worker node N1 (step S1303).
  • the worker node N1 When receiving the learning model generation request, the worker node N1 generates a learning model as in step S502 (step S1304).
  • the worker node N1 When the worker node N1 generates a learning model, the worker node N1 transmits the learning model to the master node N0 (step S1305).
  • the master node N0 transmits the learning model to another worker node N2 (step S1306).
  • the master node N0 transmits a generation request for the factor probability distribution d1 to the worker node N1 (step S1307), and transmits a generation request for the factor probability distribution d2 to the worker node N2 (step S1308).
  • the probability distributions d1 and d2 of factors can be generated by parallel processing.
  • the worker node N1 generates a probability distribution d1 of factors derived from the learning data set 10 using a probability sampling method typified by the Markov chain Monte Carlo method (step S1309).
  • the worker node N2 also generates a probability distribution d2 of factors derived from the learning data set 10 using a probability sampling method typified by the Markov chain Monte Carlo method (step S1310).
  • the worker node N1 transmits the generated probability distribution d1 of the factor to the master node N0 (step S1311).
  • the worker node N2 also transmits the generated probability distribution d2 of the factor to the master node N0 (step S1312).
  • the master node N0 determines whether the factor probability distributions d1 and d2 converge to the same probability distribution as in step S504 (step S1313).
  • the master node N0 transmits the determination result to the client terminal C (step S1314).
  • the client terminal C receives and displays the determination result (eg, Gelman-Rubin score) (step S1315).
  • the master node N0 generates the integrated probability distribution D by integrating the probability distributions d1 and d2 of factors (step S1401), as in step S505. Then, the master node N0 transmits a factor clustering request to the worker node N1 (step S1402).
  • the worker node N1 When receiving the factor clustering request, the worker node N1 generates a factor cluster by factor clustering using the integrated probability distribution D as in step S506 (step S1403). Further, the worker node N1 calculates the statistical value of each factor from each factor cluster as in step S507 (step S1404).
  • the worker node N1 transmits the calculated statistical value to the master node N0 (step S1405).
  • the master node N0 transmits the received statistical value to the other worker node N2 (step S1406).
  • the master node N0 transmits a co-occurrence amount calculation request to the worker node N2 (step S1407).
  • the worker node N2 calculates the co-occurrence amount of factors of the integrated probability distribution D as in step S508 (step S1408). Then, the worker node N2 transmits the calculated co-occurrence amount (see (A) of FIG. 9) to the master node N0 (step S1409).
  • the master node N0 generates a co-occurrence cluster by co-occurrence clustering as in step S509, and generates ID lists A and B of the co-occurrence cluster (step S1501).
  • the co-occurrence cluster ID list A is an ID list that uniquely identifies one entry group obtained by dividing the entry of the integrated probability distribution D.
  • the co-occurrence cluster ID list B is an ID list that uniquely specifies the other entry group obtained by dividing the entry of the integrated probability distribution D.
  • the master node N0 transmits the co-occurrence cluster ID list A to the worker node N1 (step S1502), and transmits the co-occurrence cluster ID list B to the worker node N2 (step S1503).
  • the worker node N1 generates a co-occurrence cluster for the ID list A by co-occurrence clustering (step S1504).
  • the worker node N2 also generates a co-occurrence cluster for the ID list B by co-occurrence clustering (step S1505).
  • the worker node N1 calculates the predicted value of the co-occurrence cluster obtained in step S1504 as in step S510 (step S1506). Similarly to step S510, worker node N2 also calculates the predicted value of the co-occurrence cluster obtained in step S1505 (step S1507). The worker node N1 stores the predicted value obtained in step S1506 in the storage device 202 (step S1508). The worker node N2 also stores the predicted value obtained in step S1507 in the storage device 202 (step S1509). The worker node N1 transmits the predicted value obtained in step S1506 to the master node N0 (step S1510). The worker node N2 also transmits the predicted value obtained in step S1507 to the master node N0 (step S1511).
  • the master node N0 executes the threshold processing for the predicted value as in step S511 (step S1512). Then, the master node N0 transmits a calculation marker that is the execution result to the client terminal C (step S1513). The client terminal C displays the calculation marker on the display screen (step S1514).
  • FIG. 16 is a flowchart showing a modification of the flowchart 3 showing an example of the distributed processing procedure by the analysis system 1200 shown in FIG.
  • the worker nodes N1 and N2 execute the co-occurrence clustering in parallel for each of the ID lists A and B, thereby realizing high-speed processing.
  • the co-occurrence cluster calculation of the ID lists A and B is executed not by the worker nodes N1 and N2 but by the master node N0.
  • the same processes as those in FIG. 15 are denoted by the same step numbers, and the description thereof is omitted.
  • the master node N0 generates a co-occurrence cluster by co-occurrence clustering for the ID list A as in step S509 (step S1602).
  • the master node N0 transmits the co-occurrence cluster of the ID list A to the worker node N1 (step S1603).
  • the worker node N1 calculates the predicted value of the co-occurrence cluster obtained in step S1602, similarly to step S510 (step S1604).
  • the worker node N1 stores the predicted value obtained in step S1604 in the storage device 202 (step S1604).
  • the worker node N1 transmits the predicted value obtained in step S1604 to the master node N0 (step S1606).
  • the master node N0 generates a co-occurrence cluster by co-occurrence clustering for the ID list B as in step S509 (step S1607).
  • the master node N0 transmits the co-occurrence cluster of the ID list B to the worker node N2 (step S1608).
  • the worker node N2 calculates the predicted value of the co-occurrence cluster obtained in step S1607 as in step S510 (step S1609).
  • the worker node N1 stores the predicted value obtained in step S1609 in the storage device 202 (step S1610).
  • the worker node N2 transmits the predicted value obtained in step S1609 to the master node N0 (step S1611).
  • the same effects as the first embodiment can be obtained.
  • the analysis processing shown in FIG. 5 is distributed by a plurality of computers. Thereby, it is possible to reduce the load on the computer and increase the analysis speed.
  • the distributed processing shown in FIGS. 13 to 16 is an example. Therefore, in addition to this, for example, at least two or more of the steps shown in FIGS. 13 to 16 may be executed by different computers.
  • the present invention is not limited to the above-described embodiments, and includes various modifications and equivalent configurations within the scope of the appended claims.
  • the above-described embodiments have been described in detail for easy understanding of the present invention, and the present invention is not necessarily limited to those having all the configurations described.
  • a part of the configuration of one embodiment may be replaced with the configuration of another embodiment.
  • each of the above-described configurations, functions, processing units, processing means, etc. may be realized in hardware by designing a part or all of them, for example, with an integrated circuit, and the processor realizes each function. It may be realized by software by interpreting and executing the program to be executed.
  • Information such as programs, tables, and files that realize each function can be stored in a storage device such as a memory, a hard disk, and an SSD (Solid State Drive), or a recording medium such as an IC card, an SD card, and a DVD.
  • a storage device such as a memory, a hard disk, and an SSD (Solid State Drive), or a recording medium such as an IC card, an SD card, and a DVD.
  • control lines and information lines indicate what is considered necessary for the explanation, and do not necessarily indicate all control lines and information lines necessary for mounting. In practice, it can be considered that almost all the components are connected to each other.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Business, Economics & Management (AREA)
  • Public Health (AREA)
  • Medical Informatics (AREA)
  • Primary Health Care (AREA)
  • General Health & Medical Sciences (AREA)
  • Epidemiology (AREA)
  • Data Mining & Analysis (AREA)
  • Strategic Management (AREA)
  • Biomedical Technology (AREA)
  • Tourism & Hospitality (AREA)
  • Human Resources & Organizations (AREA)
  • Economics (AREA)
  • General Physics & Mathematics (AREA)
  • Marketing (AREA)
  • Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medicinal Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Quality & Reliability (AREA)
  • Child & Adolescent Psychology (AREA)
  • Operations Research (AREA)
  • Development Economics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Game Theory and Decision Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

An analysis device stores, in a storage device, a learning data set having a plurality of pieces of learning data including an actual value of an objective variable and actual values of a plurality of factors, a prediction data set having a plurality of pieces of prediction data derived from learning data including prediction values of the plurality of factors, and a learning model indicating a relationship between the actual value of the objective variable and the actual values of the plurality of factors; performs clustering of the prediction data set in such a manner that the values of the plurality of factors are similar to each other to generate a plurality of factor clusters; calculates, using the prediction data set, a co-occurrence amount of co-occurrence of the plurality of factors on the basis of a correlation of the plurality of factors; performs clustering of the plurality of factors on the basis of the co-occurrence amount to generate a plurality of co-occurrence clusters having one or more co-occurrence clusters including two or more factors; and feeds the learning model with, from among the prediction values of two or more factors in a specific prediction data group included in a specific factor cluster including two or more factors in the plurality of factor clusters, prediction values of two or more specific factors indicated by a specific co-occurrence cluster out of the plurality of co-occurrence clusters, thereby calculating a prediction value of the objective variable in the specific factor cluster.

Description

分析装置、分析システムおよび分析方法Analysis apparatus, analysis system, and analysis method
 本発明は、データを分析する分析装置、分析システムおよび分析方法に関する。 The present invention relates to an analysis apparatus, an analysis system, and an analysis method for analyzing data.
 特許文献1は、患者属性と1つ以上の有害事象(Adverse Events;AE)との間の相関に関する情報を識別および提供する臨床意思決定支援システムとともに使用するコンピュータ実装方法、システム、およびコンピュータ可読記憶媒体を開示する。特許文献1のプロセスは、AEと患者属性との間の相関に対してAEおよび1つ以上の患者属性を含むデータベース情報を処理することと、1つ以上のAEと1つ以上の患者属性との間の少なくとも1つの相関を識別することとを含む。相関は、1つ以上の相関ルールを決定するための相関ルール発見プロセスを介して発見されてもよい。各相関ルールは、確信度、支持度、および/または他の閾値を満たす。当該プロセスは、識別または発見された相関に基づいて、ユーザに情報または警告をさらに提供する。 US Pat. No. 6,057,031 discloses a computer-implemented method, system, and computer-readable storage for use with a clinical decision support system that identifies and provides information regarding correlations between patient attributes and one or more adverse events (AEs). A medium is disclosed. The process of U.S. Patent No. 6,057,056 processes database information including AE and one or more patient attributes for correlation between AE and patient attributes, and includes one or more AEs and one or more patient attributes. Identifying at least one correlation between. Correlation may be discovered through an association rule discovery process to determine one or more association rules. Each association rule satisfies confidence, support, and / or other thresholds. The process further provides information or alerts to the user based on the identified or discovered correlation.
 特許文献2は、診療に対する適切な支援を行う診療支援プログラムを開示する。特許文献2の診療支援プログラムでは、診断された病気に対する患者の治療期間と前記診断された病気に対する基準治癒期間とを比較し、前記患者の治療期間が前記基準治癒期間を越えている場合に、類似する症状を発症させるそれぞれの病気を関連付けて記憶する記憶手段から前記診断された病気の症状に類似する症状を発症させる他の病気を検索し、検索した前記他の病気の病名情報を出力する、処理をコンピュータに実行させる。 Patent Document 2 discloses a medical assistance program that provides appropriate support for medical care. In the medical assistance program of Patent Document 2, the treatment period of the patient for the diagnosed disease is compared with the reference cure period for the diagnosed disease, and when the treatment period of the patient exceeds the reference cure period, Search for other illnesses that develop symptoms similar to the symptoms of the diagnosed illness from the storage means that associates and stores each illness that develops similar symptoms, and outputs the disease name information of the searched other illnesses , Cause the computer to execute the process.
特表2012-524945号公報Special table 2012-524945 gazette 特開2014-199597号公報JP 2014-199597 A
 しかしながら、上述した従来技術では、学習データから学習モデルを生成しても、どの因子が他のどの因子と関連するかがわからないという問題がある。たとえば、目的変数を疾病確率、因子を複数の薬の投与量とした場合、たとえば、薬Aと薬Bとを組み合わせて患者に投与することが効果的なのか、副作用が生じるのかがわからないという問題がある。 However, the above-described conventional technique has a problem that even if a learning model is generated from learning data, it is not known which factor is associated with which other factor. For example, when the objective variable is the disease probability and the factor is the dose of a plurality of drugs, for example, it is difficult to know whether it is effective to administer the drug A and the drug B to the patient in combination or cause side effects. There is.
 本発明は、因子の組み合わせの有効性を分析することを目的とする。 The present invention aims to analyze the effectiveness of factor combinations.
 本願において開示される発明の一側面となる分析装置、分析システムおよび分析方法は、記憶デバイスに、目的変数の実測値と複数の因子の実測値とを含む学習データを複数有する学習データ集合と、前記複数の因子の予測値を含む前記学習データ由来の予測データを複数有する予測データ集合と、前記目的変数の実測値と前記複数の因子の実測値との関係を示す学習モデルと、を記憶しておき、前記複数の因子の値どうしが類似するように前記予測データ集合をクラスタリングして、複数の因子クラスタを生成する第1生成処理と、前記予測データ集合を用いて、前記複数の因子の相関により前記複数の因子が共起する共起量を算出する第1算出処理と、前記第1算出処理によって算出された共起量に基づいて前記複数の因子をクラスタリングして、2以上の因子を含む共起クラスタを1以上有する複数の共起クラスタを生成する第2生成処理と、前記第1生成処理によって生成された複数の因子クラスタの中の2以上の因子を含む特定の因子クラスタに含まれる特定の予測データ群における前記2以上の因子の予測値のうち、前記第2生成処理によって生成された複数の共起クラスタの中の特定の共起クラスタが示す2以上の特定の因子の予測値を、前記学習モデルに与えることにより、前記特定の因子クラスタにおける前記目的変数の予測値を算出する第2算出処理と、を実行することを特徴とする。 An analysis apparatus, an analysis system, and an analysis method according to an aspect of the invention disclosed in the present application include a learning data set having a plurality of learning data including measured values of objective variables and measured values of a plurality of factors in a storage device; Storing a prediction data set having a plurality of prediction data derived from the learning data including prediction values of the plurality of factors, and a learning model indicating a relationship between the actual measurement values of the objective variable and the actual measurement values of the plurality of factors; In addition, the prediction data set is clustered so that the values of the plurality of factors are similar, and a plurality of factor clusters are generated, and the prediction data set is used to generate the plurality of factor data. A first calculation process for calculating a co-occurrence amount in which the plurality of factors co-occur by correlation, and a cluster of the plurality of factors based on the co-occurrence amount calculated by the first calculation process A second generation process for generating a plurality of co-occurrence clusters having one or more co-occurrence clusters including two or more factors, and two or more of the plurality of factor clusters generated by the first generation process. Among the predicted values of the two or more factors in the specific prediction data group included in the specific factor cluster including the factor, a specific co-occurrence cluster among the plurality of co-occurrence clusters generated by the second generation process is A second calculation process for calculating a predicted value of the objective variable in the specific factor cluster by giving predicted values of two or more specific factors to the learning model.
 本発明の代表的な実施の形態によれば、因子の組み合わせの有効性を分析することができる。前述した以外の課題、構成及び効果は、以下の実施例の説明により明らかにされる。 According to a typical embodiment of the present invention, the effectiveness of factor combinations can be analyzed. Problems, configurations, and effects other than those described above will become apparent from the description of the following embodiments.
図1は、実施例1にかかるデータ分析例を示す説明図である。FIG. 1 is an explanatory diagram of a data analysis example according to the first embodiment. 図2は、分析装置のハードウェア構成例を示すブロック図である。FIG. 2 is a block diagram illustrating a hardware configuration example of the analysis apparatus. 図3は、図1に示した学習データの詳細な内容を示す説明図である。FIG. 3 is an explanatory diagram showing the detailed contents of the learning data shown in FIG. 図4は、初期設定画面例を示す説明図である。FIG. 4 is an explanatory diagram illustrating an example of an initial setting screen. 図5は、分析装置による分析処理手順例を示すフローチャートである。FIG. 5 is a flowchart illustrating an example of an analysis processing procedure performed by the analysis apparatus. 図6は、因子の確率分布を示す説明図である。FIG. 6 is an explanatory diagram showing a probability distribution of factors. 図7は、統合確率分布の一例を示す説明図である。FIG. 7 is an explanatory diagram showing an example of the integrated probability distribution. 図8は、因子クラスタリング結果を示す説明図である。FIG. 8 is an explanatory diagram showing the result of factor clustering. 図9は、共起クラスタリングの処理例を示す説明図である。FIG. 9 is an explanatory diagram of an example of co-occurrence clustering processing. 図10は、ステップS510による予測結果を示す説明図である。FIG. 10 is an explanatory diagram illustrating the prediction result obtained in step S510. 図11は、表示画面例を示す説明図である。FIG. 11 is an explanatory diagram illustrating a display screen example. 図12は、分析システムのシステム構成例を示す説明図である。FIG. 12 is an explanatory diagram illustrating a system configuration example of the analysis system. 図13は、分析システムによる分散処理手順例を示すフローチャート1である。FIG. 13 is a flowchart 1 illustrating an example of a distributed processing procedure by the analysis system. 図14は、分析システムによる分散処理手順例を示すフローチャート2である。FIG. 14 is a flowchart 2 illustrating an example of a distributed processing procedure by the analysis system. 図15は、分析システムによる分散処理手順例を示すフローチャート3である。FIG. 15 is a flowchart 3 illustrating an example of a distributed processing procedure by the analysis system. 図16は、図15に示した分析システムによる分散処理手順例を示すフローチャート3の変形例を示すフローチャートである。FIG. 16 is a flowchart showing a modification of the flowchart 3 showing an example of a distributed processing procedure by the analysis system shown in FIG.
 <データ分析例>
 図1は、実施例1にかかるデータ分析例を示す説明図である。(1)~(6)は、分析装置による分析方法の手順を示す。(1)分析装置は、学習データ集合10から学習モデルを生成する。学習データ集合10は、例として、目的変数を薬効、具体的には疾病確率とし、因子を複数の薬の患者への投与量とする。疾病確率は、0%~100%で表現できるが、ここでは、疾病を1(=100%)、健康を0(=0%)とする。また、因子は、便宜的に薬1~薬4の4つの説明変数であるが、実際には、たとえば、数万から数億の薬である。また、各エントリは、患者を示す。患者は便宜的にA~Fの6人であるが、実際には、たとえば、数万から数億の患者である。
<Data analysis example>
FIG. 1 is an explanatory diagram of a data analysis example according to the first embodiment. (1) to (6) show the procedure of the analysis method by the analyzer. (1) The analysis apparatus generates a learning model from the learning data set 10. In the learning data set 10, for example, the objective variable is the drug effect, specifically, the disease probability, and the factor is the dose of a plurality of drugs to the patient. The disease probability can be expressed as 0% to 100%, but here, the disease is 1 (= 100%) and the health is 0 (= 0%). In addition, the factor is four explanatory variables of drug 1 to drug 4 for convenience, but actually, for example, tens of thousands to hundreds of millions of drugs. Each entry indicates a patient. The number of patients is six from A to F for convenience, but actually, for example, there are tens of thousands to hundreds of millions of patients.
 (1)学習モデルの生成において、生成される学習モデルには、線形モデルと非線形モデルがある。線形モデルには、たとえば、線形分類(Linear Classification)とロジスティック回帰(Logistic Regression)とがある。非線形モデルには、たとえば、ニューラルネットワーク(Neural Network)、サポートベクターマシン(Support Vector Machine)、アダブースト(Adaboost)、ランダムフォレスト(Random Forests)がある。ユーザは、学習モデルの生成の際に、いずれかのモデルを選択することができる。たとえば、ユーザは、因子の組み合わせの有効性を高速に分析したい場合には、線形モデルを選択すればよく、高精度に分析したい場合には、非線形モデルを選択すればよい。 (1) In generating a learning model, the generated learning model includes a linear model and a non-linear model. The linear model includes, for example, a linear classification and a logistic regression. Nonlinear models include, for example, neural networks (Neural Network), support vector machines (Support Vector Machine), Adaboost, and random forests (Random Forests). The user can select one of the models when generating the learning model. For example, when the user wants to analyze the effectiveness of the combination of factors at high speed, the user may select a linear model, and when he wants to analyze with high accuracy, the user may select a nonlinear model.
 (2)分析装置は、(1)で生成された学習モデルから各因子の確率分布20を生成する。具体的には、たとえば、分析装置は、マルコフ連鎖モンテカルロ法に代表される確率サンプリング法を用いて、学習データ集合10由来の因子の確率分布20を2組(それぞれd1、d2と称す)生成する。これにより、仮想的な因子データを大量に収集することができる。 (2) The analysis device generates a probability distribution 20 of each factor from the learning model generated in (1). Specifically, for example, the analysis apparatus generates two sets of probability distributions 20 of factors derived from the learning data set 10 (referred to as d1 and d2 respectively) using a probability sampling method typified by the Markov chain Monte Carlo method. . Thereby, a large amount of virtual factor data can be collected.
 (3)分析装置は、(2)で生成された因子の確率分布d1,d2が同一の確率分布に収束するか否かを判定する。収束判定には、具体的には、たとえば、Gelman-Rubin法が用いられる。収束するまで、分析装置は、(2)の因子の確率分布20を生成する。 (3) The analysis apparatus determines whether or not the probability distributions d1 and d2 of the factors generated in (2) converge to the same probability distribution. Specifically, for example, the Gelman-Rubin method is used for the convergence determination. Until convergence, the analyzer generates a probability distribution 20 of the factor (2).
 (4)分析装置は、(3)で収束すると判定された因子の確率分布d1、d2を統合し、統合した因子の確率分布(統合確率分布D)について、因子クラスタリングを実行する。因子クラスタリングには、具体的には、たとえば、k-meansクラスタリングが用いられる。クラスタ数は、あらかじめ設定される。ここでは、クラスタ数は例として「3」とする。これにより、因子クラスタリング結果40において、統合確率分布Dのエントリは、3種類の患者タイプα、β、γに分類される。 (4) The analysis apparatus integrates the probability distributions d1 and d2 of the factors determined to converge in (3), and executes factor clustering on the integrated probability distribution of the factors (integrated probability distribution D). Specifically, for example, k-means clustering is used for factor clustering. The number of clusters is set in advance. Here, the number of clusters is “3” as an example. Thus, in the factor clustering result 40, the entry of the integrated probability distribution D is classified into three types of patient types α, β, and γ.
 (5)また、分析装置は、統合確率分布Dについて、共起クラスタリングを実行する。具体的には、たとえば、分析装置は、統合確率分布Dの因子同士の相関係数を共起量として算出する。そして、分析装置は、共起量に階層クラスタリング法を適用し、共起クラスタを生成する。ここでは、共起クラスタ1(薬1,薬2)と共起クラスタ2(薬3,薬4)が得られたものとする。なお、ここでは、共起クラスタは、2つの因子の組み合わせであるが、3以上の因子の組み合わせでもよい。 (5) Further, the analysis apparatus performs co-occurrence clustering on the integrated probability distribution D. Specifically, for example, the analysis device calculates a correlation coefficient between factors of the integrated probability distribution D as a co-occurrence amount. Then, the analysis device applies a hierarchical clustering method to the co-occurrence amount to generate a co-occurrence cluster. Here, it is assumed that co-occurrence cluster 1 (drug 1, drug 2) and co-occurrence cluster 2 (drug 3, drug 4) are obtained. Here, the co-occurrence cluster is a combination of two factors, but may be a combination of three or more factors.
 (6)分析装置は、患者タイプα、β、γごとに、共起クラスタに属する因子を学習モデルに与えることにより、患者タイプα、β、γごとの疾病確率の予測値を算出する。このように、分析装置は、因子の組み合わせの有効性を分析することができる。 (6) The analysis apparatus calculates predicted values of disease probabilities for the patient types α, β, and γ by giving factors belonging to the co-occurrence cluster to the learning model for the patient types α, β, and γ. Thus, the analyzer can analyze the effectiveness of the combination of factors.
 <分析装置のハードウェア構成例>
 図2は、分析装置のハードウェア構成例を示すブロック図である。分析装置200は、プロセッサ201と、記憶デバイス202と、入力デバイス203と、出力デバイス204と、通信インターフェース(通信IF205)と、を有する。プロセッサ201、記憶デバイス202、入力デバイス203、出力デバイス204、および通信IF205は、バスにより接続される。プロセッサ201は、分析装置200を制御する。記憶デバイス202は、プロセッサ201の作業エリアとなる。また、記憶デバイス202は、各種プログラムやデータを記憶する非一時的なまたは一時的な記録媒体である。記憶デバイス202としては、たとえば、ROM(Read Only Memory)、RAM(Random Access Memory)、HDD(Hard Disk Drive)、フラッシュメモリがある。入力デバイス203は、データを入力する。入力デバイス203としては、たとえば、キーボード、マウス、タッチパネル、テンキー、スキャナがある。出力デバイス204は、データを出力する。出力デバイス204としては、たとえば、ディスプレイ、プリンタがある。通信IF205は、ネットワークと接続し、データを送受信する。
<Hardware configuration example of analyzer>
FIG. 2 is a block diagram illustrating a hardware configuration example of the analysis apparatus. The analysis apparatus 200 includes a processor 201, a storage device 202, an input device 203, an output device 204, and a communication interface (communication IF 205). The processor 201, the storage device 202, the input device 203, the output device 204, and the communication IF 205 are connected by a bus. The processor 201 controls the analysis device 200. The storage device 202 serves as a work area for the processor 201. The storage device 202 is a non-temporary or temporary recording medium that stores various programs and data. Examples of the storage device 202 include a ROM (Read Only Memory), a RAM (Random Access Memory), a HDD (Hard Disk Drive), and a flash memory. The input device 203 inputs data. Examples of the input device 203 include a keyboard, a mouse, a touch panel, a numeric keypad, and a scanner. The output device 204 outputs data. Examples of the output device 204 include a display and a printer. The communication IF 205 is connected to a network and transmits / receives data.
 <学習データ例>
 図3は、図1に示した学習データ集合10の詳細な内容を示す説明図である。学習データ集合10は、例として、テーブル形式のデータとする。なお、以降のデータベースまたはテーブルの説明において、AAフィールドbbb(AAはフィールド名、bbbは符号)の値を、AAbbbと表記する場合がある。たとえば、患者IDフィールド301の値を、患者ID301と表記する。
<Learning data example>
FIG. 3 is an explanatory diagram showing the detailed contents of the learning data set 10 shown in FIG. The learning data set 10 is, for example, data in a table format. In the following description of the database or table, the value of the AA field bbb (AA is a field name and bbb is a code) may be expressed as AAbbb. For example, the value of the patient ID field 301 is expressed as patient ID 301.
 学習データ集合10は、患者IDフィールド301と、目的変数フィールド302と、因子フィールド303と、を有する。同一行における各フィールド301~303の値が患者情報となるエントリを構成する。図3では、エントリ数は「6」であるが、実際には、たとえば、数万から数億の患者のエントリがある。 The learning data set 10 includes a patient ID field 301, an objective variable field 302, and a factor field 303. The values of the fields 301 to 303 in the same row constitute an entry that is patient information. In FIG. 3, the number of entries is “6”, but actually there are, for example, tens of thousands to hundreds of millions of patient entries.
 患者IDフィールド301は、患者IDを格納する記憶領域である。患者ID301は、患者を一意に特定する識別情報である。 The patient ID field 301 is a storage area for storing a patient ID. The patient ID 301 is identification information that uniquely identifies a patient.
 目的変数フィールド302は、患者ID301ごとの目的変数を格納する記憶領域である。目的変数302は、疾病確率を示す。疾病確率は、0%~100%で表現できるが、学習データ集合10は実測値であるため、疾病を1(=100%)、健康を0(=0%)とする。 The objective variable field 302 is a storage area for storing objective variables for each patient ID 301. The objective variable 302 indicates the disease probability. Although the disease probability can be expressed in the range of 0% to 100%, the learning data set 10 is actually measured values, so that the disease is 1 (= 100%) and the health is 0 (= 0%).
 因子フィールド303は、複数の因子を格納する記憶領域である。因子303は、薬の投与量を示す説明変数である。本例では、因子303は、便宜的に薬1~薬4の4つの説明変数であるが、実際には、たとえば、数万から数億の薬である。なお、因子303である薬の投与量の単位は、薬ごとに定められる。 The factor field 303 is a storage area for storing a plurality of factors. The factor 303 is an explanatory variable indicating the dose of medicine. In this example, the factor 303 is four explanatory variables of medicine 1 to medicine 4 for convenience, but actually, for example, tens of thousands to hundreds of millions of medicines. In addition, the unit of the dosage of the medicine which is the factor 303 is determined for each medicine.
 図3において、患者ID301が「患者A」のエントリは、患者Aに薬1を「20」、薬2を「13.0」、薬4を「22.0」を投与された結果、患者Aは疾病であることを示す。また、患者ID301が「患者B」のエントリは、患者Bに薬1を「10」、薬2を「23.0」、薬3を「1」、薬4を「31.0」を投与された結果、患者Bは疾病であることを示す。 In FIG. 3, the entry with the patient ID 301 of “patient A” is that patient 1 is given “20” for medicine 1, “13.0” for medicine 2, and “22.0” for medicine 4. Indicates a disease. In addition, an entry whose patient ID 301 is “patient B” is given to patient B “10” for medicine 1, “23.0” for medicine 2, “1” for medicine 3, and “31.0” for medicine 4. As a result, it is shown that patient B is ill.
 <初期設定画面例>
 図4は、初期設定画面例を示す説明図である。初期設定画面400は、出力デバイス204の一例であるディスプレイに表示され、入力デバイス203により設定される。機械学習選択領域401は、機械学習方法を選択するプルダウン式のインタフェースである。因子クラスタリング設定領域402は、クラスタリング方法と、クラスタ数と、を設定する領域である。因子クラスタリング選択領域403は、因子クラスタリングの手法を選択するプルダウン式のインタフェースである。因子クラスタ数設定領域404は、因子クラスタリングで得たいクラスタの数を設定する入力欄である。
<Initial setting screen example>
FIG. 4 is an explanatory diagram illustrating an example of an initial setting screen. The initial setting screen 400 is displayed on a display which is an example of the output device 204 and is set by the input device 203. The machine learning selection area 401 is a pull-down interface for selecting a machine learning method. The factor clustering setting area 402 is an area for setting a clustering method and the number of clusters. The factor clustering selection area 403 is a pull-down interface for selecting a factor clustering method. The factor cluster number setting area 404 is an input field for setting the number of clusters to be obtained by factor clustering.
 σ値設定領域405は、σ値を設定する入力欄である。σ値は、図1の(2)各因子の確率分布20の生成において、マルコフ連鎖モンテカルロ法の採択率αで用いられる固定のパラメータである。σ値は、0よりも大きく1以下の範囲の値である。 Σ value setting area 405 is an input field for setting a σ value. The σ value is a fixed parameter used in (2) the probability distribution 20 of each factor in FIG. 1 in the generation rate α of the Markov chain Monte Carlo method. The σ value is a value in the range of greater than 0 and less than or equal to 1.
 共起クラスタリング設定領域406は、共起方法と、クラスタリング方法と、クラスタ数と、しきい値とを設定する領域である。共起量選択領域407は、共起量の計算方法を選択するプルダウン式のインタフェースである。共起クラスタリング選択領域408は、共起クラスタリングの手法を選択するプルダウン式のインタフェースである。共起クラスタ数設定領域409は、因子クラスタリングで得たい共起クラスタの数を設定する入力欄である。しきい値設定領域410は、因子クラスタの関連度を示す相関値の予測値についてのしきい値を設定する入力欄である。決定ボタン411は、各項目401~410の値を入力するボタンである。 The co-occurrence clustering setting area 406 is an area for setting a co-occurrence method, a clustering method, the number of clusters, and a threshold value. The co-occurrence amount selection area 407 is a pull-down interface for selecting a co-occurrence amount calculation method. The co-occurrence clustering selection area 408 is a pull-down interface for selecting a co-occurrence clustering method. The co-occurrence cluster number setting area 409 is an input field for setting the number of co-occurrence clusters to be obtained by factor clustering. The threshold setting area 410 is an input field for setting a threshold for a predicted value of a correlation value indicating the degree of association of factor clusters. The decision button 411 is a button for inputting values of the items 401 to 410.
 <分析処理手順例>
 図5は、分析装置200による分析処理手順例を示すフローチャートである。分析装置200は、記憶デバイス202に記憶された分析プログラムをプロセッサ201に実行させることにより、図5のフローチャートに示す処理を実行する。まず、分析装置200は、初期設定を実行する(ステップS501)。初期設定(ステップS501)では、図4に示した初期設定画面がディスプレイに表示される。ユーザは、初期設定画面の各項目401~409について選択または入力をする。分析装置200は、入力ボタン410の押下を検出することで、各項目401~409の値を読み込む。
<Analysis processing procedure example>
FIG. 5 is a flowchart illustrating an example of an analysis process procedure performed by the analysis apparatus 200. The analysis apparatus 200 executes the processing shown in the flowchart of FIG. 5 by causing the processor 201 to execute the analysis program stored in the storage device 202. First, the analysis apparatus 200 performs initial setting (step S501). In the initial setting (step S501), the initial setting screen shown in FIG. 4 is displayed on the display. The user selects or inputs each item 401 to 409 on the initial setting screen. The analysis apparatus 200 reads the values of the items 401 to 409 by detecting the pressing of the input button 410.
 つぎに、分析装置200は、図1の(1)に示したように、学習データ集合10から学習モデルを生成する(ステップS502)。ロジスティック回帰の場合、学習モデルは下記式(1)で表現される。 Next, the analysis apparatus 200 generates a learning model from the learning data set 10 as shown in (1) of FIG. 1 (step S502). In the case of logistic regression, the learning model is expressed by the following formula (1).
 y=f(x)=σ(wx+b)・・・(1) y = f (x) = σ (w t x + b) (1)
 yは目的変数を示すスカラである。xはm次元の特徴量ベクトルである。mは因子の個数に相当する。図3の学習データ集合10では、因子303の数は4個(薬1~薬4)であるため、m=4である。σ()はシグモイド関数である。ベクトルwとスカラbは、それぞれ、重みとバイアスのパラメータであり、学習パラメータと呼ばれる。非線形モデルの場合、シグモイド関数σ()内のwxが、ベクトルwと因子xとに基づくwxよりも複雑な関数に置き換わる。 y is a scalar indicating the objective variable. x is an m-dimensional feature vector. m corresponds to the number of factors. In the learning data set 10 of FIG. 3, since the number of factors 303 is four (drug 1 to drug 4), m = 4. σ () is a sigmoid function. The vector w and the scalar b are weight and bias parameters, respectively, and are called learning parameters. In the case of the nonlinear model, w t x in the sigmoid function σ () is replaced with a more complicated function than w t x based on the vector w and the factor x.
 分析装置200は、図4の機械学習選択領域401で選択された機械学習方法に応じた学習モデルを選択して、学習モデルを表現する学習パラメータを求める。 The analysis apparatus 200 selects a learning model corresponding to the machine learning method selected in the machine learning selection area 401 in FIG. 4 and obtains a learning parameter that represents the learning model.
 つぎに、分析装置200は、図1の(2)に示したように、マルコフ連鎖モンテカルロ法に代表される確率サンプリング法を用いて、学習データ集合10由来の因子の確率分布d1,d2を生成する(ステップS503)。 Next, the analysis apparatus 200 generates probability distributions d1 and d2 of factors derived from the learning data set 10 using a probability sampling method represented by the Markov chain Monte Carlo method, as shown in (2) of FIG. (Step S503).
 図6は、因子の確率分布d1,d2を示す説明図である。因子の確率分布d1,d2は、仮想患者IDフィールド601と、目的変数フィールド602と、因子フィールド603と、を有する。同一行における各フィールド601~603の値が仮想患者情報となるエントリを構成する。なお、エントリ数は、学習データ集合10のエントリ数と同数とする。 FIG. 6 is an explanatory diagram showing probability distributions d1 and d2 of factors. The factor probability distributions d1 and d2 include a virtual patient ID field 601, an objective variable field 602, and a factor field 603. The value of each field 601 to 603 in the same row constitutes an entry that becomes virtual patient information. The number of entries is the same as the number of entries in the learning data set 10.
 仮想患者IDフィールド601は、仮想患者IDを格納する記憶領域である。仮想患者ID601は、仮想患者を一意に特定する識別情報である。 The virtual patient ID field 601 is a storage area for storing a virtual patient ID. The virtual patient ID 601 is identification information that uniquely identifies a virtual patient.
 目的変数フィールド602は、仮想患者ID601ごとの目的変数を格納する記憶領域である。目的変数602は、疾病確率を示す。疾病確率は、0%~100%で表現される。 The objective variable field 602 is a storage area for storing objective variables for each virtual patient ID 601. The objective variable 602 indicates the disease probability. The disease probability is expressed as 0% to 100%.
 因子フィールド603は、複数の因子を格納する記憶領域である。因子603は、薬の投与量を示す説明変数である。本例では、因子603の数は、学習データ集合10の因子303の数と同数となる。 The factor field 603 is a storage area for storing a plurality of factors. The factor 603 is an explanatory variable indicating the dose of medicine. In this example, the number of factors 603 is the same as the number of factors 303 in the learning data set 10.
 因子の確率分布d1,d2のエントリである仮想患者情報の生成例について説明する。分析装置200は、学習データ集合10のエントリ群からいずれかのエントリの因子ベクトルを選択する。たとえば、患者ID301が「患者A」の因子ベクトルx=(20,13.0,0,22.0)が選択されたとする。分析装置200は、選択した因子ベクトルの各要素に乱数値rを加算して、仮想因子ベクトルx’=(20+r,13.0+r,0+r,22.0+r)とする。 An example of generation of virtual patient information that is an entry of factor probability distributions d1 and d2 will be described. The analysis apparatus 200 selects a factor vector of any entry from the entry group of the learning data set 10. For example, it is assumed that the factor vector x = (20, 13.0, 0, 22.0) having the patient ID 301 “patient A” is selected. The analysis apparatus 200 adds a random value r to each element of the selected factor vector to obtain a virtual factor vector x ′ = (20 + r, 13.0 + r, 0 + r, 22.0 + r).
 分析装置200は、選択された因子ベクトルxと仮想因子ベクトルx’とをマルコフ連鎖モンテカルロ法の採択率αの式(2)に代入する。 The analyzing apparatus 200 substitutes the selected factor vector x and the virtual factor vector x ′ into the expression (2) of the acceptance rate α of the Markov chain Monte Carlo method.
Figure JPOXMLDOC01-appb-M000001
Figure JPOXMLDOC01-appb-M000001
 関数qはガウス分布関数である。関数q(x’|x)は、因子ベクトルxが与えられた場合に仮想因子ベクトルx’を生成する確率を示すガウス分布関数である。関数q(x|x’)は、仮想因子ベクトルx’が与えられた場合に因子ベクトルxを生成する確率を示すガウス分布関数である。関数fは、たとえば、式(1)に示したような、ステップS502で生成された学習モデルである。σには、σ値設定領域405に入力されたσ値が代入される。σ値により、採択率αは、(1-σ)以上の疾病確率の患者情報を含むガウス分布となる。すなわち、(1-σ)以上の疾病確率となる仮想患者情報の仮想因子ベクトルx’を採択率αで採択することができる。 The function q is a Gaussian distribution function. The function q (x ′ | x) is a Gaussian distribution function indicating the probability of generating the virtual factor vector x ′ when the factor vector x is given. The function q (x | x ′) is a Gaussian distribution function indicating the probability of generating the factor vector x when the virtual factor vector x ′ is given. The function f is, for example, a learning model generated in step S502 as shown in the equation (1). The σ value input to the σ value setting area 405 is substituted for σ. Depending on the σ value, the acceptance rate α becomes a Gaussian distribution including patient information with a disease probability of (1−σ) or more. That is, a virtual factor vector x ′ of virtual patient information having a disease probability of (1−σ) or more can be adopted at the adoption rate α.
 次に、0~1の区間で一様な乱数βを発生させ、採択率αがしきい値β(たとえば、1)以上である場合、分析装置200は、仮想因子ベクトルx’を採択する。採択率αがしきい値以上でない場合、分析装置200は、因子ベクトルxを採択する。採択された因子ベクトルを採択因子ベクトル<x>と表記する。 Next, a uniform random number β is generated in the interval from 0 to 1, and when the acceptance rate α is equal to or greater than a threshold β (for example, 1), the analysis apparatus 200 adopts the virtual factor vector x ′. If the acceptance rate α is not greater than or equal to the threshold value, the analysis apparatus 200 adopts the factor vector x. The adopted factor vector is denoted as adopted factor vector <x>.
 採択率αがしきい値β(たとえば、1)以上である場合、分析装置200は、採択因子ベクトル<x>と乱数ベクトルRとを比較する。具体的には、たとえば、分析装置200は、採択因子ベクトル<x>のすべての要素が、乱数ベクトルRの対応する要素以上であるか否かを判断する。採択因子ベクトル<x>のすべての要素が、乱数ベクトルRの対応する要素以上である場合、分析装置200は、採択因子ベクトル<x>を新規の仮想患者の仮想因子ベクトルに決定する。 When the acceptance rate α is equal to or greater than a threshold value β (for example, 1), the analysis apparatus 200 compares the acceptance factor vector <x> with the random number vector R. Specifically, for example, the analysis apparatus 200 determines whether or not all elements of the adopted factor vector <x> are equal to or greater than corresponding elements of the random number vector R. If all the elements of the adopted factor vector <x> are equal to or greater than the corresponding elements of the random number vector R, the analysis apparatus 200 determines the adopted factor vector <x> as the virtual factor vector of the new virtual patient.
 採択因子ベクトル<x>のすべての要素が、乱数ベクトルRの対応する要素以上でない場合、分析装置200は、因子ベクトルxを新規の仮想患者の仮想因子ベクトルに決定する。なお、採択因子ベクトル<x>のすべての要素が、乱数ベクトルRの対応する要素以上であることを判断の条件としたが、採択因子ベクトル<x>の一部の要素が、乱数ベクトルRの対応する要素以上であるとしてもよい。 If all the elements of the adopted factor vector <x> are not equal to or greater than the corresponding elements of the random number vector R, the analysis apparatus 200 determines the factor vector x as the virtual factor vector of the new virtual patient. It should be noted that although all the elements of the adopted factor vector <x> are equal to or more than the corresponding elements of the random number vector R, the judgment condition is that some elements of the adopted factor vector <x> It may be more than the corresponding element.
 このあと、分析装置200は、各仮想患者情報のエントリにおいて、学習モデルに新規の仮想患者の仮想因子ベクトルである因子603を与えることで、目的変数602である疾病確率を算出する。このようにして、ステップS503において、仮想患者情報のエントリが設定され、因子の確率分布d1,d2が生成される。 Thereafter, the analysis apparatus 200 calculates a disease probability that is the objective variable 602 by giving a factor 603 that is a virtual factor vector of a new virtual patient to the learning model in each virtual patient information entry. In this way, in step S503, the entry of the virtual patient information is set, and the factor probability distributions d1 and d2 are generated.
 図5に戻り、分析装置200は、図1の(3)に示したように、因子の確率分布d1,d2が同一の確率分布に収束しているかを判定する(ステップS504)。具体的には、たとえば、分析装置200は、因子の確率分布d1,d2が同一の確率分布に収束しているかを検証するための収束値を、Gelman-Rubin法により計算する。より具体的には、分析装置200は、因子の確率分布d1の列データと、当該列データに対応する因子の確率分布d2の列データとを、Gelman-Rubinの収束判定式に与えて、収束値Rhatを算出する。 5, the analysis apparatus 200 determines whether or not the factor probability distributions d1 and d2 have converged to the same probability distribution as shown in (3) of FIG. 1 (step S504). Specifically, for example, the analysis apparatus 200 calculates a convergence value for verifying whether the probability distributions d1 and d2 of the factors converge to the same probability distribution by the Gelman-Rubin method. More specifically, the analysis apparatus 200 gives the column data of the factor probability distribution d1 and the column data of the factor probability distribution d2 corresponding to the column data to the convergence determination formula of Gelman-Rubin so as to converge. The value Rhat is calculated.
 たとえば、分析装置200は、因子の確率分布d1の目的変数602の列データと、因子の確率分布d2の目的変数602の列データとをGelman-Rubinの収束判定式に与えて、収束値Rhatを算出する。また、分析装置200は、因子の確率分布d1の因子603における薬1の列データと、因子の確率分布d2の因子603における薬1の列データとをGelman-Rubinの収束判定式に与えて、収束値Rhatを算出する。薬2以降の列データに付いても同様に、分析装置200は、収束値Rhatを算出する。 For example, the analysis apparatus 200 gives the convergence value Rhat by giving the column data of the objective variable 602 of the factor probability distribution d1 and the column data of the objective variable 602 of the factor probability distribution d2 to the convergence determination formula of Gelman-Rubin. calculate. Further, the analysis apparatus 200 provides the Gelman-Rubin convergence judgment formula with the column data of the drug 1 in the factor 603 of the factor probability distribution d1 and the column data of the drug 1 in the factor 603 of the factor probability distribution d2. A convergence value Rhat is calculated. Similarly, the analysis apparatus 200 calculates the convergence value Rhat for the column data after the medicine 2.
 収束値Rhatが1.1以下であれば、因子の確率分布d1,d2の列データは、同一の確率分布に収束すると判定する。分析装置200は、収束しないと判定された列データを削除する。残存列データの数がしきい値(たとえば、50%以上)以上であれば、因子の確率分布d1,d2が同一の確率分布に収束していることとなり(ステップS504:Yes)、ステップS505に移行する。残存列データの数がしきい値以上でなければ(ステップS504:No)、ステップS503に戻り、分析装置200は、学習データ集合10由来の因子の確率分布d1,d2を再生成する。また、因子の確率分布d1,d2の因子603の列データが1つでも削除された場合、分析装置200は、残存する因子603を学習モデルに与えて、目的変数602を再計算する。 If the convergence value Rhat is 1.1 or less, it is determined that the column data of the probability distributions d1 and d2 of the factors converge to the same probability distribution. The analysis device 200 deletes the column data determined not to converge. If the number of remaining column data is equal to or greater than a threshold value (for example, 50% or more), the factor probability distributions d1 and d2 have converged to the same probability distribution (step S504: Yes), and the process proceeds to step S505. Transition. If the number of remaining column data is not greater than or equal to the threshold value (step S504: No), the process returns to step S503, and the analysis device 200 regenerates the probability distributions d1 and d2 of the factors derived from the learning data set 10. When even one column data of the factor 603 of the factor probability distributions d1 and d2 is deleted, the analysis apparatus 200 gives the remaining factor 603 to the learning model and recalculates the objective variable 602.
 収束しない列データを削除することにより、因子の確率分布d1,d2の信頼性の向上を図ることができ、分析精度が向上する。また、残存列データの数がしきい値以上であれば、分析装置200は、収束しないと判定された列データを削除せずに、ステップS504に移行してもよい。これにより、因子603を網羅した分析をおこなうことができる。また、ステップS504を実行しないこととしてもよい。これにより、分析速度の向上を図ることができる。 By deleting non-converging column data, the reliability of the factor probability distributions d1 and d2 can be improved, and the analysis accuracy is improved. If the number of remaining column data is equal to or greater than the threshold value, the analysis apparatus 200 may move to step S504 without deleting the column data determined not to converge. Thereby, analysis covering the factor 603 can be performed. Further, step S504 may not be executed. Thereby, the analysis speed can be improved.
 つぎに、分析装置200は、ステップS504において収束判定された因子の確率分布d1,d2を統合する(ステップS505)。統合した因子の確率分布を統合確率分布Dとする。 Next, the analysis apparatus 200 integrates the probability distributions d1 and d2 of the factors determined to be converged in step S504 (step S505). An integrated probability distribution D is a probability distribution of integrated factors.
 図7は、統合確率分布Dの一例を示す説明図である。図7では、説明の便宜上、図6に示した因子の確率分布d1,d2を連結した内容としたが、ステップS504において因子603におけるいずれかの列データが削除されている場合は、統合確率分布Dにおいても削除された状態となる。 FIG. 7 is an explanatory diagram showing an example of the integrated probability distribution D. In FIG. 7, for convenience of explanation, the content of the factor probability distributions d1 and d2 shown in FIG. 6 is connected. However, if any column data in the factor 603 is deleted in step S504, the integrated probability distribution is used. In D, it is also deleted.
 つぎに、分析装置200は、図1の(4)に示したように、統合確率分布Dを用いて、因子クラスタリングにより因子クラスタを生成する(ステップS506)。分析装置200は、初期設定(ステップS501)において、因子クラスタリング選択領域403で選択された因子クラスタリングを実行し、因子クラスタ数設定領域404で設定されたクラスタ数分の因子クラスタを生成する。 Next, the analysis apparatus 200 generates a factor cluster by factor clustering using the integrated probability distribution D as shown in (4) of FIG. 1 (step S506). In the initial setting (step S501), the analysis apparatus 200 executes the factor clustering selected in the factor clustering selection region 403, and generates factor clusters for the number of clusters set in the factor cluster number setting region 404.
 図8は、因子クラスタリング結果40を示す説明図である。因子クラスタリング結果40は、患者タイプIDフィールド801と、目的変数フィールド802と、因子フィールド803と、を有する。同一行における各フィールド801~803の値が患者タイプ情報となるエントリを構成する。 FIG. 8 is an explanatory diagram showing the factor clustering result 40. The factor clustering result 40 has a patient type ID field 801, an objective variable field 802, and a factor field 803. The value of each field 801 to 803 in the same row constitutes an entry that becomes patient type information.
 患者タイプIDフィールド801は、患者タイプIDを格納する記憶領域である。患者タイプID801は、因子クラスタリングで分類された患者タイプを一意に特定する識別情報である。 The patient type ID field 801 is a storage area for storing a patient type ID. The patient type ID 801 is identification information that uniquely identifies a patient type classified by factor clustering.
 目的変数フィールド802は、患者タイプID801ごとの目的変数を格納する記憶領域である。目的変数802は、疾病確率を示す。疾病確率は、0%~100%で表現される。 The objective variable field 802 is a storage area for storing objective variables for each patient type ID 801. The objective variable 802 indicates the disease probability. The disease probability is expressed as 0% to 100%.
 因子フィールド803は、複数の因子を格納する記憶領域である。因子803は、患者タイプへの薬の投与量を示す説明変数である。本例では、因子803は、便宜的に薬1~薬4の4つの説明変数であるが、実際には、たとえば、収束判定(ステップS504)後に残存する薬である。 The factor field 803 is a storage area for storing a plurality of factors. Factor 803 is an explanatory variable indicating the dose of the drug to the patient type. In this example, the factor 803 is four explanatory variables of medicine 1 to medicine 4 for convenience, but actually, for example, it is a medicine that remains after the convergence determination (step S504).
 図8では、因子クラスタリングとしてk-meansクラスタリングが用いられ、クラスタ数は例として「3」とする。これにより、統合確率分布Dのエントリは、3種類の患者タイプα、β、γの因子クラスタに分類される。 In FIG. 8, k-means clustering is used as factor clustering, and the number of clusters is “3” as an example. As a result, the entries of the integrated probability distribution D are classified into three types of factor clusters of patient types α, β, and γ.
 図5に戻り、分析装置200は、各因子クラスタから各因子の統計値を算出する(ステップS507)。具体的には、たとえば、分析装置200は、因子フィールド803に、当該エントリの患者タイプに所属する統合確率分布D内の仮想患者情報における統計値を設定する。当該統計値は、たとえば、中央値である。中央値のほか、平均値、最大値、最小値、ランダムに選択された値でもよい。また、分析装置200は、因子803である統計値を学習モデルに与えることにより、目的変数802である疾病確率を算出する。このように、患者タイプの因子803および説明変数802は、統計値および統計値由来の疾病確率に集約される。 Referring back to FIG. 5, the analysis apparatus 200 calculates the statistical value of each factor from each factor cluster (step S507). Specifically, for example, the analysis apparatus 200 sets a statistical value in the virtual patient information in the integrated probability distribution D belonging to the patient type of the entry in the factor field 803. The statistical value is, for example, a median value. In addition to the median value, an average value, a maximum value, a minimum value, or a randomly selected value may be used. The analysis apparatus 200 calculates a disease probability that is the objective variable 802 by giving a statistical value that is the factor 803 to the learning model. Thus, patient type factors 803 and explanatory variables 802 are aggregated into statistics and disease probabilities derived from statistics.
 また、分析装置200は、統合確率分布Dの因子同士の共起量を算出する(ステップS508)。共起量とは、2つの因子間の相関値である。具体的には、たとえば、分析装置200は、統合確率分布D内の全因子を総当たりで組み合わせ、因子間の相関値を算出する。相関値は、初期設定(ステップS501)において、共起量選択領域407で選択された計算方法により算出される。 Moreover, the analysis apparatus 200 calculates the co-occurrence amount between factors of the integrated probability distribution D (step S508). The co-occurrence amount is a correlation value between two factors. Specifically, for example, the analysis device 200 combines all factors in the integrated probability distribution D with brute force and calculates a correlation value between the factors. The correlation value is calculated by the calculation method selected in the co-occurrence amount selection area 407 in the initial setting (step S501).
 つぎに、分析装置200は、図1の(5)に示したように、共起クラスタリングにより共起クラスタを生成する(ステップS509)。具体的には、たとえば、分析装置200は、共起量に階層クラスタリング法を適用し、共起クラスタを生成する。階層クラスタリングとは、個々のデータを1つの共起クラスタとして設定しておき、共起クラスタ間の類似度を計算し、最も類似する共起クラスタを併合し、すべての共起クラスタが1つのクラスタになるまで処理を繰り返し、デンドログラムを生成するすクラスタリングである。ここで、共起クラスタ間の類似度とは、たとえば、共起クラスタ間の距離の短さである。具体的には、たとえば、最近隣法、最遠隣法、または重心法により、共起クラスタ間の距離が定義される。 Next, the analysis device 200 generates a co-occurrence cluster by co-occurrence clustering as shown in (5) of FIG. 1 (step S509). Specifically, for example, the analysis apparatus 200 applies a hierarchical clustering method to the co-occurrence amount to generate a co-occurrence cluster. Hierarchical clustering means that individual data is set as one co-occurrence cluster, the similarity between co-occurrence clusters is calculated, the most similar co-occurrence clusters are merged, and all co-occurrence clusters are one cluster. It is a clustering that repeats the process until it becomes to generate a dendrogram. Here, the similarity between co-occurrence clusters is, for example, a short distance between co-occurrence clusters. Specifically, for example, the distance between co-occurrence clusters is defined by the nearest neighbor method, the farthest neighbor method, or the centroid method.
 図9は、共起クラスタリング(S508、S509)の処理例を示す説明図である。(A)は、ステップS508の処理を示す。共起量テーブル900は、因子間の相関値を保持するテーブルである。(B)は、ステップS509の処理を示す。(B)において、分析装置200は、同一因子の相関値を削除する。また、分析装置200は、階層クラスタリングのために相関値を1から相関値を減じた相関値に変換する。(B)では、相関値が小さいほどその因子同士は類似することを意味する。したがって、分析装置200は、相関値が最小となる因子の組み合わせを共起クラスタとして選択する。(B)の場合は、薬1と薬2の組み合わせ(共起クラスタ1)と、薬3と薬4の組み合わせ(共起クラスタ2)とが選択される。なお、ここでは、共起クラスタは、2つの因子の組み合わせであるが、3以上の因子の組み合わせでもよい。 FIG. 9 is an explanatory diagram showing a processing example of co-occurrence clustering (S508, S509). (A) shows the process of step S508. The co-occurrence amount table 900 is a table that holds correlation values between factors. (B) shows the process of step S509. In (B), the analyzer 200 deletes the correlation value of the same factor. The analysis apparatus 200 converts the correlation value into a correlation value obtained by subtracting the correlation value from 1 for hierarchical clustering. In (B), the smaller the correlation value, the more similar the factors are. Therefore, the analysis apparatus 200 selects a combination of factors that minimize the correlation value as a co-occurrence cluster. In the case of (B), a combination of medicine 1 and medicine 2 (co-occurrence cluster 1) and a combination of medicine 3 and medicine 4 (co-occurrence cluster 2) are selected. Here, the co-occurrence cluster is a combination of two factors, but may be a combination of three or more factors.
 なお、(B)の処理は、共起クラスタの数が共起クラスタ数設定領域409で設定された共起クラスタ数になるまで、または、これ以上クラスタを併合できない状態になるまで、実行される。 The process (B) is executed until the number of co-occurrence clusters reaches the number of co-occurrence clusters set in the co-occurrence cluster number setting area 409, or until no more clusters can be merged. .
 図5に戻り、分析装置200は、図1の(6)に示したように、共起クラスタの予測値を算出する(ステップS510)。具体的には、たとえば、分析装置200は、患者タイプα、β、γごとに、共起クラスタに属する因子を学習モデルに与えることにより、患者タイプα、β、γごとの疾病確率の予測値を算出する。 Referring back to FIG. 5, the analysis apparatus 200 calculates the predicted value of the co-occurrence cluster as shown in (6) of FIG. 1 (step S510). Specifically, for example, the analysis apparatus 200 gives a factor belonging to the co-occurrence cluster to the learning model for each patient type α, β, γ, thereby predicting the disease probability for each patient type α, β, γ. Is calculated.
 図10は、ステップS510による予測結果1000を示す説明図である。このように、分析装置200は、因子の組み合わせの有効性を分析することができる。 FIG. 10 is an explanatory diagram showing the prediction result 1000 in step S510. Thus, the analysis apparatus 200 can analyze the effectiveness of the combination of factors.
 図5に戻り、分析装置200は、予測結果1000のしきい値処理を実行する(ステップS511)。具体的には、たとえば、分析装置200は、予測値がしきい値以上の患者タイプと因子クラスタの組み合わせを選択する。たとえば、しきい値設定領域410に設定されたしきい値が「0.8」である場合、分析装置200は、患者タイプαの因子クラスタ1、患者タイプβの因子クラスタ1、患者タイプγの因子クラスタ1を計算マーカとして選択する。 Returning to FIG. 5, the analysis apparatus 200 executes threshold processing of the prediction result 1000 (step S <b> 511). Specifically, for example, the analysis apparatus 200 selects a combination of a patient type and a factor cluster whose predicted value is equal to or greater than a threshold value. For example, when the threshold value set in the threshold setting area 410 is “0.8”, the analysis apparatus 200 causes the factor cluster 1 of the patient type α, the factor cluster 1 of the patient type β, and the patient type γ. Factor cluster 1 is selected as a calculation marker.
 分析装置200は、ステップS510またはS511の処理結果を出力する(ステップS512)。具体的には、たとえば、分析装置200は、出力デバイス204の一例であるディスプレイの表示画面を制御して処理結果を表示画面に表示したり、通信IF205を介して外部装置に処理結果を送信したり、記憶デバイス202に処理結果を書き込んだりする。また、ステップS504の収束判定結果も出力してもよい。 The analysis apparatus 200 outputs the processing result of step S510 or S511 (step S512). Specifically, for example, the analysis apparatus 200 controls the display screen of a display that is an example of the output device 204 to display the processing result on the display screen, or transmits the processing result to the external apparatus via the communication IF 205. Or the processing result is written in the storage device 202. Further, the convergence determination result in step S504 may also be output.
 <表示画面例>
 図11は、表示画面例を示す説明図である。表示画面1100は、出力デバイス204の一例であるディスプレイに表示される。表示画面1100は、スコア表示領域1101と、予測結果表示領域1102と、デンドログラム表示領域1103と、を有する。スコア表示領域1101には、収束判定(ステップS504)での収束値Rhatが表示される。予測結果表示領域1102には、図10に示した予測結果1000が表示される。図11に示すように、棒グラフで表示してもよい。デンドログラム表示領域1103には、階層クラスタリングにおけるデンドログラムが表示される。このように、図5に示した処理の途中結果や最終結果が表示画面1100に表示される。
<Example of display screen>
FIG. 11 is an explanatory diagram illustrating a display screen example. The display screen 1100 is displayed on a display that is an example of the output device 204. The display screen 1100 includes a score display area 1101, a prediction result display area 1102, and a dendrogram display area 1103. In the score display area 1101, the convergence value Rhat in the convergence determination (step S504) is displayed. A prediction result 1000 shown in FIG. 10 is displayed in the prediction result display area 1102. As shown in FIG. 11, it may be displayed as a bar graph. A dendrogram display area 1103 displays a dendrogram in hierarchical clustering. As described above, the intermediate result and final result of the processing shown in FIG. 5 are displayed on the display screen 1100.
 このように、実施例1によれば、分析装置200は、複数の因子の値どうしが類似するように予測データ集合(たとえば、統合確率分布D)をクラスタリングして、複数の因子クラスタを生成する第1生成処理を実行する(ステップS506)。分析装置200は、予測データ集合(たとえば、統合確率分布D)を用いて、複数の因子の相関により複数の因子が共起する共起量を算出する第1算出処理を実行する(ステップS508)。分析装置200は、第1算出処理によって算出された共起量に基づいて複数の因子をクラスタリングして、2以上の因子を含む共起クラスタを1以上有する複数の共起クラスタを生成する第2生成処理を実行する(ステップS509)。分析装置200は、第1生成処理によって生成された複数の因子クラスタの中の2以上の因子を含む特定の因子クラスタに含まれる特定の予測データ群における2以上の因子の予測値のうち、第2生成処理によって生成された複数の共起クラスタの中の特定の共起クラスタが示す2以上の特定の因子の予測値を、学習モデルに与える。そして、分析装置200は、特定の因子クラスタにおける目的変数の予測値を算出する第2算出処理を実行する(ステップS510)。 As described above, according to the first embodiment, the analysis apparatus 200 generates a plurality of factor clusters by clustering the prediction data set (for example, the integrated probability distribution D) so that the values of the plurality of factors are similar to each other. The first generation process is executed (step S506). The analysis apparatus 200 executes a first calculation process for calculating a co-occurrence amount in which a plurality of factors co-occur by correlation of a plurality of factors, using a prediction data set (for example, the integrated probability distribution D) (step S508). . The analysis apparatus 200 clusters a plurality of factors based on the co-occurrence amount calculated by the first calculation process, and generates a plurality of co-occurrence clusters having one or more co-occurrence clusters including two or more factors. Generation processing is executed (step S509). The analysis apparatus 200 includes, among prediction values of two or more factors in a specific prediction data group included in a specific factor cluster including two or more factors among the plurality of factor clusters generated by the first generation process. A prediction value of two or more specific factors indicated by a specific co-occurrence cluster among a plurality of co-occurrence clusters generated by the two generation process is given to the learning model. Then, the analysis apparatus 200 executes a second calculation process for calculating the predicted value of the objective variable in the specific factor cluster (step S510).
 これにより、分析装置200は、複数の因子が共起した特定の因子クラスタにおける目的変数の予測値により、因子の組み合わせの有効性を分析することができる。 Thereby, the analysis apparatus 200 can analyze the effectiveness of the combination of factors based on the predicted value of the objective variable in a specific factor cluster in which a plurality of factors co-occur.
 また、分析装置200は、特定の予測データ群における2以上の因子の予測値に基づいて、特定の因子クラスタにおける2以上の因子の予測値を代表する統計値を算出する第3算出処理を実行する(ステップS510)。これにより、分析装置200は、複数の因子が共起した特定の因子クラスタにおける目的変数の予測値の算出に際し、計算量の低減化を図ることができる。したがって、分析速度の向上を図ることができる。 Further, the analysis apparatus 200 executes a third calculation process for calculating a statistical value representative of the predicted values of the two or more factors in the specific factor cluster based on the predicted values of the two or more factors in the specific predicted data group. (Step S510). As a result, the analysis apparatus 200 can reduce the amount of calculation when calculating the predicted value of the objective variable in a specific factor cluster in which a plurality of factors co-occur. Therefore, the analysis speed can be improved.
 また、分析装置200は、学習モデルの種類を設定する設定処理を実行する(ステップS501)。また、分析装置200は、目的変数の実測値と複数の因子の実測値とを用いて、設定処理によって設定された種類の学習モデルを生成して、記憶デバイスに格納する第3生成処理を実行する(ステップS502)。これにより、ユーザは、目的に応じて学習モデルの種類を選択することができる。 Also, the analysis apparatus 200 executes a setting process for setting the type of learning model (step S501). Further, the analysis apparatus 200 generates a learning model of the type set by the setting process using the measured value of the objective variable and the measured values of a plurality of factors, and executes a third generation process that stores the learning model in the storage device. (Step S502). Thereby, the user can select the kind of learning model according to the objective.
 また、分析装置200は、設定処理では、種類として、線形モデルまたは非線形モデルを設定する。これにより、分析装置200は、線形モデルが設定された場合、分析速度の向上を図ることができ、非線形モデルが設定された場合、分析精度の向上を図ることができる。換言すれば、ユーザは、分析結果がより早く得たい場合は、線形モデルを選択し、分析精度を上げたい場合は、非線形モデルを選択することができる。 Further, the analysis apparatus 200 sets a linear model or a nonlinear model as a type in the setting process. Thereby, the analysis apparatus 200 can improve the analysis speed when the linear model is set, and can improve the analysis accuracy when the nonlinear model is set. In other words, the user can select a linear model if the analysis result is desired to be obtained earlier, and can select a non-linear model if the analysis accuracy is to be improved.
 また、予測データ集合(たとえば、統合確率分布D)は、学習モデルを用いた確率サンプリング法によって学習データ集合10から生成されたデータ集合としてもよい。これにより、予測データ集合(たとえば、統合確率分布D)は、学習モデルに依存したデータ集合となる。したがって、たとえば、非線形モデルが設定された場合、予測データ集合(たとえば、統合確率分布D)は、線形モデルが設定された場合に比べて、精度のよいデータ集合となる。 Further, the prediction data set (for example, the integrated probability distribution D) may be a data set generated from the learning data set 10 by a probability sampling method using a learning model. Thereby, a prediction data set (for example, integrated probability distribution D) becomes a data set depending on the learning model. Therefore, for example, when a non-linear model is set, the prediction data set (for example, the integrated probability distribution D) is a data set with higher accuracy than when a linear model is set.
 また、分析装置200は、学習モデルを用いた確率サンプリング法(たとえば、マルコフ連鎖モンテカルロ法)によって予測データまたは予測データに類似するデータのいずれか一方を採択することにより、2つの予測データ群(たとえば、因子の確率分布d1,d2)を生成する第4生成処理を実行する(ステップS503)。予測データに類似するデータとは、上述したように、予測データである因子の各値にランダム値が加算されたデータである。分析装置200は、第4生成処理によって生成された2つの予測データ群(たとえば、因子の確率分布d1,d2)が同一の確率分布に収束するか否かを判定する判定処理を実行する(ステップS504)。分析装置200は、判定処理による判定結果に基づいて2つの予測データ群(たとえば、因子の確率分布d1,d2)を統合することにより、予測データ集合(たとえば、統合確率分布D)を生成する統合処理を実行する(ステップS505)。 In addition, the analysis apparatus 200 adopts either one of the prediction data or the data similar to the prediction data by a probability sampling method (for example, Markov chain Monte Carlo method) using a learning model, thereby two prediction data groups (for example, The fourth generation process for generating the factor probability distributions d1, d2) is executed (step S503). Data similar to the prediction data is data obtained by adding a random value to each value of the factor that is the prediction data, as described above. The analysis apparatus 200 executes a determination process for determining whether or not two prediction data groups (for example, the probability distributions d1 and d2 of the factors) generated by the fourth generation process converge to the same probability distribution (step) S504). The analysis apparatus 200 integrates two prediction data groups (for example, factor probability distributions d1 and d2) based on the determination result of the determination process, thereby generating a prediction data set (for example, an integrated probability distribution D). Processing is executed (step S505).
 判定処理により、2つの予測データ群(たとえば、因子の確率分布d1,d2)が同一の確率分布、たとえば、学習データ集合10の確率分布に収束するか否かが判定される。これにより、収束していれば、2つの予測データ群(たとえば、因子の確率分布d1,d2)が学習データ集合10に類似すると判明するため、2つの予測データ群(たとえば、因子の確率分布d1,d2)から予測データ集合(たとえば、統合確率分布D)が生成される。これにより、予測データ集合(たとえば、統合確率分布D)の予測値としての確からしさ、すなわち、生成精度の向上を図ることができる。 The determination process determines whether or not two prediction data groups (for example, probability distributions d1 and d2 of factors) converge to the same probability distribution, for example, the probability distribution of the learning data set 10. As a result, if it has converged, it becomes clear that the two prediction data groups (for example, the probability distributions d1 and d2 of the factors) are similar to the learning data set 10, and therefore two prediction data groups (for example, the probability distribution d1 of the factors) , D2), a prediction data set (for example, integrated probability distribution D) is generated. Thereby, the probability as a predicted value of a prediction data set (for example, integrated probability distribution D), that is, the generation accuracy can be improved.
 また、分析装置200は、学習モデルを用いた確率サンプリング法(たとえば、マルコフ連鎖モンテカルロ法)によって予測データまたは予測データに類似するデータのいずれか一方を採択する採択率αを制御するパラメータの値(たとえば、σ値)を設定する設定処理を実行する(ステップS501)。これにより、(1-σ)以上の目的変数となる因子を採択率αで採択することができる。 In addition, the analysis apparatus 200 uses a probability sampling method (for example, a Markov chain Monte Carlo method) using a learning model to determine the value of a parameter that controls the adoption rate α for adopting either prediction data or data similar to the prediction data ( For example, a setting process for setting (σ value) is executed (step S501). As a result, a factor that becomes an objective variable of (1−σ) or more can be adopted at the acceptance rate α.
 また、分析装置200は、因子クラスタの生成数を設定する設定処理を実行する(ステップS501)。これにより、分析装置200は、ユーザが指定した数分の因子クラスタを生成することができる。具体的には、たとえば、因子クラスタの生成数が増加するほど、予測データ集合(たとえば、統合確率分布D)が細分化される。これにより、ユーザは、分析結果がより早く得たい場合は、因子クラスタの生成数を低めに設定し、分析精度を上げたい場合は、因子クラスタの生成数を高めに設定することができる。 Moreover, the analysis apparatus 200 executes a setting process for setting the number of factor clusters generated (step S501). Thereby, the analysis apparatus 200 can generate as many factor clusters as specified by the user. Specifically, for example, as the number of factor clusters generated increases, the prediction data set (for example, integrated probability distribution D) is subdivided. As a result, the user can set the factor cluster generation number to be lower when the analysis result is desired to be obtained earlier, and can set the factor cluster generation number to be higher to increase the analysis accuracy.
 また、分析装置200は、共起クラスタの生成数を設定する設定処理を実行する(ステップS501)。これにより、これにより、分析装置200は、ユーザが指定した数分の共起クラスタを生成することができる。具体的には、たとえば、共起クラスタの生成数が増加するほど、共起しあう因子の数や、共起しあう因子の組み合わせの数が増加する。したがって、ユーザは、分析結果がより早く得たい場合は、共起クラスタの生成数を低めに設定し、分析精度を上げたい場合は、共起クラスタの生成数を高めに設定することができる。 Moreover, the analysis apparatus 200 executes a setting process for setting the number of co-occurrence clusters generated (step S501). Thereby, the analyzer 200 can generate the number of co-occurrence clusters specified by the user. Specifically, for example, as the number of co-occurrence clusters generated increases, the number of co-occurring factors and the number of combinations of co-occurring factors increase. Therefore, the user can set the number of co-occurrence clusters generated to be lower when the analysis result is desired to be obtained earlier, and can be set to be higher to increase the analysis accuracy.
 また、実施例1では、複数の因子303,603を複数の薬の患者への投与量とし、目的変数302,602を患者に複数の薬を投与量投与した場合の薬効を示す値(たとえば、疾病確率)とした。これにより、複数の薬の各々をどのタイプ(因子クラスタ)の患者にどの程度投与したら、どの程度の薬効があるかを予測することができる。 Further, in Example 1, a plurality of factors 303 and 603 are set as doses of a plurality of drugs to a patient, and objective variables 302 and 602 are values indicating drug efficacy when a plurality of drugs are administered to a patient (for example, Disease probability). This makes it possible to predict how much of each of a plurality of drugs is administered to which type (factor cluster) and how much drug is effective.
 なお、上述した実施例1では、薬効分析を例に挙げて説明したが、商品レコメンデーションにも適用可能である。この場合、図3に示した学習データ集合10において、患者ID301は、たとえば、患者ではなく顧客に替わる。因子303は、たとえば、商品またはサービス(商品またはサービスのジャンルでもよい)の購入数(商品の場合)や利用回数(サービスの場合)を示す。目的変数302は、たとえば、商品またはサービス(商品またはサービスのジャンルでもよい)の購入金額(商品の場合)や利用金額(サービスの場合)を示す。因子の確率分布d1,d2、統合確率分布Dも同様である。 In Example 1 described above, the medicinal effect analysis is described as an example, but the present invention can also be applied to product recommendation. In this case, in the learning data set 10 shown in FIG. 3, for example, the patient ID 301 is replaced by a customer, not a patient. The factor 303 indicates, for example, the number of purchases (in the case of a product) or the number of uses (in the case of a service) of a product or service (may be a product or service genre). The objective variable 302 indicates, for example, a purchase amount (in the case of a product) or a usage amount (in the case of a service) of a product or service (which may be a genre of the product or service). The same applies to the factor probability distributions d1 and d2 and the integrated probability distribution D.
 また、ニュース記事の分析の場合、図3に示した学習データ集合10において、患者ID301は、たとえば、患者ではなく新聞や雑誌、webページに掲載されたニュース記事に替わる。因子303は、たとえば、単語の出現回数を示す。目的変数302は、たとえば、政治、社会、スポーツ、天気といったニュース記事のジャンルを示す。因子の確率分布d1,d2、統合確率分布Dも同様である。 In the case of news article analysis, in the learning data set 10 shown in FIG. 3, the patient ID 301 is replaced with, for example, a news article published in a newspaper, magazine, or web page instead of a patient. The factor 303 indicates, for example, the number of times the word appears. The objective variable 302 indicates the genre of a news article such as politics, society, sports, and weather. The same applies to the factor probability distributions d1 and d2 and the integrated probability distribution D.
 実施例2について説明する。実施例1では、1台の計算機により図5に示した分析処理を実行したが、実施例2では、複数台の計算機により図5に示した分析処理を分散処理する。これにより、計算機の負荷低減と分析速度の高速化を図る。各計算機は、具体的には、たとえば、図2に示したハードウェア構成を有する。 Example 2 will be described. In the first embodiment, the analysis process shown in FIG. 5 is executed by one computer. In the second embodiment, the analysis process shown in FIG. 5 is distributed by a plurality of computers. This will reduce the load on the computer and increase the analysis speed. Specifically, each computer has, for example, the hardware configuration shown in FIG.
 図12は、分析システムのシステム構成例を示す説明図である。分析システム1200は、複数台の計算機(以下、単に、ノード)N0~Nn(nは1以上の整数)と、1台以上のクライアント端末Cとを含む。複数台のノードN0~Nn(nは2以上の整数)と、1台以上のクライアント端末Cとは、ネットワーク1201を介して通信可能に接続される。ノードN0は、マスターノードN0であり、ノードN1~NnはワーカーノードN1~Nnである。マスターノードN0は、ワーカーノードN1~Nnを管理する。ワーカーノードN1~Nnは、マスターノードN0の指示にしたがって処理を実行する。なお、マスターノードN0の機能をワーカーノードN1~Nnのいずれかが担当してもよい。 FIG. 12 is an explanatory diagram showing a system configuration example of the analysis system. The analysis system 1200 includes a plurality of computers (hereinafter simply referred to as nodes) N0 to Nn (n is an integer of 1 or more) and one or more client terminals C. A plurality of nodes N0 to Nn (n is an integer of 2 or more) and one or more client terminals C are communicably connected via a network 1201. The node N0 is the master node N0, and the nodes N1 to Nn are worker nodes N1 to Nn. The master node N0 manages the worker nodes N1 to Nn. Worker nodes N1 to Nn execute processing in accordance with instructions from master node N0. Note that one of the worker nodes N1 to Nn may be responsible for the function of the master node N0.
 <分散処理手順例>
 図13~図15は、分析システム1200による分散処理手順例を示すフローチャートである。なお、ここでは、一例として、n=2、すなわち、分析システム1200は、マスターノードN0、ワーカーノードN1、N2、クライアント端末Cとする。
<Example of distributed processing procedure>
FIGS. 13 to 15 are flowcharts showing examples of distributed processing procedures by the analysis system 1200. FIG. Here, as an example, n = 2, that is, the analysis system 1200 is a master node N0, worker nodes N1, N2, and a client terminal C.
 まず、クライアント端末Cが初期設定(ステップS501)を実行する(ステップS1301)。そして、クライアント端末Cは、初期設定(ステップS501)の設定内容である解析リクエストを、マスターノードN0に送信する(ステップS1302)。 First, the client terminal C executes initial setting (step S501) (step S1301). Then, the client terminal C transmits an analysis request that is a setting content of the initial setting (step S501) to the master node N0 (step S1302).
 マスターノードN0は、学習モデル生成リクエストをワーカーノードN1に送信する(ステップS1303)。ワーカーノードN1は、学習モデル生成リクエストを受信した場合、ステップS502と同様、学習モデルを生成する(ステップS1304)。ワーカーノードN1は、学習モデルを生成すると、マスターノードN0に学習モデルを送信する(ステップS1305)。マスターノードN0は、ワーカーノードN1から学習モデルを受信すると、他のワーカーノードN2に学習モデルを送信する(ステップS1306)。 The master node N0 transmits a learning model generation request to the worker node N1 (step S1303). When receiving the learning model generation request, the worker node N1 generates a learning model as in step S502 (step S1304). When the worker node N1 generates a learning model, the worker node N1 transmits the learning model to the master node N0 (step S1305). When receiving the learning model from the worker node N1, the master node N0 transmits the learning model to another worker node N2 (step S1306).
 つぎに、マスターノードN0は、因子の確率分布d1の生成リクエストをワーカーノードN1に送信し(ステップS1307)、因子の確率分布d2の生成リクエストをワーカーノードN2に送信する(ステップS1308)。これにより、因子の確率分布d1,d2を並列処理で生成することができる。 Next, the master node N0 transmits a generation request for the factor probability distribution d1 to the worker node N1 (step S1307), and transmits a generation request for the factor probability distribution d2 to the worker node N2 (step S1308). Thereby, the probability distributions d1 and d2 of factors can be generated by parallel processing.
 つぎに、ワーカーノードN1は、ステップS503と同様、マルコフ連鎖モンテカルロ法に代表される確率サンプリング法を用いて、学習データ集合10由来の因子の確率分布d1を生成する(ステップS1309)。ワーカーノードN2も、ステップS503と同様、マルコフ連鎖モンテカルロ法に代表される確率サンプリング法を用いて、学習データ集合10由来の因子の確率分布d2を生成する(ステップS1310)。ワーカーノードN1は、生成した因子の確率分布d1をマスターノードN0に送信する(ステップS1311)。ワーカーノードN2も、生成した因子の確率分布d2をマスターノードN0に送信する(ステップS1312)。 Next, similarly to step S503, the worker node N1 generates a probability distribution d1 of factors derived from the learning data set 10 using a probability sampling method typified by the Markov chain Monte Carlo method (step S1309). Similarly to step S503, the worker node N2 also generates a probability distribution d2 of factors derived from the learning data set 10 using a probability sampling method typified by the Markov chain Monte Carlo method (step S1310). The worker node N1 transmits the generated probability distribution d1 of the factor to the master node N0 (step S1311). The worker node N2 also transmits the generated probability distribution d2 of the factor to the master node N0 (step S1312).
 マスターノードN0は、ステップS504と同様、因子の確率分布d1,d2が同一の確率分布に収束しているかを判定する(ステップS1313)。マスターノードN0は、その判定結果をクライアント端末Cに送信する(ステップS1314)。クライアント端末Cは、図11に示したように、判定結果(たとえば、Gelman-Rubinスコア)を受信して表示する(ステップS1315)。 The master node N0 determines whether the factor probability distributions d1 and d2 converge to the same probability distribution as in step S504 (step S1313). The master node N0 transmits the determination result to the client terminal C (step S1314). As shown in FIG. 11, the client terminal C receives and displays the determination result (eg, Gelman-Rubin score) (step S1315).
 図14において、マスターノードN0は、ステップS505と同様、因子の確率分布d1,d2を統合して統合確率分布Dを生成する(ステップS1401)。そして、マスターノードN0は、因子クラスタリングリクエストをワーカーノードN1に送信する(ステップS1402)。ワーカーノードN1は、因子クラスタリングリクエストを受信した場合、ステップS506と同様、統合確率分布Dを用いて、因子クラスタリングにより因子クラスタを生成する(ステップS1403)。また、ワーカーノードN1は、ステップS507と同様、各因子クラスタから各因子の統計値を算出する(ステップS1404)。ワーカーノードN1は、算出した統計値をマスターノードN0に送信する(ステップS1405)。マスターノードN0は、他のワーカーノードN2に、受信した統計値を送信する(ステップS1406)。 In FIG. 14, the master node N0 generates the integrated probability distribution D by integrating the probability distributions d1 and d2 of factors (step S1401), as in step S505. Then, the master node N0 transmits a factor clustering request to the worker node N1 (step S1402). When receiving the factor clustering request, the worker node N1 generates a factor cluster by factor clustering using the integrated probability distribution D as in step S506 (step S1403). Further, the worker node N1 calculates the statistical value of each factor from each factor cluster as in step S507 (step S1404). The worker node N1 transmits the calculated statistical value to the master node N0 (step S1405). The master node N0 transmits the received statistical value to the other worker node N2 (step S1406).
 マスターノードN0は、共起量計算リクエストをワーカーノードN2に送信する(ステップS1407)。ワーカーノードN2は、ステップS508と同様、統合確率分布Dの因子同士の共起量を算出する(ステップS1408)。そして、ワーカーノードN2は、算出した共起量(図9の(A)を参照)をマスターノードN0に送信する(ステップS1409)。 The master node N0 transmits a co-occurrence amount calculation request to the worker node N2 (step S1407). The worker node N2 calculates the co-occurrence amount of factors of the integrated probability distribution D as in step S508 (step S1408). Then, the worker node N2 transmits the calculated co-occurrence amount (see (A) of FIG. 9) to the master node N0 (step S1409).
 図15において、マスターノードN0は、ステップS509と同様、共起クラスタリングにより共起クラスタを生成し、共起クラスタのIDリストA,Bを生成する(ステップS1501)。共起クラスタのIDリストAとは、統合確率分布Dのエントリを分割した一方のエントリ群を一意に特定するIDリストである。共起クラスタのIDリストBとは、統合確率分布Dのエントリを分割した他方のエントリ群を一意に特定するIDリストである。 In FIG. 15, the master node N0 generates a co-occurrence cluster by co-occurrence clustering as in step S509, and generates ID lists A and B of the co-occurrence cluster (step S1501). The co-occurrence cluster ID list A is an ID list that uniquely identifies one entry group obtained by dividing the entry of the integrated probability distribution D. The co-occurrence cluster ID list B is an ID list that uniquely specifies the other entry group obtained by dividing the entry of the integrated probability distribution D.
 マスターノードN0は、共起クラスタのIDリストAをワーカーノードN1に送信し(ステップS1502)、共起クラスタのIDリストBをワーカーノードN2に送信する(ステップS1503)。ワーカーノードN1は、ステップS509と同様、IDリストAについて、共起クラスタリングにより共起クラスタを生成する(ステップS1504)。ワーカーノードN2も、ステップS509と同様、IDリストBについて、共起クラスタリングにより共起クラスタを生成する(ステップS1505)。 The master node N0 transmits the co-occurrence cluster ID list A to the worker node N1 (step S1502), and transmits the co-occurrence cluster ID list B to the worker node N2 (step S1503). As in step S509, the worker node N1 generates a co-occurrence cluster for the ID list A by co-occurrence clustering (step S1504). Similarly to step S509, the worker node N2 also generates a co-occurrence cluster for the ID list B by co-occurrence clustering (step S1505).
 ワーカーノードN1は、ステップS510と同様、ステップS1504で得られた共起クラスタの予測値を算出する(ステップS1506)。ワーカーノードN2も、ステップS510と同様、ステップS1505で得られた共起クラスタの予測値を算出する(ステップS1507)。ワーカーノードN1は、ステップS1506で得られた予測値を記憶デバイス202に保存する(ステップS1508)。ワーカーノードN2も、ステップS1507で得られた予測値を記憶デバイス202に保存する(ステップS1509)。ワーカーノードN1は、ステップS1506で得られた予測値をマスターノードN0に送信する(ステップS1510)。ワーカーノードN2も、ステップS1507で得られた予測値をマスターノードN0に送信する(ステップS1511)。 The worker node N1 calculates the predicted value of the co-occurrence cluster obtained in step S1504 as in step S510 (step S1506). Similarly to step S510, worker node N2 also calculates the predicted value of the co-occurrence cluster obtained in step S1505 (step S1507). The worker node N1 stores the predicted value obtained in step S1506 in the storage device 202 (step S1508). The worker node N2 also stores the predicted value obtained in step S1507 in the storage device 202 (step S1509). The worker node N1 transmits the predicted value obtained in step S1506 to the master node N0 (step S1510). The worker node N2 also transmits the predicted value obtained in step S1507 to the master node N0 (step S1511).
 マスターノードN0は、ステップS511と同様、予測値のしきい値処理を実行する(ステップS1512)。そして、マスターノードN0は、その実行結果である計算マーカをクライアント端末Cに送信する(ステップS1513)。クライアント端末Cは、計算マーカを表示画面に表示する(ステップS1514)。 The master node N0 executes the threshold processing for the predicted value as in step S511 (step S1512). Then, the master node N0 transmits a calculation marker that is the execution result to the client terminal C (step S1513). The client terminal C displays the calculation marker on the display screen (step S1514).
 図16は、図15に示した分析システム1200による分散処理手順例を示すフローチャート3の変形例を示すフローチャートである。図15では、IDリストA,BごとにワーカーノードN1、N2が並列で共起クラスタリングを実行することで、処理の高速化を実現した。一方、図16では、IDリストA,Bの共起クラスタ計算は、ワーカーノードN1,N2ではなく、マスターノードN0が実行する。なお、図15と同一処理については同一ステップ番号を付し、その説明を省略する。 FIG. 16 is a flowchart showing a modification of the flowchart 3 showing an example of the distributed processing procedure by the analysis system 1200 shown in FIG. In FIG. 15, the worker nodes N1 and N2 execute the co-occurrence clustering in parallel for each of the ID lists A and B, thereby realizing high-speed processing. On the other hand, in FIG. 16, the co-occurrence cluster calculation of the ID lists A and B is executed not by the worker nodes N1 and N2 but by the master node N0. The same processes as those in FIG. 15 are denoted by the same step numbers, and the description thereof is omitted.
 図16において、マスターノードN0は、ステップS509と同様、IDリストAについて、共起クラスタリングにより共起クラスタを生成する(ステップS1602)。マスターノードN0は、IDリストAの共起クラスタをワーカーノードN1に送信する(ステップS1603)。 In FIG. 16, the master node N0 generates a co-occurrence cluster by co-occurrence clustering for the ID list A as in step S509 (step S1602). The master node N0 transmits the co-occurrence cluster of the ID list A to the worker node N1 (step S1603).
 ワーカーノードN1は、ステップS510と同様、ステップS1602で得られた共起クラスタの予測値を算出する(ステップS1604)。ワーカーノードN1は、ステップS1604で得られた予測値を記憶デバイス202に保存する(ステップS1604)。ワーカーノードN1は、ステップS1604で得られた予測値をマスターノードN0に送信する(ステップS1606)。 The worker node N1 calculates the predicted value of the co-occurrence cluster obtained in step S1602, similarly to step S510 (step S1604). The worker node N1 stores the predicted value obtained in step S1604 in the storage device 202 (step S1604). The worker node N1 transmits the predicted value obtained in step S1604 to the master node N0 (step S1606).
 マスターノードN0は、ステップS509と同様、IDリストBについて、共起クラスタリングにより共起クラスタを生成する(ステップS1607)。マスターノードN0は、IDリストBの共起クラスタをワーカーノードN2に送信する(ステップS1608)。 The master node N0 generates a co-occurrence cluster by co-occurrence clustering for the ID list B as in step S509 (step S1607). The master node N0 transmits the co-occurrence cluster of the ID list B to the worker node N2 (step S1608).
 ワーカーノードN2は、ステップS510と同様、ステップS1607で得られた共起クラスタの予測値を算出する(ステップS1609)。ワーカーノードN1は、ステップS1609で得られた予測値を記憶デバイス202に保存する(ステップS1610)。ワーカーノードN2は、ステップS1609で得られた予測値をマスターノードN0に送信する(ステップS1611)。 The worker node N2 calculates the predicted value of the co-occurrence cluster obtained in step S1607 as in step S510 (step S1609). The worker node N1 stores the predicted value obtained in step S1609 in the storage device 202 (step S1610). The worker node N2 transmits the predicted value obtained in step S1609 to the master node N0 (step S1611).
 このように、実施例2によれば、実施例1と同様の効果を奏する。また、実施例2によれば、複数台の計算機により図5に示した分析処理を分散処理する。これにより、計算機の負荷低減と分析速度の高速化を図ることができる。なお、図13~図16に示した分散処理は一例である。したがって、このほかにも、たとえば、図13~図16に示したステップのうち少なくとも2以上のステップを異なる計算機で実行してもよい。 Thus, according to the second embodiment, the same effects as the first embodiment can be obtained. Further, according to the second embodiment, the analysis processing shown in FIG. 5 is distributed by a plurality of computers. Thereby, it is possible to reduce the load on the computer and increase the analysis speed. Note that the distributed processing shown in FIGS. 13 to 16 is an example. Therefore, in addition to this, for example, at least two or more of the steps shown in FIGS. 13 to 16 may be executed by different computers.
 なお、本発明は前述した実施例に限定されるものではなく、添付した特許請求の範囲の趣旨内における様々な変形例及び同等の構成が含まれる。例えば、前述した実施例は本発明を分かりやすく説明するために詳細に説明したものであり、必ずしも説明した全ての構成を備えるものに本発明は限定されない。また、ある実施例の構成の一部を他の実施例の構成に置き換えてもよい。また、ある実施例の構成に他の実施例の構成を加えてもよい。また、各実施例の構成の一部について、他の構成の追加、削除、または置換をしてもよい。 The present invention is not limited to the above-described embodiments, and includes various modifications and equivalent configurations within the scope of the appended claims. For example, the above-described embodiments have been described in detail for easy understanding of the present invention, and the present invention is not necessarily limited to those having all the configurations described. A part of the configuration of one embodiment may be replaced with the configuration of another embodiment. Moreover, you may add the structure of another Example to the structure of a certain Example. Moreover, you may add, delete, or replace another structure about a part of structure of each Example.
 また、前述した各構成、機能、処理部、処理手段等は、それらの一部又は全部を、例えば集積回路で設計する等により、ハードウェアで実現してもよく、プロセッサがそれぞれの機能を実現するプログラムを解釈し実行することにより、ソフトウェアで実現してもよい。 In addition, each of the above-described configurations, functions, processing units, processing means, etc. may be realized in hardware by designing a part or all of them, for example, with an integrated circuit, and the processor realizes each function. It may be realized by software by interpreting and executing the program to be executed.
 各機能を実現するプログラム、テーブル、ファイル等の情報は、メモリ、ハードディスク、SSD(Solid State Drive)等の記憶装置、又は、ICカード、SDカード、DVD等の記録媒体に格納することができる。 Information such as programs, tables, and files that realize each function can be stored in a storage device such as a memory, a hard disk, and an SSD (Solid State Drive), or a recording medium such as an IC card, an SD card, and a DVD.
 また、制御線や情報線は説明上必要と考えられるものを示しており、実装上必要な全ての制御線や情報線を示しているとは限らない。実際には、ほとんど全ての構成が相互に接続されていると考えてよい。 Also, the control lines and information lines indicate what is considered necessary for the explanation, and do not necessarily indicate all control lines and information lines necessary for mounting. In practice, it can be considered that almost all the components are connected to each other.

Claims (12)

  1.  プログラムを実行するプロセッサと、前記プログラムを記憶する記憶デバイスと、を有する分析装置であって、
     前記記憶デバイスは、目的変数の実測値と複数の因子の実測値とを含む学習データを複数有する学習データ集合と、前記複数の因子の予測値を含む前記学習データ由来の予測データを複数有する予測データ集合と、前記目的変数の実測値と前記複数の因子の実測値との関係を示す学習モデルと、を記憶しており、
     前記プロセッサは、
     前記複数の因子の値どうしが類似するように前記予測データ集合をクラスタリングして、複数の因子クラスタを生成する第1生成処理と、
     前記予測データ集合を用いて、前記複数の因子の相関により前記複数の因子が共起する共起量を算出する第1算出処理と、
     前記第1算出処理によって算出された共起量に基づいて前記複数の因子をクラスタリングして、2以上の因子を含む共起クラスタを1以上有する複数の共起クラスタを生成する第2生成処理と、
     前記第1生成処理によって生成された複数の因子クラスタの中の2以上の因子を含む特定の因子クラスタに含まれる特定の予測データ群における前記2以上の因子の予測値のうち、前記第2生成処理によって生成された複数の共起クラスタの中の特定の共起クラスタが示す2以上の特定の因子の予測値を、前記学習モデルに与えることにより、前記特定の因子クラスタにおける前記目的変数の予測値を算出する第2算出処理と、
     を実行することを特徴とする分析装置。
    An analysis apparatus comprising a processor that executes a program and a storage device that stores the program,
    The storage device has a learning data set including a plurality of learning data including actual values of objective variables and actual values of a plurality of factors, and a prediction including a plurality of prediction data derived from the learning data including prediction values of the plurality of factors. Storing a data set and a learning model indicating a relationship between the measured values of the objective variable and the measured values of the plurality of factors;
    The processor is
    Clustering the prediction data sets so that the values of the plurality of factors are similar to each other, and generating a plurality of factor clusters;
    A first calculation process for calculating a co-occurrence amount in which the plurality of factors co-occur by correlation of the plurality of factors using the prediction data set;
    A second generation process for clustering the plurality of factors based on the co-occurrence amount calculated by the first calculation process to generate a plurality of co-occurrence clusters having one or more co-occurrence clusters including two or more factors; ,
    Among the predicted values of the two or more factors in the specific prediction data group included in the specific factor cluster including two or more factors among the plurality of factor clusters generated by the first generation processing, the second generation Predicting the objective variable in the specific factor cluster by giving the learning model prediction values of two or more specific factors indicated by the specific co-occurrence cluster among the plurality of co-occurrence clusters generated by the processing A second calculation process for calculating a value;
    The analysis apparatus characterized by performing.
  2.  請求項1に記載の分析装置であって、
     前記プロセッサは、
     前記特定の予測データ群における前記2以上の因子の予測値に基づいて、前記特定の因子クラスタにおける前記2以上の因子の予測値を代表する統計値を算出する第3算出処理を実行し、
     前記第2算出処理では、前記プロセッサは、前記第3算出処理によって算出された前記2以上の因子の予測値を代表する統計値のうち、前記特定の共起クラスタが示す2以上の特定の因子の統計値を、前記学習モデルに与えることにより、前記特定の因子クラスタにおける前記目的変数の予測値を算出することを特徴とする分析装置。
    The analyzer according to claim 1,
    The processor is
    Based on the predicted values of the two or more factors in the specific prediction data group, execute a third calculation process for calculating a statistical value representative of the predicted values of the two or more factors in the specific factor cluster,
    In the second calculation process, the processor includes two or more specific factors indicated by the specific co-occurrence cluster among statistical values representing the predicted values of the two or more factors calculated by the third calculation process. An analysis apparatus characterized in that a predicted value of the objective variable in the specific factor cluster is calculated by giving the statistical value of the above to the learning model.
  3.  請求項1に記載の分析装置であって、
     前記プロセッサは、
     前記学習モデルの種類を設定する設定処理と、
     前記目的変数の実測値と前記複数の因子の実測値とを用いて、前記設定処理によって設定された種類の学習モデルを生成して、前記記憶デバイスに格納する第3生成処理と、
     を実行することを特徴とする分析装置。
    The analyzer according to claim 1,
    The processor is
    A setting process for setting the type of the learning model;
    Using a measured value of the objective variable and a measured value of the plurality of factors to generate a learning model of the type set by the setting process, and store it in the storage device;
    The analysis apparatus characterized by performing.
  4.  請求項3に記載の分析装置であって、
     前記設定処理では、前記プロセッサは、前記種類として、線形モデルまたは非線形モデルを設定することを特徴とする分析装置。
    The analyzer according to claim 3,
    In the setting process, the processor sets a linear model or a nonlinear model as the type.
  5.  請求項1に記載の分析装置であって、
     前記予測データ集合は、前記学習モデルを用いた確率サンプリング法によって前記学習データ集合から生成されたデータ集合であることを特徴とする分析装置。
    The analyzer according to claim 1,
    The analysis apparatus, wherein the prediction data set is a data set generated from the learning data set by a probability sampling method using the learning model.
  6.  請求項1に記載の分析装置であって、
     前記プロセッサは、
     前記学習モデルを用いた確率サンプリング法によって前記予測データまたは前記予測データに類似するデータのいずれか一方を採択することにより、2つの予測データ群を生成する第4生成処理と、
     前記第4生成処理によって生成された2つの予測データ群が同一の確率分布に収束するか否かを判定する判定処理と、
     前記判定処理による判定結果に基づいて前記2つの予測データ群を統合することにより、前記予測データ集合を生成する統合処理と、を実行し、
     前記第1生成処理では、前記プロセッサは、前記複数の因子の値どうしが類似するように、前記統合処理によって得られた前記予測データ集合をクラスタリングして、前記複数の因子クラスタを生成し、
     前記第1算出処理では、前記プロセッサは、前記統合処理によって得られた前記予測データ集合を用いて、前記複数の因子の相関により前記複数の因子が共起する共起量を算出することを特徴とする分析装置。
    The analyzer according to claim 1,
    The processor is
    A fourth generation process for generating two prediction data groups by adopting either one of the prediction data or data similar to the prediction data by a probability sampling method using the learning model;
    A determination process for determining whether or not the two prediction data groups generated by the fourth generation process converge to the same probability distribution;
    An integration process for generating the prediction data set by integrating the two prediction data groups based on the determination result of the determination process;
    In the first generation process, the processor generates the plurality of factor clusters by clustering the prediction data set obtained by the integration process so that the values of the plurality of factors are similar to each other.
    In the first calculation process, the processor calculates a co-occurrence amount in which the plurality of factors co-occur by correlation of the plurality of factors, using the prediction data set obtained by the integration processing. Analyzing device.
  7.  請求項6に記載の分析装置であって、
     前記プロセッサは、
     前記学習モデルを用いた確率サンプリング法によって前記予測データまたは前記予測データに類似するデータのいずれか一方を採択する採択率を制御するパラメータの値を設定する設定処理を実行し、
     前記第4生成処理では、前記プロセッサは、前記採択率に基づいて前記予測データまたは前記予測データに類似するデータのいずれか一方を採択することにより、前記2つの予測データ群を生成することを特徴とする分析装置。
    The analyzer according to claim 6,
    The processor is
    Executing a setting process for setting a value of a parameter for controlling an acceptance rate for adopting either the prediction data or the data similar to the prediction data by a probability sampling method using the learning model;
    In the fourth generation process, the processor generates the two prediction data groups by adopting either the prediction data or data similar to the prediction data based on the acceptance rate. Analyzing device.
  8.  請求項1に記載の分析装置であって、
     前記プロセッサは、
     前記因子クラスタの生成数を設定する設定処理を実行し、
     前記第1生成処理では、前記プロセッサは、前記複数の因子の値どうしが類似するように前記予測データ集合をクラスタリングして、前記設定処理によって設定された生成数の因子クラスタを生成することを特徴とする分析装置。
    The analyzer according to claim 1,
    The processor is
    Execute a setting process for setting the number of generations of the factor clusters;
    In the first generation process, the processor clusters the prediction data sets so that the values of the plurality of factors are similar to each other, and generates a generation number of factor clusters set by the setting process. Analyzing device.
  9.  請求項1に記載の分析装置であって、
     前記プロセッサは、
     前記共起クラスタの生成数を設定する設定処理を実行し、
     前記第2生成処理では、前記プロセッサは、前記第1算出処理によって算出された共起量に基づいて前記複数の因子をクラスタリングして、2以上の因子を含む共起クラスタを1以上有する共起クラスタを、前記設定処理によって設定された生成数生成することを特徴とする分析装置。
    The analyzer according to claim 1,
    The processor is
    Execute a setting process for setting the number of generated co-occurrence clusters;
    In the second generation process, the processor clusters the plurality of factors based on the co-occurrence amount calculated by the first calculation process, and has one or more co-occurrence clusters including two or more factors. An analyzer that generates a generation number set by the setting process.
  10.  請求項1に記載の分析装置であって、
     前記複数の因子は複数の薬の患者への投与量であり、前記目的変数は前記患者に前記複数の薬を前記投与量投与した場合の薬効を示す値であることを特徴とする分析装置。
    The analyzer according to claim 1,
    The analyzer is characterized in that the plurality of factors are doses of a plurality of drugs to a patient, and the objective variable is a value indicating a drug effect when the dose of the plurality of drugs is administered to the patient.
  11.  複数の計算機が通信可能に接続された分析システムであって、
     前記複数の計算機のいずれかが、目的変数の実測値と複数の因子の実測値とを含む学習データを複数有する学習データ集合と、前記複数の因子の予測値を含む前記学習データ由来の予測データを複数有する予測データ集合と、前記目的変数の実測値と前記複数の因子の実測値との関係を示す学習モデルと、を記憶しており、
     前記複数の計算機のいずれかが、
     前記複数の因子の値どうしが類似するように前記予測データ集合をクラスタリングして、複数の因子クラスタを生成する第1生成処理と、
     前記予測データ集合を用いて、前記複数の因子の相関により前記複数の因子が共起する共起量を算出する第1算出処理と、
     前記第1算出処理によって算出された共起量に基づいて前記複数の因子をクラスタリングして、2以上の因子を含む共起クラスタを1以上有する複数の共起クラスタを生成する第2生成処理と、
     前記第1生成処理によって生成された複数の因子クラスタの中の2以上の因子を含む特定の因子クラスタに含まれる特定の予測データ群における前記2以上の因子の予測値のうち、前記第2生成処理によって生成された複数の共起クラスタの中の特定の共起クラスタが示す2以上の特定の因子の予測値を、前記学習モデルに与えることにより、前記特定の因子クラスタにおける前記目的変数の予測値を算出する第2算出処理と、
     を実行することを特徴とする分析システム。
    An analysis system in which a plurality of computers are communicably connected,
    A learning data set in which any one of the plurality of computers includes a plurality of learning data including actual measurement values of objective variables and actual measurement values of a plurality of factors, and prediction data derived from the learning data including prediction values of the plurality of factors And a learning model indicating the relationship between the measured values of the objective variable and the measured values of the plurality of factors, and a prediction data set having a plurality of
    One of the plurality of computers is
    Clustering the prediction data sets so that the values of the plurality of factors are similar to each other, and generating a plurality of factor clusters;
    A first calculation process for calculating a co-occurrence amount in which the plurality of factors co-occur by correlation of the plurality of factors using the prediction data set;
    A second generation process for clustering the plurality of factors based on the co-occurrence amount calculated by the first calculation process to generate a plurality of co-occurrence clusters having one or more co-occurrence clusters including two or more factors; ,
    Among the predicted values of the two or more factors in the specific prediction data group included in the specific factor cluster including two or more factors among the plurality of factor clusters generated by the first generation processing, the second generation Predicting the objective variable in the specific factor cluster by giving the learning model prediction values of two or more specific factors indicated by the specific co-occurrence cluster among the plurality of co-occurrence clusters generated by the processing A second calculation process for calculating a value;
    An analysis system characterized by executing.
  12.  プログラムを実行するプロセッサと、前記プログラムを記憶する記憶デバイスと、を有する分析装置による分析方法であって、
     前記記憶デバイスは、目的変数の実測値と複数の因子の実測値とを含む学習データを複数有する学習データ集合と、前記複数の因子の予測値を含む前記学習データ由来の予測データを複数有する予測データ集合と、前記目的変数の実測値と前記複数の因子の実測値との関係を示す学習モデルと、を記憶しており、
     前記プロセッサは、
     前記複数の因子の値どうしが類似するように前記予測データ集合をクラスタリングして、複数の因子クラスタを生成する第1生成処理と、
     前記予測データ集合を用いて、前記複数の因子の相関により前記複数の因子が共起する共起量を算出する第1算出処理と、
     前記第1算出処理によって算出された共起量に基づいて前記複数の因子をクラスタリングして、2以上の因子を含む共起クラスタを1以上有する複数の共起クラスタを生成する第2生成処理と、
     前記第1生成処理によって生成された複数の因子クラスタの中の2以上の因子を含む特定の因子クラスタに含まれる特定の予測データ群における前記2以上の因子の予測値のうち、前記第2生成処理によって生成された複数の共起クラスタの中の特定の共起クラスタが示す2以上の特定の因子の予測値を、前記学習モデルに与えることにより、前記特定の因子クラスタにおける前記目的変数の予測値を算出する第2算出処理と、
     を実行することを特徴とする分析方法。
    An analysis method by an analysis apparatus having a processor that executes a program and a storage device that stores the program,
    The storage device has a learning data set including a plurality of learning data including actual values of objective variables and actual values of a plurality of factors, and a prediction including a plurality of prediction data derived from the learning data including prediction values of the plurality of factors. Storing a data set and a learning model indicating a relationship between the measured values of the objective variable and the measured values of the plurality of factors;
    The processor is
    Clustering the prediction data sets so that the values of the plurality of factors are similar to each other, and generating a plurality of factor clusters;
    A first calculation process for calculating a co-occurrence amount in which the plurality of factors co-occur by correlation of the plurality of factors using the prediction data set;
    A second generation process for clustering the plurality of factors based on the co-occurrence amount calculated by the first calculation process to generate a plurality of co-occurrence clusters having one or more co-occurrence clusters including two or more factors; ,
    Among the predicted values of the two or more factors in the specific prediction data group included in the specific factor cluster including two or more factors among the plurality of factor clusters generated by the first generation processing, the second generation Predicting the objective variable in the specific factor cluster by giving the learning model prediction values of two or more specific factors indicated by the specific co-occurrence cluster among the plurality of co-occurrence clusters generated by the processing A second calculation process for calculating a value;
    The analysis method characterized by performing.
PCT/JP2016/075726 2016-09-01 2016-09-01 Analysis device, analysis system, and analysis method WO2018042606A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2018536626A JP6695431B2 (en) 2016-09-01 2016-09-01 Analytical apparatus, analytical system and analytical method
PCT/JP2016/075726 WO2018042606A1 (en) 2016-09-01 2016-09-01 Analysis device, analysis system, and analysis method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2016/075726 WO2018042606A1 (en) 2016-09-01 2016-09-01 Analysis device, analysis system, and analysis method

Publications (1)

Publication Number Publication Date
WO2018042606A1 true WO2018042606A1 (en) 2018-03-08

Family

ID=61301188

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2016/075726 WO2018042606A1 (en) 2016-09-01 2016-09-01 Analysis device, analysis system, and analysis method

Country Status (2)

Country Link
JP (1) JP6695431B2 (en)
WO (1) WO2018042606A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102151272B1 (en) * 2020-01-07 2020-09-02 한국토지주택공사 Method, apparatus and computer program for analyzing data using prediction model
WO2021053775A1 (en) * 2019-09-18 2021-03-25 日本電信電話株式会社 Learning device, estimation device, learning method, estimation method, and program

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102198322B1 (en) * 2020-08-20 2021-01-04 플레인브레드 주식회사 Intelligent data visualization system using machine learning

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001034688A (en) * 1999-07-21 2001-02-09 Naoya Miyano Treatment data estimating method and treatment data estimating system
JP2006202235A (en) * 2005-01-24 2006-08-03 Nara Institute Of Science & Technology Time-based phenomenon occurrence analysis apparatus and time-based phenomenon occurrence analysis method
JP2008206575A (en) * 2007-02-23 2008-09-11 Hitachi Ltd Information management system and server
WO2011089872A1 (en) * 2010-01-22 2011-07-28 パナソニック株式会社 Image management device, image management method, program, recording medium, and integrated circuit
JP2011227838A (en) * 2010-04-23 2011-11-10 Kyoto Univ Prediction apparatus, learning apparatus for the same, and computer program for these apparatuses
JP2012524945A (en) * 2009-04-22 2012-10-18 リード ホース テクノロジーズ インコーポレイテッド Artificial intelligence assisted medical referencing system and method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9466024B2 (en) * 2013-03-15 2016-10-11 Northrop Grumman Systems Corporation Learning health systems and methods
JP6066826B2 (en) * 2013-05-17 2017-01-25 株式会社日立製作所 Analysis system and health business support method
JP6324828B2 (en) * 2014-07-07 2018-05-16 株式会社日立製作所 Medicinal effect analysis system and medicinal effect analysis method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001034688A (en) * 1999-07-21 2001-02-09 Naoya Miyano Treatment data estimating method and treatment data estimating system
JP2006202235A (en) * 2005-01-24 2006-08-03 Nara Institute Of Science & Technology Time-based phenomenon occurrence analysis apparatus and time-based phenomenon occurrence analysis method
JP2008206575A (en) * 2007-02-23 2008-09-11 Hitachi Ltd Information management system and server
JP2012524945A (en) * 2009-04-22 2012-10-18 リード ホース テクノロジーズ インコーポレイテッド Artificial intelligence assisted medical referencing system and method
WO2011089872A1 (en) * 2010-01-22 2011-07-28 パナソニック株式会社 Image management device, image management method, program, recording medium, and integrated circuit
JP2011227838A (en) * 2010-04-23 2011-11-10 Kyoto Univ Prediction apparatus, learning apparatus for the same, and computer program for these apparatuses

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021053775A1 (en) * 2019-09-18 2021-03-25 日本電信電話株式会社 Learning device, estimation device, learning method, estimation method, and program
JPWO2021053775A1 (en) * 2019-09-18 2021-03-25
JP7251642B2 (en) 2019-09-18 2023-04-04 日本電信電話株式会社 Learning device, estimation device, learning method, estimation method and program
KR102151272B1 (en) * 2020-01-07 2020-09-02 한국토지주택공사 Method, apparatus and computer program for analyzing data using prediction model

Also Published As

Publication number Publication date
JP6695431B2 (en) 2020-05-20
JPWO2018042606A1 (en) 2019-06-24

Similar Documents

Publication Publication Date Title
Ang et al. Filter bank common spatial pattern algorithm on BCI competition IV datasets 2a and 2b
Taneja Heart disease prediction system using data mining techniques
Alghowinem et al. Interpretation of depression detection models via feature selection methods
Jovanovic et al. Building interpretable predictive models for pediatric hospital readmission using Tree-Lasso logistic regression
Daniel et al. Using causal diagrams to guide analysis in missing data problems
US7930156B2 (en) Method and apparatus for supporting analysis of gene interaction network, and computer product
EP2804121A2 (en) Analysis system and health business support method
Alizadehsani et al. Diagnosis of coronary artery disease using data mining techniques based on symptoms and ecg features
JP6695431B2 (en) Analytical apparatus, analytical system and analytical method
Zhang et al. Some considerations of classification for high dimension low-sample size data
Akter et al. Towards autism subtype detection through identification of discriminatory factors using machine learning
US9026643B2 (en) Contents&#39; relationship visualizing apparatus, contents&#39; relationship visualizing method and its program
Demirhan The effect of feature selection on multivariate pattern analysis of structural brain MR images
Arendt et al. Towards rapid interactive machine learning: evaluating tradeoffs of classification without representation
Doumard et al. A comparative study of additive local explanation methods based on feature influences
Boursalie et al. Evaluation methodology for deep learning imputation models
JP7481181B2 (en) Computer system and contribution calculation method
Ge et al. A threshold linear mixed model for identification of treatment-sensitive subsets in a clinical trial based on longitudinal outcomes and a continuous covariate
Xiao et al. Coefficient-of-variation-based channel selection with a new testing framework for MI-based BCI
Zhang et al. Identifying ‘associated-sleeping-beauties’ in ‘swan-groups’ based on small qualified datasets of physics and economics
JP7234622B2 (en) Document retrieval program, document retrieval method and document retrieval system
JP5331723B2 (en) Feature word extraction device, feature word extraction method, and feature word extraction program
Chen et al. Projection subspace clustering
Yang et al. Mining heterogeneous networks with topological features constructed from patient-contributed content for pharmacovigilance
KR20220157330A (en) Method for predicting dementia by incubation period based on machine learning and apparatus implementing the same method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16915166

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2018536626

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16915166

Country of ref document: EP

Kind code of ref document: A1