WO2016148107A1 - Data processing device, data processing method, and data processing program - Google Patents

Data processing device, data processing method, and data processing program Download PDF

Info

Publication number
WO2016148107A1
WO2016148107A1 PCT/JP2016/057992 JP2016057992W WO2016148107A1 WO 2016148107 A1 WO2016148107 A1 WO 2016148107A1 JP 2016057992 W JP2016057992 W JP 2016057992W WO 2016148107 A1 WO2016148107 A1 WO 2016148107A1
Authority
WO
WIPO (PCT)
Prior art keywords
explanatory
explanatory variables
variables
selection
explanatory variable
Prior art date
Application number
PCT/JP2016/057992
Other languages
French (fr)
Japanese (ja)
Inventor
一夫 石井
利紀 古崎
哲郎 大森
周助 沼田
Original Assignee
国立大学法人東京農工大学
国立大学法人徳島大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 国立大学法人東京農工大学, 国立大学法人徳島大学 filed Critical 国立大学法人東京農工大学
Publication of WO2016148107A1 publication Critical patent/WO2016148107A1/en

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/30Unsupervised data analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N99/00Subject matter not provided for in other groups of this subclass
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Definitions

  • the present invention relates to a data processing device, a data processing method, and a data processing program.
  • Patent Documents 1 to 3 JP 2003-4739
  • Patent Document 2 JP 2002-528095
  • Patent Document 3 JP 2011-248789
  • a data processing device for identifying a cause explanatory variable set that is a set of at least one explanatory variable that causes a predetermined event from among a plurality of explanatory variables,
  • An acquisition unit that acquires a plurality of sample data in which each value of a plurality of explanatory variables is associated with the occurrence / non-occurrence of an event, and a set of selected explanatory variables is repeatedly selected from the plurality of explanatory variables.
  • An explanatory variable selection unit that randomly selects the set of selected explanatory variables without depending on the set of selected explanatory variables selected, and each of a plurality of selected explanatory variable sets based on a plurality of sample data
  • a learning processing unit that learns a prediction model that predicts the occurrence of an event from the value of a selected explanatory variable, and multiple prediction models that correspond to different sets of selected explanatory variables
  • a model selection unit that prioritizes and selects a prediction model with higher evaluation, and a determination unit that determines a set of selected explanatory variables corresponding to the prediction model selected by the model selection unit as a cause explanatory variable set.
  • a data processing apparatus is provided.
  • a block diagram of data processor 10 of this embodiment is shown.
  • the processing flow by the data processing apparatus 10 of this embodiment is shown.
  • An example of the sample data concerning this embodiment is shown.
  • An example of the set of explanatory variables according to the present embodiment is shown.
  • the selection probability which the initialization part 122 selects as a set of selection explanatory variables is shown.
  • An example of the occurrence probability of an event for each gene is shown.
  • An example of the processing flow of the learning of the prediction model by multivariate analysis is shown.
  • generates is shown.
  • An example of the processing flow of the learning of the prediction model by the maximum likelihood estimation method is shown.
  • An example of the processing flow of learning of the prediction model by the Bayes method is shown.
  • An example of discrimination by the maximum likelihood estimation method or the Bayes method is shown.
  • An example of a method for generating a set of selected explanatory variables by the Markov chain Monte Carlo method is shown.
  • the modification of the process flow of S200 by the explanatory variable selection part 120 is shown.
  • a modification of the generation of a set of selected explanatory variables by the Markov chain Monte Carlo method by the generation unit 124 is shown.
  • the parallel processing apparatus 12 which concerns on the modification of this embodiment which mounted parallel processing is shown.
  • An example of the effect of learning by this embodiment is shown. 2 shows an example of a hardware configuration of a computer 1900.
  • FIG. 1 shows a block diagram of a data processing apparatus 10 for one block of the data processing apparatus of this embodiment.
  • the data processing apparatus 10 of this embodiment specifies a cause factor set that is a set of at least one explanatory variable that causes a predetermined event from a plurality of explanatory variables.
  • the data processing device 10 identifies a set of at least one gene that is an expression factor of an event such as a disease as a cause factor set from among a plurality of genes.
  • the data processing apparatus 10 includes an acquisition unit 110, an explanatory variable selection unit 120, a learning processing unit 130, a model selection unit 140, and a determination unit 150.
  • the acquisition unit 110 acquires a plurality of sample data in which each value of a plurality of explanatory variables is associated with the occurrence / non-occurrence of an event. For example, the acquisition unit 110 acquires sample data on a plurality of subjects in which the values related to genes (for example, the presence / absence, modification or expression level of a specific structural gene) and the presence / absence of a disease are associated from the database 20. To do. The acquisition unit 110 provides the acquired sample data to the explanatory variable selection unit 120.
  • genes for example, the presence / absence, modification or expression level of a specific structural gene
  • the explanatory variable selection unit 120 selects at least one set of selected explanatory variables from a plurality of explanatory variables.
  • the explanatory variable selection unit 120 is a method in which a set of genes including a predetermined number of genes as a set of selected explanatory variables is predetermined from a plurality of genes held by the subject included in the sample data. (For example, bootstrap method or Markov chain Monte Carlo method).
  • the explanatory variable selection unit 120 includes an initialization unit 122 and a generation unit 124.
  • the initialization unit 122 determines the set of selected explanatory variables that the explanatory variable selection unit 120 selects in the initial stage of the search for the cause explanatory variable. For example, the initialization unit 122 has a high degree of contribution to the presence / absence of an event in a random selected explanatory variable, an explanatory variable that frequently occurs in a plurality of sample data, or a plurality of sample data (for example, Include explanatory variables in the set of initial selected explanatory variables (such as low Wilkes' lambda statistic, high sensitivity or specificity, and / or low Akaike Information Criterion (AIC) statistic).
  • AIC Akaike Information Criterion
  • the generation unit 124 generates a set of selected explanatory variables that the explanatory variable selection unit 120 selects after the initial search. For example, the generation unit 124 sequentially generates a set of selected explanatory explanatory variables from an initial set of explanatory explanatory variables using a Markov chain Monte Carlo method. Thereby, the generation unit 124 generates a set of selected explanatory variable with a combination close to the combination of the set of selected explanatory explanatory variables selected last time.
  • the initialization unit 122 and the generation unit 124 may determine or generate a set of selected explanatory variables based on the evaluation of the prediction model obtained from the previously generated set of selected explanatory variables. Details of processing by the explanatory variable selection unit 120 will be described later.
  • the explanatory variable selection unit 120 supplies the set of selected explanatory variables to the learning processing unit 130.
  • the learning processing unit 130 learns a prediction model for predicting the occurrence of an event from the value of each selected explanatory variable in the set of selected explanatory variables based on a plurality of sample data. For example, the learning processing unit 130 learns a prediction model for predicting the presence or absence of a disease from the presence or absence of a gene in the set of explanatory explanatory variables of each subject in the sample data. As a result, the learning processing unit 130 obtains a prediction model for each set of selected explanatory explanatory variables. Specific contents of the learning process of the learning processing unit 130 will be described later.
  • the learning processing unit 130 generates an evaluation on the prediction accuracy of event occurrence by each prediction model, and supplies this to the explanatory variable selection unit 120. For example, the learning processing unit 130 compares the result of predicting the occurrence of an event from sample data based on each prediction model with the presence or absence of an actual event, and generates an evaluation from the result. The learning processing unit 130 supplies the model selection unit 140 with a set and evaluation of selected explanatory variables corresponding to the learned prediction model.
  • the model selection unit 140 preferentially selects a prediction model with a higher evaluation among a plurality of prediction models learned by the learning processing unit 130 according to different sets of selection explanatory variables. For example, the model selection unit 140 selects the prediction model with the highest evaluation. The model selection unit 140 supplies the selected prediction model to the determination unit 150.
  • the determining unit 150 determines a set of selected explanatory variables corresponding to the prediction model selected by the model selecting unit 140 as a cause explanatory variable set. As a result, the determination unit 150 can preferentially identify the cause of the event as a set of selected explanatory variables that give a prediction model that has a high evaluation, that is, predicts the occurrence of the event with high accuracy.
  • the data processing apparatus 10 selects a set of selected explanatory variables from a plurality of candidate explanatory variable candidates in the sample data by the Markov chain Monte Carlo method or the like, and learns a prediction model for each selected selected explanatory variable set. Then, the set of selected explanatory variables corresponding to the highly evaluated prediction model is specified as the cause explanatory variable.
  • FIG. 2 shows a processing flow by the data processing apparatus 10 of the present embodiment.
  • the data processing apparatus 10 executes the process of specifying the cause explanation variable set by executing the processes of S120 to S240.
  • sample data in which each value of a plurality of explanatory variables is associated with whether or not an event has occurred is acquired for a plurality of objects.
  • the acquisition unit 110 acquires, from the database 20, sample data that associates the presence or absence of specific gene expression with the presence or absence of a specific disease such as colorectal cancer for a plurality of subjects.
  • FIG. 3 shows an example of sample data.
  • Acquisition unit 110 each of the M's subject are n (e.g., millions) if expressing whether information (gene expressing each gene g 1 ⁇ g n 1, If there is no disease, 0 or the like) and information associated with a biological event such as the presence or absence of disease (1 if disease has occurred, 0 if not) are acquired as sample data.
  • the sample data shown in FIG. 3 the subject 1 is expressed genes g 1, to express the gene g 2, did not express the gene g 3, ... to express the gene g n, it indicates that there is no disease
  • the subject 2 is not express gene g 1, to express the gene g 2 and gene g 3, ... to express the gene g n, indicates that there is a disease
  • the subject M genes g 1 and gene g 2 express, without a gene g 3, did not express ... gene g n, it indicates that there is no disease.
  • the acquisition unit 110 may include / instead of the presence / absence of gene expression of the target, information on the expression level of the gene, positional information on the gene sequence polymorphism, frequency of gene mutation, type and site of gene modification, and / or gene Information on the degree of modification may be acquired. Moreover, the acquisition part 110 may acquire the information relevant to gene expression. For example, the acquisition unit 110 is produced based on the amount of the protein produced by gene expression and translation, the type and site of modification of the transcript or protein, and / or the degree of modification, and the result of functional expression of the produced protein.
  • Metabolites eg, (a) lipids, carbohydrates, vitamins, amino acids, nucleic acids, other alcohols, organic acids or their esters, other amines, or other organic compounds, (b) minerals or Kinds and / or types of ions or other inorganic compounds (nitrogen compounds, sulfur compounds, phosphorus-containing compounds, halogen compounds, etc.) or ions thereof, or (c) complexes, complexes, or degradation products thereof)
  • Sample data including quantity information may be obtained.
  • the acquisition unit 110 provides the acquired sample data to the explanatory variable selection unit 120.
  • the initialization unit 122 determines an initial set of selected explanatory variables.
  • the initialization unit 122 includes a set of initial selected explanatory variables including, as selected explanatory variables, predetermined s explanatory variables selected at random with equal probability from all explanatory variables in the sample data. You may decide.
  • the initialization unit 122 may preferentially include explanatory variables that exist frequently in a plurality of sample data in the set of initial selected explanatory variables. For example, the initialization unit 122 determines, as an initial selection explanatory variable, a gene that is frequently expressed by the subject in the sample data (or the natural world). As an example, the initialization unit 122 selects s genes in descending order of frequency, or selects s genes that are different from each other with a selection probability corresponding to the frequency, and sets the initial selection explanatory variable as a set. Good.
  • the initialization unit 200 may select a set of selected explanatory variables from the plurality of explanatory variables as a whole. Instead, the degree of contribution to whether or not an event has occurred is determined in advance from a plurality of explanatory variables. A part of the explanatory variables that satisfy the specified criteria and / or the frequency of occurrence of the subject is higher than a predetermined criterion, and select a set of selected explanatory variables from the extracted part of the explanatory variables. You may select one.
  • the initialization unit 200 extracts, from a plurality of explanatory variables, only explanatory variables whose degree of contribution to the occurrence or non-occurrence of an event by t-test alone is higher than the average, and selects s genes from the extracted explanatory variables. You can do it. Further, the initialization unit 200 may select s genes from the result obtained by excluding the explanatory variable corresponding to the gene whose expression frequency is lower than the occurrence frequency of the event from the plurality of explanatory variables.
  • s 3
  • set G 1 includes g 10
  • G 41 g 301
  • G 41 , g 301 , g 282
  • Set G N shows a case containing g a, G b, g c , g d (a, b, c, d ⁇ n) a.
  • FIG. 5 shows an example of the selection probability that the initialization unit 122 selects each set as an initial set of explanatory explanatory variables.
  • the horizontal axis of the graph represents the set of N explanatory variables G 1 to G N arranged, and the vertical axis represents the selection probability P s corresponding to the set of explanatory variables.
  • Initializing unit 122 selects a set of explanatory variables with a probability corresponding to the selected probability P s.
  • each set of explanatory variables is arranged in order of frequency. That is, a set G x of explanatory variables (ie, genes) arranged on the leftmost side of the graph is a combination of genes that appear (or are expected to appear) most frequently in the sample data (or the natural world). set G y which is arranged to the right side of the G x is the combination of genes that appear frequently in the following G x, ... set G Z explanatory variables disposed rightmost in the sample data (or nature) A combination of genes that appears (or is expected to appear) with the lowest frequency.
  • the types of explanatory variables included in the set of explanatory variables arranged in FIG. 5 may be single (for example, three) or plural types (for example, two to five).
  • the selection probability P s corresponding to each set of explanatory variables in the graph may be a value having a magnitude corresponding to the frequency of the combination of the explanatory variables.
  • the initialization unit 122 may include, in the initial selected explanatory variable set, an explanatory variable that satisfies a predetermined criterion that contributes to the occurrence of an event independently in a plurality of sample data. .
  • the initialization unit 122 calculates an event occurrence rate for each explanatory variable from the sample data.
  • the initialization unit 122 contributes to the occurrence or non-occurrence of an event (disease) due to an explanatory variable (gene g1) (ie, , The incidence of disease due to the expression of gene g1) is calculated to be 0.11%.
  • FIG. 6 shows an example of the occurrence probability of an event for each gene.
  • the disease incidence of the subject having the gene g 1 is 0.11%
  • the disease incidence of the subject having the gene g 2 is 0.15%
  • the gene g 3 diseases incidence of subjects with 0.73% disease incidence of subjects with a genetic g n is 0.02%.
  • the initialization unit 122 determines a gene having a high degree of contribution to the occurrence of disease as an initial selection explanatory variable. As an example, the initialization unit 122 selects s genes in descending order of contribution to the presence or absence of disease, or selects s genes that are different from each other with a selection probability according to the degree. It may be a set of selected explanatory variables.
  • the learning processing unit 130 predicts the occurrence of an event from the value of each selected explanatory variable in the set of selected explanatory variables determined in the immediately preceding process, based on a plurality of sample data.
  • the learning processing unit 130 learns a prediction model for predicting the presence or absence of a disease from the presence or absence of gene expression in the set of explanatory explanatory variables of each subject in the sample data.
  • the learning processing unit 130 includes multivariate analysis (such as discriminant analysis and multiple regression analysis), machine learning (such as self-organizing map, support vector machine, and deep learning), maximum likelihood estimation method, or A prediction model is learned based on the Bayesian method.
  • FIG. 7 shows an example of a processing flow for learning a prediction model by multivariate analysis.
  • the learning processing unit 130 may execute the processing of S160 by using the multivariate analysis such as multiple regression analysis, principal component analysis, and cluster analysis by executing the processing of S162 to S172.
  • the learning processing unit 130 plots the target person of each sample data on a three-dimensional space having the respective expression levels of the 10th, 23rd, and 45th genes as axes.
  • the learning processing unit 130 in the expression level of 10 th gene g 10 of a subject is 0.1, in the expression level is 0.3 gene g 23, expression amount of 0.2 genes g 45
  • the subject is plotted at (0.1, 0.3, 0.2) points in a three-dimensional space.
  • the learning processing unit 130 inputs coordinates corresponding to the plots of the respective subjects to the first linear function f 1 (x) and obtains a plurality of output values corresponding to the respective subjects, dispersion of the plurality of output values is to optimize the coefficients a ij corresponding to each gene x ij to maximize.
  • the learning processing unit 130 generates another second linear function f 2 (x) different from the first linear function f 1 (x). For example, the learning processing unit 130 optimizes each coefficient a ij corresponding to each gene x ij so that the variance of the plurality of output values becomes the second largest after the first linear function f 1 (x).
  • the learning processing unit 130 may further generate linear functions after the third linear function f 3 (x) in the order in which the variance of the output values increases. As described above, the learning processing unit 130 generates a predetermined number of linear functions in the order in which the variance of the output values increases.
  • the learning processing unit 130 selects at least one function used to determine whether or not an event has occurred in the plurality of sample data from among the plurality of generated functions.
  • the learning processing unit 130 determines a combination of functions that can more clearly determine the boundary of occurrence / non-occurrence of an event when each target person is plotted in a multidimensional space having the output value of each function as each axis. .
  • the learning processing unit 130 calculates the output value of each linear function of each subject and whether or not an event has occurred in the subject (or the degree of occurrence of the event).
  • a correlation coefficient may be calculated, and one or more functions having a large absolute value of the correlation coefficient may be selected. In this description, for the sake of convenience, it is assumed that the learning processing unit 130 selects the first linear function f 1 (x) and the second linear function f 2 (x) generated in S162.
  • the learning processing unit 130 generates a multidimensional space having each dimension of each value of at least one function. For example, the learning processing unit 130 inputs genes of selection explanatory variables of a plurality of subjects in the sample data to the selected function, and plots the obtained values in a coordinate space having an axis corresponding to each function.
  • FIG. 8 shows an example of a multidimensional space generated by the learning processing unit 130.
  • the learning processing unit 130 generates a two-dimensional space using two functions.
  • Each plot in the graph corresponds to each subject in the sample data.
  • the learning processing unit 130 uses the output value obtained by inputting the expression level of the selected explanatory variable of each subject into the first function f 1 (x) as the component sammlung 1 on the axis LD 1. Including a point ( Budapest 1 , sammlung 2 ) including the output value obtained by inputting to the second function f 2 (x) as the component THER 1 on the axis LD 2 in the two-dimensional space. Plot to.
  • a dotted line and a solid line ⁇ indicate a subject who did not develop a disease (that is, a healthy person), and a dotted line and a solid line X indicate a subject who has a disease (that is, a non-healthy person).
  • the learning processing unit 130 generates a discriminant function that predicts whether an event has occurred from the selected explanatory variable. For example, the learning processing unit 130 determines a healthy person (no illness) and a non-healthy person (with illness) based on various discrimination methods such as linear discrimination, secondary discrimination, self-organizing map, and support vector machine. Generate a discriminant function that discriminates most accurately.
  • FIG. 8 shows a case where the learning processing unit 130 generates the linear discriminant function TH.
  • the subject plotted on the upper side of the linear discriminant function TH shown in FIG. 8 is predicted as a non-healthy person, and the subject plotted on the lower side is predicted as a healthy person.
  • the learning processing unit 130 learns a prediction model that predicts the occurrence of an event based on the position in the multidimensional space by generating a discriminant function. For example, the learning processing unit 130 selects the first function f 1 (x) and the second function f 2 (x), and thus, the learning processing unit 130 is healthy in a two-dimensional space with the output values of these functions as axes. A clear boundary TH that distinguishes a normal person from a non-healthy person can be determined.
  • the learning processing unit 130 evaluates the prediction model using the learned discriminant function. For example, the learning processing unit 130 evaluates the prediction accuracy of the occurrence of the event by the discriminant function.
  • the dotted line ⁇ in FIG. 8 indicates a subject whose disease is predicted by the discriminant function but actually has no disease, and the solid line ⁇ indicates a subject who is not predicted by the discriminant function and actually has no disease.
  • the dotted line X indicates a subject who has not actually been diagnosed with a discriminant function but actually has a disease, and the solid line X indicates a subject who has been predicted to have a disease with the discriminant function and actually has a disease.
  • the learning processing unit 130 uses the discriminant functions generated by various discriminant methods to determine the sensitivity when the presence or absence of a disease is predicted from the gene expression of a plurality of subjects in sample data (the solid line in all circles in FIG. 8). At least one of the ratio (circle) and the specificity (the ratio of x in the solid line in all) or the average of both is calculated as the evaluation of the discriminant function.
  • the evaluation of the discriminant function is the evaluation of the prediction model corresponding to the discriminant function.
  • the learning processing unit 130 determines whether or not an event has occurred based on a plurality of selected explanatory variables by plotting sample data on a multidimensional space generated based on a plurality of functions that maximizes variance. Learn predictive models that increase the possibilities.
  • FIG. 9 shows an example of a processing flow for learning a prediction model by the maximum likelihood estimation method.
  • the learning processing unit 130 may execute the process of S160 using the maximum likelihood estimation method by executing the processes of S262 to S266.
  • the learning processing unit 130 generates a likelihood function. For example, the learning processing unit 130, based on the sample data, and outputs the likelihood of an input to an event generating a variable x i indicating whether or expression amount of the gene g i in the set of selected explanatory variables ⁇ likelihood
  • the degree function lik ( ⁇ ) f D (x i
  • ⁇ ) is calculated.
  • the learning processing unit 130 determines whether or not an event of each target person in the sample data has occurred based on the likelihood function. For example, the learning processing unit 130 inputs genes included in the set of explanatory explanatory variables of each subject to a corresponding likelihood function, and calculates the likelihood that a disease will occur in each subject. The learning processing unit 130 determines that the disease does not occur when the likelihood falls below a predetermined standard (for example, 0.5), and determines that the disease occurs when the likelihood is equal to or higher than the reference.
  • a predetermined standard for example, 0.5
  • the learning processing unit 130 evaluates the accuracy of the likelihood function. For example, the learning processing unit 130 compares the occurrence / non-occurrence of each subject's event in the sample data with the result of predicting the occurrence / non-occurrence of each subject's event using the likelihood function. As an example, the learning processing unit 130 may evaluate the likelihood function based on sensitivity and specificity as in the processing in S172.
  • FIG. 10 shows an example of a processing flow for learning a prediction model by the Bayesian method.
  • the learning processing unit 130 may execute the process of S160 using the Bayesian method by executing the processes of S362 to S366.
  • the learning processing unit 130 calculates a posterior probability that each target will cause an event in the sample data. For example, the learning processing unit 130 may use the frequency of the set of explanatory variables described in FIG. 5 or the product of the occurrence probability of the selected explanatory variable described in FIG. The posterior probability may be calculated based on the product of the prior probability and the likelihood, using the likelihood function generated in step 1 as the likelihood. In addition, the learning processing unit 130 may calculate the posterior probability by a sampling algorithm based on a Markov chain Monte Carlo method such as the Metropolis-Hastings method.
  • the learning processing unit 130 determines whether an event has occurred. For example, the learning processing unit 130 inputs the prior probability based on the genes included in the set of explanatory explanatory variables of each subject and the genes included in the set of explanatory explanatory variables of each subject to the corresponding likelihood function. Based on the above, the posterior probability distribution is calculated. The learning processing unit 130 determines that the disease is present when the posterior probability (for example, the average value, median value, or mode value of the posterior probability in the posterior probability distribution) is lower than a predetermined criterion (for example, 0.5). It is determined that the disease does not occur, and it is determined that the disease occurs when the posterior probability is equal to or higher than the reference.
  • the posterior probability for example, the average value, median value, or mode value of the posterior probability in the posterior probability distribution
  • the learning processing unit 130 evaluates the accuracy of the posterior probability. For example, the learning processing unit 130 compares the occurrence / non-occurrence of each subject's event in the sample data with the result of predicting the occurrence / non-occurrence of each subject by the posterior probability. As an example, the learning processing unit 130 may evaluate the likelihood function based on sensitivity and specificity as in the processing in S172.
  • FIG. 11 shows an example of discrimination by the maximum likelihood estimation method or the Bayes method.
  • the x1 axis and x2 axis of the graph correspond to the expression level of each gene when the set of selected explanatory variables includes two genes, and the z axis is the likelihood in the maximum likelihood estimation method or the posterior probability in the Bayes method Corresponding to Each plot in the graph corresponds to each subject in the sample data.
  • a dotted line and a solid line ⁇ indicate a subject who did not develop a disease (that is, a healthy person), and a dotted line and a solid line X indicate a subject who has a disease (that is, a non-healthy person).
  • the learning processing unit 130 predicts the target person as an unhealthy person when the likelihood or the posterior probability is greater than or equal to a threshold value TH (for example, 0.5) for each target person, and the target person when the likelihood is less than the threshold value TH.
  • a threshold value TH for example, 0.5
  • the dotted line ⁇ indicates a subject who is predicted to have a disease but does not actually have a disease
  • the solid line ⁇ indicates a subject who is not predicted to have a disease by a discriminant function and actually has no disease
  • a cross indicates a subject who was not predicted to be ill by the discriminant function but actually had a disease
  • a solid line X represents a subject who was predicted to be ill by the discriminant function and actually had a disease.
  • the learning processing unit 130 uses the likelihood or the posterior probability to predict the presence or absence of a disease from the genes of a plurality of subjects in the sample data (the ratio of solid circles in all the circles in FIG. 11) and At least one of the specificities (the ratio of x of the solid line in all x) may be calculated as an evaluation.
  • the learning processing unit 130 uses the discriminant analysis, machine learning (self-organizing map, support vector machine, deep learning, etc.), maximum likelihood estimation method, Bayesian method, or the like to select the selected explanatory variable.
  • the learning processing unit 130 may learn the learning model using multivariate analysis such as multiple regression analysis, principal component analysis, and cluster analysis in addition to / in place of these.
  • the learning processing unit 130 supplies the model selection unit 140 with a set of selected explanatory variables corresponding to the prediction model and the evaluation of each prediction model.
  • the explanatory variable selection unit 120 determines whether to continue selecting the set of selected explanatory variables. For example, the explanatory variable selection unit 120 selects the selected explanatory variable on the condition that the learning process of the predetermined number of times S160 has been executed and / or the learning model with the evaluation equal to or higher than the reference determined in S160 has been learned. The selection of the set is terminated, and the process proceeds to S220. If the explanatory variable selection unit 120 does not finish selecting the set of selected explanatory variables, the process proceeds to S200.
  • the explanatory variable selection unit 120 causes the initialization unit 122 or the generation unit 124 to select at least one set of selected explanatory variables from among a plurality of explanatory variables.
  • the initialization unit 122 of the explanatory variable selection unit 120 may select a set of selected explanatory variables including a plurality of (for example, s) selected explanatory variables at random by the same processing as S140.
  • the initialization unit 122 repeatedly selects a set of selected explanatory variables in the loop of S160 to S200.
  • the initialization unit 122 may randomly select a set of selected explanatory variables without depending on a previously selected set of selected explanatory variables in each selection of S200 in the iteration.
  • the initialization unit 122 selects the set of selected explanatory variables without depending on the set of selected explanatory variables selected in the past, and adopts the bootstrap method to set the selected explanatory variables selected in the past. The same set can be selected again, or the set of selected explanatory variables selected in the past using the jackknife method may not be selected.
  • the generation unit 124 of the explanatory variable selection unit 120 may generate a set of selected explanatory variables. For example, the generation unit 124 may sequentially generate a set of selected explanatory variables from an initial set of selected explanatory variables using a Markov chain Monte Carlo method. Thereby, the generation unit 124 generates a set of selected explanatory variables that is close to the previously selected set of selected explanatory variables.
  • FIG. 12 shows an example of a method for generating a set of selected explanatory variables by the Markov chain Monte Carlo method.
  • the horizontal axis of the graph shows a set of N explanatory variables G 1 to G N arranged.
  • each explanatory variable set is arranged in the order of the combination.
  • the set of explanatory variables related to G 1 includes genes g 1 , g 2, and g 3
  • the set of explanatory variables related to G 2 adjacent to G 1 includes only gene g 3 in another g 3 . It may include genes g 1 , g 2 and g 4 replaced by the near gene g 4 .
  • a plurality of explanations are made so that the distance between sets corresponds to the similarity based on the similarity (for example, editing distance) of the set of genes and / or the similarity of the genes themselves.
  • Variables may be arranged on the horizontal axis.
  • the types of explanatory variables included in the set of explanatory variables arranged in FIG. 12 may be single (for example, three) or plural (for example, two and three).
  • a plurality of sets are arranged one-dimensionally for explanation, but a plurality of sets of explanatory variables may be arranged multidimensionally.
  • G i indicates a set of selected explanatory variables selected in the selection of the immediately preceding selected explanatory variable (S140 or the previous processing of S200), and the vertical axis indicates a set of each explanatory variable.
  • 124 shows a selection probability P s be selected. That is, in S200, generating unit 124 selects the set of explanatory variables with a probability corresponding to the selected probability P s.
  • the set having the combination of explanatory variables close to the previously selected set G i has the highest selection probability P s , and the selection probability P s gradually decreases according to the distance from G i .
  • Figure 12 shows the probability distribution becomes a normal distribution having a peak G i.
  • the generation unit 124 generates a set of selected explanatory variables having a combination of explanatory variables close to the previously selected set. Note that the generation unit 124 may not select the set G i selected last time or the set that has been selected in the past.
  • FIG. 13 shows a modification of the processing flow of S200 by the explanatory variable selection unit 120.
  • the explanatory variable selection unit 120 executes the processing of S202 to S206, so that either the initialization unit 122 or the generation unit 124 selects the selected explanatory variable according to the previous evaluation of the set of selected explanatory variables. Select a set.
  • the explanatory variable selection unit 120 determines whether or not the evaluation of the selected explanatory variable set selected last time is less than a predetermined criterion. For example, the explanatory variable selection unit 120 determines whether or not the evaluation generated in the process of S172, S266, or S366 is less than the standard for the set of selected explanatory variables generated in the process of S140 or S200. The explanatory variable selection unit 120 advances the process to S204 when the evaluation is determined to be less than the reference, and advances the process to S202 if not.
  • the explanatory variable selection unit 120 may determine whether or not the most recent predetermined number of consecutive evaluations is less than the reference, instead of determining the evaluation of the previously selected set of selected explanatory variables. For example, the explanatory variable selection unit 120 may advance the process to S ⁇ b> 204 when the evaluations of the set of selected explanatory variables most recently generated 10 times are lower than the reference value.
  • the initialization unit 122 newly selects a set of initial selection explanatory variables.
  • the initialization unit 122 may determine a new initial set of explanatory explanatory variables at random by executing the same processing as in S140.
  • the initialization unit 122 may randomly select a new set of selected explanatory explanatory variables by the bootstrap method or the jackknife method without depending on the set of selected explanatory explanatory variables selected in the past.
  • the generation unit 124 may sequentially generate a set of selected explanatory variables from the initial set of selected explanatory variables using the Markov chain Monte Carlo method. For example, the generation unit 124 may generate a set of selected explanatory variables by the method described in FIG.
  • the evaluation of the prediction model learned according to at least one selected explanatory variable selected by the initialization unit 122 is less than the reference, and the selection description generated by the generation unit 124.
  • the initialization unit 122 determines a new initial set of explanatory explanatory variables on condition that the evaluation of the prediction model learned according to the set of variables is less than the reference.
  • the explanatory variable selection part 120 resets the set of the selection explanatory variable used as the starting point of the Markov chain Monte Carlo method, when the set of selection explanatory variables with high evaluation cannot be obtained.
  • the explanatory variable selection unit 120 determines that there is no prospect in searching for an explanatory variable set in that area, and starts a search in another area. , Streamline the search for a set of explanatory variables.
  • the generation unit 124 further generates a set of selection explanatory variables sequentially from a new initial selection explanatory variable set.
  • the data processing apparatus 10 searches for another set of selected explanatory variables in the vicinity of a set of selected explanatory variables with excellent evaluation (that is, a set of explanatory variables that are likely to be causal factors). ⁇ Evaluation can be continued and causal factors can be efficiently searched.
  • the explanatory variable selection unit 120 randomly selects a set of selected explanatory variables until the initialization unit 122 obtains a set of selected explanatory variables that can be evaluated at or above the standard. It is possible to switch to the selection using the Markov chain Monte Carlo method by the generation unit 124 on the condition that the evaluation of the prediction model learned according to the selected set of selected explanatory variables is equal to or higher than the reference.
  • FIG. 14 shows a modification of the generation of a set of selected explanatory variables by the Markov chain Monte Carlo method by the generation unit 124.
  • the generation unit 124 may use a probability distribution of a constant distribution when generating the set of explanatory explanatory variables by the Markov chain Monte Carlo method. Instead, the generation unit 124 has different distribution shapes as shown by a solid line and a dotted line in FIG. A probability distribution may be used as the proposed distribution.
  • the generation unit 124 changes the proposal distribution in the Markov chain Monte Carlo method according to the evaluation of the prediction model learned according to the set of selected explanatory variables.
  • the generation unit 124 has a distribution of magnitudes that has a negative correlation with the evaluation of the previously selected selection explanatory variable set (for example, the evaluation value). Distribution of inversely proportional distribution) may be used.
  • the generation unit 124 selects the next set of selected explanatory variables from a narrower probability distribution (for example, the dotted distribution in FIG. 14) when the evaluation of the previously selected set of selected explanatory variables becomes higher. Therefore, the probability of selecting a set of selected explanatory variables closer to the previous set of selected explanatory variables is increased.
  • the data processing apparatus 10 can search the set of selection explanatory variables with high evaluation efficiently.
  • the explanatory variable selection unit 120 supplies the selected explanatory variable set to the learning processing unit 130, and returns the process to S160.
  • the model selection unit 140 has a higher evaluation among a plurality of prediction models learned by the learning processing unit 130 according to different sets of selected explanatory variables (for example, sensitivity or specificity that is low in Wilkes' lambda statistic). Predictive models are selected with higher priority (and / or Akaike Information Criterion (AIC) statistics are small). For example, the model selection unit 140 selects a prediction model corresponding to the set of selected explanatory variables with the highest evaluation from among a plurality of prediction models generated by the loop processing of S160 to S200. Further, the model selection unit 140 may select a prediction model corresponding to the set of selected explanatory variables with a probability having a magnitude corresponding to the evaluation value. The model selection unit 140 supplies the selected prediction model to the determination unit 150.
  • selected explanatory variables for example, sensitivity or specificity that is low in Wilkes' lambda statistic. Predictive models are selected with higher priority (and / or Akaike Information Criterion (AIC) statistics are small). For example, the model selection unit 140
  • the determination unit 150 determines a set of selected explanatory variables corresponding to the prediction model selected by the model selection unit 140 as a cause explanatory variable set.
  • the determination part 150 can specify the set of the selection explanatory variable which gives the prediction model with high evaluation as a cause of an event.
  • the determination unit 150 can specify a set of a plurality of genes from which a prediction model with high accuracy of predicting the occurrence of a disease is obtained as a disease-causing gene.
  • the data processing apparatus 10 includes a process of randomly selecting a selection explanatory variable.
  • the cause of the event is identified with higher accuracy than when the explanatory variable is selected in the order of the degree of contribution to the occurrence / non-occurrence of the event and the set of the selected explanatory variable is generated. can do.
  • the data processing apparatus 10 may apply a different value to s and apply the flow of FIG. 2 to determine an appropriate value of s.
  • FIG. 15 shows a parallel processing device 12 according to a modification of the present embodiment in which parallel processing is implemented.
  • the parallel processing device 12 includes a first processing unit 102 and a plurality of second processing units 104, and is different from the data processing device 10 in that parallel processing is executed by the plurality of second processing units 104.
  • the first processing unit 102 has functions of an acquisition unit 110, a model selection unit 140, and a determination unit 150.
  • Each of the plurality of second processing units 104 has functions of an explanatory variable selection unit 120 and a learning processing unit 130.
  • the acquisition unit 110 of this modification supplies the acquired sample data to the explanatory variable selection units 120 of the plurality of second processing units 104.
  • the plurality of explanatory variable selection units 120 selects a set of selected explanatory variables in parallel from the plurality of explanatory variables, and the plurality of learning processing units 130 sets a plurality of selected explanatory variables. Learn predictive models for each of these in parallel.
  • the parallel processing device 12 executes the loop processing of S160 to S200 in parallel by the plurality of explanatory variable selection units 120 and the plurality of learning processing units 130.
  • the first processing unit 102 includes the explanatory variable selection unit 120
  • the second processing unit 104 includes only the learning processing unit 130
  • the parallel processing device 12 parallelizes only the learning processing of the learning processing unit 130. May be.
  • the parallel processing device 12 executes the processing of S160 in parallel by the plurality of learning processing units 130.
  • the plurality of second processing units 104 each independently select a set of selected explanatory variables by random or Markov chain Monte Carlo method or the like.
  • the second processing unit 104 may communicate with each other a set of selected explanatory variables and evaluation information thereof, and select different sets of selected explanatory variables.
  • the other second processing unit 104 is in the vicinity of the set of selected explanatory variables of the prediction model having a high evaluation. May be assigned to the search. Thereby, the 2nd process part 104 can improve the search of the prediction model estimated that evaluation is high.
  • the second processing unit 104 can perform the selection processing of the set of selected explanatory variables and the learning process of the prediction model in parallel by a large number of processing entities, thereby improving the processing efficiency.
  • the first processing unit 102 may communicate information on a set of selected explanatory variables with a plurality of second processing units 104 and control parallel processing of the second processing unit 104.
  • the first processing unit 102 may be realized by, for example, a general-purpose CPU, and each of the plurality of second processing units 104 may be realized by a general-purpose GPU (GPGPU), a dedicated CPU, or the like.
  • GPGPU general-purpose GPU
  • An example of a general-purpose GPU is CUDA (Computer Unified Device Architecture) developed by NVIDIA.
  • the first processing unit 102 and the plurality of second processing units 104 are accessed via a parallel FPGA (Field-Programmable Gate Array), a cluster of many information processing devices (computer cluster), and a network. May be realized by a plurality of virtual machine images (machine images developed by a cloud service, etc.) and / or a plurality of cores (many core CPUs) in the processor.
  • Examples of parallel FPGAs include SciEngines GmbH, RIVYERA model.
  • Examples of servers equipped with a many-core CPU include HP ProLiant DL980 G7 (80-core CPU) manufactured by Hewlett-Packard and HP Integrity Superdome X Server (240-core CPU).
  • An example of a virtual machine image is Amazon Web Service AMI (Amazon Machine Images).
  • the plurality of second processing units 104 may execute parallel processing by in-memory parallel distributed processing. Examples of the in-memory parallel distributed processing technology include Apache Spark.
  • the data processing device 10 and the parallel processing device 12 acquire sample data in which the presence or absence of a specific gene or the expression level is associated with the presence or absence of a disease, Although the example which estimates the causal gene of a disease was demonstrated, the application object of the data processing apparatus 10 grade
  • the data processing device 10 or the like may acquire sample data including drug resistance of a pest such as an insect and gene sequence information of the pest and specify a combination of genes that contribute to drug resistance. .
  • the data processing apparatus 10 or the like may acquire sample data including gene sequence information of a plurality of closely related species and specify a combination of genes that serve as an index when creating an evolutionary phylogenetic tree.
  • the data processing apparatus 10 or the like generates a branch diagram pattern from each of a plurality of selected selection explanatory variable sets (gene sets), and determines whether each branch diagram pattern belongs to the majority group or the minority group. decide.
  • the data processing apparatus 10 or the like gives a high evaluation to a set of selected explanatory variables that give a branch diagram belonging to the majority, and gives a low evaluation to a set of selected explanatory variables that give a branch diagram that belongs to the minority.
  • the application target of the data processing apparatus 10 or the like is not limited to specifying a gene as a selected explanatory variable.
  • the data processing apparatus 10 or the like can be used to specify cause explanatory variables for all phenomena in which some of the plurality of explanatory variables contribute to the event.
  • the data processing device 10 or the like is used to identify a factor of purchase behavior, to identify a cause of fluctuations in stock prices, to propagate information in a network, or to cause a natural phenomenon such as weather. it can.
  • FIG. 16 shows an example of the effect of learning according to this embodiment.
  • the vertical axis of the graph indicates the Wilkes Lambda statistic of the prediction model finally obtained, and the lower the evaluation, the higher the evaluation (that is, the higher the prediction accuracy).
  • the horizontal axis of the graph indicates the number (s) of explanatory variables included in the set of selected explanatory variables.
  • a predictive model for predicting the occurrence of colorectal cancer is generated from the gene using the gene expression levels in the tissues of 18 colorectal cancer patients and 18 healthy individuals.
  • the solid line in the graph indicates the past from all explanatory variables in the sample data when the selected explanatory variable set is selected (for example, the processing in S140 and S200 in FIG. 2) according to the processing flow described in FIG.
  • the result obtained when selecting a set of selection explanatory variables randomly without depending on the selection is shown.
  • the broken line in the graph shows the result when a set of selected explanatory variables is generated by selecting the explanatory variables in descending order of the degree of contribution to the occurrence / non-occurrence of the event by using the t-test from a plurality of explanatory variables. Show.
  • FIG. 17 shows an example of a hardware configuration of a computer 1900 that functions as the data processing apparatus 10 or the like.
  • a computer 1900 is connected to a CPU peripheral unit having a CPU 2000, a RAM 2020, a graphic controller 2075, and a display device 2080 that are connected to each other by a host controller 2082, and to the host controller 2082 by an input / output controller 2084.
  • Input / output unit having communication interface 2030, hard disk drive 2040, and CD-ROM drive 2060, and legacy input / output unit having ROM 2010, flexible disk drive 2050, and input / output chip 2070 connected to input / output controller 2084 Is provided.
  • the host controller 2082 connects the RAM 2020 to the CPU 2000 and the graphic controller 2075 that access the RAM 2020 at a high transfer rate.
  • the CPU 2000 operates based on a program (for example, a parallel processing program) stored in the ROM 2010 and the RAM 2020, and controls each unit.
  • the graphic controller 2075 acquires image data generated by the CPU 2000 or the like on a frame buffer provided in the RAM 2020 and displays it on the display device 2080.
  • the graphic controller 2075 may include a frame buffer for storing image data generated by the CPU 2000 or the like.
  • the input / output controller 2084 connects the host controller 2082 to the communication interface 2030, the hard disk drive 2040, and the CD-ROM drive 2060, which are relatively high-speed input / output devices.
  • the communication interface 2030 communicates with other devices via a network by wire or wireless.
  • the communication interface functions as hardware that performs communication.
  • the hard disk drive 2040 stores programs and data used by the CPU 2000 in the computer 1900.
  • the CD-ROM drive 2060 reads a program or data from the CD-ROM 2095 and provides it to the hard disk drive 2040 via the RAM 2020.
  • the ROM 2010, the flexible disk drive 2050, and the relatively low-speed input / output device of the input / output chip 2070 are connected to the input / output controller 2084.
  • the ROM 2010 stores a boot program that the computer 1900 executes at startup and / or a program that depends on the hardware of the computer 1900.
  • the flexible disk drive 2050 reads a program or data from the flexible disk 2090 and provides it to the hard disk drive 2040 via the RAM 2020.
  • the input / output chip 2070 connects the flexible disk drive 2050 to the input / output controller 2084 and inputs / outputs various input / output devices via, for example, a parallel port, a serial port, a keyboard port, a mouse port, and the like. Connect to controller 2084.
  • the program provided to the hard disk drive 2040 via the RAM 2020 is stored in a recording medium such as the flexible disk 2090, the CD-ROM 2095, or an IC card and provided by the user.
  • the program is read from the recording medium, installed in the hard disk drive 2040 in the computer 1900 via the RAM 2020, and executed by the CPU 2000.
  • the program installed in the computer 1900 and causing the computer 1900 to function as the data processing apparatus 10 and the like includes an acquisition module, an explanatory variable selection module, an initialization module, a generation module, a learning processing module, a model selection module, and a determination module. These programs or modules work on the CPU 2000 or the like to make the computer 1900 into an acquisition unit 110, an explanatory variable selection unit 120, an initialization unit 122, a generation unit 124, a learning processing unit 130, a model selection unit 140, and a determination unit. Each of them may function as 150.
  • the information processing described in these programs is read into the computer 1900, whereby the acquisition unit 110, the explanatory variable selection unit 120, the initial unit, which are specific means in which the software and the various hardware resources described above cooperate. It functions as the conversion unit 122, the generation unit 124, the learning processing unit 130, the model selection unit 140, and the determination unit 150.
  • the specific data processing apparatus 10 according to the purpose of use is constructed by realizing calculation or processing of information according to the purpose of use of the computer 1900 in this embodiment by these specific means.
  • the CPU 2000 executes a communication program loaded on the RAM 2020 and executes a communication interface based on the processing content described in the communication program.
  • a communication process is instructed to 2030.
  • the communication interface 2030 reads transmission data stored in a transmission buffer area or the like provided on a storage device such as the RAM 2020, the hard disk drive 2040, the flexible disk 2090, or the CD-ROM 2095, and sends it to the network.
  • the reception data transmitted or received from the network is written into a reception buffer area or the like provided on the storage device.
  • the communication interface 2030 may transfer transmission / reception data to / from the storage device by a DMA (direct memory access) method. Instead, the CPU 2000 transfers the storage device or the communication interface 2030 as a transfer source.
  • the transmission / reception data may be transferred by reading the data from the data and writing the data to the communication interface 2030 or the storage device of the transfer destination.
  • the CPU 2000 is all or necessary from among files or databases stored in an external storage device such as a hard disk drive 2040, a CD-ROM drive 2060 (CD-ROM 2095), and a flexible disk drive 2050 (flexible disk 2090).
  • This portion is read into the RAM 2020 by DMA transfer or the like, and various processes are performed on the data on the RAM 2020. Then, CPU 2000 writes the processed data back to the external storage device by DMA transfer or the like.
  • the RAM 2020 and the external storage device are collectively referred to as a memory, a storage unit, or a storage device.
  • the CPU 2000 can also store a part of the RAM 2020 in the cache memory and perform reading and writing on the cache memory. Even in such a form, the cache memory bears a part of the function of the RAM 2020. Therefore, in the present embodiment, the cache memory is also included in the RAM 2020, the memory, and / or the storage device unless otherwise indicated. To do.
  • the CPU 2000 performs various operations, such as various operations, information processing, condition determination, information search / replacement, etc., described in the present embodiment, specified for the data read from the RAM 2020 by the instruction sequence of the program. Is written back to the RAM 2020. For example, when performing the condition determination, the CPU 2000 determines whether or not the various variables shown in the present embodiment satisfy the conditions such as large, small, above, below, equal, etc., compared to other variables or constants. If the condition is satisfied (or not satisfied), the program branches to a different instruction sequence or calls a subroutine.
  • the CPU 2000 can search for information stored in a file or database in the storage device. For example, in the case where a plurality of entries in which the attribute value of the second attribute is associated with the attribute value of the first attribute are stored in the storage device, the CPU 2000 displays the plurality of entries stored in the storage device. The entry that matches the condition in which the attribute value of the first attribute is specified is retrieved, and the attribute value of the second attribute that is stored in the entry is read, thereby associating with the first attribute that satisfies the predetermined condition The attribute value of the specified second attribute can be obtained.
  • the program or module shown above may be stored in an external recording medium.
  • an optical recording medium such as DVD or CD
  • a magneto-optical recording medium such as MO
  • a tape medium such as an IC card, and the like
  • a storage device such as a hard disk or RAM provided in a server system connected to a dedicated communication network or the Internet may be used as a recording medium, and the program may be provided to the computer 1900 via the network.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Public Health (AREA)
  • Bioethics (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)

Abstract

Genetic combinations are vast, and it has been difficult to accurately and efficiently estimate, from all combinations, the gene that is the cause of an event. The present invention provides a data processing device provided with: an acquisition unit for acquiring a plurality of sample data in which the value of each of a plurality of explanatory variables and the occurrence/non-occurrence of an event are correlated; an explanatory variable selection unit for selecting a set of explanatory variables from the plurality of explanatory variables; a learning processing unit for learning a prediction model that predicts, on the basis of the plurality of sample data, the occurrence/non-occurrence of an event from the value of each selected explanatory variable, with respect to each of the multiple sets of explanatory variables; a model selection unit for selecting preferably over others a prediction model the evaluation of which is higher among a plurality of prediction models corresponding to different sets of selected explanatory variables; and a determination unit for determining the set of selected explanatory variables, as a set of cause explanatory variables, that corresponds to the prediction model selected by the model selection unit.

Description

データ処理装置、データ処理方法、および、データ処理用プログラムData processing apparatus, data processing method, and data processing program
 本発明は、データ処理装置、データ処理方法、および、データ処理用プログラムに関する。 The present invention relates to a data processing device, a data processing method, and a data processing program.
 遺伝子情報を高速に解読する次世代のDNAシーケンサーの普及により人等の生物の膨大な遺伝子情報が得られるようになり、疾患等の生物学的事象の説明変数となる遺伝子を特定しようとする研究が進められている。生物学的事象は幾つかの遺伝子の組み合わせが原因となって生じることが多く、従来、このような遺伝子の組み合わせを推定する方法が知られている(例えば、特許文献1~3)。
 [特許文献1] 特開2003-4739号公報
 [特許文献2] 特表2002-528095号公報
 [特許文献3] 特開2011-248789号公報
Research that attempts to identify genes that serve as explanatory variables for biological events such as diseases, as a result of the widespread use of next-generation DNA sequencers that can decode genetic information at high speeds. Is underway. Biological events are often caused by combinations of several genes, and methods for estimating such gene combinations are conventionally known (for example, Patent Documents 1 to 3).
[Patent Document 1] JP 2003-4739 [Patent Document 2] JP 2002-528095 [Patent Document 3] JP 2011-248789
 しかし、遺伝子の組み合わせの数は膨大であり、事象の原因となる遺伝子の組み合わせを効率的に精度よく推定することは依然困難であった。 However, the number of gene combinations is enormous, and it has still been difficult to efficiently and accurately estimate the gene combinations that cause events.
 本発明の第1の態様においては、複数の説明変数の中から予め定められた事象の発生原因となる少なくとも1つの説明変数のセットである原因説明変数セットを特定するデータ処理装置であって、複数の説明変数の各々の値と、事象の発生有無とを対応付けたサンプルデータを複数取得する取得部と、複数の説明変数の中から選択説明変数のセットを繰り返し選択し、各選択において過去に選択した前記選択説明変数のセットに依存せずに前記選択説明変数のセットをランダムに選択する説明変数選択部と、複数のサンプルデータに基づいて、複数の選択説明変数のセットのそれぞれについて各選択説明変数の値から事象の発生有無を予測する予測モデルを学習する学習処理部と、異なる選択説明変数のセットに応じた複数の予測モデルのうち、評価がより高い予測モデルをより優先して選択するモデル選択部と、モデル選択部により選択された予測モデルに対応する選択説明変数のセットを原因説明変数セットとして決定する決定部と、を備えるデータ処理装置を提供する。 In the first aspect of the present invention, there is provided a data processing device for identifying a cause explanatory variable set that is a set of at least one explanatory variable that causes a predetermined event from among a plurality of explanatory variables, An acquisition unit that acquires a plurality of sample data in which each value of a plurality of explanatory variables is associated with the occurrence / non-occurrence of an event, and a set of selected explanatory variables is repeatedly selected from the plurality of explanatory variables. An explanatory variable selection unit that randomly selects the set of selected explanatory variables without depending on the set of selected explanatory variables selected, and each of a plurality of selected explanatory variable sets based on a plurality of sample data A learning processing unit that learns a prediction model that predicts the occurrence of an event from the value of a selected explanatory variable, and multiple prediction models that correspond to different sets of selected explanatory variables In other words, a model selection unit that prioritizes and selects a prediction model with higher evaluation, and a determination unit that determines a set of selected explanatory variables corresponding to the prediction model selected by the model selection unit as a cause explanatory variable set. A data processing apparatus is provided.
 なお、上記の発明の概要は、本発明の特徴の全てを列挙したものではない。また、これらの特徴群のサブコンビネーションもまた、発明となりうる。 Note that the above summary of the invention does not enumerate all the features of the present invention. In addition, a sub-combination of these feature groups can also be an invention.
本実施形態のデータ処理装置10のブロック図を示す。A block diagram of data processor 10 of this embodiment is shown. 本実施形態のデータ処理装置10による処理フローを示す。The processing flow by the data processing apparatus 10 of this embodiment is shown. 本実施形態に係るサンプルデータの一例を示す。An example of the sample data concerning this embodiment is shown. 本実施形態に係る説明変数のセットの一例を示す。An example of the set of explanatory variables according to the present embodiment is shown. 初期化部122が選択説明変数のセットとして選択する選択確率を示す。The selection probability which the initialization part 122 selects as a set of selection explanatory variables is shown. 遺伝子ごとの事象の発生確率の一例を示す。An example of the occurrence probability of an event for each gene is shown. 多変量解析による予測モデルの学習の処理フローの一例を示す。An example of the processing flow of the learning of the prediction model by multivariate analysis is shown. 学習処理部130が生成する多次元空間の一例を示す。An example of the multidimensional space which the learning process part 130 produces | generates is shown. 最尤推定法による予測モデルの学習の処理フローの一例を示す。An example of the processing flow of the learning of the prediction model by the maximum likelihood estimation method is shown. ベイズ法による予測モデルの学習の処理フローの一例を示す。An example of the processing flow of learning of the prediction model by the Bayes method is shown. 最尤推定法又はベイズ法による判別の一例を示す。An example of discrimination by the maximum likelihood estimation method or the Bayes method is shown. マルコフ連鎖モンテカルロ法による選択説明変数のセットの生成方法の一例を示す。An example of a method for generating a set of selected explanatory variables by the Markov chain Monte Carlo method is shown. 説明変数選択部120によるS200の処理フローの変形例を示す。The modification of the process flow of S200 by the explanatory variable selection part 120 is shown. 生成部124によるマルコフ連鎖モンテカルロ法による選択説明変数のセット生成の変形例を示す。A modification of the generation of a set of selected explanatory variables by the Markov chain Monte Carlo method by the generation unit 124 is shown. 並列処理を実装した本実施形態の変形例に係る並列処理装置12を示す。The parallel processing apparatus 12 which concerns on the modification of this embodiment which mounted parallel processing is shown. 本実施形態による学習の効果の一例を示す。An example of the effect of learning by this embodiment is shown. コンピュータ1900のハードウェア構成の一例を示す。2 shows an example of a hardware configuration of a computer 1900.
 以下、発明の実施の形態を通じて本発明を説明するが、以下の実施形態は請求の範囲にかかる発明を限定するものではない。また、実施形態の中で説明されている特徴の組み合わせの全てが発明の解決手段に必須であるとは限らない。 Hereinafter, the present invention will be described through embodiments of the invention. However, the following embodiments do not limit the invention according to the claims. In addition, not all the combinations of features described in the embodiments are essential for the solving means of the invention.
 図1は、本実施形態のデータ処理装置の1ブロック分のデータ処理装置10のブロック図を示す。本実施形態のデータ処理装置10は、複数の説明変数の中から予め定められた事象の発生原因となる少なくとも1つの説明変数のセットである原因要因セットを特定する。例えば、データ処理装置10は、複数の遺伝子の中から、疾病等の事象の発現要因となる少なくとも1つの遺伝子のセットを原因要因セットとして特定する。データ処理装置10は、取得部110、説明変数選択部120、学習処理部130、モデル選択部140、及び、決定部150を備える。 FIG. 1 shows a block diagram of a data processing apparatus 10 for one block of the data processing apparatus of this embodiment. The data processing apparatus 10 of this embodiment specifies a cause factor set that is a set of at least one explanatory variable that causes a predetermined event from a plurality of explanatory variables. For example, the data processing device 10 identifies a set of at least one gene that is an expression factor of an event such as a disease as a cause factor set from among a plurality of genes. The data processing apparatus 10 includes an acquisition unit 110, an explanatory variable selection unit 120, a learning processing unit 130, a model selection unit 140, and a determination unit 150.
 取得部110は、複数の説明変数の各々の値と、事象の発生有無とを対応付けたサンプルデータを複数取得する。例えば、取得部110は、データベース20から、遺伝子に関する値(一例として特定の構造遺伝子の有無、修飾又は発現量等)と疾病の有無とを対応付けた、複数の対象者についてのサンプルデータを取得する。取得部110は、取得したサンプルデータを説明変数選択部120に提供する。 The acquisition unit 110 acquires a plurality of sample data in which each value of a plurality of explanatory variables is associated with the occurrence / non-occurrence of an event. For example, the acquisition unit 110 acquires sample data on a plurality of subjects in which the values related to genes (for example, the presence / absence, modification or expression level of a specific structural gene) and the presence / absence of a disease are associated from the database 20. To do. The acquisition unit 110 provides the acquired sample data to the explanatory variable selection unit 120.
 説明変数選択部120は、複数の説明変数の中から少なくとも1つの選択説明変数のセットを選択する。例えば、説明変数選択部120は、サンプルデータに含まれる対象者が保有する複数の遺伝子の中から、選択説明変数のセットとして予め定められた数の遺伝子を含む遺伝子のセットを予め定められた方法(例えば、ブートストラップ法、又は、マルコフ連鎖モンテカルロ法)に基づいて選択する。説明変数選択部120は、初期化部122及び生成部124を有する。 The explanatory variable selection unit 120 selects at least one set of selected explanatory variables from a plurality of explanatory variables. For example, the explanatory variable selection unit 120 is a method in which a set of genes including a predetermined number of genes as a set of selected explanatory variables is predetermined from a plurality of genes held by the subject included in the sample data. (For example, bootstrap method or Markov chain Monte Carlo method). The explanatory variable selection unit 120 includes an initialization unit 122 and a generation unit 124.
 初期化部122は、説明変数選択部120が原因説明変数の探索の初期に選択する選択説明変数のセットを決定する。例えば、初期化部122は、ランダムに選択した説明変数、複数のサンプルデータにおいて発生する頻度が高い説明変数、又は、複数のサンプルデータにおいて単独で事象の発生有無に寄与する度合いが高い(例えば、ウィルクスのラムダ統計量が小さい、感度または特異度が高い、及び/又は、赤池情報量基準(AIC)統計量が小さいなど)説明変数を初期の選択説明変数のセットに含める。 The initialization unit 122 determines the set of selected explanatory variables that the explanatory variable selection unit 120 selects in the initial stage of the search for the cause explanatory variable. For example, the initialization unit 122 has a high degree of contribution to the presence / absence of an event in a random selected explanatory variable, an explanatory variable that frequently occurs in a plurality of sample data, or a plurality of sample data (for example, Include explanatory variables in the set of initial selected explanatory variables (such as low Wilkes' lambda statistic, high sensitivity or specificity, and / or low Akaike Information Criterion (AIC) statistic).
 生成部124は、説明変数選択部120が探索の初期以降に選択する選択説明変数のセットを生成する。例えば、生成部124は、マルコフ連鎖モンテカルロ法を用いて、初期の選択説明変数のセットから順次選択説明変数のセットを生成する。これにより、生成部124は、前回選択した選択説明変数のセットの組み合わせに近い組み合わせの選択説明変数のセットを生成する。 The generation unit 124 generates a set of selected explanatory variables that the explanatory variable selection unit 120 selects after the initial search. For example, the generation unit 124 sequentially generates a set of selected explanatory explanatory variables from an initial set of explanatory explanatory variables using a Markov chain Monte Carlo method. Thereby, the generation unit 124 generates a set of selected explanatory variable with a combination close to the combination of the set of selected explanatory explanatory variables selected last time.
 初期化部122及び生成部124は、以前に生成された選択説明変数のセットから得られた予測モデルの評価に基づいて、選択説明変数のセットを決定又は生成してよい。説明変数選択部120による処理の詳細は後述する。説明変数選択部120は、選択説明変数のセットを学習処理部130に供給する。 The initialization unit 122 and the generation unit 124 may determine or generate a set of selected explanatory variables based on the evaluation of the prediction model obtained from the previously generated set of selected explanatory variables. Details of processing by the explanatory variable selection unit 120 will be described later. The explanatory variable selection unit 120 supplies the set of selected explanatory variables to the learning processing unit 130.
 学習処理部130は、複数のサンプルデータに基づいて、選択説明変数のセット内の各選択説明変数の値から事象の発生有無を予測する予測モデルを学習する。例えば、学習処理部130は、サンプルデータ中の各対象者の選択説明変数のセット中の遺伝子の有無等から、疾病の発生有無を予測する予測モデルを学習する。これにより、学習処理部130は、選択された選択説明変数のセットごとに予測モデルを得る。学習処理部130の学習処理の具体的内容は後述する。 The learning processing unit 130 learns a prediction model for predicting the occurrence of an event from the value of each selected explanatory variable in the set of selected explanatory variables based on a plurality of sample data. For example, the learning processing unit 130 learns a prediction model for predicting the presence or absence of a disease from the presence or absence of a gene in the set of explanatory explanatory variables of each subject in the sample data. As a result, the learning processing unit 130 obtains a prediction model for each set of selected explanatory explanatory variables. Specific contents of the learning process of the learning processing unit 130 will be described later.
 また、学習処理部130は、各予測モデルによる事象発生の予測精度についての評価を生成し、これを説明変数選択部120に供給する。例えば、学習処理部130は、各予測モデルに基づいてサンプルデータから事象の発生を予測した結果と、実際の事象の発生の有無とを対比し、その結果から評価を生成する。学習処理部130は、学習した予測モデルに対応する選択説明変数のセット及び評価をモデル選択部140に供給する。 Also, the learning processing unit 130 generates an evaluation on the prediction accuracy of event occurrence by each prediction model, and supplies this to the explanatory variable selection unit 120. For example, the learning processing unit 130 compares the result of predicting the occurrence of an event from sample data based on each prediction model with the presence or absence of an actual event, and generates an evaluation from the result. The learning processing unit 130 supplies the model selection unit 140 with a set and evaluation of selected explanatory variables corresponding to the learned prediction model.
 モデル選択部140は、異なる選択説明変数のセットに応じて学習処理部130が学習した複数の予測モデルのうち、評価がより高い予測モデルをより優先して選択する。例えば、モデル選択部140は、最も評価の高い予測モデルを選択する。モデル選択部140は、選択した予測モデルを決定部150に供給する。 The model selection unit 140 preferentially selects a prediction model with a higher evaluation among a plurality of prediction models learned by the learning processing unit 130 according to different sets of selection explanatory variables. For example, the model selection unit 140 selects the prediction model with the highest evaluation. The model selection unit 140 supplies the selected prediction model to the determination unit 150.
 決定部150は、モデル選択部140により選択された予測モデルに対応する選択説明変数のセットを原因説明変数セットとして決定する。これにより、決定部150は、評価が高い、すなわち事象の発生を高精度で予測した予測モデルを与える選択説明変数のセットを、優先して事象の原因と特定することができる。 The determining unit 150 determines a set of selected explanatory variables corresponding to the prediction model selected by the model selecting unit 140 as a cause explanatory variable set. As a result, the determination unit 150 can preferentially identify the cause of the event as a set of selected explanatory variables that give a prediction model that has a high evaluation, that is, predicts the occurrence of the event with high accuracy.
 このようにデータ処理装置10は、サンプルデータ中の複数の原因説明変数の候補から、選択説明変数のセットをマルコフ連鎖モンテカルロ法等により選択し、選択した選択説明変数のセットごとに予測モデルを学習し、評価の高い予測モデルに対応する選択説明変数のセットを原因説明変数として特定する。 As described above, the data processing apparatus 10 selects a set of selected explanatory variables from a plurality of candidate explanatory variable candidates in the sample data by the Markov chain Monte Carlo method or the like, and learns a prediction model for each selected selected explanatory variable set. Then, the set of selected explanatory variables corresponding to the highly evaluated prediction model is specified as the cause explanatory variable.
 図2は、本実施形態のデータ処理装置10による処理フローを示す。データ処理装置10は、S120~S240の処理を実行することにより、原因説明変数セットを特定する処理を実行する。 FIG. 2 shows a processing flow by the data processing apparatus 10 of the present embodiment. The data processing apparatus 10 executes the process of specifying the cause explanation variable set by executing the processes of S120 to S240.
 まず、S120において、複数の説明変数の各々の値と、事象の発生有無とを対応付けたサンプルデータを複数の対象について取得する。例えば、取得部110は、データベース20から、複数の対象者について特定の遺伝子発現の有無等と大腸癌等の特定の疾病有無とを対応付けたサンプルデータを取得する。 First, in S120, sample data in which each value of a plurality of explanatory variables is associated with whether or not an event has occurred is acquired for a plurality of objects. For example, the acquisition unit 110 acquires, from the database 20, sample data that associates the presence or absence of specific gene expression with the presence or absence of a specific disease such as colorectal cancer for a plurality of subjects.
 図3は、サンプルデータの一例を示す。取得部110は、M人の対象者の各々がn個(例えば、数百万個)の遺伝子g~gのそれぞれを発現するか否かの情報(遺伝子を発現していれば1、無ければ0等)と、疾病の有無等の生物学的事象と関連づけられた情報(疾病が生じていれば1、生じていなければ0等)をサンプルデータとして取得する。例えば、図3に示すサンプルデータは、対象者1が遺伝子gを発現し、遺伝子gを発現し、遺伝子gを発現せず、…遺伝子gを発現し、疾病が無いことを示し、対象者2が遺伝子gを発現せず、遺伝子g及び遺伝子gを発現し、…遺伝子gを発現し、疾病があることを示し、対象者Mが遺伝子g及び遺伝子gを発現し、遺伝子gを有さず、…遺伝子gを発現せず、疾病がないことを示す。 FIG. 3 shows an example of sample data. Acquisition unit 110, each of the M's subject are n (e.g., millions) if expressing whether information (gene expressing each gene g 1 ~ g n 1, If there is no disease, 0 or the like) and information associated with a biological event such as the presence or absence of disease (1 if disease has occurred, 0 if not) are acquired as sample data. For example, the sample data shown in FIG. 3, the subject 1 is expressed genes g 1, to express the gene g 2, did not express the gene g 3, ... to express the gene g n, it indicates that there is no disease , the subject 2 is not express gene g 1, to express the gene g 2 and gene g 3, ... to express the gene g n, indicates that there is a disease, the subject M genes g 1 and gene g 2 express, without a gene g 3, did not express ... gene g n, it indicates that there is no disease.
 取得部110は、対象の遺伝子発現の有無に加えて/代えて、遺伝子の発現量の情報、遺伝子配列多型の位置情報、遺伝子変異の頻度、遺伝子修飾の種類及び部位、及び/又は、遺伝子修飾の程度の情報を取得してよい。また、取得部110は、遺伝子発現に関連する情報を取得してよい。例えば、取得部110は、遺伝子の発現及び翻訳により産生されたタンパク質の量、転写産物やタンパク質の修飾の種類及び部位、及び/又は、修飾の程度、産生されたタンパク質の機能発現の結果により産生された代謝物(例えば、(a)脂質、糖質、ビタミン、アミノ酸、核酸、その他のアルコール類、有機酸又はそれらのエステル類、その他のアミン類、又はその他の有機化合物、(b)ミネラル又はそのイオン、又はその他の無機化合物(窒素化合物、硫黄化合物、含リン化合物、ハロゲン化合物等)又はそのイオン、又は(c)それらの複合体、錯体、又はそれらの分解産物等)の種類及び/又は量の情報を含むサンプルデータを取得してもよい。取得部110は、取得したサンプルデータを説明変数選択部120に提供する。 The acquisition unit 110 may include / instead of the presence / absence of gene expression of the target, information on the expression level of the gene, positional information on the gene sequence polymorphism, frequency of gene mutation, type and site of gene modification, and / or gene Information on the degree of modification may be acquired. Moreover, the acquisition part 110 may acquire the information relevant to gene expression. For example, the acquisition unit 110 is produced based on the amount of the protein produced by gene expression and translation, the type and site of modification of the transcript or protein, and / or the degree of modification, and the result of functional expression of the produced protein. Metabolites (eg, (a) lipids, carbohydrates, vitamins, amino acids, nucleic acids, other alcohols, organic acids or their esters, other amines, or other organic compounds, (b) minerals or Kinds and / or types of ions or other inorganic compounds (nitrogen compounds, sulfur compounds, phosphorus-containing compounds, halogen compounds, etc.) or ions thereof, or (c) complexes, complexes, or degradation products thereof) Sample data including quantity information may be obtained. The acquisition unit 110 provides the acquired sample data to the explanatory variable selection unit 120.
 次に、S140において、初期化部122は、選択説明変数の初期セットを決定する。例えば、初期化部122は、サンプルデータ中の全説明変数の中から、等確率でランダムに選択した予め定められたs個の説明変数を選択説明変数として含む、初期の選択説明変数のセットを決定してよい。 Next, in S140, the initialization unit 122 determines an initial set of selected explanatory variables. For example, the initialization unit 122 includes a set of initial selected explanatory variables including, as selected explanatory variables, predetermined s explanatory variables selected at random with equal probability from all explanatory variables in the sample data. You may decide.
 これに代えて、初期化部122は、複数のサンプルデータにおいて存在する頻度が高い説明変数を優先して初期の選択説明変数のセットに含めてよい。例えば、初期化部122は、サンプルデータ(又は自然界)において対象者が発現する頻度の高い遺伝子を初期の選択説明変数として決定する。一例として、初期化部122は、頻度の高い順番にs個の遺伝子を選択し、又は、頻度に応じた選択確率で互いに異なるs個の遺伝子を選択して、初期の選択説明変数のセットとしてよい。 Alternatively, the initialization unit 122 may preferentially include explanatory variables that exist frequently in a plurality of sample data in the set of initial selected explanatory variables. For example, the initialization unit 122 determines, as an initial selection explanatory variable, a gene that is frequently expressed by the subject in the sample data (or the natural world). As an example, the initialization unit 122 selects s genes in descending order of frequency, or selects s genes that are different from each other with a selection probability corresponding to the frequency, and sets the initial selection explanatory variable as a set. Good.
 また、初期化部200は、複数の説明変数の全体から選択説明変数のセットを選択してよく、これに代えて、複数の説明変数から、単独で事象の発生有無に寄与する度合いが予め定められた基準を満たす及び/又は対象者が発現する頻度が予め定められた基準よりも高い一部の説明変数を抽出し、抽出された一部の説明変数の中から選択説明変数のセットをs個選択してもよい。一例として、初期化部200は、複数の説明変数から、t検定により単独で事象の発生有無に寄与する度合いが平均以上の説明変数のみを抽出し、抽出した説明変数からs個の遺伝子を選択してよい。また、初期化部200は、複数の説明変数から、事象の発生頻度よりも発現頻度が低い遺伝子に対応する説明変数を除き、除いた結果からs個の遺伝子を選択してよい。 In addition, the initialization unit 200 may select a set of selected explanatory variables from the plurality of explanatory variables as a whole. Instead, the degree of contribution to whether or not an event has occurred is determined in advance from a plurality of explanatory variables. A part of the explanatory variables that satisfy the specified criteria and / or the frequency of occurrence of the subject is higher than a predetermined criterion, and select a set of selected explanatory variables from the extracted part of the explanatory variables. You may select one. As an example, the initialization unit 200 extracts, from a plurality of explanatory variables, only explanatory variables whose degree of contribution to the occurrence or non-occurrence of an event by t-test alone is higher than the average, and selects s genes from the extracted explanatory variables. You can do it. Further, the initialization unit 200 may select s genes from the result obtained by excluding the explanatory variable corresponding to the gene whose expression frequency is lower than the occurrence frequency of the event from the plurality of explanatory variables.
 図4及び図5を用いて、初期化部122の頻度による選択説明変数のセットの決定方法を説明する。図4は、本実施形態に係る説明変数のセットの一例を示す。n個の説明変数g~gの組み合わせから、それぞれがs個の説明変数を含むN個(N=nCs)の説明変数のセットG~Gを生成することができる。図4の例は、s=3であり、セットGはg10、G41、g301、g510を含み、セットGはg10、G41、g301、g282を含み、…セットGはg、G、g、g(a,b,c,d∈n)を含む場合を示す。 A method of determining a set of selected explanatory variables based on the frequency of the initialization unit 122 will be described with reference to FIGS. 4 and 5. FIG. 4 shows an example of a set of explanatory variables according to this embodiment. From a combination of n explanatory variables g 1 to g n , N (N = nCs) explanatory variable sets G 1 to G N each including s explanatory variables can be generated. In the example of FIG. 4, s = 3, set G 1 includes g 10 , G 41 , g 301 , g 510 , set G 2 includes g 10 , G 41 , g 301 , g 282 ,... Set G N shows a case containing g a, G b, g c , g d (a, b, c, d∈n) a.
 図5は、初期化部122が各セットを初期の選択説明変数のセットとして選択する選択確率の一例を示す。グラフの横軸は並べられたN個の説明変数のセットG~Gを示し、縦軸は説明変数のセットに対応する選択確率Pを示す。初期化部122は、選択確率Pに応じた確率で説明変数のセットを選択する。 FIG. 5 shows an example of the selection probability that the initialization unit 122 selects each set as an initial set of explanatory explanatory variables. The horizontal axis of the graph represents the set of N explanatory variables G 1 to G N arranged, and the vertical axis represents the selection probability P s corresponding to the set of explanatory variables. Initializing unit 122 selects a set of explanatory variables with a probability corresponding to the selected probability P s.
 図5において各説明変数のセットは、頻度順に並べられる。すなわち、グラフの一番左に配置される説明変数(すなわち遺伝子)のセットGは、サンプルデータ(又は自然界)において最も高頻度で現れる(又は現れると予想される)遺伝子の組み合わせであり、セットGの右隣りに配置されるセットGはGの次に高頻度で現れる遺伝子の組み合わせであり、…一番右に配置される説明変数のセットGはサンプルデータ(又は自然界)において最も低頻度で現れる(又は現れると予想される)遺伝子の組み合わせである。図5において並べられる説明変数のセットに含まれる説明変数の種類は単一(例えば、3個)、又は、複数種類(例えば、2個~5個)であってよい。 In FIG. 5, each set of explanatory variables is arranged in order of frequency. That is, a set G x of explanatory variables (ie, genes) arranged on the leftmost side of the graph is a combination of genes that appear (or are expected to appear) most frequently in the sample data (or the natural world). set G y which is arranged to the right side of the G x is the combination of genes that appear frequently in the following G x, ... set G Z explanatory variables disposed rightmost in the sample data (or nature) A combination of genes that appears (or is expected to appear) with the lowest frequency. The types of explanatory variables included in the set of explanatory variables arranged in FIG. 5 may be single (for example, three) or plural types (for example, two to five).
 グラフの各説明変数のセットに対応する選択確率Pは当該説明変数の組み合わせの頻度に対応する大きさの値であってよい。これにより、初期化部122は、発生頻度が高い遺伝子のセットを優先的に選択することができる。 The selection probability P s corresponding to each set of explanatory variables in the graph may be a value having a magnitude corresponding to the frequency of the combination of the explanatory variables. Thereby, the initialization unit 122 can preferentially select a set of genes having a high occurrence frequency.
 また、初期化部122は、これに代えて、複数のサンプルデータにおいて単独で事象の発生有無に寄与する度合いが予め定められた基準を満たす説明変数を初期の選択説明変数のセットに含めてよい。例えば、初期化部122は、サンプルデータから説明変数ごとの事象の発生率を算出する。一例として、遺伝子gを発現する10000人の対象者のうち11人が疾病を有する場合、初期化部122は、説明変数(遺伝子g1)による事象(疾病)の発生有無に寄与する度合い(すなわち、遺伝子g1の発現による疾病の発生率)が0.11%であると算出する。 Alternatively, the initialization unit 122 may include, in the initial selected explanatory variable set, an explanatory variable that satisfies a predetermined criterion that contributes to the occurrence of an event independently in a plurality of sample data. . For example, the initialization unit 122 calculates an event occurrence rate for each explanatory variable from the sample data. As an example, when 11 subjects out of 10000 subjects who express gene g 1 have a disease, the initialization unit 122 contributes to the occurrence or non-occurrence of an event (disease) due to an explanatory variable (gene g1) (ie, , The incidence of disease due to the expression of gene g1) is calculated to be 0.11%.
 図6は、遺伝子ごとの事象の発生確率の一例を示す。図4によると、サンプルデータにおいて、遺伝子gを有する対象者の疾病発生率は0.11%であり、遺伝子gを有する対象者の疾病発生率は0.15%であり、遺伝子gを有する対象者の疾病発生率は0.73%であり、遺伝子gを有する対象者の疾病発生率は0.02%である。 FIG. 6 shows an example of the occurrence probability of an event for each gene. According to FIG. 4, in the sample data, the disease incidence of the subject having the gene g 1 is 0.11%, the disease incidence of the subject having the gene g 2 is 0.15%, and the gene g 3 diseases incidence of subjects with 0.73% disease incidence of subjects with a genetic g n is 0.02%.
 ここで、初期化部122は、疾病の発生有無に寄与する度合いの高い遺伝子を初期の選択説明変数として決定する。一例として、初期化部122は、疾病の発生有無に寄与する度合いの高い順番にs個の遺伝子を選択し、又は、度合に応じた選択確率で互いに異なるs個の遺伝子を選択して、初期の選択説明変数のセットとしてよい。 Here, the initialization unit 122 determines a gene having a high degree of contribution to the occurrence of disease as an initial selection explanatory variable. As an example, the initialization unit 122 selects s genes in descending order of contribution to the presence or absence of disease, or selects s genes that are different from each other with a selection probability according to the degree. It may be a set of selected explanatory variables.
 次に、S160において、学習処理部130は、複数のサンプルデータに基づいて、直前の処理で決定された選択説明変数のセット内の各選択説明変数の値から事象の発生有無を予測する予測モデルを学習する。例えば、学習処理部130は、サンプルデータ中の各対象者の選択説明変数のセット中の遺伝子発現の有無等から、疾病の発生有無を予測する予測モデルを学習する。一例として、学習処理部130は、多変量解析(判別分析、及び、重回帰分析など)、機械学習(自己組織化マップ、サポートベクトルマシン、及び、ディープラーニングなど)、最尤推定法、又は、ベイズ法等に基づいて予測モデルを学習する。 Next, in S160, the learning processing unit 130 predicts the occurrence of an event from the value of each selected explanatory variable in the set of selected explanatory variables determined in the immediately preceding process, based on a plurality of sample data. To learn. For example, the learning processing unit 130 learns a prediction model for predicting the presence or absence of a disease from the presence or absence of gene expression in the set of explanatory explanatory variables of each subject in the sample data. As an example, the learning processing unit 130 includes multivariate analysis (such as discriminant analysis and multiple regression analysis), machine learning (such as self-organizing map, support vector machine, and deep learning), maximum likelihood estimation method, or A prediction model is learned based on the Bayesian method.
 図7は、多変量解析による予測モデルの学習の処理フローの一例を示す。学習処理部130は、S162~S172の処理を実行することにより、重回帰分析、主成分分析、及び、クラスタ分析等の多変量解析を利用してS160の処理を実行してよい。 FIG. 7 shows an example of a processing flow for learning a prediction model by multivariate analysis. The learning processing unit 130 may execute the processing of S160 by using the multivariate analysis such as multiple regression analysis, principal component analysis, and cluster analysis by executing the processing of S162 to S172.
 まず、S162において、学習処理部130は、直前の処理で決定された選択説明変数のセット内の各選択説明変数を変数として含み、サンプルデータ中の各対象の投影の分散を最大化するように選択された複数の関数を生成する。例えば、学習処理部130は、選択説明変数のセットに含まれるs個の遺伝子gijの有無等を示すs個の変数xijのベクトルx(x={xi1,xi2,…,xis})を含む関数f(x)を生成する。f(x)は、ベクトルxの各要素の線形関数であってよい。 First, in S162, the learning processing unit 130 includes each selected explanatory variable in the selected explanatory variable set determined in the immediately preceding process as a variable, and maximizes the variance of the projection of each target in the sample data. Generate multiple selected functions. For example, the learning processing unit 130 generates a vector x (x = {x i1 , x i2 ,..., X is ) of s variables x ij indicating the presence / absence of s genes g ij included in the set of selected explanatory variables. }) Including the function f (x). f (x) may be a linear function of each element of the vector x.
 一例として、選択説明変数のセットに10番目、23番目、及び45番目の遺伝子が含まれる場合を想定する。説明の便宜上、学習処理部130は、10番目、23番目、及び45番目の遺伝子のそれぞれの発現量等を各軸とする3次元空間上に、各サンプルデータの対象者をプロットするものとする。例えば、学習処理部130は、ある対象者の10番目の遺伝子g10の発現量が0.1で、遺伝子g23の発現量が0.3で、遺伝子g45の発現量が0.2である場合、当該対象者を3次元空間上で(0.1,0.3,0.2)の点にプロットすることを仮定する。 As an example, assume that the set of explanatory explanatory variables includes the 10th, 23rd, and 45th genes. For convenience of explanation, the learning processing unit 130 plots the target person of each sample data on a three-dimensional space having the respective expression levels of the 10th, 23rd, and 45th genes as axes. . For example, the learning processing unit 130, in the expression level of 10 th gene g 10 of a subject is 0.1, in the expression level is 0.3 gene g 23, expression amount of 0.2 genes g 45 In some cases, assume that the subject is plotted at (0.1, 0.3, 0.2) points in a three-dimensional space.
 学習処理部130は、全対象者または所定の基準数以上の複数の対象者のプロットを完了した後に、3次元空間上の第1の線形関数f(x)=a1010+a2323+a4545+constを生成する。ここで、学習処理部130は、該各対象者のプロットに対応する座標を第1の線形関数f(x)に入力して各対象者に対応する複数の出力値を得た際に、当該複数の出力値の分散が最大になるように各遺伝子xijに対応する各係数aijを最適化する。 The learning processing unit 130 completes plotting of all subjects or a plurality of subjects equal to or greater than a predetermined reference number, and then the first linear function f 1 (x) = a 10 x 10 + a 23 x in the three-dimensional space. 23 + a 45 x 45 + const 1 is generated. Here, when the learning processing unit 130 inputs coordinates corresponding to the plots of the respective subjects to the first linear function f 1 (x) and obtains a plurality of output values corresponding to the respective subjects, dispersion of the plurality of output values is to optimize the coefficients a ij corresponding to each gene x ij to maximize.
 次に、学習処理部130は、第1の線形関数f(x)とは異なる別の第2の線形関数f(x)を生成する。例えば、学習処理部130は、複数の出力値の分散が第1の線形関数f(x)に次いで大きくなるように各遺伝子xijに対応する各係数aijを最適化する。学習処理部130は、出力値の分散が大きくなる順番に更に第3の線形関数f(x)以降の線形関数を生成していってよい。このように学習処理部130は、出力値の分散が大きくなる順に予め定められた個数の線形関数を生成する。 Next, the learning processing unit 130 generates another second linear function f 2 (x) different from the first linear function f 1 (x). For example, the learning processing unit 130 optimizes each coefficient a ij corresponding to each gene x ij so that the variance of the plurality of output values becomes the second largest after the first linear function f 1 (x). The learning processing unit 130 may further generate linear functions after the third linear function f 3 (x) in the order in which the variance of the output values increases. As described above, the learning processing unit 130 generates a predetermined number of linear functions in the order in which the variance of the output values increases.
 次にS164において、学習処理部130は、生成した複数の関数の中から、複数のサンプルデータにおける事象の発生有無の判別に用いる少なくとも1つの関数を選択する。ここで、学習処理部130は、各関数の出力値を各軸とする多次元空間に各対象者をプロットした場合に、事象の発生有無の境界がより明確に決定できる関数の組み合わせを決定する。例えば、生成した関数から一部の関数のみを選択する場合、学習処理部130は、各対象者の各線形関数の出力値と、対象者における事象の発生有無(又は事象の発生程度)との相関係数を算出し、相関係数の絶対値が大きい関数を1つ以上選択してよい。本説明では便宜上、学習処理部130が、S162で生成した第1の線形関数f(x)及び第2の線形関数f(x)を選択するものとする。 Next, in S164, the learning processing unit 130 selects at least one function used to determine whether or not an event has occurred in the plurality of sample data from among the plurality of generated functions. Here, the learning processing unit 130 determines a combination of functions that can more clearly determine the boundary of occurrence / non-occurrence of an event when each target person is plotted in a multidimensional space having the output value of each function as each axis. . For example, when only a part of the functions is selected from the generated functions, the learning processing unit 130 calculates the output value of each linear function of each subject and whether or not an event has occurred in the subject (or the degree of occurrence of the event). A correlation coefficient may be calculated, and one or more functions having a large absolute value of the correlation coefficient may be selected. In this description, for the sake of convenience, it is assumed that the learning processing unit 130 selects the first linear function f 1 (x) and the second linear function f 2 (x) generated in S162.
 次にS166において、学習処理部130は、少なくとも1つの関数のそれぞれの値を各次元とする多次元空間を生成する。例えば、学習処理部130は、サンプルデータにおける複数の対象者の選択説明変数の遺伝子を選択された関数に入力し、得られた値を各関数に対応する軸を有する座標空間にプロットする。 Next, in S166, the learning processing unit 130 generates a multidimensional space having each dimension of each value of at least one function. For example, the learning processing unit 130 inputs genes of selection explanatory variables of a plurality of subjects in the sample data to the selected function, and plots the obtained values in a coordinate space having an axis corresponding to each function.
 図8は、学習処理部130が生成する多次元空間の一例を示す。ここでは、学習処理部130は、2個の関数により2次元空間を生成した例を示す。グラフ中の各プロットは、サンプルデータにおける各対象者に対応する。 FIG. 8 shows an example of a multidimensional space generated by the learning processing unit 130. Here, an example is shown in which the learning processing unit 130 generates a two-dimensional space using two functions. Each plot in the graph corresponds to each subject in the sample data.
 例えば、学習処理部130は、各対象者の選択説明変数の遺伝子の発現量等を第1の関数f(x)に入力して得られた出力値を軸LD上の成分оとして含み、第2の関数f(x)に入力して得られた出力値を軸LD上の成分оとして含む点(о,о)を、疾病の発生結果とともに2次元空間上にプロットする。図8において点線及び実線の○は疾病が生じなかった対象者(すなわち、健常者)を示し、点線及び実線の×は疾病が生じた対象者(すなわち、非健常者)を示す。 For example, the learning processing unit 130 uses the output value obtained by inputting the expression level of the selected explanatory variable of each subject into the first function f 1 (x) as the component о 1 on the axis LD 1. Including a point (о 1 , о 2 ) including the output value obtained by inputting to the second function f 2 (x) as the component о 1 on the axis LD 2 in the two-dimensional space. Plot to. In FIG. 8, a dotted line and a solid line ◯ indicate a subject who did not develop a disease (that is, a healthy person), and a dotted line and a solid line X indicate a subject who has a disease (that is, a non-healthy person).
 次にS168において、学習処理部130は、選択説明変数から事象の発生有無を予測する判別関数を生成する。例えば、学習処理部130は、線形判別、二次判別、自己組織化マップ、及び、サポートベクトルマシン等の各種の判別手法に基づいて、健常者(疾病無し)と非健常者(疾病有り)を最も精度よく判別する判別関数を生成する。図8は、学習処理部130が線形判別関数THを生成した場合を示す。図8に示す線形判別関数THの上側にプロットされる対象者は非健常者と予測され、下側にプロットされる対象者は健常者と予測される。このように、学習処理部130は、判別関数を生成することにより、多次元空間における位置に基づいて、事象の発生有無を予測する予測モデルを学習する。例えば、学習処理部130は、第1の関数f(x)及び第2の関数f(x)を選択することにより、これらの関数の出力値を各軸とする2次元空間内で健常者及び非健常者を区別する明確な境界THを決定することができる。 Next, in S168, the learning processing unit 130 generates a discriminant function that predicts whether an event has occurred from the selected explanatory variable. For example, the learning processing unit 130 determines a healthy person (no illness) and a non-healthy person (with illness) based on various discrimination methods such as linear discrimination, secondary discrimination, self-organizing map, and support vector machine. Generate a discriminant function that discriminates most accurately. FIG. 8 shows a case where the learning processing unit 130 generates the linear discriminant function TH. The subject plotted on the upper side of the linear discriminant function TH shown in FIG. 8 is predicted as a non-healthy person, and the subject plotted on the lower side is predicted as a healthy person. In this way, the learning processing unit 130 learns a prediction model that predicts the occurrence of an event based on the position in the multidimensional space by generating a discriminant function. For example, the learning processing unit 130 selects the first function f 1 (x) and the second function f 2 (x), and thus, the learning processing unit 130 is healthy in a two-dimensional space with the output values of these functions as axes. A clear boundary TH that distinguishes a normal person from a non-healthy person can be determined.
 次にS172において、学習処理部130は、学習した判別関数により予測モデルを評価する。例えば、学習処理部130は、判別関数による事象の発生の予測精度を評価する。図8の点線の○は判別関数で疾病が予測されたが実際には疾病がない対象者を示し、実線の○は判別関数で疾病が予測されずに実際にも疾病がない対象者を示し、点線の×は判別関数で疾病が予測されなかったが実際には疾病があった対象者を示し、実線の×は判別関数で疾病が予測され実際にも疾病があった対象者を示す。 Next, in S172, the learning processing unit 130 evaluates the prediction model using the learned discriminant function. For example, the learning processing unit 130 evaluates the prediction accuracy of the occurrence of the event by the discriminant function. The dotted line ○ in FIG. 8 indicates a subject whose disease is predicted by the discriminant function but actually has no disease, and the solid line ○ indicates a subject who is not predicted by the discriminant function and actually has no disease. The dotted line X indicates a subject who has not actually been diagnosed with a discriminant function but actually has a disease, and the solid line X indicates a subject who has been predicted to have a disease with the discriminant function and actually has a disease.
 一例として、学習処理部130は、各種の判別手法により生成した判別関数により、サンプルデータの複数の対象者の遺伝子発現から疾患の有無を予測した場合の感度(図8の全○中の実線の○の割合)及び特異度(全×中の実線の×の割合)の少なくとも一方、又は、両者の平均を、判別関数の評価として算出する。判別関数の評価は、判別関数に対応する予測モデルの評価となる。 As an example, the learning processing unit 130 uses the discriminant functions generated by various discriminant methods to determine the sensitivity when the presence or absence of a disease is predicted from the gene expression of a plurality of subjects in sample data (the solid line in all circles in FIG. 8). At least one of the ratio (circle) and the specificity (the ratio of x in the solid line in all) or the average of both is calculated as the evaluation of the discriminant function. The evaluation of the discriminant function is the evaluation of the prediction model corresponding to the discriminant function.
 このように学習処理部130は、分散を最大化する複数の関数に基づいて生成した多次元空間上にサンプルデータをプロットすることで、選択した複数の説明変数に基づき事象の発生の有無を判別できる可能性を高めた予測モデルを学習する。 In this way, the learning processing unit 130 determines whether or not an event has occurred based on a plurality of selected explanatory variables by plotting sample data on a multidimensional space generated based on a plurality of functions that maximizes variance. Learn predictive models that increase the possibilities.
 図9は、最尤推定法による予測モデルの学習の処理フローの一例を示す。学習処理部130は、S262~S266の処理を実行することにより、最尤推定法を利用してS160の処理を実行してよい。 FIG. 9 shows an example of a processing flow for learning a prediction model by the maximum likelihood estimation method. The learning processing unit 130 may execute the process of S160 using the maximum likelihood estimation method by executing the processes of S262 to S266.
 まず、S262において、学習処理部130は、尤度関数を生成する。例えば、学習処理部130は、サンプルデータに基づいて、選択説明変数のセットに含まれる遺伝子gの有無又は発現量等を示す変数xを入力して事象発生の尤度θを出力する尤度関数lik(θ)=f(x|θ)を算出する。一例として、選択説明変数のセットが10番目、23番目、及び45番目の遺伝子を含む場合、学習処理部130は、尤度関数lik(θ)=f(x10,x23,x45|θ)を算出する。 First, in S262, the learning processing unit 130 generates a likelihood function. For example, the learning processing unit 130, based on the sample data, and outputs the likelihood of an input to an event generating a variable x i indicating whether or expression amount of the gene g i in the set of selected explanatory variables θ likelihood The degree function lik (θ) = f D (x i | θ) is calculated. As an example, when the set of selected explanatory variables includes the 10th, 23rd, and 45th genes, the learning processing unit 130 uses the likelihood function lik (θ) = f D (x 10 , x 23 , x 45 | θ) is calculated.
 次に、S264において、学習処理部130は、尤度関数に基づいてサンプルデータ中の各対象者の事象の発生有無を判別する。例えば、学習処理部130は、各対象者の選択説明変数のセットに含まれる遺伝子を対応する尤度関数に入力して、それぞれの対象者に疾病が発生する尤度を算出する。学習処理部130は、尤度が予め定められた基準(例えば、0.5)を下回る場合に疾病が生じないと判定し、尤度が基準以上となる場合に疾病が生じると判定する。 Next, in S264, the learning processing unit 130 determines whether or not an event of each target person in the sample data has occurred based on the likelihood function. For example, the learning processing unit 130 inputs genes included in the set of explanatory explanatory variables of each subject to a corresponding likelihood function, and calculates the likelihood that a disease will occur in each subject. The learning processing unit 130 determines that the disease does not occur when the likelihood falls below a predetermined standard (for example, 0.5), and determines that the disease occurs when the likelihood is equal to or higher than the reference.
 次に、S266において、学習処理部130は、尤度関数の精度を評価する。例えば、学習処理部130は、サンプルデータにおける各対象者の事象の発生有無と、尤度関数により各対象者の事象発生の有無を予測した結果とを対比する。一例として、学習処理部130は、S172による処理と同様に感度及び特異度により尤度関数を評価してよい。 Next, in S266, the learning processing unit 130 evaluates the accuracy of the likelihood function. For example, the learning processing unit 130 compares the occurrence / non-occurrence of each subject's event in the sample data with the result of predicting the occurrence / non-occurrence of each subject's event using the likelihood function. As an example, the learning processing unit 130 may evaluate the likelihood function based on sensitivity and specificity as in the processing in S172.
 図10は、ベイズ法による予測モデルの学習の処理フローの一例を示す。学習処理部130は、S362~S366の処理を実行することにより、ベイズ法を利用してS160の処理を実行してよい。 FIG. 10 shows an example of a processing flow for learning a prediction model by the Bayesian method. The learning processing unit 130 may execute the process of S160 using the Bayesian method by executing the processes of S362 to S366.
 まず、S362において、学習処理部130は、サンプルデータにおいて、各対象が事象を生じる事後確率を算出する。例えば、学習処理部130は、図5で説明した説明変数のセットの頻度、又は、図6で説明した選択説明変数の事象の発生確率の積等を事前確率として用いてよく、図9の処理で生成した尤度関数を尤度として用い、事前確率及び尤度の積に基づき事後確率を算出してよい。また、学習処理部130は、メトロポリス・ヘイスティングス法等のマルコフ連鎖モンテカルロ法に基づくサンプリングアルゴリズムにより、事後確率を算出してよい。 First, in S362, the learning processing unit 130 calculates a posterior probability that each target will cause an event in the sample data. For example, the learning processing unit 130 may use the frequency of the set of explanatory variables described in FIG. 5 or the product of the occurrence probability of the selected explanatory variable described in FIG. The posterior probability may be calculated based on the product of the prior probability and the likelihood, using the likelihood function generated in step 1 as the likelihood. In addition, the learning processing unit 130 may calculate the posterior probability by a sampling algorithm based on a Markov chain Monte Carlo method such as the Metropolis-Hastings method.
 次に、S364において、学習処理部130は、事象発生有無を判別する。例えば、学習処理部130は、各対象者の選択説明変数のセットに含まれる遺伝子に基づく事前確率と、各対象者の選択説明変数のセットに含まれる遺伝子を対応する尤度関数に入力した結果とに基づいて事後確率分布を算出する。学習処理部130は、事後確率(例えば、事後確率分布における事後確率の平均値、中央値、又は、最頻値等)が予め定められた基準(例えば、0.5)を下回る場合に疾病が生じないと判定し、事後確率が基準以上となる場合に疾病が生じると判定する。 Next, in S364, the learning processing unit 130 determines whether an event has occurred. For example, the learning processing unit 130 inputs the prior probability based on the genes included in the set of explanatory explanatory variables of each subject and the genes included in the set of explanatory explanatory variables of each subject to the corresponding likelihood function. Based on the above, the posterior probability distribution is calculated. The learning processing unit 130 determines that the disease is present when the posterior probability (for example, the average value, median value, or mode value of the posterior probability in the posterior probability distribution) is lower than a predetermined criterion (for example, 0.5). It is determined that the disease does not occur, and it is determined that the disease occurs when the posterior probability is equal to or higher than the reference.
 次に、S366において、学習処理部130は、事後確率の精度を評価する。例えば、学習処理部130は、サンプルデータにおける各対象者の事象の発生有無と、事後確率により各対象者の事象発生の有無を予測した結果とを対比する。一例として、学習処理部130は、S172による処理と同様に感度及び特異度により尤度関数を評価してよい。 Next, in S366, the learning processing unit 130 evaluates the accuracy of the posterior probability. For example, the learning processing unit 130 compares the occurrence / non-occurrence of each subject's event in the sample data with the result of predicting the occurrence / non-occurrence of each subject by the posterior probability. As an example, the learning processing unit 130 may evaluate the likelihood function based on sensitivity and specificity as in the processing in S172.
 図11は、最尤推定法又はベイズ法による判別の一例を示す。グラフのx1軸及びx2軸は、選択説明変数のセットが2個の遺伝子を含む場合のそれぞれの遺伝子の発現量等に対応し、z軸は最尤推定法における尤度又はベイズ法における事後確率に対応する。グラフ中の各プロットは、サンプルデータにおける各対象者に対応する。図11において点線及び実線の○は疾病が生じなかった対象者(すなわち、健常者)を示し、点線及び実線の×は疾病が生じた対象者(すなわち、非健常者)を示す。 FIG. 11 shows an example of discrimination by the maximum likelihood estimation method or the Bayes method. The x1 axis and x2 axis of the graph correspond to the expression level of each gene when the set of selected explanatory variables includes two genes, and the z axis is the likelihood in the maximum likelihood estimation method or the posterior probability in the Bayes method Corresponding to Each plot in the graph corresponds to each subject in the sample data. In FIG. 11, a dotted line and a solid line ◯ indicate a subject who did not develop a disease (that is, a healthy person), and a dotted line and a solid line X indicate a subject who has a disease (that is, a non-healthy person).
 例えば、学習処理部130は、各対象者について尤度又は事後確率が閾値TH(例えば0.5)以上の場合に当該対象者を非健常者と予測し、閾値TH未満の場合に当該対象者を健常者と予測する。図11の点線の○は疾病が予測されたが実際には疾病がない対象者を示し、実線の○は判別関数で疾病が予測されずに実際にも疾病がない対象者を示し、点線の×は判別関数で疾病が予測されなかったが実際には疾病があった対象者を示し、実線の×は判別関数で疾病が予測され実際にも疾病があった対象者を示す。 For example, the learning processing unit 130 predicts the target person as an unhealthy person when the likelihood or the posterior probability is greater than or equal to a threshold value TH (for example, 0.5) for each target person, and the target person when the likelihood is less than the threshold value TH. Is predicted to be healthy. In FIG. 11, the dotted line ○ indicates a subject who is predicted to have a disease but does not actually have a disease, the solid line ○ indicates a subject who is not predicted to have a disease by a discriminant function and actually has no disease, A cross indicates a subject who was not predicted to be ill by the discriminant function but actually had a disease, and a solid line X represents a subject who was predicted to be ill by the discriminant function and actually had a disease.
 一例として、学習処理部130は、尤度又は事後確率により、サンプルデータの複数の対象者の遺伝子から疾患の有無を予測した場合の感度(図11の全○中の実線の○の割合)及び特異度(全×中の実線の×の割合)の少なくとも一方を評価として算出してよい。 As an example, the learning processing unit 130 uses the likelihood or the posterior probability to predict the presence or absence of a disease from the genes of a plurality of subjects in the sample data (the ratio of solid circles in all the circles in FIG. 11) and At least one of the specificities (the ratio of x of the solid line in all x) may be calculated as an evaluation.
 このようにS160において、学習処理部130は、判別分析、機械学習(自己組織化マップ、サポートベクトルマシン、ディープラーニングなど)、最尤推定法、又は、ベイズ法等を利用して、選択説明変数のセットから事象発生有無を予測する予測モデルを学習する。学習処理部130は、これらに加えて/代えて、重回帰分析、主成分分析、及び、クラスタ分析等の多変量解析を用いて学習モデルを学習してもよい。また、学習処理部130は、予測モデルに対応する選択説明変数のセット、及び、各予測モデルの評価をモデル選択部140に供給する。 As described above, in S160, the learning processing unit 130 uses the discriminant analysis, machine learning (self-organizing map, support vector machine, deep learning, etc.), maximum likelihood estimation method, Bayesian method, or the like to select the selected explanatory variable. Learn the prediction model that predicts the occurrence of events from the set of The learning processing unit 130 may learn the learning model using multivariate analysis such as multiple regression analysis, principal component analysis, and cluster analysis in addition to / in place of these. In addition, the learning processing unit 130 supplies the model selection unit 140 with a set of selected explanatory variables corresponding to the prediction model and the evaluation of each prediction model.
 S160の次に、S180において、説明変数選択部120は、選択説明変数のセットの選択を継続するか判断する。例えば、説明変数選択部120は、予め定められた回数S160の学習処理を実行したこと、及び/又は、S160で予め定められた基準以上の評価の学習モデルを学習したことを条件に選択説明変数のセットの選択を終了し、処理をS220に進める。説明変数選択部120は、選択説明変数のセットの選択を終了しない場合は、処理をS200に進める。 Next to S160, in S180, the explanatory variable selection unit 120 determines whether to continue selecting the set of selected explanatory variables. For example, the explanatory variable selection unit 120 selects the selected explanatory variable on the condition that the learning process of the predetermined number of times S160 has been executed and / or the learning model with the evaluation equal to or higher than the reference determined in S160 has been learned. The selection of the set is terminated, and the process proceeds to S220. If the explanatory variable selection unit 120 does not finish selecting the set of selected explanatory variables, the process proceeds to S200.
 S200において、説明変数選択部120は、初期化部122又は生成部124により、複数の説明変数の中から少なくとも1つの選択説明変数のセットを選択する。例えば、説明変数選択部120の初期化部122は、S140と同様の処理により、ランダムに複数(例えば、s個)の選択説明変数を含む選択説明変数のセットを選択してよい。初期化部122は、S160~S200のループにおいて選択説明変数のセットを繰り返し選択する。初期化部122は、繰り返しにおけるS200の各選択において、過去に選択した選択説明変数のセットに依存せずに選択説明変数のセットをランダムに選択してよい。ここで、初期化部122は、過去に選択した選択説明変数のセットに依存せずに選択説明変数のセットを選択する上で、ブートストラップ法を採用して過去に選択した選択説明変数のセットと同一のセットを再度選択可能としてよく、又は、ジャックナイフ法を採用して過去に選択した選択説明変数のセットを選択しないようにしてよい。 In S200, the explanatory variable selection unit 120 causes the initialization unit 122 or the generation unit 124 to select at least one set of selected explanatory variables from among a plurality of explanatory variables. For example, the initialization unit 122 of the explanatory variable selection unit 120 may select a set of selected explanatory variables including a plurality of (for example, s) selected explanatory variables at random by the same processing as S140. The initialization unit 122 repeatedly selects a set of selected explanatory variables in the loop of S160 to S200. The initialization unit 122 may randomly select a set of selected explanatory variables without depending on a previously selected set of selected explanatory variables in each selection of S200 in the iteration. Here, the initialization unit 122 selects the set of selected explanatory variables without depending on the set of selected explanatory variables selected in the past, and adopts the bootstrap method to set the selected explanatory variables selected in the past. The same set can be selected again, or the set of selected explanatory variables selected in the past using the jackknife method may not be selected.
 初期化部122による選択に代えて、説明変数選択部120の生成部124が選択説明変数のセットを生成してもよい。例えば、生成部124は、マルコフ連鎖モンテカルロ法を用いて、初期の選択説明変数のセットから順次選択説明変数のセットを生成してよい。これにより、生成部124は、前回選択した選択説明変数のセットに近い選択説明変数のセットを生成する。 Instead of selection by the initialization unit 122, the generation unit 124 of the explanatory variable selection unit 120 may generate a set of selected explanatory variables. For example, the generation unit 124 may sequentially generate a set of selected explanatory variables from an initial set of selected explanatory variables using a Markov chain Monte Carlo method. Thereby, the generation unit 124 generates a set of selected explanatory variables that is close to the previously selected set of selected explanatory variables.
 図12は、マルコフ連鎖モンテカルロ法による選択説明変数のセットの生成方法の一例を示す。グラフの横軸は並べられたN個の説明変数のセットG~Gを示す。図12において、各説明変数のセットは組み合わせの近い順番に並べられる。例えば、Gに係る説明変数のセットが遺伝子g、g及びgを含むものである場合、Gに隣接するGに係る説明変数のセットは、遺伝子gのみが別のgに近い遺伝子gに置き換えられた遺伝子g、g及びgを含んで良い。 FIG. 12 shows an example of a method for generating a set of selected explanatory variables by the Markov chain Monte Carlo method. The horizontal axis of the graph shows a set of N explanatory variables G 1 to G N arranged. In FIG. 12, each explanatory variable set is arranged in the order of the combination. For example, when the set of explanatory variables related to G 1 includes genes g 1 , g 2, and g 3 , the set of explanatory variables related to G 2 adjacent to G 1 includes only gene g 3 in another g 3 . It may include genes g 1 , g 2 and g 4 replaced by the near gene g 4 .
 例えば、図12において、遺伝子のセットの類似度(例えば、編集距離)及び/又は遺伝子自体の類似度等に基づいて、セット間の距離が類似度に応じたものとなるように、複数の説明変数が横軸に並べられてよい。図12において並べられる説明変数のセットに含まれる説明変数数の種類は単一(例えば、3個)、又は、複数(例えば、2個及び3個)であってよい。なお、図12では、説明のために複数のセットを1次元上に並べて示しているが、複数の説明変数のセットは多次元上に並べられるものとしてもよい。 For example, in FIG. 12, a plurality of explanations are made so that the distance between sets corresponds to the similarity based on the similarity (for example, editing distance) of the set of genes and / or the similarity of the genes themselves. Variables may be arranged on the horizontal axis. The types of explanatory variables included in the set of explanatory variables arranged in FIG. 12 may be single (for example, three) or plural (for example, two and three). In FIG. 12, a plurality of sets are arranged one-dimensionally for explanation, but a plurality of sets of explanatory variables may be arranged multidimensionally.
 また、図12のグラフにおいて、Gは直前の選択説明変数の選択(S140又は前回のS200の処理)において選択された選択説明変数のセットを示し、縦軸は各説明変数のセットを生成部124が選択する選択確率Pを示す。すなわち、S200において、生成部124は、選択確率Pに応じた確率で説明変数のセットを選択する。 In the graph of FIG. 12, G i indicates a set of selected explanatory variables selected in the selection of the immediately preceding selected explanatory variable (S140 or the previous processing of S200), and the vertical axis indicates a set of each explanatory variable. 124 shows a selection probability P s be selected. That is, in S200, generating unit 124 selects the set of explanatory variables with a probability corresponding to the selected probability P s.
 図12に示すように、前回選択したセットGに近い説明変数の組み合わせを有するセットが最も選択確率Pが高く、Gからの距離に応じて選択確率Pが漸減する。例えば、図12は、Gをピークとする正規分布となる確率分布を示す。これにより生成部124は、前回選択したセットに近い説明変数の組み合わせを有する選択説明変数のセットを生成する。なお、生成部124は、前回選択したセットG又は過去に選択したことなるセットを重複して選択しないようにしてもよい。 As shown in FIG. 12, the set having the combination of explanatory variables close to the previously selected set G i has the highest selection probability P s , and the selection probability P s gradually decreases according to the distance from G i . For example, Figure 12 shows the probability distribution becomes a normal distribution having a peak G i. Thereby, the generation unit 124 generates a set of selected explanatory variables having a combination of explanatory variables close to the previously selected set. Note that the generation unit 124 may not select the set G i selected last time or the set that has been selected in the past.
 図13は、説明変数選択部120によるS200の処理フローの変形例を示す。本例において、説明変数選択部120は、S202~S206の処理を実行することにより、前回の選択説明変数のセットの評価に応じて、初期化部122又は生成部124のいずれかが選択説明変数のセットを選択する。 FIG. 13 shows a modification of the processing flow of S200 by the explanatory variable selection unit 120. In this example, the explanatory variable selection unit 120 executes the processing of S202 to S206, so that either the initialization unit 122 or the generation unit 124 selects the selected explanatory variable according to the previous evaluation of the set of selected explanatory variables. Select a set.
 まず、S202において、説明変数選択部120は、前回選択した選択説明変数のセットの評価が予め定められた基準未満となるか否かを判定する。例えば、説明変数選択部120は、S140又はS200の処理で生成された選択説明変数のセットについて、S172、S266又はS366の処理で生成された評価が基準未満となる否かを判定する。説明変数選択部120は、評価が基準未満と判断した場合に処理をS204に進め、そうでない場合は処理をS202に進める。 First, in S202, the explanatory variable selection unit 120 determines whether or not the evaluation of the selected explanatory variable set selected last time is less than a predetermined criterion. For example, the explanatory variable selection unit 120 determines whether or not the evaluation generated in the process of S172, S266, or S366 is less than the standard for the set of selected explanatory variables generated in the process of S140 or S200. The explanatory variable selection unit 120 advances the process to S204 when the evaluation is determined to be less than the reference, and advances the process to S202 if not.
 説明変数選択部120は、前回選択した選択説明変数のセットの評価を判定することに代えて、直近の予め定められた連続回数の評価が基準未満となるか否かを判定してもよい。例えば、説明変数選択部120は、直近10回に生成した選択説明変数のセットの評価がいずれも基準値を下回る場合に処理をS204に進めてよい。 The explanatory variable selection unit 120 may determine whether or not the most recent predetermined number of consecutive evaluations is less than the reference, instead of determining the evaluation of the previously selected set of selected explanatory variables. For example, the explanatory variable selection unit 120 may advance the process to S <b> 204 when the evaluations of the set of selected explanatory variables most recently generated 10 times are lower than the reference value.
 S204において、初期化部122は新たに初期の選択説明変数のセットを選択する。例えば、初期化部122は、S140と同様の処理を実行することにより、新たな初期の選択説明変数のセットをランダムに決定してよい。例えば、初期化部122は、ブートストラップ法又はジャックナイフ法により、過去に選択した選択説明変数のセットに依存せずにランダムに新しい初期の選択説明変数のセットを選択してよい。 In S204, the initialization unit 122 newly selects a set of initial selection explanatory variables. For example, the initialization unit 122 may determine a new initial set of explanatory explanatory variables at random by executing the same processing as in S140. For example, the initialization unit 122 may randomly select a new set of selected explanatory explanatory variables by the bootstrap method or the jackknife method without depending on the set of selected explanatory explanatory variables selected in the past.
 S206において、生成部124は、マルコフ連鎖モンテカルロ法を用いて、初期の選択説明変数のセットから順次選択説明変数のセットを生成してよい。例えば、生成部124は、図12において説明した手法により選択説明変数のセットを生成してよい。 In S206, the generation unit 124 may sequentially generate a set of selected explanatory variables from the initial set of selected explanatory variables using the Markov chain Monte Carlo method. For example, the generation unit 124 may generate a set of selected explanatory variables by the method described in FIG.
 このように本変形例によれば、初期化部122が選択した少なくとも1つの選択説明変数に応じて学習された予測モデルの評価が基準未満であること、及び、生成部124が生成した選択説明変数のセットに応じて学習された予測モデルの評価が基準未満であることを条件として、初期化部122は、新たな初期の選択説明変数のセットを決定する。これにより、説明変数選択部120は、高い評価の選択説明変数のセットが得られない場合に、マルコフ連鎖モンテカルロ法の起点となる選択説明変数のセットをリセットする。すなわち、説明変数選択部120は、ある領域において精度の良い予測モデルが得られない場合、その領域における説明変数セットの探索に見込みがないと判断し、別の領域での探索を開始することで、説明変数のセットの探索を効率化する。 As described above, according to the present modification, the evaluation of the prediction model learned according to at least one selected explanatory variable selected by the initialization unit 122 is less than the reference, and the selection description generated by the generation unit 124. The initialization unit 122 determines a new initial set of explanatory explanatory variables on condition that the evaluation of the prediction model learned according to the set of variables is less than the reference. Thereby, the explanatory variable selection part 120 resets the set of the selection explanatory variable used as the starting point of the Markov chain Monte Carlo method, when the set of selection explanatory variables with high evaluation cannot be obtained. That is, when an accurate prediction model cannot be obtained in a certain area, the explanatory variable selection unit 120 determines that there is no prospect in searching for an explanatory variable set in that area, and starts a search in another area. , Streamline the search for a set of explanatory variables.
 その後、次回以降のS160~S200のループにおいて、生成部124は、新たな初期の選択説明変数のセットから順次選択説明変数のセットを更に生成していく。これにより、本変形例のデータ処理装置10は、優れた評価の選択説明変数のセット(すなわち、原因要因である可能性が高い説明変数のセット)の近傍で別の選択説明変数のセットの探索・評価を継続することができ、効率的に原因要因を探索することができる。 Thereafter, in the subsequent loop of S160 to S200, the generation unit 124 further generates a set of selection explanatory variables sequentially from a new initial selection explanatory variable set. As a result, the data processing apparatus 10 according to the present modification searches for another set of selected explanatory variables in the vicinity of a set of selected explanatory variables with excellent evaluation (that is, a set of explanatory variables that are likely to be causal factors).・ Evaluation can be continued and causal factors can be efficiently searched.
 また、本変形例によれば、説明変数選択部120は、初期化部122により基準以上の評価が得られる選択説明変数のセットが得られるまで選択説明変数のセットをランダムに選択し、ランダムに選択された選択説明変数のセットに応じて学習された予測モデルの評価が基準以上であることを条件として、生成部124によるマルコフ連鎖モンテカルロ法を用いる選択に切替えることができる。 In addition, according to the present modification, the explanatory variable selection unit 120 randomly selects a set of selected explanatory variables until the initialization unit 122 obtains a set of selected explanatory variables that can be evaluated at or above the standard. It is possible to switch to the selection using the Markov chain Monte Carlo method by the generation unit 124 on the condition that the evaluation of the prediction model learned according to the selected set of selected explanatory variables is equal to or higher than the reference.
 これにより、本変形例のデータ処理装置10によれば、ランダムに選択説明変数の組み合わせを試行し、選択説明変数のセットから一定以上の評価が得られたことに応じて、当該選択説明変数のセットの近傍で原因説明変数の組み合わせとして更に有望な選択説明変数のセットがないか探索することができる。 Thereby, according to the data processing device 10 of the present modification, a combination of the selected explanatory variables is tried at random, and when the evaluation of a certain level or more is obtained from the set of the selected explanatory variables, It is possible to search for a more promising selection explanatory variable set as a combination of cause explanatory variables in the vicinity of the set.
 図14は、生成部124によるマルコフ連鎖モンテカルロ法による選択説明変数のセット生成の変形例を示す。生成部124は、マルコフ連鎖モンテカルロ法による選択説明変数のセットを生成する際に、一定の分布の確率分布を用いてよく、これに代えて図14で実線及び点線で示すような異なる分布形状の確率分布を提案分布として用いてよい。 FIG. 14 shows a modification of the generation of a set of selected explanatory variables by the Markov chain Monte Carlo method by the generation unit 124. The generation unit 124 may use a probability distribution of a constant distribution when generating the set of explanatory explanatory variables by the Markov chain Monte Carlo method. Instead, the generation unit 124 has different distribution shapes as shown by a solid line and a dotted line in FIG. A probability distribution may be used as the proposed distribution.
 例えば、生成部124は、選択説明変数のセットに応じて学習された予測モデルの評価に応じて、マルコフ連鎖モンテカルロ法における提案分布を変更する。一例として、生成部124は、S206で選択説明変数のセットを生成する際に、前回選択した選択説明変数のセットの評価と負の相関関係がある大きさの分散の分布(例えば、評価の値と反比例する分散の分布)を用いてよい。これにより、生成部124は、前回選択した選択説明変数のセットの評価が高くなるとより幅が狭い確率分布(例えば、図14の点線の分布)から次の選択説明変数のセットを選択することとなるので、前回の選択説明変数のセットに対してより近い選択説明変数のセットを選択する確率が高くなる。これにより、データ処理装置10は、効率良く評価の高い選択説明変数のセットを探索することができる。 For example, the generation unit 124 changes the proposal distribution in the Markov chain Monte Carlo method according to the evaluation of the prediction model learned according to the set of selected explanatory variables. As an example, when generating the set of selected explanatory variables in S206, the generation unit 124 has a distribution of magnitudes that has a negative correlation with the evaluation of the previously selected selection explanatory variable set (for example, the evaluation value). Distribution of inversely proportional distribution) may be used. Thereby, the generation unit 124 selects the next set of selected explanatory variables from a narrower probability distribution (for example, the dotted distribution in FIG. 14) when the evaluation of the previously selected set of selected explanatory variables becomes higher. Therefore, the probability of selecting a set of selected explanatory variables closer to the previous set of selected explanatory variables is increased. Thereby, the data processing apparatus 10 can search the set of selection explanatory variables with high evaluation efficiently.
 S200において、初期化部122又は生成部124が選択説明変数のセットを選択/生成後、説明変数選択部120は当該選択説明変数のセットを学習処理部130に供給し、処理をS160に戻す。 In S200, after the initialization unit 122 or the generation unit 124 selects / generates the selected explanatory variable set, the explanatory variable selection unit 120 supplies the selected explanatory variable set to the learning processing unit 130, and returns the process to S160.
 S220において、モデル選択部140は、異なる選択説明変数のセットに応じて学習処理部130が学習した複数の予測モデルのうち、評価がより高い(例えば、ウィルクスのラムダ統計量が低い、感度または特異度が高い、及び/又は、赤池情報量基準(AIC)統計量が小さいなど)予測モデルをより優先して選択する。例えば、モデル選択部140は、S160~S200のループ処理で生成した複数の予測モデルの中から、最も高い評価が得られた選択説明変数のセットに対応する予測モデルを選択する。また、モデル選択部140は、評価の値に応じた大きさの確率で選択説明変数のセットに対応する予測モデルを選択してもよい。モデル選択部140は、選択した予測モデルを決定部150に供給する。 In S220, the model selection unit 140 has a higher evaluation among a plurality of prediction models learned by the learning processing unit 130 according to different sets of selected explanatory variables (for example, sensitivity or specificity that is low in Wilkes' lambda statistic). Predictive models are selected with higher priority (and / or Akaike Information Criterion (AIC) statistics are small). For example, the model selection unit 140 selects a prediction model corresponding to the set of selected explanatory variables with the highest evaluation from among a plurality of prediction models generated by the loop processing of S160 to S200. Further, the model selection unit 140 may select a prediction model corresponding to the set of selected explanatory variables with a probability having a magnitude corresponding to the evaluation value. The model selection unit 140 supplies the selected prediction model to the determination unit 150.
 次にS240において、決定部150は、モデル選択部140により選択された予測モデルに対応する選択説明変数のセットを原因説明変数セットとして決定する。これにより、決定部150は、評価の高い予測モデルを与える選択説明変数のセットを、事象の原因として特定することができる。これにより、例えば、決定部150は、疾病の発生の予測の精度が高い予測モデルが得られた複数の遺伝子のセットを、疾病の原因遺伝子として特定することができる。特にデータ処理装置10によれば、選択説明変数をランダムに選択する処理を含む。これにより、データ処理装置10によれば、単独で事象の発生有無に寄与する度合い順に説明変数を選択して、選択説明変数のセットを生成した場合と比べて、高い精度で事象の原因を特定することができる。 Next, in S240, the determination unit 150 determines a set of selected explanatory variables corresponding to the prediction model selected by the model selection unit 140 as a cause explanatory variable set. Thereby, the determination part 150 can specify the set of the selection explanatory variable which gives the prediction model with high evaluation as a cause of an event. Thereby, for example, the determination unit 150 can specify a set of a plurality of genes from which a prediction model with high accuracy of predicting the occurrence of a disease is obtained as a disease-causing gene. In particular, the data processing apparatus 10 includes a process of randomly selecting a selection explanatory variable. Thereby, according to the data processing device 10, the cause of the event is identified with higher accuracy than when the explanatory variable is selected in the order of the degree of contribution to the occurrence / non-occurrence of the event and the set of the selected explanatory variable is generated. can do.
 図2のフローのS140及びS200において、データ処理装置10がs個の説明変数を選択して選択説明変数のセットを決定することを説明した。ここで、データ処理装置10は異なる値をsに適用して図2のフローを適用し、適切なsの値を決定してもよい。例えば、データ処理装置10は、s=2、3…mについて所定回数のS160~S200のループを実行し、各sについてループ処理後にS220で選択した予測モデルの評価を取得し、評価が向上しなくなるsの値を決定してよい。一例として、s=2から開始してs=6までは予め定められた基準以上評価が向上したが、s=7以降は予め定められた基準以上評価が向上しなかった場合は、データ処理装置10はs=6を選択説明変数のセットを決定するための適切な値として決定し、当該s=6について更に追加のS160~S200のループを行って探索を継続してよい。これにより、データ処理装置10は説明変数が増加することによる過学習を防ぐことができる。 2 that the data processing apparatus 10 selects s explanatory variables and determines a set of selected explanatory variables in S140 and S200 of the flow of FIG. Here, the data processing apparatus 10 may apply a different value to s and apply the flow of FIG. 2 to determine an appropriate value of s. For example, the data processing apparatus 10 executes a predetermined number of loops S160 to S200 for s = 2, 3,... M, acquires the evaluation of the prediction model selected in S220 after the loop processing for each s, and the evaluation is improved. The value of s that disappears may be determined. As an example, if the evaluation exceeds a predetermined standard from s = 2 to s = 6, the data processing apparatus has improved the evaluation beyond the predetermined standard after s = 7. 10 may determine s = 6 as an appropriate value for determining the set of selected explanatory variables, and may continue the search by performing an additional loop of S160 to S200 for s = 6. Thereby, the data processing apparatus 10 can prevent overlearning due to an increase in explanatory variables.
 図15は、並列処理を実装した本実施形態の変形例に係る並列処理装置12を示す。本変形例において、並列処理装置12は第1処理部102と複数の第2処理部104を備え、複数の第2処理部104により並列処理を実行する点でデータ処理装置10と相違する。 FIG. 15 shows a parallel processing device 12 according to a modification of the present embodiment in which parallel processing is implemented. In this modification, the parallel processing device 12 includes a first processing unit 102 and a plurality of second processing units 104, and is different from the data processing device 10 in that parallel processing is executed by the plurality of second processing units 104.
 第1処理部102は、取得部110、モデル選択部140、及び、決定部150の機能を有する。複数の第2処理部104は、それぞれが説明変数選択部120及び学習処理部130の機能を有する。 The first processing unit 102 has functions of an acquisition unit 110, a model selection unit 140, and a determination unit 150. Each of the plurality of second processing units 104 has functions of an explanatory variable selection unit 120 and a learning processing unit 130.
 本変形例の取得部110は、取得したサンプルデータを複数の第2処理部104の説明変数選択部120に供給する。複数の第2処理部104において、複数の説明変数選択部120は、複数の説明変数の中から選択説明変数のセットを並列に選択し、複数の学習処理部130は複数の選択説明変数のセットのそれぞれに対する予測モデルを並列に学習する。この場合、並列処理装置12は、S160~S200のループ処理を複数の説明変数選択部120及び複数の学習処理部130により並列に実行する。また、第1処理部102が説明変数選択部120を有し、第2処理部104は、学習処理部130のみを有し、並列処理装置12は、学習処理部130の学習処理のみを並列化してもよい。この場合、並列処理装置12は、S160の処理を複数の学習処理部130により並列に実行する。 The acquisition unit 110 of this modification supplies the acquired sample data to the explanatory variable selection units 120 of the plurality of second processing units 104. In the plurality of second processing units 104, the plurality of explanatory variable selection units 120 selects a set of selected explanatory variables in parallel from the plurality of explanatory variables, and the plurality of learning processing units 130 sets a plurality of selected explanatory variables. Learn predictive models for each of these in parallel. In this case, the parallel processing device 12 executes the loop processing of S160 to S200 in parallel by the plurality of explanatory variable selection units 120 and the plurality of learning processing units 130. The first processing unit 102 includes the explanatory variable selection unit 120, the second processing unit 104 includes only the learning processing unit 130, and the parallel processing device 12 parallelizes only the learning processing of the learning processing unit 130. May be. In this case, the parallel processing device 12 executes the processing of S160 in parallel by the plurality of learning processing units 130.
 複数の第2処理部104は、それぞれが独立して、ランダム又はマルコフ連鎖モンテカルロ法等により選択説明変数のセットを選択する。第2処理部104は互いに選択した選択説明変数のセット及びその評価の情報を通信し合い、互いに異なる選択説明変数のセットを選択してよい。また、複数の第2処理部104の1つで評価が基準よりも高い予測モデルが生成された場合に、他の第2処理部104を当該評価が高い予測モデルの選択説明変数のセットの近傍の探索に割り当ててよい。これにより、第2処理部104は、評価が高いと予想される予測モデルの探索を効率化することができる。 The plurality of second processing units 104 each independently select a set of selected explanatory variables by random or Markov chain Monte Carlo method or the like. The second processing unit 104 may communicate with each other a set of selected explanatory variables and evaluation information thereof, and select different sets of selected explanatory variables. In addition, when a prediction model whose evaluation is higher than the reference is generated in one of the plurality of second processing units 104, the other second processing unit 104 is in the vicinity of the set of selected explanatory variables of the prediction model having a high evaluation. May be assigned to the search. Thereby, the 2nd process part 104 can improve the search of the prediction model estimated that evaluation is high.
 これにより、第2処理部104は、選択説明変数のセットの選択及び予測モデルの学習処理を多数の処理主体により並列処理し、処理効率を向上することができる。また、第1処理部102が複数の第2処理部104と選択説明変数のセットの情報等を通信し、第2処理部104の並列処理を制御してもよい。 Thereby, the second processing unit 104 can perform the selection processing of the set of selected explanatory variables and the learning process of the prediction model in parallel by a large number of processing entities, thereby improving the processing efficiency. Further, the first processing unit 102 may communicate information on a set of selected explanatory variables with a plurality of second processing units 104 and control parallel processing of the second processing unit 104.
 第1処理部102は、例えば、汎用CPUにより実現されてよく、複数の第2処理部104のそれぞれは、汎用GPU(GPGPU)又は専用CPU等により実現されてよい。汎用GPUの例として、NVIDIA社が開発したCUDA(Compute Unified Device Architecture)などが挙げられる。また、これに代えて、第1処理部102及び複数の第2処理部104は、並列化FPGA(Field-Programable Gate Array)、多数の情報処理装置のクラスタ(コンピュータクラスタ)、ネットワークを介してアクセスされる複数の仮想マシンイメージ(クラウドサービスにより展開されるマシンイメージ等)、及び/又は、プロセッサ内の複数のコア(メニーコアCPU)により実現されてもよい。並列化FPGAの例として、SciEngines GmbH社製、RIVYERAモデルなどが挙げられる。メニーコアCPUを搭載したサーバの例として、ヒューレット・パッカード製のHP ProLiant DL980 G7(80コアCPU)及びHP IntegritySuperdome X Server(240コアCPU)などが挙げられる。仮想マシンイメージの例としてアマゾンウェブサービスのAMI(Amazon Machine Images)などが挙げられる。また、複数の第2処理部104は、インメモリ並列分散処理により並列処理を実行してもよい。インメモリ並列分散処理技術にはApahce Sparkなどが挙げられる。 The first processing unit 102 may be realized by, for example, a general-purpose CPU, and each of the plurality of second processing units 104 may be realized by a general-purpose GPU (GPGPU), a dedicated CPU, or the like. An example of a general-purpose GPU is CUDA (Computer Unified Device Architecture) developed by NVIDIA. Alternatively, the first processing unit 102 and the plurality of second processing units 104 are accessed via a parallel FPGA (Field-Programmable Gate Array), a cluster of many information processing devices (computer cluster), and a network. May be realized by a plurality of virtual machine images (machine images developed by a cloud service, etc.) and / or a plurality of cores (many core CPUs) in the processor. Examples of parallel FPGAs include SciEngines GmbH, RIVYERA model. Examples of servers equipped with a many-core CPU include HP ProLiant DL980 G7 (80-core CPU) manufactured by Hewlett-Packard and HP Integrity Superdome X Server (240-core CPU). An example of a virtual machine image is Amazon Web Service AMI (Amazon Machine Images). The plurality of second processing units 104 may execute parallel processing by in-memory parallel distributed processing. Examples of the in-memory parallel distributed processing technology include Apache Spark.
 本実施形態及び変形例等では、データ処理装置10及び並列処理装置12(データ処理装置10等とする)が特定の遺伝子の有無又は発現量と疾病有無とを対応付けたサンプルデータを取得し、疾病の原因遺伝子を推定する例を説明したが、データ処理装置10等の適用対象はこれに限られない。 In the present embodiment and modifications, the data processing device 10 and the parallel processing device 12 (referred to as the data processing device 10 or the like) acquire sample data in which the presence or absence of a specific gene or the expression level is associated with the presence or absence of a disease, Although the example which estimates the causal gene of a disease was demonstrated, the application object of the data processing apparatus 10 grade | etc., Is not restricted to this.
 例えば、データ処理装置10等は、昆虫等の有害生物の薬剤抵抗性と、当該有害生物の遺伝子配列情報とを含むサンプルデータを取得し、薬剤抵抗性に寄与する遺伝子の組み合わせを特定してよい。 For example, the data processing device 10 or the like may acquire sample data including drug resistance of a pest such as an insect and gene sequence information of the pest and specify a combination of genes that contribute to drug resistance. .
 また、例えば、データ処理装置10等は、複数の近縁生物種の遺伝子配列情報を含むサンプルデータを取得し、進化系統樹を作成する際に指標となる遺伝子の組み合わせを特定してよい。この場合、データ処理装置10等は、選択した複数の選択説明変数のセット(遺伝子のセット)のそれぞれから分岐図パターンを生成し、各分岐図パターンが多数派及び少数派のいずれに属するかを決定する。データ処理装置10等は、多数派に属する分岐図を与える選択説明変数のセットに高い評価を与え、少数派に属する岐図を与える選択説明変数のセットに低い評価を与える。 Further, for example, the data processing apparatus 10 or the like may acquire sample data including gene sequence information of a plurality of closely related species and specify a combination of genes that serve as an index when creating an evolutionary phylogenetic tree. In this case, the data processing apparatus 10 or the like generates a branch diagram pattern from each of a plurality of selected selection explanatory variable sets (gene sets), and determines whether each branch diagram pattern belongs to the majority group or the minority group. decide. The data processing apparatus 10 or the like gives a high evaluation to a set of selected explanatory variables that give a branch diagram belonging to the majority, and gives a low evaluation to a set of selected explanatory variables that give a branch diagram that belongs to the minority.
 更に、データ処理装置10等の適用対象は、遺伝子を選択説明変数として特定することに限られない。データ処理装置10等は、複数の説明変数のうち一部が事象に寄与するあらゆる現象の原因説明変数を特定するために用いることができる。例えば、データ処理装置10等は、購買行動の要因を特定すること、株価の変動原因を特定すること、ネットワークにおける情報伝播、又は、気象等の自然現象の原因等を特定するために用いることができる。 Furthermore, the application target of the data processing apparatus 10 or the like is not limited to specifying a gene as a selected explanatory variable. The data processing apparatus 10 or the like can be used to specify cause explanatory variables for all phenomena in which some of the plurality of explanatory variables contribute to the event. For example, the data processing device 10 or the like is used to identify a factor of purchase behavior, to identify a cause of fluctuations in stock prices, to propagate information in a network, or to cause a natural phenomenon such as weather. it can.
 図16は、本実施形態による学習の効果の一例を示す。グラフの縦軸は、最終的に得られた予測モデルのウィルクスのラムダ統計量を示し、低いほど評価が高い(すなわち、予測精度が高い)ことを示す。グラフの横軸は、選択説明変数のセットに含む説明変数の数(s)を示す。図16の例では、大腸がん患者18名、及び、健常者18名の組織中の遺伝子発現量を用いて遺伝子から大腸がん発生を予測する予測モデルを生成している。グラフ中の実線は、図2で説明した処理フローにより、選択説明変数のセットの選択時(例えば、図2のS140及びS200の処理)に、サンプルデータ中の全説明変数の中から、過去の選択に依存せずにランダムに選択説明変数のセットを選択した場合に得られた結果を示す。グラフ中の破線は、複数の説明変数の中から、t検定を用いて単独で事象の発生有無に寄与する度合いが高い順に説明変数を選択して選択説明変数のセットを生成した場合の結果を示す。 FIG. 16 shows an example of the effect of learning according to this embodiment. The vertical axis of the graph indicates the Wilkes Lambda statistic of the prediction model finally obtained, and the lower the evaluation, the higher the evaluation (that is, the higher the prediction accuracy). The horizontal axis of the graph indicates the number (s) of explanatory variables included in the set of selected explanatory variables. In the example of FIG. 16, a predictive model for predicting the occurrence of colorectal cancer is generated from the gene using the gene expression levels in the tissues of 18 colorectal cancer patients and 18 healthy individuals. The solid line in the graph indicates the past from all explanatory variables in the sample data when the selected explanatory variable set is selected (for example, the processing in S140 and S200 in FIG. 2) according to the processing flow described in FIG. The result obtained when selecting a set of selection explanatory variables randomly without depending on the selection is shown. The broken line in the graph shows the result when a set of selected explanatory variables is generated by selecting the explanatory variables in descending order of the degree of contribution to the occurrence / non-occurrence of the event by using the t-test from a plurality of explanatory variables. Show.
 図示するように、本実施形態により繰り返しランダムに生成された選択説明変数のセットから最終的に選択された選択説明変数のセットによる結果は、t検定により生成された選択説明変数のセットによる結果よりも優れている。また、概して、説明変数の数が増えるに従い、両者の結果の差は大きくなる傾向にあるが、説明変数が9以上となる付近で両者の差が飽和することが示される。従って、この場合、s=9として図2の処理フローを実行することにより、効率的に選択説明変数のセットを探索できる。 As shown in the figure, the result of the set of selected explanatory variables finally selected from the set of selected explanatory variables repeatedly and randomly generated according to the present embodiment is the result of the set of selected explanatory variables generated by the t-test. Is also excellent. In general, as the number of explanatory variables increases, the difference between the two results tends to increase. However, the difference between the two becomes saturated near the explanatory variable of 9 or more. Therefore, in this case, the set of selected explanatory variables can be efficiently searched by executing the processing flow of FIG. 2 with s = 9.
 図17は、データ処理装置10等として機能するコンピュータ1900のハードウェア構成の一例を示す。本実施形態に係るコンピュータ1900は、ホスト・コントローラ2082により相互に接続されるCPU2000、RAM2020、グラフィック・コントローラ2075、及び表示装置2080を有するCPU周辺部と、入出力コントローラ2084によりホスト・コントローラ2082に接続される通信インターフェイス2030、ハードディスクドライブ2040、及びCD-ROMドライブ2060を有する入出力部と、入出力コントローラ2084に接続されるROM2010、フレキシブルディスク・ドライブ2050、及び入出力チップ2070を有するレガシー入出力部を備える。 FIG. 17 shows an example of a hardware configuration of a computer 1900 that functions as the data processing apparatus 10 or the like. A computer 1900 according to this embodiment is connected to a CPU peripheral unit having a CPU 2000, a RAM 2020, a graphic controller 2075, and a display device 2080 that are connected to each other by a host controller 2082, and to the host controller 2082 by an input / output controller 2084. Input / output unit having communication interface 2030, hard disk drive 2040, and CD-ROM drive 2060, and legacy input / output unit having ROM 2010, flexible disk drive 2050, and input / output chip 2070 connected to input / output controller 2084 Is provided.
 ホスト・コントローラ2082は、RAM2020と、高い転送レートでRAM2020をアクセスするCPU2000及びグラフィック・コントローラ2075とを接続する。CPU2000は、ROM2010及びRAM2020に格納されたプログラム(例えば、並列処理用プログラム)に基づいて動作し、各部の制御を行う。グラフィック・コントローラ2075は、CPU2000等がRAM2020内に設けたフレーム・バッファ上に生成する画像データを取得し、表示装置2080上に表示させる。これに代えて、グラフィック・コントローラ2075は、CPU2000等が生成する画像データを格納するフレーム・バッファを、内部に含んでもよい。 The host controller 2082 connects the RAM 2020 to the CPU 2000 and the graphic controller 2075 that access the RAM 2020 at a high transfer rate. The CPU 2000 operates based on a program (for example, a parallel processing program) stored in the ROM 2010 and the RAM 2020, and controls each unit. The graphic controller 2075 acquires image data generated by the CPU 2000 or the like on a frame buffer provided in the RAM 2020 and displays it on the display device 2080. Instead of this, the graphic controller 2075 may include a frame buffer for storing image data generated by the CPU 2000 or the like.
 入出力コントローラ2084は、ホスト・コントローラ2082と、比較的高速な入出力装置である通信インターフェイス2030、ハードディスクドライブ2040、CD-ROMドライブ2060を接続する。通信インターフェイス2030は、有線又は無線によりネットワークを介して他の装置と通信する。また、通信インターフェイスは、通信を行うハードウェアとして機能する。ハードディスクドライブ2040は、コンピュータ1900内のCPU2000が使用するプログラム及びデータを格納する。CD-ROMドライブ2060は、CD-ROM2095からプログラム又はデータを読み取り、RAM2020を介してハードディスクドライブ2040に提供する。 The input / output controller 2084 connects the host controller 2082 to the communication interface 2030, the hard disk drive 2040, and the CD-ROM drive 2060, which are relatively high-speed input / output devices. The communication interface 2030 communicates with other devices via a network by wire or wireless. The communication interface functions as hardware that performs communication. The hard disk drive 2040 stores programs and data used by the CPU 2000 in the computer 1900. The CD-ROM drive 2060 reads a program or data from the CD-ROM 2095 and provides it to the hard disk drive 2040 via the RAM 2020.
 また、入出力コントローラ2084には、ROM2010と、フレキシブルディスク・ドライブ2050、及び入出力チップ2070の比較的低速な入出力装置とが接続される。ROM2010は、コンピュータ1900が起動時に実行するブート・プログラム、及び/又は、コンピュータ1900のハードウェアに依存するプログラム等を格納する。フレキシブルディスク・ドライブ2050は、フレキシブルディスク2090からプログラム又はデータを読み取り、RAM2020を介してハードディスクドライブ2040に提供する。入出力チップ2070は、フレキシブルディスク・ドライブ2050を入出力コントローラ2084へと接続するとともに、例えばパラレル・ポート、シリアル・ポート、キーボード・ポート、マウス・ポート等を介して各種の入出力装置を入出力コントローラ2084へと接続する。 Also, the ROM 2010, the flexible disk drive 2050, and the relatively low-speed input / output device of the input / output chip 2070 are connected to the input / output controller 2084. The ROM 2010 stores a boot program that the computer 1900 executes at startup and / or a program that depends on the hardware of the computer 1900. The flexible disk drive 2050 reads a program or data from the flexible disk 2090 and provides it to the hard disk drive 2040 via the RAM 2020. The input / output chip 2070 connects the flexible disk drive 2050 to the input / output controller 2084 and inputs / outputs various input / output devices via, for example, a parallel port, a serial port, a keyboard port, a mouse port, and the like. Connect to controller 2084.
 RAM2020を介してハードディスクドライブ2040に提供されるプログラムは、フレキシブルディスク2090、CD-ROM2095、又はICカード等の記録媒体に格納されて利用者によって提供される。プログラムは、記録媒体から読み出され、RAM2020を介してコンピュータ1900内のハードディスクドライブ2040にインストールされ、CPU2000において実行される。 The program provided to the hard disk drive 2040 via the RAM 2020 is stored in a recording medium such as the flexible disk 2090, the CD-ROM 2095, or an IC card and provided by the user. The program is read from the recording medium, installed in the hard disk drive 2040 in the computer 1900 via the RAM 2020, and executed by the CPU 2000.
 コンピュータ1900にインストールされ、コンピュータ1900をデータ処理装置10等として機能させるプログラムは、取得モジュール、説明変数選択モジュール、初期化モジュール、生成モジュール、学習処理モジュール、モデル選択モジュール、及び、決定モジュールを備える。これらのプログラム又はモジュールは、CPU2000等に働きかけて、コンピュータ1900を、取得部110、説明変数選択部120、初期化部122、生成部124、学習処理部130、モデル選択部140、及び、決定部150としてそれぞれ機能させてよい。 The program installed in the computer 1900 and causing the computer 1900 to function as the data processing apparatus 10 and the like includes an acquisition module, an explanatory variable selection module, an initialization module, a generation module, a learning processing module, a model selection module, and a determination module. These programs or modules work on the CPU 2000 or the like to make the computer 1900 into an acquisition unit 110, an explanatory variable selection unit 120, an initialization unit 122, a generation unit 124, a learning processing unit 130, a model selection unit 140, and a determination unit. Each of them may function as 150.
 これらのプログラムに記述された情報処理は、コンピュータ1900に読込まれることにより、ソフトウェアと上述した各種のハードウェア資源とが協働した具体的手段である取得部110、説明変数選択部120、初期化部122、生成部124、学習処理部130、モデル選択部140、及び、決定部150として機能する。そして、これらの具体的手段によって、本実施形態におけるコンピュータ1900の使用目的に応じた情報の演算又は加工を実現することにより、使用目的に応じた特有のデータ処理装置10等が構築される。 The information processing described in these programs is read into the computer 1900, whereby the acquisition unit 110, the explanatory variable selection unit 120, the initial unit, which are specific means in which the software and the various hardware resources described above cooperate. It functions as the conversion unit 122, the generation unit 124, the learning processing unit 130, the model selection unit 140, and the determination unit 150. The specific data processing apparatus 10 according to the purpose of use is constructed by realizing calculation or processing of information according to the purpose of use of the computer 1900 in this embodiment by these specific means.
 一例として、コンピュータ1900と外部の装置等との間で通信を行う場合には、CPU2000は、RAM2020上にロードされた通信プログラムを実行し、通信プログラムに記述された処理内容に基づいて、通信インターフェイス2030に対して通信処理を指示する。通信インターフェイス2030は、CPU2000の制御を受けて、RAM2020、ハードディスクドライブ2040、フレキシブルディスク2090、又はCD-ROM2095等の記憶装置上に設けた送信バッファ領域等に記憶された送信データを読み出してネットワークへと送信し、もしくは、ネットワークから受信した受信データを記憶装置上に設けた受信バッファ領域等へと書き込む。このように、通信インターフェイス2030は、DMA(ダイレクト・メモリ・アクセス)方式により記憶装置との間で送受信データを転送してもよく、これに代えて、CPU2000が転送元の記憶装置又は通信インターフェイス2030からデータを読み出し、転送先の通信インターフェイス2030又は記憶装置へとデータを書き込むことにより送受信データを転送してもよい。 As an example, when communication is performed between the computer 1900 and an external device or the like, the CPU 2000 executes a communication program loaded on the RAM 2020 and executes a communication interface based on the processing content described in the communication program. A communication process is instructed to 2030. Under the control of the CPU 2000, the communication interface 2030 reads transmission data stored in a transmission buffer area or the like provided on a storage device such as the RAM 2020, the hard disk drive 2040, the flexible disk 2090, or the CD-ROM 2095, and sends it to the network. The reception data transmitted or received from the network is written into a reception buffer area or the like provided on the storage device. As described above, the communication interface 2030 may transfer transmission / reception data to / from the storage device by a DMA (direct memory access) method. Instead, the CPU 2000 transfers the storage device or the communication interface 2030 as a transfer source. The transmission / reception data may be transferred by reading the data from the data and writing the data to the communication interface 2030 or the storage device of the transfer destination.
 また、CPU2000は、ハードディスクドライブ2040、CD-ROMドライブ2060(CD-ROM2095)、フレキシブルディスク・ドライブ2050(フレキシブルディスク2090)等の外部記憶装置に格納されたファイルまたはデータベース等の中から、全部または必要な部分をDMA転送等によりRAM2020へと読み込ませ、RAM2020上のデータに対して各種の処理を行う。そして、CPU2000は、処理を終えたデータを、DMA転送等により外部記憶装置へと書き戻す。このような処理において、RAM2020は、外部記憶装置の内容を一時的に保持するものとみなせるから、本実施形態においてはRAM2020及び外部記憶装置等をメモリ、記憶部、または記憶装置等と総称する。 The CPU 2000 is all or necessary from among files or databases stored in an external storage device such as a hard disk drive 2040, a CD-ROM drive 2060 (CD-ROM 2095), and a flexible disk drive 2050 (flexible disk 2090). This portion is read into the RAM 2020 by DMA transfer or the like, and various processes are performed on the data on the RAM 2020. Then, CPU 2000 writes the processed data back to the external storage device by DMA transfer or the like. In such processing, since the RAM 2020 can be regarded as temporarily holding the contents of the external storage device, in the present embodiment, the RAM 2020 and the external storage device are collectively referred to as a memory, a storage unit, or a storage device.
 本実施形態における各種のプログラム、データ、テーブル、データベース等の各種の情報は、このような記憶装置上に格納されて、情報処理の対象となる。なお、CPU2000は、RAM2020の一部をキャッシュメモリに保持し、キャッシュメモリ上で読み書きを行うこともできる。このような形態においても、キャッシュメモリはRAM2020の機能の一部を担うから、本実施形態においては、区別して示す場合を除き、キャッシュメモリもRAM2020、メモリ、及び/又は記憶装置に含まれるものとする。 Various information such as various programs, data, tables, and databases in the present embodiment are stored on such a storage device and are subjected to information processing. Note that the CPU 2000 can also store a part of the RAM 2020 in the cache memory and perform reading and writing on the cache memory. Even in such a form, the cache memory bears a part of the function of the RAM 2020. Therefore, in the present embodiment, the cache memory is also included in the RAM 2020, the memory, and / or the storage device unless otherwise indicated. To do.
 また、CPU2000は、RAM2020から読み出したデータに対して、プログラムの命令列により指定された、本実施形態中に記載した各種の演算、情報の加工、条件判断、情報の検索・置換等を含む各種の処理を行い、RAM2020へと書き戻す。例えば、CPU2000は、条件判断を行う場合においては、本実施形態において示した各種の変数が、他の変数または定数と比較して、大きい、小さい、以上、以下、等しい等の条件を満たすか否かを判断し、条件が成立した場合(又は不成立であった場合)に、異なる命令列へと分岐し、またはサブルーチンを呼び出す。 In addition, the CPU 2000 performs various operations, such as various operations, information processing, condition determination, information search / replacement, etc., described in the present embodiment, specified for the data read from the RAM 2020 by the instruction sequence of the program. Is written back to the RAM 2020. For example, when performing the condition determination, the CPU 2000 determines whether or not the various variables shown in the present embodiment satisfy the conditions such as large, small, above, below, equal, etc., compared to other variables or constants. If the condition is satisfied (or not satisfied), the program branches to a different instruction sequence or calls a subroutine.
 また、CPU2000は、記憶装置内のファイルまたはデータベース等に格納された情報を検索することができる。例えば、第1属性の属性値に対し第2属性の属性値がそれぞれ対応付けられた複数のエントリが記憶装置に格納されている場合において、CPU2000は、記憶装置に格納されている複数のエントリの中から第1属性の属性値が指定された条件と一致するエントリを検索し、そのエントリに格納されている第2属性の属性値を読み出すことにより、所定の条件を満たす第1属性に対応付けられた第2属性の属性値を得ることができる。 Further, the CPU 2000 can search for information stored in a file or database in the storage device. For example, in the case where a plurality of entries in which the attribute value of the second attribute is associated with the attribute value of the first attribute are stored in the storage device, the CPU 2000 displays the plurality of entries stored in the storage device. The entry that matches the condition in which the attribute value of the first attribute is specified is retrieved, and the attribute value of the second attribute that is stored in the entry is read, thereby associating with the first attribute that satisfies the predetermined condition The attribute value of the specified second attribute can be obtained.
 以上に示したプログラム又はモジュールは、外部の記録媒体に格納されてもよい。記録媒体としては、フレキシブルディスク2090、CD-ROM2095の他に、DVD又はCD等の光学記録媒体、MO等の光磁気記録媒体、テープ媒体、ICカード等の半導体メモリ等を用いることができる。また、専用通信ネットワーク又はインターネットに接続されたサーバシステムに設けたハードディスク又はRAM等の記憶装置を記録媒体として使用し、ネットワークを介してプログラムをコンピュータ1900に提供してもよい。 The program or module shown above may be stored in an external recording medium. As the recording medium, in addition to the flexible disk 2090 and the CD-ROM 2095, an optical recording medium such as DVD or CD, a magneto-optical recording medium such as MO, a tape medium, a semiconductor memory such as an IC card, and the like can be used. Further, a storage device such as a hard disk or RAM provided in a server system connected to a dedicated communication network or the Internet may be used as a recording medium, and the program may be provided to the computer 1900 via the network.
 以上、本発明を実施の形態を用いて説明したが、本発明の技術的範囲は上記実施の形態に記載の範囲には限定されない。上記実施の形態に、多様な変更または改良を加えることが可能であることが当業者に明らかである。その様な変更または改良を加えた形態も本発明の技術的範囲に含まれ得ることが、請求の範囲の記載から明らかである。 As mentioned above, although this invention was demonstrated using embodiment, the technical scope of this invention is not limited to the range as described in the said embodiment. It will be apparent to those skilled in the art that various modifications or improvements can be added to the above-described embodiment. It is apparent from the scope of the claims that the embodiments added with such changes or improvements can be included in the technical scope of the present invention.
 請求の範囲、明細書、および図面中において示した装置、システム、プログラム、および方法における動作、手順、ステップ、および段階等の各処理の実行順序は、特段「より前に」、「先立って」等と明示しておらず、また、前の処理の出力を後の処理で用いるのでない限り、任意の順序で実現しうることに留意すべきである。請求の範囲、明細書、および図面中の動作フローに関して、便宜上「まず、」、「次に、」等を用いて説明したとしても、この順で実施することが必須であることを意味するものではない。 The execution order of each process such as operations, procedures, steps, and stages in the apparatus, system, program, and method shown in the claims, the description, and the drawings is particularly “before” or “prior”. It should be noted that they can be implemented in any order unless the output of the previous process is used in the subsequent process. Regarding the operation flow in the claims, the description, and the drawings, even if it is described using “first”, “next”, etc. for the sake of convenience, it means that it is essential to carry out in this order. is not.
 10 データ処理装置、12 並列処理装置、20 データベース、102 第1処理部、104 第2処理部、110 取得部、120 説明変数選択部、122 初期化部、124 生成部、130 学習処理部、140 モデル選択部、150 決定部 10 data processing device, 12 parallel processing device, 20 database, 102 first processing unit, 104 second processing unit, 110 acquisition unit, 120 explanatory variable selection unit, 122 initialization unit, 124 generation unit, 130 learning processing unit, 140 Model selection part, 150 decision part

Claims (11)

  1.  複数の説明変数の中から予め定められた事象の発生原因となる少なくとも1つの説明変数のセットである原因説明変数セットを特定するデータ処理装置であって、
     前記複数の説明変数の各々の値と、前記事象の発生有無とを対応付けたサンプルデータを複数取得する取得部と、
     前記複数の説明変数の中から選択説明変数のセットを繰り返し選択し、各選択において過去に選択した前記選択説明変数のセットに依存せずに前記選択説明変数のセットをランダムに選択する説明変数選択部と、
     複数の前記サンプルデータに基づいて、複数の前記選択説明変数のセットのそれぞれについて各選択説明変数の値から前記事象の発生有無を予測する予測モデルを学習する学習処理部と、
     異なる前記選択説明変数のセットに応じた複数の前記予測モデルのうち、評価がより高い予測モデルをより優先して選択するモデル選択部と、
     前記モデル選択部により選択された前記予測モデルに対応する前記選択説明変数のセットを前記原因説明変数セットとして決定する決定部と、
     を備えるデータ処理装置。
    A data processing device that identifies a cause explanatory variable set that is a set of at least one explanatory variable that causes a predetermined event from a plurality of explanatory variables,
    An acquisition unit for acquiring a plurality of sample data in which each value of the plurality of explanatory variables is associated with occurrence of the event;
    An explanatory variable selection that repeatedly selects a set of selected explanatory variables from the plurality of explanatory variables, and randomly selects the set of selected explanatory variables without depending on the set of selected explanatory variables selected in the past in each selection. And
    Based on a plurality of the sample data, a learning processing unit that learns a prediction model that predicts the occurrence of the event from the value of each selected explanatory variable for each of the plurality of selected explanatory variable sets;
    A model selection unit that preferentially selects a prediction model having a higher evaluation among the plurality of prediction models according to different sets of selection explanatory variables;
    A determination unit that determines the set of selected explanatory variables corresponding to the prediction model selected by the model selection unit as the cause explanatory variable set;
    A data processing apparatus comprising:
  2.  前記説明変数選択部は、前記複数の説明変数から、単独で前記事象の発生有無に寄与する度合いが予め定められた基準を満たす一部の説明変数を抽出し、抽出された前記一部の説明変数の中から前記選択説明変数のセットを繰り返し選択する、
     請求項1に記載のデータ処理装置。
    The explanatory variable selection unit extracts, from the plurality of explanatory variables, a part of the explanatory variables that satisfy a predetermined criterion that contributes to the occurrence or non-occurrence of the event alone. Repeatedly selecting the set of selected explanatory variables from the explanatory variables;
    The data processing apparatus according to claim 1.
  3.  前記説明変数選択部は、
     ランダムに選択された前記選択説明変数のセットに応じて学習された前記予測モデルの評価が基準以上であることを条件として、マルコフ連鎖モンテカルロ法を用いる選択に切替える請求項1又は2に記載のデータ処理装置。
    The explanatory variable selection unit
    The data according to claim 1 or 2, wherein the selection is switched to selection using a Markov chain Monte Carlo method on condition that the evaluation of the prediction model learned according to the set of randomly selected explanatory explanatory variables is equal to or higher than a reference. Processing equipment.
  4.  前記説明変数選択部は、前記選択説明変数のセットに応じて学習された前記予測モデルの評価に応じて、マルコフ連鎖モンテカルロ法における提案分布を変更する請求項3に記載のデータ処理装置。 The data processing device according to claim 3, wherein the explanatory variable selection unit changes the proposal distribution in the Markov chain Monte Carlo method according to the evaluation of the prediction model learned according to the set of the selected explanatory variables.
  5.  複数の説明変数の中から予め定められた事象の発生原因となる少なくとも1つの説明変数のセットである原因説明変数セットを特定するデータ処理装置であって、
     前記複数の説明変数の各々の値と、前記事象の発生有無とを対応付けたサンプルデータを複数取得する取得部と、
     前記複数の説明変数の中から、単独で前記事象の発生有無に寄与する度合いが予め定められた基準を満たす一部の説明変数を抽出し、抽出された前記一部の説明変数の中から選択説明変数のセットを選択する説明変数選択部と、
     複数の前記サンプルデータに基づいて、複数の前記選択説明変数のセットのそれぞれについて各選択説明変数の値から前記事象の発生有無を予測する予測モデルを学習する学習処理部と、
     異なる前記選択説明変数のセットに応じた複数の前記予測モデルのうち、評価がより高い予測モデルをより優先して選択するモデル選択部と、
     前記モデル選択部により選択された前記予測モデルに対応する前記選択説明変数のセットを前記原因説明変数セットとして決定する決定部と、
     を備えるデータ処理装置。
    A data processing device that identifies a cause explanatory variable set that is a set of at least one explanatory variable that causes a predetermined event from a plurality of explanatory variables,
    An acquisition unit for acquiring a plurality of sample data in which each value of the plurality of explanatory variables is associated with occurrence of the event;
    From the plurality of explanatory variables, a part of the explanatory variables satisfying a predetermined criterion that contributes to the occurrence of the event independently is extracted, and the extracted explanatory variables are selected from the extracted explanatory variables An explanatory variable selector for selecting a set of selected explanatory variables;
    Based on a plurality of the sample data, a learning processing unit that learns a prediction model that predicts the occurrence of the event from the value of each selected explanatory variable for each of the plurality of selected explanatory variable sets;
    A model selection unit that preferentially selects a prediction model having a higher evaluation among the plurality of prediction models according to different sets of selection explanatory variables;
    A determination unit that determines the set of selected explanatory variables corresponding to the prediction model selected by the model selection unit as the cause explanatory variable set;
    A data processing apparatus comprising:
  6.  複数の説明変数の中から予め定められた事象の発生原因となる少なくとも1つの説明変数のセットである原因説明変数セットを特定するデータ処理装置であって、
     前記複数の説明変数の各々の値と、前記事象の発生有無とを対応付けたサンプルデータを複数取得する取得部と、
     前記複数の説明変数の中から選択説明変数のセットを過去に選択した前記セットに依存せずにランダムに選択し、ランダムに選択された前記選択説明変数のセットに応じて学習された予測モデルの評価が基準以上であることを条件として、マルコフ連鎖モンテカルロ法を用いる選択に切替える説明変数選択部と、
     複数の前記サンプルデータに基づいて、複数の前記選択説明変数のセットのそれぞれについて各選択説明変数の値から前記事象の発生有無を予測する前記予測モデルを学習する学習処理部と、
     異なる前記選択説明変数のセットに応じた複数の前記予測モデルのうち、評価がより高い予測モデルをより優先して選択するモデル選択部と、
     前記モデル選択部により選択された前記予測モデルに対応する前記選択説明変数のセットを前記原因説明変数セットとして決定する決定部と、
     を備えるデータ処理装置。
    A data processing device that identifies a cause explanatory variable set that is a set of at least one explanatory variable that causes a predetermined event from a plurality of explanatory variables,
    An acquisition unit for acquiring a plurality of sample data in which each value of the plurality of explanatory variables is associated with occurrence of the event;
    A set of selected explanatory variables is selected at random from the plurality of explanatory variables without depending on the set selected in the past, and a prediction model learned according to the set of selected explanatory variables selected at random An explanatory variable selection unit that switches to selection using the Markov chain Monte Carlo method, provided that the evaluation is equal to or higher than the standard,
    Based on a plurality of the sample data, a learning processing unit that learns the prediction model that predicts the occurrence of the event from the value of each selected explanatory variable for each of the plurality of selected explanatory variable sets;
    A model selection unit that preferentially selects a prediction model having a higher evaluation among the plurality of prediction models according to different sets of selection explanatory variables;
    A determination unit that determines the set of selected explanatory variables corresponding to the prediction model selected by the model selection unit as the cause explanatory variable set;
    A data processing apparatus comprising:
  7.  複数の遺伝子の中から前記事象の発現要因となる少なくとも1つの遺伝子のセットを特定するものである請求項1から6のいずれか1項に記載のデータ処理装置。 The data processing apparatus according to any one of claims 1 to 6, wherein the data processing apparatus specifies a set of at least one gene that becomes an expression factor of the event from a plurality of genes.
  8.  前記学習処理部は、
     前記選択説明変数のセット内の各選択説明変数を変数として含み、分散を最大化するように選択された複数の関数を生成し、
     前記複数の関数の中から前記複数のサンプルデータにおける前記事象の発生有無の判別に用いる少なくとも1つの関数を選択し、
     前記少なくとも1つの関数のそれぞれの値を各次元とする多次元空間における位置に基づいて、前記事象の発生有無を予測する前記予測モデルを学習する
     請求項1から7のいずれか一項に記載のデータ処理装置。
    The learning processing unit
    Including each selected explanatory variable in the set of selected explanatory variables as a variable, generating a plurality of functions selected to maximize variance;
    Selecting at least one function used to determine whether or not the event has occurred in the plurality of sample data from the plurality of functions;
    8. The prediction model for predicting the occurrence / non-occurrence of the event is learned based on a position in a multidimensional space in which each value of the at least one function is each dimension. 8. Data processing equipment.
  9.  前記説明変数選択部、及び、前記学習処理部をそれぞれ複数備え、
     複数の前記説明変数選択部は、前記選択説明変数のセットの選択を並列に実行し、
     複数の前記学習処理部は、前記予測モデルを並列に学習し、
     複数の前記説明変数選択部及び複数の前記学習処理部は、
     メニーコアCPU、コンピュータクラスタ、GPGPU、並列化FPGA、及び、ネットワークを介してアクセスされる仮想マシンイメージの少なくとも1つ以上により実現される、
     請求項1から8のいずれか1項に記載のデータ処理装置。
    A plurality of the explanatory variable selection unit and the learning processing unit, respectively,
    A plurality of the explanatory variable selection units execute selection of the set of selected explanatory variables in parallel,
    The plurality of learning processing units learn the prediction model in parallel,
    The plurality of explanatory variable selection units and the plurality of learning processing units,
    Realized by at least one of a many-core CPU, a computer cluster, a GPGPU, a parallelized FPGA, and a virtual machine image accessed via a network;
    The data processing apparatus according to any one of claims 1 to 8.
  10.  コンピュータにより実行される、複数の説明変数の中から予め定められた事象の発生原因となる少なくとも1つの説明変数のセットである原因説明変数セットを特定する方法であって、
     前記複数の説明変数の各々の値と、前記事象の発生有無とを対応付けたサンプルデータを複数取得する取得段階と、
     前記複数の説明変数の中から少なくとも1つの選択説明変数のセットを繰り返し選択し、各選択において過去に選択した前記選択説明変数のセットに依存せずに前記選択説明変数のセットをランダムに選択する説明変数選択段階と、
     複数の前記サンプルデータに基づいて、複数の前記選択説明変数のセットのそれぞれについて各選択説明変数の値から前記事象の発生有無を予測する予測モデルを学習する学習処理段階と、
     異なる前記選択説明変数のセットに応じた複数の前記予測モデルのうち、評価がより高い予測モデルをより優先して選択するモデル選択段階と、
     前記モデル選択段階で選択された前記予測モデルに対応する前記選択説明変数のセットを前記原因説明変数セットとして決定する決定段階と、
     を備えるデータ処理方法。
    A method of identifying a cause explanatory variable set, which is a set of at least one explanatory variable that causes a predetermined event from among a plurality of explanatory variables executed by a computer,
    An acquisition step of acquiring a plurality of sample data in which each value of the plurality of explanatory variables is associated with occurrence of the event,
    A set of at least one selected explanatory variable is repeatedly selected from the plurality of explanatory variables, and the set of selected explanatory variables is randomly selected without depending on the set of selected explanatory variables selected in the past in each selection. The explanatory variable selection stage,
    A learning process stage for learning a prediction model for predicting the occurrence of the event from the value of each selected explanatory variable for each of the plurality of sets of selected explanatory variables based on the plurality of sample data;
    A model selection step of preferentially selecting a prediction model having a higher evaluation among the plurality of prediction models according to different sets of selection explanatory variables;
    Determining a set of selected explanatory variables corresponding to the prediction model selected in the model selection step as the causal explanatory variable set;
    A data processing method comprising:
  11.  コンピュータを、複数の説明変数の中から予め定められた事象の発生原因となる少なくとも1つの説明変数のセットである原因説明変数セットを特定するデータ処理装置として機能させる処理用プログラムであって、
     実行されると当該コンピュータを、
     前記複数の説明変数の各々の値と、前記事象の発生有無とを対応付けたサンプルデータを複数取得する取得部と、
     前記複数の説明変数の中から少なくとも1つの選択説明変数のセットを繰り返し選択し、各選択において過去に選択した前記選択説明変数のセットに依存せずに前記選択説明変数のセットをランダムに選択する複数の説明変数選択部と、
     複数の前記サンプルデータに基づいて、複数の前記選択説明変数のセットのそれぞれについて各選択説明変数の値から前記事象の発生有無を予測する予測モデルを学習する学習処理部と、
     異なる前記選択説明変数のセットに応じた複数の前記予測モデルのうち、評価がより高い予測モデルをより優先して選択するモデル選択部と、
     前記モデル選択部により選択された前記予測モデルに対応する前記選択説明変数のセットを前記原因説明変数セットとして決定する決定部と、
     として機能させたデータ処理用プログラム。
    A processing program for causing a computer to function as a data processing device for specifying a cause explanatory variable set that is a set of at least one explanatory variable that causes a predetermined event from among a plurality of explanatory variables,
    When executed, the computer
    An acquisition unit for acquiring a plurality of sample data in which each value of the plurality of explanatory variables is associated with occurrence of the event;
    A set of at least one selected explanatory variable is repeatedly selected from the plurality of explanatory variables, and the set of selected explanatory variables is randomly selected without depending on the set of selected explanatory variables selected in the past in each selection. A plurality of explanatory variable selectors;
    Based on a plurality of the sample data, a learning processing unit that learns a prediction model that predicts the occurrence of the event from the value of each selected explanatory variable for each of the plurality of selected explanatory variable sets;
    A model selection unit that preferentially selects a prediction model having a higher evaluation among the plurality of prediction models according to different sets of selection explanatory variables;
    A determination unit that determines the set of selected explanatory variables corresponding to the prediction model selected by the model selection unit as the cause explanatory variable set;
    Data processing program that functioned as
PCT/JP2016/057992 2015-03-16 2016-03-14 Data processing device, data processing method, and data processing program WO2016148107A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2015-052105 2015-03-16
JP2015052105A JP2018077547A (en) 2015-03-16 2015-03-16 Parallel processing apparatus, parallel processing method, and parallelization processing program

Publications (1)

Publication Number Publication Date
WO2016148107A1 true WO2016148107A1 (en) 2016-09-22

Family

ID=56920176

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2016/057992 WO2016148107A1 (en) 2015-03-16 2016-03-14 Data processing device, data processing method, and data processing program

Country Status (2)

Country Link
JP (1) JP2018077547A (en)
WO (1) WO2016148107A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110392899A (en) * 2017-12-18 2019-10-29 甲骨文国际公司 The dynamic feature selection generated for model
JP2021002126A (en) * 2019-06-20 2021-01-07 昭和電工マテリアルズ株式会社 Design support device, design support method and design support program

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7196696B2 (en) * 2019-03-07 2022-12-27 株式会社ジェイテクト Machine learning device and machine learning method
JP2021183017A (en) * 2020-05-21 2021-12-02 キヤノン株式会社 Information processing device, information processing method, and program
JP2021197100A (en) * 2020-06-18 2021-12-27 国立研究開発法人産業技術総合研究所 Information processing system, information processing method, identification method and program
WO2022208734A1 (en) * 2021-03-31 2022-10-06 富士通株式会社 Information presentation program, information presentation method, and information presentation device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004355174A (en) * 2003-05-28 2004-12-16 Ishihara Sangyo Kaisha Ltd Data analysis method and system
JP2006048429A (en) * 2004-08-05 2006-02-16 Nec Corp System of type having replaceable analysis engine and data analysis program

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004355174A (en) * 2003-05-28 2004-12-16 Ishihara Sangyo Kaisha Ltd Data analysis method and system
JP2006048429A (en) * 2004-08-05 2006-02-16 Nec Corp System of type having replaceable analysis engine and data analysis program

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110392899A (en) * 2017-12-18 2019-10-29 甲骨文国际公司 The dynamic feature selection generated for model
JP2021507323A (en) * 2017-12-18 2021-02-22 オラクル・インターナショナル・コーポレイション Dynamic feature selection for model generation
US11599753B2 (en) 2017-12-18 2023-03-07 Oracle International Corporation Dynamic feature selection for model generation
JP7340456B2 (en) 2017-12-18 2023-09-07 オラクル・インターナショナル・コーポレイション Dynamic feature selection for model generation
CN110392899B (en) * 2017-12-18 2023-09-15 甲骨文国际公司 Dynamic feature selection for model generation
JP2021002126A (en) * 2019-06-20 2021-01-07 昭和電工マテリアルズ株式会社 Design support device, design support method and design support program

Also Published As

Publication number Publication date
JP2018077547A (en) 2018-05-17

Similar Documents

Publication Publication Date Title
WO2016148107A1 (en) Data processing device, data processing method, and data processing program
US10387430B2 (en) Geometry-directed active question selection for question answering systems
JP6954003B2 (en) Determining device and method of convolutional neural network model for database
US10776400B2 (en) Clustering using locality-sensitive hashing with improved cost model
CN105488539B (en) The predictor method and device of the generation method and device of disaggregated model, power system capacity
CN108446741B (en) Method, system and storage medium for evaluating importance of machine learning hyper-parameter
JP2016062544A (en) Information processing device, program, information processing method
JP2021099803A (en) Efficient cross-modal retrieval via deep binary hashing and quantization
JP6299759B2 (en) Prediction function creation device, prediction function creation method, and program
JP2007095069A (en) Spread kernel support vector machine
US9372959B2 (en) Assembly of metagenomic sequences
EP3779806A1 (en) Automated machine learning pipeline identification system and method
WO2016095068A1 (en) Pedestrian detection apparatus and method
CN113255611A (en) Twin network target tracking method based on dynamic label distribution and mobile equipment
US20070239415A2 (en) General graphical gaussian modeling method and apparatus therefore
KR20230004566A (en) Inferring Local Ancestry Using Machine Learning Models
JP2020529060A (en) Prediction of molecular properties of molecular variants using residue-specific molecular structural features
US10248462B2 (en) Management server which constructs a request load model for an object system, load estimation method thereof and storage medium for storing program
Li et al. Informative SNPs selection based on two-locus and multilocus linkage disequilibrium: criteria of max-correlation and min-redundancy
JP5975470B2 (en) Information processing apparatus, information processing method, and program
CN115169555A (en) Edge attack network disruption method based on deep reinforcement learning
JP7014582B2 (en) Quotation acquisition device, quotation acquisition method and program
CN110348581B (en) User feature optimizing method, device, medium and electronic equipment in user feature group
JP2002175305A (en) Graphical modeling method and device for inferring gene network
CN115344386A (en) Method, device and equipment for predicting cloud simulation computing resources based on sequencing learning

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16764937

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

NENP Non-entry into the national phase

Ref country code: JP

122 Ep: pct application non-entry in european phase

Ref document number: 16764937

Country of ref document: EP

Kind code of ref document: A1