WO2016148107A1

WO2016148107A1 - Data processing device, data processing method, and data processing program

Info

Publication number: WO2016148107A1
Application number: PCT/JP2016/057992
Authority: WO
Inventors: 一夫石井; 利紀古崎; 哲郎大森; 周助沼田
Original assignee: 国立大学法人東京農工大学; 国立大学法人徳島大学
Priority date: 2015-03-16
Filing date: 2016-03-14
Publication date: 2016-09-22
Also published as: JP2018077547A

Abstract

Genetic combinations are vast, and it has been difficult to accurately and efficiently estimate, from all combinations, the gene that is the cause of an event. The present invention provides a data processing device provided with: an acquisition unit for acquiring a plurality of sample data in which the value of each of a plurality of explanatory variables and the occurrence/non-occurrence of an event are correlated; an explanatory variable selection unit for selecting a set of explanatory variables from the plurality of explanatory variables; a learning processing unit for learning a prediction model that predicts, on the basis of the plurality of sample data, the occurrence/non-occurrence of an event from the value of each selected explanatory variable, with respect to each of the multiple sets of explanatory variables; a model selection unit for selecting preferably over others a prediction model the evaluation of which is higher among a plurality of prediction models corresponding to different sets of selected explanatory variables; and a determination unit for determining the set of selected explanatory variables, as a set of cause explanatory variables, that corresponds to the prediction model selected by the model selection unit.

Description

Data processing apparatus, data processing method, and data processing program

The present invention relates to a data processing device, a data processing method, and a data processing program.

Research that attempts to identify genes that serve as explanatory variables for biological events such as diseases, as a result of the widespread use of next-generation DNA sequencers that can decode genetic information at high speeds. Is underway. Biological events are often caused by combinations of several genes, and methods for estimating such gene combinations are conventionally known (for example, Patent Documents 1 to 3).
[Patent Document 1] JP 2003-4739 [Patent Document 2] JP 2002-528095 [Patent Document 3] JP 2011-248789

However, the number of gene combinations is enormous, and it has still been difficult to efficiently and accurately estimate the gene combinations that cause events.

In the first aspect of the present invention, there is provided a data processing device for identifying a cause explanatory variable set that is a set of at least one explanatory variable that causes a predetermined event from among a plurality of explanatory variables, An acquisition unit that acquires a plurality of sample data in which each value of a plurality of explanatory variables is associated with the occurrence / non-occurrence of an event, and a set of selected explanatory variables is repeatedly selected from the plurality of explanatory variables. An explanatory variable selection unit that randomly selects the set of selected explanatory variables without depending on the set of selected explanatory variables selected, and each of a plurality of selected explanatory variable sets based on a plurality of sample data A learning processing unit that learns a prediction model that predicts the occurrence of an event from the value of a selected explanatory variable, and multiple prediction models that correspond to different sets of selected explanatory variables In other words, a model selection unit that prioritizes and selects a prediction model with higher evaluation, and a determination unit that determines a set of selected explanatory variables corresponding to the prediction model selected by the model selection unit as a cause explanatory variable set. A data processing apparatus is provided.

Note that the above summary of the invention does not enumerate all the features of the present invention. In addition, a sub-combination of these feature groups can also be an invention.

A block diagram of data processor 10 of this embodiment is shown. The processing flow by the data processing apparatus 10 of this embodiment is shown. An example of the sample data concerning this embodiment is shown. An example of the set of explanatory variables according to the present embodiment is shown. The selection probability which the initialization part 122 selects as a set of selection explanatory variables is shown. An example of the occurrence probability of an event for each gene is shown. An example of the processing flow of the learning of the prediction model by multivariate analysis is shown. An example of the multidimensional space which the learning process part 130 produces | generates is shown. An example of the processing flow of the learning of the prediction model by the maximum likelihood estimation method is shown. An example of the processing flow of learning of the prediction model by the Bayes method is shown. An example of discrimination by the maximum likelihood estimation method or the Bayes method is shown. An example of a method for generating a set of selected explanatory variables by the Markov chain Monte Carlo method is shown. The modification of the process flow of S200 by the explanatory variable selection part 120 is shown. A modification of the generation of a set of selected explanatory variables by the Markov chain Monte Carlo method by the generation unit 124 is shown. The parallel processing apparatus 12 which concerns on the modification of this embodiment which mounted parallel processing is shown. An example of the effect of learning by this embodiment is shown. 2 shows an example of a hardware configuration of a computer 1900.

Hereinafter, the present invention will be described through embodiments of the invention. However, the following embodiments do not limit the invention according to the claims. In addition, not all the combinations of features described in the embodiments are essential for the solving means of the invention.

FIG. 1 shows a block diagram of a data processing apparatus 10 for one block of the data processing apparatus of this embodiment. The data processing apparatus 10 of this embodiment specifies a cause factor set that is a set of at least one explanatory variable that causes a predetermined event from a plurality of explanatory variables. For example, the data processing device 10 identifies a set of at least one gene that is an expression factor of an event such as a disease as a cause factor set from among a plurality of genes. The data processing apparatus 10 includes an acquisition unit 110, an explanatory variable selection unit 120, a learning processing unit 130, a model selection unit 140, and a determination unit 150.

The acquisition unit 110 acquires a plurality of sample data in which each value of a plurality of explanatory variables is associated with the occurrence / non-occurrence of an event. For example, the acquisition unit 110 acquires sample data on a plurality of subjects in which the values related to genes (for example, the presence / absence, modification or expression level of a specific structural gene) and the presence / absence of a disease are associated from the database 20. To do. The acquisition unit 110 provides the acquired sample data to the explanatory variable selection unit 120.

The explanatory variable selection unit 120 selects at least one set of selected explanatory variables from a plurality of explanatory variables. For example, the explanatory variable selection unit 120 is a method in which a set of genes including a predetermined number of genes as a set of selected explanatory variables is predetermined from a plurality of genes held by the subject included in the sample data. (For example, bootstrap method or Markov chain Monte Carlo method). The explanatory variable selection unit 120 includes an initialization unit 122 and a generation unit 124.

The initialization unit 122 determines the set of selected explanatory variables that the explanatory variable selection unit 120 selects in the initial stage of the search for the cause explanatory variable. For example, the initialization unit 122 has a high degree of contribution to the presence / absence of an event in a random selected explanatory variable, an explanatory variable that frequently occurs in a plurality of sample data, or a plurality of sample data (for example, Include explanatory variables in the set of initial selected explanatory variables (such as low Wilkes' lambda statistic, high sensitivity or specificity, and / or low Akaike Information Criterion (AIC) statistic).

The generation unit 124 generates a set of selected explanatory variables that the explanatory variable selection unit 120 selects after the initial search. For example, the generation unit 124 sequentially generates a set of selected explanatory explanatory variables from an initial set of explanatory explanatory variables using a Markov chain Monte Carlo method. Thereby, the generation unit 124 generates a set of selected explanatory variable with a combination close to the combination of the set of selected explanatory explanatory variables selected last time.

The initialization unit 122 and the generation unit 124 may determine or generate a set of selected explanatory variables based on the evaluation of the prediction model obtained from the previously generated set of selected explanatory variables. Details of processing by the explanatory variable selection unit 120 will be described later. The explanatory variable selection unit 120 supplies the set of selected explanatory variables to the learning processing unit 130.

The learning processing unit 130 learns a prediction model for predicting the occurrence of an event from the value of each selected explanatory variable in the set of selected explanatory variables based on a plurality of sample data. For example, the learning processing unit 130 learns a prediction model for predicting the presence or absence of a disease from the presence or absence of a gene in the set of explanatory explanatory variables of each subject in the sample data. As a result, the learning processing unit 130 obtains a prediction model for each set of selected explanatory explanatory variables. Specific contents of the learning process of the learning processing unit 130 will be described later.

Also, the learning processing unit 130 generates an evaluation on the prediction accuracy of event occurrence by each prediction model, and supplies this to the explanatory variable selection unit 120. For example, the learning processing unit 130 compares the result of predicting the occurrence of an event from sample data based on each prediction model with the presence or absence of an actual event, and generates an evaluation from the result. The learning processing unit 130 supplies the model selection unit 140 with a set and evaluation of selected explanatory variables corresponding to the learned prediction model.

The model selection unit 140 preferentially selects a prediction model with a higher evaluation among a plurality of prediction models learned by the learning processing unit 130 according to different sets of selection explanatory variables. For example, the model selection unit 140 selects the prediction model with the highest evaluation. The model selection unit 140 supplies the selected prediction model to the determination unit 150.

The determining unit 150 determines a set of selected explanatory variables corresponding to the prediction model selected by the model selecting unit 140 as a cause explanatory variable set. As a result, the determination unit 150 can preferentially identify the cause of the event as a set of selected explanatory variables that give a prediction model that has a high evaluation, that is, predicts the occurrence of the event with high accuracy.

As described above, the data processing apparatus 10 selects a set of selected explanatory variables from a plurality of candidate explanatory variable candidates in the sample data by the Markov chain Monte Carlo method or the like, and learns a prediction model for each selected selected explanatory variable set. Then, the set of selected explanatory variables corresponding to the highly evaluated prediction model is specified as the cause explanatory variable.

FIG. 2 shows a processing flow by the data processing apparatus 10 of the present embodiment. The data processing apparatus 10 executes the process of specifying the cause explanation variable set by executing the processes of S120 to S240.

First, in S120, sample data in which each value of a plurality of explanatory variables is associated with whether or not an event has occurred is acquired for a plurality of objects. For example, the acquisition unit 110 acquires, from the database 20, sample data that associates the presence or absence of specific gene expression with the presence or absence of a specific disease such as colorectal cancer for a plurality of subjects.

FIG. 3 shows an example of sample data. Acquisition unit 110, each of the M's subject are n (e.g., millions) if expressing whether information (gene expressing each gene g _{1 ~} g _n 1, If there is no disease, 0 or the like) and information associated with a biological event such as the presence or absence of disease (1 if disease has occurred, 0 if not) are acquired as sample data. For example, the sample data shown in FIG. 3, the subject 1 is expressed genes g _1, to express the gene g _2, did not express the gene g _3, ... to express the gene g _n, it indicates that there is no disease , the subject 2 is not express gene _{g 1,} to express the gene _{g 2} and gene _{g 3,} ... to express the gene _{g n,} indicates that there is a disease, the subject M genes _{g 1} and gene _{g 2} express, without a gene g _3, did not express ... gene g _n, it indicates that there is no disease.

The acquisition unit 110 may include / instead of the presence / absence of gene expression of the target, information on the expression level of the gene, positional information on the gene sequence polymorphism, frequency of gene mutation, type and site of gene modification, and / or gene Information on the degree of modification may be acquired. Moreover, the acquisition part 110 may acquire the information relevant to gene expression. For example, the acquisition unit 110 is produced based on the amount of the protein produced by gene expression and translation, the type and site of modification of the transcript or protein, and / or the degree of modification, and the result of functional expression of the produced protein. Metabolites (eg, (a) lipids, carbohydrates, vitamins, amino acids, nucleic acids, other alcohols, organic acids or their esters, other amines, or other organic compounds, (b) minerals or Kinds and / or types of ions or other inorganic compounds (nitrogen compounds, sulfur compounds, phosphorus-containing compounds, halogen compounds, etc.) or ions thereof, or (c) complexes, complexes, or degradation products thereof) Sample data including quantity information may be obtained. The acquisition unit 110 provides the acquired sample data to the explanatory variable selection unit 120.

Next, in S140, the initialization unit 122 determines an initial set of selected explanatory variables. For example, the initialization unit 122 includes a set of initial selected explanatory variables including, as selected explanatory variables, predetermined s explanatory variables selected at random with equal probability from all explanatory variables in the sample data. You may decide.

Alternatively, the initialization unit 122 may preferentially include explanatory variables that exist frequently in a plurality of sample data in the set of initial selected explanatory variables. For example, the initialization unit 122 determines, as an initial selection explanatory variable, a gene that is frequently expressed by the subject in the sample data (or the natural world). As an example, the initialization unit 122 selects s genes in descending order of frequency, or selects s genes that are different from each other with a selection probability corresponding to the frequency, and sets the initial selection explanatory variable as a set. Good.

In addition, the initialization unit 200 may select a set of selected explanatory variables from the plurality of explanatory variables as a whole. Instead, the degree of contribution to whether or not an event has occurred is determined in advance from a plurality of explanatory variables. A part of the explanatory variables that satisfy the specified criteria and / or the frequency of occurrence of the subject is higher than a predetermined criterion, and select a set of selected explanatory variables from the extracted part of the explanatory variables. You may select one. As an example, the initialization unit 200 extracts, from a plurality of explanatory variables, only explanatory variables whose degree of contribution to the occurrence or non-occurrence of an event by t-test alone is higher than the average, and selects s genes from the extracted explanatory variables. You can do it. Further, the initialization unit 200 may select s genes from the result obtained by excluding the explanatory variable corresponding to the gene whose expression frequency is lower than the occurrence frequency of the event from the plurality of explanatory variables.

A method of determining a set of selected explanatory variables based on the frequency of the initialization unit 122 will be described with reference to FIGS. 4 and 5. FIG. 4 shows an example of a set of explanatory variables according to this embodiment. From a combination of _n explanatory variables g ₁ to g _n , N (N = nCs) explanatory variable sets G ₁ to G _N each including s explanatory variables can be generated. In the example of FIG. 4, s = 3, set G ₁ includes g ₁₀ , G ₄₁ , g ₃₀₁ , g ₅₁₀ , set G ₂ includes g ₁₀ , G ₄₁ , g ₃₀₁ , g ₂₈₂ ,... Set G _N shows a case containing _{_{_{_{g a, G b, g c}}}} , g d (a, b, c, d∈n) a.

FIG. 5 shows an example of the selection probability that the initialization unit 122 selects each set as an initial set of explanatory explanatory variables. The horizontal axis of the graph represents the set of N explanatory variables G ₁ to G _N arranged, and the vertical axis represents the selection probability P _s corresponding to the set of explanatory variables. Initializing unit 122 selects a set of explanatory variables with a probability corresponding to the selected probability P _s.

In FIG. 5, each set of explanatory variables is arranged in order of frequency. That is, a set G _x of explanatory variables (ie, genes) arranged on the leftmost side of the graph is a combination of genes that appear (or are expected to appear) most frequently in the sample data (or the natural world). set G _y which is arranged to the right side of the G _x is the combination of genes that appear frequently in the following G _x, ... set G _Z explanatory variables disposed rightmost in the sample data (or nature) A combination of genes that appears (or is expected to appear) with the lowest frequency. The types of explanatory variables included in the set of explanatory variables arranged in FIG. 5 may be single (for example, three) or plural types (for example, two to five).

The selection probability P _s corresponding to each set of explanatory variables in the graph may be a value having a magnitude corresponding to the frequency of the combination of the explanatory variables. Thereby, the initialization unit 122 can preferentially select a set of genes having a high occurrence frequency.

Alternatively, the initialization unit 122 may include, in the initial selected explanatory variable set, an explanatory variable that satisfies a predetermined criterion that contributes to the occurrence of an event independently in a plurality of sample data. . For example, the initialization unit 122 calculates an event occurrence rate for each explanatory variable from the sample data. As an example, when 11 subjects out of 10000 subjects who express gene g ₁ have a disease, the initialization unit 122 contributes to the occurrence or non-occurrence of an event (disease) due to an explanatory variable (gene g1) (ie, , The incidence of disease due to the expression of gene g1) is calculated to be 0.11%.

FIG. 6 shows an example of the occurrence probability of an event for each gene. According to FIG. 4, in the sample data, the disease incidence of the subject having the gene g ₁ is 0.11%, the disease incidence of the subject having the gene g ₂ is 0.15%, and the gene g ₃ diseases incidence of subjects with 0.73% disease incidence of subjects with a genetic _{g n} is 0.02%.

Here, the initialization unit 122 determines a gene having a high degree of contribution to the occurrence of disease as an initial selection explanatory variable. As an example, the initialization unit 122 selects s genes in descending order of contribution to the presence or absence of disease, or selects s genes that are different from each other with a selection probability according to the degree. It may be a set of selected explanatory variables.

Next, in S160, the learning processing unit 130 predicts the occurrence of an event from the value of each selected explanatory variable in the set of selected explanatory variables determined in the immediately preceding process, based on a plurality of sample data. To learn. For example, the learning processing unit 130 learns a prediction model for predicting the presence or absence of a disease from the presence or absence of gene expression in the set of explanatory explanatory variables of each subject in the sample data. As an example, the learning processing unit 130 includes multivariate analysis (such as discriminant analysis and multiple regression analysis), machine learning (such as self-organizing map, support vector machine, and deep learning), maximum likelihood estimation method, or A prediction model is learned based on the Bayesian method.

FIG. 7 shows an example of a processing flow for learning a prediction model by multivariate analysis. The learning processing unit 130 may execute the processing of S160 by using the multivariate analysis such as multiple regression analysis, principal component analysis, and cluster analysis by executing the processing of S162 to S172.

First, in S162, the learning processing unit 130 includes each selected explanatory variable in the selected explanatory variable set determined in the immediately preceding process as a variable, and maximizes the variance of the projection of each target in the sample data. Generate multiple selected functions. For example, the learning processing unit 130 _generates a vector x (x = {x _i1 , x _i2 ,..., X _is ) of s variables x _ij indicating the presence / absence of s genes g _ij included in the set of selected explanatory variables. }) Including the function f (x). f (x) may be a linear function of each element of the vector x.

As an example, assume that the set of explanatory explanatory variables includes the 10th, 23rd, and 45th genes. For convenience of explanation, the learning processing unit 130 plots the target person of each sample data on a three-dimensional space having the respective expression levels of the 10th, 23rd, and 45th genes as axes. . For example, the learning processing unit 130, in the expression level of 10 th gene _{g 10} of a subject is 0.1, in the expression level is 0.3 gene _{g 23,} expression amount of 0.2 genes _{g 45} In some cases, assume that the subject is plotted at (0.1, 0.3, 0.2) points in a three-dimensional space.

The learning processing unit 130 completes plotting of all subjects or a plurality of subjects equal to or greater than a predetermined reference number, and then the first linear function f ₁ (x) = a ₁₀ x ₁₀ + a ₂₃ x in the three-dimensional space. ₂₃ + a ₄₅ x ₄₅ + const ₁ is generated. Here, when the learning processing unit 130 inputs coordinates corresponding to the plots of the respective subjects to the first linear function f ₁ (x) and obtains a plurality of output values corresponding to the respective subjects, dispersion of the plurality of output values is to optimize the coefficients a _ij corresponding to each gene x _ij to maximize.

Next, the learning processing unit 130 generates another second linear function f ₂ (x) different from the _first linear function f ₁ (x). For example, the learning processing unit 130 optimizes each coefficient a _ij corresponding to each gene x _ij so that the variance of the plurality of output values becomes the second largest after the _first linear function f ₁ (x). The learning processing unit 130 may further generate linear functions after the _third linear function f ₃ (x) in the order in which the variance of the output values increases. As described above, the learning processing unit 130 generates a predetermined number of linear functions in the order in which the variance of the output values increases.

Next, in S164, the learning processing unit 130 selects at least one function used to determine whether or not an event has occurred in the plurality of sample data from among the plurality of generated functions. Here, the learning processing unit 130 determines a combination of functions that can more clearly determine the boundary of occurrence / non-occurrence of an event when each target person is plotted in a multidimensional space having the output value of each function as each axis. . For example, when only a part of the functions is selected from the generated functions, the learning processing unit 130 calculates the output value of each linear function of each subject and whether or not an event has occurred in the subject (or the degree of occurrence of the event). A correlation coefficient may be calculated, and one or more functions having a large absolute value of the correlation coefficient may be selected. In this description, for the sake of convenience, it is assumed that the learning processing unit 130 selects the first linear function f ₁ (x) and the second linear function f ₂ (x) generated in S162.

Next, in S166, the learning processing unit 130 generates a multidimensional space having each dimension of each value of at least one function. For example, the learning processing unit 130 inputs genes of selection explanatory variables of a plurality of subjects in the sample data to the selected function, and plots the obtained values in a coordinate space having an axis corresponding to each function.

FIG. 8 shows an example of a multidimensional space generated by the learning processing unit 130. Here, an example is shown in which the learning processing unit 130 generates a two-dimensional space using two functions. Each plot in the graph corresponds to each subject in the sample data.

For example, the learning processing unit 130 uses the output value obtained by inputting the expression level of the selected explanatory variable of each subject into the first function f ₁ (x) as the component о ₁ on the axis LD _1. Including a point (о ₁ , о ₂ ) including the output value obtained by inputting to the second function f ₂ (x) as the component о ₁ on the axis LD _{2 in the} two-dimensional space. Plot to. In FIG. 8, a dotted line and a solid line ◯ indicate a subject who did not develop a disease (that is, a healthy person), and a dotted line and a solid line X indicate a subject who has a disease (that is, a non-healthy person).

Next, in S168, the learning processing unit 130 generates a discriminant function that predicts whether an event has occurred from the selected explanatory variable. For example, the learning processing unit 130 determines a healthy person (no illness) and a non-healthy person (with illness) based on various discrimination methods such as linear discrimination, secondary discrimination, self-organizing map, and support vector machine. Generate a discriminant function that discriminates most accurately. FIG. 8 shows a case where the learning processing unit 130 generates the linear discriminant function TH. The subject plotted on the upper side of the linear discriminant function TH shown in FIG. 8 is predicted as a non-healthy person, and the subject plotted on the lower side is predicted as a healthy person. In this way, the learning processing unit 130 learns a prediction model that predicts the occurrence of an event based on the position in the multidimensional space by generating a discriminant function. For example, the learning processing unit 130 selects the first function f ₁ (x) and the second function f ₂ (x), and thus, the learning processing unit 130 is healthy in a two-dimensional space with the output values of these functions as axes. A clear boundary TH that distinguishes a normal person from a non-healthy person can be determined.

Next, in S172, the learning processing unit 130 evaluates the prediction model using the learned discriminant function. For example, the learning processing unit 130 evaluates the prediction accuracy of the occurrence of the event by the discriminant function. The dotted line ○ in FIG. 8 indicates a subject whose disease is predicted by the discriminant function but actually has no disease, and the solid line ○ indicates a subject who is not predicted by the discriminant function and actually has no disease. The dotted line X indicates a subject who has not actually been diagnosed with a discriminant function but actually has a disease, and the solid line X indicates a subject who has been predicted to have a disease with the discriminant function and actually has a disease.

As an example, the learning processing unit 130 uses the discriminant functions generated by various discriminant methods to determine the sensitivity when the presence or absence of a disease is predicted from the gene expression of a plurality of subjects in sample data (the solid line in all circles in FIG. 8). At least one of the ratio (circle) and the specificity (the ratio of x in the solid line in all) or the average of both is calculated as the evaluation of the discriminant function. The evaluation of the discriminant function is the evaluation of the prediction model corresponding to the discriminant function.

In this way, the learning processing unit 130 determines whether or not an event has occurred based on a plurality of selected explanatory variables by plotting sample data on a multidimensional space generated based on a plurality of functions that maximizes variance. Learn predictive models that increase the possibilities.

FIG. 9 shows an example of a processing flow for learning a prediction model by the maximum likelihood estimation method. The learning processing unit 130 may execute the process of S160 using the maximum likelihood estimation method by executing the processes of S262 to S266.

First, in S262, the learning processing unit 130 generates a likelihood function. For example, the learning processing unit 130, based on the sample data, and outputs the likelihood of an input to an event generating a variable x _i indicating whether or expression amount of the gene g _i in the set of selected explanatory variables θ likelihood The degree function lik (θ) = f _D (x _i | θ) is calculated. As an example, when the set of selected explanatory variables includes the 10th, 23rd, and 45th genes, the learning processing unit 130 uses the likelihood function lik (θ) = f _D (x ₁₀ , x ₂₃ , x ₄₅ | θ) is calculated.

Next, in S264, the learning processing unit 130 determines whether or not an event of each target person in the sample data has occurred based on the likelihood function. For example, the learning processing unit 130 inputs genes included in the set of explanatory explanatory variables of each subject to a corresponding likelihood function, and calculates the likelihood that a disease will occur in each subject. The learning processing unit 130 determines that the disease does not occur when the likelihood falls below a predetermined standard (for example, 0.5), and determines that the disease occurs when the likelihood is equal to or higher than the reference.

Next, in S266, the learning processing unit 130 evaluates the accuracy of the likelihood function. For example, the learning processing unit 130 compares the occurrence / non-occurrence of each subject's event in the sample data with the result of predicting the occurrence / non-occurrence of each subject's event using the likelihood function. As an example, the learning processing unit 130 may evaluate the likelihood function based on sensitivity and specificity as in the processing in S172.

FIG. 10 shows an example of a processing flow for learning a prediction model by the Bayesian method. The learning processing unit 130 may execute the process of S160 using the Bayesian method by executing the processes of S362 to S366.

First, in S362, the learning processing unit 130 calculates a posterior probability that each target will cause an event in the sample data. For example, the learning processing unit 130 may use the frequency of the set of explanatory variables described in FIG. 5 or the product of the occurrence probability of the selected explanatory variable described in FIG. The posterior probability may be calculated based on the product of the prior probability and the likelihood, using the likelihood function generated in step 1 as the likelihood. In addition, the learning processing unit 130 may calculate the posterior probability by a sampling algorithm based on a Markov chain Monte Carlo method such as the Metropolis-Hastings method.

Next, in S364, the learning processing unit 130 determines whether an event has occurred. For example, the learning processing unit 130 inputs the prior probability based on the genes included in the set of explanatory explanatory variables of each subject and the genes included in the set of explanatory explanatory variables of each subject to the corresponding likelihood function. Based on the above, the posterior probability distribution is calculated. The learning processing unit 130 determines that the disease is present when the posterior probability (for example, the average value, median value, or mode value of the posterior probability in the posterior probability distribution) is lower than a predetermined criterion (for example, 0.5). It is determined that the disease does not occur, and it is determined that the disease occurs when the posterior probability is equal to or higher than the reference.

Next, in S366, the learning processing unit 130 evaluates the accuracy of the posterior probability. For example, the learning processing unit 130 compares the occurrence / non-occurrence of each subject's event in the sample data with the result of predicting the occurrence / non-occurrence of each subject by the posterior probability. As an example, the learning processing unit 130 may evaluate the likelihood function based on sensitivity and specificity as in the processing in S172.

FIG. 11 shows an example of discrimination by the maximum likelihood estimation method or the Bayes method. The x1 axis and x2 axis of the graph correspond to the expression level of each gene when the set of selected explanatory variables includes two genes, and the z axis is the likelihood in the maximum likelihood estimation method or the posterior probability in the Bayes method Corresponding to Each plot in the graph corresponds to each subject in the sample data. In FIG. 11, a dotted line and a solid line ◯ indicate a subject who did not develop a disease (that is, a healthy person), and a dotted line and a solid line X indicate a subject who has a disease (that is, a non-healthy person).

For example, the learning processing unit 130 predicts the target person as an unhealthy person when the likelihood or the posterior probability is greater than or equal to a threshold value TH (for example, 0.5) for each target person, and the target person when the likelihood is less than the threshold value TH. Is predicted to be healthy. In FIG. 11, the dotted line ○ indicates a subject who is predicted to have a disease but does not actually have a disease, the solid line ○ indicates a subject who is not predicted to have a disease by a discriminant function and actually has no disease, A cross indicates a subject who was not predicted to be ill by the discriminant function but actually had a disease, and a solid line X represents a subject who was predicted to be ill by the discriminant function and actually had a disease.

As an example, the learning processing unit 130 uses the likelihood or the posterior probability to predict the presence or absence of a disease from the genes of a plurality of subjects in the sample data (the ratio of solid circles in all the circles in FIG. 11) and At least one of the specificities (the ratio of x of the solid line in all x) may be calculated as an evaluation.

As described above, in S160, the learning processing unit 130 uses the discriminant analysis, machine learning (self-organizing map, support vector machine, deep learning, etc.), maximum likelihood estimation method, Bayesian method, or the like to select the selected explanatory variable. Learn the prediction model that predicts the occurrence of events from the set of The learning processing unit 130 may learn the learning model using multivariate analysis such as multiple regression analysis, principal component analysis, and cluster analysis in addition to / in place of these. In addition, the learning processing unit 130 supplies the model selection unit 140 with a set of selected explanatory variables corresponding to the prediction model and the evaluation of each prediction model.

Next to S160, in S180, the explanatory variable selection unit 120 determines whether to continue selecting the set of selected explanatory variables. For example, the explanatory variable selection unit 120 selects the selected explanatory variable on the condition that the learning process of the predetermined number of times S160 has been executed and / or the learning model with the evaluation equal to or higher than the reference determined in S160 has been learned. The selection of the set is terminated, and the process proceeds to S220. If the explanatory variable selection unit 120 does not finish selecting the set of selected explanatory variables, the process proceeds to S200.

In S200, the explanatory variable selection unit 120 causes the initialization unit 122 or the generation unit 124 to select at least one set of selected explanatory variables from among a plurality of explanatory variables. For example, the initialization unit 122 of the explanatory variable selection unit 120 may select a set of selected explanatory variables including a plurality of (for example, s) selected explanatory variables at random by the same processing as S140. The initialization unit 122 repeatedly selects a set of selected explanatory variables in the loop of S160 to S200. The initialization unit 122 may randomly select a set of selected explanatory variables without depending on a previously selected set of selected explanatory variables in each selection of S200 in the iteration. Here, the initialization unit 122 selects the set of selected explanatory variables without depending on the set of selected explanatory variables selected in the past, and adopts the bootstrap method to set the selected explanatory variables selected in the past. The same set can be selected again, or the set of selected explanatory variables selected in the past using the jackknife method may not be selected.

Instead of selection by the initialization unit 122, the generation unit 124 of the explanatory variable selection unit 120 may generate a set of selected explanatory variables. For example, the generation unit 124 may sequentially generate a set of selected explanatory variables from an initial set of selected explanatory variables using a Markov chain Monte Carlo method. Thereby, the generation unit 124 generates a set of selected explanatory variables that is close to the previously selected set of selected explanatory variables.

FIG. 12 shows an example of a method for generating a set of selected explanatory variables by the Markov chain Monte Carlo method. The horizontal axis of the graph shows a set of N explanatory variables G ₁ to G _N arranged. In FIG. 12, each explanatory variable set is arranged in the order of the combination. For example, when the set of explanatory variables related to G ₁ includes genes g ₁ , g _2, and g ₃ , the set of explanatory variables related to G ₂ adjacent to G ₁ includes only gene g ₃ in another g ₃ . It may include genes g ₁ , g ₂ and g ₄ replaced by the near gene g ₄ .

For example, in FIG. 12, a plurality of explanations are made so that the distance between sets corresponds to the similarity based on the similarity (for example, editing distance) of the set of genes and / or the similarity of the genes themselves. Variables may be arranged on the horizontal axis. The types of explanatory variables included in the set of explanatory variables arranged in FIG. 12 may be single (for example, three) or plural (for example, two and three). In FIG. 12, a plurality of sets are arranged one-dimensionally for explanation, but a plurality of sets of explanatory variables may be arranged multidimensionally.

In the graph of FIG. 12, G _i indicates a set of selected explanatory variables selected in the selection of the immediately preceding selected explanatory variable (S140 or the previous processing of S200), and the vertical axis indicates a set of each explanatory variable. 124 shows a selection probability _{P s} be selected. That is, in S200, generating unit 124 selects the set of explanatory variables with a probability corresponding to the selected probability _{P s.}

As shown in FIG. 12, the set having the combination of explanatory variables close to the previously selected set G _i has the highest selection probability P _s , and the selection probability P _s gradually decreases according to the distance from G _i . For example, Figure 12 shows the probability distribution becomes a normal distribution having a peak G _i. Thereby, the generation unit 124 generates a set of selected explanatory variables having a combination of explanatory variables close to the previously selected set. Note that the generation unit 124 may not select the set G _i selected last time or the set that has been selected in the past.

FIG. 13 shows a modification of the processing flow of S200 by the explanatory variable selection unit 120. In this example, the explanatory variable selection unit 120 executes the processing of S202 to S206, so that either the initialization unit 122 or the generation unit 124 selects the selected explanatory variable according to the previous evaluation of the set of selected explanatory variables. Select a set.

First, in S202, the explanatory variable selection unit 120 determines whether or not the evaluation of the selected explanatory variable set selected last time is less than a predetermined criterion. For example, the explanatory variable selection unit 120 determines whether or not the evaluation generated in the process of S172, S266, or S366 is less than the standard for the set of selected explanatory variables generated in the process of S140 or S200. The explanatory variable selection unit 120 advances the process to S204 when the evaluation is determined to be less than the reference, and advances the process to S202 if not.

The explanatory variable selection unit 120 may determine whether or not the most recent predetermined number of consecutive evaluations is less than the reference, instead of determining the evaluation of the previously selected set of selected explanatory variables. For example, the explanatory variable selection unit 120 may advance the process to S <b> 204 when the evaluations of the set of selected explanatory variables most recently generated 10 times are lower than the reference value.

In S204, the initialization unit 122 newly selects a set of initial selection explanatory variables. For example, the initialization unit 122 may determine a new initial set of explanatory explanatory variables at random by executing the same processing as in S140. For example, the initialization unit 122 may randomly select a new set of selected explanatory explanatory variables by the bootstrap method or the jackknife method without depending on the set of selected explanatory explanatory variables selected in the past.

In S206, the generation unit 124 may sequentially generate a set of selected explanatory variables from the initial set of selected explanatory variables using the Markov chain Monte Carlo method. For example, the generation unit 124 may generate a set of selected explanatory variables by the method described in FIG.

As described above, according to the present modification, the evaluation of the prediction model learned according to at least one selected explanatory variable selected by the initialization unit 122 is less than the reference, and the selection description generated by the generation unit 124. The initialization unit 122 determines a new initial set of explanatory explanatory variables on condition that the evaluation of the prediction model learned according to the set of variables is less than the reference. Thereby, the explanatory variable selection part 120 resets the set of the selection explanatory variable used as the starting point of the Markov chain Monte Carlo method, when the set of selection explanatory variables with high evaluation cannot be obtained. That is, when an accurate prediction model cannot be obtained in a certain area, the explanatory variable selection unit 120 determines that there is no prospect in searching for an explanatory variable set in that area, and starts a search in another area. , Streamline the search for a set of explanatory variables.

Thereafter, in the subsequent loop of S160 to S200, the generation unit 124 further generates a set of selection explanatory variables sequentially from a new initial selection explanatory variable set. As a result, the data processing apparatus 10 according to the present modification searches for another set of selected explanatory variables in the vicinity of a set of selected explanatory variables with excellent evaluation (that is, a set of explanatory variables that are likely to be causal factors).・ Evaluation can be continued and causal factors can be efficiently searched.

In addition, according to the present modification, the explanatory variable selection unit 120 randomly selects a set of selected explanatory variables until the initialization unit 122 obtains a set of selected explanatory variables that can be evaluated at or above the standard. It is possible to switch to the selection using the Markov chain Monte Carlo method by the generation unit 124 on the condition that the evaluation of the prediction model learned according to the selected set of selected explanatory variables is equal to or higher than the reference.

Thereby, according to the data processing device 10 of the present modification, a combination of the selected explanatory variables is tried at random, and when the evaluation of a certain level or more is obtained from the set of the selected explanatory variables, It is possible to search for a more promising selection explanatory variable set as a combination of cause explanatory variables in the vicinity of the set.

FIG. 14 shows a modification of the generation of a set of selected explanatory variables by the Markov chain Monte Carlo method by the generation unit 124. The generation unit 124 may use a probability distribution of a constant distribution when generating the set of explanatory explanatory variables by the Markov chain Monte Carlo method. Instead, the generation unit 124 has different distribution shapes as shown by a solid line and a dotted line in FIG. A probability distribution may be used as the proposed distribution.

For example, the generation unit 124 changes the proposal distribution in the Markov chain Monte Carlo method according to the evaluation of the prediction model learned according to the set of selected explanatory variables. As an example, when generating the set of selected explanatory variables in S206, the generation unit 124 has a distribution of magnitudes that has a negative correlation with the evaluation of the previously selected selection explanatory variable set (for example, the evaluation value). Distribution of inversely proportional distribution) may be used. Thereby, the generation unit 124 selects the next set of selected explanatory variables from a narrower probability distribution (for example, the dotted distribution in FIG. 14) when the evaluation of the previously selected set of selected explanatory variables becomes higher. Therefore, the probability of selecting a set of selected explanatory variables closer to the previous set of selected explanatory variables is increased. Thereby, the data processing apparatus 10 can search the set of selection explanatory variables with high evaluation efficiently.

In S200, after the initialization unit 122 or the generation unit 124 selects / generates the selected explanatory variable set, the explanatory variable selection unit 120 supplies the selected explanatory variable set to the learning processing unit 130, and returns the process to S160.

In S220, the model selection unit 140 has a higher evaluation among a plurality of prediction models learned by the learning processing unit 130 according to different sets of selected explanatory variables (for example, sensitivity or specificity that is low in Wilkes' lambda statistic). Predictive models are selected with higher priority (and / or Akaike Information Criterion (AIC) statistics are small). For example, the model selection unit 140 selects a prediction model corresponding to the set of selected explanatory variables with the highest evaluation from among a plurality of prediction models generated by the loop processing of S160 to S200. Further, the model selection unit 140 may select a prediction model corresponding to the set of selected explanatory variables with a probability having a magnitude corresponding to the evaluation value. The model selection unit 140 supplies the selected prediction model to the determination unit 150.

Next, in S240, the determination unit 150 determines a set of selected explanatory variables corresponding to the prediction model selected by the model selection unit 140 as a cause explanatory variable set. Thereby, the determination part 150 can specify the set of the selection explanatory variable which gives the prediction model with high evaluation as a cause of an event. Thereby, for example, the determination unit 150 can specify a set of a plurality of genes from which a prediction model with high accuracy of predicting the occurrence of a disease is obtained as a disease-causing gene. In particular, the data processing apparatus 10 includes a process of randomly selecting a selection explanatory variable. Thereby, according to the data processing device 10, the cause of the event is identified with higher accuracy than when the explanatory variable is selected in the order of the degree of contribution to the occurrence / non-occurrence of the event and the set of the selected explanatory variable is generated. can do.

2 that the data processing apparatus 10 selects s explanatory variables and determines a set of selected explanatory variables in S140 and S200 of the flow of FIG. Here, the data processing apparatus 10 may apply a different value to s and apply the flow of FIG. 2 to determine an appropriate value of s. For example, the data processing apparatus 10 executes a predetermined number of loops S160 to S200 for s = 2, 3,... M, acquires the evaluation of the prediction model selected in S220 after the loop processing for each s, and the evaluation is improved. The value of s that disappears may be determined. As an example, if the evaluation exceeds a predetermined standard from s = 2 to s = 6, the data processing apparatus has improved the evaluation beyond the predetermined standard after s = 7. 10 may determine s = 6 as an appropriate value for determining the set of selected explanatory variables, and may continue the search by performing an additional loop of S160 to S200 for s = 6. Thereby, the data processing apparatus 10 can prevent overlearning due to an increase in explanatory variables.

FIG. 15 shows a parallel processing device 12 according to a modification of the present embodiment in which parallel processing is implemented. In this modification, the parallel processing device 12 includes a first processing unit 102 and a plurality of second processing units 104, and is different from the data processing device 10 in that parallel processing is executed by the plurality of second processing units 104.

The first processing unit 102 has functions of an acquisition unit 110, a model selection unit 140, and a determination unit 150. Each of the plurality of second processing units 104 has functions of an explanatory variable selection unit 120 and a learning processing unit 130.

The acquisition unit 110 of this modification supplies the acquired sample data to the explanatory variable selection units 120 of the plurality of second processing units 104. In the plurality of second processing units 104, the plurality of explanatory variable selection units 120 selects a set of selected explanatory variables in parallel from the plurality of explanatory variables, and the plurality of learning processing units 130 sets a plurality of selected explanatory variables. Learn predictive models for each of these in parallel. In this case, the parallel processing device 12 executes the loop processing of S160 to S200 in parallel by the plurality of explanatory variable selection units 120 and the plurality of learning processing units 130. The first processing unit 102 includes the explanatory variable selection unit 120, the second processing unit 104 includes only the learning processing unit 130, and the parallel processing device 12 parallelizes only the learning processing of the learning processing unit 130. May be. In this case, the parallel processing device 12 executes the processing of S160 in parallel by the plurality of learning processing units 130.

The plurality of second processing units 104 each independently select a set of selected explanatory variables by random or Markov chain Monte Carlo method or the like. The second processing unit 104 may communicate with each other a set of selected explanatory variables and evaluation information thereof, and select different sets of selected explanatory variables. In addition, when a prediction model whose evaluation is higher than the reference is generated in one of the plurality of second processing units 104, the other second processing unit 104 is in the vicinity of the set of selected explanatory variables of the prediction model having a high evaluation. May be assigned to the search. Thereby, the 2nd process part 104 can improve the search of the prediction model estimated that evaluation is high.

Thereby, the second processing unit 104 can perform the selection processing of the set of selected explanatory variables and the learning process of the prediction model in parallel by a large number of processing entities, thereby improving the processing efficiency. Further, the first processing unit 102 may communicate information on a set of selected explanatory variables with a plurality of second processing units 104 and control parallel processing of the second processing unit 104.

The first processing unit 102 may be realized by, for example, a general-purpose CPU, and each of the plurality of second processing units 104 may be realized by a general-purpose GPU (GPGPU), a dedicated CPU, or the like. An example of a general-purpose GPU is CUDA (Computer Unified Device Architecture) developed by NVIDIA. Alternatively, the first processing unit 102 and the plurality of second processing units 104 are accessed via a parallel FPGA (Field-Programmable Gate Array), a cluster of many information processing devices (computer cluster), and a network. May be realized by a plurality of virtual machine images (machine images developed by a cloud service, etc.) and / or a plurality of cores (many core CPUs) in the processor. Examples of parallel FPGAs include SciEngines GmbH, RIVYERA model. Examples of servers equipped with a many-core CPU include HP ProLiant DL980 G7 (80-core CPU) manufactured by Hewlett-Packard and HP Integrity Superdome X Server (240-core CPU). An example of a virtual machine image is Amazon Web Service AMI (Amazon Machine Images). The plurality of second processing units 104 may execute parallel processing by in-memory parallel distributed processing. Examples of the in-memory parallel distributed processing technology include Apache Spark.

In the present embodiment and modifications, the data processing device 10 and the parallel processing device 12 (referred to as the data processing device 10 or the like) acquire sample data in which the presence or absence of a specific gene or the expression level is associated with the presence or absence of a disease, Although the example which estimates the causal gene of a disease was demonstrated, the application object of the data processing apparatus 10 grade | etc., Is not restricted to this.

For example, the data processing device 10 or the like may acquire sample data including drug resistance of a pest such as an insect and gene sequence information of the pest and specify a combination of genes that contribute to drug resistance. .

Further, for example, the data processing apparatus 10 or the like may acquire sample data including gene sequence information of a plurality of closely related species and specify a combination of genes that serve as an index when creating an evolutionary phylogenetic tree. In this case, the data processing apparatus 10 or the like generates a branch diagram pattern from each of a plurality of selected selection explanatory variable sets (gene sets), and determines whether each branch diagram pattern belongs to the majority group or the minority group. decide. The data processing apparatus 10 or the like gives a high evaluation to a set of selected explanatory variables that give a branch diagram belonging to the majority, and gives a low evaluation to a set of selected explanatory variables that give a branch diagram that belongs to the minority.

Furthermore, the application target of the data processing apparatus 10 or the like is not limited to specifying a gene as a selected explanatory variable. The data processing apparatus 10 or the like can be used to specify cause explanatory variables for all phenomena in which some of the plurality of explanatory variables contribute to the event. For example, the data processing device 10 or the like is used to identify a factor of purchase behavior, to identify a cause of fluctuations in stock prices, to propagate information in a network, or to cause a natural phenomenon such as weather. it can.

FIG. 16 shows an example of the effect of learning according to this embodiment. The vertical axis of the graph indicates the Wilkes Lambda statistic of the prediction model finally obtained, and the lower the evaluation, the higher the evaluation (that is, the higher the prediction accuracy). The horizontal axis of the graph indicates the number (s) of explanatory variables included in the set of selected explanatory variables. In the example of FIG. 16, a predictive model for predicting the occurrence of colorectal cancer is generated from the gene using the gene expression levels in the tissues of 18 colorectal cancer patients and 18 healthy individuals. The solid line in the graph indicates the past from all explanatory variables in the sample data when the selected explanatory variable set is selected (for example, the processing in S140 and S200 in FIG. 2) according to the processing flow described in FIG. The result obtained when selecting a set of selection explanatory variables randomly without depending on the selection is shown. The broken line in the graph shows the result when a set of selected explanatory variables is generated by selecting the explanatory variables in descending order of the degree of contribution to the occurrence / non-occurrence of the event by using the t-test from a plurality of explanatory variables. Show.

As shown in the figure, the result of the set of selected explanatory variables finally selected from the set of selected explanatory variables repeatedly and randomly generated according to the present embodiment is the result of the set of selected explanatory variables generated by the t-test. Is also excellent. In general, as the number of explanatory variables increases, the difference between the two results tends to increase. However, the difference between the two becomes saturated near the explanatory variable of 9 or more. Therefore, in this case, the set of selected explanatory variables can be efficiently searched by executing the processing flow of FIG. 2 with s = 9.

FIG. 17 shows an example of a hardware configuration of a computer 1900 that functions as the data processing apparatus 10 or the like. A computer 1900 according to this embodiment is connected to a CPU peripheral unit having a CPU 2000, a RAM 2020, a graphic controller 2075, and a display device 2080 that are connected to each other by a host controller 2082, and to the host controller 2082 by an input / output controller 2084. Input / output unit having communication interface 2030, hard disk drive 2040, and CD-ROM drive 2060, and legacy input / output unit having ROM 2010, flexible disk drive 2050, and input / output chip 2070 connected to input / output controller 2084 Is provided.

The host controller 2082 connects the RAM 2020 to the CPU 2000 and the graphic controller 2075 that access the RAM 2020 at a high transfer rate. The CPU 2000 operates based on a program (for example, a parallel processing program) stored in the ROM 2010 and the RAM 2020, and controls each unit. The graphic controller 2075 acquires image data generated by the CPU 2000 or the like on a frame buffer provided in the RAM 2020 and displays it on the display device 2080. Instead of this, the graphic controller 2075 may include a frame buffer for storing image data generated by the CPU 2000 or the like.

The input / output controller 2084 connects the host controller 2082 to the communication interface 2030, the hard disk drive 2040, and the CD-ROM drive 2060, which are relatively high-speed input / output devices. The communication interface 2030 communicates with other devices via a network by wire or wireless. The communication interface functions as hardware that performs communication. The hard disk drive 2040 stores programs and data used by the CPU 2000 in the computer 1900. The CD-ROM drive 2060 reads a program or data from the CD-ROM 2095 and provides it to the hard disk drive 2040 via the RAM 2020.

Also, the ROM 2010, the flexible disk drive 2050, and the relatively low-speed input / output device of the input / output chip 2070 are connected to the input / output controller 2084. The ROM 2010 stores a boot program that the computer 1900 executes at startup and / or a program that depends on the hardware of the computer 1900. The flexible disk drive 2050 reads a program or data from the flexible disk 2090 and provides it to the hard disk drive 2040 via the RAM 2020. The input / output chip 2070 connects the flexible disk drive 2050 to the input / output controller 2084 and inputs / outputs various input / output devices via, for example, a parallel port, a serial port, a keyboard port, a mouse port, and the like. Connect to controller 2084.

The program provided to the hard disk drive 2040 via the RAM 2020 is stored in a recording medium such as the flexible disk 2090, the CD-ROM 2095, or an IC card and provided by the user. The program is read from the recording medium, installed in the hard disk drive 2040 in the computer 1900 via the RAM 2020, and executed by the CPU 2000.

The program installed in the computer 1900 and causing the computer 1900 to function as the data processing apparatus 10 and the like includes an acquisition module, an explanatory variable selection module, an initialization module, a generation module, a learning processing module, a model selection module, and a determination module. These programs or modules work on the CPU 2000 or the like to make the computer 1900 into an acquisition unit 110, an explanatory variable selection unit 120, an initialization unit 122, a generation unit 124, a learning processing unit 130, a model selection unit 140, and a determination unit. Each of them may function as 150.

The information processing described in these programs is read into the computer 1900, whereby the acquisition unit 110, the explanatory variable selection unit 120, the initial unit, which are specific means in which the software and the various hardware resources described above cooperate. It functions as the conversion unit 122, the generation unit 124, the learning processing unit 130, the model selection unit 140, and the determination unit 150. The specific data processing apparatus 10 according to the purpose of use is constructed by realizing calculation or processing of information according to the purpose of use of the computer 1900 in this embodiment by these specific means.

As an example, when communication is performed between the computer 1900 and an external device or the like, the CPU 2000 executes a communication program loaded on the RAM 2020 and executes a communication interface based on the processing content described in the communication program. A communication process is instructed to 2030. Under the control of the CPU 2000, the communication interface 2030 reads transmission data stored in a transmission buffer area or the like provided on a storage device such as the RAM 2020, the hard disk drive 2040, the flexible disk 2090, or the CD-ROM 2095, and sends it to the network. The reception data transmitted or received from the network is written into a reception buffer area or the like provided on the storage device. As described above, the communication interface 2030 may transfer transmission / reception data to / from the storage device by a DMA (direct memory access) method. Instead, the CPU 2000 transfers the storage device or the communication interface 2030 as a transfer source. The transmission / reception data may be transferred by reading the data from the data and writing the data to the communication interface 2030 or the storage device of the transfer destination.

The CPU 2000 is all or necessary from among files or databases stored in an external storage device such as a hard disk drive 2040, a CD-ROM drive 2060 (CD-ROM 2095), and a flexible disk drive 2050 (flexible disk 2090). This portion is read into the RAM 2020 by DMA transfer or the like, and various processes are performed on the data on the RAM 2020. Then, CPU 2000 writes the processed data back to the external storage device by DMA transfer or the like. In such processing, since the RAM 2020 can be regarded as temporarily holding the contents of the external storage device, in the present embodiment, the RAM 2020 and the external storage device are collectively referred to as a memory, a storage unit, or a storage device.

Various information such as various programs, data, tables, and databases in the present embodiment are stored on such a storage device and are subjected to information processing. Note that the CPU 2000 can also store a part of the RAM 2020 in the cache memory and perform reading and writing on the cache memory. Even in such a form, the cache memory bears a part of the function of the RAM 2020. Therefore, in the present embodiment, the cache memory is also included in the RAM 2020, the memory, and / or the storage device unless otherwise indicated. To do.

In addition, the CPU 2000 performs various operations, such as various operations, information processing, condition determination, information search / replacement, etc., described in the present embodiment, specified for the data read from the RAM 2020 by the instruction sequence of the program. Is written back to the RAM 2020. For example, when performing the condition determination, the CPU 2000 determines whether or not the various variables shown in the present embodiment satisfy the conditions such as large, small, above, below, equal, etc., compared to other variables or constants. If the condition is satisfied (or not satisfied), the program branches to a different instruction sequence or calls a subroutine.

Further, the CPU 2000 can search for information stored in a file or database in the storage device. For example, in the case where a plurality of entries in which the attribute value of the second attribute is associated with the attribute value of the first attribute are stored in the storage device, the CPU 2000 displays the plurality of entries stored in the storage device. The entry that matches the condition in which the attribute value of the first attribute is specified is retrieved, and the attribute value of the second attribute that is stored in the entry is read, thereby associating with the first attribute that satisfies the predetermined condition The attribute value of the specified second attribute can be obtained.

The program or module shown above may be stored in an external recording medium. As the recording medium, in addition to the flexible disk 2090 and the CD-ROM 2095, an optical recording medium such as DVD or CD, a magneto-optical recording medium such as MO, a tape medium, a semiconductor memory such as an IC card, and the like can be used. Further, a storage device such as a hard disk or RAM provided in a server system connected to a dedicated communication network or the Internet may be used as a recording medium, and the program may be provided to the computer 1900 via the network.

As mentioned above, although this invention was demonstrated using embodiment, the technical scope of this invention is not limited to the range as described in the said embodiment. It will be apparent to those skilled in the art that various modifications or improvements can be added to the above-described embodiment. It is apparent from the scope of the claims that the embodiments added with such changes or improvements can be included in the technical scope of the present invention.

The execution order of each process such as operations, procedures, steps, and stages in the apparatus, system, program, and method shown in the claims, the description, and the drawings is particularly “before” or “prior”. It should be noted that they can be implemented in any order unless the output of the previous process is used in the subsequent process. Regarding the operation flow in the claims, the description, and the drawings, even if it is described using “first”, “next”, etc. for the sake of convenience, it means that it is essential to carry out in this order. is not.

10 data processing device, 12 parallel processing device, 20 database, 102 first processing unit, 104 second processing unit, 110 acquisition unit, 120 explanatory variable selection unit, 122 initialization unit, 124 generation unit, 130 learning processing unit, 140 Model selection part, 150 decision part

Claims

A data processing device that identifies a cause explanatory variable set that is a set of at least one explanatory variable that causes a predetermined event from a plurality of explanatory variables,
An acquisition unit for acquiring a plurality of sample data in which each value of the plurality of explanatory variables is associated with occurrence of the event;
An explanatory variable selection that repeatedly selects a set of selected explanatory variables from the plurality of explanatory variables, and randomly selects the set of selected explanatory variables without depending on the set of selected explanatory variables selected in the past in each selection. And
Based on a plurality of the sample data, a learning processing unit that learns a prediction model that predicts the occurrence of the event from the value of each selected explanatory variable for each of the plurality of selected explanatory variable sets;
A model selection unit that preferentially selects a prediction model having a higher evaluation among the plurality of prediction models according to different sets of selection explanatory variables;
A determination unit that determines the set of selected explanatory variables corresponding to the prediction model selected by the model selection unit as the cause explanatory variable set;
A data processing apparatus comprising:
The explanatory variable selection unit extracts, from the plurality of explanatory variables, a part of the explanatory variables that satisfy a predetermined criterion that contributes to the occurrence or non-occurrence of the event alone. Repeatedly selecting the set of selected explanatory variables from the explanatory variables;
The data processing apparatus according to claim 1.
The explanatory variable selection unit
The data according to claim 1 or 2, wherein the selection is switched to selection using a Markov chain Monte Carlo method on condition that the evaluation of the prediction model learned according to the set of randomly selected explanatory explanatory variables is equal to or higher than a reference. Processing equipment.
The data processing device according to claim 3, wherein the explanatory variable selection unit changes the proposal distribution in the Markov chain Monte Carlo method according to the evaluation of the prediction model learned according to the set of the selected explanatory variables.
A data processing device that identifies a cause explanatory variable set that is a set of at least one explanatory variable that causes a predetermined event from a plurality of explanatory variables,
An acquisition unit for acquiring a plurality of sample data in which each value of the plurality of explanatory variables is associated with occurrence of the event;
From the plurality of explanatory variables, a part of the explanatory variables satisfying a predetermined criterion that contributes to the occurrence of the event independently is extracted, and the extracted explanatory variables are selected from the extracted explanatory variables An explanatory variable selector for selecting a set of selected explanatory variables;
Based on a plurality of the sample data, a learning processing unit that learns a prediction model that predicts the occurrence of the event from the value of each selected explanatory variable for each of the plurality of selected explanatory variable sets;
A model selection unit that preferentially selects a prediction model having a higher evaluation among the plurality of prediction models according to different sets of selection explanatory variables;
A determination unit that determines the set of selected explanatory variables corresponding to the prediction model selected by the model selection unit as the cause explanatory variable set;
A data processing apparatus comprising:
A data processing device that identifies a cause explanatory variable set that is a set of at least one explanatory variable that causes a predetermined event from a plurality of explanatory variables,
An acquisition unit for acquiring a plurality of sample data in which each value of the plurality of explanatory variables is associated with occurrence of the event;
A set of selected explanatory variables is selected at random from the plurality of explanatory variables without depending on the set selected in the past, and a prediction model learned according to the set of selected explanatory variables selected at random An explanatory variable selection unit that switches to selection using the Markov chain Monte Carlo method, provided that the evaluation is equal to or higher than the standard,
Based on a plurality of the sample data, a learning processing unit that learns the prediction model that predicts the occurrence of the event from the value of each selected explanatory variable for each of the plurality of selected explanatory variable sets;
A model selection unit that preferentially selects a prediction model having a higher evaluation among the plurality of prediction models according to different sets of selection explanatory variables;
A determination unit that determines the set of selected explanatory variables corresponding to the prediction model selected by the model selection unit as the cause explanatory variable set;
A data processing apparatus comprising:
The data processing apparatus according to any one of claims 1 to 6, wherein the data processing apparatus specifies a set of at least one gene that becomes an expression factor of the event from a plurality of genes.
The learning processing unit
Including each selected explanatory variable in the set of selected explanatory variables as a variable, generating a plurality of functions selected to maximize variance;
Selecting at least one function used to determine whether or not the event has occurred in the plurality of sample data from the plurality of functions;
8. The prediction model for predicting the occurrence / non-occurrence of the event is learned based on a position in a multidimensional space in which each value of the at least one function is each dimension. 8. Data processing equipment.
A plurality of the explanatory variable selection unit and the learning processing unit, respectively,
A plurality of the explanatory variable selection units execute selection of the set of selected explanatory variables in parallel,
The plurality of learning processing units learn the prediction model in parallel,
The plurality of explanatory variable selection units and the plurality of learning processing units,
Realized by at least one of a many-core CPU, a computer cluster, a GPGPU, a parallelized FPGA, and a virtual machine image accessed via a network;
The data processing apparatus according to any one of claims 1 to 8.
A method of identifying a cause explanatory variable set, which is a set of at least one explanatory variable that causes a predetermined event from among a plurality of explanatory variables executed by a computer,
An acquisition step of acquiring a plurality of sample data in which each value of the plurality of explanatory variables is associated with occurrence of the event,
A set of at least one selected explanatory variable is repeatedly selected from the plurality of explanatory variables, and the set of selected explanatory variables is randomly selected without depending on the set of selected explanatory variables selected in the past in each selection. The explanatory variable selection stage,
A learning process stage for learning a prediction model for predicting the occurrence of the event from the value of each selected explanatory variable for each of the plurality of sets of selected explanatory variables based on the plurality of sample data;
A model selection step of preferentially selecting a prediction model having a higher evaluation among the plurality of prediction models according to different sets of selection explanatory variables;
Determining a set of selected explanatory variables corresponding to the prediction model selected in the model selection step as the causal explanatory variable set;
A data processing method comprising:
A processing program for causing a computer to function as a data processing device for specifying a cause explanatory variable set that is a set of at least one explanatory variable that causes a predetermined event from among a plurality of explanatory variables,
When executed, the computer
An acquisition unit for acquiring a plurality of sample data in which each value of the plurality of explanatory variables is associated with occurrence of the event;
A set of at least one selected explanatory variable is repeatedly selected from the plurality of explanatory variables, and the set of selected explanatory variables is randomly selected without depending on the set of selected explanatory variables selected in the past in each selection. A plurality of explanatory variable selectors;
Based on a plurality of the sample data, a learning processing unit that learns a prediction model that predicts the occurrence of the event from the value of each selected explanatory variable for each of the plurality of selected explanatory variable sets;
A model selection unit that preferentially selects a prediction model having a higher evaluation among the plurality of prediction models according to different sets of selection explanatory variables;
A determination unit that determines the set of selected explanatory variables corresponding to the prediction model selected by the model selection unit as the cause explanatory variable set;
Data processing program that functioned as