GB2583176A

GB2583176A - Prediction device, prediction program, and prediction method for predicting human judgments

Info

Publication number: GB2583176A
Application number: GB2002189.5A
Authority: GB
Inventors: Oguri Hidenobu; Itoh Kouichi
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2019-02-28
Filing date: 2020-02-18
Publication date: 2020-10-21
Also published as: JP2020140521A; GB202002189D0

Abstract

Machine learning is used to predict a target variable’s value for a target case based on explanatory variables. For example, a financial penalty imposed by a court when a business violates data protection laws may be predicted from previously decided case law based on the types of information leaked, the number of people affected, and the duration of the leak. A classification analysis model and a regression analysis model are trained based on the explanatory variables (the facts of the case) and target variable (the imposed penalty) of decided cases. One explanatory variable is divided into regions (Y1-Y5, fig. 7). The target variable is converted from its absolute value into a level based on its magnitude (e.g. “Level B: Over £100,000”). A separate regression model linking the explanatory variables to the target variable is prepared for each of these levels. For a previously unseen target case, the classification model predicts the level of the target variable (the likely range of any penalty) by identifying the training case with the set of explanatory variables closest to the target case’s explanatory variables. The target variable’s absolute value is then predicted using the regression model associated with the determined target variable level.

Description

PREDICTION DEVICE, PREDICTION PROGRAM, AND PREDICTION METHOD FOR PREDICTING HUMAN JUDGMENTS

FIELD

The present invention relates to a prediction device, a prediction program, and a prediction method for predicting human judgments.

BACKGROUND

Human judgments such as judgments taken by a government agency in relation to a law violation and judgments taken by a judging official in a gymnastics or figure skating competition, for example, include judgments based on unclear judgment flowcharts and checklists. A judgment made in relation to a law violation will be described below as an example.

For example, as information processing technology develops, the risk of is attacks on information systems and information leakages is becoming more diversified. In response thereto, many business enterprises require complex responses to security system updates. The reason for this is that there is a tendency among government agencies responsible for ensuring compliance with personal information protection laws in respective countries to tighten regulations relating to personal information, and as a result, penalties imposed on business enterprises for duty violations of personal information protection laws are becoming harsher.

Machine learning for prediction a certain value is disclosed in the following prior arts.

Patent Literature 1: Japanese Laid-open Patent Publication No. 2006-252011 Patent Literature 2: Japanese Laid-open Patent Publication No. 2018-153901

SUMMARY

However, the relationship between information security incidents and penalties or sanctions imposed by government agencies is non-transparent and difficult to predict using machine learning or artificial intelligence. In particular, the number of cases in which sanctions have been imposed is small, and therefore sufficient training data cannot be acquired, making it difficult to improve the prediction precision of an analysis model generated by machine learning. Moreover, every time the impact of a violation of personal information protection laws increases in scale compared with past cases, the relevant government authority may determine new penalties or sanctions using new, unpublished judgment indicators. Therefore, many business enterprises are being forced to invest heavily in security systems in order to avoid excessive penalties and sanctions from government agencies.

io As described above, in present circumstances, it is difficult to predict judgments by government authorities, and as a result, many business enterprises determine the degree of priority to be placed on investment in security systems while bearing risk.

Hence, an object of a first aspect of this embodiment is to provide a is prediction device, a prediction program, and a prediction method for predicting human judgments with improved prediction precision.

One aspect of the embodiment is a prediction device comprising: a processor; and a memory that is accessible by the processor, wherein the processor is configured to: (a) generate a classification prediction model and a regression prediction model based on a plurality of training data which includes a plurality of sets of case data, each set of the case data including a value of a target variable corresponding to values of a plurality of explanatory variables, at least one of the values of the plurality of explanatory variables being converted, based on a variable region including a plurality of regions, into one of identification numbers of the plurality of regions, the classification prediction model including a plurality of classification training data, which includes the plurality of training data, the value of the target variable in each of the plurality of training data being converted into one of a plurality of levels corresponding to magnitude of the value of the target variable, and determining the level of the target variable included in the classification training data, among the plurality of classification training data, in which a norm distance of the value of the explanatory variable to a prediction subject case data is shortest as the level of the target variable of the prediction subject case data, and the regression prediction model including, for each of the plurality of levels, a regression line that is close to coordinate points of the explanatory variables of a plurality of level-divided training data into which the plurality of training data are divided according to the level of the target variable, and calculating a value of the target variable of the prediction subject case data based on the regression line corresponding to the level determined by the classification prediction model; (b) predict a level of the target variable of the prediction subject case data by applying the prediction subject case data to the classification prediction model; and (c) predict a value of the target variable of the prediction subject case data by applying the prediction subject case data to the regression line is corresponding to the predicted level.

According to the first aspect, the prediction precision of the prediction device can be improved.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 is a view illustrating an example of a decision action or a judgment action.

FIG. 2 is a view depicting an example of the prediction model corresponding to the judgment action of FIG. 1.

FIG. 3 is a view depicting a flowchart that illustrates an outline of prediction model generation, prediction, and updating according to this embodiment.

FIG. 4 is a view depicting an example configuration of a prediction device according to this embodiment.

FIG. 5 is a view depicting a flowchart of processing for generating a training data master of existing cases.

FIG. 6 is a view depicting an example of the case data.

FIG. 7 is a view depicting an example of a list of the variable regions of the explanatory variables and the training data master.

FIG. 8 depicts processing for searching suitable variable regions for the classification analysis model.

FIG. 9 depicts processing for searching suitable variable regions for the regression analysis model.

FIG. 10 is a view depicting an example of the subset variable list.

FIG. 11 is a view depicting a flowchart of the processing S23 for calculating the prediction precision of the classification analysis model.

FIG. 12 is a view illustrating calculation of the prediction precision of the classification analysis model.

FIG. 13 is a view depicting a method for predicting an unknown case using the classification analysis model.

FIG. 14 is a view depicting a method for calculating the prediction precision is of the classification analysis model.

FIG. 15 is a view depicting a method for determining the optimum subset variable of the classification analysis model.

FIG. 16 is a flowchart of processing S34 for calculating the prediction precision of the regression analysis model.

FIG. 17 is a view illustrating calculation of the prediction precision of the regression analysis model.

FIG. 18 is a view depicting a method for generating the regression analysis model.

FIG. 19 is a view depicting an example in which the subset variables of the regression analysis model are sorted by prediction precision.

FIG. 20 is a view depicting a method for determining the optimum variable region of the regression analysis model.

FIG. 21 is a flowchart of the processing C and D for generating a prediction model and using the prediction model to make a prediction in relation to the prediction subject case.

FIG. 22 is a flowchart of the processing F for updating the variable region of the explanatory variable.

FIG. 23 is a view depicting an example of a list of search subject variable regions and a list of search subject subset variables.

FIG. 24 depicts an example of the prediction precision of an analysis model using each of the subset variables SSV_UP_1 to SSV_UP_5, calculated in the processing of S52. According to this example, the prediction precision order is as follows.

FIG. 25 is a view depicting a list of variable regions and a list of subset variables acquired during first-generation and second-generation search.

FIG. 26 is a view depicting examples of prediction precision comparison results and determinations made in relation thereto.

FIG. 27 is a view depicting relationships among the three subset variables in each of first-generation and second-generation search.

FIG. 28 is a view depicting values applied to the variable Y in the first to is third variable regions on the variable region list of first-generation search.

FIG. 29 is a view illustrating values applied to the variable Y in the first to third variable regions on the list of variable regions searched during second-generation search.

FIG. 30 is a view illustrating values applied to the variable Y in the first to third variable regions on the list of variable regions searched during second-generation search.

DESCRIPTION OF EMBODIMENTS

Glossary of Terms Terms used in this embodiment will be described briefly below.

Analysis model: a model for predicting a target variable from a plurality of explanatory variables in a case. In this embodiment, a classification analysis model and a regression analysis model are used together. The analysis model will also be referred to as a prediction model.

Case: A case includes a plurality of explanatory variables and a target variable serving as a prediction subject.

Training data master: a set of data acquired by constantizing and quantifying variables with respect to a plurality of existing cases and setting a value (a region identification number) acquired by quantifying each explanatory variable and a value or level of the target variable for each of the plurality of cases.

Variable constantization: converting a variable in a case into a numerical value.

Variable quantification: converting the value of a variable into an identification number (1, 2, 3, etc.) (a quantitative value) of a region based on a variable region (a quantification reference) having a plurality of regions.

io Variable region: a plurality of regions serving as references when the values of the explanatory variables are converted into region identification numbers. By setting an optimum variable region in the explanatory variable, the prediction precision of the analysis model can be improved. For example, by allocating different variable regions Yl, Y2, Y3, and so on to a variable Y, is different explanatory variables Y1, Y2, Y3, and so on can be defined for the same explanatory variable Y. In the following description, the same reference symbol will be used for the variable Y having the different variable regions and the variable regions Yl etc. A variable Yl denotes the variable Y in the variable region Yl.

Variable region search: calculating the prediction precision of the analysis model with respect to each of a plurality of variable region candidates applied to the explanatory variable Y and combinations thereof and finding the variable region in which the highest prediction precision is acquired. Variable region search is performed during a process for initializing the explanatory variables.

Subset variable: a combination of explanatory variables serving as a subset of the explanatory variable having the different variable regions and the explanatory variables having a single variable region. An analysis model is determined for each subset variable.

Prediction processing using analysis model: generating an analysis model based on training data acquired by quantifying the explanatory variables in an existing case using the variable region selected in the search process and making a prediction with respect to a prediction subject case using the analysis model.

Variable region updating: updating the analysis model by assuming that judgment criteria have been modified in response to a new case when the prediction precision of the generated analysis model decreases, calculating the prediction precision of the analysis model with respect to a plurality of subset variables respectively having a plurality of variable region candidates in relation to which the prediction precision can be estimated to be high, and updating the variable region to a variable region in which the prediction precision is higher than that of the existing analysis model.

In this embodiment, firstly, an analysis model is determined based on existing cases, which have already been decided or judged, of real-world decision actions or judgment actions in which qualitative human judgments are intermixed with quantitative judgments using a checklist, a flowchart, or the like, is making it difficult to predict the judgment result.

Secondly, the analysis model is constituted by a classification analysis model and a regression analysis model.

Thirdly, the optimum variable region of the explanatory variable in each of the two analysis models are searched in the initialization process.

Fourthly, when the prediction precision of an existing analysis model decreases in response to the appearance of a new case, an analysis model with improved prediction precision is determined by updating the variable region of the explanatory variable. The features of each of the above points will be described below.

Decision actions or judgment actions include, for example, (1) an action taken by a government agency to determine a penalty or a sanction when a business enterprise violates the law, (2) an action for deciding the amount of a bid on an extremely large project, (3) a pass/fail judgment made using both an interview test and a written test, (4) an action for deciding a term of imprisonment imposed in relation to a criminal action, and so on. It may also be possible to apply this embodiment to decision actions or judgment actions other than those described above.

Judgment Action and Prediction Model FIG. 1 is a view illustrating an example of a decision action or a judgment action. In the example of FIG. 1, a sanction is determined in response to a law violation. For example, a business enterprise 1 constructs and operates a system utilizing personal information. A problem such as leakage of the personal information occurs due to internal fraud, or external intrusion, or attack on the system (S1). When a government authority 2 is notified of the problem (S2), the authority 2 investigates the circumstances of io the problem by conducting a hearing with the business enterprise and customers (S3). The authority 2 then applies the law to the circumstances of the problem and judges whether or not the law has been violated (S4). Finally, the authority 2 notifies the business enterprise 1 of the punishment or sanction decided in accordance with the judgment (S5) and publishes a judged case 3 is (56). A business enterprise performing risk assessment in relation to the business enterprise 1 generates a prediction model 4 for predicting a judgment using past cases as training data and uses the prediction model 4 to predict the judgment in an unknown case (S7). The business enterprise 1 implements improvements on the system in consideration of the predicted judgment.

FIG. 2 is a view depicting an example of the prediction model corresponding to the judgment action of FIG. 1. Existing cases 3 include a plurality of judged cases published by the relevant government authority. The cases 3 are cases of personal information leakage, for example. First, training data 5 constituted by explanatory variables x, y, z and a target variable SA (Sanction) are generated for each of the existing cases 3. Variables included in the cases are artificially constantized from the existing cases, variable regions of the variables are set, and the values of the respective variables of the cases are converted into identification numbers indicating the regions to which the values of the respective variables of the cases correspond.

Training data 5 in the cases depicted in FIG. 2 include the types of information x, the number of affected people y, and the leakage period z as the explanatory variables and the sanction SA as the target variable. By machine-learning the training data 5, a prediction model 4 of a case having an unknown target variable is generated.

For example, the explanatory variable x is the types of information, and the variable regions thereof are constituted by identification numbers of three regions, "1" being a region in which the personal information includes one type of information, among address, name, and age, "2" being a region in which the personal information includes two types, and "3" being a region in which the personal information includes three types. The explanatory variable y is the io number of affected people, indicating the number of people whose personal information was leaked, and the variable regions thereof are constituted by identification numbers of three regions, "1" being a region in which less than 100 people are affected, "2" being a region in which less than 1000 people are affected, and "3" being a region in which 1,000 or more people are affected.

is The explanatory variable z is the leakage period, and the variable regions thereof are constituted by identification numbers of three regions, "1" being a region in which the leakage period is less than 5 years, "2" being a region in which the leakage period is less than 10 years, and "3" being a region in which the leakage period is 10 years or more. The target variable SA is the sanction.

The prediction model 4 is a model based on a predetermined machine learning algorithm 6. The machine learning algorithm 6 is linear regression analysis or polynomial regression analysis, for example. FIG. 2 depicts, as an example of the prediction model 4, a multiple regression analysis model in which the target variable SA is calculated using a linear function of the plurality of variables x, y, z. Once the prediction model 4 has been generated, thereafter, the explanatory variables x, y, z of a case 7 having an unknown target variable are input into the prediction model 4, the function of the prediction model is calculated, and a predicted value 8 of the target variable SA is output.

In the above example, a prediction model is generated from judgment cases in which a plurality of judgment indicators, such as qualitative human judgments and quantitative judgments using a checklist or a flowchart, are intermixed, whereupon the target variable of a case having an unknown target variable is predicted by the prediction model. When constructing a prediction model of this type, the following problems occur.

Firstly, the judgment mechanisms, such as checklists and flowcharts, used in the judgments of existing cases are not published. For example, in a case, a plurality of people may make non-continuous judgments and decide the sanction using a plurality of judgment methods. Further, although general safety standards are published, explanations of combinations and details of the standards remain undisclosed.

Secondly, although past cases in which sanctions were imposed are published, the number of cases is extremely small, and therefore an amount of training data that is usable during machine learning cannot be prepared. When machine learning is performed using a small amount of training data, a search is space is too large, making it difficult to generate a meaningful prediction model.

Thirdly, the qualitative and quantitative judgment criteria used to judge existing cases may vary over time. For example, when historically large damage occurs, quantification of past cases becomes meaningless, and as a result, the prediction precision of future cases may decrease.

Outline of Prediction Model Generation, Prediction, and Updating According to This Embodiment FIG. 3 is a view depicting a flowchart that illustrates an outline of prediction model generation, prediction, and updating according to this embodiment.

To generate the prediction model, a training data master of existing cases is generated (A), the optimum variable region of an explanatory variable in the cases is searched (B), and a prediction model of a subset variable, which is a combination of variables in the variable region selected in the search process, is generated (C).

To predict an unknown case using the prediction model, with respect to the prediction subject case, the value of the target variable is predicted using the prediction model generated in process C (D), whereupon prediction D using the prediction model determined in process C is repeated for as long as the unknown case prediction precision does not decrease (NO in E).

When the prediction precision of the prediction model decreases, for example when the predicted value of an unknown case differs greatly from a sanction imposed in a subsequently disclosed sanction case (YES in E), the variable region of the explanatory variable is updated to more appropriate variable region based on the existing cases and the newly disclosed case (F). A prediction model of a subset variable of explanatory variables in the updated io variable region is then determined again (C), whereupon the prediction model is used to predict subsequent unknown cases (D).

In the process A for generating the training data master, variable constantization and quantification is performed with respect to a plurality of existing cases, whereby the cases are each turned into data including quantified is values (region identification numbers) of the explanatory variables and a value or a level of the target variable. Variable constantization and quantification are as described in the glossary of terms.

In the process B for searching the variable regions, the prediction precision of the prediction model is calculated for each of a plurality of variable region candidates applied to the explanatory variable Y and combinations thereof, whereupon the variable region in which the highest prediction precision is acquired is extracted. The variable region search process B is performed during a process for initializing the explanatory variables, which is performed before the processing C for determining the prediction model.

In the process C for generating the prediction model and the prediction processing D using the prediction model, the prediction model is generated based on the training data acquired by quantifying the explanatory variables of the existing cases in the variable region extracted in the search process B (C), whereupon the value of the target variable of the prediction subject case is predicted using the analysis model (D).

In the process F for updating the variable region, the analysis model is updated by assuming that the judgment criteria have been modified in response to a new case when the prediction precision of the generated analysis model decreases, calculating the prediction precision of the analysis model with respect to a plurality of subset variables respectively having a plurality of variable region candidates in relation to which the prediction precision can be estimated to be high, and updating the variable region to a variable region in which the prediction precision is higher than that of the existing analysis model.

FIG. 4 is a view depicting an example configuration of a prediction device according to this embodiment. A prediction device 10 is an information io processing device or a computer such as a server or a personal computer. The prediction device 10 includes a processor 11 such as a CPU, a main memory 12 such as a random access memory (RAM), a network interface 13, a bus 14, and large-capacity auxiliary storage devices 20, 30 such as a hard disk drive (HDD) or a solid state drive (SDD). The network interface 13 can be accessed from is terminal devices 40, 41 of clients over the Internet or the like.

The auxiliary storage device 20 stores various programs such as a variable region search program 21, a prediction model generation program 22, a prediction model program 23, and a variable region updating program 24. The auxiliary storage device 30 stores various data such as case data 31, a training data master 32, a variable region list and a subset variable list 33, classification training data 34, regression training data (1) 35_1, and regression training data (2) 35 2.

The processor 11 of the computer 10 executes the variable region search processing B by executing the variable region search program 21. Similarly, the processor 11 executes the prediction model generation processing C by executing the prediction model generation program 22. Further, the processor 11 executes the processing D for predicting an unknown case using the prediction model by executing the prediction model program 23. Furthermore, the processor 11 executes the variable region updating processing F by executing the variable region updating program 24.

Process A for Generating Training Data Master FIG. 5 is a view depicting a flowchart of processing for generating a training data master of existing cases. Constantizable values are extracted from existing cases as explanatory variables (S10). In this extraction processing, the processor 11 may execute an explanatory variable extraction program depicted in FIG. 2 in order to extract the explanatory variables by analyzing the wording of case reports. Alternatively, the explanatory variables may be extracted from case reports artificially.

FIG. 6 is a view depicting an example of the case data. The cases depicted in FIG. 6 are cases of personal information leakage, similarly to those described above. In FIG. 6, the types of information X, the number of leakage victims Y, and the leakage period Z are extracted as constantizable explanatory variables X, Y, Z for each of cases 1 to 5. The types of information X denote the types of personal information included in the leaked data. The number of leakage victims Y denotes the number of people whose personal information was is included in the leaked data. The leakage period Z denotes the period over which the data were leaked.

Returning to FIG. 5, the processor 11 executes a variable region candidate determination program in relation to the explanatory variables depicted in FIG. 2, and based on respective ranges of the values of the extracted explanatory variables, determines one or a plurality of variable region candidates acquired by dividing a region of a predetermined range into a predetermined number (granularity) (S11). Alternatively, the variable region candidates of the extracted explanatory variables may be determined artificially.

FIG. 7 is a view depicting an example of a list of the variable regions of the explanatory variables and the training data master. In the process 511 for determining the variable region candidates of the explanatory variables, a plurality of variable region candidates are determined as search candidates for the variable region search processing to be executed subsequently. The number of determined variable region candidates is preferably as large as possible. Next, the processor executes the variable region search program 21 to extract from the variable region candidates the variable regions in which the prediction precision of the prediction model increases.

In FIG. 7, for the types of information X serving as an explanatory variable, a variable region including three regions, namely a region in which one type of personal information is included, a region in which two types of personal information are included, and a region in which three types of personal information are included, is determined. Further, for the leakage period Z, a variable region including the four regions depicted in the figure is determined.

Further, for example, variable regions Y1, Y2, Y3, Y4, Y5 are determined as candidates of the variable region of the number of leakage victims Y. As illustrated in the figure, these variable regions Yl, Y2, Y3, Y4, Y5 are determined so that the largest region (the cap region) of the variable region increases successively. According to a case database depicted in FIG. 6, the data range of the number of leakage victims Y is at least 500 people and no more than 10,000 people, and in consideration thereof, the variable regions Yl, is Y2, Y3 respectively have different cap regions of at least 100 people, at least 1000 people, and at least 10,000 people. Further, the variable regions Y4, Y5 (having cap regions of at least 50,000 people and at least 100,000 people, respectively) are added to the variable region candidates as variable regions that may be added to the candidates in a future variable region update. These variable regions are added to the candidates after predicting the scale of future leaks.

When the variable regions Y1 to Y5 are respectively applied to the number of leakage victims Y serving as an explanatory variable, respectively different explanatory variables Y1 to Y5 are acquired. By setting this variable region optimally, the prediction precision of the prediction model can be further improved. For example, when different sanctions have been determined in a case where the number of leakage victims was at least 1,000 people and a case where the number of leakage victims was at least 10,000 people, a different sanction can be predicted by selecting the variable region Y3 rather than Y2.

Returning to FIG. 5, when determining the variable region candidates of the explanatory variables, the processor 11 allocates a plurality of variable regions to a predetermined explanatory variable. In the example of FIG. 7, the five variable regions Y1 to Y5 are allocated to the explanatory variable Y (the number of leakage victims).

Next, the processor 11 executes an explanatory variable quantification s program, not depicted in the figures, to quantify the explanatory variables of the cases using the variable regions of the explanatory variables as a reference (513). The training data master 32 in FIG. 7 illustrates data acquired by quantifying the case data in relation to the explanatory variables X, Y1 to Y5, and Z of cases 1 to 5 using the variable region candidates X, Y1 to Y5, and Z as io a reference. For example, the number of leakage victims in case 1 was 1,000 people, and therefore "2" is set as the quantified data for all of the variable regions Y1 to Y5. In case 2, meanwhile, the number of leakage victims was 10,000 people, and therefore 2, 3, 4, 4, and 4 are set respectively as the quantified data of the variable regions Y1 to Y5.

Returning to FIG. 5, the processor 11 executes a target variable quantification program to generate a new target variable SAL (Sanction Level) by replacing the analog value of the target variable SA with a plurality of levels (a plurality of levels corresponding to the size of the value of the target variable) (S14). In the example of FIG. 7, a sanction SA of less than 100,000 and a sanction of at least 100,000 are replaced by a sanction level SAL of level A and a sanction level SAL of level B, respectively. Since the amount of training data is small, the analysis model (or prediction model) to be described below predicts the value (the sanction) of the target variable in relation to an unknown case using a model combining a classification analysis model and a regression analysis model. The new target variable SAL is used by the classification analysis model.

By executing the processing described above, a training data master including a set of training data acquired from case content is generated. Process B for Retrieving Optimum Variable Region of Explanatory Variable Next, the processor 11 executes the variable region search program 21 to search the optimum variable region from the variable region candidates. In the variable region search processing, a variable region suitable for the classification analysis model and a variable region suitable for the regression analysis model mentioned above are extracted respectively.

FIGS. 8 and 9 are views depicting flowcharts of the processing of the variable region search processing. FIG. 8 depicts processing for searching suitable variable regions for the classification analysis model, and FIG. 9 depicts processing for searching suitable variable regions for the regression analysis model. The processing for searching variable regions for the classification analysis model and the regression analysis model will be described io below.

Retrieval of Variable Regions for Classification Analysis Model In FIG. 8, the processor 11 executes the variable region search program 21 in order to set a plurality of variable region candidates (Yl to Y5) in the corresponding explanatory variable (Y) and generate the subset variable list 33 is including preferably all subset variables, a subset variable being a subset of a set of the plurality of explanatory variables (X, Yl to Y5, and Z) (S61).

FIG. 10 is a view depicting an example of the subset variable list. Subset variables SSV (SubSet Variables) on the list are subsets of a set of the plurality of explanatory variables (X, Y1 to Y5, and Z). For example, a subset variable SSV7 is a subset including all 7 variables, while a subset variable SSV6-1 is a subset including the explanatory variables X and Yl to Y5 but excluding the explanatory variable Z. The other subset variables are similar.

Accordingly, as depicted in FIG. 8, the processor 11 extracts one subset variable from the subset variable list 33 (S22), extracts training data corresponding to the extracted subset variable from the training data master, and calculates the prediction precision of the classification analysis model (S23). The processor 11 then repeats the processing of S23 for calculating the prediction precision of the classification analysis model for all of the subset variables on the subset variable list (S24).

FIG. 11 is a view depicting a flowchart of the processing S23 for calculating the prediction precision of the classification analysis model. FIG. 12 is a view illustrating calculation of the prediction precision of the classification analysis model. FIG. 13 is a view depicting a method for predicting an unknown case using the classification analysis model. FIG. 14 is a view depicting a method for calculating the prediction precision of the classification analysis model. FIG. 15 is a view depicting a method for determining the optimum subset variable of the classification analysis model. Referring to these figures, the processing for calculating the prediction precision of the classification analysis model will be described.

The processor 11 calculates the prediction precision of the classification analysis model using a cross-validation method depicted in FIG. 12. The processing for calculating the prediction precision in FIG. 11 will be described below with reference to FIG. 12.

As illustrated in FIG. 11, the processor 11 selects one case (case 1, for example) from the training data master 32 as an evaluation subject case (5231) is and predicts the sanction level (A or B) of the target variable of the evaluation subject case (case 1) using a classification prediction model that employs the training data of the remaining cases (cases 2-5) other than the evaluation subject case (case 1) (S232).

Further, the processor determines the success or failure (success/failure) of the prediction according to whether or not the predicted level of the evaluation subject case (case 1) matches the true level of the case (case 1) (S233).

The processor then repeats the processing of S232 and S233 in relation to all of the remaining cases (cases 2-5) in the training data master (S234). As a result, success/failure is determined in relation to the remaining evaluation subject cases (cases 2-5) according to whether or not the predicted level matches the true level of the case. Finally, the processor sets the accuracy rate (the proportion of prediction successes) of the success/failure determination in all of the evaluation subject cases as the prediction precision of the extracted subset variable (S235). For example, the accuracy rate is calculated by dividing the number of times the predicted level matches the true value by the number of cases.

In FIG. 12, the processor extracts the values of the variables corresponding to the extracted subset variable (one of the subset variables SW in FIG. 10, for example) in relation to all of cases 1-5 from the training data master 32 and generates training data for the extracted subset variable.

The processor then predicts the sanction level of the evaluation subject case (case 1, for example) by executing a classification prediction algorithm using a prediction model that employs the training data of the remaining cases 2-5 other than the evaluation subject case (case 1).

Referring to FIG. 13, a method for predicting an unknown case using the classification analysis model will be described. FIG. 13 depicts the training data of the subset variable SSV7 (variables X, Y1 to Y5, and Z). The subset variable SSV7 includes all of the explanatory variables X, Y1 to Y5, and Z, and therefore the training data thereof are identical to the training data master 32. The sanction level of a prediction subject case El is predicted based on the training is data by the following method.

The lower half of FIG. 13 depicts a coordinate space having the variables of the subset variable as coordinate axes, and cases 1-5 included in the training data are plotted within the coordinate space. When the prediction subject case El is plotted in this coordinate space, the case having the shortest L2-norm distance L2_ND to the prediction subject case El is selected from cases 1-5.

The sanction level of the selected case is then set as the predicted sanction level of the prediction subject case El.

The L2-norm distance is the distance between two cases within a coordinate space. The distance L2_ND within the coordinate space is determined by squaring differences between values on the same coordinate axis, adding together the squared differences, and calculating the square root of the added value. Hence, the processor 11 executes the classification prediction model program 23 in order to calculate the norm distances between the prediction subject case El and all of cases 1-5 and detect the case having the shortest distance from cases 1-5.

According to the example in FIG. 13, the case having the shortest distance to case El is case 4, and therefore the predicted sanction level of case El is determined to be identical to the true sanction level A of case 4.

Now the prediction method of the classification prediction model has been clarified, next, the method S23 for calculating the prediction precision of the classification prediction model will be described. A method for predicting the sanction level of evaluation subject case 3, in the example depicted in FIG. 12, based on the remaining cases 1, 2, 4, and 5 will be described with reference to FIG. 14.

In the lower half of FIG. 14, similarly to FIG. 13, cases 1-5 are plotted in a io coordinate space. Accordingly, the processor detects the case having the shortest L2-norm distance to evaluation subject case 3 from the remaining cases 1, 2, 4, and 5. As illustrated in FIG. 14, case 2 has the shortest L2-norm distance L2_ND_2 to evaluation subject case 3. Therefore, the processor determines the predicted sanction level of evaluation subject case 3 to be the is detected sanction level B of case 2 (S232). The processor then compares the predicted sanction level B of evaluation subject case 3 with the true sanction level A of case 3, and since the levels do not match, determines the predicted sanction level to be inaccurate (S233).

As illustrated in FIG. 12, the processor predicts sanction levels for each of cases 1-5 using a classification prediction model based on the remaining cases and then determines whether or not the respective predicted levels match the true sanction levels, or in other words whether or not the prediction is accurate (successful). A validation result table is depicted in the lower right of FIG. 12, and according to this table, accurate predictions were acquired in cases 1, 2, 4, and 5, while an inaccurate prediction was acquired in case 3. Hence, the processor determines the prediction precision of the classification prediction model of the subset variable extracted in S22 in FIG. 8 as the accuracy rate (0.80) of the validation result table (S235).

Returning to FIG. 8, the processor calculates the prediction precision (S23) of the classification prediction models corresponding respectively to all of the subset variables SSV (S24). FIG. 15 illustrates results sorted by the prediction precision (the accuracy rate) acquired with respect to all of the subset variables SSV. The processor determines the optimum variable region from the top N (top 5, for example) subset variables SSV having the highest prediction precision (525). In the example of FIG. 15, the variable region Y3, which is most included in the top N (top 5, for example) subset variables SSV sorted into descending order of prediction precision, is determined to be the optimum variable region. The processor then sets X, Y3, and Z as the subset variable of the classification prediction model (S25 in FIG. 15). In FIG. 15, a subset variable SSV3_# (the subset variable including the explanatory variables X, Y3, and Z) having the optimum variable region Y3 as the variable Y is determined.

Further, the processor generates classification training data based on the determined classification subset variable by extracting the data of the subset variable from the training data master (S26).

Retrieval of Variable Regions of Regression Analysis Model Next, search of the variable regions of the regression analysis model will be described. FIG. 9 is a flowchart of the processing for searching the variable regions of the regression analysis model. Further, FIG. 16 is a flowchart of processing S34 for calculating the prediction precision of the regression analysis model. FIG. 17 is a view illustrating calculation of the prediction precision of the regression analysis model. FIG. 18 is a view depicting a method for generating the regression analysis model. FIG. 19 is a view depicting an example in which the subset variables of the regression analysis model are sorted by prediction precision. FIG. 20 is a view depicting a method for determining the optimum variable region of the regression analysis model.

Referring to these figures, the processing S34 for calculating the prediction precision of the regression analysis model will be described.

In FIG. 9, the processor 11 executes the variable region search program 21 in order to select the level of the target variable (531). In the regression analysis model, variable regions are searched for each target variable level.

The levels of the target variable are A and B in the present embodiment, and here, it is assumed that level A is selected.

Next, the processor extracts the training data of the cases belonging to the selected level A from the training data master (532). Here, the extracted training data of the cases belonging to level A form a training data master belonging to level A. As illustrated in FIG. 17, cases 1, 3, and 4 belonging to level A are extracted from the training data master 32 as evaluation subjects.

Next, the processor generates a regression analysis model and calculates the prediction precision thereof based on cases 1, 3, 4 belonging to level A in relation to all of the subset variables on the subset variable list 33 (FIG. 10) generated in S21 of FIG. 8 (S33 to S35). More specifically, the processor extracts one subset variable from the subset variable list 33 (FIG. 10) (S33), extracts the training data corresponding to the extracted subset variable from the training data master of level A, described above, generates the regression analysis model, and calculates the prediction precision thereof (534). The processor 11 then repeats the processing 534 for calculating the prediction is precision of the regression analysis model for all of the subset variables on the subset variable list (S35).

The processor 11 performs the processing S34 for calculating the prediction precision of the regression analysis model using a cross-validation method illustrated in FIG. 17. The processing for calculating the prediction precision, illustrated in FIG. 16, will be described below with reference to FIG. 17.

As illustrated in FIGS. 16 and 17, the processor 11 selects one case (case 1, for example) from the training data master of level A as the evaluation subject case (S341). Further, the processor generates a regression prediction model Pmodel_1 using the training data of the remaining cases (cases 3 and 4) other than the evaluation subject case (case 1) (S342). Furthermore, the processor predicts the predicted value, the sanction value, of the target variable of the evaluation subject case (case 1) using the regression prediction model Pmodel_1 (5342).

The processor then determines an absolute ratio of the predicted value 40,000 and the true value 60,000 of evaluation subject case 1 (S343). The absolute ratio of the two values is determined by dividing the smaller value by the larger value. The processing of 5341, 5342, and 5343 described above is then repeated for all evaluation subject cases (cases 1, 3, and 4) (5344). Finally, the processor outputs the mean of the absolute ratios of all of the evaluation subject cases as the prediction precision of the subset variable s (5345). In the example of FIG. 17, the prediction precision (the mean absolute ratio) of level A is 0.71.

In level A in FIG. 17, for evaluation subject case 1, the regression prediction model Pmodel_1 is generated from the training data of the remaining cases 3 and 4, whereupon the predicted value is calculated to 40,000 and the absolute io ratio of the predicted value and the true value of case 1 is calculated to 0.67.

Regression prediction models Pmodel_3, Pmodel_4 are then generated in a similar manner for evaluation subject cases 3 and 4, whereupon the predicted values thereof are calculated respectively to 100,000 and 120,000 and the absolute ratios thereof are calculated respectively to 0.7 and 0.75.

is The processor also generates prediction models Pmodel_2 and Pmodel_5 for cases 2 and 5 of the remaining level B, and calculates the predicted values and prediction precision (mean absolute ratios) thereof (5342, S343). As a result, in the example depicted in FIG. 17, the prediction precision (the mean absolute ratio) of level B is 0.67.

FIG. 18 depicts a method for generating the regression analysis model. FIG. 18 depicts the training data of the selected subset variable SSV7. On the basis of cases 1, 3, and 4 of the training data, in which the sanction serving as the target variable corresponds to level A, a seven-dimension multiple regression analysis model RG_ANL_A of level A is generated using the method of least squares, which is a maximum likelihood estimation for minimizing the sum of squared errors. The multiple regression analysis model RG_ANL_A corresponds to the regression prediction model Pmodel illustrated in FIG. 17. Similarly, based on cases 2 and 5 of the training data, in which the sanction serving as the target variable corresponds to level B, a seven-dimension multiple regression analysis model RG_ANL_B of level B is generated using the method of least squares. In the coordinate space of FIG. 18, the axis of the explanatory variables of the multiple regression analysis model is depicted in simplified form.

In the regression cross-validation method depicted in FIG. 17, a regression analysis model is generated for evaluation subject case 1 based on the remaining cases 3 and 4. This point differs from the example depicted in FIG. 18, in which the regression analysis model of level A is generated based on cases 1, 3, and 4.

FIG. 19 depicts the subset variables, and illustrates examples of prediction formulae applied to the respective subset variables in the regression analysis io model and examples of the prediction precision calculated in relation to each subset variable from FIGS. 9 and 16. When the subset variable list is sorted by prediction precision, a sorted subset variable list illustrated in FIG. 20 is acquired.

Returning to FIG. 9, the processor determines the optimum variable region is from the sorted subset variable list of FIG. 20, or in other words from the variable regions Y1 to Y5 of the variable Y within the top N subset variables having the highest prediction precision (S36). In the example of FIG. 20, the variable region Y3, for example, which appears most frequently among the variable regions Yl to Y5 of the variable Y within the top five subset variables SSV4-# to SSV6-4, is determined to be the optimum variable region.

Accordingly, the processor sets SSV3-# (the variables X, Y3, and Z) as a subset variable 36_A of the multiple regression analysis model of level A. A subset variable 36_B of the multiple regression analysis model of level B is determined using a similar method.

Further, after determining the variable region of the variable Y for levels A and B and determining the respective subset variables thereof, the processor extracts the respective training data 35_A and 35_B of levels A and B from the training data master 32 based on the determined regression subset variable (S37). The processing of FIG. 9 is repeated for all levels (S38).

Hence, the optimum variable region is determined from the variable region candidates Yl to Y5 of the explanatory variable Y. The optimum variable region is determined for each of the classification analysis model and the regression analysis models of the respective levels (A and B). By optimizing the variable region in this manner, the prediction precision of the prediction model corresponding to the subset variable that includes the optimum variable region can be improved.

Processing C and D for Generating Prediction Model and Using Prediction Model to Predict Target Variable of Prediction Subject Case Next, as depicted in FIG. 3, the processor executes the prediction model generation program 22 and the prediction model program 23 to generate a prediction model and predict the target variable of the prediction subject case using the prediction model.

FIG. 21 is a flowchart of the processing C and D for generating a prediction model and predicting target variable of the prediction subject case using the prediction model. The processor executes the prediction model generation is program 22 in order to generate the classification prediction model from the classification training data 26 (S41). As illustrated in FIG. 13, generation of the classification prediction model is completed upon acquisition of the classification training data 26. Further, the processor executes the prediction model generation program 22 in order to generate the regression prediction model of level A and the regression prediction model of level B from the regression training data 35_A and 35_B, respectively (S42, S43). Generation of the regression prediction models is as illustrated in FIG. 18.

Thus, generation of three types of prediction models based on past cases is completed. More specifically, the classification prediction model, the regression prediction model of level A, and the regression prediction model of level B are generated. The processor then uses these prediction models to predict the sanction serving as the target variable of the prediction subject case (D). As depicted in FIG. 21, the processor receives the training data of the prediction subject case (S44), applies the explanatory variable data of the prediction subject case to the classification prediction model, and predicts the sanction level (level A or B) of the target variable (S45). The prediction method using the classification prediction model is as illustrated in FIG. 13.

Next, the processor selects the regression prediction model corresponding to the predicted sanction level (level A or B) (S46). The processor then applies the explanatory variable data of the prediction subject case to the selected regression prediction model and predicts the value of the sanction serving as the target variable (547). To predict the sanction of the prediction subject case using the regression prediction model, the values of the explanatory variables of the prediction subject case are input into a multiple regression line RG_ANL_A or RG_ANL_B of the regression prediction model (FIG. 18), and the io sanction SA serving as the target variable is calculated.

In this embodiment, the number of existing cases is small, and therefore the level (level A or B) of the sanction is predicted using a classification prediction model (a classification analysis model), whereupon the value of the target variable of the prediction subject case having an unknown target variable is is predicted using either the regression prediction model of level A or the regression prediction model of level B, the regression prediction models being generated using the training data 35_A, 35_B of the existing cases in each of levels A and B. As illustrated in FIG. 18, the regression prediction models RG_ANL_A, RG_ANL_B corresponding to the respective levels A, B are generated by the method of least squares or the like using the training data of a plurality of cases for each level. In other words, even with a small number of cases, the sanction level is predicted using a classification prediction model which has a comparatively high degree of prediction precision. The sanction value of the prediction subject case is then predicted using one of the regression prediction models generated for each predicted sanction level. The regression prediction models are generated from the training data 35_A, 35_B of cases divided based on the predicted sanction level, and therefore the prediction precision of the regression prediction models increases. Accordingly, the prediction precision of the sanction value can be improved. By predicting the unknown sanction value of a prediction subject case using a model combining a classification prediction model and a regression prediction model in this manner, the prediction precision can be improved.

Process F for Updating Variable Region of Explanatory Variable As depicted in FIG. 3, in this embodiment, when the prediction precision of a predicted case decreases (YES in E), the processor updates the variable region of the predetermined explanatory variable of the classification analysis model and the regression analysis models used for the prediction to improves the prediction precision of the respective analysis models (F). A reduction in the prediction precision of a predicted case can be detected by comparing the prediction result predicted for the case using the prediction models with a io sanction result of a new sanction case announced by the relevant government authority.

In the process F for updating the variable region, as a first process F, the process B for searching the optimum variable region, illustrated in FIGS. 3 and 8, may be executed using a training data master acquired by adding the data of is a new case to the existing training data master.

As a second process F for updating the variable region, the prediction precision of an analysis model candidate of a subset variable including a new variable region candidate may be compared with the prediction precision of the existing analysis model for each of the classification analysis model and the regression analysis models, and when the prediction precision of the analysis model candidate is higher, the existing variable region may be updated to the new variable region candidate. In this case, the prediction precision is calculated based on a training data master acquired by adding the new case to the existing cases.

Further, as a third process F, by searching the variable region having the highest prediction precision between two variable region candidates both having high prediction precision, the variable region in which the highest prediction precision is acquired can be detected in a smaller number of steps. For example, a process for comparing the prediction precision of first and second analysis models of two (first and second) variable region candidates having different cap variable regions and the prediction precision of a third analysis model of a composite variable region of the first and second variable region candidates and determining whether the variable region with the higher prediction precision is on the side of the first or second variable region candidate between the first and second variable region candidates is executed repeatedly.

For example, by determining, in each search phase, first and second variable regions defining the search range of the next search phase based on the prediction precision order of the three analysis models, and repeating identical comparison and determination processing to that described above, the variable region having the highest prediction precision can be found by drilling down. In the third analysis model including a composite variable region of the first and second variable region candidates, as will be described below, the subset variable includes both first variable region candidates Y_old and second variable region candidates Y_new.

is The most typical example of a reduction in prediction precision is a case where, for example, the variable Y, i.e. the number of people affected by a new sanction case, increases to a scale not seen in existing cases. When the number of people increases to a scale not envisaged in existing cases in this manner, the prediction precision of the prediction model may decrease with the variable region (one of Y1 to Y5) determined in the initialization processing. In this case, the processor detects the optimum variable region of the variable Y from a training data master acquired by adding the new case, which was disclosed after the prediction model was created, to the existing cases, and updates the variable region of the prediction model to the optimum variable region.

FIG. 22 is a flowchart of the processing F for updating the variable region of the explanatory variable. The process for updating the variable region depicted in FIGs. 22-30 and described below is an example applied to a regression analysis model, but the process can also be applied to a classification analysis model.

The processing F for updating the variable region in FIG. 22 is repeated for all of the levels of the target variable (the sanction). In other words, the variable regions of the regression analysis models corresponding to the respective levels of the target variable are updated.

First, the processor executes the variable region updating program 24 to select the level of the target variable (the sanction) and repeat the processing of S51 to S61 for all of the levels (S51, 562).

After selecting the level, the processor selects the variable region to be used to search the variable region having the higher prediction precision (S52). Here, with respect to the plurality of variable region candidates of the update subject io variable included in the subset variable of the prediction model, the processor performs the analysis model prediction precision calculation that was performed in the process B for searching the optimum variable region of the explanatory variable, as described above, and selects first and second variable region candidates between which the variable region with the highest prediction is precision is predicted to exist.

FIG. 23 is a view depicting an example of a list of search subject variable regions and a list of search subject subset variables. In FIG. 23, the list of search subject variable regions includes a plurality of search subject variable regions Y_UP_1 to Y_UP_5. The plurality of search subject variable regions include the variable region Y_UP_1, in which the largest cap region of "at least 10,000 people", same as that of the variable region on the existing analysis model. In other words, since the sanction imposed in the new case is a higher amount, it is assumed that the largest cap region of the variable Y, i.e. "the number of leakage victims", has increased above the existing largest cap region of "at least 10,000 people", and based on this assumption, the variable region candidates Y_UP_2 to Y_UP_5, in which the largest cap region "at least 10,000 people" (see FIGS. 20 and 7) of the variable region Y3 of the existing analysis model has been changed to "at least 50,000 people", "at least 100,000 people", "at least 200,000 people", and "at least 500,000 people", respectively, are selected.

In FIG. 23, subset variables SSV_UP_1 to SSV_UP_5 including the variable Y in the search subject variable regions Y_UP_1 to Y_UP_5 described above are illustrated as search subjects variable region. In the processing of S52 described above, the prediction precision of an analysis model using each of the subset variables SSV_UP_1 to SSV_UP_5 is calculated based on a training data master 32_UP acquired by adding the new case described above using the same regression cross-validation method (FIG. 9) as that of the processing B for searching the optimum variable region.

FIG. 24 depicts an example of the prediction precision of an analysis model using each of the subset variables SSV_UP_1 to SSV_UP_5, calculated in the io processing of S52. According to this example, the prediction precision order is as follows.

SSV_UP_3 > SSV_UP_2 > SSV_UP_1 > SSV_UP_4 > SSV_UP_5 Y_UP_3 > Y_UP_2 > Y_UP_1 > Y_UP_4 > Y_UP_5 100,000 > 50,000 > 10,000 > 200,000 > 500,000 is In this case, the variable region having the highest prediction precision can be predicted to exist between 100,000 and 200,000. Of course, the possibility that the variable region having the highest prediction precision exists between 50,000 and 100,000 cannot be denied, but from the increase rate of the prediction precision between the variable region candidates and so on, it is assumed that the above prediction is made. In this case, the processor selects the variable region Y_UP_3 and the variable region Y_UP_4 as the variable regions to be searched (S52), and then retrieves the variable region with the highest prediction precision, which exists between the first variable region Y_UP_3 and the second variable region Y_UP_4, using a bisection method to be described below. The processing of S52 for selecting the variable regions to be searched corresponds to rough variable region search processing.

Returning to FIG. 22, the processor generates a list of subset variables acquired during first-generation search based on the selected first and second variable regions Y_UP_3 and Y_UP_4 (S53). From this point, fine variable region search processing begins.

FIG. 25 is a view depicting a list of variable regions and a list of subset variables acquired during first-generation and second-generation search. Here, during first-generation search, the prediction precision of each of a subset variable SS_I_old including a first variable region Y_I_old, a subset variable SS_I_new including a second variable region Y_I_new, and a subset variable SS_I_mid including the first and second variable regions (a subset variable including a composite variable region) is calculated. Then, from the magnitude relationships between the three prediction precision values, a determination is made as to whether the highest prediction precision is on the first variable region side of the midway point between the first variable region and the io second variable region, the second variable region side of the midway point, or in the midway variable region.

Next, during second-generation search, the same processing as that of first-generation search is executed in relation to either the first variable region side or the second variable region side of the midway point between the first and is second variable regions. Hence, by a bisection method, the searched region becomes gradually narrower, and as a result, the variable region having the highest prediction precision is detected in a small number of steps.

In the example of FIG. 25, a variable region candidate list 33_I (VR) of first-generation search includes the first variable region Y_I_old, the second variable region Y_I_new, and the third variable region Y_I_mid, as described above.

The first variable region Y_I_old is a variable region in which the largest cap region is at least 100,000 people, while the second variable region Y I new is a variable region in which the largest cap region is at least 200,000 people. Further, the third variable region Y_I_mid is a composite variable region including both the first and the second variable regions.

A subset variable candidate list 33_I (SSV) of first-generation search includes a first subset variable SS_I_old, which includes the first variable region Y_I_old (= Yo) and the variables X and Z, a second subset variable SS_I_new, which includes the second variable region Y_I_new (= Yn) and the variables X and Z, and a third subset variable SS_I_mid, which includes the first and second variable regions and the variables X and Z. As illustrated in FIG. 22, next, the processor selects one subset variable from the subset variable list 33_UP (33_I (SSV) in FIG. 25), extracts the training data of the level selected in S51 and the selected subset variable from the training data master 32_UP, generates a regression analysis model, and calculates the prediction precision thereof (554). The processing of S54 is identical to the cross-validation method illustrated in FIGS. 16 and 17. The processor repeats the processing of S54 for all of the three subset variables (the first to third subset variables SS_I_old, SS_I_new, and SS_I_mid) on the subset variable list 33_I (SSV) of FIG. 25 (S54).

Further, the processor evaluates the prediction precision of the analysis models of all three subset variables (S56) and makes a determination based on prediction precision comparisons executed in the processing of S57 and 559. Having determined during first-generation search that the prediction precision of the subset variable SS_I_mid including the composite variable region is is highest (YES in S57), the processor selects a new variable region and a new subset variable based on the prediction precision comparison result (S58), then returns to the processing of S53 and executes second-generation variable region search (the processing of S53 onward). By executing second-generation search, a variable region and a subset variable with a higher prediction precision are searched.

Having determined that the prediction precision of the analysis model of the subset variable SS I new that includes the variable region Y I new is highest (YES in S59), on the other hand, the processor sets the variable region Y_I_new as the new variable region (S61). Conversely, having determined that the prediction precision of the analysis model of the subset variable SS_I_old that includes the variable region Y_I_old is highest (NO in S59), the processor sets the variable region Y_I_old as the new variable region (S60).

FIG. 28 is a view depicting values applied to the variable Y in the first to third variable regions on the variable region list of first-generation search. In the first variable region Y_I_old, the largest cap region is 100,000 people, and therefore, in a case where the number of leakage victims equals or exceeds 100,000, the value of the variable Y increases from 3 to 4. In the second variable region Yi_new, meanwhile, the largest cap region is 200,000 people, and therefore, in a case where the number of leakage victims equals or exceeds 200,000, the value of the variable Y increases from 3 to 4. Further, the third variable region Y_I_mid forms a subset variable including the first and second variable regions and is therefore substantially equivalent to a subset variable including a variable region in which the largest cap region is 150,000 people.

As described above, analysis models respectively having variables with different variable regions have different quantitative values depending on the io scale of the number of leakage victims of the case, and the value of the predicted sanction also differs. Therefore, when the relevant government authority imposes an unprecedented sanction on a case having an unprecedented number of leakage victims, the variable regions of the variable Y relating to the number of leakage victims may be modified.

is FIG. 25 depicts examples of prediction formulae of the prediction model and examples of calculated prediction precision values on the subset variable list 33_I (SSV) of first-generation search. The following two examples are provided as examples of the prediction precision values illustrated on the subset variable list 33_I (SSV).

SS_I_mid (= 0.85) > SS_I_old (= 0.75) > SS_I_new (= 0.70) : C > A > B (Y_I_old to Y_I_mid side) SS I mid (= 0.85) > SS I new (= 0.75) > SS I old (= 0.70) : C > B > A (Y_I_mid to Y_I_new side) Here, A, B, and C respectively denote SS_I_old, SS_I_new, and SS_I_mid.

FIG. 26 is a view depicting examples of prediction precision comparison results and determinations made in relation thereto. Further, FIG. 27 is a view depicting relationships among the three subset variables in each of first-generation and second-generation search. FIG. 27 conceptually illustrates a function of the analysis model, on which the horizontal axis is set as the explanatory variable axes and the vertical axis is set as the target variable axis.

As described above, the two examples of comparison of the prediction precision depicted in FIG. 25 are C > A > B and C > B > A. C > A > B and C > B > A both indicate that the subset variable C including the composite variable region has the highest prediction precision (YES in S57).

The content of the determinations corresponding to the prediction precision comparisons C > A > B and C > B > A of FIG. 26 is as follows. In the case of C > A > B (case a), a region of 10% to 50% between A:SS_I_old and B:SS_I_new (between A and C) is set as the new range of variable region search. Further, in the case of C > B > A (case b), a region of 50% to 90% io between A:SS_I_old and B:SS_I_new (between C and B) is set as the new range of variable region search. This determination content will be described below.

FIG. 27 illustrates conceptual relationships among the prediction model functions of the first, second, and third subset variables SS_I_old (= A), is SS_I_new (= B), and SS_I_mid (= C) of first-generation search.

In the case of C > A > B (case a), the prediction model of the subset variable including the variable region searched during second-generation search (a) is between SS II old a (= SS I nnid) and SS II new a. In this case, the subset variable (the variable region) having the highest prediction precision is estimated to be somewhere on the SS_I_old (Y_old) side near SS_I_mid (Y_mid) of the first generation.

FIG. 25 depicts a list 33 II (VR) of the first and second variable regions searched during second-generation search, and in the case of C > A > B, the first variable region and the second variable region are Y_II_old_a and Y_II_new_a, respectively. Further, the second, first, and third subset variables searched during second-generation search (a) in the case of C > A > B correspond respectively to SS_II_new_a, SS_II_old_a, and SS_II_mid_a on the subset variable list 33_II_a (SSV).

In the case of C > B > A (case b), meanwhile, the first and second variable regions searched during second-generation search (b) are SS_II_old_b (= SS_I_mid) and SS_II_new_b, respectively. In this case, the variable region having the highest prediction precision is estimated to be somewhere on the SS_I_new (Y_new) side near SS_I_mid (Y_mid) of the first generation.

On the list 33_11 (VR) of the first and second variable regions searched during second-generation search, depicted in FIG. 25, in the case of C > B > A, the first variable region and the second variable region are Y_II_old_b and Y_II_new_b, respectively. Further, the first, second, and third subset variables searched during second-generation search (b) in the case of C > B > A correspond respectively to SS_II_old_b, SS_II_new_b, and SS_II_mid_b on the subset variable list 33_II_b (SSV).

FIG. 27 depicts the first to third subset variables SS_II_old_a, SS_II_mid_a, and SS_II_new_a during second-generation search (a) in the case of C > A > B (case a). The region searched during second-generation search is close to the third subset variable SS_I_mid during first-generation search and within the region on the first subset variable 55_I_old side.

is FIG. 27 also depicts the first to third subset variables SS_II_old_b, SS_II_nnid_b, and SS_II_new_b during second-generation search (b) in the case of C > B > A (case b). The region searched during second-generation search is close to the third subset variable SS I mid during first-generation search and within the region on the second subset variable SS_I_new side.

FIG. 29 is a view illustrating values applied to the variable Y in the first to third variable regions on the list of variable regions during second-generation search. FIG. 29 illustrates C > A > B (case a) as an example. In the first variable region Y_II_old_a, the largest cap region is 150,000 people, and therefore, in a case where the number of leakage victims equals or exceeds 150,000, the value of the variable Y increases from 3 to 4. In the second variable region Y_II_new_a, meanwhile, the largest cap region is 120,000 people, and therefore, in a case where the number of leakage victims equals or exceeds 120,000, the value of the variable Y increases from 3 to 4. Further, the third variable region Y_II_mid_a is a subset variable including the first and second variable regions and is therefore substantially equivalent to a subset variable including a variable region in which the largest cap region is between 120,000 and 150,000 people.

FIG. 30 is a view illustrating values applied to the variable Y in the first to third variable regions on the list of variable regions during second-generation search. FIG. 30 illustrates C > B > A (case b) as an example. In the first variable region Y_II_old_b, the largest cap region is 150,000 people. In the second variable region Y_II_new_b, meanwhile, the largest cap region is 170,000 people. Further, the third variable region Y_II_mid_b is a subset variable including the first and second variable regions and is therefore substantially equivalent to a subset variable including a variable region in which the largest cap region is between 150,000 and 170,000 people.

The content of the determinations corresponding to the four remaining prediction precision comparisons A > B > C, A > C > B, B > A > C, and B > C > A made during first-generation search, depicted in FIG. 26, is as follows.

In the case of A>B>C, the first subset variable SS_I_old (the first variable is region Y_I_old) corresponding to A has the highest prediction precision, and therefore the processor updates the variable region to the first variable region Y_I_old (560) and then terminates the variable region updating processing.

In the case of A > C > B, the first subset variable SS I old (the first variable region Y_I_old) corresponding to A has the highest prediction precision, but the third subset variable SS_I_mid (the third variable region Y_I_mid) corresponding to C is also high, and therefore the processor updates the variable region to the finely adjusted first variable region Y I old and then terminates the variable region updating processing.

In the case of B > A > C, the second subset variable SS_I_new (the second variable region Y_I_new) corresponding to B has the highest prediction precision, and therefore the processor updates the variable region to the second variable region Y_I_new (561) and then terminates the variable region updating processing.

In the case of B > C > A, the second subset variable 55_I_new (the second variable region Y_I_new) corresponding to B has the highest prediction precision, but the third subset variable SS_I_mid (the third variable region Y_I_mid) corresponding to C is also high, and therefore the processor updates the variable region to the finely adjusted second variable region Y_I_new and then terminates the variable region updating processing.

In the cases of C > A > B and C > B > A, the processor continues second-generation variable region search. When the prediction precision comparisons result in A > B > C, A > C > B, B > A > C, and B > C > A during both search generations, the processor updates the variable region to the first variable region or the second variable region, similarly to the first generation described above, and then terminates the variable region updating processing.

Further, when C > A > B or C > B > A is acquired during a generation after the second generation, if search is continued up to a predetermined generation, and at that point, the variable region is updated to the variable region corresponding to C, whereupon the search processing is terminated. The reason for this is that the demerits of extending the search process are greater is than the merits of improving the prediction precision.

According to the processing F for updating the variable region of the explanatory variable, as illustrated in FIG. 22, the rough variable region search processing of S52 is combined with the fine variable region search processing of S53 to S61, and as a result, the number of steps involved in the variable region updating processing can be reduced. Needless to mention, the processing F for updating the variable region of the explanatory variable may be configured similarly to the processing B for searching the optimum variable region of the explanatory variable during initialization so that a large number of variable region candidates are set, the prediction precision of the analysis model is calculated comprehensively, and the variable region is updated to the variable region having the best prediction precision.

Claims

CLAIMS1. A prediction device comprising: a processor; and a memory that is accessible by the processor, wherein the processor is configured to: (a) generate a classification prediction model and a regression prediction model based on a plurality of training data which includes a plurality of sets of case data, each set of the case data including a value of a target variable corresponding to values of a plurality of explanatory variables, at least one of io the values of the plurality of explanatory variables being converted, based on a variable region including a plurality of regions, into one of identification numbers of the plurality of regions, the classification prediction model including a plurality of classification training data, which includes is the plurality of training data, the value of the target variable in each of the plurality of training data being converted into one of a plurality of levels corresponding to magnitude of the value of the target variable, and determining the level of the target variable included in the classification training data, among the plurality of classification training data, in which a norm distance of the value of the explanatory variable to a prediction subject case data is shortest as the level of the target variable of the prediction subject case data, and the regression prediction model including, for each of the plurality of levels, a regression line that is close to coordinate points of the explanatory variables of a plurality of level-divided training data into which the plurality of training data are divided according to the level of the target variable, and calculating a value of the target variable of the prediction subject case data based on the regression line corresponding to the level determined by the classification prediction model; (b) predict a level of the target variable of the prediction subject case data by applying the prediction subject case data to the classification prediction model; and (c) predict a value of the target variable of the prediction subject case data by applying the prediction subject case data to the regression line corresponding to the predicted level.
2. The prediction device according to claim 1, wherein the processor is configured to perform searching an optimum variable region of the classification prediction model before generating the classification prediction model, io the searching the optimum variable region for the classification prediction model including: generating a plurality of subset variables, which are subsets of a set including the plurality of explanatory variables and a plurality of variable region candidates set in at least one of the explanatory variables; is calculating a prediction precision of the classification prediction model for each of the subset variables using training data of the plurality of subset variables, which are acquired by extracting values of the explanatory variables and the variable region candidates both included in each of the plurality of subset variables from the plurality of training data; and setting a variable region candidate that is most included in the N subset variables with the highest prediction precision as the optimum variable region of the classification prediction model, N being a positive integer.
3. The prediction device according to claim 2, wherein the processor is configured to perform searching an optimum variable region of the regression prediction model before generating the regression prediction model, the searching the optimum variable region for the regression prediction model including: (d) calculating a prediction precision of the regression prediction model for each of the subset variables using training data of the plurality of subset variables using training data of the plurality of subset variables, which are acquired by extracting values of the explanatory variables and the variable region candidates both included in each of the plurality of subset variables from the plurality of level-divided training data; (e) setting a variable region candidate that is most included in the M s subset variables with the highest prediction precision as the optimum variable region of the regression prediction model, M being a positive integer; and repeating the processing of (d) and (e) for each of the plurality of levels.
4. The prediction device according to claim 2 or 3, wherein, in the plurality of variable region candidates, at least the largest region of the plurality of regions forming each of the variable region candidates is different.
5. The prediction device according to claim 2, wherein, the calculating a prediction precision of the classification prediction model for each of the subset is variables includes, repeating, using each case as an evaluation subject case, predicting the level of the target variable in the training data of the evaluation subject case using a classification prediction model including training data, among the training data of the subset variable, of remaining cases other than the evaluation subject case, and determining whether or not prediction of the evaluation subject case is success according to whether or not the predicted level matches the level of the training data of the evaluation subject case, repeat the prediction and determination processing, and outputting a success rate of all of the cases as the prediction precision.
6. The prediction device according to claim 3, wherein, the calculating a prediction precision of the regression prediction model for each of the subset variables includes, (f) repeating, using each case as an evaluation subject case, predicting the value of the target variable in the training data of the evaluation subject case using a regression prediction model generated based on the training data, among the level-divided training data of the subset variable into which the training data of the subset variable are divided according to the level of the target variable, of remaining cases other than the evaluation subject case, and determining a ratio of the predicted value to the value of the target variable in the training data of the evaluation subject case; and (g) outputting a mean value of the ratios corresponding to all of the cases as the prediction precision. 10
7. The prediction device according to claim 1, wherein, when a prediction precision acquired by comparing the value of the target variable in the training data of a prediction subject case, which was predicted using the regression prediction model, with a determined value of the target variable of a new case is falls below a reference precision, the processor is configured to perform updating the variable region of the regression prediction model, the updating the variable region of the regression prediction model includes: generating a plurality of subset variables, which are subsets of a set including the plurality of explanatory variables and a plurality of variable region candidates set in at least one of the explanatory variables; (h) calculating the prediction precision of the regression prediction model for each of the subset variables using training data of the plurality of subset variables, which are acquired by extracting values of the explanatory variables and the variable region candidates both included in each of the plurality of subset variables from a plurality of new level-divided training data acquired by dividing a plurality of new training data according to the level, the plurality of new training data being acquired by adding the training data of the new case to the plurality of training data; (i) updating the variable region of the regression prediction model with a variable region candidate included in a subset variable having a relatively high prediction precision; and repeating the processing of (h) and (i) for each of the plurality of levels.s
8. The prediction device according to claim 7, wherein, in the plurality of variable region candidates, at least largest regions of the respective variable region candidates differ from each other, the plurality of variable region candidates include a first variable region candidate including a first largest region that is larger than the largest region of the variable region candidate having the highest prediction precision, a second variable region candidate including a second largest region that is smaller than the largest region of the variable region candidate having the highest prediction precision, and is a third variable region candidate including a composite variable region of the first variable region candidate and the second variable region candidate, and during the processing of (h), the processor is configured to: execute first-generation search processing for comparing with each other a first prediction precision of the regression prediction model of a first subset variable that includes the first variable region candidate, a second prediction precision of the regression prediction model of a second subset variable that includes the second variable region candidate, and a third prediction precision of the regression prediction model of a third subset variable that includes the third variable region candidate; and when the third prediction precision is higher than the first prediction precision and the second prediction precision, execute second-generation variable region search processing including setting a variable region including a third largest region midway between the first largest region and the second largest region as the first variable region candidate, setting either a variable region including a fourth largest region between the third largest region and the first largest region or a variable region including a fifth largest region between the third largest region and the second largest region as the second variable region candidate, and executing the first-generation search processing.
9. An non-transitory computer readable media that include therein a prediction program causing a computer to execute a process comprising: (a) generating a classification prediction model and a regression prediction model based on a plurality of training data which includes a plurality of sets of case data, each set of the case data including a value of a target variable corresponding to values of a plurality of explanatory variables, at least one of the values of the plurality of explanatory variables being converted, based on a variable region including a plurality of regions, into one of is identification numbers of the plurality of regions, the classification prediction model including a plurality of classification training data, which includes the plurality of training data, the value of the target variable in each of the plurality of training data being converted into one of a plurality of levels corresponding to magnitude of the value of the target variable, and determining the level of the target variable included in the classification training data, among the plurality of classification training data, in which a norm distance of the value of the explanatory variable to a prediction subject case data is shortest as the level of the target variable of the prediction subject case data, and the regression prediction model including, for each of the plurality of levels, a regression line that is close to coordinate points of the explanatory variables of a plurality of level-divided training data into which the plurality of training data are divided according to the level of the target variable, and calculating a value of the target variable of the prediction subject case data based on the regression line corresponding to the level determined by the classification prediction model; (b) predicting a level of the target variable of the prediction subject case data by applying the prediction subject case data to the classification prediction model; and (c) predicting a value of the target variable of the prediction subject case data by applying the prediction subject case data to the regression line corresponding to the predicted level.
10. A method of predicting comprising, by a computer,: (a) generating a classification prediction model and a regression prediction model based on a plurality of training data which includes a plurality of sets of case data, each set of the case data including a value of a target variable corresponding to values of a plurality of explanatory variables, at least is one of the values of the plurality of explanatory variables being converted, based on a variable region including a plurality of regions, into one of identification numbers of the plurality of regions, the classification prediction model including a plurality of classification training data, which includes the plurality of training data, the value of the target variable in each of the plurality of training data being converted into one of a plurality of levels corresponding to magnitude of the value of the target variable, and determining the level of the target variable included in the classification training data, among the plurality of classification training data, in which a norm distance of the value of the explanatory variable to a prediction subject case data is shortest as the level of the target variable of the prediction subject case data, and the regression prediction model including, for each of the plurality of levels, a regression line that is close to coordinate points of the explanatory variables of a plurality of level-divided training data into which the plurality of training data are divided according to the level of the target variable, and calculating a value of the target variable of the prediction subject case data based on the regression line corresponding to the level determined by the classification prediction model; (b) predicting a level of the target variable of the prediction subject case data by applying the prediction subject case data to the classification prediction model; and (c) predicting a value of the target variable of the prediction subject case data by applying the prediction subject case data to the regression line io corresponding to the predicted level.