JP2020140521A

JP2020140521A - Human determination prediction device, prediction program and prediction method

Info

Publication number: JP2020140521A
Application number: JP2019036418A
Authority: JP
Inventors: 秀暢小栗; Hidenobu Oguri; 伊藤　孝一; Koichi Ito; 孝一伊藤
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2019-02-28
Filing date: 2019-02-28
Publication date: 2020-09-03
Also published as: GB2583176A; GB202002189D0

Abstract

To provide a human determination prediction device with improved prediction accuracy.SOLUTION: A prediction device has a processor and a memory to which the processor can access, the processor (a) generates, for a plurality of case data having values of objective variables corresponding to values of a plurality of explanatory variables, a classification type prediction model and a regression type prediction model based on a plurality of learning data obtained by replacing the values of the explanatory variables with an identification numbers of a plurality of regions based on a variable region having the plurality of regions, (b) applies the learning data of prediction target case to the classification type prediction model to predict level of the objective variable of the learning data of the prediction target case, and (c) applies the learning data of the prediction target case to the regression straight line corresponding to the predicted level to predict the value of the objective variable of the learning data of the prediction target case.SELECTED DRAWING: Figure 2

Description

本発明は，人的判断の予測装置、予測プログラム及び予測方法に関する。 The present invention relates to a human judgment prediction device, a prediction program, and a prediction method.

人的判断、例えば、法令違反に対する行政機関による判断、体操競技やフィギアスケート競技における審判員による判定などは、明確でない判断フローチャートやチェックリストに基づく判断が含まれる。以下、法令違反に対する判断を例にして説明する。 Human judgments, such as judgments by government agencies for violations of laws and regulations, judgments by referees in gymnastics and figure skating competitions, include judgments based on unclear judgment flowcharts and checklists. Hereinafter, a judgment on a violation of laws and regulations will be described as an example.

例えば、情報処理技術の発展に伴い、情報システムへの攻撃や情報漏えいのリスクが多様化してきている。それに対応して、多くの企業はセキュリティシステムのアップデートに困難な対応を要求されている。その理由として、各国の個人情報保護法の遵守を担当する行政機関は、個人情報に関する規制をより厳しくする傾向にあり、各企業に課される個人情報保護の義務違反に対する罰則は厳罰化している。 For example, with the development of information processing technology, the risks of attacks on information systems and information leaks are diversifying. In response, many companies are required to take difficult steps to update their security systems. The reason is that the government agencies responsible for complying with the personal information protection laws of each country tend to tighten the regulations on personal information, and the penalties for breach of the obligation to protect personal information imposed on each company are becoming stricter. ..

特開２００６−２５２０１１号公報Japanese Unexamined Patent Publication No. 2006-252011 特開２０１８−１５３９０１号公報JP-A-2018-153901

しかし、情報セキュリティのインシデントと政府機関から課せられる罰則や制裁金の関係性は不透明であり、機械学習や人工知能を用いて予測することは困難である。特に、制裁金が課せられた事例数が少ないため、十分な学習データが得られず、機械学習による分析モデルの予測精度が高くならない。また、個人情報保護法の違反行為による影響の規模が過去の事例と比較して拡大するたびに、政府当局は、公開されない新たな判断指標により罰則や制裁金を決定することがある。そのため、多くの企業は、政府機関からの過度な罰則や制裁金を避けるため、セキュリティシステムへの多大な投資を強いられている。 However, the relationship between information security incidents and penalties and sanctions imposed by government agencies is unclear and difficult to predict using machine learning and artificial intelligence. In particular, since the number of cases for which sanctions have been imposed is small, sufficient learning data cannot be obtained, and the prediction accuracy of the analysis model by machine learning does not increase. In addition, as the scale of the impact of violations of the Personal Information Protection Act grows compared to past cases, government authorities may determine penalties and penalties based on new, undisclosed indicators. As a result, many companies are forced to invest heavily in security systems to avoid excessive penalties and sanctions from government agencies.

上記のとおり、現実的には政府当局の判断を予測することは困難であるため、多くの企業はリスクを負いながら、セキュリティシステムの投資の優先度を定めている。 As mentioned above, it is difficult to predict the judgment of government authorities in reality, so many companies take risks and prioritize investment in security systems.

そこで，本実施の形態の第1の側面の目的は，予測精度を向上させた人的判断の予測装置、予測プログラム及び予測方法を提供することにある。 Therefore, an object of the first aspect of the present embodiment is to provide a prediction device, a prediction program, and a prediction method for human judgment with improved prediction accuracy.

本実施の形態の第１の側面は，プロセッサと、前記プロセッサがアクセス可能なメモリを有し、
前記プロセッサは、
（ａ）複数の説明変数の値に対応して目的変数の値をそれぞれ有する複数の事例データについて、複数の領域を有する変数域に基づき前記説明変数の値を前記複数の領域の識別番号に置換した複数の学習データに基づいて、分類型予測モデルと、回帰型予測モデルを生成し、
前記分類型予測モデルは、前記複数の学習データの前記目的変数の値を大きさに応じた複数のレベルに置換した複数の分類型学習データを有し、前記複数の分類型学習データのうち、予測対象事例の学習データと、前記説明変数の値のノルム距離が最も短い前記分類型学習データが持つ目的変数のレベルを、前記予測対象事例の学習データの目的変数のレベルと判定し、
前記回帰型予測モデルは、前記複数の学習データのうち、前記目的変数の前記レベル別に分けられたレベル別複数の学習データの説明変数の座標点に近接する回帰直線を前記レベル別に有し、前記分類型予測モデルで判定されたレベルに対応する前記回帰直線により、前記予測対象事例の学習データが持つ目的変数の値を算出し、
更に、（ｂ）予測対象事例の学習データを前記分類型予測モデルに適用して、前記予測対象事例の学習データの目的変数のレベルを予測し、
（ｃ）予測対象事例の学習データを、前記予測したレベルに対応する前記回帰直線に適用して、前記予測対象事例の学習データの目的変数の値を予測する、予測装置である。 The first aspect of this embodiment includes a processor and a memory accessible to the processor.
The processor
(A) For a plurality of case data having values of objective variables corresponding to values of a plurality of explanatory variables, the values of the explanatory variables are replaced with identification numbers of the plurality of regions based on the variable regions having a plurality of regions. A classification type prediction model and a regression type prediction model are generated based on the multiple training data.
The classification type prediction model has a plurality of classification type learning data in which the value of the objective variable of the plurality of training data is replaced with a plurality of levels according to the size, and among the plurality of classification type training data. The level of the objective variable of the training data of the prediction target case and the classification type learning data having the shortest norm distance between the values of the explanatory variables is determined to be the level of the objective variable of the training data of the prediction target case.
Among the plurality of training data, the regression type prediction model has a regression line close to the coordinate points of the explanatory variables of the plurality of training data for each level divided by the level of the objective variable for each level. The value of the objective variable of the training data of the prediction target case is calculated from the regression line corresponding to the level determined by the classification type prediction model.
Further, (b) the training data of the prediction target case is applied to the classification type prediction model to predict the level of the objective variable of the training data of the prediction target case.
(C) A prediction device that predicts the value of the objective variable of the training data of the prediction target case by applying the learning data of the prediction target case to the regression line corresponding to the predicted level.

第１の側面によれば，予測装置の予測精度を向上させることができる。 According to the first aspect, the prediction accuracy of the prediction device can be improved.

決定行為または判定行為の一例を説明する図である。It is a figure explaining an example of a decision act or a judgment act. 図１の判定行為に対応した予測モデルの例を示す図である。It is a figure which shows the example of the prediction model corresponding to the determination act of FIG. 本実施の形態における予測モデルの生成、予測、更新の概略を示すフローチャートを示す図である。It is a figure which shows the flowchart which shows the outline of the generation, prediction, and update of the prediction model in this embodiment. 本実施の形態における予測装置の構成例を示す図である。It is a figure which shows the configuration example of the prediction apparatus in this embodiment. 既存事例の学習データマスタの生成処理のフローチャートを示す図である。It is a figure which shows the flowchart of the generation process of the learning data master of an existing case. 事例データの一例を示す図である。It is a figure which shows an example of case data. 説明変数の変数域のリストと学習データマスタの例を示す図である。It is a figure which shows the list of the variable area of the explanatory variable, and the example of a training data master. 変数探索プログラムの処理のフローチャートを示す図である。It is a figure which shows the flowchart of the process of a variable search program. 変数探索プログラムの処理のフローチャートを示す図である。It is a figure which shows the flowchart of the process of a variable search program. 部分変数リストの例を示す図である。It is a figure which shows the example of the partial variable list. 分類型分析モデルの予測精度を算出する処理S23のフローチャートを示す図である。It is a figure which shows the flowchart of the process S23 which calculates the prediction accuracy of a classification type analysis model. 分類型分析モデルの予測精度の算出を説明する図である。It is a figure explaining the calculation of the prediction accuracy of a classification type analysis model. 分類型分析モデルによる未知の事例の予測方法を示す図である。It is a figure which shows the prediction method of an unknown case by a classification type analysis model. 分類型分析モデルの予測精度の算出方法を示す図である。It is a figure which shows the calculation method of the prediction accuracy of a classification type analysis model. 分類型分析モデルの最適な部分変数の決定方法を示す図である。It is a figure which shows the method of determining the optimum partial variable of a classification type analysis model. 回帰型分析モデルの予測精度の算出S34の処理のフローチャート図である。Calculation of prediction accuracy of regression analysis model It is a flowchart of the process of S34. 回帰型分析モデルの予測精度の算出を説明する図である。It is a figure explaining the calculation of the prediction accuracy of a regression type analysis model. 回帰型分析モデルの生成方法を示す図である。It is a figure which shows the generation method of the regression type analysis model. 回帰型分析モデルの部分変数を予測精度でソートした例を示す図である。It is a figure which shows the example which sorted the partial variable of the regression type analysis model by the prediction accuracy. 回帰型分析モデルの最適な変数域の決定方法を示す図である。It is a figure which shows the method of determining the optimum variable area of a regression analysis model. 予測モデルの生成及び予測対象の事例について予測モデルで予測する処理C,Dのフローチャート図である。It is a flowchart of the process C, D which predicts the generation of a prediction model and the case of the prediction target by the prediction model. 説明変数の変数域を更新する処理Fのフローチャート図である。It is a flowchart of the process F which updates the variable area of the explanatory variable. 探索対象の変数域リストと探索対象の部分変数リストの一例を示す図である。It is a figure which shows an example of the variable area list of the search target and the partial variable list of the search target. 処理S52で算出した各部分変数SSV_UP_1〜SSV_UP_5の分析モデルの予測精度の例を示す図である。It is a figure which shows the example of the prediction accuracy of the analysis model of each subvariable SSV_UP_1 to SSV_UP_5 calculated in process S52. 第１世代及び第２世代の探索における変数域リストと部分変数リストとを示す図である。It is a figure which shows the variable area list and the partial variable list in the search of the 1st generation and the 2nd generation. 予測精度の比較結果とその判定例を示す図である。It is a figure which shows the comparison result of the prediction accuracy and the judgment example. 第１世代と第２世代の探索それぞれの３つの部分変数の関係を示す図である。It is a figure which shows the relationship of 3 subvariables of each of the search of the 1st generation and the 2nd generation. 第１世代の探索の変数域リスト内の第１〜第３の変数域の変数Yに適用される値を示す図である。It is a figure which shows the value applied to the variable Y of the 1st to 3rd variable areas in the variable area list of the 1st generation search. 第２世代の探索の変数域リスト内の第１〜第３の変数域の変数Yに適用される値を示す図である。It is a figure which shows the value applied to the variable Y of the 1st to 3rd variable areas in the variable area list of the 2nd generation search. 第２世代の探索の変数域リスト内の第１〜第３の変数域の変数Yに適用される値を示す図である。It is a figure which shows the value applied to the variable Y of the 1st to 3rd variable areas in the variable area list of the 2nd generation search.

［用語定義集］
以下は、本実施の形態における用語の簡単な定義である。 [Glossary]
The following is a brief definition of terms in this embodiment.

分析モデル：事例内の複数の説明変数から目的変数を予測するモデル。本実施の形態では、分類型の分析モデルと回帰型の分析モデルを併用する。分析モデルを予測モデルと称することもある。 Analytical model: A model that predicts the objective variable from multiple explanatory variables in the case. In this embodiment, a classification type analysis model and a regression type analysis model are used together. The analytical model is sometimes called a predictive model.

事例：事例は、複数の説明変数と予測対象の目的変数を有する。 Case: The case has a plurality of explanatory variables and an objective variable to be predicted.

学習データマスタ：複数の既存の事例について、変数の定数化と定量化を行い、複数の事例それぞれを説明変数の定量化された値（領域識別番号）と目的変数の値またはレベルとした、データの集合。 Training data master: Data in which variables are constantized and quantified for multiple existing cases, and each of the multiple cases is used as the quantified value (area identification number) of the explanatory variable and the value or level of the objective variable. Set of.

変数の定数化：事例内の変数を数値化すること。 Variable constantization: To quantify the variables in the case.

変数の定量化：変数の値を複数の領域を有する変数域（定量化基準）に基づいて領域の識別番号（１，２，３等）（定量値）に変換すること。 Variable quantification: Converting a variable value into a region identification number (1, 2, 3, etc.) (quantitative value) based on a variable region (quantification standard) having multiple regions.

変数域：説明変数の値を領域の識別番号に変換するときの基準となる複数の領域。説明変数に最適な変数域を設定することで、分析モデルの予測精度を高くできる。例えば、変数Yに対して異なる変数域Y1,Y2,Y3等を割り当てることにより、同じ説明変数Yについて、異なる説明変数Y1,Y2,Y3等を定義できる。以下の説明では、変数域が異なる変数Yの符号を、変数域Y1〜と同じ符号で称する。変数Y1は、変数域Y1の変数Yを意味する。 Variable area: Multiple areas that serve as a reference when converting the value of an explanatory variable into an area identification number. By setting the optimum variable range for the explanatory variables, the prediction accuracy of the analytical model can be improved. For example, by assigning different variable ranges Y1, Y2, Y3, etc. to the variable Y, different explanatory variables Y1, Y2, Y3, etc. can be defined for the same explanatory variable Y. In the following description, the codes of the variables Y having different variable ranges are referred to by the same codes as the variable ranges Y1 to. The variable Y1 means the variable Y in the variable area Y1.

変数域の探索：説明変数Yに対して適用される複数の変数域候補それぞれ及び組み合わせについて分析モデルの予測精度を算出し、高い予測精度が得られる変数域を見つけること。変数域の探索は、説明変数の初期化工程で行われる。 Variable region search: Calculate the prediction accuracy of the analytical model for each and combination of multiple variable region candidates applied to the explanatory variable Y, and find a variable region that can obtain high prediction accuracy. The search for the variable range is performed in the process of initializing the explanatory variables.

部分変数：異なる変数域を持つ説明変数及び単一の変数域を持つ説明変数の部分集合である説明変数の組み合わせ。部分変数毎に分析モデルが決定される。 Subvariables: A combination of explanatory variables that are a subset of explanatory variables with different variable ranges and explanatory variables with a single variable range. An analysis model is determined for each subvariable.

分析モデルによる予測処理：探索工程で選ばれた変数域で既存事例の説明変数を定量化した学習データに基づき分析モデルを生成し、予測対象事例についてその分析モデルで予測すること。 Prediction processing by analytical model: To generate an analytical model based on the training data that quantifies the explanatory variables of existing cases in the variable range selected in the search process, and predict the predicted target cases with the analytical model.

変数域の更新：一旦生成した分析モデルの予測精度が低下した場合、新たな事例に対応して判定基準が変更されたとみなし、予測精度が高いと推定できる複数の変数域候補をそれぞれ有する複数の部分変数について分析モデルの予測精度を算出し、既存の分析モデルより予測精度が高くなる変数域に変数域を更新し、分析モデルを更新すること。 Update of variable range: When the prediction accuracy of the once generated analysis model deteriorates, it is considered that the judgment criteria have been changed in response to a new case, and there are multiple variable range candidates that can be estimated to have high prediction accuracy. Calculate the prediction accuracy of the analysis model for partial variables, update the variable area to the variable area where the prediction accuracy is higher than the existing analysis model, and update the analysis model.

本実施の形態では、第１に、実社会に存在する、定性的な人的判断と、定量的なチェックリストやフローチャートなどによる判断が混在することで、その判断結果を予測することが困難な決定行為または判定行為について、既に決定または判定された既存事例に基づき分析モデルを決定する。 In the present embodiment, firstly, a qualitative human judgment existing in the real world and a judgment based on a quantitative checklist or a flowchart are mixed, and it is difficult to predict the judgment result. Actions or judgments An analytical model is determined based on existing cases that have already been decided or judged.

第２に、分析モデルは、分類型の分析モデルと、回帰型の分析モデルとで構成される。 Second, the analytical model is composed of a classification type analytical model and a regression type analytical model.

第３に、初期化工程で、両分析モデルそれぞれの説明変数の最適な変数域を探索する。 Thirdly, in the initialization step, the optimum variable range of the explanatory variables of both analytical models is searched.

第４に、新たな事例が出されたことで既存の分析モデルの予測精度が低下した場合、説明変数の変数域を更新し、予測精度を改善した分析モデルを決定する。以下、上記の特徴それぞれについて説明する。 Fourth, when the prediction accuracy of the existing analysis model deteriorates due to the emergence of a new case, the variable range of the explanatory variables is updated to determine the analysis model with improved prediction accuracy. Hereinafter, each of the above features will be described.

決定行為または判定行為は、例えば、（１）企業が法令違反を起こした場合に行政当局による罰金や制裁金の判定行為、（２）巨大プロジェクトの入札金額の決定行為、（３）面接試験と筆記試験の両方による合格判定、（４）犯罪行為に対する刑期の決定行為、などである。上記以外の決定行為または判定行為についても、本実施の形態を適用できる場合がある。 Decisions or judgments include, for example, (1) judgment of fines and sanctions by the administrative authorities when a company violates laws and regulations, (2) determination of bid amount for a huge project, and (3) interview test. Pass judgment by both written examinations, (4) determination of sentence for criminal acts, etc. The present embodiment may be applicable to a decision act or a judgment act other than the above.

［判定行為と予測モデル］
図１は、決定行為または判定行為の一例を説明する図である。図１の例は、法令違反に対する制裁金の判定を行う例である。例えば、企業１は、個人情報を活用するシステムを構築し運用している。このシステムに対する内部不正や外部侵入または攻撃により個人情報の漏洩などのトラブルが発生する（S1）。そのトラブルが行政当局２に通報されると（S2）、当局２は、企業や顧客にヒヤリングを行いトラブルの状況を確認する（S3）。そして、当局２は、トラブルの状況に法律を適用して法令違反の判定を行う（S4）。当局２は、最終的に、判定にしたがって決定した懲罰や制裁金を企業１に通知し（S5）、判定した事例３を公開する（S6）。企業に対してリスクアセスメントを行う企業は、過去の事例を学習データにして判定を予測する予測モデル４を生成し、予測モデル４により未知の事例の判定を予測する（S7）。企業１は、この予測した判定を参考にしてシステムの改善を実行する。 [Judgment behavior and prediction model]
FIG. 1 is a diagram illustrating an example of a decision act or a determination act. The example of FIG. 1 is an example of determining a sanction for a violation of a law. For example, company 1 builds and operates a system that utilizes personal information. Problems such as leakage of personal information occur due to internal improprieties, external intrusions, or attacks on this system (S1). When the trouble is reported to the administrative authority 2 (S2), the authority 2 interviews the company or the customer and confirms the situation of the trouble (S3). Then, the authority 2 applies the law to the trouble situation and judges the violation of the law (S4). Finally, the authority 2 notifies the company 1 of the punishment and the sanction decided according to the judgment (S5), and publishes the judgment case 3 (S6). A company that conducts a risk assessment on a company generates a prediction model 4 that predicts a judgment using past cases as learning data, and predicts a judgment of an unknown case by the prediction model 4 (S7). Company 1 makes improvements to the system with reference to this predicted determination.

図２は、図１の判定行為に対応した予測モデルの例を示す図である。既存事例３は、行政当局が公開した複数の判定事例を含む。事例３は、例えば個人情報の漏洩事例である。まず、既存事例３それぞれについて、説明変数ｘ，ｙ，ｚと目的変数SA（Sanction）からなる学習データ５が生成される。既存の事例から人為的に事例に含まれる変数が定数化され、変数の変数域が設定され、事例の各変数の値が変数域のどの領域に該当するかを示す識別番号に変換される。 FIG. 2 is a diagram showing an example of a prediction model corresponding to the determination action of FIG. The existing case 3 includes a plurality of judgment cases published by the administrative authorities. Case 3 is, for example, a case of leakage of personal information. First, for each of the existing cases 3, learning data 5 including explanatory variables x, y, z and objective variable SA (Sanction) is generated. Variables included in the case are artificially made constant from the existing case, the variable range of the variable is set, and the value of each variable in the case is converted into an identification number indicating which area of the variable range corresponds to.

図２の事例の学習データ５は、説明変数として情報種類ｘ、影響人数ｙ、漏洩期間ｚを含み、目的変数として制裁金SAを含む。学習データ５を機械学習させることで目的変数が未知の事例の予測モデル４が生成される。 The learning data 5 of the example of FIG. 2 includes the information type x, the number of affected people y, and the leakage period z as explanatory variables, and includes the sanctions SA as the objective variable. By machine learning the training data 5, a prediction model 4 of a case whose objective variable is unknown is generated.

例えば、説明変数ｘは情報種類であり、その変数域は、個人情報が住所、氏名、年齢のうち１種類を含む場合は「１」、２種類を含む場合は「２」、３種類を含む場合は「３」の３つの領域の識別番号である。説明変数ｙは漏洩した個人情報の人数を示す影響人数であり、その変数域は、１００人未満は「１」、１０００人未満は「２」、１０００人以上は「３」の３つの領域の識別番号である。そして、説明変数ｚは漏洩期間であり、その変数域は、５年未満が「１」、１０年未満が「２」、１０年以上が「３」の３つの領域の識別番号である。目的変数SAは制裁金である。 For example, the explanatory variable x is an information type, and the variable range includes "1" when personal information includes one type of address, name, and age, "2" when it contains two types, and three types. In the case, it is an identification number of three areas of "3". The explanatory variable y is the number of affected people indicating the number of leaked personal information, and the variable range is "1" for less than 100 people, "2" for less than 1000 people, and "3" for 1000 or more people. It is an identification number. The explanatory variable z is the leakage period, and the variable range is the identification number of three regions of "1" for less than 5 years, "2" for less than 10 years, and "3" for 10 years or more. The objective variable SA is a sanction.

予測モデル４は、所定の機械学習アルゴリズム６に基づくモデルである。機械学習アルゴリズム６は、例えば、線形回帰分析や多項式回帰分析である。図２には、予測モデル４の一例として、複数の変数ｘ，ｙ，ｚの線形関数により目的変数SAを算出する重回帰分析モデルが示される。 The prediction model 4 is a model based on a predetermined machine learning algorithm 6. The machine learning algorithm 6 is, for example, a linear regression analysis or a polynomial regression analysis. FIG. 2 shows, as an example of the prediction model 4, a multiple regression analysis model in which the objective variable SA is calculated by a linear function of a plurality of variables x, y, and z.

一旦予測モデル４が生成されると、その後、目的変数が未知の事例７の説明変数ｘ，ｙ，ｚが予測モデル４に入力され、予測モデルの関数を計算し、目的変数SAの予測値８が出力される。 Once the prediction model 4 is generated, the explanatory variables x, y, z of the case 7 whose objective variable is unknown are input to the prediction model 4, the function of the prediction model is calculated, and the prediction value 8 of the objective variable SA is calculated. Is output.

上記の例では、人間による定性的な判断と定量的なチェックリストやフローチャートによる判断など複数の判断指標が混在する判定事例から、予測モデルを生成し、未知の目的変数を有する事例の目的変数を予測する。このような予測モデルを構築する場合、つぎのような課題がある。 In the above example, a prediction model is generated from a judgment case in which multiple judgment indexes such as a qualitative judgment by a human being and a judgment by a quantitative checklist or a flowchart are mixed, and the objective variable of the case having an unknown objective variable is used. Predict. When constructing such a prediction model, there are the following problems.

第１に、既存事例での判定に用いられるチェックリストやフローチャートなどの判定メカニズムが公開されていない。例えば、事例では、複数人で複数の判定手法により非連続的な判断が行われ、制裁金が決定されることがある。また、一般的な安全基準は公開されているが、その基準の組み合わせや詳細についての説明は非公表である。 First, the judgment mechanism such as checklists and flowcharts used for judgment in existing cases is not disclosed. For example, in a case, a plurality of people may make a discontinuous judgment by a plurality of judgment methods, and a sanction may be determined. In addition, although general safety standards are open to the public, explanations of combinations and details of the standards are not disclosed.

第２に、過去に制裁金が課された事例は公開されているが、事例数が非常に少なく、機械学習に利用可能な数の学習データが準備できない。少ない学習データによる機械学習では、探索空間が広すぎて有意な予測モデルを生成することが困難である。 Second, although the cases where sanctions have been imposed in the past are open to the public, the number of cases is very small and the number of learning data that can be used for machine learning cannot be prepared. With machine learning using a small amount of training data, the search space is too large to generate a significant prediction model.

第３に、既存事例での判断に用いられた定性的、定量的な判断基準は、時間の経過とともに変化する可能性がある。例えば、史上最大の被害が発生した場合、過去の事例の定量化が無意味になり、将来の事例の予測精度が低下することがある。 Third, the qualitative and quantitative criteria used to make decisions in existing cases may change over time. For example, when the greatest damage in history occurs, the quantification of past cases becomes meaningless, and the prediction accuracy of future cases may decrease.

［本実施の形態における予測モデルの生成、予測、更新の概略］
図３は、本実施の形態における予測モデルの生成、予測、更新の概略を示すフローチャートを示す図である。 [Outline of generation, prediction, and update of prediction model in this embodiment]
FIG. 3 is a diagram showing a flowchart showing an outline of generation, prediction, and update of the prediction model according to the present embodiment.

予測モデルの生成では、既存事例の学習データマスタを生成し（A）、事例の説明変数の最適な変数域を探索し（B）、探索で選出した変数域の変数の組み合わせである部分変数の予測モデルを生成する（C）。 In the generation of the prediction model, the training data master of the existing case is generated (A), the optimum variable range of the explanatory variables of the case is searched for (B), and the subvariables that are the combination of the variables of the variable range selected by the search Generate a prediction model (C).

予測モデルによる未知の事例の予測では、予測対象の事例について、工程Cで生成した予測モデルにより目的変数の値が予測される（D）。未知の事例の予測精度が低下しない間は（EのNO）、工程Cで決定した予測モデルによる予測Dを繰り返す。 In the prediction of unknown cases by the prediction model, the value of the objective variable is predicted by the prediction model generated in step C for the case to be predicted (D). As long as the prediction accuracy of the unknown case does not decrease (NO in E), the prediction D by the prediction model determined in step C is repeated.

そして、未知の事例の予測値がその後公表された制裁事例の制裁金と大きく異なるなど、予測モデルの予測精度が低下してきた場合（EのYES）、既存の事例と新たに公表された事例に基づいて、説明変数の変数域をより適切なものに更新する（F）。そして、更新した変数域の説明変数の部分変数の予測モデルを再決定し（C）、その後の未知の事例の予測に使用する（D）。 Then, when the prediction accuracy of the prediction model deteriorates (YES in E), such as when the predicted value of the unknown case is significantly different from the sanctions of the sanctioned case announced after that, the existing case and the newly published case Based on this, update the variable range of the explanatory variables to a more appropriate one (F). Then, the prediction model of the subvariable of the explanatory variable of the updated variable range is redetermined (C) and used for the prediction of the unknown case thereafter (D).

学習データマスタの生成Aでは、複数の既存事例について、変数の定数化と定量化を行い、事例それぞれを、説明変数の定量化された値（領域識別番号）と目的変数の値またはレベルを有するデータとする。変数の定数化と定量化は、定義集に記載したとおりである。 In the generation A of the training data master, variables are quantified and quantified for a plurality of existing cases, and each case has a quantified value (area identification number) of an explanatory variable and a value or level of an objective variable. Let it be data. The constantization and quantification of variables are as described in the definition book.

変数域の探索Bでは、説明変数Yに対して適用される複数の変数域候補それぞれ及び組み合わせについて予測モデルの予測精度を算出し、高い予測精度が得られる変数域を抽出する。変数域の探索Bは、予測モデルを決定する処理Cより前に行われる、説明変数の初期化工程で行われる。 In the variable region search B, the prediction accuracy of the prediction model is calculated for each of the plurality of variable region candidates and combinations applied to the explanatory variable Y, and the variable region in which high prediction accuracy can be obtained is extracted. The variable region search B is performed in the explanatory variable initialization step, which is performed before the process C for determining the prediction model.

予測モデルの生成工程Cと予測モデルによる予測処理Dでは、探索工程Bで抽出さた変数域で既存事例の説明変数を定量化した学習データに基づき、予測モデルを生成し（C）、予測対象事例の目的変数の値をその分析モデルで予測する（D）。 In the prediction model generation process C and the prediction process D using the prediction model, a prediction model is generated (C) based on the training data obtained by quantifying the explanatory variables of the existing case in the variable area extracted in the search process B, and the prediction target is predicted. Predict the value of the objective variable of the case with the analytical model (D).

変数域の更新Fでは、一旦生成した分析モデルの予測精度が低下した場合、新たな事例に対応して判定基準が変更されたとみなし、予測精度が高いと推定できる複数の変数域候補をそれぞれ有する複数の部分変数について分析モデルの予測精度を算出し、既存の分析モデルより予測精度が高くなる変数域に変数域を更新し、分析モデルを更新する。 In the variable area update F, if the prediction accuracy of the once generated analysis model deteriorates, it is considered that the judgment criteria have been changed in response to a new case, and each has a plurality of variable area candidates that can be estimated to have high prediction accuracy. The prediction accuracy of the analysis model is calculated for a plurality of partial variables, the variable area is updated to a variable area where the prediction accuracy is higher than that of the existing analysis model, and the analysis model is updated.

図４は、本実施の形態における予測装置の構成例を示す図である。予測装置１０は、サーバやパーソナルコンピュータなどの情報処理装置またはコンピュータである。予測装置１０は、CPUなどのプロセッサ１１と、RAM（Random Access Memory）などのメインメモリ１２と、ネットワークインタフェース１３と、バス１４と、HDDやSDD（Hard Disk Drive, Solid State Drive）などの大容量の補助記憶装置２０、３０とを有する。ネットワークインタフェース１３には、インターネットなどを介して、クライアントの端末装置４０，４１からアクセス可能である。 FIG. 4 is a diagram showing a configuration example of the prediction device according to the present embodiment. The prediction device 10 is an information processing device such as a server or a personal computer, or a computer. The prediction device 10 has a processor 11 such as a CPU, a main memory 12 such as RAM (Random Access Memory), a network interface 13, a bus 14, and a large capacity such as an HDD or SDD (Hard Disk Drive, Solid State Drive). It has auxiliary storage devices 20 and 30 of the above. The network interface 13 can be accessed from the client terminal devices 40 and 41 via the Internet or the like.

補助記憶装置２０には、変数域探索プログラム２１、予測モデル生成プログラム２２、予測モデルプログラム２３、変数域更新プログラム２４などの各種のプログラムが格納される。補助記憶装置３０には、事例データ３１、学習データマスタ３２、変数域リスト及び部分変数リスト３３、分類用の学習データ３４、回帰用の学習データ（１）３５＿１、回帰用の学習データ（２）３５＿２などの各種データが格納される。 The auxiliary storage device 20 stores various programs such as the variable area search program 21, the predictive model generation program 22, the predictive model program 23, and the variable area update program 24. The auxiliary storage device 30 contains case data 31, training data master 32, variable area list and partial variable list 33, training data 34 for classification, training data (1) 35_1 for regression, and training data (2) for regression. Various data such as 35_2 are stored.

コンピュータ１０のプロセッサ１１が、変数域探索プログラム２１を実行して、変数域探索処理Bを実行する。同様に、プロセッサ１１が、予測モデル生成プログラム２２を実行して、予測プログラム生成処理Cを実行する。また、プロセッサ１１が、予測モデルプログラム２３を実行して予測モデルによる未知の事例の予測処理Dを実行する。そして、プロセッサ１１が変数域更新プログラム２４を実行して変数域の更新処理Fを実行する。 The processor 11 of the computer 10 executes the variable area search program 21 to execute the variable area search process B. Similarly, the processor 11 executes the prediction model generation program 22 to execute the prediction program generation process C. Further, the processor 11 executes the prediction model program 23 to execute the prediction process D of the unknown case by the prediction model. Then, the processor 11 executes the variable area update program 24 to execute the variable area update process F.

［学習データマスタの生成A］
図５は、既存事例の学習データマスタの生成処理のフローチャートを示す図である。既存の事例から定数化可能な値が説明変数として抽出される（S10）。この抽出処理は、プロセッサ１１が、図２で示した説明変数抽出プログラムを実行し、事例のレポートを文言解析して説明変数を抽出してもよい。または、人為的に事例のレポートから説明変数を抽出してもよい。 [Generation of training data master A]
FIG. 5 is a diagram showing a flowchart of a learning data master generation process of an existing case. Values that can be made constant are extracted as explanatory variables from existing cases (S10). In this extraction process, the processor 11 may execute the explanatory variable extraction program shown in FIG. 2, analyze the case report in words, and extract the explanatory variables. Alternatively, the explanatory variables may be artificially extracted from the case report.

図６は、事例データの一例を示す図である。図６の事例は、前述と同様に個人情報漏洩事例である。図６では、事例１〜５について、定数化可能な説明変数X,Y,Zとして、情報種類X、漏洩人数Y、漏洩期間Zが抽出されている。情報種類Xは、漏洩データにどのような個人情報の種類が含まれていたかを示す。漏洩人数Yは、漏洩したデータに含まれていた個人情報の人数を示す。漏洩期間Zは、データが漏洩した期間を示す。 FIG. 6 is a diagram showing an example of case data. The case of FIG. 6 is a personal information leakage case as described above. In FIG. 6, the information type X, the number of leaked people Y, and the leaked period Z are extracted as explanatory variables X, Y, and Z that can be made constant in Cases 1 to 5. The information type X indicates what kind of personal information was included in the leaked data. The number of leaked Y indicates the number of personal information contained in the leaked data. Leakage period Z indicates the period during which data was leaked.

図５に戻り、プロセッサ１１が、図２に示した説明変数の変数域候補決定プログラムを実行して、抽出した説明変数の値の範囲に基づいて、所定の範囲の領域を所定の数（粒度）で分割した変数域の候補を単数または複数決定する（S11）。または、抽出した説明変数の変数域の候補を人為的に決定してもよい。 Returning to FIG. 5, the processor 11 executes the variable range candidate determination program for the explanatory variables shown in FIG. 2, and based on the range of the values of the extracted explanatory variables, a predetermined number of regions in a predetermined range (grain size). ) Determines one or more candidates for the variable range divided by) (S11). Alternatively, the candidates for the variable range of the extracted explanatory variables may be artificially determined.

図７は、説明変数の変数域のリストと学習データマスタの例を示す図である。説明変数の変数域候補の決定S11では、後で実行される変数域探索処理の探索候補である複数の変数域候補が決定される。できるだけ多くの変数域候補を決定するのが好ましい。その後、プロセッサは、変数域探索プログラム２１を実行して、予測モデルの予測精度が高くなるような変数域を変数域候補から探し出す。 FIG. 7 is a diagram showing a list of variable ranges of explanatory variables and an example of a learning data master. Determining Variable Area Candidates for Explanatory Variables In S11, a plurality of variable area candidates that are search candidates for the variable area search process to be executed later are determined. It is preferable to determine as many variable range candidates as possible. After that, the processor executes the variable area search program 21 to search for the variable area from the variable area candidates so that the prediction accuracy of the prediction model is high.

図７では、説明変数である情報種類Xには、個人情報が１種類含まれる、２種類含まれる、３種類含まれる、の３つの領域を有する変数域が決定されている。また、漏洩期間Zには、図示される４つの領域を有する変数域が決定される。 In FIG. 7, the information type X, which is an explanatory variable, has a variable range having three regions of including one type of personal information, two types of information, and three types of personal information. Further, for the leakage period Z, a variable region having four regions shown in the figure is determined.

そして、一例として、漏洩人数Yの変数域として変数域Y1,Y2,Y3,Y4,Y5が候補として決定されている。これら変数域Y1,Y2,Y3,Y4,Y5は、図示されるとおり、変数域の最大領域（キャップ領域）が順に多くなるように決定されている。図６の事例データベースによれば、漏洩人数Yのデータの範囲が５００人以上１００００人以下であることを考慮して、変数域Y1,Y2,Y3は、キャップ領域がそれぞれ１００人以上、１０００人以上、１００００人以上とそれぞれ異なっている。さらに、将来の変数域の更新で候補に追加する可能性のある変数域Y4,Y5（キャップ領域が５万人以上、１０万人以上）が、変数域候補に加えられている。将来の漏れの規模を予測して、これらの変数域が候補に追加される。 As an example, the variable areas Y1, Y2, Y3, Y4, and Y5 are determined as candidates for the variable area of the number of leaks Y. As shown in the figure, these variable areas Y1, Y2, Y3, Y4, and Y5 are determined so that the maximum area (cap area) of the variable area increases in order. According to the case database of FIG. 6, considering that the data range of the number of leaked people Y is 500 or more and 10000 or less, the variable areas Y1, Y2, and Y3 have cap areas of 100 or more and 1000, respectively. As mentioned above, it is different from more than 10,000 people. In addition, variable areas Y4 and Y5 (cap areas of 50,000 or more and 100,000 or more) that may be added to the candidates in future variable area updates have been added to the variable area candidates. These variable ranges are added to the candidates in anticipation of the magnitude of future leaks.

説明変数である漏洩人数Yは、変数域Y1〜Y5がそれぞれ適用されると、それぞれ異なる説明変数Y1〜Y5となる。この変数域を最適に設定することで、予測モデルの予測精度をより高くできる。例えば、事例の漏洩人数が千人以上と１万人以上とで異なる制裁金が判定されていた場合、変数域Y2よりもY3を選択したほうが、異なる制裁金を予測することができる。 The number of leaks Y, which is an explanatory variable, becomes different explanatory variables Y1 to Y5 when the variable ranges Y1 to Y5 are applied. By optimally setting this variable range, the prediction accuracy of the prediction model can be improved. For example, if different sanctions are determined for the number of leaked cases of 1,000 or more and 10,000 or more, different sanctions can be predicted by selecting Y3 rather than the variable range Y2.

図５に戻り、プロセッサ１１は、説明変数の変数域候補を決定する際に、所定の説明変数に複数の変数域を割り当てる。図７例では、説明変数Y（漏洩人数）に５種類の変数域Y1〜Y5が割り当てられている。 Returning to FIG. 5, the processor 11 allocates a plurality of variable ranges to predetermined explanatory variables when determining variable range candidates for the explanatory variables. In the example of FIG. 7, five types of variable ranges Y1 to Y5 are assigned to the explanatory variables Y (number of leaks).

次に、プロセッサ１１は、図示してない説明変数の定量化プログラムを実行して、説明変数の変数域を基準にして、事例の説明変数を定量化する（S13）。図７内の学習データマスタ３２には、事例１〜５の説明変数XとY1〜Y5とZそれぞれについて、事例のデータを変数域候補XとY1〜Y5とZを基準にして、定量化したデータが示される。例えば、事例１の漏洩人数が１０００人であるので、変数域Y1〜Y5の定量化データはすべて「２」である。一方、事例２の漏洩人数が１００００人であるので、変数域Y1〜Y5の定量化データは２，３，４，４，４となる。 Next, the processor 11 executes an explanatory variable quantification program (not shown) to quantify the explanatory variables of the case with reference to the variable range of the explanatory variables (S13). In the learning data master 32 in FIG. 7, for each of the explanatory variables X, Y1 to Y5, and Z of cases 1 to 5, the case data was quantified based on the variable area candidates X, Y1 to Y5, and Z, respectively. The data is shown. For example, since the number of leaked persons in Case 1 is 1000, all the quantified data of the variable areas Y1 to Y5 are “2”. On the other hand, since the number of leaked persons in Case 2 is 10,000, the quantified data of the variable areas Y1 to Y5 are 2, 3, 4, 4, 4.

図５に戻り、プロセッサ１１は、目的変数の定量化プログラムを実行して、目的変数SAのアナログの値を複数のレベル（目的変数の値の大きさに応じた複数のレベル）に置換して、新目的変数SAL（Sanction Level）を生成する（S14）。図７の例では、制裁金SAが１０万未満をレベルA、１０万以上をレベルBの制裁レベルSALに置き換えられている。後述する分析モデル（または予測モデル）は、学習データが少ないことを考慮して、分類型の分析モデルと回帰型の分析モデルを組合せたモデルにより、未知の事例について目的変数の値（制裁金）を予測する。新目的変数SALは、この分類型の分析モデルで使用される。 Returning to FIG. 5, the processor 11 executes the objective variable quantification program to replace the analog value of the objective variable SA with a plurality of levels (a plurality of levels according to the magnitude of the objective variable value). , Generate a new objective variable SAL (Sanction Level) (S14). In the example of FIG. 7, the sanction SA of less than 100,000 is replaced with the level A sanction level SAL of 100,000 or more. The analysis model (or prediction model) described later is a model that combines a classification type analysis model and a regression type analysis model in consideration of the small amount of training data, and the value of the objective variable (sanction) for unknown cases. Predict. The new objective variable SAL is used in this taxonomic analytical model.

以上の処理により、事例の内容から学習データの集合を有する学習データマスタが生成される。 By the above processing, a learning data master having a set of learning data is generated from the contents of the case.

［説明変数の最適な変数域の探索B］
次に、プロセッサ１１は、変数域探索プログラム２１を実行して、変数域候補から最適な変数域を探索する。この変数域探索処理では、前述した分類型分析モデルに適した変数域と、回帰型分析モデルに適した変数域をそれぞれ探し出す。 [Search for the optimum variable range of explanatory variables B]
Next, the processor 11 executes the variable area search program 21 to search for the optimum variable area from the variable area candidates. In this variable area search process, a variable area suitable for the above-mentioned classification type analysis model and a variable area suitable for the regression type analysis model are searched for, respectively.

図８及び図９は、変数探索プログラムの処理のフローチャートを示す図である。図８は、分類型分析モデルに適した変数域の探索処理を、図９は、回帰型分析モデルに適した変数域の探索処理を、それぞれ示す。分類型分析モデルと回帰型分析モデルそれぞれの変数域の探索処理について以下説明する。 8 and 9 are diagrams showing a flowchart of processing of the variable search program. FIG. 8 shows a variable region search process suitable for the classification type analysis model, and FIG. 9 shows a variable region search process suitable for the regression type analysis model. The search processing of the variable range of each of the classification type analysis model and the regression type analysis model will be described below.

［分類型分析モデルの変数域の探索］
図８において、プロセッサ１１は、変数域探索プログラム２１を実行し、複数の変数域候補（Y1〜Y5）を対応する説明変数（Y）に設定し、複数の説明変数（X,Y1〜Y5，Z）の集合の部分集合である部分変数を、望ましくは、全て有する部分変数リスト３３を生成する（S21）。 [Searching the variable range of the classification type analysis model]
In FIG. 8, the processor 11 executes the variable area search program 21, sets a plurality of variable area candidates (Y1 to Y5) in the corresponding explanatory variables (Y), and sets the plurality of explanatory variables (X, Y1 to Y5, Generate a subvariable list 33 that preferably has all the subvariables that are a subset of the set of Z) (S21).

図１０は、部分変数リストの例を示す図である。リスト内の部分変数SSV（SubSet Variable）は、複数の説明変数（X,Y1〜Y5，Z）の集合の部分集合である。例えば、部分変数SSV7は、７つの変数すべてを含む部分集合であり、部分変数SSV6-1は、説明変数Zを除く説明変数X,Y1〜Y5を含む部分集合である。他の部分変数も同様である。 FIG. 10 is a diagram showing an example of a partial variable list. The subset variable SSV (SubSet Variable) in the list is a subset of the set of multiple explanatory variables (X, Y1 to Y5, Z). For example, the subvariable SSV7 is a subset containing all seven variables, and the subvariable SSV6-1 is a subset containing the explanatory variables X, Y1 to Y5 excluding the explanatory variable Z. The same applies to other subvariables.

そこで、図８に示すとおり、プロセッサ１１は、部分変数リスト３３から一つの部分変数を抽出し（S22）、抽出した部分変数に対応する学習データを学習データマスタから抽出し、分類型分析モデルの予測精度を算出する（S23）。プロセッサ１１は、分類型分析モデルの予測精度の算出処理S23を、部分変数リスト内の全部分変数について繰り返す（S24）。 Therefore, as shown in FIG. 8, the processor 11 extracts one subvariable from the subvariable list 33 (S22), extracts the training data corresponding to the extracted subvariable from the training data master, and determines the classification type analysis model. Calculate the prediction accuracy (S23). The processor 11 repeats the calculation process S23 of the prediction accuracy of the classification type analysis model for all the partial variables in the partial variable list (S24).

図１１は、分類型分析モデルの予測精度を算出する処理S23のフローチャートを示す図である。図１２は、分類型分析モデルの予測精度の算出を説明する図である。図１３は、分類型分析モデルによる未知の事例の予測方法を示す図である。図１４は、分類型分析モデルの予測精度の算出方法を示す図である。そして、図１５は、分類型分析モデルの最適な部分変数の決定方法を示す図である。これらの図を参照して、分類型分析モデルの予測精度を算出する処理を説明する。 FIG. 11 is a diagram showing a flowchart of the process S23 for calculating the prediction accuracy of the classification type analysis model. FIG. 12 is a diagram for explaining the calculation of the prediction accuracy of the classification type analysis model. FIG. 13 is a diagram showing a method of predicting an unknown case by a classification type analysis model. FIG. 14 is a diagram showing a method of calculating the prediction accuracy of the classification type analysis model. FIG. 15 is a diagram showing a method of determining the optimum subvariables of the classification type analysis model. The process of calculating the prediction accuracy of the classification type analysis model will be described with reference to these figures.

プロセッサ１１は、分類型分析モデルの予測精度の算出を、図１２に示す交差検証法により行う。以下、図１２を参照して、図１１の予測精度算出の処理について説明する。 The processor 11 calculates the prediction accuracy of the classification type analysis model by the cross-validation method shown in FIG. Hereinafter, the process of calculating the prediction accuracy of FIG. 11 will be described with reference to FIG.

図１１に記載されるとおり、プロセッサ１１は、学習データマスタ３２内の一つの事例（例えば事例１）を評価対象の事例として選択し（S231）、評価対象の事例（事例１）を除いた残りの事例（事例２−５）の学習データによる分類型予測モデルで、評価対象の事例（事例１）の目的変数の制裁レベル（AまたはB）を予測する（S232）。 As shown in FIG. 11, the processor 11 selects one case (for example, case 1) in the training data master 32 as the case to be evaluated (S231), and the rest excluding the case to be evaluated (case 1). The sanction level (A or B) of the objective variable of the case to be evaluated (Case 1) is predicted by the classification type prediction model based on the training data of the case (Case 2-5) (S232).

更に、プロセッサは、評価対象の事例（事例１）の予測レベルと、その事例（事例１）の真のレベルとの一致・不一致により、予測の成功か失敗（成否）を判定する（S233）。 Further, the processor determines the success or failure (success or failure) of the prediction based on the match / mismatch between the prediction level of the case (case 1) to be evaluated and the true level of the case (case 1) (S233).

プロセッサは、学習データマスタ内の残りすべての事例（事例２〜５）についても、処理S232とS233を繰り返す（S234）。この結果、残りの評価対象の事例（事例２〜５）について、予測レベルと事例の真のレベルとの一致・不一致による成否が判定される。そして、最後に、プロセッサは、全評価対象の事例における成否の正解率（予測成功の割合）を、抽出した部分変数の予測精度とする（S235）。例えば、予測レベルと真値とが一致した回数を事例数で除算して、正解率が算出される。 The processor repeats the processes S232 and S233 for all the remaining cases (cases 2 to 5) in the training data master (S234). As a result, the success or failure of the remaining evaluation target cases (cases 2 to 5) is determined by the agreement / mismatch between the prediction level and the true level of the case. Finally, the processor uses the success / failure rate (ratio of successful predictions) in all the cases to be evaluated as the prediction accuracy of the extracted partial variables (S235). For example, the correct answer rate is calculated by dividing the number of times the prediction level and the true value match by the number of cases.

図１２において、プロセッサは、抽出した部分変数（例えば、図１０内のいずれかの部分変数SSV）に対応する変数値の値を、学習データマスタ３２から全事例１〜５について抽出し、抽出した部分変数の学習データを生成する。 In FIG. 12, the processor extracts and extracts the values of the variable values corresponding to the extracted partial variables (for example, any of the partial variables SSV in FIG. 10) from the training data master 32 for all cases 1 to 5. Generate training data for partial variables.

そして、プロセッサは、評価対象の事例（例えば事例１）を除いた残りの事例２〜５の学習データによる予測モデルにより、分類型の予測アルゴリズムで、評価対象の事例（事例１）の制裁レベルを予測する。 Then, the processor uses a classification type prediction algorithm to determine the sanction level of the evaluation target case (case 1) by the prediction model based on the learning data of the remaining cases 2 to 5 excluding the evaluation target case (for example, case 1). Predict.

図１３を参照して分類型分析モデルによる未知の事例の予測方法を説明する。図１３には部分変数SSV7（変数x, Y1〜Y5, Z）の学習データが示される。部分変数SSV7は、全ての説明変数X, Y1〜Y5, Zを有するので、その学習データは、学習データマスタ３２と同じになる。そして、予測対象事例E1の制裁レベルが学習データに基づいて次の方法で予測される。 A method of predicting an unknown case by a classification type analysis model will be described with reference to FIG. FIG. 13 shows the training data of the partial variables SSV7 (variables x, Y1 to Y5, Z). Since the partial variable SSV7 has all the explanatory variables X, Y1 to Y5, Z, its training data is the same as that of the training data master 32. Then, the sanction level of the predicted target case E1 is predicted by the following method based on the learning data.

図１３の下半分には、部分変数の変数を座標軸とする座標空間が示され、その座標空間内に、学習データ内の事例１〜５がプロットされている。このような座標空間内に予測対象事例E1をプロットした場合、予測対象事例E1とL2ノルム距離L2_NDが最小の事例を事例１〜５から選択する。そして、選択した事例の制裁レベルを、予測対象事例E1の予測制裁レベルとする。 In the lower half of FIG. 13, a coordinate space having a variable of a partial variable as a coordinate axis is shown, and examples 1 to 5 in the training data are plotted in the coordinate space. When the prediction target case E1 is plotted in such a coordinate space, the case with the smallest prediction target case E1 and the L2 norm distance L2_ND is selected from cases 1 to 5. Then, the sanction level of the selected case is set as the predicted sanction level of the predicted case E1.

L2ノルム距離は、２つの事例間の座標空間内の距離である。この座標空間内の距離L2_NDは、同じ座標軸の値の差分をそれぞれ二乗して加算し、加算値の平方根を算出することで求められる。したがって、プロセッサ１１は、分類型の予測モデルプログラム２３を実行し、予測対象事例E1と全事例１〜５とのノルム距離を算出し、最小距離の事例を事例１〜５から検出する。 The L2 norm distance is the distance in the coordinate space between the two cases. The distance L2_ND in this coordinate space is obtained by squaring the differences between the values of the same coordinate axes and adding them to calculate the square root of the added value. Therefore, the processor 11 executes the classification type prediction model program 23, calculates the norm distance between the prediction target case E1 and all the cases 1 to 5, and detects the case of the minimum distance from the cases 1 to 5.

図１３の例によれば、事例E1に最短距離の事例は事例４であるので、事例E1の予測制裁レベルは、事例４の真の予測レベルAと同じと判定される。 According to the example of FIG. 13, since the case of the shortest distance to the case E1 is the case 4, the predicted sanction level of the case E1 is determined to be the same as the true predicted level A of the case 4.

分類型予測モデルの予測方法が理解できたところで、続けて、分類型予測モデルの予測精度の算出方法S23を説明する。図１２に示される例の中で、評価対象の事例３を、残りの事例１，２，４，５に基づいて、制裁レベルを予測する方法を、図１４を参照して説明する。 Now that the prediction method of the classification type prediction model is understood, the calculation method S23 of the prediction accuracy of the classification type prediction model will be described. Among the examples shown in FIG. 12, a method of predicting the sanction level of the case 3 to be evaluated based on the remaining cases 1, 2, 4 and 5 will be described with reference to FIG.

図１４の下半分には、図１３と同様に事例１〜５が座標空間内にプロットされている。そこで、プロセッサは、評価対象の事例３に対し最短のL2ノルム距離を持つ事例を、残りの事例１，２，４，５から検出する。図１４に示されるとおり、評価対象の事例３は、事例２が最短のL2ノルム距離L2_ND_2である。したがって、プロセッサは、評価対象の事例３の予測制裁レベルを、検出した事例２の制裁レベルBと判定する（S232）。そして、プロセッサは、評価対象の事例３の予測制裁レベルBと事例３の真の制裁レベルAとを比較し、不一致であることから、予測制裁レベルは不正解と判定する（S233）。 In the lower half of FIG. 14, Cases 1 to 5 are plotted in the coordinate space as in FIG. Therefore, the processor detects the case having the shortest L2 norm distance with respect to the case 3 to be evaluated from the remaining cases 1, 2, 4, and 5. As shown in FIG. 14, in case 3 to be evaluated, case 2 is the shortest L2 norm distance L2_ND_2. Therefore, the processor determines the predicted sanction level of the case 3 to be evaluated as the sanction level B of the detected case 2 (S232). Then, the processor compares the predicted sanctions level B of the case 3 to be evaluated with the true sanctions level A of the case 3, and determines that the predicted sanctions level is incorrect because they do not match (S233).

図１２に示すとおり、プロセッサは、全事例１〜５それぞれについて、残りの事例に基づく分類型予測モデルによる制裁レベルの予測を行い、それぞれの予測レベルが真の制裁レベルと一致するか否か、つまり予測が正解か否か（成否）を判定する。図１２の右下には結果検証の表が示され、それによれば、事例１，２，４，５で正解となり、事例３で不正解となっている。その結果、プロセッサは、図８のS22で抽出した部分変数の分類型予測モデルの予測精度を、結果検証の表の正解率（０．８８）と求める（S235）。 As shown in FIG. 12, the processor predicts the sanction level by the classification type prediction model based on the remaining cases for each of all cases 1 to 5, and whether or not each prediction level matches the true sanction level. That is, it is determined whether or not the prediction is correct (success or failure). A table of result verification is shown in the lower right of FIG. 12, and according to it, cases 1, 2, 4 and 5 are correct answers, and case 3 is incorrect. As a result, the processor determines the prediction accuracy of the classification type prediction model of the partial variables extracted in S22 of FIG. 8 as the correct answer rate (0.88) in the result verification table (S235).

図８に戻り、プロセッサは、全部分変数SSVについて（S24）、それに対応する分類型予測モデルの予測精度を算出する（S23）。図１５には、全部分変数SSVに対する予測精度（正解率）でソートした結果が示される。プロセッサは、予測精度の高い上位N個（例えば上位５個）の部分変数SSVから最適な変数域を決定する（S25）。図１５の例では、予測精度の高い順にソートされた上位N個（上位５個）の部分変数SSVに最も多く含まれる変数域Y3を、最適な変数域と決定する。そして、分類型の予測モデルの部分変数をX，Y3,Zと決定する（図１５のS25）。図１５には、最適な変数域Y3を変数Yとする部分変数SSV3_#（説明変数X，Y3,Zを有する部分変数）が決定されている。 Returning to FIG. 8, the processor calculates the prediction accuracy of the corresponding classification type prediction model for all partial variables SSV (S24) (S23). FIG. 15 shows the results of sorting by prediction accuracy (correct answer rate) for all partial variables SSV. The processor determines the optimum variable range from the top N (for example, top 5) subvariables SSV with high prediction accuracy (S25). In the example of FIG. 15, the variable range Y3 that is most contained in the top N (top 5) subvariables SSV sorted in descending order of prediction accuracy is determined as the optimum variable range. Then, the subvariables of the classification type prediction model are determined as X, Y3, Z (S25 in FIG. 15). In FIG. 15, a subvariable SSV3_ # (a subvariable having explanatory variables X, Y3, Z) having the optimum variable range Y3 as the variable Y is determined.

さらに、プロセッサは、決定した分類型の部分変数に基づき、学習データマスタから部分変数のデータを抽出し、分類型の学習データを生成する（S26）。 Further, the processor extracts the partial variable data from the training data master based on the determined classification type subvariable, and generates the classification type training data (S26).

［回帰型分析モデルの変数域の探索］
次に、回帰型分析モデルの変数域の探索について説明する。図９が、回帰型分析モデルの変数域の探索の処理のフローチャート図である。また、図１６は、回帰型分析モデルの予測精度の算出S34の処理のフローチャート図である。図１７は、回帰型分析モデルの予測精度の算出を説明する図である。図１８は、回帰型分析モデルの生成方法を示す図である。図１９は、回帰型分析モデルの部分変数を予測精度でソートした例を示す図である。そして、図２０は、回帰型分析モデルの最適な変数域の決定方法を示す図である。これらの図を参照して、回帰型分析モデルの予測精度を算出する処理を説明する。 [Search for variable range of regression analysis model]
Next, the search for the variable range of the regression analysis model will be described. FIG. 9 is a flowchart of the process of searching the variable region of the regression analysis model. Further, FIG. 16 is a flowchart of the process of calculating the prediction accuracy of the regression analysis model S34. FIG. 17 is a diagram for explaining the calculation of the prediction accuracy of the regression analysis model. FIG. 18 is a diagram showing a method of generating a regression analysis model. FIG. 19 is a diagram showing an example in which the partial variables of the regression analysis model are sorted by the prediction accuracy. FIG. 20 is a diagram showing a method of determining the optimum variable range of the regression analysis model. The process of calculating the prediction accuracy of the regression analysis model will be described with reference to these figures.

図９において、プロセッサ１１は、変数域探索プログラム２１を実行し、目的変数のレベルを選択する（S31）。回帰型分析モデルの変数域の探索は、目的変数のレベル毎に行われる。目的変数のレベルはAとBがあり、ここではレベルAを選択したとする。 In FIG. 9, the processor 11 executes the variable area search program 21 and selects the level of the objective variable (S31). The search for the variable range of the regression analysis model is performed for each level of the objective variable. There are A and B levels of the objective variable, and it is assumed that level A is selected here.

次に、プロセッサは、選択したレベルAに属する事例の学習データを学習データマスタから抽出する（S32）。ここで抽出したレベルAに属する事例の学習データは、レベルAに属する学習データマスタとなる。図１７に示すとおり、学習データマスタ内のレベルAに属する事例１，３，４が評価対象として抽出される。 Next, the processor extracts the training data of the case belonging to the selected level A from the training data master (S32). The learning data of the case belonging to level A extracted here becomes the learning data master belonging to level A. As shown in FIG. 17, cases 1, 3 and 4 belonging to level A in the learning data master are extracted as evaluation targets.

次に、プロセッサは、図８のS21で生成した部分変数リスト３３（図１０）内の全部分変数について、レベルAに属する事例１，３，４に基づき、回帰型分析モデルの生成と、その予測精度を算出する（S33-S35）。即ち、プロセッサは、部分変数リスト３３（図１０）から一つの部分変数を抽出し（S33）、抽出した部分変数に対応する学習データを前述のレベルAの学習データマスタから抽出し、回帰型分析モデルを生成し、その予測精度を算出する（S34）。プロセッサ１１は、回帰型分析モデルの予測精度の算出処理S34を、部分変数リスト内の全部分変数について繰り返す（S35）。 Next, the processor generates a recurrent analysis model for all the partial variables in the partial variable list 33 (FIG. 10) generated in S21 of FIG. 8 based on the cases 1, 3 and 4 belonging to level A, and the generation thereof. Calculate the prediction accuracy (S33-S35). That is, the processor extracts one subvariable from the subvariable list 33 (FIG. 10) (S33), extracts the training data corresponding to the extracted subvariable from the above-mentioned level A training data master, and performs regression analysis. Generate a model and calculate its prediction accuracy (S34). The processor 11 repeats the calculation process S34 of the prediction accuracy of the regression analysis model for all the partial variables in the partial variable list (S35).

プロセッサ１１は、回帰型分析モデルの予測精度の算出S34を、図１７に示す交差検証法により行う。以下、図１７を参照して、図１６の予測精度算出の処理について説明する。 The processor 11 calculates the prediction accuracy of the regression analysis model S34 by the cross-validation method shown in FIG. Hereinafter, the process of calculating the prediction accuracy of FIG. 16 will be described with reference to FIG.

図１６及び図１７に記載されるとおり、プロセッサ１１は、レベルAの学習データマスタ内の一つの事例（例えば事例１）を評価対象の事例として選択する（S341）。さらに、プロセッサは、評価対象の事例（事例１）を除いた残りの事例（事例３，４）の学習データにより、回帰型予測モデルPmodel_1を生成する（S342）。さらに、プロセッサは、回帰型予測モデルPmodel_1により評価対象の事例（事例１）の目的変数の予測値、制裁値を予測する（S342）。 As shown in FIGS. 16 and 17, the processor 11 selects one case (for example, case 1) in the level A learning data master as the case to be evaluated (S341). Further, the processor generates a regression prediction model Pmodel_1 from the training data of the remaining cases (cases 3 and 4) excluding the case to be evaluated (case 1) (S342). Further, the processor predicts the predicted value and the sanction value of the objective variable of the case (Case 1) to be evaluated by the recurrent prediction model Pmodel_1 (S342).

そして、プロセッサは、評価対象の事例１の予測値40000と真値60000の絶対比を求める（S343）。両値の絶対比は、小さい値を大きい値で除算して求められる。上記処理S341、S342、S343が、評価対象の全事例（事例１，３，４）について繰り返される（S344）。最後に、プロセッサは、評価対象の全事例の絶対比の平均を、部分変数の予測精度として出力する（S345）。図１７の例では、レベルAの予測精度（平均絶対比）が０．７１となる。 Then, the processor obtains the absolute ratio of the predicted value of 40000 and the true value of 60000 in Case 1 to be evaluated (S343). The absolute ratio of both values is obtained by dividing the smaller value by the larger value. The above processes S341, S342, and S343 are repeated for all the cases to be evaluated (cases 1, 3, and 4) (S344). Finally, the processor outputs the average of the absolute ratios of all the cases to be evaluated as the prediction accuracy of the partial variables (S345). In the example of FIG. 17, the prediction accuracy (average absolute ratio) of level A is 0.71.

図１７のレベルAでは、評価対象の事例１について、残りの事例３，４の学習データから回帰型予測モデルPmodel_1が生成され、予測値40000が算出され、事例１の予測値と真の値の絶対比0.67が算出されている。評価対象の事例３，４も同様にして回帰型予測モデルPmodel_3, Pmodel_4が生成され、予測値100000、12000が算出され、絶対比0.7,0.75が算出されている。 At level A in FIG. 17, for case 1 to be evaluated, a regression prediction model Pmodel_1 is generated from the training data of the remaining cases 3 and 4, a prediction value of 40000 is calculated, and the predicted value and the true value of case 1 are calculated. An absolute ratio of 0.67 has been calculated. Recurrent prediction models Pmodel_3 and Pmodel_4 are generated in the same manner for cases 3 and 4 to be evaluated, prediction values 100000 and 12000 are calculated, and absolute ratios 0.7 and 0.75 are calculated.

プロセッサは、残りのレベルBの事例２，５についても、予測モデルPmodel_2, Pmodel_5の生成と予測値の算出、及び上記の予測精度（平均絶対比）の算出（S342,S343）を、実行する。その結果、図１７の例では、レベルBの予測精度（平均絶対比）が０．６７となる。 The processor also generates the prediction models Pmodel_2 and Pmodel_5, calculates the prediction value, and calculates the prediction accuracy (average absolute ratio) (S342, S343) for the remaining Level B cases 2 and 5. As a result, in the example of FIG. 17, the prediction accuracy (average absolute ratio) of level B is 0.67.

図１８は、回帰型の分析モデルの生成方法を示す。図１８には、選択された部分変数SSV7の学習データが示される。この学習データのうち、目的変数である制裁金のレベルAの事例１，３，４に基づいて、二乗和誤差を最小化する最尤推定である最小二乗法によりレベルAの７次元の重回帰型分析モデルRG_ANL_Aが生成される。この重回帰型分析モデルRG_ANL_Aは、図１７の回帰型予測モデルPmodelに対応する。同様に、学習データのうち、目的変数である制裁金のレベルBの事例２，５に基づいて、最小二乗法によりレベルBの７次元の重回帰型分析モデルRG_ANL_Bが生成される。図１８の座標空間では、重回帰型分析モデルの説明変数の軸が簡略化して示される。 FIG. 18 shows a method of generating a regression type analytical model. FIG. 18 shows the training data of the selected partial variable SSV7. Of this training data, based on the cases 1, 3 and 4 of the sanctions level A, which is the objective variable, the 7-dimensional multiple regression of level A is performed by the least squares method, which is the maximum likelihood estimation that minimizes the sum of squares error. The type analysis model RG_ANL_A is generated. This multiple regression analysis model RG_ANL_A corresponds to the regression prediction model Pmodel of FIG. Similarly, in the training data, a level B 7-dimensional multiple regression analysis model RG_ANL_B is generated by the least squares method based on the cases 2 and 5 of the sanctions level B, which is the objective variable. In the coordinate space of FIG. 18, the axes of the explanatory variables of the multiple regression analysis model are shown in a simplified manner.

図１７の回帰型の交差検証法では、評価対象の事例１について、残りの事例３，４に基づいて回帰型分析モデルが生成されている。この点は、図１８の例の事例１，３，４に基づいてレベルAの回帰型分析モデルが生成されていることと異なる。 In the regression-type cross-validation method of FIG. 17, a regression-type analysis model is generated for the evaluation target case 1 based on the remaining cases 3 and 4. This point is different from the fact that the level A regression analysis model is generated based on the cases 1, 3 and 4 of the example of FIG.

図１９は、部分変数を示し、各部分変数に対する回帰型分析モデルの予測式の例と、図９，図１６により算出した各部分変数に対する予測精度を示す。この部分変数リストを予測精度でソートすると、図２０のソート後の部分変数リストになる。 FIG. 19 shows the subvariables, and shows an example of the prediction formula of the regression analysis model for each subvariable and the prediction accuracy for each subvariable calculated by FIGS. 9 and 16. When this subvariable list is sorted with predictive accuracy, the sorted subvariable list shown in FIG. 20 is obtained.

図９に戻り、プロセッサは、図２０のソート後の部分変数リスト、つまり予測精度が高い上位N個の部分変数の変数Yの変数域Y1〜Y5から、最適な変数域を決定する（S36）。図２０の例では、例えば、上位５個の部分変数SSV4-#〜SSC6-4の変数Yの変数域Y1〜Y5のうち、最も多く出現した変数域Y3が最適な変数域と決定されている。その結果、プロセッサは、レベルAの重回帰型分析モデルの部分変数36_Aを、SSV3-#（変数X,Y3,Z）と決定する。レベルBの重回帰型分析モデルの部分変数36_Bも同様の方法で決定する。 Returning to FIG. 9, the processor determines the optimum variable range from the sorted partial variable list of FIG. 20, that is, the variable ranges Y1 to Y5 of the variables Y of the top N partial variables with high prediction accuracy (S36). .. In the example of FIG. 20, for example, among the variable ranges Y1 to Y5 of the variables Y of the upper five subvariates SSV4- # to SSC6-4, the variable range Y3 that appears most often is determined to be the optimum variable range. .. As a result, the processor determines the partial variable 36_A of the level A multiple regression analysis model as SSV3- # (variables X, Y3, Z). The partial variable 36_B of the level B multiple regression analysis model is determined in the same way.

さらに、レベルA、Bの変数Yの変数域が決定し、それぞれの部分変数が決定したことに伴い、プロセッサは、決定した回帰型の部分変数に基づき、学習データマスタ３２から、レベルAとBそれぞれの学習データ35_A, 35_Bを抽出する（S37）。図９の処理は、全レベルについて繰り返される（S38）。 Further, the variable range of the variable Y of the levels A and B is determined, and as each subvariable is determined, the processor is based on the determined regression type subvariable, and the level A and B are determined from the training data master 32. The respective training data 35_A and 35_B are extracted (S37). The process of FIG. 9 is repeated for all levels (S38).

以上で、説明変数Yの変数域候補Y1〜Y5から最適な変数域が決定される。最適な変数域は、分類型分析モデルと各レベル（A,B）の回帰型分析モデルそれぞれについて決定される。この変数域を最適化することで、最適な変数域を有する部分変数に対応する予測モデルの予測精度を高くすることができる。 With the above, the optimum variable range is determined from the variable range candidates Y1 to Y5 of the explanatory variable Y. The optimal variable range is determined for each of the taxonomic analytical model and the regression analytical model at each level (A, B). By optimizing this variable range, the prediction accuracy of the prediction model corresponding to the subvariable having the optimum variable range can be improved.

［予測モデルの生成及び予測対象の事例について予測モデルで予測C,D］
次に、図３に示すとおり、プロセッサは、予測モデル生成プログラム２２、予測モデルプログラム２３を実行して、予測モデルの生成及び予測対象事例について予測モデルによる目的変数の予測を行う。 [Generation of prediction model and prediction of cases to be predicted by prediction model C, D]
Next, as shown in FIG. 3, the processor executes the prediction model generation program 22 and the prediction model program 23 to generate the prediction model and predict the objective variable by the prediction model for the prediction target case.

図２１は、予測モデルの生成及び予測対象の事例について予測モデルで予測する処理C,Dのフローチャート図である。プロセッサは、予測モデル生成プログラム２２を実行して、分類型の学習データ２６から分類型予測モデルを生成する（S41）。分類型予測モデルの生成は、図１３で説明したとおり、分類型の学習データ２６を取得することで完了する。また、プロセッサは、予測モデル生成プログラム２２を実行して、回帰型の学習データ35_A, 35_BそれぞれからレベルAの回帰型予測モデルと、レベルBの回帰型予測モデルを生成する（S42, S43）。回帰型予測モデルの生成は、図１８で説明したとおりである。 FIG. 21 is a flowchart of processes C and D for generating a prediction model and predicting a case to be predicted by the prediction model. The processor executes the prediction model generation program 22 to generate a classification type prediction model from the classification type learning data 26 (S41). The generation of the classification type prediction model is completed by acquiring the classification type learning data 26 as described with reference to FIG. Further, the processor executes the prediction model generation program 22 to generate a level A regression type prediction model and a level B regression type prediction model from the regression type training data 35_A and 35_B, respectively (S42, S43). The generation of the regression prediction model is as described in FIG.

これで、過去の事例に基づく３種類の予測モデルの生成が完了する。つまり、分類型予測モデルと、レベルAの回帰型予測モデルと、レベルBの回帰型予測モデルである。プロセッサは、これらの予測モデルを使って、予測対象の事例について、目的変数の制裁金を予測する（D）。 This completes the generation of three types of prediction models based on past cases. That is, a classification type prediction model, a level A regression type prediction model, and a level B regression type prediction model. The processor uses these prediction models to predict the objective variable sanctions for the case to be predicted (D).

図２１に示されるとおり、プロセッサは、予測対象事例の学習データを入力し（S44）、予測対象事例の学習データを分類型予測モデルに適用し、目的変数の制裁レベル（レベルAまたはB）を予測する（S45）。分類型予測モデルによる予測方法は、図１３で説明した。 As shown in FIG. 21, the processor inputs the training data of the prediction target case (S44), applies the training data of the prediction target case to the classification type prediction model, and sets the sanction level (level A or B) of the objective variable. Predict (S45). The prediction method using the classification type prediction model has been described with reference to FIG.

次に、プロセッサは、予測した制裁レベルに対応する回帰型予測モデルを選択する（S46）。そして、プロセッサは、予測対象事例の学習データを選択した回帰型予測モデルに適用し、目的変数である制裁金の値を予測する（S47）。回帰型予測モデルによる事例の制裁金の予測では、予測対象事例の学習データの変数の値を、回帰型予測モデルの重回帰直線RG_ANL_AまたはRG_ANL_B（図１８）に入力し、目的変数である制裁金SAを算出する。 The processor then selects a regression prediction model that corresponds to the predicted sanctions level (S46). Then, the processor applies the training data of the prediction target case to the selected recurrent prediction model and predicts the value of the sanction, which is the objective variable (S47). In the prediction of the sanctions of the case by the regression prediction model, the value of the variable of the training data of the prediction target case is input to the multiple regression line RG_ANL_A or RG_ANL_B (Fig. 18) of the regression prediction model, and the sanctions which are the objective variables. Calculate SA.

本実施の形態では、既存事例の数が少ないので、分類型予測モデル（分類型分析モデル）で制裁金のレベル（レベルA,B）を予測し、レベルA,Bそれぞれの既存事例の学習データで生成したレベルAの回帰型予測モデルまたはレベルBの回帰型予測モデルで目的変数が未知の事例の目的変数の値を予測する。図１８に示したとおり、各レベルに対応した回帰型予測モデルは、各レベルの複数の事例の学習データによる最小二乗法などにより生成される。すなわち、少ない事例数でも比較的予測精度が高い分類型予測モデルで制裁レベルを予測する。そして、予測した制裁レベル毎に生成された回帰型予測モデルで、予測対象事例の制裁値を予測する。回帰型予測モデルは、予測した制裁レベルに基づいて分けられた事例の学習データからそれぞれ生成されるので、回帰型予測モデルの予測精度は高くなる。したがって、制裁値の予測精度を高くすることができる。このように、分離型予測モデルと回帰型予測モデルを組合せたモデルで未知の事例の制裁値を予測することにより、その予測精度を高くできる。 In this embodiment, since the number of existing cases is small, the level of sanctions (levels A and B) is predicted by the classification type prediction model (classification type analysis model), and the learning data of the existing cases of each level A and B is predicted. Predict the value of the objective variable in the case where the objective variable is unknown by the level A recurrent prediction model or the level B recurrent prediction model generated in. As shown in FIG. 18, the regression prediction model corresponding to each level is generated by the least squares method based on the learning data of a plurality of cases at each level. That is, the sanction level is predicted by a classification type prediction model with relatively high prediction accuracy even with a small number of cases. Then, the sanction value of the predicted target case is predicted by the regression type prediction model generated for each predicted sanction level. Since the recurrent prediction model is generated from the learning data of the cases divided based on the predicted sanction level, the prediction accuracy of the recurrent prediction model is high. Therefore, the accuracy of predicting the sanction value can be improved. In this way, by predicting the sanction value of an unknown case with a model that combines a separation type prediction model and a regression type prediction model, the prediction accuracy can be improved.

［説明変数の変数域を更新F］
図３に示されるとおり、本実施の形態では、プロセッサは、予測済み事例の予測精度が低下すると（EのYES）、その予測に用いてきた分類型分析モデルと、回帰型分析モデルの所定の説明変数の変数域をそれぞれ更新して、各分析モデルの予測精度を向上する（F）。予測済み事例の予測精度の低下は、予測モデルで予測した事例の予測結果と新たに政府当局が発表した制裁事例の制裁結果とを比較することで検出することができる。 [Update the variable area of the explanatory variable F]
As shown in FIG. 3, in the present embodiment, when the prediction accuracy of the predicted case decreases (YES in E), the processor determines the classification type analysis model and the regression type analysis model used for the prediction. The variable range of the explanatory variables is updated to improve the prediction accuracy of each analytical model (F). Decreased prediction accuracy of predicted cases can be detected by comparing the prediction results of cases predicted by the prediction model with the sanction results of sanctions cases newly announced by the government authorities.

変数域の更新Fでは、既存の学習データマスタに新たな事例のデータを加えた学習データマスタで、図３，図８で説明した最適な変数域の探索処理Bを実行してもよい。 In the variable area update F, the optimum variable area search process B described with reference to FIGS. 3 and 8 may be executed by the learning data master in which the data of the new case is added to the existing learning data master.

別の変数域の更新方法Fとしては、分類型分析モデルと回帰型分析モデルそれぞれについて、新たな変数域候補を有する部分変数の分析モデル候補の予測精度を、既存分析モデルの予測精度と比較し、分析モデル候補の予測精度が高い場合、既存の変数域を変数域候補に更新するようにしても良い。この場合の予測精度は、既存事例に新たな事例を加えた学習データマスタに基づき算出される。 As another method F for updating the variable range, the prediction accuracy of the analysis model candidate of the partial variable having a new variable range candidate is compared with the prediction accuracy of the existing analysis model for each of the classification type analysis model and the regression type analysis model. , If the prediction accuracy of the analysis model candidate is high, the existing variable area may be updated to the variable area candidate. The prediction accuracy in this case is calculated based on the learning data master in which a new case is added to the existing case.

さらに、予測精度が高い２つ変数域候補の間の最も予測精度が高い変数域を探索するようにすると、最も高い予測精度を得られる変数域を、少ない工数で検出できる。例えば、キャップ変数域が異なる２つの（第１及び第２の）変数域候補間の第１及び第２の分析モデルと、第１及び第２の変数域候補の合成変数域を有する第３の分析モデル、それぞれの予測精度を比較し、２つの変数域候補の間のどちら側に高い予測精度の変数域があるかを判定することを繰り返す。 Further, by searching for the variable area having the highest prediction accuracy between the two variable area candidates having high prediction accuracy, the variable area in which the highest prediction accuracy can be obtained can be detected with a small number of man-hours. For example, a third having a first and second analytical model between two (first and second) variable region candidates with different cap variable regions and a composite variable region of the first and second variable region candidates. The analysis model compares the prediction accuracy of each, and repeatedly determines which side of the two variable region candidates has the variable region with high prediction accuracy.

例えば、各探索フェーズで、これら３つの分析モデル間の予測精度の順位に基づいて、次の探索フェーズの探索範囲を画定する第１及び第２の変数域を決定し、上記と同じ比較及び判定を繰り返し、最高の予測精度の変数域をドリルダウンで見つける。第１及び第２の変数域候補の合成変数域を有する第３の分析モデルとは、後述するとおり、部分変数が、第１の変数域候補Y_oldと第２の変数域候補Y_new両方を有することを意味する。 For example, in each search phase, the first and second variable ranges that define the search range of the next search phase are determined based on the order of prediction accuracy among these three analytical models, and the same comparison and determination as above are performed. Repeat to find the variable range with the highest prediction accuracy by drilling down. The third analysis model having the composite variable range of the first and second variable range candidates is that the subvariable has both the first variable range candidate Y_old and the second variable range candidate Y_new, as described later. Means.

予測精度が低下する最も典型的な例は、例えば、新たな制裁事例の影響人数の変数Yが、既存の事例にないほど大規模な人数になった場合である。このように既存の事例では想定されてなかった大規模な人数になると、初期化処理で決定した変数域（Y1〜Y5のいずれか）では、予測モデルの予測精度が低下する場合がある。そのような場合、プロセッサは、予測モデルを作成した後に公表された新しい事例を、既存の事例に加えた学習データマスタから、最適な変数Yの変数域を検出し、予測モデルの変数域を更新する。 The most typical example of reduced prediction accuracy is, for example, when the variable Y of the number of people affected by a new sanctions case becomes a larger number than in existing cases. When the number of people is large, which was not expected in the existing cases, the prediction accuracy of the prediction model may decrease in the variable range (any of Y1 to Y5) determined by the initialization process. In such a case, the processor detects the variable range of the optimum variable Y from the training data master by adding the new case published after creating the prediction model to the existing case, and updates the variable range of the prediction model. To do.

図２２は、説明変数の変数域を更新する処理Fのフローチャート図である。図２２以下で説明する変数域の更新は、回帰型分析モデルに適用した例である。但し、分類型分析モデルに適用することもできる。 FIG. 22 is a flowchart of the process F for updating the variable area of the explanatory variable. The update of the variable range described below with reference to FIG. 22 is an example applied to the regression analysis model. However, it can also be applied to a classification type analysis model.

図２２の変数域の更新処理Fは、目的変数（制裁金）の全てのレベルについて繰り返される。つまり、目的変数のレベルそれぞれに対応する回帰型分析モデルの変数域が更新される。 The variable area update process F in FIG. 22 is repeated for all levels of the objective variable (sanctions). That is, the variable range of the regression analysis model corresponding to each level of the objective variable is updated.

まず、プロセッサは、変数域更新プログラム２４を実行し、目的変数（制裁金）のレベルを選択し、S51〜S61の処理を、全レベルについて繰り返す（S51,S62）。 First, the processor executes the variable area update program 24, selects the level of the objective variable (sanction), and repeats the processing of S51 to S61 for all levels (S51, S62).

レベルを選択した後、プロセッサは、より予測精度の高い変数域を探索するための変数域を選択する（S52）。ここでは、プロセッサは、予測モデルの部分変数の更新対象の変数の複数の変数域候補について、前述した説明変数の最適な変数域の探索Bで行った分析モデルの予測精度の算出を行い、最も高い予測精度の変数域を間に有すると予測される第１及び第２の変数域候補を選択する。 After selecting the level, the processor selects a variable range to search for a variable range with higher prediction accuracy (S52). Here, the processor calculates the prediction accuracy of the analysis model performed in the search B for the optimum variable range of the explanatory variables described above for a plurality of variable range candidates of the variable to be updated of the partial variables of the prediction model. Select the first and second variable range candidates that are predicted to have variable ranges with high prediction accuracy in between.

図２３は、探索対象の変数域リストと探索対象の部分変数リストの一例を示す図である。図２３中、探索対象の変数域リストには、複数の探索対象の変数域Y_UP_1〜Y_UP_5が含まれる。複数の探索対象の変数域には、既存の分析モデルの変数域と同じ最大キャップ領域が「１万人以上」の変数域Y_UP_1を有する。つまり、新たな事例の制裁金が高額であったため、変数Y「漏洩人数」の最大キャップ領域が既存の「１万人以上」より大きくなったという推定に基づき、既存分析モデルの変数域Y3の最大キャップ領域「１万人以上」（図２０及び図７参照）を、「５万人以上」「１０万人以上」「２０万人以上」「５０万人以上」とした変数域候補Y_UP_2〜Y_UP_5が選択される。 FIG. 23 is a diagram showing an example of a variable area list to be searched and a partial variable list to be searched. In FIG. 23, the variable area list to be searched includes a plurality of variable areas Y_UP_1 to Y_UP_5 to be searched. The variable area to be searched has a variable area Y_UP_1 having the same maximum cap area as the variable area of the existing analysis model of "10,000 or more people". In other words, based on the estimation that the maximum cap area of the variable Y "number of leaks" was larger than the existing "10,000 or more" because the sanctions for the new case were high, the variable area Y3 of the existing analysis model Variable area candidate Y_UP_2 ~ with the maximum cap area "10,000 or more" (see FIGS. 20 and 7) set to "50,000 or more", "100,000 or more", "200,000 or more", and "500,000 or more" Y_UP_5 is selected.

図２３には、上記の探索対象の変数域Y_UP_1〜Y_UP_5の変数Yを有する部分変数SSV_UP_1〜SSV_UP_5が探索対象として示される。上記処理S52では、前述の新たな事例を加えた学習データマスタ32_UPに基づき、部分変数SSV_UP_1〜SSV_UP_5の各分析モデルの予測精度を、最適な変数域の探索処理Fと同じ回帰型の交差検証法で算出する。 In FIG. 23, the partial variables SSV_UP_1 to SSV_UP_5 having the variable Y of the variable areas Y_UP_1 to Y_UP_5 to be searched are shown as search targets. In the above processing S52, based on the training data master 32_UP to which the above-mentioned new case is added, the prediction accuracy of each analysis model of the partial variables SSV_UP_1 to SSV_UP_5 is set to the same regression type cross-validation method as the optimum variable region search processing F. Calculate with.

図２４は、処理S52で算出した各部分変数SSV_UP_1〜SSV_UP_5の分析モデルの予測精度の例が示される。これによれば、予測精度の順位は、以下のとおりである。
SSV_UP_3 ＞ SSV_UP_1 ＞ SSV_UP_2 ＞ SSV_UP_4 ＞ SSV_UP_5
Y_UP_3 ＞ Y _UP_1 ＞ Y _UP_2 ＞ Y _UP_4 ＞ Y _UP_5
100,000 ＞ 50,000 ＞ 10,000 ＞ 200,000 ＞ 500,000 FIG. 24 shows an example of the prediction accuracy of the analysis model of each partial variable SSV_UP_1 to SSV_UP_5 calculated in the process S52. According to this, the order of prediction accuracy is as follows.
SSV_UP_3 ＞ SSV_UP_1 ＞ SSV_UP_2 ＞ SSV_UP_4 ＞ SSV_UP_5
Y_UP_3 ＞ Y _UP_1 ＞ Y _UP_2 ＞ Y _UP_4 ＞ Y _UP_5
100,000 ＞ 50,000 ＞ 10,000 ＞ 200,000 ＞ 500,000

この場合、最高予測精度を有する変数域は、100,000と200,000の間に存在すると予測できる。もちろん、50,000と100,000との間に存在する可能性は否定できないが、変数域候補間の予測精度の増加率などから上記のように予測できたとする。この場合、プロセッサは、探索する変数域として変数域Y_UP_3と変数域Y_UP_4を選択し（S52）、以下に説明する二分法で、第１の変数域Y_UP_3と第２の変数域Y_UP_4の間に存在する最大予測精度の変数域を探索する。この探索する変数域の選択処理S52は、変数域の粗い探索処理に対応する。 In this case, the variable range with the highest prediction accuracy can be predicted to exist between 100,000 and 200,000. Of course, the possibility that it exists between 50,000 and 100,000 cannot be denied, but it is assumed that the prediction can be made as described above from the rate of increase in prediction accuracy between variable region candidates. In this case, the processor selects the variable area Y_UP_3 and the variable area Y_UP_4 as the variable area to be searched (S52), and exists between the first variable area Y_UP_3 and the second variable area Y_UP_4 by the dichotomy described below. Search for the variable range with the maximum prediction accuracy. The variable area selection process S52 to be searched corresponds to the coarse search process of the variable area.

図２２に戻り、プロセッサは、選択した第１及び第２の変数域Y_UP_3とY_UP_4に基づいて、第１世代の探索における部分変数リストを生成する（S53）。ここから、変数域の密な探索処理が開始される。 Returning to FIG. 22, the processor generates a list of partial variables in the first generation search based on the selected first and second variable regions Y_UP_3 and Y_UP_4 (S53). From here, the dense search process of the variable area is started.

図２５は、第１世代及び第２世代の探索における変数域リストと部分変数リストとを示す図である。ここで、第１世代の探索では、第１の変数域Y_I_oldの部分変数SS_I_oldと、第２の変数域Y_I_newの部分変数SS_I_newと、第１及び第２の変数域を有する部分変数（合成変数域の部分変数）SS_I_midそれぞれの予測精度を算出する。そして、３つの予測精度の大小関係から、第１の変数域と第２の変数域の間の中間より第１の変数域側か、中間より第２の変数域側か、中間の変数域かのいずれに最高の予測精度があるかの判定を行う。 FIG. 25 is a diagram showing a variable range list and a partial variable list in the first-generation and second-generation searches. Here, in the first-generation search, a partial variable SS_I_old of the first variable area Y_I_old, a partial variable SS_I_new of the second variable area Y_I_new, and a partial variable having the first and second variable areas (composite variable area). (Subvariable of) SS_I_mid Calculate the prediction accuracy of each. Then, from the magnitude relation of the three prediction accuracy, whether it is the first variable region side from the middle between the first variable region and the second variable region, the second variable region side from the middle, or the intermediate variable region. It is determined which of the above has the highest prediction accuracy.

続く第２世代の探索では、第１及び第２の変数域間の中間より第１の変数域側か、または、同中間より第２の変数域かのいずれかについて、第１世代の探索と同じことを行う。このように、二分法により探索領域を徐々に狭くして、少ない工数で最高予測精度の変数域を検出する。 In the subsequent second-generation search, the first-generation search is performed for either the first variable region side from the middle between the first and second variable regions, or the second variable region from the middle. Do the same. In this way, the search area is gradually narrowed by the dichotomy method, and the variable area with the highest prediction accuracy is detected with a small number of man-hours.

図２５の例では、第１世代の探索における変数域候補リスト33_I(VR)には、前述の第１の変数域Y_I_oldと第２の変数域Y_I_newと第３の変数域Y_I_midが含まれる。第１の変数域Y_I_oldは、最大キャップ領域が１０万人以上の変数域であり、第２の変数域Y_I_newは、同領域が２０万人以上の変数域である。また、第３の変数域Y_I_midは、第１、第２の変数域を両方有する合成変数域である。 In the example of FIG. 25, the variable area candidate list 33_I (VR) in the first generation search includes the above-mentioned first variable area Y_I_old, the second variable area Y_I_new, and the third variable area Y_I_mid. The first variable area Y_I_old is a variable area having a maximum cap area of 100,000 or more people, and the second variable area Y_I_new is a variable area having the same area of 200,000 people or more. The third variable area Y_I_mid is a composite variable area having both the first and second variable areas.

そして、第１世代の探索における部分変数リスト33_I(SSV)は、第１の変数域Y_I_old(=Yo)と変数X,Yを有する第１の部分変数SS_I_oldと、第２の変数域Y_I_new(=Yn)と変数X,Yを有する第２の部分変数SS_I_newと、第１及び第２の変数域と変数X,Zを有する第３の部分変数SS_I_midとを有する。 Then, the subvariable list 33_I (SSV) in the first generation search includes the first variable area Y_I_old (= Yo), the first subvariable SS_I_old having the variables X and Y, and the second variable area Y_I_new (=). It has a second sub-variable SS_I_new having Yn) and variables X and Y, and a third sub-variable SS_I_mid having first and second variable ranges and variables X and Z.

図２２にあるとおり、プロセッサは、次に、部分変数リスト33_UP（図２５内の33_I(SSV)）内の１つの部分変数を選択し、S51で選択したレベル及び前記選択した部分変数の学習データを、学習データマスタ32_UPから抽出し、回帰型分析モデルの生成と予測精度の算出を実行する（S54）。この処理S54は、図１６及び図１７で説明した交差検証法と同じ処理である。プロセッサは、図２５の部分変数リストSS_I(SSV)内の３つの部分変数（第１〜第３の部分変数SS_I_old, SS_I_new, SS_I_mid）全てについて、処理S54を繰り返す（S55）。 As shown in FIG. 22, the processor then selects one subvariable in the subvariable list 33_UP (33_I (SSV) in FIG. 25), and the training data of the level selected in S51 and the selected subvariable. Is extracted from the training data master 32_UP, and the regression type analysis model is generated and the prediction accuracy is calculated (S54). This process S54 is the same process as the cross-validation method described with reference to FIGS. 16 and 17. The processor repeats the process S54 for all three subvariables (first to third subvariables SS_I_old, SS_I_new, SS_I_mid) in the subvariable list SS_I (SSV) of FIG. 25 (S55).

更に、プロセッサは、全３つの部分変数の分析モデルの予測精度を評価し（S56）、処理S57, S59の予測精度の比較に基づく判定を行う。プロセッサは、第１世代の探索で、合成変数域の部分変数SS_I_midの予測精度が最も高いと判断すると（S57のYES）、更に、予測精度の比較結果に基づいて新たな変数域と新たな部分変数を選択し（S58）、処理S53に戻り、第２世代の変数域の探索（S53以下の処理）を行う。第２世代の探索を行うことで、より予測精度の高い変数域及び部分変数を探索する。 Further, the processor evaluates the prediction accuracy of the analysis model of all three partial variables (S56), and makes a judgment based on the comparison of the prediction accuracy of the processes S57 and S59. When the processor determines that the prediction accuracy of the partial variable SS_I_mid in the composite variable area is the highest in the first-generation search (YES in S57), the new variable area and the new part are further based on the comparison result of the prediction accuracy. Select a variable (S58), return to processing S53, and search the variable range of the second generation (processing below S53). By performing the second generation search, variable areas and subvariables with higher prediction accuracy are searched.

一方、プロセッサは、変数域Y_I_newの部分変数SS_I_newの分析モデルの予測精度が最も高い場合（S59のYES）、変数域Y_I_newを新しい変数域に決定する（S61）。逆に、プロセッサは、変数域Y_I_oldの部分変数SS_I_oldの分析モデルの予測精度が最も高い場合（S59のNO）、変数域Y_I_oldを新しい変数域に決定する(S60)。 On the other hand, when the prediction accuracy of the analysis model of the partial variable SS_I_new of the variable area Y_I_new is the highest (YES in S59), the processor determines the variable area Y_I_new as the new variable area (S61). Conversely, the processor determines the variable region Y_I_old as the new variable region when the prediction accuracy of the analysis model of the partial variable SS_I_old of the variable region Y_I_old is the highest (NO in S59).

図２８は、第１世代の探索の変数域リスト内の第１〜第３の変数域の変数Yに適用される値を示す図である。第１の変数域Y_I_oldは、最大キャップの領域が１０万人であり、漏れ人数が１０万以上の事例では、変数Yの値が３から４に増加する。一方、第２の変数域Y_I_newは、最大キャップの領域が２０万人であり、漏れ人数が２０万人以上の事例では、変数Yの値が３から４に増加する。そして、第３の変数域I_I_midは、第１及び第２の変数域を有する部分変数になるので、実質的には、最大キャップの領域が１５万人の変数域を有する部分変数と同等になる。 FIG. 28 is a diagram showing values applied to the variables Y of the first to third variable ranges in the variable range list of the first generation search. In the first variable area Y_I_old, the maximum cap area is 100,000, and in the case where the number of leaks is 100,000 or more, the value of the variable Y increases from 3 to 4. On the other hand, in the second variable area Y_I_new, the maximum cap area is 200,000, and the value of the variable Y increases from 3 to 4 in the case where the number of leaks is 200,000 or more. Then, since the third variable area I_I_mid becomes a subvariable having the first and second variable areas, the area of the maximum cap is substantially equivalent to the subvariable having the variable area of 150,000 people. ..

上記のように異なる変数域の変数を有する分析モデルは、事例の漏れ人数の規模に応じて異なる定量値を有することになり、予測される制裁金の値も異なる。したがって、政府当局が過去にない最大漏れ人数の事例について過去にない制裁金を課した場合、漏れ人数の変数Yの変数域が変更された可能性がある。 As described above, an analytical model having variables in different variable ranges will have different quantitative values depending on the scale of the number of leaked cases, and the predicted sanctions will also be different. Therefore, if government authorities impose unprecedented sanctions on unprecedented cases of maximum leaks, the variable range of the leaked variable Y may have changed.

図２５には、第１世代の探索の部分変数リスト33_I(SSV)に、予測モデルの予測式の例と、算出された予測精度の例が示される。部分変数リスト33_I(SSV)に示された予測精度の例は、以下の２つの例である。
SS_I_mid (=0.85) ＞ SS_I_old (=0.75) ＞ SS_I_new (=0.70)：C>A>B（Y_I_old〜Y_I_mid側）
SS_I_mid (=0.85) ＞ SS_I_new (=0.75) ＞ SS_I_old (=0.70)：C>B>A（Y_I_mid〜Y_I_new側）
ここで、AはSS_I_old、BはSS_I_new、CはSS_I_midの意味である。 In FIG. 25, an example of the prediction formula of the prediction model and an example of the calculated prediction accuracy are shown in the subvariable list 33_I (SSV) of the first-generation search. The following two examples are examples of prediction accuracy shown in the partial variable list 33_I (SSV).
SS_I_mid (= 0.85) ＞ SS_I_old (= 0.75) ＞ SS_I_new (= 0.70): C>A> B (Y_I_old ~ Y_I_mid side)
SS_I_mid (= 0.85) ＞ SS_I_new (= 0.75) ＞ SS_I_old (= 0.70): C>B> A (Y_I_mid ~ Y_I_new side)
Here, A means SS_I_old, B means SS_I_new, and C means SS_I_mid.

図２６は、予測精度の比較結果とその判定例を示す図である。また、図２７は、第１世代と第２世代の探索それぞれの３つの部分変数の関係を示す図である。図２７は、横軸を説明変数の軸とし、縦軸を目的変数の軸とする、分析モデルの関数を概念的に示すものである。 FIG. 26 is a diagram showing a comparison result of prediction accuracy and a determination example thereof. Further, FIG. 27 is a diagram showing the relationship between the three subvariables of the first-generation and second-generation searches. FIG. 27 conceptually shows a function of an analytical model in which the horizontal axis is the axis of the explanatory variable and the vertical axis is the axis of the objective variable.

図２５に示した予測精度の２つの比較例は、上記の通り、C>A>BとC>B>Aである。そして、C>A>BとC>B>Aとは、合成変数域の部分変数Cが最も予測精度が高い場合（S57のYES）である。 Two comparative examples of prediction accuracy shown in FIG. 25 are C> A> B and C> B> A as described above. Then, C> A> B and C> B> A are cases where the partial variable C in the composite variable region has the highest prediction accuracy (YES in S57).

そして、図２６の予測精度比較C>A>BとC>B>Aに対応する判定内容は、次のとおりである。C>A>B（ケースa）の場合、A:SS_I_oldとB:SS_I_newの間の１０％〜５０％の領域を新たな変数域探索の範囲とする。また、C>B>A（ケースb）の場合、A:SS_I_oldとB:SS_I_newの間の５０％〜９０％の領域を新たな変数域探索の範囲とする。この判定内容について、以下説明する。 The determination contents corresponding to the prediction accuracy comparisons C> A> B and C> B> A in FIG. 26 are as follows. In the case of C> A> B (case a), the area of 10% to 50% between A: SS_I_old and B: SS_I_new is set as the range of the new variable area search. Further, in the case of C> B> A (case b), the area of 50% to 90% between A: SS_I_old and B: SS_I_new is set as the range of the new variable area search. The content of this determination will be described below.

図２７には、第１世代の探索での第１、第２、第３の部分変数SS_I_old(=A), SS_I_new(=B), SS_I_mid(=C)の関数の概念的な関係が示される。 FIG. 27 shows the conceptual relationship between the functions of the first, second, and third partial variables SS_I_old (= A), SS_I_new (= B), and SS_I_mid (= C) in the first generation search. ..

そして、C>A>B（ケースa）の場合、第２世代で探索(a)する変数域を含む部分変数の予測モデルは、SS_II_old_a(=SS_I_mid)とSS_II_new_aの間となる。この場合、最高の予測精度の部分変数（変数域）は、第１世代のSS_I_mid(Y_mid)に近いSS_I_old（Y_old）側のどこかにあると推測される。 Then, in the case of C> A> B (case a), the prediction model of the partial variable including the variable range searched (a) in the second generation is between SS_II_old_a (= SS_I_mid) and SS_II_new_a. In this case, the subvariable (variable area) with the highest prediction accuracy is presumed to be somewhere on the SS_I_old (Y_old) side near the first generation SS_I_mid (Y_mid).

図２５には、第２世代で探索する第１、第２の変数域のリスト33_II(VR)が示され、C>A>Bの場合には第１変数域がY_II_old_a、第２の変数域がY_II_new_aとなる。また、C>A>Bの場合の第２世代で探索(a)する第２、第１、第３の部分変数は、部分変数リスト33_II_a(SSV)内のSS_II_new_a, SS_II_old_a, SS_II_mid_aとなる。 FIG. 25 shows a list 33_II (VR) of the first and second variable areas to be searched in the second generation. When C> A> B, the first variable area is Y_II_old_a and the second variable area. Becomes Y_II_new_a. Further, the second, first, and third partial variables searched (a) in the second generation in the case of C> A> B are SS_II_new_a, SS_II_old_a, and SS_II_mid_a in the partial variable list 33_II_a (SSV).

一方、C>B>A（ケースb）の場合、第２世代で探索(b)する第１及び第２の変数域は、SS_II_old_b(=SS_I_mid)とSS_II_new_bの間となる。この場合、最高の予測精度の変数域は、第１世代のSS_I_mid(Y_mid)に近いSS_I_new（Y_new）側のどこかにあると推測される。 On the other hand, when C> B> A (case b), the first and second variable areas searched (b) in the second generation are between SS_II_old_b (= SS_I_mid) and SS_II_new_b. In this case, the variable range with the highest prediction accuracy is presumed to be somewhere on the SS_I_new (Y_new) side near the first generation SS_I_mid (Y_mid).

図２５内の、第２世代で探索する第１、第２の変数域のリスト33_II(VR)において、C>B>Aの場合には第１変数域がY_II_old_b、第２の変数域がY_II_new_bとなる。また、C>B>Aの場合の第２世代で探索(b)する第１、第２、第３の部分変数は、部分変数リスト33_II_b(SSV)内のSS_II_old_b, SS_II_new_b, SS_II_mid_bとなる。 In the list 33_II (VR) of the first and second variable areas searched in the second generation in FIG. 25, when C> B> A, the first variable area is Y_II_old_b and the second variable area is Y_II_new_b. It becomes. Further, the first, second, and third subvariables to be searched (b) in the second generation in the case of C> B> A are SS_II_old_b, SS_II_new_b, SS_II_mid_b in the subvariable list 33_II_b (SSV).

図２７には、C>A>B（ケースa）の場合の第２世代の探索（a）における第１〜第３の部分変数SS_II_old_a, SS_II_mid_a, SS_II_new_aが示される。第２世代の探索の領域は、第１世代の探索の第３の部分変数SS_I_midに近く、第１の部分変数SS_I_old側の領域内にある。 FIG. 27 shows the first to third partial variables SS_II_old_a, SS_II_mid_a, SS_II_new_a in the second generation search (a) in the case of C> A> B (case a). The area of the second generation search is close to the third subvariable SS_I_mid of the first generation search and is in the area on the side of the first subvariable SS_I_old.

また、図２７には、C>B>A（ケースb）の場合の第２世代の探索（b）における第１〜第３の部分変数SS_II_old_b, SS_II_mid_b, SS_II_new_bが示される。第２世代の探索の領域は、第１世代の探索の第３の部分変数SS_I_midに近く、第２の部分変数SS_I_new側の領域内にある。 Further, FIG. 27 shows the first to third partial variables SS_II_old_b, SS_II_mid_b, SS_II_new_b in the second generation search (b) in the case of C> B> A (case b). The area of the second generation search is close to the third subvariable SS_I_mid of the first generation search and is in the area on the second subvariable SS_I_new side.

図２９は、第２世代の探索の変数域リスト内の第１〜第３の変数域の変数Yに適用される値を示す図である。図２９は、C>A>B（ケースa）の場合の例である。第１の変数域Y_II_old_aは、最大キャップの領域が１５万人であり、漏れ人数が１５万以上の事例で、変数Yの値が３から４に増加する。一方、第２の変数域Y_II_new_aは、最大キャップの領域が１２万人であり、漏れ人数が１２万人以上の事例では、変数Yの値が３から４に増加する。そして、第３の変数域Y_II_mid_aは、第１及び第２の変数域を有する部分変数になるので、実質的には、最大キャップの領域が１２万と１５万人の間の変数域を有する部分変数と同等になる。 FIG. 29 is a diagram showing values applied to the variables Y of the first to third variable ranges in the variable range list of the second generation search. FIG. 29 is an example in the case of C> A> B (case a). In the first variable area Y_II_old_a, the value of the variable Y increases from 3 to 4 in the case where the maximum cap area is 150,000 and the number of leaks is 150,000 or more. On the other hand, in the second variable area Y_II_new_a, the maximum cap area is 120,000, and the value of the variable Y increases from 3 to 4 in the case where the number of leaks is 120,000 or more. Then, since the third variable region Y_II_mid_a becomes a partial variable having the first and second variable regions, the region of the maximum cap has a variable region between 120,000 and 150,000 people. Equivalent to a variable.

図３０は、第２世代の探索の変数域リスト内の第１〜第３の変数域の変数Yに適用される値を示す図である。図２９は、C>B>A（ケースb）の場合の例である。第１の変数域Y_II_old_bは、最大キャップの領域が１５万人である。一方、第２の変数域Y_II_new_bは、最大キャップの領域が１７万人である。そして、第３の変数域Y_II_mid_bは、第１及び第２の変数域を有する部分変数になるので、実質的には、最大キャップの領域が１５万と１７万人の間の変数域を有する部分変数と同等になる。 FIG. 30 is a diagram showing values applied to the variables Y of the first to third variable ranges in the variable range list of the second generation search. FIG. 29 is an example in the case of C> B> A (case b). In the first variable area Y_II_old_b, the maximum cap area is 150,000 people. On the other hand, in the second variable area Y_II_new_b, the maximum cap area is 170,000 people. Then, since the third variable region Y_II_mid_b becomes a partial variable having the first and second variable regions, the region of the maximum cap has a variable region between 150,000 and 170,000 people. Equivalent to a variable.

図２６の残った第１世代の探索での４つの予測精度比較A>B>C, A>C>B, B>A>C, B>C>Aに対応する判定内容は、次のとおりである。 Comparison of four prediction accuracy in the remaining first-generation search in FIG. 26 The judgment contents corresponding to A> B> C, A> C> B, B> A> C, B> C> A are as follows. Is.

A>B>Cの場合、Aの第１の部分変数SS_I_old（第１の変数域Y_I_old）の予測精度が最も高いので、プロセッサは、第１の変数域Y_I_oldに変更し（S60）、変数域の更新処理を終了する。 When A> B> C, the prediction accuracy of the first subvariable SS_I_old (first variable area Y_I_old) of A is the highest, so the processor changes it to the first variable area Y_I_old (S60) and changes the variable area. Ends the update process of.

A>C>Bの場合、Aの第１の部分変数SS_I_old（第１の変数域Y_I_old）の予測精度が最も高いが、Cの第３の部分変数SS_I_mid（第３の変数域Y_I_mid）も高いので、プロセッサは、微調整した第１の変数域Y_I_oldに変更し、変数域の更新処理を終了する。 When A> C> B, the prediction accuracy of the first subvariable SS_I_old (first variable range Y_I_old) of A is the highest, but the prediction accuracy of the third subvariable SS_I_mid (third variable range Y_I_mid) of C is also high. Therefore, the processor changes to the fine-tuned first variable area Y_I_old and ends the update process of the variable area.

B>A>Cの場合、Bの第２の部分変数SS_I_new（第２の変数域Y_I_new）の予測精度が最も高いので、プロセッサは、第２の変数域Y_I_newに変更し（S61）、変数域の更新処理を終了する。 When B> A> C, the prediction accuracy of the second subvariable SS_I_new (second variable area Y_I_new) of B is the highest, so the processor changes to the second variable area Y_I_new (S61) and changes the variable area. Ends the update process of.

B>C>Aの場合、Bの第２の部分変数SS_I_new（第２の変数域Y_I_new）の予測精度が最も高いが、Cの第３の部分変数SS_I_mid（第３の変数域Y_I_mid）も高いので、プロセッサは、微調整した第２の変数域Y_I_newに変更し、変数域の更新処理を終了する。 When B> C> A, the prediction accuracy of B's second sub-variable SS_I_new (second variable range Y_I_new) is the highest, but C's third sub-variable SS_I_mid (third variable range Y_I_mid) is also high. Therefore, the processor changes to the fine-tuned second variable area Y_I_new and ends the update process of the variable area.

そして、C>A>B, C>B>Aの場合は、プロセッサは、第２世代の変数域の探索を続ける。各探索世代で、予測精度比較がA>B>C, A>C>B, B>A>C, B>C>Aになると、プロセッサは、上記の第１世代と同様に、第１の変数域または第２の変数域に変更し、変数域の更新処理を終了する。 Then, in the case of C> A> B, C> B> A, the processor continues to search the variable range of the second generation. In each search generation, when the prediction accuracy comparison is A> B> C, A> C> B, B> A> C, B> C> A, the processor is the first generation, as in the first generation above. Change to the variable area or the second variable area, and end the update process of the variable area.

さらに、第２世代以降において、C>A>B, C>B>Aの場合、所定の世代まで探索を継続していたら、その次点でCの変数域に変更し、探索処理を終了する。探索工数が長くなるデメリットが、予測精度が高くなるメリットに勝ることになるからである。 Furthermore, in the second and subsequent generations, in the case of C> A> B, C> B> A, if the search is continued up to a predetermined generation, the variable area of C is changed at the next point, and the search process is terminated. .. This is because the disadvantage of increasing the search man-hours outweighs the advantage of increasing the prediction accuracy.

上記の図２２に示した説明変数の変数域を更新する処理Fによれば、粗い変数域探索S52と密な変数域探索S53〜S61を組み合わせることで、変数域の更新処理の工数を少なくできる。もちろん、説明変数の変数域を更新処理Fを、初期化時の説明変数の最適な変数域の探索処理Bと同様に、多数の変数域候補を設定して、網羅的に分析モデルの予測精度を算出し、最も良い予測精度の変数域に更新してもよい。 According to the process F for updating the variable area of the explanatory variable shown in FIG. 22 above, the man-hours for the variable area update process can be reduced by combining the coarse variable area search S52 and the dense variable area search S53 to S61. .. Of course, the variable area of the explanatory variable is updated, and the prediction accuracy of the analysis model is comprehensively set by setting a large number of variable area candidates in the same way as the search process B of the optimum variable area of the explanatory variable at the time of initialization. May be calculated and updated to the variable range with the best prediction accuracy.

１０：予測装置、コンピュータ
２１：変数域探索プログラム
２２：予測モデル生成プログラム
２３：予測プログラム
２４：変数域更新プログラム
３１：事例データ
３２：学習データマスタ
３３：変数域リスト及部分変数リスト
３４：分類用学習データ
３５：回帰用学習データ
X,Y,Z：説明変数
Y1〜Y5：変数域、変数域候補
SA：目的変数 10: Predictor, computer 21: Variable area search program 22: Prediction model generation program 23: Prediction program 24: Variable area update program 31: Case data 32: Learning data master 33: Variable area list and partial variable list 34: For classification Training data 35: Training data for regression
X, Y, Z: Explanatory variables
Y1 to Y5: Variable area, variable area candidate
SA: Objective variable

Claims

With the processor
The processor has accessible memory and
The processor
(A) For a plurality of case data having values of objective variables corresponding to values of a plurality of explanatory variables, the values of the explanatory variables are replaced with identification numbers of the plurality of regions based on the variable regions having a plurality of regions. A classification type prediction model and a regression type prediction model are generated based on the multiple training data.
The classification type prediction model has a plurality of classification type learning data in which the value of the objective variable of the plurality of training data is replaced with a plurality of levels according to the size, and among the plurality of classification type training data. The level of the objective variable of the training data of the prediction target case and the classification type learning data having the shortest norm distance between the values of the explanatory variables is determined to be the level of the objective variable of the training data of the prediction target case.
Among the plurality of training data, the regression type prediction model has a regression line close to the coordinate points of the explanatory variables of the plurality of training data for each level divided by the level of the objective variable for each level. The value of the objective variable of the training data of the prediction target case is calculated from the regression line corresponding to the level determined by the classification type prediction model.
Further, (b) the training data of the prediction target case is applied to the classification type prediction model to predict the level of the objective variable of the training data of the prediction target case.
(C) A prediction device that predicts the value of the objective variable of the training data of the prediction target case by applying the training data of the prediction target case to the regression line corresponding to the predicted level.

The processor
A plurality of subvariables that are subsets of a set having the plurality of explanatory variables and a plurality of variable range candidates set in at least one explanatory variable are generated.
For the training data of the plurality of partial variables obtained by extracting the values of the explanatory variables and the variable range candidates included in each of the plurality of partial variables from the plurality of training data, the prediction accuracy of the classification type prediction model is determined for each of the partial variables. Calculate and
The variable range candidate that is most contained in the N subvariates with high prediction accuracy is determined as the variable range of the classification type prediction model, and the N pieces are positive integers.
The prediction device according to claim 1, wherein the classification type prediction model is generated after the variable range of the classification type prediction model is determined.

The processor
(D) The regression type prediction for each of the partial variables with respect to the training data of the plurality of partial variables obtained by extracting the values of the explanatory variables and the variable range candidates included in each of the plurality of partial variables from the plurality of training data for each level. Calculate the prediction accuracy of the model,
(E) The variable region candidate contained most in the M subvariates with high prediction accuracy is determined as the variable region of the regression prediction model, and the M are positive integers.
The processes (d) and (e) are repeated for the plurality of levels.
The prediction device according to claim 2, wherein the regression type prediction model is generated after the variable range of the regression type prediction model is determined.

The predictor according to claim 2 or 3, wherein the plurality of variable region candidates have different at least maximum regions of the plurality of regions of each variable region candidate.

In the calculation of the prediction accuracy of the classification type prediction model for each partial variable,
The processor
The level of the objective variable of the training data of the evaluation target case is predicted by a classification type prediction model having the training data of the remaining cases excluding the evaluation target case among the training data of the partial variables, and the prediction is made. Judging the success or failure of the prediction of the evaluation target case based on the match or disagreement between the level and the learning data level of the evaluation target case is repeated with all the cases as the evaluation target cases, and the success or failure of the evaluation target cases is determined. The prediction device according to claim 2, which outputs the ratio as prediction accuracy.

In the calculation of the prediction accuracy of the regression type prediction model for each partial variable,
The processor
(F) A regression-type prediction model generated for the training data of the remaining cases excluding the cases to be evaluated among the training data of the partial variables for each level in which the training data of the partial variables are divided according to the level of the objective variable. , The value of the objective variable of the learning data of the case to be evaluated is predicted, and the ratio of the predicted value to the value of the objective variable of the learning data of the case to be evaluated is calculated. Repeat as an example,
(G) The average value of the ratio to all the cases is output as the prediction accuracy.
The predictor according to claim 3, wherein the processes (f) and (g) are repeated for the plurality of levels.

The prediction accuracy obtained by comparing the value of the objective variable of the training data of the prediction target case predicted using the regression prediction model with the definite value of the objective variable of the training data of the new case is lower than the reference accuracy. if you did this,
The processor
A plurality of subvariables that are subsets of a set having the plurality of explanatory variables and a plurality of variable range candidates set in at least one explanatory variable are generated.
(H) Explanatory variables included in each of the plurality of partial variables from the new plurality of training data for each level, in which the new plurality of training data obtained by adding the training data of the new case to the plurality of training data are divided according to the level. And, for the training data of the plurality of partial variables extracted from the values of the variable range candidates, the prediction accuracy of the regression type prediction model is calculated for each of the partial variables.
(I) The variable range of the regression type prediction model is updated with the variable range candidate included in the subvariable with relatively high prediction accuracy.
The predictor according to claim 1, wherein the processes (h) and (i) are repeated for the plurality of levels.

The plurality of variable area candidates differ in at least the maximum area of each variable area candidate.
The plurality of variable region candidates include a first variable region candidate having a first maximum region larger than the maximum region of the variable candidate having the maximum prediction accuracy, and a second variable having a small second maximum region. It has a region candidate and a third variable region candidate having a composite variable region of the first variable region candidate and the second variable region candidate.
The processor
In the process (h),
In the first generation search process
The first prediction accuracy of the regression type prediction model of the first subvariable having the first variable range candidate and the second of the regression type prediction model of the second partial variable having the second variable range candidate. The prediction accuracy of 2 is compared with the third prediction accuracy of the regression prediction model of the third subvariable having the third variable range candidate.
When the third prediction accuracy is higher than the first and second prediction accuracy, in the search process of the variable range of the second generation,
A variable region having a third maximum region between the first maximum region and the second maximum region is set as the first variable region candidate.
A variable region having a fourth maximum region between the third maximum region and the first maximum region, or a fifth maximum region between the third maximum region and the second maximum region. As the second variable range candidate, the variable range having
The prediction device according to claim 7, which executes the first-generation search process.

A computer-readable prediction program that causes a computer to perform prediction processing.
The prediction process is
(A) For a plurality of case data having values of objective variables corresponding to values of a plurality of explanatory variables, the values of the explanatory variables are replaced with identification numbers of the plurality of regions based on the variable regions having a plurality of regions. A classification type prediction model and a regression type prediction model are generated based on the multiple training data.
The classification type prediction model has a plurality of classification type learning data in which the value of the objective variable of the plurality of training data is replaced with a plurality of levels according to the size, and among the plurality of classification type training data. The level of the objective variable of the training data of the prediction target case and the classification type learning data having the shortest norm distance between the values of the explanatory variables is determined to be the level of the objective variable of the training data of the prediction target case.
Among the plurality of training data, the regression type prediction model has a regression line close to the coordinate points of the explanatory variables of the plurality of training data for each level divided by the level of the objective variable for each level. The value of the objective variable of the training data of the prediction target case is calculated from the regression line corresponding to the level determined by the classification type prediction model.
Further, (b) the training data of the prediction target case is applied to the classification type prediction model to predict the level of the objective variable of the training data of the prediction target case.
(C) A prediction program that predicts the value of the objective variable of the training data of the prediction target case by applying the training data of the prediction target case to the regression line corresponding to the predicted level.

The processor
(A) For a plurality of case data having values of objective variables corresponding to values of a plurality of explanatory variables, the values of the explanatory variables are replaced with identification numbers of the plurality of regions based on the variable regions having a plurality of regions. A classification type prediction model and a regression type prediction model are generated based on the multiple training data.
The classification type prediction model has a plurality of classification type learning data in which the value of the objective variable of the plurality of training data is replaced with a plurality of levels according to the size, and among the plurality of classification type training data. The level of the objective variable of the training data of the prediction target case and the classification type learning data having the shortest norm distance between the values of the explanatory variables is determined to be the level of the objective variable of the training data of the prediction target case.
Among the plurality of training data, the regression type prediction model has a regression line close to the coordinate points of the explanatory variables of the plurality of training data for each level divided by the level of the objective variable for each level. The value of the objective variable of the training data of the prediction target case is calculated from the regression line corresponding to the level determined by the classification type prediction model.
Further, (b) the training data of the prediction target case is applied to the classification type prediction model to predict the level of the objective variable of the training data of the prediction target case.
(C) A prediction method in which the training data of a prediction target case is applied to the regression line corresponding to the predicted level to predict the value of the objective variable of the training data of the prediction target case.