JP2023087998A

JP2023087998A - Machine learning system and machine learning method

Info

Publication number: JP2023087998A
Application number: JP2021202595A
Authority: JP
Inventors: 一樹山根; Kazuki Yamane; 敏之鵜飼; Toshiyuki Ukai; 光之助山本; Mitsunosuke Yamamoto
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2021-12-14
Filing date: 2021-12-14
Publication date: 2023-06-26
Anticipated expiration: 2041-12-14
Also published as: JP7359829B2

Abstract

To provide a machine learning system capable of further improving estimation of cause of an outlier and ensuring sufficient prediction accuracy.SOLUTION: A machine learning system 1 comprises an outlier detection unit 10 and a learning unit 20. The outlier detection unit 10 performs a first outlier detection process with an objective variable (for example, sales) as a detection target for all records, divides all the records into a plurality of groups based on explanatory variables, performs a second outlier detection process with the objective variable as the detection target for each group, and estimates cause of the outlier based on results of these two outlier detection processes. The learning unit 20 performs machine learning based on the results of estimating the cause of the outlier, and creates a learning model for predicting the objective variable.SELECTED DRAWING: Figure 1

Description

本発明は、機械学習システムおよび機械学習方法に関する。 The present invention relates to machine learning systems and machine learning methods.

従来、精度よく外れ値を検出することができる外れ値検出方法（例えば、特許文献１参照）や、外れ値の要因を容易に推測することができる外れ値要因推定方法（例えば、特許文献２参照）が知られている。 Conventionally, an outlier detection method capable of accurately detecting an outlier (see, for example, Patent Document 1) and an outlier factor estimation method capable of easily inferring the cause of an outlier (see, for example, Patent Document 2) )It has been known.

特開２０１５－０８２１９０号公報JP 2015-082190 A 特許第６７１９６１２号公報Japanese Patent No. 6719612

ところで、機械学習システムにおいては、目的変数に外れ値が含まれると、平均二乗誤差などをロス関数とした通常の機械学習では十分な予測精度が担保しづらくなる。また、数学的には外れ値に見える場合であっても、実際は測定や記入時のミスの可能性もあるので、その原因を推定しないと適切な対処ができないという課題があった。特に、目的変数が、「費用対効果」や「通話時間あたりの売上高」のように２つの数値の比で示される場合、目的変数が外れ値となる原因が分子・分母のいずれかを見極めないと対処ができない。 By the way, in a machine learning system, when an outlier is included in the objective variable, it becomes difficult to ensure sufficient prediction accuracy with normal machine learning in which the mean square error or the like is used as a loss function. In addition, even if the value appears to be an outlier from a mathematical point of view, there is a possibility that it was actually an error during measurement or entry, so there is a problem that appropriate countermeasures cannot be taken unless the cause is estimated. In particular, if the objective variable is expressed as a ratio of two numbers, such as "cost-effectiveness" or "sales per call hour," determine whether the numerator or denominator causes the objective variable to become an outlier. I can't deal with it without it.

本発明の態様による機械学習システムは、学習データの全レコードに関して、機械学習の予測対象である目的変数を検出対象とする第１の外れ値検出処理を行う第１の外れ値検出部と、前記レコードに含まれる説明変数に基づいて、前記全レコードを複数のグループに分けるグループ処理部と、前記グループ毎に、前記目的変数を検出対象とする第２の外れ値検出処理を行う第２の外れ値検出部と、前記第１の外れ値検出部および前記第２の外れ値検出部の検出結果に基づいて外れ値の原因を推定する原因推定部と、前記原因推定部の推定結果に基づいて機械学習を実施して、前記目的変数を予測する学習モデルを作成するモデル作成部と、を備える。 A machine learning system according to an aspect of the present invention includes a first outlier detection unit that performs a first outlier detection process for detecting an objective variable that is a prediction target of machine learning for all records of learning data; A group processing unit that divides all the records into a plurality of groups based on an explanatory variable included in the record; a value detection unit; a cause estimation unit for estimating a cause of an outlier based on the detection results of the first outlier detection unit and the second outlier detection unit; and based on the estimation result of the cause estimation unit and a model creating unit that performs machine learning to create a learning model for predicting the objective variable.

本発明によれば、外れ値の原因推定をより向上させることができ、機械学習において十分な予測精度を担保することができる。 ADVANTAGE OF THE INVENTION According to this invention, the cause estimation of an outlier can be improved more and sufficient prediction accuracy can be ensured in machine learning.

図１は、機械学習システムの構成例を示すブロック図である。FIG. 1 is a block diagram showing a configuration example of a machine learning system. 図２は、学習データの一例を示す図である。FIG. 2 is a diagram showing an example of learning data. 図３は、機械学習システムにおける機械学習方法の全体処理を示すフローチャートである。FIG. 3 is a flow chart showing the overall processing of the machine learning method in the machine learning system. 図４は、図３のステップＳ１０おいて実行される外れ検出処理の詳細を示すフローチャートである。FIG. 4 is a flow chart showing details of the deviation detection process executed in step S10 of FIG. 図５は、図４のステップＳ１７０におけるグループ単位での外れ値検出処理の詳細を示すフローチャートである。FIG. 5 is a flowchart showing the details of the outlier detection processing in units of groups in step S170 of FIG. 図６は、図５のステップＳ１８００における外れ値原因分類の詳細処理を示すフローチャートである。FIG. 6 is a flowchart showing detailed processing for outlier cause classification in step S1800 of FIG. 図７は、学習データのグループ分けを説明する図である。FIG. 7 is a diagram for explaining grouping of learning data. 図８は、外れ値原因推定結果の画面表示の一例を示す図である。FIG. 8 is a diagram showing an example of a screen display of outlier cause estimation results. 図９は、図３のステップＳ２０おいて実行される学習モデル作成処理の詳細を示すフローチャートである。FIG. 9 is a flow chart showing details of the learning model creation process executed in step S20 of FIG. 図１０は、図９のステップＳ２２０における機械学習処理の詳細を示すフローチャートである。FIG. 10 is a flow chart showing details of the machine learning process in step S220 of FIG. 図１１は、図３のステップＳ３０における学習モデル評価処理の詳細を示すフローチャートである。FIG. 11 is a flowchart showing details of the learning model evaluation process in step S30 of FIG. 図１２は、機械学習システムを実現するコンピュータのハードウェア図である。FIG. 12 is a hardware diagram of a computer that implements the machine learning system.

以下、図面を参照して本発明の実施の形態を説明する。以下の記載および図面は、本発明を説明するための例示であって、説明の明確化のため、適宜、省略および簡略化がなされている。また、以下の説明では、同一または類似の要素および処理には同一の符号を付し、重複説明を省略する場合がある。なお、以下に記載する内容はあくまでも本発明の実施の形態の一例を示すものであって、本発明は下記の実施の形態に限定されるものではなく、他の種々の形態でも実施する事が可能である。 BEST MODE FOR CARRYING OUT THE INVENTION Hereinafter, embodiments of the present invention will be described with reference to the drawings. The following description and drawings are examples for explaining the present invention, and are appropriately omitted and simplified for clarity of explanation. Also, in the following description, the same or similar elements and processes are denoted by the same reference numerals, and redundant description may be omitted. It should be noted that the contents described below merely show an example of the embodiment of the present invention, and the present invention is not limited to the following embodiment, and can be implemented in various other forms. It is possible.

本実施の形態の機械学習システムは、機械学習を用いて予測する対象である目的変数に、「費用対効果」や「通話時間あたりの売上高」のような２種の数値の比で表されるものが含まれる場合に、好適な機械学習システムである。図１は、本実施の形態の機械学習システム１の構成例を示すブロック図である。機械学習に関係するデータベースとして、学習データや評価用データなどが保存される学習データデータベースＤＢ１（以下、学習データＤＢ１と称す）、設定情報が保存される設定情報データベースＤＢ２（以下、設定情報ＤＢ２と称す）、外れ値情報が保存される外れ値情報データベースＤＢ３（以下、外れ値情報ＤＢ３と称する）、および、学習モデルに関する情報が保存されるモデル情報データベースＤＢ４（以下、モデル情報ＤＢ４と称す）を備えている。そして、それらのデータベースのデータに基づいて機械学習処理を行う、外れ値検出部１０、学習部２０、モデル評価部３０を備えている。 In the machine learning system of the present embodiment, the target variable to be predicted using machine learning is represented by the ratio of two numerical values such as "cost-effectiveness" and "sales per call time". It is a preferred machine learning system when FIG. 1 is a block diagram showing a configuration example of a machine learning system 1 according to this embodiment. As databases related to machine learning, a learning data database DB1 (hereinafter referred to as learning data DB1) that stores learning data, evaluation data, etc., and a setting information database DB2 (hereinafter referred to as setting information DB2) that stores setting information. ), an outlier information database DB3 (hereinafter referred to as outlier information DB3) in which outlier information is stored, and a model information database DB4 (hereinafter referred to as model information DB4) in which information on learning models is stored. I have. It also has an outlier detection unit 10, a learning unit 20, and a model evaluation unit 30 that perform machine learning processing based on the data in these databases.

図２は、学習データＤＢ１に格納されている学習データの一例を示したものであり、学習データに関するテーブルを示す。テーブルは、複数の列と行から構成されている。テーブルの列および行は、それぞれカラムおよびレコードと呼ばれる。カラムＣ１～Ｃ５には、それぞれ顧客名、年齢、職業、通話時間、売上高のデータが格納される。本実施の形態では、顧客名、年齢、職業、通話時間および売上高を変数と呼ぶことにする。各レコードＲ１～Ｒ３は、各変数（顧客名、年齢、職業、通話時間および売上高）のデータで構成されている。例えば、学習データＲ１では、顧客名はＡ、年齢は４８歳、職業は主婦、通話時間は６００秒、売上高は１００，０００円である。なお、図２では３つのレコードＲ１～Ｒ３しか示していないが、学習データには多数のレコードが含まれている。 FIG. 2 shows an example of learning data stored in the learning data DB1, and shows a table related to the learning data. A table consists of multiple columns and rows. Columns and rows of a table are called columns and records, respectively. Columns C1 to C5 store customer name, age, occupation, call duration, and sales data, respectively. In this embodiment, the customer name, age, occupation, call duration, and sales amount are called variables. Each record R1-R3 consists of data of each variable (customer name, age, occupation, call duration and sales). For example, in learning data R1, customer name is A, age is 48, occupation is housewife, call time is 600 seconds, and sales is 100,000 yen. Note that although only three records R1 to R3 are shown in FIG. 2, many records are included in the learning data.

図１に戻って、設定情報ＤＢ２には、目的変数の確率分布の情報、桁補正係数に関する情報、ロス関数に関する情報、欠損値補完に関する情報などが保存されている。また、外れ値情報ＤＢ３には、目的変数の定義域や測定トラブルのあった期間に関する情報が予め保存されている。また、後述するように、外れ値検出処理による一連の結果（例えば、外れ値原因推定結果など）も外れ値情報ＤＢ３に保存される。モデル情報ＤＢ４には、後述するように、一連の学習で得られるモデルや、そのモデルに関する情報（例えば、使用したロス関数、行った補正など）等が保存される。 Returning to FIG. 1, the setting information DB2 stores information on the probability distribution of objective variables, information on digit correction coefficients, information on loss functions, information on missing value interpolation, and the like. In addition, the outlier information DB 3 stores in advance information about the domain of the objective variable and the period during which the measurement trouble occurred. In addition, as will be described later, a series of results (for example, outlier cause estimation results, etc.) from the outlier detection process are also stored in the outlier information DB3. As will be described later, the model information DB 4 stores models obtained through a series of learning, information about the models (for example, loss functions used, corrections performed, etc.), and the like.

外れ値検出部１０、学習部２０およびモデル評価部３０は、学習モデル作成に関する演算処理を行うものであり、それぞれ、図３のフローチャートに示す処理を行う。図３は、システムにおける機械学習方法の全体処理を示すフローチャートである。本実施の形態では、図２に示す学習データが学習データＤＢ１に格納され、機械学習を用いて予測する対象である目的変数が、比＝（売上高）／（通話時間）である場合について説明する。目的変数が比で表される場合、比の分母か分子にあたる数値の少なくとも一方に測定や記入時のミスによる誤った値が含まれていると、目的変数の予測精度が担保しづらくなる。特に、桁の誤記入によるものは、しばしば数学的な外れ値を取ることがある。加えて、分子、分母のそれぞれを目的変数として機械学習を行う場合は、それらに数学的な外れ値が存在する場合、外れ値を考慮したロス関数を設定する必要がある。ここで、外れ値が実際は誤記によるものであれば、ロス関数の設定でなく誤記の修正が必要となる。したがって、比で表される数値を直接目的変数にした場合と、分子、分母をそれぞれ個別に目的変数にした場合の双方において適切な設定の下で機械学習を行うには、分子、分母それぞれに含まれる外れ値を抽出するとともに、その生成原因が誤記であるか否かを推定する必要が生じる。 The outlier detection unit 10, the learning unit 20, and the model evaluation unit 30 perform arithmetic processing related to creation of a learning model, and each perform processing shown in the flowchart of FIG. FIG. 3 is a flow chart showing the overall processing of the machine learning method in the system. In this embodiment, the learning data shown in FIG. 2 is stored in the learning data DB 1, and the target variable to be predicted using machine learning is the ratio=(sales)/(call time). do. When the objective variable is expressed as a ratio, if at least one of the numerical values corresponding to the denominator or numerator of the ratio contains an erroneous value due to an error in measurement or entry, it becomes difficult to ensure the prediction accuracy of the objective variable. In particular, those due to erroneous entries of digits often take mathematical outliers. In addition, when performing machine learning using the numerator and denominator as objective variables, if there are mathematical outliers in them, it is necessary to set a loss function that considers the outliers. Here, if the outlier is actually due to a typographical error, it is necessary to correct the typographical error instead of setting the loss function. Therefore, in order to perform machine learning under appropriate settings for both cases where the numerical value expressed by the ratio is directly used as the target variable and when the numerator and denominator are set as separate target variables, it is necessary to set It is necessary to extract the included outliers and estimate whether or not the cause of their generation is an error in writing.

そこで、本実施の形態では、予測対象である目的変数（＝比）だけではなく、比で示される目的変数の分母および目的変数の分子に対しても、統計学的な外れ値を検出する処理を実行する。さらに、目的変数に対する説明変数である「年齢」や「職業」に関して属性が近いレコード同士をグルーピングし、各々のグループ内で統計学的な外れ値検出処理を再度行い。そして、これら２段階の処理結果に基づいて外れ値の原因推定を行い、原因推定結果に基づいて学習モデル生成に際しての前処理やロス関数の設定を行う。 Therefore, in the present embodiment, a process for detecting statistical outliers is performed not only for the objective variable (=ratio) to be predicted, but also for the denominator and numerator of the objective variable indicated by the ratio. to run. Furthermore, records with similar attributes regarding "age" and "occupation", which are explanatory variables for the objective variable, are grouped together, and statistical outlier detection processing is performed again within each group. Based on the results of these two stages of processing, the cause of the outliers is estimated, and preprocessing and loss functions are set for learning model generation based on the result of the cause estimation.

図３のステップＳ１０では、外れ値検出部１０による処理が実行され、上述した２段階の外れ値検出処理が行われ、どのような内容の外れ値であるか（例えば、真の外れ値や誤記による外れ値など）を推定する外れ値原因推定を行う。 In step S10 of FIG. 3, the process by the outlier detection unit 10 is executed, the two-step outlier detection process described above is performed, and the content of the outlier (for example, a true outlier or an error) is detected. Perform outlier cause estimation to estimate outliers due to

ステップＳ２０では、学習部２０による処理が実行され、ステップＳ１０で得られた原因推定結果に基づいて、学習モデル作成に際しての前処理やロス関数の設定を行う。そして、それらの設定に基づいて、目的変数の分子および分母のそれぞれに関する予測と、目的変数（＝比）に関する予測の両方を行う。例えば、誤記の修正を施した機械学習や、外れ値に強いロス関数を使った機械学習(ロバスト回帰)を適用し、それぞれの精度最良のモデルを採用する。 In step S20, processing by the learning unit 20 is executed, and based on the cause estimation result obtained in step S10, preprocessing and loss function setting are performed when creating a learning model. Then, based on these settings, both predictions for the numerator and denominator of the objective variable and predictions for the objective variable (=ratio) are performed. For example, machine learning that corrects typos and machine learning that uses a loss function that is strong against outliers (robust regression) is applied, and the model with the best accuracy for each is adopted.

ステップＳ３０では、モデル評価部３０による評価処理が実行される。例えば、テストデータを使用して、ステップＳ２０で求めた分子・分母個別の予測モデル（以下では、モデル１と呼ぶ）からその比を求める。加えて、両者の比に関する予測モデル（以下では、モデル１と呼ぶ）の値も同じテストデータで算出する。そして、モデル１とモデル２の内で、予測の精度がより高い方のモデルを採用。なお、外れ値検出部１０、学習部２０、モデル評価部３０で行われる処理の詳細は後述する。 In step S30, evaluation processing by the model evaluation unit 30 is executed. For example, using test data, the ratio is obtained from the individual numerator/denominator prediction model (hereinafter referred to as model 1) obtained in step S20. In addition, the value of a predictive model (hereafter referred to as model 1) regarding the ratio of the two is also calculated using the same test data. Then, of Model 1 and Model 2, the model with the higher prediction accuracy is adopted. Details of the processing performed by the outlier detection unit 10, the learning unit 20, and the model evaluation unit 30 will be described later.

＜外れ値検出部１０における外れ値検出処理の詳細説明＞
図４は、図３のステップＳ１０において実行される外れ検出処理の詳細を示すフローチャートである。なお、本実施の形態では、目的変数である比だけでなく、比の分母（＝通話時間）および分子（＝売上高）も外れ値検出の検出対象としているので、それらを全て目的変数と呼ぶことにする。以下では、比、分母、分子を、それぞれ目的変数（比）、目的変数（分母）、目的変数（分子）のように称する。 <Detailed Description of Outlier Detection Processing in Outlier Detection Unit 10>
FIG. 4 is a flow chart showing details of the deviation detection process executed in step S10 of FIG. In the present embodiment, not only the ratio, which is the objective variable, but also the denominator (=call time) and numerator (=sales) of the ratio are targets for outlier detection, so they are all referred to as objective variables. to decide. Hereinafter, the ratio, denominator, and numerator are referred to as objective variable (ratio), objective variable (denominator), and objective variable (numerator), respectively.

図４のフローチャートにおいて、ステップＳ１００からステップＳ１８０までの処理は、目的変数単位毎にそれぞれ行われる。すなわち、最初に目的変数（分母）に関して、ステップＳ１００からステップＳ１８０までの処理を行い、次いで、目的変数（分子）に関して、ステップＳ１００からステップＳ１８０までの処理を行い、最後に目的変数（比）に関して、ステップＳ１００からステップＳ１８０までの処理を行う。 In the flowchart of FIG. 4, the processing from step S100 to step S180 is performed for each objective variable unit. That is, first, the process from step S100 to step S180 is performed for the objective variable (denominator), then the process from step S100 to step S180 is performed for the objective variable (numerator), and finally the objective variable (ratio) is , the processing from step S100 to step S180 is performed.

ステップＳ１００では、目的変数単位の外れ値検出処理ループを開始する。ステップＳ１１０では、学習データから目的変数のカラムのデータを取得する。なお、目的変数（分母）または目的変数（分子）が対象である場合には、図２のテーブルの通話時間または売上高のカラムのデータを取得すれば良い。一方、目的変数（比）が対象である場合には、テーブルから通話時間および売上高を読み込んで、それらから算出される（売上高）／（通話時間）の値を目的変数（比）とする。 In step S100, an outlier detection processing loop for each objective variable is started. In step S110, the data of the objective variable column is obtained from the learning data. When the object variable (denominator) or the object variable (numerator) is the object, the data in the call time or sales column of the table in FIG. 2 can be acquired. On the other hand, if the objective variable (ratio) is the target, the call duration and sales are read from the table, and the value of (sales)/(call duration) calculated from them is used as the objective variable (ratio). .

ステップＳ１２０では、対象とする目的変数に関しての確率分布情報が、例えば、売上高が正規分布で近似できるというような情報が、設定情報ＤＢ２に登録されているか確認する。ステップＳ１３０では、確率分布情報があるか否かを判定し、ある場合（ｙｅｓ）にはステップＳ１４０へ進み、無い場合（ｎｏ）にはステップＳ１５０へ進む。ステップＳ１４０では、登録されている確率分布を前提とした外れ値検出処理１Ａを実施する。例えば、目的変数に対して正規分布が確率分布情報として登録されている場合には、スミルノフ・グラブス検定を利用した外れ値検出処理を実施する。一方、ステップＳ１５０では、特定の確率分布を前提としない外れ値検出処理、例えば、四分位範囲を利用した外れ値検出処理などの、従う分布が不明でも適用可能な外れ値検出処理１Ｂを実施する。 In step S120, it is checked whether probability distribution information regarding the objective variable of interest, for example, information that sales can be approximated by a normal distribution, is registered in the setting information DB2. In step S130, it is determined whether or not there is probability distribution information. If there is (yes), the process proceeds to step S140, and if there is no (no), the process proceeds to step S150. In step S140, an outlier detection process 1A is performed on the premise of the registered probability distribution. For example, when a normal distribution is registered as probability distribution information for the objective variable, outlier detection processing using the Smirnov-Grubbs test is performed. On the other hand, in step S150, an outlier detection process 1B that can be applied even if the distribution to follow is unknown, such as an outlier detection process that does not assume a specific probability distribution, for example, an outlier detection process that uses an interquartile range, is performed. do.

ステップＳ１６０では、目的変数のどの値が外れ値だったかを外れ値情報ＤＢ３に保存する。ステップＳ１７０は、グループ単位での外れ値検出処理に関するルーティンであり、処理の詳細は図５のフローチャートを用いて後述する。ステップＳ１８０では、目的変数単位の外れ値検出処理ループを完了する。そして、目的変数（分母）、目的変数（分子）および目的変数（比）のそれぞれに対してステップＳ１００からステップＳ１８０までの一連の処理が完了したならば、図４の外れ値検出処理を終了する。 In step S160, which value of the objective variable is the outlier is stored in the outlier information DB3. Step S170 is a routine relating to outlier detection processing in units of groups, and the details of the processing will be described later using the flowchart of FIG. In step S180, the outlier detection processing loop for each objective variable is completed. Then, when a series of processes from step S100 to step S180 are completed for each of the objective variable (denominator), objective variable (numerator) and objective variable (ratio), the outlier detection process of FIG. 4 is terminated. .

（グループ単位の外れ値検出処理）
図５は、図４のステップＳ１７０におけるグループ単位での外れ値検出処理の詳細を示すフローチャートである。ステップＳ１７００では、クラスタリングなどを用い、説明変数単位で全データをいくつかのグループに仕分けする。図２に示す学習データの場合、レコードには説明変数として顧客名、年齢、職業があるが、例えば、職業や年齢でグループ分けすることができる。例えば、職業でグループ分けする場合、図２に示すデータにおいて職業が３種類（主婦、会社員、アルバイト）であったと仮定すると、図７（ａ）～（ｃ）に示すように複数のレコードは３種類のグループに分けられる。図７（ａ）～（ｃ）に示す例では、カラムＣ５の右側に、外れ値検出処理１（１Ａ，１Ｂ）および後述する外れ値検出処理２における判定結果も記載している。ただし、図７（ａ），（ｂ）では目的変数（分母）＝通話時間に関する外れ値検出結果を示しており、図７（ｃ）では目的変数（分子）＝売上高に関する外れ値検出結果を示している。ここでは、一つのカテゴリカルな説明変数に関してグループ分けを行ったが、連続値の説明変数であっても、その値を分割して作った複数の区間を用いてグループ分けを行ってもよい。加えて、二つ以上の説明変数についても、クラスタリングなどの手法を用いてグループ分けを行ってもよい。 (Outlier detection processing for each group)
FIG. 5 is a flowchart showing the details of the outlier detection processing in units of groups in step S170 of FIG. In step S1700, clustering or the like is used to sort all data into several groups for each explanatory variable. In the case of the learning data shown in FIG. 2, the record has customer name, age, and occupation as explanatory variables, and can be grouped by occupation and age, for example. For example, when grouping by occupation, assuming that there are three types of occupation (housewife, office worker, part-time worker) in the data shown in FIG. They are divided into three groups. In the examples shown in FIGS. 7A to 7C, the determination results of outlier detection processing 1 (1A, 1B) and outlier detection processing 2, which will be described later, are also described on the right side of column C5. However, FIGS. 7A and 7B show outlier detection results related to the objective variable (denominator)=call time, and FIG. 7C shows outlier detection results related to the objective variable (numerator)=sales. showing. Here, grouping is performed with respect to one categorical explanatory variable, but even with continuous value explanatory variables, grouping may be performed using a plurality of intervals created by dividing the value. In addition, two or more explanatory variables may also be grouped using a technique such as clustering.

なお、図７（ａ）～（ｃ）では記載を簡略化しているが、グループ単位で外れ値検出を行う場合も、目的変数（分子）、目的変数（分母）、目的変数（比）のそれぞれに関して外れ値検出が行われるので、外れ値検出処理１，２の判定結果（あり、無し）は、目的変数（分子）、目的変数（分母）および目的変数（比）のそれぞれに対して得られる。 Although the descriptions in FIGS. 7A to 7C are simplified, even when outlier detection is performed on a group-by-group basis, the objective variable (numerator), objective variable (denominator), and objective variable (ratio) Since outlier detection is performed with respect to .

ステップＳ１７１０では、外れ値情報ＤＢ３に、通話時間および売上高のデータに関する備考、例えば、定義域（例えば、通話時間の上限）や測定トラブルの期間等の情報があるか検索し、ある場合にはそれらを外れ値情報ＤＢ３から取得する。ステップＳ１７２０では、グループ単位での外れ値検出処理ループ（ステップＳ１７２０からステップＳ１８１０までの処理）を開始する。図７（ａ）～（ｃ）に示したグループ分けの例では、３種類のグループ（主婦、会社員、アルバイト）に分けられており、ステップＳ１７２０からステップＳ１８１０までの処理ループが、主婦のグループ、会社員のグループおよびアルバイトのグループのそれぞれについて実行される。 In step S1710, the outlier information DB 3 is searched for remarks on the call time and sales data, such as information on the domain (for example, upper limit of call time) and measurement trouble period. They are acquired from the outlier information DB3. In step S1720, an outlier detection processing loop (processing from step S1720 to step S1810) is started for each group. In the example of grouping shown in FIGS. 7A to 7C, the groups are divided into three types (housewives, office workers, and part-time workers). , for the group of office workers and the group of part-time workers, respectively.

ステップＳ１７３０では、目的変数に関する外れ値検出処理２を行う。なお、上述した学習データの全レコードを用いた外れ値検出処理１（１Ａ，１Ｂ）と区別するために、グループ単位の外れ値検出処理を外れ値検出処理２と称することにする。ここでは、外れ値検出処理２の具体的処理については説明を省略するが、グループ内のレコードに限定される点を除けば、図４のステップＳ１１０～Ｓ１６０の処理と同様の処理が行われる。ただし、ステップＳ１７３０の外れ値検出処理２では、特定の確率分布を前提としない外れ値検出を実施する。 In step S1730, outlier detection processing 2 for the objective variable is performed. In addition, in order to distinguish from the outlier detection processing 1 (1A, 1B) using all the records of the learning data described above, the outlier detection processing for each group will be referred to as the outlier detection processing 2. FIG. Here, a detailed description of the outlier detection process 2 is omitted, but the same processes as those of steps S110 to S160 in FIG. 4 are performed except that the records are limited to the records in the group. However, in outlier detection processing 2 in step S1730, outlier detection is performed without a specific probability distribution.

なお、図４において、目的変数に関して外れ値検出処理ループが実行されている場合、その際のグループ単位の外れ値検出処理（ステップＳ１７０）では、同じ目的変数について外れ値検出処理が実行されることになる。例えば、図４における外れ値検出処理ループが目的変数（分子）＝通話時間であった場合には、ステップＳ１７３０における外れ値検出処理２の目的変数も目的変数（分子）＝通話時間となる。すなわち、グループ分けされた学習データからカラムＣ４の値を取得して外れ値検出処理を実施する。 In FIG. 4, when the outlier detection processing loop is executed for the objective variable, the outlier detection processing is executed for the same objective variable in the group unit outlier detection processing (step S170) at that time. become. For example, if the outlier detection processing loop in FIG. 4 has the objective variable (numerator)=call time, the objective variable of the outlier detection process 2 in step S1730 is also the objective variable (numerator)=call time. That is, outlier detection processing is performed by acquiring the value of column C4 from the grouped learning data.

ステップＳ１７４０では、外れ値検出処理１（１Ａ，１Ｂ）で外れ値検出されたレコードがあるか否かを判定し、ある場合（ｙｅｓ）にはステップＳ１７５０へ進み、無い場合（ｎｏ）にはステップＳ１７６０へ進む。ステップＳ１７５０では、外れ値検出されたレコードの外れ値検出結果を「外れ値」に更新し、その後ステップＳ１７６０へ進む。例えば、図７（ａ）～（ｃ）に示す例では、顧客名Ａ１～Ａ３，Ｂ１，Ｃ１，Ｃ２の各レコードは外れ値検出処理１（１Ａ，１Ｂ）で外れ値検出とされているので、ステップＳ１７４０からステップＳ１７５０へ進み、各レコードの外れ値検出結果を「外れ値」に更新する。 In step S1740, it is determined whether or not there is a record for which an outlier has been detected in outlier detection processing 1 (1A, 1B). Proceed to S1760. In step S1750, the outlier detection result of the record in which the outlier was detected is updated to "outlier", and then the process proceeds to step S1760. For example, in the examples shown in FIGS. 7A to 7C, each record of customer names A1 to A3, B1, C1, and C2 is detected as an outlier by outlier detection processing 1 (1A, 1B). , the process advances from step S1740 to step S1750 to update the outlier detection result of each record to "outlier".

なお、更新のタイミングをより詳しく説明すると、顧客名Ａ１～Ａ３の各レコードは、目的変数が通話時間（＝目的変数（分母））の場合の処理ループであって、かつ、主婦グループ単位の処理ループにおけるステップＳ１７４０およびＳ１７５０の処理により、通話時間に関する外れ値検出結果が「外れ値」に更新される。顧客名Ｂ１のレコードは、目的変数が通話時間（＝目的変数（分母））の場合の処理ループであって、かつ、会社員グループ単位の処理ループにおけるステップＳ１７４０およびＳ１７５０の処理により、通話時間に関する外れ値検出結果が「外れ値」に更新される。顧客名Ｃ１，Ｃ２の各レコードは、目的変数が売上高（＝目的変数（分子））の場合の処理ループであって、かつ、アルバイトグループ単位の処理ループにおけるステップＳ１７４０およびＳ１７５０の処理により、売上高に関する外れ値検出結果が「外れ値」に更新される。 To explain the update timing in more detail, each record of customer names A1 to A3 is a processing loop in which the objective variable is the call time (=objective variable (denominator)), and is processed in housewife group units. Through the processing of steps S1740 and S1750 in the loop, the outlier detection result regarding call duration is updated to "outlier". The record of customer name B1 is a processing loop in which the objective variable is the call duration (=objective variable (denominator)), and is processed in steps S1740 and S1750 in the processing loop for each office worker group. The outlier detection result is updated to "outlier". Each record of customer names C1 and C2 is processed in steps S1740 and S1750 in a processing loop in which the objective variable is sales (=objective variable (numerator)) and in the processing loop for each part-time job group. The outlier detection result for high is updated to "outlier".

ステップＳ１７６０では、外れ値検出処理２で外れ値検出されたレコードがあるか否かを判定し、ある場合（ｙｅｓ）にはステップＳ１７７０へ進み、無い場合（ｎｏ）にはステップＳ１７８０へ進む。ステップＳ１７７０では、外れ値検出されたレコードの外れ値検出結果を「一般的な誤記」に更新し、その後ステップＳ１７８０へ進む。例えば、図７（ａ）～（ｃ）に示す例では、顧客名Ｂ１，Ｃ３の各レコードは外れ値検出処理２で外れ値検出とされているので、ステップＳ１７６０からステップＳ１７７０へ進み、各レコードの外れ値検出結果を「一般的な誤記」に更新する。なお、顧客名Ｂ１のレコードに関する外れ値検出結果は、ステップＳ１７５０において「外れ値」とされているが、ステップＳ１７７０において「一般的な誤記」へと再度更新されることになる。 In step S1760, it is determined whether or not there is a record for which an outlier has been detected in outlier detection processing 2. If there is (yes), the process proceeds to step S1770, and if not (no), the process proceeds to step S1780. In step S1770, the outlier detection result of the record in which the outlier was detected is updated to "general error", and then the process proceeds to step S1780. For example, in the examples shown in FIGS. 7A to 7C, the records of customer names B1 and C3 are detected as outliers in the outlier detection process 2, so the process proceeds from step S1760 to step S1770, and each record Update outlier detection result to "common typo". Note that the outlier detection result for the record of customer name B1 is "outlier" in step S1750, but is updated again to "general error" in step S1770.

ステップＳ１７８０では、各レコードのデータ（通話時間、売上高）に定義域外のデータや測定機器にトラブルがあった期間に計測されたデータがあるか否かを判定し、ある場合（ｙｅｓ）にはステップＳ１７９０へ進み、無い場合（ｎｏ）にはステップＳ１８００へ進む。ステップＳ１７９０では、該当するレコードの外れ値検出結果を「測定ミスによる誤記」に更新し、その後ステップＳ１８００へ進む。例えば、通話時間に関する定義域として３６００秒が指定されている場合には、４８００秒というデータは「測定ミスによる誤記」と判定される。また、測定機器のトラブルがあった期間に取られたという備考情報があった場合には、無条件に「測定ミスによる誤記」と判定する。 In step S1780, it is determined whether or not the data (call time, sales) of each record includes data outside the defined range or data measured during the period when there was trouble with the measuring equipment. The process proceeds to step S1790, and if not (no), the process proceeds to step S1800. In step S1790, the outlier detection result of the corresponding record is updated to "error due to measurement error", and then the process proceeds to step S1800. For example, if 3600 seconds is specified as the domain for call duration, data of 4800 seconds is determined to be "error due to measurement error". In addition, if there is remark information indicating that it was taken during a period when there was trouble with the measuring equipment, it is unconditionally determined to be "error due to measurement error".

なお、図７（ａ）～（ｃ）に示す例の場合、主婦のグループに関するステップＳ１７３０からステップＳ１７９０までの処理が行われると、顧客名Ａ１～Ａ３の各レコードの外れ値検出結果は、通話時間に関するデータが「外れ値」とされる。また、会社員のグループに関するステップＳ１７３０からステップＳ１７９０までの処理が行われると、顧客名Ｂ１のレコードの外れ値検出結果は、通話時間に関するデータが「一般的な誤記」とされる。さらにまた、アルバイトのグループに関するステップＳ１７３０からステップＳ１７９０までの処理が行われると、顧客名Ｃ１，Ｃ２のレコードの外れ値検出結果は、売上高に関するデータが「外れ値」とされ、顧客名Ｃ３のレコードの外れ値検出結果は、売上高に関するデータが「一般的な誤記」とされる。ステップＳ１８００は、外れ値原因分類に関するルーティンであり、処理の詳細を図６に示す。 In the example shown in FIGS. 7A to 7C, when the processing from step S1730 to step S1790 regarding the group of housewives is performed, the outlier detection result for each record of customer names A1 to A3 is Data related to time are considered "outliers". Further, when the processing from step S1730 to step S1790 regarding the group of office workers is performed, the outlier detection result for the record of customer name B1 is that the data regarding call duration is "general error". Furthermore, when the processing from step S1730 to step S1790 regarding the group of part-time workers is performed, the outlier detection results for the records of customer names C1 and C2 are such that the data regarding the sales amount is an "outlier", and the data regarding customer name C3 is an "outlier". Outlier detection results for the record include data on sales as "common typos." Step S1800 is a routine for outlier cause classification, and the details of the processing are shown in FIG.

（外れ値原因分類処理の詳細説明）
図６のステップＳ１８０２では、桁補正係数を設定情報ＤＢ２から取得する。桁補正係数とは、誤記と判定されたデータに乗算してデータ値の桁すなわち小数点の位置を補正するもので、例えば、１０^－５、１０^－４、・・・、１０^４、１０^５のように１０倍刻みの数値が、桁補正係数として設定情報ＤＢ２に予め格納されている。ステップＳ１８０４では、桁補正係数を「一般的な誤記」と判定されたデータ値に乗算する乗算処理を、取得した複数の桁補正係数のそれぞれに関して行う。 (Detailed description of outlier cause classification processing)
In step S1802 in FIG. 6, the digit correction coefficient is obtained from the setting information DB2. The digit correction coefficient is used to correct the ^digit of the data value, ^that is, the position of the decimal point by multiplying the data determined to ^be erroneous ^. Numerical values in 10-fold increments are stored in advance in the setting information DB2 as digit correction coefficients. In step S1804, a multiplication process of multiplying the data value determined as "general error" by the digit correction coefficient is performed for each of the plurality of acquired digit correction coefficients.

ステップＳ１８０６では、乗算後の複数の値のいずれかが、同一グループ内の誤記および外れ値でないデータの値の頻出範囲内にあるか否かを判定する。ここでの頻出範囲の例としては、例えば、「４分位点」から「４分の３分位点」の範囲がある。ステップＳ１８０６でｙｅｓと判定された場合にはステップＳ１８０８に進み、ｎｏと判定された場合にはステップＳ１８０８をスキップする。ステップＳ１８０８では、外れ値検出結果を「一般的な誤記」から「桁間違いによる誤記」へ更新する。 In step S1806, it is determined whether or not any of the multiple values after multiplication is within the frequent appearance range of data values that are not errors or outliers in the same group. As an example of the frequency range here, for example, there is a range from "quartile point" to "third quartile point". If the determination in step S1806 is yes, the process advances to step S1808, and if the determination is no, step S1808 is skipped. In step S1808, the outlier detection result is updated from "general error" to "error due to digit error".

図５に戻って、ステップＳ１８１０では、グループ単位の外れ値検出処理ループを完了する。そして、全てのグループに関してグループ単位の外れ値検出ループが完了したならば、ステップＳ１８２０に進む。ステップＳ１８２０では、一連の外れ値検出結果を外れ値情報ＤＢ３に保存する。なお、外れ値検出結果が「桁間違いによる誤記」の場合には、その時の桁補正係数を外れ値検出結果と共に外れ値情報ＤＢ３に保存する。 Returning to FIG. 5, in step S1810, the outlier detection processing loop for each group is completed. Then, when the outlier detection loop for each group is completed for all groups, the process proceeds to step S1820. In step S1820, a series of outlier detection results are stored in the outlier information DB3. If the outlier detection result is "error due to digit error", the digit correction coefficient at that time is stored in the outlier information DB3 together with the outlier detection result.

図５の処理が終了したならば、図４に戻ってステップＳ１８０に進む。ステップＳ１８０では、目的変数単位の外れ値検出処理ループを完了する。そして、全ての目的変数すなわち目的変数（分母）、目的変数（分子）、目的変数（比）に関して外れ値検出処理ループが完了したならば、図４の一連の処理を終了する。 After completing the process of FIG. 5, the process returns to FIG. 4 and proceeds to step S180. In step S180, the outlier detection processing loop for each objective variable is completed. Then, when the outlier detection processing loop is completed for all objective variables, ie, the objective variable (denominator), the objective variable (numerator), and the objective variable (ratio), the series of processes in FIG. 4 ends.

なお、後述する出力装置５７００に設けられたモニタに、外れ値原因推定結果を画面表示するようにしても良い。図８は、外れ値原因推定結果の画面表示の一例を示したものである。破線矩形枠を施したフィールドが、外れ値および誤記とされた目的変数である。 Note that the outlier cause estimation result may be displayed on a monitor provided in the output device 5700, which will be described later. FIG. 8 shows an example of a screen display of outlier cause estimation results. Fields enclosed in dashed rectangles are the target variables that were identified as outliers and typos.

＜学習部２０における学習モデル作成処理の詳細説明＞
次に、図３のステップＳ２０おいて実行される学習モデル作成処理の詳細について、図９のフローチャートを用いて説明する。図９のフローチャートにおいて、ステップＳ２００からステップＳ２４０までの処理は、目的変数（分母）、目的変数（分子）および目的変数（比）の目的変数単位毎にそれぞれ行われる。学習モデル作成処理では、ステップＳ１０の外れ値検出処理により目的変数に外れ値や誤記が検出された場合には、得られた原因推定結果に基づいて、学習モデル生成に際しての前処理やロス関数の設定を行う。一方、外れ値検出処理で目的変数に外れ値も誤記も検出されなかった場合は、通常通りの学習を実施する。それぞれの学習では、通常のAutoMLと同じく、学習アルゴリズムは複数試みて良いし、試行するアルゴリズムごとの学習においてハイパーパラメタの最適化を行っても良い。 <Detailed Description of Learning Model Creation Processing in Learning Unit 20>
Next, the details of the learning model creation process executed in step S20 of FIG. 3 will be described using the flowchart of FIG. In the flowchart of FIG. 9, the processing from step S200 to step S240 is performed for each objective variable unit of objective variable (denominator), objective variable (numerator) and objective variable (ratio). In the learning model creation process, when an outlier or an error is detected in the target variable by the outlier detection process in step S10, preprocessing and loss function are performed when the learning model is created based on the obtained cause estimation result. Make settings. On the other hand, if neither an outlier nor an error is detected in the objective variable in the outlier detection process, normal learning is performed. In each learning, as in normal AutoML, multiple learning algorithms may be tried, and hyperparameters may be optimized in learning for each algorithm to be tried.

ステップＳ２００では、目的変数単位のモデル作成処理ループを開始する。ステップＳ２１０では、外れ値情報ＤＢ３に、目的変数の誤記あるいは外れ値の情報が登録されているか確認する。ステップＳ２２０は、誤記や外れ値の内容に応じた機械学習処理に関するルーティンであり、処理の詳細を図１０に示す。なお、ステップＳ２２０における誤記や外れ値の内容に応じた機械学習処理において、外れ値情報ＤＢ３に目的変数の誤記あるいは外れ値の情報が登録されていない場合には、通常通りの学習が実施される。 In step S200, a model creation processing loop for each objective variable is started. In step S210, it is checked whether or not the outlier information DB3 contains information about an error in the objective variable or an outlier. Step S220 is a routine relating to machine learning processing according to the content of errors and outliers, and details of the processing are shown in FIG. In addition, in the machine learning process according to the contents of the clerical error and the outlier in step S220, if the information of the clerical error of the objective variable or the outlier is not registered in the outlier information DB 3, learning is performed as usual. .

（誤記や外れ値の内容に応じた機械学習処理の詳細説明）
図１０は、誤記や外れ値の内容に応じた機械学習処理の詳細を示すフローチャートである。ステップＳ２２００では、外れ値情報ＤＢ３から、各目的変数に関する外れ値、誤記および桁補正係数の情報を取得する。ステップＳ２２１０では、「桁間違いによる誤記」の目的変数に対して対応する桁補正係数を乗算し、目的変数の値を補正する。例えば、図８の顧客名Ｃ３のレコードでは、売上高の「３０円」という値が桁間違いによる誤記であると判定され、桁補正係数として「１０^２」が外れ値情報ＤＢ３に登録されている。そのため、ステップＳ２２１０において、「３０円」という値が、「１０^２」を乗算して得られる「３，０００円」という値に補正される。 (Detailed explanation of machine learning processing according to the content of errors and outliers)
FIG. 10 is a flow chart showing the details of the machine learning process according to the content of errors and outliers. In step S2200, information on outliers, writing errors, and digit correction coefficients for each objective variable is obtained from the outlier information DB3. In step S2210, the value of the objective variable "error due to wrong digit" is multiplied by the corresponding digit correction coefficient to correct the value of the objective variable. For example, in the record of the customer name C3 in FIG. 8, the sales value of "30 yen" is determined to be a digit error, and "10 ² " is registered in the outlier information DB 3 as the digit correction coefficient. . Therefore, in step S2210, the value "30 yen" is corrected to the value "3,000 yen" obtained by multiplying "10 ² ".

ステップＳ２２１０では、目的変数の中に「外れ値」と判定されたレコードがあるか否かを判定し、ありの場合（ｙｅｓ）にはステップＳ２２３０へ進み、無い場合（ｎｏ）にはステップＳ２２４０へ進む。誤記でない外れ値がある場合は、ロバスト回帰を実施する。ロバスト回帰が必要となるのは、以下のような理由からである。
・線形回帰、樹木モデル、およびニューラルネットに基づく学習アルゴリズムには、最適化対象となるロス関数を差し替えて実行できるものがある。
・ロス関数の一例には平均二乗誤差があるが、これを最適化するように学習すると、外れ値の予実測誤差を優先的に小さくするような学習がなされてしまう。そのため、外れ値でない値の予実測誤差が十分小さくならず、外れ値以外の予測が不十分なモデルが生成されてしまう。
・そのため、外れ値の存在によって値が大きくなりづらいロス関数を代わりに使うなどして、外れ値ではない値を正確に予測するモデルを生成しやすくする手段があり、その総称をロバスト回帰と呼ぶ。 In step S2210, it is determined whether or not there is a record determined to be an "outlier" in the objective variable. If yes (yes), the process proceeds to step S2230; otherwise (no), the process proceeds to step S2240. move on. If there are non-error outliers, perform robust regression. Robust regression is necessary for the following reasons.
• Some learning algorithms based on linear regression, tree models, and neural nets can be run by replacing the loss function to be optimized.
・One example of the loss function is the mean squared error, but learning to optimize this results in learning that preferentially reduces the predicted actual measurement error of outliers. As a result, the prediction measurement error of values that are not outliers is not sufficiently small, and a model that does not sufficiently predict values other than outliers is generated.
・For this reason, there is a method that makes it easier to generate a model that accurately predicts values that are not outliers, such as by using a loss function that is less likely to increase due to the presence of outliers. .

ロバスト回帰に用いるロス関数の具体例としては、「Huber-loss」、「pseudo-Huber-loss」、「τ-分位損失」などがある。ステップＳ２２３０では、モデル作成処理において実施するロバスト回帰のロス関数の種類を設定情報ＤＢ２から取得し、その後ステップＳ２２４０へ進む。 Specific examples of loss functions used in robust regression include "Huber-loss", "pseudo-Huber-loss", and "τ-quantile loss". In step S2230, the type of loss function for robust regression to be executed in the model creation process is acquired from the setting information DB2, and then the process proceeds to step S2240.

ステップＳ２２４０では、目的変数の中に「桁間違いによる誤記」以外の誤記と判定されたレコードがあるか否かを判定し、ありの場合（ｙｅｓ）にはステップＳ２２５０へ進み、無い場合（ｎｏ）にはステップＳ２２６０へ進む。ステップＳ２２５０では、モデル作成処理において実施する欠損値補完の種類を設定情報ＤＢ２から取得し、その後ステップＳ２２６０へ進む。 In step S2240, it is determined whether or not there is a record determined to be a clerical error other than "error due to digit error" in the objective variable. If yes (yes), the process proceeds to step S2250; to step S2260. In step S2250, the type of missing value interpolation to be performed in the model creation process is obtained from the setting information DB2, and then the process proceeds to step S2260.

「桁間違いによる誤記」以外の誤記である「一般的な誤記」および「測定ミスによる誤記」を含む場合には、該当する目的変数を欠損値とみなし、その対策となる欠損値補完方法を１つ以上試み、学習モデルを作成する。欠損値補完方法としては、例えば、学習対象のデータから除外する方法、別の値で補完する方法、半教師あり学習を適用する方法などがある。別の値で補完する方法としては、例えば、中央値や平均値、あるいは別途指定された値で補完する方法があり、この場合、補完する値にさらにランダムノイズを加えてもよい。また、行列分解を利用した補完方法、missForestなど回帰に使えるアルゴリズムを応用した補完方法などがある。 If there are "general errors" and "errors due to measurement errors" other than "errors due to incorrect digits", the target variable is regarded as a missing value, and the missing value imputation method as a countermeasure is set to 1. Try one or more to create a learning model. Missing value imputation methods include, for example, a method of excluding from data to be learned, a method of complementing with another value, a method of applying semi-supervised learning, and the like. As a method of interpolating with another value, for example, there is a method of interpolating with a median value, an average value, or a separately specified value. In this case, random noise may be added to the interpolated value. There are also interpolation methods using matrix decomposition and interpolation methods that apply algorithms that can be used for regression such as missForest.

ステップＳ２２６０では、設定情報ＤＢ２を参照し、ロバスト回帰を行わない場合に用いるロス関数を設定情報ＤＢ２から取得する。ステップＳ２２７０では、上述のように取得した一連のロス関数の種類および欠損値補完の種類に関して、それらの組み合わせに応じた機械学習を実施する。例えば、取得されたロス関数および欠損値補完の種類がそれぞれ３種類であった場合、ロス関数と欠損値補完の組み合わせは９（＝３×３）通りとなる。そして、ステップＳ２２７０では、それぞれの組み合わせ設定で機械学習を実施する。なお、「外れ値」の無い場合の通常の学習も、ロス関数（＝通常の二乗誤差）×欠損値補間（＝補間をしない）という組み合わせとして含まれている。ステップＳ２２８０では、一連の学習で得たモデルを、モデル情報ＤＢ４に保存する。図９からも分かるように、ステップＳ２２７０の機械学習は３つの目的変数（分子）、目的変数（分母）、目的変数（比）のそれぞれに関して行われる。 In step S2260, the setting information DB2 is referenced, and the loss function used when robust regression is not performed is acquired from the setting information DB2. In step S2270, machine learning is performed according to the combination of the series of loss function types and missing value imputation types obtained as described above. For example, if there are three types of loss functions and missing value imputations, respectively, there are nine (=3×3) combinations of loss functions and missing value imputations. Then, in step S2270, machine learning is performed for each combination setting. It should be noted that normal learning without "outliers" is also included as a combination of loss function (=normal squared error)×missing value interpolation (=no interpolation). In step S2280, the model obtained by a series of learning is stored in model information DB4. As can be seen from FIG. 9, the machine learning in step S2270 is performed for each of three target variables (numerator), target variable (denominator), and target variable (ratio).

図９に戻って、ステップＳ２３０では、各学習で得られたモデルを共通のテストデータで評価し、精度最良のモデルをモデル情報ＤＢ４に保存する。ステップＳ２４０において、各目的変数に関して目的変数単位の学習処理のループが完了すると、図９に示したモデル作成に関する一連の処理が終了する。 Returning to FIG. 9, in step S230, the models obtained by each learning are evaluated using common test data, and the model with the best accuracy is stored in the model information DB4. In step S240, when the learning process loop for each objective variable is completed for each objective variable, the series of processes related to model creation shown in FIG. 9 ends.

＜学習モデル評価処理の詳細説明＞
図３のステップＳ３０における学習モデル評価処理の詳細について、図１１のフローチャートにより説明する。ステップＳ３００では、学習データＤＢ１から評価用データを取得する。ステップＳ３１０では、ステップＳ２３０においてモデル情報ＤＢ４に保存された目的変数（分子）、目的変数（分母）および目的変数（比）のそれぞれに関する最良精度のモデルを、モデル情報ＤＢ４からそれぞれ取得する。ここでは、取得した各モデルを、分子予測モデル、分母予測モデルおよび比予測モデルと呼ぶことにする。 <Detailed description of learning model evaluation processing>
Details of the learning model evaluation process in step S30 of FIG. 3 will be described with reference to the flowchart of FIG. In step S300, evaluation data is obtained from the learning data DB1. In step S310, the best-precision model for each of the objective variable (numerator), objective variable (denominator), and objective variable (ratio) stored in the model information DB4 in step S230 is obtained from the model information DB4. Here, each acquired model is called a numerator prediction model, a denominator prediction model, and a ratio prediction model.

ステップＳ３２０では、ステップＳ３００で取得した評価用データを用いて、分子予測モデル、分母予測モデルおよび比予測モデルから、分子、分母および比の値をそれぞれ推定する。ステップＳ３３０では、（分子／分母）の精度（＝精度１）と比の精度（＝精度２）とを比較し、「精度１＞精度２」であるか否かを判定する。ステップＳ３３０で「精度１＞精度２」と判定されるとステップＳ３４０へ進み、分子予測モデルおよび分母予測モデルを最良モデルに採用し、その決定結果をモデル情報ＤＢ４に保存する。一方、ステップＳ３３０において「精度１＞精度２」でないと判定されるとステップＳ３５０へ進み、比予測モデルを最良モデルに採用し、その決定結果をモデル情報ＤＢ４に保存する。 In step S320, the evaluation data acquired in step S300 are used to estimate the values of the numerator, denominator, and ratio from the numerator prediction model, denominator prediction model, and ratio prediction model, respectively. In step S330, the accuracy of (numerator/denominator) (=accuracy 1) and the accuracy of ratio (=accuracy 2) are compared to determine whether or not "accuracy 1>accuracy 2". If it is determined in step S330 that "accuracy 1>accuracy 2", the process proceeds to step S340, adopts the numerator prediction model and the denominator prediction model as the best models, and saves the decision results in the model information DB4. On the other hand, if it is determined in step S330 that "accuracy 1>accuracy 2" does not hold, the process proceeds to step S350, adopts the ratio prediction model as the best model, and saves the determination result in the model information DB4.

なお、上述した実施の形態では、目的変数が比で表される場合について説明したが、目的変数が比でない場合、例えば、図５に示す売上高を目的変数とした場合であっても、図４～６に示した２段階の外れ値検出処理を適用することができる。図５のフローチャートのステップＳ１００からステップＳ１８０までの処理における目的変数は売上高のみとなり、目的変数＝売上高に関して、図４～６に示す２段階の外れ値検出処理が行われる。そして、外れ値検出処理に続く図９，１０の学習モデル作成処理および図１１の学習モデル評価処理に関しても、目的変数が売上高であるとして処理が行われる。すなわち、本実施の形態における２段階の外れ値検出処理、学習モデル作成処理および学習モデル評価処理は、目的変数が比である場合にも比でない場合にも適用することができ、それにより、より精度の高い学習モデルを構築することができる。 In the above-described embodiment, the case where the objective variable is expressed as a ratio has been described. A two-step outlier detection process shown in 4 to 6 can be applied. The target variable in the processing from step S100 to step S180 in the flowchart of FIG. 5 is only the sales amount, and with respect to the target variable=sales amount, two-step outlier detection processing shown in FIGS. 4 to 6 is performed. 9 and 10 following the outlier detection process and the learning model evaluation process of FIG. 11 are also performed with sales as the objective variable. That is, the two-stage outlier detection process, learning model creation process, and learning model evaluation process in the present embodiment can be applied to both cases in which the objective variable is a ratio and a case in which it is not a ratio. A highly accurate learning model can be constructed.

＜機械学習システム１を実現するコンピュータ＞
図１２は、機械学習システム１を実現するコンピュータ５０００のハードウェア図である。機械学習システム１を実現するコンピュータ５０００ではＣＰＵ（Central Processing Unit）に代表されるプロセッサ５３００、ＲＡＭ（Random Access Memory）等のメモリ５４００、入力装置５６００（例えば、キーボード、マウス、タッチパネル等）、および出力装置５７００（例えば、外部ディスプレイモニタに接続されたビデオグラフィックカード）が、メモリコントローラ５５００を介して相互接続される。 <Computer Realizing Machine Learning System 1>
FIG. 12 is a hardware diagram of a computer 5000 that implements the machine learning system 1. As shown in FIG. A computer 5000 that realizes the machine learning system 1 includes a processor 5300 represented by a CPU (Central Processing Unit), a memory 5400 such as a RAM (Random Access Memory), an input device 5600 (for example, a keyboard, a mouse, a touch panel, etc.), and an output Devices 5700 (eg, a video graphics card connected to an external display monitor) are interconnected via memory controller 5500 .

コンピュータ５０００において、機械学習システムを実現するためのプログラムがＩ／Ｏ（Input/Output）コントローラ５２００を介してＳＳＤやＨＤＤ等の外部記憶装置５８００から読みだされて、プロセッサ５３００およびメモリ５４００の協働により実行される。これにより、機械学習システム１が実現される。あるいは、機械学習システム１を実現するためのプログラムは、ネットワークインターフェース５１００を介した通信により外部のコンピュータから取得されたり、媒体読み取り装置によって記録媒体から読み出されて取得されたりしてもよい。 In the computer 5000, a program for realizing a machine learning system is read from an external storage device 5800 such as an SSD or HDD via an I/O (Input/Output) controller 5200, and cooperation between a processor 5300 and a memory 5400 is performed. Executed by Thereby, the machine learning system 1 is realized. Alternatively, the program for realizing the machine learning system 1 may be acquired from an external computer through communication via the network interface 5100, or read from a recording medium by a medium reader.

以上説明した本発明の実施の形態によれば、以下の作用効果を奏する。 According to the embodiment of the present invention described above, the following effects are obtained.

（Ｃ１）図１～８に示すように、機械学習システム１は、外れ値検出部１０と学習部２０とを備える。外れ値検出部１０は、学習データの全レコードに関して、機械学習の予測対象である目的変数を検出対象とする第１の外れ値検出処理を行う。例えば、目的変数が売上高であれば、売上高を検出対象とする第１の外れ値検出処理が行われる。また、外れ値検出部１０は、例えば、レコードＲ１～Ｒ３に含まれる職業という説明変数に基づいて、全レコードＲ１，Ｒ２，Ｒ３，・・・を主婦のグループ、会社員のグループ、アルバイトのグループに分けるグループ処理を行う。さらにまた、外れ値検出部１０は、複数のグループのグループ毎に、売上高を検出対象とする第２の外れ値検出処理を行う。さらにまた、外れ値検出部１０は、第１および第２の外れ値検出処理の検出結果に基づいて外れ値の原因を推定する。そして、学習部２０は、外れ値の原因の推定結果に基づいて機械学習を実施して、売上高を予測する学習モデルを作成する。 (C1) As shown in FIGS. 1 to 8, the machine learning system 1 includes an outlier detection unit 10 and a learning unit 20. FIG. The outlier detection unit 10 performs a first outlier detection process on all the records of the learning data, with the objective variable, which is the prediction target of machine learning, as the detection target. For example, if the target variable is sales, a first outlier detection process is performed with sales as the detection target. Further, the outlier detection unit 10 divides all the records R1, R2, R3, . Perform group processing to divide into Furthermore, the outlier detection unit 10 performs a second outlier detection process for each of the plurality of groups, with the sales amount as the detection target. Furthermore, the outlier detection unit 10 estimates the cause of the outlier based on the detection results of the first and second outlier detection processes. Then, the learning unit 20 performs machine learning based on the result of estimating the cause of the outlier to create a learning model for predicting sales.

上述のように、学習データの全レコードに売上高を検出対象として第１の外れ値検出処理を行い、かつ、グループ分けしたレコード群毎に売上高を検出対象として第２の外れ値検出処理を行い、それら２段階の外れ値検出処理に基づいて外れ値の原因を推定している。その結果、外れ値の原因推定が向上し、外れ値の原因に応じた適切な機械学習を行うことが可能となる。 As described above, the first outlier detection process is performed on all records of the learning data with the sales amount as the detection target, and the second outlier detection process is performed with the sales amount as the detection target for each group of grouped records. Based on these two-step outlier detection processing, the cause of the outlier is estimated. As a result, outlier cause estimation is improved, and appropriate machine learning can be performed according to the cause of the outlier.

（Ｃ２）図１～８に示すように、目的変数が、学習データのレコードＲ１～Ｒ３に含まれる第１の変数（売上高）と第２の変数（通話時間）との比として表される場合には、外れ値検出部１０は、学習データの全レコードに関して、目的変数（比）、売上高および通話時間のいずれか一つを検出対象とする第１の外れ値検出処理を、検出対象毎にそれぞれ行う。さらに、外れ値検出部１０は、複数のグループのグループ毎に、目的変数（比）、売上高および通話時間のいずれか一つを検出対象とする第２の外れ値検出処理を検出対象毎にそれぞれ行う。そして、学習部２０は、外れ値の原因の推定結果に基づいて機械学習を実施して、売上高と通話時間との比を予測する学習モデルを作成する。 (C2) As shown in FIGS. 1 to 8, the objective variable is expressed as a ratio between the first variable (sales) and the second variable (call time) included in the learning data records R1 to R3. In this case, the outlier detection unit 10 performs the first outlier detection process for detecting any one of the objective variable (ratio), the sales amount, and the call time for all records of the learning data. each time. Furthermore, the outlier detection unit 10 performs a second outlier detection process for each detection target of any one of the objective variable (ratio), the sales amount, and the call time for each of the plurality of groups. Do each. Then, the learning unit 20 performs machine learning based on the results of estimating the cause of the outliers, and creates a learning model for predicting the ratio of sales to call duration.

上述のように、学習データの全レコードに対して比、比の分子および比の分母を目的変数として第１の外れ値検出処理を行い、かつ、グループ分けしたレコード群毎に比、比の分子および比の分母を目的変数として第２の外れ値検出処理を行い、それら２段階の外れ値検出処理に基づいて外れ値の原因を推定している。その結果、比、比の分子および比の分母のそれぞれに関して外れ値の原因がわかり、外れ値の原因に応じた適切な機械学習を行うことが可能となる。 As described above, the first outlier detection process is performed with the ratio, the numerator of the ratio, and the denominator of the ratio for all records of the learning data as objective variables, and the ratio and the numerator of the ratio are performed for each grouped record group. and the denominator of the ratio as objective variables, the second outlier detection process is performed, and the cause of the outlier is estimated based on these two-step outlier detection processes. As a result, the causes of outliers can be found for each of the ratio, the numerator of the ratio, and the denominator of the ratio, and appropriate machine learning can be performed according to the cause of the outliers.

（Ｃ３）図５に示すように、外れ値の原因の推定において、外れ値検出部１０は、第１の外れ値検出処理により検出された検出対象を外れ値と推定し、第２の外れ値検出処理により検出された検出対象を誤記と推定する。このような外れ値の原因推定を行うことで、数学的には外れ値に見えるものを含む場合であっても、真の外れ値と誤記とを判別することができ、それぞれに応じた適切な機械学習を行うことが可能となる。 (C3) As shown in FIG. 5, in estimating the cause of an outlier, the outlier detection unit 10 estimates the detection target detected by the first outlier detection process as an outlier, A detection target detected by the detection process is presumed to be a writing error. By estimating the cause of such outliers, it is possible to distinguish between true outliers and typographical errors, even when some outliers appear mathematically. Machine learning becomes possible.

（Ｃ４）図６，８に示すように、外れ値の原因の推定において、外れ値検出部１０は、大きさの異なる複数の桁補正係数に誤記と推定された検出対象をそれぞれ乗算して、乗算後の値が、同一グループ内の前記レコードの外れ値と判定されていない同一検出対象の頻出範囲内である場合に、桁間違いによる誤記と推定する。 (C4) As shown in FIGS. 6 and 8, in estimating the cause of an outlier, the outlier detection unit 10 multiplies a plurality of digit correction coefficients of different magnitudes by detection targets that are estimated to be erroneous, If the multiplied value is within the frequency range of the same detection target that is not determined to be an outlier in the same group, it is assumed to be a writing error due to a digit error.

例えば、図８の顧客名がＣ３であるレコードにおいては、売上高の３０円は外れ値と判定され、その値に１０^－５、１０^－４、・・・、１０^４、１０^５のように１０倍刻みの複数の桁補正係数をそれぞれ乗算すると、１０^２を乗算した場合の３，０００円は、外れ値と判定されていない売上高の数値範囲内となる。すなわち、３０円は桁違いによる誤記であることが推定され、単なる誤記という判定ではなく原因が桁違いによる誤記であることまで判定できる。 For ^example ^, in ^the record whose customer name ^is C3 in FIG. When multiplied by a plurality of digit correction coefficients in 10-fold increments, 3,000 yen when multiplied by ¹⁰² falls within the numerical range of sales that is not determined as an outlier. That is, 30 yen is presumed to be a clerical error due to an order of magnitude difference, and it is possible to determine that the cause is not just a clerical error but a clerical error due to an order of magnitude difference.

（Ｃ５）図８，１０に示すように、外れ値検出部１０は、桁間違いによる誤記と推定された検出対象（例えば、図８の売上高３０円）に、乗算結果が頻出範囲内となる桁補正係数（＝１０^２）を乗算して３０円を３，０００円に補正する（ステップＳ２２１０）。そして、学習部２０は、桁間違いによる誤記と推定された検出対象である売上高３０円に代えて補正後の売上高３，０００円を用いて機械学習を実施する。その結果、桁間違いによる誤記を補正しない場合に比べて、精度向上を図ることができる。 (C5) As shown in FIGS. 8 and 10, the outlier detection unit 10 determines that the multiplication result is within the frequent appearance range for the detection target (for example, the sales amount of 30 yen in FIG. 30 yen is corrected to 3,000 yen by multiplying by a digit correction coefficient (=10 ² ) (step S2210). Then, the learning unit 20 performs machine learning using the corrected sales amount of 3,000 yen instead of the sales amount of 30 yen, which is the detection target and is estimated to be a writing error due to a digit error. As a result, it is possible to improve accuracy compared to the case where writing errors due to digit errors are not corrected.

（Ｃ６）図１０に示すように、学習部２０は、外れ値と推定された検出対象がある場合には、ロバスト回帰による機械学習を実施して学習モデルを作成する。外れ値がある場合には、平均二乗誤差などをロス関数とした通常の機械学習では十分な予測精度が担保できなくなるが、上述のようにロバスト回帰による機械学習を実施することで予測精度の向上を図ることができる。 (C6) As shown in FIG. 10, when there is a detection target that is estimated to be an outlier, the learning unit 20 performs machine learning using robust regression to create a learning model. If there are outliers, it is not possible to ensure sufficient prediction accuracy with normal machine learning that uses the mean squared error as a loss function, but as described above, machine learning using robust regression improves prediction accuracy. can be achieved.

（Ｃ７）図１０，１１に示すように、学習部２０は、売上高である目的変数（分子）を予測する分子予測モデルと、通話時間である目的変数（分母）を予測する分母予測モデルと、目的変数（比）を予測する比予測モデルとを作成する。そして、機械学習システム１は、評価データを用いて分子予測モデル、分母予測モデルおよび比予測モデルの予測値をそれぞれ求め、分子予測モデルの予測値と分母予測モデルの予測値との比の精度と、比予測モデルの予測値の精度とを比較して、より精度が高い方の予測モデルを学習モデルに採用する、モデル評価部３０をさらに備える。 (C7) As shown in FIGS. 10 and 11, the learning unit 20 creates a numerator prediction model that predicts the target variable (numerator) that is sales, and a denominator prediction model that predicts the target variable (denominator) that is call duration. , and a ratio prediction model that predicts the target variable (ratio). Then, the machine learning system 1 obtains predicted values of the numerator prediction model, the denominator prediction model, and the ratio prediction model using the evaluation data, and calculates the accuracy of the ratio between the prediction value of the numerator prediction model and the prediction value of the denominator prediction model. , and the accuracy of the prediction values of the specific prediction models, and adopts the prediction model with the higher accuracy as the learning model.

このように、比予測モデルに加えて分子予測モデルおよび分母予測モデルも考慮し、比予測モデルの予測値の精度と、分子予測モデルの予測値と分母予測モデルの予測値との比の精度との比較から学習モデルを決定しているので、より高精度な学習モデルが得られる。 Thus, in addition to the ratio prediction model, the numerator and denominator prediction models are also considered, and the accuracy of the ratio prediction model and the ratio of the prediction values of the numerator and denominator prediction models are calculated. Since the learning model is determined from the comparison of , a more highly accurate learning model can be obtained.

（Ｃ８）図１～８に示すように、本発明の機械学習方法は、学習データの全レコードに関して、機械学習の予測対象である目的変数を検出対象とする第１の外れ値検出処理を行い、レコードＲ１～Ｒ３に含まれる職業という説明変数に基づいて、全レコードＲ１，Ｒ２，Ｒ３，・・・、を主婦のグループ、会社員のグループ、アルバイトのグループに分け、複数のグループのグループ毎に、目的変数を検出対象とする第２の外れ値検出処理を行い、第１の外れ値検出処理および第２の外れ値検出処理の検出結果に基づいて外れ値の原因を推定し、推定された外れ値の原因に基づいて機械学習を実施して、目的変数を予測する学習モデルを作成する。 (C8) As shown in FIGS. 1 to 8, the machine learning method of the present invention performs a first outlier detection process on all records of the learning data, with the objective variable, which is the prediction target of machine learning, as the detection target. , all records R1, R2, R3, . Then, a second outlier detection process is performed with the objective variable as a detection target, the cause of the outlier is estimated based on the detection results of the first outlier detection process and the second outlier detection process, and the estimated cause of the outlier is Perform machine learning based on the sources of the outliers to create a learning model that predicts the target variable.

以上説明した各実施の形態はあくまで一例であり、発明の特徴が損なわれない限り、本発明はこれらの内容に限定されるものではない。本発明の技術的思想の範囲内で考えられるその他の態様も本発明の範囲内に含まれる。また、上記実施の形態および変形例は、本発明の趣旨を逸脱せず、互いに整合する範囲内で、一部または全部を組合せることができる。 Each embodiment described above is merely an example, and the present invention is not limited to these contents as long as the features of the invention are not impaired. Other aspects conceivable within the scope of the technical idea of the present invention are also included in the scope of the present invention. Moreover, the above embodiments and modifications can be combined in whole or in part without departing from the gist of the present invention and within a mutually compatible range.

１…機械学習システム、１０…外れ値検出部、２０…学習部、３０…モデル評価部、５０００…コンピュータ、５７００…出力装置、ＤＢ１…学習データデータベース、ＤＢ２…設定情報データベース、ＤＢ３…外れ値情報データベース、ＤＢ４…モデル情報データベース DESCRIPTION OF SYMBOLS 1... Machine learning system 10... Outlier detection part 20... Learning part 30... Model evaluation part 5000... Computer 5700... Output device DB1... Learning data database DB2... Setting information database DB3... Outlier information Database, DB4... Model information database

Claims

a first outlier detection unit that performs a first outlier detection process for all records of learning data, with an objective variable that is a prediction target of machine learning as a detection target;
a group processing unit that divides all the records into a plurality of groups based on explanatory variables included in the records;
a second outlier detection unit that performs a second outlier detection process with the objective variable as a detection target for each of the groups;
a cause estimating unit that estimates the cause of the outlier based on the detection results of the first outlier detecting unit and the second outlier detecting unit;
A machine learning system comprising: a model creating unit that creates a learning model for predicting the objective variable by performing machine learning based on the estimation result of the cause estimating unit.

In the machine learning system of claim 1,
The objective variable is expressed as a ratio of a first variable and a second variable included in the learning data record,
The first outlier detection unit performs the first outlier detection for detecting any one of the objective variable, the first variable, and the second variable for all records of the learning data. performing processing for each of the detection targets,
The second outlier detection unit performs a second outlier detection process for detecting any one of the objective variable, the first variable, and the second variable for each of the groups. Do it for each target,
The machine learning system, wherein the model creating unit performs machine learning based on the estimation result of the cause estimating unit to create a learning model for predicting the ratio between the first variable and the second variable.

In the machine learning system of claim 1,
The cause estimation unit estimates the detection target detected by the first outlier detection unit as an outlier, and estimates the detection target detected by the second outlier detection unit as an error. learning system.

In the machine learning system according to claim 3,
The cause estimating unit multiplies a plurality of digit correction coefficients of different magnitudes by the detection target estimated to be the erroneous writing, and the multiplied values are not determined to be outliers of the records in the same group. A machine learning system that presumes that there is a typographical error due to a digit mistake when the same detection target is within the frequent occurrence range.

In the machine learning system according to claim 4,
A correction unit that multiplies the detection target estimated to be a writing error due to the digit error by the digit correction coefficient whose multiplication result is within the frequency range to correct the detection target,
The machine learning system, wherein the model creation unit performs machine learning using the detection target corrected by the correction unit instead of the detection target estimated to be a writing error due to the digit error.

In the machine learning system according to claim 3,
The machine learning system, wherein the model creation unit creates a learning model by performing machine learning based on robust regression when there is a detection target estimated to be an outlier by the cause estimation unit.

In the machine learning system according to claim 2,
The model creation unit creates a first prediction model for predicting the first variable, a second prediction model for predicting the second variable, and a third prediction model for predicting the objective variable. death,
Predicted values of the first prediction model, the second prediction model, and the third prediction model are obtained using the evaluation data, and the prediction values of the first prediction model and the predictions of the second prediction model are obtained. machine learning, further comprising a model evaluation unit that compares the accuracy of the ratio with the value and the accuracy of the predicted value of the third prediction model, and adopts the prediction model with higher accuracy as the learning model. system.

performing a first outlier detection process for all records of the learning data, with the target variable, which is the prediction target of machine learning, as the detection target;
dividing all the records into a plurality of groups based on explanatory variables contained in the records;
performing a second outlier detection process with the objective variable as a detection target for each of the groups;
estimating the cause of the outlier based on the detection results of the first outlier detecting unit and the second outlier detecting unit;
A machine learning method, wherein machine learning is performed based on the estimated causes of the outliers to create a learning model that predicts the objective variable.