JP7359829B2

JP7359829B2 - Machine learning systems and methods

Info

Publication number: JP7359829B2
Application number: JP2021202595A
Authority: JP
Inventors: 一樹山根; 敏之鵜飼; 光之助山本
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2021-12-14
Filing date: 2021-12-14
Publication date: 2023-10-11
Anticipated expiration: 2041-12-14
Also published as: JP2023087998A

Description

本発明は、機械学習システムおよび機械学習方法に関する。 The present invention relates to a machine learning system and a machine learning method.

従来、精度よく外れ値を検出することができる外れ値検出方法（例えば、特許文献１参照）や、外れ値の要因を容易に推測することができる外れ値要因推定方法（例えば、特許文献２参照）が知られている。 Conventionally, outlier detection methods that can accurately detect outliers (see, for example, Patent Document 1) and outlier factor estimation methods that can easily infer the causes of outliers (for example, see Patent Document 2) have been proposed. )It has been known.

特開２０１５－０８２１９０号公報Japanese Patent Application Publication No. 2015-082190 特許第６７１９６１２号公報Patent No. 6719612

ところで、機械学習システムにおいては、目的変数に外れ値が含まれると、平均二乗誤差などをロス関数とした通常の機械学習では十分な予測精度が担保しづらくなる。また、数学的には外れ値に見える場合であっても、実際は測定や記入時のミスの可能性もあるので、その原因を推定しないと適切な対処ができないという課題があった。特に、目的変数が、「費用対効果」や「通話時間あたりの売上高」のように２つの数値の比で示される場合、目的変数が外れ値となる原因が分子・分母のいずれかを見極めないと対処ができない。 By the way, in a machine learning system, if an outlier is included in the objective variable, it becomes difficult to ensure sufficient prediction accuracy using normal machine learning using a loss function such as the mean squared error. Furthermore, even if the value appears to be an outlier mathematically, it may actually be an error during measurement or entry, so there is a problem in that appropriate measures cannot be taken unless the cause is estimated. In particular, when the objective variable is expressed as a ratio of two numbers, such as "cost effectiveness" or "sales per talk time," it is necessary to determine whether the objective variable is an outlier in the numerator or denominator. I can't deal with it without it.

本発明の態様による機械学習システムは、学習データの全レコードに関して、機械学習の予測対象である目的変数を検出対象とする第１の外れ値検出処理を行う第１の外れ値検出部と、前記レコードに含まれる説明変数に基づいて、前記全レコードを複数のグループに分けるグループ処理部と、前記グループ毎に、前記目的変数を検出対象とする第２の外れ値検出処理を行う第２の外れ値検出部と、前記第１の外れ値検出部および前記第２の外れ値検出部の検出結果に基づいて外れ値の原因を推定する原因推定部と、前記原因推定部の推定結果に基づいて機械学習を実施して、前記目的変数を予測する学習モデルを作成するモデル作成部と、を備える。 A machine learning system according to an aspect of the present invention includes a first outlier detection unit that performs a first outlier detection process on all records of learning data, with a detection target being an objective variable that is a prediction target of machine learning; a group processing unit that divides all the records into a plurality of groups based on explanatory variables included in the records; and a second outlier that performs a second outlier detection process using the target variable as a detection target for each group. a value detection unit, a cause estimation unit that estimates the cause of the outlier based on the detection results of the first outlier detection unit and the second outlier detection unit, and a cause estimation unit that estimates the cause of the outlier based on the estimation result of the cause estimation unit. A model creation unit that performs machine learning to create a learning model that predicts the target variable.

本発明によれば、外れ値の原因推定をより向上させることができ、機械学習において十分な予測精度を担保することができる。 According to the present invention, the cause estimation of outliers can be further improved, and sufficient prediction accuracy can be ensured in machine learning.

図１は、機械学習システムの構成例を示すブロック図である。FIG. 1 is a block diagram showing a configuration example of a machine learning system. 図２は、学習データの一例を示す図である。FIG. 2 is a diagram showing an example of learning data. 図３は、機械学習システムにおける機械学習方法の全体処理を示すフローチャートである。FIG. 3 is a flowchart showing the overall processing of the machine learning method in the machine learning system. 図４は、図３のステップＳ１０おいて実行される外れ検出処理の詳細を示すフローチャートである。FIG. 4 is a flowchart showing details of the deviation detection process executed in step S10 of FIG. 図５は、図４のステップＳ１７０におけるグループ単位での外れ値検出処理の詳細を示すフローチャートである。FIG. 5 is a flowchart showing details of the outlier detection process for each group in step S170 of FIG. 図６は、図５のステップＳ１８００における外れ値原因分類の詳細処理を示すフローチャートである。FIG. 6 is a flowchart showing detailed processing of outlier cause classification in step S1800 of FIG. 図７は、学習データのグループ分けを説明する図である。FIG. 7 is a diagram illustrating grouping of learning data. 図８は、外れ値原因推定結果の画面表示の一例を示す図である。FIG. 8 is a diagram illustrating an example of a screen display of the outlier cause estimation results. 図９は、図３のステップＳ２０おいて実行される学習モデル作成処理の詳細を示すフローチャートである。FIG. 9 is a flowchart showing details of the learning model creation process executed in step S20 of FIG. 図１０は、図９のステップＳ２２０における機械学習処理の詳細を示すフローチャートである。FIG. 10 is a flowchart showing details of the machine learning process in step S220 of FIG. 図１１は、図３のステップＳ３０における学習モデル評価処理の詳細を示すフローチャートである。FIG. 11 is a flowchart showing details of the learning model evaluation process in step S30 of FIG. 図１２は、機械学習システムを実現するコンピュータのハードウェア図である。FIG. 12 is a hardware diagram of a computer that implements the machine learning system.

以下、図面を参照して本発明の実施の形態を説明する。以下の記載および図面は、本発明を説明するための例示であって、説明の明確化のため、適宜、省略および簡略化がなされている。また、以下の説明では、同一または類似の要素および処理には同一の符号を付し、重複説明を省略する場合がある。なお、以下に記載する内容はあくまでも本発明の実施の形態の一例を示すものであって、本発明は下記の実施の形態に限定されるものではなく、他の種々の形態でも実施する事が可能である。 Embodiments of the present invention will be described below with reference to the drawings. The following description and drawings are examples for explaining the present invention, and are omitted and simplified as appropriate for clarity of explanation. Furthermore, in the following description, the same or similar elements and processes may be denoted by the same reference numerals, and redundant explanations may be omitted. The content described below is merely an example of the embodiment of the present invention, and the present invention is not limited to the embodiment described below, and can be implemented in various other embodiments. It is possible.

本実施の形態の機械学習システムは、機械学習を用いて予測する対象である目的変数に、「費用対効果」や「通話時間あたりの売上高」のような２種の数値の比で表されるものが含まれる場合に、好適な機械学習システムである。図１は、本実施の形態の機械学習システム１の構成例を示すブロック図である。機械学習に関係するデータベースとして、学習データや評価用データなどが保存される学習データデータベースＤＢ１（以下、学習データＤＢ１と称す）、設定情報が保存される設定情報データベースＤＢ２（以下、設定情報ＤＢ２と称す）、外れ値情報が保存される外れ値情報データベースＤＢ３（以下、外れ値情報ＤＢ３と称する）、および、学習モデルに関する情報が保存されるモデル情報データベースＤＢ４（以下、モデル情報ＤＢ４と称す）を備えている。そして、それらのデータベースのデータに基づいて機械学習処理を行う、外れ値検出部１０、学習部２０、モデル評価部３０を備えている。 The machine learning system of this embodiment uses machine learning to predict target variables that are expressed as a ratio of two numerical values, such as "cost effectiveness" and "sales per call time." It is a suitable machine learning system if it includes: FIG. 1 is a block diagram showing a configuration example of a machine learning system 1 according to the present embodiment. Databases related to machine learning include a learning data database DB1 (hereinafter referred to as learning data DB1) in which learning data and evaluation data are stored, and a setting information database DB2 (hereinafter referred to as setting information DB2) in which setting information is stored. ), an outlier information database DB3 (hereinafter referred to as outlier information DB3) in which outlier information is stored, and a model information database DB4 (hereinafter referred to as model information DB4) in which information regarding learning models is stored. We are prepared. It also includes an outlier detection section 10, a learning section 20, and a model evaluation section 30, which perform machine learning processing based on data in these databases.

図２は、学習データＤＢ１に格納されている学習データの一例を示したものであり、学習データに関するテーブルを示す。テーブルは、複数の列と行から構成されている。テーブルの列および行は、それぞれカラムおよびレコードと呼ばれる。カラムＣ１～Ｃ５には、それぞれ顧客名、年齢、職業、通話時間、売上高のデータが格納される。本実施の形態では、顧客名、年齢、職業、通話時間および売上高を変数と呼ぶことにする。各レコードＲ１～Ｒ３は、各変数（顧客名、年齢、職業、通話時間および売上高）のデータで構成されている。例えば、学習データＲ１では、顧客名はＡ、年齢は４８歳、職業は主婦、通話時間は６００秒、売上高は１００，０００円である。なお、図２では３つのレコードＲ１～Ｒ３しか示していないが、学習データには多数のレコードが含まれている。 FIG. 2 shows an example of learning data stored in the learning data DB1, and shows a table regarding the learning data. A table is made up of multiple columns and rows. The columns and rows of a table are called columns and records, respectively. Columns C1 to C5 store data such as customer name, age, occupation, call time, and sales amount, respectively. In this embodiment, customer name, age, occupation, call time, and sales amount will be referred to as variables. Each record R1 to R3 is composed of data for each variable (customer name, age, occupation, call time, and sales amount). For example, in the learning data R1, the customer name is A, the age is 48 years old, the occupation is a housewife, the call time is 600 seconds, and the sales amount is 100,000 yen. Although FIG. 2 only shows three records R1 to R3, the learning data includes many records.

図１に戻って、設定情報ＤＢ２には、目的変数の確率分布の情報、桁補正係数に関する情報、ロス関数に関する情報、欠損値補完に関する情報などが保存されている。また、外れ値情報ＤＢ３には、目的変数の定義域や測定トラブルのあった期間に関する情報が予め保存されている。また、後述するように、外れ値検出処理による一連の結果（例えば、外れ値原因推定結果など）も外れ値情報ＤＢ３に保存される。モデル情報ＤＢ４には、後述するように、一連の学習で得られるモデルや、そのモデルに関する情報（例えば、使用したロス関数、行った補正など）等が保存される。 Returning to FIG. 1, the setting information DB2 stores information on the probability distribution of the target variable, information on the digit correction coefficient, information on the loss function, information on missing value complementation, and the like. Furthermore, the outlier information DB 3 stores in advance information regarding the domain of the target variable and the period during which measurement troubles occurred. Furthermore, as will be described later, a series of results from the outlier detection process (for example, outlier cause estimation results, etc.) are also stored in the outlier information DB3. As will be described later, the model information DB 4 stores models obtained through a series of learning, information regarding the models (for example, loss functions used, corrections made, etc.), and the like.

外れ値検出部１０、学習部２０およびモデル評価部３０は、学習モデル作成に関する演算処理を行うものであり、それぞれ、図３のフローチャートに示す処理を行う。図３は、システムにおける機械学習方法の全体処理を示すフローチャートである。本実施の形態では、図２に示す学習データが学習データＤＢ１に格納され、機械学習を用いて予測する対象である目的変数が、比＝（売上高）／（通話時間）である場合について説明する。目的変数が比で表される場合、比の分母か分子にあたる数値の少なくとも一方に測定や記入時のミスによる誤った値が含まれていると、目的変数の予測精度が担保しづらくなる。特に、桁の誤記入によるものは、しばしば数学的な外れ値を取ることがある。加えて、分子、分母のそれぞれを目的変数として機械学習を行う場合は、それらに数学的な外れ値が存在する場合、外れ値を考慮したロス関数を設定する必要がある。ここで、外れ値が実際は誤記によるものであれば、ロス関数の設定でなく誤記の修正が必要となる。したがって、比で表される数値を直接目的変数にした場合と、分子、分母をそれぞれ個別に目的変数にした場合の双方において適切な設定の下で機械学習を行うには、分子、分母それぞれに含まれる外れ値を抽出するとともに、その生成原因が誤記であるか否かを推定する必要が生じる。 The outlier detection section 10, the learning section 20, and the model evaluation section 30 perform arithmetic processing related to learning model creation, and each performs the processing shown in the flowchart of FIG. 3. FIG. 3 is a flowchart showing the overall processing of the machine learning method in the system. In this embodiment, a case will be explained in which the learning data shown in FIG. 2 is stored in the learning data DB1, and the target variable to be predicted using machine learning is the ratio = (sales amount)/(call time). do. When a target variable is expressed as a ratio, if at least one of the numerical values corresponding to the denominator or numerator of the ratio contains an incorrect value due to an error in measurement or entry, it becomes difficult to ensure the prediction accuracy of the target variable. In particular, errors in entering digits often result in mathematical outliers. In addition, when performing machine learning using each of the numerator and denominator as objective variables, if there are mathematical outliers in them, it is necessary to set a loss function that takes the outliers into account. Here, if the outlier is actually due to a typographical error, it is necessary to correct the typographical error instead of setting the loss function. Therefore, in order to perform machine learning with appropriate settings both when the numerical value expressed as a ratio is used as the objective variable directly, and when the numerator and denominator are individually used as objective variables, it is necessary to set the numerator and denominator separately. It becomes necessary to extract the included outliers and to estimate whether or not the cause of their generation is a typographical error.

そこで、本実施の形態では、予測対象である目的変数（＝比）だけではなく、比で示される目的変数の分母および目的変数の分子に対しても、統計学的な外れ値を検出する処理を実行する。さらに、目的変数に対する説明変数である「年齢」や「職業」に関して属性が近いレコード同士をグルーピングし、各々のグループ内で統計学的な外れ値検出処理を再度行い。そして、これら２段階の処理結果に基づいて外れ値の原因推定を行い、原因推定結果に基づいて学習モデル生成に際しての前処理やロス関数の設定を行う。 Therefore, in this embodiment, we perform a process to detect statistical outliers not only for the target variable (=ratio) that is the prediction target, but also for the denominator of the target variable and the numerator of the target variable indicated by the ratio. Execute. Furthermore, records with similar attributes regarding "age" and "occupation", which are explanatory variables for the objective variable, are grouped together, and statistical outlier detection processing is again performed within each group. Then, the cause of the outlier is estimated based on the results of these two stages of processing, and preprocessing and loss function settings for generating a learning model are performed based on the cause estimation result.

図３のステップＳ１０では、外れ値検出部１０による処理が実行され、上述した２段階の外れ値検出処理が行われ、どのような内容の外れ値であるか（例えば、真の外れ値や誤記による外れ値など）を推定する外れ値原因推定を行う。 In step S10 of FIG. 3, the outlier detection unit 10 executes the two-step outlier detection process described above to determine what kind of content the outlier is (for example, whether it is a true outlier or a typo). Perform outlier cause estimation to estimate outliers (such as outliers due to

ステップＳ２０では、学習部２０による処理が実行され、ステップＳ１０で得られた原因推定結果に基づいて、学習モデル作成に際しての前処理やロス関数の設定を行う。そして、それらの設定に基づいて、目的変数の分子および分母のそれぞれに関する予測と、目的変数（＝比）に関する予測の両方を行う。例えば、誤記の修正を施した機械学習や、外れ値に強いロス関数を使った機械学習(ロバスト回帰)を適用し、それぞれの精度最良のモデルを採用する。 In step S20, the learning unit 20 executes processing, and performs preprocessing and loss function settings for creating a learning model based on the cause estimation results obtained in step S10. Then, based on these settings, predictions are made regarding both the numerator and denominator of the target variable, and predictions regarding the target variable (=ratio). For example, machine learning that corrects typos or machine learning that uses a loss function that is resistant to outliers (robust regression) is applied, and the model with the best accuracy of each is adopted.

ステップＳ３０では、モデル評価部３０による評価処理が実行される。例えば、テストデータを使用して、ステップＳ２０で求めた分子・分母個別の予測モデル（以下では、モデル１と呼ぶ）からその比を求める。加えて、両者の比に関する予測モデル（以下では、モデル１と呼ぶ）の値も同じテストデータで算出する。そして、モデル１とモデル２の内で、予測の精度がより高い方のモデルを採用。なお、外れ値検出部１０、学習部２０、モデル評価部３０で行われる処理の詳細は後述する。 In step S30, the model evaluation unit 30 executes evaluation processing. For example, using the test data, the ratio is determined from the numerator/denominator individual prediction model (hereinafter referred to as model 1) determined in step S20. In addition, the value of a prediction model (hereinafter referred to as model 1) regarding the ratio between the two is also calculated using the same test data. Then, between Model 1 and Model 2, the one with higher prediction accuracy is adopted. Note that details of the processing performed by the outlier detection section 10, learning section 20, and model evaluation section 30 will be described later.

＜外れ値検出部１０における外れ値検出処理の詳細説明＞
図４は、図３のステップＳ１０において実行される外れ検出処理の詳細を示すフローチャートである。なお、本実施の形態では、目的変数である比だけでなく、比の分母（＝通話時間）および分子（＝売上高）も外れ値検出の検出対象としているので、それらを全て目的変数と呼ぶことにする。以下では、比、分母、分子を、それぞれ目的変数（比）、目的変数（分母）、目的変数（分子）のように称する。 <Detailed explanation of outlier detection processing in outlier detection unit 10>
FIG. 4 is a flowchart showing details of the deviation detection process executed in step S10 of FIG. Note that in this embodiment, not only the ratio, which is the objective variable, but also the denominator (=call time) and numerator (=sales) of the ratio are also detected for outlier detection, so they are all referred to as objective variables. I'll decide. Hereinafter, the ratio, denominator, and numerator will be referred to as objective variable (ratio), objective variable (denominator), and objective variable (numerator), respectively.

図４のフローチャートにおいて、ステップＳ１００からステップＳ１８０までの処理は、目的変数単位毎にそれぞれ行われる。すなわち、最初に目的変数（分母）に関して、ステップＳ１００からステップＳ１８０までの処理を行い、次いで、目的変数（分子）に関して、ステップＳ１００からステップＳ１８０までの処理を行い、最後に目的変数（比）に関して、ステップＳ１００からステップＳ１８０までの処理を行う。 In the flowchart of FIG. 4, the processes from step S100 to step S180 are performed for each target variable. That is, first perform the processing from step S100 to step S180 regarding the objective variable (denominator), then perform the processing from step S100 to step S180 regarding the objective variable (numerator), and finally perform the processing regarding the objective variable (ratio). , processes from step S100 to step S180 are performed.

ステップＳ１００では、目的変数単位の外れ値検出処理ループを開始する。ステップＳ１１０では、学習データから目的変数のカラムのデータを取得する。なお、目的変数（分母）または目的変数（分子）が対象である場合には、図２のテーブルの通話時間または売上高のカラムのデータを取得すれば良い。一方、目的変数（比）が対象である場合には、テーブルから通話時間および売上高を読み込んで、それらから算出される（売上高）／（通話時間）の値を目的変数（比）とする。 In step S100, an outlier detection processing loop for each target variable is started. In step S110, the data of the objective variable column is acquired from the learning data. Note that if the objective variable (denominator) or objective variable (numerator) is the target, data in the call time or sales column of the table in FIG. 2 may be acquired. On the other hand, if the target variable (ratio) is the target variable, read the call time and sales from the table and use the value of (sales)/(call time) calculated from them as the target variable (ratio). .

ステップＳ１２０では、対象とする目的変数に関しての確率分布情報が、例えば、売上高が正規分布で近似できるというような情報が、設定情報ＤＢ２に登録されているか確認する。ステップＳ１３０では、確率分布情報があるか否かを判定し、ある場合（ｙｅｓ）にはステップＳ１４０へ進み、無い場合（ｎｏ）にはステップＳ１５０へ進む。ステップＳ１４０では、登録されている確率分布を前提とした外れ値検出処理１Ａを実施する。例えば、目的変数に対して正規分布が確率分布情報として登録されている場合には、スミルノフ・グラブス検定を利用した外れ値検出処理を実施する。一方、ステップＳ１５０では、特定の確率分布を前提としない外れ値検出処理、例えば、四分位範囲を利用した外れ値検出処理などの、従う分布が不明でも適用可能な外れ値検出処理１Ｂを実施する。 In step S120, it is checked whether probability distribution information regarding the objective variable of interest, such as information indicating that sales can be approximated by a normal distribution, is registered in the setting information DB2. In step S130, it is determined whether or not there is probability distribution information. If there is (yes), the process proceeds to step S140, and if there is not (no), the process proceeds to step S150. In step S140, outlier detection processing 1A is performed based on the registered probability distribution. For example, if a normal distribution is registered as probability distribution information for the objective variable, outlier detection processing using the Smirnov-Grubbs test is performed. On the other hand, in step S150, an outlier detection process 1B that is applicable even if the following distribution is unknown, such as an outlier detection process that does not assume a specific probability distribution, such as an outlier detection process using an interquartile range, is performed. do.

ステップＳ１６０では、目的変数のどの値が外れ値だったかを外れ値情報ＤＢ３に保存する。ステップＳ１７０は、グループ単位での外れ値検出処理に関するルーティンであり、処理の詳細は図５のフローチャートを用いて後述する。ステップＳ１８０では、目的変数単位の外れ値検出処理ループを完了する。そして、目的変数（分母）、目的変数（分子）および目的変数（比）のそれぞれに対してステップＳ１００からステップＳ１８０までの一連の処理が完了したならば、図４の外れ値検出処理を終了する。 In step S160, which value of the target variable is an outlier is stored in the outlier information DB3. Step S170 is a routine related to outlier detection processing on a group-by-group basis, and details of the processing will be described later using the flowchart of FIG. In step S180, the outlier detection processing loop for each target variable is completed. When the series of processes from step S100 to step S180 are completed for each of the objective variable (denominator), objective variable (numerator), and objective variable (ratio), the outlier detection process in FIG. 4 is completed. .

（グループ単位の外れ値検出処理）
図５は、図４のステップＳ１７０におけるグループ単位での外れ値検出処理の詳細を示すフローチャートである。ステップＳ１７００では、クラスタリングなどを用い、説明変数単位で全データをいくつかのグループに仕分けする。図２に示す学習データの場合、レコードには説明変数として顧客名、年齢、職業があるが、例えば、職業や年齢でグループ分けすることができる。例えば、職業でグループ分けする場合、図２に示すデータにおいて職業が３種類（主婦、会社員、アルバイト）であったと仮定すると、図７（ａ）～（ｃ）に示すように複数のレコードは３種類のグループに分けられる。図７（ａ）～（ｃ）に示す例では、カラムＣ５の右側に、外れ値検出処理１（１Ａ，１Ｂ）および後述する外れ値検出処理２における判定結果も記載している。ただし、図７（ａ），（ｂ）では目的変数（分母）＝通話時間に関する外れ値検出結果を示しており、図７（ｃ）では目的変数（分子）＝売上高に関する外れ値検出結果を示している。ここでは、一つのカテゴリカルな説明変数に関してグループ分けを行ったが、連続値の説明変数であっても、その値を分割して作った複数の区間を用いてグループ分けを行ってもよい。加えて、二つ以上の説明変数についても、クラスタリングなどの手法を用いてグループ分けを行ってもよい。 (Group-based outlier detection processing)
FIG. 5 is a flowchart showing details of the outlier detection process for each group in step S170 of FIG. In step S1700, all data are sorted into several groups based on explanatory variables using clustering or the like. In the case of the learning data shown in FIG. 2, records include customer name, age, and occupation as explanatory variables, and for example, the records can be grouped by occupation or age. For example, when grouping by occupation, assuming that there are three types of occupation (housewife, office worker, part-time worker) in the data shown in Figure 2, multiple records will be grouped as shown in Figures 7 (a) to (c). Divided into three types of groups. In the examples shown in FIGS. 7A to 7C, the determination results in outlier detection processing 1 (1A, 1B) and outlier detection processing 2, which will be described later, are also listed on the right side of column C5. However, Figures 7(a) and (b) show the outlier detection results regarding the objective variable (denominator) = call time, and Figure 7(c) shows the outlier detection results regarding the objective variable (numerator) = sales. It shows. Here, the grouping is performed with respect to one categorical explanatory variable, but even if the explanatory variable is a continuous value, the grouping may be performed using a plurality of intervals created by dividing the value. In addition, two or more explanatory variables may also be grouped using a method such as clustering.

なお、図７（ａ）～（ｃ）では記載を簡略化しているが、グループ単位で外れ値検出を行う場合も、目的変数（分子）、目的変数（分母）、目的変数（比）のそれぞれに関して外れ値検出が行われるので、外れ値検出処理１，２の判定結果（あり、無し）は、目的変数（分子）、目的変数（分母）および目的変数（比）のそれぞれに対して得られる。 Note that although the descriptions in Figures 7(a) to (c) are simplified, when outlier detection is performed in group units, each of the objective variable (numerator), objective variable (denominator), and objective variable (ratio) Since outlier detection is performed for .

ステップＳ１７１０では、外れ値情報ＤＢ３に、通話時間および売上高のデータに関する備考、例えば、定義域（例えば、通話時間の上限）や測定トラブルの期間等の情報があるか検索し、ある場合にはそれらを外れ値情報ＤＢ３から取得する。ステップＳ１７２０では、グループ単位での外れ値検出処理ループ（ステップＳ１７２０からステップＳ１８１０までの処理）を開始する。図７（ａ）～（ｃ）に示したグループ分けの例では、３種類のグループ（主婦、会社員、アルバイト）に分けられており、ステップＳ１７２０からステップＳ１８１０までの処理ループが、主婦のグループ、会社員のグループおよびアルバイトのグループのそれぞれについて実行される。 In step S1710, the outlier information DB 3 is searched to see if there are any notes regarding call time and sales data, such as information such as the domain (for example, the upper limit of call time) or the period of measurement trouble. These are acquired from the outlier information DB3. In step S1720, an outlier detection processing loop for each group (processing from step S1720 to step S1810) is started. In the example of grouping shown in FIGS. 7(a) to (c), there are three types of groups (housewives, office workers, part-time workers), and the processing loop from step S1720 to step S1810 is for the housewife group. , is performed for a group of office workers and a group of part-time workers, respectively.

ステップＳ１７３０では、目的変数に関する外れ値検出処理２を行う。なお、上述した学習データの全レコードを用いた外れ値検出処理１（１Ａ，１Ｂ）と区別するために、グループ単位の外れ値検出処理を外れ値検出処理２と称することにする。ここでは、外れ値検出処理２の具体的処理については説明を省略するが、グループ内のレコードに限定される点を除けば、図４のステップＳ１１０～Ｓ１６０の処理と同様の処理が行われる。ただし、ステップＳ１７３０の外れ値検出処理２では、特定の確率分布を前提としない外れ値検出を実施する。 In step S1730, outlier detection processing 2 regarding the objective variable is performed. In order to distinguish it from the outlier detection process 1 (1A, 1B) using all records of the learning data described above, the outlier detection process for each group will be referred to as the outlier detection process 2. Here, a detailed explanation of the outlier detection process 2 will be omitted, but the same process as steps S110 to S160 in FIG. 4 is performed except that it is limited to records within a group. However, in the outlier detection process 2 in step S1730, outlier detection is performed without assuming a specific probability distribution.

なお、図４において、目的変数に関して外れ値検出処理ループが実行されている場合、その際のグループ単位の外れ値検出処理（ステップＳ１７０）では、同じ目的変数について外れ値検出処理が実行されることになる。例えば、図４における外れ値検出処理ループが目的変数（分子）＝通話時間であった場合には、ステップＳ１７３０における外れ値検出処理２の目的変数も目的変数（分子）＝通話時間となる。すなわち、グループ分けされた学習データからカラムＣ４の値を取得して外れ値検出処理を実施する。 In addition, in FIG. 4, when the outlier detection processing loop is executed for the objective variable, in the outlier detection processing for each group at that time (step S170), the outlier detection processing is executed for the same objective variable. become. For example, if the objective variable (numerator) in the outlier detection processing loop in FIG. 4 is the call time, the objective variable of the outlier detection process 2 in step S1730 also becomes the objective variable (numerator) = call time. That is, the value of column C4 is acquired from the grouped learning data and the outlier detection process is performed.

ステップＳ１７４０では、外れ値検出処理１（１Ａ，１Ｂ）で外れ値検出されたレコードがあるか否かを判定し、ある場合（ｙｅｓ）にはステップＳ１７５０へ進み、無い場合（ｎｏ）にはステップＳ１７６０へ進む。ステップＳ１７５０では、外れ値検出されたレコードの外れ値検出結果を「外れ値」に更新し、その後ステップＳ１７６０へ進む。例えば、図７（ａ）～（ｃ）に示す例では、顧客名Ａ１～Ａ３，Ｂ１，Ｃ１，Ｃ２の各レコードは外れ値検出処理１（１Ａ，１Ｂ）で外れ値検出とされているので、ステップＳ１７４０からステップＳ１７５０へ進み、各レコードの外れ値検出結果を「外れ値」に更新する。 In step S1740, it is determined whether or not there is a record in which an outlier was detected in the outlier detection process 1 (1A, 1B). If there is (yes), the process advances to step S1750; if there is no record (no), the process proceeds to step S1750. Proceed to S1760. In step S1750, the outlier detection result of the record in which the outlier was detected is updated to "outlier", and then the process advances to step S1760. For example, in the examples shown in FIGS. 7(a) to (c), each record of customer names A1 to A3, B1, C1, and C2 has been detected as an outlier in the outlier detection process 1 (1A, 1B). , the process advances from step S1740 to step S1750, and the outlier detection result of each record is updated to "outlier".

なお、更新のタイミングをより詳しく説明すると、顧客名Ａ１～Ａ３の各レコードは、目的変数が通話時間（＝目的変数（分母））の場合の処理ループであって、かつ、主婦グループ単位の処理ループにおけるステップＳ１７４０およびＳ１７５０の処理により、通話時間に関する外れ値検出結果が「外れ値」に更新される。顧客名Ｂ１のレコードは、目的変数が通話時間（＝目的変数（分母））の場合の処理ループであって、かつ、会社員グループ単位の処理ループにおけるステップＳ１７４０およびＳ１７５０の処理により、通話時間に関する外れ値検出結果が「外れ値」に更新される。顧客名Ｃ１，Ｃ２の各レコードは、目的変数が売上高（＝目的変数（分子））の場合の処理ループであって、かつ、アルバイトグループ単位の処理ループにおけるステップＳ１７４０およびＳ１７５０の処理により、売上高に関する外れ値検出結果が「外れ値」に更新される。 To explain the update timing in more detail, each record of customer names A1 to A3 is a processing loop when the objective variable is call time (=objective variable (denominator)), and the processing is performed for each housewife group. Through the processing in steps S1740 and S1750 in the loop, the outlier detection result regarding call time is updated to "outlier". The record for customer name B1 is a processing loop in which the objective variable is call time (= objective variable (denominator)), and the record related to call time is obtained through the processing of steps S1740 and S1750 in the processing loop for each company employee group. The outlier detection result is updated to "outlier". Each record of customer names C1 and C2 is a processing loop when the objective variable is sales (= objective variable (numerator)), and the sales are calculated by the processing of steps S1740 and S1750 in the processing loop for part-time job groups. The outlier detection result regarding high is updated to "outlier".

ステップＳ１７６０では、外れ値検出処理２で外れ値検出されたレコードがあるか否かを判定し、ある場合（ｙｅｓ）にはステップＳ１７７０へ進み、無い場合（ｎｏ）にはステップＳ１７８０へ進む。ステップＳ１７７０では、外れ値検出されたレコードの外れ値検出結果を「一般的な誤記」に更新し、その後ステップＳ１７８０へ進む。例えば、図７（ａ）～（ｃ）に示す例では、顧客名Ｂ１，Ｃ３の各レコードは外れ値検出処理２で外れ値検出とされているので、ステップＳ１７６０からステップＳ１７７０へ進み、各レコードの外れ値検出結果を「一般的な誤記」に更新する。なお、顧客名Ｂ１のレコードに関する外れ値検出結果は、ステップＳ１７５０において「外れ値」とされているが、ステップＳ１７７０において「一般的な誤記」へと再度更新されることになる。 In step S1760, it is determined whether or not there is a record in which an outlier was detected in the outlier detection process 2. If there is (yes), the process advances to step S1770, and if there is no record (no), the process advances to step S1780. In step S1770, the outlier detection result of the record in which the outlier was detected is updated to "general error", and the process then proceeds to step S1780. For example, in the example shown in FIGS. 7(a) to (c), each record of customer names B1 and C3 has been detected as an outlier in the outlier detection process 2, so the process advances from step S1760 to step S1770, and each record Update the outlier detection results to "general errors". Note that the outlier detection result for the record of customer name B1 is determined to be an "outlier" in step S1750, but is updated to "general error" again in step S1770.

ステップＳ１７８０では、各レコードのデータ（通話時間、売上高）に定義域外のデータや測定機器にトラブルがあった期間に計測されたデータがあるか否かを判定し、ある場合（ｙｅｓ）にはステップＳ１７９０へ進み、無い場合（ｎｏ）にはステップＳ１８００へ進む。ステップＳ１７９０では、該当するレコードの外れ値検出結果を「測定ミスによる誤記」に更新し、その後ステップＳ１８００へ進む。例えば、通話時間に関する定義域として３６００秒が指定されている場合には、４８００秒というデータは「測定ミスによる誤記」と判定される。また、測定機器のトラブルがあった期間に取られたという備考情報があった場合には、無条件に「測定ミスによる誤記」と判定する。 In step S1780, it is determined whether the data of each record (call time, sales amount) includes data outside the defined range or data measured during a period in which there was a problem with the measuring equipment. The process advances to step S1790, and if there is none (no), the process advances to step S1800. In step S1790, the outlier detection result of the corresponding record is updated to "error due to measurement error", and the process then proceeds to step S1800. For example, if 3,600 seconds is specified as the defined range regarding call time, the data of 4,800 seconds is determined to be a "misprint due to a measurement error." Furthermore, if there is note information indicating that the data was taken during a period when there was a problem with the measuring equipment, it is unconditionally determined to be a "mistake due to a measurement error."

なお、図７（ａ）～（ｃ）に示す例の場合、主婦のグループに関するステップＳ１７３０からステップＳ１７９０までの処理が行われると、顧客名Ａ１～Ａ３の各レコードの外れ値検出結果は、通話時間に関するデータが「外れ値」とされる。また、会社員のグループに関するステップＳ１７３０からステップＳ１７９０までの処理が行われると、顧客名Ｂ１のレコードの外れ値検出結果は、通話時間に関するデータが「一般的な誤記」とされる。さらにまた、アルバイトのグループに関するステップＳ１７３０からステップＳ１７９０までの処理が行われると、顧客名Ｃ１，Ｃ２のレコードの外れ値検出結果は、売上高に関するデータが「外れ値」とされ、顧客名Ｃ３のレコードの外れ値検出結果は、売上高に関するデータが「一般的な誤記」とされる。ステップＳ１８００は、外れ値原因分類に関するルーティンであり、処理の詳細を図６に示す。 In the case of the example shown in FIGS. 7(a) to 7(c), when the processes from step S1730 to step S1790 regarding the housewife group are performed, the outlier detection results for each record of customer names A1 to A3 are Data related to time is considered an "outlier." Furthermore, when the processes from step S1730 to step S1790 regarding the company employee group are performed, the outlier detection result for the record of customer name B1 is that the data regarding the call time is a "general error." Furthermore, when the processing from step S1730 to step S1790 regarding the group of part-time workers is performed, the outlier detection results for the records with customer names C1 and C2 are such that the data related to sales is determined to be an "outlier", and the data regarding customer name C3 is determined to be an "outlier". Outlier detection results for records indicate that data related to sales is a "general error." Step S1800 is a routine related to outlier cause classification, and details of the process are shown in FIG.

（外れ値原因分類処理の詳細説明）
図６のステップＳ１８０２では、桁補正係数を設定情報ＤＢ２から取得する。桁補正係数とは、誤記と判定されたデータに乗算してデータ値の桁すなわち小数点の位置を補正するもので、例えば、１０^－５、１０^－４、・・・、１０^４、１０^５のように１０倍刻みの数値が、桁補正係数として設定情報ＤＢ２に予め格納されている。ステップＳ１８０４では、桁補正係数を「一般的な誤記」と判定されたデータ値に乗算する乗算処理を、取得した複数の桁補正係数のそれぞれに関して行う。 (Detailed explanation of outlier cause classification process)
In step S1802 of FIG. 6, the digit correction coefficient is acquired from the setting information DB2. The digit correction coefficient is used to correct the digits of the data value, that is, the position of the decimal point, by multiplying the data determined ^to be incorrect.For example, 10 ^-5 , ^{10 -4} ^, . Numerical values in 10 times increments are stored in advance in the setting information DB2 as digit correction coefficients. In step S1804, a multiplication process of multiplying the data value determined to be a "general error" by a digit correction coefficient is performed for each of the plurality of obtained digit correction coefficients.

ステップＳ１８０６では、乗算後の複数の値のいずれかが、同一グループ内の誤記および外れ値でないデータの値の頻出範囲内にあるか否かを判定する。ここでの頻出範囲の例としては、例えば、「４分位点」から「４分の３分位点」の範囲がある。ステップＳ１８０６でｙｅｓと判定された場合にはステップＳ１８０８に進み、ｎｏと判定された場合にはステップＳ１８０８をスキップする。ステップＳ１８０８では、外れ値検出結果を「一般的な誤記」から「桁間違いによる誤記」へ更新する。 In step S1806, it is determined whether any of the plurality of values after multiplication is within the frequent occurrence range of data values that are not errors or outliers within the same group. An example of the frequent range here is, for example, a range from "quartile" to "third quartile." If the determination in step S1806 is yes, the process advances to step S1808, and if the determination is no, step S1808 is skipped. In step S1808, the outlier detection result is updated from "general error" to "error due to digit error."

図５に戻って、ステップＳ１８１０では、グループ単位の外れ値検出処理ループを完了する。そして、全てのグループに関してグループ単位の外れ値検出ループが完了したならば、ステップＳ１８２０に進む。ステップＳ１８２０では、一連の外れ値検出結果を外れ値情報ＤＢ３に保存する。なお、外れ値検出結果が「桁間違いによる誤記」の場合には、その時の桁補正係数を外れ値検出結果と共に外れ値情報ＤＢ３に保存する。 Returning to FIG. 5, in step S1810, the outlier detection processing loop for each group is completed. When the group-based outlier detection loop is completed for all groups, the process advances to step S1820. In step S1820, a series of outlier detection results are stored in the outlier information DB3. Note that if the outlier detection result is "an error in writing due to a digit error," the digit correction coefficient at that time is stored in the outlier information DB 3 together with the outlier detection result.

図５の処理が終了したならば、図４に戻ってステップＳ１８０に進む。ステップＳ１８０では、目的変数単位の外れ値検出処理ループを完了する。そして、全ての目的変数すなわち目的変数（分母）、目的変数（分子）、目的変数（比）に関して外れ値検出処理ループが完了したならば、図４の一連の処理を終了する。 When the process in FIG. 5 is completed, the process returns to FIG. 4 and proceeds to step S180. In step S180, the outlier detection processing loop for each target variable is completed. When the outlier detection processing loop is completed for all target variables, that is, the target variable (denominator), target variable (numerator), and target variable (ratio), the series of processes shown in FIG. 4 ends.

なお、後述する出力装置５７００に設けられたモニタに、外れ値原因推定結果を画面表示するようにしても良い。図８は、外れ値原因推定結果の画面表示の一例を示したものである。破線矩形枠を施したフィールドが、外れ値および誤記とされた目的変数である。 Note that the outlier cause estimation result may be displayed on a screen on a monitor provided in the output device 5700, which will be described later. FIG. 8 shows an example of a screen display of the outlier cause estimation results. Fields with dashed rectangular frames are target variables determined to be outliers and errors.

＜学習部２０における学習モデル作成処理の詳細説明＞
次に、図３のステップＳ２０おいて実行される学習モデル作成処理の詳細について、図９のフローチャートを用いて説明する。図９のフローチャートにおいて、ステップＳ２００からステップＳ２４０までの処理は、目的変数（分母）、目的変数（分子）および目的変数（比）の目的変数単位毎にそれぞれ行われる。学習モデル作成処理では、ステップＳ１０の外れ値検出処理により目的変数に外れ値や誤記が検出された場合には、得られた原因推定結果に基づいて、学習モデル生成に際しての前処理やロス関数の設定を行う。一方、外れ値検出処理で目的変数に外れ値も誤記も検出されなかった場合は、通常通りの学習を実施する。それぞれの学習では、通常のAutoMLと同じく、学習アルゴリズムは複数試みて良いし、試行するアルゴリズムごとの学習においてハイパーパラメタの最適化を行っても良い。 <Detailed explanation of the learning model creation process in the learning unit 20>
Next, details of the learning model creation process executed in step S20 of FIG. 3 will be explained using the flowchart of FIG. 9. In the flowchart of FIG. 9, the processes from step S200 to step S240 are performed for each target variable unit of the target variable (denominator), target variable (numerator), and target variable (ratio). In the learning model creation process, if an outlier or misprint is detected in the objective variable by the outlier detection process in step S10, preprocessing and loss function correction are performed when creating the learning model based on the obtained cause estimation results. Make settings. On the other hand, if neither an outlier nor an error in the target variable is detected in the outlier detection process, learning is performed as usual. For each learning, multiple learning algorithms may be tried as in normal AutoML, and hyperparameters may be optimized during learning for each algorithm to be tried.

ステップＳ２００では、目的変数単位のモデル作成処理ループを開始する。ステップＳ２１０では、外れ値情報ＤＢ３に、目的変数の誤記あるいは外れ値の情報が登録されているか確認する。ステップＳ２２０は、誤記や外れ値の内容に応じた機械学習処理に関するルーティンであり、処理の詳細を図１０に示す。なお、ステップＳ２２０における誤記や外れ値の内容に応じた機械学習処理において、外れ値情報ＤＢ３に目的変数の誤記あるいは外れ値の情報が登録されていない場合には、通常通りの学習が実施される。 In step S200, a model creation processing loop for each target variable is started. In step S210, it is checked whether information about a misprint of the objective variable or an outlier is registered in the outlier information DB3. Step S220 is a routine related to machine learning processing according to the contents of errors and outliers, and details of the processing are shown in FIG. 10. In addition, in the machine learning process according to the contents of the misprint or outlier in step S220, if the misprint of the target variable or information about the outlier is not registered in the outlier information DB 3, learning is performed as usual. .

（誤記や外れ値の内容に応じた機械学習処理の詳細説明）
図１０は、誤記や外れ値の内容に応じた機械学習処理の詳細を示すフローチャートである。ステップＳ２２００では、外れ値情報ＤＢ３から、各目的変数に関する外れ値、誤記および桁補正係数の情報を取得する。ステップＳ２２１０では、「桁間違いによる誤記」の目的変数に対して対応する桁補正係数を乗算し、目的変数の値を補正する。例えば、図８の顧客名Ｃ３のレコードでは、売上高の「３０円」という値が桁間違いによる誤記であると判定され、桁補正係数として「１０^２」が外れ値情報ＤＢ３に登録されている。そのため、ステップＳ２２１０において、「３０円」という値が、「１０^２」を乗算して得られる「３，０００円」という値に補正される。 (Detailed explanation of machine learning processing according to the contents of errors and outliers)
FIG. 10 is a flowchart showing details of machine learning processing according to the contents of errors and outliers. In step S2200, information on outliers, typos, and digit correction coefficients regarding each objective variable is acquired from the outlier information DB3. In step S2210, the value of the target variable is corrected by multiplying the target variable of "error in writing due to wrong digit" by the corresponding digit correction coefficient. For example, in the record of customer name C3 in FIG. 8, the sales value "30 yen" is determined to be an error due to an incorrect digit, and "10 ² " is registered as the digit correction coefficient in the outlier information DB3. . Therefore, in step S2210, the value "30 yen" is corrected to the value "3,000 yen" obtained by multiplying by "10 ² ".

ステップＳ２２１０では、目的変数の中に「外れ値」と判定されたレコードがあるか否かを判定し、ありの場合（ｙｅｓ）にはステップＳ２２３０へ進み、無い場合（ｎｏ）にはステップＳ２２４０へ進む。誤記でない外れ値がある場合は、ロバスト回帰を実施する。ロバスト回帰が必要となるのは、以下のような理由からである。
・線形回帰、樹木モデル、およびニューラルネットに基づく学習アルゴリズムには、最適化対象となるロス関数を差し替えて実行できるものがある。
・ロス関数の一例には平均二乗誤差があるが、これを最適化するように学習すると、外れ値の予実測誤差を優先的に小さくするような学習がなされてしまう。そのため、外れ値でない値の予実測誤差が十分小さくならず、外れ値以外の予測が不十分なモデルが生成されてしまう。
・そのため、外れ値の存在によって値が大きくなりづらいロス関数を代わりに使うなどして、外れ値ではない値を正確に予測するモデルを生成しやすくする手段があり、その総称をロバスト回帰と呼ぶ。 In step S2210, it is determined whether or not there is a record determined to be an "outlier" in the objective variable. If there is (yes), the process proceeds to step S2230; if there is not (no), the process proceeds to step S2240. move on. If there are outliers that are not errors, perform robust regression. Robust regression is necessary for the following reasons.
- Some learning algorithms based on linear regression, tree models, and neural networks can be executed by replacing the loss function to be optimized.
- An example of a loss function is the mean squared error, but if learning is done to optimize this, learning will be done that preferentially reduces the predicted actual measurement error for outliers. Therefore, the predicted actual measurement error for values that are not outliers is not sufficiently reduced, and a model that is insufficient in predicting values other than outliers is generated.
・Therefore, there are ways to make it easier to generate models that accurately predict values that are not outliers, such as by using a loss function that does not easily increase in value due to the presence of outliers, and these methods are collectively called robust regression. .

ロバスト回帰に用いるロス関数の具体例としては、「Huber-loss」、「pseudo-Huber-loss」、「τ-分位損失」などがある。ステップＳ２２３０では、モデル作成処理において実施するロバスト回帰のロス関数の種類を設定情報ＤＢ２から取得し、その後ステップＳ２２４０へ進む。 Specific examples of loss functions used in robust regression include "Huber-loss," "pseudo-Huber-loss," and "τ-quantile loss." In step S2230, the type of loss function for robust regression to be performed in the model creation process is acquired from the setting information DB2, and then the process advances to step S2240.

ステップＳ２２４０では、目的変数の中に「桁間違いによる誤記」以外の誤記と判定されたレコードがあるか否かを判定し、ありの場合（ｙｅｓ）にはステップＳ２２５０へ進み、無い場合（ｎｏ）にはステップＳ２２６０へ進む。ステップＳ２２５０では、モデル作成処理において実施する欠損値補完の種類を設定情報ＤＢ２から取得し、その後ステップＳ２２６０へ進む。 In step S2240, it is determined whether or not there is a record in the objective variable that has been determined to be a typographical error other than "error due to digit error". If yes (yes), the process advances to step S2250; if not (no) Then, the process advances to step S2260. In step S2250, the type of missing value complementation to be performed in the model creation process is acquired from the setting information DB2, and the process then proceeds to step S2260.

「桁間違いによる誤記」以外の誤記である「一般的な誤記」および「測定ミスによる誤記」を含む場合には、該当する目的変数を欠損値とみなし、その対策となる欠損値補完方法を１つ以上試み、学習モデルを作成する。欠損値補完方法としては、例えば、学習対象のデータから除外する方法、別の値で補完する方法、半教師あり学習を適用する方法などがある。別の値で補完する方法としては、例えば、中央値や平均値、あるいは別途指定された値で補完する方法があり、この場合、補完する値にさらにランダムノイズを加えてもよい。また、行列分解を利用した補完方法、missForestなど回帰に使えるアルゴリズムを応用した補完方法などがある。 If there are errors other than "errors due to incorrect digits" such as "general errors" and "errors due to measurement errors," the relevant objective variable is considered to be missing, and the missing value imputation method used as a countermeasure is 1. Create a learning model after at least one trial. Examples of methods for filling in missing values include a method of excluding them from data to be learned, a method of filling with another value, and a method of applying semi-supervised learning. As a method of complementing with another value, for example, there is a method of complementing with a median value, an average value, or a separately specified value. In this case, random noise may be further added to the value to be complemented. There are also completion methods that use matrix decomposition, and completion methods that apply algorithms that can be used for regression, such as missForest.

ステップＳ２２６０では、設定情報ＤＢ２を参照し、ロバスト回帰を行わない場合に用いるロス関数を設定情報ＤＢ２から取得する。ステップＳ２２７０では、上述のように取得した一連のロス関数の種類および欠損値補完の種類に関して、それらの組み合わせに応じた機械学習を実施する。例えば、取得されたロス関数および欠損値補完の種類がそれぞれ３種類であった場合、ロス関数と欠損値補完の組み合わせは９（＝３×３）通りとなる。そして、ステップＳ２２７０では、それぞれの組み合わせ設定で機械学習を実施する。なお、「外れ値」の無い場合の通常の学習も、ロス関数（＝通常の二乗誤差）×欠損値補間（＝補間をしない）という組み合わせとして含まれている。ステップＳ２２８０では、一連の学習で得たモデルを、モデル情報ＤＢ４に保存する。図９からも分かるように、ステップＳ２２７０の機械学習は３つの目的変数（分子）、目的変数（分母）、目的変数（比）のそれぞれに関して行われる。 In step S2260, the setting information DB2 is referred to and a loss function used when robust regression is not performed is acquired from the setting information DB2. In step S2270, machine learning is performed in accordance with the combination of the series of loss function types and missing value completion types acquired as described above. For example, if there are three types of acquired loss functions and three types of missing value complementation, there are nine (=3×3) combinations of loss functions and missing value complementation. Then, in step S2270, machine learning is performed with each combination setting. Note that normal learning when there is no "outlier" is also included as a combination of loss function (=normal squared error) x missing value interpolation (=no interpolation). In step S2280, the model obtained through the series of learning is stored in the model information DB4. As can be seen from FIG. 9, the machine learning in step S2270 is performed for each of the three objective variables (numerator), objective variable (denominator), and objective variable (ratio).

図９に戻って、ステップＳ２３０では、各学習で得られたモデルを共通のテストデータで評価し、精度最良のモデルをモデル情報ＤＢ４に保存する。ステップＳ２４０において、各目的変数に関して目的変数単位の学習処理のループが完了すると、図９に示したモデル作成に関する一連の処理が終了する。 Returning to FIG. 9, in step S230, the models obtained in each learning are evaluated using common test data, and the model with the best accuracy is stored in the model information DB4. In step S240, when the loop of learning processing for each objective variable is completed, the series of processes related to model creation shown in FIG. 9 is completed.

＜学習モデル評価処理の詳細説明＞
図３のステップＳ３０における学習モデル評価処理の詳細について、図１１のフローチャートにより説明する。ステップＳ３００では、学習データＤＢ１から評価用データを取得する。ステップＳ３１０では、ステップＳ２３０においてモデル情報ＤＢ４に保存された目的変数（分子）、目的変数（分母）および目的変数（比）のそれぞれに関する最良精度のモデルを、モデル情報ＤＢ４からそれぞれ取得する。ここでは、取得した各モデルを、分子予測モデル、分母予測モデルおよび比予測モデルと呼ぶことにする。 <Detailed explanation of learning model evaluation process>
Details of the learning model evaluation process in step S30 in FIG. 3 will be explained with reference to the flowchart in FIG. 11. In step S300, evaluation data is acquired from the learning data DB1. In step S310, the model with the best accuracy for each of the target variable (numerator), target variable (denominator), and target variable (ratio) stored in the model information DB4 in step S230 is acquired from the model information DB4. Here, each of the acquired models will be referred to as a numerator prediction model, a denominator prediction model, and a ratio prediction model.

ステップＳ３２０では、ステップＳ３００で取得した評価用データを用いて、分子予測モデル、分母予測モデルおよび比予測モデルから、分子、分母および比の値をそれぞれ推定する。ステップＳ３３０では、（分子／分母）の精度（＝精度１）と比の精度（＝精度２）とを比較し、「精度１＞精度２」であるか否かを判定する。ステップＳ３３０で「精度１＞精度２」と判定されるとステップＳ３４０へ進み、分子予測モデルおよび分母予測モデルを最良モデルに採用し、その決定結果をモデル情報ＤＢ４に保存する。一方、ステップＳ３３０において「精度１＞精度２」でないと判定されるとステップＳ３５０へ進み、比予測モデルを最良モデルに採用し、その決定結果をモデル情報ＤＢ４に保存する。 In step S320, the values of the numerator, denominator, and ratio are estimated from the numerator prediction model, denominator prediction model, and ratio prediction model, respectively, using the evaluation data acquired in step S300. In step S330, the accuracy of (numerator/denominator) (=accuracy 1) and the accuracy of the ratio (=accuracy 2) are compared to determine whether "accuracy 1>accuracy 2". If it is determined in step S330 that "accuracy 1>accuracy 2", the process proceeds to step S340, where the numerator prediction model and the denominator prediction model are adopted as the best models, and the determination results are stored in the model information DB4. On the other hand, if it is determined in step S330 that "accuracy 1>accuracy 2" is not satisfied, the process proceeds to step S350, where the ratio prediction model is adopted as the best model, and the determination result is stored in the model information DB4.

なお、上述した実施の形態では、目的変数が比で表される場合について説明したが、目的変数が比でない場合、例えば、図５に示す売上高を目的変数とした場合であっても、図４～６に示した２段階の外れ値検出処理を適用することができる。図５のフローチャートのステップＳ１００からステップＳ１８０までの処理における目的変数は売上高のみとなり、目的変数＝売上高に関して、図４～６に示す２段階の外れ値検出処理が行われる。そして、外れ値検出処理に続く図９，１０の学習モデル作成処理および図１１の学習モデル評価処理に関しても、目的変数が売上高であるとして処理が行われる。すなわち、本実施の形態における２段階の外れ値検出処理、学習モデル作成処理および学習モデル評価処理は、目的変数が比である場合にも比でない場合にも適用することができ、それにより、より精度の高い学習モデルを構築することができる。 In addition, in the embodiment described above, the case where the objective variable is expressed as a ratio has been explained, but even if the objective variable is not a ratio, for example, even when the sales amount shown in FIG. The two-stage outlier detection process shown in 4 to 6 can be applied. The target variable in the processing from step S100 to step S180 in the flowchart of FIG. 5 is only the sales amount, and the two-step outlier detection processing shown in FIGS. 4 to 6 is performed with respect to the target variable=sales amount. The learning model creation process in FIGS. 9 and 10 and the learning model evaluation process in FIG. 11 following the outlier detection process are also performed assuming that the target variable is sales. In other words, the two-step outlier detection process, learning model creation process, and learning model evaluation process in this embodiment can be applied to cases where the objective variable is a ratio and when it is not a ratio. It is possible to construct highly accurate learning models.

＜機械学習システム１を実現するコンピュータ＞
図１２は、機械学習システム１を実現するコンピュータ５０００のハードウェア図である。機械学習システム１を実現するコンピュータ５０００ではＣＰＵ（Central Processing Unit）に代表されるプロセッサ５３００、ＲＡＭ（Random Access Memory）等のメモリ５４００、入力装置５６００（例えば、キーボード、マウス、タッチパネル等）、および出力装置５７００（例えば、外部ディスプレイモニタに接続されたビデオグラフィックカード）が、メモリコントローラ５５００を介して相互接続される。 <Computer that realizes machine learning system 1>
FIG. 12 is a hardware diagram of a computer 5000 that implements the machine learning system 1. A computer 5000 that implements the machine learning system 1 includes a processor 5300 represented by a CPU (Central Processing Unit), a memory 5400 such as a RAM (Random Access Memory), an input device 5600 (for example, a keyboard, a mouse, a touch panel, etc.), and an output. Devices 5700 (eg, a video graphics card connected to an external display monitor) are interconnected through a memory controller 5500.

コンピュータ５０００において、機械学習システムを実現するためのプログラムがＩ／Ｏ（Input/Output）コントローラ５２００を介してＳＳＤやＨＤＤ等の外部記憶装置５８００から読みだされて、プロセッサ５３００およびメモリ５４００の協働により実行される。これにより、機械学習システム１が実現される。あるいは、機械学習システム１を実現するためのプログラムは、ネットワークインターフェース５１００を介した通信により外部のコンピュータから取得されたり、媒体読み取り装置によって記録媒体から読み出されて取得されたりしてもよい。 In the computer 5000, a program for realizing a machine learning system is read out from an external storage device 5800 such as an SSD or HDD via an I/O (Input/Output) controller 5200, and is executed in cooperation with a processor 5300 and a memory 5400. Executed by Thereby, the machine learning system 1 is realized. Alternatively, the program for implementing the machine learning system 1 may be obtained from an external computer through communication via the network interface 5100, or may be obtained by being read from a recording medium by a medium reading device.

以上説明した本発明の実施の形態によれば、以下の作用効果を奏する。 According to the embodiment of the present invention described above, the following effects are achieved.

（Ｃ１）図１～８に示すように、機械学習システム１は、外れ値検出部１０と学習部２０とを備える。外れ値検出部１０は、学習データの全レコードに関して、機械学習の予測対象である目的変数を検出対象とする第１の外れ値検出処理を行う。例えば、目的変数が売上高であれば、売上高を検出対象とする第１の外れ値検出処理が行われる。また、外れ値検出部１０は、例えば、レコードＲ１～Ｒ３に含まれる職業という説明変数に基づいて、全レコードＲ１，Ｒ２，Ｒ３，・・・を主婦のグループ、会社員のグループ、アルバイトのグループに分けるグループ処理を行う。さらにまた、外れ値検出部１０は、複数のグループのグループ毎に、売上高を検出対象とする第２の外れ値検出処理を行う。さらにまた、外れ値検出部１０は、第１および第２の外れ値検出処理の検出結果に基づいて外れ値の原因を推定する。そして、学習部２０は、外れ値の原因の推定結果に基づいて機械学習を実施して、売上高を予測する学習モデルを作成する。 (C1) As shown in FIGS. 1 to 8, the machine learning system 1 includes an outlier detection section 10 and a learning section 20. The outlier detection unit 10 performs a first outlier detection process on all records of the learning data, using the target variable that is the prediction target of machine learning as the detection target. For example, if the target variable is sales, a first outlier detection process is performed in which sales are the detection target. In addition, the outlier detection unit 10 may classify all records R1, R2, R3, etc. into a group of housewives, a group of office workers, a group of part-time workers, etc., based on the explanatory variable of occupation included in records R1 to R3, for example. Perform group processing to divide the data into groups. Furthermore, the outlier detection unit 10 performs a second outlier detection process that targets sales for each of the plurality of groups. Furthermore, the outlier detection unit 10 estimates the cause of the outlier based on the detection results of the first and second outlier detection processes. Then, the learning unit 20 performs machine learning based on the estimation result of the cause of the outlier to create a learning model for predicting sales.

上述のように、学習データの全レコードに売上高を検出対象として第１の外れ値検出処理を行い、かつ、グループ分けしたレコード群毎に売上高を検出対象として第２の外れ値検出処理を行い、それら２段階の外れ値検出処理に基づいて外れ値の原因を推定している。その結果、外れ値の原因推定が向上し、外れ値の原因に応じた適切な機械学習を行うことが可能となる。 As described above, the first outlier detection process is performed on all records of the training data using sales as the detection target, and the second outlier detection process is performed on each grouped record group using sales as the detection target. The cause of the outlier is estimated based on these two stages of outlier detection processing. As a result, the estimation of the causes of outliers is improved, and it becomes possible to perform appropriate machine learning according to the causes of outliers.

（Ｃ２）図１～８に示すように、目的変数が、学習データのレコードＲ１～Ｒ３に含まれる第１の変数（売上高）と第２の変数（通話時間）との比として表される場合には、外れ値検出部１０は、学習データの全レコードに関して、目的変数（比）、売上高および通話時間のいずれか一つを検出対象とする第１の外れ値検出処理を、検出対象毎にそれぞれ行う。さらに、外れ値検出部１０は、複数のグループのグループ毎に、目的変数（比）、売上高および通話時間のいずれか一つを検出対象とする第２の外れ値検出処理を検出対象毎にそれぞれ行う。そして、学習部２０は、外れ値の原因の推定結果に基づいて機械学習を実施して、売上高と通話時間との比を予測する学習モデルを作成する。 (C2) As shown in Figures 1 to 8, the target variable is expressed as the ratio of the first variable (sales amount) and the second variable (call time) included in records R1 to R3 of the learning data. In this case, the outlier detection unit 10 performs a first outlier detection process that detects any one of the target variable (ratio), sales amount, and call time on all records of the learning data. Do each for each time. Furthermore, the outlier detection unit 10 performs a second outlier detection process for each of the plurality of groups, in which one of the target variable (ratio), sales amount, and call time is detected. Do each. The learning unit 20 then performs machine learning based on the results of estimating the cause of the outlier to create a learning model that predicts the ratio between sales and call time.

上述のように、学習データの全レコードに対して比、比の分子および比の分母を目的変数として第１の外れ値検出処理を行い、かつ、グループ分けしたレコード群毎に比、比の分子および比の分母を目的変数として第２の外れ値検出処理を行い、それら２段階の外れ値検出処理に基づいて外れ値の原因を推定している。その結果、比、比の分子および比の分母のそれぞれに関して外れ値の原因がわかり、外れ値の原因に応じた適切な機械学習を行うことが可能となる。 As mentioned above, the first outlier detection process is performed on all records of the training data using the ratio, the numerator of the ratio, and the denominator of the ratio as objective variables, and the ratio and the numerator of the ratio are calculated for each grouped record group. A second outlier detection process is performed using the denominator of the ratio and the denominator of the ratio as objective variables, and the cause of the outlier is estimated based on these two stages of outlier detection process. As a result, the causes of outliers can be found for each of the ratio, the numerator of the ratio, and the denominator of the ratio, and it becomes possible to perform appropriate machine learning according to the cause of the outliers.

（Ｃ３）図５に示すように、外れ値の原因の推定において、外れ値検出部１０は、第１の外れ値検出処理により検出された検出対象を外れ値と推定し、第２の外れ値検出処理により検出された検出対象を誤記と推定する。このような外れ値の原因推定を行うことで、数学的には外れ値に見えるものを含む場合であっても、真の外れ値と誤記とを判別することができ、それぞれに応じた適切な機械学習を行うことが可能となる。 (C3) As shown in FIG. 5, in estimating the cause of an outlier, the outlier detection unit 10 estimates the detection target detected by the first outlier detection process as an outlier, and detects the second outlier. The detection target detected by the detection process is presumed to be a typo. By estimating the causes of outliers in this way, it is possible to distinguish between true outliers and typos, even if they include things that appear to be outliers mathematically, and to take appropriate measures for each. It becomes possible to perform machine learning.

（Ｃ４）図６，８に示すように、外れ値の原因の推定において、外れ値検出部１０は、大きさの異なる複数の桁補正係数に誤記と推定された検出対象をそれぞれ乗算して、乗算後の値が、同一グループ内の前記レコードの外れ値と判定されていない同一検出対象の頻出範囲内である場合に、桁間違いによる誤記と推定する。 (C4) As shown in FIGS. 6 and 8, in estimating the cause of an outlier, the outlier detection unit 10 multiplies a plurality of digit correction coefficients of different sizes by the detection target estimated to be an error, If the value after multiplication is within the frequency range of the same detection target that is not determined to be an outlier in the record in the same group, it is assumed that the error is due to a digit error.

例えば、図８の顧客名がＣ３であるレコードにおいては、売上高の３０円は外れ値と判定され、その値に１０^－５、１０^－４、・・・、１０^４、１０^５のように１０倍刻みの複数の桁補正係数をそれぞれ乗算すると、１０^２を乗算した場合の３，０００円は、外れ値と判定されていない売上高の数値範囲内となる。すなわち、３０円は桁違いによる誤記であることが推定され、単なる誤記という判定ではなく原因が桁違いによる誤記であることまで判定できる。 For example, in the record where the customer name is C3 in Figure 8, the sales amount of 30 yen is determined to be an outlier, and the value is changed to 10 ^-5 , 10 ^-4 , ..., 10 ⁴ , 10 ⁵ , etc. When multiplied by a plurality of digit correction coefficients in increments of 10 times, 3,000 yen when multiplied by 10 ² falls within the numerical range of sales that are not determined to be outliers. That is, it is estimated that 30 yen is a typographical error due to a digit difference, and it is possible to determine that the cause is a typographical error due to a digit discrepancy, rather than just a simple typographical error.

（Ｃ５）図８，１０に示すように、外れ値検出部１０は、桁間違いによる誤記と推定された検出対象（例えば、図８の売上高３０円）に、乗算結果が頻出範囲内となる桁補正係数（＝１０^２）を乗算して３０円を３，０００円に補正する（ステップＳ２２１０）。そして、学習部２０は、桁間違いによる誤記と推定された検出対象である売上高３０円に代えて補正後の売上高３，０００円を用いて機械学習を実施する。その結果、桁間違いによる誤記を補正しない場合に比べて、精度向上を図ることができる。 (C5) As shown in FIGS. 8 and 10, the outlier detection unit 10 multiplies the detection target (for example, the sales amount of 30 yen in FIG. 8) that is estimated to be an error due to a digit error, so that the result is within the frequent occurrence range. The value is multiplied by a digit correction coefficient (=10 ² ) to correct 30 yen to 3,000 yen (step S2210). Then, the learning unit 20 performs machine learning using the corrected sales amount of 3,000 yen instead of the detected sales amount of 30 yen, which is estimated to be a typographical error due to a digit error. As a result, accuracy can be improved compared to the case where errors in writing due to digit errors are not corrected.

（Ｃ６）図１０に示すように、学習部２０は、外れ値と推定された検出対象がある場合には、ロバスト回帰による機械学習を実施して学習モデルを作成する。外れ値がある場合には、平均二乗誤差などをロス関数とした通常の機械学習では十分な予測精度が担保できなくなるが、上述のようにロバスト回帰による機械学習を実施することで予測精度の向上を図ることができる。 (C6) As shown in FIG. 10, when there is a detection target estimated to be an outlier, the learning unit 20 performs machine learning using robust regression to create a learning model. If there are outliers, normal machine learning using mean squared error as a loss function cannot guarantee sufficient prediction accuracy, but machine learning using robust regression as described above can improve prediction accuracy. can be achieved.

（Ｃ７）図１０，１１に示すように、学習部２０は、売上高である目的変数（分子）を予測する分子予測モデルと、通話時間である目的変数（分母）を予測する分母予測モデルと、目的変数（比）を予測する比予測モデルとを作成する。そして、機械学習システム１は、評価データを用いて分子予測モデル、分母予測モデルおよび比予測モデルの予測値をそれぞれ求め、分子予測モデルの予測値と分母予測モデルの予測値との比の精度と、比予測モデルの予測値の精度とを比較して、より精度が高い方の予測モデルを学習モデルに採用する、モデル評価部３０をさらに備える。 (C7) As shown in FIGS. 10 and 11, the learning unit 20 has a numerator prediction model that predicts the target variable (numerator), which is sales, and a denominator prediction model, which predicts the target variable (denominator), which is call time. , and a ratio prediction model that predicts the objective variable (ratio). Then, the machine learning system 1 calculates the predicted values of the numerator prediction model, the denominator prediction model, and the ratio prediction model using the evaluation data, and determines the accuracy of the ratio between the predicted value of the numerator prediction model and the predicted value of the denominator prediction model. , and a model evaluation unit 30 that compares the accuracy of the predicted value of the ratio prediction model and adopts the prediction model with higher accuracy as the learning model.

このように、比予測モデルに加えて分子予測モデルおよび分母予測モデルも考慮し、比予測モデルの予測値の精度と、分子予測モデルの予測値と分母予測モデルの予測値との比の精度との比較から学習モデルを決定しているので、より高精度な学習モデルが得られる。 In this way, in addition to the ratio prediction model, we also consider the numerator prediction model and the denominator prediction model, and calculate the accuracy of the predicted value of the ratio prediction model and the accuracy of the ratio between the predicted value of the numerator prediction model and the predicted value of the denominator prediction model. Since the learning model is determined from a comparison of the above, a more accurate learning model can be obtained.

（Ｃ８）図１～８に示すように、本発明の機械学習方法は、学習データの全レコードに関して、機械学習の予測対象である目的変数を検出対象とする第１の外れ値検出処理を行い、レコードＲ１～Ｒ３に含まれる職業という説明変数に基づいて、全レコードＲ１，Ｒ２，Ｒ３，・・・、を主婦のグループ、会社員のグループ、アルバイトのグループに分け、複数のグループのグループ毎に、目的変数を検出対象とする第２の外れ値検出処理を行い、第１の外れ値検出処理および第２の外れ値検出処理の検出結果に基づいて外れ値の原因を推定し、推定された外れ値の原因に基づいて機械学習を実施して、目的変数を予測する学習モデルを作成する。 (C8) As shown in FIGS. 1 to 8, the machine learning method of the present invention performs a first outlier detection process on all records of learning data, using the objective variable that is the prediction target of machine learning as the detection target. , based on the explanatory variable of occupation included in records R1 to R3, all records R1, R2, R3,... are divided into a group of housewives, a group of office workers, a group of part-time workers, and each group of multiple groups is divided. Then, a second outlier detection process is performed using the objective variable as a detection target, and the cause of the outlier is estimated based on the detection results of the first outlier detection process and the second outlier detection process. Perform machine learning based on the causes of outliers to create a learning model that predicts the target variable.

以上説明した各実施の形態はあくまで一例であり、発明の特徴が損なわれない限り、本発明はこれらの内容に限定されるものではない。本発明の技術的思想の範囲内で考えられるその他の態様も本発明の範囲内に含まれる。また、上記実施の形態および変形例は、本発明の趣旨を逸脱せず、互いに整合する範囲内で、一部または全部を組合せることができる。 The embodiments described above are merely examples, and the present invention is not limited to these contents as long as the characteristics of the invention are not impaired. Other embodiments considered within the technical spirit of the present invention are also included within the scope of the present invention. Moreover, the above-described embodiments and modified examples can be combined in part or in whole within a mutually compatible range without departing from the spirit of the present invention.

１…機械学習システム、１０…外れ値検出部、２０…学習部、３０…モデル評価部、５０００…コンピュータ、５７００…出力装置、ＤＢ１…学習データデータベース、ＤＢ２…設定情報データベース、ＤＢ３…外れ値情報データベース、ＤＢ４…モデル情報データベース DESCRIPTION OF SYMBOLS 1...Machine learning system, 10...Outlier detection unit, 20...Learning unit, 30...Model evaluation unit, 5000...Computer, 5700...Output device, DB1...Learning data database, DB2...Setting information database, DB3...Outlier information Database, DB4...Model information database

Claims

a first outlier detection unit that performs a first outlier detection process on all records of the learning data, with a target variable that is a prediction target of machine learning as a detection target;
a group processing unit that divides all the records into a plurality of groups based on explanatory variables included in the records;
a second outlier detection unit that performs a second outlier detection process using the objective variable as a detection target for each group;
a cause estimation unit that estimates the cause of the outlier based on the detection results of the first outlier detection unit and the second outlier detection unit;
A machine learning system comprising: a model creation unit that performs machine learning based on the estimation result of the cause estimation unit to create a learning model that predicts the objective variable.

The machine learning system according to claim 1,
The target variable is expressed as a ratio of a first variable and a second variable included in the learning data record,
The first outlier detection unit detects any one of the target variable, the first variable, and the second variable as a detection target for all records of the learning data. Performing processing for each of the detection targets,
The second outlier detection unit performs a second outlier detection process in which one of the target variable, the first variable, and the second variable is detected for each group. Do this for each target,
The model creation unit is a machine learning system that performs machine learning based on the estimation result of the cause estimation unit to create a learning model that predicts a ratio between the first variable and the second variable.

The machine learning system according to claim 1,
The cause estimation unit is a machine that estimates the detection target detected by the first outlier detection unit to be an outlier, and estimates the detection target detected by the second outlier detection unit to be a typographical error. learning system.

The machine learning system according to claim 3,
The cause estimating unit multiplies a plurality of digit correction coefficients of different sizes by the detection target estimated to be a typographical error, and determines that the value after the multiplication is not determined to be an outlier of the record in the same group. A machine learning system that estimates errors due to incorrect digits when the same detection target occurs frequently.

The machine learning system according to claim 4,
further comprising a correction unit that corrects the detection target by multiplying the detection target estimated to be an error due to a digit error by the digit correction coefficient whose multiplication result falls within the frequently occurring range;
The model creation unit is a machine learning system that performs machine learning using the detection target corrected by the correction unit instead of the detection target estimated to be a typographical error due to the digit error.

The machine learning system according to claim 3,
The model creation unit is a machine learning system that performs machine learning using robust regression to create a learning model when there is a detection target estimated to be an outlier by the cause estimation unit.

The machine learning system according to claim 2,
The model creation unit creates a first prediction model that predicts the first variable, a second prediction model that predicts the second variable, and a third prediction model that predicts the target variable. death,
Using the evaluation data, calculate the predicted values of the first predictive model, the second predictive model, and the third predictive model, and calculate the predicted values of the first predictive model and the second predictive model. Machine learning, further comprising a model evaluation unit that compares the accuracy of the ratio with the predicted value and the accuracy of the predicted value of the third prediction model, and adopts the prediction model with higher accuracy as the learning model. system.

For all records of the learning data, perform a first outlier detection process using the objective variable that is the prediction target of machine learning as the detection target,
dividing all the records into a plurality of groups based on explanatory variables included in the records;
performing a second outlier detection process using the target variable as a detection target for each group;
Estimating the cause of the outlier based on the detection results of the first outlier detection unit and the second outlier detection unit,
A machine learning method that performs machine learning based on the estimated cause of the outlier to create a learning model that predicts the objective variable.