JP2019159538A

JP2019159538A - Data set verification device, data set verification method, and data set verification program

Info

Publication number: JP2019159538A
Application number: JP2018042764A
Authority: JP
Inventors: 大和岡本; Yamato Okamoto; 五郎幡山; Goro Hatayama; 海虹張; Haihong Zhang; 丈嗣内藤; Joji Naito; 哲二大和; Tetsuji Yamato
Original assignee: Omron Corp; Omron Tateisi Electronics Co
Current assignee: Omron Corp
Priority date: 2018-03-09
Filing date: 2018-03-09
Publication date: 2019-09-19
Anticipated expiration: 2038-03-09
Also published as: JP6973197B2

Abstract

To suppress the execution of a learning phase for constructing a statistical model with low generalization.SOLUTION: A data set generation unit 21 generates a learning data set used for machine learning and an evaluation data set for evaluating an identification model obtained by machine learning from an original data set. A feature extraction unit 22 extracts features of a first data group belonging to the learning data set and features of a second data group belonging to the evaluation data set. A determination unit 23 determines whether the learning data set is appropriate based on the features of the first data group and the features of the second data group.SELECTED DRAWING: Figure 2

Description

この発明は、元データセットから、機械学習に用いる学習用データセットを生成する技術に関する。 The present invention relates to a technique for generating a learning data set used for machine learning from an original data set.

従来、機械学習では、学習フェーズと、評価フェーズとを繰り返すことによって、統計モデル（識別モデル）を構築する。学習フェーズは、与えられた学習用データセットを用いて、統計モデルを構築する工程である。評価フェーズは、与えられた評価用データセットを用いて、学習フェーズで構築された統計モデルを評価する工程である。機械学習で統計モデルを構築する学習装置は、例えば特許文献１等に示されている。 Conventionally, in machine learning, a statistical model (identification model) is constructed by repeating a learning phase and an evaluation phase. The learning phase is a process of building a statistical model using a given learning data set. The evaluation phase is a step of evaluating the statistical model constructed in the learning phase using a given evaluation data set. A learning apparatus that constructs a statistical model by machine learning is disclosed in, for example, Patent Document 1 and the like.

機械学習では、収集した元データセットに属するデータ群から、抜き出した一部のデータ群を学習用データセットとして生成し、この学習用データセットを用いて学習フェーズを実行している。また、機械学習では、収集した元データセットに属するデータ群から、学習用データセットとして抜き出さなかったデータ群の全部、または一部を評価用データセットとして生成し、この評価用データセットを用いて評価フェーズを実行している。 In machine learning, a part of the extracted data group is generated as a learning data set from the collected data group belonging to the original data set, and the learning phase is executed using this learning data set. In machine learning, all or part of a data group that was not extracted as a learning data set from the collected data group belonging to the original data set is generated as an evaluation data set, and this evaluation data set is used. The evaluation phase is executed.

なお、多くの場合、元データセットに属するデータ群を有効に活用するため、元データセットを、学習用データセットと、評価用データセットとに２つに分割している。 In many cases, in order to effectively use a data group belonging to the original data set, the original data set is divided into a learning data set and an evaluation data set.

特開２０１０−１５２７５１号公報JP 2010-152751 A

しかしながら、学習用データセットが、元データセットから偏った属性のデータ群を抜き出したものであると、学習フェーズにおいて、この偏った属性についての統計モデルが構築される。すなわち、学習フェーズでは、ある偏った属性（一部の属性）についての学習が行われるだけである（未学習の属性が生じる。）。したがって、学習フェーズで、未学習の属性について十分な識別率を得ることができない、汎化性の低い統計モデルが構築されてしまう。ここで言う属性は、事象の種類、事象の発生頻度、事象の発生傾向等にかかるデータ群の性質である。 However, if the learning data set is a data group having a biased attribute extracted from the original data set, a statistical model for the biased attribute is constructed in the learning phase. That is, in the learning phase, only a certain biased attribute (partial attribute) is learned (an unlearned attribute is generated). Therefore, in the learning phase, a statistical model with low generalization property that cannot obtain a sufficient identification rate for an unlearned attribute is constructed. The attribute mentioned here is the property of the data group related to the type of event, the frequency of occurrence of the event, the occurrence tendency of the event, and the like.

このように、学習用データセットとして、元データセットから抜き出されたデータ群の属性が偏っていると、汎化性の高い統計モデルを構築する学習フェーズを実行させることができない。 Thus, if the attributes of the data group extracted from the original data set are biased as the learning data set, the learning phase for constructing a statistical model with high generalization cannot be executed.

この発明の目的は、汎化性の低い統計モデルを構築する学習フェーズの実行を抑制する技術を提供することにある。 An object of the present invention is to provide a technique for suppressing execution of a learning phase for constructing a statistical model with low generalization.

この発明のータセット検証装置は、上記目的を達成するため以下に示すように構成している。 The data set verification device according to the present invention is configured as follows to achieve the above object.

データセット生成部が、元データセットから、機械学習に用いる学習用データセット、および学習用データセットを用いた機械学習で得られた識別モデルを評価する評価用データセットを生成する。特徴抽出部が、データセット生成部が生成した学習用データセットに属する第１データ群の特徴、およびデータセット生成部が生成した評価用データセットに属する第２データ群の特徴を抽出する。例えば、特徴抽出部は、第１データ群の特徴として当該第１データ群の尤度関数を抽出し、第２データ群の特徴として当該第２データ群の尤度関数を抽出する。そして、判定部は、特徴抽出部が抽出した第１データ群の特徴と、第２データ群の特徴とに基づいて、データセット生成部が生成した前記学習用データセットが適正であるかどうかを判定する。例えば、判定部は、第１データ群の特徴と、第２データ群の特徴とに類似性がなければ、データセット生成部が生成した学習用データセットが適正でないと判定する。 A data set generation unit generates, from the original data set, a learning data set used for machine learning and an evaluation data set for evaluating an identification model obtained by machine learning using the learning data set. The feature extraction unit extracts features of the first data group belonging to the learning data set generated by the data set generation unit and features of the second data group belonging to the evaluation data set generated by the data set generation unit. For example, the feature extraction unit extracts a likelihood function of the first data group as a feature of the first data group, and extracts a likelihood function of the second data group as a feature of the second data group. The determination unit determines whether the learning data set generated by the data set generation unit is appropriate based on the characteristics of the first data group extracted by the feature extraction unit and the characteristics of the second data group. judge. For example, if there is no similarity between the characteristics of the first data group and the characteristics of the second data group, the determination unit determines that the learning data set generated by the data set generation unit is not appropriate.

学習用データセットが、元データセットから偏った属性のデータ群を抜き出したものである場合、学習用データセットに属するデータ群（第１データ群）の特徴と、評価用データセットに属するデータ群（第２データ群）の特徴との類似性が低くなる。反対に、学習用データセットが、元データセットから偏った属性のデータ群を抜き出したものでない場合、学習用データセットに属するデータ群（第１データ群）の特徴と、評価用データセットに属するデータ群（第２データ群）の特徴との類似性が高くなる。したがって、判定部が適正でないと判定した学習用データセットを用いた学習フェーズの実行を制限することにより、汎化性の低い統計モデルを構築する学習フェーズの実行を抑制することができる。 In the case where the learning data set is an extracted data group having a biased attribute from the original data set, the characteristics of the data group (first data group) belonging to the learning data set and the data group belonging to the evaluation data set Similarity with the characteristics of the (second data group) is lowered. On the other hand, if the learning data set is not an extracted data group having a biased attribute from the original data set, it belongs to the characteristics of the data group (first data group) belonging to the learning data set and the evaluation data set. Similarity with the characteristics of the data group (second data group) is increased. Therefore, by restricting the execution of the learning phase using the learning data set that is determined to be inappropriate by the determination unit, the execution of the learning phase for constructing a statistical model with low generalization can be suppressed.

また、学習用データセットが元データセットから偏った属性のデータ群を抜き出したものであるかどうかの判定を、元データセットに属する第３データ群の特徴も用いて行うことで、この判定精度を高めることができる。例えば、第１データ群、第２データ群、および第３データ群の中から選択した２つのデータ群の組合せ（合計３通り）のいずれかにおいて、特徴に類似性がなければ、データセット生成部が生成した学習用データセットが適正でないと判定すればよい。したがって、より確実に、汎化性の低い統計モデルを構築する学習フェーズの実行を抑制することができる。 In addition, it is possible to determine whether or not the learning data set is an extracted data group having a biased attribute from the original data set by using the characteristics of the third data group belonging to the original data set. Can be increased. For example, if any of the combinations of two data groups selected from the first data group, the second data group, and the third data group (three types in total) has no similarity in characteristics, the data set generation unit It may be determined that the learning data set generated by is not appropriate. Therefore, execution of the learning phase for constructing a statistical model with low generalization can be suppressed more reliably.

さらに、学習用データセットが元データセットから偏った属性のデータ群を抜き出したものであるかどうかの判定を、第１データ群の特徴と、第３データ群の特徴とによって判定してもよい。この構成でも、汎化性の低い統計モデルを構築する学習フェーズの実行を抑制することができる。 Further, whether or not the learning data set is an extracted data group having an attribute biased from the original data set may be determined based on the characteristics of the first data group and the characteristics of the third data group. . Even with this configuration, it is possible to suppress the execution of the learning phase for constructing a statistical model with low generalization.

この発明によれば、汎化性の低い統計モデルを構築する学習フェーズの実行を抑制することができる。 According to the present invention, it is possible to suppress execution of a learning phase for constructing a statistical model with low generalization.

この発明にかかるデータセット検証装置を適用した１例の機械学習システムの主要部の構成を示すブロック図である。It is a block diagram which shows the structure of the principal part of an example machine learning system to which the data set verification apparatus concerning this invention is applied. データセット検証装置の主要部の構成を示すブロック図である。It is a block diagram which shows the structure of the principal part of a data set verification apparatus. データセット検証装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of a data set verification apparatus. 別の例にかかるデータセット検証装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the data set verification apparatus concerning another example. 別の例にかかるデータセット検証装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the data set verification apparatus concerning another example.

以下、この発明の実施形態について説明する。 Embodiments of the present invention will be described below.

＜１．適用例＞
図１は、この発明にかかるデータセット検証装置を適用した１例の機械学習システムの主要部の構成を示すブロック図である。この例にかかる機械学習システムは、データセット検証装置１と、識別モデル構築装置２と、識別モデル評価装置３と、元データセット記憶データベース４（元データセット記憶ＤＢ４）とを備えている。 <1. Application example>
FIG. 1 is a block diagram showing a configuration of a main part of an example machine learning system to which a data set verification apparatus according to the present invention is applied. The machine learning system according to this example includes a data set verification device 1, an identification model construction device 2, an identification model evaluation device 3, and an original data set storage database 4 (original data set storage DB 4).

元データセット記憶ＤＢ４には、元データセットを記憶する。元データセットは、収集したデータ群である。データは、例えばＮ次元の実数ベクトルにかかるデータ、Ｎ次元の実数の時系列ベクトルにかかるデータ等である。具体的に説明すると、データは、例えば顔画像データであり、元データセットは様々な年齢の男女の顔画像データの集まりである。また、データは、例えば車両の画像データであり、２輪車、軽自動車、普通自動車、トラック、バス等の様々な車種の車両の画像データの集まりである。また、データは、例えば音声データであり、元データセットは様々な年齢の男女の発声にかかる音声データの集まりである。データの種類は、構築する識別モデルの種類に応じて決定される。 The original data set storage DB 4 stores the original data set. The original data set is a collected data group. The data is, for example, data relating to an N-dimensional real vector, data relating to an N-dimensional real time series vector, and the like. More specifically, the data is, for example, face image data, and the original data set is a collection of face image data of men and women of various ages. The data is, for example, vehicle image data, and is a collection of image data of vehicles of various vehicle types such as two-wheeled vehicles, light vehicles, ordinary vehicles, trucks, and buses. The data is audio data, for example, and the original data set is a collection of audio data related to utterances of men and women of various ages. The type of data is determined according to the type of identification model to be constructed.

データセット検証装置１は、元データセット記憶ＤＢ４に記憶している元データセットから学習用データセット、および評価用データセットを生成する。具体的には、データセット検証装置１は、元データセットに属するデータ群から、抜き出した一部のデータ群を学習用データセットとして生成する。また、データセット検証装置１は、元データセットに属するデータ群から、学習用データセットとして抜き出さなかったデータ群を評価用データセットとして生成する。すなわち、この例では、データセット検証装置１は、元データセットに属するデータ群を２つに分割し、一方のデータ群を学習用データセットにし、他方のデータ群を評価用データセットにする。 The data set verification device 1 generates a learning data set and an evaluation data set from the original data set stored in the original data set storage DB 4. Specifically, the data set verification device 1 generates a partial data group extracted from the data group belonging to the original data set as a learning data set. Further, the data set verification device 1 generates a data group that is not extracted as a learning data set from the data group belonging to the original data set as an evaluation data set. In other words, in this example, the data set verification device 1 divides the data group belonging to the original data set into two, and sets one data group as a learning data set and the other data group as an evaluation data set.

データセット検証装置１は、学習用データセットに属するデータ群が、元データセットから偏った属性のデータ群を抜き出したものであるかどうかを判定する。データセット検証装置１は、学習用データセットが元データセットから偏った属性のデータ群を抜き出したものであると判定すると、学習用データセット、および評価用データセットを再生成する。例えば、学習用データセットの一部のデータ群と、評価用データセットの一部のデータ群とを入れ替えることにより、学習用データセット、および評価用データセットを再生成する。または、再度、上述した、元データセットから抜き出した一部のデータ群を学習用データセットにし、残りのデータ群を評価用データ群にする処理を実行することにより、学習用データセット、および評価用データセットを再生成する。 The data set verification device 1 determines whether or not the data group belonging to the learning data set is an extracted data group having a biased attribute from the original data set. If the data set verification apparatus 1 determines that the learning data set is an extracted data group having an attribute biased from the original data set, the data set verification apparatus 1 regenerates the learning data set and the evaluation data set. For example, the learning data set and the evaluation data set are regenerated by replacing a part of the data group of the learning data set with a part of the data group of the evaluation data set. Alternatively, the learning data set and the evaluation are performed by executing the above-described process in which a part of the data group extracted from the original data set is used as the learning data set and the remaining data group is used as the evaluation data group. Regenerate the data set.

学習用データセットに属するデータ群がこの発明で言う第１データ群に相当し、評価用データセットに属するデータ群がこの発明で言う第２データ群に相当し、元データセットに属するデータ群がこの発明で言う第３データ群に相当する。 The data group belonging to the learning data set corresponds to the first data group referred to in the present invention, the data group belonging to the evaluation data set corresponds to the second data group referred to in the present invention, and the data group belonging to the original data set includes This corresponds to the third data group referred to in the present invention.

なお、ここで言う属性とは、事象の種類、事象の発生頻度、事象の発生傾向等のデータの性質を示す。 The attribute referred to here indicates the nature of the data such as the type of event, the frequency of occurrence of the event, and the tendency of occurrence of the event.

データセット検証装置１は、学習用データセットが元データセットから偏った属性のデータ群を抜き出したものでないと判定すると、学習用データセットを識別モデル構築装置２に供給し、評価用データセットを識別モデル評価装置３に供給する。 When the data set verification device 1 determines that the learning data set is not an extracted data group having an attribute biased from the original data set, the data set verification device 1 supplies the learning data set to the identification model construction device 2, and the evaluation data set is This is supplied to the identification model evaluation device 3.

識別モデル構築装置２は、供給された学習用データセットを用いて識別モデルを構築する学習フェーズを実行する。識別モデル構築装置２は、ディープラーニングを実行するニューラルネットワークである。識別モデル構築装置２は、構築した識別モデルを識別モデル評価装置３に出力する。 The identification model construction device 2 executes a learning phase for constructing an identification model using the supplied learning data set. The identification model construction device 2 is a neural network that performs deep learning. The identification model construction device 2 outputs the constructed identification model to the identification model evaluation device 3.

識別モデル評価装置３は、識別モデル構築装置２で構築された識別モデルを、供給された評価用データセットを用いて評価する評価フェーズを実行する。識別モデル評価装置３は、識別モデル構築装置２で構築された識別モデルの評価結果を出力する。 The identification model evaluation device 3 executes an evaluation phase in which the identification model constructed by the identification model construction device 2 is evaluated using the supplied evaluation data set. The identification model evaluation device 3 outputs the evaluation result of the identification model constructed by the identification model construction device 2.

このように、この機械学習システムでは、学習用データセットが元データセットから偏った属性のデータ群を抜き出したものである場合、この学習用データセットを用いた学習フェーズを識別モデル構築装置２に実行させない。したがって、識別モデル構築装置２が、汎化性の低い統計モデルを構築する学習フェーズを実行するのを抑制することができる。言い換えれば、汎化性の高い統計モデルを構築する学習フェーズを識別モデル構築装置２に実行させることができる。 As described above, in this machine learning system, when the learning data set is an extracted data group having an attribute biased from the original data set, the learning phase using this learning data set is assigned to the identification model construction device 2. Do not execute. Therefore, it can suppress that the identification model construction apparatus 2 performs the learning phase which constructs a statistical model with low generalization. In other words, it is possible to cause the identification model construction device 2 to execute a learning phase for constructing a statistical model with high generalization.

＜２．構成例＞
図２は、データセット検証装置の主要部の構成を示すブロック図である。データセット検証装置１は、制御ユニット１１と、データベースアクセス部１２（ＤＢアクセス部１２）と、学習用データセット供給部１３と、評価用データセット供給部１４とを備えている。 <2. Configuration example>
FIG. 2 is a block diagram showing a configuration of a main part of the data set verification apparatus. The data set verification device 1 includes a control unit 11, a database access unit 12 (DB access unit 12), a learning data set supply unit 13, and an evaluation data set supply unit 14.

制御ユニット１１は、データセット検証装置１本体各部の動作を制御する。また、制御ユニット１１は、データセット生成部２１と、特徴抽出部２２と、判定部２３とを有している。データセット生成部２１、特徴抽出部２２、および判定部２３の詳細については後述する。 The control unit 11 controls the operation of each part of the data set verification apparatus 1 main body. In addition, the control unit 11 includes a data set generation unit 21, a feature extraction unit 22, and a determination unit 23. Details of the data set generation unit 21, the feature extraction unit 22, and the determination unit 23 will be described later.

ＤＢアクセス部１２は、元データセット記憶ＤＢ４、学習用データセット記憶データベース５（学習用データセット記憶ＤＢ５）、および評価用データセット記憶データベース６（評価用データセット記憶ＤＢ６）とのインタフェースである。データセット検証装置１は、ＤＢアクセス部１２を介して、元データセット記憶ＤＢ４、学習用データセット記憶ＤＢ５、および評価用データセット記憶ＤＢ６に対するデータの読み出し、およびデータの書き込みを行う。学習用データセット記憶ＤＢ５は、学習用データセットを記憶する。評価用データセット記憶ＤＢ６は、評価用データセットを記憶する。 The DB access unit 12 is an interface with the original data set storage DB 4, the learning data set storage database 5 (learning data set storage DB 5), and the evaluation data set storage database 6 (evaluation data set storage DB 6). The data set verification device 1 reads data from and writes data to the original data set storage DB 4, the learning data set storage DB 5, and the evaluation data set storage DB 6 via the DB access unit 12. The learning data set storage DB 5 stores learning data sets. The evaluation data set storage DB 6 stores evaluation data sets.

学習用データセット供給部１３は、学習用データセット記憶ＤＢ５に記憶している学習用データセットを識別モデル構築装置２に供給する。評価用データセット供給部１４は、評価用データセット記憶ＤＢ６に記憶している評価用データセットを識別モデル評価装置３に供給する。 The learning data set supply unit 13 supplies the learning data set stored in the learning data set storage DB 5 to the identification model construction device 2. The evaluation data set supply unit 14 supplies the evaluation data set stored in the evaluation data set storage DB 6 to the identification model evaluation device 3.

次に、制御ユニット１１が有する、データセット生成部２１、特徴抽出部２２、および判定部２３について説明する。 Next, the data set generation unit 21, the feature extraction unit 22, and the determination unit 23 included in the control unit 11 will be described.

データセット生成部２１は、元データセット記憶ＤＢ４に記憶されている元データセットの一部のデータ群を抜き出し、ここで抜き出した一部のデータ群を学習用データセットとして生成する。また、データセット生成部２１は、元データセット記憶ＤＢ４に記憶されている元データセットのデータ群であって、学習用データセットとして抜き出さなかったデータ群を評価用データセットとして生成する。すなわち、この例では、データセット生成部２１は、元データセット記憶ＤＢ４に記憶している元データセットを２つのデータ群に分割し、その一方を学習用データセットにし、他方を評価用データセットにしている。 The data set generation unit 21 extracts a part of the original data set stored in the original data set storage DB 4 and generates a part of the extracted data group as a learning data set. Further, the data set generation unit 21 generates a data group of the original data set stored in the original data set storage DB 4 and not extracted as a learning data set as an evaluation data set. That is, in this example, the data set generation unit 21 divides the original data set stored in the original data set storage DB 4 into two data groups, one of which is used as a learning data set, and the other as an evaluation data set. I have to.

特徴抽出部２２は、データセット生成部２１が生成した学習用データセットの特徴、および評価用データセットの特徴を抽出する。ここで言う。学習用データセットの特徴は、この学習用データセットに属するデータ群についての、事象の種類、事象の発生頻度、事象の発生傾向等の性質を示す属性の分布である。同様に、評価用データセットの特徴は、この評価用データセットに属するデータ群についての、事象の種類、事象の発生頻度、事象の発生傾向等の性質を示す属性の分布である。特徴抽出部２２は、例えば属性の確率分布関数、確率密度関数、または尤度関数を特徴として抽出する。 The feature extraction unit 22 extracts features of the learning data set generated by the data set generation unit 21 and features of the evaluation data set. Say here. A characteristic of the learning data set is a distribution of attributes indicating properties such as the type of event, the frequency of occurrence of the event, and the tendency of the occurrence of the event for the data group belonging to the learning data set. Similarly, the characteristic of the evaluation data set is the distribution of attributes indicating properties such as the type of event, the frequency of occurrence of the event, and the tendency of occurrence of the event for the data group belonging to the evaluation data set. The feature extraction unit 22 extracts, for example, an attribute probability distribution function, probability density function, or likelihood function as features.

判定部２３は、特徴抽出部２２において抽出された学習用データセットの特徴と、評価用データセットの特徴とを比較し、その類似性によって、学習用データセットが、元データセットから偏った属性のデータ群を抜き出したものであるかどうかを判定する。この例では、上述したように、元データセットのデータ群を２つのデータ群に分割し、その一方を学習用データセットにし、他方を評価用データセットにしているので、学習用データセットが元データセットから偏った属性のデータ群を抜き出したものであると、学習用データセットに属するデータ群の特徴と、評価用データセットに属するデータ群の特徴との類似性が低くなる。言い換えれば、学習用データセットが元データセットから偏った属性のデータ群を抜き出したものでなければ、学習用データセットに属するデータ群の特徴と、評価用データセットに属するデータ群の特徴との類似性が高くなる。判定部２３は、特徴抽出部２２が抽出した学習用データセットの特徴、および評価用データセットの特徴の種別に応じて、これらの類似性を、確率分布間の距離、または確率密度の比で判定する。 The determination unit 23 compares the feature of the learning data set extracted by the feature extraction unit 22 with the feature of the evaluation data set, and the similarity makes the learning data set an attribute that is biased from the original data set. It is determined whether the data group is extracted. In this example, as described above, the data group of the original data set is divided into two data groups, one of which is the learning data set and the other is the evaluation data set. If a data group having a biased attribute is extracted from the data set, the similarity between the characteristics of the data group belonging to the learning data set and the characteristics of the data group belonging to the evaluation data set is low. In other words, if the learning data set is not a data group with attributes that are biased from the original data set, the characteristics of the data group belonging to the learning data set and the characteristics of the data group belonging to the evaluation data set Similarity increases. The determination unit 23 calculates the similarity between the probability distribution distance or the probability density ratio according to the characteristics of the learning data set extracted by the feature extraction unit 22 and the characteristics of the evaluation data set. judge.

データセット検証装置１の制御ユニット１１は、ハードウェアＣＰＵ、メモリ、その他の電子回路によって構成されている。ハードウェアＣＰＵが、この発明にかかるデータセット検証プログラムを実行したときに、データセット生成部２１、特徴抽出部２２、および判定部２３として動作する。また、メモリは、この発明にかかるデータセット検証プログラムを展開する領域や、このデータセット検証プログラムの実行時に生じたデータ等を一時記憶する領域を有している。制御ユニット１１は、ハードウェアＣＰＵ、メモリ等を一体化したＬＳＩであってもよい。また、ハードウェアＣＰＵが、この発明にかかるデータセット検証方法を実行するコンピュータである。 The control unit 11 of the data set verification apparatus 1 includes a hardware CPU, a memory, and other electronic circuits. When the hardware CPU executes the data set verification program according to the present invention, it operates as the data set generation unit 21, the feature extraction unit 22, and the determination unit 23. The memory also has an area for expanding the data set verification program according to the present invention and an area for temporarily storing data generated when the data set verification program is executed. The control unit 11 may be an LSI in which a hardware CPU, a memory, and the like are integrated. The hardware CPU is a computer that executes the data set verification method according to the present invention.

また、元データセット記憶ＤＢ４、学習用データセット記憶ＤＢ５、および評価用データセット記憶ＤＢ６は、ハードディスクドライブ、ソリッドステートドライブ等の補助記憶装置であってもよい。また、元データセット記憶ＤＢ４、学習用データセット記憶ＤＢ５、および評価用データセット記憶ＤＢ６は、１つの補助記憶装置で構成してもよいし、複数の補助記憶装置で構成してもよい。 Further, the original data set storage DB4, the learning data set storage DB5, and the evaluation data set storage DB6 may be auxiliary storage devices such as a hard disk drive and a solid state drive. Further, the original data set storage DB4, the learning data set storage DB5, and the evaluation data set storage DB6 may be configured by one auxiliary storage device or a plurality of auxiliary storage devices.

＜３．動作例＞
次に、この例にかかるデータセット検証装置１の動作について説明する。図３は、データセット検証装置の動作を示すフローチャートである。データセット検証装置１は、データセットの作成指示にかかる入力を受け付けると、図３に示す処理を実行する。データセット検証装置１には、図示していない操作部におけるオペレータの入力操作、または外部装置からの入力コマンドによってデータセットの作成指示が入力される。 <3. Example of operation>
Next, the operation of the data set verification apparatus 1 according to this example will be described. FIG. 3 is a flowchart showing the operation of the data set verification apparatus. The data set verification device 1 executes the process shown in FIG. 3 when receiving an input related to a data set creation instruction. A data set creation instruction is input to the data set verification apparatus 1 by an operator input operation in an operation unit (not shown) or an input command from an external apparatus.

データセット検証装置１は、学習用データセット、および評価用データセットを生成する（ｓ１）。具体的には、データセット生成部２１がＤＢアクセス部１２を介して接続されている元データセット記憶ＤＢ４に記憶しているデータ群を２つに分割し、一方を学習用データセットにし、他方を評価用データセットにする。データセット生成部２１は、元データセット記憶ＤＢ４に記憶しているデータ群を均等に２分割してもよいし、不均等に２分割してもよい。データセット生成部２１は、生成した学習用データセットを学習用データセット記憶ＤＢ５に記憶させる。また、データセット生成部２１は、生成した評価用データセットを評価用データセット記憶ＤＢ６に記憶させる。 The data set verification device 1 generates a learning data set and an evaluation data set (s1). Specifically, the data set generation unit 21 divides the data group stored in the original data set storage DB 4 connected via the DB access unit 12 into two, and sets one as a learning data set, while the other To the evaluation data set. The data set generation unit 21 may equally divide the data group stored in the original data set storage DB 4 into two or unequally divide into two. The data set generation unit 21 stores the generated learning data set in the learning data set storage DB 5. In addition, the data set generation unit 21 stores the generated evaluation data set in the evaluation data set storage DB 6.

データセット検証装置１は、ｓ１で生成した学習用データセットの特徴を抽出する（ｓ２）。また、データセット検証装置１は、ｓ１で生成した評価用データセットの特徴を抽出する（ｓ３）。特徴抽出部２２が、ｓ２、およびｓ３にかかる処理を実行する。特徴抽出部２２は、ｓ１で生成された学習用データセットに属するデータ群における、事象の種類、事象の発生頻度、事象の発生傾向等の性質を示す属性の確率分布関数、確率密度関数、または尤度関数を学習用データセットの特徴として抽出する。また、特徴抽出部２２は、ｓ１で生成された評価用データセットに属するデータ群における、事象の種類、事象の発生頻度、事象の発生傾向等の性質を示す属性の確率分布関数、確率密度関数、または尤度関数を、評価用データセットの特徴として抽出する。 The data set verification device 1 extracts the characteristics of the learning data set generated in s1 (s2). Further, the data set verification device 1 extracts the characteristics of the evaluation data set generated in s1 (s3). The feature extraction unit 22 executes processing relating to s2 and s3. The feature extraction unit 22 includes a probability distribution function, a probability density function of an attribute indicating properties such as an event type, an event occurrence frequency, and an event occurrence tendency in the data group belonging to the learning data set generated in s1, or A likelihood function is extracted as a feature of the learning data set. In addition, the feature extraction unit 22 includes a probability distribution function, a probability density function of attributes indicating properties such as the type of event, the frequency of occurrence of the event, and the tendency of occurrence of the event in the data group belonging to the evaluation data set generated in s1 Or a likelihood function is extracted as a feature of the evaluation data set.

なお、ｓ２で抽出する学習用データセットの特徴と、ｓ３で抽出する評価用データセットの特徴とは、同じ種別である。また、ｓ２、およびｓ３にかかる処理の順番は、上記と逆であってもよい。 Note that the characteristics of the learning data set extracted in s2 and the characteristics of the evaluation data set extracted in s3 are of the same type. Moreover, the order of the processes concerning s2 and s3 may be opposite to the above.

データセット検証装置１は、ｓ２で抽出した学習用データセットの特徴と、ｓ３で抽出した評価用データセットの特徴とに類似性があるかどうかを判定する（ｓ４）。判定部２３が、ｓ４にかかる判定を行う。判定部２３は、特徴抽出部２２が抽出した学習用データセットの特徴、および評価用データセットの特徴の種別に応じて、これらに類似性があるかどうかを、確率分布間の距離、または確率密度の比によって判定する。 The data set verification device 1 determines whether or not there is similarity between the characteristics of the learning data set extracted in s2 and the characteristics of the evaluation data set extracted in s3 (s4). The determination part 23 performs the determination concerning s4. The determination unit 23 determines whether there is similarity between the features of the learning data set extracted by the feature extraction unit 22 and the characteristics of the evaluation data set, the distance between probability distributions, or the probability Judged by density ratio.

学習用データセットが元データセットから偏った属性のデータ群を抜き出したものであれば、学習用データセットの特徴と、評価用データセットの特徴との類似性が低くなる。反対に、学習用データセットが元データセットから偏った属性のデータ群を抜き出したものでなければ、学習用データセットの特徴と、評価用データセットの特徴との類似性が高くなる。すなわち、ｓ４で学習用データセットの特徴と、評価用データセットの特徴とに類似性がないと判定された場合、学習用データセットは、元データセットから偏った属性のデータ群を抜き出したものである。反対に、ｓ４で学習用データセットの特徴と、評価用データセットの特徴とに類似性があると判定された場合、学習用データセットは、元データセットから偏った属性のデータ群を抜き出したものでない。 If the learning data set is obtained by extracting a data group having a biased attribute from the original data set, the similarity between the characteristics of the learning data set and the characteristics of the evaluation data set is low. On the other hand, if the learning data set is not an extracted data group having an attribute biased from the original data set, the similarity between the characteristics of the learning data set and the characteristics of the evaluation data set is high. That is, if it is determined in s4 that the characteristics of the learning data set and the characteristics of the evaluation data set are not similar, the learning data set is a data group with attributes biased from the original data set. It is. On the other hand, when it is determined in s4 that the characteristics of the learning data set are similar to the characteristics of the evaluation data set, the learning data set extracts a data group having a biased attribute from the original data set. Not a thing.

データセット検証装置１は、判定部２３で類似性がないと判定すると、データセット生成部２１において、学習用データセット、および評価用データセットを再生成する（ｓ７）。ｓ７では、上述したｓ１と同様に、データセット生成部２１が、元データセット記憶ＤＢ４に記憶しているデータ群を２つに分割し、一方を学習用データセットにし、他方を評価用データセットにする手法であってもよい。この場合、データセット生成部２１は、元データセット記憶ＤＢ４に記憶しているデータ群の分割を前回とは異ならせる。また、データセット生成部２１は、ｓ１で生成し、学習用データセット記憶ＤＢ５に記憶させた学習用データセットのデータ群の一部を抽出するとともに、評価用データセット記憶ＤＢ６に記憶させた評価用データセットのデータ群の一部を抽出し、ここで抽出したデータ群を入れ替えることによって、学習用データセット、および評価用データを再生成してもよい。 When the determination unit 23 determines that there is no similarity, the data set verification device 1 regenerates the learning data set and the evaluation data set in the data set generation unit 21 (s7). In s7, as in s1 described above, the data set generation unit 21 divides the data group stored in the original data set storage DB 4 into two, one as a learning data set, and the other as an evaluation data set. It may be a technique to make. In this case, the data set generation unit 21 changes the division of the data group stored in the original data set storage DB 4 from the previous time. Further, the data set generation unit 21 extracts a part of the data group of the learning data set generated in s1 and stored in the learning data set storage DB 5, and also stores the evaluation stored in the evaluation data set storage DB 6 The learning data set and the evaluation data may be regenerated by extracting a part of the data group of the use data set and replacing the extracted data group.

データセット検証装置１は、ｓ７で学習用データセット、および評価用データセットを再生成すると、上述したｓ２以降の処理を繰り返す。 When the data set verification apparatus 1 regenerates the learning data set and the evaluation data set in s7, the process after s2 is repeated.

また、データセット検証装置１は、ｓ４で類似性があると判定すると、学習用データセット供給部１３が、その時点において学習用データセット記憶ＤＢ５に記憶している学習用データセットを識別モデル構築装置２に供給する（ｓ５）。また、評価用データセット供給部１４が、その時点において評価用データセット記憶ＤＢ６に記憶している評価用データセットを識別モデル評価装置３に供給し（ｓ６）、本処理を終了する。ｓ５、およびｓ６にかかる処理の順番は、上記と逆であってもよい。 If the data set verification apparatus 1 determines that there is similarity in s4, the learning data set supply unit 13 constructs an identification model for the learning data set stored in the learning data set storage DB 5 at that time. It supplies to the apparatus 2 (s5). Further, the evaluation data set supply unit 14 supplies the evaluation data set stored in the evaluation data set storage DB 6 at that time to the identification model evaluation device 3 (s6), and the process is terminated. The order of processing concerning s5 and s6 may be opposite to the above.

識別モデル構築装置２は、ｓ５で供給された学習用データセットを用いて機械学習を行い、識別モデルを構築する。また、識別モデル評価装置３は、ｓ６で供給された評価用データセットを用いて、識別モデル構築装置２が構築した識別モデルを評価し、その評価結果を出力する。 The identification model construction device 2 performs machine learning using the learning data set supplied in s5 to construct an identification model. The identification model evaluation device 3 evaluates the identification model constructed by the identification model construction device 2 using the evaluation data set supplied in s6, and outputs the evaluation result.

このように、このデータセット検証装置１は、生成した学習用データセットが、元データセットから偏った属性のデータ群を抜き出したものである場合、この学習用データセットを識別モデル構築装置２に対して供給しない。したがって、識別モデル構築装置２において、汎化性の低い統計モデルを構築する学習フェーズが実行されるのを抑制することができる。 As described above, when the generated learning data set is obtained by extracting a data group having a biased attribute from the original data set, the data set verification apparatus 1 sends the learning data set to the identification model construction apparatus 2. In contrast, do not supply. Therefore, in the identification model construction device 2, it is possible to suppress the execution of the learning phase for constructing a statistical model with low generalization.

また、学習用データセットと、評価用データセットとの特徴に類似性があるので、識別モデル評価装置３における、識別モデル構築装置２が構築した識別モデルの評価が適正に行える。 Further, since the learning data set and the evaluation data set have similar characteristics, the discrimination model evaluation device 3 can appropriately evaluate the discrimination model constructed by the discrimination model construction device 2.

＜４．変形例＞
次に、この発明にかかるデータセット検証装置１の別の例について説明する。この例にかかるデータセット検証装置１を適用した機械学習システムも図１に示す構成である。また、この例にかかるデータセット検証装置１は、上記した図２に示す構成である。この例のデータセット検証装置１は、図３に示した処理ではなく、図４に示す処理を実行する点で、上記の例と異なる。図４は、この例にかかるデータセット検証装置１の動作を示すフローチャートである。 <4. Modification>
Next, another example of the data set verification apparatus 1 according to the present invention will be described. The machine learning system to which the data set verification apparatus 1 according to this example is applied also has the configuration shown in FIG. The data set verification device 1 according to this example has the configuration shown in FIG. The data set verification apparatus 1 of this example is different from the above example in that the process shown in FIG. 4 is executed instead of the process shown in FIG. FIG. 4 is a flowchart showing the operation of the data set verification apparatus 1 according to this example.

この例にかかるデータセット検証装置１は、上述したｓ１〜ｓ３にかかる処理を実行した後、元データセット記憶ＤＢ４に記憶している元データセットの特徴を抽出する（ｓ１１）。ｓ１１にかかる元データセットの特徴の抽出は、特徴を抽出する対象が異なるだけで、上述したｓ２、ｓ３と同じである。 The data set verification apparatus 1 according to this example extracts the features of the original data set stored in the original data set storage DB 4 after executing the above-described processes related to s1 to s3 (s11). The extraction of the features of the original data set relating to s11 is the same as s2 and s3 described above except that the features to be extracted are different.

データセット検証装置１は、第１の組合せである学習用データセットの特徴と、評価用データセットの特徴とに類似性があるかどうかを判定する（ｓ１２）。このｓ１２にかかる処理は、上述した例のｓ４と同じ処理である。データセット検証装置１は、ｓ４で、第１の組合せである学習用データセットの特徴と、評価用データセットの特徴とに類似性がないと判定すると、ｓ７に進む。上述したように、学習用データセットが元データセットから偏った属性のデータ群を抜き出したものであれば、学習用データセットの特徴と、評価用データセットの特徴との類似性が低くなる。 The data set verification device 1 determines whether or not there is similarity between the characteristics of the learning data set that is the first combination and the characteristics of the evaluation data set (s12). The processing related to s12 is the same processing as s4 in the above-described example. If the data set verification apparatus 1 determines in s4 that there is no similarity between the characteristics of the learning data set that is the first combination and the characteristics of the evaluation data set, the process proceeds to s7. As described above, if the learning data set is obtained by extracting a data group having a biased attribute from the original data set, the similarity between the characteristics of the learning data set and the characteristics of the evaluation data set is low.

また、データセット検証装置１は、ｓ１２で、第１の組合せである学習用データセットの特徴と、評価用データセットの特徴とに類似性があると判定すると、第２の組合せである元データセットの特徴と、学習用データセットの特徴とに類似性があるかどうかを判定する（ｓ１３）。このｓ１３にかかる処理は、類似性を判定する比較対象が異なるだけで、上述した例のｓ４と同じ処理である。データセット検証装置１は、ｓ１３で、第２の組合せである元データセットの特徴と、学習用データセットの特徴とに類似性がないと判定すると、ｓ７に進む。学習用データセットが元データセットから偏った属性のデータ群を抜き出したものであれば、元データセットの特徴と、学習用データセットの特徴との類似性も低くなる。 If the data set verification apparatus 1 determines in s12 that the characteristics of the learning data set that is the first combination are similar to the characteristics of the evaluation data set, the original data that is the second combination It is determined whether there is a similarity between the feature of the set and the feature of the learning data set (s13). The process related to s13 is the same process as s4 in the above-described example, except that the comparison target for determining similarity is different. If the data set verification apparatus 1 determines in s13 that there is no similarity between the characteristics of the original data set that is the second combination and the characteristics of the learning data set, the process proceeds to s7. If the learning data set is an extracted data group having a biased attribute from the original data set, the similarity between the characteristics of the original data set and the characteristics of the learning data set is also reduced.

さらに、データセット検証装置１は、ｓ１３で、第２の組合せである元データセットの特徴と、学習用データセットの特徴とに類似性があると判定すると、第３の組合せである元データセットの特徴と、評価用データセットの特徴とに類似性があるかどうかを判定する（ｓ１４）。このｓ１４にかかる処理も、類似性を判定する比較対象が異なるだけで、上述した例のｓ４と同じ処理である。データセット検証装置１は、ｓ１４で、第３の組合せである元データセットの特徴と、評価用データセットの特徴とに類似性がないと判定すると、ｓ７に進む。学習用データセットが元データセットから偏った属性のデータ群を抜き出したものであれば、元データセットの特徴と、評価用データセットの特徴との類似性も低くなる。 Furthermore, when the data set verification apparatus 1 determines in s13 that the feature of the original data set that is the second combination is similar to the feature of the learning data set, the original data set that is the third combination. It is determined whether or not there is a similarity between the characteristics of and the characteristics of the evaluation data set (s14). The process related to s14 is the same process as s4 in the above-described example, except that the comparison target for determining similarity is different. If the data set verification apparatus 1 determines in s14 that there is no similarity between the characteristics of the original data set that is the third combination and the characteristics of the evaluation data set, the process proceeds to s7. If the learning data set is obtained by extracting a data group having a biased attribute from the original data set, the similarity between the characteristics of the original data set and the characteristics of the evaluation data set is also reduced.

なお、ｓ１２〜ｓ１４にかかる判定は、上記の順番に限らず、どのような順番で行ってもよい。 Note that the determination regarding s12 to s14 is not limited to the above order, and may be performed in any order.

データセット検証装置１は、ｓ１４で、第３の組合せである元データセットの特徴と、評価用データセットの特徴とに類似性があると判定すると、上述したｓ５、およびｓ６にかかる処理を行い、本処理を終了する。 If the data set verification apparatus 1 determines in s14 that the characteristics of the original data set, which is the third combination, and the characteristics of the evaluation data set are similar, the process related to s5 and s6 described above is performed. This process is terminated.

このように、この例では、元データセットと学習用データセット、元データセットと評価用データセット、および学習用データセットと評価用データセットの３つの組合せのそれぞれにおいて、特徴に類似性がある場合に、学習用データセットが元データセットから偏った属性のデータ群を抜き出したものでないと判断する。したがって、識別モデル構築装置２において、汎化性の低い統計モデルを構築する学習フェーズが実行されるのを、より確実に抑制することができる。 As described above, in this example, there are similarities in characteristics in each of the three combinations of the original data set and the learning data set, the original data set and the evaluation data set, and the learning data set and the evaluation data set. In this case, it is determined that the learning data set is not a data group having attributes that are biased from the original data set. Therefore, it is possible to more reliably suppress the learning phase for constructing a statistical model with low generalization in the identification model construction device 2.

また、この例では、ｓ１で、元データセットを３つのデータ群に分割し、１つを学習用データセット、別の１つを評価用データセット、そして残りの１つを、学習用データセットおよび評価用データセットのいずれにも属さないデータ群（学習用データセットおよび評価用データセットとして利用しないデータ群）にすることによって、学習用データセット、および評価用データセットの生成する場合においても、学習用データセットが元データセットから偏った属性のデータ群を抜き出したものであるかどうかを精度よく判定できる。 In this example, in s1, the original data set is divided into three data groups, one is a learning data set, another is an evaluation data set, and the remaining one is a learning data set. Even when generating a learning data set and an evaluation data set by creating a data group that does not belong to any of the evaluation data sets (data group not used as a learning data set and an evaluation data set) It is possible to accurately determine whether or not the learning data set is a data group having attributes that are biased from the original data set.

さらに、学習用データセットと、評価用データセットとの特徴に類似性があるので、識別モデル評価装置３における、識別モデル構築装置２が構築した識別モデルの評価が適正に行える。 Furthermore, since the characteristics of the learning data set and the evaluation data set are similar, the discrimination model evaluation device 3 can appropriately evaluate the discrimination model constructed by the discrimination model construction device 2.

さらに、この発明にかかるデータセット検証装置１の別の例について説明する。この例にかかるデータセット検証装置１を適用した機械学習システムも図１に示す構成である。また、この例にかかるデータセット検証装置１は、上記した図２に示す構成である。この例のデータセット検証装置１は、図３に示した処理ではなく、図５に示す処理を実行する点で、上記の例と異なる。図５は、この例にかかるデータセット検証装置１の動作を示すフローチャートである。 Furthermore, another example of the data set verification apparatus 1 according to the present invention will be described. The machine learning system to which the data set verification apparatus 1 according to this example is applied also has the configuration shown in FIG. The data set verification device 1 according to this example has the configuration shown in FIG. The data set verification apparatus 1 of this example is different from the above example in that the process shown in FIG. 5 is executed instead of the process shown in FIG. FIG. 5 is a flowchart showing the operation of the data set verification apparatus 1 according to this example.

この例にかかるデータセット検証装置１は、図３に示したｓ３、およびｓ４にかかる処理に替えて、ｓ２１、ｓ２２にかかる処理を行う点で相違している。ｓ２１では、評価用データセットの特徴を抽出するのではなく、元データセットの特徴を抽出する。このｓ２１にかかる処理は、上述したｓ１１と同様の処理である。データセット検証装置１は、ｓ２で抽出した学習用データセットの特徴と、ｓ２１で抽出した元データセットの特徴とに類似性があるかどうかを判定する（ｓ２２）。データセット検証装置１は、ｓ２２で学習用データセットの特徴と、元データセットの特徴とに類似性がないと判定すると、ｓ７に進む。一方、データセット検証装置１は、ｓ２２で学習用データセットの特徴と、元データセットの特徴とに類似性があると判定すると、ｓ５、およびｓ６に進む。 The data set verification apparatus 1 according to this example is different in that the processing related to s21 and s22 is performed instead of the processing related to s3 and s4 shown in FIG. In s21, the feature of the original data set is extracted instead of extracting the feature of the evaluation data set. The process related to s21 is the same process as s11 described above. The data set verification device 1 determines whether or not there is similarity between the characteristics of the learning data set extracted in s2 and the characteristics of the original data set extracted in s21 (s22). If the data set verification apparatus 1 determines in s22 that there is no similarity between the characteristics of the learning data set and the characteristics of the original data set, the process proceeds to s7. On the other hand, if the data set verification apparatus 1 determines in s22 that the characteristics of the learning data set are similar to the characteristics of the original data set, the process proceeds to s5 and s6.

この例では、データセット検証装置１が、元データセットの特徴と学習用データセットの特徴とに類似性がなければ、学習用データセットを識別モデル構築装置２に供給しない。すなわち、この例のデータセット検証装置１も、生成した学習用データセットが、元データセットから偏った属性のデータ群を抜き出したものである場合、学習用データセットを識別モデル構築装置２に供給しない。したがって、識別モデル構築装置２において、汎化性の低い統計モデルを構築する学習フェーズが実行されるのを、より確実に抑制することができる。 In this example, the data set verification device 1 does not supply the learning data set to the identification model construction device 2 if there is no similarity between the features of the original data set and the features of the learning data set. In other words, the data set verification device 1 of this example also supplies the learning data set to the identification model construction device 2 when the generated learning data set is a data group having attributes biased from the original data set. do not do. Therefore, it is possible to more reliably suppress the learning phase for constructing a statistical model with low generalization in the identification model construction device 2.

なお、この発明は、上記実施形態そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。また、上記実施形態に開示されている複数の構成要素の適宜な組み合せにより種々の発明を形成できる。例えば、実施形態に示される全構成要素から幾つかの構成要素を削除してもよい。さらに、異なる実施形態に亘る構成要素を適宜組み合せてもよい。 Note that the present invention is not limited to the above-described embodiment as it is, and can be embodied by modifying the constituent elements without departing from the scope of the invention in the implementation stage. Further, various inventions can be formed by appropriately combining a plurality of constituent elements disclosed in the embodiment. For example, some components may be deleted from all the components shown in the embodiment. Furthermore, you may combine suitably the component covering different embodiment.

さらに、この発明に係る構成と上述した実施形態に係る構成との対応関係は、以下の付記のように記載できる。
＜付記＞
元データセットから、機械学習に用いる学習用データセット、および前記学習用データセットを用いた前記機械学習で得られた識別モデルを評価する評価用データセットを生成するデータセット生成部（２１）と、
前記データセット生成部（２１）が生成した前記学習用データセットに属する第１データ群の特徴、および前記データセット生成部（２１）が生成した前記評価用データセットに属する第２データ群の特徴を抽出する特徴抽出部（２２）と、
前記特徴抽出部（２２）が抽出した前記第１データ群の特徴と、前記第２データ群の特徴とに基づいて、前記データセット生成部（２１）が生成した前記学習用データセットが適正であるかどうかを判定する判定部（２３）と、を備えたデータセット検証装置（１）。 Furthermore, the correspondence between the configuration according to the present invention and the configuration according to the above-described embodiment can be described as in the following supplementary notes.
<Appendix>
A data set generation unit (21) for generating, from an original data set, a learning data set used for machine learning, and an evaluation data set for evaluating an identification model obtained by the machine learning using the learning data set; ,
Features of the first data group belonging to the learning data set generated by the data set generation unit (21) and features of the second data group belonging to the evaluation data set generated by the data set generation unit (21) A feature extraction unit (22) for extracting
Based on the features of the first data group extracted by the feature extraction unit (22) and the features of the second data group, the learning data set generated by the data set generation unit (21) is appropriate. A data set verification device (1) comprising: a determination unit (23) for determining whether or not there is.

１…データセット検証装置
２…識別モデル構築装置
３…識別モデル評価装置
４…元データセット記憶データベース（元データセット記憶ＤＢ）
５…学習用データセット記憶データベース（学習用データセット記憶ＤＢ）
６…評価用データセット記憶データベース（評価用データセット記憶ＤＢ）
１１…制御ユニット
１２…データベースアクセス部（ＤＢアクセス部）
１３…学習用データセット供給部
１４…評価用データセット供給部
２１…データセット生成部
２２…特徴抽出部
２３…判定部 DESCRIPTION OF SYMBOLS 1 ... Data set verification apparatus 2 ... Identification model construction apparatus 3 ... Identification model evaluation apparatus 4 ... Original data set storage database (original data set storage DB)
5. Learning data set storage database (learning data set storage DB)
6. Evaluation data set storage database (evaluation data set storage DB)
11 ... Control unit 12 ... Database access part (DB access part)
13 ... Data set supply unit for learning 14 ... Data set supply unit for evaluation 21 ... Data set generation unit 22 ... Feature extraction unit 23 ... Determination unit

Claims

A data set generation unit that generates, from an original data set, a learning data set used for machine learning, and an evaluation data set for evaluating an identification model obtained by the machine learning using the learning data set;
A feature extraction unit for extracting features of the first data group belonging to the learning data set generated by the data set generation unit and features of the second data group belonging to the evaluation data set generated by the data set generation unit When,
Whether the learning data set generated by the data set generation unit is appropriate is determined based on the characteristics of the first data group and the features of the second data group extracted by the feature extraction unit. A data set verification device comprising: a determination unit;

The feature extraction unit extracts a likelihood function of the first data group as a feature of the first data group, and extracts a likelihood function of the second data group as a feature of the second data group. The data set verification apparatus according to 1.

The determination unit determines that the learning data set generated by the data set generation unit is not appropriate if there is no similarity between the characteristics of the first data group and the characteristics of the second data group. Item 3. The data set verification device according to item 1 or 2.

The feature extraction unit also extracts features of a third data group belonging to the original data set,
The determination unit generates the learning generated by the data set generation unit based on the characteristics of the first data group, the characteristics of the second data group, and the characteristics of the third data group extracted by the feature extraction unit. The data set verification device according to claim 1, wherein the data set is determined to be appropriate.

The feature extraction unit extracts a likelihood function of the first data group as a feature of the first data group, extracts a likelihood function of the second data group as a feature of the second data group, and The data set verification device according to claim 4, wherein a likelihood function of the third data group is extracted as a feature of the three data groups.

The determination unit determines the data set if there is no similarity in characteristics in any one of a combination of two data groups selected from the first data group, the second data group, and the third data group. The data set verification apparatus according to claim 4, wherein the learning data set generated by the generation unit is determined to be inappropriate.

A data set generation unit for generating a learning data set used for machine learning from the original data set;
A feature extraction unit for extracting features of the first data group belonging to the learning data set generated by the data set generation unit and features of a third data group belonging to the original data set;
Whether the learning data set generated by the data set generation unit is appropriate is determined based on the characteristics of the first data group extracted by the feature extraction unit and the characteristics of the third data group. A data set verification device comprising: a determination unit;

The feature extraction unit extracts a likelihood function of the first data group as a feature of the first data group, and extracts a likelihood function of the third data group as a feature of the third data group. 8. The data set verification device according to 7.

The determination unit determines that the learning data set generated by the data set generation unit is not appropriate if there is no similarity between the characteristics of the first data group and the characteristics of the third data group. Item 9. The data set verification device according to Item 7 or 8.

The determination unit, when determining that the learning data set generated by the data set generation unit is not appropriate, causes the data set generation unit to regenerate the learning data set. The data set verification device described in 1.

A data set generation step for generating, from an original data set, a learning data set used for machine learning and an evaluation data set for evaluating an identification model obtained by the machine learning using the learning data set;
Feature extraction step for extracting features of the first data group belonging to the learning data set generated in the data set generation step and features of the second data group belonging to the evaluation data set generated in the data set generation step When,
Whether the learning data set generated in the data set generation step is appropriate is determined based on the characteristics of the first data group extracted in the feature extraction step and the characteristics of the second data group. A data set verification method executed by a computer.

A data set generation step for generating a learning data set used for machine learning from the original data set;
A feature extraction step of extracting features of the first data group belonging to the learning data set generated in the data set generation step and features of the third data group belonging to the original data set;
Whether the learning data set generated in the data set generation step is appropriate is determined based on the characteristics of the first data group extracted in the feature extraction step and the characteristics of the third data group. A data set verification method executed by a computer.

A data set generation step for generating, from an original data set, a learning data set used for machine learning and an evaluation data set for evaluating an identification model obtained by the machine learning using the learning data set;
Feature extraction step for extracting features of the first data group belonging to the learning data set generated in the data set generation step and features of the second data group belonging to the evaluation data set generated in the data set generation step When,
Whether the learning data set generated in the data set generation step is appropriate is determined based on the characteristics of the first data group extracted in the feature extraction step and the characteristics of the second data group. A data set verification program for causing a computer to execute the determination step.

A data set generation step for generating a learning data set used for machine learning from the original data set;
A feature extraction step of extracting features of the first data group belonging to the learning data set generated in the data set generation step and features of the third data group belonging to the original data set;
Whether the learning data set generated in the data set generation step is appropriate is determined based on the characteristics of the first data group extracted in the feature extraction step and the characteristics of the third data group. A data set verification program for causing a computer to execute the determination step.