JP5733229B2

JP5733229B2 - Classifier creation device, classifier creation method, and computer program

Info

Publication number: JP5733229B2
Application number: JP2012023418A
Authority: JP
Inventors: 政典塩谷; 森　純一; 純一森; 政二福岡; 知弘梅田; 周一高谷; 俊二鈴木; 勝昭多田; 統原田; 健司鳥飼; 悠内田; 盛夫野田; 宗之前田
Original assignee: Nippon Steel Corp
Current assignee: Nippon Steel Corp
Priority date: 2012-02-06
Filing date: 2012-02-06
Publication date: 2015-06-10
Anticipated expiration: 2032-02-06
Also published as: JP2013161298A

Description

本発明は、分類器作成装置、分類器作成方法、及びコンピュータプログラムに関し、特に、与えられたデータが、２つのクラスの何れに属するのかを判断するための分類器を、クラスが既知である学習データを用いて自動的に構築するために用いて好適なものである。 The present invention relates to a classifier creating apparatus, a classifier creating method, and a computer program. In particular, a classifier for determining which of two classes a given data belongs to is learned with a known class. It is suitable for use to automatically construct using data.

従来から、与えられたデータが、あるルールに基づいて２つのクラスの何れに属するのかを判断するための装置である分類器が様々な産業で利用されている。製造業では、例えば、センサによる製品の計測データ等から製品が良品であるか否かを自動的に判断するために分類器が利用される。また、医療分野では、例えば、検査結果から或る病気の疑いが高い否かを自動的に判断するために分類器が利用される。 Conventionally, a classifier, which is a device for determining which of two classes a given data belongs to based on a certain rule, has been used in various industries. In the manufacturing industry, for example, a classifier is used to automatically determine whether a product is a non-defective product from measurement data of the product by a sensor. In the medical field, for example, a classifier is used to automatically determine from a test result whether or not there is a high suspicion of a certain disease.

このような分類器におけるルールを、クラスが既知である多数の学習データを用いて自動的に作成することを教師付き学習という。教師付き学習により分類器を作成する場合、学習データが属するクラスの構成比率が極端に偏っていると、高精度の分類器が作成されないことが多い。例えば、センサによる製品の計測データから製品が良品であるか否かを判断するためのルールを、正解率（予測したクラスが実際のクラスと一致する割合）が高くなるように作成すると、一般に不良品の個数は良品の個数よりも極端に少ないため、常に相対的に多数の学習データが属するクラス（良品）と予測する分類器、若しくは、相対的に多数の学習データが属するクラス（良品）と予測する頻度が極端に高い分類器となってしまうことがある。以下の説明では、「相対的に少数の学習データが属するクラス」を必要に応じて「少数クラス」と称し、「相対的に多数の学習データが属するクラス」を必要に応じて「多数クラス」と称する。 Automatic creation of rules in such a classifier using a large number of learning data whose classes are known is called supervised learning. When creating a classifier by supervised learning, a highly accurate classifier is often not created if the composition ratio of the class to which the learning data belongs is extremely biased. For example, if a rule for judging whether or not a product is non-defective from the measurement data of the product by the sensor is created so that the correct answer rate (the ratio that the predicted class matches the actual class) is high, it is generally not possible. Since the number of non-defective products is extremely smaller than the number of non-defective products, a classifier that always predicts a class to which a relatively large number of learning data belongs (non-defective product), or a class to which a relatively large number of learning data belongs (non-defective product). A classifier with an extremely high frequency of prediction may be generated. In the following description, “a class to which a relatively small number of learning data belongs” is referred to as a “minority class” as necessary, and “a class to which a relatively large number of learning data belongs” is referred to as a “major class” as necessary. Called.

このようにクラスの構成比率が極端に偏っている学習データを用いて分類器を作成する場合には、少数クラスに属するはずのデータを多数クラスに属すると判断してしまうこと（所謂少数クラスのとりこぼし）をなるべく回避したいというケースが多い。例えば、多少は良品を不良品と判断することは許容するが、不良品を良品と判断してしまうことを回避したいというケースが、このようなケースに該当する。 Thus, when creating a classifier using learning data in which the class composition ratio is extremely biased, it is determined that data that should belong to a minority class belongs to a majority class (so-called minority class There are many cases where it is desirable to avoid as much as possible. For example, a case where it is acceptable to judge a non-defective product as a non-defective product to some extent, but a case where it is desired to avoid judging a defective product as a non-defective product corresponds to such a case.

しかしながら、前述したように、分類器のルールの学習アルゴリズムは、通常、正解率が極力高くなるように作られる。少数クラスのとりこぼしと、多数クラスのとりこぼしを同じ尺度で考慮することになるため、前記ケースのような要望に対応できず、高精度の分類器が得られないことが多い。
そこで、学習データを加工（サンプリング）して分類器を作成することが行われている。具体的には、少数クラスの学習データの数が、当該少数クラスの学習データの数のＮ倍（Ｎ＞１）になるように少数クラスの学習データの少なくとも一部を複製して少数クラスの学習データの数を増やし、このようにして数を増やした少数クラスの学習データと、多数クラスの学習データとを用いて、分類器のルールを作成するオーバーサンプリング法と称される方法がある。また、オーバーサンプリング法とは逆に、多数クラスの学習データの数が、当該多数クラスの学習データの数の１／Ｍ倍（Ｍ＞１）になるように多数クラスの学習データの一部を削除して多数クラスの学習データの数を減らし、このようにして数を減らした多数クラスの学習データと、少数クラスの学習データとを用いて、分類器のルールを作成するアンダーサンプリング法と称される方法がある。ここで、前述したＮやＭは「重み」と称されるものである。 However, as described above, the classifier rule learning algorithm is usually made so that the accuracy rate is as high as possible. Since the failure of a small number of classes and the loss of a large number of classes are considered on the same scale, it is often impossible to meet the demand as in the case described above and a high-precision classifier cannot be obtained.
Thus, learning data is processed (sampled) to create a classifier. Specifically, at least a part of the learning data of the minority class is copied so that the number of learning data of the minority class is N times the number of learning data of the minority class (N> 1). There is a method called an oversampling method in which the number of learning data is increased and the classifier rules are created using the small number of learning data and the large number of learning data thus increased. Contrary to the oversampling method, a part of the learning data of many classes is arranged so that the number of learning data of many classes is 1 / M times the number of learning data of the many classes (M> 1). This is called an undersampling method that creates a classifier rule by using a large number of learning data and a small number of learning data that have been reduced in this way to reduce the number of learning data for many classes. There is a way to be. Here, the aforementioned N and M are referred to as “weights”.

このようなオーバーサンプリング法やアンダーサンプリング法を用いて分類器を構成する場合、前述した重みＮ、Ｍの値により、分類器の性能が異なるので、重みの値を適切に決定することが必要となる。
非特許文献１に記載の技術では、多数クラスの学習データの個数を少数クラスに属する学習データの個数で割った値を、オーバーサンプリング法における重みＮとしている（Ｎ＝多数クラスに属する学習データの個数／少数クラスに属する学習データの個数）。非特許文献１に記載の技術では、このようにして少数クラスの学習データの個数と多数クラスの学習データの個数とを同じにすることで、学習データが属するクラスの構成比率の偏りを補正する。 When a classifier is configured using such an oversampling method or undersampling method, the performance of the classifier differs depending on the values of the weights N and M described above, so it is necessary to appropriately determine the weight value. Become.
In the technique described in Non-Patent Document 1, a value obtained by dividing the number of learning data of many classes by the number of learning data belonging to the minority class is set as a weight N in the oversampling method (N = the learning data belonging to the many classes). Number / number of learning data belonging to the minority class). In the technique described in Non-Patent Document 1, the number of learning data of a small number of classes and the number of learning data of a large number of classes are made the same in this way, thereby correcting the deviation in the composition ratio of the class to which the learning data belongs. .

また、特許文献１に記載の技術では、少数クラスと多数クラスの平均個数Ｐを算出し、平均個数Ｐを少数クラスに属する学習データの個数で割った値を、重みＮとすると共に（Ｎ＝Ｐ／少数クラスに属する学習データの個数）、平均個数Ｐを多数クラスに属する学習データの個数で割った値の逆数を、重みＭ（１／Ｍ＝Ｐ／多数クラスに属する学習データの個数）としている。
特許文献１に記載の技術では、少数クラスに属する学習データの個数と、多数クラスに属する学習データの個数とを、共に、それらの平均個数Ｐに揃えることで、学習データが属するクラスの構成比率の偏りを補正する。 In the technique described in Patent Document 1, the average number P of the minority class and the majority class is calculated, and a value obtained by dividing the average number P by the number of learning data belonging to the minority class is set as a weight N (N = P / the number of learning data belonging to the minority class), the inverse of the value obtained by dividing the average number P by the number of learning data belonging to the majority class, weight M (1 / M = P / number of learning data belonging to the majority class) It is said.
In the technique disclosed in Patent Literature 1, the number of learning data belonging to a minority class and the number of learning data belonging to a majority class are both matched to the average number P thereof, whereby the composition ratio of the class to which the learning data belongs. Correct the bias.

特開２０１０−２０４９６６号公報JP 2010-204966 A

亀井靖高、門田暁人、松本健一著、「Fault-proneness モデルへのオーバーサンプリング法の適用」、ソフトウェア信頼性研究会第３回ワークショップ、pp.97-103、July 2006.Kamei Takataka, Kadota Hayato, Matsumoto Kenichi, "Application of Oversampling Method to Fault-proneness Model", Software Reliability Study Group 3rd Workshop, pp.97-103, July 2006. 山口和範、高橋淳一著、「よくわかる多変量解析の基本と仕組み」、秀和システム、２００４年Kazunori Yamaguchi and Junichi Takahashi, “Basics and Mechanisms of Multivariate Analysis Understandable”, Hidekazu System, 2004

前述した特許文献１や非特許文献１では、少数クラスに属する学習データの個数と多数クラスに属する学習データの個数との偏りを無くせば、性能の高い分類器が得られるという仮定に基づいている。
しかしながら、この仮定が成り立つことは少ない。クラスの構成比率に偏りがある学習データを用いて分類器を作成するときには、それらの属性の分布を調べると、多数クラスの学習データの中に少数クラスの学習データが分布している場合が多い。よって、多数クラスの中から少数クラスを正確に抜き出すような分類器を作成する必要がある。 Patent Document 1 and Non-Patent Document 1 described above are based on the assumption that a classifier with high performance can be obtained by eliminating the deviation between the number of learning data belonging to the minority class and the number of learning data belonging to the majority class. .
However, this assumption is rarely true. When creating a classifier using learning data with biased class composition ratios, it is often the case that the learning data of a small number of classes is distributed among the learning data of many classes when examining the distribution of those attributes . Therefore, it is necessary to create a classifier that accurately extracts a minority class from a large number of classes.

図９は、クラスの構成比率に偏りがある各学習データの属性の一例を示す図である。
図９において、○は、少数クラスの学習データを示し、×は、多数クラスの学習データを示す。
図９に示す例では、少数クラスは、点線の円内にしか分布していない。したがって、この点線の円から遠い位置にある多数クラスのデータ（図９の紙面に向かって左側に位置する×で示される多数クラスのデータ）は分類器の学習にとっては不必要なデータである。よって、多数クラスに属する学習データの個数と等しくなるように少数クラスの学習データをオーバーサンプリングにより増やすのは明らかに冗長である。 FIG. 9 is a diagram illustrating an example of the attribute of each learning data having a biased class composition ratio.
In FIG. 9, ◯ indicates learning data for a small number of classes, and x indicates learning data for a large number of classes.
In the example shown in FIG. 9, the minority class is distributed only within the dotted circle. Therefore, the data of many classes far from the dotted circle (the data of many classes indicated by “x” located on the left side in FIG. 9) is unnecessary for learning of the classifier. Therefore, it is obviously redundant to increase the minority class learning data by oversampling so as to be equal to the number of learning data belonging to the majority class.

一方、少数クラスに属する学習データの個数と等しくなるように多数クラスの学習データをアンダーサンプリングにより一律に減らすのも、点線の円内にある多数クラスの学習データを不必要に削除することになるため、行き過ぎである。すなわち、点線の円内にある多数クラスの学習データは、少数クラスの学習データとの境界を定めるために必要な学習データであるのにも関わらず、このような学習データを削除することになる。 On the other hand, to reduce the learning data of many classes uniformly by undersampling so that it becomes equal to the number of learning data belonging to the minority class, the learning data of many classes in the dotted circle is unnecessarily deleted. So it ’s too much. That is, the learning data of a large number of classes in the dotted circle is deleted even though the learning data is necessary to determine the boundary with the learning data of a small number of classes. .

本発明者らは、このような新たな知見に基づき、サンプリング後の少数クラスに属する学習データの個数は、多数クラスに属する学習データの個数以下で良いという認識に至った。すなわち、「１」以上、「多数クラスに属する学習データの個数／少数クラスに属する学習データの個数」以下の間に、重みＮ、Ｍの最適値が存在するという認識に至った。 Based on such new knowledge, the present inventors have come to realize that the number of learning data belonging to the minority class after sampling may be equal to or less than the number of learning data belonging to the majority class. That is, it has been recognized that there are optimum values of the weights N and M between “1” and less than “the number of learning data belonging to the majority class / the number of learning data belonging to the minority class”.

本発明は、このような認識を基にしてなされたものであり、与えられたデータが、２つのクラスの何れに属するのかを判断するための分類器を、クラスが付与された学習データを用いて自動的に構築する際に、一方のクラスに属する学習データの個数が他方のクラスの個数よりも極端に少ない場合でも、過不足なく学習データをサンプリングし、データを高精度に分類できる分類器を構築することを目的とする。 The present invention has been made on the basis of such recognition. A classifier for determining which of two classes a given data belongs to is used learning data to which the class is assigned. If the number of learning data belonging to one class is extremely smaller than the number of the other class, the classifier can sample the learning data without excess or deficiency and classify the data with high accuracy. The purpose is to build.

本発明の分類器作成装置は、２つのクラスの何れのクラスに属するのかが既知である学習データのうち、相対的に少数の学習データが属するクラスである少数クラスの学習データの個数を重みに応じた倍率で増やすことと、相対的に多数の学習データが属するクラスである多数クラスの学習データの個数を前記重みの逆数に応じた倍率で減らすことと、の少なくとも何れか一方を行うことにより、前記学習データの数を変更して新学習データを作成し、当該新学習データを用いて、与えられたデータが２つのクラスの何れに属するのかを判断するための分類器を作成する分類器作成装置であって、前記多数クラスの学習データの個数を前記少数クラスの学習データの個数で割った値以下で、１以上の値の範囲の中から、前記重みの上限値と下限値を定め、前記上限値と下限値の範囲から値が相互に異なる複数の仮の重みを決定する仮重み決定手段と、前記少数クラスの学習データの個数を前記仮の重みに応じた倍率で増やすことと、前記多数クラスの学習データの個数を前記仮の重みの逆数に応じた倍率で減らすことと、の少なくとも何れか一方を、前記複数の仮の重み毎に行って、前記新学習データの複数のセットを作成する学習データサンプリング手段と、前記学習データサンプリング手段により学習データの個数が変更された後の前記新学習データを用いて、与えられた学習データが前記２つのクラスの何れに属するのかを判断するための仮の分類器を作成することを、前記新学習データのセット毎に行って、複数の前記仮の分類器を得る学習手段と、前記仮の分類器により前記学習データを前記２つのクラスの何れかに分類した結果に基づいて、前記学習手段により得られた仮の分類器の性能を評価する評価値を算出することを、前記複数の仮の分類器毎に行って、複数の前記評価値を得る評価値算出手段と、前記評価値算出手段により得られた評価値と、当該評価値を得る際に用いられた仮の重みとを用いて、評価値と重みとの関係を求め、求めた関係において、前記下限値から前記上限値までの範囲で最も大きな値を有する評価値に対応する重みを、重みの最適値として導出する最適重み導出手段と、前記重みの最適値を前記重みとして用いて、前記少数クラスの学習データの個数を前記重みに応じた倍率で増やすことと、前記多数クラスの学習データの個数を前記重みの逆数に応じた倍率で減らすことと、の少なくとも何れか一方を行って、前記新学習データを作成し、当該新学習データを用いて、与えられた学習データが前記２つのクラスの何れに属するのかを判断するための分類器を作成する分類器作成手段と、を有することを特徴とする。 The classifier creating apparatus according to the present invention uses, as a weight, the number of learning data of a minority class, which is a class to which a relatively small number of learning data belongs, among learning data of which two classes are known. By increasing at a corresponding magnification, and reducing at least one of the number of learning data of a large number of classes, which is a class to which a relatively large number of learning data belongs, by a magnification according to the inverse of the weight A classifier that creates new learning data by changing the number of the learning data, and uses the new learning data to create a classifier for judging which of the two classes the given data belongs to A creation device, wherein the number of learning data of the majority class is equal to or less than a value obtained by dividing the number of learning data of the minority class by the number of learning data of the minority class. A provisional weight determination means for determining a plurality of provisional weights having different values from the range of the upper limit value and the lower limit value, and the number of learning data of the minority class at a magnification according to the provisional weight The new learning data is obtained by performing at least one of increasing the number of learning data of the multiple classes by a factor corresponding to the reciprocal of the temporary weight for each of the plurality of temporary weights. Learning data sampling means for creating a plurality of sets of learning data, and using the new learning data after the number of learning data is changed by the learning data sampling means, the given learning data is in any of the two classes The provision of a temporary classifier for determining whether or not it belongs to each of the new learning data sets, a learning means for obtaining a plurality of temporary classifiers, and the temporary classifier Calculating the evaluation value for evaluating the performance of the temporary classifier obtained by the learning means based on the result of classifying the learning data into one of the two classes. An evaluation value calculation unit that obtains a plurality of evaluation values, an evaluation value obtained by the evaluation value calculation unit, and a temporary weight used in obtaining the evaluation value An optimum weight deriving means for deriving a weight corresponding to an evaluation value having the largest value in a range from the lower limit value to the upper limit value as an optimum value of the weight, and obtaining a relationship between the value and the weight; , Using the optimum value of the weight as the weight, increasing the number of learning data of the minority class by a factor according to the weight, and multiplying the number of learning data of the majority class by the inverse of the weight Reducing with A classification that creates at least one of the above, creates the new learning data, and uses the new learning data to create a classifier for determining which of the two classes the given learning data belongs to And a container creating means.

本発明の分類器作成方法は、２つのクラスの何れのクラスに属するのかが既知である学習データのうち、相対的に少数の学習データが属するクラスである少数クラスの学習データの個数を重みに応じた倍率で増やすことと、相対的に多数の学習データが属するクラスである多数クラスの学習データの個数を前記重みの逆数に応じた倍率で減らすことと、の少なくとも何れか一方を行うことにより、前記学習データの数を変更して新学習データを作成し、当該新学習データを用いて、与えられたデータが２つのクラスの何れに属するのかを判断するための分類器を作成する分類器作成方法であって、前記多数クラスの学習データの個数を前記少数クラスの学習データの個数で割った値以下で、１以上の値の範囲の中から、前記重みの上限値と下限値を定め、前記上限値と下限値の範囲から値が相互に異なる複数の仮の重みを決定する仮重み決定工程と、前記少数クラスの学習データの個数を前記仮の重みに応じた倍率で増やすことと、前記多数クラスの学習データの個数を前記仮の重みの逆数に応じた倍率で減らすことと、の少なくとも何れか一方を、前記複数の仮の重み毎に行って、前記新学習データの複数のセットを作成する学習データサンプリング工程と、前記学習データサンプリング工程により学習データの個数が変更された後の前記新学習データを用いて、与えられた学習データが前記２つのクラスの何れに属するのかを判断するための仮の分類器を作成することを、前記新学習データのセット毎に行って、複数の前記仮の分類器を得る学習工程と、前記仮の分類器により前記学習データを前記２つのクラスの何れかに分類した結果に基づいて、前記学習工程により得られた仮の分類器の性能を評価する評価値を算出することを、前記複数の仮の分類器毎に行って、複数の前記評価値を得る評価値算出工程と、前記評価値算出工程により得られた評価値と、当該評価値を得る際に用いられた仮の重みとを用いて、評価値と重みとの関係を求め、求めた関係において、前記下限値から前記上限値までの範囲で最も大きな値を有する評価値に対応する重みを、重みの最適値として導出する最適重み導出工程と、前記重みの最適値を前記重みとして用いて、前記少数クラスの学習データの個数を前記重みに応じた倍率で増やすことと、前記多数クラスの学習データの個数を前記重みの逆数に応じた倍率で減らすことと、の少なくとも何れか一方を行って、前記新学習データを作成し、当該新学習データを用いて、与えられた学習データが前記２つのクラスの何れに属するのかを判断するための分類器を作成する分類器作成工程と、を有することを特徴とする。 The classifier creating method of the present invention uses the number of learning data of a minority class, which is a class to which a relatively small number of learning data belongs, among learning data that is known to belong to which of two classes. By increasing at a corresponding magnification, and reducing at least one of the number of learning data of a large number of classes, which is a class to which a relatively large number of learning data belongs, by a magnification according to the inverse of the weight A classifier that creates new learning data by changing the number of the learning data, and uses the new learning data to create a classifier for judging which of the two classes the given data belongs to The method is a creation method, wherein the number of learning data of the majority class is equal to or less than a value obtained by dividing the number of learning data of the minority class by the number of learning data of the minority class, and the upper limit value and A provisional weight determination step for determining a plurality of provisional weights having different values from the range of the upper limit value and the lower limit value, and the number of learning data of the minority class at a magnification according to the provisional weight The new learning data is obtained by performing at least one of increasing the number of learning data of the multiple classes by a factor corresponding to the reciprocal of the temporary weight for each of the plurality of temporary weights. The learning data sampling step for creating a plurality of sets of the learning data, and using the new learning data after the number of learning data is changed by the learning data sampling step, the given learning data is in any of the two classes Creating a temporary classifier for determining whether it belongs to each of the new learning data sets, and obtaining a plurality of temporary classifiers by the temporary classifier; Calculating the evaluation value for evaluating the performance of the temporary classifier obtained by the learning step based on the result of classifying the learning data into one of the two classes. An evaluation value calculation step for obtaining a plurality of evaluation values, an evaluation value obtained by the evaluation value calculation step, and a temporary weight used when obtaining the evaluation value are performed each time. An optimum weight deriving step of deriving a weight corresponding to an evaluation value having the largest value in a range from the lower limit value to the upper limit value as an optimum value of the weight in the obtained relationship. , Using the optimum value of the weight as the weight, increasing the number of learning data of the minority class by a factor according to the weight, and multiplying the number of learning data of the majority class by the inverse of the weight Reducing with A classification that creates at least one of the above, creates the new learning data, and uses the new learning data to create a classifier for determining which of the two classes the given learning data belongs to And a container creation step.

本発明のコンピュータプログラムは、前記分類器作成装置の各手段としてコンピュータを機能させることを特徴とする。 The computer program according to the present invention causes a computer to function as each means of the classifier creating apparatus.

本発明によれば、多数クラスの学習データの個数を少数クラスの学習データの個数で割った値以下で、１以上の値の範囲の中から、重みの上限値と下限値を定め、その上限値と下限値の範囲の中から、分類器に対する評価値として最も高い値を有する評価値に対応する重みを重みの最適値とし、当該重みの最適値を用いて学習データをサンプリングして分類器を作成するようにした。よって、一方のクラスに属する学習データの個数が他方のクラスに属する学習データの個数よりも極端に少ない場合でも、過不足なく学習データをサンプリングし、データを高精度に分類できる分類器を構築することができる。 According to the present invention, the upper limit value and the lower limit value of the weights are determined from the range of one or more values less than or equal to the number of learning data of the majority class divided by the number of learning data of the minority class. The weight corresponding to the evaluation value having the highest value as the evaluation value for the classifier is set as the optimum value of the weight from the range of the value and the lower limit value, and the learning data is sampled using the optimum value of the weight and the classifier Was created. Therefore, even if the number of learning data belonging to one class is extremely smaller than the number of learning data belonging to the other class, a learning class can be sampled without excess or deficiency and a classifier that can classify the data with high accuracy is constructed. be able to.

分類器作成装置の構成の一例を示す図である。It is a figure which shows an example of a structure of a classifier preparation apparatus. 決定木の一例を示す図である。It is a figure which shows an example of a decision tree. 混合行列の一例を説明する図である。It is a figure explaining an example of a mixing matrix. Ｆ値と重みＮとの関係の一例を示す図である。It is a figure which shows an example of the relationship between F value and weight N. 分類器作成装置の処理の一例を説明するフローチャートである。It is a flowchart explaining an example of a process of a classifier preparation apparatus. 学習データサンプリング処理の詳細を説明するフローチャートである。It is a flowchart explaining the detail of a learning data sampling process. 最適重み導出処理を説明するフローチャートである。It is a flowchart explaining optimal weight derivation processing. 実施例、比較例における混合行列を示す図である。It is a figure which shows the mixing matrix in an Example and a comparative example. クラスの構成比率に偏りがある各学習データの属性の一例を示す図である。It is a figure which shows an example of the attribute of each learning data with which the composition ratio of a class has bias.

以下、図面を参照しながら、本発明の一実施形態を説明する。
［第１の実施形態］
まず、本発明の第１の実施形態について説明する。
（分類器作成装置１００の構成）
図１は、分類器作成装置１００の構成の一例を示す図である。図１に示す分類器作成装置１００は、例えば、ＣＰＵ、ＲＯＭ、ＲＡＭ、ＨＤＤ、及び各種のインターフェースを備えたコンピュータシステム（情報処理装置）を用いることにより実現される。分類器作成装置１００が作成する分類器は、与えられたデータの属性をルールに当てはめて、当該データが２つのクラスの何れに属するのかを判断するものである。以下に、分類器作成装置１００が有する機能の一例を説明する。尚、本実施形態では、決定木により、分類器のルールを構築する場合を例に挙げて説明する。決定木とは、データの分析手法の一つであって、図２に示すように、データを様々な条件に従って木の枝葉のように分類していく分析手法である。図２に示す例では、最終的に良品及び不良品の何れかのクラスにデータが分類される。 Hereinafter, an embodiment of the present invention will be described with reference to the drawings.
[First Embodiment]
First, a first embodiment of the present invention will be described.
(Configuration of classifier creating apparatus 100)
FIG. 1 is a diagram illustrating an example of the configuration of the classifier creating apparatus 100. The classifier creating apparatus 100 shown in FIG. 1 is realized by using, for example, a computer system (information processing apparatus) including a CPU, ROM, RAM, HDD, and various interfaces. The classifier created by the classifier creating apparatus 100 applies an attribute of given data to a rule, and determines which of the two classes the data belongs to. Below, an example of the function which the classifier preparation apparatus 100 has is demonstrated. In the present embodiment, a case where a rule of a classifier is constructed by a decision tree will be described as an example. The decision tree is one of data analysis methods, and as shown in FIG. 2, is an analysis method that classifies data like branches and leaves of trees according to various conditions. In the example shown in FIG. 2, the data is finally classified into one of a good product and a defective product.

＜学習データ入力部１０１＞
学習データ入力部１０１は、オペレータ（分類器の作成者）による操作等に基づいて、複数の学習データを入力する。複数の学習データのそれぞれは、正しいクラスが分かっている実績データである。学習データは、例えば、センサによる製品の計測データであり、クラスは、例えば、製品が良品であるか否かである。この場合、複数の学習データのそれぞれには、その学習データの基となる製品が良品又は不良品の何れであるのかを示す情報がクラスの情報として付与（内包）されている。学習データの数については、作成しようとする分類器に対して期待する性能や、分類器を作成する際の計算負荷等によって、適宜決定することができる。 <Learning data input unit 101>
The learning data input unit 101 inputs a plurality of learning data based on an operation by an operator (creator of the classifier) or the like. Each of the plurality of learning data is performance data for which a correct class is known. The learning data is, for example, measurement data of a product by a sensor, and the class is, for example, whether or not the product is a non-defective product. In this case, each of the plurality of learning data is provided (included) with information indicating whether the product that is the basis of the learning data is a non-defective product or a defective product. About the number of learning data, it can determine suitably by the performance anticipated with respect to the classifier to produce, the calculation load at the time of creating a classifier, etc.

学習データ入力部１０１は、例えば、ＣＰＵが、オペレータによるユーザインタフェースの操作に基づいて学習データを入力してＲＡＭ等に記憶することにより実現される。この他、例えばＣＰＵが、通信インターフェースを介して外部装置から学習データを入力してＲＡＭ等に記憶するようにしてもよい。さらに、ＣＰＵが、可搬型の記憶媒体に記憶された学習データを入力して（読み出して）ＲＡＭ等に記憶するようにしてもよい。 The learning data input unit 101 is realized, for example, when the CPU inputs learning data based on an operation of a user interface by an operator and stores it in a RAM or the like. In addition, for example, the CPU may input learning data from an external device via a communication interface and store it in a RAM or the like. Further, the CPU may input (read out) learning data stored in a portable storage medium and store it in a RAM or the like.

＜重み範囲入力部１０２＞
重み範囲入力部１０２は、オペレータによる操作等に基づいて、重みＮの範囲を入力する。前述したように、本発明者らは、「１」以上、「多数クラスに属する学習データの個数／少数クラスに属する学習データの個数」以下の間に、重みの最適値が存在するという新たな知見を得た。よって、重み範囲入力部１０２により入力される重みＮの範囲の下限値Ｎ_minが、「１」以上となり（Ｎ_min≧１）、且つ、重み範囲入力部１０２により入力される重みＮの範囲の上限値Ｎ_maxが、「（多数クラスに属する学習データの個数／少数クラスに属する学習データの個数）Ｎ_r」以下となるように（Ｎ_max≦Ｎ_r）、重みＮの範囲がオペレータによって定められる。 <Weight range input unit 102>
The weight range input unit 102 inputs a weight N range based on an operation by an operator or the like. As described above, the present inventors have proposed that a new optimum weight value exists between “1” and “the number of learning data belonging to the majority class / the number of learning data belonging to the minority class”. Obtained knowledge. Therefore, the lower limit value N _min of the range of the weight N input by the weight range input unit 102 is “1” or more (N _min ≧ 1), and the range of the weight N input by the weight range input unit 102 is The range of the weight N is determined by the operator so that the upper limit value N _max is equal to or less than “(number of learning data belonging to majority class / number of learning data belonging to minority class) N _r ” (N _max ≦ N _r ). It is done.

重み範囲入力部１０２は、例えば、ＣＰＵが、オペレータによるユーザインタフェースの操作に基づいて重みＮの範囲を入力してＲＡＭ等に記憶することにより実現される。この他、例えばＣＰＵが、通信インターフェースを介して外部装置から重みＮの範囲を入力してＲＡＭ等に記憶するようにしてもよい。さらに、ＣＰＵが、可搬型の記憶媒体に記憶された重みＮの範囲を入力して（読み出して）ＲＡＭ等に記憶するようにしてもよい。 The weight range input unit 102 is realized, for example, when the CPU inputs a range of the weight N based on the operation of the user interface by the operator and stores it in the RAM or the like. In addition, for example, the CPU may input the range of the weight N from an external device via the communication interface and store it in the RAM or the like. Further, the CPU may input (read out) the range of the weight N stored in the portable storage medium and store it in the RAM or the like.

＜仮重み決定部１０３＞
仮重み決定部１０３は、重み範囲入力部１０２により入力された重みＮの範囲の中から、予め設定された方法で、予め設定された数の仮の重みＮ´を、値が重ならないように決定する。本実施形態では、以下の（１）式〜（３）式により、３つの仮の重みＮ´₁、Ｎ´₂、Ｎ´₃を決定する場合を例に挙げて説明する。
仮の重みＮ´₁＝Ｎ_min ・・・（１）
仮の重みＮ´₂＝（Ｎ_min＋Ｎ_max）／２・・・（２）
仮の重みＮ´₃＝Ｎ_max ・・・（３） <Temporary Weight Determination Unit 103>
The temporary weight determination unit 103 uses a preset method from the range of the weight N input by the weight range input unit 102 so that the preset number of temporary weights N ′ do not overlap. decide. In the present embodiment, a case where _three temporary weights N ′ ₁ , N ′ ₂ , and N ′ ₃ are determined by the following formulas (1) to (3) will be described as an example.
Temporary weight N ′ ₁ = N _min (1)
Temporary weight N ′ ₂ = (N _min + N _max ) / 2 (2)
Temporary weight N ′ ₃ = N _max (3)

前述したように、本実施形態では、重みＮの範囲の下限値Ｎ_minが「１」以上であり、上限値Ｎ_maxが「Ｎ_r」以下である。よって、Ｎ_min＝１かつＮ_max＝Ｎ_rのときには、（１）〜（３）式を用いて仮の重みを定めると、仮の重みＮ´₁は、「１」となり、仮の重みＮ´₂は、「（１＋Ｎ_r）／２」となり、仮の重みＮ´₃は、「Ｎ_r」となる。
仮の重みの決定方法は、（１）〜（３）式以外でも構わない。本発明は、Ｎ_minからＮ_maxの範囲内で、複数の仮の重みと各々の評価値から、重みＮの最適値を予測する手法であるため、仮の重みの値の選択方法は、それぞれの仮の重みの値が重複せず、Ｎ_minからＮ_maxの範囲を広くカバーするように選択すれば良い。例えば、以下の（１）'式〜（３）'式を用いて仮の重みを決定しても良い。
仮の重みＮ´₁＝Ｎ_min＋１×（Ｎ_max−Ｎ_min）／４・・・（１）'
仮の重みＮ´₂＝Ｎ_min＋２×（Ｎ_max−Ｎ_min）／４・・・（２）'
仮の重みＮ´₃＝Ｎ_min＋３×（Ｎ_max−Ｎ_min）／４・・・（３）' 仮重み決定部１０３は、例えば、ＣＰＵが、ＲＡＭ等から、重みＮの範囲を読み出して、（１）式〜（３）式等の計算を行うことにより仮の重みＮ´の値を求めてＲＡＭ等に記憶することにより実現される。 As described above, in this embodiment, the lower limit value N _min of the range of the weight N is “1” or more, and the upper limit value N _max is “N _r ” or less. Thus, when the N _min = 1 and N _max = N _r (1) When using the ~ (3) determining the weights of the temporary weight _N'1 The temporary "1", the temporary weight N _'2' (1 + N _r) / 2 ", and the weight _N'3 provisional is" N _r ".
The provisional weight determination method may be other than the equations (1) to (3). Since the present invention is a method for predicting the optimum value of the weight N from a plurality of provisional weights and respective evaluation values within the range of N _min to N _max , the provisional weight value selection method is respectively The provisional weight values may be selected so as to cover a wide range from N _min to N _max without overlapping. For example, the temporary weights may be determined using the following formulas (1) ′ to (3) ′.
Temporary weight N ′ ₁ = N _min + 1 × (N _max −N _min ) / 4 (1) ′
Temporary weight N ′ ₂ = N _min + 2 × (N _max −N _min ) / 4 (2) ′
Temporary weight N ′ ₃ = N _min + 3 × (N _max −N _min ) / 4 (3) ′ For example, the temporary weight determination unit 103 reads the range of the weight N from the RAM or the like. , (1) to (3) are calculated to obtain the value of the temporary weight N ′ and store it in the RAM or the like.

＜学習データサンプリング部１０４＞
学習データサンプリング部１０４は、仮重み決定部１０３で決定された複数の仮の重みＮ´のそれぞれを用いて、学習データ入力部１０１により入力された学習データのうち、少数クラスに属する学習データをオーバーサンプリングして、少数クラスに属する学習データの数を増加させた新学習データを作成する。
学習データサンプリング部１０４により、仮の重みＮ´のそれぞれについて新学習データが得られる。本実施形態では、３つの仮の重みＮ´₁、Ｎ´₂、Ｎ´₃に対応して、学習データサンプリング部１０４により、新学習データのセットが３つ得られる。本実施形態では、学習データサンプリング部１０４は、これら新学習データのセットを個別に求めるために、３つの学習データサンプリング部（第１の学習データサンプリング部１０４ａ、第２の学習データサンプリング部１０４ｂ、第３の学習データサンプリング部１０４ｃ）を有する。第１、第２、第３の学習データサンプリング部１０４ａ、１０４ｂ、１０４ｃは、それぞれ、仮の重みＮ´₁、Ｎ´₂、Ｎ´₃を用いて少数クラスに属する学習データをオーバーサンプリングする。 <Learning data sampling unit 104>
The learning data sampling unit 104 uses, for each of the plurality of temporary weights N ′ determined by the temporary weight determination unit 103, learning data belonging to the minority class among the learning data input by the learning data input unit 101. Oversampling is performed to create new learning data in which the number of learning data belonging to the minority class is increased.
The learning data sampling unit 104 obtains new learning data for each temporary weight N ′. In the present embodiment, three sets of new learning data are obtained by the learning data sampling unit 104 corresponding to the _three temporary weights N ′ ₁ , N ′ ₂ , and N ′ ₃ . In the present embodiment, the learning data sampling unit 104 includes three learning data sampling units (a first learning data sampling unit 104a, a second learning data sampling unit 104b, A third learning data sampling unit 104c) is included. The first, second, and third learning data sampling units 104a, 104b, and 104c oversample the learning data belonging to the minority class using temporary weights N ′ ₁ , N ′ ₂ , and N ′ ₃ , respectively.

尚、使用する仮の重みＮ´の値が異なる他は、第１、第２、第３の学習データサンプリング部１０４ａ、１０４ｂ、１０４ｃの機能は同じである。よって、ここでは、仮の重みＮ´₂を用いて少数クラスに属する学習データをサンプリングする第２の学習データサンプリング部１０４ｂについて説明し、その他の第１、第３の学習データサンプリング部１０４ａ、１０４ｃについての詳細な説明を省略する。 The functions of the first, second, and third learning data sampling units 104a, 104b, and 104c are the same except that the value of the temporary weight N ′ to be used is different. Therefore, here, the second learning data sampling unit 104b that samples the learning data belonging to the minority class using the temporary weight N ′ ₂ will be described, and the other first and third learning data sampling units 104a and 104c. The detailed description about is omitted.

第２の学習データサンプリング部１０４ｂは、多数クラスの学習データを必ず新学習データに追加する。また、第２の学習データサンプリング部１０４ｂは、少数クラスに属する学習データを、平均的に仮の重みＮ´₂の回数だけ繰り返し複製して新学習データに追加する。
具体的に、第２の学習データサンプリング部１０４ｂは、学習データを識別するデータ番号ｉ（ｉ＝０、１、２、・・・）が付与された各学習データのそれぞれについて、データ番号ｉが小さい学習データから順に以下の処理を行う。 The second learning data sampling unit 104b always adds many classes of learning data to the new learning data. The second learning data sampling unit 104b, a learning data belonging to a few classes, on average and repeatedly many times the weight _N'second temporary replication to add to the new training data.
Specifically, the second learning data sampling unit 104b sets a data number i for each learning data to which a data number i (i = 0, 1, 2,...) For identifying learning data is assigned. The following processing is performed in order from the smallest learning data.

まず、第２の学習データサンプリング部１０４ｂは、データ番号ｉの学習データが、少数クラスに属する学習データであるか否かを判定する。この判定の結果、データ番号ｉの学習データが、少数クラスに属する学習データでない場合（多数クラスに属する学習データである場合）、第２の学習データサンプリング部１０４ｂは、データ番号ｉの学習データを新学習データとして採用する。 First, the second learning data sampling unit 104b determines whether the learning data with the data number i is learning data belonging to the minority class. As a result of this determination, when the learning data with the data number i is not learning data belonging to the minority class (when it is learning data belonging to the majority class), the second learning data sampling unit 104b uses the learning data with the data number i. Adopt as new learning data.

一方、データ番号ｉの学習データが、少数クラスに属する学習データである場合、第２の学習データサンプリング部１０４ｂは、学習データの累積追加個数ｐ（ｐ＝０、１、２、・・・）の値が、その上限値ｑ（ｑの初期値は仮の重みＮ´₂とする）以上となるまで、データ番号ｉの学習データを繰り返し新学習データとして採用する。そして、第２の学習データサンプリング部１０４ｂは、現在の上限値ｑに仮の重みＮ´₂を加えた値を新たな上限値ｑとする。
例えば、仮の重みＮ´₂の値が「２．５」である場合、少数クラスに属する学習データは、データ番号ｉが小さいものから順に３回、２回ずつ交互に繰り返し新学習データとして採用される。
学習データサンプリング部１０４は、例えば、ＣＰＵが、以上の処理を行うことにより、仮の重みＮ´のそれぞれに基づいて、少数クラスに属する学習データをオーバーサンプリングして新学習データを作成し、新学習データをＲＡＭ等に記憶することにより実現される。 On the other hand, when the learning data with the data number i is learning data belonging to the minority class, the second learning data sampling unit 104b accumulates the additional number of learning data p (p = 0, 1, 2,...). Until the value becomes equal to or greater than the upper limit q (the initial value of q is a temporary weight N ′ ₂ ), the learning data with the data number i is repeatedly adopted as new learning data. The second learning data sampling unit 104b a value obtained by adding the temporary weight _N'2 to the current upper limit value q as a new upper limit value q.
For example, when the value of the provisional weight N ′ ₂ is “2.5”, the learning data belonging to the minority class is used as new learning data by repeating it alternately three times and twice in order from the smallest data number i. Is done.
The learning data sampling unit 104 creates new learning data by oversampling the learning data belonging to the minority class based on each of the temporary weights N ′, for example, by the CPU performing the above processing. This is realized by storing learning data in a RAM or the like.

＜学習部１０５＞
学習部１０５は、学習データサンプリング部１０４で得られた新学習データのセットを、新学習データのセット毎に学習して、正解率（分類器で予測したクラスが実際のクラスと一致する割合）が最も高くなるような仮の分類器を、新学習データのセットの数と同じ数だけ作成する。
本実施形態では、学習データサンプリング部１０４により、新学習データのセットが３つ得られる。よって、学習部１０５により３つの仮の分類器が得られる。学習部１０５は、これら３つの仮の分類器を個別に作成するために、３つの学習部（第１の学習部１０５ａ、第２の学習部１０５ｂ、第３の学習部１０５ｃ）を有する。第１、第２、第３の学習部１０５ａ、１０５ｂ、１０５ｃは、それぞれ、第１、第２、第３の学習データサンプリング部１０４ａ、１０４ｂ、１０４ｃにより得られた新学習データを学習して仮の分類器を作成する。 <Learning unit 105>
The learning unit 105 learns the new learning data set obtained by the learning data sampling unit 104 for each new learning data set, and the correct answer rate (ratio in which the class predicted by the classifier matches the actual class) As many temporary classifiers as possible are created as many as the number of new learning data sets.
In the present embodiment, the learning data sampling unit 104 obtains three sets of new learning data. Therefore, three temporary classifiers are obtained by the learning unit 105. The learning unit 105 includes three learning units (a first learning unit 105a, a second learning unit 105b, and a third learning unit 105c) in order to individually create these three temporary classifiers. The first, second, and third learning units 105a, 105b, and 105c learn temporary learning data obtained by the first, second, and third learning data sampling units 104a, 104b, and 104c, respectively. Create a classifier for.

分類器である決定木を作成するための学習アルゴリズムは、例えば、非特許文献２等に記載されている公知の技術を用いて実現することができる。よって、ここでは、分類器を作成するための学習アルゴリズムの詳細な説明を省略する。
学習部１０５は、例えば、ＣＰＵが、新学習データの組毎に新学習データを学習し、新学習データの組の数と同数の仮の分類器を作成し、作成した仮の分類器の情報をＲＡＭ等に記憶することにより実現される。 A learning algorithm for creating a decision tree that is a classifier can be realized by using a known technique described in Non-Patent Document 2, for example. Therefore, detailed description of the learning algorithm for creating the classifier is omitted here.
In the learning unit 105, for example, the CPU learns new learning data for each set of new learning data, creates the same number of temporary classifiers as the number of new learning data sets, and information on the created temporary classifiers Is stored in a RAM or the like.

＜評価値算出部１０６＞
評価値算出部１０６は、学習部１０５で得られた複数の仮の分類器のそれぞれを、元の学習データを用いて評価し、評価値を算出する。本実施形態では、学習部１０５により、３つの仮の分類器が作成されるので、３つの評価値が得られる。これら３つの評価値を求めるために、評価値算出部１０６は、３つの評価値算出部（第１の評価値算出部１０６ａ、第２の評価値算出部１０６ｂ、第３の評価値算出部１０６ｃ）を有する。第１、第２、第３の評価値算出部１０６ａ、１０６ｂ、１０６ｃは、それぞれ、第１、第２、第３の学習部１０５ａ、１０５ｂ、１０５ｃにより得られた分類器の評価値を算出する。 <Evaluation Value Calculation Unit 106>
The evaluation value calculation unit 106 evaluates each of the plurality of temporary classifiers obtained by the learning unit 105 using the original learning data, and calculates an evaluation value. In the present embodiment, since three temporary classifiers are created by the learning unit 105, three evaluation values are obtained. In order to obtain these three evaluation values, the evaluation value calculation unit 106 includes three evaluation value calculation units (a first evaluation value calculation unit 106a, a second evaluation value calculation unit 106b, and a third evaluation value calculation unit 106c. ). The first, second, and third evaluation value calculation units 106a, 106b, and 106c calculate the evaluation values of the classifiers obtained by the first, second, and third learning units 105a, 105b, and 105c, respectively. .

本実施形態では、分類器の評価値として、非特許文献１等に記載されているＦ値を用いる場合を例に挙げて説明する。尚、評価の対象となる分類器が異なる他は、第１、第２、第３の評価値算出部１０６ａ、１０６ｂ、１０６ｃの機能は同じである。よって、ここでは、第１の評価値算出部１０６ａについて説明を行い、その他の第２、第３の評価値算出部１０６ｂ、１０６ｃについての詳細な説明を省略する。 In the present embodiment, a case where the F value described in Non-Patent Document 1 or the like is used as an evaluation value of the classifier will be described as an example. The functions of the first, second, and third evaluation value calculation units 106a, 106b, and 106c are the same except that the classifiers to be evaluated are different. Therefore, here, the first evaluation value calculation unit 106a will be described, and detailed descriptions of the other second and third evaluation value calculation units 106b and 106c will be omitted.

まず、第１の評価値算出部１０６ａは、学習データ入力部１０１により入力された元の学習データを、第１の学習部１０５ａにより作成された仮の分類器に入力して分類し、混合行列を作成する。
図３は、混合行列の一例を説明する図である。図３において、ｎ₁₁は、実績（実際のクラス）が多数クラスである学習データを正しく予測した個数であり、ｎ₁₂は、実績が多数クラスである学習データを間違って予測した個数である。また、ｎ₂₂は、実績が少数クラスである学習データを正しく予測した個数であり、ｎ₂₁は、実績が少数クラスである学習データを間違って予測した個数である。 First, the first evaluation value calculation unit 106a inputs the original learning data input by the learning data input unit 101 into the temporary classifier created by the first learning unit 105a and classifies the mixed learning matrix. Create
FIG. 3 is a diagram illustrating an example of the mixing matrix. In FIG. 3, n ₁₁ is the number of correctly predicted learning data having a large number of actual results (actual classes), and n ₁₂ is the number of erroneously predicted learning data having a large number of actual results. Also, n ₂₂ is the number of correctly predicted learning data with a small number of classes, and n ₂₁ is the number of incorrectly predicted learning data with a small number of classes.

第１の評価値算出部１０６ａは、これらの値ｎ₁₁、ｎ₁₂、ｎ₂₁、ｎ₂₂を用いて、以下の（４）式に示す適合率と、以下の（５）式に示す再現率を計算する。
適合率＝ｎ₂₂／（ｎ₁₂＋ｎ₂₂）・・・（４）
再現率＝ｎ₂₂／（ｎ₂₁＋ｎ₂₂）・・・（５）
適合率とは、少数クラスと予測された学習データのうち、実際に少数クラスに属する学習データの割合である。例えば、学習データが、センサによる製品の計測データであり、クラスが、製品が良品であるか否かである場合、この適合率は、不良品と予測された製品の中に、本当に不良品の製品がどのくらい割合で含まれているのかを示すものである。
再現率とは、実際に少数クラスに属する学習データのうち、少数クラスと予測された学習データの割合である。前述した例では、この再現率は、本当に不良品の製品の中に、不良品であると予測された製品がどのくらいの割合で含まれているのかを示すものである。
再現率と適合率は、何れも、値が大きい程、評価が高いことを表すものである。 Using these values n ₁₁ , n ₁₂ , n ₂₁ , and n ₂₂ , the first evaluation value calculation unit 106a uses the precision shown in the following formula (4) and the recall shown in the following formula (5). Calculate
Precision = n ₂₂ / (n ₁₂ + n ₂₂ ) (4)
Reproducibility = n ₂₂ / (n ₂₁ + n ₂₂ ) (5)
The relevance ratio is a ratio of learning data that actually belongs to the minority class among the learning data predicted to be a minority class. For example, if the learning data is sensor measurement data of a sensor and the class is whether the product is good or not, this conformance rate is calculated based on whether the product is actually defective or not. It indicates how much the product is included.
The recall is the ratio of learning data that is predicted to be a minority class among the learning data that actually belongs to the minority class. In the example described above, this recall rate indicates how much a product that is predicted to be defective is included in a really defective product.
Both the recall ratio and the matching ratio indicate that the larger the value, the higher the evaluation.

次に、第１の評価値算出部１０６ａは、これらの適合率及び再現率と、予め値が設定されている調整係数β（通常は１）とを用いて、以下の（６）式に示すＦ値を計算する。 Next, the first evaluation value calculation unit 106a uses the relevance ratio and the recall ratio, and the adjustment coefficient β (usually 1), which is set in advance, as shown in the following formula (6). Calculate the F value.

（６）式に示すように、Ｆ値は、適合率と再現率との重み付き調和平均値である。
以上のようにして、第１、第２、第３の評価値算出部１０６ａ、１０６ｂ、１０６ｃは、第１、第２、第３の学習部１０５ａ、１０５ｂ、１０５ｃにより得られた分類器の評価値としてＦ値Ｆ₁、Ｆ₂、Ｆ₃を得る。
評価値算出部１０６は、例えば、ＣＰＵが、複数の仮の分類器のそれぞれに対する評価値（Ｆ値）を算出してＲＡＭ等に記憶することにより実現される。 As shown in the equation (6), the F value is a weighted harmonic average value of the precision and the recall.
As described above, the first, second, and third evaluation value calculation units 106a, 106b, and 106c evaluate the classifiers obtained by the first, second, and third learning units 105a, 105b, and 105c. F values F ₁ , F ₂ and F ₃ are obtained as values.
The evaluation value calculation unit 106 is realized, for example, when the CPU calculates an evaluation value (F value) for each of a plurality of temporary classifiers and stores it in a RAM or the like.

＜最適重み導出部１０７＞
最適重み導出部１０７は、仮重み決定部１０３で得られた複数の仮の重みＮ´と、それらに対応して評価値算出部１０６で得られた複数の評価値Ｆとを用いて、重みの最適値Ｎ_optを算出する。本実施形態では、３つの仮の重みＮ´₁、Ｎ´₂、Ｎ´₃と、それらに対応する３つのＦ値Ｆ₁、Ｆ₂、Ｆ₃とを用いて、重みの最適値Ｎ_optが得られる。
本実施形態では、重みＮとＦ値との関係Ｆ（Ｎ）を、以下の（７）式に示す２次関数で近似する。最適重み導出部１０７は、３つの仮の重みとＦ値との組（Ｎ´₁，Ｆ₁）、（Ｎ´₂，Ｆ₂）、（Ｎ´₃，Ｆ₃）から、以下の（７）式に示す２次関数の係数ａ、ｂ、ｃを、以下の（８）式〜（１０）式により算出する。 <Optimum Weight Deriving Unit 107>
The optimum weight deriving unit 107 uses the plurality of temporary weights N ′ obtained by the temporary weight determining unit 103 and the plurality of evaluation values F obtained by the evaluation value calculating unit 106 corresponding to the weights. The optimum value N _opt is calculated. In this embodiment, three weight provisional _N'1, _N'2, and _N'3, by using the three F values F _1, F _2, F ₃ corresponding to those of the weighting optimal value N _opt Is obtained.
In the present embodiment, the relationship F (N) between the weight N and the F value is approximated by a quadratic function expressed by the following equation (7). The optimum weight deriving unit 107 obtains the following (7) from the combinations (N ′ ₁ , F ₁ ), (N ′ ₂ , F ₂ ), (N ′ ₃ , F ₃ ) of the _three temporary weights and the F value. ) The coefficients a, b, and c of the quadratic function shown in the equation are calculated by the following equations (8) to (10).

（７）式において、Ｆ（Ｎ）は、Ｆ値がＮの関数で表されることを示し、Ｎは、重み変数（重み）を示す。
最適重み導出部１０７は、（８）式〜（１０）式により算出した係数ａ、ｂ、ｃと、重みＮの範囲の下限値である「Ｎ_min」とを（７）式に代入して、重みＮの値がＮ_minであるときのＦ値Ｆ（Ｎ_min）を算出する。また、最適重み導出部１０７は、（８）式〜（１０）式により算出した係数ａ、ｂ、ｃと、重みＮの範囲の上限値である「Ｎ_max」とを（７）式に代入して、重みＮの値がＮ_maxであるときのＦ値Ｆ（Ｎ_max）を算出する。 In the equation (7), F (N) indicates that the F value is represented by a function of N, and N indicates a weight variable (weight).
The optimum weight deriving unit 107 substitutes the coefficients a, b, c calculated by the equations (8) to (10) and the lower limit value “N _min ” of the weight N range into the equation (7). F value F (N _min ) when the value of weight N is N _min is calculated. Also, the optimum weight deriving unit 107 substitutes the coefficients a, b, and c calculated by the equations (8) to (10) and “N _max ” that is the upper limit value of the weight N range into the equation (7). Then, the F value F (N _max ) when the value of the weight N is N _max is calculated.

最適重み導出部１０７は、前述したようにして算出した係数ａ、ｂ、ｃに基づいて、（７）式に示す２次関数が上に凸の関数であるか否かを判定する。具体的に、以下の（１１）式の関係を満たす場合に、（７）式に示す２次関数が上に凸の関数となる。
ａ＋ｂ＋ｃ＜０・・・（１１） Based on the coefficients a, b, and c calculated as described above, the optimum weight deriving unit 107 determines whether or not the quadratic function shown in the equation (7) is an upward convex function. Specifically, when the relationship of the following expression (11) is satisfied, the quadratic function shown in the expression (7) becomes a convex function.
a + b + c <0 (11)

ここで、３つの異なる点を通る２次関数が（７）式のように表せることと、以下の（１１）式の関係を満たす場合に（７）式に示す２次関数が上に凸の関数となることと、を説明する。
まず、任意の３つの異なる点（ｘ₁，ｙ₁）、（ｘ₂，ｙ₂）、（ｘ₃，ｙ₃）を通る２次関数は、係数ａ、ｂ、ｃを用いると以下の（１２）式で表される。 Here, when the quadratic function passing through three different points can be expressed as shown in formula (7), and the following formula (11) is satisfied, the quadratic function shown in formula (7) is convex upward Explain that it becomes a function.
First, a quadratic function passing through any three different points (x ₁ , y ₁ ), (x ₂ , y ₂ ), (x ₃ , y ₃ ) can be expressed by the following ( 12)

点（ｘ₁，ｙ₁）を（１２）式に代入すると、以下の（１３）式が得られる。さらに、点（ｘ₂，ｙ₂）、点（ｘ₃，ｙ₃）をそれぞれ（１２）式に代入すると、以下の（１４）式、（１５）式が得られる。 Substituting the point (x ₁ , y ₁ ) into the equation (12) yields the following equation (13). Further, when the point (x ₂ , y ₂ ) and the point (x ₃ , y ₃ ) are respectively substituted into the equation (12), the following equations (14) and (15) are obtained.

以上より、３つの異なる点（ｘ₁，ｙ₁）、（ｘ₂，ｙ₂）、（ｘ₃，ｙ₃）を通る２次関数は、（１３）式、（１４）式、（１５）式で算出される係数ａ、ｂ、ｃを持つ（１２）式の関数となる。ここで、ｘ₁をＮ´₁、ｙ₁をＦ₁、ｘ₂をＮ´₂、ｙ₂をＦ₂、ｘ₃をＮ´₃、ｙ₃をＦ₃に置き換えれば、（１２）式が（７）式に、（１３）式が（８）式に、（１４）式が（９）式に、（１５）式が（１０）式に対応することになる。このように、３つの異なる点を通る２次関数は、（７）式のように表すことができる。 From the above, the quadratic functions passing through _three different points (x ₁ , y ₁ ), (x ₂ , y ₂ ), and (x ₃ , y ₃ ) are expressed by equations (13), (14), and (15). This is a function of equation (12) having coefficients a, b, and c calculated by the equation. Here, if x ₁ is replaced by N ′ ₁ , y ₁ is replaced by F ₁ , x ₂ is replaced by N ′ ₂ , y ₂ is replaced by F ₂ , x ₃ is replaced by N ′ ₃ , and y ₃ is replaced by F ₃ , equation (12) is obtained. In Equation (7), Equation (13) corresponds to Equation (8), Equation (14) corresponds to Equation (9), and Equation (15) corresponds to Equation (10). Thus, a quadratic function passing through three different points can be expressed as in equation (7).

次に、（１２）式をｘで微分すると、以下の（１６）式が得られる。したがって、２次関数ｙの極値ｘ_optは、（１６）式の右辺の値が０（ゼロ）であるときのｘの値であるため、以下の（１７）式のようにして求められる。 Next, when the equation (12) is differentiated by x, the following equation (16) is obtained. Accordingly, the extreme value x _opt of the quadratic function y is the value of x when the value on the right side of the equation (16) is 0 (zero), and is obtained as in the following equation (17).

極値ｘ_optにおいて２次関数ｙが上に凸である条件は、（１２）式の２階微分の値が極値ｘ_optにおいて負ということであるから、以下の（１８）式、（１９）式のようになる。 Conditions in extreme x _opt 2 quadratic function y is convex upward, the (12) second floor the value of the differentiation is that negative at extreme x _opt of formula below (18), (19 )

よって、前述したように、（１１）式の関係を満たす場合に（７）式に示す２次関数が上に凸の関数となる。
以上のように（１１）式の関係を満たす場合（（７）式に示す２次関数が上に凸の関数である場合）、最適重み導出部１０７は、以下の（２０）式により、（７）式に示す２次関数の極大値に対応する重みＮ_optを算出する。 Therefore, as described above, when the relationship of the expression (11) is satisfied, the quadratic function shown in the expression (7) becomes a convex function.
As described above, when the relationship of the expression (11) is satisfied (when the quadratic function shown in the expression (7) is an upward convex function), the optimum weight deriving unit 107 is expressed by the following expression (20) ( 7) The weight N _opt corresponding to the maximum value of the quadratic function shown in the equation is calculated.

最適重み導出部１０７は、（２０）式により算出した重みＮの値が、重みＮの範囲の下限値である「Ｎ_min」以上、重みＮの範囲の上限値である「Ｎ_max」以下の範囲にあるか否かを判定する。
この判定の結果、（２０）式により算出した重みＮの値が、「Ｎ_min」以上、「Ｎ_max」以下の範囲内である場合、最適重み導出部１０７は、（２０）式により算出した重みＮを重みの最適値Ｎ_optとする。
一方、（２０）式により算出した重みＮの値が、「Ｎ_min」以上、「Ｎ_max」以下の範囲内でない場合、最適重み導出部１０７は、重みＮの範囲の下限値である「Ｎ_min」に対応するＦ値Ｆ（Ｎ_min）と、重みＮの範囲の上限値である「Ｎ_max」に対応するＦ値Ｆ（Ｎ_max）とのうち、値が大きい方のＦ値に対応する重みＮ（「Ｎ_min」又は「Ｎ_max」）を重みの最適値Ｎ_optとして選択する。 The optimum weight deriving unit 107 calculates the weight N value calculated by the equation (20) to be not less than “N _min ” which is the lower limit value of the weight N range and not more than “N _max ” which is the upper limit value of the weight N range. It is determined whether it is within the range.
As a result of this determination, when the value of the weight N calculated by the equation (20) is within the range of “N _min ” or more and “N _max ” or less, the optimum weight derivation unit 107 calculates the equation by the equation (20). The weight N is set to the optimum weight value _Nopt .
On the other hand, when the value of the weight N calculated by the equation (20) is not within the range of “N _min ” or more and “N _max ” or less, the optimum weight derivation unit 107 sets “N” which is the lower limit value of the weight N range. _Among F values F (N _min ) corresponding to “ _min ” and F values F (N _max ) corresponding to “N _max ” which is the upper limit value of the weight N range, corresponding to the F value having the larger value The weight N to be used (“N _min ” or “N _max ”) is selected as the optimum weight value N _opt .

また、（１１）式の関係を満たさない場合（（７）式に示す２次関数が上に凸の関数でない場合）も同様に、最適重み導出部１０７は、Ｆ（Ｎ_min）とＦ（Ｎ_max）とのうち、値が大きい方のＦ値に対応する重みＮ（「Ｎ_min」又は「Ｎ_max」）を重みの最適値Ｎ_optとして選択する。 Similarly, when the relationship of the expression (11) is not satisfied (when the quadratic function shown in the expression (7) is not a convex function), the optimum weight deriving unit 107 similarly performs F (N _min ) and F ( N _max ), the weight N (“N _min ” or “N _max ”) corresponding to the larger F value is selected as the optimum weight value N _opt .

図４は、Ｆ値と重みＮとの関係の一例を示す図である。図４では、（７）式に示す２次関数が上に凸の関数である場合を例に挙げて示している。Ｆ値の値が大きいほど、分類器の評価は高くなるので、図４に示す例では、Ｆ値が極大値となるときの重みＮが重みの最適値Ｎ_optとなる。
最適重み導出部１０７は、例えば、ＣＰＵが、重みの最適値Ｎ_optを算出してＲＡＭ等に記憶することにより実現される。 FIG. 4 is a diagram illustrating an example of the relationship between the F value and the weight N. FIG. 4 shows an example in which the quadratic function shown in the equation (7) is an upward convex function. Since the evaluation of the classifier is higher as the F value is larger, in the example shown in FIG. 4, the weight N when the F value becomes the maximum value is the optimum weight value N _opt .
The optimum weight deriving unit 107 is realized, for example, when the CPU calculates the optimum weight value _Nopt and stores it in the RAM or the like.

＜学習データサンプリング部１０８＞
学習データサンプリング部１０８は、最適重み導出部１０７で得られた重みの最適値Ｎ_optを用いて、学習データ入力部１０１により入力された学習データのうち、少数クラスに属する学習データをオーバーサンプリングして、少数クラスに属する学習データを増加させた新学習データを作成する。少数クラスに属する学習データをオーバーサンプリングする方法は、学習データサンプリング部１０４の説明で示した方法と同じ方法で実現されるので、ここでは、その詳細な説明を省略する。
学習データサンプリング部１０８は、例えば、ＣＰＵが、重みの最適値Ｎ_optに基づいて、少数クラスに属する学習データをオーバーサンプリングして新学習データを作成し、新学習データをＲＡＭ等に記憶することにより実現される。 <Learning data sampling unit 108>
The learning data sampling unit 108 oversamples the learning data belonging to the minority class among the learning data input by the learning data input unit 101 using the optimum weight value N _opt obtained by the optimum weight deriving unit 107. Thus, new learning data in which the learning data belonging to the minority class is increased is created. The method for oversampling the learning data belonging to the minority class is realized by the same method as the method shown in the description of the learning data sampling unit 104, and thus detailed description thereof is omitted here.
In the learning data sampling unit 108, for example, the CPU oversamples the learning data belonging to the minority class based on the optimum weight value _Nopt to create new learning data, and stores the new learning data in the RAM or the like. It is realized by.

＜学習部１０９、分類器格納部１１０＞
学習部１０９は、学習データサンプリング部１０８で得られた新学習データを学習して、正解率が最も高くなるような分類器を（１つ）作成し、分類器格納部１１０に格納する。前述したように、分類器である決定木を作成するための学習アルゴリズムは、公知の技術を用いて実現することができるので、ここでは、分類器を作成するための学習アルゴリズムの詳細な説明を省略する。
学習部１０９は、例えば、ＣＰＵが、新学習データを学習して分類器を作成し、作成した分類器の情報をＨＤＤ等に記憶することにより実現される。分類器格納部１１０は、例えば、ＨＤＤ等により実現される。 <Learning unit 109, classifier storage unit 110>
The learning unit 109 learns the new learning data obtained by the learning data sampling unit 108, creates (one) a classifier having the highest accuracy rate, and stores it in the classifier storage unit 110. As described above, a learning algorithm for creating a decision tree that is a classifier can be realized using a known technique. Therefore, here, a detailed description of the learning algorithm for creating a classifier is given. Omitted.
The learning unit 109 is realized, for example, when the CPU learns new learning data to create a classifier and stores information on the created classifier in an HDD or the like. The classifier storage unit 110 is realized by an HDD or the like, for example.

分類器作成装置１００は、以上のようにして得られた分類器を用いて、評価の対象となるデータが２つのクラスの何れに属するのかを判断し、判断した結果を出力する。出力の形態としては、例えば、表示装置への表示、記憶媒体への記憶、外部装置への送信等がある。尚、このように、分類器作成装置１００が、分類器を作成することと、分類器を使用したデータの分類との双方を行うようにしても、分類器作成装置１００が作成した分類器を、分類器作成装置１００とは異なる別の情報処理装置に移管し、当該情報処理装置が、当該分類器を用いてデータの分類を行うようにしてもよい。 The classifier creating apparatus 100 uses the classifier obtained as described above to determine which of the two classes data to be evaluated belongs, and outputs the determined result. Examples of output forms include display on a display device, storage on a storage medium, transmission to an external device, and the like. In this way, even if the classifier creating apparatus 100 creates both a classifier and classifies data using the classifier, the classifier created by the classifier creating apparatus 100 is used. Alternatively, the information may be transferred to another information processing apparatus different from the classifier creating apparatus 100, and the information processing apparatus may perform data classification using the classifier.

（動作フローチャート）
次に、図５のフローチャートを参照しながら、分類器作成装置１００の処理の一例を説明する。
まず、ステップＳ５０１において、学習データ入力部１０１は、オペレータによる操作等に基づいて、複数の学習データを入力する。
次に、ステップＳ５０２において、重み範囲入力部１０２は、重みＮの範囲を入力する。重み範囲入力部１０２は、入力した重みＮの範囲の下限値Ｎ_minが、「１」以上であり、且つ、重みＮの範囲の上限値Ｎ_maxが、「Ｎ_r」以下である場合に限り、入力した重みＮの範囲を受け付け、それ以外の場合には、表示画面等を使用して、重みＮの範囲の再度の入力をオペレータに促すようにする。 (Operation flowchart)
Next, an example of processing of the classifier creating apparatus 100 will be described with reference to the flowchart of FIG.
First, in step S501, the learning data input unit 101 inputs a plurality of learning data based on an operation by an operator or the like.
Next, in step S502, the weight range input unit 102 inputs a range of weight N. The weight range input unit 102 only when the lower limit value N _min of the input weight N range is “1” or more and the upper limit value N _max of the weight N range is “N _r ” or less. The input weight N range is received. In other cases, the operator is prompted to input the weight N range again using a display screen or the like.

次に、ステップＳ５０３において、仮重み決定部１０３は、ステップＳ５０２で入力された重みＮの範囲の中から、３つの仮の重みＮ´₁、Ｎ´₂、Ｎ´₃を決定する。
次に、ステップＳ５０４において、学習データサンプリング部１０４は、ステップＳ５０３で決定された仮の重みＮ´₁、Ｎ´₂、Ｎ´₃をこの順番で１つずつ選択する。
次に、ステップＳ５０５において、学習データサンプリング部１０４は、ステップＳ５０４で選択された仮の重みＮ´を用いて少数クラスに属する学習データをオーバーサンプリングし、新学習データを作成する学習データサンプリング処理を行う。尚、学習データサンプリング処理の詳細については、図６を参照しながら後述する。 Next, in step S503, the temporary weight determining unit 103, from the range of the input weighting N in step S502, the weight _N'1 three temporary, _N'2, determines the _N'3.
Next, in step S504, the learning data sampling unit 104 selects the temporary weights N ′ ₁ , N ′ ₂ and N ′ ₃ determined in step S503 one by one in this order.
Next, in step S505, the learning data sampling unit 104 performs learning data sampling processing for oversampling the learning data belonging to the minority class using the temporary weight N ′ selected in step S504, and creating new learning data. Do. The details of the learning data sampling process will be described later with reference to FIG.

次に、ステップＳ５０６において、学習部１０５は、ステップＳ５０５で得られた新学習データを学習して、正解率が最も高くなるような仮の分類器を作成する。
次に、ステップＳ５０７において、評価値算出部１０６は、ステップＳ５０６で得られた仮の分類器に対する評価値であるＦ値を、ステップＳ５０１で入力された学習データを用いて算出する。 Next, in step S506, the learning unit 105 learns the new learning data obtained in step S505, and creates a temporary classifier that maximizes the correct answer rate.
Next, in step S507, the evaluation value calculation unit 106 calculates an F value, which is an evaluation value for the temporary classifier obtained in step S506, using the learning data input in step S501.

次に、ステップＳ５０８において、学習データサンプリング部１０４は、ステップＳ５０３で決定された仮の重みＮ´₁、Ｎ´₂、Ｎ´₃の全て（すなわちＮ´₃）を選択したか否かを判定する。この判定の結果、ステップＳ５０３で決定された仮の重みＮ´₁、Ｎ´₂、Ｎ´₃の全てを選択していない場合には、ステップＳ５０４に戻る。そして、ステップＳ５０３で決定された仮の重みＮ´₁、Ｎ´₂、Ｎ´₃の全てについて、仮の分類器の作成と、その評価値（Ｆ値）の導出とが終了するまで、ステップＳ５０４〜Ｓ５０８の処理を繰り返し行う。 Next, in step S508, the learning data sampling unit 104 determines whether or not all of the temporary weights N ′ ₁ , N ′ ₂ , and N ′ ₃ determined in step S503 (ie, N ′ ₃ ) have been selected. To do. As a result of this determination, if all of the temporary weights N ′ ₁ , N ′ ₂ , N ′ ₃ determined in step S503 have not been selected, the process returns to step S504. Then, for all of the temporary weights N ′ ₁ , N ′ ₂ , N ′ ₃ determined in step S503, the steps until the generation of the temporary classifier and the derivation of the evaluation value (F value) are completed. The processes of S504 to S508 are repeated.

ステップＳ５０８において、ステップＳ５０３で決定された仮の重みＮ´₁、Ｎ´₂、Ｎ´₃の全てを選択したと判定されると、ステップＳ５０９に進む。
ステップＳ５０９に進むと、最適重み導出部１０７は、ステップＳ５０３で得られた仮の重みＮ´₁、Ｎ´₂、Ｎ´₃と、それらに対応して評価値算出部１０６で得られた複数のＦ値Ｆ₁、Ｆ₂、Ｆ₃とを用いて、重みの最適値Ｎ_optを導出する最適重み導出処理を行う。尚、最適重み導出処理の詳細については、図７を参照しながら後述する。 If it is determined in step S508 that all the temporary weights N ′ ₁ , N ′ ₂ , and N ′ ₃ determined in step S503 have been selected, the process proceeds to step S509.
In step S509, the optimum weight deriving unit 107 determines the temporary weights N ′ ₁ , N ′ ₂ , and N ′ ₃ obtained in step S503, and the plurality of values obtained by the evaluation value calculation unit 106 corresponding to them. The optimum weight derivation process for deriving the optimum weight value N _opt is performed using the F values F ₁ , F ₂ , and F ₃ . The details of the optimum weight derivation process will be described later with reference to FIG.

次に、ステップＳ５１０において、学習データサンプリング部１０８は、ステップＳ５０５と同様の学習データサンプリング処理を行って、ステップＳ５０９で得られた重みの最適値Ｎ_optを用いて少数クラスに属する学習データをオーバーサンプリングし、新学習データを作成する。
次に、ステップＳ５１１において、学習部１０９は、ステップＳ５１０で得られた新学習データを学習して、正解率が最も高くなるような分類器を（１つ）作成し、分類器格納部１１０に格納する。
そして、図５のフローチャートによる処理を終了する。 Then, over at step S510, the learning data sampling unit 108 performs the same training data sampling processing in step S505, the learning data which belongs to a few classes using the optimum value N _opt of the weights obtained in step S509 Sampling and creating new learning data.
Next, in step S511, the learning unit 109 learns the new learning data obtained in step S510, creates (one) classifier with the highest accuracy rate, and stores it in the classifier storage unit 110. Store.
And the process by the flowchart of FIG. 5 is complete | finished.

次に、図６のフローチャートを参照しながら、図５のステップＳ５０５、Ｓ５１０の学習データサンプリング処理の詳細を説明する。尚、図５のステップＳ５０５の学習データサンプリング処理においては、第１、第２、第３の学習データサンプリング部１０４ａ、１０４ｂ、１０４ｃの順に、以下の処理を行うものとする。また、図５のステップＳ５１０の学習データサンプリング処理は、仮の重みＮ´を重みの最適値Ｎ_optにすることが、ステップＳ５０５の学習データサンプリング処理と異なるだけである。よって、ここでは、図５のステップＳ５１０の学習データサンプリング処理の詳細についてのみ説明する。 Next, details of the learning data sampling process in steps S505 and S510 in FIG. 5 will be described with reference to the flowchart in FIG. In the learning data sampling process of step S505 in FIG. 5, the following processes are performed in the order of the first, second, and third learning data sampling units 104a, 104b, and 104c. Also, the learning data sampling processing in step S510 in FIG. 5, that the temporary weight N'the optimum value N _opt weight is only different from the learning data sampling processing in step S505. Therefore, only the details of the learning data sampling process in step S510 of FIG. 5 will be described here.

まず、ステップＳ６０１において、学習データサンプリング部１０４は、データ番号ｉの値を初期値（＝０（ゼロ））、学習データの累積追加個数ｐの値を初期値（＝０（ゼロ））、学習データの累積追加個数ｐの上限値ｑを初期値（＝図５のステップＳ５０４で選択された仮の重みＮ´）とする。
次に、ステップＳ６０２において、学習データサンプリング部１０４は、図５のステップＳ５０１で入力された学習データの中から、データ番号ｉの学習データを選択する。 First, in step S601, the learning data sampling unit 104 sets the value of the data number i to an initial value (= 0 (zero)), the value of the accumulated additional number p of learning data to an initial value (= 0 (zero)), and learning. The upper limit value q of the cumulative additional number p of data is set as an initial value (= temporary weight N ′ selected in step S504 in FIG. 5).
Next, in step S602, the learning data sampling unit 104 selects learning data with the data number i from the learning data input in step S501 of FIG.

次に、ステップＳ６０３において、学習データサンプリング部１０４は、データ番号ｉの学習データが少数クラスに属する学習データであるか否かを判定する。
この判定の結果、データ番号ｉの学習データが少数クラスに属する学習データである場合には、後述するステップＳ６０７に進む。一方、データ番号ｉの学習データが少数クラスに属する学習データでない場合（多数クラスである場合）には、ステップＳ６０４に進む。 Next, in step S603, the learning data sampling unit 104 determines whether the learning data with the data number i is learning data belonging to the minority class.
As a result of the determination, if the learning data with the data number i is learning data belonging to the minority class, the process proceeds to step S607 described later. On the other hand, when the learning data with the data number i is not learning data belonging to the minority class (when it is the majority class), the process proceeds to step S604.

ステップＳ６０４に進むと、学習データサンプリング部１０４は、データ番号ｉの学習データを新学習データとして採用する。
次に、ステップＳ６０５において、学習データサンプリング部１０４は、図５のステップＳ５０１で入力された学習データの全てを選択したか否かを判定する。この判定の結果、学習データの全てを選択した場合には、図６のフローチャートによる処理を終了する。
一方、学習データの全てを選択していない場合には、ステップＳ６０６に進む。 In step S604, the learning data sampling unit 104 adopts the learning data with the data number i as new learning data.
Next, in step S605, the learning data sampling unit 104 determines whether all of the learning data input in step S501 of FIG. 5 has been selected. If all of the learning data is selected as a result of this determination, the processing according to the flowchart of FIG. 6 ends.
On the other hand, if not all of the learning data has been selected, the process proceeds to step S606.

ステップＳ６０６に進むと、学習データサンプリング部１０４は、データ番号ｉに「１」を加算し、データ番号ｉを更新する。
そして、次のデータ番号ｉの学習データに対して、ステップＳ６０２以降の処理を行う。
前述したように、ステップＳ６０３の判定の結果、データ番号ｉの学習データが少数クラスに属する学習データである場合には、ステップＳ６０７に進む。 In step S606, the learning data sampling unit 104 adds “1” to the data number i and updates the data number i.
And the process after step S602 is performed with respect to the learning data of the following data number i.
As described above, if the result of determination in step S603 is that the learning data with data number i is learning data belonging to the minority class, the process proceeds to step S607.

ステップＳ６０７に進むと、学習データサンプリング部１０４は、学習データの累積追加個数ｐの値が、その上限値ｑ未満（ｐ＜ｑ）であるか否かを判定する。この判定の結果、学習データの累積追加個数ｐの値が、その上限値ｑ未満である場合には、ステップＳ６０８に進む。
ステップＳ６０８に進むと、学習データサンプリング部１０４は、データ番号ｉの学習データを新学習データとして採用する。
次に、ステップＳ６０９において、学習データサンプリング部１０４は、学習データの累積追加個数ｐに「１」を加算し、学習データの累積追加個数ｐの値を更新する。 In step S607, the learning data sampling unit 104 determines whether or not the accumulated additional number p of learning data is less than the upper limit q (p <q). As a result of the determination, if the value of the cumulative additional number p of learning data is less than the upper limit q, the process proceeds to step S608.
In step S608, the learning data sampling unit 104 employs the learning data with the data number i as new learning data.
Next, in step S609, the learning data sampling unit 104 adds “1” to the accumulated additional number p of learning data, and updates the value of the accumulated additional number p of learning data.

そして、ステップＳ６０７に進み、学習データの累積追加個数ｐの値が、その上限値ｑ以上となるまで、ステップＳ６０７〜Ｓ６０９の処理を繰り返し行う。
そして、学習データの累積追加個数ｐの値が、その上限値ｑ以上になると（ｐ≧ｑ）、ステップＳ６１０に進む。ステップＳ６１０に進むと、学習データサンプリング部１０４は、学習データの累積追加個数ｐの上限値ｑに、図５のステップＳ５０４で選択された仮の重みＮ´を加算して、学習データの累積追加個数ｐの上限値ｑを更新する。そして、前述したステップＳ６０５に進む。
以上のステップＳ６０７〜Ｓ６１０の処理により、少数クラスに属する学習データが、平均的に仮の重みＮ´の回数だけ繰り返し新学習データとして採用される。 Then, the process proceeds to step S607, and the processes of steps S607 to S609 are repeated until the value of the cumulative additional number p of learning data becomes equal to or greater than the upper limit q.
When the value of the cumulative additional number p of learning data becomes equal to or greater than the upper limit q (p ≧ q), the process proceeds to step S610. In step S610, the learning data sampling unit 104 adds the provisional weight N ′ selected in step S504 in FIG. 5 to the upper limit q of the accumulated addition number p of learning data, thereby accumulating the addition of learning data. The upper limit value q of the number p is updated. Then, the process proceeds to step S605 described above.
Through the processes in steps S607 to S610 described above, learning data belonging to the minority class is repeatedly adopted as new learning data on the average by the number of temporary weights N ′.

次に、図７のフローチャートを参照しながら、図５のステップＳ５０９の最適重み導出処理を説明する。
まず、ステップＳ７０１において、最適重み導出部１０７は、３つの仮の重みとＦ値との組（Ｎ´₁，Ｆ₁）、（Ｎ´₂，Ｆ₂）、（Ｎ´₃，Ｆ₃）から、（７）式に示す２次関数の係数ａ、ｂ、ｃを算出する（算出式は（８）式〜（１０）式を参照）。
次に、ステップＳ７０２において、最適重み導出部１０７は、ステップＳ７０１で得られた係数ａ、ｂ、ｃと、重みＮの範囲の下限値である「Ｎ_min」とを（７）式に代入して、重みＮの値がＮ_minであるときのＦ値Ｆ（Ｎ_min）を算出する。また、最適重み導出部１０７は、ステップＳ７０１で得られた係数ａ、ｂ、ｃと、重みＮの範囲の上限値である「Ｎ_max」とを（７）式に代入して、重みＮの値がＮ_maxであるときのＦ値Ｆ（Ｎ_max）を算出する。 Next, the optimum weight derivation process in step S509 in FIG. 5 will be described with reference to the flowchart in FIG.
First, in step S701, the optimal weight derivation unit 107, three sets of weights and F values of the provisional _{_{(N'1, F 1),}} (N'2, F 2), (N'3, F 3) From the above, the coefficients a, b, and c of the quadratic function shown in Equation (7) are calculated (see Equations (8) to (10) for the calculation equations).
Next, in step S702, the optimum weight deriving unit 107 substitutes the coefficients a, b, and c obtained in step S701 and “N _min ” that is the lower limit value of the weight N range into the equation (7). Thus, the F value F (N _min ) when the value of the weight N is N _min is calculated. Further, the optimum weight deriving unit 107 substitutes the coefficients a, b, and c obtained in step S701 and “N _max ”, which is the upper limit value of the weight N range, in the equation (7), and The F value F (N _max ) when the value is N _max is calculated.

次に、ステップＳ７０３において、最適重み導出部１０７は、ステップＳ７０１で得られた係数ａ、ｂ、ｃに基づいて、（７）式に示す２次関数が上に凸の関数であるか否かを判定する。ステップＳ７０１で得られた係数ａ、ｂ、ｃが（１１）式の関係を満たす場合に、（７）式に示す２次関数が上に凸の関数であると判定される。
この判定の結果、（７）式に示す２次関数が上に凸の関数である場合には、ステップＳ７０４に進む。
ステップＳ７０４に進むと、最適重み導出部１０７は、（７）式に示す２次関数の極大値に対応する重みＮを算出する（（２０）式を参照）。 Next, in step S703, the optimum weight deriving unit 107 determines whether or not the quadratic function shown in the equation (7) is an upward convex function based on the coefficients a, b, and c obtained in step S701. Determine. When the coefficients a, b, and c obtained in step S701 satisfy the relationship of equation (11), it is determined that the quadratic function shown in equation (7) is an upward convex function.
If the result of this determination is that the quadratic function shown in equation (7) is a convex function, the process proceeds to step S704.
In step S704, the optimum weight deriving unit 107 calculates a weight N corresponding to the maximum value of the quadratic function shown in equation (7) (see equation (20)).

次に、ステップＳ７０５において、最適重み導出部１０７は、ステップＳ７０４で得られた重みＮの値が、重みＮの範囲の下限値である「Ｎ_min」以上、重みＮの範囲の上限値である「Ｎ_max」以下の範囲にあるか否かを判定する。
この判定の結果、ステップＳ７０４で得られた重みＮの値が、「Ｎ_min」以上、「Ｎ_max」以下の範囲にある場合には、ステップＳ７０６に進む。
ステップＳ７０６に進むと、最適重み導出部１０７は、ステップＳ７０４で得られた重みＮを、重みの最適値Ｎ_optとして採用する。そして、図７のフローチャートによる処理を終了する。 Next, in step S705, the optimum weight deriving unit 107 has the weight N value obtained in step S704 equal to or higher than the lower limit value “N _min ” of the weight N range and the upper limit value of the weight N range. It is determined whether or not it is within the range of “N _max ”.
As a result of the determination, if the value of the weight N obtained in step S704 is in the range of “N _min ” or more and “N _max ” or less, the process proceeds to step S706.
In step S706, the optimum weight deriving unit 107 employs the weight N obtained in step S704 as the optimum weight value _Nopt . And the process by the flowchart of FIG. 7 is complete | finished.

ステップＳ７０３において、（７）式に示す２次関数が上に凸の関数でないと判定された場合と、ステップＳ７０５において、ステップＳ７０４で得られた重みＮの値が、「Ｎ_min」以上、「Ｎ_max」以下の範囲にないと判定された場合には、ステップＳ７０７に進む。
ステップＳ７０７に進むと、最適重み導出部１０７は、重みＮの範囲の下限値である「Ｎ_min」に対応するＦ値Ｆ（Ｎ_min）が、重みＮの範囲の上限値である「Ｎ_max」に対応するＦ値Ｆ（Ｎ_max）未満であるか否かを判定する。
この判定の結果、前記条件が成り立つ場合には、ステップＳ７０８に進む。 If it is determined in step S703 that the quadratic function shown in equation (7) is not an upward convex function, and in step S705, the value of the weight N obtained in step S704 is equal to or greater than “N _min ”. If it is determined that it is not within the range of “N _max ” or less, the process proceeds to step S707.
In step S707, the optimum weight deriving unit 107 determines that the F value F (N _min ) corresponding to the lower limit value “N _min ” of the weight N range is the upper limit value “N _max ” of the weight N range. It is determined whether or not it is less than the F value F (N _max ) corresponding to “”.
If the condition is satisfied as a result of the determination, the process proceeds to step S708.

ステップＳ７０８に進むと、最適重み導出部１０７は、重みＮの範囲の上限値である「Ｎ_max」を、重みの最適値Ｎ_optとして採用する。そして、図７のフローチャートによる処理を終了する。
一方、ステップ７０７の判定条件が成り立たない場合には、ステップＳ７０９に進む。ステップＳ７０９に進むと、最適重み導出部１０７は、重みＮの範囲の下限値である「Ｎ_min」を、重みの最適値Ｎ_optとして採用する。そして、図７のフローチャートによる処理を終了する。 In step S708, the optimum weight deriving unit 107 employs “N _max ” that is the upper limit value of the weight N range as the optimum weight value N _opt . And the process by the flowchart of FIG. 7 is complete | finished.
On the other hand, if the determination condition in step 707 is not satisfied, the process proceeds to step S709. In step S709, the optimum weight deriving unit 107 employs “N _min ”, which is the lower limit value of the weight N range, as the optimum weight value N _opt . And the process by the flowchart of FIG. 7 is complete | finished.

（実施例）
次に、本発明の実施例について説明する。本実施例では、鉄鋼製品を製造した後又は製造する過程で得られるデータから、当該鉄鋼製品が製造プロセスにおいて発生工程を通過するか否かを判断するための分類器を作成する場合について説明する。
鉄鋼製品を製造するプロセスでは、圧延・冷却した鉄鋼製品を作業者が検査する。そして、鉄鋼製品の曲がりが大きければ矯正工程にてその曲がりを矯正したり、鉄鋼製品の表面に疵があれば、手入工程にてグラインダーでその疵を研磨したりする等、鉄鋼製品を製造した後に通過の有無が判明する工程が存在する。このような工程は発生工程と呼ばれる。製品が発生工程を通過すると製造工期が延びる。このため、注文された製品を製造する際には、発生工程の通過の有無を製造前に予測して、客先の納期に間に合うように製造着手する必要がある。しかし、発生工程の通過の有無は製品のスペック（サイズや硬度等）から一意に決まるものではない。このため、過去の操業実績データから発生工程の通過の有無を予測して、その予測を元に製造着手日を決定することが行われる。 (Example)
Next, examples of the present invention will be described. In the present embodiment, a case will be described in which a classifier for determining whether or not the steel product passes the generation process in the manufacturing process from data obtained after the steel product is manufactured or in the manufacturing process is described. .
In the process of manufacturing steel products, workers inspect the rolled and cooled steel products. And if the bend of the steel product is large, the bend is corrected in the straightening process, or if the surface of the steel product is wrinkled, the wrinkle is polished with a grinder in the maintenance process, etc. After that, there is a process for determining the presence or absence of passage. Such a process is called a generation process. When the product passes the generation process, the manufacturing period is extended. For this reason, when the ordered product is manufactured, it is necessary to predict whether or not the generation process has passed before manufacturing and start manufacturing in time for the delivery date of the customer. However, whether or not the generation process passes is not uniquely determined from the product specifications (size, hardness, etc.). For this reason, the presence or absence of the passage of the generation process is predicted from past operation performance data, and the production start date is determined based on the prediction.

本実施例に対する比較例として、或る鉄鋼製造プロセスにおける発生工程の通過の有無を予測する決定木を、重みを付けずに（重み＝１に相当）作成した。決定木学習アルゴリズムとして、非特許文献２に記載のC5.0と呼ばれる情報エントロピーのゲイン比に基づく方法を利用した。図８（ａ）は、この決定木の性能を表す混同行列である。図８（ａ）に示すように、実績では51510枚の鉄鋼製品が発生工程を通過しているにも関わらず、この発生工程を通過すると予測した鉄鋼製品の枚数は1467枚となった。また、Ｆ値は２．４４［％］であった。 As a comparative example for this example, a decision tree that predicts whether or not a generation process has passed in a certain steel manufacturing process was created without weighting (corresponding to weight = 1). As a decision tree learning algorithm, a method based on an information entropy gain ratio called C5.0 described in Non-Patent Document 2 was used. FIG. 8A is a confusion matrix representing the performance of this decision tree. As shown in FIG. 8 (a), the actual number of steel products predicted to pass this generation process was 1467, although 51,510 steel products passed the generation process. The F value was 2.44 [%].

これと同じ決定木学習アルゴリズムを用いて、本実施形態で説明した手法で決定木を作成した。ここで、Ｆ値を算出する際の調整係数βの値を「１」とした（β＝１）。また、重みＮの範囲の下限値Ｎ_minを「１」とし、重みＮの範囲の上限値Ｎ_maxを「１０」とした。また、仮の重みＮ´の数を「３」とした。また、仮の重みＮ´₁、Ｎ´₂、Ｎ´₃を、（１）式、（２）式、（３）式により求めた。すなわち、仮の重みＮ´₁の値を「１」とし（Ｎ´₁＝１）、仮の重みＮ´₂の値を「５．５」とし（Ｎ´₂＝５．５）、仮の重みＮ´₃の値を「１０」とした（Ｎ´₃＝１０）。 Using the same decision tree learning algorithm, a decision tree was created by the method described in this embodiment. Here, the value of the adjustment coefficient β when calculating the F value is set to “1” (β = 1). Further, the lower limit value N _min of the weight N range is set to “1”, and the upper limit value N _max of the weight N range is set to “10”. The number of temporary weights N ′ is “3”. Further, provisional weights N ′ ₁ , N ′ ₂ , and N ′ ₃ were obtained by the equations (1), (2), and (3). That is, the value of the temporary weight N ′ ₁ is set to “1” (N ′ ₁ = 1), the value of the temporary weight N ′ ₂ is set to “5.5” (N ′ ₂ = 5.5), The value of the weight N ′ ₃ was set to “10” (N ′ ₃ = 10).

これらの仮の重みＮ´₁、Ｎ´₂、Ｎ´₃を用いて、本実施形態で説明した手法で少数クラスに属する学習データをオーバーサンプリングして新学習データのセットを３つ得た。そして、これらの３つの新学習データのセットを、前述した決定木学習アルゴリズムに適用して３つの決定木（仮の分類器）を作成した。仮の重みＮ´₁、Ｎ´₂、Ｎ´₃を用いて作成した３つの仮の分類器の評価値であるＦ値Ｆ₁、Ｆ₂、Ｆ₃は、２．４４［％］（＝０．０２４４）、２８．９［％］（＝０．２８９）、２５．４［％］（＝０．２５４）であった（Ｆ₁＝０．０２４４、Ｆ₂＝０．２８９、Ｆ₃＝０．２５４）。 Using these temporary weights N ′ ₁ , N ′ ₂ , and N ′ ₃ , the learning data belonging to the minority class was oversampled by the method described in the present embodiment to obtain three sets of new learning data. Then, these three new learning data sets were applied to the above-described decision tree learning algorithm to create three decision trees (provisional classifiers). F values F ₁ , F ₂ , and F ₃ , which are evaluation values of the three temporary classifiers created using the temporary weights N ′ ₁ , N ′ ₂ , and N ′ ₃ , are 2.44 [%] (= 0.0244), 28.9 [%] (= 0.289), 25.4 [%] (= 0.254) (F ₁ = 0.0244, F ₂ = 0.289, F ₃ = 0.254).

そして、重みＮとＦ値（Ｆ（Ｎ））との関係を、（７）式に示す２次関数で近似して重みの最適値Ｎ_optを求めた。その結果、重みの最適値Ｎ_optの値は、７．２２４３となった。この重みの最適値Ｎ_optを用いて、本実施形態で説明した手法で、少数クラスに属する学習データをオーバーサンプリングして新学習データを得た。これらの新学習データを、前述した決定木学習アルゴリズムに適用して決定木（分類器）を作成した。このようにして得られた決定木（分類器）の性能を表す混同行列を計算した結果、混同行列は、図８（ｂ）に示すようになった。本実施例では、99533枚の鉄鋼製品が発生工程を通過すると予測しており、そのうち、22561枚の鉄鋼製品が実際に発生工程を通過している。このため、図８（ａ）に示す重みを付けない方法（比較例）よりも明らかに本実施例の方が優れた結果が得られた。また、図８（ａ）に示す比較例では、Ｆ値は２．４４［％］であったのに対し、本実施例では、Ｆ値は２９．９［％］まで向上した。 Then, the relationship between the weight N and the F value (F (N)) was approximated by a quadratic function shown in the equation (7) to obtain the optimum weight value _Nopt . As a result, the value of the optimum value N _opt weight became 7.2243. Using this weight optimum value N _opt , new learning data was obtained by oversampling the learning data belonging to the minority class by the method described in the present embodiment. These new learning data were applied to the decision tree learning algorithm described above to create a decision tree (classifier). As a result of calculating a confusion matrix representing the performance of the decision tree (classifier) obtained in this manner, the confusion matrix is as shown in FIG. In this example, it is predicted that 99533 steel products pass the generation process, of which 22561 steel products actually pass the generation process. For this reason, the result of the present example was clearly superior to the method (comparative example) in which weighting is not performed as shown in FIG. In the comparative example shown in FIG. 8A, the F value was 2.44 [%], whereas in this example, the F value was improved to 29.9 [%].

また、重みＮの範囲の上限値である「Ｎ_r」は、「多数クラスに属する学習データの個数／少数クラスに属する学習データの個数」で表されるので、本実施例においては、約１２．７（＝５１５１０／６５２６２３）となる。すなわち、重みの最適値Ｎ_opt（＝７．２２４３）は、重みＮの範囲の上限値である「１」と、重みＮの範囲の上限値である「Ｎ_r（＝１２．７）」との間にある。よって、Ｆ値を最も高くするには、少数クラスに属する学習データが、多数クラスに属する学習データの個数よりも少ない個数になるようにサンプリングする必要があることが分かる。 Further, “N _r ” which is the upper limit value of the range of the weight N is represented by “the number of learning data belonging to the majority class / the number of learning data belonging to the minority class”. 7 (= 51510/656233). In other words, the optimum weight value N _opt (= 7.2243) is “1” that is the upper limit value of the range of weight N and “N _r (= 12.7)” that is the upper limit value of the range of weight N. Between. Therefore, it can be seen that in order to obtain the highest F value, it is necessary to sample so that the number of learning data belonging to the minority class is smaller than the number of learning data belonging to the majority class.

（まとめ）
以上のように本実施形態では、下限値を「Ｎ_min（≧１）」、上限値を「Ｎ_max（≦Ｎ_r）」として指定された重みＮの範囲の中から、３つの仮の重みＮ´₁、Ｎ´₂、Ｎ´₃を決定する。これらの仮の重みＮ´₁、Ｎ´₂、Ｎ´₃のそれぞれを用いて少数クラスに属する学習データをオーバーサンプリングして新学習データのセットを３つ作成する。作成した新学習データの３つのセットを用いて３つの仮の分類器を作成し、その評価値であるＦ値Ｆ₁、Ｆ₂、Ｆ₃を求める。３つの仮の重みとＦ値との組（Ｎ´₁，Ｆ₁）、（Ｎ´₂，Ｆ₂）、（Ｎ´₃，Ｆ₃）から、重みＮとＦ値との関係を表す２次関数の係数ａ、ｂ、ｃを算出し、当該２次関数の極大値に対応する重みＮを求め、当該重みＮが、「Ｎ_min」以上「Ｎ_max」以下である場合には、当該重みＮを重みの最適値Ｎ_optとし、この重みの最適値Ｎ_optに基づいて作成した新学習データを用いて分類器を構築する。一方、当該重みＮが、「Ｎ_min」以上「Ｎ_max」以下でない場合には、「Ｎ_min」と「Ｎ_max」のうち、対応するＦ値の値が大きい方を重みの最適値Ｎ_optとし、重みの最適値Ｎ_optに基づいて作成した新学習データを用いて分類器を構築する。
すなわち、可及的に高精度の予測が可能な分類器を作成するために、Ｆ値が最も高くなる重みの最適値Ｎ_optを用いて少数クラスに属する学習データをサンプリングする。よって、一方のクラスに属する学習データの個数が他方のクラスに属する学習データの個数よりも極端に少ない場合でも、過不足なく学習データをサンプリングし、高精度の分類器を構築することができる。
また、本実施形態では、重みＮとＦ値との関係が２次関数であると近似して、複数の仮の重みＮ´と、それらに対応するＦ値とから、重みの最適値Ｎ_optを算出するようにした。このようにすれば収束計算を行う必要がなくなるので、重みの最適値Ｎ_optを算出する際の計算負荷を可及的に少なくすることができる。 (Summary)
As described above, in the present embodiment, three temporary weights are selected from the range of the weight N designated with the lower limit value being “N _min (≧ 1)” and the upper limit value being “N _max (≦ N _r )”. N ′ ₁ , N ′ ₂ and N ′ ₃ are determined. Three sets of new learning data are created by oversampling the learning data belonging to the minority class using each of these temporary weights N ′ ₁ , N ′ ₂ , and N ′ ₃ . Three temporary classifiers are created using the three sets of new learning data that have been created, and F values F ₁ , F ₂ , and F ₃ that are evaluation values thereof are obtained. 2 representing the relationship between the weight N and the F value from the pair (N ′ ₁ , F ₁ ), (N ′ ₂ , F ₂ ), and (N ′ ₃ , F ₃ ) of the _three temporary weights and the F value. The coefficients a, b, and c of the quadratic function are calculated, the weight N corresponding to the maximum value of the quadratic function is obtained, and when the weight N is “N _min ” or more and “N _max ” or less, The weight N is set to the optimum value N _opt for the weight, and a classifier is constructed using new learning data created based on the optimum value N _opt for the weight. On the other hand, when the weight N is not “N _min ” or more and “N _max ” or less, the optimal value N _{opt of the} weight corresponding to the larger one of the corresponding F values among “N _min ” and “N _max ”. And a classifier is constructed using the new learning data created based on the optimum weight value _Nopt .
That is, in order to create a classifier capable of predicting as accurately as possible, the learning data belonging to the minority class is sampled using the optimum value N _{opt of the} weight with the highest F value. Therefore, even when the number of learning data belonging to one class is extremely smaller than the number of learning data belonging to the other class, it is possible to construct a highly accurate classifier by sampling the learning data without excess or deficiency.
In this embodiment, the relationship between the weight N and the F value is approximated to be a quadratic function, and the optimum weight value N _opt is calculated from a plurality of temporary weights N ′ and the corresponding F values. Was calculated. By doing so, it is not necessary to perform the convergence calculation, so that the calculation load when calculating the optimum weight value _Nopt can be reduced as much as possible.

（変形例）
本実施形態では、重みＮの範囲を入力し、その範囲内で複数の仮の重みＮ´の値が等間隔になるように決定した。しかしながら、必ずしもこのようにする必要はない。例えば、複数の仮の重みＮ´の値を、オペレータの操作等に基づいて直接的に入力するようにしてもよい。ただし、このようにする場合には、複数の仮の重みＮ´の値として、「Ｎ_min」以上、「Ｎ_max」以下の値のみを受け付けるようにする必要がある。 (Modification)
In the present embodiment, a range of weight N is input, and a plurality of provisional weights N ′ are determined to be equally spaced within the range. However, this is not always necessary. For example, the values of a plurality of temporary weights N ′ may be directly input based on an operator's operation or the like. However, in this case, it is necessary to accept only values of “N _min ” or more and “N _max ” or less as the values of the plurality of temporary weights N ′.

また、重みＮの範囲の下限値Ｎ_minは「１」に限定されない。例えば、最適な重みＮ_optの大凡の値の存在範囲が、過去の経験から分かっている場合には、重みＮの範囲の下限値Ｎ_minとして「１」を上回る値を採用してもよい。また、重みＮの範囲の上限値Ｎ_maxも「（多数クラスに属する学習データの個数／少数クラスに属する学習データの個数）Ｎ_r」に限定されない。例えば、適正な重みＮの大凡の値の存在範囲が、過去の経験から分かっている場合には、重みＮの範囲の上限値Ｎ_maxとして「Ｎ_r」を下回る値を採用してもよい。また、「Ｎ_r」の値が所定値（例えば１０）より大きい場合には、重みＮの範囲の上限値Ｎ_maxとして所定値（例えば１０）を採用してもよい。
このように、重みＮの範囲（Ｎ_min〜Ｎ_max）を１〜Ｎ_r以外にする必要性に関して説明する。本発明は連続した関数である重みＮと評価値との関係を、複数の仮の重みと評価値で近似することで最適な重みＮ_optを予測する手法であるため、最適な重みＮ_optの存在範囲が過去の経験から分かっている場合には、重みＮの範囲を前記最適な重みＮ_optの存在範囲に設定した方が、最適な重みＮ_optの予測精度が高くなり、性能の高い分類器を作成することが可能となる。また、学習データの偏りが大きい場合には、Ｎ_rが過大になり、少数データを重みＮ_rでオーバーサンプリングをすると、新しい学習データの個数が膨大になり、コンピュータシステムの記憶装置（ＲＡＭ等）に格納できず、分類器を計算できなかったり、記憶装置に格納できたとしても、分類器の計算時間が膨大に掛ってしまうことがある。このような場合、分類器を作成する学習アルゴリズムやコンピュータシステムの記憶装置の容量に応じて、重みＮの範囲の上限値Ｎ_maxを「Ｎ_r」より小さな値に設定することで、このような計算上の問題を防ぐことが出来る。 Further, the lower limit value N _min of the range of the weight N is not limited to “1”. For example, if the existence range of the approximate value of the optimum weight N _opt is known from past experience, a value exceeding “1” may be adopted as the lower limit value N _min of the weight N range. Further, the upper limit value N _max of the range of the weight N is not limited to “(number of learning data belonging to the majority class / number of learning data belonging to the minority class) N _r ”. For example, when the existence range of the approximate value of the appropriate weight N is known from past experience, a value lower than “N _r ” may be employed as the upper limit value N _max of the weight N range. Further, when the value of “N _r ” is larger than a predetermined value (for example, 10), a predetermined value (for example, 10) may be adopted as the upper limit value N _max of the weight N range.
As described above, the necessity of setting the range of the weight N (N _{min to} N _max ) to other than 1 to N _r will be described. Since the present invention is a technique for predicting the optimum weight N _opt by approximating the relationship between the weight N, which is a continuous function, and the evaluation value by a plurality of temporary weights and the evaluation value, the optimum weight N _opt If the existence range is known from past experience, who range in weight N is set to present a range of the optimal weight N _opt is, the higher the prediction accuracy of optimal weight N _opt, high performance classification It is possible to create a container. Further, when the bias of the learning data is large, _Nr becomes excessive, and if the small number data is oversampled with the weight _Nr , the number of new learning data becomes enormous, and the storage device (RAM, etc.) of the computer system Even if the classifier cannot be calculated or stored in the storage device, the classifier calculation time may be enormous. In such a case, the upper limit value N _max of the range of the weight N is set to a value smaller than “N _r ” according to the learning algorithm for creating the classifier and the storage capacity of the computer system. Calculation problems can be prevented.

また、仮の重みＮ´のそれぞれについて少数クラスに属する学習データをオーバーサンプリングする方法は、前述した方法に限定されない。例えば、仮の重みＮ´の値が整数部と小数部を有する場合（例えば「２．５」の場合）には、次のようにして新学習データを決定することができる。まず、元の学習データの全てを新学習データとして採用する。次に、少数クラスに属する学習データを、整数部の値から１を減じた回数（例えば１回（＝２−１））だけ、新学習データとして（繰り返し）採用する。最後に、少数クラスに属する学習データの数に、小数点以下の値を乗じた数（例えば、少数クラスに属する学習データの数に０．５を乗じた数）の学習データを、乱数等を用いて、少数クラスに属する学習データから、値が重ならないようにランダムに選択し、選択した少数クラスに属する学習データを、新学習データとして採用する。このようにしても、少数クラスに属する学習データをサンプリングすることができる。 Further, the method of oversampling the learning data belonging to the minority class for each temporary weight N ′ is not limited to the method described above. For example, when the value of the temporary weight N ′ has an integer part and a decimal part (for example, “2.5”), new learning data can be determined as follows. First, all of the original learning data is adopted as new learning data. Next, the learning data belonging to the minority class is adopted (repeatedly) as new learning data by the number of times that 1 is subtracted from the value of the integer part (for example, once (= 2-1)). Finally, the number of learning data belonging to the minority class is multiplied by a value after the decimal point (for example, the number of learning data belonging to the minority class is multiplied by 0.5), using random numbers or the like Thus, the learning data belonging to the minority class is randomly selected so that the values do not overlap, and the learning data belonging to the selected minority class is adopted as new learning data. Even in this way, the learning data belonging to the minority class can be sampled.

また、仮の分類器に対する評価値はＦ値に限定されない。例えば、適合率と再現率とを用いた評価値として、適合率と再現率との重み付き平均値を採用してもよい。また、以下の（２１）式で表されるＴＮ率と、ＴＰ率（前述した再現率と同じ）との相乗平均（幾何平均）であるＧ平均を以下の（２２）式により評価値として算出してもよい。 Further, the evaluation value for the temporary classifier is not limited to the F value. For example, a weighted average value of the precision and the recall may be adopted as the evaluation value using the precision and the recall. In addition, a G average, which is a geometric mean of the TN rate expressed by the following formula (21) and the TP rate (same as the above-described recall rate), is calculated as an evaluation value by the following formula (22). May be.

また、仮の重みＮ´の数は、複数であれば、幾つであってもよい。仮の重みＮ´の数が４以上である場合には、重みＮとＦ値との関係Ｆ（Ｎ）を、以下の（２３）式に示す２次関数で近似し、この２次関数のＮに、仮の重みＮ´_kを代入したときのＦ値（Ｆ（Ｎ´_k））と、当該仮の重みＮ´に対して評価値算出部１０６で算出されたＦ値（Ｆ_k）との誤差の二乗和Ｊの値が最小となるように（（２４）式を参照）、最小二乗法を用いて係数ａ、ｂ、ｃを計算し、計算した係数ａ、ｂ、ｃを用いて前述したようにして重みの最適値Ｎ_optを求めることができる。また、仮の重みＮ´の値が４つ以上であれば、重みＮとＦ値との関係（Ｆ（Ｎ））を、２次関数ではなく、ｎ次関数（ｎは３以上の整数）に近似して、重みＮの範囲の下限値である「Ｎ_min」と、重みＮの範囲の上限値である「Ｎ_max」との間の最大値を重みの最適値Ｎ_optとして求めることができる。また、重みＮとＦ値との関係Ｆ（Ｎ）が１次関数であってもよい。この場合は、図７のステップＳ７０３でＮＯとなった後の処理を行うことにより、重みの最適値Ｎ_optを求めることができる。ただし、重みＮとＦ値との関係Ｆ（Ｎ）がｎ次関数（ｎは２以上の整数）であるのが望ましい。なぜなら、一般に重みＮとＦ値との関係Ｆ（Ｎ）は直線式にならないからである。 Further, the number of provisional weights N ′ may be any number as long as it is plural. When the number of provisional weights N ′ is 4 or more, the relation F (N) between the weights N and the F values is approximated by a quadratic function expressed by the following equation (23). F value (F ( _{N ′ k} )) when the temporary weight N ′ _k is substituted for N, and the F value (F _k ) calculated by the evaluation value calculation unit 106 for the temporary weight _{N ′} The coefficients a, b, and c are calculated using the least square method so that the value of the sum of squares J of the errors is minimized (see equation (24)), and the calculated coefficients a, b, and c are used. As described above, the optimum weight value N _opt can be obtained. If the value of the provisional weight N ′ is 4 or more, the relationship between the weight N and the F value (F (N)) is not a quadratic function but an n-order function (n is an integer of 3 or more). , The maximum value between “N _min ” which is the lower limit value of the weight N range and “N _max ” which is the upper limit value of the weight N range is obtained as the optimum weight value N _opt. it can. Further, the relationship F (N) between the weight N and the F value may be a linear function. In this case, the optimum value N _opt of the weight can be obtained by performing the processing after NO in step S703 of FIG. However, it is desirable that the relationship F (N) between the weight N and the F value is an n-order function (n is an integer of 2 or more). This is because the relationship F (N) between the weight N and the F value is generally not a linear expression.

また、本実施形態では、分類器が決定木である場合を例に挙げて説明したが、分類器は決定木に限定されない。例えば、ＳＶＭ（Support Vector Machine）、ニューラルネットワーク、線形判別等を分類器としてもよい。 In the present embodiment, the case where the classifier is a decision tree has been described as an example. However, the classifier is not limited to a decision tree. For example, SVM (Support Vector Machine), neural network, linear discrimination, etc. may be used as the classifier.

［第２の実施形態］
次に、本発明の第２の実施形態について説明する。前述した第１の実施形態では、少数クラスに属する学習データをオーバーサンプリングして、少数クラスに属する学習データの数を増やす場合を例に挙げて説明した。これに対し、本実施形態では、多数クラスに属する学習データをアンダーサンプリングして、多数クラスに属する学習データの数を減らす場合を例に挙げて説明する。このように本実施形態と第１の実施形態とは、主として、学習データサンプリング処理の一部が異なる。よって、本実施形態の説明において、第１の実施形態と同一の部分については、図１〜図８に付した符号と同一の符号を付す等して詳細な説明を省略する。 [Second Embodiment]
Next, a second embodiment of the present invention will be described. In the first embodiment described above, the case where the number of learning data belonging to the minority class is increased by oversampling the learning data belonging to the minority class has been described as an example. On the other hand, in this embodiment, a case where learning data belonging to a large number of classes is undersampled to reduce the number of learning data belonging to a large number of classes will be described as an example. As described above, the present embodiment and the first embodiment mainly differ in part of the learning data sampling process. Therefore, in the description of the present embodiment, the same parts as those in the first embodiment are denoted by the same reference numerals as those in FIGS.

本実施形態では、図６のフローチャートにおいて、「少数クラス」を「多数クラス」に、「Ｎ´」を「１／Ｎ´」にすることにより、学習データサンプリング処理が実現される。このようにした場合には、少数クラスに属する学習データは、そのまま新学習データに追加される（ステップＳ６０３、Ｓ６０４を参照）。一方、多数クラスに属する学習データは、平均的に１／Ｎ´回だけ新学習データとして採用される。
以上のようにすれば、必ずしも、多数クラスに属する学習データの個数と、少数クラスに属する学習データの個数とが等しくなるように多数クラスに属する学習データをアンダーサンプリングしてしまい、多数クラスに属する学習データのうち、精度の高い分類器を作成する上で重要となる学習データが新学習データに含まれなくなることを可及的に抑制することができる。よって、一方のクラスに属する学習データの個数が他方のクラスに属する学習データの個数よりも極端に少ない場合でも、過不足なく学習データをサンプリングし、高精度の分類器を構築することができる。 In the present embodiment, the learning data sampling process is realized by setting “minority class” to “majority class” and “N ′” to “1 / N ′” in the flowchart of FIG. In this case, the learning data belonging to the minority class is added to the new learning data as it is (see steps S603 and S604). On the other hand, learning data belonging to a large number of classes is adopted as new learning data on average 1 / N ′ times.
In this way, the learning data belonging to the majority class is necessarily undersampled so that the number of learning data belonging to the majority class is equal to the number of learning data belonging to the minority class, and belongs to the majority class. Of the learning data, it is possible to suppress as much as possible that the learning data that is important for creating a highly accurate classifier is not included in the new learning data. Therefore, even when the number of learning data belonging to one class is extremely smaller than the number of learning data belonging to the other class, it is possible to construct a highly accurate classifier by sampling the learning data without excess or deficiency.

本実施形態においても、第１の実施形態で説明した種々の変形例を採用することができる。尚、多数クラスに属する学習データをランダムに選択する場合には、多数クラスに属する学習データの数の１／Ｎ´倍の学習データを、乱数等を用いて、多数クラスに属する学習データのから、値が重ならないようにランダムに選択すればよい。 Also in the present embodiment, various modifications described in the first embodiment can be employed. When learning data belonging to a large number of classes is selected at random, learning data that is 1 / N 'times the number of learning data belonging to the large number of classes is obtained from learning data belonging to the large number of classes using random numbers or the like. , It may be selected randomly so that the values do not overlap.

［第３の実施形態］
次に、第３の実施形態について説明する。前述した第１の実施形態では、オーバーサンプリングをする場合を例に挙げて説明し、前述した第２の実施形態では、アンダーサンプリングする場合を例に挙げて説明した。これに対し、本実施形態では、オーバーサンプリングとアンダーサンプリングとの双方を行う場合を例に挙げて説明する。このように本実施形態と第１、２の実施形態とは、主として、学習データサンプリング処理の一部が異なる。よって、本実施形態の説明において、第１、２の実施形態と同一の部分については、図１〜図８に付した符号と同一の符号を付す等して詳細な説明を省略する。 [Third Embodiment]
Next, a third embodiment will be described. In the above-described first embodiment, the case of oversampling has been described as an example, and in the above-described second embodiment, the case of undersampling has been described as an example. On the other hand, in this embodiment, a case where both oversampling and undersampling are performed will be described as an example. Thus, the present embodiment and the first and second embodiments mainly differ in part of the learning data sampling process. Therefore, in the description of the present embodiment, the same parts as those in the first and second embodiments are denoted by the same reference numerals as those in FIGS.

本実施形態では、多数クラスに属する学習データの個数の減少に対して、少数クラスに属する学習データの増加を優先する割合を表す優先度ｒの値が、例えば、オペレータによる操作に基づき設定される。この優先度ｒの値は、０以上１以下の範囲の値である（０≦ｒ≦１）。
少数クラスに属する学習データの増加率を表す仮の重みＮ´₊（以下の説明では「少数クラス用の仮の重みＮ´₊」と称する）を以下の（２５）式のように定めると共に、多数クラスに属する学習データの減少率を表す仮の重みＮ´_-（以下の説明では「多数クラス用の仮の重みＮ´_-」と称する）を以下の（２６）式のように定める。
Ｎ´₊＝ｒ×Ｎ´＋（１−ｒ）・・・（２５）
Ｎ´_-＝Ｎ´／Ｎ´₊ ・・・（２６） In the present embodiment, a value of priority r representing a ratio of giving priority to an increase in learning data belonging to the minority class with respect to a decrease in the number of learning data belonging to the majority class is set based on an operation by an operator, for example. . The value of the priority r is a value in the range of 0 to 1 (0 ≦ r ≦ 1).
A provisional weight _N ′ ₊ (referred to as “temporary weight _N ′ ₊ for the minority class” in the following description) representing the increase rate of learning data belonging to the minority class is defined as the following equation (25): A temporary weight N ′ ₋ (referred to as “temporary weight N ′ ₋ for the multi-class” in the following description) representing the reduction rate of the learning data belonging to the multi-class is defined as the following equation (26).
_{N ′ +} = r × N ′ + (1-r) (25)
_{N ′} ₋ = _{N ′} / _{N ′ +} (26)

（２５）式、（２６）式において、Ｎ´は、第１、第２の実施形態で説明した仮の重みである。
そして、例えば、第１の実施形態で説明した学習データサンプリング処理（図６を参照）において、「Ｎ´」の代わりに、「少数クラス用の仮の重みＮ´₊」を用いて、少数クラスの学習データをオーバーサンプリングする。その後、第２の実施形態で説明した学習データサンプリング処理（図６を参照）において、「１／Ｎ´」の代わりに、「１／Ｎ´_-」を用いて、オーバーサンプリングした後の学習データに含まれる多数クラスの学習データをアンダーサンプリングする。このようにして得られた学習データが新学習データとなる。尚、少数クラス用の仮の重みＮ´₊と、多数クラス用の仮の重みＮ´_-は、それぞれ仮の重みＮ´の数（第１の実施形態では「３」）だけ得られるということは勿論である。 In Expressions (25) and (26), N ′ is the temporary weight described in the first and second embodiments.
Then, for example, in the learning data sampling process (see FIG. 6) described in the first embodiment, “temporary weight _N ′ ₊ for the minority class” is used instead of “ _N ′”, and the minority class is used. Oversample the training data. Thereafter, in the learning data sampling process (see FIG. 6) described in the second embodiment, learning data after oversampling using “1 / N ′ ₋ ” instead of “1 / N ′” Undersampling the learning data of many classes included in. The learning data obtained in this way becomes new learning data. Note that the provisional weight _N ′ ₊ for the minority class and the provisional weight _N ′ ₋ for the majority class can be obtained by the number of provisional weights N ′ (“3” in the first embodiment). Of course.

（２５）式、（２６）式のようにして少数クラス用の仮の重みＮ´₊と、多数クラス用の仮の重みＮ´_-を定めると、優先度ｒの値が「１」のときには、少数クラス用の仮の重みＮ´₊は、第１の実施形態で示した仮の重みＮ´と等しくなると共に（Ｎ´₊＝Ｎ´）、多数クラス用の仮の重みＮ´_-の値は、「１」となる（Ｎ´_-＝１）。よって、優先度ｒの値が「１」のときには、オーバーサンプリングのみを行って新学習データを作成することになる（すなわち、第１の実施形態と同じ処理を行うことになる）。
一方、優先度ｒの値が「０」のときには、少数クラス用の仮の重みＮ´₊は、「１」になると共に（Ｎ´_-＝１）、多数クラス用の仮の重みＮ´_-の値は、第２の実施形態で示した仮の重みＮ´と等しくなる（Ｎ´_-＝Ｎ´）。よって、優先度ｒの値が「０」のときには、アンダーサンプリングのみを行って新学習データを作成することになる（すなわち、第２の実施形態と同じ処理を行うことになる）。 When the provisional weight _N ′ ₊ for the minority class and the provisional weight _N ′ ₋ for the majority class are determined as in the expressions (25) and (26), when the value of the priority r is “1” The temporary weight _N ′ ₊ for the minority class becomes equal to the temporary weight _N ′ shown in the first embodiment ( _{N ′ +} = _{N ′} ), and the temporary weight N ′ ₋ for the majority class The value is “1” (N ′ ₋ = 1). Therefore, when the value of the priority r is “1”, only oversampling is performed to create new learning data (that is, the same processing as in the first embodiment is performed).
On the other hand, when the value of the priority r is “0”, the temporary weight _N ′ ₊ for the minority class becomes “1” (N ′ ₋ = 1) and the temporary weight N ′ ₋ for the majority class. Is equal to the temporary weight N ′ shown in the second embodiment (N ′ ₋ = N ′). Therefore, when the value of the priority r is “0”, only undersampling is performed to create new learning data (that is, the same processing as in the second embodiment is performed).

また、新学習データの、少数クラスに属する学習データの個数と多数クラスに属する学習データの個数は、それぞれ、以下の（２７）式、（２８）式のようになる。
新学習データの少数クラスに属する学習データの個数
＝元の学習データの少数クラスに属する学習データの個数×Ｎ´₊ ・・・（２７）
新学習データの多数クラスに属する学習データの個数
＝元の学習データの多数クラスに属する学習データの個数×Ｎ´₊／Ｎ´ ・・・（２８） Also, the number of learning data belonging to the minority class and the number of learning data belonging to the majority class of the new learning data are as shown in the following equations (27) and (28), respectively.
Number of learning data belonging to a minority class of new learning data = Number of learning data belonging to a minority class of original learning data × _{N ′ +} (27)
Number of learning data belonging to many classes of new learning data = Number of learning data belonging to many classes of original learning data × _{N ′ +} / _{N ′} (28)

したがって、仮の重みＮ´の値が、重みＮの範囲の上限値である「Ｎ_r（＝元の学習データの多数クラスに属する学習データの個数／元の学習データの少数クラスに属する学習データの個数）」と等しいときに（Ｎ´＝Ｎ_r）、新学習データの少数クラスに属する学習データの個数と、新学習データの多数クラスに属する学習データの個数とが等しくなる。また、仮の重みＮ´の値が重みＮの範囲の下限値である「１」と等しいときは（Ｎ´＝１）、Ｎ´₊＝Ｎ´_-＝１となるため、オーバーサンプリングもアンダーサンプリングも行われないことになる。よって、優先度ｒの値に関わらず、第１の実施形態で説明したオーバーサンプリングのみを行う場合や、第２の実施形態で説明したアンダーサンプリングのみを行う場合と同様に、重みＮの範囲は「１」〜「Ｎ_r」で設定すれば良い。 Therefore, the value of the temporary weight N ′ is “N _r (= the number of learning data belonging to the majority class of the original learning data / the learning data belonging to the minority class of the original learning data” which is the upper limit value of the range of the weight N. (N ′ = N _r ), the number of learning data belonging to the minority class of new learning data is equal to the number of learning data belonging to the majority class of new learning data. Further, when the value of the temporary weight N ′ is equal to “1” which is the lower limit value of the range of the weight N (N ′ = 1), N ′ ₊ = N ′ ₋ = 1, and therefore oversampling is also under. Sampling will not be performed. Therefore, regardless of the value of the priority r, the range of the weight N is the same as when only oversampling described in the first embodiment is performed or only undersampling described in the second embodiment is performed. What is necessary is just to set by "1"-" _Nr ".

さらに、重みの最適値Ｎ_optを用いて分類器を作成する際には、（２５）式、（２６）式における「Ｎ´」の代わりに重みの最適値Ｎ_optを代入して少数クラス用の重みＮ₊と、多数クラス用の重みＮ_-とを導出し、少数クラス用の重みＮ₊を用いて少数クラスの学習データをオーバーサンプリングした後、多数クラス用の重みＮ_-を用いて多数クラスの学習データをアンダーサンプリングすればよい。
以上のようにすれば、第１、第２の実施形態で説明した効果を得ることができる。また、本実施形態においても、第１、第２の実施形態で説明した種々の変形例を採用することができる。 Furthermore, when creating a classifier using the optimum value N _opt weights, (25), for a few classes by substituting the optimum value N _opt weights instead of "N'" in (26) Weight N ₊ and a majority class weight N _−, and after oversampling the minority class learning data using the minority class weight N ₊ , the majority class weight N ₋ The class learning data may be undersampled.
As described above, the effects described in the first and second embodiments can be obtained. Also in this embodiment, various modifications described in the first and second embodiments can be adopted.

本実施形態では、多数クラスに属する学習データの個数の減少に対する、少数クラスに属する学習データの増加の優先度ｒを用いた場合を例に挙げて説明した。しかしながら、少数クラスに属する学習データの増加に対する、多数クラスに属する学習データの個数の減少の優先度ｒを用いるようにしてもよい。このようにする場合には、（２５）式、（２６）式の代わりに、（２９）式、（３０）式を用いればよい。
Ｎ´₊＝Ｎ´／Ｎ´_- ・・・（２９）
Ｎ´_-＝ｒ×Ｎ´＋（１−ｒ）・・・（３０） In the present embodiment, the case where the priority r for increasing the learning data belonging to the minority class is used with respect to the decrease in the number of learning data belonging to the majority class has been described as an example. However, the priority r of decreasing the number of learning data belonging to the majority class with respect to the increase of learning data belonging to the minority class may be used. In such a case, the equations (29) and (30) may be used instead of the equations (25) and (26).
_{N ′ +} = _{N ′} / _{N ′} ₋ (29)
N ′ ₋ = r × N ′ + (1−r) (30)

尚、以上説明した本発明の実施形態は、コンピュータがプログラムを実行することによって実現することができる。また、前記プログラムを記録したコンピュータ読み取り可能な記録媒体及び前記プログラム等のコンピュータプログラムプロダクトも本発明の実施形態として適用することができる。記録媒体としては、例えば、フレキシブルディスク、ハードディスク、光ディスク、光磁気ディスク、ＣＤ−ＲＯＭ、磁気テープ、不揮発性のメモリカード、ＲＯＭ等を用いることができる。
また、以上説明した本発明の実施形態は、何れも本発明を実施するにあたっての具体化の例を示したものに過ぎず、これらによって本発明の技術的範囲が限定的に解釈されてはならないものである。すなわち、本発明はその技術思想、またはその主要な特徴から逸脱することなく、様々な形で実施することができる。 The embodiment of the present invention described above can be realized by a computer executing a program. Further, a computer-readable recording medium in which the program is recorded and a computer program product such as the program can also be applied as an embodiment of the present invention. As the recording medium, for example, a flexible disk, a hard disk, an optical disk, a magneto-optical disk, a CD-ROM, a magnetic tape, a nonvolatile memory card, a ROM, or the like can be used.
In addition, the embodiments of the present invention described above are merely examples of implementation in carrying out the present invention, and the technical scope of the present invention should not be construed as being limited thereto. Is. That is, the present invention can be implemented in various forms without departing from the technical idea or the main features thereof.

［請求項との関係］
＜請求項１、９＞
仮重み決定手段は、例えば、重み範囲入力部１０２及び仮重み決定部１０３を用いることにより実現され、仮重み決定工程は、例えば、ステップＳ５０２、Ｓ５０３の処理を行うことにより実現される。ここで、下限値は、例えばＮ_minに対応し、上限値はＮ_maxに対応する。
学習データサンプリング手段は、例えば、学習データサンプリング部１０４を用いることにより実現され、学習データサンプリング工程は、例えば、ステップＳ５０５（図６）の処理を行うことにより実現される。
学習手段は、例えば、学習部１０５を用いることにより実現され、学習工程は、例えば、ステップＳ５０６の処理を行うことにより実現される。
評価値算出手段は、例えば、評価値算出部１０６を用いることにより実現され、評価値算出工程は、例えば、ステップＳ５０７の処理を行うことにより実現される。
最適重み導出手段は、例えば、最適重み導出部１０７を用いることにより実現され、最適重み導出工程は、例えば、ステップＳ５０９（図７）の処理を行うことにより実現される。
分類器作成手段は、例えば、学習データサンプリング部１０８及び学習部１０９を用いることにより実現され、分類器作成工程は、例えば、ステップＳ４０９〜Ｓ４１１の処理を行うことにより実現される。
＜請求項３、１１＞
請求項３、１１の記載は、例えば、第１の実施形態における学習データサンプリング処理に対応する。
＜請求項４、１２＞
請求項４、１２の記載は、例えば、第２の実施形態における学習データサンプリング処理に対応する。
＜請求項５、１３＞
請求項５、１３の記載は、例えば、第３の実施形態における学習データサンプリング処理に対応する。 [Relationship with Claims]
<Claims 1 and 9>
The temporary weight determination unit is realized by using, for example, the weight range input unit 102 and the temporary weight determination unit 103, and the temporary weight determination step is realized by performing, for example, steps S502 and S503. Here, the lower limit value corresponds to N _min , for example, and the upper limit value corresponds to N _max .
The learning data sampling means is realized, for example, by using the learning data sampling unit 104, and the learning data sampling step is realized, for example, by performing the process of step S505 (FIG. 6).
The learning means is realized, for example, by using the learning unit 105, and the learning process is realized, for example, by performing the process of step S506.
The evaluation value calculation means is realized by using, for example, the evaluation value calculation unit 106, and the evaluation value calculation process is realized by, for example, performing the process of step S507.
The optimum weight derivation means is realized by using, for example, the optimum weight derivation unit 107, and the optimum weight derivation step is realized by, for example, performing the process of step S509 (FIG. 7).
The classifier creating means is realized by using, for example, the learning data sampling unit 108 and the learning unit 109, and the classifier creating process is realized by performing, for example, steps S409 to S411.
<Claims 3 and 11>
Claims 3 and 11 correspond to, for example, the learning data sampling process in the first embodiment.
<Claims 4 and 12>
Claims 4 and 12 correspond to, for example, the learning data sampling process in the second embodiment.
<Claims 5 and 13>
Claims 5 and 13 correspond to, for example, the learning data sampling processing in the third embodiment.

１００分類器作成装置
１０１学習データ入力部
１０２重み範囲入力部
１０３仮重み決定部
１０４学習データサンプリング部
１０５学習部
１０６評価値算出部
１０７最適重み算出部
１０８学習データサンプリング部
１０９学習部
１１０分類器格納部 DESCRIPTION OF SYMBOLS 100 Classifier preparation apparatus 101 Learning data input part 102 Weight range input part 103 Temporary weight determination part 104 Learning data sampling part 105 Learning part 106 Evaluation value calculation part 107 Optimal weight calculation part 108 Learning data sampling part 109 Learning part 110 Classifier storage Part

Claims

Among the learning data that is known to belong to which of the two classes, increasing the number of learning data of a minority class, which is a class to which a relatively small number of learning data belongs, by a factor corresponding to the weight, The number of learning data is changed by at least one of reducing the number of learning data of a large number of classes, which is a class to which a large number of learning data belongs, by a factor corresponding to the inverse of the weight. Creating a new learning data, and using the new learning data, a classifier creating device for creating a classifier for determining which of two classes the given data belongs to,
The upper limit value and the lower limit value of the weight are determined from a range of one or more values less than or equal to a value obtained by dividing the number of learning data of the majority class by the number of learning data of the minority class, and the upper limit value and the lower limit value. Temporary weight determining means for determining a plurality of temporary weights having different values from a range of values;
At least one of increasing the number of learning data of the minority class by a factor corresponding to the provisional weight and reducing the number of learning data of the majority class by a factor corresponding to the inverse of the provisional weight. Learning data sampling means for performing one for each of the plurality of temporary weights to create a plurality of sets of the new learning data;
Using the new learning data after the number of learning data has been changed by the learning data sampling means, a temporary classifier for determining which of the two classes the given learning data belongs is created Learning means for performing each of the new learning data sets to obtain a plurality of the temporary classifiers;
Calculating an evaluation value for evaluating the performance of the temporary classifier obtained by the learning means based on the result of classifying the learning data into one of the two classes by the temporary classifier; An evaluation value calculation means for obtaining a plurality of evaluation values by performing for each of a plurality of temporary classifiers;
Using the evaluation value obtained by the evaluation value calculation means and the provisional weight used when obtaining the evaluation value, the relationship between the evaluation value and the weight is obtained, and in the obtained relationship, from the lower limit value Optimum weight deriving means for deriving a weight corresponding to the evaluation value having the largest value in the range up to the upper limit value as an optimum value of the weight;
Using the optimum value of the weight as the weight, increasing the number of learning data of the minority class by a factor corresponding to the weight, and increasing the number of learning data of the majority class by a factor corresponding to the inverse of the weight A classification for determining whether the given learning data belongs to which of the two classes using the new learning data by creating at least one of reducing and creating the new learning data A classifier creating means for creating a container;
A classifier creating apparatus comprising:

The classifier creating apparatus according to claim 1, wherein the number of temporary weights is 3 or more.

The learning data sampling means includes at least one of the learning data of the minority class such that the number of new learning data of the minority class is equal to the number of learning data of the minority class multiplied by the temporary weight. Copying a part for each of the plurality of temporary weights to create a plurality of sets of the new learning data,
The classifier creating unit replicates the learning data of the minority class so that the number of new learning data of the minority class is equal to the number of learning data of the minority class multiplied by the optimum value of the weight. The classifier creating apparatus according to claim 1, wherein new learning data is generated, and the classifier is generated using the generated new learning data.

The learning data sampling means is configured so that the number of new learning data of the majority class is equal to the number of learning data of the majority class multiplied by the inverse of the temporary weight. Deleting a part for each of the plurality of temporary weights to create a plurality of sets of the new learning data;
The classifier creating means includes the learning data of the multi-class so that the number of the new learning data of the multi-class is the number of the learning data of the multi-class multiplied by the inverse of the optimum value of the weight. The classifier creating apparatus according to claim 1, wherein a part of the data is deleted to generate new learning data, and the classifier is generated using the generated new learning data.

Priority of increasing the number of learning data of the minority class with respect to a decrease in the number of learning data of the majority class, or priority of decreasing the number of learning data of the majority class with respect to an increase in the number of learning data of the minority class A priority that takes a value between 0 and 1 in advance, and using the temporary weight, a temporary weight for a minority class, which is a temporary weight for the learning data of the minority class, and the majority Provisional weight deriving means for deriving provisional weights for a large number of classes, which are provisional weights for the learning data of the class;
Using the priority and the optimum value of the weight, a weight for the minority class that is a weight for the learning data of the minority class and a weight for the minority class that is a weight for the learning data of the majority class are derived. Weight derivation means for
The learning data sampling unit is configured to learn the minority class so that the number of new learning data of the minority class is equal to the number of learning data of the minority class multiplied by a provisional weight for the minority class. Duplicating at least a part of the data, and the number of new learning data of the majority class is equal to the number of learning data of the majority class multiplied by the inverse of the provisional weight for the majority class. Deleting a part of the learning data of the large number of classes for each of the plurality of temporary weights to create a plurality of sets of the new learning data,
The classifier creating means is configured so that the number of new learning data of the minority class is equal to the number of learning data of the minority class multiplied by the optimum value of the weight for the minority class. Duplicating learning data, and the number of new learning data of the majority class is equal to the number of learning data of the majority class multiplied by the inverse of the optimum value of the weight for the majority class. Deleting a part of the learning data of a large number of classes, generating new learning data, creating the classifier using the generated new learning data, the priority being the learning data of the large number of classes If the priority value is 1, the provisional weight of the minority class becomes the provisional weight when the priority value is 1. Be the same And when the priority value is 0, the provisional weights of the multiple classes are the same as the provisional weights,
When the priority is a priority of decreasing the number of learning data of the majority class with respect to an increase of the number of learning data of the minority class, when the priority value is 1, the majority class The provisional weight of the minority class is the same as the provisional weight when the provisional weight is the same as the provisional weight and the priority value is 0. The classifier creating apparatus according to claim 1 or 2.

The evaluation value calculating means includes learning data that actually belong to a minority class among learning data classified into a minority class based on a result of classifying the learning data into one of the two classes by the temporary classifier. Calculate the relevance rate, which represents the proportion of data included, and the recall, which is the proportion of learning data that actually belongs to the minority class, among the learning data that belongs to the minority class. The classifier creating apparatus according to claim 1, wherein an F value that is a weighted harmonic average of a rate and a recall rate is calculated as the evaluation value.

The temporary weight determining means derives three temporary weights each having the upper limit value, the lower limit value, and a value that is half of the sum of the upper limit value and the lower limit value,
7. The optimum weight deriving unit obtains a relationship between the evaluation value and the weight on the assumption that the evaluation value is expressed by a quadratic function of the weight. The classifier preparation apparatus described in 1.

The classifier creating apparatus according to claim 1, wherein the classifier is a decision tree.

Among the learning data that is known to belong to which of the two classes, increasing the number of learning data of a minority class, which is a class to which a relatively small number of learning data belongs, by a factor corresponding to the weight, The number of learning data is changed by at least one of reducing the number of learning data of a large number of classes, which is a class to which a large number of learning data belongs, by a factor corresponding to the inverse of the weight. Creating a new learning data, and using the new learning data, a classifier creating method for creating a classifier for determining which of two classes the given data belongs to,
The upper limit value and the lower limit value of the weight are determined from a range of one or more values less than or equal to a value obtained by dividing the number of learning data of the majority class by the number of learning data of the minority class, and the upper limit value and the lower limit value. A provisional weight determination step for determining a plurality of provisional weights having different values from a range of values;
At least one of increasing the number of learning data of the minority class by a factor corresponding to the provisional weight and reducing the number of learning data of the majority class by a factor corresponding to the inverse of the provisional weight. A learning data sampling step of performing one for each of the plurality of temporary weights to create a plurality of sets of the new learning data;
Using the new learning data after the number of learning data has been changed by the learning data sampling step, a temporary classifier for determining which of the two classes the given learning data belongs is created Performing a learning step for each set of new learning data to obtain a plurality of temporary classifiers;
Calculating an evaluation value for evaluating the performance of the temporary classifier obtained by the learning step based on the result of classifying the learning data into one of the two classes by the temporary classifier; An evaluation value calculation step for obtaining a plurality of evaluation values by performing for each of a plurality of temporary classifiers;
Using the evaluation value obtained in the evaluation value calculation step and the temporary weight used when obtaining the evaluation value, the relationship between the evaluation value and the weight is obtained, and in the obtained relationship, from the lower limit value An optimum weight derivation step for deriving a weight corresponding to the evaluation value having the largest value in the range up to the upper limit value as an optimum value of the weight;
Using the optimum value of the weight as the weight, increasing the number of learning data of the minority class by a factor corresponding to the weight, and increasing the number of learning data of the majority class by a factor corresponding to the inverse of the weight A classification for determining whether the given learning data belongs to which of the two classes using the new learning data by creating at least one of reducing and creating the new learning data A classifier creation process for creating a container;
A classifier creating method characterized by comprising:

The classifier creation method according to claim 9, wherein the number of provisional weights is 3 or more.

The learning data sampling step includes at least one of the learning data of the minority class such that the number of new learning data of the minority class is equal to the number of learning data of the minority class multiplied by the temporary weight. Copying a part for each of the plurality of temporary weights to create a plurality of sets of the new learning data,
The classifier creating step replicates the learning data of the minority class so that the number of new learning data of the minority class is equal to the number of learning data of the minority class multiplied by the optimum value of the weight. The classifier creation method according to claim 9 or 10, wherein new learning data is generated and the classifier is generated using the generated new learning data.

In the learning data sampling step, the number of the learning data of the majority class is adjusted so that the number of the new learning data of the majority class is the number of the learning data of the majority class multiplied by the reciprocal of the temporary weight. Deleting a part for each of the plurality of temporary weights to create a plurality of sets of the new learning data;
In the classifier creating step, the learning data of the multi-class is set so that the number of new learning data of the multi-class is the number obtained by multiplying the number of learning data of the multi-class by the inverse of the optimum value of the weight. The classifier creation method according to claim 9 or 10, wherein a part of is deleted to generate new learning data, and the classifier is generated using the generated new learning data.

Priority of increasing the number of learning data of the minority class with respect to a decrease in the number of learning data of the majority class, or priority of decreasing the number of learning data of the majority class with respect to an increase in the number of learning data of the minority class A priority that takes a value between 0 and 1 in advance, and using the temporary weight, a temporary weight for a minority class, which is a temporary weight for the learning data of the minority class, and the majority A provisional weight derivation step for deriving provisional weights for multiple classes, which are provisional weights for the learning data of the class;
Using the priority and the optimum value of the weight, a weight for the minority class that is a weight for the learning data of the minority class and a weight for the minority class that is a weight for the learning data of the majority class are derived. A weight deriving step to
In the learning data sampling step, the learning of the minority class is performed so that the number of new learning data of the minority class is equal to the number of the learning data of the minority class multiplied by the provisional weight for the minority class. Duplicating at least a part of the data, and the number of new learning data of the majority class is equal to the number of learning data of the majority class multiplied by the inverse of the provisional weight for the majority class. Deleting a part of the learning data of the large number of classes for each of the plurality of temporary weights to create a plurality of sets of the new learning data,
In the classifier creating step, the number of new learning data in the minority class is equal to the number obtained by multiplying the number of learning data in the minority class by the optimum value of the weight for the minority class. Duplicating learning data, and the number of new learning data of the majority class is equal to the number of learning data of the majority class multiplied by the inverse of the optimum value of the weight for the majority class. Deleting a part of the learning data of many classes, generating new learning data, creating the classifier using the generated new learning data,
When the priority is a priority of an increase in the number of learning data of the minority class with respect to a decrease in the number of learning data of the majority class, when the priority value is 1, the minority class The temporary weights of the multiple classes are the same as the temporary weights, and when the priority value is 0,
When the priority is a priority of decreasing the number of learning data of the majority class with respect to an increase of the number of learning data of the minority class, when the priority value is 1, the majority class The provisional weight of the minority class is the same as the provisional weight when the provisional weight is the same as the provisional weight and the priority value is 0. The classifier creation method according to claim 9 or 10.

In the evaluation value calculating step, learning actually belonging to a minority class among learning data classified into a minority class based on a result of classifying the learning data into one of the two classes by the temporary classifier Calculate the relevance rate, which represents the proportion of data included, and the recall, which is the proportion of learning data that actually belongs to the minority class, among the learning data that belongs to the minority class. The classifier creation method according to claim 9, wherein an F value that is a weighted harmonic average of a rate and the recall rate is calculated as the evaluation value.

The provisional weight determination step derives three provisional weights each having the upper limit value, the lower limit value, and a value that is ½ of the sum of the upper limit value and the lower limit value,
15. The optimum weight derivation step obtains a relationship between the evaluation value and the weight on the assumption that the evaluation value is expressed by a quadratic function of the weight. The classifier creation method described in 2.

The classifier creation method according to claim 9, wherein the classifier is a decision tree.

A computer program for causing a computer to function as each means of the classifier creating apparatus according to claim 1.