JP6193428B1

JP6193428B1 - Feature selection device, feature selection method, and program

Info

Publication number: JP6193428B1
Application number: JP2016054517A
Authority: JP
Inventors: 信太郎高橋; 実西澤; 秀将伊藤
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2016-03-17
Filing date: 2016-03-17
Publication date: 2017-09-06
Anticipated expiration: 2036-03-17
Also published as: JP2017167979A

Abstract

【課題】複数の評価指標の各々に設定された目標値をできるだけ達成するように部分集合の選択を行うことができる特徴選択装置、特徴選択方法およびプログラムを提供する。【解決手段】実施形態の特徴選択装置１は、モデルの性能に関する複数の評価指標の各々の評価値を算出するモデル評価部７と、複数の評価指標の各々について、設定された目標値に対する評価値の達成度を算出し、複数の評価指標の各々の達成度が高く、かつ、複数の評価指標間における達成度のばらつきが少ないほど高評価となる統合評価値を算出する評価値統合部８と、を備え、統合評価値が高くなる部分集合を探索する。【選択図】図６A feature selection device, a feature selection method, and a program capable of selecting a subset so as to achieve a target value set for each of a plurality of evaluation indexes as much as possible are provided. A feature selection device according to an embodiment includes a model evaluation unit that calculates an evaluation value of each of a plurality of evaluation indexes related to the performance of the model, and an evaluation for a set target value for each of the plurality of evaluation indexes. An evaluation value integration unit 8 that calculates a degree of achievement of a value and calculates an integrated evaluation value that is higher as the degree of achievement of each of the plurality of evaluation indices is higher and the variation in the degree of achievement among the plurality of evaluation indices is smaller. And searching for a subset having a high integrated evaluation value. [Selection] Figure 6

Description

本発明の実施形態は、特徴選択装置、特徴選択方法およびプログラムに関する。 Embodiments described herein relate generally to a feature selection device, a feature selection method, and a program.

機械学習アルゴリズムを用いてモデルを構築する際に任意の特徴集合から機械学習に有用な特徴の部分集合を選択する特徴選択と呼ばれる技術がある。特徴選択にはいくつかの方法があるが、その一つとしてＷｒａｐｐｅｒ法が知られている。Ｗｒａｐｐｅｒ法は、部分集合を変更しながらモデルの生成および評価を繰り返し、評価値が高くなる部分集合を探索する方法である。 There is a technique called feature selection that selects a subset of features useful for machine learning from an arbitrary feature set when building a model using a machine learning algorithm. There are several methods for feature selection, one of which is the Wrapper method. The Wrapper method is a method of searching for a subset having a high evaluation value by repeatedly generating and evaluating a model while changing the subset.

モデルの性能を評価する評価指標には様々なものがある。例えば、事象の発生を予測する予測モデルの評価指標としては、予測の網羅性を示す「再現率」や、予測の正確性を示す「適合率」などがある。再現率と適合率は基本的にトレードオフの関係にあり、双方の評価値がともに最大値をとるような部分集合を選択することは困難である。そこで、再現率と適合率の調和平均である「Ｆ値」を評価指標として用いる場合もある。Ｆ値を用いてモデルを評価した場合、再現率と適合率のバランスがよいモデルの評価値が高くなる。 There are various evaluation indexes for evaluating the performance of the model. For example, as an evaluation index of a prediction model for predicting the occurrence of an event, there are a “reproducibility” indicating the completeness of prediction and a “matching rate” indicating the accuracy of prediction. The recall rate and the matching rate are basically in a trade-off relationship, and it is difficult to select a subset in which both evaluation values have the maximum value. Therefore, the “F value” that is the harmonic average of the recall rate and the precision rate may be used as an evaluation index. When the model is evaluated using the F value, the evaluation value of the model having a good balance between the recall rate and the matching rate becomes high.

しかし、モデルの用途によっては、例えば、再現率をある程度確保しながら適合率を優先したモデルを構築したい、あるいは逆に、適合率をある程度確保しながら再現率を優先したモデルを構築したいといった要望もある。従来の技術では、このように複数の評価指標の各々で目標が異なる場合に、それらをできるだけ達成するように部分集合の選択を行うことができず、改善が求められる。 However, depending on the use of the model, for example, there is a request to build a model that prioritizes the precision while maintaining a certain degree of recall, or conversely, a model that prioritizes the recall while securing a certain degree of precision. is there. In the conventional technique, when the target is different in each of the plurality of evaluation indexes as described above, the subset cannot be selected so as to achieve them as much as possible, and improvement is required.

特開平５−１９７８１３号公報JP-A-5-197813

本発明が解決しようとする課題は、複数の評価指標の各々に設定された目標値をできるだけ達成するように部分集合の選択を行うことができる特徴選択装置、特徴選択方法およびプログラムを提供することである。 A problem to be solved by the present invention is to provide a feature selection device, a feature selection method, and a program capable of selecting a subset so as to achieve a target value set for each of a plurality of evaluation indexes as much as possible. It is.

実施形態の特徴選択装置は、特徴集合の部分集合を用いたモデルの生成および評価を繰り返し、評価値が高くなる前記部分集合を探索する特徴選択装置であって、モデル評価部と、評価値統合部と、を備える。モデル評価部は、前記モデルの性能に関する複数の評価指標の各々の評価値を算出する。評価値統合部は、前記複数の評価指標の各々について、設定された目標値に対する前記評価値の達成度を算出し、前記複数の評価指標の各々の前記達成度が高く、かつ、前記複数の評価指標間における前記達成度のばらつきが少ないほど高評価となる統合評価値を算出する。実施形態の特徴選択装置は、前記統合評価値が高くなる前記部分集合を探索する。 A feature selection device according to an embodiment is a feature selection device that repeatedly generates and evaluates a model using a subset of a feature set, and searches for the subset with a high evaluation value. A section. The model evaluation unit calculates an evaluation value of each of a plurality of evaluation indexes related to the performance of the model. The evaluation value integration unit calculates a degree of achievement of the evaluation value with respect to a set target value for each of the plurality of evaluation indices, the degree of achievement of each of the plurality of evaluation indices is high, and the plurality of the plurality of evaluation indices An integrated evaluation value is calculated such that the smaller the variation in the achievement level between evaluation indexes, the higher the evaluation. The feature selection device according to the embodiment searches for the subset in which the integrated evaluation value is high.

図１は、一定周期でモデルによる推定を行う例を説明する図である。FIG. 1 is a diagram illustrating an example in which estimation using a model is performed at a constant period. 図２は、時系列データを使って将来の期間におけるデータを予測する例を説明する図である。FIG. 2 is a diagram for explaining an example of predicting data in a future period using time-series data. 図３は、データセットの一例を示す図である。FIG. 3 is a diagram illustrating an example of a data set. 図４は、コスト設定データの一例を示す図である。FIG. 4 is a diagram illustrating an example of cost setting data. 図５は、補正設定データの一例を示す図である。FIG. 5 is a diagram illustrating an example of the correction setting data. 図６は、実施例の特徴選択装置の機能的な構成例を示すブロック図である。FIG. 6 is a block diagram illustrating a functional configuration example of the feature selection device according to the embodiment. 図７は、目標値データの一例を示す図である。FIG. 7 is a diagram illustrating an example of target value data. 図８は、コスト制約データの一例を示す図である。FIG. 8 is a diagram illustrating an example of cost constraint data. 図９は、終了条件データの一例を示す図である。FIG. 9 is a diagram illustrating an example of the end condition data. 図１０は、特徴候補データの一例を示す図である。FIG. 10 is a diagram illustrating an example of feature candidate data. 図１１は、選択済み集合データの一例を示す図である。FIG. 11 is a diagram illustrating an example of selected set data. 図１２は、モードデータの一例を示す図である。FIG. 12 is a diagram illustrating an example of mode data. 図１３は、評価対象集合データの一例を示す図である。FIG. 13 is a diagram illustrating an example of the evaluation target set data. 図１４は、コスト評価部の処理とデータとの関係を示す図である。FIG. 14 is a diagram illustrating a relationship between the process of the cost evaluation unit and data. 図１５は、コストデータの一例を示す図である。FIG. 15 is a diagram illustrating an example of cost data. 図１６は、モデルデータの一例を示す図である。FIG. 16 is a diagram illustrating an example of model data. 図１７は、評価値データの一例を示す図である。FIG. 17 is a diagram illustrating an example of evaluation value data. 図１８は、評価値統合部の処理とデータとの関係を示す図である。FIG. 18 is a diagram illustrating a relationship between processing of the evaluation value integration unit and data. 図１９は、統合評価値データの一例を示す図である。FIG. 19 is a diagram illustrating an example of the integrated evaluation value data. 図２０は、評価値補正部の処理とデータとの関係を示す図である。FIG. 20 is a diagram illustrating a relationship between processing of the evaluation value correction unit and data. 図２１は、補正済み評価値データの一例を示す図である。FIG. 21 is a diagram illustrating an example of corrected evaluation value data. 図２２は、除外損失評価部の処理とデータとの関係を示す図である。FIG. 22 is a diagram illustrating a relationship between the process of the exclusion loss evaluation unit and data. 図２３は、最終評価データの一例を示す図である。FIG. 23 is a diagram illustrating an example of final evaluation data. 図２４は、暫定一位データの一例を示す図である。FIG. 24 is a diagram illustrating an example of provisional first-order data. 図２５は、最良集合データの一例を示す図である。FIG. 25 is a diagram illustrating an example of the best set data. 図２６−１は、追加モード時における特徴選択装置の処理手順の一例を示すフローチャートである。FIG. 26A is a flowchart illustrating an example of a processing procedure of the feature selection device in the addition mode. 図２６−２は、追加モード時における特徴選択装置の処理手順の一例を示すフローチャートである。FIG. 26B is a flowchart illustrating an example of a processing procedure of the feature selection device in the addition mode. 図２７−１は、除外モード時における特徴選択装置の処理手順の一例を示すフローチャートである。FIG. 27A is a flowchart illustrating an example of a processing procedure of the feature selection device in the exclusion mode. 図２７−２は、除外モード時における特徴選択装置の処理手順の一例を示すフローチャートである。FIG. 27-2 is a flowchart illustrating an example of a processing procedure of the feature selection device in the exclusion mode. 図２８は、特徴選択装置のハードウェア構成の一例を示すブロック図である。FIG. 28 is a block diagram illustrating an example of a hardware configuration of the feature selection device.

以下、実施形態の特徴選択装置、特徴選択方法およびプログラムを、図面を参照して詳細に説明する。本実施形態の特徴選択装置は、任意の特徴集合の部分集合を用いたモデルの生成および評価を繰り返し、評価値が高くなる部分集合を探索する、Ｗｒａｐｐｅｒ法による特徴選択装置である。 Hereinafter, a feature selection device, a feature selection method, and a program according to embodiments will be described in detail with reference to the drawings. The feature selection device according to the present embodiment is a feature selection device based on the Wrapper method that repeatedly generates and evaluates a model using a subset of an arbitrary feature set and searches for a subset having a high evaluation value.

＜実施形態の概要＞
まず、本実施形態の概要について説明する。従来のＷｒａｐｐｅｒ法による特徴選択装置は、モデルの性能に関する複数の評価指標に対して各々設定された目標値を考慮し、それらをできるだけ達成するように特徴選択を行う機能がない。そのため、ユーザが所望するモデルを得るのが難しいという問題がある。 <Outline of Embodiment>
First, an outline of the present embodiment will be described. A conventional feature selection apparatus based on the Wrapper method does not have a function of performing feature selection so as to achieve target values as much as possible in consideration of target values set for a plurality of evaluation indexes related to model performance. Therefore, there is a problem that it is difficult to obtain a model desired by the user.

そこで、本実施形態では、複数の評価指標の目標値に対する評価値の達成度に基づいて、これら複数の評価指標の評価値を統合した「統合評価値」を算出する。そして、この統合評価値が高くなる部分集合を探索する。統合評価値は、複数の評価指標の各々の達成度が高く、かつ、複数の評価指標間における達成度のばらつきが少ないほど高評価となる評価値である。これにより、複数の評価指標の各々に設定された目標値をできるだけ達成するように部分集合の選択を行うことができ、ユーザが所望するモデルを得ることができる。 Therefore, in the present embodiment, an “integrated evaluation value” obtained by integrating the evaluation values of the plurality of evaluation indexes is calculated based on the degree of achievement of the evaluation values with respect to the target values of the plurality of evaluation indexes. Then, a subset with a higher integrated evaluation value is searched. The integrated evaluation value is an evaluation value that becomes higher as the achievement level of each of the plurality of evaluation indexes is higher and the variation in the achievement level among the plurality of evaluation indexes is smaller. Thereby, the subset can be selected so as to achieve the target value set for each of the plurality of evaluation indexes as much as possible, and a model desired by the user can be obtained.

また、従来のＷｒａｐｐｅｒ法による特徴選択装置は、モデルの性能（予測精度、推定精度など）以外の要素を考慮して特徴選択を行うための機能がない。しかし、実際のデータ分析を考えると、モデルを評価する上で性能以外の要素も考慮したい場合がある。 In addition, a conventional feature selection apparatus based on the Wrapper method does not have a function for performing feature selection in consideration of factors other than model performance (such as prediction accuracy and estimation accuracy). However, considering actual data analysis, there are cases in which factors other than performance are considered when evaluating a model.

例えば、図１に示すように、一定周期でモデルによる予測・推定を行う場合を考える。この場合、予測・推定の周期ごとに、観測データからの特徴抽出とモデル適用を完了させる必要がある。このとき、使用する特徴集合（特徴選択装置により選択された部分集合）の中に、抽出に多くの時間を要する特徴が大量に含まれていると、一定周期での予測・推定ができない。したがって、特徴選択を行う際には、評価対象の部分集合により得られるモデルの性能だけでなく、その部分集合内の全特徴の抽出に要する処理時間が、予め定められた時間内に収まるか否かをチェックすることが求められる。 For example, as shown in FIG. 1, a case where prediction / estimation by a model is performed at a constant period is considered. In this case, it is necessary to complete feature extraction and model application from observation data for each prediction / estimation period. At this time, if the feature set to be used (subset selected by the feature selection device) contains a large amount of features that require a long time for extraction, prediction / estimation at a fixed period cannot be performed. Therefore, when performing feature selection, not only the performance of the model obtained from the subset to be evaluated, but also whether the processing time required to extract all the features in the subset falls within a predetermined time. It is required to check.

そこで、本実施形態では、部分集合内の全特徴の抽出に要する処理時間をその部分集合のコストと定義し、部分集合のコストが所定の上限値以下であるとの制約条件を満たしつつ、上記の統合評価値が高くなるように特徴選択を行う。これにより、実際のデータ分析において有用で、かつ、高性能なモデルを得ることができる。 Therefore, in this embodiment, the processing time required to extract all the features in the subset is defined as the cost of the subset, while satisfying the constraint that the cost of the subset is equal to or lower than a predetermined upper limit value. The feature selection is performed so that the integrated evaluation value of becomes higher. This makes it possible to obtain a high-performance model that is useful in actual data analysis.

また、例えば図２に示すように、時系列データを使って将来の期間における事象の発生を予測する場合を考える（以下、この期間を「予測対象期間」と呼ぶ）。予測結果に応じて予測対象期間までに何らかの対策を行いたい場合、予測結果はなるべく早い時期に得られることが望ましい。予測の精度が高くても、予測結果が予測対象期間の直前にしか得られなければ、予測モデルの価値は低下する。一方で、予測の精度が少々悪くても、早い時期に予測結果が得られる方が、予測モデルの価値が高いこともある。 Further, for example, as shown in FIG. 2, consider a case where occurrence of an event in a future period is predicted using time-series data (hereinafter, this period is referred to as a “prediction target period”). When it is desired to take some countermeasures by the prediction target period according to the prediction result, it is desirable to obtain the prediction result as early as possible. Even if the accuracy of the prediction is high, the value of the prediction model decreases if the prediction result is obtained only immediately before the prediction target period. On the other hand, even if the prediction accuracy is a little worse, the prediction model may be more valuable if the prediction result is obtained early.

図２に示すように、予測対象期間に対して、様々な位置に設置したウィンドウから特徴を抽出し、それらを特徴選択の候補とする場合、使用する特徴集合（特徴選択装置により選択された部分集合）に応じて予測結果が得られる時期も変化する。そこで、本実施形態では、評価対象の部分集合により得られるモデルにより予測結果が得られる時期の早さ（図２における期間Ｔの長さ）をその部分集合のスコアと定義し、部分集合のスコアに基づいて上記の統合評価値を補正する。補正は、図２における期間Ｔが長いほど、予測精度が同じモデルでも評価値が高くなるようにする。これにより、実際のデータ分析において有用で、かつ、高性能なモデルを得ることができる。 As illustrated in FIG. 2, when features are extracted from windows placed at various positions in a prediction target period and are used as candidates for feature selection, a feature set to be used (a portion selected by the feature selection device) The time when the prediction result is obtained also depends on the set. Therefore, in the present embodiment, the time when the prediction result is obtained by the model obtained from the subset to be evaluated (the length of the period T in FIG. 2) is defined as the score of the subset, and the score of the subset is set. Based on the above, the integrated evaluation value is corrected. The correction is performed so that the evaluation value becomes higher even in the model having the same prediction accuracy as the period T in FIG. 2 is longer. This makes it possible to obtain a high-performance model that is useful in actual data analysis.

＜実施例＞
以下では、本実施形態のより具体的な実施例について説明する。本実施例では、評価値が高くなる部分集合を探索するアプローチとして、例えば、前向き選択の貪欲法と後ろ向き選択の貪欲法を想定する。前向き選択の貪欲法は、選択済みの特徴の集合（以下、これを「選択済み集合」と呼ぶ）に対して特徴を追加する処理を繰り返す。本実施例の特徴選択装置は、このような前向き選択の貪欲法による特徴選択のモードをベースとし、必要に応じて、選択済み集合から特徴を除外する後ろ向き選択の貪欲法による特徴選択のモードに切り替えて処理を行う。以下、前者のモードを「追加モード」と呼び、後者のモードを「除外モード」と呼ぶ。 <Example>
Hereinafter, more specific examples of the present embodiment will be described. In this embodiment, for example, a greedy method for forward selection and a greedy method for backward selection are assumed as approaches for searching for a subset having a high evaluation value. The forward selection greedy method repeats the process of adding features to a selected feature set (hereinafter referred to as a “selected set”). The feature selection apparatus according to the present embodiment is based on the feature selection mode based on the forward selection greedy method, and is switched to the feature selection mode based on the backward selection greedy method that excludes the feature from the selected set as necessary. Switch to process. Hereinafter, the former mode is referred to as “addition mode”, and the latter mode is referred to as “exclusion mode”.

なお、以下で示す実施例は一例であり、装置が同様の機能を持つならば、機能の分け方などは問わない。また、以下に示す各種データについても、本実施例と同様の情報が含まれていれば、その表現・保存形式などは問わない。 Note that the embodiment described below is an example, and if the apparatus has the same function, there is no limitation on how to divide the function. In addition, the various data shown below may be expressed in any form as long as the same information as in the present embodiment is included.

本実施例では、以下に示す前提条件や設定に従って特徴選択・モデル生成を行うものとする。なお、ここに示した以外の条件や設定についても対応できるようにし、後述の初期設定の際にユーザの入力などに応じて条件や設定を切り替えられるようにしてもよい。例えば、初期設定時のユーザの入力に応じて、用いる評価指標を選択できるようにするなどが考えられる。 In this embodiment, feature selection / model generation is performed according to the following preconditions and settings. It should be noted that conditions and settings other than those shown here can be dealt with, and the conditions and settings can be switched in accordance with user input or the like during initial setting described later. For example, it may be possible to select an evaluation index to be used in accordance with a user input at the time of initial setting.

＜生成するモデル＞
本実施例では、ある機器の故障が発生するか否かを２クラス識別により予測するモデル（予測モデル）を生成するものとする。例えば、与えられた入力データが、「故障が発生するクラス」と「故障が発生しないクラス」の２クラスのうち、どちらに属するかを推定し分類するモデルを生成する。このモデルは、入力された特徴の値に対して、予測対象期間に故障が発生するか否かを示すデータ（以下、「ラベル」と呼ぶ）を出力する。このようなモデルは、例えばSupport Vector Machine、決定木、ロジスティック回帰などの機械学習アルゴリズムにより生成できる。 <Model to be generated>
In this embodiment, it is assumed that a model (prediction model) that predicts whether or not a failure of a certain device occurs by 2-class identification is generated. For example, a model is generated that estimates and classifies whether the given input data belongs to one of two classes, “a class in which a failure occurs” and “a class in which no failure occurs”. This model outputs data (hereinafter referred to as “label”) indicating whether or not a failure occurs in the prediction target period for the input feature value. Such a model can be generated by a machine learning algorithm such as Support Vector Machine, decision tree, and logistic regression.

＜学習データセットおよび評価用データセット＞
学習データセットは、機械学習によるモデル生成に用いられるデータセットであり、評価用データセットは、生成されたモデルの評価に用いられるデータセットである。学習データセットおよび評価用データセットは、サンプルの集合であり、両者に同様のサンプルが含まれる。以下、学習データセットと評価用データセットを区別しない場合は、単に「データセット」と呼ぶ。 <Learning data set and evaluation data set>
The learning data set is a data set used for model generation by machine learning, and the evaluation data set is a data set used for evaluation of the generated model. The learning data set and the evaluation data set are collections of samples, and both include similar samples. Hereinafter, when the learning data set and the evaluation data set are not distinguished, they are simply referred to as “data sets”.

図３は、データセットＤ１の一例を示す図である。図３に示すデータセットＤ１は、１つの行が１つのサンプルに対応する。各サンプルは、選択の候補となる特徴の値の集合と、正しい予測のラベル（以下、「正解ラベル」と呼ぶ）により構成される。図３に示すデータセットＤ１は、選択の候補となる特徴の数（特徴集合に含まれる特徴の総数）が３００個である場合の例を示している。また、各サンプルの正解ラベルは、機器に故障が発生したことを「１」、機器に故障が発生しなかったことを「−１」で表している。なお、学習データセットと評価用データセットに含まれるサンプル同士は、モデル生成の前に毎回ランダムに入れ替えたり、交差検定を用いて決めたりしてもよい。本実施例では簡単のため、学習データセットと評価用データセットそれぞれに含まれるサンプルは変更されないものとする。 FIG. 3 is a diagram illustrating an example of the data set D1. In the data set D1 shown in FIG. 3, one row corresponds to one sample. Each sample includes a set of feature values that are candidates for selection and a correct prediction label (hereinafter referred to as “correct answer label”). A data set D1 illustrated in FIG. 3 illustrates an example in which the number of features that are selection candidates (the total number of features included in the feature set) is 300. The correct label of each sample represents “1” that a failure has occurred in the device and “−1” that a failure has not occurred in the device. Note that the samples included in the learning data set and the evaluation data set may be randomly replaced each time before model generation or may be determined using cross-validation. In this embodiment, for simplicity, it is assumed that the samples included in the learning data set and the evaluation data set are not changed.

＜特徴選択処理の終了条件＞
本実施例では、選択済み集合に含まれる特徴の個数が後述の初期設定時に定めた最大個数に達した場合、あるいは、除外モードによる処理が初期設定時に定めた上限回数実行された場合に、処理を終了するものとする。なお、これは一例であり、例えば選択済み集合によるモデルの再現率、適合率の達成度が一定以上になったとき終了するなど、他の終了条件を用いてもよい。 <Termination condition for feature selection processing>
In this embodiment, when the number of features included in the selected set reaches the maximum number determined at the time of initial setting, which will be described later, or when the processing in the exclusion mode is executed an upper limit number of times determined at the time of initial setting. Shall be terminated. Note that this is an example, and other termination conditions may be used, such as termination when the achievement rate of the model reproduction rate and the matching rate by the selected set becomes a certain level or more.

＜モデルの評価指標＞
評価対象となる部分集合（以下、これを「評価対象集合」と呼ぶ）Ｘを用いて生成されるモデルの性能（予測モデルの予測精度）は、モデルを用いて評価用データセットに対して予測を行い、得られた予測結果について、下記式（１）で表される再現率と、下記式（２）で表される適合率とを算出することで評価する。これはすなわち、「故障が発生するクラス」と「故障が発生しないクラス」の２クラスのうち、評価対象として、「故障が発生するクラス」のほうに着目し、「故障が発生するクラス」に分類されるべきデータがもらさず分類されているかを再現率で、「故障が発生するクラス」に分類されたデータのうち正しく分類されている割合を適合率でそれぞれ評価するということである。なお、ここでは評価対象のクラスを「故障が発生するクラス」のみとしているが、複数のクラスを対象として再現率・適合率を評価してもよい。

<Evaluation index of model>
The performance (prediction accuracy of the prediction model) of the model generated using the subset to be evaluated (hereinafter referred to as “evaluation target set”) X is predicted for the evaluation data set using the model. The prediction result obtained is evaluated by calculating the recall represented by the following formula (1) and the precision represented by the following formula (2). In other words, out of the two classes of “classes where failure occurs” and “classes where failure does not occur”, focus on the “class where failure occurs” as the evaluation target, and change it to “class where failure occurs” In other words, whether the data to be classified is classified without being obtained is evaluated by the reproducibility, and the correctly classified ratio of the data classified as “the class in which the failure occurs” is evaluated by the relevance ratio. Here, the evaluation target class is only “a class in which a failure occurs”, but the recall ratio and the conformity ratio may be evaluated for a plurality of classes.

本実施例では、これら再現率と適合率のそれぞれに対して、ユーザが初期設定時に目標値を設定する。そして、モデルの評価により得られる再現率の値（評価値）と適合率の値（評価値）とをそれぞれの目標値を考慮して統合し、統合評価値merged_eval（Ｘ）を算出する。なお、ここに示した評価指標は一例であり、数値として算出され、目標値が設定できる評価指標であれば、任意の種類のものを用いてよい。また、３種類以上の評価指標を用いても構わない。 In this embodiment, the user sets target values at the time of initial setting for each of the recall ratio and the matching ratio. Then, the recall value (evaluation value) and the precision value (evaluation value) obtained by the evaluation of the model are integrated in consideration of the respective target values to calculate an integrated evaluation value merged_eval (X). The evaluation index shown here is an example, and any type of evaluation index may be used as long as it is calculated as a numerical value and can set a target value. Three or more kinds of evaluation indexes may be used.

＜コストと制約＞
本実施例では、特徴集合に含まれる各特徴ｘそれぞれにコストfeature_cost（ｘ）が設定されているものとする。コストfeature_cost（ｘ）は、例えば特徴量ｘの抽出に要する平均的な処理時間などが考えられる。評価対象集合Ｘのコストcost（Ｘ）は、例えば下記式（３）に示すように、評価対象集合Ｘに含まれる各特徴ｘのコストfeature_cost（ｘ）の総和で定義されるものとする。

<Costs and constraints>
In this embodiment, it is assumed that cost feature_cost (x) is set for each feature x included in the feature set. The cost feature_cost (x) may be, for example, an average processing time required for extracting the feature quantity x. Assume that the cost cost (X) of the evaluation target set X is defined by the sum of the cost feature_cost (x) of each feature x included in the evaluation target set X, for example, as shown in the following equation (3).

本実施例では、下記式（４）に示すように、評価対象集合Ｘのコストcost（Ｘ）が上限値max_cost以下であるとの制約条件を設定する。上限値max_costは、初期設定時にユーザが値を設定するものとする。

In the present embodiment, a constraint condition that the cost cost (X) of the evaluation target set X is equal to or lower than the upper limit value max_cost is set as shown in the following formula (4). The upper limit value max_cost is set by the user at the time of initial setting.

なお、これらの設定は一例であり、評価対象集合Ｘのコストcost（Ｘ）は、Ｘの要素数の増加に伴い増加する傾向を持つ数値であれば、任意の定義を用いてよい。例えば、特徴抽出を並列処理するのであれば、評価対象集合Ｘに含まれる各特徴ｘのコストfeature_cost（ｘ）の総和ではなく、最大値を用いるなどの方法も考えられる。 These settings are only examples, and the cost cost (X) of the evaluation target set X may be any definition as long as it is a numerical value that tends to increase with an increase in the number of elements of X. For example, if feature extraction is performed in parallel, a method of using the maximum value instead of the sum of the cost feature_cost (x) of each feature x included in the evaluation target set X may be considered.

コストfeature_cost（ｘ）に関する設定は、例えば図４に示すようなコスト設定データＤ２によって、本実施例の特徴選択装置による処理開始前に定義されているものとする。このコスト設定データＤ２には、特徴集合に含まれる各特徴ｘそれぞれのコストfeature_cost（ｘ）が示されている。なお、初期設定時などのユーザの入力に応じて、特徴ｘのコストfeature_cost（ｘ）の値や、評価対象集合Ｘのコストcost（Ｘ）の算出方法などを設定できるようにしても構わない。 It is assumed that the setting related to the cost feature_cost (x) is defined before the start of processing by the feature selection device of this embodiment, for example, by cost setting data D2 as shown in FIG. This cost setting data D2 indicates the cost feature_cost (x) of each feature x included in the feature set. Note that the value of the cost feature_cost (x) of the feature x, the calculation method of the cost cost (X) of the evaluation target set X, and the like may be set in accordance with user input at the time of initial setting or the like.

＜スコアと補正方法＞
本実施例では、特徴集合に含まれる各特徴ｘそれぞれにスコアfeature_score（ｘ）が設定されているものとする。スコアfeature_score（ｘ）は、例えば図２に示した特徴ごとの猶予期間ｔの長さなどが挙げられる。評価対象集合Ｘのスコアscore（Ｘ）は、例えば下記式（５）に示すように、評価対象集合Ｘに含まれる各特徴ｘのスコアfeature_score（ｘ）の最小値で定義されるものとする。

<Score and correction method>
In this embodiment, it is assumed that a score feature_score (x) is set for each feature x included in the feature set. The score feature_score (x) includes, for example, the length of the grace period t for each feature shown in FIG. Assume that the score score (X) of the evaluation target set X is defined by the minimum value of the score feature_score (x) of each feature x included in the evaluation target set X, for example, as shown in the following formula (5).

本実施例では、統合評価値merged_eval（Ｘ）を算出した後、下記式（６）に示すように、評価対象集合Ｘのスコアscore（Ｘ）に応じて統合評価値merged_eval（Ｘ）を補正し、補正済み評価値corrected_eval（Ｘ）を得る。

ここで、関数ｆ（・）は、例えば下記式（７）に示すようなものが考えられる。なお、下記式（７）のα（score（Ｘ））は、score（Ｘ）が大きいほど大きい値となる係数である。本実施例では、score（Ｘ）の値に応じたα（score（Ｘ））の値が事前に設定されているものとするが（図５参照）、score（Ｘ）とα（score（Ｘ））の関係に関して、他の任意の関数を定義してもよい。

In this embodiment, after calculating the integrated evaluation value merged_eval (X), the integrated evaluation value merged_eval (X) is corrected according to the score score (X) of the evaluation target set X as shown in the following formula (6). Then, a corrected evaluation value corrected_eval (X) is obtained.

Here, as the function f (•), for example, a function as shown in the following formula (7) is conceivable. Note that α (score (X)) in the following formula (7) is a coefficient that increases as the score (X) increases. In the present embodiment, it is assumed that the value of α (score (X)) corresponding to the value of score (X) is set in advance (see FIG. 5), but score (X) and α (score (X Any other function may be defined with respect to the relationship of)).

なお、ここに示した補正の方法は一例であり、評価対象集合Ｘのスコアscore（Ｘ）が良い値となるほど、補正済み評価値corrected_eval（Ｘ）も良い値となるような補正方法であれば、任意の方法を用いて構わない。例えば、上記式（７）では統合評価値merged_eval（Ｘ）に評価対象集合Ｘのスコアscore（Ｘ）に応じた係数α（score（Ｘ））を乗じているが、評価対象集合Ｘのスコアscore（Ｘ）に応じた係数α（score（Ｘ））を加算する方法も考えられる。また、より複雑な関数を定義しても構わない。 The correction method shown here is merely an example, and the correction method is such that the corrected evaluation value corrected_eval (X) becomes a better value as the score score (X) of the evaluation target set X becomes a better value. Any method may be used. For example, in the above equation (7), the integrated evaluation value merged_eval (X) is multiplied by a coefficient α (score (X)) corresponding to the score score (X) of the evaluation target set X. A method of adding the coefficient α (score (X)) corresponding to (X) is also conceivable. A more complicated function may be defined.

スコアfeature_score（ｘ）やスコアscore（Ｘ）に応じた係数α（score（Ｘ））に関する設定は、例えば図５に示すような補正設定データＤ３によって、本実施例の特徴選択装置による処理開始前に定義されているものとする。この補正設定データＤ３には、特徴集合に含まれる各特徴ｘそれぞれのスコアfeature_score（ｘ）と、評価対象集合Ｘのスコアscore（Ｘ）の値に応じた係数α（score（Ｘ））のテーブルが示されている。なお、初期設定時などのユーザの入力に応じて、特徴ｘのスコアfeature_score（ｘ）の値や、評価対象集合Ｘのスコアscore（Ｘ）の算出方法、補正済み評価値corrected_eval（Ｘ）の算出方法などを設定できるようにしても構わない。 The setting relating to the score feature_score (x) and the coefficient α (score (X)) corresponding to the score score (X) is set by, for example, correction setting data D3 as shown in FIG. As defined in The correction setting data D3 includes a table of the score feature_score (x) of each feature x included in the feature set and the coefficient α (score (X)) corresponding to the score score (X) of the evaluation target set X. It is shown. It should be noted that in accordance with user input at the time of initial setting or the like, the value of the score feature_score (x) of the feature x, the score score (X) of the evaluation target set X, and the corrected evaluation value corrected_eval (X) are calculated. You may enable it to set a method etc.

＜装置構成＞
次に、本実施例の特徴選択装置の機能的な構成について説明する。図６は、本実施例の特徴選択装置１の機能的な構成例を示すブロック図である。本実施例の特徴選択装置１は、図６に示すように、入力受付部２と、初期設定部３と、評価対象集合生成部４と、コスト評価部５と、モデル生成部６と、モデル評価部７と、評価値統合部８と、評価値補正部９と、除外損失評価部１０と、暫定一位更新部１１と、選択済み集合更新部１２と、終了判定部１３と、出力部１４とを備える。 <Device configuration>
Next, a functional configuration of the feature selection device according to the present embodiment will be described. FIG. 6 is a block diagram illustrating a functional configuration example of the feature selection device 1 according to the present embodiment. As shown in FIG. 6, the feature selection device 1 of the present embodiment includes an input reception unit 2, an initial setting unit 3, an evaluation target set generation unit 4, a cost evaluation unit 5, a model generation unit 6, a model Evaluation unit 7, evaluation value integration unit 8, evaluation value correction unit 9, exclusion loss evaluation unit 10, provisional first update unit 11, selected set update unit 12, end determination unit 13, output unit 14.

入力受付部２は、初期設定のためのユーザによる入力を受け付ける。ここでは、再現率・適合率の目標値、評価対象集合Ｘのコストcost（Ｘ）の上限値max_cost、終了条件における特徴の最大個数と除外モード実行の最大回数を、ユーザが入力するものとする。なお、評価対象集合Ｘのコストcost（Ｘ）の上限値max_costについては、各特徴ｘのスコアfeature_score（ｘ）のうち、最小の値未満となる値は受け付けないものとする。 The input receiving unit 2 receives an input by a user for initial setting. Here, it is assumed that the user inputs the target value of the recall rate / matching rate, the upper limit value max_cost of the cost cost (X) of the evaluation target set X, the maximum number of features and the maximum number of exclusion mode executions in the end condition. . As for the upper limit value max_cost of the cost cost (X) of the evaluation target set X, a value that is less than the minimum value among the score feature_score (x) of each feature x is not accepted.

初期設定部３は、入力受付部２が受け付けたユーザの入力内容を反映し、目標値データＤ４、コスト制約データＤ５、終了条件データＤ６、特徴候補データＤ７、選択済み集合データＤ８、モードデータＤ９の生成および初期設定を行う。 The initial setting unit 3 reflects the input contents of the user received by the input receiving unit 2, and includes target value data D4, cost constraint data D5, end condition data D6, feature candidate data D7, selected set data D8, and mode data D9. Generate and initialize.

図７は、目標値データＤ４の一例を示す図である。目標値データＤ４は、ユーザの入力に応じて、再現率・適合率の目標値を記録したデータである。図７に例示する目標値データＤ４は、再現率の目標値が０．９、適合率の目標値が０．４に設定されたことを示している。 FIG. 7 is a diagram illustrating an example of the target value data D4. The target value data D4 is data in which target values of the recall rate and the matching rate are recorded in accordance with user input. The target value data D4 illustrated in FIG. 7 indicates that the target value for the recall is set to 0.9 and the target value for the precision is set to 0.4.

図８は、コスト制約データＤ５の一例を示す図である。コスト制約データＤ５は、ユーザの入力に応じて、評価対象集合Ｘのコストcost（Ｘ）の上限値max_costを記録したデータである。図８に例示するコスト制約データＤ５は、上限値max_costが５．０に設定されたことを示している。 FIG. 8 is a diagram illustrating an example of the cost constraint data D5. The cost constraint data D5 is data in which the upper limit value max_cost of the cost cost (X) of the evaluation target set X is recorded in accordance with a user input. The cost constraint data D5 illustrated in FIG. 8 indicates that the upper limit value max_cost is set to 5.0.

図９は、終了条件データＤ６の一例を示す図である。終了条件データＤ６は、ユーザの入力に応じて、選択する特徴の最大個数と除外モード実行の最大回数とを記録したデータである。図９に例示する終了条件データＤ６は、選択する特徴の最大個数が５０個、除外モード実行の最大回数が１０回に設定されたことを示している。 FIG. 9 is a diagram illustrating an example of the end condition data D6. The end condition data D6 is data in which the maximum number of features to be selected and the maximum number of executions of the exclusion mode are recorded in accordance with a user input. The end condition data D6 illustrated in FIG. 9 indicates that the maximum number of features to be selected is set to 50 and the maximum number of executions of the exclusion mode is set to 10.

図１０は、特徴候補データＤ７の一例を示す図である。特徴候補データＤ７は、選択の候補となる全特徴（特徴集合に含まれる全特徴）とそれらの状態を記録したデータである。図１０は、選択の候補となる特徴の数（特徴集合に含まれる特徴の総数）が３００個の場合の特徴候補データＤ７の例を示している。各特徴の状態は、「未評価」、「保留」、「評価済み」、「選択済み」、「違反」、「除外」のいずれかを取る。初期設定時には、特徴候補データＤ７の全特徴の状態が「未評価」に設定される。 FIG. 10 is a diagram illustrating an example of the feature candidate data D7. The feature candidate data D7 is data in which all features that are candidates for selection (all features included in the feature set) and their states are recorded. FIG. 10 shows an example of the feature candidate data D7 when the number of features that are selection candidates (total number of features included in the feature set) is 300. The state of each feature takes one of “not evaluated”, “pending”, “evaluated”, “selected”, “violation”, and “excluded”. At the initial setting, the state of all the features in the feature candidate data D7 is set to “not evaluated”.

図１１は、選択済み集合データＤ８の一例を示す図である。選択済み集合データＤ８は、選択済み集合に含まれる全特徴と、選択済み集合に対して算出された補正済み評価値corrected_eval（Ｘ）と、選択済み集合に対して算出されたコストcost（Ｘ）とを記録したデータである。初期設定時には、選択済み集合データＤ８の選択済み集合を空集合とし、補正済み評価値corrected_eval（Ｘ）およびコストcost（Ｘ）は任意の値が設定される。 FIG. 11 is a diagram illustrating an example of the selected set data D8. The selected set data D8 includes all features included in the selected set, a corrected evaluation value corrected_eval (X) calculated for the selected set, and a cost cost (X) calculated for the selected set. Is recorded data. At the initial setting, the selected set of the selected set data D8 is an empty set, and arbitrary values are set for the corrected evaluation value corrected_eval (X) and the cost cost (X).

図１２は、モードデータＤ９の一例を示す図である。モードデータＤ９は、現在のモードが追加モードと除外モードのどちらであるかを記録したデータである。初期設定時には、モードデータＤ９のモードは追加モードに設定される。 FIG. 12 is a diagram illustrating an example of the mode data D9. The mode data D9 is data that records whether the current mode is the addition mode or the exclusion mode. At the initial setting, the mode of the mode data D9 is set to the additional mode.

評価対象集合生成部４は、モードデータＤ９、特徴候補データＤ７および選択済み集合データＤ８を参照し、追加モード時であれば状態が「未評価」の特徴を選択済み集合に１つ追加、除外モード時であれば選択済み集合から、特徴候補データＤ７に示された状態が「評価済み」でない特徴を１つ除外することにより評価対象集合を生成する。また、追加モード時であれば追加した特徴、除外モード時であれば除外した特徴について、特徴候補データＤ７の状態を「保留」にする。そして、評価対象集合生成部４は、例えば図１３に示すように、評価対象集合に含まれる各特徴を記録した評価対象集合データＤ１０を生成する。このとき、評価対象集合データＤ１０がすでに存在する場合は、そのデータを削除して新規に評価対象集合データＤ１０を生成する。 The evaluation target set generation unit 4 refers to the mode data D9, the feature candidate data D7, and the selected set data D8, and adds or excludes one feature whose state is “unevaluated” to the selected set in the add mode. In the mode, an evaluation target set is generated by excluding one feature whose state indicated in the feature candidate data D7 is not “evaluated” from the selected set. Further, the state of the feature candidate data D7 is set to “hold” for the added feature in the addition mode and the excluded feature in the exclusion mode. Then, the evaluation target set generation unit 4 generates evaluation target set data D10 in which each feature included in the evaluation target set is recorded, for example, as shown in FIG. At this time, if the evaluation target set data D10 already exists, the data is deleted and new evaluation target set data D10 is generated.

コスト評価部５は、評価対象集合生成部４が生成した評価対象集合のコストcost（Ｘ）を算出し、算出したコストcost（Ｘ）が上限値max_costを超えていないかを判定する。図１４は、コスト評価部５の処理とデータとの関係を示す図である。図中のReferはデータの参照を意味し、Createはデータの生成を意味し、Updateはデータの更新を意味する。 The cost evaluation unit 5 calculates the cost cost (X) of the evaluation target set generated by the evaluation target set generation unit 4 and determines whether the calculated cost cost (X) exceeds the upper limit value max_cost. FIG. 14 is a diagram illustrating a relationship between the process of the cost evaluation unit 5 and data. In the figure, Refer means data reference, Create means data generation, and Update means data update.

コスト評価部５は、まず、評価対象集合データＤ１０とコスト設定データＤ２を参照し、評価対象集合生成部４が生成した評価対象集合のコストcost（Ｘ）を上記式（３）により算出する。次に、コスト評価部５は、コスト制約データＤ５を参照し、算出した評価対象集合のコストcost（Ｘ）が上限値max_costを超えていないかを判定する。この判定の結果、評価対象集合のコストcost（Ｘ）が上限値max_costを超えていなければ、コスト評価部５は、特徴候補データＤ７を参照して、状態が「保留」となっている特徴の状態を「評価済み」に更新する。そして、例えば図１５に示すように、評価対象集合のコストcost（Ｘ）を記録したコストデータＤ１１を生成する。このとき、コストデータＤ１１がすでに存在する場合は、そのデータを削除して新規にコストデータＤ１１を生成する。 First, the cost evaluation unit 5 refers to the evaluation target set data D10 and the cost setting data D2, and calculates the cost cost (X) of the evaluation target set generated by the evaluation target set generation unit 4 by the above formula (3). Next, the cost evaluation unit 5 refers to the cost constraint data D5, and determines whether or not the calculated cost cost (X) of the evaluation target set exceeds the upper limit value max_cost. As a result of this determination, if the cost cost (X) of the set to be evaluated does not exceed the upper limit value max_cost, the cost evaluation unit 5 refers to the feature candidate data D7 and selects the feature whose state is “pending”. Update the status to “evaluated”. Then, for example, as shown in FIG. 15, cost data D11 in which the cost cost (X) of the evaluation target set is recorded is generated. At this time, if the cost data D11 already exists, the data is deleted and new cost data D11 is generated.

一方、評価対象集合のコストcost（Ｘ）が上限値max_costを超えた場合は、コスト評価部５は、特徴候補データＤ７を参照して、状態が「保留」の特徴の状態を「違反」に更新する。このとき、特徴候補データＤ７の特徴の状態が全て「選択済み」か「違反」か「除外」となった場合、コスト評価部５は、モードデータＤ９を除外モードに更新する。 On the other hand, when the cost cost (X) of the set to be evaluated exceeds the upper limit value max_cost, the cost evaluation unit 5 refers to the feature candidate data D7 and changes the state of the feature whose state is “pending” to “violation”. Update. At this time, if all the feature states of the feature candidate data D7 are “selected”, “violation”, or “exclusion”, the cost evaluation unit 5 updates the mode data D9 to the exclusion mode.

モデル生成部６は、評価対象集合データＤ１０と学習データセット（データセットＤ１）を参照し、評価対象集合データＤ１０に記録された特徴のみを使ってモデル生成を行う。そして、モデル生成部６は、生成したモデルのルール、パラメータなどを表すモデルデータＤ１２を生成する。このとき、モデルデータＤ１２がすでに存在する場合は、そのデータを削除して新規にモデルデータＤ１２を生成する。図１６は、モデルデータＤ１２の一例を示す図である。この図１６に例示するモデルデータＤ１２は、線形判別器を用いたときのモデルデータの例であり、各特徴に対する重みｗとバイアスｂを記録している。 The model generation unit 6 refers to the evaluation target set data D10 and the learning data set (data set D1), and generates a model using only the features recorded in the evaluation target set data D10. And the model production | generation part 6 produces | generates the model data D12 showing the rule, parameter, etc. of the produced | generated model. At this time, if the model data D12 already exists, the data is deleted and new model data D12 is generated. FIG. 16 is a diagram illustrating an example of the model data D12. The model data D12 illustrated in FIG. 16 is an example of model data when a linear discriminator is used, and the weight w and the bias b for each feature are recorded.

モデル評価部７は、評価対象集合データＤ１０、モデルデータＤ１２および評価用データセット（データセットＤ１）を参照し、評価対象集合データＤ１０に記録された特徴のみを使って、モデル生成部６が生成したモデルの再現率を上記式（１）、適合率を上記式（２）により算出する。そして、モデル評価部７は、算出した再現率と適合率とを評価値として記録した評価値データＤ１３を生成する。このとき、評価値データＤ１３がすでに存在する場合は、そのデータを削除して新規に評価値データＤ１３を生成する。図１７は、評価値データＤ１３の一例を示す図である。図１７に例示する評価値データＤ１３は、算出した再現率の評価値が０．６、適合率の評価値が０．３であったことを示している。 The model evaluation unit 7 refers to the evaluation target set data D10, the model data D12, and the evaluation data set (data set D1), and the model generation unit 6 generates only the features recorded in the evaluation target set data D10. The reproducibility of the model is calculated by the above equation (1), and the precision is calculated by the above equation (2). And the model evaluation part 7 produces | generates the evaluation value data D13 which recorded the calculated reproduction rate and the precision as an evaluation value. At this time, if the evaluation value data D13 already exists, the data is deleted and new evaluation value data D13 is generated. FIG. 17 is a diagram illustrating an example of the evaluation value data D13. The evaluation value data D13 illustrated in FIG. 17 indicates that the calculated evaluation value of the reproduction rate is 0.6 and the evaluation value of the matching rate is 0.3.

評価値統合部８は、複数の評価指標である再現率と適合率のそれぞれについて、モデル評価部７が算出した評価値の目標値に対する達成度を算出し、これら再現率の達成度および適合率の達成度に基づいて、統合評価値を算出する。図１８は、評価値統合部８の処理とデータとの関係を示す図である。図中のReferはデータの参照を意味し、Createはデータの生成を意味する。 The evaluation value integration unit 8 calculates a degree of achievement of the evaluation value calculated by the model evaluation unit 7 with respect to the target value for each of a plurality of evaluation indexes, ie, a reproduction rate and a relevance rate. An integrated evaluation value is calculated based on the degree of achievement. FIG. 18 is a diagram illustrating a relationship between the process of the evaluation value integration unit 8 and data. Refer in the figure means data reference, and Create means data generation.

評価値統合部８は、まず、評価値データＤ１３と目標値データＤ４を参照し、下記式（８）により再現率の達成度（再現率達成度）、下記式（９）により適合率の達成度（適合率達成度）をそれぞれ算出する。

なお、本実施例では、値が大きいほど高い評価となる再現率および適合率を評価指標として扱うため、目標値に対する評価値の達成度を上記のように定義できる。平均二乗誤差のように、値が小さいほど高い評価となる評価指標を扱う場合は、分子・分母を反転して（評価値を分母、目標値を分子とする）目標値に対する評価値の達成度を定義すればよい。 The evaluation value integration unit 8 first refers to the evaluation value data D13 and the target value data D4, achieves the achievement rate of reproduction rate (reproduction rate achievement rate) by the following equation (8), and achieves the precision rate by the following equation (9). Each degree (accomplishment rate achievement) is calculated.

In the present embodiment, the higher the value, the higher the reproduction rate and the matching rate, which are treated as evaluation indexes, so that the degree of achievement of the evaluation value with respect to the target value can be defined as described above. When dealing with an evaluation index that becomes higher as the value is smaller, such as the mean square error, the degree of achievement of the evaluation value for the target value by inverting the numerator and denominator (with the evaluation value as the denominator and the target value as the numerator) Should be defined.

ここで、上記式（８）および上記式（９）をそのまま用いると、再現率達成度と適合率達成度の一方が１．０を大幅に超える場合に、他方の達成度が非常に小さくても、統合評価値が高い値になってしまう場合がある。そこで、再現率達成度を下記式（１０）、適合率達成度を下記式（１１）のように変換し、変換後再現率達成度および変換後適合率達成度を求めて、変換後再現率達成度および変換後適合率達成度を用いて統合評価値を算出することが望ましい。

上記式（１０）のα_ｒと上記式（１１）のα_ｐをともに１．０より小さい正の値（例えば０．１など）にすることで、変換後再現率達成度および変換後適合率達成度が１．０を大幅に超えないようにでき、上記の問題を防ぐことができる。これらα_ｒとα_ｐの値は予め定められているものとするが、初期設定時にユーザの入力に応じて設定してもよい。 Here, when the above formula (8) and the above formula (9) are used as they are, when one of the achievement rate of the recall rate and the achievement rate of the matching rate greatly exceeds 1.0, the achievement level of the other is very small. However, the integrated evaluation value may become a high value. Therefore, the reproduction rate achievement is converted as shown in the following equation (10), and the accuracy rate achievement is converted as shown in the following equation (11). It is desirable to calculate the integrated evaluation value by using the achievement level and the conversion rate achievement level after conversion.

By making α _{r in the} above formula (10) and α _p in the above formula (11) both positive values smaller than 1.0 (for example, 0.1, etc.), the post-conversion recall achievement rate and the post-conversion precision The achievement level can be made not to greatly exceed 1.0, and the above problem can be prevented. The values of these alpha _r and alpha _p are assumed to be predetermined, but may be set in accordance with at initialization to user input.

次に、評価値統合部８は、変換後再現率達成度と変換後適合率達成度を用いて、下記式（１２）により、統合評価値merged_eval（Ｘ）を算出する。

そして、評価値統合部８は、算出した統合評価値merged_eval（Ｘ）を記録した統合評価値データＤ１４を生成する。このとき、統合評価値データＤ１４がすでに存在する場合は、そのデータを削除して新規に統合評価値データＤ１４を生成する。図１９は、統合評価値データＤ１４の一例を示す図である。図１９に例示する統合評価値データＤ１４は、算出した統合評価値merged_eval（Ｘ）が０．７０６であったことを示している。 Next, the evaluation value integration unit 8 calculates the integrated evaluation value merged_eval (X) by the following equation (12) using the post-conversion recall rate achievement and the post-conversion precision achievement.

Then, the evaluation value integration unit 8 generates integrated evaluation value data D14 in which the calculated integrated evaluation value merged_eval (X) is recorded. At this time, if the integrated evaluation value data D14 already exists, the integrated evaluation value data D14 is newly generated by deleting the data. FIG. 19 is a diagram illustrating an example of the integrated evaluation value data D14. The integrated evaluation value data D14 illustrated in FIG. 19 indicates that the calculated integrated evaluation value merged_eval (X) is 0.706.

上記式（１２）で算出される統合評価値merged_eval（Ｘ）は、変換後再現率達成度と変換後適合率達成度との調和平均となっている。すなわち、これら変換後再現率達成度と変換後適合率達成度の両者の値が大きく、かつ類似した値となっているときに、統合評価値merged_eval（Ｘ）は大きな値を取る。この統合評価値merged_eval（Ｘ）が高くなる（大きな値を取る）ように部分集合の選択を行っていくことで、再現率と適合率それぞれの目標値をできるだけ達成し、かつ達成度のばらつきが少ない（バランスが良い）モデルが得られることになる。 The integrated evaluation value merged_eval (X) calculated by the above equation (12) is a harmonic average of the degree of achievement of the conversion rate after conversion and the degree of achievement of the conversion rate after conversion. That is, the integrated evaluation value merged_eval (X) takes a large value when both the post-conversion recall rate achievement level and the post-conversion precision rate achievement level are large and similar values. By selecting subsets so that this integrated evaluation value merged_eval (X) becomes high (takes a large value), the target values of the recall and precision rates can be achieved as much as possible, and the variation in achievement level is A small (balanced) model will be obtained.

評価値補正部９は、評価対象集合生成部４が生成した評価対象集合のスコアscore（Ｘ）を算出し、このスコアscore（Ｘ）が高いほど高評価となるように、評価値統合部８が算出した統合評価値merged_eval（Ｘ）を補正する。図２０は、評価値補正部９の処理とデータとの関係を示す図である。図中のReferはデータの参照を意味し、Createはデータの生成を意味する。 The evaluation value correction unit 9 calculates the score score (X) of the evaluation target set generated by the evaluation target set generation unit 4, and the evaluation value integration unit 8 so that the higher the score score (X), the higher the evaluation. The integrated evaluation value merged_eval (X) calculated by is corrected. FIG. 20 is a diagram illustrating a relationship between the process of the evaluation value correction unit 9 and data. Refer in the figure means data reference, and Create means data generation.

評価値補正部９は、まず、評価対象集合データＤ１０と補正設定データＤ３を参照し、評価対象集合生成部４が生成した評価対象集合のスコアscore（Ｘ）を上記式（５）により算出する。次に、評価値補正部９は、統合評価値データＤ１４と補正設定データＤ３を参照し、上記式（６）および式（７）により、統合評価値merged_eval（Ｘ）をスコアscore（Ｘ）に応じて補正した補正済み評価値corrected_eval（Ｘ）を算出する。そして、評価値補正部９は、算出した補正済み評価値corrected_eval（Ｘ）を記録した補正済み評価値データＤ１５を生成する。このとき、補正済み評価値データＤ１５がすでに存在する場合は、そのデータを削除して新規に補正済み評価値データＤ１５を生成する。図２１は、補正済み評価値データＤ１５の一例を示す図である。図２１に例示する補正済み評価値データＤ１５は、算出した補正済み評価値corrected_eval（Ｘ）が０．７７６であったことを示している。 First, the evaluation value correction unit 9 refers to the evaluation target set data D10 and the correction setting data D3, and calculates the score score (X) of the evaluation target set generated by the evaluation target set generation unit 4 by the above formula (5). . Next, the evaluation value correction unit 9 refers to the integrated evaluation value data D14 and the correction setting data D3, and converts the integrated evaluation value merged_eval (X) into the score score (X) by the above formulas (6) and (7). A corrected evaluation value corrected_eval (X) corrected accordingly is calculated. Then, the evaluation value correction unit 9 generates corrected evaluation value data D15 in which the calculated corrected evaluation value corrected_eval (X) is recorded. At this time, if the corrected evaluation value data D15 already exists, the data is deleted and new corrected evaluation value data D15 is generated. FIG. 21 is a diagram illustrating an example of the corrected evaluation value data D15. The corrected evaluation value data D15 illustrated in FIG. 21 indicates that the calculated corrected evaluation value corrected_eval (X) is 0.776.

除外損失評価部１０は、除外モード時に、選択済み集合から１つの特徴を除外したときのコストcost（Ｘ）の低下と、モデルの性能の低下との比率を示す除外損失を、除外する特徴ごとに算出する。図２２は、除外損失評価部１０の処理とデータとの関係を示す図である。図中のReferはデータの参照を意味し、Createはデータの生成を意味する。 The exclusion loss evaluation unit 10 excludes an exclusion loss indicating a ratio between a decrease in cost cost (X) and a decrease in model performance when one feature is excluded from the selected set in the exclusion mode. To calculate. FIG. 22 is a diagram illustrating a relationship between the processing of the exclusion loss evaluation unit 10 and data. Refer in the figure means data reference, and Create means data generation.

除外損失評価部１０は、モードデータＤ９、選択済み集合データＤ８、補正済み評価値データＤ１５およびコストデータＤ１１を参照し、除外モード時に除外損失を算出する。具体的には、除外損失評価部１０は、選択済み集合データＤ８に記録された選択済み集合Ｘ_ｓの補正済み評価値corrected_eval（Ｘ_ｓ）およびコストcost（Ｘ_ｓ）と、補正済み評価値データＤ１５に記録された評価対象集合Ｘの補正済み評価値corrected_eval（Ｘ）およびコストデータＤ１１に記録されたコストcost（Ｘ）とを用い、下記式（１３）により、除外損失を算出する。

The exclusion loss evaluation unit 10 refers to the mode data D9, the selected set data D8, the corrected evaluation value data D15, and the cost data D11, and calculates an exclusion loss in the exclusion mode. Specifically, the excluded loss evaluation unit 10 corrects the corrected evaluation value corrected_eval (X _s ) and the cost cost (X _s ) of the selected set X _s recorded in the selected set data D8, and the corrected evaluation value data. Using the corrected evaluation value corrected_eval (X) of the evaluation target set X recorded in D15 and the cost cost (X) recorded in the cost data D11, the exclusion loss is calculated by the following equation (13).

そして、除外損失評価部１０は、評価対象集合Ｘの補正済み評価値corrected_eval（Ｘ）、コストcost（Ｘ）および上記式（１３）で算出した除外損失を記録した最終評価データＤ１６を生成する。図２３は、最終評価データＤ１６の一例を示す図である。図２３に例示する最終評価データＤ１６は、算出した除外損失が０．００２１であったことを示している。なお、追加モード時においては、除外損失の値は使用されないため、除外損失として任意の値を記録した最終評価データＤ１６を生成すればよい。 Then, the excluded loss evaluation unit 10 generates final evaluation data D16 in which the corrected evaluation value corrected_eval (X) of the evaluation target set X, the cost cost (X), and the excluded loss calculated by the above equation (13) are recorded. FIG. 23 is a diagram illustrating an example of the final evaluation data D16. The final evaluation data D16 illustrated in FIG. 23 indicates that the calculated exclusion loss is 0.0021. In addition, since the value of the exclusion loss is not used in the addition mode, the final evaluation data D16 in which an arbitrary value is recorded as the exclusion loss may be generated.

暫定一位更新部１１は、モードデータＤ９、最終評価データＤ１６および暫定一位データＤ１７を参照し、追加モード時であれば暫定一位データＤ１７よりも最終評価データＤ１６の方に記録された補正済み評価値corrected_eval（Ｘ）の方が高い場合、除外モード時であれば暫定一位データＤ１７よりも最終評価データＤ１６に記録された除外損失の方が小さい場合に、暫定一位データＤ１７を更新する。 The temporary first update unit 11 refers to the mode data D9, the final evaluation data D16, and the temporary first data D17, and in the addition mode, the correction recorded in the final evaluation data D16 rather than the temporary first data D17. When the already evaluated value corrected_eval (X) is higher, the provisional first data D17 is updated when the exclusion loss recorded in the final evaluation data D16 is smaller than the provisional first data D17 in the exclusion mode. To do.

図２４は、暫定一位データＤ１７の一例を示す図である。図２４に示すように、暫定一位データＤ１７には、評価値が暫定的に一位の評価対象集合Ｘと、その評価対象集合Ｘの補正済み評価値corrected_eval（Ｘ）、コストcost（Ｘ）および除外損失が記録されている。なお、暫定一位データＤ１７が存在しない場合、暫定一位更新部１１は暫定一位データＤ１７を新規に生成し、そのときの評価対象集合Ｘと、最終評価データＤ１６に記録されている補正済み評価値corrected_eval（Ｘ）、コストcost（Ｘ）および除外損失を暫定一位データＤ１７に記録する。 FIG. 24 is a diagram illustrating an example of the provisional first place data D17. As shown in FIG. 24, the temporary first-rank data D17 includes an evaluation object set X whose provisional value is temporarily first, a corrected evaluation value corrected_eval (X) of the evaluation object set X, and a cost cost (X). And excluded losses are recorded. If the provisional first place data D17 does not exist, the provisional first place update unit 11 newly generates the provisional first place data D17, and has been corrected recorded in the evaluation target set X and the final evaluation data D16 at that time. The evaluation value corrected_eval (X), the cost cost (X), and the exclusion loss are recorded in the temporary first place data D17.

選択済み集合更新部１２は、モードデータＤ９、選択済み集合データＤ８および特徴候補データＤ７を参照し、追加モード時であれば、まず、状態が「未評価」の特徴があるかどうかを確認する。そして、状態が「未評価」の特徴が存在しない場合、選択済み集合更新部１２は、暫定一位データＤ１７に記録された評価対象集合Ｘ、補正済み評価値corrected_eval（Ｘ）およびコストcost（Ｘ）で選択済み集合データＤ８を更新し、暫定一位データＤ１７を削除する。また、選択済み集合更新部１２は、特徴候補データＤ７の各特徴の状態を以下のように更新する。
Ｉ．選択済み集合データＤ８に含まれる特徴：状態を「選択済み」にする。
ＩＩ．状態が「除外」の特徴：そのままにする。
ＩＩＩ．その他の状態の特徴：状態を「未評価」にする。 The selected set update unit 12 refers to the mode data D9, the selected set data D8, and the feature candidate data D7, and first checks whether there is a feature whose state is “unevaluated” in the add mode. . Then, when there is no feature whose state is “unevaluated”, the selected set update unit 12 sets the evaluation target set X, the corrected evaluation value corrected_eval (X), and the cost cost (X) recorded in the temporary first-rank data D17. ), The selected set data D8 is updated, and the provisional first data D17 is deleted. In addition, the selected set update unit 12 updates the state of each feature of the feature candidate data D7 as follows.
I. Features included in the selected set data D8: The state is set to “selected”.
II. Features with status “excluded”: Leave as is.
III. Other state characteristics: The state is set to “not evaluated”.

さらに、選択済み集合更新部１２は、最良集合データＤ１８を参照し、更新後の選択済み集合データＤ８の補正済み評価値corrected_eval（Ｘ）が、最良集合データＤ１８に記録された補正済み評価値corrected_eval（Ｘ）よりも高ければ、選択済み集合データＤ８の内容で最良集合データＤ１８を更新する。 Further, the selected set update unit 12 refers to the best set data D18, and the corrected evaluation value corrected_eval (X) of the updated selected set data D8 is corrected corrected evaluation value corrected_eval recorded in the best set data D18. If it is higher than (X), the best set data D18 is updated with the contents of the selected set data D8.

図２５は、最良集合データＤ１８の一例を示す図である。図２５に示すように、最良集合データＤ１８には、その時点で評価が最も高い部分集合である最良集合と、その最良集合の補正済み評価値corrected_eval（Ｘ）およびコストcost（Ｘ）が記録されている。なお、最良集合データＤ１８が存在しない場合、選択済み集合更新部１２は、最良集合データＤ１８を新規に生成し、選択済み集合データＤ８に記録されている内容を最良集合データＤ１８に記録する。 FIG. 25 is a diagram illustrating an example of the best set data D18. As shown in FIG. 25, in the best set data D18, the best set that is the subset with the highest evaluation at that time, the corrected evaluation value corrected_eval (X) and the cost cost (X) of the best set are recorded. ing. When the best set data D18 does not exist, the selected set update unit 12 newly generates the best set data D18 and records the contents recorded in the selected set data D8 in the best set data D18.

また、除外モード時であれば、選択済み集合更新部１２は、まず、選択済み集合Ｘｓの中に状態が「評価済み」でない特徴があるかどうかを確認する。そして、状態が「評価済み」でない特徴が存在しない場合、選択済み集合更新部１２は、選択済み集合データＤ８と暫定一位データＤ１７を参照し、選択済み集合データＤ８に存在し暫定一位データＤ１７に存在しない特徴を特定する。そして、その特徴について特徴候補データＤ７中の状態を「除外」にし、暫定一位データＤ１７を削除する。 In the exclusion mode, the selected set update unit 12 first checks whether there is a feature whose state is not “evaluated” in the selected set Xs. When there is no feature whose state is not “evaluated”, the selected set update unit 12 refers to the selected set data D8 and the provisional first-order data D17, and exists in the selected set data D8 and the provisional first-order data. A feature that does not exist in D17 is specified. Then, the state in the feature candidate data D7 for the feature is set to “excluded”, and the temporary first-order data D17 is deleted.

その後、選択済み集合更新部１２は、追加モード時と同様に、選択済み集合データＤ８、最良集合データＤ１８の更新を行う。また、選択済み集合更新部１２は、追加モード時と同様に、上記のＩ．からＩＩＩ．に従って特徴候補データＤ７の各特徴の状態を更新する。 Thereafter, the selected set update unit 12 updates the selected set data D8 and the best set data D18 as in the addition mode. In addition, the selected set update unit 12 performs the above-described I.D. To III. The state of each feature in the feature candidate data D7 is updated according to the above.

終了判定部１３は、モードデータＤ９、選択済み集合データＤ８および終了条件データＤ６を参照し、追加モード時であれば、選択済み集合Ｘ_ｓに含まれる特徴の個数が終了条件データＤ６に記録された最大個数に達すると、出力部１４に対して最良集合データＤ１８に記録されたデータの出力を指示して、特徴選択装置１の処理を終了させる。 End determining unit 13, mode data D9, with reference to the selected set data D8 and end condition data D6, if additional mode, the number of features included in the selected set X _s is recorded in the termination condition data D6 When the maximum number is reached, the output unit 14 is instructed to output the data recorded in the best set data D18, and the process of the feature selection device 1 is terminated.

また、除外モード時であれば、終了判定部１３は、除外モードの実行回数をカウントしておき、カウントした除外モードの実行回数が終了条件データＤ６に記録された除外モード実行の最大回数に達すると、出力部１４に対して最良集合データＤ１８に記録されたデータの出力を指示して、特徴選択装置１の処理を終了させる。 In the exclusion mode, the end determination unit 13 counts the number of executions of the exclusion mode, and the counted number of executions of the exclusion mode reaches the maximum number of executions of the exclusion mode recorded in the end condition data D6. Then, the output unit 14 is instructed to output the data recorded in the best set data D18, and the process of the feature selection device 1 is ended.

出力部１４は、終了判定部１３からの指示に従って、最良集合データＤ１８に記録されたデータを出力する。出力部１４によるデータの出力は、例えば、表示装置による表示、外部記憶装置へのデータ格納、外部装置へのデータ送信のいずれかまたは組み合わせにより実施することができる。なお、最良集合データＤ１８に記録したデータ以外を出力する場合には、別途それらを保持しておくように処理を追加すればよい。 The output unit 14 outputs the data recorded in the best set data D18 in accordance with the instruction from the end determination unit 13. Data output by the output unit 14 can be performed by any one or combination of display by a display device, data storage in an external storage device, and data transmission to an external device, for example. When data other than the data recorded in the best set data D18 is output, processing may be added so as to hold them separately.

＜動作説明＞
次に、本実施例の特徴選択装置１の動作を説明する。本実施例の特徴選択装置１は、まず初めにユーザの入力に応じて各種初期設定を行った後、選択済み集合に特徴を追加する追加モードと、選択済み集合から特徴を除外する除外モードとを切り替えながら処理を行っていく。基本は追加モードで動作するが、上記式（４）で示した制約条件を満たすことができなくなった場合、除外モードに移行する。そして、選択済み集合から一定個数（本実施例では１つ）の特徴が除外されると、また追加モードの処理に移行する。以下、追加モード時の処理と除外モード時の処理とに分けて、それぞれの処理手順の一例をフローチャートに沿って説明する。 <Description of operation>
Next, the operation of the feature selection device 1 of this embodiment will be described. The feature selection apparatus 1 according to the present embodiment first performs various initial settings in accordance with a user input, and then adds an addition mode for adding features to the selected set, and an exclusion mode for excluding features from the selected set. Processing is performed while switching. Basically, it operates in the addition mode, but when it becomes impossible to satisfy the constraint condition shown in the above equation (4), the mode shifts to the exclusion mode. When a certain number of features (one in this embodiment) is excluded from the selected set, the process proceeds to the additional mode. Hereinafter, an example of each processing procedure will be described according to a flowchart, divided into processing in the addition mode and processing in the exclusion mode.

まず、追加モード時の処理手順を説明する。図２６−１および図２６−２は、追加モード時における特徴選択装置１の処理手順の一例を示すフローチャートである。追加モードは、選択済み集合に特徴を追加するループを繰り返す、本実施例の特徴選択装置１の基本となるモードである。 First, a processing procedure in the addition mode will be described. FIGS. 26A and 26B are flowcharts illustrating an example of a processing procedure of the feature selection device 1 in the addition mode. The addition mode is a mode that is a basis of the feature selection device 1 of the present embodiment, in which a loop for adding features to a selected set is repeated.

ステップＳ１０１：追加モードによる処理が開始されると、まず、評価対象集合生成部４が、モードデータＤ９を確認する。そして、モードデータＤ９が示す現在のモードが追加モードでれば、次のステップＳ１０２に進み、除外モードであれば、後述の除外モードの処理に移行する。 Step S101: When the processing in the addition mode is started, first, the evaluation target set generation unit 4 confirms the mode data D9. If the current mode indicated by the mode data D9 is the addition mode, the process proceeds to the next step S102. If the current mode is the exclusion mode, the process proceeds to the exclusion mode described later.

ステップＳ１０２：評価対象集合生成部４は、特徴候補データＤ７を参照し、状態が「未評価」の特徴があるかどうかを確認する。そして、「未評価」の特徴があれば（ステップＳ１０２：Ｙｅｓ）、次のステップＳ１０３に進み、「未評価」の特徴がなければ（ステップＳ１０２：Ｎｏ）、ステップＳ１１５に処理を移行する。 Step S102: The evaluation target set generation unit 4 refers to the feature candidate data D7 and confirms whether or not there is a feature whose state is “not evaluated”. If there is a feature “not evaluated” (step S102: Yes), the process proceeds to the next step S103. If there is no feature “not evaluated” (step S102: No), the process proceeds to step S115.

ステップＳ１０３：評価対象集合生成部４は、選択済み集合データＤ８と特徴候補データＤ７を参照し、特徴候補データＤ７に含まれる特徴の中で状態が「未評価」の特徴を選択済み集合に１つ追加した評価対象集合Ｘを生成する。そして、評価対象集合生成部４は、この評価対象集合Ｘに含まれる特徴を記録した評価対象集合データＤ１０を生成し、特徴候補データＤ７の中の追加した特徴の状態を「保留」に更新して、次のステップＳ１０４に進む。 Step S103: The evaluation target set generation unit 4 refers to the selected set data D8 and the feature candidate data D7, and among the features included in the feature candidate data D7, the feature whose state is “unevaluated” is set to 1 in the selected set. The evaluation target set X added is generated. Then, the evaluation target set generation unit 4 generates evaluation target set data D10 in which the features included in the evaluation target set X are recorded, and updates the state of the added features in the feature candidate data D7 to “pending”. Then, the process proceeds to the next step S104.

ステップＳ１０４：コスト評価部５が、評価対象集合データＤ１０とコスト設定データＤ２を参照し、ステップＳ１０３で生成された評価対象集合Ｘのコストcost（Ｘ）を上記式（３）により算出して、次のステップＳ１０５に進む。 Step S104: The cost evaluation unit 5 refers to the evaluation target set data D10 and the cost setting data D2, calculates the cost cost (X) of the evaluation target set X generated in step S103 by the above formula (3), Proceed to the next Step S105.

ステップＳ１０５：コスト評価部５は、コスト制約データＤ５を参照し、上記式（４）により、ステップＳ１０４で算出したコストcost（Ｘ）が制約条件を満たすか否かを判定する。そして、ステップＳ１０４で算出したコストcost（Ｘ）が制約条件を満たす場合（ステップＳ１０５：Ｙｅｓ）、コスト評価部５は、ステップＳ１０４で算出したコストcost（Ｘ）を記録したコストデータＤ１１を生成し、ステップＳ１０３で状態が「保留」とされた特徴の状態を「評価済み」に更新して、次のステップＳ１０６に進む。一方、ステップＳ１０４で算出したコストcost（Ｘ）が制約条件を満たさない場合は（ステップＳ１０５：Ｎｏ）、ステップＳ１０３で状態が「保留」とされた特徴の状態を「違反」に更新して、ステップＳ１１３に処理を移行する。 Step S105: The cost evaluation unit 5 refers to the cost constraint data D5 and determines whether or not the cost cost (X) calculated in step S104 satisfies the constraint condition according to the above equation (4). If the cost cost (X) calculated in step S104 satisfies the constraint condition (step S105: Yes), the cost evaluation unit 5 generates cost data D11 in which the cost cost (X) calculated in step S104 is recorded. Then, the state of the feature whose state is “pending” in step S103 is updated to “evaluated”, and the process proceeds to the next step S106. On the other hand, when the cost cost (X) calculated in step S104 does not satisfy the constraint condition (step S105: No), the state of the feature whose state is “pending” in step S103 is updated to “violation”. The process proceeds to step S113.

ステップＳ１０６：モデル生成部６が、評価対象集合データＤ１０と学習データセット（データセットＤ１）を参照し、学習データセットに含まれる各サンプルの評価対象集合データＤ１０に含まれる特徴のみを使用した機械学習によりモデルを生成する。そして、モデル生成部６は、生成したモデルのルールやパラメータなどを表すモデルデータＤ１２を生成し、次のステップＳ１０７に進む。 Step S106: The model generation unit 6 refers to the evaluation target set data D10 and the learning data set (data set D1), and uses only the features included in the evaluation target set data D10 of each sample included in the learning data set. A model is generated by learning. And the model production | generation part 6 produces | generates the model data D12 showing the rule, parameter, etc. of the produced | generated model, and progresses to the following step S107.

ステップＳ１０７：モデル評価部７が、評価対象集合データＤ１０、モデルデータＤ１２および評価用データセット（データセットＤ１）を参照し、評価用データセットに含まれる各サンプルの評価対象集合データＤ１０に含まれる特徴のみを使用したモデル評価を実施して、上記式（１）および式（２）により、ステップＳ１０６で生成されたモデルの評価値（再現率・適合率）を算出する。そして、モデル評価部７は、算出した再現率・適合率を記録した評価値データＤ１３を生成し、次のステップＳ１０８に進む。 Step S107: The model evaluation unit 7 refers to the evaluation target set data D10, the model data D12, and the evaluation data set (data set D1), and is included in the evaluation target set data D10 of each sample included in the evaluation data set. Model evaluation using only features is performed, and the evaluation value (reproducibility / matching rate) of the model generated in step S106 is calculated by the above formulas (1) and (2). And the model evaluation part 7 produces | generates the evaluation value data D13 which recorded the calculated reproduction rate and the precision, and progresses to the following step S108.

ステップＳ１０８：評価値統合部８が、評価値データＤ１３と目標値データＤ４を参照し、上記式（８）および式（９）により、再現率達成度および適合率達成度を算出する。また、評価値統合部８は、上記式（１０）および式（１１）により、変換後再現率達成度および変換後適合率達成度を算出し、上記式（１２）により、統合評価値merged_eval（Ｘ）を算出する。そして、評価値統合部８は、算出した統合評価値merged_eval（Ｘ）を記録した統合評価値データＤ１４を生成し、次のステップＳ１０９に進む。 Step S108: The evaluation value integration unit 8 refers to the evaluation value data D13 and the target value data D4, and calculates the achievement rate and the achievement rate of the relevance rate by the above formulas (8) and (9). Further, the evaluation value integration unit 8 calculates the post-conversion recall rate achievement and the post-conversion precision achievement by the above formulas (10) and (11), and the integrated evaluation value merged_eval ( X) is calculated. Then, the evaluation value integration unit 8 generates integrated evaluation value data D14 in which the calculated integrated evaluation value merged_eval (X) is recorded, and the process proceeds to the next step S109.

ステップＳ１０９：評価値補正部９が、評価対象集合データＤ１０と補正設定データＤ３を参照し、上記式（５）により、ステップＳ１０３で生成された評価対象集合Ｘのスコアscore（Ｘ）を算出する。また、評価値補正部９は、統合評価値データＤ１４と補正設定データＤ３を参照し、統合評価値merged_eval（Ｘ）とα（score（Ｘ））をもとに、上記式（６）および式（７）により、補正済み評価値corrected_eval（Ｘ）を算出する。そして、評価値補正部９は、算出した補正済み評価値corrected_eval（Ｘ）を記録した補正済み評価値データＤ１５を生成し、次のステップＳ１１０に進む。 Step S109: The evaluation value correction unit 9 refers to the evaluation target set data D10 and the correction setting data D3, and calculates the score score (X) of the evaluation target set X generated in step S103 by the above equation (5). . Further, the evaluation value correction unit 9 refers to the integrated evaluation value data D14 and the correction setting data D3, and based on the integrated evaluation values merged_eval (X) and α (score (X)), the above equation (6) and the equation From (7), a corrected evaluation value corrected_eval (X) is calculated. Then, the evaluation value correction unit 9 generates corrected evaluation value data D15 in which the calculated corrected evaluation value corrected_eval (X) is recorded, and proceeds to the next step S110.

ステップＳ１１０：除外損失評価部１０が、補正済み評価値データＤ１５とコストデータＤ１１を参照し、補正済み評価値corrected_eval（Ｘ）、コストcost（Ｘ）、除外損失を記録した最終評価データＤ１６を生成し、次のステップＳ１１１に進む。なお、追加モードでは、除外損失には任意の値が記録される。 Step S110: The excluded loss evaluation unit 10 refers to the corrected evaluation value data D15 and the cost data D11, and generates final evaluation data D16 in which the corrected evaluation value corrected_eval (X), the cost cost (X), and the exclusion loss are recorded. Then, the process proceeds to the next step S111. In addition, in the addition mode, an arbitrary value is recorded as the exclusion loss.

ステップＳ１１１：暫定一位更新部１１が、最終評価データＤ１６と暫定一位データＤ１７を参照し、最終評価データＤ１６の方が暫定一位データＤ１７よりも、補正済み評価値corrected_eval（Ｘ）が高くなっているか否かを判定する。そして、最終評価データＤ１６に記録された補正済み評価値corrected_eval（Ｘ）が暫定一位データＤ１７よりも高ければ（ステップＳ１１１：Ｙｅｓ）、次のステップＳ１１２に進み、最終評価データＤ１６に記録された補正済み評価値corrected_eval（Ｘ）が暫定一位データＤ１７よりも低ければ（ステップＳ１１１：Ｎｏ）、ステップＳ１０１に戻って以降の処理を繰り返す。 Step S111: The provisional first place update unit 11 refers to the final evaluation data D16 and the provisional first place data D17, and the final evaluation data D16 has a corrected evaluation value corrected_eval (X) higher than the provisional first place data D17. It is determined whether or not. If the corrected evaluation value corrected_eval (X) recorded in the final evaluation data D16 is higher than the temporary first-order data D17 (step S111: Yes), the process proceeds to the next step S112, and is recorded in the final evaluation data D16. If the corrected evaluation value corrected_eval (X) is lower than the temporary first-order data D17 (step S111: No), the process returns to step S101 and the subsequent processing is repeated.

ステップＳ１１２：暫定一位更新部１１は、評価対象集合データＤ１０に記録された評価対象集合Ｘと、最終評価データＤ１６に記録された補正済み評価値corrected_eval（Ｘ）、コストcost（Ｘ）および除外損失とにより、暫定一位データＤ１７を更新する。そして、ステップＳ１０１に戻って以降の処理を繰り返す。 Step S112: The temporary first-rank update unit 11 evaluates the evaluation target set X recorded in the evaluation target set data D10, the corrected evaluation value corrected_eval (X), the cost cost (X), and the exclusion recorded in the final evaluation data D16. The provisional first place data D17 is updated according to the loss. And it returns to step S101 and repeats the subsequent processes.

ステップＳ１１３：コスト評価部５は、ステップＳ１０４で算出したコストcost（Ｘ）が制約条件を満たさない場合（ステップＳ１０５：Ｎｏ）、制約条件を満たす評価対象集合をもう生成できないかどうかを確認する。すなわち、コスト評価部５は、特徴候補データＤ７に含まれる特徴の状態が全て「選択済み」、「違反」、「除外」のいずれかとなった場合は、制約条件を満たす評価対象集合を生成できないと判断し（ステップＳ１１３：Ｙｅｓ）、次のステップＳ１１４に進む。一方、制約条件を満たす評価対象集合を生成できる可能性があれば（ステップＳ１０５：Ｎｏ）、ステップＳ１０１に戻って以降の処理を繰り返す。 Step S113: If the cost cost (X) calculated in step S104 does not satisfy the constraint condition (No in step S105), the cost evaluation unit 5 checks whether or not an evaluation target set that satisfies the constraint condition can be generated anymore. That is, the cost evaluation unit 5 cannot generate an evaluation target set that satisfies the constraint condition when all of the feature states included in the feature candidate data D7 are “selected”, “violation”, or “exclusion”. (Step S113: Yes), the process proceeds to the next step S114. On the other hand, if there is a possibility that an evaluation target set that satisfies the constraint conditions can be generated (step S105: No), the process returns to step S101 and the subsequent processing is repeated.

ステップＳ１１４：コスト評価部５は、制約条件を満たす評価対象集合を生成できないと判断した場合（ステップＳ１０５：Ｙｅｓ）、モードデータＤ９の現在のモードを追加モードから除外モードに更新し、ステップＳ１０１に戻って以降の処理を繰り返す。 Step S114: When the cost evaluation unit 5 determines that the evaluation target set that satisfies the constraint condition cannot be generated (step S105: Yes), the current mode of the mode data D9 is updated from the addition mode to the exclusion mode, and the process proceeds to step S101. Return and repeat the subsequent processing.

ステップＳ１１５：ステップＳ１０２において特徴候補データＤ７に状態が「未評価」の特徴がないと判定された場合（ステップＳ１０２：Ｎｏ）、選択済み集合更新部１２が、暫定一位データＤ１７に記録された評価対象集合（Ｘ）、補正済み評価値corrected_eval（Ｘ）およびコストcost（Ｘ）で選択済み集合データＤ８を更新する。また、選択済み集合更新部１２は、暫定一位データＤ１７を削除するとともに、特徴候補データＤ７の各特徴の状態を上述したように更新し、次のステップＳ１１６に進む。 Step S115: When it is determined in step S102 that the feature candidate data D7 does not have a feature whose state is “not evaluated” (step S102: No), the selected set update unit 12 is recorded in the temporary first-rank data D17. The selected set data D8 is updated with the evaluation target set (X), the corrected evaluation value corrected_eval (X), and the cost cost (X). In addition, the selected set update unit 12 deletes the temporary first rank data D17, updates the state of each feature of the feature candidate data D7 as described above, and proceeds to the next step S116.

ステップＳ１１６：選択済み集合更新部１２は、最良集合データＤ１８を参照し、更新後の選択済み集合データＤ８の方が最良集合データＤ１８よりも、補正済み評価値corrected_eval（Ｘ）が高くなっているか否かを判定する。そして、更新後の選択済み集合データＤ８に記録された補正済み評価値corrected_eval（Ｘ）が最良集合データＤ１８よりも高ければ（ステップＳ１１６：Ｙｅｓ）、次のステップＳ１１７に進み、更新後の選択済み集合データＤ８に記録された補正済み評価値corrected_eval（Ｘ）が最良集合データＤ１８以下であれば（ステップＳ１１６：Ｎｏ）、ステップＳ１１８に処理を移行する。 Step S116: The selected set update unit 12 refers to the best set data D18, and whether the updated set value D8 after the update has a corrected evaluation value corrected_eval (X) higher than that of the best set data D18. Determine whether or not. If the corrected evaluation value corrected_eval (X) recorded in the updated selected set data D8 is higher than the best set data D18 (step S116: Yes), the process proceeds to the next step S117, and the updated selected value is selected. If the corrected evaluation value corrected_eval (X) recorded in the set data D8 is equal to or less than the best set data D18 (step S116: No), the process proceeds to step S118.

ステップＳ１１７：選択済み集合更新部１２は、更新後の選択済み集合データＤ８に記録されている選択済み特徴、補正済み評価値corrected_eval（Ｘ）およびコストcost（Ｘ）で最良集合データＤ１８を更新し、次のステップＳ１１８に進む。 Step S117: The selected set update unit 12 updates the best set data D18 with the selected feature, the corrected evaluation value corrected_eval (X) and the cost cost (X) recorded in the updated selected set data D8. The process proceeds to the next step S118.

ステップＳ１１８：終了判定部１３が、選択済み集合データＤ８と終了条件データＤ６を参照し、選択済み集合Ｘ_ｓに含まれる特徴の個数が終了条件データＤ６に記録された最大個数に達したか、つまり終了条件を満たすか否かを判定する。そして、終了条件を満たす場合は（ステップＳ１１８：Ｙｅｓ）、次のステップＳ１１９に進み、終了条件を満たさない場合は（ステップＳ１１８：Ｎｏ）、ステップＳ１０１に戻って以降の処理を繰り返す。 Step S118: The end determination unit 13 refers to the selected set data D8 and the end condition data D6, and whether the number of features included in the selected set X _s reaches the maximum number recorded in the end condition data D6, That is, it is determined whether or not the end condition is satisfied. If the end condition is satisfied (step S118: Yes), the process proceeds to the next step S119. If the end condition is not satisfied (step S118: No), the process returns to step S101 and the subsequent processing is repeated.

ステップＳ１１９：出力部１４が、最良集合データＤ１８に記録されたデータを出力し、本実施例の特徴選択装置１による一連の処理が終了する。 Step S119: The output unit 14 outputs the data recorded in the best set data D18, and a series of processes by the feature selection device 1 of this embodiment is completed.

次に、除外モード時の処理手順を説明する。図２７−１および図２７−２は、除外モード時における特徴選択装置１の処理手順の一例を示すフローチャートである。除外モードは、追加モードにおいてどの特徴量を追加してもコストの制約条件が満たせなくなった場合に実行されるモードである。除外モードでは、選択済み特徴集合から特徴を１つ除外する。本実施例では、特徴量を１つ除外した時点で追加モードに戻ることとするが、複数除外するまで除外モードを継続してもよい。以下では、追加モードとは挙動が異なる部分を中心に説明する。 Next, a processing procedure in the exclusion mode will be described. FIGS. 27A and 27B are flowcharts illustrating an example of a processing procedure of the feature selection device 1 in the exclusion mode. The exclusion mode is a mode that is executed when a cost constraint cannot be satisfied no matter which feature amount is added in the addition mode. In the exclusion mode, one feature is excluded from the selected feature set. In this embodiment, the mode returns to the addition mode when one feature amount is excluded, but the exclusion mode may be continued until a plurality of features are excluded. Below, it demonstrates focusing on the part from which a behavior differs from an addition mode.

ステップＳ２０１：除外モードによる処理が開始されると、まず、評価対象集合生成部４が、モードデータＤ９を確認する。そして、モードデータＤ９が示す現在のモードが除外モードであれば、次のステップＳ２０２に進み、追加モードであれば、上述した追加モードの処理に移行する。 Step S201: When processing in the exclusion mode is started, first, the evaluation target set generation unit 4 confirms the mode data D9. If the current mode indicated by the mode data D9 is the exclusion mode, the process proceeds to the next step S202, and if it is the addition mode, the process proceeds to the above-described addition mode.

ステップＳ２０２：評価対象集合生成部４は、特徴候補データＤ７を参照し、状態が「選択済み」の特徴があるかどうかを確認する。そして、「選択済み」の特徴があれば（ステップＳ２０２：Ｙｅｓ）、次のステップＳ２０３に進み、「選択済み」の特徴がなければ（ステップＳ２０２：Ｎｏ）、ステップＳ２１１に処理を移行する。 Step S202: The evaluation target set generation unit 4 refers to the feature candidate data D7 and confirms whether or not there is a feature whose state is “selected”. If there is a “selected” feature (step S202: Yes), the process proceeds to the next step S203. If there is no “selected” feature (step S202: No), the process proceeds to step S211.

ステップＳ２０３：評価対象集合生成部４は、特徴候補データＤ７に含まれる特徴の中で状態が「選択済み」の特徴を１つ選択し、その特徴の状態を「保留」に更新する。そして、状態が「選択済み」もしくは「評価済み」の特徴のみからなる集合を生成する。この処理は、選択済み集合から状態を「保留」にした特徴を除外して評価対象集合Ｘを生成することに該当する。そして、評価対象集合生成部４は、これを記録した評価対象集合データＤ１０を生成する。その後、コスト評価部５が、特徴候補データＤ７に含まれる特徴の中で、状態が「保留」のものを「評価済み」に更新する。そして、次のステップＳ２０４に進む。 Step S203: The evaluation target set generation unit 4 selects one feature whose state is “selected” among the features included in the feature candidate data D7, and updates the state of the feature to “pending”. Then, a set including only features whose states are “selected” or “evaluated” is generated. This process corresponds to generating the evaluation target set X by excluding the feature whose state is “pending” from the selected set. And the evaluation object set production | generation part 4 produces | generates evaluation object set data D10 which recorded this. Thereafter, the cost evaluation unit 5 updates the features included in the feature candidate data D7 with the status “pending” to “evaluated”. Then, the process proceeds to the next step S204.

ステップＳ２０４：モデル生成部６が、追加モード時のステップＳ１０６と同様にモデルを生成し、生成したモデルのルールやパラメータなどを表すモデルデータＤ１２を生成して、次のステップ２０５に進む。 Step S204: The model generation unit 6 generates a model in the same manner as in step S106 in the addition mode, generates model data D12 representing rules and parameters of the generated model, and proceeds to the next step 205.

ステップＳ２０５：モデル評価部７が、追加モード時のステップＳ１０７と同様にモデルの評価値（再現率・適合率）を算出し、算出した再現率・適合率を記録した評価値データＤ１３を生成して、次のステップＳ２０６に進む。 Step S205: The model evaluation unit 7 calculates the evaluation value (reproduction rate / relevance rate) of the model similarly to step S107 in the addition mode, and generates evaluation value data D13 in which the calculated reproducibility / relevance rate is recorded. The process proceeds to the next step S206.

ステップＳ２０６：評価値統合部８が、追加モード時のステップＳ１０８と同様に統合評価値merged_eval（Ｘ）を算出し、算出した統合評価値merged_eval（Ｘ）を記録した統合評価値データＤ１４を生成して、次のステップＳ２０７に進む。 Step S206: The evaluation value integration unit 8 calculates the integrated evaluation value merged_eval (X) as in step S108 in the addition mode, and generates integrated evaluation value data D14 in which the calculated integrated evaluation value merged_eval (X) is recorded. The process proceeds to the next step S207.

ステップＳ２０７：評価値補正部９が、追加モード時のステップＳ１０９と同様に補正済み評価値corrected_eval（Ｘ）を算出し、算出した補正済み評価値corrected_eval（Ｘ）を記録した補正済み評価値データＤ１５を生成して、次のステップＳ２０８に進む。 Step S207: The evaluation value correction unit 9 calculates the corrected evaluation value corrected_eval (X) in the same manner as in step S109 in the addition mode, and the corrected evaluation value data D15 in which the calculated corrected evaluation value corrected_eval (X) is recorded. And proceed to the next step S208.

ステップＳ２０８：除外損失評価部１０が、選択済み集合データＤ８、補正済み評価値データＤ１５およびコストデータＤ１１を参照し、選択済み集合データＤ８に記録された選択済み集合Ｘ_ｓの補正済み評価値corrected_eval（Ｘ_ｓ）およびコストcost（Ｘ_ｓ）と、補正済み評価値データＤ１５に記録された評価対象集合Ｘの補正済み評価値corrected_eval（Ｘ）およびコストデータＤ１１に記録されたコストcost（Ｘ）とを用い、上記式（１３）により、除外損失を算出する。そして、除外損失評価部１０は、評価対象集合Ｘの補正済み評価値corrected_eval（Ｘ）およびコストcost（Ｘ）と、算出した除外損失とを記録した最終評価データＤ１６を生成し、次のステップＳ２０９に進む。 Step S208: Exclude loss evaluation unit 10, selected aggregate data D8, with reference to the corrected evaluation value data D15 and cost data D11, corrected evaluation value corrected_eval of the recorded selected set X _s to the selected set data D8 (X _s ) and cost cost (X _s ), corrected evaluation value corrected_eval (X) of the evaluation target set X recorded in the corrected evaluation value data D 15, and cost cost (X) recorded in the cost data D 11 And the exclusion loss is calculated by the above equation (13). Then, the excluded loss evaluation unit 10 generates final evaluation data D16 in which the corrected evaluation value corrected_eval (X) and cost cost (X) of the evaluation target set X and the calculated excluded loss are recorded, and the next step S209 is performed. Proceed to

ステップＳ２０９：暫定一位更新部１１が、最終評価データＤ１６と暫定一位データＤ１７を参照し、最終評価データＤ１６の方が暫定一位データＤ１７よりも除外損失が小さいか否かを判定する。そして、最終評価データＤ１６に記録された除外損失が暫定一位データＤ１７よりも小さければ（ステップＳ２０９：Ｙｅｓ）、次のステップＳ２１０に進み、最終評価データＤ１６に記録された除外損失が暫定一位データＤ１７よりも大きければ（ステップＳ２０９：Ｎｏ）、ステップＳ２０１に戻って以降の処理を繰り返す。 Step S209: The provisional first update unit 11 refers to the final evaluation data D16 and the provisional first data D17, and determines whether or not the final evaluation data D16 has a smaller exclusion loss than the provisional first data D17. If the exclusion loss recorded in the final evaluation data D16 is smaller than the provisional first-order data D17 (step S209: Yes), the process proceeds to the next step S210, and the exclusion loss recorded in the final evaluation data D16 is provisional first-order. If it is larger than the data D17 (step S209: No), the process returns to step S201 and the subsequent processing is repeated.

ステップＳ２１０：暫定一位更新部１１は、評価対象集合データＤ１０に記録された評価対象集合Ｘと、最終評価データＤ１６に記録された補正済み評価値corrected_eval（Ｘ）、コストcost（Ｘ）および除外損失とにより、暫定一位データＤ１７を更新する。そして、ステップＳ２０１に戻って以降の処理を繰り返す。 Step S210: The provisional first-order update unit 11 evaluates the evaluation target set X recorded in the evaluation target set data D10, the corrected evaluation value corrected_eval (X), the cost cost (X), and the exclusion recorded in the final evaluation data D16. The provisional first place data D17 is updated according to the loss. And it returns to step S201 and repeats the subsequent processes.

ステップＳ２１１：ステップＳ２０２において特徴候補データＤ７に状態が「選択済み」の特徴がないと判定された場合（ステップＳ２０２：Ｎｏ）、選択済み集合更新部１２が、選択済み集合データＤ８と暫定一位データＤ１７を参照し、選択済み集合データＤ８に存在し暫定一位データＤ１７に存在しない特徴を特定して、特徴候補データＤ７の中のその特徴の状態を「除外」にし、暫定一位データＤ１７を削除する。そして、選択済み集合更新部１２は、追加モード時のステップＳ１１５と同様に選択済み集合データＤ８を更新し、特徴候補データＤ７の各状態を更新して、次のステップＳ２１２に進む。 Step S211: When it is determined in step S202 that the feature candidate data D7 does not have a feature whose state is “selected” (step S202: No), the selected set update unit 12 tentatively ranks with the selected set data D8. With reference to the data D17, a feature that exists in the selected set data D8 but does not exist in the temporary first-order data D17 is specified, the state of the feature in the feature candidate data D7 is set to “excluded”, and the temporary first-order data D17 Is deleted. Then, the selected set update unit 12 updates the selected set data D8 in the same manner as in step S115 in the addition mode, updates each state of the feature candidate data D7, and proceeds to the next step S212.

ステップＳ２１２：選択済み集合更新部１２は、追加モード時のステップＳ１１６と同様に、更新後の選択済み集合データＤ８の方が最良集合データＤ１８よりも、補正済み評価値corrected_eval（Ｘ）が高くなっているか否かを判定する。そして、更新後の選択済み集合データＤ８に記録された補正済み評価値corrected_eval（Ｘ）が最良集合データＤ１８よりも高ければ（ステップＳ２１２：Ｙｅｓ）、次のステップＳ２１３に進み、更新後の選択済み集合データＤ８に記録された補正済み評価値corrected_eval（Ｘ）が最良集合データＤ１８以下であれば（ステップＳ２１２：Ｎｏ）、ステップＳ２１４に処理を移行する。 Step S212: As in step S116 in the addition mode, the selected set update unit 12 has the corrected evaluation value corrected_eval (X) higher in the updated selected set data D8 than in the best set data D18. It is determined whether or not. If the corrected evaluation value corrected_eval (X) recorded in the updated selected set data D8 is higher than the best set data D18 (step S212: Yes), the process proceeds to the next step S213, and the updated selected value is selected. If the corrected evaluation value corrected_eval (X) recorded in the set data D8 is equal to or less than the best set data D18 (step S212: No), the process proceeds to step S214.

ステップＳ２１３：選択済み集合更新部１２は、追加モード時のステップＳ１１７と同様に、更新後の選択済み集合データＤ８に記録されている選択済み特徴、補正済み評価値corrected_eval（Ｘ）およびコストcost（Ｘ）で最良集合データＤ１８を更新し、次のステップＳ２１４に進む。 Step S213: As with step S117 in the add mode, the selected set update unit 12 selects the selected feature, the corrected evaluation value corrected_eval (X), and the cost cost (recorded in the updated set data D8 after update. X) updates the best set data D18, and proceeds to the next step S214.

ステップＳ２１４：終了判定部１３が、終了条件データＤ６を参照し、除外モードの実行回数が終了条件データＤ６に記録された最大回数に達したか、つまり終了条件を満たすか否かを判定する。そして、終了条件を満たさない場合は（ステップＳ２１４：Ｎｏ）、次のステップＳ２１５に進み、終了条件を満たす場合は（ステップＳ２１４：Ｙｅｓ）、ステップＳ２１６に処理を移行する。 Step S214: The end determination unit 13 refers to the end condition data D6 and determines whether the number of executions of the exclusion mode has reached the maximum number recorded in the end condition data D6, that is, whether the end condition is satisfied. If the end condition is not satisfied (step S214: No), the process proceeds to the next step S215. If the end condition is satisfied (step S214: Yes), the process proceeds to step S216.

ステップＳ２１５：終了判定部１３は、モードデータＤ９の現在のモードを除外モードから追加モードに更新し、ステップＳ２０１に戻って以降の処理を繰り返す。 Step S215: The end determination unit 13 updates the current mode of the mode data D9 from the exclusion mode to the addition mode, returns to step S201, and repeats the subsequent processing.

ステップＳ２１６：出力部１４が、最良集合データＤ１８に記録されたデータを出力し、本実施例の特徴選択装置１による一連の処理が終了する。 Step S216: The output unit 14 outputs the data recorded in the best set data D18, and a series of processes by the feature selection device 1 of this embodiment is completed.

＜ハードウェア構成＞
以上説明した特徴選択装置１の機能は、例えば、一般的なコンピュータのハードウェアとソフトウェア（プログラム）との協働により実現することができる。この場合の特徴選択装置１のハードウェア構成の一例を図２８に示す。 <Hardware configuration>
The function of the feature selection device 1 described above can be realized by, for example, cooperation between general computer hardware and software (program). An example of the hardware configuration of the feature selection device 1 in this case is shown in FIG.

本実施例の特徴選択装置１は、例えば図２８に示すように、情報処理を行うＣＰＵ（Central Processing Unit）１０１、ＢＩＯＳなどを記憶した読み出し専用メモリであるＲＯＭ（Read Only Memory）１０２、各種データを書き換え可能に記憶するＲＡＭ（Random Access Memory）１０３、各種データベースとして機能するとともに各種のプログラムを格納するＨＤＤ（Hard Disk Drive）１０４、記憶媒体１１０を用いて情報を保管したり外部に情報を配布したり外部から情報を入手するための媒体駆動装置１０５、ユーザがＣＰＵ１０１に命令や情報などを入力するためのキーボードやマウスなどの入力装置１０６、および、処理経過や結果などをユーザに表示するＬＣＤ（Liquid Cristal Display）などの表示装置１０７などを備え、これら各部間で送受信されるデータをバスコントローラ１０８が調停して動作する。 For example, as shown in FIG. 28, the feature selection device 1 of this embodiment includes a CPU (Central Processing Unit) 101 that performs information processing, a ROM (Read Only Memory) 102 that is a read-only memory that stores BIOS, and various data. RAM (Random Access Memory) 103 that stores data in a rewritable manner, HDD (Hard Disk Drive) 104 that functions as various databases and stores various programs, and uses a storage medium 110 to store information and distribute information to the outside The medium driving device 105 for obtaining information from the outside or the user, the input device 106 such as a keyboard and mouse for the user to input commands and information to the CPU 101, and the LCD for displaying the processing progress and results to the user (Liquid Cristal Display) and other display devices 107, etc. are transmitted and received between these parts The bus controller 108 arbitrates the data and operates.

このような特徴選択装置１では、ユーザが電源を投入するとＣＰＵ１０１がＲＯＭ１０２内のローダーというプログラムを起動させ、ＨＤＤ１０４よりＯＳ（Operation System）というコンピュータのハードウェアとソフトウェアとを管理するプログラムをＲＡＭ１０３に読み込み、このＯＳを起動させる。このようなＯＳは、ユーザの操作に応じてプログラムを起動したり、データを読み込んだり、保存を行ったりする。ＯＳのうち代表的なものとしては、Ｗｉｎｄｏｗｓ（登録商標）、ＵＮＩＸ（登録商標）などが知られている。これらのＯＳ上で動作するプログラムをアプリケーションプログラムと呼んでいる。なお、アプリケーションプログラムは、所定のＯＳ上で動作するものに限らず、後述の各種処理の一部の実行をＯＳに肩代わりさせるものであってもよいし、所定のアプリケーションソフトやＯＳなどを構成する一群のプログラムファイルの一部として含まれているものであってもよい。 In such a feature selection device 1, when the user turns on the power, the CPU 101 activates a program called a loader in the ROM 102, and loads a program for managing the hardware and software of an OS (Operation System) from the HDD 104 into the RAM 103. This OS is started. Such an OS activates a program, reads data, or stores data in accordance with a user operation. As typical OS, Windows (registered trademark), UNIX (registered trademark), and the like are known. Programs that run on these OSs are called application programs. The application program is not limited to one that runs on a predetermined OS, and may be one that causes the OS to execute some of the various processes described below, or constitutes predetermined application software, an OS, or the like. It may be included as part of a group of program files.

特徴選択装置１は、上記アプリケーションプログラムとして、図６に示した機能的な構成要素をそれぞれプロセスとして生成するためのプログラムをＨＤＤ１０４に記憶している。特徴選択装置１のＨＤＤ１０４にインストールされるアプリケーションプログラムは、一般的には、ＣＤ−ＲＯＭやＤＶＤなどの各種の光ディスク、各種光磁気ディスク、フレキシブルディスクなどの各種磁気ディスク、半導体メモリなどの各種方式のメディアなどの記憶媒体１１０に記録されて提供される。また、このプログラムは、例えばネットワークを利用した通信により外部から取り込まれ、ＨＤＤ１０４にインストールされてもよい。 The feature selection device 1 stores, in the HDD 104, a program for generating the functional components shown in FIG. Application programs installed in the HDD 104 of the feature selection device 1 are generally of various types such as various optical disks such as CD-ROM and DVD, various magnetic disks such as various magneto-optical disks and flexible disks, and semiconductor memories. The program is provided by being recorded on a storage medium 110 such as a medium. Further, this program may be imported from the outside by communication using a network, for example, and installed in the HDD 104.

以上のようなハードウェア構成を採用する場合、ＣＰＵ１０１がＯＳ上で動作する上記プログラムに従って各種の演算処理を実行することにより、例えばＲＡＭ１０３上に図６に示した機能的な構成要素が生成され、コンピュータを特徴選択装置１として機能させることができる。なお、図６に示した機能的な構成要素の一部あるいは全部を、例えば、ＡＳＩＣ（Application Specific Integrated Circuit）やＦＰＧＡ（Field-programmable Gate Array）などの専用のハードウェアを用いて実現することもできる。 When the hardware configuration as described above is adopted, the CPU 101 executes various arithmetic processes in accordance with the above-described program operating on the OS, thereby generating the functional components shown in FIG. The computer can function as the feature selection device 1. Note that some or all of the functional components shown in FIG. 6 may be realized by using dedicated hardware such as an application specific integrated circuit (ASIC) or a field-programmable gate array (FPGA). it can.

なお、本実施例では、特徴選択装置１が単体の装置として構成されていることを想定するが、特徴選択装置１は単体の装置として構成されている必要はなく、物理的に分離されてネットワークを介して接続された複数の装置により構成されていてもよい。また、特徴選択装置１は、クラウドシステム上で動作する仮想マシンとして実現されていてもよい。 In this embodiment, it is assumed that the feature selection device 1 is configured as a single device. However, the feature selection device 1 does not have to be configured as a single device, and is physically separated from the network. It may be configured by a plurality of devices connected via the. Further, the feature selection device 1 may be realized as a virtual machine that operates on the cloud system.

＜実施形態の効果＞
以上、具体的な実施例を挙げながら詳細に説明したように、本実施形態の特徴選択装置は、モデルの性能に関する複数の評価指標の各々について、設定された目標値に対する評価値の達成度を算出する。そして、評価指標ごとの達成度が高く、かつ、評価指標間における達成度のばらつきが少ないほど高評価となる統合評価値を算出し、統合評価値が高くなる部分集合を探索する。これにより、複数の評価指標とその目標値を考慮して、よりユーザが所望する要件を満たすモデルを生成できるようになる。 <Effect of embodiment>
As described above in detail with reference to specific examples, the feature selection device according to the present embodiment indicates the degree of achievement of the evaluation value with respect to the set target value for each of the plurality of evaluation indexes related to the performance of the model. calculate. Then, an integrated evaluation value that is highly evaluated as the achievement level for each evaluation index is high and the variation in the achievement level between evaluation indexes is small, and a subset in which the integrated evaluation value is high is searched. This makes it possible to generate a model that satisfies the requirements desired by the user in consideration of a plurality of evaluation indexes and target values.

また、本実施形態の特徴選択装置は、評価対象集合のコストが上限値以下であるという制約条件を満たしているか否かを判定する。そして、制約条件を満たす評価対象集合が生成できなくなった場合は、選択済み集合から特徴を除外する除外モードに移行する。このとき、特徴を除外したときのコストの低下と補正済み評価値の低下との比率を示す除外損失を算出する。この除外損失を用いることにより、モデルの性能を大きく低下させることなくコストを大幅に低下させることができるように、選択済み集合から除外する特徴を選択することができる。 In addition, the feature selection device according to the present embodiment determines whether or not the constraint that the cost of the evaluation target set is equal to or lower than the upper limit value is satisfied. When an evaluation target set that satisfies the constraint conditions cannot be generated, the mode shifts to an exclusion mode in which features are excluded from the selected set. At this time, an exclusion loss is calculated that indicates the ratio between the reduction in cost when the feature is excluded and the reduction in the corrected evaluation value. By using this exclusion loss, it is possible to select features to be excluded from the selected set so that the cost can be significantly reduced without significantly reducing the performance of the model.

このように、本実施形態ではコストの制約条件をチェックし、制約条件を満たさない場合はなるべくモデルの性能を下げずにコストを低下させるように、選択済み集合を更新することができる。これにより、例えば特徴抽出に要する処理時間の制約など、モデルの性能以外の要素を考慮した特徴選択が可能となる。 As described above, in this embodiment, the cost constraint is checked, and if the constraint is not satisfied, the selected set can be updated so as to reduce the cost without reducing the performance of the model as much as possible. As a result, for example, feature selection considering factors other than model performance, such as a restriction on processing time required for feature extraction, can be performed.

また、本実施形態の特徴選択装置は、評価対象集合に対して定義されるスコアに応じて統合評価値を補正し、補正済み評価値を算出する。そして、補正済み評価値が高くなる部分集合を探索する。これにより、例えば予測対象時刻に対して早い時期に予測を行える特徴集合のほうが高評価となるように設定できるなど、モデルの性能以外の要素を考慮した特徴選択が可能となる。 In addition, the feature selection device of the present embodiment corrects the integrated evaluation value according to the score defined for the evaluation target set, and calculates the corrected evaluation value. Then, a subset with a higher corrected evaluation value is searched. Thereby, for example, it is possible to select a feature considering factors other than the performance of the model, for example, a feature set that can be predicted earlier than the prediction target time can be set to be highly evaluated.

＜補足説明＞
なお、上述した実施例では、機器の故障が発生するか否かを予測する予測モデルを生成することを想定し、モデルの性能に関する評価指標として再現率と適合率を用いる例を説明した。しかし、生成するモデルの例としては、データを複数のクラスに分類するモデル（識別モデル）もある。識別モデルを生成する場合、モデルに関する評価指標として用いられる再現率は、複数のクラスのうち、評価対象のクラスを設定し、評価対象のクラスごとに、そのクラスに分類されるべきデータが漏れることなく分類されているかを示す。適合率は、評価対象のクラスに分類されたデータのうち正しく分類されているデータの割合を示す。なお、評価対象は、一つのクラスとしてもよい。もしくは、複数のクラスを一度に評価対象としてもよい。また、モデルの評価指標は、これら再現率と適合率に限らず、様々な評価指標を用いることができる。また、使用する評価指標の数も２つに限らず、３つ以上であってもよい。 <Supplementary explanation>
In the above-described embodiment, it is assumed that a prediction model for predicting whether or not a device failure occurs is generated, and the example in which the recall rate and the matching rate are used as the evaluation indexes related to the model performance has been described. However, as an example of a model to be generated, there is a model (identification model) that classifies data into a plurality of classes. When generating an identification model, the recall used as an evaluation index for the model is that a class to be evaluated is set among multiple classes, and for each class to be evaluated, data that should be classified into that class leaks. Indicates whether it is classified. The relevance ratio indicates the proportion of data classified correctly among the data classified into the evaluation target class. The evaluation target may be one class. Alternatively, a plurality of classes may be evaluated at a time. Further, the evaluation index of the model is not limited to the reproduction rate and the matching rate, and various evaluation indexes can be used. Further, the number of evaluation indexes to be used is not limited to two, and may be three or more.

例えば、１つのモデルで複数の出力を得る場合、それぞれの出力に対して要求される正解率が異なる場合がある。具体的には、例えば顔画像から性別や年齢を推定するモデルを考えた場合、性別はある程度厳密に、年齢はラフに推定することが要求される場合がある。このような場合は、それぞれの出力の正解率を複数の評価指標として扱うこともできる。すなわち、それぞれの出力の正解率に対して目標値を設定して、目標値に対する達成度に基づいて統合評価指標を算出してもよい。 For example, when a plurality of outputs are obtained with one model, the accuracy rate required for each output may be different. Specifically, for example, when considering a model for estimating gender and age from a face image, it may be required to estimate the gender strictly to some extent and the age roughly. In such a case, the accuracy rate of each output can be handled as a plurality of evaluation indexes. That is, a target value may be set for each output accuracy rate, and the integrated evaluation index may be calculated based on the degree of achievement of the target value.

また、上述した実施例では、評価対象集合のコストが、その評価対象集合に含まれる特徴を抽出する処理時間に基づいて定義されていた。しかし、評価対象集合に含まれる特徴を抽出する処理時間だけでなく、モデル適用の処理時間も大きく変動するようであれば、モデル適用の処理時間も考慮して評価対象集合のコストを定義してもよい。 In the above-described embodiment, the cost of the evaluation target set is defined based on the processing time for extracting the features included in the evaluation target set. However, if not only the processing time for extracting features included in the evaluation target set but also the model application processing time fluctuates significantly, the cost of the evaluation target set should be defined in consideration of the model application processing time. Also good.

また、上述した実施例では、評価対象集合のコストが上限値以下であるという制約条件を課していたが、この制約条件を無効化することもできる。この制約条件を無効化するには、例えば、初期設定時に設定されるコストの上限値max_costを非常に大きな値とすればよい。制約条件を無効化した場合は、追加モードから除外モードへの移行は生じない。 In the above-described embodiment, the constraint condition that the cost of the evaluation target set is equal to or lower than the upper limit value is imposed. However, this constraint condition can be invalidated. In order to invalidate this restriction condition, for example, the upper limit value max_cost of the cost set at the initial setting may be set to a very large value. When the constraint condition is invalidated, the transition from the addition mode to the exclusion mode does not occur.

また、上述した実施例では、評価対象集合のスコアに応じて統合評価値を補正する構成としていたが、このような補正を行わない構成とすることもできる。例えば、補正設定データにおける全特徴のスコアを同じ値にすれば、評価対象集合のスコアに応じた統合評価値の補正は無効化される。 In the above-described embodiment, the integrated evaluation value is corrected according to the score of the set to be evaluated. However, such a correction may not be performed. For example, if the scores of all the features in the correction setting data are set to the same value, the correction of the integrated evaluation value according to the score of the evaluation target set is invalidated.

以上、本発明の実施形態を説明したが、ここで説明した実施形態は、例として提示したものであり、発明の範囲を限定することは意図していない。ここで説明した新規な実施形態は、その他の様々な形態で実施されることが可能であり、発明の要旨を逸脱しない範囲で、種々の省略、置き換え、変更を行うことができる。ここで説明した実施形態やその変形は、発明の範囲や要旨に含まれるとともに、請求の範囲に記載された発明とその均等の範囲に含まれる。 As mentioned above, although embodiment of this invention was described, embodiment described here is shown as an example and is not intending limiting the range of invention. The novel embodiments described herein can be implemented in various other forms, and various omissions, replacements, and changes can be made without departing from the spirit of the invention. The embodiments and modifications described herein are included in the scope and gist of the invention, and are included in the invention described in the claims and the equivalents thereof.

１特徴選択装置
２入力受付部
３初期設定部
４評価対象集合生成部
５コスト評価部
６モデル生成部
７モデル評価部
８評価値統合部
９評価値補正部
１０除外損失評価部
１１暫定一位更新部
１２選択済み集合更新部
１３終了判定部
１４出力部 DESCRIPTION OF SYMBOLS 1 Feature selection apparatus 2 Input reception part 3 Initial setting part 4 Evaluation object set production | generation part 5 Cost evaluation part 6 Model production | generation part 7 Model evaluation part 8 Evaluation value integration part 9 Evaluation value correction part 10 Exclusion loss evaluation part 11 Temporary first place update Unit 12 selected set update unit 13 end determination unit 14 output unit

Claims

A feature selection device that repeats generation and evaluation of a model using a subset of a feature set and searches for the subset with a high evaluation value,
A model evaluation unit that calculates an evaluation value of each of a plurality of evaluation indexes related to the performance of the model;
For each of the plurality of evaluation indexes, the degree of achievement of the evaluation value with respect to a set target value is calculated, the degree of achievement of each of the plurality of evaluation indexes is high, and the achievement among the plurality of evaluation indexes An evaluation value integration unit that calculates an integrated evaluation value that becomes higher as the degree of variation is smaller,
A feature selection device that searches for the subset in which the integrated evaluation value is high.

The model is an identification model that classifies data into a plurality of classes,
The plurality of evaluation indicators are subject to evaluation of one or more of the plurality of classes, and for each of the evaluation target classes, a recall indicating whether or not all data to be classified into the class is classified, 2. The feature selection device according to claim 1, further comprising: a matching ratio indicating a correctly classified ratio of the data classified into the evaluation target class.

The model is a prediction model for predicting the occurrence of an event,
The feature selection apparatus according to claim 1, wherein the plurality of evaluation indexes include a recall rate indicating the completeness of prediction and a matching rate indicating the accuracy of prediction.

Each feature included in the feature set is given a first cost that increases as the time required for feature extraction increases.
A cost evaluation unit that calculates a second cost that is a cost of the subset based on the first cost of each feature included in the subset;
4. The feature according to claim 1, wherein the subset that satisfies the constraint condition that the second cost is equal to or lower than a set upper limit value and that has a higher integrated evaluation value is searched for. Selection device.

An exclusion loss evaluation unit that calculates, for each feature to be excluded, an exclusion loss indicating a ratio between a decrease in the second cost and a decrease in the integrated evaluation value when one feature is excluded from the subset;
The feature selection device according to claim 4, wherein when the second cost exceeds the upper limit value, a feature to be excluded from the subset is selected based on the exclusion loss.

The model is a prediction model for predicting the occurrence of an event,
A score corresponding to the time when the prediction result by the model is obtained for the subset is calculated, and an evaluation value correction unit that corrects the integrated evaluation value based on the calculated score is further provided.
The feature selection device according to claim 1, wherein the subset in which the corrected integrated evaluation value is increased is searched.

A feature selection method for repeatedly searching for a subset having a high evaluation value by repeatedly generating and evaluating a model using a subset of the feature set,
Calculating an evaluation value of each of a plurality of evaluation indexes related to the performance of the model;
For each of the plurality of evaluation indexes, the degree of achievement of the evaluation value with respect to a set target value is calculated, the degree of achievement of each of the plurality of evaluation indexes is high, and the achievement among the plurality of evaluation indexes A step of calculating an integrated evaluation value that is higher as the degree of variation is smaller,
A feature selection method for searching for the subset in which the integrated evaluation value is high.

A program for causing a computer to function as a feature selection device that repeatedly generates and evaluates a model using a subset of a feature set and searches for the subset with a high evaluation value,
In the computer,
A function of calculating an evaluation value of each of a plurality of evaluation indexes related to the performance of the model;
For each of the plurality of evaluation indexes, the degree of achievement of the evaluation value with respect to a set target value is calculated, the degree of achievement of each of the plurality of evaluation indexes is high, and the achievement among the plurality of evaluation indexes And a function to calculate an integrated evaluation value that becomes higher as the degree of variation is smaller,
A program for searching for the subset in which the integrated evaluation value increases.