JP2009282686A

JP2009282686A - Apparatus and method for learning classification model

Info

Publication number: JP2009282686A
Application number: JP2008133224A
Authority: JP
Inventors: Kota Nakata; 康太中田; Shigeaki Sakurai; 茂明櫻井; Ryohei Orihara; 良平折原
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2008-05-21
Filing date: 2008-05-21
Publication date: 2009-12-03

Abstract

<P>PROBLEM TO BE SOLVED: To construct a highly precise classification model even when teacher data whose quality is bad are included. <P>SOLUTION: Coordinates corresponding to expert data in which the reliability of labeling satisfies a prescribed reference and non-expert data in which the reliability of labeling is not clear are acquired, and a distance between the non-expert data and the expert data is calculated, and a neighboring distance is defined by applying it to a prescribed rule. Then, the expert data within the range of the neighboring distance are retrieved from the selected non-expert data, and the same label probability is calculated, and applied to reliability function based on such probability that the applied label is matched with the label of the expert data within the range of the neighboring distance, and the reliability of the non-expert data is determined and added. Then, a classification model for labeling desired data is learnt based on the expert data and the non-expert data to which the reliability has been added. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、機械学習において分類対象データにラベル付けを行う分類モデル学習装置および分類モデル学習方法に関する。 The present invention relates to a classification model learning apparatus and a classification model learning method for labeling classification target data in machine learning.

データマイニングにおいて重要な分野の一つに機械学習が挙げられる。機械学習は分類問題に使われることが多く、分類問題において計算機は人間のつけた評価を学習することで分類モデルを構築する。この機械学習の応用は画像認識、文字認識、テキスト分類など広い分野で多くの成果を上げている。このような学習は一般的に教師あり学習と呼ばれる。 One of the important fields in data mining is machine learning. Machine learning is often used for classification problems. In a classification problem, a computer builds a classification model by learning human evaluations. This application of machine learning has achieved many results in a wide range of fields such as image recognition, character recognition, and text classification. Such learning is generally called supervised learning.

教師あり学習には計算機に正しい判断を教える「教師データ」、つまり人間の手によって「ラベル」が付けられたデータが必要である。計算機は教師データをもとにどのような分類をすれば良いかを学習し、新しいデータに対して自動的に判断を下せるようになる。現代ではＩＴ環境の発展により、大量で詳細なデータが機械学習に利用可能となっており、これらを教師データとして用いればより正確な分類モデルの構築に繋がると期待されている。 Supervised learning requires “teacher data” that teaches the computer the correct judgment, that is, data that has been “labeled” by human hands. The computer learns what classification should be performed based on the teacher data, and can automatically make judgments on new data. With the development of the IT environment, a large amount of detailed data is now available for machine learning, and it is expected that the use of these as teacher data will lead to the construction of a more accurate classification model.

しかし、ここで「大量のデータ」から「大量の教師データ」を得る際のラベル付けが問題になっている。すなわち、得られたデータを教師データとして利用するためには、データに対して人間が判断したラベルを付与することが必要であるが、正確なラベル付けには、データが取られたドメインに対しての知識や経験などに基づく正確な判断が不可欠である。 However, there is a problem with labeling when obtaining a “large amount of teacher data” from a “large amount of data”. In other words, in order to use the obtained data as teacher data, it is necessary to assign a label determined by a human to the data, but for accurate labeling, the domain from which the data was taken is used. Accurate judgment based on all knowledge and experience is essential.

理想的にはこれらの条件を満たす対象分野のエキスパートがラベル付けを行うことが望ましいが、全てのデータのラベル付けをエキスパートに依頼することは非常にコストが高くなってしまう。しかし、現実的にはコストに制限があるため、大量の教師データが必要である場合には、非エキスパートがラベル付けを行った教師データを用いることになる。エキスパートによる高コストの教師データは少量になりがちなのに対し、非エキスパートによる低コストの教師データは比較的大量に獲得できるためである。その一方、非エキスパートによる教師データには判断のミスや知識の不正確さから、比較的多くの誤ったラベルが含まれてしまうことが考えられる。 Ideally, it is desirable that the experts in the subject field satisfying these conditions perform the labeling, but it is very expensive to ask the experts to label all the data. However, since the cost is practically limited, when a large amount of teacher data is required, the teacher data labeled by a non-expert is used. This is because high-cost teacher data by experts tends to be small, whereas low-cost teacher data by non-experts can be acquired in a relatively large amount. On the other hand, teacher data by non-experts may contain a relatively large number of erroneous labels due to misjudgment and inaccuracy of knowledge.

また、一般の機械学習においては、教師データの取得に関する情報は用いられず、エキスパートによるラベル付けのような「良質の教師データ」と非エキスパートによるラベル付けのような「ノイズを含む教師データ」が混在する状況においても、全てのデータを同列に扱い、等しく学習に使用する。 Moreover, in general machine learning, information on acquisition of teacher data is not used, and “good teacher data” such as labeling by experts and “teaching data including noise” such as labeling by non-experts are used. Even in a mixed situation, all data is handled in the same row and used equally for learning.

したがって、エキスパートによる少量の教師データと非エキスパートによる大量の教師データを従来どおり同列とみなして学習に使用した場合、非エキスパートデータに含まれるノイズが学習に大きく影響し、精度の良い分類モデルが構築できないケースが考えられる。 Therefore, when a small amount of teacher data by experts and a large amount of teacher data by non-experts are considered to be the same as before and used for learning, the noise contained in the non-expert data greatly affects learning, and a highly accurate classification model is built. There are cases where this is not possible.

一方で、分類モデルを学習する際に、一部の教師データを選択的に使用して学習を行うことや、一部の教師データに重みを置いて学習を行うことが一般的に行われている。アンサンブル学習の代表的手法の１つであるＡｄａＢｏｏｓｔもその一つである。ＡｄａＢｏｏｓｔは、学習データに対して重みを与えて学習器を生成し、その際に誤った分類をしたデータに対して重みを増して再度学習器を生成することを繰り返して複数の弱学習器を得て、それらの弱学習器の重みつき投票により分類を行う手法である（例えば特許文献１、非特許文献１参照）。
特開２００２−１３３３８９号公報 Y. Freund and R.E. Shapire, “Experiments with a new boosting algorithm”, Proc. of the 13th. Int. Conf. on Machine Learning, 1996, 148-156 On the other hand, when learning a classification model, it is generally performed to selectively use a part of the teacher data and to perform a learning with a weight on a part of the teacher data. Yes. AdaBoost, which is one of the typical techniques of ensemble learning, is one of them. AdaBoost generates a learning device by giving weights to the learning data, and repeatedly generates a learning device by increasing the weight for the incorrectly classified data at that time, thereby generating a plurality of weak learning devices. This is a method of performing classification by weighted voting of those weak learners (see, for example, Patent Document 1 and Non-Patent Document 1).
JP 2002-133389 A Y. Freund and RE Shapire, “Experiments with a new boosting algorithm”, Proc. Of the 13th. Int. Conf. On Machine Learning, 1996, 148-156

しかしながら、従来技術は、あくまで所定のアルゴリズムに即した形で教師データに対してデータ重みをつけるものであり、教師データの精度の差異という学習過程を開始する前の知識・情報を含んだものではなく、例えばエキスパートによる少量の教師データ（以下、「エキスパートデータ」という。）と非エキスパートによる大量の教師データ（以下、「非エキスパートデータ」という。）のような、質の異なる教師データを従来どおり同列として学習に使用した場合、質の劣る教師データに含まれるノイズが学習に大きく影響し、精度の良い分類モデルが構築できないという問題があった。 However, the conventional technology only applies data weights to teacher data in a form according to a predetermined algorithm, and does not include knowledge / information before starting the learning process of difference in accuracy of teacher data. However, different types of teacher data such as a small amount of teacher data by experts (hereinafter referred to as “expert data”) and a large amount of teacher data by non-experts (hereinafter referred to as “non-expert data”) are used as before. When used in the same row for learning, there is a problem in that noise included in inferior teacher data greatly affects learning, so that a highly accurate classification model cannot be constructed.

このような問題に対して、本出願人は特許出願２００７−２７８８９３においてエキスパートによる少量の教師データを利用することで精度の良い分類モデルの学習を行う手法を提案している。この手法は、エキスパートによる教師データを基にして非エキスパートによる教師データのラベルに信頼度を付加し、分類モデルの学習にその信頼度を反映することで分類モデルを学習するものである。この信頼度は、エキスパートデータおよび非エキスパートデータの各々を所定の規則に基づいて対応付けた座標の間の距離（例えばユークリッド距離やコサイン距離）に応じて求められている。 In response to such a problem, the present applicant has proposed a technique for learning a classification model with high accuracy by using a small amount of teacher data by an expert in Japanese Patent Application No. 2007-278893. In this method, a classification model is learned by adding reliability to the label of non-expert teacher data based on teacher data from an expert and reflecting the reliability in learning the classification model. This reliability is obtained according to a distance (for example, Euclidean distance or cosine distance) between coordinates in which each of expert data and non-expert data is associated based on a predetermined rule.

対象の非エキスパートデータから距離の近いＮ個のエキスパートデータを探索し、もしラベルが同じであればそのエキスパートデータから信頼度を得る。この信頼度は例えば距離に反比例する形で与えられ、非エキスパートデータの近くのエキスパートデータが同じラベルであれば、その非エキスパートデータは高い信頼度を得られるようになっている。これは、信頼できるデータが近くにあるほど信頼度は高いという直感的な信頼度付けを表していると言える。 N pieces of expert data close to the target non-expert data are searched. If the labels are the same, the reliability is obtained from the expert data. This reliability is given, for example, in inverse proportion to the distance. If expert data near non-expert data has the same label, the non-expert data can obtain high reliability. It can be said that this indicates an intuitive reliability rating that the closer the reliable data is, the higher the reliability is.

しかしながら、上記の信頼度付け方法は、エキスパートデータには誤ラベルが含まれていないことを暗に仮定している。エキスパートデータに全て適切なラベルが与えられているならば、それらを参照して与えられた非エキスパートデータの信頼度も適切な値になることが期待できる。その反面、エキスパートデータに誤ラベルが含まれている場合には、このような信頼度の付加は必ずしも適切とは言えない。図１０および図１１は、エキスパートデータのラベル付けと非エキスパートデータのラベル付けに対する信頼度の関係を説明する図である。 However, the above-described reliability rating method implicitly assumes that expert data does not include erroneous labels. If the expert data are all given appropriate labels, it can be expected that the reliability of the non-expert data given by referring to them will also be an appropriate value. On the other hand, when the expert data includes an erroneous label, such addition of reliability is not necessarily appropriate. FIG. 10 and FIG. 11 are diagrams for explaining the relationship of reliability for labeling expert data and labeling non-expert data.

図１０では、ある非エキスパートデータx₁の非常に近傍にエキスパートデータX₁が存在している。このX₁は非常に近傍にあるため、X₁とx₁のラベルが同じであればx₁の信頼度は高く、異なれば低くなる。ここでX₁、x₁に本来付与されるべきラベルはL₁であるとする。エキスパートデータX₁に、正確なラベルL₁が付与されているとすると、非エキスパートデータx₁にL₁が付与されている場合には信頼度は高く、異なったラベルL₂が付与されている場合には信頼度は低くなる。これは、適切な信頼度であるといえる。 In FIG. 10, the expert data X ₁ exists very close to some non-expert data x ₁ . Since this X ₁ is very close, the reliability of x ₁ is high if the labels of X ₁ and x ₁ are the same, and low if they are different. Here, it is assumed that the label that should be originally given to X ₁ and x ₁ is L ₁ . Assuming that the accurate label L ₁ is assigned to the expert data X ₁ , the reliability is high when L ₁ is assigned to the non-expert data x ₁ , and a different label L ₂ is assigned. In some cases, the reliability is low. This can be said to be appropriate reliability.

図１１では、エキスパートデータX₁に誤ラベルL₂が付与されている場合を考える。このとき、非エキスパートデータx₁に本来付与されるべきラベルL₁が付与されていたときは信頼度が低く、反対に付与されるべきでないラベルL₂が付与されていたときに信頼度が高くなってしまう。これは明らかに適切な信頼度とは反対の傾向である。 In Figure 11, consider a case where the label L ₂ erroneous expert data X ₁ is assigned. At this time, the reliability is low when the label L ₁ that should originally be given to the non-expert data x ₁ is low, and the reliability is high when the label L ₂ that should not be given on the contrary is given. turn into. This is clearly the opposite of proper confidence.

すなわち、エキスパートデータ中に誤ラベルが含まれている場合、非エキスパートデータに適切でない信頼度が付加され、その信頼度を反映して生成される分類モデルの性能が劣化してしまう。現実にはエキスパートデータ中にも少量の誤ラベルが含まれると考えられるため、エキスパートデータ中の誤ラベルに頑健な信頼度付加が必要である。 That is, when an erroneous label is included in the expert data, an inappropriate reliability is added to the non-expert data, and the performance of the classification model generated reflecting the reliability is deteriorated. In reality, it is considered that a small amount of erroneous labels are also included in the expert data. Therefore, it is necessary to add robust reliability to the erroneous labels in the expert data.

そこで、本発明は、従来技術の問題に鑑み、質の悪い教師データが含まれている状況であっても精度の良い分類モデルの構築が可能な分類モデル学習装置および分類モデル学習方法を提供することを目的とする。 Therefore, in view of the problems of the prior art, the present invention provides a classification model learning device and a classification model learning method capable of constructing a high-precision classification model even in a situation where poor quality teacher data is included. For the purpose.

本発明に係る分類モデル学習装置は、機械学習におけるラベル付けの信頼度が所定の基準を満たす教師データをエキスパートデータとして格納するエキスパートデータ格納部と、前記ラベル付けの信頼度が不明な教師データを非エキスパートデータとして格納する非エキスパートデータ格納部と、前記エキスパートデータ格納部および前記非エキスパートデータ格納部に接続され、前記エキスパートデータおよび前記非エキスパートデータの各々が対応する座標を取得して前記非エキスパートデータから前記エキスパートデータまでの距離を各々算出すると共に、この算出された距離を所定の規則に当てはめて近傍距離を定義する近傍距離定義部と、前記非エキスパートデータに付された前記ラベルが前記近傍距離の範囲内にある前記エキスパートデータに付された前記ラベルに一致する確率に基づく信頼度関数を格納する信頼度関数格納部と、前記近傍距離定義部、前記信頼度関数格納部、前記エキスパートデータ格納部、および前記非エキスパートデータ格納部に接続され、選択した前記非エキスパートデータから前記近傍距離の範囲内にある前記エキスパートデータを探索して前記確率を算出すると共に、この算出された確率を前記信頼度関数に当てはめて前記非エキスパートデータにおける前記ラベル付けの信頼度を決定する信頼度決定部と、前記エキスパートデータ格納部および前記信頼度決定部に接続され、前記エキスパートデータおよび前記信頼度が付加された非エキスパートデータに基づいて所望の分類対象データに前記ラベル付けを行う分類モデルを学習する分類モデル学習部と、を有することを特徴とする。 The classification model learning device according to the present invention includes an expert data storage unit that stores, as expert data, teacher data whose labeling reliability in machine learning satisfies a predetermined criterion, and teacher data whose labeling reliability is unknown. A non-expert data storage unit for storing as non-expert data; and connected to the expert data storage unit and the non-expert data storage unit to obtain coordinates corresponding to the expert data and the non-expert data, respectively Each distance from the data to the expert data is calculated, and a neighborhood distance definition unit that defines the neighborhood distance by applying the calculated distance to a predetermined rule, and the label attached to the non-expert data is the neighborhood The extract is within distance A reliability function storage unit that stores a reliability function based on the probability of matching the label attached to the data, the neighborhood distance definition unit, the reliability function storage unit, the expert data storage unit, and the non-expert data The probability is calculated by searching the expert data within the range of the neighborhood distance from the selected non-expert data connected to a storage unit, and applying the calculated probability to the reliability function to determine the non-expert data. A reliability determination unit that determines the reliability of the labeling in the expert data; and the expert data storage unit and the reliability determination unit connected to the expert data and the non-expert data to which the reliability is added Learning the classification model for labeling the desired classification target data And having model learning unit and, a.

本発明に係る分類モデル学習方法は、機械学習におけるラベル付けの信頼度が所定の基準を満たしている教師データをエキスパートデータ、前記ラベル付けの信頼度が不明の教師データを非エキスパートデータとして格納するコンピュータが行う分類モデル学習方法であって、前記エキスパートデータおよび前記非エキスパートデータの各々が対応する座標を取得して前記非エキスパートデータから前記エキスパートデータまでの距離を各々算出すると共に、この算出された距離を所定の規則に当てはめて近傍距離を定義する近傍距離定義ステップと、前記格納された非エキスパートデータから前記信頼度の付加対象となる非エキスパートデータを選択する選択ステップと、前記選択された非エキスパートデータから前記近傍距離の範囲内にある前記エキスパートデータを探索して前記非エキスパートデータに付された前記ラベルが前記エキスパートデータに付された前記ラベルに一致する確率を算出すると共に、この算出された確率を予め定義された信頼度関数に当てはめて前記非エキスパートデータの前記ラベル付けの信頼度を決定する信頼度決定ステップと、前記決定された信頼度が付加された非エキスパートデータおよび前記エキスパートデータに基づいて所望のデータに前記ラベル付けを行う分類モデルを学習する分類モデル学習ステップと、を有することを特徴とする。 The classification model learning method according to the present invention stores teacher data whose labeling reliability in machine learning satisfies a predetermined standard as expert data, and teacher data whose labeling reliability is unknown as non-expert data. A classification model learning method performed by a computer, wherein each of the expert data and the non-expert data is obtained with corresponding coordinates to calculate a distance from the non-expert data to the expert data. A neighborhood distance defining step of defining a neighborhood distance by applying a distance to a predetermined rule, a selection step of selecting non-expert data to which the reliability is to be added from the stored non-expert data, and the selected non-expert data Within the range of the neighborhood distance from expert data The expert data is searched to calculate a probability that the label attached to the non-expert data matches the label attached to the expert data, and the calculated probability is set to a predefined reliability function. A reliability determination step for determining the reliability of the labeling of the non-expert data by applying to the non-expert data to which the determined reliability is added and the labeling of desired data based on the expert data And a classification model learning step for learning a classification model for performing.

本発明によれば、質の悪い教師データが含まれている状況であっても精度の良い分類モデルの構築が可能な分類モデル学習装置および分類モデル学習方法が提供される。 ADVANTAGE OF THE INVENTION According to this invention, the classification model learning apparatus and classification model learning method which can construct | assemble an accurate classification model are provided even in the situation where poor quality teacher data are included.

以下、本発明の実施形態について図面を用いて説明する。図１は、本発明の一実施形態に係る分類モデル学習装置１の全体構成例を示すブロック図である。同図に示されるように、本実施形態に係る分類モデル学習装置１は、エキスパートデータ格納部１１、非エキスパートデータ格納部１２、近傍距離定義部１３、信頼度関数格納部１４、信頼度決定部１５、信頼度付き非エキスパートデータ格納部１６、分類モデル学習部１７、分類対象データ格納部１８、予測部１９、および結果表示部２０から構成されている。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram showing an example of the overall configuration of a classification model learning device 1 according to an embodiment of the present invention. As shown in the figure, the classification model learning device 1 according to the present embodiment includes an expert data storage unit 11, a non-expert data storage unit 12, a neighborhood distance definition unit 13, a reliability function storage unit 14, and a reliability determination unit. 15, a non-expert data storage unit 16 with reliability, a classification model learning unit 17, a classification target data storage unit 18, a prediction unit 19, and a result display unit 20.

エキスパートデータ格納部１１は、エキスパートデータを格納する記憶装置である。「エキスパートデータ」とは、知識、経験の豊富な専門家が機械学習においてデータを分類するためのラベル付けを行われており、ラベル付けの精度（信頼性）が高い教師データを示すものとする。 The expert data storage unit 11 is a storage device that stores expert data. "Expert data" refers to teacher data that has been labeled by a knowledgeable and experienced expert to classify data in machine learning and has high labeling accuracy (reliability). .

非エキスパートデータ格納部１２は、非エキスパートデータを格納する記憶装置である。「非エキスパートデータ」とは、ラベル付けは行われているが、その精度（信頼性）が不明確な教師データを示すものとする。 The non-expert data storage unit 12 is a storage device that stores non-expert data. “Non-expert data” refers to teacher data that has been labeled but whose accuracy (reliability) is unclear.

近傍距離定義部１３は、非エキスパートデータからエキスパートデータまでの座標間距離を各々算出し、この座標間距離に基づいてデータ間の類似度が基準値以上の範囲を表す近傍距離を定義するプログラムである。ここでは、算出された座標間距離の中から所定の規則に基づいて複数の距離を選択し、これらの距離の平均値から近傍距離を算出するが、算出方法はこれに限られない。 The neighborhood distance definition unit 13 is a program that calculates inter-coordinate distances from non-expert data to expert data, and defines a neighborhood distance that represents a range in which the similarity between data is equal to or greater than a reference value based on the inter-coordinate distance. is there. Here, a plurality of distances are selected from the calculated inter-coordinate distances based on a predetermined rule, and the neighborhood distance is calculated from the average value of these distances, but the calculation method is not limited to this.

図２は、エキスパートデータおよび非エキスパートデータを２次元で具体的に説明する図である。同図において、丸印はエキスパートデータ、四角印は非エキスパートデータを表し、各印の色はラベルを表している。これらの座標は各データを所定の規則に基づいて変換することで得られる。例えば、電子メールの分類においては、多数の迷惑メールを解析することによって特徴語リストを予め作成しておき、この特徴語リストと受信メール本文内の単語を比較することで座標化を行う。具体的には、特徴語リストに含まれるＮ個の単語との比較結果を受信メール内に含まれる場合を１、含まれない場合を０として表すことにより、受信メールのデータをＮ次元の座標（例えば（１,０，１，…，１））に変換できる。ここでは、説明のためにメールデータを座標化したＮ次元のデータを擬似的に２次元で表しているものとする。すなわち、受信メール本文の内容が近似する場合には、座標が近似するので迷惑メールか否かのラベル付け等に用いることができる。 FIG. 2 is a diagram for specifically explaining the expert data and the non-expert data in two dimensions. In the figure, circles represent expert data, square marks represent non-expert data, and the color of each mark represents a label. These coordinates are obtained by converting each data based on a predetermined rule. For example, in the classification of e-mails, a feature word list is created in advance by analyzing a large number of junk mails, and the feature word list is compared with the words in the received mail body to perform coordinate processing. Specifically, the comparison result with the N words included in the feature word list is expressed as 1 when the received mail is included in the received mail, and as 0 when not included in the received mail, so that the received mail data is expressed in N-dimensional coordinates. (For example, (1, 0, 1,..., 1)). Here, for the sake of explanation, it is assumed that N-dimensional data obtained by converting mail data into coordinates is represented in a two-dimensional manner. That is, when the contents of the received mail body are close, the coordinates are close, so that it can be used for labeling whether or not the mail is spam.

また、図２においては、近傍距離定義部１３が非エキスパートデータを選択し、この選択された非エキスパートデータから各エキスパートデータまでの距離を順次求めることが示されている。例えば、近傍距離を“非エキスパートデータから４番目に近いエキスパートデータまでの距離の平均値”とする規則が予め定められている場合には、距離ｒ_４を非エキスパートデータ毎に求め、その平均値を算出する。 Further, FIG. 2 shows that the neighborhood distance definition unit 13 selects non-expert data and sequentially obtains the distance from the selected non-expert data to each expert data. For example, in the case where a rule that sets the neighborhood distance as “the average value of the distance from the non-expert data to the fourth nearest expert data” is determined in advance, the distance r ₄ is obtained for each non-expert data, and the average value is obtained. Is calculated.

信頼度関数格納部１４は、分類問題に適した信頼度関数を格納する記憶装置である。この信頼度関数は、非エキスパートデータから近傍距離内にあるエキスパートデータの同ラベル確率に基づいて信頼度を定義する関数であり、この関数は種々の分類問題に対応させて予め複数作成しておくと好適である。具体的な定義方法は後述する。 The reliability function storage unit 14 is a storage device that stores a reliability function suitable for the classification problem. This reliability function is a function that defines the reliability based on the same label probability of expert data within the vicinity distance from non-expert data, and a plurality of such functions are prepared in advance corresponding to various classification problems. It is preferable. A specific definition method will be described later.

信頼度決定部１５は、近傍距離定義部１３により定義された近傍距離に基づいて非エキスパートデータの近傍にあるエキスパートデータを探索すると共に非エキスパートデータとの同ラベル確率を算出し、この同ラベル確率を信頼度関数格納部１４から取得される信頼度関数に当てはめて非エキスパートデータの信頼度を決定するプログラムである。尚、複数の信頼度関数の内、どの関数を用いるか選択する方法としては、モデル作成時にユーザが入力装置（図示省略する）から入力した情報に基づいて選択する方法や使用する関数を予め設定しておく方法などが挙げられる。 The reliability determination unit 15 searches for expert data in the vicinity of the non-expert data based on the neighborhood distance defined by the neighborhood distance definition unit 13 and calculates the same label probability with the non-expert data. Is applied to the reliability function acquired from the reliability function storage unit 14 to determine the reliability of the non-expert data. In addition, as a method for selecting which function to use from among a plurality of reliability functions, a method for selecting based on information input from the input device (not shown) by the user at the time of model creation and a function to be used are set in advance. And how to keep it.

信頼度付き非エキスパートデータ格納部１６は、信頼度決定部１５における処理よって信頼度が付与された非エキスパートデータ（以下、「信頼度付き非エキスパートデータ」という。）を格納する記憶装置である。 The non-expert data storage unit 16 with reliability is a storage device that stores non-expert data to which reliability is given by the processing in the reliability determination unit 15 (hereinafter referred to as “non-expert data with reliability”).

分類モデル学習部１７は、エキスパートデータと信頼度付き非エキスパートデータを用いて分類モデルを学習するプログラムである。 The classification model learning unit 17 is a program that learns a classification model using expert data and non-expert data with reliability.

分類対象データ格納部１８は、新たに分類の対象となるデータ、すなわち、ラベルが付与されていないデータ（以下、「分類対象データ」という。）を格納する記憶装置である。 The classification target data storage unit 18 is a storage device that stores data to be newly classified, that is, data without a label (hereinafter referred to as “classification target data”).

予測部１９は、分類モデル学習部１７で得られた分類モデルを用いて分類対象データ格納部１８に格納されている分類対象データにラベル付けを行うプログラムである。尚、ＡｄａＢｏｏｓｔを用いた場合、予測部１９での分類手法は、一般的なＡｄａＢｏｏｓｔにおける手法と同様であるので説明は省略する。 The prediction unit 19 is a program that labels the classification target data stored in the classification target data storage unit 18 using the classification model obtained by the classification model learning unit 17. Note that when AdaBoost is used, the classification method in the prediction unit 19 is the same as the general AdaBoost method, and thus the description thereof is omitted.

結果表示部２０は、予測部１９における予測結果を表示するディスプレイなどの表示装置である。 The result display unit 20 is a display device such as a display that displays the prediction result in the prediction unit 19.

以下、分類モデル学習装置１における動作を図面に基づいて説明する。尚、本実施形態においては、エキスパートデータおよび非エキスパートデータを２次元のデータとして具体的に説明する。図３は、近傍距離定義部１３における処理の具体例を示すフローチャートである。 Hereinafter, the operation of the classification model learning device 1 will be described with reference to the drawings. In the present embodiment, expert data and non-expert data will be specifically described as two-dimensional data. FIG. 3 is a flowchart showing a specific example of processing in the neighborhood distance defining unit 13.

Ｓ３０１においては、未だ選択されていない非エキスパートデータが存在するか否かを判断する。ここで、全ての非エキスパートデータが選択済みであればＳ３０５へ進む。これに対し、選択されていない非エキスパートデータが存在する場合にはＳ３０２へ進む。 In S301, it is determined whether there is non-expert data that has not yet been selected. If all the non-expert data has been selected, the process proceeds to S305. On the other hand, if there is non-expert data that has not been selected, the process proceeds to S302.

Ｓ３０２においては、非エキスパートデータ格納部１２から未だ選択されていない非エキスパートデータを一つ選択する。
Ｓ３０３においては、選択された非エキスパートデータから全てのエキスパートデータへの距離を各々算出する。 In S302, one non-expert data that has not been selected from the non-expert data storage unit 12 is selected.
In S303, the distances from the selected non-expert data to all the expert data are calculated.

Ｓ３０４においては、選択された非エキスパートデータからｋ番目に近いエキスパートデータまでの距離をバッファ領域（図示省略する）に保持する。尚、最適な整数ｋは問題によって異なるが、ここでは整数ｋをユーザが予め設定した値とする。例えば、Ｓ３０３で算出された距離の分布を解析し、各非エキスパートデータからの距離が所定の範囲内にあるように整数ｋを設定することができる。また、信頼度の付加にあたって複数の近傍エキスパートデータを考慮したい場合などには整数ｋを大きくすれば良い。
Ｓ３０５においては、保持していた全ての距離の平均をとり、その値を近傍距離として信頼度決定部１５へ出力し、処理を終了する。 In S304, the distance from the selected non-expert data to the kth expert data is held in a buffer area (not shown). The optimum integer k varies depending on the problem, but here, the integer k is a value set in advance by the user. For example, the distance distribution calculated in S303 is analyzed, and the integer k can be set so that the distance from each non-expert data is within a predetermined range. Further, when it is desired to consider a plurality of neighborhood expert data when adding reliability, the integer k may be increased.
In S305, the average of all the distances held is taken, and the value is output to the reliability determination unit 15 as a neighborhood distance, and the process is terminated.

以上の処理により、ｋ番目に近いエキスパートデータまでの平均距離が求められる。問題に適した整数ｋを設定すれば、この距離は近傍を定義する典型的な値をとると考えることができる。 With the above processing, the average distance to the kth expert data is obtained. If an integer k suitable for the problem is set, this distance can be considered as a typical value that defines the neighborhood.

図４は、信頼度決定部１５における処理の具体例を示すフローチャートである。Ｓ４０１においては、選択する非エキスパートデータが存在するか否かを判断する。ここで、全ての非エキスパートデータに信頼度が付与されており選択する非エキスパートデータがなければ処理を終了する。これに対し、信頼度が付与されていない非エキスパートデータが存在する場合にはＳ４０２へ進む。 FIG. 4 is a flowchart illustrating a specific example of processing in the reliability determination unit 15. In S401, it is determined whether there is non-expert data to be selected. Here, the reliability is given to all the non-expert data, and if there is no non-expert data to be selected, the process ends. On the other hand, if there is non-expert data to which no reliability is assigned, the process proceeds to S402.

Ｓ４０２においては、非エキスパートデータ格納部１２から未だ信頼度が付与されていない非エキスパートデータを１つ選択する。ここでは、下記の式（１）で表されるｊ番目の非エキスパートデータが選択されているとする。尚、ｘは座標、ｙはラベルを表すものとする。

In S <b> 402, one non-expert data to which no reliability has been given yet is selected from the non-expert data storage unit 12. Here, it is assumed that the j-th non-expert data represented by the following formula (1) is selected. Note that x represents coordinates and y represents a label.

Ｓ４０３においては、選択された非エキスパートデータの近傍に含まれるエキスパートデータをエキスパートデータ格納部１１から探索して保持する。この例では、「近傍」とは近傍距離定義部１３において定義された近傍距離ｒを用いて、上記式（１）で表される非エキスパートデータを中心とした半径ｒの円の中の領域を指すものとする。したがって、近傍距離ｒが０．５ときは、下記の式（２）のエキスパートデータＸ_ｊ１は近傍に含まれるが、式（３）のエキスパートデータＸ_ｊ２は近傍には含まれない。

In S403, expert data included in the vicinity of the selected non-expert data is searched from the expert data storage unit 11 and held. In this example, the “neighbor” means a region in a circle with a radius r centered on the non-expert data represented by the above formula (1) using the neighborhood distance r defined in the neighborhood distance definition unit 13. Shall point to. Therefore, when the neighborhood distance r is 0.5, the expert data X _j1 of the following equation (2) is included in the neighborhood, but the expert data X _j2 of the equation (3) is not included in the neighborhood.

Ｓ４０４においては、探索されたN個のエキスパートデータから同ラベル確率を算出する。この例では、対象の非エキスパートデータｘ_ｊと同ラベルの近傍エキスパートデータの数をＫ個とし、同ラベル確率Ｐ_ｊを下記の式（４）で定義する。
Ｐ_ｊ＝Ｋ／Ｎ・・・（４） In S404, the label probability is calculated from the searched N expert data. In this example, the number of neighboring expert data having the same label as the target non-expert data x _j is K, and the same label probability P _j is defined by the following equation (4).
P _j = K / N (4)

Ｓ４０５においては、式（４）を入力とする信頼度関数を用いて非エキスパートデータのラベルの信頼度を算出する。信頼度関数は分類問題によって適した形が考えられる。図５乃至図７は、分類問題の評価基準に応じた信頼度関数の具体例を説明する図である。この信頼度関数の性質の直感的な理解のために、対象となっている式（１）が表す非エキスパートデータの近傍にエキスパートデータが１０例含まれており、さらにノイズのため本来は９例が同ラベルであるところ８例が同ラベルとなっている状況を考える。 In step S405, the reliability of the label of the non-expert data is calculated using a reliability function that receives Expression (4). The reliability function can be in a form suitable for the classification problem. FIG. 5 to FIG. 7 are diagrams for explaining specific examples of the reliability function according to the evaluation criterion of the classification problem. In order to intuitively understand the nature of the reliability function, 10 examples of expert data are included in the vicinity of the non-expert data represented by the target equation (1), and 9 examples are inherent due to noise. Consider the situation where 8 cases have the same label.

この状況下で、例えば、非エキスパートデータのラベル付けが近傍エキスパートデータの８割以上と一致するならば、そのラベル付けに高信頼度を与えたい場合には、下記の式（５）のような信頼度関数を用いると好適である。尚、ａは関数の形を決定するパラメーターである。

Under this situation, for example, if the labeling of non-expert data matches 80% or more of the neighboring expert data, and if it is desired to give high reliability to the labeling, the following equation (5) It is preferable to use a reliability function. Here, a is a parameter that determines the shape of the function.

図５は、式（５）の信頼度関数を説明する図である。ここでは、横軸を同ラベル確率（P_j）、縦軸を信頼度（ｃ_j）とし、a＝２．０の場合に式（５）によって求められる点を結んだ曲線で示されている。同ラベル数が９例から８例に変化するときノイズによる信頼度ｃ_ｊの変化はＣ（９／１０）≒０．９８からＣ（８／１０）≒０．９６となり、信頼度ｃ_ｊへの影響は小さい。すなわち、近傍の１０例中の同ラベルが９例、８例のいずれの場合であっても、その非エキスパートデータのラベルの信頼度は高く維持されるという結果が得られる設定になっており、直感的にも妥当な信頼度関数であると言える。 FIG. 5 is a diagram for explaining the reliability function of Expression (5). Here, the horizontal axis is the same label probability (P _j ), the vertical axis is the reliability (c _j ), and a curve obtained by connecting points obtained by Equation (5) when a = 2.0 is shown. . When the number of labels changes from 9 to 8 cases, the change in reliability c _j due to noise changes from C (9/10) ≈0.98 to C (8/10) ≈0.96, and the reliability c _{j increases} . The impact of is small. That is, even if the same label in 10 cases in the vicinity is in any of 9 cases and 8 cases, it is set to obtain a result that the reliability of the label of the non-expert data is maintained high, It can be said that it is an intuitively reasonable reliability function.

また、誤ラベルの混入に対して厳しい設定としたい場合には、下記の式（６）のような信頼度関数を用いると好適である。

In addition, when it is desired to set a strict setting against mixing of erroneous labels, it is preferable to use a reliability function such as the following equation (6).

図６は、式（６）の信頼度関数を示す図である。ここでは、横軸を同ラベル確率（P_j）、縦軸を信頼度（ｃ_j）とし、a＝５．０の場合に式（６）によって求められる点を結んだ曲線で示されている。この関数を用いる場合には、一つでも誤ラベルがあると信頼度が大幅に下がる。例えば、医療などの高い信頼度が要求される分野において特に有用である。 FIG. 6 is a diagram illustrating the reliability function of Expression (6). Here, the horizontal axis is the same label probability (P _j ), the vertical axis is the reliability (c _j ), and a curve obtained by connecting points obtained by Equation (6) when a = 5.0 is shown. . When using this function, the reliability is greatly reduced if there is even one incorrect label. For example, it is particularly useful in fields that require high reliability such as medical care.

更に、誤ラベルの混入に対して寛容な設定としたい場合には、下記の式（７）のような信頼度関数を用いると好適である。

Further, when it is desired to set a tolerance for mislabeling, it is preferable to use a reliability function such as the following equation (7).

図７は、式（７）の信頼度関数を示す図である。ここでは、横軸を同ラベル確率（P_j）、縦軸を信頼度（ｃ_j）とし、a＝１０．０の場合に式（７）によって求められる点を結んだ曲線で示されている。この関数を用いる場合には、誤ラベルが多く含まれていても信頼度が大幅に下がることはなく、誤ラベルの増加に応じて信頼度が緩やかに低下する。 FIG. 7 is a diagram illustrating the reliability function of Expression (7). Here, the horizontal axis is the same label probability (P _j ), the vertical axis is the reliability (c _j ), and a curve obtained by connecting points obtained by Equation (7) when a = 10.0 is shown. . When this function is used, the reliability does not drop significantly even if many erroneous labels are included, and the reliability gradually decreases as the number of erroneous labels increases.

Ｓ４０６において、Ｓ４０５で得られた信頼度ｃ_jを対象の非エキスパートデータに付加し、下記の式（８）のような形で信頼度付き非エキスパートデータ格納部１６に格納する。

In S406, the reliability c _j obtained in S405 is added to the target non-expert data and stored in the non-expert data storage unit 16 with reliability in the form of the following equation (8).

前述の２次元データの例（式（１）の非エキスパートデータ）であれば、下記の式（９）の形で信頼度付き非エキスパートデータ格納部１６に格納される。

If it is an example of the above-described two-dimensional data (non-expert data of equation (1)), it is stored in the non-expert data storage unit 16 with reliability in the form of the following equation (9).

尚、エキスパートデータの信頼度は常に１としているので、エキスパートデータは擬似的に下記の式（１０）の形でエキスパートデータ格納部１１に格納されているとみなすことができる。

Since the reliability of the expert data is always 1, it can be considered that the expert data is stored in the expert data storage unit 11 in the form of the following equation (10) in a pseudo manner.

このように、近傍距離内における同ラベル確率を考慮した信頼度関数を用いることで、最近傍にあるエキスパートデータに誤ラベルが与えられていたとしても、他のラベルが正確であれば非エキスパートデータに適切な信頼度を付加することが可能になる。このような信頼度付けはデータ間の距離の長短のみに基づく信頼度付けよりもエキスパートデータの誤ラベルに対して頑健であると言える。 In this way, by using a reliability function that considers the probability of the same label within the neighborhood distance, even if an erroneous label is given to the nearest expert data, if other labels are accurate, non-expert data Appropriate reliability can be added to. It can be said that such reliability is more robust against erroneous labeling of expert data than reliability based only on the distance between data.

図８は、分類モデル学習部１７における処理の具体例を示すフローチャートである。学習器については信頼度を反映する形のものであれば、どのような学習器でも機能すると考えられるが、ここではデータ重みに対する信頼度の組み込み易さを考慮してＡｄａＢｏｏｓｔの手法に即した形で処理を行うものとする。尚、Ｂａｇｇｉｎｇなどの他の手法を用いても良い。 FIG. 8 is a flowchart showing a specific example of processing in the classification model learning unit 17. As long as the learning device reflects the reliability, any learning device can be considered to function. However, in consideration of the ease of incorporation of the reliability with respect to the data weight, the learning device is adapted to the AdaBoost method. It is assumed that processing is performed at Other methods such as bagging may be used.

Ｓ８０１においては、読み込まれた信頼度付き非エキスパートデータとエキスパートデータに、ＡｄａＢｏｏｓｔの手法に即して均等のデータ重みｗ_ｊを付ける。本発明では、ＡｄａＢｏｏｓｔにおける従来のデータ重みｗ_ｊに加え、信頼度決定部１５で得られた信頼度ｃ_ｊが教師データに付加されているため、ここでは読み込まれたｎ個の非エキスパートデータは下記の式（１１）、Ｎ個のエキスパートデータはそれぞれ下記の式（１２）の形で処理されるものとする。

In step S801, the read non-expert data with reliability and expert data are given equal data weights w _j according to the AdaBoost method. In the present invention, since the reliability c _j obtained by the reliability determination unit 15 is added to the teacher data in addition to the conventional data weight w _j in AdaBoost, the n non-expert data read here are The following equation (11) and N expert data are each processed in the form of the following equation (12).

Ｓ８０２においては、非エキスパートデータに付与された信頼度ｃ_ｊをデータ重みに反映させる。ここでは、ＡｄａＢｏｏｓｔにおけるデータ重みｗ_ｊに対して信頼度ｃ_ｊを反映させたデータ重みｗ’_ｊを下記の式（１３）により設定する。
ｗ’_ｊ＝ｃ_ｊｗ_ｊ・・・（１３） In S802, the reliability c _j assigned to the non-expert data is reflected in the data weight. Here, the data weight w ′ _j reflecting the reliability c _j with respect to the data weight w _{j in} AdaBoost is set by the following equation (13).
w ′ _j = c _j w _j (13)

このように設定することにより、データ重みｗ_ｊが大きく学習に大きな影響を及ぼすと考えられる非エキスパートデータに関しても、その非エキスパートデータの信頼度ｃ_ｊが低ければデータ重みｗ’_ｊの値は小さくなり、非エキスパートデータに含まれる信頼度ｃ_ｊの低い教師データの影響を自然な形で小さくすることができる。 By setting in this way, even for non-expert data that is considered to have a large data weight w _j and a large influence on learning, if the reliability c _j of the non-expert data is low, the value of the data weight w ′ _j is small. Thus, the influence of the teacher data with low reliability c _j included in the non-expert data can be reduced in a natural manner.

Ｓ８０３においては、Ｓ８０２で得られたデータ重みｗ’_ｊを用いて弱学習器を生成する。ＡｄａＢｏｏｓｔに用いられる弱学習器には決定木など様々なものが考えられる。
Ｓ８０４においては、ＡｄａＢｏｏｓｔのアルゴリズムに従い、データ重みと弱学習器の性能に依るコスト関数の更新を行う。 In S803, it generates a weak learners using data weights w _'j obtained in S802. Various weak learners used in AdaBoost, such as a decision tree, can be considered.
In step S804, the cost function is updated according to the data weight and the performance of the weak learner in accordance with the AdaBoost algorithm.

Ｓ８０５においては、終了条件を満たしているか否かを判定する。ここで、終了条件を満たすと判定された場合にはＳ８０６へ進む。これに対し、終了条件を満たさないと判定された場合はＳ８０２に戻る。尚、一般的なＡｄａＢｏｏｓｔの手法における終了条件は、弱学習器の数が所定数を満たすことである。例えばユーザが弱学習器を１００個作るという設定にすれば、Ｓ８０２〜Ｓ８０５を１００回繰り返すことが終了条件である。
Ｓ８０６においては、生成された弱学習器を組合せることにより精度の高い分類モデルである強学習器を生成し、処理を終了する。 In step S805, it is determined whether an end condition is satisfied. If it is determined that the end condition is satisfied, the process proceeds to S806. On the other hand, if it is determined that the termination condition is not satisfied, the process returns to S802. Note that an end condition in the general AdaBoost method is that the number of weak learners satisfies a predetermined number. For example, if the user sets to create 100 weak learners, the end condition is that S802 to S805 are repeated 100 times.
In S806, a strong learner, which is a highly accurate classification model, is generated by combining the generated weak learners, and the process ends.

このように、教師データの精度の差異という学習過程を開始する前の知識を利用して非エキスパートデータに信頼度を付与し、分類モデルの学習に組み込むことで、エキスパートデータが少ない場合であっても精度の良い分類モデルを得ることができる。 In this way, by using the knowledge before starting the learning process of the difference in the accuracy of the teacher data, giving reliability to the non-expert data and incorporating it into the learning of the classification model, there is little expert data. Can also obtain a highly accurate classification model.

図９は、予測部１９における処理の具体例を示すフローチャートである。Ｓ９０１においては、分類対象データ格納部１８における分類対象データの有無を判定する。ここで、分類対象データが有ると判定された場合には、Ｓ９０２へ進む。これに対し、分類対象データが無いと判定された場合には、処理を終了する。 FIG. 9 is a flowchart illustrating a specific example of processing in the prediction unit 19. In S901, the presence / absence of classification target data in the classification target data storage unit 18 is determined. If it is determined that there is classification target data, the process proceeds to S902. On the other hand, if it is determined that there is no classification target data, the process is terminated.

Ｓ９０２においては、分類対象データ格納部１８から分類対象データを一つ選択する。
Ｓ９０３においては、選択した分類対象データを分類モデルに当てはめてラベル付けを行い、Ｓ９０１へ戻る。Ｓ９０１〜Ｓ９０３までの処理は全ての分類対象データに対してラベル付けが完了するまで繰返し行われる。 In S902, one classification target data is selected from the classification target data storage unit 18.
In S903, the selected classification target data is applied to the classification model for labeling, and the process returns to S901. The processing from S901 to S903 is repeated until labeling is completed for all the classification target data.

上記のように構成することで、高信頼度とされる教師データ（エキスパートデータ）の中にノイズが含まれる場合においても、同ラベル確率を入力とする信頼度関数と、各教師データの精度という事前知識を利用して非エキスパートデータに信頼度を付与し、分類モデルの学習に組み込むことで、精度の良い分類モデルを得ることができる。 By configuring as described above, even when teacher data (expert data) regarded as highly reliable includes noise, the reliability function with the same label probability as input and the accuracy of each teacher data By using prior knowledge and adding reliability to non-expert data and incorporating it into classification model learning, a highly accurate classification model can be obtained.

尚、本発明は上記実施形態そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。また、上記実施形態に開示されている複数の構成要素の適宜な組み合わせにより、種々の発明を形成できる。例えば、実施形態に示される全構成要素から幾つかの構成要素を削除してもよい。更に、異なる実施形態にわたる構成要素を適宜組み合わせてもよい。 Note that the present invention is not limited to the above-described embodiment as it is, and can be embodied by modifying the constituent elements without departing from the scope of the invention in the implementation stage. In addition, various inventions can be formed by appropriately combining a plurality of components disclosed in the embodiment. For example, some components may be deleted from all the components shown in the embodiment. Furthermore, the constituent elements over different embodiments may be appropriately combined.

また、上記実施形態においてはメールデータのようなテキストデータを例として説明したが、対象データの種類はこれに限定されない。すなわち、画像データや音声データなどのデータにおいても所定の規則に基づいて座標化することで、分類モデルを作成可能である。例えば、２次元のレントゲン画像データはＭ行Ｎ列に分割し、これを１行Ｍ×Ｎ列のデータに変換すればＭ×Ｎ次元の座標が得られる。この場合、画像データにおける色彩区分（例えば１６段階のグレースケールなど）を行列の成分とすると好適である。そして、経験豊富な医師によって病変の有無が判定（ラベル付け）されたレントゲン画像データをエキスパートデータ、経験の浅い医師によって判定されたデータを非エキスパートデータとし、上記実施形態と同様に信頼度付けを行うことで精度の高い分類モデルを作成できる。 In the above embodiment, text data such as mail data has been described as an example, but the type of target data is not limited to this. That is, it is possible to create a classification model by converting data such as image data and audio data into coordinates based on a predetermined rule. For example, when two-dimensional X-ray image data is divided into M rows and N columns and converted into data of 1 row and M × N columns, M × N-dimensional coordinates can be obtained. In this case, it is preferable that the color classification (for example, 16 levels of gray scale) in the image data is a matrix component. Then, X-ray image data determined (labeled) by an experienced doctor as expert data is used as expert data, and data determined by an inexperienced doctor as non-expert data. By doing so, a highly accurate classification model can be created.

本発明の一実施形態に係る分類モデル学習装置１の全体構成例を示すブロック図。The block diagram which shows the example of whole structure of the classification model learning apparatus 1 which concerns on one Embodiment of this invention. エキスパートデータおよび非エキスパートデータを２次元で具体的に説明する図。The figure which demonstrates expert data and non-expert data concretely in two dimensions. 近傍距離定義部１３における処理の具体例を示すフローチャート。10 is a flowchart showing a specific example of processing in the neighborhood distance defining unit 13. 信頼度決定部１５における処理の具体例を示すフローチャート。10 is a flowchart showing a specific example of processing in the reliability determination unit 15. 分類問題の評価基準に応じた信頼度関数の具体例を説明する図。The figure explaining the specific example of the reliability function according to the evaluation criteria of a classification problem. 分類問題の評価基準に応じた信頼度関数の具体例を説明する図。The figure explaining the specific example of the reliability function according to the evaluation criteria of a classification problem. 分類問題の評価基準に応じた信頼度関数の具体例を説明する図。The figure explaining the specific example of the reliability function according to the evaluation criteria of a classification problem. 分類モデル学習部１７における処理の具体例を示すフローチャート。14 is a flowchart showing a specific example of processing in the classification model learning unit 17. 予測部１９における処理の具体例を示すフローチャート。14 is a flowchart illustrating a specific example of processing in the prediction unit 19. エキスパートデータのラベル付けと非エキスパートデータのラベル付けに対する信頼度の関係を説明する図。The figure explaining the relationship of the reliability with respect to labeling of expert data and labeling of non-expert data. エキスパートデータのラベル付けと非エキスパートデータのラベル付けに対する信頼度の関係を説明する図。The figure explaining the relationship of the reliability with respect to labeling of expert data and labeling of non-expert data.

Explanation of symbols

１…分類モデル学習装置、
１１…エキスパートデータ格納部、
１２…非エキスパートデータ格納部、
１３…近傍距離定義部、
１４…信頼度関数格納部、
１５…信頼度決定部、
１６…信頼度付き非エキスパートデータ格納部、
１７…分類モデル学習部、
１８…分類対象データ格納部、
１９…予測部、
２０…結果表示部。 1 ... Classification model learning device,
11 ... Expert data storage unit,
12 ... Non-expert data storage unit,
13 ... Neighborhood distance definition part,
14 ... reliability function storage,
15 ... reliability determination unit,
16 ... Non-expert data storage with reliability,
17 ... Classification model learning unit,
18 ... Classification target data storage unit,
19 ... prediction part,
20 ... Result display part.

Claims

An expert data storage unit that stores, as expert data, teacher data in which the reliability of labeling in machine learning satisfies a predetermined criterion;
A non-expert data storage unit for storing teacher data with unknown reliability of labeling as non-expert data;
The expert data storage unit and the non-expert data storage unit are connected to obtain coordinates corresponding to the expert data and the non-expert data, respectively, and calculate distances from the non-expert data to the expert data, respectively. A neighborhood distance defining unit that defines the neighborhood distance by applying the calculated distance to a predetermined rule;
A reliability function storage unit that stores a reliability function based on a probability that the label attached to the non-expert data matches the label attached to the expert data within the range of the neighborhood distance;
Connected to the neighborhood distance definition unit, the reliability function storage unit, the expert data storage unit, and the non-expert data storage unit, and searches for the expert data within the neighborhood distance from the selected non-expert data And calculating the probability, and applying the calculated probability to the reliability function to determine the reliability of the labeling in the non-expert data;
A classification model that is connected to the expert data storage unit and the reliability determination unit and learns a classification model for labeling desired classification target data based on the expert data and non-expert data to which the reliability is added The learning department,
A classification model learning apparatus characterized by comprising:

The reliability function defines a relationship between a probability that the label attached to the non-expert data matches the label attached to the expert data within the vicinity distance and the reliability of the labeling. The classification model learning device according to claim 1, wherein:

The neighborhood distance defining unit calculates a distance between the coordinates from the non-expert data to the expert data, and ranks the non-expert data for each non-expert data. The classification model learning device according to claim 1 or 2, wherein the average value is calculated by summing up from each of the values, and the average value is defined as the neighborhood distance.

The classification model learning apparatus according to claim 1, wherein the reliability function is created in advance according to a desired evaluation criterion in a classification problem.

5. The classification model learning unit according to claim 1, wherein the classification model learning unit learns the classification model by reflecting the reliability added by the reliability determination unit with respect to data weight in ensemble learning. The classification model learning device according to claim 1.

A classification model learning method performed by a computer that stores teacher data whose reliability of labeling in machine learning satisfies a predetermined standard as expert data and teacher data whose labeling reliability is unknown as non-expert data ,
Each of the expert data and the non-expert data obtains the corresponding coordinates, calculates the distance from the non-expert data to the expert data, and applies the calculated distance to a predetermined rule to determine the neighborhood distance. A neighborhood distance defining step to define;
A selection step of selecting non-expert data to be added with the reliability from the stored non-expert data;
The expert data within the range of the neighborhood distance is searched from the selected non-expert data, and the probability that the label attached to the non-expert data matches the label attached to the expert data is calculated. And a reliability determination step of applying the calculated probability to a predefined reliability function to determine the reliability of the labeling of the non-expert data;
A classification model learning step of learning a classification model for labeling desired data based on the non-expert data to which the determined reliability is added and the expert data;
A classification model learning method characterized by comprising:

The reliability function defines a relationship between a probability that the label attached to the non-expert data matches the label attached to the expert data within the vicinity distance and the reliability of the labeling. The classification model learning method according to claim 6, wherein:

In the neighborhood distance defining step, a distance between the coordinates from the non-expert data to the expert data is calculated and ranked for each non-expert data, and a distance for a desired rank is set as the distance of the non-expert data. 8. The classification model learning method according to claim 6 or 7, wherein the average value is calculated by summing up from each, and the average value is defined as the neighborhood distance.

The classification model learning method according to any one of claims 6 to 8, wherein the reliability function is created in advance according to a desired evaluation criterion in a classification problem.

10. The classification model learning step, wherein the classification model is learned by reflecting the reliability added in the reliability determination step with respect to the data weight in the ensemble learning. The classification model learning method as described in any one of Claims.