JP2020187408A

JP2020187408A - Learning data creation support system and learning data creation support method

Info

Publication number: JP2020187408A
Application number: JP2019089769A
Authority: JP
Inventors: 雅文露木; Masafumi Tsuyuki; 佑介西; Yusuke Nishi
Original assignee: Hitachi Systems Ltd
Current assignee: Hitachi Systems Ltd
Priority date: 2019-05-10
Filing date: 2019-05-10
Publication date: 2020-11-19
Anticipated expiration: 2039-05-10
Also published as: JP7213138B2

Abstract

To make active learning for classifier creation efficient by high learning efficiency and identification of unlabeled data which each oracle can efficiently provide with a label.SOLUTION: A learning data creation support system 100 comprises: a memory device which retains unlabeled data 202 and learning data 201; and an arithmetic device which: calculates a classification probability and a classification uncertainty level of the unlabeled data 202 into a classification class by using a classifier learning from the learning data 201 and a label; calculates a classification probability and a classification certainty level of the unlabeled data 202 into an oracle by using a predetermined oracle classifier learning from the learning data 201 and information of the oracle; sorts the unlabeled data in order of decreasing a sum of the uncertainty level and the certainty level; and requests a predetermined number of oracles to provide the unlabeled data with the label in order of lowering the classification probability of the unlabeled data.SELECTED DRAWING: Figure 1

Description

本発明は、学習データ作成支援システムおよび学習データ作成支援方法に関するものであり、具体的には、学習効率が高くかつ各オラクルが効率的にラベル付与できるラベルなしデータを特定し、分類器作成のための能動学習を効率的なものとする技術に関する。 The present invention relates to a learning data creation support system and a learning data creation support method. Specifically, the present invention identifies unlabeled data having high learning efficiency and can be efficiently labeled by each Oracle, and creates a classifier. Regarding technology that makes active learning efficient.

情報セキュリティの分野では、分析官がシステムログから取得した検体（ＵＲＬやファイル）について、情報セキュリティの観点から良性か悪性か、また、悪性であれば具体的な攻撃手法への分類を手作業で実施する。
こうした分類には高度な専門知識が必要とされ、かつ検体１つあたりの分類に長時間かかるため、機械学習による自動分類が有用である。 In the field of information security, samples (URLs and files) obtained from system logs by analysts are manually classified as benign or malignant from the viewpoint of information security, and if malignant, they are classified into specific attack methods. carry out.
Since such classification requires a high degree of specialized knowledge and it takes a long time to classify each sample, automatic classification by machine learning is useful.

機械学習による分類では、分類器に分類対象の特徴量を入力して、出力として分類先のクラスごとに分類確率を得る。ここで、特徴量の形式は、例えばシステムログの文字列に含まれる単語のｏｎｅ−ｈｏｔ表現や、記録された文字列そのものなど、任意の形式を用いて良い。 In classification by machine learning, the feature amount to be classified is input to the classifier, and the classification probability is obtained for each class to be classified as an output. Here, as the feature amount format, any format may be used, for example, the one-hot expression of the word included in the character string of the system log, the recorded character string itself, or the like.

このような分類器は、一般に教師あり学習で作成される。具体的には、分類対象のデータの特徴量と、当該特徴量に応じた分類結果（ラベル）からなる学習データとを学習させることで作成するものである。 Such classifiers are generally created by supervised learning. Specifically, it is created by learning the feature amount of the data to be classified and the learning data consisting of the classification result (label) according to the feature amount.

なお、高精度な分類器を得るには、分類対象となり得る多様なデータを網羅した、十分な量の学習データが必要である。このような学習データを作成するには、アノテーションと呼ばれる作業を繰り返す必要がある。アノテーションは、特徴量のみからなるラベルなしデータにラベル付けする作業である。よって、アノテーションには、大量の人手と時間が必要となる課題がある。特に、情報セキュリティ分野では、アノテーションの実施に高度な知識と時間が必要とされるため、この課題は深刻なものとなる。 In order to obtain a highly accurate classifier, a sufficient amount of learning data covering various data that can be classified is required. In order to create such training data, it is necessary to repeat a work called annotation. Annotation is the work of labeling unlabeled data consisting only of features. Therefore, annotation has a problem that requires a large amount of manpower and time. Especially in the field of information security, this problem becomes serious because the implementation of annotation requires a high level of knowledge and time.

この課題を解決するために、分類器の高精度化に必要な最小数のアノテーションで、効率的に学習データを作成することを目指した能動学習と呼ばれる技術がある。 In order to solve this problem, there is a technique called active learning that aims to efficiently create learning data with the minimum number of annotations required to improve the accuracy of the classifier.

能動学習では、まず最初に利用可能な少数の学習データから分類器を作成する。次に、アノテーション候補のラベルなしデータを分類器で分類し、その結果から分類器が分類困難なラベルなしデータを評価し、これをアノテーションするべきラベルなしデータとして選択する。 In active learning, a classifier is first created from a small number of available training data. Next, the unlabeled data of the annotation candidates is classified by the classifier, and the unlabeled data that is difficult for the classifier to classify is evaluated from the result, and this is selected as the unlabeled data to be annotated.

そして、選択したラベルなしデータのアノテーションを、オラクルに依頼してアノテーション結果を学習データに追加する。一般にオラクルは、アノテーションの答えを知っている人間であるが、任意の機械やプログラムでも良い。 Then, the annotation of the selected unlabeled data is requested to Oracle, and the annotation result is added to the training data. In general, an oracle is a person who knows the answer to the annotation, but it can be any machine or program.

最後に、追加された学習データで分類器を再学習し、得た分類器の精度をテストデータで評価する。能動学習では分類器が所望の精度を上回るまで、このような学習データの追加と再学習を繰り返し行う。 Finally, the classifier is retrained with the added training data, and the accuracy of the obtained classifier is evaluated with the test data. In active learning, such training data is repeatedly added and relearned until the classifier exceeds the desired accuracy.

能動学習において、ラベルなしデータの選択方法は複数提案されている。そうした従来技術としては、例えば、ラベルなしデータを追加した後に予想される情報エントロピーを最小にするようなラベルなしデータを選択する方法（非特許文献１参照）が開示されてい
る。 In active learning, multiple methods for selecting unlabeled data have been proposed. As such a prior art, for example, a method of selecting unlabeled data that minimizes the expected information entropy after adding unlabeled data (see Non-Patent Document 1) is disclosed.

また他にも、分類器の分類境界に近傍のクラスタに属するラベルなしデータを選択することで、多様なラベルなしデータを網羅した学習データを効率よく作成する方法（特許文献１参照）が開示されている。 In addition, a method of efficiently creating training data covering various unlabeled data by selecting unlabeled data belonging to a cluster near the classification boundary of the classifier (see Patent Document 1) is disclosed. ing.

さらに、分類器による不確定性を定量化する指標として、情報エントロピーを用いて多クラス分類の能動学習を実現する方法（特許文献２参照）も開示されている。 Further, as an index for quantifying the uncertainty by the classifier, a method for realizing active learning of multiclass classification using information entropy is also disclosed (see Patent Document 2).

特開２０１７−１６７８３４号公報JP-A-2017-167834 特開２０１０−２３１７６８号公報JP-A-2010-231768

Ａ．Ｈｏｌｕｂ、Ｐ．ＰｅｒｏｎａａｎｄＭ．Ｃ．Ｂｕｒｌ、”Ｅｎｔｒｏｐｙ−ｂａｓｅｄａｃｔｉｖｅｌｅａｒｎｉｎｇｆｏｒｏｂｊｅｃｔｒｅｃｏｇｎｉｔｉｏｎ、”２００８ＩＥＥＥＣｏｍｐｕｔｅｒＳｏｃｉｅｔｙＣｏｎｆｅｒｅｎｃｅｏｎＣｏｍｐｕｔｅｒＶｉｓｉｏｎａｎｄＰａｔｔｅｒｎＲｅｃｏｇｎｉｔｉｏｎＷｏｒｋｓｈｏｐｓ、Ａｎｃｈｏｒａｇｅ、ＡＫ、２００８、ｐｐ．１−８．A. Holub, P.M. Perona and M. C. Bull, "Entropy-based active learning for object recognition," 2008 IEEE Computer Society Communication on Computer Vision and PartnerK Corp.KormanRecognition. 1-8.

しかし従来技術においては、どのような種類のラベルなしデータでも正確かつ直ちに分類できるような、万能なオラクルの存在を仮定し、アノテーション回数を減らすことだけに注目していた。 However, in the prior art, we have only focused on reducing the number of annotations, assuming the existence of a versatile oracle that can accurately and immediately classify any kind of unlabeled data.

ところが実際には、複数存在するオラクルのそれぞれに、得意あるいは不得意な種類のラベルなしデータがある。そのため、同じ１件のアノテーションでも、その正確さやアノテーション結果を返すまでの時間は、オラクルごとに異なる。ゆえに従来技術では、オラクルが不得意なラベルなしデータの分類が依頼され、能動学習が非効率なものとなる課題があった。 However, in reality, each of the multiple oracles has a type of unlabeled data that it is good or bad at. Therefore, even for the same annotation, the accuracy and the time required to return the annotation result differ from Oracle to Oracle. Therefore, in the prior art, there is a problem that active learning becomes inefficient because the classification of unlabeled data, which Oracle is not good at, is requested.

そこで本発明の目的は、学習効率が高くかつ各オラクルが効率的にラベル付与できるラベルなしデータを特定し、分類器作成のための能動学習を効率的なものとする技術を提供することにある。 Therefore, an object of the present invention is to provide a technique for identifying unlabeled data having high learning efficiency and which can be efficiently labeled by each oracle, and making active learning for creating a classifier efficient. ..

上記課題を解決する本発明の学習データ作成支援システムは、所定分類器によるデータ分類に際し分類クラスを示すラベルが付与されていないラベルなしデータと、前記分類器の学習に必要な学習データであって、前記ラベルなしデータに関してオラクルが付与したラベルと当該オラクルの情報とを含む学習データと、を保持する記憶装置と、前記学習データおよび前記ラベルから学習した前記分類器による、前記ラベルなしデータの前記分類クラスへの分類確率および分類の不確定度を算定する処理、前記学習データおよび前記オラクルの情報から学習した所定のオラクル分類器による、前記ラベルなしデータの前記オラクルへの分類確率および分類の確信度を算定する処理、前記不確定度と前記確信度の和が大きい順に、所定数の前記ラベルなしデータを選択する処理、および、前記選択したラベルなしデータの前記分類確率が高い順に、所定数の前記オラクルにラベル付与を依頼する処理、を実行する演算装置、を含むことを特徴とする。 The learning data creation support system of the present invention that solves the above problems includes unlabeled data that is not given a label indicating a classification class when data is classified by a predetermined classifier, and training data that is necessary for learning the classifier. , The storage device that holds the label given by Oracle with respect to the unlabeled data and the training data including the information of the Oracle, and the unlabeled data by the training data and the classifier learned from the label. Processing to calculate the classification probability and classification uncertainty to the classification class, the classification probability and classification confidence of the unlabeled data to the Oracle by a predetermined Oracle classifier learned from the training data and the information of the Oracle. The process of calculating the degree, the process of selecting a predetermined number of the unlabeled data in descending order of the sum of the uncertainties and the certainty, and the predetermined number of the selected unlabeled data in descending order of the classification probability. It is characterized by including an arithmetic unit for executing a process of requesting a label to be assigned to the Oracle.

また、本発明の学習データ作成支援方法は、所定分類器によるデータ分類に際し分類クラスを示すラベルが付与されていないラベルなしデータと、前記分類器の学習に必要な学習データであって、前記ラベルなしデータに関してオラクルが付与したラベルと当該オラクルの情報とを含む学習データと、を保持する情報処理装置が、前記学習データおよび前記ラベルから学習した前記分類器による、前記ラベルなしデータの前記分類クラスへの分類確率および分類の不確定度を算定する処理と、前記学習データおよび前記オラクルの情報から学習した所定のオラクル分類器による、前記ラベルなしデータの前記オラクルへの分類確率および分類の確信度を算定する処理と、前記不確定度と前記確信度の和が大きい順に、所定数の前記ラベルなしデータを選択する処理と、前記選択したラベルなしデータの前記分類確率が高い順に、所定数の前記オラクルにラベル付与を依頼する処理と、を実行することを特徴とする。 Further, the learning data creation support method of the present invention includes unlabeled data to which a label indicating a classification class is not attached when data is classified by a predetermined classifier, and training data necessary for learning the classifier, which is the label. The classification class of the unlabeled data by the information processing apparatus holding the training data including the label given by Oracle with respect to the none data and the training data including the information of the Oracle, and the classifier learning from the training data and the label. The process of calculating the classification probability and the uncertainty of classification, and the classification probability and classification certainty of the unlabeled data to the Oracle by a predetermined Oracle classifier learned from the training data and the information of the Oracle. The process of calculating, the process of selecting a predetermined number of the unlabeled data in descending order of the sum of the uncertainty and the certainty, and the process of selecting the predetermined number of unlabeled data in descending order of the classification probability of the selected unlabeled data. It is characterized in that the process of requesting the Oracle to give a label and the process of requesting the data are executed.

本発明によれば、学習効率が高くかつ各オラクルが効率的にラベル付与できるラベルなしデータを特定し、分類器作成のための能動学習を効率的なものとできる。 According to the present invention, it is possible to identify unlabeled data having high learning efficiency and to be efficiently labeled by each oracle, and to make active learning for creating a classifier efficient.

本実施形態の学習データ作成支援システムを含むネットワーク構成例を示す図である。It is a figure which shows the network configuration example including the learning data creation support system of this embodiment. 本実施形態の学習データ選択サーバのワードウェア構成例を示す図である。It is a figure which shows the wordware configuration example of the learning data selection server of this embodiment. 本実施形態の学習データの構成例を示す図である。It is a figure which shows the structural example of the learning data of this embodiment. 本実施形態のラベルなしデータの構成例を示す図である。It is a figure which shows the structural example of the unlabeled data of this embodiment. 本実施形態の分類器情報の構成例を示す図である。It is a figure which shows the structural example of the classifier information of this embodiment. 本実施形態の忘却係数情報の構成例を示す図である。It is a figure which shows the structural example of the forgetting coefficient information of this embodiment. 本実施形態の標的分類結果情報の構成例を示す図である。It is a figure which shows the structural example of the target classification result information of this embodiment. 本実施形態のオラクル分類結果情報の構成例を示す図である。It is a figure which shows the structural example of the Oracle classification result information of this embodiment. 本実施形態のオラクル選択情報の構成例を示す図である。It is a figure which shows the structural example of the oracle selection information of this embodiment. 本実施形態のデータ選択情報の構成例を示す図である。It is a figure which shows the structural example of the data selection information of this embodiment. 本実施形態の係数情報の構成例を示す図である。It is a figure which shows the structural example of the coefficient information of this embodiment. 本実施形態のアノテーション依頼情報の構成例を示す図である。It is a figure which shows the structural example of the annotation request information of this embodiment. 本実施形態の学習データ選択サーバのフロー図を示す図である。It is a figure which shows the flow figure of the learning data selection server of this embodiment. 本実施形態の学習部のフロー図を示す図である。It is a figure which shows the flow figure of the learning part of this embodiment. 本実施形態の選択部のフロー図を示す図である。It is a figure which shows the flow chart of the selection part of this embodiment.

−−−システム構成−−− --- System configuration ---

以下に本発明の実施形態について図面を用いて詳細に説明する。図１は、本実施形態の学習データ作成支援システム１００を含むネットワーク構成図である。図１に示す学習データ作成支援システム１００は、学習効率が高くかつ各オラクルが効率的にラベル付与できるラベルなしデータを特定し、分類器作成のための能動学習を効率的なものとするコンピュータシステムである。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. FIG. 1 is a network configuration diagram including the learning data creation support system 100 of the present embodiment. The learning data creation support system 100 shown in FIG. 1 is a computer system that identifies unlabeled data that has high learning efficiency and can be efficiently labeled by each oracle, and makes active learning for creating a classifier efficient. Is.

こうした学習データ作成支援システム１００は、学習データ管理サーバ１０２と学習データ選択サーバ１０１から構成される。また、学習データ作成支援システム１００は、オラクルの端末すなわちオラクル端末１０３と適宜なネットワーク４０６を介して通信可能に接続される。 Such a learning data creation support system 100 is composed of a learning data management server 102 and a learning data selection server 101. Further, the learning data creation support system 100 is communicably connected to an Oracle terminal, that is, an Oracle terminal 103, via an appropriate network 406.

このうち学習データ選択サーバ１０１は、学習部３０１、分類部３０２、選択部３０３
、アノテーション依頼部３０４、およびデータセット更新部３０５から構成される。 Of these, the learning data selection server 101 includes a learning unit 301, a classification unit 302, and a selection unit 303.
, Annotation request unit 304, and data set update unit 305.

学習データ選択サーバ１０１は、学習データ管理サーバ１０２から学習データ２０１およびラベルなしデータ２０２を読み出し、アノテーション対象のラベルなしデータとアノテーション依頼先のオラクルを選択し、オラクル端末１０３にその結果を出力する。
上述のラベルなしデータの選択は、標的分類器実行部３０２−１が実行する。また、オラクルの選択は、オラクル分類器実行部３０２が実行する。 The learning data selection server 101 reads the learning data 201 and the unlabeled data 202 from the learning data management server 102, selects the unlabeled data to be annotated and the oracle of the annotation request destination, and outputs the result to the oracle terminal 103.
The selection of the above-mentioned unlabeled data is executed by the target classifier execution unit 302-1. Further, the selection of Oracle is executed by the Oracle classifier execution unit 302.

一方、オラクル端末１０３は、ラベルなしデータに関するアノテーション結果（データ分類の結果）としてラベルの入力を、オラクルから受け付けた場合、学習データ選択サーバ１０１のデータセット更新部３０５を通じて、学習データ管理サーバ１０２の学習データ２０１を更新する。 On the other hand, when the Oracle terminal 103 receives a label input from Oracle as an annotation result (data classification result) relating to unlabeled data, the Oracle terminal 103 of the training data management server 102 passes through the data set update unit 305 of the training data selection server 101. The training data 201 is updated.

学習データ作成支援システム１００においては、こうした一連の処理を、能動学習で目的とする標的分類器の精度が所定の値を上回るまで繰り返す。なお、本実施形態では、オラクルとして人間の分類者を想定しているため、オラクル端末１０３をインターフェイスに用いる形態を例示した。しかしながら、オラクルはプログラムとして実装された任意の分類器であっても良い。 In the learning data creation support system 100, such a series of processing is repeated until the accuracy of the target classifier targeted by active learning exceeds a predetermined value. In this embodiment, since a human classifier is assumed as the oracle, a mode in which the oracle terminal 103 is used as an interface is illustrated. However, Oracle may be any classifier implemented programmatically.

以後、具体例とともに学習データ作成支援システム１００における各部の働きについて詳細を記す。ここで、本実施形態の学習データ作成支援システム１００を運用する組織としては、一例として、ＩＴシステムのログの内容を分析官が精査し、第三者により当該ＩＴシステムへ試みられたサイバー攻撃の手法を分類し、攻撃手法に合わせて適切な対策を実施する事業体、を想定できる。 Hereinafter, the functions of each part in the learning data creation support system 100 will be described in detail together with specific examples. Here, as an organization that operates the learning data creation support system 100 of the present embodiment, as an example, an analyst carefully examines the contents of the log of the IT system, and a third party attempts to attack the IT system. It is possible to envision an entity that classifies methods and implements appropriate countermeasures according to the attack method.

上述の攻撃手法とは、例えばＤｏＳ（ＤｅｎｉａｌｏｆＳｅｒｖｉｃｅ）攻撃や、フィッシング詐欺、ＸＳＳ（Ｃｒｏｓｓ−ｓｉｔｅｓｃｒｉｐｔｉｎｇ）等を想定できる。 As the above-mentioned attack method, for example, DoS (Denial of Service) attack, phishing fraud, XSS (Cross-site scripting), and the like can be assumed.

こうした事業体が管理するＩＴシステムでは、当該ＩＴシステムに配備したログ収集システムがＩＴシステムの動作ログを日々収集している。また、収集した動作ログは、適宜設計された特徴量の形式に変換した上で、上述のラベルなしデータ２０２として蓄積される。 In the IT system managed by such an entity, the log collection system deployed in the IT system collects the operation log of the IT system on a daily basis. Further, the collected operation log is converted into an appropriately designed feature amount format and then accumulated as the above-mentioned unlabeled data 202.

ここで、特徴量とは、例えば動作ログの文字列に含まれる単語の分散表現やｏｎｅ−ｈｏｔ表現、或いは記録された文字列そのものなど、分類対象のデータに応じて適した形式が選択されるものとする。 Here, as the feature amount, a format suitable for the data to be classified, such as a distributed expression or one-hot expression of words included in the character string of the operation log, or the recorded character string itself, is selected. It shall be.

また、上述のラベルなしデータ２０２を分析官が精査し、攻撃手法を表すクラス（分類クラス）へ分類した結果は、学習データ２０１として記録される。この学習データ２０１とラベルなしデータ２０２の具体的な構成例は図３と図４に後述する。 Further, the result of the analyst scrutinizing the above-mentioned unlabeled data 202 and classifying it into a class (classification class) representing an attack method is recorded as learning data 201. Specific configuration examples of the learning data 201 and the unlabeled data 202 will be described later in FIGS. 3 and 4.

続いて、分類根拠と標的クラスラベルの正しさについて説明する。本実施形態における標的分類器は、学習データ２０１から学習する、特徴量を入力に攻撃手法を表すクラスへ分類する分類器である。 Next, the classification basis and the correctness of the target class label will be described. The target classifier in the present embodiment is a classifier that learns from the learning data 201 and classifies the feature amount into a class representing an attack method as an input.

この標的分類器のアルゴリズムは、決定木やＲａｎｄｏｍＦｏｒｅｓｔ、ＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅなど、分類対象に応じて適した方式が選択されており、学習実行部３０１−１でその学習処理を実行できるように設定されている。
また、本実施形態におけるオラクル分類器は、学習データ２０１から学習する、特徴量
を入力にオラクル名を表すクラスへ分類する分類器である。 The algorithm of this target classifier is selected according to the classification target, such as a decision tree, Random Forest, and Support Vector Machine, and is set so that the learning execution unit 301-1 can execute the learning process. ing.
Further, the oracle classifier in the present embodiment is a classifier that learns from the learning data 201 and classifies the features into a class representing the oracle name as an input.

このオラクル分類器のアルゴリズムは、決定木やＲａｎｄｏｍＦｏｒｅｓｔ、ＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅなど、分類対象に応じて適した方式が選択されており、学習実行部３０１−１でその学習処理を実行できるように設定されている。
一方、学習部３０１は、学習実行部３０１−１、分類器情報３０１−２、および忘却係数情報３０１−３から構成される。 The algorithm of this Oracle classifier is selected according to the classification target such as decision tree, Random Forest, Support Vector Machine, etc., and is set so that the learning execution unit 301-1 can execute the learning process. ing.
On the other hand, the learning unit 301 is composed of the learning execution unit 301-1, the classifier information 301-2, and the forgetting coefficient information 301-3.

このうち学習実行部３０１−１は、上述の標的分類器およびオラクル分類器を学習データ２０１から学習し、この標的分類器とオラクル分類器の実行方法を分類器情報３０１−２に記録する。 Of these, the learning execution unit 301-1 learns the above-mentioned target classifier and oracle classifier from the learning data 201, and records the execution method of the target classifier and the oracle classifier in the classifier information 301-2.

この時、学習実行部３０１−１は、学習実行時の時刻と、学習データ２０１に含まれるアノテーション時刻の差分を計算し、この差分に忘却係数情報３０１−３の値を乗算する形で学習データの重み係数を計算する。また、学習実行部３０１−１は、この重み係数にしたがって、アノテーション時刻が古い学習データに関する損失関数の値を小さくすることで、上述のオラクル分類器を学習するとしてもよい。
このように古い学習データを、いわば「忘却」することで、オラクルが最近経験したアノテーションを優先して、学習データの選択を行うことできる。
なお、分類器情報３０１−２と忘却係数情報３０１−３の具体的な構成例は図５に後述する。 At this time, the learning execution unit 301-1 calculates the difference between the time at the time of learning execution and the annotation time included in the learning data 201, and the learning data is multiplied by the value of the oblivion coefficient information 301-3. Calculate the weighting factor of. Further, the learning execution unit 301-1 may learn the above-mentioned Oracle classifier by reducing the value of the loss function for the learning data having an old annotation time according to the weighting coefficient.
By "forgetting" the old training data in this way, it is possible to prioritize the annotations that Oracle has recently experienced and select the training data.
A specific configuration example of the classifier information 301-2 and the forgetting coefficient information 301-3 will be described later in FIG.

また、分類部３０２は、標的分類器実行部３０２−１、標的分類結果情報３０２−２、オラクル分類器実行部３０２−３、およびオラクル分類結果情報３０２−４から構成される。 Further, the classification unit 302 is composed of a target classifier execution unit 302-1, a target classification result information 302-2, an Oracle classifier execution unit 302-3, and an Oracle classification result information 302-4.

分類部３０２では、上述の分類器情報３０１−２から、オラクル分類器と標的分類器の実行方法を読み出し、ラベルなしデータ２０２について分類を実行し、その結果をオラクル分類結果情報３０２−４と標的分類結果情報３０２−２に保存する。オラクル分類結果情報３０２−４と標的分類結果情報３０２−２の各構成例は、図８と図７に後述する。 In the classification unit 302, the execution method of the Oracle classifier and the target classifier is read from the above-mentioned classifier information 301-2, the unlabeled data 202 is classified, and the result is displayed as the Oracle classification result information 302-4 and the target. It is saved in the classification result information 302-2. Each configuration example of the Oracle classification result information 302-4 and the target classification result information 302-2 will be described later in FIGS. 8 and 7.

また、選択部３０３は、オラクル選択部３０３−１、オラクル選択情報３０３−２、データ選択部３０３−３、データ選択情報３０３−４、係数更新部３０３−５、および係数情報３０３−６から構成される。 Further, the selection unit 303 is composed of an oracle selection unit 303-1, an oracle selection information 303-2, a data selection unit 303-3, a data selection information 303-4, a coefficient update unit 303-5, and a coefficient information 303-6. Will be done.

このうちオラクル選択部３０３−１は、ラベルなしデータごとに、オラクル分類結果情報３０２−４に含まれる、分類確率が最大のオラクルを依頼先オラクルとして選択し、この選択結果をオラクル選択情報３０３−２に保存する。 Of these, the oracle selection unit 303-1 selects the oracle having the maximum classification probability included in the oracle classification result information 302-4 as the request destination oracle for each unlabeled data, and selects this selection result as the oracle selection information 303-. Save to 2.

また、データ選択部３０３−３は、標的分類分類結果情報３０２−２に含まれる不確定度と、オラクル分類結果情報３０２−４に含まれる確信度との和として優先度を計算し、この優先度が大きい順に所定数のラベルなしデータを選択してデータ選択情報３０３−４に保存する。オラクル選択情報３０３−２、データ選択情報３０３−４、および係数情報３０３−６の各構成例は、図９、図１０、図１１に後述する。 Further, the data selection unit 303-3 calculates the priority as the sum of the uncertainty included in the target classification classification result information 302-2 and the certainty included in the Oracle classification result information 302-4, and this priority is given. A predetermined number of unlabeled data are selected in descending order of degree and stored in the data selection information 303-4. Configuration examples of Oracle selection information 303-2, data selection information 303-4, and coefficient information 303-6 will be described later in FIGS. 9, 10, and 11.

なお、上述の不確定度は、具体的には特許文献２に記載の方法で計算することができる。また確信度は、例えば上述の不確定度の値に−１を乗算することで計算できる。 The uncertainty level described above can be specifically calculated by the method described in Patent Document 2. The conviction can be calculated, for example, by multiplying the uncertainty value described above by -1.

この場合、データ選択部３０３−３は、係数情報３０３−６に記載の係数Ｃ１を不確定度に乗算し、また係数Ｃ２を確信度に乗算してから、それらの和をとって優先度を算定す
るとしてもよい。 In this case, the data selection unit 303-3 multiplies the coefficient C1 described in the coefficient information 303-6 by the uncertainty, multiplies the coefficient C2 by the conviction, and then sums them to obtain the priority. It may be calculated.

こうした係数は、標的分類器における分類の不確定度を優先し、学習効率の高いラベルなしデータを選択するか、あるいはオラクル分類器における分類の確信度を優先し、オラクルにとってアノテーションしやすいラベルなしデータを優先して選択するか、を調整するパラメータとなる。 These coefficients prioritize the uncertainty of classification in the target classifier and select unlabeled data with high learning efficiency, or prioritize the certainty of classification in the Oracle classifier and are easy for Oracle to annotate. It becomes a parameter to adjust whether to preferentially select.

能動学習の効率の良さを鑑みると、能動学習の初期には、Ｃ２＞＝Ｃ１として設定し、アノテーションしやすいラベルなしデータを優先的に選択することが好ましい。この場合、アノテーションしやすいラベルなしデータだけでは十分に標的分類器の精度が向上しなかった状況に対処すべく、アノテーションしにくいラベルなしデータも選択されるようになるように、標的分類器の学習回数に応じて、段階的に、Ｃ２＜＝Ｃ１となるように係数更新部３０３−５が係数Ｃ１の値を大きくしていく。 Considering the efficiency of active learning, it is preferable to set C2> = C1 at the initial stage of active learning and preferentially select unlabeled data that is easy to annotate. In this case, learning of the target classifier so that unlabeled data that is difficult to annotate is also selected in order to deal with the situation where the accuracy of the target classifier is not sufficiently improved by the unlabeled data that is easy to annotate. The coefficient update unit 303-5 gradually increases the value of the coefficient C1 so that C2 <= C1 according to the number of times.

また、アノテーション依頼部３０４は、アノテーション依頼部情報作成部３０４−１とアノテーション情報３０４−２から構成される。このうちアノテーション依頼情報作成部３０４−１は、オラクル選択情報３０３−２とデータ選択情報３０３−４とをラベルなしデータＩＤをキーに結合し、アノテーション依頼情報３０４−２を作成する。アノテーション依頼情報３０４−２の具体的な構成は図１２で後述する。 The annotation request unit 304 is composed of an annotation request unit information creation unit 304-1 and annotation information 304-2. Of these, the annotation request information creation unit 304-1 combines the oracle selection information 303-2 and the data selection information 303-4 with the unlabeled data ID as a key to create the annotation request information 304-2. The specific configuration of the annotation request information 304-2 will be described later with reference to FIG.

一方、オラクル端末１０３は、アノテーション依頼情報３０４−２が更新された場合に、アノテーション依頼情報３０４−２に記載の情報をディスプレイ等に表示し、オラクルである分類者からラベルなしデータに対応する標的クラスラベルの入力を受け付ける。 On the other hand, when the annotation request information 304-2 is updated, the Oracle terminal 103 displays the information described in the annotation request information 304-2 on a display or the like, and a target corresponding to the unlabeled data from the classifier who is Oracle. Accepts class label input.

オラクル端末１０３は、上述のオラクルから標的クラスラベルの情報を受け取ったならば、学習データ選択サーバ１０１のデータセット更新部３０５に対し、ラベルなしデータＩＤと当該標的クラスラベルと、上述の入力を行ったオラクルの名称であるオラクル名を送付する。
この場合、オラクル端末１０３は、追加情報として上述の標的クラスラベルを付与した分類根拠となる特徴量のユーザ指定を受け付けるとすれば好適である。 When the oracle terminal 103 receives the target class label information from the above-mentioned oracle, the oracle terminal 103 inputs the unlabeled data ID, the target class label, and the above-mentioned input to the data set update unit 305 of the learning data selection server 101. Send the oracle name, which is the name of the oracle.
In this case, it is preferable that the Oracle terminal 103 accepts the user designation of the feature amount as the classification basis to which the above-mentioned target class label is added as additional information.

また同じラベルなしデータＩＤについて複数のオラクルから標的クラスラベルの入力を受け付けた場合、オラクル端末１０３は、その多数決を取る形で入力された「標的クラスラベルの正しさ」を評価し、データセット更新部３０５へ送付してよい。なお、こうしたオラクル端末１０３での多数決の処理は、学習データ選択サーバ１０１で実行するとしてもよい。 When the input of the target class label is received from a plurality of Oracles for the same unlabeled data ID, the Oracle terminal 103 evaluates the "correctness of the target class label" input in the form of taking a majority vote and updates the data set. It may be sent to department 305. The majority voting process on the Oracle terminal 103 may be executed on the learning data selection server 101.

また、学習データ選択サーバ１０１のデータセット更新部３０５は、オラクル端末１０３からラベルなしデータＩＤと標的クラスラベルを受け取り、当該ラベルなしデータＩＤを含む行をラベルなしデータ２０２から削除したうえで、学習データ２０１にオラクル端末１０３から受け取った情報を追加する。
−−−ハードウェア構成−−− Further, the data set update unit 305 of the learning data selection server 101 receives the unlabeled data ID and the target class label from the Oracle terminal 103, deletes the line including the unlabeled data ID from the unlabeled data 202, and then learns. The information received from the Oracle terminal 103 is added to the data 201.
--- Hardware configuration ---

続いて、本実施形態の学習データ作成支援システム１００を主として構成する学習データ選択サーバ１０１のハードウェア構成例について、図２に基づき説明する。
本実施形態の学習データ選択サーバ１０１は、記憶装置４０１、メモリ４０４、演算装置４０３、および通信装置４０５、を備える。 Subsequently, a hardware configuration example of the learning data selection server 101 that mainly constitutes the learning data creation support system 100 of the present embodiment will be described with reference to FIG.
The learning data selection server 101 of the present embodiment includes a storage device 401, a memory 404, an arithmetic device 403, and a communication device 405.

このうち記憶装置４０１は、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）やハードディスクドライブなど適宜な不揮発性記憶素子で構成される。また、メモリ４０４は、Ｒ
ＡＭなど揮発性記憶素子で構成される。 Of these, the storage device 401 is composed of an appropriate non-volatile storage element such as an SSD (Solid State Drive) or a hard disk drive. Further, the memory 404 is R
It is composed of volatile memory elements such as AM.

また、演算装置４０３は、記憶装置４０１に保持されるプログラム４０２をメモリ４０４に読み出すなどして実行し装置自体の統括制御を行なうとともに各種判定、演算及び制御処理を行なうＣＰＵである。 Further, the arithmetic unit 403 is a CPU that executes the program 402 held in the storage device 401 by reading it into the memory 404 to perform overall control of the apparatus itself and also performs various determinations, arithmetic operations, and control processes.

また、通信装置４０５は、インターネットやＬＡＮなど適宜なネットワーク４０６と接続し、学習データ管理サーバ１０２やオラクル端末１０３といった他装置との通信処理を担う装置である。 Further, the communication device 405 is a device that connects to an appropriate network 406 such as the Internet or LAN and is responsible for communication processing with other devices such as the learning data management server 102 and the Oracle terminal 103.

なお、記憶装置４０１内には、本実施形態の学習データ作成支援システム１００を構成する学習データ選択サーバ１０１として必要な機能を実装する為のプログラム４０２に加えて、分類器情報３０１−２、忘却係数情報３０１−３、標的分類結果情報３０２−２、オラクル分類結果情報３０２−４、オラクル選択情報３０３−２、データ選択情報３０３−４、係数情報３０３−６、およびアノテーション依頼情報３０４−２、が記憶されている。これらの情報の詳細については後述する。 In the storage device 401, in addition to the program 402 for implementing the function required as the learning data selection server 101 constituting the learning data creation support system 100 of the present embodiment, the classifier information 301-2 and forgetting Coefficient information 301-3, target classification result information 302-2, Oracle classification result information 302-4, Oracle selection information 303-2, data selection information 303-4, coefficient information 303-6, and annotation request information 304-2, Is remembered. Details of this information will be described later.

また、上述の演算装置４０３がプログラム４０２を実行することで、学習実行部３０１−１、標的分類器実行部３０２−１、オラクル分類器実行部３０２−２、オラクル選択部３０３−１、データ選択部３０３−３、係数更新部３０３−５、アノテーション依頼情報作成部３０４−１、およびデータセット更新部３０５が実装される。これら機能部の働きの詳細についても後述する。
−−−データ構造例−−− Further, when the above-mentioned arithmetic unit 403 executes the program 402, the learning execution unit 301-1, the target classifier execution unit 302-1, the Oracle classifier execution unit 302-2, the Oracle selection unit 303-1 and the data selection are performed. A unit 303-3, a coefficient update unit 303-5, an annotation request information creation unit 304-1 and a data set update unit 305 are implemented. The details of the functions of these functional units will also be described later.
--- Data structure example ---

続いて、本実施形態の学習データ作成支援システム１００を構成する、上述の学習データ選択サーバ１０１および学習データ管理サーバ１０２らが用いるデータ類について説明する。 Subsequently, the data used by the above-mentioned learning data selection server 101 and the learning data management server 102, which constitute the learning data creation support system 100 of the present embodiment, will be described.

図３は、本実施形態における学習データ２０１の構成例を示す図である。この学習データ２０１は、学習データを一意に識別するための数値、あるいは文字列である学習データＩＤ２０１ａをキーに、学習データの特徴量２０１ｂ、分類先のクラス（分類クラス）を示すラベルである標的クラスラベル２０１ｃ、標的クラスラベルを付与（アノテーション）したオラクル名２０１ｄ、当該オラクルがアノテーションを実施した事項２０１ｅ、そのアノテーションの根拠に当たる分類根拠２０１ｆ、および付与されたクラスラベルが正しいかを表す標的クラスラベルの正しさ２０１ｇの各値を対応付けたレコードの集合体となっている。 FIG. 3 is a diagram showing a configuration example of the learning data 201 in the present embodiment. The training data 201 is a target that is a label indicating the feature amount 201b of the training data and the class (classification class) to be classified by using the learning data ID 201a, which is a numerical value or a character string for uniquely identifying the training data, as a key. The class label 201c, the Oracle name 201d to which the target class label is annotated, the matter 201e to which the annotation was annotated, the classification basis 201f which is the basis of the annotation, and the target class label indicating whether the assigned class label is correct It is a collection of records in which each value of 201 g of correctness is associated with each other.

このうち特徴量２０１ｂは、分類対象となるデータ、あるいは分類対象となるデータから作成したデータの特徴を示す値であり、文字列、数値、ベクトルなど任意の形式を取る。 Of these, the feature amount 201b is a value indicating the characteristics of the data to be classified or the data created from the data to be classified, and may take any form such as a character string, a numerical value, or a vector.

続いて図４は、本実施形態におけるラベルなしデータ２０２の構成例を示す図である。このラベルなしデータ２０２は、ラベルなしデータを一意に識別するための数値、あるいは文字列であるラベルなしデータＩＤ２０２ａをキーに、当該ラベルなしデータの特徴量２０２ｂを対応付けたレコードの集合体となっている。 Subsequently, FIG. 4 is a diagram showing a configuration example of the unlabeled data 202 in the present embodiment. The unlabeled data 202 is a collection of records associated with the feature amount 202b of the unlabeled data using the unlabeled data ID 202a, which is a numerical value or a character string for uniquely identifying the unlabeled data, as a key. ing.

上述の特徴量２０２ｂは、分類対象となるデータ、あるいは分類対象となるデータから作成したデータの特徴を示す値であり、文字列、数値、ベクトルなど任意の形式を取る。 The feature amount 202b described above is a value indicating the feature of the data to be classified or the data created from the data to be classified, and takes any form such as a character string, a numerical value, and a vector.

続いて図５は、本実施形態における分類器情報３０１−２の構成例を示す図である。こ
の分類器情報３０１−２は、分類器の種別を表す分類器種別３０１−２ａをキーに、当該分類器のプログラム実行方法を表す分類実行方法３０１−３ｂ、および分類先となり得るクラスのリストである分類先クラス３０１−３ｃの各値を対応付けたレコードの集合体となっている。 Subsequently, FIG. 5 is a diagram showing a configuration example of the classifier information 301-2 in the present embodiment. This classifier information 301-2 is a list of classification execution methods 301-3b indicating the program execution method of the classifier and classes that can be classified, using the classifier type 301-2a indicating the type of the classifier as a key. It is a collection of records in which each value of a certain classification destination class 301-3c is associated.

続いて図６は、本実施形態における忘却係数情報３０１−３の構成例を示す図である。この訪客係数情報３０１−３は、学習データ選択サーバ１０１の起動時に指定される所定の数値を取る忘却係数３０１−３ａから構成される。 Subsequently, FIG. 6 is a diagram showing a configuration example of the forgetting coefficient information 301-3 in the present embodiment. The visitor coefficient information 301-3 is composed of a forgetting coefficient 301-3a that takes a predetermined numerical value specified when the learning data selection server 101 is started.

続いて図７は、本実施形態における標的分類結果情報３０２−２の構成例を示す図である。この標的分類結果情報３０２−２は、ラベルなしデータを一意に識別するための数値、あるいは文字列であるラベルなしデータＩＤ３０２−２ａをキーに、当該ラベルなしデータの特徴量３０２−２ｂ、標的分類器により計算された各クラスへの分類確率（３０２−２ｃ、３０２−２ｄ、３０２−２ｅ、３０２−２ｆ）、および不確定度３０２−２ｇの各値を対応付けたレコードの集合体となっている。 Subsequently, FIG. 7 is a diagram showing a configuration example of the target classification result information 302-2 in the present embodiment. The target classification result information 302-2 uses the unlabeled data ID 302-2a, which is a numerical value or a character string for uniquely identifying the unlabeled data, as a key, and the feature amount 302-2b of the unlabeled data and the target classification It is a collection of records that associate each value of the classification probability (302-2c, 302-2d, 302-2e, 302-2f) into each class calculated by the instrument and the uncertainty degree 302-2g. There is.

続いて図８は、本実施形態におけるオラクル分類結果情報３０２−４の構成例を示す図である。このオラクル分類結果情報３０２−４は、ラベルなしデータを一意に識別するための数値、あるいは文字列であるラベルなしデータＩＤ３０２−４ａをキーに、当該ラベルなしデータの特徴量３０２−４ｂ、オラクル分類器により計算された各クラスへの分類確率（３０２−４ｃ、３０２−４ｄ、３０２−４ｅ）、および確信度３０２−４ｆの各値を対応付けたレコードの集合体となっている。 Subsequently, FIG. 8 is a diagram showing a configuration example of the Oracle classification result information 302-4 in the present embodiment. The Oracle classification result information 302-4 uses the unlabeled data ID 302-4a, which is a numerical value or a character string for uniquely identifying the unlabeled data, as a key, and the feature amount 302-4b of the unlabeled data and the Oracle classification. It is a collection of records in which the classification probabilities (302-4c, 302-4d, 302-4e) for each class calculated by the instrument and the respective values of the certainty 302-4f are associated with each other.

続いて図９は、本実施形態におけるオラクル選択情報３０３−２の構成例を示す図である。このオラクル選択情報３０３−２は、ラベルなしデータを一意に識別するための数値、あるいは文字列であるラベルなしデータＩＤ３０３−２ａをキーに、アノテーションの依頼先と特定した依頼先オラクル３０３−４ｂの値を対応付けたレコードの集合体となっている。 Subsequently, FIG. 9 is a diagram showing a configuration example of Oracle selection information 303-2 in the present embodiment. This oracle selection information 303-2 is the request destination oracle 303-4b specified as the request destination of annotation by using the unlabeled data ID 303-2a, which is a numerical value or a character string, for uniquely identifying the unlabeled data as a key. It is a collection of records with values associated with it.

続いて図１０は、本実施形態におけるデータ選択情報３０３−４の構成例を示す図である。このデータ選択情報３０３−４は、ラベルなしデータを一意に識別するための数値、あるいは文字列であるラベルなしデータＩＤ３０３−４ａをキーに、アノテーション実施の優先度３０３−４ｂの値を対応付けたレコードの集合体となっている。 Subsequently, FIG. 10 is a diagram showing a configuration example of data selection information 303-4 in the present embodiment. The data selection information 303-4 is associated with the value of the priority 303-4b of annotation execution by using the unlabeled data ID 303-4a, which is a numerical value or a character string, for uniquely identifying the unlabeled data as a key. It is a collection of records.

続いて図１１は、本実施形態における係数情報３０３−６の構成例を示す図である。この係数情報３０３−６は、係数更新部３０３−５によって計算される数値である係数Ｃ１（３０３−６ａ）と、係数更新部３０３−５によって計算される数値である係数Ｃ２（３０３−６ｂ）とから構成される。 Subsequently, FIG. 11 is a diagram showing a configuration example of the coefficient information 303-6 in the present embodiment. The coefficient information 303-6 includes a coefficient C1 (303-6a) which is a numerical value calculated by the coefficient updating unit 303-5 and a coefficient C2 (303-6b) which is a numerical value calculated by the coefficient updating unit 303-5. It is composed of and.

続いて図１２は、本実施形態におけるアノテーション依頼情報３０４−２の構成例を示す図である。このアノテーション依頼情報３０４−２は、ラベルなしデータを一意に識別するための数値、あるいは文字列であるラベルなしデータＩＤ３０４−２ａをキーに、標的分類器によって当該ラベルなしデータが分類された先のクラスである推定標的クラス３０４−２ｂ、当該ラベルなしデータの分類を担当する依頼先オラクル３０４−２ｃ、および当該ラベルなしデータの特徴量３０４−２ｄ、の各値を対応付けたレコードの集合体となっている。
−−−フロー例（メインフロー）−−− Subsequently, FIG. 12 is a diagram showing a configuration example of annotation request information 304-2 in the present embodiment. The annotation request information 304-2 is the destination where the unlabeled data is classified by the target classifier using the unlabeled data ID 304-2a, which is a numerical value or a character string for uniquely identifying the unlabeled data, as a key. A collection of records associated with the values of the estimated target class 304-2b, which is a class, the request destination Oracle 304-2c, which is in charge of classifying the unlabeled data, and the feature amount 304-2d of the unlabeled data. It has become.
--- Flow example (main flow) ---

以下、本実施形態における学習データ作成支援方法の実際手順について図に基づいて説明する。以下で説明する各種動作は、学習データ作成支援システム１００を構成する学習
データ選択サーバ１０１や学習データ管理サーバ１０２、オラクル端末１０３がメモリ等に読み出して実行するプログラムによって実現される。そして、このプログラムは以下に説明される各種の動作をおこなうためのコードから構成されている。 Hereinafter, the actual procedure of the learning data creation support method in the present embodiment will be described with reference to the figure. The various operations described below are realized by a program read into a memory or the like by the learning data selection server 101, the learning data management server 102, and the Oracle terminal 103 that constitute the learning data creation support system 100. The program is composed of code for performing various operations described below.

図１３は本実施形態における学習データ選択方法のフロー例を示す図であり、具体的には、学習データ選択サーバ１０１の動作を示すフローチャートである。なお、学習データ選択サーバ１０１は、標的分類器の能動学習の起動要求を、例えば、所定の管理者等の端末から受けて本フローにおける処理を開始する。 FIG. 13 is a diagram showing a flow example of the learning data selection method in the present embodiment, and specifically, is a flowchart showing the operation of the learning data selection server 101. The learning data selection server 101 receives a request for starting active learning of the target classifier from, for example, a terminal such as a predetermined administrator, and starts processing in this flow.

学習データ選択サーバ１０１は、学習部３０１において標的分類器とオラクル分類器の存在有無を判定する（Ｓ１０１）。すなわち、学習部３０１は、（生成すなわち学習済みであれば）標的分類器およびオラクル分類器を保持し、その情報を分類器情報３０１−２に格納しているものとする。 The learning data selection server 101 determines in the learning unit 301 whether or not there is a target classifier and an oracle classifier (S101). That is, it is assumed that the learning unit 301 holds a target classifier and an oracle classifier (if it has been generated or learned), and stores the information in the classifier information 301-2.

上述の判定の結果、標的分類器またはオラクル分類器のどちらかが存在しない場合（Ｓ１０１：Ｎｏ）、学習データ選択サーバ１０１の学習実行部３０１−１は、学習データ２０１を読み出して、標的分類器とオラクル分類器を学習する（Ｓ１０６）。このＳ１０６における学習に関する処理の詳細は、図１４に示すフロー例で後述する。 As a result of the above determination, when either the target classifier or the Oracle classifier does not exist (S101: No), the learning execution unit 301-1 of the learning data selection server 101 reads out the learning data 201 and sets the target classifier. And learn the Oracle classifier (S106). The details of the process related to learning in S106 will be described later in the flow example shown in FIG.

一方、上述の判定の結果、標的分類器とオラクル分類器が存在する場合（Ｓ１０１：Ｙｅｓ）、学習データ選択サーバ１０１は、分類器情報３０１−２の分類実行方法３０１−２ｂに記載の手法を用いて、標的分類器実行部３０２−１およびオラクル分類器実行部３０２−２による、ラベルなしデータ２０２の分類を実行する（Ｓ１０２）。 On the other hand, as a result of the above determination, when the target classifier and the Oracle classifier exist (S101: Yes), the learning data selection server 101 uses the method described in the classification execution method 301-2b of the classifier information 301-2. The unlabeled data 202 is classified by the target classifier execution unit 302-1 and the Oracle classifier execution unit 302-2 (S102).

次に、学習データ選択サーバ１０１の選択部３０３は、上述のＳ１０２の分類の結果である、オラクル分類結果情報３０２−４と標的分類結果情報３０２−２を得て、ラベルなしデータとオラクルの選択を実施する（Ｓ１０３）。この場合、選択部３０３は、Ｓ１０３における選択の結果を、オラクル選択情報３０３−２とデータ選択情報３０３−４に保存するものとする。こうしたＳ１０３の具体的な処理は、図１５のフロー例で後述する。 Next, the selection unit 303 of the learning data selection server 101 obtains the oracle classification result information 302-4 and the target classification result information 302-2, which are the results of the above-mentioned classification of S102, and selects the unlabeled data and the oracle. (S103). In this case, the selection unit 303 shall store the selection result in S103 in the oracle selection information 303-2 and the data selection information 303-4. Such specific processing of S103 will be described later in the flow example of FIG.

続いて、学習データ選択サーバ１０１のアノテーション依頼部３０４は、Ｓ１０３で選択したオラクルのオラクル端末１０３に対し、Ｓ１０３で選択したラベルなしデータのアノテーションの依頼要求を通知する（Ｓ１０４）。 Subsequently, the annotation request unit 304 of the learning data selection server 101 notifies the Oracle terminal 103 of the Oracle selected in S103 of the request request for annotation of the unlabeled data selected in S103 (S104).

なお、上述のＳ１０４におけるアノテーション依頼部３０４は、オラクル端末１０３へのアノテーションの依頼に際し、学習データ２０１に含まれる標的クラスラベルごとのデータ数が等しくなるようアノテーション依頼を行うとすれば好適である。 It is preferable that the annotation requesting unit 304 in S104 described above makes an annotation request to the Oracle terminal 103 so that the number of data for each target class label included in the learning data 201 is equal.

例えば、アノテーション依頼部３０４は、上述のアノテーションの依頼要求をオラクル端末１０３に通知後、オラクル端末１０３からの応答であるアノテーション結果が示す標的クラスラベルを抽出し、標的クラスラベルごとにアノテーション結果の数をカウントしておく。そして、例えば、図３の学習データ２０１に示す、標的クラスラベル「Ｄｏｓ」、「フィッシング」、および「ＸＳＳ」に関してそれぞれ同数の学習データが揃うまで、アノテーションの依頼要求を行う。 For example, the annotation request unit 304 notifies the Oracle terminal 103 of the above-mentioned annotation request request, then extracts the target class label indicated by the annotation result which is the response from the Oracle terminal 103, and the number of annotation results for each target class label. Is counted. Then, for example, an annotation request request is made until the same number of learning data for each of the target class labels “Dos”, “phishing”, and “XSS” shown in the learning data 201 of FIG. 3 are prepared.

この場合、アノテーション依頼部３０４は、Ｓ１０４でのアノテーションの依頼要求に応じて、上述のオラクル端末１０３からアノテーション結果を受信し、これをデータセット更新部３０５に渡す。データセット更新部３０５は、このアノテーション結果を学習データ２０１に追加し、同時にラベルなしデータ２０２からアノテーションされたラベルなしデータを削除する（Ｓ１０５）。 In this case, the annotation request unit 304 receives the annotation result from the above-mentioned Oracle terminal 103 in response to the annotation request request in S104, and passes it to the data set update unit 305. The data set update unit 305 adds the annotation result to the training data 201, and at the same time deletes the annotated unlabeled data from the unlabeled data 202 (S105).

なお、一つのラベルなしデータに関して得た複数のアノテーション結果の間で標的クラスラベルが異なっていた場合、データセット更新部３０５は、当該異なる標的クラスラベルに関する多数決を取ることで正しいラベルを決定するとしてもよい。この場合、データセット更新部３０５は、学習データ２０１において、学習データ２０１に対するアノテーション結果の追加に際し、標的クラスラベルの正しさ２０１ｇに、多数決の結果を記録することとなる。 If the target class label is different among the plurality of annotation results obtained for one unlabeled data, the data set update unit 305 determines the correct label by taking a majority vote for the different target class label. May be good. In this case, the data set update unit 305 records the result of the majority vote in the correctness 201g of the target class label when adding the annotation result to the learning data 201 in the learning data 201.

続いて、学習データ選択サーバ１０１は、Ｓ１０５にて追加された学習データ２０１に基づき、学習実行部部３０１−１により標的分類器とオラクル分類器を再学習する（Ｓ１０６）。このＳ１０６の詳細については図１４のフローに基づき後述する。 Subsequently, the learning data selection server 101 relearns the target classifier and the oracle classifier by the learning execution unit 301-1 based on the learning data 201 added in S105 (S106). The details of S106 will be described later based on the flow of FIG.

また、学習データ選択サーバ１０１は、Ｓ１０６の学習の結果が、所定の終了条件（標的分類器が所定の分類精度を上回るか）を満たすものである場合（Ｓ１０７：Ｙｅｓ）、処理を終了する。 Further, the learning data selection server 101 ends the process when the learning result of S106 satisfies a predetermined end condition (whether the target classifier exceeds the predetermined classification accuracy) (S107: Yes).

一方、Ｓ１０６の学習の結果が、上述の終了条件を満たすものでない場合（Ｓ１０７：Ｎｏ）、学習データ選択サーバ１０１は、処理をＳ１０２に戻し、Ｓ１０２〜Ｓ１０６までの処理を繰り返すこととなる。
−−−フロー例（学習部の詳細フロー）−−− On the other hand, if the learning result of S106 does not satisfy the above-mentioned termination condition (S107: No), the learning data selection server 101 returns the process to S102 and repeats the processes of S102 to S106.
--- Flow example (Detailed flow of learning department) ---

図１４は、本実施形態の学習部３０１における分類器の学習方法のフロー例を示す図である。この場合、学習部３０１の学習実行部３０１−１は、学習データ２０１を読み込み（Ｓ２０１）、学習データ２０１に含まれる「標的クラスラベルの正しさ」２０１ｇが「誤」となっているレコードを取り除く（Ｓ２０２）。
次に、学習実行部３０１−１は、Ｓ２０２の処理を経た学習データ２０１により、標的分類器とオラクル分類器の学習を並列に実行する。 FIG. 14 is a diagram showing a flow example of a learning method of the classifier in the learning unit 301 of the present embodiment. In this case, the learning execution unit 301-1 of the learning unit 301 reads the learning data 201 (S201), and removes the record in which the "correctness of the target class label" 201g included in the learning data 201 is "wrong". (S202).
Next, the learning execution unit 301-1 executes learning of the target classifier and the oracle classifier in parallel by the learning data 201 that has undergone the processing of S202.

このうち標的分類器の学習において、学習実行部３０１−１は、学習データ２０１に含まれる、特徴量２０１ｂと標的クラスラベル２０１ｃを使い学習を実行する（Ｓ２０３）。 Of these, in the learning of the target classifier, the learning execution unit 301-1 executes learning using the feature amount 201b and the target class label 201c included in the learning data 201 (S203).

一方、オラクル分類器の学習において、学習実行部３０１−１は、忘却係数情報３０１−３が利用可能な場合、学習実行時の時刻と、学習データ２０１に含まれるアノテーション時刻２０１ｅの差分を計算する。また、学習実行部３０１−１は、この差分に忘却係数情報３０１−３の値を乗算する形で学習データの重み係数を計算する。さらに学習実行部３０１−１は、計算した重み係数にしたがって、アノテーション時刻２０１ｅが古い学習データに関する損失関数の値を小さくするように重み付けする（Ｓ２０４）。 On the other hand, in the learning of the Oracle classifier, the learning execution unit 301-1 calculates the difference between the time at the time of learning execution and the annotation time 201e included in the learning data 201 when the forgetting coefficient information 301-3 is available. .. Further, the learning execution unit 301-1 calculates the weighting coefficient of the learning data by multiplying this difference by the value of the forgetting coefficient information 301-3. Further, the learning execution unit 301-1 weights the annotation time 201e so as to reduce the value of the loss function related to the old learning data according to the calculated weighting coefficient (S204).

次に、学習実行部３０１−１は、学習データ２０１に、分類根拠２０１ｆの記載がある場合、この分類根拠２０１ｆに記載の特徴量から学習するように重み付けする（Ｓ２０５）。なお、上述の損失関数や重み付けの概念は、機械学習における一般的なものを想定する。
続いて学習実行部３０１−１は、学習データ２０１に含まれる特徴量とオラクル名から、オラクル分類器を学習する（Ｓ２０６）。 Next, when the learning data 201 has a description of the classification basis 201f, the learning execution unit 301-1 weights the learning data 201 so as to learn from the feature amount described in the classification basis 201f (S205). The above-mentioned concepts of loss function and weighting are assumed to be general in machine learning.
Subsequently, the learning execution unit 301-1 learns the oracle classifier from the feature amount and the oracle name included in the learning data 201 (S206).

最後に、学習実行部３０１−１は、上述のＳ２０３、Ｓ２０６における標的分類器とオラクル分類器の学習結果を、分類器情報３０１−２に保存し（Ｓ２０７）、処理を終了する。
−−−フロー例（選択部の詳細フロー）−−−
図１５は、本実施形態における選択部３０３の動作を示すフロー図であり、具体的には、ラベルなしデータおよびオラクルの選択方法のフロー例を示す。
この場合、選択部３０３は、標的分類結果情報３０２−２とオラクル分類結果情報３０２−４を読み込む（Ｓ３０１）。 Finally, the learning execution unit 301-1 stores the learning results of the target classifier and the oracle classifier in S203 and S206 described above in the classifier information 301-2 (S207), and ends the process.
--- Flow example (Detailed flow of selection) ---
FIG. 15 is a flow chart showing the operation of the selection unit 303 in the present embodiment, and specifically shows a flow example of an unlabeled data and an Oracle selection method.
In this case, the selection unit 303 reads the target classification result information 302-2 and the oracle classification result information 302-4 (S301).

次に、選択部３０３は、ラベルなしデータを一つ選択する（Ｓ３０２）。選択部３０３は、ここで選択したラベルなしデータについて、オラクルの選択処理と優先度の計算処理を並列に実行する。 Next, the selection unit 303 selects one unlabeled data (S302). The selection unit 303 executes the Oracle selection process and the priority calculation process in parallel for the unlabeled data selected here.

このうちオラクルの選択処理については、まず、オラクル選択部３０３は、オラクル分類結果情報３０２−４に含まれるオラクルのうち、Ｓ３０２で選択したラベルなしデータに関して、分類確率が最大となるオラクルを選択する（Ｓ３０３）。 Regarding the oracle selection process, first, the oracle selection unit 303 selects the oracle having the maximum classification probability with respect to the unlabeled data selected in S302 among the oracles included in the oracle classification result information 302-4. (S303).

また、オラクル選択部３０３は、Ｓ３０３で選択したオラクルの名と該当ラベルなしデータのＩＤ（ラベルなしデータＩＤ）を、オラクル選択情報３０３−２に追記する（Ｓ３０４）。 Further, the oracle selection unit 303 adds the name of the oracle selected in S303 and the ID of the corresponding unlabeled data (unlabeled data ID) to the oracle selection information 303-2 (S304).

一方、優先度の計算処理については、まず、データ選択部３０３は、係数情報３０３−６を更新する（Ｓ３０５）。この係数情報３０３−６の更新は、本フローの実行ごとに、不確定度に関する係数Ｃ１を所定レベルだけ（例：１％）増大させるか、確信度に関する係数Ｃ２を所定レベルだけ（例：１％）減少させることで実行する。 On the other hand, regarding the priority calculation process, first, the data selection unit 303 updates the coefficient information 303-6 (S305). This update of the coefficient information 303-6 increases the uncertainty coefficient C1 by a predetermined level (example: 1%) or increases the uncertainty coefficient C2 by a predetermined level (example: 1) for each execution of this flow. %) Perform by reducing.

次に、データ選択部３０３は、更新後の係数情報３０３−６に記載の係数Ｃ１を不確定度に乗算し、また係数Ｃ２を確信度に乗算して、それらの和をとって優先度を計算する（Ｓ３０６）。 Next, the data selection unit 303 multiplies the uncertainties by the coefficient C1 described in the updated coefficient information 303-6, multiplies the coefficient C2 by the certainty, and sums them to obtain the priority. Calculate (S306).

続いて、データ選択部３０３は、Ｓ３０６で得た、ラベルなしデータの優先度の情報を、データ選択情報に追加する（Ｓ３０７）。データ選択部３０３は、Ｓ３０５〜Ｓ３０７の各処理を、全てのラベルなしデータについて繰り返し、処理を終了する（Ｓ３０８）。 Subsequently, the data selection unit 303 adds the priority information of the unlabeled data obtained in S306 to the data selection information (S307). The data selection unit 303 repeats each process of S305 to S307 for all unlabeled data, and ends the process (S308).

以上、本発明を実施するための最良の形態などについて具体的に説明したが、本発明はこれに限定されるものではなく、その要旨を逸脱しない範囲で種々変更可能である。 Although the best mode for carrying out the present invention has been specifically described above, the present invention is not limited to this, and various modifications can be made without departing from the gist thereof.

こうした本実施形態によれば、学習効率が高くかつ各オラクルが効率的にラベル付与できるラベルなしデータを特定し、分類器作成のための能動学習を効率的なものとできる。 According to the present embodiment, it is possible to identify unlabeled data having high learning efficiency and to be efficiently labeled by each oracle, and to make active learning for creating a classifier efficient.

本明細書の記載により、少なくとも次のことが明らかにされる。すなわち、本実施形態の学習データ作成支援システムにおいて、前記記憶装置は、前記学習データに関し、前記オラクルがラベル付与時に分類根拠として用いた特徴量の情報を更に保持し、前記演算装置は、前記分類器による前記分類確率および分類の不確定度の算定に際し、前記分類根拠である特徴量について、前記ラベルなしデータの前記分類クラスへの前記分類確率および前記不確定度を算定し、前記オラクル分類器による前記分類確率および分類の確信度の算定に際し、前記分類根拠である特徴量について、前記ラベルなしデータの前記オラクルへの前記分類確率および前記確信度を算定するものである、としてもよい。 The description herein reveals at least the following: That is, in the learning data creation support system of the present embodiment, the storage device further holds information on the feature amount that the Oracle used as the classification basis at the time of labeling the learning data, and the arithmetic device is the classification. When calculating the classification probability and the degree of uncertainty of the classification by the device, the above classification probability and the degree of uncertainty of the unlabeled data to the said classification class are calculated for the feature amount which is the basis of the classification, and the said Oracle classifier. In calculating the classification probability and the certainty of the classification by the above, the classification probability and the certainty of the unlabeled data to the Oracle may be calculated for the feature amount which is the basis of the classification.

これによれば、分類根拠に絞り込んだ処理で分類確率、不確定度、および確信度を算定することが可能となり、ひいては、学習効率が高くかつ各オラクルが効率的にラベル付与できるラベルなしデータを特定し、分類器作成のための能動学習をより効率的なものとできる。 According to this, it becomes possible to calculate the classification probability, uncertainty, and certainty by processing narrowed down to the classification basis, and by extension, unlabeled data with high learning efficiency and efficient labeling by each oracle. It can identify and make active learning for classifier creation more efficient.

また、本実施形態の学習データ作成支援システムにおいて、前記演算装置は、前記不確定度と前記確信度の和に関して、前記ラベル付与の依頼回数が増えるに従い、前記不確定度と前記確信度の和に関して、前記不確定度の値が大きくなるように所定の係数を乗算してから前記和を算定するものである、としてもよい。 Further, in the learning data creation support system of the present embodiment, the arithmetic unit has a sum of the uncertainty and the certainty as the number of requests for labeling increases with respect to the sum of the uncertainty and the certainty. The sum may be calculated after multiplying by a predetermined coefficient so that the value of the uncertainty becomes large.

これによれば、当初は分類器の精度向上に役立つもので、かつ人間のオラクルにとって分類しやすいラベルなしデータを優先するものの、さらなる精度向上を目指す過程では、不確定度の高いものを採用することが重要である点を踏まえた処理が可能となる。ひいては、学習効率が高くかつ各オラクルが効率的にラベル付与できるラベルなしデータを特定し、分類器作成のための能動学習をより効率的なものとできる。 According to this, although unlabeled data that is initially useful for improving the accuracy of the classifier and is easy for human Oracle to classify is prioritized, in the process of further improving the accuracy, the data with high uncertainty is adopted. It is possible to process based on the important point. As a result, unlabeled data that has high learning efficiency and can be labeled efficiently by each oracle can be identified, and active learning for creating a classifier can be made more efficient.

また、本実施形態の学習データ作成支援システムにおいて、前記演算装置は、前記ラベル付与の依頼に際し、前記学習データに含まれる前記分類クラスごとのデータ数が等しくなるようラベル付与を依頼するものである、としてもよい。 Further, in the learning data creation support system of the present embodiment, when the arithmetic unit requests the labeling, the arithmetic unit requests the labeling so that the number of data for each classification class included in the learning data is equal. , May be.

これによれば、クラス間で偏った能動学習を回避し、ひいては、学習効率が高くかつ各オラクルが効率的にラベル付与できるラベルなしデータを特定し、分類器作成のための能動学習をより効率的なものとできる。 According to this, active learning that is biased between classes is avoided, and by extension, unlabeled data that is highly efficient and can be labeled efficiently by each oracle is identified, and active learning for classifier creation is made more efficient. Can be something like that.

また、本実施形態の学習データ作成支援システムにおいて、前記記憶装置は、前記学習データに関し、過去にオラクルがラベル付与を実施した分類時刻の情報を更に含み、前記演算装置は、前記分類時刻に基づき、前記学習データのうち古いものほど、所定の忘却係数に応じて重みを低くして学習したオラクル分類器を用いるものである、としてもよい。 Further, in the learning data creation support system of the present embodiment, the storage device further includes information on the classification time in which Oracle has previously assigned a label to the learning data, and the calculation device is based on the classification time. The older the training data, the lower the weight according to the predetermined forgetting coefficient, and the learning Oracle classifier may be used.

これによれば、オラクルの知見やスキルが時間経過と共に劣化、低下しやすい傾向を踏まえ、オラクル分類器におけるオラクル分類の精度を良好に維持しやすくなる。ひいては、学習効率が高くかつ各オラクルが効率的にラベル付与できるラベルなしデータを特定し、分類器作成のための能動学習をより効率的なものとできる。 According to this, it becomes easy to maintain good accuracy of Oracle classification in the Oracle classifier, based on the tendency that the knowledge and skills of Oracle tend to deteriorate and decrease with the passage of time. As a result, unlabeled data that has high learning efficiency and can be labeled efficiently by each oracle can be identified, and active learning for creating a classifier can be made more efficient.

また、本実施形態の学習データ作成支援システムにおいて、前記演算装置は、一つのラベルなしデータに複数のオラクルによって異なるラベルが付与されていた場合、当該異なるラベルに関する多数決を取ることで正しいラベルを決定し、前記学習データにおいて前記オラクルの情報に加えて前記ラベルの正しさを記録し、前記ラベルが正しい学習データのみを用いてオラクル分類器を学習するものである、としてもよい。 Further, in the learning data creation support system of the present embodiment, when one unlabeled data is given different labels by a plurality of oracles, the arithmetic unit determines the correct label by taking a majority decision regarding the different labels. Then, in the training data, the correctness of the label may be recorded in addition to the information of the Oracle, and the Oracle classifier may be trained using only the training data in which the label is correct.

これによれば、複数のオラクルによる分類結果を正しく集約し、これを踏まえた学習データによるオラクル分類器の学習が可能となる。ひいては、学習効率が高くかつ各オラクルが効率的にラベル付与できるラベルなしデータを特定し、分類器作成のための能動学習をより効率的なものとできる。 According to this, it is possible to correctly aggregate the classification results by a plurality of Oracles and to learn the Oracle classifier by learning data based on the results. As a result, unlabeled data that has high learning efficiency and can be labeled efficiently by each oracle can be identified, and active learning for creating a classifier can be made more efficient.

また、本実施形態の学習データ作成支援方法において、前記情報処理装置が、前記記憶装置において、前記学習データに関し、前記オラクルがラベル付与時に分類根拠として用いた特徴量の情報を更に保持し、前記分類器による前記分類確率および分類の不確定度の算定に際し、前記分類根拠である特徴量について、前記ラベルなしデータの前記分類クラスへの前記分類確率および前記不確定度を算定し、前記オラクル分類器による前記分類確率および分類の確信度の算定に際し、前記分類根拠である特徴量について、前記ラベルなしデータの前記オラクルへの前記分類確率および前記確信度を算定する、としてもよい。 Further, in the learning data creation support method of the present embodiment, the information processing device further holds the information of the feature amount used as the classification basis by the Oracle at the time of labeling the learning data in the storage device, and the above-mentioned When calculating the classification probability and the uncertainty of classification by the classifier, the classification probability and the uncertainty of the unlabeled data to the classification class are calculated for the feature amount which is the basis of the classification, and the Oracle classification is performed. When calculating the classification probability and the certainty of the classification by the device, the classification probability and the certainty of the unlabeled data to the Oracle may be calculated for the feature amount which is the basis of the classification.

また、本実施形態の学習データ作成支援方法において、前記情報処理装置が、前記不確定度と前記確信度の和に関して、前記ラベル付与の依頼回数が増えるに従い、前記不確定
度と前記確信度の和に関して、前記不確定度の値が大きくなるように所定の係数を乗算してから前記和を算定する、としてもよい。 Further, in the learning data creation support method of the present embodiment, as the number of requests for labeling increases with respect to the sum of the uncertainty and the certainty, the information processing apparatus increases the uncertainty and the certainty. With respect to the sum, the sum may be calculated after multiplying by a predetermined coefficient so that the value of the uncertainty becomes large.

また、本実施形態の学習データ作成支援方法において、前記情報処理装置が、前記ラベル付与の依頼に際し、前記学習データに含まれる前記分類クラスごとのデータ数が等しくなるようラベル付与を依頼する、としてもよい。 Further, in the learning data creation support method of the present embodiment, when the information processing device requests the label assignment, the label assignment is requested so that the number of data for each classification class included in the learning data is equal. May be good.

また、本実施形態の学習データ作成支援方法において、前記情報処理装置が、前記記憶装置において、前記学習データに関し、過去にオラクルがラベル付与を実施した分類時刻の情報を更に含み、前記分類時刻に基づき、前記学習データのうち古いものほど、所定の忘却係数に応じて重みを低くして学習したオラクル分類器を用いる、としてもよい。 Further, in the learning data creation support method of the present embodiment, the information processing device further includes information on the classification time when the learning data has been labeled by Oracle in the past in the storage device, and the classification time is set to the same. Based on this, the older the training data, the lower the weight according to a predetermined oblivion coefficient, and the learning Oracle classifier may be used.

また、本実施形態の学習データ作成支援方法において、前記情報処理装置が、一つのラベルなしデータに複数のオラクルによって異なるラベルが付与されていた場合、当該異なるラベルに関する多数決を取ることで正しいラベルを決定し、前記学習データにおいて前記オラクルの情報に加えて前記ラベルの正しさを記録し、前記ラベルが正しい学習データのみを用いてオラクル分類器を学習する、としてもよい。 Further, in the learning data creation support method of the present embodiment, when the information processing apparatus assigns different labels to one unlabeled data by a plurality of oracles, the correct label is obtained by taking a majority decision regarding the different labels. It may be determined, the correctness of the label is recorded in addition to the information of the Oracle in the training data, and the Oracle classifier is trained using only the training data in which the label is correct.

１０１学習データ選択サーバ
１０２学習データ管理サーバ
１０３オラクル端末
２０１学習データ
２０２ラベルなしデータ
３０１学習部
３０１−１学習実行部
３０１−２分類器情報
３０１−３忘却係数情報
３０２分類部
３０２−１標的分類器実行部
３０２−２標的分類結果情報
３０２−３オラクル分類器実行部
３０２−４オラクル分類結果情報
３０３選択部
３０３−１オラクル選択部
３０３−２オラクル選択情報
３０３−３データ選択部
３０３−４データ選択情報
３０３−５係数更新部
３０３−６係数情報
３０４アノテーション依頼部
３０４−１アノテーション依頼情報作成部
３０４−２アノテーション依頼情報
３０５データセット更新部
４０１記憶装置
４０２プログラム
４０３演算装置
４０４メモリ
４０５通信装置
４０６通信ネットワーク 101 Learning data selection server 102 Learning data management server 103 Oracle terminal 201 Learning data 202 Unlabeled data 301 Learning unit 301-1 Learning execution unit 301-2 Classifier information 301-3 Oblivion coefficient information 302 Classification unit 302-1 Target classifier Execution unit 302-2 Target classification result information 302-3 Oracle classifier Execution unit 302-4 Oracle classification result information 303 Selection unit 303-1 Oracle selection unit 303-2 Oracle selection information 303-3 Data selection unit 303-4 Data selection Information 303-5 Coefficient update unit 303-6 Coefficient information 304 Annotation request unit 304-1 Annotation request information creation unit 304-2 Annotation request information 305 Data set update unit 401 Storage device 402 Program 403 Computing device 404 Memory 405 Communication device 406 Communication network

Claims

Unlabeled data that is not labeled to indicate a classification class when classifying data by a predetermined classifier, and training data that is necessary for learning the classifier and that is a label assigned by Oracle with respect to the unlabeled data and that of the Oracle. A storage device that holds learning data including information, and
A process of calculating the classification probability and classification uncertainty of the unlabeled data into the classification class by the classifier learned from the training data and the label, a predetermined process learned from the training data and the information of the Oracle. Processing by the Oracle classifier to calculate the classification probability of the unlabeled data into the Oracle and the certainty of classification, and the predetermined number of unlabeled data are selected in descending order of the sum of the uncertainty and the certainty. An arithmetic unit that executes processing and processing for requesting a predetermined number of the oracles to assign labels in descending order of the classification probability of the selected unlabeled data.
A learning data creation support system characterized by including.

The storage device is
With respect to the training data, the information of the feature amount used by the Oracle as a classification basis at the time of labeling is further retained.
The arithmetic unit
When calculating the classification probability and the degree of uncertainty of the classification by the classifier, the above classification probability and the degree of uncertainty of the unlabeled data to the classification class are calculated for the feature amount which is the basis of the classification, and the oracle. When calculating the classification probability and the certainty of classification by the classifier, the classification probability and the certainty of the unlabeled data to the Oracle are calculated for the feature amount which is the basis of the classification.
The learning data creation support system according to claim 1, wherein the learning data creation support system is characterized in that.

The arithmetic unit
Regarding the sum of the uncertainty and the certainty
As the number of requests for labeling increases, the sum of the uncertainty and the certainty is calculated after multiplying by a predetermined coefficient so that the value of the uncertainty increases.
The learning data creation support system according to claim 1, wherein the learning data creation support system is characterized in that.

The arithmetic unit
At the time of requesting the labeling, the labeling is requested so that the number of data for each classification class included in the learning data is equal.
The learning data creation support system according to claim 1, wherein the learning data creation support system is characterized in that.

The storage device is
The training data further includes information on the classification time that Oracle has labeled in the past.
The arithmetic unit
Based on the classification time, the older the training data, the lower the weight according to the predetermined forgetting coefficient, and the learning Oracle classifier is used.
The learning data creation support system according to claim 1, wherein the learning data creation support system is characterized in that.

The arithmetic unit
When one unlabeled data is given different labels by a plurality of oracles, the correct label is determined by taking a majority decision on the different labels, and the correctness of the label is added to the information of the oracle in the training data. Record and
The Oracle classifier is trained using only the training data whose label is correct.
The learning data creation support system according to claim 1, wherein the learning data creation support system is characterized in that.

Unlabeled data that is not labeled to indicate a classification class when classifying data by a predetermined classifier, and training data that is necessary for learning the classifier and that is a label given by Oracle with respect to the unlabeled data and the data of the Oracle. An information processing device that holds learning data, including information,
A process of calculating the classification probability and classification uncertainty of the unlabeled data into the classification class by the training data and the classifier learned from the label.
A process of calculating the classification probability and classification certainty of the unlabeled data into the oracle by a predetermined oracle classifier learned from the training data and the information of the oracle.
A process of selecting a predetermined number of the unlabeled data in descending order of the sum of the uncertainty and the certainty.
A process of requesting a predetermined number of the oracles to label the selected unlabeled data in descending order of the classification probability.
A learning data creation support method characterized by executing.

The information processing device
In the storage device, with respect to the learning data, information on the feature amount used by the Oracle as a classification basis at the time of labeling is further retained.
When calculating the classification probability and the degree of uncertainty of the classification by the classifier, the above classification probability and the degree of uncertainty of the unlabeled data to the classification class are calculated for the feature amount which is the basis of the classification, and the oracle. When calculating the classification probability and the certainty of classification by the classifier, the classification probability and the certainty of the unlabeled data to the Oracle are calculated for the feature amount which is the basis of the classification.
The learning data creation support method according to claim 7, wherein the learning data is created.

The information processing device
Regarding the sum of the uncertainty and the certainty
As the number of requests for labeling increases, the sum of the uncertainty and the certainty is calculated after multiplying by a predetermined coefficient so that the value of the uncertainty increases.
The learning data creation support method according to claim 7, wherein the learning data is created.

The information processing device
When requesting the labeling, the labeling is requested so that the number of data for each classification class included in the learning data is equal.
The learning data creation support method according to claim 7, wherein the learning data is created.

The information processing device
In the storage device, the training data further includes information on the classification time that Oracle has previously labeled.
Based on the classification time, the older the training data, the lower the weight according to the predetermined forgetting coefficient, and the learning Oracle classifier is used.
The learning data creation support method according to claim 7, wherein the learning data is created.

The information processing device
When one unlabeled data is given different labels by a plurality of oracles, the correct label is determined by taking a majority decision on the different labels, and the correctness of the label is added to the information of the oracle in the training data. Record and
Learn the Oracle classifier using only the training data with the correct label.
The learning data creation support method according to claim 7, wherein the learning data is created.