JP2020038514A

JP2020038514A - Learning data generating device, learning data generating method, and program

Info

Publication number: JP2020038514A
Application number: JP2018165599A
Authority: JP
Inventors: 亮小坂; Ryo Kosaka
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2018-09-05
Filing date: 2018-09-05
Publication date: 2020-03-12

Abstract

To solve the problem in which: to generate a highly accurate sorter, learning data suitable for machine learning algorithm must be collected and to generate good learning data from input data, data cleansing processing must be performed, and in particular, appropriate data cleansing is difficult because each machine learning algorithm has different characteristics in a method in which learning data is sorted on the basis of sorting results of the sorter generated by a plurality of machine learning algorithms.SOLUTION: A learning data generating device performs appropriate cleansing processing according to characteristics of each machine learning algorithm when generating a learning model using a plurality of different machine learning algorithms. This configuration makes it possible to generate learning data necessary for machine learning according to each machine learning algorithm.SELECTED DRAWING: Figure 1

Description

本発明は、入力データから機械学習アルゴリズムに適した学習用データを生成するための学習データ生成装置、学習データ生成方法、及びプログラムに関するものである。 The present invention relates to a learning data generation device, a learning data generation method, and a program for generating learning data suitable for a machine learning algorithm from input data.

機械学習アルゴリズムを用いた学習モデルに基づいて入力データの分類を行う分類器では、種類の異なる機械学習アルゴリズムを用いることで、それぞれ利点が異なる分類器を生成することができる。このため、種類の異なる複数の分類器を用いることで、複数の分類結果を得ることができ、それを手掛かりにして、分類精度の向上を図ることができる。複数の分類器による分類結果を統計的に処理することで分類精度をより向上させる手法としては、アンサンブル学習やスタッキングと呼ばれる手法が開発されている。 In a classifier that classifies input data based on a learning model using a machine learning algorithm, a classifier having different advantages can be generated by using different types of machine learning algorithms. Therefore, by using a plurality of different classifiers, a plurality of classification results can be obtained, and the classification results can be used as a clue to improve the classification accuracy. As a method of further improving the classification accuracy by statistically processing the classification results by a plurality of classifiers, a method called ensemble learning or stacking has been developed.

ところで、精度の高い分類器を作成するためには、分類器を構成する機械学習アルゴリズムに適した学習データが必要である。
そこで、機械学習アルゴリズムによる学習に適するように、入力データの前処理を行う必要がある。前処理とは、例えば、データの正規化／標準化や新たなデータの生成といった加工処理や、不適切なデータを取り除くデータフィルタリング処理（例えば、特許文献１を参照）などの処理（以下、総称して「クレンジング処理」という）である。 By the way, in order to create a highly accurate classifier, learning data suitable for a machine learning algorithm constituting the classifier is required.
Therefore, it is necessary to preprocess input data so as to be suitable for learning by a machine learning algorithm. The pre-processing includes, for example, processing such as data normalization / standardization and generation of new data, and processing such as data filtering (refer to, for example, Patent Document 1) for removing inappropriate data (hereinafter, collectively referred to as “pre-processing”). Called “cleansing process”).

特開２００５−１８１９２８号公報JP 2005-181928 A

このように、精度の高い分類器を生成するためには、機械学習アルゴリズムに適した学習データを集める必要があるが、入力データから良質な学習データを生成するために、入力データのクレンジング処理を行わなければいけない。 As described above, in order to generate a high-precision classifier, it is necessary to collect training data suitable for a machine learning algorithm. Have to do it.

特に、複数の機械学習アルゴリズムにより生成された分類器の分類結果に基づいて分類を行う手法においては、各機械学習アルゴリズムの特徴が異なるため、データの適切なクレンジング処理が困難である。 In particular, in a method of performing classification based on a classification result of a classifier generated by a plurality of machine learning algorithms, it is difficult to perform appropriate cleansing processing of data because the characteristics of each machine learning algorithm are different.

本発明は、入力データを取得する取得手段と、学習データを生成するために入力データに対してクレンジング処理を行う処理手段と、機械学習アルゴリズムを備え、前記機械学習アルゴリズムを用いて、学習データに基づいて学習モデルを生成する生成手段と、を有する学習データ生成装置であって、前記生成手段は、前記機械学習アルゴリズムとして、種類が異なる、複数の機械学習アルゴリズムを備え、前記処理手段は、入力データに対して、複数の前記機械学習アルゴリズムのそれぞれに対応して、第１のクレンジング処理を行うことを特徴とする。 The present invention includes an acquisition unit that acquires input data, a processing unit that performs a cleansing process on the input data to generate learning data, and a machine learning algorithm. Generating means for generating a learning model based on the learning data, wherein the generating means includes, as the machine learning algorithm, a plurality of different types of machine learning algorithms, and the processing means A first cleansing process is performed on data corresponding to each of the plurality of machine learning algorithms.

本発明によれば、複数の異なる機械学習アルゴリズムを用いた学習モデルを生成する際に必要な学習データを、それぞれの機械学習アルゴリズムに応じて生成することができる。これにより、機械学習の効率向上と、分類推定の精度向上を図ることができる。また、機械学習アルゴリズムによらず、共通の入力データを利用することが可能となり、ユーザの負荷が軽減するという効果も得られる。 According to the present invention, learning data necessary for generating a learning model using a plurality of different machine learning algorithms can be generated according to each machine learning algorithm. This can improve the efficiency of machine learning and the accuracy of classification estimation. In addition, common input data can be used regardless of the machine learning algorithm, and the effect of reducing the load on the user can be obtained.

学習データ生成用プログラムの機能構成を示す図である。FIG. 4 is a diagram illustrating a functional configuration of a learning data generation program. クレンジング処理部において行われるクレンジング処理を示すフローチャートと、アルゴリズムデータベースの一例である。It is the flowchart which shows the cleansing process performed in a cleansing processing part, and an example of an algorithm database. パソコン操作ログとタスク一覧の一例である。It is an example of a personal computer operation log and a task list. 学習モデルを生成する処理を示すフローチャートである。It is a flowchart which shows the process which produces | generates a learning model. 入力データに対する分類推定処理を示すフローチャートである。It is a flowchart which shows the classification estimation process with respect to input data. タスク毎の出現語句と、クレンジング後データの例である。It is an example of an appearance phrase for each task and data after cleansing.

以下、本発明を実施するための形態について図面を用いて説明する。ただし、この実施形態に記載されている構成要素はあくまで例示であり、本発明の範囲を各実施形態に限定する趣旨ではない。
＜第１の実施形態＞ Hereinafter, embodiments for carrying out the present invention will be described with reference to the drawings. However, the components described in this embodiment are merely examples, and are not intended to limit the scope of the present invention to each embodiment.
<First embodiment>

＜学習データ生成プログラムの機能構成＞
図１は、分類器１００において学習データを生成するための機能構成を示す図である。
図１（ａ）は、分類器１００の全体の機能を示す。
共通入力データ１１０は、分類処理を行う学習モデルの生成および分類推定処理を行う際に使用されるデータである。共通入力データ１１０は、後述する複数の分類推定部１４０に入力する学習データを生成するための共通のデータとして収集され、データベースとして保存されている。
入力部１２０は、共通入力データ１１０を入力データとして取得する処理部である。 <Functional configuration of learning data generation program>
FIG. 1 is a diagram showing a functional configuration for generating learning data in the classifier 100.
FIG. 1A shows the overall function of the classifier 100.
The common input data 110 is data used when generating a learning model for performing classification processing and performing classification estimation processing. The common input data 110 is collected as common data for generating learning data to be input to a plurality of classification estimating units 140 described later, and stored as a database.
The input unit 120 is a processing unit that acquires the common input data 110 as input data.

図１（ｂ）は、クレンジング処理部１３０の機能を示す。
クレンジング処理部１３０は、後述する分類推定部１４０における学習および分類推定に最適な学習データ（クレンジング後データ１７０）を生成するための処理部である。図１（ｂ）に示されるように、クレンジング処理部１３０のクレンジング処理実行部１３２は、アルゴリズムデータベース１３１（図２（ｃ）で後述）に記録された機械学習アルゴリズムの特徴に対応して、入力データのクレンジング処理を行う。 FIG. 1B shows the function of the cleansing processing unit 130.
The cleansing processing unit 130 is a processing unit for generating learning data (post-cleansing data 170) that is optimal for learning and classification estimation in the classification estimating unit 140 described below. As shown in FIG. 1B, the cleansing processing execution unit 132 of the cleansing processing unit 130 receives an input corresponding to the feature of the machine learning algorithm recorded in the algorithm database 131 (described later in FIG. 2C). Perform data cleansing.

分類推定部１４０は、機械学習アルゴリズムにより、学習データを分類判定する処理部である。分類推定部１４０は、機械学習アルゴリズムの種類の数に対応して、複数備えられる。
これは、前述したとおり、複数種類の機械学習アルゴリズムを用いることで、利点の異なる分類器が生成できるためである。また、複数の分類推定結果を手掛かりとすることで、分類精度の向上につながるためである。 The classification estimating unit 140 is a processing unit that classifies and determines learning data using a machine learning algorithm. A plurality of classification estimating units 140 are provided corresponding to the number of types of machine learning algorithms.
This is because, as described above, a classifier having different advantages can be generated by using a plurality of types of machine learning algorithms. Further, using a plurality of classification estimation results as clues leads to improvement in classification accuracy.

図１（ｃ）は、各分類推定部１４０の機能構成を示す。
図１（ｃ）に示されるように、各分類推定部１４０は、学習モデル生成部１４１と学習モデル実行部１４３とから構成される。
学習モデル生成部１４１は、クレンジング処理部１３０で生成された学習データ（クレンジング後データ１７０）を用いて、機械学習アルゴリズムに基づいた学習を行い、学習モデル１４２を生成する。
学習モデル実行部１４３は、学習データと学習モデル１４２に基づいて分類処理を行い、分類対象の各ラベルに対する学習データの評価値を算出する。そして、最大の評価値を有するラベルを学習データの分類推定結果１４４として出力する。 FIG. 1C shows a functional configuration of each classification estimating unit 140.
As shown in FIG. 1C, each classification estimating unit 140 includes a learning model generating unit 141 and a learning model executing unit 143.
The learning model generation unit 141 performs learning based on a machine learning algorithm using the learning data (the post-cleansing data 170) generated by the cleansing processing unit 130, and generates a learning model 142.
The learning model execution unit 143 performs a classification process based on the learning data and the learning model 142, and calculates an evaluation value of the learning data for each label to be classified. Then, the label having the largest evaluation value is output as the classification estimation result 144 of the learning data.

出力判定部１５０は、複数の分類推定部１４０から出力された分類推定結果１４４を統合的に判断して、最終的な出力値を判定する。
例えば、各分類推定部１４０で推定された各分類推定結果１４４を集計し、最も多く推定された分類推定結果を出力としてもよい。あるいは、各分類推定結果１４４の確率を比較し、最も確率の高い分類推定結果を出力としてもよい。なお、判定方法はこれらに限るものではない。
出力部１６０は、出力判定部１５０において判定された出力値を所定の方法でユーザに提示する。 The output determination unit 150 integrally determines the classification estimation results 144 output from the plurality of classification estimation units 140, and determines a final output value.
For example, the classification estimation results 144 estimated by the classification estimation units 140 may be totaled, and the classification estimation result estimated most frequently may be output. Alternatively, the probabilities of the classification estimation results 144 may be compared, and the classification estimation result having the highest probability may be output. The determination method is not limited to these.
The output unit 160 presents the output value determined by the output determination unit 150 to the user by a predetermined method.

クレンジング後データ１７０は、クレンジング処理部１３０においてそれぞれの分類推定部１４０に対応した入力データのクレンジング処理を実行した後の学習データとなるデータであり、データベースに格納される。クレンジング後データ１７０の詳細については、図６において後述する。 The post-cleansing data 170 is data that becomes learning data after the cleansing processing unit 130 performs cleansing processing of input data corresponding to each of the classification estimating units 140, and is stored in the database. Details of the post-cleansing data 170 will be described later with reference to FIG.

＜クレンジング処理部１３０について＞
図２は、クレンジング処理部１３０において行われる入力データのクレンジング処理の一例を示すフローチャートである。 <About the cleansing processing unit 130>
FIG. 2 is a flowchart illustrating an example of the input data cleansing process performed by the cleansing processing unit 130.

図２（ａ）のフローチャートのＳ２１０において、クレンジング処理部１３０は、入力部１２０を介して学習データを生成するために必要な入力データを共通入力データ１１０から取得する。
Ｓ２２０において、クレンジング処理部１３０は、複数の分類推定部１４０から処理対象とする分類推定部１４０を順番に選択する。 In S210 of the flowchart in FIG. 2A, the cleansing processing unit 130 acquires input data necessary for generating learning data from the common input data 110 via the input unit 120.
In S220, the cleansing processing unit 130 sequentially selects the classification estimation units 140 to be processed from the plurality of classification estimation units 140.

Ｓ２３０において、クレンジング処理部１３０は、機械学習アルゴリズムに応じた入力データのクレンジング処理を行い、クレンジング後データ１７０を学習データとして生成する。クレンジング処理とは、例えば、学習に不適切なデータの削除や、学習効率を上げるようなデータの加工（正規化／標準化など）など処理である。データのクレンジング処理の詳細については、図２（ｂ）のフローチャートにおいて後述する。
Ｓ２４０において、クレンジング処理部１３０は、すべての分類推定部１４０を選択したか否かを確認する。そして、未処理の分類推定部１４０がなくなるまで、Ｓ２２０〜Ｓ２４０の処理を繰り返す。
すべての分類推定部１４０に対する学習データを生成すると、図２（ａ）のクレンジング処理は終了する。 In S230, the cleansing processing unit 130 performs cleansing processing of the input data according to the machine learning algorithm, and generates post-cleansing data 170 as learning data. The cleansing process is, for example, a process of deleting data inappropriate for learning or processing (normalization / standardization, etc.) of data to increase learning efficiency. The details of the data cleansing process will be described later with reference to the flowchart of FIG.
In S240, cleansing processing section 130 checks whether or not all classification estimating sections 140 have been selected. Then, the processing of S220 to S240 is repeated until there is no unprocessed classification estimation unit 140.
When the learning data for all the classifying and estimating units 140 has been generated, the cleansing process in FIG. 2A ends.

＜入力データのクレンジング処理を行うユースケースの一例＞
次に、上述した図２（ａ）のフローチャートのＳ２３０において行われる入力データのクレンジング処理についての詳細を示すために、入力データのクレンジング処理が必要となるユースケースの一例を示す。
以下では、作業者の行うパソコンの操作内容を入力データとして、各入力データを作業者の業務内容（以下、「タスク」という）に分類するユースケースを例にして説明を行う。ここでは、作業者のタスクを、会社内で行われるような、プロジェクトＡ（定例会議）、プロジェクトＢ（資料作成）、プロジェクトＣ（朝会）などの各プロジェクトに分類するものとする。なお、想定するユースケースや使用するデータは、これらに限られるものではないことは言うまでもない。 <Example of use case for performing cleansing processing of input data>
Next, in order to show the details of the input data cleansing process performed in S230 of the flowchart of FIG. 2A described above, an example of a use case in which the input data cleansing process is required will be described.
In the following, a description will be given of an example of a use case in which the content of operation of a personal computer performed by a worker is used as input data, and each input data is classified into the task content (hereinafter, referred to as “task”) of the worker. Here, it is assumed that the tasks of the worker are classified into projects such as a project A (regular meeting), a project B (creation of materials), and a project C (morning meeting), which are performed in a company. Needless to say, the assumed use cases and data to be used are not limited to these.

図３（ａ）は、入力データであるパソコン操作ログの一例である。
図３（ａ）に示されるように、パソコンの操作ログ３１０は、作業者が行ったパソコンの操作内容を時系列で記録したデータである。パソコンの操作ログ３１０には、ＩＤ３１１、時刻３１２、アプリケーション名３１３、操作対象情報３１４、操作内容３１５、入力キー情報３１６、カーソル位置３１７、ファイルプロパティ３１８、などのフィールド情報が含まれる。
操作対象情報３１４には、例えば、ドキュメント文書であれば、ファイルの保存されているパス名とファイル名などの情報が格納される。また、Ｗｅｂ閲覧であれば、ＵＲＬおよびＷｅｂページタイトルが、メールソフトであれば、送信相手および件名などの情報が、それぞれ、格納される。 FIG. 3A shows an example of a personal computer operation log as input data.
As shown in FIG. 3A, the operation log 310 of the personal computer is data in which the details of the operation of the personal computer performed by the worker are recorded in time series. The operation log 310 of the personal computer includes field information such as ID 311, time 312, application name 313, operation target information 314, operation content 315, input key information 316, cursor position 317, file property 318, and the like.
For example, in the case of a document document, information such as a path name where a file is stored and a file name are stored in the operation target information 314. In the case of Web browsing, the URL and Web page title are stored, and in the case of mail software, information such as the transmission destination and subject are stored, respectively.

なお、操作ログとして収集するフィールド情報はこれらに限られるものではない。例えば、ドキュメント文書やＷｅｂページの全文、メールの本文や添付ファイル名、マウス操作がなされたＧＵＩパーツ情報（メニュー項目名やボタン名など）などを収集し、新たなフィールド情報として追加することも可能である。 The field information collected as the operation log is not limited to these. For example, it is possible to collect the entire text of a document document or Web page, the text of an e-mail, the name of an attached file, and GUI part information (menu item names, button names, etc.) with mouse operation, and add it as new field information. It is.

図３（ｂ）は、図３（ａ）に示されているパソコン操作ログ３１０を分類したタスクの一覧の一例である。
タスク一覧３２０は、作業者が行ったタスクの履歴を時系列で記録したデータである。タスク一覧３２０は、学習モデル１４２を生成していく際に使用される。
タスク一覧３２０には、ＩＤ３２１、開始時刻３２２、終了時刻３２３、タスク名３２４などのフィールド情報が含まれる。そして、図３（ａ）のパソコン操作ログ３１０と図３（ｂ）のタスク一覧３２０とを参照することで、各タスクの実行時間内（開始時刻３２２から終了時刻３２３まで）にあるパソコン操作ログ３１０のレコードが、正解ラベルであるタスクと紐づけられる。 FIG. 3B is an example of a list of tasks in which the personal computer operation log 310 shown in FIG. 3A is classified.
The task list 320 is data in which a history of tasks performed by the worker is recorded in chronological order. The task list 320 is used when generating the learning model 142.
The task list 320 includes field information such as an ID 321, a start time 322, an end time 323, and a task name 324. Then, by referring to the personal computer operation log 310 of FIG. 3A and the task list 320 of FIG. 3B, the personal computer operation log within the execution time of each task (from the start time 322 to the end time 323) is obtained. The record 310 is associated with the task that is the correct answer label.

このようにして得られた正解ラベルと対応するパソコン操作ログ３１０に基づいて、分類推定部１４０は入力データのタスク分類を行う。そして、タスク分類の推定を行う際には、推定を行う対象であるパソコン操作ログを入力すると、学習済みの分類器により、最も信頼度の高いタスクを推定することができる。なお、分類器を学習する際の機械学習として用いる分類手法はさまざま考えられるため、各々の分類手法に用いる機械学習アルゴリズムに応じた適切な学習データを生成する必要がある。 Based on the personal computer operation log 310 corresponding to the correct label obtained in this way, the classification estimating unit 140 performs the task classification of the input data. Then, when estimating the task classification, when the personal computer operation log to be estimated is input, the task with the highest reliability can be estimated by the learned classifier. In addition, since various classification methods used as the machine learning when learning the classifier can be considered, it is necessary to generate appropriate learning data according to the machine learning algorithm used for each classification method.

＜タスク分類に用いる機械学習アルゴリズムの一例＞
上述したパソコン操作ログ３１０からなる入力データをタスク分類する際の機械学習アルゴリズムには、入力データに含まれる要素の状態に注目する各種の公知の手法を適用することができる。
ここで、入力データに含まれる要素とは、具体的には、パソコン操作ログ３１０に含まれる単語（例えば、図６で後述する、「表計算ソフト」、「ＹＹＹ」）などである。また、要素の状態とは、例えば、「単語の出現頻度」、「単語の出現パターン」、「単語の出現順序（連続性）」、などである。そして、それぞれの機械学習アルゴリズムには、どのような要素の状態に注目するかという特徴がある。 <Example of machine learning algorithm used for task classification>
Various well-known methods that focus on the state of elements included in the input data can be applied to the machine learning algorithm for classifying the input data composed of the personal computer operation log 310 as described above.
Here, the elements included in the input data are, specifically, words included in the personal computer operation log 310 (for example, “spreadsheet software”, “YYY” described later with reference to FIG. 6). The state of an element is, for example, “word appearance frequency”, “word appearance pattern”, “word appearance order (continuity)”, and the like. Each machine learning algorithm has a feature of what element state is focused on.

単語の出現頻度に注目する機械学習アルゴリズムとしては、ＢｏＷ（Bags of Words）あるいはＳＶＭ（Support Vector Machine）というアルゴリズムを用いることができる。
ＢｏＷは、各正解ラベルに紐づけられたデータ中に出現する語句の出現回数をカウントし、それぞれのラベルに特徴的な語句を学習していく手法である。この場合には、共通して高い頻度で出現する語句を排除して、各正解ラベルでのみ高い頻度で出現するような語句が集まるように、入力データをクレンジング処理することが望ましい。
ＳＶＭは、操作ログに出現する単語の特徴ベクトルを算出し、正解ラベルと特徴ベクトルの関係に基づいて、２つのラベルを分類する境界を見つける分類器を生成する手法である。そして全正解ラベル数分の分類器を生成し、それぞれの分類器で出力される評価値が最も高いものを分類結果とするものである。この場合においても、入力データを各正解ラベルでのみ高い頻度で出現するような特徴ベクトルを持つ単語が集まるような学習データにクレンジング処理することが望ましい。 An algorithm called BoW (Bags of Words) or SVM (Support Vector Machine) can be used as a machine learning algorithm that focuses on the frequency of appearance of words.
BoW is a method of counting the number of appearances of words and phrases appearing in data associated with each correct answer label, and learning a characteristic word for each label. In this case, it is desirable to perform a cleansing process on the input data so that words that appear at a high frequency in common are excluded and words that appear at a high frequency only at each correct label are collected.
The SVM is a technique of calculating a feature vector of a word appearing in an operation log and generating a classifier that finds a boundary for classifying two labels based on a relationship between a correct label and a feature vector. Then, classifiers for all correct answer labels are generated, and those having the highest evaluation values output by the respective classifiers are used as the classification results. Also in this case, it is desirable that the input data be subjected to cleansing processing to learning data in which words having feature vectors that appear at a high frequency only in each correct label are collected.

単語の特徴パターンに注目する機械学習アルゴリズムとしては、画像分類でよく用いられるようなＣＮＮ（Convolution Neural Network）というアルゴリズムを用いることができる。
ＣＮＮは、所定時間あるいは所定数のレコードを１枚の画像データのようにして扱い、正解ラベルに紐づいた特徴を抽出するように学習を行う手法である。この場合には、入力データを一連のレコードデータ中に特徴が集中して現れるような学習データにクレンジング処理することが望ましい。 As a machine learning algorithm that focuses on the feature pattern of a word, an algorithm called CNN (Convolution Neural Network) often used in image classification can be used.
The CNN is a method of treating a predetermined time or a predetermined number of records as one image data and performing learning so as to extract a feature associated with a correct label. In this case, it is desirable to perform a cleansing process on the input data into learning data in which features appear concentrated in a series of record data.

入力データの連続性に注目する機械学習アルゴリズムとしては、自動翻訳や対話文生成などでよく用いられるような、時系列データを扱うアルゴリズムを用いることができる。例えば、ＲＮＮ（Recurrent Neural Network）やＬＳＴＭ（Long Short Term Memory）などの機械学習アルゴリズムがこれに該当する。
ＲＮＮやＬＳＴＭは、共に、作業したレコードの順番を学習し、それぞれのタスクで行われる作業手順の特徴を学習していく手法である。この場合には、入力データを大まかな作業手順が把握しやすいような学習データにクレンジング処理することが望ましい。 As a machine learning algorithm that pays attention to the continuity of input data, an algorithm that handles time-series data, such as is often used in automatic translation and interactive sentence generation, can be used. For example, a machine learning algorithm such as RNN (Recurrent Neural Network) or LSTM (Long Short Term Memory) corresponds to this.
Both RNN and LSTM are methods for learning the order of the records that have been worked on and learning the characteristics of the work procedure performed for each task. In this case, it is desirable to perform a cleansing process on the input data into learning data that makes it easy to grasp a rough work procedure.

＜入力データのクレンジング処理の詳細＞
上述のような機械学習アルゴリズムを用いたタスク分類を行う際の、クレンジング処理部１３０が行う入力データのクレンジング処理の詳細について、図２（ｂ）を用いて説明する。 <Details of input data cleansing process>
The details of the input data cleansing process performed by the cleansing processing unit 130 when performing the task classification using the above-described machine learning algorithm will be described with reference to FIG.

図２（ｂ）のフローチャートのＳ２３１において、クレンジング処理部１３０は、図２（ａ）のＳ２２０で選択された分類推定部１４０において使用される機械学習アルゴリズムの特徴を、図２（ｃ）に示されるアルゴリズムデータベース１３１から取得する。 In step S231 of the flowchart of FIG. 2B, the cleansing processing unit 130 shows the characteristics of the machine learning algorithm used in the classification estimating unit 140 selected in step S220 of FIG. From the algorithm database 131 to be executed.

Ｓ２３２において、クレンジング処理部１３０は、機械学習アルゴリズムがパソコン操作ログ３１０中に出現する単語の出現頻度に基づいて分類を行う手法であるか否かを判定する。
機械学習アルゴリズムがこれに該当する場合（例えば、ＢｏｗやＳＶＭである場合）、Ｓ２３３へ進む。該当しない場合、Ｓ２３４へ進む。 In S232, the cleansing processing unit 130 determines whether or not the machine learning algorithm is a method of performing classification based on the appearance frequency of words appearing in the personal computer operation log 310.
If the machine learning algorithm corresponds to this (for example, if it is Bow or SVM), the process proceeds to S233. If not, the process proceeds to S234.

Ｓ２３３において、クレンジング処理部１３０は、単語の出現頻度に特徴を出すクレンジング処理を実行する。
上述したＢｏｗやＳＶＭの場合では、正解ラベルへの出現頻度に特徴を持たせるために、クレンジング処理として、複数の正解ラベルに共通したデータを削除するようなデータ加工処理をする。 In S233, the cleansing processing unit 130 executes a cleansing process for characterizing the frequency of occurrence of the word.
In the case of Bow or SVM described above, data processing such as deleting data common to a plurality of correct labels is performed as cleansing processing in order to give characteristics to the frequency of appearance in correct labels.

例えば、図３（ａ）および（ｂ）に示した入力データを用いて、タスク毎に出現した語句の出現回数をカウントし、タスク毎の出現語句リストを生成する。
図６の（ａ）は、タスク毎に生成された出現語句リストの一例である。（ａ１）はプロジェクトＡ（定例会議）についての出現語句リストであり、（ａ２）はプロジェクトＢ（資料作成）についての出現語句リストである。出現語句リストは、出現語句３３２と出現回数３３３とが紐づけられて記録されている。
この情報から、プロジェクトＡとプロジェクトＢというタスクに対しては、「表計算ソフト」と「ｄａｔａ」という語句が共通して出現していることが分かる。そこで、プロジェクトＡのタスクから、ＩＤ３３１がｔａ０１（表計算ソフト）とｔａ０４（ｄａｔａ）の出現語句を削除する。また、プロジェクトＢのタスクからｔｂ０３（表計算ソフト）とｔｂ０４（ｄａｔａ）の出現語句を削除する。 For example, using the input data shown in FIGS. 3A and 3B, the number of appearances of a phrase that appears for each task is counted, and a list of phrases that appear for each task is generated.
FIG. 6A is an example of an appearance phrase list generated for each task. (A1) is a list of appearing phrases for project A (regular meeting), and (a2) is a list of appearing phrases for project B (material creation). In the appearing phrase list, the appearing phrase 332 and the number of appearances 333 are linked and recorded.
From this information, it can be seen that the terms “spreadsheet software” and “data” appear in common for the tasks of project A and project B. Therefore, from the task of the project A, the appearance words whose IDs 331 are ta01 (spreadsheet software) and ta04 (data) are deleted. Also, the terms appearing in tb03 (spreadsheet software) and tb04 (data) are deleted from the task of project B.

これにより、図６の（ａ）に示されるような出現語句リストから、図６の（ｂ）に示されるように、各タスクに関連性の高い語句のみを抽出することができる。ここで、（ｂ１）はプロジェクトＡ（定例会議）について、クレンジング処理した後の出現語句リストであり、（ｂａ２）はプロジェクトＢ（資料作成）について、クレンジング処理した後の出現語句リストである。 As a result, as shown in FIG. 6B, only words highly relevant to each task can be extracted from the appearing phrase list as shown in FIG. Here, (b1) is a list of appearing phrases after the cleansing process is performed on the project A (regular meeting), and (ba2) is a list of appearing phrases after the cleansing process is performed on the project B (material creation).

Ｓ２３４において、クレンジング処理部１３０は、機械学習アルゴリズムが一定時間内での操作ログ内容の特徴パターンに基づいて分類を行う手法であるか否かを判定する。
機械学習アルゴリズムがこれに該当する場合（例えば、ＣＮＮである場合）、Ｓ２３５へ進む。該当しない場合は、Ｓ２３６へ進む。 In S234, the cleansing processing unit 130 determines whether or not the machine learning algorithm is a method of performing classification based on the characteristic pattern of the operation log content within a certain period of time.
If the machine learning algorithm corresponds to this (for example, if it is CNN), the process proceeds to S235. If not, the process proceeds to S236.

Ｓ２３５において、クレンジング処理部１３０は、特徴パターンを検出しやすくするようなクレンジング処理を実行する。
上述したＣＣＮの場合では、特徴パターンを検出しやすくするために、クレンジング処理として、フィールド情報の並べ方や適切なレコードの選別などのデータ加工処理をする。 In S235, the cleansing processing unit 130 executes a cleansing process that makes it easier to detect the characteristic pattern.
In the case of the CCN described above, in order to make it easier to detect a characteristic pattern, data processing such as how to arrange field information and selection of an appropriate record is performed as cleansing processing.

例えば、図３（ａ）のパソコン操作ログ３１０において、ＩＤ３１１がａ００７とａ００９〜ａ０１１のレコードでは、アプリケーションとしてデータ解析ソフトが使用されている。しかし、その途中で、ファイルビューアに対する操作（ＩＤがａ００８のレコード）が単発で短時間行われている。このような単発な短時間のレコードを削除することにより、特徴を集中させることができる。
あるいは、短時間で処理が終わるレコードと、長時間処理が継続しているレコードの重みが考慮されるように、作業時間に応じたデータ数の増減などのデータの正規化処理をすることもできる。例えば、図３（ａ）のパソコン操作ログ３１０において、５秒の作業時間であるａ００６のレコードと比較して、５分の作業時間であるａ００７のレコードに、作業時間に応じた重みづけをすることもできる。これにより、作業時間の異なるレコードについて、１レコードとして同等に扱うのではなく、作業時間の重みという特徴を付与することが可能となる。
これにより、クレンジング処理されたデータとして、図６の（ｃ）に示されるようなデータが生成される。 For example, in the personal computer operation log 310 shown in FIG. 3A, data analysis software is used as an application for records with IDs 311 of a007 and a009 to a011. However, in the middle of the operation, an operation (record with ID a008) for the file viewer is performed in a single short time. By deleting such a single short record, features can be concentrated.
Alternatively, data normalization processing such as increase / decrease in the number of data according to work time can be performed so that the weight of a record that is processed in a short time and the weight of a record that is processed for a long time are considered. . For example, in the personal computer operation log 310 of FIG. 3A, the record of a007, which is a work time of 5 minutes, is weighted according to the work time as compared with the record of a006, which is work time of 5 seconds. You can also. As a result, it is possible to assign the feature of the weight of the work time, instead of treating records having different work times equally as one record.
As a result, data as shown in FIG. 6C is generated as the data subjected to the cleansing process.

Ｓ２３６において、クレンジング処理部１３０は、機械学習アルゴリズムが操作ログの連続性（レコードの順番）に基づいて分類を行う手法であるか否かを判定する。
機械学習アルゴリズムがこれに該当する場合（例えば、ＲＮＮやＬＳＴＭである場合）、Ｓ２３７へ進む。該当しない場合、Ｓ２３８へ進む。 In S236, the cleansing processing unit 130 determines whether or not the machine learning algorithm is a method of performing classification based on the continuity of operation logs (record order).
If the machine learning algorithm corresponds to this (for example, if it is RNN or LSTM), the process proceeds to S237. If not, the process proceeds to S238.

Ｓ２３７において、クレンジング処理部１３０は、連続性の傾向が見えやすくなるクレンジング処理を実行する。
上述したＲＮＮやＬＳＴＭの場合では、作業手順の特徴を大まかに抽出できるようにするために、クレンジング処理として、同一操作対象へのレコードの連続出現回数を制限するような間引き処理を行うなどのデータ加工処理をする。 In S237, the cleansing processing unit 130 executes a cleansing process that makes it easier to see the tendency of continuity.
In the case of RNN or LSTM described above, in order to be able to roughly extract the characteristics of the work procedure, data such as performing thinning-out processing to limit the number of consecutive appearances of records to the same operation target as cleansing processing Perform processing.

例えば、図３（ａ）のパソコン操作ログ３１０において、ＩＤ３１１がａ００９〜ａ０１１のレコードでは、約９０秒の間に連続してデータ解析ソフトへの操作が行われている。これに対して、所定時間（例えば、１分）での操作が１回となるように間引きを行うと、ａ０１０のレコードが削除されるため、連続出現回数の調整を行うことができる。また、ａ００６のような処理時間が所定時間より短いレコードを削除した上で、間引き処理を行ってもよい。
これにより、クレンジング処理されたデータとして、図６の（ｄ）に示されるようなデータが生成される。 For example, in the personal computer operation log 310 shown in FIG. 3A, in the records having IDs a009 to a011 of ID311, the operation to the data analysis software is continuously performed for about 90 seconds. On the other hand, if the thinning is performed so that the operation in a predetermined time (for example, one minute) is performed once, the record of a010 is deleted, so that the number of continuous appearances can be adjusted. Further, a record such as a006 in which the processing time is shorter than the predetermined time may be deleted, and then the thinning processing may be performed.
As a result, data as shown in FIG. 6D is generated as the data subjected to the cleansing process.

Ｓ２３８において、クレンジング処理部１３０は、Ｓ２３７までの処理によってクレンジング処理されたデータを、分類推定部１４０の機械学習アルゴリズムと紐づけて、クレンジング後データ１７０として保存する。
以上のようなクレンジング処理が終了すると、図２（ａ）のＳ２４０へ戻って、処理が再開される。 In S <b> 238, the cleansing processing unit 130 associates the data cleansed by the processing up to S <b> 237 with the machine learning algorithm of the classification estimating unit 140 and stores the data as the post-cleansing data 170.
When the above cleansing process is completed, the process returns to S240 of FIG. 2A and the process is restarted.

以上、入力データとしてパソコン操作ログ３１０を用いたタスク分類を例として説明した。但し、入力データとしては、これ以外にも、作業者の会話、通話履歴、行動履歴や行動スケジュールなどの作業者の行動に関する情報を、パソコン操作ログ、録画や録音などを含む各種形式で、記録したデータであってもよい。
また、入力データを分類するアルゴリズム、各々のアルゴリズムで学習させる観点、データのクレンジング方法などに関しても、上記に限られるものではなく、さまざまな状況に対応させることが可能である。 The task classification using the personal computer operation log 310 as input data has been described above as an example. However, as input data, in addition to this, information on worker's actions such as worker's conversation, call history, action history and action schedule is recorded in various formats including PC operation log, recording and recording. Data may be used.
Also, algorithms for classifying input data, viewpoints for learning with each algorithm, data cleansing methods, and the like are not limited to those described above, and can correspond to various situations.

＜学習モデル生成部１４１について＞
図４は、分類推定部１４０の学習モデル生成部１４１において、学習モデル１４２を生成する処理を示すフローチャートである。 <About the learning model generation unit 141>
FIG. 4 is a flowchart illustrating a process of generating the learning model 142 in the learning model generation unit 141 of the classification estimation unit 140.

Ｓ４１０において、分類推定部１４０の学習モデル生成部１４１は、クレンジング処理部１３０で生成されたクレンジング後データ１７０を学習データとして取得する。 In S410, the learning model generation unit 141 of the classification estimation unit 140 acquires the post-cleansing data 170 generated by the cleansing processing unit 130 as learning data.

Ｓ４２０において、学習モデル生成部１４１は、学習データと各学習データに付与されたタスクに基づいて、機械学習アルゴリズムを用いて学習モデルを生成する。
学習モデルは、１つの学習データ、あるいは、所定時間内や所定量の複数の学習データに対して、分類対象である全タスクとの評価値を推定するためのモデルである。入力データとタスクとを関連付けて学習させることで、未知の入力データに対しても、各タスクとの評価値を推定できるようになる。 In S420, the learning model generation unit 141 generates a learning model using a machine learning algorithm based on the learning data and a task assigned to each learning data.
The learning model is a model for estimating the evaluation values of all the tasks to be classified with respect to one learning data or a plurality of learning data within a predetermined time or a predetermined amount. By associating the input data with the task and learning, it is possible to estimate the evaluation value of each task even for unknown input data.

Ｓ４３０において、学習モデル生成部１４１は、Ｓ４２０で生成された学習モデルを用いた際の分類推定の精度が十分であるか否かを判定する。
分類推定精度が十分であると判定された場合には、Ｓ４４０へ進む。推定精度が不十分であると判定された場合には、処理を終了する。
なお、分類推定精度を求めるには、例えば、交差検証手法を用いることができる。交差検証手法では、学習データを学習用データと検証用のデータに分け、学習用データとして割り当てた学習データに基づいて学習モデルを生成する。そして、検証用データとして割り当てた学習データが正解タスクとして出力される割合を判定することにより、学習モデルの精度を算出する。 In S430, the learning model generation unit 141 determines whether or not the accuracy of the classification estimation using the learning model generated in S420 is sufficient.
If it is determined that the classification estimation accuracy is sufficient, the process proceeds to S440. If it is determined that the estimation accuracy is insufficient, the process ends.
In order to determine the classification estimation accuracy, for example, a cross-validation technique can be used. In the cross-validation method, the learning data is divided into learning data and verification data, and a learning model is generated based on the learning data assigned as the learning data. Then, the accuracy of the learning model is calculated by determining the rate at which the learning data assigned as the verification data is output as the correct task.

Ｓ４４０において、学習モデル生成部１４１は、Ｓ４３０で生成された学習モデルを、タスク分類用の学習モデル１４２として更新する。
これにより、図４のフローチャートは終了する。 In S440, the learning model generation unit 141 updates the learning model generated in S430 as the learning model 142 for task classification.
Thereby, the flowchart of FIG. 4 ends.

＜学習モデル実行部１４３について＞
図５は、分類推定部１４０の学習モデル実行部１４３において、入力部１２０より取得した入力データに対する分類推定処理を示すフローチャートである。 <About the learning model execution unit 143>
FIG. 5 is a flowchart showing a classification estimation process for input data acquired from the input unit 120 in the learning model execution unit 143 of the classification estimation unit 140.

Ｓ５１０において、学習モデル実行部１４３は、クレンジング処理部１３０で生成された解析対象となる学習データをクレンジング後データ１７０から取得する。 In S510, the learning model execution unit 143 acquires the learning data to be analyzed generated by the cleansing processing unit 130 from the post-cleansing data 170.

Ｓ５２０において、学習モデル実行部１４３は、取得した解析対象となる学習データについて、事前に生成済みの学習モデル１４２を用いて分類されるタスクを推定する。
分類されるタスクを推定する際には、解析対象となる学習データと、分類対象の各タスクとの関連度を表す評価値が合わせて出力される。なお、１つの入力データに対して１つのタスクを推定してもよく、また、所定時間内や所定量の複数の入力データに対して１つのタスクを推定してもよい。 In S520, the learning model execution unit 143 estimates a task to be classified using the learning model 142 generated in advance for the acquired learning data to be analyzed.
When estimating the tasks to be classified, the learning data to be analyzed and the evaluation value indicating the degree of association with each task to be classified are output together. Note that one task may be estimated for one input data, or one task may be estimated for a plurality of pieces of input data within a predetermined time or a predetermined amount.

Ｓ５３０において、学習モデル実行部１４３は、Ｓ５２０で推定された評価値に基づいて、最大の評価値であるタスクを、解析対象となる学習データに対するタスクの推定結果として決定し、分類推定結果１４４として出力する。 In S530, based on the evaluation value estimated in S520, the learning model execution unit 143 determines the task having the largest evaluation value as the task estimation result for the learning data to be analyzed, and as the classification estimation result 144 Output.

以上のように、本実施形態においては、複数の機械学習アルゴリズムを用いて分類判定を行う手段において、それぞれの機械学習アルゴリズムに応じた入力データのクレンジング処理を実行する。
これにより、機械学習アルゴリズムに適した学習データを生成することができる。また、複数の機械学習アルゴリズムに応じて入力データをそれぞれ集めたりする手間を削減することができるため、学習モデルの生成にかかる時間の削減や、精度の向上につなげることができる。 As described above, in the present embodiment, the means for performing classification determination using a plurality of machine learning algorithms executes cleansing processing of input data according to each machine learning algorithm.
Thereby, learning data suitable for the machine learning algorithm can be generated. In addition, since it is possible to reduce the trouble of collecting input data in accordance with a plurality of machine learning algorithms, it is possible to reduce the time required for generating a learning model and improve the accuracy.

＜第２の実施形態＞
第１の実施形態では、各機械学習アルゴリズムに対して入力データを適切にクレンジング処理して学習データを生成するデータクレンジング処理を提供した。
これに対して、第２の実施形態では、さらに、すべての機械学習アルゴリズムにとって共通する不適切な入力データをクレンジング処理する、共通データクレンジング処理を提供する。 <Second embodiment>
In the first embodiment, a data cleansing process is provided in which input data is appropriately cleaned for each machine learning algorithm to generate learning data.
On the other hand, the second embodiment further provides a common data cleansing process for cleansing inappropriate input data common to all machine learning algorithms.

第２の実施形態においても、第１の実施形態において用いた図３（ａ）に示されるパソコン操作ログ３１０を用いたタスク分類を例に説明する。
図３（ａ）に示されるパソコン操作ログ３１０の中には、有効なフィールド情報が含まれないレコード（欠損レコード）や、すべてのタスクに共通して現れるようなレコード（重複レコード）が存在する。そこで、このような欠損レコードや重複レコードを削除する共通データクレンジング処理を行う。 Also in the second embodiment, the task classification using the personal computer operation log 310 shown in FIG. 3A and used in the first embodiment will be described as an example.
In the personal computer operation log 310 shown in FIG. 3A, there are records that do not include valid field information (missing records) and records that appear in common to all tasks (duplicate records). . Therefore, common data cleansing processing for deleting such a missing record or a duplicate record is performed.

例えば、図３（ａ）のＩＤ３１１がａ００５のレコードでは、メールソフトを使って新規作成を行っていることは分かるが、送信相手や件名などは不明である。そのため、このレコードは有効ではないとして、削除することとする。
また、事前に指定された所定の単語（キーワード）を含むレコード（指定レコード）を削除したり、指定レコードを特定のタスクに割り当てたりするなどのクレンジング処理を行ってもよい。 For example, in the record in which the ID 311 is a005 in FIG. 3A, it is known that the mail software is newly created, but the transmission destination and the subject are unknown. Therefore, this record is determined to be invalid and is deleted.
Also, a cleansing process such as deleting a record (designated record) containing a predetermined word (keyword) designated in advance, or assigning the designated record to a specific task may be performed.

また、パソコン操作ログ３１０に対し正解ラベルであるタスクを付与する際は、例えば「９：００〜１０：００：プロジェクトＡ業務」というように一定時間内のレコードに対してタスクがまとめて付与されることがある。しかし、実際には、この間にも割り込みによって作業者が別の操作を行っているケースも多々ある。このような各タスクに対して明らかに関係ない業務に関するレコードについては、あらかじめ検出して削除するような、共通クレンジング処理をしてもよい。 When a task which is a correct answer label is assigned to the personal computer operation log 310, the task is assigned to records within a certain time, for example, "9:00 to 10:00: Project A work". Sometimes. However, actually, there are many cases in which the operator performs another operation due to interruption during this time. Common cleansing processing such as detecting and deleting in advance records relating to tasks that are not clearly related to each task may be performed.

なお、共通データクレンジング処理は、これらに限られず、例えば、所定時間より短い実行時間のレコードを削除するような処理など、さまざまな方法をとることができる。また、実施形態１と同様に、想定するユースケースや使用するデータも、これらに限られるものではない。
このように、実施形態２においては、入力データに対する共通クレンジング処理を行うことにより、すべての機械学習アルゴリズムに対して学習モデルを生成する精度を向上させることができる。 Note that the common data cleansing process is not limited to these, and various methods such as a process of deleting a record having an execution time shorter than a predetermined time can be used. Further, as in the first embodiment, the assumed use cases and data to be used are not limited to these.
As described above, in the second embodiment, by performing common cleansing processing on input data, it is possible to improve the accuracy of generating a learning model for all machine learning algorithms.

（その他の実施形態）
本発明は、上述の実施形態の１以上の機能を実現するプログラムを、ネットワーク又は記憶媒体を介してシステム又は装置に供給し、そのシステム又は装置のコンピュータにおける１つ以上のプロセッサがプログラムを読出し実行する処理でも実現可能である。また、１以上の機能を実現する回路（例えば、ＡＳＩＣ）によっても実現可能である。
また、本発明は、複数の機器から構成されるシステムに適用しても、１つの機器からなる装置に適用してもよい。
本発明は上述の実施形態に限定されるものではなく、本発明の趣旨に基づき種々の変形（各実施形態の有機的な組合せを含む）が可能であり、それらを本発明の範囲から除外するものではない。すなわち、上述の各実施形態及びその変形例を組み合わせた構成もすべて本発明に含まれるものである。 (Other embodiments)
The present invention supplies a program for realizing one or more functions of the above-described embodiments to a system or an apparatus via a network or a storage medium, and one or more processors in a computer of the system or the apparatus read and execute the program. This processing can be realized. Further, it can also be realized by a circuit (for example, an ASIC) that realizes one or more functions.
Further, the present invention may be applied to a system including a plurality of devices or to an apparatus including a single device.
The present invention is not limited to the above embodiments, and various modifications (including organic combinations of the embodiments) are possible based on the spirit of the present invention, and they are excluded from the scope of the present invention. Not something. That is, the present invention also includes all configurations obtained by combining the above-described embodiments and their modifications.

１１０共通入力データ
１２０入力部
１３０クレンジング処理部
１３１アルゴリズムデータベース
１３２クレンジング処理実行部
１４０分類推定部
１４１学習モデル生成部
１４２学習モデル
１４３学習モデル実行部
１４４分類推定結果
１５０出力判定部
１６０出力部
１７０クレンジング後データ 110 common input data 120 input unit 130 cleansing processing unit 131 algorithm database 132 cleansing processing execution unit 140 classification estimation unit 141 learning model generation unit 142 learning model 143 learning model execution unit 144 classification estimation result 150 output determination unit 160 output unit 170 after cleansing data

Claims

Acquisition means for acquiring input data;
Processing means for performing cleansing processing on the input data to generate learning data;
Generating means for generating a learning model based on learning data, comprising: a machine learning algorithm;
A learning data generation device having
The generating means includes a plurality of different types of machine learning algorithms as the machine learning algorithm,
The learning data generation device, wherein the processing unit performs a first cleansing process on the input data in correspondence with each of the plurality of machine learning algorithms.

The learning data generation device according to claim 1, wherein the processing unit performs a different first cleansing process in accordance with each feature of the plurality of machine learning algorithms.

At least one of the plurality of machine learning algorithms includes a feature in which the feature focuses on one of an appearance frequency, an appearance pattern, and an appearance order of elements included in input data. Item 3. The learning data generation device according to item 2.

The learning data generation device according to claim 3, wherein at least one of the plurality of machine learning algorithms includes one in which the feature focuses on an appearance frequency of an element included in input data.

The learning data generation device according to claim 3, wherein at least one of the plurality of machine learning algorithms includes one in which the feature focuses on an appearance pattern of an element included in input data. .

6. The method according to claim 3, wherein at least one of the plurality of machine learning algorithms includes one in which the feature focuses on an appearance order of elements included in input data. 7. Learning data generation device.

The learning data generation device according to claim 3, wherein the element is a word.

The said processing means performs the 2nd cleansing process common to all of the said some machine learning algorithm in addition to the said 1st cleansing process. The Claims 3 thru | or 7 characterized by the above-mentioned. Learning data generation device.

The learning data generation device according to claim 8, wherein the second cleansing process deletes input data that does not include valid field information.

The learning data generation device according to claim 8, wherein the second cleansing process deletes input data including an element specified in advance.

The input data is classified into one of a plurality of labels based on the learning model, and input data that appears in common with all of the plurality of labels is deleted. The learning data generation device according to any one of claims 8 to 10.

The learning data generation device according to any one of claims 1 to 11, wherein the input data is information on a behavior of a worker.

The learning data generation device according to claim 12, wherein the action is a personal computer operation.

The learning data generation device according to any one of claims 1 to 13, wherein each input data is classified into one of a plurality of labels based on the learning model.

Based on an evaluation value indicating the degree of association between each input data and label,
The learning data generation device according to claim 14, wherein the input data is classified into one of a plurality of the labels.

An acquisition step for acquiring input data;
A processing step of performing a cleansing process on the input data to generate learning data;
Comprising a machine learning algorithm, using the machine learning algorithm, generating a learning model based on learning data,
A learning data generation method having
In the generating step, as the machine learning algorithm, different types, a plurality of machine learning algorithms are provided,
In the processing step, a first cleansing process is performed on input data corresponding to each of the plurality of machine learning algorithms.

A program for causing a computer to execute the learning data generation method according to claim 16.