JP2019101902A

JP2019101902A - Data processing apparatus, data processing method, and data processing program

Info

Publication number: JP2019101902A
Application number: JP2017234195A
Authority: JP
Inventors: 英裕最首; Eihiro Saishu
Original assignee: Groovenauts Inc
Current assignee: Groovenauts Inc
Priority date: 2017-12-06
Filing date: 2017-12-06
Publication date: 2019-06-24

Abstract

To provide a data processing apparatus, a data processing method, and a data processing program which can divide data in a short period of time such that unintentional deviation of data contents is prevented from occurring between learning data and verification data.SOLUTION: The present invention is directed to a data processing apparatus 10 having an acquiring uni 12 for acquiring data having a plurality of records each including a plurality of fields corresponding to a plurality of items, an input unit 13 for receiving an input representing selection of an item among the plurality of items and division ratio of data, a dividing unit 15 for extracting a field corresponding to the selected item from each of the plurality of records and for dividing data into a plurality of data including learning data for machine learning and verification data in accordance with a division ratio on the basis of the plurality of extracted fields, and a learning processing unit for carrying out machine learning processing based on the learning data and for verifying a leaning model generated by the machine learning processing, on the basis of the verification data.SELECTED DRAWING: Figure 1

Description

本発明は、データ処理装置、データ処理方法及びデータ処理プログラムに関する。 The present invention relates to a data processing device, a data processing method, and a data processing program.

近年、ニューラルネットワーク等の学習モデルにデータを分析させ、所定のタスクを実行させる研究が進んでいる（例えば、特許文献１等）。学習モデルは、学習用データによって入力と出力の関係を学習させて、タスクの実行精度を向上させるように学習処理される。学習処理を行うことで生成される学習済みモデルは、新たなデータを分析し、様々なタスクを実行することができる。例えば、学習済みモデルは、コンテンツを属性毎に分類したり、商品の売り上げ動向を予想したりするために用いられている。 In recent years, research has been advanced in which a learning model such as a neural network analyzes data and executes a predetermined task (for example, Patent Document 1 etc.). The learning model is subjected to learning processing so as to improve the task execution accuracy by causing the learning data to learn the relationship between the input and the output. The learned model generated by performing the learning process can analyze new data and perform various tasks. For example, a learned model is used to classify content by attribute and predict sales trends of products.

特開２０１７−１６８０５７号公報JP, 2017-168057, A

学習済みモデルの生成は、学習用データを用いて学習モデルに入力と出力の関係を学習させる学習処理と、学習済みモデルが適切にタスクを実行できるか否かを、検証用データを用いて検証する検証処理とにより行われることがある。そのため、学習済みモデルを生成するために用意したデータを、学習処理を行う前に、学習用データと検証用データに分割することがある。 The generation of a learned model uses learning data to make the learning model learn the relationship between inputs and outputs using learning data, and verifies whether the learned model can properly execute a task using verification data And the verification process. Therefore, data prepared for generating a learned model may be divided into learning data and verification data before performing learning processing.

学習用データと検証用データのそれぞれに意図しないデータ内容の偏りが生じないように、従前は、手作業によりデータ内容を確認しつつデータの分割を行うことがあった。しかしながら、学習済みモデルを生成するためのデータは膨大な量に及ぶことが多く、分割処理に長い作業時間を要することがある。一方、データ内容を確認せずに分割処理を行えば短い作業時間でデータの分割ができるが、その場合、学習用データと検証用データとの間で意図しないデータ内容の偏りが生じてしまい、学習モデルの学習処理が効率良く行えなかったり、検証処理において不適切な検証結果が得られたりする場合があった。 In the past, data division was sometimes performed while manually checking the data content so that unintended deviation of the data content did not occur in each of the learning data and the verification data. However, the data for generating a learned model is often enormous, and the division process may require a long working time. On the other hand, if division processing is performed without checking the data content, data can be divided in a short working time, but in that case, unintended deviation of data content occurs between learning data and verification data, In some cases, the learning process of the learning model can not be performed efficiently, or improper verification results can be obtained in the verification process.

そこで、本発明は、学習用データと検証用データとの間で意図しないデータ内容の偏りが生じないように、データの分割を短時間に行うことができるデータ処理装置、データ処理方法及びデータ処理プログラムを提供することを目的とする。 Therefore, according to the present invention, there is provided a data processing apparatus, data processing method, and data processing capable of performing data division in a short time so as to prevent unintended deviation of data contents between learning data and verification data. The purpose is to provide a program.

本発明の一態様に係るデータ処理装置は、複数の項目に対応する複数のフィールドをそれぞれ含む、複数のレコードを有するデータを取得する取得部と、複数の項目のうち一つの項目の選択及びデータの分割割合の入力を受け付ける入力部と、選択された項目に対応するフィールドを複数のレコードからそれぞれ抽出し、抽出した複数のフィールドに基づいて、データを、分割割合で機械学習の学習用データ及び検証用データを含む複数のデータに分割する分割部と、学習用データに基づいて機械学習処理を行い、検証用データに基づいて、機械学習処理によって生成された学習モデルの検証を行う、学習処理部を備える。 A data processing apparatus according to an aspect of the present invention includes an acquisition unit for acquiring data having a plurality of records, each including a plurality of fields corresponding to a plurality of items, and selection and data of one of the plurality of items. And an input unit for receiving an input of the division ratio, and a field corresponding to the selected item are respectively extracted from the plurality of records, and based on the extracted plurality of fields, the data is A learning process that performs machine learning processing on the basis of learning data and a dividing unit that divides the data into a plurality of data including verification data, and verifies a learning model generated by the machine learning processing on the basis of the verification data Equipped with

この態様によれば、特定のフィールドの値に基づいて、機械学習に用いるデータを、学習用データ及び検証用データを含む複数のデータに分割することで、学習用データと検証用データとの間で意図しないデータ内容の偏りが生じないように、データの分割を短時間に行うことができる。そのため、学習用に調製したデータを用意することなく、いわゆるビッグデータをそのまま用いて学習モデルの学習処理を効率良く行うことができ、また、検証処理において学習済みモデルの性能を適切に検証することができる。 According to this aspect, between the learning data and the verification data, the data used for machine learning is divided into a plurality of data including the learning data and the verification data based on the value of the specific field. The data can be divided in a short time so that unintended deviation of the data content does not occur. Therefore, without preparing data prepared for learning, learning processing of a learning model can be efficiently performed using so-called big data as it is, and properly verifying the performance of the learned model in verification processing. Can.

上記態様において、入力部は、フィールドの追加規則の入力を受け付け、データ処理装置は、複数のレコードそれぞれに、一つの空でないフィールドを追加規則で追加する追加部をさらに備え、追加されたフィールドに基づいて、データを、分割割合で機械学習の学習用データ及び検証用データを含む複数のデータに分割してもよい。 In the above aspect, the input unit further receives an input of an addition rule of the field, and the data processing device further comprises an addition unit that adds one non-empty field to each of the plurality of records by the addition rule, Based on the data, the division ratio may divide the data into a plurality of data including machine learning data and verification data.

この態様によれば、データ分割の基準とするのに適したフィールドがデータに予め含まれていない場合であっても、データ分割の基準とするのに適したフィールドを新たに追加することで、学習用データと検証用データとの間で意図しないデータ内容の偏りが生じないように、データの分割を行うことができる。 According to this aspect, even if the data is not included in the data suitable as a reference for data division, a field suitable for the reference for data division is newly added. Data division can be performed so as to prevent unintended deviation of data contents between learning data and verification data.

上記態様において、取得部は、複数のデータを取得し、入力部は、複数のデータのうち、結合条件を与える項目の選択及び結合対象となる項目の選択を受け付け、データ処理装置は、結合条件を与える項目に対応するフィールドに基づいて、複数のデータを結合対象となる項目に対応するフィールドを含む一つのデータとなるように結合する結合部をさらに備えてもよい。 In the above aspect, the acquisition unit acquires a plurality of data, and the input unit receives the selection of the item to which the combining condition is given among the plurality of data and the selection of the item to be combined, and the data processing device And a combining unit configured to combine the plurality of data into one data including fields corresponding to the items to be combined, based on the fields corresponding to the items giving.

この態様によれば、別々に用意された複数のデータを、結合する必要の無い項目を結合後のデータに含めることなく一つのデータに結合することができ、学習モデルの学習処理及び検証処理に必要とされるひとまとまりのデータを容易に生成することができる。 According to this aspect, it is possible to combine a plurality of separately prepared data into one data without including items that do not need to be combined in the combined data, and it is possible to perform learning processing and verification processing of a learning model. It is possible to easily generate the required set of data.

上記態様において、入力部は、データが空のフィールドを含む場合に、空のフィールドの補完方式の選択を受け付け、空のフィールドと項目が共通するフィールドに関する選択された補完方式に従った統計に基づいて、空のフィールドを補完する編集部をさらに備えてもよい。 In the above aspect, when the data includes an empty field, the input unit receives a selection of an empty field completion method, and based on the statistics according to the selected completion method for a field in which the empty field and the item are common. It may further include an editing unit that complements the empty field.

この態様によれば、レコードに空のフィールドが含まれていることにより、データを学習用データや検証用データとして用いた場合に適切な学習処理や検証処理が行えない場合であっても、欠損値を補完することによって、欠損値を含むレコードを削除することなく学習処理や検証処理を適切に行うことができる。 According to this aspect, when the data is used as learning data or verification data because the record includes an empty field, the loss may occur even if appropriate learning processing or verification processing can not be performed. By complementing the values, learning processing and verification processing can be appropriately performed without deleting records including missing values.

上記態様において、統計は、空のフィールドのデータ型が数値型である場合には、共通するフィールドの最頻値、中央値、平均値、最大値又は最小値を求めることであり、空のフィールドのデータ型が文字型である場合には、共通するフィールドの最頻文字を求めることであってもよい。 In the above aspect, the statistic is to find the mode, median, average, maximum value or minimum value of common fields when the data type of the empty fields is numeric, and the empty fields If the data type of is a character type, it may be to find the most frequent character of the common field.

この態様によれば、データ型に応じた適切な統計値により空のフィールドを補完することができ、学習用データと検証用データとの間で意図しないデータ内容の偏りが生じないように、データの分割を行うことができる。 According to this aspect, an empty field can be complemented by an appropriate statistical value according to the data type, and data is not deviated from unintended data content between the learning data and the verification data. Can be divided.

上記態様において、入力部は、項目の選択及びデータ型の選択を受け付け、編集部は、選択された項目に対応するフィールドのデータ型を、選択されたデータ型へと変更してもよい。 In the above aspect, the input unit may receive item selection and data type selection, and the editing unit may change the data type of the field corresponding to the selected item to the selected data type.

この態様によれば、データ型が適切でないことによるエラーが生じている場合に、適切なデータ型へと変更することにより、エラーを回避することができる。また、変更可能なデータ型のうちからデータ型を選択することで、誤ったデータ型への変更を防ぐことができる。 According to this aspect, in the case where an error occurs due to an inappropriate data type, the error can be avoided by changing to an appropriate data type. In addition, by selecting a data type from among changeable data types, it is possible to prevent an erroneous change to the data type.

上記態様において、入力部は、項目の選択及び置換規則の入力を受け付け、編集部は、選択された項目に対応するフィールドを、置換規則に従って置換してもよい。 In the above aspect, the input unit may receive an item selection and an input of a replacement rule, and the editing unit may replace the field corresponding to the selected item according to the replacement rule.

この態様によれば、フィールドの数値や文字を所定の置換規則に従って置換することで、単位の変換や表記揺れの低減を図ることができ、学習モデルの学習処理や検証処理をより効率的に行うことができる。 According to this aspect, unit conversion and writing fluctuation can be reduced by replacing numerical values and characters of the field according to a predetermined substitution rule, and learning processing and verification processing of a learning model are performed more efficiently. be able to.

上記態様において、入力部は、項目の選択及び項目名称の入力を受け付け、編集部は、選択された項目の名称を、入力された項目名称へと変更してもよい。 In the above aspect, the input unit may receive selection of an item and input of an item name, and the editing unit may change the name of the selected item to the input item name.

この態様によれば、項目名称を任意の項目名称に変更することができ、また、項目名称に使用されている文字の文字コードを原因としてエラーが生じている場合に、適切な文字コードの文字へと変更することでエラーを回避することができる。そのため、用意したデータを用いて学習モデルの学習処理を効率良く行うことができ、検証処理において学習済みモデルの性能を適切に検証することができる。 According to this aspect, the item name can be changed to any item name, and when an error occurs due to the character code of the character used for the item name, the character of the appropriate character code You can avoid the error by changing to. Therefore, the learning process of the learning model can be efficiently performed using the prepared data, and the performance of the learned model can be properly verified in the verification process.

上記態様において、入力部は、対象項目及び横軸項目の選択を受け付け、編集部は、対象項目を縦軸に設定し、横軸項目を横軸に設定したグラフを作成してもよい。 In the above aspect, the input unit may receive the selection of the target item and the horizontal axis item, and the editing unit may set the target item on the vertical axis and create a graph in which the horizontal axis item is set on the horizontal axis.

この態様によれば、選択した対象項目と横軸項目との関係を、グラフを用いて視覚的に表示することができる。そのため、対象項目に影響を及ぼす項目を容易に特定することができ、学習用データに含めるべき項目の選定を適切に行うことができる。 According to this aspect, the relationship between the selected target item and the horizontal axis item can be visually displayed using a graph. Therefore, it is possible to easily identify the item that affects the target item, and it is possible to appropriately select the item to be included in the learning data.

本発明の他の態様に係るデータ処理方法は、複数の項目に対応する複数のフィールドをそれぞれ含む、複数のレコードを有するデータを取得することと、複数の項目のうち一つの項目の選択及びデータの分割割合の入力を受け付けることと、選択された項目に対応するフィールドを複数のレコードからそれぞれ抽出し、抽出した複数のフィールドに基づいて、データを、分割割合で機械学習の学習用データ及び検証用データを含む複数のデータに分割することと、学習用データに基づいて機械学習処理を行い、検証用データに基づいて、機械学習処理によって生成された学習モデルの検証を行うことと、を含む。 A data processing method according to another aspect of the present invention includes acquiring data having a plurality of records, each including a plurality of fields corresponding to a plurality of items, selecting one of a plurality of items, and selecting data Receiving the input of the division ratio of the data, and extracting fields corresponding to the selected items from the plurality of records, and based on the extracted plurality of fields, data for learning of machine learning by the division ratio and verification And dividing the data into a plurality of data including the data for learning, and performing machine learning processing based on the data for learning, and verifying the learning model generated by the machine learning processing based on the data for verification. .

本発明の他の態様に係るデータ処理プログラムは、コンピュータを、複数の項目に対応する複数のフィールドをそれぞれ含む、複数のレコードを有するデータを取得する取得部、複数の項目のうち一つの項目の選択及びデータの分割割合の入力を受け付ける入力部、選択された項目に対応するフィールドを複数のレコードからそれぞれ抽出し、抽出した複数のフィールドに基づいて、データを、分割割合で機械学習の学習用データ及び検証用データを含む複数のデータに分割する分割部、学習用データに基づいて機械学習処理を行い、検証用データに基づいて、機械学習処理によって生成された学習モデルの検証を行う、学習処理部、として機能させる。 According to another aspect of the present invention, there is provided a data processing program including: an acquisition unit for acquiring data having a plurality of records including a computer and a plurality of fields respectively corresponding to a plurality of items; An input unit for receiving selection and input of a data division ratio, extracting fields corresponding to selected items from a plurality of records, and based on the extracted plurality of fields, data for learning of machine learning by division ratio A division unit that divides data and data for verification into multiple data, machine learning processing based on learning data, and verification of a learning model generated by machine learning processing based on verification data, learning Function as a processing unit.

本発明によれば、学習用データと検証用データとの間で意図しないデータ内容の偏りが生じないように、データの分割を短時間に行うことができる。 According to the present invention, it is possible to divide data in a short time so as not to cause unintended deviation of data contents between learning data and verification data.

本発明の実施形態に係るデータ処理装置の機能ブロック図である。It is a functional block diagram of a data processor concerning an embodiment of the present invention. 本発明の実施形態に係るデータ処理装置によるデータ表示の例を示す図である。It is a figure which shows the example of the data display by the data processor which concerns on embodiment of this invention. 本発明の実施形態に係るデータ処理装置による列の追加受付の例を示す図である。It is a figure which shows the example of the addition reception of the row | line by the data processor which concerns on embodiment of this invention. 本発明の実施形態に係るデータ処理装置によるデータの分割受付の例を示す図である。It is a figure which shows the example of the division reception of data by the data processor which concerns on embodiment of this invention. 本発明の実施形態に係るデータ処理装置によって実行される機械学習処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the machine learning process performed by the data processor which concerns on embodiment of this invention. 本発明の実施形態に係るデータ処理装置によるデータの結合受付の例を示す図である。It is a figure which shows the example of the coupling reception of the data by the data processor which concerns on embodiment of this invention. 本発明の実施形態に係るデータ処理装置によって実行される機械学習処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the machine learning process performed by the data processor which concerns on embodiment of this invention. 本発明の実施形態に係るデータ処理装置による欠損値補完の受付の例を示す図である。It is a figure which shows the example of reception of the defect value complementation by the data processor which concerns on embodiment of this invention. 本発明の実施形態に係るデータ処理装置によるデータ型変更の受付の例を示す図である。It is a figure which shows the example of the reception of the data type change by the data processing apparatus which concerns on embodiment of this invention. 本発明の実施形態に係るデータ処理装置によって実行される機械学習処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the machine learning process performed by the data processor which concerns on embodiment of this invention. 本発明の実施形態に係るデータ処理装置によって作成される統計情報欄及びデータグラフの例を示す図である。It is a figure which shows the example of the statistical information column produced by the data processor which concerns on embodiment of this invention, and a data graph.

添付図面を参照して、本発明の実施形態について説明する。なお、各図において、同一の符号を付したものは、同一又は同様の構成を有する。 Embodiments of the present invention will be described with reference to the accompanying drawings. In addition, what attached the same code | symbol in each figure has the same or same structure.

図１は、本発明の実施形態に係るデータ処理装置１０の機能ブロック図である。データ処理装置１０は、通信ネットワークＮを介してユーザ端末２０及びデータベース３０と通信可能であり、データベース３０から取得したデータに基づいて機械学習処理を行う。本実施形態において、データ処理装置１０は、ユーザ端末２０から取得したに基づいて、データベース３０から取得したデータを処理する。例えば、データ処理装置１０は、データベース３０から取得したデータを、機械学習の学習用データと検証用データに分割処理し、処理したデータを再びデータベース３０へ送信することができる。分割されたデータのうち学習用データは、データ処理装置１０が行う機械学習処理に使用され、検証用データは、データ処理装置１０が生成した学習済みモデルの性能を検証する検証処理に使用される。データ処理装置１０が含む通信部１１は、通信ネットワークＮに接続され、ユーザ端末２０又はデータベース３０と、通信ネットワークＮを介して接続される。 FIG. 1 is a functional block diagram of a data processing apparatus 10 according to an embodiment of the present invention. The data processing apparatus 10 can communicate with the user terminal 20 and the database 30 via the communication network N, and performs machine learning processing based on the data acquired from the database 30. In the present embodiment, the data processing apparatus 10 processes data acquired from the database 30 based on the information acquired from the user terminal 20. For example, the data processing apparatus 10 can divide the data acquired from the database 30 into machine learning data and verification data, and transmit the processed data to the database 30 again. Among the divided data, learning data is used for machine learning processing performed by the data processing device 10, and verification data is used for verification processing for verifying the performance of a learned model generated by the data processing device 10. . The communication unit 11 included in the data processing apparatus 10 is connected to the communication network N, and connected to the user terminal 20 or the database 30 via the communication network N.

学習用データとは、学習モデルに学習処理を行う際に、学習モデルに与えるデータである。また、検証用データとは、学習済みモデルにタスクを実行させたときの出力が所望の精度を達成するかの検証に使用されるデータである。例えば、商品の売上動向データの一部を学習用データとして学習済みモデルの作成に使用し、残りのデータを検証用データとして学習済みモデルが出力した売上動向の検証に使用することで、作成した学習済みモデルが適切な出力を行うか確認することができる。 The learning data is data to be given to the learning model when performing the learning process on the learning model. Further, the verification data is data used to verify whether the output when the learned model executes the task achieves the desired accuracy. For example, a part of sales trend data of a product is used as learning data to create a learned model, and the remaining data is used as verification data to verify sales trends output by the learned model. It is possible to confirm whether the learned model gives appropriate output.

通信ネットワークＮは、有線又は無線の電気通信回線であり、例えばインターネットである。通信ネットワークＮは、ＬＡＮ（Local Area Network）等であってもよい。 The communication network N is a wired or wireless telecommunication line, such as the Internet. The communication network N may be a LAN (Local Area Network) or the like.

ユーザ端末２０は、少なくとも演算回路、通信インターフェイス、入力部、表示部及び記憶部を備え、ユーザにより操作される端末である。ユーザ端末２０は、例えばパーソナルコンピュータやタブレット、スマートフォンにより構成されてよい。ユーザは、ユーザ端末２０の表示部に表示されるデータの内容を確認しつつ、データ編集に関する指示を入力部に入力する。入力されたデータ編集に関する指示は、通信ネットワークＮを介して、データ処理装置１０の通信部１１へ送られる。 The user terminal 20 includes at least an arithmetic circuit, a communication interface, an input unit, a display unit, and a storage unit, and is a terminal operated by a user. The user terminal 20 may be configured by, for example, a personal computer, a tablet, or a smartphone. The user inputs an instruction regarding data editing to the input unit while confirming the contents of the data displayed on the display unit of the user terminal 20. The input instruction on data editing is sent to the communication unit 11 of the data processing apparatus 10 via the communication network N.

データベース３０は、機械学習に使用するデータの保管、管理等を行う。データベース３０は、例えばクラウドサービスで提供されるものであってもよい。データベース３０は、ユーザ端末２０からの指示に基づいて、データ処理装置１０へデータを送信する。 The database 30 stores, manages, etc. data used for machine learning. The database 30 may be provided by, for example, a cloud service. The database 30 transmits data to the data processing apparatus 10 based on an instruction from the user terminal 20.

データ処理装置１０は、機械学習処理を実行する装置であり、通信部１１、取得部１２、入力部１３、追加部１４、分割部１５、結合部１６及び編集部１７を備える。データ処理装置１０は、データベース３０に格納されたデータを取得し、入力部１３が受け付けた指示に基づいてデータ処理を行う。データ処理装置１０は、データを処理した後、当該データを使用して機械学習処理を実行する。 The data processing apparatus 10 is an apparatus that executes machine learning processing, and includes a communication unit 11, an acquisition unit 12, an input unit 13, an addition unit 14, a division unit 15, a combination unit 16, and an editing unit 17. The data processing apparatus 10 acquires data stored in the database 30, and performs data processing based on the instruction received by the input unit 13. After processing the data, the data processing device 10 performs machine learning processing using the data.

ここで、データ処理装置１０が処理するデータは、複数のフィールドを含む複数のレコードで構成されたＣＳＶデータや、行と列にセルが並んでいる表形式データであってよい。本実施形態では、ＣＳＶデータを処理する場合を例に説明する。ＣＳＶデータは、第１数個のフィールドをそれぞれ含む、第２数個のレコードを有する。また、各レコードに含まれる第１数個のフィールドは、第１数個の項目とそれぞれ対応している。ここで、第１数及び第２数は、任意の自然数であり、データは少なくとも一つ以上のフィールドとレコードを有している。各レコードにおいて、共通する位置に格納されているフィールドの集合を列とよぶ。例えば、レコードに三つのフィールドが含まれているとき、各レコードの先頭のフィールドが一つの列を形成する。各レコードの真ん中及び末尾のフィールドも同様に列を形成する。データは、第１数の列と、第２数の行とを含むともいえ、各行がレコードに相当し、各列の要素がフィールドに相当する。また、列を構成するフィールドは、共通する項目を有している。例えば、一つのフィールドが「Ｙｅａｒ（年）」の項目を有していた場合、当該フィールドと同じ列のフィールドはそれぞれ「Ｙｅａｒ（年）」の項目を有している。本実施形態では、特に記載がない限り、三つのフィールドからなるレコードを複数含むデータを例に、データの編集処理について説明する。すなわち、本実施形態では、特に記載がない限り、フィールドを追加する前の第１数が３であり、第２数が任意の自然数である例について説明する。なお、レコードの先頭に位置するフィールドは項目「Ｙｅａｒ（年）」を有し、真ん中に位置するフィールドは項目「Ｓａｌｅｓ（売上）」を有し、末尾に位置するフィールドは項目「Ｇｅｎｒｅ（ジャンル）」を有する。なお、レコードにおけるフィールドの順番は任意であり、例えば、真ん中に位置するフィールドは項目「Ｇｅｎｒｅ（ジャンル）」を有し、末尾に位置するフィールドが項目「Ｓａｌｅｓ（売上）」を有していてもよい。また、末尾に位置するフィールドは、予測対象となる項目を有するものであってよい。例えば、項目「Ｙｅａｒ（年）」及び「Ｇｅｎｒｅ（ジャンル）」のフィールドを入力とし、項目「Ｓａｌｅｓ（売上）」のフィールドを出力する学習モデルを学習する場合に、末尾に位置するフィールドが項目「Ｓａｌｅｓ（売上）」となるようにフィールドの順番を入れ替えられるようにしてもよい。 Here, the data to be processed by the data processing apparatus 10 may be CSV data composed of a plurality of records including a plurality of fields, or tabular data in which cells are arranged in rows and columns. In the present embodiment, the case of processing CSV data will be described as an example. The CSV data has a second number of records, each including a first number of fields. Also, the first several fields included in each record correspond to the first several items. Here, the first number and the second number are arbitrary natural numbers, and the data has at least one or more fields and a record. In each record, a set of fields stored at common positions is called a column. For example, when a record contains three fields, the first field of each record forms one row. The middle and end fields of each record form a column as well. Although the data may include a first number of columns and a second number of rows, each row corresponds to a record, and an element of each column corresponds to a field. Also, the fields making up the column have common items. For example, when one field has an item "Year", fields in the same column as the field each have an item "Year". In the present embodiment, data editing processing will be described by using data including a plurality of records including three fields as an example unless otherwise noted. That is, in the present embodiment, unless otherwise stated, an example in which the first number before adding a field is 3 and the second number is an arbitrary natural number will be described. The field located at the beginning of the record has the item "Year," the field located at the middle has the item "Sales," and the field located at the end is the item "Genre." ". Note that the order of the fields in the record is arbitrary. For example, even if the field located in the middle has the item "Genre (genre)" and the field located at the end has the item "Sales (sales)" Good. Further, the field located at the end may have an item to be predicted. For example, when learning a learning model in which the fields of items "Year" and "Genre" are input and the fields of item "Sales" are output, the field located at the end is the item " The order of the fields may be changed so as to be “sales”.

データには、項目名称が含まれることがある。項目名称とは、項目に付される名称であり項目名称の選択は項目の選択と同義である。項目名称は、項目の内容を直接的に表したものであってもよく、項目の内容と異なっていてもよい。従って、任意の名称を項目に付することができる。本実施形態においては、項目「Ｙｅａｒ」に対応する項目名称として「Ｙｅａr」、項目「Ｓａｌｅｓ」に対応する項目名称として「Ｓａｌｅｓ」及び項目「Ｇｅｎｒｅ」に対応する項目名称として「ジャンル」が付されている。なお、データに項目名称が含まれていない場合には、項目名称を追加することができる。また、図２の項目名称欄４０に示すように、項目名称は、ユーザ端末２０の表示部に表示され、ユーザはデータ処理を行う際に項目名称を確認しつつ列の選択を行うことができる。 The data may include item names. The item name is the name given to the item, and the selection of the item name is the same as the selection of the item. The item name may directly represent the content of the item or may be different from the content of the item. Therefore, any name can be added to the item. In the present embodiment, “Year” is added as the item name corresponding to the item “Year”, “Genre” is added as the item name corresponding to the “Sales” item and the item “Genre” as the item name corresponding to the item “Sales”. ing. In addition, when the item name is not included in the data, the item name can be added. Further, as shown in the item name column 40 of FIG. 2, the item names are displayed on the display unit of the user terminal 20, and the user can select a column while checking the item names when performing data processing. .

通信部１１は、通信ネットワークＮを介して、ユーザ端末２０及びデータベース３０と通信を行う。通信部１１は、ユーザ端末２０からデータ処理の指示を受け取り、入力部１３へ伝達する。また、通信部１１は、データベース３０から機械学習に使用するデータを受け取り、取得部１２へ伝達する。さらに、データ処理装置１０の分割部１５、結合部１６又は編集部１７において処理されたデータを取得し、データベース３０へ送信する。 The communication unit 11 communicates with the user terminal 20 and the database 30 via the communication network N. The communication unit 11 receives an instruction of data processing from the user terminal 20 and transmits the instruction to the input unit 13. The communication unit 11 also receives data to be used for machine learning from the database 30 and transmits the data to the acquisition unit 12. Furthermore, the data processed by the dividing unit 15, the combining unit 16 or the editing unit 17 of the data processing apparatus 10 is acquired and transmitted to the database 30.

取得部１２は、通信部１１からデータを取得し、追加部１４、結合部１６又は編集部１７へ伝達する。通信部１１は、入力部１３が受け取ったデータ処理の指示内容に基づいてデータの伝達先を決定する。例えば、入力部１３がデータを結合する指示を受け付けた場合には、結合部１６へデータを伝達する。 The acquisition unit 12 acquires data from the communication unit 11 and transmits the data to the adding unit 14, the combining unit 16, or the editing unit 17. The communication unit 11 determines the data transmission destination based on the content of the data processing instruction received by the input unit 13. For example, when the input unit 13 receives an instruction to combine data, the data is transmitted to the combining unit 16.

入力部１３は、通信部１１からデータ処理の指示を受け取る。例えば、データ処理の指示には、項目の選択やデータの分割割合、フィールドの追加規則等が含まれる。入力部１３が受け取ったデータ処理の指示は、指示内容に応じて、追加部１４、分割部１５、結合部１６又は編集部１７へ伝達される。例えば、フィールドの追加指示を受け取った場合には、追加部１４に指示を伝達する。 The input unit 13 receives an instruction of data processing from the communication unit 11. For example, the data processing instruction includes item selection, data division ratio, field addition rule, and the like. The data processing instruction received by the input unit 13 is transmitted to the adding unit 14, the dividing unit 15, the combining unit 16, or the editing unit 17 according to the content of the instruction. For example, when an instruction to add a field is received, the instruction is transmitted to the adding unit 14.

追加部１４は、入力部１３から受け取ったフィールドの追加規則に基づいて、第２数個のレコードのそれぞれに一つの空でないフィールドを追加する。追加部１４によって追加されたフィールドは、列を形成する。ここで、フィールドの追加規則とは、追加されるフィールドの性質をいい、例えば、シーケンシャル又はランダムといった追加規則があるが、これらに限られるものではない。追加規則がシーケンシャルである場合には、連続する数値のフィールドが各レコードに追加され、追加規則がランダムである場合には、乱数又は疑似乱数のフィールドが各レコードに追加される。より具体的には、追加規則がシーケンシャルである場合には、１ずつインクリメント又はデクリメントされる整数のフィールドが各レコードに追加され、追加規則がランダムである場合には、０から１の範囲の乱数又は疑似乱数のフィールドが各レコードに追加されてよい。つまり、追加規則がシーケンシャルである場合には、隣り合うレコード同士で連続する数値のフィールドを有することになるが、追加規則がランダムである場合には、必ずしも連続するとは限らない。 The adding unit 14 adds one non-empty field to each of the second few records based on the adding rule of the field received from the input unit 13. The fields added by the adding unit 14 form a row. Here, the field addition rule refers to the nature of the field to be added, and there are, for example, additional rules such as sequential or random, but it is not limited thereto. If the add rule is sequential, successive numeric fields are added to each record, and if the add rule is random, a random or pseudorandom field is added to each record. More specifically, if the addition rule is sequential, an integer field incremented or decremented by 1 is added to each record, and if the addition rule is random, a random number in the range of 0 to 1 Alternatively, a pseudo random number field may be added to each record. That is, when the addition rule is sequential, adjacent records have fields of continuous numerical values, but when the addition rule is random, they are not necessarily continuous.

分割部１５は、入力部１３から受け取った指示に基づいて、データを分割する。分割部１５は、データ処理の指示として、分割の基準となる項目及び分割割合を入力部１３から取得し、分割の基準となる項目に対応するフィールドに基づいてデータを分割割合で分割する。データの分割については、図４を用いて後述する。 The dividing unit 15 divides the data based on the instruction received from the input unit 13. The division unit 15 acquires an item serving as a division reference and a division ratio from the input unit 13 as a data processing instruction, and divides data at a division ratio based on fields corresponding to the items serving as a division reference. Data division will be described later with reference to FIG.

結合部１６は、入力部１３から受け取った指示に基づいて、複数のデータを結合する。結合部１６は、入力部１３から結合条件を与える項目及び結合対象となる項目を取得し、取得部１２から複数のデータを取得した後、結合条件を与える項目に対応するフィールドに基づいて、結合対象となる項目に対応するフィールドを含む一つのデータとなるように複数のデータを結合する。データの結合については、図６を用いて後述する。 The combining unit 16 combines a plurality of data based on the instruction received from the input unit 13. The combining unit 16 obtains an item giving the combining condition and an item to be combined from the input unit 13, acquires a plurality of data from the obtaining unit 12, and then combines the fields based on the fields corresponding to the items giving the combining condition. A plurality of data are combined so as to be one data including a field corresponding to the target item. The combination of data will be described later with reference to FIG.

編集部１７は、入力部１３から、編集の対象となる選択された項目及び編集内容を取得し、選択された項目に対応するフィールドを編集内容に基づいて編集する。例えば、編集内容には、フィールドの補完、データ型の変更、フィールドの置換又は項目名称の変更等がある。データの編集については、図８、９を用いて後述する。 The editing unit 17 acquires the selected item to be edited and the editing content from the input unit 13, and edits the field corresponding to the selected item based on the editing content. For example, the contents of editing include field completion, data type change, field replacement, or item name change. Data editing will be described later with reference to FIGS.

図２は、本発明の実施形態に係るデータ処理装置１０によるデータ表示の例を示す図である。データ処理装置１０は、ユーザ端末２０の表示部にデータの内容を列ごとに表示する。同図では、ユーザ端末２０の表示部に表示されるデータ表１００を示している。データ表１００には、列ごとにデータの内容が示されており、データ表１００には、項目名称欄４０、データ型欄４１、ユニークバリュー欄４２、欠損値欄４３、列追加ボタン４４、列編集ボタン４５が含まれる。なお、データ表１００に表示されるデータ内容は、図２に示す内容に限られない。 FIG. 2 is a view showing an example of data display by the data processing apparatus 10 according to the embodiment of the present invention. The data processing apparatus 10 displays the content of the data on the display unit of the user terminal 20 for each column. The figure shows a data table 100 displayed on the display unit of the user terminal 20. The data table 100 shows the contents of data for each column, and the data table 100 includes an item name column 40, a data type column 41, a unique value column 42, a missing value column 43, a column addition button 44, and a column An edit button 45 is included. The data contents displayed in the data table 100 are not limited to the contents shown in FIG.

項目名称欄４０は、項目名称が表示される欄である。本実施形態では、項目名称欄４０には、４つの欄が含まれている。項目名称欄４０には、上の欄から順に「項目名称」、「Ｙｅａｒ」、「Ｓａｌｅｓ」、「ジャンル」の文字が表示されており、ユーザは、項目名称を目印として列の区別をすることができる。なお、データが項目名称を含まない場合には、項目名称欄４０に「ｃｏｌｕｍｎＡ」、「ｃｏｌｕｍｎＢ」等の仮の項目名称を表示させてもよい。なお、項目名称はユーザが任意に変更することができる。 The item name column 40 is a column in which an item name is displayed. In the present embodiment, the item name column 40 includes four columns. In the item name column 40, the characters "Item Name", "Year", "Sales", and "Genre" are displayed in order from the upper column, and the user can distinguish the columns by using the item name as a mark. Can. If the data does not include an item name, a temporary item name such as "columnA" or "columnB" may be displayed in the item name column 40. The item name can be arbitrarily changed by the user.

データ型欄４１は、列を構成するフィールドのデータ型が表示される欄であり、本実施形態のデータ型欄４１には、上の欄から順に「データ型」、「ＩＮＴＥＧＥＲ」、「ＦＬＯＡＴ」、「ＳＴＲＩＮＧ」の文字が表示されている。ここで、「ＩＮＴＥＧＥＲ」は整数型、「ＦＬＯＡＴ」は浮動小数点型、「ＳＴＲＩＮＧ」は文字型のデータ型であることをそれぞれ意味する。すなわち、項目名称「Ｙｅａｒ」の列に含まれるフィールドのデータ型は整数型であり、項目名称「Ｓａｌｅｓ」の列に含まれるフィールドのデータ型は浮動小数点型であり、項目名称「ジャンル」の列に含まれるフィールドのデータ型は文字型であることがわかる。フィールドのデータ型は、整数型、浮動小数点型、文字型に限られず、真偽値型（ブーリアン型）、タイムスタンプ型等の任意のデータ型であってよい。 The data type column 41 is a column in which the data types of the fields constituting the column are displayed. In the data type column 41 of the present embodiment, “data type”, “INTEGER”, and “FLOAT” are sequentially arranged from the above column. , The letters "STRING" are displayed. Here, "INTEGER" means integer type, "FLOAT" means floating point type, and "STRING" means character data type. That is, the data type of the field included in the column of the item name "Year" is an integer type, the data type of the field included in the column of the item name "Sales" is a floating point type, and the column of the item name "genre" It can be seen that the data types of the fields included in are character types. The data type of the field is not limited to the integer type, the floating point type, and the character type, and may be any data type such as a boolean type, a time stamp type, and the like.

ここで、整数型とは数値型の一種であり、整数を格納することができるデータ型である。浮動小数点型とは数値型の一種であり、実数を格納することができるデータ型である。文字型とは、文字列を格納することができるデータ型である。真偽値型とは、真偽値（「ｔｒｕｅ」又は「ｆａｌｓｅ」等）を格納することができるデータ型であり、条件判定等で使用される。タイムスタンプ型とは、年、月、日、時、分、秒で構成される時刻値を格納するデータ型である。 Here, the integer type is a kind of numeric type, and is a data type that can store integers. The floating point type is a type of numeric type and is a data type that can store real numbers. The character type is a data type capable of storing a character string. The true / false type is a data type that can store a true / false value (such as “true” or “false”), and is used in condition determination or the like. The timestamp type is a data type that stores a time value composed of year, month, day, hour, minute, and second.

ユニークバリュー欄４２は、ユニークバリューの値が表示される欄である。ユニークバリューとは、列を構成する各フィールドにおいて、重複するフィールド（同じ数値や同じ文字列からなるフィールド）を１つとして数えた場合のフィールドの個数である。例えば、五つのフィールドからなる列において、二つのフィールドが同じ数値である場合には、ユニークバリューは４となる。図２に示すユニークバリュー欄４２には、上の欄から順に「ユニークバリュー」、「５０」、「４６」、「２」の文字が表示されており、項目名称「Ｙｅａｒ」の列のユニークバリューは５０であり、項目名称「Ｓａｌｅｓ」の列のユニークバリューは４６であり、項目名称「ジャンル」の列のユニークバリューは２であることがわかる。 The unique value column 42 is a column in which the value of the unique value is displayed. The unique value is the number of fields when counting overlapping fields (fields consisting of the same numerical value and the same string) as one field in each field constituting the column. For example, in a 5-field column, if two fields have the same numerical value, the unique value is 4. In the unique value column 42 shown in FIG. 2, the characters “unique value”, “50”, “46”, “2” are displayed in order from the upper column, and the unique value of the column of the item name “Year” Is 50, the unique value of the item name "Sales" is 46, and the unique value of the item name "Genre" is 2.

欠損値欄４３は、欠損値に関する情報が表示される欄である。欠損値とは、データが存在しない空のフィールドをいう。欠損値欄４３には、列に含まれる空のフィールドの個数や、列に含まれる全フィールドの個数に対する空のフィールドの個数の割合が表示される。図２に示す欠損値欄４３には、上の欄から順に「１（０．０１％）」、「１（０．０１％）」、「０（０％）」の数値が表示されている。つまり、項目名称「Ｙｅａｒ」の列には空のフィールドが一つ存在し、列全体に占める空のフィールドの個数の割合は０．０１％であり、項目名称「Ｓａｌｅｓ」の列には空のフィールドが一つ存在し、列全体に占める空のフィールドの個数の割合は０．０１％であり、項目名称「ジャンル」の列には空のフィールドが存在せず、列全体に占める空のフィールドの個数の割合は０％であることがわかる。 The missing value field 43 is a field in which information on the missing value is displayed. Missing values are empty fields for which no data exists. In the missing value column 43, the number of empty fields included in the column and the ratio of the number of empty fields to the number of all fields included in the column are displayed. In the missing value column 43 shown in FIG. 2, numerical values of “1 (0.01%)”, “1 (0.01%)”, and “0 (0%)” are displayed in order from the upper column. . That is, one empty field exists in the column of the item name "Year", the ratio of the number of empty fields occupying in the whole column is 0.01%, and the column of the item name "Sales" is empty. There is one field, and the proportion of the number of empty fields in the whole column is 0.01%, and there is no empty field in the column of the item name "Genre", the empty fields in the whole column It can be seen that the proportion of the number of is 0%.

列追加ボタン４４は、データに新たな列を追加するためのボタンである。列追加ボタン４４を選択すると、図３に示す列追加メニュー５０がユーザ端末２０の表示部に表示される。列の追加に関しては、図３を用いて後述する。また、列編集ボタン４５は、列に含まれるフィールドの編集を行うためのボタンである。列編集ボタン４５を選択すると、図８に示すような列編集メニュー８０が表示される。列の編集に関しては、図８を用いて後述する。 The column addition button 44 is a button for adding a new column to the data. When the column addition button 44 is selected, a column addition menu 50 shown in FIG. 3 is displayed on the display unit of the user terminal 20. The addition of columns will be described later with reference to FIG. The column editing button 45 is a button for editing the fields included in the column. When the column editing button 45 is selected, a column editing menu 80 as shown in FIG. 8 is displayed. The column editing will be described later with reference to FIG.

図３は、本発明の実施形態に係るデータ処理装置による列の追加受付の例を示す図である。同図では、ユーザ端末２０の表示部に表示される列追加画面２００を示している。列追加ボタン４４を選択すると、列追加メニュー５０が表示され、追加部１４が追加する列の内容を、列追加メニュー５０によって決定することができる。列追加メニュー５０は、追加規則選択部５１、項目名称入力部５２、取消ボタン５３、実行ボタン５４を含む。 FIG. 3 is a diagram showing an example of the column addition acceptance by the data processing apparatus according to the embodiment of the present invention. The figure shows a column addition screen 200 displayed on the display unit of the user terminal 20. When the column addition button 44 is selected, a column addition menu 50 is displayed, and the contents of the column to be added by the addition unit 14 can be determined by the column addition menu 50. The column addition menu 50 includes an addition rule selection unit 51, an item name input unit 52, a cancel button 53, and an execution button 54.

追加規則選択部５１には、「シーケンシャル値」及び「ランダム値」の文字が表示される。ユーザが「シーケンシャル値」を選択すると、連続する値からなる列が追加され、「ランダム値」を選択すると、不規則な値からなる列が追加される。ここでいう「値」とは、数値、文字列、記号、その他の値を含むものであり、例えば、不規則な値とは、所定の数値範囲からランダムに選択される実数であってもよいし、所定の数値範囲からランダムに選択される整数や幾つかの候補値の中からランダムに選択される数値であってもよい。また、不規則な値として、例えば、不規則な日付や時刻等からなる列を追加してもよい。図３に示す追加規則選択部５１では、ランダム値が選択されているため、例えば、不規則な数値（疑似乱数を含む）からなる列が追加される。項目名称入力部５２には、追加する列の項目名称を入力することができる。同図より、追加される列の項目名称は「Ａｄｄ」であることがわかる。なお、図３に示す選択肢は一例であり、追加される列は数値からなる列に限られない。例えば、データ処理装置１０は、選択肢「ランダム文字」といった選択肢を追加規則選択部５１に表示させ、当該選択肢がユーザにより選択された場合に不規則な文字列からなる列を追加してもよい。また、例えば、アルファベットからなる列や、複数の色の中から規則的もしくは不規則に選択される色によって列を追加してもよい。また、図３の追加規則選択部５１で、シーケンシャル値又はランダム値を選択する際に、追加する値のタイプ（数値、文字列、記号等）を選択ないし指定できるようにしてもよい。 In the additional rule selection unit 51, characters of "sequential value" and "random value" are displayed. When the user selects "sequential value", a column of continuous values is added, and when "random value" is selected, a column of irregular values is added. The "value" mentioned here includes numerical values, character strings, symbols, and other values. For example, the irregular value may be a real number randomly selected from a predetermined numerical range. It may be an integer randomly selected from a predetermined numerical range or a numerical value randomly selected from several candidate values. In addition, as the irregular value, for example, a column of irregular date or time may be added. In the additional rule selection unit 51 shown in FIG. 3, since a random value is selected, for example, a sequence of irregular numerical values (including pseudo random numbers) is added. The item name of the column to be added can be input to the item name input unit 52. From the figure, it can be seen that the item name of the column to be added is "Add". Note that the option shown in FIG. 3 is an example, and the columns to be added are not limited to columns consisting of numerical values. For example, the data processing apparatus 10 may display an option such as an option “random character” on the additional rule selection unit 51, and may add a string of irregular character strings when the option is selected by the user. Also, for example, columns may be added by a column of alphabets or colors selected regularly or irregularly from a plurality of colors. In addition, when selecting a sequential value or a random value, the additional rule selection unit 51 of FIG. 3 may select or designate the type (numerical value, character string, symbol, etc.) of the value to be added.

取消ボタン５３は、列の追加作業を取り消すためのボタンであり、選択すると列追加メニュー５０が閉じられる。実行ボタン５４は列の追加作業を実行するためのボタンであり、選択すると、列追加メニュー５０で入力された内容に基づきデータ処理装置１０の追加部１４が列の追加を行う。 The cancel button 53 is a button for canceling the column addition operation, and when selected, the column addition menu 50 is closed. The execution button 54 is a button for executing a column adding operation, and when selected, the adding unit 14 of the data processing apparatus 10 adds a column based on the content input in the column adding menu 50.

図４は、本発明の実施形態に係るデータ処理装置によるデータの分割受付の例を示す図である。同図には、ユーザ端末２０の表示部に表示されるデータ分割画面３００を示している。データ分割画面３００には、基準項目選択部６０、ソート条件選択部６１、分割数入力部６２、分割データ名称入力部６３及び分割割合入力部６４が含まれる。 FIG. 4 is a diagram showing an example of data division acceptance by the data processing apparatus according to the embodiment of the present invention. The figure shows a data division screen 300 displayed on the display unit of the user terminal 20. The data division screen 300 includes a reference item selection unit 60, a sort condition selection unit 61, a division number input unit 62, a division data name input unit 63, and a division ratio input unit 64.

基準項目選択部６０は、データ分割の基準となる項目の選択を受け付ける。分割部１５は、基準となる項目のフィールドに基づいてデータの分割を行う。例えば、１から１００までの整数値のフィールドからなる列を含むデータを８対２の割合で分割する際には、当該列の項目を基準となる項目とすることで、１から８０までの整数値のフィールドが含まれるレコードと、８１から１００までの整数値のフィールドが含まれるレコードとに分割することで、データを８対２の割合で分割することができる。また、例えば、０から１までの乱数のフィールドからなる列を含むデータを８対２の割合で分割する際には、当該列の項目を基準となる項目とすることで、０から０．８未満のフィールドが含まれるレコードと、０．８以上１以下のフィールドが含まれるレコードとに分割することで、データを８対２の割合で分割することができる。しかしながら、このような分割の基準に適した列がデータ内に含まれているとは限らない。そのため、本実施形態に係るデータ処理装置１０では、追加部１４により分割の基準に適した列を追加することができるようにしている。本実施形態では、図４の基準項目選択部６０に「Ａｄｄ」の文字が表示されているため、追加部１４によって追加された項目名称「Ａｄｄ」の列を基準にデータの分割が行われる。 The reference item selection unit 60 receives the selection of the item to be the reference of data division. The dividing unit 15 divides data based on the field of the item serving as the reference. For example, when dividing data including a column consisting of fields of integer values from 1 to 100 at a ratio of 8 to 2, by using the items of the column as the reference items, the alignment from 1 to 80 Data can be divided at a ratio of 8 to 2 by dividing into a record including a numeric field and a record including an integer value field of 81 to 100. Also, for example, when dividing data including a row consisting of random number fields from 0 to 1 at a ratio of 8 to 2, it is possible to use an item in the row as a reference item, thereby dividing the data from 0 to 0.8. Data can be divided at a ratio of 8 to 2 by dividing into records including less than fields and records including 0.8 or more and 1 or less fields. However, columns suitable for such division criteria are not necessarily included in the data. Therefore, in the data processing apparatus 10 according to the present embodiment, the adding unit 14 is configured to be able to add a row suitable for the division criterion. In the present embodiment, since the character “Add” is displayed in the reference item selection unit 60 of FIG. 4, data division is performed based on the column of the item name “Add” added by the addition unit 14.

ソート条件選択部６１は、フィールドのソート条件を受け付ける。例えば、ソート条件選択部６１を選択すると「昇順」及び「降順」の文字が表示され、ユーザがどちらかを選択することができる。図４に示す、ソート条件選択部６１では、「昇順」が選択されているため、項目名称「Ａｄｄ」の列に対応するフィールドが昇順に並ぶように分割部１５がレコードを並び替える。本実施形態においては、ランダムな数値からなる列（項目名「Ａｄｄ」）が追加された後、追加した列を基準にソートが行われるため、元のデータに含まれているレコードが混ぜられた状態になる。 The sort condition selection unit 61 receives the sort condition of the field. For example, when the sort condition selection unit 61 is selected, the characters "ascending order" and "descending order" are displayed, and the user can select either. Since "ascending order" is selected in the sorting condition selecting unit 61 shown in FIG. 4, the dividing unit 15 rearranges the records so that the fields corresponding to the column of the item name "Add" are arranged in ascending order. In this embodiment, after a column (item name “Add”) consisting of random numerical values is added, sorting is performed based on the added column, so records included in the original data are mixed. It will be in the state.

また、ソートの基準となる列は追加された列に限られず、元のデータに含まれる項目名称「Ｙｅａｒ」や「Ｓａｌｅｓ」の列を選択してソートを実行できる。例えば、項目「Ｙｅａｒ」の列を基準としてソートを行った後、データ分割を実行することで、年度の古いデータと新しいデータに分割することができる。また、データに含まれる項目に対しソートを実行しておくことで、ＣＳＶデータをディスプレイや紙媒体に出力した際にデータ内容の確認が行い易くなる。なお、分割の際に必ずしもソートを行う必要はなく、ソートを実行するか否かは任意である。 Moreover, the column used as the reference | standard of a sort is not restricted to the added column, It can sort by selecting the column of the item name "Year" and "Sales" contained in original data. For example, after sorting based on the column of the item “Year”, data division can be performed to divide data into old data and new data of the year. Also, by sorting the items included in the data, it becomes easy to confirm the data contents when the CSV data is output to a display or a paper medium. It is not necessary to perform sorting at the time of division, and whether or not to execute sorting is optional.

分割数入力部６２は、データを分割する数の入力を受け付ける。図４に示す、分割数入力部６２には、「２」と入力されているため、分割部１５によってデータは二つに分割される。なお、分割数入力部６２に入力される数値は「２」に限られず、任意の値であってよい。例えば、データを三つに分割し、そのうち二つのデータを、学習済みモデルを作成する際の学習用データと検証用データに使用し、残った一つのデータを、学習済みモデルが出力した結果がいかに現実に則しているかを示すための比較用のデータとして使用してもよい。また、例えば、データを三つに分割し、一つを学習モデルのハイパーパラメータを調整するために使用するバリデーションデータとして、残りの二つのデータを、学習済みモデルを作成する際の学習用データと検証用データに使用してもよい。 The division number input unit 62 receives an input of the number for dividing data. Since “2” is input to the division number input unit 62 shown in FIG. 4, the data is divided into two by the division unit 15. The numerical value input to the division number input unit 62 is not limited to “2”, and may be an arbitrary value. For example, the data is divided into three, and two data of them are used as learning data and verification data when creating a learned model, and the result of the learned model outputting the remaining data is It may be used as comparison data to show how it conforms to reality. Also, for example, the remaining two data are used as learning data when creating a learned model, as the validation data used to divide the data into three and adjust one to adjust the hyper-parameters of the learning model. It may be used for verification data.

分割データ名称入力部６３は、分割後のデータの名称の入力を受け付ける。また、分割割合入力部６４は、データの分割割合の入力を受け付ける。分割データ名称入力部６３及び分割割合入力部６４は、分割数入力部６２で入力した数だけ表示される。同図に示すように、元となるデータは８対２の割合で分割され、分割後のデータの名称はそれぞれ「学習用データ」、「検証用データ」となる。 The divided data name input unit 63 receives an input of the name of data after division. The division ratio input unit 64 also receives an input of data division ratio. The divided data name input unit 63 and the division ratio input unit 64 are displayed by the number input by the division number input unit 62. As shown in the figure, the original data is divided at a ratio of 8 to 2, and the names of the divided data are "learning data" and "verification data", respectively.

図５は、本発明の実施形態に係るデータ処理装置によって実行される機械学習処理の流れを示すフローチャートである。まず、ユーザ端末２０からの指示に基づいて、取得部１２は、機械学習を行うためのデータをデータベース３０から通信部１１を介して取得する（Ｓ１０）。データを取得すると、データ処理装置１０は、図４に示すようなデータ分割画面３００をユーザ端末２０の表示部に表示させて、取得したデータをどのように分割して学習を行うかについて、ユーザからの指示を受け付ける。このとき、ユーザは、例えば、販売年（Ｙｅａｒ）や売上（Ｓａｌｅｓ）など、取得したデータに含まれる既存のフィールドを利用して、データの分割をしてもよいが、新たなフィールドを追加して、当該フィールドに基づいてデータの分割をしてもよい。既存のフィールドを利用しないでデータ分割を行う場合、データ処理装置１０は、ユーザ端末２０からの指示に基づいて、図３に示すような列追加画面２００をユーザ端末２０の表示部に表示させる。そして、入力部１３は、追加規則及び追加する列の項目名称の入力を受け付ける（Ｓ１１）。追加部１４は、受け付けた入力内容に基づき、データベース３０から取得したデータに新たなフィールドを追加する（Ｓ１２）。例えば、ランダムな数値からなる、項目名称「Ａｄｄ」の列を追加する。 FIG. 5 is a flowchart showing the flow of machine learning processing executed by the data processing apparatus according to the embodiment of the present invention. First, based on an instruction from the user terminal 20, the acquisition unit 12 acquires data for performing machine learning from the database 30 via the communication unit 11 (S10). Upon acquiring the data, the data processing apparatus 10 causes the display unit of the user terminal 20 to display the data division screen 300 as shown in FIG. 4 and the user knows how to divide and learn the acquired data. Accept instructions from At this time, the user may divide data using an existing field included in the acquired data, such as, for example, Year or Sales, but adds a new field. The data may be divided based on the field. When data division is performed without using an existing field, the data processing apparatus 10 causes the display unit of the user terminal 20 to display a column addition screen 200 as shown in FIG. 3 based on an instruction from the user terminal 20. Then, the input unit 13 receives the input of the addition rule and the item name of the column to be added (S11). The adding unit 14 adds a new field to the data acquired from the database 30 based on the received input content (S12). For example, a column of item name "Add" consisting of random numerical values is added.

新たなフィールドを追加した後、データ処理装置１０は、図４に示すようなデータ分割画面３００をユーザ端末２０の表示部に改めて表示させて、取得したデータをどのように分割して学習を行うかについて、ユーザからの指示を受け付ける。入力部１３は、データを分割する際の基準となるフィールドの項目名称の選択、選択された項目名称に対応するフィールドのソート条件の選択、分割数の入力、分割データ名称の入力及び分割割合の入力を受け付ける（Ｓ１３）。分割部１５は、選択された項目名称に対応するフィールドが、選択されたソート条件に従って並ぶように、データを並び替え（Ｓ１４）、選択された項目名称の列を基準に、入力された分割数及び分割割合になるようデータを分割し、分割されたデータに名称を付ける（Ｓ１５）。本例では、分割部１５は、追加した項目名称「Ａｄｄ」の列が昇順に並ぶようにレコードの位置を並び替え、項目名称「Ａｄｄ」の列を基準にデータを８対２の割合で分割する。データの分割後、元のデータの８割を占めるデータに「学習用データ」、２割を占めるデータに「検証用データ」のデータ名称を付ける。以上により、データ処理装置１０による処理が終了する。その後、データ処理装置１０は、分割したデータに基づいて、機械学習処理を行う（Ｓ１６）。例えば、学習用データとして分割されたデータを用いて学習モデルを生成した後、検証用データとして分割されたデータを用いて学習モデルを検証し、所定の精度を有することが検証できた学習モデルを採用することができる。なお、機械学習処理を行うとは、データ処理装置１０自身が機械学習処理を実行してもよいし、データ処理装置１０は機械学習処理を支援するものとし、外部の機械学習サーバ（不図示）が機械学習処理を実行してもよい。前者の場合、データ処理装置１０が機械学習処理及び検証処理を行う学習処理部を備え、当該学習処理部が機械学習処理を実行してもよい。後者の場合、例えば、データ処理装置１０からネットワークを介して学習用データ及び検証用データが外部の機械学習サーバに渡されて、当該機械学習サーバで機械学習処理が実行された後、生成された学習モデルをデータ処理装置１０が受け取るようにしてもよい。 After adding a new field, the data processing apparatus 10 causes the display unit of the user terminal 20 to display the data division screen 300 as shown in FIG. 4 again, and divides the acquired data to perform learning. Accept instructions from the user. The input unit 13 selects an item name of a field serving as a reference when dividing data, selects a sorting condition of a field corresponding to the selected item name, inputs a division number, inputs a division data name, and divides a division ratio. The input is accepted (S13). The dividing unit 15 rearranges the data (S14) so that the fields corresponding to the selected item names are arranged according to the selected sort condition, and the number of divisions input based on the column of the selected item names The data is divided so as to be the division ratio and the divided data are named (S15). In this example, the dividing unit 15 rearranges the positions of the records so that the columns of the added item name “Add” are arranged in ascending order, and divides the data at a ratio of 8 to 2 based on the column of the item name “Add” Do. After dividing the data, give "Data for learning" to data that occupies 80% of the original data, and give the data name "Data for verification" to the data that occupies 20%. Thus, the processing by the data processing device 10 is completed. Thereafter, the data processing apparatus 10 performs machine learning processing based on the divided data (S16). For example, after a learning model is generated using data divided as learning data, the learning model is verified using data divided as verification data, and a learning model that can be verified to have a predetermined accuracy is used. It can be adopted. In addition, data processing device 10 itself may perform machine learning processing to perform machine learning processing, data processing device 10 shall support machine learning processing, and an external machine learning server (not shown) May perform machine learning processing. In the former case, the data processing device 10 may include a learning processing unit that performs machine learning processing and verification processing, and the learning processing unit may execute the machine learning processing. In the latter case, for example, data for learning and data for verification are passed from the data processing apparatus 10 to an external machine learning server via the network, and generated after the machine learning server is executed by the machine learning server. The data processing apparatus 10 may receive the learning model.

本実施形態によれば、特定のフィールドの値に基づいて、機械学習に用いるデータを、学習用データ及び検証用データを含む複数のデータに分割することで、学習用データと検証用データとの間でデータ内容の偏りが生じないように、データの分割を短時間に行うことができる。 According to the present embodiment, data used for machine learning is divided into a plurality of data including learning data and verification data based on the value of a specific field, whereby the learning data and the verification data are It is possible to divide data in a short time so as not to cause deviation of data contents among the data.

また、本実施形態によれば、データ分割の基準とするのに適したフィールドがデータに予め含まれていない場合であっても、上述のように、シーケンシャル値やランダムな数値等からなるフィールドを各レコードに追加することで、学習用データと検証用データとの間で意図しないデータ内容の偏りが生じないように、データの分割を行うことができる。そのため、学習用に調製したデータを用意することなく、いわゆるビッグデータをそのまま用いて学習モデルの学習処理を効率良く行うことができ、また、検証処理において学習済みモデルの性能を適切に検証することができる。 Further, according to the present embodiment, even if a field suitable for use as a reference for data division is not included in the data in advance, as described above, a field consisting of a sequential value or a random numerical value, etc. By adding to each record, it is possible to divide data so as not to cause unintended deviation of data contents between learning data and verification data. Therefore, without preparing data prepared for learning, learning processing of a learning model can be efficiently performed using so-called big data as it is, and properly verifying the performance of the learned model in verification processing. Can.

図６は、本発明の実施形態に係るデータ処理装置１０によるデータ結合の受付の例を示す図である。学習モデルに機械学習させたいデータが複数のデータとして分かれて存在している場合に、それぞれのデータを使用して機械学習処理を何度も実行することは非効率的である。そのため、一つのまとまったデータを使用して効率的に機械学習処理を行うために、複数のデータを結合する場合がある。 FIG. 6 is a diagram showing an example of acceptance of data combination by the data processing apparatus 10 according to the embodiment of the present invention. When data desired to be machine-learned in a learning model is divided and present as a plurality of data, it is inefficient to execute machine learning processing many times using each data. Therefore, in order to perform machine learning processing efficiently using one set of data, a plurality of data may be combined.

同図には、ユーザ端末２０の表示部に表示されるデータ結合画面４００を示している。データ結合画面４００は、第１結合条件選択部７０ａ、第２結合条件選択部７０ｂ、第１結合対象選択部７１ａ、第２結合対象選択部７１ｂ及び結合結果表示部７２を含む。２つのデータ（以下、それぞれＤａｔａ１、Ｄａｔａ２とする。）を結合する場合を例に、データ結合について説明する。 The figure shows a data combination screen 400 displayed on the display unit of the user terminal 20. The data combining screen 400 includes a first combining condition selecting unit 70a, a second combining condition selecting unit 70b, a first combining target selecting unit 71a, a second combining target selecting unit 71b, and a combining result display unit 72. Data combining will be described by way of example of combining two pieces of data (hereinafter referred to as Data 1 and Data 2 respectively).

第１結合条件選択部７０ａには、Ｄａｔａ１に含まれる項目の項目名称が表示され、ユーザは結合条件を与える項目を選択することができる。結合条件を与える項目は、Ｄａｔａ２からも選択される。ここで、結合条件を与える項目とは、データ結合の基準となる項目であり、通常は、結合する複数のデータが共通して有する項目を選択する。結合条件を与える項目に対応するフィールドは、結合後のデータにおいて統合され一つの列を形成する。図６に示すように、第１結合条件選択部７０ａでは、項目「Ｙｅａｒ」が結合条件を与える項目として選択されている。 The item names of the items included in Data 1 are displayed in the first combination condition selection unit 70 a, and the user can select an item to which a combination condition is to be provided. The item which gives the join condition is also selected from Data2. Here, the item which gives the combining condition is an item which becomes a reference of data combining, and usually, the item which a plurality of data to combine have in common is selected. The fields corresponding to the items for which the combining condition is given are integrated in the combined data to form one row. As shown in FIG. 6, in the first combining condition selection unit 70a, the item "Year" is selected as an item to which a combining condition is given.

第２結合条件選択部７０ｂは、第１結合条件選択部７０ａと同様の構成であり、ユーザは結合条件を与える項目を選択することができる。図６に示すように、第２結合条件選択部７０ｂでは、項目「Ｙｅａｒ」が結合条件を与える項目として選択されている。 The second combining condition selecting unit 70b has the same configuration as the first combining condition selecting unit 70a, and the user can select an item to which the combining condition is to be provided. As shown in FIG. 6, in the second coupling condition selection unit 70b, the item "Year" is selected as the item to which the coupling condition is given.

データの結合を行う際には、複数のデータから選択された結合条件を与える項目のフィールドを照らし合わせ、内容が共通するフィールドが存在するか否かを調べる。内容が共通するフィールドが存在する場合には、当該共通するフィールドを有する複数のレコードを統合して一つのレコードとする。ここで、内容が共通するとは、フィールドの数値や文字が完全一致する場合の他、データの内容が共通である場合を含む。例えば、年を表すフィールドについて、あるデータのフィールドは年号表記で年を表すものであり、他のデータのフィールドは西暦表記で同じ年を表すものである場合、両者のフィールドは内容が共通すると判断されてよい。なお、内容が共通するフィールドを有さないレコードは、他のレコードと統合されることなく結合後のデータに含まれてよい。本実施形態においては、結合条件を与える項目としてＤａｔａ１及びＤａｔａ２から項目「Ｙｅａｒ」が選択されている。そのため、Ｄａｔａ１及びＤａｔａ２が結合される際には、それぞれのデータに含まれる項目「Ｙｅａｒ」のフィールド同士が照らし合わされ、内容が共通するフィールドを有する複数のレコードが一つのレコードに統合される。レコードの統合は、例えば、Ｄａｔａ１のレコードに、Ｄａｔａ２の項目「Ｙｅａｒ」以外の選択されたフィールドが追加されることにより行われてよい。 When combining data, the field of the item giving the combining condition selected from a plurality of data is checked to see if there is a field having a common content. When there is a field having common contents, a plurality of records having the common field are integrated into one record. Here, “content in common” includes cases in which the contents of data are common as well as cases in which the numerical values and characters of the fields completely match. For example, for a field representing a year, if one data field represents a year in year notation, and another data field represents the same year in Christian notation, it is assumed that the contents of both fields are common. It may be judged. Note that a record having no field whose content is common may be included in the combined data without being integrated with other records. In the present embodiment, the item “Year” is selected from Data 1 and Data 2 as the item for providing the combining condition. Therefore, when Data1 and Data2 are combined, the fields of the item "Year" included in each data are compared with each other, and a plurality of records having fields having common contents are integrated into one record. The integration of records may be performed, for example, by adding selected fields other than the item “Year” of Data 2 to the record of Data 1.

第１結合対象選択部７１ａには、結合するデータの名称、項目名称及び項目名称ごとにチェックボックスが表示される。図６に示す第１結合対象選択部７１ａには、データ名称として「Ｄａｔａ１」、項目名称として「Ｙｅａｒ」、「Ｓａｌｅｓ」及び「ジャンル」が表示され、項目名称ごとにチェックボックスが設けられている。ユーザは項目名称ごとに設けられたチェックボックスにチェックをすることで、結合後のデータに含まれる項目を選択することができる。結合後のデータに含まれる項目を結合対象となる項目という。同図より、結合後のデータには、項目名称「Ｙｅａｒ」、「Ｓａｌｅｓ」及び「ジャンル」の項目が含まれることがわかる。 In the first combination target selection unit 71a, check boxes are displayed for each of the name of the data to be combined, the item name, and the item name. In the first combination target selection unit 71a shown in FIG. 6, "Data 1" as the data name, "Year", "Sales" and "genre" as the item names are displayed, and a check box is provided for each item name. . The user can select an item included in the combined data by checking a check box provided for each item name. Items included in the combined data are referred to as items to be combined. From the figure, it can be seen that the data after combination includes the items of the item names "Year", "Sales" and "genre".

第２結合対象選択部７１ｂは、第１結合対象選択部７１ａと同様の構成であり、図６に示す第２結合対象選択部７１ｂには、データ名称として「Ｄａｔａ２」、項目名称として「Ｙｅａｒ」及び「Ｎａｍｅ」が表示され、項目名称ごとに設けられた全てのチェックボックスにチェックが付いている。同図より、結合後のデータには、項目名称「Ｙｅａｒ」及び「Ｎａｍｅ」の項目が含まれることがわかる。 The second combination target selection unit 71b has the same configuration as the first combination target selection unit 71a, and the second combination target selection unit 71b illustrated in FIG. 6 has “Data 2” as a data name and “Year” as an item name. And “Name” is displayed, and all check boxes provided for each item name are checked. From the figure, it can be seen that the combined data includes the items of the item names "Year" and "Name".

結合結果表示部７２には、結合後のデータの名称及び項目名称が示される。図６に示す結合結果表示部７２より、結合後のデータ名称は「Ｄａｔａ１−Ｄａｔａ２−ｍｅｒｇｅ」であり、結合後のデータに含まれる項目の項目名称は「Ｙｅａｒ」、「Ｓａｌｅｓ」、「ジャンル」及び「Ｎａｍｅ」であることがわかる。なお、Ｄａｔａ１、Ｄａｔａ２のそれぞれに存在した項目名称「Ｙｅａｒ」に対応するフィールドは、結合後のデータにおいては統合されて一つの列を形成する。そのため、結合結果表示部７２に表示される項目名称「Ｙｅａｒ」は一つのみである。 The combined result display section 72 shows the name and item name of data after combined. From the combining result display unit 72 shown in FIG. 6, the data name after combining is "Data1-Data2-merge", and the item names of the items included in the data after combining are "Year", "Sales", and "Genre". And “Name”. The fields corresponding to the item names “Year” present in each of Data 1 and Data 2 are integrated in the combined data to form one column. Therefore, only one item name "Year" is displayed on the combined result display unit 72.

図７は、本発明の実施形態に係るデータ処理装置によって実行される機械学習処理の流れを示すフローチャートである。まず、ユーザ端末２０からの指示に基づいて、取得部１２は、機械学習を行うための複数のデータを取得する（Ｓ２０）。このとき、ユーザは機械学習を行うためのデータとして、例えば二つのデータＤａｔａ１、Ｄａｔａ２をデータ処理装置１０に取得させてもよい。データを取得すると、データ処理装置１０は、図６に示すようなデータ結合画面４００をユーザ端末２０の表示部に表示させて、取得したデータをどのように結合して学習を行うかについて、ユーザからの指示を受け付ける。その後、入力部１３は、結合条件を与える項目の項目名称の選択（Ｓ２１）を受け付ける。このとき、ユーザは結合条件を与える項目として、例えば項目名称「Ｙｅａｒ」の項目をＤａｔａ１及びＤａｔａ２から選択してもよい。また、入力部１３は、結合対象となる項目の選択を受け付ける（Ｓ２２）。このとき、ユーザは結合対象となる項目として、例えば項目名称「Ｙｅａｒ」、「Ｓａｌｅｓ」及び「ジャンル」の項目をＤａｔａ１から選択し、項目名称「Ｙｅａｒ」及び「Ｎａｍｅ」の項目をＤａｔａ２から選択してもよい。 FIG. 7 is a flowchart showing a flow of machine learning processing executed by the data processing device according to the embodiment of the present invention. First, based on an instruction from the user terminal 20, the acquisition unit 12 acquires a plurality of data for performing machine learning (S20). At this time, the user may cause the data processing apparatus 10 to acquire, for example, two pieces of data Data1 and Data2 as data for performing machine learning. When the data is acquired, the data processing apparatus 10 causes the display unit of the user terminal 20 to display a data combination screen 400 as shown in FIG. 6, and the user determines how to combine the acquired data to perform learning. Accept instructions from Thereafter, the input unit 13 receives the selection (S21) of the item name of the item to which the combination condition is given. At this time, the user may select, for example, an item of the item name “Year” from Data 1 and Data 2 as an item for providing the combining condition. Further, the input unit 13 receives a selection of items to be combined (S22). At this time, the user selects items of item names “Year”, “Sales” and “Genre” from Data 1 as items to be combined, and selects items of item names “Year” and “Name” from Data 2 May be

結合部１６は、複数のデータから選択された結合対象となる項目のフィールド同士を照らし合わせ、共通するフィールドを含むレコード同士を統合し、結合対象となる項目のフィールドが一つのデータに含まれるよう、複数のデータを結合する（Ｓ２３）。本例では、結合部１６は、Ｄａｔａ１及びＤａｔａ２のそれぞれに含まれる項目名称「Ｙｅａｒ」のフィールド同士を照らし合わせて、共通する年のレコードを統合し、項目名称「Ｙｅａｒ」、「Ｓａｌｅｓ」、「ジャンル」及び「Ｎａｍｅ」のフィールドが一つのデータに含まれるよう、Ｄａｔａ１及びＤａｔａ２を結合する。その後、データ処理装置１０は、結合したデータに基づいて、機械学習処理を行う（Ｓ２４）。以上により、データ処理装置１０による処理が終了する。なお、結合するデータの数は二つに限られず、任意の数のデータを結合することができる。 The combining unit 16 compares fields of items to be combined selected from a plurality of data, integrates records including common fields, and includes fields of items to be combined in one data. , Combine a plurality of data (S23). In this example, the combining unit 16 compares the fields of the item name “Year” included in each of Data 1 and Data 2 and integrates the records of the common year, and combines the item names “Year”, “Sales”, “ Combine Data1 and Data2 so that the fields of "genre" and "Name" are included in one data. Thereafter, the data processing apparatus 10 performs machine learning processing based on the combined data (S24). Thus, the processing by the data processing device 10 is completed. The number of data to be combined is not limited to two, and any number of data can be combined.

本実施形態によれば、別々に用意された複数のデータを、結合する必要の無い項目を結合後のデータに含めることなく一つのデータに結合することができ、学習モデルの学習処理及び検証処理に必要とされるひとまとまりのデータを容易に生成することができる。 According to this embodiment, a plurality of separately prepared data can be combined into one data without including items which do not need to be combined in the combined data, and the learning processing and verification processing of the learning model It is easy to generate a set of data required for

図８は、本発明の実施形態に係るデータ処理装置１０による欠損値補完の受付の例を示す図である。欠損値を含むデータを機械学習処理に使用すると生成した学習済みモデルの性能が低下するおそれがある等、適切に学習処理を行えないことがある。また、欠損値を含むデータを検証処理に使用した場合、学習済みモデルの性能を正確に検証できない等、適切に検証処置が行えないことがある。そこで、欠損値を含まないデータを使用して適切に機械学習処理又は検証処理を行うために、欠損値の補完を行う必要がある。同図には、ユーザ端末２０の表示部に表示されるデータ編集画面５００を示している。データ編集画面５００には、列ごとに列編集ボタン４５が表示されている。ユーザは、列ごとに設けられた複数の列編集ボタン４５のうちから一つの列編集ボタン４５を選択することで、編集を行いたい列を選択することができる。列編集ボタン４５を選択すると、列編集メニュー８０が表示される。列編集メニュー８０には、複数の選択肢が表示される。図８に示す列編集メニュー８０には、選択肢「欠損値」、「データ型変更」、「フィールドの置換」、「項目名称の変更」、「列の削除」及び「グラフ表示」が表示されている。 FIG. 8 is a diagram showing an example of acceptance of missing value complementation by the data processing apparatus 10 according to the embodiment of the present invention. When data including a missing value is used for machine learning processing, there is a possibility that the learning processing can not be appropriately performed, for example, the performance of the generated learned model may be degraded. In addition, when data including a missing value is used for verification processing, verification processing may not be appropriately performed, for example, the performance of a learned model can not be accurately verified. Therefore, in order to appropriately perform machine learning processing or verification processing using data that does not include missing values, missing values need to be complemented. The figure shows a data editing screen 500 displayed on the display unit of the user terminal 20. In the data editing screen 500, a column editing button 45 is displayed for each column. The user can select a column to be edited by selecting one of the column editing buttons 45 from among the plurality of column editing buttons 45 provided for each column. When the column editing button 45 is selected, a column editing menu 80 is displayed. The column editing menu 80 displays a plurality of options. The column editing menu 80 shown in FIG. 8 displays the options “missing value”, “change data type”, “replace field”, “change item name”, “delete column” and “graph display”. There is.

列編集メニュー８０において選択肢「欠損値」を選択すると、欠損値処理メニュー８１が表示され、データに含まれる欠損値についての処理を行うことができる。欠損値処理メニュー８１には、選択肢「行の削除」及び「欠損値の補完」が表示される。「行の削除」を選択すると、欠損値を含むレコードが削除され、欠損値を含まない列とすることができる。 When the option "missing value" is selected in the column editing menu 80, a missing value processing menu 81 is displayed, and processing can be performed on the missing value included in the data. In the missing value processing menu 81, the options “deletion of line” and “complement of missing value” are displayed. If "Delete Row" is selected, the record containing the missing value can be deleted and the column can be made free of the missing value.

また、選択肢「欠損値の補完」を選択すると、補完メニュー８２が表示され、欠損値を含むフィールドを補完メニュー８２に表示された値で補完することができる。補完メニュー８２には、選択肢「最頻値」、「中央値」、「平均値」、「最大値」、「最小値」、「前後値」及び「任意値」が表示される。選択肢「最頻値」を選択すると、列に最も多く含まれている値で補完が行われる。選択肢「中央値」を選択すると、列に含まれる値を小さい値から並べていったときに中央に位置する値で補完が行われる。選択肢「平均値」を選択すると、列に含まれる値の合計を、列に含まれる値の個数で割った値で補完が行われる。選択肢「最大値」を選択すると、列に含まれる値のうち、最も大きな値で補完が行われる。選択肢「最小値」を選択すると、列に含まれる値のうち、最も小さな値で補完が行われる。選択肢「前後値」を選択すると、欠損値を含む列において、欠損値が含まれるレコードの前後のレコードに含まれるフィールドの値で補完が行われる。選択肢「任意値」を選択すると、ユーザが任意の値を入力することができ、入力した値で補完が行われる。 In addition, when the option “complement of missing value” is selected, the complementing menu 82 is displayed, and the field including the missing value can be complemented with the value displayed in the complementing menu 82. In the complement menu 82, the options "mode", "median", "average", "maximum", "minimum", "longitudinal", and "arbitrary" are displayed. When the option "Mode" is selected, interpolation is performed with the value most contained in the column. If the option "central value" is selected, interpolation is performed with the value located at the center when the values included in the column are arranged from the smaller value. When the option "average value" is selected, interpolation is performed using the sum of the values included in the column divided by the number of values included in the column. When the option "maximum value" is selected, interpolation is performed with the largest value among the values included in the column. When the option "minimum value" is selected, interpolation is performed with the smallest value among the values included in the column. When the option “back and forth value” is selected, in a column including a missing value, completion is performed with the values of fields included in records before and after the record including the missing value. When the option "arbitrary value" is selected, the user can input an arbitrary value, and the interpolation is performed with the input value.

欠損値を含む列を構成するフィールドのデータ型が文字型である場合、補完メニュー８２には、選択肢「最頻文字」が表示される。選択肢「最頻文字」を選択すると、列に最も多く含まれている文字で欠損値の補完が行われる。列を構成するフィールドのデータ型に適した補完方式が、補完メニュー８２に表示される。 If the data type of the field constituting the column including the missing value is a character type, the complement menu 82 displays the option “most frequent character”. If the option "Most characters" is selected, missing values are complemented with the most frequently contained characters in the column. A complementation method suitable for the data type of the field constituting the column is displayed in the complementation menu 82.

本実施形態によれば、レコードに空のフィールドが含まれていることにより、データを学習用データや検証用データとして用いた場合に適切な学習処理や検証処理が行えない場合であっても、欠損値を補完することによって、欠損値を含むレコードを削除することなく学習処理や検証処理を適切に行うことができる。また、データ型に応じた適切な統計値により空のフィールドを補完することができ、学習用データと検証用データとの間で意図しないデータ内容の偏りが生じないように、データの分割を行うことができる。 According to the present embodiment, even when the data is used as learning data or verification data, an appropriate learning process or verification process can not be performed because the record includes an empty field. By complementing the missing value, it is possible to appropriately perform learning processing and verification processing without deleting a record including the missing value. In addition, empty fields can be complemented by appropriate statistical values according to data types, and data division is performed so that unintended data content deviation does not occur between learning data and verification data. be able to.

列編集メニュー８０において選択肢「データ型変更」を選択すると、列を構成するフィールドのデータ型を変更することができる。データ型の変更については、図９を用いて後述する。 When the option “change data type” is selected in the column editing menu 80, the data types of the fields constituting the column can be changed. The change of the data type will be described later with reference to FIG.

列編集メニュー８０において選択肢「フィールドの置換」を選択すると、置換規則を設定することができ、列を構成するフィールドを置換規則に従って他のフィールドに置き換えることができる。例えば、日本円単位で表された売上金額のフィールドで列が構成されている場合に、日本円と米ドルの換算レートを置換規則として設定し、フィールドの置換を実行することで、米ドル単位で表された売上金額のフィールドを生成することができる。また、例えば、同一の対象を複数の名称を用いて記載している場合に、名寄せを行って表記揺れを解消することができる。本実施形態によれば、フィールドの数値や文字を所定の置換規則に従って置換することで、単位の変換や表記揺れの低減を図ることができ、学習モデルの学習処理や検証処理をより効率的に行うことができる。 When the option "replace field" is selected in the column editing menu 80, a replacement rule can be set, and fields constituting the column can be replaced with other fields according to the replacement rule. For example, when the column is configured by the sales amount field expressed in Japanese yen units, the conversion rate between Japanese yen and US dollars is set as the substitution rule, and substitution of the field is performed to display in US dollar units. It is possible to generate a field for the sales amount that has been sold. In addition, for example, when the same object is described using a plurality of names, it is possible to perform name identification and eliminate the writing fluctuation. According to the present embodiment, unit conversion and writing fluctuation can be reduced by replacing numerical values and characters of the field according to a predetermined substitution rule, and learning processing and verification processing of a learning model can be made more efficient. It can be carried out.

列編集メニュー８０において選択肢「項目名称の変更」を選択すると、項目名称を任意の名称に変更することができる。例えば、図２に示す項目名称「ジャンル」を「Ｇｅｎｒｅ」に変更することができる。項目名称が、仮名や漢字等の２バイト文字で表されていることによってエラーが生じている場合に、項目名称を英字等の１バイト文字や数値へ変更することでエラーを回避することができる。 When the option “change item name” is selected in the column editing menu 80, the item name can be changed to any name. For example, the item name “genre” shown in FIG. 2 can be changed to “Genre”. If an error occurs because the item name is represented by double-byte characters such as kana or kanji, the error can be avoided by changing the item name to single-byte characters such as alphabets or numerical values. .

本実施形態によれば、項目名称を任意の項目名称に変更することができ、また、項目名称に使用されている文字の文字コードを原因としてエラーが生じている場合に、適切な文字コードの文字へと変更することでエラーを回避することができる。そのため、用意したデータを用いて学習モデルの学習処理を効率良く行うことができ、検証処理において学習済みモデルの性能を適切に検証することができる。 According to the present embodiment, the item name can be changed to any item name, and if an error occurs due to the character code of the character used for the item name, the appropriate character code can be used. Errors can be avoided by changing to characters. Therefore, the learning process of the learning model can be efficiently performed using the prepared data, and the performance of the learned model can be properly verified in the verification process.

列編集メニュー８０において選択肢「列の削除」を選択すると、データ中から列を削除することができる。図８では、項目名称「Ｓａｌｅｓ」の列が選択されているため、仮に列編集メニュー８０において「列の削除」を選択すると項目名称「Ｓａｌｅｓ」の列が削除される。 When the option “delete column” is selected in the column editing menu 80, a column can be deleted from the data. In FIG. 8, since the column of the item name "Sales" is selected, if "Delete column" is temporarily selected in the column editing menu 80, the column of the item name "Sales" is deleted.

列編集メニュー８０において選択肢「グラフ表示」を選択すると、選択された列の統計情報及び列に含まれるデータをグラフとして表示することができる。グラフ表示については、図１１を用いて後述する。 When the option “graph display” is selected in the column editing menu 80, statistical information of the selected column and data contained in the column can be displayed as a graph. The graph display will be described later with reference to FIG.

図９は、本発明の実施形態に係るデータ処理装置によるデータ型変更の受付の例を示す図である。同図では、ユーザ端末２０の表示部に表示されるデータ型変更画面６００を示している。列編集メニュー８０のうち選択肢「データ型変更」を選択すると、データ型選択メニュー８３が表示される。データ型選択メニュー８３には、変更後のデータ型の選択肢として「文字型」、「整数型」、「浮動小数点型」、「真偽値型」及び「タイムスタンプ型」が表示される。例えば、図９のデータ型選択メニュー８３において、文字型を選択した場合、項目名称「Ｙｅａｒ」の列を構成するフィールドのデータ型は、整数型から文字型へと変更される。データ型選択メニュー８３には、変更可能なデータ型のみが表示されてもよい。 FIG. 9 is a diagram showing an example of acceptance of data type change by the data processing apparatus according to the embodiment of the present invention. The figure shows a data type change screen 600 displayed on the display unit of the user terminal 20. When the option “change data type” is selected from the column editing menu 80, a data type selection menu 83 is displayed. In the data type selection menu 83, "character type", "integer type", "floating point type", "true / false value type" and "time stamp type" are displayed as data type options after the change. For example, when the character type is selected in the data type selection menu 83 of FIG. 9, the data type of the field constituting the column of the item name “Year” is changed from integer type to character type. In the data type selection menu 83, only changeable data types may be displayed.

例えば、学習モデルにデータを入力した場合に、データ型が整数型である列のなかに、数値なしのフィールド（空のフィールド）が含まれていることを原因としてエラーが生じている場合、データ型を整数型から文字型へ変更して数値なしのフィールドを、０文字が含まれる（すなわち文字のエントリーが無い）フィールドへ置き換えることで、エラーを解消することができる。 For example, when data is input to a learning model, if an error occurs because a column without data type (empty field) is included in a column whose data type is integer type, the data You can eliminate the error by changing the type from integer to character and replacing the field without numbers with a field that contains zero characters (ie no entry for characters).

本実施形態によれば、データ型が適切でないことによるエラーが生じている場合に、適切なデータ型へと変更することにより、エラーを回避することができる。また、変更可能なデータ型のうちからデータ型を選択することで、誤ったデータ型への変更を防ぐことができる。 According to the present embodiment, when an error occurs due to the inappropriate data type, the error can be avoided by changing to the appropriate data type. In addition, by selecting a data type from among changeable data types, it is possible to prevent an erroneous change to the data type.

図１０は、本発明の実施形態に係るデータ処理装置によって実行される機械学習処理の流れを示すフローチャートである。図１０は、データ処理装置１０により行われる欠損値の補完、データ型の変更、フィールドの置換、項目名称の変更の処理及び機械学習処理のフローを示している。まず、ユーザ端末２０の表示部に欠損値に関する情報を表示する（Ｓ３０）。例えば、図８の欠損値欄４３に示すように、欠損値を含むフィールドの個数及び割合をユーザ端末２０の表示部に表示してよい。その後、入力部１３が列の選択及び欠損値の補完方式を受け付けた場合（Ｓ３１：Ｙｅｓ）、選択された列のフィールドを統計し、欠損値を補完する値を求める（Ｓ３２）。例えば、データ処理装置１０は、図８に示すようなデータ編集画面５００をユーザ端末２０の表示部に表示させて、欠損値をどのような方式で補完するかについて、ユーザからの指示を受け付けてもよい。図８に示すように、項目名称「Ｓａｌｅｓ」の列が選択され、補完メニュー８２から選択肢「最頻値」が選択された場合、編集部１７は、項目名称「Ｓａｌｅｓ」の列のフィールドを統計して最頻値を求めた後、当該最頻値で欠損値を補完する。 FIG. 10 is a flowchart showing the flow of machine learning processing executed by the data processing apparatus according to the embodiment of the present invention. FIG. 10 shows a flow of processing of complementing missing values, changing of data types, replacing of fields, changing of item names, and machine learning processing performed by the data processing apparatus 10. First, information on the missing value is displayed on the display unit of the user terminal 20 (S30). For example, as shown in the missing value column 43 of FIG. 8, the number and ratio of fields including the missing value may be displayed on the display unit of the user terminal 20. After that, when the input unit 13 receives the column selection and the missing value complementation method (S31: Yes), the field of the selected column is statistics, and a value for complementing the missing value is obtained (S32). For example, the data processing apparatus 10 causes the display unit of the user terminal 20 to display the data editing screen 500 as shown in FIG. 8, and accepts an instruction from the user as to how to compensate for the missing value. It is also good. As shown in FIG. 8, when the column of the item name “Sales” is selected, and the option “mode” is selected from the complement menu 82, the editor 17 statistics the field of the column of the item name “Sales”. Then, after finding the mode, the missing value is complemented by the mode.

欠損値の補完が完了した場合又は入力部１３が欠損値補完の指示を受け付けなかった場合（Ｓ３１：ＮＯ）、入力部１３は、列の選択及びデータ型の選択を受け付けたか否かを判断する（Ｓ３４）。入力部１３が列の選択及びデータ型の選択を受け付けた場合（Ｓ３４：Ｙｅｓ）、編集部１７は、選択された列のデータ型を選択されたデータ型へ変更する（Ｓ３５）。例えば、データ処理装置１０は、図９に示すようなデータ型変更画面６００をユーザ端末２０の表示部に表示させて、どのようなデータ型へ変更するかについて、ユーザからの指示を受け付けてもよい。図９に示すように、項目名称「Ｙｅａｒ」の列が選択され、データ型選択メニュー８３から選択肢「文字型」が選択された場合、編集部１７は、項目名称「Ｙｅａｒ」の列のデータ型を「整数型」から「文字型」へと変更する。 When the complementation of the missing value is completed or when the input unit 13 does not receive the instruction of the missing value complementation (S31: NO), the input unit 13 determines whether the column selection and the data type selection are accepted. (S34). When the input unit 13 receives column selection and data type selection (S34: Yes), the editing unit 17 changes the data type of the selected column to the selected data type (S35). For example, the data processing apparatus 10 causes the display unit of the user terminal 20 to display a data type change screen 600 as shown in FIG. 9 and accepts an instruction from the user as to what data type to change to. Good. As shown in FIG. 9, when the column of the item name "Year" is selected and the option "character type" is selected from the data type selection menu 83, the editing unit 17 selects the data type of the column of the item name "Year". Change from "integer type" to "character type".

データ型の変更が完了した場合又は入力部１３がデータ型変更の指示を受け付けなかった場合（Ｓ３４：ＮＯ）、入力部１３は、列の選択及びフィールドの置換規則の入力を受け付けたか否かを判断する（Ｓ３６）。入力部１３が列の選択及びフィールドの置換規則の入力を受け付けた場合（Ｓ３６：Ｙｅｓ）、編集部１７は、選択された列のフィールドを置換規則に基づいて置換する（Ｓ３７）。例えば、データ処理装置１０は、図８に示すような列編集メニュー８０において、選択肢「フィールドの置換」が選択された場合、置換規則の設定画面をユーザ端末の表示部に表示し、置換規則の設定を受け付ける。本例において、置換規則として日本円と米ドルの換算レートが設定された場合、編集部１７は、項目名称「Ｓａｌｅｓ」の列に含まれる日本円単位の売上金額を、換算レートに基づいて米ドル単位の売上金額へと置換する。 When the data type change is completed or when the input unit 13 does not receive the data type change instruction (S34: NO), the input unit 13 determines whether the column selection and the field replacement rule input are received. It judges (S36). If the input unit 13 receives column selection and field substitution rule input (S36: Yes), the editing unit 17 substitutes the field of the selected column based on the substitution rule (S37). For example, when the option "replacement of field" is selected in the column editing menu 80 as shown in FIG. 8, the data processing apparatus 10 displays a setting screen of a replacement rule on the display unit of the user terminal. Accept the settings. In this example, when the Japanese yen and US dollar conversion rates are set as the replacement rule, the editing unit 17 converts the sales amount in Japanese yen units included in the column of the item name "Sales" into US dollar units based on the conversion rates. Replace with the sales amount of.

フィールドの置換が完了した場合又は入力部１３がフィールド置換の指示を受け付けなかった場合（Ｓ３６：Ｎｏ）、入力部１３は、列の選択及び項目名称の入力を受け付けたか否かを判断する（Ｓ３８）。入力部１３が列の選択及び項目名称の入力を受け付けた場合（Ｓ３８：Ｙｅｓ）、編集部１７は、選択された列の項目名称を入力された項目名称へと変更する（Ｓ３９）。例えば、データ処理装置１０は、図８に示すような列編集メニュー８０において、選択肢「項目名称の変更」が選択された場合、新たな項目名称の入力画面をユーザ端末の表示部に表示し、項目名称の入力を受け付ける。本例において、新たな項目名称として「Ｇｅｎｒｅ」が入力された場合、編集部１７は、項目名称「ジャンル」を「Ｇｅｎｒｅ」へ変更する。項目名称の変更が完了した場合又は項目名称変更の指示を受け付けなかった場合（Ｓ３８：Ｎｏ）、データ処理装置１０は編集したデータに基づいて、機械学習処理を行う（Ｓ４０）。以上により、データ処理装置１０による処理が終了する。なお、データ処理装置１０により行われる欠損値の補完、データ型の変更、フィールドの置換及び項目名称の変更等の各種データ処理は、図１０を用いて説明した順序で実行されなくともよく、任意の順序で実行されてよい。 When the field replacement is completed or when the input unit 13 does not receive the field replacement instruction (S36: No), the input unit 13 determines whether the column selection and the item name input are received (S38). ). When the input unit 13 receives column selection and item name input (S38: Yes), the editing unit 17 changes the item name of the selected column to the input item name (S39). For example, when the option “change item name” is selected in the column editing menu 80 as shown in FIG. 8, the data processing apparatus 10 displays an input screen of a new item name on the display unit of the user terminal, Accept input of item name. In the present example, when “Genre” is input as a new item name, the editing unit 17 changes the item name “genre” to “Genre”. When the change of the item name is completed or when the instruction of the item name change is not received (S38: No), the data processing device 10 performs a machine learning process based on the edited data (S40). Thus, the processing by the data processing device 10 is completed. Note that various data processing performed by the data processing apparatus 10, such as complementation of missing values, change of data type, replacement of fields, and change of item names, may not be performed in the order described using FIG. May be performed in the order of

図１１は、本発明の実施形態に係るデータ処理装置１０によって作成される統計情報欄７００及びデータグラフ８００の例を示す図である。データ処理装置１０が作成した統計情報欄７００及びデータグラフ８００は、ユーザ端末２０へ送信された後、ユーザ端末２０が有する表示部に表示されてもよい。同図では、ユーザ端末２０の表示部に表示される統計情報欄７００及びデータグラフ８００を示している。統計情報欄７００及びデータグラフ８００は、例えば、図８に示す列編集メニュー８０において、ユーザが選択肢「グラフ表示」を選択することで表示されてもよい。ユーザが統計情報を表示させるために選択した項目又は他の項目との関係をグラフ表示させるために選択した項目を対象項目といい、本例では項目「Ｓａｌｅｓ」が対象項目として選択されている。 FIG. 11 is a diagram showing an example of the statistical information section 700 and the data graph 800 created by the data processing apparatus 10 according to the embodiment of the present invention. After being transmitted to the user terminal 20, the statistical information section 700 and the data graph 800 created by the data processing apparatus 10 may be displayed on the display unit of the user terminal 20. The figure shows a statistical information field 700 and a data graph 800 displayed on the display unit of the user terminal 20. The statistical information column 700 and the data graph 800 may be displayed, for example, when the user selects the option “graph display” in the column editing menu 80 shown in FIG. An item selected to display a graph the relationship between the item selected to display statistical information by the user or another item is referred to as a target item, and in this example, the item "Sales" is selected as the target item.

統計情報欄７００には、選択された対象項目に対応するフィールドを統計した結果が表示される。図１１に示す統計情報欄７００には、対象項目「Ｓａｌｅｓ」に対応するフィールドを統計した結果が表示されている。統計情報欄７００には、統計情報として「最頻値」、「平均値」、「最大値」及び「最小値」が表示されているが、表示される統計情報はこれらに限られず、中央値や分散等の任意の統計情報が表示されてもよい。 The statistical information column 700 displays the result of the statistics of the field corresponding to the selected target item. The statistical information column 700 shown in FIG. 11 displays the result of the statistics of the field corresponding to the target item "Sales". Although “mode”, “average value”, “maximum value” and “minimum value” are displayed as statistical information in the statistical information column 700, the displayed statistical information is not limited to these, and the median Arbitrary statistical information, such as and distribution, may be displayed.

データグラフ８００は、対象項目と横軸項目との関係を表したグラフである。データグラフ８００は、横軸９０及び縦軸９１を含む。縦軸９１には、選択された対象項目の内容に適した目盛が刻まれ、対象項目の項目名称が軸のラベルとして表示される。本例における縦軸９１には、売上の大きさを示すために、目盛として一定間隔ごとの数値が刻まれており、対象項目の項目名称「Ｓａｌｅｓ」が軸のラベルとして表示されている。また、横軸９０も縦軸９１と同様に適切な目盛が刻まれ、軸のラベルとして横軸項目の項目名称が表示される。横軸項目として設定する項目は、ユーザが任意に選択することができる。例えば、データ処理装置１０は、横軸項目選択部９２にデータに含まれる項目を一覧表示し、表示した一覧のうちから横軸項目として設定したい項目をユーザに選択させてもよい。本例では、横軸項目選択部９２において項目「Ｙｅａｒ」が選択されているため、横軸９０には、目盛として年が刻まれ、軸のラベルとして項目名称「Ｙｅａｒ」が表示されている。また、グラフ表示する横軸９０の範囲を範囲選択部９３において選択することができる。本例では、範囲選択部９３において横軸９０の範囲が「２０００」〜「２０１７」と選択されているため、２０００年から２０１７年までの売上がデータグラフ８００に表示されている。なお、本例では横軸に設定できる項目は一つであるが、横軸に設定できる項目は必ずしも一つに限られず、複数の項目を設定できる態様であってもよい。例えば、横軸として二つの項目を設定した場合には、図１１に示すデータグラフ８００の奥行方向に新たな軸が加わり、三つの軸が交差する三次元的なデータグラフ８００が作成されてもよい。なお、データ処理装置１０により作成されるデータグラフ８００は棒グラフに限られず、折れ線グラフや散布図等の任意のグラフであってよい。 The data graph 800 is a graph representing the relationship between the target item and the horizontal axis item. Data graph 800 includes a horizontal axis 90 and a vertical axis 91. On the vertical axis 91, a scale suitable for the content of the selected target item is inscribed, and the item name of the target item is displayed as a label of the axis. In the vertical axis 91 in this example, in order to indicate the size of sales, numerical values at fixed intervals are inscribed as a scale, and the item name "Sales" of the target item is displayed as a label of the axis. Further, the horizontal axis 90 is also engraved with an appropriate scale similarly to the vertical axis 91, and the item name of the horizontal axis item is displayed as the axis label. The item to be set as the horizontal axis item can be arbitrarily selected by the user. For example, the data processing apparatus 10 may cause the horizontal axis item selection unit 92 to display a list of items included in the data, and allow the user to select an item to be set as the horizontal axis item from the displayed list. In this example, since the item “Year” is selected in the horizontal axis item selection unit 92, the year is inscribed on the horizontal axis 90 as a scale, and the item name “Year” is displayed as a label of the axis. Further, the range selection unit 93 can select the range of the horizontal axis 90 to be displayed graphically. In this example, since the range of the horizontal axis 90 is selected as “2000” to “2017” in the range selection unit 93, sales from 2000 to 2017 are displayed in the data graph 800. Although one item can be set on the horizontal axis in this example, items that can be set on the horizontal axis are not necessarily limited to one, and a mode in which a plurality of items can be set may be adopted. For example, when two items are set as the horizontal axis, a new axis is added in the depth direction of the data graph 800 shown in FIG. 11, and a three-dimensional data graph 800 in which three axes intersect is created. Good. The data graph 800 created by the data processing apparatus 10 is not limited to a bar graph, and may be any graph such as a line graph or a scatter chart.

グラフ出力ボタン９４は、データ処理装置１０が作成したグラフを出力するためのボタンである。例えば、ユーザによってグラフ出力ボタン９４が選択された場合、データグラフ８００の画像データ等がデータ処理装置１０からユーザ端末２０等へ送信される。また、表出力ボタン９５は、データ処理装置１０が作成した表を出力するためのボタンである。例えば、ユーザによって表出力ボタン９５が選択された場合、データ処理装置１０は、対象項目に対応するフィールドと横軸項目に対応するフィールドとを列挙した表を作成し、当該表のＣＳＶデータや画像データ等をユーザ端末２０等へ送信する。 The graph output button 94 is a button for outputting the graph created by the data processing device 10. For example, when the graph output button 94 is selected by the user, image data or the like of the data graph 800 is transmitted from the data processing apparatus 10 to the user terminal 20 or the like. The table output button 95 is a button for outputting the table created by the data processing device 10. For example, when the table output button 95 is selected by the user, the data processing apparatus 10 creates a table in which fields corresponding to the target items and fields corresponding to the horizontal axis items are listed, and the CSV data or image of the table Data and the like are transmitted to the user terminal 20 and the like.

本実施形態によれば、選択した対象項目と横軸項目との関係を、グラフを用いて視覚的に表示することができる。そのため、対象項目に影響を及ぼす項目を容易に特定することができ、学習用データに含めるべき項目の選定を適切に行うことができる。例えば、売上予測を行う学習済みモデルを作成するための学習用データを準備する場合を考える。このとき、用意したデータに含まれる売上の項目を対象項目として縦軸９１に設定し、販売年の項目を横軸項目として横軸９０に設定したグラフを作成する。作成したグラフの値が販売年ごとに変動している場合、販売年が売上に影響を及ぼしている可能性は高いと考えられるため、販売年の項目を学習用データに含めるべきことがわかる。一方、他の項目を横軸９０として設定したときに、その項目のフィールドごとにグラフの値がほとんど変わらない場合、当該項目が売上に影響している可能性は低いと考えられるため、当該項目を学習用データに含める必要性は販売年の項目と比較して低いことがわかる。 According to this embodiment, the relationship between the selected target item and the horizontal axis item can be visually displayed using a graph. Therefore, it is possible to easily identify the item that affects the target item, and it is possible to appropriately select the item to be included in the learning data. For example, consider the case of preparing learning data for creating a learned model for sales forecasting. At this time, a sales item included in the prepared data is set as the target item on the vertical axis 91, and a sales year item is set as the horizontal axis item on the horizontal axis 90 to create a graph. If the value of the created graph fluctuates from year to year, it is considered that the year of sale is likely to affect the sales, so it is understood that the item of year of sale should be included in the data for learning. On the other hand, when the other item is set as the horizontal axis 90, if the value of the graph hardly changes for each field of the item, the item is considered to be unlikely to affect sales, so the item It can be seen that the need to include in the training data is low compared to the year of sales item.

以上説明した実施形態は、本発明の理解を容易にするためのものであり、本発明を限定して解釈するためのものではない。実施形態が備える各要素並びにその配置、材料、条件、形状及びサイズ等は、例示したものに限定されるわけではなく適宜変更することができる。また、異なる実施形態で示した構成同士を部分的に置換し又は組み合わせることが可能である。 The embodiments described above are for the purpose of facilitating the understanding of the present invention, and are not for the purpose of limiting the present invention. The elements included in the embodiment and the arrangement, the material, the conditions, the shape, the size, and the like of the elements are not limited to those illustrated, and can be changed as appropriate. In addition, configurations shown in different embodiments can be partially substituted or combined with each other.

１０…データ処理装置、１１…通信部、１２…取得部、１３…入力部、１４…追加部、１５…分割部、１６…結合部、１７…編集部、２０…ユーザ端末、３０…データベース、４０…項目名称欄、４１…データ型欄、４２…ユニークバリュー欄、４３…欠損値欄、４４…列追加ボタン、４５…列編集ボタン、５０…列追加メニュー、５１…追加規則選択部、５２…項目名称入力部、５３…取消ボタン、５４…実行ボタン、６０…基準項目選択部、６１…ソート条件選択部、６２…分割数入力部、６３…分割データ名称入力部、６４…分割割合入力部、７０ａ…第１結合条件選択部、７０ｂ…第２結合条件選択部、７１ａ…第１結合対象選択部、７１ｂ…第２結合対象選択部、７２…結合結果表示部、８０…列編集メニュー、８１…欠損値処理メニュー、８２…補完メニュー、８３…データ型選択メニュー、９０…横軸、９１…縦軸、９２…横軸項目選択部、９３…範囲選択部、９４…グラフ出力ボタン、９５…表出力ボタン、１００…データ表、２００…列追加画面、３００…データ分割画面、４００…データ結合画面、５００…データ編集画面、６００…データ型変更画面、７００…統計情報欄、８００…データグラフ DESCRIPTION OF SYMBOLS 10 ... Data processing apparatus, 11 ... Communications part, 12 ... Acquisition part, 13 ... Input part, 14 ... Addition part, 15 ... Division part, 16 ... Coupling part, 17 ... Editing part, 20 ... User terminal, 30 ... Database, 40 ... item name column, 41 ... data type column, 42 ... unique value column, 43 ... missing value column, 44 ... column addition button, 45 ... column editing button, 50 ... column addition menu, 51 ... addition rule selection unit, 52 ... item name input unit, 53 ... cancel button, 54 ... execution button, 60 ... reference item selection unit, 61 ... sort condition selection unit, 62 ... division number input unit, 63 ... division data name input unit, 64 ... division ratio input Part 70a: First combination condition selection unit 70b: Second combination condition selection unit 71a: First combination target selection unit 71b: second combination target selection unit 72: Combination result display unit 80: Row edit menu , 81 ... Missing value processing New, 82: Complement menu, 83: Data type selection menu, 90: Horizontal axis, 91: Vertical axis, 92: Horizontal axis item selection section, 93: Range selection section, 94: Graph output button, 95: Table output button, 100 ... data table, 200 ... column addition screen, 300 ... data division screen, 400 ... data combination screen, 500 ... data edit screen, 600 ... data type change screen, 700 ... statistical information column, 800 ... data graph

Claims

An acquisition unit for acquiring data having a plurality of records, each including a plurality of fields corresponding to a plurality of items;
An input unit that receives selection of one of the plurality of items and input of a division ratio of the data;
A field corresponding to the selected item is extracted from each of the plurality of records, and based on the extracted field, the data is divided into a plurality of data including machine learning learning data and verification data at the division ratio. A division unit to be divided,
A learning processing unit that performs machine learning processing based on the learning data, and verifies a learning model generated by the machine learning processing based on the verification data;
A data processing apparatus comprising:

The input unit accepts an input of an additional rule of a field,
The system further comprises an addition unit for adding one non-empty field according to the addition rule to each of the plurality of records,
The division unit divides the data into a plurality of data including machine learning learning data and verification data at the division ratio, based on the added field.
The data processing apparatus according to claim 1.

The acquisition unit acquires a plurality of the data.
The input unit receives a selection of an item to which a combination condition is given among the plurality of data and a selection of an item to be combined.
And a combining unit configured to combine the plurality of data into one piece of data including a field corresponding to the item to be combined based on a field corresponding to the item giving the combining condition.
The data processing apparatus according to claim 1.

The input unit receives selection of a complementation method of the empty field, when the data includes the empty field,
It further comprises an editor for complementing the empty field based on statistics according to the selected complementation scheme regarding the field in which the empty field and the item are common.
The data processing apparatus according to any one of claims 1 to 3.

The statistics may include determining a mode, a median, an average, a maximum value, or a minimum value of the common field, when the data type of the empty field is a numeric type. If the data type of is a character type, it is to find the most frequent character of the common field,
The data processing device according to claim 4.

The input unit receives item selection and data type selection.
The editing unit changes the data type of the field corresponding to the selected item to the selected data type.
A data processing apparatus according to claim 4 or 5.

The input unit receives an input of item selection and replacement rules;
The editing unit replaces the field corresponding to the selected item according to the substitution rule.
A data processing apparatus according to any one of claims 4 to 6.

The input unit receives item selection and item name input.
The editing unit changes the name of the selected item to the input item name.
The data processing device according to any one of claims 4 to 7.

The input unit receives selection of a target item and a horizontal axis item.
The editing unit sets the target item on the vertical axis, and creates a graph in which the horizontal axis item is set on the horizontal axis.
A data processing apparatus according to any one of claims 4 to 8.

Obtaining data having a plurality of records each including the plurality of fields corresponding to a plurality of items;
Accepting selection of one of the plurality of items and input of a division ratio of the data;
A field corresponding to the selected item is extracted from each of the plurality of records, and based on the extracted field, the data is divided into a plurality of data including machine learning learning data and verification data at the division ratio. To divide and
Performing machine learning processing based on the learning data, and verifying a learning model generated by the machine learning processing based on the verification data;
Data processing method including:

Computer,
An acquisition unit for acquiring data having a plurality of records, each including the plurality of fields corresponding to a plurality of items;
An input unit that receives selection of one of the plurality of items and input of a division ratio of the data;
A field corresponding to the selected item is extracted from each of the plurality of records, and based on the extracted field, the data is divided into a plurality of data including machine learning learning data and verification data at the division ratio. Division part to divide,
A learning processing unit that performs machine learning processing based on the learning data, and verifies a learning model generated by the machine learning processing based on the verification data;
Data processing program to function as.