JP6229512B2

JP6229512B2 - Information processing program, information processing method, and information processing apparatus

Info

Publication number: JP6229512B2
Application number: JP2014012485A
Authority: JP
Inventors: 田中　一成; 一成田中; 諒石崎
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2014-01-27
Filing date: 2014-01-27
Publication date: 2017-11-15
Anticipated expiration: 2034-01-27
Also published as: JP2015141455A

Description

本発明は、コンピュータにおいて、処理部品を組み合わせた処理フローを用いてデータ処理を行う技術に関する。 The present invention relates to a technique for performing data processing using a processing flow in which processing components are combined in a computer.

コンピュータを用いてデータ処理を行う際に、データの抽出や集計等の小さな単位のデータ処理を行う処理部品を組み合わせて、一連のデータ処理の流れを規定した処理フローを作成することが可能なツールが用いられている。ユーザはこのようなツールを用いることにより、データ処理を行うプログラムを一から作成する手間を軽減することができる。また、知識の豊富なユーザが作成した処理フローや個々の処理部品を、複数のユーザで共有することができる。 A tool that can create a processing flow that defines a series of data processing flows by combining processing components that perform data processing in small units such as data extraction and aggregation when performing data processing using a computer Is used. By using such a tool, the user can reduce the trouble of creating a data processing program from scratch. Also, the processing flow and individual processing components created by knowledge-rich users can be shared by a plurality of users.

なお、データ処理に関連する従来技術の一例として、表計算ソフトやプレインファイルのデータを、ユーザが所望する形態に加工する技術が提案されている。当該技術では、データの入力や分解などの一連のデータ加工の手順が定義されたシナリオに対応して、入力データをデータ加工する。このデータ加工においては、入力データに付随するデータ項目名等のラベルデータを所定パターンで解析して項目名を抽出し、スキーマ情報として保存する。 As an example of a conventional technique related to data processing, a technique for processing spreadsheet data and plain file data into a form desired by a user has been proposed. In this technique, input data is processed according to a scenario in which a series of data processing procedures such as data input and decomposition are defined. In this data processing, label data such as data item names attached to input data is analyzed with a predetermined pattern to extract item names and stored as schema information.

また、処理部品を組み合わせる作業の支援を行う技術例として、ウェブ・アプリケーション内のコンポーネント同士を接続するための技術が提案されている。当該技術では、コンポーネントが、コンポーネントの性質を表すデータであるプロパティの型等を示す属性情報を有する。そして、ユーザが選択したコンポーネントと同じ属性情報を持つ他のコンポーネントを、接続先候補として提示する。 In addition, as a technology example for supporting the work of combining processing parts, a technology for connecting components in a web application has been proposed. In this technique, a component has attribute information indicating a property type or the like that is data representing the property of the component. Then, other components having the same attribute information as the component selected by the user are presented as connection destination candidates.

さらに、処理部品を組み合わせる際の支援を行う他の技術例として、次のような技術が提案されている。すなわち、当該技術では、アイコンに対して設定したコンポーネント定義情報に対応するコンポーネントプログラムと適合可能な機能の型に属する機能を有するコンポーネントプログラムのコンポーネント定義情報を検索する。そして検索の結果、ヒットしたコンポーネント定義情報の識別情報や、ヒットしたコンポーネント定義情報に対応するコンポーネントプログラムの識別情報を、セット情報としてアイコンとともに表示画面上に一覧表示する。 Furthermore, the following techniques have been proposed as other examples of techniques for providing support when combining processing components. That is, in this technique, component definition information of a component program having a function that belongs to a function type compatible with the component program corresponding to the component definition information set for the icon is searched. As a result of the search, the identification information of the hit component definition information and the identification information of the component program corresponding to the hit component definition information are displayed as a list on the display screen together with icons.

特開平１０−１７１６３６号公報Japanese Patent Laid-Open No. 10-171636 特開２０１０−１７６１９５号公報JP 2010-176195 A 特開２０１１−００８３５８号公報JP 2011-008358 A

ここで、処理部品を組み合わせて処理フローを生成する際に、処理部品の処理内容によっては、入力データの一部の構成要素（例えば表形式のデータであれば一部の列や、特定の条件を満たす行など）を、処理部品のパラメータとして設定することがある。例えば、入力データの一部の構成要素のみを、処理部品による処理対象とする場合等である。この場合、パラメータとする構成要素をユーザが指定することがある。その際、構成要素をユーザが正しく指定できるようにするには、構成要素の名称等を手入力するのではなく、指定可能な候補から選択できるようにするのが望ましい。 Here, when a processing flow is generated by combining processing components, depending on the processing contents of the processing components, some components of the input data (for example, some columns in the case of tabular data or specific conditions) May be set as a parameter of the processing component. For example, it is a case where only some components of the input data are processed by the processing component. In this case, the user may specify a component as a parameter. At that time, in order to allow the user to correctly specify the component, it is desirable that the name of the component can be selected from candidates that can be specified, instead of manually inputting the name of the component.

しかし、例えば、処理部品によっては、出力データの構成要素がデータの内容に応じて動的に変わるケースがある。例えば、データの内容に応じて表構造を変換し、入力データに対して新たな列を追加するような処理を行うような場合等である。このような処理部品の出力データを入力とする後続の処理部品において、入力データの構成要素の候補を選択できるようにするためには、先行する処理部品による処理後のデータに応じた構成要素の情報を抽出しておく必要がある。
そこで、本発明の１つの側面では、処理部品を組み合わせた処理フローにおいて、処理部品による処理後のデータの構成要素の情報を効率的に抽出できるようにすることを目的とする。 However, for example, depending on the processing component, there are cases where the constituent elements of the output data change dynamically according to the contents of the data. For example, the table structure is converted according to the contents of the data, and a process for adding a new column to the input data is performed. In order to be able to select a candidate for a component of input data in a subsequent processing component that receives the output data of such a processing component, the component of the component corresponding to the data after processing by the preceding processing component is selected. Information needs to be extracted.
In view of the above, an object of one aspect of the present invention is to enable efficient extraction of data component information after processing by a processing component in a processing flow in which processing components are combined.

本発明の１つの側面では、複数の処理部品を組み合わせた処理フローに含まれる１つの対象処理部品であって、入力データの内容に応じた構成要素を有する出力データを生成する処理を行う対象処理部品に対する入力データを取得する。また、予め設定された複数のサンプリング方法にしたがって、当該複数のサンプリング方法ごとに、入力データから複数の異なるサンプル数のデータをサンプリングして複数のテストデータを生成する。さらに、複数のサンプリング方法ごとに、テストデータのサンプル数を増加させながら、それぞれのテストデータを入力とする対象処理部品による処理を実行して、それぞれの処理結果の出力データを取得する。そして、複数のサンプリング方法のうち、最も少ないサンプル数のテストデータを処理した段階で、テストデータのサンプル数の増加に応じた出力データの構成要素数の増加が所定閾値以下となるサンプリング方法を選択する。 In one aspect of the present invention, a target process that performs processing for generating output data that is one target processing component included in a processing flow in which a plurality of processing components are combined and that includes components according to the content of input data Get input data for a part. Further, according to a plurality of preset sampling methods, a plurality of test data are generated by sampling a plurality of different sample numbers of data from the input data for each of the plurality of sampling methods. Further, for each of a plurality of sampling methods, while increasing the number of test data samples, processing by the target processing component that receives each test data is executed, and output data of each processing result is acquired. Then, among the multiple sampling methods, select the sampling method that increases the number of components of the output data according to the increase in the number of test data samples when the test data with the smallest number of samples is processed. To do.

本発明の１つの側面によれば、処理部品を組み合わせた処理フローにおいて、処理部品による処理後のデータの構成要素の情報を効率的に抽出できる。 According to one aspect of the present invention, in a processing flow in which processing components are combined, information on data components after processing by the processing components can be efficiently extracted.

分析フローの一例を示す説明図である。It is explanatory drawing which shows an example of an analysis flow. 入力データテーブルの一例の説明図である。It is explanatory drawing of an example of an input data table. データプロファイル及び処理部品設定画面の一例の説明図である。It is explanatory drawing of an example of a data profile and a process component setting screen. 入力データをアイテム行列化したデータの一例の説明図である。It is explanatory drawing of an example of the data which made input data item matrix. 入力データのサンプル数と出力データの列数との相関関係の一例の説明図である。It is explanatory drawing of an example of correlation with the number of samples of input data, and the number of columns of output data. 分析サーバの機能構成及びデータ構成の一例を示す説明図である。It is explanatory drawing which shows an example of a function structure and data structure of an analysis server. 分析フローテーブルの一例の説明図である。It is explanatory drawing of an example of an analysis flow table. 部品データベースの一例の説明図である。It is explanatory drawing of an example of components database. サンプリング方法テーブルの一例の説明図である。It is explanatory drawing of an example of a sampling method table. テストデータテーブルの一例の説明図である。It is explanatory drawing of an example of a test data table. データプロファイル一時保管ファイルの一例の説明図である。It is explanatory drawing of an example of a data profile temporary storage file. データプロファイル一時保管ファイルの一例の説明図である。It is explanatory drawing of an example of a data profile temporary storage file. データプロファイル一時保管ファイルの一例の説明図である。It is explanatory drawing of an example of a data profile temporary storage file. 列数テーブルの一例の説明図である。It is explanatory drawing of an example of a column number table. サンプルデータテーブルの一例の説明図である。It is explanatory drawing of an example of a sample data table. データプロファイルテーブルの一例の説明図である。It is explanatory drawing of an example of a data profile table. 分析サーバで実行される全体処理の一例を示すフローチャートである。It is a flowchart which shows an example of the whole process performed with an analysis server. 分析サーバで実行されるデータプロファイル生成処理の一例を示すフローチャートである。It is a flowchart which shows an example of the data profile production | generation process performed with an analysis server. 分析サーバで実行されるサンプリング方法選択処理の一例を示すフローチャートである。It is a flowchart which shows an example of the sampling method selection process performed with an analysis server. 本実施形態における分析サーバのハードウェア構成の一例である。It is an example of the hardware constitutions of the analysis server in this embodiment.

［本実施形態の背景及び概要］
本実施形態では、データの抽出や集計等の小さな単位のデータ処理を行う処理部品を組み合せて、一連の処理フローを作成するツールにおいて、処理部品の処理後のデータの構成要素の情報を効率的に抽出できるようにする技術について説明する。本実施形態では、このようなデータ処理ツールの一例であるデータ分析ツールを用いて説明を行う。データ分析ツールは、処理部品の組合せにしたがってデータ内容を分析して、分析結果を出力する。 [Background and Outline of this Embodiment]
In the present embodiment, in a tool that creates a series of processing flows by combining processing parts that perform data processing in small units such as data extraction and aggregation, information on data components after processing of the processing parts is efficiently used. A technique for enabling extraction will be described. In the present embodiment, description will be made using a data analysis tool which is an example of such a data processing tool. The data analysis tool analyzes the data contents according to the combination of processing parts and outputs the analysis result.

図１は、データ分析ツールの一例において構築された分析フローの一例を示す。図１に示すように、分析フローは、例えば、入力データと、小さな単位のデータ処理を行う処理部品の組合せと、処理部品による処理結果の最終出力とを含む。１つ１つの処理部品は、先行する処理部品による出力データ（又は最初の入力データ）を処理し、処理結果の中間ファイルを後続の処理部品に渡す（又は最後に出力する）ように接続されている。そして、このような分析フローの一連の処理が実行されることにより、データの分析を行うことができる。 FIG. 1 shows an example of an analysis flow constructed in an example of a data analysis tool. As shown in FIG. 1, the analysis flow includes, for example, input data, a combination of processing components that perform data processing in a small unit, and a final output of processing results by the processing components. Each processing component is connected to process the output data (or first input data) from the preceding processing component and pass the intermediate file of the processing result to the subsequent processing component (or output it last). Yes. Data can be analyzed by executing a series of processes in such an analysis flow.

ユーザは、データ分析ツールを用いて、まずこのような分析フローを構築する。ユーザは、処理部品を自ら作成することができる。また、ユーザがデータ分析処理を一から記述する必要がないように、一度作成された処理部品を、複数のユーザが共有して使用することができる。ユーザは、入力データを決定し、入力データを順次処理する処理部品を組み合せて接続することで、分析フローを構築することができる。 A user first constructs such an analysis flow using a data analysis tool. The user can create a processing component himself. In addition, a plurality of users can share and use a processing component once created so that the user does not have to describe the data analysis process from scratch. The user can construct an analysis flow by determining input data and connecting processing components that sequentially process the input data.

ここで、処理部品の処理内容によっては、例えば、入力データの一部の構成要素のみをパラメータとして設定することがある。例えば、一部の構成要素のみを処理対象とする場合や、クラスタリング解析のパラメータとする場合等である。この場合、当該処理部品において処理対象とする構成要素をユーザが指定することがある。
例えば、図１に示す分析フローの入力データが図２に示すようなテーブル形式のデータであって、処理部品Ａの出力する中間ファイル１も、図２に示すデータと同様のデータ構造を有するとする。この入力データは、図２に示すように、構成要素の一例として、「日付」、「会計時間」、「売上」の列を含んでいる。ここで、処理部品Ｂにおいて、図２に示す入力データのうち、特定の列のみを処理対象としてユーザが選択するものとする。ここで、通常のデータベースのテーブル等を入力データとする場合、テーブルのデータプロファイルには、一般に、テーブルにどのような列が含まれているかを示す列情報が含まれている。このため、図３に示すように、処理部品における設定画面において、入力データのデータプロファイルの列情報に基づき、選択可能な列名の選択肢を表示させることができる。 Here, depending on the processing content of the processing component, for example, only some constituent elements of the input data may be set as parameters. For example, there are cases where only some of the constituent elements are to be processed, or parameters for clustering analysis. In this case, the user may specify a component to be processed in the processing component.
For example, if the input data of the analysis flow shown in FIG. 1 is data in a table format as shown in FIG. 2, and the intermediate file 1 output by the processing component A has the same data structure as the data shown in FIG. To do. As shown in FIG. 2, this input data includes columns of “date”, “accounting time”, and “sales” as an example of the constituent elements. Here, in the processing component B, it is assumed that the user selects only a specific column from the input data shown in FIG. Here, when a normal database table or the like is used as input data, the table data profile generally includes column information indicating what columns are included in the table. For this reason, as shown in FIG. 3, on the setting screen in the processing component, selectable column name options can be displayed based on the column information of the data profile of the input data.

しかし、例えば、処理部品Ｂに先行する処理部品Ａが、出力データの構成要素をデータの内容に応じて動的に変更する処理を行うとする。例えば、図４に示すように、図２に示した入力データの「会計時間」にデータが存在する時間の範囲内（最小値と最大値の範囲内）において、時間帯ごと（１時間ごと）に売上があったか否かを正負（Ｔ又はＦ）で示すように表構造を変換する（アイテム行列化する）場合である。なお、入力データをこのようなデータ構造に変換することにより、例えば、クラスタリング解析や機械学習等の処理に用いることが可能となる。このような場合、処理部品Ａの出力データを入力とする後続の処理部品Ｂにおいて列名の選択肢を表示させようとしても、前述の図３に示したデータプロファイルの列情報を用いることができない。 However, for example, it is assumed that the processing component A preceding the processing component B performs a process of dynamically changing the constituent elements of the output data according to the contents of the data. For example, as shown in FIG. 4, within the time range in which the data exists in the “accounting time” of the input data shown in FIG. 2 (within the range of the minimum value and the maximum value), every time zone (every hour) In this case, the table structure is converted (item matrix) so as to indicate whether or not there is sales in positive and negative (T or F). Note that by converting the input data into such a data structure, for example, it can be used for processing such as clustering analysis and machine learning. In such a case, the column information of the data profile shown in FIG. 3 described above cannot be used even if it is attempted to display column name options in the subsequent processing component B that receives the output data of the processing component A.

このため、本実施形態では、処理部品Ｂにおいて入力データの列情報を取得できるようにするために、先行する処理部品Ａによる処理後のデータのデータプロファイルを生成しておく。ここでデータプロファイルを生成する方法として最も確実なのは、全てのデータを処理部品Ａによって処理し、処理結果からデータプロファイルを生成する方法である。しかし、分析ツールでは、大量のデータが入力データとなることも想定される。このため、処理部品Ａの処理後のデータのデータプロファイルを生成するために全ての入力データを処理した場合、結果として、ユーザによる分析ツールの構築作業を長時間停止させてしまう可能性がある。 For this reason, in this embodiment, in order to be able to acquire the column information of the input data in the processing component B, a data profile of data after processing by the preceding processing component A is generated. Here, the most reliable method for generating a data profile is to process all data by the processing component A and generate a data profile from the processing result. However, in the analysis tool, it is assumed that a large amount of data becomes input data. For this reason, when all input data is processed in order to generate the data profile of the processed data of the processing component A, there is a possibility that the construction work of the analysis tool by the user is stopped for a long time.

そこで、本実施形態では、処理部品Ａによる処理後のデータのデータプロファイルを生成するために、サンプルデータを用いる。サンプルデータは、処理部品Ａが処理する入力データ全体のうちの少なくとも一部のデータである。処理部品Ａがサンプルデータを処理した後の出力データの列は、処理部品Ａが入力データ全体を処理した後の出力データの列と同じであることが望ましい。 Therefore, in this embodiment, sample data is used to generate a data profile of data after processing by the processing component A. The sample data is at least a part of the entire input data processed by the processing component A. It is desirable that the sequence of output data after the processing component A processes the sample data is the same as the sequence of output data after the processing component A processes the entire input data.

ここで、そのようなサンプルデータをどのように抽出するか（サンプリング方法をどうするか）が問題となる。処理部品の処理内容次第では、サンプリング方法によって、同じ列を生成するために必要となるサンプルデータの数が異なるからである。より少ないサンプル数で効率的に列を生成するためには、処理部品の処理内容に合ったサンプリング方法を選択することが望ましい。ここで、例えば、処理部品Ａが具体的にどのような処理をするかが把握できれば、ユーザが適切なサンプリング方法を選択することは比較的容易である。しかし、前述したように、データ分析ツールでは、他のユーザが作成した処理部品を使用することもある。このような場合、ユーザが処理部品内部の詳細な処理を把握できているとは限らないため、適切なサンプリング方法を選択することが困難である。 Here, how to extract such sample data (how to use the sampling method) becomes a problem. This is because the number of sample data required to generate the same column differs depending on the sampling method depending on the processing content of the processing component. In order to efficiently generate a column with a smaller number of samples, it is desirable to select a sampling method suitable for the processing content of the processing component. Here, for example, if it is possible to grasp what kind of processing is specifically performed by the processing component A, it is relatively easy for the user to select an appropriate sampling method. However, as described above, the data analysis tool may use processing components created by other users. In such a case, it is not always possible for the user to grasp the detailed processing inside the processing component, so it is difficult to select an appropriate sampling method.

このため、本実施形態では、処理部品による処理後のデータのデータプロファイルを生成するためのサンプルデータの適切なサンプリング方法を、自動で選択する。ここでいう適切なサンプリング方法とは、より少ないサンプル数で、処理部品が入力データ全体を処理した後の出力データにより近い構成要素（ここでは列構造）の出力データとなるようなサンプルデータを抽出できる方法である。 For this reason, in this embodiment, an appropriate sampling method of sample data for generating a data profile of data after processing by the processing component is automatically selected. The appropriate sampling method here is to extract sample data with a smaller number of samples and output data of components (here, column structure) closer to the output data after the processing component has processed the entire input data It can be done.

具体的には、本実施形態では、複数の処理部品を組み合わせた処理フローに含まれる１つの対象処理部品であって、入力データの内容に応じた構成要素を有する出力データを生成する処理を行う対象処理部品に対する入力データを取得する。また、予め設定された複数のサンプリング方法にしたがって、サンプリング方法ごとに、入力データから複数の異なるサンプル数のデータをサンプリングして複数のテストデータを生成する。さらに、サンプリング方法ごとに、テストデータのサンプル数を増加させながら、それぞれのテストデータを入力とする対象処理部品による処理を実行して、それぞれの処理結果の出力データを取得する。そして、これらのサンプリング方法のうち、最も少ないサンプル数のテストデータを実行した段階で、テストデータのサンプル数の増加に応じた出力データの構成要素数の増加が収束する（具体的には所定閾値以下となる）サンプリング方法を選択する。 Specifically, in the present embodiment, a process of generating output data that is one target processing component included in a processing flow in which a plurality of processing components are combined and that has a component according to the content of the input data is performed. Get input data for the target process component. Further, according to a plurality of preset sampling methods, a plurality of test data are generated by sampling a plurality of different sample numbers from the input data for each sampling method. Further, for each sampling method, while increasing the number of samples of test data, processing by the target processing component that receives each test data is executed, and output data of each processing result is acquired. Of these sampling methods, when the test data with the smallest number of samples is executed, the increase in the number of components of the output data according to the increase in the number of samples of the test data converges (specifically, a predetermined threshold value). Select the sampling method (below).

ここで、図５に、サンプリング方法ごとにおける、入力データのサンプル数と出力データの列数との相関関係の一例を示す。この例は、前述した分析フローの一例の処理部品Ａが、図２に示した入力データの「会計時間」にデータが存在する時間の範囲内において、時間帯ごとに売上があったか否かを正負で示すようにアイテム行列化する場合の例である。サンプリング方法は、例えば、「ランダムサンプリング」、「最大値と最小値を含めるサンプリング」、「より多くの異なりを含めるサンプリング」の３つとする。そして、サンプリングの基準とする列（最大値や最小値、異なり等を抽出する列）を、「会計時間」としている。
当該処理部品Ａの処理の例の場合、会計時間の最大値と最小値の範囲内で、１時間ごとに分割した列を追加する。このため、サンプリングされたデータの「会計時間」の値の範囲が大きいほど、生成される列数が多くなる。 FIG. 5 shows an example of the correlation between the number of input data samples and the number of output data columns for each sampling method. In this example, whether or not the processing component A as an example of the analysis flow described above has been sold for each time period within the time range in which the data exists in the “accounting time” of the input data shown in FIG. 2 is positive or negative. This is an example of forming an item matrix as shown by. There are three sampling methods, for example, “random sampling”, “sampling including the maximum value and minimum value”, and “sampling including more differences”. A column used as a reference for sampling (a column from which maximum values, minimum values, differences, and the like are extracted) is set as “accounting time”.
In the example of the processing of the processing component A, a column divided every hour is added within the range of the maximum value and the minimum value of the accounting time. For this reason, the larger the range of the “accounting time” value of the sampled data, the more columns are generated.

このような処理の場合、図４に示すように、最大値と最小値を含めるサンプリングであれば、サンプル数が２件の時点で「会計時間」の最大値（２１：４８）及び最小値（１０：２３）を得られる。このため、処理結果の出力データにおける、会計時間の時間帯を示す列（アイテム列）の列数は、図２の全ての入力データを処理してアイテム行列化した図４の出力データと同じ列数である１２になる。この列数は最大値であり、その後サンプル数を増加させても列数は変わらず収束している。なお、本実施形態において「収束」とは、列数が、入力データ全体を処理部品により処理した結果の出力データにおける列数又は当該列数に近い数に達し、サンプル数の増加に対する列数の増加が所定閾値よりも少なくなる状態（飽和状態）を示す。
一方、ランダムサンプリングの場合には、サンプル数を増加させても、必ずしも値の範囲を広げるようなデータが抽出されるとは限らない。より多くの異なりを含めるサンプリングの場合も同様である。このため、これらのサンプリング方法では、入力データのサンプル数が増加していっても、効率的に列が生成されず、列数の増加が収束しない。 In the case of such processing, as shown in FIG. 4, if the sampling includes the maximum value and the minimum value, the maximum value (21:48) and the minimum value (21:48) of “accounting time” at the time when the number of samples is two. 10:23). For this reason, in the output data of the processing result, the number of columns indicating the time zone of the accounting time (item column) is the same column as the output data of FIG. 4 obtained by processing all the input data of FIG. The number is 12. This number of columns is the maximum value, and even if the number of samples is increased thereafter, the number of columns does not change and converges. In this embodiment, “convergence” means that the number of columns reaches the number of columns in the output data as a result of processing the entire input data by the processing component or a number close to the number of columns, and the number of columns with respect to the increase in the number of samples. A state where the increase is less than a predetermined threshold (saturated state) is shown.
On the other hand, in the case of random sampling, even if the number of samples is increased, data that expands the value range is not always extracted. The same is true for sampling that includes more differences. For this reason, in these sampling methods, even if the number of samples of input data increases, columns are not generated efficiently, and the increase in the number of columns does not converge.

本実施形態では、このような入力データのサンプル数に応じた出力データの列数を特定し、サンプリング方法ごとに列数の増加の収束状況を特定し、より早く列数の増加が収束する効率的なサンプリング方法を選択する。このため、ユーザが適切なサンプリング方法を検討する必要がなく、作業負担が軽減される。そして、データプロファイルを生成する際に、入力データ全体ではなくサンプルデータのみを処理部品で処理すればよいため、データのプロファイルを比較的少ない処理量で生成することができる。したがって、後続の処理部品で選択対象となる入力データの列情報が効率的に抽出できるようになる。
以下、このような本実施形態の具体的態様について、詳細に説明する。 In this embodiment, the number of columns of output data according to the number of samples of such input data is specified, the convergence status of the increase in the number of columns is specified for each sampling method, and the efficiency of the increase in the number of columns converges earlier. A typical sampling method. For this reason, it is not necessary for the user to study an appropriate sampling method, and the work load is reduced. When the data profile is generated, only the sample data need be processed by the processing component instead of the entire input data, so that the data profile can be generated with a relatively small processing amount. Therefore, it is possible to efficiently extract column information of input data to be selected in subsequent processing components.
Hereinafter, specific modes of this embodiment will be described in detail.

＜システムの機能構成及びデータ構成＞
図６は、本実施形態においてデータ分析ツールを備えるコンピュータの一例である分析サーバ１の機能構成及びデータ構成の一例を示す。なお、本実施形態で説明する機能構成は、分析フローにおける各処理部品がすでに作成され、使用可能な状態となっていることを前提としている。 <System functional configuration and data configuration>
FIG. 6 illustrates an example of a functional configuration and a data configuration of the analysis server 1 that is an example of a computer including a data analysis tool in the present embodiment. Note that the functional configuration described in the present embodiment is based on the assumption that each processing component in the analysis flow has already been created and is ready for use.

分析サーバ１は、図６に示すように、プログラムが実行されることによってその機能が実現される、分析フロー読込部１１、分析フロー変更検出部１２、データプロファイル生成部１３（テストデータ生成部１４、テストデータ処理部１５、サンプリング方法選択部１６、サンプルデータ生成部１７、サンプルデータ処理部１８）、分析フロー更新部１９、列選択設定部２０及び分析フロー実行部２１を備える。また、分析サーバ１は、記憶手段において、分析フローテーブル３１、部品データベース３２、入力データテーブル３３、サンプリング方法テーブル３４、テストデータテーブル３５、データプロファイル一時保管ファイル３６、列数テーブル３７、サンプルデータテーブル３８及びデータプロファイルテーブル３９を備える。 As shown in FIG. 6, the analysis server 1 has its functions realized by executing a program. The analysis flow reading unit 11, the analysis flow change detecting unit 12, and the data profile generating unit 13 (test data generating unit 14). A test data processing unit 15, a sampling method selection unit 16, a sample data generation unit 17, a sample data processing unit 18), an analysis flow update unit 19, a column selection setting unit 20, and an analysis flow execution unit 21. In addition, the analysis server 1 includes an analysis flow table 31, a component database 32, an input data table 33, a sampling method table 34, a test data table 35, a data profile temporary storage file 36, a column number table 37, a sample data table in a storage unit. 38 and a data profile table 39.

分析フロー読込部１１は、分析フローテーブル３１から、分析フローに含まれる処理部品及び処理部品の接続に関連する情報を読み込む。また、分析フロー読込部１１は、部品データベース３２から、分析フローに含まれるそれぞれの処理部品の処理内容等に関する情報を読み込む。また、分析フロー読込部１１は、入力データテーブル３３から、分析フローの入力データを読み込む。
分析フロー変更検出部１２は、分析フロー読込部１１により読み込んだ各情報に基づき、分析フローにおいて、処理部品の追加又は変更や、入力データの変更等がなされたことを検出する。 The analysis flow reading unit 11 reads from the analysis flow table 31 information related to processing components included in the analysis flow and connection of the processing components. Further, the analysis flow reading unit 11 reads information on the processing contents and the like of each processing component included in the analysis flow from the component database 32. The analysis flow reading unit 11 reads input data of the analysis flow from the input data table 33.
Based on the information read by the analysis flow reading unit 11, the analysis flow change detection unit 12 detects that a processing component has been added or changed, input data has been changed, or the like in the analysis flow.

データプロファイル生成部１３は、処理部品の処理後のデータのデータプロファイルを生成するものであり、テストデータ生成部１４、テストデータ処理部１５、サンプリング方法選択部１６、サンプルデータ生成部１７、サンプルデータ処理部１８を含む。
テストデータ生成部１４は、データプロファイルを生成する対象となる対象処理部品の入力データから、サンプリング方法テーブル３４に格納された複数のサンプリング方法ごとに複数の異なるサンプル数のデータをサンプリングして複数のテストデータを生成する。そして、生成したテストデータを、テストデータテーブル３５に格納する。 The data profile generation unit 13 generates a data profile of data after processing of a processing component, and includes a test data generation unit 14, a test data processing unit 15, a sampling method selection unit 16, a sample data generation unit 17, and sample data. A processing unit 18 is included.
The test data generation unit 14 samples a plurality of different sample numbers of data for each of a plurality of sampling methods stored in the sampling method table 34 from input data of a target processing component that is a target for generating a data profile. Generate test data. The generated test data is stored in the test data table 35.

テストデータ処理部１５は、サンプリング方法ごとに、テストデータ生成部１４で生成したテストデータを入力とする対象処理部品による処理を実行する。このとき、テストデータ処理部１５は、各サンプリング方法についてそれぞれ同じサンプル数のテストデータを入力とする処理を実行し、それぞれの処理結果の出力データを取得する。そして、テストデータ処理部１５は、テストデータのサンプル数を増加させながら、同様の処理を実行する。
サンプリング方法選択部１６は、テストデータのサンプル数の増加に応じた出力データの列数の増加が最も早く収束するサンプリング方法を選択する。具体的には、サンプリング方法選択部１６は、最も少ないサンプル数のテストデータを実行した段階で、テストデータのサンプル数の増加に応じた出力データの列数の増加が所定閾値以下となるサンプリング方法を選択する。 For each sampling method, the test data processing unit 15 executes processing by the target processing component that receives the test data generated by the test data generation unit 14. At this time, the test data processing unit 15 executes a process of inputting test data of the same number of samples for each sampling method, and obtains output data of each processing result. Then, the test data processing unit 15 performs the same processing while increasing the number of test data samples.
The sampling method selection unit 16 selects a sampling method in which the increase in the number of columns of output data corresponding to the increase in the number of samples of test data converges earliest. Specifically, the sampling method selection unit 16 performs the sampling method in which the increase in the number of columns of the output data corresponding to the increase in the number of samples of the test data is equal to or less than a predetermined threshold when the test data having the smallest number of samples is executed. Select.

サンプルデータ生成部１７は、対象処理部品の入力データから、選択したサンプリング方法でデータをサンプリングしてサンプルデータを生成し、サンプルデータテーブル３８に格納する。
サンプルデータ処理部１８は、サンプルデータテーブル３８のサンプルデータを入力データとして、対象処理部品による処理を実行し、処理結果の出力データを取得する。そして、処理結果の出力データから列情報を取得し、当該列情報を含んだデータプロファイルを生成する。さらに、生成したデータプロファイルを、データプロファイルテーブル３９に格納する。 The sample data generation unit 17 generates sample data by sampling data from the input data of the target processing component using the selected sampling method, and stores the sample data in the sample data table 38.
The sample data processing unit 18 uses the sample data in the sample data table 38 as input data, executes processing by the target processing component, and acquires output data as a processing result. And column information is acquired from the output data of a processing result, and the data profile containing the said column information is produced | generated. Further, the generated data profile is stored in the data profile table 39.

分析フロー更新部１９は、分析フローにおける処理部品の追加又は変更等や、データプロファイルの変更等を、分析フローテーブル３１や部品データベース３２等に反映させる。
列選択設定部２０は、処理部品のパラメータの構成要素をユーザに設定させる処理を行う。具体例として、列選択設定部２０は、データプロファイルテーブル３９から、対象処理部品の列情報を抽出し、抽出した列情報により特定される列名を、処理対象とする列の選択肢として設定画面に表示する。そして、ユーザによる列名の選択入力を受け付け、選択内容を分析フローテーブル３１に格納する。
分析フロー実行部２１は、ユーザによる操作入力に基づき、実際の入力データ（サンプルデータではなく全件）を入力として分析フローを実行し、データ分析結果を出力する。 The analysis flow update unit 19 reflects the addition or change of the processing component in the analysis flow, the change of the data profile, and the like in the analysis flow table 31, the component database 32, and the like.
The column selection setting unit 20 performs processing for allowing the user to set the component of the parameter of the processing component. As a specific example, the column selection setting unit 20 extracts column information of the target processing component from the data profile table 39, and uses the column name specified by the extracted column information as a column option to be processed on the setting screen. indicate. Then, the selection input of the column name by the user is accepted, and the selection content is stored in the analysis flow table 31.
Based on the operation input by the user, the analysis flow execution unit 21 executes an analysis flow using actual input data (all items, not sample data) as input, and outputs a data analysis result.

次に、分析サーバ１が備える各データの詳細につき、図２、図７〜図１６を参照しながら説明する。なお、分析フローテーブル３１及び部品データベース３２以外の各テーブル等については、本実施形態の具体例においてデータプロファイルを生成する対象となる対象処理部品に関連するデータのみを具体例として図示している。 Next, details of each data included in the analysis server 1 will be described with reference to FIGS. 2 and 7 to 16. For each table other than the analysis flow table 31 and the component database 32, only data related to the target processing component for which the data profile is generated in the specific example of the present embodiment is illustrated as a specific example.

分析フローテーブル３１は、分析フローの内容に関する情報が、分析フローごとに格納されるテーブルである。分析フローテーブル３１は、例えば図７に示すように、分析フローに含まれる処理部品に一意に割り当てられる部品ＩＤ、当該処理部品の部品名、当該処理部品の直前に接続された処理部品を識別するための接続元ＩＤ、当該処理部品の直後に接続された処理部品を識別するための接続先ＩＤ、当該処理部品において指定されたパラメータ、当該処理部品で処理する入力データ及び当該処理部品による処理結果の出力データ、当該処理部品による処理後のデータのデータプロファイルの列を含む。 The analysis flow table 31 is a table in which information regarding the content of the analysis flow is stored for each analysis flow. For example, as shown in FIG. 7, the analysis flow table 31 identifies a component ID uniquely assigned to a processing component included in the analysis flow, a component name of the processing component, and a processing component connected immediately before the processing component. Connection source ID, connection destination ID for identifying the processing component connected immediately after the processing component, parameters specified in the processing component, input data processed by the processing component, and processing result by the processing component Output data, and a data profile column of data after processing by the processing component.

部品データベース３２には、分析フローに用いることが可能な部品に関する情報が予め格納されている。部品データベース３２は、例えば図８に示すように、部品名（部品による処理の概要を示す）、当該部品において指定可能なパラメータを示すパラメータ、入力データの変数名、入力データのデータ型、出力データの変数名、出力データのデータ型等の列を含む。
入力データテーブル３３は、分析フローの個別の入力データが格納されるテーブルであり、その構造は入力データの内容に応じて異なる。本実施形態の具体例における入力データは、例えば前述の図２に示したように、日付、会計時間及び売上の列を含む。 In the parts database 32, information related to parts that can be used in the analysis flow is stored in advance. For example, as shown in FIG. 8, the component database 32 includes a component name (indicating an outline of processing by the component), a parameter indicating a parameter that can be specified in the component, a variable name of input data, a data type of input data, and output data. Column of variable name, output data type, etc.
The input data table 33 is a table in which individual input data of the analysis flow is stored, and its structure differs depending on the contents of the input data. The input data in the specific example of this embodiment includes, for example, columns of date, accounting time, and sales as shown in FIG.

サンプリング方法テーブル３４は、様々なサンプリング方法に関する情報が予め格納されたテーブルである。サンプリング方法テーブル３４は、例えば図９に示すように、サンプリング方法及び当該サンプリング方法の詳細内容の列を含む。
テストデータテーブル３５は、各サンプリング方法によって生成されたテストデータが格納されるテーブル（構造体配列）であり、例えば図１０に示すように、テストデータが、サンプリング方法ごと且つサンプル数ごとに格納されている。 The sampling method table 34 is a table in which information related to various sampling methods is stored in advance. For example, as shown in FIG. 9, the sampling method table 34 includes a sampling method and a column of detailed contents of the sampling method.
The test data table 35 is a table (structure array) in which test data generated by each sampling method is stored. For example, as shown in FIG. 10, test data is stored for each sampling method and for each number of samples. ing.

データプロファイル一時保管ファイル３６は、テストデータを処理部品によって処理した結果生成したデータプロファイルテーブル３９が順次格納されるファイルである。データプロファイル一時保管ファイル３６は、例えば図１１〜図１３に示すように、サンプル数ごと且つサンプリング方法ごとに、入力データの列名と当該列のデータ型の列を含んだデータプロファイルテーブル３９が格納される。
列数テーブル３７は、テストデータのサンプリング方法ごとに、処理部品による各サンプル数のテストデータの処理結果として得られた列数（サンプル数に応じて数が変化する列の列数）が順次格納されるテーブルである。列数テーブル３７は、例えば図１４に示すように、サンプリング方法、及びサンプル数の列を含む。 The data profile temporary storage file 36 is a file in which data profile tables 39 generated as a result of processing test data by processing components are sequentially stored. For example, as shown in FIGS. 11 to 13, the data profile temporary storage file 36 stores a data profile table 39 including a column name of the input data and a column of the data type for each sampling number and each sampling method. Is done.
The column number table 37 sequentially stores, for each test data sampling method, the number of columns (the number of columns whose number varies depending on the number of samples) obtained as a result of processing the test data of each sample number by the processing component. It is a table to be. The column number table 37 includes, for example, a sampling method and a sample number column as shown in FIG.

サンプルデータテーブル３８は、サンプリング方法を選択した後に、選択されたサンプリング方法でサンプリングされたサンプルデータを格納するテーブルである。本実施形態の具体例におけるサンプルデータテーブル３８は、例えば図１５に示すように、日付、会計時間及び売上の列を含む。
データプロファイルテーブル３９は、サンプリング方法を選択した後に、サンプルデータを実際に処理部品によって処理した結果生成したデータプロファイルを格納するテーブルである。データプロファイルテーブル３９は、例えば図１６に示すように、入力データの列名及び当該列のデータ型の列を含む。 The sample data table 38 is a table for storing sample data sampled by the selected sampling method after selecting the sampling method. The sample data table 38 in the specific example of this embodiment includes columns of date, accounting time, and sales as shown in FIG. 15, for example.
The data profile table 39 is a table for storing a data profile generated as a result of actually processing sample data by a processing component after selecting a sampling method. For example, as shown in FIG. 16, the data profile table 39 includes a column name of input data and a column of the data type of the column.

＜分析サーバにおいて実行される処理＞
次に、分析サーバ１において実行される処理につき、図１７〜図１９を参照しながら説明する。
まず、分析サーバ１で実行される全体処理について図１７を参照しながら説明する。 <Processes executed in the analysis server>
Next, processing executed in the analysis server 1 will be described with reference to FIGS.
First, the entire process executed by the analysis server 1 will be described with reference to FIG.

ステップＳ１で、分析フロー読込部１１は、ユーザによる分析フローの指定を受け付け、分析フローテーブル３１から、当該分析フローの内容に関する情報を読み込む。そして、部品テーブルから、分析フローテーブル３１から読み込んだ情報に含まれる部品名に対応する部品に関する情報を読み込む。これらの読み込み処理により、分析フロー全体で実行すべき処理の内容が特定される。
ステップＳ２で、分析フロー読込部１１は、分析フローテーブル３１から読み込んだ情報に基づき、分析フローの最初の処理部品の入力データを読み込む。
なお、例えば分析フローを新規に生成する場合等には、上記ステップＳ１及びステップＳ２の処理は不要である。
ステップＳ３で、分析フロー読込部１１は、ユーザによる操作入力を待機する。
ステップＳ４で、分析フロー読込部１１は、分析フローの実行を命令する操作入力があったか否かを判定し、入力がない場合にはステップＳ５に進む一方（Ｎｏ）、入力があった場合はステップＳ１６に進む（Ｙｅｓ）。 In step S 1, the analysis flow reading unit 11 accepts designation of an analysis flow by the user, and reads information related to the content of the analysis flow from the analysis flow table 31. And the information regarding the component corresponding to the component name contained in the information read from the analysis flow table 31 is read from the component table. By these reading processes, the contents of processes to be executed in the entire analysis flow are specified.
In step S 2, the analysis flow reading unit 11 reads input data of the first processing component of the analysis flow based on the information read from the analysis flow table 31.
For example, when a new analysis flow is generated, the processing in steps S1 and S2 is not necessary.
In step S3, the analysis flow reading unit 11 waits for an operation input by the user.
In step S4, the analysis flow reading unit 11 determines whether or not there is an operation input for instructing execution of the analysis flow. If there is no input, the process proceeds to step S5 (No). Proceed to S16 (Yes).

以下のステップＳ５〜ステップＳ１５が、分析フローの生成又は変更時に実行される処理である。
ステップＳ５で、分析フロー変更検出部１２は、分析フローに新しい処理部品が追加された場合に、当該追加を検出する。
ステップＳ６で、分析フロー変更検出部１２は、分析フローの処理部品に設定変更がなされたり入力データが変更されたりした場合等に、当該変更等を検出する。
ステップＳ７で、分析フロー変更検出部１２は、分析フローのいずれかの処理部品において、処理対象の列の列名設定画面が開かれたか否かを判定する。当該設定画面が開かれていない場合には、ステップＳ８に進み（Ｎｏ）、当該画面が開かれた場合には、ステップＳ１３に進む（Ｙｅｓ）。 The following steps S5 to S15 are processes executed when the analysis flow is generated or changed.
In step S5, the analysis flow change detection unit 12 detects the addition when a new processing component is added to the analysis flow.
In step S 6, the analysis flow change detection unit 12 detects the change or the like when a setting change is made to a processing component of the analysis flow or input data is changed.
In step S7, the analysis flow change detection unit 12 determines whether or not the column name setting screen for the column to be processed has been opened in any of the processing components of the analysis flow. When the setting screen is not opened, the process proceeds to step S8 (No), and when the screen is opened, the process proceeds to step S13 (Yes).

ステップＳ８で、分析フロー変更検出部１２は、分析フローにおける処理部品の追加や変更等に伴い、いずれかの処理部品におけるデータプロファイルの生成（再生成を含む）が必要か否かを判定する。具体的には、例えば、処理部品が追加された場合には、当該処理部品のデータプロファイル（当該処理部品による処理後のデータのデータプロファイル）の生成が必要となる。また、例えば、入力データの内容が変更された場合や、それぞれの処理部品（又は先行する処理部品）の処理内容に変更が加えられた場合に、それぞれの処理部品のデータプロファイルの再生成が必要となる場合がある。データプロファイルの生成が必要な場合には、ステップＳ９に進み（Ｙｅｓ）、必要でない場合は、ステップＳ１２に進む（Ｎｏ）。 In step S 8, the analysis flow change detection unit 12 determines whether it is necessary to generate a data profile (including regeneration) in any of the processing components in accordance with the addition or change of the processing components in the analysis flow. Specifically, for example, when a processing component is added, it is necessary to generate a data profile of the processing component (data profile of data after processing by the processing component). In addition, for example, when the content of input data is changed or when the processing content of each processing component (or preceding processing component) is changed, it is necessary to regenerate the data profile of each processing component. It may become. If the data profile needs to be generated, the process proceeds to step S9 (Yes), and if not necessary, the process proceeds to step S12 (No).

以下のステップＳ９〜ステップＳ１１が、分析フローの作成又は変更時において、データプロファイルを生成するときに実行される処理である。
ステップＳ９で、データプロファイル生成部１３は、データプロファイルの生成が必要な処理部品を特定する。
ステップＳ１０で、データプロファイル生成部１３は、データプロファイルの生成が必要な処理部品について、当該処理部品による処理後のデータのデータプロファイルを生成する。なお、当該データプロファイルの生成処理については、後で詳述する。 The following steps S9 to S11 are processes executed when generating a data profile when creating or changing an analysis flow.
In step S9, the data profile generation unit 13 specifies a processing component that needs to generate a data profile.
In step S 10, the data profile generation unit 13 generates a data profile of data processed by the processing component for a processing component that needs to be generated. The data profile generation process will be described in detail later.

ステップＳ１１で、データプロファイル生成部１３は、生成したデータプロファイルを、データプロファイルテーブル３９に格納する。
ステップＳ１２で、分析フロー更新部１９は、検出された処理部品の追加や変更内容、データプロファイルの変更等を、必要に応じて分析フローテーブル３１のデータに反映させる。 In step S 11, the data profile generation unit 13 stores the generated data profile in the data profile table 39.
In step S 12, the analysis flow update unit 19 reflects the addition or change contents of the detected processing component, data profile change, or the like in the data of the analysis flow table 31 as necessary.

以下のステップＳ１３〜ステップＳ１５が、分析フローの作成又は変更時において、処理部品の処理対象とする列名をユーザに設定させるときに実行される処理である。
ステップＳ１３で、列選択設定部２０は、処理対象の列の列名設定画面が開かれた処理部品を特定する。
ステップＳ１４で、列選択設定部２０は、特定した処理部品の入力対象となるデータのデータプロファイル、すなわち、当該処理部品の１つ前に接続されている処理部品の処理後のデータのデータプロファイルから、列情報を抽出する。そして、抽出した列情報により特定される列名を、処理対象とする列名の選択肢として、設定画面に表示する。 The following steps S13 to S15 are processes executed when the user sets column names to be processed by the processing component when the analysis flow is created or changed.
In step S 13, the column selection setting unit 20 specifies a processing component in which the column name setting screen for the column to be processed is opened.
In step S14, the column selection setting unit 20 determines the data profile of the data to be input of the specified processing component, that is, the data profile of the data after processing of the processing component connected immediately before the processing component. Extract column information. Then, the column name specified by the extracted column information is displayed on the setting screen as a column name option to be processed.

ステップＳ１５で、列選択設定部２０は、ユーザによる、処理対象とする列名の選択入力を受け付ける。そして、列選択設定部２０は、選択された列名を示す情報を、分析フローテーブル３１における該当する処理部品のデータのパラメータの項目に格納する。
以下のステップＳ１６〜ステップＳ１７が、分析フローを実行する処理である。
ステップＳ１６で、分析フロー実行部２１は、ユーザにより指定された分析フローを実行する。
ステップＳ１７で、分析フロー実行部２１は、分析フローの実行結果を出力する。 In step S15, the column selection setting unit 20 receives a selection input of a column name to be processed by the user. Then, the column selection setting unit 20 stores information indicating the selected column name in the parameter item of the data of the corresponding processing component in the analysis flow table 31.
The following steps S16 to S17 are processes for executing the analysis flow.
In step S16, the analysis flow execution unit 21 executes the analysis flow specified by the user.
In step S17, the analysis flow execution unit 21 outputs the execution result of the analysis flow.

次に、分析サーバ１で実行されるデータプロファイル生成処理（上記ステップＳ１０の処理）について、図１８を参照しながら説明する。 Next, the data profile generation process (the process in step S10) executed in the analysis server 1 will be described with reference to FIG.

ステップＳ２１で、テストデータ生成部１４は、分析フローテーブル３１を参照して、データプロファイルの生成が必要となる対象処理部品への入力データを特定し、入力データを取得する。なお、かりに当該対象処理部品が分析フローの最初の処理部品である場合には、入力データテーブル３３から入力データを取得できる。当該対象処理部品が分析フローの途中の処理部品である場合には、先行する処理部品による処理後の中間ファイル等が当該対象処理部品の入力データとなる。
ステップＳ２２で、サンプリング方法選択部１６は、入力データからのサンプルデータのサンプリング方法を選択する処理を行う。当該サンプリング方法の選択処理については後で詳述する。 In step S 21, the test data generation unit 14 refers to the analysis flow table 31, specifies input data to the target processing component that requires generation of a data profile, and acquires input data. If the target processing component is the first processing component in the analysis flow, input data can be acquired from the input data table 33. When the target processing component is a processing component in the middle of the analysis flow, an intermediate file after processing by the preceding processing component is input data of the target processing component.
In step S22, the sampling method selection unit 16 performs a process of selecting a sampling method of sample data from the input data. The sampling method selection process will be described in detail later.

ステップＳ２３で、サンプルデータ生成部１７は、対象処理部品の入力データから、ステップＳ２２で決定したサンプリング方法によりデータをサンプリングし、サンプルデータを生成する。このとき、サンプルデータ生成部１７は、列数テーブル３７を参照し、列数の増加が収束した最も早い段階のサンプル数（最も少ないサンプル数）のデータをサンプリングすればよい。具体的には、最後に列数が格納されたサンプル数から遡って、連続して列数の増加（サンプル数の増加に対応する列数の増加）が閾値以下となっているサンプル数のうち、最も少ないサンプル数のデータをサンプリングすればよい。そして、サンプルデータ生成部１７は、生成したサンプルデータを、サンプルデータテーブル３８に格納する。 In step S23, the sample data generation unit 17 samples the data from the input data of the target processing component by the sampling method determined in step S22, and generates sample data. At this time, the sample data generation unit 17 may refer to the column number table 37 and sample data of the earliest stage sample number (the smallest sample number) in which the increase in the column number has converged. Specifically, from the number of samples in which the number of columns continuously increases (the number of columns corresponding to the increase in the number of samples) is less than or equal to the threshold, going back from the number of samples in which the number of columns was stored last. The data with the smallest number of samples may be sampled. Then, the sample data generation unit 17 stores the generated sample data in the sample data table 38.

ステップＳ２４で、サンプルデータ処理部１８は、対象処理部品に、サンプルデータテーブル３８に格納したサンプルデータを入力して、対象処理部品による処理を実行する。
ステップＳ２５で、サンプルデータ処理部１８は、ステップＳ２４による処理結果のデータから列を抽出し、各列のデータからデータ型を推定して、対象処理部品のデータプロファイルを生成する。そして、サンプルデータ処理部１８は、生成したデータプロファイルを、データプロファイルテーブル３９に格納する。 In step S24, the sample data processing unit 18 inputs the sample data stored in the sample data table 38 to the target processing component, and executes the processing by the target processing component.
In step S25, the sample data processing unit 18 extracts columns from the processing result data in step S24, estimates the data type from the data in each column, and generates a data profile of the target processing component. Then, the sample data processing unit 18 stores the generated data profile in the data profile table 39.

次に、分析サーバ１で実行されるサンプリング方法選択処理（上記ステップＳ２２の処理）について、図１９を参照しながら説明する。 Next, the sampling method selection process (the process in step S22) executed in the analysis server 1 will be described with reference to FIG.

ステップＳ３１で、サンプリング方法選択部１６は、異なるサンプル数を複数決定する。例えば、当該サンプル数は、最小数を２とする一方、入力データ全体の数を均等分割した数だけ順次加算した数を複数のサンプル数とすることができる。そして、サンプル数の配列（図示省略）を生成する。
ステップＳ３２で、サンプリング方法選択部１６は、サンプル数の配列のうち、小さい数から１つ取り出し、今回のサンプル数とする。 In step S31, the sampling method selection unit 16 determines a plurality of different sample numbers. For example, while the minimum number is 2, the number of samples can be obtained by sequentially adding the total number of input data by the number obtained by equally dividing the total number of input data. Then, an array of sample numbers (not shown) is generated.
In step S32, the sampling method selection unit 16 extracts one from the small number from the sample number array and sets it as the current sample number.

ステップＳ３３で、サンプリング方法選択部１６は、サンプリング方法テーブル３４を参照し、それぞれのサンプリング方法にしたがって、今回のサンプル数だけデータをサンプリングする。なお、サンプリングの基準とする列（最大値や最小値、異なり等を抽出する列）は、処理部品がそのデータ内容に基づいて新たな列を生成する列（上記具体例では「会計時間」の列）とする。本実施形態では、少なくとも当該列を特定可能な情報は、例えば分析フローテーブル３１や部品データベース３２等から取得できることを前提とする。そして、サンプリング方法選択部１６は、サンプリングしたデータを、テストデータテーブル３５に格納する。 In step S 33, the sampling method selection unit 16 refers to the sampling method table 34 and samples data by the current number of samples according to each sampling method. Note that the column used as a reference for sampling (the column from which the maximum value, minimum value, difference, etc. are extracted) is a column in which the processing component generates a new column based on the data content (in the above example, “accounting time”). Column). In the present embodiment, it is assumed that at least information that can identify the column can be acquired from, for example, the analysis flow table 31 or the component database 32. Then, the sampling method selection unit 16 stores the sampled data in the test data table 35.

ステップＳ３４で、サンプリング方法選択部１６は、サンプリング方法のそれぞれにつき、テストデータテーブル３５に格納した、今回のサンプル数のテストデータを入力とする対象処理部品による処理を実行し、それぞれの処理結果の出力データを取得する。
ステップＳ３５で、サンプリング方法選択部１６は、ステップＳ３４によるそれぞれのテストデータの処理結果の出力データから列を抽出し、各列のデータからデータ型を推定して、対象処理部品のデータプロファイルを生成する。そして、データプロファイル一時保管ファイル３６に格納する。 In step S34, the sampling method selection unit 16 executes processing by the target processing component that receives the test data of the current number of samples stored in the test data table 35 for each sampling method, and outputs each processing result. Get the output data.
In step S35, the sampling method selection unit 16 extracts a column from the output data of the processing result of each test data in step S34, estimates the data type from the data of each column, and generates a data profile of the target processing component. To do. Then, it is stored in the data profile temporary storage file 36.

ステップＳ３６で、サンプリング方法選択部１６は、データプロファイル一時保管ファイル３６を参照し、今回のサンプル数で抽出した列のうち、列数が変動する会計時間の時間帯を示す列（すなわち、固定的な「日付」、「会計時間」、「売上」の列を除いたアイテム列）の列数を、サンプリング方法ごとに、列数テーブル３７に保存する。なお、ここではデータプロファイルに含まれる列全体の列数を保存してもよい。
ステップＳ３７で、サンプリング方法選択部１６は、列数テーブル３７を参照し、今回、前回、前々回のサンプル数で抽出した列数を、それぞれのサンプリング方法について読み出す。 In step S36, the sampling method selection unit 16 refers to the data profile temporary storage file 36, and among the columns extracted by the current number of samples, the column indicating the time zone of the accounting time in which the number of columns fluctuates (that is, fixed) The column number of “date”, “accounting time”, and “item column excluding“ sales ”” is stored in the column number table 37 for each sampling method. Here, the total number of columns included in the data profile may be stored.
In step S 37, the sampling method selection unit 16 refers to the column number table 37 and reads the number of columns extracted by the previous and previous sample numbers for each sampling method.

ステップＳ３８で、サンプリング方法選択部１６は、次の要件を満たすサンプリング方法があるか否かを判定する。
（前回のサンプル数での列数−前々回のサンプル数での列数）が閾値以下であり、且つ、
（今回のサンプル数での列数−前回のサンプル数での列数）が閾値以下 In step S38, the sampling method selection unit 16 determines whether there is a sampling method that satisfies the following requirements.
(The number of columns in the previous number of samples−the number of columns in the previous number of samples) is less than or equal to the threshold, and
(Number of columns in the current number of samples-number of columns in the previous number of samples) is less than the threshold

なお、当該判定は、すなわち、サンプル数の増加に伴う処理結果の列数の増加が収束しているか否かを判定するものである。当該判定で用いる閾値は、入力データの数や内容に応じて、列数の増加が収束していることを判定するのに適切な値を予め設定しておけばよい。ここで、例えば「（今回のサンプル数での列数−前回のサンプル数での列数）が閾値以下」という基準だけでも当該判定は可能である。しかし、サンプリング方法や処理部品の処理内容によっては、処理結果の列数の増加に規則性がなく、たまたま処理結果の列数の増加が少ない場合もあり得る。このため、ステップＳ３８の判定では、上記のように「（前回のサンプル数での列数−前々回のサンプル数での列数）が閾値以下」という判定基準も加えるものとする。また、さらに過去数回分の出力データにおける列数の増加に基づいて当該判定を行ってもよい。 Note that this determination is to determine whether or not the increase in the number of columns of processing results accompanying the increase in the number of samples has converged. The threshold value used in the determination may be set in advance to an appropriate value for determining that the increase in the number of columns has converged according to the number and contents of input data. Here, for example, the determination can be made only with the criterion that “(the number of columns in the current number of samples−the number of columns in the previous number of samples) is equal to or less than a threshold”. However, depending on the sampling method and the processing content of the processing component, there is a case where there is no regularity in the increase in the number of columns of the processing result, and it may happen that the increase in the number of columns of the processing result is small. For this reason, in the determination in step S38, as described above, the determination criterion “(the number of columns in the previous sample number−the number of columns in the previous sample number) is equal to or less than the threshold value” is also added. Further, the determination may be performed based on an increase in the number of columns in the output data for the past several times.

なお、当該閾値を「０」とすることにより、列数が最大値に達した状態で増加が完全に収束している状態を特定することが可能である。
また、当該判定においては、列数が少ない状態（少なくとも列数の増加が収束していない他のサンプリング方法における列数よりも少ない状態）のまま列数の増加が収束してしまうようなケースが原則として発生しないことを前提としている。後述のステップＳ４１についても同様である。
上記要件を満たすサンプリング方法がある場合には、ステップＳ３９に進み（Ｙｅｓ）、ない場合にはステップＳ４０に進む（Ｎｏ）。 By setting the threshold to “0”, it is possible to specify a state in which the increase is completely converged with the number of columns reaching the maximum value.
In this determination, there is a case where the increase in the number of columns converges while the number of columns is small (at least in a state where the increase in the number of columns is less than the number of columns in other sampling methods that have not converged). In principle, it is assumed that it will not occur. The same applies to step S41 described later.
If there is a sampling method that satisfies the above requirements, the process proceeds to step S39 (Yes), and if not, the process proceeds to step S40 (No).

なお、上記ステップＳ３７及びステップＳ３８は、３回目以降のサンプル数の処理においてのみ実行するものとする（初回から２回目までは、前回、前々回のサンプル数での列数を取得できないためである）。 Note that the above steps S37 and S38 are executed only in the processing of the number of samples after the third time (because the number of columns in the previous and previous samples cannot be obtained from the first time to the second time). .

ステップＳ３９で、サンプリング方法選択部１６は、ステップＳ３８で判定した要件を満たすサンプリング方法を選択する。この処理は、すなわち、より早く（最も少ないサンプル数のテストデータを実行した段階で）列数の増加が収束しているサンプリング方法を選択するものである。
ステップＳ４０で、サンプリング方法選択部１６は、全てのサンプル数のテストデータによるデータプロファイルの生成が終了したか否かを判定する。終了した場合はステップＳ４１に進み（Ｙｅｓ）、終了していない場合はステップＳ３２に戻って、次のサンプル数（今回よりも増加させたサンプル数）の処理に移行する（Ｎｏ）。 In step S39, the sampling method selection unit 16 selects a sampling method that satisfies the requirements determined in step S38. In other words, this processing selects a sampling method in which the increase in the number of columns has converged earlier (when the test data having the smallest number of samples is executed).
In step S40, the sampling method selection unit 16 determines whether or not the generation of the data profile using the test data of all the numbers of samples is completed. If completed, the process proceeds to step S41 (Yes). If not completed, the process returns to step S32, and the process proceeds to the next sample number (the number of samples increased from this time) (No).

ステップＳ４１で、サンプリング方法選択部１６は、列数テーブル３７を参照し、サンプル数の増加に伴う処理結果の列数の増加がより収束しているサンプリング方法を選択する。具体的には、一例として、次式の計算結果の値が最も小さいサンプリング方法を選択する。
（前回のサンプル数での列数−前々回のサンプル数での列数）＋（今回のサンプル数での列数−前回のサンプル数での列数） In step S41, the sampling method selection unit 16 refers to the column number table 37, and selects a sampling method in which the increase in the number of columns of the processing result accompanying the increase in the number of samples is more converged. Specifically, as an example, a sampling method having the smallest value of the calculation result of the following equation is selected.
(Number of columns in previous sample number-Number of columns in previous sample number) + (Number of columns in current sample number-Number of columns in previous sample number)

なお、上記ステップＳ３８と同様に、例えば「（今回のサンプル数での列数−前回のサンプル数での列数）」の値だけでも、列数の増加の収束状況を識別することは可能である。しかし、処理結果の列数の増加に規則性がないような場合を考慮し、上記のように「（前回のサンプル数での列数−前々回のサンプル数での列数）」の値を加算するものとする。 As in step S38, for example, the convergence state of the increase in the number of columns can be identified only by the value of “(number of columns at current sample number−number of columns at previous sample number)”. is there. However, considering the case where there is no regularity in the increase in the number of columns in the processing result, the value of “(number of columns in the previous sample number−number of columns in the previous sample number)” is added as described above. It shall be.

＜サンプリング方法選択処理の具体例＞
ここで、上記サンプリング方法選択処理につき、データの具体例を示して説明する。
例えば、上記サンプリング方法選択処理において、ある処理部品の入力データが図２に示した入力データである場合に、図９に示したサンプリング方法テーブル３４に格納されたサンプリング方法、すなわち、「ランダムサンプリング」、「最大値と最小値を含めるサンプリング」及び「より多くの異なりを含めるサンプリング」のいずれかを選択する場合の例について説明する。そして、サンプル数の配列を、２件、３件、４件としたものとする。また、ここでは、サンプリングの基準とする列（最大値や最小値、異なり等を抽出する列）を、入力データのうち、処理部品がそのデータ内容に基づいて新たな列を生成する「会計時間」とする。 <Specific example of sampling method selection processing>
Here, the sampling method selection processing will be described by showing a specific example of data.
For example, in the sampling method selection process, when the input data of a certain processing component is the input data shown in FIG. 2, the sampling method stored in the sampling method table 34 shown in FIG. 9, that is, “random sampling”. An example of selecting one of “sampling including the maximum value and minimum value” and “sampling including more differences” will be described. Assume that the number of samples is 2, 3, and 4. In addition, here, a column as a reference for sampling (a column from which maximum values, minimum values, differences, and the like are extracted) is input data, and the processing component generates a new column based on the data contents. "

この具体例において、テストデータ生成部１４が、サンプリング方法ごとに、当該サンプル数の配列のそれぞれのサンプル数で生成したテストデータが、図１０に示すテストデータテーブル３５のデータ内容である。そして、テストデータ処理部１５は、これらのテストデータを、サンプル数の少ないテストデータから順に処理部品により処理する。その処理結果の出力データから生成されるデータプロファイルが、図１１〜図１３に示すデータプロファイル一時保管ファイル３６の内容となる。さらに、このデータプロファイル一時保管ファイル３６のデータ内容につき、テストデータ処理部１５は、サンプリング方法ごとに、それぞれのサンプル数におけるデータプロファイルの列数（すなわち、処理結果の出力データの列数）のうち、会計時間の時間帯を示す列の列数を、図１４に示すように列数テーブル３７に格納する。 In this specific example, the test data generated by the test data generation unit 14 with the number of samples in the array of the number of samples for each sampling method is the data content of the test data table 35 shown in FIG. Then, the test data processing unit 15 processes these test data by processing components in order from the test data having the smallest number of samples. The data profile generated from the output data of the processing result becomes the contents of the data profile temporary storage file 36 shown in FIGS. Further, for the data contents of the data profile temporary storage file 36, the test data processing unit 15 determines, for each sampling method, the number of columns of the data profile (that is, the number of columns of output data of the processing result) for each sample number. The number of columns indicating the time zone of the accounting time is stored in the column number table 37 as shown in FIG.

この図１４の列数テーブル３７を参照すると、「最大値と最小値を含めるサンプリング」は、サンプル数が２件のとき、３件のとき及び４件のときのいずれも、列数が１２で同じである。ここで、上記サンプリング方法選択処理のステップＳ３８の判定における閾値を「１」と設定したとする。この場合、サンプル数が４のときのステップＳ３８の判定において、（前回のサンプル数での列数−前々回のサンプル数での列数）は（１２−１２）＝０であり、（今回のサンプル数での列数−前回のサンプル数での列数）も（１２−１２）＝０である。このため、ステップＳ３８の判定における要件を満たしており、サンプル数の増加に伴う処理結果の列数の増加が収束していることがわかる。このため、当該「最大値と最小値を含めるサンプリング」が、最も適切なサンプリング方法として選択される。 Referring to the column number table 37 in FIG. 14, “sampling including the maximum value and minimum value” indicates that the number of columns is 12 when the number of samples is 2, and when the number is 4, The same. Here, it is assumed that the threshold in the determination in step S38 of the sampling method selection process is set to “1”. In this case, in the determination of step S38 when the number of samples is 4, (the number of columns in the previous sample number−the number of columns in the previous sample number) is (12-12) = 0, The number of columns in the number-the number of columns in the previous sample number) is also (12-12) = 0. For this reason, the requirements in the determination in step S38 are satisfied, and it can be seen that the increase in the number of columns of the processing results accompanying the increase in the number of samples has converged. Therefore, the “sampling including the maximum value and the minimum value” is selected as the most appropriate sampling method.

そして、サンプルデータ生成部１７が、当該「最大値と最小値を含めるサンプリング」で、サンプルデータが収束した最初のサンプル数である２件のデータをサンプリングしてサンプルデータを生成したデータが、図１５に示したサンプルデータテーブル３８のサンプルデータとなる。さらに、サンプルデータ処理部１８が当該サンプルデータを入力として処理部品による処理を実行し、図１６に示したデータプロファイルを生成する。 Then, the sample data generating unit 17 generates sample data by sampling two pieces of data, which is the first sample number that the sample data has converged, in the “sampling including the maximum value and the minimum value”. 15 is sample data of the sample data table 38 shown in FIG. Further, the sample data processing unit 18 receives the sample data as input and executes processing by the processing component to generate the data profile shown in FIG.

上記具体例のような列数の増加状況となるのは、例えば、前述したように、対象処理部品において、図２に示した入力データの「会計時間」にデータが存在する時間の範囲内において、時間帯ごとに売上があったか否かを正負で示すように表構造を変換する場合である。当該具体例におけるテストデータのサンプル数と出力データの列数の相関関係は、図５に示した通りである。ここで、対象処理部品の処理内容がこのような内容であるということを予めユーザが把握できれば、「最大値と最小値を含めるサンプリング」が最も適切なサンプリング方法であることは予測可能である。しかし、前述したように、ユーザが処理部品の処理内容を把握しているとは限らない。本実施形態では、このような場合でも、前述したように列数の収束状況を特定することで、適切なサンプリング方法を選択することができる。そして、少ないサンプルデータによってデータプロファイルを生成することができる。 The increase in the number of columns as in the above specific example is, for example, as described above, within the time range in which the data exists in the “accounting time” of the input data shown in FIG. In this case, the table structure is converted so as to indicate whether or not there is sales in each time zone. The correlation between the number of samples of test data and the number of columns of output data in the specific example is as shown in FIG. Here, if the user can grasp in advance that the processing content of the target processing component is such content, it can be predicted that “sampling including the maximum value and the minimum value” is the most appropriate sampling method. However, as described above, the user does not always know the processing content of the processing component. In this embodiment, even in such a case, an appropriate sampling method can be selected by specifying the convergence state of the number of columns as described above. A data profile can be generated with a small amount of sample data.

＜本実施形態による効果等＞
本実施形態によれば、処理部品により処理される入力データから、複数のサンプリング方法ごとに、それぞれ複数のサンプル数のテストデータが生成される。そして、テストデータが処理部品により実行され、サンプル数の増加に応じた、処理結果の列数の増加状況が、サンプリング方法ごとに特定される。そして、上記ステップＳ３８の処理により、より早い段階（すなわち少ないサンプル数）で列数の増加が収束しているサンプリング方法が適切なサンプリング方法として選択される。このため、効率的なサンプリング手法を選択することができ、処理部品による処理後のデータのデータプロファイルを少ない処理量で取得することができる。そして、データプロファイルから列情報を抽出できることで、当該処理部品による処理後のデータを入力とする後続の処理部品の設定画面において、処理対象の列名の選択肢をユーザに提示することが可能となる。その結果、ユーザが正しく列名を選択することが可能となる。 <Effects of this embodiment>
According to the present embodiment, test data having a plurality of samples is generated for each of a plurality of sampling methods from input data processed by a processing component. Then, the test data is executed by the processing component, and an increase state of the number of columns of the processing result corresponding to the increase in the number of samples is specified for each sampling method. Then, the sampling method in which the increase in the number of columns converges at an earlier stage (that is, a small number of samples) is selected as an appropriate sampling method by the process of step S38. Therefore, an efficient sampling method can be selected, and a data profile of data after processing by the processing component can be acquired with a small processing amount. Since column information can be extracted from the data profile, it is possible to present to the user options for the column name to be processed on the subsequent processing component setting screen that receives data after processing by the processing component. . As a result, the user can select the column name correctly.

また、本実施形態によれば、上記ステップＳ３８の処理において、今回と前回のサンプル数の処理における列数の増加が閾値以下である要件のみならず、前回と前々回のサンプル数の処理における列数の増加が閾値以下である要件も考慮する。これにより、前述したように、サンプリング方法や処理部品の処理内容により、処理結果の列数の増加に規則性がない場合において、たまたま処理結果の列数の増加が少なくなったような状態を、列数の増加が収束している状態と判別してしまうリスクを低減させる。 Further, according to the present embodiment, in the process of step S38, not only the requirement that the increase in the number of columns in the current and previous sample number processes is less than or equal to the threshold value, but also the number of columns in the previous and previous sample number processes. Also consider the requirement that the increase in is below the threshold. Thereby, as described above, when there is no regularity in the increase in the number of columns of the processing result due to the sampling method and the processing content of the processing component, a state where the increase in the number of columns of the processing result happens to be reduced, The risk of discriminating that the increase in the number of columns has converged is reduced.

また、本実施形態によれば、サンプリング方法の候補として、例えば、「ランダムサンプリング」、「最大値と最小値を含めるサンプリング」及び「より多くの異なりを含めるサンプリング」を含む。このような性質の異なる複数のサンプリング方法を候補としておくことで、多様な入力データ内容や処理部品の処理内容に対応することができる。
例えば、前述したような会計時間のデータ内容に応じて列を生成する場合でも、該当するデータのない時間帯の列を生成しないような処理を行う処理部品の場合には、「最大値と最小値を含めるサンプリング」よりも、「より多くの異なりを含めるサンプリング」のほうが効率的に列情報を取得することができる。なお、「ランダムサンプリング」の場合には、同一の時間のデータを複数抽出する可能性があるため、「より多くの異なりを含めるサンプリング」よりも列数の増加の収束が遅くなる。 Further, according to the present embodiment, the sampling method candidates include, for example, “random sampling”, “sampling including the maximum value and minimum value”, and “sampling including more differences”. By using a plurality of sampling methods having different properties as candidates, it is possible to deal with various contents of input data and processing contents of processing components.
For example, even when a column is generated according to the data contents of the accounting time as described above, in the case of a processing component that performs processing that does not generate a column of a time zone without corresponding data, the “maximum value and minimum value” The column information can be acquired more efficiently by “sampling including more differences” than “sampling including values”. In the case of “random sampling”, since there is a possibility that a plurality of data of the same time is extracted, convergence of the increase in the number of columns is slower than “sampling including more differences”.

また、他の例として、入力データに重要度を示す列が含まれており、その列のデータに、例えば「Ａ＋、Ａ−、Ｂ＋、Ｂ−、Ｃ＋、Ｃ−」といった値が含まれているとする。そして、この列のデータに含まれる値を先頭の文字ごとにグループ化した列（このデータ例では「Ａ、Ｂ、Ｃ」）を当該入力データに追加する処理部品を想定する。このような入力データ及び処理部品において、例えば全ての列情報を得るためには、例えば「ランダムサンプリング」の場合、全ての先頭文字が含まれるまでサンプル数を増やす必要がある。「最大値と最小値を含めるサンプリング」についても同様である。一方、「より多くの異なりを含めるサンプリング」によれば、効率的に列情報を得ることが可能となる。 As another example, the input data includes a column indicating importance, and the data in the column includes values such as “A +, A−, B +, B−, C +, C−”, for example. Suppose that Then, a processing component is assumed that adds a column (“A, B, C” in this data example) in which values included in the data of this column are grouped for each leading character to the input data. In such input data and processing parts, for example, in order to obtain all the column information, for example, in the case of “random sampling”, it is necessary to increase the number of samples until all the first characters are included. The same applies to “sampling including maximum and minimum values”. On the other hand, according to “sampling including more differences”, column information can be obtained efficiently.

さらに、他の例として、入力データが所定のテキストデータに含まれる単語の集合であり、当該単語の集合をクラスタリング解析するために、単語の出現頻度順に所定数の単語の列を作るような処理部品を想定する。このような入力データ及び処理部品では、例えば「より多くの異なりを含めるサンプリング」を選択した場合、単語の出現頻度が全て「１」になってしまい、列数が増加するのみであって収束しない。一方、例えば「ランダムサンプリング」によれば、テキストデータ全体における本来の単語の出現頻度に近いデータになり、列数の増加が適切に収束する。 Furthermore, as another example, a process in which input data is a set of words included in predetermined text data, and a sequence of a predetermined number of words is created in order of appearance frequency of words in order to perform clustering analysis of the set of words. Assume parts. In such input data and processing components, for example, when “sampling including more differences” is selected, the appearance frequency of all words becomes “1”, and the number of columns only increases and does not converge. . On the other hand, according to, for example, “random sampling”, the data is close to the original word appearance frequency in the entire text data, and the increase in the number of columns is appropriately converged.

なお、上記説明では、表構造の入力データを処理対象とし、列数の増加の収束状況に基づいて処理を行なったが、本実施形態で説明した技術が対象とするデータは、このような構造のデータに限定されるものではない。また、データプロファイルの構造も一例に過ぎず、入力データの内容に応じて変動する構成要素の数が特定可能な情報であれば、いかなる構造のデータでもよい。 In the above description, input data having a table structure is processed, and processing is performed based on the convergence state of the increase in the number of columns. However, data targeted by the technique described in this embodiment has such a structure. It is not limited to the data. Further, the structure of the data profile is merely an example, and any structure of data may be used as long as the number of components that vary depending on the content of the input data can be specified.

［ハードウェア構成等］
前述した分析サーバ１として機能するコンピュータのハードウェア構成の一例を図２０に示す。本コンピュータは、プロセッサ１０１、メモリ１０２、ストレージ１０３、可搬記憶媒体駆動装置１０４、入出力装置１０５及び通信インタフェース１０６を備える。
プロセッサ１０１は、制御ユニット、演算ユニット及び命令デコーダ等を含み、実行ユニットが、命令デコーダで解読されたプログラムの命令に従い、制御ユニットより出力される制御信号に応じ、演算ユニットを用いて算術・論理演算を実行する。かかるプロセッサ１０１は、制御に用いる各種情報が格納される制御レジスタ、既にアクセスしたメモリ２等の内容を一時的に格納可能なキャッシュ、及び、仮想記憶のページテーブルのキャッシュとしての機能を果たすＴＬＢを備える。なお、プロセッサ１０１は、ＣＰＵ（Central Processing Unit）コアが複数設けられている構成でもよい。 [Hardware configuration, etc.]
An example of the hardware configuration of the computer functioning as the analysis server 1 described above is shown in FIG. The computer includes a processor 101, a memory 102, a storage 103, a portable storage medium drive device 104, an input / output device 105, and a communication interface 106.
The processor 101 includes a control unit, an arithmetic unit, an instruction decoder, and the like. The execution unit follows the instructions of the program decoded by the instruction decoder, and performs arithmetic / logic using the arithmetic unit according to a control signal output from the control unit. Perform the operation. The processor 101 has a TLB that functions as a control register in which various information used for control is stored, a cache that can temporarily store the contents of the already accessed memory 2 and the like, and a page table cache of virtual memory. Prepare. The processor 101 may have a configuration in which a plurality of CPU (Central Processing Unit) cores are provided.

メモリ１０２は、例えばＲＡＭ（Random Access Memory）等の記憶装置であり、プロセッサ１０１で実行されるプログラムがロードされるとともに、プロセッサ１０１の処理に用いるデータが格納されるメインメモリである。また、ストレージ１０３は、例えばＨＤＤ（Hard Disk Drive）やフラッシュメモリ等の記憶装置であり、プログラムや各種データが格納される。可搬記憶媒体駆動装置１０４は、可搬記憶媒体１０７に記憶されたデータやプログラムを読み出す装置である。可搬記憶媒体１０７は、例えば磁気ディスク、光ディスク、光磁気ディスク又はフラッシュメモリ等である。プロセッサ１０１は、メモリ１０２やストレージ１０３と協働しつつ、ストレージ１０３や可搬記憶媒体１０７に格納されたプログラムを実行する。なお、プロセッサ１０１が実行するプログラムや、アクセス対象となるデータは、当該コンピュータと通信可能な他の装置に格納されていてもよい。なお、本実施形態で記載した分析サーバ１の記憶手段とは、メモリ１０２、ストレージ１０３及び可搬記憶媒体１０７若しくは当該コンピュータと通信可能な他の装置の少なくともいずれかを示す。 The memory 102 is a storage device such as a RAM (Random Access Memory), for example, and is a main memory in which a program to be executed by the processor 101 is loaded and data used for processing of the processor 101 is stored. The storage 103 is a storage device such as an HDD (Hard Disk Drive) or a flash memory, and stores programs and various data. The portable storage medium driving device 104 is a device that reads data and programs stored in the portable storage medium 107. The portable storage medium 107 is, for example, a magnetic disk, an optical disk, a magneto-optical disk, or a flash memory. The processor 101 executes a program stored in the storage 103 or the portable storage medium 107 while cooperating with the memory 102 or the storage 103. Note that the program executed by the processor 101 and data to be accessed may be stored in another device that can communicate with the computer. Note that the storage unit of the analysis server 1 described in the present embodiment indicates at least one of the memory 102, the storage 103, the portable storage medium 107, or another device that can communicate with the computer.

入出力装置１０５は例えばキーボードやタッチパネル、ディスプレイ等であり、ユーザ操作等による動作命令を受け付ける一方、コンピュータによる処理結果を出力する。
通信インタフェース１０６は、例えば、例えばＬＡＮ（Local Area Network）カード等の他、無線周波受信機および送信機、ならびに光受信機および送信機を含むことができる。前述の受信機および送信機は、Ｗｉ−Ｆｉネットワーク、ブルートゥース・ネットワーク、ロング・ターム・エボリューションなどの１つまたは複数の通信ネットワークにより動作するように実現することができる。
これらのコンピュータの各構成要素は、バス１０８で接続されている。 The input / output device 105 is, for example, a keyboard, a touch panel, a display, or the like, and receives an operation command by a user operation or the like, and outputs a processing result by a computer.
The communication interface 106 can include, for example, a radio frequency receiver and transmitter, and an optical receiver and transmitter in addition to a LAN (Local Area Network) card, for example. The aforementioned receivers and transmitters can be implemented to operate with one or more communication networks, such as a Wi-Fi network, a Bluetooth network, and long term evolution.
Each component of these computers is connected by a bus 108.

［その他］
なお、本明細書で説明したコンピュータの機能的構成及び物理的構成は、上述の態様に限るものではなく、例えば、各機能や物理資源を統合して実装したり、逆に、さらに分散して実装したりすることも可能である。
また、本明細書において、閾値等との比較において「〜以上」や「〜以下」とした記載箇所は、特記した場合を除き当該記載に限定されるものではなく、「〜より大きい（〜を上回る）」や「〜より小さい（〜を下回る）」に適宜置き換えることが可能である。逆も同様である。 [Others]
Note that the functional configuration and physical configuration of the computer described in this specification are not limited to the above-described aspects. For example, the functions and physical resources are integrated and implemented, or conversely, are further distributed. It is also possible to implement.
In addition, in this specification, the description places “to be more than” and “to be less than” in comparison with the threshold value and the like are not limited to the description except for special cases, It can be appropriately replaced with “less than” or “less than (less than)”. The reverse is also true.

以上の実施形態に関し、更に以下の付記を開示する。
（付記１）
複数の処理部品を組み合わせた処理フローに含まれる１つの対象処理部品であって、入力データの内容に応じた構成要素を有する出力データを生成する処理を行う対象処理部品に対する前記入力データを取得し、
予め設定された複数のサンプリング方法にしたがって、当該複数のサンプリング方法ごとに、前記入力データから複数の異なるサンプル数のデータをサンプリングして複数のテストデータを生成し、
前記複数のサンプリング方法ごとに、前記テストデータのサンプル数を増加させながら、それぞれの前記テストデータを入力とする前記対象処理部品による処理を実行して、それぞれの処理結果の出力データにおける前記構成要素数を特定し、
前記複数のサンプリング方法のうち、最も少ないサンプル数の前記テストデータを実行した段階で、前記テストデータのサンプル数の増加に応じた前記出力データの前記構成要素数の増加が所定閾値以下となるサンプリング方法を選択する
処理をコンピュータに実行させる情報処理プログラム。 Regarding the above embodiment, the following additional notes are disclosed.
(Appendix 1)
Obtaining the input data for a target processing component that is included in a processing flow that combines a plurality of processing components, and that performs processing to generate output data having components corresponding to the contents of the input data ,
According to a plurality of preset sampling methods, for each of the plurality of sampling methods, a plurality of different sample number data is sampled from the input data to generate a plurality of test data,
For each of the plurality of sampling methods, while increasing the number of samples of the test data, the processing by the target processing component that receives the test data as input is executed, and the component in the output data of each processing result Identify the number,
Sampling in which an increase in the number of components of the output data in accordance with an increase in the number of samples of the test data is equal to or less than a predetermined threshold when the test data with the smallest number of samples is executed among the plurality of sampling methods An information processing program for causing a computer to execute a process for selecting a method.

（付記２）
前記サンプリング方法を選択する処理は、前記テストデータのサンプル数の増加に応じた前記出力データの前記構成要素数の増加が、複数回連続して前記所定閾値以下となる場合にのみ、前記サンプリング方法を選択する、付記１記載の情報処理プログラム。 (Appendix 2)
The process of selecting the sampling method is performed only when the increase in the number of components of the output data corresponding to the increase in the number of samples of the test data is equal to or less than the predetermined threshold value a plurality of times in succession. The information processing program according to appendix 1, wherein

（付記３）
前記入力データから選択した前記サンプリング方法にしたがってデータをサンプリングしてサンプルデータを生成し、
前記サンプルデータを入力とする前記対象処理部品による処理を実行し、処理結果の出力データから構成要素に関する情報を抽出して、当該構成要素に関する情報を含んだデータプロファイルを生成する
処理をさらにコンピュータに実行させる、付記１又は２に記載の情報処理プログラム。 (Appendix 3)
Sampling data according to the sampling method selected from the input data to generate sample data;
A process for executing processing by the target processing component having the sample data as an input, extracting information on the component from the output data of the processing result, and generating a data profile including information on the component is further performed on the computer The information processing program according to attachment 1 or 2 to be executed.

（付記４）
前記対象処理部品の出力データの前記データプロファイルを参照し、前記対象処理部品の出力データを入力とする後続の処理部品による処理において用いるパラメータの設定画面に、前記対象処理部品の出力データの構成要素の選択肢を表示させる
処理をさらにコンピュータに実行させる、付記３に記載の情報処理プログラム。 (Appendix 4)
Refer to the data profile of the output data of the target processing component, and the component of the output data of the target processing component is displayed on the parameter setting screen used in the processing by the subsequent processing component that receives the output data of the target processing component. The information processing program according to appendix 3, further causing the computer to execute a process of displaying the options.

（付記５）
前記入力データが表構造のデータであり、前記構成要素が表構造の列である、付記１〜４のいずれか１項に記載の情報処理プログラム。 (Appendix 5)
The information processing program according to any one of appendices 1 to 4, wherein the input data is data having a table structure, and the constituent elements are columns having a table structure.

（付記６）
前記複数のサンプリング方法は、ランダムサンプリング、最大値と最小値を含めるサンプリング及びより多くの異なりを含むサンプリングの方法の少なくともいずれか１つを含む、付記１〜５のいずれか１項に記載の情報処理プログラム。 (Appendix 6)
The information according to any one of appendices 1 to 5, wherein the plurality of sampling methods include at least one of random sampling, sampling including a maximum value and minimum value, and sampling method including more differences. Processing program.

（付記７）
複数の処理部品を組み合わせた処理フローに含まれる１つの対象処理部品であって、入力データの内容に応じた構成要素を有する出力データを生成する処理を行う対象処理部品に対する前記入力データを取得し、
予め設定された複数のサンプリング方法にしたがって、当該複数のサンプリング方法ごとに、前記入力データから複数の異なるサンプル数のデータをサンプリングして複数のテストデータを生成し、
前記複数のサンプリング方法ごとに、前記テストデータのサンプル数を増加させながら、それぞれの前記テストデータを入力とする前記対象処理部品による処理を実行して、それぞれの処理結果の前記構成要素数を特定し、
前記複数のサンプリング方法のうち、最も少ないサンプル数の前記テストデータを実行した段階で、前記テストデータのサンプル数の増加に応じた前記出力データの前記構成要素数の増加が所定閾値以下となるサンプリング方法を選択する
処理をコンピュータが実行する情報処理方法。 (Appendix 7)
Obtaining the input data for a target processing component that is included in a processing flow that combines a plurality of processing components, and that performs processing to generate output data having components corresponding to the contents of the input data ,
According to a plurality of preset sampling methods, for each of the plurality of sampling methods, a plurality of different sample number data is sampled from the input data to generate a plurality of test data,
For each of the plurality of sampling methods, while increasing the number of samples of the test data, the processing by the target processing component that receives each of the test data is executed, and the number of components of each processing result is specified And
Sampling in which an increase in the number of components of the output data in accordance with an increase in the number of samples of the test data is equal to or less than a predetermined threshold when the test data with the smallest number of samples is executed among the plurality of sampling methods An information processing method in which a computer executes a process of selecting a method.

（付記８）
複数の処理部品を組み合わせた処理フローに含まれる１つの対象処理部品であって、入力データの内容に応じた構成要素を有する出力データを生成する処理を行う対象処理部品に対する前記入力データを取得するデータ読込部と、
予め設定された複数のサンプリング方法にしたがって、当該複数のサンプリング方法ごとに、前記入力データから複数の異なるサンプル数のデータをサンプリングして複数のテストデータを生成するテストデータ生成部と、
前記複数のサンプリング方法ごとに、前記テストデータのサンプル数を増加させながら、それぞれの前記テストデータを入力とする前記対象処理部品による処理を実行して、それぞれの処理結果の前記構成要素数を特定するテストデータ処理部と、
前記複数のサンプリング方法のうち、最も少ないサンプル数の前記テストデータを実行した段階で、前記テストデータのサンプル数の増加に応じた前記出力データの前記構成要素数の増加が所定閾値以下となるサンプリング方法を選択するサンプリング方法選択部と
を備える情報処理装置。 (Appendix 8)
Obtaining the input data for a target processing component that is a target processing component included in a processing flow that combines a plurality of processing components, and that performs processing to generate output data having components according to the contents of the input data A data reading section;
According to a plurality of preset sampling methods, for each of the plurality of sampling methods, a test data generation unit that generates a plurality of test data by sampling a plurality of different sample numbers of data from the input data,
For each of the plurality of sampling methods, while increasing the number of samples of the test data, the processing by the target processing component that receives each of the test data is executed, and the number of components of each processing result is specified A test data processing unit,
Sampling in which an increase in the number of components of the output data in accordance with an increase in the number of samples of the test data is equal to or less than a predetermined threshold when the test data with the smallest number of samples is executed among the plurality of sampling methods An information processing apparatus comprising a sampling method selection unit that selects a method.

１…分析サーバ、１１…分析フロー読込部、１２…分析フロー変更検出部、１３…データプロファイル生成部、１４…テストデータ生成部、１５…テストデータ処理部、１６…サンプリング方法選択部、１７…サンプルデータ生成部、１８…サンプルデータ処理部、１９…分析フロー更新部、２０…列選択設定部、２１…分析フロー実行部、３１…分析フローテーブル、３２…部品データベース、３３…入力データテーブル、３４…サンプリング方法テーブル、３５…テストデータテーブル、３６…データプロファイル一時保管ファイル、３７…列数テーブル、３８…サンプルデータテーブル、３９…データプロファイルテーブル DESCRIPTION OF SYMBOLS 1 ... Analysis server, 11 ... Analysis flow reading part, 12 ... Analysis flow change detection part, 13 ... Data profile generation part, 14 ... Test data generation part, 15 ... Test data processing part, 16 ... Sampling method selection part, 17 ... Sample data generation unit, 18 ... Sample data processing unit, 19 ... Analysis flow update unit, 20 ... Column selection setting unit, 21 ... Analysis flow execution unit, 31 ... Analysis flow table, 32 ... Parts database, 33 ... Input data table, 34 ... Sampling method table, 35 ... Test data table, 36 ... Data profile temporary storage file, 37 ... Column number table, 38 ... Sample data table, 39 ... Data profile table

Claims

Obtaining the input data for a target processing component that is included in a processing flow that combines a plurality of processing components, and that performs processing to generate output data having components corresponding to the contents of the input data ,
According to a plurality of preset sampling methods, for each of the plurality of sampling methods, a plurality of different sample number data is sampled from the input data to generate a plurality of test data,
For each of the plurality of sampling methods, while increasing the number of samples of the test data, the processing by the target processing component that receives the test data as input is executed, and the component in the output data of each processing result Identify the number,
Sampling in which the increase in the number of components of the output data in accordance with the increase in the number of samples of the test data is equal to or less than a predetermined threshold at the stage of processing the test data with the smallest number of samples among the plurality of sampling methods An information processing program for causing a computer to execute a process for selecting a method.

The process of selecting the sampling method is performed only when the increase in the number of components of the output data corresponding to the increase in the number of samples of the test data is equal to or less than the predetermined threshold value a plurality of times in succession. The information processing program according to claim 1, wherein the information processing program is selected.

Sampling data according to the sampling method selected from the input data to generate sample data;
A process for executing processing by the target processing component having the sample data as an input, extracting information on the component from the output data of the processing result, and generating a data profile including information on the component is further performed on the computer The information processing program according to claim 1, which is executed.

Refer to the data profile of the output data of the target processing component, and the component of the output data of the target processing component is displayed on the parameter setting screen used in the processing by the subsequent processing component that receives the output data of the target processing component. The information processing program according to claim 3, further causing the computer to execute a process of displaying the options.

The information processing program according to any one of claims 1 to 4, wherein the input data is data having a table structure, and the constituent elements are columns having a table structure.

6. The method according to claim 1, wherein the plurality of sampling methods include at least one of random sampling, sampling including a maximum value and minimum value, and sampling method including more differences. Information processing program.

Obtaining the input data for a target processing component that is included in a processing flow that combines a plurality of processing components, and that performs processing to generate output data having components corresponding to the contents of the input data ,
According to a plurality of preset sampling methods, for each of the plurality of sampling methods, a plurality of different sample number data is sampled from the input data to generate a plurality of test data,
For each of the plurality of sampling methods, while increasing the number of samples of the test data, the processing by the target processing component that receives the test data as input is executed, and the component in the output data of each processing result Identify the number,
Sampling in which the increase in the number of components of the output data in accordance with the increase in the number of samples of the test data is equal to or less than a predetermined threshold at the stage of processing the test data with the smallest number of samples among the plurality of sampling methods An information processing method in which a computer executes a process of selecting a method.

Obtaining the input data for a target processing component that is a target processing component included in a processing flow that combines a plurality of processing components, and that performs processing to generate output data having components according to the contents of the input data A data reading section;
According to a plurality of preset sampling methods, for each of the plurality of sampling methods, a test data generation unit that generates a plurality of test data by sampling a plurality of different sample numbers of data from the input data,
For each of the plurality of sampling methods, while increasing the number of samples of the test data, the processing by the target processing component that receives the test data as input is executed, and the component in the output data of each processing result A test data processing unit for identifying the number;
Sampling in which the increase in the number of components of the output data in accordance with the increase in the number of samples of the test data is equal to or less than a predetermined threshold at the stage of processing the test data with the smallest number of samples among the plurality of sampling methods An information processing apparatus comprising a sampling method selection unit that selects a method.