JP7396213B2

JP7396213B2 - Data analysis system, data analysis method, and data analysis program

Info

Publication number: JP7396213B2
Application number: JP2020106939A
Authority: JP
Inventors: 俊宏井口
Original assignee: TDK Corp
Current assignee: TDK Corp
Priority date: 2020-06-22
Filing date: 2020-06-22
Publication date: 2023-12-12
Anticipated expiration: 2040-06-22
Also published as: JP2022002029A

Description

本発明は、データ解析システム、データ解析方法及びデータ解析プログラムに関する。 The present invention relates to a data analysis system, a data analysis method, and a data analysis program.

データ解析方法として、解析対象のデータセットに基づいて目的変数と説明変数との間の関係を表す予測モデルを作成し、作成された予測モデルに基づいて解析を行う方法が知られている（例えば特許文献１参照）。 A known data analysis method is to create a predictive model that represents the relationship between the objective variable and explanatory variables based on the dataset to be analyzed, and then perform analysis based on the created predictive model (e.g. (See Patent Document 1).

特開２０２０－２４５４４号公報JP2020-24544A

上述したようなデータ解析方法では、例えば機械学習を用いることで、説明変数から目的変数を高精度に予測するモデルを生成し得る。しかしながら、得られたモデルの解釈が容易でないためにデータ解析が困難となる場合がある。また、例えば、解析対象に数値データだけではなく文字データが含まれていると適切に解析を行うことができない場合があり、解析対象の自由度が低いといった課題もある。 In the data analysis method described above, for example, by using machine learning, it is possible to generate a model that predicts a target variable with high accuracy from explanatory variables. However, data analysis may be difficult because the resulting model is not easy to interpret. Further, for example, if the analysis target includes not only numerical data but also character data, it may not be possible to perform the analysis appropriately, and there is also a problem that the degree of freedom of the analysis target is low.

本発明は、データ解析を容易化することができると共に、解析対象の自由度を向上することができるデータ解析システム、データ解析方法及びデータ解析プログラムを提供することを目的とする。 An object of the present invention is to provide a data analysis system, a data analysis method, and a data analysis program that can facilitate data analysis and improve the degree of freedom of an analysis target.

本発明のデータ解析システムは、少なくとも１つのプロセッサを備え、少なくとも１つのプロセッサは、複数の項目のデータの集合であるデータユニットを複数含むデータセットを受け付け、複数の項目のうちの一の項目からなる目的変数、及び複数の項目のうちの他の二以上の項目からなる複数の説明変数について、データセットに基づいて、仮説検定における算出方法により、目的変数と複数の説明変数の各々との間の有意確率を算出し、複数の説明変数を、有意確率が小さい順に並ぶように、表示部に表示させる。 The data analysis system of the present invention includes at least one processor, and the at least one processor receives a data set including a plurality of data units that are a collection of data of a plurality of items, and the at least one processor receives a data set including a plurality of data units that are a collection of data of a plurality of items. Based on the dataset, the relationship between the objective variable and each of the multiple explanatory variables is calculated using the calculation method in hypothesis testing based on the data set. The significance probability of the explanatory variables is calculated, and the plurality of explanatory variables are displayed on the display unit in descending order of significance probability.

このデータ解析システムでは、仮説検定における算出方法により目的変数と複数の説明変数の各々との間の有意確率が算出され、複数の説明変数が、有意確率が小さい順に並ぶように表示部に表示される。これにより、ユーザは、有意確率が小さい説明変数、すなわち目的変数との関連が強いことが期待される説明変数を容易に把握することができる。また、このデータ解析システムでは、ユーザは、有意確率を基準として複数の説明変数を比較することができる。異なる仮説検定手法の算出方法を用いた場合でも有意確率は共通に算出されることから、有意確率を基準とすることにより、異なる仮説検定手法の算出方法を用いたとしても、同一の基準で複数の説明変数を比較することが可能となる。その結果、例えば、解析対象に数値データ及び文字データの両方が含まれている場合でも、複数の説明変数を好適に比較することが可能となる。よって、このデータ解析システムによれば、データ解析を容易化することができると共に、解析対象の自由度を向上することができる。なお、このデータ解析システムでは仮説検定における算出方法を用いて有意確率を算出するが、仮説検定自体は行われなくてもよい。仮説検定は、帰無仮説を棄却し対立仮説を支持するか、又は帰無仮説を棄却しないかを観測値に基づいて決めるための統計的手続きである。 In this data analysis system, the significance probability between the target variable and each of multiple explanatory variables is calculated using the calculation method used in hypothesis testing, and the multiple explanatory variables are displayed on the display in order of decreasing significance probability. Ru. This allows the user to easily understand explanatory variables with a low significance probability, that is, explanatory variables that are expected to have a strong relationship with the target variable. Furthermore, this data analysis system allows the user to compare multiple explanatory variables based on significance probability. Since the significance probability is calculated in common even when calculation methods of different hypothesis testing methods are used, by using the significance probability as the standard, even if calculation methods of different hypothesis testing methods are used, multiple calculations using the same standard can be performed. It becomes possible to compare the explanatory variables of As a result, for example, even when the analysis target includes both numerical data and character data, it is possible to suitably compare a plurality of explanatory variables. Therefore, according to this data analysis system, data analysis can be facilitated, and the degree of freedom of analysis targets can be improved. Note that although this data analysis system calculates the significance probability using a calculation method in hypothesis testing, the hypothesis testing itself does not need to be performed. Hypothesis testing is a statistical procedure for deciding based on observed values whether to reject the null hypothesis and support an alternative hypothesis, or not to reject the null hypothesis.

少なくとも１つのプロセッサにより用いられる算出方法は、ノンパラメトリックな検定手法における算出方法を含んでいてもよい。この場合、ノンパラメトリックな検定手法における算出方法では解析対象のデータに外れ値などの異常値が含まれていたとしても精度が低下し難く、母集団の分布などの前提を必要としないため、解析対象の自由度を一層向上することができる。 The calculation method used by the at least one processor may include a calculation method in a non-parametric testing technique. In this case, the calculation method used in the nonparametric test method does not easily reduce accuracy even if the data to be analyzed contains abnormal values such as outliers, and does not require assumptions such as population distribution. The degree of freedom of targeting can be further improved.

少なくとも１つのプロセッサにより用いられる算出方法は、第１仮説検定手法における算出方法と、第１仮説検定手法とは異なる第２仮説検定手法における算出方法と、を含み、少なくとも１つのプロセッサは、目的変数及び説明変数の両方が数値データにより構成されている場合、第１仮説検定手法における算出方法を用いて有意確率を算出し、目的変数及び説明変数の少なくとも一方が文字データにより構成されている場合、第２仮説検定手法における算出方法を用いて有意確率を算出してもよい。この場合、解析対象に数値データ及び文字データの両方が含まれている場合でも、有意確率を好適に算出することができる。 The calculation method used by the at least one processor includes a calculation method in a first hypothesis testing method and a calculation method in a second hypothesis testing method different from the first hypothesis testing method, and the at least one processor When both the objective variable and the explanatory variable are composed of numerical data, the significance probability is calculated using the calculation method in the first hypothesis testing method, and when at least one of the objective variable and the explanatory variable is composed of character data, The significance probability may be calculated using the calculation method in the second hypothesis testing method. In this case, even if the analysis target includes both numerical data and character data, the significance probability can be suitably calculated.

第２仮説検定手法は、第３仮説検定手法と、第３仮説検定手法とは異なる第４仮説検定手法と、を含み、少なくとも１つのプロセッサは、目的変数及び説明変数の一方が数値データにより構成されていると共に、目的変数及び説明変数の他方が文字データにより構成されている場合、第３仮説検定手法における算出方法を用いて有意確率を算出し、目的変数及び説明変数の両方が文字データにより構成されている場合、第４仮説検定手法における算出方法を用いて有意確率を算出してもよい。この場合、解析対象に数値データ及び文字データの両方が含まれている場合でも、有意確率を一層好適に算出することができる。 The second hypothesis testing method includes a third hypothesis testing method and a fourth hypothesis testing method different from the third hypothesis testing method, and the at least one processor is configured such that one of the objective variable and the explanatory variable is composed of numerical data. and the other of the objective variable and explanatory variable is composed of text data, the significance probability is calculated using the calculation method in the third hypothesis testing method, and both the objective variable and explanatory variable are composed of text data. If configured, the significance probability may be calculated using the calculation method in the fourth hypothesis testing method. In this case, even if the analysis target includes both numerical data and character data, the significance probability can be calculated more suitably.

少なくとも１つのプロセッサは、互いに異なる複数の仮説検定手法における算出方法を用いて目的変数と説明変数との間の有意確率を複数算出し、複数の有意確率のうち最も小さい有意確率を目的変数と説明変数との間の有意確率としてもよい。この場合、有意確率を一層精度良く算出することができる。 The at least one processor calculates a plurality of significance probabilities between the objective variable and the explanatory variable using calculation methods in a plurality of mutually different hypothesis testing methods, and explains the smallest significance probability among the plurality of significance probabilities as the objective variable. It may also be a significance probability between variables. In this case, the significance probability can be calculated with higher accuracy.

複数の説明変数は、第１説明変数及び第２説明変数を含み、少なくとも１つのプロセッサは、第１仮説検定手法における算出方法を用いて目的変数と第１説明変数との間の有意確率を算出し、第１仮説検定手法とは異なる第２仮説検定手法における算出方法を用いて、目的変数と第２説明変数との間の有意確率を算出してもよい。この場合、第１仮説検定手法及び第２仮説検定手法における算出方法を用いて有意確率を算出することができ、解析対象の自由度を一層向上することができる。 The plurality of explanatory variables include a first explanatory variable and a second explanatory variable, and the at least one processor calculates the significance probability between the target variable and the first explanatory variable using a calculation method in the first hypothesis testing method. However, the significance probability between the objective variable and the second explanatory variable may be calculated using a calculation method in a second hypothesis testing method that is different from the first hypothesis testing method. In this case, the significance probability can be calculated using the calculation methods in the first hypothesis testing method and the second hypothesis testing method, and the degree of freedom of the analysis target can be further improved.

少なくとも１つのプロセッサは、表示部に表示された複数の説明変数の中から選択された一の説明変数と目的変数との間の関係を示すグラフを、表示部に表示させてもよい。この場合、ユーザは、選択された説明変数と目的変数との間の関係を容易に把握することができる。 The at least one processor may cause the display unit to display a graph showing a relationship between the objective variable and one explanatory variable selected from the plurality of explanatory variables displayed on the display unit. In this case, the user can easily understand the relationship between the selected explanatory variable and the target variable.

本発明のデータ解析方法は、少なくとも１つのプロセッサを備えるデータ解析システムにより実行されるデータ解析方法であって、複数の項目のデータの集合であるデータユニットを複数含むデータセットを受け付けるステップと、複数の項目のうちの一の項目からなる目的変数、及び複数の項目のうちの他の二以上の項目からなる複数の説明変数について、データセットに基づいて、仮説検定における算出方法により、目的変数と複数の説明変数の各々との間の有意確率を算出するステップと、複数の説明変数を、有意確率が小さい順に並ぶように、表示部に表示させるステップと、を含む。このデータ解析方法によれば、上述した理由により、データ解析を容易化することができると共に、解析対象の自由度を向上することができる。 The data analysis method of the present invention is a data analysis method executed by a data analysis system equipped with at least one processor, and includes the steps of receiving a data set including a plurality of data units that are a collection of data of a plurality of items; For the objective variable consisting of one of the items, and the multiple explanatory variables consisting of two or more of the multiple items, the objective variable and The method includes the steps of calculating the significance probability between each of the plurality of explanatory variables, and displaying the plurality of explanatory variables on the display unit in order of decreasing significance probability. According to this data analysis method, for the reasons mentioned above, data analysis can be facilitated and the degree of freedom of the analysis target can be improved.

本発明のデータ解析プログラムは、複数の項目のデータの集合であるデータユニットを複数含むデータセットを受け付けるステップと、複数の項目のうちの一の項目からなる目的変数、及び複数の項目のうちの他の二以上の項目からなる複数の説明変数について、データセットに基づいて、仮説検定における算出方法により、目的変数と複数の説明変数の各々との間の有意確率を算出するステップと、複数の説明変数を、有意確率が小さい順に並ぶように、表示部に表示させるステップと、をコンピュータに実行させる。このデータ解析プログラムによれば、上述した理由により、データ解析を容易化することができると共に、解析対象の自由度を向上することができる。 The data analysis program of the present invention includes the steps of accepting a dataset including a plurality of data units that are a collection of data of a plurality of items, an objective variable consisting of one item among the plurality of items, and a target variable consisting of one item among the plurality of items. A step of calculating the significance probability between the target variable and each of the plurality of explanatory variables based on the data set using a calculation method in hypothesis testing for the plurality of explanatory variables consisting of two or more other items; A computer is caused to display the explanatory variables on a display unit in order of decreasing significance probability. According to this data analysis program, for the reasons mentioned above, data analysis can be facilitated and the degree of freedom of the analysis target can be improved.

本発明によれば、データ解析を容易化することができると共に、解析対象の自由度を向上することができるデータ解析システム、データ解析方法及びデータ解析プログラムを提供することが可能となる。 According to the present invention, it is possible to provide a data analysis system, a data analysis method, and a data analysis program that can facilitate data analysis and improve the degree of freedom of analysis targets.

実施形態に係るデータ解析システムの機能構成の例を示す図である。FIG. 1 is a diagram illustrating an example of a functional configuration of a data analysis system according to an embodiment. データ解析システムを構成するコンピュータのハードウェアの構成例を示す図である。1 is a diagram illustrating an example of a hardware configuration of a computer that constitutes a data analysis system. データ解析システムの動作例を示すフローチャートである。2 is a flowchart showing an example of the operation of the data analysis system. データセットの例を示す図である。FIG. 3 is a diagram showing an example of a data set. 文字データの数値データへの変換方法の例を示す図である。FIG. 3 is a diagram illustrating an example of a method of converting character data into numerical data. 表示部の表示例を示す図である。It is a figure which shows the example of a display of a display part. グラフの例を示す図である。It is a figure which shows the example of a graph. グラフの例を示す図である。It is a figure which shows the example of a graph. グラフの例を示す図である。It is a figure which shows the example of a graph. （ａ）～（ｃ）は、グラフの例を示す図である。(a) to (c) are diagrams showing examples of graphs.

以下、本発明の一実施形態について、図面を参照しつつ詳細に説明する。以下の説明において、同一又は相当要素には同一符号を用い、重複する説明を省略する。
［システムの構成］ Hereinafter, one embodiment of the present invention will be described in detail with reference to the drawings. In the following description, the same reference numerals will be used for the same or equivalent elements, and overlapping description will be omitted.
[System configuration]

図１に示されるように、実施形態に係るデータ解析システム１は、機能要素として、受付部１１と、算出部１２と、表示制御部１３と、を備えている。受付部１１は、データセット３０を受け付ける。算出部１２は、データセット３０に基づいて、目的変数と複数の説明変数の各々との間の有意確率を算出する。表示制御部１３は、複数の説明変数を、有意確率が小さい順に並ぶように、後述の表示部２６に表示させる。 As shown in FIG. 1, the data analysis system 1 according to the embodiment includes a reception section 11, a calculation section 12, and a display control section 13 as functional elements. The reception unit 11 receives the data set 30. The calculation unit 12 calculates the significance probability between the objective variable and each of the plurality of explanatory variables based on the data set 30. The display control unit 13 causes the display unit 26 described below to display a plurality of explanatory variables in order of decreasing significance probability.

データ解析システム１は、例えばコンピュータ２０により構成されている。図２に示されるように、コンピュータ２０は、プロセッサ２１と、主記憶部２２と、補助記憶部２３と、通信制御部２４と、入力部２５と、表示部２６と、を備えている。プロセッサ２１は、例えばＣＰＵであり、オペレーティングシステム、アプリケーションプログラム等を実行する。主記憶部２２は、例えばＲＯＭ、ＲＡＭ等により構成される。補助記憶部２３は、例えばハードディスク、フラッシュメモリ等により構成され、主記憶部２２よりも大量のデータを記憶する。通信制御部２４は、例えばネットワークカード、無線通信モジュール等により構成される。入力部２５は、例えばキーボード、マウス、タッチパネル等により構成される。表示部２６は、例えばモニタ、タッチパネルディスプレイ等により構成される。 The data analysis system 1 includes, for example, a computer 20. As shown in FIG. 2, the computer 20 includes a processor 21, a main storage section 22, an auxiliary storage section 23, a communication control section 24, an input section 25, and a display section 26. The processor 21 is, for example, a CPU, and executes an operating system, application programs, and the like. The main storage unit 22 is composed of, for example, a ROM, a RAM, and the like. The auxiliary storage section 23 is composed of, for example, a hard disk, a flash memory, etc., and stores a larger amount of data than the main storage section 22. The communication control unit 24 includes, for example, a network card, a wireless communication module, and the like. The input unit 25 includes, for example, a keyboard, a mouse, a touch panel, and the like. The display unit 26 is composed of, for example, a monitor, a touch panel display, or the like.

データ解析システム１の各機能要素は、補助記憶部２３内に予め記憶されているデータ解析プログラム２７を実行させることにより実現される。具体的には、プロセッサ２１又は主記憶部２２の上にデータ解析プログラム２７を読み込ませてプロセッサ２１にデータ解析プログラム２７を実行させることにより、受付部１１、算出部１２及び表示制御部１３の各機能が実現される。プロセッサ２１は、データ解析プログラム２７に従って、通信制御部２４、入力部２５及び表示部２６を動作させ、主記憶部２２及び補助記憶部２３におけるデータの読み出し及び書き込みを行う。処理に必要なデータ又はデータベースは、主記憶部２２又は補助記憶部２３内に格納される。 Each functional element of the data analysis system 1 is realized by executing a data analysis program 27 stored in advance in the auxiliary storage unit 23. Specifically, by loading the data analysis program 27 into the processor 21 or the main storage unit 22 and having the processor 21 execute the data analysis program 27, each of the reception unit 11, calculation unit 12, and display control unit 13 is Function is realized. The processor 21 operates the communication control section 24, the input section 25, and the display section 26 according to the data analysis program 27, and reads and writes data in the main storage section 22 and the auxiliary storage section 23. Data or databases necessary for processing are stored in the main storage section 22 or the auxiliary storage section 23.

データ解析プログラム２７は、例えば、ＣＤ－ＲＯＭ、ＤＶＤ－ＲＯＭ、半導体メモリ等の有形の記録媒体に固定的に記録された上で提供されてもよい。すなわち、データ解析プログラム２７は、コンピュータ読み取り可能な記録媒体に記録された上で提供されてもよい。或いは、データ解析プログラム２７は、搬送波に重畳されたデータ信号として通信ネットワークを介して提供されてもよい。 The data analysis program 27 may be provided after being permanently recorded on a tangible recording medium such as a CD-ROM, DVD-ROM, or semiconductor memory. That is, the data analysis program 27 may be provided after being recorded on a computer-readable recording medium. Alternatively, the data analysis program 27 may be provided via a communication network as a data signal superimposed on a carrier wave.

データ解析システム１は、１台のコンピュータ２０により構成されてもよいし、複数台のコンピュータ２０により構成されてもよい。複数台のコンピュータ２０を用いる場合には、これらのコンピュータ２０がインターネット又はイントラネット等の通信ネットワークを介して互いに接続されることで、論理的に一つのデータ解析システム１が構築されてもよい。
［システムの動作］ The data analysis system 1 may be composed of one computer 20 or a plurality of computers 20. When using a plurality of computers 20, one data analysis system 1 may be logically constructed by connecting these computers 20 to each other via a communication network such as the Internet or an intranet.
[System operation]

図３を参照しつつ、データ解析システム１により実行されるデータ解析方法の一例を説明する。まず、受付部１１は、データセット３０を受け付ける（ステップＳ１）。受付部１１へのデータセット３０の入力は、例えば、ユーザにより入力部２５及び表示部２６を介して行われる。例えば、ユーザが補助記憶部２３に記憶されたデータセット３０を指定すると、指定されたデータセット３０が読み込まれて受付部１１に受け付けられる。 An example of a data analysis method executed by the data analysis system 1 will be described with reference to FIG. 3. First, the receiving unit 11 receives the data set 30 (step S1). The data set 30 is input to the reception unit 11 by the user via the input unit 25 and the display unit 26, for example. For example, when the user specifies a data set 30 stored in the auxiliary storage unit 23, the specified data set 30 is read and accepted by the reception unit 11.

データセット３０は、解析対象であり、複数の項目のデータの集合であるデータユニット３１を複数含んでいる。データユニット３１が有する項目は、任意に設定されてよい。項目は、例えば、材料、化合物等の特性、組成等であってもよいし、装置、デバイス等の特性、寸法、材料等であってもよい。各項目のデータは、数値データ又は文字データである。文字データとは、数値データ以外のデータであって、文字又は記号により表されるデータである。後述するように、文字データは数値データに変換して用いられる。 The data set 30 is an analysis target and includes a plurality of data units 31 that are a collection of data of a plurality of items. The items included in the data unit 31 may be set arbitrarily. The items may be, for example, the characteristics, composition, etc. of materials, compounds, etc., or the characteristics, dimensions, materials, etc. of apparatuses, devices, etc. The data of each item is numerical data or character data. Character data is data other than numerical data and is data represented by characters or symbols. As will be described later, character data is used after being converted to numerical data.

データセット３０は、例えば、工場での製造工程において取得されたデータの集合であってもよい。ＩＯＴ（Internet of Things）の促進により、製造工程において大量のデータを取得可能となることが期待される。項目は、製造された製品の品質、特性、製造条件等を含み得る。製品の品質又は特性の例としては、不良率、破壊電圧、ショート率等が挙げられる。製造条件の例としては、製造設備に割り当てられた固有番号／記号、材料の厚さの平均値又は分散、工程実施時間／回数等が挙げられる。データセット３０は、時系列データであってもよい。この場合、一の項目は、データユニット３１が取得された時刻又は順番を表す数値データにより構成されてもよい。 The data set 30 may be, for example, a collection of data acquired during a manufacturing process at a factory. It is expected that the promotion of the Internet of Things (IOT) will make it possible to acquire large amounts of data in the manufacturing process. Items may include quality, characteristics, manufacturing conditions, etc. of manufactured products. Examples of product quality or characteristics include defect rate, breakdown voltage, short circuit rate, and the like. Examples of manufacturing conditions include unique numbers/symbols assigned to manufacturing equipment, average value or variance of material thickness, process execution time/number of times, and the like. Data set 30 may be time series data. In this case, one item may be composed of numerical data representing the time or order in which the data units 31 were acquired.

データセット３０には、欠損値を含むデータユニット３１が含まれていてもよい。欠損値とは、データが欠落していることを意味する。データセット３０には、異常値（外れ値）を含むデータユニット３１が含まれていてもよい。異常値とは、当該項目における他のデータと比べて極端に逸脱した値であり、測定又は記録の誤り等に起因して生じ得る。欠損値及び異常値の処理については後述する。データユニット３１の数は限定されないが、例えば数百個以上であってもよい。項目の数は限定されないが、例えば数千個以上であってもよい。 Data set 30 may include data units 31 that include missing values. Missing values mean missing data. The data set 30 may include a data unit 31 that includes an abnormal value (outlier). An abnormal value is a value that extremely deviates from other data in the item, and may occur due to measurement or recording errors. Processing of missing values and abnormal values will be described later. The number of data units 31 is not limited, but may be several hundred or more, for example. The number of items is not limited, but may be several thousand or more, for example.

図４は、データセット３０の例を示す図である。この例では、データセット３０は、表形式で表されている。各行がデータユニット３１に相当し、各列が項目に相当する。各データユニット３１は、項目として、項目Ａ、項目Ｂ、項目Ｃ、項目Ｄ、項目Ｅ、項目Ｆ、項目Ｇ及び項目Ｈを含んでいる。例えば、項目Ａ～Ｄは、数値データにより構成されており、項目Ｅ～Ｈは、文字データにより構成されている。 FIG. 4 is a diagram showing an example of the data set 30. In this example, data set 30 is represented in tabular form. Each row corresponds to a data unit 31, and each column corresponds to an item. Each data unit 31 includes item A, item B, item C, item D, item E, item F, item G, and item H as items. For example, items A to D are composed of numerical data, and items E to H are composed of character data.

ステップＳ１に続いて、受付部１１は、解析条件を受け付ける（ステップＳ２）。受付部１１への解析条件の入力は、例えば、ユーザにより入力部２５及び表示部２６を介して行われる。解析条件は、目的変数及び説明変数の指定を含んでいる。ユーザは、データセット３０における複数の項目の中から一の項目を目的変数として選択すると共に、残りの項目の中から複数の項目を説明変数として選択する。例えば、表示部２６には目的変数を選択するための選択ボックスが表示され、当該選択ボックスにおいて項目を選択することで、ユーザは目的変数を選択する。これに伴い、目的変数として選択された項目以外の項目が、説明変数として選択される。なお、目的変数として選択された項目以外の項目の中から、説明変数として設定する項目を選択可能となっていてもよい。 Following step S1, the receiving unit 11 receives analysis conditions (step S2). The analysis conditions are input to the reception unit 11 by the user via the input unit 25 and the display unit 26, for example. The analysis conditions include designation of objective variables and explanatory variables. The user selects one item from among the plurality of items in the data set 30 as a target variable, and selects a plurality of items from the remaining items as explanatory variables. For example, a selection box for selecting a target variable is displayed on the display unit 26, and the user selects the target variable by selecting an item in the selection box. Along with this, items other than the items selected as objective variables are selected as explanatory variables. Note that it may be possible to select an item to be set as an explanatory variable from among items other than the item selected as the objective variable.

また、解析条件は、目的変数及び各説明変数についての数値範囲の指定を含んでいてもよい。また、解析条件は、何れのデータユニット３１を解析対象とするかの指定を含んでいてもよい。 Furthermore, the analysis conditions may include designation of numerical ranges for the objective variable and each explanatory variable. Furthermore, the analysis conditions may include a designation as to which data unit 31 is to be analyzed.

ステップＳ２に続いて、算出部１２は、データセット３０に基づいて、仮説検定において用いられる算出方法により、目的変数と各説明変数との間の有意確率（Ｐ値）を算出する（ステップＳ３）。有意確率は、統計的仮説検定において、帰無仮説の下で統計検定量が実現する確率である。有意確率が小さいことは、帰無仮説が成り立つ可能性が低いことを表す。この場合の帰無仮説は、仮説検定手法により異なるが、例えば、目的変数と説明変数との間に関連が無いとの仮説である。算出部１２は、複数の説明変数の各々について、目的変数との間の有意確率を算出する。算出部１２により有意確率の算出に用いられる算出方法は、互いに異なる複数の仮説検定手法における算出方法を含んでいる。算出部１２は、以下のとおり、目的変数と説明変数との組み合わせごとに、いずれの仮説検定手法における算出方法を用いるかを判断する。 Following step S2, the calculation unit 12 calculates the significance probability (P value) between the target variable and each explanatory variable based on the data set 30 using the calculation method used in hypothesis testing (step S3). . Significance probability is the probability that a statistical test quantity is realized under a null hypothesis in statistical hypothesis testing. A small significance probability indicates that the null hypothesis is unlikely to hold true. The null hypothesis in this case varies depending on the hypothesis testing method, but is, for example, a hypothesis that there is no relationship between the objective variable and the explanatory variable. The calculation unit 12 calculates the significance probability between each of the plurality of explanatory variables and the objective variable. The calculation method used by the calculation unit 12 to calculate the significance probability includes calculation methods for a plurality of mutually different hypothesis testing methods. The calculation unit 12 determines which hypothesis testing method calculation method to use for each combination of objective variable and explanatory variable, as described below.

算出部１２は、目的変数及び説明変数の両方が数値データにより構成されている場合、第１仮説検定手法における算出方法を用いて有意確率を算出する。一方、算出部１２は、目的変数及び説明変数の少なくとも一方が文字データにより構成されている場合、第２仮説検定手法における算出方法を用いて有意確率を算出する。 The calculation unit 12 calculates the significance probability using the calculation method in the first hypothesis testing method when both the objective variable and the explanatory variable are composed of numerical data. On the other hand, when at least one of the objective variable and the explanatory variable is composed of character data, the calculation unit 12 calculates the significance probability using the calculation method in the second hypothesis testing method.

より具体的には、算出部１２は、目的変数及び説明変数の一方が数値データにより構成されていると共に、目的変数及び説明変数の他方が文字データにより構成されている場合、第３仮説検定手法における算出方法を用いて有意確率を算出する。算出部１２は、目的変数及び説明変数の両方が文字データにより構成されている場合、第４仮説検定手法における算出方法を用いて有意確率を算出する。すなわち、算出部１２により用いられる算出方法は、第１仮説検定手法及び第２仮説検定手法における算出方法を含んでおり、第２仮説検定手法は、第３仮説検定手法及び第４仮説検定手法を含んでいる。第１仮説検定手法、第３仮説検定手法及び第４仮説検定手法は、互いに異なる仮説検定手法である。いずれの仮説検定手法における算出方法を用いた場合でも、有意確率は共通に算出される。 More specifically, when one of the objective variable and the explanatory variable is composed of numerical data, and the other of the objective variable and the explanatory variable is composed of character data, the calculation unit 12 uses the third hypothesis testing method. Calculate the significance probability using the calculation method in . The calculation unit 12 calculates the significance probability using the calculation method in the fourth hypothesis testing method when both the objective variable and the explanatory variable are composed of character data. That is, the calculation method used by the calculation unit 12 includes the calculation method in the first hypothesis testing method and the second hypothesis testing method, and the second hypothesis testing method includes the calculation method in the third hypothesis testing method and the fourth hypothesis testing method. Contains. The first hypothesis testing method, the third hypothesis testing method, and the fourth hypothesis testing method are different hypothesis testing methods. Regardless of the calculation method used in any hypothesis testing method, the significance probability is calculated in common.

第１仮説検定手法は、検定対象の変数の両方が数値データである場合に適用可能な手法である。第１仮説検定手法では、変数間の相関について検定が行われる。第１仮説検定手法の例としては、スピアマン（Spearman）の順位相関係数の検定、ケンドール（Kendall）の順位相関係数の検定が挙げられる。スピアマンの順位相関係数の検定及びケンドールの順位相関係数の検定は、ノンパラメトリックな検定手法である。ノンパラメトリックな検定手法とは、母集団の分布として正規分布等の特定の分布を仮定することなく統計的検定を行う手法である。ノンパラメトリックな検定手法では、解析対象のデータに外れ値などの異常値が含まれている場合でも、精度が低下し難い。 The first hypothesis testing method is applicable when both variables to be tested are numerical data. In the first hypothesis testing method, correlations between variables are tested. Examples of the first hypothesis testing method include Spearman's rank correlation coefficient test and Kendall's rank correlation coefficient test. Spearman's rank correlation coefficient test and Kendall's rank correlation coefficient test are nonparametric test methods. A nonparametric testing method is a method of performing a statistical test without assuming a specific distribution such as a normal distribution as the population distribution. With non-parametric testing methods, accuracy is unlikely to decrease even if the data to be analyzed contains abnormal values such as outliers.

第３仮説検定手法は、検定対象の変数の一方が数値データであり他方が文字データである場合に適用可能な手法である。第３仮説検定手法では、水準間（文字データ間）の代表値（数値データ）の差について検定が行われる。第３仮説検定手法の例としては、クラスカル・ウォリス（Kruskal-Wallis）の検定、フリグナー・キリーン（Flinger-Killen）の検定が挙げられる。クラスカル・ウォリスの検定及びフリグナー・キリーンの検定は、ノンパラメトリックな検定手法である。 The third hypothesis testing method is a method applicable when one of the variables to be tested is numerical data and the other is character data. In the third hypothesis testing method, a test is performed on the difference in representative values (numeric data) between levels (between character data). Examples of the third hypothesis testing method include the Kruskal-Wallis test and the Flinger-Killen test. The Kruskal-Wallis test and the Fligner-Killen test are nonparametric test methods.

第４仮説検定手法は、検定対象の変数の両方が文字データである場合に適用可能な手法である。第４仮説検定手法では、各変数から作成された分割表の独立性について検定が行われる。第４仮説検定手法の例としては、独立性のカイ二乗検定、フィッシャー（Fisher）の正確確率検定が挙げられる。独立性のカイ二乗検定及びフィッシャーの正確確率検定は、ノンパラメトリックな検定手法である。 The fourth hypothesis testing method is a method applicable when both variables to be tested are character data. In the fourth hypothesis testing method, the independence of contingency tables created from each variable is tested. Examples of the fourth hypothesis testing method include chi-square test of independence and Fisher's exact test. Chi-square test of independence and Fisher's exact test are non-parametric testing techniques.

第１仮説検定手法、第３仮説検定手法及び第４仮説検定手法の少なくとも１つとして、互いに異なる複数の仮説検定手法が設定されていてもよい。この場合、算出部１２は、設定された複数の仮説検定手法における算出方法を用いて、目的変数と説明変数との間の有意確率を複数算出する。そして、算出部１２は、算出された複数の有意確率のうち最も小さい有意確率を目的変数と説明変数との間の有意確率とする。例えば、第３仮説検定手法としてクラスカル・ウォリスの検定及びフリグナー・キリーンの検定の２つの手法が設定されている場合、算出部１２は、目的変数及び説明変数の一方が数値データにより構成され、他方が文字データにより構成されている場合、クラスカル・ウォリスの検定及びフリグナー・キリーンの検定の各々における算出方法を用いて、目的変数と説明変数との間の水準間の代表値の差の有意確率を算出する。そして、算出部１２は、算出された２つの有意確率のうち小さい方の有意確率を、当該目的変数と説明変数との間の水準間の代表値の差の有意確率とする。 A plurality of different hypothesis testing methods may be set as at least one of the first hypothesis testing method, the third hypothesis testing method, and the fourth hypothesis testing method. In this case, the calculation unit 12 calculates a plurality of significance probabilities between the objective variable and the explanatory variable using calculation methods in the plurality of set hypothesis testing methods. Then, the calculation unit 12 sets the smallest significance probability among the plurality of calculated significance probabilities as the significance probability between the objective variable and the explanatory variable. For example, when the Kruskal-Wallis test and the Fligner-Killen test are set as the third hypothesis testing method, the calculation unit 12 calculates that one of the objective variable and the explanatory variable is composed of numerical data, and the other is composed of character data, calculate the significance probability of the difference in representative values between levels between the objective variable and explanatory variable using the Kruskal-Wallis test and Fligner-Killen test. calculate. Then, the calculation unit 12 sets the smaller of the two calculated significance probabilities as the significance probability of the difference in representative value between levels between the target variable and the explanatory variable.

また、算出部１２は、有意確率の算出時に、次の欠損値処理を行う。算出部１２は、数値データにより構成された目的変数又は設計変数に欠損値が含まれている場合、欠損値を含むデータユニット３１を解析対象から除外し、残りのデータユニット３１を用いて有意確率を算出する。算出部１２は、文字データにより構成された目的変数又は設計変数に欠損値が含まれている場合、欠損値を所定の単語（例えば「ＮＡ」）に置換して、有意確率を算出する。 Further, the calculation unit 12 performs the following missing value processing when calculating the significance probability. When a missing value is included in the objective variable or design variable composed of numerical data, the calculation unit 12 excludes the data unit 31 containing the missing value from the analysis target, and calculates the significance probability using the remaining data unit 31. Calculate. When a missing value is included in the objective variable or design variable made up of character data, the calculation unit 12 replaces the missing value with a predetermined word (for example, "NA") and calculates the significance probability.

また、算出部１２は、文字データを数値データに変換した後に、有意確率の算出を行ってもよい。文字データの数値データへの変換方法としては、任意の手法が用いられてよい。例えば、図５の例では、項目「装置」が、「Ａ」、「Ｂ」、「Ｃ」の３種類の文字からなる列データを含んでおり、当該列データが、数値「０」及び「１」からなる３列の行列データに変換されている。このような変換により、文字データを数値データに変換することができる。また、算出部１２は、文字データを数値データに変換することなく、有意確率の算出を行ってもよい。例えば、クラスカル・ウォリスの検定における算出方法では、文字データを数値データに変換することなく、有意確率が算出される。 Further, the calculation unit 12 may calculate the significance probability after converting character data into numerical data. Any method may be used to convert character data into numerical data. For example, in the example of FIG. 5, the item "device" includes column data consisting of three types of characters, "A", "B", and "C", and the column data includes the numerical values "0" and " 1" is converted into three-column matrix data. Through such conversion, character data can be converted into numerical data. Further, the calculation unit 12 may calculate the significance probability without converting character data into numerical data. For example, in the calculation method for the Kruskal-Wallis test, significance probability is calculated without converting character data into numerical data.

ステップＳ３に続いて、表示制御部１３は、複数の説明変数を、有意確率が小さい順に並ぶように、表示部２６に表示させる（ステップＳ４）。 Following step S3, the display control unit 13 causes the display unit 26 to display a plurality of explanatory variables in order of decreasing significance probability (step S4).

図６は、表示部２６の表示例を示す図である。この例では、計算結果を示す表４０が表示部２６に表示されている。表４０では、目的変数が項目Ａであり、説明変数が項目Ｂ～Ｈである例が示されている。説明変数である項目Ｂ～Ｈは、有意確率の最小値が小さい順に上から順に並んでいる。すなわち、この例では、項目Ｈの有意確率の最小値が最も小さく、項目Ｆの有意確率の最小値が最も大きい。項目名の右隣には、説明変数（項目）を構成するデータが数値データであるか、又は文字データであるかの情報（すなわち、説明変数のデータタイプ）が文字により表示されている。データタイプの右隣には、説明変数（項目）のデータ数が数値により表示されている。 FIG. 6 is a diagram showing a display example of the display section 26. As shown in FIG. In this example, a table 40 showing calculation results is displayed on the display section 26. Table 40 shows an example in which the objective variable is item A and the explanatory variables are items B to H. Items B to H, which are explanatory variables, are arranged in descending order of the minimum significance probability from the top. That is, in this example, the minimum value of the significance probability of item H is the smallest, and the minimum value of the significance probability of item F is the largest. To the right of the item name, information indicating whether the data constituting the explanatory variable (item) is numerical data or character data (that is, the data type of the explanatory variable) is displayed in characters. To the right of the data type, the number of explanatory variables (items) is displayed as a numerical value.

データ数の右側には、仮説検定手法Ａ、仮説検定手法Ｂ、仮説検定手法Ｃ及び仮説検定手法Ｄの各々における算出方法を用いて算出された有意確率が数値により表示されている。仮説検定手法Ａ及び仮説検定手法Ｂは、上述した第１仮説検定手法である。すなわち、この例では、第１仮説検定手法として互いに異なる２つの仮説検定手法が設定されている。仮説検定手法Ｃは、上述した第３仮説検定手法（第２仮説検定手法）である。仮説検定手法Ｄは、上述した第４仮説検定手法（第２仮説検定手法）である。 On the right side of the number of data, the significance probability calculated using each of the calculation methods of hypothesis testing method A, hypothesis testing method B, hypothesis testing method C, and hypothesis testing method D is displayed as a numerical value. Hypothesis testing method A and hypothesis testing method B are the first hypothesis testing methods described above. That is, in this example, two different hypothesis testing methods are set as the first hypothesis testing method. Hypothesis testing method C is the third hypothesis testing method (second hypothesis testing method) described above. Hypothesis testing method D is the fourth hypothesis testing method (second hypothesis testing method) described above.

この例では、目的変数である項目Ａ、及び説明変数である項目Ｂ～Ｄは数値データにより構成されており、項目Ｅ～Ｈは文字データにより構成されている。そのため、項目Ａと項目Ｂ～Ｄとの間の有意確率は、第１仮説検定手法である仮説検定手法Ａ及び仮説検定手法Ｂの各々における算出方法を用いて算出されている。算出された有意確率は、それぞれ、「有意確率（手法Ａ）」、「有意確率（手法Ｂ）」の列に数値により表示されている。 In this example, item A, which is the objective variable, and items B to D, which are explanatory variables, are composed of numerical data, and items E to H are composed of character data. Therefore, the significance probabilities between item A and items B to D are calculated using the calculation methods in each of hypothesis testing method A and hypothesis testing method B, which are the first hypothesis testing methods. The calculated significance probabilities are displayed numerically in columns of "Significance Probability (Method A)" and "Significance Probability (Method B)", respectively.

項目Ａと項目Ｅ～Ｈとの間の有意確率は、第３仮説検定手法である仮説検定手法Ｃにおける算出方法を用いて算出されている。算出された有意確率は、「有意確率（手法Ｃ）」の列に数値により表示されている。この例では、目的変数である項目Ａが数値データであるため、仮説検定手法Ｄにおける算出方法は用いられていない。そのため、「有意確率（手法Ｄ）」の列は空欄となっている。仮説検定手法Ｄにおける算出方法を用いて有意確率が算出された場合には、算出された有意確率は「有意確率（手法Ｄ）」の欄に数値により表示される。なお、「有意確率（手法Ａ）」、「有意確率（手法Ｂ）」、「有意確率（手法Ｃ）」の列においても、対応する項目の欄以外は空欄となっている。 The significance probabilities between item A and items E to H are calculated using the calculation method in hypothesis testing method C, which is the third hypothesis testing method. The calculated significance probability is displayed numerically in the "Significance Probability (Method C)" column. In this example, since item A, which is the target variable, is numerical data, the calculation method in hypothesis testing method D is not used. Therefore, the column "Significance Probability (Method D)" is blank. When the significance probability is calculated using the calculation method in hypothesis testing method D, the calculated significance probability is displayed as a numerical value in the "Significance probability (method D)" column. Note that in the columns of "Significance Probability (Method A)," "Significance Probability (Method B)," and "Significance Probability (Method C)," the columns other than those for the corresponding items are also blank.

最も右側の列には、有意確率の最小値が数値により表示されている。この例では、項目Ｂ～Ｄについては、有意確率の最小値は、仮説検定手法Ａにおける算出方法を用いて算出された有意確率、及び仮説検定手法Ｂにおける算出方法を用いて算出された有意確率のうち小さい方である。項目Ｅ～Ｈについては、有意確率の最小値は、仮説検定手法Ｃにおける算出方法を用いて算出された有意確率である。すなわち、複数の仮説検定手法における算出方法を用いて目的変数と説明変数との間の有意確率が複数算出されている場合、複数の有意確率のうち最も小さい有意確率が、目的変数と説明変数との間の有意確率とされる。 In the rightmost column, the minimum value of the significance probability is displayed numerically. In this example, for items B to D, the minimum significance probability is the significance probability calculated using the calculation method in hypothesis testing method A, and the significance probability calculated using the calculation method in hypothesis testing method B. This is the smaller of the two. For items E to H, the minimum significance probability is the significance probability calculated using the calculation method in hypothesis testing method C. In other words, if multiple significance probabilities are calculated between the objective variable and explanatory variables using calculation methods in multiple hypothesis testing methods, the smallest significance probability among the multiple significance probabilities will be used to determine the relationship between the objective variable and explanatory variable. It is considered to be the significance probability between.

ステップＳ４の後に、ユーザは、表示部２６に表示された複数の説明変数（項目）の中から、後述するグラフ５０を表示するための一の説明変数を選択することができる。例えば、表示部２６には、選択ボックスが表示されており、ユーザが選択ボックスを押下すると、選択ボックスが展開される。展開されている状態においては、選択ボックスには、複数の説明変数を示すラベルが、有意確率が小さい順に上から並ぶように、文字により表示される。ユーザは、選択ボックスにおいて説明変数を示すラベルを選択することで、一の説明変数を選択する。この選択を受け付けると、表示制御部１３は、選択された説明変数と目的変数との間の関係を示すグラフ５０を表示部２６に表示する。グラフ５０及び選択ボックスは、例えば表４０とは異なる画面（タブ）に表示されるが、表４０と同一の画面上に表４０と共に表示されてもよい。選択ボックスにおいては複数の説明変数が有意確率が小さい順に並んで表示されるため、ユーザは、例えば上から順に説明変数を選択してグラフ５０を確認することで、効率的に解析を進めることができる。 After step S4, the user can select one explanatory variable for displaying a graph 50, which will be described later, from among the plurality of explanatory variables (items) displayed on the display unit 26. For example, a selection box is displayed on the display unit 26, and when the user presses the selection box, the selection box is expanded. In the expanded state, labels indicating a plurality of explanatory variables are displayed in text in the selection box, arranged in descending order of significance from the top. The user selects one explanatory variable by selecting a label indicating the explanatory variable in the selection box. Upon receiving this selection, the display control unit 13 displays a graph 50 showing the relationship between the selected explanatory variable and the objective variable on the display unit 26. Although the graph 50 and the selection box are displayed on a different screen (tab) from the table 40, for example, they may be displayed together with the table 40 on the same screen. In the selection box, multiple explanatory variables are displayed in descending order of significance probability, so the user can proceed with the analysis efficiently by, for example, selecting the explanatory variables in order from the top and checking the graph 50. can.

図７～図１０は、グラフ５０の例を示す図である。図７～図１０では、説明変数が項目Ｘであり、目的変数が項目Ｙである場合のグラフ５０が示されている。図７の例では、項目Ｘと項目Ｙとの関係が散布図により示されている。有意確率（Ｐ）及びデータ数（ｎ）が左上に表示されると共に、平滑線５１が表示されている。有意確率、データ数及び平滑線５１の表示の有無は、チェックボックスにより選択可能となっていてもよい。図８の例では、項目Ｘと項目Ｙとの関係が箱ひげ図により示されている。項目Ｘは、「Ｈ１」、「Ｈ２」、「Ｈ３」、「Ｈ４」、「Ｈ５」…の文字データを含んでいる。 7 to 10 are diagrams showing examples of the graph 50. 7 to 10, a graph 50 is shown in which the explanatory variable is item X and the objective variable is item Y. In the example of FIG. 7, the relationship between item X and item Y is shown by a scatter diagram. The significance probability (P) and the number of data (n) are displayed at the upper left, and a smooth line 51 is also displayed. The significance probability, the number of data, and whether or not to display the smooth line 51 may be selectable using check boxes. In the example of FIG. 8, the relationship between item X and item Y is shown by a box plot . Item X includes character data of "H1", "H2", "H3", "H4", "H5", and so on.

図９の例では、項目Ｘと項目Ｙとの関係が、時系列情報として折れ線グラフにより表示されている。横軸は、データユニット３１が取得された時刻又は順番を示す数値を表しており、縦軸は、項目Ｘ及び項目Ｙの数値を表している。このように、項目Ｘと項目Ｙとの関係は、時系列情報として表示されてもよい。横軸は、データユニット３１が取得された時刻又は順番を表す数値データからなる項目がある場合、当該項目のデータであってもよい。或いは、横軸は、データユニット３１の行番号であってもよい。 In the example of FIG. 9, the relationship between item X and item Y is displayed as time series information using a line graph. The horizontal axis represents a numerical value indicating the time or order in which the data unit 31 was acquired, and the vertical axis represents the numerical value of item X and item Y. In this way, the relationship between item X and item Y may be displayed as time series information. If there is an item consisting of numerical data representing the time or order in which the data unit 31 was acquired, the horizontal axis may be the data of the item. Alternatively, the horizontal axis may be the row number of the data unit 31.

項目Ｘと項目Ｙとの関係は、図１０（ａ）に示されるように散布図により表示されてもよいし、図１０（ｂ）に示されるように箱ひげ図により表示されてもよいし、図１０（ｃ）に示されるようにバイオリンプロットにより表示されてもよい。図１０（ａ）～図１０（ｃ）の例では、項目Ｘは、「ａ」、「ｂ」の２種類の文字データにより構成されている。なお、０よりも大きい数値データは対数変換して表示されてもよい。項目Ｘ及び項目Ｙの両方が文字データにより構成されている場合、モザイクプロットが用いられてもよい。表示部２６には、複数のグラフ５０が表示されてもよい。この場合、複数のグラフ５０は、対応する説明変数についての有意確率が小さい順に並ぶように表示されてもよい。データ解析システム１は、表示４０及びグラフ５０を含む解析結果を所定の形式でファイルに出力可能に構成されていてもよい。
［作用及び効果］ The relationship between item X and item Y may be displayed using a scatter diagram as shown in FIG. , may be displayed using a violin plot as shown in FIG. 10(c). In the examples shown in FIGS. 10(a) to 10(c), item X is composed of two types of character data: "a" and "b". Note that numerical data greater than 0 may be displayed after being logarithmically converted. If both item X and item Y are composed of character data, a mosaic plot may be used. A plurality of graphs 50 may be displayed on the display unit 26. In this case, the plurality of graphs 50 may be displayed in descending order of significance probabilities for the corresponding explanatory variables. The data analysis system 1 may be configured to be able to output analysis results including a display 40 and a graph 50 to a file in a predetermined format.
[Action and effect]

データ解析システム１では、仮説検定における算出方法により目的変数と複数の説明変数の各々との間の有意確率が算出され、複数の説明変数が、有意確率が小さい順に並ぶように表示部２６に表示される。これにより、ユーザは、有意確率が小さい説明変数、すなわち目的変数との関連が強いことが期待される説明変数を容易に把握することができる。また、データ解析システム１では、ユーザは、有意確率を基準として複数の説明変数を比較することができる。異なる仮説検定手法における算出方法を用いた場合でも有意確率は共通に算出されることから、有意確率を基準とすることにより、異なる仮説検定手法における算出方法を用いたとしても、同一の基準で複数の説明変数を比較することが可能となる。その結果、例えば、解析対象に数値データ及び文字データの両方が含まれている場合でも、複数の説明変数を好適に比較することが可能となる。よって、データ解析システム１によれば、データ解析を容易化することができると共に、解析対象の自由度を向上することができる。 In the data analysis system 1, the significance probability between the objective variable and each of the plurality of explanatory variables is calculated using the calculation method in hypothesis testing, and the plurality of explanatory variables are displayed on the display unit 26 in order of decreasing significance probability. be done. This allows the user to easily understand explanatory variables with a low significance probability, that is, explanatory variables that are expected to have a strong relationship with the objective variable. Furthermore, in the data analysis system 1, the user can compare multiple explanatory variables based on significance probability. The significance probability is calculated in common even when calculation methods for different hypothesis testing methods are used. It becomes possible to compare the explanatory variables of As a result, for example, even when the analysis target includes both numerical data and character data, it is possible to suitably compare a plurality of explanatory variables. Therefore, according to the data analysis system 1, data analysis can be facilitated, and the degree of freedom of the analysis target can be improved.

上述したとおり、製造工程においては日々大量のデータが取得され得る。しかし、データ量は膨大であるため、製品の品質と関連するデータを見出すことは容易ではない。また、機械学習を用いることで、説明変数から目的変数を高精度に予測することができる場合があるが、得られたモデルの解釈は容易ではない。製造工程データの解析にあたっては、不良率を高精度に予測すること自体に意味はなく、不良率を低減させることが目的とされる。この点、製造工程において製品の品質に異常が生じた場合、複数の原因が絡み合うのではなく、単一の原因であることが多い。例えば、特定の設備により製造した場合又は特定の原料を使用した場合に不良が増加する事象が生じ得る。また、市販のソフトウェアでも相関係数を算出することができるが、欠損値が存在すると算出することができない、異常値が存在すると精度が大きく低下する、といった課題がある。また、数値間の関係が線形でないと正確に算出することができない、数値と文字との間、又は文字と文字との間の相関係数を算出することができない、といった課題もある。これに対し、上述したとおり、データ解析システム１では、ユーザは、有意確率が小さい説明変数、すなわち目的変数との関連が強いことが期待される説明変数を容易に把握することができる。その結果、例えば、製造工程において製品の品質に異常が生じた場合でも、その原因を容易に特定することが可能となる。また、データ解析システム１は、解析対象に数値データ及び文字データの両方が含まれている場合にも適用可能であるし、欠損値又は異常値が存在する場合にも適用可能である。したがって、データ解析システム１によれば、データ解析を容易化することができると共に、解析対象の自由度を向上することができる。 As mentioned above, a large amount of data can be acquired every day in the manufacturing process. However, since the amount of data is enormous, it is not easy to find data related to product quality. Furthermore, by using machine learning, it is sometimes possible to predict a target variable from explanatory variables with high accuracy, but it is not easy to interpret the resulting model. When analyzing manufacturing process data, there is no point in predicting the defective rate with high accuracy, but rather to reduce the defective rate. In this regard, when an abnormality occurs in the quality of a product during the manufacturing process, it is often the result of a single cause rather than a combination of multiple causes. For example, an event may occur where the number of defects increases when manufactured with specific equipment or when specific raw materials are used. Furthermore, commercially available software can also calculate the correlation coefficient, but there are problems in that the calculation cannot be performed if there are missing values, and the accuracy decreases significantly if there are abnormal values. Further, there are also problems in that it is impossible to calculate accurately unless the relationship between numerical values is linear, and it is impossible to calculate a correlation coefficient between numerical values and characters or between characters. On the other hand, as described above, in the data analysis system 1, the user can easily grasp explanatory variables that have a small significance probability, that is, explanatory variables that are expected to have a strong relationship with the target variable. As a result, even if, for example, an abnormality occurs in the quality of the product during the manufacturing process, the cause can be easily identified. Further, the data analysis system 1 can be applied even when the analysis target includes both numerical data and character data, and can also be applied when there are missing values or abnormal values. Therefore, according to the data analysis system 1, data analysis can be facilitated and the degree of freedom of the analysis target can be improved.

プロセッサ２１により用いられる算出方法手法が、ノンパラメトリックな検定手法における算出方法のみを含んでいる。これにより、ノンパラメトリックな検定手法における算出方法では解析対象のデータに外れ値などの異常値が含まれていたとしても精度が低下し難く、母集団の分布などの前提を必要としないため、解析対象の自由度を一層向上することができる。 The calculation methods used by the processor 21 include only calculation methods in non-parametric testing methods. As a result, the calculation method used in nonparametric test methods does not easily reduce accuracy even if the data to be analyzed contains abnormal values such as outliers, and does not require assumptions such as population distribution. The degree of freedom of targeting can be further improved.

プロセッサ２１により用いられる算出方法が、第１仮説検定手法における算出方法と、第１仮説検定手法とは異なる第２仮説検定手法における算出方法と、を含んでいる。そして、プロセッサ２１が、目的変数及び説明変数の両方が数値データにより構成されている場合、第１仮説検定手法における算出方法を用いて有意確率を算出し、目的変数及び説明変数の少なくとも一方が文字データにより構成されている場合、第２仮説検定手法における算出方法を用いて有意確率を算出する。これにより、解析対象に数値データ及び文字データの両方が含まれている場合でも、有意確率を好適に算出することができる。 The calculation method used by the processor 21 includes a calculation method in a first hypothesis testing method and a calculation method in a second hypothesis testing method different from the first hypothesis testing method. Then, when both the objective variable and the explanatory variable are composed of numerical data, the processor 21 calculates the significance probability using the calculation method in the first hypothesis testing method, and at least one of the objective variable and the explanatory variable is composed of characters. If it is composed of data, the significance probability is calculated using the calculation method in the second hypothesis testing method. Thereby, even if the analysis target includes both numerical data and character data, the significance probability can be suitably calculated.

第２仮説検定手法が、第３仮説検定手法と、第３仮説検定手法とは異なる第４仮説検定手法と、を含んでいる。そして、プロセッサ２１が、目的変数及び説明変数の一方が数値データにより構成されていると共に、目的変数及び説明変数の他方が文字データにより構成されている場合、第３仮説検定手法における算出方法を用いて有意確率を算出し、目的変数及び説明変数の両方が文字データにより構成されている場合、第４仮説検定手法における算出方法を用いて有意確率を算出する。これにより、解析対象に数値データ及び文字データの両方が含まれている場合でも、有意確率を一層好適に算出することができる。 The second hypothesis testing method includes a third hypothesis testing method and a fourth hypothesis testing method different from the third hypothesis testing method. When one of the objective variable and the explanatory variable is composed of numerical data, and the other of the objective variable and the explanatory variable is composed of character data, the processor 21 uses the calculation method in the third hypothesis testing method. If both the objective variable and the explanatory variable are composed of character data, the significance probability is calculated using the calculation method in the fourth hypothesis testing method. Thereby, even when the analysis target includes both numerical data and character data, the significance probability can be calculated more suitably.

プロセッサ２１が、互いに異なる複数の仮説検定手法における算出方法を用いて目的変数と説明変数との間の有意確率を複数算出し、複数の有意確率のうち最も小さい有意確率を目的変数と説明変数との間の有意確率とする。これにより、有意確率を一層精度良く算出することができる。 The processor 21 calculates a plurality of significance probabilities between the objective variable and the explanatory variable using calculation methods in a plurality of mutually different hypothesis testing methods, and selects the smallest significance probability among the plurality of significance probabilities as the objective variable and the explanatory variable. Let the significance probability be between . Thereby, the significance probability can be calculated with higher accuracy.

プロセッサ２１が、表示部２６に表示された複数の説明変数の中から選択された一の説明変数と目的変数との間の関係を示すグラフ５０を、表示部２６に表示させる。これにより、ユーザは、選択された説明変数と目的変数との間の関係を容易に把握することができる。 The processor 21 causes the display unit 26 to display a graph 50 showing the relationship between the target variable and one explanatory variable selected from the plurality of explanatory variables displayed on the display unit 26. Thereby, the user can easily understand the relationship between the selected explanatory variable and the objective variable.

上記実施形態では、複数の説明変数のうち、一の説明変数についての有意確率の算出に用いられる算出方法と、他の説明変数についての有意確率の算出に用いられる算出方法とが異なる。すなわち、複数の説明変数が、第１説明変数及び第２説明変数を含み、プロセッサ２１が、第１仮説検定手法における算出方法を用いて目的変数と第１説明変数との間の有意確率を算出し、第１仮説検定手法とは異なる第２仮説検定手法における算出方法を用いて、目的変数と第２説明変数との間の有意確率を算出する。これにより、第１仮説検定手法及び第２仮説検定手法における算出方法を用いて有意確率を算出することができ、解析対象の自由度を一層向上することができる。 In the above embodiment, the calculation method used to calculate the significance probability for one explanatory variable among the plurality of explanatory variables is different from the calculation method used to calculate the significance probability for the other explanatory variables. That is, the plurality of explanatory variables include a first explanatory variable and a second explanatory variable, and the processor 21 calculates the significance probability between the target variable and the first explanatory variable using the calculation method in the first hypothesis testing method. Then, the significance probability between the objective variable and the second explanatory variable is calculated using a calculation method in the second hypothesis testing method that is different from the first hypothesis testing method. Thereby, the significance probability can be calculated using the calculation method in the first hypothesis testing method and the second hypothesis testing method, and the degree of freedom of the analysis target can be further improved.

本発明は、上記実施形態に限られない。例えば、上記実施形態では、プロセッサ２１により用いられる算出方法がノンパラメトリックな検定手法における算出方法のみであったが、パラメトリックな検定手法における算出方法を更に含んでいてもよいし、或いは、パラメトリックな検定手法における算出方法のみであってもよい。検定対象の変数の両方が数値データである場合に適用可能で且つパラメトリックな第１仮説検定手法の例としては、ピアソン（Pearson）の相関係数の検定が挙げられる。検定対象の変数の一方が数値データであり他方が文字データである場合に適用可能で且つパラメトリックな第３仮説検定手法の例としては、分散分析が挙げられる。 The present invention is not limited to the above embodiments. For example, in the above embodiment, the calculation method used by the processor 21 is only a calculation method for a non-parametric test method, but it may further include a calculation method for a parametric test method, or a calculation method for a parametric test method may be included. It may be only the calculation method in the method. An example of a parametric first hypothesis testing method that is applicable when both variables to be tested are numerical data is Pearson's correlation coefficient test. An example of a parametric third hypothesis testing method that can be applied when one of the variables to be tested is numerical data and the other is character data is analysis of variance.

１…データ解析システム、２０…コンピュータ、２１…プロセッサ、２６…表示部、２７…データ解析プログラム、３０…データセット、３１…データユニット、５０…グラフ。 DESCRIPTION OF SYMBOLS 1... Data analysis system, 20... Computer, 21... Processor, 26... Display part, 27... Data analysis program, 30... Data set, 31... Data unit, 50... Graph.

Claims

comprising at least one processor;
The at least one processor includes:
Accepts a dataset containing multiple data units that are a collection of data for multiple items,
For an objective variable consisting of one item among the plurality of items and a plurality of explanatory variables consisting of two or more other items among the plurality of items, the objective variable and the explanation are determined based on the data set. Calculating the significance probability between the target variable and each of the plurality of explanatory variables using a calculation method in a hypothesis test in which a null hypothesis is a hypothesis that there is no relationship between the variables,
A data analysis system that displays the plurality of explanatory variables on a display unit in descending order of the significance probability.

The data analysis system according to claim 1, wherein the calculation method used by the at least one processor includes a calculation method in a non-parametric testing method.

The calculation method used by the at least one processor includes a calculation method in a first hypothesis testing method and a calculation method in a second hypothesis testing method different from the first hypothesis testing method,
The at least one processor includes:
When both the objective variable and the explanatory variable are composed of numerical data, calculating the significance probability using the calculation method in the first hypothesis testing method,
The data analysis system according to claim 1 or 2, wherein when at least one of the objective variable and the explanatory variable is composed of character data, the significance probability is calculated using a calculation method in the second hypothesis testing method. .

The second hypothesis testing method includes a third hypothesis testing method and a fourth hypothesis testing method different from the third hypothesis testing method,
The at least one processor includes:
When one of the objective variable and the explanatory variable is composed of numerical data, and the other of the objective variable and the explanatory variable is composed of character data, using the calculation method in the third hypothesis testing method. Calculate the significance probability,
4. The data analysis system according to claim 3, wherein when both the objective variable and the explanatory variable are composed of character data, the significance probability is calculated using a calculation method in the fourth hypothesis testing method.

The at least one processor includes:
calculating a plurality of significance probabilities between the objective variable and the explanatory variable using calculation methods in a plurality of mutually different hypothesis testing methods;
5. The data analysis system according to claim 1, wherein the smallest significance probability among the plurality of significance probabilities is set as the significance probability between the target variable and the explanatory variable.

The plurality of explanatory variables include a first explanatory variable and a second explanatory variable,
The at least one processor includes:
Calculating the significance probability between the objective variable and the first explanatory variable using a calculation method in a first hypothesis testing method,
The data analysis according to claim 1 or 2, wherein the significance probability between the objective variable and the second explanatory variable is calculated using a calculation method in a second hypothesis testing method different from the first hypothesis testing method. system.

The at least one processor includes:
Any one of claims 1 to 6, wherein a graph showing a relationship between one explanatory variable selected from the plurality of explanatory variables displayed on the display unit and the target variable is displayed on the display unit. The data analysis system according to item (1).

A data analysis method performed by a data analysis system comprising at least one processor, the method comprising:
accepting a dataset including multiple data units that are a collection of data of multiple items;
For an objective variable consisting of one item among the plurality of items and a plurality of explanatory variables consisting of two or more other items among the plurality of items, the objective variable and the explanation are determined based on the data set. calculating the significance probability between the target variable and each of the plurality of explanatory variables using a calculation method in a hypothesis test in which a null hypothesis is a hypothesis that there is no relationship between the variables;
A data analysis method, comprising the step of displaying the plurality of explanatory variables on a display unit in descending order of the significance probability.

accepting a dataset including multiple data units that are a collection of data of multiple items;
For an objective variable consisting of one item among the plurality of items and a plurality of explanatory variables consisting of two or more other items among the plurality of items, the objective variable and the explanation are determined based on the data set. calculating the significance probability between the target variable and each of the plurality of explanatory variables using a calculation method in a hypothesis test in which a null hypothesis is a hypothesis that there is no relationship between the variables;
A data analysis program that causes a computer to execute the step of displaying the plurality of explanatory variables on a display unit in descending order of the significance probabilities.