JP4255779B2

JP4255779B2 - Data analysis apparatus, data analysis method, and data analysis program

Info

Publication number: JP4255779B2
Application number: JP2003272648A
Authority: JP
Inventors: 博明竹内
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2003-07-10
Filing date: 2003-07-10
Publication date: 2009-04-15
Anticipated expiration: 2023-07-10
Also published as: JP2005032117A

Description

本発明は、分析対象である出力属性（目的属性）、例えば製造工程で製造され
る製品の特性等と、出力属性に影響を与える属性である入力属性（説明属性）、
例えば製造プロセス条件等との因果関係を分析するデータ分析装置およびデータ
分析方法並びにデータ分析プログラムに関する。 The present invention provides an output attribute (object attribute) to be analyzed, such as characteristics of a product manufactured in a manufacturing process, and an input attribute (description attribute) that is an attribute that affects the output attribute.
For example, the present invention relates to a data analysis apparatus, a data analysis method, and a data analysis program for analyzing a causal relationship with manufacturing process conditions and the like.

出力属性と入力属性との因果関係を分析する有効な手法としては、決定木手法
が知られている（特許文献１参照）。この手法では、各入力属性の値で順次切り
分けた葉の部分で、出力属性の値がうまくまとまるような木構造を作成する。 A decision tree technique is known as an effective technique for analyzing a causal relationship between an output attribute and an input attribute (see Patent Document 1). In this method, a tree structure is created in which the values of output attributes are well organized in leaf portions that are sequentially cut by the values of input attributes.

図１０は、特許文献１の従来技術の項（特許文献１の段落［０００２］〜［０
００５］および図２２参照）に記載されている決定木の１例であり、表１のデー
タ群を分析対象としている。表１のデータ群は、ｘ１，ｘ２，ｘ３，ｘ４の４つ
の入力属性の値と、これら入力属性に対する出力属性ｙの値とを組とするデータ
を１２個集めた集合である。この手法で作成される決定木（以下、「従来の決定
木−１」と呼ぶ事にする）では、図１０に示すように、出力属性ｙの値Ｘ，Ｙ，
Ｚが入力属性ｘ１，ｘ２，ｘ３の各値によって、うまく切り分けられている。 FIG. 10 shows a section of the prior art of Patent Document 1 (paragraphs [0002] to [0 of Patent Document 1].
005] and FIG. 22), and the data group of Table 1 is the analysis target. The data group in Table 1 is a set in which twelve pieces of data including a set of four input attribute values x1, x2, x3, and x4 and a value of an output attribute y corresponding to these input attributes are collected. In a decision tree created by this method (hereinafter referred to as “conventional decision tree-1”), as shown in FIG. 10, the values X, Y,
Z is well separated by each value of the input attributes x1, x2, and x3.

しかし、図１０の従来の決定木−１の作成においては、データを分類する際に
、入力属性がとる値の数（属性値の種類数）だけのデータ集合に分類される。例
えば、入力属性ｘ２は４種類の値（ａ，ｂ，ｃ，ｄ）をとるので、入力属性ｘ２
による分類により４つの集合に分類される。そのため、入力属性がとる値の数が
増えると、決定木が煩雑になる可能性がある。 However, in the creation of the conventional decision tree-1 in FIG. 10, when data is classified, it is classified into data sets corresponding to the number of values that the input attribute takes (the number of attribute value types). For example, since the input attribute x2 takes four types of values (a, b, c, d), the input attribute x2
Classification into four sets by classification according to. Therefore, if the number of values that the input attribute takes increases, the decision tree may become complicated.

この課題の解決策として、特許文献１では、各属性において、まとめられる属
性値を１つのラベルで表現し、ラベルによりデータ分類する決定木を提案してい
る。 As a solution to this problem, Patent Document 1 proposes a decision tree in which attribute values to be grouped are represented by one label for each attribute, and data is classified by the label.

図１１は、特許文献１の実施例（特許文献１の段落［００１０］〜［００２８
］および図１３参照）に記載のラベル階層である。この実施例では、例えば、４
種の属性値（１，２，３，４）からなるｘ３属性について、ｘ３属性値「１」「
２」に「２．５以下」というラベルをつけおよび、ｘ３属性値「３」「４」に「
２．５以上」というラベルをつけて階層構造を表現している。このラベル階層構
造を用いて作成される決定木（以下、この決定木を従来の決定木−２と呼ぶ事に
する）は、図１２（特許文献１の段落［００１０］〜［００２８］および図１４
参照）に示す如くであり、図１０に示す従来の決定木−１に比べて、非常に簡潔
である。
特開平８−３１４７２５号公報（公開日：平成８年(1996)１１月２９日） FIG. 11 shows an example of Patent Document 1 (paragraphs [0010] to [0028 of Patent Document 1).
And FIG. 13). In this embodiment, for example, 4
For the x3 attribute composed of the seed attribute values (1, 2, 3, 4), the x3 attribute value “1” “
“2” is labeled “2.5 or less”, and the x3 attribute values “3” and “4” are “
A hierarchical structure is expressed with a label of “2.5 or more”. FIG. 12 (paragraphs [0010] to [0028] in FIG. 12 and FIG. 12 shows a decision tree created using this label hierarchical structure (hereinafter, this decision tree is referred to as a conventional decision tree-2). 14
The reference decision tree-1 shown in FIG. 10 is much simpler.
JP-A-8-314725 (Publication date: November 29, 1996)

上記従来の決定木生成手法をデバイス等の製品の製造工程における製品特性不
良の要因分析に応用する場合を題材にして、従来技術の課題を説明する。 The problems of the prior art will be described using the case where the above-described conventional decision tree generation method is applied to cause analysis of product characteristic defects in the manufacturing process of products such as devices.

いま、表１の入力属性ｘ１，ｘ２，ｘ３，ｘ４が製品製造工程における各種の
プロセスデータやインライン検査データ、出力属性ｙが製造された製品の特性デ
ータであり、出力属性ｙ＝Ｙが製品特性不良に相当するものとする。そして、プ
ロセス技術者が、製品特性不良ｙ＝Ｙに対し、特許文献１の従来技術に記載され
た手法で生成された決定木−１（図１０）、または特許文献１に記載された手法
で生成された従来の決定木−２（図１２）を用いて、製品特性不良の要因を調査
するものとする。 Now, the input attributes x1, x2, x3, and x4 in Table 1 are various process data and in-line inspection data in the product manufacturing process, and the output attribute y is the product characteristic data. The output attribute y = Y is the product characteristic. It shall correspond to a defect. Then, the process engineer uses the decision tree-1 (FIG. 10) generated by the technique described in the prior art of Patent Document 1 or the technique described in Patent Document 1 for the product characteristic defect y = Y. It is assumed that the cause of product characteristic failure is investigated using the generated conventional decision tree-2 (FIG. 12).

このとき、特許文献１の従来技術に記載された手法で生成された決定木−１で
は、注目すべきｙ＝Ｙが樹形の中の複数箇所（図１０の例では４箇所）に分散し
ているため煩雑であり、「どの入力属性がどの値の範囲にあるから製品特性が悪
いのか？」という製品特性不良の要因をプロセス技術者が判断しにくい。図１０
の例では、入力属性が４属性だけでかつ各属性値の種類も４つだけであるため、
何とか、プロセス技術者が製品特性不良の要因を判断することも可能である。し
かしながら、実際のデバイス（特に半導体デバイス）のような製品の製造現場で
は、１工程につき１０〜１００属性程度のプロセスデータやインライン検査デー
タがあり、しかも、その値は多値で非常に広い範囲で分布している。さらに、外
乱（入力属性として検出できていない属性）の影響により、各入力属性の値が同
じであっても、出力属性の値がばらつく事も多い。これらのような場合に特許文
献１の従来技術に記載された手法を用いると、厳密な分析を目指すがあまり、無
限数のデータ集合に分類されてしまい、もはや、プロセス技術者が、適正に製品
特性不良の要因を特定する事ができなくなる。 At this time, in the decision tree-1 generated by the method described in the prior art of Patent Document 1, y = Y to be noticed is distributed at a plurality of locations (four locations in the example of FIG. 10) in the tree shape. Therefore, it is complicated, and it is difficult for a process engineer to determine the cause of a product characteristic failure such as “Which input attribute is in which value range, so that the product characteristic is bad?”. FIG.
In this example, there are only 4 input attributes and only 4 types of attribute values.
Somehow, the process engineer can also determine the cause of the product characteristic failure. However, in manufacturing sites of products such as actual devices (especially semiconductor devices), there are process data and in-line inspection data of about 10 to 100 attributes per process, and the values are multivalued and in a very wide range. Distributed. Furthermore, due to the influence of disturbance (attributes that cannot be detected as input attributes), the values of output attributes often vary even if the values of the input attributes are the same. In such a case, if the method described in the prior art of Patent Document 1 is used, a precise analysis is aimed at, but the data is classified into an infinite number of data sets. It becomes impossible to specify the cause of the characteristic failure.

一方、特許文献１に開示された手法により生成される決定木−２（図１２）で
は、ラベル階層による分類がなされているので、決定木が簡潔である。そのため
、プロセス技術者が、ｙ＝Ｙなる製品特性不良の要因を特定しやすい。 On the other hand, in the decision tree-2 (FIG. 12) generated by the method disclosed in Patent Document 1, the decision tree is simple because the classification is based on the label hierarchy. Therefore, it is easy for the process engineer to identify the cause of the product characteristic failure where y = Y.

しかし、この図１２に示す簡潔な決定木−２を作成するには、図１１に示すラ
ベル階層構造を予め定義しておく必要がある。そのため、特許文献１の決定木生
成手法は、まとめられる属性値の見当がつかない場合には適用できない。上述し
たように、実際のデバイスのような製品の製造現場では、１工程につき１０〜１
００属性程度の、プロセスデータやインライン検査データがあり、しかも、その
値は多値で非常に広い範囲で分布している。さらに、外乱（入力属性として検出
できていない属性）の影響により、各入力属性の値が同じであっても、出力属性
の値がばらつく事も多い。これらのような状況下で、各入力属性に対し、一つの
ラベルとしてまとめられる属性値を見出す事は、経験豊富なプロセス技術者であ
っても、非常に困難である。 However, in order to create the simple decision tree-2 shown in FIG. 12, it is necessary to previously define the label hierarchical structure shown in FIG. For this reason, the decision tree generation method of Patent Document 1 cannot be applied when there is no idea of the attribute values to be collected. As described above, at the manufacturing site of a product such as an actual device, 10 to 1 per process is required.
There are process data and inline inspection data of about 00 attributes, and the values are multivalued and distributed in a very wide range. Furthermore, due to the influence of disturbance (attributes that cannot be detected as input attributes), the values of output attributes often vary even if the values of the input attributes are the same. Under these circumstances, it is very difficult even for an experienced process engineer to find an attribute value that is collected as one label for each input attribute.

本発明は、上記従来の問題点を鑑みてなされたものであり、その目的は、ラベ
ル階層構造を予め定義する事なく、簡潔な形で、出力属性と入力属性との因果関
係を導き出せるデータ分析装置およびデータ分析方法並びにデータ分析プログラ
ムを提供する事にある。 The present invention has been made in view of the above-described conventional problems, and an object of the present invention is to analyze data that can derive a causal relationship between an output attribute and an input attribute in a concise form without predefining a label hierarchical structure. An apparatus, a data analysis method, and a data analysis program are provided.

本発明に係るデータ分析装置は、上記の課題を解決するために、分析対象データ格納部に格納された、複数の入力属性ｘ _ｊ（１≦ｊ≦Ｎ、Ｎは入力属性の個数）と、１つの出力属性ｙとで構成されるデータの集合である基本データ群ＤＡを分析対象とし、入力属性と出力属性との因果関係を分析するデータ分析装置であって、基本データ群ＤＡに含まれる文字属性のデータを、一義的な変換ルールに従って数値属性のデータに変換することによって、数値属性のデータの集合である数値型基本データ群ＤＡ０を生成する文字―数値データ変換手段と、数値型基本データ群ＤＡ０を、数値型基本データ群ＤＡ０に含まれる出力属性ｙの数値と、出力属性ｙの所定閾値との大小関係の比較に基づいて、第１データ群ＤＡ１と、第２データ群ＤＡ２とに分類する分類手段と、上記複数の入力属性のうちの１つの入力属性ｘ _ｊについて、該１つの入力属性ｘ _ｊのとり得る数値毎に、当該数値以下の数値を持つデータのうち、第１データ群ＤＡ１に属するデータの個数の、第１データ群ＤＡ１に属する全てのデータの個数に対する比率である第１の頻度（１−ｘ _ｊ頻度累積％）を求める演算を行い、かつ、該１つの入力属性ｘ _ｊのとり得る数値毎に、当該数値以下の数値を持つデータのうち、第２データ群ＤＡ２に属するデータの個数の、第２データ群ＤＡ２に属する全てのデータの個数に対する比率である第２の頻度（２−ｘ _ｊ頻度累積％）を求める演算を行い、かつ、該１つの入力属性ｘ _ｊのとり得る数値毎に、第１の頻度と第２の頻度との差分（ｘ _ｊ頻度累積差％）を求める演算を、上記複数の入力属性の各々について行なう第１の評価手段と、上記複数の入力属性のうちの１つの入力属性ｘ _ｊについて、第１の評価手段で該１つの入力属性ｘ _ｊのとり得る数値毎に演算された差分（ｘ _ｊ頻度累積差％）に基づいて、最大の差分が求められた数値を当該入力属性ｘ _ｊの閾値ｘ _ｊ―ｔｈとして決定することを、上記複数の入力属性の各々について行なう閾値決定手段と、上記複数の入力属性のうちの１つの入力属性ｘ _ｊについて、閾値決定手段にて決定された該入力属性ｘ _ｊの閾値ｘ _ｊ―ｔｈにおける、第１の頻度（１−ｘ _ｊ頻度累積％）に対する第２の頻度（２−ｘ _ｊ頻度累積％）の比率である第１の比率と、閾値決定手段にて決定された該入力属性ｘ _ｊの閾値ｘ _ｊ―ｔｈにおける、（１００％−第１の頻度（１−ｘ _ｊ頻度累積％））に対する（１００％−第２の頻度（２−ｘ _ｊ頻度累積％））の比率である第２の比率とを演算するとともに、第１の比率および第２の比率のうちの大きい方の比率を選択することを、上記複数の入力属性の各々について行なう第２の評価手段と、上記第２の評価手段にて入力属性毎に選択された比率のうち、最も大きい比率を持つ入力属性ｘ _ｊ、該入力属性ｘ _ｊの閾値ｘ _ｊ―ｔｈ、および該最も大きい比率が第１の比率および第２の比率の何れであるかを示す種別を、入力属性条件を示すデータとして抽出するとともに、当該入力属性条件を分析結果データ格納部に格納する要因抽出手段とを含むことを特徴としている。 In order to solve the above problem, the data analysis apparatus according to the present invention stores a plurality of input attributes x _j (1 ≦ j ≦ N, N is the number of input attributes) stored in the analysis target data storage unit , one output is the attribute y with the basic data group DA analyte is a set of data composed in the causal relationship between the input attributes and output attributes a data analyzer you analyze, the basic data group DA Character-numeric data conversion means for generating numeric-type basic data group DA0, which is a set of numeric attribute data, by converting the included character attribute data into numeric attribute data according to a unique conversion rule; type base data group DA0, and numeric numeric output attribute y included in the basic data group DA0, based on a comparison of the magnitude relation between the predetermined threshold value of the output attribute y, the first data group DA1, second data group DA And classifying means for classifying the bets, for one input attribute x _j of the plurality of input attributes for each numerical value can be assumed by the said one input attributes x _j, among the data having the following values the number, the An operation is performed to obtain a first frequency (1-x _j frequency cumulative%), which is a ratio of the number of data belonging to one data group DA1 to the number of all data belonging to the first data group DA1. one of each numerical value can take input attributes x _j, among the data having the following values the numerical, the number of data belonging to the second data group DA2, a ratio to the number of all data belonging to the second data group DA2 It performs operation for obtaining a certain second frequency (2-x _j frequency cumulative%), and, for each numerical value can take of the one input attributes x _j, the first frequency and the difference between the second frequency (x calculation for obtaining the _j frequency cumulative difference%) A first evaluation means for, for each of the plurality of input attributes for a single input attribute x _j of the plurality of input attributes, the numerical values in the first evaluation means can take of the one input attributes x _j based on the calculated difference (x _j frequency cumulative difference%) for each, the numerical value determined maximum difference to be determined as a threshold value x _j-th of the input attributes x _j, of the plurality of input attributes a threshold value determining means for, for each, for a single input attribute x _j of the plurality of input attributes, the threshold x _j-th of the input attributes x _j which is determined by the threshold value determining means, the first frequency ( 1-x _j frequency cumulative%) of the second frequency (2-x _j frequency cumulative%) and the threshold value x _{j- of the} input attribute x _j determined by the threshold value determining means. in _th, (100% - the first frequency ( -X _j Frequency Cumulative%)) with respect to (100% - as well as calculating a second ratio is the ratio of the second frequency _{(2-x j} Frequency Cumulative%)), the first ratio and the second ratio Of the plurality of input attributes is selected for each of the plurality of input attributes, and the ratio selected for each input attribute by the second evaluation means is the largest. input attributes x _j with the _ratio, threshold x _j-th of the input attributes x _{_j,} and the type of outermost even larger ratio indicates which of the first ratio and the second ratio, shows the input attribute conditions It is characterized by including factor extraction means for extracting the input attribute condition in an analysis result data storage unit as well as extracting it as data.

本発明に係るデータ分析方法は、上記の課題を解決するために、前記のデータ分析装置を用いて、分析対象データ格納部に格納された、複数の入力属性ｘ _ｊ（１≦ｊ≦Ｎ、Ｎは入力属性の個数）と、１つの出力属性ｙとで構成されるデータの集合である基本データ群ＤＡを分析対象とし、入力属性と出力属性との因果関係を分析するデータ分析方法であって、上記文字―数値データ変換手段により、基本データ群ＤＡに含まれる文字属性のデータを、一義的な変換ルールに従って数値属性のデータに変換することによって、数値属性のデータの集合である数値型基本データ群ＤＡ０を生成する文字―数値データ変換ステップと、上記分類手段により、数値型基本データ群ＤＡ０を、数値型基本データ群ＤＡ０に含まれる出力属性ｙの数値と、出力属性ｙの所定閾値との大小関係の比較に基づいて、第１データ群ＤＡ１と、第２データ群ＤＡ２とに分類する分類ステップと、上記第１の評価手段により、上記複数の入力属性のうちの１つの入力属性ｘ _ｊについて、該１つの入力属性ｘ _ｊのとり得る数値毎に、当該数値以下の数値を持つデータのうち、第１データ群ＤＡ１に属するデータの個数の、第１データ群ＤＡ１に属する全てのデータの個数に対する比率である第１の頻度（１−ｘ _ｊ頻度累積％）を求める演算を行い、かつ、該１つの入力属性ｘ _ｊのとり得る数値毎に、当該数値以下の数値を持つデータのうち、第２データ群ＤＡ２に属するデータの個数の、第２データ群ＤＡ２に属する全てのデータの個数に対する比率である第２の頻度（２−ｘ _ｊ頻度累積％）を求める演算を行い、かつ、該１つの入力属性ｘ _ｊのとり得る数値毎に、第１の頻度と第２の頻度との差分（ｘ _ｊ頻度累積差％）を求める演算を、上記複数の入力属性の各々について行なう第１の評価ステップと、上記閾値決定手段により、上記複数の入力属性のうちの１つの入力属性ｘ _ｊについて、第１の評価手段で該１つの入力属性ｘ _ｊのとり得る数値毎に演算された差分（ｘ _ｊ頻度累積差％）に基づいて、最大の差分が求められた数値を当該入力属性ｘ _ｊの閾値ｘ _ｊ―ｔｈとして決定することを、上記複数の入力属性の各々について行なう閾値決定ステップと、上記第２の評価手段により、上記複数の入力属性のうちの１つの入力属性ｘ _ｊについて、閾値決定手段にて決定された該入力属性ｘ _ｊの閾値ｘ _ｊ―ｔｈにおける、第１の頻度（１−ｘ _ｊ頻度累積％）に対する第２の頻度（２−ｘ _ｊ頻度累積％）の比率である第１の比率と、閾値決定手段にて決定された該入力属性ｘ _ｊの閾値ｘ _ｊ―ｔｈにおける、（１００％−第１の頻度（１−ｘ _ｊ頻度累積％））に対する（１００％−第２の頻度（２−ｘ _ｊ頻度累積％））の比率である第２の比率とを演算するとともに、第１の比率および第２の比率のうちの大きい方の比率を選択することを、上記複数の入力属性の各々について行なう第２の評価ステップと、上記要因抽出手段により、上記第２の評価手段にて入力属性毎に選択された比率のうち、最も大きい比率を持つ入力属性ｘ _ｊ、該入力属性ｘ _ｊの閾値ｘ _ｊ―ｔｈ、および該最も大きい比率が第１の比率および第２の比率の何れであるかを示す種別を、入力属性条件を示すデータとして抽出するとともに、当該入力属性条件を分析結果データ格納部に格納する要因抽出ステップとを含むことを特徴としている。 In order to solve the above-described problem, the data analysis method according to the present invention uses the data analysis apparatus described above to store a plurality of input attributes x _j (1 ≦ j ≦ N, N the number of input attributes), and one output attribute y analyzed basic data group DA is a set of data composed in, you analyze the causal relationship between input attributes and output attributes data analysis methods The character-numeric data conversion means converts the character attribute data included in the basic data group DA into numeric attribute data according to a unique conversion rule, thereby obtaining a set of numeric attribute data. character generating a numeric base data group DA0 - and numeric data conversion step, by the classifying means, a numeric base data group DA0, and numerical output attribute y included in the numeric base data group DA0 Based on the comparison of the magnitude relationship between the predetermined threshold value of the output attribute y, the first data group DA1, a classification step of classifying the second data group DA2, by the first evaluation means, the plurality of input attributes for one input attributes x _j of out, for each numerical value can take of the one input attributes x _j, among the data having the following values the numerical, the number of data belonging to the first data group DA1, the first data For each numerical value that can be taken by the one input attribute x _j , the first frequency (1-x _j frequency cumulative%) that is a ratio to the number of all data belonging to the group DA1 is calculated. A second frequency (2-x _j frequency cumulative%) that is a ratio of the number of data belonging to the second data group DA2 to the number of all data belonging to the second data group DA2 among the data having the following numerical values Performance A calculation for calculating a difference between the first frequency and the second frequency (x _j frequency cumulative difference%) for each numerical value that can be taken by the one input attribute x _{j is} the plurality of input attributes. a first evaluation step of performing for each, by the threshold value determining means, for one input attribute x _j of the plurality of input attributes, the numerical values can be assumed by the said one input attribute x _j in the first evaluation means based on the calculated difference (x _j frequency cumulative difference%) for each, the numerical value determined maximum difference to be determined as a threshold value x _j-th of the input attributes x _j, of the plurality of input attributes a threshold determination step of performing for each, by the second evaluation means, for one input attribute x _j of the plurality of input attributes, the input attributes x _j which is determined by the threshold determining unit threshold x _j- the first frequency in _th (1 -X _j frequency cumulative%) to the first ratio that is the ratio of the second frequency (2-x _j frequency cumulative%) and the threshold value x _j-th of the input attribute x _j determined by the threshold value determining means in, (100% - the first frequency _{(1-x j} frequency cumulative%)) with respect to (100% - the second frequency _{(2-x j} frequency cumulative%)) computing a second ratio is the ratio of In addition, the second evaluation step in which the larger one of the first ratio and the second ratio is selected for each of the plurality of input attributes, and the factor extraction means, the second extraction step . of the ratios selected for each input attribute in the evaluation means, the input attributes x _j having the largest _ratio, threshold x _j-th of the input attributes x _{_j,} and outermost even larger ratio first ratio and the the type indicating which of 2 ratio, the input attribute condition Is extracted as to the data, is characterized by including the factor extraction step of storing the input attribute condition analysis result data storage unit.

本発明に係るデータ分析プログラムは、上記の課題を解決するために、分析対象データ格納部に格納された、複数の入力属性ｘ _ｊ（１≦ｊ≦Ｎ、Ｎは入力属性の個数）と、１つの出力属性ｙとで構成されるデータの集合である基本データ群ＤＡを分析対象とし、入力属性と出力属性との因果関係を分析するデータ分析装置が備えるコンピュータを機能させるためのデータ分析プログラムであって、上記データ分析装置は、基本データ群ＤＡに含まれる文字属性のデータを、一義的な変換ルールに従って数値属性のデータに変換することによって、数値属性のデータの集合である数値型基本データ群ＤＡ０を生成する文字―数値データ変換手段と、数値型基本データ群ＤＡ０を、数値型基本データ群ＤＡ０に含まれる出力属性ｙの数値と、出力属性ｙの所定閾値との大小関係の比較に基づいて、第１データ群ＤＡ１と、第２データ群ＤＡ２とに分類する分類手段と、上記複数の入力属性のうちの１つの入力属性ｘ _ｊについて、該１つの入力属性ｘ _ｊのとり得る数値毎に、当該数値以下の数値を持つデータのうち、第１データ群ＤＡ１に属するデータの個数の、第１データ群ＤＡ１に属する全てのデータの個数に対する比率である第１の頻度（１−ｘ _ｊ頻度累積％）を求める演算を行い、かつ、該１つの入力属性ｘ _ｊのとり得る数値毎に、当該数値以下の数値を持つデータのうち、第２データ群ＤＡ２に属するデータの個数の、第２データ群ＤＡ２に属する全てのデータの個数に対する比率である第２の頻度（２−ｘ _ｊ頻度累積％）を求める演算を行い、かつ、該１つの入力属性ｘ _ｊのとり得る数値毎に、第１の頻度と第２の頻度との差分（ｘ _ｊ頻度累積差％）を求める演算を、上記複数の入力属性の各々について行なう第１の評価手段と、上記複数の入力属性のうちの１つの入力属性ｘ _ｊについて、第１の評価手段で該１つの入力属性ｘ _ｊのとり得る数値毎に演算された差分（ｘ _ｊ頻度累積差％）に基づいて、最大の差分が求められた数値を当該入力属性ｘ _ｊの閾値ｘ _ｊ―ｔｈとして決定することを、上記複数の入力属性の各々について行なう閾値決定手段と、上記複数の入力属性のうちの１つの入力属性ｘ _ｊについて、閾値決定手段にて決定された該入力属性ｘ _ｊの閾値ｘ _ｊ―ｔｈにおける、第１の頻度（１−ｘ _ｊ頻度累積％）に対する第２の頻度（２−ｘ _ｊ頻度累積％）の比率である第１の比率と、閾値決定手段にて決定された該入力属性ｘ _ｊの閾値ｘ _ｊ―ｔｈにおける、（１００％−第１の頻度（１−ｘ _ｊ頻度累積％））に対する（１００％−第２の頻度（２−ｘ _ｊ頻度累積％））の比率である第２の比率とを演算するとともに、第１の比率および第２の比率のうちの大きい方の比率を選択することを、上記複数の入力属性の各々について行なう第２の評価手段と、上記第２の評価手段にて入力属性毎に選択された比率のうち、最も大きい比率を持つ入力属性ｘ _ｊ、該入力属性ｘ _ｊの閾値ｘ _ｊ―ｔｈ、および該最も大きい比率が第１の比率および第２の比率の何れであるかを示す種別を、入力属性条件を示すデータとして抽出するとともに、当該入力属性条件を分析結果データ格納部に格納する要因抽出手段とを含み、コンピュータを上記の各手段として機能させるためのデータ分析プログラムであることを特徴としている。 In order to solve the above problems, a data analysis program according to the present invention includes a plurality of input attributes x _j (1 ≦ j ≦ N, where N is the number of input attributes) stored in the analysis target data storage unit , A data analysis program for causing a computer included in a data analysis apparatus that analyzes a causal relationship between an input attribute and an output attribute to analyze a basic data group DA that is a set of data composed of one output attribute y The data analysis apparatus converts the numerical attribute data included in the basic data group DA into numerical attribute data according to a unique conversion rule, thereby obtaining a numerical basic that is a set of numerical attribute data. character generating data group DA0 - and numeric data conversion means, a numeric base data group DA0, and numerical output attribute y included in the numeric base data group DA0, out Based on the comparison of the magnitude relationship between the predetermined threshold value attribute y, the first data group DA1, a classification means for classifying the second data group DA2, for one input attribute x _j of the plurality of input attributes , each numerical value can be assumed by the said one input attributes x _j, among the data having the following values the numerical, the number of data belonging to the first data group DA1, the number of all data belonging to the first data group DA1 Among the data having a numerical value equal to or lower than the numerical value for each numerical value that can be taken by the one input attribute x _j , and calculating the first frequency (1-x _j frequency cumulative%) that is a ratio to Performing an operation for obtaining a second frequency (2-x _j frequency cumulative%) that is a ratio of the number of data belonging to the second data group DA2 to the number of all data belonging to the second data group DA2. One input attribute x first evaluation means for performing an operation for obtaining a difference between the first frequency and the second frequency (x _j frequency cumulative difference%) for each of the plurality of input attributes for each numerical value that _j can take ; for one input attributes x _j of the plurality of input attributes, based on the single input attribute x _j of possible numerical each of the calculated difference by the first evaluation means (x _j frequency cumulative difference%), maximum to be determined as a threshold value x _j-th of the input attributes x _j a numerical value difference is determined, and the threshold value determining means for, for each of the plurality of input attributes, the one of the plurality of input attributes for input attribute _{x j,} at the threshold _{x j-th} of the input attributes _{x j} which is determined by the threshold value determining means, a second frequency for the first frequency _{(1-x j} frequency cumulative%) _{(2-x j} The first ratio that is the ratio of the frequency cumulative%) and threshold decision At the threshold _{x j-th} of the determined at means input attribute _{x j,} (100% - the first frequency _{(1-x j} Frequency Cumulative%)) with respect to (100% - the second frequency (2-x _j frequency cumulative%)) is calculated for each of the plurality of input attributes, and the larger one of the first ratio and the second ratio is selected. a second evaluation means for, among the ratios selected for each input attribute in the second evaluation means, the threshold value x _j-th input attributes x _j, the input attributes x _j having the largest _ratio, and Extraction of the factor indicating whether the largest ratio is the first ratio or the second ratio as data indicating the input attribute condition, and extracting the input attribute condition in the analysis result data storage unit Means including the computer It is a data analysis program for functioning as each means .

本発明に係るコンピュータ読み取り可能な記録媒体は、上記の課題を解決する
ために、上記のデータ分析プログラムを記録したものであることを特徴としてい
る。 In order to solve the above problems, a computer-readable recording medium according to the present invention records the above data analysis program.

上記装置、方法、プログラム、あるいは記録媒体によれば、ラベル階層構造を
予め定義する事なく、簡潔な形で、第２データ群に対応する出力属性条件（結果
）の要因を抽出できる。それゆえ、例えば第２データ群が悪い結果（例えば不良
品の発生）に対応するデータ群であれば、その悪い結果の要因をユーザが容易に
把握できる。逆に、第２データ群が良い結果（例えば優れた特性を持つ製品の発
生）に対応するデータ群であれば、その良い結果の要因をユーザが容易に把握で
きる。 According to the apparatus, method, program, or recording medium, the factor of the output attribute condition (result) corresponding to the second data group can be extracted in a concise form without defining the label hierarchical structure in advance. Therefore, for example, if the second data group is a data group corresponding to a bad result (for example, occurrence of a defective product), the user can easily grasp the cause of the bad result. Conversely, if the second data group is a data group corresponding to a good result (for example, occurrence of a product having excellent characteristics), the user can easily grasp the factor of the good result.

本発明に係るデータ分析方法は、上記要因抽出手段で抽出された入力属性条件に基づいて、数値型基本データ群ＤＡ０を、上記入力属性条件を満たす要因データ群と上記入力属性条件を満たさない他データ群とに分割し、分類されたデータ群のうちの少なくとも一方を新たな数値型基本データ群ＤＡ０として分類手段に送る分割手段をさらに含み、分類手段による処理、第１の評価手段による処理、閾値決定手段による処理、第２の評価手段による処理、要因抽出手段による処理、および分割手段による処理からなる一連の処理が繰り返し実行されるようになっていることがより好ましい。 In the data analysis method according to the present invention, based on the input attribute condition extracted by the factor extracting means, the numerical basic data group DA0 is classified into a factor data group that satisfies the input attribute condition and the input attribute condition that does not satisfy the input attribute condition. Further comprising a dividing unit that divides the data group into at least one of the classified data groups as a new numerical basic data group DA0 to the classifying unit, processing by the classifying unit, processing by the first evaluation unit, It is more preferable that a series of processing including processing by the threshold value determination unit, processing by the second evaluation unit, processing by the factor extraction unit, and processing by the dividing unit is repeatedly executed.

上記構成によれば、複数の要因を節点として木構造を作成できる。それゆえ、
単独の相関ルールでは表現し難い複数の要因の絡み合った分析対象であっても、
十分高い精度で要因を究明できる。 According to the above configuration, a tree structure can be created with a plurality of factors as nodes. therefore,
Even if the analysis target is intertwined with multiple factors that are difficult to express with a single association rule,
The factor can be investigated with sufficiently high accuracy.

本発明に係るデータ分析装置は、終了条件を満たしているかを判定する終了条
件判定手段をさらに含み、上記終了条件判定手段において終了条件を満たしてい
ると判定されると、上記一連の処理の実行を終了するようになっていることがよ
り好ましい。これにより、必要以上の無駄な処理が行われることを回避できる。 The data analysis apparatus according to the present invention further includes an end condition determination unit that determines whether or not an end condition is satisfied. When the end condition determination unit determines that the end condition is satisfied, the series of processes is executed. It is more preferable to end the process. Thereby, it is possible to avoid performing unnecessary processing more than necessary.

上記第１の評価手段は、各入力属性の全ての数値について、第１データ群中に
おける入力属性がその数値以下であるデータの割合を第１の頻度として演算する
と共に、第２データ群中における入力属性がその数値以下であるデータの割合を
第２の頻度として演算する頻度演算手段と、各入力属性の全ての数値について、
第１の頻度と第２の頻度との差分を演算する差分演算手段とを含むことがより好
ましい。これにより、閾値評価指標を容易に演算することができる。 The first evaluation means calculates, for all the numerical values of each input attribute, the ratio of data whose input attribute in the first data group is equal to or lower than the numerical value as the first frequency, and in the second data group Frequency calculation means for calculating the ratio of data whose input attribute is less than or equal to the numerical value as the second frequency, and for all the numerical values of each input attribute,
More preferably, difference calculation means for calculating a difference between the first frequency and the second frequency is included. Thereby, the threshold evaluation index can be easily calculated.

上記第２の評価手段は、第１のルール評価値として、第１データ群中における
入力属性が閾値以下であるデータの割合に対する、第２データ群中における入力
属性が閾値以下であるデータの割合の比率を第１の比率として演算すると共に、
第２のルール評価値として、第１データ群中における入力属性が閾値を超えるデ
ータの割合に対する、第２データ群中における入力属性が閾値を超えるデータの
割合の比率を第２の比率として演算し、双方の比率のうち大きい方の比率を抽出
するものであり、上記要因抽出手段は、上記第２の評価手段で抽出された、各入
力属性の比率のうちで、その値が最大となる、入力属性、該入力属性の閾値、お
よび抽出された比率の種別を上記入力属性条件を示すデータとして抽出するもの
であることがより好ましい。これにより、第１および第２のルール評価値を容易
に演算することができる。 The second evaluation means uses, as the first rule evaluation value, the ratio of data whose input attribute in the second data group is equal to or smaller than the threshold to the ratio of data whose input attribute in the first data group is equal to or smaller than the threshold. As the first ratio,
As the second rule evaluation value, the ratio of the ratio of the data whose input attribute exceeds the threshold in the second data group to the ratio of the data whose input attribute exceeds the threshold in the first data group is calculated as the second ratio. , The larger ratio of both ratios is extracted, and the factor extracting means has the maximum value among the ratios of the input attributes extracted by the second evaluating means. More preferably, the input attribute, the threshold value of the input attribute, and the type of the extracted ratio are extracted as data indicating the input attribute condition. Thereby, the first and second rule evaluation values can be easily calculated.

本発明の装置、方法、プログラム、記録媒体によれば、以上のように、ラベル
階層構造を予め定義する事なく、「入力属性が閾値以下」あるいは「入力属性が
閾値を超える」といった非常に簡潔な形で、問題事象である特定の出力属性条件
（問題事象）が発生する要因を導き出すことが可能となる。また、複数の要因を
導き出せば、それぞれの要因（入力属性）における「入力属性が閾値以下」ある
いは「入力属性が閾値を超える」といった条件の組み合わせによる非常に簡潔な
形の決定木として、問題事象に関わる因果関係を導き出せる。 According to the apparatus, method, program, and recording medium of the present invention, as described above, without defining the label hierarchical structure in advance, “input attribute is below threshold” or “input attribute exceeds threshold” is very simple. In this way, it is possible to derive a factor that causes a specific output attribute condition (problem event) that is a problem event. In addition, if multiple factors are derived, the problem event can be expressed as a very simple decision tree based on a combination of conditions such as “input attribute is below threshold” or “input attribute exceeds threshold” for each factor (input attribute). Causal relationships related to can be derived.

本発明の一実施形態を以下に説明する。 One embodiment of the present invention will be described below.

まず、本実施形態のデータ分析装置を図１に基づいて説明する。 First, the data analysis apparatus of this embodiment is demonstrated based on FIG.

図１に示すように、データ分析装置は、文字−数値データ変換部１、分析対象
データ格納部２、閾値設定部（閾値設定手段）３、データ分類部（分類手段）４
、データ列抽出部５、頻度演算部（第１の評価手段、頻度演算手段）６、頻度累
積差演算部（第１の評価手段、差分演算手段）７、入力属性閾値決定部（閾値決
定手段）８、頻度累積比率演算部（第２の評価手段）１６、要因抽出部（要因抽
出手段）９、要因未発見データ抽出部（分割手段）１０、終了条件判定部（終了
条件判定手段）１１、入力属性閾値テーブル作成部１２、寄与率演算部１３、分
析結果データ格納部１４、および出力部１５を備えている。 As shown in FIG. 1, the data analysis apparatus includes a character-numerical data conversion unit 1, an analysis target data storage unit 2, a threshold setting unit (threshold setting unit) 3, and a data classification unit (classification unit) 4.
, Data string extraction unit 5, frequency calculation unit (first evaluation unit, frequency calculation unit) 6, frequency cumulative difference calculation unit (first evaluation unit, difference calculation unit) 7, input attribute threshold determination unit (threshold determination unit) 8) Frequency cumulative ratio calculation unit (second evaluation unit) 16, factor extraction unit (factor extraction unit) 9, factor undiscovered data extraction unit (division unit) 10, end condition determination unit (end condition determination unit) 11 , An input attribute threshold value table creation unit 12, a contribution rate calculation unit 13, an analysis result data storage unit 14, and an output unit 15.

次に、次の表１のデータ群ＤＡを分析対象とする場合を例にとって、本実施形
態のデータ分析方法を図２に基づいて説明する。表１のデータ群ＤＡは、ハード
ディスク等の格納部２に格納されている。 Next, the data analysis method of the present embodiment will be described with reference to FIG. 2, taking as an example the case where the data group DA in the following Table 1 is an analysis target. The data group DA in Table 1 is stored in the storage unit 2 such as a hard disk.

表１のデータ群ＤＡは、１〜１２のｉｄ（識別子）を持つ１２個のデータから
構成されている。表１において、ｘ１，ｘ２，ｘ３，ｘ４は入力属性である。入
力属性ｘ１は４つの文字Ａ，Ｂ，Ｃ，Ｄのいずれかをとる文字属性である。入力
属性ｘ２は４つの文字ａ，ｂ，ｃ，ｄのいずれかをとる文字属性である。入力属
性ｘ３は４つの離散値１，２，３，４のいずれかをとる離散属性である。入力属
性ｘ４は４つの離散値１０，２０，３０，４０のいずれかをとる離散属性である
。なお、入力属性は、連続した数値をとる連続属性でもよい。 The data group DA in Table 1 is composed of 12 pieces of data having ids (identifiers) of 1 to 12. In Table 1, x1, x2, x3, and x4 are input attributes. The input attribute x1 is a character attribute that takes one of four characters A, B, C, and D. The input attribute x2 is a character attribute that takes one of the four characters a, b, c, and d. The input attribute x3 is a discrete attribute that takes one of four discrete values 1, 2, 3, and 4. The input attribute x4 is a discrete attribute taking any one of four discrete values 10, 20, 30, and 40. The input attribute may be a continuous attribute that takes a continuous numerical value.

また、表１において、ｙは出力属性である。出力属性は、文字属性であっても
よく、離散属性でもよく、また連続属性でもよいが、ここでは、３つの文字Ｘ，
Ｙ，Ｚのいずれかをとる文字属性である。 In Table 1, y is an output attribute. The output attribute may be a character attribute, a discrete attribute, or a continuous attribute. Here, three characters X,
It is a character attribute that takes either Y or Z.

本実施形態のデータ分析方法では、ｙ＝Ｙなる場合を問題事象として、出力属
性ｙがＹとなる要因を分析する。 In the data analysis method of the present embodiment, the case where y = Y is regarded as a problem event, and the cause of the output attribute y being Y is analyzed.

なお、分析対象データの例としては、例えば、入力属性が、製品の製造工程に
おける製造プロセス条件および／またはインライン検査結果（製造ライン途中で
の検査結果）、出力属性が製品の品質判定結果、ｙ＝Ｙなる問題事象が品質判定
結果の不良であるデータが挙げられる。この場合、本実施形態のデータ分析方法
により入力属性と出力属性との因果関係を分析し、ｙ＝Ｙなる問題事象の要因を
導き出すことで、デバイス特性不良等の不良品の発生を解消する対策を容易に図
ることが可能となる。したがって、歩留まりの向上等のような製造プロセスの改
善を容易に図ることが可能となる。 As an example of the analysis target data, for example, the input attribute is the manufacturing process condition and / or in-line inspection result (inspection result during the manufacturing line) in the product manufacturing process, the output attribute is the product quality determination result, y Data in which the problem event = Y is a bad quality determination result can be cited. In this case, the causal relationship between the input attribute and the output attribute is analyzed by the data analysis method of the present embodiment, and the cause of the problem event y = Y is derived, thereby eliminating the occurrence of defective products such as device characteristic defects. Can be easily achieved. Therefore, it is possible to easily improve the manufacturing process such as improvement in yield.

分析対象データのより具体的な例としては、例えば、入力属性ｘ１、ｘ２、ｘ
３、ｘ４が、プラズマＣＶＤプロセスの、ガス流量、ガス圧力、投入電力、成膜
時間などのプロセスデータで、出力属性ｙが、プラズマＣＶＤプロセスで形成さ
れる薄膜の膜厚であるようなデータが挙げられる。また、これら入力属性および
出力属性の値は、連続属性でも離散属性でも文字属性でもよい。文字属性の場合
には、例えば、出力属性が膜厚の例で、‘大’、‘中’、‘小’といった具合に
表現される。
［ステップ０］
まず、文字−数値データ変換部１が、ハードディスク等の分析対象データ格納
部２に格納された表１のデータ群ＤＡにおける文字属性を下記の変換ルールに従
って数値属性（数値データ）に変換する（Ｓ０）。これにより、各データは、数
値データに変換される。そして、文字−数値データ変換部１は、変換されたデー
タ群をデータ分類部４に送る。
（ｘ１）Ａ→１、Ｂ→２、Ｃ→３、Ｄ→４
（ｘ２）ａ→１、ｂ→２、ｃ→３、ｄ→４
（ｘ３）変換せず
（ｘ４）変換せず
（ｙ）Ｘ→１、Ｙ→２、Ｚ→３
この変換ルールは、可能な限り、変換後の入力属性の数値が大きいほど出力属
性の数値が大きくなるようにあるいはその逆順となるように設定されることが好
ましい。なお、変換ルールは、一義性さえあればよく、上記の例に限られない。 As a more specific example of the analysis target data, for example, input attributes x1, x2, x
3 and x4 are process data such as gas flow rate, gas pressure, input power, and film formation time of the plasma CVD process, and the output attribute y is data such as the film thickness of the thin film formed by the plasma CVD process. Can be mentioned. The values of the input attribute and output attribute may be continuous attributes, discrete attributes, or character attributes. In the case of a character attribute, for example, the output attribute is an example of film thickness, and is expressed as “large”, “medium”, or “small”.
[Step 0]
First, the character-numeric data conversion unit 1 converts the character attributes in the data group DA of Table 1 stored in the analysis target data storage unit 2 such as a hard disk into numeric attributes (numeric data) according to the following conversion rule (S0). ). Thereby, each data is converted into numerical data. Then, the character-numeric data conversion unit 1 sends the converted data group to the data classification unit 4.
(X1) A → 1, B → 2, C → 3, D → 4
(X2) a → 1, b → 2, c → 3, d → 4
(X3) No conversion (x4) No conversion (y) X → 1, Y → 2, Z → 3
It is preferable that the conversion rule is set so that the numerical value of the output attribute increases as the numerical value of the input attribute after conversion increases, or vice versa. The conversion rule is not limited to the above example as long as it is unique.

上記変換ルールにて数値データに変換されたデータ群ＤＡ０は、表２に示す通
りである。 The data group DA0 converted into numerical data by the conversion rule is as shown in Table 2.

この変換により、得られたデータ群ＤＡ０は、離散値をとる複数の入力属性（
説明属性）と出力属性（目的属性）とで構成されるデータの集合となる。以下、
データ群ＤＡ０を基本データ群と呼ぶ事にする。
［ステップ１］
閾値設定部３は、予め定められた設定情報に従って、あるいは使用者が図示し
ないキーボードやマウス等の入力部から問題事象の属性値ｙ＝Ｙを入力したこと
に応答して、データ群ＤＡのｙ＝Ｙなる問題事象に対応する基本データ群ＤＡ０
の出力属性ｙの閾値（出力属性閾値）ｙ_thを設定し、データ分類部４に出力する
（Ｓ１）。この例においては、データ群ＤＡのｙ＝Ｙなる問題事象に対応する基
本データ群ＤＡ０の出力属性ｙの閾値は、ｙ_th＝２である。
［ステップ２］
次に、データ分類部４が、基本データ群ＤＡ０の出力属性ｙの値と、閾値設定
部３から出力された出力属性閾値ｙ_thとの比較論理（１）（２）に基づいて、基
本データ群ＤＡ０を、第１データ群ＤＡ１と第２データ群ＤＡ２とに２分化（分
類）する（Ｓ２）。 By this conversion, the obtained data group DA0 has a plurality of input attributes (discrete values).
This is a set of data composed of description attributes) and output attributes (purpose attributes). Less than,
The data group DA0 is called a basic data group.
[Step 1]
The threshold value setting unit 3 responds to predetermined setting information or in response to the user inputting the problem event attribute value y = Y from an input unit such as a keyboard or a mouse (not shown). = Basic data group DA0 corresponding to problem event Y
The output attribute y threshold (output attribute threshold) y _th is set and output to the data classification unit 4 (S1). In this example, the threshold of the output attribute y of the basic data group DA0 corresponding to the problem event y = Y of the data group DA is y _th = 2.
[Step 2]
Next, the data classification unit 4 determines the basic data based on the comparison logic (1) (2) between the value of the output attribute y of the basic data group DA0 and the output attribute threshold y _th output from the threshold setting unit 3. The group DA0 is divided into two groups (classification) into a first data group DA1 and a second data group DA2 (S2).

（１）ｙ＞ｙ_thまたはｙ＜ｙ_th→ＤＡ１
（２）ｙ＝ｙ_th→ＤＡ２
言い換えると、データ分類部４は、基本データ群ＤＡ０を、出力属性が出力属性
閾値ｙ_thと一致しない（すなわち１または３である）第１データ群ＤＡ１と、出
力属性が出力属性閾値ｙ_th（＝２）と一致する第２データ群ＤＡ２とに分類する
。第２データ群ＤＡ２は問題事象（例えば、デバイス特性不良など）のデータ群
である。すなわち、第２データ群ＤＡ２は出力属性ｙが問題事象を表す属性値（
２）であるデータ群であり、第１データ群ＤＡ１は出力属性ｙが問題事象を表し
ていない属性値（１または３）であるデータ群である。 (1) y> y _th or y <y _th → DA1
(2) y = y _th → DA2
In other words, the data classification unit 4 includes the basic data group DA0, the first data group DA1 whose output attribute does not match the output attribute threshold y _th (that is, 1 or 3), and the output attribute that is the output attribute threshold y _th ( = 2) and the second data group DA2 that coincides with the second data group DA2. The second data group DA2 is a data group of problem events (for example, defective device characteristics). That is, in the second data group DA2, the output attribute y is an attribute value (
2), and the first data group DA1 is a data group in which the output attribute y is an attribute value (1 or 3) that does not represent a problem event.

第１データ群ＤＡ１を表３に、第２データ群ＤＡ２を表４に示す。 Table 3 shows the first data group DA1 and Table 4 shows the second data group DA2.

なお、以下では、適宜、第１データ群ＤＡ１を良品（ＯＫ品）データ群、第２
データ群ＤＡ２を不良品（ＮＧ品）データ群と呼ぶ事にする。
［ステップ３］
次に、データ列抽出部５が、良品データ群ＤＡ１（表３）から、入力属性ｘｊ
（１≦ｊ≦４）の各々のデータ列を抽出する（Ｓ３）。このデータ列を１−ｘｊ
データ群と呼ぶ事にする。 In the following, the first data group DA1 is appropriately referred to as a non-defective product (OK product) data group, and the second
The data group DA2 is referred to as a defective product (NG product) data group.
[Step 3]
Next, the data string extraction unit 5 extracts the input attribute xj from the good product data group DA1 (Table 3).
Each data string of (1 ≦ j ≦ 4) is extracted (S3). This data string is 1-xj
This is called a data group.

同様に、データ列抽出部５は、不良品データ群ＤＡ２（表４）からも、入力属
性ｘｊ（１≦ｊ≦４）の各々のデータ列を抽出する（Ｓ３）。このデータ列を２
−ｘｊデータ群と呼ぶ事にする。 Similarly, the data string extraction unit 5 extracts each data string of the input attribute xj (1 ≦ j ≦ 4) from the defective product data group DA2 (Table 4) (S3). This data string is 2
It will be called a -xj data group.

１−ｘｊデータ群を表５〜８に、２−ｘｊデータ群を表９〜１２に示す。 The 1-xj data group is shown in Tables 5-8, and the 2-xj data group is shown in Tables 9-12.

［ステップ４］
頻度演算部６は、ステップ３で良品データ群ＤＡ１から抽出された１−ｘｊデ
ータ群の各々、およびステップ３で不良品データ群ＤＡ２から抽出された２−ｘ
ｊデータ群の各々を、入力属性ｘｊの値で昇順に並べ替える。そして、入力属性
ｘｊの個々の数値について、第１データ群におけるその数値以下のデータ個数の
割合を表す１−ｘｊ頻度累積％と、第２データ群におけるその数値以下のデータ
個数の割合を表す２−ｘｊ頻度累積％とを計算する（Ｓ４）。 [Step 4]
The frequency calculation unit 6 uses each of the 1-xj data groups extracted from the non-defective product data group DA1 in step 3 and 2-x extracted from the defective product data group DA2 in step 3.
Each of the j data groups is rearranged in ascending order by the value of the input attribute xj. Then, for each numerical value of the input attribute xj, 1-xj frequency cumulative% representing the ratio of the number of data less than that value in the first data group and 2 representing the ratio of the number of data less than that value in the second data group. -Xj Frequency cumulative% is calculated (S4).

ここでは、表５〜８を入力属性ｘｊの値で昇順に並べ替えた表１３〜１６を用
い、各行（ｉｄ）のデータについて表中でそのデータの位置以上の位置にあるデ
ータ個数の、第１データ群の全データ数（＝８）に対する割合を１−ｘｊ頻度累
積％として計算している。同様に、表９〜１２を入力属性ｘｊの値で昇順に並べ
替えた表１７〜２０を用い、各行（ｉｄ）のデータについて表中でそのデータの
位置以上の位置にあるデータ個数の、第２データ群の全データ数（＝４）に対す
る割合を２−ｘｊ頻度累積％として計算している
ここで計算した１−ｘｊ頻度累積％および２−ｘｊ頻度累積％の値を表１３〜
２０に示す。 Here, using Tables 13 to 16 in which Tables 5 to 8 are rearranged in ascending order by the value of the input attribute xj, the number of data at the position equal to or higher than the position of the data in the table for each row (id) data. The ratio of one data group to the total number of data (= 8) is calculated as 1-xj frequency cumulative%. Similarly, using Tables 17 to 20 in which Tables 9 to 12 are rearranged in ascending order by the value of the input attribute xj, the number of data at the position equal to or higher than the position of the data in the table for each row (id) is calculated. The ratio of 2 data groups to the total number of data (= 4) is calculated as 2-xj frequency cumulative%. The values of 1-xj frequency cumulative% and 2-xj frequency cumulative% calculated here are shown in Table 13 to
20 shows.

なお、上述したステップ３・４では、データ列を抽出し、並び替えを行った後
に、１−ｘｊ頻度累積％および２−ｘｊ頻度累積％を計算していたが、データ列
の抽出や並び替えを行うことなく直接的に１−ｘｊ頻度累積％および２−ｘｊ頻
度累積％を計算してもかまわない。 In Steps 3 and 4 described above, after extracting and rearranging the data strings, 1-xj frequency cumulative% and 2-xj frequency cumulative% are calculated. The 1-xj frequency accumulation% and the 2-xj frequency accumulation% may be directly calculated without performing the above.

さらに、頻度演算部６は、１−ｘｊ頻度累積％が計算された良品データ群であ
る１−ｘｊデータ群のテーブルと、２−ｘｊ頻度累積％が計算された不良品デー
タ群である２−ｘｊデータ群のテーブルとを結合する。具体的には、入力属性ｘ
１について、表１３と表１７とを結合して表２１のｘ１頻度累積テーブルを、入
力属性ｘ２について、表１４と表１８とを結合して表２２のｘ２頻度累積テーブ
ルを、入力属性ｘ３について、表１５と表１９とを結合して表２３のｘ３頻度累
積テーブルを、入力属性ｘ４について、表１６と表２０とを結合して表２４のｘ
４頻度累積テーブルを、それぞれ作成する。 Further, the frequency calculation unit 6 is a table of 1-xj data groups that are non-defective product data groups for which 1-xj frequency cumulative% is calculated, and 2-items that are defective product data groups for which 2-xj frequency cumulative% is calculated. The table of the xj data group is combined. Specifically, the input attribute x
For Table 1, Table 13 and Table 17 are combined to obtain the x1 frequency accumulation table in Table 21, and for the input attribute x2, Table 14 and Table 18 are combined to create the x2 frequency accumulation table in Table 22, and the input attribute x3. Table 15 and Table 19 are combined to obtain the x3 frequency accumulation table of Table 23, and for the input attribute x4, Table 16 and Table 20 are combined to generate x 24 of Table 24.
A 4-frequency accumulation table is created for each.

さらに、頻度演算部６は、表２１〜２４の各々の頻度累積テーブルを、入力属
性ｘｊの値で昇順に並べ替える。このとき、１−ｘｊ頻度累積％および２−ｘｊ
頻度累積％の空欄には、その直前の値を代入する。また、入力属性ｘｊにおいて
同じ値が続いている場合には、上記並べ替えられた最終のデータのみを採用する
。こうして、頻度演算部６にて、入力属性ｘｊの各値に対して、良品データ群で
ある第１データ群におけるその数値以下のデータ個数の割合を表す１−ｘｊ頻度
累積％（Ａ；第１の頻度）と、不良品データ群である第２データ群におけるその
数値以下のデータ個数の割合を表す２−ｘｊ頻度累積％（Ｂ；第２の頻度）との
双方が算出される（Ｓ４）。
［ステップ５］
次に、頻度累積差演算部７が、入力属性ｘｊの各値に対して、良品の１−ｘｊ
頻度累積（Ａ）と、不良品の２−ｘｊ頻度累積（Ｂ）の差分（＝｜Ａ−Ｂ｜）を
計算する（Ｓ５）。この差分値を、ｘｊ頻度累積差（＝｜Ａ−Ｂ｜）と呼ぶ。ｘ
ｊ頻度累積差の計算結果を表２５〜表２８に示す。 Furthermore, the frequency calculation unit 6 sorts the frequency accumulation tables in Tables 21 to 24 in ascending order by the value of the input attribute xj. At this time, 1-xj frequency accumulation% and 2-xj
The value immediately before is substituted into the blank for frequency accumulation%. When the same value continues in the input attribute xj, only the rearranged final data is adopted. Thus, in the frequency calculation unit 6, for each value of the input attribute xj, 1-xj frequency cumulative% (A; first) representing the ratio of the number of data less than or equal to that value in the first data group that is a non-defective data group. And 2-xj frequency cumulative% (B; second frequency) representing the ratio of the number of data less than or equal to that value in the second data group, which is a defective product data group, is calculated (S4). .
[Step 5]
Next, the frequency cumulative difference calculation unit 7 performs a non-defective 1-xj for each value of the input attribute xj.
A difference (= | A−B |) between the frequency accumulation (A) and the 2-xj frequency accumulation (B) of the defective product is calculated (S5). This difference value is referred to as an xj frequency cumulative difference (= | A−B |). x
Tables 25 to 28 show the calculation results of the j-frequency cumulative difference.

入力属性ｘｊと、良品の１−ｘｊ頻度累積（Ａ）、不良品の２−ｘｊ頻度累積
（Ｂ）、ｘｊ頻度累積差｜Ａ−Ｂ｜との関係を図３〜図６に示す。 The relationship between the input attribute xj, 1-xj frequency accumulation (A) for non-defective products, 2-xj frequency accumulation (B) for defective products, and xj frequency accumulation difference | A-B | is shown in FIGS.

各数値に対するｘｊ頻度累積差｜Ａ−Ｂ｜は、入力属性ｘｊがその数値以下の
範囲と、入力属性ｘｊがその数値を超える範囲との２分化によって、良品の第１
データ群ＤＡ１と不良品の第２データ群ＤＡ２とがうまく切り分けられているか
を表す指標である。言い換えると、ｘｊ頻度累積差｜Ａ−Ｂ｜は、入力属性がそ
の数値以下であるデータが第１データ群および第２データ群のうちの一方に偏っ
ている度合いを表す閾値評価指標である。 The xj frequency cumulative difference | A−B | for each numerical value is determined by dividing the input attribute xj into a first non-defective product by dividing it into a range where the input attribute xj is less than the numerical value and a range where the input attribute xj exceeds the numerical value.
This is an index indicating whether the data group DA1 and the defective second data group DA2 are well separated. In other words, the xj frequency cumulative difference | A−B | is a threshold evaluation index that represents the degree to which data whose input attribute is equal to or less than the numerical value is biased to one of the first data group and the second data group.

なお、ここでは、閾値評価指標としてｘｊ頻度累積差｜Ａ−Ｂ｜を演算してい
るが、各数値に対する閾値評価指標として、データの偏りの度合いを評価する指
標、例えば、情報利得（ゲイン）、情報利得比、Ｇｉｎｉインデックス、平均自
乗誤差等を用いてもよい。
［ステップ６］
入力属性閾値決定部８が、各入力属性ｘｊについて、ｘｊの個々の値の中で、
ｘｊ頻度累積差｜Ａ−Ｂ｜の値が最大となるときの入力属性ｘｊの値を抽出する
（Ｓ６）。この値を、入力属性閾値ｘｊ−ｔｈと呼ぶ事にする。 Here, the xj frequency cumulative difference | A−B | is calculated as the threshold evaluation index, but as a threshold evaluation index for each numerical value, an index for evaluating the degree of data bias, for example, information gain (gain) Information gain ratio, Gini index, mean square error, etc. may be used.
[Step 6]
The input attribute threshold value determination unit 8 for each input attribute xj, among the individual values of xj,
The value of the input attribute xj when the value of the xj frequency cumulative difference | A−B | is maximized is extracted (S6). This value is called an input attribute threshold value xj-th.

入力属性閾値ｘｊ−ｔｈは、図３〜図６を参照して分かるように、ｘｊ≦ｘｊ
−ｔｈの範囲と、ｘｊ＞ｘｊ−ｔｈの範囲との２分化によって、良品の第１デー
タ群ＤＡ１と、不良品の第２データ群ＤＡ２との切分けが最も容易となる入力属
性ｘｊの値を示している。 As can be seen with reference to FIGS. 3 to 6, the input attribute threshold value xj−th is xj ≦ xj.
The value of the input attribute xj that makes it easy to distinguish between the non-defective first data group DA1 and the defective second data group DA2 by dividing into the range of -th and the range of xj> xj-th. Is shown.

なお、ここでは、複数の入力属性について第３ステップ〜第６ステップの処理
を一括して行っているが、ｊの値を１からＮまで順次増加させて第３ステップ〜
該第６ステップの処理を繰り返してもよい。
［ステップ７］
次に、頻度累積比率演算部１６が、ｘｊ＝ｘｊ−ｔｈにおいて、良品の１−ｘ
ｊ頻度累積（Ａ）に対する、不良品の２−ｘｊ頻度累積（Ｂ）の比率を計算する
。この比率を、２−ｘｊｔｈ下比率（＝Ｂ／Ａ）と呼ぶ事にする。また、１００
から良品の１−ｘｊ頻度累積（Ａ）を引いた値（＝１００−Ａ）に対する、１０
０から不良品の２−ｘｊ頻度累積（Ｂ）を引いた値（＝１００−Ｂ）の比率を計
算する。この比率を、２−ｘｊｔｈ上比率（＝（１００−Ｂ）／（１００−Ａ）
）と呼ぶ事にする。そして、双方の比率のうちの大きい方の値を表す、２−ｘｊ
ｔｈ比率を抽出する。 Here, the processes of the third step to the sixth step are collectively performed for a plurality of input attributes, but the value of j is sequentially increased from 1 to N to increase the third step to
The process of the sixth step may be repeated.
[Step 7]
Next, the frequency cumulative ratio calculation unit 16 determines that 1−x of the non-defective product when xj = xj−th.
The ratio of the 2-xj frequency accumulation (B) of defective products to the j frequency accumulation (A) is calculated. This ratio is called a 2-xjth lower ratio (= B / A). Also, 100
10 to the value obtained by subtracting 1-xj frequency accumulation (A) of non-defective products from (= 100-A)
A ratio of a value (= 100−B) obtained by subtracting 2-xj frequency accumulation (B) of defective products from 0 is calculated. This ratio is expressed as 2-xjth upper ratio (= (100-B) / (100-A)
). And 2-xj representing the larger value of the ratio of both
Extract th ratio.

ここで、２−ｘｊｔｈ下比率は、「ｘｊ≦ｘｊ−ｔｈ」という入力属性条件に
より、良品の第１データ群と分離して不良品の第２データ群を検出できる割合を
表している。また、２−ｘｊｔｈ上比率は、「ｘｊ＞ｘｊ−ｔｈ」という入力属
性条件により、良品の第１データ群と分離して不良品の第２データ群を検出でき
る割合を表している。 Here, the 2-xjth lower ratio represents a ratio at which the defective second data group can be detected separately from the first non-defective data group based on the input attribute condition “xj ≦ xj−th”. The 2-xjth upper ratio represents a ratio at which the defective second data group can be detected separately from the first non-defective data group based on the input attribute condition “xj> xj-th”.

言い換えると、２−ｘｊｔｈ下比率は、「入力属性ｘｊが入力属性閾値ｘｊ−
ｔｈ以下であれば第２データ群に含まれるデータである」という第１の相関ルー
ルの確からしさを表す評価値（第１のルール評価値）を表している。また、２−
ｘｊｔｈ上比率は、「入力属性ｘｊが入力属性閾値ｘｊ−ｔｈを超えていれば第
２データ群に含まれるデータである」という第２の相関ルールの確からしさを表
す評価値（第２のルール評価値）を表している。 In other words, the 2-xjth lower ratio is “input attribute xj is input attribute threshold xj−
It represents an evaluation value (first rule evaluation value) representing the probability of the first association rule that “the data is included in the second data group if it is equal to or less than th”. In addition, 2-
The ratio on xjth is an evaluation value (second rule) indicating the probability of the second correlation rule that “if the input attribute xj exceeds the input attribute threshold value xj−th, it is data included in the second data group”. Evaluation value).

各入力属性ｘｊに対して抽出された入力属性閾値ｘｊ−ｔｈ、ｘｊ＝ｘｊ−ｔ
ｈにおける、良品の１−ｘｊ頻度累積（Ａ）、不良品の２−ｘｊ頻度累積（Ｂ）
、ｘｊ頻度累積差｜Ａ−Ｂ｜、２−ｘｊｔｈ下比率Ｂ／Ａ、２−ｘｊｔｈ上比率
（１００−Ｂ）／（１００−Ａ）、２−ｘｊｔｈ比率の各値を表２９に示す。 Input attribute threshold value xj-th extracted for each input attribute xj, xj = xj-t
1-xj frequency accumulation of good products (A), 2-xj frequency accumulation of defective products (B) in h
, Xj frequency cumulative difference | AB |, 2-xjth lower ratio B / A, 2-xjth upper ratio (100-B) / (100-A), and 2-xjth ratio.

［ステップ８］
要因抽出部９が、ｘ１〜ｘ４の入力属性のうち、上記ステップ７の２−ｘｊｔ
ｈ比率が最大となる入力属性を抽出する。これにより、２−ｘｊｔｈ比率が最大
となる入力属性と、その閾値、採用した比率の種別（上、下）が第２データ群に
対応する出力属性条件の要因（入力属性条件）を示すデータとして抽出されるこ
とになる。これは、全ての入力属性に関する前記相関ルールのうちで最も高い２
−ｘｊｔｈ下比率または２−ｘｊｔｈ上比率を持つ相関ルールの入力属性条件を
示すデータを抽出することに相当する。 [Step 8]
The factor extraction unit 9 selects 2-xjt in step 7 from the input attributes x1 to x4.
The input attribute that maximizes the h ratio is extracted. As a result, the input attribute that maximizes the 2-xjth ratio, the threshold value, and the type of the employed ratio (upper and lower) are data indicating the cause of the output attribute condition (input attribute condition) corresponding to the second data group. Will be extracted. This is the highest 2 of the association rules for all input attributes.
This corresponds to extracting data indicating an input attribute condition of an association rule having a -xjth lower ratio or a 2-xjth upper ratio.

なお、ここでは、最大のルール評価値を持つ相関ルールの入力属性を抽出する
ための指標として２−ｘｊｔｈ比率を演算しているが、最大のルール評価値を持
つ相関ルールの入力属性を抽出するための指標として、他の評価指標、例えば、
支持率（サポート）、確信度（コンフィデンス）、情報利得（ゲイン）、情報利
得比、Ｇｉｎｉインデックス、平均自乗誤差等を用いてもよい。 Here, the 2-xjth ratio is calculated as an index for extracting the input attribute of the correlation rule having the maximum rule evaluation value, but the input attribute of the correlation rule having the maximum rule evaluation value is extracted. As an indicator for other evaluation indicators, for example,
Support rate (support), certainty factor (confidence), information gain (gain), information gain ratio, Gini index, mean square error and the like may be used.

表２９を参照して、入力属性ｘ２＝ｘ２−ｔｈ＝２のとき、２−ｘ２ｔｈ比率
＝２−ｘ２ｔｈ上比率＝∞となっている。これは、入力属性条件「ｘ２＞２」に
て、良品の第１データ群ＤＡ１と完全に分離して、不良品の第２データ群ＤＡ２
を検出できる事を示しており、この事は、図４を参照すると、より理解しやすい
。 Referring to Table 29, when input attribute x2 = x2-th = 2, 2-x2th ratio = 2-x2th upper ratio = ∞. This is completely separated from the non-defective first data group DA1 under the input attribute condition “x2> 2,” and the defective second data group DA2 is separated.
It can be easily understood with reference to FIG.

上記抽出された、入力属性（＝ｘ２）、該入力属性の値を表す入力属性閾値（
＝２）、および採用した比率の種別（＝上）のデータを分析結果データ格納部１
４に保存する。 The extracted input attribute (= x2), the input attribute threshold value representing the value of the input attribute (
= 2), and the data of the adopted ratio type (= top) are analyzed result data storage unit 1
Save to 4.

以上のようにして、問題事象（不良品の第２データ群ＤＡ２）の一要因として
、「ｘ２＞２」という入力属性条件が抽出された。
［ステップ９］
上記ステップ８にて、問題事象（不良品の第２データ群ＤＡ２）の一要因とし
て、「ｘ２＞２」という入力属性条件が抽出されたので、次に、別の要因を調査
する。このため、要因未発見データ抽出部１０が、基本データ群ＤＡ０（表２）
を入力属性条件「ｘ２＞２」を満たすデータ群（要因データ群）と、基本データ
群ＤＡ０（表２）の中で問題事象の要因をまだ発見できていないデータ群（他デ
ータ群）、すなわち入力属性条件「ｘ２≦２」を満たす（入力属性条件「ｘ２＞
２」を満たさない）データ群とに分割し、問題事象の要因をまだ発見できていな
いデータ群を抽出する（表３０）。 As described above, the input attribute condition “x2> 2” is extracted as one factor of the problem event (the second data group DA2 of defective products).
[Step 9]
In step 8, the input attribute condition “x2> 2” is extracted as one factor of the problem event (second data group DA2 of defective products). Next, another factor is investigated. For this reason, the factor undiscovered data extraction unit 10 performs basic data group DA0 (Table 2).
And a data group (factor data group) satisfying the input attribute condition “x2> 2” and a data group (other data group) in which the cause of the problem event has not yet been found in the basic data group DA0 (Table 2), that is, Satisfy the input attribute condition “x2 ≦ 2” (input attribute condition “x2>
The data group in which the cause of the problem event has not yet been found is extracted (Table 30).

要因未発見データ抽出部１０は、抽出されたデータ群を次の（新しい）基本デ
ータ群ＤＡ０としてデータ分類部４に送る。
［ステップ１０］
そして、ステップ９で抽出されたデータ群を次の基本データ群ＤＡ０として、
終了条件判定部１１で終了条件を満たしていると判定されるまで、上記のステッ
プ２〜ステップ９の処理が繰り返される。本実施形態の終了条件判定部１１は、
繰返し処理中の上記ステップ２において不良品の第２データ群ＤＡ２のデータ個
数が０となった場合を終了条件と判定するようになっている。このように不良品
の第２データ群ＤＡ２のデータ個数が０となるまで繰り返し処理を実行すること
により、詳細な要因分析結果が得られる。 The factor undiscovered data extraction unit 10 sends the extracted data group to the data classification unit 4 as the next (new) basic data group DA0.
[Step 10]
Then, the data group extracted in step 9 is set as the next basic data group DA0.
Until the end condition determination unit 11 determines that the end condition is satisfied, the processes of step 2 to step 9 are repeated. The end condition determination unit 11 of the present embodiment
When the number of data in the second data group DA2 of the defective product becomes 0 in the above step 2 during the repeated processing, it is determined as the end condition. As described above, detailed factor analysis results can be obtained by repeatedly performing the process until the number of data in the second data group DA2 of defective products becomes zero.

なお、終了条件は、第２データ群ＤＡ２のデータ個数に基づく他の終了条件、
例えば、（１）繰返し処理中の上記ステップ２において第２データ群ＤＡ２のデ
ータ個数が所定数以下となった場合、（２）繰返し処理中の上記ステップ２にお
いて第１データ群ＤＡ１のデータ個数に対する第２データ群ＤＡ２のデータ個数
の割合が所定割合以下となった場合、（３）繰返し処理中の上記ステップ８にお
いて抽出された入力属性条件のルール評価値が所定の値を下回った場合等として
もよい。これらのような終了条件を用いた場合、より簡潔で十分な要因分析結果
を得ることができる。さらに、簡潔な要因分析結果を得ることを優先する場合に
は、終了条件を単に繰返し処理を所定回数行った場合としたり、終了条件判定部
１１を省いて、可能な限り繰り返し処理を行うようにしてもよい。 The end condition is another end condition based on the number of data in the second data group DA2.
For example, (1) when the number of data in the second data group DA2 is equal to or less than a predetermined number in step 2 during the iterative process, (2) the number of data in the first data group DA1 in step 2 during the iterative process. When the ratio of the number of data in the second data group DA2 is equal to or less than a predetermined ratio, (3) When the rule evaluation value of the input attribute condition extracted in step 8 during the iterative processing falls below a predetermined value, etc. Also good. When such termination conditions are used, a simpler and sufficient factor analysis result can be obtained. Further, when priority is given to obtaining a concise factor analysis result, the end condition is simply a case where the iterative process is performed a predetermined number of times, or the end condition determining unit 11 is omitted and the iterative process is performed as much as possible. May be.

今回の例では、２回目の繰り返し処理中のステップ９で抽出した、要因未発見
の、ｘ１≦２のデータ群に不良品のデータ（第２データ群ＤＡ２；ｙ＝２）が含
まれていなかったため、繰り返し処理は２回目で（２回目の要因抽出を行った時
点で）終了した。
［ステップ１１］
入力属性閾値テーブル作成部１２が、ステップ１０の繰り返し処理毎に抽出さ
れた入力属性ｘｊと、入力属性閾値ｘｊ−ｔｈと、採用された比率の種別とを格
納した入力属性閾値テーブルを作成する（表３１）。 In this example, defective data (second data group DA2; y = 2) is not included in the data group of x1 ≦ 2 that has not been found and extracted in step 9 during the second iteration. For this reason, the iterative process was completed at the second time (at the time when the second factor extraction was performed).
[Step 11]
The input attribute threshold value table creation unit 12 creates an input attribute threshold value table that stores the input attribute xj extracted for each repetition process of step 10, the input attribute threshold value xj-th, and the type of ratio adopted ( Table 31).

入力属性閾値テーブル作成部１２では、必要に応じて、入力属性閾値テーブル
における入力属性閾値ｘｊ−ｔｈの数値を文字データに変換する。文字データへ
の変換ルールは、ステップ０の変換の逆変換となるルールであり、下記の通りで
ある。
（ｘ１）１→Ａ、２→Ｂ、３→Ｃ、４→Ｄ
（ｘ２）１→ａ、２→ｂ、３→ｃ、４→ｄ
（ｘ３）変換せず
（ｘ４）変換せず
表３１の入力属性閾値テーブルにおける入力属性閾値ｘｊ−ｔｈを文字データ
に変換した入力属性閾値テーブルを表３２に示す。 The input attribute threshold value table creating unit 12 converts the numerical value of the input attribute threshold value xj-th in the input attribute threshold value table into character data as necessary. The conversion rule for character data is a rule that is the reverse conversion of the conversion in step 0, and is as follows.
(X1) 1 → A, 2 → B, 3 → C, 4 → D
(X2) 1 → a, 2 → b, 3 → c, 4 → d
(X3) Not converted (x4) Not converted Table 32 shows an input attribute threshold value table in which the input attribute threshold value xj-th in the input attribute threshold value table of Table 31 is converted into character data.

この入力属性閾値テーブルは、特許文献１に記載の従来の決定木−２（図１２
）において、出力属性ｙ＝Ｙ（ｙ＝２）の切分けに着目した場合の決定木の分類
条件に対応する。
［ステップ１２］
次に、寄与率演算部１３が、表３１の入力属性閾値テーブルから、抽出された
入力属性の、問題事象（ｙ＝２：不良品データ群である、元の第２データ群ＤＡ
２）に対する寄与率（相関ルールの評価指標であるサポートに相当する）を求め
る。 This input attribute threshold value table is the conventional decision tree-2 described in Patent Document 1 (FIG. 12).
) Corresponds to the classification condition of the decision tree when focusing on the output attribute y = Y (y = 2).
[Step 12]
Next, the contribution rate calculation unit 13 extracts the problem event (y = 2: defective product data group, original second data group DA of the input attribute extracted from the input attribute threshold value table of Table 31.
2) The contribution ratio (corresponding to the support that is an evaluation index of the association rule) is obtained.

表３３は、問題事象（不良品）である元の第２データ群ＤＡ２（表４）におい
て、その要因として１回目に抽出された「ｘ２＞２」なる入力属性条件、または
、２回目に抽出された「ｘ１＞２」なる入力属性条件、に該当するデータに「＊
」を付したものである。 Table 33 shows the input attribute condition “x2> 2” extracted as the first factor as the cause in the original second data group DA2 (Table 4) which is a problem event (defective product), or extracted the second time. The data corresponding to the input attribute condition “x1> 2”
".

表３３から、問題事象（元の第２データ群ＤＡ２）に対する入力属性条件「ｘ
１＞２」、「ｘ２＞２」の寄与率が表３４に示すように求められる。 From Table 33, the input attribute condition “x” for the problem event (original second data group DA2).
The contribution ratio of “1> 2” and “x2> 2” is obtained as shown in Table 34.

表３４において、「ｘ１＞２」と「ｘ１＞２」との交差部に示す寄与率、及び
「ｘ２＞２」と「ｘ２＞２」との交差部に示す寄与率は、それぞれ「ｘ１＞２」
単独要因の寄与率、及び「ｘ２＞２」単独要因の寄与率を、それぞれ表している
。また、「ｘ１＞２」と「ｘ２＞２」との交差部に示す寄与率は何れも、「ｘ１
＞２」要因と「ｘ２＞２」要因との複合要因の寄与率を表している。なお、表３
４は、図７のようにも表現できる。 In Table 34, the contribution ratio shown at the intersection of “x1> 2” and “x1> 2” and the contribution ratio shown at the intersection of “x2> 2” and “x2> 2” are “x1> 2 "
The contribution ratio of the single factor and the contribution ratio of “x2> 2” are shown. In addition, the contribution rate indicated at the intersection of “x1> 2” and “x2> 2” is “x1”.
It represents the contribution ratio of the composite factor of the “> 2” factor and the “x2> 2” factor. Table 3
4 can also be expressed as shown in FIG.

表３４または図７から、問題事象（ｙ＝２）に対し、優先順位（順位１：ｘ１
，順位２：ｘ２）を付けて対策を施す事ができる。
[ステップ１３]
以上でデータ分析を終了し、入力属性閾値テーブル作成部１２で作成された入
力属性閾値テーブルや、寄与率のデータが、分析結果データとしてハードディス
ク等の分析結果データ格納部１４に格納される。この分析結果データは、適宜、
分析結果データ格納部１４から表示装置や印刷装置等の出力部１５に送られ、表
示装置にて決定木やテーブルとして表示したり、印刷装置にて決定木やテーブル
として印刷したりすることができる。 From Table 34 or FIG. 7, priority (rank 1: x1) is assigned to the problem event (y = 2).
, Ranking 2: x2), and measures can be taken.
[Step 13]
The data analysis is thus completed, and the input attribute threshold value table created by the input attribute threshold value table creating unit 12 and the contribution rate data are stored as analysis result data in the analysis result data storage unit 14 such as a hard disk. This analysis result data is
It is sent from the analysis result data storage unit 14 to the output unit 15 such as a display device or a printing device, and can be displayed as a decision tree or table on the display device, or printed as a decision tree or table on the printing device. .

本実施形態によれば、特許文献１に記載の、従来の決定木−２（図１２）のよ
うに、ラベル階層構造（図１１）を予め定義しなくても、表３２（または表３１
）の入力属性閾値テーブルに示したような非常に簡潔な形で、問題事象の要因を
導き出せる。そして、これを用いて、問題事象に対する各要因（入力属性）の寄
与率を求める事ができる。 According to the present embodiment, as in the conventional decision tree-2 (FIG. 12) described in Patent Document 1, the label hierarchy structure (FIG. 11) is not defined in advance, but the table 32 (or table 31) can be used.
The cause of the problem event can be derived in a very simple form as shown in the input attribute threshold value table. Then, using this, the contribution rate of each factor (input attribute) to the problem phenomenon can be obtained.

ここで、表３２（または表３１）に示される本実施形態の入力属性閾値テーブ
ルを、決定木の形式で表現すると、図８のように表される。また、従来の決定木
−２（図１２）を用いて、図７と同じ形式で、問題事象ｙ＝Ｙ（＝２）に対する
各要因の寄与率を表現すると、図９のようになる。 Here, when the input attribute threshold value table of this embodiment shown in Table 32 (or Table 31) is expressed in the form of a decision tree, it is expressed as shown in FIG. Also, when the contribution rate of each factor to the problem event y = Y (= 2) is expressed in the same format as FIG. 7 using the conventional decision tree-2 (FIG. 12), it is as shown in FIG.

本実施形態から導かれる決定木（図８）と、従来の決定木−２（図１２）とを
比較すると、本実施形態の場合には、入力属性ｘ３の寄与が表現されていない。
これは、図７と図９とを比較して分かるように、問題事象ｙ＝Ｙ（ｙ＝２）が、
入力属性ｘ１およびｘ３の、それぞれの単独要因では発生していないからであり
、上記の２回目の繰り返し操作中のステップ９において、ｘ１＞２のデータ群に
対してステップ１０を実行しなかった事に因る。 When the decision tree derived from the present embodiment (FIG. 8) is compared with the conventional decision tree-2 (FIG. 12), the contribution of the input attribute x3 is not expressed in the present embodiment.
As can be seen by comparing FIG. 7 and FIG. 9, the problem event y = Y (y = 2)
This is because it does not occur due to each single factor of the input attributes x1 and x3, and in step 9 during the second repetitive operation, step 10 was not executed for the data group of x1> 2. Due to

詳細に要因を追求する場合には、入力属性ｘ３の寄与も抽出する必要があるが
、問題事象ｙ＝Ｙ（ｙ＝２）を除去する（改善する）事を目的すれば、入力属性
ｘ１のみの抽出であってもこの目的を十分に達成できる。本実施形態では、この
点に着目し、問題事象に対して対策すべき主要因を抽出しているため、入力属性
ｘ３を抽出していない。詳細な分析を必要とする場合には、上記ステップ９で２
分化されたデータ群の双方に対して、ステップ１０を実行すればよい。 When pursuing factors in detail, it is also necessary to extract the contribution of the input attribute x3. However, for the purpose of removing (improving) the problem event y = Y (y = 2), only the input attribute x1 is required. Even this extraction can sufficiently achieve this purpose. In the present embodiment, paying attention to this point, the main factor that should be taken against the problem phenomenon is extracted, so the input attribute x3 is not extracted. If detailed analysis is required, 2 in step 9 above.
Step 10 may be executed for both of the differentiated data groups.

なお、上述した実施形態では、複数の要因を導き出し決定木を生成していたが
、単に一つの要因だけを抽出したい場合であれば、ステップ８で終了してもよい
。 In the above-described embodiment, a plurality of factors are derived and a decision tree is generated. However, if only one factor is desired to be extracted, the process may end in step 8.

以上で説明したデータ分析方法は、コンピュータが図２のＳ０〜Ｓ１２（ステ
ップ０〜１３）に対応するプロセスを含むデータ分析プログラムを実行すること
によって実現できる。したがって、図１のデータ分析装置は、データ分析プログ
ラムが、コンピュータを文字−数値データ変換部１、分析対象データ格納部２、
閾値設定部３、データ分類部４、データ列抽出部５、頻度演算部６、頻度累積差
演算部７、入力属性閾値決定部８、頻度累積比率演算部１６、要因抽出部９、要
因未発見データ抽出部１０、終了条件判定部１１、入力属性閾値テーブル作成部
１２、および寄与率演算部１３として機能させることにより実現することが可能
である。 The data analysis method described above can be realized by the computer executing a data analysis program including processes corresponding to S0 to S12 (steps 0 to 13) in FIG. Therefore, in the data analysis apparatus of FIG. 1, the data analysis program converts the computer into a character-numeric data conversion unit 1, an analysis target data storage unit 2,
Threshold setting unit 3, data classification unit 4, data string extraction unit 5, frequency calculation unit 6, frequency cumulative difference calculation unit 7, input attribute threshold value determination unit 8, frequency cumulative ratio calculation unit 16, factor extraction unit 9, factor undiscovered This can be realized by functioning as the data extraction unit 10, end condition determination unit 11, input attribute threshold value table creation unit 12, and contribution rate calculation unit 13.

上記プログラムは、コンピュータで読み取り可能な記録媒体に格納してユーザ
に提供することができる。この記録媒体は、コンピュータ本体に内蔵された内蔵
メディアであってもよいし、コンピュータ本体に対して分離可能に構成されたリ
ムーバブル・メディアであってもよい。上記内蔵メディアとしては、ＲＯＭ；フ
ラッシュメモリ等の書き換え可能な不揮発性メモリ；ハードディスク等が挙げら
れる。また、上記リムーバブル・メディアとしては、ＣＤ−ＲＯＭ、ＤＶＤ等の
光記録媒体；ＭＯ等の光磁気記録媒体；フロッピー（登録商標）ディスク、カセ
ットテープ、リムーバブル・ハードディスク等の磁気記録媒体；メモリカード等
のような書き換え可能な不揮発性メモリを内蔵したメディア；ＲＯＭカセット等
のようなＲＯＭを内蔵したメディア等が挙げられる。 The program can be provided to the user by storing it in a computer-readable recording medium. The recording medium may be a built-in medium built in the computer main body, or a removable medium configured to be separable from the computer main body. Examples of the built-in medium include ROM; rewritable nonvolatile memory such as flash memory; and hard disk. The removable media includes optical recording media such as CD-ROM and DVD; magneto-optical recording media such as MO; magnetic recording media such as floppy (registered trademark) disks, cassette tapes and removable hard disks; memory cards and the like. And a medium having a built-in rewritable nonvolatile memory such as a medium having a built-in ROM such as a ROM cassette.

上記プログラムは、ＣＰＵのアクセスにより実行される構成であってもよいし
、記録媒体に格納されているプログラムを読み出し、読み出したプログラムを内
蔵メディアのプログラム記憶領域に転送した後、内蔵メディア上のプログラムが
ＣＰＵのアクセスにより実行される構成であってもよい。また、上記プログラム
は、コンピュータで読み取り可能な記録媒体に格納された状態で販売されるもの
に限定されるものではなく、インターネット等の通信ネットワークを介してユー
ザのコンピュータに転送する形式で販売されるものであってもよい。 The program may be configured to be executed by CPU access, or after reading the program stored in the recording medium and transferring the read program to the program storage area of the built-in medium, the program on the built-in medium May be executed by CPU access. In addition, the program is not limited to be sold in a state where it is stored in a computer-readable recording medium, and is sold in a format that is transferred to a user's computer via a communication network such as the Internet. It may be a thing.

なお、本実施形態では、データ分類部４において出力属性と出力属性閾値との
比較により分類を行っていたが、出力属性が文字属性である場合、文字−数値デ
ータ変換部１で出力属性を数値属性に変換せず、データ分類部４において出力属
性と要因分析対象となる出力属性（文字；Ｙ）との比較により分類を行うように
してもよい。 In this embodiment, the data classification unit 4 classifies the output attribute by comparing the output attribute with the output attribute threshold value. However, when the output attribute is a character attribute, the character-numeric data conversion unit 1 sets the output attribute to a numerical value. Instead of converting into attributes, the data classification unit 4 may perform classification by comparing the output attributes with the output attributes (characters; Y) to be analyzed.

本実施形態に係るデータ分析方法は、以上のように、Ｎ個（Ｎは２以上の整数
）の属性からなるＮ列の入力属性のデータと、１個の属性からなる１列の出力属
性のデータとで構成される基本データ群を分析対象とし、該出力属性と該入力属
性との因果関係を分析するデータ分析方法であって、出力属性閾値を設定する第
１ステップと、該出力属性の値と該出力属性閾値との比較に基づいて、該基本デ
ータ群を、第１データ群と第２データ群とに２分化する第２ステップと、該第１
データ群および該第２データ群の各々から、第Ｊ入力属性（Ｊは、１≦Ｊ≦Ｎな
る関係にある整数）のデータ列を表す１−Ｊデータ列および２−Ｊデータ列を、
それぞれ抽出する第３ステップと、該１−Ｊデータ列の該第Ｊ入力属性の個々の
値に対して、その値以下のデータ個数の割合を表す１−Ｊ頻度累積（％）を計算
し、該２−Ｊデータ列の該第Ｊ入力属性の個々の値に対して、その値以下のデー
タ個数の割合を表す２−Ｊ頻度累積（％）を計算する第４ステップと、該１−Ｊ
データ列および該２−Ｊデータ列の双方を含めた、該第Ｊ入力属性の全ての値の
個々に対して、該１−Ｊ頻度累積（％）と該２−Ｊ頻度累積（％）との差の絶対
値を表す、第Ｊ頻度累積差を計算する第５ステップと、第Ｊ頻度累積差の値が最
大となるときの第Ｊ入力属性の値を第Ｊ入力属性閾値として抽出する第６ステッ
プと、第Ｊ入力属性が第Ｊ入力属性閾値であるときにおいて、該１−Ｊ頻度累積
（％）に対する該２−Ｊ頻度累積（％）の比率を表す２−Ｊ下比率、および、１
００から該１−Ｊ頻度累積（％）を引いた値に対する、１００から該２−Ｊ頻度
累積（％）を引いた値の比率を表す２−Ｊ上比率を計算し、双方の比率のうちの
大きい方の値を示す、２−Ｊ比率を抽出する第７ステップと、Ｊの値を１からＮ
まで順次増加させて、該第３ステップ〜該第７ステップの操作を繰り返し、繰り
返し操作中の該第７ステップで抽出された、第１から第Ｎまでの入力属性の該２
−Ｊ比率のうち、その値が最大となる入力属性、該入力属性の値を表す入力属性
閾値、および採用した比率の種別を抽出し、保存する第８ステップと、該第８ス
テップで抽出された入力属性に基づいて、該基本データ群を２分化する第９ステ
ップと、該第９ステップで２分化されたデータ群のうちの少なくとも一方を、新
たな基本データ群として、所定の終了条件を満たすまで、該第２ステップ〜該第
９ステップの操作を繰返す第１０ステップとを含む。 As described above, the data analysis method according to the present embodiment includes N columns of input attribute data including N attributes (N is an integer of 2 or more) and one column of output attributes including one attribute. A data analysis method for analyzing a causal relationship between the output attribute and the input attribute, and a first step of setting an output attribute threshold; A second step of dividing the basic data group into a first data group and a second data group based on a comparison between the value and the output attribute threshold;
From each of the data group and the second data group, a 1-J data string and a 2-J data string representing a data string of the Jth input attribute (J is an integer having a relationship of 1 ≦ J ≦ N),
A third step of extracting each of the values, and for each value of the J-th input attribute of the 1-J data string, a 1-J frequency accumulation (%) representing a ratio of the number of data less than or equal to the value is calculated; A fourth step of calculating, for each value of the J-th input attribute of the 2-J data string, a 2-J frequency accumulation (%) representing a ratio of the number of data less than that value;
For each individual value of the Jth input attribute including both the data string and the 2-J data string, the 1-J frequency accumulation (%) and the 2-J frequency accumulation (%) A fifth step of calculating the J-th frequency cumulative difference, which represents the absolute value of the difference, and extracting the value of the J-th input attribute when the value of the J-th frequency cumulative difference is maximum as the J-th input attribute threshold 6-step, and when the J-th input attribute is the J-th input attribute threshold, a 2-J lower ratio representing a ratio of the 2-J frequency accumulation (%) to the 1-J frequency accumulation (%), and 1
Calculate the 2-J upper ratio, which represents the ratio of the value obtained by subtracting the 2-J frequency accumulation (%) from 100 to the value obtained by subtracting the 1-J frequency accumulation (%) from 00. The seventh step of extracting the 2-J ratio, which indicates the larger value of J, and the value of J from 1 to N
The operation of the third step to the seventh step is repeated, and the 2nd of the first to Nth input attributes extracted in the seventh step during the repetitive operation is repeated.
-The J attribute is extracted in the eighth step of extracting and storing the input attribute having the maximum value, the input attribute threshold representing the value of the input attribute, and the type of the adopted ratio, and the eighth step. Based on the input attribute, at least one of the ninth step of bisecting the basic data group and the data group bifurcated in the ninth step is set as a new basic data group, and a predetermined end condition is set. A tenth step that repeats the operations of the second step to the ninth step until it is satisfied.

上記方法によれば、ラベル階層構造を予め定義しなくても、非常に簡潔な形で問題事象の要因を複数導き出せる。そして、これを用いて、因果関係を表す決定木を作成したり、問題事象（出力属性）に対する各要因（入力属性）の寄与率を求めたりする事ができる。
なお、本発明に係るデータ分析装置は、上記の課題を解決するために、複数の入力属性と、出力属性とで構成されるデータの集合である基本データ群を分析対象とし、入力属性と出力属性との因果関係を分析し、因果関係を示す情報を抽出するデータ分析装置であって、基本データ群を出力属性に依って第１データ群と第２データ群とに分類する分類手段と、各入力属性の全ての数値について、入力属性がその数値以下であるデータが第１データ群および第２データ群のうちの一方に偏っている度合いを表す閾値評価指標を演算する第１の評価手段と、第１の評価手段で演算された閾値評価指標に基づいて、各入力属性について最大の閾値評価指標を持つ数値を各入力属性の閾値として決定する閾値決定手段と、閾値決定手段で決定された各入力属性の閾値に基づいて、「入力属性が閾値以下であれば第２データ群に含まれるデータである」という相関ルールの確からしさを表す第１のルール評価値と、「入力属性が閾値を超えていれば第２データ群に含まれるデータである」という相関ルールの確からしさを表す第２のルール評価値とを各入力属性について演算する第２の評価手段と、全ての入力属性に関する相関ルールのうちで最も高いルール評価値を持つ相関ルールの入力属性条件を示すデータを、第２データ群に対応する出力属性条件の要因を示す情報として抽出する要因抽出手段とを含むようにしてもよい。
また、本発明に係るデータ分析方法は、上記の課題を解決するために、前記のデータ分析装置を用いて、複数の入力属性と、出力属性とで構成されるデータの集合である基本データ群を分析対象とし、入力属性と出力属性との因果関係を分析し、因果関係を示す情報を抽出するデータ分析方法であって、上記分類手段により、基本データ群を出力属性に依って第１データ群と第２データ群とに分類する分類ステップと、上記第１の評価手段により、各入力属性の全ての数値について、入力属性がその数値以下であるデータが第１データ群および第２データ群のうちの一方に偏っている度合いを表す閾値評価指標を演算する第１の評価ステップと、上記閾値決定手段により、第１の評価ステップで演算された閾値評価指標に基づいて、各入力属性について最大の閾値評価指標を持つ数値を各入力属性の閾値として決定する閾値決定ステップと、上記第２の評価手段により、閾値決定ステップで決定された各入力属性の閾値に基づいて、「入力属性が閾値以下であれば第２データ群に含まれるデータである」という相関ルールの確からしさを表す第１のルール評価値と、「入力属性が閾値を超えていれば第２データ群に含まれるデータである」という相関ルールの確からしさを表す第２のルール評価値とを各入力属性について演算する第２の評価ステップと、上記要因抽出手段により、全ての入力属性に関する相関ルールのうちで最も高いルール評価値を持つ相関ルールの入力属性条件を示すデータを、第２データ群に対応する出力属性条件の要因を示す情報として抽出する要因抽出ステップとを含むようにしてもよい。
また、本発明に係るデータ分析プログラムは、上記の課題を解決するために、コンピュータを、基本データ群を出力属性に依って第１データ群と第２データ群とに分類する分類手段、各入力属性の全ての数値について、入力属性がその数値以下であるデータが第１データ群および第２データ群のうちの一方に偏っている度合いを表す閾値評価指標を演算する第１の評価手段、第１の評価手段で演算された閾値評価指標に基づいて、各入力属性について最大の閾値評価指標を持つ数値を各
入力属性の閾値として決定する閾値決定手段、閾値決定手段で決定された各入力属性の閾値に基づいて、「入力属性が閾値以下であれば第２データ群に含まれるデータである」という相関ルールの確からしさを表す第１のルール評価値と、「入力属性が閾値を超えていれば第２データ群に含まれるデータである」という相関ルールの確からしさを表す第２のルール評価値とを各入力属性について演算する第２の評価手段、および全ての入力属性に関する相関ルールのうちで最も高いルール評価値を持つ相関ルールの入力属性条件を示すデータを、第２データ群に対応する出力属性条件の要因を示す情報として抽出する要因抽出手段として機能させるためのデータ分析プログラムであってもよい。
また、本発明に係るデータ分析装置は、上記要因抽出手段で抽出された入力属性条件に基づいて、基本データ群を、上記入力属性条件を満たす要因データ群と上記入力属性条件を満たさない他データ群とに分割し、分類されたデータ群のうちの少なくとも一方を新たな基本データ群として分類手段に送る分割手段をさらに含み、分類手段による処理、第１の評価手段による処理、閾値決定手段による処理、第２の評価手段による処理、要因抽出手段による処理、および分割手段による処理からなる一連の処理が繰り返し実行されるようになっていてもよい。 According to the above method, it is possible to derive a plurality of factors of problem events in a very simple form without defining the label hierarchical structure in advance. Then, by using this, it is possible to create a decision tree representing a causal relationship, and obtain the contribution rate of each factor (input attribute) to the problem event (output attribute).
In order to solve the above-described problem, the data analysis apparatus according to the present invention analyzes a basic data group that is a set of data including a plurality of input attributes and output attributes. A data analysis device that analyzes a causal relationship with an attribute and extracts information indicating the causal relationship, and classifies a basic data group into a first data group and a second data group according to an output attribute; First evaluation means for calculating a threshold evaluation index representing the degree to which data whose input attribute is equal to or less than the numerical value of each input attribute is biased to one of the first data group and the second data group And a threshold value determining means for determining a numerical value having the maximum threshold value evaluation index for each input attribute as a threshold value of each input attribute based on the threshold value evaluation index calculated by the first evaluation means, and the threshold value determining means. Each Based on the threshold value of the force attribute, a first rule evaluation value indicating the probability of the association rule “if the input attribute is equal to or less than the threshold value, data included in the second data group”, A second evaluation means for calculating, for each input attribute, a second rule evaluation value representing the probability of the correlation rule that the data is included in the second data group if it exceeds, and correlation for all input attributes You may make it include the factor extraction means which extracts the data which show the input attribute condition of the correlation rule with the highest rule evaluation value among rules as information which shows the factor of the output attribute condition corresponding to a 2nd data group.
Further, in order to solve the above-described problem, the data analysis method according to the present invention uses the data analysis apparatus described above, and a basic data group that is a set of data composed of a plurality of input attributes and output attributes. Is a data analysis method for analyzing the causal relationship between the input attribute and the output attribute, and extracting information indicating the causal relationship, wherein the basic data group is output from the first data according to the output attribute by the classification means. A classification step of classifying the input attribute into a group and a second data group, and the first evaluation means, for all the numerical values of each input attribute, the data whose input attribute is equal to or less than the numerical value is the first data group and the second data group A first evaluation step for calculating a threshold evaluation index representing a degree of bias to one of the input values, and the threshold determination means based on the threshold evaluation index calculated in the first evaluation step. A threshold value determining step for determining a numerical value having the maximum threshold evaluation index for each input attribute as a threshold value for each input attribute, and “input attribute” based on the threshold value for each input attribute determined in the threshold value determining step by the second evaluation unit. The first rule evaluation value indicating the probability of the association rule that the data is included in the second data group if is less than or equal to the threshold value, and “if the input attribute exceeds the threshold value, it is included in the second data group. The second evaluation step for calculating the second rule evaluation value representing the certainty of the correlation rule that is “data” for each input attribute, and the above-described factor extraction means, among the correlation rules for all input attributes. A factor extraction process for extracting data indicating an input attribute condition of an association rule having a high rule evaluation value as information indicating a factor of an output attribute condition corresponding to the second data group It may be included and-up.
Further, in order to solve the above problems, the data analysis program according to the present invention includes a classification unit for classifying a basic data group into a first data group and a second data group according to output attributes, and each input. A first evaluation means for calculating a threshold evaluation index representing a degree that data having an input attribute equal to or less than the numerical value is biased to one of the first data group and the second data group for all the numerical values of the attribute; Based on the threshold evaluation index calculated by one evaluation means, each numerical value having the maximum threshold evaluation index for each input attribute is
Based on the threshold value determining means for determining the threshold value of the input attribute and the threshold value of each input attribute determined by the threshold value determining means, the correlation rule “if the input attribute is equal to or less than the threshold value, the data is included in the second data group” A first rule evaluation value that represents the certainty of the association rule, and a second rule evaluation value that represents the certainty of the association rule “if the input attribute exceeds the threshold, the data is included in the second data group”. The second evaluation means for calculating each input attribute, and the data indicating the input attribute condition of the correlation rule having the highest rule evaluation value among the correlation rules for all input attributes, the output attribute corresponding to the second data group It may be a data analysis program for functioning as a factor extracting means for extracting as information indicating the factor of the condition.
Further, the data analysis apparatus according to the present invention provides a basic data group based on the input attribute condition extracted by the factor extracting means, a factor data group that satisfies the input attribute condition, and other data that does not satisfy the input attribute condition. A dividing unit that divides the data into groups and sends at least one of the classified data groups to the classification unit as a new basic data group, and includes processing by the classification unit, processing by the first evaluation unit, and threshold determination unit A series of processes including the process, the process by the second evaluation unit, the process by the factor extraction unit, and the process by the division unit may be repeatedly executed.

本発明の一実施形態に係るデータ分析装置の構成を示すブロック図である。It is a block diagram which shows the structure of the data analyzer which concerns on one Embodiment of this invention. 本発明の一実施形態に係るデータ分析方法を示すフローチャートである。It is a flowchart which shows the data analysis method which concerns on one Embodiment of this invention. 本発明の一実施形態に係るデータ分析装置における頻度累積差演算部７（ステップ５）の出力の一例をグラフで表したもので、入力属性ｘ１と、良品の１−ｘ１頻度累積（Ａ）、不良品の２−ｘ１頻度累積（Ｂ）、ｘ１頻度累積差｜Ａ−Ｂ｜との関係を示す。An example of the output of the frequency accumulation difference calculation unit 7 (step 5) in the data analysis apparatus according to the embodiment of the present invention is represented by a graph, with an input attribute x1 and non-defective 1-x1 frequency accumulation (A), The relationship between 2-x1 frequency accumulation (B) of defective products and x1 frequency accumulation difference | AB | is shown. 本発明の一実施形態に係るデータ分析装置における頻度累積差演算部７（ステップ５）の出力の一例をグラフで表したもので、入力属性ｘ２と、良品の１−ｘ２頻度累積（Ａ）、不良品の２−ｘ２頻度累積（Ｂ）、ｘ２頻度累積差｜Ａ−Ｂ｜との関係を示す。An example of the output of the frequency accumulation difference calculation unit 7 (step 5) in the data analysis apparatus according to the embodiment of the present invention is represented by a graph, with an input attribute x2 and a non-defective 1-x2 frequency accumulation (A), The relationship between 2-x2 frequency accumulation (B) of defective products and x2 frequency accumulation difference | AB | is shown. 本発明の一実施形態に係るデータ分析装置における頻度累積差演算部７（ステップ５）の出力の一例をグラフで表したもので、入力属性ｘ３と、良品の１−ｘ３頻度累積（Ａ）、不良品の２−ｘ３頻度累積（Ｂ）、ｘ３頻度累積差｜Ａ−Ｂ｜との関係を示す。FIG. 7 is a graph showing an example of the output of the frequency cumulative difference calculation unit 7 (step 5) in the data analysis apparatus according to the embodiment of the present invention. The input attribute x3 and the non-defective 1-x3 frequency cumulative (A), The relationship between 2-x3 frequency accumulation (B) of defective products and x3 frequency accumulation difference | AB | is shown. 本発明の一実施形態に係るデータ分析装置における頻度累積差演算部７（ステップ５）の出力の一例をグラフで表したもので、入力属性ｘ４と、良品の１−ｘ４頻度累積（Ａ）、不良品の２−ｘ４頻度累積（Ｂ）、ｘ４頻度累積差｜Ａ−Ｂ｜との関係を示す。An example of the output of the frequency accumulation difference calculation unit 7 (step 5) in the data analysis apparatus according to the embodiment of the present invention is represented by a graph, with an input attribute x4 and a non-defective 1-x4 frequency accumulation (A), The relationship between 2-x4 frequency accumulation (B) of defective products and x4 frequency accumulation difference | AB | is shown. 本発明の一実施形態に係るデータ分析装置における寄与率演算部１３（ステップ１２）で出力されるデータの一例であり、問題事象である出力属性条件ｙ＝２（＝Ｙ）に対する入力属性条件「ｘ１＞２」および入力属性条件「ｘ２＞２」の寄与率を示す。It is an example of the data output by the contribution rate calculating part 13 (step 12) in the data analyzer which concerns on one Embodiment of this invention, and the input attribute condition "with respect to the output attribute condition y = 2 (= Y) which is a problem event" x1> 2 ”and the contribution ratio of the input attribute condition“ x2> 2 ”. 本発明の実施形態の入力属性閾値テーブルを、決定木の形式で表現した図である。It is the figure which expressed the input attribute threshold value table of the embodiment of the present invention in the form of a decision tree. 従来の決定木−２を、図７と同じ形式で表現した図である。It is the figure which expressed the conventional decision tree-2 in the same format as FIG. 従来の決定木−１を表す図である。It is a figure showing the conventional decision tree-1. 従来の決定木−２のラベル階層構造を表す図であり、（ａ）はｘ１属性、（ｂ）はｘ２属性、（ｃ）はｘ３属性、（ｄ）はｘ４属性を示す。It is a figure showing the label hierarchical structure of the conventional decision tree-2, (a) shows x1 attribute, (b) shows x2 attribute, (c) shows x3 attribute, (d) shows x4 attribute. 従来の決定木−２を表す図である。It is a figure showing the conventional decision tree-2.

Explanation of symbols

３閾値設定部（閾値設定手段）
４データ分類部（分類手段）
６頻度演算部（第１の評価手段、頻度演算手段）
７頻度累積差演算部（第１の評価手段、差分演算手段）
８入力属性閾値決定部（閾値決定手段）
９要因抽出部（要因抽出手段）
１０要因未発見データ抽出部（分割手段）
１１終了条件判定部（終了条件判定手段）
１６頻度累積比率演算部（第２の評価手段）

3 threshold setting unit (threshold setting means)
4 Data classification part (classification means)
6 Frequency calculator (first evaluation means, frequency calculation means)
7 Frequency cumulative difference calculation unit (first evaluation means, difference calculation means)
8 Input attribute threshold value determination unit (threshold value determination means)
9 Factor extraction unit (factor extraction means)
10 Factor undiscovered data extraction unit (division means)
11 End condition determination unit (end condition determination means)
16 Frequency cumulative ratio calculation unit (second evaluation means)

Claims

A basic data group that is a set of data composed of a plurality of input attributes x _j (1 ≦ j ≦ N, N is the number of input attributes) and one output attribute y stored in the analysis target data storage unit the DA was analyzed, the causal relationship between the input attributes and output attributes a data analyzer you analysis,
Character-numeric data that generates a numeric basic data group DA0, which is a set of numeric attribute data, by converting character attribute data contained in the basic data group DA into numeric attribute data according to a unique conversion rule Conversion means;
The numeric base data group DA0, and numeric numeric output attribute y included in the basic data group DA0, based on a comparison of the magnitude relation between the predetermined threshold value of the output attribute y, the first data group DA1, second data Classification means for classifying into group DA2 ,
For one input attributes x _j of the plurality of input attributes for each numerical value can be assumed by the said one input attributes x _j, among the data having the following values the numerical data belonging to the first data group DA1 An operation for obtaining a first frequency (1-x _j frequency cumulative%), which is a ratio of the number to the number of all data belonging to the first data group DA1, can be taken by the one input attribute x _j For each numerical value, a second frequency (2-) is the ratio of the number of data belonging to the second data group DA2 to the number of all data belonging to the second data group DA2 among the data having a numerical value less than or equal to the numerical value. x _j frequency cumulative%) is calculated, and the difference between the first frequency and the second frequency (x _j frequency cumulative difference%) is obtained for each numerical value that the one input attribute x _j can take. For each of the multiple input attributes A first evaluation means for performing
For one input attributes x _j of the plurality of input attributes, based on the single input attribute x _j of possible numerical each of the calculated difference by the first evaluation means (x _j frequency cumulative difference%) , determining a numerical value maximum of the difference is determined as a threshold value x _j-th of the input attributes x _j, a threshold determination means for, for each of the plurality of input attributes,
For one input attributes x _j of the plurality of input attributes, the threshold x _j-th of the input attributes x _j which is determined by the threshold value determining means, first frequency (1-x _j Frequency Cumulative%) second and first ratio is the ratio of frequency _{(2-x j} frequency cumulative%), at the threshold _{x j-th} of the input attributes _{x j} which is determined by the threshold determination means, (100% of - the 1 on the frequency _{(1-x j} frequency cumulative%)) (100% - the second frequency _{(2-x j} frequency cumulative%) as well as calculating a second ratio is the ratio of) first ratio And a second evaluation means for selecting the larger one of the second ratios for each of the plurality of input attributes ;
Of the ratios selected for each input attribute in the second evaluation means, the input attributes x _j having the largest _ratio, threshold x _j-th of the input attributes x _{_j,} and outermost even larger ratio first And a factor extracting means for extracting the type indicating the ratio or the second ratio as data indicating the input attribute condition and storing the input attribute condition in the analysis result data storage unit, Data analysis equipment.

Based on the input attribute condition extracted by the factor extracting means, the numerical basic data group DA0 is divided into a factor data group that satisfies the input attribute condition and another data group that does not satisfy the input attribute condition, and is classified. Further comprising a dividing means for sending at least one of the data groups to the classification means as a new numerical basic data group DA0 ,
A series of processing consisting of processing by the classification means, processing by the first evaluation means, processing by the threshold determination means, processing by the second evaluation means, processing by the factor extraction means, and processing by the dividing means is repeatedly executed. The data analysis apparatus according to claim 1 , wherein

3. The data analysis apparatus according to claim 2, wherein the dividing means selects only another data group from the classified data group and sends it to the classification means as a new numerical basic data group DA0. .

It further includes an end condition determining means for determining whether or not the end condition is satisfied, and when the end condition determining means determines that the end condition is satisfied, the execution of the series of processes is ended. data analyzer according to claim 2, wherein.

5. The data analysis apparatus according to claim 4 , wherein the end condition determination unit determines whether or not the number of data of the second data group classified by the classification unit is 0 as an end condition.

3. The data analysis apparatus according to claim 1, further comprising threshold setting means for setting the predetermined threshold of the output attribute in accordance with predetermined setting information or in response to an input from a user. .

The input attribute is a manufacturing process condition and / or an in-line inspection result in the product manufacturing process,
The above output attribute is the product quality judgment result,
The data analysis apparatus according to claim 1, wherein the second data group is a data group having a poor quality determination result.

A plurality of input attributes x stored in the analysis target data storage unit using the data analysis device according to claim 1. _ｊj (1 ≦ j ≦ N, where N is the number of input attributes) and a basic data group DA that is a set of data composed of one output attribute y, and the causal relationship between the input attribute and the output attribute is A data analysis method for analyzing,
Numeric type basic data which is a set of numeric attribute data by converting character attribute data included in the basic data group DA into numeric attribute data according to a unique conversion rule by the character-numeric data conversion means. A character-numeric data conversion step for generating the group DA0;
Based on the comparison of the magnitude relationship between the numerical value of the output attribute y included in the numerical basic data group DA0 and the predetermined threshold value of the output attribute y by the classification means, the first data group DA1 And a classification step for classifying the data into the second data group DA2.
One input attribute x out of the plurality of input attributes by the first evaluation means. _ｊj The one input attribute x _ｊj The first frequency which is the ratio of the number of data belonging to the first data group DA1 to the number of all data belonging to the first data group DA1 among the data having a numerical value less than or equal to the numerical value that can be taken (1-x _ｊj Frequency accumulation%), and the one input attribute x _ｊj For each possible numerical value, a second frequency that is a ratio of the number of data belonging to the second data group DA2 to the number of all data belonging to the second data group DA2 among data having a numerical value equal to or lower than the numerical value. (2-x _ｊj Frequency accumulation%), and the one input attribute x _ｊj For each possible numerical value, the difference between the first frequency and the second frequency (x _ｊj A first evaluation step of performing an operation for obtaining a frequency cumulative difference%) for each of the plurality of input attributes;
One input attribute x of the plurality of input attributes is obtained by the threshold value determining means. _ｊj The one input attribute x in the first evaluation means _ｊj The difference calculated for each possible numerical value (x _ｊj Based on the cumulative frequency difference%), the numerical value for which the maximum difference is obtained is the input attribute x _ｊj Threshold x _{ｊ―ｔｈj-th} Determining a threshold value for each of the plurality of input attributes; and
One input attribute x out of the plurality of input attributes by the second evaluation means. _ｊj For the input attribute x determined by the threshold value determination means _ｊj Threshold x _{ｊ―ｔｈj-th} The first frequency (1-x _ｊj Frequency to the second frequency (2-x) _ｊj The first ratio that is the ratio of the frequency cumulative%) and the input attribute x determined by the threshold value determination means _ｊj Threshold x _{ｊ―ｔｈj-th} (100% -first frequency (1-x _ｊj (Frequency cumulative%))) to (100%-second frequency (2-x _ｊj The second ratio that is the ratio of the frequency cumulative%)) is calculated, and the larger ratio of the first ratio and the second ratio is selected for each of the plurality of input attributes. A second evaluation step;
The input attribute x having the largest ratio among the ratios selected for each input attribute by the second evaluation means by the factor extracting means. _ｊj , The input attribute x _ｊj Threshold x _{ｊ―ｔｈj-th} , And the type indicating whether the largest ratio is the first ratio or the second ratio (pre-correction claim 7, paragraph 0082) as the data indicating the input attribute condition, and the input attribute And a factor extracting step of storing the condition in the analysis result data storage unit.

Multiple input attributes x stored in the analysis target data storage _ｊj (1 ≦ j ≦ N, where N is the number of input attributes) and a basic data group DA that is a set of data composed of one output attribute y, and the causal relationship between the input attribute and the output attribute is A data analysis program for causing a computer included in a data analysis device to analyze to function,
The above data analyzer is
Character-numeric data that generates a numeric basic data group DA0, which is a set of numeric attribute data, by converting character attribute data contained in the basic data group DA into numeric attribute data according to a unique conversion rule Conversion means;
Based on the comparison of the magnitude relationship between the numerical value of the output attribute y included in the numerical basic data group DA0 and the predetermined threshold value of the output attribute y, the numerical basic data group DA0 is compared with the first data group DA1 and the second data. Classification means for classifying into group DA2,
One input attribute x of the plurality of input attributes _ｊj The one input attribute x _ｊj The first frequency that is the ratio of the number of data belonging to the first data group DA1 to the number of all data belonging to the first data group DA1 among the data having numerical values equal to or smaller than the numerical value (1-x _ｊj Frequency accumulation%), and the one input attribute x _ｊj The second frequency, which is the ratio of the number of data belonging to the second data group DA2 to the number of all the data belonging to the second data group DA2 among the data having numerical values equal to or smaller than the numerical value that can be taken (2-x _ｊj Frequency accumulation%), and the one input attribute x _ｊj For each possible numerical value, the difference between the first frequency and the second frequency (x _ｊj A first evaluation unit that performs an operation for calculating a frequency cumulative difference%) for each of the plurality of input attributes;
One input attribute x of the plurality of input attributes _ｊj The one input attribute x in the first evaluation means _ｊj The difference calculated for each possible numerical value (x _ｊj Based on the cumulative frequency difference%), the numerical value for which the maximum difference is obtained is the input attribute x _ｊj Threshold x _{ｊ―ｔｈj-th} Threshold value determining means for determining each of the plurality of input attributes,
One input attribute x of the plurality of input attributes _ｊj For the input attribute x determined by the threshold value determination means _ｊj Threshold x _{ｊ―ｔｈj-th} The first frequency (1-x _ｊj Frequency to the second frequency (2-x) _ｊj The first ratio that is the ratio of the frequency cumulative%) and the input attribute x determined by the threshold value determination means _ｊj Threshold x _{ｊ―ｔｈj-th} (100% -first frequency (1-x _ｊj (Frequency cumulative%))) to (100%-second frequency (2-x _ｊj The second ratio that is the ratio of the frequency cumulative%)) is calculated, and the larger one of the first ratio and the second ratio is selected for each of the plurality of input attributes. A second evaluation means;
The input attribute x having the largest ratio among the ratios selected for each input attribute by the second evaluation means _ｊj , The input attribute x _ｊj Threshold x _{ｊ―ｔｈj-th} , And the type indicating whether the largest ratio is the first ratio or the second ratio is extracted as data indicating the input attribute condition, and the input attribute condition is stored in the analysis result data storage unit Factor extraction means,
A data analysis program for causing a computer to function as each of the above means.

A computer-readable recording medium on which the data analysis program according to claim 9 is recorded.