JP5430702B2

JP5430702B2 - Specific data detection method, specific data detection program, and specific data detection apparatus

Info

Publication number: JP5430702B2
Application number: JP2012079627A
Authority: JP
Inventors: すみれ桑田; 賢吾吉田
Original assignee: Mitsubishi Electric Information Systems Corp
Current assignee: Mitsubishi Electric Information Systems Corp
Priority date: 2012-03-30
Filing date: 2012-03-30
Publication date: 2014-03-05
Anticipated expiration: 2032-03-30
Also published as: JP2013210759A

Description

本発明は、複数のデータを含むデータ群において、各データが特異データであるか否かを判定する技術に関する。 The present invention relates to a technique for determining whether each data is unique data in a data group including a plurality of data.

データクレンジング処理は、クレンジング対象となるデータに付随する複数の属性値を評価し、他のデータと性質が大きく異なるデータを特異データとして検出する処理である。データクレンジング処理では、データの各属性値がどの程度基準から外れているかという「外れ度合」に基づいて特異データを検出する。 The data cleansing process is a process for evaluating a plurality of attribute values associated with the data to be cleansed and detecting data having significantly different properties from other data as unique data. In the data cleansing process, singular data is detected based on the “degree of deviation” indicating how far each attribute value of the data deviates from the standard.

このようなデータクレンジング処理の例は、特許文献１に記載される。特許文献１の方法では、異常値割合が所定の閾値以上となるレコードを削除することによってクレンジングを行う。 An example of such data cleansing processing is described in Patent Document 1. In the method of Patent Document 1, cleansing is performed by deleting records whose abnormal value ratio is equal to or greater than a predetermined threshold.

特開２００４−２９９７１号公報JP 2004-29971 A

しかしながら、従来の技術では、特異データの検出が必ずしも適切に行えないという問題があった。
たとえば特許文献１の方法では、異常値の割合は考慮しているが、各異常値の外れ度合については考慮していない。したがって、わずかに異常な属性値を多数含むデータについては過敏に特異データと判定される一方、大きく外れた異常値を１つだけ含むデータについては、その外れ度合に関わらず正常データと判定されてしまう可能性が高い。 However, the conventional technique has a problem that the detection of specific data cannot always be performed appropriately.
For example, in the method of Patent Document 1, the ratio of abnormal values is taken into account, but the degree of deviation of each abnormal value is not taken into consideration. Therefore, data that contains a large number of slightly abnormal attribute values is determined to be singular data, while data that contains only one abnormal value that deviates greatly is determined to be normal data regardless of the degree of deviation. There is a high possibility that it will end.

この発明は、このような問題点を解決するためになされたものであり、特異データをより適切に検出できる特異データ検出方法、特異データ検出プログラムおよび特異データ検出装置を提供することを目的とする。 The present invention has been made to solve such problems, and an object of the present invention is to provide a specific data detection method, a specific data detection program, and a specific data detection device that can detect specific data more appropriately. .

上述の問題点を解決するため、この発明に係る特異データ検出方法は、コンピュータに備えられた演算装置が、複数の属性項目に関する値を表す属性値を含むデータが特異データであるか否かを判定する、特異データ検出方法であって、データの各属性値について、複数のデータに基づき、各属性項目のそれぞれについて予め算出した基準範囲からの外れ度合に基づき、属性値が基準範囲内であれば外れ評価点は０と算出し、属性値が基準範囲外であれば外れ度合に応じて外れ評価点を算出し、外れ評価点を決定するステップと、データのそれぞれについて、外れ評価点が所定の閾値以上である属性値の数を表す外れ属性数をカウントするステップと、データのそれぞれについて、外れ評価点および外れ属性数に基づき、特異度数を決定するステップと、特異度数に基づき、データのそれぞれについて、特異データであるか否かを判定するステップと、を有し、特異度数を決定するステップにおいて、演算装置は、データの外れ評価点を合計することにより、外れ総合評価点を算出し、この外れ総合評価点と外れ属性数に基づいて決定される重みとを乗算することによって特異度数を決定し、特異データであるか否かを判定するステップは、全特異度数の第１四分位点および第３四分位点に基づいて特異度数基準範囲を決定するステップ、または、全特異度数における所定の信頼区間に基づいて特異度数基準範囲を決定するステップと、各特異度数について、特異度数基準範囲に属するか否かを判定するステップとを含むことを特徴とする。 In order to solve the above-described problem, the singular data detection method according to the present invention is a calculation device provided in a computer that determines whether or not data including attribute values representing values related to a plurality of attribute items is singular data. determining, a specific data detection method for each attribute value of the data, based on a plurality of data, based-out on the edge degree from the reference range calculated in advance for each of the attribute items, the attribute value is a reference If it is within the range, the outlier evaluation score is calculated as 0, and if the attribute value is out of the reference range, the outlier evaluation score is calculated according to the degree of outlier, and the step of determining the outlier evaluation score and the outlier are determined. The step of counting the number of outliers representing the number of attribute values whose evaluation score is equal to or greater than a predetermined threshold, and the number of specificities is determined based on the outlier evaluation score and the number of outlier attributes for each of the data Based on step and, specifically power, for each of the data, possess determining whether specific data, and in the step of determining the specificity power, computing device sums the out evaluation point of the data A step of calculating an outlier total evaluation score, determining the number of singularities by multiplying the outlier total evaluation point and a weight determined based on the number of outlier attributes, and determining whether the data is singular data Determining a specificity level reference range based on the first and third quartiles of all specificity numbers, or determining a specificity number reference range based on a predetermined confidence interval for all specificity numbers And a step of determining whether or not each specificity number belongs to the specificity number reference range .

複数のデータの入力を受け付けるステップと、データのそれぞれが特異データであるか否かを出力するステップとを有してもよい。
基準範囲は、対応する属性項目における第１四分位点および第３四分位点に基づいて決定されるか、または、対応する属性項目における所定の信頼区間に基づいて決定されてもよい。
属性項目のそれぞれについて箱ひげ図を生成するステップをさらに有し、特異データであるか否かを出力するステップは、箱ひげ図において、各属性値に対応する位置に、当該属性値を含むデータが特異データであるか否かを示す記号を出力するステップを含んでもよい。 A step of accepting an input of the multiple data, each data may have a step of outputting whether the specific data.
Criteria range be determined based on the first quartile and the third quartile of the corresponding attribute items, or may be determined based on a predetermined confidence interval in the corresponding attribute item .
Further comprising the step of generating a box-and-whisker plot for each attribute item, the step of outputting whether the specific data, in box plots, at positions corresponding to the attribute values, including the attribute values A step of outputting a symbol indicating whether or not the data is singular data may be included.

また、この発明に係る特異データ検出プログラムは、上述の特異データ検出方法をコンピュータに実行させる。 The specific data detection program according to the present invention causes a computer to execute the above-described specific data detection method.

また、この発明に係る特異データ検出装置は、上述の特異データ検出方法を実行するコンピュータによって構成される。 The singular data detection apparatus according to the present invention is constituted by a computer that executes the above-described singular data detection method.

この発明に係る特異データ検出方法、特異データ検出プログラムおよび特異データ検出装置によれば、基準範囲からの外れ度合と、基準範囲から外れた属性の数との双方に基づいて特異データを検出するので、特異データをより適切に検出することができる。 According to the specific data detection method, the specific data detection program and the specific data detection apparatus according to the present invention, the specific data is detected based on both the degree of deviation from the reference range and the number of attributes outside the reference range. Specific data can be detected more appropriately.

本発明の実施の形態１に係る特異データ検出装置の構成を示す図である。It is a figure which shows the structure of the specific data detection apparatus which concerns on Embodiment 1 of this invention. 図１のデータ群の例を示す図である。It is a figure which shows the example of the data group of FIG. 図１の特異データ検出装置の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of the specific data detection apparatus of FIG. 箱ひげ図と、基準範囲等との関係を示す図である。It is a figure which shows the relationship between a box-and-whisker diagram, a reference | standard range, etc. FIG. 図２のデータ群の各属性値に対応する外れ評価点を表す図である。FIG. 3 is a diagram illustrating outlier evaluation points corresponding to attribute values of the data group in FIG. 2. 外れ属性数と重みとの対応関係の例を表す図である。It is a figure showing the example of the correspondence of the number of detachment attributes and a weight.

以下、この発明の実施の形態を添付図面に基づいて説明する。
実施の形態１．
図１に、本発明の実施の形態１に係る特異データ検出装置１０の構成を示す。特異データ検出装置１０は、複数のデータについて、各データが特異データであるか否かを判定することにより特異データを検出する装置である。 Embodiments of the present invention will be described below with reference to the accompanying drawings.
Embodiment 1 FIG.
FIG. 1 shows the configuration of a singular data detection apparatus 10 according to Embodiment 1 of the present invention. The singular data detection device 10 is a device that detects singular data by determining whether or not each data is singular data for a plurality of data.

特異データ検出装置１０は周知のコンピュータとしての構成を有し、演算を行う演算手段２０と、情報を格納する記憶手段３０とを備える。演算手段２０はＣＰＵ（中央処理装置）を含み、記憶手段３０は半導体メモリおよびＨＤＤ（ハードディスクドライブ）等の記憶媒体を含む。 The singular data detection apparatus 10 has a configuration as a well-known computer, and includes a calculation unit 20 that performs a calculation and a storage unit 30 that stores information. The computing means 20 includes a CPU (Central Processing Unit), and the storage means 30 includes a storage medium such as a semiconductor memory and an HDD (Hard Disk Drive).

また、特に図示しないが、特異データ検出装置１０は、入力手段、出力手段およびネットワークインタフェースを備える。入力手段は、使用者が情報を入力するために用いる手段であり、たとえばマウスやキーボード等である。出力手段は、使用者に対して情報を出力する手段であり、たとえば液晶ディスプレイ等の表示装置であるが、プリンタ等の印刷装置であってもよい。ネットワークインタフェースは、通信網や他のコンピュータとの間で情報の入出力を行うための手段である。 Although not particularly shown, the singular data detection apparatus 10 includes an input unit, an output unit, and a network interface. The input means is means used by the user to input information, and is, for example, a mouse or a keyboard. The output means is means for outputting information to the user, and is a display device such as a liquid crystal display, but may be a printing device such as a printer. The network interface is a means for inputting / outputting information to / from a communication network or another computer.

記憶手段３０は、複数のデータからなるデータ群４０を記憶する。特異データ検出装置１０は、このデータ群４０に含まれるデータに対して特異データの検出処理を行う。また、記憶手段３０は、特異データ検出プログラム５０を記憶する。特異データ検出プログラム５０は、演算手段２０によって実行されることにより、コンピュータに特異データ検出装置１０としての機能を実現させる。 The storage means 30 stores a data group 40 composed of a plurality of data. The singular data detection apparatus 10 performs singular data detection processing on the data included in the data group 40. The storage unit 30 stores a unique data detection program 50. The singular data detection program 50 is executed by the computing means 20 to cause the computer to realize the function as the singular data detection device 10.

図２に、データ群４０の例を示す。データ群４０は複数のデータを含む。図２では、各行が１つのデータを表し、２０個のデータが示されている。各データはそれぞれ、複数の項目（以下「属性項目」と呼ぶ）に関して、そのデータの性質を表す定量的な測定値（以下「属性値」と呼ぶ）を含む。 FIG. 2 shows an example of the data group 40. The data group 40 includes a plurality of data. In FIG. 2, each row represents one piece of data, and 20 pieces of data are shown. Each data includes, for a plurality of items (hereinafter referred to as “attribute items”), quantitative measurement values (hereinafter referred to as “attribute values”) representing the properties of the data.

図２には「属性Ａ」〜「属性Ｅ」の５種類の属性項目が示されている。たとえばデータ０１は、「属性Ａ」という名称の属性項目について、「０．１８」という値を属性値として有する。本実施形態では、属性値はすべて数値である。各データはたとえば１つのシステム開発プロジェクトを表し、各属性値はたとえば工数、費用、品質等に関連する数値を表す。なお図２には後の説明のために「外れ属性値」を示す囲み線および注釈を付してあるが、これは実際のデータ群４０には含まれない。 FIG. 2 shows five types of attribute items “attribute A” to “attribute E”. For example, the data 01 has a value “0.18” as an attribute value for the attribute item named “attribute A”. In the present embodiment, all attribute values are numerical values. Each data represents, for example, one system development project, and each attribute value represents, for example, a numerical value related to man-hour, cost, quality, and the like. In FIG. 2, a surrounding line and an annotation indicating “out-of-range attribute value” are added for later explanation, but this is not included in the actual data group 40.

また、図２には、各属性項目の第１四分位点Ｑ１、第３四分位点Ｑ３および四分位範囲ＩＱＲも示している。属性Ｅの例では、第１四分位点Ｑ１＝４．００であり、第３四分位点Ｑ３＝８．２５であり、四分位範囲ＩＱＲ＝Ｑ３−Ｑ１＝４．２５である。ＩＱＲは、各属性項目に対応する箱ひげ図の箱部分に相当する。なお、図２の例では各四分位点の値を線形補間によって求めているが、これらの値は補間によらずいずれかの属性値をそのまま用いてもよく、また適宜丸められてもよい。 FIG. 2 also shows the first quartile Q1, the third quartile Q3, and the quartile range IQR of each attribute item. In the example of attribute E, the first quartile Q1 = 4.00, the third quartile Q3 = 8.25, and the quartile range IQR = Q3-Q1 = 4.25. IQR corresponds to a box portion of a box plot corresponding to each attribute item. In the example of FIG. 2, the values of the respective quartiles are obtained by linear interpolation. However, for these values, any attribute value may be used as it is without interpolation, and may be rounded as appropriate. .

図３は、特異データ検出装置１０の処理の流れを示すフローチャートである。特異データ検出装置１０は、コンピュータとしてこのフローチャートに係る方法を実行する。
まず演算手段２０は、データ群４０を構成する複数のデータの入力を受け付けるとともに記憶手段３０に記憶する（ステップＳ１）。この入力はたとえばキーボードを介して行われてもよく、ネットワークを介して行われてもよく、可搬記憶媒体を介して行われてもよい。 FIG. 3 is a flowchart showing a process flow of the singular data detection apparatus 10. The singular data detection apparatus 10 executes the method according to this flowchart as a computer.
First, the computing means 20 receives input of a plurality of data constituting the data group 40 and stores it in the storage means 30 (step S1). This input may be performed, for example, via a keyboard, may be performed via a network, or may be performed via a portable storage medium.

次に、演算手段２０は、データ群４０に基づき、属性項目のそれぞれについて基準範囲を決定する（ステップＳ２）。
図４は、箱ひげ図と、基準範囲等との関係を示す図である。基準範囲は、本実施形態では次のようにして決定される。
まず、第１四分位点Ｑ１および第３四分位点Ｑ３を算出する。また、四分位範囲ＩＱＲ＝Ｑ３−Ｑ１を算出する。基準範囲は、Ｑ１−１．５×ＩＱＲを下限とし、Ｑ３＋１．５×ＩＱＲを上限とする範囲として決定される。 Next, the computing means 20 determines a reference range for each attribute item based on the data group 40 (step S2).
FIG. 4 is a diagram illustrating a relationship between a box plot and a reference range. In the present embodiment, the reference range is determined as follows.
First, the first quartile Q1 and the third quartile Q3 are calculated. Also, the quartile range IQR = Q3-Q1 is calculated. The reference range is determined as a range having Q1−1.5 × IQR as a lower limit and Q3 + 1.5 × IQR as an upper limit.

次に、演算手段２０は、各データの各属性値について、基準範囲からの外れ度合に基づき、外れ評価点を決定する（ステップＳ３）。図４に示すように、属性値が基準範囲に属していれば外れ度合は０となり、外れ評価点も０となる。また、属性値が基準範囲から外れており、かつ上限または下限からの距離が１．５×ＩＱＲ以下となる範囲にあれば外れ度合は小となり、外れ評価点は１０となる。さらに、属性値が基準範囲から外れており、かつ上限または下限からの距離が１．５×ＩＱＲを超える範囲にあれば外れ度合は大となり、外れ評価点は２０となる。 Next, the computing means 20 determines an outlier evaluation score for each attribute value of each data based on the degree of deviation from the reference range (step S3). As shown in FIG. 4, if the attribute value belongs to the reference range, the degree of detachment is 0, and the evaluation score for detachment is also 0. Further, if the attribute value is out of the reference range and the distance from the upper limit or the lower limit is in the range of 1.5 × IQR or less, the degree of detachment is small and the outlier evaluation score is 10. Further, if the attribute value is out of the reference range and the distance from the upper limit or the lower limit is in a range exceeding 1.5 × IQR, the degree of detachment becomes large and the outlier evaluation score is 20.

図２において太い破線は外れ度合が大（すなわち外れ評価点が２０）である外れ属性値を示し、細い一点鎖線は外れ度合が小（すなわち外れ評価点が１０）である外れ属性値を示す。以下では、外れ度合が０でない属性値（すなわち外れ評価点が１０以上である属性値）を「外れ属性値」と呼ぶ。
図５は、図２のデータ群の各属性値に対応する外れ評価点を表す。 In FIG. 2, a thick broken line indicates an outlier attribute value with a large degree of deviation (that is, a outlier evaluation point of 20), and a thin one-dot chain line indicates an outlier attribute value with a small degree of deviation (that is, a outlier evaluation point of 10). Hereinafter, an attribute value whose degree of detachment is not 0 (that is, an attribute value having a detachment evaluation score of 10 or more) is referred to as an “outlier attribute value”.
FIG. 5 shows outlier evaluation points corresponding to the attribute values of the data group of FIG.

上述の外れ評価点の決定基準を不等式に表すと次のようになる。
・ X(n,a)＜Q1(a)-3.0×IQR(a)の場合：外れ度合大、外れ評価点20
・Q1(a)-3.0×IQR(a)≦X(n,a)＜Q1(a)-1.5×IQR(a)の場合：外れ度合小、外れ評価点10
・Q1(a)-1.5×IQR(a)≦X(n,a)≦Q3(a)+1.5×IQR(a)の場合：外れ度合０、外れ評価点0
・Q3(a)+1.5×IQR(a)＜X(n,a)≦Q3(a)+3.0×IQR(a)の場合：外れ度合小、外れ評価点10
・Q3(a)+3.0×IQR(a)＜X(n,a) の場合：外れ度合大、外れ評価点20
ここで、第ｎ番目のデータ（図２の例では１≦ｎ≦２０）の属性項目ａ（図２の例ではａは属性Ａ〜属性Ｅのいずれか）の属性値をＸ（ｎ，ａ）で表し、属性項目ａのＱ１，Ｑ３，ＩＱＲをそれぞれＱ１（ａ），Ｑ３（ａ），ＩＱＲ（ａ）で表している。 The criteria for determining the above-described outlier evaluation points are expressed as inequalities as follows.
・ If X (n, a) <Q1 (a) -3.0 × IQR (a): Degree of outage, outage score 20
・ If Q1 (a) -3.0 × IQR (a) ≦ X (n, a) <Q1 (a) -1.5 × IQR (a): Degree of detachment, evaluation score 10
・ In case of Q1 (a) -1.5 × IQR (a) ≦ X (n, a) ≦ Q3 (a) + 1.5 × IQR (a): Deflection degree 0, Outlier evaluation score 0
・ When Q3 (a) + 1.5 × IQR (a) <X (n, a) ≦ Q3 (a) + 3.0 × IQR (a): Degree of detachment, evaluation score of detachment 10
・ In case of Q3 (a) + 3.0 × IQR (a) <X (n, a): Degree of outlier, outage score 20
Here, the attribute value a of the nth data (1 ≦ n ≦ 20 in the example of FIG. 2) (in the example of FIG. 2, a is one of the attributes A to E) is represented by X (n, a ), And Q1, Q3, and IQR of the attribute item a are represented by Q1 (a), Q3 (a), and IQR (a), respectively.

次に、演算手段２０は、各データについて外れ属性値の数をカウントする（ステップＳ４）。図５ではこの数を「外れ属性数」として示す。 Next, the computing means 20 counts the number of outlier attribute values for each data (step S4). In FIG. 5, this number is shown as “number of outlier attributes”.

次に、演算手段２０は、各データについて特異度数を決定する（ステップＳ５）。特異度数の決定は、たとえば次のようにして行われる。
本実施形態ではまず、各データについて、当該データの各属性値の外れ評価点に基づいて外れ総合評価点を算出する。この外れ総合評価点は、たとえばそのデータについてすべての外れ評価点を合計することによって算出される。あるデータの外れ総合評価点が高い場合、そのデータはデータ群において他のデータと性質が異なるということを表す。
また、演算手段２０は、各データについて、外れ属性数に基づいて重みを決定する。そして、外れ総合評価点と重みとに基づいて特異度数を算出する。たとえば特異度数は外れ総合評価点と重みとを乗算した値として算出される。
このように、本実施形態によれば、基準範囲からの外れ度合と、外れ属性数との双方に基づいて特異データを検出する。 Next, the calculating means 20 determines the number of specificities for each data (step S5). The determination of the number of specificities is performed as follows, for example.
In the present embodiment, for each data, first, an outlier comprehensive evaluation score is calculated based on the outlier evaluation score of each attribute value of the data. This outlier overall evaluation score is calculated, for example, by summing up all outlier evaluation points for the data. When the outlier comprehensive evaluation score of a certain data is high, this indicates that the data is different in nature from other data in the data group.
The computing means 20 determines the weight for each data based on the number of outlier attributes. Then, the number of specificities is calculated based on the total evaluation score and weight. For example, the number of singularities is calculated as a value obtained by multiplying the total evaluation score and the weight.
Thus, according to the present embodiment, the singular data is detected based on both the degree of deviation from the reference range and the number of deviation attributes.

図６は、外れ属性数と重みとの対応関係の例を表す。図５のデータ０１の場合、外れ属性数が２個であるので重みは１．２５となり、この重みを外れ評価点の合計４０と乗算することにより、特異度数４０×１．２５＝５０が算出される。本実施形態では、外れ属性数が大きいほど重みが大きくなるよう定義されているので、外れ総合評価点が同一であっても、外れ属性数が多いデータほど特異度数が大きくなる。たとえば図５の例では、データ０１とデータ１６はいずれも外れ総合評価点は４０であるが、外れ属性数の多いデータ１６のほうが特異度数が大きくなっている。
なお、図６の対応関係は、たとえばあらかじめ記憶手段３０に記憶しておくことができる。 FIG. 6 shows an example of the correspondence between the number of outlier attributes and the weight. In the case of data 01 in FIG. 5, since the number of outlier attributes is two, the weight is 1.25. By multiplying this weight by the total of outlier evaluation points 40, the number of specificities 40 × 1.25 = 50 is calculated. Is done. In this embodiment, since the weight is defined so as to increase as the number of outlier attributes increases, the number of specificities increases as the number of outlier attributes increases even if the outlier comprehensive evaluation score is the same. For example, in the example of FIG. 5, the data 01 and the data 16 are both out of place and the total evaluation score is 40, but the data 16 having a larger number of out-of-attributes has a larger number of specificities.
6 can be stored in advance in the storage means 30, for example.

次に、演算手段２０は、全データの特異度数に基づき、各データが特異データであるか否かを判定する（ステップＳ６）。このステップは、特異データを検出するステップに相当する。
特異データであるか否かの判定は、公知の統計的手法による判定であればどのようなものであってもよいが、たとえばステップＳ２およびＳ３と同様の方法によって行うことができる。すなわち、全データの特異度数を母集団としてＱ１およびＱ３を求め、これらに基づいて特異度数基準範囲を決定し、各特異度数がこの特異度数基準範囲に属するか否かを判定し、特異度数基準範囲に属さない特異度数を有するデータを特異データであると判定する。図５の例についてこの方法を適用すると、データ０１、データ０５およびデータ１６が特異データであると判定される。 Next, the calculation means 20 determines whether each data is singular data based on the number of singularities of all data (step S6). This step corresponds to the step of detecting singular data.
The determination of whether or not the data is singular data may be any method as long as it is a determination by a known statistical method, but can be performed by a method similar to steps S2 and S3, for example. That is, Q1 and Q3 are obtained using the specificity number of all data as a population, the specificity number reference range is determined based on these, and it is determined whether or not each specificity number belongs to this specificity number reference range. Data having a specificity number that does not belong to the range is determined to be unique data. When this method is applied to the example of FIG. 5, it is determined that data 01, data 05, and data 16 are singular data.

次に、演算手段２０は、属性項目のそれぞれについて箱ひげ図を生成する（ステップＳ７。なおこのステップＳ７は、ステップＳ１より後であればどの時点で実行されてもよい）。箱ひげ図の生成は、各属性項目のＱ１、Ｑ３、ＩＱＲおよび基準範囲に基づき、公知の方法によって行うことができる。たとえば、Ｑ１の位置を下限としＱ３の位置を上限とする範囲を示す矩形を生成し、基準範囲内で最大となる属性値の位置を上端とするひげを生成し、基準範囲内で最小となる属性値の位置を下端とするひげを生成する。また、各属性値を示す記号を生成してもよい。
同様に、演算手段２０は、特異度数について箱ひげ図を生成してもよい。 Next, the calculation means 20 generates a boxplot for each attribute item (step S7. Note that this step S7 may be executed at any time after step S1). The box-and-whisker chart can be generated by a known method based on Q1, Q3, IQR and the reference range of each attribute item. For example, a rectangle indicating the range with the position of Q1 as the lower limit and the position of Q3 as the upper limit is generated, and a whisker is generated with the position of the attribute value that is the maximum within the reference range as the upper end, and is the minimum within the reference range Generates a beard with the attribute value position at the bottom. Further, a symbol indicating each attribute value may be generated.
Similarly, the calculating means 20 may generate a box plot for the specificity number.

次に、演算手段２０は、各データが特異データであるか否かを出力する（ステップＳ８）。この出力は、たとえば表示手段への表示や、記憶手段３０への新たなファイルの記録によってなされる。出力形式はどのようなものであってもよい。たとえば特異データと判定されたデータをリストして出力してもよく、各データと関連付けてそのデータが特異データであるか否かの表示を出力してもよく、またはステップＳ７で生成された箱ひげ図に出力されてもよい。箱ひげ図に出力する場合には、箱ひげ図において各属性値に対応する位置に、当該属性値を含むデータが特異データであるか否かを示す記号を出力してもよい。 Next, the calculation means 20 outputs whether each data is singular data (step S8). This output is performed, for example, by displaying on the display means or recording a new file in the storage means 30. Any output format may be used. For example, data determined to be singular data may be output in a list, and an indication of whether the data is singular data may be output in association with each data, or the box generated in step S7 You may output to a whisker. When outputting to the box plot, a symbol indicating whether or not the data including the attribute value is singular data may be output at a position corresponding to each attribute value in the box plot.

また、特異度数について箱ひげ図が生成されている場合には、その箱ひげ図において同様の出力がなされてもよい。たとえば、各特異度数に対応する位置に、当該特異度数を有するデータが特異データであるか否かを示す記号を出力してもよい。 Further, when a box plot is generated for the number of singularities, a similar output may be made in the box plot. For example, you may output the symbol which shows whether the data which have the said specificity number are specific data in the position corresponding to each specificity number.

以上のように説明される本発明の実施の形態１によれば、基準範囲からの外れ度合と、外れ属性数との双方に基づいて特異データを検出するので、これらの一方のみに基づいて検出する技術と比較して、特異データをより適切に検出することができる。したがってより適切なデータクレンジングを実現できる。 According to the first embodiment of the present invention described above, the singular data is detected based on both the degree of deviation from the reference range and the number of outlier attributes. Therefore, the detection is based on only one of these. Compared with the technique to do, specific data can be detected more appropriately. Therefore, more appropriate data cleansing can be realized.

上述の実施の形態１において、以下のような変形を施すことができる。
箱ひげ図の生成（ステップＳ７）は省略してもよい。また、箱ひげ図に代えて、または箱ひげ図に加えて、各属性値を表すグラフを生成してもよい。 In the first embodiment described above, the following modifications can be made.
The generation of a box plot (step S7) may be omitted. Further, a graph representing each attribute value may be generated instead of or in addition to the boxplot.

基準範囲の上限および下限は、実施の形態１のものに限らず、Ｑ１およびＱ３に基づいて決定される他の範囲であってもよい。また、Ｑ１およびＱ３に基づかず、他の統計的な値に基づいてもよい。さらに、基準範囲は中央値Ｑ２に基づいてもよく、四分位以外の分位数やパーセント点に基づいてもよい。また、平均値および標準偏差に基づいてもよい。 The upper and lower limits of the reference range are not limited to those in the first embodiment, and may be other ranges determined based on Q1 and Q3. Further, it may be based on other statistical values without being based on Q1 and Q3. Further, the reference range may be based on the median Q2, or may be based on quantiles other than quartiles or percentage points. Further, it may be based on an average value and a standard deviation.

基準範囲は、所定の信頼幅に対応する信頼区間に基づいて決定されてもよく、信頼区間を基準範囲としてもよい。たとえば少なくとも２つの属性項目が互いに相関を有する場合（またはその可能性がある場合）に、ある属性値に基づいて別の属性値の信頼幅および信頼区間を決定してもよい。
なお、ステップＳ６における特異度数基準範囲を決定する際にも、このような方法を用いることができる。また、属性項目に対する基準範囲と、特異度数基準範囲とを異なる方法で決定してもよい。 The reference range may be determined based on a confidence interval corresponding to a predetermined confidence width, and the confidence interval may be used as the reference range. For example, when at least two attribute items are correlated with each other (or possibly), the confidence width and confidence interval of another attribute value may be determined based on one attribute value.
Such a method can also be used when determining the specificity number reference range in step S6. Further, the reference range for the attribute item and the specificity number reference range may be determined by different methods.

また、基準範囲からの外れ度合は、基準範囲の上限または下限からの距離ではなく、距離の関数によって表されてもよい。また、外れ評価点は図４以外の方法で決定されてもよく、たとえば外れ度合をそのまま外れ評価点として用いてもよい。さらに、ある属性値が外れ属性値であるか否かの判定基準は実施の形態１と異なるものを用いてもよく、たとえば外れ評価点の閾値を１０以外の値としてもよい。 Further, the degree of deviation from the reference range may be expressed not by the distance from the upper limit or the lower limit of the reference range but by a function of the distance. Further, the detachment evaluation score may be determined by a method other than that shown in FIG. 4. For example, the detachment degree may be used as the detachment evaluation score as it is. Further, a criterion for determining whether or not a certain attribute value is an outlier attribute value may be different from that of the first embodiment. For example, the threshold value of the outlier evaluation score may be a value other than 10.

外れ属性数と重みとの対応関係は、図６に示すものでなくともよく、たとえば外れ属性数が多くなるほど重みが減少する内容としてもよい。このようにすると、多数の外れ属性値を含むデータよりも、より極端な外れ属性値を含むデータのほうが特異データであると判定されやすくなる。 The correspondence relationship between the number of outlier attributes and the weight may not be as shown in FIG. 6, and for example, the content may be such that the weight decreases as the number of outlier attributes increases. In this way, it is easier to determine that data including a more extreme detachment attribute value is singular data than data including a large number of detachment attribute values.

実施の形態１では外れ総合評価点と重みとを乗算することによって特異度数を算出するが、これは外れ総合評価点と重みとに基づく演算であれば他の演算によってもよい。たとえば加算、べき乗、指数関数等を用いて特異度数を算出してもよい。 In the first embodiment, the number of singularities is calculated by multiplying the divergence comprehensive evaluation point and the weight. However, this may be another calculation as long as the calculation is based on the detachment comprehensive evaluation point and the weight. For example, the number of singularities may be calculated using addition, power, exponential function, or the like.

実施の形態１では、外れ属性数と重みとの対応関係はすべての属性項目について共通であるが、これは属性項目ごとに異ならせてもよい。たとえば、図６に相当する対応関係を属性項目ごとに定義してもよい。
実施の形態１における特異データ検出方法は、見積りの基礎データの点検や、品質に関する目標値の設定支援に利用できる。 In the first embodiment, the correspondence between the number of outlier attributes and the weight is common to all the attribute items, but this may be different for each attribute item. For example, a correspondence relationship corresponding to FIG. 6 may be defined for each attribute item.
The singular data detection method according to the first embodiment can be used for checking basic data for estimation and for assisting in setting a target value related to quality.

１０特異データ検出装置、２０演算手段、３０記憶手段、４０データ群（データ）、５０特異データ検出プログラム、Ｑ１，Ｑ３四分位点、ＩＱＲ四分位範囲。 10 singular data detection device, 20 computing means, 30 storage means, 40 data group (data), 50 singular data detection program, Q1, Q3 quartile, IQR quartile range.

Claims

An arithmetic device provided in a computer is a singular data detection method for determining whether or not data including attribute values representing values related to a plurality of attribute items is singular data ,
For each said attribute value before Symbol data, based on a plurality of the data, based-out on the edge degree from previously calculated reference range for each of the attribute items, the attribute value is out if it is within the reference range An evaluation score is calculated as 0, and if the attribute value is out of the reference range, calculating an outlier evaluation score according to the outlier degree, and determining an outlier evaluation score;
For each of the data, counting the number of outliers representing the number of attribute values for which the outlier evaluation score is greater than or equal to a predetermined threshold;
Determining the number of specificities for each of the data based on the outlier evaluation score and the number of outlier attributes;
Determining whether each of the data is singular data based on the number of singularities;
I have a,
In the step of determining the number of singularities, the arithmetic unit calculates an outlier overall evaluation score by summing out the outlier evaluation scores of the data, and is determined based on the outlier overall evaluation score and the number of outlier attributes. Determining the number of singularities by multiplying by
The step of determining whether or not it is singular data includes:
Determining the specificity number reference range based on the first and third quartiles of the total specificity number, or determining the specificity number reference range based on a predetermined confidence interval in the total specificity number Steps,
And a step of determining whether each specificity number belongs to the specificity number reference range or not .

A step of accepting the input of the data of multiple,
The method according to claim 1, further comprising: outputting whether each of the data is unique data.

The reference range is determined based on a first quartile and a third quartile in a corresponding attribute item, or determined based on a predetermined confidence interval in the corresponding attribute item. Item 3. The specific data detection method according to Item 1 or 2 .

Generating a boxplot for each of the attribute items;
The step of outputting whether or not it is singular data is a step of outputting a symbol indicating whether or not the data including the attribute value is singular data at a position corresponding to each attribute value in the boxplot. specific data detection method of claim 2 comprising or a specific data detection method of claim 3 when quoting claim 2,.

The specific data detection program which makes a computer perform the method as described in any one of Claims 1-4 .

The specific data detection apparatus comprised with the computer which performs the method as described in any one of Claims 1-4 .