JP2004246622A

JP2004246622A - Outlier detection supporting program, outlier detection supporting method, and outlier detection supporting device

Info

Publication number: JP2004246622A
Application number: JP2003035637A
Authority: JP
Inventors: Arata Sato; 新佐藤; Ei Sakano; 鋭坂野; Takashi Suenaga; 高志末永
Original assignee: NTT Data Corp
Current assignee: NTT Data Group Corp
Priority date: 2003-02-13
Filing date: 2003-02-13
Publication date: 2004-09-02

Abstract

<P>PROBLEM TO BE SOLVED: To provide an outlier detection supporting program, an outlier detection supporting method, and an outlier detection supporting device capable of increasing the efficiency of the detection or removal of an outlier. <P>SOLUTION: This outlier detection supporting device is provided with a control part 103 for setting a data identifier for identifying each data in a data group being the target of outlier detection, and for visualizing the data group by a plot, and for making a display part 102 display the data identifier corresponding to respective data in the visualized data group, and an input part 101 for making an operator designate the data identifier corresponding to the data of an outlier in the visualized data group. <P>COPYRIGHT: (C)2004,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、データ集合からの外れ値の検出を支援するための外れ値検出支援プログラム、外れ値検出支援方法および外れ値検出支援装置に関するものであり、特に、外れ値の検出や削除にかかる効率を高めることができる外れ値検出支援プログラム、外れ値検出支援方法および外れ値検出支援装置に関するものである。
【０００２】
【従来の技術】
従来より、データマイニング、データ解析、パターン認識の分野においては、解析に悪影響を及ぼす異常な値を持ったデータ、すなわち、外れ値を検出することが非常に重要である。例えば、ただ１個の外れ値が存在しただけで、識別や予測の精度の低下を引き起こす場合があるため、主として統計学の分野においては、外れ値を検出する技術が開発されている。
【０００３】
図１２は、従来のデータテーブル１０を示す図である。このデータテーブル１０には、属性１（例えば、気温）および属性２（例えば、長さ）の各データが格納されている。
【０００４】
データテーブル１０において、レコード番号は、各レコードに対応付けられた番号である。属性１は、例えば、気温のデータである。属性２は、例えば、長さのデータである。すなわち、データテーブル１０では、気温と長さとの対応関係が表現されている。
【０００５】
図１３は、図１２に示したデータテーブル１０における各レコードのデータ（属性１：気温、属性２：長さ）をプロットした散布図２０である。この散布図２０において、横軸が属性１（気温）に対応しており、縦軸が属性２（長さ）に対応している。この散布図２０においては、例えば、丸枠２１内のプロットが外れ値とされる。
【０００６】
従って、オペレータは、丸枠２１内のプロット（データ）を外れ値として検出し、除去するために、散布図２０から同プロットのデータ（属性１および属性２）を読みとる。つぎに、オペレータは、図１２に示したデータテーブル１０から、読みとったデータ（属性１および属性２）に対応するレコードを検索した後、同レコード（外れ値）を削除する。
【０００７】
これにより、散布図２０からは、外れ値（丸枠２１内のプロット）が除去される。以後、上記動作が繰り返されることにより、外れ値の検出、除去が行われる。
【０００８】
【非特許文献１】
末永高志、佐藤新、坂野鋭著、「クラスタ構造に着目した特徴空間の可視化−クラスタ判別法−」、「電子情報通信学会論文誌Ｄ−ＩＩ、Ｖｏｌ．Ｊ８５−Ｄ−ＩＩＮｏ．５」、（社）電子情報通信学会発行、２００２年５月、ｐｐ７８５−７９５
【０００９】
【発明が解決しようとする課題】
ところで、前述したように、従来においては、図１３に示した散布図２０を一見しただけでは各プロットとデータテーブル１０（図１２参照）の各レコードとの対応関係が判らないため、散布図２０で読みとったプロット（外れ値）のデータをキーとして、データテーブル１０から外れ値に対応するレコードを検索、削除しなければならない。
【００１０】
このように、従来では、外れ値の検出や除去にかかる時間や手間が非常にかかり効率が悪いという問題があった。
【００１１】
本発明は、上記に鑑みてなされたもので、外れ値の検出や除去にかかる効率を高めることができる外れ値検出支援プログラム、外れ値検出支援方法および外れ値検出支援装置を提供することを目的とする。
【００１２】
【課題を解決するための手段】
上記目的を達成するために、請求項１にかかる発明は、コンピュータを、外れ値検出対象であるデータ集合における各データを識別するためのデータ識別子を設定するデータ識別子設定手段、前記データ集合を可視化する可視化手段、前記可視化されたデータ集合における各データに対応させて前記データ識別子を表示するデータ識別子表示手段、前記可視化されたデータ集合において外れ値のデータに対応するデータ識別子を指定するための外れ値指定手段、として機能させるための外れ値検出支援プログラムである。
【００１３】
この発明によれば、可視化されたデータ集合における各データに対応させて、各データを識別するためのデータ識別子を表示し、可視化されたデータ集合において外れ値のデータに対応するデータ識別子を指定することとしたので、外れ値の検出にかかる効率を高めることができる。
【００１４】
また、請求項２にかかる発明は、請求項１に記載の外れ値検出支援プログラムにおいて、前記コンピュータを、前記可視化されたデータ集合においてデータの重なりがある部分を拡大表示させる拡大表示手段、として機能させることを特徴とする。
【００１５】
この発明によれば、可視化されたデータ集合においてデータの重なりがある部分を拡大表示させることとしたので、データの重なりがある場合であっても、外れ値の検出にかかる効率を高めることができる。
【００１６】
また、請求項３にかかる発明は、請求項１または２に記載の外れ値検出支援プログラムにおいて、前記可視化手段は、前記可視化されたデータ集合において、所定の条件を満たす外れ値の候補としてのデータを色分けにて可視化することを特徴とする。
【００１７】
この発明によれば、可視化されたデータ集合において、所定の条件を満たす外れ値の候補としてのデータを色分けにて可視化することとしたので、外れ値の検出にかかる効率をさらに高めることができる。
【００１８】
また、請求項４にかかる発明は、請求項１〜３のいずれか一つに記載の外れ値検出支援プログラムにおいて、前記コンピュータを、前記外れ値指定手段により指定されたデータ識別子に対応するデータを外れ値として除去する除去手段、として機能させることを特徴とする。
【００１９】
この発明によれば、指定されたデータ識別子に対応するデータを外れ値として除去することとしたので、外れ値の除去にかかる効率を高めることができる。
【００２０】
また、請求項５にかかる発明は、コンピュータを、外れ値検出対象であるデータ集合における各データを識別するためのデータ識別子を設定するデータ識別子設定手段、前記データ集合を可視化する可視化手段、前記可視化されたデータ集合における各データに対応させて前記データ識別子を表示するデータ識別子表示手段、前記可視化されたデータ集合において、所定の条件を満たす外れ値のデータを除去する除去手段、として機能させるための外れ値検出支援プログラムである。
【００２１】
この発明によれば、可視化されたデータ集合における各データに対応させて、各データを識別するためのデータ識別子を表示し、可視化されたデータ集合において、所定の条件を満たす外れ値のデータを除去することとしたので、外れ値の検出や除去にかかる効率を高めることができる。
【００２２】
また、請求項６にかかる発明は、外れ値検出対象であるデータ集合における各データを識別するためのデータ識別子を設定するデータ識別子設定工程と、前記データ集合を可視化する可視化工程と、前記可視化されたデータ集合における各データに対応させて前記データ識別子を表示するデータ識別子表示工程と、前記可視化されたデータ集合において外れ値のデータに対応するデータ識別子を指定するための外れ値指定工程と、を含むことを特徴とする。
【００２３】
この発明によれば、可視化されたデータ集合における各データに対応させて、各データを識別するためのデータ識別子を表示し、可視化されたデータ集合において外れ値のデータに対応するデータ識別子を指定することとしたので、外れ値の検出にかかる効率を高めることができる。
【００２４】
また、請求項７にかかる発明は、外れ値検出対象であるデータ集合における各データを識別するためのデータ識別子を設定するデータ識別子設定工程と、前記データ集合を可視化する可視化工程と、前記可視化されたデータ集合における各データに対応させて前記データ識別子を表示するデータ識別子表示工程と、前記可視化されたデータ集合において、所定の条件を満たす外れ値のデータを除去する除去工程と、を含むことを特徴とする。
【００２５】
この発明によれば、可視化されたデータ集合における各データに対応させて、各データを識別するためのデータ識別子を表示し、可視化されたデータ集合において、所定の条件を満たす外れ値のデータを除去することとしたので、外れ値の検出や除去にかかる効率を高めることができる。
【００２６】
また、請求項８にかかる発明は、外れ値検出対象であるデータ集合における各データを識別するためのデータ識別子を設定するデータ識別子設定手段と、前記データ集合を可視化する可視化手段と、前記可視化されたデータ集合における各データに対応させて前記データ識別子を表示するデータ識別子表示手段と、前記可視化されたデータ集合において外れ値のデータに対応するデータ識別子を指定するための外れ値指定手段と、を備えたことを特徴とする。
【００２７】
この発明によれば、可視化されたデータ集合における各データに対応させて、各データを識別するためのデータ識別子を表示し、可視化されたデータ集合において外れ値のデータに対応するデータ識別子を指定することとしたので、外れ値の検出にかかる効率を高めることができる。
【００２８】
また、請求項９にかかる発明は、外れ値検出対象であるデータ集合における各データを識別するためのデータ識別子を設定するデータ識別子設定手段と、前記データ集合を可視化する可視化手段と、前記可視化されたデータ集合における各データに対応させて前記データ識別子を表示するデータ識別子表示手段と、前記可視化されたデータ集合において、所定の条件を満たす外れ値のデータを除去する除去手段と、を備えたことを特徴とする。
【００２９】
この発明によれば、可視化されたデータ集合における各データに対応させて、各データを識別するためのデータ識別子を表示し、可視化されたデータ集合において、所定の条件を満たす外れ値のデータを除去することとしたので、外れ値の検出や除去にかかる効率を高めることができる。
【００３０】
【発明の実施の形態】
以下、図面を参照して本発明にかかる外れ値検出支援プログラム、外れ値検出支援方法および外れ値検出支援装置の一実施の形態について詳細に説明する。
【００３１】
図１は、本発明にかかる一実施の形態の構成を示すブロック図である。この図において、外れ値検出支援装置１００は、データ集合からの外れ値の検出や削除を支援する装置である。外れ値検出支援装置１００において、入力部１０１は、キーボード、マウス、データ入力装置等である。表示部１０２は、各種データ散布図（図４〜図８参照）を表示する。
【００３２】
制御部１０３は、各種制御（散布図の表示制御、外れ値の検出支援制御等）を行う。この制御部１０３の動作の詳細については、後述する。記憶部１０４は、制御部１０３の制御により各種データを記憶する。
【００３３】
オリジナルデータベース１１０は、外れ値を検出するための外れ値検出処理が施される前のオリジナルデータをファイル単位で格納するデータベースである。図２（ａ）に示したオリジナルデータテーブル２００は、オリジナルデータベース１１０に格納されている複数のファイルのうち、ある一つのファイルを開いたものである。
【００３４】
オリジナルデータテーブル２００において、レコード番号は、各レコードに対応付けられた番号である。ここで、オリジナルデータテーブル２００においては、レコードが削除された場合、削除されたレコードより下のレコードに対応するレコード番号が１つずつ繰り上げられ、最上位レコードから最下位レコードまで連番とされる。属性１〜属性ｎは、ｎ個の属性を有するデータである。例えば、属性１が血圧のデータ、属性２が身長のデータ、・・・、属性ｎが血糖値のデータである。
【００３５】
図１に戻り、処理済みデータベース１２０は、上述した外れ値検出が施され、オリジナルデータから外れ値が除外された処理済みデータをファイル単位で格納するデータベースである。
【００３６】
（動作例１）
つぎに、一実施の形態の動作例１について、図３〜図８を参照しつつ説明する。図３は、一実施の形態の動作例１を説明するフローチャートである。同図に示したステップＳＡ１では、制御部１０３は、オリジナルデータベース１１０から外れ値検出対象のファイルを開き、図２（ａ）に示した初期のオリジナルデータテーブル２００を記憶部１０４上に展開する。
【００３７】
ステップＳＡ２では、制御部１０３は、図２（ａ）に示した初期のオリジナルデータテーブル２００の各レコードに対応させて、識別番号テーブル２１０を作成する。この識別番号テーブル２１０は、オリジナルデータテーブル２００における各データを識別するための識別番号を格納するテーブルである。
【００３８】
初期のオリジナルデータテーブル２００に対応する識別番号テーブル２１０では、レコード番号と同一の識別番号が付与されている。ここで、識別番号テーブル２１０においては、図２（ｂ）に示したようにオリジナルデータテーブル２００のレコード２００ａが削除された場合、当該レコード２００ａに対応するレコード２１０ａが削除され、識別番号２が欠番となるが、その他の識別番号１、３〜ｍが不変とされる。これにより、図２（ｃ）に示したた識別番号テーブル２１０’が作成される。
【００３９】
これに対して、オリジナルデータテーブル２００においては、図２（ｂ）に示したようにレコード２００ａが削除された場合、当該レコード２００ａに対応するレコードが削除され、その他の識別番号３〜ｍが一つずつ繰り上げられ、識別番号２〜ｍ−１とされる。これにより、図２（ｃ）に示したオリジナルデータテーブル２００’が作成される。
【００４０】
ステップＳＡ３およびステップＳＡ４では、周知のクラスタ判別法により、高次元（この場合、属性１〜ｎであるためｎ次元）のデータを低次元（この場合、２次元）のデータに圧縮する処理が実行される。クラスタ判別法は、高次元空間でクラスタリングを行い、求められたクラスタに属するデータを独立したカテゴリとみなして、判別分析を用いて低次元空間への写像を求める、データの可視化手法である。
【００４１】
すなわち、高次元のデータは、散布図にプロットし、データを可視化することが困難である。そこで、高次元のデータを低次元（２次元）のデータに圧縮することで、二次元のデータを散布図にプロットし、データの可視化が可能となる。
【００４２】
ステップＳＡ３では、制御部１０３は、クラスタ数を指定する。ステップＳＡ４では、制御部１０３は、クラスタ判別法の判別分析により、図２（ａ）に示したオリジナルデータテーブル２００に格納された高次元（ｎ次元）のオリジナルデータを、低次元（２次元）のデータに圧縮する。これにより、属性１〜属性ｎは、１次元の属性および２次元の属性に圧縮される。
【００４３】
なお、オリジナルデータテーブル２００に２次元（属性１および属性２）のオリジナルデータのみが既に格納されている場合には、ステップＳＡ３およびステップＳＡ４の処理がスキップされる。
【００４４】
ステップＳＡ５では、制御部１０３は、オリジナルデータテーブル２００の各レコード（但し、ｎ次元から２次元に圧縮されたデータ（１次元の属性および２次元の属性）を各プロットとして図４に示した散布図３００を表示部１０２に表示させる。この散布図３００において、横軸が１次元（属性）に対応しており、縦軸が２次元（属性）に対応している。
【００４５】
ステップＳＡ６では、制御部１０３は、図４に示した散布図３００の各プロット（各レコード）の近傍に、図２（ａ）に示した識別番号テーブル２１０の各識別番号を表示させ、散布図３００を図５に示した散布図３１０とする。例えば、図５に示した散布図３１０の「１４」は、図２（ａ）に示した識別番号テーブル２１０の識別番号（＝１４）に対応している。
【００４６】
ステップＳＡ７では、制御部１０３は、図５に示した散布図３１０においてプロットが重複しており、識別番号が見にくい部分があるか否かを判断する。この場合、図６に示した枠３１１内に存在する複数のプロットが重複しているため、制御部１０３は、ステップＳＡ７の判断結果を「Ｙｅｓ」とする。なお、ステップＳＡ７の判断結果が「Ｎｏ」である場合、制御部１０３は、ステップＳＡ８の判断を行う。
【００４７】
ステップＳＡ１２では、制御部１０３は、図６に示した枠３１１（重複部分）を拡大し、図７に示した拡大散布図３２０を、散布図３１０（図６参照）とともに表示部１０２に並列的に表示させる。これにより、識別番号やプロットが見易くなる。
【００４８】
ステップＳＡ８では、制御部１０３は、オペレータにより、散布図３１０（図６参照）および拡大散布図３２０（図７参照）において外れ値が存在していることを指示されたか否かを判断する。
【００４９】
この場合、オペレータは、図８に示した散布図３１０の「１４」（丸枠３１２）に対応するプロット（データ）を外れ値として認識し、入力部１０１を用いて、外れ値が存在しているという指示を出す。
【００５０】
これにより、制御部１０３は、ステップＳＡ８の判断結果を「Ｙｅｓ」とする。ステップＳＡ９では、オペレータは、入力部１０１を用いて、当該外れ値に対応する識別番号「１４」を指定する。
【００５１】
ステップＳＡ１０では、制御部１０３は、指定された上記識別番号「１４」に対応するレコードをオリジナルデータテーブル２００および識別番号テーブル２１０から削除する。ここで、当該識別番号「１４」は欠番となる。以後、ステップＳＡ８の判断結果が「Ｎｏ」になるまで、ステップＳＡ３〜ステップＳＡ１０が繰り返される。
【００５２】
そして、ステップＳＡ８の判断結果が「Ｎｏ」になると、ステップＳＡ１１では、制御部１０３は、外れ値に対応するレコードが削除されたオリジナルデータテーブル２００を処理済みデータとして処理済みデータベース１２０に格納する。
【００５３】
（動作例２）
さて、一実施の形態においては、散布図で外れ値の候補を色分け表示させ、オペレータによる外れ値の検出支援を行ってもよい。以下では、この場合を動作例２として説明する。図９は、一実施の形態の動作例２を説明するフローチャートである。
【００５４】
同図に示したステップＳＢ１〜ステップＳＢ４では、前述したステップＳＡ１〜ステップＳＡ４（図３参照）と同様の処理が実行される。
【００５５】
ステップＳＢ５では、制御部１０３は、オリジナルデータテーブル２００の各レコード（但し、ｎ次元から２次元に圧縮されたデータ（１次元の属性および２次元の属性））を各プロットとして図４に示した散布図３００を表示部１０２に表示させる。
【００５６】
ステップＳＢ６では、制御部１０３は、図４に示した散布図３００の各プロット（各レコード）の近傍に、図２（ａ）に示した識別番号テーブル２１０の各識別番号を表示させ、散布図３００を図５に示した散布図３１０とする。
【００５７】
ステップＳＢ７では、制御部１０３は、外れ値の候補として、クラスタ数が１のプロット（識別番号）を赤色で表示させる。この場合、例えば、識別番号「１４」およびこれに対応するプロットが赤色で表示されたとする。
【００５８】
ステップＳＢ８では、制御部１０３は、図５に示した散布図３１０においてプロットが重複しており、識別番号が見にくい部分があるか否かを判断し、この場合、判断結果を「Ｙｅｓ」とする。なお、ステップＳＢ８の判断結果が「Ｎｏ」である場合、制御部１０３は、ステップＳＢ９の判断を行う。
【００５９】
ステップＳＢ１３では、制御部１０３は、図６に示した枠３１１（重複部分）を拡大し、図７に示した拡大散布図３２０を、散布図３１０（図６参照）とともに表示部１０２に並列的に表示させる。
【００６０】
ステップＳＢ９では、制御部１０３は、オペレータにより、散布図３１０（図６参照）および拡大散布図３２０（図７参照）において除去したいプロットが存在していることを指示されたか否かを判断する。
【００６１】
この場合、オペレータは、図８に示した散布図３１０において、赤色表示された「１４」（丸枠３１２）に対応するプロット（データ）を外れ値として認識し、入力部１０１を用いて、除去したいプロットが存在しているという指示を出す。
【００６２】
これにより、制御部１０３は、ステップＳＢ９の判断結果を「Ｙｅｓ」とする。ステップＳＢ１０では、オペレータは、入力部１０１を用いて、当該外れ値に対応する除去プロットの識別番号「１４」を指定する。
【００６３】
ステップＳＢ１１では、制御部１０３は、指定された上記識別番号「１４」に対応するレコードをオリジナルデータテーブル２００および識別番号テーブル２１０から削除する。ここで、当該識別番号「１４」は欠番となる。以後、ステップＳＢ９の判断結果が「Ｎｏ」になるまで、ステップＳＢ３〜ステップＳＢ１１が繰り返される。
【００６４】
そして、ステップＳＢ９の判断結果が「Ｎｏ」になると、ステップＳＢ１２では、制御部１０３は、外れ値に対応するレコードが削除されたオリジナルデータテーブル２００を処理済みデータとして処理済みデータベース１２０に格納する。
【００６５】
（動作例３）
さて、一実施の形態の動作例２においては、散布図で外れ値の候補を色分け表示させ、オペレータの指示に従って、外れ値を検出し、除去する場合について説明したが、オペレータを介入させずに、色分け表示されたプロットを外れ値として自動的に除去してもよい。以下では、この場合を動作例３として説明する。図１０は、一実施の形態の動作例３を説明するフローチャートである。
【００６６】
同図に示したステップＳＣ１〜ステップＳＣ４では、前述したステップＳＡ１〜ステップＳＡ４（図３参照）と同様の処理が実行される。
【００６７】
ステップＳＣ５では、制御部１０３は、オリジナルデータテーブル２００の各レコード（但し、ｎ次元から２次元に圧縮されたデータ（１次元の属性および２次元の属性））を各プロットとして図４に示した散布図３００を表示部１０２に表示させる。
【００６８】
ステップＳＣ６では、制御部１０３は、図４に示した散布図３００の各プロット（各レコード）の近傍に、図２（ａ）に示した識別番号テーブル２１０の各識別番号を表示させ、散布図３００を図５に示した散布図３１０とする。
【００６９】
ステップＳＣ７では、制御部１０３は、外れ値の候補として、クラスタ数が１のプロット（識別番号）を赤色で表示させる。この場合、例えば、識別番号「１４」およびこれに対応するプロットが赤色で表示されたとする。
【００７０】
ステップＳＣ８では、制御部１０３は、図５に示した散布図３１０においてプロットが重複しており、識別番号が見にくい部分があるか否かを判断し、この場合、判断結果を「Ｙｅｓ」とする。なお、ステップＳＣ８の判断結果が「Ｎｏ」である場合、制御部１０３は、ステップＳＣ９の判断を行う。
【００７１】
ステップＳＣ１２では、制御部１０３は、図６に示した枠３１１（重複部分）を拡大し、図７に示した拡大散布図３２０を、散布図３１０（図６参照）とともに表示部１０２に並列的に表示させる。
【００７２】
ステップＳＣ９では、制御部１０３は、散布図３１０（図６参照）および拡大散布図３２０（図７参照）において、外れ値としての赤色プロットが存在するか否かを判断する。この場合、制御部１０３は、図８に示した散布図３１０において、赤色表示された「１４」（丸枠３１２）に対応するプロット（データ）が存在しているため、ステップＳＣ９の判断結果を「Ｙｅｓ」とする。
【００７３】
ステップＳＣ１０では、制御部１０３は、指定された上記識別番号「１４」に対応するレコードをオリジナルデータテーブル２００および識別番号テーブル２１０から削除する。ここで、当該識別番号「１４」は欠番となる。以後、ステップＳＣ９の判断結果が「Ｎｏ」になるまで、ステップＳＣ３〜ステップＳＣ１０が繰り返される。
【００７４】
そして、ステップＳＣ９の判断結果が「Ｎｏ」になると、ステップＳＣ１２では、制御部１０３は、外れ値に対応するレコードが削除されたオリジナルデータテーブル２００を処理済みデータとして処理済みデータベース１２０に格納する。
【００７５】
以上説明したように、一実施の形態（動作例１）によれば、図６に示したように、散布図３１０で可視化されたデータ集合における各データ（プロット）に対応させて、各データを識別するための識別番号を表示し、可視化されたデータ集合において外れ値のデータに対応する識別番号を入力部１０１で指定することとしたので、外れ値の検出にかかる効率を高めることができる。
【００７６】
また、一実施の形態（動作例１）によれば、指定された識別番号に対応するデータを外れ値として除去することとしたので、外れ値の除去にかかる効率を高めることができる。
【００７７】
また、一実施の形態（動作例１）によれば、図６および図７に示したように、可視化されたデータ集合においてデータ（プロット）の重なりがある部分（枠３１１の部分）を拡大表示させることとしたので、データ（プロット）の重なりがある場合であっても、外れ値の検出にかかる効率を高めることができる。
【００７８】
また、一実施の形態（動作例２）によれば、図９を参照して説明したように、可視化されたデータ集合において、所定の条件を満たす外れ値の候補としてのデータ（プロット、識別番号）を色分けにて可視化することとしたので、外れ値の検出にかかる効率をさらに高めることができる。
【００７９】
また、一実施の形態（動作例３）によれば、図１０を参照して説明したように、可視化されたデータ集合における各データに対応させて、各データを識別するための識別番号を表示し、可視化されたデータ集合において、所定の条件（クラスタ数が１）を満たす外れ値のデータを除去することとしたので、外れ値の検出や除去にかかる効率を高めることができる。
【００８０】
以上本発明にかかる一実施の形態について図面を参照して詳述してきたが、具体的な構成例はこの一実施の形態に限られるものではなく、本発明の要旨を逸脱しない範囲の設計変更等があっても本発明に含まれる。
【００８１】
例えば、前述した一実施の形態においては、図１に示した外れ値検出支援装置１００の機能を実現するためのプログラムを図１１に示したコンピュータ読み取り可能な記録媒体５００に記録して、この記録媒体５００に記録されたプログラムを同図に示したコンピュータ４００に読み込ませ、実行することにより各機能を実現してもよい。
【００８２】
同図に示したコンピュータ４００は、上記プログラムを実行するＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）４１０と、キーボード、マウス等の入力装置４２０と、各種データを記憶するＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）４３０と、演算パラメータ等を記憶するＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）４４０と、記録媒体５００からプログラムを読み取る読取装置４５０と、ディスプレイ、プリンタ等の出力装置４６０とから構成されている。
【００８３】
ＣＰＵ４１０は、読取装置４５０を経由して記録媒体５００に記録されているプログラムを読み込んだ後、プログラムを実行することにより、前述した機能を実現する。なお、記録媒体５００としては、光ディスク、フレキシブルディスク、ハードディスク等が挙げられる。
【００８４】
【発明の効果】
以上説明したように、請求項１、６、８にかかる発明によれば、可視化されたデータ集合における各データに対応させて、各データを識別するためのデータ識別子を表示し、可視化されたデータ集合において外れ値のデータに対応するデータ識別子を指定することとしたので、外れ値の検出にかかる効率を高めることができるという効果を奏する。
【００８５】
また、請求項２にかかる発明によれば、可視化されたデータ集合においてデータの重なりがある部分を拡大表示させることとしたので、データの重なりがある場合であっても、外れ値の検出にかかる効率を高めることができるという効果を奏する。
【００８６】
また、請求項３にかかる発明によれば、可視化されたデータ集合において、所定の条件を満たす外れ値の候補としてのデータを色分けにて可視化することとしたので、外れ値の検出にかかる効率をさらに高めることができるという効果を奏する。
【００８７】
また、請求項４にかかる発明によれば、指定されたデータ識別子に対応するデータを外れ値として除去することとしたので、外れ値の除去にかかる効率を高めることができるという効果を奏する。
【００８８】
また、請求項５、７、９にかかる発明によれば、可視化されたデータ集合における各データに対応させて、各データを識別するためのデータ識別子を表示し、可視化されたデータ集合において、所定の条件を満たす外れ値のデータを除去することとしたので、外れ値の検出や除去にかかる効率を高めることができるという効果を奏する。
【図面の簡単な説明】
【図１】本発明にかかる一実施の形態の構成を示すブロック図である。
【図２】同一実施の形態における各種テーブルを示す図である。
【図３】同一実施の形態の動作例１を説明するフローチャートである。
【図４】同一実施の形態における散布図３００である。
【図５】図４に示した各プロットに対応する識別番号が付された散布図３１０である。
【図６】図５に示した散布図３１０における枠３１１を示す図である。
【図７】図６に示した枠３１１に対応する拡大散布図３２０である。
【図８】図５に示した散布図３１０における丸枠３１２を示す図である。
【図９】同一実施の形態の動作例２を説明するフローチャートである。
【図１０】同一実施の形態の動作例３を説明するフローチャートである。
【図１１】同一実施の形態の変形例の構成を示すブロック図である。
【図１２】従来のデータテーブル１０を示す図である。
【図１３】図１２に示したデータテーブル１０の各レコードをプロットした散布図２０である。
【符号の説明】
１００外れ値検出支援装置
１０１入力部
１０２表示部
１０３制御部
１０４記憶部
１１０オリジナルデータベース
１２０処理済みデータベース[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to an outlier detection support program, an outlier detection support method, and an outlier detection support device for supporting detection of an outlier from a data set, and particularly to an efficiency related to outlier detection and deletion. The present invention relates to an outlier detection support program, an outlier detection support method, and an outlier detection support device that can enhance the outlier.
[0002]
[Prior art]
2. Description of the Related Art Conventionally, in the fields of data mining, data analysis, and pattern recognition, it is very important to detect data having abnormal values that adversely affect analysis, that is, outliers. For example, a technique for detecting outliers has been developed mainly in the field of statistics, because the presence of only one outlier may cause a decrease in the accuracy of identification or prediction.
[0003]
FIG. 12 is a diagram showing a conventional data table 10. The data table 10 stores data of attribute 1 (for example, temperature) and attribute 2 (for example, length).
[0004]
In the data table 10, the record number is a number associated with each record. Attribute 1 is, for example, temperature data. Attribute 2 is, for example, length data. That is, in the data table 10, the correspondence between the temperature and the length is expressed.
[0005]
FIG. 13 is a scatter diagram 20 in which data (attribute 1: temperature, attribute 2: length) of each record in the data table 10 shown in FIG. 12 is plotted. In the scatter diagram 20, the horizontal axis corresponds to attribute 1 (temperature), and the vertical axis corresponds to attribute 2 (length). In the scatter diagram 20, for example, a plot in a circle 21 is an outlier.
[0006]
Accordingly, the operator reads the data (attribute 1 and attribute 2) of the plot from the scatter diagram 20 in order to detect and remove the plot (data) within the circle 21 as an outlier. Next, the operator searches the data table 10 shown in FIG. 12 for a record corresponding to the read data (attribute 1 and attribute 2), and then deletes the record (outlier).
[0007]
Thereby, outliers (plots in the circle 21) are removed from the scatter diagram 20. Thereafter, the above operation is repeated to detect and remove outliers.
[0008]
[Non-patent document 1]
Takashi Suenaga, Arata Sato, Akira Sakano, "Visualization of Feature Space Focusing on Cluster Structure-Cluster Discrimination Method", "Transactions of the Institute of Electronics, Information and Communication Engineers D-II, Vol. J85-D-II No. 5," Published by The Institute of Electronics, Information and Communication Engineers, May 2002, pp785-795
[0009]
[Problems to be solved by the invention]
By the way, as described above, in the related art, since the correspondence between each plot and each record of the data table 10 (see FIG. 12) cannot be understood only by looking at the scatter diagram 20 shown in FIG. The record corresponding to the outlier must be searched and deleted from the data table 10 using the data of the plot (outlier) read in the above as a key.
[0010]
As described above, conventionally, there has been a problem that the time and labor required for detecting and removing outliers are extremely long and the efficiency is low.
[0011]
The present invention has been made in view of the above, and an object of the present invention is to provide an outlier detection support program, an outlier detection support method, and an outlier detection support apparatus that can increase the efficiency of detecting and removing outliers. And
[0012]
[Means for Solving the Problems]
In order to achieve the above object, the invention according to claim 1 provides a computer with data identifier setting means for setting a data identifier for identifying each data in a data set to be detected as an outlier, and visualizing the data set. Visualizing means, data identifier displaying means for displaying the data identifier in association with each data in the visualized data set, and out-of-position for specifying a data identifier corresponding to outlier data in the visualized data set. This is an outlier detection support program for functioning as a value designation unit.
[0013]
According to the present invention, a data identifier for identifying each data is displayed in association with each data in the visualized data set, and a data identifier corresponding to outlier data is specified in the visualized data set. Therefore, it is possible to increase the efficiency of detecting an outlier.
[0014]
According to a second aspect of the present invention, in the outlier detection support program according to the first aspect, the computer functions as enlargement display means for enlarging and displaying a portion where data overlaps in the visualized data set. It is characterized by making it.
[0015]
According to the present invention, a portion where data is overlapped in the visualized data set is enlarged and displayed. Therefore, even when data overlaps, the efficiency of detecting an outlier can be increased. .
[0016]
According to a third aspect of the present invention, in the outlier detection support program according to the first or second aspect, the visualization means includes a data as an outlier candidate satisfying a predetermined condition in the visualized data set. Are visualized by color coding.
[0017]
According to the present invention, in the visualized data set, data as an outlier candidate that satisfies a predetermined condition is visualized by color coding, so that the efficiency of outlier detection can be further increased.
[0018]
According to a fourth aspect of the present invention, in the outlier detection support program according to any one of the first to third aspects, the computer is configured to store data corresponding to a data identifier specified by the outlier specifying means. It is characterized by functioning as a removing means for removing as an outlier.
[0019]
According to the present invention, since the data corresponding to the specified data identifier is removed as an outlier, the efficiency of removing the outlier can be increased.
[0020]
Further, the invention according to claim 5 is a computer, comprising: a data identifier setting means for setting a data identifier for identifying each data in a data set to be detected as an outlier; a visualization means for visualizing the data set; Data identifier display means for displaying the data identifier in association with each data in the data set, and removing means for removing outlier data satisfying a predetermined condition in the visualized data set. This is an outlier detection support program.
[0021]
According to the present invention, a data identifier for identifying each data is displayed in association with each data in the visualized data set, and outlier data satisfying a predetermined condition is removed from the visualized data set. Therefore, the efficiency of detecting and removing outliers can be increased.
[0022]
The invention according to claim 6 is a data identifier setting step of setting a data identifier for identifying each data in a data set to be detected as an outlier, a visualization step of visualizing the data set, A data identifier displaying step of displaying the data identifier in association with each data in the data set, and an outlier specifying step of specifying a data identifier corresponding to the outlier data in the visualized data set. It is characterized by including.
[0023]
According to the present invention, a data identifier for identifying each data is displayed in association with each data in the visualized data set, and a data identifier corresponding to outlier data is specified in the visualized data set. Therefore, it is possible to increase the efficiency of detecting an outlier.
[0024]
The invention according to claim 7 is a data identifier setting step of setting a data identifier for identifying each data in a data set to be detected as an outlier, a visualization step of visualizing the data set, A data identifier display step of displaying the data identifier in association with each data in the data set, and a removing step of removing outlier data satisfying a predetermined condition in the visualized data set. Features.
[0025]
According to the present invention, a data identifier for identifying each data is displayed in association with each data in the visualized data set, and outlier data satisfying a predetermined condition is removed from the visualized data set. Therefore, the efficiency of detecting and removing outliers can be increased.
[0026]
The invention according to claim 8 is a data identifier setting means for setting a data identifier for identifying each data in a data set which is an outlier detection target; a visualization means for visualizing the data set; Data identifier display means for displaying the data identifier in association with each data in the data set, and outlier specifying means for specifying a data identifier corresponding to outlier data in the visualized data set, It is characterized by having.
[0027]
According to the present invention, a data identifier for identifying each data is displayed in association with each data in the visualized data set, and a data identifier corresponding to outlier data is specified in the visualized data set. Therefore, it is possible to increase the efficiency of detecting an outlier.
[0028]
The invention according to claim 9 is a data identifier setting means for setting a data identifier for identifying each data in a data set to be detected as an outlier, a visualization means for visualizing the data set, Data identifier display means for displaying the data identifier in association with each data in the data set, and removing means for removing outlier data satisfying a predetermined condition in the visualized data set. It is characterized by.
[0029]
According to the present invention, a data identifier for identifying each data is displayed in association with each data in the visualized data set, and outlier data satisfying a predetermined condition is removed from the visualized data set. Therefore, the efficiency of detecting and removing outliers can be increased.
[0030]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, an embodiment of an outlier detection support program, an outlier detection support method, and an outlier detection support apparatus according to the present invention will be described in detail with reference to the drawings.
[0031]
FIG. 1 is a block diagram showing a configuration of an embodiment according to the present invention. In this figure, an outlier detection support apparatus 100 is an apparatus that supports detection and deletion of an outlier from a data set. In the outlier detection support device 100, the input unit 101 is a keyboard, a mouse, a data input device, or the like. The display unit 102 displays various data scatter diagrams (see FIGS. 4 to 8).
[0032]
The control unit 103 performs various controls (display control of a scatter diagram, control for detecting outliers, and the like). Details of the operation of the control unit 103 will be described later. The storage unit 104 stores various data under the control of the control unit 103.
[0033]
The original database 110 is a database that stores original data before performing outlier detection processing for detecting outliers in file units. The original data table 200 shown in FIG. 2A is obtained by opening one of a plurality of files stored in the original database 110.
[0034]
In the original data table 200, the record number is a number associated with each record. Here, in the original data table 200, when a record is deleted, the record numbers corresponding to the records below the deleted record are incremented by one, and are serialized from the highest record to the lowest record. . Attribute 1 to attribute n are data having n attributes. For example, attribute 1 is blood pressure data, attribute 2 is height data,..., Attribute n is blood sugar level data.
[0035]
Returning to FIG. 1, the processed database 120 is a database that stores, on a file-by-file basis, processed data on which the above-described outlier detection has been performed and outliers have been excluded from the original data.
[0036]
(Operation example 1)
Next, an operation example 1 of the embodiment will be described with reference to FIGS. FIG. 3 is a flowchart illustrating an operation example 1 of the embodiment. In step SA1 shown in the figure, the control unit 103 opens an outlier detection target file from the original database 110 and develops the initial original data table 200 shown in FIG.
[0037]
In step SA2, the control unit 103 creates the identification number table 210 in association with each record of the initial original data table 200 shown in FIG. The identification number table 210 is a table for storing identification numbers for identifying each data in the original data table 200.
[0038]
In the identification number table 210 corresponding to the initial original data table 200, the same identification number as the record number is assigned. Here, in the identification number table 210, when the record 200a of the original data table 200 is deleted as shown in FIG. 2B, the record 210a corresponding to the record 200a is deleted, and the identification number 2 is a missing number. , But the other identification numbers 1, 3 to m are unchanged. Thus, the identification number table 210 'shown in FIG. 2C is created.
[0039]
On the other hand, in the original data table 200, when the record 200a is deleted as shown in FIG. 2B, the record corresponding to the record 200a is deleted, and the other identification numbers 3 to m are one. The identification numbers are incremented one by one and set to identification numbers 2 to m-1. As a result, the original data table 200 'shown in FIG. 2C is created.
[0040]
In step SA3 and step SA4, a process of compressing high-dimensional (in this case, n-dimensional in this case, attributes 1 to n) into low-dimensional (in this case, two-dimensional) data is performed by a well-known cluster discriminating method. Is done. The cluster discrimination method is a data visualization method that performs clustering in a high-dimensional space, regards the data belonging to the obtained cluster as an independent category, and obtains a mapping to a low-dimensional space using discriminant analysis.
[0041]
That is, it is difficult to plot high-dimensional data on a scatter diagram and visualize the data. Therefore, by compressing high-dimensional data into low-dimensional (two-dimensional) data, two-dimensional data can be plotted on a scatter diagram to visualize the data.
[0042]
In step SA3, the control unit 103 specifies the number of clusters. In step SA4, the control unit 103 converts the high-dimensional (n-dimensional) original data stored in the original data table 200 shown in FIG. Compress to data. Thereby, the attributes 1 to n are compressed into a one-dimensional attribute and a two-dimensional attribute.
[0043]
If only two-dimensional (attribute 1 and attribute 2) original data has already been stored in the original data table 200, the processing of step SA3 and step SA4 is skipped.
[0044]
In step SA5, the control unit 103 scatters each record of the original data table 200 (however, data compressed from n dimensions to two dimensions (one-dimensional attribute and two-dimensional attribute) as plots shown in FIG. 4). 300 is displayed on the display unit 102. In the scatter diagram 300, the horizontal axis corresponds to one dimension (attribute), and the vertical axis corresponds to two dimensions (attribute).
[0045]
In step SA6, the control unit 103 displays each identification number of the identification number table 210 shown in FIG. 2A near each plot (each record) of the scatter diagram 300 shown in FIG. Let 300 be the scatter plot 310 shown in FIG. For example, “14” in the scatter diagram 310 shown in FIG. 5 corresponds to the identification number (= 14) in the identification number table 210 shown in FIG.
[0046]
In step SA7, the control unit 103 determines whether or not the plots overlap in the scatter diagram 310 shown in FIG. In this case, since a plurality of plots existing in the frame 311 shown in FIG. 6 overlap, the control unit 103 sets the determination result of step SA7 to “Yes”. If the result of the determination in step SA7 is “No”, the control section 103 makes a determination in step SA8.
[0047]
In step SA12, the control unit 103 enlarges the frame 311 (overlapping portion) shown in FIG. 6 and adds the enlarged scatter diagram 320 shown in FIG. 7 to the display unit 102 in parallel with the scatter diagram 310 (see FIG. 6). To be displayed. This makes the identification number and the plot easier to see.
[0048]
In step SA8, the control unit 103 determines whether or not the operator has instructed that an outlier exists in the scatter diagram 310 (see FIG. 6) and the enlarged scatter diagram 320 (see FIG. 7).
[0049]
In this case, the operator recognizes a plot (data) corresponding to “14” (circle frame 312) in the scatter diagram 310 shown in FIG. Give an indication that you are.
[0050]
Thereby, the control unit 103 sets the determination result of step SA8 to “Yes”. In step SA9, the operator uses the input unit 101 to specify the identification number “14” corresponding to the outlier.
[0051]
In step SA10, the control unit 103 deletes the record corresponding to the specified identification number “14” from the original data table 200 and the identification number table 210. Here, the identification number “14” is a missing number. Thereafter, steps SA3 to SA10 are repeated until the determination result of step SA8 becomes "No".
[0052]
Then, when the result of the determination in step SA8 is “No”, in step SA11, the control unit 103 stores the original data table 200 from which the record corresponding to the outlier has been deleted as processed data in the processed database 120.
[0053]
(Operation example 2)
Now, in one embodiment, outlier candidates may be displayed in different colors in a scatter diagram to assist the operator in detecting outliers. Hereinafter, this case will be described as Operation Example 2. FIG. 9 is a flowchart illustrating Operation Example 2 of the embodiment.
[0054]
In steps SB1 to SB4 shown in the same drawing, the same processing as the above-described steps SA1 to SA4 (see FIG. 3) is executed.
[0055]
In step SB5, the control unit 103 shows each record (however, data compressed from n dimensions to two dimensions (one-dimensional attribute and two-dimensional attribute)) of the original data table 200 as plots in FIG. A scatter diagram 300 is displayed on the display unit 102.
[0056]
In step SB6, the control unit 103 displays each identification number of the identification number table 210 shown in FIG. 2A near each plot (each record) of the scatter diagram 300 shown in FIG. Let 300 be the scatter plot 310 shown in FIG.
[0057]
In step SB7, the control unit 103 displays a plot (identification number) in which the number of clusters is 1 in red as an outlier candidate. In this case, it is assumed that the identification number “14” and the corresponding plot are displayed in red.
[0058]
In step SB8, the control unit 103 determines whether or not there is a portion where the plots overlap in the scatter diagram 310 illustrated in FIG. 5 and the identification number is difficult to see. In this case, the determination result is “Yes”. . If the result of the determination in step SB8 is “No”, the control section 103 makes a determination in step SB9.
[0059]
In step SB13, the control unit 103 enlarges the frame 311 (overlapping portion) shown in FIG. 6 and adds the enlarged scatter diagram 320 shown in FIG. 7 to the display unit 102 in parallel with the scatter diagram 310 (see FIG. 6). To be displayed.
[0060]
In step SB9, the control unit 103 determines whether or not the operator has instructed that a plot to be removed exists in the scatter diagram 310 (see FIG. 6) and the enlarged scatter diagram 320 (see FIG. 7).
[0061]
In this case, the operator recognizes a plot (data) corresponding to “14” (circled frame 312) displayed in red as an outlier in the scatter diagram 310 shown in FIG. Give an indication that the plot you want exists.
[0062]
Thereby, the control unit 103 sets the determination result of step SB9 to “Yes”. In step SB10, the operator uses the input unit 101 to specify the identification number “14” of the removal plot corresponding to the outlier.
[0063]
In step SB11, the control unit 103 deletes the record corresponding to the specified identification number “14” from the original data table 200 and the identification number table 210. Here, the identification number “14” is a missing number. Thereafter, Step SB3 to Step SB11 are repeated until the determination result of Step SB9 becomes “No”.
[0064]
Then, when the result of the determination in step SB9 is “No”, in step SB12, the control unit 103 stores the original data table 200 from which the record corresponding to the outlier has been deleted in the processed database 120 as processed data.
[0065]
(Operation example 3)
By the way, in the operation example 2 of the embodiment, a case has been described where outlier candidates are displayed in different colors on a scatter diagram, and outliers are detected and removed in accordance with an operator's instruction. Alternatively, the color-coded plot may be automatically removed as an outlier. Hereinafter, this case will be described as Operation Example 3. FIG. 10 is a flowchart illustrating an operation example 3 of the embodiment.
[0066]
In steps SC1 to SC4 shown in the same drawing, the same processing as the above-described steps SA1 to SA4 (see FIG. 3) is executed.
[0067]
In step SC5, the control unit 103 shows each record of the original data table 200 (however, data compressed from n dimensions to two dimensions (one-dimensional attribute and two-dimensional attribute)) as respective plots in FIG. A scatter diagram 300 is displayed on the display unit 102.
[0068]
In step SC6, the control unit 103 displays each identification number of the identification number table 210 shown in FIG. 2A near each plot (each record) of the scatter diagram 300 shown in FIG. Let 300 be the scatter plot 310 shown in FIG.
[0069]
In step SC7, the control unit 103 displays a plot (identification number) in which the number of clusters is 1 in red as a candidate for an outlier. In this case, it is assumed that the identification number “14” and the corresponding plot are displayed in red.
[0070]
In step SC8, the control unit 103 determines whether or not the plots overlap in the scatter diagram 310 illustrated in FIG. 5 and there is a part where the identification number is difficult to see, and in this case, the determination result is “Yes”. . If the result of the determination in step SC8 is “No”, the control section 103 makes a determination in step SC9.
[0071]
In step SC12, the control unit 103 enlarges the frame 311 (overlapping portion) shown in FIG. 6 and adds the enlarged scatter diagram 320 shown in FIG. 7 to the display unit 102 in parallel with the scatter diagram 310 (see FIG. 6). To be displayed.
[0072]
In step SC9, the control unit 103 determines whether or not there is a red plot as an outlier in the scatter diagram 310 (see FIG. 6) and the enlarged scatter diagram 320 (see FIG. 7). In this case, since there is a plot (data) corresponding to “14” (circled frame 312) displayed in red in the scatter diagram 310 illustrated in FIG. 8, the control unit 103 determines the determination result of step SC9. "Yes".
[0073]
In step SC10, the control unit 103 deletes the record corresponding to the specified identification number “14” from the original data table 200 and the identification number table 210. Here, the identification number “14” is a missing number. Thereafter, steps SC3 to SC10 are repeated until the determination result of step SC9 becomes “No”.
[0074]
Then, when the result of the determination in step SC9 is “No”, in step SC12, the control unit 103 stores the original data table 200 from which the record corresponding to the outlier has been deleted as processed data in the processed database 120.
[0075]
As described above, according to one embodiment (Operation Example 1), as shown in FIG. 6, each data (plot) in the data set visualized by the scatter diagram 310 is associated with each data (plot). Since the identification number for identification is displayed and the identification number corresponding to the outlier data is specified by the input unit 101 in the visualized data set, the efficiency of the detection of the outlier can be increased.
[0076]
Further, according to the embodiment (Operation Example 1), since the data corresponding to the specified identification number is removed as an outlier, the efficiency of removing the outlier can be increased.
[0077]
According to one embodiment (operation example 1), as shown in FIGS. 6 and 7, a portion (frame 311) where data (plot) overlaps in the visualized data set is enlarged and displayed. Therefore, even when data (plots) overlap, the efficiency of detecting outliers can be increased.
[0078]
According to one embodiment (Operation Example 2), as described with reference to FIG. 9, in the visualized data set, data (plot, identification number) as an outlier candidate satisfying a predetermined condition ) Is visualized by color coding, so that the efficiency of detecting outliers can be further increased.
[0079]
Further, according to one embodiment (Operation Example 3), as described with reference to FIG. 10, an identification number for identifying each data is displayed in association with each data in the visualized data set. Since outlier data that satisfies a predetermined condition (the number of clusters is 1) is removed from the visualized data set, the efficiency of detection and removal of outliers can be increased.
[0080]
An embodiment according to the present invention has been described in detail with reference to the drawings. However, a specific configuration example is not limited to the embodiment, and a design change within a range not departing from the gist of the present invention. The present invention is also included in the present invention.
[0081]
For example, in one embodiment described above, a program for realizing the function of the outlier detection support apparatus 100 shown in FIG. 1 is recorded on the computer-readable recording medium 500 shown in FIG. Each function may be realized by reading and executing the program recorded on the medium 500 into the computer 400 shown in FIG.
[0082]
The computer 400 shown in the figure includes a CPU (Central Processing Unit) 410 for executing the above program, an input device 420 such as a keyboard and a mouse, a ROM (Read Only Memory) 430 for storing various data, an arithmetic parameter and the like. (Random access memory) 440 for storing a program, a reading device 450 for reading a program from the recording medium 500, and an output device 460 such as a display or a printer.
[0083]
The CPU 410 realizes the above-described functions by reading the program recorded on the recording medium 500 via the reading device 450 and executing the program. Note that the recording medium 500 includes an optical disk, a flexible disk, a hard disk, and the like.
[0084]
【The invention's effect】
As described above, according to the first, sixth, and eighth aspects of the present invention, a data identifier for identifying each data is displayed in association with each data in the visualized data set, and the visualized data is displayed. Since the data identifier corresponding to the outlier data is specified in the set, there is an effect that the efficiency of detecting the outlier can be increased.
[0085]
According to the second aspect of the present invention, a portion where data overlaps in the visualized data set is enlarged and displayed, so that even when there is data overlap, detection of outliers is required. The effect that efficiency can be raised is produced.
[0086]
According to the third aspect of the invention, in the visualized data set, data as an outlier candidate that satisfies a predetermined condition is visualized by color coding. There is an effect that it can be further increased.
[0087]
According to the fourth aspect of the present invention, since the data corresponding to the specified data identifier is removed as an outlier, the effect of removing outliers can be improved.
[0088]
According to the fifth, seventh, and ninth aspects of the present invention, a data identifier for identifying each data is displayed in association with each data in the visualized data set, and a predetermined identifier is displayed in the visualized data set. Since the outlier data satisfying the condition (1) is removed, the effect of detecting and removing outliers can be improved.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of an embodiment according to the present invention.
FIG. 2 is a diagram showing various tables in the same embodiment.
FIG. 3 is a flowchart illustrating an operation example 1 of the same embodiment.
FIG. 4 is a scatter diagram 300 in the same embodiment.
FIG. 5 is a scatter diagram 310 in which identification numbers corresponding to the respective plots shown in FIG. 4 are assigned.
6 is a diagram showing a frame 311 in the scatter diagram 310 shown in FIG.
FIG. 7 is an enlarged scatter diagram 320 corresponding to the frame 311 shown in FIG.
FIG. 8 is a diagram showing a round frame 312 in the scatter diagram 310 shown in FIG. 5;
FIG. 9 is a flowchart illustrating an operation example 2 of the same embodiment.
FIG. 10 is a flowchart illustrating an operation example 3 of the same embodiment.
FIG. 11 is a block diagram showing a configuration of a modification of the same embodiment.
FIG. 12 is a diagram showing a conventional data table 10.
FIG. 13 is a scatter diagram 20 in which each record of the data table 10 shown in FIG. 12 is plotted.
[Explanation of symbols]
100 Outlier detection support device
101 Input unit
102 Display
103 control unit
104 storage unit
110 Original Database
120 Processed database

Claims

Computer
Data identifier setting means for setting a data identifier for identifying each data in a data set which is an outlier detection target,
Visualization means for visualizing the data set,
Data identifier display means for displaying the data identifier in association with each data in the visualized data set,
Outlier specifying means for specifying a data identifier corresponding to outlier data in the visualized data set,
Outlier detection support program to function as a function.

2. The outlier detection support program according to claim 1, wherein the computer is caused to function as enlargement display means for enlarging and displaying a portion where data overlaps in the visualized data set.

3. The outlier detection support program according to claim 1, wherein the visualization unit visualizes data as an outlier candidate that satisfies a predetermined condition by color coding in the visualized data set.

4. The outlier according to claim 1, wherein the computer functions as an outlier that removes data corresponding to the data identifier specified by the outlier specifying unit as an outlier. Value detection support program.

Computer
Data identifier setting means for setting a data identifier for identifying each data in a data set which is an outlier detection target,
Visualization means for visualizing the data set,
Data identifier display means for displaying the data identifier in association with each data in the visualized data set,
Removing means for removing outlier data satisfying a predetermined condition in the visualized data set;
Outlier detection support program to function as a function.

A data identifier setting step of setting a data identifier for identifying each data in a data set that is an outlier detection target,
A visualization step of visualizing the data set;
A data identifier display step of displaying the data identifier in association with each data in the visualized data set,
An outlier specifying step for specifying a data identifier corresponding to outlier data in the visualized data set,
An outlier detection support method, comprising:

A data identifier setting step of setting a data identifier for identifying each data in a data set that is an outlier detection target,
A visualization step of visualizing the data set;
A data identifier display step of displaying the data identifier in association with each data in the visualized data set,
A removing step of removing outlier data that satisfies a predetermined condition in the visualized data set;
An outlier detection support method, comprising:

Data identifier setting means for setting a data identifier for identifying each data in a data set that is an outlier detection target,
Visualization means for visualizing the data set,
Data identifier display means for displaying the data identifier corresponding to each data in the visualized data set,
Outlier specifying means for specifying a data identifier corresponding to outlier data in the visualized data set,
An outlier detection support device comprising:

Data identifier setting means for setting a data identifier for identifying each data in a data set that is an outlier detection target,
Visualization means for visualizing the data set,
Data identifier display means for displaying the data identifier corresponding to each data in the visualized data set,
Removing means for removing outlier data that satisfies a predetermined condition in the visualized data set;
An outlier detection support device comprising: