JP2004038412A

JP2004038412A - Data mining method and data mining system and data mining program

Info

Publication number: JP2004038412A
Application number: JP2002192684A
Authority: JP
Inventors: Yasushi Shinohara; 篠原　靖志; Teruhisa Miura; 三浦　輝久
Original assignee: Central Research Institute of Electric Power Industry
Current assignee: Central Research Institute of Electric Power Industry
Priority date: 2002-07-01
Filing date: 2002-07-01
Publication date: 2004-02-05

Abstract

<P>PROBLEM TO BE SOLVED: To efficiently and effectively supply data to a database for analysis. <P>SOLUTION: Discrimination knowledge to output which class a certain event is belonging to by inputting the data of the item of the event is generated based on a database for analysis (a step 2). A similar event whose item data are similar, and whose belonging class is different is specified from the database for analysis, and an additional item valid for differentiating the similar event is specified, and data for all events in the data base for analysis related with the additional item are added to the database for analysis by an information source other than the database for analysis (a step 8). Then, new discrimination knowledge is generated based on the expanded database for analysis (a step 2). <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、データマイニングの方法およびシステム並びにプログラムに関する。さらに詳述すると、本発明は、データマイニングに用いるデータベースを効率的且つ効果的に拡充するための方法およびシステム並びにプログラムに関する。
【０００２】
【従来の技術】
データマイニングは、膨大なデータの中に隠れた有用な相関関係を発見する手法である。例えばＰＯＳ（販売時点情報管理）システムから得られたデータに対し、売れ筋商品や顧客の好みを見い出すなど、データを有益に使いこなすための有力手段として期待されている。企業が持つ大量のデータを分析し、意思決定に役立つ「知識」を抽出するデータマイニングは、企業間競争に勝ち残るために不可欠な技術となりつつある。
【０００３】
【発明が解決しようとする課題】
しかしながら、従来のデータマイニングでは、分析の対象となるデータが既にデータベースとして分析者の手元に整っている状態、つまり閉じた環境を対象として行なわれている。即ち、データマイニングを行う時点で分析の対象となるデータは限られたものであり、当該現有のデータのみからでは有効な「知識」が得られない場合がある。
【０００４】
そこで、分析者の手元にある情報源以外の外部の情報源から有効な情報を効率的に得るための仕組みが必要となる。ところが、従来のデータマイニングでは、このようなデータベースの拡充の考慮はなされていない。
【０００５】
データベースを無闇に拡充しても、分析に要する時間やコストの無駄となる。また、せっかく入手した外部データが結果的に知識の改善につながらないといった危険性もある。
【０００６】
そこで本発明は、分析用データベースに効率的且つ効果的にデータを補充して精度の高い有用な知識を抽出するデータマイニング方法およびシステム並びにプログラムを提供することを目的とする。
【０００７】
【課題を解決するための手段】
かかる目的を達成するため、請求項１記載の発明は、事例毎および項目毎に整理された情報と事例のそれぞれが予め定められた複数のクラスのうちのいずれのクラスに属するかを表す情報とを有する分析用データベースに基づいて、ある事例の項目のデータを入力として当該事例がいずれのクラスに属するかを出力する判別知識を生成するデータマイニングの方法であり、項目のデータが類似し尚且つ属するクラスが異なる類似事例を分析用データベースの中から特定するステップと、類似事例を差別化するために有効な追加項目を特定するステップと、追加項目に関わる分析用データベース中にある全ての事例分のデータを分析用データベース以外の情報源より分析用データベースに追加するステップと、当該拡充された分析用データベースに基づいて判別知識を生成するステップとを少なくとも有するようにしている。
【０００８】
したがって、似て非なる類似事例を特定し、当該類似事例を差別化するために有効な追加項目に関するデータを分析用データベースに追加することで、分析用データベースに効率的且つ効果的にデータを補充して、精度の高い有用な判別知識を抽出することができる。
【０００９】
また、請求項２記載の発明は、請求項１記載のデータマイニング方法において、類似事例を差別化するために入力された推薦項目の中から追加項目を特定するようにしている。推薦項目の入力は、例えば分析者等のデータマイニングの実行者によって行われる。類似事例が特定されているので、分析者等による具体的な事例（類似事例）についての対比検討が可能になる。これにより、「判別知識の精度向上のためにどのような追加項目が必要か？」といった漠然とした一般的な質問に比べて、分析者等の問題意識が鮮明になり、分析者等は的確な推薦項目を入力し易くなる。したがって、分析者等の知識を効果的に引き出すことができる。
【００１０】
また、請求項３記載の発明は、請求項１記載のデータマイニング方法において、予め定められた規則に従って情報源の中から追加項目を特定するようにしている。したがって、推薦項目の入力を待たずに、自動で判別知識の洗練に有効な追加項目を特定する。この場合、分析者等のデータマイニングの実行者には項目を推薦するための知識は要求されず、分析者等の判断又は能力によって判別知識の精度が大きく左右されてしまうことを回避できる。
【００１１】
また、請求項４記載の発明は、請求項１から３のいずれかに記載のデータマイニング方法において、生成した判別知識の精度が予め定めた基準を満足しない場合に、追加項目に関するデータを分析用データベースに追加して、当該拡充された分析用データベースに基づいて新たな判別知識を生成するようにしている。したがって、不必要な分析用データベースの拡充が回避される。
【００１２】
また、請求項５記載の発明は、生成した判別知識の精度が予め定めた基準を満足しない場合に、分析用データベースに存在しない追加事例に関するデータを情報源より分析用データベースに追加して、当該拡充された分析用データベースに基づいて新たな判別知識を生成するようにしている。したがって、追加項目に関するデータのみならず、分析用データベースに存在しない追加事例に関するデータも分析用データベースに補充して、精度の高い有用な知識を抽出することができる。
【００１３】
また、請求項６記載の発明は、請求項４または５に記載のデータマイニング方法において、生成した判別知識に基づいて判別困難な領域を特定し、判別困難領域に該当し尚且つ分析用データベースに存在しない検証用事例について判別知識を適用し、その結果に基づいて判別知識の精度を検証するようにしている。したがって、少数のデータを用いて効果的な検証を行うことができる。
【００１４】
また、請求項７記載の発明は、請求項６記載のデータマイニング方法において、判別知識の精度が予め定めた基準を満足しない場合に、検証用事例に関するデータを追加事例に関するデータとして分析用データベースに追加するようにしている。この場合、効率的且つ効果的に、分析用データベースに存在しない追加事例に関するデータを分析用データベースに補充することができる。
【００１５】
また、請求項８記載の発明は、請求項４から７のいずれかに記載のデータマイニング方法において、追加項目または追加事例に関するデータを分析用データベースに追加するステップと、当該拡充された分析用データベースに基づいて新たな判別知識を生成するステップとを、判別知識の精度が予め定めた基準を満足するか又は追加項目または追加事例に関するデータの追加による判別知識の精度の向上幅が予め定めた基準を満足しなくなるまで、繰り返すようにしている。この場合、追加項目または追加事例に関するデータによる分析用データベースの拡充は、充分に高精度の判別知識が得られるか、これ以上の精度向上が望めなくなるまで、繰り返される。
【００１６】
また、請求項９記載の発明は、請求項８記載のデータマイニング方法において、判別知識の精度が予め定めた基準を満足しない場合に、追加事例に関するデータを分析用データベースに追加し、追加事例に関するデータの追加による判別知識の精度の向上幅が予め定めた基準を満足しない場合に、追加項目に関するデータを分析用データベースに追加するようにしている。この場合、判別知識の精度が低い場合には、先ず、追加事例に関するデータが分析用データベースに追加され、追加事例に関するデータの追加では、これ以上の精度向上が望めない場合に、追加項目に関するデータが分析用データベースに追加される。したがって、効率的且つ効果的に分析用データベースが拡充される。
【００１７】
また、請求項１０記載のデータマイニングシステムは、請求項１記載のデータマイニング方法を装置化したものである。即ち、請求項１０記載のデータマイニングシステムは、事例毎および項目毎に整理された情報と事例のそれぞれが予め定められた複数のクラスのうちのいずれのクラスに属するかを表す情報とを有する分析用データベースと、項目のデータが類似し尚且つ属するクラスが異なる類似事例を分析用データベースの中から特定する手段と、類似事例を差別化するために有効な追加項目を特定する手段と、追加項目に関わる分析用データベース中にある全ての事例分のデータを分析用データベース以外の情報源より分析用データベースに追加する手段と、当該拡充された分析用データベースに基づいて、ある事例の項目のデータを入力として当該事例がいずれのクラスに属するかを出力する判別知識を生成する手段とを、少なくとも有するようにしている。
【００１８】
また、請求項１１記載のデータマイニング用プログラムは、コンピュータを請求項１０記載のデータマイニングシステムとして機能させるためのプログラムである。即ち、請求項１１記載のデータマイニング用プログラムは、コンピュータを、事例毎および項目毎に整理された情報と事例のそれぞれが予め定められた複数のクラスのうちのいずれのクラスに属するかを表す情報とを有する分析用データベースと、項目のデータが類似し尚且つ属するクラスが異なる類似事例を分析用データベースの中から特定する手段と、類似事例を差別化するために有効な追加項目を特定する手段と、追加項目に関わる分析用データベース中にある全ての事例分のデータを分析用データベース以外の情報源より分析用データベースに追加する手段と、当該拡充された分析用データベースに基づいて、ある事例の項目のデータを入力として当該事例がいずれのクラスに属するかを出力する判別知識を生成する手段として機能させるものである。
【００１９】
【発明の実施の形態】
以下、本発明の構成を図面に示す実施形態に基づいて詳細に説明する。
【００２０】
図１から図１８に本発明のデータマイニング方法およびデータマイニングシステム並びにデータマイニング用プログラムの実施の一形態を示す。本実施形態のデータマイニング方法は、事例毎および項目毎に整理された情報と各事例が予め定められた複数のクラスのうちのいずれのクラスに属するかを表す情報とを有する分析用データベース１に基づいて、ある事例の項目のデータを入力として当該事例がいずれのクラスに属するかを出力する判別知識を生成するものであり、項目のデータが類似し尚且つ属するクラスが異なる類似事例を分析用データベース１の中から特定するステップと、類似事例を差別化するために有効な追加項目を特定するステップと、追加項目に関わる分析用データベース１中にある全ての事例分のデータを分析用データベース１以外の情報源より分析用データベース１に追加するステップと、当該拡充された分析用データベース１に基づいて判別知識を生成するステップとを少なくとも有するものとしている。
【００２１】
ここで、分析用データベース１に追加するデータとしては、追加項目に関するデータのみには限らず、より高精度の判別知識を得るために、必要に応じて分析用データベース１に存在しない事例に関するデータを追加するようにしても良い。例えば、本実施形態では、生成した判別知識の精度が予め定めた基準を満足しない場合に、追加項目または分析用データベース１に存在しない追加事例に関するデータを分析用データベース１に追加して、当該拡充された分析用データベース１に基づいて新たな判別知識を生成するようにしている。
【００２２】
また、判別知識の精度を検証する方法は、特に限定されるものではないが、例えば本実施形態では、生成した判別知識に基づいて判別困難な領域を特定し、当該判別困難領域に該当し尚且つ分析用データベース１に存在しない検証用事例について判別知識を適用し、その結果に基づいて判別知識の精度を検証するようにしている。
【００２３】
また、生成した判別知識の精度が予め定めた基準を満足しない場合に、分析用データベース１に追加する追加事例に関するデータとして、例えば本実施形態では、上記の検証用事例に関するデータを分析用データベース１に追加するようにしている。ただし、追加事例を選定する規則は必ずしも本実施形態の例に限定されるものではない。
【００２４】
また、追加項目または追加事例による分析用データベース１の拡充をどの程度行うか、追加項目または追加事例による分析用データベース１の拡充をどのようなタイミングで打ち切るかは、必ずしも限定されるものではない。例えば本実施形態では、追加項目または追加事例に関するデータを分析用データベース１に追加するステップと、当該拡充された分析用データベース１に基づいて新たな判別知識を生成するステップとを、判別知識の精度が予め定めた基準を満足するか又は追加項目または追加事例に関するデータの追加による判別知識の精度の向上幅が予め定めた基準を満足しなくなるまで、繰り返すようにしている。
【００２５】
また、追加項目と追加事例に関するデータの分析用データベース１への追加のタイミングは特に限定されず、追加項目と追加事例のどちらを先に追加しても良く、または同じタイミングで追加するようにしても良い。例えば、本実施形態では、判別知識の精度が予め定めた基準を満足しない場合に、先ず追加事例に関するデータを分析用データベース１に追加し、次に追加事例に関するデータの追加による判別知識の精度の向上幅が予め定めた基準を満足しない場合に、追加項目に関するデータを分析用データベース１に追加するようにしている。
【００２６】
本発明のデータマイニング方法は、コンピュータを用いてデータマイニングシステムとして装置化される。コンピュータは、中央処理演算装置（ＣＰＵ）、主記憶装置、外部記憶装置、ディスプレイ等の出力装置、キーボード等の入力装置、外部の情報処理装置との通信インターフェース等を備える周知のものであって良い。このデータマイニングシステムは、事例毎および項目毎に整理された情報と事例のそれぞれが予め定められた複数のクラスのうちのいずれのクラスに属するかを表す情報とを有する分析用データベース１と、項目のデータが類似し尚且つ属するクラスが異なる事例を分析用データベース１の中から特定する手段と、類似事例を差別化するために有効な追加項目を特定する手段と、追加項目に関わる分析用データベース１中にある全ての事例分のデータを分析用データベース１以外の情報源より分析用データベース１に追加する手段と、当該拡充された分析用データベース１に基づいて、ある事例の項目のデータを入力として当該事例がいずれのクラスに属するかを出力する判別知識を生成する手段とを、少なくとも有する。
【００２７】
また、本発明のデータマイニング用プログラムは、本発明のデータマイニング方法をコンピュータを用いてデータマイニングシステムとして装置化するためのプログラムである。このデータマイニング用プログラムは、コンピュータを、事例毎および項目毎に整理された情報と事例のそれぞれが予め定められた複数のクラスのうちのいずれのクラスに属するかを表す情報とを有する分析用データベース１と、項目のデータが類似し尚且つ属するクラスが異なる類似事例を分析用データベース１の中から特定する手段と、類似事例を差別化するために有効な追加項目を特定する手段と、追加項目に関わる分析用データベース１中にある全ての事例分のデータを分析用データベース１以外の情報源より分析用データベース１に追加する手段と、当該拡充された分析用データベース１に基づいて、ある事例の項目のデータを入力として当該事例がいずれのクラスに属するかを出力する判別知識を生成する手段として機能させるものである。
【００２８】
本実施形態では、説明の便宜上、図１１の概念図に示すように、世界中のあらゆるデータを仮想的にサンプル×データ項目という巨大な２次元の表形式のデータベース（仮想データベース）として捉える。
【００２９】
サンプルとは、分析を行う単位となる個別の「事例」である。サンプルは、仮想データベース（図１１）のそれぞれの行で表現される。例えば電気温水器や電磁調理器などのある商品がどのような地域でよく売れるかを予測する販売分析においては、それぞれの都市がサンプルに該当する。本実施形態では、各個別のサンプルをｉまたはｊで表す。
【００３０】
データ項目（属性とも呼ぶ）とは、サンプルの特徴を表す「項目」である。データ項目は、仮想データベース（図１１）のそれぞれの列で表現される。例えば上記販売分析の例においては、販売額、顧客数といった都市の特徴がデータ項目に該当する。本実施形態では、各個別のデータ項目をａと表記する。また、データ項目の集合をＡと表記する。また、ある時点において分析用データベース１に含まれるデータ項目数をｄと表記する。
【００３１】
また、本実施形態では、あるサンプルのデータ項目ａの値をｘ_ａと表記する。特に、特定のサンプルｉのデータ項目ａの値をｘ_ｉａと表記する。また、あるサンプルの各データ項目ａの値の並び（ベクトル）を、Ｘと表記する。また、あるサンプルのデータ項目集合Ａに属する各データ項目ａの値の並び（ベクトル）を、Ｘ_Ａと表記する。特に、特定のサンプルｉについての各データ項目ａの値の並び（ベクトル）を、Ｘ_ｉと表記する。また、特定のサンプルｉのデータ項目集合Ａに属する各データ項目ａの値の並び（ベクトル）を、Ｘ_ｉＡと表記する。即ち、データ項目集合Ａを特定する必要がある場合にはＸ_ＡまたはＸ_ｉＡを用い、データ項目集合Ａを特定する必要がない場合にはＸまたはＸ_ｉを用いる。また、添え字ｉの変わりに添え字ｊ等を用いたデータは、サンプルｉとは区別されるサンプルｊに関するデータを表す。また、説明の便宜上、データを区別するために、必要に応じてｘ，Ｘの記号に変えてｚ，Ｚを用いる。また、説明の簡単のため、一度定義した記号および表記法は、特に断りが無い限り、本明細書において以下同様の意味を有するものとする。それぞれのサンプルに対しデータ項目集合（Ａ）の値（Ｘ_Ａ）が与えられているとき、サンプルは次式で表すことができる。
【００３２】
【数１】

【００３３】
また、本実施形態では、実際にデータベース化され、分析者が容易に分析に利用できる形になったデータ群を有する情報源を内部データベース２と呼ぶ。例えば社内にある商品の販売データや顧客データ等が管理されている社内データベースやデータウエアハウス等が、内部データベース２に該当する。内部データベース２は、仮想データベースの一部を構成する。
【００３４】
また、本実施形態では、仮想データベースにおける内部データベース２以外のデータ群を有する情報源を外部データベース３と呼ぶ。例えば、社外のデータベース、または社内に存在しても利用可能な形態になっていないデータ群、または新規調査等で新たにデータベース化する可能性があるデータ群など、一定の時間や費用等を負担しないと利用できないデータ群が、外部データベース３が有するデータに該当する。従って、外部データベース３には、現実にはデータベース化されていないデータ群も含まれる。
【００３５】
また、本実施形態では、データマイニングの対象となるデータ群を有するデータベースを分析用データベース１と呼ぶ。本実施形態における分析用データベース１は、データマイニングの前段階において、内部データベース２に存在する一部のサンプルとデータ項目に関するデータを抽出することで構成される。そして、分析が進むにつれて、分析用データベース１には、内部データベース２または外部データベース３から新たなサンプルやデータ項目に関するデータが補充される。
【００３６】
分析用データベース１には、分析用データベース１中の各サンプルが何れのクラスに属するかを表すクラス情報が格納される。当該クラス情報は、内部データベース２に予め格納されている情報であっても良く、または内部データベース２には格納されておらず分析用データベース１を構成する際に分析者の判断若しくは外部データベース３のデータに基づいて付加する情報であっても良い。
【００３７】
本実施形態における分析用データベース１以外の情報源としては、内部データベース２と外部データベース３とが該当する。分析用データベース１以外の情報源として、内部データベース２と外部データベース３との概念を用いることで、現実の問題に近いモデルを構築することができる。例えば、上記販売分析の例で言えば、分析者は、先ず、商品の販売データや顧客データ等が管理された社内データベース即ち内部データベース２から関連しそうな情報を抜き出して、分析用データベース１を構築し、当該分析用データベース１を用いて分析を行う。そして、必要に応じて、分析用データベース１に含まれないサンプルやデータ項目を内部データベース２から分析用データベース１に補いながら、商品の販売状況を分析する。もし内部データベース２だけでは「なぜある地域で商品が売れて、ある地域で売れないのか」等を分析できない場合には、社内から新たに発掘したデータ、または新たな市場調査により得られたデータ、或いは外部の情報源からのデータ（例えば地域の経済統計データ、年齢別人口構成比等のデータ等）、即ち外部データベース３のデータを、予算を考慮しつつ購入し、分析を行う。
【００３８】
内部データベース２と外部データベース３との大きな違いとしては、例えば次の２点が挙げられる。第１の相違点は、内部データベース２がアクセスコストや収集コスト、時間などの面で無償か、軽微なコストと見なし得るのに対して、外部データベース３は購入料金や新たな保存用データ領域、取得時間など格段にコストがかかる点である。第２の相違点は、内部データベース２はその中身がすでに分っているか、少なくとも、実際のデータを容易に調べられるのに対し、外部データベース３は大体どのような特性のデータであるのかは推定可能であっても、実際にどのようなデータとなっているかはデータを取得するまで、不明若しくは不確実である、という点である。
【００３９】
せっかく入手した外部データベース３のデータが結果的に判別知識の改善につながらないといった危険性もある。このため、精度の高い判別知識を得るために外部データベース３を活用しようとすると、より良い知識の抽出に役立つ可能性の高い外部データベース３の中のデータを適切に特定し、取得コストを全体として低くする外部データ取得の戦略性が重要となる。これは主に内部データベース２までを対象とするものである従来のデータマイニングでは、問題とならなかった新たな課題である。
【００４０】
ここで、本実施形態におけるデータマイニングの目的は、サンプルが属するクラスをデータ項目集合の値から推測する判別知識を得ることである。クラスの区分数は、２つであることが一般的である。この場合、サンプルは、例えば「成」か「否」か、または「有」か「無」か、或いは「正常」か「異常」か、等のように２つのクラスのいずれかに区分される。本実施形態では、例えば顧客情報に基づいて商品の販売の「成」「否」の推定を行う、或いは機器の点検記録から機器の「正常」「異常」を診断する等のように、サンプルが２つのクラスのいずれに属するかをサンプルの情報に基づいて判定する場合について説明する。本実施形態におけるクラスには、「成」と「否」の２つのクラスがあるものとする。上記販売分析の例で言えば、販売に成功した都市が「成」のクラス、失敗した都市が「否」のクラスに該当する。本実施形態では、クラスをｙで表記し、「成」の場合にｙ＝１とし、「否」の場合にｙ＝−１とする。
【００４１】
判別知識は、各サンプルのデータ項目集合Ａの値から各サンプルの「成」「否」を判別するための知識である。判別知識は、例えばデータ項目集合Ａの関数（ｆ（Ｘ_Ａ））として表現できる。例えば本実施形態では、判別関数ｆ（Ｘ_Ａ）の符号によって、各サンプルの成否を判別する。即ち、本実施形態では、次式に示すように、判別関数ｆ（Ｘ_Ａ）の値が０よりも大きければ、当該サンプルは「成」（ｙ＝１）のクラスと判定し、判別関数ｆ（Ｘ_Ａ）が０以下であれば、当該サンプルは「否」（ｙ＝−１）のクラスと判定する。
【００４２】
【数２】
ｉｆ　ｆ（Ｘ_Ａ）＞０　ｔｈｅｎ　ｙ＝１
ｅｌｓｅ　　　　　　　　　　　　ｙ＝−１
【００４３】
判別関数ｆ（Ｘ_Ａ）の形式は、特に限定されるものではないが、分析者にとって理解しやすい形式が望ましく、且つデータに基づいた機械的検証が可能な数理的表現であることが望ましい。この両者を満たす判別関数ｆ（Ｘ_Ａ）の代表的な形式として、例えば線形関数の場合と、ラジアル・ベイシス・関数（Ｒａｄｉａｌ　Ｂａｓｉｓ　Ｆｕｎｃｔｉｏｎ、以下、ＲＢＦと表記する）の場合との２つが挙げられる。
【００４４】
線形関数の場合は、判別関数ｆ（Ｘ_Ａ）は例えば次式で表される。
【００４５】
【数３】

但し、
Ｘ_Ａ；次式に示す列ベクトル（１行ｄ列の行列）を表す。
【数４】

Ｗ’；次式に示す重みの列ベクトル（１行ｄ列の行列）を表す。
【数５】

Ｗ’^Ｔ；重みベクトルＷ’の転置行列を表す。
ｗ_ａ；重みを表す。
ｂ；切片を表す。
【００４６】
ＲＢＦの場合は、判別関数ｆ（Ｘ_Ａ）は例えば次式で表される。
【００４７】
【数６】

但し、
ｙ_ｉ；サンプルｉのクラス値（±１）を表す。
α_ｉ；重みを表す。ただしα_ｉ＞０
Ｎ；多次元正規分布を表す。
Ｍ_Ａ；データ項目集合Ａに属する各データ項目の平均値のベクトルを表す。
Σ_Ａ ^２；データ項目集合Ａに属する各データ項目の共分散行列を表す。
ＳＶ；代表サンプルを表す。
【００４８】
ここで、判別関数ｆ（Ｘ_Ａ）が線形関数の場合とＲＢＦの場合とで、どのような意味を持っているのかを説明する。
【００４９】
線形関数の場合は、判別関数ｆ（Ｘ_Ａ）はそれぞれのデータ項目の値ｘ_ａの加重和になっている。判別は関数ｆ（Ｘ_Ａ）の符号により行われるので、対応する重みｗ_ａの値が大きいデータ項目ａの値ｘ_ａが判別結果に大きな影響を与える。そこで判別関数ｆ（Ｘ_Ａ）の重みｗ_ａを調べることによって、どのようなデータ項目が判定に影響を与えるのかを知ることができる。従って線形式の判別知識は、影響の大きなデータ項目による知識とみなせる。
【００５０】
ＲＢＦの場合、数式６においてＮ（・；Ｍ_Ａ，Σ_Ａ ^２）は、中心Ｍ_Ａ、共分散Σ_Ａ ^２の多次元正規分布である。尚、Ｎ（・；Ｍ_Ａ，Σ_Ａ ^２）中の・は、そこに任意の値が入ることを示す。従って、あるＸ_ｉＡに関するＮ（Ｘ_ｉＡ；Ｍ_Ａ，Σ_Ａ ^２）の値は、Ｘ_ｉＡが分布の中心Ｍ_Ａに近いほど、つまりＸ_ｉＡとＭ_Ａが類似しているほど大きな値を持つ。さらにｙ_ｉの値は、サンプルｉがクラス「成」の場合に＋１、クラス「否」の場合に−１の値を取る。従って、大きな値のα_ｉに対応するサンプルＸ_ｉＡに類似したサンプルＸ_ｊＡは、Ｘ_ｉＡと同じクラスｙ_ｉに属する傾向が強くなる。このようにＲＢＦを選択した場合の判別知識は、α_ｉが大きい「成」「否」の代表的なサンプルとの比較による事例ベースの知識とみなすことができる。代表的なサンプルとそのデータ項目の値を知ることで、「成」「否」のサンプルの特徴を知ることができる。尚、多次元正規分布Ｎは、平均値から離れるにつれて値が小さくなるので、平均値との類似度を示す良い指標となる。そこで、本実施形態では、多次元正規分布Ｎを類似度Ｎとも呼ぶ。
【００５１】
以上のように、線形関数の場合は特徴的なデータ項目に基づいた知識を得ることができ、ＲＢＦの場合は代表例に基づいた知識を得ることができる。これらの形式は、分析者にとって直感的に理解し易いものであり、意思決定に役立つ知識といえる。
【００５２】
本実施形態では、説明の簡単のために、データ項目の加重和による線形の知識表現を用いた説明を多用する。ただし、ＲＢＦを用いた判別は、次式に示す無限次元の特徴空間Ｆ上での線形判別に対応するため、ＲＢＦでも同様の議論が成り立つ。
【００５３】
【数７】

但し、
Ｉ；単位行列を表す。
Ｒ^ｄ；ｄ次元実数空間を表す。
【００５４】
ここで、ｄはその時点での分析用データベース１に含まれるデータ項目数（ｄ＝｜Ａ｜）である。
【００５５】
分析用データベース１に基づいて、判別知識（例えば上記の線形またはＲＢＦによる知識表現を持つ判別知識）を抽出する手法は、特に限定されるものではない。例えば、従来知られた判定手法として、典型的な又は頻度の高い「成」「否」サンプルを記憶し、これらと判定対象との距離（一種の類似度）により、判定対象の成否を判定するものがある。当該判定手法を採用してもよい。
【００５６】
一方、サンプルデータから上記の線形またはＲＢＦによる知識表現を持つ精度の高い判別知識を抽出する手法として、サポートベクトルマシン（Ｓｕｐｐｏｒｔ　Ｖｅｃｔｏｒ　Ｍａｃｈｉｎｅ、以下、ＳＶＭと表記する）がある。ＳＶＭでは、データ項目値が類似しているが成／否が異なるサンプルの組を記憶し、これらと判定対象との距離により、判定対象の成否を判定する。データ項目値が類似しているが成／否が異なるサンプルが、サポートベクトルとなる。すべてのサンプルの成否を正しく判別できる判別面が存在する場合、サポートベクトルは判別境界の正確な位置情報を与えるため、一般に高い判別精度を持つ。いくつかの例外サンプルがその成否とは逆の判別領域に侵入する場合は、侵入距離に応じたペナルティＣを課すことで、例外サンプルを自動的に特定し、これらを除外した判別面を決定できる。これをｃ−ＳＶＭと呼ぶ。特に「成」サンプルの「否」領域への侵入と、「否」サンプルの「成」領域への侵入に異なるペナルティＣ^＋、Ｃ⁻を課す場合をｕｎｂａｌａｎｃｅｄ　ｃ−ＳＶＭと呼ぶ。
【００５７】
ｃ−ＳＶＭは高精度である点以外にも、誤分類を許容する点や、信頼性の高い判定が困難な領域を示すことができるなどの特徴を持ち、判別知識を抽出する手法として好適である。そこで、本実施形態では、判別知識生成手法として、ｃ−ＳＶＭを使用し、「成」と「否」のサンプル数に大きな差がある場合には　ｕｎｂａｌａｎｃｅｄ　ｃ−ＳＶＭを使用する。以下に、ｃ−ＳＶＭ及び　ｕｎｂａｌａｎｃｅｄ　ｃ−ＳＶＭの考え方と定式化の詳細について説明する。
【００５８】
ｃ−ＳＶＭでは、次式に示す判別境界に最も近い正例（「成」）および負例（「否」）との距離を最大とするように、教師データとしての分析用データベース１に基づいて、特徴列ベクトルφ（Ｘ）の各要素（φ_１（Ｘ），φ_２（Ｘ），…，φ_ｎＦ（Ｘ））に対する重み列ベクトルＷの各要素（ｗ_１，ｗ_２，…，ｗ_ｎＦ）を決定する。
【００５９】
【数８】
ｆ（Ｘ）＝Ｗ^Ｔ・φ（Ｘ）＋ｂ
但し、
Ｗ；次式に示す重みの列ベクトル（１行ｎＦ列の行列）を表す。
【数９】
Ｗ＝（ｗ_１，ｗ_２，…，ｗ_ｎＦ）
Ｗ^Ｔ；重みベクトルＷの転置行列を表す。
φ（Ｘ）；次式に示す特徴空間Ｆ中のＸの特徴列ベクトル（１行ｎＦ列の行列）を表す。
【数１０】
φ（Ｘ）＝（φ_１（Ｘ），φ_２（Ｘ），…，φ_ｎＦ（Ｘ））
ｎＦ；特徴ベクトル空間の次元（要素数）を表す。
【００６０】
即ち、サンプルＸの高次特徴ベクトルφ（Ｘ）の空間で線形の判別面ｆ（Ｘ）＝Ｗ^Ｔφ（Ｘ）＋ｂを形成し、ｆ（Ｘ）の正負によりサンプルの成否を判定する。この時、線形判別面と判別面に最も近いサンプル（即ちサポートベクトル）との距離｜ｆ（Ｘ）｜／‖Ｗ‖を最大化するようにＷ，ｂを決定する。尚、‖Ｗ‖は次式に示すユークリッド距離を示す。
【数１１】

【００６１】
ただし、例外サンプル群｛Ｘ_ｉ｝に対しては、次式に示す侵入距離ξ_ｉに比例したペナルティΣＣ_ｉξ_ｉを考える。但し、ｍａｘ（「　」，０）は、（　）内の最大値を得る関数であり、「　」内が負なら０、負でなければ「　」内の値をとる。
【００６２】
【数１２】
ξ_ｉ＝ｍａｘ（１−ｙ_ｉｆ（Ｘ_ｉ），０）
【００６３】
成否サンプルに対するペナルティＣ_ｉを例えば次式で表す。
【００６４】
【数１３】

【００６５】
この時、ｆ（Ｘ）を構成するＷ^Ｔφ（Ｘ）とｂは、数式１４で表される２次計画問題の解α_ｉにより、数式１５および数式１６のように求めることができる。
【００６６】
【数１４】

但し、０≦α_ｉ≦Ｃ_ｉ
ｍｉｎ_添え字ｋ（ｋの関数ｇ（ｋ））；ｇ（ｋ）をｋに関して最小化する問題を表す。
【００６７】
【数１５】

【００６８】
【数１６】

但し、０＜α_ＳＶ＋ｙ_ＳＶ＋＜Ｃ^＋，０＞α_ＳＶ−ｙ_ＳＶ−＞−Ｃ⁻
ＳＶ^＋；「成」の代表サンプルを表す。
ＳＶ⁻；「否」の代表サンプルを表す。
【００６９】
ここで、Ｋ（Ｘ，Ｚ）は、カーネル関数と呼ばれ、次式で表される。
【００７０】
【数１７】
Ｋ（Ｘ，Ｚ）＝φ（Ｘ）^Ｔφ（Ｚ）
φ（Ｘ）^Ｔ；特徴ベクトルφ（Ｘ）の転置行列を表す。
φ（Ｚ）；特徴空間Ｆ中のＺの特徴ベクトルを表す。
【００７１】
カーネル関数の具体的例としては、線形カーネル関数、多項式カーネル関数、ＲＢＦカーネル関数の３つがある。線形カーネル関数は、判別関数が線形の場合に対応し、次式で表せる。
【数１８】

【００７２】
ｎ次多項式カーネル関数は、判別関数がｎ次多項式関数の場合に対応し、次式で表せる。
【数１９】

但し、
ｎ；多項式の次数を表す。
【００７３】
ＲＢＦカーネル関数は、判別関数がＲＢＦによる場合に対応し、次式で表せる。
【数２０】

但し、
Σ^２；共分散行列を表す。
【００７４】
特に、各データ項目に相関がない場合のＲＢＦカーネル関数がしばしば用いられ、この場合のＲＢＦカーネル関数は次式で表される。
【数２１】

但し、ｄはデータ項目集合Ａの大きさ（項目数）である。
【００７５】
なお、０＜α_ｉ＜Ｃ_ｉとなるＸ_ｉが第１種のサポートベクトルとなり、線形判別面から距離±１／‖Ｗ‖の面上に並ぶ。α_ｉ＝Ｃ_ｉとなるＸ_ｉは第２種のサポートベクトルとなり、±１／‖Ｗ‖の面からその成否ｙ_ｉと逆方向にξ_ｉ／‖Ｗ‖だけ侵入をする例外サンプルとなる。α_ｉの値は判別における各サポートベクトルの重要度を示す。例外サンプルも２種類に分かれる。０＜ξ_ｉ＜１のとき、そのサンプルは「誤判定」とはならず、ｆ（Ｘ_ｉ）により正しく判定されるが、その判定の信頼性は低い。一方、ξ_ｉ＞１のとき、そのサンプルの判別結果ｓｉｇｎ（ｆ（Ｘ_ｉ））は実際のｙ_ｉと異なり、真の「誤判定」サンプルとなる。ここで、ｓｉｇｎ（　）は、数式２に基づいて（　）内の符号より＋１または−１を得る関数である。
【００７６】
本実施形態におけるデータマイニングの処理の一例を図１に示すフローチャートを用いて説明する。先ず、分析用データベース１を用意する（ステップ１）。次に、分析用データベース１に基づいて判別知識を生成する（ステップ２）。次に、生成した判別知識の精度が予め定めた基準を満足するか検証する（ステップ３）。判別知識の精度が予め定めた基準を満足していれば（ステップ４；Ｙｅｓ）、データマイニングを終了する。
【００７７】
一方、判別知識の精度が予め定めた基準を満足していなければ（ステップ４；Ｎｏ）、分析用データベース１に追加事例に関するデータを追加する必要があるか判断する（ステップ５）。分析用データベース１に追加事例に関するデータを追加する必要があれば（ステップ５；Ｙｅｓ）、分析用データベース１に追加事例に関するデータ（仮想データベース（図１１）の行に対応するデータ）を追加する（ステップ６）。
【００７８】
そして、拡充された分析用データベース１に基づいて新たな判別知識を生成する（ステップ２）。分析用データベース１への追加事例に関するデータの追加（ステップ６）は、判別知識の精度が予め定めた基準を満足するか（ステップ４；Ｙｅｓ）、分析用データベース１に追加事例に関するデータを追加する必要が無くなるまで（例えば当該データの追加により判別知識の精度の向上が望めなくなるまで、ステップ５；Ｎｏ）、繰り返される。
【００７９】
分析用データベース１に追加事例に関するデータを追加することによっては、判別知識の精度の向上幅が予め定めた基準を満足しない等により、分析用データベース１に追加事例に関するデータを追加する必要が無ければ（ステップ５；Ｎｏ）、分析用データベース１に追加項目に関するデータを追加する必要があるか判断する（ステップ７）。分析用データベース１に追加項目に関するデータを追加する必要があれば（ステップ７；Ｙｅｓ）、分析用データベース１に追加項目に関するデータ（仮想データベース（図１１）の列に対応するデータ）を追加する（ステップ８）。
【００８０】
そして、拡充された分析用データベース１に基づいて新たな判別知識を生成する（ステップ２）。分析用データベース１への追加項目に関するデータの追加（ステップ８）は、判別知識の精度が予め定めた基準を満足するか（ステップ４；Ｙｅｓ）、分析用データベース１に追加項目に関するデータを追加する必要が無くなるまで（例えば当該データの追加により判別知識の精度の向上が望めなくなるまで、ステップ７；Ｎｏ）、繰り返される。
【００８１】
以下に、上記の各ステップについて、より詳細に説明する。
【００８２】
データマイニングの過程に記憶容量や実行時間等といった制約がないなら、すべてのサンプルのデータを内部データベース２から取得しても良い。しかし現実には各種の制約が課せられる。このような制約下では、できるだけ少数のサンプルのデータを用いることが好ましい。そこで、本実施形態では、内部データベース２に存在する一部のサンプルに関するデータを抽出して、分析用データベース１を構成する（ステップ１）。尚、当該内部データベース２から分析対象となるサンプルのデータを抽出する処理は、分析者の判断のもとに分析者の操作によって行なっても良く、または予め定めた一定の規則に基づいてデータマイニングシステムのコンピュータ処理によって自動で行なうようにしても良い。
【００８３】
そして、ステップ２では、データマイニングシステム（コンピュータ）が備える演算機能により、分析用データベース１に格納されたデータに基づいて、判別知識としての判別関数ｆ（Ｘ_Ａ）を生成する。
【００８４】
ステップ２で生成される判別知識は、分析用データベース１のみに基づく知識である。当該判別知識が他のデータでも正しいかどうか検証を行う必要がある。これを行うのがステップ３である。ステップ３の判別知識の検証処理を詳細化した一例を図２のフローチャートに示す。本実施形態における判別知識の検証処理では、先ず、生成した判別知識に基づいて判別困難な領域ρを特定する（ステップ３−１）。判別困難領域ρは、判別関数ｆ（Ｘ_Ａ）による判別の信頼性が低い領域であり、例えば判定結果（成・否）の変り目である判別境界の近傍が該当する。図１５は、分析用データベース１に基づいて直線的判別境界ｆ（Ｘ_Ａ）＝０を生成した例である。斜線で示す領域が判別困難領域ρである。図１５中の○と☆は、分析用データベース１に格納されている既知サンプルを表す。尚、判別困難領域ρの幅をどの程度に設定するかは、特に限定されるものではないが、例えば本実施形態では、マージン幅（１／‖Ｗ‖）の１．２倍程度としている。
【００８５】
この判別困難領域ρに該当し尚且つ分析用データベース１に存在しない検証用事例としてのサンプルを、例えば情報源としての内部データベース２の中から検索する（ステップ３−２）。当該検索されたサンプル（検証用事例）について、ステップ２で生成された判別関数ｆ（Ｘ_Ａ）を適用する（ステップ３−３）。図１５中の●と★は、内部データベース２から検索された検証用事例を表す。検証用事例の実際のクラスと、当該検証用事例について適用した判別関数ｆ（Ｘ_Ａ）の値とを比較して、誤判別の数を確認する（ステップ３−４）。誤判別の数が予め定めた一定数以下であれば（ステップ３−５；Ｙｅｓ）、判別関数ｆ（Ｘ_Ａ）の精度が充分であると判定する（ステップ３−６）。誤判別の数が予め定めた一定数よりも多ければ（ステップ３−５；Ｎｏ）、判別関数ｆ（Ｘ_Ａ）の精度が不充分であると判定する（ステップ３−７）。
【００８６】
分析用データベース１に追加事例に関するデータを追加する必要があるか否かの判断（ステップ５）として、例えば本実施形態では、次の基準を採用している。即ち、分析用データベース１に追加事例に関するデータが未だ一度も追加されていないか、または追加事例に関するデータの追加による判別知識の精度の向上幅が予め定めた基準を満足しているか、を判断する。判別知識の精度の向上幅は、例えば追加事例に関するデータの追加後の判別知識の精度と、追加事例に関するデータの追加前の判別知識の精度との差によって表される。分析用データベース１に追加事例に関するデータが未だ一度も追加されていないか、または追加事例に関するデータの追加による判別知識の精度の向上幅が予め定めた基準を満足していれば、即ち分析用データベース１に追加事例に関するデータを追加することによって判別知識の精度の向上が望めるのであれば（ステップ５；Ｙｅｓ）、分析用データベース１に追加事例に関するデータを追加する（ステップ６）。尚、ステップ５における判断は、例えば上記の基準に基づいてデータマイニングシステム（コンピュータ）が自動で行なうようにしても良く、或いは上記の基準に基づいてデータマイニングシステムが追加事例の有効性をディスプレイ等に表示するようにして、最終的な判断は分析者がデータマイニングシステムに対して入力操作を行なうことで決定されるようにしても良い。
【００８７】
そして、分析用データベース１に追加事例に関するデータを追加する場合には（ステップ４；Ｎｏ，ステップ５；Ｙｅｓ）、内部データベース２に存在する上記の検証用事例に関するデータを分析用データベース１に追加する（ステップ６）。ここで、検証用事例に関するデータには、当該検証用事例が何れのクラスに属するかを表すクラス情報も含まれる。クラス情報は、内部データベース２に予め格納されている場合もあるが、内部データベース２にクラス情報が格納されていない場合には、例えば分析者の判断若しくは外部データベース３のデータに基づいて、検証用事例のクラス情報が分析用データベース１に付加される。尚、追加事例に関わるデータの分析用データベース１への追加は、分析者等によるデータマイニングシステムに対する入力操作によって行なうようにしても良く、データマイニングシステムのコンピュータ処理により内部データベース２から自動で該当データを取得し追加するようにしても良い。
【００８８】
本実施形態のように判別困難領域ρに該当し尚且つ分析用データベース１に存在しないサンプルを検証用事例とすることで、少数のデータで効果的な検証を行うことができる。そして、誤判別の数が多い場合には、検証用事例に関するデータを分析用データベース１に追加して新たな判別知識の生成を行い、更にこれらのステップを繰り返すことで、内部データベース２の全体を見なくても、精度が高い判別知識を生成することができる。
【００８９】
一方、内部データベース２からのサンプルの追加を繰り返しても判別関数ｆ（Ｘ_Ａ）の精度が不充分な場合には（ステップ５；Ｎｏ）、新しいデータ項目を追加することで、判別知識の改善を試みる（ステップ７；Ｙｅｓ、ステップ８）。
【００９０】
ここで、分析用データベース１に追加項目に関するデータを追加する必要があるか否かの判断（ステップ７）として、例えば本実施形態では、次の基準を採用している。即ち、分析用データベース１に追加項目に関するデータが未だ一度も追加されていないか、または追加項目に関するデータの追加による判別知識の精度の向上幅が予め定めた基準を満足しているか、を判断する。分析用データベース１に追加項目に関するデータが未だ一度も追加されていないか、または追加項目に関するデータの追加による判別知識の精度の向上幅が予め定めた基準を満足していれば、即ち分析用データベース１に追加項目に関するデータを追加することによって判別知識の精度の向上が望めるのであれば（ステップ７；Ｙｅｓ）、分析用データベース１に追加項目に関するデータを追加する（ステップ８）。尚、ステップ７における判断は、例えば上記の基準に基づいてデータマイニングシステム（コンピュータ）が自動で行なうようにしても良く、或いは上記の基準に基づいてデータマイニングシステムが追加項目の有効性をディスプレイ等に表示するようにして、最終的な判断は分析者がデータマイニングシステムに対して入力操作を行なうことで決定されるようにしても良い。
【００９１】
ステップ８の新規データ項目を追加する処理を詳細化したフローチャートを図３に示し、分析用データベース１に新規データ項目が追加されるイメージを図１２から図１４に示す。本実施形態における新規データ項目を追加する処理では、先ず、データ項目のデータが類似し尚且つ属するクラスが異なる類似事例を分析用データベース１の中から特定する（ステップ８−１）。図１２に示す符号１ａ，１ｂ，１ｃは、当該特定された類似事例を示す。次に、当該類似事例を差別化するために有効な追加項目を特定する（ステップ８−２）。図１３に示す符号５は、特定された追加項目を示す。次に、特定された追加項目に関わる分析用データベース１中にある全ての事例分のデータを、例えば情報源としての外部データベース３より分析用データベース１に追加する（ステップ８−３）。図１４に示す符号６は、分析用データベース１に追加された追加項目に関わるデータを示す。尚、追加項目に関わるデータの分析用データベース１への追加は、分析者等によるデータマイニングシステムに対する入力操作によって行なうようにしても良く、データマイニングシステムのコンピュータ処理によって予め指定された外部データベース３から自動で該当データを取得し追加するようにしても良い。
【００９２】
ここで、類似事例を差別化するために有効な追加項目を特定するにあたっては、類似事例を差別化するために例えば分析者により入力された推薦項目の中から追加項目を特定するようにしても良く、或いは予め定められた規則に従って例えば情報源としての外部データベース３の中から追加項目を特定するようにしても良い。
【００９３】
入力された推薦項目の中から追加項目を特定する場合は、データマイニングシステムと分析者とが対話し協調しながら、判別知識の洗練に有効な追加項目を特定する形態となる。一定のコスト制約の下で、多数存在し得る外部データベース３のデータの中から、どのようなデータが取得可能で、それは大体どのような性質であるのかといった推測ができるのは、専門家である分析者に長けている能力である。コンピュータシステムはこの点に関しては一般に不得意である。ただし、専門家であっても、漠然とどのようなデータが判別知識の改善に役立つかと問われても、適切な回答は困難である。そこで、本実施形態では、分析者の知識を効果的に引き出すために、三つ組み法（ｔｒｉａｄ　ｅｌｉｃｉｔａｔｉｏｎ　ｍｅｔｈｏｄ）を用いる。三つ組み法とは、３種類のサンプルＡ，Ｂ，Ｃに対して、「Ａ，Ｂに共通で、Ｃには共通しない重要な項目は何か」という対比的質問を行うことで、Ａ，Ｂが属するグループのサンプルと、Ｃが属するグループのサンプルとを差別化する上で、重要な項目を聞き出す手法である。販売成否の判別の場合でいえば、販売が成功した具体的な二地点Ａ，Ｂと失敗した地点Ｃについて、Ａ，Ｂに共通で、Ｃには共通しない項目を聞き出すものである。例えばＡ，Ｂでは取引実績が有ったが、Ｃでは取引実績が無かったという知識から、回答者は「取引実績」といった項目を回答する。「どのような項目が販売成否を判別する知識に重要か？」という漠然とした一般的な質問に比べて、具体例による対比検討となるため、回答者の問題意識が鮮明になり、回答者は的確な回答をし易くなる。
【００９４】
本明細書では、項目のデータが類似し尚且つ属するクラスが異なる類似事例を分析者に提示し、類似事例を差別化するために有効な追加項目を得る方法を、拡張三つ組み法と呼ぶ。例えば本実施形態における拡張三つ組み手法では、▲１▼分析用データベース１から少数のサンプルを抽出して、当該少数サンプルのみに着目して追加すべき効果的なデータ項目を検討し、▲２▼上記検討で得られた有望なデータ項目のみについて分析用データベース１の全サンプルに対するその値を取得する、という２段階を経ることで、効果的なデータ項目に関するデータを分析用データベース１に追加する。少数サンプルによるアプローチが有効と考えられる理由としては、例えば以下の３点が挙げらる。第１に、考慮する対象を少数サンプルに限定することで、アクセスコストが高い外部データベース３へのアクセスを大きく減らすことができる。第２に、ｃ−ＳＶＭ等では特定の条件を満たす２サンプルのみに着目して解を改善することを繰り返すことで、高速に最適解を得る。データ項目の追加についても同様のアプローチの有効性が期待できる。第３に、エキスパートシステム構築等で専門家から知識を聞き出す場合、具体的サンプル（事例）の対比による質問が、的確な知識を聞き出すために有効であることが確認されている。
【００９５】
図４に、拡張三つ組み法によるデータマイニングシステムと分析者との対話（やりとり）の例をシークエンスで示す。データマイニングシステムでは、分析用データベース１の中から特定された類似事例を、ディスプレイ等の出力装置を介して分析者に提示すると共に、当該類似事例を差別化するために有効な推薦項目に関するデータの入力を受け付ける（ステップ８−２−１）。分析者は、データマイニングシステムが提示してきた類似事例を差別化するために有効と考えられる推薦項目に関するデータを、データマイニングシステムに入力する（ステップ８−２−２）。図１２から図１４に示す符号４ａ，４ｂ，４ｃは、入力された推薦項目に関するデータを表す。データマイニングシステムでは、入力された推薦項目の中から類似事例を差別化するために有効な項目を選択する（ステップ８−２−３）。
【００９６】
データマイニングシステムが分析者に類似事例を提示することの効果は、有効な追加データ項目候補の聞き出しだけでなく、以下に述べるような望ましい効果も伴う。第１に、聞き出しの過程で、現在存在しないが知識の改善に有効と思われるデータ項目が発見される可能性がある。即ち、既にあるデータ項目の聞き出しだけでなく、知識改善のために新しく調査すべき項目の発見という効果もある。第２に、判別困難な事例が類似事例として分析者に繰り返し提示されることによって、当該事例が例外ではないか、または当該事例のクラス情報が誤っているのではないか、といった見直しの機会を分析者に与えることができる。分析者が当該事例が例外であることに気づき、当該事例の成否の判断を訂正することによって、得られる判別知識が洗練される可能性がある。
【００９７】
一方、予め定められた規則に従って例えば情報源としての外部データベース３の中から追加項目を特定する場合は、分析者の知識には頼らずに、データマイニングシステムが自動で判別知識の洗練に有効な追加項目を特定する形態となる。この場合、分析者にはデータ項目をデータマイニングシステムに推薦するための知識は要求されず、分析者の判断又は能力によって判別知識の精度が大きく左右されてしまうことを回避できる。近年、インターネット等で公開されている外部情報源がどこに在るのかを示すメタ情報源（例えば各省庁の統計調査結果のリストや民間統計集又はマーケティング企業等からの情報）が整備されつつあり、これらのメタ情報源を利用しても良い。これらのメタ情報源を活用することで、必ずしも全面的に分析者の知識に頼らなくても、情報源を絞り込むことは可能となる。この場合、データマイニングシステムに、アクセス可能且つデータ入手可能な外部データベース３の場所を、予め記憶させておく又は必要に応じて通知するようにする。
【００９８】
ここで、判別知識の精度改善を行うための追加項目の選択に有効なサンプルを選択する考え方を、図１６に示す。図１６（Ａ）は、横軸に既知のデータ項目ａ_１の値をとり、縦軸に未知のデータ項目ａ_２の値をとって、各サンプルをプロットしたものである。図１６（Ｂ）は、横軸に既知のデータ項目ａ_１の値をとって、各サンプルをプロットしたものである。図１６（Ａ），（Ｂ）で横軸（ａ_１）の値が同じ点は、同一サンプルに対応している。
【００９９】
データ項目ａ_１のみが既知の場合、即ち図１６（Ｂ）の場合、誤判別が最も少ない判別境界（ｆ（ｘ_ａ１）＝０）は、ｌとなる。この場合、一部のサンプル（図１６中の★と●）に関しては、正解を与えることはできない。従って、より正確な判別知識を得るためには新しいデータ項目を追加する必要がある。本実施形態では、追加するデータ項目を選択するために、全サンプル（☆○★●等）を使用するのではなく、特定の少数サンプルにのみ着目し、これらがうまく判別されるデータ項目を採用する。本実施形態では、これらの特定の少数サンプルを項目評価用サンプル（Ｄ）と呼ぶ。また、項目評価用サンプル（Ｄ）のうち、「成」サンプルをＤ^＋、「否」サンプルをＤ⁻とする。本実施形態における項目評価用サンプル（Ｄ）には、データ項目の追加によって正しく判定されるべきサンプル、すなわち、誤判別されたサンプル（図１６中の★と●）が、最低限含まれるものとする。ここで、誤判別サンプル（図１６中の★と●）のみに注目した判別境界ｌ１は、他のサンプルに対して必ずしも適切な判別を行えていない。そこで、本実施形態では、誤判別サンプルに加え、本来「成」である誤判別サンプル（図１６中の●）のａ_１軸に関して近くにある「成」サンプル（図１６中のハッチングされた○）と、本来「否」である誤判別サンプル（図１６中の★）のａ_１軸に関して近くにある「否」サンプル（図１６中のハッチングされた☆）を加えることによって、適切な判別境界ｌ２を得るようにしている。誤判別サンプル（図１６中の★と●）および追加されたサンプル（図１６中のハッチングされた☆と○）は、ｃ−ＳＶＭでサポートベクトルと呼ばれるサンプルである。このように、複数のサポートベクトルを項目評価用サンプル（Ｄ）として使用することが、適切な判別境界を得る上で有効となる。
【０１００】
項目評価用サンプル（Ｄ）を選択する場合の更に具体的な方法について説明する。項目評価用サンプル（Ｄ）を選択する目的は、ＳＶＭで求めた判別関数による未知データに対する判定誤差（誤答率とも呼ぶ。汎化誤差、テスト誤差等として表される。）の推定値が、最小となるような追加データ項目を求めることである。
【０１０１】
ＳＶＭの未知データに対する誤答率（テスト誤差）の推定法としては、いくつかの手法が従来ある。本実施形態では　ｕｎｂａｌａｎｃｅｄ　ｃ−ＳＶＭに対するものについて主に説明するが、他のＳＶＭ、例えば侵入ξを認めないＳＶＭや、侵入ξに対してξ^２に比例したペナルティを与えるＳＶＭにおいても同様の扱いが可能である。
【０１０２】
計算コストがかかるが、精度の良いテスト誤差推定法として、次式で表される一つ抜き法（Ｌｅａｖｅ−Ｏｎｅ−Ｏｕｔ、ＬＯＯ）がある。
【０１０３】
【数２２】

但し、
［ｘ］_＋；ステップ関数（ｘ＜０ならば０、ｘ≧０ならば１）を表す。
ｙ_ｅ；ｅ番目のサンプルのクラス値（±１）を表す。
ｐ；分析用データベース１中のサンプル数を表す。
【０１０４】
ここで、ｆ_ｅ（　）は、ｐ個の全学習サンプルからｅ番目のサンプルを除いた学習サンプルに基づいて、ｃ−ＳＶＭを用いて得られる（即ち、数式１４を解くことで得られる）判別関数を表す。また、ｆ_ｅ（Ｘ_ｅ）は、当該判別関数のｅ番目のサンプルＸ_ｅについての値を表す。サポートベクトルを一つ除いても、他のサポートベクトルと数式１２で定まる侵入距離ξ_ｉが変化しないと仮定すると、次式が導かれる。
【０１０５】
【数２３】

但し、
ｆ（Ｘ_ｅ）；ｐ個の全学習サンプルから得られる判別関数のサンプルＸ_ｅについての値を表す。
α_ｅ；全学習サンプルに基づいて数式１４で得られたサンプルＸ_ｅの重みを表す。
ξ_ｅ；全学習サンプルに基づいて数式１２で得られたサンプルＸ_ｅの侵入距離を表す。
【０１０６】
ここで、Ｓ_ｅは、スパン即ちφ（Ｘ_ｅ）から判別面ｆ_ｅ（Ｘ）＝０までの距離である。スパンバウンドは、数式２３により数式２２の値を計算したものである。半径マージン限界ｒ_Ｍは、スパンバウンドの上限値の一つである。Ｓ_ｅは、全学習データを含む特徴空間Ｆ中の最小半径の超球の直径２ｒが上限であることから、半径−マージン限界ｒ_Ｍは、次式により求められる。
【０１０７】
【数２４】

但し、
【数２５】

【０１０８】
半径ｒは、数式２６で表される２次計画問題の解β_ｉにより数式２８で求められる。ここで、ｍａｘ_添え字ｋ（ｋの関数ｇ（ｋ））は、ｇ（ｋ）をｋに関して最大化する問題を表す。
【０１０９】
【数２６】

但し、
【数２７】

【数２８】

【０１１０】
以下、正のβ_ｉを持つサンプルの集合をＲＶと記す。ＲＶは、全学習データを含む特徴空間Ｆ中の最小半径ｒの超球上のサンプルの集合となる。
【０１１１】
さらに、データ項目の追加に伴う半径ｒの変化が無視できる場合には、数式２４においてデータ項目の追加で唯一変化する‖Ｗ‖^２　を、ＳＶＭの誤答率の目安として使用できる。
【０１１２】
そこで本実施形態では、項目評価用サンプル（Ｄ）を選択する具体的な方法の一例として、ＳＶＭの誤答率の目安として‖Ｗ‖^２　を使用した場合について説明する。また、当該説明においては、判別知識の抽出に数式２１で表されるＲＢＦカーネル関数ＲＢＦ_ＩをＫ（Ｘ，Ｚ）として用いた　ｕｎｂａｌａｎｃｅｄ　ｃ−ＳＶＭを使用した場合を想定する。この場合、数式２１のデータ項目集合Ａは、その時点の分析用データベース１に含まれるデータ項目の集合である。尚、ｕｎｂａｌａｎｃｅｄ　ｃ−ＳＶＭ以外のＳＶＭや他のカーネルを用いたＳＶＭでも場合でも同様の扱いが可能である。
【０１１３】
新たに追加する分析用データベース１に存在しないデータ項目の特性を、下記の値δ_ｉをとるデータ項目ａ（Ｄ^＋，Ｄ⁻）で近似する。
【０１１４】
【数２９】

但し、
δ_ｉ；サンプルｉの分析用データベース１に存在しないデータ項目の値を表す。μ；分析用データベース１に存在しないデータ項目の値の平均値を表す。
δ；Ｄ^＋，Ｄ⁻に属するサンプルのデータ項目の値の平均値からのずれ量を表す。
【０１１５】
ただし、δ≒０とする。したがってａ（Ｄ^＋，Ｄ⁻）はサンプル群Ｄ^＋とＤ⁻とで特徴的に逆向きの変化をするデータ項目である。
【０１１６】
ここで、追加すべきデータ項目ａ（Ｄ^＋，Ｄ⁻）は、データ項目追加後の分析用データベース１に対して、ｕｎｂａｌａｎｃｅｄ　ｃ−ＳＶＭを用いた時の誤答率の目安である次式で表される‖Ｗ^（ａ）‖^２を、極力減少させる可能性の高いデータ項目であると仮定する。
【数３０】

但し、
Ｗ^（ａ）；データ項目ａの追加後の判別関数の重みベクトルを表す。
Ｘ_ｉ ^（ａ）；データ項目ａの追加後のサンプルｉのデータ項目値の並びを表す。
α_ｉ ^（ａ）；データ項目ａの追加後のサポートベクトルであるサンプルｉに対する重みを表す。
【０１１７】
ここで、次式で表される追加データ項目ａ（Ｄ^＋，Ｄ⁻）の関数Ｏｂｊを考える。
【数３１】

【０１１８】
上式は、ｕｎｂａｌａｎｃｅｄ　ｃ−ＳＶＭでの目的関数（数式１４）で、データ項目追加前のサポートベクトルとその重みα_ｉを固定し、カーネル関数の引数をデータ項目追加後のもので置き換えたものである。この時、次式の関係が成立するため、δが微小であるときは、‖Ｗ^（ａ）‖^２　を極力減少させるには、Ｏｂｊを極力増加させるデータ項目ａ（Ｄ^＋，Ｄ⁻）を選択すればよい。
【数３２】

【０１１９】
この時、以下が成立する。
【０１２０】
［定理］
微小変化δを伴うデータ項目ａ（Ｄ^＋，Ｄ⁻）の追加により目的関数値Ｏｂｊは増加し、増加幅は、追加するデータ項目に依存するｇ（Ｄ^＋，Ｄ⁻）に比例する部分と、追加するデータ項目に依存しない正定数Δとを加えたものである。ただし、サンプル群Ｄ^＋，Ｄ⁻の各サンプルは各々「成」、「否」のクラスに属し、次式が成立する。
【０１２１】
【数３３】

【０１２２】
【数３４】

【０１２３】
但し、
【数３５】
ｎ＝（ｄ＋１）／ｄ
【０１２４】
［証明］
データ項目ａ（Ｄ^＋，Ｄ⁻）の追加後のカーネルＫ（Ｘ_ｉ，Ｘ_ｊ）の値は、次式で表される。
【０１２５】
【数３６】

【０１２６】
このため、その変化ｄＫは以下の式で近似できる。
【０１２７】
【数３７】

【０１２８】
Ｋ（Ｘ_ｉ，Ｘ_ｊ）≦１かつｎ＜１で、追加データ項目に関係する変数は含まれていないので、第２項は追加データ項目に依存しない正の定数である。この時、データ項目ａ（Ｄ^＋，Ｄ⁻）の追加による目的関数Ｏｂｊの値の増加ｄＯｂｊは以下の式で近似できる。
【０１２９】
【数３８】

【０１３０】
等号は次式のとき成立する。
【０１３１】
【数３９】

【０１３２】
なお、ｇ（Ｄ^＋，Ｄ⁻）の第１項は、重要度が高く（α_ｉ≫０）、類似度が高い（Ｋ（Ｘ_ｉ，Ｘ_ｊ）≒１）サポートベクトル（サンプル）の組がＯｂｊを増加させるのに好ましく、第２項は、次式で表され、例外サンプル（ξ_ｉ＞０）が好ましいことを示す。
【０１３３】
【数４０】

【０１３４】
つまり、Ｄ^±に含まれるサンプルは、類似度が高いサポートベクトルで、かつ例外サンプルであることが望ましい。また、使用するサポートベクトルの数は多いほうが改善幅が大きい。
【０１３５】
上記理論に基づくと、例えば図１６で示した線形関数による判別の場合には、ｃ−ＳＶＭにより求まる各サポートベクトルの重要度をα_ｉとすると、次式で示す評価値ｇ_{ＬＩＮＥＡＲ}が大きい少数のサンプルが、項目評価用サンプル（Ｄ）として好ましい。
【０１３６】
【数４１】

【０１３７】
また、ＲＢＦを用いた判別を行う場合には、さらにサンプル間の類似度Ｎ（Ｘ_ｉ，Ｘ_ｊ）も考慮した以下の評価値ｇ_ＲＢＦが大きい少数のサンプルが、項目評価用サンプル（Ｄ）として好ましい。
【０１３８】
【数４２】

【０１３９】
類似事例としての項目評価用サンプル（Ｄ）を分析用データベース１の中から特定する場合（ステップ８−１）の更に詳細化した処理の一例を、図５のフローチャートに示す。この場合、先ず、サポートベクトルとなるサンプルを、次式のように分類する（ステップ８−１−１）。
【０１４０】
【数４３】

【０１４１】
次に、「成」のサポートベクトル（ｓｖ_ｉ ^＋∈ＳＶ^＋）と、「否」のサポートベクトル（ｓｖ_ｉ ⁻∈ＳＶ⁻）の各対に対して、評価値ｇ（ｓｖ_ｉ ^＋，ｓｖ_ｉ ⁻）の値を数式２９または数式３０に基づいて計算する（ステップ８−１−２）。次に、「成」のサポートベクトル（ｓｖ_ｉ ^＋∈ＳＶ^＋）と、「否」のサポートベクトル（ｓｖ_ｉ ⁻∈ＳＶ⁻）の各対を、評価値ｇ（ｓｖ_ｉ ^＋，ｓｖ_ｉ ⁻）の降順にソートする（ステップ８−１−３）。次に、ソートされたｇ（ｓｖ^＋，ｓｖ⁻）の上位から順に、異なる「成」のサポートベクトル（ｓｖ_ｉ ^＋∈ＳＶ^＋）と「否」のサポートベクトル（ｓｖ_ｉ ⁻∈ＳＶ⁻）を一定の個数選択し（例えば本実施形態では各々異なる２個づつの成否サンプルｓｖ_１ ^＋，ｓｖ_２ ^＋，ｓｖ_１ ⁻，ｓｖ_２ ⁻を選択し）、項目評価用サンプル（Ｄ）とする（ステップ８−１−４）。次に、選択された一定の個数の項目評価用サンプル（Ｄ）の何れかと（例えば本実施形態ではｓｖ_１ ^＋，ｓｖ_２ ^＋，ｓｖ_１ ⁻，ｓｖ_２ ⁻の何れかと）、対になる一定数のサンプルをソートされた評価値ｇ（ｓｖ^＋，ｓｖ⁻）の上位から順に選び、項目評価用サンプル（Ｄ）に加える（ステップ８−１−５）。
【０１４２】
データマイニングシステムと分析者とが対話し協調しながら追加項目を特定する場合は、項目評価用サンプル（Ｄ）を対応する対ごとに分析者に提示して、追加すべきデータ項目の候補の入力を促す（図４のステップ８−２−１）。分析者は項目評価用サンプル（Ｄ）を比較検討し、追加すべきデータ項目と、各サンプルに対する追加項目のデータ値をデータマイニングシステムに入力する（図４のステップ８−２−２）。
【０１４３】
一方、分析者の知識には頼らずにデータマイニングシステムが自動で追加項目を特定する場合、例えば分析者がデータ項目をシステムに推薦するための十分な知識を持たないが、事前にいくつかのデータ項目候補と項目評価用サンプル（Ｄ）に対するそれらの値がわかっているような場合、例えば情報源としての外部データベース３にデータ項目候補と項目評価用サンプル（Ｄ）に対するそれらの値が用意されている場合は、予め定められた規則に従って、例えば情報源としての外部データベース３の中から追加項目を特定する。この場合、例えば、図６のフローチャートに示す処理を行う。即ち、項目評価用サンプル（Ｄ）の未使用データ項目ａと、当該未使用データ項目ａの値とを、外部データベース３の中から検索する（ステップ８−２−１’）。そして、それぞれの未使用データ項目ａに対する次式で表される指標値Ｉ_ａを計算する（ステップ８−２−２’）。但し、ｘ_ｉａ，ｘ_ｊａは、それぞれサンプルｉ，ｊに対するデータ項目ａの値である。また、ｍｉｎ（０，「　」）は、（　）内の最小値を得る関数であり、「　」内が負なら「　」内の値をとり、負でなければ０をとる。
【０１４４】
【数４４】

【０１４５】
そして、指標値Ｉ_ａが小さいものから順に一定数を推薦項目として選択する（ステップ８−２−３’）。即ち、成サンプルＤ^＋と否サンプルＤ⁻に関する未使用データ項目の値が逆符号をとる時の値の積の和が最小となるデータ項目ａを選択する。この場合、成サンプルＤ^＋に対しては平均より大きい値（または、小さい値）をとり、否サンプルＤ⁻に対しては平均より小さい値（または、大きい値）をとる傾向が大きいデータ項目ａが選択される。そして、推薦項目の中から類似事例を差別化するために有効な項目を追加項目として選択する（ステップ８−２−４’）。尚、ステップ８−２−４’を省略し、ステップ８−２−３’で選択された項目をそのまま追加項目として選択しても良い。
【０１４６】
推薦項目の中から有効な項目（追加項目）を選択するにあたり（ステップ８−２−３、またはステップ８−２−４’）、例えば本実施形態では、推薦項目の実際の値に基づいて有効性を評価し、追加項目を絞り込むようにしている。複数のデータ項目集合Ａ_０の中から適切なデータ項目集合Ａ⊂Ａ_０を選択する代表的な方法として、変数逐次増加法がある。この方法は、あるデータ項目集合Ａを使用した学習結果の望ましさを評価する関数ｅｖａｌ（Ａ）を用いて、図７に示す手順で逐次変数を追加していく手法である。
【０１４７】
即ち、Ａ_ｕｓｅ＝φ，Ａ_{ｎｏｕｓｅ}＝Ａ，ｋ＝０とする（ステップ８０１〜８０３）。次に、任意の未使用データ項目ａ∈Ａ_{ｎｏｕｓｅ}を追加した学習を行い、その評価ｅｖａｌ（Ａ_ｕｓｅ∪｛ａ｝）を計算する（ステップ８０４）。次に、ステップ８０４で求めた評価ｅｖａｌ（Ａ_ｕｓｅ∪｛ａ｝）のうち、最大値を与えるデータ項目ａ_ｋを求める（ステップ８０５）。追加しても改善がない場合、つまりｅｖａｌ（Ａ_ｕｓｅ∪｛ａ｝）≦ｅｖａｌ（Ａ_ｕｓｅ）なら（ステップ８０６；Ｙｅｓ）、追加を終了する。追加することで改善する場合、つまりｅｖａｌ（Ａ_ｕｓｅ∪｛ａ_ｋ｝）＞ｅｖａｌ（Ａ_ｕｓｅ）なら（ステップ８０６；Ｎｏ）、Ａ_ｕｓｅ＝Ａ_ｕｓｅ∪｛ａ_ｋ｝，Ａ_{ｎｏｕｓｅ}＝Ａ_{ｎｏｕｓｅ}−｛ａ_ｋ｝，ｋ＝ｋ＋１とし（ステップ８０７〜８０９）、ステップ８０４に戻る。
【０１４８】
ただし、ＳＶＭでは、サンプル追加による再学習に比べ、使用データ項目やデータ項目の分散Σ^２を変更した場合のＳＶＭの再学習は初めから学習を行う場合と同様の大きな計算コストを要する。特にデータ項目に合う最適なΣ^２を求める場合、クロスバリデーションなどを使用すると多大なコストを要する。そこで、本実施形態では、推薦項目をすべて使用して最適なΣ^２を求め、その中で未使用データ項目で大きなΣ^２を持つものから上位一定数を、適正が高いデータ項目とする方式を簡便手法として採用するようにしている。ＳＶＭでデータ項目集合Ａに合う最適なΣ^２を求める比較的高速な手法としては、未知データに対する誤答率（汎化誤差、テスト誤差）のある評価関数Ｔ（Ａ）の微分∂Ｔ／∂ｌｏｇ（Σ^２）、および、２階微分∂^２Ｔ／∂ｌｏｇ（Σ^２）^２を使用した勾配法、擬似ニュートン法などによる解法が従来提案されおり、これらの手法を採用することができる。これらの手法によればクロスバリデーション等による場合に比べ、低次元の場合でも１０分の１程度の計算量となる。
【０１４９】
図８及び図９に、本実施形態におけるデータマイニングシステムの構成の一例を示す。図８に示すデータマイニングシステム１０では、分析用データベース１と、ステップ２の処理を実行する判別知識生成手段１０１と、ステップ３の処理（図２参照）およびステップ４，ステップ５，ステップ７の判断処理を実行する判別知識検証手段１０２と、ステップ６の処理を実行する事例データ追加手段１０３と、ステップ８−１の処理（図５参照）およびステップ８−２−１，ステップ８−２−２の処理（図４参照）を実行する類似事例提示手段１０４と、ステップ８−２−３およびステップ８−３の処理を実行する項目データ追加手段１０５とを、少なくとも有している。図９に示すデータマイニングシステム１０’では、分析用データベース１と、ステップ２の処理を実行する判別知識生成手段１０１と、ステップ３の処理（図２参照）およびステップ４，ステップ５，ステップ７の判断処理を実行する判別知識検証手段１０２と、ステップ６の処理を実行する事例データ追加手段１０３と、ステップ８−１の処理（図５参照）を実行する類似事例特定手段１０４’と、ステップ８−２−１’の処理（図６参照）を実行する未使用項目データ検索手段１０７と、ステップ８−２−２’〜ステップ８−２−４’の処理（図６参照）およびステップ８−３の処理を実行する項目データ追加手段１０５’とを、少なくとも有している。本実施形態のデータマイニング用プログラムは、コンピュータにインストールされ且つセットアップされることにより、当該コンピュータを上記のデータマイニングシステム１０または１０’として機能させるものである。本実施形態のデータマイニング方法及びシステムにおけるデータフロー・ダイヤグラムの一例を図１０に示す。
【０１５０】
尚、判別知識としての判別関数ｆ（ｘ）、類似事例としての項目評価用サンプル（Ｄ）を選択する方法、推薦項目の中から追加項目を選択する方法等は、上述の例に必ずしも限定されるものではない。以下に、類似事例としての項目評価用サンプル（Ｄ）を選択する他の方法、および推薦項目の中から追加項目を選択する他の方法の好適な一例を、本発明の第２の実施形態として説明する。
【０１５１】
本実施形態では、半径マージン限界ｒ_Ｍを新データに対する誤答率の推定値として用いる。これは、ＬＯＯでは計算コストがかかり、一方、スパンバウンドは多くの局所最小解を持つため最適化で使い難いためである。ただし、データ項目追加後の分析用データベース１に基づく判別関数の新データに対する誤答率（汎化誤差）が２次形式で近似できる場合には、半径マージン限界に限らず、スパンバウンド等を、誤答率の推定値として用いても良い。
【０１５２】
分析用データベース１中にないデータ項目ａのデータ項目値ｘ_ｉａをθ_ａ ^１／２倍（但し、θ_ａ≧０，θ_ａ≒０）したデータを追加した時の新しい判別関数の誤答率の推定値ｒ_Ｍ（θ_ａ）について、次式が成立する。
【０１５３】
【数４５】

但し、
ｒ_Ｍ（０）；データ項目ａを追加していない時の誤答率の推定値を表す。
【０１５４】
追加項目の選択においては、データ項目追加後の新しい判別関数の誤答率の推定値（半径マージン限界ｒ_Ｍ）を最も減少させるデータ項目ａを見つけることが好ましい。そこで本実施形態では、数式４５に基づいて、誤答率の減少幅−ｄｒ_Ｍ／ｄθ_ａが最大となるデータ項目ａを選択すると仮定する。
【０１５５】
主たるカーネルにおいて、減少幅−ｄｒ_Ｍ／ｄθ_ａは、２次形式Ｘ_：ａ ^ＴＧＸ_：ａとなる。Ｘ_：ａは、分析用データベース１中の全サンプルの項目ａの値のベクトルである。Ｘ_：ａ ^ＴはＸ_：ａの転置行列である。例えば、ｕｎｂａｌａｎｃｅｄ　ｃ−ＳＶＭでの線形カーネル関数、多項式カーネル関数では下記の定理Ａが成立し、ＲＢＦカーネル関数では下記の定理Ｂが成立する。
【０１５６】
［定理Ａ］
ｋ１（Ｘ_Ａ ^Ｔ・Ｚ_Ａ）の形のカーネル関数に関して、−ｄｒ_Ｍ／ｄθ_ａは２次形式Ｘ_：ａ ^ＴＧＸ_：ａを持つ。この場合、Ｇは次式で表される。
【０１５７】
【数４６】

但し、
ｄｉａｇ^−１（「ベクトル」）；「ベクトル」を対角要素とする対角行列を表す。
「行列」．＊「行列」：各行列の対応する要素の積を表す。
ｄｉａｇ（「行列」）：「行列」の対角要素のベクトルを表す。
Ｋ１：ｉ行ｊ列がｋ１（Ｘ_ｉＡ ^ＴＸ_ｊＡ＋θ_ａｘ_ｉａｘ_ｊａ）である行列を表す。
α：数式１４で求まる全サンプルｉの重みα_ｉの並び（ベクトル）を表す。
β：数式２６で求まる全サンプルｉの重みβ_ｉの並び（ベクトル）を表す。
Ｙ；全サンプルのクラス値ｙ_ｉの並び（ベクトル）を表す。
ｒ；数式２８で求まる半径を表す。
【０１５８】
［定理Ｂ］
ｋ２（−‖Ｘ_Ａ−Ｚ_Ａ‖^２）の形のカーネルに関して、−ｄｒ_Ｍ／ｄθ_ａは２次形式Ｘ_：ａ ^ＴＧＸ_：ａを持つ。この場合、Ｇは次式で表される。
【０１５９】
【数４７】

但し、
【数４８】

１_ｐ；長さｐで全要素が１のベクトルを表す。
Ｋ２；ｉ行ｊ列がｋ２（−‖Ｘ_Ａ−Ｚ_Ａ‖^２−θ_ａ（ｘ_ｉａ−ｘ_ｊａ）^２）である行列を表す。
【０１６０】
尚、定理Ｂにおいて、Ｇ・１_ｐ＝０が成立する。これは、データ項目ａの値全体が平行移動してもｘ_ｉａ−ｘ_ｊａの値は変わらない、という自由度に対応する。
【０１６１】
上記の定理Ａ，Ｂは、他のＳＶＭ、例えば，侵入ξを認めないＳＶＭや侵入ξに対してξ^２に比例したペナルティを与えるＳＶＭでも成立する。追加項目の導入による誤答率（テスト誤差）の減少幅−ｄｒ_Ｍ／ｄθ_ａをＸ_：ａ ^ＴＧＸ_：ａで近似できるので、Ｇをテスト誤差削減行列と呼ぶ。
【０１６２】
次に、テスト誤差削減行列Ｇの離散近似について説明する。ｐ×ｐ行列であるＧは、ＳＶとＲＶに対応する行および列のみがゼロでない値を持ち、その他の行および列はゼロとなる。ＳＶ∪ＲＶは一般に分析用データベース１中のサンプル数ｐよりも小さいので、既に疎行列である。しかし、ＳＶ∪ＲＶのサイズが依然大きい場合も考えられる。そこで、減少幅−ｄｒ_Ｍ／ｄθを推定するためのより疎な行列Ｇ’を得ることが好ましい。行列を疎とする最も単純な方法は、ある閾値より小さい要素をゼロとすることである。しかし、この操作は大きな固有値、２次形式の値を大きく変える場合がある。このため、本実施形態では、以下のような近似を行う。即ち、次式に示すように、各項目の値の分散は一定値となっていると仮定する。
【０１６３】
【数４９】
Ｘ_：ａ ^ＴＸ_：ａ＝Σ_ｉｘ_ｉａ ^２＝１
【０１６４】
更に、ＲＢＦカーネル（および、他の定理Ｂの形のカーネル）を使用する場合には、Ｇ・１_ｐ＝０に対応して、次式に示すように、各項目の値の平均が０と仮定する。
【０１６５】
【数５０】
Ｘ_：ａ ^Ｔ・１_ｐ＝０
【０１６６】
制約Ｘ_：ａ ^ＴＸ_：ａ＝１の下で、２次形式Ｘ_：ａ ^ＴＧＸ_：ａを最大とするＸ_：ａは、Ｇの最大固有ベクトルであるため、各サンプルの値が当該最大固有ベクトルに比例する項目がテスト誤差推定値ｒ_Ｍを最も減少させる。一方、負で絶対値が最大の固有値に対する固有ベクトルがテスト誤差推定値ｒ_Ｍを最も増加させる。そこで、次式に示すように、Ｇをｍ個の絶対値最大の固有値と固有ベクトルにより近似し、他の方向は無視することとする。
【０１６７】
【数５１】

但し、
【数５２】

但し、
【数５３】

但し、
【数５４】

ｈ_ｉ；テスト誤差削減行列Ｇのｍ個の絶対値最大固有値の一つを表す。
Ｈ；ｍ個の絶対値最大固有値を対角要素とする対角行列を表す。
ｖ_ｉ；固有値ｈ_ｉに対応する固有ベクトルを表す。
Ｖ；固有値ｈ_１…ｈ_ｍに対応する固有ベクトルを並べた行列を表す。
Ｉ_ｍ×ｍ；ｍ×ｍの単位行列を表す。
【０１６８】
更に、ｐ×ｐ行列Ｇ”を疎行列Ｇ’で近似する。このため、Ｕ^ＴＵ＝Ｉ_ｐ×ｐで、Ｕの大半の要素が０に近く、少数の要素のみが±１に近い次式に示す分解を求める。
【０１６９】
【数式５５】
Ｇ”＝ＵＢＵ^Ｔ
但し、
Ｕ；ｐ×ｐ回転行列を表す。
Ｂ；ｐ×ｐ行列を表す。
Ｕ^Ｔ；Ｕの転置行列を表す。
【０１７０】
Σ_ｉｕ_ｉ ^２＝１の下でΣ_ｉｕ_ｉ ^４を最大とするｕ_ｉは、一つの要素が±１でそれ以外が０であるようなものである。この観察に基づいて、ゼロに近い要素が多いＵを得るために、次式に示す４次計画を解く。
【０１７１】
【数５６】

但し、
【数５７】

【数５８】

ｕ_ｉ，ｋ；回転行列Ｕのｉ行ｋ列の要素を表す。
Ｑ；回転行列を表す。
【０１７２】
数式５６に示す４次計画問題は、例えば図１９に示す処理を施すことにより解ける。即ち、先ず、Ｑ^ＴＱ＝ＩとなるようにＱを初期化する（ステップ２０１）。次に、Ｑ_ｏｌｄに０を代入する（ステップ２０２）。次に、次式が成立するか判定する（ステップ２０３）。
【０１７３】
【数５９】
１−ｍｉｎ（｜ｄｉａｇ（Ｑ^ＴＱ_ｏｌｄ）｜）＞ε
但し、
ε；許容誤差（例えば１０^−１０程度）を表す。
【０１７４】
数式５９が成立しなければ（ステップ２０３；Ｎｏ）、同式が成立するまで、以下のステップ２０４からステップ２０７を繰り返す。即ち、Ｑ_ｏｌｄにＱを代入する（ステップ２０４）。そして、ＵにＶ　Ｑ^Ｔを代入する（ステップ２０５）。そして、Ｑ_ＬにＵ^Ｔ．^３Ｖを代入する（ステップ２０６）。そして、ＱにＱ_Ｌ（Ｑ_Ｌ ^ＴＱ_Ｌ）^−１／２を代入する（ステップ２０７）。但し、「行列」．^ｎは「行列」の各要素をｎ乗したものを示す。
【０１７５】
数式５９が成立したならば（ステップ２０３；Ｙｅｓ）、ＢにＱＨＱ^Ｔを代入する（ステップ２０８）。以上の処理により、ＵとＢの値が得られる。
【０１７６】
上記の図１９に示す処理によって、数式５６に示す４次計画問題が解けることの根拠を以下に示す。Λ＝（λ_ｉ，ｊ）を、制約Ｑ^ＴＱ＝Ｉに対応するラグランジュ未定乗数λ_ｉｊの行列とする。この時、最適値は次式が成立する。
【０１７７】
【数６０】

但し、
ｔｒａｃｅ（「行列」）；「行列」の対角成分の和を表す。
【０１７８】
Ｌ＝Λ＋Λ^Ｔ，Ｑ_Ｌ＝Ｑ・Ｌと置くと、次式が成立する。
【０１７９】
【数６１】

【数６２】

【数６３】

【数６４】

【０１８０】
したがって、上記の不動点アルゴリズムにより最適値を求めることができる。
【０１８１】
そして、Ｕの閾値以下の要素を０に設定することで、行列ＵをＵ’で近似する。行列Ｕはゼロに近い要素を多数持つので、Ｕ’は更に疎な行列となり、次式に示すＧ’は、Ｇの主な固有ベクトルを大体保存する疎な近似となる。
【０１８２】
【数６５】
Ｇ’＝Ｕ’ＢＵ’^Ｔ
【０１８３】
以上により、テスト誤差の減少幅−ｄｒ_Ｍ／ｄθ_ａを、Ｘ_：ａ ^ＴＧＸ_：ａではなく、Ｘ_：ａ ^ＴＧ’Ｘ_：ａで推定することができる。ゼロでないＵ’の行に対応するサンプルの集合を項目評価用サンプル（Ｄ）とすると、Ｇ’は、項目評価用サンプル（Ｄ）に対応する行および列のみがゼロでない値を持ち、その他の行および列はゼロとなる。従って、少数のサンプル集合（Ｄ）のデータ項目値のみに基づいて、式Ｘ_：ａ ^ＴＧ’Ｘ_：ａによって追加項目の導入によるテスト誤差の減少幅−ｄｒ_Ｍ／ｄθ_ａを推定することができる。
【０１８４】
上記理論に基づいて、類似事例としての項目評価用サンプル（Ｄ）を選択する場合の処理の一例について説明する。この場合、分析用データベース１中の各サンプルの項目値の二乗和が一定に正規化されているものとする。また、ＲＢＦカーネル（および、他の定理Ｂの形のカーネル）を使用する場合には、分析用データベース１中の各サンプルの項目値の平均が０に正規化されているものとする。そして、数式２６及び数式２８に基づいて、ｒとβを求める。次に、ＳＶ∪ＲＶをＶｓに代入する。ここで、Ｖｓは、サポートベクトルまたは高次元特徴空間Ｆの半径ｒの超球面状の点となっているサンプルの集合である。従ってＶｓは、テスト削減行列Ｇの対応する行、列が０となっていないサンプルの集合となる。Ｖｓを求めておくことにより、テスト誤差削減行列Ｇの計算において、Ｖｓの要素のみ計算すれば良く、計算を簡素化できる。次に、数式４６または数式４７に基づいて、テスト誤差削減行列Ｇを求める。次に、行列Ｇのｍ個の絶対値最大固有値ｈ、固有ベクトル行列Ｖを求める。次に、図１９に示す処理により数式５６に示す４次計画問題を解き、ＵとＢを得る。次に、絶対値が閾値より小さい要素をゼロに設定することで行列Ｕを離散化して、Ｕ’を求める。次に、ゼロでないＵ’の行に対応するサンプルの集合を項目評価用サンプル（Ｄ）とする。以上により、データ項目の選択に最も有用なサンプルの集合が得られる。
【０１８５】
更に、上記理論に基づいて、分析者より提示された推薦項目または例えば情報源としての外部データベース３に予め用意された推薦項目の中から、追加項目を選択する場合の処理の一例について説明する。この場合、数式６５に基づいて、追加項目の導入によるテスト誤差の減少幅−ｄｒ_Ｍ／ｄθ_ａを推定するための行列Ｇ’を求める。当該行列Ｇ’を用いて、次式により、項目評価用サンプル（Ｄ）に推薦項目を追加した場合のテスト誤差の減少幅−ｄｒ_Ｍ／ｄθ_ａを評価する。そして、当該評価値が最大となる項目（即ち、テスト誤差の減少幅を最大とする項目）を推薦項目の中から追加項目として選択する。以上により、判別知識の精度向上に最も効果的な追加項目が選択される。
【０１８６】
【数６６】

但し、
Ａ_Ｒ；推薦項目の集合を表す。
Ｘ_Ｄ，ａ；項目評価用サンプル（Ｄ）の項目ａの値のベクトルを表す。
【０１８７】
【実施例】
［実施例１］
上記のデータマイニング方法およびデータマイニングシステムの有効性を確認するために、同方法およびシステムを、市販の市区町村単位の民力データと地域通貨活動の市区町村の実績データを元に、民力データから各市区町村での地域通貨導入状況を推定する判別問題に適用した。この結果得られる判別知識を未導入市区町村に適用した結果得られる評価値ｆ（Ｘ）が、導入実績を持つ市区町村と同じならば、導入可能性が高いと判断される。
【０１８８】
地域通貨導入済み市区町村として、インターネットからの検索で、既に地域通貨を導入した又は検討している４８市区町村（表１参照）を特定した。尚、これらの４８市区町村に対する事前分析により、小市町村系８地点と大中都市系４０地点に分かれることが判明した。このため本実施例では、小都市系（表１の括弧で括られた都市）の８都市を除いた４０都市を判別することを目標とした。
【０１８９】
【表１】

【０１９０】
民力データとして、朝日新聞社編民力ＣＤ−ＲＯＭ　１９９９年版より、表２に示す１２２項目について、全国３３８４市区町村のデータを抜き出した。ただし、表２中の＊印のデータ項目に関しては変化を表すデータ項目を元のデータ項目から作成した。また、各項目は平均０、分散１にスケーリングした。
【０１９１】
【表２】

【０１９２】
本実施例における目的は、地域通貨導入済みの市区町村４０地点（表１参照）を特徴付ける判定知識を得ることである。本実施例における内部データベース２は、全国市区町村３３８４ヶ所の年齢別人口（１９９８年度のデータ。区分は０才から６４歳までの５歳刻みと６５歳以上）の１４データ項目のみのデータベースとした。本実施例における外部データベース３は、表２に示した１２２データ項目のデータベースとした。外部データベース３の全サンプル数は全国市区町村３３８４ヶ所であった。初期の分析用データベース１は、地域通貨導入済み４０市区町村およびランダムに選択した未導入４０市区町村の全データ項目（各年齢別人口１４項目）の値を内部データベース２から抜き出して構成した。
【０１９３】
本実施例における判定知識では、全国３３８４市区町村の中から代表市区町村群｛Ｘ_ｉ｝との類似性により判定を行う。本実施例では、数式２１のＲＢＦカーネル関数ＲＢＦ_Ｉ（Ｘ_Ａ，Ｚ_Ａ）を使用した。従って、次式に示す判別関数の値が地域通貨の導入可能性を示す。
【０１９４】
【数６７】

【０１９５】
判別関数ｆ（Ｚ_Ａ）の符号が正ならば導入地点、負なら導入していない地点となるように、係数α_ｉと判定用データ項目集合Ａを決定した。
【０１９６】
判別知識の生成には、ｕｎｂａｌａｎｃｅｄ　ｃ−ＳＶＭを使用した。「成」「否」サンプル数の差が大きいため、成／否サンプルの例外コストをそれぞれＣ^＋＝２００、Ｃ⁻＝１０に設定した。このように「成」サンプルに対する例外コストを大きくすると、多数の「否」サンプルに対する誤りよりも、少数の「成」サンプルに対する誤りを重く見ることとなり、「成」「否」サンプル数の差による少数サンプルの誤差の無視の危険を減らすことができる。
【０１９７】
また、本実施例では、データマイニングシステムが自動で追加項目を特定する処理を行った（図６参照）。また、本実施例では、ステップ８−２−３’において最も可能性が高いとされたデータ項目をそのまま採用した。この場合、分析者の推薦が必ずしも最適でなくても、判別知識の精度の向上が進むことが確認できる。
【０１９８】
生成された判別知識を検証し、必要な数のサンプルを順次追加した（ステップ６）。そして、２０回のデータ項目追加により、導入状況を誤って推定した市区町村が１２地点まで減少したため、そこで分析を終了した。追加されたデータ項目を表３に示す。
【０１９９】
【表３】

【０２００】
表３に示した２０データ項目の追加後、得られた判別知識により地域通貨を導入する可能性があると誤判断された市区町村を表４に示す。
【０２０１】
【表４】

【０２０２】
表４では、判別知識により導入可能性が高いと判断された市区町村を順に並べている。なお、表４の１２市区町村は、判別知識により誤判定された都市（ｆ（Ｘ）＞０）である。酒田市と京都市南区は完全に導入済み都市の領域に入っている（ｆ（Ｘ）≧１）。つまり最終的に得られた判別関数ｆ（Ｘ）では、これらの都市が「否」のサンプルではない（つまり導入可能性が高い、又はすでに導入している）と判断している。
【０２０３】
データ項目の追加にしたがって、得られた判別知識が洗練されていく様子を図１７に示す。同図は、その時点で得られている判別知識により、全国の都市が正しく判別できたかを示している。図中の○は、本来は「成（導入）」であるのサンプルに対する誤判別を、図中の＋は、本来は「否（未導入）」であるサンプルに対する誤判別を表している。また実線のグラフは、判別関数が実際の「成」「否」と異なって判定したサンプル数を、点線のグラフは正しく判定はしたが信頼度が低いサンプル（例外サンプル）の数を示している。つまり実線が実際の誤判定を、点線は信頼度が低いサンプルまで含めた場合を示している。同図から明らかなように、本来は「成」であるサンプルに対する誤判定は７回目のデータ項目の追加後なくなり、信頼度が低い「成」のサンプルも９回目のデータ項目の追加後なくなっている。一方、本来は「否」であるサンプルに対する誤判定や低信頼度の「否」のサンプル数はデータ項目の追加とともに減少しているが、２０回のデータ項目の追加では完全にはなくならない。減少の傾向は１２回目のデータ項目の追加後に鈍っている。本来は「否」であるサンプルに対する誤判定は、１２回目のデータ項目の追加以降は１０サンプル程度で推移し、信頼度が低い「否」のサンプル数は１２回目のデータ項目の追加以降、８６から５５サンプルまで緩やかに減少している。この緩やかな傾斜は、１２回目のデータ項目の追加後は、少ない「否」の例外サンプルをどうにか判別しようとデータ項目を追加しているが、あまり効果が上がっていないことを示している。そこで、データ項目の追加が効果的な１２データ項目までを判別知識として用いることも考えられる。
【０２０４】
以上のように、全サンプルのデータを見ることなしに、項目評価用の少数サンプルのみに注目し、それらのサンプルに関するデータを調査して追加データ項目を決定することで、判別知識の精度が効果的に向上することが確認された。
【０２０５】
次に、データ項目が追加されていく様子を、最初のデータ項目の追加の実行例により示す。第１回目の判別知識生成により、データ項目（年齢別人口）から、サンプルの「成」「否」を判別する判別関数ｆ（Ｘ）がシステムにより計算される。この第１回目の判別知識による判別結果は、以下の表５のようになる。尚、全「成」サンプル数は４０、全「否」サンプル数は３３８４である。また、信頼度の低いサンプル数は、誤判別サンプル数も含んでいる。
【０２０６】
【表５】

【０２０７】
そして、ＲＢＦを用いた項目評価用サンプル（Ｄ）の選択手法（図５参照）によって、類似事例として表６に示す成否サンプルの対が選択された。
【０２０８】
【表６】

【０２０９】
例えば、関前村と早川町、関前村と大田村、関前村と作木村は現在の判別知識で適切に判別できず（α_ｉ＞０）、かつ分析用データベース１中のデータ項目（各年齢別人口）ではほぼ同じ値（類似度ＲＢＦ_Ｉ（Ｘ_ｉ，Ｘ_ｊ）≒１００％）を取るのにもかかわらず、関前村では地域通貨が導入（「成」）され、早川町、大田村、作木村では導入されていない（「否」）というサンプルの対となっている。データマイニングシステムは、この提示したサンプル対を効果的に判別することができるような新しいデータ項目の追加を求めている。
【０２１０】
本実施例では、分析者の知識には頼らずに、図６に示す処理によりデータマイニングシステムが自動で類似事例を差別化するのに有効だと考えられる推薦項目を選択した。この結果、推薦されたデータ項目は推薦順に表７に示す通りである。
【０２１１】
【表７】

【０２１２】
データマイニングシステムでは、推薦項目の中から、判別知識の改善に効果的と思われるデータ項目を選択する。本実施例では、推薦項目群の中で最も見込みがあるとされたデータ項目（年齢別構成比４５−６４歳）を単純に選択した。そして、分析用データベース１中の全サンプルに対する、選択されたデータ項目（年齢別構成比４５−６４歳）の実際のデータを外部データベース３から取得し、分析用データベース１に追加する。そして、新たな分析用データベース１を用いて、判別知識を更新する。この結果、新たな判別知識による判別結果は、以下の表８のようになった。尚、信頼度の低いサンプル数は誤判別サンプル数も含む。
【０２１３】
【表８】

【０２１４】
新たな判別知識により、「成」サンプル誤判別が１１から６に減っている。このような過程を繰り返しながら、データ項目を追加することで知識の精度を改善していく。
【０２１５】
ここで、本実施例では、分析者の知識には頼らずにデータマイニングシステムが自動で追加項目を特定する処理を行ったが（図６参照）、データマイニングシステムと分析者とが対話し協調しながら、判別知識の洗練に有効な追加項目を特定するようにしても良い（図４参照）。以下に、拡張三つ組み法が、外部データベース３を考慮したデータマイニングに有効である点について説明する。
【０２１６】
拡張三つ組み法は、データ項目の追加に必要な知識を分析者から引き出すための手法である。即ち、拡張三つ組み法は、現在のデータ項目では判別が困難なサンプルを判別可能とするような新たなデータ項目を分析者がデータマイニングシステムに推薦し易くするための手法である。このために、現在のデータ項目では判別が困難で、成否が異なるサンプル対を分析者に提示する。現在は判別ができない、つまりシステムにとっては同じように見えるサンプル対が実際には成否が異なっているという事実を分析者に提示することで、提示サンプル対を判別することを可能とする新たなデータ項目を聞き出すことができる。特に本実施例で適用したＲＢＦカーネルを用いた場合には、例外度が高い誤判別サンプルと、そのサンプルに似ていて成否が異なるサンプルを対として提示する。これらのサンプルのデータ項目の値に基づき、新たなデータ項目の有効性を推測することは、システムにとって理論的な正当性を持つだけでなく、以下の例で示すように、人間である分析者にとっても、直感的に意味があり効果的である。
【０２１７】
例えば本実施例では、６回目のデータ項目の追加の際に、以下の表９に示すようなサンプル対が類似事例として特定された。
【０２１８】
【表９】

【０２１９】
これは、下京区では地域通貨が導入されているが、システムが持っている現在のデータでは似たように見える上京区、中京区、東山区では導入されていないことを示している。例えば、下京区と上京区の分析用データベース１上でのデータの類似度ＲＢＦ_Ｉ（Ｘ_ｉ，Ｘ_ｊ）は９９％であった。分析者はこのようなサンプル対の提示によって、現在のデータ項目に加えるべき新たなデータ項目の発想がし易くなる。
【０２２０】
次の表１０に示す例は、１０回目のデータ項目の追加の際に、類似事例として特定されたサンプル対である。
【０２２１】
【表１０】

【０２２２】
この例から、市区町村の類似が地理的に近くにあるといった観点からではなく、データマイニングシステムにとって類似していると思われる市区町村が対として挙げられているのが確認できる。例えば、多摩市と習志野市のデータの類似度ＲＢＦ_Ｉ（Ｘ_ｉ，Ｘ_ｊ）は９８％であった。
【０２２３】
データマイニングシステムでは、表９や表１０に示すサンプル対（類似事例）を分析者に提示することもできる。システムの観点から機械的に（即ち人的な思い込みや固定観念等にとらわれず）サンプル対の提示が行われることにより、分析者がそれまで気づいていなかった因子に気づくといった効果がある。また、一回ごとの提示サンプル対だけではなく、連続した提示市区町村を見ていくことによっても新たな知見が得られる可能性がある。例えば、本実施例で何度も類似事例として特定された市区町村を示すと、次の表１１のようになる。
【０２２４】
【表１１】

【０２２５】
表１１から明らかになるのは、データマイニングシステムが長野県駒ヶ根市を正しく判別するのに苦労しているという事実である。効果的なデータ項目の追加に失敗しているという可能性もあるが、「成」のサンプルの残りの３９市区町村と駒ヶ根市を同じ「成」のクラスに入れるのは間違っているという可能性もある。このように何度も提示される市区町村を分析者が見ることで、サンプルの成否自体の変更、分析の目的に適した成否の再検討も可能となる。なお、長野県の「否」のサンプルが何度も出現しているのは、それらがデータ上から駒ヶ根市に似た市区町村であるとデータマイニングシステムが判断しているためである。そうすると、大分県湯布院町の成否も再検討する必要があると考えられる。以上のように、拡張三つ組み法によれば、データマイニングシステムにとっても、分析者にとっても、現在の判別知識の効果的な見方を提供することができる。
【０２２６】
また、本実施例では、未使用のデータ項目の良さ（最適性）を次のように評価した。即ち、外部データベース３（民力データ）から未使用のデータ項目を１つ取り出し、実際に分析用データベース１に追加して、新しい判別知識を生成し、当該判別知識の最適性の評価値（ｃ−ＳＶＭの目的関数値）を算出し、当該評価値により、未使用の各データ項目の良さを順位付けた。図１８は、選択したデータ項目が、上記の最適性評価でどの順位にあたるかを示している。「○」「△」「＋」点は、図６のステップ８−２−３’の処理により１，２，３位と評価されたデータ項目の目的関数値の順位を示す。例えば１回目の追加項目選択では、３０番目に有効なデータ項目を第１推薦項目として推薦していることが分かる。１回目の時点で、分析用データベース１に含まれないデータ項目は１０８であるので、８つのサンプルに関するデータ項目のデータだけから単純な規則で選択された追加項目は、適切な選択を行っているといえる。全体的に見ても、高々８サンプルのデータ項目の値による追加データ項目の選択は比較的良い順位の項目を選択している。ステップ８−２−３’による処理では、必ずしも最良の項目を選択する保証はないが、例えば上位３つの推薦項目を考慮することで、２０位内の項目を比較的良く選択できることが確認された。ステップ８−２−３’の選択規則は単純なものでありながら、比較的良いデータ項目が選択されていることから、項目評価用サンプルが良い判断基準を与えていたと考えられる。
【０２２７】
［実施例２］
本実施例では、手書き文字認識問題に対して本発明を適用した。情報源としての内部データベース２は、７２９１個の学習用画像データと２００７個のテスト用画像データを有する。各文字は、１６×１６ピクセルのグレースケールの画像として切り出されている。本実施例では、各ピクセルを一つのデータ項目として扱い、２５６項目の入力データとした。尚、本データは、米国郵政公社提供のもので、ｆｔｐ：／／ｆｔｐ．ｋｙｂ．ｔｕｅｂｉｎｇｅｎ．ｍｐｇ．ｄｅ／ｐｕｂ／ｂｓ／ｄａｔａ／　から入手可能である。
【０２２８】
本実施例では、相互情報量が最も高い５データ項目（初期マスク）から出発して、テスト用画像データを用いず学習用画像データのみに基づいて、数字「５」を認識するのに有効性の高いデータ項目（ピクセル）を順次選択するようにした。図２１は、各データ項目の相互情報量を示す。図２２は、選択された初期マスクを示す。図２２中の黒点が選択された５つの初期データ項目である。データは、各データ項目の平均０、分散１となるように正規化した。カーネル関数は、数式２１のＲＢＦカーネル関数を使用した。また、項目評価用サンプル（Ｄ）を選択する方法、推薦項目の中から追加項目を選択する方法には、本発明の第２実施形態で説明した方法を採用した。また、データマイニングシステムが自動で内部データベース２の中から追加項目を特定する処理を行った。
【０２２９】
図２０は、初期マスクにデータ項目を逐次追加した場合の学習用データおよびテスト用データでの誤認識率を示す。同図中の符号５０，５１は学習用データについての結果を示し、符号５２，５３はテスト用データについての結果を示す。また、同図中の実線は、項目評価用サンプル（Ｄ）の数を１６に、近似に使用する固有値・固有ベクトルを６個（ｍ＝６）に制限した場合を示す。同図中の破線は、制限をまったく設けなかった場合を示す。制限をした場合、データ項目の追加にしたがって認識率が改善し、２０データ項目の追加で３〜４％程度の誤認識率を実現している。同テスト画像データをマスクせずに見たときの人間の誤認識率が２．５％程度であることから、高い認識率を実現していることがわかる。制限をした方が誤認識率が低いのは、評価用サンプルの選択操作により類似データ項目が集中的に選択されることが防がれ、相互情報量が高いデータ項目を万遍なく選択しているためである（図２３及び図２４参照）。尚、図２３は、項目評価用サンプル（Ｄ）の数を１６に、近似に使用する固有値・固有ベクトルを６個に制限した場合に、最終的に選択された２５データ項目（初期マスクの５つと追加された２０のデータ項目）を示す。図２４は、上記制限をしなかった場合の最終的に選択された２５データ項目を示す。
【０２３０】
項目評価用サンプル（Ｄ）として現れる画像の集合は、しばしば、マスクされた画像では類似しているが一方は数字「５」であり、他方は「５」でないという画像の組であった。このような画像に基づいて選択される追加項目は、これらの類似画像を正しく判別できるデータ項目、すなわち、数字「５」の画像と数字「５」でない画像とで白黒が反転しているようなピクセルである。このような項目の追加により、サンプルの識別率が改善することは直感的に理解できる。
【０２３１】
［実施例３］
提案手法を評価するために、ＣｏＩＬ　２０００　Ｃｈａｌｌｅｎｇｅ　で使用されたデータに本発明を適用した。尚、このデータはカリフォルニア大学アーバイン校のデータレポジトリ（Ｕｎｉｖｅｒｓｉｔｙ　ｏｆ　Ｃａｌｉｆｏｒｎｉａ　Ｉｒｖｉｎｅ　Ｄａｔａ　Ｒｅｐｏｓｉｔｏｒｙ（ＵＣＩ　ＫＤＤ　Ａｒｃｈｉｖｅ））から入手可能である。
【０２３２】
データは、訓練集合とテスト集合からなる。訓練集合は、保険会社の顧客５８２２人のデータからなる。訓練集合のうち３４８人（６％）がキャラバン（自動車で牽引する移動住宅）をカバーする保険証券（以下、キャラバン保険証券と呼ぶ。）を持つ。テスト集合は、保険会社の顧客４０００人のデータからなる。テスト集合のうち２３８人がキャラバン保険証券を持つ。それぞれの顧客に対して、８５のデータ項目がある。本実施例でのデータマイニングの目的は、「キャラバン保険証券を持つ顧客を同定するための判別知識を得ること」である。
【０２３３】
尚、本発明を適用する前に、上記データのうち、項目番号１について４１のバイナリデータ（２値のデータ）に変換し、項目番号５について１０のバイナリデータに変換した。これは、これらの項目が、数値データではない顧客のタイプを表現しているからである。このため、本実施例では１３４のデータ項目を用いた。
【０２３４】
本実施例ではＲＢＦカーネルを用いたｃ−ＳＶＭを採用し、データマイニングシステムが自動で追加項目を特定する処理を行った（図６参照）。また、選択サンプル数（評価集合の大きさ）を８とし、初期の項目集合として相互情報量の上位１０項目を選択した。
【０２３５】
図２５に、テスト集合において、キャラバン保険証券を持つと判定された顧客数（サンプル数）と、追加項目数との関係を示す。同図から、キャラバン保険証券を持つと判定された顧客の数が、項目の追加とともに増えていることが確認できる。また、図２６に訓練集合中の誤判定したサンプルの数を示す。尚、同図中の「×」は「成」サンプルに対する誤判定数を示し、「＊」は「否」サンプルに対する誤判定数を示し、「＋」は「成」サンプルに対する誤判定数と「否」サンプルに対する誤判定数の和を示す。同図から、誤判定したサンプルの数が項目の追加とともに減少していることが確認できる。図２５および図２６に示す結果は、すべてのサンプルに対して項目値を評価することなしに、非常に少ないサンプルのみ（本実施例では８サンプル）に基づいて、判別知識の改善に有効な追加項目を適格に選択できるということを示している。
【０２３６】
以上のように本発明によれば、分析用データベース１に効率的且つ効果的にデータを補充して精度の高い有用な知識を抽出することができる。データマイニングにおいて、外部データを適切に取り入れながら精度の高い知識を獲得していく必要性は高い。特に、顧客データベースの整備によるお客様サービスの充実や、機器診断用のデータベースの整備による設備診断・保守の高度化が望まれている現状では、本発明は極めて有用性が高い技術である。また、現在はインターネットを通して自由にデータが取得できる情報源は限られているが、今後、本発明に関わるデータマイニングが活発になるにつれて、インターネットを通したデータの購入や一部のデータを部分的に販売することなどが実現すると考えられる。
【０２３７】
なお、上述の実施形態は本発明の好適な実施の一例ではあるがこれに限定されるものではなく、本発明の要旨を逸脱しない範囲において種々変形実施可能である。例えば、データマイニングシステムのユーザインターフェースを適宜ユーザフレンドリーに整備しても良い。また外部データベース３のアクセスコストの違いを考慮して、外部データベース３にアクセスするルール（例えば、ある基準より低いコスト又は無償のデータベースであれば取得するデータ量または接続時間に制限をかけないが、ある基準より高いコストのデータベースであれば取得するデータ量または接続時間を制限する、または同種の情報源であれば、より低コストのデータベースにアクセスする等）を設けるようにしても良い。
【０２３８】
【発明の効果】
以上の説明から明らかなように、請求項１記載のデータマイニング方法および請求項１０記載のデータマイニングシステムおよび請求項１１記載のデータマイニング用プログラムによれば、似て非なる類似事例を特定し、当該類似事例を差別化するために有効な追加項目に関するデータを分析用データベースに追加することで、分析用データベースに効率的且つ効果的にデータを補充して、精度の高い有用な判別知識を抽出することができる。
【０２３９】
データマイニングにおいて、外部データを適切に取り入れながら精度の高い知識を獲得していく必要性は高い。特に、顧客データベースの整備によるお客様サービスの充実や、機器診断用のデータベースの整備による設備診断・保守の高度化が望まれている現状では、本発明は極めて有用性が高い技術である。また、現在はインターネットを通して自由にデータが取得できる情報源は限られているが、今後、本発明に関わるデータマイニングが活発になるにつれて、インターネットを通したデータの購入や一部のデータを部分的に販売することなどが実現すると考えられる。
【０２４０】
さらに、請求項２記載のデータマイニング方法によれば、分析者等は具体的な事例（類似事例）についての対比検討が可能となるため、的確な推薦項目を入力し易く、分析者等の知識を効果的に引き出すことができる。
【０２４１】
一方、請求項３記載のデータマイニング方法によれば、分析者等には項目を推薦するための知識は要求されず、分析者等の判断又は能力によって判別知識の精度が大きく左右されてしまうことを回避でき、安定して高精度の判別知識を得ることができる。
【０２４２】
さらに、請求項４記載のデータマイニング方法によれば、不必要な分析用データベースの拡充が回避される。
【０２４３】
さらに、請求項５記載のデータマイニング方法によれば、追加項目に関するデータのみならず、分析用データベースに存在しない追加事例に関するデータも分析用データベースに補充して、精度の高い有用な知識を抽出することができる。
【０２４４】
さらに、請求項６記載のデータマイニング方法によれば、少数のデータを用いて効果的な検証を行うことができる。
【０２４５】
さらに、請求項７記載のデータマイニング方法によれば、効率的且つ効果的に、分析用データベースに存在しない追加事例に関するデータを分析用データベースに補充することができる。
【０２４６】
さらに、請求項８記載のデータマイニング方法によれば、追加項目または追加事例に関するデータによる分析用データベースの拡充は、充分に高精度の判別知識が得られるか、これ以上の精度向上が望めなくなるまで繰り返され、高精度の判別知識を確実に得ることができる。
【０２４７】
さらに、請求項９記載のデータマイニング方法によれば、判別知識の精度が低い場合には、先ず、追加事例に関するデータが分析用データベースに追加され、追加事例に関するデータの追加では、これ以上の精度向上が望めない場合に、追加項目に関するデータが分析用データベースに追加される。したがって、効率的且つ効果的に分析用データベースが拡充される。
【図面の簡単な説明】
【図１】本発明のデータマイニング方法およびシステム並びにプログラムの処理の一例を示す概略フローチャートである。
【図２】図１のステップ３を更に詳細化した処理の一例を示すフローチャートである。
【図３】図１のステップ８を更に詳細化した処理の一例を示すフローチャートである。
【図４】データマイニングシステムと分析者との対話（やりとり）の例をシークエンスで示す図である。
【図５】図３のステップ８−１を更に詳細化した処理の一例を示すフローチャートである。
【図６】図３のステップ８−２を更に詳細化した処理の一例を示すフローチャートである。
【図７】推薦項目の中から追加項目を選択する処理の一例を示すフローチャートである。
【図８】本発明のデータマイニングシステムの構成の一例を示す概略構成図である。
【図９】本発明のデータマイニングシステムの構成の他の例を示す概略構成図である。
【図１０】本発明のデータマイニング方法及びシステムにおけるデータの流れの一例を示すデータフロー・ダイヤグラムである。
【図１１】分析用データベースと分析用データベース以外の情報源（内部データベースおよび外部データベース）との関係を表す概念図である。
【図１２】分析用データベースの中から特定された類似事例と、当該類似事例を差別化するために入力された推薦項目とを表す概念図である。
【図１３】分析用データベースの中から特定された類似事例と、当該類似事例を差別化するために入力された推薦項目と、推薦項目の中から特定された追加項目を表す概念図である。
【図１４】分析用データベースの中から特定された類似事例と、当該類似事例を差別化するために入力された推薦項目と、追加項目に関する分析用データベースの全事例分のデータを表す概念図である。
【図１５】判別知識による判別が困難な領域を表す概念図である。
【図１６】判別知識の精度改善を行うための追加項目の選択に有効な事例を選択する考え方を表す概念図であり、（Ａ）は横軸に既知の項目（ａ_１）の値をとり、縦軸に未知の項目（ａ_２）の値をとって、各事例をプロットしたものであり、（Ｂ）は、横軸に既知の項目（ａ_１）の値をとって、各事例をプロットしたものである。（Ａ），（Ｂ）で横軸（ａ_１）の値が同じ点は、同一事例を表す。
【図１７】項目の追加にしたがって得られた判別知識が洗練されていく様子を示すグラフであり、横軸は項目追加の回数を示し、縦軸は誤判別された事例の数を示す。
【図１８】選択した項目が、上記の最適性評価でどの順位にあたるかを示すグラフであり、横軸は項目推薦の回数を示し、縦軸は最適性評価における順位を示す。
【図１９】本発明のデータマイニング方法およびシステム並びにプログラムの処理の他の例を示す概略フローチャートである。
【図２０】項目の追加にしたがって得られた判別知識が洗練されていく様子を示すグラフであり、横軸は追加項目数を示し、縦軸は誤判別率を示す。
【図２１】手書き文字認識問題に対して本発明を適用した一実施例を示し、１６×１６ピクセルのグレースケールの画像データについて、各ピクセルを一つのデータ項目として扱った場合に、各データ項目の相互情報量を表す図である。
【図２２】上記実施例において、図２１の相互情報量の最も高い５つのデータ項目を選択することによって得られた初期マスクを表す図である。
【図２３】上記実施例において、類似事例の数および近似に使用する固有値・固有ベクトルの数を制限した場合に、最終的に選択されたデータ項目を表す図である。
【図２４】上記実施例において、類似事例の数および近似に使用する固有値・固有ベクトルの数を制限しなかった場合に、最終的に選択されたデータ項目を表す図である。
【図２５】項目の追加にしたがって得られた判別知識が洗練されていく様子を示すグラフであり、横軸は追加項目数を示し、縦軸は「成」と判別された事例数を示す。
【図２６】項目の追加にしたがって得られた判別知識が洗練されていく様子を示すグラフであり、横軸は追加項目数を示し、縦軸は誤判別された事例の数を示す。
【符号の説明】
１　分析用データベース
２　内部データベース（情報源）
３　外部データベース（情報源）
１０，１０’　データマイニングシステム[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a data mining method, system, and program. More specifically, the present invention relates to a method, a system, and a program for efficiently and effectively expanding a database used for data mining.
[0002]
[Prior art]
Data mining is a technique for finding useful correlations hidden in vast amounts of data. For example, with respect to data obtained from a POS (point-of-sale information) system, it is expected to be a powerful means for making good use of the data, for example, finding popular products and customer preferences. Data mining, which analyzes a large amount of data possessed by a company and extracts "knowledge" that is useful for decision making, is becoming an indispensable technology to survive competition between companies.
[0003]
[Problems to be solved by the invention]
However, conventional data mining is performed in a state in which data to be analyzed is already prepared as a database by an analyst, that is, in a closed environment. That is, data to be analyzed at the time of performing data mining is limited, and effective “knowledge” may not be obtained from only the existing data.
[0004]
Therefore, a mechanism for efficiently obtaining effective information from an external information source other than the information source at hand of the analyst is required. However, conventional data mining does not consider such expansion of the database.
[0005]
Unnecessarily expanding the database wastes the time and costs required for analysis. In addition, there is a danger that acquired external data does not result in improvement of knowledge.
[0006]
Therefore, an object of the present invention is to provide a data mining method, system, and program for efficiently and effectively supplementing data to an analysis database and extracting useful knowledge with high accuracy.
[0007]
[Means for Solving the Problems]
In order to achieve the above object, the invention according to claim 1 provides information arranged for each case and each item, and information indicating which of a plurality of predetermined classes each case belongs to. Is a method of data mining that, based on an analysis database having, based on the data of an item of a case as input, generates discrimination knowledge that outputs which class the case belongs to, and the data of the item is similar and A step of identifying similar cases belonging to different classes from the analysis database, a step of identifying additional items effective for differentiating the similar cases, and a step of identifying all the cases in the analysis database related to the additional items. Adding data from the information source other than the analysis database to the analysis database; and Are the steps to have at least to generate a discrimination knowledge based on.
[0008]
Therefore, by identifying similar similar cases and adding data on additional items effective for differentiating the similar cases to the analysis database, data is efficiently and effectively supplemented to the analysis database. As a result, highly accurate and useful discrimination knowledge can be extracted.
[0009]
According to a second aspect of the present invention, in the data mining method of the first aspect, an additional item is specified from recommended items input for differentiating similar cases. The input of the recommended items is performed by a data mining executor such as an analyst. Since the similar case is specified, it is possible for the analyst or the like to compare and examine a specific case (similar case). This makes the analysts and others more conscious of the problem compared to vague general questions such as "What additional items are needed to improve the accuracy of discrimination knowledge?" It becomes easier to input recommended items. Therefore, knowledge of an analyst or the like can be effectively extracted.
[0010]
According to a third aspect of the present invention, in the data mining method according to the first aspect, an additional item is specified from an information source in accordance with a predetermined rule. Therefore, an additional item effective for refining discrimination knowledge is automatically specified without waiting for input of a recommended item. In this case, the data mining executor such as the analyst does not need knowledge for recommending items, and it is possible to avoid that the accuracy of the determination knowledge is greatly affected by the judgment or ability of the analyst or the like.
[0011]
According to a fourth aspect of the present invention, in the data mining method according to any one of the first to third aspects, when the accuracy of the generated discrimination knowledge does not satisfy a predetermined criterion, data on an additional item is analyzed. In addition to the database, new discrimination knowledge is generated based on the expanded analysis database. Therefore, unnecessary expansion of the analysis database is avoided.
[0012]
In addition, the invention according to claim 5, when the accuracy of the generated discrimination knowledge does not satisfy a predetermined criterion, adds data on an additional case that does not exist in the analysis database from the information source to the analysis database. New discrimination knowledge is generated based on the expanded analysis database. Therefore, not only data on additional items but also data on additional cases that do not exist in the analysis database can be supplemented to the analysis database to extract useful knowledge with high accuracy.
[0013]
According to a sixth aspect of the present invention, in the data mining method according to the fourth or fifth aspect, an area that is difficult to determine is specified based on the generated determination knowledge, and the area that corresponds to the difficult area and is stored in the analysis database. The discrimination knowledge is applied to a non-existent verification case, and the accuracy of the discrimination knowledge is verified based on the result. Therefore, effective verification can be performed using a small number of data.
[0014]
According to a seventh aspect of the present invention, in the data mining method of the sixth aspect, when the accuracy of the discrimination knowledge does not satisfy a predetermined criterion, the data on the verification case is added to the analysis database as the data on the additional case. I try to add it. In this case, data relating to additional cases that do not exist in the analysis database can be efficiently and effectively supplemented to the analysis database.
[0015]
According to an eighth aspect of the present invention, in the data mining method according to any one of the fourth to seventh aspects, a step of adding data relating to an additional item or an additional case to the analysis database, and the expanded analysis database Generating a new discrimination knowledge based on the criteria, the accuracy of the discrimination knowledge satisfies a predetermined criterion or the improvement of the accuracy of the discrimination knowledge by adding data on additional items or additional cases is a predetermined criterion Until they are no longer satisfied. In this case, the expansion of the analysis database based on the data on the additional items or additional cases is repeated until sufficiently high discrimination knowledge is obtained or further improvement in accuracy cannot be expected.
[0016]
According to a ninth aspect of the present invention, in the data mining method of the eighth aspect, when the accuracy of the discrimination knowledge does not satisfy a predetermined criterion, data on the additional case is added to the analysis database, and If the degree of improvement in the accuracy of discrimination knowledge by adding data does not satisfy a predetermined criterion, data on additional items is added to the analysis database. In this case, if the accuracy of the discrimination knowledge is low, first, the data on the additional case is added to the analysis database. If the addition of the data on the additional case cannot be expected to further improve the accuracy, the data on the additional item is added. Is added to the database for analysis. Therefore, the database for analysis is efficiently and effectively expanded.
[0017]
A data mining system according to a tenth aspect is a system in which the data mining method according to the first aspect is realized. That is, the data mining system according to the tenth aspect provides an analysis having information arranged for each case and each item and information indicating which class of each of the cases belongs to a plurality of predetermined classes. Means for identifying, from the analysis database, similar cases in which the data of the items are similar and belong to different classes, means for identifying additional items effective for differentiating the similar cases, and additional items Means for adding data for all cases in the analysis database relating to the above to the analysis database from information sources other than the analysis database, and based on the expanded analysis database, Means for generating discrimination knowledge that outputs which class the case belongs to as an input. There.
[0018]
Further, a data mining program according to claim 11 is a program for causing a computer to function as the data mining system according to claim 10. In other words, the data mining program according to claim 11 is a computer which stores information arranged in each case and each item and information indicating which of a plurality of predetermined classes each of the cases belongs to. A means for specifying, from the analysis database, similar cases in which the data of the items are similar and belong to different classes, and means for specifying an additional item effective for differentiating the similar cases And means for adding data for all cases in the analysis database relating to additional items to the analysis database from an information source other than the analysis database. Function as a means to generate discriminative knowledge that inputs the data of the item and outputs which class the case belongs to It is intended to.
[0019]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, the configuration of the present invention will be described in detail based on an embodiment shown in the drawings.
[0020]
1 to 18 show an embodiment of a data mining method, a data mining system, and a data mining program according to the present invention. The data mining method according to the present embodiment stores the information arranged for each case and each item and the analysis database 1 having information indicating to which of a plurality of predetermined classes each case belongs. Based on the data of an item of a certain case as input, it generates discrimination knowledge that outputs which class the case belongs to, and analyzes similar cases where the item data is similar and the class to which it belongs to is different. A step of specifying from the database 1; a step of specifying an additional item effective for differentiating similar cases; and a step of analyzing data of all cases in the analysis database 1 relating to the additional item. Adding to the analysis database 1 from an information source other than the above, and generating discrimination knowledge based on the expanded analysis database 1 That is a step shall have at least.
[0021]
Here, the data to be added to the analysis database 1 is not limited to the data relating to the additional items. In order to obtain higher-precision discrimination knowledge, data relating to a case that does not exist in the analysis database 1 is necessary. You may make it add. For example, in the present embodiment, when the accuracy of the generated discrimination knowledge does not satisfy a predetermined criterion, data on an additional item or an additional case that does not exist in the analysis database 1 is added to the analysis database 1 and the expansion is performed. New discrimination knowledge is generated based on the analyzed database 1 for analysis.
[0022]
Further, the method of verifying the accuracy of the discrimination knowledge is not particularly limited. For example, in the present embodiment, a region that is difficult to discriminate is specified based on the generated discrimination knowledge, and the region corresponds to the difficult-to-discriminate region. The discrimination knowledge is applied to verification cases that do not exist in the analysis database 1, and the accuracy of the discrimination knowledge is verified based on the result.
[0023]
In addition, in the present embodiment, for example, in the present embodiment, data on the above-described verification case is used as the data on the verification case when the accuracy of the generated discrimination knowledge does not satisfy the predetermined criterion. To be added. However, rules for selecting additional cases are not necessarily limited to the example of the present embodiment.
[0024]
Further, the extent to which the analysis database 1 is expanded based on the additional items or additional cases and the timing at which the expansion of the analysis database 1 based on the additional items or additional cases is terminated are not necessarily limited. For example, in the present embodiment, the step of adding data relating to an additional item or an additional case to the analysis database 1 and the step of generating new discrimination knowledge based on the expanded analysis database 1 are defined by the accuracy of discrimination knowledge. Is repeated until the predetermined criterion is satisfied or the degree of improvement in the accuracy of the discrimination knowledge by adding data on additional items or additional cases does not satisfy the predetermined criterion.
[0025]
Further, the timing of adding the data relating to the additional items and the additional cases to the analysis database 1 is not particularly limited, and either the additional items or the additional cases may be added first, or they may be added at the same timing. Is also good. For example, in the present embodiment, when the accuracy of the discrimination knowledge does not satisfy a predetermined criterion, first, data relating to an additional case is added to the analysis database 1, and then the accuracy of the discrimination knowledge is determined by adding data relating to the additional case. When the amount of improvement does not satisfy a predetermined standard, data on additional items is added to the analysis database 1.
[0026]
The data mining method of the present invention is implemented as a data mining system using a computer. The computer may be a well-known computer including a central processing unit (CPU), a main storage device, an external storage device, an output device such as a display, an input device such as a keyboard, a communication interface with an external information processing device, and the like. . The data mining system includes an analysis database 1 having information arranged for each case and each item, and information indicating to which of a plurality of predetermined classes each case belongs. Means for specifying, from the analysis database 1, cases in which the data are similar and belong to different classes, means for specifying an additional item effective for differentiating similar cases, and an analysis database relating to the additional items. Means for adding the data for all the cases in 1 to the analysis database 1 from an information source other than the analysis database 1 and inputting the data of a certain case item based on the expanded analysis database 1 Means for generating discrimination knowledge for outputting which class the case belongs to.
[0027]
The data mining program of the present invention is a program for realizing the data mining method of the present invention as a data mining system using a computer. This data mining program is an analysis database having information arranged on a case-by-case and item-by-case basis and information indicating to which of a plurality of predetermined classes each case belongs. A means for identifying, from the analysis database 1, similar cases in which the data of the items are similar and the classes to which they belong are different; a means for specifying an additional item effective for differentiating the similar cases; Means for adding data for all cases in the analysis database 1 relating to the above to the analysis database 1 from an information source other than the analysis database 1, and Function as a means for generating discrimination knowledge that outputs the class to which the case belongs with the data of the item as input Than it is.
[0028]
In the present embodiment, for convenience of explanation, as shown in the conceptual diagram of FIG. 11, all data in the world is virtually regarded as a huge two-dimensional tabular database (virtual database) of sample × data item.
[0029]
A sample is an individual "case" that is the unit of analysis. The sample is represented by each row of the virtual database (FIG. 11). For example, in a sales analysis that predicts in which regions a certain product such as an electric water heater or an electromagnetic cooker sells well, each city corresponds to a sample. In this embodiment, each individual sample is represented by i or j.
[0030]
A data item (also called an attribute) is an “item” that represents a feature of a sample. The data items are represented by respective columns of the virtual database (FIG. 11). For example, in the example of the sales analysis, the characteristics of the city such as the sales amount and the number of customers correspond to the data items. In the present embodiment, each individual data item is denoted by a. A set of data items is denoted by A. Further, the number of data items included in the analysis database 1 at a certain point in time is denoted by d.
[0031]
In the present embodiment, the value of the data item a of a certain sample is x_aNotation. In particular, the value of data item a for a particular sample i is x_iaNotation. In addition, an arrangement (vector) of values of each data item a of a certain sample is denoted by X. Further, an arrangement (vector) of values of each data item a belonging to the data item set A of a certain sample is represented by X_ANotation. In particular, the sequence (vector) of the values of each data item a for a specific sample i is represented by X_iNotation. In addition, a row (vector) of values of each data item a belonging to the data item set A of a specific sample i is represented by X_iANotation. That is, when it is necessary to specify the data item set A, X_AOr X_iAAnd if there is no need to specify data item set A, X or X_iIs used. Further, data using a suffix j or the like instead of the suffix i represents data on the sample j that is distinguished from the sample i. Further, for convenience of description, z and Z are used instead of the symbols x and X as necessary to distinguish data. For the sake of simplicity, the symbols and notations defined once have the same meaning in the present specification, unless otherwise specified. For each sample, the value of the data item set (A) (X_A), The sample can be expressed as:
[0032]
(Equation 1)

[0033]
In the present embodiment, an information source having a data group which is actually made into a database and which can be easily used for analysis by an analyst is referred to as an internal database 2. For example, an internal database or a data warehouse in which sales data and customer data of products in the company are managed corresponds to the internal database 2. The internal database 2 forms a part of the virtual database.
[0034]
In the present embodiment, an information source having a data group other than the internal database 2 in the virtual database is referred to as an external database 3. For example, a certain amount of time and expense will be incurred for external databases, data groups that exist inside the company but are not in a usable form, or data groups that may be newly created in a new survey etc. A data group that cannot be used unless otherwise corresponds to the data that the external database 3 has. Therefore, the external database 3 includes a data group that is not actually made into a database.
[0035]
In the present embodiment, a database having a data group to be subjected to data mining is referred to as an analysis database 1. The analysis database 1 according to the present embodiment is configured by extracting data on some samples and data items existing in the internal database 2 at a stage before data mining. As the analysis progresses, the analysis database 1 is supplemented with data on new samples and data items from the internal database 2 or the external database 3.
[0036]
The analysis database 1 stores class information indicating to which class each sample in the analysis database 1 belongs. The class information may be information stored in the internal database 2 in advance, or may not be stored in the internal database 2 and determined by an analyst when configuring the analysis database 1 or the information of the external database 3. Information added based on data may be used.
[0037]
The information sources other than the analysis database 1 in the present embodiment correspond to the internal database 2 and the external database 3. By using the concepts of the internal database 2 and the external database 3 as information sources other than the analysis database 1, a model close to a real problem can be constructed. For example, in the case of the above-mentioned sales analysis, the analyst first extracts the information that is likely to be related from an internal database, that is, an internal database 2 in which sales data and customer data of products are managed, and constructs an analysis database 1. Then, the analysis is performed using the analysis database 1. Then, if necessary, the sales situation of the product is analyzed while supplementing samples and data items not included in the analysis database 1 from the internal database 2 to the analysis database 1. If the internal database 2 alone cannot analyze "why a product sells in a certain area and cannot sell in a certain area", etc., data newly discovered from within the company, or data obtained by a new market research, Alternatively, data from an external information source (for example, data on regional economic statistics, data on population composition by age, etc.), that is, data on the external database 3 is purchased and analyzed in consideration of the budget.
[0038]
The major differences between the internal database 2 and the external database 3 include, for example, the following two points. The first difference is that the internal database 2 can be regarded as free or insignificant in terms of access cost, collection cost, time, etc., whereas the external database 3 has a purchase fee, a new storage data area, It is a point that it costs much, such as acquisition time. The second difference is that, while the internal database 2 has its contents already known, or at least the actual data can be easily checked, the external database 3 has a rough estimate of the characteristics of the data. Even if possible, the actual data is unknown or uncertain until the data is acquired.
[0039]
There is also a danger that the acquired data of the external database 3 will not eventually lead to improvement of the discrimination knowledge. Therefore, when trying to utilize the external database 3 in order to obtain highly accurate discrimination knowledge, data in the external database 3 that is likely to be useful for better knowledge extraction is appropriately specified, and the acquisition cost is reduced as a whole. Strategies for obtaining external data to be low are important. This is a new problem that has not been a problem in conventional data mining, which mainly targets the internal database 2.
[0040]
Here, the purpose of the data mining in the present embodiment is to obtain discrimination knowledge for estimating the class to which the sample belongs from the value of the data item set. In general, the number of classes is two. In this case, the sample is classified into one of two classes, for example, "whether" or "no", "whether" or "absence", or "normal" or "abnormal". . In the present embodiment, for example, a sample is determined based on customer information, such as estimating “success” or “failure” of product sales, or diagnosing “normal” or “abnormal” of a device from a device inspection record. A case will be described in which one of the two classes belongs is determined based on the information of the sample. It is assumed that there are two classes, “completion” and “failure”, in the present embodiment. In the example of the sales analysis described above, the city that succeeded in sales corresponds to the class of “Success”, and the city that failed to sell corresponds to the class of “No”. In the present embodiment, the class is represented by y, y = 1 when "completion", and y = -1 when "no".
[0041]
The determination knowledge is knowledge for determining “success” or “fail” of each sample from the value of the data item set A of each sample. The discrimination knowledge is, for example, a function (f (X_A)). For example, in the present embodiment, the discriminant function f (X_A), The success or failure of each sample is determined. That is, in the present embodiment, the discriminant function f (X_A) Is greater than 0, the sample is determined to be a “completion” (y = 1) class, and the discriminant function f (X_A) Is 0 or less, the sample is determined to be a class of “No” (y = −1).
[0042]
(Equation 2)
if f (X_A)> 0 then y = 1
else y = -1
[0043]
Discriminant function f (X_AThe format of ()) is not particularly limited, but is preferably a format that is easy for an analyst to understand, and is preferably a mathematical expression that enables mechanical verification based on data. The discriminant function f (X_A) Are, for example, a linear function and a radial basis function (hereinafter referred to as RBF).
[0044]
In the case of a linear function, the discriminant function f (X_A) Is represented, for example, by the following equation.
[0045]
(Equation 3)

However,
X_AA column vector (a matrix of 1 row and d columns) shown in the following equation.
(Equation 4)

W ': represents a column vector (matrix of 1 row and d column) of weights shown in the following equation.
(Equation 5)

W '^TA transposed matrix of the weight vector W '.
w_aA weight.
b; represents a section.
[0046]
In the case of RBF, the discriminant function f (X_A) Is represented, for example, by the following equation.
[0047]
(Equation 6)

However,
y_iRepresents the class value (± 1) of sample i.
α_iA weight. Where α_i> 0
N: represents a multidimensional normal distribution.
M_AA vector of the average value of each data item belonging to the data item set A.
Σ_A ²A covariance matrix of each data item belonging to the data item set A;
SV: represents a representative sample.
[0048]
Here, the discriminant function f (X_A) Is a linear function and an RBF.
[0049]
In the case of a linear function, the discriminant function f (X_A) Is the value x of each data item_aWeighted sum of The discrimination is performed by the function f (X_A), The corresponding weight w_aValue x of data item a having a large value of_aGreatly influences the determination result. Therefore, the discriminant function f (X_A) Weight w_aBy examining, it is possible to know what data items influence the judgment. Therefore, the discrimination knowledge of the linear format can be regarded as knowledge based on data items having a large influence.
[0050]
In the case of RBF, N (·; M_A, Σ_A ²) Is the center M_A, Covariance Σ_A ²Is a multidimensional normal distribution. In addition, N (·; M_A, Σ_A ²The symbol in parentheses indicates that an arbitrary value is entered. Therefore, some X_iAN (X_iA; M_A, Σ_A ²) Is X_iAIs the center M of the distribution_A, That is, X_iAAnd M_AHave a larger value as they are more similar. And y_iTakes a value of +1 when the sample i is of class “completion” and a value of −1 when the sample i is class “no”. Therefore, a large value of α_iSample X corresponding to_iASample X similar to_jAIs X_iASame class y as_iTend to belong to Thus, the discrimination knowledge when RBF is selected is α_iCan be regarded as case-based knowledge by comparison with a representative sample of “success” and “fail”. By knowing the values of representative samples and their data items, it is possible to know the characteristics of the samples of "completion" and "failure". Since the value of the multidimensional normal distribution N decreases as the distance from the average value increases, it is a good index indicating the degree of similarity to the average value. Therefore, in the present embodiment, the multidimensional normal distribution N is also referred to as similarity N.
[0051]
As described above, in the case of a linear function, knowledge based on characteristic data items can be obtained, and in the case of an RBF, knowledge based on a representative example can be obtained. These formats are intuitive and easy to understand for analysts, and can be said to be useful knowledge for decision making.
[0052]
In the present embodiment, for the sake of simplicity, a description using a linear knowledge expression based on a weighted sum of data items is frequently used. However, since the discrimination using the RBF corresponds to the linear discrimination on the infinite-dimensional feature space F shown in the following equation, the same argument holds for the RBF.
[0053]
(Equation 7)

However,
I; represents a unit matrix.
R^dA d-dimensional real number space.
[0054]
Here, d is the number of data items (d = | A |) included in the analysis database 1 at that time.
[0055]
The method of extracting discrimination knowledge (for example, discrimination knowledge having the above-described linear or RBF knowledge expression) based on the analysis database 1 is not particularly limited. For example, as a conventionally known determination method, typical or frequent “success” and “fail” samples are stored, and the success or failure of the determination target is determined based on the distance (a kind of similarity) between the sample and the determination target. There is something. The determination method may be employed.
[0056]
On the other hand, there is a support vector machine (hereinafter, referred to as SVM) as a method of extracting highly accurate discrimination knowledge having the above-described linear or RBF knowledge expression from sample data. The SVM stores a set of samples having similar data item values but different success / failures, and determines the success / failure of the determination target based on the distance between the sample and the determination target. Samples with similar data item values but different success / failures are support vectors. In the case where there is a discrimination plane capable of correctly discriminating the success or failure of all the samples, the support vector gives accurate position information of the discrimination boundary, and thus generally has high discrimination accuracy. When some exceptional samples enter the discrimination area opposite to the success or failure, a penalty C according to the penetration distance is imposed, the exceptional samples are automatically specified, and the discrimination plane excluding these is determined. . This is called c-SVM. In particular, the penalty C is different for the entry of the “No” sample into the “No” area and the entry of the “No” sample into the “No” area.⁺, C⁻Is called unbalanced @ c-SVM.
[0057]
The c-SVM has features such as allowing misclassification and indicating a region in which it is difficult to make a highly reliable determination, in addition to the high accuracy, and is suitable as a method for extracting discrimination knowledge. is there. Therefore, in the present embodiment, c-SVM is used as a discriminative knowledge generation method, and if there is a large difference between the number of samples of "completion" and "no", $ unbalanced @ c-SVM is used. Hereinafter, the concepts of c-SVM and $ unbalanced @ c-SVM and details of the formulation will be described.
[0058]
In the c-SVM, the distance between the positive example (“completion”) and the negative example (“no”) closest to the discrimination boundary shown in the following equation is based on the analysis database 1 as the teacher data. , Each element of the feature column vector φ (X) (φ₁(X), φ₂(X), ..., φ_nF(X)) with respect to each element (w₁, W₂, ..., w_nF).
[0059]
(Equation 8)
f (X) = W^T・ Φ (X) + b
However,
W: represents a column vector (matrix of 1 row and nF column) of weights shown in the following equation.
(Equation 9)
W = (w₁, W₂, ..., w_nF)
W^TA transpose of the weight vector W;
φ (X): represents a feature column vector (a matrix of 1 row and nF columns) of X in the feature space F shown in the following equation.
(Equation 10)
φ (X) = (φ₁(X), φ₂(X), ..., φ_nF(X))
nF: represents the dimension (number of elements) of the feature vector space.
[0060]
That is, a linear discriminant plane f (X) = W in the space of the higher-order feature vector φ (X) of the sample X^Tφ (X) + b is formed, and the success or failure of the sample is determined based on the sign of f (X). At this time, W and b are determined so as to maximize the distance | f (X) | / {W} between the linear discrimination plane and the sample closest to the discrimination plane (that is, the support vector). Note that {W} indicates the Euclidean distance shown in the following equation.
[Equation 11]

[0061]
However, exception sample group ｛X_iFor｝, the intrusion distance 次_iPenalty in proportion to ΣC_iξ_ithink of. However, max (“”, 0) is a function for obtaining the maximum value in (), and takes 0 if “” is negative, and takes a value within “” if it is not negative.
[0062]
(Equation 12)
ξ_i= Max (1-y_if (X_i), 0)
[0063]
Penalty C for success or failure sample_iIs represented by the following equation, for example.
[0064]
(Equation 13)

[0065]
At this time, W constituting f (X)^Tφ (X) and b are the solution α of the quadratic programming problem represented by Expression 14._iThus, it can be obtained as in

Expressions

15 and 16.
[0066]
[Equation 14]

Where 0 ≦ α_i≤C_i
min_{Subscript k}(Function g (k) of k); represents the problem of minimizing g (k) with respect to k.
[0067]
(Equation 15)

[0068]
(Equation 16)

However, 0 <α_{SV +}y_{SV +}<C⁺, 0> α_SV-y_SV-> -C⁻
SV⁺A representative sample of "Naru".
SV⁻A representative sample of "No".
[0069]
Here, K (X, Z) is called a kernel function and is represented by the following equation.
[0070]
[Equation 17]
K (X, Z) = φ (X)^Tφ (Z)
φ (X)^TA transpose of the feature vector φ (X).
φ (Z): represents a feature vector of Z in the feature space F.
[0071]
Specific examples of the kernel function include a linear kernel function, a polynomial kernel function, and an RBF kernel function. The linear kernel function corresponds to the case where the discriminant function is linear, and can be expressed by the following equation.
(Equation 18)

[0072]
The n-th order polynomial kernel function corresponds to the case where the discriminant function is the n-th order polynomial function, and can be expressed by the following equation.
[Equation 19]

However,
n: represents the degree of the polynomial.
[0073]
The RBF kernel function corresponds to the case where the discriminant function is based on RBF, and can be expressed by the following equation.
(Equation 20)

However,
Σ²A covariance matrix.
[0074]
In particular, an RBF kernel function when each data item has no correlation is often used, and the RBF kernel function in this case is represented by the following equation.
(Equation 21)

Here, d is the size (the number of items) of the data item set A.
[0075]
Note that 0 <α_i<C_iX that becomes_iAre the first type support vectors, which are arranged on a plane at a distance of ± 1 / {W} from the linear discrimination plane. α_i= C_iX that becomes_iIs the second kind of support vector, and its success / failure y in terms of ± 1 / {W}_iIn the opposite direction to ξ_i/ {W} is the exception sample invading. α_iIndicates the importance of each support vector in the determination. There are two types of exception samples. 0 <ξ_iWhen <1, the sample is not determined to be “wrong”, and f (X_i), But the reliability of the determination is low. Meanwhile, ξ_iWhen> 1, the discrimination result sign (f (X (X_i)) Is the actual y_iUnlike this, it is a true “misjudgment” sample. Here, sign () is a function that obtains +1 or −1 from the sign in () based on Equation 2.
[0076]
An example of data mining processing according to the present embodiment will be described with reference to the flowchart shown in FIG. First, an analysis database 1 is prepared (step 1). Next, discrimination knowledge is generated based on the analysis database 1 (step 2). Next, it is verified whether the accuracy of the generated discrimination knowledge satisfies a predetermined standard (step 3). If the accuracy of the discrimination knowledge satisfies a predetermined criterion (Step 4; Yes), the data mining is terminated.
[0077]
On the other hand, if the accuracy of the discrimination knowledge does not satisfy the predetermined criterion (Step 4; No), it is determined whether it is necessary to add data on an additional case to the analysis database 1 (Step 5). If it is necessary to add data on the additional case to the analysis database 1 (Step 5; Yes), add data on the additional case (data corresponding to the row of the virtual database (FIG. 11)) to the analysis database 1 ( Step 6).
[0078]
Then, new discrimination knowledge is generated based on the expanded analysis database 1 (step 2). The addition of the data relating to the additional case to the analysis database 1 (step 6) is performed by checking whether the accuracy of the discrimination knowledge satisfies a predetermined standard (step 4; Yes) or adding the data relating to the additional case to the analysis database 1. The process is repeated until the data is no longer necessary (for example, until the accuracy of the discrimination knowledge cannot be improved by adding the data, step 5; No).
[0079]
Unless it is necessary to add data relating to additional cases to the analysis database 1 by adding data relating to additional cases to the analysis database 1 because the degree of improvement in the accuracy of discrimination knowledge does not satisfy a predetermined standard. (Step 5; No), it is determined whether it is necessary to add data on additional items to the analysis database 1 (Step 7). If it is necessary to add data relating to additional items to the analysis database 1 (Step 7; Yes), data relating to the additional items (data corresponding to the columns of the virtual database (FIG. 11)) is added to the analysis database 1 (Step 7). Step 8).
[0080]
Then, new discrimination knowledge is generated based on the expanded analysis database 1 (step 2). The addition of the data relating to the additional items to the analysis database 1 (step 8) is based on whether the accuracy of the discrimination knowledge satisfies a predetermined criterion (step 4; Yes) or the data relating to the additional items is added to the analysis database 1 The process is repeated until the data is no longer needed (for example, until the accuracy of the discrimination knowledge cannot be improved by adding the data, Step 7; No).
[0081]
Hereinafter, each of the above steps will be described in more detail.
[0082]
If there is no restriction such as storage capacity or execution time in the data mining process, data of all samples may be obtained from the internal database 2. However, in reality, various restrictions are imposed. Under such restrictions, it is preferable to use data of as few samples as possible. Therefore, in the present embodiment, the data relating to a part of the samples existing in the internal database 2 is extracted to configure the analysis database 1 (step 1). The process of extracting the data of the sample to be analyzed from the internal database 2 may be performed by the analyst's operation under the judgment of the analyst, or may be performed by data mining based on a predetermined rule. It may be performed automatically by computer processing of the system.
[0083]
Then, in step 2, a discriminant function f (X) as discriminative knowledge is calculated based on the data stored in the analysis database 1 by an arithmetic function provided in the data mining system (computer)._A).
[0084]
The discrimination knowledge generated in step 2 is knowledge based on only the analysis database 1. It is necessary to verify whether the discrimination knowledge is correct for other data. This is step 3. An example in which the verification processing of the discrimination knowledge in step 3 is detailed is shown in the flowchart of FIG. In the verification process of the discrimination knowledge in the present embodiment, first, an area ρ that is difficult to discriminate is specified based on the generated discrimination knowledge (step 3-1). The difficult-to-discriminate region ρ is determined by the discriminant function f (X_A) Is a region in which the reliability of the determination is low, and corresponds to, for example, the vicinity of the determination boundary where the determination result (pass / fail) changes. FIG. 15 shows a linear discrimination boundary f (X_A) = 0. An area indicated by oblique lines is a difficult-to-discriminate area ρ. In FIG. 15, ☆ and ☆ represent known samples stored in the analysis database 1. The width of the difficult-to-discriminate region ρ is not particularly limited. For example, in the present embodiment, the width is set to about 1.2 times the margin width (1 / {W}).
[0085]
A sample serving as a verification case that corresponds to the indistinguishable region ρ and does not exist in the analysis database 1 is searched from, for example, the internal database 2 as an information source (step 3-2). For the searched sample (verification case), the discriminant function f (X_A) Is applied (step 3-3). In FIG. 15, ● and ★ represent verification cases searched from the internal database 2. The actual class of the verification case and the discriminant function f (X_A) Is checked to confirm the number of erroneous determinations (step 3-4). If the number of erroneous determinations is equal to or smaller than a predetermined number (step 3-5; Yes), the determination function f (X_A) Is determined to be sufficient (step 3-6). If the number of erroneous determinations is greater than a predetermined number (step 3-5; No), the determination function f (X_A) Is determined to be insufficient in accuracy (step 3-7).
[0086]
For example, in the present embodiment, the following criterion is adopted as the determination as to whether or not it is necessary to add data relating to additional cases to the analysis database 1 (step 5). That is, it is determined whether data relating to the additional case has not yet been added to the analysis database 1 or whether the improvement in the accuracy of the discrimination knowledge by adding the data relating to the additional case satisfies a predetermined criterion. . The degree of improvement in the accuracy of the discrimination knowledge is represented, for example, by the difference between the precision of the discrimination knowledge after the addition of the data regarding the additional case and the accuracy of the discrimination knowledge before the addition of the data regarding the additional case. If data relating to the additional case has not yet been added to the analysis database 1 or the degree of improvement in the accuracy of the discrimination knowledge by adding the data relating to the additional case satisfies a predetermined criterion, If it is expected that the accuracy of discrimination knowledge can be improved by adding data relating to the additional case to Step 1 (Step 5; Yes), data relating to the additional case is added to the analysis database 1 (Step 6). The determination in step 5 may be automatically made by the data mining system (computer) based on the above criteria, or the data mining system may display the validity of the additional case on a display or the like based on the above criteria. And the final determination may be determined by the analyst performing an input operation on the data mining system.
[0087]
Then, when data relating to the additional case is added to the analysis database 1 (Step 4; No, Step 5; Yes), the data relating to the verification case existing in the internal database 2 is added to the analysis database 1. (Step 6). Here, the data on the verification case also includes class information indicating to which class the verification case belongs. The class information may be stored in the internal database 2 in advance, but if the class information is not stored in the internal database 2, the class information is used for verification based on the judgment of the analyst or data of the external database 3. The class information of the case is added to the analysis database 1. The data relating to the additional case may be added to the analysis database 1 by an input operation of the data mining system by an analyst or the like. May be acquired and added.
[0088]
By using, as a verification example, a sample that corresponds to the difficult-to-discriminate region ρ and does not exist in the analysis database 1 as in the present embodiment, effective verification can be performed with a small number of data. If the number of erroneous determinations is large, data on the verification case is added to the analysis database 1 to generate new determination knowledge, and these steps are repeated, so that the entire internal database 2 is updated. Even without looking, highly accurate discrimination knowledge can be generated.
[0089]
On the other hand, even if the addition of samples from the internal database 2 is repeated, the discriminant function f (X_AIf the accuracy of () is insufficient (Step 5; No), an attempt is made to improve the discrimination knowledge by adding a new data item (Step 7; Yes, Step 8).
[0090]
Here, as the determination as to whether or not it is necessary to add data on additional items to the analysis database 1 (step 7), for example, the present embodiment employs the following criteria. That is, it is determined whether data relating to the additional item has not yet been added to the analysis database 1 or whether the degree of improvement in the accuracy of the discrimination knowledge by adding the data relating to the additional item satisfies a predetermined standard. . If data relating to the additional item has not yet been added to the analysis database 1 or if the degree of improvement in the accuracy of the discrimination knowledge by adding the data relating to the additional item satisfies a predetermined criterion, If it is expected that the accuracy of the discrimination knowledge can be improved by adding data relating to the additional items to Step 1 (Step 7; Yes), data relating to the additional items is added to the analysis database 1 (Step 8). Note that the determination in step 7 may be made automatically by a data mining system (computer) based on the above criteria, or the data mining system may display the validity of the additional item on a display or the like based on the above criteria. And the final determination may be determined by the analyst performing an input operation on the data mining system.
[0091]
FIG. 3 is a flowchart detailing the process of adding a new data item in step 8, and FIGS. 12 to 14 show images in which a new data item is added to the analysis database 1. In the process of adding a new data item in the present embodiment, first, a similar case in which the data of the data item is similar and the class to which the data item belongs is different from the analysis database 1 is specified (step 8-1). Reference numerals 1a, 1b, and 1c shown in FIG. 12 indicate the identified similar cases. Next, an additional item effective for differentiating the similar case is specified (step 8-2). Reference numeral 5 shown in FIG. 13 indicates the specified additional item. Next, data of all cases in the analysis database 1 relating to the specified additional item is added to the analysis database 1 from, for example, the external database 3 as an information source (step 8-3). Reference numeral 6 shown in FIG. 14 indicates data relating to additional items added to the analysis database 1. The data relating to the additional items may be added to the analysis database 1 by an input operation on the data mining system by an analyst or the like, or from the external database 3 specified in advance by computer processing of the data mining system. The corresponding data may be automatically acquired and added.
[0092]
Here, in specifying an additional item that is effective for differentiating similar cases, an additional item may be specified from among recommended items input by an analyst, for example, in order to differentiate similar cases. Alternatively, additional items may be specified from the external database 3 as an information source, for example, according to a predetermined rule.
[0093]
When specifying an additional item from the input recommended items, the data mining system and the analyst interact and cooperate with each other to specify an additional item effective for refining discrimination knowledge. It is an expert who can guess what kind of data can be obtained from the data of a large number of external databases 3 under a certain cost constraint and what kind of property it is. This is the ability that analysts excel at. Computer systems are generally not good at this. However, even for experts, it is difficult to give an appropriate answer no matter what data is vaguely useful for improving discrimination knowledge. Therefore, in the present embodiment, a triad method (triadication method) is used in order to effectively extract the knowledge of the analyst. The triad method is to ask three types of samples A, B, and C a comparative question, "What are important items common to A and B but not common to C?" This is a method of hearing important items in differentiating the sample of the group to which B belongs and the sample of the group to which C belongs. In the case of judging the success or failure of the sales, for two specific points A and B where the sales have succeeded and a point C where the sales have failed, items common to A and B but not common to C are heard. For example, the respondent answers items such as “transaction results” based on the knowledge that A and B have transaction results but C has no transaction results. Compared to the vague general question "What items are important for the knowledge to determine the success or failure of sales?" It is easier to give accurate answers.
[0094]
In the present specification, a method of presenting a similar case in which item data is similar and belonging to a different class to an analyst and obtaining an effective additional item for differentiating the similar case is referred to as an extended triple method. For example, in the extended triple method according to the present embodiment, (1) a small number of samples are extracted from the analysis database 1 and effective data items to be added are examined by focusing only on the small number of samples, and (2) The data relating to the effective data items is added to the analysis database 1 through the two steps of acquiring the values for all the samples of the analysis database 1 for only the promising data items obtained in the above examination. The following three points are considered as reasons why the approach using a small number of samples is considered effective. First, by limiting consideration to a small number of samples, it is possible to greatly reduce access to the external database 3 having a high access cost. Second, the c-SVM or the like repeatedly obtains an optimum solution by repeatedly focusing on only two samples satisfying a specific condition and improving the solution. The same approach can be expected to be effective for adding data items. Thirdly, it has been confirmed that, when hearing knowledge from an expert in constructing an expert system or the like, a question based on comparison of specific samples (cases) is effective for hearing accurate knowledge.
[0095]
FIG. 4 shows an example of a dialogue (exchange) between the data mining system using the extended triplet method and an analyst in a sequence. In the data mining system, a similar case identified from the analysis database 1 is presented to an analyst via an output device such as a display, and data on a recommended item effective for differentiating the similar case is obtained. An input is received (step 8-2-1). The analyst inputs data relating to recommended items that are considered effective for differentiating similar cases presented by the data mining system into the data mining system (step 8-2-2).

Reference numerals

4a, 4b, and 4c shown in FIGS. 12 to 14 represent data on the input recommended items. The data mining system selects an effective item for differentiating similar cases from the input recommended items (step 8-2-3).
[0096]
The effect of the data mining system presenting the analogous case to the analyst has not only a search for a valid additional data item candidate but also a desirable effect as described below. First, during the listening process, data items that do not currently exist but may be useful for improving knowledge may be found. That is, there is an effect of not only finding out existing data items but also finding new items to be investigated for improving knowledge. Second, by repeatedly presenting difficult-to-discriminate cases to the analyst as similar cases, there is an opportunity to review whether the case is not an exception or whether the class information of the case is incorrect. Can be given to the analyst. An analyst may realize that the case is an exception and correct the judgment of success or failure of the case, so that the obtained discrimination knowledge may be refined.
[0097]
On the other hand, when specifying additional items from the external database 3 as an information source, for example, according to a predetermined rule, the data mining system is effective in automatically refining discrimination knowledge without relying on the knowledge of the analyst. It is a form to specify an additional item. In this case, the analyst is not required to have knowledge for recommending the data items to the data mining system, and it is possible to avoid that the accuracy of the determination knowledge is greatly affected by the judgment or ability of the analyst. In recent years, meta-information sources (for example, lists of statistical survey results of each ministry or agency, information from private statistics collections or marketing companies, etc.) indicating where external information sources published on the Internet are located are being developed, These meta information sources may be used. By utilizing these meta-information sources, it is possible to narrow down the information sources without necessarily relying entirely on the knowledge of the analyst. In this case, the location of the external database 3 to which the data mining system can access and obtain the data is stored in advance or notified as needed.
[0098]
Here, the concept of selecting a sample effective for selecting an additional item for improving the accuracy of discrimination knowledge is shown in FIG. FIG. 16A shows a known data item a on the horizontal axis.₁And the vertical axis represents the unknown data item a₂Is plotted for each sample. FIG. 16B shows a known data item a on the horizontal axis.₁Is plotted for each sample. In FIGS. 16A and 16B, the horizontal axis (a₁) Correspond to the same sample.
[0099]
Data item a₁16B, that is, in the case of FIG. 16B, the discrimination boundary (f (x_a1) = 0) becomes l. In this case, a correct answer cannot be given for some samples (★ and ● in FIG. 16). Therefore, it is necessary to add a new data item to obtain more accurate discrimination knowledge. In this embodiment, in order to select a data item to be added, instead of using all samples (such as ***), attention is paid only to a specific small number of samples, and a data item for which these are determined well is adopted. I do. In the present embodiment, these specific minor samples are referred to as item evaluation samples (D). In addition, among the sample (D) for item evaluation,⁺, “No” sample D⁻And The item evaluation sample (D) according to the present embodiment includes at least samples that should be correctly determined by adding data items, that is, samples that are erroneously determined (★ and ● in FIG. 16). I do. Here, the discrimination boundary 11 focusing only on the erroneously discriminated samples (★ and ● in FIG. 16) cannot always perform appropriate discrimination with respect to other samples. Therefore, in the present embodiment, in addition to the erroneously determined sample, a of the erroneously determined sample (● in FIG. 16)₁The “completion” sample (hatched circle in FIG. 16) near the axis and the erroneous discrimination sample (★ in FIG. 16) that is originally “no”₁Appropriate discrimination boundary l2 is obtained by adding a "No" sample (hatched in FIG. 16) that is close to the axis. The erroneously determined samples (★ and ● in FIG. 16) and the added samples (hatched ☆ and ○ in FIG. 16) are samples called support vectors in c-SVM. As described above, using a plurality of support vectors as the sample for item evaluation (D) is effective in obtaining an appropriate determination boundary.
[0100]
A more specific method for selecting the item evaluation sample (D) will be described. The purpose of selecting the item evaluation sample (D) is that an estimated value of a determination error (also referred to as a wrong answer rate; represented as a generalization error, a test error, or the like) for unknown data by a discriminant function obtained by SVM is: Finding the smallest additional data item.
[0101]
As a method of estimating a wrong answer rate (test error) with respect to unknown data of SVM, there are some conventional methods. In the present embodiment, the description for the {unbalanced} c-SVM will be mainly described. However, for other SVMs, for example, an SVM that does not recognize an intrusion {²The same can be applied to an SVM that gives a penalty proportional to.
[0102]
Although the calculation cost is high, there is a one-out method (Leave-One-Out, LOO) represented by the following equation as a highly accurate test error estimation method.
[0103]
(Equation 22)

However,
[X]₊A step function (0 if x <0, 1 if x ≧ 0).
y_eRepresents the class value (± 1) of the e-th sample.
p: represents the number of samples in the analysis database 1.
[0104]
Where f_e() Represents a discriminant function obtained using c-SVM (that is, obtained by solving Equation 14) based on learning samples obtained by removing the e-th sample from all p learning samples. Also, f_e(X_e) Is the e-th sample X of the discriminant function._eRepresents the value of Even if one support vector is removed, the intrusion distance determined by Expression 12 with the other support vectors ξ_iAssuming that does not change, the following equation is derived.
[0105]
[Equation 23]

However,
f (X_e); Sample X of the discriminant function obtained from all p learning samples_eRepresents the value of
α_eA sample X obtained by Expression 14 based on all learning samples_eRepresents the weight of.
ξ_eSample X obtained by equation 12 based on all learning samples_eRepresents the penetration distance.
[0106]
Where S_eIs the span or φ (X_e) To discrimination surface f_eIt is the distance to (X) = 0. The span bound is obtained by calculating the value of Expression 22 by Expression 23. Radial margin limit r_MIs one of the upper limits of the span bound. S_eIs the radius-margin limit r since the upper limit is the diameter 2r of the hypersphere having the minimum radius in the feature space F including all learning data._MIs obtained by the following equation.
[0107]
(Equation 24)

However,
(Equation 25)

[0108]
The radius r is a solution β of the quadratic programming problem represented by Expression 26._iIs obtained by Expression 28 according to Where max_{Subscript k}(The function g (k) of k) represents the problem of maximizing g (k) with respect to k.
[0109]
(Equation 26)

However,
[Equation 27]

[Equation 28]

[0110]
Below, positive β_iThe set of samples having is described as RV. The RV is a set of samples on the hypersphere having the minimum radius r in the feature space F including all learning data.
[0111]
Furthermore, if the change in the radius r due to the addition of the data item can be ignored, the only change in the expression 24 with the addition of the data item is {W}²Can be used as a measure of the SVM error rate.
[0112]
Therefore, in the present embodiment, as an example of a specific method of selecting the item evaluation sample (D), {W}²The case where is used will be described. In the description, the RBF kernel function RBF expressed by Expression 21 is used to extract discrimination knowledge._IIt is assumed that {unbalanced} c-SVM is used in which is used as K (X, Z). In this case, the data item set A in Expression 21 is a set of data items included in the analysis database 1 at that time. Note that the same treatment can be applied to SVMs other than unbalanced @ c-SVM and SVMs using other kernels.
[0113]
The characteristics of the data items that do not exist in the newly added analysis database 1 are represented by the following values δ_iData item a (D⁺, D⁻).
[0114]
(Equation 29)

However,
δ_iA value of a data item that does not exist in the analysis database 1 of the sample i. μ: represents an average value of data items that do not exist in the analysis database 1.
δ; D⁺, D⁻Represents the amount of deviation from the average of the values of the data items of the samples belonging to.
[0115]
Note that δ ≒ 0. Therefore, a (D⁺, D⁻) Indicates sample group D⁺And D⁻Is a data item that characteristically changes in the opposite direction.
[0116]
Here, the data item a (D⁺, D⁻) Is a measure of the error rate when the unbalanced $ c-SVM is used for the analysis database 1 to which the data item has been added, and is represented by the following equation:^(A)‖²Is a data item that is likely to be reduced as much as possible.
[Equation 30]

However,
W^(A)A weight vector of the discriminant function after the addition of the data item a.
X_i ^(A)A sequence of data item values of sample i after addition of data item a.
α_i ^(A)A weight for a sample i which is a support vector after the addition of the data item a.
[0117]
Here, an additional data item a (D⁺, D⁻) Function Obj.
(Equation 31)

[0118]
The above expression is an objective function (expression 14) in unbalanced @ c-SVM, and the support vector before adding the data item and its weight α_iIs fixed, and the arguments of the kernel function are replaced with those after adding the data items. At this time, since the following equation holds, when δ is minute, ‖W^(A)‖²In order to reduce 極 as much as possible, the data item a (D⁺, D⁻) Can be selected.
(Equation 32)

[0119]
At this time, the following holds.
[0120]
[theorem]
Data item a (D⁺, D⁻) Increases the objective function value Obj, and the amount of increase depends on g (D⁺, D⁻) And a positive constant Δ that does not depend on the data item to be added. However, sample group D⁺, D⁻Each of the samples belongs to the class of “completion” and “no”, and the following equation is satisfied.
[0121]
[Equation 33]

[0122]
(Equation 34)

[0123]
However,
(Equation 35)
n = (d + 1) / d
[0124]
[Proof]
Data item a (D⁺, D⁻) After adding the kernel K (X_i, X_j) Is represented by the following equation.
[0125]
[Equation 36]

[0126]
Therefore, the change dK can be approximated by the following equation.
[0127]
(37)

[0128]
K (X_i, X_j) ≦ 1 and n <1, and no variable related to the additional data item is included, so the second term is a positive constant independent of the additional data item. At this time, data item a (D⁺, D⁻)), The increase dObj of the value of the objective function Obj can be approximated by the following equation.
[0129]
[Equation 38]

[0130]
The equality sign holds when
[0131]
[Equation 39]

[0132]
Note that g (D⁺, D⁻) Is highly important (α_i≫0), high similarity (K (X_i, X_j) ≒ 1) A set of support vectors (samples) is preferable to increase Obj, and the second term is expressed by_i> 0) is preferable.
[0133]
(Equation 40)

[0134]
That is, D^±Is preferably a support vector having a high degree of similarity and an exceptional sample. Also, the larger the number of support vectors used, the greater the improvement.
[0135]
Based on the above theory, for example, in the case of the discrimination by the linear function shown in FIG. 16, the importance of each support vector obtained by c-SVM is α_iThen, the evaluation value g represented by the following equation_LINEARAre preferable as the item evaluation sample (D).
[0136]
(Equation 41)

[0137]
Further, when the discrimination using the RBF is performed, the similarity N (X_i, X_j) Is also taken into account_RBFAre preferable as the item evaluation sample (D).
[0138]
(Equation 42)

[0139]
FIG. 5 is a flowchart of an example of a more detailed process in a case where the item evaluation sample (D) as a similar case is specified from the analysis database 1 (step 8-1). In this case, first, the samples serving as the support vectors are classified as in the following equation (step 8-1-1).
[0140]
[Equation 43]

[0141]
Next, the support vector (sv_i ⁺∈SV⁺) And the support vector (sv_i ⁻∈SV⁻) For each pair of evaluation values g (sv_i ⁺, Sv_i ⁻) Is calculated based on Equation 29 or Equation 30 (Step 8-1-2). Next, the support vector (sv_i ⁺∈SV⁺) And the support vector (sv_i ⁻∈SV⁻) Is evaluated as an evaluation value g (sv_i ⁺, Sv_i ⁻) In descending order (step 8-1-3). Next, the sorted g (sv⁺, Sv⁻), The support vectors (sv_i ⁺∈SV⁺) And the support vector of “No” (sv_i ⁻∈SV⁻) (For example, in this embodiment, two different success / failure samples sv₁ ⁺, Sv₂ ⁺, Sv₁ ⁻, Sv₂ ⁻Is selected), and used as an item evaluation sample (D) (step 8-1-4). Next, one of the selected fixed number of item evaluation samples (D) (for example, sv₁ ⁺, Sv₂ ⁺, Sv₁ ⁻, Sv₂ ⁻), And an evaluation value g (sv⁺, Sv⁻) Is selected in order from the top and added to the item evaluation sample (D) (step 8-1-5).
[0142]
In the case where the data mining system and the analyst interact and cooperate to specify additional items, the item evaluation sample (D) is presented to the analyst for each corresponding pair, and a candidate for a data item to be added is input. (Step 8-2-1 in FIG. 4). The analyst compares and evaluates the item evaluation sample (D), and inputs the data items to be added and the data values of the additional items for each sample to the data mining system (step 8-2-2 in FIG. 4).
[0143]
On the other hand, when the data mining system automatically specifies additional items without relying on the analyst's knowledge, for example, the analyst does not have sufficient knowledge to recommend data items to the system, but some In the case where the data item candidates and their values for the item evaluation sample (D) are known, for example, the data item candidates and their values for the item evaluation sample (D) are prepared in the external database 3 as an information source. If so, an additional item is specified from the external database 3 as an information source, for example, according to a predetermined rule. In this case, for example, the processing shown in the flowchart of FIG. 6 is performed. That is, the unused data item a of the item evaluation sample (D) and the value of the unused data item a are searched from the external database 3 (step 8-2-1 '). Then, an index value I represented by the following equation for each unused data item a_aIs calculated (step 8-2-2 '). Where x_ia, X_jaIs the value of data item a for samples i and j, respectively. Also, min (0, “”) is a function for obtaining the minimum value in (、), and takes a value in “” if “” is negative, and takes 0 if it is not negative.
[0144]
[Equation 44]

[0145]
And the index value I_aAre selected as recommended items in ascending order (step 8-2-3 '). That is, adult sample D⁺And no sample D⁻The data item a that minimizes the sum of the product of the values when the value of the unused data item has the opposite sign is selected. In this case, adult sample D⁺Takes a value larger (or smaller) than the average,⁻, A data item a having a large tendency to take a value smaller (or larger) than the average is selected. Then, an effective item for differentiating similar cases from the recommended items is selected as an additional item (step 8-2-4 '). Step 8-2-4 'may be omitted, and the item selected in step 8-2-3' may be directly selected as an additional item.
[0146]
In selecting an effective item (additional item) from the recommended items (step 8-2-3 or step 8-2-4 ′), for example, in the present embodiment, the effective item is selected based on the actual value of the recommended item. It evaluates gender and narrows down additional items. Multiple data item sets A₀Data item set A⊂A₀A typical method for selecting is a variable incremental method. This method uses a function eval (A) for evaluating the desirability of a learning result using a certain data item set A, and sequentially adds variables according to the procedure shown in FIG.
[0147]
That is, A_use= Φ, A_nouse= A, k = 0 (steps 801 to 803). Next, any unused data item a∈A_nouseIs performed, and the evaluation eval (A_use(A) is calculated (step 804). Next, the evaluation eval (A) obtained in step 804_use{A}), the data item a that gives the maximum value_k(Step 805). If there is no improvement even after the addition, that is, eval (A_use{A}) ≦ eval (A_use) (Step 806; Yes), the addition ends. In the case of improvement by adding, that is, eval (A_use∪ ｛a_k｝)> Eval (A_use) (Step 806; No), A_use= A_use∪ ｛a_k｝, A_nouse= A_nouse− ｛A_kとし, k = k + 1 (steps 807 to 809), and the process returns to step 804.
[0148]
However, in SVM, compared to re-learning by adding samples, the used data items and the distribution of data items²The re-learning of the SVM in the case where is changed requires a large calculation cost similar to the case where learning is performed from the beginning. Especially suitable for data items.²When cross-validation is required, a great cost is required. Therefore, in the present embodiment, an optimal key is used by using all the recommended items.², And among them, large unused data items Σ²As a simple method, a method in which a certain number of higher-ranked data items having higher values are set as data items having higher appropriateness is adopted. Optimal fit for data item set A in SVM²Is a relatively high-speed method for obtaining the differential function of the evaluation function T (A) having a wrong answer rate (generalization error, test error) for unknown data.²) And the second derivative ∂²T / ∂log (Σ²)²Conventionally, a solution method using a gradient method, a pseudo-Newton method, or the like has been proposed, and these methods can be adopted. According to these methods, the calculation amount is about one tenth even in the case of a low dimension as compared with the case of cross validation or the like.
[0149]
8 and 9 show an example of the configuration of the data mining system according to the present embodiment. In the data mining system 10 shown in FIG. 8, the analysis database 1, the discrimination knowledge generating means 101 for executing the processing in step 2, the processing in step 3 (see FIG. 2), and the judgment in

steps

4, 5, and 7 A discriminating knowledge verifying unit 102 for executing the process, a case data adding unit 103 for executing the process of Step 6, a process of Step 8-1 (see FIG. 5), and Steps 8-2-1 and 8-2-2. (See FIG. 4), and at least item data adding means 105 for executing the processing of steps 8-2-3 and 8-3. In the data mining system 10 'shown in FIG. 9, the analysis database 1, the discrimination knowledge generating means 101 for executing the processing in step 2, the processing in step 3 (see FIG. 2) and the processing in

steps

4, 5, and 7 A discriminating knowledge verifying means 102 for executing the judgment processing, a case data adding means 103 for executing the processing of step 6, a similar case specifying means 104 'for executing the processing of step 8-1 (see FIG. 5), and a step 8 Unused item data search means 107 for executing the process of FIG. 6-2-1 ′ (see FIG. 6), the processes of steps 8-2-2 ′ to 8-2-4 ′ (see FIG. 6) and And item data adding means 105 'for executing the processing of No. 3 at least. The data mining program according to the present embodiment is installed and set up on a computer to cause the computer to function as the data mining system 10 or 10 '. FIG. 10 shows an example of a data flow diagram in the data mining method and system of the present embodiment.
[0150]
Note that the discriminant function f (x) as discriminant knowledge, a method of selecting an item evaluation sample (D) as a similar case, a method of selecting an additional item from recommended items, and the like are not necessarily limited to the above examples. Not something. Hereinafter, a preferred example of another method of selecting an item evaluation sample (D) as a similar case and another method of selecting an additional item from recommended items will be described as a second embodiment of the present invention. explain.
[0151]
In the present embodiment, the radius margin limit r_MIs used as the estimated value of the rate of wrong answers to the new data. This is because LOO is computationally expensive, while spanbound has many local minimum solutions and is difficult to use in optimization. However, if the error rate (generalization error) of the discriminant function based on the analysis database 1 after adding the data item can be approximated in a quadratic form, not only the radius margin limit but also the span bound etc. It may be used as an estimated value of a wrong answer rate.
[0152]
Data item value x of data item a not in analysis database 1_iaTo θ_a ^1/2Times (however, θ_a≧ 0, θ_a{0) Estimated value r of the error rate of the new discriminant function when the added data is added_M(Θ_a), The following equation holds.
[0153]
[Equation 45]

However,
r_M(0); represents the estimated value of the incorrect answer rate when data item a is not added.
[0154]
In the selection of the additional item, the estimated value of the error rate of the new discriminant function after the addition of the data item (the radius margin limit r_MIt is preferable to find the data item a that reduces the most. Therefore, in the present embodiment, based on Expression 45, the rate of decrease of the incorrect answer rate−dr_M/ Dθ_aSuppose that the data item a that maximizes is selected.
[0155]
In the main kernel, the reduction amount-dr_M/ Dθ_aIs the quadratic form X_{: A} ^TGX_{: A}It becomes. X_{: A}Is a vector of the values of item a of all the samples in the analysis database 1. X_{: A} ^TIs X_{: A}Is the transposed matrix of. For example, the following theorem A holds for a linear kernel function and a polynomial kernel function in unbalanced @ c-SVM, and the following theorem B holds for an RBF kernel function.
[0156]
[Theorem A]
k1 (X_A ^T・ Z_AFor kernel functions of the form_M/ Dθ_aIs the secondary form X_{: A} ^TGX_{: A}have. In this case, G is represented by the following equation.
[0157]
[Equation 46]

However,
diag^-1(“Vector”); represents a diagonal matrix having “vector” as a diagonal element.
"queue". * "Matrix": represents the product of the corresponding elements of each matrix.
diag (“matrix”): represents a vector of diagonal elements of “matrix”.
K1: i row and j column are k1 (X_iA ^TX_jA+ Θ_ax_iax_ja).
α: weight α of all samples i obtained by Expression 14_i(Vector).
β: weight β of all samples i obtained by Expression 26_i(Vector).
Y: class value y of all samples_i(Vector).
r: represents a radius determined by Expression 28.
[0158]
[Theorem B]
k2 (-‖X_A-Z_A‖²), The -dr_M/ Dθ_aIs the secondary form X_{: A} ^TGX_{: A}have. In this case, G is represented by the following equation.
[0159]
[Equation 47]

However,
[Equation 48]

1_pA vector of length p and all elements being 1;
K2; i-th row and j-th column are k2 (−‖X_A-Z_A‖²−θ_a(X_ia-X_ja)²).
[0160]
Note that in Theorem B, G · 1_p= 0 holds. This is because even if the entire value of the data item a is translated, x_ia-X_jaCorresponds to the degree of freedom that the value of remains unchanged.
[0161]
The above theorems A and B show that for other SVMs, for example, SVMs that do not admit intrusion or intrusion {²SVM that gives a penalty proportional to. Reduction rate of wrong answer rate (test error) by introduction of additional items-dr_M/ Dθ_aTo X_{: A} ^TGX_{: A}G is called a test error reduction matrix.
[0162]
Next, discrete approximation of the test error reduction matrix G will be described. G, which is a p × p matrix, has non-zero values only in the rows and columns corresponding to SV and RV, and the other rows and columns are zero. Since SV∪RV is generally smaller than the number p of samples in the analysis database 1, it is already a sparse matrix. However, there may be a case where the size of SV∪RV is still large. Therefore, the decrease width-dr_MIt is preferable to obtain a sparser matrix G 'for estimating / d ?. The simplest way to make a matrix sparse is to make elements below a certain threshold zero. However, this operation may significantly change large eigenvalues and quadratic values. Therefore, in the present embodiment, the following approximation is performed. That is, as shown in the following equation, it is assumed that the variance of the value of each item is a constant value.
[0163]
[Equation 49]
X_{: A} ^TX_{: A}= Σ_ix_ia ²= 1
[0164]
Further, when using an RBF kernel (and other kernels in the form of Theorem B), G.1_pAssuming that the average of the values of each item is 0, as shown in the following equation, corresponding to = 0.
[0165]
[Equation 50]
X_{: A} ^T・ 1_p= 0
[0166]
Constraint X_{: A} ^TX_{: A}= 1, quadratic form X_{: A} ^TGX_{: A}X that maximizes_{: A}Is the largest eigenvector of G, and the item whose value of each sample is proportional to the largest eigenvector is the test error estimate r_MReduce the most. On the other hand, the eigenvector corresponding to the negative eigenvalue having the largest absolute value is the test error estimated value r._MIncrease the most. Therefore, as shown in the following equation, G is approximated by m eigenvalues having the maximum absolute value and eigenvectors, and the other directions are ignored.
[0167]
(Equation 51)

However,
(Equation 52)

However,
[Equation 53]

However,
(Equation 54)

h_i; Represents one of the m largest absolute eigenvalues of the test error reduction matrix G.
H: represents a diagonal matrix having m absolute maximum eigenvalues as diagonal elements.
v_i; Eigenvalue h_iRepresents an eigenvector corresponding to.
V; eigenvalue h₁... h_mRepresents a matrix in which eigenvectors corresponding to are arranged.
I_{m × m}Represents an m × m unit matrix.
[0168]
Further, the p × p matrix G ″ is approximated by a sparse matrix G ′.^TU = I_{p × p}Then, a decomposition shown by the following equation in which most elements of U are close to 0 and only a small number of elements are close to ± 1 is obtained.
[0169]
(Equation 55)
G "= UBU^T
However,
U: represents a p × p rotation matrix.
B: represents a p × p matrix.
U^TA transpose of U;
[0170]
Σ_iu_i ²Under = 1_iu_i ⁴U that maximizes_iIs such that one element is ± 1 and the others are 0. Based on this observation, in order to obtain U having many elements close to zero, a fourth-order program represented by the following equation is solved.
[0171]
[Equation 56]

However,
[Equation 57]

[Equation 58]

u_{i, k}; Represents the element at row i and column k of the rotation matrix U.
Q: represents a rotation matrix.
[0172]
The quaternary programming problem shown in Expression 56 can be solved, for example, by performing the processing shown in FIG. That is, first, Q^TQ is initialized so that Q = I (step 201). Next, Q_oldIs substituted for 0 (step 202). Next, it is determined whether the following equation is satisfied (step 203).
[0173]
[Equation 59]
1-min (| diag (Q^TQ_old) |)> Ε
However,
ε; tolerance (for example, 10^-10Degree).
[0174]
If the expression 59 does not hold (Step 203; No), the following Steps 204 to 207 are repeated until the expression holds. That is, Q_oldIs substituted for Q (step 204). And V Q in U^TIs substituted (step 205). And Q_LTo U^T.³V is substituted (step 206). And Q to Q_L(Q_L ^TQ_L)^-1/2Is substituted (step 207). However, "matrix".ⁿIndicates the n-th power of each element of the "matrix".
[0175]
If Expression 59 is satisfied (Step 203; Yes), B is QHQ^TIs substituted (step 208). Through the above processing, the values of U and B are obtained.
[0176]
The grounds that the process shown in FIG. 19 can solve the fourth-order programming problem shown in Expression 56 will be described below. Λ = (λ_{i, j}) With constraint Q^TLagrange undetermined multiplier λ corresponding to Q = I_ijAnd a matrix of At this time, the following equation is established as the optimum value.
[0177]
[Equation 60]

However,
trace ("matrix"); represents the sum of diagonal components of "matrix".
[0178]
L = Λ + Λ^T, Q_L= Q · L, the following equation holds.
[0179]
[Equation 61]

(Equation 62)

[Equation 63]

[Equation 64]

[0180]
Therefore, the optimum value can be obtained by the above fixed point algorithm.
[0181]
Then, by setting elements below the threshold value of U to 0, the matrix U is approximated by U '. Since the matrix U has many elements close to zero, U 'is a more sparse matrix, and G' shown in the following equation is a sparse approximation that substantially preserves the main eigenvectors of G.
[0182]
[Equation 65]
G '= U'BU'^T
[0183]
As described above, the test error reduction width-dr_M/ Dθ_aTo X_{: A} ^TGX_{: A}Not X_{: A} ^TG'X_{: A}Can be estimated. Assuming that a set of samples corresponding to the non-zero rows of U ′ is a sample for item evaluation (D), G ′ has a non-zero value only in the rows and columns corresponding to the sample for item evaluation (D). Rows and columns will be zero. Therefore, based only on the data item values of a small number of sample sets (D), the expression X_{: A} ^TG'X_{: A}The reduction in test error due to the introduction of additional items-dr_M/ Dθ_aCan be estimated.
[0184]
An example of a process for selecting the item evaluation sample (D) as a similar case based on the above theory will be described. In this case, it is assumed that the sum of squares of the item values of each sample in the analysis database 1 is normalized to a constant value. When an RBF kernel (and another kernel in the form of Theorem B) is used, it is assumed that the average of the item values of each sample in the analysis database 1 has been normalized to zero. Then, r and β are obtained based on Expressions 26 and 28. Next, SV∪RV is substituted for Vs. Here, Vs is a set of samples that are hyperspherical points having a radius r of the support vector or the high-dimensional feature space F. Therefore, Vs is a set of samples in which the corresponding row and column of the test reduction matrix G are not 0. By calculating Vs, only the element of Vs needs to be calculated in the calculation of the test error reduction matrix G, and the calculation can be simplified. Next, the test error reduction matrix G is obtained based on the equation 46 or the equation 47. Next, m maximum absolute eigenvalues h and an eigenvector matrix V of the matrix G are obtained. Next, the quaternary programming problem shown in Expression 56 is solved by the processing shown in FIG. Next, U ′ is obtained by discretizing the matrix U by setting the element whose absolute value is smaller than the threshold value to zero. Next, a set of samples corresponding to the non-zero row of U 'is defined as an item evaluation sample (D). As described above, a set of samples most useful for selecting a data item is obtained.
[0185]
Further, an example of a process for selecting an additional item from recommended items presented by the analyst or, for example, recommended items prepared in advance in the external database 3 as an information source based on the above theory will be described. In this case, based on Equation 65, the reduction amount of the test error due to the introduction of the additional item−dr_M/ Dθ_aIs calculated. Using the matrix G ', the following formula gives a decrease in test error when a recommended item is added to the item evaluation sample (D)-dr_M/ Dθ_aTo evaluate. Then, the item with the maximum evaluation value (that is, the item with the largest reduction in test error) is selected as an additional item from the recommended items. As described above, the additional item that is most effective for improving the accuracy of the discrimination knowledge is selected.
[0186]
[Equation 66]

However,
A_RA set of recommended items.
X_{D, a}A vector of the value of item a of the item evaluation sample (D).
[0187]
【Example】
[Example 1]
In order to confirm the effectiveness of the data mining method and data mining system described above, the data mining method and system were converted to civil power data based on commercially available municipal data and municipal performance data on local currency activities. Was applied to the discrimination problem of estimating the local currency introduction situation in each municipality. If the evaluation value f (X) obtained as a result of applying the discrimination knowledge obtained as a result to the municipalities that have not been introduced is the same as the municipalities having the introduction record, it is determined that the introduction possibility is high.
[0188]
As municipalities with introduced local currencies, 48 municipalities that have already introduced or are considering local currencies (see Table 1) were identified by searching from the Internet. Preliminary analysis of these 48 municipalities has revealed that there are 8 small municipalities and 40 large and medium cities. For this reason, in the present embodiment, the purpose is to determine 40 cities excluding 8 cities of the small city system (the cities enclosed in parentheses in Table 1).
[0189]
[Table 1]

[0190]
Data of 3,384 municipalities nationwide were extracted from the Asahi Shimbun's edition of the civil power CD-ROM (1999 version) for 122 items as the civil power data. However, for the data items marked with * in Table 2, data items representing changes were created from the original data items. Each item was scaled to mean 0 and variance 1.
[0191]
[Table 2]

[0192]
The purpose of the present embodiment is to obtain determination knowledge that characterizes 40 municipalities (see Table 1) where local currency has been introduced. The internal database 2 in the present embodiment is a database of only 14 data items of the population by age of 3,384 municipalities nationwide (data in 1998; the division is from 5 to 65 years old from 0 to 64 years old). did. The external database 3 in this embodiment is a database of 122 data items shown in Table 2. The total number of samples in the external database 3 was 3,384 municipalities. The initial analysis database 1 was constructed by extracting from the internal database 2 the values of all data items (14 items for each age) of 40 municipalities with introduced local currency and randomly selected 40 municipalities without introduction. .
[0193]
According to the determination knowledge in the present embodiment, the representative municipalities ｛X_iThe judgment is made based on the similarity with｝. In this embodiment, the RBF kernel function RBF_I(X_A, Z_A)It was used. Therefore, the value of the discriminant function shown in the following equation indicates the possibility of introducing a local currency.
[0194]
[Equation 67]

[0195]
Discriminant function f (Z_AIf the sign of () is positive, the coefficient α_iAnd the data item set A for determination.
[0196]
Unbalanced @ c-SVM was used to generate discrimination knowledge. Since the difference between the number of “success” and “no” samples is large, the exception cost of⁺= 200, C⁻= 10 was set. When the exceptional cost for the “success” sample is increased in this manner, errors for a small number of “success” samples are weighted more heavily than errors for a large number of “success” samples. The risk of ignoring errors of a small number of samples can be reduced.
[0197]
In the present embodiment, the data mining system automatically performs the process of specifying the additional item (see FIG. 6). Further, in the present embodiment, the data item determined to have the highest possibility in step 8-2-3 'is employed as it is. In this case, even if the recommendation of the analyst is not always optimal, it can be confirmed that the accuracy of the discrimination knowledge is improved.
[0198]
The generated discrimination knowledge was verified, and a required number of samples were sequentially added (step 6). Since the data items were added 20 times, the number of municipalities for which the introduction status was incorrectly estimated was reduced to 12 points, and the analysis was terminated there. Table 3 shows the added data items.
[0199]
[Table 3]

[0200]
After the addition of the 20 data items shown in Table 3, municipalities that have erroneously determined that there is a possibility of introducing a local currency based on the obtained discrimination knowledge are shown in Table 4.
[0201]
[Table 4]

[0202]
In Table 4, municipalities that are determined to have a high possibility of introduction based on the discrimination knowledge are arranged in order. Note that the 12 municipalities in Table 4 are cities (f (X)> 0) erroneously determined based on the discrimination knowledge. Sakata City and Minami Ward of Kyoto City are completely in the territory of the introduced city (f (X) ≧ 1). In other words, in the finally obtained discriminant function f (X), it is determined that these cities are not “no” samples (that is, it is highly likely to be introduced or has already been introduced).
[0203]
FIG. 17 shows how the obtained discrimination knowledge is refined as the data items are added. This figure shows whether cities across the country have been correctly identified based on the identification knowledge obtained at that time. In the figure, ○ indicates an erroneous determination for a sample that is originally “composed (introduced)”, and “+” in the figure indicates an erroneous determination for a sample that is originally “no (not introduced)”. Further, the solid line graph shows the number of samples determined by the discriminant function different from the actual "compliance" or "no", and the dotted line graph shows the number of samples (exception samples) that were correctly determined but had low reliability. . In other words, the solid line shows the case where the actual erroneous determination is included, and the dotted line shows the case where the samples with low reliability are included. As is clear from the figure, the erroneous determination for the sample that is originally “completion” disappears after the addition of the seventh data item, and the “completion” sample with low reliability disappears after the addition of the ninth data item. I have. On the other hand, the number of erroneous determinations for samples that are originally “No” and the number of samples of “No” with low reliability decrease with the addition of data items, but do not completely disappear with the addition of 20 data items. The downward trend has slowed after the addition of the twelfth data item. The misjudgment for a sample that is originally “No” changes in about 10 samples after the addition of the 12th data item, and the number of “No” samples with low reliability is 86 samples after the addition of the 12th data item. From 55 to 55 samples. This gentle slope indicates that after the addition of the data item for the twelfth time, the data item is added in an attempt to determine a small number of “no” exception samples, but the effect is not so high. Therefore, it is conceivable to use up to 12 data items for which the addition of data items is effective as discrimination knowledge.
[0204]
As described above, the accuracy of discrimination knowledge is effective by focusing on only a small number of samples for item evaluation and examining data on those samples to determine additional data items without looking at data on all samples. It was confirmed that it improved.
[0205]
Next, the manner in which data items are added will be described with reference to an execution example of the addition of the first data item. By the first generation of the discrimination knowledge, the discriminant function f (X) for discriminating “completion” or “no” of the sample is calculated by the system from the data items (population by age). Table 5 below shows the result of the discrimination based on the first discrimination knowledge. Note that the number of all “completion” samples is 40 and the number of all “no” samples is 3384. Further, the number of samples having low reliability includes the number of erroneously determined samples.
[0206]
[Table 5]

[0207]
Then, a pair of success / failure samples shown in Table 6 was selected as similar cases by a method of selecting the item evaluation sample (D) using the RBF (see FIG. 5).
[0208]
[Table 6]

[0209]
For example, Sekimae-mura and Hayakawa-cho, Sekimae-mura and Ota-mura, Sekimae-mura and Sakuki-mura cannot be properly distinguished with the current discrimination knowledge (α_i> 0) and almost the same value (similarity RBF) in the data items (populations by age) in the database 1 for analysis._I(X_i, X_j) Despite taking (100%), Sekimae Village introduced a local currency ("Nari") and Hayakawa-cho, Ota-mura and Sakuki-mura did not ("No"). Has become. The data mining system calls for the addition of new data items that can effectively discriminate this presented sample pair.
[0210]
In the present embodiment, a recommendation item that is considered effective for automatically differentiating similar cases by the data mining system is selected by the processing shown in FIG. 6 without depending on the knowledge of the analyst. As a result, the recommended data items are as shown in Table 7 in the order of recommendation.
[0211]
[Table 7]

[0212]
In the data mining system, data items that are considered to be effective in improving the discrimination knowledge are selected from the recommended items. In the present embodiment, the data item (age-specific composition ratio 45-64) that is considered to be the most promising in the recommended item group is simply selected. Then, the actual data of the selected data item (age-specific composition ratio of 45-64 years) for all the samples in the analysis database 1 is acquired from the external database 3 and added to the analysis database 1. Then, the determination knowledge is updated using the new analysis database 1. As a result, the determination result based on the new determination knowledge is as shown in Table 8 below. The number of samples with low reliability includes the number of erroneously determined samples.
[0213]
[Table 8]

[0214]
With the new knowledge of discrimination, the number of “completion” sample misclassifications has been reduced from 11 to 6. By repeating such a process, the accuracy of knowledge is improved by adding data items.
[0215]
Here, in this embodiment, the data mining system automatically performs the process of specifying the additional item without relying on the knowledge of the analyst (see FIG. 6). However, the data mining system and the analyst interact and cooperate. Meanwhile, additional items effective for refinement of the discrimination knowledge may be specified (see FIG. 4). Hereinafter, the point that the extended triple method is effective for data mining in consideration of the external database 3 will be described.
[0216]
The extended triple method is a method for extracting the knowledge necessary for adding a data item from an analyst. That is, the extended triad method is a method for facilitating an analyst to recommend a new data item to the data mining system so that a sample that is difficult to determine with the current data item can be determined. For this reason, sample pairs that are difficult to determine with the current data item and differ in success or failure are presented to the analyst. New data that is currently indistinguishable, that is, presenting to the analyst the fact that sample pairs that look the same to the system actually have different successes / failures, allow the presented sample pairs to be distinguished Items can be heard. In particular, when the RBF kernel applied in the present embodiment is used, a misclassified sample having a high degree of exception and a sample similar to the sample but having different success or failure are presented as a pair. Estimating the validity of a new data item based on the values of these sample data items is not only theoretically valid for the system, but also a human analyst, as shown in the following example: Is intuitively meaningful and effective.
[0219]
For example, in the present embodiment, at the time of the sixth data item addition, a sample pair as shown in Table 9 below was identified as a similar case.
[0218]
[Table 9]

[0219]
This indicates that local currencies have been introduced in Shimogyo-ku, but not in Kamigyo-ku, Nakagyo-ku, and Higashiyama-ku, which look similar in the current data of the system. For example, the similarity RBF of data in the analysis database 1 for Shimogyo-ku and Kamigyo-ku_I(X_i, X_j) Was 99%. By presenting such a sample pair, the analyst can easily think of a new data item to be added to the current data item.
[0220]
The example shown in the following Table 10 is a sample pair specified as a similar case at the time of adding the data item for the tenth time.
[0221]
[Table 10]

[0222]
From this example, it can be confirmed that the municipalities that are considered similar to the data mining system are listed as a pair, not from the viewpoint that the similarity of the municipalities is geographically close. For example, the similarity RBF between the data of Tama City and Narashino City_I(X_i, X_j) Was 98%.
[0223]
In the data mining system, the sample pairs (similar cases) shown in Tables 9 and 10 can be presented to the analyst. By presenting the sample pair mechanically from the viewpoint of the system (that is, without being bound by human assumptions or stereotypes), there is an effect that the analyst notices a factor which was not previously noticed. In addition, there is a possibility that new knowledge can be obtained not only by presenting a pair of presentation samples each time but also by observing successive presentation municipalities. For example, the municipalities identified as similar cases many times in this embodiment are as shown in Table 11 below.
[0224]
[Table 11]

[0225]
What becomes clear from Table 11 is the fact that the data mining system is having a hard time correctly identifying Komagane City, Nagano Prefecture. Although it is possible that adding effective data items has failed, it is incorrect to put the remaining 39 municipalities and Komagane-shi of the “Naru” sample into the same “Nari” class. There is also. By allowing the analyst to see the municipalities presented many times in this way, it is possible to change the success or failure of the sample itself and to re-examine the success or failure suitable for the purpose of the analysis. The reason why the "No" sample in Nagano Prefecture appears many times is that the data mining system determines that these are municipalities similar to Komagane City from the data. Then, it is necessary to reconsider the success or failure of Yufuin Town, Oita Prefecture. As described above, according to the extended triple method, both the data mining system and the analyst can provide an effective viewpoint of the current discrimination knowledge.
[0226]
In this example, the goodness (optimality) of the unused data items was evaluated as follows. That is, one unused data item is extracted from the external database 3 (civil power data) and is actually added to the analysis database 1 to generate new discrimination knowledge, and the evaluation value (c− The objective function value of SVM) was calculated, and the goodness of each unused data item was ranked based on the evaluation value. FIG. 18 shows the order of the selected data item in the above-described optimality evaluation. “O”, “Δ”, and “+” points indicate the order of the objective function values of the data items evaluated as the first, second, and third places by the processing of step 8-2-3 ′ in FIG. For example, in the first additional item selection, it can be seen that the 30th most effective data item is recommended as the first recommended item. At the first time, the number of data items that are not included in the analysis database 1 is 108. Therefore, the additional items selected by simple rules from only the data of the data items regarding eight samples are appropriately selected. It can be said that. As a whole, the selection of additional data items based on the values of the data items of at most eight samples selects items with relatively good ranking. In the processing in step 8-2-3 ′, it is not always guaranteed that the best item is selected, but it has been confirmed that, for example, by considering the top three recommended items, the items in the twentieth place can be selected relatively well. . Since the selection rule in step 8-2-3 'is simple, but a relatively good data item is selected, it is considered that the item evaluation sample has given a good criterion.
[0227]
[Example 2]
In the present embodiment, the present invention is applied to the handwritten character recognition problem. The internal database 2 as an information source has 7291 pieces of learning image data and 2007 pieces of test image data. Each character is cut out as a 16 × 16 pixel grayscale image. In this embodiment, each pixel is treated as one data item and 256 items of input data are used. This data is provided by the United States Postal Service and is available at ftp: // ftp. kyb. tubingen. mpg. It is available from de / pub / bs / data /.
[0228]
In the present embodiment, starting from the five data items (initial masks) having the highest mutual information, the effectiveness of recognizing the number “5” based only on the learning image data without using the test image data is considered. Data items (pixels) with the highest values are sequentially selected. FIG. 21 shows the mutual information amount of each data item. FIG. 22 shows the selected initial mask. The black dots in FIG. 22 are the five selected initial data items. The data was normalized so that each data item had an average of 0 and a variance of 1. As the kernel function, the RBF kernel function of Expression 21 was used. Further, the method described in the second embodiment of the present invention was adopted as a method of selecting the item evaluation sample (D) and a method of selecting an additional item from the recommended items. In addition, the data mining system automatically performed a process of specifying an additional item from the internal database 2.
[0229]
FIG. 20 shows the false recognition rate in the learning data and the test data when data items are sequentially added to the initial mask. Numerals 50 and 51 in the figure show the results for the learning data, and numerals 52 and 53 show the results for the test data. Also, the solid line in the figure shows the case where the number of item evaluation samples (D) is limited to 16 and the number of eigenvalues / eigenvectors used for approximation is limited to 6 (m = 6). The broken line in the figure shows a case where no restriction is provided. When the restriction is imposed, the recognition rate is improved as the data items are added, and the erroneous recognition rate of about 3 to 4% is realized by adding 20 data items. A human recognition error rate of about 2.5% when the test image data is viewed without masking indicates that a high recognition rate is realized. The reason that the restriction is lower in the false recognition rate is that similar data items are prevented from being intensively selected by the selection operation of the evaluation sample, and data items with a high mutual information amount are uniformly selected. (See FIGS. 23 and 24). FIG. 23 shows the case where the number of item evaluation samples (D) is limited to 16 and the number of eigenvalues / eigenvectors to be used for approximation is limited to 6, and the finally selected 25 data items (5 initial masks and 5 20 added data items). FIG. 24 shows the finally selected 25 data items when the above restriction is not made.
[0230]
The set of images that appeared as item evaluation samples (D) were often a set of images that were similar in the masked image but one was the number “5” and the other was not “5”. An additional item selected based on such an image is a data item that can correctly distinguish these similar images, that is, a black and white image is inverted between an image with the number “5” and an image without the number “5”. Pixel. It can be intuitively understood that the addition of such items improves the sample identification rate.
[0231]
[Example 3]
To evaluate the proposed approach, the present invention was applied to data used in CoIL {2000 {Challenge}. This data is available from the University of California, Irvine data repository (University of California, Irvine, Data, Repository (UCI, KDD, Archive)).
[0232]
The data consists of a training set and a test set. The training set consists of data for 5822 insurance company customers. Of the training set, 348 (6%) have insurance policies (hereafter referred to as caravan insurance policies) covering caravans (mobile driven mobile homes). The test set consists of data from 4000 insurance company customers. 238 members of the test set have caravan insurance policies. There are 85 data items for each customer. The purpose of data mining in the present embodiment is to “obtain discriminatory knowledge for identifying a customer having a caravan insurance policy”.
[0233]
Before applying the present invention, of the above data, item number 1 was converted into 41 binary data (binary data), and item number 5 was converted into 10 binary data. This is because these items represent customer types that are not numerical data. Therefore, in this embodiment, 134 data items were used.
[0234]
In the present embodiment, c-SVM using an RBF kernel is adopted, and the data mining system automatically performs a process of specifying an additional item (see FIG. 6). The number of selected samples (the size of the evaluation set) was set to 8, and the top 10 items of mutual information were selected as the initial item set.
[0235]
FIG. 25 shows the relationship between the number of customers (the number of samples) determined to have a caravan insurance policy and the number of additional items in the test set. From the figure, it can be confirmed that the number of customers determined to have a caravan insurance policy increases with the addition of items. FIG. 26 shows the number of erroneously determined samples in the training set. In the figure, “x” indicates the number of misjudgments for “success” samples, “*” indicates the number of misjudgments for “no” samples, and “+” indicates the number of misjudgments for “success” samples and “no” samples. The sum of the number of erroneous determinations for From the figure, it can be confirmed that the number of erroneously determined samples decreases with the addition of items. The results shown in FIG. 25 and FIG. 26 are effective for improving discrimination knowledge based on only a very small number of samples (8 samples in this embodiment) without evaluating the item values for all samples. Indicates that the item can be selected appropriately.
[0236]
As described above, according to the present invention, data can be efficiently and effectively supplemented to the analysis database 1 to extract useful knowledge with high accuracy. In data mining, it is highly necessary to acquire highly accurate knowledge while appropriately incorporating external data. In particular, the present invention is an extremely useful technology in a situation where it is desired to enhance customer services by preparing a customer database and to enhance facility diagnosis and maintenance by preparing a database for device diagnosis. At present, information sources from which data can be freely obtained through the Internet are limited, but as data mining related to the present invention becomes more active in the future, data purchase through the Internet and partial data It is thought that sales will be realized.
[0237]
The above embodiment is an example of a preferred embodiment of the present invention, but the present invention is not limited to this, and various modifications can be made without departing from the gist of the present invention. For example, the user interface of the data mining system may be appropriately user-friendly. Also, in consideration of the difference in the access cost of the external database 3, rules for accessing the external database 3 (for example, if the cost is lower than a certain standard or a free database, the amount of data to be acquired or the connection time is not limited, If the cost of the database is higher than a certain standard, the amount of data to be acquired or the connection time may be limited, or if the information is of the same kind, a lower-cost database may be accessed.
[0238]
【The invention's effect】
As is apparent from the above description, according to the data mining method according to the first aspect, the data mining system according to the tenth aspect, and the data mining program according to the eleventh aspect, a similar non-similar case is specified. By adding data on additional items that are effective for differentiating similar cases to the analysis database, data is efficiently and effectively supplemented to the analysis database, and highly accurate and useful discrimination knowledge is extracted. can do.
[0239]
In data mining, it is highly necessary to acquire highly accurate knowledge while appropriately incorporating external data. In particular, the present invention is an extremely useful technology in a situation where it is desired to enhance customer services by preparing a customer database and to enhance facility diagnosis and maintenance by preparing a database for device diagnosis. At present, information sources from which data can be freely obtained through the Internet are limited, but as data mining related to the present invention becomes more active in the future, data purchase through the Internet and partial data It is thought that sales will be realized.
[0240]
Furthermore, according to the data mining method described in claim 2, it is possible for an analyst or the like to compare specific cases (similar cases), so that it is easy to input an accurate recommendation item, and the knowledge of the analyst or the like can be easily obtained. Can be effectively extracted.
[0241]
On the other hand, according to the data mining method described in claim 3, the analyst or the like does not need knowledge for recommending items, and the accuracy of the discrimination knowledge greatly depends on the judgment or ability of the analyst or the like. Can be avoided, and highly accurate discrimination knowledge can be stably obtained.
[0242]
Further, according to the data mining method of the fourth aspect, unnecessary expansion of the analysis database is avoided.
[0243]
Further, according to the data mining method of the present invention, not only the data on the additional items but also the data on the additional cases that do not exist in the analysis database are supplemented to the analysis database to extract highly accurate and useful knowledge. be able to.
[0244]
Further, according to the data mining method of the sixth aspect, effective verification can be performed using a small number of data.
[0245]
Furthermore, according to the data mining method of the present invention, it is possible to efficiently and effectively replenish data relating to additional cases that do not exist in the analysis database to the analysis database.
[0246]
Further, according to the data mining method of the eighth aspect, the analysis database is expanded by data on additional items or additional cases until sufficiently high accuracy of the discrimination knowledge is obtained or further improvement in accuracy cannot be expected. This is repeated, and highly accurate discrimination knowledge can be reliably obtained.
[0247]
Further, according to the data mining method of the ninth aspect, when the accuracy of the discrimination knowledge is low, first, the data on the additional case is added to the analysis database, and when the data on the additional case is added, the accuracy is further increased. If no improvement can be expected, data on the additional items is added to the analysis database. Therefore, the database for analysis is efficiently and effectively expanded.
[Brief description of the drawings]
FIG. 1 is a schematic flowchart illustrating an example of processing of a data mining method and system and a program according to the present invention.
FIG. 2 is a flowchart illustrating an example of a process in which step 3 of FIG. 1 is further detailed;
FIG. 3 is a flowchart illustrating an example of a process in which step 8 in FIG. 1 is further detailed;
FIG. 4 is a diagram illustrating an example of a dialogue (exchange) between a data mining system and an analyst in a sequence.
FIG. 5 is a flowchart illustrating an example of a process in which step 8-1 in FIG. 3 is further detailed;
FIG. 6 is a flowchart illustrating an example of a process in which step 8-2 in FIG. 3 is further detailed;
FIG. 7 is a flowchart illustrating an example of a process of selecting an additional item from recommended items.
FIG. 8 is a schematic configuration diagram illustrating an example of a configuration of a data mining system of the present invention.
FIG. 9 is a schematic configuration diagram showing another example of the configuration of the data mining system of the present invention.
FIG. 10 is a data flow diagram showing an example of a data flow in the data mining method and system of the present invention.
FIG. 11 is a conceptual diagram showing a relationship between an analysis database and information sources (an internal database and an external database) other than the analysis database.
FIG. 12 is a conceptual diagram showing a similar case identified from the analysis database and a recommendation item input for differentiating the similar case.
FIG. 13 is a conceptual diagram showing a similar case identified from the analysis database, a recommended item input for differentiating the similar case, and an additional item identified from the recommended items.
FIG. 14 is a conceptual diagram showing similar cases identified from the analysis database, recommended items input for differentiating the similar cases, and data of all cases in the analysis database regarding additional items. is there.
FIG. 15 is a conceptual diagram illustrating an area in which it is difficult to determine based on determination knowledge.
16A and 16B are conceptual diagrams illustrating a concept of selecting an effective case for selecting an additional item for improving the accuracy of discrimination knowledge. FIG. 16A illustrates a known item (a) on the horizontal axis.₁), And the vertical axis represents the unknown item (a₂) Are plotted for each case, and (B) shows the known item (a) on the horizontal axis.₁) Are plotted for each case. The horizontal axis (a) in (A) and (B)₁) Represent the same case.
FIG. 17 is a graph showing how discrimination knowledge obtained as items are added is refined. The horizontal axis indicates the number of times of addition of an item, and the vertical axis indicates the number of erroneously determined cases.
FIG. 18 is a graph showing the rank of a selected item in the above-described optimality evaluation, in which the horizontal axis indicates the number of item recommendations and the vertical axis indicates the rank in the optimality evaluation.
FIG. 19 is a schematic flowchart showing another example of the data mining method and system of the present invention and the processing of a program.
FIG. 20 is a graph showing how discrimination knowledge obtained as items are added is refined. The horizontal axis indicates the number of added items, and the vertical axis indicates the erroneous discrimination rate.
FIG. 21 shows an embodiment in which the present invention is applied to the handwritten character recognition problem. In the case of grayscale image data of 16 × 16 pixels, each pixel is treated as one data item. It is a figure showing the mutual information amount of.
FIG. 22 is a diagram illustrating an initial mask obtained by selecting the five data items having the highest mutual information amount in FIG. 21 in the embodiment.
FIG. 23 is a diagram illustrating data items finally selected when the number of similar cases and the number of eigenvalues / eigenvectors used for approximation are limited in the embodiment.
FIG. 24 is a diagram illustrating a data item finally selected when the number of similar cases and the number of eigenvalues / eigenvectors used for approximation are not limited in the embodiment.
FIG. 25 is a graph showing how discrimination knowledge obtained as items are added is refined. The horizontal axis indicates the number of added items, and the vertical axis indicates the number of cases determined to be “completion”.
FIG. 26 is a graph showing how discrimination knowledge obtained as items are added is refined. The horizontal axis indicates the number of added items, and the vertical axis indicates the number of cases where misclassification has occurred.
[Explanation of symbols]
1) Database for analysis
2. Internal database (information source)
3. External database (information source)
10,10 'data mining system

Claims

On the basis of an analysis database having information arranged for each case and each item and information indicating which class of each of the cases belongs to a plurality of predetermined classes, based on an analysis database, A data mining method for generating discrimination knowledge that outputs which class a case belongs to by inputting data, wherein a similar case in which the data of the item is similar and the belonging class is different is stored in the analysis database. And identifying the additional items effective for differentiating the similar case, and the data for all the cases in the database for analysis related to the additional items except for the database for analysis Adding to the analysis database from the information source of the above, and based on the expanded analysis database. Data mining method characterized by comprising at least a step of generating the discrimination knowledge Te.

2. The data mining method according to claim 1, wherein the additional item is specified from recommendation items input for differentiating the similar case.

2. The data mining method according to claim 1, wherein the additional item is specified from the information source according to a predetermined rule.

When the accuracy of the generated discrimination knowledge does not satisfy a predetermined criterion, data relating to the additional item is added to the analysis database, and new discrimination knowledge is generated based on the expanded analysis database. The data mining method according to any one of claims 1 to 3, wherein the data mining is performed.

When the accuracy of the generated discrimination knowledge does not satisfy a predetermined criterion, data relating to an additional case that does not exist in the analysis database is added to the analysis database from the information source, and the expanded analysis data is added. 4. The data mining method according to claim 1, wherein new discrimination knowledge is generated based on a database.

A region that is difficult to determine is specified based on the generated determination knowledge, the determination knowledge is applied to a verification case that corresponds to the difficult-to-determine region and does not exist in the analysis database, and the determination is performed based on the result. The data mining method according to claim 4, wherein accuracy of knowledge is verified.

7. The data mining method according to claim 6, wherein when the accuracy of the discrimination knowledge does not satisfy a predetermined criterion, data on the verification case is added to the analysis database as data on the additional case.

Adding the data relating to the additional item or the additional case to the database for analysis, and generating new discrimination knowledge based on the expanded database for analysis, the accuracy of the discrimination knowledge is determined in advance. 8. The method according to claim 4, wherein the step is repeated until a predetermined criterion is satisfied or the degree of improvement in the accuracy of the determination knowledge by adding data on the additional item or the additional case does not satisfy a predetermined criterion. The data mining method according to any one of the above.

When the accuracy of the discrimination knowledge does not satisfy a predetermined criterion, data on the additional case is added to the analysis database, and the improvement range of the accuracy of the discrimination knowledge by adding data on the additional case is predetermined. 9. The data mining method according to claim 8, wherein when the criterion is not satisfied, data on the additional item is added to the database for analysis.

An analysis database having information arranged for each case and each item and information indicating which class of each of the cases belongs to a plurality of predetermined classes, and the data of the items are similar. A means for specifying a similar case belonging to a different class from the database for analysis; a means for specifying an additional item effective for differentiating the similar case; and a method for specifying the additional item in the database for analysis relating to the additional item. Means for adding the data for all the cases in the information database other than the analysis database to the analysis database, and, based on the expanded analysis database, data of a certain case item as an input. Means for generating discrimination knowledge for outputting which class the case belongs to. Mining system.

A computer, an analysis database having information arranged for each case and each item and information indicating which class of each of the cases belongs to a plurality of predetermined classes, and data of the items. Means for identifying, from the database for analysis, similar cases which are similar and belong to different classes, means for specifying additional items effective for differentiating the similar cases, and the analysis relating to the additional items Means for adding data for all the cases in the database for analysis from the information source other than the database for analysis to the database for analysis, and data of items of a case based on the expanded database for analysis. To function as a means for generating discrimination knowledge that outputs the class to which the case belongs as an input. Program for data mining.