JP4057587B2

JP4057587B2 - Feature pattern output device

Info

Publication number: JP4057587B2
Application number: JP2004548006A
Authority: JP
Inventors: 宏弥稲越; 青史岡本; 陽佐藤; 剛寿安藤; 暢尾崎
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2002-11-01
Filing date: 2002-11-01
Publication date: 2008-03-05
Anticipated expiration: 2022-11-01
Also published as: WO2004040477A1; JPWO2004040477A1

Description

【技術分野】
【０００１】
この発明は、複数のアイテムを有するデータを複数のクラスのいずれかに区分して記憶するデータベースから、前記クラスに特徴的に含まれるアイテムの組合せを当該クラスの特徴パターンとして出力する特徴パターン出力装置に関し、特にデータベースが大規模であっても高速に特徴パターンを出力可能な特徴パターン出力装置に関する。
【背景技術】
【０００２】
近年、データベースに記憶したデータについて、データ間の相関関係や、データが有するルールを抽出する手法が考案されている。データ間の相関関係やデータが有するルールは、データベースに記憶されたデータを分類する場合や、新規のデータを分類する場合などに用いることができる。
【０００３】
従来、データベースからルールを取り出してデータベースにフィードバックする相関ルール学習の手法として、Agrawel,R., "Fast Algorithm for Mining Association Rules" およびこれに対応する特許文献として「大規模データベース内の順次パターンをマイニングするためのシステムおよび方法」（特開平８−２６３３４６号公報）が公開されている。
【０００４】
ここに公開された手法によれば、アイテムと呼ばれるデータの構成要素を組み合わせてパターンを形成し、データの相関ルールを頻出するパターンによって示している。
【０００５】
しかしながらこの手法では、相関ルールの抽出に要するコストが高く、データベースの内容に変更があった場合に、その変更に対応して相関ルールの内容を行使するまでに時間が必要であった。そのため、相関ルールの抽出はデータベースをオフラインにして実行される場合が多く、データベースの更新に対して追従性が劣化するという問題点があった。
さらに、相関ルールの抽出や、抽出した相関ルールをもとにデータを分類するために必要な処理時間は、パラメータの設定によって大きく異なり、また、得られる相関ルール自体もパラメータに大きく依存するという問題があった。すなわち、パラメータの設定を適切におこなうためには、専門知識や経験が必要であり、パラメータの設定によっては得られたルールの有用性の低下を引き起こしたり、相関ルールの運用が不可能になるほどの処理時間が必要となる可能性があった。
【０００６】
一方、ルールの抽出手法としては、他にも J.Li, G. Dong, K. Ramamohanarao, and L. Wong. DeEPs: A new instance-based discovery and classificationsystem. Technical report, Dept of CSSE, University of Melbourne, 2000 が公開されている。ここで公開されているDeEPsは、入力データが与えられてから、適用可能なパターンを学習するリアルタイムなパターンの発見が可能である。したがって、データベースをオフラインにすることなく、任意のタイミングで更新することができる。また、DeEPsでは、パターン発見にパラメータを設定する必要がないため、運用時に要求される専門知識や経験が少ない。
【０００７】
しかしながら、DeEPsは、パターン発見時にデータベースの全てのデータを処理の対象とするため、データベースが有するデータ数に応じて必要な処理能力が大きくなる。したがって、データベースのデータ数が大きい場合、パターン抽出処理にリアルタイム処理におけるレスポンス時間としては許容できない時間が必要になるという問題点があった。
【０００８】
さらに、DeEPsでは、データの構成要素であるアイテムの数に比例して処理時間が要求される。したがって、それぞれのデータに含まれるアイテムの数が多い場合、パターン抽出処理に膨大な時間が必要になるという問題点があった。
【０００９】
この発明は、上述した従来技術による問題点を解消するためになされたものであり、データベースに含まれるデータ数が多く、またデータが多数のアイテムを有する大規模なデータベースにおいてもパターン抽出を高速に実行可能な特徴パターン出力装置を提供することを目的とする。
【発明の開示】
【００１０】
上述した課題を解決し、目的を達成するため、本発明に係る特徴パターン出力装置は、複数のアイテムからなるデータを複数のクラスにそれぞれ区分して記憶したデータベースから各クラスの特徴をなすアイテムの組合せを当該クラスの特徴パターンとして出力する特徴パターン出力装置であって、入力データを受け付けた際に、該入力データに類似する類似データを前記データベースから各クラスごとに抽出する類似データ抽出手段と、前記類似データ抽出手段により抽出された類似データから各クラスごとの類似パターン集合を算出する類似パターン集合算出手段と、前記類似パターン集合算出手段により算出された類似パターン集合から各クラスごとの特徴パターンを算出する特徴パターン算出手段と、を備えたことを特徴とする。
【００１１】
この発明によれば、入力データに類似する類似データをデータベースから抽出し、抽出した類似データから各クラスの特徴をなす特徴パターンを算出する。
【００１２】
また、本発明に係る特徴パターン出力装置は、前記類似パターン集合算出手段は、前記類似データ抽出手段により抽出された類似データを形成する各アイテムと、前記入力データを形成する各アイテムとが一致したアイテムの組合せをパターン集合として抽出し、前記パターン集合に自身以外の部分集合が存在しないアイテムの組合せである最小パターンを最小パターン集合として抽出し、前記パターン集合に自身以外の上位集合（スーパーセット）が存在しないアイテムの組合せである最大パターンを最大パターン集合として抽出し、前記最小パターン集合と前記最大パターン集合とを前記類似パターン集合として出力することを特徴とする。
【００１３】
この発明によれば、データベースから抽出した抽出データの各アイテムと入力データの各アイテムとを比較し、一致するアイテムの組合せから最大パターン集合と最小パターン集合とを抽出し、この最大パターン集合と最小パターン集合とをもとに特徴パターンを算出するようにしている。
【００１４】
また、本発明に係る特徴パターン出力装置は、前記特徴パターン算出手段は、複数のクラスにまたがって出現する共通パターン集合を前記最小パターン集合から抽出し、前記特徴パターン算出手段は、前記共通パターン集合が有するアイテムを全て有する特徴パターンを算出することを特徴とする。
【００１５】
この発明によれば、最小パターン集合をもとに複数のクラスにまたがって出現する共通パターンを求め、特徴パターンを共通パターンの上位集合として算出している。
【００１６】
また、本発明に係る特徴パターン出力装置は、前記類似データ抽出手段は、前記データベースから類似データを抽出する際に、クラスごとに異なる条件に基づいて類似データの抽出をおこなうことを特徴とする。
【００１７】
この発明によれば、類似データを抽出する場合に、クラスごとに条件を変更し、各クラスについて十分な数の類似データを取得するようにしている。
【００１８】
また、本発明に係る特徴パターン出力装置は、前記類似パターン集合算出手段は、複数のクラスにまたがって出現する最大パターンが存在する場合に当該最大パターンから所定のアイテムを除外することを特徴とする。
【００１９】
この発明によれば、複数のクラスにまたがって出現する最大パターンについて、そのアイテムを除去することで特徴パターンが存在しなくなるという状況が発生することを防止している。
【００２０】
また、本発明に係る特徴パターン出力装置は、前記特徴パターン算出手段が算出した特徴パターンをもとに、前記入力データを前記複数のクラスのいずれかに分類する分類手段をさらに備えたことを特徴とする。
【００２１】
この発明によれば、類似データから算出した特徴パターンをもとに入力データを分類している。
【００２２】
また、本発明に係る特徴パターン出力装置は、前記分類手段は、各クラスの類似データにおける前記特徴パターンの数を計数し、該計数結果がもっとも大きい値となるクラスに前記入力データを分類することを特徴とする。
【００２３】
この発明によれば、各クラスの類似データにおける特徴パターンの出現数を計数し、この計数結果がもっとも大きい値となったクラスに入力データを分類している。
【００２４】
また、本発明に係る特徴パターン出力装置は、前記類似パターン集合算出手段は、前記入力データを形成する所定のアイテムの値と前記類似データを形成するアイテムの値とが所定の数値範囲内にある場合には、両者のアイテムの値が一致したものと判定することを特徴とする。
【００２５】
この発明によれば、アイテムが数値データである場合に所定の数値範囲を設定し、入力データのアイテムの値と類似データのアイテムの値とが所定の範囲内にある場合に両者のアイテムの値が一致したと判定する。
【発明を実施するための最良の形態】
【００２６】
以下に添付図面を参照して、この発明に係る特徴パターン出力装置の好適な実施の形態を詳細に説明する。
【００２７】
（実施の形態１）
第１図は、本発明の実施の形態１である特徴パターン出力装置の概要構成を説明する概要構成図である。第１図において、特徴パターン出力装置２１は、データベース２２に接続されている。データベース２２は、顧客に関する情報を記憶しており、一つのデータが顧客一人に対応する。また、データには、「年齢」、「住居」、「性別」、「結婚」などの項目がある。各データは、それぞれ項目について値を有する。以下、データが有する項目と項目の値との組合せをアイテムと称する。データベース２２は、各顧客、すなわち各データを与信の可否によってクラス分けしている。データベース２２は、「与信可能」の顧客を「クラスP」、「与信不可」の顧客を「クラスN」として分類している。
【００２８】
特徴パターン出力装置２１は、その内部に入力処理部３１、類似データ抽出部３２、二値化処理部３３、類似パターン集合算出部３４、特徴パターン集合算出部３５および入力データ分類処理部３６を有している。入力処理部３１は、顧客の情報を入力データとして受信した場合に、この入力データを類似データ抽出部３２と、二値化処理部３３に出力する。
【００２９】
類似データ抽出部３２は、入力データに類似したデータをデータベース２２から抽出し、類似データとして二値化処理部３３に出力する。二値化処理部は、入力データをもとにして類似データを二値化した後、類似パターン集合算出部３４および入力データ分類処理部３６に送信する。
【００３０】
類似パターン集合算出部３４は、二値化された類似データをもとに、クラスＰクラスＮのそれぞれについて類似パターン集合を算出する。特徴パターン集合算出部３５は、類似パターン集合からクラスＰとクラスＮにそれぞれ特徴的に出現するアイテムの組合せを特徴パターンとして出力する。
【００３１】
さらに、入力データ分類処理部３６は二値化された類似データと特徴パターンとを比較し、入力データをクラスＰに分類するかクラスＮに分類するかを決定する。
【００３２】
特徴パターン出力装置２１は、この特徴パターンと、入力データの分類結果とを出力する。すなわち、この特徴パターン出力装置２１は、入力データに類似するデータをデータベース２２から抽出し、この類似データから特徴パターンを算出するので、データベース２２のデータ数や各データのアイテム数に依存することなく、高速に特徴パターンの算出をおこなうことができる。
【００３３】
つぎに、各処理について具体例を用いて詳細に説明する。
第２図に、入力データと類似データの具体例を示す。第２図（ａ）は、入力データの一例であり、第２図（ｂ）は、データベース２２が記憶するデータの一例である。第２図に示すように、入力データは、「年齢」の値として「３５」、「住居」の値として「借家」、「性別」の値として「男性」、「結婚」の値として「既婚」を有している。
【００３４】
類似データ抽出部３２は、類似度関数としてCity-block距離を用いた類似度を採用し、データベース２２から類似データの抽出をおこなう。
具体的には、
ｎをアイテムの数、Ｘをデータベース２２に記憶されたデータ、Ｙを入力データとして、
【数１】

ここで、アイテム＜ｆi：ｘi＞は、項目「ｆi」の値が「ｘi」であることを示す。また、項目が数値属性であるアイテムについては、全て［０，１］区間に正規化し、αを０〜１の半径として定める。すなわち、入力データの値を中心に、半径αの中にある場合にδの値は１となり、半径αの外にある場合にδの値は０となる。
【００３５】
すなわち、この類似度関数は、データベースに記憶したデータについて、入力データが有するアイテムと一致するアイテムの数を計数することとなる。第２図（ｂ）では、各データにおいて入力データと一致するアイテムを円で囲んで示し、類似度関数の出力を類似度として示す。なお、「年齢」は数値データであるが、ここでのα＝０．１８に相当するマージン５を許容し、年齢の値が３０〜４０である場合にアイテムが一致したと判断している。
【００３６】
さらに、第２図（ｂ）に示したデータ群を類似度に従って配したデータ空間を第３図に示す。第３図では、入力データを「★」によって示し、クラスPに属するデータを「○」、クラスNに属するデータを「×」として示す。なお、各記号の近傍に示した数字が第２図（ｂ）のデータナンバーである。
【００３７】
第３図に示したように、類似度が３であるデータ７，１０，１２，１３が入力データに最も近く、同心円４１の上に存在する。また、類似度２であるデータ２，９が次の同心円４２の上に存在する。さらに、類似度１であるデータ１，４，５，６，１１が次の同心円４３の上に存在し、類似度が０のデータ３，８は、同心円４３の外に存在することとなる。
【００３８】
類似データ抽出部３２は、類似度が所定の閾値以上であるデータを類似データとして抽出する。または、類似度が高い順に、所定の数、例えば５個のデータを類似データとして抽出する。なお、類似度が等しいデータは全て類似データに含める。したがって、第３図では、類似度が３であるデータ７，１０，１２，１３および類似度が２であるデータ２，９の６個のデータを類似データとして抽出することとなる。
【００３９】
二値化処理部３３は、類似データ抽出部３２が抽出した類似データに対して二値化処理をおこなう。具体的には、類似データからδ＝０であったアイテムを除外し、さらに、δ＝１であった項目の値を入力データの同一項目の値に置き換える。ここで、離散値属性の項目の値は入力データと同一である。したがって、数値属性の項目の値を入力データの項目の値に書き換えることで、類似データの二値化をおこなうことができる。
したがって、二値化の結果、以下の類似データが得られる。
データ２｛＜住居：借家＞＜性別：男性＞｝
データ７｛＜住居：借家＞＜性別：男性＞＜結婚：既婚＞｝
データ９｛＜年齢：３５＞＜性別：男性＞｝
データ１０｛＜年齢：３５＞＜性別：男性＞＜結婚：既婚＞｝
データ１２｛＜年齢：３５＞＜住居：借家＞＜性別：男性＞｝
データ１３｛＜住居：借家＞＜性別：男性＞＜結婚：既婚＞｝
このように、類似データを二値化することで、類似データに含まれるアイテムは、入力データに含まれるアイテムのみとなる。したがって、以降、アイテム集合の演算のみで特徴パターン算出の処理をおこなうことができる。
【００４０】
つぎに、類似パターン集合算出部３４の処理について説明する。類似パターン集合算出部３４は、クラスPとクラスNのそれぞれについて最大パターン集合と最小パターン集合とを算出する。最大パターン集合は、そのクラスの類似データに自身の上位集合が存在しないアイテムの集合である。また、最小パターン集合は、そのクラスの類似データに、自身の部分集合（サブセット）となる集合が存在しないアイテムの集合である。
【００４１】
第４図に最大パターン集合と最小パターン集合とを示す。第４図（ａ）は、クラスＰにおける集合の包含関係を示す図であり、第４図（ｂ）は、クラスＮにおける集合の包含関係を示す図である。
【００４２】
ここで、クラスＰに関しては、
データ２｛＜住居：借家＞＜性別：男性＞｝
データ７｛＜住居：借家＞＜性別：男性＞＜結婚：既婚＞｝
であり、データ２のアイテムは、全てデータ７に含まれる。すわなち、データ２はデータ７の部分集合であり、データ７は、データ２の上位集合である。この関係を第４図（ａ）において実線の矢印によって示している。
【００４３】
ここで、クラスＰの類似データに、データ７の上位集合となる集合は存在しない。したがって、データ７は、クラスＰの最大パターン集合である。一方、データ１，６は、データ２の部分集合である。しかしながら、データ１，６は類似度が１であり、類似データとして選択されていない。すなわち、クラスＰの類似データの中にデータ２の部分集合となる集合は存在しないので、データ２は、クラスＰの類似データの最小パターン集合となる。
【００４４】
同様に、クラスＮに関しては、
データ９｛＜年齢：３５＞＜性別：男性＞｝
データ１０｛＜年齢：３５＞＜性別：男性＞＜結婚：既婚＞｝
データ１２｛＜年齢：３５＞＜住居：借家＞＜性別：男性＞｝
データ１３｛＜住居：借家＞＜性別：男性＞＜結婚：既婚＞｝
であり、データ９のアイテムは、全てデータ１０，１２に含まれる。すなわち、データ９は、データ１０と１２の両方の部分集合であり、データ１０，１２はともにデータ９の上位集合である。この関係を第４図（ｂ）において実線の矢印によって示している。
【００４５】
ここで、クラスＮの類似データにデータ１０，１２の上位集合となる集合は存在しない。したがって、データ１０，１２はそれぞれクラスＮの最大パターン集合である。また、クラスＮの類似データの中にデータ９の部分集合となる集合は存在しないので、データ９は、クラスＮの最小パターン集合となる。
【００４６】
なお、データ１３は、クラスＮの類似データの中に上位集合も部分集合も存在しない。したがって、データ１３は、クラスＮの最大パターン集合であり、かつ最小パターン集合である。
【００４７】
ここで、クラスＰにおいて、二値化済み類似データをＤｐ、最小パターン集合をＬｐ、最大パターン集合をＲｐとすると、パターン集合［Ｌｐ，Ｒｐ］は、少なくとも一つの最小パターンの上位集合であり、少なくとも一つの最大パターンの部分集合であるようなパターン全体である。したがって、
Ｄｐ⊆［Ｌｐ，Ｒｐ］
が成立する。
【００４８】
第４図（ａ）に示したデータでは、Ｌｐ＝｛｛借家，男性｝｝、Ｒｐ＝｛｛借家，男性，既婚｝｝および、Ｄｐ＝｛｛借家，男性｝｝、｛借家，男性，既婚｝｝となる。
同様に、クラスＮにおいて、二値化済み類似データをＤｎ、最小パターン集合をＬｎ、最大パターン集合をＲｎとすると、パターン集合［Ｌｎ，Ｒｎ］は、少なくとも一つの最小パターンの上位集合であり、少なくとも一つの最大パターンの部分集合であるようなパターン全体である。したがって、
Ｄｐ⊆［Ｌｐ，Ｒｐ］
が成立する。
【００４９】
第４図（ｂ）に示したデータでは、Ｌｎ＝｛｛３５，男性｝，｛借家，男性，既婚｝｝、Ｒｎ＝｛｛３５，借家，男性｝，｛３５，男性，既婚｝，｛借家，男性，既婚｝｝および、Ｄｎ＝｛｛借家，男性｝｝、｛３５，借家，男性｝，｛３５，男性，既婚｝，｛借家，男性，既婚｝｝となる。
【００５０】
なお、第４図に示した例では、Ｄｐ＝［Ｌｐ，Ｒｐ］であったが、最小パターンの上位集合であり最大パターンの部分集合であるようなパターンは、類似データに存在しない場合、すなわちＤｐに存在しないパターンであっても、［Ｌｐ，Ｒｐ］に含まれる。
【００５１】
ここで、＜Ｌ，Ｒ＞を最小パターンＬおよび最大パターンＲのボーダーとして定義する。ボーダー＜Ｌ，Ｒ＞は、パターン集合である［Ｌ，Ｒ］を最小パターンと最大パターンのペアとして表記したものである。したがって、ボーダーを用いることで、集合の演算をおこなう場合に直接集合の要素を扱うことなく、最大パターンと最小パターンだけを対象とする演算に置き換えることができ、計算を大幅に効率化することができる。
【００５２】
類似パターン算出部３４は、このボーダー＜Ｌｐ，Ｒｐ＞およびボーダー＜Ｌｎ，Ｒｎ＞を類似パターン集合として特徴パターン集合３５に出力し、処理を終了する。
【００５３】
つぎに、特徴パターン算出部３５の動作について説明する。まず、ＲｐおよびＲｎが全データを対象としたクラスＰおよびクラスＮの最大パターンであるとき、［｛φ｝，Ｒｐ］−［｛φ｝，Ｒｎ］は、クラスＰのみに出現する全てのパターンを含むパターン集合であることが証明されている。（J.Li and K. Ramamohanarao. The space of jumping emerging patterns and its incremental maintenance algorithm. In Proceedings of 17^thInternational Conference on Machine learning, pages 551-558.Morgan Kaufmann,2000.）
【００５４】
本発明では、ＲｐおよびＲｎは入力データに類似するデータを処理対象としており、データ全体における最大パターンである保証は無いが、類似データは高い類似度を持つことから入力データのアイテムに対する一致数は多く、また、最大パターンは通常、アイテム数が多いため、最大パターンが類似パターンに含まれる可能性は高い。
【００５５】
しかし、最大パターンが多数含まれていたとしても、最大パターンの検出漏れが発生する可能性があり、一つでも検出漏れがあった場合には正しくない特徴パターンを発見する可能性がある。このような正しくない特徴パターンは分類精度の低下の原因となる。そこで、類似データから特徴パターンを算出する場合に類似データに対してクラスＰとクラスＮに共通して現れるパターンよりもアイテム数が多いことを条件として付加することで、最大パターンの検出漏れを防止し、分類精度の低下を防止することができる。
【００５６】
特徴パターン集合算出部３５の処理動作を第５図に示す。第５図において、特徴パターン集合算出部３５は、まず、類似パターン集合＜Ｌｐ，Ｒｐ＞および＜Ｌｎ，Ｒｎ＞からパターン集合［｛φ｝，Ｌｐ］および［｛φ｝，Ｌｎ］に共通して出現するパターン集合をもとめる。具体的には、まず、出力データとなるｅｐＬｐとｅｐＲｐをｅｐＬｐ＝｛｝，ｅｐＲｐ＝｛｝として初期化する。つぎに、intersecOperation(<｛φ｝，Ｌｐ>,<｛φ｝，Ｌn>) によって＜｛φ｝,[c1,….ck]＞を算出する（ステップＳ１０２）。このintersecOperationは、上述の文献に示されたものと同一であり、２つのボーダー<｛φ｝，Ｌｐ>,<｛φ｝，Ｌn>によって示される集合に共通に出現する全てのパターンをボーダー＜｛φ｝,[c1,….ck]＞の形式で出力する。
【００５７】
すなわち、この処理によって、パターン集合［｛φ｝，Ｌｐ］および［｛φ｝，Ｌｎ］に共通して出現する最大パターンの集合である[c1,….ck]が得られることとなる。この[c1,….ck]に含まれる任意のciは、共通の最大パターンであるから、ciの上位集合は、
・クラスＰのデータにのみ出現する
・クラスＮのデータにのみ出現する
・クラスＰ、クラスＮのいずれにも出現しない
のいずれかである。
【００５８】
したがって、[c1,….ck]の各要素ciについて、ciを含み、クラスＰにのみ出現してクラスNに出現しないパターンを探すことで、クラスPに特徴的に出現するパターンの集合を得ることができる。
【００５９】
したがって、特徴パターン集合算出処理部３５は、[c1,….ck]を求めた後、最初のパターンc1を処理対象に設定し（ステップＳ１０３）、さらに、クラスＰの最大パターン集合Ｒｐの中から、処理対象である共通パターンの上位集合になるパターン集合ｒｐを求める（ステップＳ１０４）。その後、クラスＮの最大パターン集合Ｒｎから処理対象の共通パターンの上位集合になるパターン集合ｒｎを求める（ステップＳ１０５）。
【００６０】
つぎに、特徴パターン集合算出処理部３５は、パターン集合［｛φ｝，ｒｐ］に出現し、パターン集合［｛φ｝，ｒｎ］に出現しないパターン集合を求める。具体的には、jepProducer(<｛φ｝，ｒｐ>,<｛φ｝，ｒn>) によって＜ｅｌ，ｅｒ＞を算出する（ステップＳ１０６）。このjepProducerは、上述の文献に示されたものと同一であり、ボーダー<｛φ｝，ｒｐ>によって示されるパターン集合［｛φ｝，ｒｐ］に出現し、ボーダー<｛φ｝，ｒn>によって示されるパターン集合［｛φ｝，ｒｎ］に出現しないパターン集合をボーダー＜ｅｌ,ｅｒ＞の形式で出力する。
【００６１】
ここで、ｅｌが｛φ｝でなければ（ステップＳ１０７，Ｎｏ）特徴パターン集合算出処理部３５は、＜ｅｌ,ｅｒ＞に処理対象の共通パターンを加え、ボーダー＜ｅＬ，ｅＲ＞を作成する（ステップＳ１０８）。このボーダー＜ｅＬ，ｅＲ＞によって示されるパターン集合は、処理対象の共通パターンの上位集合であるので、クラスＰに出現し、クラスＮに出現しないパターン集合となる。
【００６２】
特徴パターン集合３５は、このボーダー＜ｅＬ，ｅＲ＞をボーダー＜ｅｐＬｐ，ｅｐＲＰ＞に追加する（ステップＳ１０９）。ボーダー＜ｅｐＬｐ，ｅｐＲｐ＞は、最終的に特徴パターンとして出力するデータである。ここで、ｅｐＬｐは、常に最小パターンのみを要素とするように監視し、最小ではないパターンを除外する（ステップＳ１１０）。
【００６３】
ステップＳ１１０終了後またはｅｌが｛φ｝の場合（ステップＳ１０７，Ｙｅｓ）、特徴パターン集合算出部３５は、パターン集合[c1,….ck]の全ての要素について処理が終了したか否かを判定する（ステップＳ１１１）。特徴パターン集合算出部３５は、まだ処理が終了していない要素がある場合に（ステップＳ１１１，Ｎｏ）、つぎの要素を検査対象に設定し（ステップＳ１１３）、ステップＳ１０４に移行する。
【００６４】
一方、全ての要素について処理が終了していた場合（ステップＳ１１１，Ｙｅｓ）、特徴パターン集合算出部３５は、ボーダー＜ｅｐＬｐ，ｅｐＲｐ＞を出力する（ステップＳ１１２）。
【００６５】
また、特徴パターン算出処理部３５は、クラスＮについても同様にボーダー＜ｅｐＬｎ，ｅｐＲｎ＞を算出することができる。特徴パターン算出処理部３５は、この＜ｅｐＬｐ，ｅｐＲｐ＞と＜ｅｐＬｎ，ｅｐＲｎ＞とをもちいて、
ＳＥＰ＝ｅｐＬｐ∪ｅｐＬｎ
である特徴パターン集合ＳＥＰを出力する。この特徴パターンＳＥＰは、クラスＰまたはクラスＮに特徴的にあらわれる最小パターンの和集合である。特徴パターン算出部３５は、特徴パターン集合ＳＥＰを特徴パターン出力装置２１の外部に出力するともに、入力データ分類処理部３６に出力する。
【００６６】
この特徴パターン算出部３５の処理を、図４に示したデータについて適用すると、まず、クラスＰの最小パターン集合がＬｐ＝｛｛借家，男性｝｝であり、クラスＮの最小パターン集合がＬｎ＝｛｛３５，男性｝，｛借家，男性，既婚｝｝であるので、共通して出現するパターン集合は｛｛借家，男性｝｝である（ステップＳ１０２）。
【００６７】
そこで、ｃｉ＝｛借家，男性｝として、続く処理を継続する（ステップＳ１０２）。
クラスＰでは、クラスＰの最大パターン集合Ｒｐ＝｛｛借家，男性，既婚｝｝のうち、ｃｉ＝｛借家，男性｝の上位集合となっているものは、ｒｐ＝｛｛借家，男性，既婚｝｝である（ステップＳ１０３）。同様にクラスＮでは、クラスＮの最大パターン集合Ｒｎ＝｛｛３５，借家，男性｝，｛３５，男性，既婚｝，｛借家，男性，既婚｝｝のうち、ｃｉ＝｛借家，男性｝の上位集合となっているものは、ｒｎ＝｛｛３５，借家，男性｝，｛借家，男性，既婚｝｝である（ステップＳ１０４）。
【００６８】
求めた［｛φ｝，ｒｐ］に出現し、［｛φ｝，ｒｎ］に出現しないパターン集合を、jepProducer（＜｛φ｝，ｒｐ＞，＜｛φ｝，ｒｎ＞）によって求めた結果は、＜ｅｌ，ｅｒ＞＝＜｛φ｝，｛φ｝＞である（ステップＳ１０５）。
【００６９】
最大の共通パターン集合｛ｃｉ｝の要素はひとつだけであり、結局この例ではクラスＰの特徴パターンは＜ｅｐＬｐ，ｅｐＲｐ＞＝＜｛φ｝，｛φ｝＞となる。
【００７０】
一方、クラスＮでは、ステップＳ１０４までの処理結果は、クラスＰの場合と同様であり、ｃｉ＝｛借家，男性｝，ｒｎ＝｛｛３５，借家，男性｝，｛借家，男性，既婚｝｝，ｒｐ＝｛｛借家，男性，既婚｝｝である。（ステップＳ１０１〜Ｓ１０４）。
【００７１】
求めた［｛φ｝，ｒｎ］に出現し、［｛φ｝，ｒｐ］に出現しないパターン集合を、jepProducer（＜｛φ｝，ｒｎ＞，＜｛φ｝，ｒｐ＞）によって求めた結果は、＜ｅｌ，ｅｒ＞＝＜｛３５｝，｛３５，借家，男性｝＞である（ステップＳ１０５）。このｅｌ，ｅｒそれぞれにｃ１を追加したボーダーは＜ｅＬ，ｅＲ＞＝＜｛３５，借家，男性｝，｛３５，借家，男性｝＞である（ステップＳ１０６）。最大の共通パターン集合｛ｃ１｝の要素はひとつだけであり、結局この例ではクラスＮの特徴パターン集合は＜ｅｐＬｎ，ｅｐＲｎ＞＝＜｛３５，借家，男性｝，｛３５，借家，男性｝＞となる（ステップＳ１０７〜Ｓ１１０）。
【００７２】
つぎに、入力データ分類処理部３６の動作について説明する。第６図は、入力データ分類処理部３６の処理動作を説明するフローチャートである。第６図において、入力データ分類処理部は、まず、クラスＰの二値化済み類似データＤｐ＝｛ｄ１，ｄ２・・・ｄｓ｝および特徴パターンＳＥＰ＝｛ｐ１，ｐ２・・・ｐｔ｝を入力データとして取得する（ステップＳ２０１）。
【００７３】
つづいて、入力データ分類処理部３６は、類似データＤｐのうち、最初の要素であるｄ１を処理対象に設定する（ステップＳ２０２）。さらに、入力データ分類処理部３６は、特徴パターンＳＥＰのうち、最初の要素であるｐ１を検査対象に設定する（ステップＳ２０３）。
【００７４】
入力データ分類処理部３６は、検査対象である特徴パターンが処理対象である類似データの部分集合になっているかどうかを検査する（ステップＳ２０４）。検査対象である特徴パターンが処理対象の類似データの部分集合になっている場合（ステップＳ２０４，Ｙｅｓ）、入力データ分類処理部３６は、クラスＰカウンタの値を一つ増加させる（ステップＳ２０９）。
【００７５】
一方、検査対象である特徴パターンが処理対象である類似データの部分集合になっていない場合（ステップＳ２０４，Ｎｏ）、入力データ分類処理部３６は、全ての特徴パターンについて検査を終了したか否かを判定する（ステップＳ２０５）。まだ検査が終了していない特徴パターンが存在する場合（ステップＳ２０５，Ｎｏ）、入力データ分類処理部３６は、次の特徴パターンを検査対象に設定（ステップＳ２０８）し、ステップＳ２０４に移行する。
【００７６】
全ての特徴パターンについて検査が終了した場合（ステップＳ２０５，Ｙｅｓ）、もしくはクラスＰカウンタの値を増加させた後、入力データ分類処理部３５は、全ての類似データについて処理を終了したか否かを判定する（ステップＳ２０６）。まだ検査が終了していない類似データが存在する場合（ステップＳ２０６，Ｎｏ）、入力データ分類処理部３５は、次の類似データを処理対象に設定し（ステップＳ２１０）、ステップＳ２０３に移行する。
【００７７】
一方、全ての類似データについて処理が終了した場合（ステップＳ２０６，Ｙｅｓ）、入力データ分類処理部３６は、クラスＰカウンタの値を出力して処理を終了する。この処理によって、入力データ分類処理部３６は、クラスＰに属する類似データのうち、特徴パターンＳＥＰのいずれかを含む類似データの数を計数することができる。すなわち、クラスＰカウンタの値は、クラスＰの類似データのうち、一つ以上の特徴パターンにマッチするデータ数となる。
【００７８】
また、入力データ分類処理部３６は、同様の処理によってクラスＮカウンタの値を出力する。このクラスＮカウンタの値は、クラスＮの類似データのうち、一つ以上の特徴パターンマッチするデータ数となる。入力データ分類処理部３６は、このクラスＰカウンタの値とクラスＮカウンタの値とを比較し、値の大きい方のクラスに入力データを分類する。
【００７９】
上述してきたように、この実施の形態１に示した特徴パターン出力装置２１では、入力データに類似するデータをデータベースか２２から抽出し、この類似データからクラスごとの最大パターン集合と最小パターン集合とを算出し、クラスごとの最大パターン集合と最小パターン集合から特徴パターンを算出するので、データベース２２のデータ数や各データのアイテム数に依存することなく、高速に特徴パターンの算出をおこなうことができる。
【００８０】
その結果、算出した特徴パターンをもちいて入力データを分類することで、入力データを簡易に分類することができる。
【００８１】
さらに、入力データに類似するデータから特徴パターンを算出することで、局所的な特徴パターンであっても高精度で検出することが可能となる。
【００８２】
ところで、入力データをもとに類似データを抽出する場合、類似データにノイズが発生することがある。そこで、類似データ抽出部３２にノイズ除去の機構を付加することで、特徴パターンの検出精度および入力データの分類精度を向上することができる。
【００８３】
類似データに発生するノイズとしては、所定のクラスの類似データに他のクラスのデータが混入するクラスノイズと、所定の類似データのアイテムが他のアイテムに置き換わる属性ノイズとが存在する。
【００８４】
クラスノイズが存在する場合、二値化処理後の類似データにおいて、クラスＰとクラスＮに同一の最大パターンが出現する可能性がある。クラスＰとクラスＮとに同一の最大パターンが出現すると、特徴パターンが一つも発見できなくなり、また、分類精度も著しく低下する。そこで、クラスＰとクラスＮに同一のパターンが共通して出現した場合には、共通して出現したパターンをそれぞれのクラスから除外し、除外したパターンの部分集合であるパターンをあらたに含めることで、クラスノイズの発生を抑制することができる。
【００８５】
また、属性ノイズについては、第７図に示した統計的検定処理によって除去することができる。第７図に示すように、この属性ノイズ除去では、まず、最小パターンの一つであるＬを入力する（ステップＳ３０１）。ここで、Ｌに含まれるアイテムをＩ１，Ｉ２・・・Ｉｋとすると、Ｌ＝｛Ｉ１，Ｉ２・・・Ｉｋ｝である。
【００８６】
つぎに、Ｌのうち、最初のアイテムであるＩ１を処理対象Ｉｉに設定する（ステップＳ３０２）．つぎに、Ｌｐから処理対象のアイテムを除外したパターンＢを生成する（ステップＳ３０３）。その後、Ｂ＝＞ＰとＢ∧Ｉｉ＝＞Ｐについて統計的検定をおこなう（ステップＳ３０４）。この検定によってパターンＢに処理対象であるアイテムＩｉを追加することが、統計的に偶然程度とみなせるか否かを判定する。統計的に偶然とみなせない場合、アイテムＩｉは、属性ノイズによって出現したと考えられる。
【００８７】
統計的検定処理は、具体的には、Ｂ＝＞ＰとＢ∧Ｉｉ＝＞Ｐの確率分布の間に違いがないという統計的仮説をたて、この仮説を棄却できるかを次の式によって検定する。
Ｔ＝（Ｓ_LPＳ_L−Ｓ_LＳ_BP）／（Ｓ_LＳ_BP（Ｓ_B−Ｓ_BP）／Ｎ）^1/2
ここで、Ｓ_BはパターンＢにマッチするデータ数であり、Ｓ_LはパターンＢ∧Ｉｉにマッチするデータ数であり、Ｓ_BPは、パターンＢにマッチするクラスＰのデータ数であり、Ｓ_LPは、パターンＢ∧ＩｉにマッチするクラスＰに属するデータ数である。
【００８８】
このＴは正規分布に従うことが知られており、有意水準をａとすると、ｚ（ａ／２）は正規分布の密度関数ｐ（ｚ）＝ａ／２なる値であり、Ｔ≧ｚ（ａ／２）であれば、仮説はＢ＝＞ＰとＢ∧Ｉｉ＝＞Ｐの間に統計的な違いは無く、Ｉｉは偶然現れたものとして扱い、パターン集合Ｌｐから除外する。
【００８９】
したがって、第７図では、統計的検定の結果、仮説が棄却できるかいなかを判定し（ステップＳ３０５）、仮説が棄却できなかった場合（ステップＳ３０５，Ｎｏ）、処理対象のアイテムＩiを属性ノイズとしてＬから除外し（ステップS３０８）、ステップＳ３０６に移行する。
【００９０】
一方、仮説が棄却できた場合（ステップS３０５，Ｙｅｓ）、全てのアイテムについて検定が終了したか否かを判定する（ステップＳ３０６）。まだ検定が終了していないアイテムがある場合（ステップＳ３０６，Ｎｏ）、次のアイテムを検定対象に設定し（ステップＳ３０９）、ステップＳ３０３に移行する。
【００９１】
また、全てのアイテムについて処理が終了した場合（ステップＳ３０６，Ｙｅｓ）、属性ノイズを除去した最小パターンＬを出力し（ステップＳ３０７）、処理を終了する。
【００９２】
このように、類似データ抽出部３２にクラスノイズおよび属性ノイズを除去する機能をもたせることで、特徴パターンの検出精度および入力データの分類精度を向上することができる。
【００９３】
（実施の形態２）
つぎに本発明の実施の形態２について説明する。上記実施の形態１では、データベース２２から類似データを抽出する場合に、所定の閾値を一つ設定し、この閾値以上の類似度を有するデータを抽出していたが、この実施の形態２では、クラスＰのデータとクラスＮのデータのそれぞれについて閾値を設定し、クラス別に類似データを抽出する。なお、類似データの抽出を所定の数を充たすように抽出する場合、クラスＰとクラスＮのそれぞれに所定の数を設定し、クラスＰとクラスＮについてそれぞれ抽出すればよい。
【００９４】
第８図に、実施の形態２におけるデータと類似度の関係を示す。第８図において、データ１〜１３の配置は、第４図と同様であり、同心円５１が類似度３を示し、同心円５２が類似度２を示し、同心円５３が類似度１を示す点についても第４図と同様である。しかしながら、この第８図では、クラスＰのデータについては同心円５３が閾値となり、クラスＮのデータについては同心円５２が閾値となる点が第４図の場合と異なる。
【００９５】
クラスＰについて類似度の閾値が１に下がったことで、第９図（ａ）に示すように、データ１，４，５，６が類似データとして新たに抽出されることとなる。ここで、データ１，６はデータ２の部分集合であり、データ４は、データ７の部分集合である。しかしながら、データ５は自身の上位集合をもたないため、クラスＰの最大パターンとなる。したがって、実施の形態２におけるＲｐはデータ５に対応する｛３５｝を加え、｛｛３５｝，｛借家，男性，既婚｝｝となる。なお、第９図（ｂ）にしめすように、クラスＮに関しては閾値が２であるので、クラスＮの類似パターンは変化しない。
【００９６】
実施の形態１において説明したように、全データから全ての最大パターンを取得すれば、全ての特徴パターンを算出できることが証明されており、本発明のように入力データの近傍のデータのみを扱う場合には、類似データから特徴パターンを算出する場合に類似データに対してクラスＰとクラスＮに共通して現れるパターンよりもアイテム数が多いことを条件として付加することで、最大パターンの検出漏れを防止し、分類精度の低下を防止することが必要である。
【００９７】
したがって、クラス別に閾値を設定し、全てのクラスから十分な数のサンプルを取得することで、最大パターンの検出漏れによる分類精度の低下を防止することができる。
【００９８】
この類似データの二値化と類似パターン集合の算出処理については、実施の形態１と同様であるので説明を省略するが、この実施の形態２における類似パターン集合は、入力データに対するクラス別の近傍を用い、データベース２２にふくまれるデータ全体の近似となっている。そこで、特徴パターンの算出処理では、上述のjepProducerを使用し、
＜ｅｐＬｐ，ｅｐＲｐ＞＝jepProducer (<｛φ｝，Ｒｐ>,<｛φ｝，Ｒn>)
によって＜ｅｐＬｐ，ｅｐＲｐ＞を算出する。したがって、本実施の形態では、最小パターン集合Ｒｐ，Ｒｎを使用せず、最大パターン集合Ｌｐ，Ｌｎから特徴パターンを算出することができる。
【００９９】
さらに、この実施例では、ＳＥＰ＝ｅｐＬｐ∪ｅｐＬｎは、データベース２２全体に対する特徴パターンの近似であるので、入力データの分類をする場合に、データベース２２に含まれるデータ全体を対象にクラスＰカウントおよびクラスＮカウントを算出することができる。
【０１００】
なお、データベース２２全体に対してクラスＰカウントを算出する場合には、その値をデータベース２２に含まれるクラスＰデータのサイズで除することで、データベース２２全体におけるクラスＰの分布の偏りを補正することが好ましい。また、クラスＮカウントについても同様である。このように、各クラスに属するデータ集合のサイズをもとに補正をおこなうことで、データベース２２おける各クラスの分布比率に大きな偏りがある場合、たとえば、クラスＮのデータがクラスＰのデータに比して著しく多い場合であっても、入力データを精度良く分類することができる。
【０１０１】
上述してきたように、本実施の形態２では、クラス別に異なる閾値を用いて類似データを抽出することで最大パターンの検出漏れを防止し、入力データの分類精度を向上している。
【０１０２】
また、この実施の形態２では、データベース２２全体の特徴パターンの近似を得ることができ、さらにクラスの分布状態に関わらず、入力データの分類を高精度におこなうことができる。
【０１０３】
なお、上述した実施の形態１および実施の形態２においては、入力データを分類する場合に、クラスＰの類似データおよびクラスＮの類似データについて、特徴パターンの出現数を比較しているが、入力データの分類はこの方法に限られるものではなく、他の評価基準や、その組合せを用いて入力データを分類することができる。
【０１０４】
入力データの分類に使用可能な評価基準としては、たとえば特徴パターン数、特徴パターンのアイテム数などを用いることができる。なお、特徴パターン数では、特徴パターン現数が多い場合に評価を高くし、特徴パターンのアイテム数では、アイテム数が多い場合に評価を高くする。
【０１０５】
具体的には、特徴パターン数を使用する場合には、ｅｐＬｐに属する特徴パターンのサイズの総和と、ｅｐＬｎに属する特徴パターンのサイズの総和とを比較し、その値が大きい方に入力パターンを分類する。
【０１０６】
（実施の形態３）
本実施の形態３では、上記実施の形態１，２に示した特徴パターン出力装置と同様の機能を有する特徴パターン出力プログラムを実行するコンピュータシステムについて説明する。
【０１０７】
第１０図に示すコンピュータシステム１００は、本体部１０１、本体部１０１からの指示により表示画面１０２ａに画像等の情報を表示するディスプレイ１０２、このコンピュータシステム１００に種々の情報を入力するためのキーボード１０３、ディプレイ１０２の表示画面１０２ａ上の任意の位置を指定するマウス１０４、ローカルエリアネットワーク（ＬＡＮ）１０６または広域エリアネットワーク（ＷＡＮ）に接続するＬＡＮインターフェース、インターネットなどの公衆回線１０７に接続するモデム１０５が備えられている。ここで、ＬＡＮ１０６は、ほかのコンピュータシステム（ＰＣ）１１１、サーバ１１２、プリンタ１１３等とコンピュータシステム１００とを接続している。また、第１１図に示すように、本体部１０１は、ＣＰＵ１２１、ＲＡＭ１２２、ＲＯＭ１２３、ハードディスクドライブ（ＨＤＤ）１２４、ＣＤ−ＲＯＭドライブ１２５、ＦＤドライブ１２６、Ｉ／Ｏインターフェース１２７およびＬＡＮインターフェース１２８を備えている。
【０１０８】
このコンピュータシステム１００においてデータ管理方法を実行する場合、記憶媒体に記憶された、特徴パターン出力プログラムをコンピュータシステム１００にインストールする。インストールされた特徴パターン出力プログラムは、ＨＤＤ１２４に記憶され、ＲＡＭ１２２、ＲＯＭ１２３などを利用してＣＰＵ１２１により実行される。ここで、記憶媒体とは、ＣＤ−ＲＯＭ１０９、フロッピーディスク１０８、ＤＶＤディスク、光磁気ディスク、ＩＣカード等の可搬型記憶媒体やコンピュータシステム１００の内外に備えられたハードディスク１２４等の記憶装置のほか、ＬＡＮ１０６を介して接続されたインストール元のデータ管理プログラムを保持するサーバ１１２のデータベース、あるいは、ほかのコンピュータシステム１１１並びにそのデータベースや、さらに公衆回線１０７上の伝送媒体をも含むものである。
【０１０９】
上述してきたように、本実施の形態３では、実施の形態１，２に示した特徴パターン出力装置が有する構成をソフトウェアによって実現した特徴パターン出力プログラムをコンピュータシステム１００上で実行することで、実施の形態１，２に示した特徴パターン出力装置と同様の効果を、一般的なコンピュータシステムを用いて実現することができる。
【０１１０】
以上説明したように、本発明によれば、入力データに類似する類似データをデータベースから抽出し、抽出した類似データから各クラスの特徴をなす特徴パターンを算出するので、データベースの規模によらず高速に特徴パターンを出力可能な特徴パターン出力装置を提供することができるという効果を奏する。
【０１１１】
また、本発明によれば、データベースから抽出した抽出データの各アイテムの値と入力データの各アイテムの値とを比較し、一致するアイテムの組合せから最大パターン集合と最小パターン集合とを抽出し、この最大パターン集合と最小パターン集合とをもとに特徴パターンを算出するようにしているので、簡易な構成で高速に特徴パターンを出力可能な特徴パターン出力装置を提供することができるという効果を奏する。
【０１１２】
また、本発明によれば、最小パターン集合をもとに複数のクラスにまたがって出現する共通パターンを求め、特徴パターンを共通パターンの上位集合として算出しているので、高速に特徴パターンを出力可能な特徴パターン出力装置を提供することができるという効果を奏する。
【０１１３】
また、本発明によれば、類似データを抽出する場合に、クラスごとに条件を変更し、各クラスについて十分な数の類似データを取得するようにしているので、類似データを用いてデータベース全体を近似し、高速に特徴パターンを出力可能な特徴パターン出力装置を提供することができるという効果を奏する。
【０１１４】
また、本発明によれば、複数のクラスにまたがって出現する最大パターンについて、そのアイテムを除去することで最大パターンが複数のクラスにまたがることを防止しているので、特徴パターンを高速かつ高精度に出力可能な特徴パターン出力装置を提供することができるという効果を奏する。
【０１１５】
また、本発明によれば、類似データから算出した特徴パターンをもとに入力データを分類しているので、データベースの規模に関わらず入力データを高速に分類可能な特徴パターン出力装置を提供することができるという効果を奏する。
【０１１６】
また、本発明によれば、各クラスの類似データにおける特徴パターンの出現数を計数し、この計数結果がもっとも大きい値となったクラスに入力データを分類しているので、入力データを高速かつ高精度に分類可能な特徴パターンを出力可能な特徴パターン出力装置を提供することができるという効果を奏する。
【０１１７】
また、本発明によれば、アイテムが数値データである場合に所定の数値範囲を設定し、入力データのアイテムの値と類似データのアイテムの値とが所定の範囲内にある場合に両者のアイテムの値が一致したと判定するので、アイテムに数値データが含まれる場合であっても簡易な構成で高速に特徴パターンを出力可能な特徴パターン出力装置を提供することができるという効果を奏する。
【産業上の利用可能性】
【０１１８】
以上のように、本発明にかかる特徴パターン出力装置は、特に大規模データベースにおける特徴パターンの抽出の高速化に対して有用である。
【図面の簡単な説明】
【０１１９】
【図１】第１図は、本発明の実施の形態１である特徴パターン出力装置の概要構成を説明する概要構成図である。
【図２】第２図は、入力データと類似データの具体例を示す図である。
【図３】第３図は、データ群を類似度に従って配したデータ空間を示す図である。
【図４】第４図は最大パターン集合と最小パターン集合とを示す図である。
【図５】第５図は、特徴パターン集合算出部の処理動作を示す図である。
【図６】第６図は、入力データ分類処理部３６の処理動作を説明するフローチャートである。
【図７】第７図は、属性ノイズを除去する統計的検定処理を説明する図である。
【図８】第８図は、実施の形態２におけるデータと類似度の関係を示す図である。
【図９】第９図は、実施の形態２における最大パターン集合と最小パターン集合とを示す図である。
【図１０】第１０図は、本実施の形態３におけるコンピュータシステムを説明するための説明図である。
【図１１】第１１図は、第１０図に示した本体部の構成を説明する説明図である。【Technical field】
[0001]
  The present invention outputs a combination of items characteristically included in the class as a characteristic pattern of the class from a database that stores data having a plurality of items divided into any of a plurality of classes.Feature pattern output device, Especially when the database is largeOutputable feature pattern output deviceAbout.
[Background]
[0002]
  In recent years, methods have been devised for extracting correlations between data and rules of the data stored in a database. The correlation between data and the rules possessed by the data can be used when classifying data stored in the database or when classifying new data.
[0003]
  Traditionally, Agrawel, R., “Fast Algorithm for Mining Association Rules” and the corresponding patent literature “Mine sequential patterns in large-scale databases” as a method of learning association rules that take out rules from the database and feed back to the database. System and method for doing this "(Japanese Patent Laid-Open No. 8-263346).
[0004]
  According to the technique disclosed here, a pattern is formed by combining data components called items, and data correlation rules are shown by a frequent pattern.
[0005]
  However, with this method, the cost required for extracting the correlation rule is high, and when there is a change in the contents of the database, it takes time to exercise the contents of the correlation rule in response to the change. For this reason, the extraction of correlation rules is often performed with the database offline, and there is a problem that the followability deteriorates with respect to the update of the database.
  Furthermore, the processing time required to extract correlation rules and classify data based on the extracted correlation rules varies greatly depending on the parameter settings, and the obtained correlation rules themselves also depend greatly on the parameters. was there. In other words, in order to set parameters appropriately, specialized knowledge and experience are required. Depending on the parameter settings, the usefulness of the obtained rules may be reduced, or the operation of association rules may become impossible. Processing time may be required.
[0006]
  On the other hand, other rule extraction methods include J.Li, G. Dong, K. Ramamohanarao, and L. Wong.DeEPs: A new instance-based discovery and classification system.Technical report, Dept of CSSE, University of Melbourne. , 2000 is published. The DeEPs published here can discover real-time patterns that learn applicable patterns after input data is given. Therefore, the database can be updated at an arbitrary timing without being taken offline. In DeEPs, it is not necessary to set parameters for pattern discovery, so less expertise and experience are required during operation.
[0007]
  However, since DeEPs process all data in the database at the time of pattern discovery, the necessary processing capacity increases according to the number of data that the database has. Therefore, when the number of data in the database is large, there is a problem that the pattern extraction process requires an unacceptable time as a response time in the real-time process.
[0008]
  Furthermore, in DeEPs, processing time is required in proportion to the number of items that are data components. Therefore, when the number of items included in each data is large, there is a problem that a huge amount of time is required for the pattern extraction process.
[0009]
  The present invention has been made to solve the above-described problems caused by the prior art, and the pattern extraction can be performed at high speed even in a large-scale database having a large number of data and having many items in the database. ExecutableFeature pattern output deviceThe purpose is to provide.
DISCLOSURE OF THE INVENTION
[0010]
  In order to solve the above-described problems and achieve the object, the present invention relates toFeature pattern output deviceIs a feature that outputs a combination of items that characterize each class as a feature pattern of the class from a database that stores data consisting of a plurality of items divided into a plurality of classes.Pattern output deviceWhen the input data is received, similar data extraction means for extracting similar data similar to the input data for each class from the database, and each class from the similar data extracted by the similar data extraction means A similar pattern set calculating means for calculating a similar pattern set for each class; and a feature pattern calculating means for calculating a feature pattern for each class from the similar pattern set calculated by the similar pattern set calculating means. And
[0011]
  According to the present invention, similar data similar to the input data is extracted from the database, and a feature pattern that characterizes each class is calculated from the extracted similar data.
[0012]
  Further, according to the present inventionFeature pattern output deviceThe similar pattern set calculation means extracts, as a pattern set, a combination of items in which each item forming the similar data extracted by the similar data extraction means matches each item forming the input data, A minimum pattern that is a combination of items in which a subset other than itself does not exist in the pattern set is extracted as a minimum pattern set, and a maximum pattern that is a combination of items in which a superset other than itself does not exist in the pattern set Are extracted as a maximum pattern set, and the minimum pattern set and the maximum pattern set are output as the similar pattern set.
[0013]
  According to the present invention, each item of the extracted data extracted from the database is compared with each item of the input data, the maximum pattern set and the minimum pattern set are extracted from the matching item combination, and the maximum pattern set and the minimum pattern set are extracted. The feature pattern is calculated based on the pattern set.
[0014]
  Further, according to the present inventionFeature pattern output deviceThe feature pattern calculation unit extracts a common pattern set that appears across a plurality of classes from the minimum pattern set, and the feature pattern calculation unit calculates a feature pattern that includes all items of the common pattern set. It is characterized by doing.
[0015]
  According to the present invention, a common pattern that appears across a plurality of classes is obtained based on a minimum pattern set, and a feature pattern is calculated as a superset of the common pattern.
[0016]
  Further, according to the present inventionFeature pattern output deviceThe similar data extracting means extracts similar data based on different conditions for each class when extracting similar data from the database.
[0017]
  According to the present invention, when extracting similar data, the conditions are changed for each class, and a sufficient number of similar data is acquired for each class.
[0018]
  Further, according to the present inventionFeature pattern output deviceThe similar pattern set calculating means excludes a predetermined item from the maximum pattern when there is a maximum pattern that appears across a plurality of classes.
[0019]
  According to the present invention, it is possible to prevent a situation in which a feature pattern does not exist by removing an item of a maximum pattern that appears across a plurality of classes.
[0020]
  Further, according to the present inventionFeature pattern output deviceIs characterized by further comprising classification means for classifying the input data into one of the plurality of classes based on the feature pattern calculated by the feature pattern calculation means.
[0021]
  According to the present invention, the input data is classified based on the feature pattern calculated from the similar data.
[0022]
  Further, according to the present inventionFeature pattern output deviceThe classifying means counts the number of the feature patterns in the similar data of each class, and classifies the input data into a class having the largest count result.
[0023]
  According to the present invention, the number of appearances of feature patterns in the similar data of each class is counted, and the input data is classified into the class having the largest count result.
[0024]
  Further, according to the present inventionFeature pattern output deviceIf the value of the predetermined item forming the input data and the value of the item forming the similar data are within a predetermined numerical range, the similar pattern set calculation means calculates the value of both items. It is characterized by determining that they match.
[0025]
  According to this invention, when the item is numerical data, a predetermined numerical range is set, and when the item value of the input data and the item value of the similar data are within the predetermined range, the value of both items Is determined to match.
BEST MODE FOR CARRYING OUT THE INVENTION
[0026]
  The present invention will be described below with reference to the accompanying drawings.Feature pattern output deviceThe preferred embodiment will be described in detail.
[0027]
(Embodiment 1)
  FIG. 1 is a schematic configuration diagram illustrating a schematic configuration of a feature pattern output apparatus according to Embodiment 1 of the present invention. In FIG. 1, the feature pattern output device 21 is connected to a database 22. The database 22 stores information about customers, and one piece of data corresponds to one customer. The data includes items such as “age”, “resident”, “sex”, “marriage”, and the like. Each data has a value for each item. Hereinafter, a combination of an item included in data and a value of the item is referred to as an item. The database 22 classifies each customer, that is, each piece of data according to whether or not credit is available. The database 22 classifies “credit-capable” customers as “class P” and “credit-capable” customers as “class N”.
[0028]
  The feature pattern output device 21 includes therein an input processing unit 31, a similar data extraction unit 32, a binarization processing unit 33, a similar pattern set calculation unit 34, a feature pattern set calculation unit 35, and an input data classification processing unit 36. is doing. When the input processing unit 31 receives customer information as input data, the input processing unit 31 outputs the input data to the similar data extraction unit 32 and the binarization processing unit 33.
[0029]
  The similar data extraction unit 32 extracts data similar to the input data from the database 22 and outputs it to the binarization processing unit 33 as similar data. The binarization processing unit binarizes the similar data based on the input data, and then transmits the similar data to the similar pattern set calculation unit 34 and the input data classification processing unit 36.
[0030]
  The similar pattern set calculation unit 34 calculates a similar pattern set for each of the class P class N based on the binarized similar data. The feature pattern set calculation unit 35 outputs a combination of items that characteristically appear in the class P and class N from the similar pattern set as a feature pattern.
[0031]
  Further, the input data classification processing unit 36 compares the binarized similar data with the feature pattern, and determines whether to classify the input data into class P or class N.
[0032]
  The feature pattern output device 21 outputs this feature pattern and the classification result of the input data. That is, since the feature pattern output device 21 extracts data similar to the input data from the database 22 and calculates a feature pattern from the similar data, the feature pattern output device 21 does not depend on the number of data in the database 22 or the number of items of each data. The feature pattern can be calculated at high speed.
[0033]
  Next, each process will be described in detail using specific examples.
  FIG. 2 shows a specific example of input data and similar data. FIG. 2A is an example of input data, and FIG. 2B is an example of data stored in the database 22. As shown in FIG. 2, the input data is “35” as the value of “age”, “rental” as the value of “house”, “male” as the value of “sex”, “married” as the value of “marriage” "have.
[0034]
  The similar data extraction unit 32 employs the similarity using the City-block distance as the similarity function, and extracts similar data from the database 22.
  In particular,
  n is the number of items, X is data stored in the database 22, Y is input data,
[Expression 1]

  Here, the item <fi: xi> indicates that the value of the item “fi” is “xi”. In addition, items whose items have numerical attributes are all normalized to the [0, 1] interval, and α is defined as a radius of 0-1. That is, the value of δ is 1 when the value is within the radius α around the value of the input data, and the value of δ is 0 when the value is outside the radius α.
[0035]
  In other words, this similarity function counts the number of items that match the items included in the input data for the data stored in the database. In FIG. 2 (b), items that match the input data in each data are shown circled, and the output of the similarity function is shown as the similarity. “Age” is numerical data, but margin 5 corresponding to α = 0.18 here is allowed, and when the age value is 30 to 40, it is determined that the items match.
[0036]
  Further, FIG. 3 shows a data space in which the data group shown in FIG. 2 (b) is arranged according to the similarity. In FIG. 3, input data is indicated by “★”, data belonging to class P is indicated as “◯”, and data belonging to class N is indicated as “x”. The numbers shown in the vicinity of each symbol are the data numbers in FIG. 2 (b).
[0037]
  As shown in FIG. 3, the

data

7, 10, 12, and 13 having a similarity of 3 are closest to the input data and exist on the concentric circle 41. Further,

data

2 and 9 having a similarity of 2 exist on the next concentric circle. Further,

data

1, 4, 5, 6, and 11 having a similarity of 1 exist on the next concentric circle 43, and

data

3 and 8 having a similarity of 0 exist outside the concentric circle 43.
[0038]
The similar data extraction unit 32 extracts data whose similarity is equal to or greater than a predetermined threshold as similar data. Alternatively, a predetermined number, for example, five pieces of data are extracted as similar data in descending order of similarity. All data with the same similarity is included in the similar data. Therefore, in FIG. 3, six data of

data

7, 10, 12, 13 having a similarity of 3 and

data

2 and 9 having a similarity of 2 are extracted as similar data.
[0039]
  The binarization processing unit 33 performs binarization processing on the similar data extracted by the similar data extraction unit 32. Specifically, the item for which δ = 0 is excluded from the similar data, and the value of the item for which δ = 1 is replaced with the value for the same item in the input data. Here, the value of the item of the discrete value attribute is the same as the input data. Therefore, the similar data can be binarized by rewriting the value of the numeric attribute item to the value of the input data item.
  Therefore, the following similar data is obtained as a result of binarization.
  Data 2 {<house: rented house> <gender: male>}
  Data 7 {<Dwelling: Rent> <Gender: Male> <Marriage: Married>}
  Data 9 {<age: 35> <gender: male>}
  Data 10 {<age: 35> <gender: male> <marriage: married>}
  Data 12 {<age: 35> <house: rented house> <gender: male>}
  Data 13 {<Residential: Rental house> <Gender: Male> <Marriage: Married>}
  In this way, by binarizing the similar data, the items included in the similar data are only items included in the input data. Therefore, the feature pattern calculation process can be performed only by calculating the item set.
[0040]
  Next, the processing of the similar pattern set calculation unit 34 will be described. The similar pattern set calculation unit 34 calculates a maximum pattern set and a minimum pattern set for each of class P and class N. The maximum pattern set is a set of items whose superordinate set does not exist in similar data of the class. The minimum pattern set is a set of items for which there is no set that is a subset of the class of similar data.
[0041]
  FIG. 4 shows the maximum pattern set and the minimum pattern set. FIG. 4 (a) is a diagram showing the inclusive relation of sets in class P, and FIG. 4 (b) is a diagram showing the inclusive relation of sets in class N.
[0042]
  Here, for class P,
  Data 2 {<house: rented house> <gender: male>}
  Data 7 {<Dwelling: Rent> <Gender: Male> <Marriage: Married>}
All items of data 2 are included in data 7. That is, data 2 is a subset of data 7, and data 7 is a superset of data 2. This relationship is indicated by solid arrows in FIG. 4 (a).
[0043]
  Here, in the similar data of class P, there is no set that is a superset of data 7. Therefore, data 7 is a maximum pattern set of class P. On the other hand,

data

1 and 6 are a subset of data 2. However, the

data

1 and 6 have a similarity of 1, and are not selected as similar data. That is, there is no set that is a subset of the data 2 in the similar data of the class P, so the data 2 is the minimum pattern set of the similar data of the class P.
[0044]
  Similarly, for class N,
  Data 9 {<age: 35> <gender: male>}
  Data 10 {<age: 35> <gender: male> <marriage: married>}
  Data 12 {<age: 35> <house: rented house> <gender: male>}
  Data 13 {<Residential: Rental house> <Gender: Male> <Marriage: Married>}
The items of data 9 are all included in the

data

10 and 12. That is, data 9 is a subset of both

data

10 and 12, and

data

10 and 12 are both supersets of data 9. This relationship is indicated by solid arrows in FIG. 4 (b).
[0045]
Here, there is no set that is a superset of the

data

10 and 12 in the similar data of class N. Accordingly,

data

10 and 12 are class N maximum pattern sets, respectively. Further, since there is no set that is a subset of the data 9 in the similar data of the class N, the data 9 is the minimum pattern set of the class N.
[0046]
  Note that the data 13 has neither a superset nor a subset in the similar data of class N. Therefore, the data 13 is a maximum pattern set of class N and a minimum pattern set.
[0047]
  Here, in class P, assuming that binarized similar data is Dp, the minimum pattern set is Lp, and the maximum pattern set is Rp, the pattern set [Lp, Rp] is a superset of at least one minimum pattern, An entire pattern that is a subset of at least one maximum pattern. Therefore,
  Dp⊆ [Lp, Rp]
Is established.
[0048]
  In the data shown in FIG. 4 (a), Lp = {{rented house, male}}, Rp = {{rented house, male, married}} and Dp = {{rented house, male}}, {rented house, male, Married}}.
  Similarly, in class N, if binarized similar data is Dn, the minimum pattern set is Ln, and the maximum pattern set is Rn, the pattern set [Ln, Rn] is a superset of at least one minimum pattern, An entire pattern that is a subset of at least one maximum pattern. Therefore,
  Dp⊆ [Lp, Rp]
Is established.
[0049]
  In the data shown in FIG. 4 (b), Ln = {{35, male}, {rental, male, married}}, Rn = {{35, rental, male}, {35, male, married}, { Rented house, male, married}} and Dn = {{rented house, male}}, {35, rented house, male}, {35, male, married}, {rented house, male, married}}.
[0050]
  In the example shown in FIG. 4, Dp = [Lp, Rp], but a pattern that is a superset of the minimum pattern and a subset of the maximum pattern does not exist in the similar data, that is, Even a pattern that does not exist in Dp is included in [Lp, Rp].
[0051]
  Here, <L, R> is defined as the border of the minimum pattern L and the maximum pattern R. The border <L, R> represents [L, R] which is a pattern set as a pair of a minimum pattern and a maximum pattern. Therefore, by using a border, it is possible to replace the calculation with only the maximum pattern and the minimum pattern without directly handling the elements of the set when performing the calculation of the set, which can greatly improve the calculation efficiency. it can.
[0052]
  The similar pattern calculation unit 34 outputs the border <Lp, Rp> and the border <Ln, Rn> as a similar pattern set to the feature pattern set 35, and ends the process.
[0053]
  Next, the operation of the feature pattern calculation unit 35 will be described. First, when Rp and Rn are the maximum patterns of class P and class N for all data, [{φ}, Rp] − [{φ}, Rn] is all patterns that appear only in class P. It is proved to be a pattern set including (J.Li and K. Ramamohanarao. The space of jumping emerging patterns and its incremental maintenance algorithm.In Proceedings of 17^th(International Conference on Machine learning, pages 551-558. Morgan Kaufmann, 2000.)
[0054]
  In the present invention, Rp and Rn are targeted for processing data similar to the input data, and there is no guarantee that it is the maximum pattern in the entire data. However, since similar data has a high similarity, the number of matches for items in the input data is In many cases, the maximum pattern usually has a large number of items, and therefore there is a high possibility that the maximum pattern is included in the similar pattern.
[0055]
  However, even if there are a large number of maximum patterns, there is a possibility that detection of the maximum pattern will be missed, and if there is any detection miss, there is a possibility that an incorrect feature pattern will be found. Such an incorrect feature pattern causes a reduction in classification accuracy. Therefore, when a feature pattern is calculated from similar data, it is added on condition that there are more items than similar patterns that appear in both class P and class N, thereby preventing detection of the maximum pattern. In addition, a reduction in classification accuracy can be prevented.
[0056]
  The processing operation of the feature pattern set calculation unit 35 is shown in FIG. In FIG. 5, the feature pattern set calculation unit 35 first has a common pattern set [{φ}, Lp] and [{φ}, Ln] from the similar pattern sets <Lp, Rp> and <Ln, Rn>. Find the pattern set that appears. Specifically, first, epLp and epRp, which are output data, are initialized as epLp = {}, epRp = {}. Next, <{φ}, [c1,... Ck]> is calculated by intersecOperation (<{φ}, Lp>, <{φ}, Ln>) (step S102). This intersecOperation is the same as that shown in the above-mentioned document, and all patterns appearing in common in the set indicated by two borders <{φ}, Lp>, <{φ}, Ln> Output in the format {φ}, [c1,... Ck]>.
[0057]
  That is, by this process, [c1,... Ck], which is a set of maximum patterns that appear in common in the pattern sets [{φ}, Lp] and [{φ}, Ln], is obtained. Since any ci included in [c1,... Ck] is a common maximum pattern, the superset of ci is
  ・ Appears only in class P data
  ・ Appears only in class N data
  ・ Does not appear in either class P or class N
  One of them.
[0058]
  Therefore, for each element ci of [c1,... Ck], by searching for a pattern that includes ci and appears only in class P and does not appear in class N, a set of patterns that characteristically appear in class P is obtained. be able to.
[0059]
  Therefore, after obtaining [c1,... Ck], the feature pattern set calculation processing unit 35 sets the first pattern c1 as a processing target (step S103), and further, from among the maximum pattern set Rp of class P Then, a pattern set rp that is a superset of the common pattern to be processed is obtained (step S104). Thereafter, a pattern set rn that is a superset of the common pattern to be processed is obtained from the maximum pattern set Rn of class N (step S105).
[0060]
  Next, the feature pattern set calculation processing unit 35 obtains a pattern set that appears in the pattern set [{φ}, rp] and does not appear in the pattern set [{φ}, rn]. Specifically, <el, er> is calculated by jepProducer (<{φ}, rp>, <{φ}, rn>) (step S106). This jepProducer is the same as that shown in the above document, appears in the pattern set [{φ}, rp] indicated by the border <{φ}, rp>, and by the border <{φ}, rn>. A pattern set that does not appear in the indicated pattern set [{φ}, rn] is output in the form of a border <el, er>.
[0061]
  Here, if el is not {φ} (No in step S107), the feature pattern set calculation processing unit 35 adds a common pattern to be processed to <el, er> to create a border <eL, eR> ( Step S108). Since the pattern set indicated by the border <eL, eR> is a superset of the common patterns to be processed, the pattern set appears in class P and does not appear in class N.
[0062]
  The feature pattern set 35 adds the border <eL, eR> to the border <epLp, epRP> (step S109). The border <epLp, epRp> is data that is finally output as a feature pattern. Here, epLp always monitors only the minimum pattern as an element, and excludes the non-minimum pattern (step S110).
[0063]
  After step S110 ends or when el is {φ} (step S107, Yes), the feature pattern set calculation unit 35 determines whether or not the processing has been completed for all elements of the pattern set [c1,... Ck]. (Step S111). If there is an element that has not been processed yet (No at Step S111), the feature pattern set calculation unit 35 sets the next element as an inspection target (Step S113), and proceeds to Step S104.
[0064]
  On the other hand, when the processing has been completed for all the elements (step S111, Yes), the feature pattern set calculation unit 35 outputs a border <epLp, epRp> (step S112).
[0065]
  The feature pattern calculation processing unit 35 can also calculate the border <epLn, epRn> for class N in the same manner. The feature pattern calculation processing unit 35 uses <epLp, epRp> and <epLn, epRn>,
  SEP = epLp∪epLn
A feature pattern set SEP is output. This feature pattern SEP is a union of the minimum patterns that appear characteristically in class P or class N. The feature pattern calculation unit 35 outputs the feature pattern set SEP to the outside of the feature pattern output device 21 and also outputs it to the input data classification processing unit 36.
[0066]
  When the processing of the feature pattern calculation unit 35 is applied to the data shown in FIG. 4, first, the minimum pattern set of class P is Lp = {{rental, male}}, and the minimum pattern set of class N is Ln = Since {{35, male}, {rental, male, married}}, the pattern set that appears in common is {{rental, male}} (step S102).
[0067]
  Therefore, the subsequent processing is continued with ci = {rented house, male} (step S102).
  In class P, among the maximum pattern set Rp = {{rental, male, married}} of class P, the superset of ci = {rental, male} is rp = {{rental, male, married }} (Step S103). Similarly, in class N, ci = {rental, male} among maximum pattern set Rn = {{35, rented, male}, {35, male, married}, {rental, male, married}} of class N The superordinate set is rn = {{35, rented house, male}, {rented house, male, married}} (step S104).
[0068]
  A pattern set that appears in the obtained [{φ}, rp] and does not appear in [{φ}, rn] is obtained by jepProducer (<{φ}, rp>, <{φ}, rn>). <El, er> = <{φ}, {φ}> (step S105).
[0069]
  There is only one element of the maximum common pattern set {ci}. In this example, the feature pattern of class P is eventually <epLp, epRp> = <{φ}, {φ}>.
[0070]
  On the other hand, in class N, the processing results up to step S104 are the same as in class P, and ci = {rented house, male}, rn = {{35, rented house, male}, {rented house, male, married}} , Rp = {{rental, male, married}}. (Steps S101 to S104).
[0071]
  A pattern set that appears in the obtained [{φ}, rn] and does not appear in [{φ}, rp] is obtained by jepProducer (<{φ}, rn>, <{φ}, rp>). <El, er> = <{35}, {35, rented house, male}> (step S105). Borders obtained by adding c1 to each of el and er are <eL, eR> = <{35, rented house, male}, {35, rented house, male}> (step S106). There is only one element of the maximum common pattern set {c1}, and in this example, the class N feature pattern set is <epLn, epRn> = <{35, rented, male}, {35, rented, male}> (Steps S107 to S110).
[0072]
  Next, the operation of the input data classification processing unit 36 will be described. FIG. 6 is a flowchart for explaining the processing operation of the input data classification processing unit 36. In FIG. 6, the input data classification processing unit first inputs binarized similar data Dp = {d1, d2... Ds} of class P and feature pattern SEP = {p1, p2... Pt}. Obtained as data (step S201).
[0073]
  Subsequently, the input data classification processing unit 36 sets d1 which is the first element in the similar data Dp as a processing target (step S202). Furthermore, the input data classification processing unit 36 sets p1 which is the first element in the feature pattern SEP as an inspection target (step S203).
[0074]
  The input data classification processing unit 36 checks whether or not the feature pattern to be inspected is a subset of similar data to be processed (step S204). If the feature pattern to be inspected is a subset of the similar data to be processed (Yes in step S204), the input data classification processing unit 36 increments the value of the class P counter by one (step S209).
[0075]
  On the other hand, when the feature pattern to be inspected is not a subset of the similar data to be processed (step S204, No), the input data classification processing unit 36 determines whether or not the inspection has been completed for all the feature patterns. Is determined (step S205). If there is a feature pattern that has not yet been inspected (step S205, No), the input data classification processing unit 36 sets the next feature pattern as an inspection target (step S208), and proceeds to step S204.
[0076]
  When inspection has been completed for all feature patterns (Yes in step S205), or after the value of the class P counter has been increased, the input data classification processing unit 35 determines whether or not processing has been completed for all similar data. Determination is made (step S206). If there is similar data that has not been examined yet (No at Step S206), the input data classification processing unit 35 sets the next similar data as a processing target (Step S210), and proceeds to Step S203.
[0077]
  On the other hand, when the process is completed for all similar data (step S206, Yes), the input data classification processing unit 36 outputs the value of the class P counter and ends the process. By this processing, the input data classification processing unit 36 can count the number of similar data including any one of the feature patterns SEP among the similar data belonging to the class P. That is, the value of the class P counter is the number of data matching one or more feature patterns among the similar data of class P.
[0078]
  Further, the input data classification processing unit 36 outputs the value of the class N counter by the same processing. The value of this class N counter is the number of data that match one or more feature patterns among the similar data of class N. The input data classification processing unit 36 compares the value of the class P counter with the value of the class N counter, and classifies the input data into the class having the larger value.
[0079]
  As described above, in the feature pattern output device 21 shown in the first embodiment, data similar to the input data is extracted from the database 22 and the maximum pattern set and the minimum pattern set for each class are extracted from the similar data. Since the feature pattern is calculated from the maximum pattern set and the minimum pattern set for each class, the feature pattern can be calculated at high speed without depending on the number of data in the database 22 or the number of items of each data. .
[0080]
  As a result, the input data can be easily classified by classifying the input data using the calculated feature pattern.
[0081]
  Furthermore, by calculating a feature pattern from data similar to input data, it is possible to detect even a local feature pattern with high accuracy.
[0082]
  By the way, when extracting similar data based on input data, noise may occur in the similar data. Therefore, by adding a noise removal mechanism to the similar data extraction unit 32, it is possible to improve the feature pattern detection accuracy and the input data classification accuracy.
[0083]
  As noise generated in similar data, there are class noise in which data of another class is mixed in similar data of a predetermined class, and attribute noise in which an item of predetermined similar data is replaced with another item.
[0084]
  When class noise exists, the same maximum pattern may appear in class P and class N in similar data after binarization processing. When the same maximum pattern appears in class P and class N, no feature pattern can be found, and the classification accuracy is significantly reduced. Therefore, when the same pattern appears in both class P and class N, the commonly appearing pattern is excluded from each class, and a pattern that is a subset of the excluded pattern is newly included. The generation of class noise can be suppressed.
[0085]
  Further, the attribute noise can be removed by the statistical test process shown in FIG. As shown in FIG. 7, in this attribute noise removal, first, L which is one of the minimum patterns is inputted (step S301). Here, if the items included in L are I1, I2,... Ik, L = {I1, I2... Ik}.
[0086]
  Next, I1, which is the first item in L, is set as the processing target Ii (step S302). Next, a pattern B is generated by excluding items to be processed from Lp (step S303). Thereafter, a statistical test is performed on B => P and B∧Ii => P (step S304). It is determined whether or not adding the item Ii to be processed to the pattern B by this test can be regarded as statistically coincidental. If it is not considered statistically coincidental, the item Ii is considered to have appeared due to attribute noise.
[0087]
  Specifically, in the statistical test process, a statistical hypothesis that there is no difference between the probability distributions of B => P and B∧Ii => P is established, and whether this hypothesis can be rejected by the following equation: Test.
    T = (S_LPS_L-S_LS_BP) / (S_LS_BP(S_B-S_BP) / N)^1/2
  Where S_BIs the number of data matching pattern B, and S_LIs the number of data matching the pattern B∧Ii, and S_BPIs the number of data of class P that matches pattern B, and S_LPIs the number of data belonging to the class P that matches the pattern B∧Ii.
[0088]
  This T is known to follow a normal distribution. When the significance level is a, z (a / 2) is a value of density function p (z) = a / 2 of the normal distribution, and T ≧ z (a / 2), the hypothesis is that there is no statistical difference between B => P and B∧Ii => P, and Ii is treated as if it appeared by chance, and is excluded from the pattern set Lp.
[0089]
  Accordingly, in FIG. 7, it is determined whether the hypothesis can be rejected as a result of the statistical test (step S305). If the hypothesis cannot be rejected (step S305, No), the item Ii to be processed is set as the attribute noise. (Step S308), and the process proceeds to step S306.
[0090]
  On the other hand, if the hypothesis can be rejected (step S305, Yes), it is determined whether or not the test has been completed for all items (step S306). If there is an item that has not been tested yet (step S306, No), the next item is set as a test target (step S309), and the process proceeds to step S303.
[0091]
  If the processing is completed for all items (step S306, Yes), the minimum pattern L from which the attribute noise has been removed is output (step S307), and the processing ends.
[0092]
  As described above, by providing the similar data extraction unit 32 with a function of removing class noise and attribute noise, it is possible to improve the detection accuracy of the feature pattern and the classification accuracy of the input data.
[0093]
(Embodiment 2)
  Next, a second embodiment of the present invention will be described. In the first embodiment, when extracting similar data from the database 22, one predetermined threshold is set and data having a degree of similarity equal to or higher than the threshold is extracted. In the second embodiment, however, A threshold is set for each of class P data and class N data, and similar data is extracted for each class. When extracting similar data so as to satisfy a predetermined number, a predetermined number may be set for each of class P and class N, and extraction may be performed for class P and class N, respectively.
[0094]
  FIG. 8 shows the relationship between data and similarity in the second embodiment. In FIG. 8, the arrangement of the data 1 to 13 is the same as that in FIG. 4. Concentric circle 51 indicates similarity 3, concentric circle 52 indicates similarity 2, and concentric circle 53 indicates similarity 1. The same as FIG. However, this FIG. 8 differs from the case of FIG. 4 in that the concentric circle 53 is a threshold value for class P data and the concentric circle 52 is a threshold value for class N data.
[0095]
  Since the similarity threshold for class P has dropped to 1,

data

1, 4, 5, and 6 are newly extracted as similar data, as shown in FIG. 9 (a). Here,

data

1 and 6 are a subset of data 2, and data 4 is a subset of data 7. However, since data 5 does not have its own superset, it is the maximum pattern of class P. Therefore, Rp in the second embodiment is {{35}, {rental, male, married}} by adding {35} corresponding to data 5. As shown in FIG. 9B, the threshold value for class N is 2, so the similar pattern of class N does not change.
[0096]
  As described in the first embodiment, it is proved that all feature patterns can be calculated if all maximum patterns are obtained from all data, and only data in the vicinity of input data is handled as in the present invention. Is added on condition that the number of items is larger than the pattern that appears in common with class P and class N for similar data when calculating a feature pattern from similar data, thereby preventing the detection of the maximum pattern from being missed. It is necessary to prevent the deterioration of the classification accuracy.
[0097]
  Therefore, by setting a threshold value for each class and acquiring a sufficient number of samples from all classes, it is possible to prevent a reduction in classification accuracy due to a detection failure of the maximum pattern.
[0098]
  Since the binarization of the similar data and the calculation process of the similar pattern set are the same as in the first embodiment, description thereof will be omitted, but the similar pattern set in the second embodiment is a neighborhood by class for input data. Is used to approximate the entire data included in the database 22. Therefore, in the feature pattern calculation process, the above jepProducer is used,
  <EpLp, epRp> = jepProducer (<{φ}, Rp>, <{φ}, Rn>)
To calculate <epLp, epRp>. Therefore, in the present embodiment, the feature pattern can be calculated from the maximum pattern sets Lp and Ln without using the minimum pattern sets Rp and Rn.
[0099]
  Further, in this embodiment, since SEP = epLp∪epLn is an approximation of the feature pattern with respect to the entire database 22, when classifying input data, the class P count and class for the entire data included in the database 22 are targeted. N counts can be calculated.
[0100]
  When calculating the class P count for the entire database 22, the value is divided by the size of the class P data included in the database 22 to correct the distribution of the class P distribution in the entire database 22. It is preferable. The same applies to the class N count. As described above, when correction is performed based on the size of the data set belonging to each class, the distribution ratio of each class in the database 22 is largely biased. For example, the data of class N is compared with the data of class P. Therefore, even if the number is extremely large, the input data can be classified with high accuracy.
[0101]
  As described above, in the second embodiment, similar data is extracted using different threshold values for each class to prevent detection of the maximum pattern and improve the classification accuracy of input data.
[0102]
  In the second embodiment, an approximation of the feature pattern of the entire database 22 can be obtained, and the input data can be classified with high accuracy regardless of the class distribution state.
[0103]
  In Embodiment 1 and Embodiment 2 described above, when the input data is classified, the number of appearances of feature patterns is compared for similar data of class P and similar data of class N. The data classification is not limited to this method, and the input data can be classified using other evaluation criteria or combinations thereof.
[0104]
  As an evaluation standard that can be used for classification of input data, for example, the number of feature patterns, the number of feature pattern items, and the like can be used. For the number of feature patterns, the evaluation is increased when the current number of feature patterns is large, and for the number of feature pattern items, the evaluation is increased when the number of items is large.
[0105]
  Specifically, when using the number of feature patterns, the sum of the sizes of the feature patterns belonging to epLp is compared with the sum of the sizes of the feature patterns belonging to epLn, and the input pattern is classified into a larger value. To do.
[0106]
(Embodiment 3)
  In the third embodiment, a computer system that executes a feature pattern output program having the same function as the feature pattern output apparatus described in the first and second embodiments will be described.
[0107]
  A computer system 100 shown in FIG. 10 includes a main body 101, a display 102 that displays information such as an image on a display screen 102a according to an instruction from the main body 101, and a keyboard 103 for inputting various information to the computer system 100. A mouse 104 for designating an arbitrary position on the display screen 102a of the display 102, a LAN interface connected to a local area network (LAN) 106 or a wide area network (WAN), and a modem 105 connected to a public line 107 such as the Internet. Is provided. Here, the LAN 106 connects the computer system 100 to another computer system (PC) 111, a server 112, a printer 113, and the like. As shown in FIG. 11, the main unit 101 includes a CPU 121, a RAM 122, a ROM 123, a hard disk drive (HDD) 124, a CD-ROM drive 125, an FD drive 126, an I / O interface 127, and a LAN interface 128. Yes.
[0108]
  When the data management method is executed in the computer system 100, the feature pattern output program stored in the storage medium is installed in the computer system 100. The installed feature pattern output program is stored in the HDD 124 and executed by the CPU 121 using the RAM 122, the ROM 123, and the like. Here, the storage medium includes a portable storage medium such as a CD-ROM 109, a floppy disk 108, a DVD disk, a magneto-optical disk, and an IC card, and a storage device such as a hard disk 124 provided inside and outside the computer system 100, The database of the server 112 holding the data management program of the installation source connected via the LAN 106, or another computer system 111 and the database thereof, and further the transmission medium on the public line 107 are included.
[0109]
  As described above, the third embodiment is implemented by executing on the computer system 100 a feature pattern output program in which the configuration of the feature pattern output apparatus described in the first and second embodiments is realized by software. The same effects as those of the feature pattern output apparatus shown in the first and second embodiments can be realized by using a general computer system.
[0110]
  As described above, according to the present invention, similar data similar to the input data is extracted from the database, and the feature pattern that forms the characteristics of each class is calculated from the extracted similar data. Feature pattern can be output toFeature pattern output deviceThere is an effect that can be provided.
[0111]
  Further, according to the present invention, the value of each item of the extracted data extracted from the database is compared with the value of each item of the input data, and the maximum pattern set and the minimum pattern set are extracted from the matching item combination, Since feature patterns are calculated based on this maximum pattern set and minimum pattern set, feature patterns can be output at high speed with a simple configuration.Feature pattern output deviceThere is an effect that can be provided.
[0112]
  In addition, according to the present invention, a common pattern that appears across multiple classes is obtained based on the minimum pattern set, and the feature pattern is calculated as a superset of the common pattern, so that the feature pattern can be output at high speed. NaFeature pattern output deviceThere is an effect that can be provided.
[0113]
  In addition, according to the present invention, when extracting similar data, the conditions are changed for each class, and a sufficient number of similar data is acquired for each class. Approximate and output feature pattern at high speedFeature pattern output deviceThere is an effect that can be provided.
[0114]
  In addition, according to the present invention, since the maximum pattern that appears across multiple classes is removed by removing the item, the maximum pattern that spans multiple classes is prevented. Can be output toFeature pattern output deviceThere is an effect that can be provided.
[0115]
  Further, according to the present invention, since the input data is classified based on the feature pattern calculated from the similar data, the input data can be classified at high speed regardless of the scale of the database.Feature pattern output deviceThere is an effect that can be provided.
[0116]
  Further, according to the present invention, the number of appearances of feature patterns in the similar data of each class is counted, and the input data is classified into the class in which the counting result is the largest value. Capable of outputting feature patterns that can be classified into accuracyFeature pattern output deviceThere is an effect that can be provided.
[0117]
  Further, according to the present invention, when the item is numerical data, a predetermined numerical range is set, and when the item value of the input data and the item value of the similar data are within the predetermined range, both items Because it is determined that the values match, the feature pattern can be output at high speed with a simple configuration even if the item contains numerical dataFeature pattern output deviceThere is an effect that can be provided.
[Industrial applicability]
[0118]
  As described above, the present invention is applied.Feature pattern output deviceIs particularly useful for speeding up extraction of feature patterns in large-scale databases.
[Brief description of the drawings]
[0119]
FIG. 1 is a schematic configuration diagram illustrating a schematic configuration of a feature pattern output apparatus according to a first embodiment of the present invention.
FIG. 2 is a diagram showing a specific example of input data and similar data.
FIG. 3 is a diagram showing a data space in which data groups are arranged according to similarity.
FIG. 4 is a diagram showing a maximum pattern set and a minimum pattern set.
FIG. 5 is a diagram illustrating a processing operation of a feature pattern set calculation unit.
FIG. 6 is a flowchart for explaining the processing operation of the input data classification processing unit 36;
FIG. 7 is a diagram for explaining a statistical test process for removing attribute noise.
FIG. 8 is a diagram showing a relationship between data and similarity in the second embodiment.
FIG. 9 is a diagram showing a maximum pattern set and a minimum pattern set in the second embodiment.
FIG. 10 is an explanatory diagram for explaining a computer system according to the third embodiment.
FIG. 11 is an explanatory diagram for explaining the configuration of the main body shown in FIG. 10;

Claims

A feature pattern output device that has a database that stores data composed of a plurality of items divided into a plurality of classes, and outputs a combination of items that make up the characteristics of each class as a feature pattern of the class,
Similar data extraction means for extracting similar data similar to the input data for each class from the database when receiving the input data;
A similar pattern set calculating means for calculating a similar pattern set for each class from the similar data extracted by the similar data extracting means;
Feature pattern calculating means for calculating a feature pattern for each class from the similar pattern set calculated by the similar pattern set calculating means;
A feature pattern output device comprising:

The similar pattern set calculation means extracts, as a pattern set, a combination of items in which the value of each item forming the similar data extracted by the similar data extraction means matches the value of each item forming the input data The minimum pattern, which is a combination of items whose subsets other than itself do not exist in the pattern set, is extracted as the minimum pattern set, and the maximum pattern, which is the combination of items whose superset other than itself does not exist, is maximized. 2. The feature pattern output apparatus according to claim 1, wherein the pattern pattern is extracted as a pattern set, and the minimum pattern set and the maximum pattern set are output as the similar pattern set.

The feature pattern calculation means extracts a common pattern set appearing across a plurality of classes from the minimum pattern set, and the feature pattern calculation means calculates a feature pattern having all items of the common pattern set. The feature pattern output device according to claim 2, wherein:

3. The feature pattern output device according to claim 2, wherein the similar data extracting means extracts similar data based on different conditions for each class when extracting similar data from the database. .

5. The feature pattern output according to claim 4, wherein the similar pattern set calculation means excludes a predetermined item from the maximum pattern when there is a maximum pattern that appears across a plurality of classes. apparatus.

6. The method according to claim 1, further comprising a classifying unit that classifies the input data into one of the plurality of classes based on the feature pattern calculated by the feature pattern calculating unit. The feature pattern output device according to any one of the above.

The said classification | category means counts the number of the said feature patterns in the similar data of each class, The said input data is classify | categorized into the class from which this count result becomes the largest value, The range 6 characterized by the above-mentioned. Feature pattern output device.

If the value of the predetermined item forming the input data and the value of the item forming the similar data are within a predetermined numerical range, the similar pattern set calculation means agrees with the value of both items. The characteristic pattern output device according to claim 2, wherein the characteristic pattern output device is determined to be one.