JP2006053798A

JP2006053798A - Query deriving method

Info

Publication number: JP2006053798A
Application number: JP2004235579A
Authority: JP
Inventors: Haruaki Yamazaki; 山崎晴明
Original assignee: Yamanashi TLO Co Ltd
Current assignee: Yamanashi TLO Co Ltd
Priority date: 2004-08-12
Filing date: 2004-08-12
Publication date: 2006-02-23

Abstract

<P>PROBLEM TO BE SOLVED: To provide a method for deriving a sub-database selected by a user without clear awareness by the user and a query creating an approximate sub-database. <P>SOLUTION: A query is performed for all the items of an original database, a single item query satisfying a predetermined condition is detected and allocated as a root node; using the root node as a starting point, the single item query satisfying the predetermined condition is recursively performed till when a terminal condition is detected; a tree structure is created; for all the paths of the completed tree structure, a logical product is produced from the root node to a terminal node; and the logical sum of all the logical products is made as a query deriving an approximate database. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、ユーザが複数の属性値を持つ膨大なデータから成る元データベースから、ユーザが明確に意識せずに抽出したサブ・データベースと、近似するサブ・データベースを得るための問い合わせ（以下、クエリ）を特定する方法に関する。 The present invention relates to a query (hereinafter referred to as a query) for obtaining a sub-database extracted from the original database consisting of a large amount of data having a plurality of attribute values without the user's explicit awareness and a sub-database approximated by the user. ).

データマイニングとは、膨大なデータの中から有益な情報を見いだす技術である。近年、小売店でのPOSの導入やインフラの整備、データベース技術の発展から顧客に対するデータが絶え間なく入力され、膨大な量のデータが蓄積されている。この膨大な量のデータのなかから有用な知識を得ることは簡単ではないが、この膨大なデータから有用な知識を得ることが大きな利益へとつながる。 Data mining is a technology that finds useful information from an enormous amount of data. In recent years, data for customers has been constantly input due to the introduction of POS at retail stores, infrastructure development, and development of database technology, and a huge amount of data has been accumulated. Obtaining useful knowledge from this enormous amount of data is not easy, but obtaining useful knowledge from this enormous amount of data leads to great benefits.

データマイニングの代表的手法として、1)クラシフィケーション(Classifcation)、即ち、既存データをあらかじめ定められたいくつかのクラスに分類するような規則性を発見し、それをもとに未知データの判別、予測などを行う手法、2)クラスタリング(Clustering)、即ち、データを記述する有限個のカテゴリーの集合を同定するものであり、分類基準がわかっていないデータを、データの類似性からいくつかのグループに自動的に分類する手法、3)アソシエーション(Association)、即ち、関連性抽出データ属性の間に存在する相関関係を学び、関連性の強いデータの組み合わせパターンを検索する手法、例えば相関ルール発見、などがある。 As a typical method of data mining, 1) classification (classifcation), that is, regularity that classifies existing data into several predetermined classes is discovered, and unknown data is discriminated based on it 2) Clustering, that is, identifying a set of a finite number of categories that describe the data. A method of automatically classifying into groups, 3) Association, that is, a method of learning correlations between relevance extraction data attributes and searching for combination patterns of strongly related data, for example, finding association rules ,and so on.

しかし、これらのいずれの方法も多数の変数の中から適切な変数を選択するという前処理が必要である。また、これらの方法により得られた特徴的なパターンがそのまま有効な知見として活用できていないという問題がある。
「河野雅之:"データベースからの知識発見の現状と動向",人工知能学会誌,VoL12,No,4,PP497-504(1997)」「Blake,C,L,&Merz,CJ,,,UClRepository of ma£hineleamingdatabases,httpl//www,ics,ucLedu/mleam/MLRepository,htm1,lrvine,CAlUniversityofCa1ifomia,DepartmentoflnformationandComputerScience(1998)」データマイニングによる属性値間相関ルール生成に関する方法として、データベースに含まれる属性間の相関係数に基づいて、相関係数が所定の値以上の相関ルールのみを生成することにより、ユーザにとって有用な相関ルールを作成する技術が開示されている。 However, any of these methods requires preprocessing of selecting an appropriate variable from a large number of variables. Moreover, there is a problem that the characteristic patterns obtained by these methods cannot be utilized as effective knowledge as they are.
“Masayuki Kawano:“ Current Status and Trends of Knowledge Discovery from Databases ”, Journal of Artificial Intelligence, VoL12, No, 4, PP497-504 (1997)” `` Blake, C, L, & Merz, CJ ,,, UClRepository of ma £ hineleamingdatabases, httpl // www, ics, ucLedu / mleam / MLRepository, htm1, lrvine, CAlUniversityofCa1ifomia, DepartmentoflnformationandComputerScience (1998) '' As a rule generation method, a technique for creating a correlation rule useful for a user by generating only a correlation rule having a correlation coefficient equal to or greater than a predetermined value based on a correlation coefficient between attributes included in a database is disclosed. Has been.

しかし、この技術は属性間の関係のみを抽出してルール化しようとするものであり、従って、例えば、無相関の2つの属性の値によりユーザの意図が形成されている場合には、かかる技術では、ユーザの意図を推測することはできないという問題がある。即ち、特許文献１で開示されている技術では、身長が高い人は体重も重いという傾向をルール化できても、身長が高くて体重が軽い人をさがしているというユーザ意図を、推測することはできない。
「データマイニング装置及データマイニング方法：特開2001-265596」 However, this technique is intended to extract only the relationship between attributes and make a rule. Therefore, for example, when the user's intention is formed by the values of two uncorrelated attributes, such a technique is used. However, there is a problem that the user's intention cannot be estimated. That is, in the technique disclosed in Patent Document 1, it is possible to infer the user intention that a person who is tall and has a light weight is looking for, even if a tall person can rule the tendency that he / she is heavy. I can't.
“Data Mining Device and Data Mining Method: JP 2001-265596 A”

そこで本発明の課題は、膨大な元データベースの中から、例えばユーザが明確に意識せずに選んだサブ・データベースと、近似するサブ・データベースを得るに必要なクエリを導き出す、クエリ導出方法を提供することにある。 Therefore, an object of the present invention is to provide a query derivation method for deriving a query necessary for obtaining a sub database that is selected from a vast number of original databases, for example, without the user's explicit awareness, and a sub database that is approximated. There is to do.

本発明は、複数の属性を持つタプルから成る元データベース、即ち複数の属性を行とし
、データを列とするテーブル（表）に対して、ユーザが明確に意識せずに抽出した集合、言い換えれば、元データベースに未知の問い合わせ（クエリ）を実行することにより得られたサブ・データベースと近似するデータベースを生成するに必要なクエリを導出するクエリ導出方法を提供する。 The present invention relates to an original database composed of tuples having a plurality of attributes, that is, a set of tables extracted from a table (table) having a plurality of attributes as rows and data as columns, in other words, in other words, Provided is a query derivation method for deriving a query necessary for generating a database that approximates a sub database obtained by executing an unknown query (query) in the original database.

前記クエリ導出方法は、前記サブ・データベースの全ての項、即ち、各属性の列を単項とし、かかるサブ・データベースの全ての項毎に問い合わせ（単項クエリ）を実行する。これにより、所定の条件を満たす単項クエリを検出し、かかる単項クエリをルートノードとし割り付ける。ここで、所定の条件とは、元データベースからランダムにｎ個のタプルを抜き出し生成したランダム・データベースと前記元データベースに単項クエリを実行し生成した単項クエリ・データベースとが、前記サブ・データベースとの関係において、同じ適合率となる確率が、あらかじめ定められた危険率以下であり、かつ全ての項について実行した単項クエリのうち再現率が最大であることである。 In the query derivation method, all terms in the sub-database, that is, columns of each attribute are defined as unary terms, and a query (unary query) is executed for every term in the sub-database. Thereby, a unary query that satisfies a predetermined condition is detected, and the unary query is assigned as a root node. Here, the predetermined condition is that a random database obtained by randomly extracting n tuples from an original database and a unary query database generated by executing a unary query on the original database are defined as: In the relationship, the probability of the same relevance rate is equal to or less than a predetermined risk rate, and the recall is the highest among the single-term queries executed for all terms.

前記所定の条件を満たす単項クエリを決定し、前記所定の条件を満たす単項クエリを前記サブ・データベースについて実行し、これにより肯定データベースを生成する。また、前記所定の条件を満たす単項クエリの問い合わせを否定する否定単項クエリを前記サブ・データベースについて同様に実行し、これにより否定データベースを生成する。 A unary query that satisfies the predetermined condition is determined, and a unary query that satisfies the predetermined condition is executed on the sub-database, thereby generating an affirmative database. Further, a negative unary query that negates a query of a unary query that satisfies the predetermined condition is similarly executed on the sub-database, thereby generating a negative database.

次に、前記肯定データベース又は前記否定データベースを前記元データベースとみなし、前記肯定データベースと前記サブ・データベースとの重複部分を前記サブ・データベースと、又は前記否定データベースと前記サブ・データベースとの重複部分を前記サブ・データベースとみなし、前記肯定クエリ又は前記否定クエリを前記否定データベースと前記サブ・データベースとの重複部分について実行する。 Next, the positive database or the negative database is regarded as the original database, and an overlapping part between the positive database and the sub database is set as the sub database, or an overlapping part between the negative database and the sub database is set as the overlapping part. Considering the sub database, the affirmative query or the negative query is executed for the overlapping portion of the negative database and the sub database.

これにより、前記所定の条件を満たす肯定単項クエリ又は否定単項クエリを決定し、該肯定単項クエリ又は否定単項クエリが、前記ルートノードから２分木したノードとなるようにツリー構造を作成する。 As a result, a positive unary query or negative unary query that satisfies the predetermined condition is determined, and a tree structure is created so that the positive unary query or negative unary query becomes a node that is a binary tree from the root node.

そして、前記所定の条件を満たす単項クエリが検出できなくなるまで、これを再帰的に実行し、ツリー構造を完成させる。 This is recursively executed until a unary query that satisfies the predetermined condition cannot be detected, thereby completing the tree structure.

次に、前記ルートノードから末端のノードに至るまでのパスにおいて経由する全ての単項クエリの論理積を作成し、前記パスごとに得られる論理積の全ての論理和を近似データベースのクエリとして導出する。 Next, the logical product of all the unary queries that pass through the path from the root node to the terminal node is created, and all the logical sums of the logical products obtained for each path are derived as the approximate database query. .

以下、この発明を膨大な元データベースの中からユーザが明確な意識をせずに選んだサブ・データベースに近似する近似サブ・データベースを抽出するクエリを導出する実施例について詳述する。 In the following, an embodiment for deriving a query for extracting an approximate sub-database that approximates a sub-database selected by the user from a vast number of original databases without a clear awareness will be described in detail.

１．用語の定義
この明細書で用いられる用語の定義は次の通りである。 1. Definition of terms The definitions of terms used in this specification are as follows.

テーブルとはデータベースのことであり、図１に示すような表である。テーブルはタプルの集合である。 A table is a database and is a table as shown in FIG. A table is a set of tuples.

タプルとは図１に示す表の行を示し、属性に対する属性値の集まりである。 A tuple indicates a row of the table shown in FIG. 1, and is a collection of attribute values for attributes.

属性とは図１に示す価格、サイズ、色・・・等であり、その事物の有する特徴・性質を
含む。 The attributes are the price, size, color, etc. shown in FIG. 1, and include the characteristics and properties of the thing.

属性値とは、図１に示す属性の値であり、例えば、属性“価格”の“高”、属性“サイズ”の“小”、属性“色”の“赤”である。 The attribute values are the attribute values shown in FIG. 1, and are, for example, the attribute “price” “high”, the attribute “size” “small”, and the attribute “color” “red”.

項とは、図１に示す例えば“サイズ”の列であり、一つの列が単項である。 A term is, for example, a “size” column shown in FIG. 1, and one column is a single term.

単項クエリとは、図１に示すテーブルが与えられたとき、例えば、“属性”が価格：A、“属性値”が普通：aとなるようなタプルを探せ”というような、一つの属性に対し与えられた問合せをいう。この明細書では係る単項クエリを(A: a)のように記述するものとする。 When a table shown in FIG. 1 is given, a unary query has a single attribute such as “search for tuples whose“ attribute ”is price: A and“ attribute value ”is normal: a”. In this specification, the unary query is described as (A: a).

ノードとは、ツリー構造における各ツリーの分岐点をいう。 A node refers to a branch point of each tree in the tree structure.

パスとは、ツリー構造のノード間を結ぶ通路をいう。 A path refers to a path connecting nodes in a tree structure.

否定クエリとは、“属性Aがaでないようなタプルを探せ”という問合せであり、この明細書では、(A: ≠a)の記述は、単項クエリ(A:a)、この単項クエリを肯定単項クエリともいう、の否定単項クエリを表す。 A negative query is a query that says “Find a tuple whose attribute A is not a”. In this specification, the description of (A: ≠ a) is a unary query (A: a), and this unary query is affirmed. It represents a negative unary query, also called a unary query.

積形式（Conjunction)とは、2つの単項クエリ(A:a)、(B:b)の積を(A:a)×(B:b)のように記述した式である。 The product format (Conjunction) is an expression describing the product of two unary queries (A: a) and (B: b) as (A: a) × (B: b).

和形式(Disjunction)とは、2つの単項クエリ(A:a)、(B:b)の和を(A:a)＋(B:b)のように記述した式である。 The sum form (Disjunction) is an expression describing the sum of two unary queries (A: a) and (B: b) as (A: a) + (B: b).

再現率とは、図２に示すような元データベース（以下、テーブルR）とテーブルRのサブ・テーブル（以下、テーブルS）とが与えられた場合に、テーブルRに対して、単項クエリ（X:x）を作用させて、テーブルRの部分集合S_ｘを導いた場合において、このS_ｘとSとの関係は、図２の左側に示すような関係となる。サブ・データベースSとサブ・データベースS_ｘとの重なっている部分（S∩S_ｘ）とサブ・データベースSとの割合が再現率である。再現率はS_ｘがSを近似しているか否かの指標であり、再現率(Recall)は、｜S∩S_ｘ｜/ |S|で表され、再現率はS_ｘによりSのどれくらいをカバーできたかを示す。 The recall is a unary query (X) for the table R given the original database (hereinafter, table R) and the sub-table of the table R (hereinafter, table S) as shown in FIG. : x) is applied to derive a subset S _x of the table R, the relationship between S _x and S is as shown on the left side of FIG. The ratio of the portion (S∩S _x ) where the sub database S and the sub database S _x overlap with the sub database S is the recall rate. Recall is a measure of whether S _x is approximated S, recall (Recall) _{is, | S∩S x | / | S} | is represented by the reproduction rate much of S by S _x Indicates whether the cover was completed.

適合率(precision)とは、適合率＝｜S∩S_ｘ｜/ |S_ｘ| であり、適合率はS_ｘで近似できたSの割合を示す指標である。 The precision is a precision ratio = | S∩S _x | / | S _x |, and the precision is an index indicating the ratio of S that can be approximated by S _x .

この明細書で、｜X|と記述したときは、集合Xに含まれる要素の数を表す。 In this specification, when | X | is described, it represents the number of elements included in the set X.

危険率とは、統計的仮説検定（以下、検定という。）において、帰無仮説を棄却するかどうかを決定する基準となる確率である。検定の事象が滅多に発生しないと考えられる非常に小さな確率が採用され、有意水準ともいう。この明細書においては、危険率として１%から10%を想定するが、これに限定されない。危険率の何％は、対象とするテーブルRの種類や導出しようとするクエリの精等度により変更して用いる。 The risk factor is a probability serving as a criterion for determining whether or not to reject the null hypothesis in a statistical hypothesis test (hereinafter referred to as test). A very small probability that the event of the test is considered to rarely occur is adopted and is also called a significance level. In this specification, the risk rate is assumed to be 1% to 10%, but is not limited to this. What percentage of the risk rate is used depending on the type of target table R and the precision of the query to be derived.

検定とは、検討しようとする事象に対する仮説を立て、この仮説が正しいとした場合の発生確率を適当な統計分布にもとづく統計量を利用して求める行為である。この発生確率が滅多に起きない確率かどうかをあらかじめ定めた危険率と比較して判断する。発生確率が危険率より小さい場合は、滅多に起きないことが発生したのであると判断する。 The test is an act of making a hypothesis for an event to be examined and obtaining an occurrence probability when the hypothesis is correct by using a statistical quantity based on an appropriate statistical distribution. Whether or not the occurrence probability rarely occurs is judged by comparing with a predetermined risk rate. If the probability of occurrence is smaller than the risk rate, it is determined that it has rarely occurred.

２．ツリー構造の生成の準備
図２は、テーブルRからユーザが明確に意識しない問い合わせ（未知のクエリQ_ｘ）により、テーブルSが与えられており、単項クエリ（X:x）をテーブルRに対して実行し、テーブルS_ｘを得た場合において、かかる場合のテーブルS_ｘとテーブルSとの関係を示したものである。図２において、再現率(Recall)＝｜S∩S_ｘ｜/ |S|と定義でき、｜S∩S_ｘ｜＝ｋ、|S_ｘ|＝ｎとすると、適合率（Precision）は、P＝ｋ/ｎとなる。 2. Preparation for generation of tree structure Figure 2 shows that table S is given by query (unknown query Q _x ) that the user does not clearly recognize from table R, and unary query (X: x) is applied to table R. When the table S _x is obtained by execution, the relationship between the table S _x and the table S in this case is shown. In FIG. 2, the recall (Recall) = | S∩S _x | / | S | can be defined, and if | S∩S _x | = k and | S _x | = n, then the precision (Precision) is P = K / n.

一方、テーブルRから任意に1つタプルを取り出したとき、それがテーブルSに含まれる確率qは、q = |S|/|R|であるから、ランダムにn 個のタプルをテーブルRから抜き出した集合において、r個のタプルがテーブルSに含まれる確率P(r)は、
P(r)＝nCr q^r(1-q)^n-rとなる。 On the other hand, when one tuple is arbitrarily extracted from table R, the probability q that it is included in table S is q = | S | / | R |, so n tuples are randomly extracted from table R. In the set, the probability P (r) that r tuples are included in the table S is
P (r) = nCr q ^r (1-q) ^nr .

これはテーブルRからn個のタプルを抜き出すランダムサンプリングを実行したときの再現率である。このサンプリング結果が先の単項クエリの実行結果と同じ再現率を持つ確率は、P(k)＝nCk q^k(1-q)^n-k である。 This is the recall when random sampling is performed to extract n tuples from the table R. The probability that this sampling result has the same recall rate as the execution result of the previous unary query is P (k) = nCk q ^k (1-q) ^nk .

P(k)があらかじめ定義した危険率（例えば５％）よりも小さいものであった場合、かかる確率が発生するのは、５％以下となることから、これは極めてまれな事象が発生したことになる。このことは逆に言えば、単項クエリ（A:a）の実施結果がSを近似していると言い換えることができる。 If P (k) is less than a predefined risk factor (eg 5%), this probability is less than 5%, so this is an extremely rare event. become. In other words, it can be said that the execution result of the unary query (A: a) approximates S.

３．ツリー構造の生成
図３から図６は、図２に示す未知のクエリQ_xを導出するための手順をツリー構造として示したものである。先ず、テーブルRに対し全ての項について、単項クエリ（X: x)を実行する。即ち、単項クエリ（A:a)、次に（B:b）、・・（X:x)・・（N:n）を実行する。これらの単項クエリ（X:x)の中から危険率が予め定めた％以下、例えば５％以下であり、かつ再現率が最大である単項クエリ（X_１:x_１）を求め、これをルートノードとして割り付ける。 3. Generation of Tree Structure FIGS. 3 to 6 show a procedure for deriving the unknown query Q _x shown in FIG. 2 as a tree structure. First, a unary query (X: x) is executed for all terms on the table R. That is, a unary query (A: a), then (B: b),... (X: x)... (N: n) are executed. From these unary queries (X: x), find a unary query (X ₁ : x ₁ ) whose risk rate is less than a predetermined percentage, for example, 5% or less and has the maximum recall, and route this Assign as a node.

図３（b）は、単項クエリ（X_１:x_１)をテーブルRに対して実行して得られたデータベースをテーブルS_１とし、テーブルSとの関係を示したものである。テーブルSとテーブルS_１との重複部分がS∩S_ｘである。 FIG. 3B shows the relationship between the table S and the database obtained by executing the unary query (X ₁ : x ₁ ) on the table R as the table S ₁ . The overlapping portion of the table S and the table S ₁ is S∩S _x.

ツリー構造の生成は先ず、図3(a)に示すように、ルートノード(X_１: x_１)を起点として、単項クエリ（X_２:x_２)と補クエリ(X_２:≠x_２)の２分木を作成する。ここで、未知のクエリQ_xを実行して得られたデータベースはテーブルSである。 Generating a tree structure First, as shown in FIG. 3 (a), the root node (X _{_1:} x _₁₎ as a starting point, unary query (X _{_2:} x _₂₎ and the auxiliary query (X _{_2:} ≠ x ₂₎ Create a binary tree of Here, the database obtained by executing the unknown query Q _x is the table S.

かかる２分木の作成は、図４（a）に示すようにテーブルSとテーブルS_１との重複部分S∩S_１をテーブルSと、テーブルS_１をテーブルRとみなし、全ての項ついて単項クエリ（X:x）を実行し、所定の条件である、例えば危険率が５％以下、かつ再現率が最大である肯定単項クエリ（X_２:x_２）をルートノードの肯定側に継ぎ木する。 Creating such binary tree, consider a table S overlapping portions S∩S ₁ of Table S and Table S ₁ as shown in FIG. 4 (a), the table S ₁ and table R, unary with all terms The query (X: x) is executed, and a positive unary query (X ₂ : x ₂ ) having a predetermined condition, for example, a risk rate of 5% or less and a maximum recall rate is connected to the positive side of the root node. To do.

また、否定クエリ(X_１:≠x_１)を同様に実行し、所定の条件を満たす否定単項クエリ（X_２:≠x_２)に継ぎ木する。図４（b）はテーブルS_１とテーブルS_２との関係を示したものであり、これらの重複部分はS∩S_１∩S_２である。 Also, a negative query (X ₁ : ≠ x ₁ ) is executed in the same manner, and a negative unary query (X ₂ : ≠ x ₂ ) that satisfies a predetermined condition is spliced. FIG. 4B shows the relationship between the table S ₁ and the table S _2, and the overlapping portion is S１S ₁ ∩S ₂ .

次に、テーブルS_２をテーブルRとみなし、部分集合S∩S_１∩S_２をテーブルSとみなし、全ての項について同様に単項クエリ（X;x）を実行し、例えば危険率が５％以下、かつ再現率を最大とする肯定単項クエリを求める。しかし、かかる条件を満たす単項クエリが見つからない場合は、単項クエリ（X_２: x_２)にψを継ぎ木する。図５はそのときのツリー
構造を示したものである。 Next, the table S ₂ is regarded as the table R, the subset S∩S ₁ ∩S ₂ is regarded as the table S, and the unary query (X; x) is similarly executed for all the terms. For example, the risk rate is 5%. In the following, an affirmative unary query that maximizes the recall is obtained. However, if no unary query satisfying such conditions is found, ψ is joined to the unary query (X ₂ : x ₂ ). FIG. 5 shows the tree structure at that time.

ψを継ぎ木された場合は、ψを継ぎ木されたノードの否定単項クエリをルートと見なして、当該ルートから全ての項について、単項クエリ（X;x）同じように実行し、ψが継ぎ木されるまで再帰的に実行する。即ち、否定単項クエリ(X_２:≠x_２)を全ての項について同様に実行し、危険率が５％以下、かつ再現率を最大とする単項クエリ（X_３;x_３）を求め、ツリー構造を生成する。しかし、危険率が５％以下、かつ再現率を最大の条件を満たす単項クエリが見つからない場合はψを継ぎ木する。図６はそのときのツリー構造を示したものである。 When ψ is spliced, the negative unary query of the node with ψ spliced is regarded as the root, and all terms from the root are executed in the same way as the unary query (X; x), and ψ is spliced. Run recursively until it is treed. That is, a negative unary query (X ₂ : ≠ x ₂ ) is executed in the same manner for all terms to obtain a unary query (X ₃ ; x ₃ ) having a risk rate of 5% or less and a maximum recall rate, and a tree Generate a structure. However, if the unary query that satisfies the risk rate of 5% or less and satisfies the maximum recall rate is not found, ψ is joined. FIG. 6 shows the tree structure at that time.

図６に示すように全ての末端ノードがψによって置き換えられたとき、木の生成処理を終結し、ψおよびψに向かう矢印を取り去る。従って、肯定ノード側も否定ノード側もψが継ぎ木された末端が、ツリーの末端ノードとなる。 As shown in FIG. 6, when all the terminal nodes are replaced by ψ, the tree generation process is terminated, and the arrows directed to ψ and ψ are removed. Therefore, the end node where ψ is joined on both the positive node side and the negative node side becomes the end node of the tree.

得られた木構造のルートノードから、すべての末端ノードに至るパスをトレースし、経由する単項クエリをAND結合により論理式を生成する。例えば図６の木では、論理式は、（X_１:x_１)×(X_２:x_２)×(X_３:x_３)＋(X_１:x_１)×(X_２:x_２)×(X_３:≠x_３)＋（X_１: x_１)×(X_２:≠x_２)×(X_３:x_３)＋(X_１:x_１)×(X_２:≠x_２)である。 The path from the root node of the obtained tree structure to all terminal nodes is traced, and a logical expression is generated by AND-joining the unary query that passes through. For example, in the tree of FIG. 6, the logical expression is (X ₁ : x ₁ ) × (X ₂ : x ₂ ) × (X ₃ : x ₃ ) + (X ₁ : x ₁ ) × (X ₂ : x ₂ ) _{_{× (X 3: ≠ x 3}} ) + (X 1: x 1) × (X 2: ≠ x 2) × (X 3: x 3) + (X 1: x 1) × (X 2: ≠ x 2 ).

上述した論理式によりクエリQ_ｘを求めることができるのは、単項クエリ(X:x) を実行した結果、テーブルS₁には適合部分Xと不適合部分(雑音部分)Y、および漏れた部分Zとが存在する。肯定部分の木をさらに展開していくことは、なるべく多くのXを保存しながら、テーブルS₁から、雑音部分Yを取り除いていくプロセスである。このプロセスの最後は適合部分が多数を占めることになり、有意な単項クエリを見つけることができなくなり、プロセスは終結する。一方、否定の木を展開していくことは、漏れた部分Zを取り戻すプロセスであり、そのために最も有利な単項クエリを見つけ出すプロセスである。こうしてSに最も近いタプル集合を再現できる論理式を見出すことができる。
上述の木構造をメモリ上どのように実装するかについては、種種の方式が考えられる．最も直接的なものはANDとORに対応したメモリスタックを用意する方法が好適である。 The query Q _x can be obtained by the above-described logical expression because, as a result of executing the unary query (X: x), the conforming portion X and the nonconforming portion (noise portion) Y and the leaked portion Z are included in the table S _1. And exist. Further expanding the positive part tree is a process of removing the noise part Y from the table S ₁ while preserving as many Xs as possible. At the end of this process, the matching part will dominate and no significant unary queries can be found, and the process ends. On the other hand, expanding the denial tree is a process for recovering the leaked portion Z, and is a process for finding the most advantageous unary query. Thus, a logical expression that can reproduce the tuple set closest to S can be found.
There are various ways to implement the above tree structure in memory. The most direct method is to prepare a memory stack corresponding to AND and OR.

また、上述のアルゴリズム概説では、連続した数値を取る属性については触れていない。このような属性についてはあらかじめ区間分けしておく方法が妥当と思われる。区間分けの方法については、現在までに様々な方式が提案されているが、発明者が開発したRWS(Random Walk Splitting)法が最も良い評価を得ている。 In addition, the above algorithm outline does not mention attributes that take consecutive numerical values. For such attributes, the method of segmenting in advance seems to be appropriate. Various methods for segmentation have been proposed so far, but the RWS (Random Walk Splitting) method developed by the inventor has received the best evaluation.

また、同じ結果を導くクエリの論理式表現は1通りとは限らない。このため再現率、適合率が同じ2つの論理式が与えられたとき、どちらがよい論理式かを評価する指標が必要となる。一般にANDとORで結合された単項クエリの数が少ないものの方がより一般的な論理式と考えられるため、次の指標Mを用いているのが好適である。 Also, there is not always a single logical expression for the query that leads to the same result. For this reason, when two logical formulas having the same recall rate and matching rate are given, an index for evaluating which logical formula is better is required. In general, the one with a smaller number of unary queries connected by AND and OR is considered as a more general logical expression. Therefore, it is preferable to use the following index M.

M＝-log(単項クエリの数/ 論理式の適用結果タプル数)
実施例２
上述した未知のクエリQ_ｘが有用な手法であるかどうかを、UClリポジトリ（Blake,C,L,&Merz,CJ,,,UClRepositoryofma£hineleamingdatabases,,,
httpl//www,ics,ucLedu/mleam/MLRepository,htm1,lrvine,CAlUniversityofCalifomia,DepartmentoflnformationandComputerScience(1998)）からwineというデータベースをテーブルRとして用いた。このデータベースは、連続値属性数13で事例数178のデータ集合である。 M = -log (number of unary queries / number of logical result application tuples)
Example 2
Whether or not the unknown query Q _x described above is a useful technique can be determined using the UCl repository (Blake, C, L, & Merz, CJ ,,, UClRepositoryofma £ hineleamingdatabases ,,,
httpl // www, ics, ucLedu / mleam / MLRepository, htm1, lrvine, CAlUniversityofCalifomia, DepartmentoflnformationandComputerScience (1998)) was used as the database R. This database is a data set of 13 consecutive value attributes and 178 cases.

本発明のクエリ導出方法を次のような評価により確認した。テーブルRに単項クエリ（A
;a）を実行し、これによりテーブルSを得る。このテーブルSと近似するデータベースを生成するクエリを実施例１の方法により導出した。なお、導出するクエリは、単項クエリ（A;a）以外のクエリである。そして、導出されたクエリをテーブルRに適用し,導出された結果について再現率、適用率、項の数の評価を行った。 The query derivation method of the present invention was confirmed by the following evaluation. Unary query on table R (A
; a) is executed, thereby obtaining the table S. A query for generating a database approximating this table S was derived by the method of the first embodiment. The derived query is a query other than a unary query (A; a). Then, the derived query was applied to the table R, and the recall rate, application rate, and number of terms were evaluated for the derived results.

図７に評価結果を示す。先ず、検定の危険率を変化させながら実験を繰り返した。図８は、図７の再現率と適合率をグラフ化したものである。これらの結果から、危険率が１％から10％に範囲において、再現率、適合率ともに90％以上の良好の結果を得た。なお、危険率を大きくするほどその候補クエリによって抽出されるタプルにノイズが多く含まれる。また、候補クエリから抽出されたタプルからノイズを除くために検定を繰り返えすことで、導出されたクエリの項目の数が増える。 FIG. 7 shows the evaluation results. First, the experiment was repeated while changing the risk factor of the test. FIG. 8 is a graph of the recall rate and the matching rate in FIG. From these results, good results were obtained with a recall rate and relevance rate of 90% or more when the risk rate was in the range of 1% to 10%. Note that the greater the risk factor, the more noise is included in the tuples extracted by the candidate query. In addition, by repeating the test to remove noise from the tuple extracted from the candidate query, the number of derived query items increases.

次に、テーブルRを半分に分割し、その半分のデータベースをテーブルRとみなして、同様にクエリの導出を行なった。図９に評価の結果を示す。また、図１０に再現率と適合率をグラフ化したものを、図１１に項目の評価式をグラフ化したものを示す。これらの結果から、危険率を５％とすることにより、再現率、適合率ともに90％以上という良好の結果を得た。 Next, the table R was divided in half, the database of the half was regarded as the table R, and the query was similarly derived. FIG. 9 shows the evaluation results. FIG. 10 is a graph showing the recall rate and the matching rate, and FIG. 11 is a graph showing the item evaluation formulas. From these results, by setting the risk rate to 5%, both the recall rate and the conformity rate were 90% or better.

次に、テーブルRを4分の1に分割し、その4分の1のデータベースをテーブルRとみなして同様にクエリの導出を行なった。図１２に評価の結果を示す。また、図１３に再現率と適合率をグラフ化したもの、図１４に項目の評価式をグラフ化したものを示す。危険率を５％とすることにより、再現率、適合率ともに90％以上の良好の結果を得た。 Next, the table R was divided into quarters, and the query was similarly derived by regarding the quarter database as the table R. FIG. 12 shows the result of evaluation. FIG. 13 is a graph showing the recall rate and the matching rate, and FIG. 14 is a graph showing the item evaluation formulas. By setting the risk rate to 5%, both the recall rate and the conformity rate were 90% or better.

本発明によれば、膨大な元データベースの中から、例えばユーザが明確に意識せずに選んだサブ・データベースと、近似するサブ・データベースを高精度で得るに必要なクエリを導き出すことができる。 According to the present invention, it is possible to derive a query necessary for obtaining, with high accuracy, a sub-database selected from, for example, a user without a clear awareness and an approximate sub-database from an enormous source database.

本実施例で説明するテーブルの一例を示した図である。It is the figure which showed an example of the table demonstrated in a present Example.

テーブルRからユーザが明確に意識しない問い合わせ（未知のクエリQ_ｘ）により生成されたサブ・データベスS、近似サブ・データベースSx、再現率、適合率の関係を示した図である。FIG. 6 is a diagram illustrating a relationship among a sub database S, an approximate sub database Sx, a recall rate, and a precision rate generated by a query (unknown query Q _x ) that the user does not clearly recognize from the table R. ルート(root)を起点として単項クエリ（X_１:x_１)と補クエリ(X_１:≠x_１)の２分木等の図である。It is a diagram of a binary tree of a unary query (X ₁ : x ₁ ) and a complementary query (X ₁ : ≠ x ₁ ) starting from the root (root). ルートノードから肯定クエリ、否定クエリを実行し、所定の条件を満たす単項クエリをノードとし作成したツリー構造等を示した図である。It is the figure which showed the tree structure etc. which performed the positive query and negative query from the root node, and created the unary query which satisfy | fills predetermined conditions as a node. 所定の条件を満たす単項クエリが検出されなくなり、ツリーの末端がターミナルクエリ（ψ）になったツリー構造を示したものである。A tree structure in which a unary query that satisfies a predetermined condition is not detected and the end of the tree is a terminal query (ψ) is shown. 完成したツリー構造の図を示したものである。A diagram of the completed tree structure is shown. サンプルデータベースに本発明のクエリ導出方法を適用して得られた再現率と適合率を示した図である。It is the figure which showed the recall and the precision which were obtained by applying the query derivation method of this invention to a sample database. 図７に示す再現率と適合率をグラフ化したものである。FIG. 8 is a graph of the recall and the precision shown in FIG. サンプルデータベースを1/2として得たデータベースについて、本発明のクエリ導出方法を適用して得られた再現率と適合率を示した図である。It is the figure which showed the reproduction rate and the relevance rate which were obtained by applying the query derivation method of this invention about the database obtained by making sample database into 1/2. サンプルデータベースを1/2として得たデータベースについて、本発明のクエリ導出方法を適用して得られた再現率と適合率をグラフ化した図である。It is the figure which plotted the reproduction rate and the relevance rate which were obtained by applying the query derivation method of the present invention about the database obtained by setting the sample database to 1/2. 図９の項目評価式をグラフ化した図である。FIG. 10 is a graph of the item evaluation formula of FIG. 9. サンプルデータベースを1/4として得たデータベースについて、本発明のクエリ導出方法を適用して得られた再現率と適合率を示した図である。It is the figure which showed the reproduction rate and the relevance rate which were obtained by applying the query derivation method of this invention about the database obtained by making the sample database into 1/4. サンプルデータベースを1/4として得たデータベースについて、本発明のクエリ導出方法を適用して得られた再現率と適合率をグラフ化した図である。It is the figure which plotted the reproduction rate and the relevance rate which were obtained by applying the query derivation method of the present invention about the database obtained by setting the sample database to 1/4. 図１２の項目評価式をグラフ化した図である。FIG. 13 is a graph of the item evaluation formula of FIG. 12.

Claims

A query derivation method for deriving a query required to generate a database that approximates a sub database obtained by executing an unknown query in an original database composed of tuples having a plurality of attributes,
The query derivation method executes a unary query for all the terms of the sub-database, assigns a unary query that satisfies a predetermined condition as a root node,
Performing a unary query on the sub-database to satisfy the predetermined condition, thereby generating an affirmative database;
A negative unary query that negates the query condition of the unary query that satisfies the predetermined condition is executed on the sub-database, thereby generating a negative database;
Consider the positive database or the negative database as the original database,
Consider duplicate data between the positive database or the negative database and the sub database as the sub database, execute the positive query or negative query for all terms,
Detecting a positive unary query and a negative unary query that satisfy the predetermined condition, and creating a tree structure so that the positive unary query and the negative unary query are nodes that are binary trees from the root node;
This is recursively executed until the unary query that becomes the node becomes a positive unary query or negative unary query that does not satisfy the predetermined condition, and completes the tree structure.
Create a logical product of all unary queries that pass through the path from the root node to the end node,
A query derivation method, wherein the query is derived as a query necessary for generating a database that approximates all logical sums of logical products obtained for each pass.

The predetermined condition is a recall ratio in the relation of the random database obtained by randomly extracting n tuples from the original database in relation to the sub database, and a relation in relation to the sub database of the unary query database. The query derivation method according to claim 1, wherein a probability that the recall ratio is the same precision ratio is equal to or less than a predetermined risk ratio, and the recall ratio is the maximum among the unary queries executed for all the terms.

The creation of the tree structure according to claim 1, wherein a logical expression of a unary query that satisfies the predetermined condition is stored in a memory cell, and an address is assigned to the memory cell as a root node,
Each time a positive unary query that satisfies the predetermined condition and a negative unary query that satisfies the predetermined condition are generated, the logical expression of each unary query is addressed as a node to the memory cell so that the relationship with the root node can be specified. To complete the tree structure,
Based on the address, the query is derived as a query necessary to generate a database that approximates all the logical sums of all the unary queries obtained through the path from the root node to the end node. Query derivation method.