JP2009265905A

JP2009265905A - Preprocessor using preliminary rule, preprocessing method, information extraction device using the preprocessor and information extraction method

Info

Publication number: JP2009265905A
Application number: JP2008114193A
Authority: JP
Inventors: Kanako Hattori; 可奈子服部
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2008-04-24
Filing date: 2008-04-24
Publication date: 2009-11-12

Abstract

<P>PROBLEM TO BE SOLVED: To easily and appropriately determine a threshold for converting numerical data into category data. <P>SOLUTION: A preprocessor includes an event group database DB1; a threshold/constraint database DB2; a preliminary knowledge rule database DB3; an optimization parameter database DB4, a threshold optimization means 5; and a threshold parameter database DB6. The threshold optimization means 5 determines the value of threshold variables for categorizing the numerical data in order to satisfy a preliminary rule which a user has in advance. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、膨大な情報をカテゴリ化し、有用な相関ルールを抽出する情報抽出装置に関し、特に、情報をカテゴリ化するための閾値を算出する前処理装置に関する。 The present invention relates to an information extraction device that categorizes a vast amount of information and extracts useful correlation rules, and more particularly to a preprocessing device that calculates a threshold for categorizing information.

Technical background

近年、センサ及び記憶装置の発達により、様々なイベントデータを蓄積することが可能となった。ここでイベントデータとは、何らかのイベントが発生した時に収集されるデータであり、例えば、店舗内に訪れる全ての顧客の店内での動作を観測して得られる顧客の位置を示す数値データや、顧客の購買ログから得られた購買データをいう。しかし、このように収集され、蓄積されたイベントデータは大量であるため、従来は、これらの大量のイベントデータに対して、相関の高い組み合わせをルールとして抽出し、提示することで、有用なデータを提供している。ここで相関ルールとは、全てのイベントデータに対して同時に現れるアイテムの組み合わせをいい、相関の高いルールとは、全てのイベントデータに対してある一定以上の確率で同時に現れるアイテムの組み合わせをいう。 In recent years, with the development of sensors and storage devices, it has become possible to store various event data. Here, event data is data that is collected when an event occurs. For example, numerical data indicating customer positions obtained by observing operations in the store of all customers visiting the store, or customer data Purchase data obtained from the purchase log. However, since there is a large amount of event data collected and accumulated in this way, conventionally, useful data can be obtained by extracting and presenting highly correlated combinations as rules for these large amounts of event data. Is provided. Here, the correlation rule refers to a combination of items that appear simultaneously with respect to all event data, and a rule with a high correlation refers to a combination of items that appear simultaneously with a certain probability or more with respect to all event data.

これらの相関の高いルールは、データに現れるアイテム間の共起関係のみを表すものであり、必ずしもアイテム間の因果関係が存在するわけではないが、因果関係をもつルールが存在する場合もある。そこで、抽出された複数の相関ルールの中からユーザが因果関係のありそうなルールを選び、別の方法で因果関係があることを調べることで、ユーザの意思決定を助けることができる。例えば、一人の顧客の店舗内の動作データと購買データをイベントとし、店舗内の動作と購入した商品をアイテムと考え、「デザート売り場滞在」、「お菓子売り場滞在」、「パン売り場滞在」、「ロールケーキ購入」の４つのアイテムが全イベントの10%のイベントに含まれ、かつ「デザート売り場滞在」、「お菓子売り場滞在」、「パン売り場滞在」を含むイベントの90%が「ロールケーキ購入」も同時に含むという結果から、「デザート売り場とお菓子売り場とパン売り場に置いてある商品の区別がうまくいかないため、ロールケーキ購入者は必要以上の時間をかけてロールケーキを探している」といった仮説をユーザがたて、売り場に配置する商品の見直しを行うことなどが考えられる。 These highly correlated rules represent only the co-occurrence relationship between items appearing in the data, and the causal relationship between items does not necessarily exist, but there may be a rule having a causal relationship. Therefore, by selecting a rule that is likely to cause a causal relationship from among the plurality of extracted correlation rules, and checking that the causal relationship exists by another method, it is possible to help the user's decision making. For example, the operation data and purchase data in a store of one customer are taken as events, the operation in the store and the purchased product are considered as items, `` Desert section stay '', `` Desert section stay '', `` Bake section stay '', Four items of “Buy roll cake” are included in 10% of all events, and 90% of events including “Stay in dessert shop”, “Stay in sweet shop”, “Stay in bread shop” are “Roll cake” As a result of including `` purchase '' at the same time, the hypothesis is that roll cake buyers are looking for roll cakes more than necessary because the products in the dessert department, confectionery department, and bread department are not well distinguished. It is conceivable that the user can review the products placed on the sales floor.

このように、大量のデータから有用な相関ルールを抽出するには、観測して得られた数値データをいくつかのカテゴリに分類するといった前処理（カテゴリ化）を行った後に、アイテムの集合である相関ルールの候補を生成し、全イベントに対して相関ルール候補が含まれるかを検索し、それが含まれるイベントの数を数え、それがユーザの指定する割合以上存在するか否かを調べる必要がある。（例えば、特許文献１を参照）。 In this way, in order to extract useful association rules from a large amount of data, after performing preprocessing (categorization) such as categorizing numerical data obtained by observation into several categories, Generate a candidate for a certain correlation rule, search whether all the events include a candidate correlation rule, count the number of events that include it, and check whether it exists at a rate specified by the user or not. There is a need. (For example, see Patent Document 1).

ここで、例えば、顧客の位置を示す位置座標を数値データとして有する場合、この数値データを次のようにしてカテゴリ化している。 Here, for example, when the position coordinates indicating the position of the customer are included as numerical data, the numerical data is categorized as follows.

すなわち、顧客の位置を表すｘとｙが(１式)と(２式)を満たす場合は「デザート売り場」、(１式)と(４式)を満たす場合は「パン売り場」、(３式)と(４式)を満たす場合は「お菓子売り場」、(２式)と(３式)を満たす場合は「飲料水売り場」と定める。 That is, if the customer's position x and y satisfy (Expression 1) and (Expression 2), “dessert section”, if (1 expression) and (Expression 4) are satisfied, “Bread counter”, (Expression 3) ) And (4 formulas) are defined as “confectionery section”, and (2 formulas) and (3 formulas) are defined as “drinking water section”.

０≦ｘ＜１０ (１式)
０≦ｙ＜１０ (２式)
１０≦ｘ＜２０ (３式)
１０≦ｙ＜２０ (４式)
この場合、位置座標(１，１)は「デザート売り場」に、(１，１５)という位置座標は「パン売り場」に変換される。 0 ≦ x <10 (1 set)
0 ≦ y <10 (2 formulas)
10 ≦ x <20 (3 formulas)
10 ≦ y <20 (4 formulas)
In this case, the position coordinates (1, 1) are converted to “dessert section”, and the position coordinates (1, 15) are converted to “bread section”.

従来、このように数値データをカテゴリデータに変換するために、ユーザ自らが事前ルールなどを用いて閾値を設定していた。しかし、ユーザが有する事前ルールを用いたとしても明確にカテゴリに分けるための閾値を定めることができない場合が多い。例えば、身長を「高い」、「普通」、「低い」の３つのカテゴリに分けるための閾値を決定する場合であっても、「１７５ｃｍ以上を高い」それとも「１７４ｃｍ以上を高い」のどちらがよいかをユーザが判断することは難しい。このため、従来はユーザがそれぞれの閾値のうちいくつかを変更し、出てきた結果を確認するといった作業を繰り返して、所望の結果を得る必要があった。 Conventionally, in order to convert numerical data into category data in this way, the user himself / herself sets a threshold value using a prior rule or the like. However, there are many cases where it is not possible to define a threshold value for clearly dividing into categories even if a user has prior rules. For example, even when determining threshold values for dividing the height into three categories of “high”, “normal”, and “low”, which is better “higher than 175 cm” or “higher than 174 cm”? It is difficult for the user to judge. For this reason, conventionally, it has been necessary for the user to obtain a desired result by repeating the operation of changing some of the respective threshold values and confirming the output result.

また、間違った閾値によるカテゴリ化を行った場合、抽出される相関ルールの質に影響を与えることがある。例えば、魚売り場のエリアを広く、肉売り場のエリアを狭く設定することで、「魚売り場」と「豚肉購入」が相関ルールとして抽出される場合が考えられる。このような時に抽出される誤ったルールはユーザをかえって混乱させる場合がある。
特開平１１−２５００８４号公報 In addition, when categorization is performed with an incorrect threshold, the quality of the extracted association rule may be affected. For example, by setting a wide fish market area and a narrow meat market area, “fish market” and “pork purchase” may be extracted as correlation rules. An incorrect rule extracted at such a time may confuse the user.
JP-A-11-250084

上述したように、ユーザは、数値データをカテゴリ化する際に設定する閾値変数の値に対して、明確な判断基準を持つことは難しい。従って、ユーザが複数の閾値変数のうちのいくつかを変更し、それによって生成される相関ルールを確認するといった作業を繰り返す必要があるため、効率よく有用な相関ルールを得ることは難しい。 As described above, it is difficult for the user to have a clear criterion for the value of the threshold variable set when categorizing numerical data. Therefore, it is difficult for the user to obtain a useful correlation rule efficiently because it is necessary to repeat the operation of changing some of the plurality of threshold variables and confirming the correlation rule generated thereby.

さらに、間違った閾値変数の値によるカテゴリ化を行った場合、抽出される相関ルールの質に影響を与えることがある。例えば、魚売り場のエリアを広く、肉売り場のエリアを狭く設定することで、「魚売り場」と「豚肉購入」が相関ルールとして抽出される場合が考えられる。このような時に抽出される誤ったルールはユーザをかえって混乱させる場合がある。 Furthermore, when categorization is performed with the wrong threshold variable value, the quality of the extracted association rule may be affected. For example, by setting a wide fish market area and a narrow meat market area, “fish market” and “pork purchase” may be extracted as correlation rules. An incorrect rule extracted at such a time may confuse the user.

本発明の情報抽出装置の前処理装置は、上記の問題に鑑みてなされたものであり、同一のＩＤを有する軌跡データ及び、カテゴリ化され、それぞれに第１のラベルが付されたカテゴリデータを記録したイベント集合データベースと、このイベント集合データベースに記録された前記軌跡データをカテゴリ化する際に必要な閾値変数の条件、制約及び、これらの条件及び制約に基づいてカテゴリ化される前記軌跡データに付される第２のラベルが記録された閾値・制約データベースと、所定の確率で共起する前記第１及び前記第２のラベルの組み合わせを前記確率とともに事前ルールとして記録した事前知識ルールデータベースと、この事前知識ルールデータベースに予め含まれた全ての前記事前ルールを、前記制約の下で、前記イベント集合データベースから抽出するように前記閾値変数の値を算出する閾値最適化手段と、この閾値最適化手段によって算出された前記閾値変数の値が記録された閾値データベースと、この閾値データベースに記録された前記閾値変数の値を表示する表示装置と、を具備することを特徴とするものである。 The pre-processing device of the information extraction device of the present invention has been made in view of the above-described problem. Trajectory data having the same ID and categorized category data each having a first label are provided. The recorded event set database, the threshold variable conditions and constraints necessary for categorizing the trajectory data recorded in the event set database, and the trajectory data categorized based on these conditions and constraints A threshold / constraint database in which a second label to be attached is recorded; a prior knowledge rule database in which a combination of the first and second labels co-occurring with a predetermined probability is recorded as a prior rule together with the probability; All the prior rules previously included in the prior knowledge rule database are converted into the event set under the constraints. A threshold value optimization means for calculating the value of the threshold variable so as to be extracted from the database; a threshold value database in which the value of the threshold variable calculated by the threshold value optimization means is recorded; and the threshold value database recorded in the threshold value database. And a display device that displays the value of the threshold variable.

また、本発明の情報抽出装置の前処理方法は、同一のＩＤを有する軌跡データ及び、カテゴリ化され、それぞれに第１のラベルが付されたカテゴリデータをイベント集合データベースに記録し、このイベント集合データベースに記録された前記軌跡データをカテゴリ化する際に必要な閾値変数の条件、制約及び、これらの条件及び制約に基づいてカテゴリ化される前記軌跡データに付される第２のラベルを閾値・制約データベースに記録し、所定の確率で共起する前記第１及び前記第２のラベルの組み合わせを前記確率とともに事前ルールとして事前知識ルールデータベースに記録し、この事前知識ルールデータベースから前記事前ルールを抽出し、この抽出された前記事前ルールが有する前記第１のラベルを含むＩＤが有する前記軌跡データを前記イベント集合データベースから全て抽出し、前記閾値・制約データベースから、前記条件、制約を抽出し、この抽出された前記制約の下で、前記事前ルールが最も高い確率で現れるように前記閾値変数の値を算出することを特徴とする情報抽出装置の前処理方法である。 Further, the preprocessing method of the information extraction apparatus of the present invention records trajectory data having the same ID and categorized category data each labeled with a first label in an event set database. Threshold variable conditions and constraints necessary for categorizing the trajectory data recorded in the database, and a second label attached to the trajectory data categorized based on these conditions and constraints are set as threshold values. A combination of the first and second labels that are recorded in a constraint database and co-occur with a predetermined probability is recorded in the prior knowledge rule database as a prior rule together with the probability, and the prior rule is recorded from the prior knowledge rule database. The trajectory data which the ID including the first label of the extracted prior rule is extracted and has Are extracted from the event set database, the conditions and constraints are extracted from the threshold / constraint database, and the threshold variable is set so that the prior rule appears with the highest probability under the extracted constraints. It is a preprocessing method of the information extraction apparatus characterized by calculating the value of.

すなわち、本発明は、予めユーザが有する事前ルールを用いて、特定のアイテム集合と、その集合を含むイベントに含まれる確率が高いアイテムと、これらが少なくとも含まれる確率（共起確率）とを事前ルールとしてデータベースに記録し、この事前ルールが必ず現れるように、数値データをカテゴリ化する際に必要な閾値変数の値を自動的に決定するものである。 That is, the present invention uses a prior rule that a user has in advance to determine in advance a specific item set, an item that has a high probability of being included in an event that includes the set, and a probability (co-occurrence probability) that these are at least included. It is recorded in the database as a rule, and the value of the threshold variable necessary for categorizing numerical data is automatically determined so that this prior rule always appears.

このような本発明によれば、ユーザが閾値変数の値を直接決められない場合でも、数値データをカテゴリ化する際に必要な閾値変数の値を容易かつ適切に決定することができる。 According to the present invention as described above, even when the user cannot directly determine the value of the threshold variable, the value of the threshold variable necessary for categorizing numerical data can be determined easily and appropriately.

以下に、本発明の実施形態を図１〜図１５を参照して説明する。 Embodiments of the present invention will be described below with reference to FIGS.

(第１の実施形態)
図1は、本発明の実施形態における事前ルールを用いた前処理装置の構成を概略的に示すブロック図である。 (First embodiment)
FIG. 1 is a block diagram schematically showing a configuration of a preprocessing apparatus using a pre-rule in an embodiment of the present invention.

図１に示すように、本実施形態に係る事前ルールを用いた前処理装置は、イベント集合データベースＤＢ１と、閾値・制約データベースＤＢ２と、事前知識ルールデータベースＤＢ３と、最適化パラメータデータベースＤＢ４と、閾値最適化手段５と、閾値パラメータデータベースＤＢ６と、閾値表示装置７とで構成される。 As shown in FIG. 1, the preprocessing device using the pre-rule according to this embodiment includes an event set database DB1, a threshold / constraint database DB2, a prior knowledge rule database DB3, an optimization parameter database DB4, and a threshold value. The optimization unit 5, the threshold parameter database DB 6, and the threshold display device 7 are configured.

次に、このような事前ルールを用いた前処理装置を構成する各要素について、図２〜図７を参照して説明する。 Next, each element which comprises the pre-processing apparatus using such a prior rule is demonstrated with reference to FIGS.

まず、イベント集合データベースＤＢ１に記録されるデータについて、図２を参照して説明する。 First, data recorded in the event set database DB1 will be described with reference to FIG.

イベント集合データベースＤＢ１には、被観測体のＩＤ及び被観測体に関するデータが記憶されている。このうち、被観測体に関するデータは、数値データとカテゴリデータに分類される。数値データとは、例えば、被観測者の体、頭、足、手の位置、速度、加速度、体の向きなどをセンサで観測することで得られる被観測者の動作の計測値、または、年齢、所得などのアンケートなどを用いることによって得られる被観測者の属性値の一部、またはＰＯＳ端末などから得られる購入金額、購買点数などである。また、カテゴリデータとは、例えば、性別や職業などのアンケートデータで得られる被観測体の属性値の一部や要望、または、ＰＯＳ端末などから得られる購入した商品などである。ここでは被観測体に関するデータの一例として、数値データが被観測者の体の移動軌跡データであり、カテゴリデータが被観測者の購買データである場合について説明する。 The event set database DB1 stores the ID of the observed object and data related to the observed object. Among these, the data regarding the observed object is classified into numerical data and category data. The numerical data is, for example, a measured value of the motion of the observed person obtained by observing the body, head, foot, hand position, velocity, acceleration, body orientation, etc. of the observed person with a sensor, or age. , A part of the attribute value of the observed person obtained by using a questionnaire such as income, a purchase amount obtained from a POS terminal or the like, the number of points purchased, and the like. The category data is, for example, a part or request of the attribute value of the observed object obtained from questionnaire data such as sex or occupation, or a purchased product obtained from a POS terminal or the like. Here, as an example of data relating to the observed object, a case will be described in which the numerical data is the movement trajectory data of the observed person's body and the category data is the purchase data of the observed person.

図２は、イベント集合データベースＤＢ１に記録されるデータであり、図２Ａは数値データを示し、図２Ｂはカテゴリデータを示している。 FIG. 2 shows data recorded in the event set database DB1, FIG. 2A shows numerical data, and FIG. 2B shows category data.

図２Ａに示すように、被観測体の移動軌跡データを記録したテーブルは、複数のレコードからなり、１つのレコードには被観測体のＩＤ、データ名、データの種類、観測時間、及び被観測体の体の位置を示すｘ座標、ｙ座標、ｚ座標が記録されている。図２Ａにおいては、被観測体００１が、２００７年８月１日１５時３２分００秒にx軸３０ｃｍ、ｙ軸２０ｃｍ、ｚ軸１７０ｃｍの位置にいて、１５時３２分０１秒にはx軸１５ｃｍ、ｙ軸２０ｃｍ、ｚ軸１７０ｃｍの位置にいて、１５時３２分０２秒にはx軸１５ｃｍ、ｙ軸１０ｃｍ、ｚ軸１７０ｃｍの位置にいたことを示している。また、これらのデータの種類が数値であり、データ名が体の位置であることを示している。なお、このテーブルは、例えば被観測者毎に複数あってもよいし、これらをまとめて１つのテーブルに記録されていてもよい。また、テーブルに記録されたレコードは、単数であってもよいし、本実施形態に示すように、複数あってもよい。 As shown in FIG. 2A, the table in which the movement trajectory data of the observation object is recorded includes a plurality of records, and one record includes an observation object ID, a data name, a data type, an observation time, and an observation object. The x-coordinate, y-coordinate, and z-coordinate indicating the position of the body are recorded. In FIG. 2A, the observed object 001 is at the position of the x-axis 30 cm, the y-axis 20 cm, and the z-axis 170 cm at 15:32:00 on August 1, 2007, and at 15:32:01 This indicates that the robot was at a position of 15 cm, a y-axis of 20 cm, and a z-axis of 170 cm, and at 15:32:02 it was at a position of x-axis of 15 cm, y-axis of 10 cm, and z-axis of 170 cm. Further, the type of these data is a numerical value, and the data name indicates the body position. For example, there may be a plurality of tables for each person to be observed, or these may be collectively recorded in one table. Further, the number of records recorded in the table may be singular or plural as shown in the present embodiment.

また、図２Ｂに示すように、被観測体の購買データを記録したテーブルは、複数のレコードからなり、１つのレコードは被観測体のＩＤ、データ名、データの種類、観測時間、購買した商品が記録されている。図２Ｂにおいては、被観測体００１は２００７年８月１日１５時３０分００秒にＩｔｅｍ１０及びＩｔｅｍ２１を購入し、被観測体００２は２００７年８月１日１５時１０分００秒にＩｔｅｍ３５を購入し、被観測体００３は２００７年８月１日１５時００分００秒にＩｔｅｍ４２を購入したことを示している。また、これらのデータの種類はカテゴリデータであり、データ名が購買であることを示している。なお、このテーブルは、例えば被観測者毎に複数あってもよいし、これらをまとめて１つのテーブルに記録されていてもよい。また、テーブルに記録されたレコードは、単数であってもよいし、本実施形態に示すように、複数あってもよい。 In addition, as shown in FIG. 2B, the table in which the purchase data of the observed object is recorded includes a plurality of records, and one record includes the observed object ID, data name, data type, observation time, purchased product. Is recorded. In FIG. 2B, the observed object 001 purchased Item 10 and Item 21 at 15:30 on August 1, 2007, and the observed object 002 received Item 35 at 15:10:00 on August 1, 2007. The purchased object 003 indicates that Item 42 was purchased on August 1, 2007 at 15:00:00. The type of these data is category data, and the data name indicates purchase. For example, there may be a plurality of tables for each person to be observed, or these may be collectively recorded in one table. Further, the number of records recorded in the table may be singular or plural as shown in the present embodiment.

次に、閾値・制約データベースＤＢ２に記録されるデータについて、図３を参照して説明する。ここでは、イベント集合データベースＤＢ１に記録された数値データである被観測体の位置座標を、商品エリアに分割するというカテゴリ化を行う場合について説明する。 Next, data recorded in the threshold / constraint database DB2 will be described with reference to FIG. Here, a case will be described in which categorization is performed in which the position coordinates of the observed object, which is numerical data recorded in the event set database DB1, are divided into product areas.

図３は、閾値・制約データベースＤＢ２に記録されるデータであり、図３Ａは数値データ名と算出すべき閾値変数を示し、図３Ｂは、カテゴリ化した後に付けるラベルと数値データをカテゴリ化するための条件を示し、図３Ｃは閾値変数の制約を示している。 FIG. 3 shows data recorded in the threshold / constraint database DB2, FIG. 3A shows numerical data names and threshold variables to be calculated, and FIG. 3B shows categorized labels and numerical data to be categorized. FIG. 3C shows threshold variable constraints.

図３Ａに示すように、数値データ名と算出すべき閾値変数を記録したテーブルは、複数のレコードからなり、１つのレコードにはデータ名、閾値変数名が記憶されている。これらの閾値変数名は、ユーザによって指定されるものである。図３Ａにおいては、データ名が体の位置であり、この体の位置を示す数値データを商品エリアであるカテゴリに分類するための閾値変数がＸ１、Ｘ２、Ｙ１、Ｙ２、Ｔ１であることを示している。なお、このテーブルは、例えば分類する商品エリア毎に複数あってもよいし、本実施形態に示すように、これらをまとめて１つのテーブルに記録されていてもよい。また、テーブルに記録されたレコードは、単数であってもよいし、本実施形態に示すように、複数あってもよい。 As shown in FIG. 3A, the table in which the numerical data name and the threshold variable to be calculated are composed of a plurality of records, and the data name and the threshold variable name are stored in one record. These threshold variable names are specified by the user. In FIG. 3A, the data name is the position of the body, and the threshold variables for classifying the numerical data indicating the position of the body into the category that is the product area are X1, X2, Y1, Y2, and T1. ing. Note that, for example, a plurality of tables may be provided for each product area to be classified, or as shown in the present embodiment, these tables may be collectively recorded in one table. Further, the number of records recorded in the table may be singular or plural as shown in the present embodiment.

また、図３Ｂに示すように、数値データをカテゴリ化した後に付与するラベルと数値データをカテゴリ化するための条件を記録したテーブルは、複数のレコードからなり、１つのレコードにはデータ名、数値データをカテゴリ化した後に付与するラベル名、ラベルが付与されるための条件が記録されている。ここで、ラベル名及び数値データをカテゴリ化するための条件は、ユーザによって指定される。図３Ｂにおいては、被観測体を、例えば「Ａ商品エリア滞在」というカテゴリに分類するための条件を示している。Ａ商品エリア滞在というラベルが与えられ、「Ａ商品エリア滞在」というカテゴリに分類されるためには、数値データの位置座標ｘ、ｙがそれぞれ(５式)、（６式）を満たさなければならない。 Also, as shown in FIG. 3B, a table in which labels given after categorizing numerical data and conditions for categorizing the numerical data are composed of a plurality of records, and one record includes a data name, a numerical value The label name given after categorizing the data and the conditions for giving the label are recorded. Here, the conditions for categorizing the label name and numerical data are specified by the user. FIG. 3B shows a condition for classifying the observed object into a category of “A product area stay”, for example. In order to be given the label “A product area stay” and be classified into the category “A product area stay”, the position coordinates x and y of the numerical data must satisfy (Equation 5) and (Equation 6), respectively. .

Ｘ１＜=ｘ＜Ｘ２ (５式)
Ｙ１＜=ｙ＜Ｙ２ (６式)
さらに、（５式）（６式）の条件で分類される「Ａ商品エリア滞在」なるカテゴリに含まれる時間を示すｔは、（７式）を満たさなければならない。 X1 <= x <X2 (5 formulas)
Y1 <= y <Y2 (6 formulas)
Furthermore, t indicating the time included in the category “A product area stay” classified under the conditions of (Expression 5) and (Expression 6) must satisfy (Expression 7).

Ｔ１＜ｔ (７式)
すなわち、被観測体に、Ａ商品エリア滞在なるカテゴリに分類され、「Ａ商品エリア滞在」なるラベルが付与されるためには、（５式）、（６式）、（７式）を満たさなければならないことを示している。なお、このテーブルは、例えば分類するカテゴリ毎に複数あってもよいし、本実施形態に示すように、これらをまとめて１つのテーブルに記録されていてもよい。また、テーブルに記録されたレコードは、単数であってもよいし、本実施形態に示すように、複数あってもよい。 T1 <t (7 formulas)
That is, in order for the object to be observed to be classified into the category “A product area stay” and be given the label “A product area stay”, (Equation 5), (Equation 6), (Equation 7) must be satisfied. Indicates that it must be done. Note that, for example, a plurality of tables may be provided for each category to be classified, or these may be collectively recorded in one table as shown in the present embodiment. Further, the number of records recorded in the table may be singular or plural as shown in the present embodiment.

また、図３Ｃに示すように、数値データをカテゴリ化するための閾値変数の制約が記録されたテーブルは、複数のレコードからなり、１つのレコードには、データ名、閾値変数名、閾値変数の制約式が記録されている。この制約式は、ユーザによって指定されるものである。図３Ｃにおいては、数値データをカテゴリ化するための閾値変数Ｘ１、Ｘ２、Ｙ１、Ｙ２はそれぞれ（８式）、（９式）、（１０式）、（１１式）を満たす範囲でなければならないことを示している。 Further, as shown in FIG. 3C, the table in which the constraints of threshold variables for categorizing numerical data are composed of a plurality of records, and one record includes a data name, a threshold variable name, and a threshold variable. The constraint equation is recorded. This constraint equation is specified by the user. In FIG. 3C, threshold variables X1, X2, Y1, and Y2 for categorizing numerical data must be in ranges that satisfy (Expression 8), (Expression 9), (Expression 10), and (Expression 11), respectively. It is shown that.

０＜＝Ｘ１＜＝２０００ (８式)
０＜＝Ｘ２＜＝２０００ (９式)
０＜＝Ｙ１＜＝２０００ (１０式)
０＜＝Ｙ２＜＝２０００ (１１式)
ここで、これらの制約式は、例えば店舗の大きさを最大値として定めたものであり、各閾値は、この店舗の大きさ以上には設定できないことを意味する。なお、このテーブルは、例えば分類するカテゴリ毎に複数あってもよいし、本実施形態に示すように、これらをまとめて１つのテーブルに記録されていてもよい。また、テーブルに記録されたレコードは、単数であってもよいし、本実施形態に示すように、複数あってもよい。 0 <= X1 <= 2000 (8 formulas)
0 <= X2 <= 2000 (9 formulas)
0 <= Y1 <= 2000 (10 formulas)
0 <= Y2 <= 2000 (11 formulas)
Here, these constraint formulas define, for example, the size of a store as a maximum value, and each threshold value cannot be set to be larger than the size of the store. Note that, for example, a plurality of tables may be provided for each category to be classified, or these may be collectively recorded in one table as shown in the present embodiment. Further, the number of records recorded in the table may be singular or plural as shown in the present embodiment.

次に、事前知識ルールデータベースＤＢ３に記録されるデータについて、図４を参照して説明する。 Next, data recorded in the prior knowledge rule database DB3 will be described with reference to FIG.

図４に示すように、事前知識ルールデータベースＤＢ３に記録されるテーブルは、複数のレコードからなり、レコードには、多くのイベントに含まれるカテゴリデータのアイテムと数値データをカテゴリ化した後に付与されるラベルを持つアイテムを含むアイテム集合とそれらが含まれる確率(共起確率)とが事前ルールとして記録されている。この事前ルールは、ユーザによって指定されるものである。図４においては、例えば、１番目のレコードは、購買データのアイテム「Ｉｔｅｍ１０購入」は、９０％以上の確率で、体の位置を示す数値データの「Ａ商品エリア滞在」というラベルをもつアイテムを含むイベントに含まれるということを示している。他も同様であり、２番目のレコードは、購買データのアイテム「Ｉｔｅｍ２０購入」は、８０％以上の確率で「Ｂ商品エリア滞在」、３番目のレコードは、購買データのアイテム「Ｉｔｅｍ３０購入」は、８０％以上の確率で「Ｃ商品エリア滞在」、４番目のレコードは、購買データのアイテム「Ｉｔｅｍ４０購入」は、８０％以上の確率で「Ｄ商品エリア滞在」というそれぞれのラベルをもつアイテムを含むイベントに含まれるということを示している。なお、このテーブルに記録されたレコードは、単数であってもよい。 As shown in FIG. 4, the table recorded in the prior knowledge rule database DB3 is composed of a plurality of records, which are given after categorizing items of category data and numerical data included in many events. An item set including items having labels and the probability of including them (co-occurrence probability) are recorded as prior rules. This pre-rule is specified by the user. In FIG. 4, for example, in the first record, the item “Purchase Item 10” of the purchase data has an item with the label “A product area stay” of the numerical data indicating the position of the body with a probability of 90% or more. Indicates that it is included in the included event. The other is the same, the second record is the item “Purchase Item 20” of the purchase data, “stays in the B product area” with a probability of 80% or more, the third record is the item “Purchase Item 30” of the purchase data , “C product area stay” with a probability of 80% or more, and the fourth record is an item of purchase data “Item 40 purchase” with an item with each label of “D product area stay” with a probability of 80% or more. Indicates that it is included in the included event. Note that a single record may be recorded in this table.

すなわち、図４に示すこれらのデータは、例えば「Ｉｔｅｍ１０を購入した人の９０％以上の人はＡ商品エリアに滞在する」といったユーザの事前知識を用いて作成する、いわゆる当たり前のルールである。 That is, these data shown in FIG. 4 are so-called rules that are created using the prior knowledge of the user, for example, “90% or more of those who purchased Item 10 stay in the A product area”.

なお、事前知識ルールデータベースＤＢ３は、商品のカテゴリを分類の抽象度に合わせて階層的に記録した商品マスタを使用することで、ユーザが階層を指定し、その階層に対して共通にまた個別に異なる共起確率を入力することによって、自動的に事前ルールを作成し、事前知識ルールデータベースＤＢ３に記録してもよい。例えば、図５に商品マスタの一例を示すように、１番目のレコードは、小分類Ｉｔｅｍ１０−０１は、中分類ではＩｔｅｍ１０に含まれており、さらにＩｔｅｍ１０は、大分類ではＩｔｅｍＡに含まれることを表している。同様に、２番目のレコードは、小分類Ｉｔｅｍ１０−０２は、中分類ではＩｔｅｍ１０に含まれており、さらにＩｔｅｍ１０は、大分類ではＩｔｅｍＡに含まれることを表している。これらの分類において、大分類のラベルは、例えばお菓子、肉、魚といった大まかな商品の区分によってつけられたラベルであり、中分類のラベルは、例えばチョコレート、せんべいといった細かな商品の区分によってつけられたラベルであり、小分類のラベルは、例えば商品名や味までを含むような詳細な商品の区分によってつけられたラベルである。このような商品マスタを用いることによって、例えば、ユーザが大分類を指定した場合は、ＩｔｅｍＡ購入と、ＩｔｅｍＡエリア滞在と、ユーザが入力した共起確率とが事前ルールとして事前知識ルールデータベースＤＢ3に記録される。同様に、ユーザが中分類が指定した場合は、Ｉｔｅｍ１０購入と、Ｉｔｅｍ１０エリア滞在と、ユーザが入力した共起確率とが事前ルールとして事前知識ルールデータベースＤＢ３に記録され、ユーザが小さい分類が指定した場合は、Ｉｔｅｍ１０−０１購入と、Ｉｔｅｍ１０−０１エリア滞在と、ユーザが入力した共起確率とが事前ルールとして事前知識ルールデータベースＤＢ３に記録される。 The prior knowledge rule database DB3 uses a product master in which product categories are hierarchically recorded in accordance with the abstraction level of the classification, so that the user can specify a hierarchy and share it individually and individually. By inputting different co-occurrence probabilities, a prior rule may be automatically created and recorded in the prior knowledge rule database DB3. For example, as shown in FIG. 5 as an example of the product master, the first record includes that the small category Item 10-01 is included in Item 10 in the middle category, and Item 10 is included in Item A in the large category. Represents. Similarly, the second record indicates that the minor category Item 10-02 is included in Item 10 in the middle category, and Item 10 is included in Item A in the larger category. In these classifications, the major classification labels are labels based on broad product categories such as candy, meat and fish, while the middle classification labels are classified according to minor product classifications such as chocolate and rice crackers. The subcategory label is a label given by a detailed product classification including, for example, product names and tastes. By using such a product master, for example, when the user designates a major classification, Item A purchase, stay in the Item A area, and the co-occurrence probability input by the user are recorded in the prior knowledge rule database DB 3 as prior rules. Is done. Similarly, when the middle class is designated by the user, Item 10 purchase, Item 10 area stay, and the co-occurrence probability input by the user are recorded as prior rules in the prior knowledge rule database DB3, and the user designates a smaller class. In this case, the Item 10-01 purchase, the Item 10-01 area stay, and the co-occurrence probability input by the user are recorded as prior rules in the prior knowledge rule database DB3.

次に、最適化パラメータデータベースＤＢ４に記録されるデータについて、図６を参照して説明する。 Next, data recorded in the optimization parameter database DB4 will be described with reference to FIG.

図６に示すように、最適化パラメータデータベースＤＢ４に記録されるテーブルは、複数のレコードからなり、１つのレコードには、後述する閾値最適化手段５で使用するパラメータの１つが記録されている。閾値最適化手段５で使用するパラメータは、ユーザによって指定されるものであり、例えば初期値を探索する最大回数を表すパラメータである初期値最大探索回数（Ｌｔｈ）、最適値を探索する最大回数を表すパラメータである最大探索回数（Ｎｔｈ）、２種類の閾値変数の増減を表す閾値増減値Ｄ１、Ｄ２、初期値の生成に必要なパラメータ（Ｒ）である。図６においては、Ｌｔｈ＝１００、Ｎｔｈ＝１００００、Ｄ１＝１０、Ｄ２＝１、Ｒ=１０であることを示している。なお、閾値増減値Ｄ１、Ｄ２は、例えば（１２式）に示すように、探索回数に対して減少するような関数式であってもよい。 As shown in FIG. 6, the table recorded in the optimization parameter database DB4 is composed of a plurality of records, and one record uses one of the parameters used in the threshold optimization means 5 described later. The parameters used by the threshold optimization means 5 are specified by the user. For example, the initial value maximum search number (Lth), which is a parameter indicating the maximum number of times to search for the initial value, and the maximum number of times to search for the optimum value are set. The maximum search count (Nth), which is a parameter to be expressed, is a threshold increase / decrease value D1, D2 indicating increase / decrease in two types of threshold variables, and a parameter (R) necessary for generating initial values. FIG. 6 shows that Lth = 100, Nth = 10000, D1 = 10, D2 = 1, and R = 10. The threshold increase / decrease values D1 and D2 may be functional expressions that decrease with respect to the number of searches, for example, as shown in (Expression 12).

Ｄ１＝α／Ｎｔｈ×Ｎ (１２式)
Ｎ：探索回数
α：定数値
なお、このテーブルに含まれるレコードは、上述のように複数あってもよいし、例えばパラメータが１つの場合には、単数であってもよい。 D1 = α / Nth × N (12 formulas)
N: Number of searches α: Constant value Note that there may be a plurality of records included in this table as described above. For example, when there is one parameter, there may be a single record.

次に、閾値最適化手段５は、詳細な説明は後述するが、イベント集合データベースＤＢ１に記録された数値データとカテゴリデータを用いて、閾値・制約データベースＤＢ２に記録された閾値変数に対する制約下で事前知識ルールデータベースＤＢ３に記録された事前ルールが適切に表れるように、閾値・制約データベースＤＢ２に記録された閾値変数の値を自動的に求める手段であり、求められた閾値は、後述の閾値パラメータデータベースＤＢ６に記録される。なお、この閾値変数は、事前知識ルールデータベースＤＢ３に含まれる各事前ルールに対して閾値変数の値を求めてもよいし、公知技術であるJohn H. Holland「Adaptation in Natural and Artificial Systems」University of Michigan Press, 1975に記載されているようなＧＡ(Genetic Algorithm：遺伝的アルゴリズム)などの多目的最適化手法を用いて事前知識ルールデータベースＤＢ３に含まれる事前ルールのすべてのルールを満たすような最適な閾値変数の値を一度に求めてもよい。 Next, the threshold optimization unit 5 uses the numerical data and category data recorded in the event set database DB1 under the restriction on the threshold variable recorded in the threshold / constraint database DB2, although detailed description will be described later. This is a means for automatically obtaining the value of the threshold variable recorded in the threshold value / constraint database DB2 so that the prior rule recorded in the prior knowledge rule database DB3 appears appropriately, and the obtained threshold value is a threshold parameter described later. Recorded in the database DB6. As for this threshold variable, the value of the threshold variable may be obtained for each prior rule included in the prior knowledge rule database DB3, or John H. Holland “Adaptation in Natural and Artificial Systems” University of the well-known technology. Optimal threshold that satisfies all the rules of the prior rules contained in the prior knowledge rule database DB3 using a multi-objective optimization method such as GA (Genetic Algorithm) as described in Michigan Press, 1975 You may obtain the value of the variable at once.

次に、閾値パラメータデータベースＤＢ６に記録されるデータについて、図７を参照して説明する。 Next, data recorded in the threshold parameter database DB6 will be described with reference to FIG.

図７に示すように、閾値パラメータデータベースＤＢ６に記録されるテーブルは、複数のレコードからなり、レコードには、データ名、閾値変数名、閾値最適化手段５によって算出された最適な閾値変数の値が記録されている。図７においては、１番目のレコードは、体の位置を示す数値データをカテゴリ化するための閾値変数Ｘ１の値が１００であることを示している。同様に閾値変数Ｘ２の値は２００、閾値変数Ｙ１の値は１００、閾値変数Ｙ２の値は２００、閾値変数Ｔ１の値は１５であることを示している。なお、このテーブルは、例えば分類するカテゴリ毎に複数あってもよいし、本実施形態に示すように、これらを１つのテーブルにまとめて記録させてもよい。また、例えば閾値変数が単数である場合等は、テーブルに記録されるレコードが単数であってもよい。 As shown in FIG. 7, the table recorded in the threshold parameter database DB 6 includes a plurality of records. The record includes a data name, a threshold variable name, and an optimum threshold variable value calculated by the threshold optimization unit 5. Is recorded. In FIG. 7, the first record indicates that the value of the threshold variable X1 for categorizing numerical data indicating the position of the body is 100. Similarly, the value of the threshold variable X2 is 200, the value of the threshold variable Y1 is 100, the value of the threshold variable Y2 is 200, and the value of the threshold variable T1 is 15. Note that there may be a plurality of this table for each category to be classified, for example, or these may be recorded together in one table as shown in the present embodiment. For example, when the threshold variable is singular, the record recorded in the table may be singular.

最後に、閾値表示装置７は、閾値パラメータデータベースＤＢ６に記録された最適な閾値変数をユーザに表示すための装置であり、例えば通常のディスプレイ装置がこれに該当する。 Finally, the threshold display device 7 is a device for displaying the optimum threshold variable recorded in the threshold parameter database DB6 to the user, and corresponds to, for example, a normal display device.

続いて、閾値最適化手段５として、１つの事前ルールに対する閾値変数の最適解を求める方法について、図８、図９を用いて説明する。 Next, a method for obtaining an optimum solution of threshold variables for one pre-rule as the threshold optimization means 5 will be described with reference to FIGS.

閾値最適化手段５は、数値データをカテゴリ化するための閾値変数の値の初期設定を行うための処理手順１と、処理手順１で設定された閾値変数の値を最適化するための処理手順２に大別される。図８に処理手順１、図９に処理手順２を示すフローチャートを示す。 The threshold optimization means 5 is a processing procedure 1 for initializing the value of a threshold variable for categorizing numerical data, and a processing procedure for optimizing the value of the threshold variable set in the processing procedure 1 It is roughly divided into two. FIG. 8 shows a flowchart showing the processing procedure 1 and FIG. 9 shows a processing procedure 2.

まず、数値データをカテゴリ化するための閾値変数の値の初期設定を行うための処理手順１を、図８を参照して説明する。 First, the processing procedure 1 for initializing the value of the threshold variable for categorizing numerical data will be described with reference to FIG.

図８に示すように、処理手順１では、はじめに、事前知識ルールデータベースＤＢ３を参照し、そこに記録されている事前ルールの中から、ｊ番目の事前ルールを取り出す（Ｓ１０１）。jは抽出した事前ルールの格納されている順番を示しており、例えば、ｊ＝1では格納されている1番目の事前ルール取り出すこととなる。事前知識ルールデータベースＤＢ３が図４である場合、抽出される事前ルールは、「Ｉｔｅｍ１０購買」、「Ａ商品エリア滞在」で共起確率は９０％である。 As shown in FIG. 8, in the processing procedure 1, first, the prior knowledge rule database DB3 is referred to, and the jth prior rule is extracted from the prior rules recorded there (S101). j indicates the order in which the extracted prior rules are stored. For example, when j = 1, the stored first prior rule is extracted. When the prior knowledge rule database DB3 is FIG. 4, the extracted prior rules are “Item 10 purchase” and “A product area stay”, and the co-occurrence probability is 90%.

次に、イベント集合データベースＤＢ１を参照し、該当するカテゴリラベルを含む全ての被観測体のＩＤを全て抽出する。上述の例では、「Ｉｔｅｍ１０購入」というカテゴリデータを有する被観測体のＩＤを全て抽出する。そして、事前ルールに含まれるカテゴリ化する数値データのデータ名と抽出した被観測体のＩＤをもとに、イベント集合データベースＤＢ１から対象となる被観測体のデータを抜き出し、事前ルール該当イベントセットの作成を行う（Ｓ１０２）。ここで、イベントセットに含まれる被観測体数をＮ１とする。 Next, with reference to the event set database DB1, all IDs of all observed objects including the corresponding category labels are extracted. In the above-described example, all the IDs of the observed objects having the category data “Purchase Item 10” are extracted. Then, based on the data name of the numerical data to be categorized included in the pre-rule and the extracted ID of the object to be observed, the target object data is extracted from the event set database DB1, and the pre-rule corresponding event set Creation is performed (S102). Here, the number of objects to be observed included in the event set is N1.

次に、閾値・制約データベースＤＢ２を参照し、Ｓ１０１で抽出した事前ルールに含まれるカテゴリ化する数値データのラベルに関する閾値変数とその条件式と制約式を抽出する（Ｓ１０３）。例えば、事前ルールに含まれるカテゴリ化する数値データのラベルが「Ａ商品エリア滞在」であり、閾値変数、条件式、制約式が図３Ａ、図３Ｂ、図３Ｃの場合、求めるべき閾値変数として、Ｘ１、Ｘ２、Ｙ１、Ｙ２、Ｔ１の５種類が抽出される。また条件式として、上述の(５式)、(６式)、(７式)が抽出される。また、制約式として、上述の(８式)、(９式)、(１０式)、(１１式)が抽出される。 Next, with reference to the threshold / constraint database DB2, a threshold variable, its conditional expression and constraint expression regarding the label of the numerical data to be categorized included in the pre-rule extracted in S101 are extracted (S103). For example, when the label of the numerical data to be categorized included in the prior rule is “A product area stay” and the threshold variable, the conditional expression, and the constraint expression are FIG. 3A, FIG. 3B, and FIG. Five types of X1, X2, Y1, Y2, and T1 are extracted. Further, the above-described (Expression 5), (Expression 6), and (Expression 7) are extracted as conditional expressions. Further, the above-described (Equation 8), (Equation 9), (Equation 10), and (Equation 11) are extracted as constraint equations.

次に、Ｓ１０２で作成した事前ルール該当イベント集合データセットに含まれる全てのデータ点を対象とし、半径Ｒの円に含まれるデータ点の個数(Ｎ２)を算出する。この円は、i番目に個数が多い点（Ｘｓ、Ｙｓ）を中心とした円であり、このときの（Ｘｓ、Ｙｓ）を用いて、Ｘ１，Ｘ２、Ｙ１、Ｙ２の初期値を（１３式）、（１４式）、（１５式）、（１６式）、（１７式）のように設定する（Ｓ１０４）。 Next, the number (N2) of data points included in a circle with a radius R is calculated for all data points included in the pre-rule applicable event set data set created in S102. This circle is a circle centered at the i-th largest point (Xs, Ys). Using (Xs, Ys) at this time, the initial values of X1, X2, Y1, and Y2 are expressed by Equation (13). ), (14 formula), (15 formula), (16 formula), and (17 formula) (S104).

Ｘ１Ｓ＝Ｘｓ−Ｒ（１３式）
Ｘ２Ｓ＝Ｘｓ＋Ｒ（１４式）
Ｙ１Ｓ＝Ｙｓ−Ｒ（１５式）
Ｙ２Ｓ＝Ｙｓ＋Ｒ（１６式）
Ｔ１Ｓ＝Ｎ２／Ｎ１（１７式）
ここでiは、初期値の生成回数を示し、初期値の生成を繰り返すたびに増加する変数である。また、Ｒは、ユーザが自由に設定してよい。 X1S = Xs-R (13 formulas)
X2S = Xs + R (14 formulas)
Y1S = Ys-R (15 formulas)
Y2S = Ys + R (16 formulas)
T1S = N2 / N1 (17 formulas)
Here, i represents the number of times the initial value is generated, and is a variable that increases each time the generation of the initial value is repeated. R may be set freely by the user.

次に、Ｓ１０４で生成した初期値がＳ１０３で抽出した制約式を満たすか否かを判定する（Ｓ１０５）。そして、判定の結果、初期値が制約式を満たす場合は後述する処理手順２へ進む。 Next, it is determined whether or not the initial value generated in S104 satisfies the constraint expression extracted in S103 (S105). As a result of the determination, if the initial value satisfies the constraint equation, the process proceeds to process procedure 2 described later.

Ｓ１０５で初期値が制約式を満たさなかった場合、ｉ＝ｉ＋１として、最適化パラメータデータベースＤＢ５を参照し、初期探索回数ｉが最大初期探索回数Ｌｔｈより小さいかどうかを判定する（Ｓ１０６）。そして、判定の結果、初期探索回数が最大初期探索回数より小さい場合は、Ｓ１０４へ進む。 If the initial value does not satisfy the constraint expression in S105, i = i + 1 is set, and the optimization parameter database DB5 is referenced to determine whether the initial search count i is smaller than the maximum initial search count Lth (S106). If it is determined that the initial search count is smaller than the maximum initial search count, the process proceeds to S104.

一方、Ｓ１０６の判定の結果、初期探索回数ｉが最大初期探索回数Ｌｔｈ以上である場合、閾値が見つからないことをユーザに知らせる（Ｓ１０７）。 On the other hand, if the result of determination in S106 is that the initial search count i is greater than or equal to the maximum initial search count Lth, the user is informed that the threshold value is not found (S107).

以上のような処理手順１により、数値データをカテゴリ化するための閾値変数の値の初期設定を行う。上述の例で、例えばｉ＝１の場合、「Ｉｔｅｍ１０購入」というカテゴリデータを含む被観測体のＩＤを有する全てのデータ点のうち、半径Ｒの円に含まれるデータ点が最も多かったときの中心位置を基準として、（１３式）〜（１７式）に従って、閾値の初期設定がなされる。 By the processing procedure 1 as described above, initial setting of a threshold variable value for categorizing numerical data is performed. In the above example, when i = 1, for example, among all the data points having the ID of the observed object including the category data “Purchase Item 10”, the data point included in the circle with the radius R is the largest. The threshold is initially set according to (Expression 13) to (17) with the center position as a reference.

続いて、処理手順１で設定された閾値の最適化を行うための処理手順２を、図９を参照して説明する。 Next, process procedure 2 for optimizing the threshold set in process procedure 1 will be described with reference to FIG.

図９に示すように、処理手順２では、はじめに、処理手順１で求められた閾値Ｘ１Ｓ、Ｘ２Ｓ、Ｙ１Ｓ、Ｙ２Ｓ、Ｔ１Ｓを用いて数値データをカテゴリ化し、共起確率(Ｒ’)を求める（Ｓ２０１）。このときカテゴリ化は、上述の例で、例えば「Ｉｔｅｍ１０購入」というカテゴリデータを含む被観測体のＩＤを有する数値データが、（５式）〜（７式）を満たすか否かで判断され、満たすＩＤには、例えば「商品Ａエリア滞在」というラベルが付与される。また、共起確率は、例えば「Ｉｔｅｍ１０購入」というカテゴリデータを含む全てのＩＤのうち、このＩＤが有する数値データが「商品Ａエリア滞在」というカテゴリに分類される割合を示したものである。 As shown in FIG. 9, in the process procedure 2, first, the numerical data is categorized using the threshold values X1S, X2S, Y1S, Y2S, and T1S obtained in the process procedure 1 to obtain the co-occurrence probability (R ′) ( S201). At this time, categorization is determined by whether or not the numerical data having the ID of the observed object including the category data of “Item 10 purchase” satisfies (Expression 5) to (Expression 7) in the above example, For example, a label “stay in the product A area” is given to the ID to be satisfied. Further, the co-occurrence probability indicates, for example, a ratio in which numerical data included in this ID among all IDs including category data “Item 10 purchase” is classified into the category “product A area stay”.

次に、事前知識ルールデータベースＤＢ３に記録された共起確率(Ｒ)とＳ２０１で求められた共起確率(Ｒ’) を比較する（Ｓ２０２）。そして、Ｒ’＜Ｒを満たさない場合、Ｘ１＝Ｘ１Ｓ、Ｘ２＝Ｘ２Ｓ、Ｙ１＝Ｙ１Ｓ、Ｙ２＝Ｙ２Ｓ、Ｔ１＝Ｔ１Ｓとして、これらの値を閾値最適化手段５の出力とする。 Next, the co-occurrence probability (R) recorded in the prior knowledge rule database DB3 is compared with the co-occurrence probability (R ') obtained in S201 (S202). If R ′ <R is not satisfied, X1 = X1S, X2 = X2S, Y1 = Y1S, Y2 = Y2S, and T1 = T1S, and these values are used as the output of the threshold optimization unit 5.

一方、Ｓ２０２でＲ’＜Ｒを満たす場合、Ｘ１Ｓ’＝Ｘ１Ｓ−Ｄ１とする（Ｓ２０３）。次に、閾値・制約データベースＤＢ２を参照し、Ｓ２０３で求められたＸ１Ｓ’が制約
を満たすか否かを判定する（Ｓ２０４）。そして、Ｘ１Ｓ’が制約を満たさない場合はＲ’１＝０として、後述のＳ２０６に進む。 On the other hand, if R ′ <R is satisfied in S202, X1S ′ = X1S−D1 is set (S203). Next, with reference to the threshold / constraint database DB2, it is determined whether or not X1S ′ obtained in S203 satisfies the constraint (S204). If X1S ′ does not satisfy the constraint, R′1 = 0 is set, and the process proceeds to S206 described later.

一方、Ｘ１Ｓ’が制約を満たす場合は、Ｘ１Ｓ’、Ｘ２Ｓ、Ｙ１Ｓ、Ｙ２Ｓ、Ｔ１Ｓを用いて数値データをカテゴリ化し、共起確率(Ｒ’１)を求める（Ｓ２０５）。 On the other hand, if X1S ′ satisfies the constraints, the numerical data is categorized using X1S ′, X2S, Y1S, Y2S, and T1S, and the co-occurrence probability (R′1) is obtained (S205).

次に、Ｘ２Ｓ’＝Ｘ２Ｓ＋Ｄ１とする（Ｓ２０６）。 Next, X2S '= X2S + D1 is set (S206).

次に、閾値・制約データベースＤＢ２を参照し、Ｓ２０６で求められたＸ２Ｓ’が制約
を満たすか否かを判定する（Ｓ２０７）。そして、Ｘ２Ｓ’が制約を満たさない場合はＲ’２＝０として、後述のＳ２０９に進む。 Next, referring to the threshold / constraint database DB2, it is determined whether or not X2S ′ obtained in S206 satisfies the constraint (S207). If X2S ′ does not satisfy the constraint, R′2 = 0 is set, and the process proceeds to S209 described later.

一方、Ｘ２Ｓ’が制約を満たす場合は、Ｘ１Ｓ、Ｘ２Ｓ’、Ｙ１Ｓ、Ｙ２Ｓ、Ｔ１Ｓを用いて数値データをカテゴリ化し、共起確率(Ｒ’２)を求める（Ｓ２０８）。 On the other hand, if X2S ′ satisfies the constraints, the numerical data is categorized using X1S, X2S ′, Y1S, Y2S, and T1S, and the co-occurrence probability (R′2) is obtained (S208).

次に、Ｙ１Ｓ’＝Ｙ１Ｓ−Ｄ１とする（Ｓ２０９）。 Next, Y1S '= Y1S-D1 is set (S209).

次に、閾値・制約データベースＤＢ２を参照し、Ｓ２０９で求められたＹ１Ｓ’が制約
を満たすか否かを判定する（Ｓ２１０）。そして、Ｙ１Ｓ’が制約を満たさない場合はＲ’３＝０として、後述のＳ２１２に進む。 Next, with reference to the threshold / constraint database DB2, it is determined whether or not Y1S ′ obtained in S209 satisfies the constraint (S210). If Y1S ′ does not satisfy the constraint, R′3 = 0 is set, and the process proceeds to S212 described later.

一方、Ｙ１Ｓ’が制約を満たす場合は、Ｘ１Ｓ、Ｘ２Ｓ、Ｙ１Ｓ’、Ｙ２Ｓ、Ｔ１Ｓを用いて数値データをカテゴリ化し、共起確率(Ｒ’３)を求める（Ｓ２１１）。 On the other hand, if Y1S ′ satisfies the constraints, the numerical data is categorized using X1S, X2S, Y1S ′, Y2S, and T1S, and the co-occurrence probability (R′3) is obtained (S211).

次に、Ｙ２Ｓ’＝Ｙ２Ｓ＋Ｄ１とする（Ｓ２１２）。 Next, Y2S '= Y2S + D1 is set (S212).

次に、閾値・制約データベースＤＢ２を参照し、Ｓ２１２で求められたＹ２Ｓ’が制約
を満たすか否かを判定する（Ｓ２１３）。そして、Ｙ２Ｓ’が制約を満たさない場合はＲ’４＝０として、後述のＳ２１５に進む。 Next, referring to the threshold / constraint database DB2, it is determined whether or not Y2S ′ obtained in S212 satisfies the constraint (S213). If Y2S ′ does not satisfy the constraint, R′4 = 0 is set, and the process proceeds to S215 described later.

一方、Ｙ２Ｓ’が制約を満たす場合は、Ｘ１Ｓ、Ｘ２Ｓ、Ｙ１Ｓ、Ｙ２Ｓ’、Ｔ１Ｓを用いて数値データをカテゴリ化し、共起確率(Ｒ’４)を求める（Ｓ２１４）。 On the other hand, if Y2S ′ satisfies the constraint, the numerical data is categorized using X1S, X2S, Y1S, Y2S ′, and T1S to obtain the co-occurrence probability (R′4) (S214).

次に、Ｔ１Ｓ’＝Ｔ１Ｓ−Ｄ２とする（Ｓ２１５）。 Next, T1S '= T1S-D2 is set (S215).

次に、閾値・制約データベースＤＢ２を参照し、Ｓ２１５で求められたＴ１Ｓ’が制約
を満たすか否かを判定する（Ｓ２１６）。そして、Ｔ１Ｓ’が制約を満たさない場合はＲ’５＝０として、後述のＳ２１８に進む。 Next, with reference to the threshold / constraint database DB2, it is determined whether or not T1S ′ obtained in S215 satisfies the constraint (S216). If T1S ′ does not satisfy the constraint, R′5 = 0 is set, and the process proceeds to S218 described later.

一方、Ｔ１Ｓ’が制約を満たす場合は、Ｘ１Ｓ、Ｘ２Ｓ、Ｙ１Ｓ、Ｙ２Ｓ、Ｔ１Ｓ’を用いて数値データをカテゴリ化し、共起確率(Ｒ’５)を求める（Ｓ２１７）。 On the other hand, if T1S 'satisfies the constraint, the numerical data is categorized using X1S, X2S, Y1S, Y2S, T1S' to determine the co-occurrence probability (R'5) (S217).

次に、Ｒ’１、Ｒ’２、Ｒ’３、Ｒ’４、Ｒ’５の最大値をＲ’とし、共起確率が最大となった場合の閾値変数の値を更新する（Ｓ２１８）。例えば、Ｒ’１が最大値であった場合、Ｘ1Ｓの値のみＸ１Ｓ＝Ｘ１Ｓ’とする。同様に、Ｒ’２が最大値であった場合、Ｘ２ｓの値のみＸ２Ｓ＝Ｘ２Ｓ’、Ｒ’３が最大値であった場合、Ｙ１Ｓの値のみＹ１Ｓ＝Ｙ１Ｓ’、Ｒ’４が最大値であった場合、Ｙ２Ｓの値のみＴ２Ｓ＝Ｔ２Ｓ’、Ｒ’５が最大値であった場合、Ｔ１Ｓの値のみＴ１Ｓ＝Ｔ１Ｓ’とする。 Next, the maximum value of R′1, R′2, R′3, R′4, and R′5 is set as R ′, and the value of the threshold variable when the co-occurrence probability is maximized is updated (S218). . For example, when R′1 is the maximum value, only the value of X1S is set to X1S = X1S ′. Similarly, when R′2 is the maximum value, only the value of X2s is X2S = X2S ′, and when R′3 is the maximum value, only the value of Y1S is Y1S = Y1S ′, and R′4 is the maximum value. If there is, only the value of Y2S is T2S = T2S ′, and if R′5 is the maximum value, only the value of T1S is T1S = T1S ′.

次に、事前知識ルールデータベースＤＢ３に記録された共起確率(Ｒ)とＳ２１８で求められた共起確率(Ｒ’) を比較する（Ｓ２１９）。そして、Ｒ’＜Ｒを満たさない場合、Ｘ１＝Ｘ１Ｓ、Ｘ２＝Ｘ２Ｓ、Ｙ１＝Ｙ１Ｓ、Ｙ２＝Ｙ２Ｓ、Ｔ１＝Ｔ１Ｓとして、これらの値を閾値最適化手段５の出力とする。 Next, the co-occurrence probability (R) recorded in the prior knowledge rule database DB3 is compared with the co-occurrence probability (R ') obtained in S218 (S219). If R ′ <R is not satisfied, X1 = X1S, X2 = X2S, Y1 = Y1S, Y2 = Y2S, and T1 = T1S, and these values are used as the output of the threshold optimization unit 5.

一方、Ｓ２１９でＲ’＜Ｒを満たす場合、最適化パラメータデータベースＤＢ４を参照し、最大探索回数（Ｎｔｈ）と現在の検索回数Ｎを比較する（Ｓ２２０）。そして、Ｎ＜Ｎｔｈかつ、Ｒ’１＝Ｒ’２＝Ｒ’３＝Ｒ’４＝Ｒ’５＝０を満たす場合は、検索回数ＮをＮ＝Ｎ＋１と更新し、Ｓ２０３に進む。一方、Ｎ＜Ｎｔｈかつ、Ｒ’１＝Ｒ’２＝Ｒ’３＝Ｒ’４＝Ｒ’５＝０を満たさない場合は、処理手順１で示した閾値の初期値の生成回数ｉをｉ＝ｉ＋１と更新し、処理手順１のＳ１０６に進む。 On the other hand, when R ′ <R is satisfied in S219, the optimization parameter database DB4 is referred to, and the maximum number of searches (Nth) is compared with the current number of searches N (S220). If N <Nth and R′1 = R′2 = R′3 = R′4 = R′5 = 0 are satisfied, the number of searches N is updated to N = N + 1, and the process proceeds to S203. On the other hand, when N <Nth and R′1 = R′2 = R′3 = R′4 = R′5 = 0 are not satisfied, the number of generations i of the initial value of the threshold shown in the processing procedure 1 is set to i. = I + 1, and the process proceeds to S106 of process procedure 1.

以上のような処理を行うことで、数値データをカテゴリ化するための閾値変数の値の最適解を自動的に求めることが可能となる。 By performing the processing as described above, it is possible to automatically obtain the optimum solution of the value of the threshold variable for categorizing numerical data.

以上に説明したように、本実施形態による事前ルールを用いた前処理装置によれば、閾値最適化手段５によって、事前ルールが適切に現れるように、閾値・制約データベースＤＢ２に記録された閾値変数の値を自動的に求めることができる。すなわち、ユーザが閾値を決定できない場合であっても、容易かつ適切に閾値変数の値を決定することができる。 As described above, according to the preprocessing device using the pre-rule according to the present embodiment, the threshold variable recorded in the threshold / constraint database DB2 so that the pre-rule appears appropriately by the threshold optimizing means 5. Can be automatically determined. That is, even if the user cannot determine the threshold value, the value of the threshold variable can be determined easily and appropriately.

なお、このように求められた閾値変数の値は、閾値表示装置７を用いて、様々な形でユーザに表示することができる。例えば、同一のデータ名の同一のラベルの条件に含まれる閾値変数の中で、ユーザが指定した変数の値を用いて、図を描画してもよい。この例を図１０に示す。図１０に示すように、事前ルールから導き出した閾値変数の値を用いることで、すべての商品エリア２０を図示することができる。このように図示することで、ある商品の購入者が滞在する場所、複数の商品エリアが重なりすぎて混雑する場所などを視覚的にとらえることができるため、商品の棚２１の配置やＰＯＰの置き方などを見直すなどの施策を打つ際の知見を得ることができる。 In addition, the value of the threshold variable obtained in this way can be displayed to the user in various forms using the threshold display device 7. For example, the figure may be drawn using the value of the variable designated by the user among the threshold variables included in the condition of the same label having the same data name. An example of this is shown in FIG. As shown in FIG. 10, all the product areas 20 can be illustrated by using the value of the threshold variable derived from the prior rule. In this way, it is possible to visually grasp the place where a purchaser of a certain product stays, the place where a plurality of product areas are excessively congested, and the like. Can gain knowledge when taking measures such as reviewing

(第２の実施形態)
次に、第１の実施形態による事前ルールを用いた前処理装置を用いた情報抽出装置について、図１１〜図１５を参照して説明する。 (Second Embodiment)
Next, an information extraction apparatus using a preprocessing apparatus using a pre-rule according to the first embodiment will be described with reference to FIGS.

図１１は、第１の実施形態による事前ルールを用いた前処理装置を用いた情報抽出装置の構成を概略的に示すブロック図である。 FIG. 11 is a block diagram schematically showing the configuration of an information extraction device using a preprocessing device using a pre-rule according to the first embodiment.

図１１に示すように、本実施形態に係る情報抽出装置は、閾値最適化処理部と情報抽出部とで構成される。このうち、閾値最適化処理部は、第１の実施形態に示す事前ルールを用いた前処理装置と同様の構成である。ただし、本実施形態ではイベント集合データベースＤＢ１を第１イベント集合データベースＤＢ１と称す。また、閾値表示装置７は必要なく、もし最適化された閾値を表示したい場合は、閾値パラメータデータベースＤＢ６を参照して、後述の相関ルール表示装置１３を用いてユーザに提示すればよい。 As shown in FIG. 11, the information extraction apparatus according to this embodiment includes a threshold optimization processing unit and an information extraction unit. Among these, the threshold optimization processing unit has the same configuration as the preprocessing apparatus using the pre-rule shown in the first embodiment. However, in the present embodiment, the event set database DB1 is referred to as a first event set database DB1. Further, the threshold value display device 7 is not necessary, and if it is desired to display an optimized threshold value, it may be presented to the user using the correlation rule display device 13 described later with reference to the threshold parameter database DB6.

一方、情報抽出部は、第２イベント集合データベースＤＢ７と、連続データカテゴリ化手段８と、変換後イベント集合データベースＤＢ９と、相関ルール抽出パラメータデータベースＤＢ１０と、相関ルール抽出手段１１と、相関ルールデータベースＤＢ１２と、相関ルール表示装置１３とで構成される。 On the other hand, the information extraction unit includes a second event set database DB7, a continuous data categorization means 8, a post-conversion event set database DB9, a correlation rule extraction parameter database DB10, a correlation rule extraction means 11, and a correlation rule database DB12. And the correlation rule display device 13.

続いて、このような情報抽出装置を構成する各要素について、図１２〜図１５を参照して説明する。なお、この情報抽出装置のうち、閾値最適化処理部は第１の実施形態と同様であるため説明を省略し、ここでは、情報抽出部を構成する各要素について説明する。 Then, each element which comprises such an information extraction apparatus is demonstrated with reference to FIGS. In this information extraction apparatus, the threshold optimization processing unit is the same as that in the first embodiment, and thus description thereof is omitted. Here, each element constituting the information extraction unit will be described.

まず、第２イベント集合データベースＤＢ７に記録されるデータについて説明する。 First, data recorded in the second event set database DB7 will be described.

第２イベント集合データベースＤＢ７は、第１イベント集合データベースＤＢ７と基本的に同一のものである。すなわち、第２イベント集合データベースＤＢ７は、それぞれ複数のレコードからなるテーブルを有し、１つのレコードには、被観測体のＩＤと被観測体の数値データまたはカテゴリデータが記録されている。この第２イベント集合データベースＤＢ７に記録されている被観測体に関するデータは、第１イベント集合データベースＤＢ１に記録されているデータと同種類のセンサ、または機器で取得したデータでもよいし、その一部でもよい。また、異なる種類のセンサ、または機器で取得したデータであってもよい。ただし、第２イベント集合データベースＤＢ７に記録されている被観測体に関するデータが、第1イベント集合データベースＤＢ１に記録されているデータと異なるセンサ、機器で取得したデータを含む場合には、これらのデータが数値データではなくカテゴリデータである必要がある。また、第1イベント集合データベースＤＢ１に記録されているデータと第２イベント集合データベースＤＢ７に記録されているデータは、全く同一であってもよい。 The second event set database DB7 is basically the same as the first event set database DB7. That is, the second event set database DB7 has a table composed of a plurality of records, respectively, and the ID of the observed object and the numerical data or category data of the observed object are recorded in one record. The data relating to the observed object recorded in the second event set database DB7 may be data acquired by the same type of sensor or device as the data recorded in the first event set database DB1, or a part thereof. But you can. Further, it may be data acquired by different types of sensors or devices. However, if the data related to the object recorded in the second event set database DB7 includes data acquired by a sensor or device different from the data recorded in the first event set database DB1, these data Must be categorical data, not numeric data. The data recorded in the first event set database DB1 and the data recorded in the second event set database DB7 may be exactly the same.

次に、数値データカテゴリ化手段８について説明する。 Next, the numerical data categorizing means 8 will be described.

数値データカテゴリ化手段８は、第２イベント集合データベースＤＢ７に記録された被観測体の数値データを、閾値・制約データベースＤＢ２に記録されている条件式及び、閾値パラメータデータベースＤＢ６に記録されている最適化された閾値変数の値を使用して、カテゴリデータに変換する手段である。この数値データカテゴリ化手段８でカテゴリ化された数値データを有する被観測体のＩＤには、カテゴリに対応するラベルが付与され、変換後イベント集合データベースＤＢ９に記録される。 The numerical data categorizing means 8 uses the numerical data of the observed object recorded in the second event set database DB7, the conditional expression recorded in the threshold / constraint database DB2, and the optimum recorded in the threshold parameter database DB6. It is a means for converting into category data using the value of the normalized threshold variable. A label corresponding to the category is given to the ID of the observed object having the numerical data categorized by the numerical data categorizing means 8 and recorded in the post-conversion event set database DB9.

次に、変換後イベント集合データベースＤＢ９に記録されるデータついて、図１２を参照して説明する。 Next, data recorded in the post-conversion event set database DB9 will be described with reference to FIG.

変換後イベント集合データベースＤＢ９は、数値データを数値データカテゴリ化手段８によってカテゴリ化することで付与するラベルが記録されたテーブルと、カテゴリデータのラベルが記録されたテーブルからなり、それぞれ被観測体ＩＤとともに記録されている。このうち、カテゴリ化された数値データに付与するラベルが記録されたテーブルは、図１２に示すように、複数のレコードからなり、１つのレコードには、被観測体ＩＤ、データ名、データの種類、観測時間、ラベル名が記録されている。図１２においては、例えば、被観測体００１が２００７/０８/０１の１５時３０分００秒に「Ａ商品エリア滞在」したことを示している。同様に、被観測体００１は２００７/０８/０１の１５時４０分００秒に「Ｂ商品エリア通過」し、２００７/０８/０１の１５時４５分００秒に「Ｃ商品エリア滞在」したことを示している。一方、カテゴリデータのラベルが記録されたテーブルは、例えば図２Ｂと同様である。なお、これらのテーブルは、例えば被観測者毎に複数あってもよいし、これらをまとめて１つのテーブルに記録されていてもよい。また、テーブルに記録されたレコードは、単数であってもよいし、本実施形態に示すように、複数あってもよい。 The post-conversion event set database DB9 is composed of a table in which labels to be given by categorizing numerical data by the numerical data categorizing means 8 and a table in which labels of category data are recorded. It is recorded with. Among these, the table in which the labels assigned to the categorized numerical data are recorded is composed of a plurality of records as shown in FIG. 12, and one record includes an object ID, a data name, and a data type. The observation time and label name are recorded. FIG. 12 shows that, for example, the observed object 001 “stays in the product area A” at 15:30:30 on 2007/08/01. Similarly, the observed object 001 “passed through the B product area” at 15:40:00 on 2007/08/01 and “stayed in the C product area” at 15:45:00 on 2007/08/08 Is shown. On the other hand, the table in which the label of category data is recorded is the same as that shown in FIG. 2B, for example. Note that there may be a plurality of these tables, for example, for each person to be observed, or these may be collectively recorded in one table. Further, the number of records recorded in the table may be singular or plural as shown in the present embodiment.

次に、相関ルール抽出パラメータデータベースＤＢ１０に記録されるデータついて、図１３を参照して説明する。 Next, data recorded in the correlation rule extraction parameter database DB10 will be described with reference to FIG.

図１３に示すように、相関ルール抽出パラメータデータベースＤＢ１０に記録されるテーブルは、複数のレコードからなり、１つのレコードには、後述する相関ルール抽出手段１１で相関ルールを抽出するために必要なパラメータの１つが記録されている。図１３においては、相関ルール抽出手段１１で使用するパラメータの一例として、相関ルールとして抽出されるのに満たさなければならない条件を表す最小支持度（Ｓｕｐ）が０．２であり、最小確信度（Ｃｏｎｆ）が０．６であることを示している。この最小支持度及び最小確信度は、ユーザによって指定されるものである。ここで支持度は（１８式）、確信度は（１９式）をそれぞれ用いて算出されるものである。 As shown in FIG. 13, the table recorded in the correlation rule extraction parameter database DB10 is composed of a plurality of records, and parameters necessary for extracting correlation rules by the correlation rule extraction means 11 described later are included in one record. One of these is recorded. In FIG. 13, as an example of parameters used in the correlation rule extraction unit 11, the minimum support level (Sup) representing a condition that must be satisfied to be extracted as a correlation rule is 0.2, and the minimum certainty factor (Sup) Conf) is 0.6. The minimum support level and the minimum confidence level are specified by the user. Here, the support level is calculated using (Equation 18) and the certainty factor is calculated using (Equation 19).

支持度Ｓ（Ｘ∧Ｙ）＝Ｍ（Ｘ∧Ｙ）／Ｍ (１８式)
Ｍ（Ｘ∧Ｙ）：アイテム集合「ＸとＹ」を含むイベント（被観測体）数
Ｍ：全イベント（被観測体）数
確信度Ｃ（Ｘ∧Ｙ）＝Ｍ（Ｘ∧Ｙ）／Ｍ（Ｘ） (１９式)
Ｍ（Ｘ）：アイテム集合「Ｘ」を含むイベント（被観測体）数
上述の相関ルール抽出パラメータデータベースＤＢ１０には、最小支持度及び最小確信度が記録されており、例えばアイテム集合Ｘが「Ａ商品エリア滞在」と「Ｂ商品エリア滞在」であり、アイテム集合Ｙが「Ｉｔｅｍ０１購入」である場合、「Ａ商品エリア滞在」と「Ｂ商品エリア滞在」と「Ｉｔｅｍ０１購入」の３つのアイテムを含む被観測体が全ての被観測体の２０％以上であり、かつ「「Ａ商品滞在」かつ「Ｂ商品滞在」」を含む被観測体の６０％以上が「Ｉｔｅｍ０１購入」を含んでいる場合、「「Ａ商品滞在」かつ「Ｂ商品滞在」ならば「Ｉｔｅｍ０１購入」である」は、相関ルールとして抽出される。なお、このテーブルに含まれるレコードは、例えばパラメータが１つである場合には、単数であってもよい。 Degree of support S (X∧Y) = M (X∧Y) / M (18 formulas)
M (X∧Y): Number of events (observed objects) including item set “X and Y” M: Number of all events (observed objects) Confidence C (X∧Y) = M (X∧Y) / M (X) (Equation 19)
M (X): Number of events (observed objects) including item set “X” In the above-described correlation rule extraction parameter database DB10, the minimum support level and the minimum certainty factor are recorded. If the item set Y is “Item 01 purchase” and the item set Y is “Item 01 stay”, “Item A stay”, “Item B stay”, and “Item 01 purchase” are included. When the observed objects are 20% or more of all the observed objects, and 60% or more of the observed objects including ““ A product stay ”and“ B product stay ”include“ Item01 purchase ”. “If“ A product stay ”and“ B product stay ”,“ Item 01 purchase ”” is extracted as a correlation rule. In addition, the record contained in this table may be single, for example, when there is one parameter.

次に、相関ルール抽出手段１１は、変換後イベント集合データベースＤＢ９に記録されている被観測体のイベントに対して、相関ルール抽出パラメータデータベースＤＢ１０に記録されているパラメータを用いて相関ルールを抽出し、相関ルールデータベースＤＢ１２に記録する手段である。以下、この相関ルール抽出手段１１を、図１４を参照して説明する。 Next, the correlation rule extracting means 11 extracts a correlation rule for the observed event recorded in the post-conversion event set database DB 9 using the parameters recorded in the correlation rule extraction parameter database DB 10. , Means for recording in the correlation rule database DB12. Hereinafter, the correlation rule extracting unit 11 will be described with reference to FIG.

図１４は、相関ルール抽出手段１１の処理手順を示すフローチャートを示す。 FIG. 14 is a flowchart showing the processing procedure of the correlation rule extraction means 11.

図１４に示すように、相関ルール抽出手段１１は、まず、変換後イベント集合データベースＤＢ９を参照し、シーケンス長ｋの相関ルールの候補集合を生成する（Ｓ３０１）。ここでシーケンス長とは、相関ルールに含まれるアイテム数をいう。このｋの初期値は1であり、相関ルールの候補の生成方法は、ｋ＝１とｋ＞１とでは異なる。 As shown in FIG. 14, the correlation rule extraction unit 11 first refers to the post-conversion event set database DB9 and generates a candidate set of correlation rules having a sequence length k (S301). Here, the sequence length refers to the number of items included in the association rule. The initial value of k is 1, and the method for generating association rule candidates is different between k = 1 and k> 1.

ｋ＝１の場合は、全イベントに含まれるアイテムを候補とする。一方、ｋ＞１の場合、シーケンス長がｋ−１の相関ルールとして抽出された相関ルールの中で、ｋ−２個のアイテムが共通する相関ルールを組み合わせて候補を生成する。例えば、シーケンス長3の相関ルールとして、「「商品Ａエリア滞在」、「商品Ｂエリア滞在」、「Ｉｔｅｍ０１購入」」と「「商品Ｂエリア滞在」、「商品Ｃエリア滞在」、「Ｉｔｅｍ１１購入」」と「「商品Ｂエリア滞在」、「商品Ｃエリア滞在」、「Ｉｔｅｍ０１購入」」の３つの相関ルールが存在する場合、シーケンス長4の相関ルールの候補は、「「商品Ａエリア滞在」、「商品Ｂエリア滞在」、「商品Ｃエリア滞在」、「Ｉｔｅｍ０１購入」」と「「商品Ｂエリア滞在」、「商品Ｃエリア滞在」、「Ｉｔｅｍ０１購入」、「Ｉｔｅｍ１１購入」」となる。 When k = 1, items included in all events are candidates. On the other hand, in the case of k> 1, among the association rules extracted as association rules having a sequence length of k−1, candidates are generated by combining association rules that share k−2 items. For example, as an association rule with a sequence length of 3, “Product A Area Stay”, “Product B Area Stay”, “Item 01 Area Purchase”, “Product B Area Stay”, “Product C Area Stay”, “Item 11 Purchase” ”,“ Product B Area Stay ”,“ Product C Area Stay ”,“ Item01 Purchase ””, the candidate for the correlation rule of sequence length 4 is ““ Product A Area Stay ”, “Product B area stay”, “Product C area stay”, “Item 01 purchase” and “Product B area stay”, “Product C area stay”, “Item 01 purchase”, “Item 11 purchase”.

次に、Ｓ３０１で生成した相関ルール候補集合に含まれる相関ルール候補の数を数え、その数が０より大きいか否かを判定する（Ｓ３０２）。０の場合は相関ルール抽出手段１１を終了する。 Next, the number of correlation rule candidates included in the correlation rule candidate set generated in S301 is counted, and it is determined whether or not the number is greater than 0 (S302). In the case of 0, the correlation rule extracting means 11 is terminated.

一方、Ｓ３０１で生成した相関ルール候補集合に含まれる相関ルール候補の数が０より大きい場合、変換度イベント集合データベースＤＢ９を参照し、Ｓ３０１で生成された各相関ルール候補が変換度イベント集合データベースＤＢ９に含まれるか否かを調べる。そして、生成された相関ルールを含むイベント（被観測体）数を数え、支持度と確信度を算出する（Ｓ３０３）。 On the other hand, when the number of correlation rule candidates included in the correlation rule candidate set generated in S301 is greater than 0, the conversion event group database DB9 is referred to, and each correlation rule candidate generated in S301 is converted into the conversion event group database DB9. To check whether it is included. Then, the number of events (observed objects) including the generated association rules is counted, and the support level and the certainty level are calculated (S303).

次に、相関ルール抽出パラメータデータベースＤＢ１０を参照し、Ｓ３０３で算出した相関ルール候補の支持度が最小支持度以上であり、相関ルール候補の確信度が最小確信度以上であれば、この相関ルール候補を相関ルールとして、後述の相関ルールデータベースＤＢ１２に記録する（Ｓ３０４）。 Next, referring to the correlation rule extraction parameter database DB10, if the support level of the correlation rule candidate calculated in S303 is greater than or equal to the minimum support level and the confidence level of the correlation rule candidate is greater than or equal to the minimum confidence level, this correlation rule candidate Is recorded in the correlation rule database DB12 described later as a correlation rule (S304).

次に、Ｓ３０４で記録されたシーケンス長ｋの相関ルールの数を数え、その数が０より大きいか否かを判定する（Ｓ３０５）。０の場合は相関ルール抽出手段１１を終了する。 Next, the number of association rules of sequence length k recorded in S304 is counted, and it is determined whether or not the number is greater than 0 (S305). In the case of 0, the correlation rule extracting means 11 is terminated.

一方、Ｓ３０４で記録されたシーケンス長ｋの相関ルールが０より大きい場合、ｋ＝ｋ＋１として、Ｓ３０１に戻る。 On the other hand, if the correlation rule of sequence length k recorded in S304 is greater than 0, k = k + 1 is set and the process returns to S301.

以上のような手順により、相関ルールを生成することができる。なお、上述の例においては、アイテム間の順序を考慮していないが、考慮してもよい。 An association rule can be generated by the above procedure. In the above example, the order between items is not considered, but may be considered.

最後に、相関ルールデータベースＤＢ１２に記録されるデータついて、図１５を参照して説明する。 Finally, data recorded in the correlation rule database DB12 will be described with reference to FIG.

図１５に示すように、相関ルールデータベースＤＢ１２は、抽出された相関ルールが記録されたテーブルを有している。このテーブルは複数のレコードからなり、１つのレコードには、相関ルール抽出手段１１で抽出した相関ルールが記録されている。図１５においては、例えば１番目のレコードは、「「Ａ商品エリア滞在」(条件部)ならば「Ｉｔｅｍ０１購入」(結論部)である」という相関ルールが記録されており、この相関ルールの支持度が０．５、確信度が０．７であることを示している。同様に、２番目のレコードは、「「Ａ商品エリア滞在」かつ「Ｂ商品エリア滞在」(条件部)ならば「Ｉｔｅｍ１０購入」(結論部)である」という相関ルールが記録されており、この相関ルールの支持度が０．３、確信度が０．８であることを示しており、３番目のレコードは、「「Ａ商品エリア滞在」かつ「Ｃ商品エリア滞在」(条件部)ならば「Ｉｔｅｍ２０購入」(結論部)である」という相関ルールが記録されており、この相関ルールの支持度が０．４、確信度が０．７であることを示している。なお、このテーブルに含まれるレコードは、例えば抽出された相関ルールが１つだった場合には、単数であってもよい。 As shown in FIG. 15, the correlation rule database DB12 has a table in which the extracted correlation rules are recorded. This table is composed of a plurality of records, and a correlation rule extracted by the correlation rule extraction means 11 is recorded in one record. In FIG. 15, for example, the first record records a correlation rule “If“ A product area stay ”(condition part) is“ Item 01 purchase ”(conclusion part)”, and supports this correlation rule. The degree is 0.5 and the certainty is 0.7. Similarly, in the second record, the correlation rule “If“ A product area stay ”and“ B product area stay ”(condition part) is“ Item 10 purchase ”(conclusion part)” is recorded. If the association rule support level is 0.3 and the confidence level is 0.8, the third record is “A product area stay” and “C product area stay” (condition part) The correlation rule “It is an Item 20 purchase” (conclusion part) is recorded, which indicates that the support degree of this correlation rule is 0.4 and the certainty factor is 0.7. Note that the number of records included in this table may be singular if, for example, there is one extracted correlation rule.

最後に、上述のようにして生成され、相関ルールデータベースＤＢ１２に記録された相関ルールは、相関ルール表示装置１３に表示される。この際、相関ルールのシーケンス長、支持度の大きさ、確信度の大きさに応じて順番を変えて表示してもよいし、特定のアイテム集合を含む相関ルールのみを抽出して表示する等、目的に応じて自由に表示してよい。なお、この相関ルール表示装置１３は、例えば通常のディスプレイ装置であり、第１の実施形態における閾値表示装置７と同様のものである。これらの相関ルール表示装置１３及び閾値表示装置７は、ユーザに視覚的に相関ルールまたは最適化された閾値を提供できるものであれば、どんなものであってもよい。 Finally, the correlation rules generated as described above and recorded in the correlation rule database DB 12 are displayed on the correlation rule display device 13. At this time, the order of the association rule sequence length, the degree of support, the degree of certainty may be displayed in a different order, or only the association rules including a specific item set may be extracted and displayed. , You may display freely according to the purpose. The correlation rule display device 13 is, for example, a normal display device, and is the same as the threshold value display device 7 in the first embodiment. These correlation rule display device 13 and threshold value display device 7 may be anything as long as they can provide the user with a correlation rule or an optimized threshold value visually.

以上のように、本実施形態による情報抽出装置によれば、事前ルールが適切に現れるように閾値変数の値を自動的に決定することができる。すなわち、ユーザが閾値を決定できない場合であっても、容易かつ適切に閾値変数の値を決定することができる。そして、このようにして求められた閾値変数の値を用いて数値データをカテゴリ化するため、容易に効率よく、ユーザが知り得なかった有用な相関ルールを抽出することが可能となる。 As described above, according to the information extraction apparatus according to the present embodiment, the value of the threshold variable can be automatically determined so that the prior rule appears appropriately. That is, even if the user cannot determine the threshold value, the value of the threshold variable can be determined easily and appropriately. Since the numerical data is categorized using the value of the threshold variable thus obtained, it is possible to easily and efficiently extract useful correlation rules that the user could not know.

なお、このユーザが知りえなかった相関ルールを抽出することで、例えば、ある商品の広告をどこに表示するか等、主にマーケティングに関する分野に適用することが可能である。 It should be noted that by extracting correlation rules that the user could not know, it is possible to apply mainly to the field related to marketing, such as where to display an advertisement for a certain product.

以上に、本発明の実施の形態を示したが、実施の形態はこれに限るものではなく、様々に適用可能である。 Although the embodiment of the present invention has been described above, the embodiment is not limited to this and can be applied in various ways.

例えば、ｗｅｂページ上のポインタの移動軌跡を数値データとし、ｗｅｂページ上において、クリックされる箇所をカテゴリデータとすることで、これらの事前ルールが適切に現れるように閾値変数の値を自動的に決定することができる。すなわち、ユーザが閾値を決定できない場合であっても、容易かつ適切に閾値変数の値を決定することができる。そして、このようにして求められた閾値変数の値を用いて数値データをカテゴリ化するため、容易に効率よく、ユーザが知り得なかった有用な相関ルールを抽出することも可能である。 For example, by using the pointer movement trajectory on the web page as numerical data and the clicked location on the web page as category data, the value of the threshold variable is automatically set so that these prior rules appear appropriately. Can be determined. That is, even if the user cannot determine the threshold value, the value of the threshold variable can be determined easily and appropriately. Since the numerical data is categorized using the value of the threshold variable thus obtained, it is possible to easily and efficiently extract useful correlation rules that the user could not know.

本発明の実施形態における事前ルールを用いた前処理装置の構成を概略的に示すブロック図である。It is a block diagram which shows roughly the structure of the pre-processing apparatus using the prior rule in embodiment of this invention. イベント集合データベースに含まれる体の位置を示すデータを記録したテーブルを示す図である。It is a figure which shows the table which recorded the data which show the position of the body contained in an event set database. イベント集合データベースに含まれる購買データを記録したテーブルを示す図である。It is a figure which shows the table which recorded the purchase data contained in an event set database. 閾値・制約データベースに含まれる閾値変数を記録したテーブルを示す図である。It is a figure which shows the table which recorded the threshold variable contained in a threshold value and restrictions database. 閾値・制約データベースに含まれる数値データ名とカテゴリ化後のラベルとそのラベルに該当する数値データの条件を記録したテーブルを示す図である。It is a figure which shows the table which recorded the numerical data name contained in a threshold value and restrictions database, the label after categorization, and the conditions of numerical data applicable to the label. 閾値・制約データベースに含まれる閾値変数の制約が記録されたテーブルを示す図である。It is a figure which shows the table by which the restriction | limiting of the threshold variable contained in a threshold value and restrictions database was recorded. 事前知識ルールデータベースに含まれるテーブルを示す図である。It is a figure which shows the table contained in the prior knowledge rule database. 商品マスタを示す図であるIt is a figure which shows a goods master 最適化パラメータデータベースに含まれるテーブルを示す図である。It is a figure which shows the table contained in the optimization parameter database. 閾値パラメータデータベースに含まれるテーブルを示す図である。It is a figure which shows the table contained in a threshold value parameter database. 閾値最適化手段において閾値の初期設定を行う手順を示すフローチャートである。It is a flowchart which shows the procedure which performs the initial setting of a threshold value in a threshold value optimization means. 閾値最適化手段において閾値の最適化を行う手順を示すフローチャートである。It is a flowchart which shows the procedure which optimizes a threshold value in a threshold value optimization means. 閾値表示装置の表示例を示す図である。It is a figure which shows the example of a display of a threshold value display apparatus. 本発明の実施形態における相関ルール抽出装置を示すブロック図である。It is a block diagram which shows the correlation rule extraction apparatus in embodiment of this invention. 変換後イベント集合データベースに含まれる体の位置を示す数値データをカテゴリ化した結果を記録したテーブルを示す図である。It is a figure which shows the table which recorded the result which categorized numerical data which shows the position of the body contained in the event set database after conversion. 相関ルール抽出パラメータデータベースに含まれるテーブルを示す図である。It is a figure which shows the table contained in a correlation rule extraction parameter database. 相関ルール抽出手段において相関ルールの抽出を行う手順を示すフローチャートである。It is a flowchart which shows the procedure which extracts an association rule in an association rule extraction means. 相関ルールデータベースに含まれるテーブルを示す図である。It is a figure which shows the table contained in a correlation rule database.

Explanation of symbols

ＤＢ１・・・（第１）イベント集合データベース、ＤＢ２・・・閾値・制約データベース、ＤＢ３・・・事前知識ルールデータベース、ＤＢ４・・・最適化パラメータデータベース、ＤＢ６・・・閾値パラメータデータベース、ＤＢ７・・・第２イベント集合データベース、ＤＢ９・・・変換後イベント集合データベース、ＤＢ１０・・・相関ルール抽出パラメータデータベース、ＤＢ１２・・・相関ルールデータベース、５・・・閾値最適化手段、７・・・閾値表示装置、８・・・数値データカテゴリ化手段、１１・・・相関ルール抽出手段、１３・・・相関ルール表示装置、２０・・・商品エリア、２１・・・商品の棚。 DB1 ... (first) event set database, DB2 ... threshold / constraint database, DB3 ... prior knowledge rule database, DB4 ... optimization parameter database, DB6 ... threshold parameter database, DB7 ... -Second event set database, DB9 ... converted event set database, DB10 ... correlation rule extraction parameter database, DB12 ... correlation rule database, 5 ... threshold optimization means, 7 ... threshold display Device: 8 ... Numerical data categorizing means, 11 ... Correlation rule extracting means, 13 ... Correlation rule display device, 20 ... Product area, 21 ... Product shelf.

Claims

An event set database that records trajectory data having the same ID and category data that is categorized and each labeled with a first label;
Conditions and constraints of threshold variables necessary for categorizing the trajectory data recorded in the event set database, and a second label attached to the trajectory data categorized based on these conditions and constraints A threshold / constraint database with
A prior knowledge rule database in which a combination of the first and second labels co-occurring with a predetermined probability is recorded as a prior rule together with the probability;
Threshold optimization means for calculating the value of the threshold variable so as to extract all the prior rules previously included in the prior knowledge rule database from the event set database under the constraints;
A threshold database in which values of the threshold variables calculated by the threshold optimization means are recorded;
A display device for displaying the value of the threshold variable recorded in the threshold database;
A preprocessing apparatus for an information extraction apparatus, comprising:

The prior rule recorded in the prior knowledge rule database is an association rule automatically generated using a product master in which categories of the category data are hierarchically recorded according to the abstraction level of classification. The preprocessing device for an information extraction device according to claim 1.

A preprocessing device for an information extraction device according to claim 1 or 2,
Numeric data categorizing means for converting the trajectory data included in the event set database into category data using the calculated threshold variable value;
A post-conversion event set database that records candidate association rules composed of combinations of the second labels and the first labels attached to the trajectory data categorized by the numerical data categorizing means;
Correlation rule extraction means for extracting as a correlation rule from the correlation rule candidates recorded in the post-conversion event set database using a correlation rule extraction parameter;
A correlation rule database that records the correlation rules extracted by the correlation rule extraction means;
And the display device is a display device for displaying the correlation rule.

The association rule extraction parameters are support level and certainty level,
4. The information according to claim 3, wherein the correlation rule extraction unit is a unit that extracts the correlation rule candidate having a certain degree of support and certainty among the correlation rule candidates as a correlation rule. Extraction device.

Record the trajectory data having the same ID and the categorized category data each labeled with the first label in the event set database,
Conditions and constraints of threshold variables necessary for categorizing the trajectory data recorded in the event set database, and a second label attached to the trajectory data categorized based on these conditions and constraints In the threshold / constraint database,
A combination of the first and second labels co-occurring with a predetermined probability is recorded in the prior knowledge rule database as a prior rule together with the probability;
Extracting the prior rules from this prior knowledge rule database,
Extracting all the trajectory data of the ID including the first label of the extracted prior rule from the event set database,
Extract the conditions and constraints from the threshold / constraint database,
A pre-processing method for an information extracting apparatus, wherein the value of the threshold variable is calculated so that the pre-rule appears with the highest probability under the extracted restriction.

The means for calculating the value of the threshold variable is
Under the constraint, the initial value of the threshold variable value is calculated from the extracted trajectory data,
The trajectory data is categorized using the calculated initial value,
Calculating a co-occurrence probability between the second label attached to the categorized trajectory data and the first label;
Among the calculated co-occurrence probabilities and the co-occurrence probabilities newly obtained by changing at least one of the initial values, calculating the value of the threshold variable when the highest co-occurrence probability is obtained. The preprocessing method of the information extraction apparatus according to claim 5, wherein the information extraction apparatus is preprocessed.

The prior rule recorded in the prior knowledge rule database is a rule automatically generated using a product master in which categories are hierarchically recorded in accordance with an abstraction level of classification. The preprocessing method of the information extraction apparatus of 5 or 6.

Using the value of the threshold variable optimized by the preprocessing method of the information extraction device according to any one of claims 5 to 7, categorizing the trajectory data recorded in the event set database,
Record the combination of the second label attached to the categorized trajectory data and the first label as a correlation rule candidate in the post-conversion event set database,
An information extraction method characterized in that, from the correlation rule candidates recorded in the post-conversion event set database, the correlation rule candidates extracted using correlation rule parameters are used as correlation rules.

The association rule extraction parameters are support level and certainty level,
The information extraction method according to claim 8, wherein among the correlation rule candidates, the correlation rule candidate having a certain degree of support and certainty is extracted as a correlation rule.