JP2005242739A

JP2005242739A - Data base analysis method by genetic programming and device therefor

Info

Publication number: JP2005242739A
Application number: JP2004052779A
Authority: JP
Inventors: Akira Hara; 章原; Takumi Ichimura; 匠市村
Original assignee: Hiroshima Industrial Promotion Organization
Current assignee: Hiroshima Industrial Promotion Organization
Priority date: 2004-02-27
Filing date: 2004-02-27
Publication date: 2005-09-08

Abstract

<P>PROBLEM TO BE SOLVED: To solve the problem that a plurality of rules can not be extracted from a database, in which instances satisfying different rules are mixed and the whole set of instances can not be explained with a simplex rule, while a conventional rule extraction method in a genetic programming extracts only one rule from a database from which rules are to be extracted. <P>SOLUTION: A database analysis method by genetic programming and a device therefor adequately divide a set of examples with evolutionary optimization, extract a rule from a set of divided part instances, and consequently extract a plurality of IF_THEN rules as the input-and-output relations of the instances in the database. Each rule thus extracted is given a priority determined based on a rate at which the instances satisfying the rules exist in the database and the result of an inference by using the IF_THEN rules. Consequently, an instance classification system, which is more accurate in classification than a conventional one, can be constituted. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、木構造プログラムを進化的に最適化する手法である遺伝的プログラミングを用いてデータベース中の事例が満たす複数のIF_THENルール及びそのルールの優先度を獲得する知識獲得手法と、その優先度が付与されたルールを用いて出力が未知である事例に対して出力を予測するシステムを構築する方式に関係する。 The present invention relates to a knowledge acquisition method for acquiring a plurality of IF_THEN rules that a case in a database satisfies and a priority of the rules using genetic programming, which is a method for evolutionarily optimizing a tree structure program, and the priority thereof. This is related to a method for constructing a system that predicts an output for a case where the output is unknown using a rule to which is assigned.

データベース中の事例の入出力関係を表すルールを抽出する手法の１つとして遺伝的プログラミングがある。遺伝的プログラミングは、生物の進化から着想された最適化アルゴリズムである進化的計算法の一種であり、Kozaにより提案された（非特許文献１参照）。 Genetic programming is one of the methods for extracting rules that represent the input / output relationships of cases in a database. Genetic programming is a kind of evolutionary computation that is an optimization algorithm conceived from the evolution of organisms, and was proposed by Koza (see Non-Patent Document 1).

遺伝的プログラミングは木構造プログラムの最適化手法である。遺伝的プログラミングでは、まず問題の解を木構造プログラムで表現できるように、木構造プログラムにおける木の節の要素となる関数記号と、木の葉の要素となる終端記号を設定する。例えば、入出力関係を数式で表現したい場合は、関数記号として四則演算子、終端記号として変数や定数を用いれば良い。図２は関数記号として｛＋，−，×｝、終端記号として｛ｘ，1｝を用いた場合の木構造プログラムの例であり、この木構造プログラムは数式(1+x)(x-1)を表している。問題の解を表す木構造プログラムを遺伝的プログラミングにおける個体と呼ぶ。 Genetic programming is an optimization method for tree-structured programs. In genetic programming, first, a function symbol that is an element of a tree node and a terminal symbol that is an element of a tree leaf are set so that the problem solution can be expressed by a tree structure program. For example, when it is desired to express the input / output relationship by a mathematical expression, it is sufficient to use four operators as function symbols and variables and constants as terminal symbols. FIG. 2 shows an example of a tree structure program when {+, −, ×} is used as a function symbol and {x, 1} is used as a terminal symbol. This tree structure program is expressed by the formula (1 + x) (x−1). ). A tree structure program that represents a solution to a problem is called an individual in genetic programming.

遺伝的プログラミングは個体を複数個生成し、生物の適者生存の原理に従って個体群を進化させる。この最適化の流れを図３に示す。まずはじめに、関数記号と終端記号をランダムに組み合わせることにより、複数の木構造プログラムを生成する（３１）。これを初期世代（第1世代）の個体集団とする。次に各個体が表す解を問題に適用しその性能を評価する（３２）。各個体には問題の解としてどの程度優れているかを表す評価値が与えられる。これを適応度と呼ぶ。この適応度に基づいて次世代の個体集団の親となる個体を選択する（３３）。適応度が高い個体ほど選択されやすい。このようにして選択した親個体集合において、２個ずつの組を作成し、各組に対して交叉率と呼ばれる確率に基づいて木構造プログラムの一部を個体間で交換する（３４）。この操作を交叉と呼ぶ。交叉では、図４のように、２つの個体においてそれぞれランダムに１つのノードを選択し、そのノード以下の部分木を交換する。さらに、各個体に対して、突然変異率と呼ばれる確率に基づいて木構造プログラム中のある記号を強制的に別の記号に変更する（３５）。この操作を突然変異と呼ぶ。また、適応度に基づく選択、交叉、突然変異を遺伝操作と呼ぶ。 Genetic programming generates multiple individuals and evolves the population according to the principle of living the right person in the organism. The flow of this optimization is shown in FIG. First, a plurality of tree structure programs are generated by randomly combining function symbols and terminal symbols (31). This is the initial generation (first generation) individual population. Next, the solution represented by each individual is applied to the problem and its performance is evaluated (32). Each individual is given an evaluation value that indicates how good the problem is. This is called fitness. Based on this fitness, an individual to be a parent of the next generation individual population is selected (33). Individuals with higher fitness are easier to select. In the parent individual set thus selected, two pairs are created, and a part of the tree structure program is exchanged between individuals based on the probability called the crossover rate for each pair (34). This operation is called crossover. In crossover, as shown in FIG. 4, one node is randomly selected in each of the two individuals, and subtrees below the node are exchanged. Further, for each individual, a symbol in the tree structure program is forcibly changed to another symbol based on a probability called a mutation rate (35). This operation is called mutation. Selection based on fitness, crossover, and mutation are called genetic operations.

上記の操作によって次世代の個体集合が生成される。そして再度、各個体の適応度評価を行う（３６）。集団中の最も適応度の高い個体（最良個体）の適応度があらかじめ設定した終了基準を超えるまで、（３３）から（３６）の操作を繰り返す。これにより、木構造プログラムの集合を進化させ、問題により適した解を得ることができる。 A next generation individual set is generated by the above operation. Then, the fitness of each individual is evaluated again (36). The operations from (33) to (36) are repeated until the fitness of the individual with the highest fitness in the group (best individual) exceeds the preset end criterion. As a result, a set of tree structure programs can be evolved and a solution more suitable for the problem can be obtained.

入力信号と出力信号の関係が未知のシステムにおいて、その入出力関係を表すルールを木構造で表現できるように終端記号と関数記号を設定し、既知の事例を訓練事例として遺伝的プログラミングを適用すれば、最適化の結果得られた最良個体の木構造プログラムが表すルールは事例の入出力関係を適切に表しており、また獲得したルールを出力が未知の事例に適用することによりその事例の出力を予測することができる。ここで、既知の事例は遺伝的プログラミングの個体への入力となる特徴量ベクトルと、それに対する出力信号の組によって表現される。 In systems where the relationship between input and output signals is unknown, set terminal symbols and function symbols so that the rules representing the input / output relationships can be expressed in a tree structure, and apply genetic programming using known cases as training examples. For example, the rule expressed by the tree structure program of the best individual obtained as a result of optimization appropriately represents the input / output relationship of the case, and by applying the acquired rule to the case whose output is unknown, the case output Can be predicted. Here, a known case is expressed by a set of a feature vector serving as an input to an individual of genetic programming and an output signal corresponding thereto.

特開２００３−３１７０８３では、入力の特徴量ベクトルX=(X1，X2，・・・，Xn)があらかじめ定められたm種類の分類結果Y1，Y2，・・・，Ymのいずれかに分類されている特徴を持つ訓練事例の集合から、その分類を実現するルールを遺伝的プログラミングにより抽出するシステムが記載されている。なお以下では、あらかじめ定められた複数種類に分類され同一の分類結果を持つ事例集合をクラスと呼ぶこととする。従来発明では、ある事例がクラスYiに属するかどうかを判定するルールを抽出するために、クラスYiに分類された事例のみが正の値１を返し、クラスYi以外に分類された事例は負の値−１を返すよう教示信号を設定し、全訓練事例におけるルールの出力値と教示信号の誤差の２乗の和を適応度の評価式としてこの値を小さくするように遺伝的プログラムを適用している。この遺伝的プログラミングによる最適化をi=1，・・・，mまで各クラスを対象として独立してm回行い、各クラスの事例のみが満たすルールを抽出する。その結果得られたm個のルールを用いて、出力が未知の事例に対する分類システムを構築している。出力が未知の事例の分類を行う場合は、その事例の特徴ベクトルを各クラス用のルールに適用し、１に最も近い値を返したルールが示すクラスに分類する。 In Japanese Patent Laid-Open No. 2003-317083, an input feature vector X = (X1, X2,..., Xn) is classified into one of m predetermined classification results Y1, Y2,. Describes a system for extracting, by genetic programming, rules that realize the classification from a set of training examples having the characteristics described above. Hereinafter, a case set classified into a plurality of predetermined types and having the same classification result is referred to as a class. In the conventional invention, in order to extract a rule for determining whether a case belongs to the class Yi, only the case classified into the class Yi returns a positive value 1, and the case classified other than the class Yi is negative. The teaching signal is set to return a value of −1, and the genetic program is applied so as to reduce this value using the sum of the square of the rule output value and the teaching signal error in all training cases as the fitness evaluation formula. ing. This optimization by genetic programming is performed m times independently for each class up to i = 1,..., M, and a rule that only the case of each class satisfies is extracted. Using the m rules obtained as a result, a classification system for cases whose output is unknown is constructed. When classifying a case whose output is unknown, the feature vector of the case is applied to the rule for each class, and the case is classified into the class indicated by the rule that returned the value closest to 1.

上記発明の実施例では、具体例として、３種類のクラスα、β、γの分類問題を扱っている。ここで、図５に示すように、α、β、γの各クラスと各クラスの事例のみが満たすルールＡ、Ｂ、Ｃを求めるための教示信号との関係が定められ、教師データ（入力となる特徴量ベクトルと教示信号との組み合わせ）が準備される。遺伝的プログラミングによるルールの抽出は対象とするクラスごとに独立に行われ、最終的に各クラスに対して１つのルール（合計３つのルールＡ〜Ｃ）が決定される。 In the embodiment of the present invention, the classification problem of three types of classes α, β and γ is dealt with as a specific example. Here, as shown in FIG. 5, the relationship between the α, β, and γ classes and the teaching signals for obtaining the rules A, B, and C that only the examples of each class satisfy is determined, and the teacher data (input and (A combination of a feature vector and a teaching signal). The rule extraction by genetic programming is performed independently for each target class, and finally one rule (a total of three rules A to C) is determined for each class.

以上のようにして最終決定された３つのルールＡ〜Ｃを用いることにより分類処理が可能となる。図６に示すように、ルールＡの出力値が最大であればクラスα、ルールＢの出力値が最大であればクラスβ、ルールＣの出力値が最大であればクラスγと判定される。 Classification processing can be performed by using the three rules A to C finally determined as described above. As shown in FIG. 6, if the output value of rule A is maximum, it is determined as class α, if the output value of rule B is maximum, it is determined as class β, and if the output value of rule C is maximum, it is determined as class γ.

以上で述べた従来法の問題点は、あるクラスの事例が満たすルールとして単一のルールしか抽出できないことである。同一のクラスに分類されるものであっても、そのクラスの事例の典型的なルールを満たさない例外的な事例が存在する場合もあるため、そのクラスの全ての事例に対して有効なルールは単一のルールで記述できるとは限らない。このような状況では、事例の分類を行う際に、ある１つのクラス用に複数のルールを用意し、そのうちのいずれかが成立した場合に、そのクラスであると判断しなければならない。上記の従来発明の方法では、この処理は不可能であるため、分類能力の精度が低くなってしまう。 The problem with the conventional method described above is that only a single rule can be extracted as a rule satisfied by a certain class of cases. Even if they fall into the same class, there may be exceptional cases that do not satisfy the typical rules of that class, so the rules that are valid for all cases of that class are It is not always possible to describe with a single rule. In such a situation, when classifying cases, a plurality of rules are prepared for a certain class, and if any of them is established, it must be determined that the class is the class. Since this process is impossible in the above-described conventional method, the accuracy of the classification ability is lowered.

特開２００３−３１７０８３JP2003-317083 Koza, J., Genetic Programming: On the Programming of Computers by means of Natural Selection, MIT press, 1992Koza, J., Genetic Programming: On the Programming of Computers by means of Natural Selection, MIT press, 1992

遺伝的プログラミングを用いたデータベースからのルール抽出に関する従来手法は、全訓練事例に対する誤差の総和を最小化するただ1つのルールの抽出を行うものであった。そのルールにあてはまらない事例は、ノイズや例外データとして無視される。しかし、事例の大部分にあてはまる一般的なルールだけではなく例外的な少数の事例のために利用できるルールを獲得することも重要である。また、複数のルールを獲得した場合には、各ルールがそのクラスの事例に対してどの程度一般的なルールであるかという指標も重要となる。そこで本発明では、データベース中の事例が満たす複数のルールおよび各ルールの優先度を抽出することを目的とする。また、このような複数のルールを利用して精度の高い分類システムを構築することを目的とする。 The conventional method for rule extraction from a database using genetic programming is to extract only one rule that minimizes the sum of errors for all training cases. Cases that do not apply to that rule are ignored as noise or exception data. However, it is also important to obtain rules that can be used for a few exceptional cases as well as general rules that apply to the majority of cases. In addition, when a plurality of rules are acquired, an index indicating how general each rule is for the class example is also important. Therefore, an object of the present invention is to extract a plurality of rules satisfied by cases in the database and the priority of each rule. It is another object of the present invention to construct a highly accurate classification system using such a plurality of rules.

あらかじめ用意された複数の分類候補の1つもしくは重複を許して2つ以上に、出力信号に応じて分類されている特徴を持つ事例を格納したデータベースにおいて、事例を構成する各入力信号とその信号が取り得る値を表す終端ノードおよびそれらの値の大小関係や論理積を表す関数ノードを組み合わせた木構造により、同一の分類結果を持つ事例の入出力関係をIF_THENルールで表現する手段と、データベースに存在する任意の１つの分類結果を持つ事例の集合を同一の入出力関係を満たす複数の部分事例集合に自動で分割する手段と、複数個生成された木構造に対して事例の分類精度を示す適応度に基づいた選択と木構造における部分木の交換やノードの変更を行い新たな木構造集団を生成するという操作を繰り返す処理を用いることによって、各部分事例集合が満たす入出力関係を適切に表現するIF_THENルールを抽出する手段と、各部分集合から抽出されたIF_THENルールの複数が成立した際にいずれのIF_THENルールを採用するかを判断するための優先度を付与する手段を備えたデータベース解析装置を有する。 Each input signal that constitutes a case and its signal in a database that stores cases that have features classified according to the output signal in one or more of multiple classification candidates prepared in advance and allow two or more A method to express the input / output relationship of cases with the same classification result by IF_THEN rule by a tree structure combining terminal nodes that represent possible values and function nodes that represent the magnitude relationship and logical product of those values, and a database A means to automatically divide a set of cases with any one classification result existing in, into multiple partial case sets that satisfy the same input / output relationship, and a classification accuracy of cases for a plurality of generated tree structures By using a process that repeats the operation of generating a new tree structure group by selecting based on the fitness shown and exchanging subtrees in the tree structure or changing nodes In order to determine the IF_THEN rule that appropriately represents the input / output relationship that each partial case set satisfies, and which IF_THEN rule to adopt when multiple IF_THEN rules extracted from each subset are satisfied A database analysis apparatus having means for assigning priorities.

請求項１記載の、事例集合を同一の入出力関係を満たす複数の部分事例集合に分割する手段と、各部分事例集合が満たす入出力関係を適切に表現するIF_THENルールを抽出する手段と、各部分集合から抽出されたIF_THENルールの複数が成立した際にいずれのIF_THENルールを採用するかを判断するための優先度を付与する手段において、同一の分類結果を持つ事例集合中の任意の部分集合の事例が満たすIF_THENルールを持ち、そのルールに応じて出力を返す機能を持つプログラムをエージェントとし、同一のIF_THENルールを持つエージェントの集合をグループとして、複数のエージェントによりある１つの分類結果を持つ事例集合全体をいずれかのエージェントが持つルールで表現できるように、複数のエージェントが構成するグループ数と各グループに所属するエージェント数、および各エージェントの持つIF_THENルールを自動で探索する最適化手法を用いるデータベース解析の方法を有する。 Means for dividing the case set into a plurality of partial case sets satisfying the same input / output relationship, means for extracting an IF_THEN rule that appropriately represents the input / output relationship satisfied by each partial case set, Arbitrary subsets in the case set with the same classification result in the means to give priority to determine which IF_THEN rule to adopt when multiple IF_THEN rules extracted from the subset are established A case that has an IF_THEN rule that satisfies the above example, and that has a function that returns output according to that rule as an agent, and a set of agents that have the same IF_THEN rule as a group, and that has a single classification result by multiple agents The number of groups configured by multiple agents and each group so that the entire set can be expressed by the rules of any agent It belongs number of agents, and a method of database analysis using an optimization method to search automatically the IF_THEN rule possessed by each agent.

請求項１記載の各部分集合から抽出されたIF_THENルールの複数が成立した際にいずれのIF_THENルールを採用するかを判断するための優先度を付与する手段において、同じIF_THENルールを持つ請求項２記載のエージェントの数がそのルールの優先度を表し、任意の１つの分類結果を持つ事例集合に対するIF_THENルールがその分類結果に属する事例集合中の事例に一致する割合が大きいほどそのIF_THENルールの優先度が上がり、そのIF_THENルールが他の分類結果に属する事例に対しても成立してしまう割合が大きいほどそのIF_THENルールの優先度が下がるように適応度を設定することにより優先度を付与するデータベース解析の方法を有する。 The means for assigning a priority for determining which IF_THEN rule to be adopted when a plurality of IF_THEN rules extracted from each subset according to claim 1 are established has the same IF_THEN rule. The number of listed agents represents the priority of the rule, and the higher the proportion of the IF_THEN rule that corresponds to a case set with any one classification result matches the cases in the case set that belong to the classification result, the higher the priority of the IF_THEN rule. A database that gives priority by setting the fitness so that the higher the rate at which the IF_THEN rule is established for cases belonging to other classification results, the lower the priority of the IF_THEN rule. Has a method of analysis.

あらかじめ用意された複数の分類候補の1つもしくは重複を許して2つ以上に、出力信号に応じて分類されている特徴を持つ事例を格納したデータベースにおいて、同一の分類結果を持つ事例の集合をクラスとし、任意の１つのクラスを同一の入出力関係を満たす複数の部分事例集合に自動で分割する手段と、複数個生成された木構造に対して事例の分類精度を示す適応度に基づいた選択と木構造における部分木の交換やノードの変更を行い新たな木構造集団を生成するという操作を繰り返す処理を用いることによって、各部分事例集合が満たす入出力関係を適切に表現するIF_THENルールを抽出する手段と、各部分集合から抽出されたIF_THENルールの複数が成立した際にいずれのIF_THENルールを採用するかを判断するための優先度を付与する手段を用いて、対象としたクラスだけが満たすIF_THENルールを自動で獲得する操作を、データベース中に存在する全てのクラスに対して繰り返し履行することにより、データベースに存在する全てのクラスに対するIF_THENルールを抽出する手段を備えた請求項１記載のデータベース解析装置を有する。 A set of cases that have the same classification result in a database that stores cases with features classified according to the output signal in one or more of multiple classification candidates prepared in advance and allows two or more Based on the means to automatically divide an arbitrary class into a plurality of partial case sets satisfying the same input / output relationship and the fitness indicating the classification accuracy of the cases for the generated tree structure The IF_THEN rule that appropriately represents the input / output relationship that each sub-case set satisfies can be obtained by using a process of selecting and sub-trees in the tree structure and changing the nodes to generate a new tree structure group. A means for extracting and a means for assigning a priority for determining which IF_THEN rule to adopt when a plurality of IF_THEN rules extracted from each subset are established. The method to extract IF_THEN rules for all classes existing in the database by repeatedly executing the operation to automatically acquire IF_THEN rules that only the target class satisfies for all classes existing in the database The database analysis apparatus according to claim 1, further comprising:

あらかじめ用意された複数の分類候補の1つもしくは重複を許して2つ以上に、出力信号に応じて分類されている特徴を持つ事例を格納したデータベースにおいて、請求項４記載の手段によって獲得された全てのクラスに対するIF_THENルール集合を用いて、データベース中の事例やデータベース中には存在しない同一フォーマットを持つ事例に対して、その事例がどのクラスに対するIF_THENルールに一致するかに応じて事例を分類する手段と、事例に対して異なるクラスの特徴を示す複数のルールが成立し分類候補が複数存在した場合に、成立したルールの中で最大の優先度を持つルールが示すクラスに分類する手段を備えたデータベース解析装置を有する。 5. A database that stores cases having characteristics classified according to an output signal in one or more of a plurality of classification candidates prepared in advance or allowed to be duplicated, and obtained by means of claim 4. Use IF_THEN rule set for all classes to classify cases according to which class matches the IF_THEN rule for which case in the database or the same format that does not exist in the database And a means for classifying the rule into the class indicated by the rule having the highest priority among the established rules when a plurality of rules indicating different class characteristics are established for the case and there are a plurality of classification candidates. A database analysis device.

上述のように、本発明の遺伝的プログラミングによるルール抽出手段では、データベース中に異なるルールを満たすような事例が混在する場合に、必要な数だけの複数のIF_THENルールを自動的に抽出することができる。そのため、単一ルールの抽出を目的とした従来の遺伝的プログラミングによるルール抽出手法では不可能であった、データベース中に含まれる少数の例外的データに対するルールを抽出することが可能となる。 As described above, the rule extraction means based on genetic programming according to the present invention can automatically extract as many IF_THEN rules as necessary when there are cases where different rules are satisfied in the database. it can. Therefore, it is possible to extract rules for a small number of exceptional data included in the database, which is impossible with the conventional rule extraction method based on genetic programming for the purpose of extracting a single rule.

抽出される複数のルールの各々には、データベース中の事例に対してそのIF_THENルールが成立する頻度やそのIF_THENルールを用いた推論結果に基づいて決定される優先度が自動的に割り当てられる。この優先度の値から、それぞれのルールが一般的なものであるか、例外的なものであるかという知識が得られる。また一般的ルールと例外的ルールが明示的に分離されていることから、ルールの理解が容易となる。 Each of the plurality of extracted rules is automatically assigned a priority determined based on the frequency at which the IF_THEN rule is established for the cases in the database and the inference result using the IF_THEN rule. Knowledge of whether each rule is general or exceptional is obtained from this priority value. Also, since general rules and exceptional rules are explicitly separated, it is easy to understand the rules.

未知の事例に対して、異なるクラスのための複数のIF_THENルールが成立する場合、その優先度に応じた分類結果の判定が可能であり、その結果として精度の高い分類が可能となる。 When a plurality of IF_THEN rules for different classes are established for an unknown case, it is possible to determine the classification result according to the priority, and as a result, it is possible to classify with high accuracy.

本発明の遺伝的プログラミングによるデータベース解析の方法およびその装置は、図１に示すように、進化的最適化により事例集合を適切に分割し、その各部分事例集合からルールを抽出することで、データベース中の事例の入出力関係を表す複数のIF_THENルールを抽出することができる。また抽出された各々のルールにはそのルールに一致する事例がデータベース中に存在する割合やIF_THENルールを用いた推論結果に基づいて決定された優先度が付与される。その結果として、従来手法より分類精度が高い、事例の分類システムを構築することが可能である。以下でその詳細および実施例を示す。 As shown in FIG. 1, the method and apparatus for database analysis by genetic programming according to the present invention appropriately divides a case set by evolutionary optimization and extracts a rule from each partial case set. Multiple IF_THEN rules that represent the input / output relationship of the middle case can be extracted. Each extracted rule is given a priority determined based on the ratio of cases in the database that match the rule and the inference result using the IF_THEN rule. As a result, it is possible to construct a case classification system with higher classification accuracy than the conventional method. Details and examples are given below.

全訓練事例集合において、任意の部分集合の事例が満たす入出力関係のルールを表す木構造プログラムを持つものをここではエージェントと呼ぶ。データベースから複数のルールを抽出するために、このエージェントが複数存在するシステム（マルチエージェントシステムと呼ぶ）を用いる。訓練事例集合は複数の部分集合に分割され、複数のエージェントが別々の部分集合からルールを抽出することで、訓練事例中に含まれる複数のルールを抽出することができる。 In the entire training case set, an agent having a tree structure program representing input / output relation rules satisfied by an arbitrary subset of cases is referred to as an agent. In order to extract a plurality of rules from the database, a system having a plurality of agents (referred to as a multi-agent system) is used. The training case set is divided into a plurality of subsets, and a plurality of rules can be extracted by a plurality of agents extracting rules from different subsets.

上記のマルチエージェントシステムによる複数のルール抽出処理を実現するにあたり、訓練事例集合をいくつの部分集合へどのように分割してルールを抽出するかを決定する必要がある。本発明では、全訓練事例の入出力関係を表すのに必要なルール数および木構造プログラムで表されたルールを、1個体の表現方法や遺伝操作の方法を変更した遺伝的プログラミングを用いて獲得する。 In order to realize a plurality of rule extraction processes by the multi-agent system described above, it is necessary to determine how to divide the training case set into how many subsets to extract the rules. In the present invention, the number of rules necessary to represent the input / output relationship of all training cases and the rules represented by the tree structure program are obtained by using genetic programming in which the method of expressing an individual or the method of genetic manipulation is changed. To do.

複数のエージェントが同一の事例部分集合からルール抽出を行う場合、それらのエージェントは同一の木構造プログラムを持つ。このように同一の木構造プログラムを持つエージェントの集合をグループと呼ぶ。あらかじめ用意したエージェントは、どれか１つのグループに所属することになる。本発明で用いる遺伝的プログラミングは、エージェントがいくつのグループに分かれそしてどのエージェントが同じグループに所属しているかというグループ構造と、各グループが持つ木構造プログラムを進化の過程で共に探索する最適化手法である。この最適化手法をルール抽出処理に応用すれば、訓練事例の入出力関係を表現するのに適切な数だけ異なるルールを生成することができる。また獲得されたグループ構造を解析することにより、事例の入出力関係の表現に必要なルール数、そして各ルールを満たす事例の出現頻度およびルールを用いた推論結果に基づいた各ルールの優先度に関する知識を獲得できる。 When multiple agents extract rules from the same case subset, they have the same tree structure program. A set of agents having the same tree structure program is called a group. Agents prepared in advance belong to one of the groups. The genetic programming used in the present invention is an optimization method in which the group structure of how many agents are divided and which agents belong to the same group and the tree structure program of each group are searched together during the evolution process. It is. If this optimization technique is applied to rule extraction processing, it is possible to generate a number of different rules appropriate for expressing the input / output relationship of training examples. Also, by analyzing the acquired group structure, the number of rules necessary for expressing the input / output relationship of cases, the frequency of appearance of cases that satisfy each rule, and the priority of each rule based on the inference results using the rules Knowledge can be acquired.

本発明では各エージェントのグループが持つ木構造プログラムの集合を、遺伝的プログラミングにおける1個体としてみなす。すなわち、遺伝的プログラミングの各個体は、マルチエージェントシステムを表す。図７は４つのエージェントをルール抽出に用いた場合における遺伝的プログラミングの１個体の例を示している。この図では、エージェント１とエージェント２が１つのグループ（７１）であり、またエージェント３とエージェント４が別のグループ（７２）を形成している。その結果として、この個体は各グループ用に２つの木構造プログラム（７３、７４）を持つ。すなわち、この個体は２つのルールを持っていることになる。 In the present invention, a set of tree structure programs possessed by each agent group is regarded as one individual in genetic programming. That is, each individual in genetic programming represents a multi-agent system. FIG. 7 shows an example of one individual of genetic programming when four agents are used for rule extraction. In this figure, agent 1 and agent 2 form one group (71), and agent 3 and agent 4 form another group (72). As a result, this individual has two tree structure programs (73, 74) for each group. That is, this individual has two rules.

以下で、本発明で用いる遺伝的プログラミングの具体的な処理手順について述べる。図８は本発明による最適化の流れである。はじめに、初期個体集団を生成する（８１）。図９はある１つの初期個体の生成手順を示したものであり、また図１０は本発明における個体集団の概念図である。図９に示した処理を行うことにより、初期個体におけるグループの分割数および各グループへのエージェントの配分はランダムに決定される。図９の処理を初期世代に用いる個体数だけ繰り返すことにより個体集団を作成する。その結果、個体集団は図１０のような多様なグループ構造を持つ。なお、図１０における個体（１０１）は、図７で示した個体を簡略化して描いたものである。 Hereinafter, a specific processing procedure of genetic programming used in the present invention will be described. FIG. 8 is a flow of optimization according to the present invention. First, an initial individual population is generated (81). FIG. 9 shows a procedure for generating one initial individual, and FIG. 10 is a conceptual diagram of an individual group in the present invention. By performing the process shown in FIG. 9, the division number of the group in the initial individual and the distribution of the agent to each group are determined at random. An individual group is created by repeating the process of FIG. 9 for the number of individuals used in the initial generation. As a result, the individual population has various group structures as shown in FIG. The individual (101) in FIG. 10 is a simplified depiction of the individual shown in FIG.

本発明で用いる遺伝的プログラミングの遺伝操作において、選択（８３）、突然変異（８６）は通常の遺伝的プログラミングにおける処理と同様であるが、交叉（８５）が大きく異なる。また新たな遺伝操作としてグループ突然変異（８４）が追加される。 In the genetic manipulation of genetic programming used in the present invention, selection (83) and mutation (86) are the same as the processing in normal genetic programming, but the crossover (85) is greatly different. A group mutation (84) is added as a new genetic operation.

まずは交叉方法の説明を行う。ある２つの個体を交叉する際の具体的な処理手順を以下のステップ１からステップ３に示す。 First, the crossover method will be described. A specific processing procedure when two individuals are crossed is shown in Steps 1 to 3 below.

ステップ１：交叉を行う2つの個体に対して、1つのエージェントを任意に選択する。各個体において、そのエージェントが参照する木をそれぞれTとT'とする。これらの木が交叉に使用される。 Step 1: One agent is arbitrarily selected for two individuals to be crossed. For each individual, let T and T ′ be the trees that the agent references. These trees are used for crossover.

ステップ２：各親個体において、選択された木T、T'を参照するエージェント集合A(T)、A(T')をそれぞれ求める。これらの集合の関係は次の3つのケースが考えられる。 Step 2: For each parent individual, agent sets A (T) and A (T ′) that refer to the selected trees T and T ′ are obtained. The following three cases can be considered for the relationship of these sets.

ケース１ A(T)＝A(T')のとき：もし集合の関係がA(T)＝A(T')ならば、各個体のグループ構造は変化しない。ステップ３へ移る。 Case 1 When A (T) = A (T '): If the set relationship is A (T) = A (T'), the group structure of each individual does not change. Go to step 3.

ケース２ A(T)⊃A(T')またはA(T)⊂A(T')のとき：集合の関係がA(T)⊃A(T')またはA(T)⊂A(T')ならば、以下の処理を行う。もし集合の関係がA(T)⊃A(T')ならば、Tを持つ個体において、Tと同一の木構造プログラムを新たに生成し、集合A(T)∩A(T')の要素であるエージェントをその木構造プログラムを参照するグループに移す。このようにして新たに生成した木構造プログラムをあらためてTとする。T'を持つ個体のグループ構造は変化しない。逆に、もし集合の関係がA(T)⊂A(T')ならば、T'を持つ個体において、T'と同一の木構造プログラムを新たに生成し、集合A(T)∩A(T')の要素であるエージェントをその木構造プログラムを参照するグループに移す。このようにして新たに生成した木構造プログラムを改めてT'とする。Tを持つ個体のグループ構造は変化しない。このように、２つの集合A(T)、A(T')の共通部分集合の要素であるエージェントが参照する木構造プログラムだけが交叉に使用されるように、一方の個体でグループの分割が発生する。図１１は２つの親個体（１１１と１１２）において、エージェント２が参照する木同士で交叉した場合を表している。ステップ３へ移る。 Case 2 When A (T) ⊃ A (T ') or A (T) ⊂ A (T'): A (T) ⊃ A (T ') or A (T) ⊂ A (T' ), The following processing is performed. If the relationship of the set is A (T) ⊃ A (T '), in the individual with T, a new tree structure program identical to T is generated, and the elements of the set A (T) ∩ A (T') Is transferred to the group that references the tree structure program. The tree structure program newly generated in this way is again designated as T. The group structure of individuals with T 'does not change. Conversely, if the set relationship is A (T) ⊂ A (T '), in the individual having T', a new tree structure program identical to T 'is newly generated, and the set A (T) ∩ A ( The agent that is the element of T ′) is moved to the group that refers to the tree structure program. The tree structure program newly generated in this way is referred to as T ′. The group structure of individuals with T does not change. In this way, the group can be divided by one individual so that only the tree structure program referenced by the agent that is an element of the common subset of the two sets A (T) and A (T ') is used. Occur. FIG. 11 shows a case where two parent individuals (111 and 112) cross over between trees referred to by agent 2. Go to step 3.

ケース３ A(T)、A(T')において一方が他方を含む関係に互いにないとき（ケース１、ケース２のどちらにも該当しない場合）： A(T)、A(T')の補集合を各々A(T)~、A(T')~とする。もし集合の関係が、A(T)⊃A(T')でない、かつA(T)⊂A(T')でないならば、木Tを持つ個体において集合A(T)~∩A(T')の要素であるエージェントを木Tを持つグループへ移し、木T'持つ個体においては集合A(T')~∩A(T)の要素であるエージェントを木T'を持つグループへ移す。このとき、所属エージェントがなくなったグループおよびそのグループが持つ木構造は削除される。結果として、両方の個体で、A(T)∪A(T')の要素であるエージェントが同一の木を参照するように、エージェントの移動が起こる。図１２は２つの親個体（１２１と１２２）において、エージェント１が参照する木同士で交叉した場合を表している。ステップ3へ移る Case 3 When one of A (T) and A (T ') is not in a relationship that includes the other (when neither Case 1 nor Case 2 applies): Complement to A (T) and A (T') Assume that the sets are A (T) ˜ and A (T ′) ˜, respectively. If the set relationship is not A (T) ⊃ A (T ') and not A (T) ⊂ A (T'), the set A (T) ~ ∩ A (T ' ) Is transferred to the group having the tree T, and in the individual having the tree T ′, the agent being the element of the set A (T ′) to ∩A (T) is transferred to the group having the tree T ′. At this time, the group having no affiliated agent and the tree structure of the group are deleted. As a result, in both individuals, the agent moves so that the agent that is an element of A (T) ∪A (T ′) refers to the same tree. FIG. 12 shows a case where two parent individuals (121 and 122) cross over between trees referred to by agent 1. Go to step 3

ステップ３：木TとT'において、それぞれの木のノードの中から1点がランダムに選択し、そのノード以下の部分木を2個体で交換し、交叉が完了する。 Step 3: In the trees T and T ′, one point is randomly selected from the nodes of each tree, the subtrees below that node are exchanged with two individuals, and the crossover is completed.

以上のように、交叉は任意の同じエージェントが参照する木構造プログラム同士で行われる。そして交叉に使用される木を参照するエージェント集合の関係に応じてグループ構造を変更する。 As described above, crossover is performed between tree structure programs that are referenced by any same agent. Then, the group structure is changed according to the relationship of the agent set that refers to the tree used for crossover.

またグループ突然変異（８６）は、個体集団がただ１つのグループ構造へ収束することを防ぐために用いる。グループ突然変異では、1個体を構成する各エージェントに対してグループ突然変異率と呼ばれる生起確率に基づいて、あるエージェントを任意に選択されたグループへ移す操作を行う。この操作は交叉によるグループ構造の変化を促進するものとして働くため、交叉の前に行われる。 The group mutation (86) is used to prevent the individual population from converging into a single group structure. In the group mutation, an operation of moving an agent to an arbitrarily selected group is performed for each agent constituting one individual based on the occurrence probability called a group mutation rate. This operation is performed before crossover because it works to promote the change in group structure due to crossover.

上記の操作を用いて世代交代を繰り返すことにより、個体集団はグループ構造を徐々に好ましいものに近づけながら解の探索を行う。 By repeating generation changes using the above operation, the individual population searches for a solution while gradually bringing the group structure closer to the preferred one.

図１３は本発明の一実施例の構成図である。この図は、優先度が付与されたIF_THENルールを抽出するシステム（システム１）と、このシステムを複数回用いて抽出した各クラスの事例が満たすIF_THENルールを全て用いて作成した分類システム（システム２）を示している。システム１では、事例を格納したデータベース（１３１）に対して、遺伝的プログラミングを用いたルール抽出処理（１３２）を適用する。進化させた個体集団中の最良個体が持つ複数のルール（１３３）を事例がクラスYiであるかどうかを判定するためのIF_THENルールとする。また本発明を用いることにより、獲得された複数のルールには、そのルールが成立する頻度やそのルールを用いた推論結果に基づいた優先度が付与される。各々のクラスの事例が満たすIF_THENルールを抽出するため、ルール抽出対象のクラスYiをi=1，・・・，mと変更して、システム１を実行する。 FIG. 13 is a block diagram of an embodiment of the present invention. This figure shows a system (system 1) that extracts IF_THEN rules with priorities, and a classification system (system 2) that is created using all IF_THEN rules that are satisfied by each class extracted by using this system multiple times. ). In the system 1, a rule extraction process (132) using genetic programming is applied to a database (131) storing cases. A plurality of rules (133) possessed by the best individual in the evolved individual group are IF_THEN rules for determining whether the case is class Yi. Further, by using the present invention, a plurality of acquired rules are given priorities based on the frequency with which the rules are established and the inference results using the rules. In order to extract the IF_THEN rule that each class case satisfies, the system 1 is executed by changing the class extraction target class Yi to i = 1,..., M.

データベース（１３１）には、入力の特徴量ベクトルX=(X1，X2，・・・，Xn)がm個のクラスY1，Y2，・・・，Ymのいずれかに分類された事例が格納されている。 The database (131) stores cases where the input feature vector X = (X1, X2,..., Xn) is classified into any of the m classes Y1, Y2,. ing.

ある入力ベクトルが与えられたときに、それがどのクラスに分類されるかを判定するためには、各Yi (i=1，・・・，m)の事例のみが満たすべき論理式を求めなければならない。この論理式は、ある項目とそれが取りうる値の組をAND結合した式であり、例えば次の式数1ように表される。
In order to determine which class an input vector is given, it is necessary to find a logical expression that only the case of each Yi (i = 1, ..., m) should satisfy. I must. This logical expression is an expression obtained by ANDing a set of a certain item and a value that can be taken.

この場合、クラスYkの事例に対しては論理式が真（True）を返し、クラスYk以外の事例に対しては論理式が偽（False）を返さなければならない。よってこの式は、論理式が成立するならばその事例はクラスYkに分類されることを示すIF_THENルールとなっている。このIF_THENルールの前件部にあたる論理式は図１４のような木構造プログラムで表現される。 In this case, the logical expression must return true for cases of class Yk, and the logical expression must return false for cases other than class Yk. Therefore, this expression is an IF_THEN rule indicating that if the logical expression is satisfied, the case is classified into the class Yk. A logical expression corresponding to the antecedent part of the IF_THEN rule is expressed by a tree structure program as shown in FIG.

上で述べたクラスYkの事例のみが満たすIF_THENルールの抽出方法について述べる。図１５は各個体の適応度の評価方法を、図１６はその概念図を表している。本発明で用いる遺伝的プログラミングの個体における複数の木は、各々論理式を表している。ｍ種類のクラスが混在する訓練事例集合から各事例をシステムに入力する（１５２）。その入力データに対して、適応度の評価対象の個体が持つ各木の論理式が成立するかどうかを計算する（１５６）。図１６のデータ2に示すように１個体が持つ複数の論理式の内で１つでも真（T）となれば、その入力データの分類結果はYkであるとみなされる。また、図１６のデータ1のように１個体が持つ複数の論理式が全て偽（F）を返せば、その入力データの分類結果はYkではないとみなす。Ykに分類されるべき事例に対しては、個体中の複数の木の中の少なくともいずれか一つが真を出力し、それ以外のクラスの事例には全ての木が偽を返すように最適化を行う。 The IF_THEN rule extraction method that only the class Yk example described above satisfies is described. FIG. 15 shows a method for evaluating the fitness of each individual, and FIG. 16 shows a conceptual diagram thereof. A plurality of trees in an individual of genetic programming used in the present invention each represents a logical expression. Each case is input to the system from a training case set in which m types of classes are mixed (152). It is calculated whether or not the logical expression of each tree held by the individual whose fitness is to be evaluated is established for the input data (156). As shown in data 2 of FIG. 16, if at least one of a plurality of logical expressions of one individual is true (T), the classification result of the input data is regarded as Yk. In addition, if all the logical expressions of one individual return false (F) as in data 1 in FIG. 16, the classification result of the input data is regarded as not Yk. For cases that should be classified as Yk, optimize so that at least one of the trees in the individual outputs true and all trees return false for the other classes of cases I do.

ルール抽出対象ではない他のクラスの事例に対して、論理式が誤って真を返してしまうような場合、そのルールを持つエージェントの数に応じてペナルティとしてその個体の適応度を減じることとする。これにより誤認する頻度が高いグループへのエージェントの配分は抑制される。 If a logical expression incorrectly returns true for cases of other classes that are not subject to rule extraction, the fitness of the individual will be reduced as a penalty depending on the number of agents with that rule. . This suppresses the distribution of agents to groups that are frequently misidentified.

また、各エージェントがルール抽出のために担当する事例の数という観点では、各エージェントの負荷という概念が生じる。これは各グループが持つルールの採用回数および各グループに所属するエージェント数から計算される。各ルールの採用回数は、ルール抽出対象のクラスの事例に対して、そのルールが正しく真を返した場合にカウントされる（１６０）。採用回数のカウントの際は、図１６のデータ3のように、複数の木が真を出力した場合は、その中で最もエージェント数の多いグループのルールを採用することとする。このとき、あるエージェントaがグループgに属しているとすると、このエージェントの負荷Waは次の式数２のように計算される。
Further, from the viewpoint of the number of cases in which each agent is responsible for rule extraction, the concept of load on each agent arises. This is calculated from the number of times each group has adopted the rules and the number of agents belonging to each group. The number of times each rule is adopted is counted when the rule correctly returns true for the case of the class subject to rule extraction (160). When counting the number of hires, if a plurality of trees output true as shown in data 3 in FIG. 16, the rule of the group with the largest number of agents is adopted. At this time, if a certain agent a belongs to the group g, the load Wa of the agent is calculated as the following Expression 2.

このように計算される各エージェントの負荷を均一化することで、採用回数が多いルールを持つグループには多くのエージェントが配分されるようになり、採用回数が少ないルールを持つグループのエージェント数は少ないものとなる。各ルールを参照しているエージェントの数を見れば、抽出された各々のルールが使用される頻度、すなわち各ルールがそのクラスの事例の性質を表すルールとしてどの程度一般的なものかという重要な知識が得られる。このように各ルールを参照するエージェントの数は、そのルールに一致する事例がデータベース中に存在する割合とルールを用いた推論結果に基づいて決定され、各ルールの優先度を表すことになる。 By equalizing the load of each agent calculated in this way, a large number of agents are distributed to a group with a rule with a large number of hires, and the number of agents with a rule with a small number of hires is It will be less. Looking at the number of agents referencing each rule, it is important to see how often each extracted rule is used, that is, how common each rule represents the nature of the class instance. Knowledge is gained. Thus, the number of agents that refer to each rule is determined based on the ratio of cases in the database that match the rule and the inference result using the rule, and represents the priority of each rule.

上記の要求を満たすため、適応度は以下の式数３で計算される。この適応度が増加するように個体集団を進化させることによって、クラスYkの事例のみが満たすIF_THENルールを抽出する。
In order to satisfy the above requirement, the fitness is calculated by the following equation (3). The IF_THEN rule that only the class Yk example satisfies is extracted by evolving the individual population so that the fitness increases.

ここでmiss_target_dataはルール抽出対象であるクラスYkの事例に対して全てのルールが偽を返した数である（１５８、１６１、１６２）。misrecognitionは、他のクラスの事例に対して真と出力して誤認識してしまった事例数である（１５８、１６１、１６３）。fault_agentは他のクラスの事例に対してあるルールが真を出力して誤認した際に、そのルールを持つグループに所属するエージェント数を表す（１６４）。そのため、式数３の第３項は他のクラスの事例を誤って真と判定してしまう際にそれを支持してしまう平均エージェント数を表している。また、Vwは全エージェントに関する負荷の分散である（１６６）。これらをα、β、δで重み付けした和を適応度とする（１６７）。また冗長なグループの分割を抑制するために、個体が持つグループ数Gの増加に応じてペナルティ係数γ（ただしγ＞１）の（G-1）乗が適応度に乗算される。 Here, miss_target_data is the number that all rules return false for the case of class Yk that is the rule extraction target (158, 161, 162). misrecognition is the number of cases that have been misrecognized by outputting true for cases of other classes (158, 161, 163). fault_agent represents the number of agents belonging to a group having the rule when a certain rule outputs true for a case of another class and misidentifies (164). For this reason, the third term of Equation 3 represents the average number of agents that support the case when another class of cases is erroneously determined to be true. Vw is a load distribution for all agents (166). The sum obtained by weighting these by α, β, and δ is set as the fitness (167). In order to suppress the division of redundant groups, the fitness is multiplied by the (G-1) power of the penalty coefficient γ (where γ> 1) in accordance with the increase in the number of groups G possessed by the individual.

上記の適応度を用いた最適化により、Ykに分類されるべき事例に対しては、いずれかの木が真を返し、その他のクラスに分類されるべき事例には偽を返すようになる。また、そのルールが抽出対象のクラスの事例に真を返す頻度が高い程多くのエージェントが配分される。誤認する頻度が高いグループへのエージェントの配分は抑制される。そのため、所属エージェント数が多い程、よく使用される典型的な判定ルールであり、少数のエージェントしか所属していないルールは稀に現れる例外的データに対する判定ルールであることが分かる。他のクラス用のルールも同様の処理により抽出する。 With the optimization using the above fitness, one of the trees returns true for the case that should be classified as Yk, and false for the case that should be classified as the other class. Further, the higher the frequency at which the rule returns true to the case of the class to be extracted, the more agents are allocated. The distribution of agents to groups that are frequently misidentified is suppressed. Therefore, it can be understood that as the number of belonging agents increases, a typical determination rule is often used, and a rule to which only a small number of agents belong is a determination rule for exceptional data that appears rarely. Rules for other classes are extracted by the same process.

次に各クラス用に獲得されたルールを全て用いて、クラスが未知の事例に対する分類システムを構築する。分類処理の流れを図１７に、その概念図を図１８に示す。クラスが未知の事例の分類を行う際は、その事例を全てのクラスに対するルール集合に適用する。その結果、あるクラスに対して抽出された全てのルールが偽を返す場合は、そのクラスでないことが分かる。抽出したルールのうち、１つでも真を返すものがある場合は、そのルールが示すクラスとなる。例えば、図１８のData 1は、クラスY2と分類される場合を示している。 Next, using all the rules acquired for each class, a classification system for cases where the class is unknown is constructed. The flow of the classification process is shown in FIG. 17, and its conceptual diagram is shown in FIG. When classifying cases with unknown classes, the cases are applied to the rule set for all classes. As a result, if all the rules extracted for a class return false, it is understood that the class is not. If one of the extracted rules returns true, it becomes the class indicated by that rule. For example, Data 1 in FIG. 18 shows a case where it is classified as class Y2.

ある種の事例に対しては、異なるクラスを対象とした複数のルールが真を返す場合が生じる。この場合は、各ルールを持つグループに所属するエージェント数が重要な役割を果たす。最適化の結果として、それぞれのルールを持つグループに所属するエージェント数は、そのルールに一致する事例がデータベースに存在する割合の大きさ、およびルールの予測精度の高さを表すと考えられるため、所属エージェントの多い方のクラスがもっともらしい解と考えられる。ただし、可能性は少ないながらも他方のクラスの可能性もあるとの認識が可能である。図１８のData 2では、クラスY2とクラスYm用のルールが共に真を出力している。この場合は、各ルールを支持するエージェントの多いクラスY2に分類される。ただし、クラスYmの事例の可能性もあることを認識することができる。このようなわずかな可能性も認識できることは本システムを利用する利点の1つである。 For certain cases, multiple rules for different classes may return true. In this case, the number of agents belonging to the group having each rule plays an important role. As a result of optimization, the number of agents belonging to the group with each rule is considered to represent the size of the ratio of cases that match that rule in the database and the high prediction accuracy of the rule. The class with more agents is considered a plausible solution. However, it is possible to recognize that there is a possibility of the other class although the possibility is small. In Data 2 of FIG. 18, the rules for class Y2 and class Ym both output true. In this case, it is classified into a class Y2 with many agents supporting each rule. However, it can be recognized that there may be a case of class Ym. It is one of the advantages of using this system that we can recognize such a small possibility.

実施例として、本発明を医療分野における肝胆嚢データベースからの知識獲得に適用した。このデータベースは、入力項目として９つの検査結果と、出力として４つの疾患（Alcoholic liver damage、 Primary hepatoma、 Liver cirrhosis、 Cholelithiasis）への分類結果を持つ。図１９にこのデータベースにおける各疾患の事例の例を示す。使用するデータベースは536の事例により構成され、そのうち322例を訓練事例として、残りの214事例を学習結果の評価のためのテスト事例として用いた。事例における各検査項目ｘは、図２０に示す閾値を用いて、式数４により４段階に離散化しておく。
As an example, the present invention was applied to knowledge acquisition from a liver gallbladder database in the medical field. This database has 9 test results as input items and 4 classification results (Alcoholic liver damage, Primary hepatoma, Liver cirrhosis, Cholelithiasis) as outputs. FIG. 19 shows an example of each disease case in this database. The database used was composed of 536 cases, of which 322 were used as training cases and the remaining 214 cases were used as test cases for evaluation of learning results. Each inspection item x in the case is discretized in four stages using Equation 4 using the threshold shown in FIG.

遺伝的プログラミングの記号としては、図２１に示す終端・関数記号を用いた。ただし、記号の使用には次のような制約を設ける。and記号の引数に直接終端記号がくることはなく、また、eq、 gt、 ltの第１引数arg0にはGOTなどの検査項目が、第２引数arg1には離散値0、1、2、3のいずれかが入るとする。これを満たさないような交叉、突然変異は行わないとした。 The terminal / function symbols shown in FIG. 21 were used as symbols for genetic programming. However, the following restrictions apply to the use of symbols. The terminal symbol is not directly attached to the argument of the “and” symbol, and the check item such as GOT is included in the first argument arg0 of eq, gt, and lt, and the discrete values 0, 1, 2, 3 are included in the second argument arg1. Suppose any one of them enters. Crossovers and mutations that do not meet this requirement were not performed.

ここでは、遺伝的プログラミングの１個体を構成するエージェント数を５０として実験を行った。４つの疾患Alcoholic liver damage、 Primary hepatoma、 Liver cirrhosis、 Cholelithiasisの各事例集合のみが満たすように最適化したルールを、それぞれ順に図２２、図２３、図２４、図２５に示す。ある1つの疾患を判定するために複数のルールが抽出されていることが分かる。例えば、Primary Hepatomaの事例のみが満たすルールである図２３では、獲得された最良個体における５０エージェントは、１１のグループに分割されており、所属エージェントの多いルール程、採用回数が多い傾向を持つ。例えば、Primary hepatoma用のRule1は対象のPrimary hepatomaの事例の32％に真を返す典型的なルールであり、Rule11は対象のPrimary hepatomaの事例の2％に真を返すルールであった。出現頻度の少ない例外的な事例に対するルールも抽出できていることが分かる。 Here, the experiment was performed with 50 agents constituting one individual of genetic programming. The rules optimized so that only the case sets of the four diseases Alcoholic liver damage, Primary hepatoma, Liver cirrhosis, and Cholelithiasis are satisfied are shown in FIGS. 22, 23, 24, and 25, respectively. It can be seen that a plurality of rules are extracted to determine a certain disease. For example, in FIG. 23, which is a rule satisfied only by the case of Primary Hepatoma, the 50 agents in the acquired best individual are divided into 11 groups, and the rules with more affiliation agents tend to be employed more frequently. For example, Rule 1 for Primary hepatoma is a typical rule that returns true to 32% of the cases of the target Primary hepatoma, and Rule 11 is a rule that returns true to 2% of the cases of the target Primary hepatoma. It can be seen that rules for exceptional cases with a low appearance frequency can also be extracted.

獲得されたルールを統合して作成した分類システムを、学習に使用した322個の事例に適用し分類性能を検証した。その結果、80.4％の分類に成功した。実際の分類の様子を観ると、例えば９つの検査項目の値が各々、0、0、0、1、2、1、1、0、1であるデータは、訓練事例中にAlcoholic liver damageとCholelithiasisの2つの診断結果を持つ事例である。この事例に対しては、Alcoholic liver damageのルール２とCholelithiasisのルール６が共に真を返した。この２つのルールを参照しているエージェント数を見るとそれぞれ８、３となる。エージェント数が示す優先度から、この事例はAlcoholic liver damageに分類される。しかしCholelithiasisの可能性についても注意を促すことに成功している。 The classification system created by integrating the acquired rules was applied to 322 cases used for learning, and the classification performance was verified. As a result, 80.4% of classification was successful. Looking at the actual classification, for example, the data for the nine test items, which are 0, 0, 0, 1, 2, 1, 1, 0, 1, respectively, show that Alcoholic liver damage and Cholelithiasis This is a case with two diagnostic results. For this case, both Alcoholic liver damage rule 2 and Cholelithiasis rule 6 returned true. Looking at the number of agents referring to these two rules, they are 8 and 3, respectively. Based on the priority indicated by the number of agents, this case is classified as Alcoholic liver damage. However, it has succeeded in calling attention to the possibility of Cholelithiasis.

また、この分類システムを未学習の214個の事例に適用したところ、157例の認識に成功し、認識率は73.4％であった。ただし、ここで認識に失敗とした事例のうち、30例は正解の疾患の可能性について言及しており、かつその大部分に当たる25例は誤って採用した疾患の他に競合するものが正解の疾患だけという精度の高いものであった。可能性の検出と言う観点から見ると、87.4％の事例の認識に成功できていることが分かる。使用した症例データには、もともと誤って分類された事例も含むため、これは高い認識率と言える。 When this classification system was applied to 214 unlearned cases, 157 cases were successfully recognized and the recognition rate was 73.4%. However, of the cases that failed to be recognized here, 30 cases mentioned the possibility of correct disease, and the majority of the 25 cases were correct in addition to the wrongly adopted disease. It was highly accurate only for the disease. From the perspective of detection of possibility, it can be seen that 87.4% of cases were successfully recognized. The case data used includes cases that were originally classified incorrectly, which is a high recognition rate.

本発明は、データの計測やその計測結果に基づいた分類の判定にノイズや人間の判断・嗜好を伴うような事例が蓄えられたデータベースに有効である。このようなデータは、同一の入力であっても異なる分類結果を持つ事例が存在することや、逆に、同一の分類結果を示す複数の事例が単一のルールで表現できないといった特徴を持つ。 INDUSTRIAL APPLICABILITY The present invention is effective in a database in which cases where noise and human judgment / preference are associated with data measurement and classification determination based on the measurement result are stored. Such data has characteristics that there are cases having different classification results even with the same input, and conversely, a plurality of cases showing the same classification results cannot be expressed by a single rule.

本発明の応用例としては、医療分野における疾患の診断システム、販売事業における顧客の購買履歴からの知識獲得、気象データの時系列変動に基づく気象予測、株価変動予測などが挙げられる。 Application examples of the present invention include a disease diagnosis system in the medical field, knowledge acquisition from a customer purchase history in a sales business, weather prediction based on time-series fluctuations of weather data, stock price fluctuation prediction, and the like.

例えば、医療分野では、患者の診断結果がデータベースに蓄えられている。その大量の事例から有効な診断規則を発見することは、医療の支援や科学的根拠に基づく医療の実現に向けて重要である。しかし、データベースに蓄えられた計測値や検査値などの実測値はノイズを含み、また診断結果は医師の経験によるところが大きくあいまいなものであるため、診断規則は単一のルールで表現できるとは限らない。医療分野では、一般的なケースとは異なる患者を見落とすこと無く治療できるかどうかが、質の高い医療の実現への鍵となる。例外的な少数の患者のための知識も獲得できる本発明は、医療データベースからの診断規則の抽出に有効である。具体的には、実施例としても挙げたように、過去の患者の診断結果データベースから疾患ごとに優先度付きルールを抽出しておき、新たな患者に対してどのクラスのルールが成立するかによって診断する。異なる疾患のための複数のルールが成立した場合は、それらの疾患のどちらにも該当する可能性があり、その可能性はそれらのルールが持つ優先度の比によって認識できる。また、全てのルールが該当しない場合は、いずれの疾患でもない正常な状態とみなすことができる。 For example, in the medical field, patient diagnosis results are stored in a database. Finding effective diagnostic rules from a large number of cases is important for medical support based on scientific support. However, actual values such as measured values and test values stored in the database contain noise, and the diagnostic results are largely ambiguous depending on the experience of the doctor, so that the diagnostic rules can be expressed by a single rule. Not exclusively. In the medical field, whether or not a patient can be treated without overlooking a patient different from the general case is the key to realizing high-quality medical care. The present invention, which can acquire knowledge for an exceptionally small number of patients, is useful for extracting diagnostic rules from a medical database. Specifically, as given as an example, prioritized rules are extracted for each disease from the past patient diagnosis result database, and depending on which class of rules is established for a new patient Diagnose. When a plurality of rules for different diseases are established, there is a possibility that both of the diseases are applicable, and the possibility can be recognized by a priority ratio of the rules. Moreover, when all the rules do not apply, it can be regarded as a normal state that is not any disease.

本発明の概念図Conceptual diagram of the present invention 木構造プログラムの例Example of a tree structure program 遺伝的プログラミングの処理の流れProcess flow of genetic programming 遺伝的プログラミングにおける交叉の例Crossover example in genetic programming 従来発明における事例と教示信号の例Examples of conventional inventions and examples of teaching signals 従来発明における未知事例の分類の様子Classification of unknown cases in conventional invention 本発明における遺伝的プログラミングの１個体の概念図Conceptual diagram of one individual of genetic programming in the present invention 本発明における遺伝的プログラミングの処理の流れProcess flow of genetic programming in the present invention 本発明における遺伝的プログラミングの初期個体の生成方法Method for generating an initial individual of genetic programming in the present invention 本発明における遺伝的プログラミングの個体集団の概念図Conceptual diagram of individual population of genetic programming in the present invention 本発明における遺伝的プログラミングの交叉（グループの分割が発生する例）Crossover of genetic programming in the present invention (example in which group division occurs) 本発明における遺伝的プログラミングの交叉（グループの統合が発生する例）Crossover of genetic programming in the present invention (example of group integration) 本発明の実施例の構成図Configuration diagram of an embodiment of the present invention IF_THENルールの前件部となる論理式の例Example of logical expression that is an antecedent part of IF_THEN rule クラスYkの事例が満たすIF_THENルールを抽出する際の各個体の適応度の評価方法Evaluation method of fitness of each individual when extracting IF_THEN rule satisfied by case of class Yk 各事例がクラスYiに属するかどうかの判定方法の概念図Conceptual diagram of how to determine whether each case belongs to class Yi クラスが未知の事例に対する分類処理の流れClassification process flow for cases with unknown classes クラスが未知の事例の分類の概念図Conceptual diagram of classification of cases with unknown class 肝胆嚢疾患データの例Example of liver gallbladder disease data 各検査項目の離散化のための閾値Threshold for discretization of each inspection item 肝胆嚢疾患に対するIF_THENルール作成のための終端・関数記号Terminal and function symbols for creating IF_THEN rules for hepatobiliary disease Alcoholic liver damage の事例のみが満たすように抽出されたルールRules extracted so that only the case of Alcoholic liver damage is satisfied Primary hepatoma の事例のみが満たすように抽出されたルールRules extracted to satisfy only the primary hepatoma case Liver cirrhosis の事例のみが満たすように抽出されたルールRules extracted to satisfy only Liver cirrhosis cases Cholelithiasis の事例のみが満たすように抽出されたルールRules extracted to satisfy only Cholelithiasis cases

Explanation of symbols

７１エージェント１と２からなるグループ
７２エージェント３と４からなるグループ
７３グループ７１が参照する木構造プログラム
７４グループ７２が参照する木構造プログラム
１０１本発明における遺伝的プログラミングの１個体（図７の個体を簡略化したもの）
１１１交叉に用いる親個体の１つ
１１２符号１１１が指す個体と交叉するもう一方の親個体
１２１交叉に用いる親個体の１つ
１２２符号１２１が指す個体と交叉するもう一方の親個体
１３１ｍ種類の分類結果のいずれかに分類された事例を格納したデータベース
１３２ある1つのクラスの事例のみが満たすIF_THENルールを抽出する処理装置
１３３クラスYiの事例のみが満たすルールの集合
71 Group 72 consisting of agents 1 and 2 73 Group consisting of agents 3 and 4 Tree structure program 74 referred to by group 71 Tree structure program 101 referred to by group 72 One individual of genetic programming in the present invention (the individual shown in FIG. 7) Simplified)
111 One parent individual used for crossing 112 Another parent individual 121 crossed with the individual indicated by reference numeral 111 One parent individual used for crossing 122 Another parent individual 131 crossed with the individual indicated by reference numeral 121 Database 132 storing cases classified into any of the classification results Processing unit 133 that extracts IF_THEN rules that only one class of examples satisfies A set of rules that only a class Yi example satisfies

Claims

Each input signal that constitutes a case and its signal in a database that stores cases that have features classified according to the output signal in one or more of multiple classification candidates prepared in advance and allow two or more A method to express the input / output relationship of cases with the same classification result by IF_THEN rule by a tree structure combining terminal nodes that represent possible values and function nodes that represent the magnitude relationship and logical product of those values, and a database A means to automatically divide a set of cases with any one classification result existing in, into multiple partial case sets that satisfy the same input / output relationship, and a classification accuracy of cases for a plurality of generated tree structures By using a process that repeats the operation of generating a new tree structure group by selecting based on the fitness shown and exchanging subtrees in the tree structure or changing nodes In order to determine the IF_THEN rule that appropriately represents the input / output relationship that each partial case set satisfies, and which IF_THEN rule to adopt when multiple IF_THEN rules extracted from each subset are satisfied Database analysis device comprising means for assigning priorities.

Means for dividing the case set into a plurality of partial case sets satisfying the same input / output relationship, means for extracting an IF_THEN rule that appropriately represents the input / output relationship satisfied by each partial case set, Arbitrary subsets in the case set with the same classification result in the means to give priority to determine which IF_THEN rule to adopt when multiple IF_THEN rules extracted from the subset are established A case that has an IF_THEN rule that satisfies the above example, and that has a function that returns output according to that rule as an agent, and a set of agents that have the same IF_THEN rule as a group, and that has a single classification result by multiple agents The number of groups configured by multiple agents and each group so that the entire set can be expressed by the rules of any agent It belongs number of agents, and methods of database analysis using an optimization method to search automatically the IF_THEN rule possessed by each agent.

The means for assigning a priority for determining which IF_THEN rule to be adopted when a plurality of IF_THEN rules extracted from each subset according to claim 1 are established has the same IF_THEN rule. The number of listed agents represents the priority of the rule, and the higher the proportion of the IF_THEN rule that corresponds to a case set with any one classification result matches the cases in the case set that belong to the classification result, the higher the priority of the IF_THEN rule. A database that gives priority by setting the fitness so that the higher the rate at which the IF_THEN rule is established for cases belonging to other classification results, the lower the priority of the IF_THEN rule. Analysis method.

A set of cases that have the same classification result in a database that stores cases with features classified according to the output signal in one or more of multiple classification candidates prepared in advance and allows two or more Based on the means to automatically divide an arbitrary class into a plurality of partial case sets satisfying the same input / output relationship and the fitness indicating the classification accuracy of the cases for the generated tree structure The IF_THEN rule that appropriately represents the input / output relationship that each sub-case set satisfies can be obtained by using a process of selecting and sub-trees in the tree structure and changing the nodes to generate a new tree structure group. A means for extracting and a means for assigning a priority for determining which IF_THEN rule to adopt when a plurality of IF_THEN rules extracted from each subset are established. The method to extract IF_THEN rules for all classes existing in the database by repeatedly executing the operation to automatically acquire IF_THEN rules that only the target class satisfies for all classes existing in the database The database analysis apparatus according to claim 1, further comprising:

5. In a database storing cases having characteristics classified according to an output signal in one of a plurality of classification candidates prepared in advance or two or more allowing duplication, acquired by means of claim 4. Use IF_THEN rule set for all classes to classify cases according to which class matches the IF_THEN rule for which case in the database or case with the same format that does not exist in the database And a means for classifying the rule into the class indicated by the rule having the highest priority among the established rules when a plurality of rules indicating different class characteristics are established for the case and there are a plurality of classification candidates. Database analysis equipment.