JP5325131B2

JP5325131B2 - Pattern extraction apparatus, pattern extraction method, and program

Info

Publication number: JP5325131B2
Application number: JP2010014603A
Authority: JP
Inventors: 努平尾; 潤鈴木; 秀樹磯崎; 昌明永田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2010-01-26
Filing date: 2010-01-26
Publication date: 2013-10-23
Anticipated expiration: 2030-01-26
Also published as: JP2011154469A

Abstract

<P>PROBLEM TO BE SOLVED: To automatically generate as many patterns as required which contributes to automatic sort processing, even if the number of texts to which labels are given is small. <P>SOLUTION: A statistical model which is generated by using text with a label as training data and is constituted so that probability data which shows a probability of applied optional text belonging to a predetermined set is outputted is taken as a sorting model. The sorting model is applied to the text having no label, and probability data showing the probability that the text having no label belongs to the predetermined set is generated. An index showing the degree of relevance between a first pattern which is an arbitrary pattern, and a sort result when sorting the text containing the first pattern into a set to which the text belongs is generated using the value determined from the generated probability data. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、記号系列からなるデータの自動分類処理に寄与する系列のパターンを抽出するための技術に関する。 The present invention relates to a technique for extracting a pattern of a sequence that contributes to automatic classification processing of data consisting of symbol sequences.

記号系列からなるデータ（例えば、文字記号によって記述された文書データや、遺伝子記号によって記述された遺伝子配列など）群を自動分類する技術が存在する。例えば、スパムメールとその他の電子メールが混在する電子メール群を、スパムメールの集合とその他電子メールの集合とに分類する技術などである。このような技術では、或る特徴を持つ系列（記号系列）のパターンに基づいて、データの自動分類処理が行われる。例えば、所定の集合（例えば、スパムメールの集合）に属するデータが含む頻度が高い系列のパターンが用意され、分類対象のデータが当該系列のパターンを含むか否かに応じ、当該分類対象のデータが自動分類される。よって、このような自動分類技術では、その前提として、自動分類処理に寄与する系列のパターン（例えば、スパムメールの集合に属するデータが含む頻度が高い系列のパターン）を生成する必要がある。 There is a technique for automatically classifying a group of data consisting of symbol series (for example, document data described by character symbols, gene sequences described by gene symbols, etc.). For example, there is a technique for classifying an email group in which spam emails and other emails are mixed into a set of spam emails and a set of other emails. In such a technique, automatic data classification processing is performed based on a pattern of a certain characteristic (symbol series). For example, a series pattern having a high frequency included in data belonging to a predetermined set (for example, a set of spam mails) is prepared, and the classification target data is determined depending on whether or not the classification target data includes the pattern of the series. Are automatically classified. Therefore, in such an automatic classification technique, it is necessary to generate a sequence pattern that contributes to the automatic classification process (for example, a sequence pattern that is frequently included in data belonging to a set of spam mails) as a prerequisite.

一方、記号系列からなるデータ群から頻度の高い系列のパターンを自動抽出する技術が存在する（例えば、非特許文献１参照）。 On the other hand, there is a technique for automatically extracting a pattern having a high frequency from a data group consisting of symbol series (see, for example, Non-Patent Document 1).

図１は、記号系列からなるデータ群から頻度の高い系列のパターンを自動抽出する従来技術を説明するための図である。 FIG. 1 is a diagram for explaining a conventional technique for automatically extracting a pattern having a high frequency from a data group consisting of symbol sequences.

以下では、１個以上の記号からなる要素（最小単位）をアイテムと呼び、１個以上のアイテムからなる系列をテキストと呼び、テキストが含む１個以上のアイテムからなる系列をパターンと呼び、テキストの集合をデータベースと呼ぶ。なお、記号の例は、文字、数字、マークなどであり、アイテムの例は、文字、数字、マーク、単語、単語列、塩基、塩基対などである。図１の例では、例えば、「a」〜「d」がアイテムであり、「a b c c」などの系列が1つのテキストであり、テキストが含む「a」「a b」「a c」などの系列がパターンである。また、図１の例では、５つのテキストからなるデータベースを扱う。 In the following, an element (minimum unit) consisting of one or more symbols is called an item, a series consisting of one or more items is called a text, a series consisting of one or more items included in the text is called a pattern, and text This set is called a database. Examples of symbols are letters, numbers, marks, etc., and examples of items are letters, numbers, marks, words, word strings, bases, base pairs, and the like. In the example of FIG. 1, for example, “a” to “d” are items, a series such as “abcc” is one text, and a series such as “a”, “ab”, and “ac” included in the text is a pattern. It is. In the example of FIG. 1, a database consisting of five texts is handled.

ここで、データベース中から、出現頻度がζ=2よりも大きいパターンを抽出することを考える。なお、１つのテキスト中に同じパターンが複数回出現したとしても、そのテキストに対するそのパターンのカウント回数は1回とする。 Here, consider extracting a pattern having an appearance frequency larger than ζ = 2 from the database. Note that even if the same pattern appears multiple times in one text, the number of times the pattern is counted for that text is one.

まず、与えられたデータベース（「入力データベース(IDB)」と呼ぶ）に対し、アイテム(長さ1のパターン)の出現頻度が算出される。図１の例の場合、アイテム「a」「b」「c」「d」の出現頻度は、それぞれ5, 4, 4, 2となる。すなわち、入力データベースにおいて、出現頻度がζ=2より大きいアイテムは「a」「b」「c」の３つである。これらの３つのアイテム「a」「b」「c」は出力リスト（OUT）の要素として記憶に格納され（OUT={a, b, c}）、その後、それぞれのパターン「a」「b」「c」から始まるパターンの出現頻度が算出される。「d」から始まるパターンの出現頻度がζ=2よりも大きくなることはないので、「d」から始まるパターンは以降の処理対象とされない。なお、この例では、プロジェクションによって生成されたデータベースを用い、入力データベースにおける、パターン「a」「b」「c」から始まるパターンの出現頻度が算出される。プロジェクションとは、データベース中の各エントリ（初期はテキスト）に対し、それぞれの先頭から或るアイテム（「着目アイテム」と呼ぶ）が最初に見つかった位置までを削除し、残りの系列をデータベースのエントリとして新たなデータベースを作成することを示す。なお、プロジェクションにおいて、着目アイテムを含まないエントリはデータベースから除外される。ここで説明するアイテムの出現頻度を算出する処理では、出現頻度がζ=2より大きいアイテム「a」「b」「c」が、それぞれ着目アイテムとされる。 First, the appearance frequency of an item (length 1 pattern) is calculated for a given database (referred to as “input database (IDB)”). In the example of FIG. 1, the appearance frequencies of the items “a”, “b”, “c”, and “d” are 5, 4, 4, and 2, respectively. That is, in the input database, there are three items “a”, “b”, and “c” whose appearance frequency is greater than ζ = 2. These three items “a”, “b”, “c” are stored in memory as elements of the output list (OUT) (OUT = {a, b, c}), and then each pattern “a”, “b” The appearance frequency of the pattern starting from “c” is calculated. Since the appearance frequency of the pattern starting from “d” never becomes higher than ζ = 2, the pattern starting from “d” is not subjected to subsequent processing. In this example, using a database generated by projection, the appearance frequency of patterns starting from patterns “a”, “b”, and “c” in the input database is calculated. Projection deletes each entry (initially text) in the database from the beginning to the position where a certain item (referred to as “target item”) is first found, and the rest of the entries are entered in the database. To create a new database. In the projection, entries that do not include the item of interest are excluded from the database. In the process of calculating the item appearance frequency described here, the items “a”, “b”, and “c” whose appearance frequency is greater than ζ = 2 are set as items of interest.

まず、入力データベースに対し、「a」を着目アイテムとしてプロジェクション（prj(a)）を行うと、「b c c」、「c」、「c」、「b d」というエントリのデータベース(DB(a))が作成される。次にこのデータベース(DB(a))に対し、アイテム(長さ1のパターン)の出現頻度を求める。ここで、頻度がζ=2より大きいアイテムがあれば、そのアイテムの前にこれまでのプロジェクションの着目アイテムを付加した系列であるパターンが出力リスト（OUT）の要素として記憶に格納される。図１のプロジェクション（prj(a)）の例の場合、出現頻度がζ=2より大きいアイテムは「b」「c」の２つである。よって、各アイテム「b」「c」の前に、それぞれ、これまでのプロジェクションの着目アイテム「a」を付加した系列であるパターン「a b」「a c」が出力リスト（OUT）の要素として記憶に格納される（OUT={a, b, c, a b, a c}）。さらに、データベース（DB(a)）に対し、再度、プロジェクションが実行される。図１の例では、データベース(DB(a))に対し、頻度がζ=2より大きいアイテム「b」を着目アイテムとしたプロジェクション（prj(b)）が行われ、データベース(DB(a b))が生成される。データベース(DB(a b))には、頻度がζ=2より大きいアイテムが存在しないため、出力リスト（OUT）に新たな要素が加えられない。次に、データベース(DB(a))に戻り、頻度がζ=2より大きいアイテム「c」を着目アイテムとしたプロジェクション（prj(c)）が行われ、データベース(DB(a c))が生成される。データベース(DB(a c))には、頻度がζ=2より大きいアイテムが存在しないため、出力リスト（OUT）に新たな要素が加えられない。次に、入力データベース(IDB)に戻り、「b」を着目アイテムとしたプロジェクション（prj(b)）が行われ、データベース(DB(b))が生成され、「b a」「b c」が出力リスト（OUT）の要素として記憶に格納される（OUT={a, b, c, a b, a c, b a, b c}）。その後、同様な基準に従い、深さ優先順で、データベース(DB(b))に対するアイテム「a」「c」を着目アイテムとしたプロジェクション、入力データベース(IDB)に対するアイテム「c」を着目アイテムとしたプロジェクションが実行され、処理が終了する。このようにプロジェクションを再帰的に行うことにより、出現頻度が或る値ζより大きいパターンを効率的に求めることができる。図１の例では、出現頻度がζ=2よりも大きいパターンとして「a, b, c, a b, a c, b a, b c」が得られる。 First, when projection (prj (a)) is performed on the input database with “a” as the item of interest, a database (DB (a)) with entries “bcc”, “c”, “c”, “bd” Is created. Next, the appearance frequency of the item (length 1 pattern) is obtained for this database (DB (a)). Here, if there is an item whose frequency is greater than ζ = 2, a pattern that is a series in which the item of interest of the previous projection is added before the item is stored in the memory as an element of the output list (OUT). In the example of the projection (prj (a)) in FIG. 1, there are two items “b” and “c” whose appearance frequency is greater than ζ = 2. Therefore, before each item “b” “c”, the pattern “ab” “ac”, which is a series in which the attention item “a” of the past projection is added, is stored as an element of the output list (OUT). Stored (OUT = {a, b, c, ab, ac}). Further, the projection is executed again on the database (DB (a)). In the example of FIG. 1, a projection (prj (b)) is performed on the database (DB (a)) with the item “b” having a frequency greater than ζ = 2 as the item of interest. Is generated. In the database (DB (a b)), since there is no item whose frequency is higher than ζ = 2, no new element is added to the output list (OUT). Next, returning to the database (DB (a)), a projection (prj (c)) is performed with the item “c” whose frequency is greater than ζ = 2 as the target item, and the database (DB (ac)) is generated. The In the database (DB (ac)), there is no item with a frequency greater than ζ = 2, so that no new element is added to the output list (OUT). Next, returning to the input database (IDB), a projection (prj (b)) with “b” as the item of interest is performed, a database (DB (b)) is generated, and “ba” and “bc” are output lists. It is stored in memory as an element of (OUT) (OUT = {a, b, c, ab, ac, ba, bc}). After that, according to the same criteria, in the depth priority order, the item “a” “c” for the database (DB (b)) is the target item, and the item “c” for the input database (IDB) is the target item. Projection is executed, and the process ends. By recursively performing projection in this way, it is possible to efficiently obtain a pattern whose appearance frequency is greater than a certain value ζ. In the example of FIG. 1, “a, b, c, a b, a c, b a, b c” is obtained as a pattern whose appearance frequency is greater than ζ = 2.

J. Pei, J. Han, B.Mortazavi-Asl, H.Pinto, Q.Chen, U.Dayal, and M.-C. Hsu 2001. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth In Proc. of the 17th ICDE, pages 215-224.J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu 2001. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth In Proc. of the 17th ICDE, pages 215-224.

しかし、非特許文献１にあるような従来技術では、データベースから出現頻度の高いパターンを自動抽出することはできても、自動分類処理に寄与するパターンを自動生成することはできない。例えば、データベースに属する各テキストにテキストが属する集合を表すデータであるラベル（例えば、スパムメールであるか否かを表すラベルなど）が付与されていたとしても、非特許文献１にあるような従来技術では、ラベルを考慮したパターン抽出を行うことはできない。 However, in the conventional technique as described in Non-Patent Document 1, although a pattern having a high appearance frequency can be automatically extracted from the database, a pattern contributing to the automatic classification process cannot be automatically generated. For example, even if a label (for example, a label indicating whether it is spam mail or the like) that is a data representing a set to which the text belongs is assigned to each text belonging to the database, it is conventional as in Non-Patent Document 1. In technology, pattern extraction considering labels cannot be performed.

また、ラベルが対応付けられた（ラベルが付与された）テキストをラベルが示す集合ごとに分類し（例えば、スパムメールであるか否かに分類し）、分類された集合ごとに非特許文献１にあるような従来技術を適用すれば、集合ごとに別個に、出現頻度が高いパターンを抽出することができる。このようにして抽出されるパターンは、自動分類処理に寄与するパターンである。しかしながら、通常、テキストにラベルを対応付ける処理は人手によって行われ、そのコストは高い。よって、ラベルが対応付けられたテキストを大量に用意することは難しい。さらに、少量のテキストからパターン抽出が行われた場合には抽出されるパターンの数が非常に少なくなり、抽出されたパターンを用いた自動分類処理の性能が低下してしまう。 Further, the text associated with the label (labeled) is classified for each set indicated by the label (for example, classified as spam mail), and Non-Patent Document 1 is set for each classified set. If the conventional technique as described in (1) is applied, a pattern having a high appearance frequency can be extracted separately for each set. The pattern extracted in this way is a pattern that contributes to the automatic classification process. However, the process of associating a label with a text is usually performed manually, and its cost is high. Therefore, it is difficult to prepare a large amount of text associated with labels. Furthermore, when pattern extraction is performed from a small amount of text, the number of patterns to be extracted becomes very small, and the performance of automatic classification processing using the extracted patterns is degraded.

本発明はこのような点に鑑みてなされたものであり、ラベルが付与されたテキストの数が少量であったとしても、自動分類処理に寄与するパターンを必要な数だけ自動生成することが可能な技術を提供することを目的とする。 The present invention has been made in view of these points, and even if the number of texts with labels is small, it is possible to automatically generate a necessary number of patterns contributing to automatic classification processing. Aims to provide a new technology.

本発明では、１個以上の記号からなる要素をアイテムとし、１個以上のアイテムからなる系列をテキストとし、テキストが含む１個以上のアイテムからなる系列をパターンとし、テキストが属する集合を表すデータであるラベルに対応付けられたテキストをラベルありテキストとし、ラベルに対応付けられていないテキストをラベルなしテキストとし、ラベルありテキストが訓練データとして用いられて生成された統計モデルであって、なおかつ、適用された任意のテキストが所定の集合に属する確率を表す確率データを出力するように構成されたものを分類モデルとする。分類モデルは、ラベルなしテキストに適用され、当該ラベルなしテキストが所定の集合に属する確率を表す確率データが生成される。そして、少なくとも、生成された確率データから定まる値を用い、任意のパターンである第１パターンと、当該第１パターンを含むテキストを当該テキストが属する集合に分類した際の分類結果と、の関連性の高さを表す指標が生成される。 In the present invention, an element including one or more symbols, an item including one or more items as text, a sequence including one or more items included in the text as a pattern, and data representing a set to which the text belongs Is a statistical model generated by using text associated with a label as text with label, text not associated with label as unlabeled text, and using text with label as training data, and A classification model is configured to output probability data representing the probability that an applied arbitrary text belongs to a predetermined set. The classification model is applied to unlabeled text, and probability data representing the probability that the unlabeled text belongs to a predetermined set is generated. Then, at least using a value determined from the generated probability data, the relationship between the first pattern which is an arbitrary pattern and the classification result when the text including the first pattern is classified into the set to which the text belongs. An index representing the height of is generated.

本発明で生成される指標を用いることで、テキストの自動分類処理に寄与する度合いの高いパターンを自動生成できる。また、当該指標はラベルなしテキストに分類モデルを適用した結果を用いて生成できる。よって、本発明では、ラベルが付与されたテキストの数が少量であったとしても、自動分類処理に寄与するパターンを必要な数だけ自動生成することができる。 By using the index generated in the present invention, it is possible to automatically generate a pattern having a high degree of contribution to the automatic text classification process. The index can be generated using the result of applying the classification model to unlabeled text. Therefore, in the present invention, even if the number of texts with labels is small, a necessary number of patterns contributing to the automatic classification process can be automatically generated.

記号系列からなるデータ群から頻度の高い系列のパターンを自動抽出する従来技術を説明するための図。The figure for demonstrating the prior art which extracts the pattern of a high frequency series automatically from the data group which consists of a symbol series. 実施形態のパターン抽出装置の機能構成を説明するためのブロック図。The block diagram for demonstrating the function structure of the pattern extraction apparatus of embodiment. パターン抽出装置の処理を説明するための図。The figure for demonstrating the process of a pattern extraction apparatus. パターン抽出装置の処理を説明するための図。The figure for demonstrating the process of a pattern extraction apparatus. ステップＳ１４の処理の具体例を説明するためのフローチャート。The flowchart for demonstrating the specific example of the process of step S14. ステップＳ１４の処理の具体例を説明するためのフローチャート。The flowchart for demonstrating the specific example of the process of step S14. ステップＳ１４の処理の具体例を説明するためのフローチャート。The flowchart for demonstrating the specific example of the process of step S14. ステップＳ１４の処理の具体例を説明するための擬似コード。Pseudo code for explaining a specific example of the process of step S14. ステップＳ１４の実施例を説明するための図。The figure for demonstrating the Example of step S14. ステップＳ１４の実施例を説明するための図。The figure for demonstrating the Example of step S14. ステップＳ１４の実施例を説明するための図。The figure for demonstrating the Example of step S14.

以下、図面を参照して本発明の実施形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

＜機能構成＞
図２は、実施形態のパターン抽出装置１の機能構成を説明するためのブロック図である。 <Functional configuration>
FIG. 2 is a block diagram for explaining a functional configuration of the pattern extraction apparatus 1 according to the embodiment.

図２に例示するように、本形態のパターン抽出装置１は、訓練部１１と、分類部１２と、データベース合成部１３と、抽出部１４と、制御部１５と、記憶部１６ａ〜１６ｅとを有する。抽出部１４は、指標生成部１４ａと、限界値生成部１４ｂと、探索部１４ｃとを有する。なお、本形態のパターン抽出装置１は、例えば、CPU(central processing unit)、RAM(random-access memory)、ROM(read-only memory)などを有する公知のコンピュータ又は専用コンピュータに、所定のプログラムが読込まれて構成される特別な装置である。すなわち、訓練部１１、分類部１２、データベース合成部１３、抽出部１４及び制御部１５は、例えば、所定のプログラムを実行するCPUである。また、記憶部１６ａ〜１６ｅは、例えば、ハードディスク装置などの補助記憶装置、RAM、レジスタ、若しくは、キャッシュメモリ、又は、これらの少なくとも一部が結合された記憶領域である。また、訓練部１１、分類部１２、データベース合成部１３、抽出部１４及び制御部１５の少なくとも一部の機能が集積回路によって構成されてもよいし、記憶部１６ａ〜１６ｅの少なくとも一部の記憶領域が集積回路内に存在してもよい。なお、パターン抽出装置１は、制御部１５の制御のもと、各処理を実行する。また、以下では一部で説明を省略するが、各演算において生成されたデータは各記憶部に格納され、必要に応じてそこから読み出されて各演算に用いられる。 As illustrated in FIG. 2, the pattern extraction apparatus 1 according to the present embodiment includes a training unit 11, a classification unit 12, a database synthesis unit 13, an extraction unit 14, a control unit 15, and storage units 16 a to 16 e. Have. The extraction unit 14 includes an index generation unit 14a, a limit value generation unit 14b, and a search unit 14c. Note that the pattern extraction apparatus 1 of the present embodiment has a predetermined program on a known computer or a dedicated computer having, for example, a CPU (central processing unit), a RAM (random-access memory), a ROM (read-only memory), and the like. It is a special device that is read and configured. That is, the training unit 11, the classification unit 12, the database synthesis unit 13, the extraction unit 14, and the control unit 15 are CPUs that execute predetermined programs, for example. The storage units 16a to 16e are, for example, an auxiliary storage device such as a hard disk device, a RAM, a register, a cache memory, or a storage area in which at least a part of these is combined. In addition, at least some of the functions of the training unit 11, the classification unit 12, the database synthesis unit 13, the extraction unit 14, and the control unit 15 may be configured by an integrated circuit, or at least some of the storage units 16a to 16e are stored. A region may be present in the integrated circuit. The pattern extraction device 1 executes each process under the control of the control unit 15. In the following, although a part of the description is omitted, data generated in each calculation is stored in each storage unit, and is read from there as needed and used for each calculation.

＜処理＞
図３、４は、パターン抽出装置１の処理を説明するための図である。以下、これらの図を用いてパターン抽出装置１の処理を説明する。 <Processing>
3 and 4 are diagrams for explaining the processing of the pattern extraction apparatus 1. Hereinafter, processing of the pattern extraction apparatus 1 will be described with reference to these drawings.

［事前処理］
事前処理として、ラベルありテキストを含むラベルありデータとラベルなしテキストを含むラベルなしデータとが、それぞれ、１個以上、記憶部１６ａ（図２）に格納される。ラベルありテキスト及びラベルなしテキストの例は、文字記号によって記述された文書データや、遺伝子記号によって記述された遺伝子配列や、プログラムコードによって記述されたプログラム列などである。また、ラベルありテキストは、事前に人間によって、それぞれのテキストが属する集合を表すデータであるラベルが対応付けられたテキストである。なお、集合の数は２以上であればいくつでもよいが、以下では２個の集合が設定され、一方の集合を「P₊（正例：+）」とし、他方の集合を「P_-（負例：-）」とする。図４の例では、識別子(ID)とテキスト（ラベルありテキスト）とラベルの内容(P₊,P_-)(ラベルが集合P₊を表すなら(P₊,P_-)=(1,0)、ラベルが集合P_-を表すなら(P₊,P_-)=(0,1)）との組が、ラベルありデータとされる。また、識別子(ID)とテキスト（ラベルなしテキスト）とラベルの内容(P₊,P_-)が不明であることを示す情報（(P₊,P_-)=(unk,unk)）との組が、ラベルなしデータとされる。 [Pre-processing]
As pre-processing, one or more pieces of labeled data including labeled text and unlabeled data including unlabeled text are stored in the storage unit 16a (FIG. 2). Examples of labeled text and unlabeled text are document data described by character symbols, gene sequences described by gene symbols, program sequences described by program codes, and the like. The labeled text is text that is associated with a label, which is data representing a set to which each text belongs, by a human in advance. The number of sets may be any number as long as it is two or more. However, in the following, two sets are set, one set being “P ₊ (positive example: +)” and the other set being “P ₋ ( Negative example:-) ". In the example of FIG. 4, the identifier (ID), the text (text with label), and the contents of the label (P ₊ , P ₋ ) (if the label represents the set P ₊ (P ₊ , P ₋ ) = (1,0) If the label represents the set P ₋ , the set (P ₊ , P ₋ ) = (0, 1)) is the labeled data. In addition, the contents of an identifier (ID) and the text (no label text) label of the (P _+, P _-) information indicating that is unknown _{_{((P +, P -)}} = (unk, unk)) the set of Is unlabeled data.

また、予め定められた減衰パラメータλと、出力リストの要素の上限値を表すパラメータKとが、記憶部１６ｅに格納される。なお、減衰パラメータλは、例えば、0≦λ≦1の範囲、好ましくは、0＜λ＜1の範囲から事前に選択された値である。また、パラメータKは、例えば、事前に選択された正整数である。 A predetermined attenuation parameter λ and a parameter K representing the upper limit value of the elements of the output list are stored in the storage unit 16e. The attenuation parameter λ is, for example, a value selected in advance from a range of 0 ≦ λ ≦ 1, preferably from a range of 0 <λ <1. The parameter K is, for example, a positive integer selected in advance.

［パターン抽出処理］
まず、訓練部１１が、記憶部１６ａから、ラベルありデータが含むラベルありテキストを抽出し、当該ラベルありテキストを訓練データとして用い、適用された任意のテキストが所定の集合（クラス）に属する確率を表す確率データを出力するように構成された統計モデルである分類モデルを生成する。生成された分類モデルは記憶部１６ｂに格納される（ステップＳ１１）。なお、分類モデルは、適用された任意のテキストが所定の集合に属する確率を表す確率データを出力するように構成されたものであればどのようなものでもよい。例えば、ナイーブ・ベイズ法や最大エントロピー法などの周知の統計方法におけるモデルを分類モデルとすればよい。以下に、ナイーブ・ベイズ法を用いる場合の分類モデルを例示する。 [Pattern extraction processing]
First, the training unit 11 extracts the labeled text included in the labeled data from the storage unit 16a, uses the labeled text as training data, and the probability that the applied arbitrary text belongs to a predetermined set (class) A classification model is generated which is a statistical model configured to output probability data representing. The generated classification model is stored in the storage unit 16b (step S11). Any classification model may be used as long as it is configured to output probability data representing the probability that an applied arbitrary text belongs to a predetermined set. For example, a model in a well-known statistical method such as the Naive Bayes method or the maximum entropy method may be used as the classification model. Below, the classification model in the case of using the Naive Bayes method is illustrated.

［ステップＳ１１の具体例（ナイーブ・ベイズ法の例）］
ナイーブ・ベイズ法では、テキストxが所定の集合（クラス）cに属する確率Pr(c|x)が以下の式で定義される。 [Specific Example of Step S11 (Example of Naive Bayes Method)]
In the Naive Bayes method, the probability Pr (c | x) that the text x belongs to a predetermined set (class) c is defined by the following equation.

ここで、w_iはテキストxのi番目のアイテムを表し、N(w_i,x)は、テキストxにおけるアイテムw_iの出現頻度を表す。Pr(c)は、テキストxが集合cに属することの生起確率を表し、以下の式で定義される。

Here, w _i represents the i-th item of the text x, and N (w _i , x) represents the appearance frequency of the item w _i in the text x. Pr (c) represents the occurrence probability that the text x belongs to the set c, and is defined by the following equation.

また、Pr(w_i|c)は以下の式で定義される。

Pr (w _i | c) is defined by the following equation.

ナイーブ・ベイズ法における分類モデルは、訓練データであるラベルありテキストを用いて算出（学習）された式(2)(3)の値となる。以下にナイーブ・ベイズ法において学習される分類モデルを例示する。

The classification model in the Naive Bayes method is a value of Equations (2) and (3) calculated (learned) using labeled text that is training data. The classification model learned in the Naive Bayes method is illustrated below.

この例では、以下の１０個のラベルありテキストを訓練データとして用い、適用された任意のテキストxが集合P₊に属する確率を表す確率データを出力するように構成された分類モデルを生成する。 In this example, the following 10 labeled texts are used as training data, and a classification model configured to output probability data representing the probability that an applied arbitrary text x belongs to the set P ₊ is generated.

この場合、訓練データであるラベルありテキストがラベルP₊で表される集合c=P₊に属する生起確率Pr(P₊)、及び、訓練データであるラベルありテキストがラベルP_-で表される集合c=P_-に属する生起確率Pr(P_-)は以下のようになる。

In this case, a training data labels have text labels P = ₊ represented collectively c P ₊ belonging probability Pr (P _+), and, the text has a label which is training data labels P _- represented by The occurrence probability Pr (P ₋ ) belonging to the set c = P ₋ is as follows.

Pr(P₊)=7/10 …(4)
Pr(P_-)=3/10 …(5)
また、各単語w_i=a〜iについてのPr(w_i|P₊)は以下のようになる。 Pr (P ₊ ) = 7/10… (4)
_{Pr (P -) = 3/} 10 ... (5)
Also, Pr (w _i | P ₊ ) for each word w _i = a to _i is as follows.

また、各単語w_i=a〜iについてのPr(w_i|P_-)は以下のようになる。

Also, Pr (w _i | P ₋ ) for each word w _i = a to _i is as follows.

この例では、式(1)から定まるPr(P₊|x)と式(4)〜(23)とが分類モデルとなる（［ステップＳ１１の具体例（ナイーブ・ベイズ法の例）］の説明終わり）。なお、ステップＳ１１の処理は、必ずしもパターン抽出処理が行われるたびに実行される必要はなく、過去に生成された分類モデルを使用できるのであれば、ステップＳ１１の処理が省略されてもよい。

In this example, Pr (P ₊ | x) determined from Expression (1) and Expressions (4) to (23) are classification models ([Specific Example of Step S11 (Example of Naive Bayes Method)]] the end). Note that the process of step S11 does not necessarily have to be executed every time the pattern extraction process is performed, and the process of step S11 may be omitted if a classification model generated in the past can be used.

次に、分類部１２に、上述のラベルなしテキストと分類モデルとが入力される。分類部１２は、分類モデルをラベルなしテキストに適用し、当該ラベルなしテキストが所定の集合に属する確率を表す確率データを生成する（ステップＳ１２）。この処理の具体的な方法は分類モデルに応じて異なる。例えば、前述したナイーブ・ベイズ法おける分類モデルが用いられる場合には、以下のような確率データが生成される。 Next, the above-described unlabeled text and classification model are input to the classification unit 12. The classification unit 12 applies the classification model to unlabeled text, and generates probability data representing the probability that the unlabeled text belongs to a predetermined set (step S12). The specific method of this process differs depending on the classification model. For example, when the above-described classification model in the Naive Bayes method is used, the following probability data is generated.

［ステップＳ１２の具体例（ナイーブ・ベイズ法の例）］
前述したナイーブ・ベイズ法おける分類モデルが用いられる場合、式(1)に従って、入力されたラベルなしテキストxが所定の集合cに属する確率Pr(c|x)が算出される。例えば、式(1)から定まるPr(P₊|x)と式(4)〜(23)とからなる分類モデルに、ラベルなしテキストx=a a c c d d fが適用される場合、当該ラベルなしテキストx=a a c c d d fが、ラベルP₊で表される所定の集合に属する確率Pr(P₊|x)は、 [Specific Example of Step S12 (Example of Naive Bayes Method)]
When the above-described classification model in the Naive Bayes method is used, the probability Pr (c | x) that the input unlabeled text x belongs to the predetermined set c is calculated according to the equation (1). For example, when unlabeled text x = aaccddf is applied to a classification model consisting of Pr (P ₊ | x) determined from formula (1) and formulas (4) to (23), the unlabeled text x = aaccddf Is the probability Pr (P ₊ | x) belonging to the given set represented by the label P ₊

となる（［ステップＳ１２の具体例（ナイーブ・ベイズ法の例）］の説明終わり）。

(End of description of [specific example of step S12 (example of naive Bayes method)]).

以上のように生成された確率Pr(P₊|x)は、対応するラベルなしテキストに対応付けられ、確率データ付きデータとして記憶部１６ｃに格納される。例えば、図４の例では、識別子(ID)とテキスト（ラベルなしテキスト）と確率Pr(P₊|x)と確率Pr(P_-|x)=1-Pr(P₊|x)とが互いに対応付けられた確率データ付きデータが、記憶部１６ｃに格納される。 The probability Pr (P ₊ | x) generated as described above is associated with the corresponding unlabeled text and stored in the storage unit 16c as data with probability data. For example, in the example of FIG. 4, an identifier (ID), text (unlabeled text), probability Pr (P ₊ | x), and probability Pr (P ₋ | x) = 1−Pr (P ₊ | x) The associated data with probability data is stored in the storage unit 16c.

次に、データベース合成部１３（図２）に、記憶部１６ａに格納されたラベルありデータと、記憶部１６ｃに格納された確率データ付きデータと、記憶部１６ｅに格納された減衰パラメータλとが入力される。データベース合成部１３は、ラベルありデータと確率データ付きデータとを合成したデータベースを生成する（ステップＳ１３）。この際、確率データ付きデータの含む確率Pr(P₊|x)，Pr(P_-|x)が分類モデルを用いて自動的に生成されたものであることを考慮し、確率データ付きデータが含む確率Pr(P₊|x)，Pr(P_-|x)に減衰パラメータλが乗じられる。図４の例では、減衰パラメータλ=0.5とし、データベースが生成されている。生成されたデータベースは記憶部１６ｄに格納される。 Next, the database synthesizing unit 13 (FIG. 2) includes the labeled data stored in the storage unit 16a, the data with probability data stored in the storage unit 16c, and the attenuation parameter λ stored in the storage unit 16e. Entered. The database synthesizing unit 13 generates a database by synthesizing the labeled data and the data with probability data (step S13). At this time, considering that the probability Pr (P ₊ | x) and Pr (P ₋ | x) included in the data with probability data are automatically generated using the classification model, the data with probability data is Probabilities Pr (P ₊ | x) and Pr (P ₋ | x) are multiplied by the attenuation parameter λ. In the example of FIG. 4, the database is generated with the attenuation parameter λ = 0.5. The generated database is stored in the storage unit 16d.

次に、抽出部１４が、記憶部１６ｄに格納されたデータベースと、記憶部１６ｅに格納されたパラメータK及び減衰パラメータλとを入力とし、テキストのラベル分類に寄与する度合い（テキストの自動分類処理に寄与する度合い）の高い、K個のパターンを抽出し、それらを要素とする出力リストを出力する（ステップＳ１４）。以下に、ステップＳ１４の処理の詳細を説明する。 Next, the extraction unit 14 receives the database stored in the storage unit 16d, the parameter K and the attenuation parameter λ stored in the storage unit 16e, and contributes to the text label classification (automatic text classification process). K patterns having a high degree of contribution) are extracted, and an output list having them as elements is output (step S14). Details of the process in step S14 will be described below.

［指標］
従来の手法では、パターンの出現頻度に基づき、或るデータベースに含まれるパターンを抽出するか否か、及び、プロジェクションを行って新たなデータベースを生成して再帰的な処理を行うか否かが、決定されていた。これに対し、本形態では、パターンの出現頻度ではなく、パターンがラベル分類に寄与する度合い（すなわち、パターンと、当該パターンを含むテキストを当該テキストが属する集合に分類した際の分類結果と、の関連性の高さ）を表す指標に基づき、或るデータベースに含まれるパターンを抽出するか否か、及び、プロジェクションを行って新たなデータベースを生成して再帰的な処理を行うか否かが決定される。 [index]
In the conventional method, based on the appearance frequency of the pattern, whether to extract a pattern included in a certain database, and whether to perform a recursive process by generating a new database by performing projection, It was decided. On the other hand, in the present embodiment, not the appearance frequency of the pattern but the degree that the pattern contributes to the label classification (that is, the pattern and the classification result when the text including the pattern is classified into the set to which the text belongs). Decide whether to extract a pattern contained in a certain database based on an index indicating the level of relevance) and whether to perform a recursive process by generating a new database by performing projection Is done.

この指標は、抽出部１４の指標生成部１４ａが、少なくとも、分類部１２で生成された確率データから定まる値を用いて生成する。以下に、指標生成部１４ａが生成する指標の一例を示す。 This index is generated by the index generation unit 14a of the extraction unit 14 using at least a value determined from the probability data generated by the classification unit 12. Below, an example of the index which index generating part 14a generates is shown.

この例では、以下のような分割表を考える。 In this example, consider the following contingency table.

ここで、記憶部１６ｄに格納されたデータベースは、|D^L|個（|D^L|≧0）のラベルありテキストと、分類部１２で確率データがそれぞれ生成された|D^U|個（|D^U|＞0）のラベルなしテキストとを含むものとする。この例では|D^L|≧1とする。なお、|D^L|はデータベースが含むラベルありテキストの総数を表し、|D^U|はデータベースが含むラベルなしテキストの総数を表す。また、D^Lはラベルありテキストの集合を表し、D^Uはラベルなしテキストの集合を表す。

Here, the database stored in the storage unit 16d includes | D ^L | (| D ^L | ≧ 0) labeled text and | D ^U | (|| D ^U |> 0) unlabeled text. In this example, | D ^L | ≧ 1. | D ^L | represents the total number of unlabeled texts included in the database, and | D ^U | represents the total number of unlabeled texts included in the database. D ^L represents a set of labeled text, and D ^U represents a set of unlabeled text.

Nはデータベースが含むラベルありテキストの総数|D^L|から定まる値と、データベースが含むラベルなしテキストの総数|D^U|から定まる値との和である。この例では、以下のように、データベースが含むラベルありテキストの総数|D^L|と、データベースが含むラベルなしテキストの総数D^Uから定まる値と減衰パラメータλとの積との和をNとする。 N is the sum of a value determined from the total number of labeled texts included in the database | D ^L | and a value determined from the total number of unlabeled texts included in the database | D ^U |. In this example, N is the sum of the total number of labeled texts included in the database | D ^L | and the product of the value determined from the total number of unlabeled texts D ^U included in the database and the attenuation parameter λ, as follows: .

N=|D^L|+λ・|D^U| …(25)
Mは、データベースが含むラベルありテキストのうち所定の集合に属することを表すラベルに対応付けられたものの総数から定まる値と、分類部１２で生成されたラベルありテキストが集合P₊に属する確率を表す確率データから定まる値の総数との和である。この例では、以下のように、データベースが含むラベルありテキストのうち所定の集合P₊に属することを表すラベルに対応付けられたものの総数と、ラベルありテキストが集合P₊に属する確率と減衰パラメータλとの積の総数との和をMとする。 N = | D ^L | + λ ・ | D ^U |… (25)
M is a value determined from the total number of the texts with labels included in the database that are associated with the labels indicating that they belong to a predetermined set, and the probability that the text with labels generated by the classification unit 12 belongs to the set P _+. This is the sum of the total number of values determined from the represented probability data. In this example, as follows, the total number and the label has a probability and the damping parameter text belongs to the set P ₊ Although database associated with the label has a label indicating that they belong to a predetermined set P ₊ of the text including Let M be the sum of the total number of products with λ.

なお、d_iはテキストを表し、I(d_i)はラベルありテキストd_i∈D^Lが集合P₊に属するときに1となり、それ以外のときに0となる関数を表す。また、Pr(P₊|d_j)(d_j∈D^U)は、ラベルなしテキストd_j∈D^Uが集合P₊に属する確率を表す。

Here, d _i represents a text, and I (d _i ) represents a function that is 1 when the labeled text d _i ∈D ^L belongs to the set P ₊ and 0 otherwise. Pr (P ₊ | d _j ) (d _j ∈D ^U ) represents the probability that the unlabeled text d _j ∈D ^U belongs to the set P ₊ .

y(α)は、データベースが含むラベルありテキストであって所定の集合に属することを表すラベルに対応付けられたラベルありテキストのうちパターンαを含むものの総数から定まる値と、パターンαを含むラベルなしテキストが所定の集合に属する確率を表す確率データから定まる値の総数との和を表す。この例では、以下のように、データベースが含むラベルありテキストであって集合P₊に属することを表すラベルに対応付けられたラベルありテキストのうちパターンαを含むものの総数と、パターンαを含むラベルなしテキストが集合P₊に属する確率と減衰パラメータλとの積の総数との和をy(α)とする。 y (α) is a value including the pattern α and a value determined from the total number of the texts with labels that are included in the database and that are associated with the labels indicating that they belong to a predetermined set and that include the pattern α. None represents the sum of the total number of values determined from the probability data representing the probability that the text belongs to a predetermined set. In this example, as shown below, the total number of labeled texts included in the database and including the pattern α among the labeled texts associated with the labels indicating that they belong to the set P ₊ , and the labels including the pattern α Let y (α) be the sum of the probability that none text belongs to the set P ₊ and the total number of products of the attenuation parameters λ.

なお、F(d_i,α)は、テキストd_iがパターンαを含むときに1となり、それ以外のときに0となる関数である。

Note that F (d _i , α) is a function that becomes 1 when the text d _i includes the pattern α, and 0 otherwise.

x(α)は、パターンαを含むラベルありテキストの総数から定まる値と、パターンαを含むラベルなしテキストの総数から定まる値との和を表す。この例では、以下のように、パターンαを含むラベルありテキストの総数と、パターンαを含むラベルなしテキストの総数と減衰パラメータλとの積との和をx(α)とする。 x (α) represents the sum of a value determined from the total number of labeled texts including the pattern α and a value determined from the total number of unlabeled texts including the pattern α. In this example, x (α) is the sum of the total number of labeled texts including the pattern α and the product of the total number of unlabeled texts including the pattern α and the attenuation parameter λ as follows.

このとき、パターンαと、当該パターンαを含むテキストを当該テキストが属する集合に分類した際の分類結果と、の関連性の高さを表す指標として、以下のΧ²(α)を用いる。

At this time, the following Χ ² (α) is used as an index representing the degree of relevance between the pattern α and the classification result when the text including the pattern α is classified into the set to which the text belongs.

この例の場合、指標生成部１４ａは、記憶部１６ｄに格納されたデータベースと、記憶部１６ｅに格納された減衰パラメータλとを入力とし、式(29)にしたがって指標Χ²(α)を生成する。パターンαに対する指標Χ²(α)は記憶部１６ｅに格納される。

In the case of this example, the index generation unit 14a receives the database stored in the storage unit 16d and the attenuation parameter λ stored in the storage unit 16e, and generates the index Χ ² (α) according to the equation (29). To do. The index Χ ² (α) for the pattern α is stored in the storage unit 16e.

抽出部１４は、上記の指標に基づき、ラベル分類に寄与する度合いの高い順に選択したK個のパターンを、出力リストLの要素として出力する。本形態の例では、各パターンに対して上記の指標を順次生成していき、K個以上のパターンにそれぞれ対応する指標が生成された場合に、ラベル分類に寄与する度合いが高い順に数えてK番目の指標を閾値τ_Kとする。そして、その後生成される各指標が当該閾値τ_Kに基づく出力条件を満たすか否かに応じ、その指標に対応するパターンを出力リストLの要素とするか否か、及び、プロジェクションを行って新たなデータベースを生成して再帰的な処理を行うか否かが決定される。例えば、式(29)の指標Χ²(α)を用いる場合には、K個以上のパターンにそれぞれ対応する指標が生成された場合に、それらのうちでK番目に大きい指標を閾値τ_Kとする。そして、その後生成される各指標が閾値τ_Kを超えるという出力条件を満たすか否かに応じ、その指標に対応するパターンを出力リストLの要素とするか否か、及び、プロジェクションを行って新たなデータベースを生成して再帰的な処理を行うか否かが決定される。 The extraction unit 14 outputs, as elements of the output list L, K patterns selected in descending order of contribution to the label classification based on the index. In the example of this embodiment, the above indices are sequentially generated for each pattern, and when indices corresponding to each of K or more patterns are generated, K is counted in descending order of contribution to the label classification. Let the th index be the threshold τ _K. Then, depending on whether or not each index generated thereafter satisfies the output condition based on the threshold value τ _K , whether or not the pattern corresponding to the index is an element of the output list L, and performing a projection to newly It is determined whether to generate a recursive database and perform recursive processing. For example, when using the index Χ ² (α) of Equation (29), when indices corresponding to each of K or more patterns are generated, the K-th largest index is set as the threshold τ _K. To do. Then, depending on whether or not the output condition that each generated index exceeds the threshold value τ _K satisfies the output condition, whether or not the pattern corresponding to the index is an element of the output list L, and performing a projection It is determined whether to generate a recursive database and perform recursive processing.

［指標の限界値］
ただし、或るパターンαに対応する指標が出力条件を満たさない場合であっても、当該パターンαに１個以上のアイテムが付加された新たなパターンに対応する指標が出力条件を満たす場合がある。すなわち、指標が出力条件を満たさない場合であっても、プロジェクションを行って新たなデータベースを生成して再帰的な処理を行った場合、出力条件を満たすパターンが検出される場合がある。 [Indicator limits]
However, even if an index corresponding to a certain pattern α does not satisfy the output condition, an index corresponding to a new pattern in which one or more items are added to the pattern α may satisfy the output condition. . That is, even if the index does not satisfy the output condition, a pattern that satisfies the output condition may be detected when a new database is generated by performing projection and recursive processing is performed.

そのため、本形態では、抽出部１４の限界値生成部１４ｂが、或るパターンαに１個以上のアイテムが付加された任意の系列である任意パターンと、当該任意パターンを含む任意のテキストを当該テキストが属する集合に分類した際の分類結果と、の関連性の高さを表す指標の限界値を生成する。当該限界値は、パターンαに１個以上のアイテムが付加された任意パターンに対応する指標の最良値（最もラベル分類に寄与する度合いが高い値）を表すものである。例えば、式(29)の指標Χ²(α)が用いられる場合には、パターンαに対して以下の限界値Χ² _max(α)が生成される。なお、Χ²(α, y=x)はy=xである場合のΧ²(α)を表し、Χ²(α,y=0)はy=0である場合のΧ²(α)を表し、max(ν, μ)は、ν≧μの場合にνとなり、ν＜μの場合にμとなる関数を表す。 Therefore, in this embodiment, the limit value generation unit 14b of the extraction unit 14 converts an arbitrary pattern that is an arbitrary series in which one or more items are added to a certain pattern α and an arbitrary text including the arbitrary pattern to the A limit value of an index representing the degree of relevance between the classification result when the text is classified into the set to which the text belongs is generated. The limit value represents the best value of the index corresponding to an arbitrary pattern in which one or more items are added to the pattern α (the value that most contributes to label classification). For example, when the index Χ ² (α) of Expression (29) is used, the following limit value Χ ² _max (α) is generated for the pattern α. Χ ² (α, y = x) represents Χ ² (α) when y = x, and Χ ² (α, y = 0) represents Χ ² (α) when y = 0. And max (ν, μ) represents a function that becomes ν when ν ≧ μ, and μ when ν <μ.

Χ² _max(α)=max(Χ²(α, y=x), Χ²(α,y=0)) …(30)
そして、限界値が所定の探索条件を満たすか否かに応じて、プロジェクションを行って新たなデータベースを生成して再帰的な処理を行うか否かが決定される。 Χ ² _max (α) = max (Χ ² (α, y = x), Χ ² (α, y = 0))… (30)
Then, depending on whether or not the limit value satisfies a predetermined search condition, it is determined whether to perform projection to generate a new database and perform recursive processing.

［探索処理］
抽出部１４の探索部１４ｃは、判定対象のパターンに対して生成された上記の指標と限界値とを用い、判定対象のパターンを出力リストの要素とするか否か、及び、プロジェクションを行って新たなデータベースを生成して再帰的な処理を行うか否かを決定する。 [Search process]
The search unit 14c of the extraction unit 14 uses the above-described index and limit value generated for the determination target pattern, determines whether the determination target pattern is an element of the output list, and performs projection. Decide whether to generate a new database and perform recursive processing.

(I)すなわち、第１パターンαに対して生成された指標が所定の第１出力条件（例えば、Χ²(α)＞τ_K）を満たす場合、探索部１４ｃが、第１パターンαを出力リストの要素として出力するとともに、指標生成部１４ａが、当該第１パターンαに１個以上のアイテムが付加された系列である第２パターンと、当該第２パターンを含む第２テキストを当該第２テキストが属する集合に分類した際の分類結果と、の関連性の高さを表す第２指標を生成する。そして、その後の再帰的な処理により、当該第２指標が所定の第２出力条件を満たしたのであれば、当該第２パターンが出力リストの要素として出力される。 (I) That is, when the index generated for the first pattern α satisfies a predetermined first output condition (for example, Χ ² (α)> τ _K ), the search unit 14c outputs the first pattern α. While outputting as an element of the list, the index generation unit 14a outputs a second pattern that is a series in which one or more items are added to the first pattern α, and a second text including the second pattern. A second index is generated that indicates the degree of relevance between the classification result when the text is classified into the set to which the text belongs. If the second index satisfies a predetermined second output condition by subsequent recursive processing, the second pattern is output as an element of the output list.

(II)また、限界値生成部１４ｂが、第１パターンαに対応する指標が第１出力条件を満たさないが限界値が所定の探索条件を満たすときには（例えば、Χ²(α)≦τ_KかつΧ² _max(α)＞τ_K）、探索部１４ｃが、当該第１パターンαを出力リストの要素として出力することなく、指標生成部１４ａが、当該第１パターンαに１個以上のアイテムが付加された系列である第３パターンと、当該第３パターンを含む第３テキストを当該第３テキストが属する集合に分類した際の分類結果と、の関連性の高さを表す第３指標を生成する。そして、その後の再帰的な処理により、当該第３指標が所定の第３出力条件を満たしたのであれば、当該第３パターンが出力リストの要素として出力される。 (II) Further, when the limit value generation unit 14b does not satisfy the first output condition when the index corresponding to the first pattern α satisfies the predetermined search condition (for example, Χ ² (α) ≦ τ _K And Χ ² _max (α)> τ _K ), the search unit 14c outputs one or more items in the first pattern α without outputting the first pattern α as an element of the output list. A third index representing the high degree of relevance between the third pattern that is a series to which the third text is added and the classification result when the third text including the third pattern is classified into the set to which the third text belongs. Generate. Then, if the third index satisfies a predetermined third output condition by subsequent recursive processing, the third pattern is output as an element of the output list.

(III)また、第１パターンαに対応する指標が第１出力条件を満たさず、限界値も所定の探索条件を満たさないときには（例えば、Χ²(α)≦τ_KかつΧ² _max(α)≦τ_K）、当該第１パターンαが出力リストの要素とされず、かつ、上記の第３指標も生成されない。 (III) When the index corresponding to the first pattern α does not satisfy the first output condition and the limit value does not satisfy the predetermined search condition (for example, Χ ² (α) ≦ τ _K and Χ ² _max (α ) ≦ τ _K ), the first pattern α is not an element of the output list, and the third index is not generated.

これらの抽出部１４の処理を、式(29)の指標Χ²(α)と式(30)の限界値Χ² _max(α)とを用いる場合に限定して言い換えると以下のようになる。 In other words, the processing of the extraction unit 14 is as follows when the index Χ ² (α) in Expression (29) and the limit value Χ ² _max (α) in Expression (30) are used.

(I)Χ²(α)＞τ_Kなら、パターンαを出力リストの要素として出力するとともに、プロジェクションを行って、パターンαに１個以上のアイテムが付加された系列を新たなパターンαとし、再帰的な処理を行う。 If (I) Χ ² (α)> τ _K , the pattern α is output as an element of the output list, and projection is performed, and a sequence in which one or more items are added to the pattern α is set as a new pattern α. Perform recursive processing.

(II)Χ²(α)≦τ_KかつΧ² _max(α)＞τ_Kなら、パターンαを出力リストの要素とすることなくプロジェクションを行って、パターンαに１個以上のアイテムが付加された系列を新たなパターンαとし、再帰的な処理を行う。 (II) If Χ ² (α) ≤ τ _K and Χ ² _max (α)> τ _K , projection is performed without using pattern α as an element of the output list, and one or more items are added to pattern α. The sequence is set as a new pattern α, and recursive processing is performed.

(III)Χ²(α)≦τ_KかつΧ² _max(α)≦τ_Kなら、パターンαを出力リストの要素とせず、プロジェクションも行わず、パターンαに１個以上のアイテムが付加された系列に対する処理を行わない。 (III) If Χ ² (α) ≦ τ _K and Χ ² _max (α) ≦ τ _K , pattern α is not an element of the output list, projection is not performed, and one or more items are added to pattern α Does not process the series.

［ステップＳ１４の処理の具体例］
図５から図７は、ステップＳ１４の処理の具体例を説明するためのフローチャートである。図８は、ステップＳ１４の処理の具体例を説明するための擬似コードである。以下、これらの図を用いて、ステップＳ１４の処理の具体例を説明する。 [Specific Example of Processing in Step S14]
5 to 7 are flowcharts for explaining a specific example of the process of step S14. FIG. 8 is a pseudo code for explaining a specific example of the processing in step S14. Hereinafter, a specific example of the process of step S14 will be described with reference to these drawings.

まず、抽出部１４の探索部１４ｃが、パターンαを空に設定し（α=[ ]）、Rを記憶部１６ｄに格納されたデータベースとし、閾値τ_K=nan（未定であることを表す値）とする（ステップＳ１４０１）。これらの値は記憶部１６ｅに格納される。 First, the search unit 14c of the extraction unit 14 sets the pattern α to be empty (α = []), sets R as the database stored in the storage unit 16d, and sets the threshold τ _K = nan (a value indicating that it is undetermined). (Step S1401). These values are stored in the storage unit 16e.

次に、探索部１４ｃが、Rが空集合（R={ }）であるか否かを判定する（ステップＳ１４０２）。ここで、R={ }であれば処理が終了する。一方、R={ }でなければ、探索部１４ｃが、RRを空集合（RR←{ }）に設定し（ステップＳ１４０３）、αが空であるか（α=[ ]）否かを判定する（ステップＳ１４０４）。 Next, the search unit 14c determines whether R is an empty set (R = {}) (step S1402). Here, if R = {}, the process ends. On the other hand, if R = {} is not set, the search unit 14c sets RR to an empty set (RR ← {}) (step S1403), and determines whether α is empty (α = []). (Step S1404).

ここでαが空であると判定された場合、探索部１４ｃは、RRをRに設定し（RR←R）（ステップＳ１４０５）、Rが含むアイテムの集合をβ（β←itemset(R)）として設定し（ステップＳ１４０６）、後述のステップＳ１４１５の処理が実行される。 If it is determined that α is empty, the search unit 14c sets RR to R (RR ← R) (step S1405), and sets a set of items included in R to β (β ← itemset (R)). (Step S1406), and the process of step S1415 described later is executed.

一方、αが空でないと判定された場合、Rが含むアイテムからなる系列のエントリtrans(trans∈R)を選択する（ステップＳ１４０８）。なお、プロジェクションが１度も実行されておらず、Rが記憶部１６ｄに格納されたデータベースである場合、記憶部１６ｄに格納されたデータベースが含むテキストがエントリtransに相当する。また、過去にプロジェクションが実行されている場合、テキストからそれまでのプロジェクションによって削除されたアイテムを除いた残りの系列がエントリtransに相当する。次に、探索部１４ｃは、選択したエントリtransに対し、その先頭からアイテムαが最初に見つかった位置までを削除し、残りの系列をsubseqとして設定する処理（subseq ← postseq(last(α), trans)）を実行する（ステップＳ１４０９）。次に、探索部１４ｃは、subseqが空でないか（subseq≠[ ]）否かを判定する（ステップＳ１４１０）。ここで、subseqが空でないならば、探索部１４ｃは、RRにsubseqをエントリとして追加したものを新たなRR（RR←append(RR, subseq)）として設定し（ステップＳ１４１１）、ステップＳ１４１２の処理を実行する。一方、subseqが空であるならば、探索部１４ｃは、ステップＳ１４１１の処理を実行することなく、ステップＳ１４１２の処理を実行する。ステップＳ１４１２の処理では、探索部１４ｃが、すべてのエントリtrans∈Rについて処理が終了したか否かを判定する（ステップＳ１４１２）。ここで、すべてのエントリtrans∈Rについて処理が終了していないと判定されたのであれば、処理がステップＳ１４０８に戻される。一方、すべてのエントリtrans∈Rについて処理が終了したと判定されたのであれば、探索部１４ｃは、RRが含むアイテムの集合をβ（β←itemset(RR)）として設定し（ステップＳ１４１３）、ステップＳ１４１５の処理が実行される。なお、ステップＳ１４０８からＳ１４１３までの処理がプロジェクションに相当する。 On the other hand, if it is determined that α is not empty, a sequence entry trans (transεR) including items included in R is selected (step S1408). When the projection has never been executed and R is a database stored in the storage unit 16d, the text included in the database stored in the storage unit 16d corresponds to the entry trans. Further, when projection has been executed in the past, the remaining series excluding items deleted by the previous projection from the text corresponds to the entry trans. Next, the search unit 14c deletes the selected entry trans from the beginning to the position where the item α is first found, and sets the remaining series as subseq (subseq ← postseq (last (α), trans)) is executed (step S1409). Next, the search unit 14c determines whether subseq is not empty (subseq ≠ []) (step S1410). Here, if subseq is not empty, the search unit 14c sets a value obtained by adding subseq as an entry to RR as a new RR (RR ← append (RR, subseq)) (step S1411), and processing in step S1412 Execute. On the other hand, if subseq is empty, the search unit 14c executes the process of step S1412 without executing the process of step S1411. In the process of step S1412, the search unit 14c determines whether the process has been completed for all entries transεR (step S1412). If it is determined that processing has not been completed for all entries transεR, the process returns to step S1408. On the other hand, if it is determined that the processing has been completed for all entries transεR, the search unit 14c sets the set of items included in the RR as β (β ← itemset (RR)) (step S1413). The process of step S1415 is executed. Note that the processing from step S1408 to S1413 corresponds to projection.

ステップＳ１４１５の処理では、探索部１４ｃがβに属するアイテム（item∈β）を選択する（ステップＳ１４１５）。次に、探索部１４ｃが、αの最後にステップＳ１４１５で選択されたアイテムitemを付加した系列を新たなパターンα（α←append(α,[item])）として生成する（ステップＳ１４１６）。次に、探索部１４ｃが出力リストLに属するパターンの要素数|L|が、記憶部１６ｅに格納されたパラメータK未満であるか（|L|<K）否かを判定する（ステップＳ１４１７）。ここで、|L|<Kであると判定された場合、指標生成部１４ａが、パターンαに対する指標Χ²(α)を式(29)にしたがって生成し、探索部１４ｃが、パターンαと指標Χ²(α)との組[α, Χ²(α)]を出力リストLの要素として加え、出力リストLを更新する。更新された出力リストL（L←append(L, [α, Χ²(α)])）は記憶部１６ｅに格納される（ステップＳ１４１８）。次に、探索部１４ｃが、出力リストLに属するパターンの要素数|L|が、記憶部１６ｅに格納されたパラメータKと同値であるか（|L|=K）否かを判定する（ステップＳ１４１９）。|L|=Kでないと判定された場合、後述するステップＳ１４２２の処理が実行される。一方、|L|=Kであると判定された場合、探索部１４ｃが、出力リストLの要素[α, Χ²(α)]を指標Χ²(α)が大きい順に並び替えたものを新たな出力リストL（L=sort(L)）とし、記憶部１６ｅに格納する（ステップＳ１４２０）。次に、探索部１４ｃが、出力リストLのK番目の要素の指標（最も小さな値の指標）を閾値τ_K（τ_K=Χ²(L[K])）とし、閾値τ_Kを更新して記憶部１６ｅに格納し（ステップＳ１４２１）、次のステップＳ１４２２の処理が実行される。 In the process of step S1415, the search unit 14c selects an item (itemεβ) belonging to β (step S1415). Next, the search unit 14c generates a sequence in which the item item selected in step S1415 is added to the end of α as a new pattern α (α ← append (α, [item])) (step S1416). Next, the search unit 14c determines whether the number of elements | L | of the patterns belonging to the output list L is less than the parameter K stored in the storage unit 16e (| L | <K) (step S1417). . If it is determined that | L | <K, the index generation unit 14a generates an index Χ ² (α) for the pattern α according to the equation (29), and the search unit 14c determines the pattern α and the index. set of the ^{Χ 2 (α) [α,} Χ 2 (α)] was added as an element of the output list L and updates the output list L. The updated output list L (L ← append (L, [α, Χ ² (α)])) is stored in the storage unit 16e (step S1418). Next, the search unit 14c determines whether or not the number of elements | L | of the patterns belonging to the output list L is equal to the parameter K stored in the storage unit 16e (| L | = K) (Step S14). S1419). If it is determined that | L | = K is not satisfied, processing in step S1422 described later is executed. On the other hand, if it is determined that | L | = K, the search unit 14c newly reorders the elements [α, Χ ² (α)] of the output list L in descending order of the index Χ ² (α). The output list L (L = sort (L)) is stored in the storage unit 16e (step S1420). Next, the search unit 14c sets the index of the Kth element (the index of the smallest value) of the output list L as the threshold τ _K (τ _K = Χ ² (L [K])), and updates the threshold τ _K. Is stored in the storage unit 16e (step S1421), and the process of the next step S1422 is executed.

ステップＳ１４２２の処理では、抽出部１４が、現在の（α，RR, K）に対し、ステップＳ１４０２からＳ１４３１までの処理（call WTPS（α，RR, K））を再帰的に実行する（ステップＳ１４２２）。その後、後述するステップＳ１４３１の処理が実行される。 In the process of step S1422, the extraction unit 14 recursively executes the processes from step S1402 to S1431 (call WTPS (α, RR, K)) with respect to the current (α, RR, K) (step S1422). ). Thereafter, the process of step S1431 described later is executed.

一方、ステップＳ１４１８で、|L|<Kでないと判定された場合、指標生成部１４ａが、パターンαに対する指標Χ²(α)を式(29)にしたがって生成して記憶部１６ｅに格納し、探索部１４ｃが、記憶部１６ｅに格納された閾値τ_Kを用い、Χ²(α)＞τ_Kを満たすか否かを判定する（ステップＳ１４２３）。なお、閾値が未定である場合（τ_K=nan）には、Χ²(α)＞τ_Kを満たさないものとする。ここで、Χ²(α)＞τ_Kを満たすと判定された場合には、探索部１４ｃが、記憶部１６ｅに格納された出力リストLの最後の要素（最も指標Χ²(α)の値が小さな要素）を削除し、残りの要素からなる新たな出力リストL（L=lastdel(L)）を生成し、記憶部１６ｅに格納する（ステップＳ１４２４）。次に、探索部１４ｃが、パターンαと指標Χ²(α)との組[α, Χ²(α)]を出力リストLの要素として加え、出力リストLを更新する。更新された出力リストL（L←append(L, [α, Χ²(α)])）を記憶部１６ｅに格納される（ステップＳ１４２５）。次に、探索部１４ｃが、この出力リストLの要素[α, Χ²(α)]を指標Χ²(α)が大きい順に並び替えたものを新たな出力リストL（L=sort(L)）とし、記憶部１６ｅに格納する（ステップＳ１４２６）。次に、探索部１４ｃが、出力リストLのK番目の要素の指標（最も小さな値の指標）を閾値τ_K（τ_K=Χ²(L[K])）とし、閾値τ_Kを更新して記憶部１６ｅに格納する（ステップＳ１４２７）。次に、抽出部１４が、現在の（α，RR, K）に対し、ステップＳ１４０２からＳ１４３１までの処理（call WTPS（α，RR, K））を再帰的に実行する（ステップＳ１４２８）。その後、後述するステップＳ１４３１の処理が実行される。 On the other hand, if it is determined in step S1418 that | L | <K is not satisfied, the index generation unit 14a generates an index Χ ² (α) for the pattern α according to the equation (29), and stores it in the storage unit 16e. The search unit 14c determines whether or not τ ² (α)> τ _K is satisfied using the threshold value τ _K stored in the storage unit 16e (step S1423). When the threshold is not yet determined (τ _K = nan), it is assumed that Χ ² (α)> τ _K is not satisfied. Here, when it is determined that ( ² (α)> τ _K is satisfied, the search unit 14 c uses the last element (most index Χ ² (α) of the output list L stored in the storage unit 16 e. Is deleted, and a new output list L (L = lastdel (L)) including the remaining elements is generated and stored in the storage unit 16e (step S1424). Next, the search unit 14c adds the set [α, Χ ² (α)] of the pattern α and the index Χ ² (α) as an element of the output list L, and updates the output list L. The updated output list L (L ← append (L, [α, Χ ² (α)])) is stored in the storage unit 16e (step S1425). Next, the search unit 14c converts the elements [α, Χ ² (α)] of the output list L into a new output list L (L = sort (L)) in the descending order of the index Χ ² (α). And stored in the storage unit 16e (step S1426). Next, the search unit 14c sets the index of the Kth element (the index of the smallest value) of the output list L as the threshold τ _K (τ _K = Χ ² (L [K])), and updates the threshold τ _K. And stored in the storage unit 16e (step S1427). Next, the extraction unit 14 recursively executes the processing from step S1402 to S1431 (call WTPS (α, RR, K)) for the current (α, RR, K) (step S1428). Thereafter, the process of step S1431 described later is executed.

一方、ステップＳ１４２３の判定で、Χ²(α)＞τ_Kを満たさないと判定された場合には、限界値生成部１４ｂが、パターンαに対する限界値Χ² _max(α)を式(30)に従って生成して記憶部１６ｅに格納し、探索部１４ｃが、記憶部１６ｅに格納された閾値τ_Kを用い、Χ² _max(α)＞τ_Kを満たすか否かを判定する（ステップＳ１４２９）。ここで、Χ² _max(α)＞τ_Kを満たすと判定された場合、抽出部１４が、現在の（α，RR, K）に対してステップＳ１４０２からＳ１４３１までの処理（call WTPS（α，RR, K））を再帰的に実行する（ステップＳ１４３０）。その後、以下のステップＳ１４３１の処理が実行される。一方、Χ² _max(α)＞τ_Kを満たさないと判定された場合、ステップＳ１４３０の処理が実行されることなく、ステップＳ１４３１の処理が実行される。 On the other hand, if it is determined in step S1423 that Χ ² (α)> τ _K is not satisfied, the limit value generating unit 14b sets the limit value Χ ² _max (α) for the pattern α to the equation (30). Are generated and stored in the storage unit 16e, and the search unit 14c uses the threshold value τ _K stored in the storage unit 16e to determine whether or not Χ ² _max (α)> τ _K is satisfied (step S1429). . Here, if it is determined that Χ ² _max (α)> τ _K is satisfied, the extraction unit 14 performs the processing from step S1402 to S1431 (call WTPS (α, RR, K)) is recursively executed (step S1430). Thereafter, the following step S1431 is executed. On the other hand, when it is determined that Χ ² _max (α)> τ _K is not satisfied, the process of step S1431 is executed without executing the process of step S1430.

ステップＳ１４３１の処理では、探索部１４ｃが、すべてのアイテムitem∈βについて処理が終了したか否かを判定する（ステップＳ１４３１）。ここで、すべてのアイテムitem∈βについて処理が終了していれば、ステップＳ１４の処理が終了となる。一方、すべてのアイテムitem∈βについて処理が終了していなければ、処理がステップＳ１４１５に戻される。 In the process of step S1431, the search unit 14c determines whether the process has been completed for all items itemεβ (step S1431). Here, if the processing is completed for all items itemεβ, the processing in step S14 is completed. On the other hand, if the process has not been completed for all items itemεβ, the process returns to step S1415.

［ステップＳ１４の実施例］
図９から図１１は、ステップＳ１４の実施例を説明するための図である。以下に、これらの図を用いながら、ステップＳ１４の実施例を説明する。なお、この実施例では、記憶部１６ｄに格納される初期のデータベースとして図３のデータベース（減衰パラメータλ=0.5が乗じられたもの）を用いる。また、K=4, λ=0.5とする。また、M=1.8, N=3.5であり、これらは初期のデータベースにおける定数である。また、図１０及び図１１に示す木構造の各ノードのアイテムa,b,c,dの右上添字は、その木のルートのアイテムからノードのアイテムまでの系列からなるパターンαの指標Χ²(α)であり、右下添字はそのパターンαの限界値Χ² _max(α)を表す。また、図１０及び図１１に示すxは、パターンαの最後にβの要素であるアイテムitemを付加した系列を新たなパターンα（α←append(α,[item])）とし、当該新たなパターンαについて、式(28)にしたがって生成されたx(α)を表す。また、図１０及び図１１に示すyは、パターンαの最後にβの要素であるアイテムitemを付加した系列を新たなパターンα（α←append(α,[item])）とし、当該新たなパターンαについて、式(27)にしたがって生成されたy(α)を表す。 [Example of Step S14]
9 to 11 are diagrams for explaining the embodiment of step S14. Hereinafter, an example of step S14 will be described with reference to these drawings. In this embodiment, the database shown in FIG. 3 (multiplied by the attenuation parameter λ = 0.5) is used as the initial database stored in the storage unit 16d. Further, K = 4 and λ = 0.5. M = 1.8 and N = 3.5, which are constants in the initial database. Also, the upper right subscripts of the items a, b, c, and d of the nodes of the tree structure shown in FIGS. 10 and 11 are the indices Χ ² ( α), and the lower right subscript represents the limit value Χ ² _max (α) of the pattern α. Further, x shown in FIGS. 10 and 11 is a new pattern α (α ← append (α, [item])) obtained by adding an item item which is an element of β to the end of the pattern α, and the new For pattern α, x (α) generated according to equation (28) is represented. Further, y shown in FIGS. 10 and 11 is a new pattern α (α ← append (α, [item])) obtained by adding an item item, which is an element of β, to the end of the pattern α. For pattern α, y (α) generated according to equation (27) is represented.

1. まず、α=[ ], RR←R, β={a,b,c,d}とされる（ステップＳ１４０１〜Ｓ１４０６／図１０（Ａ））。 1. First, α = [], RR ← R, β = {a, b, c, d} are set (steps S1401 to S1406 / FIG. 10A).

2. 次に、α=aとされる（ステップＳ１４１５，Ｓ１４１６）。|L|<K(k=4)であるから、aと指標Χ²(a)=0との組[a, 0]が出力リストLの要素に追加される（ステップＳ１４１７，Ｓ１４１８／図９（Ａ））。なお、この時点では閾値τ_K=nanである。さらに、プロジェクションが行われ（ステップＳ１４２２）、その再帰的処理（call WTPS（α，RR, K）／ステップＳ１４０２〜Ｓ１４３１）の中でβ={b,c,d}とされる（図１０（Ｂ））。 2. Next, α = a is set (steps S1415 and S1416). Since | L | <K (k = 4), a pair [a, 0] of a and index Χ ² (a) = 0 is added to the elements of the output list L (steps S1417, S1418 / FIG. 9). (A)). At this time, the threshold τ _K = nan. Further, projection is performed (step S1422), and β = {b, c, d} is set in the recursive processing (call WTPS (α, RR, K) / steps S1402 to S1431) (FIG. 10 ( B)).

3. 2の再帰的処理の中で、α=a bとされる（ステップＳ１４１５，Ｓ１４１６／図１０（Ｂ））。|L|<K(k=4)であるから、a bと指標Χ²(a b)=0.5との組[a b, 0.5]が出力リストLの要素に追加される（ステップＳ１４１７，Ｓ１４１８／図９（Ｂ））。なお、この時点では閾値τ_K=nanである。さらに、プロジェクションが行われ（ステップＳ１４２２）、その再帰的処理の中でβ={c,d}とされる（図１０（Ｂ））。 In the recursive processing of 3.2, α = ab is set (steps S1415 and S1416 / FIG. 10B). Since | L | <K (k = 4), a set [ab, 0.5] of ab and index Χ ² (ab) = 0.5 is added to the elements of the output list L (steps S1417, S1418 / FIG. 9). (B)). At this time, the threshold τ _K = nan. Further, projection is performed (step S1422), and β = {c, d} is set in the recursive process (FIG. 10B).

4. 3の再帰的処理の中で、β={c, d}のうちcがitemとして選択され、α=a b cとされる（ステップＳ１４１５，Ｓ１４１６／図１０（Ｂ））。|L|<K(k=4)であるから、a b cと指標Χ²(a b c)=1.3との組[a b c, 1.3]が出力リストLの要素に追加される（ステップＳ１４１７，Ｓ１４１８／図９（Ｃ））。なお、この時点では閾値τ_K=nanである。さらに、プロジェクションが行われ（ステップＳ１４２２）、その再帰的処理の中でβ={c}とされる（図１０（Ｂ））。 In the recursive process of 4.3, c of β = {c, d} is selected as item, and α = abc is set (steps S1415, S1416 / FIG. 10B). Since | L | <K (k = 4), a pair [abc, 1.3] of abc and index Χ ² (abc) = 1.3 is added to the elements of the output list L (steps S1417, S1418 / FIG. 9). (C)). At this time, the threshold τ _K = nan. Further, projection is performed (step S1422), and β = {c} is set in the recursive processing (FIG. 10B).

5. 4の再帰的処理の中で、α=a b c cとされる（ステップＳ１４１５，Ｓ１４１６／図１０（Ｂ））。|L|<K(k=4)であるから、a b c cと指標Χ²(a b c c)=1.3との組[a b c c, 1.3]が出力リストLの要素に追加される（ステップＳ１４１７，Ｓ１４１８）。 5. In the recursive processing of 4, α = abcc is set (steps S1415, S1416 / FIG. 10B). Since | L | <K (k = 4), a pair [abcc, 1.3] of abcc and the index Χ ² (abcc) = 1.3 is added to the elements of the output list L (steps S1417 and S1418).

6. 4の再帰的処理の中で、|L|=K(k=4)となったため、出力リストLの要素を指標Χ²(α)に基づいて並び替え、閾値τ_K=0が得られる（ステップＳ１４１９〜Ｓ１４２１／図９（Ｄ））。 6. Since | L | = K (k = 4) in the recursive processing of 4, the elements of the output list L are rearranged based on the index Χ ² (α), and the threshold τ _K = 0 is obtained. (Steps S1419 to S1421 / FIG. 9D).

7. これ以上のプロジェクションが不可能なので（4の再帰的処理のステップＳ１４３１でyesとなるので）4の再帰的処理が終了し、3の再帰的処理に戻る。ここでは、α=a b, β={c, d}であるが、β={c, d}のうちcについては処理済みである。 7. Since no more projections are possible (yes in step S1431 of 4 recursive processing), 4 recursive processing ends, and 3 recursive processing returns. Here, α = a b, β = {c, d}, but c of β = {c, d} has been processed.

8. 3の再帰的処理の中で、β={c, d}のうちdがitemとして選択され、α=a b dとされる（ステップ１４１５，Ｓ１４１６／図１０（Ｂ））。|L|<K(k=4)でなく（ステップＳ１４１７）、α=a b dに対する指標Χ²(a b d)=0.2が閾値τ_K=0を超えるため（ステップＳ１４２３）、出力リストLから[a, 0]が削除され、a b dと指標Χ²(a b d)=0.2との組[a b d, 0.2]が出力リストLの要素に追加され、出力リストLの要素が並び替えられて、閾値がτ_K=0.2に更新される（ステップＳ１４２４〜Ｓ１４２７／図９（Ｅ））。 In the recursive process of 8.3, d is selected as item out of β = {c, d}, and α = abd is set (step 1415, S1416 / FIG. 10B). Since | L | <K (k = 4) is not satisfied (step S1417), and the index Χ ² (abd) = 0.2 for α = abd exceeds the threshold τ _K = 0 (step S1423), from the output list L, [a, 0] is deleted, the pair [abd, 0.2] of abd and index Χ ² (abd) = 0.2 is added to the elements of output list L, the elements of output list L are rearranged, and the threshold is τ _K = It is updated to 0.2 (steps S1424 to S1427 / FIG. 9E).

9. これ以上のプロジェクションが不可能なので（3の再帰的処理のステップＳ１４３１でyesとなるので）、3の再帰的処理が終了し、2の再帰的処理に戻る。ここでは、α=a, β={b, c, d}であるが、β={b, c, d}のうちbについては処理済みである（図１０（Ｂ））。 9. Since no more projection is possible (yes in step S1431 of the 3 recursive process), the 3 recursive process ends and the process returns to the 2 recursive process. Here, α = a, β = {b, c, d}, but b of β = {b, c, d} has been processed (FIG. 10B).

10. 2の再帰的処理の中で、β={b, c, d}のうちcがitemとして選択され、α=a cとされる（ステップ１４１５，Ｓ１４１６／図１０（Ｂ））。|L|<K(k=4)でなく（ステップＳ１４１７）、α=a cに対する指標Χ²(a c)=2.1が閾値τ_K=0.2を超えるため（ステップＳ１４２３）、出力リストLから[a b d, 0.2]が削除され、a cと指標Χ²(a c)=2.1との組[a c, 2.1]が出力リストLの要素に追加され、出力リストLの要素が並び替えられて、閾値がτ_K=0.5に更新される（ステップＳ１４２４〜Ｓ１４２７／図９（Ｆ））。さらに、プロジェクションが行われ（ステップＳ１４２８）、その再帰的処理の中でβ={c}とされる（図１０（Ｂ））。 In the recursive processing of 10.2, c is selected as item out of β = {b, c, d}, and α = ac is set (step 1415, S1416 / FIG. 10B). Since | L | <K (k = 4) is not satisfied (step S1417) and the index Χ ² (ac) = 2.1 for α = ac exceeds the threshold value τ _K = 0.2 (step S1423), [abd, 0.2] is deleted, a pair [ac, 2.1] of ac and the index Χ ² (ac) = 2.1 is added to the elements of the output list L, the elements of the output list L are rearranged, and the threshold is τ _K = It is updated to 0.5 (steps S1424 to S1427 / FIG. 9F). Further, projection is performed (step S1428), and β = {c} is set in the recursive process (FIG. 10B).

11. 10の再帰的処理の中で、β={c}のうちcがitemとして選択され、α=a c cとされる（ステップ１４１５，Ｓ１４１６／図１０（Ｂ））。|L|<K(k=4)でなく（ステップＳ１４１７）、α=a c cに対する指標Χ²(a c c)=1.3が閾値τ_K=0.5を超えるため（ステップＳ１４２３）、出力リストLから[a b, 0.5]が削除され、a c cと指標Χ²(a c c)=1.3との組[a c, 2.1]が出力リストLの要素に追加され、出力リストLの要素が並び替えられて、閾値がτ_K=1.3に更新される（ステップＳ１４２４〜Ｓ１４２７／図９（Ｇ））。 11. In 10 recursive processes, c of β = {c} is selected as item, and α = acc is set (step 1415, S1416 / FIG. 10B). Since | L | <K (k = 4) is not satisfied (step S1417) and the index Χ ² (acc) = 1.3 for α = acc exceeds the threshold τ _K = 0.5 (step S1423), [ab, 0.5] is deleted, the set [ac, 2.1] with acc and index Χ ² (acc) = 1.3 is added to the elements of the output list L, the elements of the output list L are rearranged, and the threshold is τ _K = Updated to 1.3 (steps S1424 to S1427 / FIG. 9G).

12. これ以上のプロジェクションが不可能なので（10の再帰的処理のステップＳ１４３１でyesとなるので）、10の再帰的処理が終了し、2の再帰的処理に戻る。ここでは、α=a, β={b, c, d}であるが、β={b, c, d}のうちb, cについては処理済みである（図１０（Ｂ））。 12. Since no more projection is possible (yes in step S1431 of the 10 recursive process), the 10 recursive process ends and the process returns to the 2 recursive process. Here, α = a, β = {b, c, d}, but b and c of β = {b, c, d} have been processed (FIG. 10B).

13. 2の再帰的処理の中で、β={b, c, d}のうちdがitemとして選択され、α=a dとされる（ステップ１４１５，Ｓ１４１６／図１０（Ｂ））。|L|<K(k=4)でなく（ステップＳ１４１７）、α=a dに対する指標Χ²(a d)=0.2も限界値Χ² _max(a d)=0.5も閾値τ_K=1.3を越えず（ステップＳ１４２３，Ｓ１４２９）、β={b, c, d}のすべての要素が処理済であるため（ステップＳ１４３１）、2の再帰的処理が終了し、最初のループ処理に戻る。最初のループ処理では、α=[ ], β={a,b,c,d}であるが、β={a,b,c,d}のうちaについては処理済みである。 In the recursive processing of 13.2, d is selected as item out of β = {b, c, d}, and α = ad is set (step 1415, S1416 / FIG. 10B). Not | L | <K (k = 4) (step S1417), neither the index Χ ² (ad) = 0.2 nor the limit value Χ ² _max (ad) = 0.5 for α = ad exceeds the threshold τ _K = 1.3 ( Since all elements of β = {b, c, d} have been processed (steps S1423 and S1429) (step S1431), the recursive process of 2 is completed and the process returns to the first loop process. In the first loop processing, α = [], β = {a, b, c, d}, but a has been processed among β = {a, b, c, d}.

14. 最初のループ処理の中で、β={a,b,c,d}のうちbがitemとして選択され、α=bとされる（ステップ１４１５，Ｓ１４１６／図１０（Ｃ））。|L|<K(k=4)でなく（ステップＳ１４１７）、α=bに対する指標Χ²(b)=0は閾値τ_K=1.3を越えないが（ステップＳ１４２３）、限界値Χ² _max(b)=2.7が閾値τ_K=1.3を超えるため（ステップＳ１４２９）、プロジェクションが行われ（ステップＳ１４３０）、その再帰的処理の中でβ={a,c,d}とされる（図１０（Ｃ））。 14. In the first loop processing, b is selected as item among β = {a, b, c, d}, and α = b is set (step 1415, S1416 / FIG. 10C). | L | <K (k = 4) is not satisfied (step S1417), and the index Χ ² (b) = 0 for α = b does not exceed the threshold τ _K = 1.3 (step S1423), but the limit value Χ ² _max ( Since b) = 2.7 exceeds the threshold τ _K = 1.3 (step S1429), projection is performed (step S1430), and β = {a, c, d} is set in the recursive processing (FIG. 10 ( C)).

15. 14の再帰的処理の中で、β={a,c,d}のうちaがitemとして選択され、α=b aとされる（ステップ１４１５，Ｓ１４１６／図１０（Ｃ））。|L|<K(k=4)でなく（ステップＳ１４１７）、α=b a対する指標Χ²(b a)=0.5は閾値τ_K=1.3を越えないが（ステップＳ１４２３）、限界値Χ² _max(b a)=1.6が閾値τ_K=1.3を超えるため（ステップＳ１４２９）、プロジェクションが行われ（ステップＳ１４３０）、その再帰的処理の中でβ={c}とされる（図１０（Ｃ））。 15. Among 14 recursive processes, a is selected as item out of β = {a, c, d}, and α = ba is set (step 1415, S1416 / FIG. 10C). Not | L | <K (k = 4) (step S1417), and the index Χ ² (ba) = 0.5 for α = ba does not exceed the threshold τ _K = 1.3 (step S1423), but the limit value Χ ² _max ( Since ba) = 1.6 exceeds the threshold τ _K = 1.3 (step S1429), projection is performed (step S1430), and β = {c} is set in the recursive processing (FIG. 10C).

16. 15の再帰的処理の中で、β={c}のうちcがitemとして選択され、α=b a cとされる（ステップ１４１５，Ｓ１４１６／図１０（Ｃ））。|L|<K(k=4)でなく（ステップＳ１４１７）、α=b a cに対する指標Χ²(b a c)=0.3も限界値Χ² _max(b a c)=0.5も閾値τ_K=1.3を越えず（ステップＳ１４２３，Ｓ１４２９）、β={c}のすべての要素が処理済であるため（ステップＳ１４３１）、15の再帰的処理が終了し、14の再帰的処理に戻る。14の再帰的処理ではβ={a,c,d}であるが、β={a,c,d}のうちaについては処理済みである。 16. In 15 recursive processing, c is selected as item from β = {c}, and α = bac is set (step 1415, S1416 / FIG. 10C). Not | L | <K (k = 4) (step S1417), neither the index Χ ² (bac) = 0.3 nor the limit value Χ ² _max (bac) = 0.5 for α = bac exceeds the threshold τ _K = 1.3 ( Since all elements of β = {c} have been processed (steps S1423 and S1429) (step S1431), the recursive process of 15 ends and returns to the recursive process of 14. In the recursive processing of 14, β = {a, c, d}, but among β = {a, c, d}, a has been processed.

17. 14の再帰的処理の中で、β={a,c,d}のうちcがitemとして選択され、α=b cとされる（ステップ１４１５，Ｓ１４１６／図１０（Ｃ））。|L|<K(k=4)でなく（ステップＳ１４１７）、α=b c対する指標Χ²(b c)=0.2は閾値τ_K=1.3を越えないが（ステップＳ１４２３）、限界値Χ² _max(b c)=2.3が閾値τ_K=1.3を超えるため（ステップＳ１４２９）、プロジェクションが行われ（ステップＳ１４３０）、その再帰的処理の中でβ={a,c}とされる（図１０（Ｃ））。 17. Among 14 recursive processes, c is selected as item among β = {a, c, d}, and α = bc is set (step 1415, S1416 / FIG. 10C). Not | L | <K (k = 4) (step S1417), and the index Χ ² (bc) = 0.2 for α = bc does not exceed the threshold τ _K = 1.3 (step S1423), but the limit value Χ ² _max ( Since bc) = 2.3 exceeds the threshold τ _K = 1.3 (step S1429), projection is performed (step S1430), and β = {a, c} is set in the recursive processing (FIG. 10C). ).

18. 17の再帰的処理の中で、β={a,c}のうちaがitemとして選択され、α=b c aとされる（ステップ１４１５，Ｓ１４１６／図１０（Ｃ））。|L|<K(k=4)でなく（ステップＳ１４１７）、α=b c aに対する指標Χ²(b c a)=1.5が閾値τ_K=1.3を超えるため（ステップＳ１４２３）、出力リストLから[a c c, 1.3]が削除され、b c aと指標Χ²(b c a)=1.5との組[b c a, 1.5]が出力リストLの要素に追加され、出力リストLの要素が並び替えられて、閾値がτ_K=1.3とされる（ステップＳ１４２４〜Ｓ１４２７／図９（Ｈ））。 18. In 17 recursive processing, a of β = {a, c} is selected as item, and α = bca is set (step 1415, S1416 / FIG. 10C). Since | L | <K (k = 4) is not satisfied (step S1417), and the index Χ ² (bca) = 1.5 for α = bca exceeds the threshold τ _K = 1.3 (step S1423), [acc, 1.3] is deleted, a pair [bca, 1.5] of bca and index Χ ² (bca) = 1.5 is added to the elements of output list L, the elements of output list L are rearranged, and the threshold is τ _K = 1.3 (steps S1424 to S1427 / FIG. 9H).

19. これ以上のプロジェクションが不可能なので、17の再帰的処理の中で、β={a,c}のうちcがitemとして選択され、α=b c cとされる（ステップ１４１５，Ｓ１４１６／図１０（Ｃ））。|L|<K(k=4)でなく（ステップＳ１４１７）、α=b c cに対する指標Χ²(b c c)=1.3が閾値τ_K=1.3も限界値Χ² _max(b c c)=1.3も閾値τ_K=1.3を超えず（ステップＳ１４２３，Ｓ１４２９）、ステップＳ１４３１の判定でyesとされるため、17の再帰的処理が終了し、14の再帰的処理に戻る。14の再帰的処理ではβ={a,c,d}であるが、β={a,c,d}のうちa,cについては処理済みである。 19. Since no more projections are possible, in 17 recursive processes, c of β = {a, c} is selected as item and α = bcc is set (step 1415, S1416 / FIG. 10). (C)). Not | L | <K (k = 4) (step S1417), but the index Χ ² (bcc) = 1.3 for α = bcc is the threshold τ _K = 1.3 and the limit value Χ ² _max (bcc) = 1.3 is also the threshold τ _K = 1.3 is not exceeded (steps S1423 and S1429), and the determination in step S1431 is yes, so that the 17 recursive processing ends and returns to the 14 recursive processing. In the recursive processing of 14, β = {a, c, d}, but a and c of β = {a, c, d} have been processed.

20. 14の再帰的処理の中で、β={a,c,d}のうちdがitemとして選択され、α=b dとされる（ステップ１４１５，Ｓ１４１６／図１０（Ｃ））。|L|<K(k=4)でなく（ステップＳ１４１７）、α=b d対する指標Χ²(b d)=0.2も限界値Χ² _max(b d)=0.5も閾値τ_K=1.3を超えず、（ステップＳ１４２３，Ｓ１４２９）、ステップＳ１４３１の判定でyesとされるため、14の再帰的処理が終了し、最初のループ処理に戻る。最初のループ処理では、β={a,b,c,d}であるが、β={a,b,c,d}のうちa,bについては処理済みである。 20. In 14 recursive processes, d is selected as item out of β = {a, c, d}, and α = bd is set (step 1415, S1416 / FIG. 10C). Not | L | <K (k = 4) (step S1417), neither the index Χ ² (bd) = 0.2 nor the limit value Χ ² _max (bd) = 0.5 for α = bd exceeds the threshold τ _K = 1.3, (Steps S1423, S1429) and yes in the determination of step S1431, the 14 recursive processing ends, and the process returns to the first loop processing. In the first loop processing, β = {a, b, c, d}, but a and b of β = {a, b, c, d} have been processed.

21. 最初のループ処理の中で、β={a,b,c,d}のうちcがitemとして選択され、α=cとされる（ステップ１４１５，Ｓ１４１６／図１１（Ａ））。|L|<K(k=4)でなく（ステップＳ１４１７）、α=cに対する指標Χ²(ｃ)=0.2は閾値τ_K=1.3を越えないが（ステップＳ１４２３）、限界値Χ² _max(b)=3.1が閾値τ_K=1.3を超えるため（ステップＳ１４２９）、プロジェクションが行われ（ステップＳ１４３０）、その再帰的処理の中でβ={a,c}とされる（図１１（Ａ））。 21. In the first loop process, c is selected as item among β = {a, b, c, d}, and α = c is set (step 1415, S1416 / FIG. 11A). Not | L | <K (k = 4) (step S1417), and the index Χ ² (c) = 0.2 for α = c does not exceed the threshold τ _K = 1.3 (step S1423), but the limit value Χ ² _max ( b) = 3.1 exceeds the threshold τ _K = 1.3 (step S1429), projection is performed (step S1430), and β = {a, c} is set in the recursive processing (FIG. 11A). ).

22. 21の再帰的処理の中で、β={a,c}のうちaがitemとして選択され、α=c aとされる（ステップ１４１５，Ｓ１４１６／図１１（Ａ））。|L|<K(k=4)でなく（ステップＳ１４１７）、α=c aに対する指標Χ²(c a)=1.5が閾値τ_K=1.3を超えるため（ステップＳ１４２３）、出力リストLから[a b c c, 1.3]が削除され、c aと指標Χ²(c a)=1.5との組[c a, 1.5]が出力リストLの要素に追加され、出力リストLの要素が並び替えられて、閾値がτ_K=1.3とされる（ステップＳ１４２４〜Ｓ１４２７／図９（Ｉ））。 22. In 21 recursive processing, a of β = {a, c} is selected as item, and α = ca is set (step 1415, S1416 / FIG. 11A). Since | L | <K (k = 4) is not satisfied (step S1417), and the index Χ ² (ca) = 1.5 for α = ca exceeds the threshold τ _K = 1.3 (step S1423), [abcc, 1.3] is deleted, a pair [ca, 1.5] of ca and the index Χ ² (ca) = 1.5 is added to the elements of the output list L, the elements of the output list L are rearranged, and the threshold is τ _K = 1.3 (steps S1424 to S1427 / FIG. 9I).

23. これ以上のプロジェクションが不可能なので、21の再帰的処理の中で、β={a,c}のうちcがitemとして選択され、α=c cとされる（ステップ１４１５，Ｓ１４１６／図１１（Ａ））。|L|<K(k=4)でなく（ステップＳ１４１７）、α=c cに対する指標Χ²(c c)=1.3が閾値τ_K=1.3も限界値Χ² _max(c c)=1.3も閾値τ_K=1.3を超えず（ステップＳ１４２３，Ｓ１４２９）、ステップＳ１４３１の判定でyesとされるため、21の再帰的処理が終了し、最初のループ処理に戻る。最初のループ処理では、β={a,b,c,d}であるが、β={a,b,c,d}のうちa,b,cについては処理済みである。 23. Since no more projections are possible, c is selected as item among β = {a, c} in 21 recursive processes, and α = cc is set (step 1415, S1416 / FIG. 11). (A)). Not | L | <K (k = 4) (step S1417), but the index Χ ² (cc) = 1.3 for α = cc is the threshold value τ _K = 1.3 and the threshold value Χ ² _max (cc) = 1.3 is also the threshold value τ _K = 1.3 is not exceeded (steps S1423 and S1429), and the determination in step S1431 is yes, so that the 21 recursive process ends and the process returns to the first loop process. In the first loop processing, β = {a, b, c, d}, but a, b, c of β = {a, b, c, d} have been processed.

24. 最初のループ処理の中で、β={a,b,c,d}のうちdがitemとして選択され、α=dとされる（ステップ１４１５，Ｓ１４１６／図１１（Ｂ））。|L|<K(k=4)でなく（ステップＳ１４１７）、α=dに対する指標Χ²(d)=2.1は閾値τ_K=1.3を越えるため（ステップＳ１４２３）、出力リストLから[a b c, 1.3]が削除され、dと指標Χ²(d)=2.1との組[d, 2.1]が出力リストLの要素に追加され、出力リストLの要素が並び替えられて、閾値がτ_K=1.5とされる（ステップＳ１４２４〜Ｓ１４２７／図９（Ｊ））。さらに、プロジェクションが行われ（ステップＳ１４２８）、その再帰的処理の中でβ={a,b,c,d}とされる（図１１（Ｂ））。 24. In the first loop processing, d is selected as item out of β = {a, b, c, d}, and α = d is set (step 1415, S1416 / FIG. 11B). Since | L | <K (k = 4) is not satisfied (step S1417) and the index Χ ² (d) = 2.1 for α = d exceeds the threshold τ _K = 1.3 (step S1423), the output list L indicates [abc, 1.3] is deleted, a pair [d, 2.1] of d and index Χ ² (d) = 2.1 is added to the elements of the output list L, the elements of the output list L are rearranged, and the threshold is τ _K = 1.5 (steps S1424 to S1427 / FIG. 9J). Further, projection is performed (step S1428), and β = {a, b, c, d} is set in the recursive process (FIG. 11B).

25. 24の再帰的処理の中で、β={a,b,c,d}のうちaがitemとして選択され、α=d aとされる（ステップ１４１５，Ｓ１４１６／図１１（Ｂ））。|L|<K(k=4)でなく（ステップＳ１４１７）、α=d aに対する指標Χ²(d a)=2.1が閾値τ_K=1.5を超えるため（ステップＳ１４２３）、出力リストLから[c a, 1.5]が削除され、d aと指標Χ²(d a)=2.1との組[d a, 2.1]が出力リストLの要素に追加され、出力リストLの要素が並び替えられて、閾値がτ_K=1.5とされる（ステップＳ１４２４〜Ｓ１４２７／図９（Ｋ））。さらに、プロジェクションが行われ（ステップＳ１４２８）、その再帰的処理の中でβ={b,d}とされる（図１１（Ｂ））。 25. In the recursive process of 24, a is selected as item out of β = {a, b, c, d}, and α = da is set (step 1415, S1416 / FIG. 11B). Since | L | <K (k = 4) is not satisfied (step S1417), and the index Χ ² (da) = 2.1 for α = da exceeds the threshold τ _K = 1.5 (step S1423), [ca, 1.5] is deleted, a pair [da, 2.1] of da and index Χ ² (da) = 2.1 is added to the elements of the output list L, the elements of the output list L are rearranged, and the threshold is τ _K = 1.5 (steps S1424 to S1427 / FIG. 9K). Further, projection is performed (step S1428), and β = {b, d} is set in the recursive process (FIG. 11B).

26. 25の再帰的処理の中で、β={b,d}のうちbがitemとして選択され、α=d a bとされる（ステップ１４１５，Ｓ１４１６／図１１（Ｂ））。|L|<K(k=4)でなく（ステップＳ１４１７）、α=d a bに対する指標Χ²(d a b)=0.2も限界値Χ² _max(d a b)=0.5も閾値τ_K=1.5を越えない（ステップＳ１４２３，Ｓ１４２９）。 26. In the recursive processing of 25, b is selected as item out of β = {b, d}, and α = dab is set (step 1415, S1416 / FIG. 11B). Not | L | <K (k = 4) (step S1417), neither the index Χ ² (dab) = 0.2 nor the limit value Χ ² _max (dab) = 0.5 for α = dab exceeds the threshold τ _K = 1.5 ( Steps S1423 and S1429).

27. 次に、25の再帰的処理の中で、β={b,d}のうちdがitemとして選択され、α=d a dとされる（ステップ１４１５，Ｓ１４１６／図１１（Ｂ））。|L|<K(k=4)でなく（ステップＳ１４１７）、α=d a dに対する指標Χ²(d a d)=0.2も限界値Χ² _max(d a d)=0.5も閾値τ_K=1.5を越えず（ステップＳ１４２３，Ｓ１４２９）、β={b,d}のすべての要素が処理済であるため（ステップＳ１４３１）、25の再帰的処理が終了し、24の再帰的処理に戻る。24の再帰的処理では、β={a,b,c,d}であるが、β={a,b,c,d}のうちaについては処理済みである。 27. Next, in 25 recursive processes, d is selected as item out of β = {b, d}, and α = dad is set (step 1415, S1416 / FIG. 11B). Not | L | <K (k = 4) (step S1417), neither the index Χ ² (dad) = 0.2 nor the limit value _max ² _max (dad) = 0.5 for α = dad exceeds the threshold τ _K = 1.5 ( Since all elements of β = {b, d} have been processed (steps S1433 and S1429) (step S1431), the recursive process of 25 ends and returns to the recursive process of 24. In the recursive process of 24, β = {a, b, c, d}, but a has been processed for β = {a, b, c, d}.

28. 24の再帰的処理の中で、β={a,b,c,d}のうちbがitemとして選択され、α=d bとされる（ステップ１４１５，Ｓ１４１６／図１１（Ｂ））。|L|<K(k=4)でなく（ステップＳ１４１７）、α=d bに対する指標Χ²(d b)=2.1が閾値τ_K=1.5を超えるため（ステップＳ１４２３）、出力リストLから[b c a, 1.5]が削除され、d bと指標Χ²(d b)=2.1との組[d b, 2.1]が出力リストLの要素に追加され、出力リストLの要素が並び替えられて、閾値がτ_K=2.1とされる（ステップＳ１４２４〜Ｓ１４２７／図９（Ｌ））。さらに、プロジェクションが行われ（ステップＳ１４２８）、その再帰的処理の中でβ={a,c,d}とされる（図１１（Ｂ））。 28. In the recursive processing of 24, b is selected as item among β = {a, b, c, d}, and α = db is set (step 1415, S1416 / FIG. 11B). Since | L | <K (k = 4) is not satisfied (step S1417) and the index Χ ² (db) = 2.1 for α = db exceeds the threshold τ _K = 1.5 (step S1423), the output list L indicates [bca, 1.5] is removed, the set [db between db and metrics ^{Χ 2 (db) = 2.1,} 2.1] is added to the elements of the output list L, and rearranged elements of the output list L is, the threshold tau _K = 2.1 (steps S1424 to S1427 / FIG. 9 (L)). Further, projection is performed (step S1428), and β = {a, c, d} is set in the recursive processing (FIG. 11B).

29. 28の再帰的処理の中で、β={a,c,d}のうちaがitemとして選択され、α=d b aとされる（ステップ１４１５，Ｓ１４１６／図１１（Ｂ））。|L|<K(k=4)でなく（ステップＳ１４１７）、α=d b aに対する指標Χ²(d b a)=1.5も限界値Χ² _max(d b a)=1.5も閾値τ_K=2.1を越えない（ステップＳ１４２３，Ｓ１４２９）。 29. In the recursive processing of 28, a is selected as item among β = {a, c, d}, and α = dba is set (step 1415, S1416 / FIG. 11 (B)). Not | L | <K (k = 4) (step S1417), neither the index Χ ² (dba) = 1.5 nor the limit value Χ ² _max (dba) = 1.5 for α = dba exceeds the threshold τ _K = 2.1 ( Steps S1423 and S1429).

30. 28の再帰的処理の中で、β={a,c,d}のうちcがitemとして選択され、α=d b cとされる（ステップ１４１５，Ｓ１４１６／図１１（Ｂ））。|L|<K(k=4)でなく（ステップＳ１４１７）、α=d b cに対する指標Χ²(d b c)=1.5も限界値Χ² _max(d b a)=1.5も閾値τ_K=2.1を越えない（ステップＳ１４２３，Ｓ１４２９）。 30. In the recursive processing of 28, c is selected as item among β = {a, c, d}, and α = dbc is set (step 1415, S1416 / FIG. 11B). Not | L | <K (k = 4) (step S1417), neither the index Χ ² (dbc) = 1.5 nor the limit value Χ ² _max (dba) = 1.5 for α = dbc exceeds the threshold τ _K = 2.1 ( Steps S1423 and S1429).

31. 28の再帰的処理の中で、β={a,c,d}のうちdがitemとして選択され、α=d b dとされる（ステップ１４１５，Ｓ１４１６／図１１（Ｂ））。|L|<K(k=4)でなく（ステップＳ１４１７）、α=d b dに対する指標Χ²(d b d)=0.5も限界値Χ² _max(d b d)=0.2も閾値τ_K=2.1を越えない（ステップＳ１４２３，Ｓ１４２９）。28の再帰的処理のステップＳ１４３１の判定でyesとされるため、28の再帰的処理が終了し、24の再帰的処理に戻る。24の再帰的処理では、β={a,b,c,d}であるが、β={a,b,c,d}のうちa,bについては処理済みである。 31. In the recursive processing of 28, d is selected as item out of β = {a, c, d}, and α = dbd is set (step 1415, S1416 / FIG. 11B). Not | L | <K (k = 4) (step S1417), neither the index Χ ² (dbd) = 0.5 nor the limit value Χ ² _max (dbd) = 0.2 for α = dbd exceeds the threshold τ _K = 2.1 ( Steps S1423 and S1429). Since the determination in step S1431 of 28 recursive processing is yes, the 28 recursive processing ends and returns to 24 recursive processing. In the recursive processing of 24, β = {a, b, c, d}, but a and b of β = {a, b, c, d} have been processed.

32. 24の再帰的処理の中で、β={a,b,c,d}のうちcがitemとして選択され、α=d cとされる（ステップ１４１５，Ｓ１４１６／図１１（Ｂ））。|L|<K(k=4)でなく（ステップＳ１４１７）、α=d cに対する指標Χ²(d c)=1.5も限界値Χ² _max(d c)=1.5も閾値τ_K=2.1を越えない（ステップＳ１４２３，Ｓ１４２９）。 32. In the recursive processing of 24, c is selected as item out of β = {a, b, c, d}, and α = dc is set (step 1415, S1416 / FIG. 11B). Not | L | <K (k = 4) (step S1417), neither the index Χ ² (dc) = 1.5 nor the limit value Χ ² _max (dc) = 1.5 for α = dc exceeds the threshold τ _K = 2.1 ( Steps S1423 and S1429).

33. 24の再帰的処理の中で、β={a,b,c,d}のうちdがitemとして選択され、α=d dとされる（ステップ１４１５，Ｓ１４１６／図１１（Ｂ））。|L|<K(k=4)でなく（ステップＳ１４１７）、α=d dに対する指標Χ²(d d)=0.2も限界値Χ² _max(d d)=0.5も閾値τ_K=2.1を越えない（ステップＳ１４２３，Ｓ１４２９）。24の再帰的処理のステップＳ１４３１の判定でyesとされるため24の再帰的処理が終了し、最初のループ処理に戻る。 33. In the recursive process of 24, d is selected as item out of β = {a, b, c, d}, and α = dd is set (step 1415, S1416 / FIG. 11B). Not | L | <K (k = 4) (step S1417), neither the index Χ ² (dd) = 0.2 nor the limit value Χ ² _max (dd) = 0.5 for α = dd exceeds the threshold τ _K = 2.1 ( Steps S1423 and S1429). Since the determination in step S1431 of 24 recursive processing is yes, the 24 recursive processing ends, and the process returns to the first loop processing.

34. 最初のループ処理のステップＳ１４３１の判定でyesとされるためステップＳ１４の処理が終了し、記憶部１６ｅに格納された出力リストLが出力される。 34. Since the determination in step S1431 of the first loop processing is yes, the processing in step S14 ends, and the output list L stored in the storage unit 16e is output.

〔変形例など〕
なお、本発明は上述の実施の形態に限定されるものではない。例えば、上記の実施形態では、ステップＳ１３でラベルありデータと確率データ付きデータとを合成したデータベースを生成する際、確率データ付きデータが含む確率Pr(P₊|x)，Pr(P_-|x)に減衰パラメータλが乗じられた。しかし、減衰パラメータλが乗じられない構成であってもよい。また逆に、ラベルありデータのラベル値（1 or 2）に)に増幅パラメータρ（ρ≧1、好ましくはρ＞1）が乗じられてもよい。また、減衰パラメータλと増幅パラメータρの両方が乗じられてもよい。 [Modifications, etc.]
The present invention is not limited to the embodiment described above. For example, in the above embodiment, when generating a database that combines the labeled data and the data with probability data in step S13, the probabilities Pr (P ₊ | x) and Pr (P ₋ | x) included in the data with probability data are generated. ) Is multiplied by the attenuation parameter λ. However, a configuration in which the attenuation parameter λ is not multiplied may be used. Conversely, the label value (1 or 2) of the labeled data may be multiplied by an amplification parameter ρ (ρ ≧ 1, preferably ρ> 1). Further, both the attenuation parameter λ and the amplification parameter ρ may be multiplied.

また、式(25)-(28)の代わりに以下の式(31)-(34)が用いられてもよい。 Further, the following formulas (31) to (34) may be used instead of the formulas (25) to (28).

N=ρ・|D^L|+|D^U| …(31) N = ρ ・ | D ^L | + | D ^U |… (31)

或いは、式(25)-(28)の代わりに以下の式(35)-(38)が用いられてもよい。

Alternatively, the following formulas (35) to (38) may be used instead of the formulas (25) to (28).

N=ρ・|D^L|+λ・|D^U| …(35) N = ρ ・ | D ^L | + λ ・ | D ^U |… (35)

すなわち、データベースが含むラベルありテキストの総数|D^L|から定まる値が、当該総数|D^L|と所定の増幅パラメータとの積であり、データベースが含むラベルありテキストのうち所定の集合に属することを表すラベルに対応付けられたものの総数から定まる値が、当該ラベルありテキストのうち所定の集合に属することを表すラベルに対応付けられたものの総数と増幅パラメータとの積であり、データベースが含むラベルありテキストであって所定の集合に属することを表すラベルに対応付けられたラベルありテキストのうち第１パターンαを含むものの総数から定まる値が、当該ラベルありテキストであって所定の集合に属することを表すラベルに対応付けられたラベルありテキストのうち第１パターンαを含むものの総数と増幅パラメータとの積であり、第１パターンαを含むラベルありテキストの総数から定まる値が、当該第１パターンαを含むラベルありテキストの総数と増幅パラメータとの積であってもよい。

That is, the database label has the total number of text containing the | is determined from the value, the total number | | D ^L D ^L | is the product of the predetermined amplification parameters, that belonging to a predetermined set of labels have text database contains The value determined from the total number of labels associated with the label representing the product is the product of the total number of labels associated with the label representing that they belong to a predetermined set of the labeled text and the amplification parameter, and the label included in the database A value determined from the total number of texts including the first pattern α among the texts with labels that are associated with the labels indicating that the texts belong to the predetermined set is the text with labels and belongs to the predetermined set. The total number of texts with the first pattern α among the labeled texts associated with the labels representing the The value determined from the total number of labeled texts including the first pattern α may be the product of the total number of labeled texts including the first pattern α and the amplification parameter.

また、上述の実施形態では、Χ²(α)そのものを指標として用いた。しかし、その他のΧ²(α)の広義単調関数値（単調非減少関数値）に相当する値を指標としてもよい。なお、Χ²(α)の広義単調関数値に相当する値は、Χ²(α)そのものをも含む概念である。例えば、Χ²(α)の広義単調増加関数値に相当する値を指標とするのであれば、指標の値が大きいパターンαほどラベル分類に寄与する度合いが大きいといえる。また、例えば、Χ²(α)の広義単調減少関数値に相当する値を指標とするのであれば、指標の値が小さいパターンαほどラベル分類に寄与する度合いが大きいといえる。その他、パターンαと、パターンαを含むテキストを当該テキストが属する集合に分類した際の分類結果との関連性の高さを表す凸関数値を指標としてもよい。 In the above-described embodiment, Χ ² (α) itself is used as an index. However, other values corresponding to the broad monotonic function value (monotonic non-decreasing function value) of Χ ² (α) may be used as an index. The value corresponding to the weakly monotonically function value of chi ² (alpha) is a concept including a chi ² (alpha) itself. For example, if the value corresponding to the broad monotone increasing function value of Χ ² (α) is used as an index, it can be said that the pattern α having a larger index value contributes more to the label classification. Further, for example, if a value corresponding to a broad-sense monotone decreasing function value of Χ ² (α) is used as an index, it can be said that a pattern α having a smaller index value has a higher degree of contribution to label classification. In addition, a convex function value indicating the degree of relevance between the pattern α and the classification result when the text including the pattern α is classified into a set to which the text belongs may be used as an index.

同様に、上述の実施形態では、Χ² _max(α)そのものを指標として用いた。しかし、その他のΧ² _max(α)の広義単調関数値に相当する値を指標としてもよい。なお、Χ² _max(α)の広義単調関数値に相当する値は、Χ² _max(α)そのものをも含む概念である。 Similarly, in the above-described embodiment, Χ ² _max (α) itself is used as an index. However, other values corresponding to broad monotone function values of Χ ² _max (α) may be used as an index. The value corresponding to the weakly monotonically function value of Χ ² _max (α) is a concept including a Χ ² _max (α) itself.

また、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 In addition, the various processes described above are not only executed in time series according to the description, but may be executed in parallel or individually according to the processing capability of the apparatus that executes the processes or as necessary. Needless to say, other modifications are possible without departing from the spirit of the present invention.

また、上述の構成をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記処理機能がコンピュータ上で実現される。 Further, when the above-described configuration is realized by a computer, processing contents of functions that each device should have are described by a program. The processing functions are realized on the computer by executing the program on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, the present apparatus is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

１パターン抽出装置
１１訓練部
１２分類部
１３データベース合成部
１４抽出部
１５制御部
１６ａ〜１６ｅ記憶部 DESCRIPTION OF SYMBOLS 1 Pattern extraction apparatus 11 Training part 12 Classification part 13 Database composition part 14 Extraction part 15 Control part 16a-16e Storage part

Claims

Data that represents a set to which a text belongs, with an element consisting of one or more symbols as an item, a series of one or more items as text, a series of one or more items included in the text as a pattern, A storage unit that stores the unlabeled text when the text associated with a certain label is labeled text and the text that is not associated with the label is unlabeled text;
A statistical model generated by using the labeled text as training data, and a classification model configured to output probability data representing a probability that an applied arbitrary text belongs to a predetermined set. Applying to the unlabeled text and generating probability data representing the probability that the unlabeled text belongs to the predetermined set;
Using a value determined from the probability data generated by the classification unit, a first pattern that is an arbitrary pattern, and a classification result when the text including the first pattern is classified into a set to which the text belongs. An extractor for generating an index representing the degree of relevance;
A pattern extraction apparatus having:

The pattern extraction device according to claim 1,
The extraction unit includes:
When the index satisfies a predetermined first output condition, the first pattern is output as an element of an output list, and a second pattern is a series in which one or more items are added to the first pattern; A second index indicating the level of relevance between the second text including the second pattern and the classification result when the second text is classified into the set to which the second text belongs is generated, and the second index is a predetermined second When the output condition is satisfied, the second pattern is configured to be output as an element of the output list.
A pattern extraction apparatus characterized by that.

The pattern extraction device according to claim 2,
The extraction unit includes:
The relationship between an arbitrary pattern that is an arbitrary sequence in which one or more items are added to the first pattern, and a classification result when the arbitrary text including the arbitrary pattern is classified into a set to which the text belongs. Generating a limit value of an index representing height, and outputting the first pattern as an element of an output list when the index does not satisfy the first output condition but the limit value satisfies a predetermined search condition A third pattern that is a series in which one or more items are added to the first pattern, and a classification result when the third text including the third pattern is classified into a set to which the third text belongs, Generating a third index representing the degree of relevance of the third pattern, and outputting the third pattern as an element of the output list when the third index satisfies a predetermined third output condition;
A pattern extraction apparatus characterized by that.

The pattern extraction device according to any one of claims 1 to 3,
The value determined from the probability data is a product of the probability data and a predetermined attenuation parameter.
A pattern extraction apparatus characterized by that.

The pattern extraction device according to any one of claims 1 to 4,
The indicator is
| D ^L | (| D ^L | ≧ 0) of the labeled text and | D ^U | (| D ^U |> 0) of the unlabeled text from which the probability data is generated, respectively. Generated against the database,
The first pattern is α,
The sum of the value determined from the total number of labeled texts included in the database | D ^L | and the value determined from the total number of unlabeled texts included in the database | D ^U | is N,
Sum of a value determined from the total number of texts with labels included in the database and associated with a label indicating that the text belongs to the predetermined set, and a total number of values determined from the probability data generated by the classification unit Is M,
A value determined from the total number of texts with labels that are included in the database and that include the first pattern α among the texts with labels that are associated with labels that belong to the predetermined set, and the first pattern α. Y (α) is the sum of the total number of values determined from the probability data representing the probability that the unlabeled text to be included belongs to the predetermined set,
When the sum of the value determined from the total number of labeled texts including the first pattern α and the value determined from the total number of unlabeled texts including the first pattern α is x (α),

Is a value equivalent to the monotonic function value in the broad sense of
A pattern extraction apparatus characterized by that.

The pattern extraction device according to claim 5,
The extraction unit further includes:
Χ ² (α) when y = x is Χ ² (α, y = x), and Χ ² (α) when y = 0 is Χ ² (α, y = 0), max (ν, μ) when ν ≧ μ is ν, and max (ν, μ) when ν <μ is μ,
Χ ² _max (α) = max (Χ ² (α, y = x), Χ ² (α, y = 0))
Generates the limit value of the index corresponding to the broad monotonic function value of
A pattern extraction apparatus characterized by that.

The pattern extraction device according to any one of claims 5 to 6,
The value determined from the total number of unlabeled texts | D ^U | included in the database is a product of the total number of unlabeled texts | D ^U | and a predetermined attenuation parameter, and the value determined from the probability data is the probability A product of the probability represented by the data and the attenuation parameter, and a value determined from the total number of unlabeled text including the first pattern α is the product of the total number of unlabeled text including the first pattern α and the attenuation parameter. Is,
And / or
The label has the total number of text wherein the database comprises | D ^L | determined by the value, the total number | D ^L | is the product of the predetermined amplification parameters, the predetermined set of labels have text wherein the database comprises A value determined from the total number of items associated with the label indicating belonging is a product of the total number of items associated with the label indicating belonging to the predetermined set of the labeled text and the amplification parameter, A value determined from the total number of texts with labels that are included in the database and that include the first pattern α among the texts with labels that are associated with labels that belong to the predetermined set is the text with labels. The first pattern of the labeled text associated with the label indicating belonging to the predetermined set. The value determined from the total number of labeled texts including the first pattern α is the product of the total number including the first pattern α and the amplification parameter. Is the product of
A pattern extraction apparatus characterized by that.

The pattern extraction device according to any one of claims 1 to 7,
The extraction unit is configured to output a predetermined number or less of patterns selected in descending order of the size of the corresponding index as an element of an output list.
A pattern extraction apparatus characterized by that.

A set in which the classification unit has an element including one or more symbols as an item, a series of one or more items as text, a series of one or more items included in the text as a pattern, and the text belongs to When the text associated with the label that is data representing the text is a text with a label and the text not associated with the label is a text without a label, the text with the label is used as training data and generated A classification model configured to output probability data representing a probability that an applied arbitrary text belongs to a predetermined set, to the unlabeled text stored in the storage unit. Applied to generate probability data representing the probability that the unlabeled text belongs to the predetermined set. The method comprising the steps of,
Classification result when the extraction unit classifies the first pattern which is an arbitrary pattern and the text including the first pattern into a set to which the text belongs, using a value determined from the probability data generated by the classification unit Generating an index indicating the degree of relevance of
A pattern extraction method comprising:

A program for causing a computer to function as the pattern extraction device according to claim 1.