JP2014137722A

JP2014137722A - Rule generation device and extraction device

Info

Publication number: JP2014137722A
Application number: JP2013006256A
Authority: JP
Inventors: Keisuke Ogawa; 圭介小川; Masayuki Hashimoto; 真幸橋本; Kazunori Matsumoto; 一則松本
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2013-01-17
Filing date: 2013-01-17
Publication date: 2014-07-28
Anticipated expiration: 2033-01-17
Also published as: JP6061337B2

Abstract

PROBLEM TO BE SOLVED: To automatically generate extraction rules of unique expressions by pattern matching which can be easily checked by a person and to automatically extract the unique expressions by the extraction rules.SOLUTION: A morphological analysis part 11 performs morphological analysis of a learning sentence to which tags for instructing unique expressions are given. A variation measurement part 12 calculates frequency distribution of patterns of arrangement of parts of speech of the predetermined number of words in the surroundings for the respective unique expressions, and a classification part 13 classifies the unique expressions into clusters by magnitude of variation. A rule generation part 14 generates extraction rules as a set of rules for extracting the unique expressions belonging to every cluster by pattern matching. In this case, extraction achievement of the unique expressions belonging to the respective clusters from the learning sentence is considered. In extraction of the unique expressions from an unknown sentence, the words extracted by the extraction rules for every cluster are determined as the unique expressions when variation in the unknown sentence is the same as variation of the clusters.

Description

本発明は、規則生成装置及び抽出装置に関し、固有表現のうち特に名前等の個人情報を対象とする際に好適な、固有表現の抽出規則を生成する規則生成装置と、当該抽出規則を用いることによる未知の文章からの固有表現の抽出装置と、に関する。 The present invention relates to a rule generation device and an extraction device, and relates to a rule generation device that generates an extraction rule for a specific expression, which is suitable for personal information such as a name among specific expressions, and the extraction rule. And an apparatus for extracting a specific expression from an unknown sentence.

テキスト中から固有表現抽出を用いて個人情報を抽出し、別の単語に置換することによって個人情報を削除する技術としては、例えば非特許文献１〜３に示すものが知られている。ここで、固有表現抽出とは、計算機を用いた自然言語処理技術の一つであり、固有名詞（人名、地名など）や日付、時間表現などを抽出する技術である。 For example, non-patent documents 1 to 3 are known as techniques for extracting personal information from a text by using specific expression extraction and deleting the personal information by replacing it with another word. Here, proper expression extraction is one of natural language processing techniques using a computer, and is a technique for extracting proper nouns (person names, place names, etc.), date, time expressions, and the like.

また特許文献１には、テキスト情報の秘匿化を行うために、予め定められたルールを用いて、要注意語を抽出し、個人情報を削除する技術が示されている。 Patent Document 1 discloses a technique for extracting a caution word and deleting personal information using a predetermined rule in order to conceal text information.

特開2006-309406号公報JP 2006-309406 A

桝井文人, 鈴木伸哉, 福本淳一: "テキスト処理のための固有表現抽出ツールNExT の開発",第8 回言語処理学会年次大会発表論文集,pp.176-179, 2002.Fumihito Sakurai, Shinya Suzuki, Junichi Fukumoto: "Development of NExT, a named entity extraction tool for text processing," Proc. Of the 8th Annual Conference of the Language Processing Society, pp.176-179, 2002. 情報処理学会論文誌 Vol.43 No.1 Support Vector Machineを用いた日本語固有表現抽出IPSJ Transactions Vol.43 No.1 Japanese Named Expression Extraction Using Support Vector Machine Min Tang et al., "Preserving Privacy in Spoken Language Data bases", Proceedings of the International Workshop on Privacy and Security Issues in Data Mining, ECML/PKDD, Pisa Italy, September, 2004Min Tang et al., "Preserving Privacy in Spoken Language Data bases", Proceedings of the International Workshop on Privacy and Security Issues in Data Mining, ECML / PKDD, Pisa Italy, September, 2004

このような固有表現抽出を応用したシステムでは、一般的に固定されたいくつかのルールにもとづいて個人情報を抽出して削除することを特徴としており、未知のルールでしか削除できないような場合については、対応することが不可能である。 A system using such a named expression extraction is characterized by extracting and deleting personal information based on some fixed rules, and it can be deleted only by unknown rules. Is impossible to respond.

すなわち、ルールベースでの固有表現抽出では、ヒューリスティックにルールを予めセットする必要があるため、個人情報の削除タスクでは、タスクに合わせて予め膨大な削除ルールを準備しておく必要がある。本来、個人情報には様々な種類があり、それほど多くの削除ルールを必要としないものもあれば、そうでないものもある。 In other words, rule-based extraction of specific expressions requires heuristic rules to be set in advance. Therefore, in the personal information deletion task, it is necessary to prepare a large number of deletion rules in advance according to the task. Originally, there are various types of personal information, some of which do not require so many deletion rules, and some of which do not.

また、固有表現抽出の中には統計的な手法にもとづいて削除する手法も存在するが、一般的に必ず漏れが発生する。このときに、統計的手法にもとづいて削除を行った場合には、ユーザが内部の削除ルールを適切に理解できないため、どの部分を直せば良いのかがわからず（ユーザに対するフィードバックが無い）、結局一から全て目で見てチェックをしなければならない。 In addition, there is a method for deleting a specific expression based on a statistical method, but in general, leakage always occurs. At this time, if the deletion is performed based on a statistical method, the user cannot properly understand the internal deletion rule, so it is not known which part should be corrected (there is no feedback to the user). You have to check everything from scratch.

すなわち、統計的手法を用いて固有表現と思われる箇所を抽出すること自体は可能である。しかしながら、個人情報の削除タスクでは、もれなく削除を行う必要があり、最終的には人間（ユーザ）の目での確認が必須となる。統計的手法を用いたのでは、間違いが発生する箇所を人間が予測することができないため、チェック量が膨大になってしまう。（一方、単純なルールベースでの削除であれば、削除ルールが人間の目で理解できるため、修正は容易である。つまり、削除箇所がどのようなルールによって削除されたのかがわかるため、当該ルールを容易に修正できる。） That is, it is possible to extract a part that seems to be a specific expression using a statistical method. However, in the deletion task of personal information, it is necessary to delete all information, and finally confirmation with the eyes of a human (user) is essential. If a statistical method is used, a human being cannot predict a place where an error occurs, so the check amount becomes enormous. (On the other hand, if it is a simple rule-based deletion, it can be easily corrected because the deletion rule can be understood by the human eye. Rules can be easily modified.)

本発明の第一の目的は、上記従来技術の課題に鑑み、ユーザのチェックが容易なルールベースの手法による固有表現の抽出規則を自動で生成する規則生成装置を提供することにある。 A first object of the present invention is to provide a rule generation device that automatically generates a specific expression extraction rule by a rule-based method that is easy for a user to check in view of the above-described problems of the prior art.

また、本発明の第二の目的は、前記規則生成装置により生成された抽出規則又は当該抽出規則をユーザが修正を施したものを用いて、文章より自動で固有表現を抽出する抽出装置を提供することにある。 A second object of the present invention is to provide an extraction device that automatically extracts a specific expression from a sentence using an extraction rule generated by the rule generation device or a modification of the extraction rule by a user. There is to do.

上記第一の目的を達成するため、本発明は、固有表現を指示するタグが付与された学習用文章を用いて固有表現の抽出規則を生成する規則生成装置であって、前記学習用文章を形態素解析して、各単語へ分解すると共に各単語の品詞を決定する形態素解析部と、前記タグで指示された各固有表現に対して、前記形態素解析された学習用文章より当該各固有表現の周辺の所定数の単語の品詞の並びのパターンに関する度数分布を求め、当該度数分布のばらつきを計測するばらつき測定部と、前記計測されたばらつきに基づいて、前記タグで指示された各固有表現をクラスタに分類する分類部と、前記ばらつきに基づいて分類されたクラスタ毎に、前記学習用文章を対象とした固有表現の削除成績を考慮することにより、前記ばらつき測定部にて求まったパターンをもとに、固有表現の候補単語の周辺の所定数の単語の品詞の並びのパターンの一致に基づいて当該クラスタに属する固有表現を抽出するルールのセットとして、前記抽出規則を生成する規則生成部と、を備えることを第一の特徴とする。 In order to achieve the first object, the present invention provides a rule generation device that generates a specific expression extraction rule using a learning sentence to which a tag that indicates a specific expression is assigned, wherein the learning sentence is A morpheme analysis unit that decomposes each word and determines the part of speech of each word, and for each specific expression indicated by the tag, for each specific expression from the learning sentence analyzed by the morpheme analysis A frequency distribution related to a pattern of part-of-speech arrangement of a predetermined number of words in the vicinity is obtained, a variation measuring unit that measures the variation of the frequency distribution, and each unique expression indicated by the tag based on the measured variation In the variation measuring unit, the classification unit for classifying into clusters, and the deletion result of the specific expression for the learning sentence for each cluster classified based on the variation are considered. Based on the collected patterns, the extraction rule is generated as a set of rules for extracting the unique expressions belonging to the cluster based on the pattern matching of the part-of-speech arrangement of a predetermined number of words around the candidate words of the specific expressions And a rule generation unit that performs the first feature.

また、上記第二の目的を達成するため、本発明は、前記規則生成装置によりクラスタ毎に生成された抽出規則又は当該抽出規則をユーザが修正したものを用いて、入力文章より固有表現を抽出する抽出装置であって、前記入力文章を形態素解析して、各単語へ分解すると共に各単語の品詞を決定する形態素解析部と、前記形態素解析されて得られた各単語につき、前記形態素解析された入力文章より当該各単語の周辺の所定数の単語の品詞の並びのパターンに関する度数分布を求め、当該度数分布のばらつきを計測するばらつき測定部と、前記計測されたばらつきに基づいて、前記各単語をクラスタに分類する分類部と、前記クラスタ毎に生成された抽出規則又は当該抽出規則をユーザが修正したものを前記入力文章に対して適用し、抽出された単語が、当該適用した抽出規則に対応するクラスタと同じばらつきのクラスタに前記分類部にて分類されている場合に、当該抽出された単語は固有表現であると判断する抽出部と、を備えることを第二の特徴とする。 In order to achieve the second object described above, the present invention extracts a specific expression from an input sentence using an extraction rule generated for each cluster by the rule generation device or a user's modified extraction rule. A morpheme analysis unit that morphologically analyzes the input sentence and decomposes the input sentence into words and determines a part of speech of each word; and for each word obtained by the morpheme analysis, the morpheme analysis is performed. A frequency distribution related to a pattern of part-of-speech arrangement of a predetermined number of words around each word from the input sentence, and a variation measuring unit that measures the variation of the frequency distribution, and based on the measured variation, A classification unit for classifying words into clusters, and an extraction rule generated for each cluster or a modification of the extraction rule applied to the input sentence by the user is extracted and extracted. An extraction unit that determines that the extracted word is a specific expression when the classification unit classifies the extracted word into clusters having the same variation as the cluster corresponding to the applied extraction rule. This is the second feature.

前記第一の特徴によれば、各固有表現につき、その周辺の単語の品詞の並びのパターンの度数分布のばらつきを学習用文章中において求め、当該ばらつきによって固有表現をクラスタに分類し、クラスタ毎に属する固有表現をパターンマッチングによって抽出する抽出規則を自動で求めることができる。 According to the first feature, for each unique expression, a variation in the frequency distribution of the pattern of parts-of-speech arrangement of the surrounding words is obtained in the learning sentence, and the unique expressions are classified into clusters according to the variation, It is possible to automatically obtain an extraction rule for extracting a specific expression belonging to the group by pattern matching.

前記第二の特徴によれば、入力文章に対してクラスタ毎に求まった抽出規則を適用して、抽出された単語がさらに、当該適用した抽出規則のクラスタと同じばらつきを有するものである場合に、固有表現として抽出するので、前記第一の特徴による規則生成装置による抽出規則によって固有表現を抽出することができるようになる。 According to the second feature, when the extraction rule obtained for each cluster is applied to the input sentence, and the extracted word further has the same variation as the cluster of the applied extraction rule Since it is extracted as a specific expression, the specific expression can be extracted by the extraction rule by the rule generation device according to the first feature.

一実施形態に係る規則生成装置の機能ブロック図である。It is a functional block diagram of the rule production | generation apparatus which concerns on one Embodiment. ばらつき測定部の調べる各固有表現に対する分布の概念図である。It is a conceptual diagram of the distribution with respect to each specific expression investigated by the variation measuring unit. 本発明における抽出規則の生成を概念的に説明するための図である。It is a figure for demonstrating notionally the production | generation of the extraction rule in this invention. 規則生成部の機能ブロック図である。It is a functional block diagram of a rule production | generation part. 初期ルールセット生成部において、ばらつき最小クラスタに属する固有表現に対して学習用文章から得られるルールに対して、所定の変換手法を施して新たなルールを求める具体例を説明するための図である。It is a figure for demonstrating the specific example which performs a predetermined conversion method with respect to the rule obtained from the text for learning with respect to the specific expression which belongs to the minimum variation cluster in an initial rule set production | generation part, and calculates | requires a new rule. . 自動生成部による処理のフローチャートである。It is a flowchart of the process by an automatic generation part. 図５の具体例に対応した例による、例外定義部による例外ルールの生成例を示す図である。FIG. 6 is a diagram illustrating an example of exception rule generation by an exception definition unit according to an example corresponding to the specific example of FIG. 5. 規則生成装置により生成された固有表現を抽出するルールセットを用いて、文章より固有表現を抽出する抽出装置の機能ブロック図である。It is a functional block diagram of the extraction apparatus which extracts a specific expression from a text using the rule set which extracts the specific expression produced | generated by the rule production | generation apparatus.

図１は、一実施形態に係る規則生成装置の機能ブロック図である。規則生成装置10は、形態素解析部11、ばらつき測定部12、分類部13及び規則生成部14を備え、予めタグ付与された学習用文章を用いて固有表現を抽出する規則（後述するルールセット）を自動で生成する。当該各部の概要は次の通りである。 FIG. 1 is a functional block diagram of a rule generation device according to an embodiment. The rule generation device 10 includes a morpheme analysis unit 11, a variation measurement unit 12, a classification unit 13, and a rule generation unit 14, and a rule for extracting a specific expression using a learning sentence tagged in advance (rule set to be described later) Is automatically generated. The outline of each part is as follows.

形態素解析部11は、予め用意しておく所定の学習用文章を形態素解析して、各単語とその品詞とを特定する。当該学習用文章の例EX1(参照のためEX1と符号を付した)を以下に示す。 The morpheme analyzing unit 11 performs morphological analysis on a predetermined learning sentence prepared in advance and identifies each word and its part of speech. An example EX1 of the text for learning (shown as EX1 for reference) is shown below.

[学習用文章の例EX1]
「私は阿比留です。おはようございます。立先生。私は市立山田病院の松本です。本日は松本市立松本病院に出張しています。」 [Example text EX1 for learning]
“I am Aridome. Good morning. Prof. Tachi. I am Matsumoto of Yamada Hospital. I am on a business trip to Matsumoto Hospital today.”

当該学習用文章には、上記の例のような文章本体にさらに、どの箇所が固有表現であるかとその固有表現の種類とを指示するタグが予め付与されている。ここで、学習用文章中の固有表現と、その種類とを、以下のような３つの記号「＜」、「／」及び「＞」で区切って表記することとする。
＜固有表現の箇所／固有表現：種類＞
上記表記を用いると、上記例EX1では、例えば以下のようなタグが付与され、「阿比留」、「松本」及び「立」が名前を表す固有表現であり、「山田病院」及び「松本病院」が所属を表す固有表現であることが表されている。 The learning text is preliminarily provided with a tag indicating which part is the specific expression and the type of the specific expression on the text main body as in the above example. Here, the specific expressions in the learning text and their types are described by separating them with the following three symbols “<”, “/” and “>”.
<Location of specific expression / Specific expression: Type>
When the above notation is used, in the above example EX1, for example, the following tags are given, and “Aridome”, “Matsumoto” and “Tate” are specific expressions representing names, “Yamada Hospital” and “Matsumoto Hospital” Is a proper expression representing affiliation.

[学習用文章の例EX1（タグを明示したもの）]
「私は＜阿比留／固有表現：名前＞です。おはようございます。＜立／固有表現：名前＞先生。私は市立＜山田病院／固有表現：所属＞の＜松本／固有表現：名前＞です。本日は松本市立＜松本病院／固有表現：所属＞に出張しています。」 [Example of learning text EX1 (with explicit tags)]
“I am <Aridome / proprietary expression: name> Good morning. <Tate / proprietary expression: name> Teacher. I am <Matsumoto / proprietary expression: name> of the municipal <Yamada Hospital / proprietary expression: affiliation>. Today I am on a business trip to Matsumoto City <Matsumoto Hospital / Proprietary Expression: Affiliation>. "

ばらつき測定部12は、上記のように予め固有表現のタグが付与され、且つ形態素解析部11よって各単語と品詞が特定された学習用文章を用いて、固有表現の各単語ごとに、その前後の所定数の単語の品詞分布（品詞並びパターン）を調べ、当該分布のばらつきを測定する。 As described above, the variation measuring unit 12 uses a learning sentence in which a unique expression tag is assigned in advance and each word and part of speech are specified by the morphological analysis unit 11, before and after each word of the unique expression. The part-of-speech distribution (part-of-speech arrangement pattern) of a predetermined number of words is examined, and the variation of the distribution is measured.

例えば、固有表現Nにつき、前後2個ずつの単語の品詞Ci (i=1,2,3,4)の以下のような分布を調べるとする。
{C₁, C₂, N, C₃, C₄} …(式1)
この場合、例EX1における固有表現N＝「阿比留」からは、以下のような１種類のみの分布が得られる。
{C₁, C₂, N, C₃, C₄}
＝{代名詞（「私」）、助詞（「は」）、固有表現（「阿比留」）、助動詞（「です」）、句読点（「。」）} For example, suppose that the following distribution of part of speech Ci (i = 1, 2, 3, 4) of two words before and after the specific expression N is examined.
{C ₁ , C ₂ , N, C ₃ , C ₄ }… (Formula 1)
In this case, only one type of distribution as described below is obtained from the unique expression N = “Abiru” in the example EX1.
{C ₁ , C ₂ , N, C ₃ , C ₄ }
= {Pronoun ("I"), particle ("ha"), proper expression ("Aridome"), auxiliary verb ("is"), punctuation (".")}

前後の所定数の説明例として、上記のように前後2個ずつの単語の品詞Ciで各分布を特定するものとする。学習用文章中にタグとして示されている各固有表現N_k(k=1, 2, ..., n)につき、学習用文章中においてその前後にどのような品詞並びパターンが存在しているかをばらつき測定部12が全て調べることで、以下の情報が明らかとなる。
各分布P_jの品詞並びパターン{C_{1[ j ]}, C_{2[ j ]} , N_k, C_{3[ j ]}, C_{4[ j ]}}_k
分布P_jの種類の総数m_k (j=1, 2, ..., m_k)
各分布P_jに該当するパターンが学習用文章中に現れた回数f_k(P_j) As a predetermined number of explanatory examples before and after, it is assumed that each distribution is specified by part of speech Ci of two words before and after as described above. For each proper expression N _k (k = 1, 2, ..., n) indicated as a tag in the learning sentence, what part-of-speech arrangement pattern exists before and after that in the learning sentence When the variation measuring unit 12 examines all of the above, the following information becomes clear.
Part-of-speech arrangement patterns {C _{1 [j]} , C _{2 [j]} , N _k , C _{3 [j]} , C _{4 [j]} } _{k for} each distribution P _j
Total number m _k of distribution P _j (j = 1, 2, ..., m _k )
Number of times a pattern corresponding to each distribution P _j appears in the learning sentence f _k (P _j )

図２は、ばらつき測定部12によって当該調べられた分布（度数分布）の概念図である。ばらつき測定部12は、各固有表現N_kの分布より測定するばらつきとして、上記分布の総数m_kを用いることができる。あるいは、測定するばらつきとして、上記回数f_k(P_j)が所定閾値以上であるものの総数を用いてもよい。 FIG. 2 is a conceptual diagram of the distribution (frequency distribution) examined by the variation measuring unit 12. The variation measuring unit 12 can use the total number m _k of the distributions as variations measured from the distributions of the respective unique expressions N _k . Alternatively, as the variation to be measured, the total number of the above-described number of times f _k (P _j ) that is equal to or greater than a predetermined threshold may be used.

分類部13は、各固有表現N_kを、ばらつき測定部12にて測定されたそのばらつきに基づいて、各クラスタへと分類する。一例では、ばらつきの大きさによって各固有表現N_kを順位付けして、当該順位内において所定の区間分けを施し、各区間内に属する固有表現を当該区間に対応する一つのクラスタに属するものとする。 The classifying unit 13 classifies each unique expression N _k into each cluster based on the variation measured by the variation measuring unit 12. In one example, each unique expression _Nk is ranked according to the magnitude of variation, and predetermined sections are divided within the rank, and the unique expressions belonging to each section belong to one cluster corresponding to the section. To do.

例えば、図３の（２）に概念的に示すように、（１）で測定されたばらつきの大きさによって３段階に分け、ばらつき「小」のクラスタCL1と、ばらつき「中」のクラスタCL2と、ばらつき「大」のクラスタCL3とに各固有表現N_kを分類する。ここでは、例EX1をその一部分に含む学習用文章を用いることで、例EX1における名前を表す３つの固有表現に関して、以下のように分類された例が示されている。
「阿比留」：ばらつき「小」（クラスタCL1に属する）
「松本」：ばらつき「中」(クラスタCL2に属する)
「立」：ばらつき「大」(クラスタCL3に属する) For example, as conceptually shown in (2) of FIG. 3, the cluster CL1 having a small variation and the cluster CL2 having a medium variation are divided into three stages according to the size of the variation measured in (1). Then, each proper expression N _k is classified into the cluster CL3 having the variation “large”. Here, an example classified as follows is shown with respect to three specific expressions representing names in the example EX1 by using a learning sentence including the example EX1 as a part thereof.
“Abiru”: Variation “Small” (belonging to cluster CL1)
“Matsumoto”: Variation “Medium” (belonging to cluster CL2)
“Stand”: Variation “Large” (belonging to cluster CL3)

なお、当該3段階へのクラスタ分けでなくとも、より一般にはn段階へのクラスタ分けが可能であるが、以下、説明のための例として、当該3段階のクラスタCL1〜CL3に分けたものとして、説明を行う。 In addition, clustering into n stages is possible even if it is not clustered into the three stages. However, as an example for explanation, it is assumed that these are divided into the three stages of clusters CL1 to CL3. , Explain.

規則生成部14は、図３の（３）に示すように、分類部13にて分類された各クラスタごとに、属する固有表現を抽出する規則を生成する。当該規則は、従来技術の項で説明したルールベースの手法であり、いわゆるパターンマッチングであって、抽出対象の文章を形態素解析して単語とその品詞を特定したうえで、各単語Nにつき自身の前後の所定数の単語が(式1)のような所定の品詞（及び後述する単語等）の並びになっていれば（並びのパターンがマッチしていれば）抽出するという、個別のルールを複数用意したセットとして生成される。 As shown in (3) of FIG. 3, the rule generation unit 14 generates a rule for extracting a specific expression to which each cluster classified by the classification unit 13 belongs. The rule is a rule-based method described in the section of the prior art, which is so-called pattern matching, and the morphological analysis is performed on the sentence to be extracted to identify the word and its part of speech. Multiple individual rules to extract if a predetermined number of words before and after are arranged with a predetermined part-of-speech (and words to be described later) as in (Equation 1) (if the arrangement pattern matches) Generated as a prepared set.

本発明においてはこのように、通常のパターンマッチングによる抽出の規則に対してさらに、各クラスタ毎に当該規則を生成することで、抽出される単語Nのばらつきをも考慮するようにしたものを、固有表現を抽出する規則とする。当該各クラスタ毎の抽出規則の、各クラスタ間の関係は以下の通りである。 In the present invention, in addition to the extraction rule by normal pattern matching, in addition to generating the rule for each cluster, the variation of the extracted word N is also considered, A rule for extracting a specific expression. The relationship between the clusters in the extraction rule for each cluster is as follows.

すなわち、当該規則生成部14による抽出規則の生成の際、図３の（３）に示すように、最もばらつきの小さいクラスタCL1に属する固有表現を抽出する規則を、一般的規則R100として最初に生成する。次に、当該規則R100をより特別な場合に対応可能なように複雑化する形で、ばらつきの大きい側のクラスタCL2,CL3に属する固有表現を抽出する規則を特別規則R200,R300として生成する。当該クラスタ毎の規則生成は、以下のような本発明特有の考察に基づいている。 That is, when generating the extraction rule by the rule generation unit 14, as shown in (3) of FIG. 3, the rule for extracting the specific expression belonging to the cluster CL1 having the smallest variation is first generated as the general rule R100. To do. Next, rules for extracting specific expressions belonging to the clusters CL2 and CL3 having larger variations are generated as special rules R200 and R300 in such a manner that the rule R100 is complicated so that it can cope with a more special case. The rule generation for each cluster is based on the following considerations specific to the present invention.

すなわち、図３の（２）に示したばらつき「大」のクラスタCL3に属する、名前を表す固有表現の例「立」や「良」は、名前以外にも様々な文脈での使用が考えられる。例えば「市立」、「良性」といった用法である。そのため、これら「立」や「良」といった単語を文章の全体から抽出すると、その前後の単語・品詞分布は極めてばらつきが大きくなることが考えられ、実際に当該クラスタCL3に属している。従って、当該クラスタCL3に属する単語から固有表現であるものをパターン一致によって選択する規則は、極めて複雑なものとなることが想定される。 That is, examples of proper expressions representing names “standing” and “good” belonging to the cluster “CL3” having a large variation “2” shown in (2) of FIG. 3 can be used in various contexts other than names. . For example, “city” and “benign” are used. For this reason, when these words such as “standing” and “good” are extracted from the whole sentence, the word / part of speech distribution before and after the word is considered to vary greatly, and actually belongs to the cluster CL3. Accordingly, it is assumed that a rule for selecting a specific expression from words belonging to the cluster CL3 by pattern matching is extremely complicated.

一方、図３の（２）に示したばらつき「小」のクラスタCL1に属する、名前を表す固有表現の例「阿比留」や「勅使河原」は、いわゆるめずらしい名字であり、名前以外の用途はほぼ考えられない。そのため、これら「阿比留」や「勅使河原」といった単語の前後の単語・品詞分布のばらつきは小さくなると考えられ、実際に当該クラスタCL1に属している。従って、当該クラスタCL1に属する単語から固有表現であるものをパターン一致によって選択する規則は、比較的単純なものとなることが想定される。 On the other hand, examples of proper expressions representing names belonging to cluster CL1 with variation “small” shown in (2) of FIG. 3 are “Abiru” and “Teshigawara”, which are so-called rare surnames. Unthinkable. For this reason, it is considered that the variation of the word / part of speech distribution before and after the words such as “Abiru” and “Teshigawara” is small, and actually belongs to the cluster CL1. Accordingly, it is assumed that a rule for selecting a specific expression from words belonging to the cluster CL1 by pattern matching is relatively simple.

すなわち、名前といった固有表現の中には本来様々な種類があって、パターン一致で抽出する場合、クラスタCL3の単語のように多くの抽出規則を駆使してようやく抽出できるものもあれば、クラスタCL1の単語のように単純な抽出規則のみで抽出可能なものもある。また同じく、中間にあるクラスタCL2に属するは、中程度に複雑な抽出規則が想定される。 In other words, there are various kinds of proper expressions such as names, and when extracting by pattern matching, there are some that can be finally extracted using many extraction rules like words of cluster CL3, and cluster CL1 Some of them can be extracted using simple extraction rules such as Similarly, a moderately complicated extraction rule is assumed to belong to the cluster CL2 in the middle.

従って、本発明では固有表現の抽出に際して、前後の品詞や単語のパターンマッチングによって抽出する方式を取り、且つ候補となる単語のばらつきを考慮したうえで、ばらつきの大きさによって単語の属するクラスタごとに抽出の規則を変更する。このため、ばらつきに応じた規則を規則生成部14が生成する。これにより、従来技術の項で説明した統計的手法と異なり人間が直接確認可能な、抽出するためのパターンとしての規則を、自動で用意することができる。 Therefore, in the present invention, when extracting a specific expression, a method of extracting by using the preceding and following parts of speech and pattern matching of words is taken, and the variation of candidate words is taken into consideration, and each cluster to which a word belongs depends on the size of the variation. Change the extraction rules. For this reason, the rule generation unit 14 generates a rule corresponding to the variation. Thereby, unlike the statistical method described in the section of the prior art, it is possible to automatically prepare a rule as a pattern for extraction that can be directly confirmed by a human.

また、最もばらつきの小さいクラスタCL1から生成される規則R100は名前などの固有表現を抽出するための最も一般的な規則であり、どのクラスタに対しても適用可能な、最も単純な形を取ることとなるので、当該規則をまず生成する。ばらつきの大きい側のクラスタCL2,CL3に対する規則R200,R300は、当該一般的な規則R100からの特別化・複雑化として生成することによって、規則の全体R100,R200及びR300を人間が確認する際にも、その意味合いを把握可能となり、マニュアル修正なども容易に施せるようになる。また、特別化・複雑化として規則を生成する際にも、無駄な規則が生成されることを極力避けることができる。 The rule R100 generated from the cluster CL1 with the smallest variation is the most general rule for extracting a specific expression such as a name, and takes the simplest form applicable to any cluster. Therefore, the rule is generated first. When the rules R200 and R300 for the clusters CL2 and CL3 on the side with large variations are generated as specializations and complications from the general rule R100, when the rules R100, R200 and R300 are confirmed by humans, However, it is possible to understand the implications and to make manual corrections easily. In addition, when rules are generated as specialization / complexity, generation of useless rules can be avoided as much as possible.

ここで特に、ばらつきによって分類するという本発明特有の手法を用いなかったとすれば、パターン一致で抽出する規則は膨大且つ複雑なものとなってしまい、人間が理解して修正などを施す際の参照性が確保できない事態が想定される。これに対して本発明では、クラスタ間の階層性を利用でき、従ってまた、無駄な規則の生成が回避される。階層性利用の例として例えば、ばらつき最小のクラスタCL1にマニュアルで修正を施した際は、同様の修正をばらつきの大きなクラスタCL2,CL3においても適用するなどの効率的な作業が可能となる。 Here, in particular, if the method peculiar to the present invention of classifying by variation is not used, the rules extracted by pattern matching become enormous and complicated, and this is a reference for human understanding and correction. It is assumed that sex cannot be secured. On the other hand, in the present invention, the hierarchy between clusters can be used, and generation of useless rules is avoided. As an example of hierarchical use, for example, when manual correction is performed on the cluster CL1 having the smallest variation, an efficient operation such as applying the same correction to the clusters CL2 and CL3 having large variations is possible.

上記のような意義を有する規則生成部14による規則生成の詳細につき、以下説明する。図４は、規則生成部14の機能ブロック図である。規則生成部14は、初期ルールセット生成部140、自動生成部145及び生成過程記憶部146を含む。初期ルールセット生成部は例外定義部141及び特殊化部142を含む。 Details of rule generation by the rule generation unit 14 having the above significance will be described below. FIG. 4 is a functional block diagram of the rule generation unit 14. The rule generation unit 14 includes an initial rule set generation unit 140, an automatic generation unit 145, and a generation process storage unit 146. The initial rule set generation unit includes an exception definition unit 141 and a specialization unit 142.

規則生成部14は、図３で説明したように、ばらつき最小のクラスタCL1に属する固有表現の抽出規則を一般的規則として生成したのち、図４の線L1に示すように当該一般的規則を用いることによって、ばらつきの大きい方の各クラスタCL2,CL3に属する固有表現の抽出規則を、特別規則として生成する。当該特別規則を生成する際に、例外定義部141及び特殊化部142が機能する。 As described with reference to FIG. 3, the rule generation unit 14 generates an extraction rule for a specific expression belonging to the cluster CL1 with the smallest variation as a general rule, and then uses the general rule as indicated by a line L1 in FIG. As a result, the extraction rules for the unique expressions belonging to the clusters CL2 and CL3 having the larger variations are generated as special rules. The exception definition unit 141 and the specialization unit 142 function when generating the special rule.

初期ルールセット生成部140は、各クラスタ毎に、最終的な抽出規則を求めるための候補としての、初期ルールセットを生成する。自動生成部145は、各クラスタ毎に、当該初期ルールセットの中から各クラスタに属する固有表現を実際に抽出する成績の良いものを自動で選出することにより、最終的な出力としての抽出規則を求める。当該自動選出には、局所探索法、遺伝的アルゴリズム又は焼き鈍し法その他の手法（メタヒューリスティクス又はこれと同等の結果が得られる手法）を利用することができる。 The initial rule set generation unit 140 generates an initial rule set as a candidate for obtaining a final extraction rule for each cluster. The automatic generation unit 145 automatically selects, for each cluster, an extraction rule as a final output by automatically selecting a good result that actually extracts a specific expression belonging to each cluster from the initial rule set. Ask. For the automatic selection, a local search method, a genetic algorithm, an annealing method, or other methods (a method capable of obtaining a metaheuristic or an equivalent result) can be used.

生成過程記憶部146は、初期ルールセット生成部140及び自動生成部145の処理過程を監視し、ばらつきの大きい方の各クラスタCL2,CL3に属する固有表現の抽出規則が、ばらつき最小のクラスタCL1に属する固有表現の抽出規則をどのように変更して得られたものであるかを記憶し、ユーザに当該変更の詳細に関する情報を提供する。 The generation process storage unit 146 monitors the processing process of the initial rule set generation unit 140 and the automatic generation unit 145, and the extraction rule for the specific expression belonging to each of the clusters CL2 and CL3 having the larger variation is the cluster CL1 having the smallest variation. It memorizes how it was obtained by changing the extraction rules of the unique expressions to which it belongs, and provides the user with information regarding the details of the change.

以下、最初に生成される一般的規則における生成処理を説明する。 Hereinafter, the generation process in the general rule generated first will be described.

初期ルールセット生成部140はまず、ばらつき最小のクラスタCL1に属する各固有表現N_kより、図２で説明したような当該固有表現N_kが学習用文章中において有する分布すなわち前後の品詞並びパターンP₁, P₂, ..., P_mkを、ばらつき測定部12より取得する。当該各パターンP₁, P₂, ..., P_mkが、固有表現N_kをパターンマッチングで抽出するためのパターンの候補となる。以降これら「パターン」を、パターンマッチングで抽出するための規則として用いる点を明確化するために、「ルール」と呼ぶ。また、各固有表現N_kより当該ルールを取得することで、当該クラスタCL1に属する全ての固有表現より得られたルールの全体をPi (i=1, 2, ..., n)とする。 First, the initial rule set generation unit 140 starts from each unique expression N _k belonging to the cluster CL1 with the smallest variation, the distribution of the specific expression N _k as described in FIG. ₁ , P ₂ ,..., P _mk are acquired from the variation measuring unit 12. Each of the patterns P ₁ , P ₂ ,..., P _mk is a pattern candidate for extracting the specific expression N _k by pattern matching. Hereinafter, these “patterns” will be referred to as “rules” in order to clarify the points to be used as rules for extraction by pattern matching. Further, by acquiring the rule from each unique expression N _k, the whole rule obtained from all the specific expressions belonging to the cluster CL1 is set to Pi (i = 1, 2,..., N).

初期ルールセット生成部140は、当該ばらつき測定部12でクラスタCL1の全ての固有表現N_kより取得した各ルールPi (i=1, 2, ..., n)に対して、所定のパターン変換手法を施すことで固有表現を抽出できる可能性のある新たなルールを求め、当該新たなルールを当初の各ルールPi (i=1, 2, ..., n)に加えたものとして、初期ルールセットを生成する。すなわち、当初の各ルールP_i (i=1, 2, ..., n)からそれぞれ新たなルールP_ij(j=1, 2, ..., F(i))が求まったとすると、初期ルールセットは以下のようになる。
{P₁, P₁₁, P₁₂, ..., P_1F(1), P₂, P₂₁, ..., P_2F(2), P₃, ..., P_n, ..., P_nF(n)} …(式2) The initial rule set generation unit 140 performs a predetermined pattern conversion on each rule Pi (i = 1, 2, ..., n) acquired from all the unique expressions N _k of the cluster CL1 by the variation measurement unit 12. A new rule that can extract a specific expression by applying the method is obtained, and the new rule is added to each of the original rules Pi (i = 1, 2, ..., n). Generate a ruleset. That is, if new rules P _ij (j = 1, 2, ..., F (i)) are obtained from the original rules P _i (i = 1, 2, ..., n), respectively, The ruleset is as follows:
{P ₁ , P ₁₁ , P ₁₂ , ..., P _{1F (1)} , P ₂ , P ₂₁ , ..., P _{2F (2)} , P ₃ , ..., P _n , ..., P _{nF (n)} } (Formula 2)

当該所定のパターン変換手法は以下の通りである。ここで、説明のため、固有表現N_kを抽出する１つの当初のルールP_iを、前述の例と同様に、抽出候補の単語Xの前後2単語の品詞の並びが以下の所定の並びである場合に抽出するものとする。
{C_{1[ i ]}, C_{2[ i ]} , X, C_{3[ i ]}, C_{4[ i ]}}_k The predetermined pattern conversion method is as follows. Here, for the sake of explanation, the initial rule P _i for extracting the proper expression N _k is represented by the following predetermined sequence of parts of speech of two words before and after the extraction candidate word X, as in the above example. It shall be extracted in some cases.
{C _{1 [i]} , C _{2 [i]} , X, C _{3 [i]} , C _{4 [i]} } _k

ここで、上記単語Xの前後の各品詞C_{1[ i ]}, C_{2[ i ]} , C_{3[ i ]}, C_{4[ i ]}に関しては、単語Xが固有表現N_kであるとしてばらつき測定部12にて当該N_kの分布P_iを求めた際に、学習用文章において当該各品詞が具体的に何の単語であるか、が求まっている。すなわち、分布P_iの具体例は、図２で示したように学習用文章から抽出されたf_k(P_i)個が存在するので、各品詞C_{1[ i ]}, C_{2[ i ]} , C_{3[ i ]}, C_{4[ i ]}に関して、その具体的な単語が、重複を含めてf_k(P_i)個求まっている。なおここで、分布P_iに対応する固有表現N_kが複数存在する場合は、対応する全ての固有表現N_kよりこうして学習用文章にて得られている具体的な単語を利用するものとする。 Here, for each part of speech C _{1 [i]} , C _{2 [i]} , C _{3 [i]} , C _{4 [i]} before and after the word X, the variation measuring unit assumes that the word X is the proper expression N _k. When the N _k distribution P _i is obtained in step 12, it is determined what words each part of speech specifically represents in the learning sentence. That is, the specific example of the distribution P _i includes f _k (P _i ) pieces extracted from the learning sentence as shown in FIG. 2, so each part of speech C _{1 [i]} , C _{2 [i]} , Regarding C _{3 [i]} and C _{4 [i]} , f _k (P _i ) specific words including duplicates are obtained. Here, when there are a plurality of specific expressions N _k corresponding to the distribution P _i , the specific words thus obtained in the learning sentence are used from all the corresponding specific expressions N _k. .

そこで、所定のパターン変換手法として、各品詞C_{1[ i ]}, C_{2[ i ]} , C_{3[ i ]}, C_{4[ i ]}をそれぞれ、当該学習用文章にて求まっている具体的な単語で置き換えることで、新たなルールを求めることができる。各品詞C_{1[ i ]}, C_{2[ i ]} , C_{3[ i ]}, C_{4[ i ]}に対応する具体的な単語が、それぞれN_{1[ i ]}, N_{2[ i ]} , N_{3[ i ]}, N_{4[ i ]}種類あったとすると、単語に置き換えない場合を含めることで、当該パターンP_iから以下の数M個分の新たなルールを求めることができる。なお、全体から1を減じているのは、4つ全ての品詞を単語に置き換えない場合を除外するためである。
M＝( N_{1[ i ]} +1 )( N_{2[ i ]} + 1 )( N_{3[ i ]} + 1 ) ( N_{4[ i ]} + 1) − 1 Therefore, as specific pattern conversion methods, each part of speech C _{1 [i]} , C _{2 [i]} , C _{3 [i]} , C _{4 [i]} is a specific word found in the learning sentence. By replacing with, a new rule can be obtained. Specific words corresponding to each part of speech C _{1 [i]} , C _{2 [i]} , C _{3 [i]} , C _{4 [i]} are N _{1 [i]} , N _{2 [i]} , N _{3 [} Assuming that there are _i] and N _{4 [i]} types, the following number M of new rules can be obtained from the pattern P _i by including the case of not replacing it with a word. The reason why 1 is subtracted from the whole is to exclude the case where all four parts of speech are not replaced with words.
_{M = (N 1 [i]} +1) (N 2 [i] + 1) (N 3 [i] + 1) (N 4 [i] + 1) - 1

ただし当該置き換えの際、品詞が「句読点」である場合は、置き換えるべき単語が存在しないので、置き換えは行わないものとする。例えば、上記品詞 N_{4[ i ]}が「句読点」であり、その他の品詞は「句読点」ではない場合、置き換え可能な品詞はC_{1[ i ]}, C_{2[ i ]} , C_{3[ i ]}の3つであるので、当該パターンP_iから求められる新たなルールの個数Mは以下のようになる。
M＝( N_{1[ i ]} +1 )( N_{2[ i ]} + 1 )( N_{3[ i ]} + 1 ) − 1 However, when the part of speech is “punctuation” at the time of the replacement, there is no word to be replaced, so the replacement is not performed. For example, if the above part of speech N _{4 [i]} is “punctuation” and the other parts of speech are not “punctuation”, the replaceable part of speech is C _{1 [i]} , C _{2 [i]} , C _{3 [i]} Since there are three, the number M of new rules obtained from the pattern P _i is as follows.
_{M = (N 1 [i]} +1) (N 2 [i] + 1) (N 3 [i] + 1) - 1

あるいは、品詞が「句読点」である場合も、その具体的な「単語」が「。」（句点）又は「、」（読点）であるものとして、単語の場合と同様に置き換えを実施してもよい。 Alternatively, even when the part of speech is “punctuation”, the specific “word” may be replaced with “.” (Punctuation) or “,” (reading) as in the case of the word. Good.

なおまた、当該所定のパターン変換手法による置き換えの際、次の追加的処理を施してもよい。すなわち、単語が当該ルールの所定箇所における品詞の具体例として現れた回数が、所定回数より少ない、あるいは図２で示したように学習用文章から抽出されたf_k(P_i)個のパターン全体数に対する所定割合より少ない場合は、当該単語を当該ルールにおける当該位置の品詞の置き換えには利用しないようにしてもよい。 In addition, the following additional processing may be performed at the time of replacement by the predetermined pattern conversion method. That is, the number of times that a word appears as a specific example of a part of speech at a predetermined portion of the rule is less than the predetermined number, or the entire f _k (P _i ) patterns extracted from the learning sentence as shown in FIG. If it is less than a predetermined ratio to the number, the word may not be used for replacement of the part of speech at the position in the rule.

なおまた、当該置き換えを全ての場合につき網羅的に実施して新たに得られるルールの数が膨大となる場合は、置き換える個数に所定の上限を設けてもよい。 In addition, when the number of rules newly obtained by exhaustively implementing the replacement in all cases becomes enormous, a predetermined upper limit may be set for the number of replacements.

図５は、上記所定のパターン変換手法により新たに得られるルールの具体例を説明する図である。ここでは、学習用文章として例EX1を想定し、固有表現「阿比留」から求まった分布P1の、１つ（のみ）の具体例Ｆ１「私は阿比留です。」によって、前述した固有表現Xを抽出する以下の１つのルールP1が求まっているとする。
P1＝{代名詞、助詞、X、助動詞、句読点}
図５に示すように、当該当初のルールP1からは、「句読点」の置き換えは行わないものとして、所定のパターン変換手法によって「品詞」を具体例Ｆ１の単語で置き換えることにより、7(＝2*2*2−1)個の新たなルールP1-1〜P1-7が得られることとなる。 FIG. 5 is a diagram for explaining a specific example of a rule newly obtained by the predetermined pattern conversion method. Here, the example EX1 is assumed as a learning sentence, and the above-mentioned specific expression X is extracted by one (only) specific example F1 “I am Aridome” of the distribution P1 obtained from the specific expression “Ahidome”. Assume that the following one rule P1 is obtained.
P1 = {pronoun, particle, X, auxiliary verb, punctuation}
As shown in FIG. 5, from the original rule P1, it is assumed that “punctuation” is not replaced, and “part of speech” is replaced with the word of the specific example F1 by a predetermined pattern conversion method, thereby 7 (= 2 * 2 * 2-1) New rules P1-1 to P1-7 are obtained.

自動生成部145は、ばらつき最小のクラスタCL1に属する全ての固有表現N_kより初期ルールセット生成部140が上記生成した初期ルールセットを用いて、当該初期ルールセット内において、学習用文章からのばらつき最小のクラスタCL1に属する各固有表現N_kの抽出成績のよいものを選ぶ処理を行うことで、当該クラスタCL1に対するルールセットを自動で生成する。当該ルールセットが、図３の一般的規則R100である。当該自動生成の処理は以下の通りである。 The automatic generation unit 145 uses the initial rule set generated by the initial rule set generation unit 140 from all the unique expressions N _k belonging to the cluster with the least variation CL1, and uses the initial rule set to generate variations from the learning text in the initial rule set. A rule set for the cluster CL1 is automatically generated by performing a process of selecting a good extraction result for each unique expression _Nk belonging to the smallest cluster CL1. The rule set is the general rule R100 in FIG. The automatic generation process is as follows.

まず、初期ルールセットはその生成過程によって(式2)のような形となるが、ここでは説明の明確化のために符号を改めて各ルールをR_i(i=1, 2, ..., n)として、初期ルールセットを当該n個からなるものとして、以下のように表すこととする。
(R₁, R₂, ..., R_n) First, the initial rule set takes the form shown in (Equation 2) depending on the generation process.Here, for clarity of explanation, each rule is changed to R _i (i = 1, 2, ..., As n), the initial rule set is expressed as follows assuming that the initial rule set is composed of n pieces.
(R ₁ , R ₂ , ..., R _n )

上記表記により、自動生成部145では、初期ルールセット(R₁, R₂, ..., R_n)の中から実際の抽出に使うルールのみを選別して残すという形で、最終的な出力としての当該クラスタCL1のルールセットを生成する。このため、候補となるルールセットを所定数用意して、焼き鈍し法や遺伝的アルゴリズムを応用した手法によって、当該用意された所定数を固有表現を抽出する成績のよいものへと変化させた後、その中から最終的なルールセットを選ぶ。 With the above notation, the automatic generation unit 145 selects only the rules used for actual extraction from the initial rule set (R ₁ , R ₂ , ..., R _n ) and leaves them as final output. A rule set for the cluster CL1 is generated. For this reason, after preparing a predetermined number of candidate rule sets and applying the annealing method or genetic algorithm to change the prepared predetermined number to a good result of extracting a specific expression, Choose the final rule set.

当該処理を説明するために、当該候補として所定数用意され、順次変化してゆくルールセットの各々を「個体」と呼ぶこととする。また、個体において使われるルールを「1」で、使われないルールを「0」で表す、という記法を用いる。例えば、初期ルールセットが3個のルールからなる(R₁, R₂, R₃)であれば、(1, 0, 1)によって、ルールR1及びR3を使用し、ルールR2を使用しない個体を表すものとする。 In order to explain the processing, each rule set that is prepared in a predetermined number and changes sequentially is called an “individual”. Also, a notation is used in which a rule used in an individual is represented by “1” and a rule not used is represented by “0”. For example, if the initial rule set consists of three rules (R ₁ , R ₂ , R ₃ ), the individual using rule R1 and R3 and not using rule R2 according to (1, 0, 1) It shall represent.

図６は自動生成部145による処理のフローチャートである。ステップS1では、次のような手法によって所定数の初期個体を用意する。すなわち、初期ルールセット(R₁, R₂, ..., R_n)から用意する初期個体を、個別のルールR_iを学習用文章に適用した際の成績に基づいた確率によって当該ルールR_iの箇所における「1」を発生させることによって、所定数(m個とする)用意する。なお、確率に基づいて用意するため、当該m個の中には、同種の個体が複数含まれている場合もあり、また同様に、後述のステップS2の摂動によって同種の個体が生ずる場合もある。 FIG. 6 is a flowchart of processing by the automatic generation unit 145. In step S1, a predetermined number of initial individuals are prepared by the following method. That is, the initial rule set _{_{(R 1, R 2, ...}} , R n) an initial individual to prepare from the rule by a probability based on results when applied to the training text individual rules R _i R _i A predetermined number (m) is prepared by generating “1” at the location. In addition, in order to prepare based on the probability, the m pieces may include a plurality of individuals of the same kind, and similarly, individuals of the same kind may be generated by perturbation in step S2 described later. .

この際、所定の変換式に従って、成績が良いほど高い確率とすればよい。また、成績にはF値を採用することができる。F値は、以下の一連の[式1A]のようにして適合率(Precision)及び再現率(Recall)を求めたうえで計算することができる。 At this time, according to a predetermined conversion formula, the higher the score, the higher the probability. In addition, F value can be adopted for the results. The F value can be calculated after obtaining the precision (Precision) and recall (Recall) in the following series of [Formula 1A].

[式1A]
適合率(Precision)＝(ルールR_iによって抽出されたクラスタCL1に属する固有表現の数)／(ルールR_iによって抽出された単語の数)
再現率(Recall)＝(ルールR_iによって抽出されたクラスタCL1に属する固有表現の数)／(クラスタCL1に属する固有表現の全ての数)
F値＝2／(1／Recall ＋ 1／Precision) [Formula 1A]
Matching rate (Precision) = (number of named entities belonging to the cluster CL1 extracted by the rule R _i) / (number of words extracted by the rule R _i)
Recall (Recall) = (number of named entities belonging to the rule R _i cluster CL1 extracted by) / (all numbers named entity belonging to the cluster CL1)
F value = 2 / (1 / Recall + 1 / Precision)

例えば、初期ルールセットが3個のルールからなる(R₁, R₂, R₃)であって、各ルールR₁, R₂, R₃のF値から定まる確率が90%, 70%, 30%であれば、個体(x, y, z)を、xが90%の確率で、yが70%の確率で、zが30%の確率で、それぞれ「1」となるようにして所定数m個だけ用意する。 For example, the initial rule set consists of three rules (R ₁ , R ₂ , R ₃ ), and the probability determined from the F value of each rule R ₁ , R ₂ , R ₃ is 90%, 70%, 30 If it is%, the number of individuals (x, y, z) is a predetermined number so that x is 90%, y is 70%, and z is 30%. Prepare only m pieces.

以上、ステップS1で所定数m個の初期固体が生成されると、ステップS2へ進む。ステップS2では、当該m個の個体群の各々に摂動を施す。摂動とは、所定確率に従って個体の各箇所の「1」又は「0」を反転させることによって個体を変化させることであり、当該反転をビット反転と呼ぶ。ビット反転の確率は、以下のように定めればよい。 As described above, when a predetermined number m of initial solids are generated in step S1, the process proceeds to step S2. In step S2, perturbation is applied to each of the m populations. Perturbation is to change an individual by inverting “1” or “0” of each part of the individual according to a predetermined probability, and this inversion is called bit inversion. The probability of bit inversion may be determined as follows.

まず、各個体の全体としての成績xを、上記[式1A]と同様にして、F値によって計算する。だだしこの際、[式1A]における『ルールR_iによって抽出された』の部分を『当該個体にて「1」となっている全てのルールR_iを適用することによって抽出された』と置き換える。すなわち、各個体に属するルール全体の適用結果としての成績xを、当該個体の定義するルールセットの成績として求める。 First, the overall score x of each individual is calculated from the F value in the same manner as in [Formula 1A]. However, in this case, the part “extracted by rule R _i ” in [Formula 1A] is replaced with “extracted by applying all rules R _i that are“ 1 ”in the individual”. . That is, the score x as the application result of the entire rule belonging to each individual is obtained as the score of the rule set defined by the individual.

また、初期ルールセット内の各ルールR_iの成績y_iは、既にステップS1にて求まっている。当該個体全体の成績xと、各ルールR_iの成績y_iとを用いて、以下のようにビット反転の確率を定める。 In addition, the grade y _i of each rule R _i in the initial rule set has already been obtained in step S1. Using the score x of the whole individual and the score y _i of each rule R _i , the probability of bit inversion is determined as follows.

(1)ルールR_iが当該個体内で使われている場合に、当該R_iの「1」を「0」へと反転させ、使われないようにする確率P_１→０
P_１→０＝ {1 − f(x)}*{1 − g(y_i)} …(式11)
(2)ルールR_iが当該個体内で使われていない場合に、当該R_iの「0」を「1」へと反転させ、使われるようにする確率P_０→１
P_０→１＝ {1 − F(x)}*G(y_i) …(式12)
ここで、上記(式11)及び(式12)において、各関数f(x), g(y_i), F(x), G(y_i)は0以上1以下の値域を取り、入力される成績の値に対して単調増加又は非減少となる所定の関数を用いることができる。当該各関数はルールR_i毎に定めてもよい。 (1) Probability P _{1 → 0} that reverses “1” of R _i to “0” and prevents it from being used when rule R _i is used in the individual
P _{1 → 0} = {1 − f (x)} * {1 − g (y _i )} (Equation 11)
(2) Probability P _{0 → 1} for inverting “0” of R _i to “1” when rule R _i is not used in the individual
P _{0 → 1} = {1 − F (x)} * G (y _i ) (Equation 12)
Here, in the above (Expression 11) and (Expression 12), each function f (x), g (y _i ), F (x), G (y _i ) takes a range of 0 or more and 1 or less and is input. A predetermined function that monotonously increases or does not decrease with respect to the value of the performance to be obtained can be used. Each such function may be determined for each rule R _i.

以上、ステップS2にて摂動を施すと、ステップS3へ進み、収束条件が満たされたかを判断する。収束条件としては、[1]ステップS2の摂動が所定回数施されたか、あるいは[2]当該個体の成績(F値)が所定値以上となったか、を課すことができる。収束条件が満たされていれば、ステップS4へと進み、満たされていなければ、ステップS2へ戻り摂動を継続する。 As described above, when the perturbation is applied in step S2, the process proceeds to step S3, and it is determined whether the convergence condition is satisfied. As a convergence condition, [1] whether the perturbation in step S2 has been performed a predetermined number of times, or [2] whether the performance (F value) of the individual has exceeded a predetermined value can be imposed. If the convergence condition is satisfied, the process proceeds to step S4. If not satisfied, the process returns to step S2 to continue the perturbation.

なお、上記[2]の収束条件を用いる場合は、図６のフローはステップS1で所定数の初期個体が用意された後、各個体毎にステップS2及びS3のループが実行されることとなる。この場合、個体毎に摂動の回数は変動しうる。例えば、ある個体は3回摂動を受けた後に収束条件に達するが、ある個体は5回摂動を受けた後に収束条件に達する、といったことがありうる。また、当該[2]の条件と[1]の条件の両方を課してもよい。 When the convergence condition [2] is used, the flow of FIG. 6 is executed in steps S2 and S3 for each individual after a predetermined number of initial individuals are prepared in step S1. . In this case, the number of perturbations can vary from individual to individual. For example, an individual may reach a convergence condition after being perturbed 3 times, but an individual may reach a convergence condition after being perturbed 5 times. Further, both the condition [2] and the condition [1] may be imposed.

ステップS4では、当該摂動を施し終えた各個体が表すルールセット内において、包含関係にあるルールを削除してから、ステップS5へ進む。すなわち、ステップS4では、あるルールセットにおいてルールR_iを適用した抽出結果が、ルールR_jを適用した抽出結果を包含する場合、当該包含されるルールR_jは包含するR_iに対して冗長であるので、当該包含されるルールR_jのビットを「0」として、当該包含されるルールR_jを削除する。例えば、ルール間の包含関係の例として、図５のルールP1は、当該ルールから新たに得られるルールP1-1〜P1-7の全てを包含している。 In step S4, the rule having the inclusion relationship is deleted from the rule set represented by each individual that has been subjected to the perturbation, and the process proceeds to step S5. That is, in step S4, when an extraction result applying rule R _i in a rule set includes an extraction result applying rule R _j , the included rule R _j is redundant to the included R _i . Therefore, the bit of the included rule R _j is set to “0”, and the included rule R _j is deleted. For example, as an example of the inclusion relationship between rules, the rule P1 in FIG. 5 includes all of the rules P1-1 to P1-7 newly obtained from the rule.

以上、ステップS4を完了すると、自動生成部145ではステップS1で用意した所定数のルールセットの、摂動等処理された後のものが得られている。ステップS5では、自動生成部145は当該処理された後の複数のルールセットの中から、当該ばらつき最小のクラスタCL1に対するルールセットとして、成績(F値)が最高のルールセットを選び、規則生成部14の出力(図3の一般的規則R100)となし、図６のフローは終了する。 As described above, when step S4 is completed, the automatic generation unit 145 has obtained a predetermined number of rule sets prepared in step S1 after being subjected to perturbation and the like processing. In step S5, the automatic generation unit 145 selects the rule set with the highest grade (F value) as the rule set for the cluster CL1 with the smallest variation from the plurality of rule sets after the processing, and the rule generation unit 14 is output (general rule R100 in FIG. 3), and the flow in FIG. 6 ends.

あるいは、ステップS5では、成績(F値)が所定値以上のルールセットの中から、属するルールの個数(「1」のビットが立っている個数)が最小となるものを選んで出力としてもよい。いずれの選び方を採用した場合も、最終出力としての一般的規則R100は、ステップS2でルールセットの成績を定義したのと同様に、当該選択されたルールセットに属する個別のルールを全て適用するものとなる。 Alternatively, in step S5, a rule set having the smallest number of rules (the number of bits with “1” standing) may be selected and output from among rule sets having a grade (F value) of a predetermined value or more. . Regardless of which method is selected, the general rule R100 as the final output applies all the individual rules belonging to the selected rule set in the same way as the rule set grade was defined in step S2. It becomes.

以上、規則生成部14によるばらつき最小のクラスタCL1に対するルールセットの生成を説明した。次に、当該生成結果を利用することによる、ばらつきの大きいクラスタCL2及びCL3に対するルールセットの生成を説明する。生成手法はクラスタCL2及びCL3で同一であるので、これらの任意のものを表すクラスタCLi(＝CL2又はCL3)に対するルールセットの生成として説明する。 The generation of the rule set for the cluster CL1 with the smallest variation by the rule generation unit 14 has been described above. Next, generation of a rule set for clusters CL2 and CL3 having a large variation by using the generation result will be described. Since the generation method is the same for the clusters CL2 and CL3, it will be described as generation of a rule set for the cluster CLi (= CL2 or CL3) representing any of these.

規則生成部14では、ばらつきの大きなクラスタCLiのルールセット生成に際しても、処理の枠組み自体はばらつき最小クラスタCL1の生成と共通の枠組みとなる。すなわち、初期ルールセット生成部140が初期ルールセットを用意した後、自動生成部145が図６のフローに従って、初期ルールセットを元に複数の初期個体を生成して処理を行うこととなる。ただし、以下説明するように、個別の処理自体が異なる。 In the rule generation unit 14, even when generating a rule set for a cluster CLi having a large variation, the processing framework itself is a common framework with the generation of the minimum variation cluster CL1. That is, after the initial rule set generation unit 140 prepares an initial rule set, the automatic generation unit 145 generates and processes a plurality of initial individuals based on the initial rule set according to the flow of FIG. However, as described below, the individual processing itself is different.

初期ルールセット生成部140は、クラスタCLiに対する初期ルールセットを生成する。当該生成する各実施形態を、例外定義部141及び特殊化部142が担う。ここで、ばらつき最小クラスタCL1に対して生成されたルールセットを、一般ルールセットと呼び、以下のように各ルールRG_i(i=1, 2, ..., n)から構成されているものとする。
(RG₁, RG₂, ..., RG_n) The initial rule set generation unit 140 generates an initial rule set for the cluster CLi. Each of the embodiments to be generated is handled by the exception definition unit 141 and the specialization unit 142. Here, the rule set generated for the minimum variation cluster CL1 is called a general rule set, and is composed of the rules RG _i (i = 1, 2, ..., n) as follows: And
(RG ₁ , RG ₂ , ..., RG _n )

例外定義部141は、一般ルールセット内の各ルールRG_iに対する例外ルールRX_j (j=1, 2. ..., m)を求めることで、初期ルールセットを以下のように生成する。
(RG₁, RG₂, ..., RG_n, RX₁, RX₂, ..., RX_m) …(式20) The exception definition unit 141 generates an initial rule set as follows by obtaining an exception rule RX _j (j = 1, 2...., M) for each rule RG _i in the general rule set.
(RG ₁ , RG ₂ , ..., RG _n , RX ₁ , RX ₂ , ..., RX _m )… (Formula 20)

例外ルールとは、一般ルールセット内の各ルールRG_iにおいて「ある品詞a」として指定されている箇所を「ある品詞a、ただし当該品詞aが具体的な単語bである場合を除く」とするものである。ここで、「具体的な単語b」に関しては、図５で説明したように、ばらつき最小のクラスタCL1に属する各固有表現N_kの分布を求めた際の具体例を参照して定めることができる。従って、一般ルールセット内の各ルールRG_iにおいて特定の品詞として指定されている箇所を、当該特定の品詞が特定の単語である場合を除く全ての場合を列挙することで、複数の例外ルールRX_j (j=1, 2. ..., m)を自動で定めることができる。 An exception rule is a part specified as “a part of speech a” in each rule RG _i in the general rule set as “a part of speech a, except when the part of speech a is a specific word b”. Is. Here, the “specific word b” can be determined with reference to a specific example when the distribution of each unique expression N _k belonging to the cluster CL1 with the smallest variation is obtained as described with reference to FIG. . Therefore, by enumerating all the cases except for the case where the specific part of speech is a specific word, the exception rules RX are specified as the part of the general rule set designated as the specific part of speech in each rule RG _i . _j (j = 1, 2. ..., m) can be determined automatically.

図７は、図５の例に対応させた、例外ルールを求める例を示す図である。ここでは、一般ルールセット内の一つのルールとして図5と同様のルールP1があり、当該ルールが一つの具体例F1から求まっている場合を示している。当該ルールP1内における「品詞」が、具体例F1の各「単語」である場合を除外するようにすることで、例外ルールP2-1〜P2-7が定まる。 FIG. 7 is a diagram showing an example of obtaining an exception rule corresponding to the example of FIG. Here, there is a rule P1 similar to that in FIG. 5 as one rule in the general rule set, and the rule is obtained from one specific example F1. The exception rules P2-1 to P2-7 are determined by excluding the case where the “part of speech” in the rule P1 is each “word” in the specific example F1.

なお、あらゆる場合を列挙して例外ルールの数が膨大になる場合は、除外する個数に所定の上限を設けるようにしてもよい。 When all cases are listed and the number of exception rules becomes enormous, a predetermined upper limit may be set for the number to be excluded.

特殊化部142は、一般ルールセット内の各ルールRG_i(i=1, 2, ..., n)を対象として、ばらつき最小クラスタCL1に対して図５で説明したのと同様の手法(特殊化)を適用して、「品詞」として指定されている箇所を「特定の単語」（クラスタCL1の各固有表現の分布を求めた際に学習用文章にて与えられている）に置き換えることによって、複数の特殊化ルールRS_i(i=1, 2, ..., k)を求め、以下の(式21)のように初期ルールセットを生成する。なお、当該特殊化で生成される初期ルールセットは、(式2)で与えられるばらつき最小クラスタCL1における初期ルールセットの部分集合となる。なおまた、ルール数が膨大となる場合は、「特定の単語」へと置き換える数に所定の上限を設けてもよい。
(RG₁, RG₂, ..., RG_n, RS₁, RS₂, ..., RS_k) …(式21) The specialization unit 142 targets each rule RG _i (i = 1, 2,..., N) in the general rule set as a target in the same manner as described in FIG. Apply specialization and replace the part designated as “part of speech” with “specific word” (given in the learning sentence when the distribution of each specific expression of cluster CL1 is obtained) Thus, a plurality of specialization rules RS _i (i = 1, 2,..., K) are obtained, and an initial rule set is generated as in (Equation 21) below. Note that the initial rule set generated by the specialization is a subset of the initial rule set in the minimum variation cluster CL1 given by (Equation 2). In addition, when the number of rules becomes enormous, a predetermined upper limit may be set for the number replaced with “specific words”.
(RG ₁ , RG ₂ , ..., RG _n , RS ₁ , RS ₂ , ..., RS _k )… (Formula 21)

特殊化部142は上記特殊化の代わりに、あるいは上記特殊化に加えて、一般ルールセット内の各ルールRG_i(i=1, 2, ..., n)において指定されている固有表現の前後第一の所定数の品詞(又は単語)を、より多くの前後第二の所定数の品詞まで拡張するようにしてもよい。例えば、各ルールRG_iが(式1)のように固有表現の候補単語Nの前後それぞれ2単語のパターンマッチングとして与えられていれば、前後それぞれ1単語追加して、固有表現の候補単語Nの前後それぞれ3単語のパターンマッチングへと拡張してもよい。 The specialization unit 142 replaces the specialization or in addition to the specialization with a specific expression specified in each rule RG _i (i = 1, 2, ..., n) in the general rule set. The first predetermined number of parts of speech (or words) before and after may be expanded to a second predetermined number of parts of speech before and after. For example, if each rule RG _i is given as a pattern matching of two words before and after the candidate word N of the specific expression as in (Equation 1), add one word before and after the candidate word N of the specific expression, It may be extended to pattern matching of 3 words before and after.

当該拡張の際に、拡張される箇所における具体的な品詞を定める必要があるが、各ルールRG_iにて指定されている品詞パターンの前後に生じうる品詞パターンを、学習用文章を対象としてばらつき最小のクラスタCL1内の固有表現によって各ルールRG_iにつき対応する固有表現をキーとして品詞パターンを再検索して、各ルールRG_iにつき存在する拡張パターンを定めてもよい。あるいは、各ルールRG_iにおける品詞パターンに対して、所定の対応表を設けておくことで1つ以上の拡張パターンを予め定めておいてもよい。 It is necessary to define specific parts of speech at the part to be expanded, but the part of speech patterns that can occur before and after the part of speech pattern specified by each rule RG _i vary for the learning text. The part-of-speech pattern may be re-searched using the unique expression corresponding to each rule RG _i as a key by using the unique expression in the smallest cluster CL1, and an extended pattern existing for each rule RG _i may be determined. Alternatively, for part-of-speech patterns in each rule RG _i, may be determined in advance one or more extended pattern in keeping with a predetermined correspondence table.

以上、例外定義部141と特殊化部142との個別によるばらつきの大きな側のクラスタCLiに対する初期ルールセットの生成を説明したが、両者を併用して、すなわち(式20)及び(式21)を併用して以下の(式22)のように初期ルールセットを定めてもよい。
(RG₁, RG₂, ..., RG_n, RX₁, RX₂, ..., RX_m, RS₁, RS₂, ..., RS_k) …(式22) As described above, the generation of the initial rule set for the cluster CLi on the side with the large variation due to the exception defining unit 141 and the specializing unit 142 has been described, but using both together, that is, (Equation 20) and (Equation 21) In combination, an initial rule set may be defined as shown in (Equation 22) below.
(RG ₁ , RG ₂ , ..., RG _n , RX ₁ , RX ₂ , ..., RX _m , RS ₁ , RS ₂ , ..., RS _k )… (Formula 22)

自動生成部145は、初期ルールセット生成部140がクラスタCLiに対して生成した初期ルールセットを用いて、当該クラスタに対するルールセットを生成する。この際も、ばらつき最小クラスタCL1におけるのと同様の枠組みの図６のフローに従い、各ステップの処理は以下の通りである。 The automatic generation unit 145 generates a rule set for the cluster using the initial rule set generated by the initial rule set generation unit 140 for the cluster CLi. Also in this case, the processing of each step is as follows according to the flow of FIG. 6 of the same framework as that in the minimum variation cluster CL1.

ステップS1では、クラスタCLiの初期ルールセットを対象として、ばらつき最小クラスタCL1の場合と同様の処理によって所定数の初期個体を用意して、ステップS2へと進む。 In step S1, a predetermined number of initial individuals are prepared by the same processing as in the case of the minimum variation cluster CL1 for the initial rule set of the cluster CLi, and the process proceeds to step S2.

ステップS2では、各個体に対して摂動を施し、ステップS3へ進む。この際、成績を定めるためのF値を定める前記[式1A]において、「クラスタCL1に属する固有表現」の代わりに、「クラスタCLiに属する固有表現」とする。すなわち、成績は当該対象としているクラスタCLiに属する固有表現の抽出成績によって定める。 In step S2, perturbation is applied to each individual, and the process proceeds to step S3. At this time, in [Formula 1A] for determining the F value for determining the grade, “specific expression belonging to cluster CLi” is used instead of “specific expression belonging to cluster CL1”. That is, the grade is determined by the extraction grade of the unique expression belonging to the target cluster CLi.

ステップS3では、ばらつき最小クラスタCL1の場合と同様の収束判定がなされ、収束していればステップS4へ進み、していなければ再度ステップS2へ戻る。ステップS4も、ばらつき最小クラスタCL1の場合と同様に、包含関係にあるルールの削除を実施する。最終的にステップS5にて、当該自動生成部145が当該クラスタCLiの固有表現を抽出するルールセットとして出力するルールセットも、ばらつき最小クラスタCL1の場合と同様に、摂動を受けた複数のルールセットにおいてF値が最大のもの、又は所定以上であって且つ含まれるルール数が最小のもの、となる。 In step S3, the same convergence determination as in the case of the minimum variation cluster CL1 is made. If converged, the process proceeds to step S4, and if not, the process returns to step S2. In step S4 as well, in the same way as in the case of the minimum variation cluster CL1, deletion of rules that are in an inclusive relationship is performed. Finally, in step S5, the rule set that the automatic generation unit 145 outputs as a rule set for extracting the specific expression of the cluster CLi is also a plurality of rule sets that are perturbed, as in the case of the minimum variation cluster CL1. The F value is the maximum, or the number of rules that are greater than or equal to the predetermined value and the minimum is included.

生成過程記憶部146は、ばらつきが大きい側のクラスタに対して以上のようにして求まったルールセットが、ばらつき最小クラスタCL1に対して求まったルールセットをもとにして初期ルールセット生成部140及び自動生成部145にてどのような修正を施されて得られたものであるか、を記録し、ユーザが各ルールセットに対してマニュアル修正等を施す際の便宜とすべく、当該修正の情報を提供する。例えば、固有表現を含む原文、すなわち当該固有表現が学習用文章において現れた箇所と、当該箇所の当該固有表現の抽出に用いたルールとを提示することで、修正情報を提供する。 The generation process storage unit 146 is configured such that the rule set obtained as described above for the cluster with larger variation is based on the rule set obtained for the minimum variation cluster CL1 and the initial rule set generation unit 140 and It records what corrections were made by the automatic generation unit 145, and information on the corrections for the convenience of the user when making manual corrections to each rule set. I will provide a. For example, the correction information is provided by presenting the original sentence including the specific expression, that is, the part where the specific expression appears in the learning text and the rule used for extracting the specific expression of the part.

以上、規則生成装置10による、各固有表現のばらつきを考慮して定まるクラスタ毎の、固有表現の抽出ルールセットの生成について説明した。図8は、当該規則生成装置10の生成したクラスタ毎の固有表現抽出ルールセットを用いて、未知の文章から固有表現を抽出する抽出装置の機能ブロック図である。抽出装置20は、(抽出側)形態素解析部21、(抽出側)ばらつき測定部22、(抽出側)分類部23及び抽出部24を備える。以下、当該抽出装置20の説明を行う。 The generation of the extraction rule set for the specific expression for each cluster determined by the rule generation apparatus 10 in consideration of the variation of each specific expression has been described above. FIG. 8 is a functional block diagram of an extraction apparatus that extracts a specific expression from an unknown sentence using the specific expression extraction rule set for each cluster generated by the rule generation apparatus 10. The extraction apparatus 20 includes an (extraction side) morphological analysis unit 21, an (extraction side) variation measurement unit 22, an (extraction side) classification unit 23, and an extraction unit 24. Hereinafter, the extraction device 20 will be described.

形態素解析部21は、入力としての未知の文章を受け取り、規則生成装置10の形態素解析部11と同様に、当該文章を形態素解析して各単語及びその品詞を特定する。なお、当該入力としての未知の文章は、固有表現を指示するタグが付与されていない点において「未知」であり、当該抽出装置20により、未知の文章内より固有表現が自動で抽出される。 The morpheme analysis unit 21 receives an unknown sentence as an input, and, like the morpheme analysis unit 11 of the rule generation device 10, performs morpheme analysis on the sentence and identifies each word and its part of speech. Note that the unknown sentence as the input is “unknown” in that no tag indicating the specific expression is attached, and the specific expression is automatically extracted from the unknown sentence by the extraction device 20.

ばらつき測定部22は、未知の文章に対する形態素解析部21による形態素解析の結果を用いて、含まれる各単語を特定すると共に、当該各単語の分布を求め、当該分布のばらつきを測定する。すなわち、当該処理は規則生成装置10のばらつき測定部12と同様の分布を求める処理を、未知の文章から抽出された各単語を対象として、未知の文章内で実施することに相当する。従って、ばらつき測定部22にて分布を求める際には、規則生成装置10のばらつき測定部12と同一の所定数を用いて、各単語の前後の所定数の単語を調べればよい。 The variation measuring unit 22 uses the result of the morphological analysis by the morpheme analyzing unit 21 for the unknown sentence to identify each word included, obtain the distribution of the word, and measure the variation of the distribution. That is, this processing corresponds to performing processing for obtaining a distribution similar to that of the variation measuring unit 12 of the rule generation device 10 in an unknown sentence for each word extracted from the unknown sentence. Therefore, when the distribution measurement unit 22 obtains the distribution, a predetermined number of words before and after each word may be examined using the same predetermined number as that of the variation measurement unit 12 of the rule generation device 10.

分類部23は、ばらつき測定部22によって各単語につき調べられたばらつきを順位づけして、各単語を規則生成装置10の分類部13と同様の基準でクラスタへと分類する。例えば(規則生成装置10側の)分類部13が各固有表現のばらつきの大きさを順位付けして、当該順位による相対頻度のグラフによって所定区間分けることによってクラスタ分類を施していれば、当該分類部13でも同様に相対頻度グラフ内の同一の所定区間によってクラスタ分類を行えばよい。 The classification unit 23 ranks the variations examined for each word by the variation measurement unit 22, and classifies the words into clusters according to the same criteria as the classification unit 13 of the rule generation device 10. For example, if the classification unit 13 (on the rule generation device 10 side) ranks the size of variation of each unique expression and performs cluster classification by dividing it into predetermined intervals by a graph of relative frequency according to the rank, the classification Similarly, the unit 13 may perform cluster classification based on the same predetermined section in the relative frequency graph.

ここでは、説明のために両方の分類部13,23は、図3で説明したようなばらつき「小」、「中」及び「大」の3段階の分類によるクラスタ分けを行ったものとし、抽出装置20側においても当該クラスタをそれぞれCL1, CL2及びCL3と呼ぶこととする。 Here, for the sake of explanation, it is assumed that both the classification units 13 and 23 have performed clustering by the three-stage classification of variation “small”, “medium”, and “large” as described in FIG. The cluster is also referred to as CL1, CL2, and CL3 on the device 20 side.

抽出部24は、規則生成装置10の生成したクラスタ毎の固有表現抽出のルールセットを当該未知の文章に適用して、固有表現を抽出する。すなわち、各クラスタに対応するルールセットを適用して単語を抽出した上でさらに、当該抽出された単語に対して分類部23が分類を実施したクラスタが、当該適用したルールセットに対応するクラスタと一致する場合に、当該抽出された単語は固有表現であると判断する。 The extraction unit 24 applies a rule set for extracting a unique expression for each cluster generated by the rule generation device 10 to the unknown sentence, and extracts a specific expression. That is, after extracting a word by applying a rule set corresponding to each cluster, a cluster in which the classification unit 23 performs classification on the extracted word is a cluster corresponding to the applied rule set. If they match, it is determined that the extracted word is a unique expression.

例えば、図3の表記を用いてクラスタCL1, CL2及びCL3の固有表現の抽出ルールセットをそれぞれR100,R200及びR300とする。未知の文章にルールセットR100,R200及びR300をそれぞれ適用して抽出された単語W1,W2及びW3に対して、さらに当該単語W1,W2及びW3が分類部23にてそれぞれクラスタCL1,CL2及びCL3に属するものとして分類されているのであれば、当該単語は固有表現であるものとして判断され、抽出装置20による最終的な抽出結果とされる。 For example, using the notation shown in FIG. 3, the extraction rule sets for the specific expressions of the clusters CL1, CL2, and CL3 are R100, R200, and R300, respectively. For the words W1, W2, and W3 extracted by applying the rule sets R100, R200, and R300 to unknown sentences, the words W1, W2, and W3 are further converted into clusters CL1, CL2, and CL3 by the classification unit 23, respectively. If the word is classified as belonging to, the word is determined to be a unique expression, and the final extraction result by the extraction device 20 is used.

逆に例えば、未知の文章にルールセットR100を適用して単語W1が抽出されたが、当該単語W1は分類部23にてクラスタCL1ではなくCL2又はCL3に属するものとして分類されていたのであれば、当該単語W1は固有表現ではないと判断する。 Conversely, for example, if the word W1 is extracted by applying the rule set R100 to an unknown sentence, but the word W1 is classified as belonging to CL2 or CL3 instead of the cluster CL1 in the classification unit 23 The word W1 is determined not to be a specific expression.

なお、抽出装置20にて利用するクラスタ毎のルールセットは、規則生成装置10が生成したそのままのものではなく、当該生成されたものに対してユーザによりマニュアルで修正が施されたものであってもよい。 Note that the rule set for each cluster used in the extraction device 20 is not as it is generated by the rule generation device 10, but is a manual correction made to the generated one by the user. Also good.

以下、本発明の補足事項を説明する。規則生成装置10側において、学習用文章EX1における名前としての固有表現「立」を扱うような場合、単語「市立」のように、形態素解析を施して得られた単語の一部分に固有表現ではない「立」が現れている場合があるが、このような場合も、単語の一部分である旨の情報をフラグとして付与した上で、前後の所定数の単語の品詞の並びは同じでも当該フラグの有無によって異なる分布であるものとして、ばらつき測定部22によって求められる分布の１つにカウントするものとする。 Hereinafter, supplementary matters of the present invention will be described. On the rule generation device 10 side, when handling the specific expression “standing” as a name in the learning sentence EX1, like the word “city”, a part of the word obtained by performing morphological analysis is not a specific expression. “Standing” may appear, but in such a case, the information indicating that the word is a part of the word is given as a flag, and the sequence of parts of speech of the predetermined number of words before and after is the same. It is assumed that the distribution varies depending on the presence / absence, and is counted as one of distributions obtained by the variation measuring unit 22.

同様に、所属を表す固有表現の「松本病院」における「松本」は名前を表す固有表現の「松本」とは異なるが、名前としての「松本」の分布を求める場合にはやはり、「松本病院」という単語の一部である情報をフラグとして付与したうえで、分布を求めるものとする。 Similarly, “Matsumoto” in the specific expression “Matsumoto Hospital” representing the affiliation is different from “Matsumoto” in the specific expression “Name”. The distribution is obtained after adding information that is a part of the word "as a flag.

ただし、上記のようにばらつき測定部12にてその他の単語又は固有表現の一部であるものとしてフラグが付与された分布に関しては、規則生成部14で初期ルールセットを生成する際には用いないものとする。従って、ある固有表現が単語又は固有表現の一部分である旨の情報は規則生成装置10側のクラスタ分類の際のみ利用され、抽出装置20側においてはそのような一部分である旨の情報は利用されない。 However, as described above, the distribution that is flagged as part of other words or specific expressions in the variation measuring unit 12 is not used when the rule generating unit 14 generates the initial rule set. Shall. Therefore, information indicating that a specific expression is a word or a part of a specific expression is used only for cluster classification on the rule generation device 10 side, and information indicating such a part is not used on the extraction device 20 side. .

規則生成装置10及び抽出装置20においては固有表現一般を対象とした処理を行うことができるが、ばらつきの差異をより顕著に生じうるものと考えられる一例として、当該固有表現を個人情報を表すもの、特に名前に限定してもよい。当該限定する際は、規則生成装置10に入力する学習用文章を準備する際に、タグを付与する固有表現に対応する限定を施すようにすればよい。 The rule generation device 10 and the extraction device 20 can perform processing for general representations in general, but as an example that can be considered to cause a more significant difference in variation, the specific representation represents personal information. In particular, it may be limited to names. In the case of the limitation, when preparing the learning text to be input to the rule generation device 10, a limitation corresponding to the specific expression to which the tag is added may be performed.

抽出装置20はさらに、当該抽出した個人情報などの固有表現を当初入力された文章から削除したものを出力するようにしてもよい。当該削除した際は、削除された箇所が明らかになるようにして、削除した文章を出力してもよい。 Further, the extraction device 20 may output a result obtained by deleting a specific expression such as the extracted personal information from the originally input sentence. When the deletion is performed, the deleted sentence may be output so that the deleted part becomes clear.

規則生成装置10及び抽出装置20においては、対象とする文章を医療関連の文章としてもよい。 In the rule generation device 10 and the extraction device 20, the target sentence may be a medical-related sentence.

本発明においては、固有表現の候補単語における前後の所定数の単語の品詞などによるパターンマッチングにて抽出する個別のルールを定義した。当該前後の所定数を拡張して、(式1)のような前方及び後方に所定数のパターンの場合に加えて、前方のみ又は後方のみに所定数のパターンが存在する場合も含めた、固有表現の候補単語における周辺の所定数の単語などによるパターンマッチングによっても、本発明は実施可能である。すなわち、前後の所定数の場合に加えて、前方のみパターンが定義され後方はゼロ、また逆に後方のみパターンが定義され前方はゼロの場合をも考慮してもよい。 In the present invention, individual rules to be extracted by pattern matching based on part-of-speech of a predetermined number of words before and after a candidate word for a specific expression are defined. In addition to the case where there is a predetermined number of patterns in the front and rear as in (Equation 1), the predetermined number before and after that is expanded, including the case where the predetermined number of patterns exist only in the front or only in the rear The present invention can also be implemented by pattern matching using a predetermined number of surrounding words in the expression candidate words. That is, in addition to the predetermined number of cases before and after, a pattern may be defined only for the front and zero for the rear, and conversely, a pattern may be defined only for the rear and the front may be zero.

10…規則生成装置、11…形態素解析部、12…ばらつき測定部、13…分類部、14…規則生成部、140…初期ルールセット生成部、141…例外定義部、142…特殊化部、145…自動生成部、146…生成過程記憶部、20…抽出装置、21…形態素解析部、22…ばらつき測定部、23…分類部、24…抽出部 DESCRIPTION OF SYMBOLS 10 ... Rule production | generation apparatus, 11 ... Morphological analysis part, 12 ... Variation measurement part, 13 ... Classification part, 14 ... Rule production part, 140 ... Initial rule set production part, 141 ... Exception definition part, 142 ... Specialization part, 145 ... automatic generation unit, 146 ... generation process storage unit, 20 ... extraction device, 21 ... morpheme analysis unit, 22 ... variation measurement unit, 23 ... classification unit, 24 ... extraction unit

Claims

A rule generation device that generates an extraction rule for a specific expression using a learning sentence provided with a tag that indicates a specific expression,
Morphological analysis of the learning sentence, decomposed into each word and determine the part of speech of each word,
For each specific expression indicated by the tag, a frequency distribution related to a pattern of part-of-speech arrangement of a predetermined number of words around the specific expression is obtained from the learning text subjected to morphological analysis, and the frequency distribution varies. A variation measuring unit for measuring
Based on the measured variation, a classification unit that classifies each specific expression indicated by the tag into a cluster,
For each cluster classified based on the variation, a candidate word for a specific expression is obtained based on a pattern obtained by the variation measuring unit by considering a deletion result of the specific expression for the learning sentence. A rule generation unit that generates the extraction rule as a set of rules for extracting a specific expression belonging to the cluster based on a pattern match of parts of speech of a predetermined number of words around Rule generator.

The rule generator is
For each of the classified clusters, a plurality of sets specified by using or not using each of the rules, an initial rule set generator for preparing as an initial individual,
Use or non-use of each rule from each of the initial individuals, taking into account the extraction results of the specific expression corresponding to the cluster from the learning text, according to each rule used and all the rules used Obtaining a perturbed individual and selecting one individual based on the extraction result of the specific expression corresponding to the cluster from the learning sentence among the perturbed individuals, The rule generation device according to claim 1, further comprising: an automatic generation unit configured as an extraction rule.

The classification unit obtains a ranking of the measured magnitude of variation, classifies each unique expression indicated by the tag into clusters by performing predetermined division into the ranking,
The initial rule set generation unit
A rule for extracting a specific expression by matching the arrangement pattern obtained by the variation measurement unit in the unique expression belonging to the cluster having the smallest variation, and at least a part of each part of speech of the arrangement pattern is defined as the arrangement pattern. Preparing the initial individual in the specific expression belonging to the cluster with the smallest variation as each of the rules for using or not using the rule for extracting the specific expression by matching the pattern replaced with the word when obtained,
An extraction rule generated by the automatic generation unit with respect to the cluster having the smallest variation and an exception to the extraction rule and / or the extraction of the initial individual in the specific expression belonging to other than the cluster having the smallest variation The rule generation device according to claim 2, wherein the rule generation device is prepared by a rule obtained by specializing a rule.

The initial rule set generation unit
The extraction unit generated by the automatic generation unit with respect to the cluster with the smallest variation, and each of the patterns arranged in each rule of the extraction rule, the initial individual in the specific expression belonging to other than the cluster with the smallest variation An exception prepared by an exception rule defined by excluding a case where at least a part of the part of speech matches a word obtained when the pattern of the arrangement is obtained by the variation measuring unit from the case of extracting a specific expression The rule generation device according to claim 3, further comprising a definition unit.

The initial rule set generation unit
From the following (1) and / or (2), the initial individual in the specific expression belonging to other than the cluster with the smallest variation, the extraction rule generated by the automatic generation unit for the cluster with the smallest variation The rule generation device according to claim 3, further comprising a specialization unit prepared by:
(1) In the arrangement pattern of each rule of the extraction rule generated by the automatic generation unit for the cluster having the smallest variation, the part of speech of the arrangement pattern is converted to the arrangement pattern by the variation measurement unit. Rules obtained by substituting with the words you ask for
(2) In each rule of the extraction rule generated by the automatic generation unit for the cluster with the smallest variation, the predetermined number of the periphery in which the rule is defined is based on the learning sentence, or Rules obtained by increasing based on predetermined part-of-speech conversion rules

An extraction device for extracting a specific expression from an input sentence using an extraction rule generated for each cluster by the rule generation device according to any one of claims 1 to 5 or a user's modification of the extraction rule,
Morphological analysis of the input sentence, decomposed into each word and determine the part of speech of each word,
For each word obtained by the morphological analysis, a frequency distribution related to a pattern of part-of-speech arrangements of a predetermined number of words around each word is obtained from the input sentence subjected to the morphological analysis, and variation of the frequency distribution is measured. A variation measuring unit;
A classifying unit that classifies each word into clusters based on the measured variation;
Applying the extraction rule generated for each cluster or a modification of the extraction rule by the user to the input sentence, and the extracted word is a cluster having the same variation as the cluster corresponding to the applied extraction rule. An extraction apparatus comprising: an extraction unit that determines that the extracted word is a unique expression when the classification unit classifies the extracted word.