JP5585961B2

JP5585961B2 - Predicate normalization apparatus, method, and program

Info

Publication number: JP5585961B2
Application number: JP2011066653A
Authority: JP
Inventors: 朋子泉; 賢治今村; 玄一郎菊井; 理史佐藤
Original assignee: Nagoya University NUC; Nippon Telegraph and Telephone Corp; Tokai National Higher Education and Research System NUC
Current assignee: Nagoya University NUC; Nippon Telegraph and Telephone Corp; Tokai National Higher Education and Research System NUC
Priority date: 2011-03-24
Filing date: 2011-03-24
Publication date: 2014-09-10
Anticipated expiration: 2031-03-24
Also published as: JP2012203584A

Description

本発明は、述部正規化装置、方法、及びプログラムに係り、特に、自然言語処理において、文情報の中心となる述部を対象に、述部の機能表現を単純に言い換え、言い換えられた述部が自然な日本語になるように文法性を判断する述部正規化装置、方法、及びプログラムに関する。 The present invention relates to a predicate normalization apparatus, method, and program. In particular, in natural language processing, the functional expression of a predicate is simply paraphrased and the paraphrased description is targeted for a predicate that is the center of sentence information. The present invention relates to a predicate normalization apparatus, method, and program for judging grammatical properties so that parts become natural Japanese.

議事録、アンケート、ｗｅｂ上のテキストやコンタクトセンタの会話ログなど大量の文書から有益な情報を得るためには、テキストから「誰が何をどうした」という重要な情報を抽出し、同じ意味の要素毎にまとめ上げ、集計することが必須である。しかし、テキスト上では同じ内容が様々な表現で書かれているため、機械が同じ意味の表現をまとめ上げるためには、前処理が必要となる。特に、「どうした」を表す述部は、動詞・名詞・形容詞・形容動詞・副詞といった「内容語」と、助詞・助動詞などの「機能語」との組み合わせから成り立っているため、表層形の異なりが激しい。例えば、下記の例文「故障しているのかも知れないわね」は、１つの述部に対して１つの内容語、及び５種類の異なる機能語が付属している。なお、１つ以上の機能語がかたまりで１つの意味単位を構成している場合を「機能表現」と呼ぶ。 In order to obtain useful information from a large amount of documents such as minutes, questionnaires, web texts, contact center conversation logs, etc., extract important information such as “who did what” from the text, and elements with the same meaning It is indispensable to collect and aggregate every time. However, since the same content is written in various expressions on the text, preprocessing is necessary for the machine to collect expressions having the same meaning. In particular, the predicate that expresses “what happened” consists of a combination of “content words” such as verbs, nouns, adjectives, adjectives, adverbs, and “function words” such as particles and auxiliary verbs. The difference is intense. For example, in the following example sentence “I may be out of order”, one content word and five different function words are attached to one predicate. A case where one or more function words constitute a single semantic unit is called “function expression”.

故障し（内容語）／ている（機能語１）／の（機能語２）／
かも知れない（機能語３）／わ（機能語４）／ね（機能語５） Failed (content word) / is (function word 1) / (function word 2) /
May be (Function word 3) / Wa (Function word 4) / Ne (Function word 5)

このように、述部は様々な要素の組み合わせからなっているため、表層形の異なりが激しく、機械による同義表現抽出が困難となっている。 Thus, since the predicate is composed of a combination of various elements, the difference in the surface layer form is significant, and it is difficult to extract synonymous expressions by a machine.

これらの同義表現抽出の困難さを解決するための一つの方法が、「言い換え」である。例えば、口語表現でよく出てくる、「故障しているのかも知れないわね」というような述部を、「故障しているかも知れない」と単純な形に言い換えることで、述部が表している意味を変えることなく、機械で「『故障しているのかも知れないわね』は『故障しているかも知れない』と同義である」と判断することができる。つまり、同じ意味を表す述部は同じ表層形に自動で言い換えることで、機械で集計ができるようになる。この処理を、述部の正規化と呼ぶ。これらの言い換えを用いた前処理は、述部のまとめ上げのみならず、要約や機械翻訳の前処理としても用いることができる。 One method for solving the difficulty of extracting these synonymous expressions is “paraphrase”. For example, a predicate that often appears in colloquial expressions, such as “I think it may be out of order,” is rephrased into a simple form as “It may be out of order,” so that the predicate becomes Without changing the meaning of the expression, it can be judged by the machine that ““ It may have failed ”is synonymous with“ May have failed ”. In other words, predicates that represent the same meaning can be aggregated by a machine by automatically rephrasing the same surface layer form. This process is called predicate normalization. The preprocessing using these paraphrases can be used not only for predicate summarization but also as preprocessing for summarization and machine translation.

従来、述部を単純に言い換える手法として、述部の機能語または機能表現を、「述部が表す出来事の意味に影響するか否か」を判断基準として、抽象的な意味ラベルを用いて分類し、「影響する」に属する機能表現を残すことにより、述部の正規化を行う手法が提案されている（例えば、非特許文献１参照）。非特許文献１の手法では、機能表現を分類するための抽象的な意味ラベルとして、「時制の差異」、「否定の差異」、及び「モダリティの差異」という３つの指標を立て、述部の正規化に際して、これらに属さない機能表現を削除している。また、助動詞の「だ」、及び助詞の「の」を「Ｇｒａｍｍａｒ」というカテゴリに分類し、モダリティに属する機能表現の前または後の「Ｇｒａｍｍａｒ」に分類された機能表現は削除することなく残している。 Conventionally, as a method of simply rephrasing predicates, function words or function expressions of predicates are classified using abstract semantic labels based on whether or not they affect the meaning of the event represented by the predicate. However, a method for normalizing predicates by leaving functional expressions belonging to “influence” has been proposed (see, for example, Non-Patent Document 1). In the method of Non-Patent Document 1, three indicators, “difference in tense”, “difference in negative”, and “difference in modality” are set as abstract semantic labels for classifying functional expressions. In normalization, functional expressions that do not belong to these are deleted. Also, the auxiliary verb “da” and the auxiliary particle “no” are classified into the category “Grammar”, and the functional expressions classified as “Grammar” before or after the functional expression belonging to the modality are left without deletion. Yes.

Izumi T., Imamura K., Kikui G.& Sato S.、「Standardizing Complex Functional Expressions in Japanese Predicates: Applying Theoretically-Based Paraphrasing Rules」、Proceedings of the Workshop on Multiword Expressions : From theory to applications (MWE 2010)、 63-71。Izumi T., Imamura K., Kikui G. & Sato S., `` Standardizing Complex Functional Expressions in Japanese Predicates: Applying Theoretically-Based Paraphrasing Rules '', Proceedings of the Workshop on Multiword Expressions: From theory to applications (MWE 2010), 63-71.

しかしながら、非特許文献１に記載された技術のように、述部が表す出来事の意味に影響する機能表現のみを残し、残りを単純に削除した場合では、文法的に正しくない言い換えを生成する場合がある、という問題がある。 However, as in the technique described in Non-Patent Document 1, when only the functional expression that affects the meaning of the event represented by the predicate is left and the rest is simply deleted, a grammatically incorrect paraphrase is generated. There is a problem that there is.

例えば、「苦手なのかも知れないね」という述部を例に説明する。 For example, a predicate “You may be weak” is explained as an example.

苦手（内容語）／な（機能表現・Ｇ）／の（機能表現・Ｇ）／
かも知れない（機能表現） Not good (content word) / NA (functional expression / G) / no (functional expression / G) /
It may be (functional expression)

上記のように、述部に３つの機能表現が含まれている。なお、「Ｇ」は、上述の機能表現が助動詞の「だ」、及び助詞の「の」に分類されることを示す「Ｇｒａｍｍａｒ」に分類されていることを表す。非特許文献１の手法では、モダリティに属する機能表現の前または後の「Ｇｒａｍｍａｒ」に分類された機能表現は残し、それ以外は削除している。よって、上記の例では、図１７に示すように、モダリティに属する機能表現である「かも知れない」の前に位置する「の」は残し、前後にモダリティに属する機能表現が存在しない「な」は削除される。そして、正規化された述部として「苦手のかも知れない」が生成されてしまう。この場合、名詞−形容動詞語幹の「苦手」と助詞の「の」をつなぐために、助動詞の「だ（表層形は「な」）」が必要であるにもかかわらず削除されてしまったため、生成された言い換えは日本語として正しいとは言えない。これは、（１）述部が表す意味を変えずに、述部を単純にする、（２）言い換えられた述部が日本語として正しい、という述部の言い換えを行う際に必要な２点のうち、上記（２）が考慮されていないためである。 As described above, the predicate includes three functional expressions. Note that “G” represents that the above-described functional expression is classified into “Grammar” indicating that it is classified into the auxiliary verb “da” and the auxiliary particle “no”. In the method of Non-Patent Document 1, the function expression classified as “Grammar” before or after the function expression belonging to the modality is left, and the others are deleted. Therefore, in the above example, as shown in FIG. 17, “no” positioned before “may be” that is a functional expression belonging to the modality is left, and there is no functional expression belonging to the modality before and after “n”. Is deleted. Then, “may be weak” is generated as a normalized predicate. In this case, in order to connect the noun-adjective verb stem “poor” to the particle “no”, the auxiliary verb “da (the surface form is“ na ”)” has been deleted. The generated paraphrase is not correct in Japanese. This is because there are two points necessary for paraphrasing the predicate that (1) the predicate is simplified without changing the meaning represented by the predicate, and (2) the paraphrased predicate is correct as Japanese. This is because the above (2) is not considered.

本発明は上記問題点に鑑みてなされたものであり、単純かつ文法的に正しい述部の言い換えを行うことができる述部正規化装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made in view of the above problems, and an object thereof is to provide a predicate normalization apparatus, method, and program capable of simply and grammatically correct predicate paraphrase.

上記目的を達成するために、本発明の述部正規化装置は、入力された文書を形態素解析する形態素解析手段と、前記形態素解析手段による形態素解析の結果に基づいて、前記文書の述部に含まれると共に、周辺に存在する単語によって文法的な必要性が異なる機能表現の各々に、前記文法的な必要性が異なることを示す判定ラベルを付与するラベル付与手段と、前記判定ラベルが付与された機能表現の少なくとも１つを前記述部に含ませた場合の前記述部を構成する形態素列、及び前記判定ラベルが付与された機能表現の各々を除いた場合の前記述部を構成する形態素列の各々について、機能語については形態素の表層形を要素とし、前記機能語以外の語については前記表層形を要素としない擬似単語により構築された形態素Ｎグラムモデルに基づいて、前記判定ラベルが付与された機能表現が文法的に必要か否かを示すスコアを計算する計算手段と、前記計算手段により計算されたスコアに基づいて選択された形態素列から、正規化された述部を生成する生成手段と、を含んで構成されている。 To achieve the above object, the predicate normalization apparatus of the present invention uses a morpheme analysis unit that performs morpheme analysis on an input document, and a predicate of the document based on a result of morpheme analysis by the morpheme analysis unit. Labeling means for assigning a judgment label indicating that the grammatical necessity is different to each of functional expressions that are included and have different grammatical needs depending on words existing in the vicinity, and the judgment label is given. The morpheme constituting the predescription part when at least one of the function expressions included is included in the predescription part, and the morpheme constituting the predescription part when each of the functional expressions to which the determination label is assigned is excluded For each of the columns, a morpheme N-gram model constructed with pseudo-words that have morpheme surface forms as elements for function words and words other than the function words do not have the surface layer form as elements. Based on the calculation means for calculating a score indicating whether or not the functional expression to which the determination label is given is grammatically required, and normalization from the morpheme sequence selected based on the score calculated by the calculation means Generating means for generating the predicate.

本発明の述部正規化装置によれば、形態素解析手段が、入力された文書を形態素解析し、ラベル付与手段が、形態素解析手段による形態素解析の結果に基づいて、文書の述部に含まれると共に、周辺に存在する単語によって文法的な必要性が異なる機能表現の各々に、文法的な必要性が異なることを示す判定ラベルを付与する。そして、計算手段が、判定ラベルが付与された機能表現の少なくとも１つを述部に含ませた場合の述部を構成する形態素列、及び判定ラベルが付与された機能表現の各々を除いた場合の述部を構成する形態素列の各々について、形態素Ｎグラムモデルに基づいて、判定ラベルが付与された機能表現が文法的に必要か否かを示すスコアを計算する。この形態素Ｎグラムモデルは、機能語については形態素の表層形を要素とし、機能語以外の語については表層形を要素としない擬似単語により構築されたＮグラムモデルである。そして、生成手段が、計算手段により計算されたスコアに基づいて選択された形態素列から、正規化された述部を生成する。 According to the predicate normalization apparatus of the present invention, the morpheme analysis unit performs morpheme analysis on the input document, and the label addition unit is included in the predicate of the document based on the result of the morpheme analysis by the morpheme analysis unit. At the same time, a judgment label indicating that the grammatical necessity is different is assigned to each functional expression having a different grammatical necessity depending on the words existing in the vicinity. When the calculation means excludes each of the morpheme sequence constituting the predicate when the predicate includes at least one of the functional expressions to which the determination label is given and the functional expression to which the determination label is given. For each of the morpheme strings constituting the predicate, a score indicating whether or not the functional expression to which the determination label is assigned is grammatically necessary is calculated based on the morpheme N-gram model. This morpheme N-gram model is an N-gram model constructed with pseudo-words having the surface form of the morpheme as an element for function words and the surface layer form for elements other than the function words. Then, the generation unit generates a normalized predicate from the morpheme sequence selected based on the score calculated by the calculation unit.

このように、周辺に存在する単語によって文法的な必要性が異なる機能表現の文法性を、形態素Ｎグラムモデルを用いて判断するため、単純かつ文法的に正しい述部の言い換えを行うことができる。また、機能語については形態素の表層形を要素とし、機能語以外の語については表層形を要素としない擬似単語により構築された形態素Ｎグラムモデルを用いることで、述部に含まれる機能語以外の語、すなわち内容語の表層形のばらつきによるスコアの揺れを抑えつつ、機能表現の表層形は文法性判断の基準として用いることができ、高精度の文法性判断を行うことができる。 In this way, since the grammatical nature of functional expressions having different grammatical needs depending on the words existing in the vicinity is judged using the morpheme N-gram model, it is possible to simply rephrase predicates that are grammatically correct. . For function words, the morpheme surface form is used as an element, and for words other than function words, a morpheme N-gram model constructed with pseudo-words that do not use the surface form as an element is used. In other words, the surface form of the functional expression can be used as a grammatical judgment criterion while suppressing the fluctuation of the score due to variations in the surface form of the content word, that is, the grammatical judgment of high accuracy.

また、前記計算手段は、前記形態素Ｎグラムモデルと、形態素の表層形以外を要素とした擬似単語により構築された品詞Ｎグラムモデルとに基づいて、前記スコアを算出することができる。このように、形態素Ｎグラムモデルとあわせて、品詞Ｎグラムモデルも用いることで、対象となる機能表現の文法性判断の精度がより向上する。 Further, the calculation means can calculate the score based on the morpheme N-gram model and a part-of-speech N-gram model constructed by pseudo words having elements other than the surface form of the morpheme as elements. Thus, by using the part-of-speech N-gram model together with the morpheme N-gram model, the accuracy of the grammatical judgment of the target functional expression is further improved.

また、前記ラベル付与手段は、前記判定ラベル、並びに前記機能表現が意味的及び文法的に不要であることを示す不要ラベルを含み、かつ前記機能表現が前記述部の意味に影響を与えるか否かを示す意味ラベルを前記述部に含まれる機能表現の各々に付与し、前記計算手段は、前記ラベル付与手段により不要ラベルが付与された機能表現、及び同一の意味ラベルが付与された機能表現の各々の少なくとも１つ以外の機能表現を削除した述部を用いて、前記スコアを算出することができる。これにより、より単純化された述部を生成することができる。 In addition, the label assigning means includes the determination label and an unnecessary label indicating that the functional expression is semantically and grammatically unnecessary, and whether the functional expression affects the meaning of the previous description part. Is added to each of the functional expressions included in the previous description part, and the calculation means includes a functional expression to which an unnecessary label is assigned by the label attaching means, and a functional expression to which the same semantic label is assigned. The score can be calculated using a predicate in which a functional expression other than at least one of each of the above is deleted. As a result, a more simplified predicate can be generated.

また、本発明の述部正規化方法は、コンピュータに、入力された文書を形態素解析し、形態素解析の結果に基づいて、前記文書の述部に含まれると共に、周辺に存在する単語によって文法的な必要性が異なる機能表現の各々に、前記文法的な必要性が異なることを示す判定ラベルを付与し、前記判定ラベルが付与された機能表現の少なくとも１つを前記述部に含ませた場合の前記述部を構成する形態素列、及び前記判定ラベルが付与された機能表現の各々を除いた場合の前記述部を構成する形態素列の各々について、機能語については形態素の表層形を要素とし、前記機能語以外の語については前記表層形を要素としない擬似単語により構築された形態素Ｎグラムモデルに基づいて、前記判定ラベルが付与された機能表現が文法的に必要か否かを示すスコアを計算し、計算されたスコアに基づいて選択された形態素列から、正規化された述部を生成することを含む処理を実行させる方法である。 Further, the predicate normalization method of the present invention performs a morphological analysis on an input document to a computer, and includes a grammatical expression based on a word existing in the predicate of the document based on the result of the morphological analysis. When a determination label indicating that the grammatical necessity is different is assigned to each functional expression having a different necessity, and at least one of the functional expressions having the determination label is included in the previous description section For each of the morpheme strings constituting the preceding description part when excluding each of the morpheme strings constituting the preceding description part and the function expression to which the determination label is assigned, the function word is composed of the surface form of the morpheme. Whether the functional expression to which the determination label is attached is grammatically required based on a morpheme N-gram model constructed by a pseudo word that does not have the surface layer as an element for words other than the functional word A score indicating the calculated, from the morpheme string selected based on the calculated score is a method for executing the processing comprising generating a predicate normalized.

また、本発明の述部正規化プログラムは、コンピュータを、上記の述部正規化装置を構成する各手段として機能させるためのプログラムである。 The predicate normalization program of the present invention is a program for causing a computer to function as each means constituting the predicate normalization apparatus.

以上説明したように、本発明の述部正規化装置、方法、及びプログラムによれば、機能語については形態素の表層形を要素とし、機能語以外の語については表層形を要素としない擬似単語により構築された形態素Ｎグラムモデルを用いて、周辺に存在する単語によって文法的な必要性が異なる機能表現の文法性を判断するため、単純かつ文法的に正しい述部の言い換えを行うことができる、という効果が得られる。 As described above, according to the predicate normalization apparatus, method, and program of the present invention, a pseudo word that has a morpheme surface form as an element for a function word and a surface form as an element for a word other than a function word. Using the morpheme N-gram model constructed by, we can determine the grammatical nature of functional expressions that have different grammatical needs depending on the words existing in the vicinity. The effect of is obtained.

本実施の形態の述部正規化装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the predicate normalization apparatus of this Embodiment. 形態素解析の結果の一例を示す図である。It is a figure which shows an example of the result of a morphological analysis. 機能語意味ラベル辞書の一例を示す図である。It is a figure which shows an example of a function word meaning label dictionary. 「Ｇｒａｍｍａｒ」に属する意味ラベルの一例を示す図である。It is a figure which shows an example of the meaning label which belongs to "Grammar". 意味ラベルの付与及び述部抽出結果の一例を示す図である。It is a figure which shows an example of provision of a semantic label, and a predicate extraction result. ＮＵＬＬ削除部での処理結果の一例を示す図である。It is a figure which shows an example of the processing result in a NULL deletion part. Ｎｇｒａｍ文法性判断部の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of a Ngram grammar property judgment part. 従来の形態素Ｎｇｒａｍモデルの一例を示す図である。It is a figure which shows an example of the conventional morpheme Ngram model. 品詞Ｎｇｒａｍモデルの一例を示す図である。It is a figure which shows an example of a part of speech Ngram model. 従来の形態素Ｎｇｒａｍと品詞Ｎｇｒａｍとの混合率を変化させた場合の言い換え精度比較を示す図である。It is a figure which shows the paraphrase precision comparison at the time of changing the mixing rate of the conventional morpheme Ngram and the part of speech Ngram. 本実施の形態の形態素Ｎｇｒａｍモデルの一例を示す図である。It is a figure which shows an example of the morpheme Ngram model of this Embodiment. 本実施の形態の形態素Ｎｇｒａｍと品詞Ｎｇｒａｍとの混合率を変化させた場合の言い換え精度比較を示す図である。It is a figure which shows the paraphrase precision comparison at the time of changing the mixing rate of the morpheme Ngram of this Embodiment, and the part of speech Ngram. 候補毎のＮｇｒａｍスコア算出結果の一例を示す図である。It is a figure which shows an example of the Ngram score calculation result for every candidate. 本実施の形態の述部正規化装置における述部正規化処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the predicate normalization processing routine in the predicate normalization apparatus of this Embodiment. ＮＵＬＬ削除部での処理結果の他の例を示す図である。It is a figure which shows the other example of the processing result in a NULL deletion part. 候補毎のＮｇｒａｍスコア算出結果の他の例を示す図である。It is a figure which shows the other example of the Ngram score calculation result for every candidate. 従来技術における機能表現の削除を説明するための図である。It is a figure for demonstrating deletion of the function expression in a prior art.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

本実施の形態に係る述部正規化装置１０は、ＣＰＵと、ＲＡＭと、後述する述部正規化処理ルーチンを実行するためのプログラム及び各種データを記憶したＲＯＭとを備えたコンピュータで構成されている。このコンピュータは、機能的には、図１に示すように、入力された文書（テキストデータ）を形態素解析する形態素解析部１２と、意味ラベル・述部モデル１４を用いて、形態素解析結果に対して機能表現の意味ラベルを付与し、文書の述部を抽出する意味ラベル付与・述部抽出部１６と、意味ラベル「ＮＵＬＬ」が付与された機能表現を削除するＮＵＬＬ削除部１８と、同一の意味ラベルが付与された機能表現を１つだけ残して削除する冗長ルール適用部２０と、Ｎｇｒａｍモデルを用いて機能表現の文法上の必要性を判断するＮｇｒａｍ文法性判断部２２と、不要な機能表現が削除された残りの形態素を用いて述部を生成する活用生成部２４と、を含んだ構成で表すことができる。 A predicate normalization apparatus 10 according to the present embodiment is configured by a computer including a CPU, a RAM, a program for executing a predicate normalization processing routine described later, and a ROM storing various data. Yes. As shown in FIG. 1, this computer functionally uses a morpheme analysis unit 12 that performs morphological analysis on an input document (text data) and a semantic label / predicate model 14 to perform morpheme analysis results. The semantic label assignment / predicate extraction unit 16 that assigns the semantic label of the functional expression and extracts the predicate of the document, and the NULL deletion unit 18 that deletes the functional expression to which the semantic label “NULL” is assigned are the same. Redundant rule application unit 20 that leaves and deletes only one functional expression with a semantic label, Ngram grammar determination unit 22 that determines the grammatical necessity of the functional expression using the Ngram model, and unnecessary functions It can be expressed by a configuration including a utilization generation unit 24 that generates a predicate using the remaining morpheme from which the expression is deleted.

形態素解析部１２は、入力された文書に対して、公知の形態素解析器を用いて一文毎に形態素解析を行う。形態素解析では、文が単語単位に分割され、各単語に品詞や活用型、活用形などの情報が付与される。図２に、入力された文書の一文が「彼は歌が苦手なのかも知れないね」に対する形態素解析の結果の一例を示す。 The morpheme analysis unit 12 performs morpheme analysis for each sentence using a known morpheme analyzer on the input document. In morphological analysis, a sentence is divided into words, and information such as part of speech, utilization type, and utilization form is given to each word. FIG. 2 shows an example of the result of the morphological analysis for one sentence of the input document “He may not be good at singing”.

意味ラベル・述部モデル１４は、述部が表す出来事の意味に影響を与えるか否かを示す意味ラベルを、機能表現の各々に付与すると共に、入力された文書から述部を抽出するためのモデルである。意味ラベル・述部モデル１４は、機能語意味ラベル辞書に基づいて、人手で機能語に対して正解の意味ラベルを付与したコーパスを学習データとして、述部の範囲及び意味ラベルの並びの尤もらしさを学習して生成されている。機能語意味ラベル辞書には、述部の意味に影響を与えるとされる機能表現が定められており、例えば、図３に示すように、「時制の差異」、「否定の差異」、及び「モダリティの差異」という３つの指標をもとに分類され、予め記憶されている。さらに、「Ｇｒａｍｍａｒ」というカテゴリ分類される機能表現も存在する。「Ｇｒａｍｍａｒ」は、「意味的には重要ではないが、周辺の単語によっては文法的に必要な要素」を意味する。図４に、「Ｇｒａｍｍａｒ」に属する意味ラベルを示す。「Ｇｒａｍｍａｒ」に属する代表的な機能語は、助動詞の「だ」、及び助詞の「の」であり、ここでは、用途に合わせて、「だ」の意味ラベルを「判断」、「の」の意味ラベルを「名詞化」とする。 The semantic label / predicate model 14 adds a semantic label indicating whether or not the meaning of the event represented by the predicate affects each functional expression and extracts a predicate from the input document. It is a model. The semantic label / predicate model 14 is based on the functional word semantic label dictionary, and the corpus in which the correct semantic label is manually assigned to the functional word is used as learning data, and the likelihood of the predicate range and the semantic label arrangement is estimated. Has been generated by learning. In the function word meaning label dictionary, function expressions that are supposed to affect the meaning of the predicate are defined. For example, as shown in FIG. They are classified and stored in advance based on three indicators of “modality differences”. Furthermore, there is a functional expression classified as “Grammar”. “Grammar” means “elements that are not semantically important but are grammatically necessary depending on the surrounding words”. FIG. 4 shows semantic labels belonging to “Grammar”. Typical function words belonging to “Grammar” are the auxiliary verb “da” and the auxiliary particle “no”. Here, the meaning label of “da” is “judgment”, “no” according to the usage. The semantic label is “nounization”.

意味ラベル付与・述部抽出部１６は、意味ラベル・述部モデル１４を用いた統計的手法により、形態素解析結果に対して、機能表現の意味ラベルを自動で付与し、さらに述部の範囲を抽出する。図５に、意味ラベルの付与及び述部の抽出結果の一例を示す。述部は、１個以上の内容語（Ｃ）及び０個以上の機能語（Ｆ）の連続から成り立っているため、「苦手なのかも知れないね」が述部として抽出される。この述部は、「苦手」を内容語とし、「な／の／かも知れない／ね」という４つの機能表現をもつ。 The semantic label assignment / predicate extraction unit 16 automatically assigns a functional expression semantic label to the morphological analysis result by a statistical method using the semantic label / predicate model 14, and further determines the range of the predicate. Extract. FIG. 5 shows an example of the result of assigning the semantic label and extracting the predicate. Since the predicate is composed of a series of one or more content words (C) and zero or more function words (F), “maybe I am not good” is extracted as a predicate. This predicate has four functional expressions of “I am not good” and “Nano / Maybe / Ne”.

また、この例では、ＢＩタグというものを使用し、述部の内容語であって、内容語の先頭の単語（「苦手」）には「Ｃ，Ｂ−ＰＲＥＤ」のラベルを付与し、また、述部の内容語であって、内容語の先頭以外の単語（ここでは該当単語なし）には、「Ｃ，Ｉ−ＰＲＥＤ」のラベルを付与する。Ｃは内容語（Ｃｏｎｔｅｎｔｗｏｒｄｓ）の頭文字である。また、機能語（Ｆｕｎｃｔｉｏｎｗｏｒｄｓ）には「Ｆ」のラベルを付与し、それに加えて、機能表現単位で該当する意味ラベルを付与する。機能表現「かも知れない」は、機能語「かも」、「知れ」及び「ない」で構成されているので、「かも」、「知れ」及び「ない」の各々に、機能語を示す「Ｆ」のラベルが付与されると共に、機能表現「かも知れない」に対応する意味ラベル「推量」が付与される。また、内容語と同様に、該当する機能語が、ある機能表現（「かも知れない」）の先頭の単語（「かも」）だった場合は、「Ｂ−推量」のように、その機能語が含まれる機能表現に付与された意味ラベルの前に「Ｂ」を付け、先頭以外の単語（「知れ」、「ない」）だった場合は、「Ｉ−推量」のように意味ラベルの前に「Ｉ」を付ける。また、「Ｇｒａｍｍａｒ」に属する機能語（ここでは、助動詞の「だ」及び助詞の「の」）には、上記と同様に、機能語を示すラベル「Ｆ」、機能表現の先頭または先頭以外の単語を示す「Ｂ」または「Ｉ」のラベルと共に、その単語の用途に応じて「判断」または「名詞化」の意味ラベルを付与する。機能語意味ラベル辞書に登録されていない単語、及び「Ｇｒａｍｍａｒ」に属する単語以外の単語には、上記と同様に、機能語を示すラベル「Ｆ」、機能表現の先頭または先頭以外の単語を示す「Ｂ」または「Ｉ」のラベルと共に、「ＮＵＬＬ」のラベルを付与する。 Also, in this example, a BI tag is used, and the content word of the predicate, the first word of the content word (“poor”) is given the label “C, B-PRED”, and A word “C, I-PRED” is assigned to the word of the predicate other than the head of the content word (here, no corresponding word). C is an acronym for content words. Further, a function word (Function words) is given a label of “F”, and in addition, a corresponding semantic label is given in units of function expression. Since the function expression “may be” is composed of the function words “kamo”, “know” and “no”, “F” indicating the function word in each of “kamo”, “know” and “no”. ”And a semantic label“ inference ”corresponding to the function expression“ may be ”. Similarly to the content word, when the corresponding function word is the first word (“Moka”) of a certain function expression (“Maybe”), the function word is “B-estimation”. If the word “B” is added to the front of the semantic label given to the functional expression including “”, and it is a word other than the beginning (“Know”, “None”), the front of the semantic label such as “I-inference” Add "I" to In addition, in the function words belonging to “Grammar” (here, the auxiliary verb “da” and the particle “no”), similarly to the above, the label “F” indicating the function word, the head of the function expression other than the head or the head Along with a label “B” or “I” indicating a word, a meaning label of “judgment” or “nounization” is given according to the use of the word. For words other than the words that are not registered in the function word meaning label dictionary and the words that belong to “Grammar”, similarly to the above, the label “F” indicating the function word, and the word other than the head or the head of the function expression are indicated. The label “NULL” is given together with the label “B” or “I”.

ＮＵＬＬ削除部１８は、意味ラベルとして「ＮＵＬＬ」が付与された機能表現を削除する。「ＮＵＬＬ」が付与されているということは、述部の意味に影響を与える機能表現として機能語意味ラベル辞書に定められていない語であり、かつ意味的には重要ではないが、周辺の単語によっては文法的に必要な要素である「Ｇｒａｍｍａｒ」というカテゴリにも分類されないことを示している。このような機能表現は、削除しても問題ないため、削除する。 The NULL deletion unit 18 deletes the functional expression assigned “NULL” as the semantic label. The fact that “NULL” is given is a word that is not defined in the function word meaning label dictionary as a function expression that affects the meaning of the predicate, and is not semantically important. Indicates that it is not classified into the category “Grammar” which is a grammatically necessary element. Such a functional expression is deleted because there is no problem even if it is deleted.

冗長ルール適用部２０は、同一の意味ラベルが付与された機能表現については、１つを残して削除する。これにより、不要な機能表現を削除して、述部の冗長性を解消する。図６に、ＮＵＬＬ削除及び冗長ルール適用後の一例を示す。なお、冗長ルールの適用は、機能表現単位で行う。このため、図６の例で、「知れ」及び「ない」は、同一の意味ラベルが付与されているが、各々「かも知れない」という機能表現の一部であるため、削除されない。 The redundancy rule application unit 20 deletes the functional expressions to which the same semantic label is assigned, leaving only one. As a result, unnecessary function expressions are deleted, and the redundancy of predicates is eliminated. FIG. 6 shows an example after NULL deletion and redundancy rule application. Redundancy rules are applied in units of function expressions. Therefore, in the example of FIG. 6, “Know” and “None” are assigned the same semantic label, but are not deleted because they are part of the functional expression “may be”.

Ｎｇｒａｍ文法性判断部２２は、「Ｇｒａｍｍａｒ」に属する機能表現のうち、どの機能表現が必要で、どの機能表現が不要かを判断する。Ｎｇｒａｍ文法性判断部２２は、図７に示すように、文法性判断の対象となる機能表現を残す場合及び除く場合の全ての組み合わせ候補を作成する候補生成部２２ａと、形態素Ｎｇｒａｍモデル２２ｂ及び品詞Ｎｇｒａｍモデル２２ｃに基づいて、生成された候補毎にＮｇｒａｍスコアを計算するＮｇｒａｍ計算部２２ｄと、Ｎｇｒａｍスコアに基づいて、候補の中から１つを選択する選択部２２ｅと、を含んだ構成で表すことができる。 The Ngram grammaticality determination unit 22 determines which function expression is necessary and which function expression is unnecessary among the function expressions belonging to “Grammar”. As shown in FIG. 7, the Ngram grammaticality determination unit 22 includes a candidate generation unit 22a that creates all combination candidates when a functional expression to be subjected to grammatical determination is left and when it is excluded, a morpheme Ngram model 22b, and a part of speech. An Ngram calculation unit 22d that calculates an Ngram score for each generated candidate based on the Ngram model 22c and a selection unit 22e that selects one of the candidates based on the Ngram score are used. be able to.

候補生成部２２ａは、ＮＵＬＬ削除部１８及び冗長ルール適用部２０により不要な機能表現が削除された述部について、意味ラベル付与・述部抽出部１６で「Ｇｒａｍｍａｒ」に属する機能表現として、「判断」及び「名詞化」の意味ラベルが付与された機能語を残した述部の形態素列の候補、及び対象の機能語を除いた述部の形態素列の候補について、全ての組み合わせ候補を作成する。ここでは、「な（判断）」及び「の（名詞化）」が、対象の機能語となっているため、いずれも削除した「苦手／かも／知れ／ない」、「な」を削除し「の」を残した「苦手／の／かも／知れ／ない」、「の」を削除し「な」を残した「苦手／な／かも／知れ／ない」、いずれも残した「苦手／な／の／かも／知れ／ない」が候補として生成される。 The candidate generation unit 22a uses the semantic label assignment / predicate extraction unit 16 as a function expression belonging to “Grammar” for the predicate from which the unnecessary function expression is deleted by the NULL deletion unit 18 and the redundancy rule application unit 20. ”And“ Nounization ”create all combination candidates for predicate morpheme sequence candidates that have left function words and predicate morpheme sequence candidates that exclude target function words. . Here, since “na (judgment)” and “no (nounization)” are the target function words, both “weak” / “ka” / “know” / “na” are deleted and “na” is deleted. "I'm not good" / "" / "" / "" / "" / "" "" No / May / Know / No ”is generated as a candidate.

形態素Ｎｇｒａｍモデル２２ｂ及び品詞Ｎｇｒａｍモデル２２ｃ（以下、この２つのモデルをまとめて、または区別することなく説明する場合には、単に「Ｎｇｒａｍモデル」ともいう。）は、入力された単語の並びに、確率を基にした「単語列としての尤もらしさ」のスコアを与えるためのモデルである（例えば、「北研二、中村哲、永田昌明「音声言語処理−コーパスに基づくアプローチ」、森北出版、２．４章」参照）。任意の単語数を持つ並びのスコアを計算するため、Ｎｇｒａｍモデルでは、Ｎ個の単語の並び（これをＮｇｒａｍと呼び、Ｎ＝３のときをＴｒｉｇｒａｍと呼ぶ）から単語の生成確率を取得し、単語列全体について、この総積を計算することで算出される。本実施の形態では、確率の対数値をスコアと呼び、入力の単語列全体のスコアをＮｇｒａｍスコアと呼ぶ。また、個々のＮｇｒａｍの生成確率は、予めコーパスから学習しておく。 The morpheme Ngram model 22b and the part-of-speech Ngram model 22c (hereinafter, when these two models are described collectively or without being distinguished from each other, are simply referred to as “Ngram models”) This is a model for giving a score of “likelihood as a word string” based on “For example,“ Ken Kenji, Satoshi Nakamura, Masaaki Nagata “Spoken Language Processing-Corpus-based Approach”, Morikita Publishing, 2.4 See chapter). In order to calculate the score of a sequence having an arbitrary number of words, the Ngram model obtains a word generation probability from a sequence of N words (this is called Ngram, and when N = 3 is called Trigram), It is calculated by calculating this total product for the entire word string. In the present embodiment, the logarithmic value of the probability is called a score, and the score of the entire input word string is called an Ngram score. The generation probability of each Ngram is learned in advance from the corpus.

Ｎｇｒａｍスコア計算部２２ｄは、候補生成部２２ａで生成された各候補について、Ｎｇｒａｍモデルを用いてＮｇｒａｍスコアを計算する。本実施の形態のように、２つ以上のＮｇｒａｍモデルを用いて１つのＮｇｒａｍスコアを計算する場合には、例えば、下記（１）式によりＮｇｒａｍスコアを計算することができる。 The Ngram score calculation unit 22d calculates an Ngram score for each candidate generated by the candidate generation unit 22a using an Ngram model. In the case of calculating one Ngram score using two or more Ngram models as in the present embodiment, for example, the Ngram score can be calculated by the following equation (1).

ｌｏｇＰ＝αｌｏｇＰ_ａ＋（１−α）ｌｏｇＰ_ｂ・・・（１） logP = αlogP _a + (1−α) logP _b (1)

ただし、ｌｏｇＰは、最終的な「単語列としての尤もらしさ」を表すＮｇｒａｍスコア、ｌｏｇＰ_ａは形態素Ｎｇｒａｍモデル２２ｂから算出したＮｇｒａｍスコア、ｌｏｇＰ_ｂは品詞Ｎｇｒａｍモデル２２ｃから算出したＮｇｒａｍスコアであり、αは両者の混合率である。このＮｇｒａｍスコアを用いて、「Ｇｒａｍｍａｒ」に属する機能表現の文法性判断を行う。 Where logP is the Ngram score representing the final “likelihood as a word string”, logP _a is the Ngram score calculated from the morpheme Ngram model 22b, and logP _b is the Ngram score calculated from the part-of-speech Ngram model 22c. Is the mixing ratio of both. Using this Ngram score, the grammatical judgment of the functional expression belonging to “Grammar” is performed.

ここで、本実施の形態における「Ｇｒａｍｍａｒ」に属する機能表現の文法性判断の原理について説明する。 Here, the principle of the grammatical judgment of the function expression belonging to “Grammar” in the present embodiment will be described.

入力単語列が形態素解析済みである場合、単に形態素の表層形だけを用いてＮｇｒａｍスコアを計算するより、形態素の品詞、活用型などの情報も使用してＮｇｒａｍスコアを計算する方が、より正確に「単語列としての尤もらしさ」を計算することができる。例えば、従来の形態素Ｎｇｒａｍモデルでは、形態素の表層形、品詞、活用型をまとめたものを擬似単語としてＮｇｒａｍモデルを構築し、品詞Ｎｇｒａｍモデル２２ｃでは、形態素の品詞、活用型をまとめたものを擬似単語としてＮｇｒａｍモデルを構築し、両者のＮｇｒａｍスコアを適当な混合率αで補完し、（１）式を用いて最終的なＮｇｒａｍスコアを計算することができる。図８に、Ｎ＝３の場合の従来の形態素Ｎｇｒａｍモデルの一例、図９にＮ＝３の場合の品詞Ｎｇｒａｍモデル２２ｃの一例を示す。 If the input word string has already been morphologically analyzed, it is more accurate to calculate the Ngram score using information such as the morpheme part of speech and the utilization type rather than simply using the surface form of the morpheme. The “probability as a word string” can be calculated. For example, in the conventional morpheme Ngram model, an Ngram model is constructed using a combination of the morpheme surface form, part of speech, and utilization type as a pseudo word, and in the part of speech Ngram model 22c, a combination of morpheme part of speech and utilization type is simulated. An Ngram model is constructed as a word, both Ngram scores are complemented with an appropriate mixing ratio α, and a final Ngram score can be calculated using the equation (1). FIG. 8 shows an example of a conventional morpheme Ngram model when N = 3, and FIG. 9 shows an example of a part-of-speech Ngram model 22c when N = 3.

機械翻訳で使用される言語モデルなどでは、この従来の形態素Ｎｇｒａｍモデルと品詞Ｎｇｒａｍモデル２２ｃとを適当な混合率（例えば、α＝０．８、すなわちモデルａ：モデルｂ＝０．８：０．２）を用いて、Ｎｇｒａｍスコアを計算する。しかし、本実施の形態が目的とする機能表現の文法性判断に、この手法を適用すると、図１０に示すように、混合率によって、述部の言い換え精度のばらつきが大きくなる。 In a language model used in machine translation, the conventional morpheme Ngram model and part-of-speech Ngram model 22c are combined with an appropriate mixing ratio (for example, α = 0.8, that is, model a: model b = 0.8: 0. 2) is used to calculate the Ngram score. However, when this method is applied to the grammatical judgment of the functional expression targeted by the present embodiment, the predicate paraphrase accuracy varies greatly depending on the mixing ratio, as shown in FIG.

より正しく文法性を判断するには、表層形を要素に含んだ従来の形態素Ｎｇｒａｍモデルの割合を上げることが的確であるが、機能表現の文法性判断の場合、従来の形態素Ｎｇｒａｍモデルの混合率を高くすると（すなわち、αを１に近付けると）、精度が低下する。これは、述部の内容語が悪影響を及ぼしているものと考えられる。本発明で対象とする述部には内容語も含まれているが、内容語は同一の意味を示す異なった表層形の数が機能語よりも多いため、従来の形態素Ｎｇｒａｍモデルのスコアにばらつきが生じる。これはＮｇｒａｍ確率を学習する際のデータが不十分なため、正しくスコア計算ができないことが原因である。しかし、これらの問題に直面せずに正しく学習を行うためには、大量の学習データが必要となる。 In order to judge grammaticality more correctly, it is appropriate to increase the ratio of the conventional morpheme Ngram model that includes the surface form element, but in the case of grammatical judgment of functional expression, the mixing ratio of the conventional morpheme Ngram model When the value is increased (that is, when α is brought close to 1), the accuracy decreases. This is considered to be because the content word of the predicate has an adverse effect. The predicates targeted by the present invention also include content words, but the content words have more surface layer forms that have the same meaning than the function words, so the scores of the conventional morpheme Ngram model vary. Occurs. This is because the score cannot be calculated correctly because of insufficient data when learning the Ngram probability. However, in order to perform learning correctly without facing these problems, a large amount of learning data is required.

そこで、本実施の形態では、機能表現の文法性判断を行う際のＮｇｒａｍスコア計算において、機能表現を構成する機能語については表層形を要素に含めた擬似単語、機能語以外の単語については表層形を要素に含めない擬似単語により構築されたＮｇｒａｍモデル（形態素Ｎｇｒａｍモデル２２ｂ）を使用する。これにより、文法性判断を行いたい機能表現については、表層形を要素に含めることで重要な情報をＮｇｒａｍモデルに残して、Ｎｇｒａｍモデルの精度を向上させることができ、一方、文法性判断にあまり影響を与えず、表層形のばらつきが大きい内容語については、表層形を要素から除くことで、Ｎｇｒａｍモデルの精度低下を回避することができる。 Therefore, in the present embodiment, in the Ngram score calculation when the grammatical judgment of the functional expression is performed, a pseudo word including a surface form as an element for a functional word constituting the functional expression, and a surface layer for a word other than the functional word An Ngram model (morpheme Ngram model 22b) constructed by pseudo words that do not include a shape as an element is used. As a result, for the functional expression for which grammatical judgment is to be made, the accuracy of the Ngram model can be improved by including the surface layer form in the element, leaving important information in the Ngram model. For content words that do not affect and have a large variation in surface layer shape, the accuracy of the Ngram model can be avoided by removing the surface layer shape from the elements.

図１１に、本実施の形態における形態素Ｎｇｒａｍモデル２２ｂの一例を示す。どの種類の形態素の表層形を用いるかは、文法性判断を行いたい箇所により決める。ここでは、品詞の中でも、「助詞、助動詞、フィラー、その他、記号、非自立、特殊、接尾、接続詞的、動詞非自立的」という種類が品詞の中に入っていた場合を機能語とし、これらの表層形は形態素Ｎｇｒａｍモデル２２ｂの要素として用いる。図１１に示すように、「動詞−自立」や「形容詞−自立」のような内容語の表層形は「＊」に統一し、「助動詞」や「助詞」に属する形態素に関しては表層形も要素として用いる。機能語は内容語に比べ種類が少ないため、限られたデータ数でも学習ができ、かつ機能語の表層形の並びも形態素Ｎｇｒａｍモデル２２ｂに加えることができる。これらの擬似単語を用いて、形態素Ｎｇｒａｍモデル２２ｂを、既存の学習ツールを使用して学習する。また、品詞Ｎｇｒａｍモデル２２ｃについても、品詞及び活用型を要素とする擬似単語を用いて、既存の学習ツールを使用して学習する。 FIG. 11 shows an example of a morpheme Ngram model 22b in the present embodiment. Which type of morpheme surface form to use is determined by the location where grammatical judgment is desired. Here, in the part of speech, if the type of part of speech is included in the part of speech, “participant, auxiliary verb, filler, other, symbol, non-independent, special, suffix, conjunction-like, verb non-independent” Are used as elements of the morpheme Ngram model 22b. As shown in FIG. 11, the surface layer forms of content words such as “verb-independence” and “adjective-independence” are unified to “*”, and the surface form is also an element for morphemes belonging to “auxiliary verbs” and “particles”. Used as Since there are fewer types of function words than content words, learning can be performed with a limited number of data, and the arrangement of the surface layer form of the function words can be added to the morpheme Ngram model 22b. Using these pseudo words, the morpheme Ngram model 22b is learned using an existing learning tool. The part-of-speech Ngram model 22c is also learned using an existing learning tool by using pseudo-words having the part-of-speech and the utilization type as elements.

図１２に示すように、本実施の形態のＮｇｒａｍモデルを用いることで、混合率による精度のばらつきを抑え、言い換え精度を高く保持できることが分かる。 As shown in FIG. 12, it can be seen that by using the Ngram model of the present embodiment, variation in accuracy due to the mixing ratio can be suppressed, and paraphrase accuracy can be maintained high.

そして、このＮｇｒａｍモデルを用いてＮｇｒａｍスコアを計算する際には、形態素Ｎｇｒａｍモデル２２ｂに基づくスコアの算出では、学習時と同様に、機能語については表層形を要素に含め、機能語以外の単語については表層形を要素に含めずにスコアを算出する。 Then, when calculating the Ngram score using this Ngram model, in the calculation of the score based on the morpheme Ngram model 22b, the word other than the function word includes the surface layer for the function word as in the learning. For, the score is calculated without including the surface form as an element.

また、Ｎｇｒａｍスコア計算部２２ｄは、候補生成部２２ａで生成された各候補の末尾に句点「。」を付け加えた上で、Ｎｇｒａｍモデルを用いて各候補のＮｇｒａｍスコアを計算する。各候補の末尾に句点「。」を付けるのは、「文末（すなわち、終止形）で現れる述部の中で一番尤もらしい単語の並び」を見るためである。図１３に、候補毎に計算されたＮｇｒａｍスコアの一例を示す。 In addition, the Ngram score calculation unit 22d adds a punctuation mark “.” To the end of each candidate generated by the candidate generation unit 22a, and calculates the Ngram score of each candidate using the Ngram model. The reason for adding a punctuation mark “.” At the end of each candidate is to see “the most likely word sequence among predicates appearing at the end of the sentence (ie, the final form)”. FIG. 13 shows an example of the Ngram score calculated for each candidate.

選択部２２ｅは、Ｎｇｒａｍスコア計算部２２ｄにより計算された各候補のＮｇｒａｍスコアに基づいて、Ｎｇｒａｍスコアが最も高い候補を選択し、選択した候補を構成する形態素列を出力する。図１３の例では、「Ｇｒａｍｍａｒ」に属する全ての機能語が削除された「苦手／かも／知れ／ない」が、Ｎｇｒａｍスコアが最も高い候補であるため、この候補を選択する。これにより、「彼は歌が苦手なのかも知れないね。」という入力文書の述部に対して、最も単純かつ文法的に正しい言い換え表現である「苦手かも知れない」の組み合わせが出力される。 The selection unit 22e selects a candidate having the highest Ngram score based on the Ngram score of each candidate calculated by the Ngram score calculation unit 22d, and outputs a morpheme sequence that constitutes the selected candidate. In the example of FIG. 13, “I'm not good / may / know / not” from which all function words belonging to “Grammar” are deleted is the candidate with the highest Ngram score, so this candidate is selected. As a result, for the predicate of the input document, “He may not be good at singing”, a combination of “Maybe weak” which is the most simple and grammatically correct paraphrase expression is output.

活用生成部２４は、選択部２２ｅから出力された形態素列の全ての要素を正しく活用させて、最終的な述部を生成する。本実施の形態では、言語モデルによる活用生成器を使用する。これは、予め正解データより、前方の単語の表層形・品詞・活用型と後方の単語の表層形・品詞を素性として「どの接続が尤もらしいか」を学習したモデルによる生成器である。この言語モデルに基づいて、新しく前後の単語の表層形・品詞・活用型が入力された際に、最適な表記を生成する。ここでは、この言語モデルによる活用生成器に形態素列「苦手／名詞−形容動詞語幹」、「かも／助詞−副助詞」、「知れ／動詞−自立／一段」、「ない／助動詞／特殊・ナイ」を入力し、正しく接続された述部である「苦手かも知れない」を生成する。 The utilization generation unit 24 correctly utilizes all elements of the morpheme string output from the selection unit 22e to generate a final predicate. In the present embodiment, a utilization generator based on a language model is used. This is a generator based on a model in which “what connection is likely” is learned from the correct data in advance by using the surface type / part of speech / utilization type of the front word and the surface type / part of speech of the rear word as features. Based on this language model, an optimal notation is generated when the surface form / part of speech / utilization type of the previous and next words is input. Here, the morpheme strings “unsatisfactory / noun-adjective verb stem”, “kam / particle-adverb”, “know / verb-independent / one-step”, “no / auxiliary verb / special / nay” ”Is input, and the predicate correctly connected,“ may be weak ”is generated.

次に、図１４を参照して、本実施の形態の述部正規化装置１０において実行される述部正規化処理ルーチンについて説明する。 Next, a predicate normalization processing routine executed in the predicate normalization apparatus 10 of the present embodiment will be described with reference to FIG.

ステップ１００で、入力された文書に対して、公知の形態素解析器を用いて一文毎に形態素解析を行い、文を単語単位に分割し、各単語に品詞や活用型、活用形などの情報を付与する。ここでは、入力文書に含まれた一文の例として、「主役は彼のようだね」について説明する。 In step 100, the input document is subjected to morphological analysis for each sentence using a known morphological analyzer, the sentence is divided into words, and information such as part of speech, utilization type, and utilization form is provided for each word. Give. Here, as an example of one sentence included in the input document, “the main character is like him” will be described.

次に、ステップ１０２で、意味ラベル・述部モデル１４を用いた統計的手法により、上記ステップ１００での形態素解析結果に対して、機能表現の意味ラベルを自動で付与し、さらに述部の範囲を抽出する。ここでは、「彼」を内容語とし、「の／よう／だ／ね」の４つの機能表現を含む「彼のようだね」が述部として抽出される。機能表現のうち、「の」に「Ｇｒａｍｍａｒ」に属する意味ラベル「名詞化」が付与され、「だ」に「Ｇｒａｍｍａｒ」に属する意味ラベル「判断」が付与される。また、「ね」には意味ラベル「ＮＵＬＬ」が付与される。 Next, in step 102, a functional expression semantic label is automatically assigned to the morphological analysis result in step 100 by a statistical method using the semantic label / predicate model 14. To extract. Here, “he” is extracted as a predicate, with “he” as a content word and four functional expressions “no / yo / da / ne”. Among the functional expressions, the meaning label “noun” belonging to “Grammar” is assigned to “no”, and the meaning label “judgment” belonging to “Grammar” is assigned to “da”. The meaning label “NULL” is given to “Ne”.

次に、ステップ１０４で、上記ステップ１０２で、意味ラベルとして「ＮＵＬＬ」が付与された機能表現を削除する。ここでは、「ね」が削除される。 Next, in step 104, the functional expression assigned “NULL” as the semantic label in step 102 is deleted. Here, “Ne” is deleted.

次に、ステップ１０６で、上記ステップ１０２で付与された意味ラベルを参照して、同一の意味ラベルが付与された機能表現は、１つを残して削除する。ここでは、該当する機能表現はないため、そのままステップ１０８へ移行する。図１５に示すように、上記ステップ１０４及び１０６の処理を経て、述部は「彼／の／よう／だ」となる。 Next, in step 106, referring to the meaning label assigned in step 102, the function expression to which the same meaning label is assigned is deleted except for one. Here, since there is no corresponding function expression, the routine proceeds to step 108 as it is. As shown in FIG. 15, the predicate becomes “his / no / yo / da” through the processing of steps 104 and 106 described above.

次に、ステップ１０８で、上記ステップ１０４及び１０６の処理を経て、不要な機能表現が削除された述部について、上記ステップ１０２で「Ｇｒａｍｍａｒ」に属する機能表現として、「判断」及び「名詞化」の意味ラベルが付与された機能表現を残した述部の形態素列の候補、及び対象の機能表現を除いた述部の形態素列の候補について、全ての組み合わせ候補を作成する。ここでは、「の（名詞化）」及び「だ（判断）」が、対象の機能表現となっているため、いずれも削除した「彼／よう」、「だ」を削除し「の」を残した「彼／の／よう」、「の」を削除し「だ」を残した「彼／よう／だ」、いずれも残した「彼／の／よう／だ」が候補として生成される。 Next, in step 108, “determination” and “nounization” are performed as function expressions belonging to “Grammar” in step 102 for predicates from which unnecessary function expressions have been deleted through the processing in steps 104 and 106. All combination candidates are created for the morpheme sequence candidate of the predicate that retains the functional expression to which the semantic label is assigned and the predicate morpheme sequence candidate that excludes the target functional expression. Here, because “no (nounization)” and “da (judgment)” are the functional expressions of the object, both “hi / yo” and “da” are deleted and “no” is left. In addition, “hi / no / yo”, “hi”, “no”, “hi”, “hi”, “no”, and “hi” are created as candidates.

次に、ステップ１１０で、上記ステップ１０８で生成した候補の各々の末尾に句点「。」を付け加えた上で、各候補毎に、機能語については形態素の表層形を要素に含めて、機能語以外の語は表層形を要素に含めずに、形態素Ｎｇｒａｍモデル２２ｂに基づいてスコアｌｏｇＰ_ａを算出し、また、各単語の品詞及び活用型を要素として、品詞Ｎｇｒａｍモデル２２ｃに基づいてスコアｌｏｇＰ_ｂを算出し、上記（１）式に従ってＮｇｒａｍスコアを計算する。図１６に、各候補のＮｇｒａｍスコアを示す。 Next, in step 110, after adding a punctuation mark “.” To the end of each candidate generated in step 108 above, for each function word, the morpheme surface form is included as an element for the function word. the term non without including the surface-shaped elements, morphemes Ngram calculates a score logP _a based on the model 22b, also a part of speech and conjugations of each word as an element, the score logP _b based on parts of speech Ngram model 22c And Ngram score is calculated according to the above equation (1). FIG. 16 shows the Ngram score of each candidate.

次に、ステップ１１２で、上記ステップ１１０で計算した各候補のＮｇｒａｍスコアに基づいて、Ｎｇｒａｍスコアが最も高い候補を選択し、選択した候補を構成する形態素列を出力する。ここでは、「彼／の／よう／だ」が選択される。 Next, in step 112, based on the Ngram score of each candidate calculated in step 110, the candidate having the highest Ngram score is selected, and the morpheme string that constitutes the selected candidate is output. Here, “his / no / yo / da” is selected.

次に、ステップ１１４で、上記ステップ１１２で選択した形態素列の全ての要素を正しく活用させて、最終的な述部を生成する。これにより、「主役は彼のようだね。」という入力に対して、最も単純かつ文法的に正しい言い換え表現である「彼のようだ。」の組み合わせが出力される。結果、述部が「彼のようだね」から「彼のようだ」に正規化される。 Next, in step 114, all the elements of the morpheme string selected in step 112 are correctly used to generate a final predicate. As a result, a combination of “Like him” is output, which is the simplest and grammatically correct paraphrase expression for the input “Like him is the main character”. As a result, the predicate is normalized from “Like him” to “Like him”.

以上説明したように、本実施の形態の述部正規化装置によれば、述部の意味には影響を与えないが、周辺の単語によっては日本語の文法として必要となりうる機能表現に「Ｇｒａｍｍａｒ」というカテゴリを与えて、「Ｇｒａｍｍａｒ」に属する機能表現を含む場合、除いた場合の各々について、Ｎｇｒａｍモデルを用いて単語列としての尤もらしさを示すＮｇｒａｍスコアを計算し、Ｎｇｒａｍスコアが高くなるように「Ｇｒａｍｍａｒ」に属する機能表現を残したり、削除したりするため、単純かつ文法的に正しい述部の言い換えを行うことができる。また、Ｎｇｒａｍモデルを使用する際に、形態素Ｎｇｒａｍモデルと品詞Ｎｇｒａｍモデルとを用いるが、形態素Ｎｇｒａｍモデルでは、機能語については形態素の表層形を要素とし、機能語以外の単語では表装形を要素としないため、述部に含まれる内容語の表層形のばらつきによるＮｇｒａｍスコアの揺れを抑えつつ、機能表現の表層形は文法性判断の基準として用いることができ、高精度の文法性判断を行うことができる。 As described above, according to the predicate normalization apparatus of the present embodiment, the meaning of the predicate is not affected, but depending on the surrounding words, “Grammar” can be expressed as a functional expression that may be required as a Japanese grammar. ”, And when the functional expression belonging to“ Grammar ”is included, the Ngram score indicating the likelihood as a word string is calculated using the Ngram model for each of the removed cases so that the Ngram score becomes high Since the function expression belonging to “Grammar” is left or deleted, the predicate paraphrase can be simply and grammatically correct. In addition, when using the Ngram model, the morpheme Ngram model and the part-of-speech Ngram model are used. In the morpheme Ngram model, the surface form of the morpheme is used as an element for a function word, and the appearance form is used as an element for a word other than the function word. Therefore, the surface form of the functional expression can be used as a grammatical judgment criterion while suppressing the fluctuation of the Ngram score due to the variation of the surface form of the content word contained in the predicate, and the grammatical judgment can be made with high accuracy. Can do.

また、テキストマイニングなどで行われている述部のまとめ上げや、機械翻訳、要約といった自然言語処理技術の前処理として、本実施の形態で正規化された述部を使用することができ、これらの処理の精度を向上させることができる。 In addition, the predicates normalized in the present embodiment can be used as preprocessing of natural language processing techniques such as predicate compilation performed in text mining, machine translation, summarization, etc. The accuracy of the process can be improved.

なお、上記実施の形態では、「Ｇｒａｍｍａｒ」に属する機能表現の文法性判断に、形態素Ｎｇｒａｍモデルと品詞Ｎｇｒａｍモデルとをあわせて用いる場合について説明したが、形態素Ｎｇｒａｍのみを用いて判断してもよい。 In the above embodiment, the case where the morpheme Ngram model and the part-of-speech Ngram model are used together to determine the grammatical property of the functional expression belonging to “Grammar” has been described, but the determination may be made using only the morpheme Ngram. .

また、上記実施の形態では、形態素Ｎｇｒａｍモデルは、表層形（機能語のみ）、品詞及び活用型を要素とし、品詞Ｎｇｒａｍモデルは、品詞及び活用型を要素とする場合について説明したが、品詞及び活用型に替えて、または加えて、形態素の他の要素、例えば、標準形や活用形等を要素として用いてもよい。 In the above embodiment, the morpheme Ngram model has the surface layer form (only function words), the part of speech and the inflection type as elements, and the part of speech Ngram model has been described with the part of speech and the inflection type as elements. Instead of or in addition to the utilization type, other elements of the morpheme, for example, a standard form or a utilization form may be used as an element.

また、上記実施の形態では、「Ｇｒａｍｍａｒ」に属する機能表現を、助詞の「の」及び助動詞の「だ」とする場合について説明したが、これ以外の機能語を「Ｇｒａｍｍａｒ」に属する機能表現として登録し、新しい意味ラベルを付与してもよい。また、非特許文献１のようにルールを用いて、「Ｇｒａｍｍａｒ」に属する意味ラベルが付与され、かつモダリティに属する機能表現が前または後に存在する場合のみ、本実施の形態の文法性判断の対象とするような制限を加えてもよい。 In the above embodiment, the case where the functional expression belonging to “Grammar” is the particle “no” and the auxiliary verb “da” has been described, but other function words are assumed to be functional expressions belonging to “Grammar”. You may register and give a new semantic label. In addition, as in Non-Patent Document 1, using a rule, a semantic label belonging to “Grammar” is assigned, and only when a functional expression belonging to a modality exists before or after, the target of grammatical judgment of this embodiment You may add a restriction such as

また、上記実施の形態では、意味ラベル付与・述部抽出部において、学習された意味ラベル・述部モデルを用いて、機能表現への意味ラベルの付与及び述部の抽出を行う場合について説明したが、形態素解析の結果に対して、機能語意味ラベル辞書に基づいて意味ラベルを付与した上で、予め定めた述部抽出のルールに従って述部を抽出するようにしてもよい。 In the above embodiment, the case where the semantic label assignment / predicate extraction unit performs the assignment of the semantic label to the functional expression and the extraction of the predicate using the learned semantic label / predicate model has been described. However, a predicate may be extracted according to a predetermined predicate extraction rule after giving a semantic label to the result of morphological analysis based on a function word semantic label dictionary.

また、本発明は、上記実施の形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

また、本願明細書中において、プログラムが予めインストールされている実施の形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。 Further, in the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium.

１０述部正規化装置
１２形態素解析部
１４意味ラベル・述部モデル
１６意味ラベル付与・述部抽出部
１８ＮＵＬＬ削除部
２０冗長ルール適用部
２２Ｎｇｒａｍ文法性判断部
２２ａ候補生成部
２２ｂ形態素Ｎｇｒａｍモデル
２２ｃ品詞Ｎｇｒａｍモデル
２２ｄＮｇｒａｍスコア計算部
２２ｅ選択部
２４活用生成部 DESCRIPTION OF SYMBOLS 10 Predicate normalization apparatus 12 Morphological analysis part 14 Semantic label and predicate model 16 Semantic label provision and predicate extraction part 18 NULL deletion part 20 Redundancy rule application part 22 Ngram grammatical judgment part 22a Candidate generation part 22b Morphological Ngram model 22c Part of speech Ngram model 22d Ngram score calculator 22e selector 24 utilization generator

Claims

Morphological analysis means for morphological analysis of the input document;
Based on the result of the morpheme analysis by the morpheme analysis means, the grammatical necessity is different for each functional expression that is included in the predicate of the document and has different grammatical necessity depending on the words existing in the vicinity. Label giving means for giving a determination label indicating that;
Before excluding each of the morpheme sequence constituting the previous description part when at least one of the functional expressions to which the determination label is assigned is included in the previous description part, and the functional expression to which the determination label is assigned For each of the morpheme strings that make up the description part, based on the morpheme N-gram model constructed with pseudo-words that have the surface form of the morpheme as an element for the function word and the surface layer form for the word other than the function word Calculating means for calculating a score indicating whether the functional expression to which the determination label is given is grammatically necessary;
Generating means for generating a normalized predicate from a morpheme sequence selected based on the score calculated by the calculating means;
Predicate normalization device containing

2. The predicate normalization according to claim 1, wherein the calculation unit calculates the score based on the morpheme N-gram model and a part-of-speech N-gram model constructed by a pseudo word including elements other than the surface form of the morpheme. apparatus.

The label assigning means includes the determination label and an unnecessary label indicating that the functional expression is semantically and grammatically unnecessary, and whether or not the functional expression affects the meaning of the previous description part. A semantic label is assigned to each functional expression included in the preceding description part,
The calculation means uses the predicate in which the function expression to which the unnecessary label is assigned by the label assignment means and the function expression other than each of the function expressions to which the same semantic label is assigned is deleted, The predicate normalization apparatus according to claim 1 or 2, wherein a score is calculated.

On the computer,
Morphological analysis of the input document,
Based on the result of morphological analysis, a determination label indicating that the grammatical necessity is different for each functional expression included in the predicate of the document and having different grammatical necessity depending on words existing in the vicinity And grant
Before excluding each of the morpheme sequence constituting the previous description part when at least one of the functional expressions to which the determination label is assigned is included in the previous description part, and the functional expression to which the determination label is assigned For each of the morpheme strings that make up the description part, based on the morpheme N-gram model constructed with pseudo-words that have the surface form of the morpheme as an element for the function word and the surface layer form for the word other than the function word Calculating a score indicating whether or not the functional expression to which the determination label is given is grammatically necessary,
Generate normalized predicates from morpheme sequences selected based on the calculated score
Predicate normalization method for executing processing including the above.

The predicate normalization program for functioning a computer as each means which comprises the predicate normalization apparatus of any one of Claims 1-3.