JP6058563B2

JP6058563B2 - Model learning device, filter device, method, and program

Info

Publication number: JP6058563B2
Application number: JP2014002494A
Authority: JP
Inventors: 東中　竜一郎; 竜一郎東中; 牧野　俊朗; 俊朗牧野; 松尾　義博; 義博松尾; のぞみ小林; 平野　徹; 徹平野; 千明宮崎; 豊美目黒
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2014-01-09
Filing date: 2014-01-09
Publication date: 2017-01-11
Anticipated expiration: 2034-01-09
Also published as: JP2015132876A

Description

本発明は、モデル学習装置、フィルタ装置、方法、及びプログラムに係り、特に、入力文が発話文として適格であるか否か判定するためのモデル学習装置、フィルタ装置、方法、及びプログラムに関する。 The present invention relates to a model learning device, a filter device, a method, and a program, and more particularly, to a model learning device, a filter device, a method, and a program for determining whether or not an input sentence is eligible as an utterance sentence.

対話システムは大きく分けて二種類あり、タスク指向型対話システムと非タスク指向型対話システムに分けられる。前者は特定のタスクをシステムとの対話により達成するものであり、例えば、フライトの予約システムや天気情報検索システムに用いられている。これらのシステムでは、予め話される内容が想定できるため、手作業で作り込んだ発話のデータベースを保持したり、データベースから抽出される天気情報などを手作業によるテンプレートに当てはめてシステムは発話を生成する（非特許文献１）。 There are roughly two types of dialogue systems: task-oriented dialogue systems and non-task-oriented dialogue systems. The former achieves a specific task by interaction with the system, and is used, for example, in a flight reservation system or a weather information retrieval system. Since these systems can assume what is spoken in advance, the system generates a utterance by maintaining a database of utterances created manually or by applying weather information extracted from the database to manual templates. (Non-Patent Document 1).

非タスク指向型対話システムでは、目的のない対話を扱い、対話の内容はいわゆる雑談である。雑談はさまざまな話題が話されるため、予め話される内容は想定できない。そのため発話生成は非常に難しい課題である。ユーザの幅広い入力に対応するために、近年の技術では、ウェブやツイッター（登録商標）などの文章をデータベース化しておき、ユーザ発話に類似するものを選択することでシステム発話とするものがある（非特許文献２、非特許文献３）。 A non-task-oriented dialogue system handles a dialogue with no purpose, and the content of the dialogue is a so-called chat. Since various topics are spoken in the chat, it is impossible to assume the content that is spoken in advance. Therefore, utterance generation is a very difficult task. In order to deal with a wide range of user input, recent technologies include a database of texts such as the web and Twitter (registered trademark), and a system utterance is selected by selecting the one similar to the user utterance ( Non-patent document 2, Non-patent document 3).

Ryuichiro Higashinaka, Katsuhito Sudoh, Mikio Nakano, ″Incorporating Discourse Features into Confidence Scoring of Intention Recognition Results in Spoken Dialogue Systems″, Speech Communication, Volume 48, Issues 3-4, pp.417-436,2006.Ryuichiro Higashinaka, Katsuhito Sudoh, Mikio Nakano, `` Incorporating Discourse Features into Confidence Scoring of Intention Recognition Results in Spoken Dialogue Systems '', Speech Communication, Volume 48, Issues 3-4, pp.417-436, 2006. Shibata, M., Nishiguchi, T., and Tomiura, Y. (2009). ″Dialog system for open-ended conversation using web documents. ″ Infomatica, 33 (3), pp. 277-284.Shibata, M., Nishiguchi, T., and Tomiura, Y. (2009). ″ Dialog system for open-ended conversation using web documents. ″ Infomatica, 33 (3), pp. 277-284. Bessho, F., Harada, T., and Kuniyoshi, Y. (2012). ″Dialog System Using Real-Time Crowdsourcing and Twitter Large-Scale Corpus. ″ In Proc. SIGDIAL,pp. 227-231.Bessho, F., Harada, T., and Kuniyoshi, Y. (2012). ″ Dialog System Using Real-Time Crowdsourcing and Twitter Large-Scale Corpus. ″ In Proc. SIGDIAL, pp. 227-231.

非タスク指向型対話システムにおいて、ウェブやツイッター（登録商標）などの文を抽出し、システム発話として用いる場合、発話として問題が生じる場合がある。例えば、文脈に依存した文を抽出して、システム発話として突然ユーザに伝えるとその文脈を共有していないユーザにとっては、意味が理解できない。たとえば、「行くと楽しいですよね」という文を抽出して、これをシステムがユーザに伝えても、場所の情報が共有されていなければ、どこに行くのか分からないため、ユーザはこの文を理解できない。 In a non-task-oriented dialog system, when a sentence such as web or Twitter (registered trademark) is extracted and used as a system utterance, a problem may occur as the utterance. For example, if a sentence depending on the context is extracted and is suddenly transmitted to the user as a system utterance, the meaning cannot be understood by a user who does not share the context. For example, even if the system extracts the sentence "It's fun to go" and tells the user about this, if the location information is not shared, the user can't understand this sentence because it doesn't know where to go. .

また、物事の説明としてよくある体言止めの文を抽出して、システム発話として用いる場合にも、「何がどうした」という基本的な内容が備わっていないため、ユーザにとっては意味が取りにくいという問題がある。例えば、「あっさり味のラーメン」とシステムが発話したとしても、「あっさり味のラーメン」がどうしたのかが分からないため、ユーザにとっては理解しにくい。 In addition, even when extracting verbal statements that are often used as explanations of things and using them as system utterances, it is difficult for the user to take the meaning because there is no basic content of “what happened”. There's a problem. For example, even if the system utters “simple taste ramen”, it is difficult for the user to understand because “slight taste ramen” is unknown.

「行くと楽しいですよね」という文では目的語である「どこに」という情報が欠けている。すなわち必要な文法的要素が足りない。「あっさり味のラーメン」では、「何がどうした」の「どうした」の要素が欠けている。どちらも文法的に必要的な要素が欠けているということが問題となる。 In the sentence “It is fun to go,” the object “where” is missing. In short, the necessary grammatical elements are missing. “Simple taste ramen” lacks the “what” and “what” elements. The problem is that both lack grammatical elements.

このような文を検知して、システム発話候補からフィルタできればよいが、このような文を自動的に検知することはなかなか難しいという問題がある。なぜなら、ウェブ、特に、ツイッター（登録商標）における文は、格助詞の省略など、書き言葉の文法に則っていないことが多いため、文法的な解析が適切にできないことが多く、そのため、どの文法的要素が欠けているのかを検出しにくい。 It is sufficient that such a sentence can be detected and filtered from the system utterance candidates, but there is a problem that it is difficult to automatically detect such a sentence. This is because sentences on the web, especially Twitter (registered trademark), often do not conform to the grammar of written words, such as omission of case particles, so grammatical analysis cannot often be performed properly, so It is difficult to detect if an element is missing.

加えて、そもそも必要な文法的要素を満たしているかという判断自体も困難である場合が多いという問題がある。なぜなら、動詞はその使われる状況によって必要となる文法的要素が異なるからである。「食べないと生きていけない」であれば「食べる」に目的語は必要ないが「食べに行きたいな」であれば目的語がないと意味が理解できない。「新宿駅が混みすぎて迷子」は体言で終わる名詞句の文であるが、「何がどうした」という意味内容は適切に含まれおり、発話として問題はない。 In addition, there is a problem that it is often difficult to determine whether the necessary grammatical elements are satisfied. This is because verbs require different grammatical elements depending on the situation in which they are used. If you can't live without eating, you don't need an object to eat, but if you want to eat, you can't understand the meaning without an object. “Shinjuku Station is too crowded and lost” is a noun phrase sentence that ends with a narrative, but the meaning of “what happened” is properly included, and there is no problem as an utterance.

本発明では、上記問題点を解決するために成されたものであり、発話文として適格な文であるか否かを高精度に判定するモデルを学習することができるモデル学習装置、方法、及びプログラムを提供することを目的とする。 In the present invention, a model learning device, a method, and a method that can learn a model that is highly accurate as to whether or not it is a sentence that is qualified as an utterance sentence, which is made to solve the above problems. The purpose is to provide a program.

また、発話文として適格な文であるか否かを高精度に判定し、発話文として適格な文と判定された文を出力することができるフィルタ装置、方法、及びプログラムを提供することを目的とする。 It is another object of the present invention to provide a filter device, a method, and a program capable of accurately determining whether or not a sentence is qualified as an utterance sentence and outputting a sentence that is determined as an utterance sentence. And

上記目的を達成するために、第１の発明に係るモデル学習装置は、発話文として適格であることを示す正例の情報又は発話文として適格でないことを示す負例の情報が付加されている形態素解析済みの入力文の各々について、係り受け解析を行い、前記入力文に含まれる各単語に対応して前記単語の品詞を表す各単語ノードを含み、単語の係り受け関係に応じたエッジで前記単語ノード間を結んだ木構造であって、前記単語ノードの各々について、前記単語ノードに対応する単語の表記を表すノード、前記単語ノードに対応する単語の標準表記を表すノード、及び前記単語ノードに対応する単語の終止形を表すノードの少なくとも１つを前記単語ノードの子ノードとして追加した前記木構造を作成する係り受け解析部と、前記係り受け解析部において前記入力文の各々について作成された前記木構造から得られる複数の部分木と、前記入力文の各々に付加されている正例の情報又は負例の情報とに基づいて、前記木構造に対応する文が、発話文として適格な文であるか否かを判定するモデルを学習するモデル学習部と、を含んで構成されている。 In order to achieve the above object, the model learning device according to the first invention is added with positive example information indicating that it is qualified as an utterance sentence or negative example information indicating that it is not eligible as an utterance sentence. For each input sentence that has been morphologically analyzed, a dependency analysis is performed, each word node representing the part of speech of the word corresponding to each word included in the input sentence is included, and an edge corresponding to the dependency relation of the word A tree structure connecting the word nodes, for each of the word nodes, a node representing a notation of a word corresponding to the word node, a node representing a standard notation of a word corresponding to the word node, and the word A dependency analysis unit for creating the tree structure in which at least one of nodes representing word termination forms corresponding to nodes is added as a child node of the word node; and the dependency analysis unit The tree structure based on a plurality of subtrees obtained from the tree structure created for each of the input sentences and positive example information or negative example information added to each of the input sentences And a model learning unit that learns a model for determining whether or not the sentence corresponding to is a sentence that is qualified as an utterance sentence.

第２の発明に係るモデル学習方法は、係り受け解析部と、モデル学習部と、を含むモデル学習装置におけるモデル学習方法であって、前記係り受け解析部は、発話文として適格であることを示す正例の情報又は発話文として適格でないことを示す負例の情報が付加されている形態素解析済みの入力文の各々について、係り受け解析を行い、前記入力文に含まれる各単語に対応して前記単語の品詞を表す各単語ノードを含み、単語の係り受け関係に応じたエッジで前記単語ノード間を結んだ木構造であって、前記単語ノードの各々について、前記単語ノードに対応する単語の表記を表すノード、前記単語ノードに対応する単語の標準表記を表すノード、及び前記単語ノードに対応する単語の終止形を表すノードの少なくとも１つを前記単語ノードの子ノードとして追加した前記木構造を作成し、前記モデル学習部は、前記係り受け解析部において前記入力文の各々について作成された前記木構造から得られる複数の部分木と、前記入力文の各々に付加されている正例の情報又は負例の情報とに基づいて、前記木構造に対応する文が、発話文として適格な文であるか否かを判定するモデルを学習する。 A model learning method according to a second invention is a model learning method in a model learning apparatus including a dependency analysis unit and a model learning unit, wherein the dependency analysis unit is qualified as an utterance sentence. Dependent analysis is performed on each input sentence that has been subjected to morpheme analysis to which positive example information or negative example information indicating that it is not eligible as a spoken sentence is added, and corresponding to each word included in the input sentence. Each word node representing the part of speech of the word, and a tree structure connecting the word nodes with edges according to the dependency relationship of words, and for each of the word nodes, a word corresponding to the word node At least one of a node representing a notation of a word, a node representing a standard notation of a word corresponding to the word node, and a node representing an end form of a word corresponding to the word node. The tree structure added as a child node of the input sentence, and the model learning unit includes a plurality of subtrees obtained from the tree structure created for each of the input sentences in the dependency analysis unit, and the input sentence Based on the positive example information or the negative example information added to each, a model for determining whether or not the sentence corresponding to the tree structure is a sentence suitable as an utterance sentence is learned.

第１及び第２の発明によれば、係り受け解析部により、発話文として適格であることを示す正例の情報又は発話文として適格でないことを示す負例の情報が付加されている形態素解析済みの入力文の各々について、係り受け解析を行い、木構造を作成し、モデル学習部により、入力文の各々について作成された木構造から得られる複数の部分木と、入力文の各々に付加されている正例の情報又は負例の情報とに基づいて、木構造に対応する文が、発話文として適格な文であるか否かを判定するモデルを学習する。 According to the first and second inventions, morphological analysis to which the dependency analysis unit is added with positive example information indicating that it is eligible as an utterance sentence or negative example information indicating that it is not eligible as an utterance sentence Dependency analysis is performed for each input sentence, a tree structure is created, and the model learning unit adds multiple subtrees obtained from the tree structure created for each input sentence to each input sentence. Based on the positive example information or the negative example information, a model for determining whether or not the sentence corresponding to the tree structure is a sentence suitable as an utterance sentence is learned.

このように、発話文として適格であることを示す正例の情報又は発話文として適格でないことを示す負例の情報が付加されている入力文の各々について、木構造を作成し、作成された木構造から得られる複数の部分木と、正例の情報又は負例の情報とに基づいて、木構造に対応する文が、発話文として適格な文であるか否かを高精度に判定するモデルを学習することができる。 Thus, a tree structure was created for each input sentence to which positive example information indicating that it is eligible as an utterance sentence or negative example information indicating that it is not eligible as an utterance sentence was created. Based on a plurality of subtrees obtained from a tree structure and information on positive examples or information on negative examples, it is determined with high accuracy whether or not a sentence corresponding to the tree structure is an appropriate sentence as an utterance sentence. You can learn the model.

また、第１の発明に係るモデル学習装置において、前記係り受け解析部は、前記単語ノードの各々について、前記単語ノードに対応する単語の意味情報を表すノード、及び前記単語ノードに対応する単語の必須格を表すノードの少なくとも一方を前記単語ノードの子ノードとして更に追加した前記木構造を作成してもよい。 In the model learning device according to the first aspect, the dependency analysis unit may include, for each of the word nodes, a node representing semantic information of a word corresponding to the word node and a word corresponding to the word node. The tree structure may be created by further adding at least one of nodes representing essential cases as a child node of the word node.

第３の発明に係るフィルタ装置は、形態素解析済みの入力文について、係り受け解析を行い、前記入力文に含まれる各単語に対応して前記単語の品詞を表す各単語ノードを含み、単語の係り受け関係に応じたエッジで前記単語ノード間を結んだ木構造であって、前記単語ノードの各々について、前記単語ノードに対応する単語の表記を表すノード、前記単語ノードに対応する単語の標準表記を表すノード、及び前記単語ノードに対応する単語の終止形を表すノードの少なくとも１つを前記単語ノードの子ノードとして追加した前記木構造を作成する係り受け解析部と、前記係り受け解析部において作成された前記木構造から得られる複数の部分木と、前記木構造に対応する文が、発話文として適格な文であるか否かを判定する、予め学習されたモデルとに基づいて、前記入力文が発話文として適格な文である度合いを表すスコアを算出するモデル適用部と、前記モデル適用部において算出されたスコアに基づいて、前記入力文が発話文として適格な文か否かを判定し、発話文として適格な文であると判定された前記入力文を出力するフィルタ部と、を含んで構成されている。 A filter device according to a third aspect of the present invention performs dependency analysis on an input sentence that has been subjected to morphological analysis, includes each word node that represents a part of speech of the word corresponding to each word included in the input sentence, A tree structure connecting the word nodes with edges according to the dependency relationship, for each of the word nodes, a node representing a notation of a word corresponding to the word node, a standard of a word corresponding to the word node A dependency analysis unit for creating the tree structure in which at least one of a node representing a notation and a node representing a word end form corresponding to the word node is added as a child node of the word node; and the dependency analysis unit A plurality of sub-trees obtained from the tree structure created in step 1 and a sentence corresponding to the tree structure are determined in advance to determine whether or not the sentence is a qualifying sentence. Based on the model, a model application unit that calculates a score representing the degree that the input sentence is a sentence that is eligible as an utterance sentence, and based on the score calculated by the model application part, the input sentence is an utterance sentence And a filter unit that determines whether or not the sentence is qualified and outputs the input sentence that is determined to be a qualified sentence as an utterance sentence.

第４の発明に係るフィルタ方法は、係り受け解析部と、モデル適用部と、フィルタ部と、を含むフィルタ装置におけるフィルタ方法であって、前記係り受け解析部は、形態素解析済みの入力文について、係り受け解析を行い、前記入力文に含まれる各単語に対応して前記単語の品詞を表す各単語ノードを含み、単語の係り受け関係に応じたエッジで前記単語ノード間を結んだ木構造であって、前記単語ノードの各々について、前記単語ノードに対応する単語の表記を表すノード、前記単語ノードに対応する単語の標準表記を表すノード、及び前記単語ノードに対応する単語の終止形を表すノードの少なくとも１つを前記単語ノードの子ノードとして追加した前記木構造を作成し、前記モデル適用部は、前記係り受け解析部において作成された前記木構造から得られる複数の部分木と、前記木構造に対応する文が、発話文として適格な文であるか否かを判定する、予め学習されたモデルとに基づいて、前記入力文が発話文として適格な文である度合いを表すスコアを算出し、前記フィルタ部は、前記モデル適用部において算出されたスコアに基づいて、前記入力文が発話文として適格な文か否かを判定し、発話文として適格な文であると判定された前記入力文を出力する。 A filtering method according to a fourth aspect of the present invention is a filtering method in a filter device including a dependency analysis unit, a model application unit, and a filter unit, wherein the dependency analysis unit is configured for an input sentence that has undergone morphological analysis. A tree structure that includes dependency analysis, includes each word node representing the part of speech of the word corresponding to each word included in the input sentence, and connects the word nodes with edges according to the dependency relationship of the words For each of the word nodes, a node representing a word notation corresponding to the word node, a node representing a standard word notation corresponding to the word node, and an end form of the word corresponding to the word node The tree structure is created by adding at least one of the representing nodes as a child node of the word node, and the model application unit is created by the dependency analysis unit The input sentence is based on a plurality of subtrees obtained from the notation structure and a pre-learned model that determines whether the sentence corresponding to the tree structure is a sentence that is qualified as an utterance sentence. A score representing the degree of qualification as an utterance sentence is calculated, and the filter unit determines whether or not the input sentence is qualification as an utterance sentence based on the score calculated by the model application unit. The input sentence determined to be an appropriate sentence as an utterance sentence is output.

第３及び第４の発明によれば、係り受け解析部により、形態素解析済みの入力文について、係り受け解析を行い、木構造を作成し、モデル適用部により、作成された木構造から得られる複数の部分木と、木構造に対応する文が、発話文として適格な文であるか否かを判定する、予め学習されたモデルとに基づいて、入力文が発話文として適格な文である度合いを表すスコアを算出し、フィルタ部により、算出されたスコアに基づいて、入力文が発話文として適格な文か否かを判定し、発話文として適格な文であると判定された入力文を出力する。 According to the third and fourth inventions, the dependency analysis unit performs dependency analysis on the input sentence that has been subjected to morphological analysis, creates a tree structure, and obtains the tree structure from the created tree structure by the model application unit. Based on a plurality of subtrees and a pre-learned model that determines whether the sentence corresponding to the tree structure is a sentence that is qualified as a spoken sentence, the input sentence is a sentence that is qualified as a spoken sentence A score representing the degree is calculated, and based on the calculated score by the filter unit, it is determined whether or not the input sentence is a sentence that is qualified as an utterance sentence. Is output.

このように、入力文について木構造を作成し、作成された木構造から得られる複数の部分木と、発話文として適格な文であるか否かを判定する、予め学習されたモデルとに基づいて、スコアを算出し、算出されたスコアに基づいて、入力文が発話文として適格な文か判定することにより、発話文として適格な文であるか否かを高精度に判定し、発話文として適格な文であると判定された入力文を出力することができる。 In this way, based on a plurality of subtrees obtained from the created tree structure and a pre-learned model that determines whether the sentence is qualified as an utterance sentence, by creating a tree structure for the input sentence The score is calculated, and based on the calculated score, it is determined whether the input sentence is an appropriate sentence as an utterance sentence. It is possible to output an input sentence that is determined to be a qualifying sentence.

また、第３の発明に係るフィルタ装置において、前記係り受け解析部は、前記単語ノードの各々について、前記単語ノードに対応する単語の意味情報を表すノード、及び前記単語ノードに対応する単語の必須格を表すノードの少なくとも一方を前記単語ノードの子ノードとして更に追加した前記木構造を作成してもよい。 In the filter device according to the third aspect of the present invention, the dependency analysis unit includes, for each of the word nodes, a node representing semantic information of a word corresponding to the word node, and an essential word of the word corresponding to the word node. The tree structure in which at least one of nodes representing a case is further added as a child node of the word node may be created.

また、本発明のプログラムは、コンピュータを、上記のモデル学習装置を構成する各部として機能させるためのプログラムである。 Moreover, the program of this invention is a program for functioning a computer as each part which comprises said model learning apparatus.

また、本発明のプログラムは、コンピュータを、上記のフィルタ装置を構成する各部として機能させるためのプログラムである。 Moreover, the program of this invention is a program for functioning a computer as each part which comprises said filter apparatus.

以上説明したように、本発明のモデル学習装置、方法、及びプログラムによれば、正例又は負例の情報が付加されている入力文の各々について、木構造を作成し、作成された木構造から得られる複数の部分木と、正例又は負例の情報とに基づいて、木構造に対応する文が、発話文として適格な文であるか否かを高精度に判定するモデルを学習することができる。 As described above, according to the model learning device, method, and program of the present invention, a tree structure is created for each input sentence to which positive or negative information is added, and the created tree structure Learns a model that determines with high accuracy whether or not a sentence corresponding to a tree structure is a sentence suitable as an utterance sentence, based on a plurality of subtrees obtained from be able to.

また、本発明のフィルタ装置、方法、及びプログラムによれば、入力文について木構造を作成し、作成された木構造から得られる複数の部分木と、発話文として適格な文であるか否かを判定する、予め学習されたモデルとに基づいて、スコアを算出し、算出されたスコアに基づいて、入力文が発話文として適格な文か判定することにより、発話文として適格な文であるか否かを高精度に判定し、発話文として適格な文であると判定された入力文を出力することができる。 Further, according to the filter device, method, and program of the present invention, a tree structure is created for an input sentence, and a plurality of subtrees obtained from the created tree structure and whether or not the sentence is qualified as an utterance sentence. The sentence is qualified as an utterance sentence by calculating a score based on a pre-learned model and determining whether the input sentence is an utterance sentence based on the calculated score. It is possible to output an input sentence determined to be a sentence that is qualified as an utterance sentence.

本発明の実施の形態に係るモデル学習装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the model learning apparatus which concerns on embodiment of this invention. ＪＤＥＰによる解析結果の例を示す図である。It is a figure which shows the example of the analysis result by JDEP. 文節間の依存構造の例を示す図である。It is a figure which shows the example of the dependence structure between phrases. 単語間の係り受け構造に変換された木構造の例を示す図である。It is a figure which shows the example of the tree structure converted into the dependency structure between words. 品詞ノードを中心として構築した木構造の例を示す図である。It is a figure which shows the example of the tree structure built centering on the part of speech node. 意味属性と必須格の情報を追加した木構造の例を示す図である。It is a figure which shows the example of the tree structure which added the information of the semantic attribute and the essential case. 学習データの例を示す図である。It is a figure which shows the example of learning data. 学習されたモデルの例を示す図である。It is a figure which shows the example of the learned model. 本発明の実施の形態に係るフィルタ装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the filter apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係るモデル学習装置におけるモデル学習処理ルーチンを示す図である。It is a figure which shows the model learning process routine in the model learning apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係るフィルタ装置におけるフィルタ処理ルーチンを示す図である。It is a figure which shows the filter processing routine in the filter apparatus which concerns on embodiment of this invention. 実験例の結果を示す図である。It is a figure which shows the result of an experiment example. 入力される文に５段階のスコア用いた場合の学習データの例を示す図である。It is a figure which shows the example of the learning data at the time of using a score of 5 steps for the sentence to be input.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜本発明の原理＞
本発明の実施の形態において、機械学習の手法を導入し、文法的に問題がある文とそうでない文を見分ける分類器を学習する。具体的には、ウェブやツイッター（登録商標）から抽出した文を複数用意し、これらについて、文法的な観点から、発話文としての適格性を示すスコアを付与する。そして、これらのスコアが高いものを正例とし、スコアが低いものを負例として、二値分類器を学習する。学習においては、単語間の文法的な関係性が特徴量に含まれるように、構文解析の結果（具体的には、係り受け解析の結果）を特徴量として用いる。 <Principle of the present invention>
In the embodiment of the present invention, a machine learning technique is introduced to learn a classifier that distinguishes a sentence having a grammatical problem from a sentence having no grammatical problem. Specifically, a plurality of sentences extracted from the web or Twitter (registered trademark) are prepared, and a score indicating the suitability as an uttered sentence is given to these from a grammatical viewpoint. Then, a binary classifier is learned by using a high score as a positive example and a low score as a negative example. In learning, the result of syntactic analysis (specifically, the result of dependency analysis) is used as a feature value so that the grammatical relationship between words is included in the feature value.

＜本発明の実施の形態に係るモデル学習装置の構成＞
次に、本発明の実施の形態に係るモデル学習装置の構成について説明する。図１に示すように、本発明の実施の形態に係るモデル学習装置１００は、ＣＰＵと、ＲＡＭと、後述するモデル学習処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。このモデル学習装置１００は、機能的には図１に示すように入力部１０と、演算部２０と、出力部９０とを備えている。 <Configuration of Model Learning Device According to Embodiment of the Present Invention>
Next, the configuration of the model learning device according to the embodiment of the present invention will be described. As shown in FIG. 1, a model learning apparatus 100 according to an embodiment of the present invention includes a CPU, a RAM, and a ROM that stores a program and various data for executing a model learning processing routine described later. Can be configured with a computer. Functionally, the model learning apparatus 100 includes an input unit 10, a calculation unit 20, and an output unit 90 as shown in FIG.

入力部１０は、ウェブやツイッター（登録商標）等から抽出され、かつ、システムの発話として適格な文には正例の情報を、システムの発話として不適格な文には負例の情報を付与した、複数の文の各々を受け付ける。 The input unit 10 adds positive example information to sentences that are extracted from the web, Twitter (registered trademark), etc., and qualifies as system utterances, and negative example information to sentences that are ineligible as system utterances. Each of the plurality of sentences is accepted.

演算部２０は、形態素解析部３０と、係り受け解析部４０と、モデル学習部５０と、モデル記憶部６０と、を備えている。 The calculation unit 20 includes a morphological analysis unit 30, a dependency analysis unit 40, a model learning unit 50, and a model storage unit 60.

形態素解析部３０は、入力部１０において受け付けた複数の文の各々について、形態素解析を行う。本実施の形態においては、形態素解析を行うプログラムとして、ＪＴＡＧを用いる。なお、形態素解析を行うプログラムとしてＣｈａｓｅｎやＭｅｃａｂ等を用いてもよい。 The morpheme analysis unit 30 performs morpheme analysis on each of the plurality of sentences received by the input unit 10. In the present embodiment, JTAG is used as a program for performing morphological analysis. Note that Chasen, Mecab, or the like may be used as a program for performing morphological analysis.

係り受け解析部４０は、形態素解析部３０において形態素解析された複数の文の各々について、係り受け解析（依存構造解析）を行う。具体的には、形態素解析済みの複数の文の各々について、当該文を文節（文節は内容語とそれに伴う機能語からなる日本語の基本的な単位）毎にまとめ、まとめられた文節同士の依存関係を決定する。例えば、「私は彼と映画に行く」という文については、形態素解析処理の結果から、「私は」「彼と」「映画に」「行く」という４つの文節を取得する。そして、取得された文節同士の依存構造を求めることで、「私は」、「彼と」、「映画に」は、すべて「行く」に係る構造であると解析できる。本実施の形態においては、係り受け解析を行うプログラムとしてＪＤＥＰを用いる。ここで、ＪＤＥＰは、ＪＴＡＧの出力を基にして、係り受け解析を行うソフトウェアである。なお、係り受け解析を行うプログラムとしてＣａｂｏｃｈａやＫＮＰを用いてもよい。 The dependency analysis unit 40 performs dependency analysis (dependence structure analysis) for each of a plurality of sentences subjected to morphological analysis by the morpheme analysis unit 30. Specifically, for each of a plurality of sentences that have been morphologically analyzed, the sentences are grouped into clauses (a clause is a basic unit of Japanese consisting of a content word and a function word associated therewith). Determine dependencies. For example, for a sentence “I go to him and a movie”, four phrases “I”, “with him”, “to a movie”, and “go” are acquired from the result of the morphological analysis process. Then, by obtaining the dependency structure between the acquired phrases, it is possible to analyze that “I am”, “with him”, and “to the movie” are all structures related to “go”. In the present embodiment, JDEP is used as a program for performing dependency analysis. Here, JDEP is software that performs dependency analysis based on the output of JTAG. Cabocha or KNP may be used as a program for performing dependency analysis.

また、係り受け解析部４０は、係り受け解析された複数の文の各々について、当該文に含まれる文節の依存関係の構造に基づいて、木構造を作成する。具体的には、係り受け解析された複数の文毎に、当該文に含まれる文節の各々について、当該文節に含まれる各単語について、当該単語から、当該単語から一番近く、かつ当該文節内で右側に位置する単語に係るようにする。また、文節の各々について、当該文節の最右の単語は、係先の文節内の主辞となる単語に係るようにする。係先が存在しない単語（文内最後の単語）については、木のルートノード（ｒｏｏｔ）に係るようにする。そして、当該文に含まれる単語の各々について、当該単語の品詞を表すノードを、当該単語を代表するノード（以後、単語ノードとする）として作成し、単語の係り受け関係に応じたエッジで単語ノード間を結ぶ。また、当該文に含まれる単語の各々について、当該単語の単語ノードの子ノードとして、当該単語の表記、標準表記、終止形の各々を表すノードを追加する。 Further, the dependency analysis unit 40 creates a tree structure for each of the plurality of sentences subjected to dependency analysis based on the structure of the dependency relationship of the clauses included in the sentence. Specifically, for each of the phrases included in the sentence, for each of the phrases included in the sentence, for each of the words included in the sentence, from the word, closest to the word, and within the phrase The word is located on the right side. Also, for each phrase, the rightmost word of the phrase is related to the word that is the main word in the clause at the destination. A word having no contact point (the last word in the sentence) is related to the root node (root) of the tree. Then, for each word included in the sentence, a node representing the part of speech of the word is created as a node representing the word (hereinafter referred to as a word node), and the word with an edge corresponding to the dependency relationship of the word Connect between nodes. For each word included in the sentence, a node representing each of the word notation, the standard notation, and the final form is added as a child node of the word node of the word.

また、当該文に含まれる単語の各々について、当該単語に対応する意味属性が存在する場合は、当該単語の単語ノードの子ノードとして、その意味属性の情報を持つノードを追加する。なお、意味属性が複数存在する場合には、全ての意味属性についてのノードを各々追加する。ここで、意味属性とは、単語の持つ意味内容を指す。日本語語彙大系という辞書には、単語とその意味属性の対応が記憶されている。意味属性情報は三種類あり、一般名詞に付与される一般名詞意味属性、固有名詞に付与される固有名詞意味属性、用言（主に動詞）に付与される用言意味属性がある。本実施の形態においては、文毎に当該文に含まれる単語の各々について、これらに対応する意味属性を日本語語彙大系から取得する。なお、意味属性情報が、単語の意味情報の一例である。 In addition, for each word included in the sentence, when a semantic attribute corresponding to the word exists, a node having information on the semantic attribute is added as a child node of the word node of the word. When there are a plurality of semantic attributes, nodes for all semantic attributes are added. Here, the semantic attribute refers to the semantic content of the word. A dictionary called Japanese vocabulary system stores correspondence between words and their semantic attributes. There are three types of semantic attribute information: general noun semantic attributes given to common nouns, proper noun semantic attributes given to proper nouns, and prescriptive meaning attributes given to predicates (mainly verbs). In the present embodiment, for each word included in the sentence for each sentence, semantic attributes corresponding to the words are acquired from the Japanese vocabulary system. The semantic attribute information is an example of word semantic information.

また、当該文に含まれる単語の各々について、当該単語が、述語（主に、動詞、形容詞）であり、かつ当該単語が必須格辞書に存在する場合には、当該単語の単語ノードの子ノードとして、述語の必須格情報を持つノードを追加する。ここで、必須格辞書は、大規模なテキストデータを解析して構築できる、述語の、その述語が伴う格のリストを保持したデータである。具体的には、大規模なテキストデータを係り受け解析し、述語に係る文節の持つ格助詞の頻度を数え上げる。そして、特定の述語に偏って多く出現する格助詞をその述語の必須格として必須格辞書に含める。偏って多く出現するかどうかには統計検定の手法を用いればよく、カイ二乗検定や、対数尤度比検定を用いればよい。本実施の形態において用いる必須格辞書では、「行く」の必須格として「ニ格」が、「買う」の必須格として「ヲ格」と「デ格」が、「思う」の必須格として「ト格」が定義されている。 Further, for each word included in the sentence, if the word is a predicate (mainly a verb or an adjective) and the word is present in the required case dictionary, a child node of the word node of the word As a result, a node having mandatory case information of the predicate is added. Here, the essential case dictionary is data holding a list of cases with predicates that can be constructed by analyzing large-scale text data. More specifically, large-scale text data is subjected to dependency analysis, and the frequency of case particles possessed by clauses related to predicates is counted. Then, case particles that frequently appear in a particular predicate are included in the essential case dictionary as essential cases of the predicate. A statistical test method may be used to determine whether or not a large number appears in a biased manner, and a chi-square test or a log likelihood ratio test may be used. In the required case dictionary used in the present embodiment, “dignity” is an indispensable case for “going”, “wo case” and “de case” are indispensable cases for “buy”, “ Is defined.

具体的に、「私は彼と映画に行った」という文について、係り受け解析部４０において木構造を作成する場合について説明する。図２はＪＤＥＰの当該文に対する係り受け解析結果である。図２のアスタリスクの記号は文節情報を表している。アスタリスク以降は、それぞれ文節ＩＤ、係り先文節のＩＤと係りタイプ（通常はＤで表記され、並列の係り受けについてはＰで表記される）、主辞と機能語主辞である。図２の一行目に着目すると、「私は」の文節は文節ＩＤが０で、文節ＩＤが３の文節に係っており、主辞は文節内０番目の単語（すなわち「私」）、機能語主辞は文節内１番目の単語（すなわち「は」）であることがわかる。文の終わりはＥＯＳ（ＥｎｄｏｆＳｅｎｔｅｎｃｅ）記号で示される。アスタリスクから始まらず、ＥＯＳでもない行は、個々の単語に関する行であり、６つのフィールドからなる。それぞれ、表記、品詞、標準表記、終止形（活用がある品詞の場合）、読み、意味属性情報（一般名詞属性、固有名詞属性、用言属性の順で［］内に示される）である。図３から図６は、それぞれ、文節間の依存構造を図示したもの、単語間の係り受け関係を表す木構造、品詞を表す単語ノードを中心として構築した木構造、意味属性を表すノードと必須格を表すノードを追加した木構造の例を示す。ここで、意味属性は数字で表されたものであるが、図６においては、一般名詞属性、固有名詞属性、用言属性をそれぞれ区別するため、接頭辞としてそれぞれＮ、Ｐ、Ｙを付与している。 Specifically, the case where the dependency analysis unit 40 creates a tree structure for the sentence “I went to the movie with him” will be described. FIG. 2 shows a dependency analysis result for the sentence in JDEP. The asterisk symbol in FIG. 2 represents phrase information. After the asterisk, there are a phrase ID, a relational phrase ID, a relation type (usually written as D, and a parallel dependency as P), a main word, and a function word main word. Focusing on the first line in FIG. 2, the phrase “I am” is related to the phrase whose phrase ID is 0 and the phrase ID is 3, the main word is the 0th word in the phrase (ie, “I”), function It can be seen that the main word is the first word in the phrase (ie, “ha”). The end of the sentence is indicated by an EOS (End of Sentence) symbol. Lines that do not start with an asterisk and are not EOS are lines for individual words and consist of six fields. They are notation, part of speech, standard notation, final form (in the case of part of speech with utilization), reading, and semantic attribute information (indicated in [] in the order of general noun attribute, proper noun attribute, and predicate attribute). FIGS. 3 to 6 respectively show dependency structures between clauses, tree structures representing dependency relationships between words, tree structures built around word nodes representing parts of speech, nodes representing semantic attributes, and essential elements. The example of the tree structure which added the node showing a case is shown. Here, the semantic attributes are represented by numbers. In FIG. 6, N, P, and Y are given as prefixes to distinguish the general noun attribute, proper noun attribute, and prescriptive attribute, respectively. ing.

最終的に得られる木構造は、文の大まかな構造を、品詞を表す単語ノードにより表し、また、その詳細情報を子ノードとして保持する構造となっている。また、子ノードとして意味属性情報を持つことで、その単語の指す大まかな意味内容を保持している。さらに、必須格辞書より、述語に本来持つべき格の情報を持つことにより必要な格情報が備わっているかどうかを検証するための情報を持つ。 The finally obtained tree structure is a structure in which a rough structure of a sentence is represented by a word node representing a part of speech, and the detailed information is held as a child node. Also, by having semantic attribute information as a child node, the rough semantic content pointed to by the word is held. Furthermore, it has information for verifying whether the necessary case information is provided by having the case information that the predicate should originally have from the essential case dictionary.

モデル学習部５０は、正例の情報又は負例の情報が付与された文の各々について、係り受け解析部４０において作成された当該文の木構造と、当該文に付与されている正例の情報又は負例の情報とに基づいて、システムの発話文として適格か否かを判定するモデルを学習する。本実施の形態においては、モデルの学習に用いるアルゴリズムとしてＢＡＣＴを用いる。当該アルゴリズムは、与えられたデータ中の木構造に含まれる部分木を列挙し、当該部分木が正例、負例の判定にどの程度寄与しているかの重みを統計的な処理によって計算するものである。最終的に得られるモデルの学習結果は、部分木とその重みのペアの集合となる。ＢＡＣＴは木構造一般の判定に用いられるアルゴリズムである（非特許文献４：Taku Kudo, Yuji Matsumoto (2004) A Boosting Algorithm for Classification of Semi-Structured Text, EMNLP 2004.）。なお、学習アルゴリズムは、木構造中の部分木の存在を特徴量にでき、その特徴量に重みを付与できるものであれば、他のアルゴリズムを用いてもよい。 For each sentence to which positive example information or negative example information is given, the model learning unit 50 creates a tree structure of the sentence created by the dependency analysis unit 40 and the positive example given to the sentence. Based on the information or the negative example information, a model for determining whether or not the utterance sentence of the system is eligible is learned. In the present embodiment, BACT is used as an algorithm used for model learning. This algorithm enumerates the subtrees included in the tree structure in the given data, and calculates the weight of how much the subtree contributes to the determination of positive and negative cases by statistical processing It is. The model learning result finally obtained is a set of pairs of subtrees and their weights. BACT is an algorithm used for general tree structure determination (Non-Patent Document 4: Taku Kudo, Yuji Matsumoto (2004) A Boosting Algorithm for Classification of Semi-Structured Text, EMNLP 2004.). As the learning algorithm, another algorithm may be used as long as it can make the presence of a subtree in the tree structure a feature amount and give a weight to the feature amount.

学習データの一例を図７に示す。正例データの先頭には正例を表す＋１が、負例には−１が付加されている。木構造はＢＡＣＴの入力形態であるＳ式となっている。このＳ式は、図６で説明した木構造と同じ内容を含む。「ｃａｓｅ」で示されているところは必須格情報を表す。学習された結果、得られるモデルの一例を図８に示す。 An example of learning data is shown in FIG. +1 representing the positive example is added to the head of the positive example data, and -1 is added to the negative example data. The tree structure is an S expression which is an input form of BACT. This S expression includes the same contents as the tree structure described in FIG. The part indicated by “case” represents essential case information. An example of a model obtained as a result of learning is shown in FIG.

図８に表される最初の行は切片であり、どのような部分木を持つかによらず判定対象に与えられる重みである。以降、左の欄が重みであり、右の欄がその重みを持つ部分木の文字列表現である。文字列表現の隣接する要素において、「）」は兄弟関係を表しており、それ以外は親子関係である。例えば、「動詞語幹：Ａ名詞」は、「動詞語幹：Ａ」を表す親ノードと「名詞」を表す子ノードとからなる部分木を表している。正の重みは、その部分木が含まれていると正例になりやすいことを示し、負の重みはその逆である。ここで、作成した重みはモデルとしてモデル記憶部６０に記憶し、出力部９０に出力する。 The first line shown in FIG. 8 is an intercept, which is a weight given to a determination target regardless of what subtree is held. Hereinafter, the left column is a weight, and the right column is a character string representation of the subtree having the weight. In adjacent elements of the character string expression, “)” represents a sibling relationship, and the others are parent-child relationships. For example, “verb stem: A noun” represents a subtree composed of a parent node representing “verb stem: A” and a child node representing “noun”. A positive weight indicates that the subtree is likely to be a positive example, and a negative weight is the opposite. Here, the created weight is stored as a model in the model storage unit 60 and output to the output unit 90.

モデル記憶部６０には、モデル学習部５０において学習された、システムの発話文として適格か否かを判定するモデルが記憶されている。 The model storage unit 60 stores a model that is learned by the model learning unit 50 and that determines whether or not the utterance sentence of the system is eligible.

＜本発明の実施の形態に係るフィルタ装置の構成＞
次に、本発明の実施の形態に係るフィルタ装置の構成について説明する。図９に示すように、本発明の実施の形態に係るフィルタ装置２００は、ＣＰＵと、ＲＡＭと、後述するフィルタ処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。このフィルタ装置２００は、機能的には図９に示すように入力部２１０と、演算部２２０と、出力部２９０とを備えている。 <Configuration of Filter Device According to Embodiment of the Present Invention>
Next, the configuration of the filter device according to the embodiment of the present invention will be described. As shown in FIG. 9, the filter device 200 according to the embodiment of the present invention is a computer that includes a CPU, a RAM, and a ROM that stores a program and various data for executing a filter processing routine to be described later. Can be configured. Functionally, the filter device 200 includes an input unit 210, a calculation unit 220, and an output unit 290 as shown in FIG.

入力部１０は、フィルタ処理の対象となる文を受け付ける。 The input unit 10 receives a sentence to be filtered.

演算部２２０は、形態素解析部２３０と、係り受け解析部２４０と、モデル適用部２５０と、モデル記憶部２６０と、フィルタ部２７０とを備えている。 The calculation unit 220 includes a morphological analysis unit 230, a dependency analysis unit 240, a model application unit 250, a model storage unit 260, and a filter unit 270.

形態素解析部２３０は、入力部２１０において受け付けた文について、モデル学習装置１００の形態素解析部３０と同様に、ＪＴＡＧを用いて形態素解析を行う。 The morpheme analysis unit 230 performs morpheme analysis on the sentence received by the input unit 210 using JTAG, similarly to the morpheme analysis unit 30 of the model learning device 100.

係り受け解析部２４０は、形態素解析部２３０において形態素解析された文について、モデル学習装置１００の係り受け解析部４０と同様に、ＪＤＥＰを用いて係り受け解析（依存構造解析）を行う。また、係り受け解析部２４０は、モデル学習装置１００の係り受け解析部４０と同様に、係り受け解析された文について、当該文に含まれる文節の依存関係の構造に基づいて、木構造を作成する。 The dependency analysis unit 240 performs dependency analysis (dependent structure analysis) using JDEP for the sentence subjected to the morphological analysis by the morphological analysis unit 230, similarly to the dependency analysis unit 40 of the model learning device 100. Also, the dependency analysis unit 240 creates a tree structure based on the dependency structure of clauses included in the sentence for the sentence subjected to dependency analysis, as with the dependency analysis unit 40 of the model learning apparatus 100. To do.

モデル適用部２５０は、入力部２１０において受け付けた文について係り受け解析部２４０において作成した木構造と、モデル記憶部２６０に記憶されているシステムの発話文として適格か否かを判定するモデルとに基づいて、当該作成した木構造に対応する文がシステムの発話文として適格である度合いを示すスコアを算出する。具体的には、当該木構造に含まれる部分木を列挙し、下記（１）式に従って、当該モデルを参照して、それぞれの部分木の重みを足し合わせることによりシステムの発話文として適格である度合いを示すスコアを算出する。 The model application unit 250 uses a tree structure created by the dependency analysis unit 240 for the sentence received by the input unit 210 and a model for determining whether or not the utterance sentence of the system stored in the model storage unit 260 is appropriate. Based on this, a score indicating the degree of suitability of the sentence corresponding to the created tree structure as an utterance sentence of the system is calculated. Specifically, the subtrees included in the tree structure are enumerated, and according to the following formula (1), the model is referred to, and the weights of the respective subtrees are added together to qualify as an utterance sentence of the system. A score indicating the degree is calculated.

ここで、ｔは当該文の木構造であり、ｓｃｏｒｅは木構造についてのシステムの発話文として適格である度合いを示すスコアを返す。ｗｅｉｇｈｔ_０は切片を指し、ｓｕｂｔｒｅｅｓは木構造から部分木を列挙する関数である。ｗｅｉｇｈｔは部分木について、当該モデルを参照し、その重みを返す関数である。 Here, t is the tree structure of the sentence, and score returns a score indicating the degree of qualification as an utterance sentence of the system regarding the tree structure. weight ₀ indicates an intercept, and subtrees is a function for enumerating subtrees from a tree structure. The weight is a function that refers to the model and returns the weight of the subtree.

モデル記憶部２６０は、モデル学習装置１００のモデル記憶部６０に記憶されているシステムの発話文として適格か否かを判定するモデルと同一のモデルが記憶されている。 The model storage unit 260 stores the same model as the model that determines whether or not the utterance sentence of the system stored in the model storage unit 60 of the model learning device 100 is eligible.

フィルタ部２７０は、入力部２１０において受け付けた文についてモデル適用部２５０において算出されたシステムの発話文として適格である度合いを示すスコアが、予め定められた閾値を超えているかを判定し、閾値を超えている場合には、当該文を発話文として適格であると判定し、出力部２９０に出力する。また、システムの発話文として適格である度合いを示すスコアが閾値以下の場合には、当該文を除去する対象として判定し、出力部２９０に出力する。なお、ＢＡＣＴの学習におけるデフォルトの設定によれば、閾値を０とするとき、判定性能が最も高いが、再現率を多少低くしても、適合率を高めたい場合など、閾値を高めに設定することで、システムの発話文として不適格な文が、システムの発話文として適格な文と判定されることを減らすことができる。例えば、閾値が０．００５のときである。 The filter unit 270 determines whether the score indicating the degree of qualification as the utterance sentence of the system calculated by the model application unit 250 for the sentence received by the input unit 210 exceeds a predetermined threshold value, and sets the threshold value. If it exceeds, it is determined that the sentence is eligible as an utterance sentence, and is output to the output unit 290. When the score indicating the degree of qualification as an utterance sentence of the system is equal to or less than the threshold value, the sentence is determined as an object to be removed and output to the output unit 290. According to the default setting in BACT learning, when the threshold value is set to 0, the determination performance is the highest, but the threshold value is set to a higher value when, for example, it is desired to increase the matching rate even if the recall rate is slightly reduced. Thus, it is possible to reduce a sentence that is ineligible as an utterance sentence of the system being determined as a sentence that is qualifying as an utterance sentence of the system. For example, when the threshold value is 0.005.

＜本発明の実施の形態に係るモデル学習装置の作用＞
次に、本発明の実施の形態に係るモデル学習装置１００の作用について説明する。正例の情報又は負例の情報を付与した、複数の文の各々を受け付けると、モデル学習装置１００は、図１０に示すモデル学習処理ルーチンを実行する。 <Operation of Model Learning Device According to Embodiment of Present Invention>
Next, the operation of the model learning device 100 according to the embodiment of the present invention will be described. When each of the plurality of sentences to which the positive example information or the negative example information is given is received, the model learning device 100 executes a model learning process routine shown in FIG.

まず、ステップＳ１００では、入力部１０において受け付けた正例の情報又は負例の情報が付与された複数の文の各々について、ＪＴＡＧを用いて形態素解析を行う。 First, in step S100, morphological analysis is performed using JTAG for each of a plurality of sentences to which positive example information or negative example information received by the input unit 10 is added.

次に、ステップＳ１０２では、ステップＳ１００において取得した処理対象の文の形態素解析結果に基づいて、ＪＤＥＰを用いて係り受け解析を行う。 Next, in step S102, dependency analysis is performed using JDEP based on the morphological analysis result of the processing target sentence acquired in step S100.

次に、ステップＳ１０４では、ステップＳ１０２において取得した処理対象の文の係り受け解析結果に基づいて、木構造を作成する。 Next, in step S104, a tree structure is created based on the dependency analysis result of the processing target sentence acquired in step S102.

次に、ステップＳ１０６では、入力部１０において受け付けた全ての文について、ステップＳ１０２〜ステップＳ１０４までの処理を終了したか否かの判定を行う。全ての文について、ステップＳ１０２〜ステップＳ１０４までの処理を終了した場合には、ステップＳ１０８へ移行し、全ての文について、ステップＳ１０２〜ステップＳ１０４までの処理を終了していない場合には、処理対象となる文を変更し、ステップＳ１０２へ移行する。 Next, in step S106, it is determined whether or not the processing from step S102 to step S104 has been completed for all sentences accepted by the input unit 10. If the process from step S102 to step S104 is completed for all sentences, the process proceeds to step S108. If the process from step S102 to step S104 is not completed for all sentences, the process target is processed. Is changed to step S102.

次に、ステップＳ１０８では、ステップＳ１０４において取得した入力部１０において受け付けた複数の文の各々についての木構造と、当該文の各々に付与されている正例の情報又は負例の情報とに基づいて、システムの発話文として適格か否かを判定するモデルを学習し、モデル記憶部６０に記憶する。 Next, in step S108, based on the tree structure for each of the plurality of sentences received in the input unit 10 acquired in step S104, and positive example information or negative example information given to each of the sentences. Then, a model for determining whether or not the utterance sentence of the system is eligible is learned and stored in the model storage unit 60.

次に、ステップＳ１１０では、ステップＳ１０８において取得したシステムの発話文として適格か否かを判定するモデルを出力部９０に出力して、モデル学習処理ルーチンの処理を終了する。 Next, in step S110, a model for determining whether or not the utterance sentence of the system acquired in step S108 is eligible is output to the output unit 90, and the process of the model learning process routine is terminated.

＜本発明の実施の形態に係るフィルタ装置の作用＞
次に、本発明の実施の形態に係るフィルタ装置２００の作用について説明する。まず、入力部２１０から、モデル学習装置１００において学習されたシステムの発話文として適格か否かを判定するモデルが入力され、モデル記憶部２６０に記憶される。そして、処理対象となる文を受け付けると、フィルタ装置２００は、図１１に示すフィルタ処理ルーチンを実行する。 <Operation of Filter Device According to Embodiment of the Present Invention>
Next, the operation of the filter device 200 according to the embodiment of the present invention will be described. First, a model for determining whether or not the speech sentence of the system learned by the model learning device 100 is eligible is input from the input unit 210 and stored in the model storage unit 260. And if the sentence used as a process target is received, the filter apparatus 200 will perform the filter process routine shown in FIG.

まず、ステップＳ２００では、モデル記憶部２６０に記憶されているシステムの発話文として適格か否かを判定するモデルを読み込む。 First, in step S <b> 200, a model for determining whether or not the utterance sentence of the system stored in the model storage unit 260 is eligible is read.

次に、ステップＳ２０２では、ステップＳ１００と同様に、入力部２１０において受け付けた文について、ＪＴＡＧを用いて形態素解析を行う。 Next, in step S202, as in step S100, the morphological analysis is performed on the sentence received by the input unit 210 using JTAG.

次に、ステップＳ２０４では、ステップＳ１０２と同様に、ステップＳ２０２において取得した処理対象の文の形態素解析結果に基づいて、ＪＤＥＰを用いて係り受け解析を行う。 Next, in step S204, as in step S102, dependency analysis is performed using JDEP based on the morphological analysis result of the processing target sentence acquired in step S202.

次に、ステップＳ２０６では、ステップＳ１０４と同様に、ステップＳ２０４において取得した処理対象の文の係り受け解析結果に基づいて、木構造を作成する。 Next, in step S206, similarly to step S104, a tree structure is created based on the dependency analysis result of the processing target sentence acquired in step S204.

次に、ステップＳ２０８では、ステップＳ１００において取得したモデルと、ステップＳ２０６において取得した処理対象となる文の木構造とに基づいて、上記（１）式に従って、当該木構造のシステムの発話文として適格である度合いを示すスコアを算出する。 Next, in step S208, the utterance sentence of the tree structure system is qualified according to the above equation (1) based on the model acquired in step S100 and the tree structure of the sentence to be processed acquired in step S206. The score which shows the degree which is is calculated.

次に、ステップＳ２１０では、ステップＳ２０８において取得したシステムの発話文として適格である度合いを示すスコアが、予め定められた閾値を超えているか否かを判定し、システムの発話文として適格である度合いを示すスコアが閾値を超えている場合には、ステップＳ２１４へ移行し、閾値以下である場合には、ステップＳ２１２へ移行する。 Next, in step S210, it is determined whether or not the score indicating the degree of qualification as the system utterance sentence acquired in step S208 exceeds a predetermined threshold, and the degree of qualification as the system utterance sentence is determined. When the score indicating the value exceeds the threshold value, the process proceeds to step S214, and when the score indicates less than the threshold value, the process proceeds to step S212.

次に、ステップＳ２１２では、入力部２１０において受け付けた文を除外対象として判定して、出力部２９０に出力して処理を終了する。 Next, in step S212, the sentence received by the input unit 210 is determined as an exclusion target, output to the output unit 290, and the process ends.

次に、ステップＳ２１４では、入力部２１０において受け付けた文がシステムの発話文として適格であると判定して、出力部２９０に出力して処理を終了する。 Next, in step S214, it is determined that the sentence received by the input unit 210 is eligible as an utterance sentence of the system, output to the output unit 290, and the process is terminated.

＜実験例＞
図１２のグラフは、本発明の実施の形態に係るフィルタ装置２００（図中のＳｙｎｔａｘ）と、ベースラインとして単語・品詞Ｎ−ｇｒａｍ（ｕｎｉｇｒａｍ、ｂｉｇｒａｍ、ｔｒｉｇｒａｍ）の素性を用いたＳＶＭ（サポートベクトルマシン）ベースのフィルタの性能を比較したものである。図１２は、フィルタ部の閾値を変えた場合にどのように精度が変動するかを示しており、図１２の上部に線が位置する方がよい性能を示す。図１２からわかるとおり、ｐｒｅｃｉｓｉｏｎ（適合率）、ｒｅｃａｌｌ（再現率）は絶えずフィルタ装置２００が上回っており、本発明の有効性が検証された。 <Experimental example>
The graph of FIG. 12 shows the SVM (support vector) using the filter device 200 according to the embodiment of the present invention (Syntax in the drawing) and the feature of the word / part of speech N-gram (unigram, bigram, trigram) as a baseline. Machine) based filter performance comparison. FIG. 12 shows how the accuracy fluctuates when the threshold value of the filter unit is changed. The performance is better when the line is positioned at the top of FIG. As can be seen from FIG. 12, the precision (relevance rate) and recall (reproduction rate) are constantly higher than those of the filter device 200, and the effectiveness of the present invention was verified.

以上説明したように、本発明の実施の形態に係るモデル学習装置によれば、正例又は負例の情報が付加されている入力文の各々について、木構造を作成し、作成された木構造から得られる複数の部分木と、正例又は負例の情報とに基づいて、木構造に対応する文が、システムの発話文として適格な文であるか否かを高精度に判定するモデルを学習することができる。 As described above, according to the model learning device according to the embodiment of the present invention, a tree structure is created for each input sentence to which positive or negative information is added, and the created tree structure A model that determines with high accuracy whether or not a sentence corresponding to a tree structure is an appropriate sentence as an utterance sentence of the system based on a plurality of subtrees obtained from Can learn.

また、本発明の実施の形態に係るフィルタ装置によれば、入力文について木構造を作成し、作成された木構造から得られる複数の部分木と、システムの発話文として適格な文であるか否かを判定する、予め学習されたモデルとに基づいて、システムの発話文として適格である度合いを示すスコアを算出し、算出されたシステムの発話文として適格である度合いを示すスコアに基づいて、入力文がシステムの発話文として適格な文か判定することにより、システムの発話文として適格な文であるか否かを高精度に判定し、システムの発話文として適格な文であると判定された入力文を出力することができる。 In addition, according to the filter device according to the embodiment of the present invention, a tree structure is created for an input sentence, and a plurality of subtrees obtained from the created tree structure and whether the sentence is qualified as a system utterance sentence A score indicating the degree of qualification as an utterance sentence of the system is calculated based on a pre-learned model to determine whether or not, and based on the score indicating the degree of qualification as an utterance sentence of the calculated system , By determining whether the input sentence is a sentence that qualifies as an utterance sentence of the system, it is determined with high accuracy whether or not it is a sentence that qualifies as an utterance sentence of the system. Can be output.

また、本発明を利用することにより、対話システムの発話がユーザにとって理解しやすいものになるため、システムとユーザの意思疎通がしやすくなり、システムとユーザのインタラクションが円滑になる。 Further, by using the present invention, since the utterance of the interactive system is easy for the user to understand, the system and the user can easily communicate with each other, and the interaction between the system and the user becomes smooth.

なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 Note that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

また、本実施の形態においては、モデル学習装置１００に入力される文に付加されている情報が、正例の情報又は負例の情報の二値である場合について説明したが、これに限定されるものではない。例えば、入力される文に５段階などのスコアを用いて、１点又は２点を負例とし、４点又は５点を正例とするようにして二値化して用いてもよい。図１３に、外部アノテータによってスコアが付与された例を示す。 Further, in the present embodiment, the case where the information added to the sentence input to the model learning device 100 is binary information of positive example information or negative example information has been described, but the present invention is not limited to this. It is not something. For example, it is also possible to use a score such as 5 levels for an input sentence, and binarize it so that 1 or 2 points are negative examples and 4 or 5 points are positive examples. FIG. 13 shows an example in which a score is given by an external annotator.

また、本実施の形態においては、単語ノードに意味属性の子ノードを追加する場合について説明したが、これに限定されるものではない。例えば、単語の意味を表す情報を子ノードとして追加することができればよいため、ＷｏｒｄＮｅｔにおける、ＳｙｎｓｅｔＩＤを用いてもよい。また、大規模データにおける複数の単語をその出現の仕方によってクラスタリングすることによって得られる、各単語に割当てられるクラスタの番号を子ノードとして追加してもよい。 In the present embodiment, the case where a child node of a semantic attribute is added to a word node has been described. However, the present invention is not limited to this. For example, since it is only necessary to add information representing the meaning of a word as a child node, the Synnet ID in WordNet may be used. Moreover, you may add the number of the cluster allocated to each word obtained by clustering the several words in large-scale data according to the appearance method as a child node.

また、単語ノードの子ノードとして、当該単語の表記を表すノード、標準表記を表すノード、終止形を表すノード、意味属性の情報を持つノード、及び必須格情報を持つノードを追加する場合を例に説明したが、これに限定されるものではない。例えば、単語ノードの子ノードとして、当該単語の表記を表すノード、標準表記を表すノード、終止形を表すノード、意味属性の情報を持つノード、及び必須格情報を持つノードの少なくとも一つを追加するようにしてもよい。 An example of adding a node representing a word notation, a node representing a standard notation, a node representing an end form, a node having semantic attribute information, and a node having essential case information as child nodes of the word node However, the present invention is not limited to this. For example, as a child node of a word node, add at least one of a node representing the notation of the word, a node representing a standard notation, a node representing an end form, a node having semantic attribute information, and a node having essential case information You may make it do.

また、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能であるし、ネットワークを介して提供することも可能である。 Further, in the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium or provided via a network. It is also possible to do.

１０入力部
２０演算部
３０形態素解析部
４０係り受け解析部
５０モデル学習部
６０モデル記憶部
９０出力部
１００モデル学習装置
２００フィルタ装置
２１０入力部
２２０演算部
２３０形態素解析部
２４０係り受け解析部
２５０モデル適用部
２６０モデル記憶部
２７０フィルタ部
２９０出力部 DESCRIPTION OF SYMBOLS 10 Input part 20 Operation part 30 Morphological analysis part 40 Dependency analysis part 50 Model learning part 60 Model storage part 90 Output part 100 Model learning apparatus 200 Filter apparatus 210 Input part 220 Calculation part 230 Morphological analysis part 240 Dependency analysis part 250 Model Application unit 260 Model storage unit 270 Filter unit 290 Output unit

Claims

Dependency analysis is performed on each of the input sentences that have been subjected to morphological analysis to which positive example information indicating that the sentence is qualified or negative example information indicating that the sentence is not eligible is added, and the input sentence is analyzed. Each of the word nodes representing the part of speech of the word corresponding to each of the words, and a tree structure connecting the word nodes with an edge according to the dependency relationship of the words, for each of the word nodes At least one of a node representing a notation of a word corresponding to the word node, a node representing a standard notation of a word corresponding to the word node, and a node representing an end form of the word corresponding to the word node A dependency analysis unit for creating the tree structure added as a child node of
Based on a plurality of subtrees obtained from the tree structure created for each of the input sentences in the dependency analysis unit, and positive example information or negative example information added to each of the input sentences. A model learning unit for learning a model for determining whether or not a sentence corresponding to the tree structure is a sentence suitable as an utterance sentence;
Model learning device including

The dependency analysis unit determines, for each of the word nodes, at least one of a node representing semantic information of a word corresponding to the word node and a node representing an essential case of the word corresponding to the word node of the word node. The model learning apparatus according to claim 1, wherein the tree structure further added as a child node is created.

Dependent analysis is performed on the input sentence that has been subjected to morpheme analysis, each word node representing the part of speech of the word corresponding to each word included in the input sentence is included, and the word at an edge corresponding to the dependency relation of the word A tree structure connecting nodes, and for each of the word nodes, a node representing a notation of a word corresponding to the word node, a node representing a standard notation of a word corresponding to the word node, and the word node A dependency analysis unit that creates the tree structure by adding at least one of nodes representing the end form of the corresponding word as a child node of the word node;
A pre-learned model for determining whether or not a plurality of subtrees obtained from the tree structure created in the dependency analysis unit and a sentence corresponding to the tree structure are qualifying sentences as utterance sentences Based on the above, a model application unit that calculates a score representing the degree to which the input sentence is a sentence that is qualified as an utterance sentence,
A filter unit that determines whether or not the input sentence is a sentence that is qualified as a spoken sentence based on the score calculated by the model application unit, and that outputs the input sentence that is determined to be a sentence that is qualified as a spoken sentence When,
Including a filter device.

The dependency analysis unit determines, for each of the word nodes, at least one of a node representing semantic information of a word corresponding to the word node and a node representing an essential case of the word corresponding to the word node of the word node. 4. The filter device according to claim 3, wherein the tree structure further added as a child node is created.

A model learning method in a model learning device including a dependency analysis unit and a model learning unit,
The dependency analysis unit performs dependency processing for each input sentence that has been subjected to morphological analysis and is added with positive example information indicating that it is eligible as a spoken sentence or negative example information indicating that it is not eligible as a spoken sentence. A tree structure including each word node representing the part of speech of the word corresponding to each word included in the input sentence, and connecting the word nodes with an edge according to a dependency relation of the word, , For each of the word nodes, a node representing a notation of a word corresponding to the word node, a node representing a standard notation of a word corresponding to the word node, and a node representing an end form of the word corresponding to the word node Creating the tree structure with at least one added as a child node of the word node;
The model learning unit includes a plurality of subtrees obtained from the tree structure created for each of the input sentences in the dependency analysis unit, and positive example information or negative examples added to each of the input sentences. A model learning method for learning a model for determining whether or not a sentence corresponding to the tree structure is a sentence that is qualified as an utterance sentence based on the information.

A filter method in a filter device including a dependency analysis unit, a model application unit, and a filter unit,
The dependency analysis unit performs dependency analysis on an input sentence that has been subjected to morphological analysis, includes each word node that represents a part of speech of the word corresponding to each word included in the input sentence, and includes a dependency relation between words A tree structure connecting the word nodes with an edge corresponding to each of the word nodes, each of the word nodes representing a notation of a word corresponding to the word node, and a standard notation of a word corresponding to the word node Creating the tree structure by adding at least one of a node and a node representing an end form of a word corresponding to the word node as a child node of the word node;
The model application unit determines whether or not a plurality of subtrees obtained from the tree structure created in the dependency analysis unit and a sentence corresponding to the tree structure are qualifying sentences as utterance sentences. , Based on a pre-learned model, calculate a score representing the degree to which the input sentence is a sentence that is eligible as a spoken sentence,
The filter unit determines, based on the score calculated by the model application unit, whether or not the input sentence is a sentence that is qualified as a spoken sentence, and the input sentence that is determined to be a sentence that is qualified as a spoken sentence Output filter method.

The program for functioning a computer as each part which comprises the model learning apparatus of Claim 1 or Claim 2.

The program for functioning a computer as each part which comprises the filter apparatus of Claim 3 or Claim 4.