JPH0769709B2

JPH0769709B2 - Dialogue voice recognition device

Info

Publication number: JPH0769709B2
Application number: JP5002103A
Authority: JP
Inventors: 昌明永田; 逞森元
Original assignee: 株式会社エイ・ティ・アール自動翻訳電話研究所
Priority date: 1993-01-08
Filing date: 1993-01-08
Publication date: 1995-07-31
Anticipated expiration: 2010-07-31
Also published as: JPH06208388A

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】この発明は対話音声認識装置に関
し、特に、音声認識装置が出力する文候補の中で、１文
単位でみれば統語的かつ意味的に正しい文であるが、文
脈的に不適切な文候補を排除し、文脈的に最も尤もらし
い文候補を選択することにより、文音声認識の精度を高
めることを可能にするような対話音声認識装置に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an interactive voice recognition device, and more particularly, it is a sentence that is syntactically and semantically correct in a sentence candidate output from a voice recognition device, but is contextual. The present invention relates to an interactive speech recognition device that can improve the accuracy of sentence speech recognition by eliminating inappropriate sentence candidates and selecting the sentence candidate that is most likely to be contextual.

【０００２】[0002]

【従来の技術および発明が解決しようとする課題】近
年、音声認識は、その精度の向上のために、従来の音響
処理に加えて、形態論や構文論に関する知識を利用した
言語処理を行なうようになった。その結果、音声認識装
置は、１文単位でみれば、統語的にも意味的にも正しい
文候補を出力することができるようになった。しかし、
文音声認識装置の出力の中には、１文としては統語的に
も意味的にも正しいが、その状況や文脈の中では不適確
な文候補が存在する場合があり、これを排除するのが困
難であるという問題点があった。2. Description of the Related Art In recent years, in order to improve the accuracy of speech recognition, in addition to conventional acoustic processing, speech recognition is required to perform language processing utilizing knowledge about morphology and syntax. Became. As a result, the speech recognition device can output a sentence candidate that is syntactically and semantically correct in terms of one sentence. But,
In the output of the sentence voice recognition device, there is a case where a sentence is syntactically and semantically correct, but inaccurate sentence candidates exist in the situation or context. There was a problem that it was difficult.

【０００３】たとえば、応答の表現「わかりました」
が、「ありました」・「なりました」・「かかりまし
た」のような言明に誤って音声認識されることがよく発
生する。これは文頭の半母音“ｗ”の認識の難しさに起
因する。For example, the response expression "OK"
However, it often happens that statements such as “Yes”, “Now”, and “It took” are erroneously recognized by voice. This is due to the difficulty in recognizing the semivowel "w" at the beginning of the sentence.

【０００４】事務局：登録用紙は既にお持ちでしょ
うか質問者：いいえまだです事務局：わかりました発話：わかりました音声認識誤りの例；＞ありました＞なりました「わかりました」「ありました」「なりました」は、い
ずれも、統語的にも意味的にも正しい文であるので、従
来の１文単位の処理では、これらの音声認識候補の間
で、言語的な尤もらしさの優劣を判断することは不可能
であった。しかし、この発話は、質問−応答−確認とい
う対話の基本的なパターンの中に現われているので、音
声認識結果に多義があるときに、文脈からみて、「確
認」の発話として解釈できる候補の方がより尤もらしい
と考えるのは妥当な判断だと考えられる。[0004] Secretariat: registration form already have you will either Questioner: No still is the Secretariat: young ¥ was uttered: example of a speech recognition error that was found;> Oh ¥ was> a ¥ was "to understand Since "Mata", "Areta", and "Narimasa" are syntactically and semantically correct sentences, in the conventional sentence-by-sentence processing, the speech recognition It was impossible to judge the superiority or inferiority of the general likelihood. However, since this utterance appears in the basic pattern of the dialog of question-answer-confirmation, when there is ambiguity in the voice recognition result, from the context, it is a candidate that can be interpreted as a "confirmation" utterance. It seems reasonable to think that it is more likely.

【０００５】このように、文が発話される状況や文脈に
関する知識を音声認識に利用する技術としては、従来、
プラン認識に基づく手法が一般的であった。これは、問
題領域の知識，対話構造の知識，対話参加者の目標の知
識などに基づいて、話し手の意図を推論し、次に発話さ
れる文の内容を予測する手法である。As described above, as a technique of utilizing knowledge about a situation or context in which a sentence is uttered for voice recognition, conventionally
The method based on plan recognition was common. This is a method of inferring the speaker's intention based on the knowledge of the problem area, the knowledge of the dialogue structure, the knowledge of the dialogue participants' goals, etc., and predicting the content of the next uttered sentence.

【０００６】しかし、このような「プラン認識に基づく
手法」では、対話構造や領域に関する知識をシステムの
中に人間の手で記述しなければならない点が、システム
を実現する際の隘路になる。これまでに、この手法に基
づいた非常に小規模な実験システムは構築されている。
しかし、より多くの語彙や、より広範な対象領域に関す
る様々な知識を人手で記述するのは、非常に困難な作業
になると予想される。However, in such a "method based on plan recognition", the fact that knowledge about a dialogue structure and a domain must be described in the system by a human hand becomes a bottleneck in realizing the system. So far, a very small-scale experimental system based on this method has been constructed.
However, manually describing more vocabulary and various knowledge about a wider subject area is expected to be a very difficult task.

【０００７】また、「プラン認識に基づく手法」のもう
１つの問題点は、システムの予測に対して尤度を与える
機構が存在しないことである。一般に、様々な知識に基
づいて推論した結果として、システムは複数の仮説を提
示するが、「プラン認識に基づく手法」の枠組みの中に
は、仮説の尤もらしさを評価する方法が存在しない。音
声認識装置が、音響的尤度の異なる複数の文候補を出力
することを考慮すれば、状況や文脈に基づいて発話内容
を予測するシステムは、その仮説に尤度を与える機構を
持たなければ、実用的なシステムにはなり得ない。Another problem of the "plan recognition-based method" is that there is no mechanism that gives a likelihood to the prediction of the system. Generally, a system presents a plurality of hypotheses as a result of inference based on various knowledges, but there is no method for evaluating the likelihood of a hypothesis in the framework of "a method based on plan recognition". Considering that the speech recognition device outputs a plurality of sentence candidates having different acoustic likelihoods, a system that predicts utterance content based on a situation or a context must have a mechanism that gives likelihood to the hypothesis. , It cannot be a practical system.

【０００８】それゆえに、この発明の主たる目的は、人
手により記述された対話構造や領域に関する知識を用い
るのではなく、話者の交替と発話行為タイプの系列に関
する統計的な情報から作成した対話モデルを用いて次発
話の発話行為タイプを予測することにより、文脈的に最
も尤もらしい音声認識候補を求めることのできるような
対話音声認識装置を提供することである。Therefore, the main object of the present invention is not to use the knowledge about the dialogue structure or the domain described by hand, but to create a dialogue model created from statistical information about the replacement of speakers and the series of utterance action types. It is to provide an interactive speech recognition device capable of obtaining a speech recognition candidate that is most likely to be contextualized by predicting the speech action type of the next speech using.

【０００９】[0009]

【課題を解決するための手段】この発明は、対話音声認
識装置であって、音声認識装置が出力する文を、その言
語学的な構造の記述に変換する構文解析手段と、構文解
析手段が出力する文の構造記述を入力とし、話し手から
聞き手への働きかけ方の観点から文の意味的効果を分類
する発話行為タイプ解釈手段と、対話における話者の交
替と発話行為タイプの系列に関する統計的な情報に基づ
いて対話モデルを作成する対話モデル学習手段と、現在
の発話に至るまでの話者の交替と発話行為タイプの系列
の履歴と、対話モデル学習手段が生成した対話モデルを
用いて、音声認識装置が出力する複数の文候補の中か
ら、文脈的に最も尤もらしい文音声認識候補を選択する
文候補選択手段を備え、音声認識装置が出力する複数の
文候補の発話行為タイプを解釈し、現在の発話に至るま
での話者の交替および発話行為タイプの系列の履歴と、
対話における話者の交替と発話行為タイプの系列に関す
る統計的な情報に基づいて対話モデル学習装置が作成し
た対話モデルを用いて、複数の文候補の中から文脈的に
最も尤もらしい音声認識候補を求めることができるよう
に構成される。SUMMARY OF THE INVENTION The present invention is an interactive speech recognition apparatus, comprising: a syntactic analysis means for converting a sentence output by the speech recognition device into a description of its linguistic structure; and a syntactic analysis means. A utterance action type interpreter that classifies the semantic effect of a sentence from the perspective of how the speaker acts on the listener, using the structural description of the output sentence as input, and a statistic on the replacement of speakers and the sequence of utterance action types in dialogue Using a dialogue model learning means that creates a dialogue model based on such information, the history of the sequence of speaker changes and utterance action types up to the current utterance, and the dialogue model generated by the dialogue model learning means, The speech recognition device includes a sentence candidate selection unit that selects a sentence speech recognition candidate that is most likely to be contextually selected from a plurality of sentence candidates output by the speech recognition device, and the speech act of the plurality of sentence candidates output by the speech recognition device. Interprets the flop, and the history of alternation and the speech act type of series of speaker leading up to the current speech,
Using the dialogue model created by the dialogue model learning device based on the statistical information about the replacement of speakers and the series of utterance action types in dialogue, the speech recognition candidate that is most likely to be contextually selected from among a plurality of sentence candidates. It is configured to be able to ask.

【００１０】[0010]

【作用】この発明に係る対話音声認識装置は、音声認識
装置が出力する文候補の中で、文脈的に最も尤もらしい
文候補を選択することにより、１文単位でみれば統語的
かつ意味的に正しい文であるか、文脈的に不適切な文を
排除し、文音声認識の精度を高める。The interactive speech recognition apparatus according to the present invention selects the sentence candidate that is most likely to be contextual from among the sentence candidates output by the speech recognition apparatus, and thus, it is syntactic and semantic in terms of one sentence. Improve the accuracy of sentence speech recognition by eliminating sentences that are correct or contextually inappropriate.

【００１１】[0011]

【実施例】図１はこの発明の一実施例の概略ブロック図
である。まず、図１を参照して、この発明の一実施例の
構成について説明する。入力音声は、文音声認識装置１
によって認識され、複数の文候補が構文解析装置２に出
力される。構文解析装置２は、文音声認識装置１が出力
した各文候補を、メモリ５に記憶されている解析文法を
用いて構文解析し、その構造の記述を発話行為タイプ解
釈装置３に出力する。発話行為タイプ解釈装置３は構文
解析装置２が出力した文の構造の記述を入力として、メ
モリ６に記憶されている発話行為タイプ解釈規則を用い
て、文の発話行為タイプを決定する。1 is a schematic block diagram of an embodiment of the present invention. First, the configuration of an embodiment of the present invention will be described with reference to FIG. The input voice is the sentence voice recognition device 1.
Is recognized and the plurality of sentence candidates are output to the syntax analysis device 2. The syntactic analysis device 2 syntactically analyzes each sentence candidate output by the sentence speech recognition device 1 using an analysis grammar stored in the memory 5, and outputs a description of the structure to the utterance action type interpretation device 3. The utterance action type interpretation device 3 determines the utterance action type of the sentence using the utterance action type interpretation rule stored in the memory 6 with the description of the sentence structure output by the syntax analysis device 2 as an input.

【００１２】文候補選択装置４は、発話行為タイプ解釈
装置３の出力を入力として、メモリ７に記憶されている
対話モデルを用いて、文脈的に最も尤もらしい文候補を
選択する。対話モデルはメモリ９に記憶されている発話
行為タイプのラベルが付与された対話コーパスから対話
モデル学習装置１０を用いて学習される。メモリ９は対
話コーパスの各文へ、構文解析装置２および発話行為タ
イプ解釈装置３を用いて、発話行為タイプを付与するこ
とにより作成される。The sentence candidate selection device 4 receives the output of the utterance action type interpretation device 3 as an input, and uses the dialogue model stored in the memory 7 to select the sentence candidate that is most likely to be contextual. The dialogue model is learned using the dialogue model learning device 10 from the dialogue corpus to which the label of the utterance action type is stored, which is stored in the memory 9. The memory 9 is created by giving a speech act type to each sentence of the dialogue corpus using the syntax analysis device 2 and the speech act type interpretation device 3.

【００１３】図２はこの発明の一実施例における発話行
為タイプ解釈装置が用いる発話行為タイプの分類の例を
示す図であり、図３は発話行為タイプ解釈装置を用いる
発話行為タイプ推論規則の例であり、図４は発話行為タ
イプ解釈装置の動作例であり、図５は発話行為タイプが
付与された対話コーパスの例であり、図６は対話モデル
の例であり、図７は文候補選択装置の動作を説明するた
めのフロー図であり、図８は文候補選択装置の動作例を
示す図である。FIG. 2 is a diagram showing an example of categorization of utterance action types used by the utterance action type interpretation device according to an embodiment of the present invention, and FIG. 3 is an example of utterance action type inference rules using the utterance action type interpretation device. 4 is an operation example of the utterance action type interpretation device, FIG. 5 is an example of a dialogue corpus to which a utterance action type is added, FIG. 6 is an example of a dialogue model, and FIG. 7 is sentence candidate selection. FIG. 9 is a flowchart for explaining the operation of the device, and FIG. 8 is a diagram showing an operation example of the sentence candidate selection device.

【００１４】次に、図１〜図８を参照して、この発明の
一実施例の具体的な動作について説明する。ここで、ま
ず、発話行為タイプ解釈装置３が入力文の発話行為タイ
プを決定する方法について述べ、次に対話モデル学習装
置１０が発話行為タイプが付与された対話コーパスから
対話モデルを学習する方法について述べ、最後に、文候
補選択装置４が文音声認識候補の中から文脈的に最も尤
もらしい文候補を選択する方法について述べることにす
る。Next, the specific operation of the embodiment of the present invention will be described with reference to FIGS. Here, first, a method for the utterance action type interpretation device 3 to determine the utterance action type of the input sentence will be described, and then a method for the dialogue model learning device 10 to learn the dialogue model from the dialogue corpus to which the utterance action type is given. Lastly, a method in which the sentence candidate selection device 4 selects a sentence candidate that is most likely to be contextually selected from the sentence voice recognition candidates will be described.

【００１５】まず、発話行為タイプ解釈装置３が入力文
の発話行為タイプを決定する方法について説明する。発
話行為タイプ解釈装置３は、文音声認識装置１が出力す
る各文候補に発話行為タイプを付与する目的、および、
対話コーパスの各文に発話行為タイプを付与する目的に
使用される。First, a method in which the speech act type interpretation device 3 determines the speech act type of the input sentence will be described. The utterance action type interpretation device 3 has a purpose of giving a utterance action type to each sentence candidate output by the sentence voice recognition device 1, and
It is used to give a speech act type to each sentence of the dialogue corpus.

【００１６】ここで、発話行為タイプを決定する方法に
ついて説明するために、最初に、発話行為および発話行
為タイプとは何かについて説明する。Here, in order to describe the method of determining the speech act type, first, the speech act and what is the speech act type will be described.

【００１７】言語の使用を或る目的に沿った行為の遂行
とみる発話行為の理論によれば、発話によって行なわれ
る行為には、発話行為，発語内行為，発語媒介行為の３
つの側面があり、これらの中で、発語内行為は人間の言
語的な情報伝達の基本的な単位としてとらえられてい
る。発語内行為は、基本的に、命題内容と発話の力から
構成されている。この「発話の力」、すなわち、発語内
行為が持つ意味効果は、陳述・命令・約束などに分類さ
れ、行為が有効となるための条件が議論されている。According to the theory of utterance actions, which regards the use of language as the performance of actions for a certain purpose, the actions performed by utterance include three actions: utterance actions, intra-spoken actions, and utterance-mediated actions.
There are two aspects, in which the act of speech is regarded as a basic unit of human linguistic communication. The act in speech basically consists of propositional content and utterance power. This "power of speech", that is, the semantic effect of the act in the utterance, is classified into statements, commands, promises, etc., and the conditions under which the act becomes effective are discussed.

【００１８】人間の対話のモデル、すなわち、言語によ
る情報伝達の戦略に関する知識は、この発話行為（以
下、便宜上、この「発語内行為」を発話行為として言及
する。）のレベルで記述することが望ましい。しかし、
行為の遂行的な側面は必ずしも言語的に表層レベルで明
示されるとは限らないので、このような遂行的な分析
は、そのまま自然言語処理で利用するには、やや抽象的
すぎる。そこで、この実施例では、表層の統語的なパタ
ーンと比較的直接的な対応がとれるようなレベルの分類
として、図２に示すような発話行為タイプの分類を定義
し、これを対話のやり取りのパターンを記述するための
基本構成要素とする。A model of human interaction, that is, knowledge about a strategy of information transmission by language is described at the level of this utterance action (hereinafter, for convenience, this "intra-word action" is referred to as an utterance action). Is desirable. But,
Since the executive aspect of an action is not always linguistically specified at the surface level, such an executive analysis is a little too abstract to be directly used in natural language processing. Therefore, in this embodiment, a categorization of speech act types as shown in FIG. 2 is defined as a level classification that allows a relatively direct correspondence to the syntactic pattern of the surface layer, and this is defined as the level of dialogue exchange. It is a basic component for describing patterns.

【００１９】図２において、ｐｈａｔｉｃは、「もしも
し」や「さよなら」のように、挨拶などで用いられるイ
ディオム的な表現である。ｅｘｐｒｅｓｓｉｖｅは、
「ありがとうございます」や「よろしくお願いします」
のような話者の感情表現に関するイディオム的な表現で
ある。ｒｅｓｐｏｎｓｅは、質問などに対する応答や相
槌などである。ｐｒｏｍｉｓｅは、話し手が或る行為を
することを約束する表現で、ｒｅｑｕｅｓｔは、話し手
が聞き手に行為をすることを依頼する表現である。ｉｎ
ｆｏｒｍは、話し手が聞き手に情報を提供する表現であ
る。ｑｕｅｓｔｉｏｎｉｆは真偽疑問文である。ｑｕｅ
ｓｔｉｏｎｒｅｆは疑問語疑問文である。ｑｕｅｓｔｉ
ｏｎｃｏｎｆは確認である。さらに、対話の開始および
終了時に、慣用的・儀礼的表現が頻繁に現われるという
特徴をとらえるために、これらの発話行為タイプに加え
て、対話の開始および終了を表わすＤＢＭ（Ｄｉｓｃｏ
ｕｒｓｅＢｏｕｎｄａｒｙＭａｒｋｅｒ）という特
別な発話行為タイプを用意する。In FIG. 2, phatic is an idiom-like expression used for greetings, such as "Hello" or "Goodbye." expressive is
"Thank you" and "Thank you."
It is an idiom-like expression about the emotional expression of a speaker like. A response is a response to a question or the like, or a reply. "Promise" is an expression that promises that the speaker will perform an action, and "request" is an expression that the speaker asks the listener to do an action. in
A form is an expression in which the speaker provides the listener with information. questionif is a true / false question sentence. que
stionref is a question word question sentence. questi
onconf is confirmation. Furthermore, in order to catch the feature that idiomatic and ritual expressions frequently appear at the beginning and end of a dialogue, in addition to these utterance action types, a DBM (Disco) indicating the beginning and end of the dialogue is provided.
urse Boundary Marker) is prepared.

【００２０】発話行為タイプ解釈装置３は、発話行為タ
イプ解釈規則を用いて、入力文の発話行為タイプを決定
する。発話行為タイプ解釈規則は、構文解析装置２が出
力する素性構造で表現された入力文の意味表現上の特定
のパターンを検出して書き換えるようなパターン・アク
ション規則である。ここで、発話行為タイプは、発話行
為タイプ解釈装置３が出力する意味表現の素性構造の最
上位の関係名として表現される。The utterance action type interpretation device 3 determines the utterance action type of the input sentence by using the utterance action type interpretation rule. The utterance action type interpretation rule is a pattern action rule for detecting and rewriting a specific pattern in the semantic expression of the input sentence expressed by the feature structure output by the syntax analysis device 2. Here, the utterance action type is expressed as the highest relational name of the feature structure of the semantic expression output by the utterance action type interpretation device 3.

【００２１】たとえば、図３に示す発話行為タイプ解釈
規則は、「〜せていただきます」に相当する素性構造が
あれば、発話行為タイプがｐｒｏｍｉｓｅであるような
素性構造を生成する、すなわち、概念的には、次式に示
すような関係を表わしている。For example, the utterance action type interpretation rule shown in FIG. 3 generates a feature structure whose utterance action type is "promise" if there is a feature structure corresponding to "... Specifically, the relationship is expressed by the following equation.

【００２２】させる−ｐｅｒｍｉｓｓｉｖｅ＋てもらう
−ｒｅｃｅｉｖｅｆａｖｏｒ→ｐｒｏｍｉｓｅ図３において、ｉｎ＝に続く内容は、入力文の意味表現
中で探索すべきパターンを表現し、ｏｕｔ＝で示された
内容は、照合した部分をこのように書き換えることを表
わしている。In FIG. 3, the contents following in = represent the pattern to be searched for in the semantic representation of the input sentence, and the contents indicated by out = represent the collated portion. Is rewritten in this way.

【００２３】図４は、発話行為タイプ解釈装置３の動作
例であり、特に、図４（ａ）は構文解析装置２が出力し
た入力文の意味表現の素性構造であり、発話行為タイプ
解釈装置３への入力となるものである。図４（ａ）の素
性構造の「〜させていただきます」に相当する部分が、
図３のｉｎ＝に続くパターンと照合し、図３のｏｕｔ＝
に続くパターンへ書き換えられる。その結果、発話行為
タイプ解釈装置３は、図４（ｂ）に示す発話行為タイプ
がｐｒｏｍｉｓｅであるような素性構造を生成する。FIG. 4 shows an operation example of the utterance action type interpretation device 3, and in particular, FIG. 4A shows a feature structure of the semantic expression of the input sentence output by the syntax analysis device 2, and the utterance action type interpretation device. It is an input to 3. The part corresponding to "I will do" of the feature structure in Fig. 4 (a)
By comparing with the pattern following in = in FIG. 3, out = in FIG.
It is rewritten to the pattern following. As a result, the utterance action type interpretation device 3 generates a feature structure such that the utterance action type shown in FIG. 4B is “promise”.

【００２４】次に、対話モデル学習装置１０が、発話行
為タイプ付対話コーパスから、対話モデルを学習する手
順を説明する。Next, a procedure for the dialogue model learning device 10 to learn a dialogue model from a dialogue corpus with utterance action types will be described.

【００２５】図５に、発話行為タイプ付対話コーパスの
例を示す。これは、「国際会議の参加に関する問い合わ
せ」というタスクにおける、事務局と質問者との会話で
ある。発話行為タイプ付対話コーパスは、図１に示すよ
うに、構文解析装置２および発話行為タイプ解釈装置３
を用いて、対話コーパスに発話行為タイプを付与するこ
とによって作成される。FIG. 5 shows an example of a dialogue corpus with a speech act type. This is a conversation between the secretariat and the interrogator in the task of "inquiries about participation in international conferences". As shown in FIG. 1, the dialogue corpus with utterance action types includes a syntax analysis device 2 and an utterance action type interpretation device 3.
Is created by adding a speech act type to the dialogue corpus using.

【００２６】この実施例では、対話モデルとして、発話
行為タイプの系列のマルコフモデルを用いる。まず、対
話が発話行為タイプの系列Ｓ＝Ｓ₁，Ｓ₂，…，Ｓ_nで
ある確率をＰ（Ｓ）とする。確率Ｐ（Ｓ）は第（１）式
のように表わすことができる。In this embodiment, a Markov model of a speech action type sequence is used as the interaction model. First, _let P (S) be the probability that the dialogue is a series of utterance action types S = S ₁ , S ₂ , ..., S _n . The probability P (S) can be expressed as in equation (1).

【００２７】[0027]

【数１】 [Equation 1]

【００２８】ここで、Ｐ（Ｓ_i｜Ｓ₁，…，Ｓ_i-1）
は、発話行為タイプがそれぞれＳ₁，…，Ｓ_i-1である
文が発話された後に、発話行為タイプがＳ_iである文が
発話される確率である。しかし、実際にＰ（Ｓ_i｜
Ｓ₁，…，Ｓ_i-1）を求めることは困難であるため、こ
れを第（２）式のように二重マルコフ過程で近似する。Here, P (S _i | S ₁ , ..., S _i-1 )
Is the probability that a sentence with the utterance action type S _i is uttered after a sentence with the utterance action type S ₁ , ..., S _i-1 is uttered. However, in reality, P (S _i |
Since it is difficult to obtain S ₁ , ..., S _i−1 ), this is approximated by a double Markov process as in the equation (2).

【００２９】[0029]

【数２】 [Equation 2]

【００３０】第（２）式の右辺の確率は、第（３）式の
ように、２つ組（ｂｉｇｒａｍ）と３つ組（ｔｒｉｇｒ
ａｍ）の出現頻度Ｃ（Ｓ_i-2，Ｓ_i-1）とＣ（Ｓ_i-2，
Ｓ_i- ₁，Ｓ_i）の比によって推定できる。The probabilities on the right side of the expression (2) are two sets (bigram) and three sets (trigr) as in the expression (3).
am) appearance frequency C (S _i-2 , S _i-1 ) and C (S _i-2 ,
It can be estimated by the ratio of S _i ₋ ₁ , S _i ).

【００３１】[0031]

【数３】 [Equation 3]

【００３２】ここで、ｆ（Ｓ_i｜Ｓ_i-2，Ｓ_i-1）を相
対出現頻度と呼ぶ。したがって、対話モデル学習装置１
０は発話行為タイプが付与された対話コーパスにおい
て、発話行為タイプの系列の連続する２つ組（Ｓ_i-2，
Ｓ_i-1）および３つ組（Ｓ_i-2，Ｓ_i-1，Ｓ_i）の出現
頻度を計測し、その相対出現頻度を計算することによっ
て、Ｐ（Ｓ_i｜Ｓ_i-2，Ｓ_i-1）を決定する。Here, f (S _i | S _i-2 , S _i-1 ) is called a relative appearance frequency. Therefore, the interaction model learning device 1
0 is a set of two consecutive utterance action type sequences (S _i-2 , S _i-2 ,
S _i-1 ) and the triplet (S _i-2 , S _i-1 , S _i ) are measured in frequency, and their relative frequency of occurrence is calculated to obtain P (S _i | S _i-2 , S _i-1 ) is determined.

【００３３】図６は学習された対話モデルの例であり、
各行の第１欄には、発話行為タイプの系列が“：”で区
切られて表示されている。第２欄は、第１欄に示された
発話行為タイプの系列の確率、すなわち、前述のＰ（Ｓ
_i｜Ｓ_i-2，Ｓ_i-1）である。図６に示す対話モデルで
は、単純な発話行為タイプの系列ではなく、話者と発話
行為タイプを組合せたものを表わすレベルの系列を学習
している。これは、一方が質問（ｑｕｅｓｔｉｏｎｉ
ｆ）した後には、他方が回答（ｉｎｆｏｒｍ）すること
が多いというような話者の交替による影響や、「国際会
議の参加に関する問い合わせ」の対話では、対話の開始
・終了に関しては質問者が主導権を握っているというよ
うな、話者のよる違いを、対話モデルに反映させるため
である。FIG. 6 shows an example of a learned dialogue model.
In the first column of each line, a series of utterance action types is displayed separated by ":". The second column is the probability of the utterance action type sequence shown in the first column, that is, P (S
_i | S _i-2 , S _i-1 ). In the dialogue model shown in FIG. 6, not a simple utterance action type sequence, but a level sequence representing a combination of a speaker and an utterance action type is learned. This is because one of the questions (questioni
f) After the answer, the other person often answers (inform), and the questioner takes the lead in starting and ending the dialogue in the dialogue of "Inquiries regarding participation in international conferences". This is because the difference between the speakers, such as holding the right, is reflected in the dialogue model.

【００３４】最後に、文候補選択装置４が文音声認識候
補の中から文脈的に最も尤もらしい文候補を選択する方
法について説明する。Finally, a method in which the sentence candidate selection device 4 selects a sentence candidate that is most likely to be contextually selected from the sentence voice recognition candidates will be described.

【００３５】図７は文候補選択装置の動作を説明するた
めのフロー図である。以下では、この図７に従って、文
候補選択装置４の動作について説明する。図７におい
て、Ｓ _ijは、対話における第ｉ番目の発話の第ｊ番目の
文音声認識候補の発話行為タイプを表わす。ここでは、
各文に対して、文音声認識装置１は、ｍ個の候補を出力
すると仮定する。また、Ｐｒｏｂ（ｉ）およびＱ（ｉ）
は、それぞれ第ｉ番目の発話に関して、文脈的に最も尤
もらしい文の確率、および、その文音声認識候補の番号
を表わす。したがって、第ｉ番目の発話において、文脈
的に最も尤もらしいとして選択された文音声認識候補の
発話行為タイプは、Ｓ_iQ(i)と表わされる。FIG. 7 illustrates the operation of the sentence candidate selection device.
FIG. In the following, according to this FIG.
The operation of the candidate selection device 4 will be described. Figure 7 Smell
S _ijIs the j-th speech of the i-th utterance in the dialogue
Indicates the utterance action type of the sentence voice recognition candidate. here,
For each sentence, the sentence voice recognition device 1 outputs m candidates.
Suppose. Also, Prob (i) and Q (i)
Is context-likely most likely for each i-th utterance.
Probability of a strange sentence, and the number of the sentence speech recognition candidate
Represents Therefore, in the i-th utterance, the context
Of the sentence speech recognition candidates selected as the most likely
Utterance act type is S_{iQ (i)}Is represented.

【００３６】図７におけるステップ（図示ではＳＰと略
称する）ＳＰ１において、各発話に対して行なう処理に
関する初期設定を行なう。まず、発話番号ｉが１に設定
される。次に、２つ前の発話の発話行為タイプの初期値
Ｓ_-1Q(-1)は、未定義であることを表現する“”（空の
文字列）に設定される。また、１つ前の発話の発話行為
タイプの初期値Ｓ_0Q(0)は、対話の開始点を表現するＤ
ＢＭに設定される。ステップＳＰ２において、対話が終
了したか否かを判断する。もし、対話の終了が検出され
ると、文候補選択装置４の動作は終了する。そうでなけ
れば、以下の処理を各発話に対して行なう。すなわち、
ステップＳＰ３において、各文候補に対して初期設定を
行なう。まず、文候補番号ｊが１に設定される。次に、
対話モデルから求めた文脈的に最も尤もらしい文候補の
確率Ｐｒｏｂ（ｉ）の初期値が０に設定され、その発話
行為タイプの初期値が“”（空の文字列）に設定され
る。In step (abbreviated as SP in the figure) SP1 in FIG. 7, initial settings are made for the processing to be performed for each utterance. First, the utterance number i is set to 1. Next, the initial value S _{-1Q (-1)} of the utterance action type of the second utterance is set to "" (empty character string) expressing that it is undefined. Further, the initial value S _{0Q (0)} of the utterance action type of the _immediately preceding utterance is D that represents the starting point of the dialogue.
It is set to BM. In step SP2, it is determined whether or not the dialogue has ended. If the end of the dialogue is detected, the operation of the sentence candidate selection device 4 ends. If not, the following processing is performed for each utterance. That is,
In step SP3, initial setting is performed for each sentence candidate. First, the sentence candidate number j is set to 1. next,
The initial value of the probability Prob (i) of the sentence candidate that is most likely to be contextually obtained from the dialogue model is set to 0, and the initial value of the utterance action type is set to "" (empty character string).

【００３７】ステップＳＰ４において、すべての文候補
を調べたか否かが判断される。もし、すべての文候補を
調べ終わっていれば、ステップＳＰ７において発話番号
を１つ先に進め、ステップＳＰ２に戻る。すべての文候
補を調べ終わっていれば、Ｑ（ｉ）には文脈的に最も尤
もらしい文音声認識候補の番号、Ｐｒｏｂ（ｉ）にはそ
の確率が記録され、Ｓ_iQ(i)にはその発話行為タイプが
記録される。もし、音声認識候補が残っていれば以下の
処理を行なう。In step SP4, it is determined whether all sentence candidates have been examined. If all sentence candidates have been examined, the utterance number is advanced by one in step SP7 and the process returns to step SP2. If all sentence candidates have been examined, the number of the sentence speech recognition candidate that is most likely to be contextual, the probability thereof is recorded in Prob (i), and the probability thereof is recorded in S _{iQ (i).} The speech act type is recorded. If voice recognition candidates remain, the following processing is performed.

【００３８】ステップＳＰ５において、現在着目してい
る文音声認識候補は、既に調べた文音声認識候補の中で
最も尤もらしかった候補に比べて、文脈的に尤もらしい
かを第（４）式により調べる。In step SP5, it is determined whether the sentence-speech recognition candidate currently focused on is contextually more likely than the most probable sentence-speech recognition candidate already examined by the equation (4). Find out.

【００３９】[0039]

【数４】 [Equation 4]

【００４０】もしそうであれば、Ｑ（ｉ），Ｐｒｏｂ
（ｉ），Ｓ_iQ(i)の値を、現在の文音声認識候補のそれ
に更新する。ここで、第（４）式の左辺の確率は対話モ
デルから求める。第（４）式は等号が成り立つ場合を除
外しているので、対話モデルが提示する確率が同じだっ
た場合には、より上位の（音響的尤度が高い）文音声認
識候補が優先される。If so, Q (i), Prob
(I), The value of S _{iQ (i)} is updated to that of the current sentence speech recognition candidate. Here, the probability on the left side of the equation (4) is obtained from the dialogue model. Since the expression (4) excludes the case where the equal sign holds, when the probability that the dialogue model presents is the same, the sentence speech recognition candidates of higher rank (higher acoustic likelihood) are prioritized. It

【００４１】ステップＳＰ６において、文候補番号を１
つ先へ進め、ステップＳＰ４に戻る。In step SP6, the sentence candidate number is set to 1
Go forward and return to step SP4.

【００４２】次に、図８を参照して、文候補選択装置の
動作例について説明する。図８（ａ）は現在の発話に至
るまでの対話内容および話者と発話行為タイプの系列で
あり、図８（ｂ）は現在の発話内容である。この発話に
対して、図８（ｃ）に示すのが、文音声認識装置１が出
力する文候補である。各文候補の横には、その発話行為
タイプが示されている。図８（ｄ）は現在の発話に至る
までの話者と発話行為タイプの履歴から、次に「事務
局」の発話として予測されている文の発話行為タイプを
示している。図８（ｅ）は図８（ｃ）に示された文音声
認識候補の中で、図８（ｄ）に示された対話モデルが与
える尤度に基づいて、文候補選択装置４が文脈的に最も
尤もらしいとして選択した文音声認識候補を示してい
る。Next, an operation example of the sentence candidate selection device will be described with reference to FIG. FIG. 8A shows a dialogue content up to the current utterance and a series of the speaker and the utterance action type, and FIG. 8B shows the current utterance content. FIG. 8C shows the sentence candidates output by the sentence voice recognition device 1 for this utterance. Next to each sentence candidate, its utterance action type is shown. FIG. 8D shows the utterance action type of the sentence predicted as the utterance of the “secretariat” from the history of the speaker and the utterance action type up to the current utterance. FIG. 8 (e) shows that the sentence candidate selection device 4 is contextual based on the likelihood given by the dialogue model shown in FIG. 8 (d) among the sentence speech recognition candidates shown in FIG. 8 (c). Shows the sentence speech recognition candidates selected as the most likely.

【００４３】[0043]

【発明の効果】以上のように、この発明によれば、文候
補選択手段が現在の発話に至るまでの話者の交替および
発話行為タイプの系列の履歴と、対話における話者の交
替および発話行為タイプの系列に関する統計的な情報に
基づいて作成した会話モデルを用いて、音声認識手段が
出力する文候補の中から文脈的に最も尤もらしい文音声
認識候補を選択するようにしたので、文脈的に不適切な
文候補を排除でき、文音声認識精度を高められるような
対話音声認識装置を実現できる。As described above, according to the present invention, the sentence candidate selecting means records the history of the replacement of speakers and the sequence of act types until the current utterance, and the replacement and utterance of speakers in the dialogue. By using the conversation model created based on the statistical information about the series of action types, the sentence speech recognition candidate that is most likely to be contextually selected is selected from the sentence candidates output by the speech recognition means. It is possible to realize an interactive speech recognition device that can eliminate sentence candidates that are inappropriate in terms of text and improve sentence speech recognition accuracy.

[Brief description of drawings]

【図１】この発明の一実施例の概略ブロック図である。FIG. 1 is a schematic block diagram of an embodiment of the present invention.

【図２】発話行為タイプ解釈装置が用いる発話行為タイ
プの分類の例を示す図である。FIG. 2 is a diagram showing an example of classification of utterance action types used by the utterance action type interpretation device.

【図３】発話行為タイプ解釈規則の一例を示す図であ
る。FIG. 3 is a diagram showing an example of utterance action type interpretation rules.

【図４】発話行為タイプ解釈装置の動作例を示す図であ
る。FIG. 4 is a diagram showing an operation example of a speech act type interpretation device.

【図５】発話行為タイプが付与された対話コーパスの例
を示す図である。FIG. 5 is a diagram showing an example of a dialogue corpus to which a speech act type is added.

【図６】対話モデルの一例を示す図である。FIG. 6 is a diagram showing an example of a dialogue model.

【図７】この発明の一実施例における文候補選択装置の
動作を説明するためのフロー図である。FIG. 7 is a flowchart for explaining the operation of the sentence candidate selection device in the embodiment of the present invention.

【図８】文候補選択装置の動作例を示す図である。FIG. 8 is a diagram illustrating an operation example of a sentence candidate selection device.

[Explanation of symbols]

１文音声認識装置２構文解析装置３発話行為タイプ解釈装置４文候補選択装置５メモリ（解析文法）６メモリ（発話行為タイプ解釈規則）７メモリ（対話モデル）８メモリ（対話コーパス）９メモリ（発話行為タイプ付対話コーパス）１０対話モデル学習装置 1 Sentence Speech Recognition Device 2 Syntactic Analysis Device 3 Speech Act Type Interpretation Device 4 Sentence Candidate Selection Device 5 Memory (Analysis Grammar) 6 Memory (Speech Act Type Interpretation Rule) 7 Memory (Dialogue Model) 8 Memory (Dialogue Corpus) 9 Memory ( Dialogue corpus with speech act type) 10 Dialogue model learning device

Claims

[Claims]

1. A speech recognition apparatus for recognizing a dialogue in a spoken language, wherein a syntax analysis for converting a sentence output by the speech recognition apparatus into a description of the structure by using a grammar description about the structure of the language. And a structure description of a sentence output by the syntax analysis unit,
A dialogue model is constructed based on utterance action type interpreters that classify the semantic effects of sentences from the perspective of how the speaker interacts with the listener, and statistical information about the sequence of speaker exchange and utterance action types in dialogue. The speech recognition device outputs using the dialogue model learning means to be created, the history of the sequence of the speaker alternation and utterance action types up to the current utterance, and the dialogue model generated by the dialogue model learning means. A sentence candidate selection unit that selects a sentence speech recognition candidate that is most likely to be contextually selected from the plurality of sentence candidates, and the syntactic analysis unit and the utterance action for the plurality of sentence candidates output by the speech recognition device. The type interpretation means is used to interpret the utterance action type of the sentence candidate, and the history of the sequence of the speaker alternation and utterance action types up to the current utterance, and the previous Using the interaction model of interaction model learning means has created, and obtains the context the most plausible speech recognition candidate, interactive speech recognition device.