JP2019015950A

JP2019015950A - Voice recognition method, program, voice recognition device, and robot

Info

Publication number: JP2019015950A
Application number: JP2018038717A
Authority: JP
Inventors: 勇次國武; Yuji Kunitake; 太田　雄策; Yusaku Ota; 雄策太田
Original assignee: Panasonic Intellectual Property Management Co Ltd
Current assignee: Panasonic Intellectual Property Management Co Ltd
Priority date: 2017-07-05
Filing date: 2018-03-05
Publication date: 2019-01-31

Abstract

To improve the recognition accuracy in the case of an infant speaker, or even under the circumstance in which an input utterance is greatly influenced by the noise.SOLUTION: The speech recognition method includes extracting a first utterance from sound picked up by a microphone corresponding to a speech processing device, calculating the reliability of a first utterance recognition result and the first utterance, exerting an utterance of repeat asking on the basis of the calculated reliability of the first utterance, extracting a second utterance obtained by the repeat asking by the microphone, calculating the reliability of a second utterance recognition result and the second utterance, and generating a recognition result from the first utterance recognition result and the second utterance recognition result on the basis of the calculated reliability of the second utterance.SELECTED DRAWING: Figure 1

Description

本開示は、音声認識の技術に関するものである。 The present disclosure relates to a speech recognition technique.

近年、発話した音声データから発話内容を示す単語列を推定する種々の音声認識方法が提案されている。 In recent years, various speech recognition methods have been proposed for estimating a word string indicating utterance content from uttered speech data.

例えば、特許文献１では、下記の音声認識方法が開示されている。すなわち、発話した音声データを複数の音素区間ｘに区画して各音素区間ｘに音素モデルを割り当て、ｎ番目の音素区間ｘに対して割り当てた音素モデルｐの尤度Ｐｓｎと、音素モデルｐ以外の音素モデルの音素区間ｘの尤度の最高値Ｐｍａｘｎ及び尤度Ｐｓｎの差分尤度Ｐｄｎとを求める。そして、尤度Ｐｓｎ及び差分尤度Ｐｄｎをそれぞれ、正解音素区間尤度モデル及び不正解音素度尤度モデルに入力して、正解音素区間尤度モデルの尤度Ｌｎｃと、不正解音素尤度モデルの尤度Ｌｎｉとを求める。そして、尤度Ｌｎｃと尤度Ｌｎｉとの差分尤度ｃｎを求め、差分尤度ｃｎの全音素区間の加算値を単語信頼度ＷＣとして求め、単語信頼度ＷＣが閾値以上であれば、その音声データに対して割り当てた音素列を出力し、単語信頼度ＷＣが閾値未満であれば、当該音素列をリジェクトする。 For example, Patent Document 1 discloses the following speech recognition method. That is, the uttered speech data is divided into a plurality of phoneme sections x, a phoneme model is assigned to each phoneme section x, and the likelihood Psn of the phoneme model p assigned to the nth phoneme section x and the phoneme model p other than The maximum likelihood value Pmaxn of the phoneme section x and the differential likelihood Pdn of the likelihood Psn are obtained. Then, the likelihood Psn and the difference likelihood Pdn are respectively input to the correct phoneme segment likelihood model and the incorrect solution phoneme likelihood model, and the likelihood Lnc of the correct phoneme segment likelihood model and the incorrect phoneme likelihood model The likelihood Lni is obtained. Then, the difference likelihood cn between the likelihood Lnc and the likelihood Lni is obtained, the addition value of all phoneme sections of the difference likelihood cn is obtained as the word reliability WC, and if the word reliability WC is equal to or greater than the threshold, the speech The phoneme sequence assigned to the data is output, and if the word reliability WC is less than the threshold, the phoneme sequence is rejected.

しかし、特許文献１は、音素尤度のみが考慮されており、言語尤度は何ら考慮されていないので、言語として自然さを持つ音素列を再現することができないという課題がある。 However, since Patent Document 1 considers only phoneme likelihood and does not consider language likelihood, there is a problem that a phoneme string having naturalness as a language cannot be reproduced.

そこで、非特許文献１では、音響尤度と言語尤度とを用いて、発話を音声認識して単語列を推定する方法が開示されている。具体的には、非特許文献１では、式（１）の右辺に示す確率の積を最大とする単語列Ｗが認識結果として選択される。ここで、ｗは任意の単語列であり、Ｐ（Ｏ｜ｗ）は単語列ｗの音素列がＯである確率（音響尤度）であり、音響モデルにより計算される。Ｐ（ｗ）はｗの言語としてのもっともらしさを示す確率（言語尤度）であり、ｎ−ｇｒａｍなどの連続する単語の出現頻度情報を基に言語モデルにより計算される。 Therefore, Non-Patent Document 1 discloses a method for estimating a word string by recognizing speech by using acoustic likelihood and language likelihood. Specifically, in Non-Patent Document 1, the word string W that maximizes the product of the probabilities shown on the right side of Equation (1) is selected as the recognition result. Here, w is an arbitrary word string, and P (O | w) is a probability (acoustic likelihood) that the phoneme string of the word string w is O, and is calculated by an acoustic model. P (w) is a probability (language likelihood) indicating the likelihood of w as a language, and is calculated by a language model based on appearance frequency information of consecutive words such as n-gram.

また、この方法では、この音響尤度と言語尤度との積を認識結果の確からしさ（文章の信頼度）として認識結果と一緒に出力される。 In this method, the product of the acoustic likelihood and the language likelihood is output together with the recognition result as the likelihood of the recognition result (text reliability).

特開平１１−２４９６８８号公報JP-A-11-249688

堀貴明・塚田元著、情報処理学会誌４５巻１０号ＰＰ.１０２０−１０２６、音声情報処理技術の最先端：３．重み付き有限状態トランスデューサによる音声認識、２００４年１０月Takaaki Hori and Gen Tsukada, Journal of Information Processing Society of Japan, Vol. 45, No. 10, PP.1020-1026. Speech recognition with weighted finite state transducers, October 2004

しかしながら、非特許文献１では、入力発話が曖昧であったり、入力発話がノイズなどの影響を強く受けていたりした場合、文章の信頼度が低い値となる。そして、文章の信頼度が低い場合、認識結果に誤りを含む可能性が高くなるという課題が存在する。 However, in Non-Patent Document 1, if the input utterance is ambiguous or the input utterance is strongly influenced by noise or the like, the reliability of the sentence is low. And when the reliability of a sentence is low, there exists a subject that possibility that an error will be included in a recognition result becomes high.

本開示は、このような課題を解決するためになされたものである。 This indication is made in order to solve such a subject.

本開示の一態様に係る音声認識方法は、
一の単語を意図して発話者によって発話された第１発話を、マイクを介して受信し、
前記第１発話はＮ個（Ｎは２以上の自然数）の音素から構成され、
前記第１発話を構成する前記Ｎ個の音素ごとに、全種類の音素の出現確率を算出し、
前記第１発話を構成する第１音素から第Ｎ音素まで、各前記Ｎ個の音素に対応する最大出現確率を有する音素を順に並べた音素列を前記第１発話に対応する第１音素列と認識し、
前記第１音素列を構成するＮ個の音素が有する出現確率どうしを掛け合わせることによって第１値を算出し、
前記第１値が第１閾値未満である場合は、前記発話者へ前記一の単語を再度発話するように促す音声をスピーカを通して出力させ、
前記一の単語を意図して前記発話者によって再度発話された第２発話を、前記マイクを介して受信し、前記第２発話はＭ個（Ｍは２以上の自然数）の音素から構成され、
前記第２発話を構成する前記Ｍ個の音素ごとに、全種類の音素について出現確率を算出し、
前記第２発話を構成する第１音素から第Ｍ音素まで、各前記Ｍ個の音素に対応する最大出現確率を有する音素を順に並べた音素列を前記第２発話に対応する第２音素列と認識し、
前記第２音素列を構成するＭ個の音素が有する出現確率どうしを掛け合わせることによって第２値を算出し、
前記第２値が前記第１閾値未満である場合は、前記第１音素列において第２閾値以上の出現確率を有する音素と前記第２音素列において前記第２閾値以上の出現確率を有する音素を抽出し、
メモリに記憶された辞書から、前記抽出された音素を含む単語を抽出し、前記辞書は各単語と前記各単語に対応する音素列を対応付け、
前記抽出された単語が一つである場合は、前記抽出された単語を前記一の単語に対応すると認識する。 A speech recognition method according to an aspect of the present disclosure includes:
A first utterance spoken by a speaker intended for a word is received via a microphone;
The first utterance is composed of N phonemes (N is a natural number of 2 or more),
For each of the N phonemes constituting the first utterance, the appearance probabilities of all types of phonemes are calculated,
A phoneme string in which phonemes having the maximum appearance probability corresponding to each of the N phonemes from the first phoneme constituting the first utterance to the Nth phoneme are arranged in order; a first phoneme string corresponding to the first utterance; Recognized,
A first value is calculated by multiplying appearance probabilities of N phonemes constituting the first phoneme string;
If the first value is less than a first threshold, let the speaker output a voice prompting the speaker to speak the one word again,
A second utterance re-spoken by the speaker with the intention of the one word is received via the microphone, and the second utterance is composed of M (M is a natural number of 2 or more) phonemes,
For each of the M phonemes constituting the second utterance, the appearance probability is calculated for all types of phonemes,
A phoneme string in which phonemes having the maximum appearance probability corresponding to each of the M phonemes from the first phoneme constituting the second utterance to the Mth phoneme are arranged in order; a second phoneme string corresponding to the second utterance; Recognized,
A second value is calculated by multiplying the appearance probabilities of M phonemes constituting the second phoneme sequence;
When the second value is less than the first threshold, phonemes having an appearance probability equal to or higher than the second threshold in the first phoneme string and phonemes having an appearance probability equal to or higher than the second threshold in the second phoneme string Extract and
Extracting words including the extracted phonemes from a dictionary stored in a memory, the dictionary associates each word with a phoneme string corresponding to each word,
If the extracted word is one, the extracted word is recognized as corresponding to the one word.

本開示は、発話者が幼児である場合、又は入力発話がノイズの影響を大きく受ける環境下においても、認識精度を向上させることができる。 The present disclosure can improve recognition accuracy even when the speaker is an infant or in an environment where the input utterance is greatly affected by noise.

実施の形態１における音声対話システムの全体構成の一例を示す図である。It is a figure which shows an example of the whole structure of the speech dialogue system in Embodiment 1. FIG. 二音素からなる発話において、音素毎に算出された出現確率の一例を示す図である。It is a figure which shows an example of the appearance probability calculated for every phoneme in the utterance which consists of two phonemes. 図２において第一音素目の音素と第二音素目の音素との組み合わせに対する出現確率の積を纏めた図である。FIG. 3 is a diagram summarizing products of appearance probabilities for combinations of phonemes of the first phoneme and phonemes of the second phoneme in FIG. 2. 実施の形態１における認識処理の一例を示すフローチャートである。3 is a flowchart illustrating an example of recognition processing in the first embodiment. 実施の形態１における対話の一例を示す図である。5 is a diagram illustrating an example of a conversation in Embodiment 1. FIG. 図５の対話例に対する第一認識結果と第二認識結果との一例を示す図である。It is a figure which shows an example of the 1st recognition result and the 2nd recognition result with respect to the example of a dialog of FIG. 単語辞書のデータ構成の一例を示す図である。It is a figure which shows an example of the data structure of a word dictionary. 第一認識結果から抽出された認識候補単語の一例を示す図である。It is a figure which shows an example of the recognition candidate word extracted from the 1st recognition result. 実施の形態１において、第一認識結果と第二認識結果とから認識候補単語を絞り込む処理の別の一例を示す図である。In Embodiment 1, it is a figure which shows another example of the process which narrows down a recognition candidate word from a 1st recognition result and a 2nd recognition result. 実施の形態２における音声対話システムの全体構成の一例を示す図である。It is a figure which shows an example of the whole structure of the voice interactive system in Embodiment 2. FIG. 複数のフレームに区切られた音声信号の一例を示す図である。It is a figure which shows an example of the audio | voice signal divided | segmented into the some flame | frame. 実施の形態２における認識処理の一例を示すフローチャートである。10 is a flowchart illustrating an example of recognition processing in the second embodiment. 実施の形態２の具体例において１−ｇｒａｍの言語モデルを採用した場合の探索空間の一例を示す図である。FIG. 11 is a diagram illustrating an example of a search space when a 1-gram language model is employed in a specific example of the second embodiment. 実施の形態２の具体例において２−ｇｒａｍの言語モデルを採用した場合の単語辞書の一例を示す図である。FIG. 10 is a diagram showing an example of a word dictionary when a 2-gram language model is adopted in a specific example of the second embodiment. 実施の形態２の具体例において２−ｇｒａｍの言語モデルを採用した場合の探索空間の一例を示す図である。FIG. 10 is a diagram illustrating an example of a search space when a 2-gram language model is employed in a specific example of the second embodiment. 実施の形態２の具体例における第一認識結果の各音素と第二認識結果の各音素との出現確率が合成された場合の探索空間を示す図である。It is a figure which shows the search space when the appearance probability of each phoneme of the 1st recognition result and each phoneme of the 2nd recognition result in the specific example of Embodiment 2 is synthesize | combined. 実施の形態３における音声対話システムの全体構成の一例を示す図である。FIG. 10 is a diagram showing an example of the overall configuration of a voice interaction system in a third embodiment. 実施の形態３における認識処理の一例を説明するフローチャートである。10 is a flowchart for explaining an example of recognition processing in the third embodiment. 実施の形態３における第一認識結果の５−ｂｅｓｔの一例を示す図である。FIG. 10 is a diagram illustrating an example of 5-best of the first recognition result in the third embodiment. 実施の形態３における第二認識結果の５−ｂｅｓｔの一例を示す図である。FIG. 20 is a diagram illustrating an example of 5-best of the second recognition result in the third embodiment. 実施の形態１〜３に係る音声認識装置が実装されたロボットの外観図である。It is an external view of the robot by which the speech recognition apparatus which concerns on Embodiment 1-3 is mounted.

（本開示の基礎となった知見）
ユーザが発話する音声から発話内容を解析し、解析結果を基に自然な応答を返すことでユーザとの自然な対話を実現したり、機器の制御又は情報提供などのサービスを提供したりする音声対話システムに関する技術が検討されている。 (Knowledge that became the basis of this disclosure)
Voice that analyzes utterance content from the voice uttered by the user and returns a natural response based on the analysis result to realize a natural dialogue with the user, or provide services such as device control or information provision Technologies related to dialogue systems are being studied.

成人を対象とした一般の音声認識システムでは、認識精度は９０％を超えており、たとえ認識できなかったとしても、信頼度が低い認識結果を破棄して、聞き返しによりゆっくり発話してもらったり、はっきりと発話してもらったりすることで、高い信頼度を持つ認識結果を取得することが十分にできる。 In a general voice recognition system for adults, the recognition accuracy exceeds 90%, and even if it cannot be recognized, the recognition result with low reliability is discarded and the speech is slowly spoken by listening back, A recognition result with high reliability can be acquired sufficiently by having the user speak clearly.

しかしながら、一般の音声認識システムでは、言語の獲得段階にある幼児の発話、又は入力発話がノイズの影響を大きく受ける環境では、認識精度が低くなるため、たとえ聞き返したとしても信頼度の高い認識結果が得られないという課題がある。 However, in a general speech recognition system, the recognition accuracy is low in an environment where the speech of an infant in the language acquisition stage or the input speech is greatly affected by noise, so even if it is listened to again, a highly reliable recognition result There is a problem that cannot be obtained.

非特許文献１では、言語らしさを持つ単語列を出力させることはできるものの、信頼度の低い認識結果が得られた場合、聞き返すことについての開示がないので、上記の課題は解決できない。 In Non-Patent Document 1, although a word string having language-likeness can be output, the above problem cannot be solved because there is no disclosure about listening back when a recognition result with low reliability is obtained.

特許文献１では、信頼度の低い認識結果が得られた場合、その認識結果は破棄することが開示されているに過ぎず、聞き返すことについての開示がないので、非特許文献１と同様、上記の課題を解決できない。 In Patent Document 1, when a recognition result with low reliability is obtained, it is only disclosed that the recognition result is discarded, and there is no disclosure about replaying. Can't solve the problem.

そこで、本発明者は、信頼度が低い認識結果をそのまま破棄するのではなく、その認識結果と聞き返しにより得られた認識結果とを考慮すれば、発話者が幼児である場合、又は入力発話がノイズの影響を大きく受ける環境下においても、認識精度を向上させることができるとの知見を得て、本開示を想到するに至った。 Therefore, the present inventor does not discard the recognition result with low reliability as it is, but considers the recognition result and the recognition result obtained by listening back, when the speaker is an infant or the input utterance is The present disclosure has been conceived by obtaining knowledge that recognition accuracy can be improved even in an environment that is greatly affected by noise.

この構成によれば、一の単語を意図する第１発話を認識することで得られた第１音素列の第１値が第１閾値より低く、第１音素列の信頼性が低い場合であっても、第１音素列は破棄されない。そして、聞き返しによって得られた一の単語を意図する第２発話の第２値が第１閾値より低く、第２音素列の信頼性も低い場合、第１音素列と第２音素列とのそれぞれから、信頼性の高い音素が抽出され、辞書と比較することで一の単語に対応する単語が抽出される。 According to this configuration, the first value of the first phoneme string obtained by recognizing the first utterance intended for one word is lower than the first threshold value, and the reliability of the first phoneme string is low. However, the first phoneme string is not discarded. Then, when the second value of the second utterance intended for one word obtained by listening is lower than the first threshold and the reliability of the second phoneme sequence is low, each of the first phoneme sequence and the second phoneme sequence Therefore, a phoneme with high reliability is extracted, and a word corresponding to one word is extracted by comparing with a dictionary.

このように、本構成は、第１発話に対して信頼性の低い認識結果が得られたとしても、その認識結果を破棄せず、その認識結果を第２発話に対して信頼性の低い認識結果が得られた場合に利用する。そのため、聞き返しによって、信頼性の高い認識結果が得られなかったとしても、両認識結果である第１音素列と第２音素列とのうち信頼性の高い音素を用いて一の単語が認識されているので、一の単語の認識精度を高めることができる。 Thus, even if a recognition result with low reliability is obtained for the first utterance, this configuration does not discard the recognition result and recognizes the recognition result with low reliability for the second utterance. Use when results are obtained. Therefore, even if a highly reliable recognition result is not obtained by listening back, one word is recognized using a reliable phoneme from the first phoneme sequence and the second phoneme sequence that are both recognition results. Therefore, the recognition accuracy of one word can be improved.

更に、本構成では、第１音素列と第２音素列とのうち信頼性の高い音素を含む単語が辞書から抽出されているので、言語的に不自然な認識結果が得られることを防止できる。 Furthermore, in this configuration, since words including highly reliable phonemes from the first phoneme string and the second phoneme string are extracted from the dictionary, it is possible to prevent a linguistically unnatural recognition result from being obtained. .

以上により、本構成は、発話者が幼児である場合、又は入力発話がノイズの影響を大きく受ける環境下においても、認識精度を向上させることができる。 As described above, this configuration can improve the recognition accuracy even when the speaker is an infant or in an environment where the input utterance is greatly affected by noise.

上記構成において、前記抽出された単語が複数である場合は、前記抽出された各単語を発話したか発話者に尋ねる音声を前記スピーカを通して出力し、
前記発話者から肯定又は否定の回答を前記マイクを介して受信し、
前記肯定の回答に対応する単語を、前記一の単語に対応すると認識してもよい。 In the above configuration, when there are a plurality of extracted words, a voice asking the speaker whether the extracted words are spoken or not is output through the speaker.
Receiving a positive or negative answer from the speaker via the microphone;
The word corresponding to the positive answer may be recognized as corresponding to the one word.

本構成によれば、第１音素列と第２音素列とのうち信頼性の高い音素を含む複数の単語が辞書から抽出された場合、どの単語を発話したのかを発話者に直接確認しているので、認識精度を高めることができる。 According to this configuration, when a plurality of words including a reliable phoneme are extracted from the dictionary from the first phoneme string and the second phoneme string, the speaker is directly confirmed as to which word is spoken. Therefore, recognition accuracy can be improved.

本開示の別の一態様に係る声認識方法は、
一の単語列を意図して発話者によって発話された第１発話を、マイクを介して受信し、
前記第１発話はＮ個（Ｎは２以上の自然数）の音素から構成され、
前記第１発話に対して推定される単語列の信頼度Ｘ１を算出し、

ｔは、前記第１発話を構成するフレームを指定する番号を示し、
Ｔは、前記第１発話を構成するフレームの総数を示し、
Ｐ_Ａ１（ｏ_ｔ，ｓ_ｔ｜ｓ_ｔ−１）は、前記第１発話の１番フレームからｔ−１番フレームまでの状態ｓ_ｔ−１に対応する音素列の次に、ｔ番フレームで任意の音素が出現し、状態ｓ_ｔに対応する音素列に遷移する確率を示し、
ｏ_ｔは前記第１発話から得られ、前記任意の音素を推定するための物理量を示し、
前記任意の音素は全種類の音素を示し、
Ｐ_Ｌ１（ｓ_ｔ，ｓ_ｔ−１）は、前記第１発話において前記状態ｓ_ｔ−１に対応する単語列の次に、ｔ番フレームで任意の単語が出現し、前記状態ｓ_ｔに対応する単語列に遷移する確率を示し、
前記信頼度Ｘ１が閾値以上であるか判定し、
前記信頼度Ｘ１が前記閾値未満である場合は、前記発話者へ前記一の単語列を再度発話するように促す音声をスピーカを通して出力させ、
前記一の単語列を意図して前記発話者によって再度発話された第２発話を、前記マイクを介して受信し、
前記第２発話の信頼度Ｘ１が前記閾値未満である場合は、前記第１発話と前記第２発話から推定される全ての単語列に対して合成信頼度Ｘを算出し、

ｔは、前記第１発話および前記第２発話を構成するフレームを指定する番号を示し、
Ｔは、前記第１発話および前記第２発話を構成するフレームの総数を示し、
Ｐ_Ａ１（ｏ_ｔ，ｓ_ｔ｜ｓ_ｔ−１）は、前記第１発話の１番フレームからｔ−１番フレームまでの状態ｓ_ｔ−１に対応する音素列の次に、ｔ番フレームで任意の音素が出現し、状態ｓ_ｔに対応する音素列に遷移する確率を示し、
ｏ_ｔは、前記第１発話から得られ、前記任意の音素を推定するための物理量を示し、
前記任意の音素は、全種類の音素を示し、
Ｐ_Ａ２（ｑ_ｔ，ｓ_ｔ｜ｓ_ｔ−１）は、前記第２発話の１番フレームからｔ−１番フレームまでの状態ｓ_ｔ−１に対応する音素列の次に、ｔ番フレームで任意の音素が出現し、状態ｓ_ｔに対応する音素列に遷移する確率を示し、
ｑ_ｔは前記第２発話から得られ、前記任意の音素を推定するための物理量を示し、
Ｐ_Ｌ（ｓ_ｔ，ｓ_ｔ−１）は、前記第１発話において前記状態ｓ_ｔ−１に対応する単語列の次に、ｔ番フレームで任意の単語が出現し、前記状態ｓ_ｔに対応する単語列に遷移する確率を示し、
前記合成信頼度Ｘのうち最大値を与える前記状態ｓ_ｔに対応する単語列を、前記一の単語列として認識する。 A voice recognition method according to another aspect of the present disclosure includes:
A first utterance uttered by a speaker with the intention of a single word string is received via a microphone;
The first utterance is composed of N phonemes (N is a natural number of 2 or more),
Calculating the reliability X1 of the word string estimated for the first utterance;

t indicates a number for specifying a frame constituting the first utterance;
T indicates the total number of frames constituting the first utterance,
P _A1 (o _t , s _t | s _t-1 ) is the t-th frame after the phoneme string corresponding to the state s _t-1 from the first frame to the t-1 frame of the first utterance. any phoneme appeared, indicates the probability of transition to the phoneme string corresponding to the state s _t,
o _t is obtained from the first utterance and indicates a physical quantity for estimating the arbitrary phoneme;
The arbitrary phonemes represent all types of phonemes,
_{_{_{P L1 (s t, s t}}} -1) is the next word string corresponding to the state _{s t-1} in the first utterance, any word is found at t-th frame, corresponding to the state _{s t} The probability of transition to a word string
Determining whether the reliability X1 is greater than or equal to a threshold;
If the reliability X1 is less than the threshold, a voice prompting the speaker to speak the one word string again is output through a speaker;
Receiving a second utterance re-uttered by the speaker with the intention of the one word string, via the microphone;
When the reliability X1 of the second utterance is less than the threshold value, the composite reliability X is calculated for all word strings estimated from the first utterance and the second utterance,

t indicates a number that designates a frame constituting the first utterance and the second utterance;
T represents the total number of frames constituting the first utterance and the second utterance,
P _A1 (o _t , s _t | s _t-1 ) is the t-th frame after the phoneme string corresponding to the state s _t-1 from the first frame to the t-1 frame of the first utterance. any phoneme appeared, indicates the probability of transition to the phoneme string corresponding to the state s _t,
o _t is a physical quantity obtained from the first utterance and used to estimate the arbitrary phoneme;
The arbitrary phonemes represent all types of phonemes,
P _A2 (q _t , s _t | s _t−1 ) is the t th frame after the phoneme string corresponding to the state s _t−1 from the first frame to the t−1 frame of the second utterance. any phoneme appeared, indicates the probability of transition to the phoneme string corresponding to the state s _t,
q _t is obtained from the second utterance and indicates a physical quantity for estimating the arbitrary phoneme;
_{_{_{P L (s t, s t}}} -1) is the next word string corresponding to the state _{s t-1} in the first utterance, any word is found at t-th frame, corresponding to the state _{s t} The probability of transition to a word string
A word string corresponding to the state s _t giving the maximum value of the combined confidence X, recognized as the one word sequence.

本構成によれば、一の単語列を意図する第１発話がＴ個のフレームに分けられ、ｔ−１番フレームまでの状態ｓ_ｔ−１からｔ番フレームまでの状態ｓ_ｔに遷移するときの、音素列の確率Ｐ_Ａ１（ｏ_ｔ，ｓ_ｔ｜ｓ_ｔ−１）と、単語列の確率Ｐ_Ｌ１（ｓ_ｔ，ｓ_ｔ−１）との積を最大化する単語列が一の単語列として認識される。 According to this configuration, the first utterance intended for one word sequence is divided into the T frame, when a transition to a state s _t from state s _t-1 to t-1 th frame to the t-th frame A word sequence that maximizes the product of the probability P _A1 (o _t , s _t | s _t-1 ) of the phoneme sequence and the probability P _L1 (s _t , s _t-1 ) of the word sequence Recognized as a column.

そして、第１発話の単語列の信頼度Ｘ１が閾値より低く、第１発話の単語列の信頼性が低い場合であっても、第１発話の単語列は破棄されない。そして、聞き返しによって得られた一の単語列を意図する第２発話の単語列の信頼度Ｘ１が閾値より低く、第２発話の単語列の信頼性も低い場合、状態ｓ_ｔにおける第１発話の音素列の確率Ｐ_Ａ１（ｏ_ｔ，ｓ_ｔ｜ｓ_ｔ−１）及び第２発話の音素列の確率Ｐ_Ａ２（ｑ_ｔ，ｓ_ｔ｜ｓ_ｔ−１）の加算値と、状態ｓ_ｔにおける単語列の確率Ｐ_Ｌ（ｓ_ｔ，ｓ_ｔ−１）との積が合成信頼度Ｘとして算出され、合成信頼度Ｘを最大化する単語列が一の単語として認識される。 And even if the reliability X1 of the word sequence of the first utterance is lower than the threshold value and the reliability of the word sequence of the first utterance is low, the word sequence of the first utterance is not discarded. Then, lower than the reliability X1 is the threshold of the word sequence of the second speech intended for one word string obtained by reflective listening, if the reliability of the word sequence of the second utterance also low, the first utterance in the state s _t The sum of the phoneme sequence probability P _A1 (o _t , s _t | s _t-1 ) and the phoneme sequence probability P _A2 (q _t , s _t | s _t-1 ) of the second utterance, and the state s _t The product of the word string probabilities P _L (s _t , s _t−1 ) is calculated as the combined reliability X, and the word string that maximizes the combined reliability X is recognized as one word.

このように、本構成は、第１発話に対して信頼性の低い認識結果が得られたとしても、その認識結果を破棄せず、その認識結果を第２発話に対して信頼性の低い認識結果が得られた場合に利用する。そのため、聞き返しによって、信頼性の高い認識結果が得られなかったとしても、両認識結果を合成することで一の単語列が認識されているので、一の単語列の認識精度を高めることができる。 Thus, even if a recognition result with low reliability is obtained for the first utterance, this configuration does not discard the recognition result and recognizes the recognition result with low reliability for the second utterance. Use when results are obtained. For this reason, even if a highly reliable recognition result cannot be obtained by listening back, since one word string is recognized by combining both recognition results, the recognition accuracy of one word string can be improved. .

更に、本構成では、音素列の確率のみならず単語列の確率も考慮されているので、言語的に不自然な認識結果が得られることを防止できる。 Furthermore, in this configuration, not only the probability of the phoneme string but also the probability of the word string is taken into consideration, so that a linguistically unnatural recognition result can be prevented.

本開示の更に別の一態様に係る音声認識方法は、
一の単語列を意図して発話者によって発話された第１発話を、マイクを介して受信し、
前記第１発話はＮ個（Ｎは２以上の自然数）の音素から構成され、
前記第１発話に対して推定される全ての単語列の信頼度Ｘ１を算出し、

ｔ１は、前記第１発話を構成するフレームを指定する番号を示し、
Ｔ１は、前記第１発話を構成するフレームの総数を示し、
Ｐ_Ａ１（ｏ_ｔ１，ｓ_ｔ１｜ｓ_ｔ１−１）は、前記第１発話の１番フレームからｔ１−１番フレームまでの状態ｓ_ｔ１−１に対応する音素列の次に、ｔ１番フレームで任意の音素が出現し、状態ｓ_ｔ１に対応する音素列に遷移する確率を示し、
ｏ_ｔ１は前記第１発話から得られ、前記任意の音素を推定するための物理量を示し、
前記任意の音素は全種類の音素を示し、Ｐ_Ｌ１（ｓ_ｔ１，ｓ_ｔ１−１）は、前記第１発話において前記状態ｓ_ｔ１−１に対応する単語列の次に、ｔ１番フレームで任意の単語が出現し、前記状態ｓ_ｔ１に対応する単語列に遷移する確率を示し、
前記信頼度Ｘ１の最大値ＭａｘＸ１が閾値以上であるか判定し、
前記最大値ＭａｘＸ１が前記閾値未満である場合は、
前記信頼度Ｘ１の上位Ｍ個（Ｍは２以上の自然数）を与える前記第１発話に対して推定される第１単語列を抽出し、
前記発話者へ前記一の単語列を再度発話するように促す音声をスピーカを通して出力させ、
前記一の単語列を意図して前記発話者によって再度発話された第２発話を、マイクを介して受信し、
前記第２発話に対して推定される全ての単語列の信頼度Ｘ２を算出し、

ｔ２は、前記第２発話を構成するフレームを指定する番号を示し、
Ｔ２は、前記第２発話を構成するフレームの総数を示し、
Ｐ_Ａ２（ｏ_ｔ２，ｓ_ｔ２｜ｓ_ｔ２−１）は、前記第２発話の１番フレームからｔ２−１番フレームまでの状態ｓ_ｔ２−１に対応する音素列の次に、ｔ２番フレームで任意の音素が出現し、状態ｓ_ｔ２に対応する音素列に遷移する確率を示し、
ｏ_ｔ２は前記第２発話から得られ、前記任意の音素を推定するための物理量を示し、
Ｐ_Ｌ２（ｓ_ｔ２，ｓ_ｔ２−１）は、前記第２発話において前記状態ｓ_ｔ２−１に対応する単語列の次に、ｔ２番フレームで任意の単語が出現し、前記状態ｓ_ｔ２に対応する単語列に遷移する確率を示し、
前記信頼度Ｘ２の最大値ＭａｘＸ２が閾値以上であるか判定し、
前記最大値ＭａｘＸ２が前記閾値未満である場合は、前記信頼度Ｘ２の前記上位Ｍ個を与える前記第２発話に対して推定される第２単語列を抽出し、
前記第１単語列と前記第２単語列とに共通する単語列がある場合は、前記共通する単語列を前記一の単語列として認識する。 A speech recognition method according to still another aspect of the present disclosure is provided.
A first utterance uttered by a speaker with the intention of a single word string is received via a microphone;
The first utterance is composed of N phonemes (N is a natural number of 2 or more),
Calculating reliability X1 of all word strings estimated for the first utterance;

t1 indicates a number for specifying a frame constituting the first utterance;
T1 indicates the total number of frames constituting the first utterance,
P _A1 (o _t1 , s _t1 | s _t1-1 ) is the t1 frame after the phoneme sequence corresponding to the state s _t1-1 from the first frame to the t1-1 frame of the first utterance. Indicates the probability that an arbitrary phoneme appears and transitions to a phoneme string corresponding to the state s _t1 ,
o _t1 is obtained from the first utterance and indicates a physical quantity for estimating the arbitrary phoneme;
The arbitrary phonemes indicate all types of phonemes, and P _L1 (s _t1 , s _t1-1 ) is arbitrary in the frame t1 next to the word string corresponding to the state s _t1-1 in the first utterance. And the probability of transition to a word string corresponding to the state s _t1 ,
Determining whether the maximum value MaxX1 of the reliability X1 is equal to or greater than a threshold;
When the maximum value MaxX1 is less than the threshold value,
Extracting a first word string estimated for the first utterance giving the top M pieces of reliability X1 (M is a natural number of 2 or more);
Outputting a voice prompting the speaker to speak the one word string again through a speaker;
Receiving a second utterance re-spoken by the speaker with the intention of the one word string, via a microphone;
Calculating reliability X2 of all word strings estimated for the second utterance;

t2 indicates a number that designates a frame constituting the second utterance;
T2 indicates the total number of frames constituting the second utterance,
_{_{_{_{P A2 (o t2, s t2}}}} | s t2-1) is the next phoneme string corresponding to the state _{s t2-1} to t2-1 numbered frame from 1 numbered frame of the second speech, at t2 numbered frame Indicates the probability that an arbitrary phoneme appears and transitions to a phoneme string corresponding to the state s _t2 ,
o _t2 is obtained from the second utterance and indicates a physical quantity for estimating the arbitrary phoneme;
P _L2 (s _t2 , s _t2-1 ) corresponds to the state s _t2 when an arbitrary word appears in the t2 frame after the word string corresponding to the state s _t2-1 in the second utterance. The probability of transition to a word string
It is determined whether the maximum value MaxX2 of the reliability X2 is greater than or equal to a threshold value,
If the maximum value MaxX2 is less than the threshold, extract a second word string estimated for the second utterance that gives the top M pieces of the reliability X2,
When there is a word string common to the first word string and the second word string, the common word string is recognized as the one word string.

本構成によれば、一の単語列を意図する第１発話がＴ個のフレームに分けられ、ｔ−１番フレームまでの状態ｓ_ｔ−１からｔ番フレームまでの状態ｓ_ｔに遷移するときの、音素列の確率Ｐ_Ａ１（ｏ_ｔ，ｓ_ｔ｜ｓ_ｔ−１）と、単語列の確率Ｐ_Ｌ１（ｓ_ｔ，ｓ_ｔ−１）との積が信頼度Ｘ１として算出される。 According to this configuration, the first utterance intended for one word sequence is divided into the T frame, when a transition to a state s _t from state s _t-1 to t-1 th frame to the t-th frame Of the phoneme string P _A1 (o _t , s _t | s _t-1 ) and the word string probability P _L1 (s _t , s _t-1 ) are calculated as the reliability X1.

そして、信頼度Ｘ１の最大値ＭａｘＸ１が閾値より低く、第１発話から認識された単語列の信頼性が低い場合、上位Ｍ個の信頼度Ｘ１を持つ第１単語列が抽出され、聞き返しにより第２発話が得られる。 When the maximum value MaxX1 of the reliability X1 is lower than the threshold value and the reliability of the word string recognized from the first utterance is low, the first word string having the top M reliability X1 is extracted, Two utterances are obtained.

そして、第２発話の単語列の信頼度Ｘ２の最大値ＭａｘＸ２が閾値より低く、第２発話の単語列の信頼性も低い場合、上位Ｍ個の信頼度Ｘ２を持つ第２単語列が抽出され、第１単語列と第２単語列とにおいて共通する単語列がある場合は、共通する単語列が一の単語列として認識される。 If the maximum value MaxX2 of the reliability X2 of the word sequence of the second utterance is lower than the threshold value and the reliability of the word sequence of the second utterance is also low, the second word sequence having the top M reliability X2 is extracted. When there is a common word string in the first word string and the second word string, the common word string is recognized as one word string.

このように、本構成は、第１発話に対して信頼性の低い認識結果が得られたとしても、その認識結果を破棄せず、その認識結果を第２発話に対して信頼性の低い認識結果が得られた場合に利用する。そのため、聞き返しによって、信頼性の高い認識結果が得られなかったとしても、第１発話と第２発話との両方で認識された単語列が一の単語列として認識されているので、一の単語列の認識精度を高めることができる。 Thus, even if a recognition result with low reliability is obtained for the first utterance, this configuration does not discard the recognition result and recognizes the recognition result with low reliability for the second utterance. Use when results are obtained. For this reason, even if a highly reliable recognition result is not obtained by listening back, the word sequence recognized in both the first utterance and the second utterance is recognized as one word sequence, so one word Column recognition accuracy can be increased.

上記の音声認識方法はロボットに適用されてもよい。 The above speech recognition method may be applied to a robot.

また、本開示は、以上のような特徴的な処理を実行する音声認識方法として実現することができるだけでなく、音声認識方法に含まれる特徴的なステップを実行するための処理部を備える音声認識装置などとして実現することもできる。また、このような音声認識方法に含まれる特徴的な各ステップをコンピュータに実行させるコンピュータプログラムとして実現することもできる。そして、そのようなコンピュータプログラムを、ＣＤ−ＲＯＭ等のコンピュータ読取可能な非一時的な記録媒体あるいはインターネット等の通信ネットワークを介して流通させることができるのは、言うまでもない。 In addition, the present disclosure can be realized not only as a speech recognition method that performs the characteristic processing as described above, but also includes a speech recognition unit that includes a processing unit for executing the characteristic steps included in the speech recognition method. It can also be realized as a device. Moreover, it can also be realized as a computer program that causes a computer to execute the characteristic steps included in such a speech recognition method. Needless to say, such a computer program can be distributed via a computer-readable non-transitory recording medium such as a CD-ROM or a communication network such as the Internet.

以下、図面を参照しながら、本開示の実施の形態について説明する。なお、以下で説明する実施の形態は、いずれも本開示の一具体例を示すものである。以下の実施の形態で示される数値、形状、構成要素、ステップ、及びステップの順序などは、一例であり、本開示を限定する主旨ではない。また、以下の実施の形態における構成要素のうち、最上位概念を示す独立請求項に記載されていない構成要素については、任意の構成要素として説明される。また、全ての実施の形態において、各々の内容を組み合わせることもできる。 Hereinafter, embodiments of the present disclosure will be described with reference to the drawings. Note that each of the embodiments described below shows a specific example of the present disclosure. Numerical values, shapes, components, steps, order of steps, and the like shown in the following embodiments are merely examples, and are not intended to limit the present disclosure. In addition, among the constituent elements in the following embodiments, constituent elements that are not described in the independent claims indicating the highest concept are described as optional constituent elements. In all the embodiments, the contents can be combined.

（実施の形態１）
図１は、実施の形態１における音声対話システムの全体構成の一例を示す図である。図１に示す音声対話システムは、音声認識装置１００、マイク４００、スピーカ４１０、サービスアプリサーバ４２０、及び制御機器４３０を備える。 (Embodiment 1)
FIG. 1 is a diagram illustrating an example of the overall configuration of the voice interaction system according to the first embodiment. The voice interaction system shown in FIG. 1 includes a voice recognition device 100, a microphone 400, a speaker 410, a service application server 420, and a control device 430.

音声認識装置は１００、プロセッサとしてのＣＰＵ（中央演算処理装置）２０、及びメモリ３０を備える。ＣＰＵ２０は、音声認識部２００、単語信頼度判定部２１０、意図解釈部２２０、行動選択部２３０、応答生成部２４０、音声合成部２５０、及び発話抽出部２６０を備える。メモリ３０は、単語辞書３０１及び認識結果記憶部３０２を備える。音声認識部２００は、音素推定部２０１、単語推定部２０２、及び音素出現確率判定部２０３を備える。 The speech recognition apparatus includes 100, a CPU (Central Processing Unit) 20 as a processor, and a memory 30. The CPU 20 includes a speech recognition unit 200, a word reliability determination unit 210, an intention interpretation unit 220, an action selection unit 230, a response generation unit 240, a speech synthesis unit 250, and an utterance extraction unit 260. The memory 30 includes a word dictionary 301 and a recognition result storage unit 302. The speech recognition unit 200 includes a phoneme estimation unit 201, a word estimation unit 202, and a phoneme appearance probability determination unit 203.

単語辞書３０１は、音声認識装置１００が認識可能な単語と音素列との組み合わせを記憶する。図７は、単語辞書のデータ構成の一例を示す図である。単語辞書には、「マンゴー」及び「レンガ」というような単語と、「ｍａｎｇｏ：」及び「ｒｅｎｇａ」というような各単語の音素列とが対応付けて記憶されている。 The word dictionary 301 stores combinations of words and phoneme strings that can be recognized by the speech recognition apparatus 100. FIG. 7 is a diagram illustrating an example of a data configuration of the word dictionary. In the word dictionary, words such as “mango” and “brick” and a phoneme string of each word such as “mango:” and “renga” are stored in association with each other.

図１に参照を戻す。音声認識装置１００としてコンピュータを機能させるプログラムは、音声認識装置１００を実装するロボット又は端末に組み込まれたメモリ３０に格納され、ＣＰＵ２０等のプロセッサによって実行される。また、音声認識装置１００を構成する全ての要素は、同一端末に実装されてもよいし、光ファイバ、無線又は公衆電話回線などの任意のネットワークを介して接続される別の端末又はサーバ上に個別に実装されてもよく、音声認識装置１００と別の端末又はサーバとが互いに通信することによって音声対話処理を実現してもよい。 Returning to FIG. A program that causes a computer to function as the speech recognition apparatus 100 is stored in a memory 30 incorporated in a robot or a terminal in which the speech recognition apparatus 100 is mounted, and is executed by a processor such as the CPU 20. All elements constituting the speech recognition apparatus 100 may be mounted on the same terminal, or on another terminal or server connected via an arbitrary network such as an optical fiber, a radio, or a public telephone line. It may be implemented individually, and the voice interaction processing may be realized by the voice recognition device 100 and another terminal or server communicating with each other.

マイク４００は、例えば、指向性マイクで構成され、音声認識装置１００が実装された端末又はロボットに組み込まれている。また、マイク４００は、例えばハンドマイク、ピンマイク、又は卓上マイクなど任意の収音デバイスで構成されてもよい。この場合、マイク４００は、有線又は無線を介して音声認識装置１００が実装された端末に接続される。また、マイク４００は、スマートフォン又はタブレット端末などの収音及び通信機能を持つデバイスに搭載されたマイクで構成されてもよい。 The microphone 400 is constituted by, for example, a directional microphone, and is incorporated in a terminal or a robot on which the speech recognition apparatus 100 is mounted. Moreover, the microphone 400 may be configured by an arbitrary sound collection device such as a hand microphone, a pin microphone, or a table microphone. In this case, the microphone 400 is connected to a terminal on which the speech recognition apparatus 100 is mounted via wired or wireless. Moreover, the microphone 400 may be comprised with the microphone mounted in the device with sound collection and communication functions, such as a smart phone or a tablet terminal.

スピーカ４１０は、音声認識装置１００が実装された端末又はロボットに組み込まれてもよいし、音声認識装置１００が実装された端末又はロボットと、有線又は無線を介して接続されてもよい。また、スピーカ４１０は、スマートフォン又はタブレット端末などの集音及び通信機能を持つデバイスに搭載されたスピーカで構成されてもよい。 The speaker 410 may be incorporated in a terminal or robot on which the voice recognition device 100 is mounted, or may be connected to a terminal or robot on which the voice recognition device 100 is mounted via a wired or wireless connection. The speaker 410 may be configured by a speaker mounted on a device having sound collection and communication functions such as a smartphone or a tablet terminal.

サービスアプリサーバ４２０は、お天気、読み聞かせ、ニュース、及びゲームなどの複数のサービスをネットワークを介してユーザに提供するクラウドサーバである。例えば、サービスアプリサーバ４２０は、音声認識装置１００による音声の認識結果を取得し、認識結果に応じて実行するサービスを決定する。サービスアプリサーバ４２０より提供されるサービスは、サービスアプリサーバ４２０における実行結果をネットワークを介して取得する機能を備えるプログラムによって実現されていてもよいし、サービスアプリサーバ４２０と、音声認識装置１００が実装されるロボット又は端末上のメモリに記憶されたプログラムとによって実現されてもよい。 The service application server 420 is a cloud server that provides users with a plurality of services such as weather, storytelling, news, and games via a network. For example, the service application server 420 acquires a speech recognition result by the speech recognition apparatus 100 and determines a service to be executed according to the recognition result. The service provided by the service application server 420 may be realized by a program having a function of acquiring an execution result in the service application server 420 via a network, or implemented by the service application server 420 and the speech recognition apparatus 100. It may be realized by a robot stored in the memory or a program stored in a memory on the terminal.

制御機器４３０は、有線又は無線によって音声認識装置１００と接続されたテレビ又は空調器等の機器で構成され、音声認識装置１００から音声の認識結果を受信して制御される機器である。 The control device 430 is configured by a device such as a television or an air conditioner connected to the speech recognition device 100 by wire or wireless, and is a device that is controlled by receiving a speech recognition result from the speech recognition device 100.

発話抽出部２６０は、マイク４００から出力された音声信号のうち発話中の音声信号を抽出して音素推定部２０１に出力する。ここで、発話抽出部２６０は、例えば所定音量以上の音声が一定期間以上継続した場合、発話が開始されたことを検出し、マイク４００から入力される音声信号の音素推定部２０１への出力を開始する。また、発話抽出部２６０は所定音量未満の音声が所定期間以上続いたことを検出した場合、音素推定部２０１への音声信号の出力を停止する。本実施の形態では、発話抽出部２６０は、一の単語を意図して発話者が発話した音声の音声信号を抽出するものとする。また、発話者は、言語獲得段階にある幼児とする。 The speech extraction unit 260 extracts a speech signal being spoken from the speech signal output from the microphone 400 and outputs the speech signal to the phoneme estimation unit 201. Here, the utterance extraction unit 260 detects that the utterance has been started, for example, when a sound of a predetermined volume or higher continues for a certain period of time, and outputs an audio signal input from the microphone 400 to the phoneme estimation unit 201. Start. In addition, when the speech extraction unit 260 detects that the voice having a volume lower than the predetermined volume has continued for a predetermined period or longer, the speech extraction unit 260 stops outputting the voice signal to the phoneme estimation unit 201. In the present embodiment, it is assumed that the utterance extraction unit 260 extracts a voice signal of a voice uttered by a speaker with the intention of one word. The speaker is an infant in the language acquisition stage.

音素推定部２０１は、発話抽出部２６０により入力された音声信号を、所定の時間単位で構成される複数の音声区間に区切り、各音素区間において全種類の音素のそれぞれの出現確率を算出する。音素とは、言語において、音声の最小単位のことを指し、例えば、「ａ」及び「ｉ」などの記号で表される。全種類の音素とは、発話に用いられる全ての音素を指す。この全種類の音素は、音響モデルによってモデル化されている。音響モデルとしては、例えば、ＨＭＭ（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ：隠れマルコフモデル）が挙げられる。 The phoneme estimation unit 201 divides the speech signal input by the utterance extraction unit 260 into a plurality of speech sections configured in a predetermined time unit, and calculates the appearance probabilities of all types of phonemes in each phoneme section. A phoneme refers to a minimum unit of speech in a language, and is represented by symbols such as “a” and “i”, for example. All types of phonemes refer to all phonemes used for speech. All types of phonemes are modeled by acoustic models. As an acoustic model, HMM (Hidden Markov Model: Hidden Markov Model) is mentioned, for example.

音素の種類数は言語によっても異なるが、日本語であれば例えば４０程度である。ここで、音素推定部２０１は、ＨＭＭを用いて、連続する共通の音素区間を１つの音素区間として纏めることで、音素列を推定してもよい。そして、音素推定部２０１は、全音素区間における出現確率の積を最大とする音素の組み合わせを、発話者が発話した音素列として推定する。 The number of phoneme types varies depending on the language, but is about 40 for Japanese. Here, the phoneme estimation unit 201 may estimate a phoneme string by collecting continuous common phoneme segments as one phoneme segment using an HMM. Then, the phoneme estimation unit 201 estimates a phoneme combination that maximizes the product of appearance probabilities in all phoneme sections as a phoneme string uttered by the speaker.

単語推定部２０２は、音素推定部２０１により推定された音素列に対して最もマッチする単語を、単語辞書３０１から抽出し、抽出した単語を発話者が発話した単語として推定する。 The word estimation unit 202 extracts from the word dictionary 301 the word that most closely matches the phoneme string estimated by the phoneme estimation unit 201, and estimates the extracted word as a word spoken by the speaker.

図２は、二音素からなる発話において、音素毎に算出された出現確率の一例を示す図である。図３は、図２において第一音素目の音素と第二音素目の音素との組み合わせに対する出現確率の積を纏めた図である。 FIG. 2 is a diagram illustrating an example of the appearance probability calculated for each phoneme in an utterance composed of two phonemes. FIG. 3 is a diagram summarizing the products of appearance probabilities for combinations of phonemes of the first phoneme and phonemes of the second phoneme in FIG. 2.

例えば、二音素からなる単語が発話され、図２に示される音素の出現確率が得られたとする。図２では、一音素目に対して音素「ａ」及び「ｕ」の出現確率がそれぞれ「０．４」及び「０．５」と算出され、二音素目に対して音素「ｉ」及び「ｅ」の出現確率がそれぞれ「０．３」及び「０．６」と算出されている。 For example, it is assumed that a word composed of two phonemes is uttered and the appearance probability of phonemes shown in FIG. 2 is obtained. In FIG. 2, the appearance probabilities of phonemes “a” and “u” are calculated as “0.4” and “0.5” for the first phoneme, respectively, and phonemes “i” and “i” for the second phoneme. The appearance probabilities of “e” are calculated as “0.3” and “0.6”, respectively.

この場合、一音素目と二音素目との音素の組み合わせとして「ａｉ」、「ａｅ」、「ｕｉ」及び「ｕｅ」の４つが得られ、各組み合わせの出現確率の積は、「０．１２」、「０．２４」、「０．１５」、及び「０．３０」となる。 In this case, “ai”, “ae”, “ui”, and “ue” are obtained as phoneme combinations of the first phoneme and the second phoneme, and the product of the appearance probabilities of each combination is “0.12”. ”,“ 0.24 ”,“ 0.15 ”, and“ 0.30 ”.

したがって、一音素目と二音素目との音素の出現確率の積が最大となる組み合わせは、出現確率が「０．３０」である「ｕｅ」となる。この時、音素列「ｕｅ」で単語辞書３０１を検索し、音素列「ｕｅ」にマッチする単語を認識結果として出力する。この時の各音素の出現確率の積、つまり「ｕｅ」＝「０．３０」が認識された単語の信頼度となる。 Therefore, the combination having the maximum product appearance probability of the first phoneme and the second phoneme is “ue” whose appearance probability is “0.30”. At this time, the word dictionary 301 is searched for the phoneme string “ue”, and a word that matches the phoneme string “ue” is output as a recognition result. At this time, the product of the appearance probabilities of each phoneme, that is, “ue” = “0.30” is the reliability of the recognized word.

単語信頼度判定部２１０は、単語推定部２０２により認識された単語の信頼度（第１値、第２値の一例）と所定の閾値ＴＨ１（第１閾値の一例）とを比較して、単語の信頼度が閾値ＴＨ１未満である場合、認識結果記憶部３０２に、単語推定部２０２により認識された単語の音素列と、各音素の出現確率とを含む認識結果を第一発話に対する第一認識結果として記憶させる。この場合、単語信頼度判定部２１０は、発話者に一の単語を再度発話させるために、再発話を促す音声の生成指示を応答生成部２４０に出力する。 The word reliability determination unit 210 compares the word reliability recognized by the word estimation unit 202 (an example of the first value and the second value) with a predetermined threshold value TH1 (an example of the first threshold value) to determine the word When the reliability of the first utterance is less than the threshold TH1, a recognition result including the phoneme string of the word recognized by the word estimation unit 202 and the appearance probability of each phoneme is stored in the recognition result storage unit 302 as the first recognition for the first utterance. Remember as a result. In this case, the word reliability determination unit 210 outputs, to the response generation unit 240, a voice generation instruction that prompts re-speaking in order for the speaker to speak one word again.

単語信頼度判定部２１０は、再発話によって発話者に一の単語を意図する第二発話が行われ、単語推定部２０２により第二認識結果が得られた場合、第二認識結果の信頼度が閾値ＴＨ１未満であるか否かを判定する。 The word reliability determination unit 210 performs the second utterance intended for one word to the speaker by re-utterance, and when the word recognition unit 202 obtains the second recognition result, the reliability of the second recognition result is It is determined whether or not the threshold value is less than TH1.

音素出現確率判定部２０３は、単語信頼度判定部２１０により、第二認識結果の信頼度が閾値ＴＨ１未満と判定された場合、第一認識結果により認識された単語と、第二認識結果により認識された単語とのそれぞれにおいて、音素の出現確率が閾値ＴＨ２以上の音素を抽出する。 The phoneme appearance probability determination unit 203 recognizes the word recognized by the first recognition result and the second recognition result when the word reliability determination unit 210 determines that the reliability of the second recognition result is less than the threshold TH1. For each of the words that have been generated, a phoneme having a phoneme appearance probability greater than or equal to the threshold TH2 is extracted.

単語推定部２０２は、音素出現確率判定部２０３により抽出された音素列を含む単語を単語辞書３０１から抽出し、抽出結果に基づいて最終認識結果としての単語を決定する。 The word estimation unit 202 extracts a word including the phoneme string extracted by the phoneme appearance probability determination unit 203 from the word dictionary 301, and determines a word as a final recognition result based on the extraction result.

また、単語信頼度判定部２１０は、単語の信頼度が閾値ＴＨ１以上である場合、認識結果を意図解釈部２２０に出力する。 Moreover, the word reliability determination part 210 outputs a recognition result to the intention interpretation part 220, when the word reliability is more than threshold TH1.

意図解釈部２２０は、認識結果から応答の種別（例えば、相槌、又は質問回答など）と、行動の種別（しりとり、かくれんぼ、又はテレビ制御など）とを推定する。そして、意図解釈部２２０は、応答生成部２４０に応答の種別の推定結果を出力すると共に、行動選択部２３０に行動の種別の推定結果を出力する。 The intention interpretation unit 220 estimates the type of response (for example, a match or a question answer) and the type of action (such as shiritori, hide and seek, or television control) from the recognition result. Then, the intention interpretation unit 220 outputs the response type estimation result to the response generation unit 240 and also outputs the behavior type estimation result to the behavior selection unit 230.

行動選択部２３０は、意図解釈部２２０の推定結果から、実行するサービス又は制御対象となる制御機器４３０を判断する。そして、行動選択部２３０は、サービスを実行すると判断した場合、実行するサービスの提供依頼をサービスアプリサーバ４２０に送信する。また、行動選択部２３０は、制御機器４３０を制御すると判断した場合、制御対象となる制御機器４３０に制御指示を出力する。 The action selection unit 230 determines the service to be executed or the control device 430 to be controlled from the estimation result of the intention interpretation unit 220. If the action selection unit 230 determines to execute the service, the action selection unit 230 transmits a request to provide the service to be executed to the service application server 420. In addition, when the action selection unit 230 determines to control the control device 430, the behavior selection unit 230 outputs a control instruction to the control device 430 to be controlled.

応答生成部２４０は、意図解釈部２２０から応答の種別の推定結果を取得した場合、推定結果に対応する応答文を生成する。一方、応答生成部２４０は、単語信頼度判定部２１０から再発話を促す音声の生成指示を取得した場合、発話者に一の単語を聞き返す応答文を生成する。 When the response generation unit 240 acquires an estimation result of the response type from the intention interpretation unit 220, the response generation unit 240 generates a response sentence corresponding to the estimation result. On the other hand, when the response generation unit 240 obtains a voice generation instruction for prompting re-speech from the word reliability determination unit 210, the response generation unit 240 generates a response sentence to hear back one word from the speaker.

音声合成部２５０は、応答生成部２４０により生成された応答文を音声信号に変換し、スピーカ４１０に出力する。スピーカ４１０は、音声合成部２５０から出力された音声信号を音声に変換し、外部に出力する。 The voice synthesis unit 250 converts the response sentence generated by the response generation unit 240 into a voice signal and outputs the voice signal to the speaker 410. The speaker 410 converts the voice signal output from the voice synthesizer 250 into voice and outputs the voice to the outside.

図４は、実施の形態１における認識処理の一例を示すフローチャートである。まず、発話抽出部２６０は、マイク４００における音声入力の有無を判断する（ステップＳ１００）。音声入力が無いと判断された場合（ステップＳ１００でＮＯ）、音声入力が有りになるまでステップＳ１００の処理は繰り返される。 FIG. 4 is a flowchart illustrating an example of recognition processing in the first embodiment. First, the utterance extraction unit 260 determines whether or not there is a voice input in the microphone 400 (step S100). When it is determined that there is no voice input (NO in step S100), the process of step S100 is repeated until there is a voice input.

一方、音声入力が有りと判断された場合（ステップＳ１００でＹＥＳ）、発話抽出部２６０は、マイク４００から出力される音声信号から発話中の音声信号を抽出する（ステップＳ１０１）。 On the other hand, when it is determined that there is a voice input (YES in step S100), the utterance extraction unit 260 extracts a voice signal being uttered from the voice signal output from the microphone 400 (step S101).

次に、音声認識部２００は、音声認識処理を実施する（ステップＳ１０２）。具体的には、音素推定部２０１は、発話抽出部２６０により抽出された音声信号を複数の音声区間に区切り、各音声区間の音声信号の特徴量を生成し、生成した特徴量を音響モデルと照合することで、各音声区間の音素を推定する。この時、音素推定部２０１は音声区間ごとに、音素の出現確率を計算し、ＨＭＭを用いることで連続する同一音素の音声区間を一つに纏める。例えば、発話音声を構成する第一音素、第二音素、及び第三音素で構成されているとすると、音素推定部２０１は、第一音素、第二音素、及び第三音素のそれぞれに対して全種類の音素の出現確率を算出する。 Next, the speech recognition unit 200 performs speech recognition processing (step S102). Specifically, the phoneme estimation unit 201 divides the speech signal extracted by the utterance extraction unit 260 into a plurality of speech sections, generates a feature amount of the speech signal in each speech section, and uses the generated feature amount as an acoustic model. By collating, the phonemes of each speech segment are estimated. At this time, the phoneme estimation unit 201 calculates the appearance probability of phonemes for each speech section, and combines consecutive speech sections of the same phoneme by using the HMM. For example, assuming that the first phoneme, the second phoneme, and the third phoneme that make up the uttered speech are configured, the phoneme estimation unit 201 performs each of the first phoneme, the second phoneme, and the third phoneme. Appearance probability of all types of phonemes is calculated.

例えば、第一音素は、音素「ａ」の確率が「０．４」、音素「ｉ」の確率が「０．１」、音素「ｕ」の確率が「０．２」というように全種類の音素のそれぞれについて、第一音素の出現確率が計算される。第二音素及び第三音素についても、第一音素と同様にして、全種類の音素のそれぞれの出現確率が計算される。 For example, the first phoneme has all types such that the probability of the phoneme “a” is “0.4”, the probability of the phoneme “i” is “0.1”, and the probability of the phoneme “u” is “0.2”. For each of the phonemes, the appearance probability of the first phoneme is calculated. For the second phoneme and the third phoneme, the appearance probabilities of all types of phonemes are calculated in the same manner as the first phoneme.

そして、音素推定部２０１は、第一音素の出現確率、第二音素の出現確率、及び第三音素の出現確率の積を最大化する３つの音素の組み合わせを発話音声の音素列として推定する。 Then, the phoneme estimation unit 201 estimates a combination of three phonemes that maximizes the product of the first phoneme appearance probability, the second phoneme appearance probability, and the third phoneme appearance probability as a phoneme sequence of the utterance speech.

次に、単語推定部２０２は、メモリ３０に格納されている単語辞書３０１を参照し、音素推定部２０１により推定された音素列とマッチする単語を選択する。単語辞書３０１にマッチする単語がない場合、単語推定部２０２は、各音素の出現確率の積が次に大きい単語の音素列を音素推定部２０１に推定させる。そして、単語推定部２０２は、推定された音素列にマッチする単語を単語辞書３０１から検索する。このようにして、単語辞書３０１にマッチする単語が得られると、単語推定部２０２は、マッチした単語の音素列の出現確率の積をその単語の信頼度して採用すると共に、マッチした単語の音素列と、その音素列を構成する各音素の出現確率とを認識結果として、単語信頼度判定部２１０に出力する。 Next, the word estimation unit 202 refers to the word dictionary 301 stored in the memory 30 and selects a word that matches the phoneme string estimated by the phoneme estimation unit 201. If there is no matching word in the word dictionary 301, the word estimation unit 202 causes the phoneme estimation unit 201 to estimate a phoneme string of a word having the next highest product of the appearance probabilities of each phoneme. Then, the word estimation unit 202 searches the word dictionary 301 for a word that matches the estimated phoneme string. When a word that matches the word dictionary 301 is obtained in this way, the word estimation unit 202 adopts the product of the appearance probabilities of the phoneme string of the matched word as the reliability of the word, The phoneme string and the appearance probability of each phoneme constituting the phoneme string are output to the word reliability determination unit 210 as a recognition result.

次に、単語信頼度判定部２１０は、認識された単語の信頼度が閾値ＴＨ１以上であるか否かを判断する（ステップＳ１０３）。単語の信頼度が閾値ＴＨ１以上であった場合（ステップＳ１０３でＹＥＳ）、単語信頼度判定部２１０は、認識結果記憶部３０２に第一認識結果が記憶されているか否かを判断する（ステップＳ１０４）。ここで、第一認識結果とは、ステップＳ１０１で得られた音声以前に発話された音声の認識結果であって、認識結果記憶部３０２に記憶されている認識結果のことを指す。 Next, the word reliability determination unit 210 determines whether or not the reliability of the recognized word is greater than or equal to the threshold value TH1 (step S103). When the word reliability is equal to or higher than the threshold TH1 (YES in step S103), the word reliability determination unit 210 determines whether or not the first recognition result is stored in the recognition result storage unit 302 (step S104). ). Here, the first recognition result is a recognition result of speech uttered before the speech obtained in step S101 and indicates a recognition result stored in the recognition result storage unit 302.

すなわち、前回の発話によって認識された単語の信頼度が閾値ＴＨ１未満であり、その発話の認識結果が認識結果記憶部３０２に記憶されている場合に、その認識結果が第一認識結果となる。 That is, when the reliability of the word recognized by the previous utterance is less than the threshold TH1, and the recognition result of the utterance is stored in the recognition result storage unit 302, the recognition result becomes the first recognition result.

第一認識結果が記憶されていた場合（ステップＳ１０４でＹＥＳ）、単語信頼度判定部２１０は、認識結果記憶部３０２に記憶されている第一認識結果を消去し（ステップＳ１０５）、認識結果を意図解釈部２２０に出力する。次に、意図解釈部２２０は、認識結果に基づいて意図理解処理を実施する（ステップＳ１０６）。 When the first recognition result is stored (YES in step S104), the word reliability determination unit 210 deletes the first recognition result stored in the recognition result storage unit 302 (step S105), and the recognition result is displayed. The result is output to the intention interpretation unit 220. Next, the intention interpretation unit 220 performs intention understanding processing based on the recognition result (step S106).

一方、認識結果記憶部３０２に第一認識結果が記憶されていなかった場合（ステップＳ１０４でＮＯ）、処理はステップＳ１０６へ遷移する。ステップＳ１０６では、意図解釈部２２０は、認識結果から、応答の種別と行動の種別とを推定する。ステップＳ１０７では、応答生成部２４０は、推定された応答の種別に対応する応答文を生成する。また、ステップＳ１０７では、行動選択部２３０は、推定された行動の種別にしたがって実行するサービス又は制御対象となる制御機器４３０を決定し、サービスを決定した場合はサービスの提供依頼をサービスアプリサーバ４２０に送信し、制御機器４３０を決定した場合は制御対象となる制御機器４３０に制御指示を出力する。 On the other hand, when the first recognition result is not stored in the recognition result storage unit 302 (NO in step S104), the process proceeds to step S106. In step S106, the intention interpretation unit 220 estimates a response type and an action type from the recognition result. In step S107, the response generation unit 240 generates a response sentence corresponding to the estimated response type. In step S107, the action selection unit 230 determines the service to be executed or the control device 430 to be controlled according to the estimated action type, and if the service is determined, the service application server 420 sends a service provision request. When the control device 430 is determined, a control instruction is output to the control device 430 to be controlled.

一方、認識された単語の信頼度が閾値ＴＨ１未満である場合（ステップＳ１０３でＮＯ）、単語信頼度判定部２１０は、認識結果記憶部３０２を参照し、第一認識結果が記憶されているか否かを判断する（ステップＳ１１０）。第一認識結果が記憶されていない場合（ステップＳ１１０でＮＯ）、単語信頼度判定部２１０は、単語推定部２０２により推定された単語の音素列と、各音素の出現確率とを、第一発話の認識結果（第一認識結果）として認識結果記憶部３０２に記憶させ（ステップＳ１０９）、再発話を促す音声の生成指示を応答生成部２４０に出力する。 On the other hand, when the reliability of the recognized word is less than the threshold TH1 (NO in step S103), the word reliability determination unit 210 refers to the recognition result storage unit 302 to determine whether the first recognition result is stored. Is determined (step S110). When the first recognition result is not stored (NO in step S110), the word reliability determination unit 210 uses the first utterance as the phoneme string of the word estimated by the word estimation unit 202 and the appearance probability of each phoneme. The recognition result (first recognition result) is stored in the recognition result storage unit 302 (step S109), and a voice generation instruction for prompting re-speech is output to the response generation unit 240.

次に、応答生成部２４０は、「もう１回ゆっくり言って？」というような聞き返し応答文を生成し、生成した応答文の音声信号を音声認識部２００に生成させ、生成させた音声信号の音声をスピーカ４１０から出力させる（ステップＳ１０８）。ステップＳ１０８により聞き返し応答文の音声が出力されると、音声認識装置１００は、発話者による一の単語を意図する再発話の待機状態になり、処理はＳ１００に戻る。 Next, the response generation unit 240 generates a response sentence such as “Please say it again slowly?”, Causes the voice recognition unit 200 to generate a voice signal of the generated response sentence, and generates a response of the generated voice signal. Audio is output from the speaker 410 (step S108). When the voice of the reply response sentence is output in step S108, the speech recognition apparatus 100 enters a standby state for a re-utterance intended for one word by the speaker, and the process returns to S100.

この聞き返しにより、発話者により第二発話が行われ、ステップＳ１００〜ステップＳ１０２の処理により、第一発話と同様、第二発話に対する第二認識結果が得られる。そして、第二認識結果の信頼度が閾値ＴＨ１未満であれば、ステップＳ１０３でＮＯと判定され、処理がＳ１１０に進む。 As a result, the second utterance is made by the speaker, and the second recognition result for the second utterance is obtained in the same manner as the first utterance by the processing in steps S100 to S102. And if the reliability of a 2nd recognition result is less than threshold value TH1, it will determine with NO by step S103, and a process will progress to S110.

一方、第二認識結果の信頼度が閾値ＴＨ１以上であれば（ステップＳ１０３でＹＥＳ）、第二認識結果が発話者が意図する一の単語として決定され、ステップＳ１０５〜ステップＳ１０７の処理が実行される。 On the other hand, if the reliability of the second recognition result is equal to or higher than the threshold value TH1 (YES in step S103), the second recognition result is determined as one word intended by the speaker, and the processing from step S105 to step S107 is executed. The

ステップＳ１１０にて、第一認識結果が認識結果記憶部３０２に記憶されていた場合（ステップＳ１１０でＹＥＳ）、音素出現確率判定部２０３は、認識結果記憶部３０２に記憶されている第一認識結果と、ステップＳ１０２により得られた発話者による再発話に対する第二認識結果とから所定の閾値ＴＨ２（第２閾値の一例）以上の音素をそれぞれ抽出する（ステップＳ１１１）。 When the first recognition result is stored in the recognition result storage unit 302 in step S110 (YES in step S110), the phoneme appearance probability determination unit 203 stores the first recognition result stored in the recognition result storage unit 302. Then, phonemes equal to or greater than a predetermined threshold TH2 (an example of the second threshold) are extracted from the second recognition result for the recurrent speech by the speaker obtained in step S102 (step S111).

次に、単語推定部２０２は、単語辞書３０１を参照し、第一認識結果の音素列において、閾値ＴＨ２以上の音素を含む単語を認識候補単語として抽出する（ステップＳ１１２）。次に、単語推定部２０２は、ステップＳ１１２で抽出した認識候補単語のリストから、第二認識結果の音素列において、閾値ＴＨ２以上の音素を含む単語で認識候補単語を絞り込む（ステップＳ１１３）。 Next, the word estimation unit 202 refers to the word dictionary 301 and extracts a word including a phoneme having a threshold value TH2 or more as a recognition candidate word in the phoneme string of the first recognition result (step S112). Next, the word estimation unit 202 narrows down the recognition candidate words from the list of recognition candidate words extracted in step S112 by words including phonemes having a threshold value TH2 or more in the phoneme string of the second recognition result (step S113).

図５は、実施の形態１における対話の一例を示す図である。図５において、ロボットは、音声認識装置１００が実装されたロボットを指し、ロボットの後に付された数字はロボットの発話順序を示す。また、幼児とは、ロボットと対話する幼児を指し、幼児の後に付された数字は発話順序を示す。 FIG. 5 is a diagram illustrating an example of the dialogue in the first embodiment. In FIG. 5, the robot refers to the robot on which the speech recognition apparatus 100 is mounted, and the numbers attached after the robot indicate the utterance order of the robot. The infant refers to an infant who interacts with the robot, and the numbers attached after the infant indicate the utterance order.

まず、ロボットは、幼児に対して「どんな果物が好き？」（ロボット１）と発話し、これに対して幼児は「リンゴ」（幼児１）と発話している。しかし、ここでは、「リンゴ」（幼児１）の発話に対して認識された単語の信頼度が低くかったため、ロボットは、ステップＳ１０８によって、聞き返しを実施している。 First, the robot speaks “What kind of fruit do you like?” (Robot 1) to the infant, and the infant speaks “Apple” (Infant 1). However, since the reliability of the word recognized for the utterance of “apple” (toddler 1) is low here, the robot performs a replay in step S108.

この聞き返しにより、幼児は「リンゴ」（幼児２）と再発話しているが、この再発話の信頼度も低かった。この場合の音声認識装置１００の処理を図６、図７、及び図８を用いて以下に説明する。 By this rehearsal, the toddler was re-speaking to “apple” (toddler 2), but the reliability of this re-speech was also low. The processing of the speech recognition apparatus 100 in this case will be described below with reference to FIGS. 6, 7, and 8.

図６は、図５の対話例に対する第一認識結果と第二認識結果との一例を示す図である。図６に示すように第一認識結果では、幼児の「リンゴ」という発話に対して単語「マンゴー」が認識されており、この単語の信頼度が閾値ＴＨ１未満であった。そのため、認識結果記憶部３０２に、第一認識結果が記憶されている。第一認識結果の内訳は、図６に示すように、認識単語が「マンゴー」であり、認識音素列が「ｍ」、・・・、「ｏ：」であり、音素の出現確率である「０．４」、・・・、「０．６」であった。 FIG. 6 is a diagram illustrating an example of the first recognition result and the second recognition result with respect to the dialogue example of FIG. As shown in FIG. 6, in the first recognition result, the word “mango” is recognized for the utterance “infringe” of the infant, and the reliability of this word is less than the threshold value TH1. Therefore, the first recognition result is stored in the recognition result storage unit 302. As shown in FIG. 6, the breakdown of the first recognition result is that the recognition word is “mango”, the recognition phoneme string is “m”,..., “O:”, and the phoneme appearance probability “ 0.4 ”,...,“ 0.6 ”.

第一認識結果の信頼度が低かったので、「もう一回ゆっくり言って？」というロボットの聞き返しにより、幼児が再び「リンゴ」と発話したが、「リンドウ」を認識する第二認識結果が得られており、第二認識結果においても、信頼度が閾値ＴＨ１以下であった。第二識結果の内訳は、図６に示すように、認識単語が「リンドウ」であり、認識音素列が「ｒ」、・・・、「ｏ：」であり、音素の出現確率が「０．９」、・・・、「０．５」であった。 Because the reliability of the first recognition result was low, the infant spoke again with an “apple” when asked by the robot, “Please say slowly again?”, But the second recognition result was recognized to recognize “gentian”. In the second recognition result, the reliability is equal to or less than the threshold value TH1. As shown in FIG. 6, the breakdown of the second knowledge result is that the recognized word is “Lindou”, the recognized phoneme string is “r”,..., “O:”, and the phoneme appearance probability is “0”. .9 ”,...,“ 0.5 ”.

ここで、音素の出現確率の閾値ＴＨ２を０．７とする。この場合、音素出現確率判定部２０３は、第一認識結果から、音素の出現確率が０．７以上である音素「ｎ」と音素「ｇ」とを抽出する。また、音素出現確率判定部２０３は、第二認識結果から、音素の出現確率が０．７以上である音素「ｒ」と音素「ｉ」と音素「ｎ」とを抽出する。 Here, the threshold TH2 of the phoneme appearance probability is set to 0.7. In this case, the phoneme appearance probability determination unit 203 extracts a phoneme “n” and a phoneme “g” having a phoneme appearance probability of 0.7 or more from the first recognition result. Moreover, the phoneme appearance probability determination unit 203 extracts a phoneme “r”, a phoneme “i”, and a phoneme “n” having a phoneme appearance probability of 0.7 or more from the second recognition result.

次に、単語推定部２０２は、単語辞書３０１を参照し、第一認識結果から抽出された連続する「ｎ」と「ｇ」との音素列を含む単語を認識候補単語として抽出する。図７に例示された単語のうち、連続する音素列「ｎｇ」を含む単語は、「マンゴー」、「レンガ」、「リンゴ」、及び「リンゴジュース」である。 Next, the word estimation unit 202 refers to the word dictionary 301 and extracts, as recognition candidate words, words including continuous “n” and “g” phoneme sequences extracted from the first recognition result. Among the words illustrated in FIG. 7, words including the continuous phoneme string “ng” are “mango”, “brick”, “apple”, and “apple juice”.

そのため、単語推定部２０２は、図８に示すように「マンゴー」、「レンガ」、「リンゴ」、及び「リンゴジュース」を認識候補単語として抽出する。図８は、第一認識結果から抽出された認識候補単語の一例を示す図である。 Therefore, the word estimation unit 202 extracts “mango”, “brick”, “apple”, and “apple juice” as recognition candidate words as shown in FIG. FIG. 8 is a diagram illustrating an example of recognition candidate words extracted from the first recognition result.

更に、単語推定部２０２は、抽出した認識候補単語のうち、第二認識結果から抽出された連続する音素列「ｒｉｎ」を含む単語を抽出することにより、認識候補単語を絞り込む。図８に例示された認識候補単語のうち、連続する音素列「ｒｉｎ」を含む単語は「リンゴ」及び「リンゴジュース」である。 Furthermore, the word estimation unit 202 narrows down the recognition candidate words by extracting words including the continuous phoneme string “rin” extracted from the second recognition result from the extracted recognition candidate words. Among the recognition candidate words illustrated in FIG. 8, words including the continuous phoneme string “rin” are “apple” and “apple juice”.

そのため、単語推定部２０２は、ステップＳ１１３において、「リンゴ」及び「リンゴジュース」を認識候補単語として最終的に絞り込む。 Therefore, the word estimation unit 202 finally narrows down “apple” and “apple juice” as recognition candidate words in step S113.

図４のステップＳ１１５において、閾値ＴＨ３が３であったとすると、最終的に絞り込まれた認識候補単語は２つであるため、単語推定部２０２は、ステップＳ１１５でＹＥＳと判定する。ステップＳ１１６にて、単語推定部２０２は、「リンゴですか？」「リンゴジュースですか？」というように認識候補単語を一つずつ確認するための確認発話の音声信号を音声合成部２５０に生成させ、スピーカ４１０から出力させる。 If it is assumed that the threshold value TH3 is 3 in step S115 in FIG. 4, the word estimation unit 202 determines YES in step S115 because the number of recognition candidate words finally narrowed down is two. In step S116, the word estimation unit 202 generates a speech signal of confirmation utterances for confirming recognition candidate words one by one, such as “Is it an apple?” Or “Apple juice?” And output from the speaker 410.

発話者は、この確認発話に対して例えば、肯定する発話（例えば「はい」）又は否定する発話（例えば「いいえ」）を行う。単語推定部２０２は、確認発話に対して肯定する発話を認識した場合、その確認発話に対応する単語を一の単語を意図した発話として認識する。一方、単語推定部２０２は、確認発話に対して否定する発話を認識した場合、次の認識候補単語の確認発話を行う。 The speaker performs, for example, an affirmative utterance (for example, “Yes”) or a negative utterance (for example, “No”) for the confirmation utterance. When the word estimation unit 202 recognizes an utterance affirmed with respect to the confirmation utterance, the word estimation unit 202 recognizes a word corresponding to the confirmation utterance as an utterance intended for one word. On the other hand, when recognizing a negative utterance with respect to the confirmation utterance, the word estimation unit 202 confirms the next recognition candidate word.

図９は、実施の形態１において、第一認識結果と第二認識結果とから認識候補単語を絞り込む処理の別の一例を示す図である。図９の例では、第一認識結果及び第二認識結果において、閾値ＴＨ２以上の音素が連続していない場合の絞り込み方法が示されている。 FIG. 9 is a diagram illustrating another example of processing for narrowing recognition candidate words from the first recognition result and the second recognition result in the first embodiment. In the example of FIG. 9, the narrowing-down method in the case where phonemes having a threshold value TH2 or more are not continuous in the first recognition result and the second recognition result is shown.

図９において、対話例は図５と同じである。図９の例では、「リンゴ」という発話に対して単語「ルンバ」が認識された第一認識結果と、「リンゴ」という再発話に対して単語「黄粉」が認識された第二認識結果とが得られている。そして、図９の例では、第一認識結果及び第二認識結果とも、信頼度は閾値ＴＨ１＝０．７未満であったため、単語「ルンバ」と単語「黄粉」とを用いて認識候補単語を絞り込む処理を行う。 In FIG. 9, the example of a dialog is the same as FIG. In the example of FIG. 9, the first recognition result in which the word “rumba” is recognized for the utterance “apple”, and the second recognition result in which the word “yellow powder” is recognized for the recurrent utterance “apple”. Is obtained. In the example of FIG. 9, since the reliability is less than the threshold value TH1 = 0.7 for both the first recognition result and the second recognition result, the recognition candidate word is determined using the word “rumba” and the word “yellow powder”. Perform the process of narrowing down.

図９に示すように、第一認識結果において閾値ＴＨ２＝０．７以上の音素は「ｒ」、「ｎ」であり、両音素の順序は「ｒ」の方が「ｎ」より先である。第二認識結果において閾値ＴＨ２＝０．７以上の音素は「ｉ」、「ｏ」であり、両音素の順序は「ｉ」の方が「ｏ」より先である。 As shown in FIG. 9, in the first recognition result, phonemes having a threshold TH2 = 0.7 or more are “r” and “n”, and the order of both phonemes is “r” before “n”. . In the second recognition result, phonemes with the threshold TH2 = 0.7 or more are “i” and “o”, and the order of both phonemes is “i” before “o”.

そこで、図９の例では、単語推定部２０２は、単語辞書３０１から、「ｒ」と「ｎ」との間に音素が存在しているか否かに拘わらず、「ｒ」→「ｎ」の順で音素が配列された単語を認識候補単語として抽出する。次に、音素出現確率判定部２０３は、抽出した認識候補単語の中から、「ｉ」と「ｏ」との間に音素が存在しているか否かに拘わらず、「ｉ」→「ｏ」の順で配列された単語を抽出し、認識候補単語の更なる絞り込みを行う。 Therefore, in the example of FIG. 9, the word estimation unit 202 sets “r” → “n” from the word dictionary 301 regardless of whether a phoneme exists between “r” and “n”. A word in which phonemes are arranged in order is extracted as a recognition candidate word. Next, the phoneme appearance probability determination unit 203 selects “i” → “o” from the extracted recognition candidate words regardless of whether a phoneme exists between “i” and “o”. The words arranged in this order are extracted, and the recognition candidate words are further narrowed down.

図４に参照を戻す。ステップＳ１１４にて、認識候補単語が１つに絞りこめた場合（ステップＳ１１４でＹＥＳ）、単語推定部２０２は、絞り込んだ単語を認識結果として決定し、処理をステップＳ１０５に遷移させ、ステップＳ１０５以降の処理が実行される。 Returning to FIG. In step S114, when the number of recognition candidate words is narrowed down to one (YES in step S114), the word estimation unit 202 determines the narrowed-down word as a recognition result, causes the process to transition to step S105, and after step S105 The process is executed.

一方、認識候補単語が１つに絞り込めなかった場合（ステップＳ１１４でＮＯ）、音素出現確率判定部２０３は、認識候補単語が２つ以上且つ閾値ＴＨ３以下に絞り込めた否かを判断する（ステップＳ１１５）。絞り込んだ認識候補単語の数が、２つ以上且つ閾値ＴＨ３以下であった場合（ステップＳ１１５でＹＥＳ）、単語推定部２０２は、絞り込んだ認識候補単語を一つずつ発話者に確認する確認発話を行うよう音声合成部２５０に指示する（ステップＳ１１６）。確認発話としては、例えば、絞り込まれた認識候補単語の一つにリンゴが含まれているとすると、「あなたはリンゴといいましたか？」といった発話が挙げられる。 On the other hand, when the number of recognition candidate words cannot be narrowed down to one (NO in step S114), the phoneme appearance probability determination unit 203 determines whether or not the number of recognition candidate words is narrowed down to two or more and a threshold value TH3 or less ( Step S115). If the number of narrowed recognition candidate words is two or more and the threshold value TH3 or less (YES in step S115), the word estimation unit 202 performs a confirmation utterance for confirming the narrowed recognition candidate words to the speaker one by one. The voice synthesizer 250 is instructed to perform it (step S116). As the confirmation utterance, for example, if an apple is included in one of the narrowed recognition candidate words, an utterance such as “Did you say an apple?” Is mentioned.

確認発話に対して、発話者から「はい」又は「そうです」等の肯定を意味する発話が行われた場合、単語推定部２０２は、肯定された認識候補単語を認識結果として確定する。ステップＳ１１７で認識結果が確定した場合（ステップＳ１１７でＹＥＳ）、処理はＳ１０５に遷移し、Ｓ１０５以降の処理が実行される。 When an utterance that means affirmation such as “Yes” or “Yes” is performed from the speaker with respect to the confirmation utterance, the word estimation unit 202 determines the recognized recognition candidate word as a recognition result. If the recognition result is confirmed in step S117 (YES in step S117), the process proceeds to S105, and the processes after S105 are executed.

一方、認識候補単語が２つ以上且つ閾値ＴＨ３以下に絞り込めなかった場合（ステップＳ１１５でＮＯ）、処理はステップＳ１０９に遷移し、単語推定部２０２は、第二認識結果をメモリ３０の認識結果記憶部３０２に記憶させる。この時、過去に同じ認識結果が存在すれば、その認識結果は過去の認識結果に上書きされる。また、この時、単語推定部２０２は、絞り込まれた全ての認識候補単語を第二認識結果に含ませて認識結果記憶部３０２に記憶させればよい。 On the other hand, when two or more recognition candidate words cannot be narrowed down to the threshold value TH3 or less (NO in step S115), the process proceeds to step S109, and the word estimation unit 202 uses the second recognition result as the recognition result in the memory 30. The data is stored in the storage unit 302. At this time, if the same recognition result exists in the past, the recognition result is overwritten on the past recognition result. At this time, the word estimation unit 202 may include all the narrowed-down recognition candidate words in the second recognition result and store them in the recognition result storage unit 302.

一方、ステップＳ１１６において、全ての認識候補単語に対して、肯定的な発話が実施されず、認識結果が確定されなかった場合（Ｓ１１７でＮＯ）、音素出現確率判定部２０３は、認識を諦めて処理を終了する。 On the other hand, if a positive utterance is not performed for all recognition candidate words in step S116 and the recognition result is not confirmed (NO in S117), the phoneme appearance probability determination unit 203 gives up recognition. The process ends.

このように、実施の形態１の音声認識装置１００によれば、第一発話に対して信頼性の低い認識結果が得られたとしても、その認識結果を破棄せず、その認識結果を第二発話に対して信頼性の低い認識結果が得られた場合に利用する。そのため、聞き返しによって、信頼性の高い認識結果が得られなかったとしても、第一認識結果に含まれる音素列及び第二認識結果に含まれる音素列とのうち信頼性の高い音素を用いて一の単語が認識されている。その結果、一の単語の認識精度を高めることができる。 Thus, according to the speech recognition apparatus 100 of the first embodiment, even if a recognition result with low reliability is obtained for the first utterance, the recognition result is not discarded and the recognition result is This is used when a recognition result with low reliability is obtained for the utterance. For this reason, even if a highly reliable recognition result is not obtained by listening back, the phoneme string included in the first recognition result and the phoneme string included in the second recognition result may be used by using a reliable phoneme. The words are recognized. As a result, the recognition accuracy of one word can be improved.

なお、第一認識結果と第二認識結果とにより認識結果が一意に絞り込めなかった場合、すなわち、ステップＳ１１５でＮＯと判定されて認識結果記憶部３０２に第二認識結果が記憶された場合（ステップＳ１０９）、音声認識装置１００は、更なる聞き返しにより、第三認識結果を取得すればよい。そして、第三認識結果において信頼度が閾値ＴＨ１未満であった場合、音素出現確率判定部２０３は、第一、第二、及び第三認識結果を用いた絞り込みを実行すればよい。この場合、音素出現確率判定部２０３は、第一認識結果及び第二認識結果によって絞り込まれた認識候補単語を、第三認識結果により認識された音素列のうち出現確率が閾値ＴＨ２以上の音素を含む単語で絞り込めばよい。これによっても、認識候補単語の数が閾値ＴＨ３以下にならなければ、音素出現確率判定部２０３は、更なる聞き返しを行い、認識候補単語の数が閾値ＴＨ３以下になるまで聞き返しを繰り返せばよい。 When the recognition result cannot be narrowed down uniquely by the first recognition result and the second recognition result, that is, when it is determined NO in Step S115 and the second recognition result is stored in the recognition result storage unit 302 ( In step S109), the speech recognition apparatus 100 may acquire the third recognition result by further listening. When the reliability is less than the threshold TH1 in the third recognition result, the phoneme appearance probability determination unit 203 may perform narrowing using the first, second, and third recognition results. In this case, the phoneme appearance probability determination unit 203 selects the recognition candidate words narrowed down by the first recognition result and the second recognition result as phonemes having an appearance probability of the threshold TH2 or more from the phoneme string recognized by the third recognition result. You can narrow down by the words you include. Even in this case, if the number of recognition candidate words is not less than or equal to the threshold TH3, the phoneme appearance probability determination unit 203 may perform further replay and repeat the replay until the number of recognition candidate words is less than or equal to the threshold TH3.

（実施の形態２）
図１０は、実施の形態２における音声対話システムの全体構成の一例を示す図である。図１０において、図１との相違点は、単語推定部２０２、音素出現確率判定部２０３、及び単語信頼度判定部２１０が、それぞれ、文章推定部１２０２、音素出現確率合成部１２０３、及び文章信頼度判定部１２１０に置き換えられている点にある。 (Embodiment 2)
FIG. 10 is a diagram illustrating an example of the overall configuration of the voice interaction system according to the second embodiment. 10 differs from FIG. 1 in that a word estimation unit 202, a phoneme appearance probability determination unit 203, and a word reliability determination unit 210 are respectively a sentence estimation unit 1202, a phoneme appearance probability synthesis unit 1203, and a sentence trust. The degree determination unit 1210 is replaced.

実施の形態１の音声認識部２００は、音声として一つの単語のみを認識することが可能な構成であるのに対し、実施の形態２の音声認識部２００の構成は、任意の単語から構成される文章（単語列）が認識可能な構成を取る。 The speech recognition unit 200 according to the first embodiment is configured to recognize only one word as speech, whereas the configuration of the speech recognition unit 200 according to the second embodiment is configured from arbitrary words. The sentence (word string) is recognized.

音素推定部２０１には、隠れマルコフモデル（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ：ＨＭＭ）が用いて音素列を推定し、文章推定部１２０２は、有限状態文法又はｎ−ｇｒａｍを用いて文章（単語列）を推定する。 The phoneme estimation unit 201 estimates a phoneme string using a hidden Markov model (HMM), and the sentence estimation unit 1202 estimates a sentence (word string) using a finite state grammar or n-gram. .

ＨＭＭ及び有限状態文法又はｎ−グラムを組み合わせることで複数の音素がネットワーク状に繋げられた有向グラフで構成された探索空間が構成される。したがって、音声認識処理は、ネットワークの経路の探索問題に帰着される。すなわち、音声認識処理は、入力された音声信号に対して最も適合するネットワークの経路を見つけ、その経路に対応する単語列を認識結果とする処理となる。具体的には、音声認識処理は、下記の式（２）において、音素及び単語の出現確率の積を最大化する単語列Ｗ（Ｓ）を求める処理となる。 By combining the HMM and the finite state grammar or n-gram, a search space composed of a directed graph in which a plurality of phonemes are connected in a network is configured. Therefore, the speech recognition process is reduced to a search problem of the network route. That is, the voice recognition process is a process of finding a network path that best matches the input voice signal and using a word string corresponding to the path as a recognition result. Specifically, the speech recognition process is a process for obtaining a word string W (S) that maximizes the product of the phoneme and the word appearance probability in the following equation (2).

図１１は、複数のフレームに区切られた音声信号の一例を示す図である。図１１に示すように、フレームとは、入力された音声信号を、例えば２５ｍｓｅｃというような一定の時間間隔に区切ったものを指す。ｏ_ｔは、ｔ番目のフレームにおける特徴ベクトルを示す。特徴ベクトルとは、音素を推定するために用いられる物理量の一例であり、音声信号の音量から得られる。Ｔは、入力された音声信号の長さをフレーム数で表したものである。特徴ベクトルとしては、例えば、メル周波数ケプストラム係数（Ｍｅｌ−ＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒｕｍＣｏｅｆｆｉｃｉｅｎｔｓ）が採用できる。ｓ_ｔはｔ番目のフレームまで処理が到達したときの状態を表す。 FIG. 11 is a diagram illustrating an example of an audio signal divided into a plurality of frames. As shown in FIG. 11, a frame refers to an input audio signal divided at a constant time interval such as 25 msec. o _t indicates a feature vector in the t-th frame. A feature vector is an example of a physical quantity used for estimating a phoneme, and is obtained from the volume of an audio signal. T represents the length of the input audio signal in the number of frames. As the feature vector, for example, a Mel-Frequency Cepstrum Coefficients can be adopted. s _t represents a state where the process has reached the t-th frame.

図１１において、右向きの矢印１１０１は、状態ｓ_ｔを表している。音素列に関して、状態ｓ_ｔでは、「ｋｙｋｙｏ：ｏ：ｎｎｏ」又は「ｋｙｏ：ｎｏ」の音素列が推定されている。なお、「ｋｙｋｙｏ：ｏ：ｎｎｏ」及び「ｋｙｏ：ｎｏ」は音響モデルの違いに依存する。連続する同じ音素は結合するという音響モデルを音素推定部２０１が利用していれいる場合、状態ｓ_ｔの推定結果は後者になる。簡単のために以降は、１フレーム１音素という音響モデルを用いて説明する。 11, an arrow 1101 of right represents the state _{s t.} With respect to the phoneme string, in the state _{s t,} "kykyo: o: nno" or "kyo: no" sequence of phonemes are estimated. Note that “kykyo: o: nno” and “kyo: no” depend on the difference in the acoustic model. If the same phoneme successive phoneme estimator 201 acoustic models of binding is been not use the estimation result of the state s _t is the latter. For simplicity, the following description will be made using an acoustic model of one frame and one phoneme.

また、単語列に関して、状態ｓ_ｔでは、「今日の」という単語列が推定されている。従って、Ｐ_Ａ（ｏ_ｔ，ｓ_ｔ｜ｓ_ｔ−１）は、状態ｓ_ｔ−１に対応する音素列から状態ｓ_ｔに対応する音素列へ遷移する確率（音素列の出現確率）を表す。また、Ｐ_Ｌ（ｓ_ｔ，ｓ_ｔ−１）は状態ｓ_ｔ−１に対応する単語列から状態ｓ_ｔに対応する単語列へ遷移する言語モデルの確率（単語列の出現確率）を表す。なお、単語列の出現確率Ｐ_Ｌ（ｓ_ｔ，ｓ_ｔ−１）は状態ｓ_ｔ−１と状態ｓ_ｔとが単語の境界である場合に適用され、単語の境界以外は１となる。Ｗ（Ｓ）は、状態遷移過程Ｓ、すなわち、状態ｓ_ｔに対応する単語列を表す。 In addition, with regard to the word string, in the state s _t, word string of "Today" it has been estimated. Therefore, P _A (o _t , s _t | s _t−1 ) represents the probability of transition from the phoneme string corresponding to the state s _t−1 to the phoneme string corresponding to the state s _t (appearance probability of the phoneme string). . P _L (s _t , s _t-1 ) represents the probability of the language model (word string appearance probability) for transition from the word string corresponding to the state s _t-1 to the word string corresponding to the state s _t . Note that the word string appearance probability P _L (s _t , s _t-1 ) is applied when the state s _t-1 and the state s _t are word boundaries, and is 1 except for the word boundary. W (S) is the state transition process S, i.e., represent the word sequence corresponding to the state _{s t.}

入力発話の音声信号に対して、最終的に推定される単語列は、１番目のフレームからＴ番目のフレームまでの音素列に対応する。音素列は１番目のフレーム→２番目のフレーム→・・・→Ｔ番目のフレームというように、前から順に推定されていく。何らかの発話があった場合、まず、音素推定部２０１は、発話の音声信号に対して音素列を推定し得る数だけ推定する。推定し得る音素列は、発話全体に対する音素列以外に、１番目のフレーム、１番目のフレームから２番目のフレーム、及び１番目のフレームから３番目のフレーム・・・というように発話の開始から連続した発話の途中までの音素列も含まれる。 The word string finally estimated for the speech signal of the input utterance corresponds to the phoneme string from the first frame to the Tth frame. The phoneme sequence is estimated in order from the first, such as the first frame → second frame →... → Tth frame. When there is any utterance, first, the phoneme estimation unit 201 estimates the number of phoneme strings that can be estimated with respect to the speech signal of the utterance. The phoneme sequence that can be estimated is the first frame, the second frame from the first frame, the third frame from the first frame, etc., in addition to the phoneme sequence for the entire utterance. A phoneme string up to the middle of a continuous utterance is also included.

次に、文章推定部１２０２は、推定された音素列に、割り当て可能な単語を、割り当てられるだけ割り当てていく。そして、文章推定部１２０２は、推定された音素列の出現確率に、割り当てられた単語の出現確率を乗算し、その最大値を得る音素列及び単語の組み合わせを単語列として最終的に推定する。ここで、推定された音素列の出現確率と、割り当てられた単語の出現確率との積は、推定された音素列及びそれに割り当てられた単語から構成される単語列の信頼度を示す。以下、具体的に説明する。 Next, the sentence estimation unit 1202 assigns as many assignable words as possible to the estimated phoneme string. Then, the sentence estimation unit 1202 multiplies the estimated appearance probability of the phoneme string by the appearance probability of the assigned word, and finally estimates the combination of the phoneme string and the word that obtains the maximum value as the word string. Here, the product of the appearance probability of the estimated phoneme string and the appearance probability of the assigned word indicates the reliability of the word string composed of the estimated phoneme string and the words assigned thereto. This will be specifically described below.

「今日の天気」と発話された場合、音素推定部２０１は、状態ｓ_１、つまり、１番目のフレームの音素列（この場合は音素）から順に、発話全体（ここでは、１番目のフレームからＴ＝９番目のフレームまで）の状態ｓ_９の音素列を推定し、推定した音素列ごとにその出現確率を計算する。 When “Today's weather” is uttered, the phoneme estimation unit 201 starts from the state s ₁ , that is, from the first frame phoneme sequence (in this case, the phoneme), in order from the entire utterance (here, from the first frame). estimating a phoneme sequence of states s ₉ of T = to the 9th frame), we calculate the probability for each estimated phoneme string.

状態ｓ_１の音素列が「ｋｙ」と推定された場合、状態ｓ_２、つまり２番目のフレームまでの音素列は、例えば「ｋｙｏ：」と推定される。そして、この場合の２番目のフレームまでの音素列の出現確率Ｐ_Ａ（ｏ_２，ｓ_２｜ｓ_１）は、音素「ｋｙ」の後に、音素「ｏ：」が出現する確率を表す。 When the phoneme string in the state s ₁ is estimated as “ky”, the phoneme string up to the state s ₂ , that is, the second frame is estimated as “kyo:”, for example. In this case, the appearance probability P _A (o ₂ , s ₂ | s ₁ ) of the phoneme string up to the second frame represents the probability that the phoneme “o:” appears after the phoneme “ky”.

状態ｓ_２の音素列の候補は「ｋｙｏ：」だけではなく、全種類の音素数だけ存在するが、実際に発話されたときの音声の特徴により、音素列の出現確率が変化する。ここでは、「今日の天気」と発話されているので、状態ｓ_２の音素列は、音素列「ｋｙｏ：」の出現確率Ｐ_Ａの方が音素列「ｋｙｕ：」の出現確率Ｐ_Ａよりも高くなる。同様に、状態ｓ_９の音素列は、音素列「ｋｙｏ：ｎｏｔｅｎｋｉ」の出現確率Ｐ_Ａの方が音素列「ｋｙｏ：ｎｏｄｅｎｃｈｉ」の出現確率Ｐ_Ａよりも高くなる。 The number of phoneme strings in state s ₂ is not limited to “kyo:”, but there are all types of phoneme numbers, but the appearance probability of the phoneme string varies depending on the characteristics of the speech when it is actually spoken. In this case, because it is spoken as "today's weather", the phoneme string of state s _2, the phoneme string "kyo:" appearance it is a sequence of phonemes probability P _A of "kyu:" than the occurrence probability P _A of Get higher. Similarly, the phoneme string of state _{s 9,} the phoneme string "kyo: notenki" probability of occurrence _{P A} of it is a phoneme string: higher than the occurrence probability _{P A} of "kyo nodenchi".

文章推定部１２０２は、まず、音素推定部２０１によって推定された音素列に対して、単語を割り当てていく。例えば、状態ｓ_９の音素列が「ｋｙｏ：ｎｏｔｅｎｋｉ」と推定された場合、「今日の天気」又は「京の天気」などの単語が割り当てられる。次に、文章推定部１２０２は、割り当てた単語のそれぞれに対して、ｎ−ｇｒａｍなどの言語モデルによる単語の出現確率を用いて、単語列の出現確率Ｐ_Ｌ（ｓ_ｔ，ｓ_ｔ−１）を計算する。例えば、文章推定部１２０２が２−ｇｒａｍの言語モデルを利用している場合、「今日の」に対する単語の出現確率Ｐ_Ｌ（ｓ_ｔ，ｓ_ｔ−１）は、「今日」の次に「の」が出現する確率を表し、「京の」に対する単語の出現確率Ｐ_Ｌ（ｓ_ｔ，ｓ_ｔ−１）は、「京」の次に「の」が出現する確率を表す。 The sentence estimation unit 1202 first assigns words to the phoneme string estimated by the phoneme estimation unit 201. For example, when the phoneme string in the state s ₉ is estimated as “kyo: notenki”, a word such as “Today's weather” or “Kyoto's weather” is assigned. Next, the sentence estimation unit 1202 uses the word appearance probability P _L (s _t , s _t-1 ) for each assigned word, using the word appearance probability based on a language model such as n-gram. Calculate For example, when the sentence estimation unit 1202 uses a 2-gram language model, the word appearance probability P _L (s _t , s _t−1 ) for “today” is “to” next to “today”. ”Represents the probability of appearance, and the word appearance probability P _L (s _t , s _t−1 ) for“ Kyo ”represents the probability of“ no ”appearing after“ Kyo ”.

これらの単語の出現確率は、単語辞書３０１において記憶されている。状態ｓ_９の音素列「ｋｙｏ：ｎｏｔｅｎｋｉ」に対する単語の出現確率は、「今日の」の単語の出現確率の方が、「京の」の単語の出現確率より大きかった場合、「今日の天気」に対する単語の出現確率Ｐ_Ｌ（ｓ_ｔ，ｓ_ｔ−１）の方が「京の天気」に対する単語の出現確率Ｐ_Ｌ（ｓ_ｔ，ｓ_ｔ−１）よりも大きくなる。ここでは、２−ｇｒａｍの例を説明したが、ｎ−ｇｒａｍ（ｎは自然数）のいずれを利用していても単語の出現確率の計算は同様である。 The appearance probabilities of these words are stored in the word dictionary 301. If the appearance probability of the word “today” is greater than the appearance probability of the word “Kyoto” for the phoneme string “kyo: notenki” in the state s ₉ , “today ’s weather” word of the occurrence probability _{_{_{P L (s t, s t}}} -1) of the people is greater than the probability appearance of the word for "Today's weather" _{_{_{P L (s t, s t}}} -1) for. Here, an example of 2-gram has been described, but the calculation of the word appearance probability is the same regardless of which n-gram (n is a natural number) is used.

文章信頼度判定部１２１０は、音素推定部２０１において推定された音素列の出現確率Ｐ_Ａ（ｏ_ｔ，ｓ_ｔ｜ｓ_ｔ−１）と、文章推定部１２０２において推定された全音素列のそれぞれに対して割り当られた複数の単語列の出現確率Ｐ_Ｌ（ｓ_ｔ，ｓ_ｔ−１）とを乗算して、複数の単語列の信頼度を計算する。そして、文章信頼度判定部１２１０は、複数の信頼度のうち最大の信頼度を持つ単語列を最終的な単語列として認識する。すなわち、文章推定部１２０２は、式（２）におけるＷ（ｓ）を最終的な単語列として認識する。 The sentence reliability determination unit 1210 includes each of the phoneme string appearance probability P _A (o _t , s _t | s _t−1 ) estimated by the phoneme estimation unit 201 and all the phoneme strings estimated by the sentence estimation unit 1202. Are multiplied by the appearance probabilities P _L (s _t , s _t−1 ) of the plurality of word strings assigned to, and the reliability of the plurality of word strings is calculated. Then, the sentence reliability determination unit 1210 recognizes a word string having the maximum reliability among the plurality of reliability as a final word string. That is, the sentence estimation unit 1202 recognizes W (s) in Expression (2) as a final word string.

音素出現確率合成部１２０３は、第一発話における各音素の出現確率と第二発話における各音素の出現確率との和を取ることで、各音素の出現確率を合成する。なお、各音素の出現確率が合成された場合、文章推定部１２０２は、合成された各音素の出現確率を用いて、第一発話に対して求めた手法と同様の手法を用いて複数の単語列の信頼度を計算し、最大の信頼度を持つ単語列を最終的な認識結果とする。すなわち、文章推定部１２０２は式（３）における単語列Ｗ（ｓ）を最終的な認識結果とする。 The phoneme appearance probability synthesis unit 1203 synthesizes the appearance probability of each phoneme by taking the sum of the appearance probability of each phoneme in the first utterance and the appearance probability of each phoneme in the second utterance. When the appearance probabilities of each phoneme are combined, the sentence estimation unit 1202 uses the combined appearance probabilities of each phoneme to generate a plurality of words using the same method as the method obtained for the first utterance. The reliability of the sequence is calculated, and the word sequence having the maximum reliability is used as the final recognition result. That is, the sentence estimation unit 1202 uses the word string W (s) in Expression (3) as the final recognition result.

ここで、第一発話とは、聞き返しに対する応答発話ではなく、音声認識装置１００からの問いかけに対する応答、又はユーザから音声認識装置１００に対する話しかけによる発話のことを指す。また、第二発話とは、聞き返しに対する応答発話のことを指し、第一発話を意図する発話者による発話のことを指す。 Here, the first utterance is not a response utterance to a reply, but a response to an inquiry from the speech recognition apparatus 100 or an utterance by a conversation from the user to the speech recognition apparatus 100. The second utterance refers to a response utterance in response to a reply, and refers to an utterance by a speaker who intends the first utterance.

式（３）において、Ｐ_Ａ１は第一発話の音素列の出現確率を示し、Ｐ_Ａ２は第二発話の音素列の出現確率を示す。この時、第一発話と第二発話との各音素の出現確率の和は、第一発話の信頼度と第二発話の信頼度とに応じた重み付け加算した値が採用されてもよい。例えば、第一発話の信頼度をα、第二発話の信頼度をβとすると、出現確率の和は、第一発話の各音素の出現確率に対して重み値α／α＋βを乗じた値と、第二発話の各音素の出現確率に対して重み値β／α＋βを乗じた値との加算値が採用されてもよい。 In Expression (3), P _A1 represents the appearance probability of the phoneme string of the first utterance, and P _A2 represents the appearance probability of the phoneme string of the second utterance. At this time, the sum of the appearance probabilities of each phoneme in the first utterance and the second utterance may be a value obtained by weighted addition according to the reliability of the first utterance and the reliability of the second utterance. For example, if the reliability of the first utterance is α and the reliability of the second utterance is β, the sum of the appearance probabilities is obtained by multiplying the appearance probability of each phoneme of the first utterance by the weight value α / α + β. An addition value obtained by multiplying the appearance probability of each phoneme of the second utterance by the weight value β / α + β may be employed.

文章信頼度判定部１２１０は、文章推定部１２０２により推定された第一発話の認識結果に対する信頼度（音素列の出現確率と単語列の出現確率との積）が閾値ＴＨ１以上か否かを判定する。そして、文章信頼度判定部１２１０は、信頼度が閾値ＴＨ１未満の場合、第一発話に対する認識結果を第一認識結果として認識結果記憶部３０２に記憶し、聞き返しを実施する。ここで、第一認識結果には、単語列を推定するために必要な情報が含まれ、例えば、認識された単語列と、その単語列に対応する音素列と、その音素列を構成する各音素の出現確率が含まれる。 The sentence reliability determination unit 1210 determines whether or not the reliability (product of the phoneme string appearance probability and the word string appearance probability) with respect to the recognition result of the first utterance estimated by the sentence estimation unit 1202 is greater than or equal to a threshold value TH1. To do. Then, when the reliability is less than the threshold TH1, the sentence reliability determination unit 1210 stores the recognition result for the first utterance in the recognition result storage unit 302 as the first recognition result, and performs a replay. Here, the first recognition result includes information necessary for estimating the word string. For example, the recognized word string, the phoneme string corresponding to the word string, and each of the phoneme strings constituting the phoneme string Contains the phoneme appearance probability.

図１２は、実施の形態２における認識処理の一例を示すフローチャートである。ステップＳ２００及びステップＳ２０１の処理は、図４に示す、ステップＳ１００及びステップＳ１０１の処理と同じである。 FIG. 12 is a flowchart illustrating an example of recognition processing according to the second embodiment. The processing in step S200 and step S201 is the same as the processing in step S100 and step S101 shown in FIG.

音声認識部２００は、音声認識処理を実施する（ステップＳ２０２）。具体的には、音素推定部２０１は、実施の形態１と同様に、音響モデルを用いて各音声区間の音素を推定する。文章推定部１２０２は、単語辞書３０１に登録されている単語列を音素推定部２０１により推定された音素列に割り当てていく。このとき、文章推定部１２０２は、音素推定部２０１により推定された全ての音素列のそれぞれに対して割り当て可能な単語列を割り当てていき、推定された各音素列に対して１以上の単語列の割り当て結果を得る。そして、文章推定部１２０２は、音素列の出現確率と割り当てた単語列の出現確率との積が最大となる単語列を認識結果として出力すると共に、積の最大値を認識結果として得られた単語列の信頼度として文章信頼度判定部１２１０に出力する。 The voice recognition unit 200 performs a voice recognition process (step S202). Specifically, the phoneme estimation unit 201 estimates the phonemes of each speech section using an acoustic model, as in the first embodiment. The sentence estimation unit 1202 assigns the word string registered in the word dictionary 301 to the phoneme string estimated by the phoneme estimation unit 201. At this time, the sentence estimation unit 1202 assigns an assignable word string to each of all phoneme strings estimated by the phoneme estimation part 201, and one or more word strings for each estimated phoneme string Get the result of the assignment. Then, the sentence estimation unit 1202 outputs a word string that maximizes the product of the appearance probability of the phoneme string and the appearance probability of the assigned word string as the recognition result, and the word obtained as the recognition result using the maximum product value The column reliability is output to the sentence reliability determination unit 1210.

次に、文章信頼度判定部１２１０は、文章推定部１２０２により認識された単語列の信頼度が閾値ＴＨ１以上であるか否か判断する（ステップＳ２０３）。文章の信頼度が閾値ＴＨ１以上であった場合（ステップＳ２０３でＹＥＳ）、処理はステップＳ２０４に進む。ステップＳ２０４〜ステップＳ２０７は、図４に示すステップＳ１０４〜ステップＳ１０７と同じである。 Next, the sentence reliability determination unit 1210 determines whether or not the reliability of the word string recognized by the sentence estimation unit 1202 is greater than or equal to the threshold value TH1 (step S203). If the text reliability is equal to or higher than the threshold TH1 (YES in step S203), the process proceeds to step S204. Steps S204 to S207 are the same as steps S104 to S107 shown in FIG.

一方、文章推定部１２０２により認識された単語列の信頼度が閾値ＴＨ１未満である場合（ステップＳ２０３でＮＯ）、文章信頼度判定部１２１０は、認識結果記憶部３０２を参照し、第一認識結果が記憶されているか否かを判断する（ステップＳ２１０）。第一認識結果が記憶されていない場合（ステップＳ２１０でＮＯ）、文章信頼度判定部１２１０は、文章推定部１２０２により認識された単語列と、その単語列に対応する音素列と、式（２）のＰ_Ａ（ｏ_ｔ，ｓ_ｔ｜ｓ_ｔ−１）により求められる各音素の出現確率とを、第一発話の認識結果（第一認識結果）として認識結果記憶部３０２に記憶させる（ステップＳ２０９）。ステップＳ２０８では、図４に示すステップＳ１０８と同様、音声認識装置１００により聞き返しが行われる。この聞き返しにより、発話者により第二発話が行われ、ステップＳ２００〜ステップＳ２０２の処理により、第一発話と同様、第二発話に対する第二認識結果が得られる。そして、第二認識結果の信頼度が閾値ＴＨ１未満であれば、ステップＳ２０３でＮＯと判定され処理がＳ２１０に進む。 On the other hand, when the reliability of the word string recognized by the sentence estimation unit 1202 is less than the threshold TH1 (NO in step S203), the sentence reliability determination unit 1210 refers to the recognition result storage unit 302 and performs the first recognition result. Is stored (step S210). When the first recognition result is not stored (NO in step S210), the sentence reliability determination unit 1210, the word string recognized by the sentence estimation unit 1202, the phoneme string corresponding to the word string, the formula (2) ) P _A (o _t , s _t | s _t-1 ), and the appearance probability of each phoneme is stored in the recognition result storage unit 302 as a recognition result (first recognition result) of the first utterance (step) S209). In step S208, the speech recognition apparatus 100 performs listening back as in step S108 shown in FIG. The second utterance is performed by the speaker by this replay, and the second recognition result for the second utterance is obtained by the processing of step S200 to step S202 as with the first utterance. If the reliability of the second recognition result is less than the threshold TH1, NO is determined in step S203, and the process proceeds to S210.

一方、第二認識結果の信頼度が閾値ＴＨ１以上であれば（ステップＳ２０３でＹＥＳ）、第二認識結果が発話者が意図する一の単語列として決定され、ステップＳ２０５〜ステップＳ２０７の処理が実行される。 On the other hand, if the reliability of the second recognition result is equal to or higher than the threshold TH1 (YES in step S203), the second recognition result is determined as one word string intended by the speaker, and the processing in steps S205 to S207 is executed. Is done.

一方、第一認識結果が認識結果記憶部３０２に記憶されていた場合（ステップＳ２１０でＹＥＳ）、音素出現確率合成部１２０３は、認識結果記憶部３０２に記憶されている第一認識結果に含まれる音素列の各音素の出現確率と、ステップＳ２０２により得られた第二発話の音素列の各音素の出現確率の和を取る（ステップＳ２１１）。 On the other hand, when the first recognition result is stored in the recognition result storage unit 302 (YES in step S210), the phoneme appearance probability synthesis unit 1203 is included in the first recognition result stored in the recognition result storage unit 302. The sum of the appearance probability of each phoneme in the phoneme string and the appearance probability of each phoneme in the phoneme string of the second utterance obtained in step S202 is calculated (step S211).

次に、文章推定部１２０２は、第一発話と第二発話との各音素の出現確率の和を乗算することで後述する合成出現確率を算出し、この合成出現確率に単語の出現確率を乗算することで、各単語列の信頼度を算出し、最大の信頼度を与える単語列を発話者が発話した一の単語列としてを認識する（ステップＳ２１２）。ステップＳ２１２の処理が終わると処理はステップＳ２０３へ遷移する。 Next, the sentence estimation unit 1202 calculates a composite appearance probability to be described later by multiplying the sum of the appearance probabilities of each phoneme of the first utterance and the second utterance, and multiplies the composite appearance probability by the word appearance probability. Thus, the reliability of each word string is calculated, and the word string giving the maximum reliability is recognized as one word string uttered by the speaker (step S212). When the process of step S212 ends, the process transitions to step S203.

（実施の形態２の具体例）
次に実施の形態２の具体例について説明する。この具体例では、簡単のために、「リンゴです」及び「マンゴーです」の二つの単語列（文章）のみを推定できるモデルを用いて、文章を認識する音声認識装置１００が説明される。 (Specific example of Embodiment 2)
Next, a specific example of the second embodiment will be described. In this specific example, for the sake of simplicity, the speech recognition apparatus 100 that recognizes a sentence using a model that can estimate only two word strings (sentences) of “I am an apple” and “Is a mango” will be described.

音素推定部２０１が、発話に対する音素列として「ｒｉｎｇｏｄｅｓｕ」と「ｍａｎｇｏ：ｄｅｓｕ」とを推定したとする。この場合、各音素列の出現確率は、各音素列を構成する音素の出現確率同士の積として計算される。 It is assumed that the phoneme estimation unit 201 estimates “ringodesu” and “mango: desu” as phoneme sequences for an utterance. In this case, the appearance probability of each phoneme string is calculated as a product of the appearance probabilities of the phonemes constituting each phoneme string.

図１３は、実施の形態２の具体例において１−ｇｒａｍの言語モデルを採用した場合の探索空間の一例を示す図である。 FIG. 13 is a diagram illustrating an example of a search space when a 1-gram language model is adopted in the specific example of the second embodiment.

図１３の探索空間において、１番目の音素「ｓｉｌ」は「ｓｉｌｅｎｔ」を略したものであり、無音区間示す。また、図１３において、各アルファベットは音素を示し、各アルファベットの下に記載された数値は、各音素の出現確率である。この探索空間では、先頭及び最終のそれぞれに要素「ｓｉｌ」が配置されており、音素列「ｒｉｎｇｏｄｅｓｕ」及び音素列「ｍａｎｇｏ：ｄｅｓｕ」とが含まれている。具体的には、この探索空間は、先頭の要素「ｓｉｌ」から「ｒｉｎｇｏ」及び「ｍａｎｇｏ：」の２つの音素列に分岐し、再び音素列「ｄｅｓｕ」で合流し、最終の要素「ｓｉｌ」へと至っている。 In the search space of FIG. 13, the first phoneme “sil” is an abbreviation of “silent” and indicates a silent section. Moreover, in FIG. 13, each alphabet shows a phoneme, and the numerical value described under each alphabet is the appearance probability of each phoneme. In this search space, an element “sil” is arranged at the beginning and the end, respectively, and includes a phoneme string “ringodesu” and a phoneme string “mango: desu”. Specifically, the search space branches from the leading element “sil” into two phoneme strings “ringo” and “mango:”, merges again with the phoneme string “desu”, and the final element “sil”. Has led to

この場合、音素列「ｒｉｎｇｏｄｅｓｕ」の出現確率は、０．７×０．５×０．５×・・・×０．９×０．９と算出され、音素列「ｍａｎｇｏ：ｄｅｓｕ」の出現確率は、０．２×０．３×０．４×・・・×０．９×０．９と算出される。 In this case, the appearance probability of the phoneme sequence “ringodesu” is calculated as 0.7 × 0.5 × 0.5 ×... × 0.9 × 0.9, and the appearance probability of the phoneme sequence “mango: desu”. Is calculated as 0.2 × 0.3 × 0.4 ×... × 0.9 × 0.9.

ここで、単語辞書３０１には、「リンゴ」、「マンゴー」、及び「です」の３つの単語と、各単語の出現確率とが登録されていたとする。この場合、文章推定部１２０２は、各音素列に対してこれら３つの単語を割り当てることで、図１３に示す探索空間を得る。各単語の右に示される数値は単語の出現確率を示す。 Here, it is assumed that three words “apple”, “mango”, and “is” and the appearance probability of each word are registered in the word dictionary 301. In this case, the sentence estimation unit 1202 obtains the search space shown in FIG. 13 by assigning these three words to each phoneme string. The numerical value shown to the right of each word indicates the appearance probability of the word.

一般に単語の出現確率はｎ−ｇｒａｍが用いられる。ｎ−ｇｒａｍでは、単語の出現確率が直前の単語に依存すると仮定する。図１３の例では１−ｇｒａｍが用いられている。１−ｇｒａｍは、直前の単語には依存しないため、単語単体の出現確率を利用する。この時、一単語目に「リンゴ」が発話される確率は０．６であり、一単語目に「マンゴー」が発話される確率は０．４である。また、「マンゴー」及び「リンゴ」に続いて「です」が発話される確率は１である。 Generally, n-gram is used as the word appearance probability. In n-gram, it is assumed that the word appearance probability depends on the immediately preceding word. In the example of FIG. 13, 1-gram is used. Since 1-gram does not depend on the immediately preceding word, the appearance probability of a single word is used. At this time, the probability that “apple” is uttered as the first word is 0.6, and the probability that “mango” is uttered as the first word is 0.4. Further, the probability that “I” is uttered after “mango” and “apple” is 1.

文章推定部１２０２は、先頭の要素「ｓｉｌ」から最終の「ｓｉｌ」までを繋ぐ全経路のそれぞれを音素列として抽出し、各音素列に単語辞書３０１に登録された単語のうち割り当て可能な単語を割り当て、複数の単語列を得る。図１３の例では、音素列「ｒｉｎｇｏ」に単語「リンゴ」が割り当てられ、音素列「ｍａｎｇｏ：」に単語「マンゴー」が割り当てられ、音素列「ｄｅｓｕ」に単語「です」が割り当てられる。そのため、図１３の例では、単語列「リンゴです」及び「マンゴーです」が得られる。 The sentence estimation unit 1202 extracts all the paths connecting the leading element “sil” to the final “sil” as phoneme strings, and can be assigned to each phoneme string among the words registered in the word dictionary 301. Assign multiple word strings. In the example of FIG. 13, the word “apple” is assigned to the phoneme string “ringo”, the word “mango” is assigned to the phoneme string “mango:”, and the word “is” is assigned to the phoneme string “desu”. Therefore, in the example of FIG. 13, the word strings “I am an apple” and “I am a mango” are obtained.

そして、単語列「リンゴです」の音素列「ｒｉｇｏｄｅｓｕ」＋「ｓｉｌ」の各音素の出現確率の乗算値「０．７×０．５×・・・０．９」に単語「リンゴ」の出現確率「０．６及び「です」の出現確率「１」が乗じられ、単語列「リンゴです」の信頼度が得られる。同様にして、単語列「マンゴーです」の信頼度が得られる。 Then, the appearance of the word “apple” in the multiplication value “0.7 × 0.5 ×... 0.9” of the appearance probabilities of each phoneme of the phoneme sequence “rigodesu” + “sil” of the word sequence “I am apple” The probability “0.6” and the appearance probability “1” of “is” are multiplied, and the reliability of the word string “is apple” is obtained. Similarly, the reliability of the word string “is mango” is obtained.

そして、単語列「リンゴです」及び「マンゴーです」のうち、最大の信頼度を持つ単語列が認識結果として推定される。図１３の例では、単語列「リンゴです」の信頼度の方が単語列「マンゴーです」の信頼度よりも大きいため、単語列「リンゴです」が認識結果となる。 Of the word strings “I am apple” and “I am mango”, the word string having the maximum reliability is estimated as the recognition result. In the example of FIG. 13, since the reliability of the word string “I am an apple” is greater than the reliability of the word string “I am a mango”, the recognition result is the word string “I am an apple”.

２−ｇｒａｍの場合、単語の出現確率は、直前の単語のみに依存すると仮定する。つまり、「リンゴ」、「マンゴー」、及び「です」の三単語のみからなる２−ｇｒａｍの辞書は、図１４に示すようになる。図１４は、実施の形態２の具体例において２−ｇｒａｍの言語モデルを採用した場合の単語辞書３０１の一例を示す図である。「ｓｉｌ」も含めて、「リンゴ」、「マンゴー」、及び「です」の三単語から得られる２−ｇｒａｍの組み合わせは下記の通りである。すなわち、２−ｇｒａｍの組み合わせは、「ｓｉｌ」に対して「リンゴ」、「マンゴー」、及び「です」の３組と、「リンゴ」に対して「です」、「マンゴー」、及び「ｓｉｌ」の３組と、「マンゴー」に対して「です」、「リンゴ」、及び「ｓｉｌ」の３組と、「です」に対して「リンゴ」、「マンゴー」、及び「ｓｉｌ」の３組とが考えられ、合計３×４＝１２組の組み合わせが考えられる。そこで、図１４に示す単語辞書３０１では、これら１２組の２−ｇｒａｍの単語列が登録されている。 In the case of 2-gram, it is assumed that the word appearance probability depends only on the immediately preceding word. That is, a 2-gram dictionary consisting of only three words “apple”, “mango”, and “is” is as shown in FIG. FIG. 14 is a diagram illustrating an example of the word dictionary 301 when the 2-gram language model is adopted in the specific example of the second embodiment. The combinations of 2-grams obtained from the three words “apple”, “mango”, and “is” including “sil” are as follows. That is, the 2-gram combination includes three sets of “apple”, “mango”, and “is” for “sil”, and “is”, “mango”, and “sil” for “apple”. 3 pairs of “I”, “apple” and “sil” for “mango”, and 3 pairs of “apple”, “mango” and “sil” for “do” And a total of 3 × 4 = 12 combinations are possible. Therefore, in the word dictionary 301 shown in FIG. 14, these 12 sets of 2-gram word strings are registered.

図１４に示す単語辞書３０１を用いた２−ｇｒａｍの探索空間は図１５のように表される。図１５は、実施の形態２の具体例において２−ｇｒａｍの言語モデルを採用した場合の探索空間の一例を示す図である。なお、図１５において音素列及び各音素の出現確率は図１３と同じである。 A 2-gram search space using the word dictionary 301 shown in FIG. 14 is expressed as shown in FIG. FIG. 15 is a diagram illustrating an example of a search space when a 2-gram language model is employed in the specific example of the second embodiment. In FIG. 15, the phoneme string and the appearance probability of each phoneme are the same as those in FIG.

このとき、図１４のような単語辞書３０１が記憶されている場合は、一単語目に「リンゴ」が出現する確率、すなわち、要素「ｓｉｌ」の次に「リンゴ」が出現する確率は、０．３である。また、一単語目に「マンゴー」が出現する確率、すなわち、要素「ｓｉｌ」の次に「マンゴー」が出現する確率は、０．２である。 At this time, when the word dictionary 301 as shown in FIG. 14 is stored, the probability that “apple” appears in the first word, that is, the probability that “apple” appears after the element “sil” is 0. .3. The probability that “mango” appears in the first word, that is, the probability that “mango” appears after the element “sil” is 0.2.

また、「リンゴ」の次に「です」が出現する確率は０．５であり、「マンゴー」の次に「です」が出現する確率は０．４である。更に、「です」の次に要素「ｓｉｌ」が出現する確率は０．６である。この場合、図１５のグラフに示す各経路の音素列の出現確率と２−ｇｒａｍの単語列の出現確率との積が最大となる単語列が認識結果として採用される。すなわち、音素列「ｒｉｎｇｏｄｅｓｕ」の各音素の出現確率と、「ｓｉｌ−リンゴ」、「リンゴ−です」、及び「です−ｓｉｌ」のそれぞれの出現確率（＝０．３、０．５、及び０．６）との積が、単語列「リンゴです」の信頼度として算出される。同様にして、単語列「マンゴーです」の信頼度も算出される。そして、この例では、単語列「リンゴです」の信頼度の方が単語列「マンゴーです」の信頼度よりも高いため、最終的に単語列「リンゴです」が認識結果となる。これは、ｎ−ｇｒａｍが３−ｇｒａｍ以上の場合でも同様の処理となる。 The probability that “Is” will appear after “Apple” is 0.5, and the probability that “Is” will appear after “Mango” is 0.4. Further, the probability that the element “sil” appears after “is” is 0.6. In this case, the word string that maximizes the product of the appearance probability of the phoneme string of each path shown in the graph of FIG. 15 and the appearance probability of the 2-gram word string is adopted as the recognition result. That is, the appearance probability of each phoneme in the phoneme string “ringodesu” and the occurrence probabilities of “sil-apple”, “apple-is”, and “is-sil” (= 0.3, 0.5, and 0) .6) is calculated as the reliability of the word string “I am an apple”. Similarly, the reliability of the word string “is mango” is also calculated. In this example, since the reliability of the word string “I am an apple” is higher than the reliability of the word string “I am a mango”, the recognition result is finally the word string “I am an apple”. This is the same processing even when n-gram is 3-gram or more.

文章信頼度判定部１２１０は、文章推定部１２０２において推定された単語列の信頼度が閾値ＴＨ１以上であるか否かを判定する。音素出現確率合成部１２０３は、第一発話に対する第一認識結果の信頼度と第二発話に対する第二認識結果の信頼度とが共に閾値ＴＨ１未満である場合、第一発話における各音素の出現確率と第二発話における各音素の出現確率との和を乗算することで、合成出現確率を算出する。 The sentence reliability determination unit 1210 determines whether or not the reliability of the word string estimated by the sentence estimation unit 1202 is greater than or equal to a threshold value TH1. When the reliability of the first recognition result for the first utterance and the reliability of the second recognition result for the second utterance are both less than the threshold TH1, the phoneme appearance probability synthesis unit 1203 has an appearance probability of each phoneme in the first utterance. Is multiplied by the sum of the appearance probabilities of each phoneme in the second utterance to calculate a composite appearance probability.

文章推定部１２０２は、音素出現確率合成部１２０３により算出された合成出現確率を用いて、単語列（文章）を認識する。 The sentence estimation unit 1202 recognizes a word string (sentence) using the combined appearance probability calculated by the phoneme appearance probability combining unit 1203.

図１６は、実施の形態２の具体例における第一認識結果の各音素と第二認識結果の各音素との出現確率が合成された場合の探索空間を示す図である。図１６では、図１５と同様、音素列「ｒｉｎｇｏｄｅｓｕ」と音素列「ｍａｎｇｏ：ｄｅｓｕ」との有向グラフが示されており、各音素について第一発話の出現確率と第二発話の出現確率とが示されている。また、図１６の例では、１−ｇｒａｍの単語が割り当てられている。図１６において、各音素の直ぐ下に記載された数値は第一発話の出現確率を示し、第一発話の直ぐ下に記載された数値は第二発話の出現確率を示している。 FIG. 16 is a diagram illustrating a search space when the appearance probabilities of each phoneme of the first recognition result and each phoneme of the second recognition result in the specific example of the second embodiment are combined. FIG. 16 shows a directed graph of the phoneme sequence “ringodesu” and the phoneme sequence “mango: desu”, as in FIG. 15, and the appearance probability of the first utterance and the appearance probability of the second utterance are shown for each phoneme. Has been. In the example of FIG. 16, a word of 1-gram is assigned. In FIG. 16, the numerical value described immediately below each phoneme indicates the appearance probability of the first utterance, and the numerical value described immediately below the first utterance indicates the appearance probability of the second utterance.

例えば、音素列「ｒｉｎｇｏｄｅｓｕ」の第一発話における音素「ｒ」の出現確率は、０．７であり、第二発話における音素「ｒ」の出現確率は、０．３である。 For example, the appearance probability of the phoneme “r” in the first utterance of the phoneme string “ringodesu” is 0.7, and the appearance probability of the phoneme “r” in the second utterance is 0.3.

ここで、音素列「ｒｉｎｇｏｄｅｓｕ」の合成出現確率は、（０．７＋０．３）×（０．５＋０．４）×・・・×（０．９＋０．９）である。また、音素列「ｍａｎｇｏ：ｄｅｓｕ」の合成出現確率は、（０．２＋０．４）×（０．３＋０．５）×・・・×（０．９＋０．９）である。 Here, the combined appearance probability of the phoneme string “ringodesu” is (0.7 + 0.3) × (0.5 + 0.4) ×... × (0.9 + 0.9). The synthetic appearance probability of the phoneme string “mango: desu” is (0.2 + 0.4) × (0.3 + 0.5) ×... × (0.9 + 0.9).

この場合、文章推定部１２０２は、音素列「ｒｉｎｇｏｄｅｓｕ」及び音素列「ｍａｎｇｏ：ｄｅｓｕ」のそれぞれに対して、単語辞書３０１に登録された１−ｇｒａｍの単語列を割り当てていく。 In this case, the sentence estimation unit 1202 assigns a 1-gram word string registered in the word dictionary 301 to each of the phoneme string “ringodesu” and the phoneme string “mango: desu”.

そして、文章推定部１２０２は、音素出現確率合成部１２０３により算出された合成出現確率に、単語の出現確率を乗算することによって、各単語列の信頼度を算出する。そして、文章推定部１２０２は、最大の信頼度を持つ音素列を発話者が意図する一の単語列として認識する。 Then, the sentence estimation unit 1202 calculates the reliability of each word string by multiplying the combined appearance probability calculated by the phoneme appearance probability combining unit 1203 by the word appearance probability. Then, the sentence estimation unit 1202 recognizes the phoneme string having the maximum reliability as one word string intended by the speaker.

図１６において、単語列「リンゴです」の信頼度は、一単語目に「リンゴ」が出現する確率が０．６であり、「リンゴ」の次に「です」が出現する確率が１であるため、（０．７＋０．３）×（０．５＋０．４）×・・・×（０．９＋０．９）×０．６×１と算出される。同様に、単語列「マンゴーです」の信頼度は、一単語目に「マンゴー」が出現する確率が０．４であり、「マンゴー」の次に「です」が出現確率が１であるため、０．２＋０．４）×（０．３＋０．５）×・・・×（０．９＋０．９）×０．４×１と算出される。 In FIG. 16, the reliability of the word string “I am an apple” has a probability of “apple” appearing as the first word is 0.6, and the probability that “is” appears after “apple” is 1. Therefore, (0.7 + 0.3) × (0.5 + 0.4) ×... × (0.9 + 0.9) × 0.6 × 1 is calculated. Similarly, the reliability of the word string “is a mango” has a probability that “mango” appears in the first word is 0.4, and “is” next to “mango” has an appearance probability of 1. 0.2 + 0.4) × (0.3 + 0.5) ×... × (0.9 + 0.9) × 0.4 × 1.

そして、ここでは、単語列「リンゴです」の方が単語列「マンゴーです」よりも信頼度が高いため、単語列「リンゴです」が発話されたと認識される。 Here, since the word string “I am an apple” is more reliable than the word string “I am a mango”, it is recognized that the word string “I am an apple” is spoken.

このように、実施の形態２の音声認識装置１００によれば、第一発話に対して信頼性の低い認識結果が得られたとしても、その認識結果を破棄せず、その認識結果を第二発話に対して信頼性の低い認識結果が得られた場合に利用する。そのため、聞き返しによって、信頼性の高い認識結果が得られなかったとしても、両認識結果を合成することで一の単語列が認識されているので、一の単語列の認識精度を高めることができる。 As described above, according to the speech recognition apparatus 100 of the second embodiment, even if a recognition result with low reliability is obtained for the first utterance, the recognition result is not discarded and the recognition result is This is used when a recognition result with low reliability is obtained for the utterance. For this reason, even if a highly reliable recognition result cannot be obtained by listening back, since one word string is recognized by combining both recognition results, the recognition accuracy of one word string can be improved. .

なお、ステップＳ２０９にて、認識結果記憶部３０２に記憶される認識結果は直前の認識結果だけでなく、聞き返しによって得られた過去複数回の認識結果であってもよい。この場合、音素出現確率合成部１２０３は、ステップＳ２１１において、過去複数回の認識結果として得られた複数の音素列の各音素の出現確率と、最新の認識結果として得られた音素列の各音素の出現確率とを合成すればよい。 Note that in step S209, the recognition result stored in the recognition result storage unit 302 may be not only the previous recognition result but also a plurality of past recognition results obtained by listening back. In this case, the phoneme appearance probability synthesis unit 1203, in step S211, the appearance probability of each phoneme of a plurality of phoneme strings obtained as a result of recognition in the past plural times and each phoneme of the phoneme string obtained as the latest recognition result. And the appearance probability of.

（実施の形態３）
図１７は、実施の形態３における音声対話システムの全体構成の一例を示す図である。図１７において、図１０との相違点は、音素出現確率合成部１２０３を省略し、共通候補抽出部２７０を追加した点にある。 (Embodiment 3)
FIG. 17 is a diagram illustrating an example of the overall configuration of the voice interaction system according to the third embodiment. 17, the difference from FIG. 10 is that the phoneme appearance probability synthesis unit 1203 is omitted and a common candidate extraction unit 270 is added.

実施の形態３において、文章推定部１２０２は、実施の形態２と同様に単語列を推定するが、信頼度が最大の単語列を認識結果とするのではなく、信頼度が高い順に上位ｎ個の単語列をそれぞれ認識候補として抽出し、上位ｎ個の認識候補（ｎ−ｂｅｓｔ）を認識結果とする。ｎ−ｂｅｓｔとは、認識結果に含まれる複数の認識候補のうち、信頼度が高い順にｎ個の認識候補のことを指す。 In the third embodiment, the sentence estimation unit 1202 estimates a word string in the same manner as in the second embodiment, but does not use the word string with the highest reliability as the recognition result, but the top n words in the descending order of reliability. Are extracted as recognition candidates, and the top n recognition candidates (n-best) are used as recognition results. n-best refers to n recognition candidates in descending order of reliability among a plurality of recognition candidates included in the recognition result.

共通候補抽出部２７０は、文章信頼度判定部１２１０により、第一認識結果における信頼度の最大値が閾値ＴＨ１未満且つ第二認識結果における信頼度の最大値が閾値ＴＨ１未満と判定された場合、第一発話の認識候補（ｎ−ｂｅｓｔ）と第二発話の認識候補（ｎ−ｂｅｓｔ）とを比較し、共通する認識候補を抽出し、抽出結果に基づいて最終的に認識する単語列を決定する。 The common candidate extraction unit 270 determines that the maximum reliability value in the first recognition result is less than the threshold TH1 and the maximum reliability value in the second recognition result is less than the threshold TH1 by the sentence reliability determination unit 1210. The first utterance recognition candidate (n-best) and the second utterance recognition candidate (n-best) are compared, common recognition candidates are extracted, and a word string to be finally recognized is determined based on the extraction result. To do.

図１８は、実施の形態３における認識処理の一例を説明するフローチャートである。ステップＳ３００〜ステップＳ３０７の処理は、図１２に示すステップＳ２００〜ステップＳ２０７の処理と基本的に同じである。但し、ステップＳ３０３では、文章信頼度判定部１２１０は、文章推定部１２０２により推定された第一認識結果の信頼度の最大値が閾値ＴＨ１以上であるか否かを判定する。 FIG. 18 is a flowchart illustrating an example of recognition processing according to the third embodiment. The processing from step S300 to step S307 is basically the same as the processing from step S200 to step S207 shown in FIG. However, in step S303, the sentence reliability determination unit 1210 determines whether or not the maximum value of the reliability of the first recognition result estimated by the sentence estimation unit 1202 is greater than or equal to the threshold value TH1.

ステップＳ３０３において、第一認識結果の信頼度の最大値が閾値ＴＨ１以上である場合（ステップＳ３０３でＹＥＳ）、当該最大値を持つ認識候補が発話者が意図した一の単語列として決定され、ステップＳ３０５〜ステップＳ３０７の処理が実行される。 In step S303, when the maximum value of the reliability of the first recognition result is equal to or greater than the threshold value TH1 (YES in step S303), a recognition candidate having the maximum value is determined as one word string intended by the speaker. The processing from S305 to step S307 is executed.

一方、ステップＳ３０３において、文章信頼度判定部１２１０は、第一認識結果における信頼度の最大値が閾値ＴＨ１未満である場合（ステップＳ３０３でＮＯ）、認識結果記憶部３０２を参照し、第一認識結果が記憶されているか否かを判断する（ステップＳ３１０）。第一認識結果が記憶されていない場合（ステップＳ３１０でＮＯ）、図１９に示すように、文章信頼度判定部１２１０は、第一発話の認識結果に含まれる認識候補のうち、信頼度（音素毎の出現確率の積と単語毎の出現確率の積）が高い順にｎ個の認識候補をｎ−ｂｅｓｔとして文章推定部１２０２から取得し、認識結果記憶部３０２に記憶する。ステップＳ３０８では、図１０に示すステップＳ２０８と同様、音声認識装置１００により聞き返しが行われる。この聞き返しにより、発話者により第二発話が行われ、ステップＳ３００〜ステップＳ３０２の処理により、第一発話と同様、第二発話に対する第二認識結果が得られる。そして、第二認識結果の信頼度の最大値が閾値ＴＨ１未満であれば、ステップＳ３０３でＮＯと判定され処理がＳ３１０に進む。 On the other hand, in step S303, when the maximum reliability value in the first recognition result is less than the threshold value TH1 (NO in step S303), the sentence reliability determination unit 1210 refers to the recognition result storage unit 302 and performs the first recognition. It is determined whether the result is stored (step S310). When the first recognition result is not stored (NO in step S310), as shown in FIG. 19, the sentence reliability determination unit 1210 includes the reliability (phoneme) among the recognition candidates included in the recognition result of the first utterance. N recognition candidates are acquired as n-best from the sentence estimation unit 1202 and stored in the recognition result storage unit 302 in descending order of the product of the appearance probability for each word and the product of the appearance probability for each word. In step S308, similar to step S208 shown in FIG. The second utterance is performed by the speaker by this replay, and the second recognition result for the second utterance is obtained by the processing of step S300 to step S302 as with the first utterance. If the maximum value of the reliability of the second recognition result is less than the threshold value TH1, NO is determined in step S303, and the process proceeds to S310.

一方、第二認識結果の信頼度の最大値が閾値ＴＨ１以上であれば（ステップＳ３０３でＹＥＳ）、当該最大値を持つ認識候補が発話者が意図する一の単語列として決定され、ステップＳ３０５〜ステップＳ３０７の処理が実行される。 On the other hand, if the maximum value of the reliability of the second recognition result is equal to or greater than the threshold value TH1 (YES in step S303), a recognition candidate having the maximum value is determined as one word string intended by the speaker, and steps S305 to S305 are performed. The process of step S307 is executed.

一方、第一認識結果が認識結果記憶部３０２に記憶されていた場合（ステップＳ３１０でＹＥＳ）、共通候補抽出部２７０は、第一認識結果のｎ−ｂｅｓｔと、第二認識結果のｎ−ｂｅｓｔとを比較する（ステップＳ３１１）。 On the other hand, when the first recognition result is stored in the recognition result storage unit 302 (YES in step S310), the common candidate extraction unit 270 determines the n-best of the first recognition result and the n-best of the second recognition result. Are compared (step S311).

次に、共通候補抽出部２７０は、比較した結果、共通の認識候補があるか否かを判断する（ステップＳ３１２）。共通する認識候補が存在する場合（ステップＳ３１２でＹＥＳ）、共通候補抽出部２７０は、共通候補が複数存在するか否かを判定する（ステップＳ３１３）。共通する認識候補が複数存在する場合（ステップＳ３１３でＹＥＳ）、共通候補抽出部２７０は、共通する複数の認識候補のそれぞれについて第一認識結果の信頼度と第二認識結果の信頼度との和を算出する。そして、共通候補抽出部２７０は、信頼度の和が最大の認識候補を最終的な認識結果として決定してもよいし、信頼度の和が高い順に複数の認識候補を最終的な認識結果として決定してもよい。ステップＳ３１３の処理が終了すると、処理はステップＳ３０４に遷移する。また、共通候補抽出部２７０は、信頼度の和が高い順に得られた複数の認識候補に対して、図４のステップＳ１１６で説明した発話確認を行い、発話者によって同意が得られた認識候補を最終的な認識結果として決定してもよい。 Next, the common candidate extraction unit 270 determines whether there is a common recognition candidate as a result of the comparison (step S312). If there are common recognition candidates (YES in step S312), the common candidate extraction unit 270 determines whether or not there are a plurality of common candidates (step S313). When there are a plurality of common recognition candidates (YES in step S313), the common candidate extraction unit 270 adds the reliability of the first recognition result and the reliability of the second recognition result for each of the plurality of common recognition candidates. Is calculated. Then, the common candidate extraction unit 270 may determine a recognition candidate having the maximum reliability as a final recognition result, or select a plurality of recognition candidates as a final recognition result in descending order of the reliability. You may decide. When the process of step S313 ends, the process transitions to step S304. Further, the common candidate extraction unit 270 performs the utterance confirmation described in step S116 in FIG. 4 on the plurality of recognition candidates obtained in descending order of the reliability, and the recognition candidate whose consent is obtained by the speaker. May be determined as the final recognition result.

図１９は、実施の形態３における第一認識結果の５−ｂｅｓｔの一例を示す図である。図２０は、実施の形態３における第二認識結果の５−ｂｅｓｔの一例を示す図である。図１９及び図２０において、共通する認識候補は、「リンゴ食べたい」及び「インコ飛べた」である。この時、第一認識結果と第二認識結果との信頼度の和は、「リンゴ食べたい」が０．９６（＝０．５４＋０．４２）、「インコ飛べた」が０．４７（＝０．２０＋０．２７）である。この場合、信頼度の和が最大である「リンゴ食べたい」が最終的な認識結果として決定される。或いは、両方の認識候補が最終的な信頼度として決定されてもよい。 FIG. 19 is a diagram illustrating an example of 5-best of the first recognition result in the third embodiment. FIG. 20 is a diagram illustrating an example of 5-best of the second recognition result in the third embodiment. In FIG. 19 and FIG. 20, common recognition candidates are “I want to eat an apple” and “I can fly a parakeet”. At this time, the sum of the reliability of the first recognition result and the second recognition result is 0.96 (= 0.54 + 0.42) for “I want to eat an apple” and 0.47 (= 0 for “I can fly a parrot”) 20 + 0.27). In this case, “I want to eat an apple” having the maximum sum of reliability is determined as the final recognition result. Alternatively, both recognition candidates may be determined as final reliability.

一方、共通する認識候補が存在しなかった場合（ステップＳ３１２でＮＯ）、処理はステップＳ３０９に遷移する。ステップＳ３０９では、共通候補抽出部２７０は、認識結果記憶部３０２に、第一認識結果に加えて更に第二認識結果を記憶させ、聞き返しの応答文の生成指示を応答生成部２４０に出力することで、発話者への更なる聞き返しを実施する（ステップＳ３０８）。これにより、第三認識結果が取得される。そして、第三認識結果の信頼度の最大値が閾値ＴＨ１未満であれば、第一、第二、及び第三認識結果が比較され、共通する認識候補が抽出される。この場合、第一、第二、及び第三認識結果のうち、少なくとも２つで共通する認識候補があれば、その認識候補が共通する認識結果として抽出される。 On the other hand, if there is no common recognition candidate (NO in step S312), the process transitions to step S309. In step S <b> 309, the common candidate extraction unit 270 stores the second recognition result in addition to the first recognition result in the recognition result storage unit 302, and outputs an instruction to generate a response to the response to the response generation unit 240. Then, the speaker is further heard back (step S308). Thereby, the third recognition result is acquired. And if the maximum value of the reliability of a 3rd recognition result is less than threshold value TH1, a 1st, 2nd, and 3rd recognition result will be compared and a common recognition candidate will be extracted. In this case, if there is a recognition candidate common to at least two of the first, second, and third recognition results, the recognition candidate is extracted as a common recognition result.

このように、実施の形態３に係る音声認識装置１００によれば、第一発話に対して信頼性の低い認識結果が得られたとしても、その認識結果を破棄せず、その認識結果を第二発話に対して信頼性の低い認識結果が得られた場合に利用する。そのため、聞き返しによって、信頼性の高い認識結果が得られなかったとしても、第一発話と第二発話との両方で認識された単語列が一の単語列が認識されているので、一の単語列の認識精度を高めることができる。 Thus, according to the speech recognition apparatus 100 according to Embodiment 3, even if a recognition result with low reliability is obtained for the first utterance, the recognition result is not discarded, and the recognition result is This is used when a low-reliability recognition result is obtained for two utterances. Therefore, even if a highly reliable recognition result is not obtained by replay, one word string is recognized because the word string recognized in both the first utterance and the second utterance is recognized. Column recognition accuracy can be increased.

（ロボット）
音声認識装置１００は図２１に示すようなロボット５００に実装されてもよい。図２１は、実施の形態１〜３に係る音声認識装置１００が実装されたロボット５００の外観図である。ロボット５００は、球帯状のメイン筐体５０１、第１球冠部５０２、及び第２球冠部５０３を備える。メイン筐体５０１、第１球冠部５０２、及び第２球冠部５０３は全体として球体を構成する。即ち、ロボット５００は、球体形状を有する。また、ロボット５００は、第２球冠部５０３にカメラ５０４を備え、第１球冠部５０２に距離センサ５０５、スピーカ４１０、及びマイク４００を備える。 (robot)
The speech recognition apparatus 100 may be mounted on a robot 500 as shown in FIG. FIG. 21 is an external view of a robot 500 on which the speech recognition apparatus 100 according to the first to third embodiments is mounted. The robot 500 includes a spherical main casing 501, a first spherical crown portion 502, and a second spherical crown portion 503. The main housing 501, the first spherical crown portion 502, and the second spherical crown portion 503 constitute a sphere as a whole. That is, the robot 500 has a spherical shape. In addition, the robot 500 includes a camera 504 in the second crown portion 503 and a distance sensor 505, a speaker 410, and a microphone 400 in the first crown portion 502.

カメラ５０４は、ロボット５００の周辺環境の映像を取得する。また、距離センサ５０５は、ロボット５００の周辺環境までの距離情報を取得する。尚、本態様において、ロボット５００は、第２球冠部５０３にカメラ５０４を、第１球冠部５０２に距離センサ５０５、スピーカ４１０、及びマイク４００を備えるが、これに限られるものではなく、第１球冠部５０２、及び第２球冠部５０３の少なくとも一方にカメラ５０４、距離センサ５０５、スピーカ４１０、及びマイク４００を備えればよい。 The camera 504 acquires an image of the surrounding environment of the robot 500. The distance sensor 505 acquires distance information to the surrounding environment of the robot 500. In this aspect, the robot 500 includes the camera 504 in the second crown portion 503 and the distance sensor 505, the speaker 410, and the microphone 400 in the first crown portion 502, but is not limited thereto. The camera 504, the distance sensor 505, the speaker 410, and the microphone 400 may be provided in at least one of the first spherical crown portion 502 and the second spherical crown portion 503.

第１球冠部５０２の中心と第２球冠部５０３の中心とはメイン筐体５０１の内部に設けられたシャフト（図略）によって固定接続されている。メイン筐体５０１はシャフトに対して回転自在に取り付けられている。また、シャフトにはフレーム（図略）及び表示部（図略）が取り付けられている。フレームにはメイン筐体５０１を回転させる第１モータ（図略）が取り付けられている。この第１モータ（図略）が回転することで、メイン筐体５０１は第１球冠部５０２及び第２球冠部５０３に対して回転し、ロボット５００は前進又は後退する。第１モータ及びメイン筐体５０１は移動機構の一例である。なお、ロボット５００が前進又は後退する場合、第１球冠部５０２及び第２球冠部５０３は停止状態にあるので、カメラ５０４、距離センサ５０５、マイク４００、及びスピーカ４１０はロボット１の正面を向いた状態に維持される。また、表示部には、ロボット１の目及び口を示す画像を表示する。この表示部は、第２モータ（図略）による動力によってシャフトに対する角度が調整自在に取り付けられている。したがって、表示部のシャフトに対する角度を調整することで、ロボットの目及び口の方向が調整される。なお、表示部はメイン筐体５０１とは独立してシャフトに取り付けられているので、メイン筐体５０１が回転してもシャフトに対する角度は変化しない。したがって、ロボット５００は、目及び口の向きを固定した状態で前進又は後退できる。 The center of the first spherical crown portion 502 and the center of the second spherical crown portion 503 are fixedly connected by a shaft (not shown) provided inside the main casing 501. The main casing 501 is rotatably attached to the shaft. A frame (not shown) and a display unit (not shown) are attached to the shaft. A first motor (not shown) for rotating the main casing 501 is attached to the frame. As the first motor (not shown) rotates, the main casing 501 rotates relative to the first and second spherical crown portions 502 and 503, and the robot 500 moves forward or backward. The first motor and the main casing 501 are an example of a moving mechanism. When the robot 500 moves forward or backward, the first spherical crown portion 502 and the second spherical crown portion 503 are in a stopped state, so that the camera 504, the distance sensor 505, the microphone 400, and the speaker 410 are positioned in front of the robot 1. Maintained facing. Further, an image showing the eyes and mouth of the robot 1 is displayed on the display unit. The display unit is attached so that the angle with respect to the shaft can be adjusted by power from a second motor (not shown). Therefore, the direction of the eyes and mouth of the robot is adjusted by adjusting the angle of the display unit with respect to the shaft. Note that since the display unit is attached to the shaft independently of the main housing 501, the angle with respect to the shaft does not change even when the main housing 501 rotates. Therefore, the robot 500 can move forward or backward with the direction of the eyes and mouth fixed.

本開示は、音声の認識精度を向上させることができるため、例えば、発話が曖昧な幼児との対話を行うロボットの技術分野にとって有用である。 The present disclosure can improve the accuracy of speech recognition, and is useful, for example, in the technical field of robots that interact with an infant whose speech is ambiguous.

２０ＣＰＵ
３０メモリ
１００音声認識装置
２００音声認識部
２０１音素推定部
２０２単語推定部
２０３音素出現確率判定部
２１０単語信頼度判定部
２２０意図解釈部
２３０行動選択部
２４０応答生成部
２５０音声合成部
２６０発話抽出部
２７０共通候補抽出部
３０１単語辞書
３０２認識結果記憶部
４００マイク
４１０スピーカ
５００ロボット
１２０２文章推定部
１２０３音素出現確率合成部
１２１０文章信頼度判定部
20 CPU
30 memory 100 speech recognition device 200 speech recognition unit 201 phoneme estimation unit 202 word estimation unit 203 phoneme appearance probability determination unit 210 word reliability determination unit 220 intention interpretation unit 230 action selection unit 240 response generation unit 250 speech synthesis unit 260 utterance extraction unit 270 Common candidate extraction unit 301 Word dictionary 302 Recognition result storage unit 400 Microphone 410 Speaker 500 Robot 1202 Sentence estimation unit 1203 Phoneme appearance probability synthesis unit 1210 Sentence reliability determination unit

Claims

A speech recognition method,
A first utterance spoken by a speaker intended for a word is received via a microphone;
The first utterance is composed of N phonemes (N is a natural number of 2 or more),
For each of the N phonemes constituting the first utterance, the appearance probabilities of all types of phonemes are calculated,
A phoneme string in which phonemes having the maximum appearance probability corresponding to each of the N phonemes from the first phoneme constituting the first utterance to the Nth phoneme are arranged in order; a first phoneme string corresponding to the first utterance; Recognized,
A first value is calculated by multiplying appearance probabilities of N phonemes constituting the first phoneme string;
If the first value is less than a first threshold, let the speaker output a voice prompting the speaker to speak the one word again,
A second utterance re-spoken by the speaker with the intention of the one word is received via the microphone, and the second utterance is composed of M (M is a natural number of 2 or more) phonemes,
For each of the M phonemes constituting the second utterance, the appearance probability is calculated for all types of phonemes,
A phoneme string in which phonemes having the maximum appearance probability corresponding to each of the M phonemes from the first phoneme constituting the second utterance to the Mth phoneme are arranged in order; a second phoneme string corresponding to the second utterance; Recognized,
A second value is calculated by multiplying the appearance probabilities of M phonemes constituting the second phoneme sequence;
When the second value is less than the first threshold, phonemes having an appearance probability equal to or higher than the second threshold in the first phoneme string and phonemes having an appearance probability equal to or higher than the second threshold in the second phoneme string Extract and
Extracting words including the extracted phonemes from a dictionary stored in a memory, the dictionary associates each word with a phoneme string corresponding to each word,
If the extracted word is one, the extracted word is recognized as corresponding to the one word;
Speech recognition method.

If there are a plurality of extracted words, the voice that asks the speaker whether the extracted words are spoken is output through the speaker.
Receiving a positive or negative answer from the speaker via the microphone;
Recognizing the word corresponding to the positive answer as corresponding to the one word;
The speech recognition method according to claim 1.

A speech recognition method,
A first utterance uttered by a speaker with the intention of a single word string is received via a microphone;
The first utterance is composed of N phonemes (N is a natural number of 2 or more),
Calculating the reliability X1 of the word string estimated for the first utterance;

t indicates a number that designates a frame constituting the first utterance and the second utterance;
T represents the total number of frames constituting the first utterance and the second utterance,
P _A1 (o _t , s _t | s _t-1 ) is the t-th frame after the phoneme string corresponding to the state s _t-1 from the first frame to the t-1 frame of the first utterance. any phoneme appeared, indicates the probability of transition to the phoneme string corresponding to the state s _t,
o _t is a physical quantity obtained from the first utterance and used to estimate the arbitrary phoneme;
The arbitrary phonemes represent all types of phonemes,
P _A2 (q _t , s _t | s _t−1 ) is the t th frame after the phoneme string corresponding to the state s _t−1 from the first frame to the t−1 frame of the second utterance. any phoneme appeared, indicates the probability of transition to the phoneme string corresponding to the state s _t,
q _t is obtained from the second utterance and indicates a physical quantity for estimating the arbitrary phoneme;
_{_{_{P L (s t, s t}}} -1) is the next word string corresponding to the state _{s t-1} in the first utterance, any word is found at t-th frame, corresponding to the state _{s t} The probability of transition to a word string
A word string corresponding to the state s _t giving the maximum value of the combined confidence X, recognized as the one word sequence,
Speech recognition method.

A speech recognition method,
A first utterance uttered by a speaker with the intention of a single word string is received via a microphone;
The first utterance is composed of N phonemes (N is a natural number of 2 or more),
Calculating reliability X1 of all word strings estimated for the first utterance;

t2 indicates a number that designates a frame constituting the second utterance;
T2 indicates the total number of frames constituting the second utterance,
_{_{_{_{P A2 (o t2, s t2}}}} | s t2-1) is the next phoneme string corresponding to the state _{s t2-1} to t2-1 numbered frame from 1 numbered frame of the second speech, at t2 numbered frame Indicates the probability that an arbitrary phoneme appears and transitions to a phoneme string corresponding to the state s _t2 ,
o _t2 is obtained from the second utterance and indicates a physical quantity for estimating the arbitrary phoneme;
P _L2 (s _t2 , s _t2-1 ) corresponds to the state s _t2 when an arbitrary word appears in the t2 frame after the word string corresponding to the state s _t2-1 in the second utterance. The probability of transition to a word string
It is determined whether the maximum value MaxX2 of the reliability X2 is greater than or equal to a threshold value,
If the maximum value MaxX2 is less than the threshold, extract a second word string estimated for the second utterance that gives the top M pieces of the reliability X2,
If there is a word string common to the first word string and the second word string, the common word string is recognized as the one word string;
Speech recognition method.

A program for causing a computer to execute the speech recognition method according to claim 1.

A speech recognition device comprising a processor, a memory, a microphone, and a speaker,
The processor is
Receiving a first utterance spoken by a speaker with the intention of a word via the microphone;
The first utterance is composed of N phonemes (N is a natural number of 2 or more),
For each of the N phonemes constituting the first utterance, the appearance probabilities of all types of phonemes are calculated,
A phoneme string in which phonemes having the maximum appearance probability corresponding to each of the N phonemes from the first phoneme constituting the first utterance to the Nth phoneme are arranged in order; a first phoneme string corresponding to the first utterance; Recognized,
A first value is calculated by multiplying appearance probabilities of N phonemes constituting the first phoneme string;
If the first value is less than a first threshold, the speaker is prompted to output a voice prompting the speaker to speak the one word again,
A second utterance re-spoken by the speaker with the intention of the one word is received via the microphone, and the second utterance is composed of M (M is a natural number of 2 or more) phonemes,
For each of the M phonemes constituting the second utterance, the appearance probability is calculated for all types of phonemes,
A phoneme string in which phonemes having the maximum appearance probability corresponding to each of the M phonemes from the first phoneme constituting the second utterance to the Mth phoneme are arranged in order; a second phoneme string corresponding to the second utterance; Recognized,
A second value is calculated by multiplying the appearance probabilities of M phonemes constituting the second phoneme sequence;
When the second value is less than the first threshold, phonemes having an appearance probability equal to or higher than the second threshold in the first phoneme string and phonemes having an appearance probability equal to or higher than the second threshold in the second phoneme string Extract and
Extracting words including the extracted phonemes from the dictionary stored in the memory, the dictionary associates each word with a phoneme string corresponding to each word,
If the extracted word is one, the extracted word is recognized as corresponding to the one word;
Voice recognition device.

A voice recognition device according to claim 6;
A housing containing the voice recognition device;
A moving mechanism for moving the housing;
Robot equipped with.

A speech recognition device comprising a processor, a microphone, and a speaker,
The processor is
Receiving a first utterance uttered by a speaker with the intention of one word string via the microphone;
The first utterance is composed of N phonemes (N is a natural number of 2 or more),
Calculating the reliability X1 of the word string estimated for the first utterance;

t indicates a number for specifying a frame constituting the first utterance;
T indicates the total number of frames constituting the first utterance,
P _A1 (o _t , s _t | s _t-1 ) is the t-th frame after the phoneme string corresponding to the state s _t-1 from the first frame to the t-1 frame of the first utterance. any phoneme appeared, indicates the probability of transition to the phoneme string corresponding to the state s _t,
o _t is obtained from the first utterance and indicates a physical quantity for estimating the arbitrary phoneme;
The arbitrary phonemes represent all types of phonemes,
_{_{_{P L1 (s t, s t}}} -1) is the next word string corresponding to the state _{s t-1} in the first utterance, any word is found at t-th frame, corresponding to the state _{s t} The probability of transition to a word string
Determining whether the reliability X1 is greater than or equal to a threshold;
If the reliability X1 is less than the threshold, a voice prompting the speaker to speak the one word string again is output through the speaker.
Receiving a second utterance re-uttered by the speaker with the intention of the one word string, via the microphone;
When the reliability X1 of the second utterance is less than the threshold value, the composite reliability X is calculated for all word strings estimated from the first utterance and the second utterance,

t indicates a number that designates a frame constituting the first utterance and the second utterance;
T represents the total number of frames constituting the first utterance and the second utterance,
P _A1 (o _t , s _t | s _t-1 ) is the t-th frame after the phoneme string corresponding to the state s _t-1 from the first frame to the t-1 frame of the first utterance. any phoneme appeared, indicates the probability of transition to the phoneme string corresponding to the state s _t,
o _t is a physical quantity obtained from the first utterance and used to estimate the arbitrary phoneme;
The arbitrary phonemes represent all types of phonemes,
P _A2 (q _t , s _t | s _t−1 ) is the t th frame after the phoneme string corresponding to the state s _t−1 from the first frame to the t−1 frame of the second utterance. any phoneme appeared, indicates the probability of transition to the phoneme string corresponding to the state s _t,
q _t is obtained from the second utterance and indicates a physical quantity for estimating the arbitrary phoneme;
_{_{_{P L (s t, s t}}} -1) is the next word string corresponding to the state _{s t-1} in the first utterance, any word is found at t-th frame, corresponding to the state _{s t} The probability of transition to a word string
A word string corresponding to the state s _t giving the maximum value of the combined confidence X, recognized as the one word sequence,
Voice recognition device.

A voice recognition device according to claim 8;
A housing containing the voice recognition device;
A moving mechanism for moving the housing;
Robot equipped with.

A speech recognition device comprising a processor, a microphone, and a speaker,
The processor is
Receiving a first utterance uttered by a speaker with the intention of one word string via the microphone;
The first utterance is composed of N phonemes (N is a natural number of 2 or more),
Calculating reliability X1 of all word strings estimated for the first utterance;

t1 indicates a number for specifying a frame constituting the first utterance;
T1 indicates the total number of frames constituting the first utterance,
P _A1 (o _t1 , s _t1 | s _t1-1 ) is the t1 frame after the phoneme sequence corresponding to the state s _t1-1 from the first frame to the t1-1 frame of the first utterance. Indicates the probability that an arbitrary phoneme appears and transitions to a phoneme string corresponding to the state s _t1 ,
o _t1 is obtained from the first utterance and indicates a physical quantity for estimating the arbitrary phoneme;
The arbitrary phonemes indicate all types of phonemes, and P _L1 (s _t1 , s _t1-1 ) is arbitrary in the frame t1 next to the word string corresponding to the state s _t1-1 in the first utterance. And the probability of transition to a word string corresponding to the state s _t1 ,
Determining whether the maximum value MaxX1 of X1 is equal to or greater than a threshold;
When the maximum value MaxX1 is less than the threshold value,
Extracting a first word string estimated for the first utterance giving the top M pieces of reliability X1 (M is a natural number of 2 or more);
A voice prompting the speaker to speak the one word string again is output through the speaker;
Receiving a second utterance re-uttered by the speaker with the intention of the one word string, via the microphone;
Calculating reliability X2 of all word strings estimated for the second utterance;

t2 indicates a number that designates a frame constituting the second utterance;
T2 indicates the total number of frames constituting the second utterance,
_{_{_{_{P A2 (o t2, s t2}}}} | s t2-1) is the next phoneme string corresponding to the state _{s t2-1} to t2-1 numbered frame from 1 numbered frame of the second speech, at t2 numbered frame Indicates the probability that an arbitrary phoneme appears and transitions to a phoneme string corresponding to the state s _t2 ,
o _t2 is obtained from the second utterance and indicates a physical quantity for estimating the arbitrary phoneme;
P _L2 (s _t2 , s _t2-1 ) corresponds to the state s _t2 when an arbitrary word appears in the t2 frame after the word string corresponding to the state s _t2-1 in the second utterance. The probability of transition to a word string
It is determined whether the maximum value MaxX2 of the reliability X2 is greater than or equal to a threshold value,
If the maximum value MaxX2 is less than the threshold, extract a second word string estimated for the second utterance that gives the top M pieces of the reliability X2,
If there is a word string common to the first word string and the second word string, the common word string is recognized as the one word string;
Voice recognition device.

A speech recognition device according to claim 10;
A housing containing the voice recognition device;
A moving mechanism for moving the housing;
Robot equipped with.