JP2008052178A

JP2008052178A - Voice recognition device and voice recognition method

Info

Publication number: JP2008052178A
Application number: JP2006230378A
Authority: JP
Inventors: Ryo Murakami; 涼村上; Seisho Watabe; 生聖渡部; Kazuya Shimooka; 和也下岡
Original assignee: Toyota Motor Corp
Current assignee: Toyota Motor Corp
Priority date: 2006-08-28
Filing date: 2006-08-28
Publication date: 2008-03-06

Abstract

<P>PROBLEM TO BE SOLVED: To provide technology for suppressing incorrect recognition by appropriately asking back, when recognizing voice of a speaker. <P>SOLUTION: A voice recognition device is equipped with: a voice input means for inputting voice and converting it to voice data; an utterance section extracting means for extracting an utterance section; a voice analysis means for calculating a time sequence of a feature quantities of the voice; a word likelihood calculating means for calculating likelihood of each candidate word group; a sentence likelihood calculating means for calculating likelihood of each candidate sentence group; a reliability calculating means for calculating reliability of each candidate word group; a sentence specifying means for specifying a sentence which the speaker speaks, based on the reliability of the word included in the sentence; a first asking-back judgment means for judging whether the asking-back is required or not, based on the reliability of the word included in the specified sentence; and an asking-back means for asking back to the speaker, when it is judged that the asking-back to the speaker is required. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、対話者が話しかける音声を認識する装置と方法に関する。 The present invention relates to an apparatus and method for recognizing speech spoken by a dialog person.

人間が装置の動作を制御する際に、キーボードやレバーなどのインターフェースを操作することなく、音声を発することによって装置を制御する技術がある。このような技術においては、マイクなどの音声入力手段から入力される音声から、音声によって表現される言葉の内容を認識し、認識された言葉の内容に応じた制御が行われる。このような音声認識を用いた制御を行う場合、可能な限り誤認識を抑制することが重要である。誤認識した結果に基づいて装置を制御すると、装置の誤作動を引き起こしてしまう。 When a human controls the operation of the device, there is a technique for controlling the device by emitting a voice without operating an interface such as a keyboard or a lever. In such a technique, the content of words expressed by speech is recognized from speech input from speech input means such as a microphone, and control according to the recognized content of words is performed. When performing control using such voice recognition, it is important to suppress erroneous recognition as much as possible. If the device is controlled based on the erroneously recognized result, the device malfunctions.

音声認識における誤認識を防止するために、音声をうまく認識できなかった場合に対話者への聞き返しを行う技術が開発されている。例えば特許文献１には、対話者に聞き返しを行う音声認識システムが開示されている。このシステムでは、入力された音声を予め用意された単語の音響モデルと比較し、音声との一致率の高い音響モデルが表現する単語を対話者が話した単語として認識する。このシステムでは、音声と音響モデルとの一致率が所定のしきい値より低い場合に、対話者への聞き返しを行う。 In order to prevent misrecognition in speech recognition, a technology has been developed that allows a conversation person to be heard back when speech is not successfully recognized. For example, Patent Document 1 discloses a voice recognition system that listens to a conversation person. In this system, an input speech is compared with an acoustic model of a word prepared in advance, and a word expressed by an acoustic model having a high coincidence rate with speech is recognized as a word spoken by a conversation person. In this system, when the coincidence rate between the voice and the acoustic model is lower than a predetermined threshold value, the conversation person is returned.

特開２００３−４４７５６号公報JP 2003-44756 A

音声と音響モデルとの一致率が高い場合でも誤認識を起こしてしまう場合があるし、一致率が低い場合でも誤認識を起こさない場合がある。一般に音声と音響モデルとの一致率は、対話者の発声が明瞭であれば高い値となり、発声が不明瞭であれば低い値となる傾向がある。例えばアナウンサーのように発声が明瞭な対話者の音声が入力された場合には、対話者が実際に話した単語の音響モデルとの一致率は当然に高く評価されるが、対話者が話していない他の類似する単語の音響モデルとの一致率も高く評価される傾向がある。この様な場合、一致率が高い複数の単語の候補が存在しており、どの単語が実際に対話者が話したものであるのか曖昧さが解消されず、誤認識を起こしてしまう可能性がある。上記とは逆に、発声が不明瞭な対話者の音声が入力された場合には、候補となるどの単語についても音響モデルとの一致率は低く評価されてしまう。しかしながら、ある単語についての一致率が他の単語についての一致率に比べて相対的に大きな値であれば、候補となる単語が絞り込まれているため、一致率が低い値であっても誤認識を起こさない。この場合、あえて対話者へ聞き返さなくともよい。 Even when the matching rate between the speech and the acoustic model is high, erroneous recognition may occur, and even when the matching rate is low, erroneous recognition may not occur. In general, the coincidence rate between a speech and an acoustic model tends to be a high value if the conversation person's utterance is clear, and a low value if the utterance is unclear. For example, when the voice of a clear talker such as an announcer is input, the match rate with the acoustic model of the word actually spoken by the talker is naturally highly evaluated, but the talker is speaking. There is a tendency that the coincidence rate with acoustic models of other similar words that are not high is also highly evaluated. In such a case, there are multiple candidate words with a high match rate, and the ambiguity about which words are actually spoken by the interlocutor is not resolved, which may lead to misrecognition. is there. Contrary to the above, when the voice of a conversation person whose utterance is unclear is input, the matching rate with the acoustic model is evaluated low for any candidate word. However, if the matching rate for one word is relatively large compared to the matching rate for other words, the candidate words are narrowed down, so that even if the matching rate is low, the recognition error Does not cause. In this case, you don't have to ask the talker.

上記のように、音声と音響モデルとの一致率からは、誤認識を抑制するための聞き返しの要否を適切に判断することができない。より適切に聞き返しの要否を判断することが可能な技術が待望されている。 As described above, it is not possible to appropriately determine whether or not it is necessary to hear back in order to suppress misrecognition from the matching rate between the voice and the acoustic model. There is a need for a technique that can more appropriately determine whether it is necessary to hear back.

本発明は上記課題を解決する。本発明では、対話者の話しかける音声を認識する際に、適切な聞き返しを行うことによって、誤認識を抑制することが可能な技術を提供する。 The present invention solves the above problems. The present invention provides a technique capable of suppressing misrecognition by performing appropriate listening when recognizing a voice spoken by a conversation person.

本発明は、対話者が話しかける音声を認識する装置として具現化される。本発明の音声認識装置は、音声を入力して音声データに変換する音声入力手段と、音声データから発話区間を抽出する発話区間抽出手段と、音声データから発話区間における音声の特徴量の時系列を算出する音声分析手段と、発話区間における音声の特徴量の時系列から候補となる単語群のそれぞれについての尤度を算出する単語尤度算出手段と、候補となる単語群のそれぞれについての尤度から候補となる文章群のそれぞれについての尤度を算出する文章尤度算出手段と、候補となる文章群のそれぞれについての尤度と候補となる単語群のそれぞれについての尤度から候補となる単語群のそれぞれについての確信度を算出する確信度算出手段と、文章に含まれる単語の確信度に基づいて発話区間において対話者が話しかけた文章を候補となる文章群の中から特定する文章特定手段と、特定された文章に含まれる単語の確信度に基づいて対話者への聞き返しの要否を判断する第１聞き返し判定手段と、対話者への聞き返しが必要と判断された場合に対話者への聞き返しを行う聞き返し手段を備えている。 The present invention is embodied as an apparatus for recognizing voice spoken by a dialog person. The speech recognition apparatus according to the present invention includes a speech input unit that inputs speech and converts it into speech data, a speech segment extraction unit that extracts speech segments from the speech data, and a time series of speech feature values in the speech segments from speech data Speech analysis means for calculating the likelihood, word likelihood calculation means for calculating the likelihood for each of the candidate word groups from the time series of the feature amount of the speech in the utterance section, and the likelihood for each of the candidate word groups The sentence likelihood calculation means for calculating the likelihood for each candidate sentence group from the degree, and the likelihood for each candidate sentence group and the likelihood for each candidate word group become candidates. Candidates for calculating confidence for each word group and sentences spoken by the interlocutor in the utterance interval based on the certainty of words contained in the sentence are candidates. A sentence specifying means for specifying from among a sentence group, a first answer judging means for judging whether or not to ask a conversation person based on a certainty factor of a word included in the identified sentence, and a reply to the conversation person. It is provided with a means for listening back to the interlocutor when it is judged necessary.

本発明の音声認識装置では、発話区間における音声の特徴量の時系列から、候補となる単語群のそれぞれについての尤度と、候補となる文章群のそれぞれについての尤度を算出する。音声の特徴量としては、例えば周波数スペクトルそのものを扱ってもよいし、メル周波数ケプストラム係数（ＭＦＣＣ）を扱ってもよい。そして、単語群のそれぞれについての尤度と、文章群のそれぞれについての尤度から、単語群のそれぞれについての確信度を算出する。単語の確信度とは、競合する他の単語の候補に対してその単語がどの程度信頼度が高いかを示す指標である。単語の確信度の詳細については、例えば、李晃伸、河原達也、鹿野清宏、”２パス探索アルゴリズムにおける高速な単語事後確率に基づく信頼度算出法”、信学技報、社団法人電子情報通信学会、２００３年１２月、ＳＰ２００３−１６０、ｐ．３５−４０等に記載されている。本発明の音声認識装置では、文章に含まれる単語の確信度に基づいて、候補となる文章群の中から対話者が話しかけた文章を特定する。本発明の音声認識装置によれば、確信度の高い単語を多く含む文章を対話者が話しかけた文章として特定することで、対話者が話しかけた文章を正確に認識することができる。 In the speech recognition apparatus of the present invention, the likelihood for each candidate word group and the likelihood for each candidate sentence group are calculated from the time series of speech feature values in the utterance section. As the audio feature quantity, for example, the frequency spectrum itself may be handled, or a mel frequency cepstrum coefficient (MFCC) may be handled. Then, the certainty factor for each word group is calculated from the likelihood for each word group and the likelihood for each sentence group. The certainty of a word is an index indicating how reliable the word is with respect to other competing word candidates. For details on the certainty of words, see, for example, Lee Shin-nobu, Tatsuya Kawahara, Kiyohiro Shikano, “High-speed word posterior probability calculation method in 2-pass search algorithm”, IEICE Technical Report, The Institute of Electronics, Information and Communication Engineers December 2003, SP 2003-160, p. 35-40 etc. In the speech recognition apparatus of the present invention, a sentence spoken by a dialogue person is specified from a group of candidate sentences based on the certainty of words included in the sentence. According to the speech recognition apparatus of the present invention, it is possible to accurately recognize a sentence spoken by a conversation person by specifying a sentence including many words with high certainty as a sentence spoken by the conversation person.

本発明の音声認識装置では、対話者が話しかけた文章として特定された文章に含まれる単語の確信度に基づいて、対話者への聞き返しの要否を判断する。文章に含まれる単語の確信度が低い場合、特定された文章と他の文章の間で尤度にそれほど大きな差がないと考えられるから、対話者への聞き返しを行う。逆に、文章に含まれる単語の確信度が高い場合、特定された文章は他の文章に比べて相対的に大きな尤度であると考えられるため、聞き返しを行わない。このように、本発明の音声認識装置は文章に含まれる単語の確信度に着目して聞き返しの要否を判断するから、不要な聞き返しを行うことなく、必要な場合にのみ聞き返しを行って、音声認識における誤認識を抑制することができる。 In the speech recognition apparatus of the present invention, it is determined whether or not it is necessary to hear back to the conversation person based on the certainty factor of the word included in the sentence specified as the sentence spoken by the conversation person. When the certainty of the words included in the sentence is low, it is considered that there is not so much difference in likelihood between the specified sentence and the other sentences, so the conversation person is asked. On the other hand, when the certainty factor of the word included in the sentence is high, the identified sentence is considered to have a relatively high likelihood compared to the other sentences, and therefore no replay is performed. In this way, the speech recognition apparatus of the present invention determines whether or not it is necessary to listen back by paying attention to the certainty of the words included in the sentence. Misrecognition in speech recognition can be suppressed.

上記の音声認識装置は、音声データから発話区間での音量を検出する音量検出手段と、発話区間での音量に基づいて対話者への聞き返しの要否を判断する第２聞き返し判定手段をさらに備えることが好ましい。 The voice recognition apparatus further includes a volume detection unit that detects a volume in the utterance section from the voice data, and a second listen determination unit that determines whether or not a conversation person needs to be answered based on the volume in the utterance section. It is preferable.

一般に音声入力手段には検出可能な音声の大きさの範囲が規定されており、この範囲に入らない音声については正確に検出することができない。従って、対話者の声が大き過ぎたり小さ過ぎたりすると、正確な音声認識をすることができない。上記の音声認識装置では、検出された音量に基づいて聞き返しの要否を判断することによって、音声入力手段において正確に音声を検出できないことに起因する誤認識を抑制することができる。 In general, a range of loudness that can be detected is defined in the voice input means, and a voice that does not fall within this range cannot be accurately detected. Therefore, if the conversation person's voice is too loud or too small, accurate speech recognition cannot be performed. In the above speech recognition device, it is possible to suppress misrecognition caused by the fact that speech cannot be accurately detected by the speech input means by determining whether or not to listen back based on the detected volume.

上記の音声認識装置は、前記第２聞き返し判定手段が、発話区間での音量が上限値を超える場合に、対話者への聞き返しが必要と判断することが好ましい。 In the above speech recognition apparatus, it is preferable that the second answer determination unit determines that the answer to the talker is required when the volume in the utterance section exceeds the upper limit value.

音声入力手段には検出可能な音声の大きさの上限が存在する。上限を超える大きさの音声については、正確に検出することができない。従って、対話者の声が大き過ぎる場合には、音声波形を正確に検出することができず、正確な音声認識をすることができない。上記の音声認識装置によれば、音量が大き過ぎる場合に聞き返しを行って、より小さな声で話すことを対話者に促す。これによって、対話者の声が大き過ぎることに起因する誤認識を抑制することができる。 The voice input means has an upper limit of the loudness of detectable voice. Sound that exceeds the upper limit cannot be detected accurately. Therefore, if the voice of the interlocutor is too loud, the speech waveform cannot be accurately detected, and accurate speech recognition cannot be performed. According to the above speech recognition apparatus, when the volume is too high, the speaker is asked to talk back and speak with a lower voice. Thereby, it is possible to suppress misrecognition caused by the voice of the conversation person being too loud.

上記の音声認識装置は、前記第２聞き返し判定手段が、発話区間での音量が下限値に満たない場合に、対話者への聞き返しが必要と判断することが好ましい。 In the above speech recognition apparatus, it is preferable that the second answer determination unit determines that the answer to the talker is necessary when the volume in the utterance section is less than the lower limit.

音声認識装置には検出可能な音声の大きさの下限が存在する。下限を下回る大きさの音声については、正確に検出することができない。従って、対話者の声が小さ過ぎる場合には、音声波形を正確に検出することができず、正確な音声認識をすることができない。上記の音声認識装置によれば、音量が小さ過ぎる場合に聞き返しを行って、より大きな声で話すことを対話者に促す。これによって、対話者の声が小さ過ぎることに起因する誤認識を抑制することができる。 The voice recognition device has a lower limit of the loudness of a detectable voice. Sound that is below the lower limit cannot be detected accurately. Therefore, when the voice of the interlocutor is too low, the voice waveform cannot be accurately detected, and accurate voice recognition cannot be performed. According to the above speech recognition apparatus, when the volume is too low, the speaker is asked to listen back and speak to a louder voice. Thereby, the misrecognition caused by the voice of the conversation person being too low can be suppressed.

上記の音声認識装置は、候補となる人物群の中から対話者である人物を特定する対話者識別手段と、特定された人物に応じてしきい値を設定するしきい値設定手段をさらに備えており、前記第１聞き返し判定手段が、特定された文章に含まれる単語のうちで自立語である単語の確信度の平均値が前記しきい値に満たない場合に、対話者への聞き返しが必要と判断することが望ましい。 The above speech recognition apparatus further includes a conversation person identifying means for identifying a person who is a conversation person from a group of candidate persons, and a threshold value setting means for setting a threshold value according to the identified person. When the average value of the certainty level of words that are independent words among the words included in the specified sentence is less than the threshold value, It is desirable to judge that it is necessary.

上記の音声認識装置によれば、聞き返しの要否を判断する際に用いるしきい値を、対話者に応じて個別に設定することができる。このような構成とすることによって、多種多様な対話者に対しても、適切に聞き返しの要否を判断することができる。 According to the above speech recognition apparatus, the threshold value used when determining whether or not to hear back can be set individually according to the interlocutor. By adopting such a configuration, it is possible to appropriately determine whether it is necessary to listen back to a wide variety of interlocutors.

本発明は音声認識方法としても具現化される。本発明の方法は、音声を入力して音声データに変換する音声入力工程と、音声データから発話区間を抽出する発話区間抽出工程と、音声データから発話区間における音声の特徴量の時系列を算出する音声分析工程と、発話区間における音声の特徴量の時系列から候補となる単語群のそれぞれについての尤度を算出する単語尤度算出工程と、候補となる単語群のそれぞれについての尤度から候補となる文章群のそれぞれについての尤度を算出する文章尤度算出工程と、候補となる文章群のそれぞれについての尤度と候補となる単語群のそれぞれについての尤度から候補となる単語群のそれぞれについての確信度を算出する確信度算出工程と、文章に含まれる単語の確信度に基づいて発話区間において対話者が話しかけた文章を候補となる文章群の中から特定する文章特定工程と、特定された文章に含まれる単語の確信度に基づいて対話者への聞き返しの要否を判断する聞き返し判定工程と、対話者への聞き返しが必要と判断された場合に対話者への聞き返しを行う聞き返し工程を備えている。 The present invention is also embodied as a speech recognition method. The method of the present invention includes a speech input step of inputting speech and converting it into speech data, a speech segment extraction step of extracting speech segments from the speech data, and calculating a time series of speech feature values in the speech segments from speech data From the speech analysis step, the word likelihood calculation step of calculating the likelihood for each candidate word group from the time series of the speech feature amount in the utterance section, and the likelihood for each of the candidate word groups A word likelihood candidate step based on a sentence likelihood calculating step for calculating a likelihood for each candidate sentence group, and a likelihood for each candidate sentence group and a likelihood for each candidate word group. A confidence level calculation step for calculating the confidence level for each of the above, and a sentence group in which the sentences spoken by the talker in the utterance interval based on the confidence level of the words included in the sentence are candidates It was determined that a sentence identification process to identify from the inside, a rehearsal determination process to determine whether or not to ask the conversation person to answer based on the certainty of the words contained in the identified sentence, and a conversation person need to be heard back In this case, there is provided a process of replaying back to the interlocutor.

本発明の音声認識装置および音声認識方法によれば、対話者の話しかける音声を認識する際に、適切な聞き返しを行うことによって、誤認識を抑制することができる。 According to the voice recognition device and the voice recognition method of the present invention, it is possible to suppress misrecognition by performing an appropriate response when recognizing the voice spoken by the conversation person.

以下に発明を実施するための最良の形態を列記する。
（形態１）単語尤度算出手段は、発話区間における音声の特徴量の時系列から、隠れマルコフ・モデル（ＨＭＭ；Hidden Markov Model）を用いて、候補となる単語群のそれぞれについての尤度を算出する。 The best mode for carrying out the invention is listed below.
(Mode 1) The word likelihood calculating means calculates a likelihood for each candidate word group from a time series of speech feature values in an utterance interval using a hidden Markov model (HMM). calculate.

本実施例では、図１に例示する音声認識装置１００において、対話者Ｖが話しかける音声を認識する例を説明する。音声認識装置１００は、例えばショールームやイベント会場に配置された案内ロボットであり、案内を求めて話しかけてくる来場者（対話者）Ｖが話しかける音声を認識する。 In this embodiment, an example will be described in which the voice recognition apparatus 100 illustrated in FIG. The voice recognition device 100 is a guidance robot arranged in, for example, a showroom or an event venue, and recognizes a voice spoken by a visitor (interactive person) V who talks for guidance.

音声認識装置１００は、頭部１０２と、胴体部１０８と、腕部１１６を備えている。音声認識装置１００は、頭部１０２の前方に並んで配置された右カメラ１０４と左カメラ１０６と、頭部１０２の前方に配置されたスピーカ１１８と、胴体部１０８に対して頭部１０２および腕部１１６を駆動するアクチュエータ群１１０と、胴体部１０８の前方に設けられたマイクロホン１１２と、右カメラ１０４、左カメラ１０６、スピーカ１１８、アクチュエータ群１１０およびマイクロホン１１２の動作を制御するコントローラ１１４を備えている。 The speech recognition apparatus 100 includes a head 102, a body part 108, and an arm part 116. The voice recognition device 100 includes a right camera 104 and a left camera 106 arranged side by side in front of the head 102, a speaker 118 arranged in front of the head 102, and the head 102 and arms with respect to the body portion 108. An actuator group 110 for driving the unit 116; a microphone 112 provided in front of the body unit 108; and a controller 114 for controlling the operations of the right camera 104, the left camera 106, the speaker 118, the actuator group 110, and the microphone 112. Yes.

右カメラ１０４と左カメラ１０６は、一般的なＣＣＤカメラである。右カメラ１０４と左カメラ１０６は、所定の時間間隔で繰り返し撮影を行い、撮影された画像データをコントローラ１１４へ出力する。 The right camera 104 and the left camera 106 are general CCD cameras. The right camera 104 and the left camera 106 repeatedly shoot at a predetermined time interval, and output the shot image data to the controller 114.

マイクロホン１１２は、入力された音声によって膜面に加えられる音圧を検知し、検知した音圧に応じた電圧値をコントローラ１１４へ出力する。 The microphone 112 detects the sound pressure applied to the membrane surface by the input sound, and outputs a voltage value corresponding to the detected sound pressure to the controller 114.

スピーカ１１８は、コントローラ１１４から送信された信号をアンプによって増幅し、増幅された電流の変動に応じて振動板を振動させ、音声を出力する。 The speaker 118 amplifies the signal transmitted from the controller 114 by an amplifier, vibrates the diaphragm according to the fluctuation of the amplified current, and outputs sound.

アクチュエータ群１１０は、コントローラ１１４から送信される制御信号に基づいて、頭部１０２および腕部１１６を駆動する。 Actuator group 110 drives head 102 and arm 116 based on a control signal transmitted from controller 114.

図２はコントローラ１１４の構成を示すブロック図である。コントローラ１１４は、処理装置（ＣＰＵ）、記憶装置（光学記憶媒体、磁気記憶媒体、あるいはＲＡＭやＲＯＭといった半導体メモリ等）、入出力装置、演算装置などから構成されているコンピュータ装置である。 FIG. 2 is a block diagram showing the configuration of the controller 114. The controller 114 is a computer device that includes a processing device (CPU), a storage device (an optical storage medium, a magnetic storage medium, or a semiconductor memory such as a RAM or a ROM), an input / output device, an arithmetic device, and the like.

画像Ａ／Ｄ変換部２０２は、右カメラ１０４から入力される画像データをＡ／Ｄ変換して、デジタル画像データを生成する。以下では画像Ａ／Ｄ変換部２０２で生成されたデジタル画像データを、右側デジタル画像データと呼ぶ。画像Ａ／Ｄ変換部２０２は、生成された右側デジタル画像データを画像認識部２０６へ送信する。画像Ａ／Ｄ変換部２０４は、左カメラ１０６から入力される画像データをＡ／Ｄ変換して、デジタル画像データを生成する。以下では画像Ａ／Ｄ変換部２０４で生成されたデジタル画像データを、左側デジタル画像データと呼ぶ。画像Ａ／Ｄ変換部２０４は、生成された左側デジタル画像データを画像認識部２０６へ送信する。 The image A / D conversion unit 202 A / D converts the image data input from the right camera 104 to generate digital image data. Hereinafter, the digital image data generated by the image A / D conversion unit 202 is referred to as right digital image data. The image A / D conversion unit 202 transmits the generated right digital image data to the image recognition unit 206. The image A / D conversion unit 204 A / D converts image data input from the left camera 106 to generate digital image data. Hereinafter, the digital image data generated by the image A / D conversion unit 204 is referred to as left digital image data. The image A / D conversion unit 204 transmits the generated left digital image data to the image recognition unit 206.

画像認識部２０６は、画像Ａ／Ｄ変換部２０２から入力される右側デジタル画像データと、画像Ａ／Ｄ変換部２０４から入力される左側デジタル画像データに基づいて、人物データベース（以下ではデータベースをＤＢと表記する）２０８を用いて対話者Ｖを識別する。人物ＤＢ２０８には、対話者Ｖの候補となる人物を示す識別符号と、その人物の顔の特徴点の位置関係が関連付けて登録されている。画像認識部２０６は、右側デジタル画像データおよび左側デジタル画像データから、撮影された対話者Ｖの顔の特徴点を抽出し、抽出された特徴点の位置関係を算出して、人物ＤＢ２０８に登録された人物の中から、顔の特徴点の位置関係が最も類似する人物を検索する。 Based on the right digital image data input from the image A / D conversion unit 202 and the left digital image data input from the image A / D conversion unit 204, the image recognition unit 206 is a person database (hereinafter referred to as a database DB). The conversation person V is identified using 208. In the person DB 208, an identification code indicating a person who is a candidate for the conversation person V and a positional relationship between the feature points of the face of the person are registered in association with each other. The image recognition unit 206 extracts the feature points of the face of the captured conversation person V from the right digital image data and the left digital image data, calculates the positional relationship of the extracted feature points, and is registered in the person DB 208. The person with the most similar positional relationship of the facial feature points is searched from the selected persons.

画像認識部２０６は、右カメラ１０４および左カメラ１０６が撮影する毎に、右カメラ１０４および左カメラ１０６で同一時刻に撮影された画像データに基づいて、対話者Ｖとして識別された人物の識別符号を特定する。画像認識部２０６は、対話者Ｖの識別符号を撮影時刻と関連付けて、第１聞き返し判定部２３４へ出力する。 Each time the right camera 104 and the left camera 106 photograph, the image recognition unit 206 identifies the person identified as the conversation person V based on the image data photographed at the same time by the right camera 104 and the left camera 106. Is identified. The image recognizing unit 206 associates the identification code of the conversation person V with the shooting time, and outputs it to the first listen determination unit 234.

音声Ａ／Ｄ変換部２１０は、マイクロホン１１２から入力される音圧の経時的な変化をＡ／Ｄ変換して、デジタル音声データを生成する。音声Ａ／Ｄ変換部２１０は、生成されたデジタル音声データを、発話区間抽出部２１２、音量検出部２１４、音声分析部２１６に出力する。 The audio A / D converter 210 A / D converts the temporal change in the sound pressure input from the microphone 112 to generate digital audio data. The voice A / D conversion unit 210 outputs the generated digital voice data to the utterance section extraction unit 212, the volume detection unit 214, and the voice analysis unit 216.

発話区間抽出部２１２は、音声Ａ／Ｄ変換部２１０から入力されるデジタル音声データから、発話の開始時刻と発話の終了時刻を検出する。図３に発話区間抽出部２１２に入力されるデジタル音声データが示す音声波形３０２の例を示す。発話区間抽出部２１２は、発話の開始が検知されていない状況では、音声波形３０２において音圧が所定のしきい値Ｐ１を超えるか否かを監視する。詳細には、単位時間Ｔ１において、平均音圧がしきい値Ｐ１を超えて、かつ音声波形３０２が音圧ゼロの線３０４と交差する回数が所定の回数以上となった時点で、発話区間抽出部２１２は発話が開始されたと判断する。発話の開始が検知されると、発話区間抽出部２１２は発話の開始時刻ＴＳを特定し、音量検出部２１４、音声分析部２１６に発話の開始時刻ＴＳを報知する。 The utterance section extraction unit 212 detects the utterance start time and the utterance end time from the digital voice data input from the voice A / D conversion unit 210. FIG. 3 shows an example of the speech waveform 302 indicated by the digital speech data input to the speech segment extraction unit 212. In a situation where the start of utterance is not detected, the utterance section extraction unit 212 monitors whether or not the sound pressure in the speech waveform 302 exceeds a predetermined threshold value P1. More specifically, when the average sound pressure exceeds the threshold value P1 and the number of times that the speech waveform 302 intersects the zero sound pressure line 304 becomes equal to or greater than a predetermined number in the unit time T1, the speech segment extraction is performed. Unit 212 determines that the utterance has started. When the start of the utterance is detected, the utterance section extraction unit 212 specifies the utterance start time TS, and notifies the volume detection unit 214 and the voice analysis unit 216 of the utterance start time TS.

発話区間抽出部２１２は、音声波形３０２が単位時間Ｔ２において音圧ゼロの線３０４と交差する回数をカウントし、カウントされた回数が所定のしきい値に達するか否かを監視する。詳細には、単位時間Ｔ２において、カウントされる回数が所定のしきい値に満たなくなり、かつ平均音圧が所定のしきい値Ｐ２に満たなくなった時点で、発話区間抽出部２１２は発話が終了したと判断する。発話の終了が検知されると、発話区間抽出部２１２は発話の終了時刻ＴＥを特定し、音量検出部２１４、音声分析部２１６に発話の終了時刻ＴＥを報知する。 The utterance section extraction unit 212 counts the number of times the speech waveform 302 intersects the zero sound pressure line 304 in the unit time T2, and monitors whether the counted number reaches a predetermined threshold value. Specifically, in the unit time T2, when the number of times counted does not reach the predetermined threshold value and the average sound pressure does not reach the predetermined threshold value P2, the utterance section extraction unit 212 ends the utterance. Judge that When the end of the utterance is detected, the utterance section extracting unit 212 specifies the utterance end time TE, and notifies the volume detecting unit 214 and the voice analyzing unit 216 of the utterance end time TE.

音量検出部２１４は、音声Ａ／Ｄ変換部２１０から入力されるデジタル音声データに基づいて、発話の開始時刻ＴＳから発話の終了時刻ＴＥまでの間で音圧の自乗値を積算する。音量検出部２１４は、発話区間抽出部２１２から発話の開始時刻ＴＳが報知されると、音圧の自乗値の積算を開始する。音量検出部２１４は、発話区間抽出部２１２から発話の終了時刻ＴＥが報知されると、音圧の自乗値の積算を終了する。そして、音圧の自乗値の積算値を発話区間の長さＴＥ−ＴＳで除して、発話区間における平均音量を特定する。その後、音量検出部２１４は、発話区間における平均音量を第２聞き返し判定部２１８へ出力する。 Based on the digital audio data input from the audio A / D converter 210, the sound volume detector 214 integrates the square value of the sound pressure between the utterance start time TS and the utterance end time TE. When the utterance start time TS is notified from the utterance section extraction unit 212, the sound volume detection unit 214 starts to accumulate the square value of the sound pressure. When the utterance end time TE is notified from the utterance section extraction unit 212, the sound volume detection unit 214 ends the integration of the square value of the sound pressure. Then, the integrated value of the square value of the sound pressure is divided by the length TE-TS of the utterance section, and the average sound volume in the utterance section is specified. Thereafter, the sound volume detection unit 214 outputs the average sound volume in the utterance section to the second listen-back determination unit 218.

第２聞き返し判定部２１８は、音量検出部２１４から入力される発話区間における平均音量に基づいて、対話者Ｖへの聞き返しの要否を判断する。まず第２聞き返し判定部２１８は、発話区間における平均音量を所定の上限値と比較する。平均音量が上限値を超えている場合、第２聞き返し判定部２１８は、発話区間における対話者Ｖの声が大き過ぎて、正確な音声認識を行うことができないと判断する。この場合、第２聞き返し判定部２１８は、対話者Ｖに対してもっと小さな声で話すことを促す聞き返しを、対応決定部２４０に指示する。次いで第２聞き返し判定部２１８は、発話区間における平均音量を所定の下限値と比較する。平均音量が下限値を下回る場合、第２聞き返し判定部２１８は、発話区間における対話者Ｖの声が小さ過ぎて、正確な音声認識を行うことができないと判断する。この場合、第２聞き返し判定部２１８は、対話者Ｖに対してもっと大きな声で話すことを促す聞き返しを、対応決定部２４０に指示する。第２聞き返し判定部２１８による上記の処理は、発話区間抽出部２１２で発話の終了時刻ＴＥが検知されて、音量検出部２１４から発話区間における平均音量が入力される度に行われる。 Based on the average sound volume in the utterance interval input from the sound volume detection unit 214, the second rehearsal determination unit 218 determines whether or not the conversation person V needs to be replayed. First, the second listening determination unit 218 compares the average sound volume in the utterance section with a predetermined upper limit value. When the average sound volume exceeds the upper limit value, the second answer determination unit 218 determines that the voice of the conversation person V in the utterance section is too loud and accurate voice recognition cannot be performed. In this case, the second response determination unit 218 instructs the response determination unit 240 to perform a response prompting the conversation person V to speak with a smaller voice. Next, the second listening determination unit 218 compares the average volume in the utterance section with a predetermined lower limit value. When the average volume is lower than the lower limit value, the second hearing determination unit 218 determines that the voice of the conversation person V in the utterance section is too low to perform accurate voice recognition. In this case, the second answer determination unit 218 instructs the response determination unit 240 to ask the conversation person V to speak more loudly. The above processing by the second listening determination unit 218 is performed every time the utterance end time TE is detected by the utterance section extraction unit 212 and the average volume in the utterance section is input from the volume detection unit 214.

音声分析部２１６は、発話の開始時刻ＴＳから発話の終了時刻ＴＥまでの間で、音声の特徴量の時系列を算出する。本実施例の音声分析部２１６は、入力されるデジタル音声データについてフレーム化処理を実施し、各フレームに対応する音データの周波数スペクトルを特定する。図４に音声データのフレーム化処理と、各フレームの音声データの周波数スペクトルを特定する様子を示す。本実施例では、フレームの長さは２０ｍｓであり、フレーム間隔は１０ｍｓである。図４に示すように、音声データ４０２についてフレームＦ１、Ｆ２、Ｆ３、・・・が規定される。音声分析部２１６は、フレームＦ１、Ｆ２、Ｆ３、・・・のそれぞれにおける音声データ４０２の周波数スペクトルｆ１、ｆ２、ｆ３、・・・を特定する。周波数スペクトルは、周波数に対する振幅の分布として与えられる。周波数スペクトルの特定は、例えば高速フーリエ変換を用いて行うことができる。音声分析部２１６は、発話区間抽出部２１２から発話の開始時刻ＴＳが報知されると、上記のフレーム化処理と周波数スペクトルの特定処理を開始する。音声分析部２１６は、発話の終了時刻ＴＥが報知されるまで、上記の処理を順次実行して、各フレームの周波数スペクトルを音素尤度算出部２２０へ順次出力する。音声分析部２１６は、発話区間抽出部２１２から発話の終了時刻ＴＥが報知されると、上記のフレーム化処理と周波数スペクトルの特定処理を終了する。 The voice analysis unit 216 calculates a time series of voice feature values between the utterance start time TS and the utterance end time TE. The voice analysis unit 216 according to the present embodiment performs framing processing on the input digital voice data, and specifies the frequency spectrum of the sound data corresponding to each frame. FIG. 4 shows how voice data is framed and how the frequency spectrum of the voice data of each frame is specified. In this embodiment, the frame length is 20 ms, and the frame interval is 10 ms. As shown in FIG. 4, frames F1, F2, F3,. The voice analysis unit 216 identifies the frequency spectrums f1, f2, f3,... Of the voice data 402 in each of the frames F1, F2, F3,. The frequency spectrum is given as a distribution of amplitude with respect to frequency. The specification of the frequency spectrum can be performed using, for example, a fast Fourier transform. When the utterance start time TS is notified from the utterance section extraction unit 212, the voice analysis unit 216 starts the framing process and the frequency spectrum specifying process. The voice analysis unit 216 sequentially executes the above processes until the utterance end time TE is notified, and sequentially outputs the frequency spectrum of each frame to the phoneme likelihood calculation unit 220. When the utterance end time TE is notified from the utterance section extraction unit 212, the voice analysis unit 216 ends the framing process and the frequency spectrum specifying process.

音素尤度算出部２２０、単語尤度算出部２２４、文章尤度算出部２２８、確信度算出部２３２、文章特定部２３８は、音声分析部２１６から入力される各フレーム毎の周波数スペクトルから、隠れマルコフ・モデル（ＨＭＭ；Hidden Markov Model）を用いて、音素の時系列としての文章を特定する。ここで音素とは、人間が言葉を話す際に発せられる音声を構成する要素を意味する。例えば人間が「ぶどう」という言葉を話す際に発せられる音声は、「ｂ」と「ｕ」と「ｄ」と「ｏ：」という４つの音素から構成されている。ＨＭＭを用いて音素の時系列を特定する場合、１つの音素は複数の状態から構成されていると想定し、それぞれの状態を次の状態へ遷移する遷移確率と、次の状態へ遷移せずに停留する停留確率によって特徴付ける。以下では音素を構成する状態のことを音素状態と記述する。本実施例では、１つの音素が３つの音素状態から構成されている例を説明する。例えば「ｂ」という音素は、音素状態ｂ１、ｂ２、ｂ３から構成されている。ある音素状態から音素状態ｂ１へ遷移し、音素状態ｂ１から音素状態ｂ２に遷移し、音素状態ｂ２から音素状態ｂ３に遷移することで、音素「ｂ」が実現される。音素状態ｂ１は、次の音素状態である音素状態ｂ２へ遷移することもあるし、音素状態ｂ１のまま停留することもある。音素状態ｂ２、ｂ３についても同様である。本実施例では、音素状態の時系列として音素が特定され、音素の時系列として単語が特定され、単語の時系列として文章が特定される。本実施例では、音素状態の時系列としての単語および文章についての尤度を算出し、単語および文章についての尤度に基づいて文章に含まれる単語の確信度を算出して、単語の確信度に基づいて対話者Ｖが話しかけた文章を特定する。 The phoneme likelihood calculating unit 220, the word likelihood calculating unit 224, the sentence likelihood calculating unit 228, the certainty factor calculating unit 232, and the sentence specifying unit 238 are hidden from the frequency spectrum for each frame input from the speech analyzing unit 216. A sentence as a time series of phonemes is specified using a Markov model (HMM). Here, the phoneme means an element that constitutes a voice uttered when a human speaks a word. For example, a voice uttered when a person speaks the word “grape” is composed of four phonemes “b”, “u”, “d”, and “o:”. When specifying a phoneme time series using the HMM, it is assumed that one phoneme is composed of a plurality of states, transition probabilities of transitioning each state to the next state, and transition to the next state are not made. Characterized by the probability of stopping at In the following, the phoneme state is described as a phoneme state. In this embodiment, an example in which one phoneme is composed of three phoneme states will be described. For example, the phoneme “b” is composed of phoneme states b1, b2, and b3. The phoneme state “b” is realized by making a transition from a phoneme state to the phoneme state b1, transitioning from the phoneme state b1 to the phoneme state b2, and transitioning from the phoneme state b2 to the phoneme state b3. The phoneme state b1 may transit to the phoneme state b2 that is the next phoneme state, or may remain in the phoneme state b1. The same applies to the phoneme states b2 and b3. In this embodiment, a phoneme is specified as a phoneme state time series, a word is specified as a phoneme time series, and a sentence is specified as a word time series. In this embodiment, the likelihood of words and sentences as a time series of phoneme states is calculated, the certainty of words included in the sentences is calculated based on the likelihood of words and sentences, and the certainty of words Based on the above, the sentence spoken by the conversation person V is specified.

音素尤度算出部２２０は、フレーム毎に特定された周波数スペクトルから、そのフレームに対する各音素状態の尤度を評価する。それぞれの音素状態は、その音素状態が実現された場合に、音声として観測される周波数スペクトルについての確率分布を有する。この確率分布は、実験などによって予め取得しておくことができる。この確率分布と、フレームに対して特定された周波数スペクトルから、そのフレームに対する音素状態の尤度を計算することができる。本実施例では、音素ＤＢ２２２に尤度評価の対象とする各音素の各音素状態について、周波数スペクトルから尤度を算出する関数が予め記憶されている。音素尤度算出部２２０は、周波数スペクトルｆ１、ｆ２、ｆ３、・・・のそれぞれについて、各音素の各音素状態について尤度を算出する。例えばフレームＦ１の周波数スペクトルｆ１から、フレームＦ１に対する音素「ｂ」の音素状態ｂ１、ｂ２、ｂ３のそれぞれの尤度が算出される。他の音素の音素状態についても同様にして、フレームＦ１に対する尤度が算出される。それ以降のフレームＦ２、Ｆ３、・・・についても同様にして、そのフレームに対する各音素の各音素状態の尤度が算出される。 The phoneme likelihood calculating unit 220 evaluates the likelihood of each phoneme state for the frame from the frequency spectrum specified for each frame. Each phoneme state has a probability distribution for the frequency spectrum observed as speech when that phoneme state is realized. This probability distribution can be acquired in advance by experiments or the like. From this probability distribution and the frequency spectrum specified for the frame, the likelihood of the phoneme state for the frame can be calculated. In the present embodiment, a function for calculating likelihood from a frequency spectrum is stored in advance in the phoneme DB 222 for each phoneme state of each phoneme to be subjected to likelihood evaluation. The phoneme likelihood calculation unit 220 calculates the likelihood for each phoneme state of each phoneme for each of the frequency spectra f1, f2, f3,. For example, the likelihoods of the phoneme states b1, b2, and b3 of the phoneme “b” for the frame F1 are calculated from the frequency spectrum f1 of the frame F1. Similarly for the phoneme states of other phonemes, the likelihood for the frame F1 is calculated. Similarly for the subsequent frames F2, F3,..., The likelihood of each phoneme state of each phoneme for that frame is calculated.

各フレームに対する各音素状態の尤度が算出されると、単語尤度算出部２２４は、各音素の尤度の評価と、各単語の尤度の評価を行う。図５を参照しながら、各音素の尤度の評価と、各単語の尤度の評価について説明する。図５では一例として、単語「ぶどう」についての尤度を評価する例を説明する。図５の左側の欄では、単語「ぶどう」が音素「ｂ」、「ｕ」、「ｄ」、「ｏ：」の系列として構成されており、音素「ｂ」が音素状態ｂ１、ｂ２、ｂ３の系列として構成されており、音素「ｕ」が音素状態ｕ１、ｕ２、ｕ３の系列として構成されており、音素「ｄ」が音素状態ｄ１、ｄ２、ｄ３の系列として構成されており、音素「ｏ：」が音素状態ｏ：１、ｏ：２、ｏ：３の系列として構成されていることが示されている。図５では、フレームＦ１において音素状態ｂ１が実現している状態を点５０２で表現し、その後のフレームＦ２、Ｆ３、・・・Ｆｎにおいて、音素状態ｂ１、ｂ２、ｂ３、・・・が実現している状態を点５０４、５０６、５０８、５１０、５１２・・・で表現している。また、それぞれの点５０２、５０４、５０６、・・・からは、次のフレームにおいて次の音素状態へ遷移する経路と、次の音素状態へ遷移することなく停留する経路が伸びている。例えばフレームＦ１において音素状態ｂ１が実現している状態を示す点５０２からは、次のフレームＦ２において次の音素状態ｂ２へ遷移する枝５１４と、次の音素状態ｂ２へ遷移することなく音素状態ｂ１で停留する枝５１６が伸びている。枝５１４は、フレームＦ２において音素状態ｂ２が実現している状態を示す点５０４まで伸びている。枝５１６は、フレームＦ２において音素状態ｂ１が実現している状態を示す点５０６まで伸びている。 When the likelihood of each phoneme state for each frame is calculated, the word likelihood calculation unit 224 evaluates the likelihood of each phoneme and the likelihood of each word. The evaluation of the likelihood of each phoneme and the evaluation of the likelihood of each word will be described with reference to FIG. FIG. 5 illustrates an example in which the likelihood of the word “grape” is evaluated as an example. In the left column of FIG. 5, the word “grape” is configured as a sequence of phonemes “b”, “u”, “d”, “o:”, and the phoneme “b” is the phoneme state b1, b2, b3. Phoneme “u” is configured as a sequence of phoneme states u1, u2, u3, phoneme “d” is configured as a sequence of phoneme states d1, d2, d3, and phoneme “ It is shown that “o:” is configured as a sequence of phoneme states o: 1, o: 2, and o: 3. In FIG. 5, the state where the phoneme state b1 is realized in the frame F1 is expressed by a point 502, and the phoneme states b1, b2, b3,... Are realized in the subsequent frames F2, F3,. Are expressed by points 504, 506, 508, 510, 512. Further, from each of the points 502, 504, 506,..., A path for transitioning to the next phoneme state in the next frame and a path for stopping without transitioning to the next phoneme state extend. For example, from a point 502 indicating a state in which the phoneme state b1 is realized in the frame F1, a branch 514 that makes a transition to the next phoneme state b2 in the next frame F2, and a phoneme state b1 without making a transition to the next phoneme state b2 A branch 516 that stops at is extended. The branch 514 extends to a point 504 indicating a state where the phoneme state b2 is realized in the frame F2. The branch 516 extends to a point 506 indicating a state in which the phoneme state b1 is realized in the frame F2.

図５のそれぞれの点５０２、５０４、５０６、・・・の尤度は、各フレームに対する各音素状態の尤度として算出することができる。それぞれの枝５１４、５１６、・・・の尤度は、各音素状態の遷移確率と停留確率から算出することができる。例えば枝５１４の尤度は、音素状態ｂ１から音素状態ｂ２への遷移確率から算出することができる。枝５１６の尤度は、音素状態ｂ１の停留確率から算出することができる。単語を構成する各音素の各音素状態の遷移確率と停留確率は、実験などによって予め取得されており、単語ＤＢ２２６に記憶されている。 The likelihood of each point 502, 504, 506,... In FIG. 5 can be calculated as the likelihood of each phoneme state for each frame. The likelihood of each branch 514, 516,... Can be calculated from the transition probability and stationary probability of each phoneme state. For example, the likelihood of the branch 514 can be calculated from the transition probability from the phoneme state b1 to the phoneme state b2. The likelihood of the branch 516 can be calculated from the retention probability of the phoneme state b1. The transition probability and the retention probability of each phoneme state of each phoneme constituting the word are acquired in advance by an experiment or the like and stored in the word DB 226.

単語尤度算出部２２４は、各フレームに対する各音素状態の尤度して算出される点５０２、５０４、５０６、・・・の尤度と、単語ＤＢ２２６に記憶されている枝５１４、５１６、・・・の尤度に基づいて、その時点で取り得る全ての経路について尤度を計算し、最も尤度の高い経路を特定する。ここで経路についての尤度とは、その経路に沿って事象が進行した尤度のことをいう。経路に沿って事象が進行した尤度は、その経路に含まれる点の尤度と枝の尤度から算出することができる。単語尤度算出部２２４は、その単語において最も尤度の高い経路が特定されると、その経路に沿って事象が進行した尤度をその単語の尤度として特定する。
図５に示す例では、フレームＦ１、Ｆ２、・・・Ｆｎまで処理が進行している時点において、単語「ぶどう」において最も尤度の高い経路として経路５１８が特定されている。このような場合には、経路５１８に沿って事象が進行した尤度が、単語「ぶどう」の尤度として特定される。経路５１８に沿って事象が進行した尤度は、経路５１８に含まれる点５０２、５０４、５１０、・・・の尤度と、枝５１４、・・・の尤度から算出される。 The word likelihood calculation unit 224 calculates the likelihood of the points 502, 504, 506,... Calculated by the likelihood of each phoneme state for each frame, and the branches 514, 516,. ... Based on the likelihoods, the likelihoods are calculated for all possible routes at that time, and the route with the highest likelihood is specified. Here, the likelihood for a route refers to the likelihood that an event has progressed along the route. The likelihood that an event has progressed along a route can be calculated from the likelihood of points included in the route and the likelihood of branches. When the route with the highest likelihood in the word is specified, the word likelihood calculating unit 224 specifies the likelihood that the event has progressed along the route as the likelihood of the word.
In the example shown in FIG. 5, the route 518 is specified as the route with the highest likelihood in the word “grape” at the time when the processing is progressing to the frames F1, F2,. In such a case, the likelihood that the event has progressed along the path 518 is specified as the likelihood of the word “grape”. The likelihood that the event has progressed along the path 518 is calculated from the likelihood of the points 502, 504, 510,... Included in the path 518 and the likelihood of the branches 514,.

単語ＤＢ２２６には対話者Ｖが話す単語として想定される単語群のそれぞれについて、単語を構成する各音素の各音素状態の遷移確率と停留確率が記憶されている。図５では単語「ぶどう」についての尤度を評価する例を説明したが、単語尤度算出部２２４は、上記した尤度の評価を、単語ＤＢ２２６に記憶されている全ての単語について実施する。これによって、単語ＤＢ２２６に記憶されている全ての単語についての尤度が評価される。 The word DB 226 stores a transition probability and a retention probability of each phoneme state of each phoneme constituting the word for each word group assumed as a word spoken by the conversation person V. Although FIG. 5 illustrates an example in which the likelihood of the word “grape” is evaluated, the word likelihood calculating unit 224 performs the above-described likelihood evaluation for all the words stored in the word DB 226. Thereby, the likelihood about all the words memorize | stored in word DB226 is evaluated.

各単語についての尤度の評価と並行して、文章尤度算出部２２８は各文章の尤度を評価する。文章尤度算出部２２８は、文章ＤＢ２３０に記憶されている全ての文章について、尤度の評価を行う。文章ＤＢ２３０には、対話者Ｖが話す文章として想定される文章群のそれぞれについて、その文章を構成する単語の系列が、関連付けて記憶されている。
図６に文章の尤度を評価する様子を示している。図６に示す例では、「プリウス」（登録商標）―「の」―「燃費」―「は」―「いくら」―「ですか」という単語の系列が１つの文章を構成している。また、「プリウス」―「の」―「燃費」―「を」―「教えて」―「下さい」という単語の系列も１つの文章を構成している。これらの文章と、その文章を構成する単語の系列は、文章ＤＢ２３０に予め記憶されている。 In parallel with the likelihood evaluation for each word, the sentence likelihood calculating unit 228 evaluates the likelihood of each sentence. The sentence likelihood calculation unit 228 evaluates the likelihood for all sentences stored in the sentence DB 230. In the sentence DB 230, for each sentence group assumed as a sentence spoken by the conversation person V, a series of words constituting the sentence is stored in association with each other.
FIG. 6 shows how the likelihood of a sentence is evaluated. In the example shown in FIG. 6, a series of words “Prius” (registered trademark) — “no” — “fuel consumption” — “ha” — “how much” — “what” constitutes one sentence. In addition, a series of words “Prius”-“No”-“Fuel consumption”-“O”-“Teach me”-“Please” make up one sentence. These sentences and a series of words constituting the sentences are stored in the sentence DB 230 in advance.

文章尤度算出部２２８は、文章の尤度を、その文章に含まれる単語の尤度と、その文章における単語から単語への接続確率から算出する。単語から単語への接続確率は、図７に示す単語接続表７００を用いて特定される。単語接続表７００は、ある単語（図では前単語と記述している）から次に続く単語（図では後単語と記述している）への接続が出現する確率（図では出現率と記述している）を記述している。このような単語から単語への接続が出現する確率は、実験などによって取得することができる。単語接続表７００は文章ＤＢ２３０に予め記憶されており、文章尤度算出部２２８は必要に応じて文章ＤＢ２３０から単語接続表７００を読み込む。文章尤度算出部２２８は、文章ＤＢ２３０に記憶されている全ての文章について尤度を評価する。 The sentence likelihood calculating unit 228 calculates the likelihood of the sentence from the likelihood of the word included in the sentence and the connection probability from the word to the word in the sentence. The connection probability from word to word is specified using the word connection table 700 shown in FIG. The word connection table 700 has a probability (denoted as an appearance rate in the figure) that a connection from a certain word (denoted as the previous word in the figure) to the next word (denoted as the subsequent word in the figure) appears. Is described). The probability that such a word-to-word connection appears can be obtained by experiments or the like. The word connection table 700 is stored in the sentence DB 230 in advance, and the sentence likelihood calculation unit 228 reads the word connection table 700 from the sentence DB 230 as necessary. The sentence likelihood calculating unit 228 evaluates the likelihood for all sentences stored in the sentence DB 230.

各文章について尤度が評価されると、確信度算出部２３２は、各文章についての尤度と、各文章に含まれる各単語の尤度に基づいて、単語ごとに確信度を算出する。単語の確信度とは、競合する他の単語の候補に対してその単語がどの程度信頼度が高いかを示す指標である。音声の時系列Ｘについて、時刻τからｔの期間が単語ｗであることの確信度Ｃは、次式で算出される。 When the likelihood is evaluated for each sentence, the certainty factor calculation unit 232 calculates the certainty factor for each word based on the likelihood for each sentence and the likelihood of each word included in each sentence. The certainty of a word is an index indicating how reliable the word is with respect to other competing word candidates. For the time series X of speech, the certainty factor C that the period from time τ to t is the word w is calculated by the following equation.

ここで、Ｗは文章を示しており、Ｗ［ｗ；τ，ｔ］は時刻τからｔの期間に単語ｗを含む文章の集合を示している。ｇ（Ｗ）は、文章Ｗの尤度を対数で表現したものである。αはスムージング係数と呼ばれる１以下の正の数である。ｐ（Ｘ）は音声の時系列がＸである尤度を示しており、ここでは全ての文章の尤度の総和で与えられる。 Here, W indicates a sentence, and W [w; τ, t] indicates a set of sentences including the word w in the period from time τ to t. g (W) represents the likelihood of the sentence W in logarithm. α is a positive number less than 1 called a smoothing coefficient. p (X) indicates the likelihood that the time series of the speech is X, and is given here as the sum of the likelihoods of all sentences.

確信度算出部２３２は、各文章に関して、その文章に含まれる単語のそれぞれについての確信度を算出する。確信度算出部２３２は、各文章について、自立語の確信度の平均値をそれぞれ算出する。確信度算出部２３２は、各文章と、その文章の自立語の確信度の平均値を関連付けて、文章特定部２３８へ出力する。 The certainty factor calculation unit 232 calculates the certainty factor for each of the words included in the sentence for each sentence. The certainty factor calculation unit 232 calculates an average value of the certainty factor of the independent word for each sentence. The certainty factor calculation unit 232 associates each sentence with the average value of the certainty factor of the independent word of the sentence and outputs the associated value to the sentence specifying unit 238.

文章特定部２３８は、自立語の確信度の平均値が最も高い文章を、対話者Ｖが話しかけた文章として特定する。文章特定部２３８は、特定された文章を対応決定部２４０へ出力する。また文章特定部２３８は、特定された文章と、その文章の自立語の確信度の平均値を、第１聞き返し判定部２３４へ出力する。 The sentence specifying unit 238 specifies the sentence having the highest average confidence level of the independent word as the sentence spoken by the conversation person V. The sentence specifying unit 238 outputs the specified sentence to the correspondence determining unit 240. In addition, the sentence specifying unit 238 outputs the specified sentence and the average value of the certainty level of the independent word of the sentence to the first listening determination unit 234.

第１聞き返し判定部２３４は、文章特定部２３８から入力された文章と、その文章の自立語の確信度の平均値から、対話者Ｖへの聞き返しの要否を判断する。文章特定部２３８から入力された文章は、候補となる文章群のうちで最も自立語の確信度の平均値が高いものである。その文章の自立語の確信度が高い場合には、他の文章の尤度に比べてその文章の尤度が大きく上回っており、認識の結果に曖昧さがそれほど無いと言える。従って、このような場合には、対話者Ｖへの聞き返しを行うまでもなく、対話者Ｖの話した文章の内容を正確に認識できていると考えられる。逆に、文章特定部２３８から入力された文章の自立語の確信度の平均値が低い場合には、他の文章の尤度とその文章の尤度にはそれほど大きな差がなく、認識の結果に曖昧さがあると言える。従って、このような場合には、対話者Ｖへの聞き返しを行って、より正確に対話者Ｖの話す文章を認識する必要がある。 The first answer determination unit 234 determines whether it is necessary to answer the conversation person V from the sentence input from the sentence specifying unit 238 and the average value of the confidence level of the independent word of the sentence. The sentence input from the sentence specifying unit 238 has the highest average confidence level of independent words in the candidate sentence group. When the confidence level of the independent word of the sentence is high, the likelihood of the sentence is much higher than the likelihood of other sentences, and it can be said that there is not much ambiguity in the recognition result. Therefore, in such a case, it is considered that the content of the sentence spoken by the conversation person V can be accurately recognized without having to ask the conversation person V back. On the contrary, when the average value of the confidence level of the independent word of the sentence input from the sentence specifying unit 238 is low, the likelihood of the other sentence and the likelihood of the sentence is not so large, and the recognition result It can be said that there is ambiguity. Therefore, in such a case, it is necessary to listen to the conversation person V and recognize the sentence spoken by the conversation person V more accurately.

第１聞き返し判定部２３４は、文章特定部２３８から入力された確信度の平均値をしきい値と比較することで、聞き返しの要否を判断する。確信度の平均値がしきい値以上の場合、第１聞き返し判定部２３４は聞き返しは不要と判断する。確信度の平均値がしきい値に満たない場合、第１聞き返し判定部２３４は聞き返しが必要と判断して、対話者Ｖに対してもっとはっきりと話すことを促す聞き返しを対応決定部２４０に指示する。 The first answer determination unit 234 determines whether or not the answer is necessary by comparing the average value of the certainty factor input from the sentence specifying unit 238 with a threshold value. When the average value of the certainty factor is equal to or greater than the threshold value, the first answer determination unit 234 determines that the answer is unnecessary. When the average value of the certainty level is less than the threshold value, the first answer determination unit 234 determines that the answer is necessary, and instructs the correspondence determination unit 240 to ask the conversation person V to speak more clearly. To do.

なお第１聞き返し判定部２３４は、上記した判断に用いるしきい値を、画像認識部２０６から入力される対話者Ｖの識別符号に基づいて決定する。第１聞き返し判定部２３４は、対話者Ｖの識別符号をキーとして確信度ＤＢ２３６を検索し、その識別符号が示す人物に対して適切なしきい値を読み出す。人物に対する適切なしきい値は、実験などによって予め取得されている。確信度ＤＢ２３６には、人物の識別符号と、その人物に対して適切なしきい値が、関連付けて記憶されている。このような構成とすることによって、対話者Ｖがどのような人物であっても、聞き返しの要否を適切に判断することができる。 The first listening determination unit 234 determines the threshold value used for the above determination based on the identification code of the conversation person V input from the image recognition unit 206. The first hearing determination unit 234 searches the certainty factor DB 236 using the identification code of the conversation person V as a key, and reads an appropriate threshold value for the person indicated by the identification code. An appropriate threshold value for a person is acquired in advance by an experiment or the like. The certainty factor DB 236 stores a person identification code and a threshold value appropriate for the person in association with each other. By adopting such a configuration, it is possible to appropriately determine whether or not it is necessary to listen back to any person who is the conversation person V.

音声分析部２１６、音素尤度算出部２２０、単語尤度算出部２２４、文章尤度算出部２２８、確信度算出部２３２、文章特定部２３８は、上記したフレーム化処理から文章データの推定までの一連の処理を、発話区間抽出部２１２から発話の終了時刻ＴＥが報知されるまで繰り返し実施する。発話区間抽出部２１２から発話の終了時刻ＴＥが報知されると、文章特定部２３８は発話区間における音声から特定された文章を、文字列として対応決定部２４０へ出力する。 The speech analysis unit 216, phoneme likelihood calculation unit 220, word likelihood calculation unit 224, sentence likelihood calculation unit 228, certainty factor calculation unit 232, and sentence specification unit 238 perform the above-described framing processing to sentence data estimation. A series of processing is repeatedly performed until the utterance end time TE is notified from the utterance section extraction unit 212. When the utterance end time TE is notified from the utterance section extraction unit 212, the sentence specifying unit 238 outputs the sentence specified from the speech in the utterance section to the correspondence determining unit 240 as a character string.

対応決定部２４０は、第２聞き返し判定部２１８および第１聞き返し判定部２３４からの聞き返しの指示の有無と、文章特定部２３８から入力される文字列に基づいて、対話者Ｖへの対応を決定する。 The correspondence determining unit 240 determines the correspondence to the conversation person V based on the presence / absence of the instruction to hear back from the second listening determination unit 218 and the first listening determination unit 234 and the character string input from the sentence specifying unit 238. To do.

第２聞き返し判定部２１８から、もっと小さな声で話すことを対話者Ｖに促す聞き返しを指示されている場合、対応決定部２４０は対話者Ｖへの聞き返しとして「もう少し小さな声で話してください。」という文字列を音声合成部２４２へ出力する。また、対応決定部２４０は、腕部１１６を下方向へ押さえつけるようなジェスチャーを示す動作パターンを、動作生成部２４４へ出力する。 If the second answer determination unit 218 instructs the conversation person V to speak back with a smaller voice, the response determination section 240 “speak a little less loudly” as a reply to the conversation person V. Is output to the speech synthesizer 242. In addition, the correspondence determination unit 240 outputs an operation pattern indicating a gesture for pressing the arm unit 116 downward to the operation generation unit 244.

第２聞き返し判定部２１８から、もっと大きな声で話すことを対話者Ｖに促す聞き返しを指示されている場合、対応決定部２４０は対話者Ｖへの聞き返しとして「もう少し大きな声で話してください。」という文字列を音声合成部２４２へ出力する。また、対応決定部２４０は、腕部１１６の先端でスピーカ１１８の周囲を覆うようなジェスチャーを示す動作パターンを、動作生成部２４４へ出力する。 When the second answer determination unit 218 instructs the conversation person V to speak back to speak in a louder voice, the response determination section 240 "speak a little louder" as a reply to the conversation person V. Is output to the speech synthesizer 242. In addition, the correspondence determination unit 240 outputs an operation pattern indicating a gesture that covers the periphery of the speaker 118 with the tip of the arm unit 116 to the operation generation unit 244.

第１聞き返し判定部２３４から、もっとはっきりと話すことを対話者Ｖに促す聞き返しを指示されている場合、対応決定部２４０は対話者Ｖへの聞き返しとして「もう少しはっきりと話してください。」という文字列を音声合成部２４２へ出力する。また、対応決定部２４０は、腕部１１６を左右に広げて頭部１０２を左右に振るようなジェスチャーを示す動作パターンを、動作生成部２４４へ出力する。 When the first answer determination unit 234 has instructed the conversation person V to speak more clearly, the response determination part 240 sends the character “Please speak a little more clearly” as a reply to the conversation person V. The sequence is output to the speech synthesizer 242. In addition, the correspondence determination unit 240 outputs to the motion generation unit 244 a motion pattern that indicates a gesture that spreads the arm unit 116 left and right and swings the head 102 left and right.

第２聞き返し判定部２１８および第１聞き返し判定部２３４のいずれからも聞き返しを指示されていない場合、対応決定部２４０は、文章特定部２３８から入力される文章の文字列に基づいて、対話者Ｖへの対応を決定する。対応ＤＢ２４６には、対話者Ｖから話しかけられる文章の文字列と、それに対する適切な応答音声を示す文字列と、適切な応答動作を示す動作パターンが、関連付けて記憶されている。対応決定部２４０は、文章特定部２３８から入力された文章の文字列をキーとして対応ＤＢ２４６を検索し、適切な応答音声を示す文字列と、適切な応答動作を示す動作パターンを決定する。対応決定部２４０は、決定された文字列を音声合成部２４２へ出力し、決定された動作パターンを動作生成部２４４へ出力する。 When neither of the second answer determination unit 218 and the first answer determination unit 234 is instructed to answer, the correspondence determination unit 240 determines the conversation person V based on the character string of the sentence input from the sentence specifying unit 238. Determine the response to. In the correspondence DB 246, a character string of a sentence spoken by the conversation person V, a character string indicating an appropriate response voice for the sentence, and an operation pattern indicating an appropriate response action are stored in association with each other. The correspondence determining unit 240 searches the correspondence DB 246 using the character string of the text input from the text specifying unit 238 as a key, and determines a character string indicating an appropriate response voice and an operation pattern indicating an appropriate response action. The correspondence determination unit 240 outputs the determined character string to the speech synthesis unit 242 and outputs the determined operation pattern to the operation generation unit 244.

音声合成部２４２は、対応決定部２４０から入力された文字列に基づいて、対話者Ｖへの応答音声をデジタル音声データとして生成する。音声合成部２４２は、生成されたデジタル音声データを音声Ｄ／Ａ変換部２４８へ出力する。 The voice synthesizer 242 generates a response voice to the conversation person V as digital voice data based on the character string input from the correspondence determination unit 240. The voice synthesizer 242 outputs the generated digital voice data to the voice D / A converter 248.

音声Ｄ／Ａ変換部２４８は、音声合成部２４２から入力されるデジタル音声データをＤ／Ａ変換して、スピーカ１１８へ出力する。これによって、対話者Ｖが話しかけた文章の内容に応じた適切な返答、あるいは対話者Ｖへの聞き返しが、スピーカ１１８から音声で出力される。 The voice D / A converter 248 performs D / A conversion on the digital voice data input from the voice synthesizer 242 and outputs it to the speaker 118. As a result, an appropriate response corresponding to the content of the sentence spoken by the conversation person V or a reply to the conversation person V is output from the speaker 118 by voice.

動作生成部２４４は、対応決定部２４０から入力された動作パターンに基づいて、アクチュエータ群１１０を駆動して、頭部１０２や腕部１１６を動作させる。 The motion generation unit 244 drives the actuator group 110 based on the motion pattern input from the correspondence determination unit 240 to operate the head 102 and the arm unit 116.

図８のフローチャートを参照しながら、コントローラ１１４が行う処理について説明する。ステップＳ８０２では、発話区間抽出部２１２が発話の開始を検出するまで待機する。ステップＳ８０２で発話の開始が検出されると、コントローラ１１４は、ステップＳ８０４からステップＳ８１４までに示す処理と、ステップＳ８１６に示す処理と、ステップＳ８１８に示す処理を並列に実行する。 Processing performed by the controller 114 will be described with reference to the flowchart of FIG. In step S802, the process waits until the speech segment extraction unit 212 detects the start of speech. If the start of the utterance is detected in step S802, the controller 114 executes the processing shown in steps S804 to S814, the processing shown in step S816, and the processing shown in step S818 in parallel.

まずステップＳ８０４からステップＳ８１４に示す処理について説明する。ステップＳ８０４では、音声分析部２１６が音声のフレーム化処理を実行する。ステップＳ８０６では、音声分析部２１６が各フレームについての周波数スペクトルを特定する。ステップＳ８０８では、音素尤度算出部２２０が各フレームについて音素状態毎の尤度を算出する。ステップＳ８１０では、単語尤度算出部２２４が各フレームについての音素状態毎の尤度から、単語毎の尤度を算出する。ステップＳ８１２では、文章尤度算出部２２８が単語毎の尤度から文章毎の尤度を算出する。ステップＳ８１４では、確信度算出部２３２が単語毎の尤度と文章毎の尤度から、単語毎の確信度を算出する。このようなステップＳ８０４からステップＳ８１４までの処理は、ステップＳ８２２で発話の終了が検出されるまで、繰り返し実行される。 First, the processing shown in steps S804 to S814 will be described. In step S804, the speech analysis unit 216 executes speech framing processing. In step S806, the voice analysis unit 216 specifies a frequency spectrum for each frame. In step S808, the phoneme likelihood calculation unit 220 calculates the likelihood for each phoneme state for each frame. In step S810, the word likelihood calculating unit 224 calculates the likelihood for each word from the likelihood for each phoneme state for each frame. In step S812, the sentence likelihood calculating unit 228 calculates the likelihood for each sentence from the likelihood for each word. In step S814, the certainty factor calculation unit 232 calculates the certainty factor for each word from the likelihood for each word and the likelihood for each sentence. Such processing from step S804 to step S814 is repeatedly executed until the end of the utterance is detected in step S822.

上記の処理と並行して、ステップＳ８１６では、音量検出部２１４が発話区間における音圧の自乗値の積算処理を行う。音圧の自乗値の積算処理は、ステップＳ８２２で発話の終了が検出されるまで、繰り返し実行される。 In parallel with the above processing, in step S816, the sound volume detection unit 214 performs a process of integrating the square value of the sound pressure in the speech section. The sound pressure square value integration process is repeatedly executed until the end of the utterance is detected in step S822.

さらに上記の処理と並行して、ステップＳ８１８では、画像認識部２０６が対話者Ｖの識別を行う。ステップＳ８１８の処理は、ステップＳ８２２で発話の終了が検出されるまで、繰り返し実行される。 Further, in parallel with the above processing, in step S818, the image recognition unit 206 identifies the conversation person V. The process of step S818 is repeatedly executed until the end of the utterance is detected in step S822.

ステップＳ８２２では、発話区間抽出部２１２が発話の終了を検出したか否かが判断される。ステップＳ８２２で発話の終了が検出されると、ステップＳ８２４以下の処理が実行される。 In step S822, it is determined whether or not the utterance section extraction unit 212 detects the end of the utterance. When the end of the utterance is detected in step S822, the processing in step S824 and subsequent steps is executed.

ステップＳ８２４では、第２聞き返し判定部２１８が、対話者Ｖの声が大き過ぎるか否かを判断する。対話者Ｖの声が大き過ぎる場合（ステップＳ８２４でＹＥＳの場合）、処理はステップＳ８３０へ進み、対話者Ｖに対してもっと小さな声で話すことを促す聞き返しが実行される。対話者Ｖの声が大き過ぎない場合（ステップＳ８２４でＮＯの場合）、処理はステップＳ８２６へ進む。 In step S824, the second hearing determination unit 218 determines whether or not the voice of the conversation person V is too loud. If the voice of the conversation person V is too loud (YES in step S824), the process proceeds to step S830, and a rehearsal is performed to urge the conversation person V to speak with a smaller voice. If the conversation person V is not too loud (NO in step S824), the process proceeds to step S826.

ステップＳ８２６では、第２聞き返し判定部２１８が、対話者Ｖの声が小さ過ぎるか否かを判断する。対話者Ｖの声が小さ過ぎる場合（ステップＳ８２６でＹＥＳの場合）、処理はステップＳ８３２へ進み、対話者Ｖに対してもっと大きな声で話すことを促す聞き返しが実行される。対話者Ｖの声が小さ過ぎない場合（ステップＳ８２６でＮＯの場合）、処理はステップＳ８２８へ進む。 In step S826, the second hearing determination unit 218 determines whether or not the voice of the conversation person V is too low. When the voice of the conversation person V is too low (in the case of YES at step S826), the process proceeds to step S832, and a response is made to prompt the conversation person V to speak with a louder voice. If the conversation person V's voice is not too low (NO in step S826), the process proceeds to step S828.

ステップＳ８２８では、第１聞き返し判定部２３４が、文章特定部２３８で特定された文章における自立語の確信度の平均値が、しきい値に満たないか否かを判断する。この際のしきい値は、ステップＳ８１８での対話者Ｖの識別結果に応じて決定される。自立語の確信度の平均値がしきい値に満たない場合（ステップＳ８２８でＹＥＳの場合）、処理はステップＳ８３４へ進み、対話者Ｖに対してもっとはっきりと話すことを促す聞き返しが実行される。自立語の確信度の平均値がしきい値以上の場合（ステップＳ８２８でＮＯの場合）、処理はステップＳ８３６へ進み、文章特定部２３８で特定された文章に応じた適切な応答が実行される。 In step S828, the first hearing determination unit 234 determines whether the average value of the confidence level of the independent word in the sentence specified by the sentence specifying unit 238 is less than a threshold value. The threshold value at this time is determined according to the identification result of the conversation person V in step S818. When the average value of the confidence level of the independent words is less than the threshold value (in the case of YES in step S828), the process proceeds to step S834, and a response is made to prompt the conversation person V to speak more clearly. . If the average value of the confidence level of the independent words is equal to or greater than the threshold value (NO in step S828), the process proceeds to step S836, and an appropriate response corresponding to the sentence specified by the sentence specifying unit 238 is executed. .

以上、本発明の具体例を詳細に説明したが、これらは例示にすぎず、特許請求の範囲を限定するものではない。特許請求の範囲に記載の技術には、以上に例示した具体例を様々に変形、変更したものが含まれる。
また、本明細書または図面に説明した技術要素は、単独であるいは各種の組み合わせによって技術的有用性を発揮するものであり、出願時請求項記載の組み合わせに限定されるものではない。また、本明細書または図面に例示した技術は複数目的を同時に達成するものであり、そのうちの一つの目的を達成すること自体で技術的有用性を持つものである。 Specific examples of the present invention have been described in detail above, but these are merely examples and do not limit the scope of the claims. The technology described in the claims includes various modifications and changes of the specific examples illustrated above.
In addition, the technical elements described in the present specification or the drawings exhibit technical usefulness alone or in various combinations, and are not limited to the combinations described in the claims at the time of filing. In addition, the technology illustrated in the present specification or the drawings achieves a plurality of objects at the same time, and has technical utility by achieving one of the objects.

図１は音声認識装置１００の外観を示す図である。FIG. 1 is a view showing the appearance of the speech recognition apparatus 100. 図２はコントローラ１１４の構成を模式的に示す図である。FIG. 2 is a diagram schematically showing the configuration of the controller 114. 図３は発話の開始時刻ＴＳと終了時刻ＴＥの検出を説明する図である。FIG. 3 is a diagram for explaining the detection of the utterance start time TS and the end time TE. 図４は音声データ３０２のフレーム化処理と周波数スペクトルの特定を説明する図である。FIG. 4 is a diagram for explaining framing processing of audio data 302 and identification of a frequency spectrum. 図５は単語「ぶどう」の尤度評価を説明する図である。FIG. 5 is a diagram for explaining the likelihood evaluation of the word “grape”. 図６は文章の尤度評価を説明する図である。FIG. 6 is a diagram for explaining sentence likelihood evaluation. 図７は単語接続表７００を例示する図である。FIG. 7 is a diagram illustrating a word connection table 700. 図８はコントローラ１１４が行う処理を説明するフローチャートである。FIG. 8 is a flowchart for explaining processing performed by the controller 114.

Explanation of symbols

１００：音声認識装置
１０２：頭部
１０４：右カメラ
１０６：左カメラ
１０８：胴体部
１１０：アクチュエータ
１１２：マイクロホン
１１４：コントローラ
１１６：腕部
１１８：スピーカ
２０２、２０４：画像Ａ／Ｄ変換部
２０６：画像認識部
２０８：人物ＤＢ
２１０：音声Ａ／Ｄ変換部
２１２：発話区間抽出部
２１４：音量検出部
２１６：音声分析部
２１８：第２聞き返し判定部
２２０：音素尤度算出部
２２２：音素ＤＢ
２２４：単語尤度算出部
２２６：単語ＤＢ
２２８：文章尤度算出部
２３０：文章ＤＢ
２３２：確信度算出部
２３４：第１聞き返し判定部
２３６：確信度ＤＢ
２３８：文章特定部
２４０：対応決定部
２４２：音声合成部
２４４：動作生成部
２４６：対応ＤＢ
２４８：Ｄ／Ａ変換部
３０２：音声波形
３０４：音圧ゼロの線
５０２、５０４、５０６、５０８、５１０、５１２：点
５１４、５１６：枝
５１８：経路
７００：単語接続表 100: Voice recognition device 102: Head 104: Right camera 106: Left camera 108: Body 110: Actuator 112: Microphone 114: Controller 116: Arm 118: Speaker 202, 204: Image A / D converter 206: Image Recognition unit 208: person DB
210: Speech A / D conversion unit 212: Speech segment extraction unit 214: Volume detection unit 216: Speech analysis unit 218: Second answer determination unit 220: Phoneme likelihood calculation unit 222: Phoneme DB
224: Word likelihood calculation unit 226: Word DB
228: sentence likelihood calculating unit 230: sentence DB
232: Certainty factor calculation unit 234: First listening determination unit 236: Certainty factor DB
238: sentence specifying unit 240: correspondence determining unit 242: speech synthesizing unit 244: action generating unit 246: correspondence DB
248: D / A converter 302: voice waveform 304: zero sound pressure lines 502, 504, 506, 508, 510, 512: point 514, 516: branch 518: path 700: word connection table

Claims

A device that recognizes the voice spoken by the interlocutor,
Voice input means for inputting voice and converting it into voice data;
Speech segment extraction means for extracting a speech segment from voice data;
Speech analysis means for calculating a time series of speech feature values in the utterance section from speech data;
Word likelihood calculating means for calculating a likelihood for each candidate word group from a time series of voice feature values in the utterance section;
Sentence likelihood calculating means for calculating the likelihood for each of the candidate sentence groups from the likelihood for each of the candidate word groups;
A certainty factor calculating means for calculating a certainty factor for each candidate word group from the likelihood for each candidate sentence group and the likelihood for each candidate word group;
A sentence identifying means for identifying a sentence spoken by a conversation person in an utterance section from candidate sentence groups based on the certainty of words included in the sentence;
Based on the certainty factor of the word included in the identified sentence, a first answer determination unit for determining whether or not to ask the dialogue person is necessary;
A speech recognition apparatus comprising a means for performing a replay to a conversation person when it is determined that a replay to the conversation person is necessary.

Volume detection means for detecting the volume in the utterance section from the voice data;
The speech recognition apparatus according to claim 1, further comprising second listening / return determining means for determining whether or not to hear back to the conversation person based on the volume in the utterance section.

The speech recognition apparatus according to claim 2, wherein the second answer determination unit determines that the answer to the talker is required when the volume in the utterance section exceeds an upper limit value.

The speech recognition apparatus according to claim 3, wherein the second answer determination unit determines that the answer to the talker is required when the volume in the utterance section is less than the lower limit.

A conversation person identifying means for identifying a person who is a conversation person from among a group of candidates,
A threshold setting means for setting a threshold according to the identified person;
If the average value of certainty of words that are independent words among the words included in the specified sentence is less than the threshold value, the first response determination unit determines that a response to the conversation person is necessary. The speech recognition apparatus according to claim 1.

A method of recognizing the voice spoken by the interlocutor,
A voice input process for inputting voice and converting it into voice data;
An utterance interval extraction step of extracting an utterance interval from voice data;
A voice analysis step of calculating a time series of voice feature values in the utterance section from the voice data;
A word likelihood calculating step of calculating a likelihood for each candidate word group from a time series of voice feature values in an utterance section;
A sentence likelihood calculating step for calculating the likelihood for each of the candidate sentence groups from the likelihood for each of the candidate word groups;
A certainty factor calculation step for calculating a certainty factor for each candidate word group from the likelihood for each candidate word group and the likelihood for each candidate word group;
A sentence identifying step for identifying a sentence spoken by a dialogue person in an utterance section from candidate sentence groups based on the certainty of words included in the sentence;
Based on the certainty factor of the word included in the identified sentence, a replay determination step of determining whether or not a replay to the talker is necessary,
A speech recognition method comprising a replaying step of performing a replay to a conversation person when it is determined that a replay to the conversation person is necessary.