JP2007241304A

JP2007241304A - Device and method for recognizing voice, and program and recording medium therefor

Info

Publication number: JP2007241304A
Application number: JP2007112035A
Authority: JP
Inventors: Koji Asano; 康治浅野
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2007-04-20
Filing date: 2007-04-20
Publication date: 2007-09-20

Abstract

PROBLEM TO BE SOLVED: To provide a device and method for recognizing voice capable of raising user's recognition accuracy of voice uttered at a position part from a microphone almost without increasing computational complexity required for the voice recognition processing, and to provide a program and a recording medium therefor. SOLUTION: A distance calculating part 47 finds the distance from an uttering user up to the microphone 21, and supplies the distance to a voice recognition part 41B. The voice recognition part 41B stores a set of acoustic models which takes into account acoustic environments produced from each voice data of collected voices uttered at two or more positions apart by two or more different distances. Then, the voice recognition part 41B selects a set of acoustic models closest to the distance supplied from the distance calculating part 47, and performs voice recognition using the set of acoustic models. COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、音声認識装置および音声認識方法、並びにプログラムおよび記録媒体に関し、例えば、発話者から音声認識装置までの距離に対応した音響モデルのセットを用いて音声認識処理を行うようにすることで、音声認識精度を向上させることができるようにする音声認識装置および音声認識方法、並びにプログラムおよび記録媒体に関する。 The present invention relates to a voice recognition device, a voice recognition method, a program, and a recording medium, and performs voice recognition processing using a set of acoustic models corresponding to the distance from a speaker to the voice recognition device, for example. The present invention relates to a speech recognition apparatus and speech recognition method, a program, and a recording medium that can improve speech recognition accuracy.

近年におけるCPU(Central Processing Unit)の高速化や、メモリ等の大容量化等に伴い、大量の音声データやテキストデータを用いた統計的なモデル化手法を採用した数万語の語彙を対象とする大語彙音声認識システムが実現されている。 Targeting tens of thousands of vocabulary words using statistical modeling techniques using large amounts of speech data and text data as CPU (Central Processing Unit) speeds up and memory capacity increase in recent years. A large vocabulary speech recognition system has been realized.

かかる大規模音声認識システムを含む音声認識システムでは、認識対象の音声が入力されるマイク（マイクロフォン）に近い位置で発話された音声については、高い精度の音声認識が実現されている。 In a speech recognition system including such a large-scale speech recognition system, highly accurate speech recognition is realized for speech uttered at a position close to a microphone (microphone) to which recognition target speech is input.

しかしながら、マイクから離れた位置で発話された音声については、そのマイクからの距離が大きくなるほど、雑音の混入等によって、音声認識精度が劣化する。 However, for speech uttered at a position distant from the microphone, the voice recognition accuracy deteriorates as the distance from the microphone increases, due to noise mixing or the like.

そこで、例えば、三木、西浦、中村、鹿野、「マイクロホンアレーとＨＭＭ分解・合成法による雑音・残響下音声認識」、電子情報通信学会論文誌D-II, Vol.J83-DII No.11 pp.2206-2214, Nov.2000（以下、適宜、文献１という）では、第１の方法として、マイクロホンアレーを用いることで、マイクから離れた位置で発話された音声のＳＮ(Signal Noise)比を向上させる音声認識方法が提案されている。 So, for example, Miki, Nishiura, Nakamura, Shikano, "Speech recognition under noise and reverberation by microphone array and HMM decomposition / synthesis", IEICE Transactions D-II, Vol.J83-DII No.11 pp. 2206-2214, Nov. 2000 (hereinafter referred to as “Document 1” as appropriate), as a first method, using a microphone array improves the SN (Signal Noise) ratio of speech uttered at a position away from the microphone. A voice recognition method is proposed.

また、例えば、清水、梶田、武田、板倉、「空間音響特性を考慮したスペースダイバーシチ型ロバスト音声認識」、電子情報通信学会論文誌D-II, Vol.J83-DII No.11 pp.2448-2456, Nov.2000（以下、適宜、文献２という）では、第２の方法として、複数のマイクを室内に分散させて配置し、音源から各マイクまでの距離に対応するインパルス応答それぞれを学習用の音声データに畳み込んで得られる音声データを用いて学習を行うことにより、各距離のインパルス応答を考慮したHMM(Hidden Markov Model)を用意し、複数のマイクに入力された音声それぞれについて、各距離のインパルス応答を考慮したHMMの尤度を計算する音声認識方法が提案されている。 Also, for example, Shimizu, Iwata, Takeda, Itakura, “Space Diversity Robust Speech Recognition Considering Spatial Acoustic Characteristics”, IEICE Transactions D-II, Vol.J83-DII No.11 pp.2448-2456 , Nov. 2000 (hereinafter referred to as “Document 2” as appropriate), as a second method, a plurality of microphones are distributed in a room and an impulse response corresponding to the distance from the sound source to each microphone is used for learning. HMM (Hidden Markov Model) that takes into account the impulse response of each distance is prepared by learning using the voice data obtained by convolution with the voice data, and for each voice input to multiple microphones, each distance A speech recognition method has been proposed to calculate the likelihood of HMM considering the impulse response.

しかしながら、上述した第１や第２の方法では、マイクの設置に関して制約があり、その適用が困難な場合がある。 However, in the first and second methods described above, there are restrictions on the installation of the microphone, and the application thereof may be difficult.

即ち、例えば、近年においては、玩具等として、ユーザが発した音声を音声認識し、その音声認識結果に基づいて、ある仕草をしたり、合成音を出力する等の行動を自律的に行うロボット（本明細書においては、ぬいぐるみ状のものを含む）が製品化されているが、かかるロボットに、第１の方法による音声認識装置を実装した場合、マイクロホンアレーを構成する複数のマイクの設置間隔等の物理的制約が、ロボットの小型化やデザインの自由度の障害になる。 That is, for example, in recent years, as a toy or the like, a robot that recognizes a voice uttered by a user and autonomously performs actions such as performing a certain gesture or outputting a synthesized sound based on the voice recognition result (In the present specification, including a stuffed animal) has been commercialized, but when the speech recognition device according to the first method is mounted on such a robot, the installation intervals of a plurality of microphones constituting a microphone array Physical constraints such as these become obstacles to robot miniaturization and design freedom.

また、ロボットに、第２の方法による音声認識装置を実装した場合、ロボットを使用する部屋ごとに、複数のマイクを設置する必要があり、現実的ではない。さらに、第２の方法による場合には、複数のマイクそれぞれから入力された音声について、各距離のインパルス応答を考慮したHMMの尤度を計算しなければならず、音声認識処理に対して、大きな計算量が要求されることになる。 Further, when the speech recognition apparatus according to the second method is mounted on the robot, it is necessary to install a plurality of microphones for each room where the robot is used, which is not realistic. Furthermore, in the case of the second method, the likelihood of the HMM considering the impulse response of each distance must be calculated for the speech input from each of the plurality of microphones, which is large for speech recognition processing. A calculation amount is required.

本発明は、このような状況に鑑みてなされたものであり、音声認識処理に要求される計算量をほとんど増大させることなく、マイクから離れた位置で発話されたユーザの音声の認識精度を向上させることができるようにするものである。 The present invention has been made in view of such a situation, and improves the recognition accuracy of a user's voice uttered at a position away from the microphone without substantially increasing the amount of calculation required for voice recognition processing. It is to be able to be made.

本発明の第１の音声認識装置は、音声の音源までの距離を求める距離算出手段と、複数の異なる距離だけ離れた音源それぞれから発せられた音声を用いて生成された、複数の異なる距離ごとの、音響環境を考慮した音響モデルのセットを記憶している記憶手段と、距離算出手段において求められた距離に対応する音響モデルのセットを、記憶手段に記憶された複数の異なる距離ごとの、音響環境を考慮した音響モデルのセットの中から選択することにより取得する取得手段と、取得手段において取得された音響モデルのセットを用いて、音声を認識する音声認識手段とを備えることを特徴とする。 The first speech recognition apparatus according to the present invention includes a distance calculation unit that obtains a distance to a sound source of the sound, and a plurality of different distances that are generated using a sound emitted from each of the sound sources separated by a plurality of different distances. Storage means storing a set of acoustic models in consideration of the acoustic environment, and a set of acoustic models corresponding to the distance obtained by the distance calculation means for each of a plurality of different distances stored in the storage means, An acquisition unit that acquires by selecting from a set of acoustic models in consideration of the acoustic environment; and a voice recognition unit that recognizes speech using the set of acoustic models acquired by the acquisition unit. To do.

本発明の第１の音声認識方法は、音声の音源までの距離を求める距離算出ステップと、距離算出ステップにおいて求められた距離に対応する音響モデルのセットを、複数の異なる距離だけ離れた音源それぞれから発せられた音声を用いて生成された、複数の異なる距離ごとの、音響環境を考慮した音響モデルのセットを記憶している記憶手段に記憶された複数の異なる距離ごとの、音響環境を考慮した音響モデルのセットの中から選択することにより取得する取得ステップと、取得ステップにおいて取得された音響モデルのセットを用いて、音声を認識する音声認識ステップとを含むことを特徴とする。 The first speech recognition method of the present invention includes a distance calculation step for obtaining a distance to a sound source of sound, and a set of acoustic models corresponding to the distance obtained in the distance calculation step, each of sound sources separated by a plurality of different distances. Considering the acoustic environment for each of a plurality of different distances stored in a storage means storing a set of acoustic models considering the acoustic environment for each of a plurality of different distances generated using the sound emitted from An acquisition step of acquiring by selecting from among the set of acoustic models, and a speech recognition step of recognizing speech using the set of acoustic models acquired in the acquisition step.

本発明の第１のプログラムは、音声の音源までの距離を求める距離算出ステップと、距離算出ステップにおいて求められた距離に対応する音響モデルのセットを、複数の異なる距離だけ離れた音源それぞれから発せられた音声を用いて生成された、複数の異なる距離ごとの、音響環境を考慮した音響モデルのセットを記憶している記憶手段に記憶された複数の異なる距離ごとの、音響環境を考慮した音響モデルのセットの中から選択することにより取得する取得ステップと、取得ステップにおいて取得された音響モデルのセットを用いて、音声を認識する音声認識ステップとを含む処理をコンピュータに行わせることを特徴とする。 The first program of the present invention generates a distance calculation step for obtaining a distance to a sound source of sound and a set of acoustic models corresponding to the distance obtained in the distance calculation step from each of a plurality of sound sources separated by a plurality of different distances. Acoustics taking into account the acoustic environment for each of a plurality of different distances stored in a storage means storing a set of acoustic models taking into account the acoustic environment for each of a plurality of different distances generated using the obtained speech A computer that performs a process including an acquisition step acquired by selecting from a set of models, and a speech recognition step of recognizing speech using the set of acoustic models acquired in the acquisition step; To do.

本発明の第１の記録媒体は、音声の音源までの距離を求める距離算出ステップと、距離算出ステップにおいて求められた距離に対応する音響モデルのセットを、複数の異なる距離だけ離れた音源それぞれから発せられた音声を用いて生成された、複数の異なる距離ごとの、音響環境を考慮した音響モデルのセットを記憶している記憶手段に記憶された複数の異なる距離ごとの、音響環境を考慮した音響モデルのセットの中から選択することにより取得する取得ステップと、取得ステップにおいて取得された音響モデルのセットを用いて、音声を認識する音声認識ステップとを含む処理をコンピュータに行わせるプログラムが記録されていることを特徴とする。 The first recording medium of the present invention includes a distance calculation step for obtaining a distance to a sound source of sound, and a set of acoustic models corresponding to the distance obtained in the distance calculation step from a plurality of sound sources separated by a plurality of different distances. Considering the acoustic environment for each of a plurality of different distances stored in the storage means storing a set of acoustic models considering the acoustic environment for each of a plurality of different distances generated using the emitted speech A program for causing a computer to perform processing including an acquisition step acquired by selecting from a set of acoustic models and a speech recognition step of recognizing speech using the set of acoustic models acquired in the acquisition step is recorded It is characterized by being.

本発明の第２の音声認識装置は、音声の音源までの距離を求める距離算出手段と、距離算出手段において求められた距離に対応する周波数特性の逆フィルタを実現するタップ係数を取得する第１の取得手段と、第１の取得手段において取得されたタップ係数を用いて、音声をフィルタリングするフィルタ手段と、複数の異なる距離だけ離れた音源それぞれから発せられた音声を用いて生成された、複数の異なる距離ごとの、音響環境を考慮した音響モデルのセットを記憶している記憶手段と、距離算出手段において求められた距離に対応する音響モデルのセットを、記憶手段に記憶された複数の異なる距離ごとの、音響環境を考慮した音響モデルのセットの中から選択することにより取得する第２の取得手段と、フィルタ手段においてフィルタリングされた音声を、第２の取得手段において取得された音響モデルのセットを用いて認識する音声認識手段とを備えることを特徴とする。 The second speech recognition apparatus according to the present invention obtains a tap coefficient for realizing a distance calculating means for obtaining a distance to a sound source of speech and an inverse filter having a frequency characteristic corresponding to the distance obtained by the distance calculating means. A plurality of voices generated from sound sources emitted from a plurality of sound sources separated by a plurality of different distances, and a filter means for filtering voice using the tap coefficient acquired in the first acquisition means. Storage means storing a set of acoustic models in consideration of the acoustic environment for each different distance, and a plurality of different acoustic models stored in the storage means corresponding to the distance obtained by the distance calculation means A second acquisition means for acquiring by selecting from a set of acoustic models in consideration of the acoustic environment for each distance; Rings have been a voice, characterized in that it comprises a speech recognition means for recognizing with a set of acoustic models acquired in the second acquisition means.

本発明の第２の音声認識方法は、音声の音源までの距離を求める距離算出ステップと、距離算出ステップにおいて求められた距離に対応する周波数特性の逆フィルタを実現するタップ係数を取得する第１の取得ステップと、第１の取得ステップにおいて取得されたタップ係数を用いて、音声をフィルタリングするフィルタステップと、距離算出ステップにおいて求められた距離に対応する音響モデルのセットを、複数の異なる距離だけ離れた音源それぞれから発せられた音声を用いて生成された、複数の異なる距離ごとの、音響環境を考慮した音響モデルのセットを記憶している記憶手段に記憶された複数の異なる距離ごとの、音響環境を考慮した音響モデルのセットの中から選択することにより取得する第２の取得ステップと、フィルタステップにおいてフィルタリングされた音声を、第２の取得ステップにおいて取得された音響モデルのセットを用いて認識する音声認識ステップとを含むことを特徴とする。 According to a second speech recognition method of the present invention, a distance calculating step for obtaining a distance to a sound source of speech, and a first coefficient for obtaining a tap coefficient for realizing an inverse filter having a frequency characteristic corresponding to the distance obtained in the distance calculating step. A set of acoustic models corresponding to the distance obtained in the distance calculating step and the filter step for filtering the voice using the tap coefficient acquired in the first acquiring step, and a plurality of different distances. For each of a plurality of different distances stored in a storage means storing a set of acoustic models taking into account the acoustic environment, generated for each of a plurality of different distances, generated using sound emitted from each remote sound source, A second acquisition step of acquiring by selecting from a set of acoustic models in consideration of the acoustic environment; Filtered speech in-up, characterized in that it comprises a speech recognition step recognizes using a set of acoustic models acquired in the second acquisition step.

本発明の第２のプログラムは、音声の音源までの距離を求める距離算出ステップと、距離算出ステップにおいて求められた距離に対応する周波数特性の逆フィルタを実現するタップ係数を取得する第１の取得ステップと、第１の取得ステップにおいて取得されたタップ係数を用いて、音声をフィルタリングするフィルタステップと、距離算出ステップにおいて求められた距離に対応する音響モデルのセットを、複数の異なる距離だけ離れた音源それぞれから発せられた音声を用いて生成された、複数の異なる距離ごとの、音響環境を考慮した音響モデルのセットを記憶している記憶手段に記憶された複数の異なる距離ごとの、音響環境を考慮した音響モデルのセットの中から選択することにより取得する第２の取得ステップと、フィルタステップにおいてフィルタリングされた音声を、第２の取得ステップにおいて取得された音響モデルのセットを用いて認識する音声認識ステップとを含む処理をコンピュータに行わせることを特徴とする The second program of the present invention is a distance acquisition step for obtaining a distance to a sound source of sound, and a first acquisition for obtaining a tap coefficient for realizing an inverse filter of a frequency characteristic corresponding to the distance obtained in the distance calculation step. A set of acoustic models corresponding to the distance obtained in the step, a filter step for filtering speech using the tap coefficient obtained in the first obtaining step, and a distance obtained in the distance calculating step are separated by a plurality of different distances. Acoustic environment for each of a plurality of different distances stored in a storage means storing a set of acoustic models taking into account the acoustic environment for each of a plurality of different distances generated using sound emitted from each sound source A second acquisition step by selecting from a set of acoustic models taking into account The speech filtered in-flop, characterized in that to perform processing including a voice recognition step recognizes the computer using the set of acoustic models acquired in the second acquisition step

本発明の第２の記録媒体は、音声の音源までの距離を求める距離算出ステップと、距離算出ステップにおいて求められた距離に対応する周波数特性の逆フィルタを実現するタップ係数を取得する第１の取得ステップと、第１の取得ステップにおいて取得されたタップ係数を用いて、音声をフィルタリングするフィルタステップと、距離算出ステップにおいて求められた距離に対応する音響モデルのセットを、複数の異なる距離だけ離れた音源それぞれから発せられた音声を用いて生成された、複数の異なる距離ごとの、音響環境を考慮した音響モデルのセットを記憶している記憶手段に記憶された複数の異なる距離ごとの、音響環境を考慮した音響モデルのセットの中から選択することにより取得する第２の取得ステップと、フィルタステップにおいてフィルタリングされた音声を、第２の取得ステップにおいて取得された音響モデルのセットを用いて認識する音声認識ステップとを含む処理をコンピュータに行わせるプログラムが記録されていることを特徴とする。 The second recording medium of the present invention is a distance calculation step for obtaining a distance to a sound source of sound, and a first coefficient for obtaining a tap coefficient for realizing an inverse filter of a frequency characteristic corresponding to the distance obtained in the distance calculation step. A set of acoustic models corresponding to the distance obtained in the obtaining step, the filter step for filtering the sound using the tap coefficient obtained in the first obtaining step, and the distance calculating step are separated by a plurality of different distances. Acoustics for each of a plurality of different distances stored in a storage means storing a set of acoustic models in consideration of the acoustic environment, generated for each of a plurality of different distances, generated using the sound emitted from each sound source. A second acquisition step by selecting from a set of acoustic models taking into account the environment; and a filter step. In the filtered speech, and to perform processing including a second set speech recognition step recognizes using acoustic models acquired by the acquisition step to the computer, characterized in that it is recorded.

本発明の第１の音声認識装置および音声認識方法、並びにプログラムにおいては、音声の音源までの距離が求められ、その距離に対応する、音響環境を考慮した音響モデルのセットが取得される。そして、その取得された音響モデルのセットを用いて、音声が認識される。 In the first speech recognition apparatus, speech recognition method, and program of the present invention, the distance to the sound source of the sound is obtained, and a set of acoustic models corresponding to the distance and considering the acoustic environment is acquired. Then, speech is recognized using the acquired set of acoustic models.

本発明の第２の音声認識装置および音声認識方法、並びにプログラムにおいては、音声の音源までの距離が求められ、その距離に対応する周波数特性の逆フィルタを実現するタップ係数が取得される。そして、その取得されたタップ係数を用いて、音声がフィルタリングされ、そのフィルタリングされた音声が、音声の音源までの距離に対応する、音響環境を考慮した音響モデルのセットを用いて認識される。 In the second speech recognition apparatus, speech recognition method, and program of the present invention, the distance to the sound source of the sound is obtained, and the tap coefficient that realizes the inverse filter of the frequency characteristic corresponding to the distance is obtained. Then, the acquired tap coefficient is used to filter the voice, and the filtered voice is recognized using a set of acoustic models that considers the acoustic environment corresponding to the distance of the voice to the sound source.

本発明の第１の音声認識装置および音声認識方法、並びにプログラムによれば、音声の音源までの距離が求められ、その距離に対応する、音響環境を考慮した音響モデルのセットが取得される。そして、その取得された音響モデルのセットを用いて、音声が認識される。従って、音声認識精度を向上させることができる。 According to the first speech recognition apparatus, speech recognition method, and program of the present invention, the distance to the sound source of the sound is obtained, and a set of acoustic models corresponding to the distance and considering the acoustic environment is acquired. Then, speech is recognized using the acquired set of acoustic models. Therefore, the voice recognition accuracy can be improved.

本発明の第２の音声認識装置および音声認識方法、並びにプログラムによれば、音声の音源までの距離が求められ、その距離に対応する周波数特性の逆フィルタを実現するタップ係数が取得される。そして、その取得されたタップ係数を用いて、音声がフィルタリングされ、そのフィルタリングされた音声が、音声の音源までの距離に対応する、音響環境を考慮した音響モデルのセットを用いて認識される。従って、音声認識精度を向上させることができる。 According to the second speech recognition apparatus, speech recognition method, and program of the present invention, the distance to the sound source of the sound is obtained, and the tap coefficient that realizes the inverse filter of the frequency characteristic corresponding to the distance is obtained. Then, the acquired tap coefficient is used to filter the voice, and the filtered voice is recognized using a set of acoustic models that considers the acoustic environment corresponding to the distance of the voice to the sound source. Therefore, the voice recognition accuracy can be improved.

図１は、本発明を適用したペット型ロボットの外観構成の例を示す斜視図であり、図２は、その内部構成の例を示すブロック図である。 FIG. 1 is a perspective view showing an example of the external configuration of a pet robot to which the present invention is applied, and FIG. 2 is a block diagram showing an example of the internal configuration thereof.

図１の実施の形態において、ペット型ロボットは、四つ足の動物型のロボットとされており、大きくは、胴体部ユニット１、脚部ユニット２A，２B，２C，２D、頭部ユニット３、および尻尾部ユニット４から構成されている。 In the embodiment of FIG. 1, the pet-type robot is a four-legged animal type robot, and mainly includes a torso unit 1, leg units 2A, 2B, 2C, 2D, a head unit 3, And the tail unit 4.

胴体に相当する胴体部ユニット１の前後左右には、それぞれ、脚に相当する脚部ユニット２A，２B，２C，２Dが連結され、胴体部ユニット１の前端部と後端部には、それぞれ、頭部に相当する頭部ユニット３と尻尾に相当する尻尾部ユニット４が連結されている。 Leg units 2A, 2B, 2C, 2D corresponding to the legs are connected to the front, rear, left and right of the body unit 1 corresponding to the body, respectively, and the front end portion and the rear end portion of the body unit 1 are respectively connected to A head unit 3 corresponding to the head and a tail unit 4 corresponding to the tail are connected.

胴体部ユニット１の上面には背中センサ１Ａが設けられている。また、頭部ユニット３には、その上部に頭センサ３Ａが、下部に顎センサ３Ｂがそれぞれ設けられている。なお、背中センサ１Ａ、頭センサ３Ａ、顎センサ３Ｂは、いずれも圧力センサで構成され、その部位に与えられる圧力を検知する。 A back sensor 1 A is provided on the upper surface of the body unit 1. The head unit 3 is provided with a head sensor 3A at the top and a chin sensor 3B at the bottom. Note that each of the back sensor 1A, the head sensor 3A, and the chin sensor 3B is configured by a pressure sensor, and detects the pressure applied to the part.

尻尾部ユニット４は、胴体部ユニット１に対して、水平方向、および上下方向に揺動自在に取り付けられている。 The tail unit 4 is attached to the body unit 1 so as to be swingable in the horizontal direction and the vertical direction.

図２に示すように、胴体部ユニット１には、コントローラ１１、Ａ／Ｄ変換部１２、Ｄ／Ａ変換部１３、通信部１４、半導体メモリ１５、背中センサ１Ａ等が格納されている。 As shown in FIG. 2, the body unit 1 stores a controller 11, an A / D conversion unit 12, a D / A conversion unit 13, a communication unit 14, a semiconductor memory 15, a back sensor 1A, and the like.

コントローラ１１は、コントローラ１１全体の動作を制御するCPU１１Aと、CPU１１Aが各部を制御するのに実行するＯＳ(Operating System)、アプリケーションプログラム、その他の必要なデータ等が記憶されているメモリ１１B等を内蔵している。 The controller 11 includes a CPU 11A that controls the overall operation of the controller 11, a memory 11B that stores an OS (Operating System), application programs, and other necessary data that the CPU 11A executes to control each unit. is doing.

Ａ／Ｄ(Analog/Digital)変換部１２は、マイク２１、ＣＣＤカメラ２２Ｌおよび２２Ｒ、背中センサ１Ａ、頭センサ３Ａ、顎センサ３Ｂが出力するアナログ信号をＡ／Ｄ変換することによりディジタル信号とし、コントローラ１１に供給する。Ｄ／Ａ(Digital/Analog)変換部１３は、コントローラ１１から供給されるディジタル信号をＤ／Ａ変換することによりアナログ信号とし、スピーカ２３に供給する。 The A / D (Analog / Digital) converter 12 converts the analog signals output from the microphone 21, the CCD cameras 22L and 22R, the back sensor 1A, the head sensor 3A, and the chin sensor 3B into digital signals, Supply to the controller 11. A D / A (Digital / Analog) converter 13 converts the digital signal supplied from the controller 11 into an analog signal by D / A conversion, and supplies the analog signal to the speaker 23.

通信部１４は、外部と無線または有線で通信するときの通信制御を行う。これにより、ＯＳやアプリケーションプログラムがバージョンアップされたときに、通信部１４を介して、そのバージョンアップされたＯＳやアプリケーションプログラムをダウンロードして、メモリ１１Ｂに記憶させたり、また、所定のコマンドを、通信部１４で受信し、ＣＰＵ１１Ａに与えることができるようになっている。 The communication unit 14 performs communication control when communicating with the outside wirelessly or by wire. Accordingly, when the OS or application program is upgraded, the upgraded OS or application program is downloaded via the communication unit 14 and stored in the memory 11B, or a predetermined command is It can be received by the communication unit 14 and given to the CPU 11A.

半導体メモリ１５は、例えば、ＥＥＰＲＯＭ(Electrically Erasable Programmable Read-only Memory)等で構成され、胴体部ユニット１に設けられた図示せぬスロットに対して、着脱可能になっている。半導体メモリ１５には、例えば、後述するような感情モデル等が記憶される。 The semiconductor memory 15 is composed of, for example, an EEPROM (Electrically Erasable Programmable Read-only Memory) or the like, and is detachable from a slot (not shown) provided in the body unit 1. The semiconductor memory 15 stores, for example, an emotion model as described later.

背中センサ１Ａは、胴体部ユニット１において、ペット型ロボットの背中に対応する部位に設けられており、そこに与えられるユーザからの圧力を検出し、その圧力に対応する圧力検出信号を、Ａ／Ｄ変換部１２を介してコントローラ１１に出力する。 The back sensor 1A is provided in a part corresponding to the back of the pet-type robot in the torso unit 1. The back sensor 1A detects a pressure from the user applied thereto, and outputs a pressure detection signal corresponding to the pressure to the A / A. The data is output to the controller 11 via the D conversion unit 12.

なお、胴体部ユニット１には、その他、例えば、ペット型ロボットの動力源となるバッテリ（図示せず）や、そのバッテリ残量を検出する回路等も格納されている。 The body unit 1 also stores, for example, a battery (not shown) serving as a power source for the pet-type robot, a circuit for detecting the remaining battery capacity, and the like.

頭部ユニット３においては、図２に示すように、外部からの刺激を感知するセンサとしての、音を感知する「耳」に相当するマイク２１、光を感知する「左目」および「右目」に相当するCCD(Charge Coupled Device)カメラ２２Ｌおよび２２Ｒ、並びにユーザが触れること等により与えられる圧力を感知する触覚に相当する頭センサ３Ａ、および顎センサ３Ｂが、例えば、それぞれ対応する部位に設けられている。また、頭部ユニット３には、ペット型ロボットの「口」に相当するスピーカ２３が、例えば、対応する部位に設置されている。 In the head unit 3, as shown in FIG. 2, a microphone 21 corresponding to an "ear" that senses sound as a sensor that senses an external stimulus, and a "left eye" and "right eye" that sense light Corresponding CCD (Charge Coupled Device) cameras 22L and 22R, and a head sensor 3A and a jaw sensor 3B corresponding to a tactile sensation that senses a pressure applied by the user touching, for example, are provided at corresponding portions, respectively Yes. Further, the head unit 3 is provided with a speaker 23 corresponding to the “mouth” of the pet-type robot, for example, at a corresponding part.

脚部ユニット２A乃至２Dのそれぞれの関節部分、脚部ユニット２A乃至２Dのそれぞれと胴体部ユニット１の連結部分、頭部ユニット３と胴体部ユニット１の連結部分、並びに尻尾部ユニット４と胴体部ユニット１の連結部分などには、アクチュエータが設置されている。アクチュエータは、コントローラ１１からの指示に基づいて各部を動作させる。即ち、アクチュエータによって、例えば、脚部ユニット２Ａ乃至２Ｄが動き、これにより、ロボットが歩行する。 The joint parts of the leg units 2A to 2D, the connection parts of the leg units 2A to 2D and the body unit 1, the connection parts of the head unit 3 and the body unit 1, and the tail unit 4 and the body part An actuator is installed at a connecting portion of the unit 1 or the like. The actuator operates each unit based on an instruction from the controller 11. That is, for example, the leg units 2A to 2D are moved by the actuator, whereby the robot walks.

頭部ユニット３に設置されているマイク２１は、ユーザからの発話を含む周囲の音声（音）を集音し、得られた音声信号を、Ａ／Ｄ変換部１２を介してコントローラ１１に出力する。CCDカメラ２２Ｌおよび２２Ｒは、周囲の状況を撮像し、得られた画像信号を、Ａ／Ｄ変換部１２を介してコントローラ１１に出力する。頭部ユニット３の上部に設けられた頭センサ３Ａや、頭部ユニット３の下部に設けられた顎センサ３Ｂは、例えば、ユーザからの「撫でる」や「叩く」といった物理的な働きかけにより受けた圧力を検出し、その検出結果を圧力検出信号として、Ａ／Ｄ変換部１２を介してコントローラ１１に出力する。 The microphone 21 installed in the head unit 3 collects surrounding sounds (sounds) including utterances from the user and outputs the obtained sound signals to the controller 11 via the A / D conversion unit 12. To do. The CCD cameras 22 L and 22 R capture the surrounding situation and output the obtained image signal to the controller 11 via the A / D conversion unit 12. The head sensor 3A provided at the upper part of the head unit 3 and the chin sensor 3B provided at the lower part of the head unit 3 are subjected to physical actions such as “blow” and “slap” from the user, for example. The pressure is detected, and the detection result is output as a pressure detection signal to the controller 11 via the A / D converter 12.

コントローラ１１は、マイク２１、CCDカメラ２２Ｌおよび２２Ｒ、背中センサ１Ａ、頭センサ３Ａ、並びに顎センサ３Ｂから、Ａ／Ｄ変換部１２を介して与えられる音声信号、画像信号、圧力検出信号に基づいて、周囲の状況や、ユーザからの指令、ユーザからの働きかけなどの有無を判断し、その判断結果に基づいて、ペット型ロボットが次にとる行動を決定する。そして、コントローラ１１は、その決定に基づいて、必要なアクチュエータを駆動させ、これにより、頭部ユニット３を上下左右に振らせたり、尻尾部ユニット４を動かせたり、各脚部ユニット２A乃至２Dを駆動して、ペット型ロボットを歩行させるなどの行動をとらせる。 The controller 11 is based on an audio signal, an image signal, and a pressure detection signal given from the microphone 21, CCD cameras 22L and 22R, the back sensor 1A, the head sensor 3A, and the chin sensor 3B via the A / D conversion unit 12. Then, the surrounding situation, the command from the user, the presence / absence of the action from the user, and the like are determined, and the action to be taken next by the pet robot is determined based on the determination result. Based on the determination, the controller 11 drives the necessary actuators, thereby swinging the head unit 3 up and down, left and right, moving the tail unit 4, and moving the leg units 2A to 2D. Drive to take action such as walking a pet-type robot.

さらに、コントローラ１１は、必要に応じて、合成音を生成し、それを、Ｄ／Ａ変換部１３を介して、スピーカ２３に供給して出力させたり、ペット型ロボットの「目」の位置に設けられた、図示しないLED(Light Emitting Diode)を点灯、消灯または点滅させる。 Further, the controller 11 generates a synthesized sound as needed, and supplies it to the speaker 23 via the D / A converter 13 for output, or at the “eye” position of the pet robot. A provided LED (Light Emitting Diode) (not shown) is turned on, turned off, or blinked.

以上のようにして、ペット型ロボットは、周囲の状況や、接してくるユーザに基づいて、自律的に行動をとるようになっている。 As described above, the pet-type robot takes actions autonomously based on the surrounding situation and the user who comes into contact.

次に、図３は、図２のコントローラ１１の機能的構成例を示している。なお、図３に示す機能的構成は、CPU１１Aが、メモリ１１Bに記憶されたＯＳおよびアプリケーションプログラムを実行することで実現される。また、図３では、Ａ／Ｄ変換部１２およびＤ／Ａ変換部１３の図示を省略してある。 Next, FIG. 3 shows a functional configuration example of the controller 11 of FIG. The functional configuration shown in FIG. 3 is realized by the CPU 11A executing the OS and application program stored in the memory 11B. In FIG. 3, the A / D converter 12 and the D / A converter 13 are not shown.

コントローラ１１のセンサ入力処理部４１は、背中センサ１Ａや、頭センサ３Ａ、顎センサ３Ｂ、マイク２１、CCDカメラ２２Ｌおよび２２Ｒ等からそれぞれ与えられる圧力検出信号、音声信号、画像信号等に基づいて、特定の外部状態や、ユーザからの特定の働きかけ、ユーザからの指示等を認識し、その認識結果を表す状態認識情報を、モデル記憶部４２および行動決定機構部４３に通知する。 The sensor input processing unit 41 of the controller 11 is based on a pressure detection signal, an audio signal, an image signal, and the like given from the back sensor 1A, the head sensor 3A, the chin sensor 3B, the microphone 21, the CCD cameras 22L and 22R, respectively. A specific external state, a specific action from the user, an instruction from the user, and the like are recognized, and state recognition information representing the recognition result is notified to the model storage unit 42 and the action determination mechanism unit 43.

即ち、センサ入力処理部４１は、圧力処理部４１Ａ、音声認識部４１Ｂ、および画像処理部４１Ｃを有している。 That is, the sensor input processing unit 41 includes a pressure processing unit 41A, a voice recognition unit 41B, and an image processing unit 41C.

圧力処理部４１Ａは、背中センサ１Ａ、頭センサ３Ａ、または顎センサ３Ｂから与えられる圧力検出信号を処理する。そして、圧力処理部４１Ａは、例えば、その処理の結果、所定の閾値以上で、かつ短時間の圧力を検出したときには、「叩かれた（しかられた）」と認識し、所定の閾値未満で、かつ長時間の圧力を検出したときには、「なでられた（ほめられた）」と認識して、その認識結果を、状態認識情報として、モデル記憶部４２および行動決定機構部４３に通知する。 The pressure processing unit 41A processes a pressure detection signal given from the back sensor 1A, the head sensor 3A, or the chin sensor 3B. Then, for example, when the pressure processing unit 41A detects a pressure that is equal to or higher than a predetermined threshold value and for a short time as a result of the processing, the pressure processing unit 41A recognizes that the pressure processing unit 41A has been struck and is below the predetermined threshold value. When a long-time pressure is detected, it is recognized as “struck (praised)”, and the recognition result is notified to the model storage unit 42 and the action determination mechanism unit 43 as state recognition information. .

音声認識部４１Ｂは、マイク２１から与えられる音声信号を対象とした音声認識を行う。そして、音声認識部４１Ｂは、その音声認識結果としての、例えば、「歩け」、「伏せ」、「ボールを追いかけろ」等の指令その他を、状態認識情報として、モデル記憶部４２および行動決定機構部４３に通知する。なお、音声認識部４１Ｂには、後述する距離計算部４７より、ユーザ等の音源からマイク２１までの距離が供給されるようになっており、音声認識部４１Ｂは、この距離に基づいて音声認識を行うようになっている。 The voice recognition unit 41B performs voice recognition on the voice signal given from the microphone 21. Then, the voice recognition unit 41B uses, as state recognition information, a command storage unit 42 and an action determination mechanism unit, for example, commands such as “walk”, “turn down”, and “follow the ball” as the voice recognition result. 43 is notified. Note that the distance from the sound source of the user or the like to the microphone 21 is supplied to the voice recognition unit 41B from a distance calculation unit 47 described later, and the voice recognition unit 41B performs voice recognition based on this distance. Is supposed to do.

画像処理部４１Ｃは、CCDカメラ２２Ｌおよび２２Ｒから与えられる画像信号を用いて、画像認識処理を行う。そして、画像処理部４１Ｃは、その処理の結果、例えば、「赤い丸いもの」や、「地面に対して垂直なかつ所定の高さ以上の平面」等を検出したときには、「ボールがある」や、「壁がある」等の画像認識結果を、状態認識情報として、モデル記憶部４２および行動決定機構部４３に通知する。 The image processing unit 41C performs image recognition processing using image signals given from the CCD cameras 22L and 22R. Then, when the image processing unit 41C detects, for example, “a red round object”, “a plane perpendicular to the ground and higher than a predetermined height” as a result of the processing, An image recognition result such as “There is a wall” is notified to the model storage unit 42 and the action determination mechanism unit 43 as state recognition information.

モデル記憶部４２は、ロボットの感情、本能、成長の状態を表現する感情モデル、本能モデル、成長モデルをそれぞれ記憶し、管理している。 The model storage unit 42 stores and manages an emotion model, an instinct model, and a growth model that express the emotion, instinct, and growth state of the robot.

ここで、感情モデルは、例えば、「うれしさ」、「悲しさ」、「怒り」、「楽しさ」等の感情の状態（度合い）を、所定の範囲（例えば、−１．０乃至１．０等）の値によってそれぞれ表し、センサ入力処理部４１からの状態認識情報や時間経過等に基づいて、その値を変化させる。 Here, the emotion model includes, for example, emotion states (degrees) such as “joyfulness”, “sadness”, “anger”, “fun”, etc. within a predetermined range (for example, −1.0 to 1.. 0), and the value is changed based on the state recognition information from the sensor input processing unit 41, the passage of time, and the like.

本能モデルは、例えば、「食欲」、「睡眠欲」、「運動欲」等の本能による欲求の状態（度合い）を、所定の範囲の値によってそれぞれ表し、センサ入力処理部４１からの状態認識情報や時間経過等に基づいて、その値を変化させる。 The instinct model represents, for example, the state (degree) of desire by instinct such as “appetite”, “sleep desire”, “exercise desire”, etc., by a predetermined range of values. The value is changed based on the passage of time or the like.

成長モデルは、例えば、「幼年期」、「青年期」、「熟年期」、「老年期」等の成長の状態（度合い）を、所定の範囲の値によってそれぞれ表し、センサ入力処理部４１からの状態認識情報や時間経過等に基づいて、その値を変化させる。 The growth model represents, for example, growth states (degrees) such as “childhood”, “adolescence”, “mature age”, “old age”, and the like by values in a predetermined range. The value is changed based on the state recognition information and the passage of time.

モデル記憶部４２は、上述のようにして感情モデル、本能モデル、成長モデルの値で表される感情、本能、成長の状態を、状態情報として、行動決定機構部４３に送出する。 The model storage unit 42 sends the emotion, instinct, and growth states represented by the values of the emotion model, instinct model, and growth model as described above to the behavior determination mechanism unit 43.

なお、モデル記憶部４２には、センサ入力処理部４１から状態認識情報が供給される他に、行動決定機構部４３から、ペット型ロボットの現在または過去の行動、具体的には、例えば、「長時間歩いた」などの行動の内容を示す行動情報が供給されるようになっており、モデル記憶部４２は、同一の状態認識情報が与えられても、行動情報が示すペット型ロボットの行動に応じて、異なる状態情報を生成するようになっている。 In addition to the state recognition information supplied from the sensor input processing unit 41, the model storage unit 42 receives the current or past behavior of the pet robot from the behavior determination mechanism unit 43, specifically, for example, “ The behavior information indicating the content of the behavior such as “walked for a long time” is supplied, and the model storage unit 42 is provided with the behavior of the pet-type robot indicated by the behavior information even if the same state recognition information is given. Depending on the situation, different state information is generated.

例えば、ペット型ロボットが、ユーザに挨拶をし、ユーザに頭を撫でられた場合には、ユーザに挨拶をしたという行動情報と、頭を撫でられたという状態認識情報とが、モデル記憶部４２に与えられ、この場合、モデル記憶部４２では、「うれしさ」を表す感情モデルの値が増加される。 For example, when the pet-type robot greets the user and strokes the head, the model storage unit 42 includes behavior information indicating that the user has been greeted and state recognition information indicating that the head has been stroked. In this case, the value of the emotion model representing “joy” is increased in the model storage unit 42.

行動決定機構部４３は、センサ入力処理部４１からの状態認識情報や、モデル記憶部４２からの状態情報、時間経過等に基づいて、次の行動を決定し、決定された行動の内容を、行動指令情報として、姿勢遷移機構部４４に出力する。 The action determination mechanism unit 43 determines the next action based on the state recognition information from the sensor input processing unit 41, the state information from the model storage unit 42, the passage of time, etc., and the content of the determined action is The action command information is output to the posture transition mechanism unit 44.

即ち、行動決定機構部４３は、ペット型ロボットがとり得る行動をステート（状態）(state)に対応させた有限オートマトンを、ペット型ロボットの行動を規定する行動モデルとして管理している。そして、行動決定機構部４３は、この行動モデルとしての有限オートマトンにおけるステートを、センサ入力処理部４１からの状態認識情報や、モデル記憶部４２における感情モデル、本能モデル、または成長モデルの値、時間経過等に基づいて遷移させ、遷移後のステートに対応する行動を、次にとるべき行動として決定する。 That is, the behavior determination mechanism unit 43 manages a finite automaton that associates the behavior that the pet robot can take with the state as a behavior model that defines the behavior of the pet robot. Then, the behavior determination mechanism unit 43 uses the state recognition information from the sensor input processing unit 41, the value of the emotion model, instinct model, or growth model in the model storage unit 42, the time in the finite automaton as the behavior model. Transition is made based on the progress and the like, and the action corresponding to the state after the transition is determined as the action to be taken next.

ここで、行動決定機構部４３は、所定のトリガ(trigger)があったことを検出すると、ステートを遷移させる。即ち、行動決定機構部４３は、例えば、現在のステートに対応する行動を実行している時間が所定時間に達したときや、特定の状態認識情報を受信したとき、モデル記憶部４２から供給される状態情報が示す感情や、本能、成長の状態の値が所定の閾値以下または以上になったとき等に、ステートを遷移させる。 Here, the behavior determination mechanism unit 43 transitions the state when it detects that a predetermined trigger (trigger) has occurred. That is, the behavior determination mechanism unit 43 is supplied from the model storage unit 42 when, for example, the time during which the behavior corresponding to the current state is executed reaches a predetermined time or when specific state recognition information is received. The state is changed when the emotion, instinct, and growth state values indicated by the state information are below or above a predetermined threshold.

なお、行動決定機構部４３は、上述したように、センサ入力処理部４１からの状態認識情報だけでなく、モデル記憶部４２における感情モデルや、本能モデル、成長モデルの値等にも基づいて、行動モデルにおけるステートを遷移させることから、同一の状態認識情報が入力されても、感情モデルや、本能モデル、成長モデルの値（状態情報）によっては、ステートの遷移先は異なるものとなる。 In addition, as described above, the behavior determination mechanism unit 43 is based not only on the state recognition information from the sensor input processing unit 41 but also on the emotion model, instinct model, growth model value, and the like in the model storage unit 42. Since the state in the behavior model is transitioned, even if the same state recognition information is input, the state transition destination differs depending on the value (state information) of the emotion model, instinct model, and growth model.

その結果、行動決定機構部４３は、例えば、状態情報が、「怒っていない」こと、および「お腹がすいていない」ことを表している場合において、状態認識情報が、「目の前に手のひらが差し出された」ことを表しているときには、目の前に手のひらが差し出されたことに応じて、「お手」という行動をとらせる行動指令情報を生成し、これを、姿勢遷移機構部４４に送出する。 As a result, for example, when the state information represents “not angry” and “not hungry”, the behavior determination mechanism unit 43 indicates that the state recognition information is “the palm in front of the eyes”. Is generated, action command information for taking the action of “hand” is generated in response to the palm being presented in front of the eyes. To the unit 44.

また、行動決定機構部４３は、例えば、状態情報が、「怒っていない」こと、および「お腹がすいている」ことを表している場合において、状態認識情報が、「目の前に手のひらが差し出された」ことを表しているときには、目の前に手のひらが差し出されたことに応じて、「手のひらをぺろぺろなめる」ような行動を行わせるための行動指令情報を生成し、これを、姿勢遷移機構部４４に送出する。 In addition, for example, when the state information indicates “not angry” and “hungry”, the behavior determination mechanism unit 43 indicates that the state recognition information indicates that the palm is in front of the eyes. When it indicates that it has been `` submitted, '' action command information is generated to perform an action such as `` flipping the palm '' in response to the palm being presented in front of the eyes. And sent to the posture transition mechanism 44.

なお、行動決定機構部４３には、モデル記憶部４２から供給される状態情報が示す感情や、本能、成長の状態に基づいて、遷移先のステートに対応する行動のパラメータとしての、例えば、歩行の速度や、手足を動かす際の動きの大きさおよび速度などを決定させることができ、この場合、それらのパラメータを含む行動指令情報が、姿勢遷移機構部４４に送出される。 Note that the behavior determination mechanism unit 43 uses, for example, walking as a behavior parameter corresponding to the transition destination state based on the emotion, instinct, and growth state indicated by the state information supplied from the model storage unit 42. , The magnitude and speed of movement when moving the limb, and in this case, action command information including those parameters is sent to the posture transition mechanism unit 44.

また、行動決定機構部４３では、上述したように、ペット型ロボットの頭部や手足等を動作させる行動指令情報の他、ペット型ロボットに発話を行わせる行動指令情報も、必要に応じて生成される。そして、ペット型ロボットに発話させる行動指令情報は、音声合成部４６に供給されるようになっている。音声合成部４６は、行動指令情報を受信すると、その行動指令情報にしたがって音声合成を行い、得られた合成音を、スピーカ２３から出力させる。 In addition, as described above, the behavior determination mechanism unit 43 generates behavior command information for causing the pet robot to speak in addition to behavior command information for operating the head, limbs, and the like of the pet robot. Is done. Then, the action command information to be uttered by the pet type robot is supplied to the voice synthesis unit 46. When the voice synthesis unit 46 receives the behavior command information, the voice synthesis unit 46 performs voice synthesis according to the behavior command information, and outputs the obtained synthesized sound from the speaker 23.

姿勢遷移機構部４４は、行動決定機構部４３から供給される行動指令情報に基づいて、ペット型ロボットの姿勢を、現在の姿勢から次の姿勢に遷移させるための姿勢遷移情報を生成し、これを制御機構部４５に送出する。 The posture transition mechanism unit 44 generates posture transition information for shifting the posture of the pet-type robot from the current posture to the next posture based on the behavior command information supplied from the behavior determination mechanism unit 43. Is sent to the control mechanism unit 45.

ここで、現在の姿勢から次に遷移可能な姿勢は、例えば、胴体や手や足の形状、重さ、各部の結合状態のようなペット型ロボットの物理的形状と、関節が曲がる方向や角度のようなアクチュエータの機構とによって決定される。 Here, the postures that can be transitioned from the current posture are, for example, the physical shape of the pet-type robot such as the shape and weight of the torso, hands and feet, and the connected state of each part, and the direction and angle at which the joint bends. And the mechanism of the actuator.

また、次の姿勢としては、現在の姿勢から直接遷移可能な姿勢と、直接には遷移できない姿勢とがある。例えば、４本足のペット型ロボットは、手足を大きく投げ出して寝転んでいる状態から、伏せた状態へ直接遷移することはできるが、立った状態へ直接遷移することはできず、一旦、手足を胴体近くに引き寄せて伏せた姿勢になり、それから立ち上がるという２段階の動作が必要である。また、安全に実行できない姿勢も存在する。例えば、４本足のペット型ロボットは、その４本足で立っている姿勢から、両前足を挙げてバンザイをしようとすると、簡単に転倒してしまう。 Further, as the next posture, there are a posture that can be directly changed from the current posture and a posture that cannot be directly changed. For example, a four-legged pet-type robot can make a direct transition from a lying position with its limbs thrown down to a lying state, but cannot make a direct transition to a standing state. A two-step movement is required, which is a close-up posture by pulling close to the torso and then standing up. There are also postures that cannot be executed safely. For example, a four-legged pet-type robot will easily fall if it tries to banzai with both front legs raised from its four-legged posture.

このため、姿勢遷移機構部４４は、直接遷移可能な姿勢をあらかじめ登録しておき、行動決定機構部４３から供給される行動指令情報が、直接遷移可能な姿勢を示す場合には、その行動指令情報を制御機構部４５に送出する。 For this reason, the posture transition mechanism unit 44 registers in advance a posture that can be directly transitioned, and when the behavior command information supplied from the behavior determination mechanism unit 43 indicates a posture that can be transitioned directly, the behavior command Information is sent to the control mechanism unit 45.

一方、行動指令情報が、直接遷移不可能な姿勢を示す場合には、姿勢遷移機構部４４は、遷移可能な他の姿勢に一旦遷移した後に、目的の姿勢まで遷移させるような姿勢遷移情報を生成し、制御機構部４５に送出する。これによりロボットが、遷移不可能な姿勢を無理に実行しようとする事態や、転倒するような事態を回避することができるようになっている。 On the other hand, when the action command information indicates a posture that cannot be directly transitioned, the posture transition mechanism unit 44 provides posture transition information that makes a transition to a target posture after once transitioning to another posture that can be transitioned. It is generated and sent to the control mechanism unit 45. As a result, it is possible to avoid situations where the robot forcibly executes a posture incapable of transition or a situation where the robot falls over.

制御機構部４５は、姿勢遷移機構部４４からの姿勢遷移情報にしたがって、アクチュエータを駆動するための制御信号を生成し、これを、各部のアクチュエータに送出する。 The control mechanism unit 45 generates a control signal for driving the actuator in accordance with the posture transition information from the posture transition mechanism unit 44, and sends this to the actuator of each unit.

音声合成部４６は、行動決定機構部４３から行動指令情報を受信し、その行動指令情報にしたがって、例えば、規則音声合成を行い、得られた合成音を、スピーカ２３に供給して出力させる。 The voice synthesis unit 46 receives the behavior command information from the behavior determination mechanism unit 43, performs, for example, regular voice synthesis according to the behavior command information, and supplies the obtained synthesized sound to the speaker 23 for output.

距離計算部４７には、CCDカメラ２２Ｌおよび２２Ｒが出力する画像信号が供給されるようになっている。距離計算部４７は、CCDカメラ２２Ｌおよび２２Ｒからの画像信号を用いてステレオ処理（ステレオマッチング法による処理）を行うことにより、CCDカメラ２２Ｌおよび２２Ｒによって撮像された画像に表示されたユーザ等の音源から、マイク２１までの距離を求め、音声認識部４１Ｂに供給する。 The distance calculation unit 47 is supplied with image signals output from the CCD cameras 22L and 22R. The distance calculation unit 47 performs stereo processing (processing by a stereo matching method) using the image signals from the CCD cameras 22L and 22R, thereby generating a sound source such as a user displayed on the images captured by the CCD cameras 22L and 22R. From this, the distance to the microphone 21 is obtained and supplied to the voice recognition unit 41B.

ここで、距離計算部４７で行われるステレオ処理は、２つ以上の方向（異なる視線方向）からカメラで同一対象物を撮影して得られる複数の画像間の画素同士を対応付けることで、対応する画素間の視差情報や、カメラから対象物までの距離を求めるものである。 Here, the stereo processing performed by the distance calculation unit 47 corresponds by associating pixels between a plurality of images obtained by photographing the same object with a camera from two or more directions (different line-of-sight directions). It calculates parallax information between pixels and the distance from the camera to the object.

即ち、いま、CCDカメラ２２Ｌと２２Ｒを、それぞれ基準カメラ２２Ｌと検出カメラ２２Ｒというとともに、それぞれが出力する画像を、基準カメラ画像と検出カメラ画像というものとして、例えば、図４に示すように、基準カメラ２２Ｌおよび検出カメラ２２Ｒで、撮像対象物としてのユーザを撮影すると、基準カメラ２２Ｌからはユーザの投影像を含む基準カメラ画像が得られ、検出カメラ２２Ｒからもユーザの投影像を含む検出カメラ画像が得られる。そして、いま、例えば、ユーザの口部上のある点Ｐが、基準カメラ画像および検出カメラ画像の両方に表示されているとすると、その点Ｐが表示されている基準カメラ画像上の位置と、検出カメラ画像上の位置、つまり対応点（対応画素）とから、視差情報を求めることができ、さらに、三角測量の原理を用いて、点Ｐの３次元空間における位置（３次元位置）を求めることができる。 That is, the CCD cameras 22L and 22R are now referred to as the reference camera 22L and the detection camera 22R, respectively, and the images output from them are referred to as the reference camera image and the detection camera image, for example, as shown in FIG. When the camera 22L and the detection camera 22R photograph a user as an imaging object, a reference camera image including the user's projection image is obtained from the reference camera 22L, and a detection camera image including the user's projection image from the detection camera 22R. Is obtained. And now, for example, if a certain point P on the mouth of the user is displayed in both the reference camera image and the detected camera image, the position on the reference camera image where the point P is displayed, The disparity information can be obtained from the position on the detected camera image, that is, the corresponding point (corresponding pixel), and further, the position of the point P in the three-dimensional space (three-dimensional position) is obtained using the principle of triangulation. be able to.

従って、ステレオ処理では、まず、対応点を検出することが必要となるが、その検出方法としては、例えば、エピポーラライン（Epipolar Line）を用いたエリアベースマッチング法などがある。 Therefore, in stereo processing, it is first necessary to detect corresponding points. As a detection method, for example, there is an area-based matching method using an epipolar line.

即ち、図５に示すように、基準カメラ２２Ｌにおいては、ユーザ上の点Ｐは、その点Ｐと基準カメラ２２Ｌの光学中心（レンズ中心）Ｏ₁とを結ぶ直線Ｌ上の、基準カメラ１の撮像面Ｓ₁との交点ｎ_aに投影される。 That is, as shown in FIG. 5, in the reference camera 22L, the point P on the user is on the straight line L connecting the point P and the optical center (lens center) O _{1 of the} reference camera 22L. Projected to the intersection n _a with the imaging surface S ₁ .

また、検出カメラ２２Ｒにおいては、ユーザ上の点Ｐは、その点Ｐと検出カメラ２２Ｒの光学中心（レンズ中心）Ｏ₂とを結ぶ直線上の、検出カメラ２２Ｒの撮像面Ｓ₂との交点ｎ_bに投影される。 Further, in the detection camera 22R is P point on the user, on a straight line connecting the optical center (lens center) O ₂ and the point P detected camera 22R, intersection n of the imaging surface S ₂ of the detection camera 22R Projected on _b .

この場合、直線Ｌは、光学中心Ｏ₁およびＯ₂、並びに点ｎ_a（または点Ｐ）の３点を通る平面と、検出カメラ画像が形成される撮像面Ｓ₂との交線Ｌ₂として、撮像面Ｓ₂上に投影される。点Ｐは、直線Ｌ上の点であり、従って、撮像面Ｓ₂において、点Ｐを投影した点ｎ_bは、直線Ｌを投影した直線Ｌ₂上に存在し、この直線Ｌ₂はエピポーララインと呼ばれる。即ち、点ｎ_aの対応点ｎ_bが存在する可能性のあるのは、エピポーララインＬ₂上であり、従って、対応点ｎ_bの探索は、エピポーララインＬ₂上を対象に行えば良い。 In this case, the straight line L is an intersection line L ₂ between the plane passing through the three points of the optical centers O ₁ and O ₂ and the point n _a (or the point P) and the imaging surface S _{2 on} which the detection camera image is formed. and it is projected onto the imaging surface S _2. Point P is a point on the straight line L, thus, the imaging surface S _2, the n _b point obtained through projection of the point P, and lies on the straight line L ₂ obtained by projecting the straight line L, the straight line L ₂ is epipolar lines Called. That is, there is a possibility that the corresponding point n _b of the point n _a exists on the epipolar line L _2. Therefore, the search for the corresponding point n _b may be performed on the epipolar line L ₂ .

ここで、エピポーララインは、例えば、撮像面Ｓ₁に形成される基準カメラ画像を構成する画素ごとに考えることができるが、基準カメラ２２Ｌと検出カメラ２２Ｒの位置関係が既知であれば、その画素ごとに存在するエピポーララインは、例えば計算によって求めることができる。 Here, the epipolar line, for example, can be considered for each pixel constituting the reference camera image formed on the imaging surface S _1, the positional relationship of the reference camera 22L and the detection camera 22R is if known, the pixel The epipolar line which exists every time can be calculated | required by calculation, for example.

エピポーララインＬ₂上の点からの対応点ｎ_bの検出は、例えば、次のようなエリアベースマッチングによって行うことができる。 Detection of corresponding points n _b from a point on the epipolar line L ₂ is, for example, can be carried out by the following area based matching.

即ち、エリアベースマッチングでは、図６（Ａ）に示すように、基準カメラ画像上の点ｎ_aを中心（例えば、対角線の交点）とする、例えば長方形状の小ブロック（以下、適宜、基準ブロックという）が、基準カメラ画像から抜き出されるとともに、図６（Ｂ）に示すように、検出カメラ画像に投影されたエピポーララインＬ₂上の、ある点を中心とする、基準ブロックと同一の大きさの小ブロック（以下、適宜、検出ブロックという）が、検出カメラ画像から抜き出される。 That is, in area-based matching, as shown in FIG. 6A, for example, a rectangular small block (hereinafter referred to as a reference block as appropriate) having _a point n _a on the reference camera image as _a center (for example, an intersection of diagonal lines). Is extracted from the reference camera image and, as shown in FIG. 6B, has the same size as the reference block centered on a certain point on the epipolar line L ₂ projected onto the detected camera image. A small block (hereinafter referred to as a detection block as appropriate) is extracted from the detection camera image.

ここで、図６（Ｂ）の実施の形態においては、エピポーララインＬ₂上に、検出ブロックの中心とする点として、点ｎ_b1乃至ｎ_b6の６点が設けられている。この６点ｎ_b1乃至ｎ_b6は、図５に示した３次元空間における直線Ｌを、所定の一定距離ごとに区分する点、即ち、基準カメラ２２Ｌからの距離が、例えば、１ｍ，２ｍ，３ｍ，４ｍ，５ｍ，６ｍの点それぞれを、検出カメラ２２Ｒの撮像面Ｓ₂に投影した点で、従って、基準カメラ２２Ｌからの距離が１ｍ，２ｍ，３ｍ，４ｍ，５ｍ，６ｍの点にそれぞれ対応している。 Here, in the embodiment of FIG. 6 (B), six points n _{b1 to} n _b6 are provided on the epipolar line L ₂ as points serving as the center of the detection block. These six points n _{b1 to} n _b6 are points at which the straight line L in the three-dimensional space shown in FIG. 5 is _{divided at} predetermined constant distances, that is, the distance from the reference camera 22L is, for example, 1 m, 2 m, 3 m , 4m, 5 m, respectively point 6m, a point obtained by projecting the imaging surface S ₂ of the detection cameras 22R, therefore, each corresponding distance from the reference camera 22L is 1m, 2m, 3m, 4m, 5m, to the point of 6m is doing.

エリアベースマッチングでは、検出カメラ画像から、エピポーララインＬ₂上に設けられている点ｎ_b1乃至ｎ_b6それぞれを中心とする検出ブロックが抜き出され、各検出ブロックと、基準ブロックとの相関が、所定の評価関数を用いて演算される。そして、点ｎ_aを中心とする基準ブロックとの相関が最も高い検出ブロックの中心の点ｎ_bが、点ｎ_aの対応点として求められる。 In area-based matching, detection blocks centered on the points n _{b1 to} n _b6 provided on the epipolar line L ₂ are extracted from the detection camera image, and the correlation between each detection block and the reference block is Calculation is performed using a predetermined evaluation function. Then, the point n _b of the central correlation highest detection block and the reference block centered on the point n _a is obtained as the corresponding point of the point n _a.

即ち、例えば、いま、評価関数として、相関が高いほど小さな値をとる関数を用いた場合に、エピポーララインＬ₂上の点ｎ_b1乃至ｎ_b6それぞれについて、例えば、図７に示すような評価値（評価関数の値）が得られたとする。この場合、評価値が最も小さい（相関が最も高い）点ｎ_b3が、点ｎ_aの対応点として検出される。なお、図７において、点ｎ_b1乃至ｎ_b6それぞれについて求められた評価値（図７において●印で示す）のうちの最小値付近のものを用いて補間を行い、評価値がより小さくなる点（図７において×印で示す）を求めて、その点を、最終的な対応点として検出することも可能である。 That is, for example, when a function that takes a smaller value as the correlation is higher is used as the evaluation function, for example, evaluation values as shown in FIG. 7 for the points n _{b1 to} n _b6 on the epipolar line L ₂ , for example. Assume that (value of evaluation function) is obtained. In this case, the smallest evaluation value (the correlation is the highest) is point n _b3, is detected as a corresponding point of the point n _a. In FIG. 7, interpolation is performed using evaluation values near the minimum value among the evaluation values (indicated by ● in FIG. 7) obtained for each of the points n _{b1 to} n _b6 , and the evaluation value becomes smaller. It is also possible to obtain (represented by x in FIG. 7) and detect that point as the final corresponding point.

図６の実施の形態では、上述したように、３次元空間における直線Ｌを所定の等距離ごとに区分する点を、検出カメラ２２Ｒの撮像面Ｓ₂に投影した点が設定されているが、この設定は、例えば、基準カメラ２２Ｌおよび検出カメラ２２Ｒのキャリブレーション時に行うことができる。そして、このような設定を、基準カメラ２２Ｌの撮像面Ｓ₁を構成する画素ごとに存在するエピポーララインごとに行い、図８（Ａ）に示すように、エピポーラライン上に設定された点（以下、適宜、設定点という）と、基準カメラ２２Ｌからの距離とを対応付ける設定点／距離テーブルをあらかじめ作成しておけば、対応点となる設定点を検出し、設定点／距離テーブルを参照することで、即座に、基準カメラ２２Ｌからの距離（ユーザまでの距離）を求めることができる。即ち、いわば、対応点から、直接、距離を求めることができる。 In the embodiment of FIG. 6, as described above, the point that divides the straight line L at predetermined equidistant in a three-dimensional space, but a point obtained by projecting the imaging surface S ₂ of the detection camera 22R is set, This setting can be performed at the time of calibration of the reference camera 22L and the detection camera 22R, for example. Then, such a setting is performed for each epipolar line existing for each pixel constituting the imaging surface S ₁ of the reference camera 22L, and as shown in FIG. If a set point / distance table that associates the set point with the distance from the reference camera 22L is created in advance, the set point that becomes the corresponding point is detected and the set point / distance table is referred to. Thus, the distance from the reference camera 22L (the distance to the user) can be obtained immediately. In other words, the distance can be obtained directly from the corresponding points.

一方、基準カメラ画像上の点ｎ_aについて、検出カメラ画像上の対応点ｎ_bを検出すれば、その２点ｎ_aおよびｎ_bの間の視差（視差情報）を求めることができる。さらに、基準カメラ２２Ｌと検出カメラ２２Ｒの位置関係が既知であれば、２点ｎ_aおよびｎ_bの間の視差から、三角測量の原理によって、ユーザまでの距離を求めることができる。視差から距離の算出は、所定の演算を行うことによって行うことができるが、あらかじめその演算を行っておき、図８（Ｂ）に示すように、視差ζと距離との対応付ける視差／距離テーブルをあらかじめ作成しておけば、対応点を検出し、視差を求め、視差／距離テーブルを参照することで、やはり、即座に、基準カメラ２２Ｌからの距離を求めることができる。 On the other hand, for the points n _a on the reference camera image, by detecting the corresponding points n _b on the detection camera image can be obtained parallax (disparity information) between the two points n _a and n _b. Further, if the positional relationship between the reference camera 22L and the detection camera 22R is known, from the disparity between the two points n _a and n _b, the principle of triangulation, can determine the distance to the user. The calculation of the distance from the parallax can be performed by performing a predetermined calculation. However, as shown in FIG. 8B, a parallax / distance table for associating the parallax ζ with the distance is calculated. If created in advance, the distance from the reference camera 22L can be immediately obtained by detecting the corresponding point, obtaining the parallax, and referring to the parallax / distance table.

ここで、視差と、ユーザまでの距離とは一対一に対応するものであり、従って、視差を求めることとと、ユーザまでの距離を求めることとは、いわば等価である。 Here, the parallax and the distance to the user have a one-to-one correspondence. Therefore, obtaining the parallax and obtaining the distance to the user are equivalent to each other.

また、対応点の検出に、基準ブロックおよび検出ブロックといった複数画素でなるブロックを用いるのは、ノイズの影響を軽減し、基準カメラ画像上の画素（点）ｎ_aの周囲の画素のパターンの特徴と、検出カメラ画像上の対応点（画素）ｎ_bの周囲の画素のパターンの特徴との相関性を明確化して判断することにより、対応点の検出の確実を期すためであり、特に、変化の少ない基準カメラ画像および検出カメラ画像に対しては、画像の相関性により、ブロックの大きさが大きければ大きいほど対応点の検出の確実性が増す。 Further, the detection of the corresponding point, to use a reference block and a detection block comprised of a plurality of pixels such blocks is to reduce the effects of noise, the characteristics of the pattern of pixels around the pixel (point) n _a of the reference camera image When, by determining clarifies a correlation between the feature of the pattern of peripheral pixels of the corresponding point (pixel) n _b on the detection camera image, it is for the sake of certainty of the corresponding point detection, in particular, changes For a reference camera image and a detection camera image with a small amount, the correlation between the images increases the certainty of detection of corresponding points as the block size increases.

なお、エリアベースマッチングにおいて、基準ブロックと検出ブロックとの相関性を評価する評価関数としては、基準ブロックを構成する画素と、それぞれの画素に対応する、検出ブロックを構成する画素の画素値の差分の絶対値の総和や、その差分の自乗和、正規化された相互相関(normalized cross correlation)などを用いることができる。 In area-based matching, the evaluation function for evaluating the correlation between the reference block and the detection block is the difference between the pixel values of the pixels constituting the reference block and the pixels constituting the detection block corresponding to each pixel. The sum of absolute values of the values, the square sum of the differences, normalized cross correlation, and the like can be used.

以上、ステレオ処理について簡単に説明したが、ステレオ処理（ステレオマッチング法）については、その他、例えば、安居院、長尾、「Ｃ言語による画像処理入門」、昭晃堂 pp.127ページなどにも記載されている。 The stereo processing has been briefly described above. However, the stereo processing (stereo matching method) is also described in, for example, Yakuin, Nagao, “Introduction to Image Processing in C Language”, Shosodo pp. 127, etc. ing.

次に、図９は、図３の音声認識部４１Ｂの構成例を示している。 Next, FIG. 9 shows a configuration example of the voice recognition unit 41B of FIG.

図２のマイク２１およびＡ／Ｄ変換部１２を介して、音声認識部４１Ｂに入力される音声データは、特徴抽出部１０１と音声区間検出部１０７に供給される。 The voice data input to the voice recognition unit 41B via the microphone 21 and the A / D conversion unit 12 in FIG. 2 is supplied to the feature extraction unit 101 and the voice section detection unit 107.

特徴抽出部１０１は、Ａ／Ｄ変換部１２からの音声データについて、適当なフレームごとに音響分析処理を施し、これにより、例えば、ＭＦＣＣ(Mel Frequency Cepstrum Coefficient)等の特徴量としての特徴ベクトルを抽出する。なお、特徴抽出部１０１では、その他、例えば、スペクトルや、線形予測係数、ケプストラム係数、線スペクトル対等の特徴ベクトル（特徴パラメータ）を抽出することが可能である。 The feature extraction unit 101 performs acoustic analysis processing on the audio data from the A / D conversion unit 12 for each appropriate frame, and thereby, for example, a feature vector as a feature quantity such as MFCC (Mel Frequency Cepstrum Coefficient) is obtained. Extract. In addition, the feature extraction unit 101 can extract other feature vectors (feature parameters) such as a spectrum, a linear prediction coefficient, a cepstrum coefficient, and a line spectrum pair.

特徴抽出部１０１においてフレームごとに得られる特徴ベクトルは、特徴ベクトルバッファ１０２に順次供給されて記憶される。従って、特徴ベクトルバッファ１０２では、フレームごとの特徴ベクトルが時系列に記憶されていく。 The feature vectors obtained for each frame in the feature extraction unit 101 are sequentially supplied to and stored in the feature vector buffer 102. Therefore, the feature vector buffer 102 stores feature vectors for each frame in time series.

なお、特徴ベクトルバッファ１０２は、例えば、ある発話の開始から終了まで（音声区間）に得られる時系列の特徴ベクトルを記憶する。 Note that the feature vector buffer 102 stores, for example, a time-series feature vector obtained from the start to the end of a certain utterance (voice section).

マッチング部１０３は、特徴ベクトルバッファ１０２に記憶された特徴ベクトルを用いて、音響モデルデータベース１０４_n（ｎ＝１，２，・・・，Ｎ（Ｎは、２以上の整数）、辞書データベース１０５、および文法データベース１０６を必要に応じて参照しながら、マイク２１に入力された音声（入力音声）を、例えば、連続分布ＨＭＭ法等に基づいて音声認識する。 The matching unit 103 uses the feature vectors stored in the feature vector buffer 102 to generate an acoustic model database 104 _n (n = 1, 2,..., N (N is an integer of 2 or more), a dictionary database 105, The speech (input speech) input to the microphone 21 is recognized based on, for example, the continuous distribution HMM method while referring to the grammar database 106 as necessary.

即ち、音響モデルデータベース１０４_nは、音声認識する音声の言語における個々の音素や音節などの所定の単位(PLU(Phonetic-Linguistic-Units))ごとの音響的な特徴を表す音響モデルのセットを記憶している。ここでは、連続分布ＨＭＭ法に基づいて音声認識を行うので、音響モデルとしては、例えば、ガウス分布等の確率密度関数を用いたＨＭＭ(Hidden Markov Model)が用いられる。辞書データベース１０５は、認識対象の各単語（語彙）について、その発音に関する情報（音韻情報）が記述された単語辞書を記憶している。文法データベース１０６は、辞書データベース１０５の単語辞書に登録されている各単語が、どのように連鎖する（つながる）かを記述した文法規則（言語モデル）を記憶している。ここで、文法規則としては、例えば、文脈自由文法（ＣＦＧ）や、正規文法（ＲＧ）、統計的な単語連鎖確率（Ｎ−ｇｒａｍ）などに基づく規則を用いることができる。 That is, the acoustic model database 104 _n stores a set of acoustic models representing acoustic features for each predetermined unit (PLU (Phonetic-Linguistic-Units)) such as individual phonemes and syllables in the speech language for speech recognition. is doing. Here, since speech recognition is performed based on the continuous distribution HMM method, an HMM (Hidden Markov Model) using a probability density function such as a Gaussian distribution is used as the acoustic model. The dictionary database 105 stores a word dictionary in which information about pronunciation (phoneme information) is described for each word (vocabulary) to be recognized. The grammar database 106 stores grammatical rules (language model) describing how each word registered in the word dictionary of the dictionary database 105 is linked (connected). Here, as the grammar rule, for example, a rule based on context free grammar (CFG), regular grammar (RG), statistical word chain probability (N-gram), or the like can be used.

マッチング部１０３は、辞書データベース１０５の単語辞書を参照することにより、音響モデルデータベース１０４_nに記憶されている音響モデルを接続することで、単語の音響モデル（単語モデル）を構成する。さらに、マッチング部１０３は、幾つかの単語モデルを、文法データベース１０６に記憶された文法規則を参照することにより接続し、そのようにして接続された単語モデルを用いて、時系列の特徴ベクトルとのマッチングを、連続分布ＨＭＭ法によって行い、マイク２１に入力された音声を認識する。即ち、マッチング部１０３は、上述したようにして構成された各単語モデルの系列から、特徴ベクトルバッファ１０２に記憶された時系列の特徴ベクトルが観測される尤度を表すスコアを計算する。そして、マッチング部１０３は、例えば、そのスコアが最も高い単語モデルの系列を検出し、その単語モデルの系列に対応する単語列を、音声の認識結果として出力する。 The matching unit 103 refers to the word dictionary in the dictionary database 105 to connect an acoustic model stored in the acoustic model database 104 _n to configure a word acoustic model (word model). Further, the matching unit 103 connects several word models by referring to the grammar rules stored in the grammar database 106, and uses the word models connected in this way, Are matched by the continuous distribution HMM method, and the voice input to the microphone 21 is recognized. That is, the matching unit 103 calculates a score representing the likelihood that a time-series feature vector stored in the feature vector buffer 102 is observed from each word model sequence configured as described above. Then, for example, the matching unit 103 detects a series of word models having the highest score, and outputs a word string corresponding to the series of word models as a speech recognition result.

なお、ここでは、ＨＭＭ法により音声認識が行われるため、マッチング部１０３は、音響的には、接続された単語モデルに対応する単語列について、各特徴ベクトルの出現確率を累積し、その累積値をスコアとする。 Here, since speech recognition is performed by the HMM method, the matching unit 103 acoustically accumulates the appearance probability of each feature vector for the word string corresponding to the connected word model, and the accumulated value. Is a score.

即ち、マッチング部１０３におけるスコア計算は、音響モデルデータベース１０４に記憶された音響モデルによって与えられる音響的なスコア（以下、適宜、音響スコアという）と、文法データベース１０６に記憶された文法規則によって与えられる言語的なスコア（以下、適宜、言語スコアという）とを総合評価することで行われる。 That is, the score calculation in the matching unit 103 is given by an acoustic score given by an acoustic model stored in the acoustic model database 104 (hereinafter referred to as an acoustic score as appropriate) and a grammar rule stored in the grammar database 106. This is performed by comprehensively evaluating a linguistic score (hereinafter referred to as a language score as appropriate).

具体的には、音響スコアは、例えば、ＨＭＭ法による場合には、単語モデルを構成する音響モデルから、特徴抽出部１０１が出力する特徴ベクトルの系列が観測される確率（出現する確率）に基づいて、単語ごとに計算される。また、言語スコアは、例えば、バイグラムによる場合には、注目している単語と、その単語の直前の単語とが連鎖（連接）する確率に基づいて求められる。そして、各単語についての音響スコアと言語スコアとを総合評価して得られる最終的なスコア（以下、適宜、最終スコアという）に基づいて、音声認識結果が確定される。 Specifically, for example, in the case of the HMM method, the acoustic score is based on a probability (probability of appearance) that a series of feature vectors output from the feature extraction unit 101 is observed from an acoustic model constituting a word model. Calculated for each word. Further, for example, in the case of bigram, the language score is obtained based on the probability that the word of interest and the word immediately preceding the word are linked (connected). Then, a speech recognition result is determined based on a final score (hereinafter, referred to as a final score as appropriate) obtained by comprehensively evaluating the acoustic score and the language score for each word.

ここで、音声認識部４１Ｂは、文法データベース１０６を設けずに構成することも可能である。但し、文法データベース１０６に記憶された規則によれば、接続する単語モデルが制限され、その結果、マッチング部１０３における音響スコアの計算の対象とする単語数が限定されるので、マッチング部１０３の計算量を低減し、処理速度を向上させることができる。 Here, the speech recognition unit 41B can be configured without providing the grammar database 106. However, according to the rules stored in the grammar database 106, the word models to be connected are limited. As a result, the number of words for which the acoustic score is calculated in the matching unit 103 is limited. The amount can be reduced and the processing speed can be improved.

また、図９の実施の形態では、Ｎ個の音響モデルデータベース１０４₁，１０４₂，・・・，１０４_Nが設けられているが、これらのＮ個の音響モデルデータベース１０４₁乃至１０４_Nには、マイクから複数の異なる距離だけ離れた音源それぞれから発せられた音声を用いて生成された、その複数の異なる距離ごとの音響モデルのセットがそれぞれ記憶されている。 In the embodiment shown in FIG. 9, N acoustic model databases 104 ₁ , 104 ₂ ,..., 104 _N are provided, and these N acoustic model databases 104 _{1 to} 104 _N are included in the _N acoustic model databases 104 _{1 to} 104 _N. A set of acoustic models for each of a plurality of different distances generated using sounds emitted from a plurality of sound sources separated from the microphone by a plurality of different distances is stored.

即ち、例えば、いま、マイクから、音源である学習用の音声の発話者までの距離を、Ｄ₁，Ｄ₂，・・・，Ｄ_N（但し、Ｄ₁＜Ｄ₂＜・・・＜Ｄ_Nとする）として、マイクから、各距離Ｄ₁，Ｄ₂，・・・，Ｄ_Nだけ離れた位置から発話を行った発話者の音声を、そのマイクで収録し、その収録した各距離に対する音声データを用いて学習を行うことにより得られた、各距離Ｄ₁，Ｄ₂，・・・，Ｄ_Nごとの音響モデル（ここではＨＭＭ）のセットが、音響モデルデータベース１０４₁，１０４₂，・・・，１０４_Nにそれぞれ記憶されている。 That is, for example, the distance from the microphone to the speaker of the learning voice that is the sound source is represented by D ₁ , D ₂ ,..., D _N (where D ₁ <D ₂ <... <D _N )), the voice of the speaker who spoke from a distance D ₁ , D ₂ ,..., D _N from the microphone is recorded with the microphone, and for each recorded distance. obtained by performing learning using the speech data, the distance D _1, D _2, · · ·, a set of acoustic models of each D _N (here HMM), an acoustic model database 104 _1, 104 _2, .., 104 _N are stored respectively.

従って、音響モデルデータベース１０４_nには、マイクから、距離Ｄ_nだけ離れた位置から発話を行った発話者の音声データから生成された音響モデルのセットが記憶されている。 Therefore, the acoustic model database 104 _n stores a set of acoustic models generated from the voice data of a speaker who has spoken from a position away from the microphone by a distance D _n .

なお、距離Ｄ₁乃至Ｄ_Nのうちの最小値Ｄ₁としては、例えば、０（実際にはユーザの口とマイクが近接した状態）を採用することができ、最大値Ｄ_Nとしては、ここでは、例えば、ユーザがロボットに話しかけるであろうと予測される距離の最大値の統計値（例えば、多数のユーザに、最大で、どの程度離れた位置からロボットに話しかけるかのアンケートを行い、各ユーザが解答する距離の平均値など）を採用することができる。さらに、他の距離Ｄ₂，Ｄ₃，・・・，Ｄ_N-1としては、例えば、距離Ｄ_Nを等分する距離を採用することができる。 As the minimum value D ₁ of the distances D _{1 to} D _N , for example, 0 (actually the state where the user's mouth and the microphone are close to each other) can be adopted, and the maximum value D _N is here. Then, for example, the statistical value of the maximum value of the distance that the user is expected to talk to the robot (for example, a questionnaire to determine how far away the maximum number of users should talk to the robot) Can be used as an average of distances to be answered. Furthermore, as the other distances D ₂ , D ₃ ,..., D _N−1 , for example, a distance that equally divides the distance D _N can be adopted.

また、音響モデルデータベース１０４_nに記憶させる距離Ｄ_nの音響モデルのセットは、マイクから、距離Ｄ_nだけ離れた位置で、実際に行った発話の音声データから生成することもできる他、マイクに近接して（マイクとの距離を０として）行われた発話を収録した音声データ（例えば、ヘッドセットマイクを使用して収録された音声データ）に対して、マイクとそこから距離Ｄ_nだけ離れた位置との間（空間）のインパルス応答を畳み込むことにより得られる音声データから生成することも可能である。なお、インパルス応答を用いて、所定の距離だけ離れた位置で発話された音声をマイクで収録した音声データを得ることについては、例えば、前述の文献２に記載されている。 In addition, the acoustic model set of the distance D _n to be stored in the acoustic model database 104 _n can be generated from the voice data of the actually performed speech at a position away from the microphone by the distance D _n. For audio data that records utterances made in close proximity (with the distance to the microphone set to 0) (for example, audio data recorded using a headset microphone), the microphone and the distance D _n away from it. It is also possible to generate from voice data obtained by convolving the impulse response between the two positions (space). In addition, it is described in the above-mentioned document 2, for example, to obtain voice data in which a voice uttered at a position separated by a predetermined distance using an impulse response is recorded with a microphone.

音声区間検出部１０７は、Ａ／Ｄ変換部１２の出力に基づいて、音声区間を検出し、その検出結果を表すメッセージを、選択制御部１０８に供給する。ここで、音声区間を検出する方法としては、例えば、所定のフレームごとに、Ａ／Ｄ変換部１２の出力のパワーを計算し、そのパワーが所定の閾値以上であるかどうかを判定する方法がある。 The voice segment detection unit 107 detects a voice segment based on the output of the A / D conversion unit 12 and supplies a message representing the detection result to the selection control unit 108. Here, as a method for detecting a speech section, for example, a method of calculating the output power of the A / D conversion unit 12 for each predetermined frame and determining whether the power is equal to or higher than a predetermined threshold. is there.

選択制御部１０８は、音声区間検出部１０７から音声区間である旨のメッセージを受信すると、距離計算部４７（図３）に、マイク２１から、発話を行っているユーザまでの距離の算出を要求し、その要求に対応して、距離計算部４７から供給される距離を受信する。さらに、選択制御部１０８は、距離計算部４７から受信した距離に基づいて、セレクタ１０９を制御する。 When the selection control unit 108 receives a message indicating that it is a voice segment from the voice segment detection unit 107, the selection control unit 108 requests the distance calculation unit 47 (FIG. 3) to calculate the distance from the microphone 21 to the user who is speaking. In response to the request, the distance supplied from the distance calculation unit 47 is received. Further, the selection control unit 108 controls the selector 109 based on the distance received from the distance calculation unit 47.

ここで、図３の実施の形態における距離計算部４７では、上述したステレオ処理によって、CCDカメラ２２Ｌまたは２２Ｒから、発話を行っているユーザまでの距離が計算されるが、CCDカメラ２２Ｌまたは２２Ｒと、マイク２１とは近い位置に設置されており、従って、CCDカメラ２２Ｌまたは２２Ｒから、発話を行っているユーザまでの距離は、マイク２１から、発話を行っているユーザまでの距離とみなせるものとする。但し、CCDカメラ２２Ｌおよび２２Ｒと、マイク２１との位置関係が分かっている場合には、マイク２１からユーザまでの距離は、CCDカメラ２２Ｌまたは２２Ｒからユーザまでの距離に基づいて求めることが可能である。 Here, in the distance calculation unit 47 in the embodiment of FIG. 3, the distance from the CCD camera 22L or 22R to the user who is speaking is calculated by the above-described stereo processing, but the CCD camera 22L or 22R Therefore, the distance from the CCD camera 22L or 22R to the user who is speaking is regarded as the distance from the microphone 21 to the user who is speaking. To do. However, when the positional relationship between the CCD cameras 22L and 22R and the microphone 21 is known, the distance from the microphone 21 to the user can be obtained based on the distance from the CCD camera 22L or 22R to the user. is there.

セレクタ１０９は、選択制御部１０８からの制御にしたがい、Ｎ個の音響モデルデータベース１０４₁乃至１０４_Nのうちの１つである音響モデルデータベース１０４_nを選択する。さらに、セレクタ１０９は、その選択した音響モデルデータベース１０４_nに記憶された、距離Ｄ_nの音響モデルのセットを取得し、マッチング部１０３に提供する。これにより、マッチング部１０３では、セレクタ１０９で取得された距離Ｄ_nの音響モデルを用いて、音響スコアの計算が行われる。 The selector 109 selects the acoustic model database 104 _n that is one of the _N acoustic model databases 104 _{1 to} 104 _N under the control of the selection control unit 108. Further, the selector 109 acquires a set of acoustic models of the distance D _n stored in the selected acoustic model database 104 _n and provides it to the matching unit 103. Accordingly, the matching unit 103 calculates an acoustic score using the acoustic model of the distance D _n acquired by the selector 109.

次に、図１０のフローチャートを参照して、図９の音声認識部４１Ｂによる音声認識処理について説明する。 Next, the speech recognition processing by the speech recognition unit 41B in FIG. 9 will be described with reference to the flowchart in FIG.

まず最初に、ステップＳ１において、音声区間検出部１０７は、ユーザからの音声入力があったかどうかを判定する。即ち、音声区間検出部１０７は、音声区間かどうかを判定し、音声区間であると判定した場合には、ユーザからの音声入力があったと判定し、音声区間でないと判定した場合には、ユーザからの音声入力がなかったと判定する。 First, in step S1, the speech segment detection unit 107 determines whether or not there is a speech input from the user. That is, the voice section detection unit 107 determines whether or not the voice section, and if the voice section is determined to be a voice section, the voice section detection unit 107 determines that there is a voice input from the user. It is determined that there was no voice input from.

ステップＳ１において、音声入力がなかったと判定された場合、ステップＳ２乃至Ｓ５をスキップして、ステップＳ６に進む。 If it is determined in step S1 that there is no voice input, steps S2 to S5 are skipped and the process proceeds to step S6.

また、ステップＳ１において、音声入力があったと判定された場合、即ち、音声区間検出部１０７において、音声区間が検出され、その旨のメッセージが選択制御部１０８に供給されるとともに、特徴抽出部１０１において、音声区間の音声データの特徴ベクトルの抽出が開始され、さらに、特徴ベクトルバッファ１０２において、その特徴ベクトルの記憶が開始された場合、ステップＳ２に進み、選択制御部１０８は、距離計算部４７（図３）に対して、発話を行っているユーザまでの距離の計算を要求する。これにより、距離計算部４７は、ステップＳ２において、発話を行っているユーザまでの距離を計算し、その距離を、選択制御部１０８に供給する。 If it is determined in step S1 that there is a voice input, that is, the voice segment detection unit 107 detects a voice segment, a message to that effect is supplied to the selection control unit 108, and the feature extraction unit 101 When the extraction of the feature vector of the speech data of the speech section is started and the storage of the feature vector is started in the feature vector buffer 102, the process proceeds to step S2, and the selection control unit 108 includes the distance calculation unit 47. (Fig. 3) is requested to calculate the distance to the user who is speaking. Thereby, the distance calculation part 47 calculates the distance to the user who is speaking in step S2, and supplies the distance to the selection control part 108.

ここで、ユーザは、一般に、ロボットの正面方向から話しかけることが多いと予想されるため、ユーザまでの距離を計算するためにそのユーザを撮像するCCDカメラ２２Ｌおよび２２Ｒは、その撮像方向が、ロボットの正面方向になるように、頭部ユニット３（図２）設置されているものとする。 Here, since it is generally expected that the user often talks from the front direction of the robot, the CCD cameras 22L and 22R that image the user to calculate the distance to the user have the imaging direction of the robot It is assumed that the head unit 3 (FIG. 2) is installed so as to be in the front direction.

なお、この場合、ユーザが、ロボットの正面方向からはずれた、例えば、側面や背面方向などから話しかけてきた場合には、CCDカメラ２２Ｌおよび２２Ｒにおいて、ユーザを撮像することができないことになる。そこで、例えば、マイク２１として、CCDカメラ２２Ｌおよび２２Ｒの撮像方向と同一方向の指向性を有するマイクを採用し、マイク２１に入力される音声レベルが最大となる方向に、頭部ユニット３を動かし、これにより、CCDカメラ２２Ｌおよび２２Ｒにおいて、ユーザを撮像することができるようにすることが可能である。 In this case, when the user talks from the front direction of the robot, for example, from the side or back direction, the CCD cameras 22L and 22R cannot capture the user. Therefore, for example, a microphone having directivity in the same direction as the imaging direction of the CCD cameras 22L and 22R is adopted as the microphone 21, and the head unit 3 is moved in a direction in which the sound level input to the microphone 21 is maximized. As a result, the CCD cameras 22L and 22R can image the user.

また、ロボットには、複数のマイクを設け、その複数のマイクに到達する音声信号のパワー差や位相差から音源の方向を推定し、その方向に、その複数のマイクのうち、最大の音声レベルが得られるものの方向に、頭部ユニット３を動かすことによって、CCDカメラ２２Ｌおよび２２Ｒにおいて、ユーザを撮像することができるようにすることも可能である。なお、ロボットに、複数のマイクを設ける場合には、例えば、最大の音声レベルが得られるマイク（ロボットがユーザの方向を向いた場合には、基本的には、正面方向に設けられているマイク）が出力する音声データが、音声認識の対象とされる。 In addition, the robot is provided with a plurality of microphones, the direction of the sound source is estimated from the power difference and phase difference of the audio signals reaching the plurality of microphones, and the maximum sound level of the plurality of microphones in the direction is estimated. By moving the head unit 3 in the direction in which the image is obtained, the CCD cameras 22L and 22R can capture the user. When the robot is provided with a plurality of microphones, for example, a microphone that can obtain the maximum sound level (if the robot faces the direction of the user, the microphone that is basically provided in the front direction is used. ) Is output as a speech recognition target.

ここで、図３の距離計算部４７において、CCDカメラ２２Ｌおよび２２Ｒから得られる画像を用いてステレオ処理を行うことにより、ユーザまでの距離を計算するには、CCDカメラ２２Ｌおよび２２Ｒが出力する画像から、ユーザが表示されている画素（以下、適宜、ユーザ画素という）を検出する必要があるが、例えば、いわゆる肌色などの所定の色が表示されている画素を、ユーザ画素として検出するようにすることが可能である。あるいは、また、例えば、CCDカメラ２２Ｌや２２Ｒによって、ユーザの顔を、あらかじめ撮像しておき、その顔画像を標準パターンとして、画像認識を行うことにより、ユーザ画素を検出することも可能である。 Here, in the distance calculation unit 47 of FIG. 3, in order to calculate the distance to the user by performing stereo processing using the images obtained from the CCD cameras 22L and 22R, the images output from the CCD cameras 22L and 22R Therefore, it is necessary for the user to detect a pixel displayed (hereinafter referred to as a user pixel as appropriate). For example, a pixel displaying a predetermined color such as a so-called skin color is detected as a user pixel. Is possible. Alternatively, for example, it is also possible to detect user pixels by imaging a user's face in advance using a CCD camera 22L or 22R and performing image recognition using the face image as a standard pattern.

選択制御部１０８は、距離計算部４７（図３）から、ユーザまでの距離を受信すると、ステップＳ３に進み、上述のＮ個の距離Ｄ₁乃至Ｄ_Nの中から、ユーザまでの距離に最も近い距離Ｄ_nを検出し、その距離Ｄ_nの音響モデルのセットを記憶している音響モデルデータベース１０４_nを選択するように、セレクタ１０９を制御する。これにより、セレクタ１０９は、ステップＳ３において、選択制御部１０８の制御にしたがい、音響モデルデータベース１０４_nを選択し、ユーザまでの距離に最も近い距離Ｄ_nの音響モデルのセットを取得して、マッチング部１０３に供給し、ステップＳ４に進む。 When the selection control unit 108 receives the distance to the user from the distance calculation unit 47 (FIG. 3), the selection control unit 108 proceeds to step S3, and selects the most distance from the _N distances D _{1 to} D _N described above to the user. The selector 109 is controlled to detect the close distance D _n and select the acoustic model database 104 _n storing the set of acoustic models of the distance D _n . Thereby, in step S3, the selector 109 selects the acoustic model database 104 _n according to the control of the selection control unit 108, acquires a set of acoustic models having the distance D _n closest to the distance to the user, and performs matching. Supplied to the unit 103 and proceeds to step S4.

ステップＳ４では、マッチング部１０３は、特徴ベクトルバッファ１０２に記憶された、音声区間の音声データから抽出された特徴ベクトルを用い、セレクタ１０９から供給される距離Ｄ_nの音響モデルのセット、辞書データベース１０５に記憶された単語辞書、および文法データベース１０６に記憶された文法規則を参照することにより、音声認識結果の候補としての単語列（単語）に対する言語スコアおよび音響スコアを計算し、さらに、最終スコアを求め、最終スコアの最も大きい単語列（単語）を、音声認識結果として確定する。 In step S4, the matching unit 103, stored in the feature vector buffer 102, using the feature vectors extracted from the speech data of the speech interval, the set of acoustic models of the distance D _n supplied from the selector 109, the dictionary database 105 The language score and the acoustic score for the word string (word) as a speech recognition result candidate are calculated by referring to the word dictionary stored in FIG. The word string (word) having the highest final score is determined as the speech recognition result.

そして、ステップＳ５に進み、マッチング部１０３は、ステップＳ４で確定した音声認識結果を出力し、ステップＳ６に進む。 In step S5, the matching unit 103 outputs the voice recognition result determined in step S4, and then proceeds to step S6.

ステップＳ６では、音声認識処理を終了するかどうかが判定され、終了しないと判定された場合、ステップＳ１に戻り、以下、同様の処理が繰り返される。 In step S6, it is determined whether or not the voice recognition process is to be terminated. If it is determined not to be terminated, the process returns to step S1 and the same process is repeated thereafter.

また、ステップＳ６において、音声認識処理を終了すると判定された場合、即ち、例えば、ユーザによって、ロボットの電源がオフ状態とされた場合、音声認識処理を終了する。 In step S6, when it is determined that the voice recognition process is to be ended, that is, for example, when the power of the robot is turned off by the user, the voice recognition process is ended.

以上のように、発話を行ったユーザまでの距離を計算し、その距離に最も近い距離Ｄ_nだけ離れた位置で発話された音声を収録した音声データから生成された音響モデルのセット（距離Ｄ_nの音響モデルのセット）を用いて音声認識を行うようにしたので、マイクから離れた位置で発話されたユーザの音声の認識精度を向上させることができる。 As described above, to calculate the distance to the user who made the utterance, the closest distance D _n apart sets of acoustic models generated from speech data was recorded speech uttered by the position on the distance (distance D _Since the speech recognition is performed using a set of _n acoustic models, the recognition accuracy of the user's speech uttered at a position away from the microphone can be improved.

即ち、音声認識が、ユーザが発話を行った音響環境に近い音響環境で収録された音声データを用いて学習が行われた音響モデルのセットを用いて行われるため、音声認識精度を向上させることができる。 That is, voice recognition is performed using a set of acoustic models learned using voice data recorded in an acoustic environment close to the acoustic environment where the user uttered, thereby improving voice recognition accuracy. Can do.

さらに、この場合、Ｎ個の音響モデルのセットのうちの１セットを選択して音声認識が行われるので、音声認識処理に要求される計算量を（ほとんど）増大させることはない。 Further, in this case, since speech recognition is performed by selecting one set of N acoustic models, the amount of calculation required for speech recognition processing is not (almost) increased.

なお、音響環境は、マイク２１からユーザまでの距離の他、ノイズレベルや、残響特性その他の要因によって変化するので、それらを考慮した音響モデルのセットを用いることで、音声認識精度をより向上させることが可能となる。 Note that the acoustic environment changes depending on the noise level, reverberation characteristics, and other factors in addition to the distance from the microphone 21 to the user. Therefore, the use of a set of acoustic models that takes them into consideration further improves speech recognition accuracy. It becomes possible.

また、図９の実施の形態では、各距離Ｄ₁乃至Ｄ_Nの音響モデルのセットを記憶した音響モデルデータベース１０４₁乃至１０４_Nから、ユーザまでの距離に対応する音響モデルのセット（ユーザまでの距離に最も近い距離の音響モデルのセット）を選択するようにしたが、ユーザまでの距離に対応する音響モデルのセットは、その他、例えば、ネットワークを介して取得するようにすることなどが可能である。 Further, in the embodiment of FIG. 9, from the acoustic model database 104 ₁ through 104 _N which stores the set of acoustic models for each distance D ₁ to D _N, to the acoustic model set (user corresponding to the distance to the user The set of acoustic models closest to the distance) is selected, but the acoustic model set corresponding to the distance to the user can be acquired via a network, for example. is there.

次に、図１１は、図１のペット型ロボットの他の内部構成例を示すブロック図である。なお、図中、図２における場合と対応する部分については、同一の符号を付してあり、以下では、その説明は、適宜省略する。即ち、図１１のペット型ロボットは、頭部ユニット３の中に、超音波センサ１１１が新たに設けられている他は、図２における場合と基本的に同様に構成されている。 Next, FIG. 11 is a block diagram showing another internal configuration example of the pet-type robot of FIG. In the figure, portions corresponding to those in FIG. 2 are denoted by the same reference numerals, and description thereof will be omitted below as appropriate. That is, the pet type robot shown in FIG. 11 is basically configured in the same manner as in FIG. 2 except that the ultrasonic sensor 111 is newly provided in the head unit 3.

超音波センサ１１１は、図示せぬ音源とマイクを有し、図１２に示すように、音源から、超音波パルスを発する。さらに、超音波センサ１１１は、その超音波パルスが障害物で反射され、返ってくる反射波を、マイクで受信し、超音波パルスを発してから、反射波を受信するまでの時間（以下、適宜、ラグ時間という）を求め、コントローラ１１に供給する。 The ultrasonic sensor 111 has a sound source and a microphone (not shown), and emits ultrasonic pulses from the sound source as shown in FIG. Further, the ultrasonic sensor 111 receives the reflected wave that is reflected by the obstacle and is returned by the microphone and emits the ultrasonic pulse, and then receives the reflected wave (hereinafter, referred to as the reflected wave). (Referred to as lag time) and supplies the controller 11 with it.

次に、図１３は、図１１のコントローラ１１の機能的構成例を示している。なお、図中、図３における場合と対応する部分については、同一の符号を付してあり、以下では、その説明は、適宜省略する。即ち、図１３のコントローラ１１は、距離計算部４７に対して、CCDカメラ２２Ｌおよび２２Ｒの出力に代えて、超音波センサ１１１の出力が供給されるようになっている他は、図３における場合と同様に構成されている。 Next, FIG. 13 shows a functional configuration example of the controller 11 of FIG. In the figure, portions corresponding to those in FIG. 3 are denoted by the same reference numerals, and description thereof will be omitted below as appropriate. That is, the controller 11 of FIG. 13 is supplied with the output of the ultrasonic sensor 111 instead of the outputs of the CCD cameras 22L and 22R to the distance calculation unit 47, except in the case of FIG. It is configured in the same way.

図１３の実施の形態では、距離計算部４７が、超音波センサ１１１の出力に基づいて、ユーザまでの距離を計算するようになっている。 In the embodiment of FIG. 13, the distance calculation unit 47 calculates the distance to the user based on the output of the ultrasonic sensor 111.

即ち、図１３の実施の形態においては、例えば、上述したような、マイク２１として指向性を有するものを使用する方法、複数のマイクを使用する方法、画像認識を利用する方法などによって、発話を行っているユーザの方向を認識し、そのユーザの方向に、超音波センサ１１１の音源が向くように、頭部ユニット３を動かす。そして、超音波センサ１１１は、超音波パルスを、ユーザに向けて発し、その反射波を受信することで、ラグ時間を求め、距離計算部４７に供給する。距離計算部４７は、超音波センサ１１１から供給されるラグ時間に基づいて、ユーザまでの距離を計算し、音声認識部４１Ｂに供給する。以後は、音声認識部４１Ｂにおいて、図９及び図１０で説明した場合と同様の処理が行われる。 That is, in the embodiment of FIG. 13, for example, the utterance is made by the method using the directional microphone 21 as described above, the method using a plurality of microphones, the method using image recognition, or the like. The direction of the user who is performing is recognized, and the head unit 3 is moved so that the sound source of the ultrasonic sensor 111 faces the direction of the user. Then, the ultrasonic sensor 111 emits an ultrasonic pulse toward the user, receives the reflected wave, obtains a lag time, and supplies the lag time to the distance calculation unit 47. The distance calculation unit 47 calculates the distance to the user based on the lag time supplied from the ultrasonic sensor 111 and supplies the calculated distance to the voice recognition unit 41B. Thereafter, the voice recognition unit 41B performs the same processing as that described with reference to FIGS.

次に、図１４は、図３または図１３の音声認識部４１Ｂの他の構成例を示している。なお、図中、図９における場合と対応する部分については、同一の符号を付してあり、以下では、その説明は、適宜省略する。 Next, FIG. 14 shows another configuration example of the voice recognition unit 41B of FIG. 3 or FIG. In the figure, portions corresponding to those in FIG. 9 are denoted by the same reference numerals, and description thereof will be omitted below as appropriate.

図９の実施の形態では、距離Ｄ₁乃至Ｄ_Nの音響モデルのセットの中から、ユーザまでの距離に最も近い距離の音響モデルのセットを選択し、その音響モデルのセットを用いて音声認識を行うようになっていたが、図１４の実施の形態では、ユーザまでの距離に対応する周波数特性の逆フィルタを用いて、マイク２１から出力される音声データをフィルタリングし、そのフィルタリング後の音声データについて、所定の音響モデルのセットを用い、音声認識を行うようになっている。 In the embodiment of FIG. 9, a set of acoustic models closest to the distance to the user is selected from a set of acoustic models with distances D _{1 to} _DN , and speech recognition is performed using the set of acoustic models. In the embodiment of FIG. 14, the audio data output from the microphone 21 is filtered using an inverse filter having a frequency characteristic corresponding to the distance to the user, and the filtered audio is output. For data, speech recognition is performed using a set of predetermined acoustic models.

即ち、図１４の実施の形態において、音響モデルデータベース１０４は、例えば、マイクに近接して（マイクとの距離を０として）行われた発話を収録した音声データから生成された音響モデルのセットを記憶している。 In other words, in the embodiment of FIG. 14, the acoustic model database 104 includes, for example, a set of acoustic models generated from audio data that records utterances performed close to the microphone (the distance from the microphone is 0). I remember it.

フィルタ部１２１は、タップ係数選択部１２２から供給されるタップ係数（以下、適宜、選択タップ係数という）をタップ係数として動作するディジタルフィルタで、Ａ／Ｄ変換部１２が出力する音声データをフィルタリングし、特徴抽出部１０１に供給する。 The filter unit 121 is a digital filter that operates using the tap coefficient supplied from the tap coefficient selection unit 122 (hereinafter referred to as a selected tap coefficient as appropriate) as a tap coefficient, and filters the audio data output from the A / D conversion unit 12. , And supplied to the feature extraction unit 101.

タップ係数選択部１２２は、音声区間検出部１０７から音声区間を検出した旨のメッセージを受信すると、図３または図１３の距離計算部４７に、ユーザまでの距離の計算を要求し、その要求に応じて、距離計算部４７から供給されるユーザまでの距離を受信する。さらに、タップ係数選択部１２２は、ユーザまでの距離に最も近い距離に対応する周波数特性の逆フィルタを実現するタップ係数のセットを、タップ係数記憶部１２３から読み出す。そして、タップ係数選択部１２２は、タップ係数記憶部１２３から読み出したタップ係数を、選択タップ係数として、フィルタ部１２１に供給し、そのタップ係数として設定する。 When the tap coefficient selection unit 122 receives a message indicating that the voice section has been detected from the voice section detection unit 107, the tap coefficient selection unit 122 requests the distance calculation unit 47 of FIG. 3 or FIG. 13 to calculate the distance to the user. In response, the distance to the user supplied from the distance calculation unit 47 is received. Furthermore, the tap coefficient selection unit 122 reads out from the tap coefficient storage unit 123 a tap coefficient set that realizes an inverse filter having a frequency characteristic corresponding to the distance closest to the user. Then, the tap coefficient selection unit 122 supplies the tap coefficient read from the tap coefficient storage unit 123 to the filter unit 121 as the selected tap coefficient and sets it as the tap coefficient.

タップ係数記憶部１２３は、例えば、上述したＮ個の距離Ｄ₁乃至Ｄ_Nそれぞれに対応する周波数特性の逆の特性を有するディジタルフィルタとしての逆フィルタを実現するタップ係数のセットを記憶している。 The tap coefficient storage unit 123 stores, for example, a set of tap coefficients that realizes an inverse filter as a digital filter having characteristics opposite to the frequency characteristics corresponding to the _N distances D ₁ to DN described above. .

以上のように構成される音声認識部４１Ｂでは、音声区間検出部１０７において音声区間であることが検出されると、タップ係数選択部１２２が、距離計算部４７（図３または図１３）に対して、ユーザまでの距離の計算を要求し、その要求に応じて、距離計算部４７から供給されるユーザまでの距離を受信する。さらに、タップ係数選択部１２２は、ユーザまでの距離に最も近い距離に対応する周波数特性の逆フィルタを実現するタップ係数のセットを、タップ係数記憶部１２３から読み出し、選択タップ係数として、フィルタ部１２１に供給する。 In the speech recognition unit 41B configured as described above, when the speech segment detection unit 107 detects that the speech segment is a speech segment, the tap coefficient selection unit 122 causes the distance calculation unit 47 (FIG. 3 or FIG. 13) to Then, it requests the calculation of the distance to the user, and receives the distance to the user supplied from the distance calculation unit 47 in response to the request. Furthermore, the tap coefficient selection unit 122 reads out a tap coefficient set that realizes an inverse filter having a frequency characteristic corresponding to the distance closest to the distance to the user from the tap coefficient storage unit 123, and uses the tap coefficient set as the selected tap coefficient. To supply.

フィルタ部１２１では、選択タップ係数を、そのタップ係数として、Ａ／Ｄ変換部１２が出力する音声データをフィルタリングし、これにより、マイク２１が出力する音声データの周波数成分からユーザまでの距離に対応する周波数特性を除去した音声データ、即ち、等価的に、マイク２１に近接して行われた発話を収録した音声データを得て、特徴抽出部１０１に供給する。 The filter unit 121 filters the audio data output from the A / D conversion unit 12 using the selected tap coefficient as the tap coefficient, thereby corresponding to the distance from the frequency component of the audio data output from the microphone 21 to the user. The voice data from which the frequency characteristics are removed, that is, the voice data in which an utterance made close to the microphone 21 is equivalently obtained, is obtained and supplied to the feature extraction unit 101.

ここで、例えば、いま、マイクに近接して行われた発話を収録した音声データをｘ（ｔ）（ｔは時刻（時間）を表す）と、距離Ｄ_nだけ離れた位置で行われた同一発話を収録した音声データをｙ（ｔ）と、マイク２１から、距離Ｄ_nだけ離れた位置までの空間のインパルス応答をｈ_n（ｔ）と、それぞれ表すとともに、各周波数をωとして、ｘ（ｔ），ｙ（ｔ），ｈ_n（ｔ）それぞれのフーリエ変換を、Ｘ（ω），Ｙ（ω），Ｈ_n（ω）と表すこととすると、次式が成立する。 Here, for example, now, the same was done at the position (the t time (representing time)) of audio data x (t) was recorded utterances made in proximity to the microphone and, at a distance D _n The voice data recording the utterance is represented by y (t), the impulse response of the space from the microphone 21 to the position separated by the distance D _{n is} represented by h _n (t), and each frequency is represented by ω and x ( If each Fourier transform of t), y (t), h _n (t) is expressed as X (ω), Y (ω), H _n (ω), the following equation is established.

Ｙ（ω）＝Ｘ（ω）Ｈ_n（ω）・・・（１） Y (ω) = X (ω) H _n (ω) (1)

式（１）から次式が成り立つ。 The following equation holds from equation (1).

Ｘ（ω）＝Ｙ（ω）／Ｈ_n（ω）・・・（２） X (ω) = Y (ω) / H _n (ω) (2)

Ｈ_n（ω）は、距離Ｄ_nに対応する周波数特性を表すから、式（２）より、距離Ｄ_nに対応する周波数特性Ｈ_n（ω）の逆の特性１／Ｈ_n（ω）を有するフィルタである逆フィルタによって、マイク２１から距離Ｄ_nだけ離れた位置で行われた発話を収録した音声データＹ（ω）をフィルタリングすれば、等価的に、マイク２１に近接して行われた発話を収録した音声データＸ（ω）を得ることができる。 H _n (omega), since it represents the frequency characteristic corresponding to the distance D _n, the equation (2), the distance D corresponding frequency characteristic _n H _n (omega) opposite characteristics 1 / H _n of the (omega) If the speech data Y (ω) containing utterances performed at a position away from the microphone 21 by the distance D _n is filtered by an inverse filter, which is a filter having the same, it is equivalently performed close to the microphone 21 Audio data X (ω) containing utterances can be obtained.

図１４の実施の形態において、タップ係数記憶部１２３は、各距離Ｄ₁乃至Ｄ_Nに対応する周波数特性Ｈ₁（ω）乃至Ｈ_N（ω）の逆の特性１／Ｈ₁（ω）乃至１／Ｈ_N（ω）を有する逆フィルタを実現するタップ係数を記憶しており、タップ係数選択部１２２において、ユーザまでの距離に最も近い距離に対応する周波数特性の逆フィルタを実現するタップ係数のセットを、タップ係数記憶部１２３から読み出し、選択タップ係数として、フィルタ部１２１に供給する。 In the embodiment of FIG. 14, the tap coefficient storage unit 123 has characteristics 1 / H ₁ (ω) through inverse of frequency characteristics H ₁ (ω) through H _N (ω) corresponding to the distances D ₁ through D _N. The tap coefficient for realizing the inverse filter having 1 / H _N (ω) is stored, and the tap coefficient for realizing the inverse filter of the frequency characteristic corresponding to the distance closest to the distance to the user in the tap coefficient selection unit 122 is stored. Are read from the tap coefficient storage unit 123 and supplied to the filter unit 121 as selected tap coefficients.

フィルタ部１２１では、そのような選択タップ係数を、そのタップ係数（ディジタルフィルタのタップ係数）として、Ａ／Ｄ変換部１２が出力する音声データがフィルタリングされることにより、マイク２１に近接して行われた発話を収録した音声データが、等価的に求められ、特徴抽出部１０１に供給される。 In the filter unit 121, the selected tap coefficient is used as the tap coefficient (digital filter tap coefficient), and the audio data output from the A / D conversion unit 12 is filtered, so that the selected tap coefficient is close to the microphone 21. The voice data recording the uttered speech is obtained equivalently and supplied to the feature extraction unit 101.

従って、マッチング部１０３では、等価的にマイク２１に近接して行われた発話を収録した音声データについて、マイクに近接して行われた発話を収録した音声データから生成された音響モデルのセットを用いて音声認識が行われるから、やはり、図９の実施の形態における場合と同様に、マッチング部１０３の計算量を増大させずに、音声認識精度を向上させることができる。 Accordingly, the matching unit 103 equivalently sets a set of acoustic models generated from the sound data recorded of the utterances performed close to the microphone for the sound data recorded of the utterances performed close to the microphone 21. Since speech recognition is performed using the same, the speech recognition accuracy can be improved without increasing the calculation amount of the matching unit 103 as in the embodiment of FIG.

ここで、距離Ｄ_nに対応する周波数特性Ｈ_n（ω）の逆の特性１／Ｈ_n（ω）は、マイクから距離Ｄ_nだけ離れた位置から、理想的には、インパルスδ（ｔ）を発し、そのインパルスδ（ｔ）を収録したマイクから出力される音声データｓ（ｔ）を観測することにより、実用上はＴＳＰ(Time Stretched Pulse)信号を用いて計測することによって、式（１）または（２）に示した関係から求めることが可能である。 Here, the distance D corresponding frequency characteristic _{_n} H _n (ω) opposite characteristics 1 / H _n of (omega) from the position at a distance D _n from the microphone, ideally, impulse [delta] (t) And observing the voice data s (t) output from the microphone recording the impulse δ (t), and in practice, using the TSP (Time Stretched Pulse) signal, the equation (1) ) Or (2).

なお、マイク２１と、学習用の音声データを収録するマイクとは、同一の周波数特性を有するものを用いるのが望ましい。 In addition, it is desirable to use the microphone 21 and the microphone that records the audio data for learning that have the same frequency characteristics.

以上、本発明を、音声認識機能を有する、実体のあるロボットに適用した場合について説明したが、本発明は、例えば、コンピュータ上に表示される仮想的なロボットや、その他の任意の装置に適用可能である。 As described above, the case where the present invention is applied to a substantial robot having a voice recognition function has been described. However, the present invention is applied to, for example, a virtual robot displayed on a computer or any other device. Is possible.

なお、上述した一連の音声認識処理は、例えば、汎用のコンピュータに行わせることができ、この場合、一連の音声認識処理を行うプログラムが、汎用のコンピュータにインストールされることで、音声認識装置が実現される。 The series of voice recognition processes described above can be performed by, for example, a general-purpose computer. In this case, the voice recognition apparatus is installed by installing a program for performing the series of voice recognition processes on the general-purpose computer. Realized.

さらに、この場合、プログラムは、コンピュータに内蔵されている記録媒体としてのハードディスクやＲＯＭ(Read Only Memory)に予め記録しておくことができる。 Further, in this case, the program can be recorded in advance on a hard disk or ROM (Read Only Memory) as a recording medium built in the computer.

あるいはまた、プログラムは、フレキシブルディスク、CD-ROM(Compact Disc Read Only Memory)，MO(Magneto optical)ディスク，DVD(Digital Versatile Disc)、磁気ディスク、半導体メモリなどのリムーバブル記録媒体に、一時的あるいは永続的に格納（記録）しておくことができる。このようなリムーバブル記録媒体は、いわゆるパッケージソフトウエアとして提供することができる。 Alternatively, the program is temporarily or permanently stored on a removable recording medium such as a flexible disk, CD-ROM (Compact Disc Read Only Memory), MO (Magneto optical) disk, DVD (Digital Versatile Disc), magnetic disk, and semiconductor memory. Can be stored (recorded). Such a removable recording medium can be provided as so-called package software.

さらに、プログラムは、上述したようなリムーバブル記録媒体からコンピュータにインストールする他、ダウンロードサイトから、ディジタル衛星放送用の人工衛星を介して、コンピュータに無線で転送したり、LAN(Local Area Network)、インターネットといったネットワークを介して、コンピュータに有線で転送し、コンピュータでは、そのようにして転送されてくるプログラムを受信し、内蔵するハードディスクにインストールすることができる。 In addition to installing the program from a removable recording medium as described above to the computer, the program can be wirelessly transferred from a download site to a computer via an artificial satellite for digital satellite broadcasting, a LAN (Local Area Network), the Internet Such a program can be transferred to a computer via a network, and the computer can receive the program transferred in this way and install it on a built-in hard disk.

なお、本明細書において、コンピュータ（ＣＰＵ）に各種の処理を行わせるためのプログラムを記述する処理ステップは、必ずしもフローチャートとして記載された順序に沿って時系列に処理する必要はなく、並列的あるいは個別に実行される処理（例えば、並列処理あるいはオブジェクトによる処理）も含むものである。 In the present specification, the processing steps for describing a program for causing a computer (CPU) to perform various processes do not necessarily have to be processed in time series according to the order described in the flowchart. It includes processing executed individually (for example, parallel processing or object processing).

また、プログラムは、１のコンピュータにより処理されるものであっても良いし、複数のコンピュータによって分散処理されるものであっても良い。さらに、プログラムは、遠方のコンピュータに転送されて実行されるものであっても良い。 Further, the program may be processed by a single computer, or may be processed in a distributed manner by a plurality of computers. Furthermore, the program may be transferred to a remote computer and executed.

さらに、本発明は、ＨＭＭ法以外のアルゴリズムによる音声認識にも適用可能である。 Furthermore, the present invention is also applicable to speech recognition using algorithms other than the HMM method.

また、本実施の形態では、ステレオ処理や、超音波センサによって、ユーザまでの距離を求めるようにしたが、ユーザまでの距離は、その他の任意の方法によって求めることが可能である。即ち、例えば、ユーザまでの距離は、例えば、ユーザに、その距離を発話してもらい、その発話を音声認識することによって求めること等が可能である。さらに、リモートコマンダに、距離を入力するボタンを設け、そのボタンを、ユーザに操作してもらうことにより、ロボットにおいて、ユーザまでの距離を求めるようにすることが可能である。 In the present embodiment, the distance to the user is obtained by stereo processing or an ultrasonic sensor, but the distance to the user can be obtained by any other method. That is, for example, the distance to the user can be obtained, for example, by having the user utter the distance and recognizing the utterance. Furthermore, it is possible to obtain a distance to the user in the robot by providing the remote commander with a button for inputting a distance and having the user operate the button.

本発明を適用したペット型ロボットの外観構成例を示す斜視図である。It is a perspective view which shows the example of an external appearance structure of the pet type robot to which this invention is applied. ペット型ロボットのハードウェア構成例を示すブロック図である。It is a block diagram which shows the hardware structural example of a pet-type robot. コントローラ１１の機能的構成例を示すブロック図である。3 is a block diagram illustrating a functional configuration example of a controller 11. FIG. 基準カメラ２２Ｌおよび検出カメラ２２Ｒで、ユーザを撮影している状態を示す図である。It is a figure which shows the state which is image | photographing a user with the reference camera 22L and the detection camera 22R. エピポーララインを説明するための図である。It is a figure for demonstrating an epipolar line. 基準カメラ画像および検出カメラ画像を示す図である。It is a figure which shows a reference | standard camera image and a detection camera image. 評価値の推移を示す図である。It is a figure which shows transition of an evaluation value. 設定点／距離テーブルおよび視差／距離テーブルを示す図である。It is a figure which shows a set point / distance table and a parallax / distance table. 音声認識部４１Ｂの構成例を示すブロック図である。It is a block diagram which shows the structural example of the speech recognition part 41B. 音声認識部４１Ｂの処理を説明するフローチャートである。It is a flowchart explaining the process of the speech recognition part 41B. ペット型ロボットの他のハードウェア構成例を示すブロック図である。It is a block diagram which shows the other hardware structural example of a pet-type robot. 超音波センサ１１１の処理を説明する図である。It is a figure explaining the process of the ultrasonic sensor. コントローラ１１の他の機能的構成例を示すブロック図である。4 is a block diagram illustrating another functional configuration example of the controller 11. FIG. 音声認識部４１の他の構成例を示すブロック図である。It is a block diagram which shows the other structural example of the speech recognition part.

Explanation of symbols

１胴体部ユニット，１Ａ背中センサ，２Ａ乃至２Ｄ脚部ユニット，３頭部ユニット，３Ａ頭センサ，３Ｂ顎センサ，４尻尾部ユニット，１１コントローラ，１１ＡＣＰＵ，１１Ｂメモリ，１２Ａ／Ｄ変換部，１３Ｄ／Ａ変換部，１４通信部，１５半導体メモリ，２１マイク，２２Ｌ，２２ＲＣＣＤカメラ，２３スピーカ，４１センサ入力処理部，４１Ａ圧力処理部，４１Ｂ音声認識部，４１Ｃ画像処理部，４２モデル記憶部，４３行動決定機構部，４４姿勢遷移機構部，４５制御機構部，４６音声合成部，４７距離計算部，１０１特徴抽出部，１０２特徴ベクトルバッファ，１０３マッチング部，１０４，１０４₁乃至１０４_N 音響モデルデータベース，１０５辞書データベース，１０６文法データベース，１０７音声区間検出部，１０８選択制御部，１０９セレクタ，１１１超音波センサ，１２１フィルタ部，１２２タップ係数選択部，１２３タップ係数記憶部 1 body unit, 1A back sensor, 2A to 2D leg unit, 3 head unit, 3A head sensor, 3B chin sensor, 4 tail unit, 11 controller, 11A CPU, 11B memory, 12 A / D conversion unit, 13 D / A conversion unit, 14 communication unit, 15 semiconductor memory, 21 microphone, 22L, 22R CCD camera, 23 speaker, 41 sensor input processing unit, 41A pressure processing unit, 41B speech recognition unit, 41C image processing unit, 42 model Storage unit, 43 action determination mechanism unit, 44 posture transition mechanism unit, 45 control mechanism unit, 46 speech synthesis unit, 47 distance calculation unit, 101 feature extraction unit, 102 feature vector buffer, 103 matching unit, 104, 104 _{1 to} 104 _N acoustic model database, 105 dictionary database, 106 grammar de Database, 107 voice section detection unit, 108 a selection control unit, 109 a selector, 111 ultrasonic sensor, 121 filter unit, 122 the tap coefficient selection unit, 123 the tap coefficient storage unit

Claims

A speech recognition device that recognizes input speech,
A distance calculating means for calculating a distance to the sound source;
Storage means for storing a set of acoustic models taking into account the acoustic environment for each of the plurality of different distances, generated using sounds emitted from a plurality of sound sources separated by a plurality of different distances;
Acquired by selecting a set of acoustic models corresponding to the distance obtained by the distance calculating means from among a set of acoustic models considering the acoustic environment for each of the plurality of different distances stored in the storage means. Acquisition means to
A voice recognition device comprising: voice recognition means for recognizing the voice using the set of acoustic models acquired by the acquisition means.

The speech recognition apparatus according to claim 1, wherein the distance calculation unit obtains a distance to the sound source by performing stereo processing using images output from a plurality of imaging units that capture images.

The speech recognition apparatus according to claim 1, wherein the distance calculation unit obtains a distance to the sound source using an output of an ultrasonic sensor.

A speech recognition method for recognizing input speech,
A distance calculating step for obtaining a distance to the sound source of the sound;
A set of acoustic models corresponding to the distance obtained in the distance calculating step is generated using sounds emitted from sound sources separated by a plurality of different distances. An acquisition step of acquiring by selecting from among a set of acoustic models considering an acoustic environment for each of the plurality of different distances stored in a storage means storing a set of considered acoustic models;
A voice recognition method comprising: a voice recognition step of recognizing the voice using the set of acoustic models acquired in the acquisition step.

A program for causing a computer to perform speech recognition processing for recognizing input speech,
A distance calculating step for obtaining a distance to the sound source of the sound;
A set of acoustic models corresponding to the distance obtained in the distance calculating step is generated using sounds emitted from sound sources separated by a plurality of different distances. An acquisition step of acquiring by selecting from among a set of acoustic models considering an acoustic environment for each of the plurality of different distances stored in a storage means storing a set of considered acoustic models;
A program that causes a computer to perform a process including a voice recognition step of recognizing the voice using the set of acoustic models acquired in the acquisition step.

A recording medium on which a program for causing a computer to perform speech recognition processing for recognizing input speech is recorded,
A distance calculating step for obtaining a distance to the sound source of the sound;
A set of acoustic models corresponding to the distance obtained in the distance calculating step is generated using sounds emitted from sound sources separated by a plurality of different distances. An acquisition step of acquiring by selecting from among a set of acoustic models considering an acoustic environment for each of the plurality of different distances stored in a storage means storing a set of considered acoustic models;
A recording medium in which a program for causing a computer to perform a process including a voice recognition step of recognizing the voice using the set of acoustic models acquired in the acquisition step is recorded.

A speech recognition device that recognizes input speech,
A distance calculating means for calculating a distance to the sound source;
First acquisition means for acquiring a tap coefficient for realizing an inverse filter having a frequency characteristic corresponding to the distance obtained by the distance calculation means;
Filter means for filtering the voice using the tap coefficient acquired in the acquisition means;
Storage means for storing a set of acoustic models taking into account the acoustic environment for each of the plurality of different distances, generated using sounds emitted from a plurality of sound sources separated by a plurality of different distances;
Acquired by selecting a set of acoustic models corresponding to the distance obtained by the distance calculating means from among a set of acoustic models considering the acoustic environment for each of the plurality of different distances stored in the storage means. A second acquisition means for:
A voice recognition device comprising: voice recognition means for recognizing the voice filtered by the filter means using the set of acoustic models acquired by the acquisition means.

A speech recognition method for recognizing input speech,
A distance calculating step for obtaining a distance to the sound source of the sound;
A first acquisition step of acquiring a tap coefficient for realizing an inverse filter of a frequency characteristic corresponding to the distance obtained in the distance calculation step;
A filter step of filtering the voice using the tap coefficient acquired in the first acquisition step;
A set of acoustic models corresponding to the distance obtained in the distance calculating step is generated using sounds emitted from sound sources separated by a plurality of different distances. A second acquisition step of acquiring by selecting from among a set of acoustic models considering an acoustic environment for each of the plurality of different distances stored in a storage means storing a set of considered acoustic models;
A speech recognition method comprising: recognizing the speech filtered in the filtering step using the set of acoustic models acquired in the second acquisition step.

A program for causing a computer to perform speech recognition processing for recognizing input speech,
A distance calculating step for obtaining a distance to the sound source of the sound;
A first acquisition step of acquiring a tap coefficient for realizing an inverse filter of a frequency characteristic corresponding to the distance obtained in the distance calculation step;
A filter step of filtering the voice using the tap coefficient acquired in the first acquisition step;
A set of acoustic models corresponding to the distance obtained in the distance calculating step is generated using sounds emitted from sound sources separated by a plurality of different distances. A second acquisition step of acquiring by selecting from among a set of acoustic models considering an acoustic environment for each of the plurality of different distances stored in a storage means storing a set of considered acoustic models;
A program for causing a computer to perform a process including a voice recognition step of recognizing the voice filtered in the filtering step using the acoustic model set acquired in the second acquisition step.

A recording medium on which a program for causing a computer to perform speech recognition processing for recognizing input speech is recorded,
A distance calculating step for obtaining a distance to the sound source of the sound;
A first acquisition step of acquiring a tap coefficient for realizing an inverse filter of a frequency characteristic corresponding to the distance obtained in the distance calculation step;
A filter step of filtering the voice using the tap coefficient acquired in the first acquisition step;
A set of acoustic models corresponding to the distance obtained in the distance calculating step is generated using sounds emitted from sound sources separated by a plurality of different distances. A second acquisition step of acquiring by selecting from among a set of acoustic models considering an acoustic environment for each of the plurality of different distances stored in a storage means storing a set of considered acoustic models;
A program that causes a computer to perform a process including a voice recognition step of recognizing the voice filtered in the filtering step using the acoustic model set acquired in the second acquisition step is recorded. A recording medium characterized by the above.