JP5595112B2

JP5595112B2 - robot

Info

Publication number: JP5595112B2
Application number: JP2010109213A
Authority: JP
Inventors: 俊一山本; 祥平松本
Original assignee: Honda Motor Co Ltd
Current assignee: Honda Motor Co Ltd
Priority date: 2010-05-11
Filing date: 2010-05-11
Publication date: 2014-09-24
Anticipated expiration: 2030-05-11
Also published as: JP2011237621A

Description

本発明は、ロボットに関し、特に、頭部に複数のマイクロフォンが設置され、全方向から音を取得して、音源方向を推定するロボットに関する。 The present invention relates to a robot, and more particularly, to a robot in which a plurality of microphones are installed in a head and a sound source direction is estimated by acquiring sound from all directions.

従来のロボットは、回動可能な頭部と、音を３６０度全方向から取得できるように頭部に複数のマイクロフォンとを備え、発話者がいる方向を推定することが開示されている。
しかしながら、従来のロボットは、雑音（ノイズ音）を出力してしまうノイズ発生源となる装置を自身が備えている場合や、自身が設置された場所の近傍にノイズ発生源がある場合などにおいて、発話者の音声だけでなく、そのノイズ発生源から出力されたノイズ音を取得してしまう。そのため、ノイズ音の騒音レベルが高い場合に、ロボットは、ノイズ発生源のある方向を発話者がいる方向として推定してしまい、本来の発話者がいる方向（推定すべき音源方向）を推定できないという問題点があった。
特にロボットは、自身を駆動するための動力機器やコンピュータなどを冷やすための冷却ファンを有し、この冷却ファンが回転する音の騒音レベルが高いため、推定すべき音源方向を推定できないという問題点があった。 It is disclosed that a conventional robot includes a rotatable head and a plurality of microphones on the head so that sound can be acquired from all directions of 360 degrees, and estimates the direction in which the speaker is present.
However, in the case where the conventional robot itself has a device that becomes a noise generation source that outputs noise (noise sound), or when there is a noise generation source in the vicinity of the place where the robot is installed, In addition to the voice of the speaker, the noise sound output from the noise source is acquired. Therefore, when the noise level of the noise sound is high, the robot estimates the direction in which the noise is generated as the direction in which the speaker is present, and cannot estimate the direction in which the original speaker is present (the sound source direction to be estimated). There was a problem.
In particular, a robot has a cooling fan for cooling a power device or a computer for driving itself, and the noise level of the sound of rotation of the cooling fan is high, so that the sound source direction to be estimated cannot be estimated. was there.

そのため、特許文献１の移動ロボットは、音声認識結果に基づいて、自身が備えるノイズ発生源（冷却ファン）から出力されるノイズ音の騒音レベルが低くなるように、冷却ファンの回転数を減らすといった、ノイズ発生源となる装置自体を制御する技術や、発話者に再度の発声を依頼する技術が開示されている。 For this reason, the mobile robot of Patent Document 1 reduces the number of rotations of the cooling fan based on the voice recognition result so that the noise level of the noise sound output from the noise generation source (cooling fan) included in the mobile robot is lowered. In addition, a technique for controlling a device itself as a noise generation source and a technique for requesting a speaker to speak again are disclosed.

特開２００６−９５６３５号公報JP 2006-95635 A

しかしながら、従来の回動可能な頭部に音声入力部を備えるロボットは、発話者がいる方向（推定すべき音源方向）を推定するために、ノイズ音を出力してしまうノイズ発生源となる装置を制御しなければならないという問題点があった。 However, a conventional robot having a voice input unit on a rotatable head is a noise source that outputs a noise sound in order to estimate the direction in which the speaker is present (the sound source direction to be estimated). There was a problem that had to be controlled.

本発明は、以上のような問題を解決するためになされたものであり、回動可能な頭部に備えられた音声入力部に、３６０度全方向から入力される音声から、ノイズ音が出力される方向を除外して、発話者がいる方向（推定すべき音源方向）を推定するロボットを提供することを課題とする。 The present invention has been made to solve the above-described problems, and noise sound is output from a sound input from 360 degrees in all directions to a sound input unit provided in a rotatable head. It is an object of the present invention to provide a robot that estimates a direction in which a speaker is present (a sound source direction to be estimated) by excluding a direction in which the speaker is present.

前記課題を解決するために、本発明の請求項１に記載のロボットは、胴部と、前記胴部の上面で回動可能に支持軸に支持される頭部と、前記頭部が回動する回動方向に前記支持軸を中心に所定の角度で離間して前記頭部に配設され、入力された音声を音声データに変換して出力する３以上の音声入力部と、前記音声入力部それぞれから入力された前記音声データに、信号分解アルゴリズムを用いて、前記頭部の向きを基準とした３６０度全方向からの音圧値を示す全方向音圧成分データを生成する音声データ処理部と、前記胴部の向きを基準とした前記頭部の回動角度を測定する回動角測定部と、前記胴部の向きを基準とした音源として推定しない方向の範囲が予め設定された除外角度範囲が記憶された記憶部と、音源方向を推定する音源方向推定部とを備えるロボットにおいて、前記音源方向推定部は、測定された回動角度を用いて、前記全方向音圧成分データから前記除外角度範囲内にあるデータを除去して有効方向音圧成分データを生成するフィルタ部を備えることを特徴とする。 In order to solve the above-mentioned problem, a robot according to a first aspect of the present invention includes a torso, a head supported by a support shaft so as to be rotatable on an upper surface of the torso, and the head is turned. Three or more voice input units arranged on the head and spaced apart at a predetermined angle around the support shaft in the rotating direction to convert the input voice to voice data, and the voice input Audio data processing for generating omnidirectional sound pressure component data indicating sound pressure values from 360 degrees in all directions with reference to the direction of the head, using the signal decomposition algorithm for the sound data input from each of the units A rotation angle measuring unit that measures the rotation angle of the head with respect to the direction of the body, and a range of directions that are not estimated as a sound source based on the direction of the body A storage unit storing the excluded angle range and a sound source method for estimating the sound source direction In the robot including the estimation unit, the sound source direction estimation unit removes data within the excluded angle range from the omnidirectional sound pressure component data by using the measured rotation angle, and an effective direction sound pressure component A filter unit for generating data is provided.

かかる構成によれば、ロボットは、音声入力部が設置された頭部の回動角度に係わらず、胴部の向きを基準とした除外角度範囲で示す方向からの音（例えば、ロボットの背面でファンが回転することで出力されるノイズ音）を除去することができる。
また、ロボットは、除外角度範囲外から入力される有効方向音圧成分データを取得することができる。 According to such a configuration, the robot can generate sound from the direction indicated by the exclusion angle range based on the direction of the torso (for example, on the back of the robot) regardless of the rotation angle of the head on which the voice input unit is installed. Noise noise output by the rotation of the fan can be removed.
Also, the robot can acquire effective direction sound pressure component data input from outside the excluded angle range.

また、請求項２に記載のロボットは、請求項１に記載のロボットにおいて、前記記憶部は、前記支持軸を中心とした前記音声入力部それぞれの方向を示す音声入力角度が記憶され、前記音源方向推定部が、前記フィルタ部が生成した前記有効方向音圧成分データから、直前の角度の音圧値と直後の角度の音圧値との双方より大きい音圧値の角度を少なくとも１以上抽出する角度抽出部を有し、前記音声入力部が出力した音声データから、前記音声入力角度に基づき、前記角度抽出部が抽出した推定音源角度それぞれの方向から入力された音声データを分離抽出する推定音源分離部と、前記推定音源分離部に抽出された推定音源方向音声データに音声認識を行い、複数の仮説フレーズを生成し、各仮説フレーズの正しさを示す音声認識尤度を算出する音声認識部と、前記仮説フレーズと、前記仮説フレーズに係る前記推定音源方向音声データが入力される方向との関係の正しさを示す音声入力方向適合度で、前記仮説フレーズに係る前記音声認識尤度を重み付けて、前記仮説フレーズそれぞれのフレーズ確信度を算出し、前記算出したフレーズ確信度に基づいて音源方向を推定するフレーズ確信度算出部とを備えることを特徴とする。 The robot according to claim 2 is the robot according to claim 1, wherein the storage unit stores a voice input angle indicating a direction of each of the voice input units around the support shaft, and the sound source. The direction estimation unit extracts at least one or more angles of sound pressure values larger than both the sound pressure value at the immediately preceding angle and the sound pressure value at the immediately following angle from the effective direction sound pressure component data generated by the filter unit. An estimation unit that separates and extracts audio data input from directions of estimated sound source angles extracted by the angle extraction unit from audio data output by the audio input unit based on the audio input angle. Speech recognition is performed on the estimated sound source direction speech data extracted by the sound source separation unit and the estimated sound source separation unit, a plurality of hypothesis phrases are generated, and a speech recognition likelihood indicating the correctness of each hypothesis phrase is calculated. The speech recognition unit related to the hypothesis phrase with a speech input direction suitability indicating the correctness of the relationship between the speech recognition unit that performs the hypothesis phrase and the direction in which the estimated sound source direction speech data related to the hypothesis phrase is input A phrase certainty factor calculating unit that weights likelihoods, calculates a phrase certainty factor of each hypothesis phrase, and estimates a sound source direction based on the calculated phrase certainty factor, is provided.

かかる構成によれば、ロボットは、音声認識部が行った音声認識結果と、推定音源分離部が分離抽出した推定音源方向音声データが入力される方向との関係の正しさを音声入力方向適合度で示すことができる。
また、ロボットは、音声認識部における音声認識結果の正しさを示す音声認識尤度を、前記音声入力方向適合度で重み付けすることで、音声認識結果の正しさと、その音声が入力された方向の正しさとの双方を評価する値を算出することができる。 According to this configuration, the robot determines the correctness of the relationship between the speech recognition result performed by the speech recognition unit and the direction in which the estimated sound source direction speech data separated and extracted by the estimated sound source separation unit is input. Can be shown.
In addition, the robot weights the speech recognition likelihood indicating the correctness of the speech recognition result in the speech recognition unit by the speech input direction suitability, so that the correctness of the speech recognition result and the direction in which the speech is input It is possible to calculate a value that evaluates both the correctness and the correctness.

また、請求項３に記載のロボットは、請求項２に記載のロボットにおいて、前記フレーズ確信度算出部は、前記仮説フレーズそれぞれのフレーズ確信度から、最大のフレーズ確信度を抽出し、そのフレーズ確信度に係る仮説フレーズを生成した前記推定音源方向音声データに係る前記推定音源角度を音源方向に推定することを特徴とする。 The robot according to claim 3, in the robot according to claim 2, wherein the phrase confidence factor computing unit, from the hypothesis phrases each phrase confidence, and extracts the maximum phrases confidence, the phrase belief The estimated sound source angle related to the estimated sound source direction sound data that generated the hypothesis phrase related to the degree is estimated in the sound source direction .

かかる構成によれば、ロボットは、フレーズ確信度が最大となる音声認識結果に係る推定音源角度を音源方向とすることができる。 According to such a configuration, the robot can set the estimated sound source angle related to the speech recognition result that maximizes the phrase certainty as the sound source direction.

また、請求項４に記載のロボットは、請求項３に記載のロボットが、フレーズ記憶部と、フレーズ決定部とを備え、前記フレーズ記憶部は、予め入力され得るフレーズを集めた複数のフレーズパターンと、前記支持軸を中心とし前記頭部の向きを基準とした方向を示す方向角度値パターンと、前記フレーズパターンが前記方向角度値パターンの方向から入力される正しさを示す前記音声入力方向適合度との３つが対応付けられて記憶され、前記フレーズ確信度算出部は、前記音声認識部に生成された複数の仮説フレーズとその音声認識尤度とを取得すると共に、前記仮説フレーズおよび前記角度抽出部が抽出した推定音源角度を用いて、前記フレーズ記憶部に記憶された前記フレーズパターンおよび前記方向角度値パターンに対応付けられた前記音声入力方向適合度を抽出し、その音声入力方向適合度で前記音声認識尤度を重み付けして、フレーズ確信度を算出し、前記フレーズ決定部は、前記フレーズ確信度算出部が抽出した前記最大のフレーズ確信度に係る仮説フレーズを、当該仮説フレーズに係る前記フレーズ確信度算出部が推定した音源方向からの音声を音声認識したフレーズとすることを特徴とする。
According to a fourth aspect of the present invention, there is provided a robot according to the third aspect, wherein the robot according to the third aspect includes a phrase storage unit and a phrase determination unit, and the phrase storage unit collects a plurality of phrase patterns in which phrases that can be input in advance are collected. And a direction angle value pattern indicating a direction centered on the support shaft and based on the direction of the head, and the voice input direction adaptation indicating the correctness of the phrase pattern being input from the direction of the direction angle value pattern Are stored in association with each other, and the phrase certainty calculation unit acquires a plurality of hypothesis phrases generated by the speech recognition unit and the speech recognition likelihood thereof, and the hypothesis phrase and the using the estimated sound source angle angle extracting unit has extracted, associated with the phrase pattern and the direction angle value patterns stored in the phrase storage unit Serial extract audio input direction fit, by weighting the speech recognition likelihoods that the speech input direction fitness, calculates the phrase confidence, the phrase determiner, the phrase confidence factor computing unit has extracted the the hypothesis phrase according to the maximum phrases confidence, characterized by the phrases and to Turkey recognized speech sound from the sound source direction in which the phrase confidence factor computing unit is estimated according to the hypothesis phrase.

かかる構成によれば、ロボットは、角度抽出部が取得した推定音声角度と、音声認識部が音声認識して生成した仮説フレーズとの関係の正しさを音声入力方向適合度で示すことができる。
また、ロボットは、音声認識部が生成した仮説フレーズの正しさを示す音声認識尤度を、その仮説フレーズが入力される方向の正しさ音声入力方向適合度で重み付けすることで、仮説フレーズの音声認識結果の正しさを示す判断基準となるフレーズ確信度を算出することができる。
そして、ロボットは、フレーズ確信度が最大値となる仮説フレーズを正しいフレーズとすることができる。
さらに、ロボットは、音声認識尤度の値が大きくても、音声入力方向適合度の小さい仮説フレーズを、除去することができる。 According to such a configuration, the robot can indicate the correctness of the relationship between the estimated speech angle acquired by the angle extraction unit and the hypothesis phrase generated by speech recognition by the speech recognition unit as the speech input direction suitability.
In addition, the robot weights the speech recognition likelihood indicating the correctness of the hypothetical phrase generated by the speech recognition unit by the correctness of the direction in which the hypothetical phrase is input, and the speech input direction suitability, so that It is possible to calculate a phrase certainty factor that serves as a determination criterion indicating the correctness of the recognition result.
Then, the robot can set the hypothesis phrase having the maximum phrase certainty as a correct phrase.
Furthermore, even if the value of the speech recognition likelihood is large, the robot can remove a hypothesis phrase having a low speech input direction suitability.

また、請求項５に記載のロボットは、請求項４に記載のロボットにおいて、前記フレーズ記憶部は、前記フレーズパターン毎の、前記方向角度値パターンと前記音声入力方向適合度とからなる２次元のヒストグラムを記憶することを特徴とする。 Further, the robot according to claim 5 is the robot according to claim 4, wherein the phrase storage unit is a two-dimensional model composed of the direction angle value pattern and the voice input direction matching degree for each phrase pattern. A histogram is stored.

かかる構成によれば、ロボットは、フレーズ確信度を、フレーズパターン毎の、方向角度値パターンと前記音声認識尤度とからなる２次元のヒストグラムから抽出することができる。 According to this configuration, the robot can extract the phrase certainty factor from a two-dimensional histogram composed of the direction angle value pattern and the speech recognition likelihood for each phrase pattern.

また、請求項６に記載のロボットは、請求項２ないし請求項５のいずれか１項に記載されたロボットにおいて、前記角度抽出部が抽出した推定音源角度に至るまで、前記頭部に回動させる行動制御部を備えることを特徴とする。 A robot according to a sixth aspect of the present invention is the robot according to any one of the second to fifth aspects, wherein the robot rotates to the head until the estimated sound source angle extracted by the angle extraction unit is reached. The behavior control part to be provided is provided.

かかる構成によれば、ロボットは、呼びかけられた方向に頭部を回動することができる。 With this configuration, the robot can turn the head in the called direction.

請求項１の発明によれば、音源として推定しない方向の範囲（除外角度範囲）を予め記憶部に記憶しておくことで、回動する頭部に音声入力部が備えられ、ノイズ発生源と音声入力部との相対位置が変化しても、ロボットは、有効方向音圧成分データの音圧成分値から、除外角度範囲外に位置する発話者（音源）が発声した方向を推定することができる。 According to the first aspect of the present invention, the range of the direction not to be estimated as the sound source (excluded angle range) is stored in the storage unit in advance, whereby the rotating head is provided with the voice input unit, and the noise generation source Even if the relative position to the voice input unit changes, the robot can estimate the direction in which a speaker (sound source) located outside the excluded angle range uttered from the sound pressure component value of the effective direction sound pressure component data. it can.

また、請求項１の発明によれば、ノイズ発生源の位置に基づいて除外角度範囲を設定することで、そのノイズ発生源からの音を除去することができるため、ノイズ発生源となる装置を制御する必要がない。 Further, according to the invention of claim 1, since the sound from the noise generation source can be removed by setting the exclusion angle range based on the position of the noise generation source, the device that becomes the noise generation source There is no need to control.

請求項２に記載の発明によれば、ロボットは、音声認識結果の正しさと、その音声が入力された方向の正しさとの双方を評価する値を求めることができ、その算出値が大きいほど、音声認識結果が正しいという確率が高いことを示すため、音声認識の誤認識率を低減することができる。また、例えば、認識率の正誤に基づく基準値を設けることで、その基準値以上となる算出値の方向が、音源方向であると推定することもできる。 According to the second aspect of the invention, the robot can obtain a value for evaluating both the correctness of the speech recognition result and the correctness of the direction in which the speech is input, and the calculated value is large. The higher the probability that the speech recognition result is correct, the higher the recognition rate of speech recognition can be reduced. For example, by providing a reference value based on whether the recognition rate is correct or incorrect, it is possible to estimate that the direction of the calculated value that is equal to or greater than the reference value is the sound source direction.

請求項３に記載の発明によれば、ロボットは、複数の算出値から最大値を抽出することで、音声認識結果が正しいという確率が最高の音声認識結果を取得することができるため、さらに音声認識の誤認識率を低減することができる。 According to the third aspect of the present invention, the robot can acquire the voice recognition result with the highest probability that the voice recognition result is correct by extracting the maximum value from the plurality of calculated values. The recognition error rate of recognition can be reduced.

請求項４および請求項５に記載の発明によれば、ロボットは、音声認識結果の正しさを示す判断基準となるフレーズ確信度が最大値となる仮説フレーズを、音源方向からの音声を音声認識したフレーズとすることで、誤認識率を低減することができる。 According to the fourth and fifth aspects of the present invention, the robot recognizes the hypothesis phrase having the maximum phrase certainty as a determination criterion indicating the correctness of the speech recognition result, and the speech from the sound source direction. By using the phrase, the recognition error rate can be reduced.

請求項６に記載の発明によれば、ロボットは、呼びかけた発話者のいる方向に顔面を向けることができる。 According to the sixth aspect of the present invention, the robot can turn its face in the direction of the calling speaker.

本発明に係るロボットを含むロボットシステムの全体構成図である。1 is an overall configuration diagram of a robot system including a robot according to the present invention. 本発明に係るロボットの全体構成を示すブロック図である。It is a block diagram which shows the whole structure of the robot which concerns on this invention. （ａ）本発明に係るロボットの頭部Ｒ１の正面図である。（ｂ）本発明に係るロボットの頭部Ｒ１のＡ−Ａ断面図である。(A) It is a front view of head R1 of the robot which concerns on this invention. (B) It is AA sectional drawing of head R1 of the robot which concerns on this invention. 本発明に係るロボットの音源方向推定処理に係る構成を示すブロック図である。It is a block diagram which shows the structure which concerns on the sound source direction estimation process of the robot which concerns on this invention. 本発明に係るロボットの音声入力角度ＤＢの一例を示す図である。It is a figure which shows an example of audio | voice input angle DB of the robot which concerns on this invention. 本発明に係るロボットのフレーズＤＢの一例を示す図である。It is a figure which shows an example of phrase DB of the robot which concerns on this invention. 本発明に係るロボットのＢ軸と、Ｈ軸と、除外角度範囲との一例を示す図である。It is a figure which shows an example of the B-axis of the robot which concerns on this invention, an H-axis, and an exclusion angle range. 本発明に係るロボットの音声データ処理部にて生成されたＭＵＳＩＣスペクトラムの一例を示す図である。It is a figure which shows an example of the MUSIC spectrum produced | generated in the audio | voice data processing part of the robot which concerns on this invention. 本発明に係るロボットの音源方向推定処理のフローチャートである。It is a flowchart of the sound source direction estimation process of the robot according to the present invention. 図９に続くロボットの音源方向推定処理のフローチャートである。FIG. 10 is a flowchart of a sound source direction estimation process of the robot following FIG. 9. FIG. 図１０に続くロボットの音源方向推定処理のフローチャートである。It is a flowchart of the sound source direction estimation process of the robot following FIG.

以下、図面を参照して、本発明の実施の形態（以下、「本実施形態」と称する）に係るロボットについて説明する。なお、各図は、本発明を十分に理解できる程度に、概略的に示してあるに過ぎない。よって、本発明は、図示例のみに限定されるものではない。また、各図において、共通する構成要素や同様な構成要素については、同一の符号を付し、それらの重複する説明を省略する。 Hereinafter, a robot according to an embodiment of the present invention (hereinafter referred to as “this embodiment”) will be described with reference to the drawings. Each figure is only schematically shown so that the present invention can be fully understood. Therefore, the present invention is not limited to the illustrated example. Moreover, in each figure, the same code | symbol is attached | subjected about the common component and the same component, and those overlapping description is abbreviate | omitted.

［ロボットシステムの構成］
まず、図１を参照して、本実施形態に係るロボットを含むロボットシステムの全体構成について説明する。ここでは、自律移動型の２足歩行ロボットを一例として説明する。
図１に示すように、ロボットシステムは、ロボットＲと、このロボットＲと無線通信によって接続された基地局１と、この基地局１とロボット専用ネットワーク２を介して接続された管理用コンピュータ３と、この管理用コンピュータ３にネットワーク４を介して接続された端末５から構成される。このロボットシステムでは、図示は省略しているが複数のロボットＲを想定している。 [Robot system configuration]
First, an overall configuration of a robot system including a robot according to the present embodiment will be described with reference to FIG. Here, an autonomous mobile biped robot will be described as an example.
As shown in FIG. 1, the robot system includes a robot R, a base station 1 connected to the robot R by wireless communication, and a management computer 3 connected to the base station 1 via a robot dedicated network 2. The terminal 5 is connected to the management computer 3 via the network 4. In this robot system, although not shown, a plurality of robots R are assumed.

本実施形態に係るロボットＲは、図１に示すように、頭部Ｒ１、腕部Ｒ２、脚部Ｒ３、胴部Ｒ４、および背面格納部Ｒ５を有しており、胴部Ｒ４にそれぞれ接続された頭部Ｒ１、腕部Ｒ２、および脚部Ｒ３は、それぞれアクチュエータ（駆動手段）により駆動され、後記する自律移動制御部５０（図２参照）により２足歩行の制御がなされる。この２足歩行についての詳細は、例えば、特開２００１−６２７６０号公報に開示されている。
ここで、ロボットＲは胴部Ｒ４の背面に背面格納部Ｒ５を備えている。この背面格納部Ｒ５には、頭部Ｒ１、腕部Ｒ２、脚部Ｒ３、および胴部Ｒ４の動作を制御する制御装置（主制御部４０など）や、この制御装置が処理した熱を排熱するためのファンＦ（ノイズ発生源）、後記するバッテリ７０などが格納されている。このファンＦが回転することでノイズ音が出力される。 As shown in FIG. 1, the robot R according to the present embodiment has a head R1, an arm R2, a leg R3, a trunk R4, and a back housing R5, and is connected to the trunk R4. The head R1, arm R2, and leg R3 are each driven by an actuator (driving means), and bipedal walking is controlled by an autonomous movement controller 50 (see FIG. 2) described later. Details of this bipedal walking are disclosed in, for example, Japanese Patent Application Laid-Open No. 2001-62760.
Here, the robot R includes a back surface storage portion R5 on the back surface of the body portion R4. In the rear storage portion R5, a control device (such as the main control portion 40) that controls the operations of the head portion R1, the arm portion R2, the leg portion R3, and the torso portion R4, and heat processed by the control device are exhausted. A fan F (noise generation source), a battery 70 described later, and the like are stored. As the fan F rotates, a noise sound is output.

ロボット専用ネットワーク２は、基地局１と、管理用コンピュータ３と、ネットワーク４とを接続するものであり、ＬＡＮ（Local Area Network）などにより実現されるものである。 The robot dedicated network 2 connects the base station 1, the management computer 3, and the network 4, and is realized by a LAN (Local Area Network) or the like.

管理用コンピュータ３は、複数のロボットＲを管理するものであり、基地局１、ロボット専用ネットワーク２を介してロボットＲの移動・発話などの各種制御を行うと共に、ロボットＲに対して必要な情報を提供する。ここで、必要な情報とは、発話するための音声データや、検知された人物の氏名、ロボットＲの周辺の地図などがこれに相当し、これらの情報は、管理用コンピュータ３に設けられた記憶部（不図示）に記憶されている。 The management computer 3 manages a plurality of robots R, and performs various controls such as movement and speech of the robot R via the base station 1 and the robot dedicated network 2 and information necessary for the robot R. I will provide a. Here, the necessary information corresponds to voice data for speaking, the name of the detected person, a map around the robot R, and the like. These pieces of information are provided in the management computer 3. It is stored in a storage unit (not shown).

端末５は、ネットワーク４を介して管理用コンピュータ３に接続し、管理用コンピュータ３に設けられた記憶手段（不図示）に、人物に関する情報などを登録する、もしくは登録されたこれらの情報を修正するものである。また、端末５は、ロボットＲに実行させるタスクの登録や、管理用コンピュータ３において設定されるタスクスケジュールの変更や、ロボットＲの動作命令の入力などを行うものである。 The terminal 5 is connected to the management computer 3 via the network 4 and registers information about a person in the storage means (not shown) provided in the management computer 3 or corrects the registered information. To do. The terminal 5 is used for registering tasks to be executed by the robot R, changing a task schedule set in the management computer 3, inputting an operation command for the robot R, and the like.

以下、ロボットＲと、管理用コンピュータ３とについてそれぞれ詳細に説明する。
［ロボット］
図２は、ロボットの全体構成を示すブロック図である。図２に示すように、ロボットＲは、頭部Ｒ１、腕部Ｒ２、脚部Ｒ３、胴部Ｒ４、および背面格納部Ｒ５に加えて、カメラＣ、スピーカＳ、音声入力部ＭＣ、画像処理部１０、音声処理部２０、記憶部３０、主制御部４０、自律移動制御部５０、無線通信部６０、および対象検知部８０を有する。 Hereinafter, the robot R and the management computer 3 will be described in detail.
[robot]
FIG. 2 is a block diagram showing the overall configuration of the robot. As shown in FIG. 2, the robot R includes a camera C, a speaker S, a voice input unit MC, and an image processing unit in addition to a head R1, an arm R2, a leg R3, a trunk R4, and a rear storage unit R5. 10, a voice processing unit 20, a storage unit 30, a main control unit 40, an autonomous movement control unit 50, a wireless communication unit 60, and a target detection unit 80.

頭部Ｒ１は、自律移動制御部５０（行動制御部）に制御された後記する頭回動部Ｒ１１により、胴部Ｒ４上で支持軸Ｏを中心に左右に回動する。これにより、頭部Ｒ１（顔面）が向いている方向（Ｈ軸）は、回動角θだけ、胴部Ｒ４の正面が向いている方向（Ｂ軸）とずれることとなる。ここで、自律移動制御部５０（回動角測定部）は、頭回動部Ｒ１１に回動させた角度（回動角θ）を保持する。つまり、自律移動制御部５０は、頭部Ｒ１の回動角度の測定値である回動角θを保持する。
さらに、ロボットＲは、ロボットＲの現在位置を検出するため、ジャイロセンサＳＲ１や、ＧＰＳ受信器ＳＲ２を有している。
このカメラＣ、スピーカＳ、および音声入力部ＭＣは、いずれも頭部Ｒ１の内部に配設される。 The head R1 is rotated left and right about the support shaft O on the trunk R4 by a head rotation unit R11 described later controlled by the autonomous movement control unit 50 (behavior control unit). As a result, the direction (H axis) in which the head R1 (face) faces is shifted from the direction (B axis) in which the front of the torso R4 faces by the rotation angle θ. Here, the autonomous movement control unit 50 (rotation angle measurement unit) holds the angle (rotation angle θ) rotated by the head rotation unit R11. That is, the autonomous movement control unit 50 holds the rotation angle θ that is a measurement value of the rotation angle of the head R1.
Further, the robot R has a gyro sensor SR1 and a GPS receiver SR2 in order to detect the current position of the robot R.
The camera C, the speaker S, and the voice input unit MC are all disposed inside the head R1.

［カメラ］
カメラＣ（Ｃ１，Ｃ２）は、映像をデジタルデータとして取り込むことができるものであり、例えば、カラーＣＣＤ(Charge-Coupled Device)カメラが使用される。カメラＣ１とカメラＣ２とは、左右に平行に並んで配置され、撮影した画像は画像処理部１０に出力される。 [camera]
The camera C (C1, C2) is capable of capturing video as digital data, and for example, a color CCD (Charge-Coupled Device) camera is used. The camera C1 and the camera C2 are arranged side by side in parallel on the left and right, and the captured image is output to the image processing unit 10.

［画像処理部］
画像処理部１０は、カメラＣが撮影した画像を処理して、撮影された画像からロボットＲの周囲の状況を把握するため、周囲の障害物や人物の認識を行う部分である。この画像処理部１０は、ステレオ処理部１１ａ、移動体抽出部１１ｂ、および顔認識部１１ｃを含んで構成される。以下に各構成部について簡単に記載する。 [Image processing unit]
The image processing unit 10 is a part for recognizing surrounding obstacles and persons in order to process the image captured by the camera C and grasp the situation around the robot R from the captured image. The image processing unit 10 includes a stereo processing unit 11a, a moving body extraction unit 11b, and a face recognition unit 11c. Each component will be briefly described below.

ステレオ処理部１１ａは、左右のカメラＣが撮影した２枚の画像の一方を基準としてパターンマッチングを行い、左右の画像中の対応する各画素の視差を計算して視差画像を生成し、生成した視差画像および元の画像を移動体抽出部１１ｂに出力する。なお、この視差は、ロボットＲから撮影された物体までの距離を表すものである。 The stereo processing unit 11a performs pattern matching using one of the two images taken by the left and right cameras C as a reference, calculates the parallax of each corresponding pixel in the left and right images, and generates a parallax image. The parallax image and the original image are output to the moving object extraction unit 11b. This parallax represents the distance from the robot R to the photographed object.

移動体抽出部１１ｂは、ステレオ処理部１１ａから出力されたデータに基づき、撮影した画像中の移動体を抽出するものである。移動する物体（移動体）を抽出するのは、移動する物体が人物であると推定して、人物の認識をするためである。
移動体を抽出するために、移動体抽出部１１ｂは、過去の数フレーム（コマ）の画像を記憶しており、最も新しいフレーム（画像）と、過去のフレーム（画像）を比較して、パターンマッチングを行い、各画素の移動量を計算し、移動量画像を生成する。そして、視差画像と、移動量画像とから、カメラＣから所定の距離範囲内で、移動量の多い画素がある場合に、人物があると推定し、その所定距離範囲のみの視差画像として、移動体を抽出し、顔認識部１１ｃへ移動体の画像を出力する。 The moving body extraction unit 11b extracts a moving body in the photographed image based on the data output from the stereo processing unit 11a. The reason for extracting the moving object (moving body) is to recognize the person by estimating that the moving object is a person.
In order to extract the moving object, the moving object extraction unit 11b stores images of several past frames (frames), compares the newest frame (image) with the past frames (images), and determines the pattern. Matching is performed, the movement amount of each pixel is calculated, and a movement amount image is generated. Then, from the parallax image and the movement amount image, when there is a pixel with a large movement amount within a predetermined distance range from the camera C, it is estimated that there is a person, and the movement is performed as a parallax image only in the predetermined distance range. The body is extracted and an image of the moving body is output to the face recognition unit 11c.

顔認識部１１ｃは、抽出した移動体から肌色の部分を抽出して、その大きさ、形状などから顔の位置を認識する。なお、同様にして、肌色の領域と、大きさ、形状などから手の位置も認識される。
認識された顔の位置は、ロボットＲが移動するときの情報として、また、その人とのコミュニケーションを取るため、主制御部４０に出力される。 The face recognition unit 11c extracts a skin color portion from the extracted moving body, and recognizes the face position from the size, shape, and the like. Similarly, the position of the hand is also recognized from the skin color area, size, shape, and the like.
The recognized face position is output to the main control unit 40 as information when the robot R moves and to communicate with the person.

［音声入力部］
音声入力部ＭＣは、ロボットＲの周囲の音を電気信号（音響信号）として取り出すマイクロホンと、当該音響信号をデジタル化して音響信号データ（音声データを含む）を生成するアナログ−デジタル変換器（Ａ／Ｄ変換器）を含み、音声処理部２０（後記する音声データ処理部２２および推定音源分離部２４）に出力する。
図３（ａ）は、頭部Ｒ１の正面図であり、図３（ｂ）は、頭部Ｒ１のＡ−Ａ断面図である。そして、図３（ｂ）に、頭部Ｒ１の内部に配設された音声入力部ＭＣを示す。
音声入力部ＭＣ１，ＭＣ２，・・・，ＭＣ８は、それぞれ支持軸Ｏを中心に顔面が向いている方向（Ｈ軸）から配設角度∠Ｈ_mc（∠Ｈ_mc1（＝Ｈ軸），∠Ｈ_mc2，∠Ｈ_mc3，・・・）だけ離間して、頭部Ｒ１外部からの音を集音するように配設される。 [Voice input section]
The voice input unit MC includes a microphone that extracts sound around the robot R as an electrical signal (acoustic signal), and an analog-digital converter (A) that digitizes the acoustic signal and generates acoustic signal data (including voice data). / D converter) and output to the speech processing unit 20 (speech data processing unit 22 and estimated sound source separation unit 24 described later).
Fig.3 (a) is a front view of head R1, and FIG.3 (b) is AA sectional drawing of head R1. FIG. 3B shows the voice input unit MC arranged inside the head R1.
The voice input parts MC1, MC2,..., MC8 are arranged at angles ∠H _mc (∠H _mc1 (= H axis), ∠H from the direction (H axis) in which the face faces around the support axis O, respectively. _mc2, ∠H _mc3, ···) only at a distance, is disposed so as to collect the sound from the head R1 outside.

ここで、配設角度∠Ｈ_mcは、３６０度全方向からの音声を集音できるように音声入力部ＭＣを配設できる角度であればよいが、図３（ｂ）に示すように、音声入力部ＭＣが８個ある場合、等角度（４５度）であることが望ましい。
また、後記する除外角度範囲ｒの範囲外に、音声入力部ＭＣが随時２台以上存在するように、音声入力部ＭＣの数および配設角度∠Ｈ_mcは設定される。ここで、本実施形態では、複数の方向から入力された音響信号データに基づき、音源方向を求めるため、除外角度範囲ｒの範囲外に２台存在する必要がある。 Here, the disposition angle ∠H _mc may be any angle that allows the sound input unit MC to be disposed so that sound from all directions of 360 degrees can be collected, but as shown in FIG. When there are eight input parts MC, it is desirable that they are equiangular (45 degrees).
Also, outside the exclusion angular range r to be described later, so that the voice input unit MC is present two or more at any time, the number and disposed angle ∠H _mc speech input unit MC is set. Here, in the present embodiment, in order to obtain the sound source direction based on the acoustic signal data input from a plurality of directions, it is necessary that two units exist outside the excluded angle range r.

［スピーカ］
スピーカＳは、音声処理部２０（後記する音声合成部２１）により生成された音声データに基づき、音声を出力する。 [Speaker]
The speaker S outputs sound based on the sound data generated by the sound processing unit 20 (speech synthesis unit 21 described later).

［音声処理部］
音声処理部２０は、音声合成部２１と、音声データ処理部２２と、音源方向推定部２３と、推定音源分離部２４と、音声認識部２５と、音源定位部２６とを有する。
音声合成部２１は、主制御部４０により決定された発話行動の指令に基づき、文字情報（テキストデータ）から発話音声データを生成し、スピーカＳに音声を出力する部分である。発話音声データの生成には、予め記憶部３０に記憶されている文字情報（テキストデータ）と、その文字情報が読み上げられた音声データとの対応関係を利用する。なお、この文字情報と音声データとは、管理用コンピュータ３から取得され、記憶部３０に保存される。 [Audio processor]
The speech processing unit 20 includes a speech synthesis unit 21, a speech data processing unit 22, a sound source direction estimation unit 23, an estimated sound source separation unit 24, a speech recognition unit 25, and a sound source localization unit 26.
The voice synthesizer 21 is a part that generates utterance voice data from the character information (text data) based on the utterance action command determined by the main controller 40 and outputs the voice to the speaker S. For the generation of speech voice data, the correspondence between the character information (text data) stored in advance in the storage unit 30 and the voice data from which the character information is read out is used. The character information and the voice data are acquired from the management computer 3 and stored in the storage unit 30.

本実施形態の音源方向推定処理に係る処理は、音声処理部２０が備える音声データ処理部２２と、音源方向推定部２３と、推定音源分離部２４と、音声認識部２５と、音源定位部２６とにより行われる。これら構成部により行われる音源方向推定処理の詳細な説明は後記するため、ここでは、音声処理部２０が行う音源方向推定処理について簡単に説明する。
音声処理部２０は、音声入力部ＭＣ（ＭＣ１，ＭＣ２，ＭＣ３，・・・）から入力された音響信号データに信号分解アルゴリズム（例えば、ＭＵＳＩＣ（MUltiple SIgnal Classification）法）を用いて、頭部の向きを基準とした３６０度全方向からの音圧値を示す全方向音圧成分データ（例えば、ＭＵＳＩＣスペクトラム）を生成する。そして、音声処理部２０は、その全方向音圧成分データから音源方向を推定する。次に、音声処理部２０は、推定した音源方向から入力される音響信号データを、音声入力部ＭＣから入力された音響信号データから分離抽出して、分離抽出した音響信号データに対して音声認識を行う。 The processing related to the sound source direction estimation process of the present embodiment includes a sound data processing unit 22, a sound source direction estimating unit 23, an estimated sound source separating unit 24, a sound recognizing unit 25, and a sound source localization unit 26 included in the sound processing unit 20. And done. Since a detailed description of the sound source direction estimation process performed by these components will be described later, here, the sound source direction estimation process performed by the sound processing unit 20 will be briefly described.
The sound processing unit 20 uses a signal decomposition algorithm (for example, MUSIC (MUltiple SIgnal Classification) method) on the acoustic signal data input from the sound input unit MC (MC1, MC2, MC3,...) Omnidirectional sound pressure component data (for example, MUSIC spectrum) indicating sound pressure values from 360 degrees in all directions with reference to the direction is generated. Then, the sound processing unit 20 estimates the sound source direction from the omnidirectional sound pressure component data. Next, the speech processing unit 20 separates and extracts the acoustic signal data input from the estimated sound source direction from the acoustic signal data input from the speech input unit MC, and performs speech recognition on the separated and extracted acoustic signal data. I do.

［記憶部］
記憶部３０は、例えば、一般的なハードディスクなどから構成され、管理用コンピュータ３から送信された音声合成部２１が発話音声データを生成する際に用いる文字情報および音声データ、後記する対象検知部８０が用いる必要な情報（人物の氏名、ローカル地図データ、会話用データなど）、ロボットＲが識別した人物の識別番号や位置情報の他に、本実施形態の音源方向推定処理に係るデータを記憶するものである。
この音源方向推定処理に係るデータは、音声入力角度ＤＢ３１と、除外角度範囲ＤＢ３２と、フレーズＤＢ３３（フレーズ記憶部）と、伝達関数ＤＢ３４とに記憶されており、これらについての詳細な説明は後記する。 [Storage unit]
The storage unit 30 is composed of, for example, a general hard disk, etc., and character information and speech data used when the speech synthesizer 21 transmitted from the management computer 3 generates speech speech data, an object detection unit 80 to be described later. In addition to the necessary information used by the person (person name, local map data, conversation data, etc.), the identification number and position information of the person identified by the robot R, data related to the sound source direction estimation processing of this embodiment is stored. Is.
Data related to the sound source direction estimation processing is stored in the voice input angle DB 31, the excluded angle range DB 32, the phrase DB 33 (phrase storage unit), and the transfer function DB 34, and detailed description thereof will be described later. .

［主制御部］
主制御部４０は、画像処理部１０、音声処理部２０、記憶部３０、自律移動制御部５０、無線通信部６０、および対象検知部８０を統括制御するものである。 [Main control section]
The main control unit 40 performs overall control of the image processing unit 10, the sound processing unit 20, the storage unit 30, the autonomous movement control unit 50, the wireless communication unit 60, and the target detection unit 80.

［自律移動制御部］
自律移動制御部（自律移動制御手段）５０は、主制御部４０の指示に従い頭部Ｒ１、腕部Ｒ２、脚部Ｒ３を駆動する。
また、ジャイロセンサＳＲ１、およびＧＰＳ受信器ＳＲ２が検出したデータは、主制御部４０に出力され、ロボットＲの行動を決定するために利用される。 [Autonomous Movement Control Unit]
The autonomous movement control unit (autonomous movement control means) 50 drives the head R1, the arm R2, and the leg R3 in accordance with instructions from the main control unit 40.
The data detected by the gyro sensor SR1 and the GPS receiver SR2 is output to the main control unit 40 and used to determine the behavior of the robot R.

［無線通信部］
無線通信部６０は、管理用コンピュータ３とデータの送受信を行う通信装置である。無線通信部６０は、公衆回線通信装置６１ａおよび無線通信装置６１ｂを有する。
公衆回線通信装置６１ａは、携帯電話回線やＰＨＳ（Personal Handyphone System）回線などの公衆回線を利用した無線通信手段である。一方、無線通信装置６１ｂは、IEEE802.11b規格に準拠するワイヤレスＬＡＮなどの、近距離無線通信による無線通信手段である。
無線通信部６０は、管理用コンピュータ３からの接続要求に従い、公衆回線通信装置６１ａまたは無線通信装置６１ｂを選択して管理用コンピュータ３とデータ通信を行う。 [Wireless communication part]
The wireless communication unit 60 is a communication device that transmits and receives data to and from the management computer 3. The wireless communication unit 60 includes a public line communication device 61a and a wireless communication device 61b.
The public line communication device 61a is a wireless communication means using a public line such as a mobile phone line or a PHS (Personal Handyphone System) line. On the other hand, the wireless communication device 61b is a wireless communication unit using short-range wireless communication such as a wireless LAN conforming to the IEEE802.11b standard.
The wireless communication unit 60 performs data communication with the management computer 3 by selecting the public line communication device 61 a or the wireless communication device 61 b in accordance with a connection request from the management computer 3.

［バッテリ］
バッテリ７０は、ロボットＲの各部（頭部Ｒ１、腕部Ｒ２、脚部Ｒ３、および胴部Ｒ４）の動作や、ファンＦの回転駆動、背面格納部Ｒ５に格納された制御装置の処理などに必要な電力の供給源である。このバッテリ７０は、充填式の構成をもつものが使用される。 [Battery]
The battery 70 is used for the operation of each part of the robot R (head R1, arm R2, leg R3, and torso R4), rotational drive of the fan F, processing of the control device stored in the rear storage part R5, and the like. It is a source of necessary power. The battery 70 has a rechargeable configuration.

［対象検知部］
対象検知部８０は、ロボットＲの周囲にタグＴを備える人物が存在するか否かを検知するものである。例えば、ＬＥＤから構成され、ロボットＲの頭部Ｒ１外周に沿って前後左右などに配設される複数の発光部を備える（図示は省略する）。
対象検知部８０は、発光部から、各発光部を識別する発光部ＩＤを示す信号を含む赤外光をそれぞれ発信すると共に、この赤外光を受信したタグＴから受信報告信号を受信する。いずれかの赤外光を受信したタグＴは、その赤外光に含まれる発光部ＩＤに基づいて、受信報告信号を生成するので、ロボットＲは、この受信報告信号に含まれる発光部ＩＤを参照することにより、当該ロボットＲから見てどの方向にタグＴが存在するかを特定することができる。また、対象検知部８０は、タグＴから取得した受信報告信号の電波強度に基づいて、タグＴまでの距離を特定する機能を有する。したがって、対象検知部８０は、受信報告信号に基づいて、タグＴの位置（距離および方向）を、人物の位置として特定することができる。
さらに、対象検知部８０は、発光部から赤外光を発光するだけではなく、ロボットＩＤを示す信号を含む電波を図示しないアンテナから発信する。これにより、この電波を受信したタグＴは、赤外光を発信したロボットＲを正しく特定することができる。なお、対象検知部８０およびタグＴについての詳細は、例えば、特開２００６−１９２５６３号公報に開示されている。 [Target detection unit]
The target detection unit 80 detects whether or not there is a person with the tag T around the robot R. For example, it includes a plurality of light emitting units that are configured by LEDs and are arranged in front, rear, left, and right along the outer periphery of the head R1 of the robot R (not shown).
The object detection unit 80 transmits infrared light including a signal indicating a light emitting unit ID for identifying each light emitting unit from the light emitting unit, and receives a reception report signal from the tag T that has received the infrared light. The tag T that has received any infrared light generates a reception report signal based on the light emitting unit ID included in the infrared light, so that the robot R determines the light emitting unit ID included in the reception report signal. By referencing, it is possible to specify in which direction the tag T is present when viewed from the robot R. Further, the target detection unit 80 has a function of specifying the distance to the tag T based on the radio wave intensity of the reception report signal acquired from the tag T. Therefore, the target detection unit 80 can specify the position (distance and direction) of the tag T as the position of the person based on the reception report signal.
Furthermore, the object detection unit 80 not only emits infrared light from the light emitting unit, but also transmits a radio wave including a signal indicating the robot ID from an antenna (not shown). Thus, the tag T that has received the radio wave can correctly identify the robot R that has transmitted infrared light. Details of the target detection unit 80 and the tag T are disclosed in, for example, Japanese Patent Application Laid-Open No. 2006-192563.

次に、本実施形態に係るロボットＲの音源方向推定処理に係る構成について、図４を参照して説明する。
［記憶部の構成］
図４に示すように、記憶部３０は、音声入力角度ＤＢ３１と、除外角度範囲ＤＢ３２と、フレーズＤＢ３３とを含んで構成される。 Next, a configuration related to the sound source direction estimation process of the robot R according to the present embodiment will be described with reference to FIG.
[Configuration of storage unit]
As shown in FIG. 4, the storage unit 30 includes a voice input angle DB 31, an excluded angle range DB 32, and a phrase DB 33.

（音声入力角度ＤＢ）
音声入力角度ＤＢ３１は、音声入力部ＭＣの識別番号と、その音声入力部ＭＣが顔面が向いている方向（Ｈ軸）からなす配設角度∠Ｈ_mc（∠Ｈ_mc1，∠Ｈ_mc2，∠Ｈ_mc3，・・・）とを対応付けて記憶するデータベースである。
図５は、図３（ｂ）に示すように、音声入力部ＭＣが頭部Ｒ１の内部に配設されたときの音声入力角度ＤＢの一例を示す図である。音声入力角度ＤＢには、音声入力部ＭＣ１に相当する識別番号“ＭＣ１”に、配設角度（∠Ｈ_mc1）＝０度が記録されており、音声入力部ＭＣ２に相当する識別番号“ＭＣ２”に、配設角度（∠Ｈ_mc2）＝４５度が記録されている。 (Audio input angle DB)
The voice input angle DB 31 includes an identification number of the voice input unit MC and an arrangement angle ∠H _mc (∠H _mc1 , ∠H _mc2 , ∠H) formed from the direction (H axis) in which the voice input unit MC faces. _mc3 ,...) in association with each other and stored.
FIG. 5 is a diagram illustrating an example of the voice input angle DB when the voice input unit MC is disposed inside the head R1, as shown in FIG. In the voice input angle DB, an arrangement angle (∠H _mc1 ) = 0 degrees is recorded in an identification number “MC1” corresponding to the voice input unit MC1, and an identification number “MC2” corresponding to the voice input unit MC2 is recorded. In addition, an arrangement angle (∠H _mc2 ) = 45 degrees is recorded.

（伝達関数ＤＢ）
伝達関数ＤＢ３４は、後記する音声データ処理部２２にて、ＭＵＳＩＣスペクトラムＰ_avg(θ)を生成するために、予めシミュレーションして取得した、頭部の向きを基準としたθ方向に音源がある場合の伝達関数ベクトルｖ(θ)が記録されている。この伝達関数ベクトルｖ(θ)は、シミュレーション時に行った方向の数だけ記録されている。伝達関数ベクトルｖ(θ)についての詳細は、後記する音声データ処理部２２の説明にて記載する。
本実施形態では、３６０度全方向を５度で等分した７２方向に音源がある場合のシミュレーションを予め行っておき、７２個の伝達関数ベクトルｖ(θ)（ｖ₁，ｖ₂，・・・，ｖ₇₂）を算出しておく。 (Transfer function DB)
The transfer function DB 34 has a sound source in the θ direction, which is obtained by simulation in advance to generate the MUSIC spectrum P _avg (θ) in the audio data processing unit 22 to be described later, and is based on the head direction. The transfer function vector v (θ) is recorded. This transfer function vector v (θ) is recorded in the number of directions performed at the time of simulation. Details of the transfer function vector v (θ) will be described in the description of the audio data processing unit 22 described later.
In the present embodiment, a simulation is performed in advance when there is a sound source in 72 directions obtained by equally dividing 360 degrees in all directions at 5 degrees, and 72 transfer function vectors v (θ) (v ₁ , v ₂ ,. ·, V ₇₂ ) is calculated in advance.

（除外角度範囲ＤＢ）
除外角度範囲ＤＢ３２は、ロボットＲの管理者により端末５から入力され、管理用コンピュータ３を介して、ロボットＲに設定される除外角度範囲ｒを記憶する記憶手段である。
除外角度範囲ｒは、支持軸Ｏを中心に胴部Ｒ４の正面が向いている方向（Ｂ軸，図７参照）からなす、音源として推定しない方向の範囲である。例えば、除外角度範囲ｒとして、∠Ｂ＝＋１２０〜＋１８０〜＋２４０度で記憶されている。 (Exclusion angle range DB)
The exclusion angle range DB 32 is a storage unit that stores an exclusion angle range r that is input from the terminal 5 by the administrator of the robot R and set in the robot R via the management computer 3.
The excluded angle range r is a range in a direction not estimated as a sound source, which is formed from a direction (B axis, see FIG. 7) in which the front surface of the trunk portion R4 faces around the support axis O. For example, it is stored as 。B = + 120 to +180 to +240 degrees as the excluded angle range r.

この除外角度範囲ｒは、音声入力部ＭＣとロボットＲの動作環境とに基づき予め設定される範囲である。例えば、ロボットＲの背面格納部Ｒ５に格納されたファンＦ（ノイズ発生源）から出力されるノイズ音を除去するために、予め、音声入力部ＭＣから入力される、ファンＦによるノイズ音の音響信号データに対して、音声データ処理部２２に全方向音圧成分データ（例えば、ＭＵＳＩＣスペクトラム）を生成させる。この全方向音圧成分データからノイズ音の音圧成分を抽出して、ノイズ音が与える影響の範囲を考慮する。
ここで、ノイズ音が与える影響とは、後記する音声認識部２５が、ファンＦ（ノイズ発生源）以外からの音声を、ノイズ音により音声として認識しないことである。 This exclusion angle range r is a range set in advance based on the voice input unit MC and the operating environment of the robot R. For example, in order to remove the noise sound output from the fan F (noise generation source) stored in the back surface storage unit R5 of the robot R, the noise of the noise sound by the fan F input from the audio input unit MC in advance is removed. For the signal data, the audio data processing unit 22 is caused to generate omnidirectional sound pressure component data (for example, MUSIC spectrum). The sound pressure component of the noise sound is extracted from the omnidirectional sound pressure component data, and the range of influence of the noise sound is considered.
Here, the influence given by the noise sound is that the voice recognition unit 25 described later does not recognize a voice from other than the fan F (noise generation source) as a voice by the noise sound.

（フレーズＤＢ）
フレーズＤＢ３３は、フレーズパターン３３１と、方向角度値パターン３３２と、フレーズ入力方向適合度３３３（音声入力方向適合度）とを対応付けて記憶するデータベースである。
図６は、フレーズＤＢの一例を示す図である。
フレーズパターン３３１は、人間が発声するフレーズであり、特に、人（発話者Ｙ，図１参照）が他人（ロボットＲ，図１参照）を自分に振り向かせるために、呼びかけるときに発するフレーズが複数記憶されている。例えば、「こっち向いて」や「こっち見て」、「おーい」、「ごめんください」などである。 (Phrase DB)
The phrase DB 33 is a database that stores the phrase pattern 331, the direction angle value pattern 332, and the phrase input direction suitability 333 (speech input direction suitability) in association with each other.
FIG. 6 is a diagram illustrating an example of the phrase DB.
The phrase pattern 331 is a phrase uttered by a person, and in particular, there are a plurality of phrases that are uttered when a person (speaker Y, see FIG. 1) calls to another person (robot R, see FIG. 1) to turn around. It is remembered. For example, “Look here”, “Look here”, “Oi”, “I'm sorry”, etc.

方向角度値パターン３３２は、顔面が向いている方向（Ｈ軸）（図３（ｂ）参照）からなす角度であり、３６０度全方向の方向角度値が記憶されている。
図６において、方向角度値は、２０度毎の値としているが、１度毎の値であってもよい。 The direction angle value pattern 332 is an angle formed from the direction in which the face is facing (H axis) (see FIG. 3B), and 360 degree directional angle values are stored.
In FIG. 6, the direction angle value is a value every 20 degrees, but may be a value every 1 degree.

フレーズ入力方向適合度３３３（音声入力方向適合度）は、人間が発声するフレーズが、ロボットＲに対してどの角度から発話（入力）されやすいかを示す値であり、つまり、フレーズと方向角度値との関係の正しさ（フレーズと方向角度値との関係が正しい確率）を示す値である。
例えば、「こっち向いて」というフレーズは、ロボットＲの顔面が向いている方向（Ｈ軸上、方向角度値＝０度）から言われるフレーズではなく、ロボットＲの顔面が向いていない方向（横や背面）から言われるフレーズであることに着目して、フレーズ入力方向適合度３３３は、ロボットＲの管理者により予め設定された値である。 The phrase input direction suitability 333 (speech input direction suitability) is a value indicating from which angle a phrase uttered by a human is likely to be spoken (input) to the robot R, that is, the phrase and the direction angle value. Is a value indicating the correctness of the relationship (the probability that the relationship between the phrase and the direction angle value is correct).
For example, the phrase “Look here” is not a phrase that is said to be from the direction in which the face of the robot R is facing (on the H axis, the direction angle value = 0 degree), but the direction in which the face of the robot R is not facing (horizontal Paying attention to the phrase that is said to be from the back), the phrase input direction suitability 333 is a value preset by the administrator of the robot R.

図６においてフレーズが「こっち向いて」の場合は、方向角度値が「０度」におけるフレーズ入力方向適合度３３３の値は低く設定され（＝０．００）、一方、方向角度値が「１８０度」におけるフレーズ入力方向適合度３３３の値は高く設定されている（＝０．１０）。
また、フレーズが「あなたのお名前は？」の場合は、ロボットＲの顔面が向いている方向（Ｈ軸上、方向角度値＝０度）から言われるフレーズである。そのため、方向角度値が「０度」におけるフレーズ入力方向適合度３３３の値は高く設定され（＝０．３０）、一方、方向角度値が「１８０度」におけるフレーズ入力方向適合度３３３の値は低く設定されている（＝０．００）。
ここで、図６のフレーズＤＢ３３は、フレーズそれぞれで、全方向角度値のフレーズ入力方向適合度３３３の値の合計が１となるように、方向角度値ごとに、フレーズ入力方向適合度３３３の値が設定されている。 In FIG. 6, when the phrase is “Looking over here”, the value of the phrase input direction suitability 333 when the direction angle value is “0 degree” is set low (= 0.00), while the direction angle value is “180”. The value of the phrase input direction suitability 333 in “degree” is set high (= 0.10).
When the phrase is “What is your name?”, It is a phrase that is said from the direction in which the face of the robot R is facing (on the H axis, the direction angle value = 0 degrees). Therefore, the value of the phrase input direction fitness 333 when the direction angle value is “0 degree” is set high (= 0.30), while the value of the phrase input direction fitness 333 when the direction angle value is “180 degrees” is It is set low (= 0.00).
Here, the phrase DB 33 of FIG. 6 has a value of the phrase input direction suitability 333 for each direction angle value so that the sum of the values of the phrase input direction suitability 333 of the omnidirectional angle values is 1 for each phrase. Is set.

［音声処理部の構成］
図４に示すように、音声処理部２０は、音声データ処理部２２と、音源方向推定部２３と、推定音源分離部２４と、音声認識部２５と、音源定位部２６とを含んで構成される。 [Configuration of audio processing unit]
As shown in FIG. 4, the speech processing unit 20 includes a speech data processing unit 22, a sound source direction estimation unit 23, an estimated sound source separation unit 24, a speech recognition unit 25, and a sound source localization unit 26. The

（音声データ処理部）
本実施形態において、音声データ処理部２２は、信号分解アルゴリズムとしてＭＵＳＩＣ法を用いる。音声データ処理部２２は、音声入力部ＭＣ（ＭＣ１，ＭＣ２，ＭＣ３，・・・）から入力された音響信号データからＭＵＳＩＣ法を用いて、雑音信号の空間を推定して、予め準備しておいた７２方向の伝達関数ベクトルｖ(θ)それぞれを、その雑音信号の空間に射影することにより、ＭＵＳＩＣスペクトラムＰ_avg(θ)（全方向音圧成分データ）を生成する処理部である。 (Audio data processing unit)
In the present embodiment, the audio data processing unit 22 uses the MUSIC method as a signal decomposition algorithm. The audio data processing unit 22 estimates the noise signal space from the acoustic signal data input from the audio input unit MC (MC1, MC2, MC3,...) Using the MUSIC method, and prepares in advance. This is a processing unit that generates MUSIC spectrum P _avg (θ) (omnidirectional sound pressure component data) by projecting each 72-direction transfer function vector v (θ) to the space of the noise signal.

伝達関数ベクトルｖ(θ)は、予めシミュレーションして取得したベクトル値である。
シミュレーションは、頭部の向きを基準としたθ方向の音源からインパルスを出力する。そのインパルスが入力された各音声入力部ＭＣ（ＭＣ１，ＭＣ２，ＭＣ３，・・・）が出力するインパルス応答を音声データ処理部２２が取得する。そして、音声データ処理部２２は、そのインパルス応答に離散フーリエ変換を施し、周波数領域に変換することで、伝達関数ベクトルｖ(θ)が得られる。 The transfer function vector v (θ) is a vector value obtained by simulation in advance.
In the simulation, an impulse is output from a sound source in the θ direction with reference to the head direction. The voice data processing unit 22 acquires an impulse response output from each voice input unit MC (MC1, MC2, MC3,...) To which the impulse is input. Then, the audio data processing unit 22 performs a discrete Fourier transform on the impulse response and converts the impulse response into the frequency domain, thereby obtaining a transfer function vector v (θ).

シミュレーションは次のように行う。頭部の向きを基準とした０度の方向（θ＝０）に位置する音源に対応する伝達関数ベクトルｖ₁を算出する場合、０度の方向からインパルスを出力する。音声データ処理部２２は、各音声入力部ＭＣ（ＭＣ１，ＭＣ２，ＭＣ３，・・・）から取得したインパルス応答に離散フーリエ変換を施し、周波数領域に変換することで、伝達関数ベクトルｖ₁を算出することができる。
本実施形態では、３６０度全方向を５度で等分した７２方向に音源がある場合のシミュレーションを予め行っておき、７２個の伝達関数ベクトルｖ(θ)（ｖ₁，ｖ₂，・・・，ｖ₇₂）を算出しておく。 The simulation is performed as follows. When calculating the transfer function vector v ₁ corresponding to the sound source located in the direction of 0 degrees (θ = 0) with respect to the head direction, an impulse is output from the direction of 0 degrees. The voice data processing unit 22 calculates a transfer function vector v ₁ by performing discrete Fourier transform on the impulse response acquired from each voice input unit MC (MC1, MC2, MC3,...) And converting the impulse response to the frequency domain. can do.
In the present embodiment, a simulation is performed in advance when there is a sound source in 72 directions obtained by equally dividing 360 degrees in all directions at 5 degrees, and 72 transfer function vectors v (θ) (v ₁ , v ₂ ,. ·, V ₇₂ ) is calculated in advance.

（ＭＵＳＩＣスペクトラム算出処理）
そして音声データ処理部２２は、ＭＵＳＩＣスペクトラム算出処理を行う。
まず、音声データ処理部２２は、音声入力部ＭＣから音響信号データを取得し、そして、記憶部３０の伝達関数ＤＢ３４から７２個の伝達関数ベクトルｖ(θ)（ｖ₁，ｖ₂，ｖ₃，・・・，ｖ₇₂）を取得する。
そして、音声データ処理部２２は、式（１）を用いてＭＵＳＩＣスペクトラムＰ(θ)を計算する。ｅは固有ベクトルであり、音響信号データから算出される値である。算出手段は後記する。
ここで、Ｍは音声入力部ＭＣの数である。また、Ｎは音源として認識可能な最大数を示し、所定の値に設定することができる。本実施形態では、「Ｎ＝３」と設定しておく。Ｔは転置行列であることを示す。 (MUSIC spectrum calculation process)
Then, the audio data processing unit 22 performs a MUSIC spectrum calculation process.
First, the audio data processing unit 22 acquires acoustic signal data from the audio input unit MC, and 72 transfer function vectors v (θ) (v ₁ , v ₂ , v ₃ ) from the transfer function DB 34 of the storage unit 30. ,..., V ₇₂ ).
Then, the audio data processing unit 22 calculates the MUSIC spectrum P (θ) using Expression (1). e is an eigenvector, which is a value calculated from acoustic signal data. The calculation means will be described later.
Here, M is the number of voice input units MC. N indicates the maximum number that can be recognized as a sound source, and can be set to a predetermined value. In this embodiment, “N = 3” is set. T indicates a transposed matrix.

（固有ベクトルｅの算出）
ここで、固有ベクトルｅは次のようにして、音響信号データから算出する。
まず、音声データ処理部２２は、音響信号データに離散フーリエ変換を施して、周波数領域に変換し、スペクトルｘを算出する。
そして、相関行列Ｒ_xxは、次の期待値Ｅで示す式（２）で示すことができる。 (Calculation of eigenvector e)
Here, the eigenvector e is calculated from the acoustic signal data as follows.
First, the audio data processing unit 22 performs discrete Fourier transform on the acoustic signal data, converts it into the frequency domain, and calculates the spectrum x.
Then, the correlation matrix R _xx can be expressed by Expression (2) indicated by the next expected value E.

この相関行列Ｒ_xxが式（３）を満たすような固有値λと、固有ベクトルｅとを算出する。これにより、雑音信号の空間への射影行列である固有ベクトルｅを取得することができる。 An eigenvalue λ and an eigenvector e such that the correlation matrix R _xx satisfies the equation (3) are calculated. Thereby, the eigenvector e which is a projection matrix of the noise signal onto the space can be acquired.

音声データ処理部２２は、式（３）を満たすすべての固有値λと固有ベクトルｅとの組を保持する。このとき保持した組の数をＫとする。
そして、音声データ処理部２２は、固有値λが一番大きい組からＮ＋１〜Ｋ番目の固有値λと固有ベクトルｅとの組を取得する。このＫ−Ｎ個の固有ベクトルｅを用いて、式（１）に示すＭＵＳＩＣスペクトラムＰ(θ)を算出する。 The audio data processing unit 22 holds a set of all eigenvalues λ and eigenvectors e that satisfy Expression (3). Let K be the number of sets held at this time.
Then, the audio data processing unit 22 acquires a set of the (N + 1) th to Kth eigenvalues λ and the eigenvector e from the set having the largest eigenvalue λ. The MUSIC spectrum P (θ) shown in Expression (1) is calculated using the K−N eigenvectors e.

ここで、通常、式（３）を満たす組は、音声入力部ＭＣの数（Ｍ）だけ存在する。そのため、音源として認識可能な最大数として設定されるＮの値は、Ｎ＜Ｍであることが好ましい。 Here, normally, there are as many pairs (M) as the number of voice input units MC that satisfy the formula (3). Therefore, the value of N set as the maximum number that can be recognized as a sound source is preferably N <M.

以上のようにして、音声データ処理部２２は、ＭＵＳＩＣスペクトラム算出処理を行い、時刻ｔにおける、周波数ωのＭＵＳＩＣスペクトラムＰ(θ)を取得することができる。
そして、音声データ処理部２２は、周波数毎にＭＵＳＩＣスペクトラムＰ(θ)の算出処理を行い、所定の周波数帯域のＭＵＳＩＣスペクトラムＰ(θ)を取得する。
ここで、所定の周波数帯域とは、発話者が発する音声の音圧が大きい周波数帯域であり、かつ雑音の音圧が小さい周波数帯域が望ましく、例えば、0.5〜2.8kHzであればよい。 As described above, the audio data processing unit 22 can perform the MUSIC spectrum calculation process and acquire the MUSIC spectrum P (θ) of the frequency ω at time t.
Then, the audio data processing unit 22 performs a calculation process of the MUSIC spectrum P (θ) for each frequency, and acquires the MUSIC spectrum P (θ) in a predetermined frequency band.
Here, the predetermined frequency band is a frequency band in which the sound pressure of the voice uttered by the speaker is high, and a frequency band in which the sound pressure of the noise is low, for example, may be 0.5 to 2.8 kHz.

そして、音声データ処理部２２は、各周波数帯域のＭＵＳＩＣスペクトラムＰ(θ)を広帯域信号に拡張する。
音声データ処理部２２は、音響信号データからＳ／Ｎ比がよい（ノイズが少ない）周波数帯域ωを抽出し、広帯域信号へと拡張したときに、周波数帯域ωのＭＵＳＩＣスペクトラムＰ(θ)が強く反映されるように、周波数帯域ωの音響信号データから式（３）で得た一番大きい固有値λ_maxを用いて、式（４）に示すように、ＭＵＳＩＣスペクトラムＰ(θ)に重み付けをして、総和を計算する。これにより、広帯域のＭＵＳＩＣスペクトラムＰ_avg(θ)を取得する。 Then, the audio data processing unit 22 extends the MUSIC spectrum P (θ) of each frequency band to a wideband signal.
When the audio data processing unit 22 extracts a frequency band ω having a good S / N ratio (low noise) from the acoustic signal data and expanding the frequency band ω to a wideband signal, the MUSIC spectrum P (θ) of the frequency band ω is strong. As shown in Equation (4), the MUSIC spectrum P (θ) is weighted using the largest eigenvalue λ _max obtained from Equation (3) from the acoustic signal data in the frequency band ω. To calculate the sum. As a result, a broadband MUSIC spectrum P _avg (θ) is acquired.

Ω：周波数帯域の集合、|Ω|：集合Ωの要素数

Ω: Set of frequency bands, | Ω |: Number of elements in set Ω

以上により、音声データ処理部２２は、ＭＵＳＩＣスペクトラムＰ_avg(θ)（全方向音圧成分データ）を生成することができる。 As described above, the audio data processing unit 22 can generate the MUSIC spectrum P _avg (θ) (omnidirectional sound pressure component data).

（音源方向推定部）
音源方向推定部２３は、フィルタ部２３１と、角度抽出部２３２とを含んで構成される。
フィルタ部２３１は、音声データ処理部２２からＭＵＳＩＣスペクトラムＰ_avg(θ)（全方向音圧成分データ）を取得して、除外角度範囲ＤＢ３２から除外角度範囲ｒを取得して、自律移動制御部５０から頭回動部Ｒ１１の回動角θを取得する。
そして、フィルタ部２３１は、Ｈ軸が基準となるＭＵＳＩＣスペクトラムＰ_avg(θ)から、Ｂ軸を基準とした回動角θを考慮して、Ｂ軸を基準とした除外角度範囲ｒに該当する範囲内のデータを除去して、推定範囲スペクトラム（有効方向音圧成分データ）を抽出する。 (Sound source direction estimation unit)
The sound source direction estimation unit 23 includes a filter unit 231 and an angle extraction unit 232.
The filter unit 231 acquires the MUSIC spectrum P _avg (θ) (omnidirectional sound pressure component data) from the audio data processing unit 22, acquires the excluded angle range r from the excluded angle range DB 32, and the autonomous movement control unit 50 To obtain the rotation angle θ of the head rotation unit R11.
The filter unit 231 corresponds to the excluded angle range r with respect to the B axis in consideration of the rotation angle θ with respect to the B axis from the MUSIC spectrum P _avg (θ) with the H axis as a reference. Data within the range is removed, and an estimated range spectrum (effective direction sound pressure component data) is extracted.

角度抽出部２３２は、推定範囲スペクトラムからすべてのピーク音圧値（Peak1，Peak2，・・・）と、それらピーク音圧値の角度（推定音源角度）とを抽出する。この推定音源角度が成す方向が推定音源方向である。ここで、本実施形態における推定音源角度は、Ｈ軸を基準とした角度∠φ_H（∠φ_H1，∠φ_H2，・・・）とする。
ここで、ピーク音圧値について説明する。推定範囲スペクトラムにおいて、角度θの音圧値が、直前の角度の音圧値および直後の角度の音圧値よりも大きな値である場合に、その角度θの音圧値がピーク音圧値である。また、その角度θが推定音源角度∠φ_Hである。
そして、角度抽出部２３２は、抽出したすべてのピーク音圧値の推定音源角度∠φ_H（∠φ_H1，∠φ_H2，・・・）を推定音源分離部２４に出力する。
この推定音源角度∠φ_Hは、頭部Ｒ１の正面（顔面）を推定音源方向に向けるまでに必要な回動角度でもある。 The angle extraction unit 232 extracts all peak sound pressure values (Peak1, Peak2,...) And the angles (estimated sound source angles) of these peak sound pressure values from the estimated range spectrum. The direction formed by the estimated sound source angle is the estimated sound source direction. Here, the estimated sound source angle in this embodiment is an angle ∠φ _H (∠φ _H1 , ∠φ _H2 ,...) With the H axis as a reference.
Here, the peak sound pressure value will be described. In the estimated range spectrum, when the sound pressure value at the angle θ is larger than the sound pressure value at the immediately preceding angle and the sound pressure value at the immediately following angle, the sound pressure value at the angle θ is the peak sound pressure value. is there. The angle θ is the estimated sound source angle ∠φ _H.
Then, the angle extraction unit 232 outputs the estimated sound source angles ∠φ _H (∠φ _H1 , ∠φ _H2 ,...) Of all the extracted peak sound pressure values to the estimated sound source separation unit 24.
This estimated sound source angle ∠φ _H is also a rotation angle necessary until the front surface (face) of the head R1 faces the estimated sound source direction.

ここで、音声入力部ＭＣから入力された音響信号データから音声データ処理部２２が生成したＭＵＳＩＣスペクトラムＰ_avg(θ)（全方向音圧成分データ）を取得して、音源方向推定部２３（フィルタ部２３１および角度抽出部２３２）が、音源方向を推定するまでの処理について、図７と図８とを用いて説明する。図７は、Ｂ軸と、Ｈ軸と、除外角度範囲ｒとの一例を示す図であり、図８は、音声データ処理部２２にて生成されたＭＵＳＩＣスペクトラムの一例を示す図である。 Here, the MUSIC spectrum P _avg (θ) (omnidirectional sound pressure component data) generated by the sound data processing unit 22 is acquired from the sound signal data input from the sound input unit MC, and the sound source direction estimation unit 23 (filter) The processing until the unit 231 and the angle extracting unit 232) estimate the sound source direction will be described with reference to FIGS. FIG. 7 is a diagram illustrating an example of the B axis, the H axis, and the excluded angle range r, and FIG. 8 is a diagram illustrating an example of the MUSIC spectrum generated by the audio data processing unit 22.

この図７は、頭部Ｒ１の顔面が向いている方向（Ｈ軸）が、胴部Ｒ４の正面が向いている方向（Ｂ軸）から右に（θ＝）３０度の方向である場合の図である。このとき、Ｈ軸（∠Ｈ＝０度）は、Ｂ軸を基準（０度）として換算すると、∠Ｂ＝＋３０度で示される。
そして、背面格納部Ｒ５に格納されたファンＦ（ノイズ発生源）から出力されるノイズ音を除去するために、除外角度範囲ｒ（ｒ＝１２０度）は、∠Ｂ＝＋１２０〜＋１８０〜＋２４０度の範囲で記憶されている。これは、Ｈ軸で換算すると、∠Ｈ＝＋８０〜＋１４０〜＋２００度の範囲である。 In FIG. 7, the direction in which the face of the head R1 faces (H axis) is the direction (θ =) 30 degrees to the right from the direction in which the front of the torso R4 faces (B axis). FIG. At this time, the H axis (∠H = 0 degree) is represented by ∠B = + 30 degrees when the B axis is converted with reference (0 degree).
In order to remove the noise sound output from the fan F (noise generation source) stored in the rear storage unit R5, the excluded angle range r (r = 120 degrees) is ∠B = + 120 to +180 to +240 degrees. It is memorized in the range. This is a range of ∠H = + 80 to +140 to +200 degrees in terms of the H axis.

図８は、図７で示す状態であるときに、音声データ処理部２２が、頭部Ｒ１の音声入力部ＭＣ（ＭＣ１〜ＭＣ８）に入力された音響信号データにＭＵＳＩＣ法を用いて生成した３６０度全方向からの音圧値（スペクトラム強度）を示すＭＵＳＩＣスペクトラムＰ_avg(θ)（全方向音圧成分データ）である。そして、縦軸は、スペクトル強度を示す音圧［ｄＢ］、横軸は、Ｈ軸を基準とした方向を示す角度［度］である。 FIG. 8 shows the 360 generated by the audio data processing unit 22 using the MUSIC method for the acoustic signal data input to the audio input units MC (MC1 to MC8) of the head R1 in the state shown in FIG. MUSIC spectrum P _avg (θ) (omnidirectional sound pressure component data) indicating sound pressure values (spectrum intensity) from all directions. The vertical axis is the sound pressure [dB] indicating the spectral intensity, and the horizontal axis is the angle [degree] indicating the direction with respect to the H axis.

フィルタ部２３１は、このＭＵＳＩＣスペクトラムＰ_avg(θ)において除外角度範囲ｒに該当する範囲のデータを除去して、スペクトラム（推定範囲スペクトラム）を取得する。図８における推定範囲スペクトラムは、除外角度範囲ｒ（∠Ｈ＝＋８０〜＋１４０〜＋２００）以外の∠Ｈ＝−１６０〜±０〜＋８０度で示されるスペクトラムデータである（図８で太線で示す）。 The filter unit 231 removes data in a range corresponding to the excluded angle range r in the MUSIC spectrum P _avg (θ), and acquires a spectrum (estimated range spectrum). The estimated range spectrum in FIG. 8 is spectrum data represented by ∠H = −160 to ± 0 to +80 degrees other than the excluded angle range r (∠H = + 80 to +140 to +200) (indicated by a thick line in FIG. 8). .

角度抽出部２３２は、この推定範囲スペクトラムから、ピーク音圧値Peak1の推定音源角度∠φ_H1（∠φ_H）（∠Ｈ＝−７０度）と、ピーク音圧値Peak2の推定音源角度∠φ_H2（∠φ_H）（∠Ｈ＝＋１０度）とを抽出する。そして、推定音源角度∠φ_H1および∠φ_H2を、推定音源分離部２４に出力する。 From this estimated range spectrum, the angle extraction unit 232 estimates the estimated sound source angle ∠φ _H1 (∠φ _H ) (∠H = −70 degrees) of the peak sound pressure value Peak1 and the estimated sound source angle ∠φ of the peak sound pressure value Peak2. _H2 (∠φ _H ) (∠H = + 10 degrees) is extracted. Then, the estimated sound source angles ∠φ _H1 and ∠φ _H2 are output to the estimated sound source separation unit 24.

以上のように、フィルタ部２３１と角度抽出部２３２との処理により、ＭＵＳＩＣスペクトラムにおける最大音圧値（ＭＡＸ）（図８参照）は、除外角度範囲ｒ（∠Ｈ＝＋８０〜＋１４０〜＋２００）内であるために、削除される。 As described above, the maximum sound pressure value (MAX) (see FIG. 8) in the MUSIC spectrum is within the excluded angle range r (∠H = + 80 to +140 to +200) by the processing of the filter unit 231 and the angle extraction unit 232. To be deleted.

（推定音源分離部）
推定音源分離部２４は、角度抽出部２３２から推定音源方向を示すすべての推定音源角度∠φ_H（∠φ_H1，∠φ_H2，・・・）を取得し、音声入力部ＭＣ（ＭＣ１，ＭＣ２，ＭＣ３，・・・）から音響信号データを取得する。そして、推定音源分離部２４は、取得した複数の音響信号データから得られる３６０度全方向からの音響信号データのうち、各推定音源角度∠φ_Hの方向から入力された音響信号データを分離抽出する。最後に、推定音源分離部２４は、すべての推定音源角度∠φ_H（∠φ_H1，∠φ_H2，・・・）とともに、それぞれの角度方向に基づき分離抽出した推定音源方向音声データを音声認識部２５に出力する。 (Estimated sound source separation unit)
The estimated sound source separation unit 24 acquires all estimated sound source angles ∠φ _H (∠φ _H1 , ∠φ _H2 ,...) Indicating the estimated sound source direction from the angle extraction unit 232, and the voice input units MC (MC1, MC2). , MC3,...) Acoustic signal data is acquired. Then, the estimated sound source separation unit 24 separates and extracts the sound signal data input from the direction of each estimated sound source angle ∠φ _H from the sound signal data from all 360 degrees obtained from the plurality of acquired sound signal data. To do. Finally, the estimated sound source separation unit 24 recognizes all estimated sound source angles ∠φ _H (∠φ _H1 , Hφ _H2 ,...) And the estimated sound source direction speech data separated and extracted based on the respective angle directions. To the unit 25.

ここで、推定音源分離部２４が行う複数の音声入力部ＭＣから入力された音響信号データから、特定の音響信号データを分離抽出する処理については、特開２００８−３０６７１２号公報などに開示された公知の技術を利用した処理である。
当該処理について、簡単に記載する。 Here, a process for separating and extracting specific acoustic signal data from acoustic signal data input from a plurality of sound input units MC performed by the estimated sound source separation unit 24 is disclosed in Japanese Patent Application Laid-Open No. 2008-306712. This is a process using a known technique.
The process will be briefly described.

複数の音声入力部ＭＣそれぞれには、複数の音源それぞれから個別の音源信号が重畳された音声信号が入力される。推定音源分離部２４は、それら音声入力部ＭＣから音響信号データを取得する。そして、推定音源分離部２４は、ブラインド音源分離法（ＢＳＳ：Blind Source Separation）を用いて、音響信号データそれぞれから、音源それぞれからの音源信号（音響信号データ）を分離する。そして、推定音源分離部２４は、推定音源角度∠φ_Hの方向から入力された音響信号データ（推定音源方向音声データ）を抽出する。
ＢＳＳとして、ＤＳＳ（Decorrelation based Source Separation）や、ＩＣＡ（Independent Component Analysis）、ＨＤＳＳ（Higher-order DSS）に基づく音源分離法や、これらの手法それぞれに幾何的情報を加えたＧＳＳ（Geometric constrained Source Separation）や、ＧＩＣＡ（Geometric constrained ICA）、ＧＨＤＳＳ（Geometric constrained HDSS）といった音源分離法を用いてもよい。 Each of the plurality of sound input units MC receives a sound signal on which individual sound source signals are superimposed from each of the plurality of sound sources. The estimated sound source separation unit 24 acquires acoustic signal data from the voice input unit MC. And the estimation sound source separation part 24 isolate | separates the sound source signal (acoustic signal data) from each sound source from each sound signal data using a blind sound source separation method (BSS: Blind Source Separation). Then, the estimated sound source separation unit 24 extracts acoustic signal data (estimated sound source direction sound data) input from the direction of the estimated sound source angle ∠φ _H.
As BSS, DSS (Decorrelation based Source Separation), ICA (Independent Component Analysis), HDSS (Higher-order DSS) based sound source separation method, GSS (Geometric constrained Source Separation with geometric information added to each of these methods) ), GICA (Geometric constrained ICA), or GHDSS (Geometric constrained HDSS) may be used.

（音声認識部）
音声認識部２５は、推定音源分離部２４から、すべての推定音源角度∠φ_H（∠φ_H1，∠φ_H2，・・・）と、それぞれの角度方向の推定音源方向音声データとを取得する。
音声認識部２５は、推定音源方向音声データそれぞれに対して音声認識を行い、文字情報（テキストデータ）を生成する機能を有し、１つの推定音源方向音声データから、複数の仮説フレーズを生成する。この仮説フレーズはテキストデータで生成される。 (Voice recognition unit)
The speech recognition unit 25 acquires all the estimated sound source angles ∠φ _H (∠φ _H1 , ∠φ _H2 ,...) And the estimated sound source direction speech data of each angular direction from the estimated sound source separation unit 24. .
The speech recognition unit 25 has a function of performing speech recognition on each estimated sound source direction speech data and generating character information (text data), and generates a plurality of hypothesis phrases from one estimated sound source direction speech data. . This hypothesis phrase is generated as text data.

さらに、音声認識部２５は、仮説フレーズの正しさ（仮説フレーズが正しい確率）を示す音声認識尤度を算出する機能を有する。この音声認識尤度は、仮説フレーズの内容や前後の文章内容（前後の仮説フレーズ）などから、仮説フレーズが正しい文章内容であるかを示すものである。
＜音声認識部２５の音声認識尤度算出処理＞
音声認識部２５にて行われる音声認識尤度算出処理について説明する。
一般的な音声認識における音声認識尤度（Ｐ（Ｗ｜Ｘ））は、音声信号をＸ、フレーズ（単語の列）をＷとして、以下の式（５）で示すことができる。 Furthermore, the speech recognition unit 25 has a function of calculating speech recognition likelihood indicating the correctness of the hypothesis phrase (the probability that the hypothesis phrase is correct). This speech recognition likelihood indicates whether or not the hypothesis phrase is correct sentence content from the contents of the hypothesis phrase and the sentence contents before and after (preceding phrase before and after).
<Voice Recognition Likelihood Calculation Processing of Voice Recognition Unit 25>
A speech recognition likelihood calculation process performed by the speech recognition unit 25 will be described.
The speech recognition likelihood (P (W | X)) in general speech recognition can be expressed by the following equation (5), where X is a speech signal and W is a phrase (word string).

Ｐ（Ｗ｜Ｘ）：音声認識尤度、Ｐ（Ｘ｜Ｗ）：音響尤度、Ｐ（Ｗ）：言語確率

P (W | X): speech recognition likelihood, P (X | W): acoustic likelihood, P (W): language probability

ここで、言語確率Ｐ（Ｗ）を算出するとき、文法を用いてもよいし、統計言語モデルを用いてもよい。
文法を用いて言語確率Ｐ（Ｗ）を算出する場合には、フレーズＷが文法に定義されていればＰ（Ｗ）＝１とし、定義されていなければＰ（Ｗ）＝０とする。そして、確率の定義に従い、最後にはＰ（Ｗ）の合計が１になるように正規化する。
一方、統計言語モデルを用いて言語確率Ｐ（Ｗ）を算出する場合には、単語と単語との繋がりやすさを大量の文から学習したＮ−ｇｒａｍが用いられる。一般的にＮ−ｇｒａｍとは、Ｎ個の単語の繋がりやすさを学習させたものである。 Here, when calculating the language probability P (W), a grammar may be used, or a statistical language model may be used.
When calculating the language probability P (W) using the grammar, P (W) = 1 if the phrase W is defined in the grammar, and P (W) = 0 if not defined. Then, according to the definition of the probability, finally, normalization is performed so that the sum of P (W) becomes 1.
On the other hand, when the language probability P (W) is calculated using a statistical language model, an N-gram learned from a large amount of sentences for the ease of connection between words is used. In general, N-gram is a learning of the ease of connection of N words.

ここでは、統計言語モデルを用い、多くの場合に利用されているＮ＝３の場合について説明する。一般的に、３−ｇｒａｍの統計言語モデルを用いて学習する際には、同時に２−ｇｒａｍも学習している。
例えば、「はしをわたって（橋を渡って）」の場合、「橋｜を｜渡っ｜て」といった４つの単語に分解して考えます。そして、２つの単語「橋｜を」の次に、「渡っ」という単語がどれだけ現れやすいかという確率Ｐ（渡っ｜橋，を）を学習してデータベース（３−ｇｒａｍのデータベース、２−ｇｒａｍのデータベース）として持っている。
このように、「橋｜を｜渡っ｜て」というフレーズＷが現れる確率Ｐ（橋｜を｜渡っ｜て）は、以下の式（６）で示すことができる。 Here, a case where N = 3, which is used in many cases, will be described using a statistical language model. Generally, when learning using a 3-gram statistical language model, 2-gram is also learned at the same time.
For example, in the case of “crossing the bridge (crossing the bridge)”, it is divided into four words such as “bridge | Then, after the two words “bridge | o”, learning the probability P (cross | bridge, o) how easily the word “cross” appears, and database (3-gram database, 2-gram) As a database).
Thus, the probability P (the bridge | is crossed) that the phrase W “bridge | cross | cross” can be expressed by the following equation (6).

右辺は３−ｇｒａｍおよび２−ｇｒａｍのデータベースから得られるもので、左辺が言語確率Ｐ（Ｗ）である。
以上により、音声認識部２５は、「橋を渡って」とする仮説フレーズの音声認識尤度（Ｐ（Ｗ｜Ｘ））は大きい値で算出され（仮説フレーズが正しい確率が高い）、一方、「箸を渡って」とする仮説フレーズの音声認識尤度（Ｐ（Ｗ｜Ｘ））は小さい値で算出される（仮説フレーズが正しい確率が低い）。
以上のように、音声認識部２５にて音声認識尤度算出処理が行われ、音声認識尤度（Ｐ（Ｗ｜Ｘ））が算出される。 The right side is obtained from the 3-gram and 2-gram databases, and the left side is the language probability P (W).
From the above, the speech recognition unit 25 calculates the speech recognition likelihood (P (W | X)) of the hypothesis phrase “cross the bridge” with a large value (the probability that the hypothesis phrase is correct is high), The speech recognition likelihood (P (W | X)) of the hypothesis phrase “cross chopsticks” is calculated with a small value (the hypothesis phrase has a low probability of being correct).
As described above, the speech recognition likelihood calculation process is performed in the speech recognition unit 25, and the speech recognition likelihood (P (W | X)) is calculated.

そして、音声認識部２５は、推定音源角度∠φ_Hで示される角度方向の推定音源方向データから生成したすべての仮説フレーズと、各仮説フレーズの音声認識尤度（Ｐ（Ｗ｜Ｘ））とを、推定音源角度∠φ_Hと対応付けて、音源定位部２６（フレーズ確信度算出部２６１）に出力する。
なお、音声データとテキストデータとの対応関係は、記憶部３０に予め記憶されている。 The speech recognition unit 25 then generates all hypothesis phrases generated from the estimated sound source direction data in the angular direction indicated by the estimated sound source angle ∠φ _H , and the speech recognition likelihood (P (W | X)) of each hypothesis phrase. Is associated with the estimated sound source angle ∠φ _H and output to the sound source localization unit 26 (phrase certainty calculation unit 261).
Note that the correspondence between the voice data and the text data is stored in the storage unit 30 in advance.

（音源定位部）
音源定位部２６は、フレーズ確信度算出部２６１と、フレーズ決定部２６２とを含んで構成される。
（フレーズ確信度算出部）
フレーズ確信度算出部２６１は、音声認識部２５からすべての推定音源角度∠φ_H（∠φ_H1，∠φ_H2，・・・）と、推定音源角度∠φ_Hそれぞれと対応付けられた仮説フレーズのすべてと、各仮説フレーズの音声認識尤度（Ｐ（Ｗ｜Ｘ））とを取得する。
そして、フレーズ確信度算出部２６１は、すべての推定音源角度∠φ_Hから１つの推定音源角度∠φ_Hを抽出し、さらに、その推定音源角度∠φ_Hと対応付けられたすべての仮説フレーズから１つの仮説フレーズを抽出する（以下、抽出仮説フレーズとする）。
次に、フレーズ確信度算出部２６１は、フレーズＤＢ３３（図６）のフレーズパターン３３１から、抽出仮説フレーズと一致するフレーズを抽出する。次に、その抽出したフレーズと対応付けられた方向角度値パターン３３２から、推定音源角度∠φ_Hと一致する方向角度値を抽出する。最後に、その抽出した方向角度値と対応付けられたフレーズ入力方向適合度３３３（フレーズと方向角度値との関係が正しい確率）を抽出する。 (Sound source localization part)
The sound source localization unit 26 includes a phrase certainty factor calculation unit 261 and a phrase determination unit 262.
(Phrase certainty calculator)
The phrase certainty factor calculation unit 261 is a hypothetical phrase associated with each of the estimated sound source angles ∠φ _H (1φ _H1 , ∠φ _H2 ,...) And the estimated sound source angle ∠φ _H from the speech recognition unit 25. And the speech recognition likelihood (P (W | X)) of each hypothesis phrase.
Then, the phrase certainty calculation unit 261 extracts one estimated sound source angle ∠φ _H from all the estimated sound source angles ∠φ _H and, further, from all hypothetical phrases associated with the estimated sound source angle ∠φ _H. One hypothesis phrase is extracted (hereinafter referred to as an extracted hypothesis phrase).
Next, the phrase certainty calculation unit 261 extracts a phrase that matches the extracted hypothesis phrase from the phrase pattern 331 of the phrase DB 33 (FIG. 6). Next, a direction angle value that matches the estimated sound source angle ∠φ _H is extracted from the direction angle value pattern 332 associated with the extracted phrase. Finally, the phrase input direction suitability 333 (probability that the relationship between the phrase and the direction angle value is correct) associated with the extracted direction angle value is extracted.

次に、フレーズ確信度算出部２６１は、フレーズの音源方向とフレーズの音声認識結果との関係を示すフレーズ確信度を算出する。このフレーズ確信度の値が大きいほど、抽出仮説フレーズ（音声認識部２５が音声認識した仮説フレーズ）の音声認識結果が正しいという確率が高いことを示す。
＜フレーズ確信度算出部２６１のフレーズ確信度算出処理＞
フレーズ確信度算出部２６１にて行われるフレーズ確信度算出処理について説明する。
前記した、仮説フレーズの正しさ（仮説フレーズが正しい確率）を示す音声認識尤度を算出するための式（５）を、フレーズ入力方向適合度３３３を用いて拡張することで、フレーズ確信度Ｐ（Ｗ｜Ｘ，ｄ）は、以下の式（７）で示すことができる。ここで、音声信号をＸ、フレーズ（単語の列）をＷ、方向をｄとする。 Next, the phrase certainty calculating unit 261 calculates a phrase certainty indicating the relationship between the sound source direction of the phrase and the speech recognition result of the phrase. It shows that the probability that the speech recognition result of the extracted hypothesis phrase (the hypothesis phrase recognized by the speech recognition unit 25) is correct is higher as the value of the phrase certainty factor is larger.
<Phrase certainty calculation processing of the phrase certainty calculation unit 261>
The phrase certainty factor calculation process performed by the phrase certainty factor calculation unit 261 will be described.
By expanding the expression (5) for calculating the speech recognition likelihood indicating the correctness of the hypothesis phrase (probability that the hypothesis phrase is correct) by using the phrase input direction suitability 333, the phrase confidence P (W | X, d) can be expressed by the following equation (7). Here, X is an audio signal, W is a phrase (word string), and d is a direction.

Ｐ（Ｗ｜Ｘ）：音声認識尤度、Ｐ（Ｘ｜Ｗ）：音響尤度、Ｐ（Ｗ）：言語確率
Ｐ（Ｗ｜Ｘ，ｄ）：フレーズ確信度、Ｐ（ｄ｜Ｗ）：フレーズ入力方向適合度

P (W | X): speech recognition likelihood, P (X | W): acoustic likelihood, P (W): language probability P (W | X, d): phrase confidence, P (d | W): Phrase input direction suitability

以上のように、フレーズ確信度算出処理を行うことで、フレーズ確信度算出部２６１は、フレーズ確信度（Ｐ（Ｗ｜Ｘ，ｄ））を、フレーズＤＢ３３から取得したフレーズ入力方向適合度３３３（Ｐ（ｄ｜Ｗ））に、音声認識尤度（Ｐ（Ｗ｜Ｘ））を乗算することで算出できる。 As described above, by performing the phrase certainty factor calculation process, the phrase certainty factor calculating unit 261 obtains the phrase certainty factor (P (W | X, d)) from the phrase DB 33 for the phrase input direction suitability 333 ( P (d | W)) is multiplied by the speech recognition likelihood (P (W | X)).

ここで、音響尤度Ｐ（Ｘ｜Ｗ）は、入力した音声信号Ｘと音響モデルと呼ばれるデータベースから計算される尤度であり、入力音声信号ＸがフレーズＷであると仮定すると、入力音声信号ＸがどれだけそのフレーズＷらしいかを表す。この音響モデルには、隠れマルコフモデル（Hidden Markov Model）が用いられることが一般的で、フレーズＷに対する入力音声信号Ｘの音響尤度を計算するアルゴリズムとして、ビタビ（Viterbi）アルゴリズムなどがある。 Here, the acoustic likelihood P (X | W) is a likelihood calculated from the input speech signal X and a database called an acoustic model, and assuming that the input speech signal X is the phrase W, the input speech signal X represents how much the phrase W seems to be. As this acoustic model, a hidden Markov model is generally used. As an algorithm for calculating the acoustic likelihood of the input speech signal X with respect to the phrase W, there is a Viterbi algorithm or the like.

そして、フレーズ確信度算出部２６１は、推定音源角度∠φ_Hと、仮説フレーズと、算出したフレーズ確信度（Ｐ（Ｗ｜Ｘ，ｄ））との３つの値を対応付けてフレーズ決定部２６２に出力する。
この処理を、フレーズ確信度算出部２６１は、音声認識部２５から取得したすべての推定音源角度∠φ_Hに対して行い、各推定音源角度∠φ_Hのすべての仮説フレーズのフレーズ確信度（Ｐ（Ｗ｜Ｘ，ｄ））を算出し、フレーズ決定部２６２に出力する。 Then, the phrase certainty calculation unit 261 associates the three values of the estimated sound source angle ∠φ _H , the hypothesis phrase, and the calculated phrase certainty factor (P (W | X, d)) with the phrase determination unit 262. Output to.
The phrase certainty calculation unit 261 performs this process on all estimated sound source angles ∠φ _H acquired from the speech recognition unit 25, and the phrase certainty (P) of all hypothetical phrases for each estimated sound source angle ∠φ _H. (W | X, d)) is calculated and output to the phrase determination unit 262.

（フレーズ決定部）
フレーズ決定部２６２は、フレーズ確信度算出部２６１から取得したすべての推定音源角度∠φ_Hに対して以下の処理を行う。
まず、フレーズ決定部２６２は、１つの推定音源角度∠φ_Hに対応付けられたすべての仮説フレーズのフレーズ確信度を取得する。そして、フレーズ決定部２６２は、取得したフレーズ確信度Ｐ（Ｗ｜Ｘ，ｄ）から最大値のフレーズ確信度を抽出し、そのフレーズ確信度が最大値の仮説フレーズを、音声認識したフレーズと決定する。そして、このフレーズ確信度が最大値の仮説フレーズＷ_maxが、ロボットＲの正面から推定音源角度∠φ_Hを成す方向に位置する音源（発話者Ｙ，図１参照）から発話された発話フレーズであるとする。
そして、主制御部４０に、当該発話フレーズと、推定音源角度∠φ_Hとを出力する。
ここで、フレーズ確信度が最大値の仮説フレーズＷ_maxは、以下の式（８）で示すことができる。 (Phrase determination part)
The phrase determination unit 262 performs the following processing on all estimated sound source angles ∠φ _H acquired from the phrase certainty calculation unit 261.
First, the phrase determination unit 262 acquires the phrase certainty factor of all hypothesis phrases associated with one estimated sound source angle ∠φ _H. Then, the phrase determining unit 262 extracts the maximum phrase certainty factor from the acquired phrase certainty factor P (W | X, d), and determines the hypothetical phrase having the maximum phrase certainty factor as a speech-recognized phrase. To do. The hypothesis phrase W _max having the maximum phrase certainty factor is an utterance phrase uttered from a sound source (speaker Y, see FIG. 1) located in the direction of the estimated sound source angle ∠φ _H from the front of the robot R. Suppose there is.
Then, the utterance phrase and the estimated sound source angle ∠φ _H are output to the main control unit 40.
Here, the hypothesis phrase W _max having the maximum phrase certainty factor can be expressed by the following equation (8).

Ｘ：音声認識したフレーズ、Ｗ：仮説フレーズ

X: Phrase recognized speech, W: Hypothesis phrase

以上の処理を、フレーズ決定部２６２は、フレーズ確信度算出部２６１から取得したすべての推定音源角度∠φ_Hに対して行い、各推定音源角度∠φ_Hの発話フレーズを主制御部４０に出力する。 The phrase determination unit 262 performs the above processing for all the estimated sound source angles ∠φ _H acquired from the phrase certainty factor calculation unit 261, and outputs the utterance phrase of each estimated sound source angle ∠φ _H to the main control unit 40. To do.

本実施の形態に係るロボットＲによれば、以下に示す動作ができるようになる。
各推定音源角度∠φ_Hの発話フレーズを取得した主制御部４０は、記憶部３０に記憶された文字情報を参照して、発話フレーズの内容を取得する。ここで、発話フレーズが相手を呼びかける内容の文章であったときに、主制御部４０は、自律移動制御部５０へ、頭部Ｒ１の正面（顔面）が推定音源角度∠φ_Hに至るまで、頭回動部Ｒ１１を回動させる指示を出力させることができる。これにより、発話者が「おーい」と呼びかけることで、ロボットＲはその発話者がいる方向へ顔面を向けることができる。
また、各推定音源角度∠φ_Hの発話フレーズを取得した主制御部４０が、角度抽出部２３２が推定範囲スペクトラムから取得した、それら推定音源角度∠φ_Hの角度のピーク音圧値を取得する。そして、主制御部４０は、自律移動制御部５０へ、頭部Ｒ１の正面（顔面）が、ピーク音圧値が最大となる推定音源角度∠φ_Hに至るまで、頭回動部Ｒ１１を回動させる指示を出力させてもよい。これにより、ロボットＲは、音圧が最大の発話者がいる方向へ顔面を向けることができる。そのため、近くで話している発話者よりも大きな声で呼びかけた発話者がいる方向へ顔面を向けることができる。 The robot R according to the present embodiment can perform the following operations.
The main control unit 40 that has acquired the utterance phrase of each estimated sound source angle ∠φ _H refers to the character information stored in the storage unit 30 and acquires the content of the utterance phrase. Here, when the utterance phrase is a sentence that calls the other party, the main control unit 40 moves to the autonomous movement control unit 50 until the front (face) of the head R1 reaches the estimated sound source angle ∠φ _H. An instruction to rotate the head rotation unit R11 can be output. Thus, when the speaker calls “Oi”, the robot R can turn the face in the direction in which the speaker is present.
Further, the main control unit 40 that acquired the utterance phrase of each estimated sound source angle ∠φ _H acquires the peak sound pressure value of the angle of the estimated sound source angle ∠φ _H acquired by the angle extraction unit 232 from the estimated range spectrum. . The main control unit 40 then turns the head rotation unit R11 to the autonomous movement control unit 50 until the front surface (face) of the head R1 reaches the estimated sound source angle ∠φ _H where the peak sound pressure value is maximum. An instruction to move may be output. As a result, the robot R can turn its face in the direction in which the speaker having the maximum sound pressure is present. Therefore, the face can be directed in the direction in which there is a speaker who called out louder than a speaker speaking nearby.

［ロボットの音源方向推定処理］
次に、図９〜図１１のフローチャートを参照して、ロボットの音源方向推定処理について説明する（適宜、図１ないし図８を参照）。
まず、ロボットＲの音声入力部ＭＣ（ＭＣ１，ＭＣ２，ＭＣ３，・・・）に、ユーザが発話した音声が入力され（ステップＳ１０１）、音声データ処理部２２に音響信号データが入力される（ステップＳ１０２）。
次に、音声データ処理部２２は、記憶部３０の音声入力角度ＤＢ３１から、各音声入力部ＭＣの配設角度∠Ｈ_mc（∠Ｈ_mc1，∠Ｈ_mc2，∠Ｈ_mc3，・・・）を取得し、伝達関数ＤＢ３４から、伝達関数ベクトルｖ(θ)を取得する（ステップＳ１０３）。
そして、音声データ処理部２２は、ＭＵＳＩＣ法を用いて、各音声入力部ＭＣから入力された音響信号データと、配設角度∠Ｈ_mcと、伝達関数ベクトルｖ(θ)とから、ＭＵＳＩＣスペクトラムを生成する（ステップＳ１０４）。 [Robot sound source direction estimation processing]
Next, the sound source direction estimation process of the robot will be described with reference to the flowcharts of FIGS. 9 to 11 (refer to FIGS. 1 to 8 as appropriate).
First, the voice uttered by the user is input to the voice input unit MC (MC1, MC2, MC3,...) Of the robot R (step S101), and the acoustic signal data is input to the voice data processing unit 22 (step S101). S102).
Next, the voice data processing unit 22 determines the arrangement angle ∠H _mc (∠H _mc1 , ∠H _mc2 , ∠H _mc3 ,...) Of each voice input unit MC from the voice input angle DB 31 of the storage unit 30. Obtain a transfer function vector v (θ) from the transfer function DB 34 (step S103).
The audio data processing unit 22 uses the MUSIC method, the acoustic signal data inputted from the audio input unit MC, and the disposed angle ∠H _mc, since the transfer function vector v as (theta), the MUSIC spectrum Generate (step S104).

フィルタ部２３１は、音声データ処理部２２からＭＵＳＩＣスペクトラム（全方向音圧成分データ）を取得して、除外角度範囲ＤＢ３２から除外角度範囲ｒを取得して、自律移動制御部５０から頭回動部Ｒ１１の回動角θを取得する（ステップＳ１０５）。
そして、フィルタ部２３１は、ＭＵＳＩＣスペクトラムから、回動角θを考慮して、除外角度範囲ｒに該当する範囲内のデータを除去して、推定範囲スペクトラム（有効方向音圧成分データ）を抽出する（ステップＳ１０６）。 The filter unit 231 acquires the MUSIC spectrum (omnidirectional sound pressure component data) from the audio data processing unit 22, acquires the excluded angle range r from the excluded angle range DB 32, and receives the head rotation unit from the autonomous movement control unit 50. The rotation angle θ of R11 is acquired (step S105).
Then, the filter unit 231 extracts the estimated range spectrum (effective direction sound pressure component data) from the MUSIC spectrum by removing the data within the range corresponding to the excluded angle range r in consideration of the rotation angle θ. (Step S106).

角度抽出部２３２は、推定範囲スペクトラムからすべてのピーク音圧値（Peak1，Peak2，・・・）と、それらピーク音圧値の角度（推定音源角度∠φ_H）とを抽出する（ステップＳ１０７）。そして、角度抽出部２３２は、抽出したすべての推定音源角度∠φ_H（∠φ_H1，∠φ_H2，・・・）を、推定音源分離部２４に出力する。 The angle extraction unit 232 extracts all peak sound pressure values (Peak1, Peak2,...) And the angles of these peak sound pressure values (estimated sound source angle ∠φ _H ) from the estimated range spectrum (step S107). . Then, the angle extraction unit 232 outputs all the extracted estimated sound source angles ∠φ _H (∠φ _H1 , ∠φ _H2 ,...) To the estimated sound source separation unit 24.

推定音源分離部２４は、角度抽出部２３２からすべての推定音源角度∠φ_H（∠φ_H1，∠φ_H2，・・・）を取得し、音声入力部ＭＣ（ＭＣ１，ＭＣ２，ＭＣ３，・・・）から入力された音響信号データを取得する（ステップＳ１０８）。そして、取得した音響信号データから、各推定音源角度∠φ_Hの方向から入力された音響信号データを分離抽出する（ステップＳ１０９）。最後に、推定音源分離部２４は、推定音源角度∠φ_H（∠φ_H1，∠φ_H2，・・・）とともに、それぞれの角度方向に基づき分離抽出した推定音源方向音声データを音声認識部２５に出力する。 The estimated sound source separation unit 24 acquires all the estimated sound source angles ∠φ _H (∠φ _H1 , ∠φ _H2 ,...) From the angle extraction unit 232, and the voice input units MC (MC 1, MC 2, MC 3,. The acoustic signal data input from () is acquired (step S108). Then, the acoustic signal data input from the direction of each estimated sound source angle ∠φ _H is separated and extracted from the acquired acoustic signal data (step S109). Finally, the estimated sound source separation unit 24, together with the estimated sound source angles ∠φ _H (∠φ _H1 , 2φ _H2 ,...), The estimated sound source direction speech data separated and extracted based on the respective angle directions, Output to.

音声認識部２５は、推定音源分離部２４からすべての推定音源角度∠φ_H（∠φ_H1，∠φ_H2，・・・）と、それぞれの角度方向の推定音源方向音声データを取得する（ステップＳ１１０）。
（音声認識処理）
音声認識部２５は、推定音源分離部２４から入力された推定音源方向音声データに音声認識を行って、複数の仮説フレーズを生成し（ステップＳ１１１）、所定の算出手段を用いて、仮説フレーズの正しさを示す音声認識尤度を算出する（ステップＳ１１２）。そして、音声認識部２５は、すべての推定音源角度∠φ_H方向の推定音源方向音声データに対して、当該音声認識処理を行う。 The speech recognition unit 25 acquires all the estimated sound source angles ∠φ _H (∠φ _H1 , ∠φ _H2 ,...) And the estimated sound source direction speech data of the respective angular directions from the estimated sound source separation unit 24 (steps). S110).
(Voice recognition processing)
The speech recognition unit 25 performs speech recognition on the estimated sound source direction speech data input from the estimated sound source separation unit 24, generates a plurality of hypothesis phrases (step S111), and uses a predetermined calculation means to determine the hypothesis phrase. A speech recognition likelihood indicating correctness is calculated (step S112). Then, the speech recognition unit 25 performs the speech recognition process on the estimated sound source direction speech data in all estimated sound source angle ∠φ _H directions.

以上の処理により、音声認識部２５は、推定音源方向音声データ毎に、その推定音源方向音声データから生成された複数の仮説フレーズと、各仮説フレーズの音声認識尤度を取得する。そして、音声認識部２５は、推定音源角度∠φ_Hと対応付けて、その推定音源角度∠φ_H方向の推定音源方向音声データから生成したすべての仮説フレーズと、各仮説フレーズの音声認識尤度とを、フレーズ確信度算出部２６１に出力する。 Through the above processing, the speech recognition unit 25 acquires, for each estimated sound source direction speech data, a plurality of hypothesis phrases generated from the estimated sound source direction speech data and the speech recognition likelihood of each hypothesis phrase. Then, the voice recognition unit 25, the estimated sound source angle ∠Fai in association with _H, all the hypotheses phrase generated from the estimated sound source direction audio data of the estimated sound source angle ∠Fai _H direction, speech recognition likelihood of each hypothesis phrases Is output to the phrase certainty calculation unit 261.

フレーズ確信度算出部２６１は、音声認識部２５からすべての推定音源角度∠φ_H（∠φ_H1，∠φ_H2，・・・）と、各推定音源角度∠φ_Hに対応付けられたすべての仮説フレーズと、各仮説フレーズの音声認識尤度とを取得する（ステップＳ１１３，図１０）。
そして、フレーズ確信度算出部２６１は、取得した推定音源角度∠φ_H（∠φ_H1，∠φ_H2，・・・）から、１つの推定音源角度∠φ_Hを抽出し（ステップＳ１１４）、その推定音源角度∠φ_Hに対応付けられた複数の仮説フレーズから１つを抽出する（ステップＳ１１５）。
次に、フレーズ確信度算出部２６１は、抽出した推定音源角度∠φ_Hと仮説フレーズとに基づき、フレーズＤＢ３３のフレーズパターン３３１から、仮説フレーズと一致するフレーズを抽出する。次に、フレーズ確信度算出部２６１は、抽出したフレーズと対応付けられた方向角度値パターン３３２から、推定音源角度∠φ_Hと一致する方向角度値を抽出する。そして、フレーズ確信度算出部２６１は、抽出した方向角度値と対応付けられたフレーズ入力方向適合度３３３を取得する（ステップＳ１１６）。 Phrase confidence factor computing unit 261, the speech recognition unit of all the 25 estimated sound source angle _{_{_{∠φ H (∠φ H1, ∠φ H2}}} , ···) and all associated with each estimated sound source angle ∠Fai _H The hypothesis phrase and the speech recognition likelihood of each hypothesis phrase are acquired (step S113, FIG. 10).
Then, the phrase certainty calculation unit 261 extracts one estimated sound source angle ∠φ _H from the acquired estimated sound source angles ∠φ _H (∠φ _H1 , ∠φ _H2 ,...) (Step S114), One is extracted from a plurality of hypothesis phrases associated with the estimated sound source angle ∠φ _H (step S115).
Next, the phrase certainty calculation unit 261 extracts a phrase that matches the hypothesis phrase from the phrase pattern 331 of the phrase DB 33 based on the extracted estimated sound source angle ∠φ _H and the hypothesis phrase. Next, the phrase certainty calculation unit 261 extracts a direction angle value that matches the estimated sound source angle ∠φ _H from the direction angle value pattern 332 associated with the extracted phrase. And the phrase certainty calculation part 261 acquires the phrase input direction adaptability 333 matched with the extracted direction angle value (step S116).

次に、フレーズ確信度算出部２６１は、フレーズＤＢ３３から取得したフレーズ入力方向適合度３３３に、音声認識部２５から取得した仮説フレーズの音声認識尤度を乗算して、仮説フレーズのフレーズ確信度を算出する（ステップＳ１１７）。そして、フレーズ確信度算出部２６１は、推定音源角度∠φ_Hと、仮説フレーズと、算出したフレーズ確信度との３つの値を対応付けてフレーズ決定部２６２に出力する。
そして、フレーズ確信度算出部２６１は、ステップＳ１１５にて未抽出仮説フレーズがあるか否かを判定する（ステップＳ１１８）。未抽出仮説フレーズがあれば（ステップＳ１１８，Ｙｅｓ）、ステップＳ１１５に戻る。 Next, the phrase certainty calculation unit 261 multiplies the phrase input direction suitability 333 acquired from the phrase DB 33 by the speech recognition likelihood of the hypothesis phrase acquired from the speech recognition unit 25 to obtain the phrase confidence of the hypothesis phrase. Calculate (step S117). Then, the phrase certainty calculation unit 261 outputs the estimated sound source angle ∠φ _H , the hypothesis phrase, and the calculated phrase certainty value to the phrase determination unit 262 in association with each other.
And the phrase certainty calculation part 261 determines whether there exists an unextracted hypothesis phrase in step S115 (step S118). If there is an unextracted hypothesis phrase (step S118, Yes), the process returns to step S115.

一方、未抽出仮説フレーズがなければ（ステップＳ１１８，Ｎｏ）、フレーズ確信度算出部２６１は、ステップＳ１１４にて未抽出推定音源角度があるか否かを判定する（ステップＳ１１９）。未抽出推定音源角度があれば（ステップＳ１１９，Ｙｅｓ）、ステップＳ１１４に戻る。
一方、未抽出推定音源角度がなければ（ステップＳ１１９，Ｎｏ）、フレーズ決定部２６２に処理を移す。 On the other hand, if there is no unextracted hypothesis phrase (No in step S118), the phrase certainty calculation unit 261 determines whether or not there is an unextracted estimated sound source angle in step S114 (step S119). If there is an unextracted estimated sound source angle (step S119, Yes), the process returns to step S114.
On the other hand, if there is no unextracted estimated sound source angle (step S119, No), the process proceeds to the phrase determination unit 262.

フレーズ決定部２６２は、すべての推定音源角度∠φ_H（∠φ_H1，∠φ_H2，・・・）と、各推定音源角度∠φ_Hに対応付けられたすべての仮説フレーズのフレーズ確信度を取得する（ステップＳ１２０，図１１）。
まず、フレーズ決定部２６２は、取得した推定音源角度∠φ_H（∠φ_H1，∠φ_H2，・・・）から、１つの推定音源角度∠φ_Hを抽出する（ステップＳ１２１）。
そして、フレーズ決定部２６２は、抽出した推定音源角度∠φ_Hに対応付けられたすべてのフレーズ確信度から、最大値のフレーズ確信度を抽出し（ステップＳ１２２）、その抽出したフレーズ確信度の仮説フレーズを、ロボットＲが音声認識した推定音源角度∠φ_Hの方向からのフレーズに決定する（ステップＳ１２３）。これにより、１つの推定音源角度∠φ_Hの方向に対して、１つのフレーズが決定する。 The phrase determination unit 262 calculates the phrase confidences of all estimated sound source angles ∠φ _H (∠φ _H1 , ∠φ _H2 ,...) And all hypothetical phrases associated with each estimated sound source angle ∠φ _H. Obtained (step S120, FIG. 11).
First, the phrase determining unit 262 extracts one estimated sound source angle ∠φ _H from the acquired estimated sound source angle ∠φ _H (∠φ _H1 , ∠φ _H2 ,...) (Step S121).
Then, the phrase determination unit 262 extracts the maximum phrase certainty factor from all the phrase certainty factors associated with the extracted estimated sound source angle ∠φ _H (step S122), and the extracted phrase certainty factor hypothesis The phrase is determined as a phrase from the direction of the estimated sound source angle ∠φ _H recognized by the robot R (step S123). Thereby, one phrase is determined for the direction of one estimated sound source angle ∠φ _H.

そして、フレーズ決定部２６２は、ステップＳ１１４にて未抽出推定音源角度があるか否かを判定する（ステップＳ１２４）。未抽出推定音源角度があれば（ステップＳ１２４，Ｙｅｓ）、ステップＳ１２１に戻る。
一方、未抽出推定音源角度がなければ（ステップＳ１２４，Ｎｏ）、フレーズ決定部２６２は、これまでの処理により、決定した各推定音源角度∠φ_Hのフレーズを、主制御部４０に出力する（ステップＳ１２５）。そして、ロボットＲは音源方向推定処理を終了する。 Then, the phrase determination unit 262 determines whether or not there is an unextracted estimated sound source angle in step S114 (step S124). If there is an unextracted estimated sound source angle (Yes in step S124), the process returns to step S121.
On the other hand, if there is no unextracted estimated sound source angle (No in step S124), the phrase determining unit 262 outputs the phrases of the estimated sound source angles ∠φ _H determined by the processing so far to the main control unit 40 ( Step S125). Then, the robot R ends the sound source direction estimation process.

以上、本発明の実施形態について説明したが、本発明は前記した実施形態に限定されず、適宜変更して実施することが可能である。
例えば、除外角度範囲ＤＢ３２には、除外角度範囲ｒを記憶するとしたが、３６０−ｒで算出される角度範囲を記憶してもよい。これにより、ＭＵＳＩＣスペクトラムから、有効方向音圧成分データを抽出することができる。
また、角度抽出部２３２は推定音源角度として、Ｈ軸を基準とした角度∠φ_Hの代わりに、Ｂ軸を基準とした角度∠φ_Bを用いてもよい。 Although the embodiments of the present invention have been described above, the present invention is not limited to the above-described embodiments, and can be implemented with appropriate modifications.
For example, the exclusion angle range DB 32 stores the exclusion angle range r, but may store the angle range calculated by 360-r. Thereby, effective direction sound pressure component data can be extracted from the MUSIC spectrum.
In addition, the angle extraction unit 232 may use an angle ∠φ _{B based} on the B axis instead of the angle ∠φ _{H based on} the H axis as the estimated sound source angle.

また、除外角度範囲ＤＢ３２において、除外角度範囲ｒを、音声入力部が３つであれば１２０°未満、４つであれば１８０°未満の値に設定することで、少なくとも２つの音声入力部ＭＣが除外角度範囲ｒの範囲外に位置するため、発話者Ｙ（図１参照）がいる方向（推定すべき音源方向）を推定することができる。 Further, in the exclusion angle range DB 32, the exclusion angle range r is set to a value of less than 120 ° if there are three speech input units, and a value of less than 180 ° if it is 4, so that at least two speech input units MC. Is located outside the excluded angle range r, the direction in which the speaker Y (see FIG. 1) is present (the sound source direction to be estimated) can be estimated.

フレーズＤＢ３３は、フレーズパターン３３１毎の、方向角度値パターン３３２とフレーズ入力方向適合度３３３とからなる２次元のヒストグラムを記憶してもよい。これにより、フレーズ確信度算出部２６１は、音声認識部２５から取得した推定音源角度∠φ_Hに該当する方向角度値と、２次元のヒストグラムで対応付けられたフレーズ入力方向適合度３３３を取得することになる。 The phrase DB 33 may store a two-dimensional histogram composed of the direction angle value pattern 332 and the phrase input direction matching degree 333 for each phrase pattern 331. Thereby, the phrase certainty calculation unit 261 acquires the phrase input direction matching degree 333 associated with the direction angle value corresponding to the estimated sound source angle ∠φ _H acquired from the speech recognition unit 25 by the two-dimensional histogram. It will be.

ロボットＲは、発話者Ｙとコミュニケーションをするロボットであってもよいが、物音がする方向にカメラＣと高性能マイクロフォン（音声入力部ＭＣ）を向ける警備用のロボットであっても良い。 The robot R may be a robot that communicates with the speaker Y, but may also be a security robot that directs the camera C and the high-performance microphone (voice input unit MC) in the direction in which a sound is heard.

２０音声処理部
２２音声データ処理部
２３音源方向推定部
２４推定音源分離部
２５音声認識部
２６音源定位部
３０記憶部
３１音声入力角度ＤＢ
３２除外角度範囲ＤＢ
３３フレーズＤＢ
４０主制御部
５０自律移動制御部（行動制御部）（回動角測定部）
２３１フィルタ部
２３２角度抽出部
２６１フレーズ確信度算出部
２６２フレーズ決定部
ＭＣ（ＭＣ１，ＭＣ２，ＭＣ３，・・・）音声入力部
Ｏ支持軸
Ｒロボット
Ｒ１頭部
Ｒ１１頭回動部 DESCRIPTION OF SYMBOLS 20 Speech processing part 22 Speech data processing part 23 Sound source direction estimation part 24 Estimated sound source separation part 25 Speech recognition part 26 Sound source localization part 30 Storage part 31 Voice input angle DB
32 Exclusion angle range DB
33 Phrase DB
40 Main Control Unit 50 Autonomous Movement Control Unit (Behavior Control Unit) (Rotation Angle Measurement Unit)
231 Filter unit 232 Angle extraction unit 261 Phrase confidence calculation unit 262 Phrase determination unit MC (MC1, MC2, MC3,...) Voice input unit O Support shaft R Robot R1 Head R11 Head rotation unit

Claims

The torso,
A head supported by a support shaft so as to be rotatable on the upper surface of the body part;
Three or more audio inputs arranged in the head and spaced apart at a predetermined angle around the support shaft in the direction of rotation in which the head rotates, and converting the input sound into sound data and outputting it And
Using the signal decomposition algorithm, omnidirectional sound pressure component data indicating sound pressure values from 360 degrees omnidirectional with respect to the head direction is generated for the sound data input from each of the sound input units. An audio data processing unit;
A rotation angle measurement unit for measuring the rotation angle of the head with respect to the direction of the body,
A storage unit storing an exclusion angle range in which a range of directions not estimated as a sound source based on the direction of the body is set;
A sound source direction estimating unit for estimating a sound source direction,
The sound source direction estimation unit
A robot including a filter unit that generates effective direction sound pressure component data by removing data within the excluded angle range from the omnidirectional sound pressure component data using the measured rotation angle.

The storage unit stores voice input angles indicating directions of the voice input units around the support shaft,
The sound source direction estimation unit has at least one angle of sound pressure values larger than both of the sound pressure value at the immediately preceding angle and the sound pressure value at the immediately following angle from the effective direction sound pressure component data generated by the filter unit. It has an angle extraction unit to extract above,
An estimated sound source separation unit that separates and extracts the sound data input from each direction of the estimated sound source angle extracted by the angle extraction unit based on the sound input angle from the sound data output by the sound input unit;
A speech recognition unit that performs speech recognition on the estimated sound source direction speech data extracted by the estimated sound source separation unit, generates a plurality of hypothesis phrases, and calculates a speech recognition likelihood indicating the correctness of each hypothesis phrase;
With the speech input direction suitability indicating the correctness of the relationship between the hypothesis phrase and the direction in which the estimated sound source direction speech data related to the hypothesis phrase is input, weighting the speech recognition likelihood related to the hypothesis phrase, The robot according to claim 1 , further comprising: a phrase certainty calculation unit that calculates a phrase certainty factor of each of the hypothesis phrases and estimates a sound source direction based on the calculated phrase certainty factor.

The phrase certainty calculation unit
Extracting the maximum phrase certainty factor from the phrase certainty factor of each of the hypothetical phrases, and estimating the estimated sound source angle related to the estimated sound source direction sound data that generated the hypothetical phrase related to the phrase certainty factor in the sound source direction. The robot according to claim 2, wherein

A phrase storage unit and a phrase determination unit;
The phrase storage unit includes a plurality of phrase patterns obtained by collecting phrases that can be input in advance, a direction angle value pattern that indicates a direction with the support shaft as a center and a direction of the head, and the phrase pattern is the direction. Three of the voice input direction matching degree indicating correctness input from the direction of the angle value pattern is stored in association with each other,
The phrase certainty calculation unit obtains a plurality of hypothesis phrases generated by the speech recognition unit and the speech recognition likelihood thereof, and uses the hypothesis phrase and the estimated sound source angle extracted by the angle extraction unit, The speech input direction suitability associated with the phrase pattern and the direction angle value pattern stored in the phrase storage unit is extracted, the speech recognition likelihood is weighted by the speech input direction suitability, and the phrase Calculate confidence,
The phrase determination unit recognizes the speech from the sound source direction estimated by the phrase certainty calculation unit related to the hypothesis phrase as a hypothesis phrase related to the maximum phrase certainty factor extracted by the phrase certainty calculation unit. The robot according to claim 3, wherein the robot is a phrase.

The robot according to claim 4, wherein the phrase storage unit stores a two-dimensional histogram composed of the direction angle value pattern and the voice input direction matching degree for each phrase pattern.

The robot according to claim 2, further comprising an action control unit that rotates the head until the estimated sound source angle extracted by the angle extraction unit is reached.